python - How can I find where files are used as links in html pages? -
i have static website versions of old pages still stored in root. want find these pages , if used in link somewhere in root's files.
made list of files inside root using powershell's command ls -r -name
, store on file 'filelist.txt' , have like:
directory1 directory2 5s.htm 5s.html 5s_introduction.htm ... images\icons images\icons\linkedin.png images\icons\project-slider-arrow-left.png images\icons\project-slider-arrow-right.png
i want these files used, thought use simple script in python (as don't know windows' powershell) takes line list , occurences in each html page inside root.
extract file name tried regex on notepad++:
[^\\^\n]+\.[a-z]{0,4}
and seemed work...(^\n exclude lines represent directories)
second step, tried adapt python lines found on stackoverflow:
import re open('filelist.txt') f: l in f: m = re.match('([^\\^\n]+\.[a-z]{0,4})', l) if m: print(m.group(1))
but returns me strings wrong, full of spaces or single letters, if regex wrong. thought use regex result variable , check somehow on each html pages on root directory, i'm stuck here.
since sure file names contain '.'
, each path can split on '\'
, checked if contains '.'
. also, stripping each line remove new line characters.
with open('filelist.txt') f: l in f: l= l.strip() if '.' in l.split('\\')[-1]: print l.split('\\')[-1]
output:
5s.htm 5s.html 5s_introduction.htm linkedin.png project-slider-arrow-left.png project-slider-arrow-right.png
Comments
Post a Comment