我知道这已经被问过好几次了,但我认为我做的一切都是正确的,但它仍然不起作用,所以在我临床发疯之前,我会发一个帖子。这是代码(它应该将 HTML 文件转换为 txt 文件并省略某些行):
fid = codecs.open(htmlFile, "r", encoding = "utf-8")
if not fid:
return
htmlText = fid.read()
fid.close()
stripped = strip_tags(unicode(htmlText)) ### strip html tags (this is not the prob)
lines = stripped.split('\n')
out = []
for line in lines: # just some stuff i want to leave out of the output
if len(line) < 6:
continue
if '*' in line or '(' in line or '@' in line or ':' in line:
continue
out.append(line)
result= '\n'.join(out)
base, ext = os.path.splitext(htmlFile)
outfile = base + '.txt'
fid = codecs.open(outfile, "w", encoding = 'utf-8')
fid.write(result)
fid.close()
谢谢!