我有一些代码在一个更大的程序中独立工作,但现在在更大的程序中它似乎没有工作——即它没有执行所需的操作。
问题发生在第 4 步(见下文),经过反思,我在字符类中的预期逻辑(即“除回车之外的所有内容”)似乎没有正确编码(但我不知道还有什么“表达”逻辑)。
我的目标只是用段落标签包装每一行或段落。
Python代码
import re
# 1. open the html file in read mode
html_file = open('test.html', 'r')
# 2. convert to string
html_file_as_string = html_file.read()
# 3. close the html file
html_file.close()
# 4. replace carriage returns with closing and opening paragraph tags
html_file_as_string = re.sub('([^\r]*)\r', r'\1</p>\n<p>', html_file_as_string)
# 5. remove time and date
html_file_as_string = re.sub(r'(Lorem ipsum \d*/\d*/\d*, \d*:\d* [a-z]{2})', r"", html_file_as_string)
# 6. remove the white space after the opening paragraph tags
html_file_as_string = re.sub('<p>\n*\s*', r"<p>", html_file_as_string)
# 7. remove the white space before the closing paragraph tags
html_file_as_string = re.sub('\s*</p>', r"</p>", html_file_as_string)
# 8. open the file in write mode to clear
html_file = open('test.html', 'w')
# 9. write the new contents to file
html_file.write(html_file_as_string)
# 10. print to screen so we can see what is happening
print html_file_as_string
# 11. close the html file
html_file.close()
这是 HTML 文件的内容:
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum..consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit."Lorem ipsum dolor sit amet", consectetur adipisc'ing elit.Lorem ipsum dolor...sit amet, consectetur adipiscing elit..
Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit..
.....Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum 01/01/05, 05:00 am</p>
这是在 SciTE 编辑器中查看的文件内容(因此可以看到空格、回车和换行符)。
编辑:
我根据以下建议更改了正则表达式,然后将替换加倍(从第 4 步中可见的原始代码更改和第 4 步之前的第 6 步复制)。
工作代码:
import re
# 1. open the html file in read mode
html_file = open('test.html', 'r')
# 2. convert to string
html_file_as_string = html_file.read()
# 3. close the html file
html_file.close()
# 6(added). remove the white space after the opening paragraph tags
html_file_as_string = re.sub('<p>\n*\s*', r"<p>", html_file_as_string)
# 4(changed). replace carriage returns with closing and opening paragraph tags
html_file_as_string = re.sub('([^\r\n]*)(\r\n?|\n)', r'\1</p>\2<p>', html_file_as_string)
# 5. remove time and date
html_file_as_string = re.sub(r'(Lorem ipsum \d*/\d*/\d*, \d*:\d* [a-z]{2})', r"", html_file_as_string)
# 6. remove the white space after the opening paragraph tags
html_file_as_string = re.sub('<p>\n*\s*', r"<p>", html_file_as_string)
# 7. remove the white space before the closing paragraph tags
html_file_as_string = re.sub('\s*</p>', r"</p>", html_file_as_string)
# 8. open the file in write mode to clear
html_file = open('test.html', 'w')
# 9. write the new contents to file
html_file.write(html_file_as_string)
# 10. print to screen so we can see what is happening
print html_file_as_string
# 11. close the html file
html_file.close()
编辑2:
上面的代码在其他部分代码过于激进,做了太多的修改,回到绘图板上。