我有一个 html 文件,其中包含以下实例:
<p>[CR][LF]
Here is the text etc
和:
...here is the last part of the text.[CR][LF]
</p>
where [CR]
and[LF]
分别代表回车和换行。
这些段落在具有特定类的 div 中,例如my_class
.
我想在这个特定的 div 类中定位段落标签并执行以下替换:
# remove new line after opening <p> tag
re.sub("<p>\n+", "<p>", div)
# remove new line before closing </p> tag
re.sub("<p>\n+", "<p>", div)
因此,我的方法是:
- 打开html文件
- 隔离特定的 div
- 隔离
<p>
这些 div 中的标签 - 仅对这些
<p>
标签执行替换 - 将修改后的内容写回原来的html文件
这是我到目前为止所拥有的,但是当它进行替换并写回文件时逻辑失败:
from bs4 import BeautifulSoup
import re
# open the html file in read mode
html_file = open('file.html', 'r')
# convert to string
html_file_as_string = html_file.read()
# close the html file
html_file.close()
# create a beautiful soup object
bs_html_file_as_string = BeautifulSoup(html_file_as_string, "lxml")
# isolate divs with specific class
for div in bs_html_file_as_string.find_all('div', {'class': 'my_class'}):
# perform the substitutions
re.sub("<p>\n+", "<p>", div)
re.sub("\n+</p>", "</p>", div)
# open original file in write mode
html_file = open('file', 'w')
# write bs_html_file_as_string (with substitutions made) to file
html_file.write(bs_html_file_as_string)
# close the html file
html_file.close()
我也一直在看美丽的汤的replace_with但不确定它是否与这里相关。
编辑:
下面的解决方案向我展示了另一种在不使用 re.sub 的情况下完成任务的方法。
但是,我需要执行另一个替换,但仍然不知道是否可以执行 re.sub within a specific class
, within a paragraph
. 具体来说,在以下示例中,我想将所有[CR][LF]
's替换为</p>\n<p>
. 我曾设想这会发生在潜艇上:
re.sub('\n+', r'</p>\n<p>', str)
SciTE 编辑器的屏幕截图显示回车和新行:
演示 HTML (demo_html.html):
<html>
<body>
<p>lalalalalalalala</p>
<p>lalalalalalalala</p>
<div class="my_class">
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum..consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit."Lorem ipsum dolor sit amet", consectetur adipisc'ing elit.Lorem ipsum dolor...sit amet, consectetur adipiscing elit..
Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit..
.....Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</div>
<p>lalalalalalalala</p>
<p>lalalalalalalala</p>
</body>
</html>
演示 Python (demo_python.py):
from bs4 import BeautifulSoup
import re
with open('demo_html.html', 'r') as html_file:
html_file_as_string = html_file.read()
bs_html_file_as_string = BeautifulSoup(html_file_as_string, "lxml")
for div in bs_html_file_as_string.find_all('div', {'class': 'my_class'}):
for p in div.find('p'):
p.string.replace('\n','</p>\n<p>')
with open('demo_html.html', 'w') as html_file:
html_file.write(bs_html_file_as_string.renderContents())
print 'finished'