0

我有一个 html 文件,其中包含以下实例:

<p>[CR][LF]
Here is the text etc

和:

...here is the last part of the text.[CR][LF]
</p>

where [CR]and[LF]分别代表回车和换行。

这些段落在具有特定类的 div 中,例如my_class.

我想在这个特定的 div 类中定位段落标签并执行以下替换:

# remove new line after opening <p> tag
re.sub("<p>\n+", "<p>", div)
# remove new line before closing </p> tag
re.sub("<p>\n+", "<p>", div)

因此,我的方法是:

  • 打开html文件
  • 隔离特定的 div
  • 隔离<p>这些 div 中的标签
  • 仅对这些<p>标签执行替换
  • 将修改后的内容写回原来的html文件

这是我到目前为止所拥有的,但是当它进行替换并写回文件时逻辑失败:

from bs4 import BeautifulSoup
import re
# open the html file in read mode
html_file = open('file.html', 'r')
# convert to string
html_file_as_string = html_file.read()
# close the html file
html_file.close()
# create a beautiful soup object 
bs_html_file_as_string = BeautifulSoup(html_file_as_string, "lxml")
# isolate divs with specific class
for div in bs_html_file_as_string.find_all('div', {'class': 'my_class'}):
    # perform the substitutions
    re.sub("<p>\n+", "<p>", div)
    re.sub("\n+</p>", "</p>", div)
# open original file in write mode
html_file = open('file', 'w')
# write bs_html_file_as_string (with substitutions made) to file
html_file.write(bs_html_file_as_string)
# close the html file
html_file.close()

我也一直在看美丽的汤的replace_with但不确定它是否与这里相关。

编辑:

下面的解决方案向我展示了另一种在不使用 re.sub 的情况下完成任务的方法。

但是,我需要执行另一个替换,但仍然不知道是否可以执行 re.sub within a specific class, within a paragraph. 具体来说,在以下示例中,我想将所有[CR][LF]'s替换为</p>\n<p>. 我曾设想这会发生在潜艇上:

re.sub('\n+', r'</p>\n<p>', str)

SciTE 编辑器的屏幕截图显示回车和新行:

在此处输入图像描述

演示 HTML (demo_html.html):

<html>
<body>
<p>lalalalalalalala</p>
<p>lalalalalalalala</p>
<div class="my_class">
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum..consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit."Lorem ipsum dolor sit amet", consectetur adipisc'ing elit.Lorem ipsum dolor...sit amet, consectetur adipiscing elit..
Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit..
.....Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</div>
<p>lalalalalalalala</p>
<p>lalalalalalalala</p>
</body>
</html>

演示 Python (demo_python.py):

from bs4 import BeautifulSoup
import re

with open('demo_html.html', 'r') as html_file:
    html_file_as_string = html_file.read()
bs_html_file_as_string = BeautifulSoup(html_file_as_string, "lxml")
for div in bs_html_file_as_string.find_all('div', {'class': 'my_class'}):
    for p in div.find('p'):
    p.string.replace('\n','</p>\n<p>')
with open('demo_html.html', 'w') as html_file:
    html_file.write(bs_html_file_as_string.renderContents())

print 'finished'
4

3 回答 3

2

p.string.strip()将删除前导、尾随空格。

p.string.replaceWith(NEW_STRING)将 p 标签的文本替换为 NEW_STRING。

from bs4 import BeautifulSoup

with open('file.html', 'r') as f:
    html_file_as_string = f.read()
soup = BeautifulSoup(html_file_as_string, "lxml")
for div in soup.find_all('div', {'class': 'my_class'}):
    for p in div.find('p'):
        p.string.replace_with(p.string.strip())
with open('file', 'w') as f:
    f.write(soup.renderContents())

顺便说一句,re.sub(..)返回替换字符串。它不会替换替换的原始字符串。

>>> import re
>>> text = '   hello'
>>> re.sub('\s+', '', text)
'hello'
>>> text
'   hello'

编辑

编辑代码以匹配编辑的问题:

from bs4 import BeautifulSoup

with open('file.html', 'r') as f:
    html_file_as_string = f.read()
soup = BeautifulSoup(html_file_as_string, "lxml")
for div in soup.find_all('div', {'class': 'my_class'}):
    for p in div.findAll('p'):
        new = BeautifulSoup(u'\n'.join(u'<p>{}</p>'.format(line.strip()) for line in p.text.splitlines() if line), 'html.parser')
        p.replace_with(new)
with open('file', 'w') as f:
    f.write(soup.renderContents())
于 2013-06-23T11:48:03.707 回答
1

您需要检查您的第一个和最后一个内容元素p是否是文本节点( 的实例bs4.NavigableString,它是 的子类str)。这应该有效:

from bs4 import BeautifulSoup, NavigableString
import re

html_file_as_string = """
<p>test1</p>

<p>
test2</p>
<p>test3
</p>

<p></p>

<p>
test4
<b>...</b>
test5
</p>

<p><b>..</b>
</p>

<p>
<br></p>

"""

soup = BeautifulSoup(html_file_as_string, "lxml")
for p in soup.find_all('p'):
    if p.contents:
        if isinstance(p.contents[0], NavigableString):
            p.contents[0].replace_with(p.contents[0].lstrip())
        if isinstance(p.contents[-1], NavigableString):
            p.contents[-1].replace_with(p.contents[-1].rstrip())

print(soup)

输出:

<html><body><p>test1</p>
<p>test2</p>
<p>test3</p>
<p></p>
<p>test4
<b>...</b>
test5</p>
<p><b>..</b></p>
<p><br/></p>
</body></html>

使用正则表达式来解析/处理 html 几乎总是一个坏主意。

于 2013-06-23T12:55:11.523 回答
-1

不存储 for 循环中的替换结果;你可以尝试类似的东西:

import re

strings = ['foo', 'bar', 'qux']

for k, s in enumerate(strings):
    strings[k] = re.sub('foo', 'cheese', s)
于 2013-06-23T11:54:05.460 回答