python - 使用 BeautifulSoup 提取一行文本

Question

我有两个数字 (NUM1; NUM2) 我试图跨具有相同格式的网页提取：

<div style="margin-left:0.5em;">  
  <div style="margin-bottom:0.5em;">
    NUM1 and NUM2 are always followed by the same text across webpages
  </div>

我认为正则表达式可能是这些特定领域的方法。这是我的尝试（从各种来源借来的）：

def nums(self):
    nums_regex = re.compile(r'\d+ and \d+ are always followed by the same text across webpages')
    nums_match = nums_regex.search(self)
    nums_text = nums_match.group(0)
    digits = [int(s) for s in re.findall(r'\d+', nums_text)]
    return digits

就其本身而言，在函数之外，此代码在指定文本的实际来源时起作用（例如，nums_regex.search(text)）。但是，我正在修改另一个人的代码，而我自己以前从未真正使用过类或函数。这是他们的代码示例：

@property
def title(self):
    tag = self.soup.find('span', class_='summary')
    title = unicode(tag.string)
    return title.strip()

正如您可能已经猜到的那样，我的代码不起作用。我得到错误：

nums_match = nums_regex.search(self)
TypeError: expected string or buffer

看起来我没有正确输入原始文本，但我该如何解决呢？

score 0 · Accepted Answer

您可以使用相同的正则表达式模式BeautifulSoup通过文本查找，然后提取所需的数字：

import re

pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")

for elm in soup.find_all("div", text=pattern):
    print(pattern.search(elm.text).groups())

请注意，由于您尝试匹配文本的一部分而不是任何与 HTML 结构相关的内容，因此我认为只需将正则表达式应用于整个文档就可以了。

下面完成工作示例代码示例。

使用BeautifulSoup正则表达式/“按文本”搜索：

import re

from bs4 import BeautifulSoup

data = """<div style="margin-left:0.5em;">
  <div style="margin-bottom:0.5em;">
    10 and 20 are always followed by the same text across webpages
  </div>
</div>
"""

soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")

for elm in soup.find_all("div", text=pattern):
    print(pattern.search(elm.text).groups())

仅正则表达式搜索：

import re

data = """<div style="margin-left:0.5em;">
  <div style="margin-bottom:0.5em;">
    10 and 20 are always followed by the same text across webpages
  </div>
</div>
"""

pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
print(pattern.findall(data))  # prints [('10', '20')]

python - 使用 BeautifulSoup 提取一行文本

1 回答 1

Related

Reference