python - Python BeautifulSoup HTML解析获取文本

Question

我有一个 HTML 页面，其格式如下

<section class="entry-content">
    <p>...</p>
    <p>...</p>
    <p>...</p>
</section>

我正在尝试<p>使用 BeautifulSoup/Python 提取标签中包含的文本。这是我到目前为止所拥有的，但我不确定如何“挖掘”<p>标签并获取文本。任何建议将不胜感激。

import urllib2
from BeautifulSoup import BeautifulSoup

def main():
    url = 'URL'
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup(data)

    ingreds = bs.find('section', {'class': 'entry-content'})

    fname = 'most.txt'
    with open(fname, 'w') as outf:
    outf.write('\n'.join(ingreds))

if __name__=="__main__":
  main()

score 2 · Accepted Answer

您可以“挖掘”并使用.stripped_strings可迭代的标签从标签中获取文本：

section = bs.find('section', {'class': 'entry-content'})
ingreds = [' '.join(ch.stripped_strings) for ch in section.find_all(True)]

我们.find_all(True)只循环遍历包含在中的标签section，而不是直接的文本内容（例如换行符）。

请注意，这.find_all(True)将遍历任何嵌套标签，这可能导致字符串重复。以下将仅循环直接标记section：

ingreds = [' '.join(ch.stripped_strings) for ch in section if hasattr(ch, 'stripped_strings')]

python - Python BeautifulSoup HTML解析获取文本

1 回答 1

Related

Reference