python - BeautifulSoup, findAll('table') 返回所有表格以及它们之间的文本

Question

我试图隔离网页的一部分，不幸的是它不包含在我可以拉出的任何内容中。

我能得到的最接近的是获取网页的整个正文，然后尝试删除表格（这是我唯一不想要的部分）。

我正在使用的代码：

storyText = soup.body
toRemove = storyText.findAll('table')
for each in toRemove:
    print each

目前的问题是 toRemove 行返回表格和它们之间包含的文本，尽管不在它们中。

所以我得到：

<body>
<table>
    table stuff
</table>
    Text, not in tags </br> #This is what I want.
<table>
    table stuff
</table
</body>

我通过执行以下操作解决了我的问题：

# Isolate body
findBody = soup.body
new = str(findBody)
# Section off the text from the tables before it.
sec = new.split('</table>')
# Select story area
newStory = sec[3]
# Section off the text from the tables after it.
newSec = newStory.split('<table')
# Select the story area, this the area that we want.
story = newSec[0]

我仍在寻找答案，因为似乎应该有一种更清洁的方法来做到这一点。

score 0 · Accepted Answer

您的代码在我的 Mac 上运行良好。你用的是哪个版本？我用了美丽的汤4。

（不推荐美汤3。因为，它不再被开发。http://www.crummy.com/software/BeautifulSoup/bs4/doc/）

这是我的代码：

from bs4 import BeautifulSoup

contents = '''<body>
<table>
     table stuff1
</table>
     Text, not in tags </br> #This is what I want.
<table>
     table stuff2
</table>
</body>'''

soup = BeautifulSoup(contents)

storyText = soup.body
toRemove = storyText.findAll('table')
for each in toRemove:
    print each
    each.extract()

print '----result-------------'
print soup

会得出以下结果。

<table>
    table stuff1
</table>
<table>
    table stuff2
</table>
----result-------------
<body>

    Text, not in tags  #This is what I want.

</body>

python - BeautifulSoup, findAll('table') 返回所有表格以及它们之间的文本

1 回答 1

Related

Reference