beautifulsoup - 使用 beautifulsoup 在 html 元素中获取数据

Question

到目前为止，我已经这样做了：

import urllib2,re,time
from bs4 import BeautifulSoup
base_url="http://nairobinow.wordpress.com/"
rawEventsData=urllib2.urlopen(base_url).read()
rawEventssoup = BeautifulSoup(rawEventsData)
events=rawEventssoup.findAll("div", {"id": re.compile(r'post-[\d+]')})

现在我想在标签、地点和日期之后获取数据。这是事件块（只是迭代部分之一）：

<div class="post-17149 post type" id="post-17149">
<h2><a href="http://nairobinow.wordpress.com/2012/11/05/out/">Out of Town: Lamuest</a> 
</h2><p>u
Dates: November 15-18, 2012<br/>
Venue: Lamu</p>
<p>Accommodation information: <a href="http://.../index.html"target="_blank"  
>http://www.lamu.org/index.html</a></p></div>

任何帮助将不胜感激

score 2 · Accepted Answer

如果我正确理解您的问题，听起来您对<p>标签中的数据感兴趣。如果这是正确的...

如果您还不知道，请.findAll()返回一个列表。在这种情况下，任何div具有相应的id都将被返回。

您需要做的就是迭代events：

for event in events:
    print event('p')[0]

这将返回：

<p>u Dates: November 15-18, 2012<br/> Venue: Lamu</p>

用于.contents删除标签。请注意，.contents将根据其索引调用文本。即：.contents[0]可能会返回Dates: November 15-18,，而.contents[1]可能会返回Venue: Lamu。

你应该玩弄这个并找到适合你需要的东西。我希望这回答了这个问题，它有点模糊，但无论如何我都会试一试。

beautifulsoup - 使用 beautifulsoup 在 html 元素中获取数据

1 回答 1

Related

Reference