python - 使用python bs4基于标题的屏幕抓取

Question

我在使用 bs4 进行屏幕抓取时遇到问题。以下是我的代码。

from bs4 import BeautifulSoup
import urllib2
url="http://www.99acres.com/property-in-velachery-chennai-south-ffid?"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
properties=soup.findAll('a',{'title':'Bedroom'})
for eachproperty in properties:
    print eachproperty['href']+",", eachproperty.string

当我分析网站时，实际的标题结构是这样的

1 Bedroom, Residential Apartment in Velachery对于所有锚链接。但我没有得到任何输出，也没有错误。那么如何告诉程序抓取所有标题包含单词的数据"Bedroom"呢？

希望我说清楚了。

score 2 · Accepted Answer

您需要在此处使用正则表达式，因为您只想匹配标题Bedroom 中的锚链接，而不是整个标题：

import re

properties = soup.find_all('a', title=re.compile('Bedroom'))

这为您提供的 URL 提供了 47 个匹配项。

python - 使用python bs4基于标题的屏幕抓取

1 回答 1

Related

Reference