python - 学习 python 正则表达式和网络抓取和卡住

Question

我正在尝试使用 python 进行网络抓取。我正在尝试获取产品的链接（我的目标）

http://www.fastfurnishings.com/3-Piece-Reversible-Bonded-Leather-Match-Sofa-Set-i-p/bstrblm3p.htm

我正在抓取这个网址/网站

 http://www.fastfurnishings.com/SearchResults.asp?Search=3-Piece+Reversible+Bonded+Leather+Match+Sofa+Set+in+Cream

如果您进行页面查看，您会发现没有某些 id 或标签可以帮助我确定我需要的 url，而且我也不太擅长正则表达式。到目前为止我在python中有这个

import urllib
import re
product = "3-Piece Reversible Bonded Leather Match Sofa Set in Cream"
productSearchUrl = product.replace(" ","+");
myurl = "http://www.fastfurnishings.com/SearchResults.asp?Search="+productSearchUrl
print myurl
htmlfile = urllib.urlopen(myurl)
htmltext = htmlfile.read()
regex = '<td valign="top" width="33%" align="center">(.+?)</td> '
r = re.compile(regex)
print re.findall(r,htmltext)

但那不是读任何东西......任何帮助将不胜感激

score 3 · Accepted Answer

你最好使用诸如Scrapy或BeautifulSoup之类的网络爬虫库。肯定会为您节省很多痛苦，并使您在抓取信息后专注于您真正想要实现的目标。

score 3 · Accepted Answer

这就是您使用 HTML 解析器的原因，例如BeautifulSoup：

>>> import urllib2
>>> from bs4 import BeautifulSoup as BS
>>> html = urllib2.urlopen('http://www.fastfurnishings.com/SearchResults.asp?Search=3-Piece+Reversible+Bonded+Leather+Match+Sofa+Set+in+Cream')
>>> soup = BS(html)
>>> print soup.find('td', {'valign':'top', 'width':'33%', 'align':'center'}).a['href']
http://www.fastfurnishings.com/3-Piece-Reversible-Bonded-Leather-Match-Sofa-Set-i-p/bstrblm3p.htm

看看那是多么容易；）

score 0 · Accepted Answer

0

不要这样做，等等。看起来你没有考虑换行：

r = re.compile(regex, re.DOTALL)

于 2013-09-13T06:36:09.347 回答

python - 学习 python 正则表达式和网络抓取和卡住

3 回答 3

Related

Reference