python - Web Scraper 不使用 Python 生成结果

Question

我是一只需要你帮助的年轻蚱蜢。我做了很多研究，似乎找不到解决方案。我在下面编写了以下代码。运行时它不会提取任何标题。我相信我的正则表达式是正确的。不确定是什么问题。对于经验丰富的老师来说可能是显而易见的。提前致谢。

from urllib import urlopen

import re

url = urlopen('http://www.realclearpolitics.com/epolls/2012/senate/ma/massachusetts_senate_brown_vs_warren-2093.html#polls').read()

'''
a href="http://multimedia.heraldinteractive.com/misc/umlrvnov2012final.pdf">Title a>
'''

A = 'a href.*pdf">(expression to pull everything) a>' 

B = re.compile(A) 

C = re.findall(B,url)

print C

score 3 · Accepted Answer

这在 SO 上经常出现。而不是使用正则表达式，您应该使用允许您搜索/遍历文档树的 HTML 解析器。

我会使用BeautifulSoup：

Beautiful Soup 解析你给它的任何东西，并为你做树遍历的东西。您可以告诉它“查找所有链接”，或“查找类 externalLink 的所有链接”，或“查找所有 url 匹配“foo.com”的链接，或“查找带有粗体文本的表格标题，然后给出我那条短信。”

>>> from bs4 import BeautifulSoup
>>> html = ? # insert your raw HTML here
>>> soup = BeautifulSoup(html)
>>> a_tags = soup.find_all("a")
>>> for anchor in a_tags:
>>> ...     print anchor.contents

score 0 · Accepted Answer

我将回应关于不使用 RegEx 解析 HTML 的其他评论，但有时它既快速又简单。看起来您示例中的 HTML 不太正确，但我会尝试类似的方法：

re.findall('href.*?pdf">(.+?)<\/a>', A)

python - Web Scraper 不使用 Python 生成结果

2 回答 2

Related

Reference