python - 涉及带有属性的 HTML 标记的 Python 网页抓取

Question

我正在尝试制作一个网络爬虫，它将解析出版物的网页并提取作者。网页的骨架结构如下：

<html>
<body>
<div id="container">
<div id="contents">
<table>
<tbody>
<tr>
<td class="author">####I want whatever is located here ###</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>

到目前为止，我一直在尝试使用 BeautifulSoup 和 lxml 来完成这项任务，但是我不确定如何处理这两个 div 标签和 td 标签，因为它们具有属性。除此之外，我不确定是否应该更多地依赖 BeautifulSoup 或 lxml 或两者的组合。我该怎么办？

目前，我的代码如下所示：

    import re
    import urllib2,sys
    import lxml
    from lxml import etree
    from lxml.html.soupparser import fromstring
    from lxml.etree import tostring
    from lxml.cssselect import CSSSelector
    from BeautifulSoup import BeautifulSoup, NavigableString

    address='http://www.example.com/'
    html = urllib2.urlopen(address).read()
    soup = BeautifulSoup(html)
    html=soup.prettify()
    html=html.replace('&nbsp', '&#160')
    html=html.replace('&iacute','&#237')
    root=fromstring(html)

我意识到很多导入语句可能是多余的，但我只是复制了我目前在更多源文件中的任何内容。

编辑：我想我并没有说得很清楚，但是我在页面中有多个要抓取的标签。

score 12 · Accepted Answer

从你的问题中我不清楚为什么你需要担心div标签 - 只是做什么：

soup = BeautifulSoup(html)
thetd = soup.find('td', attrs={'class': 'author'})
print thetd.string

在您提供的 HTML 上，运行它会准确地发出：

####I want whatever is located here ###

这似乎是你想要的。也许您可以更好地准确指定您需要什么，而这个超级简单的代码片段却没有——您需要考虑的td所有类author的多个标签（全部？只是一些？哪些？），可能缺少任何这样的标签（在这种情况下你想做什么）等等。仅从这个简单的示例和过多的代码中，很难推断出您的规格到底是什么；-)。

编辑：如果根据 OP 的最新评论，有多个这样的 td 标签，每个作者一个：

thetds = soup.findAll('td', attrs={'class': 'author'})
for thetd in thetds:
    print thetd.string

...即，一点也不难！-)

score 6 · Accepted Answer

或者您可以使用 pyquery，因为 BeautifulSoup 不再被积极维护，请参阅http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

首先，安装pyquery

easy_install pyquery

那么你的脚本可以很简单

from pyquery import PyQuery
d = PyQuery('http://mywebpage/')
allauthors = [ td.text() for td in d('td.author') ]

pyquery 使用 jQuery 熟悉的 css 选择器语法，我发现它比 BeautifulSoup 更直观。它在下面使用 lxml，并且比 BeautifulSoup 快得多。但是 BeautifulSoup 是纯 python，因此也适用于谷歌的应用引擎

score 5 · Accepted Answer

lxml 库现在是在 python 中解析 html 的标准。界面起初看起来很尴尬，但它的功能非常有用。

您应该让库处理 xml 专业，例如那些转义的 &entities;

import lxml.html

html = """<html><body><div id="container"><div id="contents"><table><tbody><tr>
          <td class="author">####I want whatever is located here, eh? &iacute; ###</td>
          </tr></tbody></table></div></div></body></html>"""

root = lxml.html.fromstring(html)
tds = root.cssselect("div#contents td.author")

print tds           # gives [<Element td at 84ee2cc>]
print tds[0].text   # what you want, including the 'í'

score 1 · Accepted Answer

BeautifulSoup 无疑是规范的 HTML 解析器/处理器。但是，如果您只需要匹配这种片段，而不是构建代表 HTML 的整个分层对象，pyparsing 可以轻松定义前导和尾随 HTML 标记作为创建更大搜索表达式的一部分：

from pyparsing import makeHTMLTags, withAttribute, SkipTo

author_td, end_td = makeHTMLTags("td")

# only interested in <td>'s where class="author"
author_td.setParseAction(withAttribute(("class","author")))

search = author_td + SkipTo(end_td)("body") + end_td

for match in search.searchString(html):
    print match.body

Pyparsing 的 makeHTMLTags 函数不仅仅是发出"<tag>"和"</tag>"表达式。它还处理：

标签的无大小写匹配
"<tag/>"句法
开始标签中的零个或多个属性
以任意顺序定义的属性
带有命名空间的属性名称
单引号、双引号或不带引号的属性值
在标记和符号或属性名称、“=”和值之间插入空格
解析为命名结果后可以访问属性

这些是考虑使用正则表达式进行 HTML 抓取时的常见缺陷。

python - 涉及带有属性的 HTML 标记的 Python 网页抓取

4 回答 4

Related

Reference