python - 如何将此 XPath 表达式转换为 BeautifulSoup？

Question

在回答上一个问题时，有几个人建议我将BeautifulSoup用于我的项目。我一直在努力处理他们的文档，但我无法解析它。有人可以指出我应该能够将此表达式转换为 BeautifulSoup 表达式的部分吗？

hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')

上面的表达式来自Scrapy。我正在尝试应用正则表达式re('\.a\w+')以td class altRow从那里获取链接。

我也将不胜感激任何其他教程或文档的指针。我找不到任何东西。

谢谢你的帮助。

编辑： 我正在查看此页面：

>>> soup.head.title
<title>White & Case LLP - Lawyers</title>
>>> soup.find(href=re.compile("/cabel"))
>>> soup.find(href=re.compile("/diversity"))
<a href="/diversity/committee">Committee</a>

但是，如果您查看页面源代码"/cabel"：

 <td class="altRow" valign="middle" width="34%"> 
 <a href='/cabel'>Abel, Christian</a>

出于某种原因，BeautifulSoup 看不到搜索结果，但 XPath 可以看到它们，因为hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')捕获了“/cabel”

编辑： cobbal：它仍然无法正常工作。但是当我搜索这个时：

>>>soup.findAll(href=re.compile(r'/.a\w+'))
[<link href="/FCWSite/Include/styles/main.css" rel="stylesheet" type="text/css" />, <link rel="shortcut icon" type="image/ico" href="/FCWSite/Include/main_favicon.ico" />, <a href="/careers/northamerica">North America</a>, <a href="/careers/middleeastafrica">Middle East Africa</a>, <a href="/careers/europe">Europe</a>, <a href="/careers/latinamerica">Latin America</a>, <a href="/careers/asia">Asia</a>, <a href="/diversity/manager">Diversity Director</a>]
>>>

它返回所有带有第二个字符“a”的链接，但不返回律师姓名。因此，出于某种原因，BeautifulSoup 看不到这些链接（例如“/cabel”）。我不明白为什么。

score 6 · Accepted Answer

一种选择是使用lxml（我不熟悉beautifulsoup，所以我不能说如何使用它），它默认支持XPath

编辑：
尝试~~（未经测试）~~测试：

soup.findAll('td', 'altRow')[1].findAll('a', href=re.compile(r'/.a\w+'), recursive=False)

我在http://www.crummy.com/software/BeautifulSoup/documentation.html使用了文档

汤应该是一个 BeautifulSoup 对象

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html_string)

score 4 · Accepted Answer

我知道 BeautifulSoup 是规范的 HTML 解析模块，但有时你只想从一些 HTML 中刮出一些子字符串，而 pyparsing 有一些有用的方法可以做到这一点。使用此代码：

from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib

# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()

# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes, 
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))

# compose total matching pattern (add trailing tdStart to filter out 
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart

# scan input HTML source for matching refs, and print out the text and 
# href values
for ref,s,e in patt.scanString(html):
    print ref.text, ref.a.href

我从您的页面中提取了 914 条参考文献，从 Abel 到 Zupikova。

Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
AcuÃ±a, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
ZÃdek, AleÅ¡ /azidek
ZiÃ³Å‚ek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova

score 2 · Accepted Answer

我刚刚在 Beautiful Soup 邮件列表上回答了这个问题，作为对 Zeynel 邮件的回复。基本上，网页中有一个错误，在解析过程中完全杀死了 Beautiful Soup 3.1，但只是被 Beautiful Soup 3.0 破坏了。

该主题位于Google Groups 存档中。

score 1 · Accepted Answer

看来您正在使用 BeautifulSoup 3.1

我建议恢复到 BeautifulSoup 3.0.7（因为这个问题）

我刚刚用 3.0.7 进行了测试，得到了你期望的结果：

>>> soup.findAll(href=re.compile(r'/cabel'))
[<a href="/cabel">Abel, Christian</a>]

使用 BeautifulSoup 3.1 进行测试会得到您所看到的结果。html 中可能有一个格式错误的标签，但我没有快速查看它是什么。

python - 如何将此 XPath 表达式转换为 BeautifulSoup？

4 回答 4

Related

Reference