python - 如何使用 Beautiful Soup 从 HTML 文档中获取纯文本和 URL？

Question

我使用 Python 和正则表达式在 HTML 文档中查找内容，与大多数人所说的不同，它运行良好，即使出现问题。无论如何，我认为 Beautiful Soup 会更快更容易，但我真的不知道如何让它像我用正则表达式做的那样，这很容易，但很混乱。

我正在使用此页面的 HTML：

http://www.locationary.com/places/duplicates.jsp?inPID=1000000001

编辑：

这是主要位置的 HTML：

<tr>
<td class="Large Bold" nowrap="nowrap">Riverside Tower Hotel&nbsp;</td>
<td class="Large Bold" width="100%">80 Riverside Drive, New York, New York, United States</td>
<td class="Large Bold" nowrap="nowrap" width="55">&nbsp;<input name="selectCheckBox" type="checkbox" checked="checked" disabled="disabled" />Yes
</td>
</tr>

第一个类似地方的例子：

<td class="" nowrap="nowrap"><a href="http://www.locationary.com/place/en/US/New_York/New_York/54_Riverside_Dr_Owners_Corp-p1009633680.jsp" target="_blank">54 Riverside Dr Owners Corp</a></td>
<td width="100%">&nbsp;54 Riverside Dr, New York, New York, United States</td>
<td nowrap="nowrap" width="55">

当我的程序得到它并且我使用 Beautiful Soup 使其更具可读性时，HTML 出来的结果与 Firefox 的“查看源代码”有点不同......我不知道为什么。

这些是我的正则表达式：

PlaceName = re.findall(r'"nowrap">(.*)&nbsp;</td>', main)

PlaceAddress = re.findall(r'width="100%">(.*)</td>\n<td class="Large Bold"', main)

cNames = re.findall(r'target="_blank">(.*)</a></td>\n<td width="100%">&nbsp;', main)

cAddresses = re.findall(r'<td width="100%">&nbsp;(.*)</td>\n<td nowrap="nowrap" width="55">', main)

cURLs = re.findall(r'<td class="" nowrap="nowrap"><a href="(.*)" target="_blank">', main)

前两个是主要地点和地址。其余的用于其余地方的信息。完成这些之后，我决定只需要 cNames、cAddresses 和 cURLs 的前 5 个结果，因为我不需要 91 或其他任何值。

我不知道如何通过 BS 找到此类信息。我对 BS 所能做的就是找到特定的标签并用它们做事。这个 HTML 有点复杂，因为所有的信息。我想要的是表格，表格标签也有点乱......

您如何获取该信息，并将其限制在前 5 个结果左右？

谢谢。

score 3 · Accepted Answer

人们说你不能用正则表达式解析 HTML 是有原因的，但这里有一个适用于你的正则表达式的简单原因：你已经在你\n的正则 表达式中，并且这些可以并且将会在你的页面上随机更改正在尝试解析。发生这种情况时，您的正则表达式将不匹配，您的代码将停止工作。

但是，您要执行的任务非常简单

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('this-stackoverflow-page.html'))

for anchor in soup('a'):
    print anchor.contents, anchor.get('href')

生成所有锚标记，无论它们出现在此页面的深层嵌套结构中的什么位置。以下是我从该三行脚本的输出中摘录的几行：

[u'Stack Exchange'] http://stackexchange.com
[u'msw'] /users/282912/msw
[u'faq'] /faq
[u'Stack Overflow'] /
[u'Questions'] /questions
[u'How to use Beautiful Soup to get plaintext and URLs from an HTML document?'] /questions/11902974/how-to-use-beautiful-soup-to-get-plaintext-and-urls-from-an-html-document
[u'http://www.locationary.com/places/duplicates.jsp?inPID=1000000001'] http://www.locationary.com/places/duplicates.jsp?inPID=1000000001
[u'python'] /questions/tagged/python
[u'beautifulsoup'] /questions/tagged/beautifulsoup
[u'Marcus Johnson'] /users/1587751/marcus-johnson

很难想象更少的代码可以为您完成那么多工作。

python - 如何使用 Beautiful Soup 从 HTML 文档中获取纯文本和 URL？

1 回答 1

Related

Reference