python - Python 3 Beautiful Soup 数据类型不兼容问题

Question

你好，堆栈社区！

我遇到了一个我似乎无法解决的问题，因为它看起来大部分的帮助都是针对 Python 2.7 的。

我想从网页中提取一个表格，然后只获取链接文本而不是整个锚点。

代码如下： from urllib.request import urlopen from bs4 import BeautifulSoup import re

url = 'http://www.craftcount.com/category.php?cat=5'

html = urlopen(url).read()
soup = BeautifulSoup(html)
alltables = soup.findAll("table")

## This bit captures the input from the previous sequence
results=[]
for link in alltables:
    rows = link.findAll('a')
## Find just the names
    top100 = re.findall(r">(.*?)<\/a>",rows)
print(top100)

当我运行它时，我得到：“TypeError：预期的字符串或缓冲区”

直到倒数第二行，它都能正确完成所有操作（当我将 'print(top100)' 换成 'print(rows)' 时）。

作为我得到的响应的一个例子：

<a href="http://www.etsy.com/shop.php?user_id=5323531"target="_blank">michellechangjewelry</a>

我只需要得到：michellechangjewelry

根据 pythex.org 的说法，我的（ir）正则表达式应该可以工作，所以我想看看那里是否有人知道如何做到这一点。作为一个额外的问题，看起来大多数人喜欢另辟蹊径，即从拥有全文而只想要 URL 部分。

最后，我出于“方便”而使用 BeautifulSoup，但如果您能建议一个更好的包来缩小对链接文本的解析范围，我不会感激它。

提前谢谢了！！

score 1 · Accepted Answer

BeautifulSoup 结果不是字符串；它们大多是Tag对象。

查找标签的文本，使用属性：<a>.string

for table in alltables:
    link = table.find('a')
    top100 = link.string
    print(top100)

这将找到表中的第一个 <a>链接。要查找所有链接的所有文本：

for table in alltables:
    links = table.find_all('a')
    top100 = [link.string for link in links]
    print(top100)

python - Python 3 Beautiful Soup 数据类型不兼容问题

1 回答 1

Related

Reference