python - 使用 Python 从字符串中提取链接

Question

首先，我想做的是向用户询问搜索词。然后程序搜索 yahoo 并打印出第一个结果的链接。这是我到目前为止的代码。

from urllib import urlopen

import re, time
from BeautifulSoup import BeautifulSoup


print "What Would You Like to Search For?"

user_input = raw_input('') #Gets Search Term from User



search = "http://search.yahoo.com/search;_ylt=A2KLtaJX_1BQfT4AwX2bvZx4?p=baker&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t-701" 

new_search = search.replace('baker', user_input)           
content = urlopen( new_search ).read()                       

soupcontent = BeautifulSoup(content)                    


link1 = soupcontent.find(id="link-1")            
print link1

一切正常。它接受用户输入并搜索雅虎。我遇到的问题是可以说我搜索了“狗”

然后程序将打印如下内容： "a id="link-1" class="yschttl spt" href="http://www.dog.com/" data-bk="5101.1>b>Dog/b > 用品 | b>狗/b> 食物，b>狗/b> 床，b>狗/b> wbr>/wbr>跳蚤控制和更多.../a>"

这确实是页面上的第一个链接。但是我只希望它打印出“http://www.dog.com/”有人可以帮我吗？

谢谢。

score 1 · Accepted Answer

尝试使用正则表达式。请参阅：http ://docs.python.org/library/re.html 。

match = re.search(r'href="(http://.*?)"', str(link1))
print match.group(1)

score 1 · Accepted Answer

BeautifulSoup 实际上使这非常容易：

>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen
>>> 
>>> url = 'http://search.yahoo.com/search?p=dog'
>>> content = urlopen(url).read()
>>> soup = BeautifulSoup(content)
>>> 
>>> soup.find(id="link-1")
<a class="yschttl spt" data-bk="5097.1" href="http://www.dog.com/" id="link-1"><b>Dog</b> Supplies | <b>Dog</b> Food, <b>Dog</b> Beds, <b>Dog</b> <wbr></wbr>Flea Control &amp; More ...</a>
>>> soup.find(id="link-1").get("href")
'http://www.dog.com/'

根据您对 UTF-8 的请求，您可能会看到

 u'http://www.dog.com/'

相反，Unicode 版本也很好。

标准警告：请务必检查 Yahoo! 的最终用户许可证是否允许您做任何事情，因为许多许可证排除了某些自动使用。

score 0 · Accepted Answer

0

link = your_full_link_string.split('href="')[1].split('"')[0]

于 2012-09-13T00:50:13.597 回答

python - 使用 Python 从字符串中提取链接

3 回答 3

Related

Reference