python - 使用 Mechanize 和 Python Django View 解析 JSON 输出

Question

我目前正在site:somedomain.com使用 Python 和 Mechanize 进行网站搜索，例如：进入 BING。

它可以很好地提交 bing 并返回输出 - 看起来像 Json？我似乎无法找到进一步解析结果的好方法。是 JSON 吗？

我得到如下输出：

Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=478', text='SomeSite -  Professor Rating of Louis Scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=478'), ('h', 'ID=SERP,5105.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=527', text='SomeSite -  Professor Rating of Jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=527'), ('h', 'ID=SERP,5118.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=645', text='SomeSite -  Professor Rating of David Kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=645'), ('h', 'ID=SERP,5131.1')])

我想获取所有网址，例如：

http://www.somesite.com/prof.php?pID=478
http://www.somesite.com/prof.php?pID=527
http://www.somesite.com/prof.php?pID=645

等等，所以url里面的属性

如何在我的代码中通过机械化进一步做到这一点？请记住，未来的一些网址可能如下所示：

http://www.anothersite.com/dir/dir/dir/send.php?pID=100

谢谢！

score 1 · Accepted Answer

那么 mechanize 更像是一个类似于 Python 包的浏览器，对于解析 HTML/XML，我推荐 Lxml，您可以将该数据提供给 lxml 并查找 url。另一种选择是使用正则表达式来查找 url，这种方法会更灵活。

import re 
url_regex = re.compile('http:[^\']+')
urls = re.findall(url_regex, html_text)

编辑：

好吧，而不是打印output，只需传递output而不是html_text输入re.findall()，然后打印urls

score 0 · Accepted Answer

将 Microsoft 的 Azure Datamarket API 与 Python 请求一起使用，您可以直接请求 JSON 字符串：

import requests, urllib
q = u'Hello World'
q = urllib.quote(q.encode('utf8'), '')
req = requests.get(
    u'https://api.datamarket.azure.com/Data.ashx/Bing/SearchWeb/Web?$format=JSON&Query=%%27%s%%27' % q,
    auth=('', u'YOU_API_KEY')
)
# print req.json()
results = req.json()['d']['results']
list_of_urls = [ r['Url'] for r in results]

根据您的输入数据，您可能需要也可能不需要“q”的 .encode('utf8') 部分。“site:xy.com”查询也应该有效，但我没有对此进行测试。此外，我们偶尔会从 Bing 返回一些奇怪的编码......所以我们不得不重新编码返回的 URL，如下所示：

url = r['Url'].encode('latin1')

但那些真的是特殊情况......

您需要注册 Azure API（免费），每月最多 5000 个 Bing 搜索请求是免费的： http: //datamarket.azure.com/dataset/bing/search

有几个参数可以微调您的结果：http ://datamarket.azure.com/dataset/bing/search#schema

python - 使用 Mechanize 和 Python Django View 解析 JSON 输出

2 回答 2

Related

Reference