2

我的文字看起来像:

Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=478', text='SomeSite -  Professor Rating of Louis Scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=478'), ('h', 'ID=SERP,5105.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=527', text='SomeSite -  Professor Rating of Jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=527'), ('h', 'ID=SERP,5118.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=645', text='SomeSite -  Professor Rating of David Kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=645'), ('h', 'ID=SERP,5131.1')])

问题

  1. 有谁知道这个文本的格式?

  2. 例如,我将如何解析元素的值url(来自上面的文本): http ://www.somesite.com/prof.php?pID=478 http://www.somesite.com/prof。 php?pID=527

  3. 你会推荐什么 Python 库来解析这种类型的输出、xml、json 等?

我只是想loop through the url解析urlonly 的值。

请记住,我使用的是 Django。

感谢您提供任何帮助。

编辑 *当前代码: *

domainLinkOutputAsString = str(domainLinkOutput) 

r = re.compile(" url='(.*?)',", )  ##ERRORENOUS, must be 're' compliant.

ProperDomains = r.findall(domainLinkOutputAsString)

return HttpResponse(ProperDomains)
4

2 回答 2

1

您可以简单地使用Python Regexp

import re
text = "Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=478', text='SomeSite -  Professor Rating of Louis Scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=478'), ('h', 'ID=SERP,5105.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=527', text='SomeSite -  Professor Rating of Jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=527'), ('h', 'ID=SERP,5118.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=645', text='SomeSite -  Professor Rating of David Kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=645'), ('h', 'ID=SERP,5131.1')])"

# Create the regexp object to match the value of 'url'
r = re.compile(" url='(.*?)',", )

# Print all matches
print r.findall(text)

>>>['http://www.somesite.com/prof.php?pID=478', 'http://www.somesite.com/prof.php?pID=527', 'http://www.somesite.com/prof.php?pID=645']
于 2013-08-09T01:27:29.517 回答
0

我们有一个 Python 库来获取和解析可通过以下方式访问的 Google 搜索结果 pip install google-search-results

用途:

from lib.google_search_results import GoogleSearchResults
query = GoogleSearchResults({"q": "coffee"})
html_results = query.get_html()

它通过 SERP API 的后端工作

更全面的选择:

query_params = {
  "q": "query",
  "google_domain": "Google Domain",
  "location": "Location Requested",
  "device": device,
  "hl": "Google UI Language",
  "gl": "Google Country",
  "safe": "Safe Search Flag",
  "num": "Number of Results",
  "start": "Pagination Offset",
  "serp_api_key": "Your SERP API Key"
}

query = GoogleSearchResults(query_params)
query.params_dict["location"] = "Portland"

html_results = query.get_html()
dictionary_results = query.get_dictionary()
dictionary_results_with_images = query.get_dictionary_with_images()
json_results = query.get_json()
json_results_with_images = query.get_json_with_images()
于 2018-02-24T00:13:01.140 回答