python - 在 Python (Django) 中解析文本

Question

我的文字看起来像：

Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=478', text='SomeSite -  Professor Rating of Louis Scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=478'), ('h', 'ID=SERP,5105.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=527', text='SomeSite -  Professor Rating of Jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=527'), ('h', 'ID=SERP,5118.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=645', text='SomeSite -  Professor Rating of David Kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=645'), ('h', 'ID=SERP,5131.1')])

问题

有谁知道这个文本的格式？
例如，我将如何解析元素的值url（来自上面的文本）： http ://www.somesite.com/prof.php?pID=478 http://www.somesite.com/prof。 php?pID=527
你会推荐什么 Python 库来解析这种类型的输出、xml、json 等？

我只是想loop through the url解析urlonly 的值。

请记住，我使用的是 Django。

感谢您提供任何帮助。

编辑 *当前代码： *

domainLinkOutputAsString = str(domainLinkOutput) 

r = re.compile(" url='(.*?)',", )  ##ERRORENOUS, must be 're' compliant.

ProperDomains = r.findall(domainLinkOutputAsString)

return HttpResponse(ProperDomains)

score 1 · Accepted Answer

您可以简单地使用Python Regexp：

import re
text = "Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=478', text='SomeSite -  Professor Rating of Louis Scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=478'), ('h', 'ID=SERP,5105.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=527', text='SomeSite -  Professor Rating of Jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=527'), ('h', 'ID=SERP,5118.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=645', text='SomeSite -  Professor Rating of David Kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=645'), ('h', 'ID=SERP,5131.1')])"

# Create the regexp object to match the value of 'url'
r = re.compile(" url='(.*?)',", )

# Print all matches
print r.findall(text)

>>>['http://www.somesite.com/prof.php?pID=478', 'http://www.somesite.com/prof.php?pID=527', 'http://www.somesite.com/prof.php?pID=645']

score 0 · Accepted Answer

我们有一个 Python 库来获取和解析可通过以下方式访问的 Google 搜索结果 pip install google-search-results

用途：

from lib.google_search_results import GoogleSearchResults
query = GoogleSearchResults({"q": "coffee"})
html_results = query.get_html()

它通过 SERP API 的后端工作

更全面的选择：

query_params = {
  "q": "query",
  "google_domain": "Google Domain",
  "location": "Location Requested",
  "device": device,
  "hl": "Google UI Language",
  "gl": "Google Country",
  "safe": "Safe Search Flag",
  "num": "Number of Results",
  "start": "Pagination Offset",
  "serp_api_key": "Your SERP API Key"
}

query = GoogleSearchResults(query_params)
query.params_dict["location"] = "Portland"

html_results = query.get_html()
dictionary_results = query.get_dictionary()
dictionary_results_with_images = query.get_dictionary_with_images()
json_results = query.get_json()
json_results_with_images = query.get_json_with_images()

python - 在 Python (Django) 中解析文本

2 回答 2

Related

Reference