python - 抓取 Google Scholar 安全页面

Question

我有一个这样的字符串：

url = 'http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10'

我希望将其转换为：

converted_url = 'https://scholar.google.pl/citations?view_op=search_authors&hl=en&mauthors=label:security&after_author=rukAAOJ8__8J&astart=10'

我试过这个：

converted_url = url.decode('utf-8')

但是，会引发此错误：

AttributeError: 'str' object has no attribute 'decode'

score 1 · Accepted Answer

您可以使用requests自动为您进行解码。

注意：after_authorURL 参数是下一页标记，因此当您向您提供的确切 URL 发出请求时，返回的 HTML 将与您预期的不同，因为after_authorURL 参数在每个请求中都会更改，例如在我的情况下它是不同的- uB8AAEFN__8J，在您的 URL 中是rukAAOJ8__8J.

为了让它工作，你需要从第一页解析下一页令牌，这将导致第二页等等，例如：

# from my other answer: 
# https://github.com/dimitryzub/stackoverflow-answers-archive/blob/main/answers/scrape_all_scholar_profiles_bs4.py

params = {
    "view_op": "search_authors",
    "mauthors": "valve",
    "hl": "pl",
    "astart": 0
}

authors_is_present = True
while authors_is_present:
    
    # if next page is present -> update next page token and increment to the next page
    # if next page is not present -> exit the while loop
    if soup.select_one("button.gs_btnPR")["onclick"]:
        params["after_author"] = re.search(r"after_author\\x3d(.*)\\x26", str(soup.select_one("button.gs_btnPR")["onclick"])).group(1)  # -> XB0HAMS9__8J
        params["astart"] += 10
    else:
        authors_is_present = False

在在线 IDE 中提取配置文件数据的代码和示例：

from parsel import Selector
import requests, json

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "label:security",
    "hl": "pl",
    "view_op": "search_authors"
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}

html = requests.get("https://scholar.google.pl/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)

profiles = []

for profile in selector.css(".gs_ai_chpr"):
    profile_name = profile.css(".gs_ai_name a::text").get()
    profile_link = f'https://scholar.google.com{profile.css(".gs_ai_name a::attr(href)").get()}'
    profile_email = profile.css(".gs_ai_eml::text").get()
    profile_interests = profile.css(".gs_ai_one_int::text").getall()

    profiles.append({
        "profile_name": profile_name,
        "profile_link": profile_link,
        "profile_email": profile_email,
        "profile_interests": profile_interests
    })

print(json.dumps(profiles, indent=2))

或者，您可以使用来自 SerpApi的Google Scholar Profiles API来实现相同的目的。这是一个带有免费计划的付费 API。

不同之处在于您不需要弄清楚如何提取数据、绕过搜索引擎的阻止、增加请求的数量等等。

要集成的示例代码：

from serpapi import GoogleSearch
import os, json

params = {
    "api_key": os.getenv("API_KEY"),     # SerpApi API key
    "engine": "google_scholar_profiles", # SerpApi profiles parsing engine
    "hl": "pl",                          # language
    "mauthors": "label:security"         # search query
}

search = GoogleSearch(params)
results = search.get_dict()

for profile in results["profiles"]:
    print(json.dumps(profile, indent=2))

# part of the output:
'''
{
  "name": "Johnson Thomas",
  "link": "https://scholar.google.com/citations?hl=pl&user=eKLr0EgAAAAJ",
  "serpapi_link": "https://serpapi.com/search.json?author_id=eKLr0EgAAAAJ&engine=google_scholar_author&hl=pl",
  "author_id": "eKLr0EgAAAAJ",
  "affiliations": "Professor of Computer Science, Oklahoma State University",
  "email": "Zweryfikowany adres z cs.okstate.edu",
  "cited_by": 159999,
  "interests": [
    {
      "title": "Security",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Asecurity",
      "link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:security"
    },
    {
      "title": "cloud computing",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Acloud_computing",
      "link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:cloud_computing"
    },
    {
      "title": "big data",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Abig_data",
      "link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:big_data"
    }
  ],
  "thumbnail": "https://scholar.google.com/citations/images/avatar_scholar_56.png"
}
'''

免责声明，我为 SerpApi 工作。

score 0 · Accepted Answer

decode用于转换bytes为string. 而你的网址是string，不是bytes。

您可以使用encode将其转换string为bytes稍后使用decode以转换为正确的string.

（我使用前缀r来模拟有这个问题的文本 - 没有前缀的 url 不必转换）

url = r'http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10'
print(url)

url = url.encode('utf-8').decode('unicode_escape')
print(url)

结果：

http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10

http://scholar.google.pl/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security&after_author=rukAAOJ8__8J&astart=10

顺便说一句：首先检查print(url)您是否有正确的 url，但您使用了错误的方法来显示它。Python Shell 显示所有结果而不print()使用print(repr())which display some chars as code 来显示文本中使用的结束编码（utf-8、iso-8859-1、win-1250、latin-1 等）

python - 抓取 Google Scholar 安全页面

2 回答 2

Related

Reference