syntax - 为爬虫定义 URL 列表，语法问题

Question

我目前正在运行以下代码：

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin

def hltv_match_list(max_offset):
    offset = 0
    while offset < max_offset:
        url = 'http://www.hltv.org/?pageid=188&offset=' + str(offset)
        base = "http://www.hltv.org/"
        soup = BeautifulSoup(requests.get("http://www.hltv.org/?pageid=188&offset=0").content, 'html.parser')
        cont = soup.select("div.covMainBoxContent a[href*=matchid=]")
        href =  urljoin(base, (a["href"] for a in cont))
        # print([urljoin(base, a["href"]) for a in cont])
        get_hltv_match_data(href)
        offset += 50

def get_hltv_match_data(matchid_url):
    source_code = requests.get(matchid_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    for teamid in soup.findAll("div.covSmallHeadline a[href*=teamid=]"):
        print teamid.string

hltv_match_list(5)

错误：

  File "C:/Users/mdupo/PycharmProjects/HLTVCrawler/Crawler.py", line 12, in hltv_match_list
    href =  urljoin(base, (a["href"] for a in cont))
  File "C:\Python27\lib\urlparse.py", line 261, in urljoin
    urlparse(url, bscheme, allow_fragments)
  File "C:\Python27\lib\urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "C:\Python27\lib\urlparse.py", line 182, in urlsplit
    i = url.find(':')
AttributeError: 'generator' object has no attribute 'find'

Process finished with exit code 1

href = urljoin(base, (a["href"] for a in cont))我认为我在尝试创建一个可以输入的 url 列表get_hltv_match_data以捕获该页面中的各种项目时遇到了问题。我要解决这个问题了吗？

干杯

score 0 · Accepted Answer

您需要根据您的注释代码加入每个 href：

urls  =  [urljoin(base,a["href"]) for a in cont]

您正在尝试将基本 url 加入到生成器中，即(a["href"] for a in cont)这没有任何意义。

您还应该将 url 传递给请求，否则您将一遍又一遍地请求同一页面。

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

syntax - 为爬虫定义 URL 列表，语法问题

1 回答 1

Related

Reference