0

我目前正在运行以下代码:

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin

def hltv_match_list(max_offset):
    offset = 0
    while offset < max_offset:
        url = 'http://www.hltv.org/?pageid=188&offset=' + str(offset)
        base = "http://www.hltv.org/"
        soup = BeautifulSoup(requests.get("http://www.hltv.org/?pageid=188&offset=0").content, 'html.parser')
        cont = soup.select("div.covMainBoxContent a[href*=matchid=]")
        href =  urljoin(base, (a["href"] for a in cont))
        # print([urljoin(base, a["href"]) for a in cont])
        get_hltv_match_data(href)
        offset += 50

def get_hltv_match_data(matchid_url):
    source_code = requests.get(matchid_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    for teamid in soup.findAll("div.covSmallHeadline a[href*=teamid=]"):
        print teamid.string

hltv_match_list(5)

错误:

  File "C:/Users/mdupo/PycharmProjects/HLTVCrawler/Crawler.py", line 12, in hltv_match_list
    href =  urljoin(base, (a["href"] for a in cont))
  File "C:\Python27\lib\urlparse.py", line 261, in urljoin
    urlparse(url, bscheme, allow_fragments)
  File "C:\Python27\lib\urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "C:\Python27\lib\urlparse.py", line 182, in urlsplit
    i = url.find(':')
AttributeError: 'generator' object has no attribute 'find'

Process finished with exit code 1

href = urljoin(base, (a["href"] for a in cont))我认为我在尝试创建一个可以输入的 url 列表get_hltv_match_data以捕获该页面中的各种项目时遇到了问题。我要解决这个问题了吗?

干杯

4

1 回答 1

0

您需要根据您的注释代码加入每个 href:

urls  =  [urljoin(base,a["href"]) for a in cont]

您正在尝试将基本 url 加入到生成器中,即(a["href"] for a in cont)这没有任何意义。

您还应该将 url 传递给请求,否则您将一遍又一遍地请求同一页面。

soup = BeautifulSoup(requests.get(url).content, 'html.parser')
于 2016-06-01T17:46:48.323 回答