python - Trustpilot 评论 Scraper 分页不起作用

Question

我一直在尝试从几页中从 Trustpilot 中抓取有关 DoorDash 的客户评论，但由于某种原因，它只会一遍又一遍地抓取第一页（似乎分页不起作用）！这是我的代码：

review_text=[]
review_score=[]
review_date=[]
review_title=[]

pages = np.arange(1, 10, 1)
for page in pages:
    page = requests.get("https://www.trustpilot.com/review/doordash.com" + "?page=" + str(page))
    sleep(randint(2,10))
    if response.status_code == 200:
        soup = bs4.BeautifulSoup(response.text)
        for rev in soup.find_all('div',class_="review-content"):
            nv = rev.find_all('p',class_= 'review-content__text')
            review = rev.p.text.strip() if len(nv) == True else '-'
            review_text.append(review)            
            date_json = json.loads(rev.find('script').string)
            date = date_json['publishedDate']
            review_date.append(date)
        for rev in soup.find_all('div',class_='star-rating star-rating--medium'):
            review_score.append(rev.find('img').get('alt'))
        for rev in soup.find_all('h2',class_='review-content__title'):
            review_title.append(rev.text.strip())
    else:
        print("Issue getting url")

有谁知道我该如何解决这个问题？（除分页外，其他一切都完美无缺）谢谢！

score 0 · Accepted Answer

Trustpilot 中的分页不是使用第 1 页、第 2 页完成的，您需要获取下一页 URL 并抓取其中的内容。在此示例中，您可以看到如何获取下一页 URL 以使用页面抓取

base_url = "https://trustpilot.com/review/doordash.com"
general= "https://trustpilot.com"
Numberpage=20
for i in range(1,Numpages):
    page = requests.get(base_url, verify=False)
    tree = html.fromstring(page.content)
    next_page = tree.xpath("//a[contains(@class, 'next-page')]")
    if next_page:
        base_url = general + next_page[0].get('href')
    #place the function that collects reviews from one page here
    scrape_page(base_url)

python - Trustpilot 评论 Scraper 分页不起作用

1 回答 1

Related

Reference