0

我一直在尝试从几页中从 Trustpilot 中抓取有关 DoorDash 的客户评论,但由于某种原因,它只会一遍又一遍地抓取第一页(似乎分页不起作用)!这是我的代码:

review_text=[]
review_score=[]
review_date=[]
review_title=[]

pages = np.arange(1, 10, 1)
for page in pages:
    page = requests.get("https://www.trustpilot.com/review/doordash.com" + "?page=" + str(page))
    sleep(randint(2,10))
    if response.status_code == 200:
        soup = bs4.BeautifulSoup(response.text)
        for rev in soup.find_all('div',class_="review-content"):
            nv = rev.find_all('p',class_= 'review-content__text')
            review = rev.p.text.strip() if len(nv) == True else '-'
            review_text.append(review)            
            date_json = json.loads(rev.find('script').string)
            date = date_json['publishedDate']
            review_date.append(date)
        for rev in soup.find_all('div',class_='star-rating star-rating--medium'):
            review_score.append(rev.find('img').get('alt'))
        for rev in soup.find_all('h2',class_='review-content__title'):
            review_title.append(rev.text.strip())
    else:
        print("Issue getting url")

有谁知道我该如何解决这个问题?(除分页外,其他一切都完美无缺)谢谢!

4

1 回答 1

0

Trustpilot 中的分页不是使用第 1 页、第 2 页完成的,您需要获取下一页 URL 并抓取其中的内容。在此示例中,您可以看到如何获取下一页 URL 以使用页面抓取

base_url = "https://trustpilot.com/review/doordash.com"
general= "https://trustpilot.com"
Numberpage=20
for i in range(1,Numpages):
    page = requests.get(base_url, verify=False)
    tree = html.fromstring(page.content)
    next_page = tree.xpath("//a[contains(@class, 'next-page')]")
    if next_page:
        base_url = general + next_page[0].get('href')
    #place the function that collects reviews from one page here
    scrape_page(base_url)
于 2021-06-05T10:49:48.617 回答