python-3.x - 使用 Python 抓取 Kickstarter 项目页面

Question

一年多以来，我一直在使用下面的代码来抓取某些 Kickstarter 页面，作为我日常工作的一部分。没有恶意或恶意，只需要从页面中获取一些信息来帮助项目创建者。

但是在过去的 4 - 6 个月里，Kickstarter 实施了某种阻止程序，它阻止我到达/抓取实际页面。我得到的只是Backer or bot? Complete this security check to prove that you’re a human. Once you’ve passed this page, you might need to navigate away from your current screen on Kickstarter to refresh and move on. To avoid seeing this page again, double-check that JavaScript and cookies are enabled on your web browser and that you’re not blocking them from loading with an extension (e.g., ad blockers).

任何人都可以想出一种方法来绕过此检查并实际登陆页面吗？任何输入都会非常有帮助。

import os
import sys
import requests
import time
import urllib
import urllib.request
import shutil
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from csv import writer
from shutil import copyfile

print('What is the project URL?')
urlInp = input()

elClass = "rte__content"

chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)

driver.get(urlInp)
time.sleep(2)
html = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()

soup = BeautifulSoup(html, 'lxml')
ele = soup.find('div', {'class': elClass})

print(soup)
quit()

score 1 · Accepted Answer

看着你的剧本——看起来你正试图了解这个故事。

Selenium 非常适合 GUI测试，但它会向网站宣布它是谁，以帮助防止 DOS 攻击。如果您想了解更多信息，请阅读有关文档的更多信息。我认为这些网站出于某种原因正在努力阻止 GUI 自动化。他们有很多聪明的人在努力，所以想要打败他们将是一场艰苦的战斗。

作为更好的选择，您是否考虑过使用该requests库？- 这将允许您在不需要浏览器的情况下模拟呼叫

我查看了 devtools，甚至还有一个 API 可以为您获取故事信息。您需要一个csrf 令牌，并且您需要发布一些数据（这些数据已经在您的 url 中可用）。这将比 selenium 运行得更快，并允许您做更多事情。

这是我为您整理的一些代码。我选择了一个随机的 kickstarter 页面，它被硬编码到这个演示中：

urlInp = 'https://www.kickstarter.com/projects/iamlunasol/soft-like-mochi-enamel-pins?ref=section-homepage-featured-project'


#start a session - this stores cookies
s = requests.session()

# go here to get  cookies and the token
landing = s.get(urlInp) 
page = html.fromstring(landing.content)
csrf = page.xpath('//meta[@name="csrf-token"]')[0].get('content')
headers={} 
headers['x-csrf-token'] = csrf


#hit the api with the data
graphslug = urlInp.split("projects/")[1]
graphslug = graphslug.split("?")[0]
graphData= [{
        "operationName": "Campaign",
        "variables": {
            "slug": graphslug
        },
        "query": "query Campaign($slug: String!) {\n  project(slug: $slug) {\n    id\n    isSharingProjectBudget\n    risks\n    showRisksTab\n    story(assetWidth: 680)\n    currency\n    spreadsheet {\n      displayMode\n      public\n      url\n      data {\n        name\n        value\n        phase\n        rowNum\n        __typename\n      }\n      dataLastUpdatedAt\n      __typename\n    }\n    environmentalCommitments {\n      id\n      commitmentCategory\n      description\n      __typename\n    }\n    __typename\n  }\n}\n"
    }]

response = s.post("https://www.kickstarter.com/graph", json=graphData, headers=headers)

#process the response
graph_json = response.json()
story = graph_json[0]['data']['project']['story']
soup = BeautifulSoup(story, 'lxml')
print(soup)

输出的前几行是：

<html><body><p>Hi! I'm Felice Regina (<a href="https://www.instagram.com/iamlunasol/" rel="noopener" target="_blank">@iamlunasol</a> on Instagram) but everyone just calls me Luna! I'm an independent illustrator and pin designer! I've run many successful 
Kickstarter campaigns for enamel pins over the past few years. This campaign will help put new hard enamel pin designs into production.</p>
<p>Pledging ensures that the pins get produced, discounts when you purchase multiple pins, plus any freebies that we may unlock. If the campaign is successful, any extra pins will be sold at $12 + shipping in my <a href="https://shopiamlunasol.com/" rel="noopener" target="_blank">web store</a>.</p>

这与story在 devtools 上的 json 中看到的有关 - 预览选项卡对此有好处：

最后，如果您希望对其进行调整以使用其他查询，您可以了解要从请求有效负载中的标头选项卡发送的 json 数据：

python-3.x - 使用 Python 抓取 Kickstarter 项目页面

1 回答 1

Related

Reference