我正在编写一个 Scrapy 程序,我在这个网站上登录并抓取不同扑克牌的数据,http://www.starcitygames.com/buylist/. 但是我只从这个 url 中抓取 ID 值,然后我使用该 ID 号重定向到不同的 URL,并抓取该 JSON 网页,并对所有 207 个不同类别的卡片执行此操作。我看起来更真实一点,然后直接使用 JSON 数据访问 URL。无论如何,我之前用多个 URL 编写了 Scrapy 程序,并且我能够将这些程序设置为轮换代理和用户代理,但是我将如何在这个程序中做到这一点?由于从技术上讲只有一个 URL,有没有办法将其设置为在抓取 5 个左右不同的 JSON 数据页后切换到不同的代理和用户代理?我不希望它随机旋转。我希望它每次都使用相同的代理和用户代理抓取相同的 JSON 网页。我希望一切都说得通。
# Import needed functions and call needed python files
import scrapy
import json
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from ..items import DataItem
# Spider class
class LoginSpider(scrapy.Spider):
# Name of spider
name = "LoginSpider"
#URL where dated is located
start_urls = ["http://www.starcitygames.com/buylist/"]
# Login function
def parse(self, response):
# Login using email and password than proceed to after_login function
return scrapy.FormRequest.from_response(
response,
formcss='#existing_users form',
formdata={'ex_usr_email': 'example@email.com', 'ex_usr_pass': 'password'},
callback=self.after_login
)
# Function to barse buylist website
def after_login(self, response):
# Loop through website and get all the ID numbers for each category of card and plug into the end of the below
# URL then go to parse data function
for category_id in response.xpath('//select[@id="bl-category-options"]/option/@value').getall():
yield scrapy.Request(
url="http://www.starcitygames.com/buylist/search?search-type=category&id={category_id}".format(category_id=category_id),
callback=self.parse_data,
)
# Function to parse JSON dasta
def parse_data(self, response):
# Declare variables
jsonreponse = json.loads(response.body_as_unicode())
# Call DataItem class from items.py
items = DataItem()
# Scrape category name
items['Category'] = jsonreponse['search']
# Loop where other data is located
for result in jsonreponse['results']:
# Inside this loop, run through loop until all data is scraped
for index in range(len(result)):
# Scrape the rest of needed data
items['Card_Name'] = result[index]['name']
items['Condition'] = result[index]['condition']
items['Rarity'] = result[index]['rarity']
items['Foil'] = result[index]['foil']
items['Language'] = result[index]['language']
items['Buy_Price'] = result[index]['price']
# Return all data
yield items