1

我正在尝试从亚马逊加拿大 (amazon.ca) 抓取数据。我正在使用请求和 bs4 包来发送和解析 html 数据。我无法从响应中提取数据。有人可以帮我从回复中提取信息吗?

import requests
from bs4 import BeautifulSoup

# Define headers
headers={
        'content-type': 'text/html;charset=UTF-8',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
        }

# Amazon Canada product url
url = 'https://www.amazon.ca/INIU-High-Speed-Flashlight-Powerbank-Compatible/dp/B07CZDXDG8?ref_=Oct_s9_apbd_otopr_hd_bw_b3giFrP&pf_rd_r=69GE1K9DG49351YHSYBC&pf_rd_p=694b8fdf-0d96-57ba-b834-dc9bdeb7a094&pf_rd_s=merchandised-search-11&pf_rd_t=BROWSE&pf_rd_i=3379552011&th=1'
resp = requests.get(url,headers= header)
print(resp)

<Response [200]>

早些时候它显示<Response [503]>,所以我添加了标题,现在它显示了<Response [200]>。所以我试图从页面中提取一些信息。

# Using html parser
soup = BeautifulSoup(resp.content,'lxml')

# Extracting information from page
product_title = soup.find('span',id='productTitle')
print('product_title -' ,product_title)

product_price = soup.find('span',id='priceblock_ourprice')
print('product_price -' ,product_price)

('product_title -', None)
('product_price -', None)

但它正在显示None,所以我检查了汤中究竟存在哪些数据。所以我打印了汤。

soup.text

'\n\n\n\nRobot Check\n\n\n\n\nif (true === true) {\n    var ue_t0 = (+
new Date()),\n        ue_csm = window,\n        ue = { t0: ue_t0, d:
function() { return (+new Date() - ue_t0); } },\n        ue_furl =
"fls-na.amazon.ca",\n        ue_mid = "A2EUQ1WTGCTBG2",\n       
ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],\n  
ue_sn = "opfcaptcha.amazon.ca",\n        ue_id =
\'0B2HQATTKET8J6M36Y3G\';\n}\n\n\n\n\n\n\n\n\n\n\n\nEnter the
characters you see below\nSorry, we just need to make sure you\'re not
a robot. For best results, please make sure your browser is accepting
cookies.\n\n\n\n\n\n\n\n\n\n\nType the characters you see in this
image:\n\n\n\n\n\n\n\n\nTry different
image\n\n\n\n\n\n\n\n\n\n\n\nContinue
shopping\n\n\n\n\n\n\n\n\n\n\n\nConditions of Use &
Sale\n\n\n\n\nPrivacy Notice\n\n\n          \xa9 1996-2015,
Amazon.com, Inc. or its affiliates\n          \n           if (true
=== true) {\n             document.write(\'<img src="https://fls-na.amaz\'+\'on.ca/\'+\'1/oc-csi/1/OP/requestId=0B2HQATTKET8J6M36Y3G&js=1"
/>\');\n           };\n          \n\n\n\n\n\n\n    if (true === true)
{\n        var head = document.getElementsByTagName(\'head\')[0],\n   
prefix =
"https://images-na.ssl-images-amazon.com/images/G/01/csminstrumentation/",\n
elem = document.createElement("script");\n        elem.src = prefix +
"csm-captcha-instrumentation.min.js";\n       
head.appendChild(elem);\n\n        elem =
document.createElement("script");\n        elem.src = prefix +
"rd-script-6d68177fa6061598e9509dc4b5bdd08d.js";\n       
head.appendChild(elem);\n    }\n    \n\n'

我彻底检查了输出,但我没有在响应中找到任何可用的数据,我什至尝试做同样的事情并检查了 resp.content,但没有找到任何数据。我还验证了网址,网址也有效。我什至通过添加公共代理来测试上面的脚本,但仍然没有输出。

有人可以帮我从 url 或任何其他方式中提取信息来完成它吗?

4

1 回答 1

2

尝试这个:

import requests
from bs4 import BeautifulSoup

headers = {
    'content-type': 'text/html;charset=UTF-8',
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}

url = 'https://www.amazon.ca/INIU-High-Speed-Flashlight-Powerbank-Compatible/dp/B07CZDXDG8'
resp = requests.get(url, headers=headers)

soup = BeautifulSoup(resp.content, 'lxml')

# Extracting information from page
print('product_title -', soup.find('span', id='productTitle').text.strip())
print('product_price -', soup.find('span', id='priceblock_ourprice').text.strip())

代码产生:

product_title - INIU Power Bank, Ultra-Slim Dual 3A High-Speed Portable Charger, 10000mAh USB C Input & Flashlight External Phone Battery Pack for iPhone Xs X 8 Plus Samsung S10 Google LG iPad etc. [2020 Upgrade]
product_price - CDN$ 60.66
于 2020-08-31T17:43:02.090 回答