我正在尝试从亚马逊加拿大 (amazon.ca) 抓取数据。我正在使用请求和 bs4 包来发送和解析 html 数据。我无法从响应中提取数据。有人可以帮我从回复中提取信息吗?
import requests
from bs4 import BeautifulSoup
# Define headers
headers={
'content-type': 'text/html;charset=UTF-8',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
}
# Amazon Canada product url
url = 'https://www.amazon.ca/INIU-High-Speed-Flashlight-Powerbank-Compatible/dp/B07CZDXDG8?ref_=Oct_s9_apbd_otopr_hd_bw_b3giFrP&pf_rd_r=69GE1K9DG49351YHSYBC&pf_rd_p=694b8fdf-0d96-57ba-b834-dc9bdeb7a094&pf_rd_s=merchandised-search-11&pf_rd_t=BROWSE&pf_rd_i=3379552011&th=1'
resp = requests.get(url,headers= header)
print(resp)
<Response [200]>
早些时候它显示<Response [503]>,所以我添加了标题,现在它显示了<Response [200]>。所以我试图从页面中提取一些信息。
# Using html parser
soup = BeautifulSoup(resp.content,'lxml')
# Extracting information from page
product_title = soup.find('span',id='productTitle')
print('product_title -' ,product_title)
product_price = soup.find('span',id='priceblock_ourprice')
print('product_price -' ,product_price)
('product_title -', None)
('product_price -', None)
但它正在显示None,所以我检查了汤中究竟存在哪些数据。所以我打印了汤。
soup.text
'\n\n\n\nRobot Check\n\n\n\n\nif (true === true) {\n var ue_t0 = (+
new Date()),\n ue_csm = window,\n ue = { t0: ue_t0, d:
function() { return (+new Date() - ue_t0); } },\n ue_furl =
"fls-na.amazon.ca",\n ue_mid = "A2EUQ1WTGCTBG2",\n
ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],\n
ue_sn = "opfcaptcha.amazon.ca",\n ue_id =
\'0B2HQATTKET8J6M36Y3G\';\n}\n\n\n\n\n\n\n\n\n\n\n\nEnter the
characters you see below\nSorry, we just need to make sure you\'re not
a robot. For best results, please make sure your browser is accepting
cookies.\n\n\n\n\n\n\n\n\n\n\nType the characters you see in this
image:\n\n\n\n\n\n\n\n\nTry different
image\n\n\n\n\n\n\n\n\n\n\n\nContinue
shopping\n\n\n\n\n\n\n\n\n\n\n\nConditions of Use &
Sale\n\n\n\n\nPrivacy Notice\n\n\n \xa9 1996-2015,
Amazon.com, Inc. or its affiliates\n \n if (true
=== true) {\n document.write(\'<img src="https://fls-na.amaz\'+\'on.ca/\'+\'1/oc-csi/1/OP/requestId=0B2HQATTKET8J6M36Y3G&js=1"
/>\');\n };\n \n\n\n\n\n\n\n if (true === true)
{\n var head = document.getElementsByTagName(\'head\')[0],\n
prefix =
"https://images-na.ssl-images-amazon.com/images/G/01/csminstrumentation/",\n
elem = document.createElement("script");\n elem.src = prefix +
"csm-captcha-instrumentation.min.js";\n
head.appendChild(elem);\n\n elem =
document.createElement("script");\n elem.src = prefix +
"rd-script-6d68177fa6061598e9509dc4b5bdd08d.js";\n
head.appendChild(elem);\n }\n \n\n'
我彻底检查了输出,但我没有在响应中找到任何可用的数据,我什至尝试做同样的事情并检查了 resp.content,但没有找到任何数据。我还验证了网址,网址也有效。我什至通过添加公共代理来测试上面的脚本,但仍然没有输出。
有人可以帮我从 url 或任何其他方式中提取信息来完成它吗?