python - Python：从网站获取智能手机的价格

Question

我想从这个网站http://tweakers.net获得智能手机的价格。这是一个荷兰网站。问题是价格不是从网站上收集的。

文本文件“TweakersTelefoons.txt”包含 3 个条目：

三星-galaxy-s6-32gb-zwart

lg-nexus-5x-32gb-zwart

华为-nexus-6p-32gb-zwart

我正在使用 python 2.7，这是我使用的代码：

import urllib
import re

symbolfile = open("TweakersTelefoons.txt")
symbolslist = symbolfile.read()
symbolslist = symbolslist.split("\n")

for symbol in symbolslist:
    url = "http://tweakers.net/pricewatch/[^.]*/" +symbol+ ".html"
## http://tweakers.net/pricewatch/423541/samsung-galaxy-s6-32gb-zwart.html  is the original html

    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()

    regex = '<span itemprop="lowPrice">(.+?)</span>'
## <span itemprop="lowPrice">€ 471,95</span>  is what the original code looks like
    pattern = re.compile(regex)
    price = re.findall(pattern, htmltext)

    print "the price of", symbol, "is ", price

输出：

samsung-galaxy-s6-32gb-zwart 的价格是 []

lg-nexus-5x-32gb-zwart 的价格是 []

huawei-nexus-6p-32gb-zwart的价格是[]

价格未显示我尝试使用 [^.] 摆脱欧元符号，但没有奏效。

此外，在欧洲，我们可能使用“，”而不是“。”。作为小数的分隔符。请帮忙。

先感谢您。

score 1 · Accepted Answer

import requests

from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get("http://tweakers.net/categorie/215/smartphones/producten/").content)

print [(p.a["href"], p.a.text) for p in soup.find_all("p",{"class":"price"})]

获取所有页面：

from bs4 import BeautifulSoup

# base url to pass page number to 1-69 in this case
base_url = "http://tweakers.net/categorie/215/smartphones/producten/?page={}"
soup = BeautifulSoup(requests.get("http://tweakers.net/categorie/215/smartphones/producten/").content, "lxml")

# get and store all prices and phone links
data = {1: (p.a["href"], p.a.text) for p in soup.find_all("p", {'class': "price"})}

pag = soup.find("span", attrs={"class":"pageDistribution"}).find_all("a")

# last page number
mx_pg = max(int(a.text) for a in pag if a.text.isdigit())

# get all the pages from the second to  mx_pg 
for i in range(2, mx_pg + 1):
    req = requests.get(base_url.format(i))
    print req
    soup = BeautifulSoup(req.content)
    data[i] = [(p.a["href"], p.a.text) for p in soup.find_all("p",{"class":"price"})]

您将需要这两个请求，BeautifulSoup。如果您想抓取更多数据，该字典具有指向您可以访问的每个电话页面的链接。

score 0 · Accepted Answer

我认为您的问题是您希望 Web 服务器能够解析 URL 中的通配符，"http://tweakers.net/pricewatch/[^.]*/而您没有检查我怀疑是 404 的返回代码。

您需要确定产品 ID（如果已修复）或使用 forms post 方法发布搜索请求。

python - Python：从网站获取智能手机的价格

2 回答 2

Related

Reference