为了写我的硕士论文,我需要收集数据。现在,我想从 Vivino.com 收集数据,但我没有任何网络抓取经验。我已经看到了一些关于此的问题,但我想收集有关葡萄酒的所有信息(名称、国家、评级、描述、价格等)和葡萄酒的评论。
import requests
import pandas as pd
r = requests.get(
"https://www.vivino.com/api/explore/explore",
params = {
"country_code": "FR",
"country_codes[]":"pt",
"currency_code":"EUR",
"grape_filter":"varietal",
"min_rating":"1",
"order_by":"price",
"order":"asc",
"page": 1,
"price_range_max":"500",
"price_range_min":"0",
"wine_type_ids[]":"1"
},
headers= {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"
}
)
results = [
(
t["vintage"]["wine"]["winery"]["name"],
f'{t["vintage"]["wine"]["name"]} {t["vintage"]["year"]}',
t["vintage"]["statistics"]["ratings_average"],
t["vintage"]["statistics"]["ratings_count"]
)
for t in r.json()["explore_vintage"]["matches"]
]
dataframe = pd.DataFrame(results,columns=['Winery','Wine','Rating','num_review'])
print(dataframe)
使用此代码,我可以收集 ['Winery' 'Wine' 'Rating' 'num_review']
使用以下代码,我可以收集评论:
import re
import json
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
}
url = "https://www.vivino.com/FR/en/dauprat-pauillac/w/3823873?year=2017&price_id=24797287"
api_url = (
"https://www.vivino.com/api/wines/{id}/reviews?per_page=9999&year={year}"
) # <-- increased the number of reviews to 9999
id_ = re.search(r"/(\d{5,})", url).group(1)
year = re.search(r"year=(\d+)", url).group(1)
data = requests.get(api_url.format(id=id_, year=year), headers=headers).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for r in data["reviews"]:
print(r["note"])
print("-" * 80)
有人可以帮我如何结合所有这些信息吗?那么,包括相应评论在内的所有葡萄酒信息?
先感谢您!!