python - 在python中从链接过滤信息？

Question

所以我正在用 Python 编写一个程序来从我最喜欢的网站中提取一部电影的评分。

评论示例链接：http: //timesofindia.indiatimes.com/entertainment/movie-reviews/hindi/Madras-Cafe-movie-review/movie-review/21975443.cms

目前，我正在使用 string.partition 命令来获取包含评级信息的部分源 html 代码。但是，这种方法非常缓慢。

获得电影评分的最快方法是什么？

这是我正在使用的代码：

#POST Request to TOI site, for review source
data_output = requests.post(review_link)

#Clean HTML code
soup = BeautifulSoup(data_output.text)

#Filter source data, via a dirty string partition method

#rating
texted = str(soup).partition(" stars,")
texted = texted[0].partition("Rating: ")
rating = texted[2]
#title
texted = texted[0].partition(" movie review")
texted = texted[0].partition("<title>")
title = texted[2]

#print stuff
print "Title:", title
print "Rating:", rating, "/ 5"

谢谢！

score 1 · Accepted Answer

这是一个使用requests获取 html 的示例，lxml用于解析 html 并获取评级值，re用于将实际评级提取为数字：

import re
from lxml import etree
import requests

URL = "http://timesofindia.indiatimes.com/entertainment/movie-reviews/hindi/Madras-Cafe-movie-review/movie-review/21975443.cms"

response = requests.get(URL)

parser = etree.HTMLParser()
root = etree.fromstring(response.text, parser=parser)
rating_text = root.find('.//div[@id="sshow"]/table/tr/td[2]/div[1]/script[1]').text  # prints fbcriticRating="4"; 
print re.search("\d+", rating_text).group(0)  # prints 4

请注意，您不需要在requests此处使用 - 您可以使用urllib2，这只是一个示例。主要部分是解析html并获取评分值。

希望有帮助。

python - 在python中从链接过滤信息？

1 回答 1

Related

Reference