python - Python：为什么网站不解析？

Question

我在网站上运行这段代码：juventus.com。我可以解析标题

from urllib import urlopen
import re

webpage = urlopen('http://juventus.com').read()
patFinderTitle = re.compile('<title>(.*)</title>')
findPatTitle = re.findall(patFinderTitle, webpage)
print findPatTitle

输出是：

['Welcome - Juventus.com']

但是如果在另一个网站上尝试相同的代码返回什么都不是

from urllib import urlopen
import re

webpage = urlopen('http://bp1.shoguto.com/detail.php?userg=hhchpxqhacciliq').read()
patFinderTitle = re.compile('<title>(.*)</title>')
findPatTitle = re.findall(patFinderTitle, webpage)
print findPatTitle

有谁知道为什么？

score 4 · Accepted Answer

内容http://bp1.shoguto.com/detail.php?userg=hhchpxqhacciliq为：（为方便阅读而修改）

<script type='text/javascript'>
top.location.href = 'https://www.facebook.com/dialog/oauth?
client_id=466261910087459&redirect_uri=http%3A%2F%2Fbp1.shoguto.com&
state=07c9ba739d9340de596f64ae21754376&scope=email&0=publish_actions';
</script>

没有标题标签；没有正则表达式匹配。

使用selenium评估 javascript：

from selenium import webdriver

driver = webdriver.Firefox() # webdriver.PhantomJS()
driver.get('http://bp1.shoguto.com/detail.php?userg=hhchpxqhacciliq')
print driver.title
driver.quit()

score 0 · Accepted Answer

因为正则表达式与它重定向到的页面上的标题标签不匹配，所以它被重定向了。

你的代码应该（a）使用beautifulsoup，或者如果你知道输出将是格式良好的xml、lxml（或带有beautifulsoup后端的lxml）来解析html，而不是正则表达式（b）使用请求，一个更简单的模块HTTP 请求，可以透明地处理重定向。

score 0 · Accepted Answer

那是因为 urlopen 链接包含一个 javascript 重定向，它只是不包含标题标签。

这是它包含的内容：

<script type='text/javascript'>top.location.href = 'https://www.facebook.com/dialog/oauth?client_id=466261910087459&redirect_uri=http%3A%2F%2Fbp1.shoguto.com&state=0f9abed6de7412b5129a4d105a4be25f&scope=email&0=publish_actions';</script>

另外，我可能错了，但如果我没记错的话，你不能使用 urlopen 来运行 javascript 代码。您将需要一个不同的 python 模块，现在不记得它的名称，但是如果我记得有一个模块可以运行 javascript 代码，但需要一个 gui 和一个有效的浏览器才能使用，例如。火狐...

python - Python：为什么网站不解析？

3 回答 3

Related

Reference