2

我正在使用 BS4 和 python2.7。这是我的代码的开始(感谢root):

from bs4 import BeautifulSoup
import urllib2

f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)

当我打印 html 时,它的内容与在 chrome 中查看的页面的来源相同。然而,当我打印汤时,它会切掉整个身体并留下这个(头部标签的内容):

<!DOCTYPE html>

<html>
<head>
<title>Browse Movie - YIFY Torrents</title>
<meta charset="utf-8">
<meta content="IE=9" http-equiv="X-UA-Compatible"/>
<meta content="YIFY-Torrents.com - The official YIFY Torrents website. Here you will be able to browse and download all YIFY rip movies in excellent DVD, 720p, 1080p and 3D quality, all at the smallest file size." name="description"/>
<meta content="torrents, yify, movies, movie, download, 720p, 1080p, 3D, browse movies, yify-torrents" name="keywords"/>
<link href="http://static.yify-torrents.com/yify.ico" rel="shortcut icon"/>
<link href="http://yify-torrents.com/rss" rel="alternate" title="YIFY-Torrents RSS feed" type="application/rss+xml"/>
<link href="http://static.yify-torrents.com/assets/css/styles.css?1353330463" rel="stylesheet" type="text/css"/>
<link href="http://static.yify-torrents.com/assets/css/colorbox.css?1327223987" rel="stylesheet" type="text/css"/>
<script src="http://static.yify-torrents.com/assets/js/jquery-1.6.1.min.js?1327224013" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.validate.min.js?1327224011" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.colorbox-min.js?1327224010" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/form.js?1349683447" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/common.js?1353399801" type="text/javascript"></script>
<script>
        var webRoot = 'http://yify-torrents.com/';
        var IsLoggedIn = 0  </script>
<!--[if !IE]><!--><style type="text/css">#content input.field:focus, #content textarea:focus{border: 1px solid #47bc15 !important;}</style></meta></head></html> 

我哪里错了?!

4

1 回答 1

8

我有同样的问题,这解决了我的问题:

soup = BeautifulSoup(html, 'html5lib')

您需要安装 html5lib:

pip install html5lib

或者

easy_install html5lib

您可以在此处阅读有关 Beautiful Soup 的不同解析器(优缺点)的更多信息:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

于 2013-03-23T15:02:15.037 回答