python - Python，解析html

Question

感谢这个网站的好心用户，我对如何使用 re 作为非标准 python 模块的替代有了一些想法，这样我的脚本就可以以最小的悬垂工作。今天，我一直在尝试解析模块。我遇到了beautifulsoup .. 这一切都很棒，但我不明白。

出于教育目的，我想从http://yify-torrents.com/browse-movie中删除以下信息（请不要告诉我使用网络爬虫，我不是要爬取整个site - 只需从此页面中提取信息即可了解解析模块的工作原理！）

电影名称质量种子链接

这些项目中有 22 个，我希望它们按顺序存储在列表中，即。项目_1，项目_2。而这些列表需要包含这三个项目。例如：

item_1 = ["James Bond: Casino Royale (2006)", "720p", "http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent"]
item_2 = ["Pitch Perfect (2012)", "720p", "http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent"]

然后，为了简单起见，我只想将每个项目打印到控制台。然而，为了让事情变得更加困难，这些项目在页面上没有标识符，所以 info. 需要严格排序。这一切都很好，但我得到的只是每个列表项包含的整个源，或者是空项！一个示例项目分隔符如下：

<div class="browse-info">
    <span class="info">
        <h3><a href="http://yify-torrents.com/movie/James_Bond_Casino_Royale_2006">James Bond: Casino Royale (2006)</a></h3>
        <p><b>Size:</b> 1018.26 MB</p>
        <p><b>Quality:</b> 720p</p>
        <p><b>Genre:</b> Action | Crime</p>
        <p><b>IMDB Rating:</b> 7.9/10</p>
            <span>
                <p class="peers"><b>Peers:</b> 698</p>
                <p class="peers"><b>Seeds:</b> 356</p>
            </span>
    </span>
    <span class="links">
        <a href="http://yify-torrents.com/movie/James_Bond_Casino_Royale_2006" class="std-btn-small mright">View Info<span></span></a>
        <a href="http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent" class="std-btn-small mleft torrentDwl" data-movieID="2620" data-torrentID="2812">Download<span></span></a>
    </span> 
</div>

有任何想法吗？有人可以给我一个如何做到这一点的例子吗？我不确定漂亮的汤是否能满足我的所有要求！PS。抱歉英语不好，这不是我的第一语言。

score 2 · Accepted Answer

from bs4 import BeautifulSoup
import urllib2

f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)


In [25]: for i in soup.findAll("div",{"class":"browse-info"}):
    ...:     name=i.find('a').text
    ...:     for x in i.findAll('b'):
    ...:         if x.text=="Quality:":
    ...:             quality=x.parent.text
    ...:     link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
    ...:     print [name,quality,link]
    ...:     
[u'James Bond: Casino Royale (2006)', u'Quality: 720p', 'http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent']
[u'Pitch Perfect (2012)', u'Quality: 720p', 'http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent']
...

或者得到你想要的输出：

In [26]: for i in soup.findAll("div",{"class":"browse-info"}):
    ...:     name=i.find('a').text
    ...:     for x in i.findAll('b'):
    ...:         if x.text=="Quality:":
    ...:             quality=x.parent.find(text=True, recursive=False).strip()
    ...:     link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
    ...:     print [name,quality,link]

score 0 · Accepted Answer

根据您的要求，我粘贴了解析器的简单示例。如您所见，它使用 lxml。使用 lxml，您有两种使用 DOM 树的方法，其中一种是xpath，第二种是我更喜欢 xpath 的 css 选择器。

import lxml.html
import decimal
import urllib

def parse():
    url = 'https://sometotosite.com'
    doc = lxml.html.fromstring(urllib.urlopen(url).read())
    main_div = doc.xpath("//div[@id='line']")[0]
    main = {}
    tr = []
    for el in main_div.getchildren():
    if el.xpath("descendant::a[contains(@name,'tn')]/text()"):
        category = el.xpath("descendant::a[contains(@name,'tn')]/text()")[0]
        main[category] = ''
        tr = []
    else:
        for element in el.getchildren():
            if '&#8212' in lxml.html.tostring(element):
                tr.append(element)
                print category, tr
parse()

LXML官方网站

python - Python，解析html

2 回答 2

Related

Reference