3
4

5 回答 5

8

最简单的可能是BeautifulSoup(请务必使用 3.0.8 或更高3.0.*版本,而不是 3.1.*,除非您使用的是 Python 3——请参见此处!)。

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(thehtmlstring)

for anchor in soup.findAll('a'):
  print anchor['href'], anchor.string

BeautifulSoup 生成 unicode 字符串——如果这是一个问题,请确保按照您希望的方式对它们进行编码,以按照您想要的方式获取字节字符串!

于 2010-06-29T22:31:10.817 回答
4

我个人会使用 lxml。安装后,得到你想要的很简单:

from lxml import html

tree = html.fromstring(open("data.html").read())

print [e.text_content() for e in tree.xpath("//a")]
于 2010-06-29T22:25:05.663 回答
2

SGMLParser 在 Python 2.6 中已被弃用,并将在 3.0 中消失。您可能想改用 HTMLParser 模块。我以前从未使用过它(我总是只使用 BeutifulSoup 来处理这类事情),所以我想我会学习它是如何工作的。这是我放在一起的示例脚本,它应该可以满足您的需求。

#!/usr/bin/env python

from HTMLParser import HTMLParser

class URLParser(HTMLParser):
    def __init__(self):
        self.in_link = False
        self.links = []
        self.current_link = ''
        HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            self.current_link = self.get_href_from_attrs(attrs)
            self.in_link = True

    def handle_endtag(self, tag):
        if tag == 'a':
            self.links.append(self.current_link)
            self.in_link = False

    def handle_data(self, data):
        if self.in_link:
            self.current_link = '%s - %s' % (self.current_link, data)

    def get_href_from_attrs(self, attrs):
        # The attrs dict is a list of tuples like:
        #  [('href', 'www.google.com'), ('class', 'some-class')]
        for prop, val in attrs:
            if prop == 'href':
                return val
        return ''

if __name__ == '__main__':
    the_html = '''
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T &amp; P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p>

<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html">Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html">Sales Clerk</a> - <font size="-1"> (Kits)</font></p>

<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html">MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html">Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html">Retail with small parts appliance background</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html">Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>

<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html">Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html">STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html">Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>
    '''
    url_parser = URLParser()
    url_parser.feed(the_html)

    print '\n'.join(url_parser.links)

输出

http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - F/T  -  P/T Sales Associate - Caliente Fashions
http://vancouver.en.craigslist.ca/van/ret/1817804151.html - IMMEDIATE EMPLOYMENT WANTED!
http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html - TRAVEL AGENT
http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html - Optical Sales Position
http://vancouver.en.craigslist.ca/van/ret/1817709780.html - Sales Clerk
http://vancouver.en.craigslist.ca/van/ret/1817676850.html - MARINE SALES
http://vancouver.en.craigslist.ca/van/ret/1817608506.html - Retail Sales Associate
http://vancouver.en.craigslist.ca/van/ret/1817573985.html - Retail with small parts appliance background
http://vancouver.en.craigslist.ca/rds/ret/1817540938.html - Manager *Enjoyable work atmosphere
http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html - Team Member - Retail Store - FT
http://vancouver.en.craigslist.ca/rds/ret/1817459155.html - STORE MANAGER-SHOE WAREHOUSE
http://vancouver.en.craigslist.ca/pml/ret/1817448777.html - Retail Sales

更新:经过那个小练习之后,这个界面感觉很糟糕,所以我会坚持使用更干净的 BeutifulSoup 库。查看 Alex 的示例以了解它是如何完成的。

于 2010-06-29T22:39:27.367 回答
1

只要我们在比较选项,这个 pyparsing 片段还会为您提供每个位置的位置,在<font>结束<a>标记之后的标记中给出:

from pyparsing import makeHTMLTags, SkipTo

a,aEnd = makeHTMLTags("A")
font,fontEnd = makeHTMLTags("FONT")
p,pEnd = makeHTMLTags("P")

patt = (p + a("a") + SkipTo(aEnd)("posn") + aEnd + '-' + 
        font + SkipTo(fontEnd)("locn") + fontEnd + pEnd)

for tokens,_,_ in patt.scanString(the_html):
    print tokens.a.href, '-', tokens.posn, tokens.locn

给出:

http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - F/T &amp; P/T Sales Associate - Caliente Fashions (North Vancouver)
http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html - TRAVEL AGENT (NORTH VANCOUVER)
http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html - Optical Sales Position (New Westminster)
http://vancouver.en.craigslist.ca/van/ret/1817709780.html - Sales Clerk (Kits)
http://vancouver.en.craigslist.ca/van/ret/1817676850.html - MARINE SALES (VANCOUVER ( KITS ))
http://vancouver.en.craigslist.ca/van/ret/1817608506.html - Retail Sales Associate (Vancouver)
http://vancouver.en.craigslist.ca/rds/ret/1817540938.html - Manager *Enjoyable work atmosphere (Langley Centre)
http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html - Team Member - Retail Store - FT (Burnaby South)
http://vancouver.en.craigslist.ca/rds/ret/1817459155.html - STORE MANAGER-SHOE WAREHOUSE (South Surrey-Semiahmoo)
http://vancouver.en.craigslist.ca/pml/ret/1817448777.html - Retail Sales (Coquitlam)
于 2010-06-29T23:05:57.833 回答
0
#download BeautifulSoup library for python
from Beautiful import *

fh = open('data.html')
html = fh.read()
soup = BeautifulSoup(html)

tags = soup('a')

for tag in tags:
    print tag.contents[0]
于 2017-03-25T23:00:29.717 回答