问问题
11568 次
5 回答
8
最简单的可能是BeautifulSoup(请务必使用 3.0.8 或更高3.0.*
版本,而不是 3.1.*
,除非您使用的是 Python 3——请参见此处!)。
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(thehtmlstring)
for anchor in soup.findAll('a'):
print anchor['href'], anchor.string
BeautifulSoup 生成 unicode 字符串——如果这是一个问题,请确保按照您希望的方式对它们进行编码,以按照您想要的方式获取字节字符串!
于 2010-06-29T22:31:10.817 回答
4
我个人会使用 lxml。安装后,得到你想要的很简单:
from lxml import html
tree = html.fromstring(open("data.html").read())
print [e.text_content() for e in tree.xpath("//a")]
于 2010-06-29T22:25:05.663 回答
2
SGMLParser 在 Python 2.6 中已被弃用,并将在 3.0 中消失。您可能想改用 HTMLParser 模块。我以前从未使用过它(我总是只使用 BeutifulSoup 来处理这类事情),所以我想我会学习它是如何工作的。这是我放在一起的示例脚本,它应该可以满足您的需求。
#!/usr/bin/env python
from HTMLParser import HTMLParser
class URLParser(HTMLParser):
def __init__(self):
self.in_link = False
self.links = []
self.current_link = ''
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
if tag == 'a':
self.current_link = self.get_href_from_attrs(attrs)
self.in_link = True
def handle_endtag(self, tag):
if tag == 'a':
self.links.append(self.current_link)
self.in_link = False
def handle_data(self, data):
if self.in_link:
self.current_link = '%s - %s' % (self.current_link, data)
def get_href_from_attrs(self, attrs):
# The attrs dict is a list of tuples like:
# [('href', 'www.google.com'), ('class', 'some-class')]
for prop, val in attrs:
if prop == 'href':
return val
return ''
if __name__ == '__main__':
the_html = '''
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T & P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html">Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html">Sales Clerk</a> - <font size="-1"> (Kits)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html">MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html">Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html">Retail with small parts appliance background</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html">Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html">Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html">STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html">Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>
'''
url_parser = URLParser()
url_parser.feed(the_html)
print '\n'.join(url_parser.links)
输出
http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - F/T - P/T Sales Associate - Caliente Fashions
http://vancouver.en.craigslist.ca/van/ret/1817804151.html - IMMEDIATE EMPLOYMENT WANTED!
http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html - TRAVEL AGENT
http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html - Optical Sales Position
http://vancouver.en.craigslist.ca/van/ret/1817709780.html - Sales Clerk
http://vancouver.en.craigslist.ca/van/ret/1817676850.html - MARINE SALES
http://vancouver.en.craigslist.ca/van/ret/1817608506.html - Retail Sales Associate
http://vancouver.en.craigslist.ca/van/ret/1817573985.html - Retail with small parts appliance background
http://vancouver.en.craigslist.ca/rds/ret/1817540938.html - Manager *Enjoyable work atmosphere
http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html - Team Member - Retail Store - FT
http://vancouver.en.craigslist.ca/rds/ret/1817459155.html - STORE MANAGER-SHOE WAREHOUSE
http://vancouver.en.craigslist.ca/pml/ret/1817448777.html - Retail Sales
更新:经过那个小练习之后,这个界面感觉很糟糕,所以我会坚持使用更干净的 BeutifulSoup 库。查看 Alex 的示例以了解它是如何完成的。
于 2010-06-29T22:39:27.367 回答
1
只要我们在比较选项,这个 pyparsing 片段还会为您提供每个位置的位置,在<font>
结束<a>
标记之后的标记中给出:
from pyparsing import makeHTMLTags, SkipTo
a,aEnd = makeHTMLTags("A")
font,fontEnd = makeHTMLTags("FONT")
p,pEnd = makeHTMLTags("P")
patt = (p + a("a") + SkipTo(aEnd)("posn") + aEnd + '-' +
font + SkipTo(fontEnd)("locn") + fontEnd + pEnd)
for tokens,_,_ in patt.scanString(the_html):
print tokens.a.href, '-', tokens.posn, tokens.locn
给出:
http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - F/T & P/T Sales Associate - Caliente Fashions (North Vancouver)
http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html - TRAVEL AGENT (NORTH VANCOUVER)
http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html - Optical Sales Position (New Westminster)
http://vancouver.en.craigslist.ca/van/ret/1817709780.html - Sales Clerk (Kits)
http://vancouver.en.craigslist.ca/van/ret/1817676850.html - MARINE SALES (VANCOUVER ( KITS ))
http://vancouver.en.craigslist.ca/van/ret/1817608506.html - Retail Sales Associate (Vancouver)
http://vancouver.en.craigslist.ca/rds/ret/1817540938.html - Manager *Enjoyable work atmosphere (Langley Centre)
http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html - Team Member - Retail Store - FT (Burnaby South)
http://vancouver.en.craigslist.ca/rds/ret/1817459155.html - STORE MANAGER-SHOE WAREHOUSE (South Surrey-Semiahmoo)
http://vancouver.en.craigslist.ca/pml/ret/1817448777.html - Retail Sales (Coquitlam)
于 2010-06-29T23:05:57.833 回答
0
#download BeautifulSoup library for python
from Beautiful import *
fh = open('data.html')
html = fh.read()
soup = BeautifulSoup(html)
tags = soup('a')
for tag in tags:
print tag.contents[0]
于 2017-03-25T23:00:29.717 回答