3

I am trying to parse a website for

blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah 

(there are many of these, and I want all of them in some tokenized form). Unfortunately the HTML is very large and a little complicated, so trying to crawl down the tree might take me some time to just sort out the nested elements. Is there an easy way to just retrieve this?

Thanks!

4

1 回答 1

14

如果您只想要 href 的a标签,请使用:

data = """blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah"""

import lxml.html
tree = lxml.html.fromstring(data)
print tree.xpath('//a/@href')

# ['THIS IS WHAT I WANT']
于 2013-02-02T15:59:17.310 回答