python - 使用 lxml 提取 URL

Question

我已经将一些 HTML 抓取到一个大的 txt 文件（约 50k 行）中，并希望提取一组特定的 URL。我追求的 URL 是以下两种模式之一：

第一的

<div class="pic">
  <a href="https://www.site.com/joesmith"><img alt="Joe Smith" class="person_image" src="https://s3.amazonaws.com/photos.site.com/medium_jpg?12345678"></a>
</div>

第二

<div class="name">
  <a href="https://www.site.com/joesmith">Joe Smith</a>
</div>

我需要的文本是https://www.site.com/joesmith. 我是第一次使用 lxml，我很难把它放在一起。

这是我的代码

from lxml import etree
from io import StringIO

def read(filename):
  file = open(filename, 'r')
  text = file.read()
  file.close()
  out = unicode(text, errors='ignore')
  return out

def parse(filename):
  data = read(filename)
  parser = etree.HTMLParser()
  tree = etree.parse(StringIO(data), parser)
  result = etree.tostring(tree.getroot(), pretty_print=True, method='HTML')
  urls = result.findall('<div class="name">')
  return urls

我已经用 findall 和 findtext 尝试了这段代码，无论哪种方式结果都是一样的，"AttributeError: 'str' object has no attribute 'findall'"。我已经确认 'result' 是一个带有type().

我是否在正确的路径上提取 URL？我应该如何解决这个属性错误？

score 2 · Accepted Answer

我不确定基于 HTML 的树是否支持 XPath（我怀疑他们支持）。在那种情况下，你可以简单地做

urls = tree.xpath('//div[@class="pics"]/a/@href') + 
       tree.xpath('//div[@class="name"]/a/@href')

python - 使用 lxml 提取 URL

1 回答 1

Related

Reference