python - 在 Python 中抓取 HTML

Question

我试图从页面源中找到一系列 URL（twitter 链接），然后将它们放入文本文档中的列表中。我遇到的问题是，一旦我 .readlines() urlopen 对象，我总共有 3-4 行，每行包含几十个我需要一个接一个收集的 url。这是我尝试纠正的代码片段：

page = html.readlines()
for line in page:
       ind_start = line.find('twitter')
       ind_end = line.find('</a>', ind_start+1)
       while ('twitter' in line[ind_start:ind_end]):
           output.write(line[ind_start:ind_end] + "\n")
           ind_start = line.find('twitter', ind_start)
           ind_end  = line.find('</a>', ind_start + 1)

不幸的是，我无法使用它提取任何网址。有什么建议吗？

score 3 · Accepted Answer

You can extract links using lxml and a xpath expression :

from lxml.html import parse

p = parse('http://domain.tld/path')
for link in p.xpath('.//a/@href'):
    if "twitter" in link:
        print link, "match 'twitter'"

Using regex there, is not the better way : parsing HTML is a solved problem in 2013. See RegEx match open tags except XHTML self-contained tags

score 2 · Accepted Answer

您可以使用 BeautifulSoup 模块：

from bs4 import BeautifulSoup

soup = BeautifulSoup('your html')
elements = soup.findAll('a')

for el in elements:
    print el['href']

如果没有 - 只需使用正则表达式：

import re

expression = re.compile(r'http:\/\/*')
m = expression.search('your string')

if m:
    print 'match found!'

这也将匹配<img />标签中的 url，但您可以轻松调整我的解决方案以仅在<a />标签中查找 url

python - 在 Python 中抓取 HTML

2 回答 2

Related

Reference