1

I have tried using regex but read around and got directed to beautiful soup...

I've kinda figured out how to get urls in html tags with soup, but how would I grab urls from both html tags (href=*) and the body text of the page?

Also for grabbing the ones in tags, how do I specify that I only want urls starting with http://, https://... ?

Thanks in advance!

4

1 回答 1

1

首先看一下parsing-html-in-python-lxml-or-beautifulsoup。我读了它,从来没有看过汤。我猜是因为我发现 lxml 很容易。我相信有不同的方法可以做你所问的,也许有更简单的方法。但我会展示我使用的东西。

在 lxml 中,您可以使用XPath,就像对 XML/HTML 使用正则表达式一样。下面的代码将找到所有具有“http”属性的“a”标签,并打印所有以 http 开头的链接。这应该可以帮助您开始解析。

from lxml.html import etree

tree = etree.parse("my.html", etree.HTMLParser())
root = tree.getroot()
links = root.findall('*//a[@href]')
foreach link in links:
    if link.get("http").startswith("http"):
        print link.get("http")
于 2013-07-09T23:19:22.467 回答