python - Grab url from href and text

Question

I have tried using regex but read around and got directed to beautiful soup...

I've kinda figured out how to get urls in html tags with soup, but how would I grab urls from both html tags (href=*) and the body text of the page?

Also for grabbing the ones in tags, how do I specify that I only want urls starting with http://, https://... ?

Thanks in advance!

score 1 · Accepted Answer

首先看一下parsing-html-in-python-lxml-or-beautifulsoup。我读了它，从来没有看过汤。我猜是因为我发现 lxml 很容易。我相信有不同的方法可以做你所问的，也许有更简单的方法。但我会展示我使用的东西。

在 lxml 中，您可以使用XPath，就像对 XML/HTML 使用正则表达式一样。下面的代码将找到所有具有“http”属性的“a”标签，并打印所有以 http 开头的链接。这应该可以帮助您开始解析。

from lxml.html import etree

tree = etree.parse("my.html", etree.HTMLParser())
root = tree.getroot()
links = root.findall('*//a[@href]')
foreach link in links:
    if link.get("http").startswith("http"):
        print link.get("http")

python - Grab url from href and text

1 回答 1

Related

Reference