python - 使用正则表达式搜索关键字附近的 HTML 链接

Question

如果我正在寻找关键字“sales”并且我想获得最近的“http://www.somewebsite.com”，即使文件中有多个链接。我想要最近的链接而不是第一个链接。这意味着我需要搜索关键字匹配之前的链接。

这行不通...

regex = (http|https)://[-A-Za-z0-9./]+.*(?!((http|https)://[-A-Za-z0-9./]+))sales sales

找到最接近关键字的链接的最佳方法是什么？

score 3 · Accepted Answer

使用 HTML 解析器而不是正则表达式通常更容易、更健壮。

使用第三方模块lxml：

import lxml.html as LH

content = '''<html><a href="http://www.not-this-one.com"></a>
<a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p>
</html>
'''

doc = LH.fromstring(content)    
for url in doc.xpath('''
    //*[contains(text(),"sales")]
    /preceding::*[starts-with(@href,"http")][1]/@href'''):
    print(url)

产量

http://www.somewebsite.com

我发现 lxml（和 XPath）是一种方便的方式来表达我正在寻找的元素。但是，如果不能选择安装第三方模块，您也可以使用标准库中的HTMLParser完成这项特定工作：

import HTMLParser
import contextlib

class MyParser(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.last_link = None

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        if 'href' in attrs:
            self.last_link = attrs['href']

content = '''<html><a href="http://www.not-this-one.com"></a>
<a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p>
</html>
'''

idx = content.find('sales')

with contextlib.closing(MyParser()) as parser:
    parser.feed(content[:idx])
    print(parser.last_link)

关于 lxml 解决方案中使用的 XPath： XPath 的含义如下：

 //*                              # Find all elements
   [contains(text(),"sales")]     # whose text content contains "sales"
   /preceding::*                  # search the preceding elements 
     [starts-with(@href,"http")]  # such that it has an href attribute that starts with "http"
       [1]                        # select the first such <a> tag only
         /@href                   # return the value of the href attribute

score 0 · Accepted Answer

我认为您不能仅使用正则表达式（尤其是在关键字匹配之前查看）来执行此操作，因为它没有比较距离的意义。

我认为你最好做这样的事情：

查找所有出现的sales& 获取子字符串索引，称为salesIndex
查找所有出现的https?://[-A-Za-z0-9./]+并获取子字符串索引，称为urlIndex
循环通过salesIndex。对于中的每个位置i，salesIndex找到urlIndex最近的。

根据您要如何判断“最接近”，您可能需要获取和出现的开始和结束索引以进行比较。即，找到最接近当前出现的开始索引的URL的结束索引，并找到最接近当前出现的结束索引的URL的开始索引，并选择更接近的那个.saleshttp...salessales

您可以使用matches = re.finditer(pattern,string,re.IGNORECASE)获取匹配列表，然后match.span()获取每个matchin的开始/结束子字符串索引matches。

score 0 · Accepted Answer

在 math.coffee 建议的基础上，您可以尝试以下方法：

import re
myString = "" ## the string you want to search

link_matches = re.finditer('(http|https)://[-A-Za-z0-9./]+',myString,re.IGNORECASE)
sales_matches = re.finditer('sales',myString,re.IGNORECASE)

link_locations = []

for match in link_matches:
    link_locations.append([match.span(),match.group()])

for match in sales_matches:
    match_loc = match.span()
    distances = []
    for link_loc in link_locations:
        if match_loc[0] > link_loc[0][1]: ## if the link is behind your keyword
            ## append the distance between the END of the keyword and the START of the link
            distances.append(match_loc[0] - link_loc[0][1])
        else:
            ## append the distance between the END of the link and the START of the keyword
            distances.append(link_loc[0][0] - match_loc[1])

    for d in range(0,len(distances)-1):
        if distances[d] == min(distances):
            print ("Closest Link: " + link_locations[d][1] + "\n")
            break

score -1 · Accepted Answer

我测试了这段代码，它似乎正在工作......

def closesturl(keyword, website):
    keylist = []
    urllist = []
    closest = []
    urls = []
    urlregex = "(http|https)://[-A-Za-z0-9\\./]+"
    urlmatches = re.finditer(urlregex, website, re.IGNORECASE)
    keymatches = re.finditer(keyword, website, re.IGNORECASE)
    for n in keymatches:
        keylist.append([n.start(), n.end()])
    if(len(keylist) > 0):
        for m in urlmatches:
            urllist.append([m.start(), m.end()])
    if((len(keylist) > 0) and (len(urllist) > 0)):
        for i in range (0, len(keylist)):
            closest.append([abs(urllist[0][0]-keylist[i][0])])
            urls.append(website[urllist[0][0]:urllist[0][1]])
            if(len(urllist) >= 1):
                for j in range (1, len(urllist)):
                    if((abs(urllist[j][0]-keylist[i][0]) < closest[i])):
                        closest[i] = abs(keylist[i][0]-urllist[j][0])
                        urls[i] = website[urllist[j][0]:urllist[j][1]]
                        if((abs(urllist[j][0]-keylist[i][0]) > closest[i])):
                            break # local minimum / inflection point break from url list                                                      
    if((len(keylist) > 0) and (len(urllist) > 0)):
        return urls #return website[urllist[index[0]][0]:urllist[index[0]][1]]                                                                
    else:
        return ""

    somestring = "hey whats up... http://www.firstlink.com some other test http://www.secondlink.com then mykeyword"
    keyword = "mykeyword"
    print closesturl(keyword, somestring)

以上运行时显示http://www.secondlink.com......

如果有人对如何加速这段代码有想法，那就太棒了！

谢谢 V$H。

python - 使用正则表达式搜索关键字附近的 HTML 链接

4 回答 4

Related

Reference