python - URL 的最长前缀匹配

Question

我需要有关可用于 URL 上“最长前缀匹配”的任何标准 python 包的信息。我已经浏览了两个标准包http://packages.python.org/PyTrie/#pytrie.StringTrie & 'http://pypi.python.org/pypi/trie/0.1.1' 但它们似乎没有对 URL 上的最长前缀匹配任务很有用。

例如，如果我的设置有这些 URL 1->http://www.google.com/mail , 2->http://www.google.com/document, 3->http://www.facebook.com ， ETC..

现在，如果我搜索“http://www.google.com/doc”，那么它应该返回 2，搜索“http://www.face”应该返回 3。

我想确认是否有任何标准的 python 包可以帮助我做到这一点，或者我应该实现一个 Trie 来进行前缀匹配。

我不是在寻找一种正则表达式的解决方案，因为它随着 URL 数量的增加而无法扩展。

非常感谢。

score 13 · Accepted Answer

性能比较

`suffixtree`vs. `pytrie`vs. `trie`vs.函数`datrie`_`startswith`

设置

记录的时间是 1000 次搜索的 3 次重复中的最短时间。包含一个 trie 构建时间并在所有搜索中传播。搜索是对从 1 到 1000000 个项目的主机名集合执行的。

三种类型的搜索字符串：

non_existent_key- 没有匹配的字符串
rare_key- 大约百万分之二十
frequent_key- 出现次数与集合大小相当

结果

一百万个 url 的最大内存消耗：

| function    | memory, | ratio |
|             |     GiB |       |
|-------------+---------+-------|
| suffix_tree |   0.853 |   1.0 |
| pytrie      |   3.383 |   4.0 |
| trie        |   3.803 |   4.5 |
| datrie      |   0.194 |   0.2 |
| startswith  |   0.069 |   0.1 |
#+TBLFM: $3=$2/@3$2;%.1f

要重现结果，请运行 trie 基准代码。

稀有键/不存在键的情况

如果 url 的数量小于 10000，那么 datrie 是最快的，因为 N>10000 -suffixtree更快，startwith平均来说明显更慢。

稀有钥匙

轴：
- 垂直（时间）刻度约为 1 秒（2**20 微秒）
- 横轴显示每种情况下的 url 总数：N= 1、10、100、1000、10000、100000 和 1000000（一百万）。

不存在的密钥

频繁键

最多 N=100000datrie是最快的（对于一百万个 url，时间主要由 trie 构建时间决定）。

在找到的匹配项中查找最长的匹配项花费的时间最多。因此，所有函数的行为都与预期相似。

频繁键

startswith- 时间性能与键的类型无关。

trie并且pytrie行为相似。

无需尝试构建时间的性能

datrie- 最快，体面的内存消耗
startswith在这里更加不利，因为其他方法不会因构建 trie 所花费的时间而受到惩罚。
datrie, pytrie, trie- 对于稀有/不存在的键几乎 O(1)（恒定时间）

稀有的key_no_trie_build_time 不存在_key_no_trie_build_time

频繁的key_no_trie_build_time

拟合（近似）已知函数的多项式以进行比较（与图中相同的对数/对数标度）：

| Fitting polynom              | Function          |
|------------------------------+-------------------|
| 0.15  log2(N)   +      1.583 | log2(N)           |
| 0.30  log2(N)   +      3.167 | log2(N)*log2(N)   |
| 0.50  log2(N)   +  1.111e-15 | sqrt(N)           |
| 0.80  log2(N)   +  7.943e-16 | N**0.8            |
| 1.00  log2(N)   +  2.223e-15 | N                 |
| 2.00  log2(N)   +  4.446e-15 | N*N               |

score 12 · Accepted Answer

此示例适用于小型 url 列表，但不能很好地扩展。

def longest_prefix_match(search, urllist):
    matches = [url for url in urllist if url.startswith(search)]
    if matches:
        return max(matches, key=len)
    else:
        raise Exception("Not found")

使用trie模块的实现。

import trie


def longest_prefix_match(prefix_trie, search):
    # There may well be a more elegant way to do this without using
    # "hidden" method _getnode.
    try:
        return list(node.value for node in prefix_trie._getnode(search).walk())
    except KeyError:
        return list()

url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = trie.Trie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, longest_prefix_match(url_trie, search)

结果：

'http' -> ['http://www.facebook.com', 'http://www.google.com/document', 'http://www.google.com/mail']
'http://www.go' -> ['http://www.google.com/document', 'http://www.google.com/mail']
'http://www.fa' -> ['http://www.facebook.com']
'http://fail' -> []

或使用PyTrie给出相同的结果，但列表的顺序不同。

from pytrie import StringTrie


url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = StringTrie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, url_trie.values(prefix=search)

从内存使用的角度来看，我开始认为基数树/帕特里夏树会更好。这就是基数树的样子：

示例 URL 的基数树

而特里看起来更像：示例 URL 的尝试

score 1 · Accepted Answer

下面的函数将返回最长匹配的索引。其他有用的信息也可以很容易地提取出来。

from os.path import commonprefix as oscp

def longest_prefix(s, slist):
    pfx_idx = ((oscp([s, url]), i) for i, url in enumerate(slist))
    len_pfx_idx = map(lambda t: (len(t[0]), t[0], t[1]), pfx_idx)
    length, pfx, idx = max(len_pfx_idx)
    return idx

slist = [
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

print(longest_prefix('http://www.google.com/doc', slist))
print(longest_prefix('http://www.face', slist))

score 1 · Accepted Answer

如果您愿意用 RAM 换取时间性能，那么SuffixTree可能会有用。它具有很好的算法属性，例如它允许在线性时间内解决最长的公共子串问题。

如果您总是搜索前缀而不是任意子字符串，那么您可以在填充时添加唯一前缀SubstringDict()：

from SuffixTree import SubstringDict

substr_dict = SubstringDict()
for url in URLS: # urls must be ascii (valid urls are)
    assert '\n' not in url
    substr_dict['\n'+url] = url #NOTE: assume that '\n' can't be in a url

def longest_match(url_prefix, _substr_dict=substr_dict):
    matches = _substr_dict['\n'+url_prefix]
    return max(matches, key=len) if matches else ''

这种使用似乎不是最理想的，但它比@StephenPaulger 的解决方案[基于] 我尝试过的数据SuffixTree快20-150 倍（没有SubstringDict()构建时间），它可能已经足够好了。.startswith()

要安装SuffixTree，请运行：

pip install SuffixTree -f https://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees

python - URL 的最长前缀匹配

4 回答 4

性能比较

suffixtreevs. pytrievs. trievs.函数datrie_startswith

设置

结果

无需尝试构建时间的性能

Related

Reference

`suffixtree`vs. `pytrie`vs. `trie`vs.函数`datrie`_`startswith`