3

If I have a collection of strings is there a data structure or function that could improve the speed of checking if any of the elements of the collections are substrings on my main string?

Right now I'm looping through my array of strings and using the in operator. Is there a faster way?

import timing

## string match in first do_not_scan
## 0:00:00.029332

## string not in do_not_scan
## 0:00:00.035179
def check_if_substring():
    for x in do_not_scan:
        if x in string:
            return True
    return False

## string match in first do_not_scan
## 0:00:00.046530

## string not in do_not_scan
## 0:00:00.067439
def index_of():
    for x in do_not_scan:
        try:
            string.index(x)
            return True
        except:
            return False

## string match in first do_not_scan
## 0:00:00.047654

## string not in do_not_scan
## 0:00:00.070596
def find_def():
    for x in do_not_scan:
        if string.find(x) != -1:
            return True
    return False

string = '/usr/documents/apps/components/login'
do_not_scan = ['node_modules','bower_components']

for x in range(100000):
    find_def()
    index_of()
    check_if_substring()
4

4 回答 4

3

不,没有更快的内置方式。

如果您有大量字符串要测试,那么您最好使用第三方Aho-Corasick包,正如JF Sebastian 的回答所示。


使用内置方法,最坏的情况是:没有匹配项,这意味着您已经测试了列表中的每个项目以及每个项目中的几乎每个偏移量。

幸运的是,该in运算符非常快(至少在 CPython 中)并且在我的测试中快了近三倍:

0.3364804992452264  # substring()
0.867534976452589   # any_substring()
0.8401796016842127  # find_def()
0.9342398950830102  # index_of()
2.7920695478096604  # re implementation

这是我用于测试的脚本:

from timeit import timeit
import re

def substring():
    for x in do_not_scan:
        if x in string:
            return True
    return False

def any_substring():
    return any(x in string for x in do_not_scan)

def find_def():
    for x in do_not_scan:
        if string.find(x) != -1:
            return True
    return False

def index_of():
    for x in do_not_scan:
        try:
            string.index(x)
            return True
        except:
            return False

def re_match():
    for x in do_not_scan:
        if re.search(string, x):
            return True
    return False

string = 'a'
do_not_scan = ['node_modules','bower_components']

print(timeit('substring()', setup='from __main__ import substring'))
print(timeit('any_substring()', setup='from __main__ import any_substring'))
print(timeit('find_def()', setup='from __main__ import find_def'))
print(timeit('index_of()', setup='from __main__ import index_of'))
print(timeit('re_match()', setup='from __main__ import re_match'))
于 2016-03-04T19:51:56.433 回答
2

是的,有一种更快的执行方式,found = any(s in main_string for s in collection_of_strings)例如,Aho-Corasick_algorithm允许改进any()基于O(n*m*k)算法O(n + m*k)的时间操作 where nis len(main_string)mis len(collections_of_strings),并k表示集合中字符串的各个长度。

#!/usr/bin/env python
import noaho # $ pip install noaho

trie = noaho.NoAho()
for s in collection_of_strings:
    trie.add(s)
found = trie.find_short(main_string)[0] is not None

string = 'a'注意:如果您对 Big-O 行为感兴趣,那么测量小字符串的时间性能是没有意义的。要么使用更具代表性的样本进行基准测试,要么在您的情况下不需要更快(渐近)的算法。

于 2016-03-05T16:44:46.450 回答
2
def check():
    if any(w in string for w in do_not_scan):
        return True
    else:
        return False

或更简单:

def check():
    return any(w in string for w in do_not_scan)

正如@Two-Bit Alchemist 所述

于 2016-03-04T18:13:38.583 回答
2

我没有要尝试的大型数据集:

但也许这样的事情会奏效吗?

蟒蛇3

from builtins import any
import timeit

do_not_scan = ['node_modules', 'bower_components']
string = 'a'


def check_if_substring():
    return any(string in x for x in do_not_scan)


result = timeit.Timer("check_if_substring()", "from __main__ import check_if_substring")
count = 10000
print(result.timeit(count)/count)

或者反过来:

def check_if_substring():
    return any(x in string for x in do_not_scan)

我的结果:6.48119201650843e-07

于 2016-03-04T21:18:52.423 回答