3

我正在努力解决一些性能问题。手头的任务是提取两个字符串之间的相似度值。为此,我正在使用fuzzywuzzy

from fuzzywuzzy import fuzz

print fuzz.ratio("string one", "string two")
print fuzz.ratio("string one", "string two which is significantly different")
result1 80
result2 38

但是,这没关系。我面临的问题是我有两个列表,一个有 1500 行,另一个有几千行。我需要将第一个的所有元素与第二个的所有元素进行比较。for 循环中的简单 for 将花费大量时间来计算。

如果有人建议我如何加快速度,将不胜感激。

4

4 回答 4

2

我自己为你做了一些东西(python 2.7):

from __future__ import division

import time
from itertools import izip

from fuzzywuzzy import fuzz


one = "different simliar"
two = "similar"


def compare(first, second):
    smaller, bigger = sorted([first, second], key=len)

    s_smaller= smaller.split()
    s_bigger = bigger.split()
    bigger_sets = [set(word) for word in s_bigger]

    counter = 0
    for word in s_smaller:
        if set(word) in bigger_sets:
            counter += len(word)
    if counter:
        return counter/len(' '.join(s_bigger))*100 # percentage match
    return counter


start_time = time.time()
print "match: ", compare(one, two)
compare_time = time.time() - start_time
print "compare: --- %s seconds ---" % (compare_time)
start_time = time.time()
print "match: ", fuzz.ratio(one, two)
fuzz_time = time.time() - start_time
print "fuzzy: --- %s seconds ---" % (fuzz_time)
print
print "<simliar or similar>/<length of bigger>*100%"
print 7/len(one)*100
print
print "Equals?"
print 7/len(one)*100 == compare(one, two)
print
print "Faster than fuzzy?"
print compare_time < fuzz_time

所以我认为我的更快,但对你来说更准确?你决定。

现在编辑 不仅更快,而且更准确。

结果:

match:  41.1764705882
compare: --- 4.19616699219e-05 seconds ---
match:  50
fuzzy: --- 7.39097595215e-05 seconds ---

<simliar or similar>/<length of bigger>*100%
41.1764705882

Equals?
True

Faster than fuzzy?
True

当然,如果你想像fuzzywuzzy那样检查单词,那么你就可以了:

from __future__ import division
from itertools import izip
import time

from fuzzywuzzy import fuzz


one = "different simliar"
two = "similar"


def compare(first, second):
    smaller, bigger = sorted([first, second], key=len)

    s_smaller= smaller.split()
    s_bigger = bigger.split()
    bigger_sets = [set(word) for word in s_bigger]

    counter = 0
    for word in s_smaller:
        if set(word) in bigger_sets:
            counter += 1
    if counter:
        return counter/len(s_bigger)*100 # percentage match
    return counter


start_time = time.time()
print "match: ", compare(one, two)
compare_time = time.time() - start_time
print "compare: --- %s seconds ---" % (compare_time)
start_time = time.time()
print "match: ", fuzz.ratio(one, two)
fuzz_time = time.time() - start_time
print "fuzzy: --- %s seconds ---" % (fuzz_time)
print
print "Equals?"
print fuzz.ratio(one, two) == compare(one, two)
print
print "Faster than fuzzy?"
print compare_time < fuzz_time

结果:

match:  50.0
compare: --- 7.20024108887e-05 seconds ---
match:  50
fuzzy: --- 0.000125169754028 seconds ---

Equals?
True

Faster than fuzzy?
True
于 2016-08-21T18:25:03.757 回答
1

我能想到的最佳解决方案是使用IBM Streams 框架来并行化您基本上不可避免的 O(n^2) 解决方案。

使用该框架,您将能够编写与此类似的单线程内核

def matchStatements(tweet, statements):
    results = []
    for s in statements:
        r = fuzz.ratio(tweet, s)
        results.append(r)
    return results

然后使用与此类似的设置将其并行化

def main():
    topo = Topology("tweet_compare")
    source = topo.source(getTweets)
    cpuCores = 4
    match = source.parallel(cpuCores).transform(matchStatements)
    end = match.end_parallel()
    end.sink(print)

这对处理进行了多线程处理,大大加快了处理速度,同时节省了您自己实现多线程细节的工作(这是 Streams 的主要优势)。

这个想法是,每条推文都是一个 Streams 元组,要跨多个处理元素进行处理。

Streams 的 Python 拓扑框架文档在此处parallel特别是操作符在此处进行了描述。

于 2016-08-22T14:29:42.533 回答
1

如果您需要计算每个语句出现的次数,那么不,我不知道在比较每个列表中的元素所需的 n^2 操作上获得巨大的加速。您可以通过使用长度来排除可能发生匹配的可能性来避免某些字符串匹配,但您仍然嵌套了 for 循环。您可能会花费更多的时间来优化它,而不是它为您节省的处理时间。

于 2016-08-21T17:47:26.647 回答
0

您可以使用 将列转换为列表并将其column_name.tolist()分配给变量。

有一个名为的 python 包two-lists-similarity,它比较两列的列表并计算分数。

https://pypi.org/project/two-lists-similarity/

于 2020-05-08T20:06:43.927 回答