2

注意:这不是 LZW 压缩的正确用途。我只是在玩弄它。

问题

在一次通过中,是否也可以更新字典中元素的频率计数?

我的实现

import sys
from collections import defaultdict
import re

# The silliest string!
inputString = "this is the first sentence in this book the first sentence is really the most interesting the first sentence is always first"
inputString = inputString.lower().split()

StringTable = defaultdict(int)
FreqTable = defaultdict(int)

def DoPass():
    global inputString
    global StringTable
    global FreqTable

    print ""
    print "INPUT STRING:"
    print inputString

    CODE = 256

    STRING = inputString[0]

    output = []

    StringTable[STRING] = CODE
    CODE += 1

    total = len(inputString)

    for i in range(1, total):
        WORD = inputString[i]

        if STRING + " " + WORD in StringTable:
            STRING += " " + WORD
        else:
            if STRING in StringTable:
                output.append(str(StringTable[STRING]))
            else:
                output.append(STRING)
            StringTable[STRING + " " + WORD] = CODE
            CODE += 1
            STRING = WORD

    StringTable[STRING] = CODE
    CODE += 1
    output.append(str(StringTable[STRING]))

    print ""
    print "OUTPUT STRING:"
    print output

    print ""
    print "Dictionary Built..."
    for i in sorted(StringTable.keys(), key=lambda x: len(x)):
        print i, StringTable[i]

    print ""
    print "Frequencies:"
    for i in sorted(FreqTable.keys(), key=lambda x: len(x)):
        print i, FreqTable[i]

def main():
    DoPass()

if __name__ == "__main__":
    main()

输出

INPUT STRING:
['this', 'is', 'the', 'first', 'sentence', 'in', 'this', 'book', 'the', 'first', 'sentence', 'is', 'really', 'the', 'most', 'interesting', 'the', 'first', 'sent
ence', 'is', 'always', 'first']

OUTPUT STRING:
['256', 'is', 'the', 'first', 'sentence', 'in', '256', 'book', '259', 'sentence', 'is', 'really', 'the', 'most', 'interesting', '265', 'is', 'always', '275']

Dictionary Built...
this 256
first 275
is the 258
in this 262
this is 257
book the 264
the most 269
this book 263
is always 273
is really 267
the first 259
really the 268
sentence in 261
sentence is 266
always first 274
first sentence 260
interesting the 271
most interesting 270
the first sentence 265
the first sentence is 272

Frequencies:
#### I am trying to fill this

我想FreqTable用它找到的任何模式的频率计数来填充。出于明显的原因,我没有把我的方法放在这里——它不起作用,而且它给了我错误的计数。关于这是否可能的任何建议都会很棒。

4

1 回答 1

1

不确定是否理解您的问题。如果您只需要频率表,那么这应该很简单:每次找到模式时,只需将其频率计数加 +1。所以真正的问题应该是找到模式。

如果你想保持模式排序,这也应该很容易,因为你一直保持表格排序,它最终是一个插入排序操作,这非常快。

现在,找到正确的模式是另一回事。您需要一棵树,或一个哈希表,然后是树,或列表,或其他任何东西,以找到最佳匹配序列。这就是使此类算法执行起来更加复杂的原因。

显然,对于非常小的数据集,“天真的”搜索(一个一个地测试所有条目)可以给出一些结果。但随着数据集的扩大,搜索成本将变得过高。

于 2011-11-09T19:55:33.650 回答