python - 概率解析器的内存使用情况

Question

我正在为范围连接语法编写 CKY 解析器。我想使用树库作为语法，所以语法会很大。我用 Python 写了一个原型1，当我模拟几十个句子的树库时，它似乎工作得很好，但是内存使用是不可接受的。我尝试用 C++ 编写它，但到目前为止这非常令人沮丧，因为我以前从未使用过 C++。这是一些数据（n是语法所基于的句子数）：

考虑到最佳优先算法，这种增长模式是可以预期的，但开销的数量是我关心的问题。根据 heapy 的内存使用量比这些数字小十倍，valgrind 报告了类似的情况。是什么导致了这种差异，我可以在 Python（或 Cython）中做些什么？也许是因为碎片化？或者可能是 python 字典的开销？

一些背景知识：两个重要的数据结构是将边映射到概率的议程和图表，它是将非终结符和位置映射到边的字典。议程通过 heapdict（内部使用 dict 和 heapq 列表）实现，图表带有将非终结符和位置映射到边缘的字典。议程经常被插入和删除，图表只得到插入和查找。我用这样的元组表示边：

(("S", 111), ("NP", 010), ("VP", 100, 001))

字符串是语法中的非终结符标签，位置被编码为位掩码。当成分不连续时，可能有多个位置。所以这条边可以表示对“玛丽快乐”的分析，其中“是”和“快乐”都属于副总裁。图表字典由这条边的第一个元素索引，（“S”，111）在这个在一个新版本中，我尝试转置这个表示，希望它可以通过重用来节省内存：

(("S", "NP", "VP), (111, 100, 011))

我认为如果第一部分与不同的位置组合出现，Python 只会存储一次，尽管我实际上不确定这是真的。在任何一种情况下，它似乎都没有任何区别。

所以基本上我想知道的是是否值得进一步追求我的 Python 实现，包括使用 Cython 和不同的数据结构做事，或者用 C++ 从头开始编写它是唯一可行的选择。

更新：经过一些改进后，我不再遇到内存使用问题。我正在开发优化的 Cython 版本。我会将赏金奖励给提高代码效率的最有用的建议。在http://student.science.uva.nl/~acranenb/plcfrs_cython.html有一个带注释的版本

1 https://github.com/andreasvc/disco-dop/ -- 运行 test.py 来解析一些句子。需要 python 2.6、nltk和heapdict

score 2 · Accepted Answer

我认为如果第一部分与不同的位置结合出现，Python 只会存储一次

不必要：

>>> ("S", "NP", "VP") is ("S", "NP", "VP")
False

您可能想要intern引用非终结符的所有字符串，因为您似乎在rcgrules.py. 如果你想要intern一个元组，那么先把它变成一个字符串：

>>> intern("S NP VP") is intern(' '.join('S', 'NP', 'VP'))
True

否则，您将不得不“复制”元组，而不是重新构建它们。

（如果您是 C++ 新手，那么在其中重写这样的算法不太可能提供很多内存优势。您必须首先评估各种哈希表实现并了解其容器中的复制行为。我已经发现boost::unordered_map有很多小哈希表非常浪费。）

score 2 · Accepted Answer

您是否尝试过使用 PyPy 而不是 CPython 运行您的应用程序？

PyPy在注意共性和避免与不必要的重复事物相关的内存开销方面比 CPython 聪明得多。

无论如何，值得一试：http: //pypy.org/

score 1 · Accepted Answer

在这些情况下，首先要做的是分析：

15147/297    0.032    0.000    0.041    0.000 tree.py:102(__eq__)
15400/200    0.031    0.000    0.106    0.001 tree.py:399(convert)
        1    0.023    0.023    0.129    0.129 plcfrs_cython.pyx:52(parse)
6701/1143    0.022    0.000    0.043    0.000 heapdict.py:45(_min_heapify)
    18212    0.017    0.000    0.023    0.000 plcfrs_cython.pyx:38(__richcmp__)
10975/10875    0.017    0.000    0.035    0.000 tree.py:75(__init__)
     5772    0.016    0.000    0.050    0.000 tree.py:665(__init__)
      960    0.016    0.000    0.025    0.000 plcfrs_cython.pyx:118(deduced_from)
    46938    0.014    0.000    0.014    0.000 tree.py:708(_get_node)
25220/2190    0.014    0.000    0.016    0.000 tree.py:231(subtrees)
    10975    0.013    0.000    0.023    0.000 tree.py:60(__new__)
    49441    0.013    0.000    0.013    0.000 {isinstance}
    16748    0.008    0.000    0.015    0.000 {hasattr}

我注意到的第一件事是很少有函数来自 cython 模块本身。它们中的大多数来自 tree.py 模块，也许这就是瓶颈。

专注于 cython 方面，我看到了richcmp函数：

我们可以简单地通过在方法声明中添加值的类型来优化它

def __richcmp__(ChartItem self, ChartItem other, int op):
        ....

这降低了价值

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
....
18212    0.011    0.000    0.015    0.000 plcfrs_cython.pyx:38(__richcmp__)

添加 elif 语法而不是单个 if 将启用cython 的开关优化

    if op == 0: return self.label < other.label or self.vec < other.vec
    elif op == 1: return self.label <= other.label or self.vec <= other.vec
    elif op == 2: return self.label == other.label and self.vec == other.vec
    elif op == 3: return self.label != other.label or self.vec != other.vec
    elif op == 4: return self.label > other.label or self.vec > other.vec
    elif op == 5: return self.label >= other.label or self.vec >= other.vec

获得：

17963    0.002    0.000    0.002    0.000 plcfrs_cython.pyx:38(__richcmp__)

试图弄清楚 tree.py:399 转换来自哪里我发现 dopg.py 中的这个函数需要所有时间

  def removeids(tree):
""" remove unique IDs introduced by the Goodman reduction """
result = Tree.convert(tree)
for a in result.subtrees(lambda t: '@' in t.node):
    a.node = a.node.rsplit('@', 1)[0]
if isinstance(tree, ImmutableTree): return result.freeze()
return result

现在我不确定树中的每个节点是否都是 ChartItem 以及getitem值是否正在其他地方使用，但添加此更改：

cdef class ChartItem:
cdef public str label
cdef public str root
cdef public long vec
cdef int _hash
__slots__ = ("label", "vec", "_hash")
def __init__(ChartItem self, label, int vec):
    self.label = intern(label) #.rsplit('@', 1)[0])
    self.root = intern(label.rsplit('@', 1)[0])
    self.vec = vec
    self._hash = hash((self.label, self.vec))
def __hash__(self):
    return self._hash
def __richcmp__(ChartItem self, ChartItem other, int op):
    if op == 0: return self.label < other.label or self.vec < other.vec
    elif op == 1: return self.label <= other.label or self.vec <= other.vec
    elif op == 2: return self.label == other.label and self.vec == other.vec
    elif op == 3: return self.label != other.label or self.vec != other.vec
    elif op == 4: return self.label > other.label or self.vec > other.vec
    elif op == 5: return self.label >= other.label or self.vec >= other.vec
def __getitem__(ChartItem self, int n):
    if n == 0: return self.root
    elif n == 1: return self.vec
def __repr__(self):
    #would need bitlen for proper padding
    return "%s[%s]" % (self.label, bin(self.vec)[2:][::-1])

在 mostprobableparse 内部：

from libc cimport pow
def mostprobableparse...
            ...
    cdef dict parsetrees = <dict>defaultdict(float)
    cdef float prob
    m = 0
    for n,(a,prob) in enumerate(derivations):
        parsetrees[a] += pow(e,prob)
        m += 1

我得到：

         189345 function calls (173785 primitive calls) in 0.162 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
6701/1143    0.025    0.000    0.037    0.000 heapdict.py:45(_min_heapify)
        1    0.023    0.023    0.120    0.120 plcfrs_cython.pyx:54(parse)
      960    0.018    0.000    0.030    0.000 plcfrs_cython.pyx:122(deduced_from)
 5190/198    0.011    0.000    0.015    0.000 tree.py:102(__eq__)
     6619    0.006    0.000    0.006    0.000 heapdict.py:67(_swap)
     9678    0.006    0.000    0.008    0.000 plcfrs_cython.pyx:137(concat)

所以接下来的步骤是优化 heapify 和 deduced_from

deuce_from 可以进一步优化：

cdef inline deduced_from(ChartItem Ih, double x, pyCx, pyunary, pylbinary, pyrbinary, int bitlen):
cdef str I = Ih.label
cdef int Ir = Ih.vec
cdef list result = []
cdef dict Cx = <dict>pyCx
cdef dict unary = <dict>pyunary
cdef dict lbinary = <dict>pylbinary
cdef dict rbinary = <dict>pyrbinary
cdef ChartItem Ilh
cdef double z
cdef double y
cdef ChartItem I1h 
for rule, z in unary[I]:
    result.append((ChartItem(rule[0][0], Ir), ((x+z,z), (Ih,))))
for rule, z in lbinary[I]:
    for I1h, y in Cx[rule[0][2]].items():
        if concat(rule[1], Ir, I1h.vec, bitlen):
            result.append((ChartItem(rule[0][0], Ir ^ I1h.vec), ((x+y+z, z), (Ih, I1h))))
for rule, z in rbinary[I]:
    for I1h, y in Cx[rule[0][1]].items():
        if concat(rule[1], I1h.vec, Ir, bitlen):
            result.append((ChartItem(rule[0][0], I1h.vec ^ Ir), ((x+y+z, z), (I1h, Ih))))
return result

我将在这里停下来，尽管我相信随着对问题的更多了解，我们可以继续优化。

一系列的单元测试将有助于断言每个优化不会引入任何细微的错误。

附注，尝试使用空格而不是制表符。

python - 概率解析器的内存使用情况

3 回答 3

Related

Reference