python - 如何保存 Python NLTK 对齐模型供以后使用？

Question

在 Python 中，我NLTK's alignment module用于在平行文本之间创建单词对齐。对齐双文本可能是一个耗时的过程，尤其是在处理大量语料库时。有一天批量进行对齐并在以后使用这些对齐会很好。

from nltk import IBMModel1 as ibm
biverses = [list of AlignedSent objects]
model = ibm(biverses, 20)

with open(path + "eng-taq_model.txt", 'w') as f:
    f.write(model.train(biverses, 20))  // makes empty file

创建模型后，如何 (1) 将其保存到磁盘并 (2) 以后重复使用？

score 8 · Accepted Answer

直接的答案是腌制它，请参阅https://wiki.python.org/moin/UsingPickle

但是因为 IBMModel1 返回一个 lambda 函数，所以不可能用默认pickle/腌制它cPickle（参见https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L74和https:// github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L104）

所以我们将使用dill. 首先，安装dill，见Can Python pickle lambda functions?

$ pip install dill
$ python
>>> import dill as pickle

然后：

>>> import dill
>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
...
>>> exit()

要使用腌制模型：

>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> bitexts = comtrans.aligned_sents()[:100]
>>> with open('model1.pk', 'rb') as fin:
...     ibm = pickle.load(fin)
... 
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']

如果你尝试腌制IBMModel1对象，它是一个 lambda 函数，你最终会得到这个：

>>> import cPickle as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle function objects

（注：以上代码片段来自 NLTK 3.0.0 版本）

在带有 NLTK 3.0.0 的 python3 中，您也将面临同样的问题，因为 IBMModel1 返回一个 lambda 函数：

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('mode1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
_pickle.PicklingError: Can't pickle <function IBMModel1.train.<locals>.<lambda> at 0x7fa37cf9d620>: attribute lookup <lambda> on nltk.align.ibm1 failed'

>>> import dill
>>> with open('model1.pk', 'wb') as fout:
...     dill.dump(ibm, fout)
... 
>>> exit()

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> from nltk.corpus import comtrans
>>> with open('model1.pk', 'rb') as fin:
...     ibm = dill.load(fin)
... 
>>> bitexts = comtrans.aligned_sents()[:100]
>>> aligned_sent = ibm.aligned(bitexts[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'IBMModel1' object has no attribute 'aligned'
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']

（注意：在 python3 中，pickle是cPickle，请参阅http://docs.pythonsprints.com/python3_porting/py-porting.html）

score 3 · Accepted Answer

您讨论了保存对齐模型，但您的问题似乎更多是关于保存已对齐的对齐双文本：“有一天批量对齐并稍后使用这些对齐会很好。” 我要回答这个问题。

在 nltk 环境中，使用类似语料库的资源的最佳方式是通过语料库阅读器访问它。NLTK 不附带语料库编写器，但 NLTK 支持的格式AlignedCorpusReader很容易生成：（NLTK 3 版本）

model = ibm(biverses, 20)  # As in your question

out = open("folder/newalignedtext.txt", "w")
for pair in biverses:
    asent = model.align(pair)
    out.write(" ".join(asent.words)+"\n")
    out.write(" ".join(asent.mots)+"\n")
    out.write(str(asent.alignment)+"\n")

out.close()

而已。您可以稍后重新加载和使用对齐的句子，就像您使用comtrans语料库一样：

from nltk.corpus.reader import AlignedCorpusReader

mycorpus = AlignedCorpusReader(r"folder", r".*\.txt")
biverses_reloaded = mycorpus.aligned_sents()

如您所见，您不需要对齐器对象本身。对齐的句子可以用语料库阅读器加载，对齐器本身没什么用，除非你想研究嵌入的概率。

评论：我不确定我是否会将对齐对象称为“模型”。在 NLTK 2 中，对齐器没有设置为对齐新文本——它甚至没有align()方法。在 NLTK 3 中，该函数align()可以对齐新文本，但前提是从 python 2 中使用；在 Python 3 中，它被破坏了，显然是因为比较不同类型对象的规则更加严格。但是，如果您希望能够腌制并重新加载对准器，我很乐意将其添加到我的答案中；据我所见，它可以用 vanilla 完成cPickle。

score 1 · Accepted Answer

如果您愿意，并且看起来像这样，您可以将其存储为 AlignedSent 列表：

from nltk.align import IBMModel1 as IBM
from nltk.align import AlignedSent
import dill as pickle

biverses = [list of AlignedSent objects]
model = ibm(biverses, 20)

for sent in range(len(biverses)):
     biverses[sent].alignment = model.align(biverses[sent]).alignment

之后，您可以将其与 dill 一起保存为 pickle：

with open('alignedtext.pk', 'wb') as arquive:
     pickle.dump(biverses, arquive)

score 0 · Accepted Answer

joblib 还可以保存训练好的 nltk 模型，例如：

from nltk.lm import MLE
import joblib
model = MLE(n=2)
model.fit(train_data, padded_sents)
# save model
with open(model_path, 'wb') as fout:
    joblib.dump(model, fout)

#load model
joblib.load(model_path)

python - 如何保存 Python NLTK 对齐模型供以后使用？

4 回答 4

Related

Reference