python - 使用 nltk.tag.brill_trainer 训练 IOB Chunker（基于转换的学习）

Question

我正在尝试通过使用NLTK 的 brill 模块来训练特定的分块器（为简单起见，我们说一个名词分块器）。我想使用三个功能，即。词，POS-标签，IOB-标签。

(Ramshaw and Marcus, 1995:7)展示了 100 个模板，这些模板是从这三个特征的组合中生成的，例如，
```
W0, P0, T0     # current word, pos tag, iob tag
W-1, P0, T-1   # prev word, pos tag, prev iob tag
...
```

我想将它们合并到nltk.tbl.feature中，但只有两种特征对象，即。brill.Word和brill.Pos。受限于设计，我只能像 (word, pos) 这样将 word 和 POS 特征放在一起，因此使用 ( (word, pos), iob) 作为特征进行训练。例如，

from nltk.tbl import Template
from nltk.tag import brill, brill_trainer, untag
from nltk.corpus import treebank_chunk
from nltk.chunk.util import tree2conlltags, conlltags2tree

# Codes from (Perkins, 2013)
def train_brill_tagger(initial_tagger, train_sents, **kwargs):
    templates = [
        brill.Template(brill.Word([0])),
        brill.Template(brill.Pos([-1])),
        brill.Template(brill.Word([-1])),
        brill.Template(brill.Word([0]),brill.Pos([-1])),]
    trainer = brill_trainer.BrillTaggerTrainer(initial_tagger, templates, trace=3,)
    return trainer.train(train_sents, **kwargs)

# generating ((word, pos),iob) pairs as feature.
def chunk_trees2train_chunks(chunk_sents):
    tag_sents = [tree2conlltags(sent) for sent in chunk_sents]
    return [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents]

>>> from nltk.tag import DefaultTagger
>>> tagger = DefaultTagger('NN')
>>> train = treebank_chunk.chunked_sents()[:2]
>>> t = chunk_trees2train_chunks(train)
>>> bt = train_brill_tagger(tagger, t)
TBL train (fast) (seqs: 2; tokens: 31; tpls: 4; min score: 2; min acc: None)
Finding initial useful rules...
    Found 79 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
  12  12   0  17  | NN->I-NP if Pos:NN@[-1]
   3   3   0   0  | I-NP->O if Word:(',', ',')@[0]
   2   2   0   0  | I-NP->B-NP if Word:('the', 'DT')@[0]
   2   2   0   0  | I-NP->O if Word:('.', '.')@[0]

如上所示，(word, pos) 将一个特征视为一个整体。这不是对三个特征（word、pos-tag、iob-tag）的完美捕捉。

还有其他方法可以将 word、pos、iob 功能分别实现到中nltk.tbl.feature吗？
如果在 NLTK 中不可能，那么在 python 中是否还有其他实现？我只能在 Internet 上找到 C++ 和 Java 实现。

score 2 · Accepted Answer

nltk3 brill trainer api（我写的）确实处理了用多维特征描述的令牌序列的训练，因为您的数据就是一个例子。然而，实际限制可能很严重。多维学习中可能的模板数量急剧增加，brill trainer 的当前 nltk 实现以内存换速度，类似于 Ramshaw 和 Marcus 1994，“探索转换规则序列的统计推导......”。内存消耗可能是巨大的，并且很容易给系统提供比它可以处理的更多的数据和/或模板。一个有用的策略是根据模板产生良好规则的频率对模板进行排名（参见下面示例中的 print_template_statistics()）。通常，

另一种或额外的可能性是使用 Brill 原始算法的 nltk 实现，它具有非常不同的内存速度权衡；它没有索引，因此将使用更少的内存。它使用了一些优化，实际上在找到最佳规则方面相当快，但是当有许多竞争的、低分的候选人时，通常在训练结束时非常慢。无论如何，有时你并不需要这些。由于某种原因，新的 nltks 似乎省略了这个实现，但这里是源代码（我刚刚测试过）http://www.nltk.org/_modules/nltk/tag/brill_trainer_orig.html。

还有其他权衡其他算法，特别是 Florian 和 Ngai 2000 ( http://www.aclweb.org/anthology/N/N01/N01-1006.pdf ) 的快速内存高效索引算法和概率规则采样塞缪尔 1998 年的著作（https://www.aaai.org/Papers/FLAIRS/1998/FLAIRS98-045.pdf）将是一个有用的补充。此外，正如您所注意到的，文档并不完整，并且过于关注词性标记，并且不清楚如何从中进行概括。修复文档（也）在待办事项列表中。

然而，对 nltk 中的广义（非 POS 标记）tbl 的兴趣相当有限（nltk2 完全不适合的 api 10 年没有被触及），所以不要屏住呼吸。如果您不耐烦，您可能希望查看更多专用的替代方案，特别是 mutbl 和 fntbl（谷歌它们，我只有两个链接的声誉）。

无论如何，这是 nltk 的速写：

首先，nltk 中的硬编码约定是标记序列（“标签”表示您想分配给数据的任何标签，不一定是词性）表示为对序列，[（token1，tag1），（令牌2，标签2），...]。标签是字符串；在许多基本应用程序中，令牌也是如此。例如，标记可能是单词，字符串可能是它们的 POS，如

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

（顺便说一句，这种令牌序列标记对约定在 nltk 及其文档中普遍存在，但可以说它应该更好地表示为命名元组而不是对，所以不要说

[token for (token, _tag) in tagged_sequence]

你可以说例如

[x.token for x in tagged_sequence]

第一种情况在非对上失败，但第二种利用鸭子类型，因此 tagged_sequence 可以是用户定义实例的任何序列，只要它们具有属性“token”。）

现在，您可以更丰富地表示您可以使用的代币。现有的标记器接口 (nltk.tag.api.FeaturesetTaggerI) 期望每个标记作为一个特征集而不是一个字符串，这是一个将特征名称映射到序列中每个项目的特征值的字典。

一个标记的序列可能看起来像

[({'word': 'Pierre', 'tag': 'NNP', 'iob': 'B-NP'}, 'NNP'),
 ({'word': 'Vinken', 'tag': 'NNP', 'iob': 'I-NP'}, 'NNP'),
 ({'word': ',',      'tag': ',',   'iob': 'O'   }, ','),
 ...
]

还有其他可能性（尽管在 nltk 的其余部分中支持较少）。例如，您可以为每个标记创建一个命名元组，或者一个用户定义的类，它允许您向属性访问添加任意数量的动态计算（可能使用 @property 来提供一致的接口）。

brill 标记器不需要知道您当前在令牌上提供的视图。但是，它确实需要您提供一个初始标记器，该标记器可以将表示中的标记序列转换为标记序列。您不能直接使用 nltk.tag.sequential 中的现有标记器，因为它们期望 [(word, tag), ...]。但是您仍然可以利用它们。下面的示例使用此策略（在 MyInitialTagger 中）和 token-as-featureset-dictionary 视图。

from __future__ import division, print_function, unicode_literals

import sys

from nltk import tbl, untag
from nltk.tag.brill_trainer import BrillTaggerTrainer
# or: 
# from nltk.tag.brill_trainer_orig import BrillTaggerTrainer
# 100 templates and a tiny 500 sentences (11700 
# tokens) produce 420000 rules and uses a 
# whopping 1.3GB of memory on my system;
# brill_trainer_orig is much slower, but uses 0.43GB

from nltk.corpus import treebank_chunk
from nltk.chunk.util import tree2conlltags
from nltk.tag import DefaultTagger


def get_templates():
    wds10 = [[Word([0])],
             [Word([-1])],
             [Word([1])],
             [Word([-1]), Word([0])],
             [Word([0]), Word([1])],
             [Word([-1]), Word([1])],
             [Word([-2]), Word([-1])],
             [Word([1]), Word([2])],
             [Word([-1,-2,-3])],
             [Word([1,2,3])]]

    pos10 = [[POS([0])],
             [POS([-1])],
             [POS([1])],
             [POS([-1]), POS([0])],
             [POS([0]), POS([1])],
             [POS([-1]), POS([1])],
             [POS([-2]), POS([-1])],
             [POS([1]), POS([2])],
             [POS([-1, -2, -3])],
             [POS([1, 2, 3])]]

    iobs5 = [[IOB([0])],
             [IOB([-1]), IOB([0])],
             [IOB([0]), IOB([1])],
             [IOB([-2]), IOB([-1])],
             [IOB([1]), IOB([2])]]


    # the 5 * (10+10) = 100 3-feature templates 
    # of Ramshaw and Marcus
    templates = [tbl.Template(*wdspos+iob) 
        for wdspos in wds10+pos10 for iob in iobs5]
    # Footnote:
    # any template-generating functions in new code 
    # (as opposed to recreating templates from earlier
    # experiments like Ramshaw and Marcus) might 
    # also consider the mass generating Feature.expand()
    # and Template.expand(). See the docs, or for 
    # some examples the original pull request at
    # https://github.com/nltk/nltk/pull/549 
    # ("Feature- and Template-generating factory functions")

    return templates

def build_multifeature_corpus():
    # The true value of the target fields is unknown in testing, 
    # and, of course, templates must not refer to it in training.
    # But we may wish to keep it for reference (here, truepos).

    def tuple2dict_featureset(sent, tagnames=("word", "truepos", "iob")):
        return (dict(zip(tagnames, t)) for t in sent)

    def tag_tokens(tokens):
        return [(t, t["truepos"]) for t in tokens]
    # connlltagged_sents :: [[(word,tag,iob)]]
    connlltagged_sents = (tree2conlltags(sent) 
        for sent in treebank_chunk.chunked_sents())
    conlltagged_tokenses = (tuple2dict_featureset(sent) 
        for sent in connlltagged_sents)
    conlltagged_sequences = (tag_tokens(sent) 
        for sent in conlltagged_tokenses)
    return conlltagged_sequences

class Word(tbl.Feature):
    @staticmethod
    def extract_property(tokens, index):
        return tokens[index][0]["word"]

class IOB(tbl.Feature):
    @staticmethod
    def extract_property(tokens, index):
        return tokens[index][0]["iob"]

class POS(tbl.Feature):
    @staticmethod
    def extract_property(tokens, index):
        return tokens[index][1]


class MyInitialTagger(DefaultTagger):
    def choose_tag(self, tokens, index, history):
        tokens_ = [t["word"] for t in tokens]
        return super().choose_tag(tokens_, index, history)


def main(argv):
    templates = get_templates()
    trainon = 100

    corpus = list(build_multifeature_corpus())
    train, test = corpus[:trainon], corpus[trainon:]

    print(train[0], "\n")

    initial_tagger = MyInitialTagger('NN')
    print(initial_tagger.tag(untag(train[0])), "\n")

    trainer = BrillTaggerTrainer(initial_tagger, templates, trace=3)
    tagger = trainer.train(train)

    taggedtest = tagger.tag_sents([untag(t) for t in test])
    print(test[0])
    print(initial_tagger.tag(untag(test[0])))
    print(taggedtest[0])
    print()

    tagger.print_template_statistics()

if __name__ == '__main__':
    sys.exit(main(sys.argv))

上面的设置构建了一个 POS 标记器。如果您希望以另一个属性为目标，例如构建一个 IOB 标记器，则需要进行一些小的更改，以便从语料库中的“标记”位置访问目标属性（您可以将其视为读写） [(token, tag), ...] 和任何其他属性（你可以认为是只读的）都是从 'token' 位置访问的。例如：

1) 为 IOB 标记构建您的语料库 [(token,tag), (token,tag), ...]

def build_multifeature_corpus():
    ...

    def tuple2dict_featureset(sent, tagnames=("word", "pos", "trueiob")):
        return (dict(zip(tagnames, t)) for t in sent)

    def tag_tokens(tokens):
        return [(t, t["trueiob"]) for t in tokens]
    ...

2）相应地更改初始标记器

...
initial_tagger = MyInitialTagger('O')
...

3) 修改特征提取类定义

class POS(tbl.Feature):
    @staticmethod
    def extract_property(tokens, index):
        return tokens[index][0]["pos"]

class IOB(tbl.Feature):
    @staticmethod
    def extract_property(tokens, index):
        return tokens[index][1]

python - 使用 nltk.tag.brill_trainer 训练 IOB Chunker（基于转换的学习）

1 回答 1

Related

Reference