python - 那是标签列表还是其他？

Question

我是 NLP 和 NLTK 的新手，我想找到模棱两可的词，意思是至少有n不同标签的词。我有这种方法，但输出不仅仅是令人困惑。

代码：

def MostAmbiguousWords(words, n):
# wordsUniqeTags holds a list of uniqe tags that have been observed for a given word
wordsUniqeTags = {}
for (w,t) in words:
    if wordsUniqeTags.has_key(w):
        wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
    else:
        wordsUniqeTags[w] = set([t])
# Starting to count
res = []
for w in wordsUniqeTags:
    if len(wordsUniqeTags[w]) >= n:
        res.append((w, wordsUniqeTags[w]))

return res
MostAmbiguousWords(brown.tagged_words(), 13)

输出：

[("what's", set(['C', 'B', 'E', 'D', 'H', 'WDT+BEZ', '-', 'N', 'T', 'W', 'V', 'Z', '+'])),
("who's", set(['C', 'B', 'E', 'WPS+BEZ', 'H', '+', '-', 'N', 'P', 'S', 'W', 'V', 'Z'])),
("that's", set(['C', 'B', 'E', 'D', 'H', '+', '-', 'N', 'DT+BEZ', 'P', 'S', 'T', 'W', 'V', 'Z'])),
('that', set(['C', 'D', 'I', 'H', '-', 'L', 'O', 'N', 'Q', 'P', 'S', 'T', 'W', 'CS']))]

现在我不知道什么B，，，，C等等Q。可以代表。所以，我的问题：

这些是什么？
他们的意思是什么？（如果它们是标签）
我认为它们不是标签，因为who并且whats没有WH指示“wh question words”的标签。

如果有人可以发布包含所有可能标签及其含义的映射的链接，我会很高兴。

score 3 · Accepted Answer

看起来你有一个错字。在这一行：

wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)

你应该有set([t])（不是set(t)），就像你在else案例中所做的那样。

这解释了您所看到的行为，因为t它是一个字符串，并且set(t)正在从字符串中的每个字符组成一个集合。你想要的是set([t])它制作一个t以它为元素的集合。

>>> t = 'WHQ'
>>> set(t)
set(['Q', 'H', 'W'])    # bad
>>> set([t])
set(['WHQ'])            # good

顺便说一句，您只需将该行更改为：

wordsUniqeTags[w].add(t)

但是，实际上，您应该使用setdefault方法 ondict和列表理解语法来整体改进方法。所以试试这个：

def most_ambiguous_words(words, n):
  # wordsUniqeTags holds a list of uniqe tags that have been observed for a given word
  wordsUniqeTags = {}
  for (w,t) in words:
    wordsUniqeTags.setdefault(w, set()).add(t)
  # Starting to count
  return [(word,tags) for word,tags in wordsUniqeTags.iteritems() if len(tags) >= n]

score 0 · Accepted Answer

您正在将您的 POS 标签拆分为这一行中的单个字符：

    wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)

set('AT')结果set(['A', 'T'])。

score 0 · Accepted Answer

在 collections 模块中使用 Counter 和 defaultdict 功能怎么样？

from collection import defaultdict, Counter

def most_ambiguous_words(words, n):
    counts = defaultdict(Counter)
    for (word,tag) in words:
        counts[word][tag] += 1
    return [(w, counts[w].keys()) for w in counts if len(counts[word]) > n]

python - 那是标签列表还是其他？

3 回答 3

Related

Reference