1

I am looking for an algorithm or method that would help identify general phrases from a corpus of text that has a particular dialect (it is from a specific domain but for my case is a dialect of the English language) -- for example the following fragment could be from a larger corpus related to the World or Warcraft or perhaps MMORPHs.

players control a character avatar within a game world in third person or first person view, exploring the landscape, fighting various monsters, completing quests, and interacting with non-player characters (NPCs) or other players. Also similar to other MMORPGs, World of Warcraft requires the player to pay for a subscription, either by buying prepaid game cards for a selected amount of playing time, or by using a credit or debit card to pay on a regular basis

As output from the above I would like to identify the following general phrases:

  1. first person
  2. World of Warcraft
  3. prepaid game cards
  4. debit card

Notes:

  1. There is a previous questions similar to mine here and here but for clarification mine has the following differences:

    a. I am trying to use an existing toolkit such as NLTK, OpenNLP, etc.

    b. I am not interested in identifying other Parts of Speech in the sentence

    c. I can use human intervention where the algorithm presents the identified noun phrases to a human expert and the human expert can then confirm or reject the findings however we do not have resources for training a model of language on hand-annotated data

4

2 回答 2

1

您似乎正在尝试进行名词短语提取。TextBlob Python 库包括两个开箱即用的名词短语提取实现。

最简单的入门方法是使用基于此处FastNPExtractor描述的 Shlomi Babluki 算法的默认值。

from text.blob import TextBlob

text = '''
players control a character avatar within a game world in third person or first
person view, exploring the landscape, fighting various monsters, completing quests,
and interacting with non-player characters (NPCs) or other players. Also similar
to other MMORPGs, World of Warcraft requires the player to pay for a
subscription, either by buying prepaid game cards for a selected amount of
playing time, or by using a credit or debit card to pay on a regular basis
'''

blob = TextBlob(text)
print(blob.noun_phrases)  # ['players control', 'character avatar' ...]

换成其他实现(基于 NLTK 的分块器)非常容易。

from text.np_extractors import ConllExtractor

blob = TextBlob(text, np_extractor=ConllExtractor())

print(blob.noun_phrases)  # ['character avatar', 'game world' ...]

如果这些都不够,您可以创建自己的名词短语提取器类。我建议查看TextBlob np_extractor 模块源代码以获取示例。要更好地理解名词短语分块,请查看NLTK 书籍第 7 章

于 2013-09-09T05:06:21.080 回答
1

Nltk 内置了部分语音标记,已被证明非常适合识别未知单词。也就是说,您似乎误解了名词是什么,您可能应该巩固对词性和问题的理解。

例如,in first personfirst 是形容词。您可以自动假设相关形容词是该短语的一部分。

或者,如果您要识别一般短语,我的建议是实现一个简单的马尔可夫链模型,然后寻找特别高的转换概率。

如果您正在寻找 Python 中的马尔可夫链实现,我会向您指出我在当天写的这个要点:https ://gist.github.com/Slater-Victoroff/6227656

如果你想比这更先进,你将很快进入论文领域。我希望这会有所帮助。

PS Nltk 包含大量可用于您的目的的预注释语料库。

于 2013-09-09T01:41:40.527 回答