nlp - identifying general phrases in a particular dialect

Question

I am looking for an algorithm or method that would help identify general phrases from a corpus of text that has a particular dialect (it is from a specific domain but for my case is a dialect of the English language) -- for example the following fragment could be from a larger corpus related to the World or Warcraft or perhaps MMORPHs.

players control a character avatar within a game world in third person or first person view, exploring the landscape, fighting various monsters, completing quests, and interacting with non-player characters (NPCs) or other players. Also similar to other MMORPGs, World of Warcraft requires the player to pay for a subscription, either by buying prepaid game cards for a selected amount of playing time, or by using a credit or debit card to pay on a regular basis

As output from the above I would like to identify the following general phrases:

first person
World of Warcraft
prepaid game cards
debit card

Notes:

There is a previous questions similar to mine here and here but for clarification mine has the following differences:

a. I am trying to use an existing toolkit such as NLTK, OpenNLP, etc.

b. I am not interested in identifying other Parts of Speech in the sentence

c. I can use human intervention where the algorithm presents the identified noun phrases to a human expert and the human expert can then confirm or reject the findings however we do not have resources for training a model of language on hand-annotated data

score 1 · Accepted Answer

您似乎正在尝试进行名词短语提取。TextBlob Python 库包括两个开箱即用的名词短语提取实现。

最简单的入门方法是使用基于此处FastNPExtractor描述的 Shlomi Babluki 算法的默认值。

from text.blob import TextBlob

text = '''
players control a character avatar within a game world in third person or first
person view, exploring the landscape, fighting various monsters, completing quests,
and interacting with non-player characters (NPCs) or other players. Also similar
to other MMORPGs, World of Warcraft requires the player to pay for a
subscription, either by buying prepaid game cards for a selected amount of
playing time, or by using a credit or debit card to pay on a regular basis
'''

blob = TextBlob(text)
print(blob.noun_phrases)  # ['players control', 'character avatar' ...]

换成其他实现（基于 NLTK 的分块器）非常容易。

from text.np_extractors import ConllExtractor

blob = TextBlob(text, np_extractor=ConllExtractor())

print(blob.noun_phrases)  # ['character avatar', 'game world' ...]

如果这些都不够，您可以创建自己的名词短语提取器类。我建议查看TextBlob np_extractor 模块源代码以获取示例。要更好地理解名词短语分块，请查看NLTK 书籍第 7 章。

score 1 · Accepted Answer

Nltk 内置了部分语音标记，已被证明非常适合识别未知单词。也就是说，您似乎误解了名词是什么，您可能应该巩固对词性和问题的理解。

例如，in first personfirst 是形容词。您可以自动假设相关形容词是该短语的一部分。

或者，如果您要识别一般短语，我的建议是实现一个简单的马尔可夫链模型，然后寻找特别高的转换概率。

如果您正在寻找 Python 中的马尔可夫链实现，我会向您指出我在当天写的这个要点：https ://gist.github.com/Slater-Victoroff/6227656

如果你想比这更先进，你将很快进入论文领域。我希望这会有所帮助。

PS Nltk 包含大量可用于您的目的的预注释语料库。

nlp - identifying general phrases in a particular dialect

2 回答 2

Related

Reference