nlp - 你如何提取某个单词的各种含义

Question

给定“暴力”作为输入，是否有可能想出一个人如何解释暴力（例如，身体暴力、一本书、一张专辑、一个音乐团体..），如下文参考文献#1 所述。

假设用户指的是专辑，那么从一组推文中寻找暴力作为专辑的最佳方式是什么。

有没有办法通过任何 NLP API 来推断这一点，比如 OpenNLP。

参考 #1

violence/N1 - intentional harmful physical action.
violence/N2 - the property of being wild or turbulent.
Violence/N6 - a book from Neil L. Whitehead; nonfiction
Violence/N7 - an album by The Last Resort
Violence/N8 - Violence is the third album by the Washington-based Alternative metal music group Nothingface.
Violence/N9 - a musical group which produced the albums Eternal Nightmare and Nothing to Gain
Violence/N10 - a song by Aesthetic Perfection, Angel Witch, Arsenic, Beth Torbert, Brigada Flores Magon, etc on the albums A Natural Disaster, Adult Themes for Voice, I Bificus, Retribution, S.D.E., etc
Violence/N11 - an album by Bombardier, Dark Quarterer and Invisible Limits
Violence/N12 - a song by CharlElie Couture, EsprieM, Fraebbblarnir, Ian Hunter, Implant, etc on the albums All the Young Dudes, Broke, No Regrets, Power of Limits, Repercussions, etc
Violence/N18 - Violence: The Roleplaying Game of Egregious and Repulsive Bloodshed is a short, 32-page roleplaying game written by Greg Costikyan under the pseudonym "Designer X" and published by Hogshead Publishing as part of its New Style line of games.
Violence/N42 - Violence (1947) is an American drama film noir directed by Jack Bernhard.

score 2 · Accepted Answer

对于这个问题，纯粹的自动推理通常有点困难。

相反，我们可以使用：

WordNet 或语义词典等资源。对于英语以外的语言，您可以查看 eurowordnet（非免费）数据集。
为了获得更多意义（即专辑意义），我们处理了一些管理良好的资源，如维基百科。维基百科有很多元信息，对这种处理非常有用。
该过程的可靠性是通过结合最大数量的数据源并使用专门的程序正确处理它们来实现的。
作为最后的手段，您可以尝试手动处理/注释。时间长且成本高，但在您只需要一小部分语言的企业环境中很有用。

这里没有免费的午餐。

score 1 · Accepted Answer

如果您在中使用英语 NLP python，那么您可以尝试使用以下wordnetAPI：

from nltk.corpus import wordnet as wn
query = 'violence'
for ss in wn.synsets(query):
  print query, str(ss.offset).zfill(8)+'-'+ss.pos, ss.definition

如果您正在研究其他人类语言，也许您可以查看http://casta-net.jp/~kuribayashi/multi/提供的开放 wordnets

注意：的原因str(ss.offset).zfill(8)+'-'+ss.pos，是因为它被用作每个sense特定单词的唯一 ID。这个 id 在每种语言的开放 wordnet 中都是一致的。前 8 位数字给出了 id，破折号后的字符是意义的词性。

score 1 · Accepted Answer

看看这个：来自 Idilia 的Twitter 过滤演示。它通过首先分析一段文本以发现其单词的含义，然后过滤包含您正在寻找的含义的文本来完全满足您的需求。它可以作为 API 使用。

免责声明：我为伊迪利亚工作。

score 0 · Accepted Answer

这将非常困难，因为“暴力”这个词的专有名词使用在所有词中的比例非常罕见，并且它们的频率分布可能以某种方式高度倾斜。我们几乎在任何时候想要做某种形式的命名实体消歧时都会遇到这些问题。

我所知道的任何工具都不会为您执行此操作，因此您将构建自己的分类器。正如 K 先生建议的那样，使用 Wikipedia 作为培训资源可能是您最好的选择。

score 0 · Accepted Answer

您可以提取所有“暴力”发生的上下文（上下文可以是整个文档，或者说 50 个单词的窗口），然后将它们转换为特征（使用词袋），然后对这些特征进行聚类。由于集群是无监督的，您不会有集群的名称，但您可以用一些典型的上下文标记它们。

然后你需要查看查询中的“暴力”属于哪个集群。基于查询中作为上下文的其他词或通过明确询问（您的意思是“....”中的暴力还是“....”中的暴力）

nlp - 你如何提取某个单词的各种含义

5 回答 5

Related

Reference