sentiment-analysis - n Gram 到底是什么？

Question

我在 SO: N-grams: Explanation + 2 applications上找到了这个先前的问题。OP给出了这个例子并询问它是否正确：

Sentence: "I live in NY."

word level bigrams (2 for n): "# I', "I live", "live in", "in NY", 'NY #'
character level bigrams (2 for n): "#I", "I#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#N", "NY", "Y#"

When you have this array of n-gram-parts, you drop the duplicate ones and add a counter for each part giving the frequency:

word level bigrams: [1, 1, 1, 1, 1]
character level bigrams: [2, 1, 1, ...]

答案部分中有人确认这是正确的，但不幸的是，我有点迷失了，因为我没有完全理解所说的一切！我正在使用 LingPipe 并按照教程说明我应该选择 7 到 12 之间的值 - 但没有说明原因。

什么是好的 nGram 值？在使用像 LingPipe 这样的工具时我应该如何考虑它？

编辑：这是教程：http ://cavajohn.blogspot.co.uk/2013/05/how-to-sentiment-analysis-of-tweets.html

score 52 · Accepted Answer

通常一张图片胜过千言万语。

资料来源：http ://recognize-speech.com/language-model/n-gram-model/comparison

score 47 · Accepted Answer

N-gram 只是您可以在源文本中找到的长度为n的相邻单词或字母的所有组合。例如，给定单词fox，所有 2-grams（或“bigrams”）都是foand ox。您还可以计算单词边界——这会将 2-gram 列表扩展为#f、fo、ox和x#，其中#表示单词边界。

您可以在单词级别上执行相同的操作。例如，hello, world!文本包含以下单词级二元组：# hello、hello world、world #。

n-gram 的基本观点是它们从统计的角度捕捉语言结构，例如给定的字母或单词可能跟随哪个字母或单词。n-gram 越长（n 越高），您必须使用的上下文越多。最佳长度实际上取决于应用程序——如果你的 n-gram 太短，你可能无法捕捉到重要的差异。另一方面，如果它们太长，您可能无法捕捉到“一般知识”，而只关注特定情况。

score 3 · Accepted Answer

一个 n-gram 是一个n 元组或一组 n 个单词或字符（gram，用于语法片段），它们彼此跟随。因此，您句子中单词的 n 为 3 就像“#I live”、“I live in”、“live in NY”、“in NY #”。这用于创建单词彼此跟随的频率的索引。您可以在马尔可夫链中使用它来创建类似于语言的东西。当您填充词组或字符组的分布的映射时，您可以将它们重新组合为输出接近自然的概率，n-gram 越长。

数字太高，你的输出会是原件的逐字复制，数字太低，输出会太乱。

sentiment-analysis - n Gram 到底是什么？

3 回答 3

Related

Reference