algorithm - 去重 n-gram 集

Question

我需要想出一种方法来排序并向用户显示最相关的数据。我们的数据由从社交媒体中提取的多个 n-gram 组成。我们称这些为“主题”。

我面临的问题是数据包含很多重复。虽然每个字符串都不是另一个字符串的直接副本，但它们是子集。对于用户而言，此信息似乎是重复的。以下是一些示例数据：

{
    "count": 1.0, 
    "topic": "lazy people"
}, 
{
    "count": 1.0, 
    "topic": "lazy people taking"
}, 
{
    "count": 1.0, 
    "topic": "lazy people taking away food stamps"
}

一个极端情况是可以从其他短语中提取短语“lazy people”。例如，“懒人快乐”。使用最小的公分母（在这种情况下为“懒惰的人”）似乎不是一个好主意，因为最终用户不会看到不同的上下文（“拿走食品券”和“很开心”）。

另一方面，取最长的 N-Gram 可能信息太多。在我上面给出的例子中，这似乎是合乎逻辑的。然而，这可能并不总是正确的。

我的总体目标是以一种信息丰富且排名靠前的方式呈现这些数据。

有没有现有的解决方案和相应的算法来解决这类问题？

注意：最初我的问题非常模糊和不清楚。事实上，这导致我一起改变了这个问题，因为我真正需要的是指导我的最终结果应该是什么。

注意 2：让我知道我是否误用了任何术语或应该修改此问题的标题以增强其他人搜索此类问题的答案。

score 2 · Accepted Answer

This is a hard problem and solutions tend to be very application specific. Typically you'd collect more than just the n-grams and counts. For example, it usually matters if a particular n-gram is used a lot by a single person, or by a lot of people. That is, if I'm a frequent poster and I'm passionate about wood carving, then the n-gram "wood carving" might show up as a common term. But I'm the only person who cares about it. On the other hand, there might be many people who are into oil painting, but they post relatively infrequently and so the count for the n-gram "oil painting" is close to the count for "wood carving." But it should be obvious that "oil painting" will be relevant to your users and "wood carving" would not be. Without information about what pages the n-grams come from, it's impossible to say which would be relevant to more users.

A common way to identify the most relevant phrases across a corpus of documents is called TF-IDF: Term frequency-inverse document frequency. Most descriptions you see concern themselves with individual words, but it's simple enough to extend that to n-grams.

This assumes, of course, that you can identify individual documents of some sort. You could consider each individual post as a document, or you could group all of the posts from a user as a larger document. Or maybe all of the posts from a single day are considered a document. How you identify documents is up to you.

A simple TF-IDF model is not difficult to build and it gives okay results for a first cut. You can run it against a sample corpus to get a baseline performance number. Then you can add refinements (see the Wikipedia article and related pages), always testing their performance against your pure TF-IDF baseline.

Given the information I have, that's where I would start.

score 0 · Accepted Answer

Consider using a graph database, having a table of words, containing the elements of the N-Grams; and a tabe of N-Grams containing arcs to the words that are contained in the N-Grams.

As implementation, you can use neo4j that has also a Python library: http://www.coolgarif.com/brain-food/getting-started-with-neo4j-in-python

algorithm - 去重 n-gram 集

2 回答 2

Related

Reference