2

假设我有一些大的数据行集合,其中行中的每个元素都是一个(键,值)对:

1)    [(bird, "eagle"), (fish, "cod"),      ... , (soda, "coke")]
2)    [(bird, "lark"),  (fish, "bass"),     ...,  (soda, "pepsi")]
n)    ....
n+1)  [(bird, "robin"), (fish, "flounder"), ...,  (soda, "fanta")]

我希望能够运行一些计算来确定新行,与该行“最相似”的行是什么?

我能想到的为任何特定行找到“最相似”行的最直接方法是直接将所述行与所有其他行进行比较。这显然在计算上非常昂贵。

我正在寻找以下形式的解决方案。

  • 一个可以取一行并为该行生成一些导数整数的函数。这个返回的整数将是行的一种“签名”。此签名的重要属性是,如果两行非常“相似”,它们将生成非常接近的整数,如果行非常“不同”,它们将生成远距离整数。显然,如果它们是相同的行,它们将生成相同的签名。

  • 然后,我可以获取这些生成的签名,以及它们指向的行的索引,并按它们的签名对它们进行排序。我会保留这个数据结构,以便我可以快速查找。称它为数据库 B。

  • 当我有一个新行时,我想知道数据库 B 中哪个现有行最相似,我会:

    1. 为新行生成签名
    2. 对数据库 B 中的 (signature,index) 的排序列表进行二分搜索以查找壁橱匹配项
    3. 返回数据库 B 中最接近的匹配(可能是完美匹配)行。

我知道他们在这个问题上挥手致意。我的问题是我实际上并不知道生成此签名的函数是什么。我看到 Levenshtein 距离,但那些代表转换成本,而不是签名。我看到我可以尝试有损压缩,两件事可能是“可存储的”,因为它们压缩为同一件事。我正在寻找有关如何执行此操作的其他想法。

谢谢你。

4

2 回答 2

1

If you had a lot of data, and wanted to do this hardcore, I would suggest a statistical method like PLSA or PSVM, which can extract identifying topics from text and identify documents with similar topic probabilities.

A simpler, but less accurate way of doing it is using Soundex, which is available for many languages. You can store the soundex (which will be a short string, not an integer I'm afraid), and look for exact matches to the soundex, which should point to similar rows.

I think it's unrealistic to expect a function to turn a series of strings into an integer such that integers near each other map to similar strings. The closest you might come is doing a checksum on each individual tuple, and comparing the checksums for the new row to the checksums of existing rows, but I'm guessing you're trying to come up with a single number you can index on.

于 2011-01-23T02:10:26.053 回答
1

EDIT: This is my original answer, which we will call Case 1, where there is no precedence to the keys

You cannot do it as a sorted integer because that is one dimensional and your data is multi-dimensional. So "nearness" in that sense cannot be established on a line.

Your example shows bird, fish and soda for all 3 lines. Are the keys fixed and known? If they are not, then your first step is to hash the keys of a row to establish rows that have the same keys.

For the values, consider this as a poor man's Saturday Night similarity trick. Hash the values, any two rows that match on that hash are an exact match and represent the same "spot", zero distance.

If N is the number of key/value pairs:

The closest non-exact "nearness" would mean matching N-1 out of N values. So you generate N more hashes, each one dropping out one of the values. Any two rows that match on those hashes have N-1 out of N values in common.

The next closest non-exact "nearness" would mean matching N-2 out of N values. So you generate more than N more hashes (I can't figure the binary this late), this time each hash leaves out a combination of two values. Any two rows that match on those hashes have N-2 out of N values in common.

So you can see where this is going. At the logical extreme you end up with 2^N hashes, not very savory, but I'm assuming you would not go that far because you reach a point where too few matching values would be considered to "far" to be worth considering.

EDIT: To see how we cannot escape dimensionality, consider just two keys, with values 1-9. Plot all possible values on a graph. We see see that {1,1} is close to {2,2}, but also that {5,6} is close to {6,7}. So we get a brainstorm, we say, Aha! I'll calculate each point's distance from the origin using Pythagorean theorem! This will make both {1,1} and {2,2} easy to detect. But then the two points {1,10} and {10,1} will get the same number, even though they are as far apart as they can be on the graph. So we say, ok, I need to add the angle for each. Two points at the same distance are distinguished by their angle, two points at the same angle are distinguished by their distance. But of course now we've plotted them on two dimensions.

EDIT: Case 2 would be when there is precedence to the keys, when key 1 is more significant than key 2, which is more significant than key 3, etc. In this case, if the allowed values were A-Z, you would string the values together as if they were digits to get a sortable value. ABC is very close to ABD, but very far from BBD.

于 2011-01-23T02:34:45.693 回答