8

我有大量任意自然语言字符串。为了让我的工具分析它们,我需要将每个字符串转换为唯一的颜色值(RGB 或其他)。我需要颜色对比来取决于字符串的相似性(字符串与其他字符串不同,它们各自的颜色应该越不同)。如果我总是为相同的字符串获得相同的颜色值,那将是完美的。

关于如何解决这个问题的任何建议?

更新字符串之间的距离

我可能需要将“相似性”定义为类似 Levenstein 的距离。不需要自然语言解析。

那是:

"I am going to the store" and 
"We are going to the store"

相似的。

"I am going to the store" and 
"I am going to the store today"

也类似(但略少)。

"I am going to the store" and 
"J bn hpjoh up uif tupsf"

很不相似。

(谢谢,韦尔博格!)

只有当我看到程序输出时,我才可能确切地知道我需要什么距离函数。所以让我们从更简单的事情开始。

任务简化更新

我已经删除了我自己将任务分成两部分的建议——绝对距离计算和颜色分布。这不会很好,因为起初我们将维度信息减少到单个维度,然后尝试将其合成到三个维度。

4

8 回答 8

3

You need to elaborate more on what you mean by "similar strings" in order to come up with an appropriate conversion function. Are the strings

 "I am going to the store" and 
"We are going to the store" 

considered similar? What about the strings

 "I am going to the store" and 
"J bn hpjoh up uif tupsf" 

(all of the letters in the original +1), or

 "I am going to the store" and 
"I am going to the store today"

? Based on what you mean by "similar", you might consider different functions.

If the difference can be based solely on the values of the characters (in Unicode or whatever space they are from), then you can try summing the values up and using the result as a hue for HSV space. If having a longer string should cause the colours to be more different, you might consider weighing characters by their position in the string.

If the difference is more complex, such as by the occurrences of certain letters or words, then you need to identify this. Maybe you can decide red, green and blue values based on the number of Es, Ss and Rs in a string, if your domain has a lot of these. Or pick a hue based on the ratio of vowels to consonents, or words to syllables.

There are many, many different ways to approach this, but the best one really depends on what you mean by "similar" strings.

于 2009-01-30T14:42:25.060 回答
2

It sounds like you want a hash of some sort. It doesn't need to be secure (so nothing as complicated as MD5 or SHA) but something along the lines of:

char1 + char2 + char3 + ... + charN % MAX_COLOUR_VALUE

would work as a simple first step. You could also do fancier things along the lines of having each character act as an 'amplitude' for R,G and B (e could be +1R, +2G and -4B, etc.) and then simply add up all the values in a string... clamp them at the end and you have a method of turning arbitrary length strings into colours as a 'colour hash' sort of process.

于 2009-01-30T14:47:11.943 回答
1

First, you'll need to pick a way to measure string similarity. Minimal edit distance is traditional, but is not sufficient to well-order the strings, which is what you will need if you want to allocate the same colours to the same strings every time - perhaps you could weight the edit costs by alphabetic distance. Also minimal edit distance by itself may not be very useful if what you are after is similarity in speech rather than in written form (if so, consider a stemming/soundex pass first), or some other sense of "similarity".

Then you need to pick a way of traversing the visible colour space based on that metric. It may be helpful to consider using HSL or HSV colour representation - the algorithm could then become as simple as picking a starting hue and walking the sorted corpus, assigning the current hue to each string before offsetting it by the string's difference from the previous one.

于 2009-01-30T14:47:24.643 回答
1

How important is it that you never end up with two dissimilar strings having the same colour?

If it's not that important then maybe this could work?

You could pick a 1 dimensional color space that is "homotopic" to the circle: Say the color function c(x) is defined for x between 0 and 1. Then you'd want c(0) == c(1).

Now you take the sum of all character values modulo some scaling factor and wrap this back to the color space:

c( (SumOfCharValues(word) modulo ScalingFactor) / ScalingFactor )

This might work even better if you defined a "wrapping" color space of higher dimensions and for each dimension pick different SumOfCharValues function; someone suggested alternating sum and length.

Just a thought... HTH

于 2009-10-25T20:39:24.273 回答
1

您可以使用MinHash或其他一些LSH 方法,并将相似性定义为由Jaccard 系数测量的带状疱疹集之间的交集。Rajaraman 和 Ullman在Mining of Massive data sets, Ch.3中有很好的描述。

于 2013-03-24T22:23:03.810 回答
1

这是我的建议(我觉得这个算法有个通用的名字,但是我太累了记不住了):

您希望将每个字符串转换为 3D 点 node(r, g, b) (您可以缩放这些值以使其适合您的范围),从而最大限度地减少以下错误:

Error = \sum_i{\sum_j{(dist(node_i, node_j) - dist(str_i, str_j))^2}}

你可以这样做:

  1. 首先为每个字符串分配一个随机颜色(r,g,b)
  2. 重复直到你认为合适(例如,误差调整小于 \epsilon = 0.0001):
    1. 选择一个随机节点
    2. 调整它的位置 (r, g, b) 使误差最小化
  3. 缩放坐标系,使每个节点坐标在 [0., 1.) 或 [0, 256] 范围内
于 2009-10-25T21:30:24.660 回答
0

I would maybe define some delta between two strings. I don't know what you define as the difference (or "unequality") of two strings, but the most obvious thing I could think about would be string length and the number of occurences of particular letters (and their index in the string). It should not be tricky to implement it such that it returns the same color code in equal strings (if you do an equal first, and return before further comparison).

When it comes to the actual RGB value, I would try to convert the string data into 4 bytes (RGBA), or 3 bytes if you only use the RGB. I don't know if every string would fit into them (as that may be language specific?).

于 2009-01-30T14:36:15.163 回答
0

对不起,但你不能用 levenshtein distance 或类似的方法做你正在寻找的东西。RGB 和 HSV 是 3 维几何空间,但 levenshtein 距离描述了一个度量空间 - 一组更宽松的约束,没有固定的维数。无法将度量空间映射到固定数量的维度,同时始终保持局部性。

但是,就近似而言,对于单个术语,您可以使用诸如 soundex 或 metaphone 之类的算法的修改来选择颜色;例如,对于多个术语,您可以将 soundex 或 metaphone 分别应用于每个单词,然后将它们相加(使用溢出)。

于 2009-03-16T11:30:14.627 回答