algorithm - “绝对”字符串度量

Question

我有一组巨大（但有限）的自然语言字符串。

我需要一种将每个字符串转换为数值的方法。对于任何给定的字符串，每次的值都必须相同。

两个给定的字符串越“不同”，两个对应的值应该越不同。它们越“相似”，不同的值应该越少。

我还不知道我需要的字符串之间差异的确切定义。反正没有自然语言解析。它可能应该类似于 Levenstein（但 Levenstein 是相对的，我需要绝对度量）。让我们从简单的事情开始。

尺寸更新

我很乐意接受多维（最好是 3d）向量而不是单个数值。

更新预期结果的正确性

正如在此处和此处正确指出的那样，从一个字符串到另一个字符串的距离是一个具有MAX(firstStringLength, secondStringLength)维度的向量。一般来说，在不丢失一些信息的情况下减少维数是不可能的。

但是我不需要一个绝对的解决方案。我会满足于从 N 维字符串空间到我的 3D 空间的任何“足够好”的转换。

另请注意，我有有限数量的有限长度的字符串。（虽然字符串的数量相当大，大约 8000 万（10 GB），所以我最好选择一些单通道无状态算法。）

从扫描参考资料来看，我的印象是希尔伯特空间填充曲线可能对我有所帮助。看起来希尔伯特空间填充曲线的聚类特性分析文章讨论了一些接近我的问题的东西......

希尔伯特曲线方法的更新

我们将每个字符串映射到 N 维空间中的一个点，其中 N 是集合中字符串的最大长度。顺便说一句，字符串中的第 i 个字符代码可以用作这里的第 i 个坐标值吗？
我们通过该 N 维空间绘制一条希尔伯特曲线。
对于每个字符串，我们在曲线上取点，最接近字符串的坐标。该点的希尔伯特值（从曲线开始的长度）是我寻求的一维值。
如果我们需要 3D 值，我们在 3D 中绘制 Hilbert 曲线并选择匹配 Hilbert 值的点，如上所述。

这看起来对吗？这里的计算费用是多少？

score 5 · Accepted Answer

我认为这是不可能的。从一个简单的字符串开始，并将其分配为零（数字是什么并不重要）

“你好世界” = 0

以下字符串与它的距离为 2：

“XXllo 世界” = 一个
“HeXXo 世界” = b
“你好 XXrld” = c
“你好，WorXX” = d

然而，这些字符串中的每一个都彼此相距 4。对于以下示例，无法对数字进行排序以使其正常工作：

a = 1, b = -1, c = 2, d = -2

考虑 c 到 0 是 2，而 c 到 a 是 1，但 0 比 a 更接近。

这只是一个简单的案例。

score 3 · Accepted Answer

我认为您将不得不更清楚地说明您的问题，您究竟想用这个指标实现什么？

我这样说是因为 Levenstein 之所以有效，是因为它将字符串对映射到一个度量，这可以保留字符串空间的维度。如果您尝试将字符串映射到数字，会发生大量的维度信息丢失。例如，假设我有字符串“cat”，我希望“bat”、“hat”、“rat”、“can”、“cot”等都合理地接近这个。对于大量的单词，结果是您最终会得到不同的单词相对频繁地接近，例如“bat”和“cot”可能很接近，因为它们都恰好与正面的“cat”距离相似。这类似于当您尝试将平面映射到一条线时发生的问题，很难满足平面上的点远在直线上的限制。因此，这样做的结果是“两个给定的字符串越“不同”，两个对应的值应该越不同”的要求是困难的。

所以，我的第一个建议是，你真的需要这样做吗，一个简单的哈希码是否足以为你提供唯一值，或者你可以使用 Levenstein 并忽略单个字符串的值？如果这些都不够，也许您可以使用多维函数值，即将字符串映射成对、三元组或另一个小的数字元组。以这种方式授予的额外维度将为您提供更好的结果。

一个例子可能是将字符串编码为三元组：长度、字符串中字母值的总和、字母值的交替总和，例如 f("cat") = (3, 3 + 1 + 20, 3 - 1 + 20) = (3, 24, 22)。这将具有您想要的一些属性，但可能不是最佳的。尝试寻找字符串的正交特征来进行这种编码，或者甚至更好，如果您有大量的字符串测试集，那么现有的库可以将此类数据映射到低维同时保留指标（例如 Levenstein 指标），并且您可以训练你的功能。我记得 S 语言支持这种东西。

score 3 · Accepted Answer

所以在这里我希望展示根本问题和解决方案。

问题：寻找“足够好”的解决方案是正确的，因为不可能获得完美的解决方案（我可以在信息论中证明这一点，但我会选择几何，因为它更具可读性）。您有一个 N 维空间，因此无法在不丢失信息的情况下投影距离度量：

distance projected onto X: (x,y,z).(1,0,0) = x

但是，您可以使用考虑多个维度的向量，但最终会得到相距相同的元素：

(30,0,0).(1/3,1/3,1/3) = (0,30,0).(1/3,1/3,1/3) = (0,0,30).(1/3,1/3,1/3) = 10

所以现在解决方案：您希望做的最好的事情是使用主成分分析进行聚类，以找到您的字符串差异最大的三个维度。这又取决于您使用的距离度量的组成部分，并且不重要（即，我不想让这篇文章更长）。

对于一个快速的解决方案，我建议您使用下面描述的 3 个字符串的 Levenstein 距离快速尝试在 head 中的 PCA：

"acegikmoqsuwy" //use half your permitted symbols then repeat until you have a string of size equal to your longest string.
"bdfhjlnprtv" //use the other half then repeat as above.
"" //The empty string, this will just give you the length of the string, so a cheap one.

此外，如果您想深入了解，这对指标/距离可能会有所帮助： http ://www.springer.com/mathematics/geometry/book/978-3-642-00233-5

和 levenstein 距离演示：http ://www.merriampark.com/ld.htm

score 2 · Accepted Answer

我想扩展 FryGuy 的答案，为什么它不能在任何固定数量的维度上工作。让我们拿aaaaaaaaaa和baaaaaaaaa, abaaaaaaaa, ... , aaaaaaaaab. 在此示例中，字符串的长度为 10，但它们可以是任意长度。b10 个字符串中的每一个的距离aaaaaaaaaa是 1，它们之间的距离是 2。一般来说，如果你在 2 个字母的字母表上取长度为 N 的固定字符串，它们的距离图是一个 N 维超立方体。

除非您的字符串长度有限，否则您无法将其映射到固定数量的维度。

score 1 · Accepted Answer

测量与空字符串的编辑距离，但不是将每个编辑视为具有值“1”，而是将其在按使用频率排序的字母表中添加/删除的字母的索引（etaoinshrdlu ...），并且如果您的算法允许您将替换检测为替换而不是插入+删除对，则字母索引之间的差异。

score 1 · Accepted Answer

您还可以尝试研究潜在语义分析和向量空间模型，但问题是您必须限制最大字符串长度。

您的尺寸是字母表元素和字符串中位置的乘积。给定字母表（“a”、“b”、“c”、“t”）和最大长度为 3，尺寸为 (a:1, b:1, c:1, t:1, ... , a:3, b:3, c:3, t:3)

例如，"cat"变为 (0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1)。

这当然是一个巨大的数据集，但您可以使用降维技术（如SVD）来减少维数。这应该很好用，因为单词中有许多重复出现的模式。您可以调整输出维度的数量以满足您的需求。

两个词之间的相似度可以通过词向量之间的余弦相似度来计算。您还可以存储 SVD 的变换向量以获得单词的简化向量，甚至是以前看不见的向量。

score 0 · Accepted Answer

要克服“相对距离”问题，您需要做的就是取一个固定点进行测量。

您仍然可以使用 Levenstein 距离，但从固定的“Origin”字符串中获取。例如，您可以使用所有空格的任意长度字符串作为原始字符串。

无论如何，我会先用一小部分已知字符串对其进行测试，以查看这些值是否反映了您希望看到的内容。

score -1 · Accepted Answer

这是对这个问题的“头脑风暴”回答。

基本上，这计算了句子 2 与句子 1 的距离，作为与句子 1 的笛卡尔距离（假设在原点），其中距离是 2 个句子中单词之间的最小 Levenshtein 差异的总和。它具有 2 个相等的句子给出 0 距离的特性。

如果这种方法已在其他地方发表，我不知道。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            string str1 = "The cat sat on the mat";
            string str2 = "The quick brown fox jumped over the lazy cow";
            ReportDifference(str1, str1);
            ReportDifference(str2, str2);
            ReportDifference(str1, str2);
            ReportDifference(str2, str1);
        }
        /// <summary>
        /// Quick test andisplay routine
        /// </summary>
        /// <param name="str1">First sentence to test with</param>
        /// <param name="str2">Second sentence to test with</param>
        static void ReportDifference(string str1, string str2)
        {
            Debug.WriteLine(
                String.Format("difference between \"{0}\" and \"{1}\" is {2}", 
                str1, str2, Difference(str1, str2))); 
        }
        /// <summary>
        /// This does the hard work.
        /// Basically, what it does is:
        /// 1) Split the stings into tokens/words
        /// 2) Form a cartesian product of the 2 lists of words. 
        /// 3) Calculate the Levenshtein Distance between each word.
        /// 4) Group on the words from the first sentance
        /// 5) Get the min distance between the word in first sentence and all of the words from the second
        /// 6) Square the distances for each word. 
        ///     (based on the distance betwen 2 points is the sqrt of the sum of the x,y,... axises distances
        ///     what this assumes is the first word is the origin)
        /// 7) take the square root of sum
        /// </summary>
        /// <param name="str1">sentence 1 compare</param>
        /// <param name="str2">sentence 2 compare</param>
        /// <returns>distance calculated</returns>
        static double Difference(string str1, string str2)
        {
            string[] splitters = { " " };

            var a = Math.Sqrt(
                (from x in str1.Split(splitters, StringSplitOptions.RemoveEmptyEntries)
                     from y in str2.Split(splitters, StringSplitOptions.RemoveEmptyEntries)
                     select new {x, y, ld = Distance.LD(x,y)} )
                    .GroupBy(x => x.x)
                    .Select(q => new { q.Key, min_match = q.Min(p => p.ld) })
                    .Sum(s =>  (double)(s.min_match * s.min_match )));
            return a;
        }
    }

    /// <summary>
    /// Lifted from http://www.merriampark.com/ldcsharp.htm
    /// </summary>
    public class Distance
    {

        /// <summary>
        /// Compute Levenshtein distance
        /// </summary>
        /// <param name="s">String 1</param>
        /// <param name="t">String 2</param>
        /// <returns>Distance between the two strings.
        /// The larger the number, the bigger the difference.
        /// </returns>
        public static int LD(string s, string t)
        {
            int n = s.Length; //length of s
            int m = t.Length; //length of t
            int[,] d = new int[n + 1, m + 1]; // matrix
            int cost; // cost
            // Step 1
            if (n == 0) return m;
            if (m == 0) return n;
            // Step 2
            for (int i = 0; i <= n; d[i, 0] = i++) ;
            for (int j = 0; j <= m; d[0, j] = j++) ;
            // Step 3
            for (int i = 1; i <= n; i++)
            {
                //Step 4
                for (int j = 1; j <= m; j++)
                {
                    // Step 5
                    cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
                    // Step 6
                    d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                              d[i - 1, j - 1] + cost);
                }
            }
            // Step 7
            return d[n, m];
        }
    }
}

algorithm - “绝对”字符串度量

尺寸更新

更新预期结果的正确性

希尔伯特曲线方法的更新

8 回答 8

Related

Reference