c# - 检查两个字符串是否共享相同的重复字符模式

Question

是否有一个有效的正则表达式来断言两个字符串共享相同的重复字符模式。

("tree", "loaa") => true
("matter", "essare") => false
("paper", "mime") => false
("acquaintance", "mlswmodqmdlp") => true
("tree", "aoaa") => false

事件如果不是通过正则表达式，我正在寻找执行任务的最有效方式

score 12 · Accepted Answer

The easiest way is probably to walk through both strings manually at the same time and build up a dictionary (that matces corresponding characters) while you are doing it:

if(input1.Length != input2.Length)
    return false;
var characterMap = new Dictionary<char, char>();
for(int i = 0; i < input1.Length; i++)
{
    char char1 = input1[i];
    char char2 = input2[i];
    if(!characterMap.ContainsKey(char1))
    {
        if (characterMap.ContainsValue(char2))
            return false;
        characterMap[char1] = char2;
    }
    else
    {
        if(char2 != characterMap[char1])
            return false;
    }
}
return true;

In the same manner you could construct a regex. This is certainly not more efficient for a single comparison, but it might be useful if you want to check one repetition pattern against multiple strings in the future. This time we associate characters with their back-references.

var characterMap = new Dictionary<char, int>();
string regex = "^";
int nextBackreference = 1;
for(int i = 0; i < input.Length; i++)
{
    char character = input[i];
    if(!characterMap.ContainsKey(character))
    {
        regex += "(.)";
        characterMap[character] = nextBackreference;
        nextBackreference++;
    }
    else
    {
        regex += (@"\" + characterMap[character]);
    }
}
regex += "$";

For matter it will generate this regex: ^(.)(.)(.)\3(.)(.)$. For acquaintance this one: ^(.)(.)(.)(.)\1(.)(.)(.)\1\6\2(.)$. If could of course optimize this regular expression a bit afterwards (e.g. for the second one ^(.)(.)..\1.(.).\1\3\2$), but in any case, this would give you a reusable regex that checks against this one specific repetition pattern.

EDIT: Note that the given regex solution has a caveat. It allows mapping of multiple characters in the input string onto a single character in the test strings (which would contradict your last example). To get a correct regex solution, you would have to go a step further to disallow characters already matched. So acquaintance would have to generate this awful regular expression:

^(.)(?!\1)(.)(?!\1|\2)(.)(?!\1|\2|\3)(.)\1(?!\1|\2|\3|\4)(.)(?!\1|\2|\3|\4|\5)(.)(?!\1|\2|\3|\4|\5|\6)(.)\1\6\2(?!\1|\2|\3|\4|\5|\6|\7)(.)$

And I cannot think of an easier way, since you cannot use backreferences in (negated) character classes. So maybe, if you do want to assert this as well, regular expressions are not the best option in the end.

Disclaimer: I am not really a .NET guru, so this might not be the best practice in walking through arrays in building up a dictionary or string. But I hope you can use it as a starting point.

score 1 · Accepted Answer

只是因为我喜欢 LINQ::)

void Main()
{
    Console.WriteLine(Map("tree") == Map("loaa"));
    Console.WriteLine(Map("matter") == Map("essare"));
    Console.WriteLine(Map("paper") == Map("mime"));
    Console.WriteLine(Map("acquaintance") == Map("mlswmodqmdlp"));
    Console.WriteLine(Map("tree") == Map("aoaa"));  
}

public string Map(string input)
{
    var seen = new Dictionary<char,int>();
    var index = 0;
    return string.Join(
      string.Empty, 
      input.Select(c =>seen.ContainsKey(c) ? seen[c] : seen[c] = index++));
}

score 1 · Accepted Answer

我不知道如何使用正则表达式来做到这一点，但在代码中我会一次遍历两个字符串一个字符，在进行时进行比较并构建一个比较列表：

t = l
r = o
e = a
etc.

在添加每个比较之前，我会检查第一个字符串中的字符是否已存在于列表中。如果第二个字符串中的相应字符与比较列表不匹配，则字符串模式不匹配。

score 1 · Accepted Answer

编辑：接受的代码现在是正确的。这个将作为替代方案留在这里（几乎在任何意义上都不太好）。

    private static List<int> FindIndices(string str, char c, int ind)
    {
        var retval = new List<int>();
        int last = ind, temp;
        while (0 < (temp = str.IndexOf(c, last)))
        {
            retval.Add(temp);
            last = temp + 1;
        }           
        return retval;
    }

    public static int[] CanonicalForm(string s)
    {
        string table = String.Empty;
        var res = new int[s.Length];
        int marker = 0;
        int lastInd;

        for(int i=0; i < s.Length-1; ++i)
        {
            if (table.Contains(s[i]))
                continue;

            table += s[i];              
            lastInd = i+1;

            if (s.IndexOf(s[i], lastInd) > 0)
                res[i] = ++marker;
            else
                continue;

            foreach (var v in FindIndices(s, s[i], lastInd))
                res[v] = marker;
        }
        return res;
    }

和比较：

    public static bool ComparePatterns(string s1, string s2)
    {
        return ( (s1.Length == s2.Length) && CanonicalForm(s1).SequenceEqual(CanonicalForm(s2)) );
    }

所以关键是建立一个可以在以后进行比较的规范形式。这不是特别聪明，但确实给出了正确的结果。

score 0 · Accepted Answer

刚遇到同样的问题。我为它写了一段python代码。它相当简单，无需导入额外的模块。其基本思想是利用ascii字符与其对应数值之间的关系，将给定的两个字符串分别翻译成一个新的模式字符串。最后比较两个模式字符串。

def SamePattern(s1, s2):
  i = j = 97
  p1 = p2 = ""

  for index1, l1 in enumerate(s1):
    if l1 not in s1[0:index1]:
      p1 += chr(i)
      i += 1
    else:
      p1 += chr(97 + s1.index(l1))

  for index2, l2 in enumerate(s2): 
    if l2 not in s2[0:index2]:
      p2 += chr(j)
      j += 1
    else:
      p2 += chr(97 + s2.index(l2))
      
  if p1 == p2:
    return True
  else:
    return False

c# - 检查两个字符串是否共享相同的重复字符模式

5 回答 5

Related

Reference