c# - 查找重复列并用计数替换匹配项

Question

我有一个制表符分隔的文件，其中有重复的命名标题；

[Column1] \t [Column2] \t [test] \t [test] \t [test] \t [test] \t [Column3] \t [Column4]

我想要做的是用整数重命名重复的列 [test]。所以会变成类似

[Column1] \t [Column2] \t [test1] \t [test2] \t [test3] \t [test4] \t [Column3] \t [Column4]

到目前为止，我可以隔离第一行。然后计算我找到的匹配项

string destinationUnformmatedFileName = @"C:\New\20130816_Opportunities_unFormatted.txt";
string destinationFormattedFileName = @"C:\New\20130816_Opportunities_Formatted.txt";
var unformattedFileStream = File.Open(destinationUnformmatedFileName, FileMode.Open, FileAccess.Read);  // Open (unformatted) file for reading
var formattedFileStream = File.Open(destinationFormattedFileName, FileMode.Create, FileAccess.Write);   // Create (formattedFile) for writing

StreamReader sr = new StreamReader(unformattedFileStream);
StreamWriter sw = new StreamWriter(formattedFileStream);

int rowCounter = 0;
// Read each row in the unformatted file
while ((currentRow = sr.ReadLine()) != null)
{
    //First row, lets check for duplicate names
    if (rowCounter = 0)
    {

    // Write column name to array
    string delimiter = "\t";
    string[] fieldNames = currentRow.Split(delimiter.ToCharArray());

    foreach (string fieldName in fieldNames)
    {
        // fieldName must be followed by a tab for it to be a duplicate
        // original code - causing the issue
        //Regex rgx = new Regex("\\t(" + fieldName + ")\\t");
        // Edit - resolved the issue
        Regex rgx = new Regex("(?<=\\t|^)(" + fieldName + ")(\\t)+");   

        // Count how many occurances of fieldName in currentRow
        int count = rgx.Matches(currentRow).Count;               
        //MessageBox.Show("Match Count = " + count.ToString());

        // If we have a duplicate field name 
        if (count > 1)                                           
        {
             string newFieldName = "\t" + fieldName + count.ToString() + "\t";
             //MessageBox.Show(newFieldName);
             currentRow = rgx.Replace(currentRow, newFieldName, 1);   
         }
     }
     }
rowCounter++;
}

我认为我在正确的轨道上，但我不认为正则表达式工作正常？

编辑：我想我已经弄清楚如何使用 using 找到模式；

Regex rgx = new Regex("(?<=\\t|^)(" + fieldName + ")(\\t)+");

它不是破坏交易，但现在唯一的问题是它贴标签；

[Column1] \t [Column2] \t [test4] \t [test3] \t [test2] \t [test] \t [Column3] \t [Column4]

代替

[Column1] \t [Column2] \t [test1] \t [test2] \t [test3] \t [test4] \t [Column3] \t [Column4]

score 0 · Accepted Answer

使用以下

 Regex rgx = new Regex("(?<=\\t|^)(" + fieldName + ")(\\t)+");

通过使用我在这里找到的环视解决了这个问题； http://www.regular-expressions.info/duplicatelines.html

可能应该在发布之前多花几分钟研究它。

score 0 · Accepted Answer

首先在RegExr测试您的正则表达式。我认为“\t”是一个特殊字符。试试“\\t”。在您的 C# 中，它将是 "\\\\t"

score 0 · Accepted Answer

这是和之间的最佳组合：RegexLINQ

var input = @"[Column1] \t [Column2] \t [test] \t [test] \t [test] \t [foo] \t [test] \t [Column3] \t [foo] \t [Column4]";
Regex reg = new Regex(@"(?<=\\t )[[](.+?)[]]");
string output = "";
int k = 0;           
foreach (var m in reg.Matches(input)
                     .OfType<Match>()
                     .Select((x,i)=>new {x,i})
                     .GroupBy(g=>g.x.Value)
                     .Where(g=>g.Count()>1)
                     .SelectMany(x=> x.Select((a,i)=>new {a,i=i+1}))
                     .OrderBy(x=>x.a.i)){                        
     output += input.Substring(k, m.a.x.Index - k) + m.a.x.Result("[${1}" + m.i + "]");
     k = m.a.x.Index + m.a.x.Length;
 }
 output += input.Substring(k);

结果： [Column1] \t [Column2] \t [test1] \t [test2] \t [test3] \t [foo1] \t [test4] \t [Column3] \t [foo2] \t [Column4]

c# - 查找重复列并用计数替换匹配项

3 回答 3

Related

Reference