2

如何在 C# 中有效地从超过 300 万个字符的字符串中提取 100 万个子字符串?我编写了一个程序,该程序涉及从具有 300 万个字符的字符串中读取长度为 100 的随机 DNA 读数(来自随机位置的子字符串)。有 100 万次这样的读取。目前,我运行了一个运行 100 万次的 while 循环,并从 300 万字符的字符串中读取了 100 个字符长度的子字符串。这需要很长时间。我该怎么做才能更快地完成这个?

这是我的代码,len是原始字符串的长度,在这种情况下为 300 万,它可能低至 50,这就是为什么在 while 循环中进行检查的原因。

while(i < 1000000 && len-100> 0) //len is 3000000
            {
                int randomPos = _random.Next()%(len - ReadLength);
                readString += all.Substring(randomPos, ReadLength) + Environment.NewLine;
                i++;


            }
4

4 回答 4

2

使用 StringBuilder 组装字符串将使您的处理量增加 600 倍(因为它避免了每次附加到字符串时重复创建对象。

before 循环(初始化容量避免在 StringBuilder 中重新创建后备数组):

StringBuilder sb = new StringBuilder(1000000 * ReadLength);

在循环中:

sb.Append(all.Substring(randomPos, ReadLength) + Environment.NewLine);

循环后:

readString = sb.ToString();

使用 char 数组而不是字符串来提取值可以再提高 30%,因为您可以避免在调用 Substring() 时创建对象:

循环前:

char[] chars = all.ToCharArray();

在循环中:

sb.Append(chars, randomPos, ReadLength);
sb.AppendLine();

编辑(不使用 StringBuilder 并在 300 毫秒内执行的最终版本):

char[] chars = all.ToCharArray();    
var iterations = 1000000;
char[] results = new char[iterations * (ReadLength + 1)];    
GetRandomStrings(len, iterations, ReadLength, chars, results, 0);    
string s = new string(results);

private static void GetRandomStrings(int len, int iterations, int ReadLength, char[] chars, char[] result, int resultIndex)
{
    Random random = new Random();
    int i = 0, index = resultIndex;
    while (i < iterations && len - 100 > 0) //len is 3000000 
    {
        var i1 = len - ReadLength;
        int randomPos = random.Next() % i1;

        Array.Copy(chars, randomPos, result, index, ReadLength);
        index += ReadLength;
        result[index] = Environment.NewLine[0];
        index++;

        i++;
    }
}
于 2012-03-21T09:56:54.297 回答
1

I think better solutions will come, but .NET StringBuilder class instances are faster than String class instances because it handles data as a Stream.

You can split the data in pieces and use .NET Task Parallel Library for Multithreading and Parallelism

Edit: Assign fixed values to a variable out of the loop to avoid recalculation;

int x = len-100 
int y = len-ReadLength 

use

StringBuilder readString= new StringBuilder(ReadLength * numberOfSubStrings);
readString.AppendLine(all.Substring(randomPos, ReadLength));

for Parallelism you should split your input to pieces. Then run these operations on pieces in seperate threads. Then combine the results.

Important: As my previous experiences showed these operations run faster with .NET v2.0 rather than v4.0, so you should change your projects target framework version; but you can't use Task Parallel Library with .NET v2.0 so you should use multithreading in oldschool way like

Thread newThread ......
于 2012-03-21T09:37:28.940 回答
0

多长时间是多久?它不应该那么长。

var file = new StreamReader(@"E:\Temp\temp.txt");
var s = file.ReadToEnd();
var r = new Random();
var sw = new Stopwatch();
sw.Start();
var range = Enumerable.Range(0,1000000);
var results = range.Select( i => s.Substring(r.Next(s.Length - 100),100)).ToList();
sw.Stop();
sw.ElapsedMilliseconds.Dump();
s.Length.Dump();

所以在我的机器上结果是 807 毫秒,字符串是 4,055,442 个字符。

编辑:我刚刚注意到你想要一个字符串作为结果,所以我上面的解决方案只是更改为......

var results = string.Join(Environment.NewLine,range.Select( i => s.Substring(r.Next(s.Length - 100),100)).ToArray());

并且增加了大约 100 毫秒,所以总共还不到一秒。

于 2012-03-21T10:10:57.707 回答
0

编辑:我放弃了使用 memcpy 的想法,我认为结果非常棒。我在 43 毫秒内将一个 3m 长的字符串分成 30k 个长度为 100 的字符串。

private static unsafe string[] Scan(string hugeString, int subStringSize)
{
    var results = new string[hugeString.Length / subStringSize];

    var gcHandle = GCHandle.Alloc(hugeString, GCHandleType.Pinned);

    var currAddress = (char*)gcHandle.AddrOfPinnedObject();

    for (var i = 0; i < results.Length; i++)
    {
        results[i] = new string(currAddress, 0, subStringSize);
        currAddress += subStringSize;
    }

    return results;
}

要对问题中显示的情况使用该方法:

const int size = 3000000;
const int subSize = 100;

var stringBuilder = new StringBuilder(size);
var random = new Random();

for (var i = 0; i < size; i++)
{
    stringBuilder.Append((char)random.Next(30, 80));
}

var hugeString = stringBuilder.ToString();

var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 1000; i++)
{
    var strings = Scan(hugeString, subSize);
}
stopwatch.Stop();

Console.WriteLine(stopwatch.ElapsedMilliseconds / 1000); // 43
于 2012-03-21T10:00:35.843 回答