c# - Better Search for a string in all files using C#

Question

After referring many blogs and articles, I have reached at the following code for searching for a string in all files inside a folder. It is working fine in my tests.

QUESTIONS

Is there a faster approach for this (using C#)?
Is there any scenario that will fail with this code?

Note: I tested with very small files. Also very few number of files.

CODE

static void Main()
    {
        string sourceFolder = @"C:\Test";
        string searchWord = ".class1";

        List<string> allFiles = new List<string>();
        AddFileNamesToList(sourceFolder, allFiles);
        foreach (string fileName in allFiles)
        {
            string contents = File.ReadAllText(fileName);
            if (contents.Contains(searchWord))
            {
                Console.WriteLine(fileName);
            }
        }

        Console.WriteLine(" ");
        System.Console.ReadKey();
    }

    public static void AddFileNamesToList(string sourceDir, List<string> allFiles)
    {

            string[] fileEntries = Directory.GetFiles(sourceDir);
            foreach (string fileName in fileEntries)
            {
                allFiles.Add(fileName);
            }

            //Recursion    
            string[] subdirectoryEntries = Directory.GetDirectories(sourceDir);
            foreach (string item in subdirectoryEntries)
            {
                // Avoid "reparse points"
                if ((File.GetAttributes(item) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
                {
                    AddFileNamesToList(item, allFiles);
                }
            }

    }

REFERENCE

score 30 · Accepted Answer

Instead of File.ReadAllText() better use

File.ReadLines(@"C:\file.txt");

It returns IEnumerable (yielded) so you will not have to read the whole file if your string is found before the last line of the text file is reached

score 11 · Accepted Answer

I wrote somthing very similar, a couple of changes I would recommend.

Use Directory.EnumerateDirectories instead of GetDirectories, it returns immediately with a IEnumerable so you don't need to wait for it to finish reading all of the directories before processing.
Use ReadLines instead of ReadAllText, this will only load one line in at a time in memory, this will be a big deal if you hit a large file.
If you are using a new enough version of .NET use Parallel.ForEach, this will allow you to search multiple files at once.
You may not be able to open the file, you need to check for read permissions or add to the manifest that your program requires administrative privileges (you should still check though)

I was creating a binary search tool, here is some snippets of what I wrote to give you a hand

private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
    Parallel.ForEach(Directory.EnumerateFiles(_folder, _filter, SearchOption.AllDirectories), Search);
}

//_array contains the binary pattern I am searching for.
private void Search(string filePath)
{
    if (Contains(filePath, _array))
    {
        //filePath points at a match.
    }
}

private static bool Contains(string path, byte[] search)
{
    //I am doing ReadAllBytes due to the fact that I am doing a binary search not a text search
    //  There are no "Lines" to seperate out on.
    var file = File.ReadAllBytes(path);
    var result = Parallel.For(0, file.Length - search.Length, (i, loopState) =>
        {
            if (file[i] == search[0])
            {
                byte[] localCache = new byte[search.Length];
                Array.Copy(file, i, localCache, 0, search.Length);
                if (Enumerable.SequenceEqual(localCache, search))
                    loopState.Stop();
            }
        });
    return result.IsCompleted == false;
}

This uses two nested parallel loops. This design is terribly inefficient, and could be greatly improved by using the Booyer-Moore search algorithm but I could not find a binary implementation and I did not have the time when I wrote it originally to implement it myself.

score 3 · Accepted Answer

the main problem here is that you are searching all the files in real time for every search. there is also the possibility of file access conflicts if 2+ users are searching at the same time.

to dramtically improve performance I would index the files ahead of time, and as they are edited/saved. store the indexed using something like lucene.net and then query the index (again using luence.net) and return the file names to the user. so the user never queries the files directly.

if you follow the links in this SO Post you may have a head start on implementing the indexing. I didn't follow the links, but it's worth a look.

Just a heads up, this will be an intense shift from your current approach and will require

a service to monitor/index the files
the UI project

score 1 · Accepted Answer

I think your code will fail with an exception if you lack permission to open a file.

Compare it with the code here: http://bgrep.codeplex.com/releases/view/36186

That latter code supports

regular expression search and
filters for file extensions

-- things you should probably consider.

score 1 · Accepted Answer

Instead of Contains better use algorithm Boyer-Moore search.
Fail scenario: file have not read permission.

c# - Better Search for a string in all files using C#

5 回答 5

Related

Reference