31

After referring many blogs and articles, I have reached at the following code for searching for a string in all files inside a folder. It is working fine in my tests.

QUESTIONS

  1. Is there a faster approach for this (using C#)?
  2. Is there any scenario that will fail with this code?

Note: I tested with very small files. Also very few number of files.

CODE

static void Main()
    {
        string sourceFolder = @"C:\Test";
        string searchWord = ".class1";

        List<string> allFiles = new List<string>();
        AddFileNamesToList(sourceFolder, allFiles);
        foreach (string fileName in allFiles)
        {
            string contents = File.ReadAllText(fileName);
            if (contents.Contains(searchWord))
            {
                Console.WriteLine(fileName);
            }
        }

        Console.WriteLine(" ");
        System.Console.ReadKey();
    }

    public static void AddFileNamesToList(string sourceDir, List<string> allFiles)
    {

            string[] fileEntries = Directory.GetFiles(sourceDir);
            foreach (string fileName in fileEntries)
            {
                allFiles.Add(fileName);
            }

            //Recursion    
            string[] subdirectoryEntries = Directory.GetDirectories(sourceDir);
            foreach (string item in subdirectoryEntries)
            {
                // Avoid "reparse points"
                if ((File.GetAttributes(item) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
                {
                    AddFileNamesToList(item, allFiles);
                }
            }

    }

REFERENCE

  1. Using StreamReader to check if a file contains a string
  2. Splitting a String with two criteria
  3. C# detect folder junctions in a path
  4. Detect Symbolic Links, Junction Points, Mount Points and Hard Links
  5. FolderBrowserDialog SelectedPath with reparse points
  6. C# - High Quality Byte Array Conversion of Images
4

5 回答 5

30

Instead of File.ReadAllText() better use

File.ReadLines(@"C:\file.txt");

It returns IEnumerable (yielded) so you will not have to read the whole file if your string is found before the last line of the text file is reached

于 2012-12-21T16:20:38.547 回答
11

I wrote somthing very similar, a couple of changes I would recommend.

  1. Use Directory.EnumerateDirectories instead of GetDirectories, it returns immediately with a IEnumerable so you don't need to wait for it to finish reading all of the directories before processing.
  2. Use ReadLines instead of ReadAllText, this will only load one line in at a time in memory, this will be a big deal if you hit a large file.
  3. If you are using a new enough version of .NET use Parallel.ForEach, this will allow you to search multiple files at once.
  4. You may not be able to open the file, you need to check for read permissions or add to the manifest that your program requires administrative privileges (you should still check though)

I was creating a binary search tool, here is some snippets of what I wrote to give you a hand

private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
    Parallel.ForEach(Directory.EnumerateFiles(_folder, _filter, SearchOption.AllDirectories), Search);
}

//_array contains the binary pattern I am searching for.
private void Search(string filePath)
{
    if (Contains(filePath, _array))
    {
        //filePath points at a match.
    }
}

private static bool Contains(string path, byte[] search)
{
    //I am doing ReadAllBytes due to the fact that I am doing a binary search not a text search
    //  There are no "Lines" to seperate out on.
    var file = File.ReadAllBytes(path);
    var result = Parallel.For(0, file.Length - search.Length, (i, loopState) =>
        {
            if (file[i] == search[0])
            {
                byte[] localCache = new byte[search.Length];
                Array.Copy(file, i, localCache, 0, search.Length);
                if (Enumerable.SequenceEqual(localCache, search))
                    loopState.Stop();
            }
        });
    return result.IsCompleted == false;
}

This uses two nested parallel loops. This design is terribly inefficient, and could be greatly improved by using the Booyer-Moore search algorithm but I could not find a binary implementation and I did not have the time when I wrote it originally to implement it myself.

于 2012-12-21T16:36:45.843 回答
3

the main problem here is that you are searching all the files in real time for every search. there is also the possibility of file access conflicts if 2+ users are searching at the same time.

to dramtically improve performance I would index the files ahead of time, and as they are edited/saved. store the indexed using something like lucene.net and then query the index (again using luence.net) and return the file names to the user. so the user never queries the files directly.

if you follow the links in this SO Post you may have a head start on implementing the indexing. I didn't follow the links, but it's worth a look.

Just a heads up, this will be an intense shift from your current approach and will require

  1. a service to monitor/index the files
  2. the UI project
于 2012-12-21T16:30:03.837 回答
1

I think your code will fail with an exception if you lack permission to open a file.

Compare it with the code here: http://bgrep.codeplex.com/releases/view/36186

That latter code supports

  1. regular expression search and
  2. filters for file extensions

-- things you should probably consider.

于 2012-12-21T16:22:56.690 回答
1
  1. Instead of Contains better use algorithm Boyer-Moore search.

  2. Fail scenario: file have not read permission.

于 2012-12-21T16:34:47.357 回答