2

我正在使用 c# 制作一个工具,它遍历一个大文件目录并提取某些信息。该目录是按语言(LCID)组织的,所以我想使用多线程来浏览目录——每个语言文件夹一个线程。

我的代码目前在没有多线程的情况下扫描少量文件并提取所需的数据,但在大规模上它会花费太长时间。

我在循环中设置了一个线程来获取 LCID 文件夹,但出现以下错误:“'HBscan' 没有重载与委托 System.threading.threadstart 匹配”。根据我在网上阅读的内容,然后我将我的方法放在一个类中,这样我就可以有参数了,现在没有错误,但是代码没有正确地遍历文件。它正在将文件排除在扫描之外。

我想知道是否有人可以看到我的代码哪里出了问题,导致它无法正常执行?谢谢。

public static void Main(string[] args)
    {
        //change rootDirectory variable to point to directory which you wish to scan through
        string rootDirectory = @"C:\sample";
        DirectoryInfo dir = new DirectoryInfo(rootDirectory);

        //get the LCIDs from the folders
        string[] filePaths = Directory.GetDirectories(rootDirectory);
        for (int i = 0; i < filePaths.Length; i++)
        {
            string LCID = filePaths[i].Split('\\').Last();
            Console.WriteLine(LCID);

            HBScanner scanner = new HBScanner(new DirectoryInfo(filePaths[i]));
            Thread t1 = new Thread(new ThreadStart(scanner.HBscan));              
            t1.Start();             
        } 

        Console.WriteLine("Scanning through files...");

    }
    public class HBScanner
    {
        private DirectoryInfo DirectoryToScan { get; set; }

        public HBScanner(DirectoryInfo startDir)
        {
            DirectoryToScan = startDir;
        }

        public void HBscan()
        {
            HBscan(DirectoryToScan);
        } 

        public static void HBscan(DirectoryInfo directoryToScan)
        {
            //create an array of files using FileInfo object
            FileInfo[] files;
            //get all files for the current directory
            files = directoryToScan.GetFiles("*.*");
            string asset = "";
            string lcid = "";

            //iterate through the directory and get file details
            foreach (FileInfo file in files)
            {
                String name = file.Name;
                DateTime lastModified = file.LastWriteTime;
                String path = file.FullName;

                //first check the file name for asset id using regular expression
                Regex regEx = new Regex(@"([A-Z][A-Z][0-9]{8,10})\.");
                asset = regEx.Match(file.Name).Groups[1].Value.ToString();

                //get LCID from the file path using regular expression
                Regex LCIDregEx = new Regex(@"sample\\(\d{4,5})");
                lcid = LCIDregEx.Match(file.FullName).Groups[1].Value.ToString();

                //if it can't find it from filename, it looks into xml
                if (file.Extension == ".xml" && asset == "")
                {
                    System.Diagnostics.Debug.WriteLine("File is an .XML");
                    System.Diagnostics.Debug.WriteLine("file.FullName is: " + file.FullName);
                    XmlDocument xmlDoc = new XmlDocument();
                    xmlDoc.Load(path);
                    //load XML file in 

                    //check for <assetid> element
                    XmlNode assetIDNode = xmlDoc.GetElementsByTagName("assetid")[0];
                    //check for <Asset> element
                    XmlNode AssetIdNodeWithAttribute = xmlDoc.GetElementsByTagName("Asset")[0];

                    //if there is an <assetid> element
                    if (assetIDNode != null)
                    {
                        asset = assetIDNode.InnerText;
                    }
                    else if (AssetIdNodeWithAttribute != null) //if there is an <asset> element, see if it has an AssetID attribute
                    {
                        //get the attribute 
                        asset = AssetIdNodeWithAttribute.Attributes["AssetId"].Value;

                        if (AssetIdNodeWithAttribute.Attributes != null)
                        {
                            var attributeTest = AssetIdNodeWithAttribute.Attributes["AssetId"];
                            if (attributeTest != null)
                            {
                                asset = attributeTest.Value;
                            }
                        }
                    }
                }

                Item newFile = new Item
                {
                    AssetID = asset,
                    LCID = lcid,
                    LastModifiedDate = lastModified,
                    Path = path,
                    FileName = name
                };

                Console.WriteLine(newFile);

            }

            //get sub-folders for the current directory
            DirectoryInfo[] dirs = directoryToScan.GetDirectories("*.*");
            foreach (DirectoryInfo dir in dirs)
            {
                HBscan(dir);
            }
        }
    }
4

4 回答 4

4

我还没有检查,但我认为这可以工作。

该代码将为每个线程创建一个扫描仪并执行 HBscan 方法。

public static void Main(string[] args)
        {
            //change rootDirectory variable to point to directory which you wish to scan through
            string rootDirectory = @"C:\sample";
            DirectoryInfo dir = new DirectoryInfo(rootDirectory);

            //get the LCIDs from the folders
            string[] filePaths = Directory.GetDirectories(rootDirectory);
            for (int i = 0; i < filePaths.Length; i++)
            {
                string LCID = filePaths[i].Split('\\').Last();
                Console.WriteLine(LCID);

                Thread t1 = new Thread(() => new HBScanner(new DirectoryInfo(filePaths[i])).HBscan());
                t1.Start();
            }

            Console.WriteLine("Scanning through files...");

        }
        public class HBScanner
        {
            private DirectoryInfo DirectoryToScan { get; set; }

            public HBScanner(DirectoryInfo startDir)
            {
                DirectoryToScan = startDir;
            }

            public void HBscan()
            {
                HBscan(DirectoryToScan);
            }

            public static void HBscan(DirectoryInfo directoryToScan)
            {
                //create an array of files using FileInfo object
                FileInfo[] files;
                //get all files for the current directory
                files = directoryToScan.GetFiles("*.*");
                string asset = "";
                string lcid = "";

                //iterate through the directory and get file details
                foreach (FileInfo file in files)
                {
                    String name = file.Name;
                    DateTime lastModified = file.LastWriteTime;
                    String path = file.FullName;

                    //first check the file name for asset id using regular expression
                    Regex regEx = new Regex(@"([A-Z][A-Z][0-9]{8,10})\.");
                    asset = regEx.Match(file.Name).Groups[1].Value.ToString();

                    //get LCID from the file path using regular expression
                    Regex LCIDregEx = new Regex(@"sample\\(\d{4,5})");
                    lcid = LCIDregEx.Match(file.FullName).Groups[1].Value.ToString();

                    //if it can't find it from filename, it looks into xml
                    if (file.Extension == ".xml" && asset == "")
                    {
                        System.Diagnostics.Debug.WriteLine("File is an .XML");
                        System.Diagnostics.Debug.WriteLine("file.FullName is: " + file.FullName);
                        XmlDocument xmlDoc = new XmlDocument();
                        xmlDoc.Load(path);
                        //load XML file in 

                        //check for <assetid> element
                        XmlNode assetIDNode = xmlDoc.GetElementsByTagName("assetid")[0];
                        //check for <Asset> element
                        XmlNode AssetIdNodeWithAttribute = xmlDoc.GetElementsByTagName("Asset")[0];

                        //if there is an <assetid> element
                        if (assetIDNode != null)
                        {
                            asset = assetIDNode.InnerText;
                        }
                        else if (AssetIdNodeWithAttribute != null) //if there is an <asset> element, see if it has an AssetID attribute
                        {
                            //get the attribute 
                            asset = AssetIdNodeWithAttribute.Attributes["AssetId"].Value;

                            if (AssetIdNodeWithAttribute.Attributes != null)
                            {
                                var attributeTest = AssetIdNodeWithAttribute.Attributes["AssetId"];
                                if (attributeTest != null)
                                {
                                    asset = attributeTest.Value;
                                }
                            }
                        }
                    }

                    Item newFile = new Item
                    {
                        AssetID = asset,
                        LCID = lcid,
                        LastModifiedDate = lastModified,
                        Path = path,
                        FileName = name
                    };

                    Console.WriteLine(newFile);

                }

                //get sub-folders for the current directory
                DirectoryInfo[] dirs = directoryToScan.GetDirectories("*.*");
                foreach (DirectoryInfo dir in dirs)
                {
                    HBscan(dir);
                }
            }
        }
于 2013-01-25T11:06:41.127 回答
2

如果您使用的是 .NET 4.0,则可以使用 TPL 并使用Parallel.For/Parallel.ForEach同时处理多个项目相当容易。

几天前我才接触到它,这很有趣。它通过在不同内核上使用多个线程来加速您的工作,从而为您提供出色的性能。当然,由于过多的 IO 访问,这可能会在您的情况下受到限制。

但这可能值得一试!(并且更改您当前的来源非常容易,只需检查一下即可)

于 2013-01-25T11:11:17.227 回答
2

有点像这样的东西呢,

public static void Main(string[] args)
{
    const string rootDirectory = @"C:\sample";

    Directory.EnumerateDirectories(rootDirectory)
        .AsParallel()
        .ForAll(f => HBScannner.HBScan(new DirectoryInfo(f)));
}

毕竟,您只能在循环体中获取 LCID 才能将其写入控制台。如果你想保持对控制台的写入,你可以这样做,

public static void Main(string[] args)
{
    const string rootDirectory = @"C:\sample";

    Console.WriteLine("Scanning through files...");

    Directory.EnumerateDirectories(rootDirectory)
        .AsParallel()
        .ForAll(f => 
            {
                var lcid = f.Split('\\').Last();
                Console.WriteLine(lcid);

                HBScannner.HBScan(new DirectoryInfo(f));
            });
}

请注意,EnumerateDirectories应该优先使用 of,GetDirectories因为它是惰性评估的,因此您的处理可以在找到第一个目录后立即开始。您不必等待所有目录都加载到列表中。

于 2013-01-25T11:30:14.423 回答
1

使用 BlockingCollection http://msdn.microsoft.com/en-us/library/dd267312.aspx可以大大改善您的任务。

总体结构是这样的:您创建一个线程(或在主线程中执行此操作),它将枚举文件并将它们添加到 BlockingCollection。简单地枚举文件应该相当快,并且这个线程应该比工作线程更快地完成。

然后,您创建许多任务(与 Environment.ProcessorCount 相同的数量会很好)。这些任务应该像 docs (collection.Take()) 中的第一个示例一样。任务应该对一个单独的文件进行检查。

因此,一个线程正在寻找文件名并将它们放入 BlockingCollection,而其他并行的线程将检查文件内容。这样您将获得更好的并行性,因为如果您为文件夹创建线程,这可能会导致工作分配不均(您不知道每个文件夹中都有很多文件,对吧?)

于 2013-01-25T11:24:06.527 回答