c# - 使用 Lucene.NET 和 C# 索引 blob 内的数据

Question

我正在使用 Lucene.Net + 自定义爬虫 + Ifilter，以便我可以索引 blob 内的数据。

foreach (var item in containerList)
            {
                CloudBlobContainer container = BlobClient.GetContainerReference(item.Name);
                if (container.Name != "indexes")
                {
                    IEnumerable<IListBlobItem> blobs = container.ListBlobs();
                    foreach (CloudBlob blob in blobs)
                    {
                        CloudBlobContainer blobContainer = blob.Container;
                        CloudBlob blobToDownload = blobContainer.GetBlobReference(blob.Name);

                        blob.DownloadToFile(path+blob.Name);
                        indexer.IndexBlobData(path,blob);
                        System.IO.File.Delete(path+blob.Name);
                    }
                }
            }
/*Code for crawling which downloads file Locally on azure instance storage*/

下面的代码是使用 IFilter 的索引器函数

public bool IndexBlobData(string path, CloudBlob blob)
    {
        Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
        try
        {
            TextReader reader = new FilterReader(path + blob.Name);
            doc.Add(new Lucene.Net.Documents.Field("url", blob.Uri.ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED));
            doc.Add(new Lucene.Net.Documents.Field("content", reader.ReadToEnd().ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.ANALYZED));
            indexWriter.AddDocument(doc);
            reader.Close();
            return true;
        }
        catch (Exception e)
        {
            return false;
        }
    }

现在我的问题是我不想下载实例存储上的文件。我直接想将文件传递给FilterReader。但它需要“物理”路径，传递http地址不起作用。有人可以建议任何其他解决方法吗？我不想再次从 blob 下载相同的文件然后对其进行索引，而是更喜欢下载并将其保存在主内存中并直接使用索引过滤器。

我从这里使用 IFilter

score 1 · Accepted Answer

不是很清楚你的意思是I don't want to download same file again from blob and then index it, instead i will prefer download and keep it in main memory and directly use index filter什么？那是什么main memory- Azure Blob 存储或本地实例内存。

但是，由于 IFilter 接口的性质，您面临的问题无法解决。如果您从这里更深入地了解您正在使用的源代码，您会发现它在幕后使用IPersistFile COM 接口。不幸的是，这个接口只适用于本地文件，不接受流。

我建议使用 Blob 中的 Stream 并将其传递给 Reader，而不是物理路径。但是，如前所述 - IFilter 使用仅适用于物理路径的 COM 接口。因此，使用您当前的方法，无法跳过 blob 下载。

在本地下载 blob 并不可怕。如果存储帐户与计算在同一个关联组中，则下载将非常快，流量将免费。如果您使用较小的实例大小，您将有 165GB 的本地存储空间。这是足够的存储空间。您可以通过跟踪索引的内容和未索引的内容来稍微优化您的流程。您可以为此使用 Azure 表存储。另一种非常快速且廉价的存储解决方案，非常适合将键值对存储为file name- etag。然后，当您枚举 blob 时，首先获取etagblob 并检查表是否已被索引。仅当它没有被索引时才下载它，然后将新记录添加到表中以将此文件标记为已索引。

或者...或者不要使用 IFilter。我看不出在 Azure 上使用 IFilter 有什么好处。IFilter 仅在安装应用程序时注册。例如，如果您想使用 IFilter 处理 Office 文档 - 您必须在 VM 上安装 Microsoft Office（由于 MS Office 的许可证移动性限制，即使您有许可证，目前也不能这样做）。如果您想获得 PDF 的 IFilter - 您必须安装 Adobe Acrobat Reader（您可以通过启动任务来完成）。等等，等等——有些应用程序可以安装，有些则不能。您的 Windows Azure VM 实例是完全没有 IFilter 的纯 Windows。想象一下 Windows Server 2008 R2 的基本安装，没有添加任何角色和功能 - 这就是您的实例。

c# - 使用 Lucene.NET 和 C# 索引 blob 内的数据

1 回答 1

Related

Reference