c# - 使用 NHibernate 索引 Lucene.Net 中的大量数据

Question

我们使用 Nhibernate 作为我们的数据访问层。我们有一个包含 170 万条记录的表，我们需要通过 Lucene 为我们的搜索一一索引。当我们运行我们为构建索引而编写的控制台应用程序时，它开始时很快，但随着它遍历项目，它逐渐变得越来越慢。

我们的第一次迭代只是将它们全部索引。第二次迭代是按类别对它们进行索引。现在，我们按类别选择子集，然后将它们分成 100 个“页面”。我们仍然有性能下降。

我打开了 sql profiler 并在迭代项目时，它为每个项目一个接一个地调用 sql server 来处理图像，即使延迟加载设置为不用于图像。

这是一个商业网站，我们正在索引目录项（产品）。每个目录项都有 0 到多个图像（存储在单独的表中。

这是我们的映射：

public class ItemMap : ClassMap<Item>
    {
        public ItemMap()
        {
            Table("Products");

            Id(x => x.Id, "ProductId").GeneratedBy.GuidComb();

            Map(x => x.Model);
            Map(x => x.Description);

            Map(x => x.Created);
            Map(x => x.Modified);
            Map(x => x.IsActive);
            Map(x => x.PurchaseUrl).CustomType<UriType>();

            Component(x => x.Identifier, m =>
                {
                    m.Map(x => x.Upc);
                    m.Map(x => x.Asin);
                    m.Map(x => x.Isbn);
                    m.Map(x => x.Tid);
                });

            Component(x => x.Price, m =>
                {
                    m.Map(x => x.Currency);
                    m.Map(x => x.Amount, "Price");
                    m.Map(x => x.Shipping);
                });

            References(x => x.Brand, "BrandId");
            References(x => x.Category, "CategoryId");
            References(x => x.Supplier, "SupplierId");
            References(x => x.Provider, "ProviderId");

            HasMany(x => x.Images)
                .Table("ProductImages")
                .KeyColumn("ProductId")
                .Not.LazyLoad();




            // TODO: Add variants





        }

    }

这是索引应用程序的根逻辑。

public void IndexProducts()
        {
            Console.WriteLine("--- Begin Indexing Products ---");
            Console.WriteLine();
            var categories = categoryRepository.GetAll().ToList();
            Console.WriteLine(String.Format("--- {0} Categories found ---", categories.Count));
            categories.Add(null);

            foreach (var category in categories)
            {
                string categoryName = "\"None\"";

                if (category != null)
                    categoryName = category.Name;

                Console.WriteLine(String.Format("--- Begin Indexing Category ({0}) ---", categoryName));
                var categoryItems = from p in catalogRepository.GetList(new ActiveProductsByCategoryQuery(category))
                                    select p;

                int count = categoryItems.Count();
                int pageSize = 100;
                int currentPage = 0;
                int offest = currentPage * pageSize;
                int current = 1;

                Console.WriteLine(String.Format("Indexing {0} Products...", count));

                while (offest < count)
                {
                    var products = (from p in categoryItems
                                    select p).Skip(offest).Take(pageSize);

                    foreach (var item in products)
                    {
                        indexer.UpdateContent(item);
                        UpdateCounter(current, count);
                        current++;
                    }

                    currentPage++;
                    offest = currentPage * pageSize;
                }
                Console.WriteLine();

                Console.WriteLine(String.Format("--- End Indexing Category ({0}) ---", categoryName));
                Console.WriteLine();
            }

            Console.WriteLine("--- End Indexing Products ---");
            Console.WriteLine();
        }

仅供参考，相关类别的计数为 26552。它运行的第一个查询是这样的：

exec sp_executesql N'SELECT TOP 100 ProductId100_1_, Upc100_1_, Asin100_1_, Isbn100_1_, Tid100_1_, Currency100_1_, Price100_1_, Shipping100_1_, Model100_1_, Descrip10_100_1_, Created100_1_, Modified100_1_, IsActive100_1_, Purchas14_100_1_, BrandId100_1_, CategoryId100_1_, SupplierId100_1_, ProviderId100_1_, CategoryId103_0_, Name103_0_, ShortName103_0_, Created103_0_, Modified103_0_, ShortId103_0_, DisplayO7_103_0_, IsActive103_0_, ParentCa9_103_0_ FROM (SELECT this_.ProductId as ProductId100_1_, this_.Upc as Upc100_1_, this_.Asin as Asin100_1_, this_.Isbn as Isbn100_1_, this_.Tid as Tid100_1_, this_.Currency as Currency100_1_, this_.Price as Price100_1_, this_.Shipping as Shipping100_1_, this_.Model as Model100_1_, this_.Description as Descrip10_100_1_, this_.Created as Created100_1_, this_.Modified as Modified100_1_, this_.IsActive as IsActive100_1_, this_.PurchaseUrl as Purchas14_100_1_, this_.BrandId as BrandId100_1_, this_.CategoryId as CategoryId100_1_, this_.SupplierId as SupplierId100_1_, this_.ProviderId as ProviderId100_1_, category1_.CategoryId as CategoryId103_0_, category1_.Name as Name103_0_, category1_.ShortName as ShortName103_0_, category1_.Created as Created103_0_, category1_.Modified as Modified103_0_, category1_.ShortId as ShortId103_0_, category1_.DisplayOrder as DisplayO7_103_0_, category1_.IsActive as IsActive103_0_, category1_.ParentCategoryId as ParentCa9_103_0_, ROW_NUMBER() OVER(ORDER BY CURRENT_TIMESTAMP) as __hibernate_sort_row FROM Products this_ left outer join Categories category1_ on this_.CategoryId=category1_.CategoryId WHERE (this_.IsActive = @p0 and (1=0 or (this_.CategoryId is not null and category1_.CategoryId = @p1)))) as query WHERE query.__hibernate_sort_row > 500 ORDER BY query.__hibernate_sort_row',N'@p0 bit,@p1 uniqueidentifier',@p0=1,@p1='A988FD8C-DD93-4119-8F84-0AF3656DAEDD'

然后对于每个产品，它执行

exec sp_executesql N'SELECT images0_.ProductId as ProductId1_, images0_.ImageId as ImageId1_, images0_.ImageId as ImageId98_0_, images0_.Description as Descript2_98_0_, images0_.Url as Url98_0_, images0_.Created as Created98_0_, images0_.Modified as Modified98_0_, images0_.ProductId as ProductId98_0_ FROM ProductImages images0_ WHERE images0_.ProductId=@p0',N'@p0 uniqueidentifier',@p0='487EA053-4DD5-4EBA-AA36-95B30C42F0CD'

这很好。问题是前 2000 个左右的速度确实很快，但是它在类别中运行的时间越长，它变得越慢并且消耗的内存越多——即使它正在索引相同数量的产品。GC 正在工作，因为内存使用量下降，但总体而言，它随着处理器的工作而攀升。

我们可以做些什么来加快索引器的速度吗？为什么它的性能一直在下降？我不认为它是休眠或查询，因为它开始得太快了。我们在这里真是不知所措。

谢谢

score 3 · Accepted Answer

就在几周前，Ayende 有一篇关于完成这项工作的帖子（使用无状态会话和自定义 IList 实现）。

http://ayende.com/Blog/archive/2010/06/27/nhibernate-streaming-large-result-sets.aspx

这听起来正是您所需要的，至少可以加快记录检索速度并最大限度地减少内存使用。

score 0 · Accepted Answer

我们最终转向 Solr 进行索引。我们无法让它有效地索引，这可能是由于实现。

以供参考：

http://lucene.apache.org/solr/

http://code.google.com/p/solrnet/

score 0 · Accepted Answer

您是否对所有呼叫使用相同的会话？如果是这种情况，它将缓存加载的实体，并在调用 Flush 时循环检查它们是否需要刷新（这取决于您的 FlushMode）。为每页项目使用一个新会话，或者更改 FlushMode。

您可以在使用条件时指定应使用 sql 连接预取特定属性，这可能会加快数据读取速度。我通常比 Linq-to-NHibernate 更信任critiera api，因为我实际上决定了每次调用都做了什么。

c# - 使用 NHibernate 索引 Lucene.Net 中的大量数据

3 回答 3

Related

Reference