c# - 组内的 LINQ-to-objects 索引 + 用于不同的分组（又名 ROW_NUMBER 与 PARTITION BY 等效）

Question

经过大量的谷歌搜索和代码实验，我被一个复杂的 C# LINQ-to-objects 问题难住了，在 SQL 中，使用一对 ROW_NUMBER()...PARTITION BY 函数和一个或两个子查询很容易解决这个问题。

换句话说，这就是我在代码中尝试做的事情——基本要求是从列表中删除重复的文档：

首先，按 (Document.Title, Document.SourceId) 对列表进行分组，假设（简化的）类定义如下：
```
类文件
{
    字符串标题；
    整数源ID；// 源优先（ID=1 优于 ID=2）
}
```
在该组中，为每个文档分配一个索引（例如，索引 0 == 此来源中具有此标题的第一个文档，索引 1 = 此来源中具有此标题的第 2 个文档等）。我喜欢 SQL 中的 ROW_NUMBER() 等价物！
现在按 (Document.Title, Index) 分组，其中 Index 是在步骤 #2 中计算的。对于每个组，仅返回一个文档：具有最低 Document.SourceId 的文档。

第 1 步很简单（例如 codepronet.blogspot.com/2009/01/group-by-in-linq.html），但我对第 2 步和第 3 步感到困惑。我似乎无法构建一个无红色曲线的 C# LINQ 查询来解决所有三个步骤。

Anders Heilsberg 在此线程上的帖子是我认为如果我能正确使用语法，那么上面的步骤 #2 和 #3 的答案。

我宁愿避免使用外部局部变量来进行索引计算，如 slodge.blogspot.com/2009/01/adding-row-number-using-linq-to-objects.html 上所建议的那样，因为该解决方案会中断如果外部变量被修改。

最佳情况下，可以先完成按标题分组的步骤，因此“内部”分组（首先按 Source 计算索引，然后按 Index 过滤重复项）可以对每个“按标题”中的少量对象进行操作组，因为每个按标题组中的文档数通常低于 100。我真的不想要 N ²解决方案！

我当然可以使用嵌套的 foreach 循环来解决这个问题，但这似乎是使用 LINQ 应该很简单的问题。

有任何想法吗？

score 6 · Accepted Answer

我认为 jpbochi 错过了您希望您的分组是成对的值（Title+SourceId 然后是 Title+Index）。这是一个 LINQ 查询（主要）解决方案：

var selectedFew = 
    from doc in docs
    group doc by new { doc.Title, doc.SourceId } into g
    from docIndex in g.Select((d, i) => new { Doc = d, Index = i })
    group docIndex by new { docIndex.Doc.Title, docIndex.Index } into g
    select g.Aggregate((a,b) => (a.Doc.SourceId <= b.Doc.SourceId) ? a : b);

首先我们按 Title+SourceId 分组（我使用匿名类型，因为编译器为分组查找构建了一个好的哈希码）。然后我们使用 Select 将分组索引附加到我们在第二个分组中使用的文档。最后，对于每个组，我们选择最低的 SourceId。

鉴于此输入：

var docs = new[] {
    new { Title = "ABC", SourceId = 0 },
    new { Title = "ABC", SourceId = 4 },
    new { Title = "ABC", SourceId = 2 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 5 },
    new { Title = "123", SourceId = 5 },
};

我得到这个输出：

{ Doc = { Title = ABC, SourceId = 0 }, Index = 0 }
{ Doc = { Title = 123, SourceId = 5 }, Index = 0 }
{ Doc = { Title = 123, SourceId = 5 }, Index = 1 }
{ Doc = { Title = 123, SourceId = 7 }, Index = 2 }

更新：我刚看到你关于按标题分组的问题。您可以使用标题组上的子查询来执行此操作：

var selectedFew =
    from doc in docs
    group doc by doc.Title into titleGroup
    from docWithIndex in
        (
            from doc in titleGroup
            group doc by doc.SourceId into idGroup
            from docIndex in idGroup.Select((d, i) => new { Doc = d, Index = i })
            group docIndex by docIndex.Index into indexGroup
            select indexGroup.Aggregate((a,b) => (a.Doc.SourceId <= b.Doc.SourceId) ? a : b)
        )
    select docWithIndex;

score 3 · Accepted Answer

老实说，我对你的问题感到很困惑。也许如果您应该解释您要解决的问题。无论如何，我会尽力回答我所理解的。

1）首先，我假设您已经有一个按Title+分组的文档列表SourceId。出于测试目的，我硬编码了一个列表，如下所示：

var docs = new [] {
    new { Title = "ABC", SourceId = 0 },
    new { Title = "ABC", SourceId = 4 },
    new { Title = "ABC", SourceId = 2 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 5 },
};

2）要在每个项目中放置一个索引，您可以使用Select扩展方法，传递一个 Func 选择器函数。像这样：

var docsWithIndex
    = docs
    .Select( (d, i) => new { Doc = d, Index = i } );

3）据我了解，下一步是将最后一个结果分组Title。这是如何做到的：

var docsGroupedByTitle
    = docsWithIndex
    .GroupBy( a => a.Doc.Title );

GroupBy 函数（上面使用）返回一个IEnumerable<IGrouping<string,DocumentWithIndex>>. 由于一个组也是可枚举的，所以我们现在有一个可枚举的枚举。

4）现在，对于上面的每个组，我们将只获得具有最小值的项目SourceId。要进行此操作，我们需要 2 级递归。在 LINQ 中，外层是一个选择（对于每个组，获取它的一项），内层是一个聚合（获取最低的项目SourceId）：

var selectedFew
    = docsGroupedByTitle
    .Select(
        g => g.Aggregate(
            (a, b) => (a.Doc.SourceId  <= b.Doc.SourceId) ? a : b
        )
    );

只是为了确保它有效，我用一个简单的方法对其进行了测试foreach：

foreach (var a in selectedFew) Console.WriteLine(a);
//The result will be:
//{ Doc = { Title = ABC, SourceId = 0 }, Index = 0 }
//{ Doc = { Title = 123, SourceId = 5 }, Index = 4 }

我不确定那是你想要的。如果没有，请评论答案，我可以修复答案。我希望这有帮助。

Obs.：我的测试中使用的所有类都是匿名的。所以，你真的不需要定义一个DocumentWithIndex类型。实际上，我什至还没有声明一个Document类。

score 1 · Accepted Answer

基于方法的语法：

var selectedFew = docs.GroupBy(doc => new {doc.Title, doc.SourceId}, doc => doc)
                      .SelectMany((grouping) => grouping.Select((doc, index) => new {doc, index}))
                              .GroupBy(anon => new {anon.doc.Title, anon.index})
                              .Select(grouping => grouping.Aggregate((a, b) =>    a.doc.SourceId <= b.doc.SourceId ? a : b));

你会说上面是等效的基于方法的语法吗？

score 1 · Accepted Answer

我实现了一个扩展方法。它支持按字段进行多个分区以及多个订单条件。

public static IEnumerable<TResult> Partition<TSource, TKey, TResult>(
    this IEnumerable<TSource> source, 
    Func<TSource, TKey> keySelector,
    Func<IEnumerable<TSource>, IOrderedEnumerable<TSource>> sorter,
    Func<TSource, int, TResult> selector)
{
    AssertUtilities.ArgumentNotNull(source, "source");

    return source
        .GroupBy(keySelector)
        .Select(arg => sorter(arg).Select(selector))
        .SelectMany(arg => arg);
}

用法：

var documents = new[] 
{
    new { Title = "Title1", SourceId = 1 },
    new { Title = "Title1", SourceId = 2 },
    new { Title = "Title2", SourceId = 15 },
    new { Title = "Title2", SourceId = 14 },
    new { Title = "Title3", SourceId = 100 }
};

var result = documents
    .Partition(
        arg => arg.Title,  // partition by
        arg => arg.OrderBy(x => x.SourceId), // order by
        (arg, rowNumber) => new { RowNumber = rowNumber, Document = arg }) // select
    .Where(arg => arg.RowNumber == 0)
    .Select(arg => arg.Document)
    .ToList();

结果：

{ Title = "Title1", SourceId = 1 },
{ Title = "Title2", SourceId = 14 },
{ Title = "Title3", SourceId = 100 }

c# - 组内的 LINQ-to-objects 索引 + 用于不同的分组（又名 ROW_NUMBER 与 PARTITION BY 等效）

4 回答 4

Related

Reference