3

我有一个半百万的记录表,我需要找到重复项。所以我使用我创建的这段代码:

var dups2 = from m in mg_B
    group m by new { m.Addr1, m.Addr2, m.City, m.State }
    into g
    where g.Count() > 1
    select g;

这段代码的问题在于,它不会将 addr1 为空字符串“”和分别为 NULL 的 2 条记录作为重复记录。

基本上在比较字段的空值和空值时,它会将它们视为不同,但我需要被视为相同。

我知道我可以浏览每条记录并用“”替换空值,但我花了 1 分钟时间浏览了 4 000 条记录。当有人单击按钮时,这将重复进行。

我发现了这个空字符串问题,因为我最初创建了一个只有一些字段的类(表有超过 40 个字段)。

List<CombineClass> mg = (from m in db.MG_Backup
   where m.IsArchived == false
   select new CombineClass { id = m.ID, name = m.Name, addr1 = string.IsNullOrEmpty(m.Addr1) ? "" : m.Addr1, addr2 = string.IsNullOrEmpty(m.Addr2) ? "" : m.Addr2, city = m.City, state = m.State }).ToList(); 

有任何想法吗 ?

4

2 回答 2

3

此版本兼容 Linq-to-Sql / Linq-to-Entities

var dups2 = from m in mg_B
    group m by new 
    { 
        Addr1 = m.Addr1 ?? string.Empty, 
        Addr2 = m.Addr2 ?? string.Empty, 
        City  = m.City ?? string.Empty, 
        State = m.State ?? string.Empty,
    }
    into g
    where g.Count() > 1
    select g;

生成的 sql 看起来有点像这样:

-- Parameters
DECLARE @p0 NVarChar(1000) = ''
DECLARE @p1 NVarChar(1000) = ''
DECLARE @p2 NVarChar(1000) = ''
DECLARE @p3 NVarChar(1000) = ''
DECLARE @p4 Int = 1

SELECT [t2].[value2] AS [Addr1], [t2].[value22] AS [Addr2], [t2].[value3] AS [City], [t2].[value3] AS [State]
FROM (
    SELECT COUNT(*) AS [value], [t1].[value] AS [value2], [t1].[value2] AS [value22], [t1].[value3], [t1].[value4]
    FROM (
        SELECT COALESCE([t0].[Addr1],@p0) AS [value], COALESCE([t0].[Addr2],@p1) AS [value2], COALESCE([t0].[City],@p2) AS [value3], COALESCE([t0].[State],@p3) AS [value4]
        FROM [SettingSystemNodes] AS [t0]
        ) AS [t1]
    GROUP BY [t1].[value], [t1].[value2], [t1].[value3], [t1].[value4]
    ) AS [t2]
WHERE [t2].[value] > @p4

请注意,如果您在查询string.Empty之前或什至在查询中设置了一个局部变量let,则只有一个参数将用于空字符串。

于 2013-04-03T15:24:05.200 回答
0

那么这里是蛮力的方式:

var dups2 = from m in mg_B
    group m by new { 
        Addr1 = (string.IsNullOrEmpty(m.Addr1) ? "" : m.Addr1), 
        Addr2 = (string.IsNullOrEmpty(m.Addr2) ? "" : m.Addr2), 
        City  = (string.IsNullOrEmpty(m.City)  ? "" : m.City ), 
        State = (string.IsNullOrEmpty(m.State) ? "" : m.State),
        ...
        }
    into g
    where g.Count() > 1
    select g;

如果你想让代码看起来更干净,你可以有一个扩展方法string

public static string EmptyForNull(this string s)
{
    return string.IsNullOrEmpty(s) ? "" : s;
}

然后您的查询将是:

var dups2 = from m in mg_B
    group m by new { 
        Addr1 = EmptyForNull(m.Addr1), 
        Addr2 = EmptyForNull(m.Addr2), 
        City  = EmptyForNull(m.City), 
        State = EmptyForNull(m.State),
        ...
        }
    into g
    where g.Count() > 1
    select g;

但是,如果它是在 SQL 而不是 Linq 中完成的,这可能会快很多。

于 2013-04-03T13:42:51.893 回答