1

I've a problem which is annoying the hell out of me!

I have a database with several thousand users. The data originally came from a database which I cannot trust data from, so I have imported it into another 'clean-up' database to remove duplicate entries.

I performed the query:

SELECT uid, username 
FROM users
GROUP BY username 
HAVING COUNT(username)>1

This is a sample of my table in its present state:

uid     forename     surname     username
1       Jo           Bloggs      jobloggs
2       Jo           Bloggs      jobloggs
3       Jane         Doe         janedoe
4       Jane         Doe         janedoe

After performing the query above, I get the following sample result:

uid     forename     surname     username
2       Jo           Bloggs      jobloggs

As you can see, there are 2 duplicate users, however the query is only displaying one of these.

When I perform the query, I get 300~ results. Obviously if the query isn't pulling all the duplicates, I cant trust this result set to be accurate and can't proceed with the clean up.

Any idea's about what I can try?

Thanks

Phil

4

1 回答 1

0

对于返回的结果集没有很好的解释。

根据示例数据和您的查询,您应该得到第二行:

3   janedoe

(实际上,返回的 uid 值是 3 还是 4 是任意的。)

此外,请确保您的客户端仅返回行的子集,例如 SQLyog 具有“限制行”功能,可限制返回的行数。

如果这不是问题,那么最可能的解释是“janedoe”之一包含不可打印的字符,或者您正在进行一些邪恶的字符集转换,其中两种不同的编码显示相同的值。

作为快速的第一步,我建议您检查每个“janedoe”值中的字符数:

SELECT username, LENGTH(username) FROM mytable WHERE uid IN (3,4) ORDER BY uid

此外,您可以尝试显示实际编码,使用 HEX() 函数查看是否存在差异。(注意:我不清楚字符集转换是在 HEX 之前还是之后发生的,我们在这里追求的是 MySQL 等效的 Oracle DUMP() 函数,它将显示实际值的逐字节表示。 )

您可能将一些 Latin1 编码转换为 UTF-8,反之亦然,或者其他一些奇怪的字符集正在发生。这可能会给你一些想法......

SELECT username
     , HEX(username)
     , HEX(BINARY username)
     , CONVERT(BINARY username USING latin1) 
     , CONVERT(BINARY username USING utf8)
  FROM mytable 
 WHERE uid IN (3,4)
 ORDER BY uid
于 2012-12-20T23:14:04.233 回答