unicode - PostgreSQL 对 Unicode 字符串按错误分组？

Question

我发生了一件非常奇怪的事情，我注意到如果 agroup by (word)是 UTF-8 字符串，它并不总是按单词分组。在同一个查询中，我得到了正确分组的情况，以及没有正确分组的情况。我想知道是否有人知道这是怎么回事？

select *,count(*) over (partition by md5(word)) as k
from (
  select word,count(*) as n
  from :tmpwl
  group by 1
) a order by 1,2 limit 12;
/* gives:
 word | n | k 
------+---+---
 いい | 1 | 1
 くず | 1 | 1
 ごみ | 1 | 1
 さま | 1 | 1
 さん | 1 | 1
 へま | 1 | 1
 まめ | 1 | 1
 よく | 1 | 1
 ろく | 1 | 1
 ネガ | 1 | 2   -- what the heck?
 ネガ | 1 | 2
 パス | 1 | 1
*/

请注意，以下解决方法可以正常工作：

select word,n,count(*) over (partition by md5(word)) as k
from (
  select md5(word),max(word) as word,count(*) as n
  from :tmpwl
  group by 1
) a order by 1,2 limit 12;
/* gives:
 word | n | k 
------+---+---
 いい | 1 | 1
 くず | 1 | 1
 ごみ | 1 | 1
 さま | 1 | 1
 さん | 1 | 1
 へま | 1 | 1
 まめ | 1 | 1
 よく | 1 | 1
 ろく | 1 | 1
 ネガ | 2 | 1
 パス | 1 | 1
 プア | 1 | 1
*/

版本为x86_64-unknown-linux-gnu上的PostgreSQL 8.2.14（Greenplum Database 4.0.4.0 build 3 Single-Node Edition），由GCC gcc.exe（GCC）4.1.1编译，编译于2010年11月30日17:20： 26.

源表:tmpwl：

\d :tmpwl
Table "pg_temp_25149.pdtmp_foo706453357357532"
  Column  |  Type   | Modifiers 
----------+---------+-----------
 baseword | text    | 
 word     | text    | 
 value    | integer | 
 lexicon  | text    | 
 nalts    | bigint  | 
Distributed by: (word)

unicode - PostgreSQL 对 Unicode 字符串按错误分组？

0 回答 0

Related

Reference