2

我发生了一件非常奇怪的事情,我注意到如果 agroup by (word)是 UTF-8 字符串,它并不总是按单词分组。在同一个查询中,我得到了正确分组的情况,以及没有正确分组的情况。我想知道是否有人知道这是怎么回事?

select *,count(*) over (partition by md5(word)) as k
from (
  select word,count(*) as n
  from :tmpwl
  group by 1
) a order by 1,2 limit 12;
/* gives:
 word | n | k 
------+---+---
 いい | 1 | 1
 くず | 1 | 1
 ごみ | 1 | 1
 さま | 1 | 1
 さん | 1 | 1
 へま | 1 | 1
 まめ | 1 | 1
 よく | 1 | 1
 ろく | 1 | 1
 ネガ | 1 | 2   -- what the heck?
 ネガ | 1 | 2
 パス | 1 | 1
*/

请注意,以下解决方法可以正常工作:

select word,n,count(*) over (partition by md5(word)) as k
from (
  select md5(word),max(word) as word,count(*) as n
  from :tmpwl
  group by 1
) a order by 1,2 limit 12;
/* gives:
 word | n | k 
------+---+---
 いい | 1 | 1
 くず | 1 | 1
 ごみ | 1 | 1
 さま | 1 | 1
 さん | 1 | 1
 へま | 1 | 1
 まめ | 1 | 1
 よく | 1 | 1
 ろく | 1 | 1
 ネガ | 2 | 1
 パス | 1 | 1
 プア | 1 | 1
*/

版本为x86_64-unknown-linux-gnu上的PostgreSQL 8.2.14(Greenplum Database 4.0.4.0 build 3 Single-Node Edition),由GCC gcc.exe(GCC)4.1.1编译,编译于2010年11月30日17:20: 26.

源表:tmpwl

\d :tmpwl
Table "pg_temp_25149.pdtmp_foo706453357357532"
  Column  |  Type   | Modifiers 
----------+---------+-----------
 baseword | text    | 
 word     | text    | 
 value    | integer | 
 lexicon  | text    | 
 nalts    | bigint  | 
Distributed by: (word)
4

0 回答 0