我发生了一件非常奇怪的事情,我注意到如果 agroup by (word)
是 UTF-8 字符串,它并不总是按单词分组。在同一个查询中,我得到了正确分组的情况,以及没有正确分组的情况。我想知道是否有人知道这是怎么回事?
select *,count(*) over (partition by md5(word)) as k
from (
select word,count(*) as n
from :tmpwl
group by 1
) a order by 1,2 limit 12;
/* gives:
word | n | k
------+---+---
いい | 1 | 1
くず | 1 | 1
ごみ | 1 | 1
さま | 1 | 1
さん | 1 | 1
へま | 1 | 1
まめ | 1 | 1
よく | 1 | 1
ろく | 1 | 1
ネガ | 1 | 2 -- what the heck?
ネガ | 1 | 2
パス | 1 | 1
*/
请注意,以下解决方法可以正常工作:
select word,n,count(*) over (partition by md5(word)) as k
from (
select md5(word),max(word) as word,count(*) as n
from :tmpwl
group by 1
) a order by 1,2 limit 12;
/* gives:
word | n | k
------+---+---
いい | 1 | 1
くず | 1 | 1
ごみ | 1 | 1
さま | 1 | 1
さん | 1 | 1
へま | 1 | 1
まめ | 1 | 1
よく | 1 | 1
ろく | 1 | 1
ネガ | 2 | 1
パス | 1 | 1
プア | 1 | 1
*/
版本为x86_64-unknown-linux-gnu上的PostgreSQL 8.2.14(Greenplum Database 4.0.4.0 build 3 Single-Node Edition),由GCC gcc.exe(GCC)4.1.1编译,编译于2010年11月30日17:20: 26.
源表:tmpwl
:
\d :tmpwl
Table "pg_temp_25149.pdtmp_foo706453357357532"
Column | Type | Modifiers
----------+---------+-----------
baseword | text |
word | text |
value | integer |
lexicon | text |
nalts | bigint |
Distributed by: (word)