postgresql - 多维立方体上的 Postgresql k-最近邻 (KNN)

Question

我有一个有 8 个维度的立方体。我想做最近邻匹配。我对 postgresql 完全陌生。我读到 9.1 支持多维度上的最近邻匹配。如果有人能给出一个完整的例子，我将不胜感激：

如何用 8D 立方体创建表格？
样品插入
查找 - 精确匹配
查找 - 最近邻匹配

样本数据：

为简单起见，我们可以假设所有值的范围为 0-100。

点1：（1,1,1,1, 1,1,1,1）

点2：（2,2,2,2, 2,2,2,2）

查找值：(1,1,1,1, 1,1,1,2)

这应该与 Point1 而不是 Point2 匹配。

参考：

What's_new_in_PostgreSQL_9.1

https://en.wikipedia.org/wiki/K-d_tree#Nearest_neighbour_search

score 6 · Accepted Answer

PostgreSQL 支持距离运算符<->，据我了解，这可用于分析文本（使用 pg_trgrm 模块）和几何数据类型。

我不知道如何将它用于超过 1 个维度。也许您必须定义自己的距离函数，或者以某种方式将您的数据转换为具有文本或几何类型的一列。例如，如果您有 8 列（8 维立方体）的表：

c1 c2 c3 c4 c5 c6 c7 c8
 1  0  1  0  1  0  1  2

您可以将其转换为：

c1 c2 c3 c4 c5 c6 c7 c8
 a  b  a  b  a  b  a  c

然后用一列表格：

c1
abababac

然后你可以使用（在创建gist index之后）：

SELECT c1, c1 <-> 'ababab'
 FROM test_trgm 
 ORDER BY c1 <-> 'ababab';

例子

创建样本数据

-- Create some temporary data
-- ! Note that table are created in tmp schema (change sql to your scheme) and deleted if exists !
drop table if exists tmp.test_data;

-- Random integer matrix 100*8 
create table tmp.test_data as (
   select 
      trunc(random()*100)::int as input_variable_1,
      trunc(random()*100)::int as input_variable_2, 
      trunc(random()*100)::int as input_variable_3,
      trunc(random()*100)::int as input_variable_4, 
      trunc(random()*100)::int as input_variable_5, 
      trunc(random()*100)::int as input_variable_6, 
      trunc(random()*100)::int as input_variable_7, 
      trunc(random()*100)::int as input_variable_8
   from 
      generate_series(1,100,1)
);

将输入数据转换为文本

drop table if exists tmp.test_data_trans;

create table tmp.test_data_trans as (
select 
   input_variable_1 || ';' ||
   input_variable_2 || ';' ||
   input_variable_3 || ';' ||
   input_variable_4 || ';' ||
   input_variable_5 || ';' ||
   input_variable_6 || ';' ||
   input_variable_7 || ';' ||
   input_variable_8 as trans_variable
from 
   tmp.test_data
);

这将为您提供一个变量trans_variable，其中存储了所有 8 个维度：

trans_variable
40;88;68;29;19;54;40;90
80;49;56;57;42;36;50;68
29;13;63;33;0;18;52;77
44;68;18;81;28;24;20;89
80;62;20;49;4;87;54;18
35;37;32;25;8;13;42;54
8;58;3;42;37;1;41;49
70;1;28;18;47;78;8;17

除了||运算符，您还可以使用以下语法（更短，但更神秘）：

select 
   array_to_string(string_to_array(t.*::text,''),'') as trans_variable
from 
   tmp.test_data t

添加索引

create index test_data_gist_index on tmp.test_data_trans using gist(trans_variable);

测试距离注意：我从表中选择了一行 - 52;42;18;50;68;29;8;55- 并使用稍微改变的值 ( 42;42;18;52;98;29;8;55) 来测试距离。当然，您的测试数据中会有完全不同的值，因为它是 RANDOM 矩阵。

select 
   *, 
   trans_variable <->  '42;42;18;52;98;29;8;55' as distance,
   similarity(trans_variable, '42;42;18;52;98;29;8;55') as similarity,
from 
   tmp.test_data_trans 
order by
   trans_variable <-> '52;42;18;50;68;29;8;55';

您可以使用距离运算符 <-> 或相似函数。距离 = 1 - 相似度

score 5 · Accepted Answer

最近在pgsql-hackers列表中提供了一个“引入 kNN 搜索具有欧几里得、出租车和切比雪夫距离的立方体的补丁”。如果您可以自定义 PostgreSQL 构建，它可能适合您的目的。

请注意，该cube类型是 PostgreSQL 扩展，可用于表示 n 维中的点或立方体。（默认情况下，n 的值可以达到 100，如果cubedata.h提高了限制，则更多。）因此，此补丁应启用索引辅助多维点/向量/立方体最近邻搜索。

（没有这个补丁，该cube类型没有<->距离运算符，并且缺少支持函数（#8），OPERATOR CLASS gist_cube_ops这需要让 gist 能够在这些值上创建与距离相关的索引。）

我还没有尝试过这个补丁，并注意到讨论列表中的一个回复表明它目前可能会破坏一些回归测试。

postgresql - 多维立方体上的 Postgresql k-最近邻 (KNN)

2 回答 2

Related

Reference