sql - PostgreSQL 中文本的 n-gram

Question

我正在寻找从 PostgreSQL 中的文本列创建 n-gram。我目前将文本列中的（空白）数据（句子）拆分为数组。

enter code here从表名中选择 regexp_split_to_array(sentenceData,E'\s+')

一旦我有了这个数组，我该怎么做：

创建一个循环来查找 n-gram，并将每个写入另一个表中的一行

使用 unnest 我可以在单独的行上获取所有数组的所有元素，也许我可以想办法从单个列中获取 n-gram，但我会放宽我明智地保留的句子边界。

用于模拟上述场景的 PostgreSQL 示例 SQL 代码

create table tableName(sentenceData  text);

INSERT INTO tableName(sentenceData) VALUES('This is a long sentence');

INSERT INTO tableName(sentenceData) VALUES('I am currently doing grammar, hitting this monster book btw!');

INSERT INTO tableName(sentenceData) VALUES('Just tonnes of grammar, problem is I bought it in TAIWAN, and so there aint any englihs, just chinese and japanese');

select regexp_split_to_array(sentenceData,E'\\s+')   from tableName;

select unnest(regexp_split_to_array(sentenceData,E'\\s+')) from tableName;

score 3 · Accepted Answer

查看pg_trgm：“pg_trgm 模块提供了用于根据三元组匹配确定文本相似性的函数和运算符，以及支持快速搜索相似字符串的索引运算符类。”

sql - PostgreSQL 中文本的 n-gram

1 回答 1

Related

Reference