regex - 为 NLP 语言编码文本的高效 SQL 语句

Question

背景

locution 是由至少两个单词组成的名词短语，例如：

黑橄榄
辣椒酱
玫瑰芬苹果土豆

单独的单词黑色和橄榄是形容词（黑色- JJ）和名词（橄榄- NN）。但是，人类知道黑橄榄是一个名词（将其与绿橄榄区分开来）。

这里的问题是如何最有效地将标准化成分名称列表（例如上面的列表）转换为自然语言处理器（NLP）的特定格式。

示例数据

该表可以按如下方式创建：

CREATE TABLE ingredient_name (
  id bigserial NOT NULL, -- Uniquely identifies the ingredient.
  label character varying(30) NOT NULL
);

以下 SQL 语句显示了实际的数据库记录：

insert into ingredient_name (label) values ('alfalfa sprout');
insert into ingredient_name (label) values ('almond extract');
insert into ingredient_name (label) values ('concentrated apple juice');
insert into ingredient_name (label) values ('black-eyed pea');
insert into ingredient_name (label) values ('rose finn apple potato');

数据格式

一般格式为：

lexeme1_lexeme2_<lexemeN> lexeme1_lexeme2_lexemeN NN

鉴于上面的单词列表，NLP 期望：

black_<olive> black_olive NN
hot_pepper_<sauce> hot_pepper_sauce NN
rose_finn_apple_<potato> rose_finn_apple_potato NN

数据库有一个表 ( recipe.ingredient_name) 和一个列 ( label)。标签被规范化（例如，单个空格、小写）。

SQL 语句

产生预期结果的代码：

CREATE OR REPLACE VIEW ingredient_locutions_vw AS 
SELECT
  t.id,
   -- Replace spaces with underscores
  translate( t.prefix, ' ', '_' )
  || '<' || t.suffix || '>' || ' ' ||
  translate( t.label, ' ', '_' )
  || ' NN' AS locution_nlp
FROM (
  SELECT
    id,

    -- Ingredient name
    label,

    -- All words except the last word
    left( label, abs( strpos( reverse( label ), ' ' ) - length( label ) ) + 1 ) AS prefix,

    -- Just the last word
    substr( label,
       length( label ) - strpos( reverse( label ), ' ' ) + 2
    ) AS suffix
  FROM
    ingredient_name
  WHERE
    -- Limit set to ingredient names having at least one space
    strpos( label, ' ' ) > 0
) AS t;

问题

在上述代码中拆分prefix（除第一个之外的所有单词）和（仅最后一个单词）的更有效（或优雅）的方法是什么？suffix

系统是 PostgreSQL 9.1。

谢谢！

score 1 · Accepted Answer

CREATE OR REPLACE VIEW ingredient_locutions_vw AS 
SELECT
    t.id,
    format('%s_<%s> %s NN', 
        array_to_string(t.prefix, '_'), 
        t.suffix, 
        array_to_string(t.label, '_')
    ) AS locution_nlp
FROM (
    SELECT
        id,

        -- Ingredient name
        label,

        -- All words except the last word
        label[1:array_length(label, 1) - 1] AS prefix,

        -- Just the last word
        label[array_length(label, 1)] AS suffix
    FROM (
        select id, string_to_array(label, ' ') as label
        from ingredient_name
    ) s
    WHERE
    -- Limit set to ingredient names having at least one space
    array_length(label, 1) > 1
) AS t;

select * from ingredient_locutions_vw ;
 id |                      locution_nlp                      
----+--------------------------------------------------------
  1 | alfalfa_<sprout> alfalfa_sprout NN
  2 | almond_<extract> almond_extract NN
  3 | concentrated_apple_<juice> concentrated_apple_juice NN
  4 | black-eyed_<pea> black-eyed_pea NN
  5 | rose_finn_apple_<potato> rose_finn_apple_potato NN
(5 rows)

regex - 为 NLP 语言编码文本的高效 SQL 语句

背景

示例数据

数据格式

SQL 语句

问题

1 回答 1

Related

Reference