从表中选定的行中,如何根据短语出现的频率提取短语并对其进行排名?
示例 1:http: //developer.yahoo.com/search/content/V1/termExtraction.html
示例 2: http: //mirror.me/i/love
INPUT:
CREATE TABLE phrases (
id BIGSERIAL,
phrase VARCHAR(10000)
);
INSERT INTO phrases (phrase) VALUES (‘Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration.’)
INSERT INTO phrases (phrase) VALUES (‘Andrea Bolgi was an italian sculptor’)
DESIRED OUTPUT:
phrase | weight
italian sculptor | 5
virgin mary | 2
painters | 1
renaissance | 1
inspiration | 1
Andrea Bolgi | 1
要查找单词,而不是短语,可以使用
SELECT * FROM ts_stat('SELECT to_tsvector(''simple'', phrase) FROM phrases')
ORDER BY nentry DESC, ndoc DESC, word;
一些注意事项:
- 短语可以包含“停用词”,例如“易于回答”</li>
- 理想情况下,英语变体和同义词会自动分组。
pg_trgm 有帮助吗?(如果只找到 2 个和 3 个单词的短语就可以了)。具体如何?
相关问题: