3

我有一个包含 80,000 行的数据库,当我测试一些 FULLTEXT 查询时,我遇到了意外的结果。我已经从 MYSQL 中删除了停用词,并将最小字长设置为 3。

当我执行此查询时:

SELECT `sentence`, MATCH (`sentence`) AGAINST ('CAN YOU FLY') AS `relevance`
FROM `sentences`
WHERE MATCH (`sentence`) AGAINST ('CAN YOU FLY')
ORDER BY `relevance` DESC

它给出了这个结果:

NO A FLY WITHOUT WINGS WOULD BE CALLED A WINGLESS | 10.623517036438
I CAN FLY                                         | 7.61278629302979
I CAN FLY :)                                      | 7.61278629302979
CAN YOU FLY?                                      | 7.61278629302979
THEY CAN FLY                                      | 7.61278629302979
YOU AM NOT FLY                                    | 7.61278629302979
CAN YOU FLY                                       | 7.61278629302979
HAVE YOU EVER SWALLOWED A FLY?                    | 7.52720737457275
I JUST WANNA FLY                                  | 7.52720737457275

为什么“NO A FLY WITHOUT WINGS WOULD BE CALLED A WINGLESS”的相关性最高,它只包含一个单词……另外,“CAN YOU FLY”怎么不在顶部,是完全匹配的。

我希望它按最匹配的关键字排序,然后按最匹配的关键字排序,然后按最少的单词排序。这将给出合乎逻辑的结果:

CAN YOU FLY
CAN YOU FLY?
I CAN FLY
THEY CAN FLY
I CAN FLY :)
YOU AM NOT FLY
HAVE YOU EVER SWALLOWED A FLY?
I JUST WANNA FLY
NO A FLY WITHOUT WINGS WOULD BE CALLED A WINGLESS
4

1 回答 1

1

MySQL Internals Manual中提供了用于计算的公式:

w = (log(dtf)+1)/sumdtf * U/(1+0.0115*U) * log((N-nf)/nf)

在哪里

dtf     is the number of times the term appears in the document
sumdtf  is the sum of (log(dtf)+1)'s for all terms in the same document
U       is the number of Unique terms in the document
N       is the total number of documents
nf      is the number of documents that contain the term

第一个文本显然比其他文本有更多的内容。并且公式很大程度上依赖U于文档中唯一术语的数量。

根据您的评论,我建议使用Boolean Fulltext Search

SELECT `sentence`, MATCH (`sentence`) AGAINST ('CAN YOU FLY' IN BOOLEAN MODE) AS `relevance`
FROM `sentences`
WHERE MATCH (`sentence`) AGAINST ('CAN YOU FLY' IN BOOLEAN MODE)
ORDER BY `relevance` DESC
于 2013-03-21T22:51:24.643 回答