我正在研究 django 的全文搜索引擎。它必须安装简单,快速索引,快速索引更新,索引时不阻塞,快速搜索。
在阅读了许多网页之后,我列出了简短的列表:Mysql MYISAM fulltext、djapian/python-xapian 和 django-sphinx 我没有选择 lucene,因为它看起来很复杂,也没有选择 haystack,因为它的功能比 djapian/django-spĥinx 少(像字段加权)。
然后我做了一些基准测试,为此,我在网上收集了许多免费书籍,生成了一个包含 1 485 000 条记录(id、title、body)的数据库表,每条记录大约 600 字节长。从数据库中,我还生成了一个包含 100 000 个现有单词的列表,并将它们打乱以创建一个搜索列表。对于测试,我在我的笔记本电脑上运行了 2 次(4Go RAM,双核 2.0Ghz):第一次,在服务器重新启动以清除所有缓存后,第二次在之后完成,以测试缓存结果有多好. 以下是“自制”基准测试结果:
1485000 records with Title (150 bytes) and body (450 bytes)
Mysql 5.0.75/Ubuntu 9.04 Fulltext :
==========================================================================
Full indexing : 7m14.146s
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:11.553524
next run : 0:00:00.168508
Mysql 5.5.4 m3/Ubuntu 9.04 Fulltext :
==========================================================================
Full indexing : 6m08.154s
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:09.553524
next run : 0:00:20.316903
1 thread, 100000 searchs with single word randomly taken from database :
First run : 9m09s
next run : 5m38s
1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test : 0:00:15.007353
1 thread, boolean search : 1000 x (+word1 +word2)
First run : 0:00:21.205404
next run : 0:00:00.145098
Djapian Fulltext :
==========================================================================
Full indexing : 84m7.601s
1 thread, 1000 searchs with single word randomly taken from database with prefetch :
First run : 0:02:28.085680
next run : 0:00:14.300236
python-xapian Fulltext :
==========================================================================
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:26.402084
next run : 0:00:00.695092
django-sphinx Fulltext :
==========================================================================
Full indexing : 1m25.957s
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:30.073001
next run : 0:00:05.203294
1 thread, 100000 searchs with single word randomly taken from database :
First run : 12m48s
next run : 9m45s
1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test : 0:00:23.535319
1 thread, boolean search : 1000 x (word1 word2)
First run : 0:00:20.856486
next run : 0:00:03.005416
如您所见,Mysql 对于全文搜索来说并没有那么糟糕。此外,它的查询缓存非常高效。
Mysql 对我来说似乎是一个不错的选择,因为无需安装任何东西(我只需要编写一个小脚本来将 Innodb 生产表同步到 MyISAM 搜索表),而且我真的不需要像词干提取等高级搜索功能......
这是一个问题:您如何看待 Mysql 全文搜索引擎与 sphinx 和 xapian?