3

Right now I have table in Mysql with 3 columns.

DocId             Int
Match_DocId       Int
Percentage Match  Int

I am storing document id along with its near duplicate document id and percentage which indicate how closely two documents match.

So if one document has 100 near duplicates, we have 100 rows for that particular document.

Right now, this table has more than 1 billion records for total of 14 millions documents. I am expecting total documents to go upto 30 millions. That means my table which stores near duplicate information will have more than 5 billions rows, may be more than that. (Near duplicate data grows exponentially compare to total document set)

Here are few issues that I have:

  1. Getting all there records in mysql table is taking lot of time.
  2. Query takes lot of time as well.

Here are few queries that I run:

  • Check if particular document has any near duplicate. (this is relatively fast, but still slow)

  • Check for given set of documents, how many near duplicates are there in each percentage range (Percentage range is 86-90, 91-95 , 96-100)?

    This query takes lot of time. Most of the time it fails. I am going group by on percentage column.

Can this be managed with any available NoSql solution?

I am skeptical for SQL query support for NoSql solutions as I need group by support while querying data.

4

2 回答 2

2

MySQL

您可以尝试使用当前的 MySql 解决方案进行分片,即将您的大型数据库拆分为较小的独特数据库。这样做的问题是您一次只能使用一个分片,这会很快。如果您计划跨多个分片使用查询,那么它会非常缓慢。

NoSql

Apache Hadoop堆栈值得一看。有几个系统允许您执行稍微不同的查询。一个好处是它们都倾向于在彼此之间进行良好的互操作。

检查特定文档是否有任何近乎重复的内容。(这相对较快,但仍然很慢)

HBase可以为大表完成这项工作。

检查给定的一组文件,在每个百分比范围内有多少接近重复的文件?(百分比范围为 86-90、91-95、96-100)

这应该非常适合Map-Reduce


还有许多其他解决方案,请参阅此链接以获取其他 NoSql 数据库的列表和简要说明。

于 2012-08-09T10:16:44.353 回答
1

我们对Redis有很好的经验。它很快,可以根据您的需要变得可靠。其他选项可能是CouchDBCassandra

于 2012-08-09T09:25:15.347 回答