
Let's say I have two fairly large data sets - the first is called "Base" and it contains 200 million tab delimited rows and the second is call "MatchSet" which has 10 million tab delimited rows of similar data.

Let's say I then also have an arbitrary function called Match(row1, row2) and Match() essentially contains some heuristics for looking at row1 (from MatchSet) and comparing it to row2 (from Base) and determining if they are similar in some way.

Let's say the rules implemented in Match() are custom and complex rules, aka not a simple string match, involving some proprietary methods. Let's say for now Match(row1,row2) is written in psuedo-code so implementation in another language is not a problem (though it's in C++ today).

In a linear model, aka program running on one giant processor - we would read each line from MatchSet and each line from Base and compare one to the other using Match() and write out our match stats. For example we might capture: X records from MatchSet are strong matches, Y records from MatchSet are weak matches, Z records from MatchSet do not match. We would also write the strong/weak/non values to separate files for inspection. Aka, a nested loop of sorts:

for each row1 in MatchSet
    for each row2 in Base
        var type = Match(row1,row2);
            //do something based on type

I've started considering Hadoop streaming as a method for running these comparisons as a batch job in a short amount of time. However, I'm having a bit of a hardtime getting my head around the map-reduce paradigm for this type of problem.

I understand pretty clearly at this point how to take a single input from hadoop, crunch the data using a mapping function and then emit the results to reduce. However, the "nested-loop" approach of comparing two sets of records is messing with me a bit.

The closest I'm coming to a solution is that I would basically still have to do a 10 million record compare in parallel across the 200 million records so 200 million/n nodes * 10 million iterations per node. Is that that most efficient way to do this?


3 回答 3


From your description, it seems to me that your problem can be arbitrarily complex and could be a victim of the curse of dimensionality.

Imagine for example that your rows represent n-dimensional vectors, and that your matching function is "strong", "weak" or "no match" based on the Euclidean distance between a Base vector and a MatchSet vector. There are great techniques to solve these problems with a trade-off between speed, memory and the quality of the approximate answers. Critically, these techniques typically come with known bounds on time and space, and the probability to find a point within some distance around a given MatchSet prototype, all depending on some parameters of the algorithm.

Rather than for me to ramble about it here, please consider reading the following:

  1. Locality Sensitive Hashing
  2. The first few hits on Google Scholar when you search for "locality sensitive hashing map reduce". In particular, I remember reading [Das, Abhinandan S., et al. "Google news personalization: scalable online collaborative filtering." Proceedings of the 16th international conference on World Wide Web. ACM, 2007] with interest.

Now, on the other hand if you can devise a scheme that is directly amenable to some form of hashing, then you can easily produce a key for each record with such a hash (or even a small number of possible hash keys, one of which would match the query "Base" data), and the problem becomes a simple large(-ish) scale join. (I say "largish" because joining 200M rows with 10M rows is quite a small if the problem is indeed a join). As an example, consider the way CDDB computes the 32-bit ID for any music CD CDDB1 calculation. Sometimes, a given title may yield slightly different IDs (i.e. different CDs of the same title, or even the same CD read several times). But by and large there is a small set of distinct IDs for that title. At the cost of a small replication of the MatchSet, in that case you can get very fast search results.

于 2013-04-10T04:47:03.490 回答

检查Section 3.5 - Relational Joins论文“使用 MapReduce 进行数据密集型文本处理”。我没有详细介绍,但它可能会对你有所帮助。

于 2011-11-28T16:07:15.410 回答

这是一个老问题,但假设您的单流作业执行 200M * 10M Match() 计算,您提出的解决方案是正确的。通过进行 N 批 (200M / N) * 10M 计算,您实现了 N 倍的加速。通过在映射阶段进行计算,然后对结果进行阈值处理并将结果导向强/弱/无匹配缩减器,您可以收集结果以输出到单独的文件。

如果可以使用额外的优化,他们希望同时适用于单流和并行版本。示例包括阻塞,以便您需要执行少于 200M * 10M 的计算或为 10M 匹配集预先计算算法的常量部分。

于 2013-09-16T16:19:25.527 回答