hadoop - 重复键过滤

Question

我正在寻找一种分布式解决方案来实时筛选/过滤大量密钥。我的应用程序每天生成超过 1000 亿条记录，我需要一种方法来过滤流中的重复项。我正在寻找一个系统来存储滚动 10 天的密钥，每个密钥大约 100 个字节。我想知道在使用 Hadoop 之前如何解决此类大规模问题。HBase 会是正确的解决方案吗？有没有人尝试过像 Zookeeper 这样的部分内存解决方案？

score 4 · Accepted Answer

I can see a number of solutions to your problem, but the real-time requirement really narrows it down. By real-time do you mean you want to see if a key is a duplicate as its being created?

Let's talk about queries per second. You say 100B/day (that's a lot, congratulations!). That's 1.15 Million queries per second (100,000,000,000 / 24 / 60 / 60). I'm not sure if HBase can handle that. You may want to think about something like Redis (sharded perhaps) or Membase/memcached or something of that sort.

If you were to do it in HBase, I'd simply push the upwards of a trillion keys (10 days x 100B keys) as the keys in the table, and put some value in there to store it (because you have to). Then, you can just do a get to figure out if the key is in there. This is kind of hokey and doesn't fully utilize hbase as it is only fully utilizing the keyspace. So, effectively HBase is a b-tree service in this case. I don't think this is a good idea.

If you relax the restraint to not have to do real-time, you could use MapReduce in batch to dedup. That's pretty easy: it's just Word Count without the counting. You group by the key you have and then you'll see the dups in the reducer if multiple values come back. With enough nodes an enough latency, you can solve this problem efficiently. Here is some example code for this from the MapReduce Design Patterns book: https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch3/DistinctUserDriver.java

ZooKeeper is for distributed process communication and synchronization. You don't want to be storing trillions of records in zookeeper.

So, in my opinion, you're better served by a in-memory key/value store such as redis, but you'll be hard pressed to store that much data in memory.

score 1 · Accepted Answer

恐怕传统系统是不可能的：|

这是你提到的：

每天 1000 亿意味着每秒大约 100 万！！！！
密钥的大小为 100 字节。
你想检查 10 天工作集中的重复项意味着 1 万亿个项目。

这些假设导致在一组 1 万亿个对象中查找，这些对象的总大小为 90 TB！！！！！！这个实时问题的任何解决方案都应该提供一个可以在这个数据量中每秒查找 100 万个项目的系统。我对 HBase、Cassandra、Redis 和 Memcached 有一些经验。我确信 U 无法在任何基于磁盘的存储（如 HBase、Cassandra 或 HyperTable）上实现这种性能（并添加任何 RDBMS，如 MySQL、PostgreSQl 和......）。我听说过的 redis 和 memcached 的最佳性能是在单台机器上每秒大约 100k 操作。这意味着你必须有 90 台机器，每台机器都有 1 TB 的 RAM ！！！！！！！！！
即使是像 Hadoop 这样的批处理系统也无法在不到一个小时内完成这项工作，我想即使是 100 台机器的大型集群也需要数小时和数天。

UR 谈论非常非常大的数字（90 TB，每秒 1M）。RU确定这个？？？

hadoop - 重复键过滤

2 回答 2

Related

Reference