hdfs - HDFS 分区数据

Question

我在文本文件中有大量 DNS 日志数据 (TB)，其中每条记录的格式为

timestamp | resolvername | domainlookedfor | dns_answer

在哪里，

timestamp       - time at which the record was logged
resolvername    - the dns resolver that served the end-user
domainlookedfor - domain that was looked for by the end user
dns_answer      - final dns resolution record of 'hostname -> ip address'

截至目前，我已经individual text files for every five minutes of logs从各种dns resolvers. 因此，如果我想查看过去 10 天内包含主机名 say 的记录www.google.com，那么我将不得不扫描过去 10 天的全部数据（比如说 50GB），并且只过滤与域匹配的记录（让比如说 10MB 的数据）。所以很明显有大量的数据是从磁盘中不必要地读取的，并且需要很长时间才能得到结果。

为了改善这种情况，我正在考虑根据对数据进行分区domain name，从而减少我的搜索空间。另外，我想保留基于时间分隔记录的概念（如果不是每 5 分钟，我希望每天都有一个文件）。

我能想到的一种简单方法是，

根据域名的哈希（或者可能是前两个字母）[domain_AC, domain_AF, domain_AI ... domain_ZZ] 存储记录，其中目录 domain_AC 将包含所有第一个字符为 A 和第二个字符的域的记录字符是 A 或 B 或 C。
在每个存储桶中，每天都会有一个单独的文件 [20130129, 20130130, ... ]

因此，要获取的记录www.google.com，首先识别存储桶，然后根据日期范围，扫描相应的文件并仅过滤与 www.google.com 匹配的记录。

我的另一个要求是根据resolvername要回答的查询对记录进行分组，例如get all the records by resolver 'x'.

请让我知道是否有任何我应该考虑的重要细节以及解决此问题的任何其他已知方法。我很感激任何帮助。谢谢！

hdfs - HDFS 分区数据

0 回答 0

Related

Reference