hadoop - Hadoop - 在两个客户列表中查找匹配的名称

Question

我有两张来自不同活动的人员名单；我想在这些列表中寻找匹配的人名，以及匹配的公司。我知道每个列表中可能会有同名的人不是同一个人，但这将有助于找到匹配项。

第一个列表示例：
姓名、公司、职务
John Doe、ACME Corporation、大象训练师
Jane Smith、ACME Corporation、CEO
John Smith、Widgets-R-Us、Janitor +10,000
行

第二个列表示例：
名称、公司
Fred Smith、ACME Corporation
John Smith、Widgets-R-Us
John Smith、Company XYZ
Jane Smith、Company XYZ
+10,000's of rows

所需的输出
匹配名称：
John Smith
Jane Smith

匹配公司：
ACME Corporation
Widgets-R-Us

我在 AWS 环境中运行它，并且是 Hadoop 的新手。任何编程语言都可以。我知道如何在 Excel 中执行此操作，但希望能够随着时间的推移使用更多名称列表（每个名称都在自己的 CSV 文件中）来扩展它。

score 0 · Accepted Answer

您需要一个 Mapper 实现，在其中您将 Name 和 Company Name 作为 Text 和 IntWritable 发出。
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{ /*Some logic to derive the person name or the Company name.*/ String name = value.split(',')[0]; context.write(new Text(value),new IntWritable(1)); }

Reducer 中 reduce 方法的实现类似于
public void reduce(Text key, Iterable<IntWritable> values,Context context)throws IOException, InterruptedException{ int count = 1; for(IntWritable val: values){count++;} //You would all the unique names with no of times it is repeated. context.write(key,new IntWritable(count)); }
Hope this help。

hadoop - Hadoop - 在两个客户列表中查找匹配的名称

1 回答 1

Related

Reference