java - 了解 MapReduce 性能？

Question

嗨，我想更好地了解地图降低性能。

是什么主导了 Hadoop 中实现的 MapReduce 算法的性能？

如果一个节点需要处理大量数据，是计算时间，还是磁盘写入和读取时间？

当我运行一些 map reduce 程序时，我观察到与磁盘读取时间相比，磁盘写入时间需要很长时间。

我想知道磁盘写入的开销是否远大于计算时间（CPU 时间），需要在节点处处理大量数据。与 I/O 访问相比，CPU 时间是否微不足道？

下面的算法是在每个 reduce 节点上发生的情况：我想知道与从 HDFS 读取输入然后处理将输出写入 HDFS 相比，执行此算法的 CPU 时间是否微不足道。

  Input : R is a multiset of records sorted by the increasing order of their sizes; each    record has been canonicalized by a global ordering O; a Jaccard similarity threshold t
  Output : All pairs of records hx, yi, such that sim(x, y) > t

  1 S <- null;
  2 Ii <- null (1 <= i <= |U|);
  3 for each x belongs to R do
  4 p <- |x| - t * |x| + 1;
  5 for i = 1 to p do
  6 w <- x[i];
  7 for each (y, j) belongs to Iw such
  that |y|>= t*|x| do /* size filtering on |y| */
  8 Calculate similarity s = (x intersection y) /* Similarity calculation*/ 
  9 if similarity>t
     S <- S U (x,y);
  10 Iw <- Iw Union {(x, i)}; /* index the current prefix */;

  11 return S

score 2 · Accepted Answer

一般来说 - 这取决于您正在进行的处理类型。但是可以指出除了代码之外什么需要时间和资源。
我们将回顾 MR 作业流程并指出明显的资源消耗。1. 从 HDFS 读取您的拆分。除非进行本地读取优化——数据通过套接字（CPU）和/或网络+磁盘 IO 传递。MD5 也在读取期间计算。1.输入格式。输入数据应该被切割成 Mapper 的键值。考虑到它是java，它总是动态内存分配和解除分配。解析输入通常需要 CPU 时间。
2. 从记录读取器到映射器 - 没有严重的开销。
3.Mapper输出是排序和序列化的（很多CPU）+本地磁盘。
4. 数据由reducer 从mapper 机器中提取。很多网络。
5. 在reducer端合并数据。CPU + 磁盘。
6. reducer 的输出写入 HDFS。x3 的数据大小磁盘 IO + x2 的网络流量，因为复制。

简而言之，3、4、5 通常是最耗费时间和资源的。

score 0 · Accepted Answer

这可能有助于您对问题的理解：

L1 cache reference                            0.5 ns
Branch mispredict                             5   ns
L2 cache reference                            7   ns             14x L1 cache
Mutex lock/unlock                            25   ns
Main memory reference                       100   ns             20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy              3,000   ns
Send 1K bytes over 1 Gbps network        10,000   ns    0.01 ms
Read 4K randomly from SSD*              150,000   ns    0.15 ms
Read 1 MB sequentially from memory      250,000   ns    0.25 ms
Round trip within same datacenter       500,000   ns    0.5  ms
Read 1 MB sequentially from SSD*      1,000,000   ns    1    ms  4X memory
Disk seek                            10,000,000   ns   10    ms  20x datacenter roundtrip
Read 1 MB sequentially from disk     20,000,000   ns   20    ms  80x memory, 20X SSD
Send packet CA->Netherlands->CA     150,000,000   ns  150    ms

来源：https ://gist.github.com/2841832

java - 了解 MapReduce 性能？

2 回答 2

Related

Reference