问题标签 [mrv2]
For questions regarding programming in ECMAScript (JavaScript/JS) and its various dialects/implementations (excluding ActionScript). Note JavaScript is NOT the same as Java! Please include all relevant tags on your question; e.g., [node.js], [jquery], [json], [reactjs], [angular], [ember.js], [vue.js], [typescript], [svelte], etc.
eclipse - 新的 MapReduce 架构和 Eclipse
一些主要的重构正在围绕 MapReduce 进行 Hadoop。有关相同的详细信息可以在下面的 JIRA 中找到。
https://issues.apache.org/jira/browse/MAPREDUCE-279
它有 ResourceManager、NodeManager 和 HistoryServer 守护进程。有没有人尝试在 Eclipse 中运行它们?这将使开发和调试目的变得更容易。
我在 Hadoop 论坛中发送了一封邮件,但没有人尝试过。只是想检查是否有人在stackoverflow中做了类似的事情。
hadoop - Hadoop 集群设置为 0.23 版本(MRv2 或 NextGen MR)
如我所见,hadoop 的最新稳定版本是 0.20.x。最新版本是 0.23。. 似乎从 .20 开始有很多变化。至 0.23.x。
我们能够建立具有稳定版本(0.20.2)的小型集群并练习 mapreduce 编程。
我们已经看到在 0.23.x 中添加了很多新的 api。为了探索 0.23.x,我们还需要在 0.23.x 版本中设置集群。
你们能否给我们指出一个文档,我们可以在其中设置具有 0.23.x 版本的集群。
当我解压缩 tar 文件时,似乎 0.23.x 与 0.20.x 完全不同。请给我们一些书籍参考/文档,其中从一开始就提到了集群设置。
谢谢 MRK
hadoop - 关于Hadoop的secondarynamenode概念
根据文档(http://hadoop.apache.org/common/docs/r0.20.203.0/hdfs_user_guide.html)secondarynamenode 在 hadoop0.20.203.0 版本中已弃用,并由 checkpointnode 和 backupnode 取代。但是在集群设置文档(http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html)中没有提到该更改。更多 bin/start-dfs.sh 在 conf/masters 文件中提到的地址中启动辅助名称节点。
有人可以提供有什么区别吗?这是否意味着配置未更改。只有secondarynamenode的内部架构被改变..
同样在 hadoop0.23.0 版本中,没有我们用来指定辅助名称节点需要启动的主机地址的 conf/masters 文件。
谢谢 MRK
hadoop - Hadoop / Yarn (v0.23.3) 伪分布式模式设置 :: 无作业节点
我只是在伪分布式模式下设置了 Hadoop/Yarn 2.x(特别是 v0.23.3)。
我遵循了一些博客和网站的说明,它们或多或少地提供了相同的设置方法。我还关注了 O'reilly 的 Hadoop 书籍的第 3 版(具有讽刺意味的是,它的帮助最小)。
问题:
配置:
在 my 和 hadoop 的 UNIX 帐户配置文件中都设置了以下环境变量:~/.profile:
hadoop$ java -版本
NAMENODE & DATANODE 目录,也在 etc/hadoop/conf/hdfs-site.xml 中指定:
接下来是各种 XML 配置文件(这里还是 YARN/MRv2/v0.23.3):
核心站点.xml
mapred-site.xml
hdfs-site.xml
纱线站点.xml
等/hadoop/conf/保存
其他总结说明:
谢谢!
hadoop - Yarn NodeManager 和 ResourceManager 在同一个节点
(默认情况下)Hadoop Yarn 中是否有与“资源管理器”相同的节点中的“节点管理器”?如果没有,是否可以在同一个节点上运行它们?
hadoop - 容器运行超出内存限制
在 Hadoop v1 中,我为每个 7 个映射器和减速器分配了 1GB 的插槽,我的映射器和减速器运行良好。我的机器有8G内存,8个处理器。现在使用 YARN,当在同一台机器上运行相同的应用程序时,出现容器错误。默认情况下,我有以下设置:
它给了我错误:
然后我尝试在 mapred-site.xml 中设置内存限制:
但仍然出现错误:
我很困惑为什么地图任务需要这么多内存。据我了解,1GB 的内存足以完成我的 map/reduce 任务。为什么当我为容器分配更多内存时,任务使用更多?是因为每个任务都有更多的拆分吗?我觉得稍微减小容器的大小并创建更多的容器会更有效,这样更多的任务可以并行运行。问题是我怎样才能确保每个容器不会被分配比它可以处理的更多的拆分?
hadoop - 如何使用 Hadoop 2.x 提交 Hadoop 流作业并检查执行历史记录
我是 Hadoop 的新手。在 Hadoop 1.X 中,我可以从 master 节点提交一个 hadoop 流作业,并从 namenode web 检查结果和执行时间。
以下是 Hadoop 1.X 中 hadoop 流的示例代码:
但是,在 Hadoop 2.x 中,作业跟踪器被删除。如何在 Hadoop 2.X 中获得相同的功能?
hadoop - MRv2 / YARN Features
I'm trying to wrap my head about the actual purpose of the new API, and reading over the internet, I have found different answers to the same questions I was dealing with.
The questions I'd like to know the answers to are:
1) Which of the MRv2/YARN daemons is the one responsible for launching application containers and monitoring application resource usage.
2) Which two issues MRv2/YARN is designed to address?
I'll try to make this thread educational and constructive to other readers by specifying resources and actual data from my searches, so I hope it wouldn't look like I have provided too much information while I could just ask the questions and make my post shorter.
For the 1st question, reading in the documentation, I could find 3 main resources to rely on:
From Hadoop documentation:
ApplicationMaster<-->NodeManager Launch containers. Communicate with NodeManagers by using NMClientAsync objects, handling container events by NMClientAsync.CallbackHandler
The ApplicationMaster communicates with YARN cluster, and handles application execution. It performs operations in an asynchronous fashion. During application launch time, the main tasks of the ApplicationMaster are:
a) communicating with the ResourceManager to negotiate and allocate resources for future containers, and
b) after container allocation, communicating YARN NodeManagers (NMs) to launch application containers on them.
From Hortonworks documentation
The ApplicationMaster is, in effect, an instance of a framework-specific library and is responsible for negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the containers and their resource consumption. It has the responsibility of negotiating appropriate resource containers from the ResourceManager, tracking their status and monitoring progress.
From Cloudera documentation:
MRv2 daemons -
ResourceManager – one per cluster – Starts ApplicationMasters, allocates resources on slave nodes
ApplicationMaster – one per job – Requests resources, manages individual Map and Reduce tasks
NodeManager – one per slave node – Manages resources on individual slave nodes
JobHistory – one per cluster – Archives jobs’ metrics and metadata
Back to the question (which daemons is the one responsible for launching application containers and monitoring application resource usage) I ask myself:
Is it the NodeManager? Is it the ApplicationMaster?
From what I understand, the ApplicationMaster is the one who makes the NodeManager to actually get the job done, so it is like asking who's responsible for lifting a box from the ground, were those the hands who did the actual lifting of the mind who controls the body and makes them do the lifting...
It is a tricky question, I guess, but there has to be only one answer to it.
For the 2nd question, reading online, I could find different answers from many resources and thus the confusion, but my main sources would be:
From Cloudera documentation:
MapReduce v2 (“MRv2”) – Built on top of YARN (Yet"Another Resource NegoGator)
– Uses ResourceManager/NodeManager architecture
– Increases scalability of cluster
– Node resources can be used for any type of task
– Improves cluster utilization
– Support for non/MR jobs
Back to the question (Which two issues MRv2/YARN is designed to address?), I know MRv2 made a few changes like prevent resource pressure on the JobTracker (in MRv1, maximum number of nodes in the cluster could be around 4000, and in MRv2 it is more than 2 times this number), and I also know it provides the ability to run frameworks other than MapReduce, such as MPI.
From documentation:
The Application Master provides much of the functionality of the traditional ResourceManager so that the entire system can scale more dramatically. In tests, we’ve already successfully simulated 10,000 node clusters composed of modern hardware without significant issue.
and:
Moving all application framework specific code into the ApplicationMaster generalizes the system so that we can now support multiple frameworks such as MapReduce, MPI and Graph Processing.
But I also think it dealt with the fact that the NameNode was a Single point of failure, and in the new version there's the Standby NameNode via the high availability mode (I might be confusing features of the old vs. new API, with features of MRv1 vs. MRv2 and that might be the cause for my question):
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.
So if you would have to choose 2 of the 3, which ones would be the 2 that serve as the two issues MRv2/YARN is designed to address?
-Resource pressure on the JobTracker
-Ability to run frameworks other than MapReduce, such as MPI.
-Single point of failure in the NameNode.
Thank you in advance! D
hadoop - MRv1 和 MRv2 参数
链接上给出了完整的参数列表(对于 Hadoop-2.6)
但是您可以以 MRv1 或 MRv2 样式执行作业。我认为有些参数只适用于 MRv1 之类mapreduce.tasktracker.map.tasks.maximum
的,这是真的吗?如果是,那么有没有更聪明的方法来计算所有这些参数?我们可以通过 -Dproperty=value 或 -D property=value 传递所有参数,还是有任何参数,我不能像这样传递?
caching - Hadoop参数说明
Hadoo-2.6 具有文档中给出的以下参数
mapreduce.job.max.split.locations
(为每个分割存储的最大块位置数以进行局部性计算。它如何在局部性计算中使用它?)mapreduce.job.split.metainfo.maxsize
(拆分元信息文件的最大允许大小。JobTracker 不会尝试读取大于配置值的拆分元信息文件。但是将其固定为某个值有什么好处?为什么我们不能使其灵活?)mapreduce.job.counters.limit
(每个作业的这些用户计数器是什么?为什么我们要限制它们?)mapreduce.jobhistory.datestring.cache.size
(日期字符串缓存的大小。影响将被扫描以查找工作的目录数量。设置此限制有什么好处?)mapreduce.jobhistory.joblist.cache.size
(作业列表缓存的大小。我们为什么要使用这个缓存?)mapreduce.jobhistory.loadedjobs.cache.size
(这个和以前的有什么区别?)mapreduce.jobhistory.move.thread-count
(用于移动文件的线程数。它们仅用于移动历史文件吗?为什么需要这种移动?)
它们是否适用于 MRv1 和 MRv2 风格的作业执行?