I'm trying to wrap my head about the actual purpose of the new API, and reading over the internet, I have found different answers to the same questions I was dealing with.
The questions I'd like to know the answers to are:
1) Which of the MRv2/YARN daemons is the one responsible for launching application containers and monitoring application resource usage.
2) Which two issues MRv2/YARN is designed to address?
I'll try to make this thread educational and constructive to other readers by specifying resources and actual data from my searches, so I hope it wouldn't look like I have provided too much information while I could just ask the questions and make my post shorter.
For the 1st question, reading in the documentation, I could find 3 main resources to rely on:
From Hadoop documentation:
ApplicationMaster<-->NodeManager Launch containers. Communicate with NodeManagers by using NMClientAsync objects, handling container events by NMClientAsync.CallbackHandler
The ApplicationMaster communicates with YARN cluster, and handles application execution. It performs operations in an asynchronous fashion. During application launch time, the main tasks of the ApplicationMaster are:
a) communicating with the ResourceManager to negotiate and allocate resources for future containers, and
b) after container allocation, communicating YARN NodeManagers (NMs) to launch application containers on them.
From Hortonworks documentation
The ApplicationMaster is, in effect, an instance of a framework-specific library and is responsible for negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the containers and their resource consumption. It has the responsibility of negotiating appropriate resource containers from the ResourceManager, tracking their status and monitoring progress.
From Cloudera documentation:
MRv2 daemons -
ResourceManager – one per cluster – Starts ApplicationMasters, allocates resources on slave nodes
ApplicationMaster – one per job – Requests resources, manages individual Map and Reduce tasks
NodeManager – one per slave node – Manages resources on individual slave nodes
JobHistory – one per cluster – Archives jobs’ metrics and metadata
Back to the question (which daemons is the one responsible for launching application containers and monitoring application resource usage) I ask myself:
Is it the NodeManager? Is it the ApplicationMaster?
From what I understand, the ApplicationMaster is the one who makes the NodeManager to actually get the job done, so it is like asking who's responsible for lifting a box from the ground, were those the hands who did the actual lifting of the mind who controls the body and makes them do the lifting...
It is a tricky question, I guess, but there has to be only one answer to it.
For the 2nd question, reading online, I could find different answers from many resources and thus the confusion, but my main sources would be:
From Cloudera documentation:
MapReduce v2 (“MRv2”) – Built on top of YARN (Yet"Another Resource NegoGator)
– Uses ResourceManager/NodeManager architecture
– Increases scalability of cluster
– Node resources can be used for any type of task
– Improves cluster utilization
– Support for non/MR jobs
Back to the question (Which two issues MRv2/YARN is designed to address?), I know MRv2 made a few changes like prevent resource pressure on the JobTracker (in MRv1, maximum number of nodes in the cluster could be around 4000, and in MRv2 it is more than 2 times this number), and I also know it provides the ability to run frameworks other than MapReduce, such as MPI.
From documentation:
The Application Master provides much of the functionality of the traditional ResourceManager so that the entire system can scale more dramatically. In tests, we’ve already successfully simulated 10,000 node clusters composed of modern hardware without significant issue.
and:
Moving all application framework specific code into the ApplicationMaster generalizes the system so that we can now support multiple frameworks such as MapReduce, MPI and Graph Processing.
But I also think it dealt with the fact that the NameNode was a Single point of failure, and in the new version there's the Standby NameNode via the high availability mode (I might be confusing features of the old vs. new API, with features of MRv1 vs. MRv2 and that might be the cause for my question):
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.
So if you would have to choose 2 of the 3, which ones would be the 2 that serve as the two issues MRv2/YARN is designed to address?
-Resource pressure on the JobTracker
-Ability to run frameworks other than MapReduce, such as MPI.
-Single point of failure in the NameNode.
Thank you in advance! D