Understanding Hadoop Yarn
YARN — Yet Another Resource Negotiator, is a part of Hadoop 2 version, is one of the two major components of Apache Hadoop (with HDFS). It plans the use of cluster resources as well as the treatments applied to the data.
Architecture
Above is the basic architecture of Yarn, where Resource Manager is the core component of the entire architecture, which is responsible for the management of resources including RAMs, CPUs, and other resources throughout the cluster. Application Master is responsible for application scheduling throughout the life cycle, Node Manager is responsible for the supply and isolation of resources on this node.
The Resource Manager: it controls the resource management of the cluster, also makes allocation decisions. The resource manager has two main components: Scheduler and Applications Manager.
- The scheduler: is called the YarnScheduler, which allows different policies for managing constraints such as capacity, fairness, and service level agreements.
- The Applications Manager: is responsible for maintaining a list of submitted application. After application is submitted by the client, application manager firstly validates whether application requirement of resources for its application master can be satisfied or not.If enough resources are available then it forwards the application to scheduler otherwise application will be rejected.
The Node Manager: is responsible for launching and managing containers on a node. Containers execute tasks as specified by the AppMaster.
- The container: Signifies an allocated resources to an ApplicationMaster. ResourceManager is responsible for issuing resource/container to an ApplicationMaster. and it refers to a collection of resources such as memory, CPU, disk and network IO.
- The Application Master: is an instance of a framework-specific library that negotiates resources from the Resource Manager and works with the NodeManager to execute and monitor the granted resources (bundled as containers) for a given application. An application can be mapreduce job, hive framework…
Steps of executing Applications with YARN:
- A client submits an application to the YARN ResourceManager.
- The ApplicationsManager (in the ResourceManager) negotiates a container and bootstraps the ApplicationMaster instance for the application.
- The ApplicationMaster registers with the ResourceManager and requests containers(RAMs and CPUs).
- The ApplicationMaster communicates with NodeManagers to launch the containers it has been granted.
- The ApplicationMaster manages application execution. During execution, the application provides progress and status information to the ApplicationMaster. The client can monitor the application’s status by querying the ResourceManager or by communicating directly with the ApplicationMaster.
- The ApplicationMaster reports completion of the application to the ResourceManager.
- The ApplicationMaster un-registers with the ResourceManager, which then cleans up the ApplicationMaster container.
Yarn Scheduler :
The Scheduler has a pluggable policy which is responsible for partitioning the cluster resources among the various queues, applications etc.
2. FIFO scheduler
The FIFO scheduler is one of the earliest deployment strategies used by Hadoop, and can simply be interpreted as a Java queue. which means that there can only be one job in the cluster at the same time. All applications are executed in the order of submission, and the Job after the completion of the previous Job will be executed in the order of the queue.
This scheduler lets short applications finish in reasonable time while not starving long-lived applications.
2. Capacity scheduler
The Capacity scheduler is a pluggable scheduler for Hadoop that allows multiple tenants to securely share a large cluster. Resources are allocated to each tenant's applications in a way that fully utilizes the cluster, governed by the constraints of allocated capacities.
Queues are typically set up by administrators to reflect the economics of the shared cluster. The Capacity Scheduler supports hierarchical queues to ensure that resources are shared among the sub-queues of an organization before other queues are allowed to use free resources.
3. Fair scheduler
The FairScheduler is a pluggable scheduler for Hadoop that allows YARN applications to share resources in a large cluster fairly. Fair scheduling is a method of assigning resources to applications such that all applications get, on average, an equal share of resources over time.
By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule resources based on memory, CPU, and disk usage.
When other applications are submitted, resources that free up are assigned to the new applications, so that each application eventually gets approximately the same amount of resources.