HDPCA Exam Objective – Configure the Capacity Scheduler

Note: This is post is part of the HDPCA exam objective series

YARN Schedulers

The Hadoop YARN scheduler is responsible for assigning resources to the applications submitted by users. There are 3 types of schedulers in YARN.

  1. First in First out (FIFO) (Hadoop 1.x)
  2. Fair scheduler
  3. Capacity scheduler

First in First out (FIFO)

By default, YARN supports a First in First out (FIFO) scheduler, which executes jobs in the same order as they arrive using a queue of jobs. However, FIFO scheduling might not be the best option for large multi-user Hadoop deployments.

Fair scheduler

The Fair scheduler allows all jobs to receive an equal share of resources. The resources are assigned to newly submitted jobs as and when the resources become available until all submitted and running jobs have the same amount of resources.

Capacity scheduler

The Capacity scheduler allows a large cluster to be shared across multiple organizational entities while ensuring guaranteed capacity for each entity and that no single user or job holds all the resources. In order to achieve this, the Capacity scheduler defines queues and queue hierarchies, with each queue having a guaranteed capacity. The Capacity scheduler allows the jobs to use the excess resources (if any) from the other queues.

Note: For the HDPCA exam, we have to concentrate only on the configuration of capacity scheduler. We will not cover the other 2 schedulers. The FIFO scheduler is anyways never used in production environments. Also, HDPCA exam does not expect us to configure actual queues, only enabling capacity scheduler is expected.

Enabling the Capacity Scheduler (Command Line)

1. To enable the capacity scheduler, make sure you have the following property set in the yarn configuration file /etc/hadoop/conf/yarn-site.xml on the ResourceManager Host:

# vi /etc/hadoop/conf/yarn-site.xml

Property: yarn.resourcemanager.scheduler.class
Value: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler

2. Switch to the user “yarn” and run the below command which refreshes the current queues.

$ yarn rmadmin -refreshQueues
18/07/22 09:04:22 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
18/07/22 09:04:23 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]

Enabling the Capacity Scheduler (With Ambari)

1. To enable Capacacity scheduler using ambari, goto services > YARN > Configs. Search for the property yarn.resourcemanager.scheduler.class in the filter box. As shown below, currently Fair Share scheduler is set as the default scheduler.

2. Modify the scheduler property to have the value org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler and click save to save the config.

3. Provide an appropriate description while saving the config.

4. We will have to restart the YARN service for the changes to take effect.

Verify

You can verify the scheduler after restarting the YARN service. Search for the property “yarn.resourcemanager.scheduler.class” in the filter box. As shown below the scheduler type is now Capacity Scheduler.

You can also verify the scheduler type in the yarn configuration file /etc/hadoop/conf/yarn-site.xml.

# cat /etc/hadoop/conf/yarn-site.xml
Related Post