Configure Hadoop Using Ansible Playbook

3 min readJan 13, 2021

Hadoop

Apache Hadoop is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

DataNode

DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is commodity hardware, that is, a non-expensive system that is not of high quality or high-availability

NameNode

NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

Our task is to write an ansible playbook that will configure data nodes and name nodes in slave and master host groups respectively.

I will use 2 Datanodes on AWS, 1 Namenode on AWS, and 1 Ansible Controller node in Virtual Machine.