Apache Hadoop is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.
DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is commodity hardware, that is, a non-expensive system that is not of high quality or high-availability
NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Our task is to write an ansible playbook that will configure data nodes and name nodes in slave and master host groups respectively.
I will use 2 Datanodes on AWS, 1 Namenode on AWS, and 1 Ansible Controller node in Virtual Machine.
- Launch 3 Instances on AWS.
- Update Inventory file of Ansible hosts.
- Configure the Ansible Configuration file.
- Run the following Playbook. (Click here to get code.)
- The output of Playbook is:
- Now check Namenodes and Datanodes, it has been configured.