Hadoop with respect to big data delivers evergreen solutions and output for processing large volumes of data. Hadoop is an Apache Open Source Framework written in Java. It distributes the large sets of data across the clusters of computers by using simple programming methods. This framework works in an environment of distributed storage and computation across the clusters of computers.
Why hadoop technology was developed?
Mainly Hadoop technology was designed to scale up from the single central server to thousands of machines which perform local storage and computation. For this reason, Hadoop technology came into existence. There is no chance of missing any data across the network, even if you found any issue you can retrieve as all the computed data lies at processing computer.
The Hadoop architecture comprises of 4 modules namely MapReduce, Hadoop Distributed File System(HDFS), YARN Framework and Common Utilities. It is the collection of software packages such as Apache Pig, Apache Hive, Apache HBase etc.
Now here I am going to discuss the first module of the Hadoop Architecture MapReduce Algorithm.
The MapReduce is the software framework designed to write the applications to process large volumes of data across the clusters of computers with a reliable and fault tolerant ability. The MapReduce does two tasks with respect to the Hadoop programs. One is Map task and the other is reduce task.
- Map Task: Map task is the first task where it takes the input and converts the data into key/value pairs nothing but tuples.
- Reduce Task: In the reduce task, it takes an output from the Map task as input and combines all the data tuples to form a small set of tuples.
Here all the inputs and outputs are stored in a file system. Next, the Hadoop framework takes care of scheduling, monitoring the tasks, it also re-executes the failed tasks.
The MapReduce framework consists of single master JobTracker and one slave TaskTracker for one cluster node. The master provides all the resources required to execute the task for the slaves and the slaves will execute them and report with the status. If the JobTracker does any wrong then all the process will be halted it is the only point of failure.
Hadoop Distributed File System:
Hadoop Distributed File System is the next module to MapReduce.The Hadoop Distributed File System is based on the Google File System which provides the distributed environment to perform the tasks among the cluster of machines. Hadoop can work with any of the file systems such as local FS, HFTP FS, S3 FS but it uses most commonly the Hadoop Distributed File System(HDFS).
Hadoop Distributed File System also uses the Master/Slave Architecture. The master contains a single NameNode which manages the file system metadata and the one or more slave contains the DataNodes that store the actual data. The file in the HDFS NameNode is distributed into blocks which are stored in the set of DataNodes. Here the MAster determines the mapping of the blocks to slave. The master manages, controls and guides the slaves for the data to be processed. The HDFS provides a shell of commands to interact directly with the file systems.
How does Hadoop works?
Below I am going to discuss with an example who the tasks are performed in order to execute a task.
Step1: The application to submit a contract to the Hadoop client. He needs to provide the information pertaining to the inputs, output files in the HDFS. He also needs to produce the Java jar files in order to implement the map and reduce functions. He also required to give any data regarding the contract.
Step2: The Hadoop contract client then submits the contracts jar file to the JobTracker. The JobTracker then handle it over to the slaves to perform the tasks and let them know the status.
Step3: The TaskTrackers on different nodes executes the tasks as per MapReduce implementation and provides the output of the reduce function to store in the output file system.
Key benefits of Hadoop framework:
- It is very efficient to write and test the distributed systems.
- Hadoop had been designed with its own set of libraries to detect and handle failures at the application level.
- There is a flexibility to add or remove servers without any interruption of the processes going on.
- One of the great advantages is that it is an open source framework and supports to work on all platforms as it is based on Java.
Now let’s go through the Hadoop Distributed File System in depth.
Hadoop Distributed File System:
Hadoop File System was developed using the distributed file system design. It holds a bulk amount of data and very easy to access it. It provides fault tolerance environment for the users. The redundant process of storing data across the nodes helps in case of any failure. It also does a very effective parallel processing of data sets.
Key Features of Hadoop Distributed File System:
- Hadoop Distributed File System benefits for the distributed storage and processing.
- Hadoop comes with command interface to interact with HDFS.
- The NameNode and the DataNode helps you to check the status of the clusters very easily.
- It provide file permissions to stream file system data with greater authentication feature.
Below is the HDFS architecture representing the NameNode and DataNode in the Hadoop file system.
HDFS architecture is similar to the master/slave architecture in MapReduce module.It consists of two main elements NameNode and DataNode.
The namenode is a commodity hardware that contains Linux OS and namenode software as well. The system with namenode acts as the master server and do following things. It manages the file systems, controls the client’s access to particular files, it also does the file system operations like closing, executing, renaming and the opening of any files and directories.
Datanode is also a commodity software that contains Linux OS and Datanode software. Every node inside cluster as data node which manages the data storage of the file system. Also, the data in the files are divided into segments termed as blocks. The block size can be configured as per your requirement in the settings page.
Goals of Hadoop Distributed File System:
Here are some major goals stipulated by the Hadoop Distributed File System.
- It provide the strong mechanisms for automatic fault detection and recovery.
- It also should have thousands of nodes in the cluster as the data volumes are every high.
- In case of huge datasets involvement it reduces the network traffic and increases the throughput.
The Hadoop technology is the very efficient one to process the huge volume of datasets. It also able to recover the lost data and is fault resistant with the hardware support. Moreover, the Hadoop Distributed File System helps to process and execute the bulk data by using the MapReduce function, Namenode and Datanode components in its architecture. It is very useful for the users who want to process large volumes of data without any failure then it is a very good technology for the choice.