Bantu Tech
Our Services
Contact Us Events Our Vision Our Partners
Cyber Security Blog Community Networks Blog Linux Organizational Solutions
About Contact Us Events Our Vision Our Partners Blogs Cyber Security Blog Community Networks Blog Linux Organizational Solutions Our Services
Bantu Tech

Introduction to Hadoop

hadoop-logo.png

Apache™ Hadoop® is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. As an open-source framework, Hadoop can be installed on standard servers or industry standard servers. Hardware can be added or replaced in a cluster and VMs (Virtual Machines) can also be cloned. Hadoop is economical in the sense that costs are relatively since the software is common across the infrastructure.

In the Hadoop framework lies MapReduce, a software programming framework that is an integral part of Hadoop. MapReduce provides a framework that utilises 2 Hadoop functions. The Map function, and the Reduce function. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.(1) In addition to simplifying the processing of big data sets, MapReduce also provides programmers with a common method of defining and orchestrating complex processing tasks across clusters of computers.

As Hadoop supports current databases and analytical infrastructures, there is no need to worry about Hadoop displacing any information. Hadoop can handle datasets and tasks that can be a problem for a legacy database.  It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Hadoop clusters are known for boosting the speed of data analysis applications. They also are highly scalable: If a cluster's processing power is overwhelmed by growing volumes of data, additional cluster nodes can be added to increase throughput. Hadoop clusters also are highly resistant to failure because each piece of data is copied onto other cluster nodes, which ensures that the data is not lost if one node fails(2).

If you are interested in Hadoop, I highly recommend the Big Data University course called Hadoop Fundamentals.

bantuhadoop.jpg

The base Apache Hadoop framework is compromised of the following modules:

  • Hadoop Common – contains libraries and utilities needed by other Hadoop modules. [1]
  • Hadoop Distributed File System (HDFS) – HDFS is a distributed Java-based file system for storing large volumes of data. HDFS is a scalable, fault tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN (MapReduce). [2]
  • Hadoop YARN – a resource management platform responsible for managing computing resources in clusters and using them for scheduling of users applications. [3]
  • Hadoop MapReduce – an implementation of the MapReduce programming model for large scale data processing. [4]

An excellent course on learning the basics of Hadoop is offered by the Big Data University. I highly recommend this course as a starting point to learning and understanding Big Data.  The course teaches you the basics of Apache Hadoop and the concept of Big Data. As the course is free, it provides all the testing materials and software for free, and it is now accredited by IBM meaning you can get a professionally recognised IBM Badge, certified by Pearson VUE. More information is available here.

 

References

1. IBM - What is MapReduce [Internet]. Www-01.ibm.com. 2016 [cited 20 April 2016]. Available from: https://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/

2. What is Hadoop cluster? - Definition from WhatIs.com [Internet]. SearchBusinessAnalytics. 2016 [cited 20 April 2016]. Available from: http://searchbusinessanalytics.techtarget.com/definition/Hadoop-cluster

Storage, HadoopStephen ChapendamaApril 20, 2016Hadoop, Storage, SQL, Java, Open-Source, Ubuntu, Linux, Virtual Machines, Big Data, Big Data University, Hadoop Fundamentals, Open Cloud, Open Support, Yarn, Data, Apache, hadoop, hadoop tutorial, hadoop wiki, hadoop certification, hadoop interview questions, hadoop cluster, hadoop ecosystem, hadoop vs spark, hadoop spark, hadoop mapreduce, hadoop architecture, hadoop and spark, hadoop admin, hadoop aws, hadoop administrator, hadoop apache, hadoop alternatives, hadoop admin interview questions, hadoop admin salary, hadoop api, hadoop big data, hadoop basics, hadoop books, hadoop book pdf, hadoop basic commands, hadoop block size, hadoop big data certification, hadoop big data jobs, hadoop blogs, hadoop big data wiki, hadoop commands, hadoop components, hadoop course, hadoop cloudera, hadoop configuration, hadoop command not found, hadoop cluster setup, hadoop certification cost, hadoop download, hadoop database, hadoop data lake, hadoop distributed file system, hadoop definition, hadoop docker, hadoop distcp, hadoop distributions, hadoop definitive guide, hadoop data, hadoop ecosystem components, hadoop examples, hadoop elephant, hadoop etl, hadoop edge node, hadoop environment, hadoop explained, hadoop ecosystem pdf, hadoop engineer, hadoop fs commands, hadoop framework, hadoop file system, hadoop for dummies, hadoop fs, hadoop for windows, hadoop for beginners, hadoop federation, hadoop fs put, hadoop file formats, hadoop github, hadoop getmerge, hadoop google, hadoop getting started, hadoop get, hadoop gui, hadoop get command, hadoop guide, hadoop gis, hadoop grep, hadoop hive, hadoop hue, hadoop hbase, hadoop hortonworks, hadoop hive tutorial, hadoop hdfs commands, hadoop high availability, hadoop hello world, hadoop hdfs tutorial, hadoop hdp, hadoop installation, hadoop installation on windows, hadoop installation on ubuntu, hadoop impala, hadoop introduction, hadoop in real world, hadoop icon, hadoop in action, hadoop installation steps, hadoop jobs, hadoop jobs uk, hadoop java, hadoop jar, hadoop join, hadoop java example, hadoop journalnode, hadoop java api, hadoop json, hadoop job scheduler, hadoop kafka, hadoop kill job, hadoop kerberos, hadoop kubernetes, hadoop kms, hadoop knox, hadoop kudu, hadoop kill application, hadoop kerberos setup, hadoop kafka spark, hadoop logo, hadoop latest version, hadoop language, hadoop learning, hadoop ls, hadoop là gì, hadoop logs, hadoop logo png, hadoop linux, hadoop list files, hadoop meaning, hadoop mapreduce example, hadoop machine learning, hadoop mkdir, hadoop maven, hadoop meaning in hindi, hadoop mac, hadoop mongodb, hadoop mapper, hadoop namenode, hadoop nosql, hadoop nodes, hadoop namenode format, hadoop notes pdf, hadoop name, hadoop news, hadoop nifi, hadoop namenode not starting, hadoop notes, hadoop on windows, hadoop on aws, hadoop oozie, hadoop open source, hadoop or spark, hadoop on azure, hadoop operations, hadoop on docker, hadoop oracle, hadoop online course, hadoop pig, hadoop python, hadoop projects, hadoop pdf, hadoop programming, hadoop ppt, hadoop platform, hadoop put command, hadoop pig tutorial, hadoop ports, hadoop questions, hadoop query, hadoop query language, hadoop quora, hadoop quiz, hadoop queue, hadoop query example, hadoop que es, hadoop quick start, hadoop qa resume, hadoop resume, hadoop remove directory, hadoop r, hadoop releases, hadoop requirements, hadoop ranger, hadoop resource manager, hadoop real time projects, hadoop replication factor, hadoop rest api, hadoop streaming, hadoop security, hadoop stack, hadoop software, hadoop sqoop, hadoop sql, hadoop spark tutorial, hadoop setup, hadoop summit, hadoop training, hadoop the definitive guide, hadoop tutorial pdf, hadoop tools, hadoop the definitive guide pdf, hadoop testing, hadoop technology, hadoop tez, hadoop tutorial youtube, hadoop use cases, hadoop ubuntu, hadoop unstructured data, hadoop user experience, hadoop udemy, hadoop udacity, hadoop unix commands, hadoop user group, hadoop uber mode, hadoop user, hadoop version, hadoop vs mongodb, hadoop vs cassandra, hadoop vs sql, hadoop version command, hadoop vs big data, hadoop vs elasticsearch, hadoop vendors, hadoop vs rdbms, hadoop windows, hadoop wordcount example, hadoop with python, hadoop what is, hadoop with python pdf, hadoop weekly, hadoop wordcount, hadoop winutils, hadoop world, hadoop xml, hadoop xml files, hadoop xmx, hadoop xfs vs ext4, hadoop xml configuration file, hadoop xcievers, hadoop xml input format, hadoop xml ingestion, hadoop xml processing, hadoop yarn, hadoop yarn architecture, hadoop youtube, hadoop yahoo, hadoop yarn tutorial, hadoop yarn wiki, hadoop yarn logs, hadoop yarn-site.xml, hadoop yarn container, hadoop yarn vs mapreduce, hadoop zookeeper, hadoop zeppelin, hadoop zip command, hadoop zoo, hadoop zkfc, hadoop zookeeper interview questions, hadoop zookeeper tutorial, hadoop zip files, hadoop zcat, hadoop zones, hadoop 0.20.2, hadoop 0.0.0.0, hadoop 0.17, hadoop 0 datanode running, hadoop 0.23, hadoop 0.19.1 download, hadoop-0.18.0-eclipse-plugin.jar download, hadoop 0.20.2 api, hadoop 0.18.0, hadoop 0.20.1, hadoop 101, hadoop 1 architecture, hadoop 1 vs 2, hadoop 1.2.1 download, hadoop 1.0 architecture, hadoop 1.0 vs 2.0, hadoop 1.0, hadoop 1, hadoop 1.0.3, hadoop 1.0.4, hadoop 2.7.3 download, hadoop 2 architecture, hadoop 2.0 architecture, hadoop 2.7.3, hadoop 2, hadoop 2.0, hadoop 2.7, hadoop 2.7.4, hadoop 2.7.2 download, hadoop 2.8, hadoop 3, hadoop 3.0 release date, hadoop 3.0 features, hadoop 3.0 architecture, hadoop 3 docker, hadoop 3.0 docker, hadoop 3.0 installation, hadoop 3 installation, hadoop 3.x, hadoop 3.0.0-alpha2, hadoop 4mc, hadoop 4 v's, hadoop 4th edition pdf, hadoop 4th edition, hadoop 4 node cluster, hadoop 4, hadoop-4487, hadoop-4904, hadoop 4.7, hadoop 4.4, hadoop 50070, hadoop 50070 not working, hadoop 50070 port, hadoop 5 v, hadoop 50030, hadoop 5 node cluster, hadoop 54310 connection refused, hadoop 64 bit native library download, hadoop 64 bit download, hadoop 64 bit, hadoop 64mb block size, hadoop 60020, hadoop 64-bit ubuntu, hadoop-6255, hadoop 64 bit native, hadoop 600 seconds killing, hadoop-692, hadoop-7682, hadoop-7139, hadoop-7154, hadoop-7714, hadoop 7z, hadoop 7285, hadoop 7zip, hadoop-7178, hadoop 7180, hadoop-7156, hadoop 8020, hadoop 8088, hadoop 8032, hadoop 8088 port, hadoop 8088 not working, hadoop 8031, hadoop 8042, hadoop 8042 port, hadoop 8020 connection refused, hadoop 8020 failed on connection exception, hadoop 9000, hadoop 9000 failed on connection exception, hadoop-9640, hadoop 9001, hadoop-9922, hadoop-9361, hadoop-9652, hadoop 9000 address already in use, hadoop-9902, hadoop-9215
Facebook0 Twitter Google LinkedIn0 Reddit Tumblr Pinterest0 0 Likes
Previous

Networking Protocols and IP

NetworkingBantu Tech AdminMay 9, 2016Networking, IP, Protocols, Full Duplex, Encryption
Hours
 

Subscribe to our Newsletter

We occasionally send updates to our subscribers with the latest news in Cyber Security. We aim to send 1 email a month so as to not spam your inbox. 

Thank you!
About
Contact Us Our Vision Our Partners Our Services
Blogs
Cyber Security Community Networking Linux Organizational Solutions