The Big 3 in Big Data Analytics
One of the three main technologies associated with Big Data analytics is NoSQL, which is commonly referred to as “Not Only SQL.” It is a database infrastructure commonly associated and well adapted for the demands of big data. It allows agile processing of information at a large scale compared to other database technologies. The benefits of using NoSQL rather than relational databases is that NoSQL databases are unstructured in nature, trading off stringent consistency requirements for speed and agility. DataJobs.com describes the functionality benefit of NoSQL databases as “horizontally scalable; as data continues to explode, addition of more to keep up, with no slowdown in performance.”
Most NoSQL databases support some form of database replication as a feature linked to disaster recovery plans. More sophisticated NoSQL databases offer automated failover functions and recovery.
Redis (REmote DIctionary Server) is an example of a NoSQL solution. It is described by www.devbridge.com as an “open source, BSD licensed, advanced key-value store.” Redis supports data structures such as hashes, lists and strings, sorted sets with range queries, hyperloglogs and geospatial indexes amongst other features. The development of Redis has been sponsored by Redis Labs since June 2015. According to the monthly ranking by DB-Engins.com, Redis is the most popular key-value database. One of the key selling point for Redis has been the no notable speed difference between write and read operations.
Another example of a NoSQL solution commonly used is the MongoDB which is a document storage platform designed for high performance, high availability and with automatic scaling. Unlike relational databases which use tables and rows, MongoDB is built on an architecture of collections and documents. The documents are in sets of key-value pairs and are the basic unit of data in MongoDB. Rouse also mentions that, “Like other NoSQL databases, MongoDB supports dynamic schema design allowing the documents in a collection to have different fields and structures.” MongoDB was created by Dwight Merriman and Eliot Horowitz and was released to open source in 2009 and can be used under the terms of the Free Software Foundations GNU AGPL Version 3.0 commercial license.
The second technology associated with Big Data is Hadoop, a Java developed framework primarily used for running applications on clusters of industry standard commodity hardware. Hadoop is a distributed architecture which requires multiple computers to operate. Hadoop is quite flexible and modular as it is built upon open source frameworks. Hadoop can be viewed as two segments: data storage functions (HDFS) and data processing function (MapReduce). More on Hadoop can be found via this post by Bantu Tech.
HDFS (Hadoop Distributed File System) is a main component of the Apache Hadoop project. In a Hadoop cluster, the data is analysed and broken apart into manageable smaller chunks (called blocks) and allocated throughout the cluster. This allows the map and reduce functions to be executed on small subsets of large data sets. These function then allow high-performance access to data across Hadoop clusters. As HDFS is typically deployed on low commodity hardware, server failures are very common. According to www.SearchBusinessAnalytics.techtarget.com contributor Emma Preslar, “The file system is designed to be highly fault-tolerant, however by facilitating the rapid transfer of data between compute nodes and enabling Hadoop systems to continue running if a node fails. This decreases the risk of catastrophic failure, even in the event that numerous nodes fail.”
HDFS uses a master node/data node architecture which is essentially a master/slave iteration. Each cluster contains a single NameNode which is responsible for managing system operations and the DataNodes manage storage on individual compute nodes. When data arrives, the files are split into blocks which are replicated to the DataNodes. The default size block is 64MB and the block will be replicated to 3 nodes in the cluster by default.
To manage the clusters, Hadoop utilises YARN (Yet Another Resource Negotiator) which is a key feature in the 2nd generation Hadoop 2 version. YARN is now characterized as a large-scale, distributed OS (operating system) for Big Data solutions. According to www.SearchDataManagement.techtarget.com YARN became “a sub-project of the larger Apache Hadoop project”. A benefit deriving from the separation of YARN into a sub project made Hadoop more suitable for operational applications that can’t wait for batch jobs to finish. YARN provides a central operating system to deliver operational instructions, security and tools to manage data across Hadoop clusters.
MapReduce is described by IBM, a leader in Big Data and Hadoop solutions as, “the heart of Hadoop.” MapReduce is the programing structure which supports and allows the scalability function of Hadoop. MapReduce is based in Java and is supported by the Apache Hadoop software framework. Though MapReduce is based on Java, to utilise functions such as ‘map’ and ‘reduce’ any programming language can be used. Similar to HDFS, MapReduce shares the characteristic that it’s built to be fault tolerant and to work at massive-scale distributions. MapReduce works by splitting inputted data into smaller manageable chunks (tasks), these are called map tasks. These map tasks can be executed in parallel processes. When processed, the map tasks output is then reduced (reduce task) and is saved onto the system.
Apache Knox Gateway
The Apache Knox Gateway is the secure entry point for Hadoop clusters. It provides perimeter security to ensure organisations can distribute Hadoop to more users whilst assuring that their data is safe and complies with laws such as The Data Protection Act 1998. Knox functions through YARN as its architectural central point. According to www.HortonWorks.com “Knox also simplifies Hadoop security for users who access the cluster data and execute jobs.”
Hadoop already has Kerberos integrated as a security measure to curb the following threats:
- Stopping unauthorized access to the Hadoop Distribution File System and MapReduce systems.
- Preventing unauthorized access to jobs submitted through the Apache Oozie workflow scheduler for Hadoop.
- Stop malicious servers attempting to access Hadoop clusters.
- Detect and stop impersonation attacks.
- Stopping attempts to access root accounts.
According to a report by SAS on Hadoop Architecture, the Hadoop Security field is an evolving field with most major Hadoop distributers developing different security protocols and projects (Rogers and Keefer, 2016). The Hortonworks Knox Gateway and Cloudera Sentry are two examples which are based on having Kerberos enabled to secure the Hadoop environment.