What Hadoop is not
5 misconceptions that are hurting successful Hadoop adoption
Ashish Trehan | June 3, 2015
Hadoop is the successor to the commonplace relational database.
As it steadily grows in popularity, the Hadoop technology stack is increasingly entering the corporate vernacular—from companies spanning data management to analytics. Those same companies also span the Hadoop utilization maturity curve: from working POCs to full production environments.
Unfortunately, many Hadoop projects fail due to ill-conceived business cases built upon incorrect assumptions of the technology. These misconceptions are hurting adoption. In order to better define Hadoop’s use cases—and get the most from the technology—let’s tackle those misconceptions by outlining what Hadoop is not.
Misconception 1: Hadoop is the panacea for all data management and analytics ills
Similar to the alchemical dream of turning metals into gold, Hadoop is viewed as a sorcerer’s stone: it will turn unstructured data into valuable business insight.
Emerging technologies, such as Hadoop, tend to garner unrealistic and sometimes fantastical qualities because of either slick marketing or a misunderstanding of the underlying technology. Yes, Hadoop is great for massive datasets. Yes, Hadoop is scalable. But Hadoop is suited for well-defined scenarios and requires applications that have taken infrastructure into consideration.
Hadoop is not the single-solution to all BI/DW and analytic problems. It is an ecosystem comprised of a multitude of complementary technologies that have different advantages and limitations. These limitations are being mitigated through further development of tools and utilities within the Hadoop ecosystem. Hadoop is a fault-tolerant, distributed file system available to a fault-tolerant parallel execution environment.
Misconception 2: Hadoop is big data
Hadoop is seen as synonymous with big data. Though the terms are sometimes used interchangeably, this is similar to claiming paint brush as synonymous with artist.
Big data is the paradigm shift disrupting the cultural, economic, and political spheres by providing a level of data granularity and insight never historically seen before. Hadoop is the technology framework that allows companies to manage this inundation of data that is proliferating throughout value chains. This framework is foundationally a distributed file system (HDFS) and resource manager (Map/Reduce, YARN). The Hadoop ecosystem includes tools that allow companies to manage the volume, variety, and velocity of their data currents.
But, depending on the use case, companies could be engaged in big data projects that are absent of Hadoop entirely. Hadoop does fall under the big data category, much like NoSQL technology (Cassandra, Mongo DB) and machine-learning algorithms (Random Forest, K-means, Naïve Bayes, and more). The Hadoop ecosystem is expansive and its portfolio of tools increasingly allow for greater adoption. So while traditional Hadoop 1.x (HDFS & Map/Reduce) catered to batch-oriented data processing processes, its continuing evolution has allowed for low-latency and even real-time querying to become suitable use cases. YARN (resource manager) is a game changer allowing for a greater variety of use cases.
Misconception 3: Hadoop is analytics
The misconception plagues many—especially the c-suite. Hadoop is viewed as a magic black box where unstructured, disorganized data can be dumped and then, by waving a machine-learning (another buzzword) wand—ta-da!—double comma cost savings and sales decisions pulled from the cluster.
Hadoop is a control of analytics that allows data scientists to parallelize their algorithms between multiple nodes to handle terabytes/petabytes of data. Most machine learning tasks, which have an iterative model to converge towards an answer, such as optimization problems, will perform poorly on it. However, this is being remedied through the use of a new execution engine called Spark. Spark is still embryonic, but it’s showing great promise. Hadoop 2.0 through YARN provides resource management and a central platform to deliver consistent operations, security, and data governance tools for a more enterprise-ready Hadoop ecosystem. The Hadoop ecosystem gives data scientists, analysts, and developers the ability to run in-database analytics where a powerful cluster crunches through billions of rows without facing performance bottlenecks.
Misconception 4: Hadoop is an ETL tool
While it shares many similarities with ETL tools, Hadoop is not one per se. Hadoop could be utilized as an end-to-end ETL layer, but it does not perform ETL itself.
Within the Hadoop ecosystem, there are tools that do ETL processes, such as Sqoop (bulk data loader), Hive (SQL-like database layer), and Pig (low-level data pipeline scripting language). Traditional Hadoop aligns better with ELT (Extract, Load, Transform) processes now more commonly known as “enterprise data lake.” Moving toward a Hadoop infrastructure allows companies to efficiently and economically warehouse all enterprise data in one place in order to run large models on voluminous data sets. The Hadoop ecosystem is bigger than just ELT use cases. Hadoop with YARN allows for a distributed/clustered operating system on which future applications can be run on.
Misconception 5: Hadoop is a relational database
Hadoop is a file system with a resource manager (YARN) that orchestrates the necessary processes using an execution engine (Map/Reduce, Spark, Tez, etc). Hadoop 1.x is used typically for non-latency dependent tasks, such as bulk processing. Traditionally, the Hadoop framework was ill-suited for real-time analytics or low-latency tasks. Now with YARN and the continuing adoption and porting of other technologies (such as Hbase and Kafta), Hadoop can tackle real-time analytics and low-latency tasks. Hadoop empowers companies to fully capture and utilize data that they have yet to add structure to.
Needed: well-defined use case
The single biggest threat to a successful Hadoop project is a misconstrued, ill-defined use-case.
The misconceptions outlined here aren’t intended to make enterprises apprehensive about engaging in Hadoop-oriented projects. Rather, they’re a reminder that having a properly-defined use case will set the foundation for a harmonious cycle that will lead to increasingly ambitious business cases around big data. Small, easy wins will encourage wider Hadoop adoption and executive sponsorship (while a swing-for-the-fences project may end in discouragement and disappointment or worse: throwing the baby out with the bathwater).
The Hadoop ecosystem offers companies a flexible, inexpensive way to store and analyze large quantities of data. It invites companies to fully acquire all enterprise-related, peripheral data (also known as data exhaust) that will in turn drive value. Hadoop allows for projects to scale as your data grows, and traditional enterprise tools allow for further integration with the Hadoop environment. Though the Hadoop technology stack still lacks governance and security, many companies are stepping in to mitigate these limitations. Innovations from Hortonworks, Cloudera, and MapR, to name a few, continue to lower the barriers to entry.
There is, of course, no one size fits all solution, and there may never be one. But companies are picking and choosing Big Data technologies that suit their needs. Businesses will soon ask to use new kinds of data that won’t fit nicely into current data infrastructure paradigms.
Ready to get started? I hope that you’re now better prepared to handle the hype, and flawlessly execute a big data strategy that complements, not necessarily replaces, your existing paradigm. Happy Hadooping!
Ashish Trehan is an experienced IT and analytics professional in Slalom Atlanta's information management and analytics practice, with an emphasis in big data technology ecosystems, advanced analytics, and visualization within the customer service, manufacturing, and telecommunications industries. Ashish combines deep analytics expertise with keen enterprise understanding, allowing for a holistic and complete approach to business problems. He is a passionate, motivated professional with a strong intellectual curiosity in machine-learning, analytic strategy, and data-driven decision making.