Searching Hadoop Data Just Got A Lot Easier

Enterprise search is one of those concepts that so simple, it's easy to underestimate its value in the world of big data and data…

Enterprise Brian Proffitt June 5, 2013 0

NSA Concedes Hadoop Beats Its Pricey Alternatives

Despite its billion-dollar budget, the open-source community builds better Big Data technology than the NSA.

Enterprise Matt Asay June 21, 2013 0

McKinsey Report Sells Big Data Short As Game Changer

Big Data gets all the hype, but McKinsey highlights improved education and infrastructure as having bigger impact.

Enterprise Matt Asay July 26, 2013 0

enterprise

ReadWriteEnterprise gives deep insight into the most important business technologies, infrastructure, and trends for technology professionals and executives.

· Why China And India Could Disrupt The US Patent System

· NGINX Rolls Out Its First Commercial Web Server

· It May Take A Moonshot To Save HP

· Why The App Won The Software War

You can't have a conversation about Big Data for very long without running into the elephant in the room: Hadoop.

This open source software platform managed by the Apache Software Foundation has proven to be very helpful in storing and managing vast amounts of data cheaply and efficiently.

But what exactly is Hadoop, and what makes it so special? Basically, it's a way of storing enormous data sets across distributed clusters of servers and then running "distributed" analysis applications in each cluster.

It's designed to be robust, in that your Big Data applications will continue to run even when individual servers — or clusters — fail. And it's also designed to be efficient, because it doesn't require your applications to shuttle huge volumes of data across your network.

Here's how Apache formally describes it:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

Look deeper, though, and there's even more magic at work.

Hadoop is almost completely modular, which means that you can swap out almost any of its components for a different software tool.

That makes the architecture incredibly flexible, as well as robust and efficient.

Hadoop Distributed Filesystem (HDFS)

If you remember nothing else about Hadoop, keep this in mind:

It has two main parts –

a data processing framework and

a distributed filesystem for data storage.

There's more to it than that, of course, but those two components really make things go.

The distributed filesystem is that far-flung array of storage clusters noted above - i.e., the Hadoop component that holds the actual data.

By default, Hadoop uses the cleverly named Hadoop Distributed File System (HDFS), although it can use other file systems as well.

HDFS is like the bucket of the Hadoop system: You dump in your data and it sits there all nice and cozy until you want to do something with it, whether that's running an analysis on it within Hadoop or capturing and exporting a set of data to another tool and performing the analysis there.

Data Processing Framework & MapReduce

The data processing framework is the tool used to work with the data itself. By default, this is the Java-based system known as MapReduce.

You hear more about MapReduce than the HDFS side of Hadoop for two reasons:

It's the tool that actually gets data processed.
It tends to drive people slightly crazy when they work with it.

In a "normal" relational database, data is found and analyzed using queries, based on the industry-standard Structured Query Language (SQL).

Non-relational databases use queries, too; they're just not constrained to use only SQL, but can use other query languages to pull information out of data stores. Hence, the term NoSQL.

But Hadoop is not really a database: It stores data and you can pull data out of it, but there are no queries involved - SQL or otherwise.

Hadoop is more of a data warehousing system - so it needs a system like MapReduce to actually process the data.

MapReduce runs as a series of jobs, with each job essentially a separate Java application that goes out into the data and starts pulling out information as needed. Using MapReduce instead of a query gives data seekers a lot of power and flexibility, but also adds a lot of complexity.

There are tools to make this easier: Hadoop includes Hive, another Apache application that helps convert query language into MapReduce jobs, for instance. But MapReduce's complexity and its limitation to one-job-at-a-time batch processing tends to result in Hadoop getting used more often as a data warehousing than as a data analysis tool.

Scattered Across The Cluster

There is another element of Hadoop that makes it unique: All of the functions described act as distributed systems, not the more typical centralized systems seen in traditional databases.

In a database that uses multiple machines, the work tends to be divided out: all of the data sits on one or more machines, and all of the data processing software is housed on another server (or set of servers).

On a Hadoop cluster, the data within HDFS and the MapReduce system are housed on every machine in the cluster.

This has two benefits:

it adds redundancy to the system in case one machine in the cluster goes down, and

it brings the data processing software into the same machines where data is stored, which speeds information retrieval.

Like we said: Robust and efficient.

When a request for information comes in, MapReduce uses two components,

a JobTracker that sits on the Hadoop master node, and

TaskTrackers that sit out on each node within the Hadoop network.

The process is fairly linear: The Map part is accomplished by the JobTracker dividing computing jobs up into defined pieces and shifting those jobs out to the TaskTrackers on the machines out on the cluster where the needed data is stored.

Once the job is run, the correct subset of data is Reduced back to the central node of the Hadoop cluster, combined with all the other datasets found on all of the cluster's machines.

HDFS is distributed in a similar fashion. A single NameNode tracks where data is housed in the cluster of servers, known as DataNodes.

Data is stored in data blocks on the DataNodes.

HDFS replicates those data blocks, usually 128MB in size, and distributes them so they are replicated within multiple nodes across the cluster.

This distribution style gives Hadoop another big advantage:

Since data and processing live on the same servers in the cluster, every time you add a new machine to the cluster, your system gains the space of the hard drive and the power of the new processor.

Kit Your Hadoop

As mentioned earlier, users of Hadoop don't have to stick with just HDFS or MapReduce.

For its Elastic Compute Cloud solutions, Amazon Web Services has adapted its own S3 filesystem for Hadoop.

DataStax' Brisk is a Hadoop distribution that replaces HDFS with Apache Cassandra's CassandraFS.

To get around MapReduce's first-in-first-out limitations, the Cascading framework gives developers an easier tool in which to run jobs and more flexibility to schedule jobs.

Hadoop is not always a complete, out-of-the-box solution for every Big Data task. MapReduce, as noted, is enough of a pressure point that many Hadoop users prefer to use the framework only for its capability to store lots of data fast and cheap.

But Hadoop is still the best, most widely used system for managing large amounts of data quickly when you don't have the time or the money to store it in a relational database. That's why Hadoop is likely to remain the elephant in the Big Data room for some time to come.

7 Definitions of Big Data You Should Know About

The first thing to note is that – despite what Wikipedia says – everybody in the industry generally agrees that Big Data isn’t just about having more data (since that’s just inevitable, and boring).

(1) The Original Big Data

Big Data as the three Vs:

1. Volume,

2. Velocity, and

3. Variety.

This is the most venerable and well-known definition, first coined by Doug Laney of Gartner over twelve years ago. Since then, many others have tried to take it to 11 with additional Vs including

4. Validity,

5. Veracity,

6. Value, and

7. Visibility.

(2) Big Data as Technology

Why did a 12-year old term suddenly zoom into the spotlight? It wasn’t simply because we do indeed now have a lot more volume, velocity, and variety than a decade ago. Instead, it was fueled by new technology, and in particular the fast rise of open source technologies such as Hadoop and other NoSQL ways of storing and manipulating data.
The users of these new tools needed a term that differentiated them from previous technologies, and–somehow–ended up settling on the woefully inadequate term Big Data. If you go to a big data conference, you can be assured that sessions featuring relational databases–no matter how many Vs they boast–will be in the minority.

(3) Big Data as Data Distinctions

The problem with big-data-as-technology is that
(a) it’s vague enough that every vendor in the industry jumped in to claim it for themselves and
(b) everybody ‘knew’ that they were supposed to elevate the debate and talk about something more business-y and useful.
Here are two good attempts to help organizations understand why Big Data now is different from mere big data in the past:

·Transactions, Interactions, and Observations. This one is from Shaun Connolly of Hortonworks. Transactions make up the majority of what we have collected, stored and analyzed in the past. Interactions are data that comes from things like people clicking on web pages. Observations are data collected automatically.

·Process-Mediated Data, Human-Sourced Information, and Machine-Generated Data. This is brought to us by Barry Devlin, who co-wrote the first paper on data warehousing. It is basically the same as the above, but with clearer names.

(4) Big Data as Signals

        This is another business-y approach that divides the world by intent and timing rather than the type of data, courtesy of SAP’s Steve Lucas.
        The ‘old world’ is about transactions, and by the time these transactions are recorded, it’s too late to do anything about them: companies are constantly ‘managing out of the rear-view mirror’. In the ‘new world,’ companies can instead use new ‘signal’ data to anticipate what’s going to happen, and intervene to improve the situation.
        Examples include tracking brand sentiment on
social media (if your ‘likes’ fall off a cliff, your sales will surely follow) and
predictive maintenance (complex algorithms determine when you need to replace an aircraft part, before the plane gets expensively stuck on the runway).

(5) Big Data as Opportunity

This one is from 451 Research’s Matt Aslett and broadly defines big data as ‘analyzing data that was previously ignored because of technology limitations.’
(OK, so technically, Matt used the term ‘Dark Data’ rather than Big Data, but it’s close enough).
This is my personal favorite, since I believe it lines up best with how the term is actually used in most articles and discussions.

(6) Big Data as Metaphor

In his wonderful book The Human Face of Big Data, journalist Rick Smolan says big data is “the process of helping the planet grow a nervous system, one in which we are just another, human, type of sensor.”
Deep, huh? But by the time you’ve read some of stories in the book or the mobile app, you’ll be nodding your head in agreement.

(7) Big Data as New Term for Old Stuff

        This is the laziest and most cynical use of the term, where projects that were possible using previous technology, and would have been called BI or analytics in the past have suddenly been re-baptized in a fairly blatant attempt to jump on the big data bandwagon.
        And finally, one bonus, fairly useless definition of big data. Still not enough for you? Here’s 30+ more and counting!.
        The bottom line: whatever the disagreements over the definition, everybody agrees on one thing: big data is a big deal, and will lead to huge new opportunities in the coming years.