Enterprise search is one of those concepts that so simple, it's
easy to underestimate its value in the world of big data and data…
Despite its billion-dollar budget, the open-source community
builds better Big Data technology than the NSA.
Big Data gets all the hype, but McKinsey highlights improved
education and infrastructure as having bigger impact.
ReadWriteEnterprise gives deep insight into the most important
business technologies, infrastructure, and trends for technology professionals
and executives.
You can't have a conversation about Big Data for very long
without running into the elephant in the room: Hadoop.
This open source software platform managed by the Apache Software Foundation has proven to be very helpful in storing and managing vast
amounts of data cheaply and efficiently.
But what exactly is Hadoop, and what makes it so special? Basically, it's a way of storing
enormous data sets across distributed clusters of servers and then running "distributed"
analysis applications in each cluster.
It's designed to be robust, in that your Big Data applications
will continue to run even when individual servers — or clusters — fail. And
it's also designed to be efficient, because it doesn't require your
applications to shuttle huge volumes of data across your network.
Here's how Apache formally
describes it:
The Apache Hadoop software library is a framework that allows
for the distributed processing of large data sets across clusters of computers
using simple programming models.
It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to deliver
high-availability, the library itself is designed to detect and handle failures
at the application layer, so delivering a highly available service on top of a
cluster of computers, each of which may be prone to failures.
Look deeper, though, and there's even more magic at work.
Hadoop is almost completely modular, which means that you can
swap out almost any of its components for a different software tool.
That makes the architecture incredibly flexible, as well as
robust and efficient.
Hadoop
Distributed Filesystem (HDFS)
If you remember nothing else about Hadoop, keep this in mind:
It has two main parts –
a data processing framework
and
a distributed filesystem for
data storage.
There's more to it than that, of course, but those two
components really make things go.
The distributed filesystem is that far-flung array of storage
clusters noted above - i.e., the Hadoop component that holds the actual data.
By default, Hadoop uses the cleverly named Hadoop Distributed File
System (HDFS), although it can use
other file systems as well.
HDFS is like the bucket of the Hadoop system: You dump in your
data and it sits there all nice and cozy until you want to do something with
it, whether that's running an analysis on it within Hadoop or capturing and
exporting a set of data to another tool and performing the analysis there.
Data
Processing Framework & MapReduce
The data processing framework is the tool used to work with the
data itself. By default, this is the Java-based system known as MapReduce.
You hear more about MapReduce than the HDFS side of Hadoop
for two reasons:
- It's the tool that actually gets
data processed.
- It tends to drive people slightly
crazy when they work with it.
In a "normal" relational database, data is found and
analyzed using queries, based on the industry-standard Structured Query
Language (SQL).
Non-relational databases use queries, too; they're just not
constrained to use only SQL, but can use other query languages to pull
information out of data stores. Hence, the term NoSQL.
But Hadoop is not really
a database: It stores data and you can pull data out of
it, but there are no queries involved - SQL or otherwise.
Hadoop is more of a data warehousing system - so it needs a
system like MapReduce to actually process the data.
MapReduce runs as a series of jobs, with each job essentially a
separate Java application that goes out into the data and starts pulling out
information as needed. Using MapReduce instead of a query gives data
seekers a lot of power and flexibility, but also adds a lot of complexity.
There are tools to make this easier: Hadoop includes Hive, another Apache application that helps convert query language
into MapReduce jobs, for instance. But MapReduce's complexity and its
limitation to one-job-at-a-time batch processing tends to result in Hadoop
getting used more often as a data warehousing than as a data analysis tool.
Scattered
Across The Cluster
There is another element of Hadoop that makes it unique: All of
the functions described act as distributed systems, not the more typical
centralized systems seen in traditional databases.
In a database that uses multiple machines, the work tends to be
divided out: all of the data sits on one or more machines, and all of the data
processing software is housed on another server (or set of servers).
On a Hadoop cluster, the data within HDFS and the MapReduce
system are housed on every machine in the cluster.
This has two benefits:
it adds redundancy to the system in case one machine in the
cluster goes down, and
it brings the data processing software into the same machines
where data is stored, which speeds information retrieval.
Like we said: Robust and
efficient.
When a request for information comes in, MapReduce uses two
components,
a JobTracker that sits on the Hadoop master node, and
TaskTrackers that sit out on each node within the Hadoop
network.
The process is fairly linear: The Map part is
accomplished by the JobTracker dividing computing jobs up into defined pieces
and shifting those jobs out to the TaskTrackers on the machines out on the
cluster where the needed data is stored.
Once the job is run, the correct subset of data is Reduced
back to the central node of the Hadoop cluster, combined with all the other
datasets found on all of the cluster's machines.
HDFS is distributed in a similar fashion. A single NameNode
tracks where data is housed in the cluster of servers, known as DataNodes.
Data is stored in data blocks on the DataNodes.
HDFS replicates those data blocks, usually 128MB in size, and
distributes them so they are replicated within multiple nodes across the
cluster.
This distribution style
gives Hadoop another big advantage:
Since data and processing live on the same servers in the
cluster, every time you
add a new machine to the cluster, your system gains the space of the hard drive
and the power of the new processor.
Kit Your Hadoop
As mentioned earlier, users of Hadoop don't have to stick with
just HDFS or MapReduce.
For its Elastic Compute Cloud
solutions, Amazon Web Services has adapted its own S3 filesystem for Hadoop.
To get around MapReduce's first-in-first-out
limitations, the Cascading framework gives developers an easier tool in
which to run jobs and more flexibility to schedule jobs.
Hadoop is not always a complete, out-of-the-box solution for
every Big Data task. MapReduce, as noted, is enough of a pressure point that
many Hadoop users prefer to use the framework only for its capability to store
lots of data fast and cheap.
But Hadoop is still the best, most widely used system for
managing large amounts of data quickly when you don't have the time or the
money to store it in a relational database. That's why Hadoop is likely to
remain the elephant in the Big Data room for some time to come.
No comments:
Post a Comment