At the recent Big Data Workshop
held by the Boston Predictive Analytics group, airline analyst and R user Jeffrey Breen gave a step-by-step guide to setting up
an R and Hadoop infrastructure.
Firstly, as a local virtual instance of Hadoop with R, using VMWare and Cloudera's Hadoop Demo VM. (This is a
great way to get familiar with Hadoop.) Then, as single-machine cloud-based instance with lots of RAM and CPU, using Amazon EC2. (Good for more
Hadoop experimentation, now with more realistic data sizes.) And finally, as a
true distributed Hadoop cluster in the cloud, using Apache whirr to spin up multiple
nodes running Hadoop and R.
With that infrastructure set
up, you're ready to start using RHadoop to write map-reduce jobs in R.
The final part of Jeffrey's
workshop is a tutorial on the rmr package, with a worked example of loading a large data airline data
set of airline departures and arrivals into HDFS and using an R-based
map-reduce task to calculate the scheduled (orange) and actual (yellow) total
flight hours in the US over the last decade or so. (Actual time spent in the
air is also shown, in blue.)
Hadoop:
What it is, how it works, and what it can do
Cloudera CEO Mike Olson
on Hadoop's architecture and its data applications.
Hadoop gets a lot of buzz these days in database and
content management circles, but many people in the industry still don’t really
know what it is and or how it can be best applied.
Cloudera CEO and Strata speaker Mike
Olson, whose company offers
an enterprise distribution of Hadoop and contributes to the project, discusses
Hadoop’s background and its applications in the following interview.
Where did Hadoop come
from?
Mike Olson: The underlyingtechnology was invented by Google back in their earlier
days so they could usefully index all the rich textural and structural information
they were collecting, and then present meaningful and actionable results to
users.
There was nothing on the market that would let
them do that, so they built their own platform.
Google’s innovations were incorporated into Nutch, an open source project, and Hadoop was later
spun-off from that.
Yahoo has played a key role developing Hadoop for enterprise
applications.
What problems can Hadoop
solve?
Mike Olson: The Hadoop platform was designed to solve problems where you
have a lot of data — perhaps a mixture of complex and structured data — and it
doesn’t fit nicely into tables.
It’s for situations where you want to run
analytics that are deep and computationally extensive, like clustering and
targeting.
That’s exactly what Google was doing when it
was indexing the web and examining user behavior to improve performance
algorithms.
Hadoop applies to a bunch of markets. In
finance, if you want to do accurate portfolio evaluation and risk analysis, you
can build sophisticated models that are hard to jam into a database engine. But
Hadoop can handle it. In online retail, if you want to deliver better search
answers to your customers so they’re more likely to buy the thing you show
them, that sort of problem is well addressed by the platform Google built.
Those are just a few examples.
How is Hadoop architected?
Mike Olson: Hadoop is designed to run on a large number of machines that
don’t share any memory or disks. That means you can buy a whole bunch of
commodity servers, slap them in a rack, and run the Hadoop software on each
one. When you want to load all of your organization’s data into Hadoop, what
the software does is bust that data into pieces that it then spreads across
your different servers. There’s no one place where you go to talk to all of
your data; Hadoop keeps track of where the data resides. And because there are
multiple copy stores, data stored on a server that goes offline or dies can be
automatically replicated from a known good copy.
In a
centralized database system, you’ve got one big disk connected to four or eight
or 16 big processors.
But that is as much horsepower as you can
bring to bear. In a Hadoop cluster, every one of those servers has two or four
or eight CPUs. You can run your indexing job by sending your code to each of
the dozens of servers in your cluster, and each server operates on its own
little piece of the data. Results are then delivered back to you in a unified
whole. That’s MapReduce: you map the operation out to all of those
servers and then you reduce the results back into a single result set.
Architecturally, the reason you’re able to
deal with lots of data is because Hadoop spreads it out. And the reason you’re
able to ask complicated computational questions is because you’ve got all of
these processors, working in parallel, harnessed together.
At this point, do
companies need to develop their own Hadoop applications?
Mike Olson: It’s fair to say that a current Hadoop adopter must be more
sophisticated than a relational database adopter. There are not that many
“shrink wrapped” applications today that you can get right out of the box and
run on your Hadoop processor. It’s similar to the early ’80s when Ingres and
IBM were selling their database engines and people often had to write
applications locally to operate on the data.
That said, you can develop applications in a
lot of different languages that run on the Hadoop framework. The developer
tools and interfaces are pretty simple. Some of our partners — Informatica is a good example — have ported their tools
so that they’re able to talk to data stored in a Hadoop cluster using Hadoop
APIs. There are specialist vendors that are up and coming, and there are also a
couple of general process query tools: a version of SQL that lets you interact
with data stored on a Hadoop cluster, and Pig,
a language developed by Yahoo that allows for data flow and data transformation
operations on a Hadoop cluster.
Hadoop’s deployment is a bit tricky at this
stage, but the vendors are moving quickly to create applications that solve
these problems. I expect to see more of the shrink-wrapped apps appearing over
the next couple of years.
Where do you stand in the
SQL vs NoSQL debate?
Mike Olson: I’m a deep believer in relational
databases and in SQL. I think
the language is awesome and the products are incredible.
I hate the term “NoSQL.” It was invented to create cachet around a
bunch of different projects, each of which has different properties and behaves
in different ways. The real question is, what problems are you solving? That’s
what matters to users.
Doug Cutting, Cloudera's
Chief Architect, helped create Apache Hadoop out of necessity as data
from the web exploded, and grew far beyond the ability of traditional systems
to handle it.
Hadoop was initially inspired
by papers published by Google outlining its approach to handling an avalanche
of data, and has since become the de facto standard for storing, processing and analyzing hundreds
of terabytes, and even petabytes of data.
Apache Hadoop is 100% open
source, and pioneered a fundamentally new way of storing and processing data.
Instead of relying on
expensive, proprietary hardware and different systems to store and process
data, Hadoop enables distributed parallel processing of huge amounts of data
across inexpensive, industry-standard servers that both store and process the
data, and can scale without limits.
With Hadoop, no data is too big.
And in today’s
hyper-connected world where more and more data is being created every day,
Hadoop’s breakthrough advantages mean that businesses and organizations can now
find value in data that was recently considered useless.
Reveal
Insight
From All Types of Data,
From All Types of Systems
Hadoop can handle all types of data from disparate systems:
1.structured,
2.unstructured,
3.log files,
4.pictures,
5.audio files,
6.communications
records,
7.email–
just about anything you can think of, regardless of its
native format.
Even when different types of
data have been stored in unrelated systems, you can dump it all into your
Hadoop cluster with no prior need for a schema.
In other words, you don’t
need to know how you intend to query your data before you store it;
Hadoop lets you decide later and over time can
reveal questions you never even thought to ask.
By making all of your data
useable, not just what’s in your databases, Hadoop lets you see relationships
that were hidden before and reveal answers that have always been just out of
reach.
You can start making more
decisions based on hard data instead of hunches and look at complete data sets,
not just samples.
Redefine
the Economics of Data:
Keep Everything, Forever, Online
In addition, Hadoop’s cost
advantages over legacy systems redefine the economics of data.
Legacy systems, while fine
for certain workloads, simply were not engineered with the needs of Big Data in
mind and are far too expensive to be used for general purpose with today's
largest data sets.
One of the cost advantages of
Hadoop is that because it relies in an internally redundant data structure and
is deployed on industry standard servers rather than expensive specialized data
storage systems, you can afford to store data not previously viable .
And we all know that once
data is on tape, it’s essentially the same as if it had been deleted -
accessible only in extreme circumstances.
Enterprises who build their
Big Data around Cloudera can afford to store literally all the data in their
organization, and keep it all online for real-time interactive querying,
business intelligence, analysis and visualization.
Restructure
Your Thinking:
Make Big Data the Lifeblood of Your Enterprise
With data growing so rapidly and the
rise of unstructured data accounting for 90% of the data today, the time has
come for enterprises to re-evaluate their approach to data storage, management
and analytics.
Legacy systems will remain necessary
for specific high-value, low-volume workloads, and compliment the use of Hadoop-optimizing the
data management structure in your organization by putting the right Big Data
workloads in the right systems.
The cost-effectiveness, scalability and
streamlined architectures of Hadoop will make the technology more and more
attractive.
In
fact, the need for Hadoop is no longer a question.
The
only question now is how to take advantage of it best, and the
enterprise-proven answer is Cloudera.