R and Hadoop:
Step-by-step tutorials
At the recent Big Data Workshop
held by the Boston Predictive Analytics group, airline analyst and R user Jeffrey Breen gave a step-by-step guide to setting up
an R and Hadoop infrastructure.
Firstly, as a local virtual instance of Hadoop with R, using VMWare and Cloudera's Hadoop Demo VM. (This is a
great way to get familiar with Hadoop.) Then, as single-machine cloud-based instance with lots of RAM and CPU, using Amazon EC2. (Good for more
Hadoop experimentation, now with more realistic data sizes.) And finally, as a
true distributed Hadoop cluster in the cloud, using Apache whirr to spin up multiple
nodes running Hadoop and R.
The final part of Jeffrey's
workshop is a tutorial on the rmr package, with a worked example of loading a large data airline data
set of airline departures and arrivals into HDFS and using an R-based
map-reduce task to calculate the scheduled (orange) and actual (yellow) total
flight hours in the US over the last decade or so. (Actual time spent in the
air is also shown, in blue.)
No comments:
Post a Comment