Big Data 10 Emerging Technologies
December 5, 2012, 6:00 AM PST
Takeaway: Thoran Rodrigues
interviewed Dr. Satwant Kaur about the 10 emerging technologies that will drive
Big Data forward.
I’ve recently had the opportunity to
have a conversation with Dr. Satwant Kaur on the topic of Big Data (see my
previous interview with Dr. Kaur, “The 10
traits of the smart cloud“). Dr. Kaur has an extensive
history in IT, being the author of Intel’s Transitioning Embedded Systems to
Intelligent Environments. Her professional background, which includes four
patents while at Intel & CA, 20 distinguished awards, ten keynote
conference speeches at IEEE, and over 50 papers and publications, has earned
her the nickname, “The First Lady of Emerging Technologies.” Dr. Kaur will be
delivering the keynote at the CES show: 2013 IEEE International Conference on
Consumer Electronics (ICCE).
While the topic of Big Data is broad
and encompasses many trends and new technology developments, she managed to
give me a very good overview of what she considers to be the top ten emerging
technologies that are helping users cope with and handle Big Data in a
cost-effective manner.
Dr.
Kaur:
Column-oriented databases
Traditional, row-oriented databases
are excellent for online transaction processing with high update speeds, but
they fall short on query performance as the data volumes grow and as data
becomes more unstructured. Column-oriented databases store data with a focus on
columns, instead of rows, allowing for huge data compression and very fast
query times. The downside to these databases is that they will generally only
allow batch updates, having a much slower update time than traditional models.
Schema-less databases, or NoSQL
databases
There are several database types
that fit into this category, such as key-value stores and document stores,
which focus on the storage and retrieval of large volumes of unstructured,
semi-structured, or even structured data. They achieve performance gains by
doing away with some (or all) of the restrictions traditionally associated with
conventional databases, such as read-write consistency, in exchange for
scalability and distributed processing.
MapReduce
This is a programming paradigm that
allows for massive job execution scalability against thousands of servers or
clusters of servers. Any MapReduce implementation consists of two tasks:
·The
“Map” task, where an input dataset is converted into a different set of
key/value pairs, or tuples;
·The
“Reduce” task, where several of the outputs of the “Map” task are combined to
form a reduced set of tuples (hence the name).
Hadoop
Hadoop is by far the most popular
implementation of MapReduce, being an entirely open source platform for
handling Big Data. It is flexible enough to be able to work with multiple data
sources, either aggregating multiple sources of data in order to do large scale
processing, or even reading data from a database in order to run
processor-intensive machine learning jobs. It has several different
applications, but one of the top use cases is for large volumes of constantly
changing data, such as location-based data from weather or traffic sensors,
web-based or social media data, or machine-to-machine transactional data.
Hive
Hive is a “SQL-like” bridge that
allows conventional BI applications to run queries against a Hadoop cluster. It
was developed originally by Facebook, but has been made open source for some
time now, and it’s a higher-level abstraction of the Hadoop framework that
allows anyone to make queries against data stored in a Hadoop cluster just as
if they were manipulating a conventional data store. It amplifies the reach of
Hadoop, making it more familiar for BI users.
PIG
PIG is another bridge that tries to
bring Hadoop closer to the realities of developers and business users, similar
to Hive. Unlike Hive, however, PIG consists of a “Perl-like” language that
allows for query execution over data stored on a Hadoop cluster, instead of a
“SQL-like” language. PIG was developed by Yahoo!, and, just like Hive, has also
been made fully open source.
WibiData
WibiData is a combination of web
analytics with Hadoop, being built on top of HBase, which is itself a database
layer on top of Hadoop. It allows web sites to better explore and work with
their user data, enabling real-time responses to user behavior, such as serving
personalized content, recommendations and decisions.
PLATFORA
Perhaps the greatest limitation of
Hadoop is that it is a very low-level implementation of MapReduce, requiring
extensive developer knowledge to operate. Between preparing, testing and
running jobs, a full cycle can take hours, eliminating the interactivity that
users enjoyed with conventional databases. PLATFORA is a platform that turns
user’s queries into Hadoop jobs automatically, thus creating an abstraction
layer that anyone can exploit to simplify and organize datasets stored in
Hadoop.
Storage Technologies
As the data volumes grow, so does
the need for efficient and effective storage techniques. The main evolutions in
this space are related to data compression and storage virtualization.
SkyTree
SkyTree is a high-performance
machine learning and data analytics platform focused specifically on handling
Big Data. Machine learning, in turn, is an essential part of Big Data, since
the massive data volumes make manual exploration, or even conventional
automated exploration methods unfeasible or too expensive.
Big
Data in the cloud
As we can see, from Dr. Kaur’s
roundup above, most, if not all, of these technologies are closely associated with
the cloud. Most cloud vendors are already offering hosted Hadoop clusters that
can be scaled on demand according to their user’s needs. Also, many of the
products and platforms mentioned are either entirely cloud-based or have cloud
versions themselves.
Big Data and cloud computing go
hand-in-hand. Cloud computing enables companies of all sizes to get more value
from their data than ever before, by enabling blazing-fast analytics at a
fraction of previous costs. This, in turn drives companies to acquire and store
even more data, creating more need for processing power and driving a virtuous
circle.