Comparing SQL databases and Hadoop
Given that Hadoop is a framework for processing data, what makes it better than standard relational databases, the workhorse of data processing in most of today’s applications? One reason is that SQL (structured query language) is by design targeted at structured data. Many of Hadoop’s initial applications deal with unstructured data such as text. From this perspective Hadoop provides a more general paradigm than SQL.
For working only with structured data, the comparison is more nuanced. In principle, SQL and Hadoop can be complementary, as SQL is a query language which can be implemented on top of Hadoop as the execution engine.
But in practice, SQL databases tend to refer to a whole set of legacy technologies, with several dominant vendors, optimized for a historical set of applications. Many of these existing commercial databases are a mismatch to the requirements that Hadoop targets.
With that in mind, let’s make a more detailed comparison of Hadoop with typical SQL databases on specific dimensions.
SCALE-OUT INSTEAD OF SCALE-UP
Scaling commercial relational databases is expensive. Their design is more friendly to scaling up. To run a bigger database you need to buy a bigger machine. In fact, it’s not unusual to see server vendors market their expensive high-end machines as “database-class servers.” Unfortunately, at some point there won’t be a big enough machine available for the larger data sets. More importantly, the high-end machines are not cost effective for many applications. Hadoop clusters with ten to hundreds of machines is standard. In fact, other than for development purposes, there’s no reason to run Hadoop on a single server.
KEY/VALUE PAIRS INSTEAD OF RELATIONAL TABLES
A fundamental tenet of relational databases is that data resides in tables having relational structure defined by a schema . Although the relational model has great formal properties, many modern applications deal with data types that don’t fit well into this model. Text documents, images, and XML files are popular examples. Also, large data sets are often unstructured or semistructured. Hadoop uses key/value pairs as its basic data unit, which is flexible enough to work with the less-structured data types. In Hadoop, data can originate in any form, but it eventually transforms into (key/value) pairs for the processing functions to work on.
FUNCTIONAL PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE QUERIES (SQL)
SQL is fundamentally a high-level declarative language. You query data by stating the result you want and let the database engine figure out how to derive it. Under MapReduce you specify the actual steps in processing the data, which is more analogous to an execution plan for a SQL engine . Under SQL you have query statements; under MapReduce you have scripts and codes. MapReduce allows you to process data in a more general fashion than SQL queries. For example, you can build complex statistical models from your data or reformat your image data. SQL is not well designed for such tasks.
OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS
Hadoop is designed for offline processing and analysis of large-scale data. It doesn’t work for random reading and writing of a few records, which is the type of load for online transaction processing. In fact, as of this writing (and in the foreseeable future), Hadoop is best used as a write-once , read-many-times type of data store. In this aspect it’s similar to data warehouses in the SQL world.