Wide-column, column-oriented and column-family databases – Part 1
Wide-column, column-oriented and column-family databases belong to the NoSQL family of databases built to store and query massive amounts of data, aka BigData.
They are highly available and scalable, built to work in a distributed environment. Most of the wide-column databases do not support joins to scale linearly when it comes to read and write performance.
Some of the popular wide-column stores are Google Bigtable, Apache Cassandra, HBase, etc.
Before I plunge into the topic, let me give you a little heads up on the common data storage models.
Data storage models
Row-oriented databases are the traditional relational databases like MySQL and PostgreSQL that store data in rows.
For instance, here is how a row-oriented database will store the details of a customer.
Illustration 1.65: Row-oriented data storage
In a row-oriented database, the data is stored in rows on the disk in contiguous locations or blocks. On running a query, it’s easy to fetch an entire row of data for a particular customer since it resides in the same block.
It’s also easy to update customer data since all the data for a certain customer is stored in a single row. Even in the case of partitions, the rows are partitioned horizontally. The columns containing data of a particular customer always stay together on the disk.
This data storage model is best suited for OLTP (Online transactional processing) use cases.
Now let’s move on to column-oriented databases.
Column-oriented databases, on the other hand, store data in columns as opposed to rows.
In this scenario, the customer ids will be stored on the disk at a certain location in one column, customer names in another column, customer city and country will be in respective columns and so on.
Illustration 1.66: Column-oriented data storage
These databases are best suited for OLAP (Online analytical processing) use cases. They provide the best performance when similar data is stored in columns on the disk. Examples of column-oriented databases are Google BigQuery and Amazon Redshift.
But why do we need to store data in columns as opposed to rows?
Imagine the above customer info table having billions of rows and the business needs to run analytics on the data. Say it needs to run a query to determine the percentage of customers from a particular country.
If the data was stored in rows on the disk, the query would have to traverse a large number of disk blocks (across machines, since the data would be partitioned) to figure out the results, processing unnecessary data.
On the flipside, on saving columns on the disk, the query will only process the country column, neglecting other customer data, providing high-throughput reads. It wouldn’t have to scan through rows of customer data. Also, similar data in columns can be compressed by using techniques such as run-length encoding, requiring less storage space than row-based data.
Column-oriented databases are suited for analytical use cases where the business has to make sense of large volumes of data. However, in this scenario, if we need to add a new row of customer data, this would prove to be resource-intensive since we have to access all the columns to add a new record in them.
For writing customer data, a row-oriented database would suit best.
Both the data storage models have their pros and cons and use cases.
Wide-column database is a term for databases built for storing large amounts of data in flexible columns. Different wide-column databases have different implementations when it comes to their data storage model and clustering implementation. I’ll come to that up ahead in the lesson.
There is no strict, well-defined definition of the difference between a wide-column database and a column-oriented datastore.
Now, let’s move on to understanding column-family databases.
Column family databases store similar columns in column families. A row in these databases typically contains multiple column families.
The most popular example of a column-family database is Google Bigtable. Let’s plunge into it, understanding its data storage model.
Google Bigtable – A column-family database
Bigtable is a distributed column-oriented database that can scale to large sizes upto petabyte-scale running across thousands of commodity servers.
Recommended read: YouTube database – How does it store so many videos without running out of storage space?
It can also be used with the MapReduce framework acting as a data source, also to persist the results of the large-scale parallel computation jobs.
The distributed storage system is specifically built by Google to handle structured BigData, keeping low latency in mind. Bigtable powers over sixty Google products, including Google Earth, Google Finance, Search web indexing, and Analytics.
The products have varied requirements, from throughput-focused batch processing to low-latency real-time data access by the end-users.
Speaking of its data model, the product offers dynamic control over the data layout, format, locality of data, and if data is to be served from the memory or the disk.
Bigtable is a sparse, distributed, persistent multidimensional sorted map. The key in the map constitutes a row key, a column key, and a timestamp. And the value is an array of bytes.
What does sparse mean here? Sparse means that most of the cells in the table would be zero or empty. The data model is designed keeping in mind the use cases of a Bigtable-like system. Let’s understand this with the help of an example in the lesson up next.