Column-oriented Storage and Column Families

Traditional databases store data in a row-oriented fashion, i.e, all the values from one row of a table are stored contiguously. Column-oriented Storage store all the values from each column together.

The advantages of columnar storage are:

Queries reading from a single column need to fetch data only from that file
Better compression: Storing similar values together generally gives better compression
Better CPU efficiency – Using Vectorized processing to work on column data in CPU cache

Column-oriented stores are generally a good fit for analytical workloads that need to compute aggregate values over columns (as opposed to interactive queries for individual records). Examples of column oriented datastores are Vertica and Apache Kudu. Column-oriented file formats include Apache Parquet and Apache ORC.

Column Families

Column-oriented stores should not be confused with the similar sounding Column Families a term that likely originated from Google’s Bigtable paper and is present in Cassandra and HBase, databases that inherited concepts from Bigtable. Here, columns are grouped into column families and inside each column family, data is still stored row-wise. Data is distributed across different nodes by the partitioning key (or set of keys) and all columns in the group are stored together.

Cassandra in more recent versions has visibly moved away from the Column Family terminology and presents an almost SQL like query interface (called CQL). However, when designing schemas using Cassandra, we do need to make sure that queries to the data only filter rows using the keys used for clustering or indexing the data. This is a consequence of how the data is actually laid out.

Notes from the excellent Designing Data-Intensive Applications by Martin Kleppmann
Am working my way through Database Internals by Alex Petrov

Column Families

Share this:

Related