Designing Data Intensive Applications - Storage and retrieval (part 2)

Table of Contents

Designin Data Intensive Applications Book notes - This article is part of a series.

Part 1: Designing Data Intensive Applications - Reliable, Scalable and Maintainable applications

Part 2: Designing Data Intensive Applications - Data models and query languages

Part 3: Designing Data Intensive Applications - Storage and retrieval (part 1)

Part 4: This Article

Part 5: Designing Data Intensive Applications - Encoding and Evolution (part 1)

Transaction processing or analytics #

A transaction represent a group of reads and writes that form a logical unit. The access pattern of looking up for typically a few records by their primary key, and writing some records, is called OLTP (Online Transaction Processing).

The access pattern of scanning through many records, filtering them, aggregating them, and computing aggregate statistics, is called OLAP (Online Analytical Processing).

We can differenciate the two through the following:

Property	Transaction processing system (OLTP)	Analytic system (OLAP)
Main read pattern	Small number of records per query, fetched by key	Aggregate over large number of records
Main write pattern	Random-access, low latency writes from user input	Bulk import (ETL - Extract, Transform, Load) or event stream
Primary usage	End user/customer, via application	Internal analyst, for decision support
What data represents	Latest state of data at the time	History of events that happened over time
Dataset size	GB to TB	TB to PB

Data warehousing #

Over time, it became clear that the OLTP and OLAP workloads are so different that they are best served by separate systems. This separation of the databases for OLAP is called a data warehouse. OLTP systems are usually expected to be highly available and performant as they serve the customer-facing application. The business analysts often need to run expensive queries, scanning large parts of the dataset, affecting the performances, which makes it logical to offer a different environment (the data warehouse).

A data warehouse contains a read-only copy of the data from the OLTP database, extracted periodically or through a continuous stream of updates. The process of getting data in the warehouse is known as Extract-Transform-Load (ETL). The data is extracted from the OLTP database, transformed into an analysis-friendly schema, and loaded into the data warehouse.

On the surface level, OLTP databases looks similar to data warehouses, as they usually both offer SQL interfaces, but the underlying data structures and access patterns are very different in order to offer the best performances for their respective workloads.

Stars and snowflakes: Schemas for analytics #

In opposite to the realm of transactional processing, in the realm of analytics, the data models offer less diversity of data models. One of the main data model used in analytics is the star schema. It is composed of a central table (fact table) containing the metrics that the business is interested in, and several dimension tables that contain the attributes of the fact table. Some of the columns in the fact table are attributes, while others are foreign keys that refer to the dimension tables.

Usually, facts are captured as events. So, one row in the fact table would represent one event, and the dimension tables would contain the attributes of the event.

The snowflake schema is a variation of the star schema, in which the dimension tables are further normalized into multiple related tables, which can be more efficient in terms of storage, but can be more complex to query.

Column-oriented storage #

When you have trillions of rows, or petabytes of data, which is common in analytics, the column-oriented storage can be more efficient.

In most OLTP databases, the data is stored in row-oriented storage. Each row is stored as one record, and the records are stored sequentially on disk. But, when doing analytics, you may want to query only a few columns of a table containing hundreds of columns.

The idea of column-oriented storage is simple, instead of storing the values of one row together, we store the values of one column together. The columns can be stored in separate files, hence, a query that needs to read and parse only a few columns will save a lot of work scanning only the files that are needed.

Column Compression #

One way of further optimizing the column-oriented storage is to compress the columns. Column-oriented storage is well suited for compression because the data in one column is of the same type, and often has low cardinality (few distinct values). One of the most common compression techniques is run-length encoding, which stores a value and the number of times it is repeated.

Compressing the data is beneficial because of the reduced storage space, but also because it can speed up the query processing. It reduces the amount of data that needs to be read from disk, bandwidth being a big bottleneck in queries that scan over millions of rows, as well as the amount of data that needs to be processed in memory.

Sort order in column storage #

The simplest way to store the data in a column-oriented storage is to store the values in the order they were inserted. But, it is not the most efficient way to store the data. Usually, the data is sorted according to the values in one of the columns, preferably the one that is most queried. Secondary and so on sort keys can be used to further optimize the storage for queries that need to filter on multiple columns.

Sorted order can also help with compression, as the values are more likely to be similar to their neighbors.

Some data warehouse systems use a technique to store the same column in multiple sorted orders, which can be useful for queries that need to filter on different columns. This data needs to be replicated on multiple machines anyway, so storing the same data in multiple sorted orders is acceptable. Having multiple sort orders in a column-oriented store is like having multiple secondary indexes in a row-oriented store.

Writing to column-oriented storage #

The previously discussed optimization makes sense, because most of the load in data warehouses is read-heavy. But it comes with the cost of complex write operations, especially as columns are compressed.

The LSM-tree is a good fit for writing to column-oriented storage. The data is written to an in-memory table, and when the table reaches a certain size, it is flushed to disk as a new segment.

Aggregations: Data cubes and materialized views #

As data warehouses are used for analytics, they often need to perform aggregations over large datasets (Count, sum, Average, min, max, etc…). One way of optimizing for that, is to precompute the aggregations, what we call a materialized view. They are like a view in a relational database, except that a view in a relational database (virtual view) is computed on the fly, while a materialized view is precomputed and stored on disk.

Materialized views makes writes expensive, but they can still make sens in a read-heavy data warehouse.

One type of materialized view is a data cube, which is a multidimensional version of a materialized view. It is used to precompute aggregations over multiple dimensions.

Summary #

We saw that the choice of the storage engine has a big impact on the performance of the database. Different storage handles storing and retrieving in different ways.

On a high level, we can differentiate storage systems can ve divided into two categories:

Transaction processing (OLTP) systems, which are optimized for a high throughput of small transactions.
Analytic (OLAP) systems, which are optimized for scanning many records to compute aggregations.

On the OLTP side, we saw two main school of thought:

The log-structured school, which only appends to files and delete obsolete data during compaction. It never overwrites data in place. SSTables, LSM-Tree, Lucene and other belongs to this group.
The update-in-place school, which treats the disk as a set of fixed-size pages, and updates the data in place. B-Trees are the most common example of this group, used in all major relational databases, as well as many non-relational ones.

Log-structures structured storage, which are a recent development, are efficient in a write-heavy throughput, by turning random-access writes into sequential writes on disk.