Faster Database Performance with Denormalization
Database Denormalization For Performance
In the past, we’ve discussed using methods to normalize our database to reduce data redundancy and decrease bugs. One downside to normalizing our database is that there can be performance problems as we interact with this normalized data at scale. We may have to join several tables to get the data we need in one query, and performing calculations on tens of millions of rows can take significant time.
When large amounts of data cause performance problems, we need to find other ways to store it.
In this article, we’ll discuss database denormalization and how it can be used in your application to improve performance.
What Is Denormalization?
When we normalize our database tables, we go through an iterative process to design them. We keep exactly one copy of each piece of data in our tables to remove data redundancy.
Denormalization is a process we use on normalized database tables to increase read performance at the expense of losing some write performance. We do this by adding redundant copies of our data or by grouping data in ways that will be fast to read later on.
When we denormalize our database, we might add columns with derived data, create tables that split an existing table across logical boundaries or create aggregated data.
We may use either DataBase Management System (DBMS) denormalization or DataBase Administrator (DBA) denormalization. DBMS denormalization is when our DBMS provides support to denormalize the data automatically while DBA denormalization is when we have to manually store and update the data logically.
When to Use Denormalization
Denormalization is a great solution for any kind of data that gets updated infrequently or data that is part of a report that needs to be displayed quickly.
Some example techniques include:
- Keeping count of the “many” elements in a one-to-many relationship by storing the count as a column of the “one” element
- Adding attributes to a table that are copies of data from another table that is routinely joined
- Creating summaries of data
A normalized database gives us a lot of performance out of the box, and denormalization can be a case of premature optimization that overcomplicates the structure of our code for little to no gain. I recommend not jumping right into using denormalization when it may not be needed. We want to use denormalization only when we start to experience performance problems in our applications that can’t easily be solved by adding more memory to our database server.
What Are Techniques To Structure Denormalized Data?
There are several techniques we can use to structure our denormalized data. We’re going to cover some techniques I find the most helpful but this is NOT a complete list of all of them.
Introducing a Redundant Column/Pre-joining Tables
The first technique is to create a redundant column that pre-joins our data for cases where we commonly join the table with another table to get a specific column.
For example, let’s say we have a charges
table that’s used to track financial charges for a specific customer
.
create table charges
(
id int not null auto_incrementing,
created_at datetime not null,
amount_in_cents_usd int not null,
customer_id int not null
);
For this example, the customer
table just tracks the name of the customer.
create table customer
(
id int not null auto_incrementing,
name varchar(255)
);
We commonly join the customers
table to charges
table to display what customer is associated with the charge. To improve performance and prevent the join
we can add a customer_name
to the charges
table.
create table charges
(
id int not null auto_incrementing,
created_at datetime not null,
amount_in_cents_usd int not null,
customer_id int not null,
customer_name varchar(255) not null
);
Table splitting
Another option is to split the table into multiple smaller tables so they’re faster to query and manage. We can either perform a horizontal table splitting or a vertical table splitting.
Horizontal Table Splitting
When we split a table horizontally we make multiple duplicate copies of the table with the same columns but then split the data based on customer, region, location, department, or any other logical separation.
For example, our charges
table may have reached the point where there are too many entries for us to be able to easily perform actions on all the rows at once. We can generate ten tables for the various regions we service.
create table charges_region_01
(
id int not null auto_incrementing,
created_at datetime not null,
amount_in_cents_usd int not null,
customer_id int not null
);
...
create table charges_region_10
(
id int not null auto_incrementing,
created_at datetime not null,
amount_in_cents_usd int not null,
customer_id int not null
);
We can then either stop using our charges
table or perform double entry but these regional-based tables are going to be much faster to query because we’re only working with a subset of our data.
Vertical Table Splitting
When we split a table vertically we keep the same number of rows but split the columns into different tables.
For example, going back to our charges
table someone went and added a bunch of blob
columns to our table for supporting evidence but they’re being used in less than 1% of rows.
create table charges
(
id int not null auto_incrementing,
created_at datetime not null,
amount_in_cents_usd int not null,
customer_id int not null,
support_file_01 blob,
support_file_02 blob,
support_file_03 blob,
support_file_04 blob,
support_file_05 blob
);
We can easily vertically split this table into a charges_financial
table and a charges_support
table so we can keep the financial pieces fast without having to worry about loading all of those blob
columns into memory every time we need to read the table data.
create table charges_financial
(
id int not null auto_incrementing,
created_at datetime not null,
amount_in_cents_usd int not null,
customer_id int not null
);
create table charges_support
(
charges_financial_id int not null,
support_file_01 blob,
support_file_02 blob,
support_file_03 blob,
support_file_04 blob,
support_file_05 blob
);
Adding Derived Columns
The third technique we’re going to discuss is adding derived columns. When we add derived columns we create columns that are derived from other data.
For example, we may need to display the total charges customers have generated in the current year. To generate this data, we can, of course, perform a join between the customers
and charges
able, filter it down to just this year, perform agroup by
, and output the results. Again, we might have some performance issues with this as the tables grow larger and larger.
To get around this, we can add a ytd_charges_in_cents_usd
column to our customers
table and then store the result there to increase the query speed.
create table customers
(
id int not null auto_incrementing,
name varchar(255),
ytd_charges_in_cents_usd int not null
);
Using Mirrored Tables
Next, we get to mirrored tables. When we create a mirrored table we create a full or partial copy of a table but stored in a separate location that is specifically optimized for faster query performance. For example, we may have a separate database server setup to perform analytics on our data. By having it separated it can perform faster and won’t affect the workload of the rest of the system.
Materialized views
Finally, we can use materialized views which are the “result” of a query. These are typically queries that aggregate data together and can be generated manually and some DBMS provide support for them by automatically regenerating the data stored in the view when it’s stale. An example of when we can use materialized views is when we can track the ytd_charges_in_cents_usd
in a materialized view instead of inside the customers
table.
If your DBMS doesn’t support materialized views, it’s easy enough to implement your own by creating a cron task that regenates the data.
How Do We Build Denormalized Data?
Now that you have a basic understanding of how to store the data in a denormalized fashion, you might be wondering how we can generate this denormalized data. Depending on your needs, you can generate the data for the denormalized data using scheduled jobs, a database trigger, and even inside your application’s internal logic. A lot of it’s going to have to do with how “real-time” you need the denormalized data to be. Are you happy with it only updating once a day, then cron to the rescue? If you need the data to be up to date immediately, then event-sourcing might fit your uses better. We do a combination of techniques depending on what we need for each piece.
What You Need to Know
- Database Denormalization is a process of creating redundant copies of data to improve read performance
- Use only when you need it
- There are several techniques on how to create Denormalized data
Leave a comment
Use the form below to leave a comment: