Stories by James Coleman on Medium

PostgreSQL at Scale: Saving Space (Basically) for Free

James Coleman — Mon, 28 Sep 2020 20:57:58 GMT

Braintree Payments operates dozens of PostgreSQL clusters with over 100 terabytes of data. At this scale, even a few percentage points change in disk space growth rate can meaningfully impact the writable lifespan of a database cluster. Unfortunately, many ideas to save disk space require application changes and therefore need to be slotted into product timelines.

But today I want to focus on a technique that saved us approximately 10% of our disk space with very little effort beyond existing processes. In short, carefully choosing column order when creating a table can eliminate padding that would otherwise be needed.

This technique isn’t revolutionary: it’s been well-documented by 2ndQuadrant in On Rocks and Sand, EDB in Data Alignment in PostgreSQL, GitLab in Ordering Table Columns in PostgreSQL, the classic “Column Tetris” answer on a StackOverflow question, and I’m sure many more. What I hope we’re bringing to the table is tooling encoding these ideas so that you don’t have to re-invent the wheel (or apply the technique manually).

Below I’ll describe the rules and heuristics we apply to determine an ideal column ordering. But a list of rules sounds a lot like the definition for an algorithm. And that implies a problem space we can tackle at the systems, not people, level. Instead of sending a mass email to every engineer writing database DDL changes and expecting them to remember these rules, we authored a Ruby gem called pg_column_byte_packer to automate the solution in our development cycle. We'll talk more about that soon, but first let's take a more in-depth look a the problem space.

Photo by Pickawood on Unsplash

Data Alignment, Padding, and Waste

PostgreSQL’s heap storage, much like fields in C-language structs, writes columns guaranteeing alignment boundaries. For example, a column having 8-byte alignment is guaranteed to start at a byte index evenly divisible by 8 (zero-indexed). The heap storage engine automatically introduces any padding necessary to maintain this alignment.

We can introspect all kinds of system behavior and objects using PostgreSQL’s catalog tables, and alignment is no exception. Each datatype is listed in pg_catalog.pg_type, and you can determine the alignment required for any data type in the typalign column of that catalog table. PostgreSQL's documentation provides an excellent summary of how to interpret this column.

What Can We Do About It?

From a high-level perspective we can minimize the amount of space that will be lost to alignment padding by ordering each table’s columns in descending order of their data type’s alignment.

For example, suppose on our 64-bit system we have a table with two columns: a bigint column (which requires 8-byte alignment) and an integer column (which requires 4-byte alignment). If we put the integer column first we'll have the following data layout that takes up 16 bytes:

https://medium.com/media/fa91648efcd6be172826714faa976f42/href

However if we put the bigint column first our data layout will only take up 12 bytes:

https://medium.com/media/efb1858af43a1e152cc5ca9dab9bf793/href

But there are a few other cases we want to handle at the same time:

Variable length data types like TEXT can have variable alignment requirements depending on their size. While we obviously can't look at the data when creating the table, we do want to take hints based on length constraints on the column, if any.
Binary (BYTEA data type) columns are similarly variable in length (and therefore variable alignment), we assume as a heuristic that binary data is usually "long" length.
NOT NULL columns are definitionally more likely to contain data than a random nullable column, so it makes sense to order them earlier (for faster unpacking during reads).
Columns with a DEFAULT are more likely to contain data (though slightly less so than NOT NULL), so it makes sense to order them earlier (also for faster unpacking during reads).
PRIMARY KEY columns not only always have data, but are also often the most frequently accessed columns (because they tend to be JOIN conditions), so we order then at the beginning of their alignment group.

Bundled Up in a Ruby Gem

I mentioned earlier that we’ve incorporated this set of column ordering rules into our recently open-sourced Ruby Gem pg_column_byte_packer. We implemented two complementary approaches to try to solve the problem holistically.

First, we automatically patch ActiveRecord’s migration code to re-order columns on-the-fly when all of those columns are included in a single create_table (or, safe_create_table if you're using our pg_ha_migrations gem to maintain uptime guarantees when running migrations!) call.

Second, we provide an API to re-order the columns found in CREATE TABLE statements in a SQL file generated by PostgreSQL's pg_dump utility.

Beginning to use the tool in your applications now will immediately benefit new tables. But we didn’t want to stop there, because we have many existing large tables! Of course re-ordering existing columns meant we needed to re-write tables. So we created entirely new databases and applied schemas files updated using the pg_dump SQL file modification feature described above. Finally we logically replicated all data to these new databases and transparently cutover from the old databases. This is also the foundation of how we achieve zero-downtime major version upgrades in PostgreSQL, but that's a topic for a future post!

We hope many of you will benefit from our work here. And we’d also love to see any ideas you might have for improving it.

PostgreSQL at Scale: Saving Space (Basically) for Free was originally published in The PayPal Technology Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

PostgreSQL at Scale: Database Schema Changes Without Downtime

James Coleman — Fri, 01 Feb 2019 23:01:28 GMT

Braintree Payments uses PostgreSQL as its primary datastore. We rely heavily on the data safety and consistency guarantees a traditional relational database offers us, but these guarantees come with certain operational difficulties. To make things even more interesting, we allow zero scheduled functional downtime for our main payments processing services.

Several years ago we published a blog post detailing some of the things we had learned about how to safely run DDL (data definition language) operations without interrupting our production API traffic.

Since that time PostgreSQL has gone through quite a few major upgrade cycles — several of which have added improved support for concurrent DDL. We’ve also further refined our processes. Given how much has changed, we figured it was time for a blog post redux.

In this post we’ll address the following topics:

Transactional DDL
Rollback Strategy
Locking
Table Operations
Column Operations
Index Operations
Constraints
Enum Types
Bonus: Library for Ruby on Rails

First, some basics

For all code and database changes, we require that:

Live code and schemas be forward-compatible with updated code and schemas: this allows us to roll out deploys gradually across a fleet of application servers and database clusters.
New code and schemas be backward-compatible with live code and schemas: this allows us to roll back any change to the previous version in the event of unexpected errors.

For all DDL operations we require that:

Any exclusive locks acquired on tables or indexes be held for at most ~2 seconds.
Rollback strategies do not involve reverting the database schema to its previous version.

Transactionality

PostgreSQL supports transactional DDL. In most cases, you can execute multiple DDL statements inside an explicit database transaction and take an “all or nothing” approach to a set of changes. However, running multiple DDL statements inside a transaction has one serious downside: if you alter multiple objects, you’ll need to acquire exclusive locks on all of those objects in a single transactions. Because locks on multiple tables creates the possibility of deadlock and increases exposure to long waits, we do not combine multiple DDL statements into a single transaction. PostgreSQL will still execute each separate DDL statement transactionally; each statement will be either cleanly applied or fail and the transaction rolled back.

Note: Concurrent index creation is a special case. Postgres disallows executing CREATE INDEX CONCURRENTLY inside an explicit transaction; instead Postgres itself manages the transactions. If for some reason the index build fails before completion, you may need to drop the index before retrying, though the index will still never be used for regular queries if it did not finish building successfully.

Rollback Strategy

Many database schema management tools assume that each schema change should include both a “forward” and “reverse” change definition. For example, a schema change that adds a table should be accompanied by the ability to undo that change (in this case by dropping the new table).

This strategy is appealing in development (e.g., “run one schema change, see if tests pass, iterate by reversing the change and trying again”), but we do not believe it’s appropriate for production deployments.

Although some schema changes are safe to reverse (e.g., dropping a just-created non-unique index), most possible schema changes are not safe to reverse. A few examples relative to database integrity:

Dropping a newly added column may result in data loss.
Re-adding a dropped unique index (or any other constraint) may fail because data may now exist that violates the constraint.
Dropping a enum value simply isn’t supported by Postgres (and wouldn’t be safe since it might be referenced by rows in the database).

These concerns are magnified if we also consider currently running application code (especially across multiple revisions, i.e., during a deploy). For example, the application may expect:

Indexes to be present for performant queries.
Constraints to hold.
Columns to be present.

For these reasons our deployment tooling and risk mitigation strategies do not involve reverting the database schema to its previous version. Consequently, we do not define “undo” operations in our schema management tooling. Instead we concentrate on ensuring that operations are safe to apply while both old and new revisions of an application are running. In the rare case where we need to “undo” a schema change, we roll forward, rather than rolling back, by having an engineer write a new schema change and deploying that change.

Locking

PostgreSQL has many different levels of locking. We’re concerned primarily with the following table-level locks since DDL generally operates at these levels:

ACCESS EXCLUSIVE: blocks all usage of the locked table.
SHARE ROW EXCLUSIVE: blocks concurrent DDL against and row modification (allowing reads) in the locked table.
SHARE UPDATE EXCLUSIVE: blocks concurrent DDL against the locked table.

Note: “Concurrent DDL” for these purposes includes VACUUM and ANALYZE operations.

All DDL operations generally necessitate acquiring one of these locks on the object being manipulated. For example, when you run:

https://medium.com/media/b1ba327e98b89a84461c915a075c77aa/href

PostgreSQL attempts to acquire an ACCESS EXCLUSIVE lock on the table foos. Attempting to acquire this lock causes all subsequent queries on this table to queue until the lock is released. In practice your DDL operations can cause other queries to back up for as long as your longest running query takes to execute. Because arbitrarily long queueing of incoming queries is indistinguishable from an outage, we try to avoid any long-running queries in databases supporting our payments processing applications.

But sometimes a query takes longer than you expect. Or maybe you have a few special case queries that you already know will take a long time. PostgreSQL offers some additional runtime configuration options that allow us to guarantee query queueing backpressure doesn’t result in downtime.

Instead of relying on Postgres to lock an object when executing a DDL statement, we acquire the lock explicitly ourselves. This allows us to carefully control the time the queries may be queued. Additionally when we fail to acquire a lock within several seconds, we pause before trying again so that any queued queries can be executed without significantly increasing load. Finally, before we attempt lock acquisition, we query pg_locks¹ for any currently long running queries to avoid unnecessarily queueing queries for several seconds when it is unlikely that lock acquisition is going to succeed.

Starting with Postgres 9.3, you adjust the lock_timeout parameter to control how long Postgres will allow for lock acquisition before returning without acquiring the lock. If you happen to be using 9.2 or earlier (and those are unsupported; you should upgrade!), then you can simulate this behavior by using the statement_timeout parameter around an explicit LOCK <table> statement.

In many cases an ACCESS EXCLUSIVE lock need only be held for a very short period of time, i.e., the amount of time it takes Postgres to update its "catalog" (think metadata) tables. Below we'll discuss the cases where a lower lock level is sufficient or alternative approaches for avoiding long-held locks that block SELECT/INSERT/UPDATE/DELETE.

Note: Sometimes holding even an ACCESS EXCLUSIVE lock for something more than a catalog update (e.g., a full table scan or even rewrite) can be functionally acceptable when the table size is relatively small. We recommend testing your specific use case against realistic data sizes and hardware to see if a particular operation will be "fast enough". On good hardware with a table easily loaded into memory, a full table scan or rewrite for thousands (possibly even 100s of thousands) of rows may be "fast enough".

Table operations

Create table

In general, adding a table is one of the few operations we don’t have to think too hard about since, by definition, the object we’re “modifying” can’t possibly be in use yet. :D

While most of the attributes involved in creating a table do not involve other database objects, including a foreign key in your initial table definition will cause Postgres to acquire a SHARE ROW EXCLUSIVE lock against the referenced table blocking any concurrent DDL or row modifications. While this lock should be short-lived, it nonetheless requires the same caution as any other operation acquiring such a lock. We prefer to split these into two separate operations: create the table and then add the foreign key.

Drop table

Dropping a table requires an exclusive lock on that table. As long as the table isn’t in current use you can safely drop the table. Before allowing a DROP TABLE ... to make its way into our production environments we require documentation showing when all references to the table were removed from the codebase. To double check that this is the case you can query PostgreSQL's table statistics view pg_stat_user_tables² confirming that the returned statistics don't change over the course of a reasonable length of time.

Rename table

While it’s unsurprising that a table rename requires acquiring an ACCESS EXCLUSIVE lock on the table, that's far from our biggest concern. Unless the table is not being read from or written to, it's very unlikely that your application code could safely handle a table being renamed underneath it.

We avoid table renames almost entirely. But if a rename is an absolute must, then a safe approach might look something like the following:

Create a new table with the same schema as the old one.
Backfill the new table with a copy of the data in the old table.
Use INSERT and UPDATE triggers on the old table to maintain parity in the new table.
Begin using the new table.

Other approaches involving views and/or RULEs may also be viable depending on the performance characteristics required.

Column operations

Note: For column constraints (e.g., NOT NULL) or other constraints (e.g., EXCLUDES), see Constraints.

Add column

Adding a column to an existing table generally requires holding a short ACCESS EXCLUSIVE lock on the table while catalog tables are updated. But there are several potential gotchas:

Default values: Introducing a default value at the same time of adding the column will cause the table to be locked while the default value in propagated for all rows in the table. Instead, you should:

Add the new column (without the default value).
Set the default value on the column.
Backfill all existing rows separately.

Note: In the recently release PostgreSQL 11, this is no longer the case for non-volatile default values. Instead adding a new column with a default value only requires updating catalog tables, and any reads of rows without a value for the new column will magically have it “filled in” on the fly.

Not-null constraints: Adding a column with a NOT NULL constraint is only possible if there are no existing rows or a DEFAULT is also provided. If there are no existing rows, then the change is effectively equivalent to a catalog only change. If there are existing rows and you are also specifying a default value, then the same caveats apply as above with respect to default values.

Note: Adding a column will cause all SELECT * FROM ... style queries referencing the table to begin returning the new column. It is important to ensure that all currently running code safely handles new columns. To avoid this gotcha in our applications we require queries to avoid * expansion in favor of explicit column references.

Change column type

In the general case changing a column’s type requires holding an exclusive lock on a table while the entire table is rewritten with the new type.

There are a few exceptions:

Changing VARCHARto TEXT [9.1+] (more specifically: "when the old type is binary coercible to the new type and the using clause does not change the column contents").
“When the new type is an unconstrained domain over the old type” [9.1+].
When increasing or removing a length or precision limit, e.g., VARCHAR(5)to VARCHAR(10)and VARCHAR(5)to VARCHAR [9.2+].

Note: Even though one of the exceptions above was added in 9.1, changing the type of an indexed column would always rewrite the index even if a table rewrite was avoided. In 9.2 any column data type that avoids a table rewrite also avoids rewriting the associated indexes as long as Postgres can verify that the logical sort order remains the same (for example, a collation change on a text column will still require rebuilding indexes). If you’d like to confirm that your change won’t rewrite the table or any indexes, you can query pg_class³ and verify the relfilenode column doesn't change.

If you need to change the type of a column and one of the above exceptions doesn’t apply, then the safe alternative is:

Add a new column new_<column>.
Dual write to both columns (e.g., with a BEFORE INSERT/UPDATE trigger).
Backfill the new column with a copy of the old column’s values.
Rename <column> to old_<column> and new_<column> inside a single transaction and explicit LOCK <table> statement.
Drop the old column.

Drop column

It goes without saying that dropping a column is something that should be done with great care. Dropping a column requires an exclusive lock on the table to update the catalog but does not rewrite the table. As long as the column isn’t in current use you can safely drop the column. It’s also important to confirm that the column is not referenced by any dependent objects that could be unsafe to drop. In particular, any indexes using the column should be dropped separately and safely with DROP INDEX CONCURRENTLY since otherwise they will be automatically dropped along with the column under an ACCESS EXCLUSIVE lock. You can query pg_depend⁴ for any dependent objects.

Before allowing a ALTER TABLE ... DROP COLUMN ... to make its way into our production environments we require documentation showing when all references to the column were removed from the codebase. This process allows us to safely roll back to the release prior to the one that dropped the column.

Note: Dropping a column will require that you update all views, triggers, function, etc. that rely on that column.

Index operations

Create index

The standard form of CREATE INDEX ... acquires an ACCESS EXCLUSIVE lock against the table being indexed while building the index using a single table scan. In contrast, the form CREATE INDEX CONCURRENTLY ... acquires an SHARE UPDATE EXCLUSIVE lock but must complete two table scans (and hence is somewhat slower). This lower lock level allows reads and writes to continue against the table while the index is built.

Caveats:

Multiple concurrent index creations on a single table will not return from either CREATE INDEX CONCURRENTLY ... statement until the slowest one completes. In addition until Postgres 14+ the various phases building the index concurrently each wait on transactions held open by other such concurrent operations.
CREATE INDEX CONCURRENTLY ... may not be executed inside of a transaction but does maintain transactions internally. Prior to Postgres 14 this always involved holding open a transaction preventing auto-vacuums (against any table in the system) from cleaning up dead tuples introduced after the index build began until the index operation returned. If you have a table with a large volume of updates (particularly bad if to a very small table) this could result in extremely sub-optimal query execution. In Postgres 14+ VACUUM is able to ignore concurrent index operations on other tables so long as they contain only bare columns (no expressions) and are not partial indexes.
CREATE INDEX CONCURRENTLY ... must wait for all transactions using the table to complete before returning.

Drop index

The standard form of DROP INDEX ... acquires an ACCESS EXCLUSIVE lock against the table with the index while removing the index. For small indexes this may be a short operation. For large indexes, however, file system unlinking and disk flushing can take a significant amount of time. In contrast, the form DROP INDEX CONCURRENTLY ... acquires a SHARE UPDATE EXCLUSIVE lock to perform these operations allowing reads and writes to continue against the table while the index is dropped.

Caveats:

DROP INDEX CONCURRENTLY ... cannot be used to drop any index that supports a constraint (e.g., PRIMARY KEY or UNIQUE).
DROP INDEX CONCURRENTLY ... may not be executed inside of a transaction but does maintain transactions internally. Prior to Postgres 14 this always involved holding open a transaction preventing auto-vacuums (against any table in the system) from cleaning up dead tuples introduced after the index build began until the index operation returned. If you have a table with a large volume of updates (particularly bad if to a very small table) this could result in extremely sub-optimal query execution. In Postgres 14+ VACUUM is able to ignore concurrent index operations on other tables so long as they contain only bare columns (no expressions) and are not partial indexes.
DROP INDEX CONCURRENTLY ... must wait for all transactions using the table to complete before returning.

Note: DROP INDEX CONCURRENTLY ... was added in Postgres 9.2. If you're still running 9.1 or prior, you can achieve somewhat similar results by marking the index as invalid and not ready for writes, flushing buffers with the pgfincore extension, and the dropping the index.

Rename index

ALTER INDEX ... RENAME TO ... requires an ACCESS EXCLUSIVE lock on the index blocking reads from and writes to the underlying table. However a recent commit expected to be a part of Postgres 12 lowers that requirement to SHARE UPDATE EXCLUSIVE.

Reindex

REINDEX INDEX ... requires an ACCESS EXCLUSIVE lock on the index blocking reads from and writes to the underlying table. Instead we use the following procedure:

Create a new index concurrently that duplicates the existing index definition.
Drop the old index concurrently.
Rename the new index to match the original index’s name.

However in Postgres 12+ CONCURRENTLY support has been added to REINDEX which essentially implemented the procedure outline above into a single command. The same caveats apply as the CONCURRENTLY variants of CREATE INDEX and DROP INDEX.

Note: If the index you need to rebuild backs a constraint, remember to re-add the constraint as well (subject to all of the caveats we’ve documented.)

Constraints

NOT NULL Constraints

Removing an existing not-null constraint from a column requires an exclusive lock on the table while a simple catalog update is performed.

In contrast, adding a not-null constraint to an existing column requires an exclusive lock on the table while a full table scan verifies that no null values exist. Instead you should:

Add a CHECK constraint requiring the column be not-null with ALTER TABLE <table> ADD CONSTRAINT <name> CHECK (<column> IS NOT NULL) NOT VALID;. The NOT VALID tells Postgres that it doesn't need to scan the entire table to verify that all rows satisfy the condition.
Manually verify that all rows have non-null values in your column.
Validate the constraint with ALTER TABLE <table> VALIDATE CONSTRAINT <name>;. With this statement PostgreSQL will block acquisition of other EXCLUSIVE locks for the table, but will not block reads or writes.
On Postgres 12 and following, additionally proceed to ALTER TABLE <table> ALTER COLUMN <column> SET NOT NULL. According to the docs “if a valid CHECK constraint is found which proves no NULL can exist, then the table scan is skipped”.

Foreign keys

ALTER TABLE ... ADD FOREIGN KEY requires a SHARE ROW EXCLUSIVE lock (as of 9.5) on both the altered and referenced tables. While this won't block SELECT queries, blocking row modification operations for a long period of time is equally unacceptable for our transaction processing applications.

To avoid that long-held lock you can use the following process:

ALTER TABLE ... ADD FOREIGN KEY ... NOT VALID: Adds the foreign key and begins enforcing the constraint for all new INSERT/UPDATE statements but does not validate that all existing rows conform to the new constraint. This operation still requires SHARE ROW EXCLUSIVE locks, but the locks are only briefly held.
ALTER TABLE ... VALIDATE CONSTRAINT <constraint>: This operation checks all existing rows to verify they conform to the specified constraint. Validation requires a SHARE UPDATE EXCLUSIVE so may run concurrently with row reading and modification queries.

Check constraints

ALTER TABLE ... ADD CONSTRAINT ... CHECK (...) requires an ACCESS EXCLUSIVE lock. However, as with foreign keys, Postgres supports breaking the operation into two steps:

ALTER TABLE ... ADD CONSTRAINT ... CHECK (...) NOT VALID: Adds the check constraint and begins enforcing it for all new INSERT/UPDATE statements but does not validate that all existing rows conform to the new constraint. This operation still requires an ACCESS EXCLUSIVE lock.
ALTER TABLE ... VALIDATE CONSTRAINT <constraint>: This operation checks all existing rows to verify they conform to the specified constraint. Validation requires a SHARE UPDATE EXCLUSIVE on the altered table so may run concurrently with row reading and modification queries. A ROW SHARE lock is held on the reference table which will block any operations requiring exclusive locks while validating the constraint.

Uniqueness constraints

ALTER TABLE ... ADD CONSTRAINT ... UNIQUE (...) requires an ACCESS EXCLUSIVE lock. However, Postgres supports breaking the operation into two steps:

Create a unique index concurrently. This step will immediately enforce uniqueness, but if you need a declared constraint (or a primary key), then continue to add the constraint separately.
Add the constraint using the already existing index with ALTER TABLE ... ADD CONSTRAINT ... UNIQUE USING INDEX <index>. Adding the constraint still requires an ACCESS EXCLUSIVE lock, but the lock will only be held for fast catalog operations.

Note: If you specify PRIMARY KEY instead of UNIQUE then any nullable columns in the index will be made NOT NULL. This requires a full table scan which currently can't be avoided. See NOT NULL Constraints for more details.

Exclusion constraints

ALTER TABLE ... ADD CONSTRAINT ... EXCLUDE USING ... requires an ACCESS EXCLUSIVE lock. Adding an exclusion constraint builds the supporting index, and, unfortunately, there is currently no support for using an existing index (as you can do with a unique constraint).

Enum Types

CREATE TYPE <name> AS (...) and DROP TYPE <name> (after verifying there are no existing usages in the database) can both be done safely without unexpected locking.

Modifying enum values

ALTER TYPE <enum> RENAME VALUE <old> TO <new> was added in Postgres 10. This statement does not require locking tables which use the enum type.

Deleting enum values

Enums are stored internally as integers and there is no support for gaps in the valid range, removing a value would currently shifting values and rewriting all rows using those values. PostgreSQL does not currently support removing values from an existing enum type.

Announcing Pg_ha_migrations for Ruby on Rails

We’re also excited to announce that we have open-sourced our internal library pg_ha_migrations. This Ruby gem enforces DDL safety in projects using Ruby on Rails and/or ActiveRecord with an emphasis on explicitly choosing trade-offs and avoiding unnecessary magic (and the corresponding surprises). You can read more in the project’s README.

Footnotes

[1] You can find active long-running queries and the tables they lock with the following query:

https://medium.com/media/90ee0e73f1d666273b12ad4df0ce2dfd/href

[2] You can see PostgreSQL’s internal statistics about table accesses with the following query:

https://medium.com/media/571d5801ea7602f91f4c6cf98361de39/href

[3] You can see if DDL causes a relation to be rewritten by seeing if the relfilenode value changes after running the statement:

https://medium.com/media/4526240441c8b45733d22b7b04326d48/href

[4] You can find objects (e.g., indexes) that depend on a specific column by running the statement:

https://medium.com/media/5dfc04bc00958bcd9a44f3568c73a49c/href

PostgreSQL at Scale: Database Schema Changes Without Downtime was originally published in The PayPal Technology Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.