Zero Downtime Data Migration using PostgreSQL Logical Replication

Embarking on the journey of upgrading or migrating your database can often be challenging, especially when faced with the task of minimizing downtime. When it comes to user data management, the need for zero downtime is most important. This is where PostgreSQL’s Logical Replication offers a robust solution for those seeking a seamless and efficient […]

by Nikolay Nikolov

February 23, 2024

13 min read

Image by DC Studio on Freepik

Embarking on the journey of upgrading or migrating your database can often be challenging, especially when faced with the task of minimizing downtime. When it comes to user data management, the need for zero downtime is most important. This is where PostgreSQL’s Logical Replication offers a robust solution for those seeking a seamless and efficient data migration strategy. In this blog post, we delve into the strategies and benefits that make it a game-changer in the context of database transitions. Join me as we explore how logical replication can turn the complicated process of data migration into a smooth and predictable experience.

Concept

Logical replication follows a publish-subscribe model. Information is transferred by decoding the Write-Ahead Log (WAL) of the publisher, which describes changes on a storage level, into an application-specific form such as a stream of tuples that is then applied at the subscriber. This process begins with a snapshot of existing data, which can take a few minutes for larger tables. During that time new changes are continuously recorded in the WAL. When the initial snapshot for a table is complete, the accumulated changes are applied on top. Then the subscriber will start receiving real-time changes and keep the tables in sync. 

image 16 - Zero Downtime Data Migration using PostgreSQL Logical Replication

Logical replication works in a deterministic way, allowing you to explicitly specify which data will be replicated and under what conditions. It allows for granular control and flexibility. As a drawback, it requires identical database structures on both ends and is not capable of transferring structural changes (e.g. sequences). Any change in the structure of replication items on the originating node will stop an ongoing replication. This makes it most useful for short periods of time, like migration windows. As an alternative, PostgreSQL offers physical replication. It copies both structural and data changes, but it doesn’t provide any granularity. Using physical replication, you would copy the whole database cluster. Although it can be more performant than logical replication, its main strength lies in maintaining high-availability replicas for fault-tolerance. For upgrading or migrating your database, it seems more intuitive to go with logical replication, especially if you employ a microservice architecture, where you can have clusters with multiple schemas attached to different services. It is far easier to migrate a single service and schema at a time, rather than doing everything at once and risking user downtime if something goes wrong.

During our custom development work for a particular client, we leveraged logical replication to do exactly that. We replicated data to a new environment and version of PostgreSQL. We did this with zero user downtime by copying a single schema (or sometimes set of tables) at a time to break down the migration into smaller scopes and make it easier to manage. We required flexibility and selectivity, which was the biggest reason we went with logical over physical replication. With all that said, let’s get into logical replication in practice and why it might be the solution to your next data migration problem.

In practice

We will demonstrate each of the various replication concepts we talk about. Feel free to skip over the technical details and focus on the snippets and diagrams if you prefer a high-level overview.

The demo will involve the replication of a few tables without too much complexity. However, we will talk about the minimum requirements and possible caveats one face might face in the process. I’ve prepared two docker containers for our publisher and subscriber. I will use the postgres superuser to change properties and alter permissions, after which I will continue with a pub and sub role respectively. Let’s begin.

Schema

This is the schema I’ll be using for the demo with the context of coffee delivery.

CREATE TABLE farm (
name varchar(255) PRIMARY KEY, 
country varchar(100)
);

CREATE TABLE roaster (
name varchar(255) PRIMARY KEY, 
country varchar(100)
);

CREATE TABLE delivery (
	farm varchar(255) REFERENCES farm (name),
	roaster varchar(255) REFERENCES roaster (name),
	quantityInKilos integer,
	sentOn timestamp,
	deliveryDate timestamp
);

I’ll fill the tables with data using the following generation scripts. 

CREATE TABLE farm (
name varchar(255) PRIMARY KEY, 
country varchar(100)
);

CREATE TABLE roaster (
name varchar(255) PRIMARY KEY, 
country varchar(100)
);

CREATE TABLE delivery (
	farm varchar(255) REFERENCES farm (name),
	roaster varchar(255) REFERENCES roaster (name),
	quantityInKilos integer,
	sentOn timestamp,
	deliveryDate timestamp
);

Pre-requisites

There are a few mandatory requirements, regardless of setup. First we will set the minimum required configuration properties for both nodes. You can edit these via a text editor of your choice. Changes to most of these will require a database restart, so make sure to configure all of them together to avoid multiple restarts. 

publisher

// Hint: find the configuration file using
// SHOW config_file;

wal_level: logical
max_replication_slots: 10
max_wal_senders: 10
max_worker_processes: 15

Logical replication is based on WAL-streaming, which we won’t get into in this post, but it requires setting the wal_level to logical. Then we need available replication slots, so set max_replication_slots to a number sufficient enough for an apply worker and multiple tablesync workers. The default is 10, which is enough for our needs, so we will keep it. Then max_wal_senders must be equal to or higher than the number of slots available plus any existing replicas. We don’t have any replicas, so we will ignore it and keep it the same as the slots. Finally you should set max_worker_processes higher than that and it should match the value on the subscriber.

subscriber

// Hint: find the configuration file using
// SHOW config_file;

wal_level: logical
max_replication_slots: 10
max_wal_senders: 10
max_worker_processes: 15
max_logical_replication_workers: 4

The subscriber requires the same wal_level. The number of max_replication_slots for subscribers holds a different meaning and denotes the maximum active subscriptions plus any tablesync operations. We will keep the default of 10. We will keep max_wal_senders and max_worker_processes equal or higher than the publisher, otherwise queries won’t be allowed on the subscriber. Finally, we will keep max_logical_replication_workers default value of 4. This will be enough, but they are taken from the pool of max_worker_processes, so increasing them will mandate a larger pool.

There are other properties like the max_sync_workers_per_subscription, whose name is self-explanatory, but this is enough for a starting point and illustrating the concept. For more details check out the official documentation linked above. Let’s move on to the other functional requirements.

user

To use replication a user requires replication and superuser permissions. The  userepl & usesuper fields are responsible for this and should be set to t(true). You can see and alter these settings with the following statements:

// Hint: toggle extended display in your terminal with \x
// See permissions
SELECT * FROM pg_user WHERE usename = 'pub';
-[ RECORD 1 ]+---------
usename      | pub
usesysid     | 16388
usecreatedb  | f
usesuper     | t
userepl      | t
usebypassrls | f
passwd       | ********
valuntil     |
useconfig    |

// Alter using USER or ROLE keywords. It's identical in function
ALTER USER pub REPLICATION SUPERUSER;
ALTER ROLE pub REPLICATION SUPERUSER;

// Revert it later with the NO- version of the keywords
ALTER USER pub NOREPLICATION NOSUPERUSER;
ALTER ROLE pub NOREPLICATION NOSUPERUSER;

schema

Now we need to make sure the objects you replicate exist on the subscriber. If your subscriber has extra columns or tables it won’t cause conflicts, but if some are missing, then when the publisher attempts to send that across, the replication will fail and you will get errors like these:

// replicating a column doesn't exist on the sub
2024-02-04 12:00:00.000 UTC [340] LOG:  logical replication table synchronization worker for subscription "sub", table "farm" has started
2024-02-04 12:00:00.000 UTC [340] ERROR:  logical replication target relation "public.farm" is missing replicated column: "country"
2024-02-04 12:00:00.000 UTC [1] LOG:  background worker "logical replication worker" (PID 340) exited with exit code 1

data

Your subscriber’s tables should be empty or have a guarantee of no conflicts for the replicated objects. Replication becomes more complicated if there is a risk of conflicts and postgres doesn’t provide automated mechanisms to deal with this during replication. We won’t cover this in more detail, but you can observe conflicts in the logs and read more about approaches to resolve them here

Publisher

The publisher is a PostgreSQL node that generates a stream of changes, which can be configured to include any combination of INSERT, UPDATE, DELETE and TRUNCATE statements. The publisher is where we specify which tables will be replicated. Publishers can have multiple subscribers, allowing for increased flexibility. 

Let’s start our replication. We will create a publication for all tables. Specifying all operations is the default option, but let’s be verbose and show everything that’s happening.

CREATE PUBLICATION pub 
FOR ALL TABLES 
WITH (publish = 'insert, update, delete, truncate');

Replica Identity

Each table has a so-called replica identity, which is a mechanism to identify which rows change on modifying operations – UPDATE & DELETE. The default option is the primary key, but you can also choose an index made of unique, non-null fields, none or full. You will see where this is important a little later on in the demo.

Replication Slots

Replication slots play a crucial role in logical replication by acting as a socket between the publisher and subscriber. A replication slot will keep track of how far a subscriber is lagging behind the publisher. This mechanism allows it to start synchronization when a subscriber connects for the first time, and keep the last point in the transaction log that a subscriber has acknowledged. In the case a subscriber loses connection, this will allow it to resume replication in a fault-tolerant way when it comes back online.

Replication slots will be automatically created, managed and deleted by a subscriber. There will be a minimum of one slot, used for the apply worker, and one more for each active tablesync worker. The maximum number is capped by the max_logical_replication_slots property in the configuration.

In order to show this explicitly, we will manually create a replication slot, instead of relying on the subscriber to do so.

SELECT * from pg_create_logical_replication_slot('slot', 'pgoutput');

Subscriber

Lastly, we will create a subscriber. The subscriber is a PostgreSQL node that subscribes to changes made by the publisher. It receives and applies changes to maintain synchronization. 

CREATE SUBSCRIPTION sub
CONNECTION 
  'host=localhost 
   port=5432 
   dbname=postgres 
   user=pub 
   password=pub 
   connect_timeout=10'
PUBLICATION pub
WITH (enabled=true, copy_data=true, create_slot=false, slot_name='slot');

Change your connection details accordingly. By default, the subscriber will create a slot on the publisher and use it. This is not what we want, let’s use the additional parameters to connect to our custom slot. We will leave the enabled setting to its default value of true, meaning the subscription will start as soon as you execute the command.

Subscription Workers

There are two types of background processes, which receive and apply the data. They are referred to as apply and tablesync workers. 

image 15 - Zero Downtime Data Migration using PostgreSQL Logical Replication

Apply workers are long-lived and will remain until the subscription is dropped. They are responsible for reading the replication stream and applying changes once a table is synchronized. During the initial sync, they have the additional responsibility of creating and delegating work to tablesync workers. They are capable of handling transactions, so they ensure changes are made in the same order they were committed at the publisher. 

Tablesync workers are transient. They are only used during the initial sync, where their primary task is to copy existing data. After that they pass the responsibility back to the main apply worker and exit. They work on a per-table basis, meaning you can replicate a fixed amount of tables at once until a new worker can be made.

Monitoring & Debugging

Now that the subscription is up and running, let’s see how we can observe what happened during the process.

Logs

With the small amount of data we have, replication will finish almost instantly. You can confirm this by going into the logs. Here we can observe the activity of workers. It shows that the initial synchronization is complete, meaning we are in sync with the publisher. 

2024-02-04 12:00:00.000  UTC [206] LOG:  logical replication apply worker for subscription "sub" has started
2024-02-04 12:00:00.000 UTC [207] LOG:  logical replication table synchronization worker for subscription "sub", table "delivery" has started
2024-02-04 12:00:00.000 UTC [208] LOG:  logical replication table synchronization worker for subscription "sub", table "farm" has started
2024-02-04 12:00:00.000 UTC [208] LOG:  logical replication table synchronization worker for subscription "sub", table "farm" has finished
2024-02-04 12:00:00.000 UTC [209] LOG:  logical replication table synchronization worker for subscription "sub", table "roaster" has started
2024-02-04 12:00:00.000 UTC [209] LOG:  logical replication table synchronization worker for subscription "sub", table "roaster" has finished
2024-02-04 12:00:00.000 UTC [207] LOG:  logical replication table synchronization worker for subscription "sub", table "delivery" has finished

Let’s test this out by updating some data to see real-time changes. 

INSERT INTO farm VALUES ('Another Kenya Farm', 'Not Kenya');
UPDATE farm SET country = 'Kenya' WHERE country ILIKE '%not kenya%';

SELECT * FROM farm; // check results

Here you won’t see anything in the logs, but if you run a select statement on the subscriber you can find the new data. Let’s try something else.

UPDATE delivery SET quantityinkilos = 2 WHERE quantityinkilos = 1;

// result
ERROR:  cannot update table "delivery" because it does not have a replica identity and publishes updates
HINT:  To enable updating the table, set REPLICA IDENTITY using ALTER TABLE.

Now, you will probably see something like the above output in the console. This is where replica identity comes into play. Since we never set a primary key on the delivery table there is no way for replication to identify which rows are changing and send that information to the subscriber. If we want a quick fix, we can set replica identity to FULL, but this is our largest table and we don’t want to reduce performance. For tables with millions of rows, this might be detrimental and slow the database to a halt. 

Let’s resolve this problem and create a primary key. We will add a new column that will serve as an identifier. Do this on both sides then try the query again. You won’t need to change anything else, since the table is already part of an existing replication. You should get output like the last line showing the number of updated rows. With this step we have gone through the most important concepts and roadblocks you can encounter. Let’s wrap up with additional ways of monitoring and how to clean up.

// execute on both sides
ALTER TABLE delivery ADD COLUMN identifier SERIAL;
ALTER TABLE delivery ADD PRIMARY KEY (identifier);

// then try the update on the publisher again
UPDATE delivery SET quantityinkilos = 2 WHERE quantityinkilos = 1;
UPDATE 50

Internal tables

Another way to monitor and troubleshoot problems is using the internal postgres monitoring tables. Most notably, the pg_stat_replication, pg_stat_subscription and pg_logical_replication_slots table. We can see the state of replication, as well as the time and WAL address of the last update received and confirmed on each side. You can test this by updating data. Each query should update the times, but if nothing is being replicated, then you will only see an update on a set interval.

SELECT * FROM pg_stat_replication;

-[ RECORD 1 ]----+------------------------------
pid              | 507
usesysid         | 16388
usename          | pub
application_name | sub
client_addr      | 172.20.0.3
client_hostname  |
client_port      | 44956
backend_start    | 2024-02-04 19:37:54.861947+00
backend_xmin     |
state            | streaming
sent_lsn         | 0/C3F4D28
write_lsn        | 0/C3F4D28
flush_lsn        | 0/C3F4D28
replay_lsn       | 0/C3F4D28
write_lag        |
flush_lag        |
replay_lag       |
sync_priority    | 0
sync_state       | async
reply_time       | 2024-02-04 20:58:47.002852+00


SELECT * FROM pg_stat_subscription;

-[ RECORD 1 ]---------+------------------------------
subid                 | 24609
subname               | sub
pid                   | 463
leader_pid            | 
relid                 | 
received_lsn          | 0/C3F4D28
last_msg_send_time    | 2024-02-04 20:58:36.989652+00
last_msg_receipt_time | 2024-02-04 20:58:36.989871+00
latest_end_lsn        | 0/C3F4D28
latest_end_time       | 2024-02-04 20:58:36.989652+00

You can observe the replication slots state. There will always be one for the apply worker, but if you have a long-running synchronization phase you can find additional rows for tablesync workers. This is something we actively used to double-check when our larger tables had been completely synced, since the log ran too fast to check the start and end statements reliably. Once there are no more tablesync workers and only the apply one is active you can be sure that the synchronization is complete. 

SELECT * FROM pg_replication_slots;

-[ RECORD 1 ]-------+----------
slot_name           | slot
plugin              | pgoutput
slot_type           | logical
datoid              | 5
database            | postgres
temporary           | f
active              | t
active_pid          | 507
xmin                |
catalog_xmin        | 811
restart_lsn         | 0/C3F4CF0
confirmed_flush_lsn | 0/C3F4D28
wal_status          | reserved
safe_wal_size       |
two_phase           | f
conflicting         | f

Cleanup

If your use-case does not involve a long-running replication, you should clean up when complete. For example, you don’t want to keep a wal_level property value of logical when you are not using it to avoid the overhead it adds. 

Go through this checklist and cherry-pick what you need. If you are migrating to a new instance and are dropping your whole publisher anyway you can skip the steps related to it.

// Drop subscription
// will also drop replication slot automatically, unless the connection has been broken
DROP SUBSCRIPTION sub; 

// Drop publication
DROP PUBLICATION pub;
SELECT pg_drop_replication_slot(‘slot’); // opt.

// Remove permissions from users
ALTER USER pub NOREPLICATION NOSUPERUSER;
ALTER USER sub NOREPLICATION NOSUPERUSER;

// Revert temporary replica identity changes 
ALTER TABLE delivery REPLICA IDENTITY NONE; //optional

// Edit the configuration settings again
// SHOW config_file;
/var/lib/postgresql/data/postgresql.conf

Conclusion

Let’s wrap up. PostgreSQL logical replication stands out as a robust mechanism that enables zero-downtime migrations, fine-grained customization over replication processes, as well as tools for monitoring and debugging potential issues. Additionally, the PostgreSQL community is actively improving it with each new release, enhancing its features and usability.  

In this blogpost we demonstrated how to leverage the tools provided by logical replication. We showed the needed prerequisites to prepare your environment.  We discussed the essential concepts to a successful replication, what makes them important, how to customize them, and their implementation in practice. We have also discussed the potential challenges and how to troubleshoot common issues that arise during replication. Whatever your use case is, PostgreSQL logical replication grants you the ability to perform complex data migrations with minimal setup

But don’t take my word for it, try it out for yourself!

Dreamix Software Engineer