How do we perform data sync if we have some intensive flow, which requires sequential processing? Whoever has gone at least through one architectural planning or has just been involved in a running project, has also stumbled upon this topic. In this blog, we will discuss the issue with data sync and its potential solution.
Initial Problem with Data Sync
Imagine you are trying to make a booking for two at a hotel through an app and suddenly you need to go alone, so you change the room to be a single-bed one. However, you receive a room update but the pricing doesn’t change. This is just a simple example to illustrate the general need for synchronisation of processes or just data sync.
We will review two options to overcome this issue – CDC (change data capture) and cloud messaging (Kafka, PubSub, RabbitMQ, and others) and give an example with microservices, but first, let’s observe how these synchronisation tools work.
Cloud Messaging
The general concept here is that you have a bunch of containers, called topics, to which you publish information. On the other side, you have clients, called subscribers, connected to a specific topic and listen for newly published messages. You can have as many subscribers as you want to fit your needs. Â
This way you can always ensure that the services required to update or manipulate specific data are always receiving the correct data.
What about data synchronisation in this case? What if two services are publishing different update messages simultaneously? Whichever comes first is the “latest” update?
Well, not really, as most tools have either bulk processing in which you can attempt to fetch a specific amount of messages for a specific time. On the bright side, it will make sure our data sync process is working, but with a delay. This would provide a confusing moment for some if they hit exactly this “syncing timespan“. Another feasible and even better option is to add a consumer config, which fetches those messages ordered by creation date. This way you can ensure that if you have 10 updates, all 10 would go through one at a time and in the correct order. Most of the time you wouldn’t need 9 of those events but rather just the latest one.
However, this also comes with another concern. If you can update your application from different sources and the app is still in production, the latest update would override the first one. You can fix this by providing additional information in the update message itself. For instance, versioning or indicators to compare some fields (e.g. last date of modification) and take certain actions with. All this makes it a bit hacky or has a twisted way of implementation in terms of those actions.
Change Data Capture
Change Data Capture (CDC) is a technique, which observes all data changes (or some specific field changes) and then notifies specific services using different approaches here. Being tightly connected to the database, CDC shows you the order of committed changes.
At the beginning of this blog, we gave an example of a booking desync and now, we can expand it a bit. Let’s say we are using a third-party application to update our room. Тhat’s why we see it updated but the pricing is connected directly to the external service our application uses. In our case, if we had such CDC, we could notify this external service and perform the update via cloud messaging or directly calling their endpoints.
CDC has the best use case for cache validation, search indexes, in-memory data view, storing audit trails, keeping data from different sources in sync, and many more.
The Power of MicroservicesÂ
To demonstrate what we have been talking about we would need a simple architecture plan, which could look like this.
We are going to have three microservices in the backend, each one with a different database to reduce load. Also, we need to make sure that each frontend microservice operates with one backend microservice. Those databases must be in sync, which is why we have decided to use cloud messaging as a synchronisation process. Whenever an update occurs, we start channelling those processes.
In our case, we have just one publisher and the rest are only reading the information. The subscribers receive the message, perform some logic if required, and save or update the object from the event in the database. This could further trigger another event to something like a notification service, providing the message to a Redis server. From there, we could make a WebSocket event that a certain action occurred. This way, we would have synced data between all microservices, which results in a seamless experience for the users.Â
We could use Change Data Capture and message instead of directly publishing the message. Doing so will only provide even better data sync as the CDC reads directly from the latest database. Another approach preferred when Java (and more specifically Spring) developers don’t want to use CDC is to use an aspect. This allows you to monitor a specific method, for example, the onSave() method from the repository. Whenever the method gets called, you decide what kind of operations you would like to do with the soon-to-be-updated object.
It is similar to CDC but at the same time not as effective because it shows you the current and the new data and it is up to you on what you are going to do.
We could send the whole new object with an update message, we could compare different fields and do operations based on this comparison. Or, we could just produce different events based on payload. It gives us the freedom to choose what we would like to perform.
In Conclusion
Data sync is quite important, especially in the microservices architecture model. The concern we have talked about is of serious matter as valuable information could be mixed, resulting in endless troubleshooting and poor user experience. Needless to say that developers should always have discussions around data synchronisation during architectural planning. This ensured easier scalability and allows developers to include more microservices or sync with external services. Plus, it wouldn’t require further workarounds to perform certain actions later.