Iterative Architecture — Migration to Event Driven In A Fast Moving Start-up

Photo by Chris Briggs on Unsplash

I’ve been fortunate enough to have been positioned as part of a platform team at INSHUR, and with that comes the responsibility of ensuring that we are nurturing a system that is adaptable to change and allows many stream aligned teams to work autonomously in a pain free manner (or as pain free as possible). Without going into too much detail, the current backend is primarily build from numerous entity based services that own a particular domain entity, and rely on each other synchronously at runtime (i.e. a distributed monolith). Although this approach can be simple to build, I'm sure most of us are familiar with the pain points of such approach - we have low cohesion and high runtime coupling and side effects of this include difficulty in making and testing changes in an isolated manner.

With most of the platform team having previously worked on starting, and building upon event driven systems elsewhere, it had quickly become clear to us that we could certainly benefit from such an approach.

The biggest challenges we face are that we have to design a migration strategy that:

  • Can add value quickly — were a start-up/scaleup after all. There's never really time for big bang architectural changes
  • Does not disrupt or require input from any of the other squads that are busy delivering business value.
  • Facilitate the reduction of runtime coupling and a move towards truly autonomous microservices. We want teams to be able to build what they need when they need it without heavy reliance on others.
  • Allow us to iteratively migrate existing components —we want to implement any necessary changes at the last responsible point in time to avoid expensive to reverse decisions and unnecessary work

Ok, so where do we start?

Looking at the main points of coupling between applications its clear that, in the majority of cases, its usually at the point at which a process needs to access data that is currently owned by someone else. Generally speaking, the single source of truth in a distributed system is a fallacy and is probably the most common misstep developers make when jumping into the world of microservices. Instead we should be aiming for well defined bounded contexts where each service holds its own view of the world relevant to that particular scope in an eventually consistent manner

**Edit: As with everything in software the above is not a golden rule but the outcome of trade-off analysis. As CAP theroum suggests you can only guarantee two of these three architectural characteristics in a distributed system: Consistency, Availability and Partition Tolerance. As, for our requirements, we want to trade immediate consistency for availability and partition tolerance

Think of an (over simplified) example where a user signs up to an online bookshop that allows them to buy audio books and access them via their account. Alongside this is a social aspect where users can discuss books and build a profile.

In this example, each bounded context will hold data about the same set of audio books, but that view may look very different. Its likely that the social context would hold data such as links to reviews and the library would hold references to audio files. However this sort of information would be useless in a billing context where details about cost are more likely to be required. There is no single source of truth about what a ‘book’ looks like.

To enable us to get from the service per entity to a model with well defined boundaries we need to liberate our data. In the book Building Event-Driven Microservices, Adam Bellemare defines data liberation as:

Data liberation is the identification and publication of cross-domain data sets to their corresponding event streams and is part of a migration strategy for event-driven architectures.

Liberating our data

Luckily identifying the data is pretty easy for us. We take the collection of data from each service that represents the real life entity that the service is modelling, and leave out anything that is only used internally within that application. With that said, the next step is to work out how to get that data out there. At this point we want to make no assumptions on how this data is used so we want to include all of the outwardly-used data when we make it accessible outside of its current home. Because of this, we quickly came to the conclusion that we needed to build out fat-events, whereby the data payload is a complete representation of the given entity after the handling of a particular command that caused the event to occur.

Within INSHUR, an early decision was made to never update a record, but instead add a new record with a incremented version number. This data was to be stored as documents in MongoDB. This is a big win for us — because this data is being placed in a MongoDB collection that represents the entity, we can utilise the change streams made available natively by MongoDB. Personally, I think this means of data liberation by change capture is a pretty cool approach. Because all updates result in a new document rather than having a document that is updated, we have access to historical state changes, akin to event sourcing without actually having implemented event sourcing.

Minimising Disruption

Now, to convert database updates into truly meaningful events we’d have to understand the context in which each update is being applied, which would mean that we’d need to understand the full domain up front, and alter each application to trigger these events at the correct points in the process. If we do that we aren't going to be limiting disruption and were certainly going to take a lot of time doing so. What we can do though is read event context data from the document that has triggered the change stream action and default to an event type based on the mongo operation if not available. This will allow us to enrich in the future, when needed, by simply making a small addition to any inserts we make as part of a process.

CRUD based events are not the goal, and are, in fact, generally terrible. They provide no meaningful context of what has occurred. In DDD thinking, we should be using ubiquitous language and these event types should be as meaningful to technical and non technical folk alike. However, we are working with mostly immutable data — think an insurance quote — if you alter your input, rather than alter a quote, a new one is generated. This means that we can get away with directly mapping inserts into ‘<insert-entity-type-here>.created’ event types under most scenarios. For those cases where updates do happen we will also support the ‘updated’ and ‘deleted’ types. So long as we can support enrichment when we need it we should be able to add a lot of value with a small amount of work.

Implementing via change capture

The applications we are working with are packaged as docker images and deployed to a GCP managed Kubernetes cluster. We can just deploy a sidecar into the pod, whose sole purpose is to extract data changes and convert them into events. So what does it look like?

Deployment model for publishing process
  • The sidecar was nice and quick to build, and has a small footprint (using Quarkus native build). Using Mutiny reactive api for the change stream processes.
  • GCP PubSub is used as the message broker, as we already use this elsewhere.
  • To ensure we can support multiple instances, we use Redis as a data store for leader election between all pods deployed for the same service.
  • For the structure of our events, we are wrapping the entity json in a Cloud Event and storing to a dedicated outbox with a further change stream binding as a separate process. This is to safely publish these events to a pubsub topic without the duel write problem.
  • For resilience against crashes and outages we store the last change stream id we processed for both bindings to the entity collection and the outbox, which will allow us to pick up where we left off upon restart or when we need to republish (via a status flag in the outbox). This process will use passage-of-time events as a means of scheduling checks (I aim to cover this in a future post).

The publishing process looks a little bit like this:

Event Publishing process

Where is the data going?

When we first add the sidecar to our services, there will be no active subscribers that are going to react to any of our events. The whole point of this exercise is so that we can iteratively evolve, and this means gradually adding more business context to our events and subscribing to them when it makes sense to apply a change in that area of the system. But we do need somewhere for the data to go, or we have solved nothing.

We have decided to deploy a simple service that acts as event store for all events, so that they can be accessible in the future, and for replaying when needed. This event store is a simple subscriber that stores data to an internal inbox, indexed by a timestamp provided in the cloud event attributes, and will eventually evolve to include further functionality such as the ability to be able to replay events etc. This will come in useful when wanting to backfill any data when introducing new subscribers to the system, using patterns such as event carried state transfer. We opted to have a single topic in which real time subscribers will receive all integration events, and we will implement handlers that will get called based on the cloud event type attribute. The global store will use the same feed and become eventually consistent with the other services.

Overall publishing deployment model

Lastly we needed the ability to publish events from the past. With our data models this is actually simple for 95% (ish) of the data. We have implemented a mechanism that will only trigger once per entity, that will publish these events in the same structure, but to a separate topic that only the event store will ever listen to. It will be up to subscribers to actively try and hydrate themselves with data.

Publisher ‘seed’ model

What next?

Once this is all up and running we then have to provide a pain free mechanism for feature teams to be able to leverage the new capability that we have introduced here. Which I will cover in the part 2.

--

--

--

Software Engineer, Technical Lead, Architect. Love learning and talking about software architecture. Opinions are purely my own and are most likely nonsense.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Code review checklist

PixWare CSGO Hack — Self Injecting Legit/Rage

PixWare CSGO Hack – Self Injecting Legit/Rage

Bewekoof customer Care number/8584892730//8584892730/Bewekoof customer Care…

Git and its accompaniments

Crafting Finance Weekly Report

Python Data Types

Opera Mail: Features, Drawbacks, and 3 Great Alternatives

How to Host Unity Games on the Web

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Luke Gee

Luke Gee

Software Engineer, Technical Lead, Architect. Love learning and talking about software architecture. Opinions are purely my own and are most likely nonsense.

More from Medium

Noisy Neighbor antipattern

Representing The Passage Of Time In An Event Driven System

Missing the point with Microservices

Domain-driven design (DDD) or just domain-oriented design (DOD)