Project Aurora & The Strangler pattern

Recently we have had another tech guy join the company who is reporting to the CTO. I find that people in these kind of roles want to put their stamp on things by coming up with a new idea.

He presented his idea in our monthly Tech Meeting. He wants to attempt to address our performance problems by taking traffic away from our main on-premise databases. There’s been some similar ideas recently, and although I’m not great when it comes to hardware, networks and general software/hardware architecture; I am sceptical that these ideas can work.

His idea is that we can replicate the database in the cloud (“the cloud” solves all problems you see), and then the database in the cloud can be used for Read access, whereas Write would still go to the main on-premise databases (then synced up to the cloud).

The Announcement

This programme of work is to move workload away from our primary systems to enable these systems to withstand expected load factors from upcoming initiatives as well as expected growth in usage on our APIs during Winter 2023.

The intent is to run focused cross functional teams in work-streams across the group to deliver this initiative. The approach taken here is to place multiple bets, across multiple teams. The expectation is that not all teams will deliver by September, but enough to bring in the headroom we need.

The programme is intending to free up at least 20% load across our core databases. 

Upcoming aims:
    • Strategic, move read-only workloads to Aurora.
    • Redeploy APIs to AWS, Move to cloud technology, Containerise and Optimise Service
    • Enable use of replica data when ready.
    • Move Appointment Workload
    • Mitigate 8am peak load.
    • Use caching engine on AWS (Elasticache/Redis), mitigate 8.2% of PC DB Load 
    • Reduce load on the DB during day time.
    • Reduce Datacentre and DB load and improve performance
    • Mitigate 6.2% of DB load by optimising how we summarise task counts
    • Proof of concept is Complete, expected to cost £2m a year.

My Conversation With Architect Mark

I think the reason for the replication (as opposed to just moving it all to the Cloud) is that you can’t fully commit to ideas like this. You have to have a rollback plan. So if we find it doesn’t work, is too expensive etc., we can just return to the old way without much inconvenience. I asked one of our Software Architects what he thought of the plan because it doesn’t sound right to me:

Me
doesn't sending data out to another database just increase traffic, and they wanted to reduce it?
Mark
Yes, it will also be delayed, and often broken
Me
no pain, no gain
Mark
they're replicating data, and it's unlikely it'll be used
Me
I don't see how you migrate things. You have to keep them both running until you are confident it works, then bin off the old database. But then in reality you just end up keeping them both for longer than expected
Mark
you then also need cross-database transactions or to be very careful with queries
yeah, that's basically it. Have the same API at both ends, some sort of replicate and transform on the data to ensure it's in both. Persist to both simultaneously, then when all works, turn off the old
Me
The CTO said that “some people say there is a delay, but it is only 5 minutes”. Does that address any of your concerns at all?
Mark
no, this is only the second time I've heard about this, and the first I laughed
I agree with the principle of strangler pattern for migrating, but this isn't migrating
it's keeping multiple DBs 'in-sync'
Me
does that mean you can view an appointment book which is 5 mins out of date, and you try book an appointment, then it checks the real database and is like "no mate you cannot do that"

The conversation between architects

Mark then sent me a conversation he had with two other architects, Andrew and Jon. Mark already had concerns with the “appointment book” example.

Mark
so when this replication system goes down for a few hours, what happens then? I guess the system tries to book appointments for slots already booked, put in requests for items already issued etc.?
seems our business layer needs to be aware of how outdated the original information was, so it can compare something like a changelog number. Sounds like a big challenge to implement correctly

Andrew 11:10
Yes, any write operations will need logic to ensure that cannot happen Mark.
John and I have already called out that Appointments and Orders will have significant challenges with this replication model and have suggested that the initial focus should be on User Profiles, and any historic data, etc.

Mark 11:13
User Profiles and historic data seem just as dangerous to be honest.

Jon 11:15
The idea I suggested these is that you would check the change log on the primary system before even considering going to the replica. If the User had had a recent change (what counts as "recent" is TBC, I suggested 30 minutes) you wouldn't even consider going to the replica.

Mark 11:15
can we implement the strangler pattern properly? set up proper Appointments APIs to use in our datacentre, and AWS.
duplicate the data.
then dual file everything against the APIs? if one fails to file, the other gets rolled back.
we ensure consistency, we can transform the data, and we're using the pattern as intended
Jon, I agree your idea is the right way to do this sort of thing, but it will be adding logic and latency in a lot of places (as well as augmenting every one of our products to be aware of this), and not bringing us forward, but continuing to keep us in the primary data-store model

Jon 11:18
Honestly if the use case for customers looking at their data, then having it a touch out-of-date information isn't as critical as if our actual users sees an out of date view. As a hypothetical Customer who knows nothing about IT, if I viewed my record straight after a consultation
and it wasn't there I would just assume that there was a delay and it would appear later.
When it comes to actual Users viewing the record, it's absolutely critical that they see the up to date view. And when it comes to appointments that's also critical because appointment booking is fast moving, it'd be an awful experience for a User if every "free" slot they booked turned out to be booked minutes earlier.

Mark 11:19
depends, if you've just requested a particular item and the page doesn't update to indicate that, can you continue requesting it?

Jon 11:20
Many of our users (mine included) turned off online appointment booking entirely at the beginning of the pandemic and use a triage system now.
You wouldn’t be able to successfully request duplicate items, because the write would take place conditionally, so if it had been requested already then it'd say no (if designed even
vaguely competently).

Mark 11:22
the write wouldn't come through, but it'd be confusing for the User seeing the prescription still requestable, unless the application has its own datastore of state

Jon 11:22
Yes it would be far from ideal. But the CTO has some ideas about that (having a "recent changes" dataset in a cache that is updated live, and merged with the replica's data.
feels like there's loads of little bits of logic that need 'tacking on' to resolve potentially quite serious incidents. When the correct use of the strangler pattern gets us away from on-premise as primary DB, and moving in the direction we want to go
Yeah, this isn't easy and requires careful consideration.

Andrew 11:30
You are absolutely right Mark - there are a heck of a lot of potential gotchas and ultimately the plan has to be to use the strangler pattern, but at the moment we are looking at a rescue plan to put out some existing fires in the data centre and to handle predicted significant increase in load that will hit us in the Autumn. Everything that you have flagged is being considered.
The only fall-back plan that we currently have is to spend nearly £4m / year on additional SQL Server readable secondaries (on top of having to pay an additional 12% on our existing SQL Server licences thanks to MS hiking their prices) and nobody has the appetite for that.

Closing Thoughts

I don’t know what the Strangler Pattern is, so I’ll add that to my reading lists. However, it seems that even with my limited knowledge of architecture, our Software Architects have similar concerns as I do. There’s been plenty of ideas that the CTO (or similar level managers) have quickly backtracked on due to not consulting people who have knowledge on whether their idea is actually logically sound. I’ll keep my eye on this idea to see how it develops.

Problems With Hosted Services

Recently we have had several major incidents due to: software bugs, incorrect configuration being applied, not renewing licence keys, and migrating servers to the cloud and failing to check all services were correctly configured and running.

Our Hosted Services team gave a presentation of work in their department, and gave more insight to even more failings that have happened recently. As far as I am aware, Hosted deal with servers, data centres and networks.

Hosted explained that due to the decision to move all servers to the cloud, when their usual time came to replace old servers – they didn’t bother. But the migration has been a slow process and delayed which meant our software was running on inferior hardware for longer than anticipated.

“We don’t need to invest in the in the architecture that we’ve got, which was not the right decision in hindsight

We had a team of people who, in some cases, were the wrong people. They didn’t have the appetite to go and actively drive out issues and reduce the points of failure in our networks.”
Hosted Manager

He then goes on to say the change in strategy caused many of their long-term staff to leave. These people that really knew how the business worked.

“So we lost around about 90% of the team over a relatively short space of time and that put us into quite a challenging position to say the least. And needless to say, we were probably on the back foot in the first quarter of this year with having to recruit pretty much an entire new team.”
Hosted Manager

Then, because they were short staffed, their backlog of work was increasing, putting more stress on the people that remained:

“We had to stop doing some tasks, and some of our incident queues and ticketing queues were going north in terms of volumes, which was really not a good place to be.”
Hosted Manager

I’ve written about this situation in the past. It has happened in the Development department when a new CTO comes in, and says that manual software testing is archaic; so people have to learn automation or lose their jobs. Then a few months later, they realise their plan isn’t so feasible, yet have lost some good software testers to other companies, or allowed others to switch roles and aren’t interested in going back. Then the releases slow down because we can’t get the fixes tested fast enough due to the last of software testers.

They go on to say the Firewalls suffered 50 major incidents in Quarter 2, and now they have “procured new firewalls” to solve it. They have reduced bandwidth into the main data centre by routing certain traffic through an alternate link. The “core switches” at our offices and data centres are “End of Life” and will be upgraded to modern hardware (Cisco Nexus 9K).

So it sounds like they have a plan, or at least are doing the best with what they have. It sounds like all departments are currently shooting themselves in the foot at the moment.

The Changes #3 – Tech Stack

Recently, we have a new CTO and a new Head Of Development. Like all new managers with power, they want to make their own changes.

In The Changes #1, I explained changes to job titles of Testers. In Changes #2 I explain that Developers are also affected.

Our previous management decided our upcoming software would specifically use the cloud service AWS, whereas the new managers want a “cloud-agnostic” approach (work with all the clouds!). So they want us to have the possibility of using one, or a combination of: Azure, Google or AWS – although the emphasis seems to be moving to Microsoft’s Azure.

Currently, I don’t think there is much rationale for this. Everyone specifically learned how to use AWS, so will need some time to adapt to the differences. When people started learning about Cloud Technologies, the problem of “Vendor Tie-in” was raised and instantly dismissed by the previous management. It makes sense to ensure you have that flexibility, because if your provider increases their costs, then you can easily migrate if you had the architecture for it.

Another change is that they want every product to have their source code in GitHub. Maybe it does make sense to have all our products in the same place. The reason we have them scattered is mainly because our products were acquired by a series of acquisitions, so the original companies had chosen their own languages and technologies. However, our core, flagship product is already in Azure DevOps, and given that the main cloud provider is going to be Azure, surely it makes sense to keep it all in one place?

These changes to jobs, processes and technologies seem to have solely been decided by the CTO and Head Of Development. I feel like they haven’t discussed these ideas with the Development staff at all. I’m intrigued by what other arbitrary changes will be forced on us. With any change, I think you always have to ask yourself “what is the problem we are trying to solve?”. If you can’t explain the problem, then it shows you are making changes just for the sake of it.

Recently, I have been reading Sir Alex Ferguson’s “Leading” book and came across a great quote:

“There is no point suddenly changing routines that players are comfortable with. It is counterproductive, saps morale and immediately provokes players to question the new man’s motives. A leader who arrives in a new setting, or inherits a big role, needs to curb the impulse to display his manhood.”
Sir Alex Ferguson (ex Manchester United football manager)

I thought it was a very relevant and appropriate quote for this situation.