CTO overrule

I’ve written blogs about how our CTO tried to change our release process and announced it on a “Town Hall” call with the entire department; then loads of teams told him it couldn’t be done, so he had to back down.

Then later, on another Town Hall, he tried to change the Change Control process, but wouldn’t back down when we told him it wouldn’t work. He made the claim of it being a sackable offence if we didn’t follow it. Then a month later, he found out someone couldn’t turn a server on because of the Change Control process. He said it was “malicious compliance” and that will be a sackable offence in future. Within a few months, nearly the entire process had been reverted.

Last week, he was talking about how we needed to move a Data Centre to a new location. He said he preferred to move to the Cloud since it is “inline with our strategic targets”. However, after having several meetings with the experts involved in the Data Centre, they decided the best solution would be to move to another data centre. However, the CTO didn’t like this because it wasn’t inline with their strategy and he thought the move would be too slow.

Therefore, he took the executive decision to overrule them and demand they move to the cloud.

“Do I know we can move all servers to the cloud in time? No. But I was prepared to take the risk. I would rather make decisions and be wrong, than sit by and not make any”

CTO

It seemed strange to me to claim that moving to a physical data centre would be slow, but then moving to the Cloud probably couldn’t be done in time.

He then claimed that

“we have wasted enough time deciding what the plan should be; to move to the cloud or to move to a physical data centre”.

CTO

Isn’t this the worst scenario though? He could have made the decision before any meeting was arranged. But it sounds like they had debated the decision, came to a conclusion, then he told them he didn’t like their conclusion. Then he moaned that they wasted time debating.

So they have had meetings with the experts, and conclude the data centre is the best decision, but since the CTO loves the cloud, he has overruled them. So what was the value of the meeting? And will the staff be motivated to do something they don’t believe in?

“I think I’m compassionate with my team. It’s what binds us together as a team. Otherwise we are a bunch of individuals.”

CTO

I don’t get how he can make these statements and not realise the hypocrisy. How can you be compassionate if you have shown no confidence in their opinions and decision making?

Migrating from on-prem servers to the cloud

When migrating from on-prem servers to the cloud, the Deployment team decided to change the way servers were allocated, presumably to minimise the cost. They: 

designed separate pools for the public side and private side so that the computer and memory could be dedicated to (and protected from) other types of traffic. Due to this split we reduce the ration of CPU cores to sites from 0.42 per site to 0.28 per site (as the cores were now dedicated to public, private all separately)“.

Deployment expert

Initially, this new way worked fine, but then during a particular busy week, they saw slower response times. It actually led to a discovery of a problem we must have had for a while, that SQL connections weren’t being properly disposed of, which created a bottleneck of the remaining possible connections.

They added a temporary fix which was something to do with “Shared app pools“, rather than autoscaling the application servers which would cost money. But this is a major advantage of the cloud – that you can scale on demand.

So to no one’s surprise, when another increase in load happened, performance issues happened once again.

So now the fix should be autoscaling right? No, they are still reluctant to do so. Instead, they added a fixed number of application servers. Surely that costs money, and increases our costs at quieter periods. I suppose I don’t know all the details but it seems risky to choose a set number and hope that the load never exceeds that.

On Viva Engage, a manager posted a positive message stating that the AWS migration was a big success:

“I am thrilled to announce that we have completed the migration to AWS!

This is a major milestone for our cloud migration programme and has involved many team members across multiple disciplines working together.

We have migrated a whopping 505 TB of data across 4178 databases and have stood up over 1,080 application servers. There has been meticulous planning (over 130 steps in each migration), preparation and countless hours spent migrating our systems, including overnight and weekend working.

The impact of this collective effort extends beyond numbers and statistics. We have successfully improved stability and performance for our end users. The migration has enabled us to navigate the increased load challenges.”

Manager

Yet, someone shared this angry message from a customer. I’m not sure if the first sentence is sarcastic, or if they thought we had been somewhat supportive:

“Thank you for your support in what seems to be a “run” of problems for the business. After our awful experience in November when your software literally tipped over leaving us without a system, I did request that both the ombudsman and your company treated this as a significant event, looked into what went wrong and responded to me with an answer. To date I have not received any such feedback from either party.”

Sarcastic customer

I asked a Software Architect what he thought, since he is usually close to the gossip or involved directly.

The Chief of Smoke and Mirrors will have some explaining to do.
performance improved quite a bit as a result of the 64-bit work done behind the scenes (not to the client)
but now users do things faster with longer sessions
and they have plenty of infrastructure issues around the AWS changes that caused a lot of customers problems
as always, one group of people fix certain things, while one group breaks lots of things at the same time

Architect

So it sounds like there’s been some good work done, but also some mistakes made. Then internally, we are announcing it as a great success.

Someone also showed me this complaint where someone had visited a customer and reported back what they had witnessed:

“We visited a site yesterday displaying nearly all of the problems we have discussed to date – still having to reboot the software 10 to 15 times per day! System slowness (witnessed), documents not opening, closing when going into the orders module, first record load slow, changing an order – system crashes.”

Another reason for performance issues was due to missing config after the migration:

“some of our app servers are downloading/installing Windows Updates in core hours, which is causing poor performance for users.”

A simple workaround that sometimes happens is a “cache reset”. That sounds like it’s a similar mindset to the “turn it off and on again” which does magically fix some problems. However, due to the migration, Support had got a bit confused how to remote onto the new servers:

“cache resets were done on the wrong servers. ” –

Manager explaining why performance issues lingered for longer than expected.

Even after further tweaks to the cloud migration, updating the client to 64 bit, fixing SQL connections, and some other miscellaneous changes, the Support team were saying some sites were still having problems:

Can I confirm that things should be improving for all sites following all the brilliant work done? The customer is experiencing the below and I am visiting them tomorrow;

Customer issues

  • loading can take several minutes
  • Slowness and crashing every day, at least 9 or 10 times a day
  • No discernible pattern or time of day for slowness or crashing, and no particular machine is noticeably better or worse
  • Been experiencing performance issues for 2 years, but have gotten much worse recently (last 6 months)
  • experiencing significant delays when uploading records
  • Can take up to 1 hour to approve a small amount of external requests which can involve multiple restarts
  • Switching between records can lead to delays and ‘greyed out screen’ (not responding)
  • Constant and randomly crashes and needs restarting – staff having to partition out tasks such as viewing documents and approving tasks

Closing statement

It does seem like our performance issues are a bit of a mystery. I think we have run out of things to blame. Customer internet, SQL connections, 32 bit client, on-prem servers, caching bug. Hopefully one day we will have a fast system.

Cloud FinOps

Over the last few years, my employer has gone Cloud crazy. We are a large company so we have our own data centres. These are costly to run when you need physical space, staff, electricity, software licensing, and a plan of action when things go wrong.

I wonder if it is better to have your own servers when you are a big company. I always think Cloud is best for smaller companies that don’t have the resources to host it themselves.

“Our reasons for using the cloud are the same as others using the cloud.”

Our CTO

Not really true though is it? From what I saw quoted for the virtual machines for our test systems, I think Cloud is more expensive over time. On-prem has a massive up-front cost which is what they don’t like, but we have the capital to do it, unlike small companies that the Cloud is perfect for.

The recent drive to move away from our data centres is that we needed to replace some old hardware, and perform SQL server upgrades.

I could imagine us moving to the cloud, managers then panicking when they see the monthly costs, then demanding we go back.

One aspect of an SQL Server upgrade sounded like they needed to migrate the data to a new physical server. One of the tables they were concerned about was Audit, which adds a new row every time the user edits a record, which they stated was around 9 Billion records. A copy of the changed data is then saved as XML, so then you can do a before/after comparison. So that particular column is a problem.

So for the data that would still remain in our data centres and moved to a new server with a modern SQL Server version, the plan was to migrate the table without the XML column in it. Instead a new boolean (true/false) column was added to state if there should be data there, and instead, the data is moved to the cloud.

So now we are paying to host the database on our own data centre, but then have certain data in AWS which sounds like it should be more expensive. The justification is that we didn’t need to buy as much hard disk storage which they reckoned could have cost a massive £500k! Then it would mean the migration to the new server in the data centre was faster.

Still, we needed to transfer the data to the AWS Cloud storage. I think the idea was that Audit data isn’t accessed much, so it’s better to move it to a cheaper but slower storage method, then request it on demand. So in our software, instead of displaying the data instantly when you view that record, there would be a “view more detail” button, and only then do we request it and show it.

I think the mindset is just to focus on the cost figures that are apparent. Seeing a figure like £500k sounds like a crazy figure, but if we look at the cost to store it over a few years, does storing it in our own servers outweigh the cost of paying Amazon to store it?

A new corporate buzzword that gets thrown around in this subject is FinOps, as in Financial Operations. 

One of the challenges we have when we start to build a new service is around estimating the potential cost of that new service in AWS. This ultimately goes towards setting the budget expectation for that service and therefore how we monitor it from a FinOps perspective. Do we have any experience within the department or anything we can leverage to help us get better at understanding the potential budget expectations for a new service we’re building?

Concerned staff member

In one of the recent “Town hall” meetings, the CEO was ranting about how high our cloud costs were. He said we currently had £250k in AWS servers that are switched off (not sure if that was a yearly figure, or even more unbelievable; monthly). These were servers just for development/testing. If our testing teams are spending £250k on servers we aren’t really using, how much are we spending on ones we are actively using? Then how much does our live system cost?

Now when you see those figures, that £500k hard disk storage doesn’t sound too bad.

“FYI – Stopped instances don’t incur charges, but Elastic IP addresses or EBS volumes attached to those instances do.”

Cloud expert

He is throwing around more jargon there.

Cloud Nomenclature

When people start talking about the cloud, they quickly start dropping in jargon terms. Sometimes they use multiple terms in the same sentence and it quickly becomes hard to understand when you aren’t familiar with the cloud providers. Even if you are familiar with one particular provider, other providers use different terms for their equivalent service. I think AWS is particularly bad for their naming which often aren’t intuitive. So when people start talking about Elastic Beanstalk, Route 53 and Redshift; it’s hard to grasp what the hell they are talking about.

Here’s an example of equivalent services by four different cloud providers.

AWSAzureGoogle CloudHuawei Cloud
Elastic Compute Cloud (EC2)Virtual MachineCompute EngineElastic Computer Server
Elastic BeanstalkCloud ServicesGoogle App EngineService Stage
LambdaFunctionsCloud FunctionsFunctionGraph
Elastic Container ServiceContainer ServiceContainer EngineCloud Container Engine
Auto ScalingAutoscaleAutoscalerAutoscaling
Simple Storage ServiceBlob StorageCloud StorageObject Storage Service
Elastic Block StorageManaged StoragePersistent DiskElastic Volume Service
CloudFrontCDNCloud CDNCloud Deliver Network
RDSSQL DatabaseCloud SQLRDS
DynamoDBDocumentDBCloud DatastoreDocument Database Service
VPCVirtual NetworksCloud Virtual NetworkVirtual Private Cloud
Direct ConnectExpress RouteCloud InterconnectDirect Connect
Route 53Traffic ManagerCloud DNSCloud DNS
CloudTrailOperational InsightsCloud LoggingCloud Trace
CloudWatchApplication InsightsStackdriver MonitoringCloud Eye
Identity and Access Management (IAM)Azure Active DirectoryCloud Identity and Access Management (IAM)Identity and Access Management
CloudHSMAzure Trust CenterGoogle Cloud Platform SecurityData Encryption Workshop
KinesisStream AnalyticsCloud DataflowData Ingestion Service
OpsWorksAutomationComputer Engine ManagementCloud Service Engine
CloudFormationResource ManagerCloud Deployment ManagementApplication Orchestration Service
Simple Notification SystemNotification Hub Simple Message Notification
Elastic Load BalancingLoad BalancingCloud Load BalancingElastic Load Balancing

Project Aurora & The Strangler pattern

Recently we have had another tech guy join the company who is reporting to the CTO. I find that people in these kind of roles want to put their stamp on things by coming up with a new idea.

He presented his idea in our monthly Tech Meeting. He wants to attempt to address our performance problems by taking traffic away from our main on-premise databases. There’s been some similar ideas recently, and although I’m not great when it comes to hardware, networks and general software/hardware architecture; I am sceptical that these ideas can work.

His idea is that we can replicate the database in the cloud (“the cloud” solves all problems you see), and then the database in the cloud can be used for Read access, whereas Write would still go to the main on-premise databases (then synced up to the cloud).

The Announcement

This programme of work is to move workload away from our primary systems to enable these systems to withstand expected load factors from upcoming initiatives as well as expected growth in usage on our APIs during Winter 2023.

The intent is to run focused cross functional teams in work-streams across the group to deliver this initiative. The approach taken here is to place multiple bets, across multiple teams. The expectation is that not all teams will deliver by September, but enough to bring in the headroom we need.

The programme is intending to free up at least 20% load across our core databases.

Upcoming aims:
• Strategic, move read-only workloads to Aurora.
• Redeploy APIs to AWS, Move to cloud technology, Containerise and Optimise Service
• Enable use of replica data when ready.
• Move Appointment Workload
• Mitigate 8am peak load.
• Use caching engine on AWS (Elasticache/Redis), mitigate 8.2% of PC DB Load 
• Reduce load on the DB during day time.
• Reduce Datacentre and DB load and improve performance
• Mitigate 6.2% of DB load by optimising how we summarise task counts
• Proof of concept is Complete, expected to cost £2m a year.

My Conversation With Architect Mark

I think the reason for the replication (as opposed to just moving it all to the Cloud) is that you can’t fully commit to ideas like this. You have to have a rollback plan. So if we find it doesn’t work, is too expensive etc., we can just return to the old way without much inconvenience. I asked one of our Software Architects what he thought of the plan because it doesn’t sound right to me:

Me
doesn't sending data out to another database just increase traffic, and they wanted to reduce it?
Mark
Yes, it will also be delayed, and often broken
Me
no pain, no gain
Mark
they're replicating data, and it's unlikely it'll be used
Me
I don't see how you migrate things. You have to keep them both running until you are confident it works, then bin off the old database. But then in reality you just end up keeping them both for longer than expected
Mark
you then also need cross-database transactions or to be very careful with queries
yeah, that's basically it. Have the same API at both ends, some sort of replicate and transform on the data to ensure it's in both. Persist to both simultaneously, then when all works, turn off the old
Me
The CTO said that “some people say there is a delay, but it is only 5 minutes”. Does that address any of your concerns at all?
Mark
no, this is only the second time I've heard about this, and the first I laughed
I agree with the principle of strangler pattern for migrating, but this isn't migrating
it's keeping multiple DBs 'in-sync'
Me
does that mean you can view an appointment book which is 5 mins out of date, and you try book an appointment, then it checks the real database and is like "no mate you cannot do that"

The conversation between architects

Mark then sent me a conversation he had with two other architects, Andrew and Jon. Mark already had concerns with the “appointment book” example.

Mark
so when this replication system goes down for a few hours, what happens then? I guess the system tries to book appointments for slots already booked, put in requests for items already issued etc.?
seems our business layer needs to be aware of how outdated the original information was, so it can compare something like a changelog number. Sounds like a big challenge to implement correctly

Andrew 11:10
Yes, any write operations will need logic to ensure that cannot happen Mark.
John and I have already called out that Appointments and Orders will have significant challenges with this replication model and have suggested that the initial focus should be on User Profiles, and any historic data, etc.

Mark 11:13
User Profiles and historic data seem just as dangerous to be honest.

Jon 11:15
The idea I suggested these is that you would check the change log on the primary system before even considering going to the replica. If the User had had a recent change (what counts as "recent" is TBC, I suggested 30 minutes) you wouldn't even consider going to the replica.

Mark 11:15
can we implement the strangler pattern properly? set up proper Appointments APIs to use in our datacentre, and AWS.
duplicate the data.
then dual file everything against the APIs? if one fails to file, the other gets rolled back.
we ensure consistency, we can transform the data, and we're using the pattern as intended
Jon, I agree your idea is the right way to do this sort of thing, but it will be adding logic and latency in a lot of places (as well as augmenting every one of our products to be aware of this), and not bringing us forward, but continuing to keep us in the primary data-store model

Jon 11:18
Honestly if the use case for customers looking at their data, then having it a touch out-of-date information isn't as critical as if our actual users sees an out of date view. As a hypothetical Customer who knows nothing about IT, if I viewed my record straight after a consultation
and it wasn't there I would just assume that there was a delay and it would appear later.
When it comes to actual Users viewing the record, it's absolutely critical that they see the up to date view. And when it comes to appointments that's also critical because appointment booking is fast moving, it'd be an awful experience for a User if every "free" slot they booked turned out to be booked minutes earlier.

Mark 11:19
depends, if you've just requested a particular item and the page doesn't update to indicate that, can you continue requesting it?

Jon 11:20
Many of our users (mine included) turned off online appointment booking entirely at the beginning of the pandemic and use a triage system now.
You wouldn’t be able to successfully request duplicate items, because the write would take place conditionally, so if it had been requested already then it'd say no (if designed even
vaguely competently).

Mark 11:22
the write wouldn't come through, but it'd be confusing for the User seeing the prescription still requestable, unless the application has its own datastore of state

Jon 11:22
Yes it would be far from ideal. But the CTO has some ideas about that (having a "recent changes" dataset in a cache that is updated live, and merged with the replica's data.
feels like there's loads of little bits of logic that need 'tacking on' to resolve potentially quite serious incidents. When the correct use of the strangler pattern gets us away from on-premise as primary DB, and moving in the direction we want to go
Yeah, this isn't easy and requires careful consideration.

Andrew 11:30
You are absolutely right Mark - there are a heck of a lot of potential gotchas and ultimately the plan has to be to use the strangler pattern, but at the moment we are looking at a rescue plan to put out some existing fires in the data centre and to handle predicted significant increase in load that will hit us in the Autumn. Everything that you have flagged is being considered.
The only fall-back plan that we currently have is to spend nearly £4m / year on additional SQL Server readable secondaries (on top of having to pay an additional 12% on our existing SQL Server licences thanks to MS hiking their prices) and nobody has the appetite for that.

Closing Thoughts

I don’t know what the Strangler Pattern is, so I’ll add that to my reading lists. However, it seems that even with my limited knowledge of architecture, our Software Architects have similar concerns as I do. There’s been plenty of ideas that the CTO (or similar level managers) have quickly backtracked on due to not consulting people who have knowledge on whether their idea is actually logically sound. I’ll keep my eye on this idea to see how it develops.

Problems With Hosted Services

Recently we have had several major incidents due to: software bugs, incorrect configuration being applied, not renewing licence keys, and migrating servers to the cloud and failing to check all services were correctly configured and running.

Our Hosted Services team gave a presentation of work in their department, and gave more insight to even more failings that have happened recently. As far as I am aware, Hosted deal with servers, data centres and networks.

Hosted explained that due to the decision to move all servers to the cloud, when their usual time came to replace old servers – they didn’t bother. But the migration has been a slow process and delayed which meant our software was running on inferior hardware for longer than anticipated.

“We don’t need to invest in the in the architecture that we’ve got, which was not the right decision in hindsight

We had a team of people who, in some cases, were the wrong people. They didn’t have the appetite to go and actively drive out issues and reduce the points of failure in our networks.”

Hosted Manager

He then goes on to say the change in strategy caused many of their long-term staff to leave. These people that really knew how the business worked.

“So we lost around about 90% of the team over a relatively short space of time and that put us into quite a challenging position to say the least. And needless to say, we were probably on the back foot in the first quarter of this year with having to recruit pretty much an entire new team.”

Hosted Manager

Then, because they were short staffed, their backlog of work was increasing, putting more stress on the people that remained:

“We had to stop doing some tasks, and some of our incident queues and ticketing queues were going north in terms of volumes, which was really not a good place to be.”

Hosted Manager

I’ve written about this situation in the past. It has happened in the Development department when a new CTO comes in, and says that manual software testing is archaic; so people have to learn automation or lose their jobs. Then a few months later, they realise their plan isn’t so feasible, yet have lost some good software testers to other companies, or allowed others to switch roles and aren’t interested in going back. Then the releases slow down because we can’t get the fixes tested fast enough due to the last of software testers.

They go on to say the Firewalls suffered 50 major incidents in Quarter 2, and now they have “procured new firewalls” to solve it. They have reduced bandwidth into the main data centre by routing certain traffic through an alternate link. The “core switches” at our offices and data centres are “End of Life” and will be upgraded to modern hardware (Cisco Nexus 9K).

So it sounds like they have a plan, or at least are doing the best with what they have. It sounds like all departments are currently shooting themselves in the foot at the moment.

The Changes #3 – Tech Stack

Recently, we have a new CTO and a new Head Of Development. Like all new managers with power, they want to make their own changes.

In The Changes #1, I explained changes to job titles of Testers. In Changes #2 I explain that Developers are also affected.

Our previous management decided our upcoming software would specifically use the cloud service AWS, whereas the new managers want a “cloud-agnostic” approach (work with all the clouds!). So they want us to have the possibility of using one, or a combination of: Azure, Google or AWS – although the emphasis seems to be moving to Microsoft’s Azure. 

Currently, I don’t think there is much rationale for this. Everyone specifically learned how to use AWS, so will need some time to adapt to the differences. When people started learning about Cloud Technologies, the problem of “Vendor Tie-in” was raised and instantly dismissed by the previous management. It makes sense to ensure you have that flexibility, because if your provider increases their costs, then you can easily migrate if you had the architecture for it.

Another change is that they want every product to have their source code in GitHub. Maybe it does make sense to have all our products in the same place. The reason we have them scattered is mainly because our products were acquired by a series of acquisitions, so the original companies had chosen their own languages and technologies. However, our core, flagship product is already in Azure DevOps, and given that the main cloud provider is going to be Azure, surely it makes sense to keep it all in one place?

These changes to jobs, processes and technologies seem to have solely been decided by the CTO and Head Of Development. I feel like they haven’t discussed these ideas with the Development staff at all. I’m intrigued by what other arbitrary changes will be forced on us. With any change, I think you always have to ask yourself “what is the problem we are trying to solve?”. If you can’t explain the problem, then it shows you are making changes just for the sake of it.

Recently, I have been reading Sir Alex Ferguson’s “Leading” book and came across a great quote:

“There is no point suddenly changing routines that players are comfortable with. It is counterproductive, saps morale and immediately provokes players to question the new man’s motives. A leader who arrives in a new setting, or inherits a big role, needs to curb the impulse to display his manhood.”

Sir Alex Ferguson (ex Manchester United football manager)

I thought it was a very relevant and appropriate quote for this situation.