When migrating from on-prem servers to the cloud, the Deployment team decided to change the way servers were allocated, presumably to minimise the cost. They:
“designed separate pools for the public side and private side so that the computer and memory could be dedicated to (and protected from) other types of traffic. Due to this split we reduce the ration of CPU cores to sites from 0.42 per site to 0.28 per site (as the cores were now dedicated to public, private all separately)“.
Deployment expert
Initially, this new way worked fine, but then during a particular busy week, they saw slower response times. It actually led to a discovery of a problem we must have had for a while, that SQL connections weren’t being properly disposed of, which created a bottleneck of the remaining possible connections.
They added a temporary fix which was something to do with “Shared app pools“, rather than autoscaling the application servers which would cost money. But this is a major advantage of the cloud – that you can scale on demand.
So to no one’s surprise, when another increase in load happened, performance issues happened once again.
So now the fix should be autoscaling right? No, they are still reluctant to do so. Instead, they added a fixed number of application servers. Surely that costs money, and increases our costs at quieter periods. I suppose I don’t know all the details but it seems risky to choose a set number and hope that the load never exceeds that.
On Viva Engage, a manager posted a positive message stating that the AWS migration was a big success:
“I am thrilled to announce that we have completed the migration to AWS!
This is a major milestone for our cloud migration programme and has involved many team members across multiple disciplines working together.
We have migrated a whopping 505 TB of data across 4178 databases and have stood up over 1,080 application servers. There has been meticulous planning (over 130 steps in each migration), preparation and countless hours spent migrating our systems, including overnight and weekend working.
The impact of this collective effort extends beyond numbers and statistics. We have successfully improved stability and performance for our end users. The migration has enabled us to navigate the increased load challenges.”
Manager
Yet, someone shared this angry message from a customer. I’m not sure if the first sentence is sarcastic, or if they thought we had been somewhat supportive:
“Thank you for your support in what seems to be a “run” of problems for the business. After our awful experience in November when your software literally tipped over leaving us without a system, I did request that both the ombudsman and your company treated this as a significant event, looked into what went wrong and responded to me with an answer. To date I have not received any such feedback from either party.”
Sarcastic customer
I asked a Software Architect what he thought, since he is usually close to the gossip or involved directly.
The Chief of Smoke and Mirrors will have some explaining to do.
performance improved quite a bit as a result of the 64-bit work done behind the scenes (not to the client)
but now users do things faster with longer sessions
and they have plenty of infrastructure issues around the AWS changes that caused a lot of customers problems
as always, one group of people fix certain things, while one group breaks lots of things at the same time
Architect
So it sounds like there’s been some good work done, but also some mistakes made. Then internally, we are announcing it as a great success.
Someone also showed me this complaint where someone had visited a customer and reported back what they had witnessed:
“We visited a site yesterday displaying nearly all of the problems we have discussed to date – still having to reboot the software 10 to 15 times per day! System slowness (witnessed), documents not opening, closing when going into the orders module, first record load slow, changing an order – system crashes.”
Another reason for performance issues was due to missing config after the migration:
“some of our app servers are downloading/installing Windows Updates in core hours, which is causing poor performance for users.”
A simple workaround that sometimes happens is a “cache reset”. That sounds like it’s a similar mindset to the “turn it off and on again” which does magically fix some problems. However, due to the migration, Support had got a bit confused how to remote onto the new servers:
“cache resets were done on the wrong servers. ” –
Manager explaining why performance issues lingered for longer than expected.
Even after further tweaks to the cloud migration, updating the client to 64 bit, fixing SQL connections, and some other miscellaneous changes, the Support team were saying some sites were still having problems:
Can I confirm that things should be improving for all sites following all the brilliant work done? The customer is experiencing the below and I am visiting them tomorrow;
Customer issues
- loading can take several minutes
- Slowness and crashing every day, at least 9 or 10 times a day
- No discernible pattern or time of day for slowness or crashing, and no particular machine is noticeably better or worse
- Been experiencing performance issues for 2 years, but have gotten much worse recently (last 6 months)
- experiencing significant delays when uploading records
- Can take up to 1 hour to approve a small amount of external requests which can involve multiple restarts
- Switching between records can lead to delays and ‘greyed out screen’ (not responding)
- Constant and randomly crashes and needs restarting – staff having to partition out tasks such as viewing documents and approving tasks
Closing statement
It does seem like our performance issues are a bit of a mystery. I think we have run out of things to blame. Customer internet, SQL connections, 32 bit client, on-prem servers, caching bug. Hopefully one day we will have a fast system.