When migrating from on-prem servers to the cloud, the Deployment team decided to change the way servers were allocated, presumably to minimise the cost. They:
“designed separate pools for the public side and private side so that the computer and memory could be dedicated to (and protected from) other types of traffic. Due to this split we reduce the ration of CPU cores to sites from 0.42 per site to 0.28 per site (as the cores were now dedicated to public, private all separately)“.
Deployment expert
Initially, this new way worked fine, but then during a particular busy week, they saw slower response times. It actually led to a discovery of a problem we must have had for a while, that SQL connections weren’t being properly disposed of, which created a bottleneck of the remaining possible connections.
They added a temporary fix which was something to do with “Shared app pools“, rather than autoscaling the application servers which would cost money. But this is a major advantage of the cloud – that you can scale on demand.
So to no one’s surprise, when another increase in load happened, performance issues happened once again.
So now the fix should be autoscaling right? No, they are still reluctant to do so. Instead, they added a fixed number of application servers. Surely that costs money, and increases our costs at quieter periods. I suppose I don’t know all the details but it seems risky to choose a set number and hope that the load never exceeds that.
On Viva Engage, a manager posted a positive message stating that the AWS migration was a big success:
“I am thrilled to announce that we have completed the migration to AWS!
This is a major milestone for our cloud migration programme and has involved many team members across multiple disciplines working together.
We have migrated a whopping 505 TB of data across 4178 databases and have stood up over 1,080 application servers. There has been meticulous planning (over 130 steps in each migration), preparation and countless hours spent migrating our systems, including overnight and weekend working.
The impact of this collective effort extends beyond numbers and statistics. We have successfully improved stability and performance for our end users. The migration has enabled us to navigate the increased load challenges.”
Manager
Yet, someone shared this angry message from a customer. I’m not sure if the first sentence is sarcastic, or if they thought we had been somewhat supportive:
“Thank you for your support in what seems to be a “run” of problems for the business. After our awful experience in November when your software literally tipped over leaving us without a system, I did request that both the ombudsman and your company treated this as a significant event, looked into what went wrong and responded to me with an answer. To date I have not received any such feedback from either party.”
Sarcastic customer
I asked a Software Architect what he thought, since he is usually close to the gossip or involved directly.
The Chief of Smoke and Mirrors will have some explaining to do. performance improved quite a bit as a result of the 64-bit work done behind the scenes (not to the client) but now users do things faster with longer sessions and they have plenty of infrastructure issues around the AWS changes that caused a lot of customers problems as always, one group of people fix certain things, while one group breaks lots of things at the same time
Architect
So it sounds like there’s been some good work done, but also some mistakes made. Then internally, we are announcing it as a great success.
Someone also showed me this complaint where someone had visited a customer and reported back what they had witnessed:
“We visited a site yesterday displaying nearly all of the problems we have discussed to date – still having to reboot the software 10 to 15 times per day! System slowness (witnessed), documents not opening, closing when going into the orders module, first record load slow, changing an order – system crashes.”
Another reason for performance issues was due to missing config after the migration:
“some of our app servers are downloading/installing Windows Updates in core hours, which is causing poor performance for users.”
A simple workaround that sometimes happens is a “cache reset”. That sounds like it’s a similar mindset to the “turn it off and on again” which does magically fix some problems. However, due to the migration, Support had got a bit confused how to remote onto the new servers:
“cache resets were done on the wrong servers. ” –
Manager explaining why performance issues lingered for longer than expected.
Even after further tweaks to the cloud migration, updating the client to 64 bit, fixing SQL connections, and some other miscellaneous changes, the Support team were saying some sites were still having problems:
Can I confirm that things should be improving for all sites following all the brilliant work done? The customer is experiencing the below and I am visiting them tomorrow;
Customer issues
loading can take several minutes
Slowness and crashing every day, at least 9 or 10 times a day
No discernible pattern or time of day for slowness or crashing, and no particular machine is noticeably better or worse
Been experiencing performance issues for 2 years, but have gotten much worse recently (last 6 months)
experiencing significant delays when uploading records
Can take up to 1 hour to approve a small amount of external requests which can involve multiple restarts
Switching between records can lead to delays and ‘greyed out screen’ (not responding)
Constant and randomly crashes and needs restarting – staff having to partition out tasks such as viewing documents and approving tasks
Closing statement
It does seem like our performance issues are a bit of a mystery. I think we have run out of things to blame. Customer internet, SQL connections, 32 bit client, on-prem servers, caching bug. Hopefully one day we will have a fast system.
In our software, we have a task list where “requests” go. They can be created by our users, or online by their customers. We have 2 boxes where these go: “Requests” and “Requests With Queries“. As far as I understand, the Requests are often safe to approve because it’s basically just a repeat order and added by the staff member so has already had one official approval on it. When there is some uncertainty, they go into the “With Queries” box for more scrutinisation. The requests coming from online always go into “With Queries” and require more scrutinisation.
The time it took to click approve and then load up the next task was quite slow. We added a Bulk Approval feature where the user can view tasks quickly, then approve several at once which means they don’t have to go through the load/send/load/send/load/send workflow. It’s more like load/load/load; and the sending can be done in a background process.
For Requests, this bulk feature worked fine because they can be quickly reviewed, then sent. For ‘With Queries’, it made sense that our users would want to bulk review the user-created ones, but the customer-created ones would require further time to review. So we decided to create a new box where the customer-created ones go.
This was requested by some of our users, and it made sense to us. However, we didn’t ask all our users if it was appropriate for them.
So when it went out, many users complained that we had “doubled their work“.
The comments from our users often seemed strange, but many seemed to be saying they had a Receptionist that went through all the tasks and reassigned the task owner to different staff members so they all had an even amount of tasks. Then each user would check their tasks and approve them. They referred to this as “regulated distribution”. We were baffled why having the same number of tasks as before but just located in 2 boxes rather than one would be a problem.
One user said this:
“unfortunately we don’t work like that. The requests have to be counted – so many queries and so many straightforward. They are allocated daily and completed but have to be collected and centralised first.. Nightmare for us.”
Another user said this:
We cannot work out now within this new box which are queries and which are not so we are having to open every single one ( 500 today) in order to sort them out.
But before, all these tasks would appear in the Requests With Queries box because they were all customer-created. Now in the Requests With Queries box, they should be able to review these faster because each one would require the same level of scrutiny, whereas before, they would have to keep looking at the “source” to see if it was from a user or customer to decide what level of checking it required.
I think it must be the case of just being shocked when something changes and reluctant to adapt.
During development, we also debated what the Review Date really meant. If it was set, then we check if the date has passed and don’t allow these to be bulk approved. However, customers can have no Review Date at all which we interpreted to mean it wasn’t applicable to them, so we allowed all their tasks to be bulk approved. However, one particular organisation thought this was very unsafe for them. They wrote an interesting write-up, full of capitalised letters and very much geared to a Hazard Matrix:
This issue has been discussed at the Joint IT Committee, who are expecting feedback in due course.
The Committee's concern is that there is a HAZARD that items may be issued through Bulk Approvals that have not been appropriately reviewed. The CAUSE of concern is that the Bulk Approvals module includes orders for customers who have no items Review date. Where an organisation's business operations involve using items Review date to govern their ordering, the EFFECT may be that a customer whose items has not been appropriately reviewed may have bulk approved orders (potentially repeatedly). The HARM to the customer may be any of the wide range of harms that can come about through unreviewed access to order items including death. Therefore a LIKELIHOOD needs to be calculated (you are best placed to do so as you can audit your records to identify how many customers have had repeat orders issued through Bulk Approvals, describe any case reports you have had of customers who have been harmed (and near-misses), and estimate the future risk by identifying how many customers have repeatable orders and no review date.
I believe that on your hazard matrix, the CONSEQUENCES therefore could plausibly result in death = CATASTROPHIC
The LIKELIHOOD I would appreciate your guidance on, but I wonder if it might be UNLIKELY i.e. likely to occur some time (the longer it is running the higher the chance), or if you have other CONTROLs I'm not aware of, possibly EXCEPTIONAL i.e. unlikely, but may occur exceptionally, which would give the HAZARD a rating of HIGH or MEDIUM.
The Committee would therefore be grateful for more detailed feedback on the HAZARD so that we can respond to our Members. This might be the relevant row from your Hazard Log for example, but a narrative description would be fine.
The suggested REACTIVE CONTROL is to consider excluding those customers from Bulk Approvals which would ELIMINATE this cause. There are alternative controls but none that would eliminate the cause entirely that we are aware of. In any interim, a organisation could mitigate this risk until any change in module behaviour if they audited their customer records to identify customers who have current active repeat orders but no review date.
My employer made the news recently after the deployment team applied the wrong config to several organisations which led to a heavy spike in network traffic and caused issues nationwide.
Sadly, we made the news again.
“We couldn’t print our forms for three hours. Someone had goofed and did not keep up with their subscriptions. For a company of their size; that is embarrassing.”
Customer quote
When users reported the issue, there was a call between a few managers. I was invited to the call with zero context by someone I had never spoken to before, so I thought there was a good chance they invited the wrong person.
The call was recorded so I just listened to it when I was free. They had invited me on the suggestion that I had worked on a feature in that area many years ago, and they had no other ideas. Also on the call, they called someone else and remarked how strange it was to the recipient because they had never spoken to him before. Why didn’t they learn and send some context? He didn’t join either.
Eventually, they found someone who belongs to the team that procures the licences. He explained that they purchase licence keys for this printing software, then send them to an internal support team to update the licence keys in the database. The team receives automated emails reminding them to renew the licence keys 3 months prior to expiry, and they act on it quickly to not risk them expiring.
“It takes a while to go through the purchase process, so I usually do it early, but sometimes it can “fall through”
Procurement guy
After going through some emails, they found the key had been promptly purchased, and the licence keys were sent to support, then they were not applied.
Another guy joined the call and said a total of 43 organisations have reported errors, but that’s the only ones we know about due to direct complaints we saw via Facebook.
“And then I need to understand exactly how this is happening, ‘cause this is the second time in two weeks that a licence key was sent to support and wasn’t applied.”
Angry manager
After the issue was resolved, another manager asked for a summary of the issue. One guy remarked
“just to say, the people invited to the call (16 invited in total), are not the ones that ended up being involved or resolving.”
Manager, reflecting on what a mess the meeting was
I don’t understand how all our departments are causing chaos all at once? We don’t seem to learn our lesson either, how can we make the same mistake twice in two weeks? It’s also been a problem for years – that we struggle to invite the correct people to major incident calls so issues take much longer to resolve.
We recently featured in the national newspapers due to an outage with our software, and it was surprising that our system was down for around 12 hours. Usually we respond to outages really quickly, and get the system back online after a few hours max.
I sent the article to a few of my colleagues and they were unaware of the problem, so internally we seemed to be keeping quiet about it.
We had put out a message to users, but from what I saw in the aftermath, we initially claimed it was just a specific region that was affected, before others via social media had stated it was affecting people nationwide, much to the derision from our main rivals, who posted multiple social media posts to insult us.
“They deliver complaints, weekend working, major incidents, stress, insomnia, and hypertension, to name a few. Do their rivals deliver?”
Customer quote
All Hands
5 days later, the Directors arranged an “All Hands” meeting with only a few hours notice. So that caused panic amongst many colleagues because of the implied urgency. It was led by the CTO who was announcing changes to our deployment process, but I’m not sure who he consulted with. I would have thought he would have learned from the last time he announced changes to our release process – when we all ripped his idea to shreds on the same call, stating several technical and some legal reasons why we can’t possibly work that way. A much better approach would be to state the problem and let the experts discuss how we can solve the problem, but yet again, he seems to have had a meeting with other Directors and took the attitude “this is how it’s going to be from now on”. But then we all pushed back and told him how it wasn’t possible.
I love it when situations like these occur. The drama makes you want to grab the popcorn, sit back, and enjoy the show. Hearing all the different points of view and trying to understand the complexities of why things are the way they are – is super interesting. I think when you are told to follow a Process and you don’t know the rationale behind it, it makes it feel like a chore. But focussing on a problem and trying to solve it by forming a new/modified process is something you can get behind.
So like I said, not learning his lesson from before, he states what the new process is, and only gives a wishy-washy explanation of the problem he was trying to solve. After we all bombarded him with questions, the truth of the matter came out, and some of his ideas wouldn’t solve this particular problem that caused the outage. So it seemed he was adding unnecessary processes, slowing us down and causing more stress. We are already releasing way too slow, so then us Software Developers are just chilling because we can’t get things out that we do fix/implement.
He described this situation as a
“major incident that impacted the whole estate, attributed directly to a failed Change. We recognise that the change was not intended to have the adverse impact that it did, but sadly the consequences have been a major blow to Users and us. Therefore, we are seeking to create immediate stability across our estate, and are implementing several amendments to the way Technology Changes are approved and implemented, as follows;
1. A comprehensive change freeze for the month of June, with only changes meeting enhanced criteria being passed for implementation.
2. Pre-approved changes are suspended
3. Any changes submitted for approval will require TWO members of SLT.
4. Implementation activity must be witnessed by two or more staff members. Screen sharing technology should be used to witness the change. No additional activities are carried out that are not explicitly in the documentation.
5. All changes must have a comprehensive rollback plan, with proof of testing. The rollback plan must be executable within 50% of the approved change window.
Backout plans will be analysed by the Head Of Development to add a more technically-informed scrutiny, and assess the risk. Further budget will be allocated to DataDog, for more visibility in problems with our servers. A dedicated team will improve how we alert internal teams and customers in the event of an incident.
What does a “Failed Change” even mean? We deployed a database change that failed to apply due to a key constraint violation? That a new feature or bug completely broke a feature? It’s very vague: Then what does “We recognise that the change was not intended to have the adverse impact that it did” mean? Is that just stating the same thing? You aren’t gonna intend to have an adverse impact. It sounds like the sort of nonsense you put out to non-technical users just to blag an apology. Why is he saying this to the Development department? How are we supposed to understand the process and help come up with something better if he is being deceptive?
So as I previously stated: “After we all bombarded him with questions, the truth of the matter came out”, so I’ll try and explain by piecing together the information I did gather.
Years ago we attempted to replace a module and this was feature flagged. But it turns out it wasn’t significantly better than the thing it was replacing and has its own set of issues, so some users are using this feature (V2) and others used the old one (V1). There was an experimental feature which hoped to improve performance for both V2 and V1, but it was put as a sub-feature for V2 in the user interface of the configuration tool. So you have V2 on/off, then if it is On, it looks like there’s then an option for Experimental Fix on/off. When the Deployment team came to turn on the experimental feature, they saw that the main feature (V2) was set to Off for some of the users, so they set it to On. Since enabling V2 would then download loads of data to be in the V2 format, this caused a massive impact on the network and so actually affected everyone that wasn’t impacted directly by the configuration change.
So would the CTO’s new process stop this happening?
New Rules – My View
“There will be a comprehensive change freeze for the month of June, with only changes meeting enhanced criteria being passed for implementation.”
The size of the release wasn’t the problem, so cutting it down won’t solve anything. It might annoy the users even more if we then delay features that we announced.
“Pre-approved changes are suspended”
Not sure of the full scope what is covered by the “pre-approved changes”. But the idea of a “pre-approved” change is that it is something that is often run on the live servers to fix common issues and is low risk, hence it is pre-approved (eg the ability to restart a crashed server/service.). This is just going to annoy staff members in Deployment. The CTO also remarks:
“Preapproved changes are wonderful. They have been reviewed and tested to death. My goal is to increase the number of preapproved changes in the future. It’s just with the existing ones, we don’t know if they have been reviewed or not”.
You don’t know if they have been “reviewed” but they have been run 100’s of times, and never caused an issue. So you are temporarily banning them on the grounds that they could cause an issue?
“Any changes submitted for approval will require TWO members of SLT. ”
How many times has there been some kind of approval process and the people with authorisation are too busy or on annual leave? Why are we going from 0 approvers to 2? Would the managers understand a change to enable a feature for users belonging to company A, B and C? Would they go “hang on, C don’t have the main feature! I’m rejecting this”? It’s going to be a box-ticking exercise.
We already have a problem when changes are Code Reviewed by Developers – there’s not enough “expert” people that can review it in the required level of detail. So how would a manager understand the change and technical impact? It will be more like “does this make us money? Yes we like money”; approved.
“Implementation activity must be witnessed by two or more staff members. Screen sharing technology should be used to witness the change. No additional activities are carried out that are not explicitly in the documentation.”
This might actually help, although might be patronising for Deployment. The CTO made a comment on the call about having “Competent” people involved in the deployment process. So if a Developer has to watch a member of Deployment click a few buttons; it feels like babysitting and not respecting them as employees.
“All changes must have a comprehensive rollback plan, with proof of testing. The rollback plan must be executable within 50% of the approved change window.”
The rollback idea is one of these ideas that sounds logical and great in theory but this is the biggest concern for the technical people in Development as we will discuss below.
My opinion
The Developer Feedback
On our current change approval form, there is already a section for rollback plan, but this isn’t properly filled in and is often just “revert change”. So one developer said all the change needs to be is:
“people should be ensuring they have a documented their tested process for rollback”. Then also added “In the case of feature-toggling. What happens when you do that? Do the users have a cache? How do we clear the cached data for all users? Can you force everybody to turn that feature off by forcing them to log out? If the change causes a performance impact, we need to be sure that turning something off actually turns it off and stops any performance impact right there and then”.
One person pointed out that the majority of developers are in India now, and due to the time zone differences, they aren’t around for the current Change Approval Board meetings, or available during when we deploy.
Like many companies, we love deploying Friday nights because that is considered downtime for most of our users. The problem is, staff that would be useful to have around – don’t work weekends unless they are paid to be on-call. Getting people to deploy late probably increases the chance of mistakes. Maybe we should deploy earlier in the week if possible. At least don’t deploy on a bank holiday weekend because that is even worse for staff to fix any issues.
Deployment staff member
Our test databases have probably less than 5% of the data that our live databases have, so testing SQL is always a problem for us. What runs in 1 second could easily run for a minute with the scale of data. So if we come up with a rollback plan, our timings won’t be accurate at all.
Databas Expert
With our C# code, you can easily rollback the version by redeploying it. The problem is the corresponding database changes. Our database patches are designed for sequential application, e.g. one script can create a table, a later version may add a new column. You can’t just re-deploy the old version to get back to a state where the column doesn’t exist. You have to write code to do the reverse – delete the column. What if data has been added after it was deployed and before you realised the feature doesn’t work (and you need to rollback)? Do you just lose the data and tell the users that it is “tough luck”, or do you archive the data, apply the revert patch? Then later fix the issue, reapply the original patch, then run another script to apply the archived data? It’s much more feasible just to disable the feature, apply a fix then turn it back on. [We recently employed a new guy with the job title “Technical Director” and he was adamant a rollback had to go back to the original state at all cost “Fix forward is not a valid risk mitigation strategy.”. He doesn’t understand how we work and is making these technical decisions.]
Another developer reiterates that point to tell the new Technical Director that he is talking nonsense:
When a change is implemented Out of Hours for the user, any problem may only emerge when users log on in the morning. If we roll back at that point, any overnight transactions potentially will be lost. This includes transactions on inbound interfaces from 3rd party systems. Rolling back at that point may be even more disruptive than hot-fixing the problem. How are we to handle this, in terms of process management and policies?
We have many products, especially with all the mergers/acquisitions we have had over the years. Each with their own processes.
[The CTO and Technical Director were talking as if we were all using the same process and tools which wasn’t the case.]
One concern was that when it comes to our Mobile apps, the likes of Apple and Google have strict update procedures, where you often have to submit the update which is approved in their own time. I don’t know how rollbacks work in this process. I think the implication was that feature flags are often used so they stay on the same version, but remote configuration switches a new version on.
Mobile developer
The rollback idea seems to be that if the change took 20 minutes, then it should be able to rollback within 10 minutes. Why? I am not sure. Maybe it is just a rough guide to be “fast”. But doesn’t this assume that any failure has been identified within the change window of 20 minutes? It might take a while for the users to run into the scenario that causes such failure. If we patch out during the early hours (1AM-3AM) then the user works 9-5PM, then there is a time delay. Our main patching process takes a long time (possibly 1 hour, large file downloads over a slow network), so I’d expect a rollback process to also take a long time. If users are actively using the system and it’s only a specific feature that is a problem, then you don’t want to take the entire system down to patch. You want a Zero-downtime fix.
We need to be specific of what indicates a “failure”. They were talking about using Dashboards to look for evidence of failures but there can be cases where a feature doesn’t work but it doesn’t log an error. Successful server calls might work. Also, if there is a problem, which rollback plan do you execute? If you have several features/bug fixes in the release, do you rollback it all, or do you spend time investigating which feature it is, then roll that back only.
We have a tendency to release with features switched off and then enable them later. Many times, the implementation of new features causes you to change existing code, which then may contain bugs. So even when the feature is off, it can still cause bugs. So in this case I think we need to go through the Change Approval Board for the initial release, then again to actually switch it on.
As part of the rollback process, have teams considered the overall impact that this will have on estimates of deliverables, overall development time and testing time, and how we are going to manage customers expectations. Depending on the project/feature this could add an additional 50% development time to create the rollback and to complete the testing of the roll back.
Tester
After we made all these points the Technical Director just dismissed all our concerns.
“I think I’ve been clear on what a rollback plan is and what the expectation is of it. The onus is on the change proposer to demonstrate they have a Rollback plan”
Technical Director
Well, you’ve been clear of your expectations. We have all been clear that it simply isn’t possible to implement.
Architect
I think we need to change our architecture to support such plans. They want instant process change. But this needs to be planned in.
I asked a Software Architect what he thought:
“I don’t like how we’ve had a bunch of clowns that know how nothing works come up with a plan which makes little sense and is impractical, and opposed to much of industry direction. Patch forwards is the way with databases, supporting rollback of code changes with tech like containers is the way to go with that”
Architect
The Next Release
Since this process affects the next release, the team that is involved in the biggest feature raised their concern. Their change is super risky and therefore they will need to start spending weeks coming up with a back-out plan and therefore will probably miss the release. The Technical Director said that it is:
“going ahead – but with the additional scrutiny and bolstered processes we’ve detailed. This isn’t about stopping – it’s about wrapping our change activity in more certainty. I’d personally like to understand “high risk” here though.”
Technical Director
I think this illustrates my initial points. He will be one of the people involved in signing off on changes, and here is a proposed change without a rollback plan, and he is already stating he is willing to sign off on it. Then the Lead Developer has to convince him it is actually one of the riskiest changes we have made this year.
“I’m not sure how it isn’t a high risk change. It’s massive, wide-ranging, and there are so many things that can go wrong. We are working hard to mitigate the risks but the risks are still there.”
Lead Developer
Since we had a project going into the next release, we arranged a meeting with the Head of Development (in regards to the statement “Backout plans will be analysed by the Head Of Development to add a more technically-informed scrutiny, and assess the risk”).
When we explained the difficulty of coming up with a Rollback plan, he then says “why don’t you just add a feature flag“. So we told him we had but it was unacceptable. The Tech Director and CTO had made that clear. He said he was fine with it, because it doesn’t make sense otherwise. So the Head of Development that is supposed to be scrutinising these plans even thinks this rollback plan doesn’t make sense. Brilliant.
Feedback
On our next meeting with the CTO, he asked how we had been adapting to the new process.
One guy said that there was a server that was down for 3 hours and all it needed was the service to be turned back on. Due to the fact that they took away the Pre-Approved changes, they weren’t allowed to turn it on until they got 2 Senior Managers to approve. They only could contact 1, and after asking him if they could go ahead and turn it on anyway, he refused because they had strict instructions that they couldn’t deviate from this new process. The CTO was outraged. He came out with this amazing hypocritical speech.
It is essential that people “engage brain” before following process. The idea that we can deliberately leave customers without service because some process says so - It's crazy. And if anyone disagrees, come and argue with me. This is incredibly frustrating for us as an organisation where people follow processes to the detriment of providing good service to our customers. Now I understand that there's no malice or deliberate reason to break something by following that process, but as you explain it, I can't imagine anyone on this call saying “Oh yeah, that was the right thing to do”. If we have to write a process that means people don't have to think, then we've lost the battle already, and people need to be able to think. The challenge I have in front of me is our customers, and their customers are not able to be helped. I can press this button and it will fix things, or I can wait until some process completes and then I'll press the button. At the very least an emergency phone call should be made to someone senior enough to make the decision and that should have proceeded straight away. We have heard of other examples of this. I want you to see how serious I am about how upsetting that is. You know, we employ intelligent people with a brain. We can't get away with automatons that just follow process. If we do that, we'll just automate everything. We need people that are intelligent enough to make good decisions, and when they can see that what is happening is not right, they escalate it until they get an answer.
I thought he said it would be a sackable offence if we even slightly deviated? Even his Senior Management was strictly following the CTO’s process and refused to authorise it when escalated. We told him in advance it was a bad idea to get rid of Pre-Approved changes. We told him it was a bad idea to require 2 approvals from the Senior Managers. Didn’t listen. Refused to “engage brain”. Caused another outage. The fact that the CTO admits it’s actually happened a few times as well is hilarious.
Another team said they weren’t invited to the first meeting when the change got announced, then trying to follow the process delayed them 6 hours. The CTO admitted that one was an oversight on his part.
Another team said they were trying to do a demo to a customer but they needed a configuration change applied. Even though it was a “Test System”, it had some dependency on the live infrastructure and so had to go through this new change approval process. They delayed their demo and looked a bit incompetent in front of the customer.
A Technical Manager reiterated the point about not being able to rollback the database changes and his team had wasted hours in meetings discussing how it could even be remotely possible. The CTO then said we were being “hung up on terminology” and we know that the ultimate aim is to get into the previous state, so it doesn’t matter if we call it “rollback” or “fix forward”, it just needs to work. But we were telling him there is massive difference. A rollback literally puts you back into the previous state. Fix-forward means you add extra fixes on top of the thing that was broken so they are left with the new feature. Rollback has the concern about losing data that was added since you upgraded the version, then rolled back. It’s a completely different approach. Then the CTO and the Technical Director said that we can “do what we want as long as it gets us in a working state”.
So we can fix forward then? And we can deviate from the rollback plan?
Another person asked if we “will we go back to normal mode after July”. I think they miss the point. Surely the process is there to try and prevent major incidents like this from happening again. You don’t just do it for a few months then stop.
Refinement to the process
After a few weeks, they realised that having a limited number of staff that can approve changes was a bad idea, especially when most of them had a trip to India booked for the week. So they increased the pool of people that can approve.
Another week went by and they realised they had way too many approvals to go through, so knocked the number of approvers down to 1.
Another week went by and they realised most of the approvals were formally the Pre-approved changes, so they got reinstated.
We are almost back to our old process!
Conclusion
I think I have lost all confidence in the CTO, and I think the new Technical Director is just as dumb. How can you make decisions not involving the people actually involved in the process, and then reject their concerns when they tell you how stupid it is!? Then when you then get evidence how stupid it was, you then blame your staff and tell them they need to “engage brain” and to not follow your process if it wasn’t going to solve the problem of getting our systems up and running. We have been in the industry years, and haven’t had many instances of long term major incidents. I think under the new CTO’s leadership, the number of major incidents has increased dramatically. I don’t think it’s necessarily down to decisions he has made, but some of the ideas he has come up with can’t help.
Several months ago, I was on a social call with a few of my colleagues. We were talking about how our flagship product is really outdated but we had been working on a replacement for around 4 years without success. Therefore, the future looks bleak.
I made the point that if our rivals release their new software before we do, then we could be in serious trouble.
A colleague, Robin, agreed with me and stated:
“the only way we can stay in business is to rely on our competitors being worse than us”
Then added:
“we are Yahoo, waiting for Google to turn up”
Robin
I think the comparison is that Yahoo were quite a well-known search engine, but they didn’t build on what they had. So then Google came into the market with their superior search engine and not only took their market share, but then released products like Gmail which made Yahoo even more redundant. Yahoo ended up abandoning their own Search feature and used Google’s instead. Today, it seems they use Bing.
I’m not sure we would have the same opportunity just to back off and utilise our competitor’s features.
Robin then elaborated:
We have lost a lot of expertise over the years, so there’s large parts of our software we don’t truly understand. It’s essentially falling apart at this stage. We need to retire it. The sheer amount of technical debt is so bad that fixing technical debt is stymied by how much technical debt there is.
Robin
Shortly after, an American company placed a bid at a hefty 50% increase on share price. It was quickly accepted by the board who will be cashing in for millions.
“This is exciting for us. This is in recognition of the brilliant work we have achieved together in the past few years. It’s a recognition of success. This is about future growth, with potentially a new shareholder who can drive innovation, great service for our customers, and take us to the next stage. We believe this combination will have the resources and expertise to enable us to better support our users through technology innovations. Our long standing track record of delivering effective technology solutions and strong financial and operating performance combined with our partner Group’s resources and expertise will enable us to accelerate our development. We therefore consider that the combined group will be well positioned to serve its customers and partners in the UK, whilst ensuring we remain a strong organisation through technology innovation.”
Director’s corporate-jargon-riddled waffle.
A further update from the director’s was aimed to allay our fears of redundancy or general change to how we work. It was also full of the usual corporate jargon:
great for everyone, the employees, customers, the shareholders
Accelerating investment
redundancies: “Absolutely not the case. This is about growth.
genuinely really excited for this. Excited about the opportunities for individuals
Excited about the positive synergies between the two businesses
accelerate in research
“don’t worry because it’s going to be absolutely EPIC!”
Continue personal journey
accelerate our strategy, accelerate our roadmap, accelerate our customer base
gives our colleagues something to really relish
gunning for growth and innovation for some time. Completely in alignment with the direction we set.
no plan to change what we are doing in UK or India
unique culture that we will maintain and build upon
Benefits and conditions will remain (pensions safeguarded)
“clearly, what we are focusing on now, is making sure we complete the year”
At the end of the update, our CEO came out with this ominous statement
We are actually highly profitable (somehow), but that made it sound like we were struggling. Is there any doubt we will not see out the year?
I didn’t find many comments from our users posted on the internet, but a small group seemed to be concerned that our data would be owned by an American company. I think for the most part, and especially with our main software, we don’t actually own the data. Our users own the data and we are mainly the “data processor”. So if the American company wanted to buy us to get their hands on our data, then I reckon they would be doing illegal things, so that’s not a worry. I think it is merely a case of the American company wanting a profitable company and to get into the sector. They might be shocked when they find out our current software is really retro without much scope for growth, and the replacement is in “development hell”, and not worth the hype we gave it.
One user was saying that because the idea of Americans owning our data was “unethical”, they wanted to build their own system which would be ethical and not-for-profit. I did wonder if an open-source project could work – our system has essentially been built by hundreds of developers over 15 years. With hindsight, you could build something similar much quicker, especially if you focus on the core features. However, the architecture is a massive problem and costs loads to run. It would take at least 5 years before you had enough features to get users to switch to your system. Then it would still need to be funded to keep running. I think even the big open source projects have salaried software engineers because you cannot just rely on ad-hoc contributions.
Like I said, we have struggled to build our own replacement which is in “development hell”. That’s with full-time staff and loads of managers to coordinate it. Then we were talking about trialling a version which is basically just an RSS feed after hyping it up 4 years ago.
I was discussing this conversation with one of our Software Architects, and he said the open source idea could only feasibly work if an existing company like ours made our product’s code open source; so it was already complete and the architecture was in place. Then the contributions would be to add features, and make it more performant.
“Well, they certainly can’t make them any worse! That wheel of death that says ‘loading’ just goes round and round forever!
A user on the takeover, and our software’s recent performance issues
Back to the takeover, it will be interesting when/if the takeover fully completes to see what the American company plans to do with us. I’m intrigued if our Director’s statements have any truth.
We used to have a very large support team in our own call centre. As a software developer, we were occasionally sent to go talk to them from time to time, and I was amazed at how busy they were. Usually, as soon as they had put the phone down, they had another user call up.
Sometimes it was that the user just didn’t know how to use the system, and other times it was to complain about a software bug or slow performance. The call centre staff were rapid at entering the information into the system, and were brilliant in asking the user the right questions to really understand their problem. They could often tell the user if the issue was logged or not, and also give them some relevant work-arounds.
After speaking to some of the staff, they explained how strict the culture was there – they were monitored on how fast they picked up the phone, how long the call was, and how many breaks they had. They said how annoying it was to be warned about being late when it was due to bad traffic.
It surprised me because it seemed a completely different culture to how the development department is run. We are flexible when we start; so you can just turn up late and no one cares. We were never challenged on how long our work took to complete. I guess if our work is poor quality, it’s the call centre staff that took the complaints!
At some point, some manager decided to use a 3rd party company for Support, and most of our Support staff either left or (presumably) got redundancy.
The amount of complaints seemed to go up on various social media platforms, and I got the impression this 3rd party company didn’t know our software so were just providing users generic statements from a script “can you try turning it off and on again?”. Maybe if they got past the initial questions, they then got put through to our smaller 2nd-line Support team.
A few years later, I think a new manager came in and decided to try to reverse the decision, but it’s going to take a bit of time to get the new staff as good as the old ones.
“As part of our drive to strengthen our customer satisfaction and experience along with simplifying our ways of working for both our customers and the service desk, the decision has been made to insource the call centre. All calls will now come directly into the service desk.
We have already run a couple of trial switch offs over the last fortnight and the initial feedback has been unanimously positive with customers preferring to be directly connected to the service desk; just in this small sample there has been an increase in both the quality of cases and first-time fixes. We will continue to invest in developing a world class service desk.”
Company Announcement
It seems obvious to me if you make the support more generic, then it’s going to decrease customer satisfaction.
The only problems I had with Support is:
A) when they would link completely different issues to the same bug report. Sometimes you see that a bug that you thought should be super rare – has had 100’s of reports from the users. Then when you look into the cases, you see that 90%+ of them are unrelated. We could have probably put some advice on how to decide if issues are the same root cause or not, in order to try to help remedy this.
B) Sometimes there’s other data entry errors that end up being misleading, like this:
Support (in the free-text description):
"however we have been able to re-create this on the test system by"...
also support (in the mandatory fields):
"Recreated in-house: No
Reason not recreated: Unable to recreate"
Looking through my draft blogs, I have quite a few to do with performance, so I have grouped them together to make a complete blog entry.
“the whirling circle thing is just whirling”
User annoyed at the slowness of our system
Do we care about performance?
I’ve listened to some Tech podcasts that have discussed Big O notation which is a way of classifying the performance of an algorithm such as O(n), O(n^2), but outside of studying Computer Science at university, I personally have never heard anyone else reference this. I think with some programming jobs, the efficiency of algorithms is crucial. For the most part, I don’t think people care unless you are doing something that turns out incredibly slow. In the age of Cloud Computing where you can be charged by the millisecond, then it will become increasingly important for server-side algorithms.
Where I work, we are very reactionary rather than proactive. Recently, we have had loads of complaints about our performance, so then we have identified the slowest areas and addressed them.
Thinking about performance upfront would mean the user gets the best experience and doesn’t complain. However, there’s plenty of examples of code which runs fine on your machine but runs poorly in live. This can be because you have a faster computer, the network speeds are faster, lower latency between the servers (when developing, the client and server is on the same computer so there is no latency), or you don’t have a database which represents live – you have 100s of rows rather than 100,000s. I think this last reason is often a cause for us, in addition to concurrency (thousands of live users simultaneously using the system, compared to a few testers using a test server).
An example of how we don’t often consider performance is as follows:
Example 1
I showed some initiative recently and essentially rewrote a project that was inefficient.
Some server calls had a check to see if a feature was enabled before attempting to retrieve data from the database, but some didn’t. Therefore these code paths ran queries against the database before returning no data.
When the feature was enabled, there were other parts of the code that were making the same server call twice in the same sequence of actions,
and other places that made a server call even though the client already had the data from a previous call.
It was a mess basically.
The functionality behaves correctly and the performance didn’t seem out of the ordinary, so the Software Testers never flagged it. We were probably talking an extra 50ms to 200ms in total when testing internally, although this would be worse when deployed live. It should have been obvious to the Development team that there were inefficiencies if they were thinking about possible improvements to their codebase, or if they were paying attention to the server calls and noticed their new server calls when the feature was switched off.
Volume and Performance Testing
We have loads of test environments – virtual machines with databases and servers where the Software Testers can test our code changes. I would have thought these would be fine to do performance testing, and spam these test servers with calls, but a Test Environment Engineer once sent an angry email:
Our standard test environments are Absolutely not to be used for Volume and Performance testing. This can impact the whole devtest infrastructure, as well as some live services that share the same infrastructure. Doing this could cause a Major Incident.
Test Environment Engineer
This seems bizarre to me. Why would you share resources with the live production environment? Surely the Test Environment should be in its own isolated sandbox, where you can create, update, and delete with the safety and freedom.
Example 2
We had another performance issue where a large amount of data was attempted to be retrieved. When the developer analysed it, we weren’t actually trying to retrieve a large amount of data afterall.
“There were 142,858 Id’s passed into the comma separated string, I found that there were only 64 distinct ID’s once I removed the duplicates. After the duplicates were removed it went from 76minutes duration to under 1 second”.
Software Developer
What a performance improvement!
I wonder if passing the data in a form that isn’t a comma-separated string would also help. If you are passing in “12, 156, 201, 202, 284”, then, to get the ID’s, you need to remove the commas and place the numbers in a table. If you just pass it in as a table, then you are cutting out a step on the database layer.
Example 3: SQLite
One part of our system used a local file cache (loads of files were downloaded to the users machine, then our software would just load up these xml files rather than going to the server every time). It has worked fairly well, but overtime, the number of files has grown.
I think the performance issue comes from the fact that all the files are then loaded and are kept in memory, so it is a memory hog, and sometimes leading to “Out Of Memory” errors. Maybe the solution is to work out if we can dynamically load these files in full only when they are needed, while only having the “metadata” in memory at all times.
Colin came up with an idea to move the files into a local database using SQLite. This innovative idea impressed the managers and were eager to do it. I was sceptical because I think we would still end up loading all the files, just the place where they are stored has changed.
Day 1 of release, the following Major Incident was raised:
“SQLite corruption. Out of memory”
After that was fixed, I believe the changes caused 3 additional Major Incidents. Colin still swears by it. He thinks it was an amazing idea. I ain’t convinced.
User Feedback
It’s rare that I come across comments from users about our software, but they do often make me laugh. Part of the problem is that they use software, but aren’t really technical enough to use the correct terms, so when complaining about a performance issue, one user remarked:
“the whirling circle thing is just whirling”
User annoyed at the slowness of our system
Another thing about user complaints is that I tend to only hear about the real angry complaints and not just generic comments. I think Support had told a user there was a large amount of data on that particular record, and the more data on the record – the longer it is going to take to load.
“Regarding larger records, this is a predictable problem. What mitigations are being put in place to offset that? I still have 20 years of working life yet, don’t tell me that by 2035 I will have to wait 20 minutes to load a record???!”
Angry User
It’s a valid complaint. We need to come up with a different way of loading the record so they aren’t sitting around waiting for the full record to load.
I recently had to deal with another company’s software support team, and this was the second bug I had logged. The first bug didn’t go well at all. For this second issue, I had provided them detailed recreation steps, and videos of the issue occurring.
After I logged the issue, they said they would investigate and get back to me shortly. After nearly 2 weeks, I received an email requesting to arrange a phone call. I thought he was going to give me some news but he wanted me to demo it. So I asked him what that would give him over the videos. He said the videos I sent him wouldn’t play. Brilliant.
I didn’t understand why the videos won’t play. I recorded them using the Microsoft Game Bar feature and they ran fine on my machine. Instead, he wanted me to record the video with Powerpoint. This is bonkers. At least I learned that Powerpoint can record screens. It’s quite useful because you can zone out a tiny part of your window to record…but then there is no option to simply record the full screen!
The problem we had was that our software was intermittently crashing when it was interacting with theirs. However if you changed some Security settings in their software to “never warn me about suspicious activity”, then the crash didn’t occur. You would have thought it should pop up a message box rather than crash. So I asked their Support specifically if he had any idea why this would happen. If it was something I could change at our end, then maybe I could quickly resolve the issue.
“Our software may be thinking this 3rd party app is suspicious. And disabling that security setting helps!”
Support
Well, that sure was helpful. I bet he referred to Captain Obvious for that one.
Why is it intermittent? Why would it think our application is suspicious? why would it crash instead of popping up a message if it was suspicious? My line of questioning is to prompt him into getting to the bottom of the issue but it seemed he couldn’t be bothered investigating or even logging it with their Development team with this information.
I was also annoyed how he kept on chasing my responses when I’d barely had any time to respond. In his email signature, it said he was working 9-5:30 Monday to Friday, and he sometimes sent emails at 8pm on a Friday. Then I’d also get an email 9am Monday reminding me that I haven’t replied to his last email. If he doesn’t work weekends, why does he assume I do? He has literally given me 0 working hours to respond.
There were even occasions where he wouldn’t even chase me by email, but would chase me up by an actual phone call that I didn’t agree to. We had put our IT’s department’s phone number on the Support ticket. I told him many times to contact me by email, but we can arrange a Microsoft Teams call if we need to talk. He would then email saying he couldn’t get through by phone. So I would remind him
“The phone number is for our IT department. I don’t have a direct number.”
Then he would sometimes respond with something similar to:
“We tried to connect you by phone, but unfortunately unable to connect”
Support
Absolute wind-ups.
I find that they always want to arrange calls, even though they end up asking something that can have been addressed by email. They must get reviewed by how many calls they make or something. I don’t understand the advantages. Being put on the spot to give information over the phone isn’t as effective as asking in an email and waiting for the person to have time to acquire the information when they are free. But maybe that’s just my preference? Still, they should respect the customer’s preferences.
At one point, he suggested that the reason why some users didn’t encounter the issue was due to a different “Microsoft .Net Framework” version installed on their machine. I asked him the best way of finding this information out. He replies with the following:
I found the framework version in the error listed in Event Viewer for the affected machine. You can check if they are different by comparing a working machine and non-working machine.
Support
Do you see a flaw in their plan?
A working machine doesn’t log an error in the Event Viewer
I was invited to a call with some of our users who wanted to explain some of the problems they had in a particular feature of our software. One of the problems was actually an issue with a massively-popular 3rd-party software we integrate with, so they wanted me to log a bug with the 3rd-party on their behalf.
A reasonable request I thought, but I really didn’t want to have loads of phone calls/email chains back-and-forth with support. I just wanted to log it and forget about it. Luckily, I am a developer so I have a good understanding of all the information Support/Development would want.
This company that I logged the bug report with must be one of the Top 10 software companies in the world. So you’d expect their Support to be of a high standard. Due to the fact that the bug report is still open at the time of writing, I’d better keep their identity a secret, otherwise they could easily look up my name from the bug report and my anonymity is blown.
The feature is that you click a button, then their software will open a dialog with some data pre-filled in. If their software is currently closed, it still opens the dialog (their software is just temporarily open). In this scenario, when you save/close, then their software is supposed to close back down completely. However, it was getting stuck and you had to use Task Manager to kill it. This problem can be recreated consistently every time.
So I put together a great description of the problem, the recreation steps, the version of Windows, the exact version of their software I was using. I told them our users have recently upgraded from ‘X’ edition to ‘Z’ edition; and I was using ‘Y’ edition and can now see the problem after installing ‘Z’ edition so the problem is definitely in this new edition. Since I didn’t want to provide them with our software to recreate the issue, I even found some of their software with a similar feature and recreated it on there too. But to make sure they definitely didn’t consider it to be a problem with our software, or their other software product, I found another 3rd-party, free-to-use software and recreated it there too.
So I’d say they have a 100% chance of recreating that themselves, and I had proven it was a problem with their latest iteration of software.
Within an hour, I get an email asking questions. So it seems I failed to provide them the information required:
What version of Windows are you using?
What exact version of our software are you using
When was the feature last seen working?
I had literally given him all those answers. So I reworded my original report and put them in bullet points. He then mails back saying he cannot recreate it and would like a demo. Unbelievable. I can recreate it everytime, our users can recreate it every time. It’s not a Windows issue because I was using Windows 8 and our complaining users are using Windows 10.
He specifies that the demo must be on Windows 10. Why? Is he just trying to mess me about? You can’t expect someone to have access to various computers. The version of Windows is irrelevant.
Regardless, I accept and I set up a Virtual Machine. It’s like a fresh computer install so maybe it is better than my work computer to prove this is a bug. I put their software on it, even created a brand new account. However, now my user doesn’t have the Premium licence but I accept the trial, so had 14 days. I can recreate it every time. Surely he hasn’t tried hard.
So I arrange the call, share my screen and I demo the feature to him. I ask if he did the exact same thing. He then tells me the main difference was that he was using the very latest version. Is he trying to wind me up? I told him the exact version, he asked for the version, then didn’t use it! Surely if he couldn’t recreate it on the latest version, then he could have checked on the version I was using; then rejected the bug as fixed. Instead, I waste an hour configuring the Virtual Machine and another 30 mins demoing it to him.
He was supposed to send me instructions on how to update because it wasn’t as simple as clicking a button. However, he didn’t bother and I found the instructions after some Googling. He was right that it was fixed…or kind of. I tested it several times and I saw the same issue again. So I tried some more and it worked for maybe 30 times then failed. So it’s intermittent. So I tell him it’s not perfect. He wants another demo because he can’t recreate it.
So I demo the problem and he said he cannot help because he noticed I’m “not using the Premium account” and he is in the Premium support team. I told him I do have a licence, I was just using this account to rule out if it was a problem if you had a large user profile: basically doing his investigation for him.
However, I have to deploy a new Virtual Machine since the settings don’t allow me to sign in (seems to be some restriction from our IT department). So after setting up yet another Virtual Machine and signing in with my work account (with the Premium licence), I do another demo. However, the intermittent nature means the issue didn’t happen and the meeting had already lasted 20 mins. I was pretty bored clicking a button and closing a dialog for that length of time. So I said if I had time, I’ll try and record it happening, but in the meantime, I’ll send him the recordings I had from before (with the Standard account).
After a few days, he says that he is going to log a ticket with the Microsoft Windows 8 Support Team, so I ask him why. It doesn’t make sense when the issue occurs on both Windows 8 and Windows 10. He said it was because it only happens on Windows 8 since I had demoed it working on Windows 10. I told him again that I have recreated the issue with the latest software on both Windows 8 and 10, and using 2 different user accounts. The recordings I sent him were Windows 10. It was just that the issue was intermittent and coincidentally worked during the time I was demoing it.
He apologises for the misunderstanding and would need some time with his ‘senior’ to come up with the way forward. A few days later, he says that his senior wants me to arrange another call to test out some scenarios. I told him I didn’t understand what the plan was. He apologises for the misunderstanding, but there’s some scenarios that we need to test. I told him I didn’t understand what the plan was: am I doing a demo? or am I watching them do a demo? He apologises for the misunderstanding, but he wants me to deploy another Virtual Machine and create a new user and demo this to them. Surely, I needed to know that and set up a Virtual Machine and new user account before we start the call. It would have been a bit awkward if I set up the call, he asks me to demo it on a machine I wasn’t logged into and with a user I hadn’t even created.
I told him I didn’t understand what the plan was. Why do I need yet another Virtual Machine and user account when I recreated the issue already using 2 accounts, 1 Windows 8 physical machine and 2 Windows 10 Virtual Machines? He said they had attempted to recreate the issue with multiple users and on different Virtual Machines, but since the problem seems to exist on my Standard account, then he would like me to verify it on another Standard account.
Let’s recap: when he concluded it was a Windows 8 issue, I told him it wasn’t because I’d recreated it on Windows 10. It seemed that me demoing the feature working on Windows 10 was the thing that he picked up on, and completely disregarded everything I wrote and said. I had told him on that call it was intermittent and I had seen it fail that same day on the same Virtual Machine. Now we are in a situation where I had told him I had recreated it with multiple users, but he is still thinking of that demo I did where it was working. Not only was it Windows 10, but I was using my Premium account at the time. So now he has this idea that the problem is with my Standard account.
I think I need to take him to a hypnotist to train him to forget the existence of that demo. What’s he going to claim next? That because he saw it working on a Friday, then it’s a Monday to Thursday issue?
My dream of just logging the issue and them instantly confirming the bug was completely ruined.
It’s dragged out for a month or something daft and I feel like I have invested weeks of time investigating it. This is literally his job and I am doing his work and telling him how to do his job. It’s ridiculous.
I keep saying to my colleagues that it seems that he is trained to mess people about so they drop the issue then they don’t have to fix anything. It’s a bizarre strategy, but what other logical explanation is there for his behaviour?
I like our system: Regardless of whether Support can or cannot recreate the issue, the bug gets prioritised based on what they know. If it is deemed high priority then it will get sent to Development, otherwise probably gets thrown on the backlog and we might look at it in a year… or never. But I think the key aspect is that we always believe our users, so it does get logged (“the customer is always right”). Well, I guess we get evidence like a video or screenshot, but I had already provided those – and we would never dismiss video evidence.
As a side note: I wish that low priority bugs were dealt with faster because it will discourage users from reporting them if they think we will ignore it for a year. Maybe they can be given to Junior Developers to clear down.
Apparently my employer has recently made a big deal and sold our software to a large group of users, but the users now want to reverse their decision based on other customer’s opinions of our software. Not sure why this group of customers only found out these opinions after making the deal.
We had a meeting with the new Head Of Development to discuss what we can do about it. The overall aim is that we need to repair our reputation: fixing major bugs, improving performance, improving the user experience in general, and making it clear to our customers that we value their opinion.
He initially stated that he wants more bug fixes, especially performance improvements and delivered in smaller, but frequent releases. He states that the current process doesn’t work and we should ignore it and come up with a new one.
Sounds great in principle, and a desired philosophy of Agile Development. However, Release Management stated it wasn’t possible to update the entire customer-base so quickly. This is partially due to the way some users work, and some contractual obligations. Both of these problems pretty much come down to the idea of essentially wanting close to zero downtime, so they want to choose a specific time to accept our updates, and they like this to be around once a month.
So the discussion turned to the development side. It seemed we were in agreement that we need higher-skilled staff to target these important bugs (we love throwing Juniors at these important issues), and we would also like to work in Domain based teams. Examples of this team idea is that you’d have one team that exclusively handles the Reporting module, and another handles Registration module etc. This way you can have a fuller understanding of all the configuration, features and the codebase; so can fix/enhance the software faster. Essentially you specialise, rather than going with the jack-of-all-trades master-of-none style. We also wanted more Testers since some teams have produced a lot of fixes recently but the Testers couldn’t get through it. To be honest, I expected that some Testers were just working slow and the fix isn’t necessarily to throw more staff members at the problem (just have a stern word).
The response from the Head of Development was that “the business” has decided on the team sizes and who is in those teams – and they won’t change.
So to summarise – we are asked how we can make dramatic changes to the way we work, ignoring the current process… then get told we can’t do our ideas. We’d shot-down his suggestion, then he shoots down ours.
I think he shot down our suggestion because he is too focussed on the current process and structure, and he isn’t taking his own advice by ignoring it and coming up with a new idea. To be honest, it is easier said than done, and I think his way of thinking is – how to preserve the current in-flight projects? You can’t simply restructure the entire department without dealing with the current projects.
What you can do is just make the current project teams bigger and let them take the project and some bug fixes. With a bigger team, they can easily carry on with their project work, and with the extra people, they take on live Bugs, fixing “Technical Debt”, adding more automated tests etc.
I get confused how we get told these “Service Improvement” teams (teams that deal with bug fixes only) are the most important, yet it’s the Project teams that get the most experienced developers and the most attention. He has limited the “Service Improvement” teams to 20 developers with only 4 with Senior rank, one of which isn’t good enough to be Senior. Then most of the other developers are essentially Juniors.
How can you deal with important bugs with a lack of talent? It’s frustrating that he arranges a meeting to ask how we can solve our way of working, but then he doesn’t listen to the feedback.
If he wants a quality product and to deliver faster, then we need the staff and process to actually achieve it.