Manager feedback

When it comes to performance reviews, I find it hard to fill in the forms about myself, but it’s even harder to fill them about others. I often think you can write something to describe people’s general approach/attitude to work, and maybe point out a few strengths and weaknesses, but trying to answer specific and sometimes cryptic/ambiguous questions just frustrates me. I always think it should just be:

  1. Strengths
  2. Weaknesses
  3. Other comments

But instead, last year we had to fill in the following about our manager. I think it was generally just scoring 1-5, but with optional comments to justify it. Some questions are easy to answer, but questions with nonsense buzz phrases like “change agent” and “builds internal networks” just frustrate me, and destroy the little enthusiasm I had for filling this in. I would have thought having fewer questions would mean people spend more time thinking about each one, and therefore get more accurate results and better comments. 

  1. Mary leads by example aligned to Our Values. Mary exemplifies values within their own teams and organisationally: Collaborative, Responsible, Supportive and Transformative
  2. Mary is responsible for their team and takes accountability for delivery
  3. Mary provides guidance and support in the setting of performance objectives at individual and group level
  4. Mary manages performance of objectives (individual and group level) ensuring achievement of objectives through reviews
  5. Mary builds an environment where people understand, and recognise, the contribution they make and how their role fits into the wider organisation and its performance
  6. Mary is a change agent, leading change to ensure delivery through people with effective communication, support and results
  7. Mary builds people knowledge and business understanding through regular communication with their people and teams
  8. Mary engages with their team, encourages engagement amongst their team and cross-departmentally to drive solutions through collaboration
  9. Mary builds internal networks to drive and collectively deliver through positive relationship-building and understanding
  10. Mary embraces and encourages collaboration within their own team and between teams managing problems directly
  11. Mary is aware of, and responsive to, the changing needs of their people (own team) and their wellbeing at work
  12. Mary supports the engagement, retention and loyalty of key people as well as providing an environment where new people feel empowered and trusted
  13. Mary is solutions-focused not problem-focused bringing all relevant parties together to deliver
  14. Mary provides opportunity for all relevant parties to contribute, build and achieve collectively whether in their own teams or broader cross-department teamwork
  15. What one thing is Mary doing that is having a positive impact on the team?
  16. What one thing could Mary do to have a greater impact on the team?

Development Environments

This blog is basically stealing “Healthy Software Developers” explanation of “Development Environments – Isolating Customers From Your Changes

Introduction

“Development environments” allow software engineers to view a particular version of the software and control who has access to it. 

If you go back about 20 years ago, there was much less maturity with how changes were rolled out to customers, and every company had a different process. In some companies, developers would just immediately roll them out, straight into production. This might work for the most part if you have a very small and trusted team without the process or contractual requirements. These days, the standard is that most companies need a minimum of three separate environments when developing software.

Even where I work, people used to tell me it used to be much less professional, and without as much legal scrutiny – allowing them to drop DLLs onto a server to quickly fix a bug for a single customer. Now there’s all kinds of people who need to sign it off, and 99% of changes are released as an official version to multiple customers.

Development Environment

The main development environment is a place where an individual engineer can actually work on the software in isolation from any other stakeholder such as a Tester or Customer. Instead of making changes directly to a live website/application, the changes can be run on a local copy on the developer’s machine. The developer can begin looking at the impact of the changes he’s making, and will use test data. Any problem introduced only affects the single developer.

It could be as simple as a copy of a database and a version of the website. A more sophisticated program may have even more dependencies, so need multiple websites or services configured – so you could leverage modern technology like “cloud services” or “containers” to easily deploy this setup. The decision of how this looks depends on the complexity of the software architecture, team size and process.

Staging Environment

The “User acceptance” test environment, sometimes known as a “staging environment”, is another copy of the application, but developers aren’t making any changes to it, and it is also using test data. The purpose of this environment is typically to run exhaustive testing to find any issues before deploying the changes to the end users. 

When companies rely on manual testing, and don’t have modern “continuous delivery” processes, an unreleased version may sit in the staging environment for an extended period of time. In a more automated environment, automated tests can be run, and the version easily deployed to customers if passed. This could take as little as fifteen minutes to an hour depending on the tests, software complexity, and process involved.

Production environment 

The Production environment, or Live environment, is where your customers use your product; so the actual website or application. 

Demo environment

You can have other environments like a Demo environment. This is another copy, but not used by Developers or Testers. The version can be “work in progress”, or a version that is deemed to be ready; but the purpose is to show customers upcoming features and to gather feedback.

Capacity testing

You could create a special kind of test environment for capacity/load testing. This is meant to  stress test the software and see how it’s performing under heavy load. Systems can have increased traffic at specific times of the day, e.g. users all logging in at 9am to start the day, increase in website hits during lunch breaks, more shopping traffic during Christmas period etc. If users are paying to use your service and it goes down; it’s a huge problem. If you sell a product/service on your website and it goes down; then it can cause lost sales. So performing this testing can be essential.

Test Specifications

Many  years ago, when I was a Software Tester, I remember when we had to write a Test Specification based on the work that the Developers had planned. This was for both Enhancements and Bug Fixes (so new features and changes to old ones).

It would be a Word document, with the Item Number, Title, and then a description of what you would test (the description would be a bit more high level than the step-by-step description featured in an actual Test Case).

You would spend weeks writing it, then you had to get it approved by all the developers, or the Dev Lead. The developer often then told you it’s nothing like you imagined so you had to rewrite it. 

Sometimes they would demo the feature to you so you had a better idea. If they had no comments, I often felt that they didn’t read it.

When there was a new Developer who wasn’t familiar with the process, he “rejected” a Test Specification because of some grammatical issues. I think it was something to do with the wrong tense, and not using semicolons. The Tester was fuming, but he was quite a belligerent character.

I think we often worked from the Bug description, or some comments from the Developer. However, quite often, the comment section would be the Developer writing something generic like “test bug is fixed”, “check data is saved“. If it was more detailed, sometimes you would paste the developer’s notes and change a few words – and have no idea what it meant until you saw the finished thing.

The Verdict

I think both Developers and Testers saw Test Specifications as a waste of time. The Developers weren’t enthused to read it, especially when most people rewrote what the Developer provided, and that might not be the full test coverage needed. The Testers should be adding more value by using their existing knowledge of the system to come up with additional scenarios to cover “regression testing”. 

I think the only advantage is to quickly verify that the Developers and Testers were on the “same page”, but that only works if the Tester has not used the developers words and tried to illustrate that they do understand the upcoming features.

I think it eventually got binned off for what we called “Cycle Zero Testing” which was where the developer quickly demoed their changes; which I think was still hated by the Developers but was easier to understand the value, and was a more collaborative between the two roles.

The Price Of Cheese

It’s a non-programming blog again, but this story is too good to share.

One of my colleagues, Andy, bought some cheese for Christmas from a specialist, and his mate pointed out that the same cheese was way cheaper at Morrisons (a popular supermarket in northern England).

So he threatened to post a bad review, and the Cheese Care & Quality Manager at the company gave him an essay about the cheese making process and schooling him how their cheese is higher quality.

So if you are interested in the process of making Stilton, then continue reading:

The Response:

Dear Andy,

Thank you so much for emailing us and giving us a chance to respond to this for you. I can certainly see why you would be both concerned and upset to see this really big price difference.

There are two answers to this question. The first is that the cheese isn’t identical. It is certainly made by the same producer but not to the same recipe. Most of the Stilton made by Cropwell Bishop is made in the main building at their dairy. It is a good cheese made by a talented producer, but the method of production involves moving large quantities of curds from the vat to the draining table using gravity.

The cheese that we buy is their hand-ladled animal rennet Stilton. This is made in a smaller room to the side of the main production area. In this room the cheese is made in much smaller vats and is made entirely by hand. A crucial step in Stilton making is the transference of the curds from the vat to the draining table. This is done by the traditional, painstaking method of ladling scoops of curds, one by one, and layering them up gently so that the fragile curds are as undamaged as possible. This method has a big impact on both the texture and flavour of the cheese.

Here is a photo of this taking place. You can see that it is a very different process to the large movement of curds that would happen in most larger scale cheesemaking.

A tiny proportion of the overall make of the dairy is made to this method, but this is the only one that we buy. We believe that you can really taste the difference and it is very interesting to do a side by side blind tasting. I hope you would agree that the cheese is not the same if you were to do this.

The second answer is that even given this difference, I am still really surprised to see their standard Stilton being sold at this price and can’t help wondering if Morrisons had some sort of promotion going on at Christmas. The Best range Stilton from Morrisons is usually made by the Long Clawson dairy rather than Cropwell Bishop. Long Clawson Stilton is much cheaper than Cropwell Bishop and is definitely considered to be of a lower quality by most in the cheese industry. I had a look at the Morrison’s site today and the “The Best” Stilton on the site is listed as being made by Long Clawson. In fact, I have never seen standard Cropwell Bishop Stilton being sold at this price. Cropwell Bishop themselves charge £20.57/kg when buying a half cheese and £23.68/kg when buying a 454g piece. They don’t sell their hand-ladled animal rennet Stilton online so I can’t make a comparison. 

I have also looked at other online retailers to get a sense of what other people are charging. The Cropwell Bishop standard Stilton varies from around £29.75/kg to £34.95/kg. So, while it is certainly true that Morrison’s would have much bigger purchasing power than we do, I still find the price of £15/kg to be a really surprising one and wonder if perhaps there may have been some trouble with supply from Long Clawson that led them to replace “The Best” range with Cropwell Bishop but charge the same price. However, this is speculation, and I may be wrong. I shall go to my local Morrisons and take a look as soon as I can out of interest. Perhaps they really do usually charge this price, in which case, they are undercutting both the supplier themselves and any other retailer that I know. I can only say that it is a huge bargain at this price and we would not have a business at all if we tried to compete with it. Our cheese is always well looked after in our maturing rooms and is cut fresh to order. It is never pre-cut and wrapped, which has a really detrimental impact on quality and we have confidence that we are supplying the best cheese that we can, both by buying only the hand-ladled Stilton and ensuring it reaches our customers as fresh as possible.

I very much hope this goes some way to explaining the price difference. Thank you for ordering your Christmas cheese from us. I very much hope you enjoyed it.

Best wishes,

Ruth

Cheese Care & Quality Manager

Changes to the Software Delivery Process

One of the problems we have where I work – is not releasing fast enough. When you read about Software Development, you hear of these companies that can release minor updates every week. Maybe it is more of a Web Development thing rather than an Application Development one, but there are also contractual reasons why we cannot release faster.

However, over time, the release time has continuously crept up to 4-6 weeks which causes problems.

If there is more time for each release, it means that the scope of the current release often increases further. For example, if there is a fix that needs to go out within 3 weeks to meet the SLA (Service Level Agreement) and the next release will go out in 4 weeks, then you have little choice but to get it in the current release. If you are checking it in close to the deadline, then you might end up delaying the release in order to test it. The more you delay, the more chance of someone else needing to get a fix in the current release and it grows further.

If there’s many big projects targeting the same release, each team developers in their own code “branch”, then will merge into the Main branch for release. Since it’s not really feasible to all merge at the same time, you end up taking it in turns and resolving any conflicting changes. To be honest, it’s quite rare that we will change the same files for the main feature changes, but there’s certain files with a lot of churn, mainly ones containing XML. To merge in projects, it usually takes a few days, then all the bug fixes on top. The Testers can’t really begin testing until it’s all merged so it’s a lot of overhead to manage the release.

When the releases are large, the Testers insist on running more Regression Tests which increases the Testing phase and can cause further delays.

“I think we spent about 2 months on that regression. It was madness. It was a HUUUGE release”

Software Tester

So smaller releases are much more manageable, take much less time to test, incur less risk, and have lower scope for scope-creep.

Our Software Delivery team made an announcement about this (basically just saying the same things I have just discussed), and desire to plan in quarters but release in a couple of weeks.

In the past, we would scope a single release looking at the features, fixes and minor enhancements we wished to deploy. We would follow a process of merging everything into our main release branch before undertaking testing. This was a two-phased testing approach, integration/functional testing each feature and fix, and then regression testing to ensure pre-existing functions continued to work as expected. We would then spend eight to ten weeks deploying the release through Controlled Roll Out and Customer User Acceptance Testing.
 
This approach brought with it a number of challenges. Merging everything in was time consuming, issues or blockers with one feature would slow down or block other features, regression testing was a challenge, and this also put pressure on the roll out and deployment through pushing out a number of changes in one go.
 
To try and mitigate some of these challenges, we are now adopting a strategy of breaking these large releases down into smaller updates.
 
Working in quarterly cycles we will scope what we wish to deliver over a 12 week period. Each feature will be analysed and risk assessed for size and complexity by our Engineering Leads, and have a business value determined by our Product Management and Commercial Teams.
 
Using this feedback we will then determine an order in which we wish to deliver each feature. We will then merge them into a release and test them one at a time (potentially two if both are small and low risk), before signing over to Release Management to commence deployment.
 
We will then deploy the full scope over a series of smaller releases rather than in one large release.
 
The last update in the cycle will be a maintenance release to address the backlog of prioritised service work.
 
The objective behind this approach is to have our users benefit by taking elements of the release scope earlier than they would have before, whilst also simplifying the testing approach and hopefully enabling us to push code out across the estate quicker.

16 weeks work

Some teams post fortnightly updates on what they have done on our internal social media platform Microsoft’s Yammer (Now named Viva Engage). No matter what they post, managers seem to respond with positive comments.

So here is an example. On team stated they had completed:

3 user stories. 1 dev investigation. Improved code coverage.

Then attached was there code analysis report which stated:

  • 0 Bugs
  • 0 Code smells
  • 100% coverage on 1 new lines to cover
  • 0% duplications on 8 new lines

I never know how to make sense of these stats. The duplications mention 8 new lines, but yet the code coverage mentions 1 new line. It can’t mean it is still to cover when it states they have 100% coverage.

Their video demo stated it was “Sprint 8” which means it’s after 16 weeks of work. They logged into their application and there are 2 buttons (not functional).

My Work

I’ve been working on a game in Unity and have also reached the 16 week mark. I have only been working on it sporadically. Early on, I might have worked a few hours and many days per week, sometimes working several hours at a weekend. Other weeks, I have just worked on it for 3 hours or so in one afternoon only.

I have written loads of code for the AI state machine. Your units can automatically move in a formation, break away in certain conditions. Then the defending group of units have their own behaviours, and work together in their own way. Some behaviour is configurable so I have the logic and UI for the player to change it. There’s controls for the cameras, game speed, pausing, save and reloading. A few options for volume and a few graphical effects. There’s some miscellaneous aesthetics such as animations, particle effects.

I am a single developer working fewer hours, and I have to play-test it myself. Compared to a team with a few developers, a few testers and different managers assigned – all working a 9-5 day, 5 days a week; and all they have is a menu and a title banner.

Hackathon Progress

Another comparison is to compare the progress of developers during a “hackathon”, or “game jam”.  Developers can often put together a prototype, or a fairly fun game within a few days…

Developers during hackathon: We built an entire application in just 3 days. Developers after hackathon: Adding that icon is going to take 3 weeks.

Mark Dalgleish

Conclusion

Recently, I’ve been thinking about this a lot. I think if you start with no code, then any code you write has a much bigger impact. You also don’t have to work out how it works alongside other code.

So the more code there is already, then the longer it takes to write. So if the Hackathons are productive, then you can write something fast. Same situation with my Unity game, and since I am working on it myself, then I have no one slowing me down.

The other reason why working on my own is great is that I can set the process. Due to the 90/10 rule, I think it is best just to get it 90% done, then move on. If you change your mind of how something works, it isn’t as big of a time waste. If it turns out to be good, then you improve it later, alongside other features when you are changing code in that area.

So going back to this original team who did 16 weeks work and got nowhere – even though it’s a new software product, and should be able to code fast – I think they are slowed down by their insistence that everything has to be perfect and has all the extras like Unit Tests. So they could get a feature 90% complete quickly, but then they are told to spend several days perfecting it for that last 10% of quality. Then they are messing around with non-functional stuff like the Deployment Pipelines and miscellaneous processes. When you are months away from releasing, I don’t see the point in dedicating so much time to how you are going to deploy it, so I have been 100% focussed on features and gameplay.

I think this could illustrate how process can severely slow you down, and striving for perfection can also lead to inefficiencies. I think my idea of “90% quality is good enough” could be a winning approach, then you can always perfect the features if you are ahead of schedule. If I keep working on my game, I will worry about how to release it when it is close to a releasable state (ie how to get your game listed on Steam).

Printing Licence Key Expiry

My employer made the news recently after the deployment team applied the wrong config to several organisations which led to a heavy spike in network traffic and caused issues nationwide.

Sadly, we made the news again.

“We couldn’t print our forms for three hours. Someone had goofed and did not keep up with their subscriptions. For a company of their size; that is embarrassing.”

Customer quote

When users reported the issue, there was a call between a few managers. I was invited to the call with zero context by someone I had never spoken to before, so I thought there was a good chance they invited the wrong person. 

The call was recorded so I just listened to it when I was free. They had invited me on the suggestion that I had worked on a feature in that area many years ago, and they had no other ideas. Also on the call, they called someone else and remarked how strange it was to the recipient because they had never spoken to him before. Why didn’t they learn and send some context? He didn’t join either.

Eventually, they found someone who belongs to the team that procures the licences. He explained that they purchase licence keys for this printing software, then send them to an internal support team to update the licence keys in the database. The team receives automated emails reminding them to renew the licence keys 3 months prior to expiry, and they act on it quickly to not risk them expiring.

“It takes a while to go through the purchase process, so I usually do it early, but sometimes it can “fall through”

Procurement guy

After going through some emails, they found the key had been promptly purchased, and the licence keys were sent to support, then they were not applied.

Another guy joined the call and said a total of 43 organisations have reported errors, but that’s the only ones we know about due to direct complaints we saw via Facebook.

“And then I need to understand exactly how this is happening, ‘cause this is the second time in two weeks that a licence key was sent to support and wasn’t applied.”

Angry manager

After the issue was resolved, another manager asked for a summary of the issue. One guy remarked

“just to say, the people invited to the call (16 invited in total), are not the ones that ended up being involved or resolving.”

Manager, reflecting on what a mess the meeting was

I don’t understand how all our departments are causing chaos all at once? We don’t seem to learn our lesson either, how can we make the same mistake twice in two weeks? It’s also been a problem for years – that we struggle to invite the correct people to major incident calls so issues take much longer to resolve.

The Outage

Managing the Public Relations

We recently featured in the national newspapers due to an outage with our software, and it was surprising that our system was down for around 12 hours. Usually we respond to outages really quickly, and get the system back online after a few hours max.

I sent the article to a few of my colleagues and they were unaware of the problem, so internally we seemed to be keeping quiet about it.

We had put out a message to users, but from what I saw in the aftermath, we initially claimed it was just a specific region that was affected, before others via social media had stated it was affecting people nationwide, much to the derision from our main rivals, who posted multiple social media posts to insult us.

They deliver complaints, weekend working, major incidents, stress, insomnia, and hypertension, to name a few. Do their rivals deliver?

Customer quote

All Hands

5 days later, the Directors arranged an “All Hands” meeting with only a few hours notice. So that caused panic amongst many colleagues because of the implied urgency. It was led by the CTO who was announcing changes to our deployment process, but I’m not sure who he consulted with. I would have thought he would have learned from the last time he announced changes to our release process – when we all ripped his idea to shreds on the same call, stating several technical and some legal reasons why we can’t possibly work that way. A much better approach would be to state the problem and let the experts discuss how we can solve the problem, but yet again, he seems to have had a meeting with other Directors and took the attitude “this is how it’s going to be from now on”. But then we all pushed back and told him how it wasn’t possible.

I love it when situations like these occur. The drama makes you want to grab the popcorn, sit back, and enjoy the show. Hearing all the different points of view and trying to understand the complexities of why things are the way they are – is super interesting. I think when you are told to follow a Process and you don’t know the rationale behind it, it makes it feel like a chore. But focussing on a problem and trying to solve it by forming a new/modified process is something you can get behind.

So like I said, not learning his lesson from before, he states what the new process is, and only gives a wishy-washy explanation of the problem he was trying to solve. After we all bombarded him with questions, the truth of the matter came out, and some of his ideas wouldn’t solve this particular problem that caused the outage. So it seemed he was adding unnecessary processes, slowing us down and causing more stress. We are already releasing way too slow, so then us Software Developers are just chilling because we can’t get things out that we do fix/implement.

He described this situation as a

“major incident that impacted the whole estate, attributed directly to a failed Change. We recognise that the change was not intended to have the adverse impact that it did, but sadly the consequences have been a major blow to Users and us. Therefore, we are seeking to create immediate stability across our estate, and are implementing several amendments to the way Technology Changes are approved and implemented, as follows;

1. A comprehensive change freeze for the month of June, with only changes meeting enhanced criteria being passed for implementation.

2. Pre-approved changes are suspended

3. Any changes submitted for approval will require TWO members of SLT. 

4. Implementation activity must be witnessed by two or more staff members. Screen sharing technology should be used to witness the change. No additional activities are carried out that are not explicitly in the documentation.

5. All changes must have a comprehensive rollback plan, with proof of testing. The rollback plan must be executable within 50% of the approved change window.

Backout plans will be analysed by the Head Of Development to add a more technically-informed scrutiny, and assess the risk. Further budget will be allocated to DataDog, for more visibility in problems with our servers. A dedicated team will improve how we alert internal teams and customers in the event of an incident.

What does a “Failed Change” even mean? We deployed a database change that failed to apply due to a key constraint violation? That a new feature or bug completely broke a feature? It’s very vague: Then what does “We recognise that the change was not intended to have the adverse impact that it did” mean? Is that just stating the same thing? You aren’t gonna intend to have an adverse impact. It sounds like the sort of nonsense you put out to non-technical users just to blag an apology. Why is he saying this to the Development department? How are we supposed to understand the process and help come up with something better if he is being deceptive?

So as I previously stated: “After we all bombarded him with questions, the truth of the matter came out”, so I’ll try and explain by piecing together the information I did gather.

Years ago we attempted to replace a module and this was feature flagged. But it turns out it wasn’t significantly better than the thing it was replacing and has its own set of issues, so some users are using this feature (V2) and others used the old one (V1). There was an experimental feature which hoped to improve performance for both V2 and V1, but it was put as a sub-feature for V2 in the user interface of the configuration tool. So you have V2 on/off, then if it is On, it looks like there’s then an option for Experimental Fix on/off. When the Deployment team came to turn on the experimental feature, they saw that the main feature (V2) was set to Off for some of the users, so they set it to On. Since enabling V2 would then download loads of data to be in the V2 format, this caused a massive impact on the network and so actually affected everyone that wasn’t impacted directly by the configuration change.

So would the CTO’s new process stop this happening?

New Rules – My View

“There will be a comprehensive change freeze for the month of June, with only changes meeting enhanced criteria being passed for implementation.”The size of the release wasn’t the problem, so cutting it down won’t solve anything. It might annoy the users even more if we then delay features that we announced.
“Pre-approved changes are suspended”Not sure of the full scope what is covered by the “pre-approved changes”. But the idea of a “pre-approved” change is that it is something that is often run on the live servers to fix common issues and is low risk, hence it is pre-approved (eg the ability to restart a crashed server/service.). This is just going to annoy staff members in Deployment. The CTO also remarks:

 “Preapproved changes are wonderful. They have been reviewed and tested to death. My goal is to increase the number of preapproved changes in the future. It’s just with the existing ones, we don’t know if they have been reviewed or not”. 

You don’t know if they have been “reviewed” but they have been run 100’s of times, and never caused an issue. So you are temporarily banning them on the grounds that they could cause an issue?
“Any changes submitted for approval will require TWO members of SLT. ”How many times has there been some kind of approval process and the people with authorisation are too busy or on annual leave? Why are we going from 0 approvers to 2? Would the managers understand a change to enable a feature for users belonging to company A, B and C? Would they go “hang on, C don’t have the main feature! I’m rejecting this”? It’s going to be a box-ticking exercise. 

We already have a problem when changes are Code Reviewed by Developers – there’s not enough “expert” people that can review it in the required level of detail. So how would a manager understand the change and technical impact? It will be more like “does this make us money? Yes we like money”; approved.
“Implementation activity must be witnessed by two or more staff members. Screen sharing technology should be used to witness the change. No additional activities are carried out that are not explicitly in the documentation.”This might actually help, although might be patronising for Deployment. The CTO made a comment on the call about having “Competent” people involved in the deployment process. So if a Developer has to watch a member of Deployment click a few buttons; it feels like babysitting and not respecting them as employees.
“All changes must have a comprehensive rollback plan, with proof of testing. The rollback plan must be executable within 50% of the approved change window.”The rollback idea is one of these ideas that sounds logical and great in theory but this is the biggest concern for the technical people in Development as we will discuss below.
My opinion

The Developer Feedback

On our current change approval form, there is already a section for rollback plan, but this isn’t properly filled in and is often just “revert change”. So one developer said all the change needs to be is:

“people should be ensuring they have a documented their tested process for rollback”. Then also added “In the case of feature-toggling. What happens when you do that? Do the users have a cache? How do we clear the cached data for all users? Can you force everybody to turn that feature off by forcing them to log out? If the change causes a performance impact, we need to be sure that turning something off actually turns it off and stops any performance impact right there and then”.

One person pointed out that the majority of developers are in India now, and due to the time zone differences, they aren’t around for the current Change Approval Board meetings, or available during when we deploy.

Like many companies, we love deploying Friday nights because that is considered downtime for most of our users. The problem is, staff that would be useful to have around – don’t work weekends unless they are paid to be on-call. Getting people to deploy late probably increases the chance of mistakes. Maybe we should deploy earlier in the week if possible. At least don’t deploy on a bank holiday weekend because that is even worse for staff to fix any issues.

Deployment staff member

Our test databases have probably less than 5% of the data that our live databases have, so testing SQL is always a problem for us. What runs in 1 second could easily run for a minute with the scale of data. So if we come up with a rollback plan, our timings won’t be accurate at all.

Databas Expert

With our C# code, you can easily rollback the version by redeploying it. The problem is the corresponding database changes. Our database patches are designed for sequential application, e.g. one script can create a table, a later version may add a new column. You can’t just re-deploy the old version to get back to a state where the column doesn’t exist. You have to write code to do the reverse – delete the column. What if data has been added after it was deployed and before you realised the feature doesn’t work (and you need to rollback)? Do you just lose the data and tell the users that it is “tough luck”, or do you archive the data, apply the revert patch? Then later fix the issue, reapply the original patch, then run another script to apply the archived data? It’s much more feasible just to disable the feature, apply a fix then turn it back on. [We recently employed a new guy with the job title “Technical Director” and he was adamant a rollback had to go back to the original state at all costFix forward is not a valid risk mitigation strategy.”. He doesn’t understand how we work and is making these technical decisions.]

Another developer reiterates that point to tell the new Technical Director that he is talking nonsense:

When a change is implemented Out of Hours for the user, any problem may only emerge when users log on in the morning. If we roll back at that point, any overnight transactions potentially will be lost. This includes transactions on inbound interfaces from 3rd party systems. Rolling back at that point may be even more disruptive than hot-fixing the problem. How are we to handle this, in terms of process management and policies?

We have many products, especially with all the mergers/acquisitions we have had over the years. Each with their own processes.

[The CTO and Technical Director were talking as if we were all using the same process and tools which wasn’t the case.]

One concern was that when it comes to our Mobile apps, the likes of Apple and Google have strict update procedures, where you often have to submit the update which is approved in their own time. I don’t know how rollbacks work in this process. I think the implication was that feature flags are often used so they stay on the same version, but remote configuration switches a new version on.

Mobile developer

The rollback idea seems to be that if the change took 20 minutes, then it should be able to rollback within 10 minutes. Why? I am not sure. Maybe it is just a rough guide to be “fast”. But doesn’t this assume that any failure has been identified within the change window of 20 minutes? It might take a while for the users to run into the scenario that causes such failure. If we patch out during the early hours (1AM-3AM) then the user works 9-5PM, then there is a time delay. Our main patching process takes a long time (possibly 1 hour, large file downloads over a slow network), so I’d expect a rollback process to also take a long time. If users are actively using the system and it’s only a specific feature that is a problem, then you don’t want to take the entire system down to patch. You want a Zero-downtime fix.

We need to be specific of what indicates a “failure”. They were talking about using Dashboards to look for evidence of failures but there can be cases where a feature doesn’t work but it doesn’t log an error. Successful server calls might work. Also, if there is a problem, which rollback plan do you execute? If you have several features/bug fixes in the release, do you rollback it all, or do you spend time investigating which feature it is, then roll that back only.

We have a tendency to release with features switched off and then enable them later. Many times, the implementation of new features causes you to change existing code, which then may contain bugs. So even when the feature is off, it can still cause bugs. So in this case I think we need to go through the Change Approval Board for the initial release, then again to actually switch it on.

As part of the rollback process, have teams considered the overall impact that this will have on estimates of deliverables, overall development time and testing time, and how we are going to manage customers expectations. Depending on the project/feature this could add an additional 50% development time to create the rollback and to complete the testing of the roll back.

Tester

After we made all these points the Technical Director just dismissed all our concerns. 

“I think I’ve been clear on what a rollback plan is and what the expectation is of it. The onus is on the change proposer to demonstrate they have a Rollback plan”

Technical Director

Well, you’ve been clear of your expectations. We have all been clear that it simply isn’t possible to implement.

Architect

I think we need to change our architecture to support such plans. They want instant process change. But this needs to be planned in.

I asked a Software Architect what he thought:

“I don’t like how we’ve had a bunch of clowns that know how nothing works come up with a plan which makes little sense and is impractical, and opposed to much of industry direction. Patch forwards is the way with databases, supporting rollback of code changes with tech like containers is the way to go with that”

Architect

The Next Release

Since this process affects the next release, the team that is involved in the biggest feature raised their concern. Their change is super risky and therefore they will need to start spending weeks coming up with a back-out plan and therefore will probably miss the release. The Technical Director said that it is:

“going ahead – but with the additional scrutiny and bolstered processes we’ve detailed. This isn’t about stopping – it’s about wrapping our change activity in more certainty. I’d personally like to understand “high risk” here though.”

Technical Director

I think this illustrates my initial points. He will be one of the people involved in signing off on changes, and here is a proposed change without a rollback plan, and he is already stating he is willing to sign off on it. Then the Lead Developer has to convince him it is actually one of the riskiest changes we have made this year.

“I’m not sure how it isn’t a high risk change. It’s massive, wide-ranging, and there are so many things that can go wrong. We are working hard to mitigate the risks but the risks are still there.”

Lead Developer

Since we had a project going into the next release, we arranged a meeting with the Head of Development (in regards to the statement “Backout plans will be analysed by the Head Of Development to add a more technically-informed scrutiny, and assess the risk”).

When we explained the difficulty of coming up with a Rollback plan, he then says “why don’t you just add a feature flag“. So we told him we had but it was unacceptable. The Tech Director and CTO had made that clear. He said he was fine with it, because it doesn’t make sense otherwise. So the Head of Development that is supposed to be scrutinising these plans even thinks this rollback plan doesn’t make sense. Brilliant.

Feedback

On our next meeting with the CTO, he asked how we had been adapting to the new process. 

One guy said that there was a server that was down for 3 hours and all it needed was the service to be turned back on. Due to the fact that they took away the Pre-Approved changes, they weren’t allowed to turn it on until they got 2 Senior Managers to approve. They only could contact 1, and after asking him if they could go ahead and turn it on anyway, he refused because they had strict instructions that they couldn’t deviate from this new process. The CTO was outraged. He came out with this amazing hypocritical speech.

It is essential that people “engage brain” before following process. The idea that we can deliberately leave customers without service because some process says so - It's crazy. And if anyone disagrees, come and argue with me. This is incredibly frustrating for us as an organisation where people follow processes to the detriment of providing good service to our customers. Now I understand that there's no malice or deliberate reason to break something by following that process, but as you explain it, I can't imagine anyone on this call saying “Oh yeah, that was the right thing to do”. If we have to write a process that means people don't have to think, then we've lost the battle already, and people need to be able to think. The challenge I have in front of me is our customers, and their customers are not able to be helped. I can press this button and it will fix things, or I can wait until some process completes and then I'll press the button. At the very least an emergency phone call should be made to someone senior enough to make the decision and that should have proceeded straight away. We have heard of other examples of this. I want you to see how serious I am about how upsetting that is. You know, we employ intelligent people with a brain. We can't get away with automatons that just follow process. If we do that, we'll just automate everything. We need people that are intelligent enough to make good decisions, and when they can see that what is happening is not right, they escalate it until they get an answer.

I thought he said it would be a sackable offence if we even slightly deviated? Even his Senior Management was strictly following the CTO’s process and refused to authorise it when escalated. We told him in advance it was a bad idea to get rid of Pre-Approved changes. We told him it was a bad idea to require 2 approvals from the Senior Managers. Didn’t listen. Refused to “engage brain”. Caused another outage. The fact that the CTO admits it’s actually happened a few times as well is hilarious.

Another team said they weren’t invited to the first meeting when the change got announced, then trying to follow the process delayed them 6 hours. The CTO admitted that one was an oversight on his part.

Another team said they were trying to do a demo to a customer but they needed a configuration change applied. Even though it was a “Test System”, it had some dependency on the live infrastructure and so had to go through this new change approval process. They delayed their demo and looked a bit incompetent in front of the customer.

A Technical Manager reiterated the point about not being able to rollback the database changes and his team had wasted hours in meetings discussing how it could even be remotely possible. The CTO then said we were being “hung up on terminology” and we know that the ultimate aim is to get into the previous state, so it doesn’t matter if we call it “rollback” or “fix forward”, it just needs to work. But we were telling him there is massive difference. A rollback literally puts you back into the previous state. Fix-forward means you add extra fixes on top of the thing that was broken so they are left with the new feature. Rollback has the concern about losing data that was added since you upgraded the version, then rolled back. It’s a completely different approach. Then the CTO and the Technical Director said that we can “do what we want as long as it gets us in a working state”.

So we can fix forward then? And we can deviate from the rollback plan?

Another person asked if we “will we go back to normal mode after July”. I think they miss the point. Surely the process is there to try and prevent major incidents like this from happening again. You don’t just do it for a few months then stop.

Refinement to the process

After a few weeks, they realised that having a limited number of staff that can approve changes was a bad idea, especially when most of them had a trip to India booked for the week. So they increased the pool of people that can approve.

Another week went by and they realised they had way too many approvals to go through, so knocked the number of approvers down to 1.

Another week went by and they realised most of the approvals were formally the Pre-approved changes, so they got reinstated.

We are almost back to our old process!

Conclusion

I think I have lost all confidence in the CTO, and I think the new Technical Director is just as dumb. How can you make decisions not involving the people actually involved in the process, and then reject their concerns when they tell you how stupid it is!? Then when you then get evidence how stupid it was, you then blame your staff and tell them they need to “engage brain” and to not follow your process if it wasn’t going to solve the problem of getting our systems up and running. We have been in the industry years, and haven’t had many instances of long term major incidents. I think under the new CTO’s leadership, the number of major incidents has increased dramatically. I don’t think it’s necessarily down to decisions he has made, but some of the ideas he has come up with can’t help.

ISO Project Audit

We were being audited on a project we completed about a year ago. But due to our slow release process, it took about 4 months to go out. So I think that’s why we were chosen because it fell within that year period. I suppose they weren’t auditing us for speed, but I thought it would be a bit embarrassing if they actually saw the project’s start and end dates.

We couldn’t remember much specific details about the project, but our software tester just freestyled a speech about the Agile Development process: so spoke about requirements, User Stories, writing Test Cases, continuous integration; and the auditor seemed to be fine with his content.

I might have only said “hi“. Can’t quite remember. But the auditor was impressed with my knowledge.

“The external auditor was very impressed with the knowledge and understanding from all attendees during the day. I will share the report within the next couple of days once it is available for me to download.”

Project Manager

I do wonder if ISO Audits actually mean anything if you can blag your way through it. They may have been more interested in the Project Management side of the process rather than the Software Development.

Performance Review 2023

Last year, we felt slightly aggrieved that our employer made out they would match inflation with our pay rises, but then disguised it with: a one-off profit share, pension payments, and optional benefits. Our wage only increased by 2% but inflation was currently around 7% and predicted to rise further. Better than nothing though, so can’t really complain.

With inflation being 10%, we wondered what we would get this year. Then came the communication that stated that they understand inflation, and remarked that we are still highly profitable so can cover the costs:

“Further to our communication at the end of 2022, I can now update you on our remuneration approach for 2023. We know and understand that employees are continuing to experience cost pressures and we remain committed to ensuring that the overall pay budget for this year reflects the continued external inflationary environment we are in. Whilst this has been possible due to the ongoing success of the business, it will require us to again manage our costs in 2023 carefully.”

HR Director

However, then they reminded us that they gave us: 

  1. an extra £200 one-off payment
  2. Increased leave for parents
  3. the option to work bank holidays and take a different day off instead.
  4. they will give us a one-off profit share just like last year

So with them clutching at straws for things they have done, you know that means they weren’t giving us 10%. Maybe we could get 5% as a compromise then?

“The exact amount of any increase to an individual’s salary and timing will be agreed locally, as normal, and will be dependent on a number of factors including performance. I believe these measures reflect our responsibility to be aware of the external inflationary environment, our immense gratitude for your efforts over the past year and our desire to become an employer of choice.”

HR Director

What did I get?

0%

immense gratitude for your efforts over the past year” 🤨

I’ve mentioned in previous blogs that judging performance is very subjective and often bent to fit a narrative, and I got completely screwed. So let’s get into it:

The Review

“You had an average year in 2022, working on only a couple of projects, one where you worked as an individual contributor then moved onto a pair project”

Well, that seems to imply just being on 2 projects in a year was a bad thing, but that’s not really my fault is it? Since I single-handedly delivered one project, then was in a team of 2, surely that should show I can handle responsibility and pressure?

“You were good in many technical aspects but were not able to step up to the plate as a leader in the team.”

I was in a team of 2 as a Senior Developer, and I got really good feedback from that team member. What more do you want?

“You were given an opportunity to lead on defining our interview process. I expected you to lead on this but have not shown much interest in it. Ultimately someone else led it, creating the interview pack, redefining interview questions and leading live interviews, which I think is a missed opportunity to shine.”

I blogged about this process. It was a mess, but at least I tried my best to direct it. We created an interview process for Graduate Developers from Indian universities who were using Python, but set them C# and Object-oriented questions…then asked questions on Software Testing which they also don’t do. I was pretty clear we needed to focus on the people we were targeting (Python and Web programming) and not on what the job was (C# Developer); (or even better, change the people we were targeting – but that was already decided upon). I got outvoted by the other Seniors. I felt I couldn’t lead in the interviews because I didn’t believe in the questions.

“I can see that you might not see Unit testing as a productive way forward, and the target set might not be accurate”

In recent times, managers have been demanding we achieve 80% test coverage and I have highlighted how poor many tests are that people write, and the 80% is a random target. It’s not improving the quality of code at all. Due to the fact I wrote little (but high-quality tests) I got criticised and talked about between senior management. I raised in my review that I believe managers now see my output as low quality, so I will need to repair my reputation.

I also raised that I was annoyed at the frequency of team member changes. My projects had Testers switched in and out, we changed Product Owners, Software Architect, and there was talk of more changes that didn’t happen. Colin then stated “I never raised my concerns at the time”, but I definitely did.

Some of the comments received from your colleagues:

“I think that he has been a good technical lead. His knowledge is excellent, and this has been shown in how he has been able to help me with any technical problems. He has constantly been available, has a great attendance record, and will defer annual leave for the good of the project. His code review comments are always clear, concise, and helpful. He is always available to help with Testing as well.”

“He has a good insight into what he thought was and was not possible”

“He needs to make sure that just because an issue is raised he doesn’t jump on it right away as there is a priority order for things”

“He’s reliable and just gets on with the work”

“Can be a bit quiet on calls at times but whenever I specifically call him out in calls, he would answer questions”

“At weekly meetings there wasn’t enough substance in his updates. He doesn’t come across as a leader to me or perhaps he just did not want to be at those meetings.”

I definitely got good feedback from people but then the positive ones were ignored in the discussion. I had good arguments against some of the negative points as well. I think I only picked up non-prioritised work when I had finished the work that was. What do you want me to do? Just sit there idle? I often knew it was issues related to recent work so it was obviously gonna be the highest priority, and as Team Lead, I think I can make that call myself. 

I’m always annoyed when people criticise me for being quiet on calls. If enough people are saying the things I am thinking, then I don’t need to speak. If no one has raised the issue I am thinking of, then I will provide it. Then, if people specifically ask what I think, then I will say, even if I am repeating what others have said. That final feedback was about a meeting similar to a Scrum of Scrums where many teams get together. But the meeting was mainly for Colin and he rarely attended. So I had the attitude of quickly going through my points then moving on. I thought the wording was harsh – they have no idea what it is like to work with me directly.

I have no idea how they can justify giving me nothing. That’s like a 10% pay-cut with today’s economy.

Colin isn’t my manager anymore, so let’s hope my new manager will actually put things right.