Elon Musk fired 6,500 employees at Twitter. A little birdie told me it’s down to:
– 2 designers
– 6 iOS developers
– 20 web developers
– Around 1,400 sales and operations people
How is it possible that we are still using this website? Two words: Parkinson’s Law.
Have you ever wondered why seemingly simple tech companies have tens of thousands of employees? Sometimes, it’s because they have huge sales forces or tech support/operations people. But often it’s also due to Parkinson’s Law.
Parkinson’s law is like lighter fluid for bureaucracy. It’s a business tapeworm that slowly eats away at companies, making them less and less efficient and innovative over time. Parkinson’s Law is the idea that the work will generally expand to the amount of time, budget, and number of people allocated to it, and no matter how many people you allocate to it, those people will feel busy. They’ll feel busy because, due to the excess time/slack in the system, they’ll start focusing on less and less important tasks.
Here’s how it manifests on an individual level: Let’s say you have a report due in a week. The report might only take you around five hours to finish if you really focus and work efficiently. However, because you know you have a week to complete it, you might find yourself spending a lot more time on it than you need to. You’ll be more prone to distractions, take longer breaks, or perhaps decide to add more details, tables, graphs, and so forth. Essentially, the task becomes more complex and time-consuming purely because you have more time in which to do it.
And here’s how it manifests across organizations: Imagine a big tech company. A social media company with various departments. Each department has tasks that it must complete to contribute to the overall productivity of the company. Now, suppose each department is given a budget and a set amount of time to complete its tasks for the year. According to Parkinson’s Law, each department will use its entire budget and the entire allotted time, even if the tasks could have been completed more efficiently. This is because as resources and time increase, departments tend to become more complex and less efficient. For example, a department might add more steps to its procedures, requiring more approvals and creating more paperwork, which slows down the process. Or it might use the full budget on additional personnel or equipment that doesn’t necessarily improve productivity. The department might also use the full budget to justify the same or larger budget for the next year, since budgets in many organizations are often determined based on the previous year’s spending. This is a phenomenon known as “budget padding” or “spend it or lose it” mentality.
Inefficiencies can also develop in staff allocation. If a department expands, it might add managerial positions that aren’t strictly necessary. More employees are hired to manage, creating layers of bureaucracy that may not contribute to productivity and can even slow decision-making. I have seen this occur over and over again in my career. The larger the team, the larger the budget, the longer the timeline, the less gets accomplished. I’m very curious to see how many more tech companies come to this realization.
Recently, I blogged about how managers love any excuse to go to India to visit our office over there. Then they write a blog on their experience, stating how important it is for face-to-face collaboration in an office environment… before returning to the UK and telling us how working remotely from home is the modern way of working, and has no impact on efficiency.
They actually spend most of their blog writing about the local cuisine and the landmarks they saw; so it’s definitely a holiday and not a work trip at all.
I also wrote about The Expo, which is where the entire UK side of the company travelled to one location to watch many in-person presentations (which we could have just watched remotely like we normally do). Then when it is “business as usual“, managers are telling us to find ways to save money, and how we want to become a carbon-neutral business.
So after dumping loads of money into travel costs, hotel expenses, venue hire and catering for the Expo in the UK, they decide it would only be fair to host a similar thing in India… which means getting all the directors and senior managers to fly over there to do the presentations.
Obviously they used the opportunity to post a blog about the importance of face-to-face collaboration, Indian landmarks and cuisine.
Key phrases from their blog are as follows:
The India Office
“I am amazed at how much we were able to accomplish”
“India greeted us with its vibrant energy and diverse cultural heritage”
“The workspace was a fantastic environment, promoting team collaboration and productivity”
“Witnessing the teams working closely together was inspiring, and the entire place was abuzz with creativity and a real growth mindset”
“The office boasted excellent facilities, including communal work areas, private group session rooms, a gym, nap rooms, massage chairs, a food court, and garden”.
Expo Day:
“The Expo day itself was an exhilarating experience, with a buzzing atmosphere and a large number of attendees”
“Representing the team on the stands was a humbling experience, as engagement levels were high and the audience had a deep understanding of our work, asking probing questions around aspects of safety, governance and our products.”
Cultural Experiences:
Visiting the UNESCO heritage site at Mahabalipuram allowed us to witness the interplay between Hindu, Chinese, and Roman architectural styles in this historic trade centre.
Learning about the story of Draupadi and understanding the long history of international collaboration.
Our visit to DakshinaChitra cultural heritage site, highlighted the vastness of South India and its rich diversity.
Meeting the skilled craftsmen and hearing them describe their trades first-hand provided a deeper appreciation for the diversity of people and their skills across the country.
We learned about different rice and cooking methods for Biryani, and the amazing flavoursome vegetarian dish suggestions.
Here is a collection of a few fails by our IT Department.
PC shutdown & Usage Monitoring
Even though we work from home, we still have some PC’s in our office that we remote onto. There’s certain systems that only seem to work when on the physical network so people often call this a “jump box”. Our IT was planning on temporarily moving our PCs whilst some electric work was being done in the office. I was invited into a Teams chat which was supposed to be for everyone affected. After skimming the list, I spotted 3 people that were missing, and other colleagues spotted others. 9 people were missing in total! How do they not know who owns the PCs? They have been citing “increased security” in recent times. Surely a security risk if they don’t know who uses PCs on the network.
More recently, I was contacted again via email asking “if you use this PC”. Again, why do they need to ask if we use them? Isn’t that a security concern if not? Surely they know, especially when they have installed extra network security tools recently. I thought they had said software monitors network traffic and alerts for anything suspicious.
Upgrading Software
I was contacted by IT saying my SQL Server version was no longer supported by Microsoft, so I need to urgently upgrade it by the end of the week due to being considered insecure. They said if I want an installer, please reply. I thought it would be easy enough locating the installer, but it seems Microsoft’s SQL Server pages are very confusing. So I replied asking for the installer. They ignore me. I reply again, they ignore me. Months have gone by. So not that urgent then.
IT then announced that they are taking increased security measures and are removing all admin rights from our PC’s. Now we can only install software with their permission. They also said it makes sure we can’t install unlicensed software, since it is easy for someone to install software that is free for personal use, but is paid software for commercial use, and then the business can be liable.
A week later, they then email us saying there is a known security vulnerability with our Visual Studio version so we need to update it. We can’t though, we need admin rights to keep our software updated and secure! So now we have to log tickets, then they remote on and type in the admin password to proceed. I bet they love that.
In a similar fashion, they are more fussy with USB devices. They sent one of my colleagues a new laptop but it rejects his smart-card reader which he needs for testing. Can’t be plugging in USB devices these days.
Saving Money
They also said they wanted to be more stringent when it comes to licence keys, as we seem notorious for purchasing more licence keys than we need, then we might stop using software then still pay. I was contacted in early July 2022, saying that I have had a Jira licence for the last year but have not being using it:
We currently purchase a licence for you to access Jira. We understand a lot of the users will have now migrated to Azure DevOps and as such, your access may no longer be required.
May I kindly ask you to respond to this email by 12pm Friday 8th July confirming whether or not you continue to require access?
IT Email
So I reply saying I wasn’t using it and I don’t think I have used it for 2 years. I then got contacted again in February 2023 saying the same thing. I confirm that I don’t need it. I then got contacted earlier this month asking me again. So I’ve had a licence for 3 years now for a product I don’t use at all.
We recently featured in the national newspapers due to an outage with our software, and it was surprising that our system was down for around 12 hours. Usually we respond to outages really quickly, and get the system back online after a few hours max.
I sent the article to a few of my colleagues and they were unaware of the problem, so internally we seemed to be keeping quiet about it.
We had put out a message to users, but from what I saw in the aftermath, we initially claimed it was just a specific region that was affected, before others via social media had stated it was affecting people nationwide, much to the derision from our main rivals, who posted multiple social media posts to insult us.
“They deliver complaints, weekend working, major incidents, stress, insomnia, and hypertension, to name a few. Do their rivals deliver?”
Customer quote
All Hands
5 days later, the Directors arranged an “All Hands” meeting with only a few hours notice. So that caused panic amongst many colleagues because of the implied urgency. It was led by the CTO who was announcing changes to our deployment process, but I’m not sure who he consulted with. I would have thought he would have learned from the last time he announced changes to our release process – when we all ripped his idea to shreds on the same call, stating several technical and some legal reasons why we can’t possibly work that way. A much better approach would be to state the problem and let the experts discuss how we can solve the problem, but yet again, he seems to have had a meeting with other Directors and took the attitude “this is how it’s going to be from now on”. But then we all pushed back and told him how it wasn’t possible.
I love it when situations like these occur. The drama makes you want to grab the popcorn, sit back, and enjoy the show. Hearing all the different points of view and trying to understand the complexities of why things are the way they are – is super interesting. I think when you are told to follow a Process and you don’t know the rationale behind it, it makes it feel like a chore. But focussing on a problem and trying to solve it by forming a new/modified process is something you can get behind.
So like I said, not learning his lesson from before, he states what the new process is, and only gives a wishy-washy explanation of the problem he was trying to solve. After we all bombarded him with questions, the truth of the matter came out, and some of his ideas wouldn’t solve this particular problem that caused the outage. So it seemed he was adding unnecessary processes, slowing us down and causing more stress. We are already releasing way too slow, so then us Software Developers are just chilling because we can’t get things out that we do fix/implement.
He described this situation as a
“major incident that impacted the whole estate, attributed directly to a failed Change. We recognise that the change was not intended to have the adverse impact that it did, but sadly the consequences have been a major blow to Users and us. Therefore, we are seeking to create immediate stability across our estate, and are implementing several amendments to the way Technology Changes are approved and implemented, as follows;
1. A comprehensive change freeze for the month of June, with only changes meeting enhanced criteria being passed for implementation.
2. Pre-approved changes are suspended
3. Any changes submitted for approval will require TWO members of SLT.
4. Implementation activity must be witnessed by two or more staff members. Screen sharing technology should be used to witness the change. No additional activities are carried out that are not explicitly in the documentation.
5. All changes must have a comprehensive rollback plan, with proof of testing. The rollback plan must be executable within 50% of the approved change window.
Backout plans will be analysed by the Head Of Development to add a more technically-informed scrutiny, and assess the risk. Further budget will be allocated to DataDog, for more visibility in problems with our servers. A dedicated team will improve how we alert internal teams and customers in the event of an incident.
What does a “Failed Change” even mean? We deployed a database change that failed to apply due to a key constraint violation? That a new feature or bug completely broke a feature? It’s very vague: Then what does “We recognise that the change was not intended to have the adverse impact that it did” mean? Is that just stating the same thing? You aren’t gonna intend to have an adverse impact. It sounds like the sort of nonsense you put out to non-technical users just to blag an apology. Why is he saying this to the Development department? How are we supposed to understand the process and help come up with something better if he is being deceptive?
So as I previously stated: “After we all bombarded him with questions, the truth of the matter came out”, so I’ll try and explain by piecing together the information I did gather.
Years ago we attempted to replace a module and this was feature flagged. But it turns out it wasn’t significantly better than the thing it was replacing and has its own set of issues, so some users are using this feature (V2) and others used the old one (V1). There was an experimental feature which hoped to improve performance for both V2 and V1, but it was put as a sub-feature for V2 in the user interface of the configuration tool. So you have V2 on/off, then if it is On, it looks like there’s then an option for Experimental Fix on/off. When the Deployment team came to turn on the experimental feature, they saw that the main feature (V2) was set to Off for some of the users, so they set it to On. Since enabling V2 would then download loads of data to be in the V2 format, this caused a massive impact on the network and so actually affected everyone that wasn’t impacted directly by the configuration change.
So would the CTO’s new process stop this happening?
New Rules – My View
“There will be a comprehensive change freeze for the month of June, with only changes meeting enhanced criteria being passed for implementation.”
The size of the release wasn’t the problem, so cutting it down won’t solve anything. It might annoy the users even more if we then delay features that we announced.
“Pre-approved changes are suspended”
Not sure of the full scope what is covered by the “pre-approved changes”. But the idea of a “pre-approved” change is that it is something that is often run on the live servers to fix common issues and is low risk, hence it is pre-approved (eg the ability to restart a crashed server/service.). This is just going to annoy staff members in Deployment. The CTO also remarks:
“Preapproved changes are wonderful. They have been reviewed and tested to death. My goal is to increase the number of preapproved changes in the future. It’s just with the existing ones, we don’t know if they have been reviewed or not”.
You don’t know if they have been “reviewed” but they have been run 100’s of times, and never caused an issue. So you are temporarily banning them on the grounds that they could cause an issue?
“Any changes submitted for approval will require TWO members of SLT. ”
How many times has there been some kind of approval process and the people with authorisation are too busy or on annual leave? Why are we going from 0 approvers to 2? Would the managers understand a change to enable a feature for users belonging to company A, B and C? Would they go “hang on, C don’t have the main feature! I’m rejecting this”? It’s going to be a box-ticking exercise.
We already have a problem when changes are Code Reviewed by Developers – there’s not enough “expert” people that can review it in the required level of detail. So how would a manager understand the change and technical impact? It will be more like “does this make us money? Yes we like money”; approved.
“Implementation activity must be witnessed by two or more staff members. Screen sharing technology should be used to witness the change. No additional activities are carried out that are not explicitly in the documentation.”
This might actually help, although might be patronising for Deployment. The CTO made a comment on the call about having “Competent” people involved in the deployment process. So if a Developer has to watch a member of Deployment click a few buttons; it feels like babysitting and not respecting them as employees.
“All changes must have a comprehensive rollback plan, with proof of testing. The rollback plan must be executable within 50% of the approved change window.”
The rollback idea is one of these ideas that sounds logical and great in theory but this is the biggest concern for the technical people in Development as we will discuss below.
My opinion
The Developer Feedback
On our current change approval form, there is already a section for rollback plan, but this isn’t properly filled in and is often just “revert change”. So one developer said all the change needs to be is:
“people should be ensuring they have a documented their tested process for rollback”. Then also added “In the case of feature-toggling. What happens when you do that? Do the users have a cache? How do we clear the cached data for all users? Can you force everybody to turn that feature off by forcing them to log out? If the change causes a performance impact, we need to be sure that turning something off actually turns it off and stops any performance impact right there and then”.
One person pointed out that the majority of developers are in India now, and due to the time zone differences, they aren’t around for the current Change Approval Board meetings, or available during when we deploy.
Like many companies, we love deploying Friday nights because that is considered downtime for most of our users. The problem is, staff that would be useful to have around – don’t work weekends unless they are paid to be on-call. Getting people to deploy late probably increases the chance of mistakes. Maybe we should deploy earlier in the week if possible. At least don’t deploy on a bank holiday weekend because that is even worse for staff to fix any issues.
Deployment staff member
Our test databases have probably less than 5% of the data that our live databases have, so testing SQL is always a problem for us. What runs in 1 second could easily run for a minute with the scale of data. So if we come up with a rollback plan, our timings won’t be accurate at all.
Databas Expert
With our C# code, you can easily rollback the version by redeploying it. The problem is the corresponding database changes. Our database patches are designed for sequential application, e.g. one script can create a table, a later version may add a new column. You can’t just re-deploy the old version to get back to a state where the column doesn’t exist. You have to write code to do the reverse – delete the column. What if data has been added after it was deployed and before you realised the feature doesn’t work (and you need to rollback)? Do you just lose the data and tell the users that it is “tough luck”, or do you archive the data, apply the revert patch? Then later fix the issue, reapply the original patch, then run another script to apply the archived data? It’s much more feasible just to disable the feature, apply a fix then turn it back on. [We recently employed a new guy with the job title “Technical Director” and he was adamant a rollback had to go back to the original state at all cost “Fix forward is not a valid risk mitigation strategy.”. He doesn’t understand how we work and is making these technical decisions.]
Another developer reiterates that point to tell the new Technical Director that he is talking nonsense:
When a change is implemented Out of Hours for the user, any problem may only emerge when users log on in the morning. If we roll back at that point, any overnight transactions potentially will be lost. This includes transactions on inbound interfaces from 3rd party systems. Rolling back at that point may be even more disruptive than hot-fixing the problem. How are we to handle this, in terms of process management and policies?
We have many products, especially with all the mergers/acquisitions we have had over the years. Each with their own processes.
[The CTO and Technical Director were talking as if we were all using the same process and tools which wasn’t the case.]
One concern was that when it comes to our Mobile apps, the likes of Apple and Google have strict update procedures, where you often have to submit the update which is approved in their own time. I don’t know how rollbacks work in this process. I think the implication was that feature flags are often used so they stay on the same version, but remote configuration switches a new version on.
Mobile developer
The rollback idea seems to be that if the change took 20 minutes, then it should be able to rollback within 10 minutes. Why? I am not sure. Maybe it is just a rough guide to be “fast”. But doesn’t this assume that any failure has been identified within the change window of 20 minutes? It might take a while for the users to run into the scenario that causes such failure. If we patch out during the early hours (1AM-3AM) then the user works 9-5PM, then there is a time delay. Our main patching process takes a long time (possibly 1 hour, large file downloads over a slow network), so I’d expect a rollback process to also take a long time. If users are actively using the system and it’s only a specific feature that is a problem, then you don’t want to take the entire system down to patch. You want a Zero-downtime fix.
We need to be specific of what indicates a “failure”. They were talking about using Dashboards to look for evidence of failures but there can be cases where a feature doesn’t work but it doesn’t log an error. Successful server calls might work. Also, if there is a problem, which rollback plan do you execute? If you have several features/bug fixes in the release, do you rollback it all, or do you spend time investigating which feature it is, then roll that back only.
We have a tendency to release with features switched off and then enable them later. Many times, the implementation of new features causes you to change existing code, which then may contain bugs. So even when the feature is off, it can still cause bugs. So in this case I think we need to go through the Change Approval Board for the initial release, then again to actually switch it on.
As part of the rollback process, have teams considered the overall impact that this will have on estimates of deliverables, overall development time and testing time, and how we are going to manage customers expectations. Depending on the project/feature this could add an additional 50% development time to create the rollback and to complete the testing of the roll back.
Tester
After we made all these points the Technical Director just dismissed all our concerns.
“I think I’ve been clear on what a rollback plan is and what the expectation is of it. The onus is on the change proposer to demonstrate they have a Rollback plan”
Technical Director
Well, you’ve been clear of your expectations. We have all been clear that it simply isn’t possible to implement.
Architect
I think we need to change our architecture to support such plans. They want instant process change. But this needs to be planned in.
I asked a Software Architect what he thought:
“I don’t like how we’ve had a bunch of clowns that know how nothing works come up with a plan which makes little sense and is impractical, and opposed to much of industry direction. Patch forwards is the way with databases, supporting rollback of code changes with tech like containers is the way to go with that”
Architect
The Next Release
Since this process affects the next release, the team that is involved in the biggest feature raised their concern. Their change is super risky and therefore they will need to start spending weeks coming up with a back-out plan and therefore will probably miss the release. The Technical Director said that it is:
“going ahead – but with the additional scrutiny and bolstered processes we’ve detailed. This isn’t about stopping – it’s about wrapping our change activity in more certainty. I’d personally like to understand “high risk” here though.”
Technical Director
I think this illustrates my initial points. He will be one of the people involved in signing off on changes, and here is a proposed change without a rollback plan, and he is already stating he is willing to sign off on it. Then the Lead Developer has to convince him it is actually one of the riskiest changes we have made this year.
“I’m not sure how it isn’t a high risk change. It’s massive, wide-ranging, and there are so many things that can go wrong. We are working hard to mitigate the risks but the risks are still there.”
Lead Developer
Since we had a project going into the next release, we arranged a meeting with the Head of Development (in regards to the statement “Backout plans will be analysed by the Head Of Development to add a more technically-informed scrutiny, and assess the risk”).
When we explained the difficulty of coming up with a Rollback plan, he then says “why don’t you just add a feature flag“. So we told him we had but it was unacceptable. The Tech Director and CTO had made that clear. He said he was fine with it, because it doesn’t make sense otherwise. So the Head of Development that is supposed to be scrutinising these plans even thinks this rollback plan doesn’t make sense. Brilliant.
Feedback
On our next meeting with the CTO, he asked how we had been adapting to the new process.
One guy said that there was a server that was down for 3 hours and all it needed was the service to be turned back on. Due to the fact that they took away the Pre-Approved changes, they weren’t allowed to turn it on until they got 2 Senior Managers to approve. They only could contact 1, and after asking him if they could go ahead and turn it on anyway, he refused because they had strict instructions that they couldn’t deviate from this new process. The CTO was outraged. He came out with this amazing hypocritical speech.
It is essential that people “engage brain” before following process. The idea that we can deliberately leave customers without service because some process says so - It's crazy. And if anyone disagrees, come and argue with me. This is incredibly frustrating for us as an organisation where people follow processes to the detriment of providing good service to our customers. Now I understand that there's no malice or deliberate reason to break something by following that process, but as you explain it, I can't imagine anyone on this call saying “Oh yeah, that was the right thing to do”. If we have to write a process that means people don't have to think, then we've lost the battle already, and people need to be able to think. The challenge I have in front of me is our customers, and their customers are not able to be helped. I can press this button and it will fix things, or I can wait until some process completes and then I'll press the button. At the very least an emergency phone call should be made to someone senior enough to make the decision and that should have proceeded straight away. We have heard of other examples of this. I want you to see how serious I am about how upsetting that is. You know, we employ intelligent people with a brain. We can't get away with automatons that just follow process. If we do that, we'll just automate everything. We need people that are intelligent enough to make good decisions, and when they can see that what is happening is not right, they escalate it until they get an answer.
I thought he said it would be a sackable offence if we even slightly deviated? Even his Senior Management was strictly following the CTO’s process and refused to authorise it when escalated. We told him in advance it was a bad idea to get rid of Pre-Approved changes. We told him it was a bad idea to require 2 approvals from the Senior Managers. Didn’t listen. Refused to “engage brain”. Caused another outage. The fact that the CTO admits it’s actually happened a few times as well is hilarious.
Another team said they weren’t invited to the first meeting when the change got announced, then trying to follow the process delayed them 6 hours. The CTO admitted that one was an oversight on his part.
Another team said they were trying to do a demo to a customer but they needed a configuration change applied. Even though it was a “Test System”, it had some dependency on the live infrastructure and so had to go through this new change approval process. They delayed their demo and looked a bit incompetent in front of the customer.
A Technical Manager reiterated the point about not being able to rollback the database changes and his team had wasted hours in meetings discussing how it could even be remotely possible. The CTO then said we were being “hung up on terminology” and we know that the ultimate aim is to get into the previous state, so it doesn’t matter if we call it “rollback” or “fix forward”, it just needs to work. But we were telling him there is massive difference. A rollback literally puts you back into the previous state. Fix-forward means you add extra fixes on top of the thing that was broken so they are left with the new feature. Rollback has the concern about losing data that was added since you upgraded the version, then rolled back. It’s a completely different approach. Then the CTO and the Technical Director said that we can “do what we want as long as it gets us in a working state”.
So we can fix forward then? And we can deviate from the rollback plan?
Another person asked if we “will we go back to normal mode after July”. I think they miss the point. Surely the process is there to try and prevent major incidents like this from happening again. You don’t just do it for a few months then stop.
Refinement to the process
After a few weeks, they realised that having a limited number of staff that can approve changes was a bad idea, especially when most of them had a trip to India booked for the week. So they increased the pool of people that can approve.
Another week went by and they realised they had way too many approvals to go through, so knocked the number of approvers down to 1.
Another week went by and they realised most of the approvals were formally the Pre-approved changes, so they got reinstated.
We are almost back to our old process!
Conclusion
I think I have lost all confidence in the CTO, and I think the new Technical Director is just as dumb. How can you make decisions not involving the people actually involved in the process, and then reject their concerns when they tell you how stupid it is!? Then when you then get evidence how stupid it was, you then blame your staff and tell them they need to “engage brain” and to not follow your process if it wasn’t going to solve the problem of getting our systems up and running. We have been in the industry years, and haven’t had many instances of long term major incidents. I think under the new CTO’s leadership, the number of major incidents has increased dramatically. I don’t think it’s necessarily down to decisions he has made, but some of the ideas he has come up with can’t help.
In recent times, the likes of the CTO have stated that we need to use modernised technology and tools. One aspect that they love is software that produces statistics/metrics that we can then judge improvements over time.
When we buy software licences for such a tool, there is always hype among certain groups of people who will volunteer to take ownership and work on implementing such software (installation, training, creating a “best practices” process), and will take any opportunity to highlight it to the managers.
So the “soup of the day” is a tool called Datadog which seems like a very powerful tool and has all kinds of integrations. I found this “jack-of-all-trades” approach was difficult to really understand what Datadog was for, and why it was different from what we had before. I knew we had Dashboards that showed which servers were running, their processor/memory usage, and which versions of our software was installed, and more. Datadog is used for this purpose too.
Jargon Sales Pitch
One reason why it is difficult to understand is that Datadog’s webpage spouts loads of jargon, but also internally, managers love spouting jargon too. Here is what one DevOps member said about Datadog (warning – these next paragraphs contain a lot of jargon):
“As our organisation continues to grow and evolve, it is essential that we have a comprehensive and centralised observability solution in place. Currently, we are using multiple disparate siloed monitoring tools, which not only is inefficient but also hinders our ability to identify and resolve issues promptly. This leads to decreased visibility and a lack of agility in our operations.
Datadog observability provides a unified platform that consolidates all our monitoring, logging and tracing tools into one solution. This not only reduces the complexity of our monitoring landscape but also gives us a single source of truth for all our operational data. By implementing Datadog observability, we will have the ability to quickly and easily identify and resolve issues across our entire infrastructure, reducing downtime and improving overall service levels.
Moreover, Datadog observability offers the ability to deploy configuration changes to the Datadog agent with agility, which is critical in a fast-paced and dynamic environment where changes to our infrastructure occur regularly. With Datadog observability, we will be able to quickly and easily make updates to our monitoring configuration, ensuring that our monitoring remains up-to-date and relevant at all times.
With a pre-approved change, it will be easier for us to leverage the 600+ integrations that we can configure to further enhance our current infrastructure observability, root cause analysis and incident mitigation. This will allow us to gain greater insights into our operations, improving our ability to identify and resolve issues before they become critical.
In conclusion, authorisation and creation of a Datadog pre-approved change will bring numerous benefits to our organisation, including increased visibility, improved agility, and reduced complexity. This solution will help us effectively monitor and manage our infrastructure, ensuring that our operations run smoothly and efficiently.”
DevOps Engineer
That really sounded like he was saying the same thing multiple times and was really emphasising the speed. I think a concise statement is that “Datadog is one software product for monitoring, and can replace many metric tools that we currently have”. So I would imagine it should be cheaper (paying one licence rather than several), and since it is all in one place – probably easier to create new dashboards.
Jargon From The Docs
On their page, Collect SQL Server Custom Metrics, they show how you can run a custom query involving a person’s age. Isn’t that a terrible example? This would run every minute (or whatever it is configured to do) and you will create graphs from this. Without good examples, it’s hard to understand how or why you would use this feature. Other problems are due to excessive jargon.
“In v6, DogStatsD is a Golang implementation of Etsy’s StatsD metric aggregation daemon. It is used to receive and roll up arbitrary metrics over UDP or Unix socket, thus allowing custom code to be instrumented without adding latency.”
Datadog
“Enabling JMX Checks forces the Agent to use more memory depending on the number of beans exposed by the monitored JVMs.”
Datadog
Official Training
Members of the Network team, DevOps, a few managers, and some volunteers (who want managers to look upon them favourably) – signed up to a session with official Datadog training staff. These sessions were recorded, and I watched these and made a few notes; although it was just riddled with jargon and hard to know what anyone was talking about.
“Datadog Expert Services, or DES for short, is a set of “guided hands-on keyboard” pair-programming sessions. These collections of sessions, collectively known as an engagement, are time boxed and specifically designed to enable you to get the most out of Datadog while adhering to best practices. In this session, our team will work you to configure and deploy the Datadog Agent. This includes deployment in a standard, or Kubernetes containerized environment.”
Datadog
There were 2 courses that it seemed that these people were enrolled on
Name
Time
Cost
QS-INF course
2 Weeks + 5 Sessions, Curated
$15k
QS-INF-LOG course
3 Weeks + 8 Sessions, Curated **
$25K
Training cost is bonkers isn’t it? Once you have paid all that, then it pushes you toward the sunk-cost fallacy.
One of the Instructors asked what our infrastructure was.
“we’ve got resources and infrastructure in Azure, with a bias towards AWS, then we have on-prem; most of it is Windows Server. A combination of 2012…and onwards. 2016, but mainly 2019 as well. They also run on Windows HyperVisor, and also VMware – so they are virtual machines. But actually, we also have physical servers as well.”
deployment dude
Basically, we just made it up as we went along and got all the things! It sounds like a similar thing was done with the monitoring, because the deployment dude said we have “16 or 17 on-prem monitoring tools, as well as custom Powershell scripts to generate some data to monitor”
The Datadog Instructor explains that we have to log tickets if it is outside our “engagement time”. They will reply when they can but there’s no set time-frame.
“That’s fine with us, we log enough tickets already, so that’s fine. I think we will welcome that.”
DevOps Engineer
It’s almost like we were taking any opportunity to slag our company off.
No Going Back
Good news everyone!
The DevOps engineers with support from the Architecture Team have levelled up our Live datacentres!
How? With estate wide deployment (completed Friday evening) of the incredible, uber-awesome full stack monitoring SaaS Datadog!
If you’re aware of Datadog’s capabilities, effortless integration and out-of-the-box features you’ll appreciate how monumental this is.
For the uninitiated, Datadog in a slick, AI driven, intuitive UX allows full stack monitoring of servers, databases, tools, services, containers, et al.
Effortlessly switch from viewing the entirety of all network traffic to drilling down into individual requests, logs, payloads, processes, you name it, in real-time.
Going forward we envisage significant improvements to our reaction and mitigation of all types of incidents, minor to major!
We are currently trialling access – To request access please join our Slack channel.
Stay tuned as we have more exciting stuff coming as a result of our DevOps strategy!
Watch this space!
DevOps Engineer
Web-based Demo
One team put together a small web-based app and presented a demo to the department to promote Datadog, and obviously, take the opportunity to look amazing in front of the management.
The team lead was trying to show a feature called “Cumulative Layout Shift” but didn’t explain it. He made out it could track how many parts of the website load – so you know how sometimes you load a webpage and might see some text, then an image suddenly pops on screen, then some adverts, and often it causes the layout to change, then some more adverts appear, possibly changing the layout once more? It’s not a smooth user experience and causes a lot of jerks if the user tries to navigate the page before it has fully loaded. So how does Datadog track that? What is tracking it? and wouldn’t that mean there are multiple server calls to Datadog to log it? The web page is already slow, so why would adding extra server calls back out make it better? I can’t see how that can be performant, especially when you have thousands of users. Isn’t this process logging an insane amount of trivial data over time? I think I was left with way more questions than answers.
He also said it can track time spent on a particular web page, view count, error count, action count, frustration count (he claims Datadog tracks clicks out of frustration. How?). When people are already worried about the amount of monitoring/tracking/surveillance with the likes of tracking cookies – and then websites can track you to this granular scale with Datadog; it is a bit worrying isn’t it!?
Everyone should use Datadog
In following department meetings we were told by the CTO that all teams would eventually use Datadog and we need to increase the amount of monitoring, and to do it quickly to take advantage of the benefits of the tool.
My manager wanted our team to create a Datadog dashboard. Even if it wasn’t that useful, she wanted to be among the initial users – probably to look good to her manager.
I asked one of the smartest developers if it was even suitable for my team. He was looking into creating a dashboard for his team, but his team had an API that third-parties could use and it was prime for this kind of monitoring.
He was a bit vague though:
“You could create a custom metric for it. But I wouldn’t be too sure. I’m probably going to use custom metrics for “#messages per APP per minute” sort of thing. But I can get all that from my Logs/Traces. You’d have to have something pulling that data from the main databases which would involved pushing it to Datadog.”
Principal Developer
I asked other people that were using it, and people just kept on saying they weren’t sure, or maybe others have done it.
“We’re making heavy use of Datadog for our new software and I believe it’s also being used in other areas too. It’s incredibly powerful and provides a huge amount of detail. Getting the most out of it is important and also building some standards seems like a good idea. Do we have any thoughts around how we ensure we lead the way with this and get some standard/learning/documentation in place?”
Developer
No one can give a straight answer when it comes to this stuff. People are like “EVERYONE is using Datadog“, then when you ask about it in more detail, they are like “well SOME teams are using Datadog“, then when you ask more people, they are like “there are some metrics but not quite the ones you want“
Performance Problems
I asked my Software Architect friend (who seems to know everything) if Datadog is as flawless as people were implying. My intuition was thinking it cannot have zero disadvantages.
Me
Won't Datadog just cause performance issues if we start monitoring everything?
Mark
yep, or run while patching is in progress and block access to the Database/tables, which has already happened. Running ad-hoc scripts is a fairly bad idea
Hosted had to run patching twice the other week, which pushed us out of our Service Level Agreement.
Me:
this juicy gossip keeps quiet doesn't it
Mark
yes because Datadog is a massive success and we paid lots of money for it
Technical Director
Recently we hired a “Technical Director”. He asked how Datadog was coming along and if we can highlight any issues so he can get involved. This prompted John to go on a rant. The TLDR of this section is that “Software Developers don’t know about infrastructure of the Live Production environment.”
I think one of the company’s biggest challenges is how many products we have, and how diverse they are. We have no real standardisation due to a number of different factors, not sun-setting old services, not tackling tech debt, products that were developed by other companies and came to us via acquisition etc..
As a result, I think it’s difficult for us to template things out such that it can work for multiple people.
Realistically, each team for each product needs to look at how their product works, how it’s used, what tech it’s built on, and build a solution that works for their product. And I think one of the biggest challenges at the company is the ‘DevOps wall of confusion’ isn’t just a normal wall, it’s a Trumpian 15 foot high one with razor wire. Lots of products have dev teams (assuming they have one at all!) with little to no exposure or knowledge of how production works and what it looks like. For so long dev teams were told they had no role in production, no need to access it and were kept locked away from it
For reference, I used to think like that. I’ve been here 15 years and I have been part of the mindset in the past. It’s changing, and I’m happy to be one of the people pushing for that change, breaking down that wall of confusion. But that’s one of your biggest hurdles – is that people don’t know what to monitor in production because they don’t know what it looks like, and trying to monitor it by just copying a template that worked for somebody else, but doesn’t work for their solution isn’t a way to solve it
The key to unlocking Datadog for me, is to get people to have visibility of production, to understand how it’s used and what it looks like, and then start to work out what metrics are important, and what “normal” looks like so we can alert when we deviate from that
I can talk for hours about this, my team has one of the best observabilities out there, and had it before Datadog came around. If you want to have a chat, happy to have a discussion about what we can do.
I may have painted a somewhat negative opinion above, and I agree that there are things that we can improve. But we can’t expect some pretty Datadog dashboard templates to solve the historical problems that have meant we have lots of live services in the business with nobody who understands where they are or how they work and crucially expect Operations 24/7 to be able to magically pick up the pieces and fix it by themselves when it falls apart.
Yes, the company has a long history of developing a solution, moving the team that developed it off onto a new project, and leaving that solution behind. Combine that with a massive wall of confusion between Dev and Hosted, you have hosted running a bunch of servers that they have no idea what they do.
Case in point right now, the “Login and Identity service” is in the main data-centre, and we also have one in the DMZ that was built for the Mobile app, but nobody is quite sure what the main one is for. I have some notes that indicate it was built for the Connect app, but Connect doesn’t use it. Yet still that production sits there unused with nobody sure why it’s there.
You’ll find a team that has maybe done work in the past on Appointments, maybe even recently. Are they currently working on Appointments? do they have any knowledge or visibility of production? is it even on their radar that they should be monitoring the performance of it?
This goes deeper than just dashboard templates, it’s a company culture problem
John
Anomaly detection works well if the metrics are predictable for different periods of a day. It’s not a “AI” as we thought when I tried it out, it’s more of a fancy algorithm than machine learning.
I found with XMPP that method would work OK for Mon-Fri, then the alert will trigger all weekend because traffic wasn’t as high those days.
Lee
Scheduler
I was added to a group chat where the team was discussing how to use Datadog on our “Scheduler”. It sounds like an easy task, but there’s way more nuance and technicality to it. The main problems we have with the scheduler is that:
some jobs fail and then wait to be reset,
some jobs run but get stuck (I think most cases the scheduler isn’t informed that the job has finished, so it fails to schedule the next run).
The TLDR of this section is that: there is a lot of junk data (and I mean A LOT) and reporting on these figures can be somewhat misleading because failed jobs for organisations that no longer exist aren’t a problem (although we should stop them from running since they are obsolete).
John
Surely we need something that runs that shows us:
a count of jobs in Error Status
A list of jobs with a status of Running,
Long Running Jobs
Matthew
We'll want to take into account the normal running time of a particular job. We don't want to be alerted about a job that usually takes 2 hours to run and it has only been 1 hour.
We'll get to ones that get stuck quicker if they usually take a minute to run
Dave
Someone should run some queries against live now, and get a picture that looks correct.
Matthew
We also want the data to be meaningful otherwise we'll be back to where we are now - where we don't know what's actually working and what isn't. There's a balance to be had here
Christian
Can we summarise the Key Performance Indicators that will cause an action that needs to be performed? These become multiple metrics IMO, that add together to give better context
John
1. Job queue building up
2. jobs failing and not being reset
3. jobs getting stuck
Matthew
• Large numbers of jobs not running when they should
• Jobs stuck in running beyond their normal running time
• Mass job failures
• Mass job queues (this has the potential to false flag when workload is high)
John
There's a bug / unexpected behaviour where the scheduler can fail to update the database with the result of a job and the table shows it in status Running. Scheduler logic queries the tables for what is / isn't running. Leaving it to make decisions that it can't do stuff because a job is "running" when it in fact isn't.
Matthew
If this is a bug, the smartest thing to do after the monitoring piece is to fix the piece of software causing it surely?
John
the secret to any good bug report is reproduction steps, and it's not an easy one to reproduce
You mentioned you'd had one get "stuck" recently. Do we know how to reproduce that behaviour again on demand?
Matthew
"Just let the scheduler do it's thing and wait" is the only way we know how to replicate these
John
hence why any developer would struggle to fix it because it's difficult to know where to look if you can't reproduce it
Christian
"Treasure what you measure" or "Measure what you treasure". Simple counts and alerts will likely get us to what we need very short term which is to prevent or proactively manage / reduce potential for a Major Incident.
Matthew
I've got some initial queries together for this that could be used to alert on if the numbers get too high. I'd appreciate someone who knows TSQL to have a look and to suggest any improvements to the data being returned.
John
the 3000 + jobs in error is scary
Do we need to filter that list by only jobs that we know get automatically reset by the Hosted DBA agent jobs?
Matthew
Maybe, I did think that but I also thought that we should really know how many jobs are actually in error
John
I know that list in itself is a problem. But I think all Domains are going to have a high failed count and it's difficult to know if there are important ones in that 3000 + count
Matthew
We shouldn't alert on that metric, hence the one to track how many in error for the last hour
The scheduler is a massive mess and that 3000+ count suggests we have a fair bit of clean-up to do.
John
the only suitable metric I can think of for "important' is ones that the Database Administrators already deemed as important and created automated resets for.
Matthew
I could add an additional "Important" row to the general stats that includes those (or excludes any that aren't those)
Need that info from the Database Administrators though
John
Do we maybe need a couple of groups rather than just 1 "important" group
Matthew
I'd rather split the jobs and call out their names though, rather than pile them into one huge count if we're doing that
Let's get the data in Datadog first and see what normal looks like and tune alerting accordingly
JobTypeIDs aren't consistent across the estate by the way, so you'll have to match on JobTypeName with a join to the JobType table
<Image of 2048 Ready jobs. 47 Error>
John
Interestingly those 47 jobs are haven't run in years. some of them last ran successfully in 2016
but we're resetting them every day to try and run and constantly failing (job run count of 271,280)
Matthew
Hence my comment about a lot of clean-up - I'm willing to bet these are trying to run for closed Orgs, or orgs that have moved endpoint
John
Each Domain will probably need work to get rid of all the false alarms
I know when I checked one domain there were 40 + jobs that had never ran and were just constantly being reset
Maybe an idea to simply disable these and change the script to add & Enabled = 1 to the filter so you count only enabled jobs?
That should help remove the false positives you know about - then you can actually alert if the value goes above 0 for jobs in error
Paul
We are assessing whether the best approach to reduce the number of scheduler incidents is to deliver the new scheduler with improved logic and Datadog integration which will take time.
Or support the integration of Datadog with the current scheduler.
Matthew
If it's the former, should we still do the latter anyway until the new scheduler logic is in place?
I suppose what I'm trying to ask is will the time-frames for implementing the new logic be quick enough to satisfy the urgency of monitoring the scheduler?
Paul
Yes agreed, we have just reviewed the last 9 months of incidents and having Datadog reporting would have given us the insight to avoid a number of these.
John
As well as adding an "enabled=1" filter Matthew, do you think it's worth adding a runcount > 0 filter as well to avoid counting jobs that have never ran?
For the sample Domain I looked at, every priority job in error had a run count of 0 showing they've never worked. Adding this would bring that result down to 0 which makes it much easier to then set an alert if that goes above 0
Matthew
I thought about that, but that will mask errors with jobs that should run but haven't. We'll want to see those. New job types as well, for example
John
going to be a hell of a job tidying up all the crap scheduled jobs in the scheduler AND setting up and calibrating monitoring at the same time
My thoughts were to filter those out for now, then look at those crap jobs later
Matthew
Yep, it is, but we can't ignore the mess as it won't go away. A lot of work disabling jobs will be needed to get the overall stats query to show nice figures. We shouldn't shy away from them looking terrible though. I don't believe in fiddling figures to make things look nice when it comes to monitoring
The other queries that show failures and stuck/running jobs for over an hour will help with spotting immediate issues though
One particular situation to take into account is the longest-running job we have is 8 hours. We can take care of that in Datadog with trends and anomaly detection to tell us when more jobs than the expected ones are stuck in a status for longer than an hour.
Similarly, we can use that same alerting method to warn us when the numbers on the overall stats aren't within usual parameters. Change detection is also a good measurement to use here too. We don't necessarily have to use traditional methods of alerting as soon as a value is over X
John
that sounds to me like a case of another metric
count of scheduled jobs running more than 1 hour where job type is NOT “expect long-running jobs”
Performance Problems Part 2
Note: If the min_collection_interval is set to 30, it does not mean that the metric is collected every 30 seconds, but rather that it could be collected as often as every 30 seconds. The collector tries to run the check every 30 seconds but the check might need to wait in line, depending on how many integrations are enabled on the same Agent. Also if the check method takes more than 30 seconds to finish, the Agent skips execution until the next interval.
Datadog
It seems that for custom sql metrics, you can only specify a single time frequency to run ALL the queries. So if one team creates a query they want to run every minute, and another team wants to run every hour – you can’t.
One team wanted to run a long-running query, but since the first team had set the queries to run every 60 seconds, then this long-running query wasn’t possible.
In a similar fashion, we also anticipate problems if the total time of all queries exceeds this 60 second limit which we will soon do with only several queries.
Another problem that we found is that the time you set is just a guide. So Datadog could actually run it twice in the 60 second period. Then when it comes to creating the dashboard, you have to be careful that you don’t end up counting the data multiple times. Some teams were seeing decimal numbers on their charts when counting data with only whole numbers!
The possibly crazy workaround John
I think a good workaround would be to have the data refreshed hourly, and placed in a separate database somewhere then have that separate database queried every 60 seconds by Datadog. If it’s separate to the Live estate – it should reduce the risk. Needs thought putting into how you would pull those stats into a database hourly however. Need a SQL agent job or similar that could collect them once an hour and push them to a central separate location.
John
key thing, would be to ensure we aren’t using the scheduler to kick off SQL that monitors the scheduler 🤣
Christian
Need More Tools?
We purchased Datadog to replace several metric tools. Now we have seen the SQL metrics are a bit basic, it seems like we are concluding we need an alternative tool. I wonder if we will keep buying different metric tools over time and end up in the same situation we were in before!
You get some basic SQL monitoring in the base DataDog install (or the APM one, not sure which). You can pay extra for “enhanced” SQL monitoring in the DBM module. It’s still very basic and about 3x the cost of “proper” SQL monitoring tools. I’m sure as the DBM module matures it will get closer to fit for purpose, but right now it’s an incredibly expensive SQL Server monitoring tool that’s feature poor. If it was near zero cost, I’d have it everywhere to give a wider audience better visibility of what the DB layer does in live, but the features don’t currently justify the cost of that particular extra module.
Recently, a team was created in order to tackle major issues, often proactively. Some errors are logged without the user noticing anything, so by monitoring the logs, the team can diagnose and fix these “silent” errors. The other thing they are looking out for is performance problems. Some performance problems aren’t noticed when it’s more of a minor increase, inconsistent, or slowly gets worse over time. I would think some users don’t bother reporting slowness because it’s harder to quantify than the obvious crash.
However, one user had been seeing a recent drop in performance by not dealing with their tasks, but it had got to the point where they could no longer log in, as it took longer than 30 seconds to retrieve their tasks on login – so it timed-out (as in an error was meant to be thrown when the time to retrieve their tasks took 30 seconds).
“At the time of logging this bug, the user currently has 136,854 tasks in Tasks Management. The Program performance will start to be negatively affected after 4,000 tasks. I have extended the timeout of sql call for TasksManagement.GetUserTaskCountSummary to 60 seconds as this caused a login failure.”
Walter (Developer)
“let’s be honest, the program performance will start to be negatively affected after 1 task”
Mike (jestingly)
I think this is acceptable as a quick fix to allow the user to login again, but is it really acceptable for the login process to take more than 30 seconds? I’d imagine it would take around 40 seconds for this user.
That’s the problem with this team, they just look for quick wins, even if it isn’t the right solution long-term, and might even move the problem elsewhere.
What’s better than waiting 30 seconds? waiting 60 seconds? or threading it off to delay the loading? or Mark’s idea: no timeout.
What if the time taken still exceeds 60 seconds, assuming some other user has a large volume of data? Can you set the timeout as infinity?.
Mark
we had a customer with a very large count and it only ran for around 32/33 seconds and their counts were unusually large. We are going to send out communications for customers to keep these low by means of maintenance. The 60 seconds just allows an extra safety net if we get in this situation again. I don’t want to extend the timeout to be too long for this reason as it will unlikely (if ever) be needed to be longer than 60 seconds.
Walter
Why not a try catch / retry attempt for this? It should be a non-essential call to logging in, if it fails, you can catch, log, and show an error message. Should we not look at optimising this so that you can login quicker? Maybe run this on a background thread too?
Lee
I discussed this with Johnny and making changes to this Stored Procedure could result in worse performance for smaller datasets and advised against making changes to it. We’re going to tackle this by means of communications to sites. I thought the simplest and safest approach is just to extend the timeout slightly so that the practice does not suffer a system down as a result of it – the timeout is only breached by a second or two. Once the user logs in, they are displayed their task counts, so I think it might be deemed essential (rather than showing them a loading wheel until the data is returned). Currently, if we did this, when loading up Tasks Management it would just crash with an error.
Walter
It would still crash on logging in if it takes over 60 seconds. Why not make it non-critical to logging in?
Log in
Status bar “Loading…”
Completes – OK.
Fails – Show error and retry link.
Lee
This was the worst site on the estate and was taking roughly 32 seconds. To take over 60 seconds the task count would probably be unheard of. Each time I ran that stored procedure manually, the results were the same so I don’t think a retry is going to work here. Even by changing to make it non critical to logging in, Tasks Management will still be busted when you try to load it. The timeout is on the SQL side so that is the area we need to resolve really.
Walter
However, Johnny did advise against alternative solutions such as: 1. fine tuning the stored proc 2. adding indexes 3. Remove tasks counts completely for some types of tasks
My View:
Walter seemed to have put more thought into it than I originally thought, but I still thought he was overlooking Lee’s suggestion. Yes, it would need more work to actually work (display loading text on the Tasks Count Bar, then loading screen when launching the Tasks Management page), but it would significantly speed up logging in. If this user could log in 32 seconds quicker, then what would the average user see?
If the other parts of the log-in process also take some time, then that’s a long time they are waiting in total. If taskcounts are the bulk of the time, then we can make it super fast if we take it out. I would have thought users would expect times of 5 seconds or less (might not be possible, but that’s the scale we need to aim for). Walter is talking like users are more than happy to wait 30 seconds or more just to get to the home page. A long wait is better than not being able to log-in at all, but surely it’s generally unacceptable to be more than several seconds in total. It’s one of the reasons why users have grown more discontent over time.
When doing some testing of smaller counts, for example 10k – the results are returned in a few seconds (2-3). This organisation had around 120k Appointments Tasks across all users plus all of their other tasks which resulted in a production duration of 32 seconds. The more they manage their tasks the quicker workflow will be, that’s always been the message we’ve tried to get across.
This long blog documents what I have been working on for the past year. I had made lots of notes with the aim of writing a blog, in addition to taking extra notes from chat logs.
We actually estimated the project would take around 5 months, but then an extra 2 months for testing and go through our slow rollout process. It actually took closer to a year. I’d say it was a combination of:
realising the feature was more complicated than anticipated
the UX team had little knowledge of the actual user experience
managers changing or trying to change team members
We were told the project was important, yet we were only assigned 2 developers (as in myself and one other). As the project came to a close, we were being integrated into our new team, therefore other developers could help out during the final stages.
Here is a list of all the people involved over the project’s lifetime:
Name (Core team in bold)
Role
Me
Developer (Team Lead)
Daniel
Developer
Dean
Developer (Temporary)
Dennis
Developer (Temporary)
Tina
Tester
Tim
Tester
Colin
Technical Manager
Mary
Technical Manager
Olivia
Product Owner
Owen
Product Owner
Carl
Customer Representative
Adam
Architect
Andy
Architect
Grace
Safety & Legal Governance
Ulrika
UX
Ursula
UX
I’ve made the names start with a letter to represent their job title, apart from Colin because he is a recurring person in my blogs. I’ll put reminders throughout the blog so it is easy to follow.
Current Software
To protect anonymity, I need to come up with a different theme for what the software is for. Let’s say customers request various restricted items of different severity. So a request could come in for a Hunting Rifle, and the user needs to know if they have the adequate licence to possess firearms and they are deemed medically safe in a recent time-frame. Possible warnings are shown which the user can dismiss/acknowledge e.g. “licence is up for renewal in the next 3 months”, “recent purchase of other firearms”. Standard users can create “Awaiting Approval” tasks and assign them to users with authority to approve. To approve them, the authorised users open the task list, view the details, then click approve. Many tasks have either no warnings, or low-severity warnings, so users often just glance at the info and click Approve. The system then sends the approved request to a central system, then loads up the next task. There’s a couple of seconds delay due to the “digital signing”, a couple of seconds for sending, then loading up the next record. To sign loads of tasks, it’s a very slow and laborious process. It’s a major source of complaints from our users.
Unsafe/Unofficial Automation
Carl [Customer Representative] sent a link to a video where someone was demoing a commercial automated tool that autocompletes the tasks. It waits for the system to load, clicks the approve button, then repeat. So you could set it running, then walk away from your desk.
I thought it seemed ridiculously irresponsible and would cause people to be sacked if they got caught using such a tool:
A) The program is now the one authorising the tasks, not the qualified user. What’s the point needing to have qualifications if you aren’t even going to read what is on-screen? If a task was wrongly approved, then the user would be accountable.
B) if you walk away from your desk, you are leaving your PC unlocked, along with your physical Security Key.
The creator had actually put a bit of thought into it though. If there are any Warnings that require another click to dismiss/override, then the automation is paused.
The video claimed that some users have up to 500 tasks to sign after a weekend. They charge a fixed yearly fee of £295, plus 7p per customer on the system per year.
“the robot does not get bored, does not make human errors, and crucially is a lot cheaper than the user’s hourly wage”
Promotional video for the Automation tool
Probably just makes robotic errors instead!
I said we should change the names of the buttons to try and screw them since it probably uses something like that to locate the button to click. It would be quite funny to make them dish out refunds.
The existence of the automation tool shows how much the users desire a better solution.
UX User Feedback
Given the existence of such an automated tool, it is no surprise that one frequently requested feature is Batch Approval. Our UX team put together some kind of interactive prototype and invited a few users to provide feedback on two designs. The alternative design was actually produced by Mary [Technical Manager] who has no UX qualifications. I’m not sure how that came about and why UX agreed to trial her design, but the feedback was actually extremely favourable to her design.
This caused her to be quite smug and maybe caused some animosity as we will see later. The ratings out of 5 were:
(Option A) 4.3 for Mary’s design
(Option B) 2.3 for UX Team’s design
For additional comments, one user commented:
“I prefer Option A by a country mile – Option B feels even worse than the existing system!”
Another commented:
“Option B feels more clunky, less user friendly than option A. A lot of clicking involved”
One even gave a threatening response:
“Option A or you’re gonna lose me and my franchise”
Shortly, there was a write-up from a conference where the feature was announced:
“This item is one that really did steal the show – this is something that our customers have been very eager to see us implement and are very excited to learn that we are busy developing this solution.”
“Busy developing this solution” made me laugh, because at the time, all I had was a dialog box with a couple of lines of text and a button.
Proposed Change
The general idea, is that the user is presented with key details from the tasks in a data grid.
They can click checkboxes to select which tasks they want to approve.
These are added in a queue to send in the background.
The user can continue working as they are sending.
The “digital signing” has to take place on the user’s computer so a large part is done client-side.
The user has to remain logged in until the process is finished.
This project had actually been discussed for years, but because there wasn’t much of a “commercial drive” for it – we would be giving users this feature for free – it was always low priority.
Product Owner: Owen
I think the initial planning was done by a different Product Owner but then when the project fully began, we were assigned a new Product Owner, Owen, who was new to the company, but he also gave me the impression that he was new to the role…but also didn’t seem very clever in general.
Here are some quotes that happened in various meetings (mainly Sprint Planning and Refinement).
Owen: "which work item is it?"
Me: “the one right at the top"
Owen: slowly scrolls...chooses 2nd item
Me: "it's not a Must, it is a Could"
Owen saves it with Must tag
Tim [Tester]: "No, Owen, you tagged it wrong, go back"
Owen: "Which WI is this?"
saves it with the Must tag again
Then goes back into the work item and gets confused
then goes back into it again. I think he needs rebooting
Me: "you need to set the state"
Owen clicks to close
Me: "you need to set the state, go back"
Owen is confused
Me: "left hand side Owen!"
Owen hovers over the right
Me: "left hand side Owen!"
Owen moves down
Me: "leave it as it is"
Owen "Which one shall I take out?"
I'm sure he is intentionally 30 seconds behind to wind us all up
Owen changes Story Points from 3 to a 5 without any discussion.
"shall we keep it at 5?"
For another item, I was talking about how the requirement is either obsolete, or needs a completely different approach from the initial proposal.
Owen: "So how many points shall we add?"
"The system crashes when entering incorrect PIN and clicking 'OK' on error prompt"
Owen: "what was the behaviour before we fixed this?"
team: "It crashed"
We were discussing how we logged a bug a few months back but haven’t seen it occur since, so it will need some investigation to try work out what the recreation steps are.
“Assuming the bug still exists, how long will it take to fix it?”
Owen
Estimating software changes is hard, but I always think bugs are even harder to estimate. It’s only possible if there’s clear recreation steps, otherwise it is stupid to ask – we can’t fix it if we don’t know what the problem even is.
“depending on Grace’s [Safety & Legal Governance] feedback, do you know how long it would take to fix?”
Owen
Translation: can you predict what Grace would say, and given that she did say it, can you come up with an estimate for it?
I logged a bug about suggestions on how to improve a dialog. It would be up to Owen or UX to decide on the approach to fix it. Owen then asks questions along the lines of: “what do we need to do for this? do we need it?” I said it would be nice but it’s not my decision. Then he still asks “do we need it?” “can we close it?“
What’s the point asking me these questions, when I logged it with the aim of asking him to decide?
When the project deadline was looming, we ended up having multiple meetings to decide if there’s any features we could scrap, or defer to a later release. After the first meeting where we decided scope, he may as well have said “You know those items you said we need to do and couldn’t defer them, are you sure we can’t defer them”, because he was arranging subsequent meetings to go back over them. When we came up with estimates which showed that we would need at least another month, he was then arranging another meeting to re-estimate them.
The Architects
An important project started around the same time ours did. Our architect, Adam [Architect], was reassigned to the new project. Andy [Architect] joined our team as a replacement. He wasn’t completely new to the company but wasn’t familiar with this area of the system. Additionally, I don’t think he even looked at the software or even requested a demo.
Any question we asked him, he ended up making an excuse that he was busy and will get back to me later. Then when he did answer, I then sent a message to the original architect, Adam, and he said Andy had asked Adam about it and simply relayed the message back to us. So basically Andy wasn’t doing anything. We had him officially assigned, but it was Adam [Architect] that was answering the questions but via a middle-man.
The July Cancellation
There was a bit of disruption when our project looked to be cancelled, but there was apparently some mis-communication.
Hi All, a decision has been made by Directors to stop Batch Approval and to move resources across to pick up Project France instead. Therefore I will be cancelling the Batch Approval meetings.
Project Manager
1 day Later
The directors had decided to move you to the new project so I cancelled the meetings, but then I find that there wasn’t a firm decision from the Directors.
Project Manager
Brian has asked us to proceed with Batch Approval as originally planned. Sorry about the chaos dudes. They must be smoking some good drugs upstairs.
Olivia [Product Owner]
It was off the table, then someone put it back on the table, then someone else swept it off the table, then someone picked it up off the floor and put it back on the table.
Andy [Architect]
Coding Tales
Colin [Technical Manager]: "What sprint are you in?"
Me: "I dunno"
Colin [Technical Manager]: "you are the team lead, you should know"
Me: "No one in the team knows"
Put it in a new tab but make it behave like a dialog
The original UX designs looked like it fit nicely in the existing Task Framework. The requirements were that Batch Approval had:
Its own folder but is a sub-folder of Approvals
Opening a task opens it in a new tab
After looking at the code though, the framework didn’t actually support a sub-item. But we found a basic workaround to make it look like it did. However, there were quite a few features that we got “for free”, but we didn’t want them because they weren’t appropriate for a sub folder. So I had to disable the features by hacky code.
If you double click a task, then it opens in a new tab, which is what they wanted. However, they then didn’t want you to be able to navigate away into other parts of the system, and the Task Framework didn’t support that. With a bit of a workaround, I got that working, but the tab was designed to view one task only, and we are displaying a Batch of them. A few weeks went by and I managed to cobble something together, but the code was awful.
I took a step back and thought about it.
We have a tab that the users surely would expect to be able to move away from to view other tabs.
I’m using this “tab” which is designed for a single task, and I want multiple. So I had to make my own custom page.
We have hacked a sub folder and had to basically fight against the codebase to get it all working…
So why don’t we just have a button on the main folder, and it launches a modal dialog?
It would take a couple of days to get working,
the code would be neat,
and I think it’s what the user would expect.
After speaking to UX about it, they were happy with my proposal. I had wasted about 3 weeks trying to get it working like they previously wanted. Also, we are again telling UX what a good UX design is.
Scrollbar
The UX was also clear that we didn’t want a scrollbar to appear, and instead we use pagination. I didn’t see anything obvious in the standard DataGridView Winforms control, although I’m sure this is a common problem/requirement.
I ended up writing my own logic to add controls to the grid, keep track of the size, then stop adding when the size exceeds the height of the control. However, if there is only 1 very large task, we have no choice but to use a scrollbar.
The problem we encountered was that sometimes a scrollbar did appear when it shouldn’t. I made some tweaks to the calculation and it seemed to work fine. But then a Tester found a combination of task sizes where it still appeared. I couldn’t work out what I was missing in the calculations but it seemed about 4 pixels off, so I just added that into the calculation. Again, all seemed fine for a few days, but then the Tester found a combination of sizes where it still appeared.
Olivia [Product Owner] suggested that we detect when there is a scrollbar then disable the Approve button until the user scrolls down.
I said if we know when the scrollbar is there, why don’t we just remove the last task and check for the scrollbar again, repeat until the scrollbar has gone. I thought the code would be messy, and I’d end up writing a stupid code comment like “mate, something has gone wrong with the calculations here, so we’re gonna have to do some jiggery pokery to get out of this mess”.
Adam [Architect] did suggest some alternatives and they were just as wildly wrong.
Dean, a developer in another team agreed to help, and after a couple of days, he says “you can just set the vertical scrollbar to be disabled”.
But if the scrollbar is appearing so you have to scroll to view the content, then surely disabling the scrollbar will mean content is off the screen?
I tested his idea, and it worked fine! What must be happening is that the vertical scrollbar appears and takes some of the horizontal space… which causes the text to wrap and creates the need for more vertical space. Therefore the scrollbar is required and so remains. But if you tell the scrollbar it cannot appear, then the controls are added, and my calculations meant it fit perfectly in the grid.
It’s a self-fulfilling prophecy!
Olivia [Product Owner]: Do we have concerns about the unknowns?
Tim [Tester]: It's just the unknowns that we don't know about
I feel like you need to know the system inside and out to be able to safely implement this
Conflict With The UX Team
UX: “We want to minimise pop-ups”
Also UX: “Add a pop up after closing the dialog”
Ulrika [UX] had to take time off to deal with some personal problems. Ursula [UX] agreed to join the meeting we arranged on the Wednesday.
“I don’t work Thursday/Friday and have to leave early on a Wednesday to get the kids. I’ll get back to you next week”.
Ursula covers for Ulrika but then also has time off.
When she got back to us, she seemed to overlook how users access this restricted part of the system, and it turned out none of the UX team actually had this knowledge. So halfway through the project, we were discovering new requirements because they hadn’t designed the user flow.
Don’t Have Time
In early January, we were waiting for UX to give us some approved text but they seemed to be taking their time. I asked Olivia [Product Owner] what was going on, and she said that we don’t have time to make any more changes so they “needed to stop requesting changes”. Even though I pointed out that I was the one requesting changes, she said “we don’t have time to test” (even though it only involved quickly checking some text has changed on a message box). Nearly 2 months went by before we actually began to release.
After more protests from me, she says:
“The text is fine for now. We don’t have time to be changing it.”
Olivia [Product Owner]
When it came for the final review, reviewers questioned why we had dialogs with some ToDO comments on it saying “ToDo: Awaiting UX approval“. Even if you don’t have comments like that, I have seen developers question the user-facing messages if the grammar isn’t correct or sounds unclear. It definitely wasn’t clear because we just wrote the first thing that popped into our heads at the time; knowing the text would be replaced.
I think what had happened was that Mary [Technical Manager] and Olivia [Product Owner] had fallen out with Ulrika [UX], and then was refusing to authorise her changes. Remember, tensions will have been building since users criticised Ulrika’s design and wanted Mary’s design, and Mary’s arrogance about it wouldn’t have gone down well.
It’s just part of the process though – all text needs to be approved by the UX team; otherwise what is the point of their team?
Conflict With The Architect
When we implemented Adam [Architect]’s suggested invalidation logic, we thought the criteria was too restrictive. Adam was off on annual leave for a few weeks so we couldn’t consult him. So we made our own decision to change it, and got Carl [Customer Representative] and Grace [Safety & Legal Governance] in agreement. However, when the Architect saw it, he said it was unsafe. In many meetings, I got the impression Grace wasn’t really listening and she tended to agree with what we said. Not exactly great when your job involves telling the team what is safe and legal, and then get overruled by the Architect.
We came up with a compromise, and implemented it. Then when it came to the Code Review, Adam suggested removing one more of the sub-rules which I think would be perfect, but then Olivia [Product Owner] was reluctant for us to make more changes.
Then a week later, Olivia said she would arrange another meeting to discuss the rules because she felt it might be too restrictive. OMG. However, she then seemed to have personal grievances with Adam, so told me not to make the simple change, even though it would be what we want. She used the excuse of lack of Testing time.
Adam [Architect]
We shouldn’t be knowingly introducing bugs.
Olivia [Product Owner]
This is not a bug. It’s a change to the criteria and we are not going to change it a week before we finish. I am speaking to Carl [Customer Representative] about changing the criteria, and we’ll look at it then.
Adam [Architect]
A bug is any deviation from requirements. Why are you planning on changing it if it is not a bug?
Olivia [Product Owner]
That’s not a bug. You are right in the sense that we need to change it…we’re just not changing it now. I was happy to leave it as it was to get this out of the door. That’s my call to make.
Mary [Technical Manager]
There's a lot that's not right. But how long do we keep going until we give it to the customers?
A summary of how this situation appears to me:
There is a process, but if you declare you want to move the process to the next release, then it is fine.
It will take too long to change a few lines of code, so we ain’t doing it. Apart from when it is a comment on the Code Review, then we are doing it, apart from those that we aren’t.
It takes longer for Olivia [Product Owner] to argue against it than to fix it.
The CEO had recently posted:
“The most important thing we do every day is keep our users and their customers safe by managing risk effectively. I know you all know this, but it warrants repeating: safety is our number 1 priority all day, every day – regardless of anything else that is going on. It trumps everything. Please always remember that.”
CEO
Our Managers are like:
“Next release”
The Technical Manager change
Colin [Technical Manager] complains that Daniel [Developer] and I haven’t handled the project well – and it overran by over a month at that point. A week or so later, the team was on a call with other stakeholders and he said
“you guys have done a tremendous job”,
Colin
then said the delay “was caused purely by scope-creep and nothing to do with the developers at all”.
“Mary is in charge of the team since yesterday”
Colin [Technical Manager] with his timely announcement
I got the impression that Mary just wanted to get rid of the project, because it was dragging on for far too long.
The Testers had nothing to do since us Developers were working on the last few bug fixes. Tina [Tester] said she was just re-testing old features to pass the time, but also get extra confidence there are no remaining bugs. Mary [Technical Manager] replied:
“should we be doing testing when changes are ongoing?”
Mary
Well, in that case, this statement means testers should only be hired for a couple of weeks right at the end of a project – since changes are constantly ongoing. I think she might have intended it to mean like “you’d better not find more bugs!”, but if there are bugs, then you definitely want to find them before our users do.
On the last day of the Sprint, Tina [Tester] took annual leave. She had left her assigned items in the “To Test” column of the Kanban board. There was no evidence she had tested the item, so I don’t think it wasn’t a case of just forgetting to move to “PO Approval” column. Olivia [Product Owner] and Mary [Technical Manager] then decided to just close the items. No evidence, no demo – just close them so the Sprint looks good, and looks ready to release.
What annoys me is that Mary had criticised how we had run our team and suggested we don’t follow the process. She stated that she perfectly follows the process – which leads to her successful projects. Then I see her cutting corners like that.
Just like Colin, she criticises me to my face, but then when we are in a group she states:
“I think you’ve done a fantastic job given that there’s only 4 of you”
Mary
A few days later, I had finished what I was assigned, but there was a bug on the backlog which Mary [Technical Manager] seemed to want to defer (again, she just wanted to release the project as soon as possible). I thought it couldn’t be released without this fix. I stated that I would like to look at it and she said:
“don’t do any development work”
Mary
Seems I have the day off then. What is the point in me sat around doing nothing? If I fix it, we can decide if it goes straight in, or deferred for the next release. Or maybe I won’t even find a solution. She just seemed desperate to finish the project so wasn’t considering the seriousness of the bug, or thinking logically at all.
The Backstab
I didn’t actually sit around doing nothing. I worked hard and found a solution. I knew that there was no chance Mary would accept my changes, so I needed to come up with a way of convincing her. My plan was to get the testers to informally test it, then I can say that I have a fix, and the testers are happy that there’s low risk of introducing more issues – so she would be stupid to reject it.
Testers Tim and Tina were in agreement that the fix should definitely go out in the initial release, and they agreed Mary was making a bad decision to consider releasing without it.
Tim said he would “have to check with Mary if he was allowed to spend time testing it” since they got told not to test anything. I said “there is no way she would approve it, that’s why we are doing this informally/secretively”. If Tim and Tina test it and find a bug, my plan has failed and Mary never needs to know that I attempted it.
It’s a perfect plan, or it would have been, but Tim then goes and tells Mary that I asked them to test it.
“You gotta start being better with your comminications – it’s not just yours and Tim/Tina’s decision if something gets put into the release – it’s a whole team decision but ultimately mine and Olivia’s. You’ve messaged them directly asking if they can get it tested, and as much as they’ll also want to get it done, it then puts them under pressure. This is how you’ve all got to a stage of being all over the place and burning yourselves out, it’s got to stop please.”
Mary’s chastisement
I shouldn’t have to go behind people’s backs and make my own decisions, but the entire non-management side of the team thought it should go in, and only the managers thought it shouldn’t. As a team we care about quality, but managers are just focussed on deadlines.
I also didn’t appreciate that she is accusing my decision making of adding stress to my team.
80% coverage
As the project got towards completion, I recalled our stupid “Merge Ready” process that no one seems to care about other than the small team who came up with it. You have to justify the likes of Code Coverage, and ours was at a low figure like 10%.
I’ll write some blog posts about my reasoning on when tests are good or bad in the future. A simple explanation is that Units tests are good when covering requirements, but often developers write them to cover implementation i.e. verify a particular method is called; but not that the particular method actually works. When you switch implementation, you have to rewrite a new unit test, slowing you down. Unit Tests are supposed to help you refactor, but in this case, it is a hindrance to refactoring. We did a lot of prototyping early on, and knew there would be large re-writes, so Daniel [Developer] and I decided to worry about Unit Tests later on.
When I declared the low number of Unit Tests, Olivia ended up raising it to the Directors for some reason. Why was it their concern? Do they even know what Unit Tests are for, and what the coverage actually means?
It could jeopordise my chance of payrises (I was correct, I got 0% this year) and tarnishes my reputation.
When Mary joined the team, she berated me over this decision and made the dramatic statement:
“We can’t go on like this”
Mary
She then asked a couple of her favourite developers to write some Unit Tests for my project, completely undermining me.
The thing is, both Dean [Developer (Temporary)] and Dennis [Developer (Temporary)] spent way longer than they estimated, and they didn’t do as much as they hyped, then when it came to make the last few changes, it slowed us down.
We ended up around 22% in the end, and the managers decided that is fine.
That’s the problem with us though… Do 80% coverage because it’s important. But actually it’s not that important, so you don’t need 80%. But TRY get 80%, Why?, Dunno, but the Document says.
Tim [Tester]
On track
Dennis [Developer (Temporary)] was also asked to helping out address the Code Review comments. In some ways, this kinda slowed us down. I told him I had a branch with some changes in already and sent him a link to it so we can work together. When I caught up with him the next day, he said that he had been working on a few of the ones I already had done because he hadn’t looked at the link. What a waste of time.
When Mary asked for a progress report, Dennis reckoned it would take 1 day to go through 20 comments, but he had done 8 easy ones the day before, and we had the hard ones left. So I said it would be more like 4 days, but could take longer if they are surprisingly complicated. I was correct.
Manager: How's the project coming along? Developer: It's on track.
On the final Project Demo, Carl [Customer Representative] was saying the sending process was far too slow. He had been on most of the demos from the start and saw the progress across the project.
The original version I showed him was incredibly slow, but I had managed to speed it up significantly. So despite him witnessing the project months ago, he said the performance was a concern and maybe users would think it wasn’t a significant improvement.
We had all kinds of people turn up to this final demo. People from support, training etc. We should have had those guys on the early meetings. They were prodding holes in the requirements and asking important questions. Although we gave good answers for most of them, I couldn’t help but think our changes might not be as useful as we thought.
If only we got more users involved throughout the project, rather than just some UX mock-ups before we started, and then a year later – give them the update and hope for the best.
I’d like to reiterate just how hard the team has worked. They have worked their little socks off
Olivia [Product Owner]
Conclusion
We were told the importance of the project, but because there wasn’t a direct commercial aspect to the project, I felt it wasn’t backed up by the number of developers assigned to the project. With only 2 developers, then key staff like Architects and Product Owners switching throughout the project; it just slowed us all down and made us all feel it was actually a low-priority project.
There were other morale-reducing aspects like when we were told the project was on hold, then Mary berating my decisions, and implying the failures were down to me.
There wasn’t a great understanding of the feature in many ways, illustrated by
how many requirements we discovered throughout the project,
the UX team being clueless about many aspects,
one Product Owner so clueless – it seemed he struggled to use a computer,
then switching to a clueless Architect that just went straight to the original architect.
General Summary Of Why The Software Developer Hiring Process Is Bad
When I was involved in trying to come up with something better, I was frustrated because I didn’t have the answers, but I felt everyone else involved didn’t understand my concerns.
I hate the way Tech Interviews and tests currently are, but I think it’s hard to come up with something as a replacement. When researching example questions, I see lots of questions along the lines of “who invented Object Oriented Programming?“, and I think “I don’t care“.
Then there’s loads of aspects that you would just Google if you didn’t know it e.g. memory usage of each type.
Or you may get questions like “what is wrong with this code?“, but if you actually pasted that code in Visual Studio and try and build it, you would see the error. Or in the case it is valid but not advisable; Code Analysis would flag it. Some of these can be trick questions like mathematical logic where the order of execution is ambiguous, but Code Analysis would definitely flag to tell you to add brackets, so it’s silly having these as a question.
Then I hate all the generic Fizzbuzz stuff (check if a number is divisible by 3 or 5, or even both). These are just so cliché that people could just learn from memory anyway.
Also, what do these graduates we are targeting actually know? back in my day, I knew Java but I was terrible and didn’t know how to test properly, or even debug. So there’s no point testing them on that either.
Stupid Questions From The Internet
During our meetings, I asked the team how they were coming up with the interview questions, and they said they were just Googling for C# Interview questions. This is problematic because applicants may have read and learnt from these very sites, but also many sites had terrible questions, which were sometimes incorrect.
95. Is the use of return statement necessary in every function in C#?
Yes No
Answer: A) Yes
Explanation: Yes, the use of return statement is necessary in every function in C#.
I think some programming languages may differentiate between “methods” and “functions” where “functions” always return something. But in C# you just call them “methods” which can either return something, or be defined as “void”. So this is just incorrect.
99. Which is the first line of a C# program?
using System;
using system;
using Namespace;
namespace MyApplication
Answer: A) using System;
If you create a blank class, “using System;” is the first line, but A) you could remove it B) the order of using statements matter C) in modern C# you could move it to a “global namespace”. I also asked the question of “who are we targeting?” because everyone was coming up with C# specific questions. My colleagues told me the universities we were targeting were primarily coding using Python! So we were writing the wrong test, or targetting the wrong people.
The Performance Review
In my recent performance review, my manager Colin stated:
“You were given an opportunity to lead on defining our interview process, but have not shown much interest in it. Ultimately someone else led it, creating the interview pack, redefining interview questions and leading live interviews, which I think is a missed opportunity to shine.”
Colin
I argued that: although I didn’t have a huge impact on the end result, it would have been a bigger mess had I not pushed back on some of the questions. Although it seems some of the interviewers still asked questions I said weren’t suitable. We ended up coming up with a terrible process that some candidates dropped out of, and the ones that went through with the interview struggled and we were left with the decision to gamble based on what little we saw.
The Best Hire
One of the best hires we made was someone I didn’t interview, but I looked at the interviewer’s notes, and he didn’t exactly look like a strong hire.
Implemented the basic application to consume webapi and just printed the values based on some predefined values
No unit testcases were written
application is simple
Inheritance – simple explanation
polymorphism – no idea
interface – not good explanation
purpose of inheritance – No idea
Print the numbers and its occurrences – Not completed, but after showing the hints he completed
Print the number from 1 to 100 using Recursion – Completed
print the numbers based on divisible values – Completed
Logical, cut the cakes into 8 pieces with 3 cuts – Completed
Candidate lacks on syntactically at some areas, but has some logical solving skills and good attitude on approaching things
When I stated to Colin that it “wasn’t a great batch of hiring”, Colin said out of the people we did interview, we ended up hiring around a third of them – which he said was above the normal hiring rate. I said that it was only above the hiring rate because we hired people that should have failed. Then later, he quoted the numbers “15 interviewed. 4 hired” (more like a quarter then). But there were loads of applicants rejected before the interview stage, and others dropped out along the way.
The Polymorphism question that was asked on some of the interviews was basically “do you know the types of polymorphism?”
I didn’t even understand the question, and I probably have 10 years of experience.
I was only thinking about the usual polymorphism where you can refer to derived types as the base type:
IAnimal animal = new Dog();
animal= new Cat();
To check if I wasn’t going mad, I asked some of my colleagues. Here is how they responded:
“not sure i do, no. Inheritance and Composition, are they right?”
Dean (Senior Developer around 10 years experience)
Like covariance and contravariance? But I don’t think I could put it into words. I guess you have real polymorphism and fake…virtual functions have a virtual function table that map overridden functions from base classes. But a lot of polymorphism is just the compiler being intuitive and going “you’re not crazy” it can be treated as that type.
Adam (Software Architect)
Types? an interface can be implemented many times. A method can have the same sig but different parameter types etc
Rob (Senior Developer around 20 years experience)
According to https://www.bmc.com/blogs/polymorphism-programming/, there’s Runtime, Compile Time, Overloading, and Casting. So if very experienced developers can’t explain the concept, then why are we using the question to target Graduate developers?
Conclusion
It annoyed me that Colin stated “have not shown much interest in it”. I was trying to make sure we had well-defined objectives on the exact type of developer we were hiring. Everyone else just started copying questions for the internet without considering the knowledge and experience of the people we were targeting. I didn’t have the solution of how to make a good software development hiring process, but I really wanted to make sure we got close – but I just didn’t have any support from everyone else that was involved.
Colin said that our Sprint Burndown chart wasn’t accurately reflecting the work done. He said we were overrunning with each Work Item which meant that we would have 0 points for one Sprint, then we get the full value in the next Sprint – which shows as a sharp drop on the Burndown chart.
I told him that’s how the Burndown charts work – but he said they want more accuracy. I argued further: If the requirements haven’t been met, then it’s not complete, so you haven’t added value to the software – i.e. you have made no progress.
A few days later, our Scrum Master had been in a meeting with him and was instructing us on his new process. Colin’s idea apparently was to add a “task” for each day and link it to the Work Item. At the end of the day, you mark the task as closed.
I’m like “eeeeeeer wut”. So now it tracks your daily work.
I told her she must have misunderstood. Adding a task per day is just counting the amount of days in the week. I suppose if you take a day off, then you won’t count that day.
I questioned it, and she agreed that it didn’t sound right. So she goes back to Colin. No, he really does want a closed task per day, but also said to create a task even if you are on holiday.
Wut.
add tasks to all the things!
So they want effort more accurately tracked, but are now just counting days, even the ones you haven’t worked. Surely if you create a chart, it’s just gonna be a diagonal line with no fluctuations.
What are we supposed to write for the task’s title and description? “Carried on working on it 7.5 hours”.
I just refused to do it, but the Scrum Master did the admin on my behalf. The idea lasted about a month.
I find that Burndown Charts often look unclear anyway. Here is one from another team:
So what are we even seeing here? The chart to the left shows Tasks, although the chart doesn’t seem to show the correct Completed figure – it shows as 0%. However, you do see the average is 46 which I think is per day – which illustrates the ridiculous number of tasks that teams were creating anyway.
The chart to the right shows User Stories but I think it’s not the number created, but the total points assigned. So one might be worth 1 point, but another could be worth 8 – it depends on how complicated the work is. I think this is a typical Burndown where the first few days nothing is complete because the developers are working on fixes, then the tester will get it in a few days. In the second week, more items are completed. There were even 4 points removed, presumably a change of requirement or maybe it was deemed redundant.
This is another chart that a team posted to boast about their progress. This example is a bit more unclear, but I noticed the Tasks Burndown (start 12th September) is not for the same time period that the Stories Burndown is (start 29th August).
The Stories Burndown looks interesting doesn’t it? It looks like only a small amount of work was done and then when it gets to the end of the second week, they add even more work. I did theorise that maybe they didn’t officially start the project until after 10th September but what does the 72% Completed mean? That seems to imply they are ¾ through their project. ¯\_( ͡° ͜ʖ ͡°)_/¯
A Product Manager recently wrote about how Software Bugs aka Problems are prioritised, so I thought I’d share that here.
Prioritisation Spotlight Report
Product Managers conduct a weekly meeting with other stakeholders to discuss Problems and their effects on our customers. A key output of this meeting is to ensure that we are prioritising the defects that are causing our customers pain or have the ability to do so.
These Problems can range from being the result of a major incident, recent software upgrades, or internal database monitoring. However, what they share is that they all have the ability to generate customer dissatisfaction.
The Product Managers have been ensuring we are able to accurately and consistently apply logic to the Prioritisation process. This is a key requirement of the Problem review that allows us to create the prioritisation for Development to work on.
How does this work?
The weekly Problem Prioritisation meeting is open to anyone who has a business interest in resolving these software defects for our customers. When discussing these defects as a group, a number of areas are covered, some of the new key areas are below:
Number of cases linked to Problem – this is multiplied by 2 so a Problem with 4 cases will generate a score of 8 for example
CSAT (Customer Satisfaction score)– This is the level of market impact the defect has, or is expected to have, and is scored in 4 areas; Critical, High, Medium, Low.
Software upgrade blocker – Does this hold up the ability to patch customers to a newer version of the software
Safety rating – Does this have safety implications?
The Prioritisation reason – has the problem been raised as an internal escalation, safety, Information Governance, Security, Customer pain/Escalation, Service Level Agreement, or an Enhancement via the User Defined Roadmap.
IG – Does this have Information Governance implications?
Number of users impacted – taken into account based on how widespread the issue is and how many customers are affected.
What will this allow us to do?
This is going to help us all become aligned with the vision across the different areas of the business and will enable key stakeholders access to a single source of truth when scoping these items into the team’s backlog. A “Top 100 Problems” list will be updated after the weekly meetings.