Datadog

Introduction

In recent times, the likes of the CTO have stated that we need to use modernised technology and tools. One aspect that they love is software that produces statistics/metrics that we can then judge improvements over time.

When we buy software licences for such a tool, there is always hype among certain groups of people who will volunteer to take ownership and work on implementing such software (installation, training, creating a “best practices” process), and will take any opportunity to highlight it to the managers.

So the “soup of the day” is a tool called Datadog which seems like a very powerful tool and has all kinds of integrations. I found this “jack-of-all-trades” approach was difficult to really understand what Datadog was for, and why it was different from what we had before. I knew we had Dashboards that showed which servers were running, their processor/memory usage, and which versions of our software was installed, and more. Datadog is used for this purpose too.

https://twitter.com/_workchronicles/status/1509146599355781122?s=20&t=QxTz3UkI_BvJg3WdTXk12w

Jargon Sales Pitch

One reason why it is difficult to understand is that Datadog’s webpage spouts loads of jargon, but also internally, managers love spouting jargon too. Here is what one DevOps member said about Datadog (warning – these next paragraphs contain a lot of jargon):

“As our organisation continues to grow and evolve, it is essential that we have a comprehensive and centralised observability solution in place. Currently, we are using multiple disparate siloed monitoring tools, which not only is inefficient but also hinders our ability to identify and resolve issues promptly. This leads to decreased visibility and a lack of agility in our operations.

Datadog observability provides a unified platform that consolidates all our monitoring, logging and tracing tools into one solution. This not only reduces the complexity of our monitoring landscape but also gives us a single source of truth for all our operational data. By implementing Datadog observability, we will have the ability to quickly and easily identify and resolve issues across our entire infrastructure, reducing downtime and improving overall service levels.

Moreover, Datadog observability offers the ability to deploy configuration changes to the Datadog agent with agility, which is critical in a fast-paced and dynamic environment where changes to our infrastructure occur regularly. With Datadog observability, we will be able to quickly and easily make updates to our monitoring configuration, ensuring that our monitoring remains up-to-date and relevant at all times.

With a pre-approved change, it will be easier for us to leverage the 600+ integrations that we can configure to further enhance our current infrastructure observability, root cause analysis and incident mitigation. This will allow us to gain greater insights into our operations, improving our ability to identify and resolve issues before they become critical.

In conclusion, authorisation and creation of a Datadog pre-approved change will bring numerous benefits to our organisation, including increased visibility, improved agility, and reduced complexity. This solution will help us effectively monitor and manage our infrastructure, ensuring that our operations run smoothly and efficiently.”
DevOps Engineer

That really sounded like he was saying the same thing multiple times and was really emphasising the speed. I think a concise statement is that “Datadog is one software product for monitoring, and can replace many metric tools that we currently have”. So I would imagine it should be cheaper (paying one licence rather than several), and since it is all in one place – probably easier to create new dashboards.

Jargon From The Docs

On their page, Collect SQL Server Custom Metrics, they show how you can run a custom query involving a person’s age. Isn’t that a terrible example? This would run every minute (or whatever it is configured to do) and you will create graphs from this. Without good examples, it’s hard to understand how or why you would use this feature. Other problems are due to excessive jargon.

“In v6, DogStatsD is a Golang implementation of Etsy’s StatsD metric aggregation daemon. It is used to receive and roll up arbitrary metrics over UDP or Unix socket, thus allowing custom code to be instrumented without adding latency.”
Datadog

“Enabling JMX Checks forces the Agent to use more memory depending on the number of beans exposed by the monitored JVMs.”
Datadog

Official Training

Members of the Network team, DevOps, a few managers, and some volunteers (who want managers to look upon them favourably) – signed up to a session with official Datadog training staff. These sessions were recorded, and I watched these and made a few notes; although it was just riddled with jargon and hard to know what anyone was talking about.

“Datadog Expert Services, or DES for short, is a set of “guided hands-on keyboard” pair-programming sessions. These collections of sessions, collectively known as an engagement, are time boxed and specifically designed to enable you to get the most out of Datadog while adhering to best practices. In this session, our team will work you to configure and deploy the Datadog Agent. This includes deployment in a standard, or Kubernetes containerized environment.”
Datadog

There were 2 courses that it seemed that these people were enrolled on

Name	Time	Cost
QS-INF course	2 Weeks + 5 Sessions, Curated	$15k
QS-INF-LOG course	3 Weeks + 8 Sessions, Curated **	$25K

Training cost is bonkers isn’t it? Once you have paid all that, then it pushes you toward the sunk-cost fallacy.

One of the Instructors asked what our infrastructure was.

“we’ve got resources and infrastructure in Azure, with a bias towards AWS, then we have on-prem; most of it is Windows Server. A combination of 2012…and onwards. 2016, but mainly 2019 as well. They also run on Windows HyperVisor, and also VMware – so they are virtual machines. But actually, we also have physical servers as well.”
deployment dude

Basically, we just made it up as we went along and got all the things! It sounds like a similar thing was done with the monitoring, because the deployment dude said we have “16 or 17 on-prem monitoring tools, as well as custom Powershell scripts to generate some data to monitor”

The Datadog Instructor explains that we have to log tickets if it is outside our “engagement time”. They will reply when they can but there’s no set time-frame.

“That’s fine with us, we log enough tickets already, so that’s fine. I think we will welcome that.”
DevOps Engineer

It’s almost like we were taking any opportunity to slag our company off.

No Going Back

Good news everyone!

The DevOps engineers with support from the Architecture Team have levelled up our Live datacentres!

How? With estate wide deployment (completed Friday evening) of the incredible, uber-awesome full stack monitoring SaaS Datadog!

If you’re aware of Datadog’s capabilities, effortless integration and out-of-the-box features you’ll appreciate how monumental this is.

For the uninitiated, Datadog in a slick, AI driven, intuitive UX allows full stack monitoring of servers, databases, tools, services, containers, et al.

Effortlessly switch from viewing the entirety of all network traffic to drilling down into individual requests, logs, payloads, processes, you name it, in real-time.

Going forward we envisage significant improvements to our reaction and mitigation of all types of incidents, minor to major!

We are currently trialling access – To request access please join our Slack channel.

Stay tuned as we have more exciting stuff coming as a result of our DevOps strategy!

Watch this space!
DevOps Engineer

Web-based Demo

One team put together a small web-based app and presented a demo to the department to promote Datadog, and obviously, take the opportunity to look amazing in front of the management.

The team lead was trying to show a feature called “Cumulative Layout Shift” but didn’t explain it. He made out it could track how many parts of the website load – so you know how sometimes you load a webpage and might see some text, then an image suddenly pops on screen, then some adverts, and often it causes the layout to change, then some more adverts appear, possibly changing the layout once more? It’s not a smooth user experience and causes a lot of jerks if the user tries to navigate the page before it has fully loaded. So how does Datadog track that? What is tracking it? and wouldn’t that mean there are multiple server calls to Datadog to log it? The web page is already slow, so why would adding extra server calls back out make it better? I can’t see how that can be performant, especially when you have thousands of users. Isn’t this process logging an insane amount of trivial data over time? I think I was left with way more questions than answers.

He also said it can track time spent on a particular web page, view count, error count, action count, frustration count (he claims Datadog tracks clicks out of frustration. How?). When people are already worried about the amount of monitoring/tracking/surveillance with the likes of tracking cookies – and then websites can track you to this granular scale with Datadog; it is a bit worrying isn’t it!?

Everyone should use Datadog

In following department meetings we were told by the CTO that all teams would eventually use Datadog and we need to increase the amount of monitoring, and to do it quickly to take advantage of the benefits of the tool.

My manager wanted our team to create a Datadog dashboard. Even if it wasn’t that useful, she wanted to be among the initial users – probably to look good to her manager.

I asked one of the smartest developers if it was even suitable for my team. He was looking into creating a dashboard for his team, but his team had an API that third-parties could use and it was prime for this kind of monitoring.

He was a bit vague though:

“You could create a custom metric for it. But I wouldn’t be too sure. I’m probably going to use custom metrics for “#messages per APP per minute” sort of thing. But I can get all that from my Logs/Traces. You’d have to have something pulling that data from the main databases which would involved pushing it to Datadog.”
Principal Developer

I asked other people that were using it, and people just kept on saying they weren’t sure, or maybe others have done it.

“We’re making heavy use of Datadog for our new software and I believe it’s also being used in other areas too. It’s incredibly powerful and provides a huge amount of detail. Getting the most out of it is important and also building some standards seems like a good idea. Do we have any thoughts around how we ensure we lead the way with this and get some standard/learning/documentation in place?”
Developer

No one can give a straight answer when it comes to this stuff. People are like “EVERYONE is using Datadog“, then when you ask about it in more detail, they are like “well SOME teams are using Datadog“, then when you ask more people, they are like “there are some metrics but not quite the ones you want“

Performance Problems

I asked my Software Architect friend (who seems to know everything) if Datadog is as flawless as people were implying. My intuition was thinking it cannot have zero disadvantages.

Me
Won't Datadog just cause performance issues if we start monitoring everything?

Mark
yep, or run while patching is in progress and block access to the Database/tables, which has already happened. Running ad-hoc scripts is a fairly bad idea
Hosted had to run patching twice the other week, which pushed us out of our Service Level Agreement.

Me:
this juicy gossip keeps quiet doesn't it

Mark
yes because Datadog is a massive success and we paid lots of money for it

Technical Director

Recently we hired a “Technical Director”. He asked how Datadog was coming along and if we can highlight any issues so he can get involved. This prompted John to go on a rant. The TLDR of this section is that “Software Developers don’t know about infrastructure of the Live Production environment.”

I think one of the company’s biggest challenges is how many products we have, and how diverse they are. We have no real standardisation due to a number of different factors, not sun-setting old services, not tackling tech debt, products that were developed by other companies and came to us via acquisition etc..

As a result, I think it’s difficult for us to template things out such that it can work for multiple people.

Realistically, each team for each product needs to look at how their product works, how it’s used, what tech it’s built on, and build a solution that works for their product. And I think one of the biggest challenges at the company is the ‘DevOps wall of confusion’ isn’t just a normal wall, it’s a Trumpian 15 foot high one with razor wire. Lots of products have dev teams (assuming they have one at all!) with little to no exposure or knowledge of how production works and what it looks like. For so long dev teams were told they had no role in production, no need to access it and were kept locked away from it

For reference, I used to think like that. I’ve been here 15 years and I have been part of the mindset in the past. It’s changing, and I’m happy to be one of the people pushing for that change, breaking down that wall of confusion. But that’s one of your biggest hurdles – is that people don’t know what to monitor in production because they don’t know what it looks like, and trying to monitor it by just copying a template that worked for somebody else, but doesn’t work for their solution isn’t a way to solve it

The key to unlocking Datadog for me, is to get people to have visibility of production, to understand how it’s used and what it looks like, and then start to work out what metrics are important, and what “normal” looks like so we can alert when we deviate from that

I can talk for hours about this, my team has one of the best observabilities out there, and had it before Datadog came around. If you want to have a chat, happy to have a discussion about what we can do.

I may have painted a somewhat negative opinion above, and I agree that there are things that we can improve. But we can’t expect some pretty Datadog dashboard templates to solve the historical problems that have meant we have lots of live services in the business with nobody who understands where they are or how they work and crucially expect Operations 24/7 to be able to magically pick up the pieces and fix it by themselves when it falls apart.

Yes, the company has a long history of developing a solution, moving the team that developed it off onto a new project, and leaving that solution behind. Combine that with a massive wall of confusion between Dev and Hosted, you have hosted running a bunch of servers that they have no idea what they do.

Case in point right now, the “Login and Identity service” is in the main data-centre, and we also have one in the DMZ that was built for the Mobile app, but nobody is quite sure what the main one is for. I have some notes that indicate it was built for the Connect app, but Connect doesn’t use it. Yet still that production sits there unused with nobody sure why it’s there.

You’ll find a team that has maybe done work in the past on Appointments, maybe even recently. Are they currently working on Appointments? do they have any knowledge or visibility of production? is it even on their radar that they should be monitoring the performance of it?

This goes deeper than just dashboard templates, it’s a company culture problem
John

Anomaly detection works well if the metrics are predictable for different periods of a day. It’s not a “AI” as we thought when I tried it out, it’s more of a fancy algorithm than machine learning.

I found with XMPP that method would work OK for Mon-Fri, then the alert will trigger all weekend because traffic wasn’t as high those days.
Lee

Scheduler

I was added to a group chat where the team was discussing how to use Datadog on our “Scheduler”. It sounds like an easy task, but there’s way more nuance and technicality to it. The main problems we have with the scheduler is that:

some jobs fail and then wait to be reset,
some jobs run but get stuck (I think most cases the scheduler isn’t informed that the job has finished, so it fails to schedule the next run).

The TLDR of this section is that: there is a lot of junk data (and I mean A LOT) and reporting on these figures can be somewhat misleading because failed jobs for organisations that no longer exist aren’t a problem (although we should stop them from running since they are obsolete).

John
Surely we need something that runs that shows us:
a count of jobs in Error Status 	
A list of jobs with a status of Running, 	
Long Running Jobs


Matthew
We'll want to take into account the normal running time of a particular job. We don't want to be alerted about a job that usually takes 2 hours to run and it has only been 1 hour.
We'll get to ones that get stuck quicker if they usually take a minute to run

Dave
Someone should run some queries against live now, and get a picture that looks correct.

Matthew
We also want the data to be meaningful otherwise we'll be back to where we are now - where we don't know what's actually working and what isn't. There's a balance to be had here
Christian
Can we summarise the Key Performance Indicators that will cause an action that needs to be performed? These become multiple metrics IMO, that add together to give better context

John
1. Job queue building up
2. jobs failing and not being reset
3. jobs getting stuck

Matthew
• Large numbers of jobs not running when they should
• Jobs stuck in running beyond their normal running time
• Mass job failures
• Mass job queues (this has the potential to false flag when workload is high)

John
There's a bug / unexpected behaviour where the scheduler can fail to update the database with the result of a job and the table shows it in status Running. Scheduler logic queries the tables for what is / isn't running. Leaving it to make decisions that it can't do stuff because a job is "running" when it in fact isn't.

Matthew
If this is a bug, the smartest thing to do after the monitoring piece is to fix the piece of software causing it surely?

John
the secret to any good bug report is reproduction steps, and it's not an easy one to reproduce
You mentioned you'd had one get "stuck" recently. Do we know how to reproduce that behaviour again on demand?

Matthew
"Just let the scheduler do it's thing and wait" is the only way we know how to replicate these

John
hence why any developer would struggle to fix it because it's difficult to know where to look if you can't reproduce it

Christian
"Treasure what you measure" or "Measure what you treasure". Simple counts and alerts will likely get us to what we need very short term which is to prevent or proactively manage / reduce potential for a Major Incident.

Matthew
I've got some initial queries together for this that could be used to alert on if the numbers get too high. I'd appreciate someone who knows TSQL to have a look and to suggest any improvements to the data being returned.

John
the 3000 + jobs in error is scary
Do we need to filter that list by only jobs that we know get automatically reset by the Hosted DBA agent jobs?

Matthew
Maybe, I did think that but I also thought that we should really know how many jobs are actually in error

John
I know that list in itself is a problem. But I think all Domains are going to have a high failed count and it's difficult to know if there are important ones in that 3000 + count

Matthew
We shouldn't alert on that metric, hence the one to track how many in error for the last hour
The scheduler is a massive mess and that 3000+ count suggests we have a fair bit of clean-up to do.

John
the only suitable metric I can think of for "important' is ones that the Database Administrators already deemed as important and created automated resets for.

Matthew
I could add an additional "Important" row to the general stats that includes those (or excludes any that aren't those)
Need that info from the Database Administrators though

John
Do we maybe need a couple of groups rather than just 1 "important" group

Matthew
I'd rather split the jobs and call out their names though, rather than pile them into one huge count if we're doing that
Let's get the data in Datadog first and see what normal looks like and tune alerting accordingly
JobTypeIDs aren't consistent across the estate by the way, so you'll have to match on JobTypeName with a join to the JobType table
<Image of 2048 Ready jobs. 47 Error>

John
Interestingly those 47 jobs are haven't run in years. some of them last ran successfully in 2016
but we're resetting them every day to try and run and constantly failing (job run count of 271,280)

Matthew
Hence my comment about a lot of clean-up - I'm willing to bet these are trying to run for closed Orgs, or orgs that have moved endpoint

John
Each Domain will probably need work to get rid of all the false alarms
I know when I checked one domain there were 40 + jobs that had never ran and were just constantly being reset
Maybe an idea to simply disable these and change the script to add & Enabled = 1 to the filter so you count only enabled jobs?
That should help remove the false positives you know about - then you can actually alert if the value goes above 0 for jobs in error

Paul
We are assessing whether the best approach to reduce the number of scheduler incidents is to deliver the new scheduler with improved logic and Datadog integration which will take time. 
Or support the integration of Datadog with the current scheduler.

Matthew
If it's the former, should we still do the latter anyway until the new scheduler logic is in place?
I suppose what I'm trying to ask is will the time-frames for implementing the new logic be quick enough to satisfy the urgency of monitoring the scheduler?

Paul
Yes agreed, we have just reviewed the last 9 months of incidents and having Datadog reporting would have given us the insight to avoid a number of these.

John
As well as adding an "enabled=1" filter Matthew, do you think it's worth adding a runcount > 0 filter as well to avoid counting jobs that have never ran?
For the sample Domain I looked at, every priority job in error had a run count of 0 showing they've never worked. Adding this would bring that result down to 0 which makes it much easier to then set an alert if that goes above 0

Matthew
I thought about that, but that will mask errors with jobs that should run but haven't. We'll want to see those. New job types as well, for example

John
going to be a hell of a job tidying up all the crap scheduled jobs in the scheduler AND setting up and calibrating monitoring at the same time
My thoughts were to filter those out for now, then look at those crap jobs later

Matthew
Yep, it is, but we can't ignore the mess as it won't go away. A lot of work disabling jobs will be needed to get the overall stats query to show nice figures. We shouldn't shy away from them looking terrible though. I don't believe in fiddling figures to make things look nice when it comes to monitoring
The other queries that show failures and stuck/running jobs for over an hour will help with spotting immediate issues though
One particular situation to take into account is the longest-running job we have is 8 hours. We can take care of that in Datadog with trends and anomaly detection to tell us when more jobs than the expected ones are stuck in a status for longer than an hour.
Similarly, we can use that same alerting method to warn us when the numbers on the overall stats aren't within usual parameters. Change detection is also a good measurement to use here too. We don't necessarily have to use traditional methods of alerting as soon as a value is over X

John
that sounds to me like a case of another metric
count of scheduled jobs running more than 1 hour where job type is NOT “expect long-running jobs”

Performance Problems Part 2

Note: If the min_collection_interval is set to 30, it does not mean that the metric is collected every 30 seconds, but rather that it could be collected as often as every 30 seconds. The collector tries to run the check every 30 seconds but the check might need to wait in line, depending on how many integrations are enabled on the same Agent. Also if the check method takes more than 30 seconds to finish, the Agent skips execution until the next interval.
Datadog

It seems that for custom sql metrics, you can only specify a single time frequency to run ALL the queries. So if one team creates a query they want to run every minute, and another team wants to run every hour – you can’t.

One team wanted to run a long-running query, but since the first team had set the queries to run every 60 seconds, then this long-running query wasn’t possible.

In a similar fashion, we also anticipate problems if the total time of all queries exceeds this 60 second limit which we will soon do with only several queries.

Another problem that we found is that the time you set is just a guide. So Datadog could actually run it twice in the 60 second period. Then when it comes to creating the dashboard, you have to be careful that you don’t end up counting the data multiple times. Some teams were seeing decimal numbers on their charts when counting data with only whole numbers!

The possibly crazy workaround
John

I think a good workaround would be to have the data refreshed hourly, and placed in a separate database somewhere then have that separate database queried every 60 seconds by Datadog. If it’s separate to the Live estate – it should reduce the risk. Needs thought putting into how you would pull those stats into a database hourly however. Need a SQL agent job or similar that could collect them once an hour and push them to a central separate location.
John

key thing, would be to ensure we aren’t using the scheduler to kick off SQL that monitors the scheduler 🤣
Christian

Need More Tools?

We purchased Datadog to replace several metric tools. Now we have seen the SQL metrics are a bit basic, it seems like we are concluding we need an alternative tool. I wonder if we will keep buying different metric tools over time and end up in the same situation we were in before!

You get some basic SQL monitoring in the base DataDog install (or the APM one, not sure which). You can pay extra for “enhanced” SQL monitoring in the DBM module. It’s still very basic and about 3x the cost of “proper” SQL monitoring tools. I’m sure as the DBM module matures it will get closer to fit for purpose, but right now it’s an incredibly expensive SQL Server monitoring tool that’s feature poor. If it was near zero cost, I’d have it everywhere to give a wider audience better visibility of what the DB layer does in live, but the features don’t currently justify the cost of that particular extra module.
Database Administrator

manager: can you draw me a pretty graph

pretty graph:

Mentoring #8: Former Apprentice

Intro

A few years back, I was assigned to mentor one of the Software Developer Apprentices and wrote about him in a series, the last one being Mentoring #7. There, I mentioned that our manager, Colin, was supposed to be setting him challenges (with the aim of sacking him) or finding him some kind of alternative role, possibly as a Software Tester.

The Apprentice turned that idea down, but I thought it would be a good career move if he went for it, because he didn’t seem to have the problem-solving skills required to be a developer. I was increasingly thinking he was one of those people that is “all talk and no action”.

So I’ll go through a few events that’s happened since then.

Colin’s Kanban

I always thought Colin was a bit disorganised, or he’d often come up with ideas then quickly abandon them. When we hired some new developers, Colin created a Kanban board with Tasks that they need to complete for their induction. He said The Apprentice needed to do it as well to ensure we had trained him adequately. The theory was that if the new starters start writing code after completing our training, and The Apprentice doesn’t; then it’s evidence that HR will require to sack him.

After a month, I checked the Kanban board and there was no progress.

Me  15:36
Remember the new starter training programme?

The Apprentice  16:15
What do you mean remember? 
This is my programme, although I'm not exactly working on it like that

Me  16:28
nothing has moved on the board for weeks

The Apprentice  16:30
I don't get your point as we haven't been asked to move anything on the board etc. Maybe it's just for managers to plan etc

Me  16:38
It's a kanban board. It's supposed to be what you are currently doing and what you have left.
I haven't heard a peep out of those new starters

The Apprentice  16:40
I haven't received any such instructions and am doing the tasks I have been asked to do. But I will speak to Colin now that you mention it cos I probably am supposed to be doing that.

So Colin had basically abandoned it, but then there’s no determination to impress from The Apprentice. He is just chilling away without a care. He could have easily provided evidence he had completed everything and impressed Colin.

LibreOffice Config

My Apprentice picked up a bug where he needed to switch the configuration from MS Word to LibreOffice. I told him to configure LibreOffice in Configuration Manager. He asks if it is a feature in the main program. I tell him it isn’t; Configuration Manager is a separate configuration tool.I want him to try and work independently so I need to give him generic advice for the future. To try and work out how to enable features in our software, I tell him that in general you can check:

the independent Configuration Manager tool (newer features are most likely here),
Organisation Configuration in the main software,
then the modules themselves.

For point 3, one example I gave is that the Users module has its own Configuration screen. 30 mins later he says

“I checked User Config and I can’t see an option for LibreOffice“.
Apprentice

Before I gave him the generic advice, I told him it was in Configuration Manager. Then when I gave him the generic advice, I listed Configuration Manager first. Why didn’t he check them in the order I said? He either doesn’t pay attention or just comes across as trolling by slowly doing the wrong thing.

Oblivious

We had some mandatory Security Training presented remotely from a third-party, which started at 9:00 and lasted half the day. It was 12:45

“Is this Security training something everyone should attend?
Apprentice

The Set Up

When he first joined, I showed him how to check out our code repository, how to build it, where to get the databases from, and we rewrote the New Starter documentation together. He had replaced his laptop recently, so he had to set it up again. He asked me a question about how to access a database backup server, so I asked him why? He mentioned he wanted a particular database from the server. So I asked him “why?” – if he is following the instructions we wrote; it doesn’t say to do that. He claimed he was following the instructions.

“I’m honestly on the instructions, I can’t see what you are referring to.”
Apprentice

The funny thing was, I didn’t have the instructions open but I remember what it said. So I open them, click the Database section in the Table of Contents, then copy the instructions into the chat that say something along the lines of “Run the following SQL script to create the database:”.

What was he looking at? Or why was he pretending the instructions said to access the database backup server? I could have all the databases configured in 20 mins at most, and he dragged it out for hours.

Performance Review

When it was time to do objectives, he obviously didn’t have much to write about because he hadn’t done any work. Apparently, he had a “spreadsheet of evidence” though, so maybe I am wrong. We had a form that we needed to submit and he spent the entire day transferring the spreadsheet to the form. The next day, I had some free time, so I told him I’d help him look into his assigned software bug. He said that he wants an extra 30 minutes to finish the form…which then became a few hours. See what I mean about being “all talk and no action”? He just makes excuses to not do his work.

False Confidence

I ran out of ideas on a bug fix I was working on. I told my colleagues in the group chat on Slack. The Apprentice says

“Fancy a call to talk your thoughts? I’m kind of getting good. And I can share my ideas”
Apprentice

I was completely baffled where this confidence was coming from. He hasn’t fixed anything himself and struggles to come up with ideas. I am not opposed to a Junior correcting/inspiring me, but there’s no evidence to suggest that he could do it.

Support

Last month, he told me he has a new job, but is actually staying within the company. He has switched to 2nd Line Support. I don’t really get how that interview went. Being a software developer is about diagnosing then fixing issues, whereas support is just about diagnosis. (If there is a known fix that they can do without assistance from development, then they can fix the problem too). So I think it makes sense that people move from Support into Development if they have learned how to code, but I have never seen the switch the other way. I am intrigued how that is going to go. He has already started making claims like “this is much more suited to my skills”, “I’m really happy with this role”, but it’s early days.

Virgin Media Hub 3.0

For my home internet and mobile sim, I have Virgin Media and a Virgin Mobile sim. They contacted me saying they were switching me over to o2, so in future, I would be billed by o2 instead, but I also qualify for a few extra benefits for the same price.

Once that was activated, they then said – because I have Virgin Media and an o2 mobile Sim, I now qualify for a bonus speed to my Virgin Media home broadband. Not sure how that makes any sense and why I didn’t qualify before, but cool – free stuff.

However, my current router cannot handle the new speeds or something, so now I have to have their latest “Hub 3.0”.

When I received the package, I had a quick look through the instructions and it seemed as simple as plugging it in. The only thing of note was that when you think you are ready to connect your devices, you need to look at the lights on the Hub:

“When the Wi-fi light is on and the base light is solid white, you are ready to move on. The arrows may still be flashing green”
Instructions

I assumed the flashing arrows meant it was updating (but couldn’t see anything in the instruction manual), and when they stopped flashing after 1 hour (why does it take 1 hour to update!?), I had a stable green Wi-Fi light, stable green update arrows, and a stable yellow main light. So what does that mean? it doesn’t match their description.

After a minute, the green update arrows and green wi-fi light went out, and I was left with a stable yellow main light and no internet connection. So I turned it off and on again. Same sequence of events happened.

So I reconnected my old router to check the internet was still working. It was.

The next day, I asked one of my colleagues (who I knew had Virgin Media broadband). He said he had a Hub 3.0 and his just has a stable yellow main light and had no idea what I was on about when I told him about the white light that the booklet mentioned. It was years ago when he had set his up, but he thought it was as simple as plugging it in, and away you go.

So after I logged off work, I plugged the “Hub 3.0” in again and got the same sequence of events. This time I went to the router’s IP page http://192.168.0.1/. Is it updating? Why so many updates?

Update in progress
Please wait before updating any settings.
Refresh

I waited over an hour, but I was still stuck on the update screen. I turned it off-and-on again. Still says it is updating. But there’s no green arrows on the router itself. Can we trust the arrows?

I check Twitter and find a few people from various years with the same problem but some say that Virgin call centre staff resolved it – but didn’t say what the resolution was. Then there were some unresolved cases of people Tweeting into the void.

So since it was late, and I assumed Virgin’s call centre wouldn’t be available, I waited till the morning. I then plug it all back in, and call the number in the booklet:

“Connection issues? if you’re still having trouble connecting after following all the steps, waiting 30 minutes for your Hub to set up and making sure the connections are secure – call us on 0800 953 9500”
Instructions

I was greeted with an automated line asking me for my account number. I hung up and went looking for the letter. I call back, type in the account number, then it asks me if I would like to link my phone number to my account for faster calling in future.

That sounds great because I hate having to read out an account number, and go through the “security” checks. If I can bypass one or both of those, then it would be amazing. They always ask you for part of your memorable word and it always trips me up because I have only needed to call them 3 times in 9 years or something – so it is easy to forget. I was convinced I knew it, and this would test out my thoughts, so I went through the process of trying to link it.

The automated voice instructed “press the key that corresponds to the first letter”, so the 2 key would represent A, B, or C. Maybe not so secure when there’s ambiguous answers. I typed the 3 numbers in, and apparently it was wrong. So I hung up.

I went to the website, account details, “change memorable word”. You have to choose a word between 8 and 10 characters long, but it’s not quite a word because it needs 1 number. With that level of specific criteria, it probably makes it less memorable too. So I type in 9 characters and a number to get 10 characters in length. Apparently it didn’t match the rule “8-10 characters long”!? So “8-10” actually means 8 or 9?

Eventually I managed to set it to something slightly memorable, so call back. Enter the “Account number”, “memorable word”. Right, as long as I call using this mobile number, it should get me straight through in future.

Right, can I speak to a human now? no.

The automated voice says they know I have been sent a new Hub and if I press “1”, they can send a signal to activate it.

WHAAAAAAAAAAAAAAAAAAAAAAAAT!?
Me, raging

The instructions never said that. It said to wait 30 mins for a solid white light and wi-fi light, then only call if there’s connection issues. Yet this number is an automated line that is VITAL to call.

So I press “1”. The voice says it “may take 1 hour for the connection to activate”.

What!?

Super fast broadband, like 264 mbps and you are saying it takes 1 hour to transfer 1 signal to tell the router it is valid? What the hell. I was supposed to be working and thought I would be offline for 15 mins.

After waiting 1 hour, there’s still no connection. I waited another 15 mins. I checked the router settings page; “Update in progress”. It’s either lying, or completely broken.

So I called the number again to see what would happen. The automated voice tells me my account number is linked to my phone, so I press “1” to accept. Now I have to enter 3 letters from my memorable word. At least not entering the account number is convenient. I put my letters containing the account number in the drawer; I won’t be needing those again.

The automated voice tells me that the “signal” had failed to activate my router, so I have to be passed onto a human. I connect straight away, and first I need to state my name. Now I need my account number. WHAT!? I can’t have gotten this far without my account number which I had linked to my phone. So I scramble to get the papers out of my drawer so I can read off the account number. Now I need to specify 3 characters from my memorable word. (╯‵□′)╯︵┻━┻

If it is a challenge to make a calm guy like me turn aggressive, then this is certainly the way to go about it.

So I explain that I have this new Hub and it doesn’t work. She asks me what lights I see, and she says it should be working. I then get put on hold for a minute, then she says

“We haven’t registered this Hub at our end”.
Virgin call centre staff member

Brilliant. Why is that even a thing? The connection is coming through to the inside of my house (my old router works perfectly fine). Why do they need to authorise a device inside my house? They sent it to me too, so why wasn’t it automatically registered? You would think they would have the process perfected after all these years.

So after holding a bit longer, she said she would then send the signal but it may take an hour. She then asks if “I am happy with the resolution?”.

“Eeeer. Dunno. If it works, then yes. If it doesn’t then no.”
Me, uncertain

“It will work, sir. We will send you a text message when it is activated.”
Virgin call centre staff member

The connection actually came on after 1 minute.

1 hour 45 later: Virgin via text: “We’ve activated the new Virgin Media kit“

Here’s a list of things that are dumb:

If you send someone some new hardware, make sure it is registered on your system
If it requires the customer to make a phone call, make sure it is clear in the instructions
The phone number should also state when the line is open, and if it is automated or not.
If there’s lights on the hardware device with different meanings – put them in the instruction booklet
Don’t tell the user they are looking for a white light, when it is actually yellow.
Don’t make a page stating “Update in progress” when the status is “Unregistered device”
If there is an Update process, explain to the user what this means and how often it should occur, and how long it should take. What if I turn off the device whilst it is updating? Does it become “bricked”?
Don’t send a text 1 hour 45 minutes late.
Don’t tell the user they can register the account number to their phone, then ask them to read out the account number.
Don’t say you can create a memorable word of 10 characters, then tell them they cannot.
As a human, don’t ask for 3 letters of a memorable word, and when the customer gets it wrong, ask for 3 different letters. There’s a good chance you could piece what the full word is by putting together the answers. I assume the call-centre staff cannot see the full word, but it wouldn’t surprise me at all if they could.
There must be a better way of activating a router than via signal that takes up to 1 hour. I assume there’s some serious leeway here, but it’s not good to keep a customer waiting that long before calling support again.

Mum’s email problem: The Train Noise

Recently, Microsoft made a change to their OneDrive terms. I’m sure it was a bug, but if your OneDrive becomes full, they then stop your Outlook emails from being received. I was receiving the warning when my OneDrive was 99% full but my email allowance was 3% full.

My Mum had set up OneDrive to sync her Photos and Desktop, and had dumped several GB of videos there when she only had a 5GB OneDrive limit.

I had told her to sort her files out. However, she isn’t even confident dragging and dropping files into different folders.

It’s always tough to explain problems to her, or for her to explain problems to me. She says she normally checks emails on her phone these days, rather than on her laptop, and she doesn’t use OneDrive on her phone – so she couldn’t make sense of my explanation. Her thought process was: “How could Onedrive on her laptop prevent Outlook from receiving emails on her phone?”

Then, when she kept on showing me the Gmail app, I told her that I couldn’t fix it in Gmail. Then she kept on saying I’d got it wrong, because it was Outlook and not Gmail that she was using.

She uses an Outlook email account, but the Gmail app on her phone. She couldn’t seem to differentiate between the concept of an account and app with the same names.

It’s easy to take aspects like this for granted. What’s easy for me to understand is simply impossible for the non-technical person. In the age where pretty much everything is pushing for digital, it’s a big ask for the older generation to come on board without every step of the process being intuitive.

When I got to her house, she seemed adamant in clearing out old emails, and I kept on telling her that emails are small and it’s not the problem.

She also complained about a “train noise” which I said I would need to hear from myself because it’s an extremely weird statement.

Once I had sorted out the files on her laptop, and stopped OneDrive syncing files on her Desktop, I told her that the emails should come through. She claimed that it still wasn’t working, and was showing me by refreshing the emails by sliding her finger down.

“I get double emails”
Mum

“I thought you said you weren’t getting emails?”
Me

“no, that was before when I was getting the train noise”
Mum

I sent her an email, and it came through along with a notification sound.

“There it is! the train noise!”.
Mum

She was adamant it was never her notification sound, and to be fair, there was a different tone for text messages, so I don’t know if it was a Gmail-specific tone, but I couldn’t see an option in the settings.

Even when emails were coming through she still claimed they weren’t coming through. She said that she often gets around 20 emails a day but had only 4 come through; so “it wasn’t fixed”. I didn’t know how to prove how many she should have.

But what about the “double email” problem? I still needed to solve that before I left. So I asked if she knew how we could recreate it. She refreshed Gmail and pointed to the loading wheel. After a few seconds, another loading wheel appeared lower down and she said: “there! double emails“. So there wasn’t double emails – just double the loading wheels.

I asked why this was a problem to her, and she said she

“didn’t get double emails or the train noise until the Microsoft thing popped up”.
Mum

Who knows if it’s true or not, but it’s so hard to help her when she describes somewhat fictional problems using the wrong terminology.

For another Mum-story, see Mum’s Frozen Laptop Screen

Performance Tales: Tasks

Recently, a team was created in order to tackle major issues, often proactively. Some errors are logged without the user noticing anything, so by monitoring the logs, the team can diagnose and fix these “silent” errors. The other thing they are looking out for is performance problems. Some performance problems aren’t noticed when it’s more of a minor increase, inconsistent, or slowly gets worse over time. I would think some users don’t bother reporting slowness because it’s harder to quantify than the obvious crash.

However, one user had been seeing a recent drop in performance by not dealing with their tasks, but it had got to the point where they could no longer log in, as it took longer than 30 seconds to retrieve their tasks on login – so it timed-out (as in an error was meant to be thrown when the time to retrieve their tasks took 30 seconds).

“At the time of logging this bug, the user currently has 136,854 tasks in Tasks Management. The Program performance will start to be negatively affected after 4,000 tasks. I have extended the timeout of sql call for TasksManagement.GetUserTaskCountSummary to 60 seconds as this caused a login failure.”
Walter (Developer)

“let’s be honest, the program performance will start to be negatively affected after 1 task”
Mike (jestingly)

I think this is acceptable as a quick fix to allow the user to login again, but is it really acceptable for the login process to take more than 30 seconds? I’d imagine it would take around 40 seconds for this user.

That’s the problem with this team, they just look for quick wins, even if it isn’t the right solution long-term, and might even move the problem elsewhere.

What’s better than waiting 30 seconds? waiting 60 seconds? or threading it off to delay the loading? or Mark’s idea: no timeout.

What if the time taken still exceeds 60 seconds, assuming some other user has a large volume of data? Can you set the timeout as infinity?.
Mark

we had a customer with a very large count and it only ran for around 32/33 seconds and their counts were unusually large. We are going to send out communications for customers to keep these low by means of maintenance. The 60 seconds just allows an extra safety net if we get in this situation again. I don’t want to extend the timeout to be too long for this reason as it will unlikely (if ever) be needed to be longer than 60 seconds.
Walter

Why not a try catch / retry attempt for this? It should be a non-essential call to logging in, if it fails, you can catch, log, and show an error message. Should we not look at optimising this so that you can login quicker?
Maybe run this on a background thread too?
Lee

I discussed this with Johnny and making changes to this Stored Procedure could result in worse performance for smaller datasets and advised against making changes to it. We’re going to tackle this by means of communications to sites. I thought the simplest and safest approach is just to extend the timeout slightly so that the practice does not suffer a system down as a result of it – the timeout is only breached by a second or two.
Once the user logs in, they are displayed their task counts, so I think it might be deemed essential (rather than showing them a loading wheel until the data is returned). Currently, if we did this, when loading up Tasks Management it would just crash with an error.
Walter

It would still crash on logging in if it takes over 60 seconds.
Why not make it non-critical to logging in?

Log in

Status bar “Loading…”

Completes – OK.

Fails – Show error and retry link.

Lee

This was the worst site on the estate and was taking roughly 32 seconds. To take over 60 seconds the task count would probably be unheard of.
Each time I ran that stored procedure manually, the results were the same so I don’t think a retry is going to work here.
Even by changing to make it non critical to logging in, Tasks Management will still be busted when you try to load it. The timeout is on the SQL side so that is the area we need to resolve really.
Walter

However, Johnny did advise against alternative solutions such as:
1. fine tuning the stored proc
2. adding indexes
3. Remove tasks counts completely for some types of tasks

My View:

Walter seemed to have put more thought into it than I originally thought, but I still thought he was overlooking Lee’s suggestion. Yes, it would need more work to actually work (display loading text on the Tasks Count Bar, then loading screen when launching the Tasks Management page), but it would significantly speed up logging in. If this user could log in 32 seconds quicker, then what would the average user see?

If the other parts of the log-in process also take some time, then that’s a long time they are waiting in total. If taskcounts are the bulk of the time, then we can make it super fast if we take it out. I would have thought users would expect times of 5 seconds or less (might not be possible, but that’s the scale we need to aim for). Walter is talking like users are more than happy to wait 30 seconds or more just to get to the home page. A long wait is better than not being able to log-in at all, but surely it’s generally unacceptable to be more than several seconds in total. It’s one of the reasons why users have grown more discontent over time.

When doing some testing of smaller counts, for example 10k – the results are returned in a few seconds (2-3). This organisation had around 120k Appointments Tasks across all users plus all of their other tasks which resulted in a production duration of 32 seconds. The more they manage their tasks the quicker workflow will be, that’s always been the message we’ve tried to get across.
Walter

ISO Project Audit

We were being audited on a project we completed about a year ago. But due to our slow release process, it took about 4 months to go out. So I think that’s why we were chosen because it fell within that year period. I suppose they weren’t auditing us for speed, but I thought it would be a bit embarrassing if they actually saw the project’s start and end dates.

We couldn’t remember much specific details about the project, but our software tester just freestyled a speech about the Agile Development process: so spoke about requirements, User Stories, writing Test Cases, continuous integration; and the auditor seemed to be fine with his content.

I might have only said “hi“. Can’t quite remember. But the auditor was impressed with my knowledge.

“The external auditor was very impressed with the knowledge and understanding from all attendees during the day. I will share the report within the next couple of days once it is available for me to download.”
Project Manager

I do wonder if ISO Audits actually mean anything if you can blag your way through it. They may have been more interested in the Project Management side of the process rather than the Software Development.

Performance Review 2023

Last year, we felt slightly aggrieved that our employer made out they would match inflation with our pay rises, but then disguised it with: a one-off profit share, pension payments, and optional benefits. Our wage only increased by 2% but inflation was currently around 7% and predicted to rise further. Better than nothing though, so can’t really complain.

With inflation being 10%, we wondered what we would get this year. Then came the communication that stated that they understand inflation, and remarked that we are still highly profitable so can cover the costs:

“Further to our communication at the end of 2022, I can now update you on our remuneration approach for 2023. We know and understand that employees are continuing to experience cost pressures and we remain committed to ensuring that the overall pay budget for this year reflects the continued external inflationary environment we are in. Whilst this has been possible due to the ongoing success of the business, it will require us to again manage our costs in 2023 carefully.”
HR Director

However, then they reminded us that they gave us:

an extra £200 one-off payment
Increased leave for parents
the option to work bank holidays and take a different day off instead.
they will give us a one-off profit share just like last year

So with them clutching at straws for things they have done, you know that means they weren’t giving us 10%. Maybe we could get 5% as a compromise then?

“The exact amount of any increase to an individual’s salary and timing will be agreed locally, as normal, and will be dependent on a number of factors including performance. I believe these measures reflect our responsibility to be aware of the external inflationary environment, our immense gratitude for your efforts over the past year and our desire to become an employer of choice.”
HR Director

What did I get?

“immense gratitude for your efforts over the past year” 🤨

I’ve mentioned in previous blogs that judging performance is very subjective and often bent to fit a narrative, and I got completely screwed. So let’s get into it:

The Review

“You had an average year in 2022, working on only a couple of projects, one where you worked as an individual contributor then moved onto a pair project”

Well, that seems to imply just being on 2 projects in a year was a bad thing, but that’s not really my fault is it? Since I single-handedly delivered one project, then was in a team of 2, surely that should show I can handle responsibility and pressure?

“You were good in many technical aspects but were not able to step up to the plate as a leader in the team.”

I was in a team of 2 as a Senior Developer, and I got really good feedback from that team member. What more do you want?

“You were given an opportunity to lead on defining our interview process. I expected you to lead on this but have not shown much interest in it. Ultimately someone else led it, creating the interview pack, redefining interview questions and leading live interviews, which I think is a missed opportunity to shine.”

I blogged about this process. It was a mess, but at least I tried my best to direct it. We created an interview process for Graduate Developers from Indian universities who were using Python, but set them C# and Object-oriented questions…then asked questions on Software Testing which they also don’t do. I was pretty clear we needed to focus on the people we were targeting (Python and Web programming) and not on what the job was (C# Developer); (or even better, change the people we were targeting – but that was already decided upon). I got outvoted by the other Seniors. I felt I couldn’t lead in the interviews because I didn’t believe in the questions.

“I can see that you might not see Unit testing as a productive way forward, and the target set might not be accurate”

In recent times, managers have been demanding we achieve 80% test coverage and I have highlighted how poor many tests are that people write, and the 80% is a random target. It’s not improving the quality of code at all. Due to the fact I wrote little (but high-quality tests) I got criticised and talked about between senior management. I raised in my review that I believe managers now see my output as low quality, so I will need to repair my reputation.

I also raised that I was annoyed at the frequency of team member changes. My projects had Testers switched in and out, we changed Product Owners, Software Architect, and there was talk of more changes that didn’t happen. Colin then stated “I never raised my concerns at the time”, but I definitely did.

Some of the comments received from your colleagues:

“I think that he has been a good technical lead. His knowledge is excellent, and this has been shown in how he has been able to help me with any technical problems. He has constantly been available, has a great attendance record, and will defer annual leave for the good of the project. His code review comments are always clear, concise, and helpful. He is always available to help with Testing as well.”

“He has a good insight into what he thought was and was not possible”

“He needs to make sure that just because an issue is raised he doesn’t jump on it right away as there is a priority order for things”

“He’s reliable and just gets on with the work”

“Can be a bit quiet on calls at times but whenever I specifically call him out in calls, he would answer questions”

“At weekly meetings there wasn’t enough substance in his updates. He doesn’t come across as a leader to me or perhaps he just did not want to be at those meetings.”

I definitely got good feedback from people but then the positive ones were ignored in the discussion. I had good arguments against some of the negative points as well. I think I only picked up non-prioritised work when I had finished the work that was. What do you want me to do? Just sit there idle? I often knew it was issues related to recent work so it was obviously gonna be the highest priority, and as Team Lead, I think I can make that call myself.

I’m always annoyed when people criticise me for being quiet on calls. If enough people are saying the things I am thinking, then I don’t need to speak. If no one has raised the issue I am thinking of, then I will provide it. Then, if people specifically ask what I think, then I will say, even if I am repeating what others have said. That final feedback was about a meeting similar to a Scrum of Scrums where many teams get together. But the meeting was mainly for Colin and he rarely attended. So I had the attitude of quickly going through my points then moving on. I thought the wording was harsh – they have no idea what it is like to work with me directly.

I have no idea how they can justify giving me nothing. That’s like a 10% pay-cut with today’s economy.

Colin isn’t my manager anymore, so let’s hope my new manager will actually put things right.

The classic Wingdings printout Bug

When I was a Software Tester, one of the first bugs I found was on an appointment booking system. There was this concept called “Assignment List” which had a list of Patients that required appointments. Then you drag and drop them into the appointment slot to book them in. A tick/checkmark would appear next to their name. I then printed it out and saw that most of the printout was in the Wingdings font!

I thought it was pretty clear, so typed up the basic information in the Bug Report, even suggesting what I thought the problem was. My (correct) assumption that the font was switched to use the tick/checkmark from the Wingdings font, then the font wasn’t set back to normal for the next bit of information, resulting in a full page of Wingdings symbols!

Ensure patients are present in the assignment list. Book patients into session. Press the print button – App book is printed and includes patient details – Details for some patients appear in Windings font in assignment lists. This is possibly related to the tick (shown by using Windings font?).
My bug report

For some reason, the lead developer decided to be a bit aggressive and add extra information to my report:

CRITICAL MISSING INFORMATION: This ONLY happens if a patient in the assignment list has been assigned and has a tick next to them. When printing it prints the tick but it appears that the rest of the details for that patient are left in the wingdings font.
Developer response

CRITICAL MISSING INFORMATION!