Debate about ChecksumGenerator

I was working on recreating functionality in our new API. It was basically a copy-and-paste job from our current product.

I had another debate with the Team Lead who said I shouldn’t have copied the ChecksumGenerator, and wanted me to replace it with a Nuget package.

The idea of using existing packages is from the generic advice of “don’t reinvent the wheel“. If the code already exists and has been thoroughly tested by loads of people, then it’s best to use that. In terms of future fixes and enhancements, if it is a popular open-source package, then other developers will update it. If you rewrite the code for your own company’s use, you are stuck maintaining it.

However, in our case, we already have some code, we know it works since I took it from our other product and it’s been used for several years. It’s only 58 lines long and has no need to change.

The package he wanted me to reuse wasn’t actually that popular so it adds some risk using it. I had just finished my code changes and already ran all the tests, so any changes would need to be retested.

Isn’t using loads of packages just adding bloat? That is such a webapp thing to do. My brief experience of developing a React Application, has shown me that it uses a crazy amount of packages with a basic boilerplate application like create-react-app. Then anything you install has loads of other dependencies, and you end up having bloated install size and loads of potential security risks.

The Team Lead’s justification is that a ChecksumGenerator isn’t directly related to our API so shouldn’t be in our repo, and wants to “do things properly“. I think if you go that extreme then the software ends up in development hell. Which is exactly what has happened here because this API is part of a larger project that has been in development for about 7 years now.

It’s quite hard to win an argument against a Team Lead because unless the majority say that he is wrong, then what he says goes. We have a Junior in our team, then the other developer is the Team Lead’s best friend. Then the other two are Testers that don’t have much interest in a coding discussion like that. So it’s gonna be rare for someone to speak up and cast a winning vote there.

I asked my Software Architect friend what he thought:

“That’s completely senseless. Pretty much always use code that works, as long as it performs well as part of it ‘working’.
Does the new package have an allowable licence?

You should ideally try to avoid dependencies”
Software Architect

I had a discussion with another colleague in another team. He mentioned that infamous time someone got angry and withdrew a package which took down a large part of the internet. I think we have mitigation for that scenario though.

Robert:

What if the package gets pulled?

Me:

I was about that yesterday. I think we mitigate it by having our own nuget store. So I think the package goes from the main nuget source, down to our local nuget store, then into our builds
might have made it up, but that's how I picture it
so unless you attempt to update the version, you always have a copy of the one you currently reference because it is cached on your server

Asking a Senior Developer how to create an authenticated user

I have a lot of experience as a software developer in Desktop Applications, but moving to a project involving many Web technologies, creating an API is a new challenge to me.

I asked a Senior Developer in the Authentication team how to get set up. I didn’t even know where to go to even begin to create a user, then didn’t know how to get the correct permissions etc. Then the process of developing against “DEV” environments rather than a “Production” URL is another mystery to solve.

Tony: "you need to exchange the Advanced-Auth token for an 3rd-Party-Token token"
Me: but don't I need to get the Advanced-Auth token first?
Tony: yes
Me: Don't I need to link my user to be enabled for Advanced-Auth?
Tony: yes, just do it through the app
Me: I have never used the app
Tony: create an account then
Me: but I need to know how to get the app in order to create an account
Tony: host.int.solutions.com thats the url you will need
Me: but if I am making changes in a project, then do we use DEV rather than INT
Tony: So everyone integrates with int as dev is unstable
Me: If no one uses DEV then what is it for?
My conversation with Tony is like:

It’s gonna be a long month.

It’s frustrating that he was only giving me basic information. You’d think there would be like a general “New starters” guide he could have linked me to. Or his team should have had documentation for creating new users. They probably do, he was just being awkward.

I suppose this is the classic example of a software developer that is skilled in actual coding but not so good at communicating.

How (Not) To Split An API

I’m a software developer that only really has experience on desktop apps, and was recently put on a project to make an API. I had an interesting debate with the Tech Lead of the project about where we should put our code, and how we should view its responsibilities.

To make it more anonymous, I’ll change the functionality slightly but it’s the same idea.

An Orders API had been in development for a while, and my team needed to add functionality to send some data to a government API, so let’s say it was for ID verification. Even though our initial requirements are that only the OrdersAPI will use the VerifyAPI, you could argue that in future, maybe in future, other applications we have made, or third parties could call this VerifyAPI directly.

There’s a famous idea in software development; YAGNI; You Ain’t Gonna Need It. Which is the idea that you should program to requirements and not speculative “what if” scenarios.

The Tech Lead argued that we should put our code in a new repository because it was a separate API. I said that that adds loads of overhead because we will need to write code in OrdersAPI to call our code, then will need to add a reference to our VerifyAPI using a Nuget package. This will slow down development as you need to update 2 repositories, need some temporary reference as you develop, create multiple “Pull Requests”, then need to publish the Nuget package and update the references once more. I stated this was gonna be a huge inconvenience if the project ends up running over the year.

I also called YAGNI on that we will probably never use the API for anything other than OrdersAPI so it should just go in the same repository. In the event where I am wrong, it should be fairly easy to move it out as long as we just use separate projects to keep our code separate.

He insisted on doing things his way, but the thing is, even though we had a separate repository, it wasn’t a separate API. It was more like a code library. So several months later, he was asking managers if we can create a “mini project” to turn it into an API for clearer separation.

So it seems like we had 2 opposing viewpoints but ended up somewhere in between with all the disadvantages.

Another interesting debate I had seemed to illustrate his confused view of what our code is. He has always viewed our code as intending to be an API, but I was changing some error messages and he said my messages were misleading because our repository is not an API!

The confusion seemed to be him saying the “client” is the OrdersAPI, but I see the user of our software as the client, the OrdersAPI is the server call, and it doesn’t matter where it goes next

The message was something like. “Field ‘Date of Birth’ is missing”. He didn’t like the word “field”

Tech Lead
"I'd change the wording on these messages. We're no longer talking about "fields" since we've split the API request out."

Me
“does it matter where our code is? it's part of the same request as far as the client is concerned”
 
Tech Lead
"fields" just sounds like API talk
 
Me
but the client has made an API call
 
Tech Lead
the client hasn't made an API call though
if those prerequisite checks fail then no API has ever been involved
and even if it has, why would the client need to know anything about an API?
are you talking about the browser client?
 
Me
isn't it
client -> OrdersAPI -> our library {fail checks} -> error status to the client
 
Tech Lead
sorry i thought you were referring to the OrdersAPI as the client in this context
which it is
our package shouldn't know that it's being used in an API., that's the whole point of this change
 
Me
it's a black box for the caller. The user shouldn't know that it's using a library. The code could all the be in the same place as far as it is concerned

Then after more discussion, he is adamant that something could use our library in future so then there’s 2 usages, an API and non-API. So it cannot have API related stuff in it.

But our intention was to have a separate space for our team to maintain, we have never discussed it being used by anything other than the API. The early discussions was to have our repo that was an API.

Daniel
tbh I don't immediately think API when I see "field" I think it's fine
 
Me
he did say the message could just be
"‘Date of Birth’ is missing"
Which might be better but then wouldn't you want all the messages to be consistent. However, I guess someone could update the OrdersAPI repo with a new message style, and then forget about ours.
 
Daniel
you make a good point about consistency though, the API should be consistent regardless of where the code lives

It’s a really trivial argument, but I think this is just the beginning of many debates. Sometimes I think we like adding loads of complexity early on then work doesn’t get done.

Experimentation vs Being Safe

When it comes to software development, often you can play it safe using technology you already know, or be more adventurous and use something new. I think the trick is to research the pros/cons of the language and make sure it is suitable for your approach.

There’s no point thinking something is cool and therefore using it – when it might not be the correct programming language to use. An entire team investing time learning something new can be a complete waste of time if the project is then cancelled/restarted due to heading the wrong direction.

A rule a thumb when choosing technologies:

For an experiment? be as weird as possible.
For production? be as boring as possible.

When it comes to maintenance, sometimes you end up in situations where someone is the “Expert” and therefore has to fix any issues themselves, or will be approached for help by another developer. Therefore, if you write something crazy for production, it will be you that maintains it, either directly or indirectly.

Sometimes becoming the expert in something is the way to get promoted or pay rises though, since you become the super important developer that the company can’t afford to let go. However, that also means you will be stuck on this part of the software, and can’t really move on to different projects.

If you do become one of these experts, and if you want to move on to a new project; you need to train a replacement up. Can you find a replacement that wants to take over, knowing that they will be stuck with this special project? How long will it take to train a replacement? How much documentation did you write?

Software Naming

There was an internal meeting where a new product called “Recruit” was announced. The first question was that “it sounds like it could be confused with a method of recruiting staff to work for us, so was that discussed?”

The manager said “to be honest, I never considered that“.

He then added there were around 20 people who were in the meetings, and no one had questioned it, or raised any objections.

A few months prior, there was an announcement about a new team that was handling branding in Marketing. We were told we couldn’t create any names without going via them. The last product names they came up with were ASSistant, and ANALytics.

I thought that if the software isn’t well received, it could easily gain a negative nickname, and people could make statements like “the software is ass”.

A Product Owner recently stated that the Assistant branding will soon be phased out, and it will just be merged into our main software’s branding. The decision came about when another Product Owner was doing a demo and had created a test user with the name “ass”. A manager flagged it as unprofessional and was concerned that we could easily demo something like that to external clients.

“you probably want to change those Ass users”
Manager

So far, the marketing naming team hasn’t got a good track record.

Twitter -> X

Twitter was a really strong brand, so Elon’s instance to change it to X seemed like a baffling choice, and it seems one based on his fascination with the letter X rather than any business reason.

As a…
– user
I want…
– a strategically-timed, comprehensive rebrand
so that…
– negative discourse regarding company practices can be briefly veiled with commentary regarding new name, logo, and brand assets, alongside excessively-shared parodies of the brand in meme format pic.twitter.com/5OuZLtg9pg
— Shit User Story (@ShitUserStory) October 21, 2021

As a…
– user
I want…
– a strategically-timed, comprehensive rebrand
so that…
– negative discourse regarding company practices can be briefly veiled with commentary regarding new name, logo, and brand assets, alongside excessively-shared parodies of the brand in meme format

1/

x is an awesome letter, one of the coolest letters in fact! I just turned 8 years old and I think the letter x is the most coolest thing there is. anyway, my dad made me chicken nuggets so I can’t finish the thread right now
— Drew Gooden (@drewisgooden) July 23, 2023

here’s why everyone is WRONG about twitter changing its name to x being “a baffling choice” and “not a smart thing to do”: a thread

1/ x is an awesome letter, one of the coolest letters in fact! I just turned 8 years old and I think the letter x is the most coolest thing there is. anyway, my dad made me chicken nuggets so I can’t finish the thread right now

All jokes aside, I did find a serious thread that was actually useful in detail the reason for X:

X became part of PayPal.

And PayPal was acquired by eBay.

Musk walked away with $165 million.

He then started building again…

…and the letter X would reappear. pic.twitter.com/dnRZKvipih
— Jon Erlichman (@JonErlichman) July 23, 2023

The X era has begun at Twitter Elon Musk says the bird logo will be replaced by an X. For Musk, it’s the continuation of a story that began 25 years ago. Here’s a brief history on that...

Let’s go back to 1999. After selling his first company… …Musk set out to disrupt banking. And the X identity was born. But its time as a standalone brand would be short-lived. Instead, X became part of another brand.

X became part of PayPal. And PayPal was acquired by eBay. Musk walked away with $165 million. He then started building again… …and the letter X would reappear.

Musk would launch Space Exploration Technologies Corporation. It would become known as SpaceX.

Worth noting… SpaceX’s logo has a hidden message. The X symbolizes a rocket’s trajectory.

Musk also used “X” at Tesla. It’s the name of Tesla’s third model. Musk envisioned a car lineup with models that spelled out “sexy.” There’s a Model S. An X. And a Y. Ford had the rights to the Model E. So Musk instead opted for 3, which is basically a backwards E.

X is even the name of Musk’s young son.

Meanwhile, you likely saw all of the headlines leading up to this development… …from rebranding Twitter as X Corp… …to the recent launch of xAI. In other words, wheels were already in motion for the new era.

Our Datadog use

Introduction: Summary of previous blogs

Datadog is a monitoring tool my employer purchased licences for, and quickly became the cool thing to use and impress the senior managers with (see Datadog, and Datadog – The Smooth Out).

I discussed problems in both those blogs, but a concern with all metrics is;

What do you want to measure?
Who is viewing the data? And when?
What does “good” and “bad” look like, and who acts when that state is shown?

In “Datadog Knee Jerk“, I explained how our CTO and Technical Director demanded that everyone create a Datadog dashboard to monitor all services, regardless of what they are.

If we don’t have a clear idea of what to measure, who needs to view it, and how do they know it is good/bad; then aren’t we just throwing money away? (even the dashboard itself doesn’t cost, you still have the time to create one. Some dashboards would require additional logging to be effective though). Surely an obvious problem with wanting to monitor everything is that it can become quite costly when you look into Datadog’s pricing model.

Easy To Make Nonsense Dashboards

From my brief time making Datadog dashboards and analysing other teams’ dashboards, I realised that the data can often look wrong, and it’s really easy to misinterpret the metrics due to the jargon used, and when/how the data is actually collected.

“I know nothing about Datadog, yet have been told to make a dashboard”
Principal Tester

Surely the worst case is to make dashboards that show nonsense data. You will waste time investigating problems that don’t exist, or not be alerted to actual problems that happen. So once we create a dashboard, who checks that it is valid?

Look at this one that I saw:

This is supposed to be plotting a line (purple) for failures in the time frame specified, then another (blue) for “week_before“.

It looks immediately wrong at a single glance. If I have set the time-frame combo box to show the “previous month”, should week_before be last week, or should it be the week before last month? It seemed to be neither. Also, notice that the graph is exactly the same shape/numbers. It just seems to be plotting the exact same data but pretending it is a week later.

Jargon

You would think you just need some understanding of statistics to draw some charts, but in the usual nerd fashion, they throw around jargon to be cool. So people end up saying stuff like this:

What is datadog? Fundamentally, a platform like Datadog provides us with a scalable solution for ingesting observability data from our services. Datadog is built upon the three pillars of observability:
Metrics provide numerical measurements that allow us to assess our system performance and behaviour
Traces allow us to understand the flow of a request or transaction through our systems
Logs allow us to capture the details of system events and errors

When you read the official documentation, it’s difficult to understand what it actually can do. It’s the combination of jargon plus hyping up features to be powerful:

Datadog vector
Vector is a high-performance observability data pipeline that puts organizations in control of their observability data. Collect, transform, and route all your logs, metrics, and traces to any vendors you want today and any other vendors you may want tomorrow.

Imagine sending your metrics to vendors that you want in the future. They are like “mate, stop spamming us with your info, you aren’t our customer“.

Then you are given the implication that this is the ultimate solution that can somehow solve some of the major problems with our system:

Having access to this data provides us with opportunities to understand the inner workings of our complex and distributed systems in a way that we haven’t been able to before.
However, the data alone is limited in its usefulness, and it is the insights from this data that offer the greater value. Datadog provides the tooling to surface these insights in a way that enables proactive support and improvement of our systems.
DevOps Engineer

The bad thing about overhyping a tool like this is that you have to manage expectations and make it clear what the scope is, otherwise your interactions with managers is more difficult than it should be. One of the DevOps engineers made a vague statement like:

“Our dashboards monitor everything”

So they got a question from a manager “Can you tell me who uses our API?”

“no, our dashboards can’t see that”

What we have enabled so far:

Configured service metadata to populate service ownership details
Enabled traces
Enabled RUM (Real User Monitoring) traces to provide full end to end tracing
Improved our service & environment tagging
Enabled version tracking so that we can observe version related anomalies
Defined a baseline set of monitors to cover availability, anomalous throughput, errors, latency and infrastructure performance
Defined strict availability & latency SLOs
Implemented 24 SLOs & 264 monitors
Configured PagerDuty automatic incident creation and resolution
Enabled logging
Driven several key Information Governance decisions
Established a Data asset inventory to provide more objectivity as to what data can be stored in Datadog

Performance Issues

One problem with our system – is performance issues. Although we have blamed all kinds of things, performance issues still remain in general. There’s been claims that Datadog could help us diagnose where the performance issues are, but they have also increased network traffic and server resources; so that they have caused performance issues of their own!

DD agent is using a lot of resources on our test systems and looks to be causing performance issues, I have stopped the agent multiple times when testing as the CPU and memory usage is maxed out. This has been raised before.
Tester

Architect: 
Datadog seems to be showing memory usage on all app servers is high, wonder why?  

Me:
Does it only happen when Datadog is watching it?
We licence Datadog to prevent Major Incidents and performance issues…
Datadog causes Major Incidents and performance issues and tells us about it

Another aspect is that some things we wanted to measure required querying our SQL databases. To write an efficient SQL query, the columns you filter on need Indexes to be performant, but Indexes themselves take up space. Then we are always moaning about the cost of storage.

We wanted to look at adding Datadog to monitor the usage of a particular feature that managers were making a lot of fuss about. So we asked the Database Administrators about the repercussions of adding an index to our tables. It soon adds up to be absurd.

I checked a random server and a new Index on RecordID (int 4 byte), Method (tinyint 1 byte) and AvailabilityTimeStamp (datetime 8bytes) would be around 2.5GB for a server. There are 60 servers so we need around 150GB for an extra index across Live. Testing the Datadog query before and after the creation of this index shows a 98.6% improvement in total execution time.
Deployment Engineer

Architect
I wondered if anyone else had noticed (and looking into?) poor performance spikes occurring every 2 hours, they seem to present on most servers I checked.

Me  
no one actually looks at Datadog 
can you create a Meta Dashboard, so it shows you the time since Dashboards were looked at?
 
Architect 
I can only assume it's genuinely the case that no one actually looks the dashboards
I've raised 4 issues now, purely from observing the trends in the last 2 weeks
we've had wrong servers in the public and private pools
Windows Updates running in the day and killing servers
servers sat idle with no traffic hitting them
SQL Server spikes on F: drive access
these spikes every 2 hours
don't know what they're doing 
I've had a look at the Monitoring DB for Server07 this afternoon, and I'm absolutely appalled at how horrendous it is, I can't see the wood for the trees. I can only assume that users aren't getting any work done 
 
Me
Interesting that the spikes are exactly 2h apart, but at different base minutes between servers
 
Architect 
it is interesting, but we're still no closer to anyone paying attention to the issue 
Philip will probably sort it, he sorted the last DB-related issue

Datadog pricing

The following are our discounted rates, Per month costs as follows (Sept ‘22 - Sept ‘23):
•Infrastructure $11.75
•Network Performance Monitoring (NPM) $4.30
•Application Performance Monitoring (APM) $29.00
•Custom metrics $5 (per 100, per month)
•High use of logs (>1m/month) $1.52 (per 1m, per 15 days)
•Database Monitoring $77.28 (not discounted)

Standard prices are on here https://www.datadoghq.com/pricing/list/

“Related to this, the Azure Pipelines integration for CI Visibility starting September 1st, 2023 will have a cost of $8 per committer per month (on an annual plan, or $12 per committer per month on-demand). Additionally, 400,000 CI Pipeline Spans are included per Pipeline Visibility committer per month. Based on our June usage data, our monthly cost for Azure Pipelines integration for CI Visibility would have been $644.74. We’ve had this enabled for sometime now, is anybody actively using this?”
CTO

Product	Product Charges ($)
APM Hosts	$2,320.00
Audit Trail	$1,846 54
Database Monitoring	$463.68
Fargate Tasks (APM)	$128.06
Fargate Tasks (Continuous Profiler)	$70.84
Fargate Tasks (Infra)	$145.73
Infra Host	$42,206.00
Log Events – 15 Days	$10,265.18
Log Ingestion	$28.20
NetFIow Monitoring – 30 Days	$507.80
Network Devices	$993.14
Network Hosts	$1,242.70
Pipeline Visibility	$574.08
Profiled Hosts	$275.08
RUM Browser or Mobile Sessions	$1,186 91
RUM Session Replay	$48.74
Sensitive Data Scanner	$2,414.65
Serverless APM	$48.77
Serverless Workloads (Functions)	$1,385 52
Synthetics – API Tests	$1825.25
Synthetics – Browser Tests	$0.06
Workflow Automation	$13.03
Grand Total	$67,989.96

These were our monthly charges at one point (although one entry is 15 days, so double it). Then if you estimate how much this costs yearly, it’s going to be $936k unless we either cut down what we are logging, make efficient changes, and don’t scale by adding more servers. So around about $1m just for monitoring for a year. How ridiculous is that!?

Obviously when the CTO was fully aware of these costs, he then calls an All Hands meeting.

CTO: Tells everyone that Datadog should be used by everyone
Also CTO: Help! the costs are spiralling out of control

Jim
Lol - classic.
Jan: We are investing in this awesome new technology.
Apr: Please stop using this technology - it's too expensive.

Me
ha yeah, that was yesterday's meme for me

we've got this amazing feature
but there's not enough licences to use it properly
so we only partially use it
classic

part of my Dev Haiku collection

We even did that in relation to Datadog. It integrated with a product called Pagerduty to notify the team that something is bad, but there’s not enough licences to alert everyone involved! What’s the point even paying for it at all if it is half done. You can’t get the value. It is bonkers.

One of my colleagues who works on a different project to do with an API said it only costs $2,500 to run the API for a month and it’s used by millions of people. Yet here we are spending $78k on monitoring alone.

Service owners and leads have been granted additional permissions to access more detailed billing and account information. Please take some time this next week to:
- Confirm that all active services and features are necessary and being actively used
- Identify any areas where we can optimise our setup to avoid unnecessary costs
- Prioritise production system costs over other environments. This is a critical opportunity for us to do some housekeeping and ensure our resources are being used efficiently. We spend a significant amount of money on Datadog (7 figures), and we need to ensure that we are getting as much bang for our buck!If anyone has any questions about the above, please ask in here or reach out directly!

The Costly INFO

“Please consider very carefully if storing success status logs is required for your applications.”

As predicted, encouraging people to log everything to Datadog, not think about if it is useful, and to not check on it; soon led to an angry message from the CTO.

You know how it was ridiculous we were spending $10k on logs for 15 days? Well, it quickly grew to $15k and counting. On investigation, it was caused by one particular feature who was constantly logging the status INFO or OK.

327.59m OKs were logged in 1 day.

I looked at the total logs across a 1 week period and it showed as 1.18G. There’s a Gagillion logs!

How does it get to this point that no one noticed the excessive logging though? I suppose the Datadog company aren’t gonna alert you to it. They love money.

It proves I was right about getting everyone to create dashboards and increase logging, without actually having an owner to actually check and respond to them.

Costly Switch On

This is a big mea culpa. In an effort to get early visibility in Datadog of AWS, I enabled the AWS Integration (many months ago). This means that lots of metrics, hosts, logs etc come from these accounts automatically (adding to our costs!).

I’d like to undo this mistake but want to understand if anyone is using these monitors (for example NLB monitors).

Any views?

The problem is we pay $11.75/host/month whether or not we use the metrics
CTO

James
Both. You pay per metric and for polling the API.
Neal
Didn't we find this out the hard way with granular replicator CW metric capture? (swiftly removed though)
John
yep, that gets expensive if your pulling from datadog which is why there is the kinesis stream option
James
Yes - we added very granular cloudwatch custom metrics, as we didn't have a direct Datadog connection. This pushed up our AWS spend significantly, so we turned that off. Custom metrics direct in Datadog is significantly cheaper, but still worth keeping an eye on. E.g. we wanted to track latency, error rate and a few others at a per org level - that quickly pushed up to 40K metrics. In DD you pay $5 per month per 200 custom metrics. So we had to change our approach to metrics / observability to only surface details for error situations.
CTO
I've disabled the metric collection for all these accounts now. That saves at least $6,800 per year. Every little helps!

Who is viewing the data? And when?

Next challenge is this: I want the rest of the Senior Leadership Team (who are mostly not as technical as me) to be able to connect to DataDog and be able to understand how well our systems are working. I would suggest using the Golden Signals as a standard for all the systems. The dashboard that’s created needs to be easily consumed, consistent across products and reflective of customers’ experience. Can we make this happen?
CTO

Me:
Are the Directors actually gonna connect to DataDog? Trying to picture how this can be useful, and how they should get the info.
Architect
it's all laughable really
I was cringing when he put in the initial request to move it to single-sign on, could see where it was heading!
I don't think they need access to Datadog, surely they just want:
-Everything is ok
-Some things are broken (these things...)
-Everything is in ruins

everything else requires some level of investigation and interpretation
and we should probably have that information on our business status page, unless we're scared about how frequently we have problems

Me
That's true. People probably already tell the CEO if things aren't fine
and if they want to fiddle the figures, they can still do it in the dashboard that she sees

yep that dashboard was a first attempt to get us a first step to what you describe
problem with us at the deep technical level – is knowing what data SLT / GXT find useful. Count of Active Alerts ? Just a simple RAG status ?
average response time ?
DevOps Engineer

The conclusion was, the metrics that we create should have the following properties:

Consistent – the same interpretation should be made across all products
Comparative – we can tell from the metric how close we are to having an issue (ie percentage of link utilisation is better than Mbps)
Trending – we can see the past and if there is an underlying trend that points to an issue in the future the metric would make that obvious
RAG status’d – you can tell if a metric is ok quickly by seeing if it’s red, amber or green.
Relatable – the metric is connected to the experience by a customer, partner or patient.

We Monitored Everything But Not That Important Bit

Hi all, just letting you know about a lesson learnt from last weekend’s Mobile disruption incident; An internal service Mobile relies on had a dependency on a 3rd party service endpoint that went into an error state. Unfortunately we weren’t monitoring that as part of the service monitoring and therefore we had a significant delay in detecting that down-stream failure. We also didn’t have the Mobile service mapped as depending on the other internal service within PagerDuty, so even if an alert had fired, we wouldn’t have seen that incident bubble up from Identify Link into Mobile as the cause of the Mobile disruption.
CTO

Conclusion

I say in a lot of these blogs that you really need to understand the problem you are trying to solve. Otherwise you end up wasting time, money and causing more problems. It’s ridiculous that we have spent $1m a year on monitoring, and we can’t predict or react to Major Incidents. There’s gaps in the monitoring, incidents caused by the monitoring, and people not looking at the monitoring.

Also in predictable fashion, we are moving away from Datadog to Dynatrace which is supposed to be much cheaper. However, all the dashboards will have to be remade so there’s going to be lots of time wasted.

CTO overrule

I’ve written blogs about how our CTO tried to change our release process and announced it on a “Town Hall” call with the entire department; then loads of teams told him it couldn’t be done, so he had to back down.

Then later, on another Town Hall, he tried to change the Change Control process, but wouldn’t back down when we told him it wouldn’t work. He made the claim of it being a sackable offence if we didn’t follow it. Then a month later, he found out someone couldn’t turn a server on because of the Change Control process. He said it was “malicious compliance” and that will be a sackable offence in future. Within a few months, nearly the entire process had been reverted.

Last week, he was talking about how we needed to move a Data Centre to a new location. He said he preferred to move to the Cloud since it is “inline with our strategic targets”. However, after having several meetings with the experts involved in the Data Centre, they decided the best solution would be to move to another data centre. However, the CTO didn’t like this because it wasn’t inline with their strategy and he thought the move would be too slow.

Therefore, he took the executive decision to overrule them and demand they move to the cloud.

“Do I know we can move all servers to the cloud in time? No. But I was prepared to take the risk. I would rather make decisions and be wrong, than sit by and not make any”
CTO

It seemed strange to me to claim that moving to a physical data centre would be slow, but then moving to the Cloud probably couldn’t be done in time.

He then claimed that

“we have wasted enough time deciding what the plan should be; to move to the cloud or to move to a physical data centre”.
CTO

Isn’t this the worst scenario though? He could have made the decision before any meeting was arranged. But it sounds like they had debated the decision, came to a conclusion, then he told them he didn’t like their conclusion. Then he moaned that they wasted time debating.

So they have had meetings with the experts, and conclude the data centre is the best decision, but since the CTO loves the cloud, he has overruled them. So what was the value of the meeting? And will the staff be motivated to do something they don’t believe in?

“I think I’m compassionate with my team. It’s what binds us together as a team. Otherwise we are a bunch of individuals.”
CTO

I don’t get how he can make these statements and not realise the hypocrisy. How can you be compassionate if you have shown no confidence in their opinions and decision making?