Datadog knee jerk

To carry on the recent trend of failings and causing Major Incidents (see Printing Licence Key Expiry, and The Outage), we recently had another major problem for a small group of users due to migrating their server to “the cloud”.

From what I understand, everything worked apart from one particular service which they forgot to check, and left the feature broken for a few days. The most embarrassing part of it, was that it was our main rivals that told us it wasn’t working when they were calling our interoperability API and it was failing. It had been broken for 3 weeks!

This caused another instant reaction from our CTO and Technical Director who demanded that everyone creates a Datadog dashboard to monitor all services, regardless of what they are.

Datadog is a recent monitoring tool we purchased licences for, and is the cool thing to use and impress the senior managers with (see Datadog, and Datadog – The Smooth Out).

I discussed problems in both those blogs, but a concern with all metrics is;

What do you want to measure?
Who is viewing the data? And when?
What does “good” and “bad” look like, and who acts when that state is shown?

Another key point was made by a colleague:

“But we can’t expect some pretty Datadog dashboard templates to solve the historical problems that have meant we have lots of live services in the business with nobody who understands where they are, or how they work…

The company has a long history of developing a solution, moving the team that developed it off onto a new project, and leaving that solution behind. Combine that with a massive wall of confusion between Dev and Hosted, you have Hostedrunning a bunch of servers that they have no idea what they do.”

So do the developers really understand the way things work once it is deployed? Does the development team know how to create an effective dashboard, and how to act upon what it shows?

After the CTO had decided every team needs a dashboard, I was invited to a meeting with several people from different teams. One of the Test Managers said it was

“a knee jerk reaction. We want this and we want it now”
Test Manager

Then he goes on to say:

“I know nothing about Datadog, yet have been told to make a dashboard”
Test Manager

People were also told that it was the number one priority and so we need to pause our current development. The CTO claimed it:

“should take a week. A relatively simple ask

ANYTHING you are doing at the moment is secondary to this. The only exception is a major incident. If you get invited to any other meeting, invite the Tech Directors and they will get it cancelled”
CTO

People that knew more about how Datadog works raised concerns with performance issues. If Datadog is running and sending metrics every minute, it will cause way more network traffic than we had before – and we already have a problem with our networks not being able to handle the current load.

Again, someone came up with an idea that the servers could send their metrics to a server which acts as a middle man, then that can send the data to Datadog. But this idea doesn’t make sense, you still have the same number of servers (well, plus one server) sending data on the network, then the central server then needs to send a massive amount of data in one go.

Are people going to create good dashboards?
Is the data they are showing accurate?
Are we going to act on them when they show that something has gone wrong?
Is the increase in metrics going to create performance issues?

Datadog knee jerk

Published by timeinints

One thought on “Datadog knee jerk”

Leave a comment Cancel reply

Share this:

Related

Published by timeinints

One thought on “Datadog knee jerk”

Leave a comment Cancel reply