Datadog – The Smooth Out

I recently wrote about Datadog, which allows you to create dashboards to monitor servers. It’s the “cool” thing to use at work and the CTO is heavily promoting its use.

I discussed how it’s quite confusing to use, and there’s certain limitations that you have to bear in mind. I also suspected people were creating dashboards then assuming that they worked because it was showing some data – but when you try and verify the data; it turns out some of the dashboards were showing absolute nonsense.

One guy, who had been working with Datadog for months, only just noticed a problem with his dashboard.

“In our team, we monitor the response time of our API calls, and over the last 3 months, we’ve seen a remarkable increase. In May, we were seeing around 140ms, but now we are seeing 550ms.”

So I loaded up his graph and my default view was “1 Hour”. I switched it to “past 1 Day” to zoom out, and the graph looked a bit different to what I expected. The first graph shows 11-12, so on the second graph, just look at the far right at 11-12.

The first graph shows a spike around 11am and is roughly around 1.6 seconds, then drops down to 0.2 and remains very consistent. Yet the second graph seems to spike up to 1.15 seconds then drops down to around 0.4, then finally spikes up to 0.6 at the end. 

As I switched between the views, the numbers seemed to differ by larger margins.

I then moved to the day that he mentioned, 7th July. Then picked a particular time 12:05, then made a note of the value as I switched between the views, (although that particular time didn’t exist on all graphs, but I couldn’t be bothered re-checking them for more accuracy)

ViewRecorded Value
15 mins179.65
1 hour176.51
4 hours156.1
1 Day372.89
2 days (12:00 time used instead)552.93
1 week (12:00 time used instead)554.93
1 month (11:00 time used instead)550
3 months (1:00 used instead)559
12:05 Friday July 07

He didn’t seem to be using any weird formulas, so why is the data so wildly different? I didn’t report my exact findings from the table, but another developer chimed in with this comment:

“Datadog’s long-term reporting is poor when it comes to averages. I would take a 1-4 hour window as a good sample size. Anything 1 day plus, the results are clearly not accurate”

Datadog user

So his statement seems consistent with my analysis. So why is Datadog so hyped up in the business – when people who use it don’t even think it is reporting accurate figures? Additionally, it sounds like when they have noticed, they have kept it to themselves and not shared this knowledge.

He then theorised that Datadog aggregates an average of its datapoints over a period of time e.g. 24 hours, then plots that. To me, it doesn’t make much sense because if it has a sample of the exact value for that time eg 12:00, then why would it need to take the average from 12:00 on that day till 12 the previous day, especially when you could be monitoring traffic which is time-sensitive eg have a spike in the morning, then lowers in the afternoon.

After searching on the Datadog documentation, we found this:

“As Datadog stores data at a 1 second granularity, it cannot display all real data on graphs. See metric aggregation for more details.

For a graph on a 1-week time window, it would require sending hundreds of thousands of values to your browser—and besides, not all these points could be graphed on a widget occupying a small portion of your screen. For these reasons, Datadog is forced to proceed to data aggregation and to send a limited number of points to your browser to render a graph.

For instance, on a one-day view with the ’lines’ display, there is one datapoint every 5 minutes. The Datadog backend slices the 1-day interval into 288 buckets of 5 minutes. For each bucket, the backend rolls up all data into a single value. For instance, the datapoint rendered on your graph with timestamp 07:00 is actually an aggregate of all real datapoints submitted between 07:00:00 and 07:05:00 that day.”

https://docs.datadoghq.com/dashboards/guide/query-to-the-graph/

That explanation sounds fine in theory. If the graph is showing each hour, then each point can be an aggregate of the previous hour. But what that should mean is that it is a smoothed value. So if you have “zoomed” into a minute-by-minute view, and see a very squiggly line of rapid but small fluctuations, if you zoom out to the hour, then the line should look fairly straight, and the value should be an average in that time period. I don’t think it explains how my first graph probably has an average of just over 0.2ms then it showed 0.4ms.

There’s this useless video from Datadog which is supposed to show how your graphs are “smoothed out” when zooming out. It has no sound so you have to interpret what they are showing you with their mouse:

At 12:04, they show the value of 13.32% in the (15 mins) view, and at 12:05 you see it drop down to 9.07%.

They switch to the 4 hours view

I’d say the 12:04 value is now 15.8% but they start hovering over times around 11:17

When they switch to 1 day, they then start looking at data around 6:45, so clearly they have no idea what they are even doing. In this view, the times around 12:00 are ~14%

With such small timescales, surely there shouldn’t be that much variance with the data. Surely it is important to get accurate figures when looking at the micro level at less than 1 hour views, then when you zoom out, the figures should be consistent.

One thought on “Datadog – The Smooth Out

Leave a comment