Introduction: Summary of previous blogs
Datadog is a monitoring tool my employer purchased licences for, and quickly became the cool thing to use and impress the senior managers with (see Datadog, and Datadog – The Smooth Out).
I discussed problems in both those blogs, but a concern with all metrics is;
- What do you want to measure?
- Who is viewing the data? And when?
- What does “good” and “bad” look like, and who acts when that state is shown?
In “Datadog Knee Jerk“, I explained how our CTO and Technical Director demanded that everyone create a Datadog dashboard to monitor all services, regardless of what they are.
If we don’t have a clear idea of what to measure, who needs to view it, and how do they know it is good/bad; then aren’t we just throwing money away? (even the dashboard itself doesn’t cost, you still have the time to create one. Some dashboards would require additional logging to be effective though). Surely an obvious problem with wanting to monitor everything is that it can become quite costly when you look into Datadog’s pricing model.
Easy To Make Nonsense Dashboards
From my brief time making Datadog dashboards and analysing other teams’ dashboards, I realised that the data can often look wrong, and it’s really easy to misinterpret the metrics due to the jargon used, and when/how the data is actually collected.
“I know nothing about Datadog, yet have been told to make a dashboard”
Principal Tester
Surely the worst case is to make dashboards that show nonsense data. You will waste time investigating problems that don’t exist, or not be alerted to actual problems that happen. So once we create a dashboard, who checks that it is valid?
Look at this one that I saw:
This is supposed to be plotting a line (purple) for failures in the time frame specified, then another (blue) for “week_before“.
It looks immediately wrong at a single glance. If I have set the time-frame combo box to show the “previous month”, should week_before be last week, or should it be the week before last month? It seemed to be neither. Also, notice that the graph is exactly the same shape/numbers. It just seems to be plotting the exact same data but pretending it is a week later.
Jargon
You would think you just need some understanding of statistics to draw some charts, but in the usual nerd fashion, they throw around jargon to be cool. So people end up saying stuff like this:
What is datadog? Fundamentally, a platform like Datadog provides us with a scalable solution for ingesting observability data from our services. Datadog is built upon the three pillars of observability:
Metrics provide numerical measurements that allow us to assess our system performance and behaviour
Traces allow us to understand the flow of a request or transaction through our systems
Logs allow us to capture the details of system events and errors
When you read the official documentation, it’s difficult to understand what it actually can do. It’s the combination of jargon plus hyping up features to be powerful:
Datadog vector
Vector is a high-performance observability data pipeline that puts organizations in control of their observability data. Collect, transform, and route all your logs, metrics, and traces to any vendors you want today and any other vendors you may want tomorrow.
Imagine sending your metrics to vendors that you want in the future. They are like “mate, stop spamming us with your info, you aren’t our customer“.
Then you are given the implication that this is the ultimate solution that can somehow solve some of the major problems with our system:
Having access to this data provides us with opportunities to understand the inner workings of our complex and distributed systems in a way that we haven’t been able to before.
DevOps Engineer
However, the data alone is limited in its usefulness, and it is the insights from this data that offer the greater value. Datadog provides the tooling to surface these insights in a way that enables proactive support and improvement of our systems.
The bad thing about overhyping a tool like this is that you have to manage expectations and make it clear what the scope is, otherwise your interactions with managers is more difficult than it should be. One of the DevOps engineers made a vague statement like:
“Our dashboards monitor everything”
So they got a question from a manager “Can you tell me who uses our API?”
“no, our dashboards can’t see that”
What we have enabled so far:
- Configured service metadata to populate service ownership details
- Enabled traces
- Enabled RUM (Real User Monitoring) traces to provide full end to end tracing
- Improved our service & environment tagging
- Enabled version tracking so that we can observe version related anomalies
- Defined a baseline set of monitors to cover availability, anomalous throughput, errors, latency and infrastructure performance
- Defined strict availability & latency SLOs
- Implemented 24 SLOs & 264 monitors
- Configured PagerDuty automatic incident creation and resolution
- Enabled logging
- Driven several key Information Governance decisions
- Established a Data asset inventory to provide more objectivity as to what data can be stored in Datadog
Performance Issues
One problem with our system – is performance issues. Although we have blamed all kinds of things, performance issues still remain in general. There’s been claims that Datadog could help us diagnose where the performance issues are, but they have also increased network traffic and server resources; so that they have caused performance issues of their own!
DD agent is using a lot of resources on our test systems and looks to be causing performance issues, I have stopped the agent multiple times when testing as the CPU and memory usage is maxed out. This has been raised before.
Tester
Architect:
Datadog seems to be showing memory usage on all app servers is high, wonder why?
Me:
Does it only happen when Datadog is watching it?
We licence Datadog to prevent Major Incidents and performance issues…
Datadog causes Major Incidents and performance issues and tells us about it
Another aspect is that some things we wanted to measure required querying our SQL databases. To write an efficient SQL query, the columns you filter on need Indexes to be performant, but Indexes themselves take up space. Then we are always moaning about the cost of storage.
We wanted to look at adding Datadog to monitor the usage of a particular feature that managers were making a lot of fuss about. So we asked the Database Administrators about the repercussions of adding an index to our tables. It soon adds up to be absurd.
I checked a random server and a new Index on RecordID (int 4 byte), Method (tinyint 1 byte) and AvailabilityTimeStamp (datetime 8bytes) would be around 2.5GB for a server. There are 60 servers so we need around 150GB for an extra index across Live. Testing the Datadog query before and after the creation of this index shows a 98.6% improvement in total execution time.
Deployment Engineer
Architect
I wondered if anyone else had noticed (and looking into?) poor performance spikes occurring every 2 hours, they seem to present on most servers I checked.
Me
no one actually looks at Datadog
can you create a Meta Dashboard, so it shows you the time since Dashboards were looked at?
Architect
I can only assume it's genuinely the case that no one actually looks the dashboards
I've raised 4 issues now, purely from observing the trends in the last 2 weeks
we've had wrong servers in the public and private pools
Windows Updates running in the day and killing servers
servers sat idle with no traffic hitting them
SQL Server spikes on F: drive access
these spikes every 2 hours
don't know what they're doing
I've had a look at the Monitoring DB for Server07 this afternoon, and I'm absolutely appalled at how horrendous it is, I can't see the wood for the trees. I can only assume that users aren't getting any work done
Me
Interesting that the spikes are exactly 2h apart, but at different base minutes between servers
Architect
it is interesting, but we're still no closer to anyone paying attention to the issue
Philip will probably sort it, he sorted the last DB-related issue
Datadog pricing
The following are our discounted rates, Per month costs as follows (Sept ‘22 - Sept ‘23): •Infrastructure $11.75 •Network Performance Monitoring (NPM) $4.30 •Application Performance Monitoring (APM) $29.00 •Custom metrics $5 (per 100, per month) •High use of logs (>1m/month) $1.52 (per 1m, per 15 days) •Database Monitoring $77.28 (not discounted)
Standard prices are on here https://www.datadoghq.com/pricing/list/
“Related to this, the Azure Pipelines integration for CI Visibility starting September 1st, 2023 will have a cost of $8 per committer per month (on an annual plan, or $12 per committer per month on-demand). Additionally, 400,000 CI Pipeline Spans are included per Pipeline Visibility committer per month. Based on our June usage data, our monthly cost for Azure Pipelines integration for CI Visibility would have been $644.74. We’ve had this enabled for sometime now, is anybody actively using this?”
CTO
| Product | Product Charges ($) |
| APM Hosts | $2,320.00 |
| Audit Trail | $1,846 54 |
| Database Monitoring | $463.68 |
| Fargate Tasks (APM) | $128.06 |
| Fargate Tasks (Continuous Profiler) | $70.84 |
| Fargate Tasks (Infra) | $145.73 |
| Infra Host | $42,206.00 |
| Log Events – 15 Days | $10,265.18 |
| Log Ingestion | $28.20 |
| NetFIow Monitoring – 30 Days | $507.80 |
| Network Devices | $993.14 |
| Network Hosts | $1,242.70 |
| Pipeline Visibility | $574.08 |
| Profiled Hosts | $275.08 |
| RUM Browser or Mobile Sessions | $1,186 91 |
| RUM Session Replay | $48.74 |
| Sensitive Data Scanner | $2,414.65 |
| Serverless APM | $48.77 |
| Serverless Workloads (Functions) | $1,385 52 |
| Synthetics – API Tests | $1825.25 |
| Synthetics – Browser Tests | $0.06 |
| Workflow Automation | $13.03 |
| Grand Total | $67,989.96 |
These were our monthly charges at one point (although one entry is 15 days, so double it). Then if you estimate how much this costs yearly, it’s going to be $936k unless we either cut down what we are logging, make efficient changes, and don’t scale by adding more servers. So around about $1m just for monitoring for a year. How ridiculous is that!?
Obviously when the CTO was fully aware of these costs, he then calls an All Hands meeting.
CTO: Tells everyone that Datadog should be used by everyone Also CTO: Help! the costs are spiralling out of control Jim Lol - classic. Jan: We are investing in this awesome new technology. Apr: Please stop using this technology - it's too expensive. Me ha yeah, that was yesterday's meme for me we've got this amazing feature but there's not enough licences to use it properly so we only partially use it classic part of my Dev Haiku collection
We even did that in relation to Datadog. It integrated with a product called Pagerduty to notify the team that something is bad, but there’s not enough licences to alert everyone involved! What’s the point even paying for it at all if it is half done. You can’t get the value. It is bonkers.
One of my colleagues who works on a different project to do with an API said it only costs $2,500 to run the API for a month and it’s used by millions of people. Yet here we are spending $78k on monitoring alone.
Service owners and leads have been granted additional permissions to access more detailed billing and account information. Please take some time this next week to: - Confirm that all active services and features are necessary and being actively used - Identify any areas where we can optimise our setup to avoid unnecessary costs - Prioritise production system costs over other environments. This is a critical opportunity for us to do some housekeeping and ensure our resources are being used efficiently. We spend a significant amount of money on Datadog (7 figures), and we need to ensure that we are getting as much bang for our buck!If anyone has any questions about the above, please ask in here or reach out directly!
The Costly INFO
“Please consider very carefully if storing success status logs is required for your applications.”
As predicted, encouraging people to log everything to Datadog, not think about if it is useful, and to not check on it; soon led to an angry message from the CTO.
You know how it was ridiculous we were spending $10k on logs for 15 days? Well, it quickly grew to $15k and counting. On investigation, it was caused by one particular feature who was constantly logging the status INFO or OK.
327.59m OKs were logged in 1 day.
I looked at the total logs across a 1 week period and it showed as 1.18G. There’s a Gagillion logs!
How does it get to this point that no one noticed the excessive logging though? I suppose the Datadog company aren’t gonna alert you to it. They love money.
It proves I was right about getting everyone to create dashboards and increase logging, without actually having an owner to actually check and respond to them.
Costly Switch On
This is a big mea culpa. In an effort to get early visibility in Datadog of AWS, I enabled the AWS Integration (many months ago). This means that lots of metrics, hosts, logs etc come from these accounts automatically (adding to our costs!).
I’d like to undo this mistake but want to understand if anyone is using these monitors (for example NLB monitors).
Any views?
The problem is we pay $11.75/host/month whether or not we use the metrics
CTO
James Both. You pay per metric and for polling the API. Neal Didn't we find this out the hard way with granular replicator CW metric capture? (swiftly removed though) John yep, that gets expensive if your pulling from datadog which is why there is the kinesis stream option James Yes - we added very granular cloudwatch custom metrics, as we didn't have a direct Datadog connection. This pushed up our AWS spend significantly, so we turned that off. Custom metrics direct in Datadog is significantly cheaper, but still worth keeping an eye on. E.g. we wanted to track latency, error rate and a few others at a per org level - that quickly pushed up to 40K metrics. In DD you pay $5 per month per 200 custom metrics. So we had to change our approach to metrics / observability to only surface details for error situations. CTO I've disabled the metric collection for all these accounts now. That saves at least $6,800 per year. Every little helps!
Who is viewing the data? And when?
Next challenge is this: I want the rest of the Senior Leadership Team (who are mostly not as technical as me) to be able to connect to DataDog and be able to understand how well our systems are working. I would suggest using the Golden Signals as a standard for all the systems. The dashboard that’s created needs to be easily consumed, consistent across products and reflective of customers’ experience. Can we make this happen?
CTO
Me: Are the Directors actually gonna connect to DataDog? Trying to picture how this can be useful, and how they should get the info. Architect it's all laughable really I was cringing when he put in the initial request to move it to single-sign on, could see where it was heading! I don't think they need access to Datadog, surely they just want: -Everything is ok -Some things are broken (these things...) -Everything is in ruins everything else requires some level of investigation and interpretation and we should probably have that information on our business status page, unless we're scared about how frequently we have problems Me That's true. People probably already tell the CEO if things aren't fine and if they want to fiddle the figures, they can still do it in the dashboard that she sees
yep that dashboard was a first attempt to get us a first step to what you describe
DevOps Engineer
problem with us at the deep technical level – is knowing what data SLT / GXT find useful. Count of Active Alerts ? Just a simple RAG status ?
average response time ?
The conclusion was, the metrics that we create should have the following properties:
- Consistent – the same interpretation should be made across all products
- Comparative – we can tell from the metric how close we are to having an issue (ie percentage of link utilisation is better than Mbps)
- Trending – we can see the past and if there is an underlying trend that points to an issue in the future the metric would make that obvious
- RAG status’d – you can tell if a metric is ok quickly by seeing if it’s red, amber or green.
- Relatable – the metric is connected to the experience by a customer, partner or patient.
We Monitored Everything But Not That Important Bit
Hi all, just letting you know about a lesson learnt from last weekend’s Mobile disruption incident; An internal service Mobile relies on had a dependency on a 3rd party service endpoint that went into an error state. Unfortunately we weren’t monitoring that as part of the service monitoring and therefore we had a significant delay in detecting that down-stream failure. We also didn’t have the Mobile service mapped as depending on the other internal service within PagerDuty, so even if an alert had fired, we wouldn’t have seen that incident bubble up from Identify Link into Mobile as the cause of the Mobile disruption.
CTO
Conclusion
I say in a lot of these blogs that you really need to understand the problem you are trying to solve. Otherwise you end up wasting time, money and causing more problems. It’s ridiculous that we have spent $1m a year on monitoring, and we can’t predict or react to Major Incidents. There’s gaps in the monitoring, incidents caused by the monitoring, and people not looking at the monitoring.
Also in predictable fashion, we are moving away from Datadog to Dynatrace which is supposed to be much cheaper. However, all the dashboards will have to be remade so there’s going to be lots of time wasted.