When it comes to software development, often you can play it safe using technology you already know, or be more adventurous and use something new. I think the trick is to research the pros/cons of the language and make sure it is suitable for your approach.
There’s no point thinking something is cool and therefore using it – when it might not be the correct programming language to use. An entire team investing time learning something new can be a complete waste of time if the project is then cancelled/restarted due to heading the wrong direction.
A rule a thumb when choosing technologies:
For an experiment? be as weird as possible.
For production? be as boring as possible.
When it comes to maintenance, sometimes you end up in situations where someone is the “Expert” and therefore has to fix any issues themselves, or will be approached for help by another developer. Therefore, if you write something crazy for production, it will be you that maintains it, either directly or indirectly.
Sometimes becoming the expert in something is the way to get promoted or pay rises though, since you become the super important developer that the company can’t afford to let go. However, that also means you will be stuck on this part of the software, and can’t really move on to different projects.
If you do become one of these experts, and if you want to move on to a new project; you need to train a replacement up. Can you find a replacement that wants to take over, knowing that they will be stuck with this special project? How long will it take to train a replacement? How much documentation did you write?
There was an internal meeting where a new product called “Recruit” was announced. The first question was that “it sounds like it could be confused with a method of recruiting staff to work for us, so was that discussed?”
The manager said “to be honest, I never considered that“.
He then added there were around 20 people who were in the meetings, and no one had questioned it, or raised any objections.
A few months prior, there was an announcement about a new team that was handling branding in Marketing. We were told we couldn’t create any names without going via them. The last product names they came up with were ASSistant, and ANALytics.
I thought that if the software isn’t well received, it could easily gain a negative nickname, and people could make statements like “the software is ass”.
A Product Owner recently stated that the Assistant branding will soon be phased out, and it will just be merged into our main software’s branding. The decision came about when another Product Owner was doing a demo and had created a test user with the name “ass”. A manager flagged it as unprofessional and was concerned that we could easily demo something like that to external clients.
“you probably want to change those Ass users”
Manager
So far, the marketing naming team hasn’t got a good track record.
Twitter was a really strong brand, so Elon’s instance to change it to X seemed like a baffling choice, and it seems one based on his fascination with the letter X rather than any business reason.
As a… – user I want… – a strategically-timed, comprehensive rebrand so that… – negative discourse regarding company practices can be briefly veiled with commentary regarding new name, logo, and brand assets, alongside excessively-shared parodies of the brand in meme format pic.twitter.com/5OuZLtg9pg
As a…
– user
I want…
– a strategically-timed, comprehensive rebrand
so that…
– negative discourse regarding company practices can be briefly veiled with commentary regarding new name, logo, and brand assets, alongside excessively-shared parodies of the brand in meme format
1/
x is an awesome letter, one of the coolest letters in fact! I just turned 8 years old and I think the letter x is the most coolest thing there is. anyway, my dad made me chicken nuggets so I can’t finish the thread right now
here’s why everyone is WRONG about twitter changing its name to x being “a baffling choice” and “not a smart thing to do”: a thread
1/ x is an awesome letter, one of the coolest letters in fact! I just turned 8 years old and I think the letter x is the most coolest thing there is. anyway, my dad made me chicken nuggets so I can’t finish the thread right now
All jokes aside, I did find a serious thread that was actually useful in detail the reason for X:
The X era has begun at Twitter Elon Musk says the bird logo will be replaced by an X. For Musk, it’s the continuation of a story that began 25 years ago. Here’s a brief history on that...
Let’s go back to 1999. After selling his first company… …Musk set out to disrupt banking. And the X identity was born. But its time as a standalone brand would be short-lived. Instead, X became part of another brand.
X became part of PayPal. And PayPal was acquired by eBay. Musk walked away with $165 million. He then started building again… …and the letter X would reappear.
Musk would launch Space Exploration Technologies Corporation. It would become known as SpaceX.
Worth noting… SpaceX’s logo has a hidden message. The X symbolizes a rocket’s trajectory.
Musk also used “X” at Tesla. It’s the name of Tesla’s third model. Musk envisioned a car lineup with models that spelled out “sexy.” There’s a Model S. An X. And a Y. Ford had the rights to the Model E. So Musk instead opted for 3, which is basically a backwards E.
X is even the name of Musk’s young son.
Meanwhile, you likely saw all of the headlines leading up to this development… …from rebranding Twitter as X Corp… …to the recent launch of xAI. In other words, wheels were already in motion for the new era.
Datadog is a monitoring tool my employer purchased licences for, and quickly became the cool thing to use and impress the senior managers with (see Datadog, and Datadog – The Smooth Out).
I discussed problems in both those blogs, but a concern with all metrics is;
What do you want to measure?
Who is viewing the data? And when?
What does “good” and “bad” look like, and who acts when that state is shown?
In “Datadog Knee Jerk“, I explained how our CTO and Technical Director demanded that everyone create a Datadog dashboard to monitor all services, regardless of what they are.
If we don’t have a clear idea of what to measure, who needs to view it, and how do they know it is good/bad; then aren’t we just throwing money away? (even the dashboard itself doesn’t cost, you still have the time to create one. Some dashboards would require additional logging to be effective though). Surely an obvious problem with wanting to monitor everything is that it can become quite costly when you look into Datadog’s pricing model.
Easy To Make Nonsense Dashboards
From my brief time making Datadog dashboards and analysing other teams’ dashboards, I realised that the data can often look wrong, and it’s really easy to misinterpret the metrics due to the jargon used, and when/how the data is actually collected.
“I know nothing about Datadog, yet have been told to make a dashboard”
Principal Tester
Surely the worst case is to make dashboards that show nonsense data. You will waste time investigating problems that don’t exist, or not be alerted to actual problems that happen. So once we create a dashboard, who checks that it is valid?
Look at this one that I saw:
This is supposed to be plotting a line (purple) for failures in the time frame specified, then another (blue) for “week_before“.
It looks immediately wrong at a single glance. If I have set the time-frame combo box to show the “previous month”, should week_before be last week, or should it be the week before last month? It seemed to be neither. Also, notice that the graph is exactly the same shape/numbers. It just seems to be plotting the exact same data but pretending it is a week later.
Jargon
You would think you just need some understanding of statistics to draw some charts, but in the usual nerd fashion, they throw around jargon to be cool. So people end up saying stuff like this:
What is datadog? Fundamentally, a platform like Datadog provides us with a scalable solution for ingesting observability data from our services. Datadog is built upon the three pillars of observability: Metrics provide numerical measurements that allow us to assess our system performance and behaviour Traces allow us to understand the flow of a request or transaction through our systems Logs allow us to capture the details of system events and errors
When you read the official documentation, it’s difficult to understand what it actually can do. It’s the combination of jargon plus hyping up features to be powerful:
Datadog vector Vector is a high-performance observability data pipeline that puts organizations in control of their observability data. Collect, transform, and route all your logs, metrics, and traces to any vendors you want today and any other vendors you may want tomorrow.
Imagine sending your metrics to vendors that you want in the future. They are like “mate, stop spamming us with your info, you aren’t our customer“.
Then you are given the implication that this is the ultimate solution that can somehow solve some of the major problems with our system:
Having access to this data provides us with opportunities to understand the inner workings of our complex and distributed systems in a way that we haven’t been able to before. However, the data alone is limited in its usefulness, and it is the insights from this data that offer the greater value. Datadog provides the tooling to surface these insights in a way that enables proactive support and improvement of our systems.
DevOps Engineer
The bad thing about overhyping a tool like this is that you have to manage expectations and make it clear what the scope is, otherwise your interactions with managers is more difficult than it should be. One of the DevOps engineers made a vague statement like:
“Our dashboards monitor everything”
So they got a question from a manager “Can you tell me who uses our API?”
“no, our dashboards can’t see that”
What we have enabled so far:
Configured service metadata to populate service ownership details
Enabled traces
Enabled RUM (Real User Monitoring) traces to provide full end to end tracing
Improved our service & environment tagging
Enabled version tracking so that we can observe version related anomalies
Defined a baseline set of monitors to cover availability, anomalous throughput, errors, latency and infrastructure performance
Defined strict availability & latency SLOs
Implemented 24 SLOs & 264 monitors
Configured PagerDuty automatic incident creation and resolution
Enabled logging
Driven several key Information Governance decisions
Established a Data asset inventory to provide more objectivity as to what data can be stored in Datadog
Performance Issues
One problem with our system – is performance issues. Although we have blamed all kinds of things, performance issues still remain in general. There’s been claims that Datadog could help us diagnose where the performance issues are, but they have also increased network traffic and server resources; so that they have caused performance issues of their own!
DD agent is using a lot of resources on our test systems and looks to be causing performance issues, I have stopped the agent multiple times when testing as the CPU and memory usage is maxed out. This has been raised before.
Tester
Architect: Datadog seems to be showing memory usage on all app servers is high, wonder why?
Me: Does it only happen when Datadog is watching it? We licence Datadog to prevent Major Incidents and performance issues… Datadog causes Major Incidents and performance issues and tells us about it
Another aspect is that some things we wanted to measure required querying our SQL databases. To write an efficient SQL query, the columns you filter on need Indexes to be performant, but Indexes themselves take up space. Then we are always moaning about the cost of storage.
We wanted to look at adding Datadog to monitor the usage of a particular feature that managers were making a lot of fuss about. So we asked the Database Administrators about the repercussions of adding an index to our tables. It soon adds up to be absurd.
I checked a random server and a new Index on RecordID (int 4 byte), Method (tinyint 1 byte) and AvailabilityTimeStamp (datetime 8bytes) would be around 2.5GB for a server. There are 60 servers so we need around 150GB for an extra index across Live. Testing the Datadog query before and after the creation of this index shows a 98.6% improvement in total execution time.
Deployment Engineer
Architect I wondered if anyone else had noticed (and looking into?) poor performance spikes occurring every 2 hours, they seem to present on most servers I checked.
Me no one actually looks at Datadog can you create a Meta Dashboard, so it shows you the time since Dashboards were looked at?
Architect I can only assume it's genuinely the case that no one actually looks the dashboards I've raised 4 issues now, purely from observing the trends in the last 2 weeks we've had wrong servers in the public and private pools Windows Updates running in the day and killing servers servers sat idle with no traffic hitting them SQL Server spikes on F: drive access these spikes every 2 hours don't know what they're doing I've had a look at the Monitoring DB for Server07 this afternoon, and I'm absolutely appalled at how horrendous it is, I can't see the wood for the trees. I can only assume that users aren't getting any work done
Me Interesting that the spikes are exactly 2h apart, but at different base minutes between servers
Architect it is interesting, but we're still no closer to anyone paying attention to the issue Philip will probably sort it, he sorted the last DB-related issue
Datadog pricing
The following are our discounted rates, Per month costs as follows (Sept ‘22 - Sept ‘23):
•Infrastructure $11.75
•Network Performance Monitoring (NPM) $4.30
•Application Performance Monitoring (APM) $29.00
•Custom metrics $5 (per 100, per month)
•High use of logs (>1m/month) $1.52 (per 1m, per 15 days)
•Database Monitoring $77.28 (not discounted)
“Related to this, the Azure Pipelines integration for CI Visibility starting September 1st, 2023 will have a cost of $8 per committer per month (on an annual plan, or $12 per committer per month on-demand). Additionally, 400,000 CI Pipeline Spans are included per Pipeline Visibility committer per month. Based on our June usage data, our monthly cost for Azure Pipelines integration for CI Visibility would have been $644.74. We’ve had this enabled for sometime now, is anybody actively using this?”
CTO
Product
Product Charges ($)
APM Hosts
$2,320.00
Audit Trail
$1,846 54
Database Monitoring
$463.68
Fargate Tasks (APM)
$128.06
Fargate Tasks (Continuous Profiler)
$70.84
Fargate Tasks (Infra)
$145.73
Infra Host
$42,206.00
Log Events – 15 Days
$10,265.18
Log Ingestion
$28.20
NetFIow Monitoring – 30 Days
$507.80
Network Devices
$993.14
Network Hosts
$1,242.70
Pipeline Visibility
$574.08
Profiled Hosts
$275.08
RUM Browser or Mobile Sessions
$1,186 91
RUM Session Replay
$48.74
Sensitive Data Scanner
$2,414.65
Serverless APM
$48.77
Serverless Workloads (Functions)
$1,385 52
Synthetics – API Tests
$1825.25
Synthetics – Browser Tests
$0.06
Workflow Automation
$13.03
Grand Total
$67,989.96
These were our monthly charges at one point (although one entry is 15 days, so double it). Then if you estimate how much this costs yearly, it’s going to be $936k unless we either cut down what we are logging, make efficient changes, and don’t scale by adding more servers. So around about $1m just for monitoring for a year. How ridiculous is that!?
Obviously when the CTO was fully aware of these costs, he then calls an All Hands meeting.
CTO: Tells everyone that Datadog should be used by everyone
Also CTO: Help! the costs are spiralling out of control
Jim
Lol - classic.
Jan: We are investing in this awesome new technology.
Apr: Please stop using this technology - it's too expensive.
Me
ha yeah, that was yesterday's meme for me
we've got this amazing feature
but there's not enough licences to use it properly
so we only partially use it
classic
part of my Dev Haiku collection
We even did that in relation to Datadog. It integrated with a product called Pagerduty to notify the team that something is bad, but there’s not enough licences to alert everyone involved! What’s the point even paying for it at all if it is half done. You can’t get the value. It is bonkers.
One of my colleagues who works on a different project to do with an API said it only costs $2,500 to run the API for a month and it’s used by millions of people. Yet here we are spending $78k on monitoring alone.
Service owners and leads have been granted additional permissions to access more detailed billing and account information. Please take some time this next week to:- Confirm that all active services and features are necessary and being actively used- Identify any areas where we can optimise our setup to avoid unnecessary costs- Prioritise production system costs over other environments. This is a critical opportunity for us to do some housekeeping and ensure our resources are being used efficiently. We spend a significant amount of money on Datadog (7 figures), and we need to ensure that we are getting as much bang for our buck!If anyone has any questions about the above, please ask in here or reach out directly!
The Costly INFO
“Please consider very carefully if storing success status logs is required for your applications.”
As predicted, encouraging people to log everything to Datadog, not think about if it is useful, and to not check on it; soon led to an angry message from the CTO.
You know how it was ridiculous we were spending $10k on logs for 15 days? Well, it quickly grew to $15k and counting. On investigation, it was caused by one particular feature who was constantly logging the status INFO or OK.
327.59m OKs were logged in 1 day.
I looked at the total logs across a 1 week period and it showed as 1.18G. There’s a Gagillion logs!
How does it get to this point that no one noticed the excessive logging though? I suppose the Datadog company aren’t gonna alert you to it. They love money.
It proves I was right about getting everyone to create dashboards and increase logging, without actually having an owner to actually check and respond to them.
Costly Switch On
This is a big mea culpa. In an effort to get early visibility in Datadog of AWS, I enabled the AWS Integration (many months ago). This means that lots of metrics, hosts, logs etc come from these accounts automatically (adding to our costs!).
I’d like to undo this mistake but want to understand if anyone is using these monitors (for example NLB monitors).
Any views?
The problem is we pay $11.75/host/month whether or not we use the metrics
CTO
James
Both. You pay per metric and for polling the API.
Neal
Didn't we find this out the hard way with granular replicator CW metric capture? (swiftly removed though)
John
yep, that gets expensive if your pulling from datadog which is why there is the kinesis stream option
James
Yes - we added very granular cloudwatch custom metrics, as we didn't have a direct Datadog connection. This pushed up our AWS spend significantly, so we turned that off. Custom metrics direct in Datadog is significantly cheaper, but still worth keeping an eye on. E.g. we wanted to track latency, error rate and a few others at a per org level - that quickly pushed up to 40K metrics. In DD you pay $5 per month per 200 custom metrics. So we had to change our approach to metrics / observability to only surface details for error situations.
CTO
I've disabled the metric collection for all these accounts now. That saves at least $6,800 per year. Every little helps!
Who is viewing the data? And when?
Next challenge is this: I want the rest of the Senior Leadership Team (who are mostly not as technical as me) to be able to connect to DataDog and be able to understand how well our systems are working. I would suggest using the Golden Signals as a standard for all the systems. The dashboard that’s created needs to be easily consumed, consistent across products and reflective of customers’ experience. Can we make this happen?
CTO
Me:
Are the Directors actually gonna connect to DataDog? Trying to picture how this can be useful, and how they should get the info.
Architect
it's all laughable really
I was cringing when he put in the initial request to move it to single-sign on, could see where it was heading!
I don't think they need access to Datadog, surely they just want:
-Everything is ok
-Some things are broken (these things...)
-Everything is in ruins
everything else requires some level of investigation and interpretation
and we should probably have that information on our business status page, unless we're scared about how frequently we have problems
Me
That's true. People probably already tell the CEO if things aren't fine
and if they want to fiddle the figures, they can still do it in the dashboard that she sees
yep that dashboard was a first attempt to get us a first step to what you describe problem with us at the deep technical level – is knowing what data SLT / GXT find useful. Count of Active Alerts ? Just a simple RAG status ? average response time ?
DevOps Engineer
The conclusion was, the metrics that we create should have the following properties:
Consistent – the same interpretation should be made across all products
Comparative – we can tell from the metric how close we are to having an issue (ie percentage of link utilisation is better than Mbps)
Trending – we can see the past and if there is an underlying trend that points to an issue in the future the metric would make that obvious
RAG status’d – you can tell if a metric is ok quickly by seeing if it’s red, amber or green.
Relatable – the metric is connected to the experience by a customer, partner or patient.
We Monitored Everything But Not That Important Bit
Hi all, just letting you know about a lesson learnt from last weekend’s Mobile disruption incident; An internal service Mobile relies on had a dependency on a 3rd party service endpoint that went into an error state. Unfortunately we weren’t monitoring that as part of the service monitoring and therefore we had a significant delay in detecting that down-stream failure. We also didn’t have the Mobile service mapped as depending on the other internal service within PagerDuty, so even if an alert had fired, we wouldn’t have seen that incident bubble up from Identify Link into Mobile as the cause of the Mobile disruption.
CTO
Conclusion
I say in a lot of these blogs that you really need to understand the problem you are trying to solve. Otherwise you end up wasting time, money and causing more problems. It’s ridiculous that we have spent $1m a year on monitoring, and we can’t predict or react to Major Incidents. There’s gaps in the monitoring, incidents caused by the monitoring, and people not looking at the monitoring.
Also in predictable fashion, we are moving away from Datadog to Dynatrace which is supposed to be much cheaper. However, all the dashboards will have to be remade so there’s going to be lots of time wasted.
I’ve written blogs about how our CTO tried to change our release process and announced it on a “Town Hall” call with the entire department; then loads of teams told him it couldn’t be done, so he had to back down.
Then later, on another Town Hall, he tried to change the Change Control process, but wouldn’t back down when we told him it wouldn’t work. He made the claim of it being a sackable offence if we didn’t follow it. Then a month later, he found out someone couldn’t turn a server on because of the Change Control process. He said it was “malicious compliance” and that will be a sackable offence in future. Within a few months, nearly the entire process had been reverted.
Last week, he was talking about how we needed to move a Data Centre to a new location. He said he preferred to move to the Cloud since it is “inline with our strategic targets”. However, after having several meetings with the experts involved in the Data Centre, they decided the best solution would be to move to another data centre. However, the CTO didn’t like this because it wasn’t inline with their strategy and he thought the move would be too slow.
Therefore, he took the executive decision to overrule them and demand they move to the cloud.
“Do I know we can move all servers to the cloud in time? No. But I was prepared to take the risk. I would rather make decisions and be wrong, than sit by and not make any”
CTO
It seemed strange to me to claim that moving to a physical data centre would be slow, but then moving to the Cloud probably couldn’t be done in time.
He then claimed that
“we have wasted enough time deciding what the plan should be; to move to the cloud or to move to a physical data centre”.
CTO
Isn’t this the worst scenario though? He could have made the decision before any meeting was arranged. But it sounds like they had debated the decision, came to a conclusion, then he told them he didn’t like their conclusion. Then he moaned that they wasted time debating.
So they have had meetings with the experts, and conclude the data centre is the best decision, but since the CTO loves the cloud, he has overruled them. So what was the value of the meeting? And will the staff be motivated to do something they don’t believe in?
“I think I’m compassionate with my team. It’s what binds us together as a team. Otherwise we are a bunch of individuals.”
CTO
I don’t get how he can make these statements and not realise the hypocrisy. How can you be compassionate if you have shown no confidence in their opinions and decision making?
When migrating from on-prem servers to the cloud, the Deployment team decided to change the way servers were allocated, presumably to minimise the cost. They:
“designed separate pools for the public side and private side so that the computer and memory could be dedicated to (and protected from) other types of traffic. Due to this split we reduce the ration of CPU cores to sites from 0.42 per site to 0.28 per site (as the cores were now dedicated to public, private all separately)“.
Deployment expert
Initially, this new way worked fine, but then during a particular busy week, they saw slower response times. It actually led to a discovery of a problem we must have had for a while, that SQL connections weren’t being properly disposed of, which created a bottleneck of the remaining possible connections.
They added a temporary fix which was something to do with “Shared app pools“, rather than autoscaling the application servers which would cost money. But this is a major advantage of the cloud – that you can scale on demand.
So to no one’s surprise, when another increase in load happened, performance issues happened once again.
So now the fix should be autoscaling right? No, they are still reluctant to do so. Instead, they added a fixed number of application servers. Surely that costs money, and increases our costs at quieter periods. I suppose I don’t know all the details but it seems risky to choose a set number and hope that the load never exceeds that.
On Viva Engage, a manager posted a positive message stating that the AWS migration was a big success:
“I am thrilled to announce that we have completed the migration to AWS!
This is a major milestone for our cloud migration programme and has involved many team members across multiple disciplines working together.
We have migrated a whopping 505 TB of data across 4178 databases and have stood up over 1,080 application servers. There has been meticulous planning (over 130 steps in each migration), preparation and countless hours spent migrating our systems, including overnight and weekend working.
The impact of this collective effort extends beyond numbers and statistics. We have successfully improved stability and performance for our end users. The migration has enabled us to navigate the increased load challenges.”
Manager
Yet, someone shared this angry message from a customer. I’m not sure if the first sentence is sarcastic, or if they thought we had been somewhat supportive:
“Thank you for your support in what seems to be a “run” of problems for the business. After our awful experience in November when your software literally tipped over leaving us without a system, I did request that both the ombudsman and your company treated this as a significant event, looked into what went wrong and responded to me with an answer. To date I have not received any such feedback from either party.”
Sarcastic customer
I asked a Software Architect what he thought, since he is usually close to the gossip or involved directly.
The Chief of Smoke and Mirrors will have some explaining to do. performance improved quite a bit as a result of the 64-bit work done behind the scenes (not to the client) but now users do things faster with longer sessions and they have plenty of infrastructure issues around the AWS changes that caused a lot of customers problems as always, one group of people fix certain things, while one group breaks lots of things at the same time
Architect
So it sounds like there’s been some good work done, but also some mistakes made. Then internally, we are announcing it as a great success.
Someone also showed me this complaint where someone had visited a customer and reported back what they had witnessed:
“We visited a site yesterday displaying nearly all of the problems we have discussed to date – still having to reboot the software 10 to 15 times per day! System slowness (witnessed), documents not opening, closing when going into the orders module, first record load slow, changing an order – system crashes.”
Another reason for performance issues was due to missing config after the migration:
“some of our app servers are downloading/installing Windows Updates in core hours, which is causing poor performance for users.”
A simple workaround that sometimes happens is a “cache reset”. That sounds like it’s a similar mindset to the “turn it off and on again” which does magically fix some problems. However, due to the migration, Support had got a bit confused how to remote onto the new servers:
“cache resets were done on the wrong servers. ” –
Manager explaining why performance issues lingered for longer than expected.
Even after further tweaks to the cloud migration, updating the client to 64 bit, fixing SQL connections, and some other miscellaneous changes, the Support team were saying some sites were still having problems:
Can I confirm that things should be improving for all sites following all the brilliant work done? The customer is experiencing the below and I am visiting them tomorrow;
Customer issues
loading can take several minutes
Slowness and crashing every day, at least 9 or 10 times a day
No discernible pattern or time of day for slowness or crashing, and no particular machine is noticeably better or worse
Been experiencing performance issues for 2 years, but have gotten much worse recently (last 6 months)
experiencing significant delays when uploading records
Can take up to 1 hour to approve a small amount of external requests which can involve multiple restarts
Switching between records can lead to delays and ‘greyed out screen’ (not responding)
Constant and randomly crashes and needs restarting – staff having to partition out tasks such as viewing documents and approving tasks
Closing statement
It does seem like our performance issues are a bit of a mystery. I think we have run out of things to blame. Customer internet, SQL connections, 32 bit client, on-prem servers, caching bug. Hopefully one day we will have a fast system.
Over the last few years, my employer has gone Cloud crazy. We are a large company so we have our own data centres. These are costly to run when you need physical space, staff, electricity, software licensing, and a plan of action when things go wrong.
I wonder if it is better to have your own servers when you are a big company. I always think Cloud is best for smaller companies that don’t have the resources to host it themselves.
“Our reasons for using the cloud are the same as others using the cloud.”
Our CTO
Not really true though is it? From what I saw quoted for the virtual machines for our test systems, I think Cloud is more expensive over time. On-prem has a massive up-front cost which is what they don’t like, but we have the capital to do it, unlike small companies that the Cloud is perfect for.
The recent drive to move away from our data centres is that we needed to replace some old hardware, and perform SQL server upgrades.
I could imagine us moving to the cloud, managers then panicking when they see the monthly costs, then demanding we go back.
One aspect of an SQL Server upgrade sounded like they needed to migrate the data to a new physical server. One of the tables they were concerned about was Audit, which adds a new row every time the user edits a record, which they stated was around 9 Billion records. A copy of the changed data is then saved as XML, so then you can do a before/after comparison. So that particular column is a problem.
So for the data that would still remain in our data centres and moved to a new server with a modern SQL Server version, the plan was to migrate the table without the XML column in it. Instead a new boolean (true/false) column was added to state if there should be data there, and instead, the data is moved to the cloud.
So now we are paying to host the database on our own data centre, but then have certain data in AWS which sounds like it should be more expensive. The justification is that we didn’t need to buy as much hard disk storage which they reckoned could have cost a massive £500k! Then it would mean the migration to the new server in the data centre was faster.
Still, we needed to transfer the data to the AWS Cloud storage. I think the idea was that Audit data isn’t accessed much, so it’s better to move it to a cheaper but slower storage method, then request it on demand. So in our software, instead of displaying the data instantly when you view that record, there would be a “view more detail” button, and only then do we request it and show it.
I think the mindset is just to focus on the cost figures that are apparent. Seeing a figure like £500k sounds like a crazy figure, but if we look at the cost to store it over a few years, does storing it in our own servers outweigh the cost of paying Amazon to store it?
A new corporate buzzword that gets thrown around in this subject is FinOps, as in Financial Operations.
One of the challenges we have when we start to build a new service is around estimating the potential cost of that new service in AWS. This ultimately goes towards setting the budget expectation for that service and therefore how we monitor it from a FinOps perspective. Do we have any experience within the department or anything we can leverage to help us get better at understanding the potential budget expectations for a new service we’re building?
Concerned staff member
In one of the recent “Town hall” meetings, the CEO was ranting about how high our cloud costs were. He said we currently had £250k in AWS servers that are switched off (not sure if that was a yearly figure, or even more unbelievable; monthly). These were servers just for development/testing. If our testing teams are spending £250k on servers we aren’t really using, how much are we spending on ones we are actively using? Then how much does our live system cost?
Now when you see those figures, that £500k hard disk storage doesn’t sound too bad.
“FYI – Stopped instances don’t incur charges, but Elastic IP addresses or EBS volumes attached to those instances do.”
When people start talking about the cloud, they quickly start dropping in jargon terms. Sometimes they use multiple terms in the same sentence and it quickly becomes hard to understand when you aren’t familiar with the cloud providers. Even if you are familiar with one particular provider, other providers use different terms for their equivalent service. I think AWS is particularly bad for their naming which often aren’t intuitive. So when people start talking about Elastic Beanstalk, Route 53 and Redshift; it’s hard to grasp what the hell they are talking about.
Here’s an example of equivalent services by four different cloud providers.
Unit testing is a software testing technique where individual components of a program are tested in isolation. These components or “units” can be functions, methods, or classes.
When implemented correctly, Unit testing is a crucial practice in modern software development. It helps ensure the quality, reliability, and maintainability of your code. By incorporating unit testing into your development workflow, you can catch bugs early, improve code quality, and ultimately deliver better software to your users.
When I heard about Unit tests, they did seem awesome. But then the more I used them, I found that my opinion on them has declined. I find it quite hard to explain though.
I think in general, to make things testable, you have to split the logic up into smaller methods. But then when they are smaller, A) they are easier to understand and B) they are less likely to change. So if a developer has looked at that code, what is the chance they are gonna change it and break it? If you have a unit test and it never fails in the software’s lifetime, has it provided any benefit?
Then in the case that you decide to change the behaviour, then you have the overhead of rewriting all the unit tests and it can basically double the development time.
When there’s certain scenarios which could end up taking ages to manually test it, the unit tests are very beneficial. When there’s loads of permutation/optional aspects to logic, it is a prime candidate for unit tests. Without unit tests, retesting every time you make a simple change is incredibly tedious. But with unit tests, you just click a button and wait a few seconds.
Unit tests give you confidence you can refactor without risk. However, they are not automatically the silver bullet. Well-written, fast, reliable tests accelerate development and save time. Poorly-written, slow, flakey tests hinder development and waste time.
A test that takes a second to run doesn’t sound slow, but what if you have hundreds or thousands of tests? If the tests take a long time to run, the developers won’t run them as often, or at all, then what value do they serve?
They also should run on a build to ensure only quality releases actually go live, but you want your release process to be fast.
There was a recent change where the developer was claiming to have sped up a long-running call, however, he hadn’t carried over that performance enhancement mindset to the tests, and had actually increased the time to run them by 6 seconds.
The code “Thread.Sleep” can be used in threaded code to intentionally call a delay. I’ve seen many developers add this to a unit test. Tests are supposed to be fast, so you should never add this in a unit test.
Measuring Tests & ExcludeFromCodeCoverage
When people write unit tests, they want to try to understand how much of their code is covered by tests. We have this metric of Code Coverage but it has some severe limitations in the way that it is measured. It’s often a simple metric of “does the line get hit by at least one test”, but since methods can be executed with different combinations of variables, you can end up having 100% statement coverage but without actually testing many combinations at all.
The metric is one that impresses managers so you often see developers writing bad tests simply to game the test coverage metric. This is bad as you end up being misled that your code changes haven’t caused any bugs but yet it could have introduced something severe because the unit tests weren’t adequate.
I’ve seen quite a few code changes purely to increase the code coverage. So the title of the change would be like:
“Added more code coverage”
Then when I check the build output:
“There might be failed tests”
How can you be adding more tests then not actually run them before submitting it to review? Madness. The explanation is that their focus is just on coverage and not on quality. Maybe a bit of arrogance and laziness.
This week I worked with a team to get code coverage over 80% (a corporate minimum). The problem with this effort: Code coverage can be gamed. Sure, low code coverage means there’s a lot of untested code. But, high code coverage doesn’t mean the code is well tested.
Cory House
You can add ExcludeFromCodeCoverage “attributes” to your code which tells the code coverage tool to ignore it. It’s a simple way of reducing the amount of lines that are flagged as untested.
Here’s one of our principal developer’s opinion on this attribute:
“Using ExcludeFromCodeCoverage is only good when the goal is 100% coverage. That should never be the goal. The goal should be a test suite that prevents bugs from ever going live. I’m happy never using it and just having coverage reports flag things that are not actually covered, it is a more realistic representation of what the tests cover and makes me much more cautious about changing them as I know I don’t have test coverage. Never add Exclude from Code Coverage, it’s just lying to everyone. Why suppress things that might be a problem, if they are, we need to fix them.”
Principal Developer
Personally, I think adding suppressions/attributes just clutters the code base. I’d rather just treat the stats as relative to each release. The numbers have gone up/down, but why? If we can justify them, then it’s all good. Chasing 0 code smells and a specific test coverage means you can just cheat and add the likes of ExcludeFromCodeCoverage to meet such metrics.
Another developer said:
I value a holistic set of metrics that help us understand quality in software development. Code coverage is a single metric that can be part of that set of metrics you monitor. No single metric can stand by itself, and be meaningful. Nothing is perfect, which is why we should value a toolbox. I don’t believe in gaming the system and “hiding” uncovered code to get to 100%.
You need engineering teams who are prepared and confident enough to publicly share their coverage reports. This sets the tone of the culture. Context is needed, always. There will be reasons why the coverage is as it is. Use tools that help engineering teams with confidence/delivering at pace and ultimately delivering customer satisfaction. You cannot compare reports from different teams or projects.
Useful Tests
You need to make sure your tests actually test some logic. Sometimes people end up seemingly writing tests that really test the actual programming language, but I suspect it is just so the Code Coverage metric is fooled. Code Coverage checks if lines of code are “covered” by tests, but the simplistic nature of the check just ensures that a line of code is executed whilst a test is running; rather than if there was a meaningful test.
So for example:
[Fact]
public void DefaultConstructor_ReturnsInstance()
{
var redisMgr = new RedisStateManager();
Assert.NotNull(redisMgr);
}
So there you are instantiating an object then checking it is not null. Now that’s how objects work in C#. You instantiate an object, and then you have an object. Now, I suppose an exception could be thrown and the object wasn’t created, but that is generally considered bad practice and also there was no other test to check a situation like that, so they haven’t tested all scenarios.
Setting it then checking it is set. Unless the property has loads of logic which you could say is bad design, then checking it is set is really testing the “.net framework” but if you think you need this; that means you don’t trust the fundamental features of the programming language you are using. You are supposed to be testing the logic of your code, and not the programming language.
If there’s lots of setup then the Assert is just checking for Null, then it’s likely just to fudge the code coverage. Another classic that I’ve seen is loads of setup, then ends with:
Assert.IsTrue(true);
So as long as the test didn’t throw an exception along the way, then it would just always pass because True is definitely equal to True.
Those ones seem intentionally malicious to me, but maybe the following example is more of a case of a clear typo:
Assert.Same(returnTrigger, returnTrigger);
Whereas this following one looks like a typo, but it’s actually two different variables. Need to look closely (one is a single S in Transmission). 🧐
What goes through people’s heads? How can you write code like that and carry on like nothing is weird.
Sometimes tests look a bit more complicated but on analysis they still don’t really test much:
[CollectionDefinition(nameof(LoggerProviderTests), DisableParallelization = true)]
public class LoggerProviderTests : IDisposable
{
[Theory]
[InlineData("Verbose")]
[InlineData("Debug")]
[InlineData("Fatal")]
[InlineData("Information")]
[InlineData("InvalidLogLevel")] // Test with an invalid log level
public void GetMinimumLevel_ReturnsCorrectLogLevel(string logLevelSetting)
{
// Arrange
System.Configuration.ConfigurationManager.AppSettings["DistributedState.LogLevel"] = logLevelSetting;
var firstInstance = LoggerProvider.Instance;
var secondInstance = LoggerProvider.Instance;
// Assert
Assert.Same(firstInstance, secondInstance);
}
}
So this sets a setting on AppSettings, presumably used by the “LoggerProvider”. However, all they are doing is testing that if you call the Instance property twice, it returns the same object both times. So the setting of the different log levels is completely irrelevant. I mean, the log level could be completely wrong but you are comparing ‘is the wrong value of A the same as the wrong value of B’; and it will still pass.
Another common aspect is when you use a testing library like Moq, and you can use it to create objects and essentially say “when I call some code with these specific parameters, then give me this value back”. The thing is when developers use this as the actual thing they are testing, then you are testing Moq, and not your actual logic.
“I think all this test is doing – is testing that JobsRepo returns an object that was passed into the constructor on line 22. The GetById is redundant, it will always work if it returns that object because the Moq was configured to return that value. That is testing Moq, and not our code. But then if you are just asserting a property returns an object, you are just testing that C# properties work.”
Me
“yes you are right , I am just testing if JobsRepo could return a value, so that it helps me in code coverage for get functionality of JobsRepo , as it is just set in the constructor of the class and there is no call for get”
Developer who wrote bad tests
So I think they are saying “I am just fudging the coverage”. Checks it in anyway.
There’s been loads of tests where you could actually cut out large parts of the method they are testing and the tests still pass. Again, sometimes you point this out to developers and they still want to check it in, purely for the statistics, and not for any benefit to any developer.
“do these tests add value? a quick glance suggests this is very dependent on your mock object. It might be the case that the production code can be changed without breaking these tests.”
Me
yeah, they’re kind of meaningless. Merging code to use as templates for future, better, tests.
Developer who wrote bad tests
Here is a rant I left on a similar review:
This name implies that it should always be disabled, especially because there’s no coverage for the case where it is true. However, these tests aren’t really testing anything. I think you’re basically testing that Moq works and the default boolean is false. I think the most you can really do is call Verify on Moq to ensure the correct parameters are passed into the GetBool call.
If you replace the contents of IsRequestingFeatureEnabledForOrganisation with return false, your tests pass which illustrate the coverage isn’t complete, or you aren’t guaranteeing the configuration manager code is even called at all. Personally, I don’t think it is worth testing at all though. All your class does is call the FeatureDetails class so you aren’t testing any logic on your side.
I think people are too concerned about getting code coverage up, so they insist on writing tests, even if it makes things more confusing.
I suppose it is up to you and your team to decide what you want to do, but I occasionally question people just to make them think if it is actually adding any value. I’ve seen tests where they simply assert if an object is not null, but it could literally return an object with all the wrong values and still pass (and the method always returned an object anyway so could never fail). If you see a method has tests, it gives you a false sense of security that you think it is going to catch any mistake you make, but it just always passes anyway
always think if your tests will add value and if it’s worth adding them. If you need to mock everything then they’re not very valuable, or you’re testing at the wrong level (too high), and you’re better off with integration tests than unit test. 100% code coverage is a really bad idea for complex software, massive diminishing returns in value the higher you try to push it. We change stuff all the time in our software too, so if everything has high-level unit tests then you spend more time fixing those tests.I tend to find you spend ages writing tests then if you change the implementation then you have to change the tests and you can’t run them to see if you broke anything because you had to change the test to run it.
Me
Test Driven Design (TDD)
There’s a methodology called Test Driven Development where you write a test first. It will then fail if you run it because there’s no functionality to run. Then you write the implementation and get it to pass. Then move onto writing the next test, and repeat. So you build up your suite of tests and get feedback if your new changes have broken previous logic you wrote.
I was recently listening to a podcast and the guest said that he always writes code first, then adds tests after. If he can’t write tests, then he will make a change to bypass code just for the tests. I wasn’t sure what he meant by this, maybe it’s like when people write a new constructor which is only ever called by the tests. But that’s bad design.
I thought he may as well just do TDD from the start, instead of always going through that struggle. He says TDD doesn’t often lead to good design because you aren’t thinking about design, you just think of how to make the tests pass.
But doesn’t the design organically come from TDD? and his way of changing the design just for the tests is what he is arguing against TDD for. TDD often slightly over-engineers the solution with the likes of Interfaces. So then he is avoiding TDD, and instead writing the tests after; but his way adds “Technical Debt” via adding extra constructors that are only used by the tests.
“I’ll add tests in a separate change later”.
5 reasons to add tests before merge:
1. Clear memory: Before merge, everything is fresh in my mind. I know what the code is supposed to do, because I wrote it. So I also know what tests I should write to assure it works. Every minute that passes after merge, I will understand the feature less, and thus, be less equipped to add proper test coverage.
2. More effective reviews: If I write the tests before merge, then anyone reviewing my code can use my tests to help them understand the code, and to watch the feature run.
3. Faster development: If I write tests during development, I can use the tests to accelerate my development. I can “lean” on my tests as I refactor. Faster feedback loops = faster development.
4. Better design: Writing tests during dev encourages me to write code that is testable. It makes me consider accessibility too since that tends to make automated testing easier by providing well-labeled targets.
5. Changing priorities: After merge, there’s no guarantee that I’ll have time to write the tests at all. I may get pulled away for other more “urgent” tasks.
Bottom line: The proper time to add tests is *before* merge.
Coty House
I recently saw the following conversation. A developer was basically saying he didn’t have time to write the tests, and it might end up in some drastic refactoring which would be risky. Then the plan is to rely on manual testers and get the changes released. Then the next part probably won’t happen (because important features will be prioritised), but his suggestion is that he then makes the changes for the next release with good unit test coverage.
Senior Developer: This domain supports unit testing, you should be able to add tests to cover the changes you made to make sure it behaves as you expect
Developer Currently there are no unit test cases available for the changes made class, and the class is tightly coupled. I have written some draft tests and will check them next month as a priority.
Architect IMO, given the pressure, timescales and urge to complete this, I think we can defer for now and stress the testers to pay more attention to the areas that we have low confidence.
Senior Developer: So instead of checking if it is correct by adding tests that we can be sure exercise the code changes, we just merge it and hope that the manual testers find any bugs over the next day or so, and if they do, then it is back to the dev team and another change?
Time In Unit Tests
Tests should be deterministic. If a test is run and passes, then if no changes have been made and we run it again, it should also pass (obviously). An unreliable test doesn’t give you confidence in code changes you make. It’s a surprisingly common occurrence when you make a change and an unrelated test breaks, and you are thinking “how can those changes break the test“? then you look at what it is doing, and it’s often something to do with time.
You see something like data is "BirthDate":"1957-01-15T00:00:00" And the test result says: Expected "Age":"67y" Actual: "Age":"68y" Today is their birthday!
What you need to do is put a “wrapper” around the code that gets the current date. So instead of simply DateTime.Now, you create a class called something like DateTimeProvider, and in the production code, the class returns DateTime.Now. Then in your Unit Tests, you then create a MockDateTimeProvider and make it return a hard-coded date. That way, no matter when you run the test, it always returns the same date, and is a deterministic test.
I recently fixed some tests that were failing between 9pm-12am. I found that a developer had changed the MockDateTimeProvider to return DateTime.Now, completely rendering it pointless. Other parts of the test were adding 3 hours to the current time, and because 9pm+3 hours is tomorrow’s date, the date comparison it was doing then failed.
public class MockDateTimeProvider : IDateTimeProvider
{
public DateTime Now { get { return DateTime.Now; } }
}
I think another red flag in unit tests is conditional statements. Logic should be in your production code, and not in tests. Not only does this following code have a DateTime.Now in it, it looks like they have put a conditional If statement in there, so if it would normally fail, it will now execute the other branch instead and pass. So maybe the test can never fail.
[Fact]
public void ExpiryDateTest()
{
DateTime? expiryDate = (DateTime?)Convert.ToDateTime("12-Dec-2012");
_manageSpecialNoteViewModel = new ManageSpecialNoteViewModel(_mockApplicationContext.Object);
_manageSpecialNoteViewModel.ExpiryDate = Convert.ToDateTime(expiryDate);
if (_manageSpecialNoteViewModel.ExpiryDate < DateTime.Now.Date)
Assert.True(_manageSpecialNoteViewModel.IsValid());
else
Assert.False(_manageSpecialNoteViewModel.IsValid());
}
Other Bad Unit Tests
Maybe the most obvious red flag, even to non-programmers – is testing that the feature is broken. The developer has left a code comment to say it looks wrong!
Assert.Equal("0", fileRecordResponse.Outcome); // I would have thought this should have been -1
The One Line Test
How do you even read this. Is that actually one line? 🤔🧐
_scheduledJobsRepo.Setup(r => r.GetAllAsNoTracking(It.IsAny<Expression<Func<ScheduledJob, bool>>>(),
It.IsAny<Func<IQueryable<ScheduledJob>, IOrderedQueryable<ScheduledJob>>>(),
It.IsAny<int>(),
It.IsAny<int>(),
It.IsAny<Expression<Func<ScheduledJob, object>>>()))
.Returns((Expression<Func<ScheduledJob, bool>> expression,
Func<IQueryable<ScheduledJob>, IOrderedQueryable<ScheduledJob>> orderBy,
int page, int pageSize,
Expression<Func<ScheduledJob, object>>[] includeProperties)
=>
{
var result = _scheduledJobs.AsQueryable();
if (expression != null)
{
result = result.Where(expression);
}
result = orderBy(result);
result = result.Skip(page * pageSize).Take(pageSize);
return result;
});
When it is that hard to read, I wonder how long it took to write it.
Other Common Mistakes
I think tests can be unclear if you use a Unit Testing library but not understand what features are available. Like instead of using the ExpectedException check, they may come up with some convoluted solution like a try/catch block then flagging the test as passed/failed.
Naming the tests can be tricky to make it clear what it does and to differentiate it from other tests. The worst is when the name says something completely different, most likely from a “copy and paste” mistake.
I’ve talked about how using DateTime can make tests fail at certain times. You can end up with tests that rely on some shared state, then the order you run the tests causes failure when the test expects data is set or not set.
Bank
Tests are even more important in certain domains. You know, like when money is involved. Commercial Bank Of Ethiopia allowed customers to withdraw more cash than they had.
“I’m really interested how a bank managed to deploy code that didn’t have tests for “can a user withdraw money when they don’t have enough balance.” Development teams at banks are usually conservative, process-heavy and slow-moving with changes exactly to avoid this. Wow”
Conclusion
Unit tests can be a useful and helpful tool to developers. However, there is an art to writing them and they have to be written with good intentions. If they aren’t written to be useful, fast, and reliable, then developers either won’t run them or won’t trust them.
One of the latest buzzwords to be thrown around is “Customer experience”. My understanding is that it’s a focus on customer interactions, from awareness of the product to purchase. This covers brand perception, sales process, and customer service.
Customer Experience is shortened to the acronym CX, because using the letter X is always cooler. For some reason, we went a bit further and put a hyphen in there for good measure; “C-X Experience Centre”.
The weird thing is that it kinda looks like a letter is missing like you are supposed to pronounce it like SEX; and a Sex Experience Centre is a different thing entirely. Does it even make sense, or sound sensible to call it the Customer Experience Experience Centre?
“The Customer-Xcellence Programme is all about putting our customers and users at the heart of everything we do. It directly supports our strategic priority of delighting our customers and partners. But we can only do that if we really put ourselves in their shoes and truly understand what day-to-day working life is like for them. By doing so, we can ensure the products and solutions we design, enhance and implement are directly informed by their experiences.”
We lost even more office space to create this C-X Experience Centre. Since we worked at home, they made the desk space more spacious for those that did go into the office, then over time have reassigned meeting rooms to nonsense like this.
To make it more pretentious, we invited a local politician for the grand opening
“The C-X experience Centre is a real gamechanger in how we immerse ourselves in the experiences of our customers and users.”
I think all it is is a few computers in a room decorated to look like a customer’s office.
“This will help everyone learn about the challenges our customers and users face, and how our solutions help them provide a better service.”
As well as showcasing our solutions to customers and key stakeholders, it will be used for:
onboarding new starters
supporting sales enablement training
launching and testing new solutions and products
“Thank you to the whole Customer-Xcellence team for turning this vision into reality – it will make such a difference in how we understand our customer’s and user’s challenges.”