How (Not) To Split An API

I’m a software developer that only really has experience on desktop apps, and was recently put on a project to make an API. I had an interesting debate with the Tech Lead of the project about where we should put our code, and how we should view its responsibilities.

To make it more anonymous, I’ll change the functionality slightly but it’s the same idea.

An Orders API had been in development for a while, and my team needed to add functionality to send some data to a government API, so let’s say it was for ID verification. Even though our initial requirements are that only the OrdersAPI will use the VerifyAPI, you could argue that in future, maybe in future, other applications we have made, or third parties could call this VerifyAPI directly.

There’s a famous idea in software development; YAGNI; You Ain’t Gonna Need It. Which is the idea that you should program to requirements and not speculative “what if” scenarios.

The Tech Lead argued that we should put our code in a new repository because it was a separate API. I said that that adds loads of overhead because we will need to write code in OrdersAPI to call our code, then will need to add a reference to our VerifyAPI using a Nuget package. This will slow down development as you need to update 2 repositories, need some temporary reference as you develop, create multiple “Pull Requests”, then need to publish the Nuget package and update the references once more. I stated this was gonna be a huge inconvenience if the project ends up running over the year.

I also called YAGNI on that we will probably never use the API for anything other than OrdersAPI so it should just go in the same repository. In the event where I am wrong, it should be fairly easy to move it out as long as we just use separate projects to keep our code separate.

He insisted on doing things his way, but the thing is, even though we had a separate repository, it wasn’t a separate API. It was more like a code library. So several months later, he was asking managers if we can create a “mini project” to turn it into an API for clearer separation.

So it seems like we had 2 opposing viewpoints but ended up somewhere in between with all the disadvantages.

Another interesting debate I had seemed to illustrate his confused view of what our code is. He has always viewed our code as intending to be an API, but I was changing some error messages and he said my messages were misleading because our repository is not an API!

The confusion seemed to be him saying the “client” is the OrdersAPI, but I see the user of our software as the client, the OrdersAPI is the server call, and it doesn’t matter where it goes next

The message was something like. “Field ‘Date of Birth’ is missing”. He didn’t like the word “field”

Tech Lead
"I'd change the wording on these messages. We're no longer talking about "fields" since we've split the API request out."

Me
“does it matter where our code is? it's part of the same request as far as the client is concerned”
 
Tech Lead
"fields" just sounds like API talk
 
Me
but the client has made an API call
 
Tech Lead
the client hasn't made an API call though
if those prerequisite checks fail then no API has ever been involved
and even if it has, why would the client need to know anything about an API?
are you talking about the browser client?
 
Me
isn't it
client -> OrdersAPI -> our library {fail checks} -> error status to the client
 
Tech Lead
sorry i thought you were referring to the OrdersAPI as the client in this context
which it is
our package shouldn't know that it's being used in an API., that's the whole point of this change
 
Me
it's a black box for the caller. The user shouldn't know that it's using a library. The code could all the be in the same place as far as it is concerned

Then after more discussion, he is adamant that something could use our library in future so then there’s 2 usages, an API and non-API. So it cannot have API related stuff in it.

But our intention was to have a separate space for our team to maintain, we have never discussed it being used by anything other than the API. The early discussions was to have our repo that was an API.

Daniel
tbh I don't immediately think API when I see "field" I think it's fine
 
Me
he did say the message could just be
"‘Date of Birth’ is missing"
Which might be better but then wouldn't you want all the messages to be consistent. However, I guess someone could update the OrdersAPI repo with a new message style, and then forget about ours.
 
Daniel
you make a good point about consistency though, the API should be consistent regardless of where the code lives

It’s a really trivial argument, but I think this is just the beginning of many debates. Sometimes I think we like adding loads of complexity early on then work doesn’t get done.

Experimentation vs Being Safe

When it comes to software development, often you can play it safe using technology you already know, or be more adventurous and use something new. I think the trick is to research the pros/cons of the language and make sure it is suitable for your approach.

There’s no point thinking something is cool and therefore using it – when it might not be the correct programming language to use. An entire team investing time learning something new can be a complete waste of time if the project is then cancelled/restarted due to heading the wrong direction.

A rule a thumb when choosing technologies:

For an experiment? be as weird as possible.
For production? be as boring as possible.

When it comes to maintenance, sometimes you end up in situations where someone is the “Expert” and therefore has to fix any issues themselves, or will be approached for help by another developer. Therefore, if you write something crazy for production, it will be you that maintains it, either directly or indirectly.

Sometimes becoming the expert in something is the way to get promoted or pay rises though, since you become the super important developer that the company can’t afford to let go. However, that also means you will be stuck on this part of the software, and can’t really move on to different projects.

If you do become one of these experts, and if you want to move on to a new project; you need to train a replacement up. Can you find a replacement that wants to take over, knowing that they will be stuck with this special project? How long will it take to train a replacement? How much documentation did you write?

Software Naming

There was an internal meeting where a new product called “Recruit” was announced. The first question was that “it sounds like it could be confused with a method of recruiting staff to work for us, so was that discussed?”

The manager said “to be honest, I never considered that“.

He then added there were around 20 people who were in the meetings, and no one had questioned it, or raised any objections.

A few months prior, there was an announcement about a new team that was handling branding in Marketing. We were told we couldn’t create any names without going via them. The last product names they came up with were ASSistant, and ANALytics.

I thought that if the software isn’t well received, it could easily gain a negative nickname, and people could make statements like “the software is ass”.

A Product Owner recently stated that the Assistant branding will soon be phased out, and it will just be merged into our main software’s branding. The decision came about when another Product Owner was doing a demo and had created a test user with the name “ass”. A manager flagged it as unprofessional and was concerned that we could easily demo something like that to external clients.

“you probably want to change those Ass users”
Manager

So far, the marketing naming team hasn’t got a good track record.

Twitter -> X

Twitter was a really strong brand, so Elon’s instance to change it to X seemed like a baffling choice, and it seems one based on his fascination with the letter X rather than any business reason.

As a…
– user
I want…
– a strategically-timed, comprehensive rebrand
so that…
– negative discourse regarding company practices can be briefly veiled with commentary regarding new name, logo, and brand assets, alongside excessively-shared parodies of the brand in meme format pic.twitter.com/5OuZLtg9pg
— Shit User Story (@ShitUserStory) October 21, 2021

As a…
– user
I want…
– a strategically-timed, comprehensive rebrand
so that…
– negative discourse regarding company practices can be briefly veiled with commentary regarding new name, logo, and brand assets, alongside excessively-shared parodies of the brand in meme format

1/

x is an awesome letter, one of the coolest letters in fact! I just turned 8 years old and I think the letter x is the most coolest thing there is. anyway, my dad made me chicken nuggets so I can’t finish the thread right now
— Drew Gooden (@drewisgooden) July 23, 2023

here’s why everyone is WRONG about twitter changing its name to x being “a baffling choice” and “not a smart thing to do”: a thread

1/ x is an awesome letter, one of the coolest letters in fact! I just turned 8 years old and I think the letter x is the most coolest thing there is. anyway, my dad made me chicken nuggets so I can’t finish the thread right now

All jokes aside, I did find a serious thread that was actually useful in detail the reason for X:

X became part of PayPal.

And PayPal was acquired by eBay.

Musk walked away with $165 million.

He then started building again…

…and the letter X would reappear. pic.twitter.com/dnRZKvipih
— Jon Erlichman (@JonErlichman) July 23, 2023

The X era has begun at Twitter Elon Musk says the bird logo will be replaced by an X. For Musk, it’s the continuation of a story that began 25 years ago. Here’s a brief history on that...

Let’s go back to 1999. After selling his first company… …Musk set out to disrupt banking. And the X identity was born. But its time as a standalone brand would be short-lived. Instead, X became part of another brand.

X became part of PayPal. And PayPal was acquired by eBay. Musk walked away with $165 million. He then started building again… …and the letter X would reappear.

Musk would launch Space Exploration Technologies Corporation. It would become known as SpaceX.

Worth noting… SpaceX’s logo has a hidden message. The X symbolizes a rocket’s trajectory.

Musk also used “X” at Tesla. It’s the name of Tesla’s third model. Musk envisioned a car lineup with models that spelled out “sexy.” There’s a Model S. An X. And a Y. Ford had the rights to the Model E. So Musk instead opted for 3, which is basically a backwards E.

X is even the name of Musk’s young son.

Meanwhile, you likely saw all of the headlines leading up to this development… …from rebranding Twitter as X Corp… …to the recent launch of xAI. In other words, wheels were already in motion for the new era.

Our Datadog use

Introduction: Summary of previous blogs

Datadog is a monitoring tool my employer purchased licences for, and quickly became the cool thing to use and impress the senior managers with (see Datadog, and Datadog – The Smooth Out).

I discussed problems in both those blogs, but a concern with all metrics is;

What do you want to measure?
Who is viewing the data? And when?
What does “good” and “bad” look like, and who acts when that state is shown?

In “Datadog Knee Jerk“, I explained how our CTO and Technical Director demanded that everyone create a Datadog dashboard to monitor all services, regardless of what they are.

If we don’t have a clear idea of what to measure, who needs to view it, and how do they know it is good/bad; then aren’t we just throwing money away? (even the dashboard itself doesn’t cost, you still have the time to create one. Some dashboards would require additional logging to be effective though). Surely an obvious problem with wanting to monitor everything is that it can become quite costly when you look into Datadog’s pricing model.

Easy To Make Nonsense Dashboards

From my brief time making Datadog dashboards and analysing other teams’ dashboards, I realised that the data can often look wrong, and it’s really easy to misinterpret the metrics due to the jargon used, and when/how the data is actually collected.

“I know nothing about Datadog, yet have been told to make a dashboard”
Principal Tester

Surely the worst case is to make dashboards that show nonsense data. You will waste time investigating problems that don’t exist, or not be alerted to actual problems that happen. So once we create a dashboard, who checks that it is valid?

Look at this one that I saw:

This is supposed to be plotting a line (purple) for failures in the time frame specified, then another (blue) for “week_before“.

It looks immediately wrong at a single glance. If I have set the time-frame combo box to show the “previous month”, should week_before be last week, or should it be the week before last month? It seemed to be neither. Also, notice that the graph is exactly the same shape/numbers. It just seems to be plotting the exact same data but pretending it is a week later.

Jargon

You would think you just need some understanding of statistics to draw some charts, but in the usual nerd fashion, they throw around jargon to be cool. So people end up saying stuff like this:

What is datadog? Fundamentally, a platform like Datadog provides us with a scalable solution for ingesting observability data from our services. Datadog is built upon the three pillars of observability:
Metrics provide numerical measurements that allow us to assess our system performance and behaviour
Traces allow us to understand the flow of a request or transaction through our systems
Logs allow us to capture the details of system events and errors

When you read the official documentation, it’s difficult to understand what it actually can do. It’s the combination of jargon plus hyping up features to be powerful:

Datadog vector
Vector is a high-performance observability data pipeline that puts organizations in control of their observability data. Collect, transform, and route all your logs, metrics, and traces to any vendors you want today and any other vendors you may want tomorrow.

Imagine sending your metrics to vendors that you want in the future. They are like “mate, stop spamming us with your info, you aren’t our customer“.

Then you are given the implication that this is the ultimate solution that can somehow solve some of the major problems with our system:

Having access to this data provides us with opportunities to understand the inner workings of our complex and distributed systems in a way that we haven’t been able to before.
However, the data alone is limited in its usefulness, and it is the insights from this data that offer the greater value. Datadog provides the tooling to surface these insights in a way that enables proactive support and improvement of our systems.
DevOps Engineer

The bad thing about overhyping a tool like this is that you have to manage expectations and make it clear what the scope is, otherwise your interactions with managers is more difficult than it should be. One of the DevOps engineers made a vague statement like:

“Our dashboards monitor everything”

So they got a question from a manager “Can you tell me who uses our API?”

“no, our dashboards can’t see that”

What we have enabled so far:

Configured service metadata to populate service ownership details
Enabled traces
Enabled RUM (Real User Monitoring) traces to provide full end to end tracing
Improved our service & environment tagging
Enabled version tracking so that we can observe version related anomalies
Defined a baseline set of monitors to cover availability, anomalous throughput, errors, latency and infrastructure performance
Defined strict availability & latency SLOs
Implemented 24 SLOs & 264 monitors
Configured PagerDuty automatic incident creation and resolution
Enabled logging
Driven several key Information Governance decisions
Established a Data asset inventory to provide more objectivity as to what data can be stored in Datadog

Performance Issues

One problem with our system – is performance issues. Although we have blamed all kinds of things, performance issues still remain in general. There’s been claims that Datadog could help us diagnose where the performance issues are, but they have also increased network traffic and server resources; so that they have caused performance issues of their own!

DD agent is using a lot of resources on our test systems and looks to be causing performance issues, I have stopped the agent multiple times when testing as the CPU and memory usage is maxed out. This has been raised before.
Tester

Architect: 
Datadog seems to be showing memory usage on all app servers is high, wonder why?  

Me:
Does it only happen when Datadog is watching it?
We licence Datadog to prevent Major Incidents and performance issues…
Datadog causes Major Incidents and performance issues and tells us about it

Another aspect is that some things we wanted to measure required querying our SQL databases. To write an efficient SQL query, the columns you filter on need Indexes to be performant, but Indexes themselves take up space. Then we are always moaning about the cost of storage.

We wanted to look at adding Datadog to monitor the usage of a particular feature that managers were making a lot of fuss about. So we asked the Database Administrators about the repercussions of adding an index to our tables. It soon adds up to be absurd.

I checked a random server and a new Index on RecordID (int 4 byte), Method (tinyint 1 byte) and AvailabilityTimeStamp (datetime 8bytes) would be around 2.5GB for a server. There are 60 servers so we need around 150GB for an extra index across Live. Testing the Datadog query before and after the creation of this index shows a 98.6% improvement in total execution time.
Deployment Engineer

Architect
I wondered if anyone else had noticed (and looking into?) poor performance spikes occurring every 2 hours, they seem to present on most servers I checked.

Me  
no one actually looks at Datadog 
can you create a Meta Dashboard, so it shows you the time since Dashboards were looked at?
 
Architect 
I can only assume it's genuinely the case that no one actually looks the dashboards
I've raised 4 issues now, purely from observing the trends in the last 2 weeks
we've had wrong servers in the public and private pools
Windows Updates running in the day and killing servers
servers sat idle with no traffic hitting them
SQL Server spikes on F: drive access
these spikes every 2 hours
don't know what they're doing 
I've had a look at the Monitoring DB for Server07 this afternoon, and I'm absolutely appalled at how horrendous it is, I can't see the wood for the trees. I can only assume that users aren't getting any work done 
 
Me
Interesting that the spikes are exactly 2h apart, but at different base minutes between servers
 
Architect 
it is interesting, but we're still no closer to anyone paying attention to the issue 
Philip will probably sort it, he sorted the last DB-related issue

Datadog pricing

The following are our discounted rates, Per month costs as follows (Sept ‘22 - Sept ‘23):
•Infrastructure $11.75
•Network Performance Monitoring (NPM) $4.30
•Application Performance Monitoring (APM) $29.00
•Custom metrics $5 (per 100, per month)
•High use of logs (>1m/month) $1.52 (per 1m, per 15 days)
•Database Monitoring $77.28 (not discounted)

Standard prices are on here https://www.datadoghq.com/pricing/list/

“Related to this, the Azure Pipelines integration for CI Visibility starting September 1st, 2023 will have a cost of $8 per committer per month (on an annual plan, or $12 per committer per month on-demand). Additionally, 400,000 CI Pipeline Spans are included per Pipeline Visibility committer per month. Based on our June usage data, our monthly cost for Azure Pipelines integration for CI Visibility would have been $644.74. We’ve had this enabled for sometime now, is anybody actively using this?”
CTO

Product	Product Charges ($)
APM Hosts	$2,320.00
Audit Trail	$1,846 54
Database Monitoring	$463.68
Fargate Tasks (APM)	$128.06
Fargate Tasks (Continuous Profiler)	$70.84
Fargate Tasks (Infra)	$145.73
Infra Host	$42,206.00
Log Events – 15 Days	$10,265.18
Log Ingestion	$28.20
NetFIow Monitoring – 30 Days	$507.80
Network Devices	$993.14
Network Hosts	$1,242.70
Pipeline Visibility	$574.08
Profiled Hosts	$275.08
RUM Browser or Mobile Sessions	$1,186 91
RUM Session Replay	$48.74
Sensitive Data Scanner	$2,414.65
Serverless APM	$48.77
Serverless Workloads (Functions)	$1,385 52
Synthetics – API Tests	$1825.25
Synthetics – Browser Tests	$0.06
Workflow Automation	$13.03
Grand Total	$67,989.96

These were our monthly charges at one point (although one entry is 15 days, so double it). Then if you estimate how much this costs yearly, it’s going to be $936k unless we either cut down what we are logging, make efficient changes, and don’t scale by adding more servers. So around about $1m just for monitoring for a year. How ridiculous is that!?

Obviously when the CTO was fully aware of these costs, he then calls an All Hands meeting.

CTO: Tells everyone that Datadog should be used by everyone
Also CTO: Help! the costs are spiralling out of control

Jim
Lol - classic.
Jan: We are investing in this awesome new technology.
Apr: Please stop using this technology - it's too expensive.

Me
ha yeah, that was yesterday's meme for me

we've got this amazing feature
but there's not enough licences to use it properly
so we only partially use it
classic

part of my Dev Haiku collection

We even did that in relation to Datadog. It integrated with a product called Pagerduty to notify the team that something is bad, but there’s not enough licences to alert everyone involved! What’s the point even paying for it at all if it is half done. You can’t get the value. It is bonkers.

One of my colleagues who works on a different project to do with an API said it only costs $2,500 to run the API for a month and it’s used by millions of people. Yet here we are spending $78k on monitoring alone.

Service owners and leads have been granted additional permissions to access more detailed billing and account information. Please take some time this next week to:
- Confirm that all active services and features are necessary and being actively used
- Identify any areas where we can optimise our setup to avoid unnecessary costs
- Prioritise production system costs over other environments. This is a critical opportunity for us to do some housekeeping and ensure our resources are being used efficiently. We spend a significant amount of money on Datadog (7 figures), and we need to ensure that we are getting as much bang for our buck!If anyone has any questions about the above, please ask in here or reach out directly!

The Costly INFO

“Please consider very carefully if storing success status logs is required for your applications.”

As predicted, encouraging people to log everything to Datadog, not think about if it is useful, and to not check on it; soon led to an angry message from the CTO.

You know how it was ridiculous we were spending $10k on logs for 15 days? Well, it quickly grew to $15k and counting. On investigation, it was caused by one particular feature who was constantly logging the status INFO or OK.

327.59m OKs were logged in 1 day.

I looked at the total logs across a 1 week period and it showed as 1.18G. There’s a Gagillion logs!

How does it get to this point that no one noticed the excessive logging though? I suppose the Datadog company aren’t gonna alert you to it. They love money.

It proves I was right about getting everyone to create dashboards and increase logging, without actually having an owner to actually check and respond to them.

Costly Switch On

This is a big mea culpa. In an effort to get early visibility in Datadog of AWS, I enabled the AWS Integration (many months ago). This means that lots of metrics, hosts, logs etc come from these accounts automatically (adding to our costs!).

I’d like to undo this mistake but want to understand if anyone is using these monitors (for example NLB monitors).

Any views?

The problem is we pay $11.75/host/month whether or not we use the metrics
CTO

James
Both. You pay per metric and for polling the API.
Neal
Didn't we find this out the hard way with granular replicator CW metric capture? (swiftly removed though)
John
yep, that gets expensive if your pulling from datadog which is why there is the kinesis stream option
James
Yes - we added very granular cloudwatch custom metrics, as we didn't have a direct Datadog connection. This pushed up our AWS spend significantly, so we turned that off. Custom metrics direct in Datadog is significantly cheaper, but still worth keeping an eye on. E.g. we wanted to track latency, error rate and a few others at a per org level - that quickly pushed up to 40K metrics. In DD you pay $5 per month per 200 custom metrics. So we had to change our approach to metrics / observability to only surface details for error situations.
CTO
I've disabled the metric collection for all these accounts now. That saves at least $6,800 per year. Every little helps!

Who is viewing the data? And when?

Next challenge is this: I want the rest of the Senior Leadership Team (who are mostly not as technical as me) to be able to connect to DataDog and be able to understand how well our systems are working. I would suggest using the Golden Signals as a standard for all the systems. The dashboard that’s created needs to be easily consumed, consistent across products and reflective of customers’ experience. Can we make this happen?
CTO

Me:
Are the Directors actually gonna connect to DataDog? Trying to picture how this can be useful, and how they should get the info.
Architect
it's all laughable really
I was cringing when he put in the initial request to move it to single-sign on, could see where it was heading!
I don't think they need access to Datadog, surely they just want:
-Everything is ok
-Some things are broken (these things...)
-Everything is in ruins

everything else requires some level of investigation and interpretation
and we should probably have that information on our business status page, unless we're scared about how frequently we have problems

Me
That's true. People probably already tell the CEO if things aren't fine
and if they want to fiddle the figures, they can still do it in the dashboard that she sees

yep that dashboard was a first attempt to get us a first step to what you describe
problem with us at the deep technical level – is knowing what data SLT / GXT find useful. Count of Active Alerts ? Just a simple RAG status ?
average response time ?
DevOps Engineer

The conclusion was, the metrics that we create should have the following properties:

Consistent – the same interpretation should be made across all products
Comparative – we can tell from the metric how close we are to having an issue (ie percentage of link utilisation is better than Mbps)
Trending – we can see the past and if there is an underlying trend that points to an issue in the future the metric would make that obvious
RAG status’d – you can tell if a metric is ok quickly by seeing if it’s red, amber or green.
Relatable – the metric is connected to the experience by a customer, partner or patient.

We Monitored Everything But Not That Important Bit

Hi all, just letting you know about a lesson learnt from last weekend’s Mobile disruption incident; An internal service Mobile relies on had a dependency on a 3rd party service endpoint that went into an error state. Unfortunately we weren’t monitoring that as part of the service monitoring and therefore we had a significant delay in detecting that down-stream failure. We also didn’t have the Mobile service mapped as depending on the other internal service within PagerDuty, so even if an alert had fired, we wouldn’t have seen that incident bubble up from Identify Link into Mobile as the cause of the Mobile disruption.
CTO

Conclusion

I say in a lot of these blogs that you really need to understand the problem you are trying to solve. Otherwise you end up wasting time, money and causing more problems. It’s ridiculous that we have spent $1m a year on monitoring, and we can’t predict or react to Major Incidents. There’s gaps in the monitoring, incidents caused by the monitoring, and people not looking at the monitoring.

Also in predictable fashion, we are moving away from Datadog to Dynatrace which is supposed to be much cheaper. However, all the dashboards will have to be remade so there’s going to be lots of time wasted.

CTO overrule

I’ve written blogs about how our CTO tried to change our release process and announced it on a “Town Hall” call with the entire department; then loads of teams told him it couldn’t be done, so he had to back down.

Then later, on another Town Hall, he tried to change the Change Control process, but wouldn’t back down when we told him it wouldn’t work. He made the claim of it being a sackable offence if we didn’t follow it. Then a month later, he found out someone couldn’t turn a server on because of the Change Control process. He said it was “malicious compliance” and that will be a sackable offence in future. Within a few months, nearly the entire process had been reverted.

Last week, he was talking about how we needed to move a Data Centre to a new location. He said he preferred to move to the Cloud since it is “inline with our strategic targets”. However, after having several meetings with the experts involved in the Data Centre, they decided the best solution would be to move to another data centre. However, the CTO didn’t like this because it wasn’t inline with their strategy and he thought the move would be too slow.

Therefore, he took the executive decision to overrule them and demand they move to the cloud.

“Do I know we can move all servers to the cloud in time? No. But I was prepared to take the risk. I would rather make decisions and be wrong, than sit by and not make any”
CTO

It seemed strange to me to claim that moving to a physical data centre would be slow, but then moving to the Cloud probably couldn’t be done in time.

He then claimed that

“we have wasted enough time deciding what the plan should be; to move to the cloud or to move to a physical data centre”.
CTO

Isn’t this the worst scenario though? He could have made the decision before any meeting was arranged. But it sounds like they had debated the decision, came to a conclusion, then he told them he didn’t like their conclusion. Then he moaned that they wasted time debating.

So they have had meetings with the experts, and conclude the data centre is the best decision, but since the CTO loves the cloud, he has overruled them. So what was the value of the meeting? And will the staff be motivated to do something they don’t believe in?

“I think I’m compassionate with my team. It’s what binds us together as a team. Otherwise we are a bunch of individuals.”
CTO

I don’t get how he can make these statements and not realise the hypocrisy. How can you be compassionate if you have shown no confidence in their opinions and decision making?

Time

When it comes to software, the concept of time can cause problems. There’s actually some really interesting scenarios, but even in simple applications, some developers really struggle with simple concepts.

In terms of standard problems, you can have problems where the client and server times can be out. This can just be because they are set incorrectly, or maybe are using a different timezone. As a developer, if you are looking at times in log files across the client and server, it can cause confusion if the timestamps are out. A common thing I have seen is that some servers don’t use Daylight Savings Time we have in the UK, but the client times often do. So the server can be an hour out.

Daylight savings time is interesting as time shifts forward or backwards one hour. So time isn’t linear.

I recall reading a blog about time by Jon Skeet who then discussed how if you are using historical dates, the time can also suddenly change. Like if a country switches to a different calendar system entirely, so moving a day could suddenly jump in years to align with the new system.
Computerphile have a discussion on this The Problem with Time & Timezones – Computerphile

Leap Years

We once had a leap year bug because someone created a new date using the current day and month, and added a year. So when it was 29th Feb, it tried to create a date of 29th Feb for next year which wasn’t a valid date. So the feature crashed. Everyone was panicking trying to rush out a fix, but then we realised we could only get the fix out to our customers tomorrow, and the bug wouldn’t happen. Not for another 4 years anyway. It was hilarious

-1

One weird mistake I saw recently, is that a developer defined a variable and set it to 5. The code they wrote was supposed to make sure that we never make an API call more than once every 5 minutes. However, they then minused 1, so were checking every 4 minutes instead.

var minimumNumberOfSecondsRequiredBetweenEachAPICall = (NumberOfMinutesRequiredBetweenEachAPICall - 1) * 60;

Ages

You would think everyone would understand the concept of ages since everyone has an age and it increases by 1 every time you have a birthday. However, many developers seem to struggle with the concept. The last implementation I saw had the following:

int age = DateTime.Now.Year - dateOfBirth.Year;

So it can be one year out because it basically assumes your birthday is 1st January.

It reminds me of an exchange on Twitter that I saw years ago. It was in the context of football.

PA: Why are Arsenal paying £25m for a 29 year old striker?
G: he’s 28 btw
PA: He’s a lot nearer to 29 than 28, that’s a fact
G: He’s 28, that’s a fact
PA: Why am I not surprised that fractions are beyond you. The day after his birthday, he is no longer 28.
G: He’s 28 until he becomes 29. That’s how it works
PA: Perhaps if you had paid more attention in Maths lessons? You might remember “round up or down to the nearest whole number”
G: He’s 28. That’s a fact.
PA: No, it is not. £1.75 is not one pound. You don’t even understand what a fact is now.
G: Until he is 29, he is 28.

When it is the next day after your birth, are you 1 day old? technically you could just be a minute old but claim you are 1 day old.

My instinct to perform mathematics on dates would be to use an existing date library. Another developer tried to make something themselves. This seemed a bit complex to me, but I think it actually worked, or at least seemed reasonable for how they wanted to use it.


public static double AgeInYearsAtDate(DateTime effectiveDate, DateTime dateOfBirth)
{
        double daysInYear = 365.25;
        int completeYears = Age.GetYears(dateOfBirth, effectiveDate);

        dateOfBirth = dateOfBirth.AddYears(completeYears);

        double proportion = effectiveDate == dateOfBirth ? 0 : Age.GetDays(dateOfBirth, effectiveDate) / daysInYear;

        return completeYears + proportion;
        }

        public static string ConvertCurrentAgeToYearsAndMonths(double age)
        {
                int monthsInYear = 12;
                int years = (int)age;
                int months = (int)Math.Round((age - (int)age) * monthsInYear);

        return $"{years} year{(years == 1 ? String.Empty : "s")} and {months} month{(months == 1 ? String.Empty : "s")}";
        }

Ages Part 2

Another developer was testing his age code and wrote this:

new object[]
            {
                new DateTime(2010, 05, 31),
                new DateTime(2009, 06, 01),
                AgeRange.UpToOneYear,
                "52 weeks and 0 days"
            },

If there’s 52 weeks in a year, then is that 52 weeks? kinda looks 1 day short to me. Time is mental isn’t it?

Hidden Scanning Portal

Many years ago, my colleague Andy came up with a great software hack to fix a bug. I didn’t understand the fix at the time, so don’t remember the details, but the bug manifested as a red box replacing a UI control whilst the user was scanning a paper document.

Andy implemented a solution dubbed the “Hidden Scanning Portal,” a dialog box that remained invisible until the scan was complete, after which it was disposed of.

After a few months, another developer, Joe, convinced they had discovered a more permanent solution; removed Andy’s Hidden Scanning Portal. This action inadvertently introduced a new bug, so the Hidden Scanning Portal was swiftly restored, averting further complications.

Our Team Lead, Matt revealed that failing to fix the original issue could have resulted in a fine of £16,000. This revelation cast Andy’s quick fix in a new light, attributing a significant value to what might have otherwise been seen as a mere temporary solution.

Andy’s reaction to the situation was a mix of pride and frustration. Despite his contribution to saving the company from a hefty fine, he lamented the lack of recognition in the form of a modest pay rise.

“and they didn’t even give me a measly 3% pay rise”
Andy

Quick fixes might not be ideal, and increase “technical debt”, but they can provide immediate relief, avoid hefty fines, and great stories to reminisce about.

Crazy Code Design

Here is a selection of poor code design that I have come across over the years.

Opacity Slider

Sometimes, programmers write simple programs to help them, or their team. When I first joined this company, they were using a more basic version of a source control like SVN, and someone had designed a program to help with Code Reviews. On the tool, they had a slider that changes the opacity of the dialog, so you can make it see-through. It seemed like a case of the developer simply doing it because they could, and not because anyone actually found the feature useful. I suppose you could be more creative if it’s only used internally, and not for a customer; but still incredibly pointless.

Law of Demeter

Law of Demeter, also known as the principle of least knowledge, advocates for a design where objects are loosely coupled and only communicate with their immediate acquaintances. This means that a method in a class should only call methods on:

Its direct components.
The object itself.
Objects passed as parameters.
Any objects it creates.

In practice, this means you shouldn’t have a massive chain of methods/properties.

One example of breaking this law that I saw in our code looked like this:

responseMessage.acknowledgements.conveyingTransmission.controlActEvent.reason[0].detectedEvent.code.code

Awful.

Partial Record

One of our modules can potentially show a large amount of data. The data can come from the local record, or if there is a sharing agreement, it can come from other companies that they share with. There are certain screens where you don’t need all the information, so to try and cut down loading times, we have this concept of a Partial record and a Full Record.

	public IRecordPartial Partial
	{
		get
		{
			return IsFullLoaded ? Local : EnsurePartial();
		}
	}

The logic, and wording gets confusing real fast. The property above is called Partial, but there is a check for IsFullLoaded which implies that Local can be Full, and not just Partial like the property says. When you look further into the code, Local might not even be local because it can contain shared. Mindbending.

PercentNull

Coming up with descriptive names is always a challenge in programming, and there is occasions where naming can be ambiguous. However, I have no idea what a method called PercentNull does here:

transactionOut.Surname = helperFacade.PercentNull(customerDetail.Surname);

If it assigning the result to Surname, then it should be returning text. Nothing immediately obvious comes to mind if you are passing in Surname to a method called PercentNull and getting a Surname from that. So it’s not like it is returning a percent number if it can, or Null if it cannot. Or returning the percentage that the text contains whitespace. 🤷

How High

	public enum SeverityScale
	{
		None = 0,
		Low = 1,
		Medium = 2,
		High = 3,
		ExtraHigh = 5,
		VeryHigh = 6
	}

We have this enum to represent the Severity. It makes sense until you get further than High. Extra High makes sense to be more severe than High, but should Very High be higher than Extra High? Extra and Very sound like synonyms. You need something more extreme like Extremely High to make it immediately obvious what the ranking is.

Focussed or not?

	public void SetSearchBox()
	{
		SearchFocused = false;
		SearchFocused = true;
	}

When you see code like the above, it is most likely that the code is doing something weird, and because no one worked out how to fix the original issue, you then end up writing more weird code to work around that. These “hacks” just stack up and are hard to remove. If you try to do the right/honourable thing and remove a hack, then you will probably see some strange behaviour which then means you remove more code, and maybe more code when you find more problems. Repeat until you are left with the original issue which you then have to fix.

So how can setting a property to false, then immediately setting it to true actually do something? Probably if the property has loads of logic and side-effects. Probably best not to look. 🙈

Random Parameter

		public Organisation(bool randomParamToChangeSignatureForFasterConstruction)
		{
			// Had to create a new constructor that doesn't initialise internal state, param isn't used.
		}

The thing that is scary about this strange code, is that it was written by one of our smartest developers. I have no idea why he had to create an extra constructor. Presumably the parameter is there because the “default” constructor already existed. Intentionally not initialising data sounds like a recipe for bugs though. Maybe it needs a redesign.

Slow Request

I remember my Dad telling me a story of a software team putting a delay into their code. Then each month, they would simply reduce the delay a bit and tell their managers they have been working hard on performance improvements.

if (Helper.SlowRequest)
	Thread.Sleep(15000);

I found the above code in our codebase but it relies on a configuration value being present in the config file. It was present and set to true by default for Developers though so would always be slow. Changing the value to false would speed it up, but you have to know that it exists to change it.

<add key="SlowRequest" value="true"/>

Although it doesn’t affect our customers, there’s always the chance something will go wrong one day and it could affect them.

Boolean “Mode”

If you want to handle many options, you often use an “enum”. Using a “Boolean” which represents 2 values (true/false) is a very weird choice…

 <param name="mode">If set to <c>true</c> then [mode] is a delete operation, else add, edit and none operations</param>

true = delete

false = add OR edit OR none.

If you put the wrong value, then there’s an if statement after. So if you say it’s a delete and it isn’t, then things don’t get deleted.

		if(!mode)
		{
			if (@event.updateMode != vocUpdateMode.delete)
			{
			}
			else
			{
				if (@event.updateMode == vocUpdateMode.delete)

Not Supported Property

	public virtual String UploadedBy
	{
		get { return _document.Observation.EnteredByUser.DisplayName; }
		set { throw new NotSupportedException("Setting UploadedBy is not supported"); }
	}

Isn’t that Set a complete dick-move. If you call it, it will crash. If the setter wasn’t there at all, you would know it shouldn’t be set.

I guess it could be designed with the expectation that you would override the property. However, I thought it wouldn’t be set correctly, because the “get” returns a massive chain of properties. So it’s not just the case of setting a variable, it’s actually document.Observation.EnteredByUser.DisplayName that needs to be set, and that is breaking the Law of Demeter anyway.

Mutual Exclusive

This is gonna end in tears

private Boolean _isSingleClick;
private Boolean _isDoubleClicked;

Not only are there better ways than detecting clicks, when you have multiple variables tracking similar concepts like this, you can easily end up in invalid states. If _isDoubleClicked is true, then you would expect _isSingleClick to always be false. But it is easy to make a mistake in the code and not set it which then leads to a bug.

Notifications

		public bool IsNotificationConfigEmailEnabled()
		{
			if (!_configSmsEnabled.HasValue)
				_configSmsEnabled = NotificationConfiguration.IsEmailEnabled;

			return _configSmsEnabled.Value;
		}

The fact that this code had been there years means it should work. But when the property is supposed to be checking if Email is enabled but the code only looks at SMS enabled; then who knows how it works.

Resizing

int parentSize = Parent != null
                ? Parent.Width
                : Screen.PrimaryScreen.Bounds.Width;
 
var availableSpace = parentSize - _clientLocation.Y - _yOffset;

What a nonsense calculation that is! We want to calculate the height of a pop up box and make sure it can fit within the window. So we look at the width of control, or maybe the width of the monitor that they might not be using (if they actually have the program on their secondary monitor), then minus the Y coordinate of the window which would be related to height, and not width.

Splitting

Sometimes programmers like jamming as much code as they can on what is technically a single line. It is an absolute nightmare to debug chained logic like this:

return
	RequiresSingleItemOrder(item, customerRequestsSplit)
	?
	null
	:
	batchOrders
	.FirstOrDefault(
		orderInstance =>
			(orderInstance.IssueMethod != IssueMethod.Electronic || orderInstance.Items.Count() < 4)
			&&
			!orderInstance.Items.Any(m => RequiresSingleItemOrder(m, customerRequestsSplit))
			&&
			orderInstance.IssueMethod == issueMethod
			&&
			orderInstance.OrderType == OrderType
			&&
			orderInstance.IsUrgent == isUrgent
			&&
			(!ShouldSeparateControlledItems
			||
			orderInstance.AnyControlledItems == item.IsControlledItem)
			&&
			orderInstance.Warehouse == Warehouse
			&&
			!Authorisation.Separator.ShouldBeSeparated(authoriser: authoriser, orderAuthoriser: orderInstance.Authoriser)
			&&
			(!ShouldSeparatePrivateItems
			||
			orderInstance.IsPrivate == isPrivate)
			&&
			MatchesForRepeatOrdering(item, orderInstance)
			&&
			NullAndNonsequentialEqual(separationTypeIds, orderInstance.SeparationTypeIds));

Funny Youtube Comment

I saw this comment on a programming video on YouTube. It is remarking on how you can write really confusing code as a way to increase your job security because you can be in charge of code that only you can read:

And remember, kids – if you nest multiple null coalescing operators into a single line of poly-nested ternary operators, that’s called “job security” – cause ain’t no one wanna maintain that code when you’re gone.

return a> b ? a < c ? a != d ? e ?? f ?? 0 : f ?? g ?? : 0 e ?? g ?? 0;

Typescript Example

Years ago, we started rewriting our C# program using Web Technologies. Since everyone was new to Javascript and Typescript, everyone wrote awful code. But then you didn’t know if it was good or bad. One of the first bits of code I saw when I joined their team was this:

export const Actions: { [key: string]: <T extends {}> (value: T) => IAction<T> } = {
 setModuleName: <T extends {}>(value: T): IAction<T> => CreateAction<T>(ActionTypes.setModuleName, value),
};

Don’t ask me what it says, because I have no idea.

Contradiction

InsertUpdateResourceImmutableProperties

It’s so frustrating to read, it makes you want to punch someone. Immutability means it shouldn’t change after it has been created. So how can you Update something Immutable? Luckily it has a comment explaining it:

   -- If we are about to insert the most upto date version of the resource, then we should update the
    -- resources immutable properties, because they might have changed in the source system. (Even though they shouldn't).

Well, I am still none-the-wiser.

Boolean Extensions

If you have a boolean variable and you want to check it is false, you can just write:

			bool example = false;

			if (example == false)
			{
			}

Now with this handy extension method:

public static class BooleanExtensions
{
	public static bool IsFalse(this bool boolValue)
	{
		return !boolValue;
	}
}

You can now write:

bool example = false;

if (example.IsFalse())
{
}

What’s the point in that? No advantage really is there?

Not Set Up

		get
		{
			if (_setup)
			{
				_setup = false;
				return true;
			}

			return _noDocumentsInError;
		}

In the get property, we check the value of _setup. If true then we set it to false, and the overall result returns true. However, if we immediately call the same property, it will just return the value of _noDocumentsInError instead. It’s bad design to have side-effects in a get. Gets are supposed to return a value, and not set things. Then we seem to have different fields tracking different concepts which just looks like it will be prone to errors.

Reload

	public void ReloadTriggers()
		{
			// To be implemented
			return;
		}

This code doesn’t even need a return statement. It is just there for the bantz. When this code is in a module that is very error prone, and you have a method that is not implemented and does nothing at all, then doesn’t give a good impression does it?

Dialog Events

            _msgBox.ShowDialog();
        	return _continueWithSelection ? DialogResult.OK : DialogResult.Cancel;
}
 
private void Cancel_Selection(object sender, EventArgs e)
{
       _continueWithSelection = false;
       _msgBox.Dispose();
}

You can get the dialog result when the dialog is closing. However, this developer has decided to create their own event handlers for the button clicks, then using the variable to decide if it is a DialogResult.OK or Cancel.

NoEncryptionEncryptor

new NoEncryptionEncryptor()

Is this an Encryptor or not?🤔

SillyException

You throw an Exception when something bad has happened. However, in our code there is a concept of SillyException where you are supposed to ignore it because apparently, it isn’t a real error. It’s only used in a specific part of our codebase though so goes undiscovered for ages. A colleague Michael found it again and we were reminiscing on a bug I found.

Michael 11:33: 
if (mapped is SillyException) return 0;         // Avoid throwing the SillyException 

Me 11:33: 
I love sillyexception 

Michael 11:33: 
/// <summary>Represents a data provider exception that doesn't indicate an actual error, 
/// and should be ignored.</summary> 
[Serializable] 
public class SillyException 

Me 11:34: 
I remember David going on about SillyExceptions and it took me ages to realise he was referring to an actual thing 

Michael 11:35: 
"I keep getting this SillyException" 
what kind of exception is it David? 
IT'S A SILLYEXCEPTION 
YES, BUT WAT KIND 

Me 11:35: 
yeah, it did go something like that 

Michael 11:35: 
haha 

Me 11:37: 
I think it was around the time he spent weeks investigating that bug I found that no one else could recreate 
and I thought he had worked it out when he said he was getting a SillyException

lame async call

var result = client.BeginUploadFile((Stream)payload, endpoint);
bool uploadCompleted = false;
while (!uploadCompleted)
{
	uploadCompleted = true;
	if (!result.IsCompleted)
	{
		uploadCompleted = false;
	}
}
if (!uploadCompleted)
{
	throw new Exception($"Failure occurred while uploading the data");
}
client.EndUploadFile(result);

weird that they set it to true, then set it back to false. Then it will never get to the exception, it will just be an infinite loop because the check is outside the loop that will only exit if it is complete.

Is this just a reimplementation of a lame async call?

Email Regex

When it comes to data entry, you can perform validation using Regular Expressions which defines a pattern/structure. Certain identifiers have rules, for example; an English postcode (I think) is 2 letters, 1 or 2 numbers, usually formatted with a space, then a number, followed by 2 letters. eg HX1 1QG.

Software developers often come up with convoluted rules to try and define what an email address looks like, but I think they are inconsistent even between email providers that certain special characters can be excluded.

In our software, a developer had decided to change the regular expression in our email address validation. It was complex before, and it looked more complex after the new developer’s changes.

This was what it was:

return (Regex.IsMatch(emailAddress, @"^[A-Za-z0-9_\.\-&\+~]+@((\w+\-+)|(\w+\.))*\w{1,63}\.[a-zA-Z]{2,6}$"));

And he changed it to this:

return (Regex.IsMatch(emailAddress, @"^([\w-]+(?:\.[\w-]+)*(?:\'[\w-]+)*)@((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$"));

Now, glancing at that, I have no idea what the advantage is. Some developers just assumed it was a good change and approved the change. What I noticed is that the end part was changed from [a-zA-Z] to [a-z], which means the last part of the email eg “.com” can only be written in lowercase, but previously we accepted “.COM“. As far as I am aware, email addresses don’t care about casing, so although it would be nice if users only entered them in lowercase, it seems an unnecessary restriction. Was it intentional? What else in this validation is different?

If he had unit tests to show the valid emails, and prove his change wasn’t flagging valid emails as invalid; then I would be more comfortable approving it. But there were no unit tests, and I didn’t understand the change.

Another developer pointed out that the advice for emails is usually just to go for the simple *@*.* validation which translates to; “it must have some characters, followed by an @ sign, more characters, then a period followed by some characters“.

Email validation is surprisingly complicated, so it’s best having a simple restriction just to filter out data-entry errors. Having complex restrictions can then exclude people’s valid email addresses and you definitely don’t want to do that!