Employee Profiles: Neil

Neil was very similar to Gerald. Although I always got on with Neil, his programming skills were a bit lacking.

He was definitely one of those software developers that may have been good in his prime, but the languages he has used are now obsolete and he struggles to learn new things, so was a poor C# developer. Or maybe he has always been poor.

Neil is like “I don’t know why this bug is happening, I might try change random parameters“

So I said “have you found the barcode code since it is that which is displaying incorrectly?“

“No I haven’t. I suppose that’s important to find“

Neil in the Standup update on Thursday: “gonna switch branches”

Friday: “just in the process of switching branches”

Switching branches takes a few minutes (Git, plus some config changes)

“I don’t have a good feeling about downloading nuget packages any more. Before the patch, the fix made 21 changes to .csproj file and .config files in order to work. So far it has made 523 changes.”
Neil

What’s he on about? sounds like Neil is deleting packages then removing the part of the build script that grabs them, then wondering why he can’t log in.

One trait they had is that he seemed focussed on his own work and didn’t pay attention to what anyone else was doing. So there could be well-known employees at the company and he wouldn’t know who they are. So there were plenty of conversations like “go and ask George for assistance” and he would be like “who’s that?” or “what does he do?” much to the derision of team-mates.

"might be worth getting you and Nick on a call together because you are working on similar items"
Neil: "Nick who?"
"Nick on our team"

He once turned up to the stand up which started at 9:30 but he was supposed to be on a 4 hour training course from 9. He said he joined the meeting but no one was there. Our manager was like “did you join using the correct link? The calendar invite says PLACEHOLDER so that’s not the correct one“. Why didn’t he ask people if the meeting was on? He should have messaged his manager straight away.

A few months later, he missed another meeting. “did you not see my reminder about this morning’s call in Slack yesterday afternoon?“

“I thought it was some advice on where to look in the code“

The team lead had posted “Announcement : reminder that we’ve got a call at 9am tomorrow about MEDS ISN”.

Probably Neil’s brain – “mmmmmmm MEDS ISN call. Sorry, I can’t seem to find the MedsISNCall code. Is that a 3rd party dll?”

I’m still baffled by Neil. If he thought it really was code, then why wouldn’t he ask you how to find it?

He often struggled and didn’t ask for help, even though we repeatedly told him we would help train him up in our software and with C#. He would raise it as a point for himself in the Agile Retrospectives, and say that he will ask for more help and work as a team; but then carried on as normal.

“Yesterday, I failed to log in after 3 attempts. So I am carrying on looking at it“

Neil wants to add a new user to the database because he doesn’t know how to unlock his password by running a script on the database, but it seems he didn’t know how to add users either. I wonder what time he locked it. Did he lock it at 11am, then walked off?

Neil last week: I need to ask for help more

Probably Neil’s brain this week: well, my account is locked, looks like today is a write-off

Later on, the manager asks him if he is all sorted and he said: “I can log in but an error pops up.“

So he is blocked again and never said anything. He is really trying to get himself sacked.

I’ve noticed these types of people often blag their standup updates by saying “Sent a message to…”. Sending an instant message or email to someone could take some time to make sure it is worded correctly, but it’s not really a significant thing. Then there can be times where the recipient doesn’t respond because they are busy or out of office. But then what are you doing whilst blocked? They seem to use it as an excuse to write-off the entire day and shift the blame to someone else.

There were several times where he was working on items that had been picked up by others, had completely misinterpreted the requirements, or just had general bad luck with work being deprioritised as he was working on it. It kinda became a running joke like he was a cartoon character.

We told Neil to take a bug from the backlog. It was one I had investigated and put loads of notes in so it should be really easy for him to fix. He took the one already assigned to me with a status of “In Progress”.

Build Problem

On a standup, he was saying he was struggling with a build error. I volunteered to help him since we didn’t want him stuck all day. I asked him to clarify the situation; “is it on a computer that has always worked then suddenly stopped working?” And is he “currently running the build script without any of his changes in, or could it be caused by something he changed?”. “If he types that az login command, what does it say?”

Then all he says back is

“Curtis from IT is supposed to be setting up elevated access on A20205 but it isn’t working yet“

WHO IS CURTIS? Why is he included in the story now? and what even is that machine name? Am I supposed to recognise it?

“Existing one that has always worked except for these latest build changes. Curtis said “I’ll need to switch your admin access over from Primary to an elevated to match that of the other users on the jumpbox” and under the circumstances I thought it sounded like a good idea.”

I asked why Curtis is involved. Has Neil asked him to fix his build issues? and why does he think it’s an IT problem rather than a Development one?

“He contacted me. He did not give a reason for making the change so I don’t think he knew about my build issues. He may have noticed that others users have elevated permissions. I do not think they have caused the build issues.“

So he is telling me about Curtis but it’s nothing to do with the problem I am trying to help him with?

After looking at Neil’s local changes, I saw that he had made changes to the build script which is basically one of my initial questions which he could have answered. I suppose I need to check each stage for tampering.

So I ask him what the error he got in order for him to start changing the build script. It shouldn’t be necessary; why would it work for everyone in the department but not Neil? He sends me a screenshot, and I noticed that the command prompt showed Isobel’s name as the local user.

How can he be logging in as Isobel? Have IT merged his account. Is this Curtis’ doing?

I asked him if he noticed he was logging in as Isobel. He says he took the screenshot from her documentation! He is a wind-up. He was getting the same error as her troubleshooting guide but decided to paste her screenshot to me.

I tell him to log in; then run a command. He runs the command without logging in. So I explained again, and I think he logged in but didn’t run the command. Absolute wind-up.

In the end I think we undid all his random changes to the build script and fixed the problem with the Azure Devops plugin, as documented in Isobel’s guide.

How Many Items To Return

There was a bug he was assigned to that looked like it could be simple on first glance. You have a list of items and the code was calling the LINQ method First(). The change could be that it should be FirstOrDefault if it could be an empty collection, or maybe the problem is that there really should be one item in the list and it is missing.

Without recreating the problem, we wouldn’t know what the fix should be.

Looking at the code, it passed in a list of items and possibly switched some depending on the config and certain checks. The method returned a list but was not clear at first glance what was returned. Is it all items? ones that have switched? any that have been modified?

He said it should be all items regardless if they were swapped.

Then when reading the code in this particular method where the crash occurs, why was it returning items when this swapping feature was enabled, but 0 if not?

Another developer, Dean, points out that in the previous method calls, there is another IF statement. When this swapping feature is on, an empty list is returned, and when it isn’t, the original drugs are returned. But then they are placed in a list called switchedItems.

Neil: “I understand it enough to know that this fix will work, but don’t understand it enough to refactor it”

He says his approach is to make the simplest change possible, not increase the scope and chance of introducing a new bug. I say that although that approach is generally good, in this case we see that the original developer had made a mistake in a previous method, and not returned the initial list of items. Then it looked like they had put in a hack to possibly try and work around it.

So I tell Neil that the original change was basically a hack – and the developer was attempting to do the simplest and quickest fix; but it has then made the code look confusing and still has a bug. Then Neil is continuing this mentality, by modifying the if statement to work in this additional scenario where we have found a bug, further contributing to the mess.

We discussed other scenarios and if those scenarios are impacted by this particular code. Then David says “so my change is fine then?”. Dean and I, in unison, say “NO!”. You need to fix the original method to return the initial list.

Not Hitting Breakpoints

There was another good example where he did ask for help but hadn’t recreated the issue at all, and didn’t adequately explain what point in the investigation he was stuck on.

The bug report was very poorly worded, but the general gist of the problem as I understood was – that there was a certain type of Request Task which contained a list of items. The user then tries to change the item to a Repeat order. Then it crashed triggering a dialogue.

His manager asked him if he had tried recreating the issue, and he said he has been “trying for a while“, but didn’t elaborate on it.

I gave him advice that sometimes you need to look where the crash occurs from the stack trace, then look at the method calls before that to see the initial UI call. Then work out which UI control triggers that.

The next day, he says he is still struggling, so I needed to help him.

I get set up. After I start investigating and getting sidetracked with other issues, he then says he is getting a crash creating the Tasks. What has he been doing for the last few hours?

I sorted his environment out, then asked him if he had found any suspicious code, and he said he hadn’t. I wasn’t sure if he meant he hadn’t even found any relevant code, or hadn’t identified what could be the problem, but I left him to it for a bit.

He sends me some method names which lead up to the method which was mainly from the stacktrace, and some were irrelevant. I pointed out that the line of code that was crashing was when a collection was empty, and reminded him the exact method he needs to look at.

A few hours later, he asked me if I had made any progress, and then said “there is a null being returned in the If statement”. So I remind him that null is fine, we are looking when the collection is empty. Empty or null are similar, but there’s a difference in the world of programming.

So we need to find what can return an empty collection. The returned list sometimes does stuff with the swapping feature so that made me think it could be that.

Neil 
Which dialog are you referring to when you say "in that dialog"?
Me 
In the Task
Neil  
Are you adding a task that appears in the To-Do list or Inbox?
Me 
Requests, not To-Do list
have you not got that far, or have I just confused you by giving you random bits of info

I don’t really get it because it sounded like he knew it was Requests, and then is now thinking it could just be some other type of task. Then when I explain where the link is that the user is clicking in the stack trace, he is like “oooooooh I haven’t got this far“.

It said Requests in the bug report. I told him to first work out what the user was clicking in the UI. But he hasn’t got far enough to know it was a Request nevermind what the user was clicking on.

As he was showing me debugging, it was hitting all kinds of breakpoints. He had like 50 and hadn’t been hitting them before because he wasn’t using the correct dialog. He had speculatively changed the method I had told him to look at as an experiment, but if he was never using the feature that called that code, then what is the point of experimenting by changing it?

He could have told me that he can’t hit his breakpoints at least. Either a problem in Visual Studio, or most likely; he isn’t in the correct feature of our program.

I don’t really understand how he can be a developer for 30 years and not understand how to work out a problem from the stacktrace. It’s not 100% possible due to a feature might needing configuration to be enabled, so you may still wonder how code is called, but at least you can deduce some information and come up with several scenarios then rule some out.

Conclusion

I always say it is hard to tell how good developers are, but there’s certain red flags that suggest they are bad. Obvious ones for me are not asking for help, not understanding basic debugging, showing a lack of interest in the project or code quality.

Teapot

I was looking through the list of HTTP status codes and saw a strange one.

Error 418 (I’m a teapot)!?

Google have a dedicated teapot page:

Error 418 (I’m a teapot)!?

If you hover over the teapot, his eyes open and he smiles. If you click it, it then animates.

It sounds like the origin was an April Fools prank.

RFC 2324 was written by Larry Masinter, who describes it as a satire, saying “This has a serious purpose – it identifies many of the ways in which HTTP has been extended inappropriately.” The wording of the protocol made it clear that it was not entirely serious; for example, it notes that “there is a strong, dark, rich requirement for a protocol designed espressoly for the brewing of coffee”.

wikipedia

Github copilot

We recently had staff from Github Copilot do a presentation on how their product can be useful to Software Developers. I found their answers to be a bit wishy-washy. I think it’s a really complex topic and having what I think were essentially sales managers trying to pitch something technical to us was a challenge. They didn’t have a full understanding of how it actually worked.

Someone asked a question to clarify if Copilot just looked at your open documents, or if it had the permission to see all the other files in your repository. Their answer was vague, along the lines of “it might do. Could come down to chance“.

For it to be effective, it really does need to look at your codebase to see what your product does, what features are already developed, and for bonus points, your coding style.

When it needs to suggest calling third-party code and installing additional libraries, does it understand that you may need to abide by a certain licence (pay some fee, or not use it in open-source etc)? and does it know that you may be limited to a certain version of it due to other dependencies? when features and the API (required parameters etc) can change drastically between versions, does Copilot understand that?

It’s probably the same scenario as what Wolfram Alpha were talking about when they came to our company to do a presentation on AI. They were emphasising how standard language models often suggest some text which reads like it makes sense, but it’s actually nonsense. They gave an example where it quoted a real journal from that country, stated the title of a chart that exists, quoted some figures and years – but the figures were fictional.

I saw a news article about how a lawyer presented some documentation to a judge about similar cases, but it turns out the lawyer had used ChatGPT and it had made up the case numbers and years.

The way those models work is that it knows some related words, and knows sentence structure, but the likes of ChatGPT doesn’t understand that something like that needs to be accurate and you can’t make stuff up. So Wolfram were saying their plugin can be combined with ChatGPT’s conversational structure to plug in actual figures to make accurate essays. TEAMWORK.

I would imagine there’s a good chance Copilot has exactly the same issue. It knows a bit of structure, slaps in the correct programming language, but it has no idea that it’s from a different library version that you aren’t using.

From what I have seen of Copilot, it is very impressive but does often give you code that doesn’t quite compile but gives you a good template and inspiration of how to progress.

In the past I have seen people blindly copy code from the internet, or just do what a colleague suggests without actually thinking about it. I think we are gonna be seeing this more from now on, but it’s gonna be the AI’s fault.

I am not against AI in programming because it can speed up development in certain tedious areas, but it always comes down to the idea that the best programmers are ones with a certain mindset of quality, and I think AI is gonna produce more developers with the wrong mindset because it’s about speed and cutting corners.

I’ve heard people suggest that the next wave of developers can be so dependent on AI, that they will be unable to come up with a solution when the AI doesn’t get it right.

Rebase Debate

I’ve written a few blogs about debating issues with the Team Lead. Another argument we had was about the use of Git Rebasing. The blog will be quite long if I go into the detail of what this means, but:

Git is a type of source control
Source control allows you to see the changes with the date/time they occurred
You can create a “branch”, make your changes, then merge many changes at once into the “main” branch.
There’s a concept of “merge” which basically keeps the timestamps they were originally made
There’s another concept of “rebase” which creates new commits, so changes the timestamps like they have just been made.

This new team always rebase, and end up rebasing as they are developing in their branch. A well-known rule is that if multiple developers are working on the same branch, you cannot rebase it without causing problems to your team members.

I raised this as a potential issue because I saw many upcoming items we need to work on that need multiple developers working on a feature, or needing to work from a shared branch. The Team Lead dismissed because he has apparently never had problems working with rebase before.

As I forewarned, I got hit:

Jay creates folder A with many files
I branch from his changes
Jay moves all those files to folder B
rebases
I then rebase my branch. Git can’t see the move because the history is rewritten. It keeps folder A with the old files and treats them as mine. It adds folder B with Jay’s edited files.

Later on, the Team Lead was hit with something similar.

Team Lead
rebased fine for me
 
Team Lead 
hmm this is fucking me up now
i rebased onto jay's branch which went fine
 
Me
but now there is duplicates all over the shop
 
Team Lead
now i'm trying to rebase onto develop but it's trying to apply jay's commit to it too
 
Andrew
he rebased his branch again
before merging into develop
 
Team Lead 
but it should recognise that my branch starts at my first commit though shouldn't it
 
Andrew
not if you rebased onto his before he rebased again
you just have to drop any of his commits
 
Team Lead 
ah right, not come across that before but makes sense

So if you have multiple developers working in the same branch, you should not rebase once commits have been synced by one or more team members. Rewriting the commit history means Git cannot sync new changes because it sees everything as a new commit.

Debate about ChecksumGenerator

I was working on recreating functionality in our new API. It was basically a copy-and-paste job from our current product.

I had another debate with the Team Lead who said I shouldn’t have copied the ChecksumGenerator, and wanted me to replace it with a Nuget package.

The idea of using existing packages is from the generic advice of “don’t reinvent the wheel“. If the code already exists and has been thoroughly tested by loads of people, then it’s best to use that. In terms of future fixes and enhancements, if it is a popular open-source package, then other developers will update it. If you rewrite the code for your own company’s use, you are stuck maintaining it.

However, in our case, we already have some code, we know it works since I took it from our other product and it’s been used for several years. It’s only 58 lines long and has no need to change.

The package he wanted me to reuse wasn’t actually that popular so it adds some risk using it. I had just finished my code changes and already ran all the tests, so any changes would need to be retested.

Isn’t using loads of packages just adding bloat? That is such a webapp thing to do. My brief experience of developing a React Application, has shown me that it uses a crazy amount of packages with a basic boilerplate application like create-react-app. Then anything you install has loads of other dependencies, and you end up having bloated install size and loads of potential security risks.

The Team Lead’s justification is that a ChecksumGenerator isn’t directly related to our API so shouldn’t be in our repo, and wants to “do things properly“. I think if you go that extreme then the software ends up in development hell. Which is exactly what has happened here because this API is part of a larger project that has been in development for about 7 years now.

It’s quite hard to win an argument against a Team Lead because unless the majority say that he is wrong, then what he says goes. We have a Junior in our team, then the other developer is the Team Lead’s best friend. Then the other two are Testers that don’t have much interest in a coding discussion like that. So it’s gonna be rare for someone to speak up and cast a winning vote there.

I asked my Software Architect friend what he thought:

“That’s completely senseless. Pretty much always use code that works, as long as it performs well as part of it ‘working’.
Does the new package have an allowable licence?

You should ideally try to avoid dependencies”
Software Architect

I had a discussion with another colleague in another team. He mentioned that infamous time someone got angry and withdrew a package which took down a large part of the internet. I think we have mitigation for that scenario though.

Robert:

What if the package gets pulled?

Me:

I was about that yesterday. I think we mitigate it by having our own nuget store. So I think the package goes from the main nuget source, down to our local nuget store, then into our builds
might have made it up, but that's how I picture it
so unless you attempt to update the version, you always have a copy of the one you currently reference because it is cached on your server

Asking a Senior Developer how to create an authenticated user

I have a lot of experience as a software developer in Desktop Applications, but moving to a project involving many Web technologies, creating an API is a new challenge to me.

I asked a Senior Developer in the Authentication team how to get set up. I didn’t even know where to go to even begin to create a user, then didn’t know how to get the correct permissions etc. Then the process of developing against “DEV” environments rather than a “Production” URL is another mystery to solve.

Tony: "you need to exchange the Advanced-Auth token for an 3rd-Party-Token token"
Me: but don't I need to get the Advanced-Auth token first?
Tony: yes
Me: Don't I need to link my user to be enabled for Advanced-Auth?
Tony: yes, just do it through the app
Me: I have never used the app
Tony: create an account then
Me: but I need to know how to get the app in order to create an account
Tony: host.int.solutions.com thats the url you will need
Me: but if I am making changes in a project, then do we use DEV rather than INT
Tony: So everyone integrates with int as dev is unstable
Me: If no one uses DEV then what is it for?
My conversation with Tony is like:

It’s gonna be a long month.

It’s frustrating that he was only giving me basic information. You’d think there would be like a general “New starters” guide he could have linked me to. Or his team should have had documentation for creating new users. They probably do, he was just being awkward.

I suppose this is the classic example of a software developer that is skilled in actual coding but not so good at communicating.

How (Not) To Split An API

I’m a software developer that only really has experience on desktop apps, and was recently put on a project to make an API. I had an interesting debate with the Tech Lead of the project about where we should put our code, and how we should view its responsibilities.

To make it more anonymous, I’ll change the functionality slightly but it’s the same idea.

An Orders API had been in development for a while, and my team needed to add functionality to send some data to a government API, so let’s say it was for ID verification. Even though our initial requirements are that only the OrdersAPI will use the VerifyAPI, you could argue that in future, maybe in future, other applications we have made, or third parties could call this VerifyAPI directly.

There’s a famous idea in software development; YAGNI; You Ain’t Gonna Need It. Which is the idea that you should program to requirements and not speculative “what if” scenarios.

The Tech Lead argued that we should put our code in a new repository because it was a separate API. I said that that adds loads of overhead because we will need to write code in OrdersAPI to call our code, then will need to add a reference to our VerifyAPI using a Nuget package. This will slow down development as you need to update 2 repositories, need some temporary reference as you develop, create multiple “Pull Requests”, then need to publish the Nuget package and update the references once more. I stated this was gonna be a huge inconvenience if the project ends up running over the year.

I also called YAGNI on that we will probably never use the API for anything other than OrdersAPI so it should just go in the same repository. In the event where I am wrong, it should be fairly easy to move it out as long as we just use separate projects to keep our code separate.

He insisted on doing things his way, but the thing is, even though we had a separate repository, it wasn’t a separate API. It was more like a code library. So several months later, he was asking managers if we can create a “mini project” to turn it into an API for clearer separation.

So it seems like we had 2 opposing viewpoints but ended up somewhere in between with all the disadvantages.

Another interesting debate I had seemed to illustrate his confused view of what our code is. He has always viewed our code as intending to be an API, but I was changing some error messages and he said my messages were misleading because our repository is not an API!

The confusion seemed to be him saying the “client” is the OrdersAPI, but I see the user of our software as the client, the OrdersAPI is the server call, and it doesn’t matter where it goes next

The message was something like. “Field ‘Date of Birth’ is missing”. He didn’t like the word “field”

Tech Lead
"I'd change the wording on these messages. We're no longer talking about "fields" since we've split the API request out."

Me
“does it matter where our code is? it's part of the same request as far as the client is concerned”
 
Tech Lead
"fields" just sounds like API talk
 
Me
but the client has made an API call
 
Tech Lead
the client hasn't made an API call though
if those prerequisite checks fail then no API has ever been involved
and even if it has, why would the client need to know anything about an API?
are you talking about the browser client?
 
Me
isn't it
client -> OrdersAPI -> our library {fail checks} -> error status to the client
 
Tech Lead
sorry i thought you were referring to the OrdersAPI as the client in this context
which it is
our package shouldn't know that it's being used in an API., that's the whole point of this change
 
Me
it's a black box for the caller. The user shouldn't know that it's using a library. The code could all the be in the same place as far as it is concerned

Then after more discussion, he is adamant that something could use our library in future so then there’s 2 usages, an API and non-API. So it cannot have API related stuff in it.

But our intention was to have a separate space for our team to maintain, we have never discussed it being used by anything other than the API. The early discussions was to have our repo that was an API.

Daniel
tbh I don't immediately think API when I see "field" I think it's fine
 
Me
he did say the message could just be
"‘Date of Birth’ is missing"
Which might be better but then wouldn't you want all the messages to be consistent. However, I guess someone could update the OrdersAPI repo with a new message style, and then forget about ours.
 
Daniel
you make a good point about consistency though, the API should be consistent regardless of where the code lives

It’s a really trivial argument, but I think this is just the beginning of many debates. Sometimes I think we like adding loads of complexity early on then work doesn’t get done.

Experimentation vs Being Safe

When it comes to software development, often you can play it safe using technology you already know, or be more adventurous and use something new. I think the trick is to research the pros/cons of the language and make sure it is suitable for your approach.

There’s no point thinking something is cool and therefore using it – when it might not be the correct programming language to use. An entire team investing time learning something new can be a complete waste of time if the project is then cancelled/restarted due to heading the wrong direction.

A rule a thumb when choosing technologies:

For an experiment? be as weird as possible.
For production? be as boring as possible.

When it comes to maintenance, sometimes you end up in situations where someone is the “Expert” and therefore has to fix any issues themselves, or will be approached for help by another developer. Therefore, if you write something crazy for production, it will be you that maintains it, either directly or indirectly.

Sometimes becoming the expert in something is the way to get promoted or pay rises though, since you become the super important developer that the company can’t afford to let go. However, that also means you will be stuck on this part of the software, and can’t really move on to different projects.

If you do become one of these experts, and if you want to move on to a new project; you need to train a replacement up. Can you find a replacement that wants to take over, knowing that they will be stuck with this special project? How long will it take to train a replacement? How much documentation did you write?

Migrating from on-prem servers to the cloud

When migrating from on-prem servers to the cloud, the Deployment team decided to change the way servers were allocated, presumably to minimise the cost. They:

“designed separate pools for the public side and private side so that the computer and memory could be dedicated to (and protected from) other types of traffic. Due to this split we reduce the ration of CPU cores to sites from 0.42 per site to 0.28 per site (as the cores were now dedicated to public, private all separately)“.
Deployment expert

Initially, this new way worked fine, but then during a particular busy week, they saw slower response times. It actually led to a discovery of a problem we must have had for a while, that SQL connections weren’t being properly disposed of, which created a bottleneck of the remaining possible connections.

They added a temporary fix which was something to do with “Shared app pools“, rather than autoscaling the application servers which would cost money. But this is a major advantage of the cloud – that you can scale on demand.

So to no one’s surprise, when another increase in load happened, performance issues happened once again.

So now the fix should be autoscaling right? No, they are still reluctant to do so. Instead, they added a fixed number of application servers. Surely that costs money, and increases our costs at quieter periods. I suppose I don’t know all the details but it seems risky to choose a set number and hope that the load never exceeds that.

On Viva Engage, a manager posted a positive message stating that the AWS migration was a big success:

“I am thrilled to announce that we have completed the migration to AWS!

This is a major milestone for our cloud migration programme and has involved many team members across multiple disciplines working together.

We have migrated a whopping 505 TB of data across 4178 databases and have stood up over 1,080 application servers. There has been meticulous planning (over 130 steps in each migration), preparation and countless hours spent migrating our systems, including overnight and weekend working.

The impact of this collective effort extends beyond numbers and statistics. We have successfully improved stability and performance for our end users. The migration has enabled us to navigate the increased load challenges.”
Manager

Yet, someone shared this angry message from a customer. I’m not sure if the first sentence is sarcastic, or if they thought we had been somewhat supportive:

“Thank you for your support in what seems to be a “run” of problems for the business. After our awful experience in November when your software literally tipped over leaving us without a system, I did request that both the ombudsman and your company treated this as a significant event, looked into what went wrong and responded to me with an answer. To date I have not received any such feedback from either party.”
Sarcastic customer

I asked a Software Architect what he thought, since he is usually close to the gossip or involved directly.

The Chief of Smoke and Mirrors will have some explaining to do.
performance improved quite a bit as a result of the 64-bit work done behind the scenes (not to the client)
but now users do things faster with longer sessions
and they have plenty of infrastructure issues around the AWS changes that caused a lot of customers problems
as always, one group of people fix certain things, while one group breaks lots of things at the same time
Architect

So it sounds like there’s been some good work done, but also some mistakes made. Then internally, we are announcing it as a great success.

Someone also showed me this complaint where someone had visited a customer and reported back what they had witnessed:

“We visited a site yesterday displaying nearly all of the problems we have discussed to date – still having to reboot the software 10 to 15 times per day! System slowness (witnessed), documents not opening, closing when going into the orders module, first record load slow, changing an order – system crashes.”

Another reason for performance issues was due to missing config after the migration:

“some of our app servers are downloading/installing Windows Updates in core hours, which is causing poor performance for users.”

A simple workaround that sometimes happens is a “cache reset”. That sounds like it’s a similar mindset to the “turn it off and on again” which does magically fix some problems. However, due to the migration, Support had got a bit confused how to remote onto the new servers:

“cache resets were done on the wrong servers. ” –
Manager explaining why performance issues lingered for longer than expected.

Even after further tweaks to the cloud migration, updating the client to 64 bit, fixing SQL connections, and some other miscellaneous changes, the Support team were saying some sites were still having problems:

Can I confirm that things should be improving for all sites following all the brilliant work done? The customer is experiencing the below and I am visiting them tomorrow;

Customer issues

loading can take several minutes
Slowness and crashing every day, at least 9 or 10 times a day
No discernible pattern or time of day for slowness or crashing, and no particular machine is noticeably better or worse
Been experiencing performance issues for 2 years, but have gotten much worse recently (last 6 months)
experiencing significant delays when uploading records
Can take up to 1 hour to approve a small amount of external requests which can involve multiple restarts
Switching between records can lead to delays and ‘greyed out screen’ (not responding)
Constant and randomly crashes and needs restarting – staff having to partition out tasks such as viewing documents and approving tasks

Closing statement

It does seem like our performance issues are a bit of a mystery. I think we have run out of things to blame. Customer internet, SQL connections, 32 bit client, on-prem servers, caching bug. Hopefully one day we will have a fast system.

Cloud FinOps

Over the last few years, my employer has gone Cloud crazy. We are a large company so we have our own data centres. These are costly to run when you need physical space, staff, electricity, software licensing, and a plan of action when things go wrong.

I wonder if it is better to have your own servers when you are a big company. I always think Cloud is best for smaller companies that don’t have the resources to host it themselves.

“Our reasons for using the cloud are the same as others using the cloud.”
Our CTO

Not really true though is it? From what I saw quoted for the virtual machines for our test systems, I think Cloud is more expensive over time. On-prem has a massive up-front cost which is what they don’t like, but we have the capital to do it, unlike small companies that the Cloud is perfect for.

The recent drive to move away from our data centres is that we needed to replace some old hardware, and perform SQL server upgrades.

I could imagine us moving to the cloud, managers then panicking when they see the monthly costs, then demanding we go back.

One aspect of an SQL Server upgrade sounded like they needed to migrate the data to a new physical server. One of the tables they were concerned about was Audit, which adds a new row every time the user edits a record, which they stated was around 9 Billion records. A copy of the changed data is then saved as XML, so then you can do a before/after comparison. So that particular column is a problem.

So for the data that would still remain in our data centres and moved to a new server with a modern SQL Server version, the plan was to migrate the table without the XML column in it. Instead a new boolean (true/false) column was added to state if there should be data there, and instead, the data is moved to the cloud.

So now we are paying to host the database on our own data centre, but then have certain data in AWS which sounds like it should be more expensive. The justification is that we didn’t need to buy as much hard disk storage which they reckoned could have cost a massive £500k! Then it would mean the migration to the new server in the data centre was faster.

Still, we needed to transfer the data to the AWS Cloud storage. I think the idea was that Audit data isn’t accessed much, so it’s better to move it to a cheaper but slower storage method, then request it on demand. So in our software, instead of displaying the data instantly when you view that record, there would be a “view more detail” button, and only then do we request it and show it.

I think the mindset is just to focus on the cost figures that are apparent. Seeing a figure like £500k sounds like a crazy figure, but if we look at the cost to store it over a few years, does storing it in our own servers outweigh the cost of paying Amazon to store it?

A new corporate buzzword that gets thrown around in this subject is FinOps, as in Financial Operations.

One of the challenges we have when we start to build a new service is around estimating the potential cost of that new service in AWS. This ultimately goes towards setting the budget expectation for that service and therefore how we monitor it from a FinOps perspective. Do we have any experience within the department or anything we can leverage to help us get better at understanding the potential budget expectations for a new service we’re building?
Concerned staff member

In one of the recent “Town hall” meetings, the CEO was ranting about how high our cloud costs were. He said we currently had £250k in AWS servers that are switched off (not sure if that was a yearly figure, or even more unbelievable; monthly). These were servers just for development/testing. If our testing teams are spending £250k on servers we aren’t really using, how much are we spending on ones we are actively using? Then how much does our live system cost?

Now when you see those figures, that £500k hard disk storage doesn’t sound too bad.

“FYI – Stopped instances don’t incur charges, but Elastic IP addresses or EBS volumes attached to those instances do.”
Cloud expert

He is throwing around more jargon there.