Software Performance Tales

Looking through my draft blogs, I have quite a few to do with performance, so I have grouped them together to make a complete blog entry.

“the whirling circle thing is just whirling”

User annoyed at the slowness of our system

Do we care about performance?

I’ve listened to some Tech podcasts that have discussed Big O notation which is a way of classifying the performance of an algorithm such as O(n), O(n^2), but outside of studying Computer Science at university, I personally have never heard anyone else reference this. I think with some programming jobs, the efficiency of algorithms is crucial. For the most part, I don’t think people care unless you are doing something that turns out incredibly slow. In the age of Cloud Computing where you can be charged by the millisecond, then it will become increasingly important for server-side algorithms.

Where I work, we are very reactionary rather than proactive. Recently, we have had loads of complaints about our performance, so then we have identified the slowest areas and addressed them.

Thinking about performance upfront would mean the user gets the best experience and doesn’t complain. However, there’s plenty of examples of code which runs fine on your machine but runs poorly in live. This can be because you have a faster computer, the network speeds are faster, lower latency between the servers (when developing, the client and server is on the same computer so there is no latency), or you don’t have a database which represents live – you have 100s of rows rather than 100,000s. I think this last reason is often a cause for us, in addition to concurrency (thousands of live users simultaneously using the system, compared to a few testers using a test server).

An example of how we don’t often consider performance is as follows:

Example 1

I showed some initiative recently and essentially rewrote a project that was inefficient. 

  1. Some server calls had a check to see if a feature was enabled before attempting to retrieve data from the database, but some didn’t. Therefore these code paths ran queries against the database before returning no data. 
  2. When the feature was enabled, there were other parts of the code that were making the same server call twice in the same sequence of actions,
  3. and other places that made a server call even though the client already had the data from a previous call. 

It was a mess basically.

The functionality behaves correctly and the performance didn’t seem out of the ordinary, so the Software Testers never flagged it. We were probably talking an extra 50ms to 200ms in total when testing internally, although this would be worse when deployed live. It should have been obvious to the Development team that there were inefficiencies if they were thinking about possible improvements to their codebase, or if they were paying attention to the server calls and noticed their new server calls when the feature was switched off. 

Volume and Performance Testing

We have loads of test environments – virtual machines with databases and servers where the Software Testers can test our code changes. I would have thought these would be fine to do performance testing, and spam these test servers with calls, but a Test Environment Engineer once sent an angry email:

Our standard test environments are Absolutely not to be used for Volume and Performance testing. This can impact the whole devtest infrastructure, as well as some live services that share the same infrastructure. Doing this could cause a Major Incident.

Test Environment Engineer

This seems bizarre to me. Why would you share resources with the live production environment? Surely the Test Environment should be in its own isolated sandbox, where you can create, update, and delete with the safety and freedom. 

Example 2

We had another performance issue where a large amount of data was attempted to be retrieved. When the developer analysed it, we weren’t actually trying to retrieve a large amount of data afterall.

“There were 142,858 Id’s passed into the comma separated string, I found that there were only 64 distinct ID’s once I removed the duplicates. After the duplicates were removed it went from 76minutes duration to under 1 second”.

Software Developer

What a performance improvement! 

I wonder if passing the data in a form that isn’t a comma-separated string would also help. If you are passing in “12, 156, 201, 202, 284”, then, to get the ID’s, you need to remove the commas and place the numbers in a table. If you just pass it in as a table, then you are cutting out a step on the database layer.

Example 3: SQLite

One part of our system used a local file cache (loads of files were downloaded to the users machine, then our software would just load up these xml files rather than going to the server every time). It has worked fairly well, but overtime, the number of files has grown.

I think the performance issue comes from the fact that all the files are then loaded and are kept in memory, so it is a memory hog, and sometimes leading to “Out Of Memory” errors. Maybe the solution is to work out if we can dynamically load these files in full only when they are needed, while only having the “metadata” in memory at all times.

Colin came up with an idea to move the files into a local database using SQLite. This innovative idea impressed the managers and were eager to do it. I was sceptical because I think we would still end up loading all the files, just the place where they are stored has changed.

Day 1 of release, the following Major Incident was raised:

SQLite corruption. Out of memory

After that was fixed, I believe the changes caused 3 additional Major Incidents. Colin still swears by it. He thinks it was an amazing idea. I ain’t convinced.

User Feedback

It’s rare that I come across comments from users about our software, but they do often make me laugh. Part of the problem is that they use software, but aren’t really technical enough to use the correct terms, so when complaining about a performance issue, one user remarked:

“the whirling circle thing is just whirling”

User annoyed at the slowness of our system

Another thing about user complaints is that I tend to only hear about the real angry complaints and not just generic comments. I think Support had told a user there was a large amount of data on that particular record, and the more data on the record – the longer it is going to take to load.

“Regarding larger records, this is a predictable problem. What mitigations are being put in place to offset that? I still have 20 years of working life yet, don’t tell me that by 2035 I will have to wait 20 minutes to load a record???!”

Angry User

It’s a valid complaint. We need to come up with a different way of loading the record so they aren’t sitting around waiting for the full record to load.

Leave a comment