Logging too many Errors

When things go wrong with your software, it’s obviously good practice for the developer to log relevant information into an error log. You can then know when users are affected, but also how many are affected – to understand how widespread the issue is. With that information, it becomes easier to triage. There can be loads of other projects to work on, and bugs to fix, so being able to prioritise them is key.

Choosing what to log and how often can require some thought. You can come up with categories to your logging such as Information, Warning, and Errors. Errors are when things have gone wrong, Warning could be that you suspect something has gone wrong like missing config, and Information could be useful for debugging like if there is an optional service the user connects to, you can log “user connected“, “user disconnected“.

We have a chat functionality which uses a PubSub (publish/subscribe) model, and we were logging status changes and connection statuses. If you just blindly log scenarios like this, then it might be counterproductive. If the statuses are changing frequently, and there are thousands of users, you can be spamming the error log and then it makes it harder to see the real problems. If you see the same entries logged again and again, it becomes likely that you just think “we expect that“, and then just ignore it.

There can be extra costs associated with logging too. Data takes some memory to store and adding thousands of rows to a database per day can quickly increase the size. All those extra network calls can be excessive too.

We have had a few projects recently with the aim of trying to cut down the amount of errors.

In the case of problems, then obviously fixing the root cause of the problem is the best strategy. If the logs aren’t useful, then it’s best to stop logging them.

If the logs are useful, sometimes it’s best to cut down the logs rather than stop completely. So if you have a log such as “failed to connect” then it retries in a few seconds, do you really want to log another “failed to connect“? Maybe the functionality should try 5 times then give up until the user manually attempts to reconnect. Maybe the logs could remain on the user’s computer then submitted once with the number of failure attempts. So instead of 5 separate entries, it could just submit 1 saying it tried 5 times then gave up.

On a large scale system like ours, the number of entries in the databases are crazy. Read this statement from a concerned Support team member (which I think were the stats 1 month after a recent release):

Based on the daily volume of errors logged over the past few days I’m expecting the number of errors logged in Monitoring to increase by 82% over the course of a month.

  • Current Stats from Server11:
  •  Total Rows in ErrorLog: 11,519,198
  •  PubSub Errors: 3,283,396
  •  Errors excluding PubSub Errors 8,235,802
  •  Avg PubSubErrors/day 218,893
  •  Estimated PubSub Errors 31 Days 6,785,685
  •  Data Size per error (bytes) 3,015
  •  Estimated Data Size PubSub Errors (MB) 21,072

 

Leave a comment