Batches
Performance of SQL database queries is an interesting topic that I never understand. When dealing with a large amount of data, we are often told it is beneficial to process the results in batches. So if you are wanting to update/delete rows based on some calculations, for example; we may be told to iterate through groups of 1000 rows.
You can’t just put your query in a batch then assume it is efficient. There was one attempted data fix where the reviewing Database Expert reckoned it would take 2h 46 mins to execute.
“It doesn’t matter if you select 1 row or 1m rows – it still takes 10 seconds. So 1000 batches x 10 sec = 2h 46min processing time just working out what to delete. We need to change the proc so that it gets all 1m rows in one hit at the start into a temp table (should take 10 seconds) and then use that list to work from when batch deleting the rows”
Database Expert
Performance Tales: LINQ to SQL
Although we mainly use SQL Stored Procedures to retrieve data from the database, some of our code uses LinqToSQL which uses code to dynamically generate SQL. Sometimes it seems that handcrafting your own database query (Stored Procedure) can give better and consistent results since the SQL is the same every time. Sometimes the performance issues isn’t with the actual query, but possibly that the columns don’t have an appropriate Index applied to them to allow fast lookups on that column.
There was a problem where our default 30 second timeout was been reached for some users. So if they attempted to load a large record, it would take 30 seconds then error out. One developer suggested to increase from 30 seconds to 5 minutes.
“i’d like the customers to still be able to see the user’s record.”
Developer
The record would eventually load for the user (presumably), if the user has actually waited for 5 minutes and not closed the application, thinking it has crashed (often the user interface in this example would just show “Not Responding“).
This idea doesn’t fix the root problem though, and the software still seems unusable for the user’s point of view.
An Architect stated:
“there is something very wrong with the query if it’s taking longer than 30 seconds, particularly when the use case of using LINQ2SQL is commonly for small batches of data.
Software Architect
The default is usually a good indicator that something is wrong, but if we increase the timeout across the board we will miss out on such problems.
The reason the timeout is being breached needs to be investigated:
- Is it general: applying to any query seemingly at random?
- Is it during the connection to SQL Server or during the execution of a query?
- Could a particular problem query be optimised, be made asynchronous, or the timeout be altered for that individual query?
The Database Expert stated:
Nothing by default should take over 30 seconds – it’s a good default. If we have particular problems with SQL taking over 30s it should be investigated and addressed. It is possible that some features should be allowed to take over 30s (e.g. not user facing, known to take a long time). Allowing >30s means more chance for blocking and wider impacts.
Having queries run longer than 30s increases the amount of time they are running in SQL – this could lead to more blocking, more CPU/memory demand which could make the entire server unreliable, so we go from 1 user complaining to many.
LINQtoSQL can be optimised, we’ve done it many times over the years. The simplest is to replace the queries with Stored procedures – but it is possible to improve bad LINQtoSQL too. It depends what is wrong. None of daily operations should be anywhere near that.
SQL can do near 0ms operations on many more rows of data than we have. It isn’t a problem because we have more data, it will be something with that query. Poor plan choice or Blocking.
Database Expert
Performance Tales: Limit User Input
A developer looked into an issue related to our Searches module which allows the user to query their data and generate tables and charts to create their own reports. The Developer claimed their new changes gave performance “up to 10 times better“
The Database Expert looked at his changes and stated:
“I think that this might not be the right way to approach this issue in isolation. The issue with this procedure is not that it is slow in general, it is that our software allows you to call it in a huge variety of ways, some of which can only produce terrible execution plans as it stands. We should be analysing what kinds of searches that users do, and see if we can restrict some of the sillier searches that they can do. For example, users can right now do an audit search filtered on “anything before today’s date” with no further filters.
For example, see this snippet of a live trace done right now: <shows screenshot>
Clearly, some calls are very well optimised and others are terrible.
We should be looking at restricting what users can do, for example requiring people to search on a more focussed timeframe.
The way I would approach this issue would be to look at trace data and look for two main things:
-The styles of queries that are run most often
-The styles of queries that run the worst
Both need focus. The ones that run most often, even if they are relatively quick, can yield a big improvement simply from scale alone. The ones that run the worst improve the overall user experience as well as the impact on the system.
Of course improving a query might not involve any SQL changes at all, instead they might involve app changes to prevent silly queries from being run.”
Database Expert
Keep lowering the number with each MI
There was a Major Incident regarding a feature. The stored procedure used already had a limit on the amount of data selected in one go. A developer changed the number from 50 to 10. Since it is a very small number, I couldn’t understand why such small numbers made a difference. There was a code comment next to this limit to say it was there to “improve performance”. I looked at the file history to see if I could find any more information about why this line was added in the first place. I saw a very interesting trend, where that line was the only line to be changed in the previous 2 revisions:
original: select top 250
24 Sept 2015: select top 100
2 Oct 2015: select top 50
25 May 2022: select top 10
The developer did explain that the data is passed to another process that takes 90 seconds to process 10 instances. It’s essentially a queue that is polled every 10 minutes and would only have a small number of tasks each time.
The concerning thing is that the number keeps being lowered and the performance is not deemed good enough. Maybe the overall implementation needs revising.