SQL Server Performance: Disk Latency

Generally speaking, SQL Server these days is limited by Disk Read/Write speed. This means that the most important factor in a SQL server’s performance is its ability to perform I/O operations quickly. If the I/O subsystem cannot keep up with the demand being placed on it, then the SQL Server workload will suffer performance problems.

SQL Server is slow. It looks like the disks…

A problem I’ve seen in the past is relating increased I/O subsystem latencies with poor I/O subsystem performance. This is why baselining systems is so important, or at least have some monitoring in place where you can see trending on these sorts of metrics. The I/O subsystem may have been performing fine when running on the designed workload, but it suddenly becomes the performance bottleneck when the workload increases past the I/O subsystem design point. The workload increase, in this case, is what’s causing the problem, not the disk subsystem.

Right? why are you telling me this

Don’t just go online, read some blog that says “This is how you diagnose disk issues” then run off and annoy your infrastructure team telling them their disks are broken. that’s not going to win you any friends.

A Script…….

Your first step is knowing if I/O latency has increased. This script uses sys.dm_io_virtual_file_stats to display the disk latency information

SELECT
[ReadLatency] =
CASE WHEN [num_of_reads] = 0
THEN 0 ELSE ([io_stall_read_ms] / [num_of_reads]) END,
[WriteLatency] =
CASE WHEN [num_of_writes] = 0
THEN 0 ELSE ([io_stall_write_ms] / [num_of_writes]) END,
[Latency] =
CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0)
THEN 0 ELSE ([io_stall] / ([num_of_reads] + [num_of_writes])) END,
[AvgBPerRead] =
CASE WHEN [num_of_reads] = 0
THEN 0 ELSE ([num_of_bytes_read] / [num_of_reads]) END,
[AvgBPerWrite] =
CASE WHEN [num_of_writes] = 0
THEN 0 ELSE ([num_of_bytes_written] / [num_of_writes]) END,
[AvgBPerTransfer] =
CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0)
THEN 0 ELSE
(([num_of_bytes_read] + [num_of_bytes_written]) /
([num_of_reads] + [num_of_writes])) END,
LEFT ([mf].[physical_name], 2) AS [Drive],
DB_NAME ([vfs].[database_id]) AS [DB],
[mf].[physical_name]
FROM
sys.dm_io_virtual_file_stats (NULL,NULL) AS [vfs]
JOIN sys.master_files AS [mf]
ON [vfs].[database_id] = [mf].[database_id]
AND [vfs].[file_id] = [mf].[file_id]
— WHERE [vfs].[file_id] = 2 — log files
— ORDER BY [Latency] DESC
— ORDER BY [ReadLatency] DESC
ORDER BY [WriteLatency] DESC;
GO

The counters are initialized to empty whenever the SQL Server service is started. so the I/O times reported by the script are averages. The longer the time the data set covers, the more likely it is that transient latency spikes will be masked.

So what are acceptable values?

Well, That depends. It comes down to what I was explaining earlier, your normal workload might be running at 50ms latency and that might be fine. this is why it’s important to capture a baseline of the system during day-to-day usage with no reported problems, then you can compare it to when issues are being reported. That being said, I would generally suggest that anything over 100ms is slow. certainly 500+ should be of concern.

This can even go the other way. on a critical system you might have 1-5ms. and performance problems being reported when latency starts hitting 10ms, but you dont have a baseline of the system so you dont know. you run this script and see latency is only 10ms so dont think it’s cause for concern and ignore it.

Don’t shout wolf

If latency has increased, look for changes in SQL Server behavior before blaming the I/O subsystem. For instance:

  • Query plan changes
  • Increased workload
  • Out of date statistics
  • Fragmented or Missing indexes
  • Code changes
  • Decreased memory pool

Leave a comment