SQL Server Performance: Disk Latency

Generally speaking, SQL Server these days is limited by Disk Read/Write speed. This means that the most important factor in a SQL server’s performance is its ability to perform I/O operations quickly. If the I/O subsystem cannot keep up with the demand being placed on it, then the SQL Server workload will suffer performance problems.

SQL Server is slow. It looks like the disks…

A problem I’ve seen in the past is relating increased I/O subsystem latencies with poor I/O subsystem performance. This is why baselining systems is so important, or at least have some monitoring in place where you can see trending on these sorts of metrics. The I/O subsystem may have been performing fine when running on the designed workload, but it suddenly becomes the performance bottleneck when the workload increases past the I/O subsystem design point. The workload increase, in this case, is what’s causing the problem, not the disk subsystem.

Right? why are you telling me this

Don’t just go online, read some blog that says “This is how you diagnose disk issues” then run off and annoy your infrastructure team telling them their disks are broken. that’s not going to win you any friends.

A Script…….

Your first step is knowing if I/O latency has increased. This script uses sys.dm_io_virtual_file_stats to display the disk latency information

SELECT
[ReadLatency] =
CASE WHEN [num_of_reads] = 0
THEN 0 ELSE ([io_stall_read_ms] / [num_of_reads]) END,
[WriteLatency] =
CASE WHEN [num_of_writes] = 0
THEN 0 ELSE ([io_stall_write_ms] / [num_of_writes]) END,
[Latency] =
CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0)
THEN 0 ELSE ([io_stall] / ([num_of_reads] + [num_of_writes])) END,
[AvgBPerRead] =
CASE WHEN [num_of_reads] = 0
THEN 0 ELSE ([num_of_bytes_read] / [num_of_reads]) END,
[AvgBPerWrite] =
CASE WHEN [num_of_writes] = 0
THEN 0 ELSE ([num_of_bytes_written] / [num_of_writes]) END,
[AvgBPerTransfer] =
CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0)
THEN 0 ELSE
(([num_of_bytes_read] + [num_of_bytes_written]) /
([num_of_reads] + [num_of_writes])) END,
LEFT ([mf].[physical_name], 2) AS [Drive],
DB_NAME ([vfs].[database_id]) AS [DB],
[mf].[physical_name]
FROM
sys.dm_io_virtual_file_stats (NULL,NULL) AS [vfs]
JOIN sys.master_files AS [mf]
ON [vfs].[database_id] = [mf].[database_id]
AND [vfs].[file_id] = [mf].[file_id]
— WHERE [vfs].[file_id] = 2 — log files
— ORDER BY [Latency] DESC
— ORDER BY [ReadLatency] DESC
ORDER BY [WriteLatency] DESC;
GO

The counters are initialized to empty whenever the SQL Server service is started. so the I/O times reported by the script are averages. The longer the time the data set covers, the more likely it is that transient latency spikes will be masked.

So what are acceptable values?

Well, That depends. It comes down to what I was explaining earlier, your normal workload might be running at 50ms latency and that might be fine. this is why it’s important to capture a baseline of the system during day-to-day usage with no reported problems, then you can compare it to when issues are being reported. That being said, I would generally suggest that anything over 100ms is slow. certainly 500+ should be of concern.

This can even go the other way. on a critical system you might have 1-5ms. and performance problems being reported when latency starts hitting 10ms, but you dont have a baseline of the system so you dont know. you run this script and see latency is only 10ms so dont think it’s cause for concern and ignore it.

Don’t shout wolf

If latency has increased, look for changes in SQL Server behavior before blaming the I/O subsystem. For instance:

  • Query plan changes
  • Increased workload
  • Out of date statistics
  • Fragmented or Missing indexes
  • Code changes
  • Decreased memory pool

High Virtual Log File(VLF) Count in SQL Server

So what is a VLF?

The SQL Server Database Engine divides each physical log file internally into a number of virtual log files (VLFs). Virtual log files have no fixed size, and there is no fixed number of virtual log files for a physical log file.

Virtual log file (VLF) creation follows this method:

  • If the next growth is less than 1/8 of current log physical size, then create 1 VLF that covers the growth size (Starting with SQL Server 2014 (12.x))
  • If the next growth is more than 1/8 of the current log size, then use the pre-2014 method:
    • If growth is less than 64MB, create 4 VLFs that cover the growth size (e.g. for 1 MB growth, create four 256KB VLFs)
    • If growth is from 64MB up to 1GB, create 8 VLFs that cover the growth size (e.g. for 512MB growth, create eight 64MB VLFs)
    • If growth is larger than 1GB, create 16 VLFs that cover the growth size (e.g. for 8 GB growth, create sixteen 512MB VLFs)

Right? so how could this be an issue?

When you log grows in small increments, it’s going to create a VLF for each growth event. if you’ve grown your log by 1mb 1000 times, you’re going to have 1000 additional VLF’s. having too many VLF’s is going to cause Database Backup and Recovery tasks to slowdown, and in extreme cases it can even effect DML(Insert/Update/Delete) Performance.

So what’s ‘high’?

Well, this depends on transaction log size, 100-300 VLF’s isn’t unreasonable, and I’ve never seen performance issues with those amounts. It’s When the VLF Count starts going over 1000 you should look at sorting the issue out. I’ve only seen performance issues with the VLF Count over 1000.

How can I tell how many logs VLF’s I have?

Either of these 2 scripts should give you the information;

DBCC LOGINFO(<DatabaseName>)

SELECT [name], s.database_id,
COUNT(l.database_id) AS ‘VLF Count’,
SUM(vlf_size_mb) AS ‘VLF Size (MB)’,
SUM(CAST(vlf_active AS INT)) AS ‘Active VLF’,
SUM(vlf_activevlf_size_mb) AS ‘Active VLF Size (MB)’, COUNT(l.database_id)-SUM(CAST(vlf_active AS INT)) AS ‘In-active VLF’, SUM(vlf_size_mb)-SUM(vlf_activevlf_size_mb) AS ‘In-active VLF Size (MB)’
FROM sys.databases s
CROSS APPLY sys.dm_db_log_info(s.database_id) l
GROUP BY [name], s.database_id
ORDER BY ‘VLF Count’ DESC
GO

I’ve got 300million VLF’s how do I fix it?

Okay there might be some exaggeration there but you get the idea.

Shrinking the transaction log is the only way to reduce the number of VLF’s. but you don’t want the issue to occur again, so you will need to regrow the log back to normal operation size after the shrink is finished. Growing the log in 1 operation to the correct size will ensure SQL Server allocates the correct number of VLF’s for that size of log file(Based on the above algorithm).

Remember

Shrinking a log file is a blocking operation!

Oh! And set some reasonable growth parameters while you’re at it…. this will ensure the problem will be kept at bay

Transaction log growing when performing index maintenance

With Index maintenance; when operating on large indexes, the index operation can generate large data loads, causing the transaction log to fill quickly. This process ensures the index operation can be rolled back should there be a problem. Unfortunately this means the transaction log cannot be truncated until process has been completed(The log can still be backed up). Obviously the transaction log must have sufficient room to store both the index operation transaction and any user transactions for the duration of the index operation.

So what can we do? 

Well your first option is to throw additional disk space at the database log drive. which isn’t ideal as disk space doesn’t grow on trees…

Alternatively we can set the SORT_IN_TEMPDB option to ON.

What does that do?

This separates the index transactions from the user transactions.

The index transactions will be stored in the tempdb transaction log, and the concurrent user transactions will be stored in the transaction log of the user database. This allows for the transaction log of the user database to be truncated during the index operation if it is required.

This also means that if the tempdb log is not on the same disk as the user database log, the two logs are not competing for the same disk space.

Does that mean TempDB will incur more usage?

yes and no.

If SORT_IN_TEMPDB is set to ON, there must be sufficient free space in tempdb to store the sort runs. however it’s only active during the index maintenance, normal data usage for sorts will still occur in memory(or spill to tempdb depending on memory grants)

So how do we set it?

You can either do this at per index, by creating the index and setting the SORT_IN_TEMPDB=ON Option.

or

If you’re using Ola Hallengren Index Maintenance Scripts set the; @SortInTempdb=’Y’ Parameter.

SQL Server Configuration: Fill-Factor

Fill-factor determines the percentage of space on each leaf-level page to be filled with data, reserving the remainder on each page as free space for future growth. This only takes effect when an index is created or rebuilt.

For example ; Setting fill-factor to 90 means that 10 percent of each leaf level page will be left empty. This provides space for index expansion as data is added. The empty space is reserved between the index rows rather than at the end of the index.

SPACE

There are space implications with leaving this white space is that your indexes. for a 100mb index you would need 10% more space, making the index 110mb. that doesn’t sound to bad does it? what if you index was 5gb? that’s now 5.5gb for 1 index… now imagine that for all the indexes on your database….

PAGE SPLITS

By leaving this free space at the leaf-level rows there is the potential to reduce index fragmentation and page splits. Page Splits occur when there is now room within the index page for data to be written, the database engine moves approximately half of the rows on the page to a new page to make room for this new data. this is a fairly resource intensive operation which can cause an unnecessary increase to disk i/o. Trying to avoid (Too Many) page splits from occurring should be the main reason you’re setting a fill-factor.

READ PERFORMANCE

Configuring the fill-factor of an index may reduce the occurrences of page splits, it will also decrease the read performance as the database engine will have more index pages to search through. Microsoft clearly states; “a fill-factor value of 50 can cause database read performance to decrease by two times”

CONFIGURATION

Configuring a correct Fill-Factor shouldn’t be done on a mass scale. and in my opinion it shouldn’t be done at instance level. It’s very likely that it will hurt your performance more than it helps.

Workloads should be taken into account, and you should definitely be checking your performance counters to see how SQL Server is currently coping with the current workload. if a large amount of page splits are occurring then it’s worth drilling down to index level analysis to see where the page splits are occurring and setting the fill-factor on that individual index.

BUT I READ ONLINE IT CAN REDUCE PAGE SPLITS

It’s not uncommon for me to look at an instance that has been under the care of another dba and a default fill-factor has been set (Usually 80!) Other than the wasted space within the index, it could decrease your read performance by 20%!

STOP IT!

Find where your pain is, and target your pain relief there. This blanket setting is just going to cause you other issues further down the line.

There is not enough space on the disk – Error when performing CheckDB

The Database Integrity suit of commands (CheckDB, CheckAlloc, CheckTable, CheckCatalog & CheckFileGroup) takes an internal Read-Only snapshot of the database while the DBCC Command is running.

These database snapshots work in the same way regular snapshots work; as data is change in the original database, the changed data is written back to the snapshot… this means the snapshot grows in proportion to the data changed.

Depending on database activity during the time in which the DBCC Command is running, you could encounter disk space issues. unfortunately the database snapshot is created on the same drive as the user database. as a failsafe, to protect the operational integrity of the user database the snapshot is marked as suspect and the DBCC command fails.

This is failure by design, unfortunately there is no way to allocate a drive for the internal snapshots to use.

OPTIONS

  • Provide enough disk space for the DBCC Command to execute successfully
  • Perform the Check on a restored backup of the live database.

On large, highly transactional systems; performing a weekly CheckDB on an offline copy of the database might be the preferred option. I’ve previously had to do this for 4TB+ CRM & ERP Systems. In this instance the CheckDB took over 48hours to complete.

LEGACY CARDINALITY ESTIMATION Explained

Since SQL Server 2014 Microsoft has included and updated CARDINALITY ESTIMATOR with updates on the assumptions and algorithms that should work better with modern workloads. Microsoft SQL Server 2012 and earlier work based on an older version of the Cardinality Estimator that had its last update in 1998 (CE70)……….

SQL Server VersionCompatibility Level
SQL Server 7.070
SQL Server 200080
SQL Server 200590
SQL Server 2008 & SQL Server 2008 R2100
SQL Server 2012110
SQL Server 2014120
SQL Server 2016130
SQL Server 2017140
SQL Server 2019150

By Setting the Database Compatibility level, you’re telling SQL Server which version of the cardinality estimator to use. so if you Set database Compatibility Level to 130, it will be using the SQL Server 2016 Cardinality Estimator.

So What if we get performance issues with the cardinality estimator but want to keep some of the features available in the 2016 Compatibility Level?

LEGACY_CARDINALITY ESTIMATION enables the legacy CE (CE70) regardless of the database compatibility level setting. It is equivalent to trace flag 9481, but it only affects the database in question

The QUERY_OPTIMIZER_HOTFIXES option is equivalent to trace flag 4199 at the database level. this gives you the option to DISABLE all query optimizer hotfixes in that version of SQL Server back to RTM, so you can patch your SQL Server, and if you notice a performance degredation you can disable the query optimizer hotfixes for an individual database instead of rolling the patch back.

Upgrading your SQL Server major version is more complicated than it has been historically. Because of these changes to database compatibility levels cardinality estimators, it is not uncommon to see an impact to performance without having planned & tested what you’re migrating to.

SQL Server: When to Enable Lock Pages in Memory (LPIM)

SQL Server listens to Physical Memory Notifications to trim its working set. the trimming is done by the working set manager which has its own set of rules for how and when to trim memory. If the currently available resources are insufficient to satisfy a new request for resources made by a driver or another application, the working set manager will trim the working set to satisfy the request.  Small but repeated working set trimming will eventually result in significant portion of SQL Server being paged out.

A significant part of sql server process memory has been paged out. This may result in a performance degradation. Duration: #### seconds. Working set (KB): ####, committed (KB): ####, memory utilization: ##%.

This is where Min Memory comes in to play (sp_configure ‘min server memory‘) Windows Server will attempt to trim SQL Down to it’s minimum working set (Min Server Memory) there after it will page out to disk.

It’s important to configure both Min and Max memory correctly within SQL Server before enabling this option. setting aside enough memory for the OS and other running Processes like Antivirus, Server Monitoring Software…. Other instances of SQL!

There is an old debate among DBA’s whether this option should be enabled by default and i would say in earlier versions of windows server (2000,2003 etc.. ) that’s true. however Windows Server 2008 (and Onward) improves the contiguous memory allocation mechanism.  This improvement lets Windows Server 2008 reduce the side effects of paging out the working set of applications when new memory requests arrive. Hard Trims can still happen and you should be monitoring your SQL Server Error log for Error 17890 or the query text above.

How to enable

This Windows policy determines which accounts can use a process to keep data in physical memory, preventing the system from paging the data to virtual memory on disk

Locking pages in memory may boost performance when paging memory to disk is expected.

 

To enable the lock pages in memory option

  1. On the Start menu, click Run. In the Open box, type gpedit.msc.
  2. On the Local Group Policy Editor console, expand Computer Configuration, and then expand Windows Settings.
  3. Expand Security Settings, and then expand Local Policies.
  4. Select the User Rights Assignment folder.The policies will be displayed in the details pane.
  5. In the pane, double-click Lock pages in memory.
  6. In the Local Security Setting – Lock pages in memory dialog box, click Add User or Group.
  7. In the Select Users, Service Accounts, or Groups dialog box, select the SQL Server Service account.
  8. Restart the SQL Server Service for this setting to take effect.

Starting with SQL Server 2012 (11.x), trace flag 845 is not needed for Standard Edition to use Locked Pages.

SQL Server: SOS_SCHEDULER_YEILD Waits

SOS_SCHEDULER_YIELD Occurs when a task voluntarily yields the scheduler for other tasks to execute. During this wait the task is waiting for its quantum to be renewed.

SOS_SCHEDULER_YEILD is a difficult one to understand; the knee jerk reaction is that there is CPU Pressure…

SQL Server has its own scheduling mechanism that is implemented in what we know as SQLOS;

A small unit of time called time quantum is assigned to each process. When the time quantum expired, the CPU is switched to another process

  • A quantum (4ms) of CPU time is assigned to every thread
  • When a thread consumes its quantum it is sent to a queue Waiter List or SUSPENDED State.
  • Suspended threads wait until resource becomes available to enter the RUNNABLE State.
  • Those threads from the top of the RUNNABLE Queue are granted execution and enter the RUNNING State

If a thread exhausts its quantum. it voluntarily gives up the processor (known as yielding). When this occurs, the thread moves directly to the bottom of the Runnable Queue (It doens’t go suspended as there is nothing for it to wait for). SQL OS must register a wait type for this transition off the Processor; It registers SOS_SCHEDULER_YIELD.

Recently I’ve seen 2 main causes for SOS_SCHEDULER_YIELD, I’m not saying there are not other reasons but look out for these gotchas! VM Level misconfiguration….

1; Incorrectly configured sockets and cores;

This is mainly due to the version of SQL Server, for example assigning 8 sockets to a Standard Edition SQL Server (It can only use 4)…. read more about this issue here https://lukesalter.wordpress.com/2018/08/07/understanding-sql-server-socket-core-restrictions/

2: Over Subscribed vCPU’s at Host level;

If a thread is prevented from running for a few milliseconds or more, that could mean that its execution might exhaust its thread quantum without actually getting 4ms of CPU time and so yield the processor causing an SOS_SCHEDULER_YIELD to be registered.

If a thread is waiting for a resource, and this thread has to wait to be scheduled to execute by the hypervisor due to the host being oversubscribed with vCPUs based on the hardware pCPUs, then the actual resource wait time noted in SQL Server will include that time that the VM was unable to run, and so the wait time will appear to be longer than it would have been had the VM not been delayed.

When the thread quantum expires, the thread *must* yield the processor. It has no knowledge of other threads on that scheduler and there is *always* a context switch when the thread goes to the bottom of the runnable queue, even if it’s the only thread on the scheduler.

Disclaimer;

SOS_SCHEDULER_YIELD can be normal behavior; queries doing scans of pages that are in memory and aren’t changing, hence there’s no contention for page access and the scanning thread can run until it exhausts its thread quantum…

SQL Server: PAGELATCH_UP – Real World TempDB Contention

When TempDB is heavily used, SQL Server may experience contention when it tries to allocate pages. This may cause queries and requests that involve TempDB to be unresponsive for short periods of time. You will see an increase in Lock Timeouts if you are collecting stats on this SQL Counter.

While heavy load is occurring you will see severe blocking when looking in the DMV sys.dm_exec_requests. the wait type will be PAGELATCH_UP and wait resource points to pages in TempDB . These pages might be of the format 2:1:1, 2:1:3 and so on (PFS and SGAM pages in TempDB )

Understanding wait resource

DatabaseID:FileNumber:PageNumber

For example: Wait Resource 2:1:3 would be;

  • DatabaseID: 2 (TempDB)
  • File Number: 1 (The first data file)
  • Page Number: 3 (SGAM Page)

It’s also important to understand what pages 0-3 are of the data file are;

File Header: The file header is the first page in the data file; Page 0. You can see the header content by turning trace flag 3604 on for the connection.

PFS Page: PFS Tracks the allocation status of each page and approximately how much free space it has. There is 1 PFS page for every 512 MB of data file. It is always page 1 in the data file and then repeats every 8,088 pages.

GAM Page: GAM page is page 2 in the data file, GAM Tracks which extents have been allocated. There is 1 GAM page for every 4 GB of data file. It is always page 2 in the data file and then repeats every 511,232 pages.

SGAM Page: SGAM page is page 3 in data file , SGAM Tracks which extents have been allocated. There is 1 GAM page for every 4 GB of data file. It is always page 2 in the data file and then repeats every 511,232 pages.

Identifying the Contention

As mentioned you can use the DMV sys.dm_exec_requests, however SQLSolider has provided this query to divide the page number to help identify the PFS/GAM/SGAM Pages for us.

This looks for Page Latch waits on TempDB and tell us if the wait resource is one of our system pages.

Select session_id,
wait_type,
wait_duration_ms,
blocking_session_id,
resource_description,
ResourceType = Case
When Cast(Right(resource_description, Len(resource_description) – Charindex(‘:’, resource_description, 3)) As Int) – 1 % 8088 = 0 Then ‘Is PFS Page’
When Cast(Right(resource_description, Len(resource_description) – Charindex(‘:’, resource_description, 3)) As Int) – 2 % 511232 = 0 Then ‘Is GAM Page’
When Cast(Right(resource_description, Len(resource_description) – Charindex(‘:’, resource_description, 3)) As Int) – 3 % 511232 = 0 Then ‘Is SGAM Page’
Else ‘Is Not PFS, GAM, or SGAM page’
End
From sys.dm_os_waiting_tasks
Where wait_type Like ‘PAGE%LATCH_%’
And resource_description Like ‘2:%’

Resolving your contention

First of all, you should be implementing these 2 trace flags, these have pretty much become an industry/build standard for most experienced DBA’s. Also; as of SQL Server 2016 they are turned on by default (So you only need to switch these on for SQL Instances prior to SQL Server 2016).

Trace Flag 1118 – Full Extents Only
This means that each newly allocated object in every database on the instance gets its own private 64KB of data. TempDB is usually the place where most objects are created, so it makes the most difference there.

Trace Flag 1117 – Grow All Files in a File Group Equally
Trace flag 1117 changes the behavior of file growth: if one data file in a file group grows, it forces other files in that file group to ALSO grow. This can be useful for TempDB , which is commonly configured with multiple data files as KB 2154845 advises.

Adding additional Data Files: Next is adding additional TempDB Datafiles. Yes; these can call be on the same disk, the contention is around pages within the datafile not around the I/O of the physical disk.

Again industry standard is now 1 TempDB Data File per logical core up to 8. Once you are above 8 cores you should only be adding data files to alleviate contention. Add them in groups of 4 then stop as soon as contention is alleviated.

SQL Server is SLOW! What do i do!?

So you’re having performance issues

whats the first thing you look at? This is going to be a basic check list of what you should start looking at.

Firstly you need to assess the situation; is it Slow? or dead in the water, simplified it into 4 categories;

  1. SQL Server is Dead/Offline and needs Immediate Attention.
  2. SQL Server is going to go offline unless we do something NOW
  3. SQL Server Is currently under performing but no outage will occur
  4. Some parts of Some Queries are running slow

Level 1 and Level 2 Issues are critical. you may have to take risks to bring the instance back online, but you also need to know what other system engineers you need to assist you (Server/Network Admins for example.)

At Levels 3 & 4 it’s likely that taking risks can make things worse. We need to prove it’s not an emergency and respond appropriately.

Triage Checklist

System Performance

Check your monitoring software, look for trends in SQL/Performance Counters, if you don’t have any monitoring software you should be looking at implementing some or even having something basic developed in house – these days there is no excuse!

What is currently running?

Connect to the instance,

select * from sys.dm_exec_requests

There are alternatives, like sp_whoisactive (a great piece of kit written by Adam Mechanic, check it out if you haven’t) or sp_who2.

You really need to be looking at Query Start Time, Blocking and possibly wait types depending on the issue! If there is a query running for hours or days you should be finding out what this query is.

dbcc inputbuffer(<SessionID>)

This should find you the query text. to pass on to the application support/development team. if there is any blocking queries find the head blocker in the chain and check the query text on that also check for the username executing that query; Make sure it’s safe to kill before killing it.

Check for bottlenecks and contention

Check for Contention; Pending Memory Grants, PLE ,CPU,  Wait stats….  you should build yourself a toolkit of scripts, there are  loads out there for free to help find the pressure in your sql server. Find them, Familiarize your self with them and keep them on hand for when there are issues. I’m not going to go through them all here! sp_blitz (Free from Brent Ozar) is a great tool for just picking out the bad stuff in your SQL Server. here’s a list of things you should be checking for to get your started (in no specific order);

  1. Memory Pressure
  2. Waiting Tasks
  3. Long Running Transactions
  4. Blocking Sessions
  5. Expensive Queries
  6. Wait Stats
  7. Read/Write Latency

Only Some Queries are slow

It’s likely this is due to some indexing issue, look at expensive queries, see if you can get the query text for what the users are running and check the execution plan for it. check for missing indexes around it. check the statistics for that table, or check for possible parameter sniffing problems.

These sorts of issues should not be mission critical so changes should be tested on UAT/DEV/QA Systems before putting them into production remember you could be doing more harm than good!