What else is backing up my SQL Server Databases?

I’ve recently had an issue where my differential backups are being converted to full backups (Using the Ola Hallengren Scripts) Ola does this to preserve your LSN Backup Chain. When a new Full backup is taken Ola’s Scripts are intelligent enough to see something else had taken the backup, and it knows for you, that if you were try to take a differential backup, you would run in to problems trying to perform a restore in the future. so it converts your would be Diff backup to a Full backup to keep YOU safe…

Bloating…

Now if something else, is performing full backups every night, you run into the situation where every Diff is being converted to a full backup, resulting in a bloated backup directory. Typically, it’s 3rd party backup software that is doing the job for you.

Point of fact: I would always recommend using SQL Server Native backups. That’s not to say 3rd party backup software is bad, I’ve worked with CommVault in the past and it does make managing an estate of 300 Servers easier… but it needs to be setup and configured correctly to work effectively. I Would also recommend regularly testing restoring your databases (to a test system) on a regular basis to confirm RPO/RTO & backup integrity .

Anyway!…….

How can I confirm what is backing up my databases from SQL Server?

Well, this SQL Script;

select bs.media_set_id,bs.database_name,bs.backup_start_date,bmf.physical_device_name,bmf.device_type
from backupset bs
join backupmediafamily bmf on (bs.media_set_id=bmf.media_set_id)
where backup_start_date > getdate()-1 and type=’D’ and database_name=’DATABASE_NAME’

You will notice, your regular backups will be using the SQL Agent User Name (Another good reason to have your SQL Agent using it’s own service account), and the Physical_Device_Name will be your known backup file.

Other entries, could be anything; Check for the Device name and the UserName. The Device_Type (7) means Virtual, this is a trade mark for 3rd party backup applications. these should help you pin point what other application is backing up the system for you…………. feed this information back to your Infrastucture/Server Engineers and get them to either stop, or get them to change the backup type to Copy_only.

Copy only backups do not interfere with the LSN Chain.

High Virtual Log File(VLF) Count in SQL Server

So what is a VLF?

The SQL Server Database Engine divides each physical log file internally into a number of virtual log files (VLFs). Virtual log files have no fixed size, and there is no fixed number of virtual log files for a physical log file.

Virtual log file (VLF) creation follows this method:

  • If the next growth is less than 1/8 of current log physical size, then create 1 VLF that covers the growth size (Starting with SQL Server 2014 (12.x))
  • If the next growth is more than 1/8 of the current log size, then use the pre-2014 method:
    • If growth is less than 64MB, create 4 VLFs that cover the growth size (e.g. for 1 MB growth, create four 256KB VLFs)
    • If growth is from 64MB up to 1GB, create 8 VLFs that cover the growth size (e.g. for 512MB growth, create eight 64MB VLFs)
    • If growth is larger than 1GB, create 16 VLFs that cover the growth size (e.g. for 8 GB growth, create sixteen 512MB VLFs)

Right? so how could this be an issue?

When you log grows in small increments, it’s going to create a VLF for each growth event. if you’ve grown your log by 1mb 1000 times, you’re going to have 1000 additional VLF’s. having too many VLF’s is going to cause Database Backup and Recovery tasks to slowdown, and in extreme cases it can even effect DML(Insert/Update/Delete) Performance.

So what’s ‘high’?

Well, this depends on transaction log size, 100-300 VLF’s isn’t unreasonable, and I’ve never seen performance issues with those amounts. It’s When the VLF Count starts going over 1000 you should look at sorting the issue out. I’ve only seen performance issues with the VLF Count over 1000.

How can I tell how many logs VLF’s I have?

Either of these 2 scripts should give you the information;

DBCC LOGINFO(<DatabaseName>)

SELECT [name], s.database_id,
COUNT(l.database_id) AS ‘VLF Count’,
SUM(vlf_size_mb) AS ‘VLF Size (MB)’,
SUM(CAST(vlf_active AS INT)) AS ‘Active VLF’,
SUM(vlf_activevlf_size_mb) AS ‘Active VLF Size (MB)’, COUNT(l.database_id)-SUM(CAST(vlf_active AS INT)) AS ‘In-active VLF’, SUM(vlf_size_mb)-SUM(vlf_activevlf_size_mb) AS ‘In-active VLF Size (MB)’
FROM sys.databases s
CROSS APPLY sys.dm_db_log_info(s.database_id) l
GROUP BY [name], s.database_id
ORDER BY ‘VLF Count’ DESC
GO

I’ve got 300million VLF’s how do I fix it?

Okay there might be some exaggeration there but you get the idea.

Shrinking the transaction log is the only way to reduce the number of VLF’s. but you don’t want the issue to occur again, so you will need to regrow the log back to normal operation size after the shrink is finished. Growing the log in 1 operation to the correct size will ensure SQL Server allocates the correct number of VLF’s for that size of log file(Based on the above algorithm).

Remember

Shrinking a log file is a blocking operation!

Oh! And set some reasonable growth parameters while you’re at it…. this will ensure the problem will be kept at bay

Transaction log growing when performing index maintenance

With Index maintenance; when operating on large indexes, the index operation can generate large data loads, causing the transaction log to fill quickly. This process ensures the index operation can be rolled back should there be a problem. Unfortunately this means the transaction log cannot be truncated until process has been completed(The log can still be backed up). Obviously the transaction log must have sufficient room to store both the index operation transaction and any user transactions for the duration of the index operation.

So what can we do? 

Well your first option is to throw additional disk space at the database log drive. which isn’t ideal as disk space doesn’t grow on trees…

Alternatively we can set the SORT_IN_TEMPDB option to ON.

What does that do?

This separates the index transactions from the user transactions.

The index transactions will be stored in the tempdb transaction log, and the concurrent user transactions will be stored in the transaction log of the user database. This allows for the transaction log of the user database to be truncated during the index operation if it is required.

This also means that if the tempdb log is not on the same disk as the user database log, the two logs are not competing for the same disk space.

Does that mean TempDB will incur more usage?

yes and no.

If SORT_IN_TEMPDB is set to ON, there must be sufficient free space in tempdb to store the sort runs. however it’s only active during the index maintenance, normal data usage for sorts will still occur in memory(or spill to tempdb depending on memory grants)

So how do we set it?

You can either do this at per index, by creating the index and setting the SORT_IN_TEMPDB=ON Option.

or

If you’re using Ola Hallengren Index Maintenance Scripts set the; @SortInTempdb=’Y’ Parameter.

SQL Server Configuration: MODEL Database

The “Model” Database is the template database SQL Server uses when a new User Database is created, any Database Settings/Sizes are copied from the Model Database to a New blank database – Obviously if you’re doing a restore it’s going to restore with the settings the database was backed up with, so this only applies to new databases.

What can be changed?

Any database level configuration option can be changed on the Model Database, and these will be replicated to your new blank database, things like; Read Committed Snapshot Isolation Level, Collation, Compatibility level etc…

This means if you know what settings databases will need on the new instance, you can stay ahead of the curve and set them on Model. I Would advise leaving most settings as standard unless you know something is going to be specifically required for future databases

What should I change then?

Growth! Auto Growth settings is my biggest annoyance when it comes to databases, As default (< SQL Server 2016) Auto growth is set to a percentage of the data/log file. This means on small databases there are going to be lots of growth events. when you’re looking through growth history & seeing 100’s if not 1000’s of 1mb growths it makes it hard to see what happened. It’s also going to increase your Virtual Log Files(VLF’s) by a dramatic amount (which will cause performance issues)

It could also be worth setting the Initial Size of the data & log files to something sensible….

Recommendations?

Again it depends on your instance size, if it’s a small instance, I would suggest setting growth to 256 or 512. basically you want to avoid lots of growth and balance that with the datafile taking “Too much space” if it’s a 100mb database, It would be a little excessive setting auto growth to 1gb.

ALTER DATABASE Model
MODIFY FILE (NAME = ModelDEV, FILEGROWTH = 1024);
ALTER DATABASE Model
MODIFY FILE (NAME = ModelLOG, FILEGROWTH = 1024);

Anything larger I would suggest 1gb growth is fine. this can always be changed on the actual database once it’s in use, after all, remember this is just configuring the template database.

The main benefit to configuring the model database this way, is; if lots of databases have been created, all with bad auto growth settings, you’re going to have to go back and correct them all at some point; setting up all their growth parameters & shrinking the log to reduce VLF’s… it’s just time consuming work that can be avoided!

Forgotten Maintenance – SP_CYCLE_ERRORLOG

DBA’s tend to get caught up in the technical performance configuration of SQL Server (Like; Setting MaxDOP, Configuring TempDB, Installing Index Maintenance and Integrity Checks) that they sometimes forget the basics; Cycling the Error Log!

Why Bother?

Yeah sure, it’s not going to give users a performance boost, but when you’re investigating an outage, and you want to look in the error log for any issues, do you really want to open it with millions of entries? I’ve lost count of the times where I’ve had to sit there for 10mins waiting for the error log to open because it’s not been cycled in 6months+

Oh So Simple

And it’s so easy to implement error log maintenance. The error log is only cycled on 2 events;

  • When SQL Server is restarted.
  • When you execute sp_cycle_errorlog

Outside of this, without intervention it just grows, and if you’re logging user login activity it grows fast and big.

So How?

So first we want to put a limit on how many error logs we want to keep, this is basic house keeping,

  • Expand the “Management” folder.
  • Right click on “SQL Server Logs”
  • Select “Configure”
  • Check the box “Limit the number of error log files before they are recycled”
  • Pick some value to put in the “Maximum number of error log files” box
  • Click “OK

Then we want to setup a SQL Agent job to Cycle the error log, all it has to do is run EXEC sp_cycle_errorlog on a schedule which suits you, obviously if you have a particularly noisy log, you might want to schedule it more often.

Generally I limit my SQL Servers to keep 26 logs and to cycle every week, this gives us about 6 months of Error log. you might want to schedule your cycle monthly or even daily (Which is totally fine) just make sure you’re increasing or decreasing the max number of log files to suit. You want enough logs to diagnose any historic problems but you don’t want so many that they will use unnecessary disk space.

Cluster Resource ” In Clustered Service or Application Failed – AAG Network Interface Failure

When trying to failover your Always On Availability Group you get the error message “Cluster Resource ‘<ResourceName>’ In Clustered Service or Application ‘<AAGName>’Failed.

Okay? So what happened?

My Client received this error when trying to fail over to their DR Site. there were 4 Availability groups on the same server. It’s multi subnet so there is a second IP address specified for the Always On Availability group Listener.

Always reserve your IP’s

The problem came when the second IP attempted to come online and subsequently failed. With the Server having 4 AAG’s, there are 4 Listeners configured.

  1. On the primary site, there are 4 IP addresses (Yep one for each Listener)
  2. On the secondary site, there should be another 4 IP Addresses configured

For the Secondary Site closer inspection showed that the same IP had been designated for each Listener.

Give that listener it’s own IP

This means, when we started the failover to DR, the first AAG came up okay. It started the network interface for the IP Address and worked correctly. however the IP Address is now in use, and as the other AAG’s failed over, their network interfaces gave us the above error.

Great! now what?

Thankfully we don’t have to recreate the listener, and we can resolve the issue with no outage. All that needs to be done is to alter the IP Address in Failover Cluster Manager;

  • Open Failover Cluster Manager
  • Expand Services and Applications
  • For each AAG Right Click the Offline (Duplicate) IP Click Properties
  • Update the Static IP Address with a new valid IP

Always make sure IP Addresses you’ve assigned are reserved so they don’t get used elsewhere. if an IP Address is already in use when WSFC Tries to bring that resource online, you will get the error message; “Cluster Resource ‘<ResourceName>’ In Clustered Service or Application ‘<AAGName>’ Failed

SQL Server Configuration: Parallelism

Max degree of parallelism (MAXDOP)

MaxDOP is the the parameter used to specify the maximum number of processors to use on a single execution of a SQL Statement. 0 Is the default value for MaxDOP, This allows SQL Server to use all the available processors (up to 64 processors). To suppress parallel plan processing you can set MaxDOP to 1. this will limit SQL to using 1 processor per query.

Cost threshold for parallelism

Cost threshold for parallelism specifies the threshold at which SQL Server creates and runs parallel plans for queries. when the cost of the query exceeds the threshold the plan attempts to run in parallel depending on the MaxDOP Setting. The threshold value is the estimated cost of running the query on a specific set of hardware, not a unit of time. the default value for Cost threshold for parallelism is 5 (this is very low)

Wait what? what does that mean?

Think of a classroom, the database engine is the teacher & the students are each of the cores available to SQL Server. In this example we will say there are 4 students. The teacher asks the students to count all the jellybeans in a jar, so each of the 4 students grabs a handful of jellybeans, counts them, then grabs more until the jar is empty, the teacher then collects the results from each student. this is great if there are 100’s of beans in the Jar. what if there are only 4 beans in the jar, well with a low cost threshold for parallelism, each student would take an equal share (1 bean each) count that, then the teacher would collect the results…. terribly inefficient.

Guidelines

To avoid simple plans going parallel; For a new installation or a system that not previously had the Cost threshold for parallelism configured it’s recommended that it’s set to 50. This can be changed depending on your workloads

MaxDOP however depends on the CPU Configuration of your Server;

MaxDOP and Cost threshold for Parallelism can both be configured on the Advanced tab of the SQL Server properties screen in SSMS.

SQL Server Installations

During the installation of SQL Server 2016+, There is now a MaxDOP Configuration page, which automatically suggests the MaxDOP Settings. which is great, however there are some applications that do not work very well with MaxDOP and it’s recommended by the software vendor that MaxDOP is set to 1. SharePoint is one of these, Don’t get caught out!

SQL Server Configuration: Fill-Factor

Fill-factor determines the percentage of space on each leaf-level page to be filled with data, reserving the remainder on each page as free space for future growth. This only takes effect when an index is created or rebuilt.

For example ; Setting fill-factor to 90 means that 10 percent of each leaf level page will be left empty. This provides space for index expansion as data is added. The empty space is reserved between the index rows rather than at the end of the index.

SPACE

There are space implications with leaving this white space is that your indexes. for a 100mb index you would need 10% more space, making the index 110mb. that doesn’t sound to bad does it? what if you index was 5gb? that’s now 5.5gb for 1 index… now imagine that for all the indexes on your database….

PAGE SPLITS

By leaving this free space at the leaf-level rows there is the potential to reduce index fragmentation and page splits. Page Splits occur when there is now room within the index page for data to be written, the database engine moves approximately half of the rows on the page to a new page to make room for this new data. this is a fairly resource intensive operation which can cause an unnecessary increase to disk i/o. Trying to avoid (Too Many) page splits from occurring should be the main reason you’re setting a fill-factor.

READ PERFORMANCE

Configuring the fill-factor of an index may reduce the occurrences of page splits, it will also decrease the read performance as the database engine will have more index pages to search through. Microsoft clearly states; “a fill-factor value of 50 can cause database read performance to decrease by two times”

CONFIGURATION

Configuring a correct Fill-Factor shouldn’t be done on a mass scale. and in my opinion it shouldn’t be done at instance level. It’s very likely that it will hurt your performance more than it helps.

Workloads should be taken into account, and you should definitely be checking your performance counters to see how SQL Server is currently coping with the current workload. if a large amount of page splits are occurring then it’s worth drilling down to index level analysis to see where the page splits are occurring and setting the fill-factor on that individual index.

BUT I READ ONLINE IT CAN REDUCE PAGE SPLITS

It’s not uncommon for me to look at an instance that has been under the care of another dba and a default fill-factor has been set (Usually 80!) Other than the wasted space within the index, it could decrease your read performance by 20%!

STOP IT!

Find where your pain is, and target your pain relief there. This blanket setting is just going to cause you other issues further down the line.

SQL Server Configuration: Optimise For Ad Hoc Workloads

Optimise for Ad Hoc Workloads changes the way in which SQL Server retains Query Plans in the Plan Cache. When this option is turned on, the Database Engine stores a small compiled plan stub in the plan cache when a batch is compiled for the first time, instead of the full compiled plan. This reduces the size of the Plan Cache and stops reusable Plans being forced out of the Cache when the Cache threshold is reached. It can also helps to alleviate memory pressure due to the plan cache requiring less memory.

There are only extreme edge cases where enabling this option would have a negative performance impact; like lots of query’s compiling twice over a short period of time. In 99.9% of cases; SQL Server will benefit more from the memory you’re saving by freeing up this Cache space than the few CPU cycles you’re wasting having to compile a plan for the second time.

You can use this query to find all the single use plans in your SQL Server’s Plan Cache;

Query

SELECT objtype, cacheobjtype, AVG(usecounts) AS Avg_UseCount, SUM(refcounts) AS AllRefObjects, SUM(CAST(size_in_bytes AS bigint))/1024/1024 AS Size_MB FROM sys.dm_exec_cached_plans WHERE objtype = ‘Adhoc’ AND usecounts = 1 GROUP BY objtype, cacheobjtype;

Turning on the Optimize for ad hoc workloads Setting only affects new plans; plans that are already in the plan cache are unaffected. You would need clear the plan cache or restart SQL Server to flush out all your single use plans.

I Always turn Optimize for ad hoc workloads on by default.

There is not enough space on the disk – Error when performing CheckDB

The Database Integrity suit of commands (CheckDB, CheckAlloc, CheckTable, CheckCatalog & CheckFileGroup) takes an internal Read-Only snapshot of the database while the DBCC Command is running.

These database snapshots work in the same way regular snapshots work; as data is change in the original database, the changed data is written back to the snapshot… this means the snapshot grows in proportion to the data changed.

Depending on database activity during the time in which the DBCC Command is running, you could encounter disk space issues. unfortunately the database snapshot is created on the same drive as the user database. as a failsafe, to protect the operational integrity of the user database the snapshot is marked as suspect and the DBCC command fails.

This is failure by design, unfortunately there is no way to allocate a drive for the internal snapshots to use.

OPTIONS

  • Provide enough disk space for the DBCC Command to execute successfully
  • Perform the Check on a restored backup of the live database.

On large, highly transactional systems; performing a weekly CheckDB on an offline copy of the database might be the preferred option. I’ve previously had to do this for 4TB+ CRM & ERP Systems. In this instance the CheckDB took over 48hours to complete.