SQL Detective: A Curious Tale of Long Waits

This morning while I was on the phone with an IT person from another department discussing a project, he asked if I could take a look at their main production SQL Server instance. The users were experiencing quite a few timeouts. As a side note, this particular database server is highly transactional for our organization and is used 24×7. It’s also mid-morning. Not the middle-of-the-night-and-I-can’t-believe-I’m-awake-and-get-off-my-lawn morning. Note: This one is running SQL Server 2014 Standard edition.

Using our handy dandy monitoring tool, I was fairly quickly able to pinpoint the likely culprit. Everything looked fine except for the SQL Server waits. There was a ten minute period of time where the wait times were unusually high. In this case, it was PAGEIOLATCH_SH. The wait times are normally very low, like single digits low. In this case, the wait times were in the triple digits coming close to quad digits. Yeah, that seemed awfully high.

PAGEIOLATCH_SH: Not Always the Problem

According to Microsoft, a PAGEIOLATCH_SH “Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Shared mode. Long waits may indicate problems with the disk subsystem.”

What does that mean? It basically means it’s taking longer than normal to retrieve data from disk into memory in order to read it. Normally I’d start looking closer at the disk and such. In this case, I wanted to see if something else could have caused this besides a potential problem with the disk subsystem. I then looked at the top SQL that was running at that time and found something very curious.

Wait… What?

First, I had visual confirmation of the timeouts. Yep. Lots and lots of red for lots and lots timeouts. And then I saw it. Lo and behold, someone was either creating or modifying a non-clustered index on a large table right before all the timeouts. Wait… what? Did I mention this is a highly transactional database server? Thankfully our tool also lists the login being used so I could inform the people of importance of who, what, and where. No, it wasn’t a DBA who did this. It turns out someone ran a script in production instead of non-production. Been there, done that. They confirmed what had happened and everything had been running fine since then. I’m just glad it wasn’t a disk problem. *Sanity note: In this particular case, I don’t have a say as to who has these types of permissions in production.*

Index Maintenance: Time After Time

Creating or modifying a non-clustered index will lock the table and make it read-only. That means SELECT statements should work, in general, unless the application decides it doesn’t like it for whatever reason. In this case, the application was not a happy camper by any means. I checked the statements that were timing out and sure enough, they were all referencing that particular table.

This is also why it’s generally preferred to test things out on a non-production system and implement any index modifications during after hours (or a low volume part of the day) unless it’s an emergency or you’re fairly certain the impact will be low and the people in charge are okay with it. It’s also why I tend to be super cautious and triple-check which server I’m on before I run any scripts.

Connecting the Dots

So if PAGEIOLATCH_SH normally indicates a possible issue with disks, how does that relate to a non-clustered index modification? The general simple answer for this wait type is the page did not reside in memory, needed to be retrieved from disk, and it took longer than normal causing timeouts. Add in the modification to the index is occurring on disk in addition to all the requests for that index and well, there you have it. *poof*

This instance does not have a timeout period set. Applications usually have a timeout setting of some sort. Not all of the SQL calls were stored procedures either. There were quite a few ad-hoc queries as well and I do not believe they’ve optimized it for ad-hoc workloads. By the way, this is a vendor-supported system. So there isn’t much I can do with it other than to make notes and recommendations.

In parting, please remember the following:

1. Please remember that it rarely hurts to double (or in my case, triple) check which server you’re running that script on. I’m pretty sure most of us have done it at least once or twice or umm… *cough cough*

2. Know thy server and workloads. Do you truly really genuinely need to change that index in production during peak work hours? Are you sure? Are you really, really sure? Yes? Then are you at least emotionally prepared to deal with the fallout in case it doesn’t go as planned? What? Things don’t always go as planned? Yes, yes that does happen on occasion.

3. PAGEIOLATCH_SH doesn’t always necessarily mean you have disk problems. Try to get a look at what’s going on in your system as a whole.

4. I’m not a real detective. I just get to play one at work sometimes.

5. No SQL instances were harmed in the writing of this post.

6. Please remember to feed your DBAs and provide them access to copious amounts of caffeine at all times.