Drawing conclusion from PAL Reports

Dec 7, 2012 at 9:58 PM

Hi Cliff and CodePlex community, 

We have some customers experiencing application performance issue in production and we have been trying to do a deep-dive to see what we can do to improve performance. We go and do analysis on logs, use SQL profiler for query executions, PerfMon to collect logs, DB layer maintenance such as reindexing DBs, making sure to purge records...etc. Now, the tool that kind of gives us what we need is the PerfMon and we run PAL to generate reports based on the log files.

We see that the

Physical Disk Read Latency generated alerts with Avg: 0.084 and Max: 3.213 Disk sec\Read
Physical Disk Write  Latency generated alerts with Avg: 1.458 and Max: 17.895 Disk sec\Write
Physical Disk Current Disk Queue Length generated alerts with: Avg: 22 and Max: 255 
Physical Disk  Avg. Disk Queue Length: Avg 398 and Max: 14,464
Processor % Processor Time:  
Min-Avg-Max per processor core
==========

0\ - 3 - 23 - 100
1\ - 0 - 19 - 100
2\ - 1 - 24 - 100
3\ - 0 - 17 - 100
4\ - 0 - 19 - 100 
5\ - 0 - 46 - 100
6\ - 0 - 19 - 100
7\ - 1 - 22 - 100 

 No memory contention and basically DISK and CPU generates alerts. 

A little bit of background on the application and server environment:
=============================================
- It is a PMS application in VB 6.0 with SQL Server as the DBM.
- There are also some interface apps written in VB 6 which talk do variety of tasks such as COM port communications to talk to call accounting, PBX systems, WebService calls\polls for XML message communications over the network...etc. All of these interfaces also poll for new messages to process in the database. 
- Server environment: Windows Server 2003 Standard Edition SP 2 
- RAM: 4.00 GB
- CPU: Intel(R) Xeon(R) CPU E5335 @ 2.00 GHz 
- Virtual Memory: 6141 M
- Server hosts apps and interfaces for 2 properties. This means up to 6-8 users can be logged on to this server running the main PMS application via Remote Dekstop from their Thin-Client Wyse terminals.
- All the apps and interfaces are installed and run on the server itself. 
- DBs (SQL Server 2000 Standard Edition 8.00.2305) are also installed on the server. 
- Since there are 2 properties, there are 2 SQL Server instances.
- Each SQL instance houses 7 databases with size ranging from 10MB up to 2GB.

The server pretty much hosts and runs the SQL Server, the apps and interfaces. The only other major thing that visibly runs is McAfee AV. Everything else is pretty much Server related services. 

At the time of writing this, there are 5 users logged into the server and Task Manager shows:

2.78 GB RAM used
CPU spiking anywhere from 7%-38%
98 processes running with 31638 handles and 1465 threads.
SQL Instance#1 - 849MB memory (not dynamic. Configured at 849)
SQL Instance#2 - 581MB memory (not dynamic. Configured at 581)

Some of the interfaces that are run on the server:
CC dialer interface#1 - I/O Reads: 1,173,447,256 | Write: 5,936
CC dialer interface#2 - I/O Reads: 1,106,077,515 | Write: 4,543

Between the 2 properties they are running 20 apps and interfaces combined. 
- 5 PMS instances
- 6 interfaces for call accounting, PBX, Voice Mail that communicates with 3rd party systems over COM Port and polls for messages from the PMS side every 1-5 mins. 
- 9 interfaces that communicate over internet for reservation delivery via webservices or file upload to a central datawarehouse...etc These also poll for new messages every 1-5 mins. 

There are 8 SQL Jobs that run at different times throughout the day to complete tasks such as DB backup, reindex, DBCC CHECKDB, certain application-related jobs (query intensive), exporting data to other servers...etc. 


 Since the apps and the interfaces are all DB-intensive it is presumably understandable that there would be lots of memory accesses and disk reads\writes. Now, the fact that memory is not being used much but lots of disk reads\writes might an orange flag to indicate a warning here. Logically, if we give more memories to SQL Server then the SQL queries won't have to access disk for data much, right? (Unless the cache is flushed... I know this is a very generic question that's hard to answer without technical details and variables). 

Back to the main point...

So we collect PerfMon logs, generate PAL reports and see high Disk read\write. 

What do we do after that? We know the apps are DB intensive. So there will be disk reads\writes. When they exceed the standard thresholds do we look at it from the standpoint of: "Ok, could the disk be performing at an inefficient level?" or do we look at it from the standpoint of: "How can be make the application so that it is less disk IO intensive?". 

We have the same application distributed to thousands of other clients on different platforms and "performance" is not the common denominator. Majority of our clients do not experience application  slowness. So, that's why are concentrating on this specific incidence. 

Again, the question is, we see high disk read\write. What is the next step after that? 

 

Any advise, suggestions, recommendations, sharing experience would greatly be appreciated. I might have left lots of helpful information out for you to be able to give me advise, and if so - please let me know what else i need to provide. 

I haven't provided much details on the thin-client Wyse terminals that are used at the properties. But we also have same chain customers that use the same terminals and do not experience slowness issues.  

Finding a solution to this would help us a great deal and solve an issue we have been battling for 2 years. 

Thank you,

Amar

Coordinator
Dec 11, 2012 at 5:03 AM

Hi Amar,

Thank you for the high level of detail.

Those disk latency numbers are very high. It tells me that the disks are very likely overwhelmed. Since I am not an expert on SQL Server, I will take a generic, operating system approach to this. First, I would get a Process Monitor (http://live.sysinternals.com/procmon.exe) trace. Process Monitor is a Sysinternals tool and owned by Microsoft. Have it trace when disk latency is above 25 ms (0.025 seconds) for about 1 minute, then stop the trace. Ensure the filter is set to file and disk only (uncheck registry, network, profiling events, etc.). Go to Tools, Process Summary. You should see SQL Server as the top user of disk activity. If not, you need to find out what those other processes are and why they are consuming so much disk IO. Next, go do Tools, File Summary to ensure that the files that SQL Server is using is what they are supposed to be. Finally, go to Tools, Stack Summary, sort by Count (highest at top), then expand the tree view on the left completely. You should see some call stacks with red "K"s. These are kernel commands. We are looking for any suspicious disk filter drivers that are causing more than 5 ms delays. Look for anything non-Microsoft. This is to prove or disprove the existence of disk filters that can cause unnecessary delays in disk IO processing. I commonly file anti-virus and anti-intrusion software here causing delays. If you don't find any delays in disk filter drivers, then look at SQL queries doing the most disk IO. I believe there is a free app on Codeplex called SQL Nexus can help with this. Finally, if the disks are not able to do more than 200 IOPS (measured by \LogicalDisk(*)\Disk Transfers/sec) sustained or do at least 10 MB per second sustained, then it is very likely that the hardware is not working correctly or is overwhelmed. If that is the case, then check out my magazine article, "How to speak SAN-ish" at http://mcpmag.com/articles/2011/05/12/how-to-speak-san-ish.aspx

Dec 12, 2012 at 6:35 PM

Your SQL jobs such as checkdb and reindexing should only be run at night (assuming you aren't a 24x7 shop).

Is SQL getting memory pages swapped to disk due to memory pressure? Have you granted the service account "lock pages in memory" privs?

Why don't you run perfmon to collect SQL stats and run PAL on that to see if SQL is causing you issues? SQL collects all the wait statistics you need on disk IO.

Greg

Dec 12, 2012 at 6:58 PM

Clint: Thank you for taking your time to read through my long note and providing us with some excellent advise we can take and further do analysis. I will report my findings here for further discussions and we will see where it leads us to. 

Gcormier:  Our customers run 24x7. Therefore, the SQL agents jobs are staggered (not to overlap with one another) and they are ran during night times over weekends (during the slowest hours). However, there are other tasks such as hourly transactional log backup which are required to run throughout the day. My original thought was disk swapping due to memory pressure. However, the memory doesn't seem to be the issue. The group that manages the server runs a sitescope monitoring tool to log metrics such as CPU usage, Memory usage, Network bandwidth...etc. Analyzing data of these metrics for an extended period of time never supported my theory of low memory or high CPU causing performance. As I mentioned above, there is a 4 GB RAM on the server and the usage does not go up beyond 2.87 GB - 3 GB. Even if the OS reserves around 300-400 MB, there is still room left extra memory usage. Now, if you are talking about SQL having limited memory allocated and the APP trying to do more than the maximum amount of memory that SQL can use then yes - it is possible that SQL would start doing disk swapping because the OS wouldn't let SQL use more memory beyond what is configured. However, we have increases SQL memory in the past and haven't seen much improvement. If we allocate more memory for SQL it will just use whatever it is given and increasing the allocated memory means lower memory resource for the server itself. That can really bring some heavy performance issue. Our client's position on this is that it is not necessarily the hardware because there are other clients running on same server blades with no performance issues reported. 

As for your question about running perfmon to collect SQL stats. I did run PerfMon to collect SQL stats (not all). 

- Buffer manager (all counters)
- Cache Manager (all counters)
- Databases (all counters)
- General Statistics (all counters)
- Latches (all counters)
- Locks (all counters)
- Memory Manager (all counters)
- SQL statistics (all counters) 

However, all of these are "STATS ONLY" data. The SQL Server is SQL 2000 and the latest version of PAL doesn't have an option to analyse SQL 2000 thresholds. I remember seeing another forum discussion question someone asking about this same question as well. 

Any other thoughts you have or any other lights you can shed would greatly be appreciated.

Thanks again!

Amar 

Dec 12, 2012 at 7:06 PM

What is the page life expectancy of SQL?

Why not add more RAM? It's cheap and probably costs less then the time you are spending on the issue. :)

Dec 12, 2012 at 7:39 PM

Greg: The following statistics are directly from the PAL report...

Page life expectancy
Condition:                        No thresholds
Min:                                 0
Avg:                                 193.624
Max:                                1480
Hourly Trend:                   -61
Std Deviation:                    236.442
10% of Outliers Removed: 126.794
20% of Outliers Removed: 97.322
30% of Outliers Removed: 80.434 

In regards to adding RAM - Unless we can prove that memory is low and that's what is causing the performance we can't really ask our client to throw in more hardware. If there are room for improvements within the application, SQL\DB configuration, Anti-Virus scan, sybsyetm performance...etc Those are the areas we want to look into and make improvements. The last thing we want to tell our clients is to spend money for extra hardware and we have to have concrete evidence or some data to show at that time. 

How do we find out if the number of DB calls and data requests (read\write) we are making at any given time is exceeding the total amount of memory SQL is allocated with and if that's causing disk swapping tremendously? Eventually, lots of disk IOs are causing the performance issue. We are just not sure of the source. Is it just the application that's running non-optimal queries and making unnecessary DB calls or is it the memory thats not enough causing disk swapping or is it the sybsystem not working at an optimal level or antivirus doing scans on every disk IO delaying response times ...etc. 

Dec 12, 2012 at 7:52 PM

It looks like there is serious memory pressure. How big are your databases? With the size of your memory limits, I would hope the databases are no more then 2gb and 1gb respectively.

Your counter should be at least 300, but the higher the better. Yours looks like it's less than half the recommended value, which means a given page will live in memory for less then 2 minutes before it is overwritten with another page in memory. This means there is not enough memory on your server and SQL is required to fetch data from the disks often.

http://www.sqlmag.com/article/sql-server/page-life-expectancy-a-reliable-indicator-of-sql-server-memory-pressure

Coordinator
Jan 3, 2013 at 10:15 PM

The SQLServer.xml threshold file has a threshold of 300 (5 minutes) or less for page life expectancy. I'm surprised that it did not throw an alert for it.

In regards to "memory swapping", it is only a problem if the disks cannot keep up with hard page faults. Hard page faults occur whenever the requested memory is not in the working set (RAM) of a process. Many people assume that hard page faults are referring to the paging file, but that is not correct. Hard page faults simply mean disk access which *could* be a paging file or just files loading from disk. I use a combination of a lack of available RAM (\Memory\Available MBytes), more than 1000 (4 MB per second) hard page faults (\Memory\Pages/sec), and paging file growth (\Paging File(*)\% Usage) to indicate memory swapping. Any one of these counters cannot used alone to make this determination.

When Page Life Expectancy is low in SQL Server, then it typically means that SQL needs more memory. A common pitfall that DBAs tend to do is lock too much  RAM using page locking in SQL. The memory locked in this way cannot be used by all parts of SQL. For example, SQL cannot use the memory that it locked for its thread stacks.

FYI. David Pless and I are working on an updated SQL threshold file.