Interpreting results based on sampling

Nov 9, 2009 at 5:11 AM

Hi Clint,

I'm a new PAL user (new Perfmon user too for that matter), and trying to get an understanding of the time interval.

For example, I used the standard MOSS template, which has perfmon test every 15 seconds.

I ran the test for 2 hours.  The Analysis Interval in the report is 3 minutes.

 

I'm seeing alerts for the disk drive, for example:


Physical Disk Read Latency Analysis (Alerts: 12)
Physical Disk Write Latency Analysis (Alerts: 28)
Process IO Data Operations/sec (Alerts: 1)
Process IO Other Operations/sec (Alerts: 15)

 

In the detailed part of the report it shows these are alerts of:

Disk responsiveness is very slow (spike of more than 25ms)

Average disk responsiveness is very slow - more than 25ms

 

Since this is a sampling, I guess I'm wondering how to multiply this out to get a ROUGH IDEA of how many times in 2 hours these alerts really occurred.

 

The "average" read and write measurements aren't bad (0.004 / 0.006):

 

PhysicalDisk(_Total)\Avg. Disk sec/Read 0 0.004 0.737
PhysicalDisk(0 C: D:)\Avg. Disk sec/Write 0 0.006 0.432

 

Any advice on interpreting the above,  and taking into account the sampling for both the perform logging and the report analysis interval would be appreciated.

 

Thanks

 

Nov 9, 2009 at 6:08 PM

Clint will probably have a better answer for you then me, but this is what I do.  I use PAL to take a look at my perfmon logs to see a good overview of what is going on.  If there is something that I want to look at further I will load the log file into perfmon and only look at those counters I am interested in.  If you need a step by step on how to do this let me know and I will try to post some instructions. 

Now I will be interested to see what Clint has to say on the subject. :-)

Coordinator
Nov 10, 2009 at 5:13 AM

The collection interval is the time interval that the values in the counter log was collected. In your case, 15 seconds.

The analysis interval is the time slice that PAL is using. If you choose AUTO, then PAL will slice up the log into 30 time slices. For example, if you have a 30 hour log, then PAL will slice it into 1 hour time slices. If you have a 30 minute log, then PAL will slice it into 1 minute time slices. The reason it does this is to produce an "average" of a given amount of time.

Disk response times greater than 25ms is very bad. We don't know why it's bad though. For disk I/O, I highly recommend running Process Monitor for a minute or so collecting when the disk I/O is bad. It will tell you what kind of I/O you are doing on that server to see if it's important or not. http://live.sysinternals.com/procmon.exe

Serverguy is correct. PAL is a time saving tool and gives you the highlights of that to look at and what not to look at in the perfmon log.

Nov 10, 2009 at 10:35 PM

Hi Clint

You wrote "For disk I/O, I highly recommend running Process Monitor for a minute or so collecting when the disk I/O is bad. It will tell you what kind of I/O you are doing on that server to see if it's important or not. http://live.sysinternals.com/procmon.exe"

I ran process monitor for the same reason (high disk I/O) but I do not know where in process monitor this info can be retrieved. Where should I look in process monitor to get this info? Thanks

Coordinator
Nov 10, 2009 at 11:52 PM

After capturing the data, click Tools, then Process Activity Summary to see which process is doing the most I/O. Next, do Tools, File Summary to see that files hit the most. You can change the filter to be specific to a specific drive or directory. Finally, you can use the Tool, Stack Summary to see if any backup software or anti-virus software is trying to be involved in the I/O.

Nov 23, 2009 at 8:32 PM

Hi Clint

Below are some of the stats I collected from a server over a 3 day period. The PAL report was run with a 5 minute interval. All the alerts below are warnings. Aside from disk, can someone conclude from Memory Pages Input/sec (Alerts: 630) that there is a memory shortage? Thanks

 

Processor

Processor Utilization Analysis (Alerts: 0)

Processor Queue Length (Alerts: 0)

Privileged Mode CPU Analysis (Alerts: 27)

High Context Switching (Alerts: 0)

Excessive Processor Use by Processes (Alerts: 0)

Interrupt Time (Alerts: 0)

Network

Network Utilization Analysis (Alerts: 0)

Network Output Queue Length Analysis (Alerts: 1)

Memory Committed Bytes (Stats only)

Disk

Physical Disk Read Latency Analysis (Alerts: 94)

Physical Disk Write Latency Analysis (Alerts: 0)

Logical Disk Read Latency Analysis (Alerts: 2)

Logical Disk Write Latency Analysis (Alerts: 0)

Disk Free Space for a Kernel Dump (Alerts: 0)

Process IO Data Operations/sec (Alerts: 18)

Process IO Other Operations/sec (Alerts: 554)

LogicalDisk Disk Transfers/sec (Alerts: 0)

Memory

Free System Page Table Entries (Alerts: 0)

Pool Non Paged Bytes (Alerts: 0)

Pool Paged Bytes (Alerts: 0)

Available Memory (Alerts: 3)

Memory Pages/sec (Alerts: 2)

Memory Leak Detection (Alerts: 0)

Handle Leak Detection (Alerts: 0)

Process Thread Count (Alerts: 0)

High Virtual Memory Usage (Alerts: 0)

Process Working Set (Alerts: 0)

Memory System Cache Resident Bytes (Alerts: 1)

Memory Pages Input/sec (Alerts: 630)

Memory Percent Committed Bytes In Use (Alerts: 0)

 

 

Coordinator
Nov 24, 2009 at 8:34 AM

I have learned a lot after I initially wrote the PAL tool, but I have not updated the threshold files in PAL v1.x yet.

Page Inputs/sec accounts for all hard page faults, but not necessarily to the page file. This means that the hard page faults could be something as simple as backup software reading a memory mapped file. See my blog article, "The Case of the Phantom Page Faults" at http://blogs.technet.com/clinth/archive/2009/07/16/the-case-of-the-phantom-hard-page-faults.aspx.

There is no *easy* way to determine if a Windows computer is really running out of memory because of how it can trim working memory to the page file to reuse RAM. I met with the Windows product team last month and recommended new counters to be added to Windows to show page *file* reads/sec and write/sec, so we can actually get a better idea of when the computer is running out of memory. For now, we have to just look at the Available MBytes counter. If Available MBytes is less then 100MBs or less then 10% of RAM, then the computer is likely running out of RAM. If Pages/sec increases dramatically when Available MBytes goes down, then that is another clue, but not direct evidence.

When the computer runs out of RAM, the kernel will trim the working sets (RAM used by processes) and page them out. If the Commit Charge (all committed memory on the computer) gets close to the Commit Limit (RAM + page file(s)), then the computer will fail to allocate memory or the system will expand the page file which could lead to page file fragmentation.

If your company has a Microsoft Premier Support contract, then consider the Vital Signs workshop. It's a Windows architecture workshop that I teach that focuses on performance analysis.