Improper calculation of statistical values (Min, Avg, Max) / No graphs displayed for Memory Leak Detection and High Virtual Memory

Mar 17, 2009 at 3:00 PM

I am experiencing couple of issues with some of the counters:

 

1. Statistical Values: it seems like PAL improperly calculates Avg, Min and Max displayed in the counter instance statistics section.

As an  example I will email  you PAL report, which has ‘0’ for Min and Avg, and ‘20’ for Max value under System\Processor Queue Length, which is incorrect since graphs does shows threads waiting in the queue for the most part of the capture period.  In addition, if you look under Processor/Processor Time you would find that Average ‘7’ and Maximum ‘100’ under Processor/Processor Time are off too. If you visually compare graph against counter statistic section of the report. There are few other places in the report where you would find similar discrepancies.  

2. Processor/Processor: Graph only has one line for all processors, is there a way to view one line per processor?

3. Memory Leak Detection and High Virtual Memory: These graphs never get displayed.

 

Thank You in advance for your help.

Coordinator
Mar 18, 2009 at 2:00 AM
1. I have most of the analyses set to an integer (whole number) data type and the Processor Queue Length is one of them. The stats are being rounded and then averages per each time slice. The processor queue lengh is very low (less then 1 nearly the entire time), so it will naturally skew the stats a bit.'
The processor time is statistics are accurate and are based on the entire log. I just compared them to the perfmon log you sent me and they appear to be correct. Regarding the chart, the chart is plotting the averages of each of the timeslices (analysis interval), so you won't ever see the maximum and minimum values in the chart since they have been averaged out. In PAL v2.0, I am plotting *all* of the datapoints in high resolution charts (not possible in PAL v1.0), so it will be a better experience.

2. Yes, all of the processor instances will be displayed in the PAL analysis. You perfmon log only has the _Total instance, so that is only data it is able to display. Try creating a new perfmon log that has all of the counter instances and try again. You can export a perfmon log template from the System Overview threshold file by clicking the Export to Perfmon Log Template button.

3. The reason those analyses are not charting is because the only counter instance in the log is the _Total instance. Those analyses are trying to analyze all of the processes and explicitly exclude the _Total, Idle, and System processes/instances. I was going to recommend using the Quick System Overview threshold file, but it looks like it has the same problem. A bug on my part. Sorry. It looks like I forgot to enable some of the analyses, so it should work. You can fix this by editing the threshold file, locating the analysis, click Enable, click Update analysis, then click File, Save.

Thank you for the great feedback. PAL isn't perfect, but it's the best I can do for now. Remember, it is simply a time saving tool, not a replacement for traditional performance analysis.
Mar 18, 2009 at 7:46 PM

Clint - Thanks for the detailed explanation and quick response.

Mar 24, 2009 at 8:06 PM
I was impressed with great PAL tool. I have two questions:
1. How the "Hourly Trend" is measured.
2. How the "Process(sqlservr)\% Processor Time" is calculated. Should not I expect a value less then or equal to 100 since its %. My values Max value is 327.
I greatly appreciate your response.

Regards,
Coordinator
Mar 25, 2009 at 1:51 AM
1. Hourly Trend is the calculated slope from the data point on back to the beginning of the log. For example, if we are in the 10th time slice, then data points 1 through 10 are calculated for slope based on an hourly average. if we are in the 11th time slice, then data points 1 through 11 are calculaed for slope based on the hourly average.

2. Each process has the potential to consume 100% times the number of processors - meaning a single process has the potential of getting 400% on a server with 4 logical processors. Here is a detailed explaination that Bruce Adamczak (a brillant guy on my team) wrote:

Many people confuses what they see in Task Manager on the Processes tab in CPU column with Process\% Processor Time\Instance in Perfmon.  They are NOT the same counters.   There is NO counter in PerfMon that matches what you see in Task Manager on the Processes tab in CPU column.<o:p></o:p>

 

Process\% Processor Time\Instance is NOT the amount of time that the CPU’s were busy.  It is the % of time that this instance charges against the Processor\% User time.<o:p></o:p>

 <o:p></o:p>

The theoretical Max for this counter is (# of processors * 100)<o:p></o:p>

 <o:p></o:p>

It is possible to see spikes go above the theoretical Max explained below.<o:p></o:p>

 <o:p></o:p>

Why you see data greater than expected is due to a design bug in perflib V1/PDH. To calculate the percentage of CPU usage, PDH needs two samples (each with a raw value and a timestamp); the problem is that PDH uses only the instance name to match the processes, so it can sometimes use two samples from different processes. <o:p></o:p>

 <o:p></o:p>

For example, in the table below there are 3 processes (black, green and red). Each sample is 1 second apart, and the values represent the total number of milliseconds each process has ran for since it started. The first row indicates how PDH names the process; <o:p></o:p>

 <o:p></o:p>

            X   X#1     X#2<o:p></o:p>

Sample 1    0     0       0 <o:p></o:p>

Sample 2   20    10     500<o:p></o:p>

Sample 3   40    20    1000  ß Process in green is deleted, so process in red becomes X#1

Sample 4   60  1500          ß It looks like process X#1 ran for 1480s in this 1s interval!

Sample 5   80  2000          ß Everything seems back to normal now.

 <o:p></o:p>

In perflib V2 each instance has an instance id, which will prevent the above from happening (that is, the table won’t shift on sample 4 because we’d know process red is X#2, not X#1).<o:p></o:p>

 <o:p></o:p>

Applying the KB below will help with the above issue.<o:p></o:p>

 <o:p></o:p>

The Process object in Performance Monitor can display Process IDs (PIDs)<o:p></o:p>

http://support.microsoft.com/?id=281884<o:p></o:p>

 <o:p></o:p>

Here’s another reason why you might see spikes above theoretic Max:<o:p></o:p>

 <o:p></o:p>

There can be an arbitrary gap between time stamping and data collection, and when this happens it appears that the CPU has been used for longer than time has elapsed (> 100% total). Perflib automatically boosts the thread’s priority during data collection to one below real time to try to alleviate this problem, so any boosting that you’re planning to do will help only if you boost the priority above that and there are other threads running at or above realtime-1. And it only helps, it doesn’t completely avoid the problem (e.g., if a hard page fault happens, we’ll have to wait for the disk).<o:p></o:p>

 <o:p></o:p>

Here’s a diagram to maybe help you understand this. Suppose a thread is taking up 100% of the CPU, and remember that the percentage is calculated as (CPU_i - CPU_(i-1))/(TS_i - TS_(i-1)), where CPU_i is the total CPU usage at sample “i” since the thread was created and TS_i is the time stamp at sample “i”. Here’s the regular case:<o:p></o:p>

 <o:p></o:p>

---+-+---------------------+-+--------------------------> (Time)<o:p></o:p>

   | |                     | |<o:p></o:p>

   | \- Get CPU Time 1     | \- Get CPU Time 2<o:p></o:p>

   |                       |<o:p></o:p>

   \- Get Timestamp 1      \- Get time stamp 2<o:p></o:p>

 <o:p></o:p>

Note that the time elapsed between the two timestamps and the time elapsed between the data collection (“Get CPU time” ) is roughly the same, so the reported value will be roughly 100%. Now suppose, for example, that a hard page fault happened between “get time stamp 2” and “get CPU time 2”, the diagram would look like this:<o:p></o:p>

 <o:p></o:p>

 <o:p></o:p>

---+-+---------------------+----------------+-----------> (Time)<o:p></o:p>

   | |                     |                |<o:p></o:p>

   | \- Get CPU Time 1     |                \- Get CPU Time 2<o:p></o:p>

   |                       |<o:p></o:p>

   \- Get Timestamp 1      \- Get time stamp 2<o:p></o:p>

 <o:p></o:p>

Which means that CPU_2 - CPU_1 is greater than TS2 - TS1, so the reported CPU time will be greater than 100%.<o:p></o:p>

Mar 26, 2009 at 2:39 PM

Thanks for the kind words Clint.  But to be fair the information has been collected over the last years while talking to different members of the product teams.  I can only take credit for compiling the information.

The product team has also started a public forum to Discuss performance monitor (perfmon), resource monitor (resmon), and task manager, focusing on HOW-TO, Errors/Problems, and usage scenarios.

There is a great FAQ here too. 

http://social.msdn.microsoft.com/Forums/en-US/perfmon/threads/

Bruce

Mar 27, 2009 at 2:45 PM
Hi Clint,
Thanks a lot for detailed answer. It really helped.