Windows NT/2000 System Performance

Performance Gallery Gold works with Demand Technology Software’s Performance SeNTry collection agent to report many crucial aspects of Windows NT/2000 performance. Performance Gallery Gold accepts one or more SMF (System Management Facility) collection files from a single computer as input and produces a variety of charts and graphs that can be used to pinpoint performance bottlenecks. It contains a variety of predefined chart templates that are designed to get you up and running quickly.

Processor Performance

The Windows NT processor charts focus on the System performance object, the Processor performance object, and the processor utilization of specific processes. The thread is actually the dispatch able unit on NT, but Performance Gallery Gold cannot report thread activity. Thread CPU activity is summarized to the individual process. NT supports both single-processor and multiprocessor configurations.

NT supports what is known as symmetric multiprocessing, which means, by default, any thread can run on any processor. The tuning options for a large NT Server multiprocessing configuration include defining hard processor affinity, which restricts the processors that threads are eligible to execute on. Using these tuning options demands a good understanding of the CPU resource usage pattern of various processes.

Processor Utilization

% Privileged Time counter

% User Time counter

% Interrupt Time counter *

% DPC Time counter *

"Well-behaved" NT device drivers rely on deferred procedure calls (DPC’s) for the bulk of their processing. Unlike interrupt service routines that mask off lower-level interrupts, DPC’s run with interrupts enabled.

The Processor Utilization Breakdown (NT) graph has a reference line at 75%. Systems with processors that run consistently greater than 75% busy can be reaching capacity limits. Verify by reviewing the Performance Gallery Gold Processor Queue Length (NT) graph to see how many threads are delayed in the NT dispatcher ready queue. If the processor queue length value is also large, a CPU bottleneck is probably impacting application performance.

Processor Queue Length

The Performance Gallery Gold Processor Queue Length (NT) graph shows the number of threads that are currently waiting for service in the NT dispatcher ready queue. The queue length is an instantaneous count of the number of ready threads at the end of the measurement interval (an integer value is reported), not an average. Of the threads in the ready queue, one thread per processor is in the running state. For instance, a thread from the Performance SeNTry collector process that is taking the measurements will show up in the running state.

NT has a single dispatcher queue, where ready threads wait, that services all processors. The reference line in the Processor Queue Length (NT) graph is drawn at two ready threads. A good rule of thumb is to have no more than two ready and waiting threads per processor. On a single processor, queue lengths of 6 or 8 ready threads are cause for concern, but 6 to 8 ready thread are okay on a four-way multiprocessor.

High processor utilization (greater than 75% CPU busy) coupled with a long ready queue indicates a CPU bottleneck. A long ready queue without high overall processor utilization may mean that a large number of timer-activated threads all happen to wake up at the same time that the Performance SeNTry collector does. In that case, the long processor queue length measured is a by-product of that behavior, not a sign of trouble.

Processor Utilization by Processor

Windows NT/2000 is a symmetric multiprocessing system, therefore, processor utilization statistics are normally very similar across all processors. However, if you use processor hard affinity performance options, it may be necessary to examine these statistics on a processor-by- processor basis.

System Configuration

	NOTE Refer to "Windows NT/2000 Objects and Counters" for definitions of the Windows NT counters.


Parameter	Description
Domain	The security administration domain.
Computer	The computer name.
OS type	The type of operating system. The OS type for Windows NT Workstation is "winnt." The OS type for Windows NT Server is "lanmannt."
Version	The level of the Windows NT software.
CPU type	The AT-compatible processor, normally.
Installed memory	The total amount of memory installed, in bytes.
#CPU’s	The number of processors installed.
CPU speed	The clock rate of the processor (not available for Alpha systems).

CPU Utilization by Process

The Performance Gallery Gold CPU Utilization by Process Table (NT) shows which applications are the heaviest consumers of processor resources. Processes include both applications that are started manually and applications that are started automatically by the Service Control Manager. (Services are equivalent to daemons in Unix or started tasks in MVS, and they run in the background.)


Process	Description
system	The operating system.
csrss.exe	The Client-server subsystem, which is the component responsible for desktop Windows management.
services.exe	Includes a number of system services packaged in a single executable program, including the network Computer Browser service, the network Redirector service, the network File Server service, the Alerter service, the Messenger service, and the Event log service.
lsass	The security remote validation service.
osa	A service associated with the Microsoft Office Suite.
rpcss	The service that supports NT’s Remote Procedure Call interface.
smss	The Session Manager subsystem, which is involved in all security and authorization.
spoolss	The printer Spooler service.
winlogon	The application that controls access to the desktop. For example, the application that responds when you type Ctrl+Alt+Delete.

Memory Performance

The amount of random access memory (RAM) installed is one of the critical elements of Windows NT/2000 performance. This component is often referred to as real memory, which is contrasted with virtual memory, the logical view of memory that an application is granted. Each individual process views memory as a 4-gigabyte (GB) virtual address space. The upper half of each virtual address space is reserved for the NT operating system and various authorized components. The lower half of each virtual address space is available for use by each individual process. The system’s virtual addresses are common to each process. while the per process areas are unique.

The virtual address spaces of active processes (including the System process) define a logical set of memory addresses that is normally considerably larger than the amount of RAM you actually have installed. The Virtual Memory Manager component of the NT operating system attempts to manage real memory, so the active pages of processes reside there, while inactive pages are stored on disk-resident paging files.

Virtual memory locations are mapped to real memory locations using a series of page tables. It is a function of the operating system to build and maintain a page table for each process that executes. Windows NT also tracks which real memory locations are in use and by which process.

Virtual memory systems are very dynamic. Only the active pages of a process need to reside in real memory, which is something that changes over time. When there is not enough real memory to go around, NT trims inactive pages from active processes and attempts to redistribute them based on current activity. NT will swap the pages of an inactive desktop application to the paging file on disk, for example, to free up real memory for some active process. Paging is disk I/O activity that occurs whenever an application references a valid virtual memory location that is not currently present in real memory.

Real memory is also used extensively for caching in Windows NT. Caching refers to storing frequently-used objects from the disk inside memory to speed up access. Caching is important, because memory access is thousands of times faster than disk access. The NT cache is an important part of NT’s ability to serve as a file server for network clients. Other applications, like Internet Information Server, MS SQL Server, and MS Exchange, also rely on caching frequently- used objects. The frequently-accessed objects cached by these applications are Web pages, database data, and messages. There is also a sense that real memory serves as a cache for the frequently-accessed virtual pages of executing processes.

Real memory often has a bigger impact on NT performance than any other single component. This is because a number of applications rely on caching frequently-accessed objects in memory. Virtual memory systems can degrade quite suddenly; one minute they can be running fine and the next minute they can be hopelessly slow. Often, if does not take a major change to transition from a satisfactory state to an unsatisfactory one.

The two most important indicators of real memory performance are 1) the Available Bytes counter, which reports how much real memory is currently available for use, and 2) measures of demand paging activity to disk.

Real Memory Utilization

The Performance Gallery Gold Real Memory Utilization (NT) chart is a stacked area graph that shows the number of bytes currently allocated in the six system real memory pools. Each data point in the graph represents a specific counter; an instantaneous count of the number of bytes in the pool at the end of the measurement interval. The measurements are reported as bytes, because NT supports different page sizes on different hardware (4K pages for systems with an Intel processor, 8K pages for systems with an Alpha processor).

Real Memory Counters

Available Bytes

The number of available bytes is a very important indicator to track. The Available Bytes counter tracks how much "free" real memory exists on the system during the specified measurement interval. More specifically, it represents the sum of three lists:

New allocations are always made from the Zero List. A kernel thread called the "Balance Set Manager" is dispatched once per second to "trim" excess pages from each process address space. Initially, trimmed pages are put on the Standby List. As they age, trimmed pages are moved to the Free List. (If the page in real memory is modified, it must be written to the paging file before it can be moved to the Free List.) The low-priority system zero page thread is responsible for zeroing out pages on the Free List. The page trimming process is illustrated in Figure 11.2.

As long as the amount of available memory is greater than 4-megabytes (MB), process working sets (resident pages of an active process) are allowed to grow. If the available memory falls below 4 MB, the system can become constrained by a storage of real memory. if available memory falls below 1 MB, the system is likely to be constrained and, consequently, process working sets are trimmed much more aggressively.

Pool Non-paged Bytes

Pool Paged (resident) Bytes

System Code Resident Bytes

System Driver Resident Bytes

The System Driver Resident Bytes counter tracks the system memory allocated by device drivers that is currently residing in real memory. During I/O interrupt processing, device driver interrupt service routines must use virtual memory addresses that reside in real memory. I/O memory buffers also must be locked in real memory while they are being accessed by the devices.

System Cache Resident Bytes

The System Cache Resident Bytes counter tracks the number of bytes of pageable operating system code in the file system cache. The NT memory cache is mapped to a 512 MB segment of the system’s 2 GB virtual memory address space. It is then allocated to active files in 256-kilobyte (KB) chunks. This is where the portions of the file cache that are residing in memory are counted. The cache bytes also includes some additional memory usage by system threads. Internally, the system process’s working set is known as the cache.

Real Memory Calculation

Subtract this number from the total real memory installed on the system.

Refer to the Performance Gallery Gold Memory Usage by Active Process (NT) graph to see how different applications are using the available real memory.

Memory Usage by Active Processes

The working set of a process consists of all its virtual memory pages that are currently resident in real memory. The process Working Set counter is an instantaneous value for the process at the end of the measurement interval.

Pages from a shared DLL that are currently resident in memory are counted as part of every process address space where the DLL is loaded. As a result, resident pages from shared DLL’s can be counted two, three, four, or more times, so the sum of all the process working sets and the amount of system memory used is usually much greater than the amount of real memory installed on the system.

Virtual Memory Usage (commit%)

NT uses virtual memory. Once virtual memory is committed by application processes, NT allocates space for the virtual memory on one of its paging files. Frequently-accessed, pageable virtual memory pages tend to reside in real memory. However, when more virtual memory is committed than exists inside the computer (its real memory), paging activity occurs. The mapping performed by the operating system of virtual addresses to real memory locations is transparent except when a program references a virtual address that is not resident in real memory. That causes a page fault to occur, which NT must handle.

Virtual memory pages are committed when NT reserves space for them on one of its paging files. A range of virtual addresses can be reserved without being committed. For example, NT reserves about 512 MB of the system’s 2 GB virtual address range for the file cache. These pages are not committed until the cache manager allocates them to satisfy file requests.

Virtual memory allocations are limited by the amount of real memory and the amount of space on disk allocated for paging files. (NT supports up to sixteen paging files, one per logical volume, each of which can be as large as 1 GB.) This value is known as the virtual memory commit limit. When the number of committed virtual memory bytes reaches 90% of the commit limit, NT issues an "Out of Virtual Memory" message. Respond by creating another paging file or by increasing the size of the existing paging files. NT will automatically extend the size of the paging file, if possible, when the percentage of committed bytes remains above 90% for an extended period of time.

The Virtual Memory Usage (NT) graph includes a reference line at 70% (y=70). Systems that run consistently above 70% are more likely to experience excessive paging rates. The paging rate can be referenced from the Performance Gallery Gold Demand Paging (NT) and Hard Page Fault Rate (NT) graphs.

Demand Paging

Hard Page Fault Rate

A hard page fault occurs when a process attempts to access a virtual memory location that is not resident in real memory. A hard page fault requires operations to the physical disk to be performed by the NT memory manager while the program that incurred the fault is forced to wait. A hard page fault is detected and resolved in the following manner:

	NOTE By default, the Performance SeNTry collector filters out intervals where a process was not active, so not all running processes are accounted for in this graph.

The NT memory manager examines the exception and determines that the reference is valid and the page does not reside in real memory.

The memory manager then "grabs" a real memory page from the Free List and calls the I/O manager to copy the page from the paging file to the designated memory location.

Once the page is resident in real memory, the program that incurred the page fault is restarted.

NT performs bulk paging operations that are more efficient than executing a single disk I/O at a time. When NT encounters a page fault, it reads beyond the address specifically requested and takes several contiguous pages off the disk in a single disk operation.

	NOTE This computation, while logically correct, sometimes leads to a negative number.

The Page Reads/sec counter tracks the number of times NT initiates a paging operation to read one or more pages from disk. Generally, multiple pages are input for each read request.

The Pages Input/sec counter tracks the number of individual pages read from the disk. This counter can be referenced from the Performance Gallery Gold Paging Activity (total) (NT) graph.

When the disk (where the page fault resides) is very busy, page fault requests can accumulate. To check the efficiency of the NT paging operations, refer to the Performance Gallery Gold Clustered Paging I/O Operations (NT) graph.

The Demand Paging (NT) graph has a reference line at 25 hard page faults per second (y=25). The reference line can be adjusted (see "Marker"). The number of hard page faults you system can tolerate is determined by:

Soft (transition) Fault Rate

When a page fault occurs, the NT Memory Manager checks the contents of the Standby List to see if the page was among the pages trimmed recently by the Balance Set Manager. If the page requested is located on the Standby List, the Memory Manager is able to add it to the process working set immediately; no time-consuming I/O is involved. This type of page fault is referred to as a "soft" or "transition" fault. Soft/transition faults are just a by-product of the working set trimming practice that NT uses; they are not normally a performance issue.

Cache Fault Rate

By default, application I/O requests are diverted to the NT file cache, which represents a specific region of virtual memory. File segments are mapped to this area of virtual memory in 256 KB chunks. The range of the system’s virtual memory associated with the file cache contends for real memory like any other application process. A reference to a file location mapped to virtual memory and not in real memory causes an NT file cache fault.

When a cache fault occurs, the memory manager calls the I/O manager to generate a read request to the file. While a high rate of cache faults may indicate a shortage of real memory on a file server, many cache faults are simply the result of new files being accessed by different local processes and network clients. To determine the effectiveness of the file cache, refer to the Performance Gallery Gold File Cache Hit% by Type (NT) graph. There is little that can be done to tune the file cache in Windows NT beyond purchasing more RAM.

Clustered Paging I/O Operations

Windows NT is designed to perform bulk paging operations, because the paging file disks are used most efficiently when multiple requests are processed in a single operation. Although bulk paging operations are efficient, they take longer than individual paging requests.

Page reads are the result of page faults. When NT discovers a page fault in a data page, it automatically reads several neighboring pages. It does the same for a process code page, but is even more aggressive about prefetching the pages from the disk. Trimmed pages that are subsequently modified must be updated on disk before the page in memory can be used by another process address space. The NT kernel modified page writer thread will wait until a number of changed pages accumulate before writing the modified pages to the paging file in bulk.

Paging Activity (total)

The Performance Gallery Gold Total Paging Activity (NT) graph shows the number of page inputs (Pages Input/sec counter) and outputs (Pages Output/sec counter) per second that occurred during the specified measurement interval.

	NOTE Any I/O bandwidth involved in system paging activity cannot also be available to applications.

The graph has a reference line at 100 pages per second (y=100). This line can be adjusted as appropriate for your system (see "Marker"). Consider that the I/O bandwidth that is absorbed in system paging activity is not available to application programs.

Paging Operations

The number of pages read for input is greater than the number of page read operations, because NT performs bulk paging operations. The number of pages output is greater than the number of page write operations for the same reason.

The total number of pages input and output can be referenced in the Performance Gallery Gold Paging Activity (total) (NT) graph. The efficiency of NT’s bulk paging operations can be referenced in the Clustered Paging I/O Operations (NT) graph.

Memory Utilization Index

The Performance Gallery Gold Memory Utilization Index (NT) graph produces a memory contention index that can be useful in predicting when an NT system might experience a memory bottleneck. This index is designed to help identify potential conflicts caused by a memory bottleneck and correct them before they become serious.

The memory utilization index is the virtual memory bytes allocated in the system pageable pool, divided by the memory bytes that are resident in the same pool. As system activity increases, the number of bytes in the pageable pool tends to increase. If the amount of real memory available to back this pool is limited, the ratio of virtual bytes allocated to real memory bytes consumed will increase. The value computed serves as a memory contention index, because the virtual memory pages in the pageable pool all contend for real memory.

An increase in the memory contention index is often accompanied by an increase in hard page faults. Combine the Performance Gallery Gold Memory Utilization Index (NT) graph and the Hard Page Faults Rate (NT) graph (see "Secondary Graph") to see the relationship between the two performance indicators. The combined, dual-y-axes graph can be changed to a table (see "Chart Type"), then exported to MS Excel (see "Export") and displayed as a scatter diagram (see Figure 11.3). The example scatter diagram includes a linear regression trend line to show the relationship between the two sets of measurements.

File Cache Performance

On an NT Server machine configured to run as a file server, one of the biggest consumers of real memory is often the file cache. If the file cache is too small, performance will be poor, because there are too many accesses from disk and not enough accesses from memory. The most important way to tune a file server is to ensure it has an adequate amount of real memory to use for caching.

NT reports a variety of file cache statistics. The most important are the various cache hit ratios— the percentage of accesses resolved from memory compared to all file accesses. Keeping these hit ratios high is critical to system performance. Outside of one tuning parameter called "LargeSystemCache," which is a drastic change from the default working set management policy, there is very little that can be done to tune the file cache other than adding more RAM to the system. Because the file cache can never be larger than 512 MB, there is even an upper limit on how much RAM to add.

File Cache Activity by Type

The Performance Gallery Gold File Cache Activity by Type (NT) graph provides an overview of all NT file cache activities, including several different types of cacheable read requests and the amount of changed pages written to disk by the lazy writer thread.

The performance of the NT system file cache is critical when an NT system is used as a file server. File segments that are cached in real memory are accessed without having to access the physical disk. The best overall measure of the effectiveness of the file cache is to monitor the cache hit percentages. Cache hits occur when a request for a file was satisfied from the cache.

Copy Reads/sec counter

When application files are accessed, they are read into the NT file cache first, then data buffers are copied from the NT file cache into the process virtual address space. Thus, normal application file requests become copy read requests. The copy reads data element shows how many copy read requests per second were satisfied from the cache without having to access the file. When application files are accessed sequentially, the file NT cache performs read-ahead operations to keep the copy read hit ratio high.

Data Maps/sec counter

The data mapping function of the cache is used primarily by the NTFS file system to cache master file table (MTF) entries. The NTFS master file table stores information about files and directories contained in the file system. This is how many times per second that data map requests are satisfied from the file cache without accessing the disk.

Pin Reads/sec counter

MFT entries that are mapped in the cache are pinned when they are modified. NTFS (NT file server) pins the changed MFT entries so the file system can exercise direct control over when the changed pages are flushed to disk. Following a change, but before the change is written to disk, a pinned entry is highly likely to be re-accessed by the file system. This data element is how many times per second those requests were satisfied without having to perform disk I/O operations.

MDL Reads/sec counter

Memory descriptor lists (MDL’s) are buffers used by devices that support "scatter-gather" direct memory access (DMA) I/O operations. Scatter-gather devices reference multiple data areas in real memory within a single logical I/O request. The file server component of Windows NT also uses MDL’s to improve the efficiency of large file requests. This data element shows how many times per second MDL reads were satisfied without having to access the disk.

Lazy Write Pages/sec counter

File Cache Lazy Writer

When writes are issued to a cached file, NT does not immediately write back the changed data to the corresponding area where the file is stored permanently on physical disk. Instead, the action is deferred. This form of deferring write-back caching is called lazy writing.

There are several potential performance benefits associated with a lazy write policy. Because current data is available in the cache, an application that needs to re-read that information subsequently can get to the data without having to access the disk again. Applications like a word processing program that scroll through a large data file can benefit from this. In addition, if the application modifies the data again, the original write operation can lead to more efficient physical disk access. Suppose the application program subsequently changes a block near the original modification. Now, when changed pages of the file are flushed to disk, the physical disk can be accessed in a very efficient manner.

Keep in mind that because deferred write-back caching is used, application write requests do not translate into physical disk write operations immediately. When enough changed pages in the file cache accumulate, a cache manager lazy writer thread is dispatched to write a bunch of changed pages to disk in bulk.

Lazy Write Flushes/sec counter

Lazy Write Pages/sec counter

When the lazy write cache function is working efficiently, multiple changed pages are written to disk during each flush. On the other hand, if write activity is very low, there will be very few pages to flush, no matter how long NT defers the request. This graph does not present a stacked view, because the number of lazy write pages is usually greater than the number of lazy write flushes.

File Cache Read Activity

Copy Reads/sec counter

When application files are accessed, they are read into the NT file cache first, then data buffers are copied from the NT file cache into the process virtual address space. This data element represents the total number of copy read requests.

MDL Reads/sec counter

Memory descriptor lists (MDL’s) are buffers used by the devices that support "scatter-gather" direct memory access (DMA) I/O operations. Scatter-gather devices reference multiple data areas in real memory within a single logical I/O request. The file server component of Windows NT also uses MDL’s to improve the efficiency of large file requests. This is the total number of MDL read requests.

Read Aheads/sec counter

When application files are accessed sequentially, the file NT cache performs read ahead operations to keep the copy read hit ratio high. This data element represents the total number of read ahead requests initiated by the file cache.

Pin Reads/sec counter

Master file table (MFT) entries mapped in the cache are pinned when they are modified. NTFS (NT file server) pins changes MFT entries so the file system can exercise direct control over when changed pages are flushed to disk. Following a change, but before the change is written back to disk, a pinned entry is likely to be accessed by the file system. This data element represents the total number of pinned read requests.

Cached File System Mapping Requests

The data map function of the NT file cache was designed mainly for use by installed file systems. The NTFS file system uses the data map function to cache master file table (MTF) entries. The NTFS master file table is where NT stores information about what directories and files are contained in the file system. This is also referred to as file system metadata.

Data Maps/sec counter

The data maps element reflects the rate of data mapping requests by the installed NT file system(s). MFT entries mapped in the cache are pinned when they are modified. NTFS pins change MFT entries so the file system can exercise direct control over when these changed pages are flushed to disk.

Data Map Pins/sec counter

File Server Performance

NT Server is often deployed as a network file and print server. Support for file server services is built into all version of Windows NT. This support is associated with the Server service, which resides in the services.exe program that runs as a service. Network file sharing uses a session- oriented wire protocol known as server management blocks (SMB’s). SMB defines a protocol for accessing a shared disk resource, querying the contents of disk directories, and accessing individual files for reading or updating. The Server applet in the Control Panel enables you to view the status of the active file server sessions.

File Server Activity

The Performance Gallery Gold File Server Activity (NT) graph tracks NT file server activity and the number of concurrent server sessions. The main indicators of file server activity are the number of bytes transmitted (the Bytes Transmitted/sec counter) and received (the Bytes Received/sec counter).

File Server Work Queues

File Server Request Rate

File server requests are sent by network client using the SMB (server management block) wire protocol. The Performance Gallery File Server Request Rate (NT) graph tracks the rate of SMB requests serviced by the server component on the specified machine. The server request rate can be used in conjunction with the statistics in the Server Work Queues object to estimate the service time for requests handled by the machine.

Logical Disk Performance

Disk performance is a critical area, because mechanical disks are relatively slow compared to other electronic components of a computer system. Disk performance can also be complicated, because there are so many options available to improve disk performance, including:

Replacing slow disks with faster disks.

Implementing software to defragment disks.

Using the hardware caching option to speed disk access.

Employing different disk configuration options, such as striping data across multiple physical disks.

Disk performance monitoring in Windows NT is performed at both the physical and logical disk level. It is performed by a component called "diskperf" (see "How diskperf works"). The NT disk performance statistics do a good job of characterizing the I/O workload, and they can be used to spot performance problems, although they are probably not detailed enough to tell you why a performance problem is occurring. To diagnose disk performance problems in NT can require very specialized and very specific knowledge of your disk hardware performance characteristics.

How diskperf works

diskperf is a special I/O filter driver program called diskperf.sys. Envision the I/O manager of Windows NT as a series of layers or a stack beginning with the file system and working down to the disk device driver layer, and, finally, to the SCSI miniport driver layer, which is provided by the maker of your SCSI interface board. I/O request packets (IRP) represent an I/O request. they are passed down through various layers of the I/O manager until they reach the SCSI miniport driver, where they are turned into SCSI disk commands. When the physical disk completes the command requested, status is returned back through the layers of the I/O manager stack all the way back to the original requesting application.

The diskperf filter driver gathers statistics on I/O requests. The module counts the number of I/O requests, the number of bytes transferred, and whether the request was to read or write data. In addition, diskperf times how long the request takes. The time diskperf measures is the amount of time the IRP spent during processing at the device itself, but also includes any time spent in the software layers below diskperf. Think of it as the round trip time to the disk and back, or the response time of the disk.

By default, the diskperf statistics gatherer is not enabled. Without it, no disk performance data is collected and all the performance counters associated with the logical and physical disk objects are zero. You must enable diskperf and reboot the system in order to collect performance statistics.

diskperf can be positioned in one of two places in the I/O manager stack that affects how it gathers statistics. The Windows NT disk administrator has basic options to format your hard drive. It also contains more advanced volume manager capabilities. These advanced features include:

An optional component called ftdisk (fault tolerant disk driver) implements these volume manager features. If any advanced volume manager functions are configured using the disk administrator, ftdisk.sys is loaded automatically during system initialization. diskperf.sys can be loaded either above or below ftdisk.sys in the I/O manager stack as illustrated in Figure 11.4.

If you specify diskperf -y when you turn on disk performance monitoring, then diskperf.sys is loaded above ftdisk.sys in the I/O manager stack.

If you specify diskperf -ye, then diskperf.sys is loaded below ftdisk.sys in the I/O manager stack.

Since what ftdisk.sys does is redirect a logical disk request to its appropriate physical disk representation, the place where diskperf.sys is loaded influences the measurement data that is collected. For example, suppose you create a logical disk, D:, which is a mirrored disk constructed from two physical partitions—one on physical disk 1 and the other on physical disk 2. ftdisk.sys is the component responsible for: 1) taking a single logical disk request to write records to the D: disk, and 2) sending the SCSI miniport driver two identical write commands—one to physical disk 1 and the other to physical disk 2.

If diskperf.sys is loaded above ftdisk.sys, diskperf will record the logical disk statistics correctly, but it will not understand that two physical disk requests were issued for each write operation requested.

If diskperf.sys is loaded below ftdisk.sys, diskperf will measure the two physical disk requests that were issued correctly, but it will not understand that only one logical disk write operation was requested.

The logical disk and physical disk statistics that diskperf collects are identical. If ftdisk.sys is loaded, diskperf.sys can collect logical disk statistics, or correct physical disk statistics, but not both. You are likely to see some very strange performance statistics if you access the wrong object.

Logical Disk Response Time

The Performance Gallery Gold Logical Disk Response Time (NT) graph shows the average elapsed time of I/O requests, from the viewpoint of diskperf (see "How diskperf works"). The fields reported here correspond to the Avg. Disk sec/Read counter and Avg. Disk sec/Write counter under the Logical Disk performance object. This is the total response time of I/O requests, so it includes both disk service time and queue time spent waiting for NT software components and the disk hardware.

A reference line is included in the Logical Disk Response Time (NT) graph at 25 milliseconds (Y=0.025 seconds). This is an arbitrary boundary, but a good place to begin worrying about disk performance problems. If the response time is consistently above the 25 ms threshold, it would be worthwhile to investigate the cause. Access the link to the Logical Disk Utilization (NT) graph to see how poorly the disks are performing. Unless the high response times are associated with high rates of activity, the problem is probably not severe enough to warrant serious attention.

Disk Performance Expectations

Seek Time

Seek time is the time spent positioning the read write arm of the disk over the proper track. In sequential file access, the arm is already positioned in the right location (unless another application I/O request "steals" the arm), and there is no seek time component. At the opposite extreme, there are long seeks from the beginning of the disk to the end, which might take 20 ms or more. On average, a seek of one-third the distance across the platter normally takes 8-10 ms, depending on the make and model of the disk.

Since sequential is a popular mode of access, figure that your workload needs to perform average seeks only 50% of the time.This reduces the expected average seek time per I/O request to about 5 ms. Optimizations like defragmenting your hard drive regularly will increase the number of sequential accesses performed and will improve I/O performance.

Rotational Delay or Latency

Once the arm is positioned to the right disk track, it is necessary to wait until the sector requested rotates under the read/write head before the operation can continue. Popular disks rotate at 5,400-10,000 RPM’s, which is about 6-12 ms per rotation. Sometimes the disk does not have to rotate very far before the right sector is reached. Other times you have to wait much closer to a full disk revolution. On average, you can expect a half-rotation delay per I/O, which is about 3-6 ms.

Protocol Delay

Data Transfer

Finally, the disk is ready to read or write the data as requested. How fast the device can transfer data is a function of 1) how fast the disk spins and 2) how much data is recorded per track. Disks today are capable of moving data at a rate of 10-25 MB per second.

File Size

Another factor in data transfer time is the size of the request. In NTFS, valid sector sizes range from 512 bytes to 4 KB. Today’s disks can transfer even the larger 4 KB blocks in less than 1 ms. Keep in mind that many areas of Windows NT are optimized to perform bulk requests and prefetch data in anticipation of its use. Both demand paging and file cache prefetching requests tend to be bulk requests. In general, this results larger blocks transferred in a single I/O request.

A reasonable service time expectation is about 11 ms for an average disk on an average day. Of course, there can be many workload factors that are above or below average. The service time expectation calculated above ignores all other delay factors, like what happens when there are multiple requests active at the same time for a single disk, or when the disk shares access to the SCSI bus with other devices. On a busy system, all of these factors can easily contribute to queuing delays that are as long as the actual service time. So, you might not want to investigate the various I/O tuning strategies until the response time measures are consistently above 20-25 ms per request, on average.

Disk Hardware Performance Options

Actuator-level Buffers

A feature common to many high-performance disks is a built-in, dedicated, actuator-level buffer. During a read request, buffered devices transfer data from the track requested directly into this built-in buffer memory. If a subsequent request asks for data that is adjacent to the original request, the data can probably be found in the actuator buffer. A buffer hit means only protocol time and data transfer time are needed to satisfy the request. There is no seek time or latency delays associated with a buffer hit. Also, most devices can transfer data from the buffer at full interface speed, which might be 40 MB per second, while the device transfer speed is usually somewhat slower.

Caching Controllers

Similar to actuator-level buffers, caching controllers boost performance when the data requested is found in the cache. This allows the request to be serviced without having to access the physical disk. Again, there is no seek time or disk latency delays involved, just protocol and data transfer time.

Disk Arrays

Via software using the NT disk administrator. Array operations are implemented by the ftdisk fault tolerant disk driver module, ftdisk.sys.

Via hardware using disk striping. Disk striping spreads a single logical request across multiple blocks of data. However, in most Windows NT environment, disk striping adds little performance value, because of the small sector size used by NTFS.

RAID (Redundant Array of Independent Disks)

RAID refers to disk arrays in which information is also replicated to make the configuration fault tolerant. In a RAID configuration, a single drive in the array can fail without losing data permanently. The simplest form of RAID organization is disk mirroring, also called RAID level 1. Disk striping is sometimes combined with mirroring to achieve high performance with the benefit of fault tolerance. This is called RAID level 0/1.

RAID 5 is the most popular fault tolerant disk organization. RAID 5 is equivalent to the striping with parity option in disk administrator. Instead of making a full copy of the original data on a separate disk as in mirroring, RAID 5 creates a single parity error correction code sector for each corresponding set of data sectors. The parity data is sufficient to recreate the original data following a single disk failure. RAID 5 is a high-availability feature, first and foremost. Disk performance actually suffers under RAID 5, because of the extra steps that must be taken to maintain the parity information during writes. Watch out for this RAID 5 write performance penalty. Battery-packed controller cache is used in more expensive RAID 5 subsystems to buffer writes and to mask the RAID 5 write penalty.

Logical Disk Detail

The Performance Gallery Gold Logical Disk Detail (NT) table is a report that summarizes the most important disk performance measurements. Use this table to look for problem disks that have a combination of both a high activity rate and a high response time.

Read RT (response time)

Reads/sec

Avg Read Bytes

The average read bytes data element is the size, in bytes, of the average read request. It is calculated by dividing the Disk Read Bytes/sec counter by the Disk Reads/sec counter. The average block size is significantly larger than the file system sector size, which reflects the use of bulk requests in NT, which are more efficient than individual disk sector accesses.

Write RT (response time)

Writes/sec

Avg Write Bytes

The average write bytes data element is the size, in bytes, of the average write request. It is calculated by dividing the Disk Write Bytes/sec counter by the Disk Writes/sec counter. The average block size is significantly larger than the file system sector size, which reflects the use of bulk requests in NT.

# in System

This data element corresponds to the Avg. Disk Queue Length counter, which is the product of the overall I/O request rate (Disk Transfers/sec counter) and the average response time value in the Avg. Disk sec/Transfer counter. The calculation uses Little’s Law, a basic tenant of queuing theory. A more detailed explanation of this calculation is provided in the description for the Q length data (below).

Q Length

Queue length is an instantaneous value based on the Current Disk Queue Length counter. It is the number of requests that are currently active, including any I/O request that is in service at the disk at the time the disk performance data was collected. For example, a current disk queue length value of 2 indicates that one request is active and one is currently waiting.

% Free Space

MB’s Free

This data element is the free space on the disk expressed in terms of absolute space, based on the Free Megabytes counter. The Throughput data element is the sum of the Disk Read Bytes/ sec counter and the Disk Write Bytes/sec counter.

Logical Disk Utilization

The official Microsoft explanation for % Disk Time is that it is "the percentage of elapsed time that the selected disk drive is busy servicing [emphasis added] read or write requests." This is a misleading explanation. Strictly speaking, these counters do not report disk utilization, which is why they sometimes behave in odd ways. The Performance Gallery Gold Logical Disk Utilization (NT) graph reports the three counters associated with % Disk Time and shows their relationship to the Avg. Disk Queue Length counter.

Disk Utilization

Disk utilization can normally be found by measuring the activity rate and the average disk service time. % Disk Time could then be calculated as the product of the Disk Transfers/sec counter and disk service time. This is known as the Utilization Law in queuing theory.

Disk Average Queue Length

The product of the Disk Transfers/sec counter and the Avg. Disk sec/Transfer counter (disk response time), according to the well-known Little’s Law formula, is the average number of outstanding disk requests—also known as average queue length.

Masquerading as utilization, the % Disk Time counters are artificially capped at 100%. This is because it is impossible in queuing theory (and in reality) for a server to be greater than 100% utilized. This can sometimes lead to absurd results. For example, say the Logical Disk D: is 2.49% busy from reads, 100% busy from writes, and 100% total busy, according to the % Disk Time counters. The product of Disk Transfers/sec and Avg.Disk sec/Transfer can be greater than 100%, because the Avg. Disk sec/Transfer counter measures response time.

Three corresponding Avg. Disk Queue Length counters were introduced in NT 4.0 to try to clear up this confusion. These queue length counters are calculated using Little’s Law, without the capping. The Avg. Disk Queue Length counters are logically consistent while the purported disk busy counters are not.

How should the Avg. Disk Queue Length counters be interpreted? They literally represent the average number of outstanding requests to the disk, including any requests that are currently in service. A value of 2.79 for Avg. Write Disk Queue Length means, on average, that one request is in service at the disk while almost two requests are always waiting. For reality testing, the calculated values of the Avg. Disk Queue Length counter should be compared to the measured values of the Current Disk Queue Length as shown in the Logical Disk Detail (NT) table. But also keep in mind that the measured values of Current Disk Queue Length are likely to be systematically under-sampled (especially on uniprocessors). This is due to the fact that the disk driver Interrupt Service Routine and Deferred Procedure Calls that manage the device queue run at a higher dispatching priority than the Performance SeNTry collection service. By the time the Performance SeNTry collection service is eligible to run, the disk driver software may have already dispatched the next I/O request.

Logical Disk Average Queue Length

The Performance Gallery Gold Logical Disk Avg Q Len Statistics (NT) table provides the raw values for the (false) Disk% Busy counters and the (correct) Avg. Queue Length counters. Both are computed as the product of the activity rate and the response time. The (false) disk % busy counters are capped at 100%, while the (correct) Avg. Queue Length counters are not.

The Performance Gallery Logical Disk Avg Queue Length (NT) graph is a three-dimensional temperature chart (3D surface chart) that enables you to easily spot disk performance problems across a large server disk farm. The color-coded temperature chart legend shows the range of measurement data broken down into ten equal ranges. The red, orange, and yellow peaks identify devices with lengthy queuing delays.

Physical Disk Performance

A set of Performance Gallery Gold charts nearly identical to the logical disk charts is available for reporting attached physical disk performance. Physical disk performance statistics are gathered by the NT diskperf driver program (see "Logical Disk Performance"). The same exact measurement data is provided with the same quirks. The only difference is the logical disk measurements include statistics on free disk space—these statistics are not available for the physical disk.

Redirector Performance

The Redirector component of Windows NT is used to transform (or redirect) networked file requests to use the network instead of a local disk. It is the client-side of file server requests. In fact, the sum of all Redirector requests for network clients is precisely equal to the sum of all file server requests, assuming all network clients are running Windows NT.

Network Activity (redirector)

The Performance Gallery Network Activity - Redirector (NT) graph shows the total amount of network traffic associated with redirected file requests. The value shown in the graph corresponds to the Bytes Total/sec counter in the Redirector performance object. From this graph, you can link to the following charts:

Redirector Errors by Type (NT) - shows Redirector error statistics.

Redirector Bytes Received/Sent (NT) - breaks down bytes transmitted into reads and writes.

Redirector File Operations (NT) - shows Redirector file operations.

Network Traffic Performance

When the Windows NT Network Monitor Agent is installed on a system, a variety of network utilization statistics can be collected. In general, the networking statistics are gathered at the network interface card level, with the network monitor collecting basic information on the number and types of network requests, including the number of bytes associated with each request. Statistics are available for each of the different networking protocols that Windows NT supports, including TCP/IP, NetBEUI, IPX, and AppleTalk. Additional statistics associated with TCP/IP are available only if the SNMP (Simple Network Management Protocol) service is installed.

Network traffic per segment data is collected by running the network interface card (NIC) in promiscuous mode, where the NIC notifies the host whenever a packet is received. Using a standard Ethernet hub, the network segment is logically a continuous loop of wire, where all stations see all packets. It is only necessary to gather statistics at one station to be able to summarize activity on the entire segment. On a switched network, each station sees only traffic specifically intended for it.

Due to their performance characteristics, Ethernet networks can degrade very rapidly when multiple stations contend for access to the shared wire. If two stations attempt to access the wire at the same time, a collision occurs and both stations need to retransmit. Characteristically, utilization tends to increase very quickly once collisions start to occur. Unfortunately, statistics for the number of collisions are only available by running the full Network Monitor. However, Windows NT makes it very easy to gather statistics utilization.

Network Utilization

The Performance Gallery Gold Network Utilization by Segment (NT) graph reports the specific network segment(s) with which the station is associated. Utilization is calculated by capturing the number of bytes received and bytes transmitted per segment and dividing by the rated capacity of the network interface card.

The scope of the network segment is defined by the networking hardware. On a standard Ethernet hub, all attached stations are part of the segment, and each station sees all packets. Using switched Ethernet, a station sees only packets specifically addressed to it, so each segment consists of a single station.

The Network Utilization by Segment (NT) graph has a reference line at 50% utilization (y=50). With Ethernet segments, be careful about collisions caused by multiple stations that need access to the wire concurrently, because Ethernet has no arbitration phase. Stations that need to transmit data simply wait for the wire to be free, then they place their packet on the wire. When two stations can attempt to do this at the same time, a collision occurs and both stations must retransmit. This causes a characteristic "bulge" in utilization, once the rate of traffic reaches 40% to 50% of capacity. Consequently, you would normally not want to see Ethernet network segment utilization running at a sustained level in excess of 40-50%.

This rule does not apply when the bulk of network traffic consists of a single session between one station and another, as in bulk file transfers or backup operations. For example, since the two stations performing a bulk file transfer are involved in a sort of "conversation" and must wait until the receiver acknowledges the receipt of messages, there is no contention for the wire. It is possible to driver Ethernet utilization to nearly 100% in these circumstances without incurring collisions.

Network Interface Traffic

The Performance Gallery Gold Network Interface Traffic (NT) graph shows network traffic, broken down into bytes sent and bytes received at the network interface level. The data elements in the graph correspond to the Bytes Sent/sec counter and Bytes Received/sec counter in the Network Interface object.

System Activity

The Performance Gallery Gold System Activity (NT) graph reports the Context Switches/sec counter and the Total Interrupts/sec counter. These are two separate, but related, performance indicators. Both measures are relative indicators of system activity.

In contrast to most other Performance Gallery Gold pre-configured charts, this area graph does not show a stacked data view. Since interrupts cause context switches to occur, the two measurements are directly related. However, the number of context switches will always be greater than the number of interrupts, because some context switches are unrelated to the servicing of interrupts.

Context switching occurs whenever a running thread voluntarily relinquishes the processor, or when a running thread is preempted by a higher-priority ready thread following an interrupt. When the new thread executes in a different address space, the internal Intel processor Task State Segment (TSS) register must be reloaded and some internal processor caches may be invalidated. Context switches also occur when a user-mode thread calls an NT executive service or a Win32 subsystem service. In the latter case, a privileged kernel mode thread is assigned to service the request. A context switch occurs and is counted, but since the TSS does not have to be reloaded, the performance impact of this event is trivial.

NT device drivers service interrupts whenever peripherals need to notify the processor that some external event has occurred—usually that an I/O request has completed. Look for sudden unexplained changes in the interrupt rate that may be caused by a malfunctioning interface board or device.

Overview

Processor Performance

Processor Utilization

% Privileged Time counter

% User Time counter

% Interrupt Time counter *

% DPC Time counter *

Processor Queue Length

Processor Utilization by Processor

System Configuration

CPU Utilization by Process

Memory Performance

Real Memory Utilization

Real Memory Counters

Available Bytes

Pool Non-paged Bytes

Pool Paged (resident) Bytes

System Code Resident Bytes

System Driver Resident Bytes

System Cache Resident Bytes

Real Memory Calculation

Memory Usage by Active Processes

Virtual Memory Usage (commit%)

Demand Paging

Hard Page Fault Rate

Soft (transition) Fault Rate

Cache Fault Rate

Clustered Paging I/O Operations

Paging Activity (total)

Paging Operations

Memory Utilization Index

File Cache Performance

File Cache Activity by Type

Copy Reads/sec counter

Data Maps/sec counter

Pin Reads/sec counter

MDL Reads/sec counter

Lazy Write Pages/sec counter

File Cache Lazy Writer

Lazy Write Flushes/sec counter

Lazy Write Pages/sec counter

File Cache Read Activity

Copy Reads/sec counter

MDL Reads/sec counter

Read Aheads/sec counter

Pin Reads/sec counter

Cached File System Mapping Requests

Data Maps/sec counter

Data Map Pins/sec counter

File Server Performance

File Server Activity

File Server Work Queues

File Server Request Rate

Logical Disk Performance

How diskperf works

Logical Disk Response Time

Disk Performance Expectations

Seek Time

Rotational Delay or Latency

Protocol Delay

Data Transfer

File Size

Disk Hardware Performance Options

Actuator-level Buffers

Caching Controllers

Disk Arrays

RAID (Redundant Array of Independent Disks)

Logical Disk Detail

Read RT (response time)

Reads/sec

Avg Read Bytes

Write RT (response time)

Writes/sec

Avg Write Bytes

# in System

Q Length

% Free Space

MB’s Free

Logical Disk Utilization

Disk Utilization