How Do I Determine Application IOPS?

This is just a note for myself since I often find myself trying to analyze an application to performance tune a server/system.
Using Windows Performance Monitoring I use the following metrics when analyzing an application.

Metric Example Data Description
Logical Disk: Avg. Disk Bytes/Read 0.00 IO Size Read (Data block size)
Logical Disk: Avg. Disk Bytes/Transfer 8192.00 IO Size
Logical Disk: Avg. Disk Bytes/Write 8192.00 IO Size Read (Data block size)
Logical Disk: Disk Read Bytes/sec 0.00 Total Read data bytes per second
Logical Disk: Disk Write Bytes/sec 125000.00 Total Write data bytes per second
Logical Disk: Disk Transfers/sec 15.258 Total IOPS
Logical Disk: Disk Reads/sec 0.00 Read IOPS
Logical Disk: Disk Writes/sec 15.258 Write IOPS

To calculate Write IOPS for the application take “Disk Write Bytes/sec” and divide by “Avg. Disk Bytes/Write”.
125000.00 / 8192.00 = 15.258 IOPS
Notice that Disk Reads/sec and Disk Writes/sec corresponds to Read IOPS and Write IOPS already.
 

A few notes on Guest Virtual Machine Memory Ballooning

Looking over some of my Windows 2008 R2 servers, I notice via Task Manager that their memory is 90%+ quite often.

Using Sysinternal’s RAMMap tool, I can quickly identify if this is due to ballooning.

Here’s a screenshot showing that 2.2GB of RAM is driver-locked (e.g., ballooned using the VMware Tools driver vmmemctl).

Using vSphere Client, I check to see the performance of this VM to confirm if this is ballooned memory and what the actual memory usage is currently.

You can see that it is indeed ballooning nearly 2GB of RAM on this VM and the Guest VM is actively using only a little over 1GB of RAM.

To control how much memory the VM will balloon, you can set sched.mem.maxmemctl parameter in the virtual machine’s .vmx configuration file. (Reference: VMware KB1003586MSDN Lock Pages in Memory VirtualLock)

The above KB mentioned is a good starting point for assisting troubleshooting Guest VMs that pin pages in memory (Lock Pages – using SQL with LPIM enabled?).

Testing Disk in Linux using fio

I recently discovered a utility called fio that allows you to benchmark disk subsystem in Linux. Here are the results for this test.
What is fio?

fio is an I/O tool meant to be used both for benchmark and stress/hardware verification. It has support for 13 different types of I/O engines (sync, mmap, libaio, posixaio, SG v3, splice, null, network, syslet, guasi, solarisaio, and more), I/O priorities (for newer Linux kernels), rate I/O, forked or threaded jobs, and much more. It can work on block devices as well as files. fio accepts job descriptions in a simple-to-understand text format. Several example job files are included. fio displays all sorts of I/O performance information. Fio is in wide use in many places, for both benchmarking, QA, and verification purposes. It supports Linux, FreeBSD, NetBSD, OS X, OpenSolaris, AIX, HP-UX, and Windows.

Windows fio download:  http://www.bluestop.org/fio/
OS – Debian Linux “Wheezy” AMD64
RAM – 8GB
Virtualized – YES
VMware Tools – YES
Disk – 1 x 50GB Thin Provisioned
Test File – 10GB
Note:  Disk is on a LUN that is comprised of RAID5 using 6 disks @ 15kRPM – no throttling for disk/cpu is configured on VM.

 
Here are my fio test files:

[randrw]
rw=randread
size=10G
direct=1
directory=/tmp/
numjobs=1
group_reporting
name=randr-4k
bs=4k
runtime=30
write_iops_log
write_lat_log
write_bw_log
[randrw]
rw=randread
size=10G
direct=1
directory=/tmp/
numjobs=1
group_reporting
name=randr-8k
bs=8k
runtime=30
write_iops_log
write_lat_log
write_bw_log
[randrw]
rw=randrw
size=10G
direct=1
directory=/tmp/
numjobs=1
group_reporting
name=randrw-4k
bs=4k
runtime=30
write_iops_log
write_lat_log
write_bw_log
[randrw]
rw=randrw
size=10G
direct=1
directory=/tmp/
numjobs=1
group_reporting
name=randrw-8k
bs=8k
runtime=30
write_iops_log
write_lat_log
write_bw_log
[randrw]
rw=randwrite
size=10G
direct=1
directory=/tmp/
numjobs=1
group_reporting
name=randw-4k
bs=4k
runtime=30
write_iops_log
write_lat_log
write_bw_log
[randrw]
rw=randrw
size=10G
direct=1
directory=/tmp/
numjobs=1
group_reporting
name=random-rw-direct
bs=8k
runtime=30
write_iops_log
write_lat_log
write_bw_log
[randrw]
rw=read
size=10G
direct=1
directory=/tmp/
numjobs=1
group_reporting
name=seqr-4k
bs=4k
runtime=30
write_iops_log
write_lat_log
write_bw_log
[randrw]
rw=read
size=10G
direct=1
directory=/tmp/
numjobs=1
group_reporting
name=seqr-8k
bs=8k
runtime=30
write_iops_log
write_lat_log
write_bw_log
[randrw]
rw=rw
size=10G
direct=1
directory=/tmp/
numjobs=1
group_reporting
name=seqrw-4k
bs=4k
runtime=30
write_iops_log
write_lat_log
write_bw_log
[randrw]
rw=rw
size=10G
direct=1
directory=/tmp/
numjobs=1
group_reporting
name=seqrw-8k
bs=8k
runtime=30
write_iops_log
write_lat_log
write_bw_log
[randrw]
rw=write
size=10G
direct=1
directory=/tmp/
numjobs=1
group_reporting
name=seqw-4k
bs=4k
runtime=30
write_iops_log
write_lat_log
write_bw_log
[randrw]
rw=write
size=10G
direct=1
directory=/tmp/
numjobs=1
group_reporting
name=seqw-8k
bs=8k
runtime=30
write_iops_log
write_lat_log
write_bw_log

I used fio_generate_plot to generate gnuplot graphs.

IBM Turbo Performance License Explained

Download the PDF of this: Performance Recommendation Handout_081010

What is Turbo Performance

An optional premium feature, that when enabled, boosts the performance of a storage system across both throughput and IOPs workloads. This implementation is based on a firmware algorithm and does not require any new hardware dependencies.

Turbo Performance Feature Rules of Thumb

Use Turbo Performance Feature when:

The storage system performance demands exceed the base maximum performance across either the FC or SAS host interfaces.

Refer to page 2 for recommendations based on four corner performance testing

Often, IOPs performance is often not realized with Turbo Performance Feature until target drive quantities are achieved. For throughput, improvement is achieved at very low drive quantities.

Note: Performance is dependant not only on the storage system and drives, but also the host, OS, HBAs and the application workload. Therefore, results may vary!

Note: As 1Gb/s iSCSI may be a bottleneck in taking full advantage of the Turbo Performance feature (minimal improvement in overall performance), this feature is not recommended for use the iSCSI daughter card, but for SAS and FC.

Environments that may benefit from Turbo Performance

While almost every environment can realize the benefits of the Turbo Performance feature, the following environments may especially benefit:

  • Consolidation /Virtualization
  • Video streaming, rich media
  • 3D modeling and simulation
  • High performance computing (HPC)
  • Real‐time data backup
  • Online retail (online transaction processing)
  • Data warehousing
  • Distributed data collection
  • Grid computing
  • Multi‐media
  • SAN Volume Controller and ProtecTIER implementations

Focus on positioning the Turbo Performance feature for those customers with bandwidth‐intensive applications

Selling Script / Questions to Ask You Customer

  1. What types of applications/environments do you intend for the DS3500 to support?
    • Multiple (due to virtualization and consolidation) – go to question #2
    • Throughput/bandwidth intensive – go to question #2
    • IOPs intensive – go to question #3
  2. The Turbo Performance feature allows you to take full advantage of your DS3500’s performance capabilities with throughput‐intensive applications. Let me share with you the performance improvements you may achieve based on your drive count for both read and write performance (refer to second page of handout)
  3. Up to how many drives are you planning on scaling to with you DS3500?
    • Over 60 – go to question #4
    • Under 60 – revisit the Turbo Performance feature with this customer when their drive quantity gets closer to 60 drives
  4. For IOPs performance, the Turbo Performance feature is beneficial for high drive count configurations. Let me share with you the performance improvements you may achieve for both read and write performance (refer to second page of handout)

SQL Server Optimization Rules of Thumb

This helped me today in troubleshooting and understanding some issues.
The performance counter SQL Server Access Methods: Full Scans/sec is very helpful in determining if you have a query doing full scans however, remember that it is for the entire database server not just a particular database.

Optimization Rules of Thumb

  • Always look at the query plan first. It will show you the optimal current execution plan from the query engine’s point of view. Find the most expensive part of the execution plan and start optimizing from there. However, even before that, make sure that the statistics on all tables in your query are up to date, by running the update statistics <TableName> command on all tables in your query.
  • If you see table scan, optimize. Table scan is the slowest possible way of execution. Table scan means not only that no index is used, but that there is no clustered index for this table at all. Even if you can only replace table scan with clustered index scan, it is still worth it.
  • If you see clustered index scan, find out whether it can be replaced with index seek. For that, find what conditions are applied to this table. Usually, conditions exist for two or three fields of the table. Find out the most selective condition (that is, the condition that would produce the smallest number of records if applied alone), and see whether an index on this field exists. Any index that lists this field first will qualify. If there is no such index, create it and see whether the query engine picks it up.
  • If the query engine is not picking up the existing index (that is, if it is still doing a clustered index scan), check the output list. It is possible that seek on your index is faster than clustered index scan, but involves bookmark lookup that makes the combined cost greater than use of a clustered index. Clustered index operations (scan or seek) never need bookmark lookup, since a clustered index already contains all the data. If the output list is not big, add those fields to the index, and see whether the query engine picks it up. Please remember that the combined size is more important than the number of fields. Adding three integer fields to the index is less expensive than adding one varchar field with an average data length of 20.Summarizing this rule, try to make your index covering, and see whether it works better than clustered index scan. Please note that it is not always possible to make the query engine pick up your index automatically. A small table or a low-selectivity index will produce clustered index scan, even if your index is covering.
  • If you see bookmark lookup, it means that your index is not covering. Try to make it covering if it makes sense (see the preceding guidelines).
  • The execution plan selected by the query engine may be not the best one. The query engine makes certain assumptions about disk subsystem and CPU cost versus IO cost. These assumptions sometimes can be incorrect. If you don’t believe that the query engine’s selection is the best one, run a query in the loop for 10 to 15 minutes with automatic selection, change the query to use your index (you will have to use index hint to force it), and then run it for 10 to 15 minutes again. Compare the results to see which one works better.
  • Avoid any operations on the fields, where possible. Some operations will prevent the use of the index on this field even if it exists—for example, the infamous ltrim(rtrim(FieldName)); other operations will degrade the performance. For example, instead of using the condition cast(DateField as varchar(20)) = @dateString, try to convert @dateString to an expression of datetime type first, and then compare it to DateField.
  • Please note that the query engine cost estimate does not include the cost of embedded procedure or function calls. If you compare between plain join and select from table-value functions, the latter would seem to have smaller cost, but it usually does not. In such a situation, use your own metrics to find out which query performs better.
  • When it is not possible to avoid operation on the field, use an index built on that expression. This can be done in two ways:
    • Create a calculated field based on your expression.
    • Create a view, and build an index on it.
Note    SQL Server requires certain conditions to be met in order to allow the use of calculated fields and indexed views (set quoted_identifier on, set arithabort on, and so on).
  • Indexed views are a good way to further speed up the query if you are not satisfied with the results. Indexed view is a clustered index built over the view’s select list. You can also define additional indexes for the indexed view, just as you can for any regular table. Indexed views take disk space and involve some maintenance overhead (every time underlying tables change, the indexed view also has to change), but they usually provide a good boost in performance, even after all other optimization techniques are exhausted.

Special thanks to Andrei Volkov for many interesting discussions about the SQL Server internals.
Source:  http://msdn.microsoft.com/en-us/library/aa964133(v=SQL.90).aspx