My Favorite Oracle Solaris Performance Analysis Commands

4 commands that help you find bottlenecks

A while ago, we discussed some performance analysis basics:

Define what your problem is.
Figure out your goal: What metric needs to be in what ballpark for you to declare victory?
Analyze your system from the inside out: CPU, RAM, Disk, Network. Your Bottleneck is always in one of these 4 regions.

So what are the best commands for finding bottlenecks in each of the four categories above? Here's part two of my Oracle Solaris Performance cheat sheet with some favorite tricks.

Does Your System Have Enough CPU Power?

This is usually the first suspicion when the performance isn't where it should be:

"The CPU is too slow!"

And it's often just plain wrong.

Let's see how we can quickly answer the question: Do I have enough CPU power?

In the old days of single-core, single-CPU systems, we fired up top and watched the system load value, or the top processes' CPU percentage. But in today's multi-CPU, multi-core world, this doesn't work anymore. The old concept of "load" is now misleading and quite useless if your want to assess whether your system has enough CPU power or not.

Here's a more modern way:

constant@fridolin:~$ vmstat 5                                                   
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr cd s0 s2 --   in   sy   cs us sy id
 0 0 0 446144 130076 23 100  0  1  3  0 12  7 -0 13  0  465 1352 1137  6 12 82
 0 0 0 405376 90808  33  41  0  0  0  0  0 39  0  3  0  514  500  571  4 11 85
 0 0 0 405296 90536   0   0  0  0  0  0  0 29  0  1  0  502  778  551  4 10 86
...

(Remember to ignore the first line of the output as it may contain accumulated data from an unknown sample size.)

Now watch the rightmost column, which is the system idle time in percent. Is it bigger than 0 most of the time? Then you have enough CPU power. It's that simple. If idle time is 0 most of the time, buy a bigger CPU, if not, look elsewhere.

The above system has enough CPU: It's idle more than 80% of its time so even if something runs slow, it can't be the CPU in this case.

(Yes, life can be more complex than that, but remember, we're talking about a cheat sheet here. This is the most useful approach for a majority of cases.)

How's My Memory Doing?

Now that we've ruled out "not enough CPU horsepower" as the bottleneck, let's look at the next layer: RAM. Do we have enough RAM? Or is the system starving for more memory, possibly resorting to using slow disks as a poor substitute for RAM? Again,

constant@fridolin:~$ vmstat 5                                                   
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr cd s0 s2 --   in   sy   cs us sy id
 2 0 7   6472 30620   6  85 108 392 546 3060 2617 143 0 111 0 839 408 14606 3 40 57
 0 0 7   8360 33960  10  51 89 155 1910 1816 19090 187 0 52 0 883 529 9512 5 36 59
 0 0 7  12548 42948  19  48 66 215 215 1080 0 121 0 70 0 737 340 10273 3 31 66
 1 0 7  13612 39916  38  90 106 0  0 632 0 171 0 56  0  900  616 10160 5 29 66
 4 0 7   8060 29528  10  47 55  0 383 232 5514 112 0 77 0 854 739 6665 4 26 70
 0 0 7   7312 38468   3   9 15 234 1500 0 17073 33 0 47 0 580 349 3993 2 25 73
 0 0 7   8960 39460  17  46 55  0  0  0  0 101 0 37  0  744  529 7870  3 27 70
 2 0 7   8836 37020   6  31 46  0  0  0  0 87  0 87  0  749  418 6033  3 20 77

is our friend. This time, let's look at three values: swap, free and sr (or: scan rate):

swap: This is the amount of free virtual memory.
free: This is the amount of free physical memory.
scan rate (sr): This is the number of times that the memory page scanner is cleaning up memory pages, freeing the lesser used memory pages to make room for data that needs to be allocated from physical memory.

Again, the old adage was: If memory is full, you need more of it. But today it's misleading: Modern operating systems tend to use up as much memory as they can, to maximize your hard spent RAM bucks' utilization. For example, ZFS uses as much free memory as possible as a read cache to save you from spending precious IOPS on disks. So if the "free mem" column in top is small, this is actually a good sign: It means that your RAM is doing useful stuff.

A better question to ask here: Is my memory system in trouble? That's what the scan rate value is telling us: The bigger this value, the more stressed our memory subsystem is, because the OS is more and more busy scanning memory pages for expendable chunks so it can fulfill a high demand in fresh memory. If the scan rate is a single digit value most of the time, you're ok. If it shows large values over extended periods of time, you'll likely benefit from some extra RAM in your system.

In the second vmstat example above, I created extra stress for the memory system by starting a ZFS scrub (filling up RAM), starting OpenOffice with a large presentation and asking GIMP to set up a new 8k x 8k picture for me. That resulted in some samples showing more than a thousand page scans. That's certainly a situation where more RAM would have come in handy. The system was unusable, although the CPU showed more than 70% idle.

(Again, there's a lot more detail that we don't cover here, but we don't want to make this post bigger than a good bedtime reading, do we?)

The nice thing about vmstat is that with just one command, you can easily assess if the CPU and RAM situation is ok or not, then move on to the next layer.

Or Is There a Disk Problem?

Now it gets interesting. Most if not all of the performance problems I see are disk I/O related, and there's no indication that this is about to change.

You can get a quick overview about your IO situation by using:

constant@fridolin:~$ iostat -xzn 5                                              
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    8.2    6.2  163.8   90.0  0.5  0.2   35.4   13.1   8  10 c3d0
    1.4   12.2   30.0   81.4  0.1  0.2    8.9   13.0   3   7 c6t0d0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  126.6   33.1 1613.0  400.3  3.5  1.6   21.9    9.8  75  81 c3d0
    0.0   19.7    0.0   40.7  0.6  0.1   28.6    7.5  14  15 c6t0d0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   33.4    2.0  242.5   14.4  7.1  2.0  200.0   56.4 100 100 c3d0
    0.0   15.8    0.0   39.4  2.3  0.5  148.2   31.3  49  49 c6t0d0

Again, looking at simple performance numbers like reads/writes per second or even kilobytes read/written per second doesn't tell you much. Are 126 reads fast? Or too slow? Wow, 1613k read per second. That's a lot! Is it? Wait, what disks am I using again? (Answer: The above is a Solaris 11 Express system running on VirtualBox on my 3-year-old Mac.)

A more interesting figure to look at is wait: This is the number of IO operations that are waiting to be serviced. In other words: "wait" tells you the waiting queue length. If your queue length looks like the one in front of an Apple store at the day of the introduction of the new iPhone, you need to work on your disks (Here are a few suggestions if you use ZFS). If the wait time is in the single digit range, then your problem may be elsewhere.

Sometimes you want a more application level view into your IO situation and that is what the following command is about:

 admin@krengi:~$ fsstat -F 5
  new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   set    ops   ops   ops bytes   ops bytes
    0     0     0     0     0      0     0     0     0     0     0 ufs
    0     0     0     0     0      0     0     0     0     0     0 proc
    0     0     0     0     0      0     0     0     0     0     0 nfs
    0     0     0    68     0     43     0     0     0     9 1.06K zfs
    0     0     0     0     0      0     0     0     0     0     0 lofs
    0     0     0     0     0      0     0     0     0     0     0 tmpfs
    0     0     0     0     0      0     0     0     0     0     0 mntfs
    0     0     0     0     0      0     0     0     0     0     0 nfs3
    0     0     0     0     0      0     0     0     0     0     0 nfs4
    0     0     0     0     0      0     0     0     0     0     0 autofs

(I threw away the first batch of data, which is always useless.)

Or, if the number of filesystems you're interested in is limited:

admin@krengi:~$ fsstat zfs 5
 new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   set    ops   ops   ops bytes   ops bytes
2.08M  613K  171K 7.68G 2.25M  10.0G 43.3M 1.09G 1.97T  189M  638G zfs
    0     0     0    74     0     79     0    35   608    18   860 zfs
    0     0     0    67     0     39     0     0     0     1   112 zfs
    0     0     0    71     0     73     0     1     4     1   112 zfs

This is another great way of quickly having a look at what's up with your disk IO.

Are your users creating lots of files? Or are they modifying/removing/changing attributes a lot? What filesystems are causing the most IO load? How much IO goes through NFS and how much is local? All these questions can be easily answered with fsstat and a few flags.

Checking Out the Network

Finally, if your problem is neither on the CPU nor on the memory nor on the disk IO side, it may lie outside of your system, perhaps at the networking level. Again, there's a favorite command that gets me a useful picture most of the time. For example, while streaming some video on my home server, I checked the effect on the network with this:

admin@krengi:~$ netstat -I e1000g0 5
    input   e1000g    output       input  (Total)    output
packets errs  packets errs  colls  packets errs  packets errs  colls 
417683472 4     384816503 0     0      420603019 4     387736050 0     0     
5779    0     3282    0     0      5779    0     3282    0     0     
6487    0     3556    0     0      6487    0     3556    0     0     
3672    0     2351    0     0      3673    0     2352    0     0

Notice that netstat counts packets here, not MB/s. Network performance analysis and tuning is a science of its own, but with this command you can quickly assess what each networking interface is doing, and whether the packets they transmit are in the right ballpark. Maybe you have multiple network interfaces configured, but still all your data is sent through the same pipe?

Digging Deeper

So that's it for my performance cheat sheet: vmstat for CPU and memory, iostat with the -xzn flags and fsstat for disk IO, and good old netstat -I for the network. This is the 20% effort solution, the minimum effective set of commands that will get you a quick overview of a system in 80% of the cases.

Now for that other 20% of more complicated cases, you will need some extra digging. If you want to learn more, here are a few useful pointers:

The Solaris Internals Wiki has a great page about CPU/Processor Analysis.
dim_STAT is a complete toolset for collecting and analyzing system performance. It can both generate a high level overview or a deep down analysis of a system.
Jörg wrote a nice article about fsstat, and he promised a little series about *stat articles. Jörg, why don't you continue your series with some of your favorite tools? That would be cool!

Your Own Favorite Performance Tools

As we have seen, most of the time we can get away with some simple use of vmstat, iostat, fsstat and netstat. What are the tools that you like to use most of the time? What's your own little set of cheat sheet performance tools? Share your own set of tools in the comments, and if Jörg is reading this: Please continue your Meet the stats series!