Performance Tuning on Linux — Taking Meaningful Measurements

Measure the Right Things

This goes back to where we started: What are we trying to achieve? Throughput or Latency? Do you want a bigger truck or a faster car?

We also need to understand performance dependencies. It makes no sense to jump all the way to NFS performance if we haven't built our server, client, and network from adequate hardware, tuned the disk I/O and file systems on the NFS server, and tuned the Ethernet and TCP networking connecting the client and server.

Use Appropriate Tools

See the previous page on application measurement and monitoring, plus the filesystem I/O benchmarking tools.

Select the right tool for the job — vmstat, iostat, and iotop tell you about I/O bottlenecks, but they provide very different information.

Use the Tools Appropriately

Take measurements at appropriate times. At a scale of weeks to years, I/O will degrade as a file system ages and becomes more full and fragmented. At a scale of minutes to days, the work load on a production server will vary. Are backups running? A major software build? Is the load down because it's lunch time?

Take measurements over appropriate periods. The CPU runs orders of magnitude faster than rotating disks. 3 second intervals make sense for top, but file system I/O measurement intervals with iostat and possibly iotop should be much larger.

Is this really a fair test of typically random performance? If you are taking multiple measures through repeated tests, does the first test prime the cache and make the rest artificially faster? What do you have to do to flush the cache for the next test?

How much does the test resemble typical use? Many small read/write accesses to a large database present a load that is very different from streaming large media files. Can your benchmarking tool be tuned to better emulate intended use? Put another way, your users probably aren't interested in running the Bonnie++ benchmarking tool, so tuning a system to optimize its results at the expense of typical use will make things worse, not better.

How much does the measurement itself get in the way? Run top -d 0.2 inside a Gnome terminal on a Gnome desktop and see how much CPU power it takes to monitor CPU use in this wasteful fashion.

Draw Reasonable Conclusions

Collect plenty of data, analyze it carefully, and don't try to make it mean too much. Be honest and fair with your claims. Even after you take several measurements, you have nothing more than an estimate of the actual value. Read on if you don't mind details and a little math.

A quantitative analysis chemistry class I took during undergraduate electrical engineering study had a large influence on me. It was a lecture course with an associated lab. It was in the chemistry department, and we did things in a chemistry laboratory in the weekly 3-hour lab session, but what I found useful long-term was the quantitative analysis part and not the chemistry itself. Distinguishing between accuracy and precision, for example. Deciding how many significant digits I can really believe at the end of a measurement.

Then, in graduate school in electrical engineering, I encountered the trend in robotics research to consider a problem solved once you manage to capture video of an industrial robot accomplishing some task. But all that proves is that it's possible. That can be an interesting result, but to make things useful they must be more than just possible, they must work most of the time, and we need to have a pretty good estimate of the expected success rate.

When that troublesome quantitative analysis chemistry course leads me to bring up these unpleasant details, the typical response is to run the experiment four times with three successes and claim that it works 75% of the time. Or, trading optimism for realism, "at least 75% of the time!"

Well, no. Three successes in four trials doesn't mean very much. All I can definitely say is that the probability of success is larger than 0 and less than 1. Yes, a probability of 0.75 is the most likely, but with just a few measurements there is a lot of measurement noise and the real probability might really be pretty close to either 0 or 1 and we just happened to see this short initial sequence.

We can use Bayes' theorem to calculate P(r), the probability that the real success rate is some value r where 0 ≤ r ≤ 1 and there were k successes in n trials. We will use this formula:

φ(r) = r^k(1 − r)^n−k

Calculate the integral, the area under that curve over the range 0 ≤ r ≤ 1, call that A. The probability that r is the actual success rate is:

P(r) = (r^k(1 − r)^n−k)/A

Now start at the peak where r = k/n, and integrate out both directions to points r₁ and r₂ so that r₁ ≤ r ≤ r₂ and P(r₁) = P(r₂) and the integral of that range is 0.9. Then we could honestly say something like "I have 90% confidence that the real success rate is between r₁ and r₂."

The problem is that with three successes out of four trials, all we can really say is that we're 90% confident that the real success rate is in the range of 39.53—95.75%. Higher confidence means even wider ranges. We can narrow down our estimates of the real range by taking more measurements. But look at how many measurements it takes!

Successes	Trials	90% confidence range	95% confidence range	98% confidence range
3	4	39.53 — 95.75%	32.99 — 97.40%	26.04 — 98.64%
30	40	63.00 — 84.86%	60.55 — 86.45%	57.65 — 88.16%
300	400	71.34 — 78.44%	70.61 — 79.07%	69.77 — 79.79%

If you think this is bad, this is the simple case of success versus failure.

Be very wary of saying that you measured the speed, made some improvements, measured again, and now it's 20% faster. How meaningful were your estimates of the original and new speeds, and what does that say about the real factor of improvement?

To the Linux / Unix Page