# Performance Tuning on Linux — Taking Meaningful Measurements

## Measure the Right Things

This goes
back to where we started:
What are we trying to achieve?
**Throughput or Latency?**
Do you want a bigger truck or a faster car?

We also need to understand performance dependencies. It makes no sense to jump all the way to NFS performance if we haven't built our server, client, and network from adequate hardware, tuned the disk I/O and file systems on the NFS server, and tuned the Ethernet and TCP networking connecting the client and server.

## Use Appropriate Tools

See the previous page on application measurement and monitoring, plus the filesystem I/O benchmarking tools.

Select the right tool for the job — `vmstat`

,
`iostat`

, and `iotop`

tell you about
I/O bottlenecks, but they provide very different information.

## Use the Tools Appropriately

**Take measurements at appropriate times.**
At a scale of weeks to years,
I/O will degrade as a file system ages and
becomes more full and fragmented.
At a scale of minutes to days,
the work load on a production server will vary.
Are backups running?
A major software build?
Is the load down because it's lunch time?

**Take measurements over appropriate periods.**
The CPU runs orders of magnitude faster than rotating disks.
3 second intervals make sense for `top`

,
but file system I/O measurement intervals with
`iostat`

and possibly `iotop`

should
be much larger.

**Is this really a fair test of
typically random performance?**
If you are taking multiple measures through repeated tests,
does the first test prime the cache and make the rest
artificially faster?
What do you have to do to flush the cache for the next test?

**How much does the test resemble typical use?**
Many small read/write accesses to a large database
present a load that is very different from streaming
large media files.
Can your benchmarking tool be tuned to better emulate
intended use?
Put another way, your users probably aren't interested
in running the
Bonnie++ benchmarking tool,
so tuning a system to optimize its results at the expense
of typical use will make things worse, not better.

**How much does the measurement itself
get in the way?**
Run `top -d 0.2`

inside a Gnome terminal
on a Gnome desktop and see how much CPU power it takes to
monitor CPU use in this wasteful fashion.

## Draw Reasonable Conclusions

**Collect plenty of data,
analyze it carefully,
and don't try to make it mean too much.**
Be honest and fair with your claims.
Even after you take several measurements, you have nothing
more than an estimate of the actual value.
Read on if you don't mind details and a little math.

A quantitative analysis chemistry class I took during undergraduate electrical engineering study had a large influence on me. It was a lecture course with an associated lab. It was in the chemistry department, and we did things in a chemistry laboratory in the weekly 3-hour lab session, but what I found useful long-term was the quantitative analysis part and not the chemistry itself. Distinguishing between accuracy and precision, for example. Deciding how many significant digits I can really believe at the end of a measurement.

Then, in graduate school in electrical engineering, I
encountered the trend in robotics research to consider
a problem solved once you manage to capture video of
an industrial robot accomplishing some task.
**But all that proves is that it's possible.**
That can be an interesting result, but to make things useful
they must be more than just possible, they must work most
of the time, and we need to have a pretty good estimate of
the expected success rate.

When that troublesome quantitative analysis chemistry course
leads me to bring up these unpleasant details,
the typical response is to run the experiment four times
with three successes and claim that it works 75% of the time.
Or, trading optimism for realism,
"*at least* 75% of the time!"

Well, no. Three successes in four trials doesn't mean very much. All I can definitely say is that the probability of success is larger than 0 and less than 1. Yes, a probability of 0.75 is the most likely, but with just a few measurements there is a lot of measurement noise and the real probability might really be pretty close to either 0 or 1 and we just happened to see this short initial sequence.

We can use
Bayes' theorem
to calculate *P(r),* the probability that the real
success rate is some value *r* where
0 ≤ *r* ≤ 1
and there were *k* successes in *n* trials.
We will use this formula:

*φ(r) = r ^{k}(1 − r)^{n−k}*

Calculate the integral, the area under that curve over the range
0 ≤ *r* ≤ 1,
call that *A.*
The probability that *r* is the actual success rate is:

*P(r) = (r ^{k}(1 − r)^{n−k})/A*

Now start at the peak where *r = k/n,* and integrate
out both directions to points *r _{1}*
and

*r*so that

_{2}*r*and

_{1}≤ r ≤ r_{2}*P(r*and the integral of that range is 0.9.

_{1}) = P(r_{2})*Then*we could honestly say something like "I have 90% confidence that the real success rate is between

*r*and

_{1}*r*."

_{2}The problem is that with three successes out of four trials, all we can really say is that we're 90% confident that the real success rate is in the range of 39.53—95.75%. Higher confidence means even wider ranges. We can narrow down our estimates of the real range by taking more measurements. But look at how many measurements it takes!

Successes |
Trials |
90% confidence range |
95% confidence range |
98% confidence range |

3 | 4 | 39.53 — 95.75% | 32.99 — 97.40% | 26.04 — 98.64% |

30 | 40 | 63.00 — 84.86% | 60.55 — 86.45% | 57.65 — 88.16% |

300 | 400 | 71.34 — 78.44% | 70.61 — 79.07% | 69.77 — 79.79% |

If you think this is bad, this is the simple case of success versus failure.

**Be very wary** of saying that you measured
the speed, made some improvements, measured again,
and now it's 20% faster.
How meaningful were your estimates of the original and
new speeds, and what does that say about the real
factor of improvement?