Still digesting it, and may be slightly incorrect, but a summary of what I've learned: properly using statistics to compare the performance of computer systems [1] [2].
0. When you get a sample and calculate sample mean, it is likely different from the population mean. Simply comparing two sample means from different computer systems may therefore lead to misleading conclusions, since the observed difference could be due to random sampling variation rather than a real performance difference (especially when the variance is high).
1. Central Limit Theorem indicates that, regardless of the underlying distribution, the sampling distribution of the mean tends to follow normal distribution when the sample size is sufficiently large (typically n >= 30). But to apply CLT, the observations should be independent and collected from the same distribution.
2. Based on the CLT, we can estimate how close the sample mean (to be precise, any population parameter) is likely to be to the population mean. A confidence interval [x, y] with a confidence level of p% means that if we repeated the sampling process many times under the same conditions, about p% of those intervals would contain the population mean.
3. To compare two distributions, confidence intervals can help determine whether the difference between mean values is statistically significant. This can be done by:
- Checking whether the confidence intervals of the two samples do not overlap, or
- Examining the confidence interval for the mean of the differences to check if it does not include 0.
4. Caveat: The CLT assumes that the data is collected independently, meaning one observation does not affect another. In computer systems, this assumption usually does not hold. Caches, memory layout, scheduling decisions etc. can introduce some degree of dependencies between observations.
This can be mitigated by 1) reducing dependence between experiments as much as possible or 2) by applying the bootstrapping method [3].
[1] Jan Kara, Measuring performance regressions,
https://youtu.be/HAHhW13ofrg?si=drgegMwXUDegHsQf[2] Dev Jain, The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling,
https://www.amazon.com/Art-Computer-Systems-Performance-Analysis/dp/0471503363[3]
https://en.wikipedia.org/wiki/Bootstrapping_(statistics)