P values and error bars help readers infer whether a reported difference would
likely recur, with the sample size *n* used for statistical tests
representing biological replicates, independent measurements of the population
from separate experiments. We provide examples and practical tutorials for
creating figures that communicate both the cell-level variability and the
experimental reproducibility.

### Introduction

While far from perfect, the P value offers a pragmatic metric to infer whether an
observed difference is reproducible and substantial relative to the noise in the
measurements (Greenwald et al., 1996).
The P value should be treated as a mere heuristic, interpreted as the degree of
compatibility between the observed dataset and a given statistical model. A P
value reports the probability that the observed data—or any more extreme
values—would occur by chance (the “null hypothesis”). But a
small P value does not actually tell us which assumption is incorrect, the null
hypothesis or some other assumption of the statistical model (e.g., normal
distribution, random sampling, equal variance, etc.). In the case of treating
each cell as an *n*, the assumption that is violated is
independent sampling, not necessarily the null hypothesis. The resulting P
values are worse than useless: counting each cell as a separate *n* can easily result in false-positive rates of >50%
(Aarts et al., 2015). For excellent
practical guides to statistics for cell biologists, readers are referred to Lamb et al, (2008) and Pollard et al. (2019). In this paper, we
specifically address simple ways to communicate reproducibility when performing
statistical tests and plotting data.

Error bars and P values are often used to assure readers of a real and persistent
difference between populations or treatments. P values are based on the
difference between population means (or other summary metrics) as well as the
number of measurements used to determine that difference. In general, increasing
the number of measurements decreases the resulting P value. To convey
experimental reproducibility, P values and standard error of the mean should be
calculated using biological replicates—independent measurements of a
population of interest, typically from independent samples or separate
experiments (Hurlbert, 1984; Lazic, 2010; Vaux et al., 2012; Aarts
et al., 2015; Naegle et al.,
2015; Lazic et al., 2018).
Limited time and resources often constrain cell biologists to repeat any
particular experiment only a handful of times, so a typical sample size *n* is often in the single digits. However, if authors assign *n* as the number of cells observed during the experiment, *n* may be on the order of hundreds or thousands, resulting
in small P values and error bars that do not convey the experimental
reproducibility or the cell-level variability.

For example, if a researcher measures the length of 20 neurons in a zebrafish and
20 neurons in a fish exposed to a toxin, the correct *n* for each
condition is 1, because the toxin exposure was only performed once. Without
repeating the treatment multiple times with multiple fish, there is no way to
know whether any observed difference was from the toxin or due to natural or
otherwise uncontrolled differences between those two individual fish. The reader
does not care that those two particular fish are different, but that treatments
result in a consistent difference across multiple fish. The P value should be
calculated to reflect the latter, not the former.

Well-designed studies embrace both cell-to-cell and sample-to-sample variation (Altman and Krzywinski, 2015). Repeatedly quantifying a biological parameter rarely converges on a single “true” value, due to the complexity of living cells or because many biological processes are intrinsically stochastic. Calculating standard error from thousands of cells conceals this expected variability. We have written this tutorial to help cell biologists plot data in a way that highlights both experimental robustness and cell-to-cell variability. Specifically, we propose the use of distribution–reproducibility “SuperPlots” that display the distribution of the entire dataset, and report statistics (such as means, error bars, and P values) that address the reproducibility of the findings.

### What population is being sampled?

To clarify what your sample size *n* should be, ask yourself: What
population are you trying to sample? The choice of *n* determines
the population being evaluated or compared (Naegle et al., 2015; Lazic et al.,
2018; Pollard et al., 2019).
A typical cell biology experiment strives to draw general conclusions about an
entire population of cells, so the sample selection should reflect the breadth
of that population. For example, to test if a treatment changes the speed of
crawling cells, you could split a flask of lymphocytes into two wells, treat one
well with a drug of interest and one with a placebo, and then track individual
cells in each of the two wells. If you use each cell as a sample
(*n* = number of cells), the two populations you end up
comparing are the cells in those two particular wells. Multiple observations
within one well increase the precision for estimating the mean for that one
sample, but do not reveal a truth about all cells in all wells. By repeating the
experiment multiple times from new flasks, and using each experiment as a sample
(*n* = number of independent experiments), you evaluate
the effect of the treatment on any arbitrary flask of similar cells. (For more
examples, see Table
S1.)

If you are interested only in cell-to-cell variability within a particular
sample, then *n* could be the number of cells observed. However,
making inferences beyond that sample is difficult, because the natural
variability of individual cells can be overshadowed by systematic differences
between biological replicates. Whether caused by passage number, confluency, or
location in the incubator, cells often vary from sample to sample and day to
day. For example, an entire flask of cells can be described as
“unhappy.” Accordingly, cells from experimental and control
samples (e.g., tubes, flasks, wells, coverslips, rats, tissue samples, etc.) may
differ from each other, regardless of the experimental treatment. When authors
report the sample size as the number of cells, the statistical analysis cannot
help the reader evaluate whether differences are due to the intended treatment
or sample-to-sample variability. We are not prescribing any specific definition
of *n*; researchers should consider what main source of
variability they hope to overcome when designing experiments and statistical
analyses (Altman and Krzywinski,
2015).

### Statistics in cell biology typically assume independent tests of a hypothesis

Analysis becomes challenging when the experimental unit—the item that can be randomly assigned to a treatment—is different than the biological entity of interest. For example, we often care about how individual cells react to a treatment, but typically treat entire dishes of cells at a time. To test the hypothesis that two treatments or populations are different, the treatment must be applied or the populations sampled multiple times. Neighboring cells within one flask or well treated with a drug are not separate tests of the hypothesis, because the treatment was only applied once. But if individual cells are microinjected with a drug or otherwise randomly assigned to a different treatment, then each cell really can be a separate test of a hypothesis.

Finding truly independent groups and deciding what makes for a good biological replicate can be challenging (Vaux et al., 2012; Blainey et al., 2014; Naegle et al., 2015; Lazic et al., 2018). For example, is it acceptable to run multiple experiments from just one thawed aliquot of cells? Is it necessary to generate multiple knockout strains? Is it sufficient to test in one cell line? There’s no single right answer: each researcher must balance practicality with robust experimental design. At a minimum, researchers must perform an experiment multiple times if they want to know whether the results are robust.

### Calculating P values from cell-level observations

Cell biologists often observe hundreds of cells per experiment and repeat an experiment multiple times. To leverage that work into robust statistics, one needs to take into account the hierarchy of the data. Combining the cell-level data from multiple independent experiments squanders useful information about run-to-run variability (Fig. 1). There is ample literature about the analysis of this type of hierarchical data (Galbraith et al., 2010), which takes into account both the variance within a sample and the clustering across multiple experimental runs (Aarts et al., 2015), or that propagate the error up the chain, such as a nested ANOVA (Krzywinski et al., 2014). Recently, statisticians have proposed a Bayesian approach to multilevel analysis (Lazic et al., 2020). For a detailed resource on hierarchical data analysis, see Gelman and Hill (2006).

A simple approach—which permits conventional *t* test or
ANOVA calculations—is to pool the cell-level data from each experiment
separately and compare the subsequent sample-level means (Altman and Bland, 1997; Galbraith et al., 2010; Lazic,
2010). For example, if you have three biological replicates of
control and treated samples, and you measure the cell diameter of 200 cells in
each sample, first calculate the mean of those 200 measurements for each sample,
then run a *t* test on those sample means (three control, three
treated). When using this simplified method, it is best to keep the number of
observations per sample similar, because each sample gets the same weighting in
the analysis.

While pooling dependent observations together avoids false positives (Galbraith et al., 2010; Aarts et al., 2015), this simple approach
might fail to detect small but real differences between groups, where more
advanced techniques may prove to be more powerful. However, increasing the
number of biological replicates usually has a larger influence on the
statistical power than measuring many more cells in each sample (Blainey et al., 2014; Aarts et al., 2015). While *n* of 3 is often considered a pragmatic minimum in cell
biology (Naegle et al., 2015),
distinguishing more subtle observed differences will require planning for more
biological replicates and/or harnessing the power of more robust statistical
analyses.

### Communicating variability with SuperPlots

After analyzing hundreds of cells across multiple rounds of experimentation, it would be useful to incorporate both the cell-level variability and experimental repeatability into a single diagram. In Fig. 1 A, the plots have small error bars and P values, which should raise red flags given how difficult it would be to replicate a cell biology experiment with identical results and/or to repeat it hundreds of times, which such miniscule P values imply. Bar graphs are problematic because they obscure the distribution of cell-level data as well as the sample-to-sample repeatability (Weissgerber et al., 2015). While beeswarm, box-and-whisker, and violin plots are great at conveying information about the range and distribution of the underlying data, plotting the entire dataset does not make it appropriate to treat repeated measurements on the same sample as independent experiments.

Therefore, we suggest authors incorporate information about distribution and reproducibility by creating “SuperPlots,” which superimpose summary statistics from repeated experiments on a graph of the entire cell-level dataset (Fig. 1, right columns). SuperPlots convey more information than a conventional bar graph or beeswarm plot, and they make it clear that statistical analyses (e.g., error bars and P values) are calculated across separate experiments, not individual cells—even when each cell is represented on the plot. For example, the mean from each experiment could be listed in the caption or plotted as a larger dot on top of the many smaller dots that denote individual measurements.

When possible, it is best to link samples by run, for instance, by color-coding
the dots by experiment or a line linking paired measurements together (Fig. S1 D). These linkages convey the
repeatability of the work: readers learn more if they know that one experiment
exhibited high readings across the board than if they have to guess the trend in
each sample. Linking data can also eliminate the need to normalize data in order
to directly compare different experimental runs. Often, multiple experiments
might all exhibit the same trend, but different absolute values (Fig. 1 C). By encoding the biological
replicate into the data, such trends can be revealed without normalizing to a
control group: P values can then be calculated using statistical tests that take
into account linkages among samples (e.g., a paired or ratio *t* test). In fact, not taking into account linkages can make the *t* test too conservative, yielding false negatives (Galbraith et al., 2010).

An impressive amount of information can be depicted by color-coded beeswarm SuperPlots (see Fig. 1, rightmost plots), where each cell-level datapoint divulges which experiment it came from (Galbraith et al., 2010; Weissgerber et al., 2017). This helps convey to the reader whether each experimental round gave similar results or if one run biases the conclusion (Fig. 1 D). The summary statistics and P values in beeswarm SuperPlots are overlaid on the color-coded scatter. (See Fig. S2, Fig. S3, Fig. S4, and,Fig. S5 for tutorials on how to make beeswarm SuperPlots in Prism, Python, R, and Excel using Data S1.)

Whatever way authors choose to display their data, it is critical to list the number of independent experiments in the figure or caption, as well as how the means and statistical tests were calculated.

### Error bars that communicate reproducibility

The choice of error bars on a SuperPlot depends on what you hope to communicate:
descriptive error bars characterize the distribution of measurements (e.g.,
standard deviation), while inferential error bars evaluate how likely it is that
the same result would occur if the experiment were to be repeated (e.g.,
standard error of the mean or confidence intervals; Cumming et al., 2007). To convey how repeatable an
experiment is, it is appropriate to choose inferential error bars calculated
using the number of independent experiments as the sample size. However,
calculating standard error of the mean by inputting data from all cells
individually fails in two ways: first, the natural variability we expect from
biology would be better summarized with a descriptive measure, like standard
deviation; and second, the inflated *n* produces error bars that
are artificially small (due to √*n* in the denominator)
and do not communicate the repeatability of the experiment.

The problems with calculating error bars using cell count as the sample size are illustrated by comparing the error bars in Fig. 1 A to those in Fig. 1, B–D: when each cell measurement is treated as an independent sample, the standard error of the mean is always tiny, whether or not there is variability among experimental replicates. In contrast, the error bars calculated using biological replicates grow when the results vary day to day. In cases where displaying every data point is not practical, authors should consider some way of representing the cell-to-cell variability as well as the run-to-run repeatability. This could be error bars that represent the standard deviation of the entire dataset, but with P values calculated from biological replicates.

### Conclusions

When calculating your P value, take a moment to consider these questions: What
variability does your P value represent? How many independent experiments have
you performed, and does this match with your *n*? (See Table S1 for practical
examples of this analysis.) We encourage authors and editors to focus less on
reporting satisfying yet superficial statistical tests such as P values, and
more on presenting the data in a manner that conveys both the variability and
the reproducibility of the work.

### Online supplemental material

Fig. S1 shows other plotting examples. Fig. S2 is a tutorial for making
SuperPlots in Prism. Fig. S3 is a
tutorial for making SuperPlots in Excel. Fig. S4 is a tutorial for making SuperPlots in R. Fig. S5 is a tutorial for making SuperPlots in Python. Table S1 shows how
the choice of *n* influences conclusions. Data S1 is the raw data
used to generate Figs. S4 and S5.

## Acknowledgments

We acknowledge that our past selves are not innocent of the mistakes described in this manuscript. We are grateful to several colleagues who provided feedback on our preprint, including Kenneth Campellone, Adam Zweifach, William Velle, Geoff O’Donoghue, and Nico Stuurman. We also thank Natalie Petek for providing some hints on using Prism and Jiongyi Tan for reviewing the Python code.

This work was supported by grants to L.K. Fritz-Laylin from the National Institutes of Health (from the National Institute of Allergy and Infectious Diseases grant 1R21AI139363), from the National Science Foundation (grant IOS-1827257), and from the Pew Scholars Program in the Biomedical Sciences; and by grants to R.D. Mullins from the National Institutes of Health (R35-GM118119) and Howard Hughes Medical Institute.

The authors declare no competing financial interests.