P values and error bars help readers infer whether a reported difference would likely recur, with the sample size *n* used for statistical tests representing biological replicates, independent measurements of the population from separate experiments. We provide examples and practical tutorials for creating figures that communicate both the cell-level variability and the experimental reproducibility.

### Introduction

While far from perfect, the P value offers a pragmatic metric to infer whether an observed difference is reproducible and substantial relative to the noise in the measurements (Greenwald et al., 1996). The P value should be treated as a mere heuristic, interpreted as the degree of compatibility between the observed dataset and a given statistical model. A P value reports the probability that the observed data—or any more extreme values—would occur by chance (the “null hypothesis”). But a small P value does not actually tell us which assumption is incorrect, the null hypothesis or some other assumption of the statistical model (e.g., normal distribution, random sampling, equal variance, etc.). In the case of treating each cell as an *n*, the assumption that is violated is independent sampling, not necessarily the null hypothesis. The resulting P values are worse than useless: counting each cell as a separate *n* can easily result in false-positive rates of >50% (Aarts et al., 2015). For excellent practical guides to statistics for cell biologists, readers are referred to Lamb et al, (2008) and Pollard et al. (2019). In this paper, we specifically address simple ways to communicate reproducibility when performing statistical tests and plotting data.

Error bars and P values are often used to assure readers of a real and persistent difference between populations or treatments. P values are based on the difference between population means (or other summary metrics) as well as the number of measurements used to determine that difference. In general, increasing the number of measurements decreases the resulting P value. To convey experimental reproducibility, P values and standard error of the mean should be calculated using biological replicates—independent measurements of a population of interest, typically from independent samples or separate experiments (Hurlbert, 1984; Lazic, 2010; Vaux et al., 2012; Aarts et al., 2015; Naegle et al., 2015; Lazic et al., 2018). Limited time and resources often constrain cell biologists to repeat any particular experiment only a handful of times, so a typical sample size *n* is often in the single digits. However, if authors assign *n* as the number of cells observed during the experiment, *n* may be on the order of hundreds or thousands, resulting in small P values and error bars that do not convey the experimental reproducibility or the cell-level variability.

For example, if a researcher measures the length of 20 neurons in a zebrafish and 20 neurons in a fish exposed to a toxin, the correct *n* for each condition is 1, because the toxin exposure was only performed once. Without repeating the treatment multiple times with multiple fish, there is no way to know whether any observed difference was from the toxin or due to natural or otherwise uncontrolled differences between those two individual fish. The reader does not care that those two particular fish are different, but that treatments result in a consistent difference across multiple fish. The P value should be calculated to reflect the latter, not the former.

Well-designed studies embrace both cell-to-cell and sample-to-sample variation (Altman and Krzywinski, 2015). Repeatedly quantifying a biological parameter rarely converges on a single “true” value, due to the complexity of living cells or because many biological processes are intrinsically stochastic. Calculating standard error from thousands of cells conceals this expected variability. We have written this tutorial to help cell biologists plot data in a way that highlights both experimental robustness and cell-to-cell variability. Specifically, we propose the use of distribution–reproducibility “SuperPlots” that display the distribution of the entire dataset, and report statistics (such as means, error bars, and P values) that address the reproducibility of the findings.

### What population is being sampled?

To clarify what your sample size *n* should be, ask yourself: What population are you trying to sample? The choice of *n* determines the population being evaluated or compared (Naegle et al., 2015; Lazic et al., 2018; Pollard et al., 2019). A typical cell biology experiment strives to draw general conclusions about an entire population of cells, so the sample selection should reflect the breadth of that population. For example, to test if a treatment changes the speed of crawling cells, you could split a flask of lymphocytes into two wells, treat one well with a drug of interest and one with a placebo, and then track individual cells in each of the two wells. If you use each cell as a sample (*n* = number of cells), the two populations you end up comparing are the cells in those two particular wells. Multiple observations within one well increase the precision for estimating the mean for that one sample, but do not reveal a truth about all cells in all wells. By repeating the experiment multiple times from new flasks, and using each experiment as a sample (*n* = number of independent experiments), you evaluate the effect of the treatment on any arbitrary flask of similar cells. (For more examples, see Table S1.)

If you are interested only in cell-to-cell variability within a particular sample, then *n* could be the number of cells observed. However, making inferences beyond that sample is difficult, because the natural variability of individual cells can be overshadowed by systematic differences between biological replicates. Whether caused by passage number, confluency, or location in the incubator, cells often vary from sample to sample and day to day. For example, an entire flask of cells can be described as “unhappy.” Accordingly, cells from experimental and control samples (e.g., tubes, flasks, wells, coverslips, rats, tissue samples, etc.) may differ from each other, regardless of the experimental treatment. When authors report the sample size as the number of cells, the statistical analysis cannot help the reader evaluate whether differences are due to the intended treatment or sample-to-sample variability. We are not prescribing any specific definition of *n*; researchers should consider what main source of variability they hope to overcome when designing experiments and statistical analyses (Altman and Krzywinski, 2015).

### Statistics in cell biology typically assume independent tests of a hypothesis

Analysis becomes challenging when the experimental unit—the item that can be randomly assigned to a treatment—is different than the biological entity of interest. For example, we often care about how individual cells react to a treatment, but typically treat entire dishes of cells at a time. To test the hypothesis that two treatments or populations are different, the treatment must be applied or the populations sampled multiple times. Neighboring cells within one flask or well treated with a drug are not separate tests of the hypothesis, because the treatment was only applied once. But if individual cells are microinjected with a drug or otherwise randomly assigned to a different treatment, then each cell really can be a separate test of a hypothesis.

Finding truly independent groups and deciding what makes for a good biological replicate can be challenging (Vaux et al., 2012; Blainey et al., 2014; Naegle et al., 2015; Lazic et al., 2018). For example, is it acceptable to run multiple experiments from just one thawed aliquot of cells? Is it necessary to generate multiple knockout strains? Is it sufficient to test in one cell line? There’s no single right answer: each researcher must balance practicality with robust experimental design. At a minimum, researchers must perform an experiment multiple times if they want to know whether the results are robust.

### Calculating P values from cell-level observations

Cell biologists often observe hundreds of cells per experiment and repeat an experiment multiple times. To leverage that work into robust statistics, one needs to take into account the hierarchy of the data. Combining the cell-level data from multiple independent experiments squanders useful information about run-to-run variability (Fig. 1). There is ample literature about the analysis of this type of hierarchical data (Galbraith et al., 2010), which takes into account both the variance within a sample and the clustering across multiple experimental runs (Aarts et al., 2015), or that propagate the error up the chain, such as a nested ANOVA (Krzywinski et al., 2014). Recently, statisticians have proposed a Bayesian approach to multilevel analysis (Lazic et al., 2020). For a detailed resource on hierarchical data analysis, see Gelman and Hill (2006).

A simple approach—which permits conventional *t* test or ANOVA calculations—is to pool the cell-level data from each experiment separately and compare the subsequent sample-level means (Altman and Bland, 1997; Galbraith et al., 2010; Lazic, 2010). For example, if you have three biological replicates of control and treated samples, and you measure the cell diameter of 200 cells in each sample, first calculate the mean of those 200 measurements for each sample, then run a *t* test on those sample means (three control, three treated). When using this simplified method, it is best to keep the number of observations per sample similar, because each sample gets the same weighting in the analysis.

While pooling dependent observations together avoids false positives (Galbraith et al., 2010; Aarts et al., 2015), this simple approach might fail to detect small but real differences between groups, where more advanced techniques may prove to be more powerful. However, increasing the number of biological replicates usually has a larger influence on the statistical power than measuring many more cells in each sample (Blainey et al., 2014; Aarts et al., 2015). While *n* of 3 is often considered a pragmatic minimum in cell biology (Naegle et al., 2015), distinguishing more subtle observed differences will require planning for more biological replicates and/or harnessing the power of more robust statistical analyses.

### Communicating variability with SuperPlots

After analyzing hundreds of cells across multiple rounds of experimentation, it would be useful to incorporate both the cell-level variability and experimental repeatability into a single diagram. In Fig. 1 A, the plots have small error bars and P values, which should raise red flags given how difficult it would be to replicate a cell biology experiment with identical results and/or to repeat it hundreds of times, which such miniscule P values imply. Bar graphs are problematic because they obscure the distribution of cell-level data as well as the sample-to-sample repeatability (Weissgerber et al., 2015). While beeswarm, box-and-whisker, and violin plots are great at conveying information about the range and distribution of the underlying data, plotting the entire dataset does not make it appropriate to treat repeated measurements on the same sample as independent experiments.

Therefore, we suggest authors incorporate information about distribution and reproducibility by creating “SuperPlots,” which superimpose summary statistics from repeated experiments on a graph of the entire cell-level dataset (Fig. 1, right columns). SuperPlots convey more information than a conventional bar graph or beeswarm plot, and they make it clear that statistical analyses (e.g., error bars and P values) are calculated across separate experiments, not individual cells—even when each cell is represented on the plot. For example, the mean from each experiment could be listed in the caption or plotted as a larger dot on top of the many smaller dots that denote individual measurements.

When possible, it is best to link samples by run, for instance, by color-coding the dots by experiment or a line linking paired measurements together (Fig. S1 D). These linkages convey the repeatability of the work: readers learn more if they know that one experiment exhibited high readings across the board than if they have to guess the trend in each sample. Linking data can also eliminate the need to normalize data in order to directly compare different experimental runs. Often, multiple experiments might all exhibit the same trend, but different absolute values (Fig. 1 C). By encoding the biological replicate into the data, such trends can be revealed without normalizing to a control group: P values can then be calculated using statistical tests that take into account linkages among samples (e.g., a paired or ratio *t* test). In fact, not taking into account linkages can make the *t* test too conservative, yielding false negatives (Galbraith et al., 2010).

An impressive amount of information can be depicted by color-coded beeswarm SuperPlots (see Fig. 1, rightmost plots), where each cell-level datapoint divulges which experiment it came from (Galbraith et al., 2010; Weissgerber et al., 2017). This helps convey to the reader whether each experimental round gave similar results or if one run biases the conclusion (Fig. 1 D). The summary statistics and P values in beeswarm SuperPlots are overlaid on the color-coded scatter. (See Fig. S2, Fig. S3, Fig. S4, and,Fig. S5 for tutorials on how to make beeswarm SuperPlots in Prism, Python, R, and Excel using Data S1.)

Whatever way authors choose to display their data, it is critical to list the number of independent experiments in the figure or caption, as well as how the means and statistical tests were calculated.

### Error bars that communicate reproducibility

The choice of error bars on a SuperPlot depends on what you hope to communicate: descriptive error bars characterize the distribution of measurements (e.g., standard deviation), while inferential error bars evaluate how likely it is that the same result would occur if the experiment were to be repeated (e.g., standard error of the mean or confidence intervals; Cumming et al., 2007). To convey how repeatable an experiment is, it is appropriate to choose inferential error bars calculated using the number of independent experiments as the sample size. However, calculating standard error of the mean by inputting data from all cells individually fails in two ways: first, the natural variability we expect from biology would be better summarized with a descriptive measure, like standard deviation; and second, the inflated *n* produces error bars that are artificially small (due to √*n* in the denominator) and do not communicate the repeatability of the experiment.

The problems with calculating error bars using cell count as the sample size are illustrated by comparing the error bars in Fig. 1 A to those in Fig. 1, B–D: when each cell measurement is treated as an independent sample, the standard error of the mean is always tiny, whether or not there is variability among experimental replicates. In contrast, the error bars calculated using biological replicates grow when the results vary day to day. In cases where displaying every data point is not practical, authors should consider some way of representing the cell-to-cell variability as well as the run-to-run repeatability. This could be error bars that represent the standard deviation of the entire dataset, but with P values calculated from biological replicates.

### Conclusions

When calculating your P value, take a moment to consider these questions: What variability does your P value represent? How many independent experiments have you performed, and does this match with your *n*? (See Table S1 for practical examples of this analysis.) We encourage authors and editors to focus less on reporting satisfying yet superficial statistical tests such as P values, and more on presenting the data in a manner that conveys both the variability and the reproducibility of the work.

### Online supplemental material

Fig. S1 shows other plotting examples. Fig. S2 is a tutorial for making SuperPlots in Prism. Fig. S3 is a tutorial for making SuperPlots in Excel. Fig. S4 is a tutorial for making SuperPlots in R. Fig. S5 is a tutorial for making SuperPlots in Python. Table S1 shows how the choice of *n* influences conclusions. Data S1 is the raw data used to generate Figs. S4 and S5.

## Acknowledgments

We acknowledge that our past selves are not innocent of the mistakes described in this manuscript. We are grateful to several colleagues who provided feedback on our preprint, including Kenneth Campellone, Adam Zweifach, William Velle, Geoff O’Donoghue, and Nico Stuurman. We also thank Natalie Petek for providing some hints on using Prism and Jiongyi Tan for reviewing the Python code.

This work was supported by grants to L.K. Fritz-Laylin from the National Institutes of Health (from the National Institute of Allergy and Infectious Diseases grant 1R21AI139363), from the National Science Foundation (grant IOS-1827257), and from the Pew Scholars Program in the Biomedical Sciences; and by grants to R.D. Mullins from the National Institutes of Health (R35-GM118119) and Howard Hughes Medical Institute.

The authors declare no competing financial interests.