Understanding biological function requires the identification and characterization of complex patterns of molecules. Single-molecule localization microscopy (SMLM) can quantitatively measure molecular components and interactions at resolutions far beyond the diffraction limit, but this information is only useful if these patterns can be quantified and interpreted. We provide a new approach for the analysis of SMLM data that develops the concept of structures and super-structures formed by interconnected elements, such as smaller protein clusters. Using a formal framework and a parameter-free algorithm, (super-)structures formed from smaller components are found to be abundant in classes of nuclear proteins, such as heterogeneous nuclear ribonucleoprotein particles (hnRNPs), but are absent from ceramides located in the plasma membrane. We suggest that mesoscopic structures formed by interconnected protein clusters are common within the nucleus and have an important role in the organization and function of the genome. Our algorithm, SuperStructure, can be used to analyze and explore complex SMLM data and extract functionally relevant information.
Introduction
Single-molecule localization microscopy (SMLM; van de Linde et al., 2011; Schermelleh et al., 2010; Henriques et al., 2011; Sauer and Heilemann, 2017) is now commonly employed for quantitative analysis of molecular structures and interactions in both cell-based (Cisse et al., 2013; Kapanidis et al., 2018; Chong et al., 2018) and in vitro experiments (Revyakin et al., 2006; Deniz et al., 2008). Unlike other light microscopy techniques, SMLM achieves resolutions far beyond the diffraction limit, and its typical output is a list of 3D coordinates (or localization events) that are naturally analyzed using efficient clustering algorithms borrowed from quantitative big-data analysis and even astronomy (Owen et al., 2010; Sengupta et al., 2011; Garcia-Parajo et al., 2014; Baumgart et al., 2016; Spahn et al., 2016; Griffié et al., 2016). However, traditional clustering algorithms rely on user-defined parameters that are intrinsically intertwined with the notion of similarity that is necessary to define a cluster. These parameters can be either hypothesized by physical intuition or inferred via preemptive analysis (Burgert et al., 2017; Williamson et al., 2020; Malkusch and Heilemann, 2016), yet their choice has a significant impact on the results, in turn hindering the portability of clustering algorithms and the comparison between different datasets.
At the same time, recent evidence suggest that assemblies of proteins (Brangwynne et al., 2015; Larson et al., 2017; Strom et al., 2017; Sabari et al., 2018; Cho et al., 2018; Maharana et al., 2018; Chong et al., 2018) and chromatin (Bintu et al., 2018; Boettiger et al., 2016; Frank and Rippe, 2020) form functional complex structures that are not fully captured by standard clustering algorithms. For example, the heterogeneous nuclear ribonucleoprotein U (hnRNP-U), also called scaffold attachment factor A (SAF-A), is suggested to form a dynamic and functional mesh-like structure while interacting with RNA to maintain transcriptionally active genomic loci in a decompacted configuration (Nozawa et al., 2017; Michieletto and Gilbert, 2019). Other examples include SC35, a nuclear protein involved in RNA splicing and chromatin elongation (Lin et al., 2008) that displays localized nuclear speckles (Xie et al., 2006; Jackson et al., 2000), or actin and microtubules, which form elongated and interconnected networks involved in cell motility and division, as well as in the synaptic plasticity of dendritic spines (Resch et al., 2002; Rogers et al., 2003; Izeddin et al., 2011). Additionally, recent super-resolution studies indicate that chromatin is also functionally organized in connected nano-scale compartments (Prakash et al., 2015; Szabo et al., 2018; Nir et al., 2018; Maiser et al., 2020). Rapidly evolving methods of chromatin tracing (Boettiger et al., 2016; Wang et al., 2016; Beliveau et al., 2015; Nir et al., 2018; Bintu et al., 2018) and super-resolved imaging of the accessible genome (Xie et al., 2020) require sophisticated algorithms to analyze the topology of the generated paths (Goundaroulis et al., 2020). To understand the relationship between these complex structures and the underlying biological mechanism and functions of the genome (Bronshtein et al., 2015; Khanna et al., 2019; Leidescher et al., 2020,Preprint; Smeets et al., 2014), a more sophisticated and standardized analysis of SMLM data is urgently required.
It is clear that quantification of complex structures is a ubiquitous problem in molecular and cell biology, and it is intimately connected to cellular function. Motivated by this problem, here, we introduce a new algorithm termed SuperStructure, which extends in a novel and original way the popular density-based clustering algorithm DBSCAN. SuperStructure allows (1) a parameter-free detection and quantification of complex structures made of connected clusters in SMLM data and (2) a parameter-free quantification of the density of molecules within clusters.
Here, we demonstrate the capabilities of SuperStructure on simulated datasets and then use it to analyze two groups of experimental datasets: (1) nuclear proteins involved in RNA processing, namely SAF-A, hnRNP-C, and SC35; and (2) ceramide lipids involved in cellular trafficking at the membrane. We find that interconnections between clusters are abundant in classes of proteins in the hnRNP family and that they are surprisingly absent from ceramides, suggesting this feature is relevant for the biological function of SAF-A and hnRNP-C. Therefore, SuperStructure enables us to discover new facets of protein organization in human cells and provides a better understanding of the molecular mechanisms underlying the organization of subcellular (super-)structures.
Finally, since SuperStructure is parameter-free, it provides the community with a standardized tool for the discovery and quantification of complex patterns in SMLM data. Furthermore, beyond helping our understanding of complex biological structures, it might be used to assess the fluorophore blinking quality and thus offers versatility in assessing also technical imaging properties (van de Linde and Sauer, 2014; Hennig et al., 2015; Siegberg and Herten, 2011).
Results
SuperStructure algorithm
SuperStructure is best explained in relation to the well-known DBSCAN algorithm. DBSCAN detects clusters by grouping together high-density localizations and classifies as outliers low-density ones (Ester et al., 1996). In practice, DBSCAN determines that a localization is part of a cluster if more than other localizations are found within a neighborhood distance (or if it is part of the neighborhood of another localization with this property). Conversely, SuperStructure extracts connectivity information from the rate at which the number of detected clusters changes with the neighborhood radius for a fixed (see Fig. 1). Indeed, the curves contain important overlooked information about the structure of connections. To simplify the analysis, and without loss of generality, we set which means that we do not require a minimum number of localizations within the neighborhood to define a cluster. As a consequence, is necessarily a monotonically decreasing function, as for every localization is detected as a single cluster and increasing yields fewer but larger clusters. Following on, the rate at which decays with is an indicator of how quickly localizations, and then clusters of localizations, coalesce, thus indicating how much localizations and clusters are connected.
The curves provided by SuperStructure identify different clustering regimes (Fig. 1). The first (small regime describes the merging of localizations within clusters (intra-cluster regime), the second (intermediate regime captures the growth of clusters into super-structures (first super-cluster regime), and the third (large regime describes the merging of super-clusters into higher-order super-structures (second/third super-cluster regimes). The curve in the first regime typically follows a Poissonian function (Eq. 1), and its decay rate is related to the density of emitters within the clusters (see Materials and methods and Figs. 1 and S1). The width of the Poisson function also sets the critical value of at which this first regime is expected to end (Eq. 2). On the other hand, the decay in the second and third regimes follows an exponential decay with characteristic length-scale and are highly dependent on the connectivity between (super-)clusters, as well as on the density of (super-)clusters (Eq. 4).
The number of super-cluster regimes depends on the homogeneity of both cluster distribution and connections. In the two extreme cases of a completely connected or unconnected homogeneous distribution of clusters, we expect a single super-cluster regime. However, while in the former case this regime is exponential (because the clusters are connected), in the latter it assumes a Poissonian functional form (see respectively Eqs. 4 and 3). This is not surprising, as free (unconnected) clusters that are randomly distributed behave (on a larger scale) as single emitters inside clusters (see Materials and methods and Fig. S1). Also, in the case of clusters embedded in a random distribution of other localizations (such as noise), we obtain a Poissonian decay. Importantly, a random distribution of localizations (also at high density) is different from “connected” clusters, where nearby localizations are mostly distributed in between clusters. As a result, the curves generated by SuperStructure allow us to identify the presence/absence of connectivity by investigating the functional form of the curves, as well as to extract their decay rates.
In heterogeneous systems that display a mix of randomly dispersed localizations/clusters and connected ones over similar length-scales, we strongly recommend restricting the analysis with regions of interest (ROIs) over subregions that display qualitatively similar phenotypes. A good example of heterogeneous system is given by the nuclear protein SC35, which we analyze below. Restricting the analysis to ROIs is also recommended when quantifying nuclear or cellular substructures that display boundaries. Masking localizations falling outside these boundaries allows SuperStructure to generate cleaner curves that are easier to interpret.
To quantify the intra-cluster density and (super-)cluster connectivities, one needs to define boundaries between regimes and to fit every regime with the corresponding function (see Eqs. 1, 3, and 4). Regime boundaries and fitting ranges can be selected either manually (where curves change their decay properties) or by rigorously running a preemptive goodness-of-fit test. For instance, once the rough regime range has been identified and fitted, one can modify the fit window to identify the boundaries of the regime outside which the fit is no longer acceptable. Arguably, the optimum regime is found by identifying the best goodness-of-fit window (e.g., the range with the minimum χ2). It is also possible to define a single function fitting the entire curve by (1) defining a piecewise function where every “piece” is the fit of the corresponding regime or (2) adding together the contribution of the different regimes (appropriately weighted).
The workflow for the application of SuperStructure is shown in Fig. 1 and is described in detail in Materials and methods. Additionally, the codes and scripts are open source and available at git repository (see below).
Characterizing SuperStructure feature extraction from simulated SMLM data
To evaluate the performance of SuperStructure, we analyzed artificial datasets consisting of interconnected clusters of localizations on a 2D plane (see Fig. 2 A). Clusters are homogeneously and randomly positioned on the plane with a cluster density that is comparable to that of some nuclear proteins (see below). Every cluster has average radius and an overall internal localization density where is the number of localizations per cluster. Pairs of clusters are connected with probability by a sparse point distribution and only if the distance between the clusters is less than These choices allow us to readily tune the degree of “connectivity” in the system by varying a single parameter A second parameter, is introduced to control the density of localizations within the connections (see Materials and methods for details).
The length-scales associated to density of emitters inside clusters and inside connections define the boundaries among the three regimes of (Fig. 2 B): (1) for the intra-cluster regime follows a Poissonian decay (Eq. 1) with density parameter (as expected, since it was set by construction); (2) for intermediate values of the exponential super-cluster regime dominates (Eq. 4), and the fusion of connected clusters takes place (see inset of Fig. 2 B); (3) for we expect to observe the coalescence of super- and nonconnected clusters in a second super-cluster regime; this is captured by a second exponential for (Eq. 4). Conversely, for we observe a single super-cluster regime that is well fitted by a Poissonian function with lower density (Eq. 3), as it corresponds to the density of clusters rather than emitters within clusters (see dark green curve in Fig. 2 B).
Examination of Fig. 2 B (inset) highlights the exponential behavior of the super-cluster regime (2) for different values of connectivity Importantly, a larger results in an effectively shorter decay length (or larger spatial rate of merging) for the regime in which clusters merge into super-clusters. This strongly suggests that the effective decay length (or rate) mirrors the connectedness of the underlying super-structures (Fig. 2 C). In fact, these simulations reveal that the decay length represents the combined contribution of cluster density and connectivity A larger density of clusters can impact the decay length as much as a larger connectivity, as shown by simulations at fixed and different (Fig. 2 D; and Fig. S2, A and B). In particular, we find that the functional form of the decay length is (Fig. 2, D and E). The cluster density contribution is as it depends on the typical distance between clusters and is relevant when comparing datasets with different cluster density. By combining SuperStructure with a cluster analysis, one can estimate and normalize to obtain the pure connectivity contribution in the decay length:
Finally, in order to characterize the contribution to the curves coming from the density of localizations within the connections, we further simulated SMLM datasets with a fixed, large connectivity and varied the density of points in the connections by tuning (see simulated datasets in Figs. 2 A and S2 F). As expected, we observe a single super-cluster regime, and the denser the connections, the shorter the decay length. This indicates that our algorithm is able to describe not only how well clusters are connected (i.e., the number of connections per cluster) but also how strongly they are connected (i.e., how dense the connections are). These features are likely to be highly relevant for nuclear proteins.
Before applying this methodology to experimental data, we also tested the effect of random noise in the system (i.e., unconnected isolated localizations from biological or technical sources). We observed that in presence of random noise the decay of SuperStructure curves becomes Poissonian for large (see Fig. S2 C) with an effective density larger than the cluster density (see Fig. S2 D). Decay lengths in the first super-cluster regime (yellow regime) are still distinguishable even in presence of noise at reasonable density (albeit smaller than the connection density), but their absolute values are altered, with weakly connected systems more severely affected (see Fig. S2 E). These observations suggest that, as in most analysis algorithms, large noise might obscure exponential decays of connected systems. In case a single Poissonian behavior or a combination of exponential and Poissonian decay is found in the SMLM dataset, it is therefore important to combine SuperStructure with an independent cluster analysis at different length scales (e.g., at three or four selected values of and a direct observation of the dataset in order to exclude the presence of hidden connectivity.
Quantification of super-structures in nuclear proteins
We now examine biological data and apply SuperStructure to dSTORM data acquired for three different nuclear proteins (Fig. 3, A and B): the serine/arginine-rich splicing factor SC35, hnRNP-C, and hnRNP-U (also known as SAF-A). These proteins are abundantly expressed in the nucleus of human cells and are involved with RNA processing at different stages. SC35 is necessary for RNA splicing, while hnRNPs are implicated not only in the regulation and maturation of mRNA but also in chromatin structure (Nozawa et al., 2017; Xiao et al., 2012; Caudron-Herger et al., 2011). In particular, SAF-A is thought to form a dynamic homogeneous mesh that regulates large-scale chromatin organization by keeping gene-rich loci in a decompacted state (Nozawa et al., 2017; Michieletto and Gilbert, 2019). Hence, capturing the organization of this protein beyond the traditional single-cluster analysis is an important step toward understanding how it regulates chromatin structure in different cell stages and conditions.
Curves obtained from SuperStructure analysis after masking signal in the nuclear region are shown in Fig. 3 C, where we highlighted the super-cluster regimes discussed above. Global nuclear analysis is represented by filled curves, while analysis on localized ROIs is represented by dashed ones (hnRNP-C nuclear mesh and SC35 speckles). Both hnRNPs display a first super-cluster regime for which the curves decay as exponentials, suggesting that within this range, distinct clusters are in reality connected. Interestingly, while SAF-A displays a unique long super-cluster regime, hnRNP-C seems to also show a second exponential regime (filled curve). However, this regime appears at very large values of and is due to sparse clusters of localizations in nucleoli. Running SuperStructure on ROIs with nucleoli masked out (dashed line) indeed generates a single exponential function, confirming that hnRNP-C clusters are fully connected. We can therefore conclude that both hnRNPs exhibit a single exponential regime, typical of fully connected meshes. On the other hand, SC35 displays exponentials with different characteristic decay rates in two distinct and significant super-cluster regimes (filled curve), one for intermediate when clusters inside speckles merge (first super-cluster regime), and another one for large indicating that speckles merge together and with isolated clusters (second super-cluster regime). The SC35 connectivity is further confirmed by running SuperStructure on ROIs masking the speckles, as we observed a clear single exponential decay (dashed line). These regimes are further confirmed by directly looking at the arrangement of identified clusters for certain values of (see Fig. 3, A [inset] and B).
From the SuperStructure curves, we first obtained the density of intra-cluster emitters by fitting the intra-cluster regime with the Poisson function (Eq. 1). Interestingly, both SAF-A and SC35 form clusters with similar densities, while hnRNP-C clusters are less dense (see Fig. 3, D and E). Then, in order to have a quantitative description of the clusters/speckles connectivities, we fitted the curves in the exponential regimes (Eq. 4) to extract the decay length . However, a direct comparison is possible only by normalizing decay lengths by the cluster/speckle density (see Materials and methods for details and Fig. S3, A and B). Fig. 3 F highlights that while hnRNP-C has a short normalized decay length due to the highly connected clusters, SAF-A displays a weaker decay (larger due to sparser connections. Finally, SC35 displays a first (intra-speckle) very connected regime, even more than that of hnRNPs (smaller This is followed by a second (inter-speckle) regime that shows a cluster connectivity weaker than that of hnRNPs.
In summary, our analysis revealed that while different nuclear proteins may have similar cluster sizes or densities of emitters within clusters (e.g., SAF-A and SC35), they have distinct super-cluster arrangements and connectivities. For instance, we find that the super-structures inside nuclear speckles are more connected than those formed by hnRNPs and also very dense (see Fig. 3, E and F; and Table S1). We stress that these features, which we further verified not emerging from technical artifacts (see Fig. S3 C), cannot be quantified using standard clustering algorithms or pair-correlation functions. Additionally, the analysis in Fig. 3, E and F shows that our method is sensitive enough to distinguish connectivity features of two closely related wild-type hnRNPs in cell-based experiments.
The results presented in Fig. 3 give us confidence not only that SuperStructure can be applied to a variety of nuclear wild-type or mutated proteins in different cells, cell stages, and conditions, but also that it has the capability to extract unique features that may yield new mechanistic insights into the functioning of such proteins. For instance, the analysis of SC35 reveals that speckles are themselves made of clusters that are as heavily interconnected as the clusters formed by hnRNP proteins. Given the fact that all these proteins interact with RNA, our findings suggest that RNA binding may facilitate the formation of connections between clusters of proteins; in turn, this also points to a suspected structural role of noncoding RNAs in structuring the organization of the nuclear interior (Hall and Lawrence, 2016). Studying the effect of RNA depletion on the super-cluster connectivity is therefore a natural next step to perform in the future.
In general, while certain mutations or conditions may not alter the size of protein cluster itself, they may affect the connectivity between clusters. In these cases, the analysis provided by SuperStructure would be invaluable and indeed essential to reveal the underlying mechanisms that guide the formation of such protein assemblies.
Ceramide clusters at the plasma membrane are not connected
To test our algorithm on a different class of molecules, we applied SuperStructure on published dSTORM datasets (Burgert et al., 2017) taken on ceramides-membrane lipids involved in cellular trafficking (Fig. 4 A). The authors (Burgert et al., 2017) found that bacillus cereus sphingomyelinase (bSMase) treatment increases the size of ceramides clusters and the overall localization density. By applying SuperStructure analysis (Fig. 4 B), we confirmed these results and further detected that the difference in localization density persists inside clusters (see Fig. 4, C and D; and Fig. S4, C and D). Furthermore, we detected the absence of connectivity between clusters, as the large regime is well captured by a Poisson function (Eq. 3) and not by an exponential (see Fig. 4, B and E). In other words, clusters of ceramides behave as unconnected, uniformly and randomly distributed emitters. The possibility of local connectivities at intermediate has also been ruled out, as no merging of clusters was directly observed (see Fig. S4, A and B). The crossing of the curves at is a consequence of the difference in overall localization density (which in turn causes a horizontal shift between the curves; see Fig. 4, B [inset] and C), rather than a difference in local connectivities. The notable absence of connections between clusters of ceramides further supports that the ones detected in hnRNP-U/C and SC35 are significant.
Limitations and potential interpretation pitfalls
While we have provided evidence that SuperStructure can detect connected clusters and distinguish them from noise (at low density) or unconnected but dense clusters, in this section, we discuss potential pitfalls and interpretation issues.
First, as mentioned earlier, datasets should always be segmented in order to identify the main ROI. Spurious localizations outside the ROI (e.g., outside of the nucleus, if we are interested in nuclear proteins) may affect the curves generated by SuperStructure and render their interpretation difficult. An analogous issue may arise if the localizations are embedded within heterogeneous structures, as in the case of SC35 proteins that form structures strongly connected within nuclear speckles and weakly connected outside speckles (see Fig. 3). Due to this mixed behavior over similar length-scales, it is recommended to restrict the analysis to regions that display similar structural phenotypes. Even better, and to be preferred when possible, is to label the region or structure of interest with orthogonal markers.
The key difference between connected and unconnected (albeit possibly more clustered) structures is the functional form of the SuperStructure curves. However, in some cases, Poisson curves may be difficult to distinguish from exponentials (especially over short intervals). In this case, the best way to identify connected clusters (and distinguish them from noisier or more clustered subregions) is to restrict the analysis over smaller ROIs to avoid potential contaminations and to perform goodness-of-fit tests on the curves. Additionally, in these complex cases we also suggest performing an independent cluster analysis over different length-scales and directly inspecting the results.
As with all computational algorithms, the danger of incorrect interpretation can be addressed with quality control. In the case of SuperStructure, this means directly monitoring the formation of connected clusters/structures while increasing Nonetheless, thanks to its parameter-free execution, SuperStructure may currently offer one of the safest ways to analyze SMLM data.
Discussion
In this work, we have introduced a novel algorithm that extends the traditional idea of cluster analysis of SMLM data and that can quantify both the connections between clusters and the density of emitters within clusters. SuperStructure introduces for the first time the concept of connectivity between clusters, which is different from a random distribution of points at high density. In this concept, connection points are preferentially found in between clusters and this feature manifests itself in SuperStructure curves behaving as single exponentials rather than Poissonian. Because SuperStructure is parameter-free, it does not require any prior knowledge of the sample and it thus takes a crucial step toward a more standardized, portable, and democratic quantification of complex patterns and super-structures in SMLM data.
Here, we have tested the capabilities of SuperStructure first on simulated datasets, where we observed that it could capture not only the degree of connectivity between clusters but also the strength of the connections, and then on biological dSTORM data from nuclear proteins and membrane lipids. SuperStructure allowed us to discover that the speckles formed by the splicing factor SC35 are made of connected clusters. Further, that the density of emitters in those clusters is high and the connectivity between clusters even higher than that of hnRNP proteins. We argue that this may reflect the RNA-binding feature that characterizes both hnRNPs and SC35 and that may be driving the formation of interconnected nuclear super-structures. We highlight that this discovery could not be made simply by looking at clustering with traditional algorithms, as both proteins display clusters of similar size at small/intermediate
We further stress that SuperStructure is perfectly suited to compare different datasets without a priori assumptions (albeit, as discussed before, segmentation to ROIs is recommended for strongly heterogeneous structures). The datasets of nuclear proteins we chose to analyze are an example of this. SAF-A, hnRNP-C, and SC-35 are three nuclear proteins involved in the metabolism of RNA at different stages, and they display three different connectivity phenotypes, which point to three different nuclear functions. In particular, SAF-A, which also plays a major role in maintaining the chromatin active loci in a decompacted state, is detected as a fully connected mesh. This finding is in agreement with a previous study that hypothesized the formation of a dynamic and RNA-interacting nuclear mesh made by SAF-A (Nozawa et al., 2017). We thus argue that SuperStructure is a useful tool for studying the structural and functional properties of this nuclear mesh. For instance, we expect that in absence of RNA, the SAF-A mesh would be disrupted and its connectivity strongly weakened (not necessarily affecting the protein clusters, which may be formed via an RNA-independent mechanism, such as phase separation by weak unspecific interactions of SAF-A’s intrinsically disordered domain). In turn, the application of SuperStructure would in this case be indispensable for understanding the link between the spatial arrangement, mechanics, and function of this nuclear protein. A similar example is given by the V(D)J locus, whereby interacting segments appear to be trapped by a protein or chromatin network whose (super-)structure is still poorly understood (Khanna et al., 2019). We argue that SuperStructure can shed light also on this problem.
In addition to all this, super-resolved chromatin tracing (Boettiger et al., 2016; Bintu et al., 2018) and super-resolved imaging of the accessible genome (Xie et al., 2020) generate complex datasets that will benefit from “beyond-traditional-clustering” algorithms. Connections between nanodomains and chromatin paths do not resemble the structure of isolated clusters but rather that of a mesh of clusters, which would be perfectly suited for quantification via the SuperStructure algorithm.
The use of SuperStructure is not limited to biological applications, and we propose it can be used as a standardized and parameter-free tool for assessing imaging technical aspects (van de Linde and Sauer, 2014; Hennig et al., 2015). One of the main issues in SMLM data, especially in dSTORM, is the evaluation of fluorophore blinking quality, as it strongly affects the localization accuracy in the analysis process. For example, an elevated blinking frequency would result in a high emitter density (per frame) and therefore in a high localization inaccuracy due to overlapping emissions. A similar detrimental effect could also be due to a poor blinking signal (few emitted photons per blinking event). As a consequence, lower localization precision of emitters may create pseudo-clusters, as well as pseudo-connections. We envisage that SuperStructure would be well suited to evaluate the blinking quality of fluorophores, for instance by measuring the emerging pseudo-connectivity in a controlled setup, such as fluorophores attached to a grid.
As discussed above, SuperStructure has been developed with the aim of going beyond “simple clustering” and in particular to measure connectivity between clusters. However, our method might be used in combination with other pairwise distance and clustering methods. For instance, one can compute Ripley’s (pairwise distance) functions to preliminarily detect if localizations are uniform or clustered and, in case, what is the average cluster radius. Yet, Ripley’s functions cannot identify single clusters or more complex structures. Thus, one could use SuperStructure to determine whether the system under investigation displays connected or isolated clusters. At the same time, by computing SuperStructure curves, one can have a firm ground to decide the value of that can be used as input in DBSCAN for cluster analysis. This second approach can be used, for example, to measure the size or shape of local super-structures. Indeed, one can fix at the value that identifies super-structures, perform a cluster analysis, and calculate the gyration tensor of the identified clusters.
We tested the segmentation capabilities of the latter approach by estimating the radius and circularity of SC35 speckles; we observed that it yields similar results as the well-known SR-Tesseler software (Levet et al., 2015; see Fig. S5). Although SuperStructure lacks a graphical user interface, it has several advantages. First, it can be run on any operating system and can be easily automatized to run on a large number of cells. Second, since it is based on DBSCAN, the algorithm scales as in its simplest implementation (where is the number of values used in the analysis and is the total number of localizations). The calculations on different are independent, so SuperStructure scales extremely well with the number of central processing units available. For instance, the analysis of values and localizations can be done on a six-core machine in ∼19 min. Third, since our algorithm is aimed at extracting beyond-simple-clustering information, it is flexible and intended to be used in combination with other pair-correlation or segmentation methods that are extensively employed for single-clustering analysis.
We conclude by highlighting that SuperStructure provides an unbiased and parameter-free estimation of (1) the density of localizations within single clusters and (2) the formation of super-structures made of connected clusters. Here, we tested SuperStructure both in simulated and cell-based SMLM datasets. Importantly, we revealed previously undocumented system-spanning structures made of connected clusters of nuclear proteins that we argue may have a functional role in shaping genome organization. The use of SuperStructure on cells under different conditions or with protein mutations is thus an exciting direction to uncover the biological significance of these newly discovered nuclear structures.
Materials and methods
SuperStructure algorithm
SuperStructure is an algorithm that detects and quantifies super-structures formed by interconnected clusters on SMLM datasets. Additionally, it can also evaluate the density of emitters inside clusters.
SuperStructure is mainly based on DBSCAN, a density-based algorithm to detect clusters of points in arbitrary dimensional space. The key concept underlying DBSCAN scheme is that it groups together points at high density, while it marks as outliers points in low-density regions. After defining a neighborhood size a point can be part of a cluster if the number of points within a circular region of size centered in exceeds some threshold (or is within the region of another point satisfying this condition).
The concept of clusters is subject to the choice of and and therefore to some sort of likeness or proximity. Furthermore, the change in number of clusters detected by DBSCAN when varying contains some information of the underlying distribution of points that has been overlooked.
SuperStructure progressively runs DBSCAN to detect the number of clusters within a broad range of the neighborhood parameter while is kept fixed. The resulting curves, and in particular the change due to a small change in neighborhood parameter contain fundamental information about the formation and organization of super-structures and connected clusters.
As we aim for a parameter-free algorithm, without losing generality, we fix which means no minimum number of other emitters necessary in the neighborhood to define a localization as part of a cluster. For any point is found to be a cluster by itself. Then, points merge upon increasing resulting in Additionally, the larger the more identified clusters are coalescing together for a certain
We need to stress that by choosing connections will also be considered as points to be merged. However, it is important that we identify connection points as having a lower local density than the groups of points that are bridged by them (clusters). In this way, they will merge in this second regime to form super-structures. The limiting case in which the local density of connection points is the same as the one in the clusters at the two ends of the connections is indistinguishable from the case of one elongated cluster. A special case is that in which both clusters and connections have the same density of points but the connections are slightly detached from the clusters, thus forming three independent clusters at intermediate which may then merge (we assume this to be a rare event). The above reasoning can be extended to multiply connected clusters via the analysis of pairwise connections.
At larger we could have additional super-cluster regimes if the system is heterogeneous. Most common cases showing two (or more) super-cluster regimes are the following: (1) inhomogeneous system displaying different connectivities at different length-scales, (2) connected clusters embedded in a noisy environment (in this case we observe an exponential followed by a Poissonian decay), and (3) unconnected clusters within a random noise and/or unconnected clusters at different densities (in this case, we observe two or more Poissonian decays).
SuperStructure pipeline
To apply SuperStructure, we adopt the following steps.
(1) Generation of SuperStructure curves.We run SuperStructure on a SMLM dataset by first masking our data in the ROI, such as the nucleus for nuclear proteins as mentioned in the section below. Then, we choose a range to analyze. For example, in SMLM datasets of nuclear proteins a typical choice is with One should notice that lower may be necessary for fitting the intra-cluster regime. SuperStructure curves are generated by progressively running DBSCAN clustering algorithm on the SMLM dataset in the chosen range (and The DBSCAN software we use is from https://github.com/gyaikhom/dbscan, and the progressive run is performed with bash scripts available in the repository. SuperStructure output curves are saved in a three-column file where is the number of detected clusters for the corresponding and the number of total localizations. Additionally, the classification of localizations in clusters is saved on a separate file for every
(2) Evaluation of SuperStructure regimes. As a second step, we evaluate regimes by plotting and investigating SuperStructure curves (we adopt a log scale in the y axis). This step includes a preliminary check for the number of regimes and their decay behavior (exponential versus Poissonian). In the case we observe a single Poissonian behavior, we can state that the dataset does not show any, or very limited, connectivity, and therefore, we are in presence of homogeneous isolated clusters (and eventually noise). Limited connectivity needs to be checked with a cluster analysis and direct dataset observation in case noise has obscured an exponential decay. On the other hand, if we observe a single exponential regime (a straight line in a log-linear plot), we conclude that the system is made of fully connected clusters. If SuperStructure curves show multiple super-cluster regimes, it is likely that the system is heterogeneous. Indeed, multiple exponential regimes may reflect heterogeneous/multiscale connectivities combined with heterogeneous distributions of clusters. Alternatively, we may find also a combination of exponential and Poissonian regimes, and in this case, the system may be made of connected clusters embedded in a noisy region. Other more complex combinations may be possible; however, one should notice that in heterogeneous systems, it might be difficult to recognize and fit super-cluster regimes. To clarify these contributions, it is useful to combine the analysis of SuperStructure curves with a direct observation of the dataset and identified structures and to run SuperStructure on smaller ROIs to analyze different regions of the sample with similar structural phenotypes. Nonetheless, SuperStructure will be able to unambiguously detect differences in connectivity and behaviors in, for example, samples that have been subjected to different conditions or expressing mutated proteins.
(3) Fit of SuperStructure regimes. Once regimes have been identified, one needs to define the boundaries where regimes crossover from one to another. This can be either done manually or by using a preemptive goodness-of-fit test (this procedure would also define fitting ranges). The intra-cluster regime is typically fitted with a Poisson equation (Eq. 1) to evaluate the density of emitters inside clusters as well as obtain an estimation of the upper limit of the intra-cluster regime (using Eq. 2). For super-cluster regimes, we use Eq. 3 if they show a Poissonian decay (curved on a log-linear plot) or Eq. 4 if they otherwise appear straight on a log-linear plot; from the latter, we quantify the connectivity parameter We can then additionally calculate the cluster density to extract the pure connectivity part The cluster density can be computed by performing a cluster analysis with DBSCAN on local circular regions representative of that decay regime and by fixing at the start of that regime (e.g., by counting the number of clusters one obtains by fixing at the beginning of the yellow area in Fig. 3). In the section below and in Fig. S3, we describe in detail the procedure for normalization for the nuclear protein datasets. Finally, and optionally, it is also possible to define a single function fitting the entire curve by either (1) defining a piecewise function where every piece is the fit of the corresponding regime or (2) adding together the contribution of the different regimes (appropriately weighted). We performed fits with a combination of bash and gnuplot scripts available in the repository.
Simulated dataset generation and SuperStructure analysis
The simulated dataset consists of spatially homogeneous and interconnected clusters randomly distributed on a plane. We set to work with clusters made by taking random clusters centers on the plane and by sampling emitters within a Gaussian of standard deviation thereby setting the cluster radius to with a 95% confidence and the intra-cluster emitters density at The clusters are positioned in a large area, and their number is varied in order to consider different clusters densities. In the example shown in the main text, we fixed thus fixing a cluster density to approximately roughly similar to the values found in experiments for some nuclear proteins. Pairs of clusters are connected with probability if they are positioned closer than a distance The value of is calculated as the ratio between the actual drawn connections and which is the maximum possible connections (i.e., when every cluster is connected with every other cluster). To generate a single connection, we considered the vector joining the centers of two clusters and sampled one emitter with probability every 10 nm. Emitters are sampled from a 2D Gaussian centered on the vector connecting the two clusters centers and with a width In the main text, we fixed Note that controls the number of connections, while controls their density, We generated at least 20 independent replicas for each simulated dataset using a combination of bash and python scripts, and then we ran SuperStructure analysis in the range with a change If not differently specified, the first super-cluster regime was fitted with Eq. 4 for while the second super-cluster regime was fitted with either Eq. 3 (unconnected systems) or Eq. 4 (connected systems) for
Cell preparation for dSTORM imaging of nuclear proteins
hTERT-RPE1 cells (catalog no. ATCC-CRL-4000; American Type Culture Collection) were grown overnight in an eight-well Lab-Tek II Chambered Coverglass–1.5 borosilicate glass (Thermo Fisher Scientific) at 37°C at initial concentration of in (∼40% confluency). We fixed the cells with 4% PFA (Sigma-Aldrich) for 10 min, washed three times in PBS, permeabilized with 0.2% Triton X-100 (Sigma-Aldrich) for 10 min, washed three times in PBS, and blocked with 1% BSA (Sigma-Aldrich) for 10 min.
Immunofluorescence labeling was done by exposing the cells for 2 h to (1) hnRNP-U polyclonal rabbit antibody (A300-690A; Bethyl Laboratories) at (2) hnRNP-C1/C2 (4F4) mouse monoclonal antibody (sc-32308; Santa Cruz Biotechnology) at or (3) SC-35 mouse monoclonal antibody (ab11826; Abcam) at and then three washes. Cells were then exposed for 1 h to secondary antibody. The secondary antibody was made by AffiniPure fragment donkey anti-rabbit or donkey anti-mouse IgG (H+L; 711–006-152 and 715–007-003, Jackson ImmunoResearch Europe Ltd.) conjugated to the organic fluorophore CF647 (92238A-IVL; Sigma-Aldrich) at a stochiometric ratio of ∼1. After that, cells were washed three times in PBS.
Oxygen scavenger imaging buffer based on the glucose oxidase enzymatic system (GLOX) for dSTORM was prepared fresh. The recipe employed was similar to that used previously (McSwiggen et al., 2019). We mixed (1) 5.3 ml of 200 mM Tris and 50 mM NaCl solution with (2) 2 ml of 40% glucose solution, (3) 200 μl GLOX, (4) 1.32 ml of 1 M 2-mercaptoethanol (Sigma-Aldrich), and (5) 100 μl of 50 μg/ml DAPI solution (Sigma-Aldrich). The GLOX solution was made by mixing 160 μl of 200 mM Tris and 50 mM NaCl with 40 μl catalase from bovine liver (Sigma-Aldrich) and 18 mg glucose oxidase (Sigma-Aldrich).
The 8.9-ml final solution was enough to fill the chambers of the eight-well dish; a coverglass was sealed at the top of the dish to prevent inflow of oxygen.
dSTORM acquisition of nuclear proteins
We performed 3D-STORM acquisitions using a Nikon N-STORM total internal reflection fluorescence system (TIRF) with Eclipse Ti-E inverted microscope and laser TIRFilluminator (Nikon). We equipped the microscope with a CFI SR HP Apo TIRF 100× objective lens (N.A. 1.49) and applied a 1.5× additional optical zoom. We also used a cylindrical astigmatic lens to obtain elliptical shapes for emitters that reflect their z-position (Huang et al., 2008). Laser light was provided via a Nikon LU-NV laser bed with 405-, 488-, 561-, and 640-nm laser lines. In particular, CF647 fluorophores were stochastically excited using the 640-nm laser beam with an additional 405-nm weak pulse. Images were acquired with an Andor iXon 897 EMCCD camera (Andor Technologies). The z-position was stabilized during the entire acquisition by the integrated perfect focus system. Acquisition were performed at room temperature.
For every nucleus, we acquired a stack of 20,000 frames at 19-ms exposure time by using the Nikon NIS-Element software. Acquired images have a 256 × 256 pixel resolution with pixel size equal to 106 nm. For every condition (SAF-A, hnRNP-C, and SC35), we acquired six nuclei (i.e., six independent datasets).
Raw images and post-processing analysis for nuclear protein data
The raw stack of frames was initially segmented based on a DAPI marker to carefully mask out the extranuclear signal. Then, frames were analyzed using FIJI (Schindelin et al., 2012) and in particular the Thunderstorm plugin (Ovesný et al., 2014). First, we filtered them by using Wavelet functions to separate signal from noise. The B-Spline order was set to 3 and the B-Spline scale to 2.0 as suggested previously (Ovesný et al., 2014) for localizations of ∼5 pixels. To localize the emitters centroids, we thresholded filtered images (threshold value was set 1.2 times the standard deviation of the first Wavelet function) and calculated the local maximum relative to the eight nearest neighbors. Finally, we fitted the emitters signal distribution with elliptical gaussians (ellipses are necessary for z-position reconstruction) using the weighted least-square method and by setting 3 pixels as fitting radius and 1.6 pixels as initial sigma.
Localized data were then postprocessed using the same plugin. We corrected the xy drift using a pair-correlation analysis, filtered data with a position uncertainty restricted the z-position to the interval , and projected the data in a 2D plane, as the z-axis precision is ∼100 nm.
Reconstructed images shown in the main text were created by using the average shifted histograms method of the same plugin with 10× magnification (final resolution set to 10.6 nm/pixel).
SuperStructure analysis for nuclear protein data
SuperStructure analysis was run on the entire nuclear region by setting and by increasing in the range and “all-nucleus” curves were generated for six independent nuclei. We set the change rate for and for This choice was due to the higher resolution necessary to extract intra-cluster information at small As shown in Fig. 3, SuperStructure all-nucleus curves show that SAF-A has a single exponential super-cluster regime, while hnRNP-C and SC35 have two regimes. In the case of hnRNP-C, the second regime is due to weakly connected and sparse clusters in nucleoli, while in SC35 it is due to the cluster/connectivity heterogeneity in the system (i.e., speckles). Therefore, we additionally run SuperStructure analysis on local ROIs for hnRNP-C and SC35 to obtain the isolated contribution for the first super-cluster regime. In particular, for hnRNP-C, we considered five independent circular ROIs per nucleus with radius within the nuclear mesh; for SC35, we considered five independent circular ROIs per nucleus with radius within speckles. We ran the analysis on these ROIs and generated SuperStructure “local” curves (five for each nucleus).
The values of the intra-cluster density were extracted by fitting with Eq. 1 the intra-cluster regime in the all-nucleus curves in the range Resulting average values are and
Then, we identified the super-cluster regimes of interest: the first super-cluster regimes of SAF-A and hnRNP-C and both super-cluster regimes of SC35 (SC35-1 and SC35-2). For SAF-A and SC35-2, the decay length was obtained by fitting all-nucleus curves with Eq. 4. For hnRNP-C and SC35-1, we fitted the local curves (five curves per nucleus) and then averaged values obtained from different local curves in the same nucleus. Fit ranges are for SAF-A, for hnRNP-C, for SC35-1, and for SC35-2.
Finally, the values of for SAF-A, hnRNP-C, SC35-1, and SC35-2 were normalized by the cluster density: In the case of SAF-A and SC35-2, the normalization was performed for for every nucleus by using the average cluster density of that nucleus. In particular, was calculated as the average of the cluster density in five independent circular regions of radius in the same nucleus as shown in the example of Fig. S3 A. In the case of hnRNP-C and SC35-1, where values were obtained from local curves, the normalization of was performed using the cluster density of the same local region; then, values obtained from different regions in the same nucleus were averaged (see Table S1). The number of clusters estimation (to calculate the cluster density) was made with DBSCAN by setting and close to the beginning of the exponential regime of interest, as shown in Fig. S3 B, and by keeping only clusters with at least 30 particles. To compute the cluster density, for SAF-A and hnRNP-C, we set local circular regions of radius and fixed for cluster analysis (for hnRNP-C, we used the same local regions as defined above). For SC35, we considered two sets of local regions: (1) inside speckles to normalize the shorter decay length, where we used ROIs with radius and fixed for cluster analysis (same regions as above); and (2) outside speckles to normalize the longer decay length, where we used ROIs with radius and for cluster analysis. Average nuclear values of and are shown in Table S1.
SuperStructure analysis of ceramide data
SuperStructure analysis was run on the two ceramide datasets provided by the authors from Burgert et al. (2017), namely +bSMase and −bSMase, by setting and We set for and for This choice was due to the higher resolution necessary to extract intra-cluster information at small From the curves in Fig. 4 B, it is clear that there is no strong connectivity (we observe a Poissonian decay). Therefore, we identified free unclustered emitters as noise. We additionally ran SuperStructure in 16 independent local circular regions of radius to extract the quantities of interest. In particular, we measured the average densities of total localizations, and respectively, for +bSMase and −bSMase treatment. This is in accordance with results in the original paper. Then, we fitted local SuperStructure curves in the intra-cluster regime with Eq. 1 for and respectively, for +bSMase and −bSMase treatments. Finally, we fitted local SuperStructure curves in the super-cluster regime with Eq. 3 in the range for +bSMase and for −bSMase (the difference in fit starting value is explained by a horizontal shift between the two curves): and These two values are in accordance with the sum of cluster density and noise at the value were the fit starts. We additionally performed a cluster analysis with DBSCAN, and results are in agreement with the original paper (see Fig. S4 for details). To verify that there is no limited connectivity hidden by noise, we performed a cluster analysis at two different values of and monitored the change in density of clusters and density of free emitters (see Fig. S4 for details).
Data availability
The simulated and experimental datasets that support the findings of this study are available from the corresponding authors upon request.
Code availability
The code for the generation of SuperStructure curves is available at https://git.ecdf.ed.ac.uk/dmichiel/superstructure.
Online supplemental material
Fig. S1 shows a simulated distribution of points inside a single cluster and how it is well represented by Eq. 1 in Materials and methods. Fig. S2 shows SuperStructure curves for simulated datasets of connected clusters in different conditions, including systems with different cluster densities, systems embedded in a noisy environment, and fully connected meshes. Fig. S3,shows how the normalization of was performed in nuclear protein data (exhaustively explained in Materials and methods) and that nuclear proteins connectivity is not a technical artifact. Fig. S4 shows that there is no local connectivity in ceramide data and confirms the original paper’s results on ceramide cluster size. Fig. S5 shows SuperStructure + DBSCAN segmentation capabilities by estimating the radius and circularity of SC35 speckles alongside SR-Tesseler software.Table S1 recapitulates values for and in nuclear protein data.
Acknowledgments
The authors thank the Edinburgh Super-Resolution Imaging Consortium (Institute of Genetics and Molecular Medicine section), in particular Matthew Pearson and Ann Wheeler, for help and support. The authors are grateful to Markus Sauer for providing the ceramides data. M. Marenda and D. Michieletto also thank Ibrahim Cissè for an igniting discussion and Davide Marenduzzo's group for discussions.
M. Marenda is a cross-disciplinary postdoctoral fellow supported by funding from the University of Edinburgh and the Medical Research Council (core grant MC_UU_00009/2 to the Medical Research Council Institute of Genetics and Molecular Medicine). S. van de Linde is supported by the Academy of Medical Sciences, the British Heart Foundation, the Government Department of Business, Energy and Industrial Strategy, and the Wellcome Trust Springboard Award (SBF003\1163). N. Gilbert is funded by the UK Medical Research Council (grant MC_UU_00007/13). D. Michieletto is a Royal Society University Research Fellow and was supported by the Leverhulme Trust (grant ECF-2019-088) and European Research Council Starting Grant (Topologically Active Polymers [TAP] grant 947918). The authors thank the Scottish University Life Science Alliance for support through a technology seed grant (Worktribe Project ID 8824507).
The authors declare no competing financial interests.
Author contributions: M. Marenda, D. Michieletto, and N. Gilbert conceived the project. M. Marenda and D. Michieletto analyzed both simulated and experimental datasets. M. Marenda, S. van de Linde, and D. Michieletto generated the simulated dataset. M. Marenda, E. Lazarova, and D. Michieletto performed super-resolution experiments and localization analysis. M. Marenda, D. Michieletto, S. van de Linde, and N. Gilbert wrote the manuscript, with input from all authors.