Confocal micrographs of EGFP fusion proteins localized at key cell organelles in murine and human cells were acquired for use as subcellular localization landmarks. For each of the respective 789,011 and 523,319 optically validated cell images, morphology and statistical features were measured. Machine learning algorithms using these features permit automated assignment of the localization of other proteins and dyes in both cell types with very high accuracy. Automated assignment of subcellular localizations for model tail-anchored (TA) proteins with randomly mutated C-terminal targeting sequences allowed the discovery of motifs responsible for targeting to mitochondria, endoplasmic reticulum, and the late secretory pathway. Analysis of directed mutants enabled refinement of these motifs and characterization of protein distributions in within cellular subcompartments.
Introduction
Subcellular localization of proteins is a key feature of eukaryotes. Understanding subcellular localization has long been a goal for cell biologists interested in basic mechanisms of protein sorting and for understanding the generation of organelles with distinct compositions and morphologies. Moreover, there is much to be learned about disease processes, mechanisms of signal transduction, and cellular metabolism that is directly linked to subcellular localization. Traditionally, subcellular localization of a protein of interest has been assigned by visual comparison with one or more proteins of known localization (antibody based or fluorescence protein tagged) or with organelle-specific dyes (e.g., Mitotracker) in fluorescence microscope images by an experimentalist. However, human visual inspection is prone to both drift and bias. Therefore, machine learning tools have been developed to automate the analysis of subcellular localization.
Early classifiers built to distinguish subcellular structures in fluorescence micrographs in HeLa cells based on features tailored specifically for subcellular location studies functioned well with small datasets (Boland and Murphy, 2001). In addition to the newly designed region of interest (ROI) features, the well-known textural features by Haralick et al. (1973) and Zernike moments features (Zernike, 1934) were used. A new set of statistical features called threshold adjacency statistics (TAS) is faster to calculate than other commonly used statistical features (Hamilton et al., 2007), and also shows good performance (Nanni and Lumini, 2008). As a result, several supervised classification strategies to distinguish subcellular structures of the main subcellular locations (e.g., cytoplasm, nucleus, Golgi apparatus, mitochondria, and ER) have been published (Hamilton et al., 2007; Conrad et al., 2004; Li et al., 2012).
A major limitation to using automated methods of image analysis for determining subcellular localization is the paucity of high-quality images with explicit annotations. This is particularly problematic for proteins that transit between organelles or those at steady-state that are located at more than one subcellular location. For such proteins, accurate annotation based on biological experiments or images of subcellular distributions can be very difficult. One approach to dealing with this problem has been to use semi-supervised methods to assign subcellular localization from lower-quality data together with multi-label classification (Xu et al., 2016). A similar approach was used to detect mis-localization of proteins in cancer cells using images from the human protein atlas (Xu et al., 2015, 2019). However, in these cases, automated analysis was limited to detection of relatively coarse localization changes such as between the cytoplasm and nucleus or mitochondria.
More recently, deep learning approaches alone and together with crowd sourcing have been used to tackle the problem of classifying subcellular localization of proteins in yeast (Chong et al., 2015; Pärnamaa and Parts, 2017) and in human cell lines (Sullivan et al., 2018). A major advance in these studies was the use of hundreds of thousands of images to overcome differences inherent in the data deposited in repositories such as the human protein atlas as well as the cell to cell variations inherent in normal biology. Using multiple markers in the same cell, it was possible to automatically classify a number of subcellular structures, particularly subnuclear spot types (Sullivan et al., 2018); however, automated identification of the compartments within the secretory pathway has not been achieved.
Improving automated assignment of localization requires a large dataset of high-quality images and an alternative approach to the problem of proteins having multiple subcellular localizations. Our approach was to generate a reference library of 789,011 and 523,319 optically validated landmark-based localization images in murine and human epithelia, respectively. Numerical analyses identified 160 features most useful for assignment of localization while minimizing the effects of expression levels. Rather than forcing assignment to predefined organelles, we classify images by similarity to landmarks that may themselves localize to multiple destinations. We then used this library of images to examine localization of model tail-anchor proteins. We show that automated analysis outperformed even highly trained human observers and enabled the identification multiple morphologically distinct distributions of TA proteins. Our results demonstrate the utility of the reference library of images, derived features, and machine learning to provide unbiased assignment of subcellular localizations with high accuracy in individual living cells.
Results
Distinct subcellular phenotypes identified from multidimensional descriptors of cell images of landmark proteins
Landmark proteins were generated by fusing EGFP to proteins selected from previous reports establishing localization at specific subcellular organelles (Table 1). A reference library used for automated assignment of subcellular localization in individual cells was generated by calculating multidimensional descriptors of micrographs of cells expressing these proteins. The reference library contains 789,011 and 523,319 quality-validated cell images of the EGFP-landmark fusion proteins expressed in normal murine mammary gland (NMuMG) cells and in MCF10A cells, respectively. For organelles with recognized subdomains (mitochondria, ER, and Golgi apparatus) multiple landmark proteins were used (Fig. 1 A and Table 1).
In addition to forward transit, proteins recycle between different organelles or within compartments of the same organelle (e.g., Golgi), which increases the heterogeneity of distribution. To capture the varying phenotypes caused by mobility of TA proteins, thousands of images were acquired for each landmark (Table 1). Prior to feature extraction, all images undergo automated quality control to remove out-of-focus cells and segmentation artifacts (Fig. S1 and Materials and methods).
To visualize in two dimensions the separation of these landmark proteins in 160-dimensional feature space, we used t-distributed stochastic neighbor embedding (t-SNE) projection (Van der Maaten and Hinton, 2008) to compress the data. As displaying over 500,000 multidimensional points in two dimensions is not meaningful, a subset comprising the 200 nearest-neighbor data points to the centroid of each landmark was plotted. Landmarks in the dataset are clearly separated from each other, with only a few points distributed to other areas within the t-SNE landscape (Fig. 2 A). As another way to represent the distribution of the data and demonstrate the separation of landmarks, data points for individual landmarks were over-clustered using Phenograph (Levine et al., 2015; Fig. 2 B). As a third measure of the relationships between the multidimensional classes, the Euclidean distances between landmark centroids were tabulated and plotted as a heatmap (Fig. S2 ).
An ensemble of Random Forest (RF) classifiers generated from a subset (less than half) of the landmark images was used to assign localization of images not used for training to generate confusion matrices (Fig. 1 B). The high accuracy of the resulting classification of individual cells (78%) demonstrated that even though many of the images for different landmarks appear visually similar, a classifier based on the image features table accurately separates and identifies fluorescent proteins targeted to different subcellular organelles for individual cells without a colocalization marker (Fig. 1). The confusion matrix indicates that even for the two most similar and visually indistinguishable landmarks, monoamine oxidase A (MAO) and cytochrome c oxidase (CCO; outer and inner mitochondrial membranes, respectively), <15% of cells were misclassified to another single landmark (Fig. 1 B). Control experiments in which we deliberately impair the classifier by training on images of the highest or lowest intensity and then classify images of different intensity levels demonstrated that the expression level (intensity) of the landmarks neither contributed to nor confounded classification (Fig. 3). Indeed, the classification accuracy drops only in the extreme cases such as when a classifier trained using images of only the lowest intensity (highest shot noise) was used to classify images from the highest-intensity quartile (Fig. 3) and the errors introduced were confined to the most closely related patterns. This effect is negligible in our other experiments as all intensity levels were used in training. Furthermore, the SD in the assignments across 20 classifiers was <1% of cells and is therefore not specified in the figures (see Materials and methods). Taken together, these results suggest that an assignment of >20% of cells to a specific landmark location is a conservative definition for significance.
At steady-state, many of the landmarks have some cells reproducibly assigned to other localizations. This type of “mis-localization” to other organelles actually reflects normal organelle and protein dynamics. Thus, assignment of a protein to a particular subcellular location is never realistically 100%.
To our surprise, different landmarks ostensibly targeted to the same organelle were efficiently discriminated by the RF classifier. For example, the images of EGFP fused to the TA sequences of the ER localized TA proteins Bcl-2 interacting killer (Bik; Germain et al., 2002) and cytochrome b5 (Cytb5) are visually similar to each other and to images of EGFP fused to Calr-KDEL, another ER landmark (Fig. 1 A). However, the cell images were well-separated from each other in multidimensional space in both cell lines (Fig. 1 B and,Fig. 2 A). As a result, in NMuMG cells, only 8% of Cytb5-expressing cells were classified as Bik (Fig. 1 B); 5% of Bik-expressing cells were classified as Cytb5, while 9% were classified as the nearest Euclidean neighbor, phosphatidylserine synthase 1 (PTDSS1; mitochondria-associated ER membrane [MAM] and ER; Table 1); and 9% were classified as the ER–Golgi intermediate compartment (ERGIC) protein 53. Classification of Bik localization to these landmarks in a small fraction of cells is highly plausible as both compartments are related to the ER. Importantly, only 2% of cells expressing Cytb5 were classified as VAMP5, the nearest Euclidean neighbor. Thus, images that are not assigned to their own landmark are assigned according to features representing their biology, and not simply by centroid Euclidean distances. Controls (Fig. 3 and described below) demonstrated that intensity differences do not account for the different classifications. It is possible that classification differences for landmarks ostensibly located at the same organelle represent different steady-state distributions within the organelle, and different extents to which the landmarks localize to different organelles, as seen previously (Xu et al., 2016). As discussed below, different steady-state distributions within an organelle may also represent functional or physical subdomains. Because the different landmarks for the same organelle are well separated, we refer to the various subcellular locations by the landmark protein rather than the organelle name. In this way, the variability of localization for each landmark is preserved in the classification. When organelle names are used, it is to designate multiple landmarks in aggregate.
The localization of novel proteins and dyes can be identified across species using reference libraries
To validate our multiparametric definition of the subcellular landscape, we used a new set of marker proteins previously reported to target to specific organelles and classified them using our existing set of NMuMG landmarks (Fig. 4 A). The ER-specific dye (BODIPY-thapsigargin) was predominantly classified with the ER marker Calr-KDEL (47%), but also with ER markers Cytb5 (15%) and Bik (20%) for a total ER classification of 82%. Similarly, the cis-Golgi–localized TA protein Golgi SNAP receptor complex member 2 (Membrin) was classified as mostly Golgin84 (62%) with a component assigned to the Golgi protein GalT (31%). Moreover, the mitochondrial stain Mitotracker and the mitochondrial matrix protein ornithine transcarbamylase (OTC) were both classified with the mitochondrial landmarks MAO (55% and 27%, respectively) and CCO (34% and 62%, respectively). In addition, the classifier successfully assigned outer membrane mitochondrial localization to the outer mitochondrial membrane protein Harakiri fused to the red fluorescence protein mLumin (66% outer mitochondrial MAO and 25% inner membrane CCO). The recycling endosome marker (Rab11) was classified with Rab5 (42%).
In addition to analyzing a new set of marker proteins, we also tested the performance of the NMuMG cell classifier by using human MCF10A cells expressing five of the landmarks (Rab5, ERGIC53, LAMP-1, Bik, and Golgin84). These proteins were stably expressed in MCF10A cells and imaged, and the resulting micrographs were analyzed using the classifier derived from images of NMuMG cells expressing all the landmarks. The classifier correctly assigned the localization of the landmark proteins expressed in MCF10A cells (Fig. 4 B). However, the ER protein Bik was mostly classified as the other ER-localized TA protein, Cytb5. Moreover, only 41% of Golgin84 in MCF10A was assigned to the NMuMG Golgin84 compartment, suggesting that there is a difference in the morphology of these compartments between the two cell lines. This hypothesis was verified by visual comparison of the corresponding images in the two cell lines (Fig. 4 C). This result demonstrates that subtle changes in morphology can be detected by the classifier, suggesting utility for a wide variety of genetic and chemical perturbation studies. However, for automated classification in more divergent cell types, it will be necessary to build a new library of landmark images.
We generated a second landmark library in the human cell line MCF10A (Fig. 1) that includes additional landmarks (Table 1) and resulted in accurate assignment (77%) of subcellular localizations for images of individual cells not used in training (Fig. 1 B). Comparison of the five query proteins expressed in MCF10A revealed that as expected, the accuracy was somewhat (on average <20%) higher when assigned using the MCF10A instead of the NMuMG library. As expected, the improvement was dramatic for Golgin84 (41% to 84%; Fig. 4, B and C; and Fig. 1 B, respectively).
These results demonstrate that our analysis captures the subcellular landscape with sufficient resolution to accurately assign the localization of proteins and organelle specific dyes not used for training, regardless of whether they expressed RFP or GFP fusions or in human or murine epithelial cells, attesting to the robustness of the approach. Furthermore, the analysis is sensitive enough to separate landmarks ostensibly targeted to the same organelle (e.g., Cytb5, Bik, and Calr-KDEL) and thereby identify different protein distributions within organelles (Cytb5 and Bik) and/or discriminate resident versus recycling proteins (Cytb5 vs. Calr-KDEL).
Image-based analysis of the determinants of subcellular localization for TA proteins
As an example of one of the uses for the localization libraries, we systematically examined the targeting behavior of TA proteins. In multiple cell types, VAMP1 targets to the ER and then transits through the secretory pathway to the cell surface (Raptis et al., 2005; Chen and Scheller, 2001). When EGFP-TA was expressed in NMuMG cells, the images of 62,000 of 104,000 cells (∼60%) were classified as VAMP5 localization. Although VAMP5 has been reported to be located primarily at the plasma membrane (Hong, 2005) due to the dynamic trafficking of proteins, images of the VAMP5 landmark in our library show protein localization throughout the secretory pathway. Thus, our results indicate that EGFP-TA has a distribution compatible with the biology of the protein the TA sequence was derived from.
We used EGFP-TA to investigate the features of the TA sequence that determine the specificity of localization within the secretory pathway and to other intracellular membranes. To analyze one region of the sequence systematically, we performed random mutagenesis on the codons for the last five amino acids of the TA sequence (IYFFT), which constitute the C-terminal sequence (CTS). Sequencing revealed 995 unique sequences, each of which was individually expressed in NMuMG cells. As expected, when classified using the reference library, many of the mutants were assigned VAMP5 localization. However, a few were localized primarily at the ER (n = 8) or mitochondria (n = 13) and were therefore selected for further analysis. Our initial focus was the mutants localized at mitochondria because the requirements for TA targeting to this organelle have been analyzed extensively, yet a clearly defined consensus has not emerged.
Targeting to the mitochondria is determined by positions of amino acid properties in the CTS
Mitochondria are a well-known targeting destination for TA proteins that play a crucial role in apoptosis (Cory and Adams, 2002), protein import (Horie et al., 2002), organelle and vesicle fission and fusion (Scott and Youle, 2010), and other functions. While several approaches have been used to decipher the sequence requirements for localization, it is still unclear what comprises a TA mitochondrial targeting sequence.
12 of the 13 CTS sequences that resulted in mitochondrial localization of EGFP-TA were five amino acids long. The one sequence with a four–amino acid CTS due to random incorporation of a termination codon was assigned to the outer mitochondrial membrane (MAO), but not examined further. Of the five amino acid CTSs, eight were assigned only to mitochondria, all primarily to the outer mitochondrial membrane localization MAO landmark as expected for proteins anchored to mitochondria by a TA sequence. For three of these mutants, a smaller fraction was also assigned as the inner membrane landmark CCO. Of the four sequences not primarily assigned to MAO, none was primarily assigned to CCO. Even though 12 sequences is a small number, it was possible to visualize amino acid enrichment using a pLogo representation (O’Shea et al., 2013). The most overrepresented amino acid in sequences that localized EGFP-TA to mitochondria was arginine (R) at CTS position 2 (Fig. 5 A). Moreover, lysine (K) at CTS position 1 was also significantly overrepresented. There was a high occurrence of arginine at positions 3 and 5, but individually this did not reach statistical significance (Fig. 5 A). By carrying out an enrichment analysis for the number rather than position of positive charges, we observed that mitochondrial localization of mutants with at least three positively charged residues (K or R) was statistically significant (three positive charged residues, P = 5.6 e−5; four positive charged residues, P = 4.9 e−9, Fig. S3 A). This result is consistent with reports that positively charged residues are frequently observed in the CTS of mitochondrial TA proteins (Rapaport, 2003). However, previous publications (Marty et al., 2014; Rapaport, 2003) suggested that a net charge of the CTS region of +2 or greater is sufficient to target TA sequences to mitochondria. In contrast, for the TA protein Fis1, it has been reported that at least four positive charges are required for targeting to mitochondria (Rao et al., 2016). With our large datasets of random mutants, we can evaluate these rules systematically by examining sequences that target EGFP-TA to mitochondria and how frequently sequences that match the consensus do not target to mitochondria. When we examined the frequency of EGFP-TA mutants with five-residue CTS sequences with KR or RR as the first two amino acids, or net +2 or greater charges in the CTS targeting to mitochondria, the specificity of these simple rules was 0.86 or higher. However, the sensitivities and positive predictive values (PPV) are low (table in Fig. 5 A). Only 12 out of 133 +2 or greater charged CTS sequences (9.0%) targeted EGFP-TA to the mitochondria. Previous studies have not reported the frequency of sequences that adhere to the predicted rules but do not target as predicted, presumably due to the difficulty of manually assigning localization for large numbers of mutants, a task that is easily achievable with our automated reference library.
To further test the importance of positive charges, we generated mutants with a high proportion of positive amino acids (e.g., RRRNR, QRRNR, TRRNR, and SRRNR), all of which contain at least three positive charges and the over-represented R at position 2 (Fig. 5 B). The micrographs of EGFP-TA with CTS sequences RRRNR and QRRNR were assigned 89% and 80% mitochondrial localization, primarily to outer mitochondrial membrane localization (MAO, 65% and 62%), respectively. However, image-based classification assigned the mutants SRRNR and TRRNR to the ER landmark Calr-KDEL and Golgi, respectively, clearly indicating that three positive charges with an R at position 2 is not sufficient to target EGFP-TA to mitochondria.
To generate a rule for predicting localization of EGFP-TA proteins at mitochondria, we used the sequence data to generate a position weight matrix (PWM) based on amino acid properties. The PWM enabled calculating a likelihood value for localizing at mitochondria for all of the mutant sequences with at least two positive charges (Fig. 5 B). Visual inspection of this data suggests that a threshold of 5.68 (equal to that of the sequence RHRAN) optimally identifies CTSs that target EGFP-TA to mitochondria. With that threshold, the PWM correctly assigns the mutants RRQVH and RHRAN that are missed by the other rules. Moreover, it correctly assigns the designed mutants RRRNR and QRRNR to mitochondria and rejects the SRRNR and TRRNR sequences. The PPV, sensitivity, and specificity demonstrate that the PWM predictive rule clearly outperformed any of the other rules (Fig. 5, table). Thus, the locations of the biochemical properties of the amino acids contribute to the targeting signal within the CTS that specifies mitochondrial outer membrane localization.
To evaluate a CTS sequence predicted by the PWM to target to mitochondria in the context of a different protein, we examined the TA sequence of Bcl-2 fused to EGFP. Wild-type EGFP-TA-Bcl-2 targets to the ER, MAM, and mitochondria, as cell images were classified 45% Calr-KDEL, 20% PTDSS1, and 17% MAO (Fig. 6), consistent with previous reports demonstrating Bcl-2 localization at both ER and mitochondria (Zhu et al., 1996). Localization at multiple organelles also demonstrates that the TA sequence of Bcl-2 is permissive for insertion into multiple membranes.
After removing the SHK sequence that constitutes the CTS of Bcl-2 (Henderson et al., 2007), the protein Bcl-2-ΔSHK was localized almost exclusively to the ERGIC compartment, confirming that the CTS can determine localization for this protein. Replacing the SHK sequence with the CTS sequence KRRNR generated the protein EGFP-TA-Bcl-2-ΔSHK-KRRNR, which, as predicted from the PWM, was classified exclusively (96%) as MAO (outer mitochondrial membrane). Furthermore, when the CTS of EGFP-TA-Bcl-2 was replaced with the sequences FPCVN or WTNFK that localized EGFP-TA to the ER, the resulting proteins were primarily classified as targeting to the ER (~60% of the cell images). Similar to the EGFP-TA versions, most individual images were classified as Bik, but some were assigned to the other ER subdomains defined by Cytb5 and Calr-KDEL (Fig. 6). Taken together, these results indicate that for permissive TA sequences, targeting to the mitochondrial outer membrane can be achieved by CTS sequences defined by the PWM. The PWM-defined motif is the first demonstration that the position of amino acid properties within the CTS rather than sequence identity determines mitochondrial localization.
The ensemble of RF classifier correctly assigned cell images of several TA proteins that were not used in training to the outer mitochondrial rather than inner mitochondrial membrane. These included Harakiri, EGFP-TA-Bcl-2-ΔSHK-KRRNR, EGFP-TA-RRRNR, and EGFP-TA-QRRNR. This is remarkable considering a human observer would be unable to accurately determine outer from inner mitochondrial localization from the images (Fig. S4). Consistent with the classifier efficiently distinguishing localization at the outer and inner mitochondrial membrane, the confusion matrix (Fig. 1 B) indicates <15% overlap between targeting assignments for MAO and CCO. An alternative explanation is that an intensity difference between the inner and outer mitochondrial membrane protein landmarks (CCO and MAO) was sufficient to result in classification of queries as MAO based on intensity despite our selecting features less influenced by intensity. To test this hypothesis, individual cells were selected for training that expressed CCO or MAO with a similar defined intensity range (Fig. 7). This intensity filter was also applied to cell images of OTC and some of the EGFP-TA mutants that targeted to the mitochondria for use as a query set. The classification result (Fig. 7) of this image data resulted in all of the tail-anchored proteins correctly assigned as localized in the outer membrane, while OTC was correctly assigned localization to the inner membrane. Thus, the classifier accurately distinguished inner and outer mitochondrial membranes for six query proteins not used in training, independent of intensity (Fig. 7, heatmap).
ER localization and secretory pathway motifs can be identified from a sparse dataset
The 995 mutants constitute a very sparse dataset representing <0.03% of the ∼3.36 million possible five–amino acid CTS sequences. The more specific the requirements are for a targeting sequence for any particular subcellular location, the fewer examples there will be in a sparse dataset. As a consequence in our dataset, even a single sequence might represent a class of related sequences that localize similarly (a motif). To test this hypothesis, we examined further the eight EGFP-TA mutants that were assigned only ER localization. These sequences already suggest that there are sequence dependencies for the localizations defined by the landmarks Bik, Cytb5, and Calr-KDEL. Of the eight sequences, three (FPCVN, WTNFK, and DPTDS) localized EGFP-TA primarily (45–59%) to the localization defined by Bik, while the other five sequences (DEPGH, PEHVS, PKWVT, PSNHQ, and RVRPG) were assigned (43–55%) to the localization identified by Cytb5. Consistent with those representing distinct distributions within or subcompartments of the ER, none of these proteins were assigned significant localization to any other locations, including the other ER landmarks (Fig. S5 B).
To assess the importance of individual amino acids in determining ER localization, we chose one mutant, FPCVN, for further study; because of the 995 sequences, it is the only one ending in CVN. Images of this mutant were assigned 61% to the ER landmarks (46% Bik localization) with <20% assigned to any other location (Fig. 8 and Fig. S5 B).
To determine if the CVN sequence is part of a motif, the first two residues of the FPCVN sequence were mutated to create a new set of CTS sequences on EGFP-TA that differ by one or two amino acids. We did not design “CVN” mutants with negative charges because mutants without negatively charged amino acids are more likely to escape the ER/post-ER compartments and localize at the plasma membrane (Fig. S3 B). Previous work on membrane proteins suggests that the length of the transmembrane domain (TMD) is a key feature that determines localization to the Golgi versus plasma membrane (Sharpe et al., 2010). To test this notion further, the changes were selected to alter the hydrophobicity at positions 1 and 2, thereby potentially extending the length of the hydrophobic region of the TA sequence by up to four residues as the C and V of the CVN sequence are also hydrophobic. Consistent with our sparse data-motif hypothesis, 14 of the 22 “X-X-C-V-N” sequences resulted in mutants that were classified as localizing primarily to the ER (Cytb5, Bik, or Calr-KDEL). This is a remarkable result considering that only 8 of 995 random mutants were localized similarly. To demonstrate that assignment to ER localization was not driven by image intensity, we repeated the analysis with selected images for training and testing with similar distributions of image intensities as shown above (Fig. 7). Restricting the distribution of intensities made no significant difference to the image classification results (Fig. S5 C).
Plotting the average hydrophobicity (Kyte and Doolittle, 1982) of the two positions (Fig. 8) demonstrated that sequences with hydrophobicity <0.6 (i.e., hydrophilic) together with CVN constitute a new motif for targeting EGFP-TA to the ER. All of the mutants with this motif were assigned significant localization at the ER with variable localization also at ERGIC and endosomes. Of the seven sequences with intermediate hydrophobicity (>0.6 and <1.7), four (FPCVN, SLCVN, GLCVN, and SVCVN) were retained in the ER. The exceptions (LRCVN, LWCVN, and LGCVN) that were assigned significant VAMP5 localization demonstrate that hydrophobicity is not required to progress through the secretory pathway. Of the seven mutants with hydrophobicity (1.7–1.8), LGCVN, WICVN, and IWCVN were all >48% assigned VAMP5, while GLCVN and SVCVN were primarily assigned to the ER with only partial localization at endosomes and no significant localization with VAMP5. Furthermore, the sequences LWCVN, WICVN, IWCVN, and CICVN that exhibited the strongest preference, >50% assigned localization to the VAMP5 compartment, ranged in CTS sequence hydrophobicity from modest (1.45) to high (3.8). Finally, some mutants suggest that rather subtle sequence features are involved in retention or export from the ER. For example, RWCVN was assigned to ER, yet switching the order of the first two amino acids to generate the CTS sequence WRCVN resulted in assigned localization throughout the secretory pathway with equivalent numbers of cell images assigned to ER (29%) and endosomes (30%). Thus, while our results confirm a weak correlation between hydrophilicity and retention at the ER, it is clear that hydrophobicity is not the only driving force for sorting to or exclusion from the plasma membrane or endosomes.
To our surprise, some of the mutants with ER-assigned localizations were classified as similar to Bik, while others were assigned to Cytb5, suggesting significant differences in their distribution in the ER. Consistent with this interpretation, several other mutants were assigned primarily to one or the other of these landmarks (Fig. S5, A and C).
Discussion
Our development and characterization of reference datasets, each containing more than 500,000 optically validated individual cell images, enabled the identification of 160 features useful for assigning subcellular localization with an ensemble of RF classifier. These reference sets of optically validated images, computed features, classification results, and cell clones (available from Addgene) are tools that can be used to interrogate subcellular localization, for analysis of diversity in organelle morphologies, and as a reference standard for algorithm development. For example, our automated analysis of the images highlighted both morphological similarities (mitochondria) and differences (Golgi) between premalignant murine and human mammary epithelia cells grown in monolayers (Fig. 4, B and C). Furthermore, by assigning localizations to landmarks rather than specifying a specific organelle, our approach accounts for proteins having multiple localizations. For example, VAMP5 is generally referred to as a plasma membrane protein, but here the VAMP5 localization encompasses much of the secretory pathway because at steady-state, much of the protein is in transit. Because the cell lines and the derived clones have relatively stable genomes, they are expected to constitute a platform for image-based analyses of genetic manipulations, cell signaling events, and quantitative measurement of cellular responses to environmental and extracellular matrix changes as well as responses to drugs and other perturbations. Here we have used the images and feature sets as tools to examine the sequence requirements for motifs that regulate TA protein subcellular localization.
Motifs for localization at mitochondria
Our dataset included only 12 mutants with a five–amino acid CTS that targeted EGFP-TA to mitochondria. However, comparison with the almost 1,000 sequences that did not result in localization at mitochondria was sufficient to identify the shared characteristics of proteins assigned mitochondrial localization as a PWM for the amino acids within the CTS. Thus, in contrast to previous results, our data directly demonstrate that the presence of positively charged amino acids is not sufficient for targeting mitochondria. Indeed, 133 EGFP-TA mutants with positively charged CTS sequences (+2 or greater) did not localize to mitochondria (Fig. 5). CTS sequences that targeted to mitochondria did not include those with a longer hydrophobic core sequence, a result consistent with other studies using smaller datasets (Costello et al., 2017). However, a shorter hydrophobic core was also not sufficient to target a TA protein to mitochondria (Fig. S3 C). Indeed, even when the ≥2 positive charges and minimal hydrophobic core were combined, there were 121 proteins targeted to nonmitochondrial locations compared with 12 assigned to the mitochondria.
Automated assignment of localization will facilitate future analyses of more sequences to determine if there are multiple independent motifs for mitochondrial localization of TA proteins. Classification based on the computed feature data clearly enables automated assignment of outer mitochondrial membrane localization even though such assignment is not possible based on visual inspection of the images (Fig. 7 and Fig. S4). Our approach will also enable generation of a more robust PWM for such motifs to fully capture the rules governing localization of TA proteins at mitochondria.
ER-localized TA proteins
Our observation that different ER-localized TA proteins have distinct distributions suggests that the ER may be composed of multiple morphologically distinct subcompartments. A related explanation is that different TA proteins have different residence times in the ER or distinct regions of the ER. Of particular interest is the observation that the TA sequences from Bik- and Cytb5-localized EGFP-TA result in different distributions of the proteins (Fig. 1 B and Fig. 2 A).
Consistent with the CTS sequence FPCVN being a representative of a bona fide localization motif for a novel Bik distribution within the ER, the sequence resulted in the same localization for EGFP-TA-Bcl-2-ΔSHK (Fig. 4). Unexpectedly, inverting the first two amino acids of FPCVN to generate PFCVN resulted in export of EGFP-TA from the ER and localization at Golgi and VAMP5 destinations. In contrast, the sequence RRCVN resulted in 48% assignment of EGFP-TA localization as most similar to Calr-KDEL with negligible protein assigned to Bik localization, indicating that the different sequences result in images that are morphologically distinct. The differences in assigned localizations are most likely due to differences in the distributions of the protein within the ER. Whether the distributions reflect differences in residence times and/or functionally distinct subdomains remains to be determined, but the differing assignments are clearly not the results of different expression levels of the proteins (Fig. S5 C).
Data from other mutants reinforce the concept that proteins ostensibly targeted to the same organelle have different distributions, potentially the result of specific ER subdomains. The sequences of two designed mutants, RWCVN and GSCVN, resulted in EGFP-TA assignment to Calr-KDEL, while the CTS sequence WTNFK resulted in assignment of both EGFP-TA and EGFP-TA-Bcl-2-ΔSHK to the Bik landmark. These results also demonstrate that despite our not identifying by random mutagenesis (i.e., within our 995 random mutants) a CTS sequence for localization of EGFP-TA to the Calr-KDEL landmark, such sequences clearly exist.
Mutants of EGFP-TA with a CVN sequence and a single R residue that lost assignment preference to a specific ER landmark provided further insight into localization requirements: those with an R in the first position (RWCVN, RCCVN, and RFCVN) were all retained in the ER (40–53%), but less than a third of the protein was assigned to any one of the ER landmarks (Fig. S3). In contrast, mutants with an R at the second position (WRCVN, LRCVN, and VRCVN) were all assigned to the Cytb5 landmark and to locations further along the secretory pathway, including endosomes and the VAMP5 compartments.
To our surprise, only mutations that resulted in a distribution similar to the Cytb5 landmark also had partial localization at endosomes. These sequences (WRCVN, RLCVN, LRCVN, VRCVN, SLCVN, SVCVN, MVCVN, and VMCVN) share no obvious similarity and vary in hydrophobicity across the entire scale. However, export of EGFP-TA to the secretory pathway is not a defining feature of the distribution characteristic of the Cytb5 landmark, as other TA CTS sequences including DEPGH, PEHVS, PKWVT, PSNHQ, and RVRPG resulted in localizations assigned as the Cytb5 landmark without significant endosome or plasma membrane localization (Fig. S5). Thus, in this case, if the localization is defined by residence times, it must reflect residences with distinct distributions within the ER. Furthermore, the extent to which the different mutant proteins were exported from the Cytb5 compartment was variable. The mutant MVCVN was equally distributed between the Cytb5, endosome, and VAMP5 compartments, while VMCVN was assigned primarily to VAMP5. In contrast, the six TA proteins assigned to either the Bik or Calr-KDEL landmarks were not exported from the ER (Fig. S5).
The bias toward export of TA proteins assigned to the Cytb5 ER landmark suggests the intriguing possibility that the distribution of this landmark represents a subcompartment that may play an important role in sorting TA proteins. Future experimentation is needed to determine if this is unique to TA proteins or if other proteins are sorted from a different ER subcompartments. For example, the Calr-KDEL distribution may represent a subcompartment involved in transport of secretory proteins since that landmark is an ER luminal protein localized to ER in part because it is efficiently recycled from the Golgi.
Conclusions
We present a database of optically validated images for both human and murine epithelia together with a set of quantitative features that can be used to study protein localization in individual living cells. It combines (1) a large number of images to provide examples of most if not all organelle morphologies including those throughout the secretory pathway that result from dynamic movement, (2) use of an unbiased classification tool to assess subcellular localization that can be readily extended to include additional landmarks, and (3) use of a PWM approach to predict and verify subcellular localization motifs. With these tools, we identified a new motif for targeting TA proteins to mitochondria and multiple morphologically and likely functionally distinct protein distributions within the ER. The sparsity of the sequence coverage provided by 995 unique mutants suggests that our dataset contains many more targeting motifs similar to the CVN sequence that can be characterized using these techniques. Ultimately, the utility of the tools described here is limited only by the creativity of the user.
Materials and methods
Plasmid construction
All coding regions were cloned into pQCXIP, a retroviral expression vector (Clontech) encoding the monomeric fluorescent reporter protein EGFP-S65T upstream of the gene of interest, unless otherwise stated. The vector pQCXIP has an incomplete retroviral 3′ long terminal repeat (LTR) to prevent replication. Therefore, the LTR was repaired by excising the missing piece from pBabe-puro and introducing it into pQCXIP. This restored the function of 3′ LTR (pQCXIP-Repaired [R]), enabling rescue of the virus as described below. To generate pQCIXP-R-EGFP-TA, EGFP from pEGFP (Clontech) was cloned into pQCIXP-R by digestion of the pEGFP vector with AgeI-HF(NEB) and BamHI-HF(NEB). The open reading frame from human VAMP1 (Open Biosystem/Dharmacon) was amplified using Phusion High-Fidelity DNA Polymerase (NEB) and forward (Fwd) and reverse (Rev) primers with the following sequences: Fwd-VAMP1: 5′-GCGTCGACATGGAGAGCAGTGCTGCCAAGCTAAA-3′, Rev-VAMP1: 5′-GCGGATCCTCAAGTAAAAAAGTAGATTACAATAAC-3′. The resulting products were subcloned into pQCXIP-R-EGFP digested with SalI-HF (NEB) and BamHI-HF (NEB). As landmarks, the following coding regions were cloned into pQCXIP: Rab5, Ras-related protein Rab7A (Rab7), and Rab11 (kindly provided by J.C. Simpson, University College Dublin, Dublin, Ireland); Golgin84 and Bik (Open Biosystem/Dharmacon); ERGIC53, VAMP2, and VAMP5 (OriGene Technologies); Bcl-2 (TA sequence residues: 213–239 [isoform 2]; Zhu et al., 1996); PTDSS1 (Stone and Vance, 2000; kindly provided by J. Vance, University of Alberta, Edmonton, Canada); cytochrome b5 (Zhu et al., 1996); harakiri, membrin, Ras-related protein Rab3C, emerin, and ribosome-attached membrane protein 4 (OriGene Technologies); and the peroxisomal targeting sequence. The sequence encoding the TA region of MAO was assembled from oligonucleotides. pBABE-puro-GFP-wt-lamin A was a gift from T. Misteli (National Cancer Institute, National Institutes of Health, Bethesda, MD; Addgene plasmid 17662; Scaffidi and Misteli, 2008), and LAMP-1-mGFP was a gift from E. Dell’Angelica (University of California, Los Angeles, Los Angeles, CA; Addgene plasmid 34831; Falcón-Pérez et al., 2005). The following coding regions were fused to the N-terminus of EGFP-S65T in the pQCXIP vector: OTC (32 amino acids of the N-terminal sequence; Horwich et al., 1986); the mitochondrial targeting sequence from subunit VIII human CCO; a sequence encoding the N-terminal 81 amino acids of GalT; and the N-terminal 20 amino acids of neuromodulin (Skene and Virág, 1989). Calr-KDEL includes the ER targeting sequence of calreticulin fused to the N-terminus of EGFP and the ER retention sequence KDEL at the C-terminus of EGFP. A plasmid encoding the fluorescent protein mLumin was kindly provided by J, Hardy (University of Massachusetts, Amherst, MA). The coding region for Harakiri was fused to the 3′ end of the mLumin coding sequence by using SalI-HF and BamHI-HF. For expression in MCF10A, the coding regions were subcloned into pLVX-EF1a-IRES-Puro (Clontech) by using NotI-HF (NEB) and AvrII (NEB).
Random mutagenesis
Generation of random mutations in the VAMP1 CTS in EGFP-TA was performed using PCR with degenerate primers. The coding sequence for the fusion protein EGFP-TA was amplified with primers (Fwd-CF: 5′-CCGCGGCCGCACCGGTCGCCACCAT-3′, Rev-CF: 5′-GCGGAATTCCGGATCCTCAMNNMNNMNNMNNMNNTACAATAACTACCACGATGATGG-3′; Integrated DNA Technologies) containing five degenerate codons NNK at the 3′ end of the coding sequence. To generate the “CVN” mutants, the following reverse primer was used: Rev-CVN: 5′-GAATTCCGGATCCTCAATTAACACAMMNMMNTACAATAACTGCCACGATGATGGC-3′ (Integrated DNA Technologies). Fwd primer is the same that was used above. The pQCXIP-R plasmid was digested with BamHI-HF (NEB) and AgeI-HF (NEB), the PCR products were ligated into the cut plasmid using the Cold Fusion technology (System Biosciences), and the DNA was transformed into Escherichia coli bacteria DH5α. After incubation at 37°C for 12 h to complete ligation and amplify the plasmids while minimizing the number of copies of identical plasmids, the transformants were harvested and pooled, and plasmid DNA was isolated using a Presto DNA Mini Plasmid kit (FroggaBio) and packaged into retroviral particles (Phoenix cell line). Sequence analysis after rescue from cells (see below) revealed 995 unique sequences from 1,220 clones.
Cell lines and culture
The cell lines used were selected because they are both relatively genomically stable breast epithelia. Our assumption is that genomically stable cell lines would exhibit a relatively normal distribution of localization phenotypes characteristic of the cell type. NMuMG cells (a generous gift of J. Wrana, Lunenfeld-Tanenbaum Research Institute, Toronto, Canada) were cultured in DMEM, containing 10 µg/ml bovine insulin (Sigma), 10% FBS (Gibco), and penicillin/streptomycin (Wisent). The retroviral packaging cell line (Phoenix) and HEK293T were grown in DMEM (Gibco), supplemented with 10% FBS and penicillin/streptomycin. MCF10A were cultured in DMEM/F12 (Gibco) supplemented with 5% horse serum (Gibco), 10 µg/ml insulin (Sigma), 0.02 µg/ml EGF (Preprotech), 0.5 µg/ml hydrocortisone (Sigma), and 0.25 mg/liter isoproterenol (Sigma). All cell lines were maintained in a 5% CO2 atmosphere at 37°C. All cell lines tested mycoplasma-free using a PCR-based detection system (Hopert et al., 1993). MCF10A were analyzed by comparative genomic hybridization (Mills et al., 2015). Both Phoenix and HEK293T were genotyped at the Centre for Applied Genomics, SickKids, Toronto, Canada.
Transfection and transduction
Retrovirus was derived by transient transfection of pQCXIP-R-EGFP-TA into the Phoenix packaging cell line using Fugene HD (Promega). After 24 h, the virus-containing cell medium was filtered (0.45 µm, PALL) and transferred onto the target cell line. To increase the efficiency of transduction, 8 µg/ml polybrene (Sigma) was added. Stable colonies were selected in 10% FBS/DMEM containing 2 µg/ml puromycin (Sigma) and followed by sorting single cells (BD FACSAria II) expressing EGFP into individual wells of multiwall plates (TC plate 96-well, standard, F, Sarstedt). Once a colony formed, it was grown under puromycin selection as above until further analysis.
Rescuing of proviral DNA
To retrieve the nucleotide sequences of the mutants, cell clones individually transduced with the replication-repaired pQCXIP-R-EGFP-TA mutants were seeded in a 96-well format plate (TC plate 96-well, standard, F, Sarstedt) and transfected with the pCL-ECO packaging vector (Imgenex) using Fugene HD (Promega). Transfection with pCL-ECO enables virus production from the cells transduced with pQCXIP-R-EGFP-TA mutants. After 48 h, the supernatant was collected for viral RNA extraction (Murdoch et al., 1997). Briefly, Trizol LS (ThermoFisher) was added to the supernatant in a 3:1 ratio along with chloroform and yeast tRNA under RNase-free conditions. After centrifugation and isopropanol precipitation, viral RNA was subjected to reverse transcription by using SuperScript III Reverse transcription (ThermoFisher) to generate cDNA. Sequence analysis was performed at the Centre for Applied Genomics sequencing facility.
Lentivirus production and transduction
To express all landmarks in the human cell line MCF10A, the coding regions were subcloned into the lentiviral vector pLVX-EF1a-IRES-Puromycin (Clontech). The lentiviral DNA plasmid and both pPAX2 and pMD2.G plasmids were transfected into HEK293T cells at 1:1:0.1 ratios by means of calcium phosphate precipitation. Prior to transfection, the three plasmids were briefly mixed with sterile 1× Hepes buffered saline. 1× Hepes buffered saline is composed of Hepes (5 g/liter), NaCl (8 g/liter), dextrose (1 g/liter), KCl (3.7 g/liter), and Na2HPO4 × 7H2O (0.19 g/liter). This mixture was then supplemented with 2.5 M CaCl2 to a final concentration of 0.14 M and incubated for 20 min at room temperature before it was added to the HEK293T cells.
After 48 h, the virus-containing supernatant was collected from the HEK293T cells and filtered through a 0.45-µm filter unit and transferred onto MCF10A cells. To obtain stable cell lines, selection was performed by adding puromycin (2 µg/ml) 48 h after transduction.
Live cell imaging
NMuMG and MCF10A cells expressing one of the landmarks or EGFP-TA mutants were seeded in separate wells in 384-well microplates (CellCarrier-384 ultra, B128 SRI/160; Perkin Elmer) and allowed to grow for 24 h before staining with the nuclear dye DRAQ5 (5 nM; Biostatus). To account for any technical interference that may affect the classification output, cell clones expressing the landmarks or mutants were imaged on multiple days and positions within the wells. In addition, the positions within the plates where the cell clones were grown and imaged was varied. Where indicated, cells were also stained with either Mitotracker Red (500 µM; ThermoFisher) or BODIPY-thapsigargin (500 µM; Setareh Biotech) according to the manufacturer’s instructions. Plates were imaged (two to eight wells, 20–30 fields of view) on two different spinning disk automated confocal microscopes of the same model (OPERA QHS; PerkinElmer) with 40× water objectives (NA = 0.9) in a defined temperature (37°C) and CO2 (5%) environment by using EvoShell acquisition software. Images were collected using 3-Peltier-cooled 12-bit CCD cameras (Type sensiCam, camera resolution 1.3 megapixels; PCO.imaging) either with a binning of two or unbinned. Unbinned images were binned numerically before segmentation and feature extraction. Imaging the controls and some query lines on multiple microscopes, with and without camera binning, enabled feature counter selection to remove features characteristic of the imaging platforms. Assessment of new query cell images was processed as described above along with a set of established landmarks as controls.
Image processing
Image segmentation
To identify individual cells, as well as nuclear and cytoplasmic areas for each cell in fluorescence micrographs, image segmentation was performed using PerkinElmer Acapella 2.0 software (Nuclei Detection Algorithm A). The nuclear detection algorithm includes following parameters: threshold adjustment (1.5), individual threshold adjustment (0.4), minimum nuclei distance (15), nuclear splitting adjustment (15), minimum nuclear area (300), and minimum nuclear contrast (0.1). Briefly, the nuclear region was identified using an image of DRAQ5 staining. Using the nucleus as a seed, the cell region was identified using a watershed algorithm and the low-level cytoplasmic staining due to DRAQ5. The region of the cell that is not in the nucleus is considered to be the cytoplasm. Additionally, a ROI mask was computed by identifying all the pixels that are 1.5 times the mean pixel intensity of the cell mask in the EGFP channel. Regions with at least 30 contiguous pixels were retained as ROIs. Images of cells touching the micrograph edge were discarded.
Removing out-of-focus cells
To investigate the steady-state localization of a protein in cells, images of dividing or dying cells were removed as they have been shown to interfere with classification (Huang and Murphy, 2004). Such cells are typically out of focus and exhibit a different texture and morphology of the nucleus in the DRAQ5 channel compared with other cells. These features enabled the generation of an automated image quality control algorithm that effectively removed images of cells that were out of focus, dividing, or dead. To train a classifier to remove these images, we collected cell images of landmarks that were in-focus and 4 µm below the in-focus plane. The cells were identified by image segmentation, and image features were calculated. Since information regarding cell focus can be obtained from the nuclear staining, which should be similar for cell clones independent of the EGFP protein expressed, we used texture and morphology features from the DRAQ5 images to create a binary random forests classifier and identity features necessary to separate cell images that are in focus and out of focus. Using the new feature subset, a new random forests classifier was built to identify within fields of view images of individual cells that are in focus and out of focus. The classifier was then applied to the entire dataset to remove individual cell images that were out of focus.
Examination of the rejected images revealed that this classifier also removed many cells that, when visually inspected, appeared to be in focus, but were usually not correctly segmented, suggesting there may have been a minor problem in focus. Our approach differs from previous work on focus quality control in which an entire image field was identified as out of focus rather than images of individual cells within an image field (Bray et al., 2012). The sensitivity for individual class before and after removal of out-of-focus cells is shown in Fig. S1.
Removing cells with improper segmentation
Cell image feature extraction
Features were measured for nuclear and cytoplasmic (including ROI) areas for DRAQ5 staining and EGFP expression, respectively, using a custom PerkinElmer Acapella script (code available on https://github.com/DWALab/Schormann-et-al). For each cell image, 495 morphological and statistical image features were calculated (Collins et al., 2015). Micrographs of individual cells were automatically selected for inclusion in the reference library based on numerical assessment of image and segmentation quality (R1 and R2; Fig. S1).
Feature selection
For classification of localization, a reduced set of features was selected to minimize (1) the impact of intensity variations due to changes in the intensity of the illumination or efficiency of collection, and (2) features that were sensitive to differences in imaging instrument or data collection scheme. The final feature list comprises 160 features, i.e., TAS (n = 140), morphology (n = 18; Boland and Murphy, 2001), and texture (radial moment and angular second moment; Haralick et al., 1973), that were derived exclusively from the EGFP channel. A spot detection script implemented in PerkinElmer Acapella software was used. All feature data tables and a description of feature calculation is available on GitHub (https://github.com/DWALab/Schormann-et-al).
Using the images of NMuMG cells expressing the landmarks, the following steps were taken to select features most relevant for the final classification.
We first retained only features pertaining to the EGFP channel to eliminate influence of nuclei-derived features on protein localization. Further, only texture TAS and shape features were retained to minimize the effects of protein expression on the final classification.
To minimize variation due to different microscopes, a random forests classifier was created to output features that can separate individual landmark cell images based on which microscope they were imaged on. This was done for each landmark separately. We then computed the frequency for each feature, i.e., the number of times the same feature was important in separating the microscopes across landmarks. Features that were important to separate microscopes for at least five landmarks were removed.
Correlations between all features and cell intensity were computed. Features with correlation values between −0.5 to 0.5 were retained.
A minimum set was identified from the remaining features using a “leave one feature out” test of classification accuracy. For each feature, a classifier was built in which that feature was omitted. Features were retained if the classification accuracy decreased when the feature was removed.
Intensity preprocessing
For further data processing and final classification, only features from the EGFP channel were used. Cells with <100 intensity units of EGFP expression were eliminated from the analyses. The threshold value of 100 intensity units was obtained empirically by manually thresholding 20 randomly selected EGFP channel images into foreground and background regions. All the features were scaled to have zero mean and unit SD.
Further, for each landmark and mutant dataset, cells with integrated total intensity in the top or bottom fifth percentile were removed. The remaining 789,011 and 523,319 validated landmark cell images of NMuMG and MCF10A cells, respectively, constitute the reference libraries uploaded to the Image Data Resource (idr0072; https://idr.openmicroscopy.org) and have been made available as a resource for general use.
Quantify effects of protein expression
To determine whether differences in the expression levels of the landmark proteins contributed to the localizations assigned by the random forests classifier, landmark images were divided into four different classes based on intensity quartiles (0–25, 25–50, 50–75, and 75–100 quartiles). Cross-validation was performed by training on the landmarks from a quartile bin and using the other quartile bins as queries (Fig. 3, quartile 1–quartile 4). When assessed for all quartiles, the landmarks are classified with high accuracy irrespective of the quartile fraction used for training, thus independent of protein expression levels. To further verify that assignment of localization was not due to differences in intensity, especially for proteins ostensibly targeted to the same organelle, images were randomly selected for training and classification in which the query and landmark had similar distributions of intensities (see below).
For each mutant, thousands of cells were imaged from independent experiments performed on multiple days. Further cells were deliberately imaged from different plate locations. Pooling training set cell images from independent experiments and from different automated microscopes reduces the impact of minor fluctuations between experiments. After filtering cell images for focus, segmentation, and expression artifacts, the random forests classifier created using the reference library was then used to classify images of cells not used in training to individual landmarks (Breiman, 2001).
Classification of the cell images
Random forest classifiers were used to separate the landmarks. Code scripting was performed in the MATLAB environment (R2012b, MathWorks,) including following the RF package (https://github.com/ajaiantilal/randomforest-matlab/tree/master/RF_Class_C). All codes are downloadable from GitHub (https://github.com/DWALab/Schormann-et-al). The classifier was generated setting the “mtry” variable to 12 (approximate square root of the total number of features). The “mtry” variable specifies the number of features that are available for the classifier at each split point when building decision trees. The number of trees was set to 500. Due to differences in the number of cell images for each landmark, we randomly selected 3,200 (70% of Calr-KDEL, the landmark with the smallest number of cells) cell images per landmark for training. The rest of the data were used for testing. To avoid bias due to random sampling, five different random forest classifiers were used, each with randomly sampled data from the landmarks. Every unknown cell image was classified using all five classifiers. Each classifier assigns a cell to a specific landmark. For any cell, the final class was decided using the mode of the predicted class among the five different classifiers.
To establish the extent of variation between classifications, 20 classification runs were performed on mitochondrial and CVN mutants. The result is presented as mean values (percentage classified) including SD (see Table S1). In general, the variation in percentage of cells assigned to a particular location was <1.
Classification of cell images within intensity range
For control experiments using cells with a defined range of intensities, cells were automatically selected from the image dataset based on total cellular EGFP intensity. The intensity ranges used were 300–500 and 200–1,000 arbitrary units for mitochondrial and CVN tail-anchored mutants, respectively. To display the intensity profiles as continuous lines, the cell intensities were fit to a density curve (Fig. 7). Comparison of the classification results confirmed that intensity differences did not contribute significantly (Fig. 7 and Fig. S5).
Visualization of the spatial relationships of the landmark cell images in 160-dimensional space
We used the t-SNE algorithm (Van der Maaten and Hinton, 2008) to reduce the multidimensional data into a two-dimensional plot using code obtained from the author’s website (Van der Maaten and Hinton, 2008). We set the perplexity value to 10 and used 1,000 iterations to generate the two-dimensional representation. Due to the limitations of the algorithm regarding memory and computation cost, we did not use all the data for visualization. For each landmark only the 200 cells nearest to the median of the landmark were used for visualization. This resulted in a total of 3,400 (murine landmark library) and 4,000 (human landmark library) data points for visualization.
Naming system
When organelle names are used, it is to designate multiple landmarks in aggregate. The following labeling was used: ER corresponds to the landmarks Cytb5, Calr-KDEL, and Bik; Golgi apparatus corresponds to GalT and Golgin84; endosome corresponds to Rab5 and Rab7; MAM is represented by PTDSS1; mitochondria correspond to MAO and CCO; and ERGIC is represented by ERGIC53.
Motif visualization
Visualization of amino acid motifs was performed using pLogo plots (O’Shea et al., 2013). The foreground sequences were all the sequences that belonged to the cluster, while the background sequences were all the rest of the mutants generated by random mutagenesis with a CTS five amino acids long.
PWM
As an alternative approach to predict the localization of an unknown sequence of amino acids, a PWM was computed for the amino acid sequences of the CTS for mutants assigned to a specific localization. A foreground-normalized frequency matrix was first computed by calculating for each position the frequency of each amino acid at that position. If an amino acid was not present at a given position, its probability would be zero and would bring the likelihood computation to zero. To avoid this, a pseudocount value of 0.25 was added to the raw frequency for all amino acids for all positions. Thus, the frequency matrix was divided by the number of sequences in the cluster plus a value of 5 (0.25 × 20), resulting in a normalized frequency matrix. The background-normalized frequency matrix was calculated for all the sequences that were generated using random mutagenesis with a five–amino acids CTS. In this case, we did not use pseudocounts as every amino acid was represented at least once at every position. The foreground-normalized frequency matrix was then divided by the background frequency matrix. The log of the resultant gives the PWM.
The simple rules and PWMs for predicting mitochondrial localization were tested on the database of sequences having five amino acids in the CTS. The list of predicted localizations based on the simple rule or PWM was compared with the “true assignments” based on image classification. The PPV, sensitivity, and specificity were calculated from the predicted and true assignments.
Statistical enrichment
All enrichments were computed by fitting a hypergeometric distribution. For all calculations (added hydrophobic core, positive charges, and negative charges) except enrichment of actual length of the CTS, only sequences with five amino acids in the CTS were used. All of the sequences were used to calculate the enrichment for the lengths of the CTS.
Online supplemental material
Fig. S1 illustrates the image processing pipeline used and sample images. Fig. S2 provides Euclidean distances between centroids of the landmarks in NMuMG cells as a colored heat map and as a table of values. Fig. S3 provides heatmaps of the amino acid enrichments associated with different localizations for EGFP-TA mutants. Fig. S4 provides sample images at high magnification demonstrating that images of different mitochondrial localized proteins correctly assigned by classification care not reliably distinguished visually. Fig. S5 provides heatmaps of the localization assignments for the XXCVN mutants to different distributions within the ER. Table S1 shows mean values (including SD) of 20 classification runs of CVN and mitochondrial mutants.
Acknowledgments
The authors thank Robert Mullen for critical reading of the manuscript and Jarkko Ylanko for technical support with the high-content imaging.
Funding for this research was provided by a foundation grant from the Canadian Institutes of Health Research (FDN 143312) to D.W. Andrews. D.W. Andrews holds the Tier 1 Canada Research Chair in Membrane Biogenesis.
The authors declare no competing financial interests.
Author contributions: All authors participated in experimental design and data interpretation. W. Schormann and S. Hariharan performed experiments, imaged the cells, and analyzed the data. W. Schormann generated all of the mutants and cell lines. S. Hariharan carried out programming for and automated image analysis and informatics analysis. D.W. Andrews supervised the project. All authors participated in writing, reviewing, and editing the manuscript.
References
Author notes
W. Schormann and S. Hariharan contributed equally to this paper.