Data sharing is important in the biological sciences to prevent duplication of effort, to promote scientific integrity, and to facilitate and disseminate scientific discovery. Sharing requires centralized repositories, and submission to and utility of these resources require common data formats. This is particularly challenging for multidimensional microscopy image data, which are acquired from a variety of platforms with a myriad of proprietary file formats (PFFs). In this paper, we describe an open standard format that we have developed for microscopy image data. We call on the community to use open image data standards and to insist that all imaging platforms support these file formats. This will build the foundation for an open image data repository.
Recent letters and editorials have highlighted the importance of open access to the large datasets now being collected by biologists in laboratories around the world (COSEPUP, 2009; Field et al., 2009; Schofield et al., 2009). Researchers, universities, and funding bodies all agree that scientific data produced from public- and charity-funded research (not just the results, but complete workflows including raw data) should be shared and accessible. The arguments in favor of open access data are now well established, and protocols and principles for data sharing are emerging (http://sciencecommons.org/projects/publishing/open-access-data-protocol). However, access to and sharing of scientific data require substantial effort and investment to define specifications and build resources to support them. For the successful sharing of DNA sequence data, the genome communities built, maintained, and in some cases fought for the standards and resources that were ultimately accepted by the whole community. This effort laid the foundation for the release of genomic data and the development of online resources, accessible by anyone, for any purpose, that now underpin all modern biomedical research.
We believe the imaging community can achieve the same success for digital image data. In this paper, we review the current status of online biological image repositories and provide a set of recommendations to drive the use of open standardized data formats in biological microscopy as a prerequisite for creating a global image data repository.
Scientific image data repositories for the life sciences
In December 2008, the Journal of Cell Biology (JCB) launched the JCB DataViewer, an online repository for original image data in the life sciences (Fig. 1). To our knowledge, this system is the first open repository that enables routine archiving and sharing of original image datasets supporting published scientific articles. One key attribute of the JCB DataViewer that distinguishes it from past and current data repositories is that the original binary data and metadata, additional information captured by acquisition software about an image, such as the instruments used, acquisition settings, image size, and resolution, are preserved and accessible by the community. As of this writing, the JCB DataViewer contains 6,446 multidimensional (5D; including space, channel, and time) images in support of 186 published articles. The JCB DataViewer is a customized application based on the open source and open development Open Microscopy Environment (OME) Remote Objects (OMERO) and Bio-Formats projects, released by the OME Consortium (http://openmicroscopy.org).
One goal of the JCB DataViewer was to initiate the development of a functional, scientifically valuable online image repository. The first step was to make original data available alongside a publication, available for examination by reviewers and readers of a submitted or published manuscript. Currently, the JCB DataViewer allows access to original data for viewing, simple measurement, and review, but users cannot download the original data files, and sophisticated image analysis and querying tools are not included in the application. In the next update, users will be able to download video versions of data stored in the JCB DataViewer, and original image data will be available in an open, standardized data format that preserves the original image metadata (OME tagged image file format [TIFF]). Authors will also retain access to their original data, thereby making the JCB DataViewer an archive where authors can store their own published data. These updates represent one more step toward the development of a fully functional data repository.
The data in the JCB DataViewer are freely available to the public immediately upon publication, without a subscription to the JCB. In the future, as image repositories mature, we plan to merge the data held in the JCB DataViewer with whatever resources emerge as the definitive public repository of image data in the life sciences.
The JCB DataViewer is one of a growing number of image data repositories that are now available, each focused on providing access not only to results but also to some combination of sophisticated visualization, analysis, and mining of these complex data (Table I and Fig. 2). Each of these efforts has emphasized specific applications and functionality and reflects the simple fact that the diversity of scientific exploration and images cannot yet be addressed by a single resource. However, there are ongoing efforts to align data models where possible, and perhaps most importantly, simplify submission and subsequent processing through the definition and use of file formats that support standardized metadata. These are examples of real progress toward the goals that many have discussed and that have recently been reiterated (COSEPUP, 2009; Field et al., 2009; Schofield et al., 2009).
In summary, significant effort by peer-reviewed, competitively funded groups in the US and Europe has produced image informatics tools that the research community uses. The tools and resources are by no means finished, and our current status seems analogous to the state of the genomics resources in the mid-1980’s, when individual authors submitted their own sequence data to GenBank, SWISSPROT, and others. The diversity of imaging platforms, experiments, techniques, and data makes this analogy only partially correct and undoubtedly makes the challenge of building and running scientifically useful image repositories harder. Regardless, the sophistication of centralized scientific image resources is growing, and as a result, so will the value they deliver to the scientific community. Those resources that depend on submissions from the community will require the development, adoption, and use of standardized file formats that support as rich a metadata structure as possible. This is why the development and use of standardized image data and metadata formats are so important.
Microscopy file formats
Many laboratories have at least one sophisticated imaging system, and many large shared-use facilities provide access to an array of imaging systems. After many years of innovation and development, modern digital imaging systems enable temporally and spatially resolved, multichannel measurement and visualization of molecular and ion concentrations in cells and tissues. Emerging imaging techniques such as multispectral, polarization, fluorescence lifetime, and fluorescence correlation are extending the complexity of analysis of biological cells and tissues. This rapid growth and evolution within the field is a double-edged sword. It certainly enables new discovery and insight. However, most digital microscope imaging systems, whether commercial products or laboratory prototypes, are usually run by custom software that saves and processes data using a PFF. In general, every new imaging platform comes with a new PFF, so rapid advances in imaging simultaneously make data exchange and access more difficult. To realize the dream of open data access and sharing, we first must solve the basic problem of accessing the data contained in PFFs. Any solution will not directly lead to new scientific insights, but it is a prerequisite for submission to repositories and the discoveries they enable through reanalysis. For example, if the data from cell-based phenotypic screens were available, they could be reanalyzed for aberrations that were not of interest to the investigators who did the original screen.
Generally speaking, image data are written in formats that include the binary data and the actual image measurement, along with some representation of the metadata: the size of the binary data, its dimensions, acquisition system settings, and any other information that the developer of the acquisition software considered useful. In our experience, storage of binary data in many commercial microscopy formats is based on common formats (TIFF, HDF5, and OLE2, etc.) or other formats that most software tools can read (although there are some notable, extreme exceptions). The much more challenging problem is the metadata. Because standards are not yet agreed upon, microscope and imaging companies define their own metadata formats in their PFFs, and these are often incompatible with those from competing companies.
Since 2000, the OME has been dedicated to building tools for specification, management, and sharing of biological light microscopy data (Swedlow et al., 2003, 2009; Goldberg et al., 2005). OME has developed and released the OME Compliant specification (Fig. 2), which covers most of the metadata in PFFs from many sources and includes most of the fundamental imaging metadata in cell and developmental biology. This specification, used within the context of a TIFF file (OME-TIFF), provides a simple, easy to use format for microscope imaging data that can be used by any software that reads the TIFF file format. Several commercial imaging systems now support OME-TIFF in their software. A popular tool (>13,000 installations worldwide) is Bio-Formats, a software library that interfaces with a large number of software tools (such as ImageJ), enables the reading of >75 PFFs, and supports output to OME-TIFF.
Future directions and recommendations
For many years, the imaging community has expressed a desire to move away from the current ad hoc approach toward more defined standards for metadata representation (Goldberg et al., 2005). However, creating a reasonable standard takes years of community discussion and effort. For the standard to be successful, it must be widely used and functional enough to be worth the effort of conformance, and it takes time for the “snowball effect” to occur. Given the diversity and rapid evolution of imaging applications in biology, we don’t believe that standards can be mandated by any one entity. Instead, we argue that standards for biological imaging must be supported and developed, and once they are valuable for scientific discovery and data sharing, and have demonstrated the ability to rapidly adapt to new technologies, the community must demand the support of these formats in the commercial platforms they purchase. Under the umbrella of the OME, we have been collecting community feedback for several years now, and our recommendations for this process are detailed in Box 1.
Recommendations for use of PFFs
Image metadata must be associated with the binary image data, preferably as a single file.
Microscope systems must not store metadata in proprietary databases that are available only on the data acquisition system.
Metadata must be readable by third party software using a common, openly accessible software package or library. PFF developers must work with developers of open translation libraries to ensure their format is correctly interpreted.
Scientists must use image processing and analysis tools that preserve image metadata.
Image data must reflect the original measurement. If compression is supported, the user must be given the option of saving uncompressed or losslessly compressed images (which allows the exact original data to be reconstructed after compression). If compression or encryption is used, the algorithm and parameters must be stated and stored in the metadata.
Commercial software programs must provide data export to an open metadata specification. To ensure that commercial software writes these formats correctly, open, freely available libraries and format validators must be available to enable compliance.
Public and charity funding for imaging systems must include a requirement that the system writes data in an open, accessible format, wherever possible.
All file formats must use versioning to reflect any changes in the data model.
When PFFs must be used, new versions must be announced to the scientific community, and users and funding bodies must predicate their purchases on this type of support for the scientific community.
Once a standardized repository is available, journals must require deposition of original data supporting scientific manuscripts.
In some cases, PFFs are needed to ensure the proper performance of the acquisition system. However, in our experience with Bio-Formats, OMERO, and the JCB DataViewer, most of the data we have seen could be recorded in an open, standardized, multidimensional file format.
As the number of imaging systems and the rate of innovation grows, maintaining a tool like Bio-Formats, simply because commercial vendors do not use standardized file formats, becomes increasingly untenable. Reverse engineering is slow and inherently error prone, as metadata stored in PFFs are decoded and translated. As popular as Bio-Formats is, it is time to reconsider the value PFFs deliver for a specific commercial product against the costs, which are paid for by public and charity funding: lost time for scientific researchers, inhibited collaborations, and impeded access to data using the aforementioned emerging data repositories.
Many scientific funding bodies now require the published output from the work they fund to be deposited in open access repositories. The same open access principle should be extended to the data generated through their funding to enable broad dissemination and further analysis. As with other forms of data, there is no requirement to publish all images associated with a paper, just the ones that form the definitive representation of the reported discovery. The OME, International Society for Advancement of Cytometry (http://www.isac-net.org), and Digital Imaging and Communications in Medicine (http://medical.nema.org/dicom/) formats are all well developed, supported, and available for use. It may be that no single format can satisfy every requirement or data type, but our experience demonstrates that the vast majority of the data used to support scientific publications can be properly stored in these formats. We can support a range of open file formats with Bio-Formats, thus allowing interconversion between open file formats where necessary. We have developed the OME metadata standards through extensive direct experience and discussion with the user and commercial developer communities. We plan to use them as we progress to the development of a public repository but remain open to suggestions about how they can be improved.
As noted in the Box 1, the use and adoption of these file formats won’t happen by itself, the community must work to drive their adoption. Individual scientists and their funding bodies must require support for these formats when they purchase or fund new imaging systems. The argument for this concerted action is based on a simple, practical goal: scientific data, funded by the public and nonprofit charities, must be publicly available. Over the next few years, the technical capabilities in image repositories will mature. Data to fill these repositories must be open, accessible, and ready for use.
We thank Dr. Alexia Ferrand for preparation of samples for structured illumination data and Angus Lamond for critical reading of the manuscript.
Work on OME in J.R. Swedlow’s laboratory is supported by the Wellcome Trust (grant 085982) and the Biotechnology and Biological Sciences Research Council (grant BB/G022585).