Data management is the intentional process of collecting, storing, processing, and protecting data for research, and it is vital for effectively testing hypotheses and initiating peer-review. Good data management will allow data to be preserved at a high quality, so the data are discoverable, accessible, understandable, and reproducible now and into the future.
This page provides an overview of data management planning and practices that will help you successfully organize, preserve, and share your data. For a detailed explanation of these recommendations and more, read the section Managing Ecological Data in:
Recknagel, F., W. Michener. 2006. Ecological Informatics. Springer, Cham. https://doi.org/10.1007/978-3-319-59928-1
Also see Standards and Practices for NASA's Earth Science Data and Information System Project.
The best practices for data management are summarized below. For a detailed explanation of these recommendations, see:
Cook, R.B., Y. Wei, L.A. Hook, S.K.S Vannan, J.J. McNelis. 2018. Preserve: Protecting Data for Long-Term Use. In: F. Recknagel, W. Michener (eds) Ecological Informatics. Springer, Cham. https://doi.org/10.1007/978-3-319-59928-1_6
Define the parameters
For others to use your data, the contents of the dataset must be understandable, including the parameter names, units of measure, formats, and definitions of coded values. Standardize the parameters across files, datasets, and the project by using commonly accepted parameter names and abbreviations. Develop a data dictionary that defines each parameter, attribute, or variable. To ensure consistency, follow standards established for data interoperability, like the Climate and Forecast (CF) Metadata Convention.
The units of parameters should be stated in the data dictionary. SI units are preferable, but each discipline may have its own commonly used units of measure.
Date & Time
Use International Organization for Standardization (ISO) standard date formats: yyyy-mm-dd or yyyymmdd. If only the month or only the year is of interest use yyyy-mm or yyyy. Use 24-hour notation (13:30 instead of 1:30 p.m. and 04:30 instead of 4:30 a.m.). Report both local time (defined in the documentation) and Coordinated Universal Time (UTC), but standard (UTC) time is preferred. For more information on this standard, see ISO 8601:2004.
Coordinates & Geospatial Information
Report coordinates in decimal degrees (≥ 4 decimal places). Provide south latitude and west longitude recorded as negative values. All location information in a dataset should use the same coordinate system, including coordinate type, datum, and spheroid.Define the projection and coordinate/spatial reference system, spatial extent, spatial resolution, boundary, and scale. Define fill values, valid ranges, scale factors, and offset of the data values in your documentation.
Coded Fields & Data Flags
Define and standardize coded fields in your data. A separate field may be used for quality considerations, reasons for missing values, or indicating replicated samples. Codes and flags should be consistent across parameters and data files within a dataset. Definitions of flag codes should be included in the dataset documentation.
Use consistent notation for no-data values throughout your dataset. For numeric fields, it is common to represent missing data with a specified extreme value (e.g., -9999). For character fields, use "NA".
Use consistent data organization
There are two common ways to organize tabular data. In either case, each separate line (i.e., row) represents an observation and is a complete record. Often, the columns represent all the parameters that make up the record. Similar to a spreadsheet, this is the potentially "short and fat" style of data organization. However, if most parameters in a record do not have measurements, you can define the parameter and value in two columns. Other columns can be used to describe the measurement, like site information, date, units of measure, etc. This is the "long and skinny" style of data organization.
Keep similar measurements (i.e, same investigator, methods, time basis, and instruments) together in one data file. Many small files are more difficult to process than one large file. There are exceptions: observations of different types of measurements can be placed into separate data files or data collected on different time scales or temporal resolutions might be handled more efficiently in separate files.
Favor a flatter structure for data files. When you work with data manually, it may be easier to have more folders with fewer files in each. But when working with data programmatically and enabling analysis-in-place, it is better to have fewer folders with more files (or even a completely flat file structure). There may be times when you need to split things up (typically a few thousand files), but the bias should be towards a flatter structure.
Use stable file formats
Select a file format that can be read indefinitely and is independent of changes to relevant applications. If you collected data that used proprietary file formats, convert those files into a stable, well-documented, and non-proprietary format to maximize others' ability to use your data. Throughout the dataset, use consistent formats given the type of data.
Delimited text file formats are used for tabular or "spreadsheet" data. Report any summary information in supplementary documents, not in the file's contents. The file's header row should contain names that describe the content of each column, including parameter names and units. For examples of how to structure data for delimited text file formats, see:
Geospatial file formats should be self-descriptive; that is, metadata are included inside the contents of the file. For introductions to popular geospatial file formats for raster or image data, see the following:
- ORNL DAAC NetCDF Data Requirements
- Unidata NetCDF Documentation
- Data Curation Network GeoTIFF Primer
- OSGeo GeoTIFF Documentation
- NEON Hierarchical Data Formats - What is HDF5?
For introductions to spatial file formats for vector data, see:
Assign descriptive file names
File names should reflect the file contents and must uniquely identify the data enclosed. File names can include information such as a project acronym, study title, location, investigator, year of the study, data type, version number, and file type. File names should contain only lower-case letters, numbers, hyphens, and underscores, and no spaces or special characters. This will allow for easy management by various data systems and decrease software and platform dependency. Similar logic is useful when designing directory-naming schemes.
Use programming to manipulate your data and preserve versions
To preserve your data and its integrity, save a "read-only" copy of your raw data with no transformations, interpolation, or analyses. Make a separate file and use a programming language to manipulate your data. Code is an excellent record of data processing, and it can quickly be revised and rerun in the event of data loss or if new analyses are needed. Be sure to write comments throughout your code to allow others to understand the purpose of particular lines. Programming has the added benefit of allowing a colleague to follow-up or reproduce your processing.
Track changes to your files using a version number, processing date, or both to identify versions. Keep a history of changes to the data and who made the changes. Make sure the data files submitted to the archive are the correct version. Data repositories, like GitHub, have built-in version control tools and are useful when multiple people will manipulate the data.
Perform basic data quality assurance
You should perform data quality assurance checks on your data files before to sharing them.
- Review the organization of the file's content and descriptors to ensure that there are no missing data values for key parameters.
- Sort the records by key parameters to highlight discrepancies.
- Check the validity of measured or derived values and scan for impossible values (e.g., a pH of 7).
- Check the time frame and the temporal units. Generate time series plots to detect anomalous values or data gaps.
- Review statistical summaries (e.g., mean, median, minimum values, maximum values, etc.).
- If geolocation is a parameter, use scatter plots or GIS software to map each location to check for errors in coordinates. For GIS image and vector files, ensure the projection parameters have been accurately stated.
- Additional information such as data type, scale, corner coordinates, missing data value, size of an image, and the number of bands should be checked for accuracy.
- Remove any unnecessary parameters or columns used in processing that are uninformative.
Try using our Data Quality Review Checklist as a guide.
Document your data
Documentation should be stored using stable, non-proprietary formats. Documentation is most useful when it is structured as a "user's guide" for the data product, which provides content details for each file and how the data were created. More information about the data is always preferable to less because those who are not familiar with your project will need more guidance. Long-term experimental activities also require detailed documentation because personnel could change over time. In general, write documentation for an audience who is inexperienced in your particular research, methods, and observations. See the documentation section of our Detailed Submission Guidelines for examples of documentation.
Protect your data
Ensure that file transfers can be performed without error by comparing checksums or file sizes before and after data transfers. Create and test back-up copies often to prevent the loss of data. Maintain at least three copies of your data: the original, an on-site but external backup, and an off-site backup such as on cloud-based storage. Periodically test that you can recover your data in the case of an emergency.
Publish your data
There are many benefits to publishing your data in an open data repository. You get a digital object identifier (DOI) for archived data products that can be cited in academic papers and other publications. When a data repository performs quality assurance checks and writes documentation for your data product it adds value and could reveal errors that you overlooked. Archived data allows collaborators to access your data, and scientists outside your project can also access, understand, and use your data to address their own hypotheses. Financial sponsors of your research and other stakeholders also protect their investment when you archive your data.
A data management plan (DMP) is a brief document that outlines what you will do with your data during and after your research. NASA's Terrestrial Ecology Program requires that each research proposal include a DMP of up to two pages. Moreover, many journals require that data be made publicly available before the associated manuscript is published, and a DMP can expedite this process.
Components of a data management plan
A DMP for a proposal is a short document that describes what you will do to manage your research data. A DMP must include at least these components:
- Information about the data. Describe the data and their intended organization. Consider the acquisition, processing, and storage of the data that will be produced.
- Description of the data. How much data will you collect and how will the data be stored? Avoid proprietary formats that might not be readable in the future. Will you explicitly measure data quality?
- Metadata content and format. Describe the metadata necessary to make the data understood by someone outside of your project. How will metadata be created or captured and in what format or standard?
- Policies for data access, sharing, and re-use. Data sharing details and obligations should be considered. Address ethical, privacy, intellectual property, and copyright issues for the dataset. When will the data be delivered?
- Long-term preservation. Where will the data be published or archived? Who will be responsible for questions about the data and updates to the data?
The project budget should include considerations for the time, hardware, software, and personnel needed for proper data management.
Before getting started on your DMP, see:
Michener, W.K. 2018. Project Data Management Planning. In: F. Recknagel, W. Michener (eds) Ecological Informatics. Springer, Cham. https://doi.org/10.1007/978-3-319-59928-1_2
Examples of data management plans for research proposals
DMPs will vary in content and style depending upon the type of project. Below are some examples of DMPs for research proposals.
- Multi-scale synthesis and Terrestrial Model Intercomparison Project (MsTMIP) Phase II
- Development of a Data-Assimilation Framework for Arctic Ecosystems
- Determining the Extent and Dynamics of Surface Water for the ABoVE Field Campaign
- Daymet Daily Gridded Meteorological Data
- Remote Sensing Data and Land-Use Transitions
Tools for creating data Management plans
Several tools are available to reduce the amount of time and effort that you need to put into producing a DMP.
- DMPTool is a collaborative effort by several institutions
- Integrated Earth Data Applications (IEDA) Data Management Tool
- Johns Hopkins University Data Management Services
NASA guidance for data management plans
After receiving funding, large missions within NASA (airborne- or satellite-based missions) are required to develop DMPs that describe in detail how the plan will be implemented. As background, you may find NASA's Data Management Plan Guidance for Earth Science Missions useful as you develop short DMPs for proposals and implementation plans for successfully funded projects.
Department of Energy guidance for data management
The Department of Energy (DOE) Office of Science requires an integrated DMP with the overall research plan for projects that received funding on or after October 1, 2014. Read the Office of Science Statement on Digital Data Management for detailed information that might be useful for all DMPs. The Biological and Environmental Research program of the DOE provides additional requirements and guidance for digital data management for Climate and Environmental Sciences Division Research funded data repositories.