Why should you worry about good data management practices?
To prepare data for archival it must be organized in well-formatted, described, and documented datasets. Benefits of good data management include:
- Spend less time doing data management and more time doing research
- Easier to prepare and use data for yourself
- Collaborators can readily understand and use data files
- Long-term (data publication)
- Scientists outside your project can find, understand, and use your data to address broad questions
- You get credit for archived data products and their use in other papers
- Sponsors protect their investment
This page provides an overview of data management planning and preparation. It offers practical methods to successfully share and archive your data at the ORNL DAAC.
ORNL DAAC Data Management Best Practices
The ORNL DAAC has developed data management best practices for preparing datasets for sharing and archival.
- Use stable file formats
- Define the contents of your data files
- Assign descriptive file names
- Use consistent data organization
- Preserve information with version control
- Document your data
- Perform basic data quality assurance
- Protect your data
- Publish your data
Use stable file formats
Select a consistent format that can be read well into the future and is independent of changes in applications. If your data collection process used proprietary file formats, converting those files into a stable, well-documented, and non-proprietary format to maximize others' abilities to use and build upon your data.
Tabular or "spreadsheet" formatted data
Delimited text file formats ensure data are readable in the future. The ORNL DAAC recommends comma-separated values (CSV) format for all "spreadsheet" type data. Use a consistent structure throughout the dataset. Report summary information and analyses in supplementary documents, not in the data files. A header row should contain column headings that describe the content of each column, including parameter names and units.
- ORNL DAAC CSV standards
- For more information on data organization in spreadsheets, see Data Carpentry's Data Organization in Spreadsheets.
Good spatial file formats are open, non-proprietary, simple, and commonly used. More importantly, they are self-descriptive, in other words, metadata are included inside the file. The ORNL DAAC recommends:
- GeoTIFF, NetCDF, or HDF for raster or image data
- shapefile or KML/KMZ for vector data
NetCDF and CF-compliance data guidelines
Define the contents of your data files
In order for others to use your data, they must fully understand the contents of the dataset, including the parameter names, units of measure, formats, and definitions of coded values. Be consistent throughout your data.
Describe each parameter, standardized across files, datasets, and the project using commonly accepted parameter names and abbreviations. Develop a data dictionary that defines each attribute, variable, and parameter in the data. Standards for parameters currently in use include the CF Conventions and Metadata. These standards are applicable for NetCDF as well as other data formats.
Units of reported parameters must be explicitly stated in the data file and in the documentation. SI units are preferable, but each discipline may have its own commonly used units of measure. Units standards are available, for example, the CF Conventions and Metadata.
Date and time formats
Use ISO standard date formats: yyyy-mm-dd or yyyymmdd. If only the month or only the year is of interest use yyyy-mm or yyyy. Use 24-hour notation (13:30 instead of 1:30 p.m. and 04:30 instead of 4:30 a.m.) Report both local time and Coordinated Universal Time (UTC). Standard time is preferred. Be sure to define the local time in the documentation. Based on ISO 8601:2004.
Coordinates and geospatial information
Report coordinates in decimal degrees (≥ 4 decimal places). Provide south latitude and west longitude recorded as negative values. All location information in a dataset should use the same coordinate system, including coordinate type, datum, and spheroid.
Define the projection & coordinate reference system, referenced datum, EPSG code, spatial resolution, and bounding box. Provide a projection file in .prj or Well-Known Text (WKT) format. Embed the ‘no-data' or fill values in the image files if possible. Document all no-data values, fill values, valid ranges, scale factor and offset of the data values.
Coded fields and data flags
Define and standardize any coded fields in your data. A separate field may be used for quality considerations, reasons for missing values, or indicating replicated samples. Codes and flags should be consistent across parameters and data files. Definitions of flag codes should be included in the dataset documentation.
Use consistent missing value notations throughout your dataset. For numeric fields, represent missing data with a specified extreme value (e.g., -9999). For character fields, use "NA". Explicit missing value representations are better than empty fields. Document how missing and no-data values are represented.
Assign descriptive file names
File names should reflect the contents of the file and uniquely identify the data file. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type.
File names should be constructed to contain only lower-case letters, numbers, and underscores – no spaces or special characters – for easy management by various data systems and to decrease software and platform dependency. Similar logic is useful when designing directory structures and names.
Use consistent data organization
There are two common ways to organize tabular data. In either case, each separate line or row represents an observation. Each line is a complete record.
Most often, the columns represent all the parameters that make up the record. Similar to a spreadsheet, this is the potentially "short and fat" style of data organization.
If most parameters in a record do not have measurements, you can define the parameter and value in two columns. Other columns may be used for data about the measurement like site, date, units of measure, etc. This is the "long and skinny" style of data organization.
Keep similar measurements together (same investigator, methods, time basis, and instruments) in one data file. Many small files are more difficult to process than one larger file. There are exceptions: observations of different types of measurements might be placed into separate data files. Data collected on different time scales or temporal resolution might be handled more efficiently in separate files.
Use similar data organization, parameter formats, and common site names across the dataset. Include dataset organization and provide definitions for all coded values or abbreviations, including spatial coordinates, in the documentation.
Preserve information with version control
To preserve your data and its integrity, save a "read-only" copy of your raw data files with no transformations, interpolation, or analyses. Use a programming language to process data in a separate file. The code you have written is an excellent record of data processing. Your code can easily and quickly be revised and rerun in the event of data loss or requests for edits. Programming has the added benefit of allowing a future worker to follow-up or reproduce your processing. GUI-based tools are easy on the front end, but they do not keep a record of changes to your data and make reproducing results difficult.
Track versions of your data files as changes are made using a number, date, or both to identify versions. Save and store earlier versions in a separate directory. Keep a history of changes to data files/versions and who made the changes. Make sure the data files submitted to the archive are the correct version. Version control tools, like GitHub. are useful, especially if a group manipulates the data.
Document your data
As with data, documentation should be saved using stable, non-proprietary formats. The documentation is most useful when structured as a user's guide for the data product. Documentation can never be too complete. Users who are not familiar with your data will need more detailed documentation to understand your dataset. Long-term experimental activities require more documentation because personnel change over time. Write documentation for a user who is unfamiliar with your project, methods, or observations. See the documentation section in Submission Guidelines for examples.
Perform basic data quality assurance
You should perform basic data QA on the data files prior to sharing them.
- Check file organization and descriptors to ensure that there are no missing values for key parameters (such as sample identifier, station, time, date, geographic coordinates).
- Sort the records by key data fields to highlight discrepancies.
- Check the validity of measured or derived values. Scan parameters for impossible values (e.g., pH of 74 or negative values where negative values are not possible).
- Check the time frame specified and the temporal units against the data. Generate time series plots to detect anomalous values or data gaps.
- Perform statistical summaries (frequency of parameter occurrence) and review results.
- If location is a parameter (latitude/longitude), then use scatter plots or GIS software to map each location to see if there are any errors in coordinates. For GIS image and vector files, ensure the projection parameters have been accurately given.
- Additional information such as data type, scale, corner coordinates, missing data value, size of image, number of bands, and endian type should be checked for accuracy.
- Remember to remove any "leftover" parameters or columns used in processing the data that are uninformative to other users of your dataset.
You can use our Data Quality Review Checklist as a guide.
Protect your data
Ensure that file transfers are done without error by comparing checksums or file sizes before and after transfers. Create and test back-up copies often to prevent the disaster of lost data. Maintain at least three copies of your data: the original, an on-site but external backup, and an off-site backup, such as on cloud-based storage, in case of a disaster. Periodically test your ability to recover your data.
Publish your data
There are many benefits to publishing your data in an open data repository. You get credit for archived data products and their use in papers. The process of QA and documentation adds value and may catch errors in your data. Collaborators can readily understand and use your data both in the near future and in the long-term. Scientists outside your project can find, understand, and use your data to address broad questions. Sponsors protect their investment.
What is a Data Management Plan?
A Data Management Plan (DMP) for a proposal is a brief document that outlines what you will do with your data during and after your research, to ensure your data will be safe, documented, and accessible now and in the future. A DMP developed early and used throughout the research project will increase research efficiency by making the data understandable and usable in the future and preventing duplication of research efforts. NASA's Terrestrial Ecology Program now requires that each proposal include a DMP of up to two pages.
Components of a Data Management Plan
A DMP for a proposal is a short (less than two pages) document that describes what you will do to manage your data.A DMP must include at least these components:
- Information about the data - Describe the data and their organization. Consider the acquisition, processing, and storage of the data to be produced.
- Description of data - How will the data be saved? Avoid proprietary formats that may not be readable in the future.
- Metadata content and format - Describe the metadata needed. How will they be created and/or captured and in what format or standard?
- Policies for access, sharing and re-use - Data sharing details and obligations should be considered. Address ethical, privacy, intellectual property, and copyright issues for the dataset.
- Long-term archival - Where will the data be preserved? Who will be responsible?
The project budget should include considerations for the time, hardware, software, and personnel required for Data Management.
Download this outline DMP . to get started on your DMP.
Examples of Data Management Plans for Proposals
DMPs for different types of projects will vary in content and style. Here are several example proposal DMPs:
- Multi-scale synthesis and Terrestrial Model Intercomparison Project (MsTMIP) Phase II
- Development of a Data-Assimilation Framework for Arctic Ecosystems (ABoVE)
- Determining the extent and dynamics of surface water for the ABoVE field campaign
- Daily Gridded Meteorological Data (Daymet)
- Remote Sensing Data and Land-Use Transitions
- Mauna Loa CO 2 Record (from DataONE)
Tools for Creating Data Management Plans
Tools are available to reduce the amount of time and effort that goes into producing a DMP.
- DMPTool. a collaborative effort by several institutions
- Integrated Earth Data Applications (IEDA) Data Management Tool
- Johns Hopkins University Data Management Services
NASA Guidance for Data Management Plans
After receiving funding, large missions within NASA (airborne- or satellite-based missions) are required to develop DMPs that describe in detail how the plan will be implemented. As background, you may find NASA's Data Management Plan Guidance for Earth Science Missions.useful as you develop short DMPs for proposals and implementation plans for successfully funded projects.
DOE Office of Science Statement on Digital Data Management
The Office of Science of the DOE now requires an integrated DMP with an overall research plan for project that receive funding issued on or after October 1, 2014. Read the Office of Science Statement on Digital Data Management.for detailed information and guidance that may be useful for all DMPs. The Office of Science of the DOE has further guidance on Suggested Elements for a Data Management Plan.
Data Carpentry Lessons
- Data organization in spreadsheets
- Intro to vector-format spatial data (i.e. shapefiles, points, lines, or polygons) in R
- Intro to raster-format (i.e. gridded) spatial data in R
- Intro to HDF5 data in R
- Intro to time series data in R
- Data cleaning with OpenRefine
- Link your published datasets to your ORCID Record
- Additional data management and analysis tutorials can be found at datacarpentry.org
ORNL DAAC Resources
- Best Practices for Preparing Environmental Data Sets to Share and Archive
- Data Management Best Practices Workshop
- Environmental Data Management Best Practices: Part 1 Tabular Data
- Environmental Data Management Best Practices: Part 2 Geospatial Data
- Other webinars are available on the NASA Earthdata YouTube Channel