LibGuides: Research Data Management: Organize and Document Data

Data Organization

A clear, logical structure helps you and others find and understand your data easily. Consistency is key for maintaining order as your dataset grows.

File Naming

‣ Elements to Consider in Your File Naming Scheme

Project name/acronym
File status
Date (e.g. YYYYMMMDD)
Short description of content
Data type information
Creator name/initials
Version number

‣ File name best practices:

Establish an 'order of elements', eg. YukonClimate_20240103_TempertureData_v3.csv (ProjectName_Date_Description_Version#)
Use date in ISO 8601 format YYYYMMDD
Avoid spaces or special characters (e.g. ! @ $ % * () ‘;<>,[]{}”); use underscores (e.g. file_name) or camel case (e.g. FileName)
Make your file names less than 30 characters
Include version control elements: version numbers (e.g., v1, v2) or dates
When sequentially numbering files, use leading zeros in order to guarantee that files will sort properly; e.g. 0001, 0002 … 1001 vs. 1,2, … 1001

It is also a good idea to design a "README.TXT" file that explains your naming convention and abbreviations.

File Formats

Try to select non-proprietary and uncompressed formats for the purpose of long-term storage and management. Here below are some preferred file formats.

Text: XML, TXT, PDF/A, HTML, ASCII, UTF-8 (not Word)
Tabular Data: CSV (not Excel)
Still Images: TIFF, JPEG 2000, PDF, PNG, BMP (not GIF or JPG)
Moving Images: MOV, MPEG, AVI, MXF (not Quicktime)
Sounds: WAVE, AIFF, MP3, MXF
Databases: XML, CSV
Statistics: ASCII, DTA, POR, SAS, SAV
Containers: TAR, GZIP, ZIP
Geospatial: SHP, DBF, GeoTIFF, NetCDF
Web Archive: WARC

More resources:

Digital File and Folder Management (Thompson Rivers University)
Research Data Management Guide (UBC)
Naming and organizing your files and folders worksheet (MIT)
File naming best practices handout (pdf) (MIT)
Version control tools & techniques handout (pdf) (MIT)
Sustainability of Digital Formats (Library of Congress)

Data Documentation

Data documentation is essential for ensuring that your research data is discoverable, understandable, reusable, and reproducible.

Common documentation files:

README File - A README file provides a quick overview of your project. It answers basic questions about the data’s purpose, origin, structure, and how to use it, making it the first place users should look to understand your data.
Data Dictionary - A data dictionary is a detailed list of the variables in your dataset, explaining what each variable represents, its data type (e.g., number, text), and any relevant units or coding schemes. It ensures that users can interpret your dataset accurately.
Codebook - A codebook is a comprehensive guide that describes the coding scheme used for a dataset. It provides essential information about each variable, including descriptions, values, units of measurement, etc.
Laboratory Notebook - A laboratory notebook records the day-to-day process of your research, including experiment dates, methods, changes, and observations. This helps users understand how the data was collected and any modifications that may impact results.
Metadata - Metadata is information that describes data, such as the title, creator, keywords, and license information. It makes your data more searchable, discoverable, and easier to cite in other research.

Organizing Data Variables with READMEs

README File

A README file is a plain text file that includes descriptive information and is commonly used for software, games, and code. It is a supplementary document that exists so the creator can explain the contents to the user. When working with data, it can be useful to create and include a READMe file with your data. This ensures that future users will understand the data, any terms, and more.

What Should a README Include?

There are no standards for writing a READMe text file, but it is recommended to include:

Title
Principle Investigator(s)
Dates/Locations of data collection
Keywords
Language
Funding
Descriptions of every folder, file, format, data collection method, instruments,etc.
Definitions
People involved
Recommended citation

If creating a READMe file for a dataset, be mindful of the following:

Abbreviations and acronyms are defined
Variables/parameters and units are described
Data treatment and methodology are described
Headings are explained
Known limits of the data/problems are mentioned (this should include explanations of missing data, negative values (for parameters this is not expected for), no reads are explained, etc.)
Reference to papers describing methodology, if applicable
Related datasets are properly cited

How to Write a README File?

Guide to writing "readme" style metadata (Cornell University)
A template README for social science replication packages (Social Science Data Editors )

Metadata & Describing Data

Metadata

Metadata is information about data. It provides the context and details by addressing the who, what, when, where, why, and how of the dataset, making it easier to find, access, and use. Good metadata aligns with the FAIR data principles, ensuring data is Findable, Accessible, Interoperable, and Reusable.

Key Metadata Elements

Title: full title by which the dataset is known
Creator: the name(s) of the person or organization responsible for creating the work
Contact Information: name and email address for the main contact for the dataset
Description: summary of purpose, nature, and scope of the dataset
Subject: broad domain-specific subject category
Date(s): consider including collection date, production date, deposit date, distribution date, publication date, etc.
Keywords: relevant terms that help users search for the dataset
Location: for geospatial data
Licensing Information: details on how the data can be used, such as a Creative Commons license.
Funding or granting agency

Choosing a Metadata Standard

Metadata standards or schemas consist of specific elements used to describe or document your data. Many data repositories, disciplines and organizations have established specific metadata standards.

To find an appropriate metadata standard for your data:

Disciplinary Metadata (Digital Curation Centre)
The RDA Metadata Standards Catalog (Research Data Alliance for the international academic community)
The RDA Metadata Standards Directory (Research Data Alliance)

Examples of some commonly used ones:

Dublin Core - a basic, widely-used standard and domain-agnostic metadata standard
DDI (Data Documentation Inititative) - commonly used in social, behavioral, economic, and health sciences
EML (Ecological Metadata Language) - specific for ecology disciplines
ISO 19115 - a standard for geographic information