species | site_01 | site_02 | site_03 |
---|---|---|---|
Tilia americana | 3 | 3 | 5 |
Pinus strobus | 1 | 0 | 2 |
6 Make Your Data Software Ready
Those sharing or managing data can take small steps to make them “software ready.” These include using non-proprietary formats, structuring tables with specific columns and entries, including standards for information about time, place, and organism.
6.1 Use non-proprietary formats
What is it?
Non-proprietary file formats do not require specific software and can be accessed without licenses and within different software systems.For example, comma separated values (CSV) format is becoming an increasingly popular non-proprietary format compared to the proprietary .xlsx format.
Why?
Allows data to be useful in perpetuity by ensuring data readability and reusability across multiple platforms
Aligns better with the FAIR data principles
Makes data more socially equitable, supporting open science
Many applications (e.g. Microsoft Office) allow exporting into multiple formats, which makes it easy to share data in non-proprietary formats even if it was created using proprietary software.
Top Resources
Table of commonly used formats for common data types
A more detailed table that is specific to U.S. Federal records management
6.2 Structure tabular data in tidy/long format
What is it?
Long (or sometimes called “tidy”) format for tabular data can best be described as having one observation per row.
The following example shows two different formats – wide and long – of the same data. Notice that while sites 1, 2, and 3 are the column names filled with counts for each species in the wide format, site and count become the column names in long format.
Why?
The clear structure makes data more machine readable, particularly with commonly-used analytical software.
Data are as atomic as possible (e.g. no mixed types in one field)
It is easier to aggregate data across multiple files
Example of Wide Format
Example of Long Format
species | site | count |
---|---|---|
Tilia americana | site_01 | 0 |
Tilia americana | site_02 | 1 |
Tilia americana | site_03 | 0 |
Pinus strobus | site_01 | 2 |
Pinus strobus | site_02 | 5 |
Pinus strobus | site_03 | 3 |
Top Resources
Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23.
Video: Data Sharing and Management Snafu in 3 Short Acts (video)Tips for working with data in BASH
6.3 Follow ISO 8601 for dates
What is it?
ISO 8601 is a convention for dates and times, where dates are listed as YYYY-MM-DD and time is given in Coordinated Universal Time (UTC, Zulu, or GMT) which is the time standard, relative to 0o longitude, that regulates global clocks.
The following table outlines how to write dates, times, and time intervals using ISO 8601:
Examples: April 3, 2023 standardized to ISO 8601
Description | Written in ISO 8601 |
---|---|
Date | 2023-04-03 |
Date and Time with timezone offset | 2023-04-03T18:29:38+00:00 |
Date and Time in UTC | 2023-04-03T18:29:38Z |
Time Interval in UTC (April 3 - 5, 2023) | 2023-04-03T18:29:38Z/2023-04-05T00:29:38Z |
Examples: different styles of timezone annotation
Description | Written in ISO 8601 |
---|---|
Date | 2023-04-03 |
Date and Time with timezone offset | 2023-04-03T18:29:38+00:00 |
Date and Time in UTC | 2023-04-03T18:29:38Z |
Time Interval in UTC (April 3 - 5, 2023) | 2023-04-03T18:29:38Z/2023-04-05T00:29:38Z |
Why?
- Internationally accepted format used across multiple schemas (e.g.
Darwin Core
,EML
,ISO 19115
) - Removes ambiguity related to timezone, daylight savings time changes, and time of day
- Better software integration of time date/time elements
Top References
6.5 Record latitude and longitude in decimal degrees in WGS84
What is it?
WGS84 is a coordinate reference system that clarifies location. Recording latitude and longitude coordinates in decimal degrees (DD), rather than degrees-minutes-seconds (DMS) or decimal-minutes (DM or DDM) standardizes them to be more machine and human readable. Degrees West and South are negative in decimal degrees, and longitude can range from -180 to 180, and longitude -90 to 90. Below are example coordinates in each format. Once locations are recorded in DD, the number of decimal places included should be adjusted to match the precision of the observation.
Example Coordinates
Format | Example |
---|---|
Decimal Degrees (DD) | 30.50833333 |
Degrees Minutes Seconds (DMS) | 30° 15' 10 N |
Degrees Decimal Minutes (DM or DDM) | 30° 15.1667 N |
Why?
- Users have to know where you collected this data, which requires a latitude, longitude, reference system and uncertainty.
- Decimal-degrees avoids special symbols (
°
or‘
) which is preferable for machine readable formats WGS84
is a reference coordinate system that is widely used and incorporated in many GPS units and tools, and recognized as a standard by many government agencies.
Top Resources
Existing R/python/ESRI packages/functions
R package measurements
EML - find bounding coordinates
Some background on precision
6.6 Use persistent unique identifiers
Why?
- It can be useful to have unique identifiers to unambiguously identify granules of information, e.g. dataset, collection, database, taxonomic concept, etc. This will allow users to precisely refer to the data and allow your data to remain identifiable when aggregated with other datasets.
- To be able to uniquely identify a record in your data system or across data systems. Useful to create relational databases or merge records.
- Although it increases workload, it safeguards against confusion and inefficiency in the future.
Key Information
- There are good reasons to keep an identifier opaque, i.e. it does not indicate anything about the content of information it points to. However, there are also transparent, or semi-opaque identifiers in use that take advantage of semantics to guide humans as well as machines.
- One way to create a unique identifier is concatenation of sampling event, location, time, enumeration of unique observation or event. (e.g.
Station_95_Date_09JAN1997:14:35:00.000
) - Some prefer using opaque identifiers. (e.g.
10FC9784-B30F-48ED-8DB5-FF65A2A9934E
) - If there is an existing persistent unique identifier, it’s usually a good idea to use it (i.e. when using a taxonomic authority like WoRMS and applying their LSID).
- It is important to manage any identifiers you create, if they are not managed by an authority (e.g. DOIs).
- Important that it be persistent (consider samples possibly moving between institutions)
Examples of PIDs
Type of PID | Use Case | Example |
---|---|---|
Digital Object Identifier (DOI) | Actionable persistent link for papers, data, and other digital objects | https://doi.org/10.6084/m9.figshare.16806712.v2 |
International Geo Sample Number (IGSN) | Persistent identifier for physical samples | http://igsn.org/AU1243> |
Life Science Identifier (LSID) | Persistent structured method for biologically significant data | urn:lsid:marinespecies.org:taxname:218214 |
Open Researcher and Contributor ID (ORCID) | Persistent actionable link for individuals | https://orcid.org/0000-0002-4391-107X |
Top References
- Software and Packages to generate uuids:
- Guidance on how to use GUIDs (Globally Unique Identifiers) to meet specific requirements of the biodiversity information community
http://bioimages.vanderbilt.edu/pages/guid-applicability-final-2011-01.pdf - Use of globally unique identifiers (GUIDs) to link herbarium specimen records to physical specimens
https://bsapubs.onlinelibrary.wiley.com/doi/full/10.1002/aps3.1027 - A Beginner’s Guide to Persistent Identifiers
http://links.gbif.org/persistent_identifiers_guide_en_v1.pdf