13 June Comparison of data model structures

13.1 Introduction and goals

We considered five data products in this section, selected for their relevance to research driven soil carbon data products: International Soil Carbon Network vs 3 (ISCN), Coastal Carbon Research Coordination Network (CCRCN), International Soil Radiocarbon Database (ISRaD), and two soil warming meta-analyses conducted by Crowther (Crowther et al. (2016)) and vanGestel (van Gestel et al. (2018)). These studies include ongoing studies (CCRCN, ISCN), ongoing with incremental publications (ISRaD:Lawrence et al. (2020)), and completed projects (Crowther:Crowther et al. (2016), vanGestel:van Gestel et al. (2018)).

13.2 Individual PI studies were smaller.

In general individual PI projects had smaller data models (see Table 1 and Figure 1). Multi-PI studies tended to have multiple columns describing the same variable. These extra columns describe methods, units, standard deviations, and other quantities related to said variable. While Crowther technically had 8 data tables, most of the data was in three main data tables (Figure 2). In contrast, multi-PI projects had larger data tables with more complex key-ed references across them (Figure 3). This also held true for the number of variables in each study. The single PI studies had between 40 (Crowther: Figure 1) and 56 (vanGestel) unique variable names. Multi-PI studies however had between 144 (CCRCN) and 351 (ISRaD: Figure 2).

Table 13.1: Table 1: Increasing the number of researchers involved with a study increased the complexity of the data model. The studies varied in the number of data tables that they each contained, with the single PI study by Crowther only containing 3 tables and the multi-PI study CCRCN containing 12. There was a wider variation in the unique variables in each study, from 28 to 221. These were associated with columns that contained data values, units, and other methods notes.
Study ID Table count Variable count Column count Multi PI?
Crowther 3 27 33 No
vanGestel 6 52 54 No
ISCN3 4 108 139 Yes
CCRCN 12 124 144 Yes
ISRaD 8 216 349 Yes

Figure 13.1: Figure 1: Data models with id keys only.

Figure 13.2: Figure 2: Crawther data model. An example of a less complex, single PI data model.

Figure 13.3: Figure 3: ISRaD data model. An example of a complex, multiPI data model.

13.3 Vocabulary across studies were not obviously harmonizable.

Initial efforts to harmonize the vocabulary across studies showed over 580 unique variables out of 924 total variables across all data models. Only 5 variables were commonly shared across all data models. These variables tended to focus on study information, site location, climate, bulk density, organic carbon percentage, sand/silt/clay fractions, pH and cation exchange capacity (see Table 2).

Table 13.2: Table 2: Common variables (>2 data models) across data models. Total variable count is 21.
variable
latitude
longitude
clay
sand
silt
elevation
mean_annual_temperature
bulk_density
carbon_organic
cation_exchange_capacity
depth_max
depth_min
total_carbon
mean_annual_precipitation
nitrogen_total
soil_organic_carbon
soil_texture_class
vegetation_species
carbon_to_nitrogen
citation
dataset_name
drainage_class
loss_on_ignition
observation_date
pH
site_name
soil_series
13c
14c
15n
aspect_degree
author
caco3
coarse_fraction
curator_name
curator_organization
doi
effective_cation_exchange_capacity
modification_date
reference
site
slope
treatment_duration_start
treatment_type
aluminum_dithionate
aluminum_oxalate
aspect_class
base_saturation
base_sum
biome
calcium_ extractable
contact_name
country
depth_mean
email
exchangeable_cations_sum
fraction_modern
fraction_note
fraction_scheme
iron_dithionate
iron_oxalate
layer_name
magnesium_extractable
pH_h2o
potassium_extractable
site_description
sodium_ extractable
soil_horizon
soil_type
treatment_duration
vegetation
2d_position
age
aluminum_pyrophosphate
bulk_density_sample
bulk_density_total
burn_evidence
coarse_size_thresh
color
contact_email
contact_orcid_id
datum
depth
depth_water
ecoregion
fraction_property
horizon
land_cover
layer_note
net_primary_productivity
nitrogen_total_stock
organic_matter
parent_material
pH_cacl
phosphorus_extractable
profile_name
silicon_dithionate
silicon_oxalate
silicon_pyrophosphate
slope_shape
soil_taxonomy
NA

13.4 Study feature summary

Common across most models are location (latitude-longitude-elevation), observation time, mean annual temperature, and mean annual precipitation, all of which describe site level characteristics. Depth of core or layer paired with bulk density, organic carbon percentage, sand, silt, clay, pH, soil texture class, cation_exchange_capacity, and 14C describe soil level characteristics. Columns for vegetation class notes were common, but not directly comparable across the studies. Bulk density was typically broken into several categories depending on the measurement method used.

13.4.1 Unique features

  • CCRCN
    • min/max latitude
    • detailed author information
    • ‘one_liner’ summary
    • break out bulk density mass/volume
    • many specific isotopes listed (Am241, C14, Cs137, Be7, Pb210, Ra226)
    • X_class is free text or control vocabulary
    • coastal specific vocabulary
      • inundation/salinity
    • anthropogenic impacts
    • core-level vs site latitude/longitude and elevation
  • ISCN3
    • disturbance table
    • high level of site details
      • frost free days, ponding, runoff
    • higher then average number of layer-level info
    • fraction table only shared with ISRaD
  • ISRaD
    • interstitial table
    • flux table
    • incubation table
    • fraction table only shared with ISCN3
    • higher than average number of layer info
      • mineral abundance, mass of element extracted
  • Crowther
    • Author updated data (outside sources)
      • Biome, % Clay, pH,
    • Detailed soil warming data
      • planned temperatures, control temperatures, mean temperatures
    • Cation exchange capacity reported
    • % Nitrogen reported
    • distinguished between total raw carbon and total carbon
    • Difference between detailed_site_id(New Name) and site_id(Old Name)
  • vanGestel
    • mean depth instead of depth of core and layer
    • carbon, nitrogen, and phosphorus pools above and below ground
    • treatments and information about treatments (mean, standard error, size)
    • ‘input’ variable
    • soil horizon and percent soil organic matter

13.5 Initial ontology search for relevant control vocabulary.

In general, The Ecosystem Ontology (http://bioportal.bioontology.org/ontologies/ECSO) was the most relevant ontology to this study. However, the control vocabulary was not as method specific as many of the larger data products examined here. We searched http://bioportal.bioontology.org/ for ontologies with four common terms across the studies: ‘soil bulk density’, ‘soil organic carbon’, ‘soil pH’, and ‘soil depth’. None of the data products considered in this study referred to a formalized ontology and instead chose to develop their own or adapt-extend vocabulary from previous data products. We report an initial set of search results for some common terms in the data products considered.

The search term ‘soil bulk density’ returned 39 ontology matches (search date: 19 May 2020). Many of these were entries for generalized ‘density’ or ‘bulk density’. The Ecosystem Ontology was the only ontology with a complete match (http://purl.dataone.org/odo/ECSO_00001110), though the definition of this entry was ambiguous. It did not specify if the bulk density was sieved or dry soil, making this challenging to use in a soil study without further specifications.

The search term ‘soil organic carbon’ returned 26 ontology matches (search date: 19 May 2020). While many of these referred to soil or carbon independently of soil, two entries were specific to soil organic carbon. Interlinking Ontology for Biological Concepts had a complete match to ‘soil organic carbon’ (http://purl.jp/bio/4/id/200906061124670034), though no units were specified making it ambiguous whether this was a mass fraction or density quantification. The Ecosystem Ontology also had a complete match under ‘organic carbon percentage in soil’ (http://purl.dataone.org/odo/ECSO_00000648), but also had ‘total organic carbon percentage’ (http://purl.dataone.org/odo/ECSO_00002149). While the units in this case were well defined, the method of measurement, similar to bulk density, needed more specificity to make the label broadly applicable.

The search term ‘soil pH’ returned 43 ontology matches (search data: 19 May 2020). These hits were dominated by either ‘soil’ or ‘pH’ hits. Only two ontologies had specific soil pH entries. The Ecosystem Ontology had a match for ‘soil pH’ (http://purl.dataone.org/odo/ECSO_00001646) did not specify an extraction method. Interlinking Ontology for Biological Concepts had a match for ‘soil acidity’ (http://purl.jp/bio/4/id/200906080708260606), also without an extraction method specified.

The search term ‘soil depth’ returned 33 ontology matches (search date: 19 May 2020). Only two of these hits specifically referred to depth of soil. The Ecosystem Ontology had a match for ‘Soil Depth’ (http://purl.dataone.org/odo/ECSO_00001207), specifically mentioning soil depth in the context of layers. Interlinking Ontology for Biological Concepts had a match for ‘soil depth’ (http://purl.jp/bio/4/id/201006028017141570), which seemed to refer more to the total depth of the soil.

13.6 Next steps

All groups in this study have been contacted and confirmed interest in participating. We are currently expanding the data products being considered to include the Soil Incubation Database (SIDb), World Soil Information Service (WoSIS), ISCN-2016-template, the LTER soil organic matter working group, and the Soil Health Database. We have draft the initial questions for the long format interviews below and plan to start conducting interviews in June. By the end of July we expect to have a more general community survey targeted more broadly to the soil science community. We will continue to explore the developed ontologies.

13.6.1 Interview questions

  1. Why did you start this study?
  2. Describe your workflow for ingesting data sets?
  3. What decisions did you make to arrive at this workflow?
  4. How would someone get a copy of the data in this study?
  5. What would you do differently if you had to start again? What would be the same?

References

Crowther, T. W., K. E. O. Todd-Brown, C. W. Rowe, W. R. Wieder, J. C. Carey, M. B. Machmuller, B. L. Snoek, et al. 2016. “Quantifying Global Soil Carbon Losses in Response to Warming.” Nature 540 (7631). Springer Science; Business Media LLC: 104–8. https://doi.org/10.1038/nature20150.

Lawrence, Corey R., Jeffrey Beem-Miller, Alison M. Hoyt, Grey Monroe, Carlos A. Sierra, Shane Stoner, Katherine Heckman, et al. 2020. “An Open-Source Database for the Synthesis of Soil Radiocarbon Data: International Soil Radiocarbon Database (ISRaD) Version 1.0.” Earth System Science Data 12 (1). Copernicus GmbH: 61–76. https://doi.org/10.5194/essd-12-61-2020.

van Gestel, Natasja, Zheng Shi, Kees Jan van Groenigen, Craig W. Osenberg, Louise C. Andresen, Jeffrey S. Dukes, Mark J. Hovenden, et al. 2018. “Predicting Soil Carbon Loss with Warming.” Nature 554 (7693). Springer Science; Business Media LLC: E4–E5. https://doi.org/10.1038/nature25745.