2 Summary of Interviews

Our conversations with project teams began in the Summer of 2020 and consisted of five, guiding questions covering the general motivation of the project, workflow, decision points, availability, and possible improvements. From these questions, we hoped to gain a better understanding of the current status of synthesis projects to inform a best practices paper to make suggestions for future efforts.

2.1 Project motivation

Specific project motivation to create larger datasets varied but generally centered on answering outstanding questions on the carbon cycle that could not be addressed by the data from a single project. To answer these questions, researchers needed large scale harmonized datasets that were unavailable, unorganized or nonexistent. These data synthesis projects were motivated by a specific question that required robust datasets and scattered, undocumented data was common frustration. Creating these data synthesis products proved to be very long and tedious, typically taking a period of five or more years to be developed and were often unfunded passion projects.

2.2 Workflow

Workflows focused on efficiently identifying, aggregating and vetting data contributions to create a harmonized final data product.

The process typically began with data identifying and prioritization for synthesis. Data was often prioritized based on its size (more data was higher priority) and relevancy to the question of the project.

The next step was the aggregation process where much of the diversity in these workflows arose. Different groups utlized manual data transcription to varying degrees. The most common approach entailed manual data transcription from the original format (data table or information embedded in a figure) to a study specific data template. This data transcription was done in some cases primarily by the data provider (ISCN, ISRaD, WoSIS, CC-RCN, and Crowther), or by the data synthesis team themselves (ISRaD, van Gestel, and SIDb). Another unique approach used was keyed transcription as used in the SoDaH project. In the SoDaH project, the data provider developed a data thesaurus to translate the original data model into the common project model. There are numerous tradeoffs in both approaches around expertise, reproducablity, and backwards compatability. No group was entirely happy with their aggregation workflow.

Next groups typically had a QA/QC procedure which is varied among projects. QA/QC ranged from gap filling, unifying units, and range checking the data. This QA/QC procedure often had both automated and manual components.

Finally data releases often utilized GitHub repositorues or were hosted on the project website. We did not find any group who utilized a DataOne-federated repository. Typically the final data product was a set of tabular relational database. SIDb was an exception and had a unique tree structure with embedded rectangular data.

2.3 Decisions

Workflows were a balance between ease of use and flexiblity. While no team was entirely happy with their workflows, each felt that they arrived at a good balance for their team. Very large (ISRaD) and very small (van Gestel) team projects tended to focus on ease of use and short development time, leading them to favor data transcription template approach. Intermediate projects like SoDaH who had a smaller funded team with a diverse skill set were able to explore a more flexible scripted approach.

Across all projects, communication, both within the synthesis team and between the synthesis team and data providers, was critical. Unsuprisingly documentation of the data template was critical for teams of more than a single PI. Hackathons where the working group was co-located and able to commit to focusing on completing tasks were critical to several projects (ISRaD, SIDb, and SoDaH). Hackathons allowed group members from different institutions to have a focused period of time and address questions within the group immediately as opposed to a several hour/day lag over email. Finally, back and forth between the data providers and synthesis team was common, often extensive, and critical for addressing ambiguities in the submitted data.

2.4 Obtaining copy of project

Notably online repositories were not the primary access point for synthesis products. Data was generally stored in a GitHub repository or hosted on project websites. The data was mainly formatted as relational data tables, frequently via an Excel file.

2.5 Pain points and suggestions

Common pain points stated by those interviewed included the generational transfer of the data product, the required skill mix of informatics and soils, and the complex complete data structures which lead to large amounts of unique vocabulary. Data product synthesis efforts typically take longer then a standard 3-year funding cycle. Thus data products are often repurposed to address slightly different questions over the course of their lifecycle (ISRaD is a particularly good example of this). Both the inherent complexity of soils data, as well as this repurposing, frequently led to changes in the project data template. These changes in template structure were frequently challenging to implement, requiring revisiting of the original data source to ensure a complete capture of the data with the new template.

Soil data templates are also often particularly large and diverse, consisting of 100’s of unique columns (ISCN, ISRaD) unless deliberately constrained (Crowther, WoSIS). These two approaches reflected different philosphies around capturing ALL the data to contextualize each measurement or delibrately stripping the context so as not to artificially inflate confidence. These large datasets therefore frequently identified that a unique skill combination of soils science paired with computer programming or database expertise, was needed in these projects and can be particularly difficult to find.

To combat some of these concerns, the researchers suggested ensuring to start with the correct team, involving both soil scientists and informaticials or programmers, as well as taking more time to develop the template before its use, so it would not need to be modified later. Positive suggestions included (1) having the community weigh in on needs and wants for the database and (2) hosting hackathons and workshops to focus on completing tasks. This ensures that the final data product will be completed efficiently and that when completed it will be used.

#Next Steps

This work has highlighted the need for ontologies, which would allow soil scientists to make their studies more organized and extendable in the future. Through the community surveys and the recent ESIP convention, Dr. Todd-Brown and others have started a new ESIP Cluster: Soil Ontology and Informatics https://wiki.esipfed.org/Soil_Ontologies_and_Informatics. These meetings take place once every two weeks and have demonstrated the need to bridge the gap in ontologies across different fields of the soil science community.

Additionally, we are in the process of developing a soil data product best practice manuscript that would aid future efforts (pending as of Fall 2020).

Our work with the best practices paper and soil ontology intends to use our findings to change the way that soil data is organized, documented, and stored in order to promote the ease of use and longevity of the data.