agenda

  • reproducibility & open science
  • data management resources and expections of the CAP LTER
  • data management resources at ASU

goals

  • open, publicly accessible data
  • available, transparent workflow

reproducible research is the ability to recompute data analytic results given an observed data set and knowledge of the data analysis pipeline.1

Replicating studies remains the gold standard for rigorous scientific research, but reproducibility is viewed increasingly as a minimum standard that all scientists should strive toward.1

reproducibility for scientific reasons

  • increased trustworthiness
  • more rigorous, reliable science
  • learning more from other works





“The climate scientists at the centre of a media storm over leaked emails were yesterday cleared of accusations that they fudged their results and silenced critics, but a review found they had failed to be open enough about their work.”1

reproducibility for personal reasons

  • what if something needs to be redone, how to you manage that?
  • how do you keep track of data sources, versions?
  • how did I make that figure?

“Basically, if the thought of redoing your analyses is terrifying then you are doing it wrong.” — J. Bryan

Lowndes, J., Best, B., Scarborough, C. et al. Our path to better science in less time using open data science tools. Nat Ecol Evol 1, 0160 (2017)

most basic principle for reproducible research: do everything via code

  • downloading data from the web
  • converting an Excel file to CSV
  • renaming columns or variables
  • omitting bad samples or data points

…do all of these programmatically

on the importance of scripting

consider, we want to remove samples that we feel may have been compromised…

in a R script

some work...

# remove samples 4 and 5 that may have
# been compromised due to the monsters

chemistry_data %>%
  dplyr::filter(!sampleID %in% c(4,5))

...more work

in a spreadsheet

Lowndes, J., Best, B., Scarborough, C. et al. Our path to better science in less time using open data science tools. Nat Ecol Evol 1, 0160 (2017)

it is all about taming the chaos

Bryan 2018 Excuse Me, Do You Have a Moment to Talk About Version Control?

Braga, P. H. P., Hébert, K., et al. Not just for programmers: How GitHub can accelerate collaborative and reproducible research in ecology and evolution. Methods in Ecology and Evolution, 14, 1364–1380. (2023)

Lowndes, J., Best, B., Scarborough, C. et al. Our path to better science in less time using open data science tools. Nat Ecol Evol 1, 0160 (2017)

tidy data

Open an intro to ANY statistics textbook and you will find that statistics (analysis, plotting - anything, really) starts once you have tidy data. Dr. Dianne Cook


project organization

If I could go back to my grad student self and tell her one thing, it would be: have a system to organize your files, data, and project notes. You’re thinking a lot about how to generate data, and what those data mean in the broader body of literature, but this is as important. Dr. Jacquelyn Gill


  • use one folder per project
  • separation of data, methods, output, etc.
  • craft informative names
  • include README files

standardization

comprehensive map of all countries in the world that use MMDDYYYY format

good coding practices

A recent study aiming to run 2,000 project’s worth of R code found that 74% of the associated R files failed to complete without error (Trisovic et al. 2022)1. Many of those errors involve coding practices that hinder reproducibility but are easily preventable by the original code authors.

publishing data (and code)

work with your mentor to determine the best way to publish your data and code

https://sustainability-innovation.asu.edu/caplter/

ASU research data management services

https://rto.asu.edu/research-data-management/

    PLAN

  • proposal editing
  • data management plans
  • training
    • information security
    • HIPAA
  • ADHS (health data)

   GENERATE

  • LabArchives
  • REDCap
  • secure data environment
  • data storage locator

   PROCESS

  • high-performance computing
  • library data science & analytics
  • CHS Biostats core

   SHARE

  • digital repository