Skip to content
Snippets Groups Projects
Forked from R3 / school / courses
247 commits behind the upstream repository.
Code owners
Assign users and groups as approvers for specific file changes. Learn more.

Data housekeeping

File names

General pricinples

  • Machine readable
  • Human readable
  • Plays well with default ordering

Separators

  • No spaces
  • Underscore to separate
  • Hyphen to combine

Date format follows ISO 8601

2018-12-03
2018-12-06_1700

Bad names
 PhD-project-Jan19 alldata_final.foo
 Finacial detailes BIocore 19/11/12.xls
 ATACseq1Londonmapped.bam
 Hlad.jez.M-L-průtoky JíObj.z Ohře-od 10-2011.xlsx
Good names
Iris-setosa_samples_1927-05-12.csv
PI102_Mouse12_EEG_2018-11-03_1245.tsv
Bioinfiniti_FullProposal_2018-11-15_1655.do


From Jenny Bryan by CC-BY (https://speakerdeck.com/jennybc/how-to-name-files)

Data housekeeping

File organization

  • Have folder organization conventions for your group

    • Per Paper
    • Per Study/Project
    • Per Collaborator
  • Keep readme files for data

    • Title
    • Date of Creation/Receipt
    • Instrument or software specific information
    • People involved
    • Relations between multiple files/folders
  • Separate files you are actively working from the old ones

  • Orient newcomers to the group's conventions

Data housekeeping

When working

  • Clarify and separate source and intermediate data
  • Keep data copies to a minimum
  • Cleanup post-analysis
  • Cleanup copies created for presentations or for sharing

Data housekeeping

End of project

  • handover data to a new responsible when leaving
  • data should be kept as a single copy on server-side storage
    • no copies on desktops and external devices
  • non-proprietary formats
  • minimal metadata
  • sensitive data (e.g. whole genome) must be encrypted


* If not specified otherwise, data must be kept for **10 years** following project end for reproducibility purposes Note: sometimes it is hard to find/understand dataset 10 days old

In doubt on data archival?

Contact R3 for support on archival of datasets using tickets:

Data housekeeping - Summary

Server is your friend!

  • Allows a consistent backup policy for your datasets
  • Keeps number of copies to minimum
  • Specification of clear access rights
  • High accessibility
  • Data are discoverable
  • Server can't be stolen

General guidelines

  • Use institutional media for storage of all data
  • Research data (particularly sensitive data) should be in a single source location
  • Enable encryption for data stored on movable media
  • Clarify and separate source and intermediate data
  • Disable write access to relevant source data (read-only)
  • Backup research data!
  • Download Anti-virus software
  • Generate checksums