Skip to content
Snippets Groups Projects
Commit d3b1e26d authored by Laurent Heirendt's avatar Laurent Heirendt :airplane:
Browse files

Merge branch 'file_naming' into 'develop'

File naming

See merge request R3/outreach/presentations!21
parents 170dc8f0 d9bcb166
No related branches found
No related tags found
No related merge requests found
Showing
with 325 additions and 0 deletions
## Checksums - What is it?
<center><img src="slides/img/checksum.png" height="600px"></center>
# Checksums - When?
* transmission
- new dataset from collaborator
- upload to remote repository
* long term storage
- master version of dataset
- snapshot of data for publication
# Checksums - How?
* Linux and MacOS
- command-line tools shasum, md5sum (Linux), md5 (Mac)
```bash
shasum -a 256 my_file.csv > my_file.sha256
```
- verifying the checksum
```bash
shasum -c my_file.sha256
my_file.csv: OK
```
* Windows
- creation
```
certutil -hashfile my_file.csv SHA256 > my_file.sha256
```
- verification - re-run the **certutil** and manually compare generated checksums
- **MD5Summer** - free checksum tool with GUI
# Final notes
* plan ahead
* follow guidelines of your team
* ensure proper dissemination of your naming policy
* consistency overrules personal preference
# Human readable
name carries information about content
* use English
* balance name length against clarity
* don't be too creative and stay professional
<!-- "PEPA_d-pic.jpeg" - a fourth **pic**ture from your paper on **P**erformance **E**valuation **P**rocess **A**lgebra -->
* **never** use suffixes (or prefixes) like **"final"**, **"old"**, **"new"**, **"current"**, **"obsolete"**, **"recent"**, **"latest"**, **"best"**...
* leave out meaningless words, e.g. "the", "and", "a", "file", "data" ...
# Human readable
* do not forget about extensions
```
Iris-setosa-table.csv
video-2019-annual_meeting.avi
2019-12-11-notes.log
ATAC-seq1-London-mapped.bam
A2452-description_tutorial.info
```
* be specific
| Bad named | Better name |
| ------------------------- | ----------------------------------------------------- |
| myabstract.txt | John-White_Sensitivity-of-PLFA-analyses_abstract.txt |
| samples_project_start.csv | PA324_samples_2019-12-11.csv |
| ms_cresp_final.doc | John-White_Cell-respiration_manuscript_2019-12-11.doc |
| fig_1.png | John-White_Cell-respiration_fig-1_2019-12-11.png |
2019/2019-06-25_file-naming/slides/img/checksum.png

51.3 KiB

2019/2019-06-25_file-naming/slides/img/favicon.ico

39.9 KiB

2019/2019-06-25_file-naming/slides/img/r3-training-logo.png

32.4 KiB

## File naming
<br><br>
#### June 25th, 2019
<div style="top: 6em; left: 0%; position: absolute;">
<img src="theme/img/lcsb_bg.png">
</div>
<div style="top: 5em; left: 60%; position: absolute;">
<img src="slides/img/r3-training-logo.png" height="200px">
<br><br><br><br>
<h3>Naming files and checksums</h3>
<br><br><br><br>
<h4>
Vilem Ded<br><br>
vilem.ded@uni.lu<br><br>
<i>Luxembourg Centre for Systems Biomedicine</i>
</h4>
</div>
[
{
"filename": "index.md",
"attr": {
}
},
{
"filename": "three_principles.md",
"attr": {
}
},
{
"filename": "machine_readable.md",
"attr": {
}
},
{
"filename": "human_readable.md",
"attr": {
}
},
{
"filename": "ordering.md",
"attr": {
}
},
{
"filename": "final-notes.md",
"attr": {
}
},
{
"filename": "checksums_intro.md",
"attr": {
}
},
{
"filename": "thanks.md",
"attr": {
}
}
]
# Machine readable
* case sensitivity
* <span style="color:red">special characters</span>:
**&#35;&#36;&#37;&#38;&#39;&#40;&#34;&#41;&#42;&#43;&#44; &#45;&#46;&#47;&#58;&#59;&#60;&#61;&#62;&#63;&#64;&#91;&#92; &#93;&#94;&#95;&#96;&#123;&#124;&#125;&#126;** and
white characters like **space** or **tabulator**
* <span style="color:red">accented characters</span>:
**&#231;**, **&#228;**, **&#244;**,
**&#283;**, **&#341;**, ...
* <span style="color:red">lack of consistency</span>:
```
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFractions_B03.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Celline-100-1MutantFraction_B03.csv
2013-06-26_BRAFWTNEGASSAY_Plazmid-Cellline-100-1MutantFraction_B03.csv
2013-06-26_BRAFWTNEGASSAY_plasmid-Celline-100-1MutantFraction_B03.csv
```
# Machine readable
So what should I use?
* alphanumeric characters
* hyphen "**&#45;**" to combine t-h-i-n-g-s
* underscore "**&#95;**" to separate
Example:
"projectID_method-name_sampleID_YYYY-MM-DD.ext"
"Author-Name_Paper-Name_manuscript_YYYY-MM-DD.doc"
# Machine readable
**Ugly** names:
```
"Hlad.jez.M-L-průtoky JíObj.z Ohře-od 10-2011.xlsx"
"Finacial detailes BIocore 19/11/12.xls"
"ATACseq1Londonmapped.bam"
```
**Nice** names:
```
Hipo-1631_CA1_iba-1488_GFAP-647.tiff
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A01.csv
2013-06-26_BRAFWTNEGASSAY_FFPEDNA-CRC-1-41_A01.csv
PI102_Mouse12_EEG_2018-11-03_1245.tsv
Bioinfiniti_FullProposal_2018-11-15_1655.docx
```
# Machine readable
* globbing and pattern search
```{r}
> list.files(pattern = "Plasmid")
```
<!-- Python: re.search("*Plasmid*", txt)
Bash: ls -l *Plasmid* -->
Result
```
...
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A01.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A02.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A03.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_B01.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_B02.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_B03.csv
...
```
# Machine readable
* (meta)data extraction
```{r}
str_split_fixed(flist, "[_\\.]", 5)
```
<!-- names(flist_df) <- c("Date", "Project", "Origin", "SampleID", "Format") -->
| Date | Project | Origin | SampleID | Format |
|--------------|------------------|----------------------------------------|----------|--------|
| "2013-06-26" | "BRAFWTNEGASSAY" | "Plasmid-Cellline-100-1MutantFraction" | "A01" | csv |
| "2013-06-26" | "BRAFWTNEGASSAY" | "Plasmid-Cellline-100-1MutantFraction" | "A02" | csv |
| "2013-06-26" | "BRAFWTNEGASSAY" | "Plasmid-Cellline-100-1MutantFraction" | "A03" | csv |
| "2013-06-26" | "BRAFWTNEGASSAY" | "Plasmid-Cellline-100-1MutantFraction" | "B01" | csv |
| "2013-06-26" | "BRAFWTNEGASSAY" | "Plasmid-Cellline-100-1MutantFraction" | "B02" | csv |
| "2013-06-26" | "BRAFWTNEGASSAY" | "Plasmid-Cellline-100-1MutantFraction" | "B03" | csv |
## Default ordering
use of inbuilt ordering
* terms in general-to-specific order - logical ordering
"projectID_method-name_sampleID_YYYY-MM-DD.ext"
<!-- TODO -->
* date first - chronological ordering
* digits first - explicit ordering
## Default ordering
use of inbuilt ordering
* terms in general-to-specific order - logical ordering
<!-- TODO -->
* date first - chronological ordering
```
2013-06-26_Plasmid_A01.csv
2014-06-26_Plasmid_C02.csv
2015-06-30_Plasmid_A03.csv
2015-07-12_Plasmid_B01.csv
2015-07-13_Plasmid_B02.csv
2015-11-10_Plasmid_B03.csv
```
* digits first - explicit ordering
## Default ordering
use of inbuilt ordering
* terms in general-to-specific order - logical ordering
<!-- TODO -->
* date first - chronological ordering
* digits first - explicit ordering
```
01_Plasmid_A01_2013-06-26.csv
02_Plasmid_C02_2014-06-26.csv
03_Plasmid_A03_2015-06-30.csv
10_Plasmid_B01_2015-07-12.csv
11_Plasmid_B02_2015-07-13.csv
25_Plasmid_B03_2015-11-10.csv
```
## Three main principles
* Machine readable:
* easily search for files later
* easily narrow file lists based on names
* easily extract info from file names, e.g. by splitting, regex,...
* Human readable:
* easily understand what the file is and what it contains
* easily share files with others
* Plays well with default ordering:
* logically ordered/clustered
* fast manual search
## Thank you.
<center><img src="slides/img/r3-training-logo.png" height="150px"></center>
Contact us if you need help:
<a href="mailto:r3lab.core@uni.lu">r3lab.core@uni.lu</a>
<br><br>
Resources:
Jenny Brian's [slides](https://speakerdeck.com/jennybc/how-to-name-files) on "Naming things" from Reproducible Science Workshop, Duke, 2015
Semantic versioning - [semverdoc.org](https://semverdoc.org/)
LCSB *IT101* training [presentation](https://git-r3lab.uni.lu/R3/labCards/uploads/738930b9a533a2f308cc62c431d9246f/it101.html)
## Three main principles
* Machine readable:
* search for files later
* narrow file lists based on names
* extract info from file names, e.g. by splitting, regex,...
* Human readable:
* understand what the file is and what it contains
* share files with others
* Plays well with default ordering:
* logical ordering/clustered
* fast manual search
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment