diff --git a/cards.md b/cards.md index 320ddda8fa8da2e328cbdbc912bc8ed1bc41df22..f4715b1a808714d4694b519968964c5ba8692bf9 100644 --- a/cards.md +++ b/cards.md @@ -79,7 +79,8 @@ order: -1 <li><a href="external/integrity/encryption/file">Encrypting Files and Folders</a></li> <li><a href="external/integrity/naming">Naming files</a></li> <li><a href="external/integrity/organization">Organization</a></li> + <li><a href="external/integrity/spreadsheets">Working with spreadsheets</a></li> </ul> </div> -</div><br><center><a href='/'>go back</a></center><br><center><a href='/cards'>Overview of all HowTo cards</a></center> \ No newline at end of file +</div><br><center><a href='/'>go back</a></center><br><center><a href='/cards'>Overview of all HowTo cards</a></center> diff --git a/external/integrity/spreadsheets/img/excel_analyses-sheet.jpeg b/external/integrity/spreadsheets/img/excel_analyses-sheet.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..1e24b5575fa05d9e6e7aabb3cb2eaa3bc3251301 --- /dev/null +++ b/external/integrity/spreadsheets/img/excel_analyses-sheet.jpeg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b17184a42b86f91bd4067c4bd00f9fdd717ed38ba60df1eca0c3c4f421c0c4a1 +size 122967 diff --git a/external/integrity/spreadsheets/img/excel_data-sheet.png b/external/integrity/spreadsheets/img/excel_data-sheet.png new file mode 100644 index 0000000000000000000000000000000000000000..63a824bcefb1c303f813ec5367400664496ee2aa --- /dev/null +++ b/external/integrity/spreadsheets/img/excel_data-sheet.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:dace0d7847c2fbe29a49869180b58697ceeaa8633d7f1d570fe4e8910542ba9f +size 26789 diff --git a/external/integrity/spreadsheets/spreadsheets.md b/external/integrity/spreadsheets/spreadsheets.md new file mode 100644 index 0000000000000000000000000000000000000000..c5146a650fdabec45bf6691cd9169ca7be763bf8 --- /dev/null +++ b/external/integrity/spreadsheets/spreadsheets.md @@ -0,0 +1,79 @@ +--- +layout: page +permalink: /external/integrity/spreadsheets/ +shortcut: integrity:spreadsheets +redirect_from: + - /cards/integrity:spreadsheets + - /external/cards/integrity:spreadsheets +--- + +# Working with spreadsheets + +Spreadsheets are widely used tools for tabular data manipulation facilitating data input and allowing simple formatting, validation and visualization. + +This card describes tabular data format, common mistakes made in spreadsheets and how to use your spreadsheet application effectively while increasing re-usability, quality and accuracy of your data. + +## What is a table (tabular data format)? + +Data in tabular format follow 3 key conditions: + + 1. table contains a one-line header containing unique machine readable column names + 2. table has rows representing individual observations + 3. table has columns representing attributes/features of the observations and contain values of one data type + +| Table | Not a table | +|:-----------------------------------|---------------------------| +|<img src="./img/excel_data-sheet.png" height=200> | <img src="./img/excel_analyses-sheet.jpeg" height=200>| + +## Tips and Tricks + +### Keep the original + +Changes in spreadsheets are not tracked. Any update or change should produce a new file labeled by version with changes described in change log. + +### Export data after collection + +For reproducibility purposes, the collected data should be always exported from proprietary (.xlsx, .xls, ...) format into non-proprietary format (.csv, .tsv, etc.) with minimal metadata in README file. + +### Cell + +- Use field validation - validation rules on columns ensure you have data checked automatically already on input. +- Avoid non-exportable proprietary content - visual formatting (cell coloring / outlining), embedded comments and charts, merged cells, ... + +### Table + +- Keep header column names machine readable. You can follow the same best practices as for file naming (see our [Card on file naming](../naming/file_naming.md)). +- Keep values in columns atomic. +- Use primary keys - values in one particular column should be unique for the whole table. This will allow you to create unique references pointing to one and only one observation/record. +- Do not insert empty rows or columns which would split the table in two. +- Keep data in long format (sometimes referred to as narrow, gathered or melted format). All columns should be meaningful for all observations. If a new observation requires a new column to be created or if the observation's data ends up in just one cell instead of the whole row, your table is most probably **not** in long format. + +#### MS Excel Table tool + +MS Excel feature called **Table** (found in Insert->Table) allows to create real table object instead of just cell range. Its main advantages are: + +- Table formatting and column validation is expanded automatically with a new observation/record. +- Each table object can be referenced by its name - no more (named) cell ranges and hard-to-read formulas. +- Automatically adds filter buttons and subtotals. + +### Sheet + +- Keep one table per sheet or workbook. +- Start your table in the first column (and preferably on the first row). +- Do not insert any values next or below your table - add the content to a new column, new table, analyses sheet or file with metadata. +- Keep metadata about the table in separate sheet in tabular format or separate file. If you must, keep metadata **above** the table itself. + +### Analyses + +- Keep data separate from the analyses - create a link to the data from sheets or workbooks containing the analyses. +- Use pivot tables - if your data is in long format (it should be), it is very easy to create dynamic summary tables. +- Use pivot charts - you can produce your desired auto-refreshing charts while having data still in long format. +- Script more advanced analyses and data manipulation using standard tools for data processing (R, Python, Bash, ...). + +## Further reading + +Data Curation Network - [Microsoft Excel Data Curation Primer](https://github.com/DataCurationNetwork/data-primers/blob/master/Excel%20Data%20Curation%20Primer/Excel%20Data%20Curation%20Primer.md) + +Data Carpentry - [Spreadsheet ecology lesson](https://datacarpentry.org/spreadsheet-ecology-lesson/) + +Wikipedia - [Wide and narrow data](https://en.wikipedia.org/wiki/Wide_and_narrow_data)