From Chaotic Scripts to Structured Science
Essential Components:
Where should it live?
$SCRATCH
storage on HPC systemsdata
sub-folder1. Raw data is read-only:
2. Document data provenance:
3. Separate processing stages:
Where should it live?
You can set the permissions on the data/raw/
directory to read and execute only: chmod -R 500 raw/
Where should it live?
We will talk later in more depth about releasing and sharing code, but open data should be saved and shared in a data repository if at all possible:
my-research-project/
├── README.md # Project overview and setup
├── environment.yml # Dependencies
├── data/ # Possibly some minor output data
├── src/ # Your source code
│ ├── data_processing/
│ ├── analysis/
│ └── visualization/
├── tests/ # Unit tests for your functions
├── notebooks/ # Jupyter notebooks for exploration
├── results/ # Figures, tables, model outputs
├── docs/ # Additional documentation
└── scripts/ # Standalone utility scripts
src
) and analysis, notes, and notebooks are all in the same high-level project folderOften research work involves:
I tended to bump into a few questions or problems that I didn’t see an obvious solution to:
- I want to keep my codebase nice and tidy and comprehensible, but I also tend to produce lots and lots of analysis scripts, notebooks and figures as I analyse my results;
- I want to compare my models to analytical cases to check their validity, again producing lots of various outputs, figures etc.;
- I don’t want to accidentally modify some of my numerical model code when trying to analyse my results at a later stage;
- I want to update my core code, fix some problems and add some functionality; how do I keep track of what results I produced with which version of my code?
- Other people would probably be able to use my model, or adapt it for their own projects, but they won’t want to sift through all my iterative work in the meantime.
Sometimes a single project folder can become unwieldy:
planet-evolution/
├── src/
│ └── planet_evolution/
│ ├── __init__.py Makes the folder a package.
│ └── source.py An example module containing source code.
├── tests/
| ├── __init__.py Sets up the test suite.
│ └── test_source.py A file containing tests for the code in source.py.
├── README.md README with information about the project.
├── docs Package documentation
├── pyproject.toml Allows me to install this as a package
├── LICENSE License text to allow for reuse
└── CITATION.cff Citation file that makes it easy for people to cite you!
This will contain the code that you will import as a library in your project!
pallasite-parent-body-evolution/
├── LICENSE
├── README.md
├── env.yml or requirements.txt The libraries I need for analysis (including planet_evolution!)
├── data I usually load in large data from storage elsewhere
│ ├── interim But sometimes do keep small summary datafiles in the repository
│ ├── processed
│ └── raw
├── docs Notes on analysis, process etc.
├── notebooks Jupyter notebooks used for analysis
├── reports For a manuscript source, e.g., LaTeX, Markdown, etc., or any project reports
│ └── figures Figures for the manuscript or reports
├── src Source code for this project
│ ├── data Scripts and programs to process data
│ ├── tools Any helper scripts go here
│ └── visualization Scripts for visualisation of your results, e.g., matplotlib, ggplot2 related.
└── tests Test code for this project, benchmarking, comparison to analytical models
If you opt for a single folder layout, but keep it tidy and organised, it is easy to split it into two separate directories at a later point!
The “I’ll organize it later” trap:
The “only I will use this” fallacy: