Organizing Your Research Coding Project
From Chaotic Scripts to Structured Science
In this section, we will explore different ways of organising our code, data, and notes.
After finishing this section, you should have an idea of:
- Different ways to organise your project
- Sensible ways to store/access your code
Open introduction presentation β
Presentation content
Keeping your projects tidy
- Itβs not always obvious the best way of keeping your code, notes, and data nice and tidy
- Research tends to be very fluid and experimental, creating lots of scripts and notebooks that can easily become a tangled mess
- We want to be able to easily find the right piece of code, the write snippet of notes!
Why Project Organization Matters
Your future self will thank you when you can:
- Find your analysis code from 6 months ago
- Understand what each script actually does
- Reproduce your results without panic
- Share your work with collaborators confidently
- Build on your previous work instead of starting over
Poor organization leads to:
- Lost time searching for the βfinalβ version
- Irreproducible results, mistakes, and embarrassing retractions
- Collaboration nightmares and frustrated colleagues
- Thesis chapters that canβt be verified
A Well-Organized Research Project
Essential Components:
- π Clear directory structure - Everything has a logical place
- π Comprehensive documentation - README files that actually help
- π Version control - Git tracks every change and decision
- π§ͺ Organized data workflow - Raw β Processed β Results
- β Testing framework - Confidence that your code works correctly
- π¦ Dependency management - Reproducible computational environment
The first step: π Clear directory structure
Research data
Where should it live?
- On backed-up University Storage (appropriate for its tier) mounted on your workstation
- (Temporarily) on
$SCRATCH
storage on HPC systems - Not mixed in with your code!
- Lots of project structure templates online suggest having a
data
sub-folder - In general, itβs often better practise to keep your data in a separate folder:
- It may be sensitive data, but you want your code to be shareable
- It may be far too large to store on your desktop permanently alongside your code
- It will cause issues when version-controlling your code
- Lots of project structure templates online suggest having a
Research data rules
1. Raw data is read-only:
- Never edit original data files directly
- All modifications happen through documented code
- Keep multiple backups of irreplaceable data
Research data rules
2. Document data provenance:
- Where did each dataset come from?
- What processing steps were applied?
- What are the known limitations or biases?
Research data rules
3. Separate processing stages:
- Raw β Cleaned β Analysis-ready β Results
- Each stage produces documented intermediate files
- Clear scripts connect each transformation step
Research data
Where should it live?
data/
βββ raw/
βββ processed/
βββ results/
You can set the permissions on the data/raw/
directory to read and execute only: chmod -R 500 raw/
- Make sure to pick the correct permissions: chmod command
- You can always undo these later
Research data
Where should it live?
We will talk later in more depth about releasing and sharing code, but open data should be saved and shared in a data repository if at all possible:
- Your funder may have specific repositories for research data
- Publications may have requirements about sharing output data
- Use the Universityβs data deposit service
Research code
- There are lots of different ways of organising your research code
- There is no one βcorrectβ way
Single Folder Layout
my-research-project/
βββ README.md # Project overview and setup
βββ environment.yml # Dependencies
βββ data/ # Possibly some minor output data
βββ src/ # Your source code
β βββ data_processing/
β βββ analysis/
β βββ visualization/
βββ tests/ # Unit tests for your functions
βββ notebooks/ # Jupyter notebooks for exploration
βββ results/ # Figures, tables, model outputs
βββ docs/ # Additional documentation
βββ scripts/ # Standalone utility scripts
- This is one way of keeping your code organised
- The source code (in
src
) and analysis, notes, and notebooks are all in the same high-level project folder
Two Folder Layout
Often research work involves:
- Building some form of numerical model, analysis pipeline, or other code that is somewhat generic or modular
- Applying that model to specific parameters and testing various inputs and outputs, using tools like Jupyter notebooks to explore output and write notes
Two Folder Layout
I tended to bump into a few questions or problems that I didnβt see an obvious solution to:
- I want to keep my codebase nice and tidy and comprehensible, but I also tend to produce lots and lots of analysis scripts, notebooks and figures as I analyse my results;
- I want to compare my models to analytical cases to check their validity, again producing lots of various outputs, figures etc.;
- I donβt want to accidentally modify some of my numerical model code when trying to analyse my results at a later stage;
- I want to update my core code, fix some problems and add some functionality; how do I keep track of what results I produced with which version of my code?
- Other people would probably be able to use my model, or adapt it for their own projects, but they wonβt want to sift through all my iterative work in the meantime.
Two Folder Layout
Sometimes a single project folder can become unwieldy:
- It can be difficult to separate out your methods and core code from various experiments and tests
- Important information and code snippets can go missing in Jupyter Notebooks
- One folder for βcore codeβ; the functions you will re-use
- One folder for the βapplicationβ - whether thatβs running experiments, writing a paper or a thesis chapter, etc.
Folder 1: the βcore codeβ or βpackageβ folder
planet-evolution/
βββ src/
β βββ planet_evolution/
β βββ __init__.py Makes the folder a package.
β βββ source.py An example module containing source code.
βββ tests/
| βββ __init__.py Sets up the test suite.
β βββ test_source.py A file containing tests for the code in source.py.
βββ README.md README with information about the project.
βββ docs Package documentation
βββ pyproject.toml Allows me to install this as a package
βββ LICENSE License text to allow for reuse
βββ CITATION.cff Citation file that makes it easy for people to cite you!
This will contain the code that you will import as a library in your project!
Folder 2: the βapplicationβ repository
pallasite-parent-body-evolution/
βββ LICENSE
βββ README.md
βββ env.yml or requirements.txt The libraries I need for analysis (including planet_evolution!)
βββ data I usually load in large data from storage elsewhere
β βββ interim But sometimes do keep small summary datafiles in the repository
β βββ processed
β βββ raw
βββ docs Notes on analysis, process etc.
βββ notebooks Jupyter notebooks used for analysis
βββ reports For a manuscript source, e.g., LaTeX, Markdown, etc., or any project reports
β βββ figures Figures for the manuscript or reports
βββ src Source code for this project
β βββ data Scripts and programs to process data
β βββ tools Any helper scripts go here
β βββ visualization Scripts for visualisation of your results, e.g., matplotlib, ggplot2 related.
βββ tests Test code for this project, benchmarking, comparison to analytical models
Deciding on a folder layout
If you opt for a single folder layout, but keep it tidy and organised, it is easy to split it into two separate directories at a later point!
- In this course, weβre going to stick with a single folder set-up since we are building a basic project
- Itβs less important to choose the perfect layout, and more important to be intentional, tidy, and consistent about where you save things!
Common pitfalls
The βIβll organize it laterβ trap:
- Start with basic structure from day one
- Organization gets harder as projects grow
- Good habits compound over time
The βonly I will use thisβ fallacy:
- You are your most important collaborator
- Future you has forgotten current youβs logic
- A clear and tidy directory structure will help you down the line
Preparing for practical 2
- We are going to set up a virtual machine on GitHub codespaces
- We are going to use our project workflow to set up a coding environment