Software Development Skills for Research Computing
During this course, you will:
Learn to apply basic software development practices to improve your code
Get to grips with organising your code-base
Develop a blueprint for dealing with dependencies, conda environments, and code versions
Learn about various tools and resources you can implement in the future
Software Development Skills for Research Computing
During this course, you will not:
Learn best practice software development: we are going for a good enough approach as opposed to perfect, but can point you to resources if you want to learn more
Become a software developer overnight: it takes practise!
Learn the complicated mathematics behind your numerical models or statistical analysis, or how to implement these in Python
An example from my research: Electron Microprobe Analysis
flowchart LR
subgraph lab[1. Lab analysis of samples]
direction TB
A[Primary Standards: known comp. - P1] -->
B(Samples: unknown comp.) -->
C[Primary Standards again: known comp. - P2]
end
lab -..-> END[2. Instrument validation after data collection]
Bracket samples with standards of known composition (published and trusted standards)
Why apply software dev principles to your coding?
flowchart LR
subgraph inst[2. Instrument validation after data collection]
direction LR
D[/Do P1 and P2<br>match each other<br>within error?/]-->|Yes| F
D -->|No| E
E(Instrument drift)
F[/Do P1 and P2 match<br>published values<br> within error?/]
F -->|No| G
G(Calibration issue)
end
START[1. Lab analysis of samples] -.-> inst
F -->|Yes| pos
E --> neg
G --> neg
neg(fa:fa-ban Results not valid)
pos(Results may be valid)
pos --> posnext[Test scientific<br>validity of results]
neg -.-> negnext[Check instrument settings<br>Rerun analyses]
Compare standards to each other to see results are consistent over time
Compare standards to their published compositions
Well-established allowable error
Why apply software dev principles to your coding?
flowchart LR
subgraph lab[1. Lab analysis of samples]
direction TB
A[Primary Standards: known comp. - P1] -->
B(Samples: unknown comp.) -->
C[Primary Standards again: known comp. - P2]
end
subgraph inst[2. Instrument validation after data collection]
direction LR
D[/Do P1 and P2<br>match each other<br>within error?/]-->|Yes| F
D -->|No| E
E(Instrument drift)
F[/Do P1 and P2 match<br>published values<br> within error?/]
F -->|No| G
G(Calibration issue)
end
lab ---> inst
F -->|Yes| pos
E --> neg
G --> neg
neg(fa:fa-ban Results not valid)
pos(Results may be valid)
pos --> posnext[Test scientific<br>validity of results]
neg -.-> negnext[Check instrument settings<br>Rerun analyses]
Without the above documented steps, my results would not be publishable or considered in any way robust
How do we implement a similar workflow for computational research?
We treat code as a laboratory instrument!
GitHub codespaces and devcontainers
Today, we are going to be using GitHub codespaces to run our code
This is essentially just a remote linux machine running in the cloud
You get restricted free access (120 hours per month) which is plenty for this course
When using what we’ve discussed for your own research, install everything locally
We have created a template repository for you to use
Make sure you have a GitHub account and know your login details!
git status # check on status of current git repogit branch NAME # create a branch called NAMEgit checkout NAME # swap over to the branch called NAMEgit add . # stage all changed files for commit, you can replace "." with FILE to add a single file called FILEgit commit # commit the staged files (this will open your text editor to create a commit message)git push origin NAME # push local commits to the remote branch tracking the branch NAME
Version control
All your files and the git history will be stored in a public repository on GitHub
Transparency, easy to see your process, useful for reviewing code
Don’t worry about your “messy workings” being visible - it’s part of the scientific process
Project Organisation
Project organisation
What does your project currently look like?
Lots of Python scripts in different folders?
Very long, convoluted Python files?
Tests?
Comments?
How do you share your Python work?
How do you record what version of each script you used?
How do you transfer your work to the HPC system and back?
Basic Structure Suggestion
# The most basic structure for a code project should look like:my-package├── README.md├── pyproject.toml├── src <- Source code for this project└── tests <- Test code for this project
This folder contains tests that run small sections of your code with known expected results
All tests units (files and methods) must be named starting with test_ and placed inside a directory called tests.
Tests can be grouped in just one folder for the entire repository or they can be organized within each package/subpackage.
Two directory structure
I keep my large-scale code development separate from my scientific output
For example, I want to analyse the thermal evolution of a planet for a scientific papers
I build a numerical model of the heating and cooling of the planet
I use this model to test a range of parameters and compare to various datasets
I might be able to reuse my numerical model in other situations so want to keep this separate
I know that my research process involves lots of exploratory plotting and analysis, which produces lots of scripts, and I don’t want these to get mixed up with my model code
Repository 1: the numerical model as a Python package
planet-evolution/ The package git repository├── src/ │ └── planet_evolution/ │ ├── __init__.py Makes the folder a package.│ └── source.py An example module containing source code.├── tests/| ├── __init__.py Sets up the test suite.│ └── test_source.py A file containing tests for the code in source.py.├── README.md README with information about the project.├── docs Package documentation├── pyproject.toml Allows me to install this as a package├── LICENSE License text to allow for reuse└── CITATION.cff Citation file that makes it easy for people to cite you!
This model can be installed as a package, cited in your research, and reused in a later project.
Repository 2: my scientific analysis
pallasite-parent-body-evolution/ The project git repository├── LICENSE├── README.md├── env.yml or requirements.txt The libraries I need for analysis (including planet_evolution!)├── data I usually load in large data from storage elsewhere│ ├── interim But sometimes do keep small summary datafiles in the repository│ ├── processed│ └── raw├── docs Notes on analysis, process etc.├── notebooks Jupyter notebooks used for analysis├── reports For a manuscript source, e.g., LaTeX, Markdown, etc., or any project reports│ └── figures Figures for the manuscript or reports├── src Source code for this project│ ├── data Scripts and programs to process data│ ├── tools Any helper scripts go here│ └── visualization Scripts for visualisation of your results, e.g., matplotlib, ggplot2 related.└── tests Test code for this project, benchmarking, comparison to analytical models
This is the actual work for the scientific project - while others are unlikely to use this code as-is, it’s public and citeable so that you can point to a specific version in your published paper and readers can reproduce your work with it if they wish.
.├── AUTHORS.md├── LICENSE├── README.md├── bin <- Your compiled model code can be stored here (not tracked by git)├── config <- Configuration files, e.g., for doxygen or for your model if needed├── data│ ├── external <- Data from third party sources.│ ├── interim <- Intermediate data that has been transformed.│ ├── processed <- The final, canonical data sets for modeling.│ └── raw <- The original, immutable data dump.├── docs <- Documentation, e.g., doxygen or scientific papers (not tracked by git)├── notebooks <- Ipython or R notebooks├── reports <- For a manuscript source, e.g., LaTeX, Markdown, etc., or any project reports│ └── figures <- Figures for the manuscript or reports├── src <- Source code for this project│ ├── data <- scripts and programs to process data│ ├── external <- Any external source code, e.g., pull other git projects, or external libraries│ ├── models <- Source code for your own model│ ├── tools <- Any helper scripts go here│ └── visualization <- Scripts for visualisation of your results, e.g., matplotlib, ggplot2 related.└── tests <- Test code for this project
Testing code
Testing code
Remember our example of using known standards to check the instruments in the lab?
This is the equivalent for computational work!
Tests ensure that your code runs in the way it’s intended to
Tests will flag if any changes you made either
Produce an error or break the code
“Silently” introduce errors - the code still runs, but the output is different
Testing code
The good news is, you’ve probably already created a test without realizing it. Remember when you ran your application and used it for the first time? Did you check the features and experiment using them? That’s known as exploratory testing and is a form of manual testing.
Exploratory testing is a form of testing that is done without a plan. In an exploratory test, you’re just exploring the application.
To have a complete set of manual tests, all you need to do is make a list of all the features your application has, the different types of input it can accept, and the expected results. Now, every time you make a change to your code, you need to go through every single item on that list and check it.
N.B. rarely in scientific applications can we use == as we are often dealing with floats and some degree of error; we will discuss the various alternatives that allow tolerances during the testing session.
Each function in your project should have a unit test
Tests edgecases
Tests how the package works together as a whole
Tests various combinations of functions
Tests the full workflow you use for research
Testing your science
This is often where tutorials stop when it comes to code testing: but you also need to scientifically validate your code!
Unit Testing and Integration Testing test that the code is functional, it does not check for scientific validity
Depending on your area/the code you are writing, you might need to test for:
Numerical precision and accuracy
Stability
Agreement with previous numerical models/analytical solutions*
Scientific sense: does the answer make physical sense? (Does the thing cool down when you expect it to? Does time run forwards?)
There are other problems with the circle of purely validating numerical models against other numerical models… but that is too long a debate for today!
Testing your science
In our two-repository set up, tests for science can be split across both:
Some tests will always need to be true (the model should always be cooling, gravitational acceleration should always be >20 ms^-2 on a giant planet etc.)
Some tests will be specific to your application for a scientific output and can live in that second repository
There are other problems with the circle of purely validating numerical models against other numerical models… but that is too long a debate for today!
Linting and Formatting code
Coding conventions
If your language or project has a standard policy, use that. For example:
$ conda install flake8$ flake8 myscript.pymyscript.py:2:6: E201 whitespace after '{'myscript.py:2:11: E231 missing whitespace after ':'myscript.py:2:14: E231 missing whitespace after ','myscript.py:2:18: E231 missing whitespace after ':'myscript.py:3:1: E128 continuation line under-indented for visual indentmyscript.py:3:4: E231 missing whitespace after ':'myscript.py:4:13: E225 missing whitespace around operatormyscript.py:4:14: E222 multiple spaces after operatormyscript.py:5:1: E302 expected 2 blank lines, found 0myscript.py:5:13: E201 whitespace after '('myscript.py:5:25: E202 whitespace before ')'myscript.py:6:4: E111 indentation is not a multiple of 4myscript.py:6:9: E211 whitespace before '('myscript.py:6:20: E202 whitespace before ')'myscript.py:7:8: E111 indentation is not a multiple of 4myscript.py:7:14: E271 multiple spaces after keywordmyscript.py:7:25: E225 missing whitespace around operatormyscript.py:8:4: E301 expected 1 blank line, found 0myscript.py:8:4: E111 indentation is not a multiple of 4myscript.py:8:17: E203 whitespace before ':'myscript.py:8:18: E231 missing whitespace after ':'myscript.py:9:8: E128 continuation line under-indented for visual indentmyscript.py:9:9: E203 whitespace before ':'myscript.py:9:15: E252 missing whitespace around parameter equalsmyscript.py:9:16: E252 missing whitespace around parameter equalsmyscript.py:10:8: E124 closing bracket does not match visual indentationmyscript.py:10:8: E125 continuation line with same indent as next logical linemyscript.py:11:8: E111 indentation is not a multiple of 4myscript.py:12:1: E302 expected 2 blank lines, found 0myscript.py:12:6: E211 whitespace before '('myscript.py:12:9: E201 whitespace after '('myscript.py:12:13: E202 whitespace before ')'myscript.py:12:15: E203 whitespace before ':'myscript.py:13:4: E111 indentation is not a multiple of 4myscript.py:13:10: E271 multiple spaces after keywordmyscript.py:13:26: E203 whitespace before ':'myscript.py:13:34: W291 trailing whitespace
Linters and Formatters
This is the equivalent of spellchecker for your code
Do yourself a favour and ensure whatever IDE you are using has this enabled!
I prefer having the linter run while you code, rather than running after, but this is personal preference
Many different tools available, we will have some preloaded in our devcontainer
You can see what the Black code formatter will do to your code here:
If application A needs version 1.0 of a particular module but application B needs version 2.0, then the requirements are in conflict and installing either version 1.0 or 2.0 will leave one application unable to run.
The solution for this problem is to create a virtual environment, a self-contained directory tree that contains installation for particular versions of software/packages.
Conda
Conda is an open source package management system and environment management system that runs on Windows, macOS, and Linux.
It offers dependency and environment management for any language—Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, Fortran, and more.
from collections import defaultdictdef get_top_cities(prices): top_cities = defaultdict(int)# For each price range# Get city searches in that price# Count num times city was searched# Take top 3 cities & add to dictreturndict(top_cities)
Commenting for others
In later iterations of your code you might want to clean up your comments to yourself and formalise your documentation more
defcomplex(real=0.0, imag=0.0):"""Form a complex number. Keyword arguments: real -- the real part (default 0.0) imag -- the imaginary part (default 0.0) """if imag ==0.0and real ==0.0:return complex_zero
Releases on GitHub
Releases
Releases are deployable software iterations you can package and make available for a wider audience to download and use.
A release takes a snapshot of your entire repository at a specific time, bundles it all into a zipped file, and stamps it with a version number (like v1.2.0), making it easy for you to reference the exact version of your code you used for a scientific project
You can link your GitHub repository to Zenodo and get a DOI for your releases
You probably already have multiple different projects in progress, and don’t have the time or capacity to go back and organise everything as we’ve explained.
What can you do when faced with an overwhelmingly messy codebase?