Preventing dependency hell
Dependencies are the versions of various different modules, packages, or software that your research code depends on: all of your imports.
Don’t be like Ruby and Avi! Record your dependencies.
Image CC BY Candace Savonen, retrieved from Reproducibility in Cancer Informatics.
Managing dependencies is something many, many people find complicated and difficult. Let’s start with some key ideas to simplify things.
NYU Libraries provide some straightforward guidance that is language agnostic:
Package management software usually:
Package management software might also:
Depending on the language you are using, there may be many different options available. For example, for Python, you may have heard of:
Python dependencies can get very messy…
In order to use conda on your machine, I recommend you use the MiniForge installer, which by default loads packages from conda-forge
.
If you are using Anaconda (provided by Anaconda.com), ensure you are not using the defaults
channel to install packages as this falls under the recent Anaconda repository licensing changes.
Tl;dr: stick to open source to avoid licensing issues, by using a tool such as Miniforge which by default downloads packages from conda-forge
and not proprietary channels.
.yml
fileenvironment.yml
file.Who here regularly uses environment.yml
files in their workflow?
YAML is a human readable format for plain text files (usually with the file ending .yml
), often used for configuration of other programs.
It’s the format environment.yml
files are written in.
yml
file?What does a .yml
file look like?
name: my-env-name
dependencies:
- python=3.12
- pytest
- setuptools
- blackd
- isort
- numpy
- matplotlib
- pandas
We just need to create a plain text file called environment.yml
or something else sensible, and list the packages we need!
We can then turn this into a conda environment:
environment.yml
fileYou’ll get a lot of output, as it finds the various packages online and goes about solving all your dependencies in the background.
.yml
file?Can’t I just create a fresh environment from the command line like this?
And then activate it and add dependencies like this?
You absolutely can, but theres a few reasons why you shouldn’t.
pytest
on the last slide) will be pinned by the versions of libraries installed at an earlier stage (so the base python
version and numpy
), which can lead to dependency conflicts.If you want to add a new package that you didn’t include in your original environment.yml
file, or pin a package to a specific version, you can go and do so within the environment.yml
.
=
notation as in the first example.environment.yml
file is up to date, you can apply the changes to your conda environment:The --prune
argument here clears out old unused libraries and is key to keeping your .conda
folder a reasonable size.
conda install package-name
from within your environment can lead to dependency conflicts (say your env has an older version of numpy
and you’ve tried to conda install
another package that can’t support this).yml
file allows the solver to work through the dependencies at the same time. There may still be conflicts, but many easily avoidable issues will disappear.env.yml
is too prescriptiveconda env export
command - let’s learn about it!Say you have a conda environment file similar to the one shown below, with very minimal pinned dependencies. For the sake of reproducibility, you want a better record of exactly what libraries you used, right?
This is where the export option comes in. From inside your active environment, simply run:
env-record.yml
.2024-11-27-env-record.yml
.env.yml
files being restrictiveThis is where the myth of the conda env.yml
being prohibitively restrictive comes in:
This is why I prefer to export it into a file name like env-record instead of just environment: it makes it very obvious this is recording the state of the environment as opposed to building a recipe to rebuild it.
This exported environment file is mainly useful as a record for the sake of reproducibility, not for reusability.
If you produce results with your code that are being used in some form of research output (e.g. a paper), export your environment at the time when the results are being generated, so you have a record of the versions of different libraries you used.
This is why I said it is good to build and update your environment from your environment.yml
file: that way you have a reusable recipe that you can share and use to rebuild your environment, but you can also use export
to get a super detailed snapshot for any sets of results!
If you ignored all my advice about not building a haphazard environment incrementally with conda install
, there is still hope: you can use the --from-history
flag comes in to play:
This will produce a clean conda environment file similar to the example we gave at the start of this post, listing only the packages directly explicitly installed (without background dependencies or build details).
However, things will get messy if you start adding in pip
dependencies…
Using a conda environment.yml
makes working with pip and conda together less painful.
environment.yml
file.Just add your pip
dependencies to your environment file, then run conda env update --file environment.yml --prune
as previously described:
name: env-with-pip-dependencies
dependencies:
# Whatever packages you need for your project
- python=3.12
- numpy
- matplotlib
- pandas
- pip
- pip:
- black
- https://github.com/YOUR-USERNAME/YOUR-REPO-NAME/releases/download/YOUR-VERSION-NAME/PACKAGENAME-VERSION.tar.gz # you can even install your own packages that you host on GitHub
You can update this as described above.
Exporting the full record works the same if you have pip dependencies:
However, --from-history
will not include pip dependencies…
However, --from-history
will not include pip dependencies…
If you’ve followed our advice until now (and built your conda environment from a file), this won’t be an issue.
But if you ignored us or are trying to salvage old code, thankfully, there are a few different workarounds!
This is outside of the scope of the talk today, but the instructions for exporting are here:
Modified from this conversation on GitHub, this code snippet will export your conda and pip dependencies without version numbers (so that the environment.yml
file can be used to build a new environment):
# Extract installed pip packages
pip_packages=$(conda env export | grep -A9999 ".*- pip:" | grep -v "^prefix: " | cut -f1 -d"=")
# Export conda environment without builds, and append pip packages
conda env export --from-history | grep -v "^prefix: " > new-environment.yml
echo "$pip_packages" >> new-environment.yml
But remember: it is better to keep your environment.yml
file current, and update your conda env from this file, as opposed to adding packages using conda install
and then trying to export details to your environment file to track these changes.
Remember that for reusability, you want:
For reproducibility, you need the code to be reusable, but also:
Environments should be treated as disposable and easily rebuildable, we can use version control and tests to make sure that’s true.
environment.yml
to keep control of the packages you have installed.conda env export > env-record.yml
to export records of your environments for reproducibility, but use the --from-history
tag to make it more reusable.