The art of conda

conda

python

dependencies

packages

reproducibility

Grappling conda yaml files for reproducible research

Author

Maeve Murphy Quinlan

Published

November 27, 2024

Conda is a widely used package management system which allows you to isolate different Python “environments” from each other, allowing you to use different versions of libraries or modules for different projects. However, mismanagement of packages can lead to dependency hell with tangled environments and incompatible versions of different modules.

Installing conda: click to expand

In order to use conda on your machine, I recommend you use the MiniForge installer, which by default loads packages from conda-forge.

If you are using Anaconda (provided by Anaconda.com), ensure you are not using the defaults channel to install packages as this falls under the recent Anaconda repository licensing changes.

Tl;dr: stick to open source to avoid licensing issues, by using a tool such as Miniforge which by default downloads packages from conda-forge and not proprietary channels.

The key to conda: the `.yml` file

Despite the fact conda is very widely used, especially in research, science, and data science fields, people often neglect the real magic of the system: the environment.yml file. This file is the recipe or configuration for your environment.

Creating and environment and installing new libraries

A conda env.yml or environment.yml file will look something like this:

name: my-env-name

dependencies:
  - python=3.12
  - pytest
  - setuptools
  - blackd
  - isort
  - numpy
  - matplotlib
  - pandas

This is then turned into a conda environment with all the listed dependencies installed by calling the following (from the folder containing the .yml file):

conda env create -f environment.yml

Why should I bother with a .yml file? Click to expand

Can’t I just create a fresh environment from the command line like this?

conda create -n ENV-NAME python=3.12 numpy

And then activate it and add dependencies like this?

conda install pytest

You absolutely can, but theres a few reasons why you shouldn’t.

Things you install later (e.g. pytest in the example above) will be pinned by the versions of libraries installed at an earlier stage (so the base python version and numpy above), which can lead to dependency conflicts.
You can end up with a lot of crud and old unneeded libraries that you no longer used bloating your environment.

Read on to see what you can do instead.

Updating an environment

If you want to add a new package that you didn’t include in your original environment.yml file, or pin a package to a specific version, you can go and do so within the conda env. Just add any new packages to the list of dependencies, and pin libraries with the = notation as in the first example.

Once your environment.yml file is up to date, you can apply the changes to your conda environment:

conda env update --file environment.yml --prune

The --prune argument here clears out old unused libraries and is key to keeping your .conda folder a reasonable size.

Whereas running conda install package-name from within your environment can lead to dependency conflicts (say your env has an older version of numpy and you’ve tried to conda install another package that can’t support this), updating the environment from the .yml file allows the solver to work through the dependencies at the same time. There may still be conflicts, but many easily avoidable issues will disappear.

Myth: conda environment.yml files are too prescriptive! Click to expand

This is simply not true, and is often stated by people who have run conda env export > environment.yml to generate the .yml file, and have then taking this file (which has every package pinned to the exact version, and even includes builds) and tried to build an environment on a different computer or system with it.

The section on “Exporting a conda env” will digs into this a little deeper, but essentially an environment.yml file can be as prescriptive as you want. If we consider the clean file we wrote above with only a few pinned dependencies as a recipe, then the file produced by conda env export > environment.yml is the original recipe, but heavily annotated, scribbled on, and highlighted to get the best results working with a slightly temperamental gas stove that runs a bit cold at the front: useful if you are using the exact same old oven, but likely to cause issues if you follow the instructions for a new electric fan oven.

Exporting a conda env

So let’s say you have a conda environment file similar to the one shown above, with very minimal pinned dependencies. For the sake of reproducibility, you want a better record of exactly what libraries you used, right?

This is where the export option comes in. From inside your active environment, simply run:

conda env export > env-record.yml

The command above will export an extremely detailed list of everything in your environment (including background dependencies and their exact version numbers) to the file env-record.yml. Sometimes, you might find it appropriate to export this to a filename with the date, for example 2024-11-27-env-record.yml.

This is where the myth of the conda env.yml being prohibitively restrictive comes in: people often try to use this file to build a replica of the same environment on a different machine; however this exported file contains specific details of backends and builds that will likely not be transferrable across different computers. This is why I prefer to export it into a file name like env-record instead of just environment: it makes it very obvious this is recording the state of the environment as opposed to building a recipe to rebuild it.

This exported environment file is mainly useful as a record for the sake of reproducibility, not for reusability.

If you produce results with your code that are being used in some form of research output (e.g. a paper), export your environment at the time when the results are being generated, so you have a record of the versions of different libraries you used.

Lets say you ignored our advice about updating your conda environment purely through modifying your environment file, and used conda install to add packages, so you know your environment state is not in line with your environment.yml. Is there a way to export a simple environment file that can be used to build an environment again?

Absolutely, this is where the --from-history flag comes in to play:

conda env export --from-history > environment.yml # again, from inside the activated env

This will produce a clean conda environment file similar to the example we gave at the start of this post, listing only the packages directly explicitly installed (without background dependencies or build details).

Mixing in pip

Using a conda environment.yml makes working with pip and conda together less painful. You will have heard (or experienced first hand) that once you install pip in a conda env, everything from that point on must be pip, or you will break the environment. This is true, but you can get around this by adding your pip dependencies to your environment.yml file:


name: env-with-pip-dependencies

dependencies:
# Whatever packages you need for your project
  - python=3.12
  - numpy
  - matplotlib
  - pandas
  - pip
  - pip:
    - black
    - https://github.com/YOUR-USERNAME/YOUR-REPO-NAME/releases/download/YOUR-VERSION-NAME/PACKAGENAME-VERSION.tar.gz # you can even install your own packages that you host on GitHub

You can update this as described above.

Exporting with pip

Exporting the full record works the same if you have pip dependencies:

conda env export > env-record.yml

However, --from-history will not include pip dependencies. Thankfully, there are a few different workarounds! Modified from this conversation on GitHub, this code snippet will export your conda and pip dependencies without version numbers (so that the environment.yml file can be used to build a new environment):

# Extract installed pip packages
pip_packages=$(conda env export | grep -A9999 ".*- pip:" | grep -v "^prefix: " | cut -f1 -d"=")

# Export conda environment without builds, and append pip packages
conda env export --from-history | grep -v "^prefix: " > new-environment.yml
echo "$pip_packages" >> new-environment.yml

But remember: it is better to keep your environment.yml file current, and update your conda env from this file, as opposed to adding packages using conda install and then trying to export details to your environment file to track these changes.

In Conclusion

If you are using conda, use your conda environment.yml to keep control of the packages you have installed.
Use conda env export > env-record.yml to export records of your environments for reproducibility, but use the --from-history tag to make it more reusable.

Citation

BibTeX citation:

@online{murphy_quinlan2024,
  author = {Murphy Quinlan, Maeve},
  title = {The Art of Conda},
  date = {2024-11-27},
  url = {https://murphyqm.github.io/posts/2024-11-27-conda-envs},
  langid = {en}
}

For attribution, please cite this work as:

Murphy Quinlan, Maeve. 2024. “The Art of Conda.” November 27, 2024. https://murphyqm.github.io/posts/2024-11-27-conda-envs.

The key to conda: the .yml file