What's the point of bulker?

Container systems like docker and singularity are really useful, but there are two use cases described below that are problematic. Bulker is a general framework that really nicely solves these two niche problems:

How to containerize a workflow

There's a lot of existing tooling around individual containers or sets of containers running simultaneously (like kubernetes or docker swarm). But what about containers for workflows that will run a series of commands sequentially? In a workflow, we're only running 1 container at a time, so we don't need the complexity of a container swarm. But going one-by-one is a pain when your workflow needs 15 different tools. What's the best way to containerize a workflow?

One solution is to build a mega-image with all 15 command-line tools required for the entire workflow. This is pretty common, but it's inefficient because different workflows that use the same software will not be able to share it. It also makes it hard to update individual components, makes images huge, and goes against the modular philosophy of containers, where each individual tool has its own image. Keeping tools isolated promotes reusability, reduces resource usage, and is easier to maintain.

So, if not mega-images, how should we containerize a workflow that needs to run lots of commands? Using individual images sounds nice, but there are a few challenges with the individual image approach:

  1. Distribution. It would be nice to be able to distribute a collection of images somehow, but there isn't a way to do this easily.

  2. Portability. It's kind of counter-intuitive, but in a way, using individual images reduces portability. If each of my workflow commands requires a different container, then I need some way to specify the container for each command. This makes the workflow harder to write and maintain. It also makes it less portable: I could make it so every command in my workflow starts with docker run ... or singularity run..., but then the workflow requires a specific container engine, and it is no longer really portable. If I make the workflow flexible enough to use either docker or singularity or neither, that seems like it will add work to build and maintain. How can I easily make the workflow independent of any one container system, and also make it work for users who don't want to use containers at all?

How to use one installation across computing environments

The second use case is also a problem of portability. I use a lot of different computing environments: my desktop at work, one or more remote servers, my laptop, etc. I want a set of common commands and tools installed on all of these systems, but I really don't want to install and maintain all this software 5 times. This seems like the perfect use-case for containers, but I struggled for years to figure out how to do it right. Plus, the HPC at work runs singularity, but I use docker at home, so that adds additional complexity. What I want is to be able to just clone a git repository on each computer, and then automatically have pandoc and samtools and latex and R, etc, available without doing anything else. I need these commands to just run in a container, so that I don't have to install it... and I want that to just work on all my computers, without having to do anything.

Eventually I refined my own little docker system that allows me to install a command-line tool one time, and then use it on all my computers. My system combined a series of shell scripts that can be populated with a few variables, and then individual, manually created shell scripts for each executable that I store in a folder in a git repository. By just cloning that repository and including a bin folder in my PATH I get access to these containerized executables. The problem I faced is that the volumes and environment variables I want to automatically mount on all these containers are not the same in every computing environment. For example, on the HPC I need to mount a particular filesystem that doesn't exist on my laptop.

How bulker solves these problems

Bulker solves both of these problems in a really simple and elegant way that merges three important concepts:

Containerized executables. First, bulker builds on the idea of containerized executables. A containerized executable is just a script that runs a command in a container. It could be written using docker run or singularity exec. Containerized executables are nice because to the user, they behave just like a native tool -- but they're actually running in a container. This means any workflow can use containers without being changed. Bulker makes it really easy to create these executables automatically because it handles the details of both the container engine, and any environment-specific settings like volumes. Adding a layer like bulker over existing container engines makes the system work across container engines and across computing environments, so there's a single user interface.

Decoupling container settings. One of the challenges of automatically producing containerized executables is that some of the settings for containers depend on the tool, and others depend on the environment. Bulker solves this problem by decoupling the tool-specific settings for each image from the environment settings that vary by computer system. The environment-specific config file (the bulker_config.yaml file) keeps track of the environment-specific settings, like volumes, environment variables, etc. Then, I use a template-driven automatic executable builder to create all the executables I need, which vary by system.

Bulker crates. For distribution, bulker introduces the concept of a crate, which is a collection of containerized executables. Really, it's just a folder with a bunch of command-line scripts in it, each tailored to a particular computing system and container engine. To build a crate, bulker requires a list of every command to be containerized, plus the container and any settings required to run it. We call the list of commands a manifest. Using a manifest, users can install all the software from a manifest in a single line of code (with bulker load), making every tool available in the PATH. This makes the workflow portable because it bulker takes away all concept of containerization from the workflow. The workflow itself doesn't change -- it is completely unaware of containers and needs only be written as if it were running natively.

Manifests are also useful beyond workflow portability. I wrote my own personal manifest with all the tools I want available in my workspace. Then, I configure each environment (HPC, laptop, desktop, cloud) with a simple local bulker_config.yaml on each computer. Bulker pairs the tool manifest with the environment configuration to automatically produce containerized executables for each environment. To install a new package across all my systems, all I have to do is update the manifest, and then re-run bulker load databio/nsheff. With almost no maintenance, I keep my computing systems in sync, even across different container systems.