Bulker refgenie tutorial

Here we'll show you how to use bulker together with refgenie to create a bunch of custom refgenie assets without having to install any indexing software.

Introduction to refgenie

Refgenie manages storage, access, and transfer of reference genome data assets. Among other tasks, it builds aligner indexes for custom genome assemblies. You can read more about refgenie if you like, but for now, you can just think of it as a simple pipeline that will run a few commands: bowtie2-build, bwa index, and hisat2-build to produce 3 different genome indexes for a custom fasta file.

You'll need to make sure refgenie is installed for this to work, using some variation of this command:

pip install --user refgenie

Refgenie is easy to install, but we don't want to have to install all of those individual genome indexing tools, which probably require compiling and are distributed in different places. Luckily, there's a bulker manifest available that groups together all the necessary biocontainers to run a refgenie build pipeline.

Activating the crate

All we have to do to make those commands available is load and activate the bulker crate:

bulker load databio/refgenie:0.7.0
bulker activate databio/refgenie:0.7.0

This populates your environment with the commands necessary for the refgenie pipeline to run.

Downloading data

We'll need a fasta file to feed to our indexing pipeline. Here we'll use a small decoy sequence so the pipeline doesn't take too long to run the indexes. You could use this same approach to produce indexes for any fasta file

wget -O hs38d1.fna.gz \
  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/786/075/GCA_000786075.2_hs38d1/GCA_000786075.2_hs38d1_genomic.fna.gz 

Running the pipeline

Make sure you have a refgenie config file initialized. If not, run these commands:

export REFGENIE="refgenie.yaml"
refgenie init -c $REFGENIE

Now, start the refgenie pipeline:

refgenie build \
    --genome hs38d1 \
    --fasta hs38d1.fna.gz \
    fasta bowtie2_index bwa_index hisat2_index star_index

This command will run bowtie2-build, bwa index, and hisat2-build in succession on your given fasta file. These tools don't have to be installed on your computer; they are included in the bulker crate.

Using the assets

Now we can use the refgenie seek command to retrieve the paths to any of these assets like this:

refgenie seek hs38d1/fasta
refgenie seek hs38d1/fasta.fai
refgenie seek hs38d1/fasta.chrom_sizes
refgenie seek hs38d1/bowtie2_index
refgenie seek hs38d1/hisat2_index
refgenie seek hs38d1/bwa_index