PRO-seq pipeline tutorial with bulker and PEPPRO

In this tutorial we'll show how to use bulker to set up a controlled environment for a bioinformatics workflow. I assume you've already gone through the install and configure instructions. Before we begin, we need to set up a bash function to allow the activate function to work in the jupyter notebook for this tutorial:

bulker-activate() {
  eval "$(bulker activate -e $@)"
}

This will make the activate command adjust the current shell instead of returning a new one. You may want to add this to your .bashrc (You can read more about this in tips).

Next, make sure you've initialized a bulker config file:

rm "peppro_example/bulker_config.yaml"
export BULKERCFG="peppro_example/bulker_config.yaml"
bulker init -c $BULKERCFG
rm: cannot remove 'peppro_example/bulker_config.yaml': No such file or directory
Guessing container engine is docker.
Wrote new configuration file: peppro_example/bulker_config.yaml

Loading the peppro crate

We're going to be using the peppro pipeline, which processes nascent RNA-seq data (from PRO-seq or GRO-seq experiments). We've produced a bulker crate for the pipeline so it can be run using modular containers. Let's load the peppro crate, which is available on the bulker registry:

bulker load databio/peppro -r
Bulker config: /home/nsheff/code/bulker/docs_jupyter/peppro_example/bulker_config.yaml
Importing crate 'bulker/alpine:default' from '/home/nsheff/bulker_crates/bulker/alpine/default'.
Importing crate 'bulker/coreutils:default' from '/home/nsheff/bulker_crates/bulker/coreutils/default'.
Populating host commands
Loading manifest: 'databio/peppro:default'. Activate with 'bulker activate databio/peppro:default'.
Commands available: samtools, bowtie2, seqkit, fastp, seqtk, preseq, fastq_pair, wigToBigWig, bigWigCat, fastqc, pigz, cutadapt, flash, bedtools, Rscript, R, java, perl
Host commands available: python3

You can see this crate offers several common bioinformatics tools, like samtools and bowtie2. You can see this crate in your list of available crates:

bulker list
Available crates:
databio/peppro:default -- /home/nsheff/bulker_crates/databio/peppro/default

Run commands in the peppro crate

Now we've loaded the crate, which means there's a folder somewhere on your computer with a bunch of executables. Bulker provides two ways to activate this crate conveniently, depending on your use case: bulker activate, and bulker run.

Let's start with a bulker run command, which is how we could execute an ad-hoc command. If you try to run samtools natively, it doesn't work, because I don't have it installed natively on this system:

samtools --version

Command 'samtools' not found, but can be installed with:

sudo apt install samtools


But we can run it inside the crate:

bulker run databio/peppro samtools --version
Bulker config: /home/nsheff/code/bulker/docs_jupyter/peppro_example/bulker_config.yaml
Activating crate: databio/peppro

samtools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.

If we need to run more than one command (or, say, an entire workflow), then it's much simpler to use bulker activate than bulker run. Here, I'm using the hypenated version I created at the beginning of this tutorial:

bulker-activate databio/peppro
Bulker config: /home/nsheff/code/bulker/docs_jupyter/peppro_example/bulker_config.yaml
Activating bulker crate: databio/peppro

Now we can run any crate commands as if they were native:

samtools --version
fastp --version
fastp 0.20.0

cutadapt --version
2.4

Running a pipeline

Now that we've proven we can run each of these commands, let's put them all together and run a whole pipeline. Since the environment we've just activated has all those commands available as if they were installed natively, all we have to do is run a bunch of commands in succession and they will automatically run in individual containers.

Keep in mind: the bulker environment contains all the commands that the pipeline will run, but the pipeline code itself is distinct. Therefore, we must retrieve the pipeline code and any requirements of the workflow system before we can run it within the bulker environment.

First, we'll clone our pipeline from github:

git clone https://github.com/databio/peppro peppro_example/peppro
Cloning into 'peppro_example/peppro'...
remote: Enumerating objects: 58, done.
remote: Counting objects: 100% (58/58), done.
remote: Compressing objects: 100% (40/40), done.
remote: Total 1714 (delta 28), reused 41 (delta 18), pack-reused 1656
Receiving objects: 100% (1714/1714), 2.62 MiB | 0 bytes/s, done.
Resolving deltas: 100% (1004/1004), done.
Checking connectivity... done.

We'll also need a few other things required by the pipeline. First, we have to install the workflow system used by the pipeline. Remember, the workflow system itself is native, it's the individual commands that will be run in containers. This workflow uses pypiper, which can be installed from PyPI:

pip install --user piper
Requirement already satisfied: piper in /home/nsheff/.local/lib/python3.5/site-packages (0.12.1)
Requirement already satisfied: pandas in /home/nsheff/.local/lib/python3.5/site-packages (from piper) (0.24.2)
Requirement already satisfied: ubiquerg>=0.4.5 in /home/nsheff/.local/lib/python3.5/site-packages (from piper) (0.4.10.dev0)
Requirement already satisfied: yacman in /home/nsheff/.local/lib/python3.5/site-packages (from piper) (0.6.2)
Requirement already satisfied: psutil in /home/nsheff/.local/lib/python3.5/site-packages (from piper) (5.6.1)
Requirement already satisfied: logmuse>=0.2.4 in /home/nsheff/.local/lib/python3.5/site-packages (from piper) (0.2.4)
Requirement already satisfied: attmap>=0.12.5 in /home/nsheff/.local/lib/python3.5/site-packages (from piper) (0.12.9)
Requirement already satisfied: python-dateutil>=2.5.0 in /home/nsheff/.local/lib/python3.5/site-packages (from pandas->piper) (2.8.0)
Requirement already satisfied: numpy>=1.12.0 in /home/nsheff/.local/lib/python3.5/site-packages (from pandas->piper) (1.16.2)
Requirement already satisfied: pytz>=2011k in /home/nsheff/.local/lib/python3.5/site-packages (from pandas->piper) (2018.9)
Requirement already satisfied: oyaml in /home/nsheff/.local/lib/python3.5/site-packages (from yacman->piper) (0.9)
Requirement already satisfied: pyyaml>=3.13 in /home/nsheff/.local/lib/python3.5/site-packages (from yacman->piper) (5.1)
Requirement already satisfied: six>=1.5 in /home/nsheff/.local/lib/python3.5/site-packages (from python-dateutil>=2.5.0->pandas->piper) (1.12.0)
WARNING: You are using pip version 19.2.3, however version 19.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

This pipeline also requires reference genome assets that are managed by refgenie. If you don't already have an initialized refgenie config, you can easily initialize one like this:

pip install --user refgenie
export REFGENIE="peppro_example/refgenie.yaml"
refgenie init -c $REFGENIE -s http://staging.refgenomes.databio.org
refgenie pull -g hs38d1 fasta bowtie2_index
Requirement already satisfied: refgenie in /home/nsheff/.local/lib/python3.5/site-packages (0.7.0.dev0)
Requirement already satisfied: ubiquerg>=0.4.8 in /home/nsheff/.local/lib/python3.5/site-packages (from refgenie) (0.4.10.dev0)
Requirement already satisfied: yacman>=0.6.1 in /home/nsheff/.local/lib/python3.5/site-packages (from refgenie) (0.6.2)
Requirement already satisfied: piper>=0.12.1 in /home/nsheff/.local/lib/python3.5/site-packages (from refgenie) (0.12.1)
Requirement already satisfied: logmuse>=0.2 in /home/nsheff/.local/lib/python3.5/site-packages (from refgenie) (0.2.4)
Requirement already satisfied: refgenconf>=0.4.0 in /home/nsheff/.local/lib/python3.5/site-packages (from refgenie) (0.5.0.dev0)
Requirement already satisfied: attmap>=0.12.5 in /home/nsheff/.local/lib/python3.5/site-packages (from refgenie) (0.12.9)
Requirement already satisfied: oyaml in /home/nsheff/.local/lib/python3.5/site-packages (from yacman>=0.6.1->refgenie) (0.9)
Requirement already satisfied: pyyaml>=3.13 in /home/nsheff/.local/lib/python3.5/site-packages (from yacman>=0.6.1->refgenie) (5.1)
Requirement already satisfied: pandas in /home/nsheff/.local/lib/python3.5/site-packages (from piper>=0.12.1->refgenie) (0.24.2)
Requirement already satisfied: psutil in /home/nsheff/.local/lib/python3.5/site-packages (from piper>=0.12.1->refgenie) (5.6.1)
Requirement already satisfied: future in /home/nsheff/.local/lib/python3.5/site-packages (from refgenconf>=0.4.0->refgenie) (0.17.1)
Requirement already satisfied: requests in /home/nsheff/.local/lib/python3.5/site-packages (from refgenconf>=0.4.0->refgenie) (2.21.0)
Requirement already satisfied: tqdm in /home/nsheff/.local/lib/python3.5/site-packages (from refgenconf>=0.4.0->refgenie) (4.32.1)
Requirement already satisfied: numpy>=1.12.0 in /home/nsheff/.local/lib/python3.5/site-packages (from pandas->piper>=0.12.1->refgenie) (1.16.2)
Requirement already satisfied: pytz>=2011k in /home/nsheff/.local/lib/python3.5/site-packages (from pandas->piper>=0.12.1->refgenie) (2018.9)
Requirement already satisfied: python-dateutil>=2.5.0 in /home/nsheff/.local/lib/python3.5/site-packages (from pandas->piper>=0.12.1->refgenie) (2.8.0)
Requirement already satisfied: certifi>=2017.4.17 in /home/nsheff/.local/lib/python3.5/site-packages (from requests->refgenconf>=0.4.0->refgenie) (2019.3.9)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /home/nsheff/.local/lib/python3.5/site-packages (from requests->refgenconf>=0.4.0->refgenie) (1.24.3)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /home/nsheff/.local/lib/python3.5/site-packages (from requests->refgenconf>=0.4.0->refgenie) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /home/nsheff/.local/lib/python3.5/site-packages (from requests->refgenconf>=0.4.0->refgenie) (2.8)
Requirement already satisfied: six>=1.5 in /home/nsheff/.local/lib/python3.5/site-packages (from python-dateutil>=2.5.0->pandas->piper>=0.12.1->refgenie) (1.12.0)
WARNING: You are using pip version 19.2.3, however version 19.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Initializing refgenie genome configuration
Wrote new refgenie genome configuration file: peppro_example/refgenie.yaml
'hs38d1/fasta:default' archive size: 1.7MB
Downloading URL: http://staging.refgenomes.databio.org/v2/asset/hs38d1/fasta/archive
hs38d1/fasta:default: 1.74MB [00:00, 28.8MB/s]
Download complete: /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/fasta__default.tgz
Extracting asset tarball and saving to: /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/fasta/default
Default tag for 'hs38d1/fasta' set to: default
'hs38d1/bowtie2_index:default' archive size: 8.8MB
Downloading URL: http://staging.refgenomes.databio.org/v2/asset/hs38d1/bowtie2_index/archive
hs38d1/bowtie2_index:default: 9.22MB [00:00, 21.1MB/s]                          
Download complete: /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/bowtie2_index__default.tgz
Extracting asset tarball and saving to: /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/bowtie2_index/default
Default tag for 'hs38d1/bowtie2_index' set to: default

Now execute the example code to run it.

./peppro_example/peppro/pipelines/peppro.py \
  --sample-name test \
  --genome hs38d1 \
  --input peppro_example/peppro/examples/data/test_r1.fq.gz \
  --single-or-paired single \
  -O peppro_example/output/
### Pipeline run code and environment:

*              Command:  `./peppro_example/peppro/pipelines/peppro.py --sample-name test --genome hs38d1 --input peppro_example/peppro/examples/data/test_r1.fq.gz --single-or-paired single -O peppro_example/output/`
*         Compute host:  puma
*          Working dir:  /home/nsheff/code/bulker/docs_jupyter
*            Outfolder:  /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/
*  Pipeline started at:   (10-21 11:29:03) elapsed: 0.0 _TIME_

### Version log:

*       Python version:  2.7.12
*          Pypiper dir:  `/home/nsheff/.local/lib/python2.7/site-packages/pypiper`
*      Pypiper version:  0.12.0
*         Pipeline dir:  `/home/nsheff/code/bulker/docs_jupyter/peppro_example/peppro/pipelines`
*     Pipeline version:  0.8.1
*        Pipeline hash:  05095a2cc78a2e210916f215dd0828940b894a6f
*      Pipeline branch:  * dev
*        Pipeline date:  2019-10-21 11:21:23 -0400

### Arguments passed to pipeline:

*           `TSS_name`:  `None`
*            `adapter`:  `fastp`
*          `anno_name`:  `None`
*         `complexity`:  `True`
*        `config_file`:  `peppro.yaml`
*              `cores`:  `1`
*           `coverage`:  `False`
*              `dedup`:  `seqkit`
*              `dirty`:  `False`
*  `ensembl_gene_body`:  `None`
*        `ensembl_tss`:  `None`
*          `exon_name`:  `None`
*       `force_follow`:  `False`
*    `genome_assembly`:  `hs38d1`
*              `input`:  `['peppro_example/peppro/examples/data/test_r1.fq.gz']`
*             `input2`:  `None`
*        `intron_name`:  `None`
*               `keep`:  `False`
*             `logdev`:  `False`
*            `max_len`:  `30`
*                `mem`:  `4000`
*          `new_start`:  `False`
*            `no_fifo`:  `False`
*      `output_parent`:  `peppro_example/output/`
*         `paired_end`:  `False`
*              `parts`:  `4`
*           `pre_name`:  `None`
*      `prealignments`:  `[]`
*           `protocol`:  `pro`
*            `recover`:  `False`
*        `sample_name`:  `test`
*              `scale`:  `False`
*             `silent`:  `False`
*   `single_or_paired`:  `single`
*                `sob`:  `False`
*           `testmode`:  `False`
*            `trimmer`:  `seqtk`
*          `umi_fastp`:  `False`
*            `umi_len`:  `0`
*          `verbosity`:  `None`

----------------------------------------

Some assets are not found. You can update your REFGENIE config file or point directly to the file using the noted command-line arguments:
  Assets not existing: refgene_tss (--TSS-name), ensembl_tss (--pi-tss), ensembl_gene_body (--pi-body), refgene_pre_mRNA (--pre-name), feat_annotation (--anno-name), refgene_exon (--exon-name), refgene_intron (--intron-name)
Local input file: peppro_example/peppro/examples/data/test_r1.fq.gz

> `File_mb` 1.02    PEPPRO  _RES_

> `Read_type`   single  PEPPRO  _RES_

> `Genome`  hs38d1  PEPPRO  _RES_
Detected PRO input

### Merge/link and fastq conversion:  (10-21 11:29:03) elapsed: 0.0 _TIME_

Number of input file sets: 1
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/raw/test.fastq.gz`  

> `ln -sf /home/nsheff/code/bulker/docs_jupyter/peppro_example/peppro/examples/data/test_r1.fq.gz /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/raw/test.fastq.gz` (20905)
<pre>
</pre>
Command completed. Elapsed time: 0:00:00. Running peak memory: 0.0GB.  
  PID: 20905;   Command: ln;    Return code: 0; Memory used: 0.0GB

Local input file: '/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/raw/test.fastq.gz'
Found .fastq.gz file
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/fastq/test_R1.fastq`  

> `gzip -f -d -c /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/raw/test.fastq.gz > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/fastq/test_R1.fastq` (20907)
<pre>
</pre>
Command completed. Elapsed time: 0:00:00. Running peak memory: 0.002GB.  
  PID: 20907;   Command: gzip;  Return code: 0; Memory used: 0.002GB


> `Raw_reads`   12500   PEPPRO  _RES_

> `Fastq_reads` 12500   PEPPRO  _RES_
['/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/raw/test.fastq.gz']

### FASTQ processing:  (10-21 11:29:03) elapsed: 0.0 _TIME_

You set adapter arg to 'fastp' but you must select 'cutadapt' for single end data. Overriding.

> `cutadapt --version`
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/fastq/test_R1_processed.fastq`  

> `( cutadapt -j 1 -m 18 -a TGGAATTCTCGGGTGCCAAGG /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/fastq/test_R1.fastq | seqtk trimfq -b 0 -L 30 - | seqtk seq -r - > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/fastq/test_R1_processed.fastq ) 2> /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/cutadapt/test_cutadapt.txt` (20985)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.057GB.  
  PID: 20985;   Command: ;  Return code: 0; Memory used: 0.057GB

Evaluating read trimming

> `Trimmed_reads`   11880   PEPPRO  _RES_

> `Trim_loss_rate`  4.96    PEPPRO  _RES_

### Plot adapter insertion distribution (10-21 11:29:06) elapsed: 2.0 _TIME_

Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/cutadapt/test_adapter_insertion_distribution.pdf`  

> `Rscript ./peppro_example/peppro/tools/PEPPRO.R cutadapt -i /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/cutadapt/test_cutadapt.txt -o /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/cutadapt` (21191)
<pre>
Error: Install the "PEPPROr" R package before proceeding.
i.e. devtools::install_github("databio/peppro", subdir="PEPPROr")
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.057GB.  
  PID: 21191;   Command: Rscript;   Return code: 0; Memory used: 0.019GB

> `Adapter insertion distribution`  cutadapt/test_adapter_insertion_distribution.pdf    Adapter insertion distribution  cutadapt/test_adapter_insertion_distribution.png    PEPPRO  _OBJ_

### Prealignments (10-21 11:29:07) elapsed: 1.0 _TIME_

You may use `--prealignments` to align to references before the genome alignment step. See docs.

### Map to genome (10-21 11:29:07) elapsed: 0.0 _TIME_

Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam`  

> `bowtie2 -p 1 --very-sensitive -X 2000 --rg-id test -x /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/bowtie2_index/default/hs38d1 -U /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/fastq/test_R1_processed.fastq | samtools view -bS - -@ 1  | samtools sort - -@ 1 -T /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/tmpjLFtWP -o /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam` (21283,21287,21313)
<pre>
11880 reads; of these:
  11880 (100.00%) were unpaired; of these:
    10476 (88.18%) aligned 0 times
    372 (3.13%) aligned exactly 1 time
    1032 (8.69%) aligned >1 times
11.82% overall alignment rate
</pre>
Command completed. Elapsed time: 0:00:02. Running peak memory: 0.057GB.  
  PID: 21287;   Command: samtools;  Return code: 0; Memory used: 0.018GB  
  PID: 21283;   Command: bowtie2;   Return code: 0; Memory used: 0.018GB  
  PID: 21313;   Command: samtools;  Return code: 0; Memory used: 0.018GB


> `samtools view -q 10 -b -@ 1 -U /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_fail_qc.bam /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam` (21527)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.057GB.  
  PID: 21527;   Command: samtools;  Return code: 0; Memory used: 0.018GB


> `samtools depth /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam | awk '{counter++;sum+=$3}END{print sum/counter}'`

> `Mapped_reads`    1404    PEPPRO  _RES_

> `QC_filtered_reads`   989.0   PEPPRO  _RES_

> `Aligned_reads`   415 PEPPRO  _RES_

> `Alignment_rate`  3.49    PEPPRO  _RES_

> `Total_efficiency`    3.32    PEPPRO  _RES_

> `Read_depth`  1.26    PEPPRO  _RES_

### Compress all unmapped read files (10-21 11:29:11) elapsed: 4.0 _TIME_

Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam.bai`  

> `samtools index /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam` (21800)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.057GB.  
  PID: 21800;   Command: samtools;  Return code: 0; Memory used: 0.019GB


> `samtools idxstats /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam | grep -we 'chrM' -we 'chrMT' -we 'M' -we 'MT' -we 'rCRSd' -we 'rCRSd_3k'| cut -f 3`

> `samtools stats /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam | grep '^SN' | cut -f 2- | grep 'maximum length:' | cut -f 2-`

### Calculate NRF, PBC1, and PBC2 (10-21 11:29:13) elapsed: 2.0 _TIME_

Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam.bai`  

> `samtools index /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam` (22028)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.057GB.  
  PID: 22028;   Command: samtools;  Return code: 0; Memory used: 0.019GB

Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/test_bamQC.tsv`  

> `./peppro_example/peppro/tools/bamQC.py --silent -i /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam -c 1 -o /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/test_bamQC.tsv --silent` (22096)
<pre>
</pre>
Command completed. Elapsed time: 0:00:03. Running peak memory: 0.068GB.  
  PID: 22096;   Command: ./peppro_example/peppro/tools/bamQC.py;    Return code: 0; Memory used: 0.068GB


> `awk '{ for (i=1; i<=NF; ++i) { if ($i ~ "NRF") c=i } getline; print $c }' /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/test_bamQC.tsv`

> `awk '{ for (i=1; i<=NF; ++i) { if ($i ~ "PBC1") c=i } getline; print $c }' /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/test_bamQC.tsv`

> `awk '{ for (i=1; i<=NF; ++i) { if ($i ~ "PBC2") c=i } getline; print $c }' /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/test_bamQC.tsv`

> `NRF` 0.89    PEPPRO  _RES_

> `PBC1`    0.95    PEPPRO  _RES_

> `PBC2`    30.75   PEPPRO  _RES_
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_unmap.bam`  

> `samtools view -b -@ 1 -f 4  /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_unmap.bam` (22108)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.068GB.  
  PID: 22108;   Command: samtools;  Return code: 0; Memory used: 0.019GB


> `samtools view -c -f 4 -@ 1 /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam`

> `Unmapped_reads`  10476.0 PEPPRO  _RES_

### Split BAM by strand (10-21 11:29:18) elapsed: 5.0 _TIME_

Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_plus.bam`,`/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_minus.bam`  

> `samtools view -bh -F 20 /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_plus.bam` (22257)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.068GB.  
  PID: 22257;   Command: samtools;  Return code: 0; Memory used: 0.019GB


> `samtools view -bh -f 16 /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_minus.bam` (22322)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.068GB.  
  PID: 22322;   Command: samtools;  Return code: 0; Memory used: 0.019GB

Skipping TSS -- TSS enrichment requires TSS annotation file: refgene_tss
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/chr_order.txt`  

> `samtools view -H /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam | grep 'SN:' | awk -F':' '{print $2,$3}' | awk -F' ' -v OFS=' ' '{print $1,$3}' > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/chr_order.txt` (22386,22391,22395,22403)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.068GB.  
  PID: 22386;   Command: samtools;  Return code: 0; Memory used: 0.018GB  
  PID: 22395;   Command: awk;   Return code: 0; Memory used: 0.004GB  
  PID: 22391;   Command: grep;  Return code: 0; Memory used: 0.001GB  
  PID: 22403;   Command: awk;   Return code: 0; Memory used: 0.004GB

Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/chr_keep.txt`  

> `cut -f 1 /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/chr_order.txt > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/chr_keep.txt` (22455)
<pre>
</pre>
Command completed. Elapsed time: 0:00:00. Running peak memory: 0.068GB.  
  PID: 22455;   Command: cut;   Return code: 0; Memory used: 0.0GB

Skipping PI -- Pause index requires 'TSS' and 'gene body' annotation files: ensembl_tss and ensembl_gene_body
Skipping FRiP -- Fraction of reads in pre-mRNA requires pre-mRNA annotation file: refgene_pre_mRNA

### Calculate fraction of reads in features (FRiF) (10-21 11:29:20) elapsed: 2.0 _TIME_


### Plot FRiF (10-21 11:29:20) elapsed: 0.0 _TIME_


> `samtools view -@ 1 -q 10 -c -F4 /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_plus.bam`

> `samtools view -@ 1 -q 10 -c -F4 /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_minus.bam`

### Produce bigWig files (10-21 11:29:21) elapsed: 1.0 _TIME_

Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/signal_hs38d1/test_plus_body_0-mer.bw`  

> `samtools index /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_plus.bam` (22609)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.068GB.  
  PID: 22609;   Command: samtools;  Return code: 0; Memory used: 0.02GB


> `./peppro_example/peppro/tools/bamSitesToWig.py -i /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_plus.bam -c /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/fasta/default/hs38d1.chrom.sizes -o /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/signal_hs38d1/test_plus_body_0-mer.bw -p 1 --variable-step --tail-edge` (22674)
<pre>
Registering input file: '/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_plus.bam'
Temporary files will be stored in: 'tmp_test_plus_cuttrace_qU7o0L'
Processing with 1 cores...
stdin is empty of data
Reduce step (merge files)...
Merging 121 files into output file: '/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/signal_hs38d1/test_plus_body_0-mer.bw'
Couldn't open tmp_test_plus_cuttrace_qU7o0L/JTFH01000350.1.txt.bw
</pre>
Command completed. Elapsed time: 0:01:19. Running peak memory: 0.068GB.  
  PID: 22674;   Command: ./peppro_example/peppro/tools/bamSitesToWig.py;    Return code: 0; Memory used: 0.057GB

Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/signal_hs38d1/test_minus_body_0-mer.bw`  

> `samtools index /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_minus.bam` (31803)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.068GB.  
  PID: 31803;   Command: samtools;  Return code: 0; Memory used: 0.019GB


> `./peppro_example/peppro/tools/bamSitesToWig.py -i /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_minus.bam -c /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/fasta/default/hs38d1.chrom.sizes -o /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/signal_hs38d1/test_minus_body_0-mer.bw -p 1 --variable-step --tail-edge` (31872)
<pre>
Registering input file: '/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_minus.bam'
Temporary files will be stored in: 'tmp_test_minus_cuttrace_ljeMBw'
Processing with 1 cores...
Reduce step (merge files)...
Merging 124 files into output file: '/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/signal_hs38d1/test_minus_body_0-mer.bw'
</pre>
Command completed. Elapsed time: 0:01:23. Running peak memory: 0.068GB.  
  PID: 31872;   Command: ./peppro_example/peppro/tools/bamSitesToWig.py;    Return code: 0; Memory used: 0.057GB

Starting cleanup: 5 files; 4 conditional files for cleanup

Cleaning up flagged intermediate files. . .

Cleaning up conditional list. . .

### Pipeline completed. Epilogue
*        Elapsed time (this run):  0:03:01
*  Total elapsed time (all runs):  0:02:59
*         Peak memory (this run):  0.0675 GB
*        Pipeline completed time: 2019-10-21 11:32:04

The pipeline has now completed successfully and we can explore the results:

tree peppro_example/output/
peppro_example/output/
└── test
    ├── aligned_hs38d1
    │   ├── test_fail_qc.bam
    │   ├── test_minus.bam
    │   ├── test_minus.bam.bai
    │   ├── test_plus.bam
    │   ├── test_plus.bam.bai
    │   ├── test_sort.bam
    │   ├── test_sort.bam.bai
    │   └── test_unmap.bam
    ├── cutadapt
    │   └── test_cutadapt.txt
    ├── fastq
    ├── fastqc
    ├── objects.tsv
    ├── PEPPRO_cleanup.sh
    ├── PEPPRO_commands.sh
    ├── PEPPRO_completed.flag
    ├── PEPPRO_log.md
    ├── PEPPRO_profile.tsv
    ├── QC_hs38d1
    │   └── test_bamQC.tsv
    ├── raw
    │   └── test.fastq.gz -> /home/nsheff/code/bulker/docs_jupyter/peppro_example/peppro/examples/data/test_r1.fq.gz
    ├── signal_hs38d1
    │   └── test_minus_body_0-mer.bw
    └── stats.tsv

8 directories, 19 files

We've successfully run a complete pipeline without having to install any of the software that runs the commands in the workflow. We're also able to interactively explore the environment that ran the workflow.

Conclusion

That's basically it. If you're a workflow developer, all you need to do is write your own manifest and distribute it with your workflow; in 3 lines of code, users will be able to run your workflow using modular containers, using the container engine of their choice.