Introduction to biobase
Bulker comes with a core manifest called biobase
that includes more than 50 common bioinformatics tools. Biobase is useful for everyday interactive analysis, basic pipelines, and also as a starting point for more complicated manifests. In this tutorial we'll show how to activate biobase and use it for some interactive RNA-seq analysis, and to run a pipeline. I assume you've already gone through the install and configure instructions.
Loading the biobase crate
Let's load the biobase crate, which is available on the bulker registry. First I'm initializing an empty bulker config for this tutorial.
export BULKERCFG="biobase_example/bulker_config.yaml"
rm $BULKERCFG
bulker init -c $BULKERCFG
Guessing container engine is docker.
Wrote new configuration file: biobase_example/bulker_config.yaml
bulker load biobase -f $HOME/code/hub.bulker.io/bulker/biobase.yaml
Bulker config: /home/nsheff/code/bulker/docs_jupyter/biobase_example/bulker_config.yaml
Importing crate 'bulker/alpine:default' from '/home/nsheff/bulker_crates/bulker/alpine/default'.
Importing crate 'bulker/coreutils:default' from '/home/nsheff/bulker_crates/bulker/coreutils/default'.
Loading manifest: 'bulker/biobase:default'. Activate with 'bulker activate bulker/biobase:default'.
Commands available: aws, ascp, bamtools, bedClip, bedCommonRegions, bedGraphToBigWig, bedIntersect, bedItemOverlapCount, bedops, bedPileUps, bedToBigBed, bedtools, bigWigAverageOverBed, bigWigCat, bigWigSummary, bismark, bissnp, blast, bowtie2, bowtie, bsmap, bwa, cellranger, cufflinks, curl, cutadapt, faSplit, fastq-dump, fasterq-dump, vdb-config, fastqc, gatk, gt, hisat2, homer, kallisto, khmer, liftOver, macs2, mashmap, picard, pigz, prefetch, R, repeatmasker, rg, Rscript, sambamba, salmon, samtools, segway, seqkit, seqtk, skewer, samblaster, STAR, tabix, trim_galore, trimmomatic, vep, wigToBigWig
You can see this crate offers many common bioinformatics tools, like samtools
and bowtie2
. You can see this crate in your list of available crates:
bulker list
Available crates:
bulker/biobase:default -- /home/nsheff/bulker_crates/bulker/biobase/default
Before activating the crate, these commands are not available because they are not installed natively on this sytem:
samtools --version
Command 'samtools' not found, but can be installed with:
sudo apt install samtools
kallisto
Command 'kallisto' not found, but can be installed with:
sudo apt install kallisto
So, we activate the crate. In a shell you just type:
bulker activate biobase
Due to a limitation with jupyter, because bulker spawns a new shell by default, you have to things a bit differently...there 2 ways to use bulker with a jupyter note book:
- You can simply activate the crate before starting the notebook (the easier way).
- You can use this
eval
trick below to activate the crate in the existing jupyter shell.
Either way should have the same affect as typing bulker activate biobase
outside of jupyter.
eval "$(bulker activate -e -p biobase)"
Bulker config: /home/nsheff/pCloudSync/env/bulker_config/zither.yaml
Activating bulker crate: biobase
It's worth emphasizing: that single line of code is all you need to do to "install" dozens of common bioinformatics tools. Now, in just a few minutes you've given yourself the ability to create a portable computing environment instantly, on demand.
Now we can inspect what commands are made available by this crate:
bulker inspect
Bulker config: /home/nsheff/code/bulker/docs_jupyter/biobase_example/bulker_config.yaml
Bulker manifest: bulker/biobase
Crate path: /home/nsheff/bulker_crates/bulker/biobase/default
Available commands: ['cellranger', 'fasterq-dump', 'cksum', 'hisat2', 'paste', 'chmod', 'aws', 'picard', 'arch', 'tail', 'unexpand', 'liftOver', 'cutadapt', 'expr', 'rm', 'STAR', 'base32', 'runcon', 'timeout', 'gt', 'samtools', 'nohup', 'rmdir', 'kallisto', 'sleep', 'vdir', 'bowtie2', 'pigz', 'samblaster', 'echo', 'base64', 'bwa', 'which', 'bash', 'shuf', 'homer', 'trimmomatic', 'salmon', 'stty', 'fastqc', 'cp', 'chgrp', 'ptx', 'blast', 'mkfifo', 'mashmap', 'seq', 'du', 'bigWigCat', 'logname', 'factor', 'vep', 'dircolors', 'bedCommonRegions', 'pwd', 'bigWigAverageOverBed', 'expand', 'sh', 'mkdir', 'whoami', 'bedops', 'R', 'bedPileUps', 'tsort', 'users', 'prefetch', 'fmt', 'truncate', 'csplit', 'repeatmasker', 'bissnp', 'gatk', 'uptime', 'tee', 'touch', 'false', 'od', 'mv', 'numfmt', 'bismark', 'seqtk', 'fastq-dump', 'b2sum', 'bamtools', 'groups', 'id', 'docker', 'seqkit', 'basename', 'cat', 'cufflinks', 'bedtools', 'nproc', 'unlink', 'md5sum', 'link', 'cut', 'sambamba', 'bowtie', 'pr', 'shred', 'bedItemOverlapCount', 'bsmap', 'bedGraphToBigWig', 'yes', 'lesspipe', 'dir', 'sha512sum', 'skewer', 'tr', 'split', 'hostname', 'macs2', 'Rscript', 'sha256sum', 'chown', 'tty', 'uname', 'pinky', 'ascp', 'vdb-config', 'grep', 'true', 'bedClip', 'wc', 'faSplit', 'printf', 'tabix', 'install', 'bigWigSummary', 'stdbuf', 'ln', 'bedIntersect', 'khmer', 'trim_galore', 'mknod', 'df', 'comm', 'sha224sum', 'realpath', 'dd', 'pathchk', 'uniq', 'nl', 'test', 'curl', 'ls', 'env', 'sha384sum', 'tac', 'segway', 'fold', 'bedToBigBed', 'who', 'wigToBigWig', 'hostid', 'printenv', 'sha1sum', 'join', 'nice', 'stat', 'date', 'awk', 'sed', 'sum', 'sync', 'rg', 'dirname', 'head', 'sort']
Any of these commands can be run as if they are native, but they will actually be running in containers.
samtools --version
samtools 1.10
Using htslib 1.10.2
Copyright (C) 2019 Genome Research Ltd.
kallisto
kallisto 0.46.2
Usage: kallisto <CMD> [arguments] ..
Where <CMD> can be one of:
index Builds a kallisto index
quant Runs the quantification algorithm
bus Generate BUS files for single-cell data
pseudo Runs the pseudoalignment step
merge Merges several batch runs
h5dump Converts HDF5-formatted results to plaintext
inspect Inspects and gives information about an index
version Prints version information
cite Prints citation information
Running kallisto <CMD> without arguments prints usage information for <CMD>
cutadapt --version
Running an analysis
Now that we've proven we can run each of these commands, let's put them all together and run a whole pipeline. Since the environment we've just activated has all those commands available as if they were installed natively, all we have to do is run a bunch of commands in succession and they will automatically run in individual containers.
Let's grab some reads from SRA:
wget https://sra-download.ncbi.nlm.nih.gov/traces/sra10/SRR/011497/SRR11773889
--2020-07-10 14:36:20-- https://sra-download.ncbi.nlm.nih.gov/traces/sra10/SRR/011497/SRR11773889
Resolving sra-download.ncbi.nlm.nih.gov (sra-download.ncbi.nlm.nih.gov)... 130.14.250.25, 130.14.250.24, 165.112.9.231
Connecting to sra-download.ncbi.nlm.nih.gov (sra-download.ncbi.nlm.nih.gov)|130.14.250.25|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15095418 (14M) [application/octet-stream]
Saving to: ‘SRR11773889’
SRR11773889 100%[===================>] 14.40M 49.3KB/s in 2m 31s
2020-07-10 14:38:53 (97.4 KB/s) - ‘SRR11773889’ saved [15095418/15095418]
fasterq-dump -L info SRR11773889 -O raw_reads
spots read : 742,462
reads read : 742,462
reads written : 742,462
Now we want to try to align these. We'll use refgenie to easily pull some bowtie2 indexes. Refgenie is included in the biobase.
export REFGENIE="biobase_example/refgenie.yaml"
refgenie init -c $REFGENIE -s http://staging.refgenomes.databio.org
refgenie pull t7/bowtie2_index
Requirement already satisfied: refgenie in /home/nsheff/.local/lib/python3.5/site-packages (0.7.0.dev0)
Requirement already satisfied: ubiquerg>=0.4.8 in /home/nsheff/.local/lib/python3.5/site-packages (from refgenie) (0.4.10.dev0)
Requirement already satisfied: yacman>=0.6.1 in /home/nsheff/.local/lib/python3.5/site-packages (from refgenie) (0.6.2)
Requirement already satisfied: piper>=0.12.1 in /home/nsheff/.local/lib/python3.5/site-packages (from refgenie) (0.12.1)
Requirement already satisfied: logmuse>=0.2 in /home/nsheff/.local/lib/python3.5/site-packages (from refgenie) (0.2.4)
Requirement already satisfied: refgenconf>=0.4.0 in /home/nsheff/.local/lib/python3.5/site-packages (from refgenie) (0.5.0.dev0)
Requirement already satisfied: attmap>=0.12.5 in /home/nsheff/.local/lib/python3.5/site-packages (from refgenie) (0.12.9)
Requirement already satisfied: oyaml in /home/nsheff/.local/lib/python3.5/site-packages (from yacman>=0.6.1->refgenie) (0.9)
Requirement already satisfied: pyyaml>=3.13 in /home/nsheff/.local/lib/python3.5/site-packages (from yacman>=0.6.1->refgenie) (5.1)
Requirement already satisfied: pandas in /home/nsheff/.local/lib/python3.5/site-packages (from piper>=0.12.1->refgenie) (0.24.2)
Requirement already satisfied: psutil in /home/nsheff/.local/lib/python3.5/site-packages (from piper>=0.12.1->refgenie) (5.6.1)
Requirement already satisfied: future in /home/nsheff/.local/lib/python3.5/site-packages (from refgenconf>=0.4.0->refgenie) (0.17.1)
Requirement already satisfied: requests in /home/nsheff/.local/lib/python3.5/site-packages (from refgenconf>=0.4.0->refgenie) (2.21.0)
Requirement already satisfied: tqdm in /home/nsheff/.local/lib/python3.5/site-packages (from refgenconf>=0.4.0->refgenie) (4.32.1)
Requirement already satisfied: numpy>=1.12.0 in /home/nsheff/.local/lib/python3.5/site-packages (from pandas->piper>=0.12.1->refgenie) (1.16.2)
Requirement already satisfied: pytz>=2011k in /home/nsheff/.local/lib/python3.5/site-packages (from pandas->piper>=0.12.1->refgenie) (2018.9)
Requirement already satisfied: python-dateutil>=2.5.0 in /home/nsheff/.local/lib/python3.5/site-packages (from pandas->piper>=0.12.1->refgenie) (2.8.0)
Requirement already satisfied: certifi>=2017.4.17 in /home/nsheff/.local/lib/python3.5/site-packages (from requests->refgenconf>=0.4.0->refgenie) (2019.3.9)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /home/nsheff/.local/lib/python3.5/site-packages (from requests->refgenconf>=0.4.0->refgenie) (1.24.3)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /home/nsheff/.local/lib/python3.5/site-packages (from requests->refgenconf>=0.4.0->refgenie) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /home/nsheff/.local/lib/python3.5/site-packages (from requests->refgenconf>=0.4.0->refgenie) (2.8)
Requirement already satisfied: six>=1.5 in /home/nsheff/.local/lib/python3.5/site-packages (from python-dateutil>=2.5.0->pandas->piper>=0.12.1->refgenie) (1.12.0)
WARNING: You are using pip version 19.2.3, however version 19.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Initializing refgenie genome configuration
Wrote new refgenie genome configuration file: peppro_example/refgenie.yaml
'hs38d1/fasta:default' archive size: 1.7MB
Downloading URL: http://staging.refgenomes.databio.org/v2/asset/hs38d1/fasta/archive
hs38d1/fasta:default: 1.74MB [00:00, 28.8MB/s]
Download complete: /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/fasta__default.tgz
Extracting asset tarball and saving to: /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/fasta/default
Default tag for 'hs38d1/fasta' set to: default
'hs38d1/bowtie2_index:default' archive size: 8.8MB
Downloading URL: http://staging.refgenomes.databio.org/v2/asset/hs38d1/bowtie2_index/archive
hs38d1/bowtie2_index:default: 9.22MB [00:00, 21.1MB/s]
Download complete: /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/bowtie2_index__default.tgz
Extracting asset tarball and saving to: /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/bowtie2_index/default
Default tag for 'hs38d1/bowtie2_index' set to: default
bowtie2 -x `refgenie seek rCRSd/bowtie2_index` raw_reads/SRR11773889.fastq -S aligned.sam
742462 reads; of these:
742462 (100.00%) were unpaired; of these:
742461 (100.00%) aligned 0 times
0 (0.00%) aligned exactly 1 time
1 (0.00%) aligned >1 times
0.00% overall alignment rate
bowtie2 -x `refgenie seek t7/bowtie2_index` raw_reads/SRR11773889.fastq -S aligned.sam
742462 reads; of these:
742462 (100.00%) were unpaired; of these:
49655 (6.69%) aligned 0 times
689848 (92.91%) aligned exactly 1 time
2959 (0.40%) aligned >1 times
93.31% overall alignment rate
This pipeline also requires reference genome assets that are managed by refgenie. If you don't already have an initialized refgenie config, you can easily initialize one like this:
Now execute the example code to run it.
./peppro_example/peppro/pipelines/peppro.py \
--sample-name test \
--genome hs38d1 \
--input peppro_example/peppro/examples/data/test_r1.fq.gz \
--single-or-paired single \
-O peppro_example/output/
### Pipeline run code and environment:
* Command: `./peppro_example/peppro/pipelines/peppro.py --sample-name test --genome hs38d1 --input peppro_example/peppro/examples/data/test_r1.fq.gz --single-or-paired single -O peppro_example/output/`
* Compute host: puma
* Working dir: /home/nsheff/code/bulker/docs_jupyter
* Outfolder: /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/
* Pipeline started at: (10-21 11:29:03) elapsed: 0.0 _TIME_
### Version log:
* Python version: 2.7.12
* Pypiper dir: `/home/nsheff/.local/lib/python2.7/site-packages/pypiper`
* Pypiper version: 0.12.0
* Pipeline dir: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/peppro/pipelines`
* Pipeline version: 0.8.1
* Pipeline hash: 05095a2cc78a2e210916f215dd0828940b894a6f
* Pipeline branch: * dev
* Pipeline date: 2019-10-21 11:21:23 -0400
### Arguments passed to pipeline:
* `TSS_name`: `None`
* `adapter`: `fastp`
* `anno_name`: `None`
* `complexity`: `True`
* `config_file`: `peppro.yaml`
* `cores`: `1`
* `coverage`: `False`
* `dedup`: `seqkit`
* `dirty`: `False`
* `ensembl_gene_body`: `None`
* `ensembl_tss`: `None`
* `exon_name`: `None`
* `force_follow`: `False`
* `genome_assembly`: `hs38d1`
* `input`: `['peppro_example/peppro/examples/data/test_r1.fq.gz']`
* `input2`: `None`
* `intron_name`: `None`
* `keep`: `False`
* `logdev`: `False`
* `max_len`: `30`
* `mem`: `4000`
* `new_start`: `False`
* `no_fifo`: `False`
* `output_parent`: `peppro_example/output/`
* `paired_end`: `False`
* `parts`: `4`
* `pre_name`: `None`
* `prealignments`: `[]`
* `protocol`: `pro`
* `recover`: `False`
* `sample_name`: `test`
* `scale`: `False`
* `silent`: `False`
* `single_or_paired`: `single`
* `sob`: `False`
* `testmode`: `False`
* `trimmer`: `seqtk`
* `umi_fastp`: `False`
* `umi_len`: `0`
* `verbosity`: `None`
----------------------------------------
Some assets are not found. You can update your REFGENIE config file or point directly to the file using the noted command-line arguments:
Assets not existing: refgene_tss (--TSS-name), ensembl_tss (--pi-tss), ensembl_gene_body (--pi-body), refgene_pre_mRNA (--pre-name), feat_annotation (--anno-name), refgene_exon (--exon-name), refgene_intron (--intron-name)
Local input file: peppro_example/peppro/examples/data/test_r1.fq.gz
> `File_mb` 1.02 PEPPRO _RES_
> `Read_type` single PEPPRO _RES_
> `Genome` hs38d1 PEPPRO _RES_
Detected PRO input
### Merge/link and fastq conversion: (10-21 11:29:03) elapsed: 0.0 _TIME_
Number of input file sets: 1
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/raw/test.fastq.gz`
> `ln -sf /home/nsheff/code/bulker/docs_jupyter/peppro_example/peppro/examples/data/test_r1.fq.gz /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/raw/test.fastq.gz` (20905)
<pre>
</pre>
Command completed. Elapsed time: 0:00:00. Running peak memory: 0.0GB.
PID: 20905; Command: ln; Return code: 0; Memory used: 0.0GB
Local input file: '/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/raw/test.fastq.gz'
Found .fastq.gz file
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/fastq/test_R1.fastq`
> `gzip -f -d -c /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/raw/test.fastq.gz > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/fastq/test_R1.fastq` (20907)
<pre>
</pre>
Command completed. Elapsed time: 0:00:00. Running peak memory: 0.002GB.
PID: 20907; Command: gzip; Return code: 0; Memory used: 0.002GB
> `Raw_reads` 12500 PEPPRO _RES_
> `Fastq_reads` 12500 PEPPRO _RES_
['/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/raw/test.fastq.gz']
### FASTQ processing: (10-21 11:29:03) elapsed: 0.0 _TIME_
You set adapter arg to 'fastp' but you must select 'cutadapt' for single end data. Overriding.
> `cutadapt --version`
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/fastq/test_R1_processed.fastq`
> `( cutadapt -j 1 -m 18 -a TGGAATTCTCGGGTGCCAAGG /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/fastq/test_R1.fastq | seqtk trimfq -b 0 -L 30 - | seqtk seq -r - > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/fastq/test_R1_processed.fastq ) 2> /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/cutadapt/test_cutadapt.txt` (20985)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.057GB.
PID: 20985; Command: ; Return code: 0; Memory used: 0.057GB
Evaluating read trimming
> `Trimmed_reads` 11880 PEPPRO _RES_
> `Trim_loss_rate` 4.96 PEPPRO _RES_
### Plot adapter insertion distribution (10-21 11:29:06) elapsed: 2.0 _TIME_
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/cutadapt/test_adapter_insertion_distribution.pdf`
> `Rscript ./peppro_example/peppro/tools/PEPPRO.R cutadapt -i /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/cutadapt/test_cutadapt.txt -o /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/cutadapt` (21191)
<pre>
Error: Install the "PEPPROr" R package before proceeding.
i.e. devtools::install_github("databio/peppro", subdir="PEPPROr")
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.057GB.
PID: 21191; Command: Rscript; Return code: 0; Memory used: 0.019GB
> `Adapter insertion distribution` cutadapt/test_adapter_insertion_distribution.pdf Adapter insertion distribution cutadapt/test_adapter_insertion_distribution.png PEPPRO _OBJ_
### Prealignments (10-21 11:29:07) elapsed: 1.0 _TIME_
You may use `--prealignments` to align to references before the genome alignment step. See docs.
### Map to genome (10-21 11:29:07) elapsed: 0.0 _TIME_
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam`
> `bowtie2 -p 1 --very-sensitive -X 2000 --rg-id test -x /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/bowtie2_index/default/hs38d1 -U /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/fastq/test_R1_processed.fastq | samtools view -bS - -@ 1 | samtools sort - -@ 1 -T /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/tmpjLFtWP -o /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam` (21283,21287,21313)
<pre>
11880 reads; of these:
11880 (100.00%) were unpaired; of these:
10476 (88.18%) aligned 0 times
372 (3.13%) aligned exactly 1 time
1032 (8.69%) aligned >1 times
11.82% overall alignment rate
</pre>
Command completed. Elapsed time: 0:00:02. Running peak memory: 0.057GB.
PID: 21287; Command: samtools; Return code: 0; Memory used: 0.018GB
PID: 21283; Command: bowtie2; Return code: 0; Memory used: 0.018GB
PID: 21313; Command: samtools; Return code: 0; Memory used: 0.018GB
> `samtools view -q 10 -b -@ 1 -U /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_fail_qc.bam /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam` (21527)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.057GB.
PID: 21527; Command: samtools; Return code: 0; Memory used: 0.018GB
> `samtools depth /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam | awk '{counter++;sum+=$3}END{print sum/counter}'`
> `Mapped_reads` 1404 PEPPRO _RES_
> `QC_filtered_reads` 989.0 PEPPRO _RES_
> `Aligned_reads` 415 PEPPRO _RES_
> `Alignment_rate` 3.49 PEPPRO _RES_
> `Total_efficiency` 3.32 PEPPRO _RES_
> `Read_depth` 1.26 PEPPRO _RES_
### Compress all unmapped read files (10-21 11:29:11) elapsed: 4.0 _TIME_
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam.bai`
> `samtools index /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam` (21800)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.057GB.
PID: 21800; Command: samtools; Return code: 0; Memory used: 0.019GB
> `samtools idxstats /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam | grep -we 'chrM' -we 'chrMT' -we 'M' -we 'MT' -we 'rCRSd' -we 'rCRSd_3k'| cut -f 3`
> `samtools stats /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam | grep '^SN' | cut -f 2- | grep 'maximum length:' | cut -f 2-`
### Calculate NRF, PBC1, and PBC2 (10-21 11:29:13) elapsed: 2.0 _TIME_
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam.bai`
> `samtools index /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam` (22028)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.057GB.
PID: 22028; Command: samtools; Return code: 0; Memory used: 0.019GB
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/test_bamQC.tsv`
> `./peppro_example/peppro/tools/bamQC.py --silent -i /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam -c 1 -o /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/test_bamQC.tsv --silent` (22096)
<pre>
</pre>
Command completed. Elapsed time: 0:00:03. Running peak memory: 0.068GB.
PID: 22096; Command: ./peppro_example/peppro/tools/bamQC.py; Return code: 0; Memory used: 0.068GB
> `awk '{ for (i=1; i<=NF; ++i) { if ($i ~ "NRF") c=i } getline; print $c }' /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/test_bamQC.tsv`
> `awk '{ for (i=1; i<=NF; ++i) { if ($i ~ "PBC1") c=i } getline; print $c }' /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/test_bamQC.tsv`
> `awk '{ for (i=1; i<=NF; ++i) { if ($i ~ "PBC2") c=i } getline; print $c }' /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/test_bamQC.tsv`
> `NRF` 0.89 PEPPRO _RES_
> `PBC1` 0.95 PEPPRO _RES_
> `PBC2` 30.75 PEPPRO _RES_
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_unmap.bam`
> `samtools view -b -@ 1 -f 4 /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_unmap.bam` (22108)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.068GB.
PID: 22108; Command: samtools; Return code: 0; Memory used: 0.019GB
> `samtools view -c -f 4 -@ 1 /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam`
> `Unmapped_reads` 10476.0 PEPPRO _RES_
### Split BAM by strand (10-21 11:29:18) elapsed: 5.0 _TIME_
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_plus.bam`,`/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_minus.bam`
> `samtools view -bh -F 20 /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_plus.bam` (22257)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.068GB.
PID: 22257; Command: samtools; Return code: 0; Memory used: 0.019GB
> `samtools view -bh -f 16 /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_minus.bam` (22322)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.068GB.
PID: 22322; Command: samtools; Return code: 0; Memory used: 0.019GB
Skipping TSS -- TSS enrichment requires TSS annotation file: refgene_tss
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/chr_order.txt`
> `samtools view -H /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam | grep 'SN:' | awk -F':' '{print $2,$3}' | awk -F' ' -v OFS=' ' '{print $1,$3}' > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/chr_order.txt` (22386,22391,22395,22403)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.068GB.
PID: 22386; Command: samtools; Return code: 0; Memory used: 0.018GB
PID: 22395; Command: awk; Return code: 0; Memory used: 0.004GB
PID: 22391; Command: grep; Return code: 0; Memory used: 0.001GB
PID: 22403; Command: awk; Return code: 0; Memory used: 0.004GB
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/chr_keep.txt`
> `cut -f 1 /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/chr_order.txt > /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/QC_hs38d1/chr_keep.txt` (22455)
<pre>
</pre>
Command completed. Elapsed time: 0:00:00. Running peak memory: 0.068GB.
PID: 22455; Command: cut; Return code: 0; Memory used: 0.0GB
Skipping PI -- Pause index requires 'TSS' and 'gene body' annotation files: ensembl_tss and ensembl_gene_body
Skipping FRiP -- Fraction of reads in pre-mRNA requires pre-mRNA annotation file: refgene_pre_mRNA
### Calculate fraction of reads in features (FRiF) (10-21 11:29:20) elapsed: 2.0 _TIME_
### Plot FRiF (10-21 11:29:20) elapsed: 0.0 _TIME_
> `samtools view -@ 1 -q 10 -c -F4 /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_plus.bam`
> `samtools view -@ 1 -q 10 -c -F4 /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_minus.bam`
### Produce bigWig files (10-21 11:29:21) elapsed: 1.0 _TIME_
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/signal_hs38d1/test_plus_body_0-mer.bw`
> `samtools index /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_plus.bam` (22609)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.068GB.
PID: 22609; Command: samtools; Return code: 0; Memory used: 0.02GB
> `./peppro_example/peppro/tools/bamSitesToWig.py -i /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_plus.bam -c /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/fasta/default/hs38d1.chrom.sizes -o /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/signal_hs38d1/test_plus_body_0-mer.bw -p 1 --variable-step --tail-edge` (22674)
<pre>
Registering input file: '/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_plus.bam'
Temporary files will be stored in: 'tmp_test_plus_cuttrace_qU7o0L'
Processing with 1 cores...
stdin is empty of data
Reduce step (merge files)...
Merging 121 files into output file: '/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/signal_hs38d1/test_plus_body_0-mer.bw'
Couldn't open tmp_test_plus_cuttrace_qU7o0L/JTFH01000350.1.txt.bw
</pre>
Command completed. Elapsed time: 0:01:19. Running peak memory: 0.068GB.
PID: 22674; Command: ./peppro_example/peppro/tools/bamSitesToWig.py; Return code: 0; Memory used: 0.057GB
Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/signal_hs38d1/test_minus_body_0-mer.bw`
> `samtools index /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_minus.bam` (31803)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.068GB.
PID: 31803; Command: samtools; Return code: 0; Memory used: 0.019GB
> `./peppro_example/peppro/tools/bamSitesToWig.py -i /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_minus.bam -c /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/fasta/default/hs38d1.chrom.sizes -o /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/signal_hs38d1/test_minus_body_0-mer.bw -p 1 --variable-step --tail-edge` (31872)
<pre>
Registering input file: '/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_minus.bam'
Temporary files will be stored in: 'tmp_test_minus_cuttrace_ljeMBw'
Processing with 1 cores...
Reduce step (merge files)...
Merging 124 files into output file: '/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/signal_hs38d1/test_minus_body_0-mer.bw'
</pre>
Command completed. Elapsed time: 0:01:23. Running peak memory: 0.068GB.
PID: 31872; Command: ./peppro_example/peppro/tools/bamSitesToWig.py; Return code: 0; Memory used: 0.057GB
Starting cleanup: 5 files; 4 conditional files for cleanup
Cleaning up flagged intermediate files. . .
Cleaning up conditional list. . .
### Pipeline completed. Epilogue
* Elapsed time (this run): 0:03:01
* Total elapsed time (all runs): 0:02:59
* Peak memory (this run): 0.0675 GB
* Pipeline completed time: 2019-10-21 11:32:04
The pipeline has now completed successfully and we can explore the results:
tree peppro_example/output/
peppro_example/output/
└── test
├── aligned_hs38d1
│ ├── test_fail_qc.bam
│ ├── test_minus.bam
│ ├── test_minus.bam.bai
│ ├── test_plus.bam
│ ├── test_plus.bam.bai
│ ├── test_sort.bam
│ ├── test_sort.bam.bai
│ └── test_unmap.bam
├── cutadapt
│ └── test_cutadapt.txt
├── fastq
├── fastqc
├── objects.tsv
├── PEPPRO_cleanup.sh
├── PEPPRO_commands.sh
├── PEPPRO_completed.flag
├── PEPPRO_log.md
├── PEPPRO_profile.tsv
├── QC_hs38d1
│ └── test_bamQC.tsv
├── raw
│ └── test.fastq.gz -> /home/nsheff/code/bulker/docs_jupyter/peppro_example/peppro/examples/data/test_r1.fq.gz
├── signal_hs38d1
│ └── test_minus_body_0-mer.bw
└── stats.tsv
8 directories, 19 files
We've successfully run a complete pipeline without having to install any of the software that runs the commands in the workflow. We're also able to interactively explore the environment that ran the workflow.
Conclusion
That's basically it. If you're a workflow developer, all you need to do is write your own manifest and distribute it with your workflow; in 3 lines of code, users will be able to run your workflow using modular containers, using the container engine of their choice.