ocms_16s dada2 - amplicon sequencing analysis

Author:: Nick Ilott, Sandi Yen
Release:: $Id$
Date:: Aug 21, 2023
Tags:: Python

Purpose

This is a pipeline that is built using the cgat-core framework. The purpose of the pipeline is to run dada2 processing of amplicon sequencing data either on a compute cluster or locally. The pipeline consists of a number of wrapper scripts in R that are executed on the commandline and as such there is no requirement for R coding by the user. The hope is that the pipeline provides an accessible and user-friendly interface to produce reproducble results from an amplicon sequencing study.

A useful feature of this pipeline is that it outputs an sqlite database that serves as input to OCMSlooksy, an R/Shiny application designed for downstream analysis and visualisation of amplicon sequencing data.

You should familiarise yourself with the dada2 workflow before running the pipeline to ensure that you understand the parameterisation.

Installation

The pipeline depends on having a number of python and R libraries installed. There are various ways that you can install and use the pipeline and these are outlined below.

conda

It is possible to set up a conda environment that is compatible with our dada2 pipeline that has all of the relevant dependencies installed. The following steps outline how to install the pipeline using conda to manage the dependencies.

Download and install conda if you don’t already have it.

Clone the OCMS_16S repository:

git clone git@github.com:OxfordCMS/OCMS_16S.git

Change into OCMS_16S directory:
```
cd OCMS_16S
```

Create ocms_16s conda environment:

conda env create -f envs/environment_ocms_16s.yaml

This can often be slow as conda tries to solve environment. If this happens you should try using mamba instead:

conda install -c conda-forge mamba

mamba env create -f envs/environment_ocms_16s.yaml

Activate the newly created environment:
```
conda activate ocms_16s
```
Install ocms_16s dada2:
```
python setup.py install
```

Python virtual environment

You can use a python virtual environment to install python dependencies whilst relying on system-wide software for non-python dependencies (R etc.).

Create a python virtual environment with your chosen python install (call the environment what you like. We call it ocms_pipeline as an example):
```
python -m venv ocms_pipeline
```
Activate virtualenv:
```
source ocms_pipeline/bin/activate
```

Install cgat-core:

# first install some dependencies
pip install apsw gevent numpy==1.19.5 pandas paramiko pep8 pytest pytest-pep8 drmaa pyyaml ruffus setuptools six sqlalchemy

pip install cgatcore

Make sure you have other R dependencies installed:

dplyr
ggplot2
gplots
gridextra
gtools
rmarkdown
optparse
plotly
rcolorbrewer
rcppparallel
reshape

You can install these one-by-one or using the script, install_r_packages.R, that is present in the top-level directory of the OCMS_16S repository. Start R and type:

source("install_r_packages.R")

Install dada2 as described here

Install OCMS_16S:
```
pip install ocms-16s
```

Running the pipeline

The pipeline runs using fastq files as input. It also requires parameters to be set that will be passed to the dada2 R scripts.

Input files

The input is a directory of fastq files. These should be placed in the directory in which you wish to run the pipeline. They must be of the format <name>.fastq.1.gz for single-end data and two files <name>.fastq.1.gz and <name>.fastq.2.gz for paired-end data.

For the pipeline to run succesfully you will also need to have downloaded relevant dada2 databases and point to them in the pipeline.yml parameters file as described in the next section.

Parameterisation

The parameters for dada2 processing are specified in the pipeline.yml file. To create this file, move into the directory containing the fastq files that you wish to process and type:

ocms_16s dada2 config

This will create the pipeline.yml file in the current working directory which you can edit using your favourite text editor. The parameters are provided in a standard yaml format as outlined below:

# specify whether data are paired or single end. The
# pipeline will pick up whether this is true but being
# explicit here is helpful
paired: 1

# dada2 parameters
trim:

    # parameters used for trimming reads. If the data are
    # paired-end then you need to specify 2 values for
    # maxee, truncLen and trimLeft. These parameters must be specified
    maxn: 0
    maxee: 2,2
    truncq: 2
    trunclen: 250,160
    trimleft: 0,0

sample_inference:

    # parameters for sample inference. This includes
    # error learning, de-replication, merging (if paired) and
    # sample inference.

    # number of reads to use (per sample) to estimate error
    # model
    nbases: 10000000

    # additional options
    options: ''

taxonomy:

    memory: 10G

    # assigning taxonomy
    taxonomy_file: RefSeq-RDP16S_v2_May2018.fa.gz

    # This is the file that is used for the addSpecies function in
    # dada2 for exact matching and species assignment. It must therefore
    # be derived from the same database used as taxonomy_file above
    species_file: silva_species_assignment_v132.fa.gz

report:
    # whether to run diagnostics report. This is only necessary if after the
    # main report is built you want to get into more regarding the specifics of
    # how dada2 processed sequences. Specify as 1 if you wish to run it
    diagnostics:

database:
    # name of the output database. This is a database that is built to
    # be compatible with the OCMSlooksy.
    name: output_db

The majority of the parameters correspond to the dada2 arguments to the various functions in the dada2 package.

Getting help on pipeline tasks

The pipeline is run using a simple commandline interface. You can view the tasks that are going to be run by using the ‘show’ command. In the directory that you plan to run the pipeline:

ocms_16s dada2 show full

This will print out the tasks that are going to be run:

----------------------------------------------------
Tasks which will be run:

Task = "mkdir('tree.dir')   before pipeline_dada2.buildTree "
Task = "mkdir('abundance.dir')   before pipeline_dada2.runSampleInference "
Task = "mkdir('filtered.dir')   before pipeline_dada2.filterAndTrim "
Task = 'pipeline_dada2.filterAndTrim'
Task = 'pipeline_dada2.runSampleInference'
Task = 'pipeline_dada2.mergeAbundanceTables'
Task = "mkdir('taxonomy.dir')   before pipeline_dada2.assignTaxonomy "
Task = 'pipeline_dada2.assignTaxonomy'
Task = 'pipeline_dada2.addUniqueIdentifiers'
Task = 'pipeline_dada2.mergeTaxonomyTables'
Task = 'pipeline_dada2.buildDefinitiveTable'
Task = 'pipeline_dada2.buildTree'
Task = "mkdir('taxonomy_abundances.dir')   before pipeline_dada2.splitTableByTaxonomicLevels "
Task = 'pipeline_dada2.splitTableByTaxonomicLevels'
Task = 'pipeline_dada2.full'
________________________________________
# 2021-11-25 21:52:46,850 INFO job finished in 0 seconds at Thu Nov 25 21:52:46 2021 --  1.59  1.52  0.00  0.02 -- 1cae61fa-de0c-4b85-86b0-38dfd964c155

There are often numerous parameters that can be passed to dada2 functions. The most commone parameters that need to be changed are explicitly stated in the pipeline.yml. However additional options can be specified and these are commadline options to the various R scripts. You can view these parameters by running the ‘help’ script. For example:

ocms_16s help --sampleInference

This will provide the possible options that can be passed to the runSampleInference task via the pipeline.yml.

Once you have set the parameters, the pipeline should be simple to run. You can run the pipeline locally or on a compute cluster in order to maximise parallelisation that is afforded by using cgat-core workflow management.

Running the pipeline locally

In general the pipeline can be run using the following command:

ocms_16s dada2 make full -v5 -p100

where -v specifies the verbosity level of the logging output and -p specifies the number of processes you want to lauch per task e.g if you want to process 100 samples then specifiy -p100 and each sample will be processed in parallel and data combined in the final output tables.

If you want to run it locally on a laptop you will need access to a unix-like operating system (e.g. Mac). You must specify the –local flag:

ocms_16s dada2 make full -v5 -p1 --local

specifying -p as the number of processors you have available.

Running the pipeline on a cluster

The best way to maximise the utility of the pipeline is to run it on a high performance cluster - allowing you to parallelise sample processing.To run on a cluster you will have to have a .cgat.yml file in your home directory that specifies the queue manager, queue to use etc. An example is belo:

cluster:
    queue_manager: <slurm|sge|pbstorque>
    parallel_environment: <pe name>
    queue: <queue_name>

You will also need to make sure that the pipeline has access to the drmaa library so it’s best to set this as an environmental variable in your ~/.bashrc:

export DRMAA_LIBRARY_PATH=/<full-path>/libdrmaa.so

Once set up you should be able to run:

ocms_16s dada2 make full -v5 -p100

As the pipeline runs, logging information will be printed to the screen and also saved in the file pipeline.log. This file is useful to inspect if the pipeline crashes and you need to debug.

Building a report

Once the pipeline has finished, there is opportunity to assess the dada2 processing results in an html report by running:

ocms_16s dada2 make build_report

This will build the report, report.dir/report.html which you can inspect.

Transition to OCMSlooksy

OCMSlooksy is an R/Shiny application that enables users to inspect data from this dada2 processing pipeline as well as perform statistical analysis and visualisation. By running:

ocms_16s dada2 make build_db

you will build an sqlite database that contains all of the outputs neccessary to load into OCMSlooksy. The database will be named according to the specification in the pipipeline.yml. In the example above it would be called ‘output_db’ and this would be present in the current working directory.

Other output files

OCMS_16S will also output flat files that can be used for downstream analysis. The main output file of the pipeline is the counts matrix that consists of amplicon sequence variants and their abundance in each sample. The pipeline assigns taxonomy to each ASV and this is incorporated into the ASV name in the resulting file. It is of the form:

test_id	Sample1	Sample2
ASV1:p__phylum1;c__class1;o__order1;f__family1;g__genus1;s__species1	1000	1239
ASV2:p__phylum2;c__class2;o__order2;f__family2;g__genus2;s__species2	500	10
ASV3:p__phylum3;c__class3;o__order3;f__family3;g__genus3;s__species3	1000	2300

This file is created as abundance.dir/taxa_abundances.tsv.

The purpose of this output file is that it can be taken forward in a easy fashion to look at differential abundance using software such as DESeq2 and this will be done on a per ASV level. If you wish to perform analysis on counts that have been summed over taxa at a particular taxonomic level you can use the following output files:

taxonomy_abundances.dir/phylum_abundances.tsv
taxonomy_abundances.dir/class_abundances.tsv
taxonomy_abundances.dir/order_abundances.tsv
taxonomy_abundances.dir/family_abundances.tsv
taxonomy_abundances.dir/genus_abundances.tsv
taxonomy_abundances.dir/species_abundances.tsv

Acknowledgements

This pipeline is based off of a lot of work that has gone before it. It is basically a wrapper for dada2 functionality and so if you use the pipeline in a publication please remember to cite the dada2 paper. The cgat-core framework is of course another important tool that has enabled the development of this pipeline.