Tutorial: Using OCMS_16S to process amplicon sequencing data

Here we show you how to run OCMS_16S dada2 to process 16S rRNA amplicon sequencing data. We are going to use publically available data that were published here. The data were generated from stool samples of either wild type or MMP9 -/- mice that were either treated with dextran sodium sulphate to induce colitis or vehicle. Sequencing was performed on the Illumina MiSeq platform which produced 2 x 250bp reads targetting the V4 region of the 16S rRNA gene. This step-by-step tutorial should help you to get a sense of the pipeline and how you might use it to process your own data.

Step 1: Download tutorial data

We have provided links to raw data in the OCMS_16S repository (OCMS_16S/tutorial). To download the data, create a working directory where you will run the pipeline and download the files:

mkdir ocms_16s_tutorial
cd ocms_16s_tutorial
bash <path-to-OCMS_16S/tutorial/download_files.sh>

This will place the raw fastq files in your working directory.

Step 2: Create and edit configuration file

We need a pipeline.yml file to pass parameters to pipeline tasks. To create this we type:

ocms_16s dada2 config

This produces the pipeline.yml file that we can now edit:

#############################################################################
#############################################################################
#############################################################################
# Configuration file for OCMS_16S dada2.
#
# The parameters and options that are passed to the wrappers around dada2
# functions are the same as those in the dada2 package. For a full list of
# the parameters and verbatim explanation of them (from dada2 package)
# you can use the help functions
#
# e.g. ocms_16s dada2 help --filterAndTrim
#      ocms_16s dada2 help --sampleInference
#      ocms_16s dada2 help --assignTaxonomy
#
#############################################################################
#############################################################################
#############################################################################

# Edit parameters

# specify whether data are paired or single end. The
# pipeline will pick up whether this is true but being
# explicit here is helpful
paired: 1

# dada2 parameters
trim:

    # parameters used for trimming reads. If the data are
    # paired-end then you need to specify 2 values for
    # maxee, truncLen and trimLeft. These parameters must be specified.
    maxn: 0
    maxee: 2,2
    truncq: 2
    trunclen: 150,150
    trimleft: 0,0

sample_inference:

    memory: 10G

    # parameters for sample inference. This includes
    # error learning, de-replication, merging (if paired) and
    # sample inference.

    # number of reads to use (per sample) to estimate error
    # model
    nbases: 10000000

    # additional options
    options: ''

taxonomy:

    memory: 10G

    # assigning taxonomy
    taxonomy_file: RefSeq-RDP16S_v2_May2018.fa.gz

    # This is the file that is used for the addSpecies function in
    # dada2 for exact matching and species assignment. It must therefore
    # be derived from the same database used as taxonomy_file above
    species_file: RefSeq-RDP_dada2_assignment_species.fa.gz

report:
    #   whether to run diagnostics report. This is only necessary if after the
    # main report is built you want to get into more regarding the specifics of
    # how dada2 processed sequences. Specify as 1 if you wish to run it
    diagnostics:

database:
    # name of the output database. This is a database that is built to
    # be compatible with the OCMSlooksy.
    name: output_db

For this example, we specify that the data are paired and we want the final length of both the forward and reverse reads to be 150bp.

The default settings are taken from the defaults used by dada2. If you want an explanation of the parameters for the dada2 steps then you can type for example:

ocms_16s help --filterAndTrim

This will show you the options that are passed from the pipeline.yml to the R scripts along with a description:

-packages/ocms_16S-0.0.1-py3.8.egg/ocms16S/R/dada2_filter_and_trim.R [options]


Options:
        -i INFILE, --infile=INFILE
                input fastq file [default NA]

        -f FILTERED-DIRECTORY, --filtered-directory=FILTERED-DIRECTORY
                directory for filtered fastq files [default filtered]

        -p, --paired
                is it paired-end data [default FALSE]

        -n MAXN, --maxN=MAXN
                maxN parameter [default 0]

        -e MAXEE, --maxEE=MAXEE
                maximum number of expected errors [default 2,2]

        --truncQ=TRUNCQ
                truncate reads at the first instance of a quality score less
                than or equal to truncQ [default 2]

        --truncLen=TRUNCLEN
               truncate  reads  after truncLen bases [default 250,250]

        --trimLeft=TRIMLEFT
               trim left sequence (primers) [default 0,0]

        -h, --help
               Show this help message and exit

We will leave the majority of settings as they are for this example. However, we need to specify annotation files that will be used to assign taxonomic information to the amplicon sequence variants (ASVs) that are produced by dada2. In this example we will download them into our working directory, however you may want to have them somewhere else for using with future data. E.g. to use RefSeq databases do:

wget https://zenodo.org/record/2541239/files/RefSeq-RDP16S_v2_May2018.fa.gz
wget https://zenodo.org/record/2658728/files/RefSeq-RDP_dada2_assignment_species.fa.gz

These are then specified in the pipeline.yml as above.

Step 3: Run the pipeline

Once you are happy with the parameterisation, you can run the pipeline:

ocms_16s dada2 make full -v5 -p40

Here we are running the pipeline using 80 processors as this is the number of samples we have - they will be processed in parallel on the cluster. If you are running this on a laptop make sure to specify the –local flag:

ocms_16s dada2 make full -v5 -p1 --local

The -v5 specifies the verbosity level of the logging information. At 5 this will be very verbose and useful for debugging. you can check how the pipeline is progressing by viewing the pipeline.log file that is created in the working directory:

cat pipeline.log

When the pipeline has finished running, the log file will look like this:

tail pipeline.log

...

2020-02-06 12:08:28,597 INFO main task - Task enters queue = 'full'
2020-02-06 12:08:28,598 INFO main task -     Job no function to check if up-to-date
2020-02-06 12:08:28,799 INFO main task -     Job completed
2020-02-06 12:08:28,799 INFO main control - {"task": "'full'", "task_status": "completed", "task_total": 0, "task_completed": 0, "task_completed_percent": 0}
2020-02-06 12:08:28,799 INFO main task - Completed Task = 'full'
2020-02-06 12:08:28,801 INFO main experiment - job finished in 3002 seconds at Thu Feb  6 12:08:28 2020 -- 66.94 45.21  5.67  9.90 -- 8e55fc3a-59c9-4f94-b014-1062ee84c9bc

Step 4: Build the report

You can then build the report to inspect the performance of the processing:

ocms_16s dada2 make build_report

This will produce the file report.dir/report.html that you can view in your browser.

Step 5: Downstream analysis with OCMSlooksy

OCMSlooksy is an R/Shiny application that will take the output from OCMS_16S dada2 and enable the user to inspect the parameters of the dada2 run, the processing results as well as perform downstream viosualisation and statistical analyses. To format the outputs from OCMS_16S dada2 for OCMSlooksy you can run:

ocms_16s dada2 make build_db

In the example above this will create the SQLite database, ‘output_db’, that serves as the main input to OCMSlooksy.