Development of a 16S rRNA gene sequencing pipeline for the MiSeq

http://goo.gl/KATDw8

Patrick D. Schloss, PhD (@PatSchloss)
The Department of Microbiology & Immunology at The University of Michigan

mothur

The Small Subunit rRNA genes

Evolutionary chronometer
Ubiquitous across all of life (16S: B/A and 18S: E)
Highly conserved (good and bad)
Low rates of horizontal gene transfer
Most deposited gene into GenBank
Conserved primers allow us to target variable regions within the gene

How do we analyze 16S rRNA genes?

Classification based (supervised)
- Limited by database, length of reads, conflicting taxonomies
- What value are names?
- Sequencing errors not as critical
Operational taxonomic units (OTUs; unsupervised)
- Assign sequences to OTUs based on similarity to each other
- Computationally demanding
- Sensitive to sequencing errors
Should not have to justify the need high quality data

Extensive resources

What we developed

Stand-alone, platform-independent, open source, community supported (-ive) software package
Builds upon popularity of DOTUR, which was published in 2005 (1473 citations; as of 8/13/14)
The design is very PAUP*/R-ish

How are we doing?

AltMetrics:

2952 mailing list recipients
313 wiki articles
915 forum members
7486 forum posts for 2252 topics

Sequencing the 16S rRNA gene

Which factors to consider before choosing a method?

Data quality
Number of reads
Sequence length
Cost

This is not genome sequencing

Traditional genomics approach builds contigs to "solve" errors
Sequencing 16S rRNA gene has limited ability to resequence individual bases
Sequencing variants represent novel biodiversity
Would like to have as fine-scale a resolution as possible

plot of chunk error

But first... sequencing with other platforms

Titanium: 250 bp reads with an 0.02% error rate
GSFLX+: 600 bp reads with an 0.02% error rate
Others...
- PacBio: Full length reads, but error rates of 0.5%
- IonTorrent: Disaster - short and high error rates

Illumina-based approaches

Single read sequencing (Ugh...)
Sequence through primer and index (Illumina's protocol)
- Waste base calls
- Requires two PCR steps to get adapters on (chimeras?)
Caporasso/Knight (EMP/AG/etc.)
- Single index - large number of long primers
- Generally being used with HiSeq (?)
Kozich/Schloss
- Dual index - considerably fewer primers required
- Wet-lab SOP available

Questions

Does a dual-index strategy work?
Can we create a pipeline like we have for 454 that results in low error rate data?
Is it possible to move beyond the Knight approach now that reads have gone from 150 to 250 to 300 bp to sequence longer regions?
What are the principles of primer design so we can amplify other regions and genes?

Dual indexing makes things easier

# Samples	#F Primers	#R Primers	Reads/Sample*
96	12	8	156250
192	12	16	78125
384	24	16	39062
1536	48	32	9766

* assumes: 15x10⁶ reads per run

Overlapping regions within the 16S rRNA gene

Experimental design

Amplified V34, V4, and V45 from mock community DNA and DNA isolated from human feces, mouse feces, and soil
All samples amplified in triplicate with different barcodes (12 total samples)
Have used same design/samples with other sequencing platforms (454, PacBio, IonTorrent)
Sequencing runs...
- Cluster density ~ 800 k/mm2
- %PhiX ~ 5-10%

Mock community

DNA from 21 bacterial isolates with genome sequences:
Acinetobacter baumannii, Actinomyces odontolyticus, Bacillus cereus, Bacteroides vulgatus, Clostridium beijerinckii, Deinococcus radiodurans, Enterococcus faecalis, Escherichia coli, Helicobacter pylori, Lactobacillus gasseri, Listeria monocytogenes, Neisseria meningitidis, Porphyromonas gingivalis, Propionibacterium acnes, Pseudomonas aeruginosa, Rhodobacter sphaeroides, Staphylococcus aureus, Staphylococcus epidermidis, Streptococcus agalactiae, Streptococcus mutans, Streptococcus pneumoniae (Available through BEI: v3.1, HM-278D)
Since we know the true 16S rRNA gene sequences we can measure the error rates
Can remove chimeras a priori

Error profile

Basic error correction

Trust MiSeq software to call indices
Form contigs
- If there is a mismatch, the base with a qual score that is greater than or equal to a threshold wins
- If it is less than threshold the base becomes an N and the read will ultimately get rejected
- Contig length must make “sense”
Remove anything with…
- Ambiguous base calls
- Homopolymer length greater than 8
- Doesn’t map to correct region

Error rates

The V4 region yields the best error rates and retains the most reads
Require Q to be at least 6 points higher to break a mismatch

Error rate summary

Region	Length	Error rate after making contigs (%)	Error rate after pre.cluster (%)
V3-V4	429	0.41	0.26
V4	253	0.04	0.01
V4-V5	375	0.57	0.26
"V3-V5"	260	0.06	0.02

"V3-V5": Based on our 454 SOP using PyroNoise

Richness

Sample	V3-V4	V4	V4-V5
Mock - PC(20)	NA	21.4	83.5
Mock - actual	NA	32.4	126.2
Mouse	NA	135.7	145.7
Human	NA	121.6	187.8
Soil	NA	1271.3	1221.2

All rarefied to 5,000 reads

(XX) – number of OTUs w/ no errors

Mock-PC has no chimeras

"Richness" is correlated with sequencing depth / crap
Error rates have a significant impact on number of observed OTUs

Error rate with new chemsitry

Region	Length	250PE error rate after making contigs (%)	300PE error rate after pre.cluster (%)
V3-V4	429	0.41	0.80
V4	253	0.04	0.04
V4-V5	375	0.57	0.90
"V3-V5"	260	0.06	NA

"V3-V5": Based on our 454 SOP using PyroNoise

Meh.

Quality scores with the V3 chemistry

Note that the quality scores crap out around cycle 500
Similar results with varying cluster density/%PhiX

The MiSeq mothur SOP

method	what it does...	reference
`make.contigs`	assemble reads	Kozich et al. 2014
`screen.seqs`	remove bad reads
`unique.seqs`	collapse duplicate reads
`count.seqs`	count frequency of each sequence in each sample
`align.seqs`	align sequences to a reference	Schloss 2009, Schloss 2010
`filter.seqs`	trim reads to same alignment space	Schloss 2013
`pre.cluster`	remove one and two offs	Schloss et al. 2011
`chimera.uchime`	identify and remove chimeras	Edgar et al. 2011
`classify.seqs`	classify sequences against a database	Wang et al. 2009
`remove.lineage`	remove contaminants
`cluster.split`	assign sequences to OTUs	Schloss & Westcott 2011
`phylotype`	assign sequences to phylotypes
`make.shared`	generate sample by OTU frequency table
`clearcut`	generate a neighbor-joining tree	Sheneman et al. 2006

Concluding the MiSeq pipeline

Sequence reads must fully overlap with each other to get good error remediation
If they don't...
- Inflated number of unique reads / OTUs
- Samples will look more different than they really are
- Significant computational hurdles await
- Cheaper to throw data away and resquence correctly than to buy sufficient RAM/time
Staggered primers and high %PhiX aren't necessary
The V3 MiSeq kit isn't going to cut it for longer reads

Acknowledgements

Schloss Lab
- Sarah Westcott
- Jim Kozich
Sequencing centers
- Illumina (no $ to SchlossLab)
- 454 (free seq. runs)
- U of M Seq. Core
NIH: R01HG005975, R01GM099514
http://goo.gl/KATDw8

You must enable Javascript to view this page properly.

Development of a 16S rRNA gene sequencing pipeline for the MiSeq

http://goo.gl/KATDw8

mothur

The Small Subunit rRNA genes

How do we analyze 16S rRNA genes?

Extensive resources

What we developed

How are we doing?

Sequencing the 16S rRNA gene

Which factors to consider before choosing a method?

This is not genome sequencing

But first... sequencing with other platforms

Illumina-based approaches

Questions

Dual indexing makes things easier

* assumes: 15x106 reads per run

Overlapping regions within the 16S rRNA gene

Experimental design

Mock community

Error profile

Basic error correction

Error rates

Error rate summary

"V3-V5": Based on our 454 SOP using PyroNoise

Richness

All rarefied to 5,000 reads

(XX) – number of OTUs w/ no errors

Mock-PC has no chimeras

Error rate with new chemsitry

"V3-V5": Based on our 454 SOP using PyroNoise

Meh.

Quality scores with the V3 chemistry

The MiSeq mothur SOP

Concluding the MiSeq pipeline

Acknowledgements

* assumes: 15x10⁶ reads per run