Identifying components of mixed and contaminated soil samples by detecting specific signatures of control 16S rRNA libraries

•A supervised NGS-based algorithm to identify sources of soil samples is proposed.•Algorithm extracts characteristic molecular markers from control set of samples.•Then it quantifies signals of these markers in a test sample to indicate its sources.•Algorithm accurately identified sources of mixed a...

Full description

Saved in:

Bibliographic Details
Published in:	Ecological indicators Vol. 94; pp. 446 - 453
Main Authors:	Igolkina, A.A., Grekhov, G.A., Pershina, E.V., Samosorov, G.G., Leunova, V.M., Semenov, A.N., Baturina, O.A., Kabilov, M.R., Andronov, E.E.
Format:	Journal Article
Language:	English
Published:	Elsevier Ltd 01-11-2018
Subjects:	16S rRNA Contaminated soil Mixed samples Soil signature Source identification Suffix array 16S rRNA Source identification Contaminated soil Soil signature Suffix array Mixed samples
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	•A supervised NGS-based algorithm to identify sources of soil samples is proposed.•Algorithm extracts characteristic molecular markers from control set of samples.•Then it quantifies signals of these markers in a test sample to indicate its sources.•Algorithm accurately identified sources of mixed and contaminated soil samples.•The maximal information content approach can be integrated in existing pipelines. Identifying particular control components of a test soil sample presented as mixed, contaminated, improperly stored or damaged soil is an important problem in soil forensics, soil monitoring and other types of soil analysis. This problem is reduced to determining whether two soil samples — test and control — have the same origin or source. Here, we propose an algorithm which copes with this problem based on 16S rRNA gene libraries of test and control soil samples and does not rely on OTU clustering. The algorithm first extracts the Library-SPECific sets of sequences (LSPECs) for alternative control libraries and then quantifies signals of LSPECs in a test library. The heavy use of the suffix array approach for sequence comparison accelerates the algorithm significantly. To evaluate the performance of the algorithm, we collected a control set of 29 soil samples and created two test sets (real and simulated), containing mixed, contaminated and extremely small single-source soil samples (last samples resemble forensics probes). We then carried out 16S rRNA amplicon sequencing of total soil DNA isolated from both test and control soil samples. The algorithm successfully identified the origin of all single-source soil samples and the compositions of mixed and even low/highly contaminated samples. The algorithm also demonstrated robustness to the increase in control set size from 9 to 29. We believe the proposed algorithm is suitable for identification problems with various degrees of complexity and is flexible enough to manage other molecular markers and microbiological samples from different non-soil sources.
ISSN:	1470-160X 1872-7034
DOI:	10.1016/j.ecolind.2018.06.060