\(^1\)Senior AI solutions developer, Parexel
\(^2\)Associate Professor, The University of Verona

Amplicon sequencing for microbial community analysis

The fundamental principle of amplicon sequencing for microbial community analysis is to target a specific, short, and conserved genetic region that is present in most (or all) members of the microbial group of interest (e.g., bacteria, fungi, archaea). This region also needs to contain enough variation to distinguish different species or even strains.

For bacteria and archaea, the most common target is the 16S ribosomal RNA (rRNA) gene. For fungi, it’s often the Internal Transcribed Spacer (ITS) region. These genes are chosen because:

The Illumina MiSeq system has been, and largely remains, the most common and widely used platform for amplicon sequencing in microbial community analysis, particularly for 16S rRNA gene sequencing:

Pros and Cons of amplicon variant sequencing using MiSeq

Pros

  • Ease of Use: Method is relatively easy to perform with well-established protocols.

  • Cost-Effectiveness: Highly cost-effective for multiplexed targeted sequencing applications (e.g., 16S rRNA gene sequencing) due to high throughput.

  • High Throughput for Targeted Sequencing: Capable of generating a large number of reads suitable for analyzing many samples simultaneously in amplicon-based studies.

  • Straightforward Analysis for Standard Applications: Well-developed bioinformatics pipelines and tools are available for common applications like 16S amplicon analysis.

  • High Accuracy (Low Raw Error Rate): Illumina chemistry generally provides a low raw sequencing error rate.

  • Robust Ecosystem and Community Support: Large user base, extensive documentation, and readily available bioinformatics tools and resources.

  • Relatively Fast Turnaround: Offers quicker run times compared to higher-throughput sequencers, suitable for moderate-sized projects.

Cons

  • Limited Resolution for Taxa Identification: In amplicon sequencing, taxonomic assignment is based on a short, single sequence region, limiting resolution, especially at the species or strain level. This is true for amplicon sequencing in general.

  • Inherent Biases in PCR-Based Workflows:

    • DNA Extraction Efficiency: Variations in DNA extraction efficiency across different organisms can bias the initial DNA pool.
    • PCR Amplification Bias: PCR primers may not amplify all sequence variants with the same efficiency, leading to an inaccurate representation of relative abundances.
    • Chimeric Sequences: PCR can generate artificial sequences (chimeras) that need to be identified and removed during analysis.
  • Impact of Sequencing Errors: While low, sequencing errors are not perfectly constant and can still lead to overestimation of diversity if not properly mitigated by denoising algorithms.

  • Assumptions Regarding Quantitative Accuracy:

    • Assumes a linear dependency between sequence variants and true population size, which is often challenged by the aforementioned biases.
    • Assumes similar linear dependency across all taxa, which may not hold true.
  • Limited Read Length: Max read lengths (2x300 bp)

  • Consumable Costs (when analyzing few samples): The cost of a MiSeq reagent kit can be substantial for very low-throughput projects, making it less economical in such scenarios.

  • Requires Bioinformatics Expertise: Despite straightforward pipelines, interpreting complex biological data and troubleshooting still demands significant bioinformatics knowledge.

The experimental workflow

Phase 1: Sample Collection and DNA Extraction
  • Sample Collection

  • DNA Extraction: This step is critical as it influences the diversity and quantity of DNA obtained and can be source-dependent (e.g., specific kits for soil vs. stool).

  • Quality Control: Assess DNA quantity (e.g., Qubit, NanoDrop) and integrity (e.g., gel electrophoresis) to ensure sufficient and high-quality input for downstream steps.

Phase 2: PCR Amplification of 16S rRNA Gene
  • Primer Design/Selection: Choose appropriate 16S rRNA gene region-specific primers (e.g., V3-V4, V4, V1-V2) that are suitable for the target microbial community and sequencing platform.

  • PCR Amplification (First Round): Perform PCR to amplify the target 16S rRNA gene region from the extracted DNA. This step uses primers that bind to conserved regions flanking the variable region of interest.

  • PCR Product Cleanup: Purify the PCR products to remove unligated primers, nucleotides, and enzymes. This is often done using magnetic beads (e.g., AMPure XP) or gel extraction.

  • Quality Control: Verify amplicon size and presence using gel electrophoresis or a bioanalyzer/TapeStation. Quantify purified PCR products.

Phase 3: Library Preparation (Adding Adapters and Indices)
  • Index PCR (Second Round PCR): Second PCR step is performed using primers that incorporate:

    • Illumina Sequencing Adapters: Sequences required for binding to the flow cell.
    • Dual Indices (Barcodes): Unique DNA sequences that allow multiple samples to be pooled and sequenced together on a single run (multiplexing). These indices are read during the sequencing process to demultiplex the data later.
  • Library Cleanup: Purify the indexed PCR products again to remove excess primers, primer-dimers, and short fragments.

  • Library Quantification and Quality Control:

    • Quantification: Accurately quantify the final libraries (e.g., Qubit, qPCR for library quantification). This is crucial for pooling.
    • Size Distribution Assessment: Check the size distribution of the libraries using a bioanalyzer, TapeStation, or Fragment Analyzer to ensure they are within the expected range (amplicon size + adapter length).
Phase 4: MiSeq Sequencing
  • Library Pooling: Based on the quantification results, pool equimolar amounts of the barcoded libraries. This ensures an even representation of samples on the sequencing run.

  • Denaturation and Dilution: Denature the pooled libraries into single strands and dilute them to the appropriate loading concentration for the MiSeq flow cell.

  • Sequencing Run: Initiate the sequencing run on the MiSeq instrument. The instrument performs bridge amplification on the flow cell, followed by sequencing-by-synthesis, generating millions of reads.

Phase 5: Data Output and Initial Processing
  • Raw Data Generation: The MiSeq instrument generates raw sequencing data in BCL (Base Call) files.

  • BCL to FASTQ Conversion (Demultiplexing): Illumina’s instrument control software (or a separate utility like bcl2fastq/dragen) processes the BCL files. This step performs base calling which converts the fluorescent signals into base calls (A, T, C, G).

  • Quality Scoring: Assigns a quality score (Phred score) to each base call.

  • Demultiplexing: Reads the unique index sequences for each cluster and separates the reads originating from different samples into individual files.

  • FASTQ File Output: The final output of the sequencing run for each sample consists of FASTQ files. Each FASTQ file contains the raw sequence reads and their corresponding quality scores. Typically, you’ll get two FASTQ files per sample (forward and reverse reads for paired-end sequencing).

But what is a FASTQ file?

@M03461:328:000000000-JLTJF:1:1101:8238:1893 1:N:0:38
TACGGGAGGCAGCAGTGGGGAATATTTGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGTGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTTAGGAAGATAAT
+
AACCCE7@CFGGEDGG<8,+--C,CF,;,@<-C9--@+++++;;E,,CCE9,6@,C8<C@B+B,,,<,,,,,,,,69EEF9,,BB+B,,6,6:CEFFF9,C,,,4,,,,C,5B
  • Line 1: Read sequence identifier (encoded descriptions of instrument, lane…)
  • Line 2: Read sequence
  • Line 3: « + » sign (can be followed by seq identifier)
  • Line 4: ASCII characters denoting the quality score associated to each Read
  • The method of encoding quality scores:
    • Phred+33: Sanger
    • Phred+64: Illumina 1.8+
Quality score encoding
  • Quality score can be used to describe the chance that a call was correct (confidence) or the chance that a call was incorrect (probability of error).
  • Illumina reads will often assign a maximum per-base confidence of 99.99%, or 1/10,000 probability of error.
  • PacBio reading in high-confidence mode, can assign even higher confidence to individual bases due to the PacBio’s ability to read a given sequence several times to ensure correctness.
  • Keeping such large numbers with sequencing data would require a significantly larger amount of storage than necessary, so an encoding scheme is used for compression.
  • Encoding is designed to lead to loss of fine detail at the very high-confidence end of the spectrum, with much less loss at the low-confidence end. Difference between a 1/10,000 chance of error and a 1/9500 chance of error is considered trivial; Difference between a 1/100 chance of error and a 1/10 chance of error is of great importance.
  • Phred is the most common encoding scheme:
    \(-10 \log_{10} P\), where \(P\) is the estimated probability of the base call being incorrect.
library(ggplot2)
# 1. Define a range of probabilities of incorrect base calls (P)
# We'll use a sequence from a very small number (close to 0) to 1.
# Using a log-scale for P to get a good distribution for Phred scores.
p_values <- c(10^seq(-4, 0, length.out = 100)) # From 0.0001 to 1

# 2. Calculate the corresponding Phred scores (Q)
# For P=0, Q would theoretically be infinity, but practically Phred scores are finite.
q_values <- -10 * log10(p_values)

# 3. Calculate the corresponding probabilities of correct base calls (1-P)
p_correct <- 1 - p_values

# 4. Create a data frame with these values
phred_data <- data.frame(
  p_error = p_values,
  p_correct = p_correct,
  phred_score = q_values
)

# 5. Use ggplot2 to plot Phred score against the probability of incorrect base call
ggplot(phred_data, aes(x =  p_error, y = phred_score)) +
  # Add a line plot
  geom_line(color = "steelblue", size = 1.2) +
  labs(
    title = "Dependency of Phred Score on Probability of Error",
    x = "Probability of Incorrect Base Cal",
    y = "Phred Score"
  ) +
  scale_x_continuous(
    limits = c(0, 1),
    breaks = c(0,  0.1, 0.2, 0.4, 0.6, 0.8, 1.0), 
    labels = scales::percent_format(accuracy = 0.01) # Format x-axis as percentages
  ) +
  scale_y_continuous(
    limits = c(0, 40), # Common range for Phred scores
    breaks = seq(0, 40, by = 5)
  ) +
  theme_minimal() + # Use a minimal theme for a clean look
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.title.x = element_text(size = 12, margin = margin(t = 10)),
    axis.title.y = element_text(size = 12, margin = margin(r = 10)),
    axis.text = element_text(size = 10),
    panel.grid.major = element_line(color = "grey90", linetype = "dashed"),
    panel.grid.minor = element_blank() # Remove minor grid lines
  ) +
  # Add annotations for key Phred scores, now based on P (probability of error)
  annotate("text", x = 0.1, y = 10, label = "Q=10 (P=0.1)", hjust = 0, vjust = -0.5, size = 3, color = "darkgreen") +
  annotate("text", x = 0.01, y = 20, label = "Q=20 (P=0.01)", hjust = 0, vjust = -0.5, size = 3, color = "darkgreen") +
  annotate("text", x = 0.001, y = 30, label = "Q=30 (P=0.001)", hjust = 0, vjust = -0.5, size = 3, color = "darkgreen") +
  annotate("text", x = 0.0001, y = 40, label = "Q=40 (P=0.0001)", hjust = 0, vjust = -0.5, size = 3, color = "darkgreen")

  • To encode a quality score as a single character, first it is compressed down to a two-digit number by Phred encoding.
  • For a call with a confidence of 99.9%, or 1/1000 chance of error, we would get a Phred score of 30.
  • Obtained number is offset by adding 33 (Phred +33.) and the corresponding symbol is taken from the ASCII table
    • A Phred score of 0 (meaning 100% error probability) corresponds to an ASCII value of 0+33=33, which is the exclamation mark (!).
    • A Phred score of 10 (90% accuracy, 10% error) corresponds to an ASCII value of 10+33=43, which is the plus sign (+).
    • A Phred score of 30 (99.9% accuracy, 0.1% error) corresponds to an ASCII value of 30+33=63, which is the question mark (?).
    • A Phred score of 40 (99.99% accuracy, 0.01% error) corresponds to an ASCII value of 40+33=73, which is the character I.
ascii_data <- data.frame(
  Decimal = 0:127,
  Character = sapply(0:127, function(i) {
    char_val <- if (i >= 0 && i <= 31) {
      # Common control characters
      # Using switch for readability for control characters
      switch(i + 1, # R indexing starts at 1
             "NUL", "SOH", "STX", "ETX", "EOT", "ENQ", "ACK", "BEL",
             "BS", "HT", "LF", "VT", "FF", "CR", "SO", "SI",
             "DLE", "DC1", "DC2", "DC3", "DC4", "NAK", "SYN", "ETB",
             "CAN", "EM", "SUB", "ESC", "FS", "GS", "RS", "US"
      )
    } else if (i == 127) {
      "DEL" # Delete character
    } else {
      # Printable ASCII characters
      intToUtf8(i)
    }
    # Pad all characters to a consistent width (e.g., 5 characters) for alignment
    # This ensures columns line up nicely when printed side-by-side
    sprintf("%-5s", char_val)
  }),
  stringsAsFactors = FALSE # Prevent automatic conversion to factors
)

section_size <- 32

# split the data into 4 sections so we can show side by side
df_list <- lapply(1:4, function(k) {
  start_row <- (k - 1) * section_size + 1
  end_row <- k * section_size
  section_df <- ascii_data[start_row:end_row, ]
  return(section_df)
})
# Combine the sections side-by-side into a single wide data frame
combined_df <- do.call(cbind, df_list)
knitr::kable(combined_df, row.names = FALSE)
Decimal Character Decimal Character Decimal Character Decimal Character
0 NUL 32 64 @ 96 `
1 SOH 33 ! 65 A 97 a
2 STX 34 66 B 98 b
3 ETX 35 # 67 C 99 c
4 EOT 36 $ 68 D 100 d
5 ENQ 37 % 69 E 101 e
6 ACK 38 & 70 F 102 f
7 BEL 39 71 G 103 g
8 BS 40 ( 72 H 104 h
9 HT 41 ) 73 I 105 i
10 LF 42 * 74 J 106 j
11 VT 43 + 75 K 107 k
12 FF 44 , 76 L 108 l
13 CR 45 - 77 M 109 m
14 SO 46 . 78 N 110 n
15 SI 47 / 79 O 111 o
16 DLE 48 0 80 P 112 p
17 DC1 49 1 81 Q 113 q
18 DC2 50 2 82 R 114 r
19 DC3 51 3 83 S 115 s
20 DC4 52 4 84 T 116 t
21 NAK 53 5 85 U 117 u
22 SYN 54 6 86 V 118 v
23 ETB 55 7 87 W 119 w
24 CAN 56 8 88 X 120 x
25 EM 57 9 89 Y 121 y
26 SUB 58 : 90 Z 122 z
27 ESC 59 ; 91 [ 123 {
28 FS 60 < 92 \ 124 |
29 GS 61 = 93 ] 125 }
30 RS 62 > 94 ^ 126 ~
31 US 63 ? 95 _ 127 DEL

16s amplicon metagenomics analysis

In microbial community analysis using amplicon sequencing (like 16S rRNA gene sequencing), there are two primary ways to categorize and count the different “types” of microbes present in a sample:

In this workshop we will be using ASV estimation as implemented in the dada2 (Callahan et al. 2016) R package.

Why dada2?

We will be working with a subset of the PRJNA616141 BioProject Containing sequences of two Drosophila species: D. melanogaster and D. subobscura subject to lead poisoning.

required software:

On Windows:

On linux:

IDE:

Required R libraries

library(tidyverse)
library(dada2)
library(phyloseq)
library(phangorn)
library(ggrepel)
library(DECIPHER)

How to install?

cran_packages <- c("tidyverse", "vegan", "cowplot", "ggrepel", "BiocManager")
install.packages(cran_packages )
bioconductor_packages <- c("dada2", "phyloseq", "DESeq2", "DECIPHER")
BiocManager::install(bioconductor_packages)

required non R software:

Inspecting reads

reads.in <- list.files("fastq/raw", full.names = TRUE)
fwd.reads.in <- reads.in[grepl("_1.fq.gz", reads.in)]
rev.reads.in <- reads.in[grepl("_2.fq.gz", reads.in)]

fwd_qual <- plotQualityProfile(fwd.reads.in)
rev_qual <- plotQualityProfile(rev.reads.in)

plot_grid(fwd_qual + facet_wrap(~file, ncol = 4),
          rev_qual + facet_wrap(~file, ncol = 4),
          ncol = 1)

Primer removal (cutadapt tool)

Why remove primers?

  • PCR Artifacts, Not Biological Sequences - Including them in your final sequence data means you’re carrying along non-biological information
  • Impact on Sequence Similarity and Diversity Metrics (Especially for OTU Clustering):
    • Artificial Homogeneity: All reads for a given amplicon will have the same primer sequence at their ends. These identical stretches artificially increase the overall sequence similarity between reads that originated from different organisms.
    • Distorted Distance Calculations - Downstream analyses involving distance matrices (calculating beta diversity or for traditional OTU clustering at 97% similarity), rely on accurate sequence divergence. If common primer sequences are included, they will reduce the calculated genetic distance between sequences, potentially leading to sequences from different species being grouped into the same OTU.
    • OTU Definition Bias: The commonly used 97% similarity threshold for species-level OTUs is based on the full 16S rRNA gene or significant variable regions without primer sequences. Including the conserved primer regions can skew this similarity, making different organisms appear more similar than they actually are based on their variable regions, leading to an underestimation of true diversity.
  • Interference with Denoising Algorithms
    • Ambiguous Nucleotides (Degeneracy): Many universal 16S primers contain degenerate bases (e.g., ‘N’, ‘Y’, ‘R’, ‘S’) to account for natural variation in the primer binding sites across different taxa.
      • DADA2 and similar tools work by identifying exact sequence variants (ASVs). Ambiguous nucleotides introduce “noise” that the algorithm might misinterpret as true biological variation or, worse, can lead to a massive number of false-positive ASVs or issues with chimera detection. DADA2 chimera removal step requires primers to have been removed, as otherwise the ambiguous nucleotides in most primer sets cause large numbers of false-positive chimeras to be identified.
  • Accurate Taxonomic Assignment: Taxonomic assignment tools rely on comparing your ASVs to reference databases (e.g., SILVA, GreenGenes). These databases contain full-length or primer-free 16S gene sequences.

How to remove primers?

  • specialized tools like cutadapt
    • for paired end reads use paired end mode
fwd_primer <- "TACGGGAGGCAGCAG"
rev_primer <- "CCAGGGTATCTAATCC"
fwd_fastq_file <- fwd.reads.in[1]
rev_fastq_file <- rev.reads.in[1]
reads_out_folder =  "fastq/cutadapt"


cutadapt_call <- cutadapt_primers <- paste("-a ^",
                          fwd_primer,
                          "...",
                          as.character(Biostrings::reverseComplement(Biostrings::DNAString(rev_primer))),
                          " -A ^",
                          rev_primer,
                          "...",
                          as.character(Biostrings::reverseComplement(Biostrings::DNAString(fwd_primer))), sep = "")

  
fwd_cutadapt_out <- file.path(reads_out_folder, basename(fwd.read))
rev_cutadapt_out <- file.path(reads_out_folder, basename(rev.read))

cutadapt_call <- paste("cutadapt ",
                       cutadapt_primers,
                       " --discard-untrimmed -o ",
                       fwd_cutadapt_out,
                       " -p ",
                       rev_cutadapt_out,
                       " ",
                       fwd.read,
                       " ",
                       rev.read,
                       sep = "")

cutadapt_call       

#^TACGGGAGGCAGCAG...GGATTAGATACCCTGG: linked adapter

#-a:  sequence of a 5’ adapter 
#^: anchored - expected to occur in full length at the beginning of the read
#...: any number of bases
#left adapter is anchored and required, right adapter is not anchored ($) so its is optional, will be removed if found

#--discard-untrimmed: if adapter not found remove read

Utility function to call cutadapt from R (cutadapt should be visible by R - on system path):

cutadapt <- function(fwd.reads.in,
                     rev.reads.in,
                     reads.out,
                     fwd_primer,
                     rev_primer){
  if(missing(fwd.reads.in)){
    stop("fwd.reads.in must be specified")
  }
  
  if(missing(rev.reads.in)){
    stop("rev.reads.in must be specified")
  }
  
  if(missing(reads.out)){
    stop("reads.out must be specified")
  }
  
  if(missing(fwd_primer)){
    stop("fwd_primer must be specified")
  }
  
  if(missing(rev_primer)){
    stop("rev_primer must be specified")
  }
  
  
  cutadapt_primers <- paste("-a ^",
                            fwd_primer,
                            "...",
                            as.character(Biostrings::reverseComplement(Biostrings::DNAString(rev_primer))),
                            " -A ^",
                            rev_primer,
                            "...",
                            as.character(Biostrings::reverseComplement(Biostrings::DNAString(fwd_primer))), sep = "")
  dir.create(reads.out)
  for(i in seq_along(fwd.reads.in)){
    fwd.read <- fwd.reads.in[i]
    rev.read <- rev.reads.in[i]
    
    fwd.cutadapt <- file.path(reads.out, basename(fwd.read))
    rev.cutadapt <- file.path(reads.out, basename(rev.read))
    
    cutadapt_call <- paste("cutadapt ",
                           cutadapt_primers,
                           " --discard-untrimmed -o ",
                           fwd.cutadapt,
                           " -p ",
                           rev.cutadapt,
                           " ",
                           fwd.read,
                           " ",
                           rev.read,
                           sep = ""
    )
    
    message(cutadapt_call)
    cut_primers <-  system(cutadapt_call, intern = TRUE)
    
    fileConn <- file(paste(fwd.cutadapt, "_stats.txt", sep = ""))
    writeLines(cut_primers, fileConn)
    close(fileConn)
  }
}


# usage:
cutadapt(fwd.reads.in = fwd.reads.in,
         rev.reads.in = rev.reads.in ,
         reads.out = "fastq/cutadapt",
         fwd_primer = "TACGGGAGGCAGCAG",
         rev_primer = "CCAGGGTATCTAATCC")
  • alternatively, simply trim primer length from sequence start

Filtering and trimming sequences (the filterAndTrim() function)

  • Removal of low quality regions
  • Trimming of read ends - sequencing errors are more common at the ends of reads
  • Removal of sequences with too many estimated errors
  • Removal of short sequences
reads.in <- list.files("fastq/cutadapt", full.names = TRUE)

fnFs_trim <- reads.in[grepl("_1.fq.gz$", reads.in)]
fnRs_trim <- reads.in[grepl("_2.fq.gz$", reads.in)]
filtFs <- file.path("fastq", "filtered", basename(fnFs_trim))
filtRs <- file.path("fastq", "filtered", basename(fnRs_trim))

filter_and_trim <- filterAndTrim(fnFs_trim,
                                 filtFs,
                                 fnRs_trim,
                                 filtRs,
                                 truncLen = c(270, 200),
                                 maxN = 0,
                                 maxEE = c(2, 2), #After truncation, reads with higher than maxEE "expected errors" will be discarded. Expected errors are calculated from the nominal definition of the quality score: EE = sum(10^(-Q/10))
                                 minLen = 150,
                                 rm.phix = TRUE, #discard reads that match against the phiX genome (internal control for sequencing)
                                 compress = TRUE,
                                 multithread = FALSE,
                                 matchIDs = TRUE)

If you do not want to use cutadapt but rather to left trim the sequences to remove primers:

reads.in <- list.files("fastq/raw", full.names = TRUE)

fnFs_trim <- reads.in[grepl("_1.fq.gz$", reads.in)]
fnRs_trim <- reads.in[grepl("_2.fq.gz$", reads.in)]
filtFs <- file.path("fastq", "filtered", basename(fnFs_trim))
filtRs <- file.path("fastq", "filtered", basename(fnRs_trim))

out <- filterAndTrim(fnFs_trim,
                     filtFs,
                     fnRs_trim,
                     filtRs,
                     trimLeft = 20,
                     truncLen = c(270, 200),
                     maxN = 0,
                     maxEE = c(2, 2), #
                     minLen = 150,
                     rm.phix = TRUE,
                     compress = TRUE,
                     multithread = FALSE,
                     matchIDs = TRUE)

Inspect reads after filtering:

fwd_qual <- plotQualityProfile(filtFs)
rev_qual <- plotQualityProfile(filtRs)

plot_grid(fwd_qual + facet_wrap(~file, ncol = 4),
          rev_qual + facet_wrap(~file, ncol = 4),
          ncol = 1)

Central step of the DADA2 algorithm. Error rates (The learnErrors() function) and denoising (The dada() function) with selfConsist = TRUE

Components:

  1. The Error Model

DADA2 learns the specific error rates (e.g., A→T, G→C errors) from your own sequencing data using a parametric error model, which is more accurate than using a universal error model.

The error model is a set of nucleotide-transition probability tables. Specifically, for each possible base transition (e.g., A to C, G to T, C to A, etc.), and for each Phred quality score, DADA2 estimates the probability of that error occurring.

The error model assumes that errors at different nucleotide positions are independent, and that the probability of a specific base transition at a given quality score is generally independent of the surrounding sequence context.

The Phred quality score (\(Q\)) is a crucial component, as it represents the estimated probability of an incorrect base call (\(P\)):

\[ Q = -10 \log_{10}(P) \]

Conversely, the probability of an error (\(P\)) can be derived from the quality score (\(Q\)):

\[ P = 10^{-Q/10} \]

DADA2 utilizes these quality scores to estimate the probability of specific nucleotide transitions. We define this as \(p(B_{actual} \to B_{observed} | Q)\), the probability that a true base \(B_{actual}\) is sequenced as \(B_{observed}\) given a Phred quality score \(Q\).

DADA2 constructs an error rate matrix, or set of matrices \(M\). For specific bases and quality scores, \(M_{B_{actual}, B_{observed}, Q}\) represents the estimated probability (or rate) of a true base \(B_{actual}\) being miscalled as \(B_{observed}\) when the sequencer reports quality score \(Q\).

The observed error rates are often noisy. To obtain a generalized model, DADA2 uses LOESS (Locally Estimated Scatterplot Smoothing) regression. LOESS is applied to fit a smooth curve for each possible base transition (e.g., A \(\to\) C, A \(\to\) G, etc.) as a function of the quality score. This smooth curve represents the refined estimated probability \(p(B_{actual} \to B_{observed} | Q)\).

The probability that a specific observed read \(i\) was generated from a true biological sequence \(j\) is modeled as the product of the position-specific transition probabilities. This probability, denoted \(\lambda_{j \to i}\) assumes that errors at different nucleotide positions are independent.

\[ \lambda_{j \to i} = \prod_{l=1}^{L} p(j(l) \to i(l) | q_i(l)) \]

Where:

  • \(\lambda_{j \to i}\) is the overall per-read probability that true sequence \(j\) would produce observed sequence \(i\).

  • \(L\) is the length of the aligned sequences.

  • \(j(l)\) is the nucleotide at position \(l\) in the true sequence \(j\)

  • \(i(l)\) is the nucleotide at position \(l\) in the observed read \(i\).

  • \(q_i(l)\) is the Phred quality score of the base at position \(l\) in read \(i\).

  • \(p(j(l) \to i(l) | q_i(l)))\) is the specific probability of a true nucleotide \(j(l)\) being read as \(i(l)\), given the quality score \(q_i(l)\) at that position.

The formula multiplies (\(\prod\)) the probabilities for each nucleotide position \(l\) up to the aligned sequence length \(L\).

  1. Expected Number of Error Reads

    This calculates the expected number of times you’d see read \(i\) if it were purely the result of errors from the more abundant sequence \(j\).

\[ \mathbf{\Lambda_{ji}} = n_j \cdot \lambda_{j \to i} \] Where:

  • \(\mathbf{\Lambda_{ji}}\) is the expected count of error reads of sequence \(i\) generated from sequence \(j\).

  • \(n_j\) is the total number of reads (abundance) for the true sequence \(j\).

  • \(\lambda_{j \to i}\) is the per-read error probability calculated in the first formula.

  1. Abundance P-value

To decide if a unique sequence is a true variant or just an error, DADA2 calculates an abundance p-value. It quantifies the likelihood that the observed abundance of unique sequence \(i\) (\(a_i\)) is too high to be explained purely by sequencing errors from a higher-abundance sequence \(j\). It is defined as:

\[ p_A(j \to i) = \frac{P(X \ge a_i \mid X \sim \text{Poisson}(\mathbf{\Lambda_{ji}}))}{1 - P(X=0 \mid X \sim \text{Poisson}(\mathbf{\Lambda_{ji}}))} \] Where:

  • \(a_i\): The observed abundance of unique sequence \(i\).

  • \(X\): A random variable representing the number of times sequence \(i\) would be observed if it were purely an error from sequence \(j\).

  • \(\text{Poisson}(\mathbf{\Lambda_{ji}})\): A Poisson distribution with an expected value equal to \(\mathbf{\Lambda_{ji}}\).

  • The numerator \(P(X \ge a_i \mid X \sim \text{Poisson}(\mathbf{\Lambda_{ji}}))\) is the probability of observing \(a_i\) or more reads of \(X\).

  • The denominator \(1 - P(X=0 \mid X \sim \text{Poisson}(\mathbf{\Lambda_{ji}}))\) is the probability of observing at least one read of \(X\), conditioning the p-value on its observation.

The p-value (represented by the sum in the numerator) calculates the probability of observing \(k\) reads, from the actual observed abundance \(a_i\) up to infinity. A low p-value suggests the sequence is a true variant (ASV).

If this p-value is smaller than a defined threshold (\(\Omega_A\)), the null hypothesis is rejected, and the sequence \(i\) is considered a true Amplicon Sequence Variant (ASV).

To sum up:

We have some probability of getting an observed sequence from a true sequence just due to sequencing errors If the amount of the observed sequence is higher than we would expect just based on the errors we declare it as a true sequence.

But how to know which are the true sequences from the pool of sequenced data in order to calculate the above?

The self consistent partitioning model

The selfConsist argument enables the simultaneous inference of the error model and the true sample sequences. When selfConsist = TRUE, DADA2 performs an iterative process:

  1. Initialization: The algorithm starts with an initial guess for the error rates. This is typically a “maximum possible” error rate matrix derived from the data, assuming all reads are errors apart from a single, most abundant sequence.

  2. Sample Inference (Partitioning): Using the current error model, the algorithm partitions the dereplicated sequences into groups that are likely to have originated from the same true sequence (later we go into more details).

  3. Error Rate Re-estimation: A new error model is estimated from the partitioned sequences. This is done by comparing the observed sequences to the inferred “true” sequences within each partition.

  4. Convergence Check: The algorithm checks if the error rates have converged. If not, it repeats steps 2 and 3 with the updated error model. This iterative process continues until the error rates stabilize, resulting in a jointly consistent solution for the error model and the inferred sample composition.

The Partitioning Process

Dereplicate sequences to generate unique sequence + abundances.

Start with one partition and declare the most abundant sequence a true ASV.

  1. Bud: Searches through all sequences in all current partitions to find the single sequence that is the best candidate to become a new ASV. This “best candidate” is the one with the lowest abundance p-value (after passing other checks like minimum abundance and fold-change). If this lowest p-value is significant (i.e., less than the threshold \(\Omega_A\)after correction), that sequence “buds” off to form a new partition/cluster.

  2. Shuffle: After a new ASV is found (or even after an iteration where no new ASV is found), re-evaluate the partition assignment for every non-center sequence. Each sequence is moved to the partition/cluster for which its is most likely to originate from ensuring all sequences are in their most likely partition based on the current set of identified ASVs.

    For any given unique sequence (\(seq_i\)), the algorithm compares it against every established ASV partition (\(ASV_j\), \(ASV_k\), etc.). For each comparison, it calculates the expected number of times \(seq_i\) would appear as an error from that ASV.

\[ E_{ji} = n_j \cdot \lambda_{ji} \]

Where:

  • \(n_j\) is the total number of reads assigned to the partition for \(ASV_j\).
  • \(\lambda_{ji}\) is the per-read probability that an error in \(ASV_j\) would produce \(seq_i\), based on their sequence differences and quality scores.

Continue this “Bud & Shuffle” loop until no more sequences can be significantly separated into new ASVs.

In practice

  1. Error estimation (The learnErrors() function)
errF <- learnErrors(filtFs, 
                    multithread = TRUE,
                    nbases = 1e+10,
                    MAX_CONSIST = 20,
                    verbose = TRUE) 
dada2::plotErrors(errF)

errR <- learnErrors(filtRs,
                    multithread = TRUE,
                    nbases = 1e+10,
                    MAX_CONSIST = 20,
                    verbose = TRUE)  
dada2::plotErrors(errR)

  1. Dereplication (The derepFastq() function)

DADA2 groups identical reads together. Instead of processing millions of individual reads, it works with a much smaller set of unique sequences and a count of how many times each sequence was observed. This significantly reduces computation time.

derepFs <- derepFastq(filtFs, verbose = TRUE)
derepRs <- derepFastq(filtRs, verbose = TRUE)
  1. Denoising (The dada() function)
dadaFs <- dada(derepFs,
               err = errF,
               multithread = TRUE,
               selfConsist = TRUE)

dadaRs <- dada(derepRs,
               err = errR,
               multithread = TRUE,
               selfConsist = TRUE)

The selfConsist = TRUE iterative process, performed independently for each sample (or on pooled sequences), allows DADA2 to robustly infer the true ASVs present and their abundances, effectively stripping away sequencing noise. The reliance on the learned error model and statistical inference makes it highly sensitive to single-nucleotide differences while being robust against typical sequencing errors.

Merging left and right reads (mergePairs())

Merge the dereplicated sequences rejecting with a sufficient overlap (minOverlap). The overlap needs to be perfect by default (maxMismatch). trimOverhang is recommended if not used cutadapt in paired end mode.

mergers <- mergePairs(dadaFs,
                      derepFs,
                      dadaRs,
                      derepRs,
                      verbose = TRUE,
                      minOverlap = 20)

Create a sequence table (makeSequenceTable())

Result is a matrix of sequences as rows and samples in columns, each cell is a count. collapseNoMismatch will collapse sequences that are identical up to shifts or length variation. The representative will the most abundant sequence.

seqtab <- makeSequenceTable(mergers)
seqtab_collapse <- collapseNoMismatch(seqtab)

Bimera removal (removeBimeraDenovo())

A bimera is a two-parent chimera, in which the left side is made up of one parent sequence, and the right-side made up of a second parent sequence.

For a given sequence being tested (the “query”) in a specific sample:

  • Identify Potential Parents: The algorithm first identifies a set of valid parent sequences. A sequence is considered a potential parent only if it is significantly more abundant than the query sequence within that same sample. This is controlled by the minFoldParentOverAbundance and minParentAbundance parameters (isBimeraDenovo function - can be passed from removeBimeraDenovo).
  • Find the Best Left and Right Matches: The algorithm then tries to explain the query sequence using the set of parents. It does this by finding the longest possible matches from both ends:
    • Best Left Parent: It compares the query to all potential parents to find the one that provides the longest stretch of exact matching sequence starting from the left end. The length of this longest match is recorded (max_left).
    • Best Right Parent: It repeats the process from the right end, finding the parent that provides the longest exact match. This length is recorded (max_right).
  • Flag as bimera: : The query sequence is flagged as a chimera if the best left match and the best right match can collectively cover its entire length (max_left + max_right >= query_sequence_length).
    If this condition is true, it means a two-parent model perfectly explains the query sequence, and it is flagged as a chimera for that sample. A sequence might be flagged as chimeric in a sample where abundant parents are present, but not in another sample where they are absent.
seqtab_nochim <- removeBimeraDenovo(seqtab_collapse,
                                    multithread = FALSE,
                                    verbose = TRUE)

Inspecting sequence distribution and read loss

t(seqtab_nochim) %>%
  as.data.frame() %>%
  rownames_to_column("seq") %>%
  mutate(length = nchar(seq),
         reads = rowSums(.[2:ncol(.)])) %>%
  group_by(length) %>%
  summarise(reads = sum(reads)) %>%
  ggplot()+
  geom_line(aes(x = length, y = reads)) +
  theme_classic() +
  geom_vline(xintercept = 400, color = "grey50", lty = 2)+
  geom_vline(xintercept = 428, color = "grey50", lty = 2)

V3V4 has a bimodal length distribution..

Removal of unexpectedly short sequences:

seqtab_length <- seqtab_nochim[,nchar(colnames(seqtab_nochim)) %in% seq(400,428)]

Track of sequences:

input_count <- ShortRead::countFastq("fastq/raw")
input_count <- input_count  %>%
  rownames_to_column("sample") %>%
  filter(grepl("_1\\.fq\\.gz$", sample))

cutadapt_count <- ShortRead::countFastq("fastq/cutadapt")
cutadapt_count <- cutadapt_count %>%
  rownames_to_column("sample") %>%
  filter(grepl("_1\\.fq\\.gz$", sample))

filtered_count <- ShortRead::countFastq("fastq/filtered")
filtered_count <- filtered_count %>%
  rownames_to_column("sample") %>%
  filter(grepl("_1\\.fq\\.gz$", sample))

##helper function
getN <- function(x) sum(getUniques(x))

track <- data.frame(sample = input_count$sample,
                    input = input_count$records,
                    primers = cutadapt_count$records,
                    filtered = filtered_count$records,
                    denoisedF = sapply(dadaFs, getN),
                    denoisedR = sapply(dadaRs, getN),
                    merged = sapply(mergers, getN),
                    chimera_remove = rowSums(seqtab_nochim),
                    length_trim = rowSums(seqtab_length))


rownames(track) = NULL

track <- track %>%
  mutate(percent_final = length_trim/input * 100)

knitr::kable(track, digits = 2)
sample input primers filtered denoisedF denoisedR merged chimera_remove length_trim percent_final
Dmel_K_St_M_1.fq.gz 127453 124153 115864 115824 115832 115634 115634 115629 90.72
Dmel_Sl_St_F_1.fq.gz 102775 99978 92205 92172 92175 92050 92050 92050 89.56
Dmel_Sl_St_M_1.fq.gz 123931 120552 112549 112528 112535 112354 112354 112354 90.66
Dmel_St_K_1.fq.gz 92464 90110 83373 83356 83359 83339 83283 83283 90.07
Dmel_St_Sl_1.fq.gz 148858 145068 134031 133996 134016 133952 133862 133862 89.93
Dsub_K_St_F_1.fq.gz 50278 48864 44627 44622 44605 44580 44478 44478 88.46
Dsub_K_St_M_1.fq.gz 69370 67390 61467 61459 61462 61445 61421 61421 88.54
Dsub_Sl_St_F_1.fq.gz 144672 140609 129372 129363 129370 129145 124535 124535 86.08
Dsub_Sl_St_M_1.fq.gz 98992 96208 88100 88098 88091 87832 80888 80888 81.71
Dsub_St_K_1.fq.gz 119238 116060 106907 106897 106897 106855 106849 106849 89.61
Dsub_St_Sl_1.fq.gz 91187 88741 81798 81772 81774 81730 81712 81712 89.61

Taxonomy assignment

Taxonomic assignment is the process of putting a biological name on the inferred ASVs. After denoising, an ASV is just a DNA sequence; assigning it a taxonomy (e.g., Lactobacillus) gives it biological meaning, which is essential for downstream analysis and interpretation. Two common approaches are used to accomplish this:

  • Direct Comparison Methods. The most intuitive method is to search for the sequence in a comprehensive reference database, typically using an alignment tool like BLAST. The taxonomy of the best-matching reference sequence is then transferred to the query ASV.
    • Computationally slow. requires pairwise alignment of each ASV against a massive database.
    • Difficult to interpret the results. There is no universal rule that links percentage identity to a specific taxonomic rank. For example, a 99% match might be to a well-defined species, but it could also be to a completely different genus that just happens to be the closest relative in the database. This ambiguity makes it hard to decide at which taxonomic level the assignment is truly reliable. Lower percentage identity (<90%) is even harder to interpret.
  • Classifier-Based Methods. Supervised learning algorithm is trained on features extracted from sequences from a curated reference database (like SILVA or GTDB) to “learn” the characteristic features of different taxonomic groups. For example, algorithms like the RDP Classifier (Wang et al. 2007) or IdTaxa (Murali et al. 2018) learn which short DNA words, or k-mers, are distinctive for each taxon. When classifying a new ASV, the algorithm checks which of these characteristic features the ASV possesses and calculates a score for each possible taxonomy.
    • Much faster. They avoid full alignments.
    • Statistical rigor. Usually these methods provide a confidence score (e.g., a bootstrap value) for each taxonomic level assigned. This allows researchers to trim the taxonomy at the last confident rank (e.g., Family level with >90% confidence), preventing over-confident assignments to uncertain species or genera.

The IdTaxa algorithm

Links to reference data.

Download the training sequences:

# download and save training sequences
gtdb_genus <- "https://zenodo.org/records/13984843/files/GTDB_bac120_arc53_ssu_r220_genus.fa.gz"
gtdb_species <- "https://zenodo.org/records/13984843/files/GTDB_bac120_arc53_ssu_r220_species.fa.gz"

gtdb_genus_train_set <- Biostrings::readDNAStringSet(gtdb_genus)
gtdb_species_train_set <- Biostrings::readDNAStringSet(gtdb_species)

if (!dir.exists("gtdb")) dir.create("gtdb")
Biostrings::writeXStringSet(gtdb_genus_train_set, "gtdb/gtdb_genus_train_set.fasta",  format="fasta")
Biostrings::writeXStringSet(gtdb_species_train_set, "gtdb/gtdb_species_train_set.fasta",  format="fasta")

Optional! Trim the sequences based on used primers:

forward_primer <- "TACGGGAGGCAGCAG"
reverse_primer <-  "CCAGGGTATCTAATCC"
reverse_primer_rc <- as.character(Biostrings::reverseComplement(Biostrings::DNAString(reverse_primer)))

input_fasta_file <- "gtdb/gtdb_genus_train_set.fasta"
output_fasta_file <- "gtdb/gtdb_genus_trimmed_train_set.fasta"
cutadapt_command <- paste(
  "cutadapt",
  "-a", paste(forward_primer, "...", reverse_primer_rc, sep = ""),
  "--discard-untrimmed",
  " --action trim",
  "-o", output_fasta_file,
  input_fasta_file
)

cutadapt_output <- system(cutadapt_command, intern = TRUE)

fileConn <- file(paste(output_fasta_file, "_stats.txt", sep = ""))
writeLines(cutadapt_output , fileConn)
close(fileConn)

train_set <- Biostrings::readDNAStringSet( "gtdb/gtdb_genus_trimmed_train_set.fasta")
hist(Biostrings::width(train_set))

Optional! Remove too short or too long sequences

final_train_set <- train_set[Biostrings::width(train_set) > 350 & Biostrings::width(train_set) < 450]
length(final_train_set)/length(train_set)
## [1] 0.7652882

Train:

tax_labels <- paste("Root", names(final_train_set), sep = ";")
set.seed(123)
gtdb_idtaxa <- DECIPHER::LearnTaxa(final_train_set,
                                    tax_labels)
## ================================================================================
## 
## Time difference of 884.75 secs

Optional! Save model for reuse in later sessions:

saveRDS(gtdb_idtaxa, "gtdb/gtdb_idtaxa.rds")

Optional! Load model:

gtdb_idtaxa <- readRDS("gtdb/gtdb_idtaxa.rds")

Infer taxonomy:

set.seed(123)
gtdb_tax_seqtab_length <- DECIPHER::IdTaxa(Biostrings::DNAStringSet(colnames(seqtab_length)),
                                           gtdb_idtaxa) 
## ================================================================================
## 
## Time difference of 21.68 secs

Cleanup:

lapply(gtdb_tax_seqtab_length, function(x) c(x$taxon, rep("", 8 - length(x$taxon)))) %>%
  do.call(rbind, .) %>%
  as.data.frame() %>%
  select(-V1, -V8) %>%
  dplyr::rename(Kingdom = V2,
                Phylum = V3,
                Class = V4,
                Order = V5,
                Family = V6,
                Genus = V7) %>%
  mutate_all(~ifelse(. == "", NA, .)) %>%
  mutate_all(~ifelse(grepl("^unclassified_", .), NA, .)) -> gtdb_tax_seqtab_length 


gtdb_tax_seqtab_length %>%
  knitr::kable(format = "html") %>%
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(height = "400px")
Kingdom Phylum Class Order Family Genus
Bacteria Pseudomonadota Alphaproteobacteria Acetobacterales Acetobacteraceae Acetobacter
Bacteria Pseudomonadota Alphaproteobacteria Rickettsiales Anaplasmataceae Wolbachia
Bacteria Pseudomonadota Alphaproteobacteria Acetobacterales Acetobacteraceae NA
Bacteria Pseudomonadota Alphaproteobacteria Acetobacterales Acetobacteraceae Acetobacter
Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Latilactobacillus
Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Lactiplantibacillus
Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Latilactobacillus
Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Lactiplantibacillus
Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Levilactobacillus
Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Fructilactobacillus
Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Furfurilactobacillus
Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Levilactobacillus
NA NA NA NA NA NA
Bacteria Pseudomonadota Alphaproteobacteria Acetobacterales Acetobacteraceae Acetobacter
Bacteria Bacillota Bacilli Staphylococcales Staphylococcaceae Staphylococcus
Bacteria Bacteroidota Bacteroidia Chitinophagales Chitinophagaceae Sediminibacterium
Bacteria Pseudomonadota Alphaproteobacteria Rhizobiales Xanthobacteraceae NA
Bacteria Pseudomonadota Alphaproteobacteria Rickettsiales Anaplasmataceae Wolbachia
Bacteria Bacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
Bacteria Actinomycetota Actinomycetes Mycobacteriales Nakamurellaceae Nakamurella
Bacteria Bacillota_A Clostridia Lachnospirales Lachnospiraceae Lachnospira
Bacteria Pseudomonadota Gammaproteobacteria Burkholderiales Burkholderiaceae NA
Bacteria Bacteroidota Bacteroidia Bacteroidales Bacteroidaceae Prevotella
Bacteria NA NA NA NA NA
Bacteria Pseudomonadota Gammaproteobacteria Pseudomonadales Moraxellaceae Acinetobacter
Bacteria Pseudomonadota Gammaproteobacteria Xanthomonadales Xanthomonadaceae Thermomonas
Bacteria Pseudomonadota Alphaproteobacteria Rhizobiales Beijerinckiaceae Methylobacterium
Bacteria Pseudomonadota Gammaproteobacteria Enterobacterales Enterobacteriaceae NA
Bacteria Bacillota_A Clostridia Tissierellales Peptoniphilaceae Finegoldia
Bacteria NA NA NA NA NA
Bacteria Pseudomonadota Gammaproteobacteria Pseudomonadales Moraxellaceae Moraxella_A
Bacteria Actinomycetota Actinomycetes Mycobacteriales Mycobacteriaceae Lawsonella
Bacteria Pseudomonadota Alphaproteobacteria Sphingomonadales Sphingomonadaceae NA
Bacteria Pseudomonadota Alphaproteobacteria Sphingomonadales Sphingomonadaceae Sphingomonas
Bacteria Pseudomonadota Gammaproteobacteria Pseudomonadales Pseudomonadaceae NA
Bacteria Pseudomonadota Gammaproteobacteria Xanthomonadales Xanthomonadaceae Stenotrophomonas
Bacteria Actinomycetota Actinomycetes Mycobacteriales Mycobacteriaceae Corynebacterium
Bacteria Pseudomonadota Gammaproteobacteria Pseudomonadales Pseudomonadaceae NA
Bacteria NA NA NA NA NA
Bacteria NA NA NA NA NA
Bacteria NA NA NA NA NA
Bacteria Actinomycetota Thermoleophilia Solirubrobacterales Solirubrobacteraceae JAGIBJ01
Bacteria Actinomycetota Acidimicrobiia Acidimicrobiales NA NA
Bacteria Actinomycetota Actinomycetes Actinomycetales Micrococcaceae Rothia
Bacteria Pseudomonadota Gammaproteobacteria Burkholderiales Burkholderiaceae Comamonas
Bacteria Bacillota_A Clostridia Peptostreptococcales Peptostreptococcaceae NA
Bacteria Bacillota Bacilli Bacillales Anoxybacillaceae Anoxybacillus_A
Bacteria Actinomycetota Actinomycetes Actinomycetales Actinomycetaceae Varibaculum
Bacteria Pseudomonadota Gammaproteobacteria Burkholderiales Burkholderiaceae Pelomonas
Bacteria Pseudomonadota Gammaproteobacteria Pseudomonadales Moraxellaceae Psychrobacter
Bacteria Pseudomonadota Gammaproteobacteria Burkholderiales Burkholderiaceae NA
Bacteria Deinococcota Deinococci Deinococcales Deinococcaceae Deinococcus
Bacteria NA NA NA NA NA
Bacteria Pseudomonadota Gammaproteobacteria Enterobacterales Enterobacteriaceae NA
Bacteria Actinomycetota Actinomycetes Mycobacteriales Mycobacteriaceae Corynebacterium
Bacteria Pseudomonadota NA NA NA NA
Bacteria Pseudomonadota Gammaproteobacteria Burkholderiales Burkholderiaceae Caldimonas
Bacteria Bacteroidota Bacteroidia Cytophagales Hymenobacteraceae Hymenobacter

Species assignment:

taxa <- as.matrix(gtdb_tax_seqtab_length)
rownames(taxa) <- colnames(seqtab_length)

taxa <- dada2::addSpecies(taxa,
                         "gtdb/gtdb_species_train_set.fasta",
                          allowMultiple = TRUE)

taxa %>%
  knitr::kable(format = "html") %>%
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(height = "400px")
Kingdom Phylum Class Order Family Genus Species
TGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTCGACGGGGACGATGATGACGGTACCCGTAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGCTCGGAATGACTGGGCGTAAAGGGCGTGTAGGCGGTTTGTACAGTCAGATGTGAAATCCCCGGGCTTAACCTGGGAGCTGCATTTGATACGTGCAGACTAGAGTGTGAGAGAGGGTTGTGGAATTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCAACCTGGCTCATTACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Alphaproteobacteria Acetobacterales Acetobacteraceae Acetobacter ascendens/ghanensis/oryzifermentans/oryzoeni/pasteurianus
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCGGATTAGTAAGTTAAAAGTGAAATCCCAAGGCTCAACCTTGGAATTGCTTTTAAAACTGCTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTAGTGTAGAGGTGAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAGGCGCGAAGGCGTGGGGAGCAAACA Bacteria Pseudomonadota Alphaproteobacteria Rickettsiales Anaplasmataceae Wolbachia sp936270435
TGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGTTTTCGGATTGTAAAGCACTTTCAGCGGGGACGATGATGACGGTACCCGCAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCAAGCGTTGCTCGGAATGACTGGGCGTAAAGGGCGCGTAGGCGGTTGTTACAGTCAGATGTGAAATTCCCGGGCTTAACCTGGGGGCTGCATTTGATACGTGACGACTAGAGTGTGAGAGAGGGTTGTGGAATTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCAACCTGGCTCATGACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Alphaproteobacteria Acetobacterales Acetobacteraceae NA NA
TGGGGAATATTGGACAATGGGGGAAACCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGTTTTCGGATTGTAAAGCACTTTCGGCGGGGACGATGATGACGGTACCCGCAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGCTCGGAATGACTGGGCGTAAAGGGCGTGTAGGCGGTTTGTACAGTCAGATGTGAAATCCCCGGGCTTAACCTGGGAGCTGCATTTGATACGTGCAGACTAGAGTATGAGAGAGGGTTGTGGAATTCTCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCAACCTGGCTCATTACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Alphaproteobacteria Acetobacterales Acetobacteraceae Acetobacter NA
TAGGGAATCTTCCACAATGGACGAAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAACTCTGTTGTTGGAGAAGAACGTATTTGATAGTAACTGATCAGGTAGTGACGGTATCCAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCTTCGGCTCAACCGAAGAAGTGCATCGGAAACTGGGAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCAAACA Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Latilactobacillus curvatus
TAGGGAATCTTCCACAATGGACGAAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAACTCTGTTGTTAAAGAAGAACATATCTGAGAGTAACTGTTCAGGTATTGACGGTATTTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCTTCGGCTCAACCGAAGAAGTGCATCGGAAACTGGGAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGTATGGGTAGCAAACA Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Lactiplantibacillus fabifermentans/paraplantarum/pentosus/pingfangensis/plantarum/plantarum_A
TAGGGAATCTTCCACAATGGACGAAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAACTCTGTTGTTGGAGAAGAACGTATTTGATAGTAACTGATCAGGTAGTGACGGTATCCAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCTTCGGCTCAACCGAAGAAGTGCATCGGAAACTGGGGAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCAAACA Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Latilactobacillus NA
TAGGGAATCTTCCACAATGGACGAAAGTCTGATGGAACAACGCCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAACTCTGTTGTTAAAGAAGAACATATCTGAGAGTAACTGTTCAGGTATTGACGGTATTTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCTTCGGCTCAACCGAAGAAGTGCATCGGAAACTGGGAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGTATGGGTAGCAAACA Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Lactiplantibacillus NA
TAGGGAATCTTCCACAATGGACGAAAGTCTGATGGAGCAATGCCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAACTCTGTTGTTAAAGAAGAACACCTTTGAGAGTAACTGTTCAAGGGTTGACGGTATTTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGAAGTGCATCGGAAACTGGGAGACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTAGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACA Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Levilactobacillus brevis
TAGGGAATCTTCCACAATGGACGAAAGTCTGATGGAGCAATGCCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAACTCTGTTGTTAGAGAAGAACGGGCGTGAGAGTAACTGCTCACGTCGCGACGGTATCTAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTCTTCTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGAAGTGCATCGGAAACTGGGAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGGCTGTCTAGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCAAACA Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Fructilactobacillus fructivorans
TAGGGAATCTTCCACAATGGGCGCAAGCCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAACTCTGTTGTTAAAGAAGAACGAATCTGAGAGTAACTGTTCAGATTGTGACGGTATTTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGGGAGTGCAGGCGGTTACTTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGAAGGGCATCGGAAACTGGGCAACTTGAGTGCAGAAGAGGGTAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCAAACA Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Furfurilactobacillus NA
TAGGGAATCTTCCACAATGGACGAAAGTCTGATGGAGCAATGCCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAACTCTGTTGTTAAAGAAGAACACCTTTGAGAGTAACTGTTCAAGGGTTGACGGTATTTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGAAGTGCATCGGAAACTGGGAGACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTAGTCTGTAACTGACGCTGAGGCTCGAAAACATGGGTAGCGAACA Bacteria Bacillota Bacilli Lactobacillales Lactobacillaceae Levilactobacillus NA
TGGGGAATCTTGGTCAATGGGCGCAAGCCTGAACCAGCGATGCCGCGTGAGTGATGAAGGCCTTCGGGTTGTAAAGCTCTGTGGGGAGGGACGAAAGAATCGGTTCTAACAGGACCGAAGTTGACGGTACCTCCTTAGCAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAAGACAGAGGGTGCAAACGTTGCTCGGAATCACTGGGCGTAAAGGGCGTGTAGGCGGGAAGGTAAGTCGGGCGTGAAATCCCTGAGCTCAACTCAGGAACTGCACTCGAAACTACTTTTCTTGAGTGCCGGAGAGGAAAGCGGAATTCCTGGTGTAGAGGTGAAATTCGTAGATATCAGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGACGGTAACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACA NA NA NA NA NA NA NA
TGGGGAATATTGGACAATGGGGGAAACCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGTTTTCGGATTGTAAAGCACTTTCGGCGGGGACGATGATGACGGTACCCGCAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGCTCGGAATGACTGGGCGTAAAGGGCGTGTAGGCGGTTTGTACAGTCAGATGTGAAATCCCCGGGCTTAACCTGGGAGCTGCATTTGATACGTGCAGACTAGAGTGTGAGAGAGGGTTGTGGAATTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCAACCTGGCTCATTACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Alphaproteobacteria Acetobacterales Acetobacteraceae Acetobacter NA
TAGGGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTCTTCGGATCGTAAAACTCTGTTATTAGGGAAGAACAAATGTGTAAGTAACTATGCACGTCTTGACGGTACCTAATCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACA Bacteria Bacillota Bacilli Staphylococcales Staphylococcaceae Staphylococcus caprae/epidermidis/saccharolyticus
TAAGGAATATTGGTCAATGGACGCAAGTCTGAACCAGCCATGCCGCGTGAAGGATGAAGGTCCTCTGGATTGTAAACTTCTTTTATAGGGGGCGAAAAAAGGTCTTTCTAGATCACTTGACAGTACCCTATGAATAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGCGGGTAGGTAAGTCAGAGGTGAAATCCTGGAGCTTAACTCCAGAACTGCCTTTGATACTATCTATCTTGAATATGGTGGAGGTAAGCGGAATATGTCATGTAGCGGTGAAATGCATAGATATGACATAGAACACCTATTGCGAAGGCAGCTTACTACGCCTATATTGACGCTGAGGCACGAAAGCGTGGGGATCAAACA Bacteria Bacteroidota Bacteroidia Chitinophagales Chitinophagaceae Sediminibacterium magnilacihabitans
TGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTTGTGCGGGAAGATAATGACGGTACCGCAAGAATAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGCTCGGAATCACTGGGCGTAAAGGGTGCGTAGGCGGGTCTTTAAGTCAGGGGTGAAATCCTGGAGCTCAACTCCAGAACTGCCTTTGATACTGAAGATCTTGAGTTCGGGAGAGGTGAGTGGAACTGCGAGTGTAGAGGTGAAATTCGTAGATATTCGCAAGAACACCAGTGGCGAAGGCGGCTCACTGGCCCGATACTGACGCTGAGGCACGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Alphaproteobacteria Rhizobiales Xanthobacteraceae NA NA
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCGGATTAGTAAGTTAAAAGTGAAATCCCAAGGCTCAACCTTGGAATTGCTTTTAAAACTGCTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTAGTGTAGAGGTGAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAGGCGCGAAGGCGTGGGGAGCAACCA Bacteria Pseudomonadota Alphaproteobacteria Rickettsiales Anaplasmataceae Wolbachia NA
TGAGGAATATTGGTCAATGGACGAGAGTCTGAACCAGCCAAGTAGCGTGAAGGATGACTGCCCTATGGGTTGTAAACTTCTTTTATACGGGAATAAAGTGAGGCACGTGTGCCTTTTTGTATGTACCGTATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACGCTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGGTGTCTTGAGTACAGTAGAGGCAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTGCTGGACTGTAACTGACGCTGATGCTCGAAAGTGTGGGTATCAAACA Bacteria Bacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides NA
TGGGGAATATTGCGCAATGGGCGGAAGCCTGACGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCTCTGACGAAGCGAGAGTGACGGTAGGAGCAGAAGAAGCACCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCGAATGTGAAAACCCGGGGCTCAACCCCGGGCCTGCATTCGATACGGGCAGACTAGAGTTCGGTAGGGGAGTCTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGACTCTGGGCCGATACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA Bacteria Actinomycetota Actinomycetes Mycobacteriales Nakamurellaceae Nakamurella NA
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGCCTTTTGGTTGTAAAGCACTTTAAGCGAGGAGGAGGCTACTTGGATTAATACTCTAGGATAGTGGACGTTACTCGCAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTTTTAAGTCGGATGTGAAATCCCTGAGCTTAACTTAGGAATTGCATTCGATACTGGGAAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCATCTGGCCTAATACTGACGCTGAGGTACGAAAGCATGGGGAGCAAACA Bacteria Bacillota_A Clostridia Lachnospirales Lachnospiraceae Lachnospira NA
TGGGGAATTTTGGACAATGGGCGCAAGCCTGATCCAGCAATGCCGCGTGAGTGAAGAAGGCCTTCGGGTTGTAAAGCTCTTTTGTCAGGGAAGAAACGGCTGTGGCTAATATCCACGGCTAATGACGGTACCTGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTTGTAAGTCTGTCGTGAAATCCCCGGGCTCAACCTGGGAATTGCGATGGAGACTGCAAGGCTAGAATCTGGCAGAGGGGGGTAGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAACACCGATGGCGAAGGCAGCCCCCTGGGTCAAGATTGACGCTCATGCACGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Gammaproteobacteria Burkholderiales Burkholderiaceae NA NA
TGAGGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGAAGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGGGAGTCGTGACTCTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACA Bacteria Bacteroidota Bacteroidia Bacteroidales Bacteroidaceae Prevotella NA
TGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGCCTTTTGGTTGTAAAGCACTTTAAGCGAGGAGGAGGCTACTTGGATTAATACTCTGAGATAGTGGACGTTACTCGCAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTTTTAAGTCGGATGTGAAATCCCTGAGCTTAACTTAGGAATTGCATTCGATACTGGGAAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCATCTGGCCTAATACTGACGCTGAGGTACGAAAGCATGGGGAGCAAACA Bacteria NA NA NA NA NA NA
TGGGGAATATTGGACAATGGGGGGAACCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGCCTTTTGGTTGTAAAGCACTTTAAGCGAGGAGGAGGCTACTCTAGTTAATACCTAGAGATAGTGGACGTTACTCGCAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGTGGCCAATTAAGTCAAATGTGAAATCCCCGAGCTTAACTTGGGAATTGCATTCGATACTGGTTGGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCATCTGGCCTAATACTGACACTGAGGTACGAAAGCATGGGGAGCAAACA Bacteria Pseudomonadota Gammaproteobacteria Pseudomonadales Moraxellaceae Acinetobacter NA
TAGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTTATCCGGAAAGAAAAGCGCTGGGTTAATACCCTGGTGTCCTGACGGTACCGGAAGAATAAGCACCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGTGCAAGCGTTACTCGGAATTACTGGGCGTAAAGCGTGCGTAGGTGGTTTGTTAAGTCTGATGTGAAAGCCCTGGGCTCAACCTGGGAATTGCATTGGATACTGGCAGGCTAGAGTGCGGTAGAGGGTAGTGGAATTCCCGGTGTAGCAGTGAAATGCGTAGAGATCGGGAGGAACATCTGTGGCGAAGGCGACTACCTGGACCAGCACTGACACTGAGGCACGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Gammaproteobacteria Xanthomonadales Xanthomonadaceae Thermomonas carbonis
TGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTTGTCCGGGACGATAATGACGGTACCGGAAGAATAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGCTCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGCCATTCAAGTCGGGGGTGAAAGCCTGTGGCTCAACCACAGAATTGCCTTCGATACTGTTTGGCTTGAGTTTGGTAGAGGTTGGTGGAACTGCGAGTGTAGAGGTGAAATTCGTAGATATTCGCAAGAACACCAGTGGCGAAGGCGGCCAACTGGACCAACACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Alphaproteobacteria Rhizobiales Beijerinckiaceae Methylobacterium sp001423085
TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGCGAGGAGGAAGGGTAGTGTGTTAATAGCACATTGCATTGACGTTACTCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGCGCTTAACGTGGGAACTGCATTTGAAACTGGCAAGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Gammaproteobacteria Enterobacterales Enterobacteriaceae NA NA
TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAACGAAGAAGGTATTCGTATCGTAAAGTTCTGTCCTATGGGAAGATAATGACAGTACCATAGAAGAAAGCTCCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGAGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGTACGCAGGCGGTTTAATAAGTCGAATGTTAAAGATCGGGGCTCAACCCCGTAAAGCATTGGAAACTGATAAACTTGAGTAGTGGAGAGGAAAGTGGAATTCCTAGTGTAGTGGTGAAATACGTAGATATTAGGAGGAATACCAGTAGCGAAGGCGACTTTCTGGACACAAACTGACGCTGAGGTACGAAAGCGTGGGGAGCAAACA Bacteria Bacillota_A Clostridia Tissierellales Peptoniphilaceae Finegoldia magna/magna_F/magna_H/magna_J/sp910589545
TGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGCCTTTTGGTTGTAAAGCACTTTAAGCGAGGAGGAGGCTACCTGGATTAATACTCTAGGATAGTGGACGTTACTCGCAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTTTTAAGTCGGATGTGAAATCCCTGAGCTTAACTTAGGAATTGCATTCGATACTGGGAAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCATCTGGCCTAATACTGACGCTGAGGTACGAAAGCATGGGGAGCAAACA Bacteria NA NA NA NA NA NA
TGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGCCTTTTGGTTGTAAAGCACTTTAAGCAGGGAGGAGAGGCTAATGGTTAATACCCATTAGATTAGACGTTACCTGCAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGAGTGTAGGTGGCTCATTAAGTCACATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATGTGATACTGGTGGTGCTAGAATATGTGAGAGGGAAGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCTTCCTGGCATAATATTGACACTGAGATTCGAAAGCGTGGGTAGCAAACA Bacteria Pseudomonadota Gammaproteobacteria Pseudomonadales Moraxellaceae Moraxella_A osloensis/sp002478835/sp902506215
TGGGGAATATTGCACAATGGGCGGAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGTAAGTGACGGTACCTGCAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCACGTCGTCTGTGAAATCCTAGGGCTTAACCCTGGACGTGCAGGCGATACGGGCTGACTTGAGTACTACAGGGGAGACTGGAATTTCTGGTGTAGCGGTGGAATGCACAGATATCAGGAAGAACACCGATGGCGAAGGCAGGTCTCTGGGTAGTAACTGACGCTGAGGAGCGAAAGCATGGGTAGCGAACA Bacteria Actinomycetota Actinomycetes Mycobacteriales Mycobacteriaceae Lawsonella clevelandensis_A
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTTACCCGGGATGATAATGACAGTACCGGGAGAATAAGCTCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGAGCTAGCGTTATTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTTTGTAAGTTAGAGGTGAAAGCCTGGAGCTCAACTCCAGAATTGCCTTTAAGACTGCATCGCTTGAATCCAGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACTGGTATTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Alphaproteobacteria Sphingomonadales Sphingomonadaceae NA NA
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTTACCCGGGATGATAATGACAGTACCGGGAGAATAAGCTCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGCTTTGTAAGTTAGAGGTGAAAGCCTGGAGCTCAACTCCAGAATTGCCTTTAAGACTGCATCGCTCGAATCCGGGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGGCTCACTGGACCGGTATTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Alphaproteobacteria Sphingomonadales Sphingomonadaceae Sphingomonas sp900109205
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGCAGTGAATTAATACTTTACTGTTTTGACGTTACCGACAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTGTTAAGTTGAATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTGGCAAGCTAGAGTAGGGCAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGGCTCATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Gammaproteobacteria Pseudomonadales Pseudomonadaceae NA NA
TGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATACCGCGTGGGTGAAGAAGGCCTTCGGGTTGTAAAGCCCTTTTGTTGGGAAAGAAAAGCAGTCGGCTAATACCCGGTTGTTCTGACGGTACCCAAAGAATAAGCACCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGTGCAAGCGTTACTCGGAATTACTGGGCGTAAAGCGTGCGTAGGTGGTTGTTTAAGTCTGTTGTGAAAGCCCTGGGCTCAACCTGGGAATTGCAGTGGATACTGGGCGACTAGAGTGTGGTAGAGGGTAGTGGAATTCCCGGTGTAGCAGTGAAATGCGTAGAGATCGGGAGGAACATCCATGGCGAAGGCAGCTACCTGGACCAACACTGACACTGAGGCACGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Gammaproteobacteria Xanthomonadales Xanthomonadaceae Stenotrophomonas bentonitica_A/rhizophila_A/sp018612915
TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGGGATGACGGCCTTCGGGTTGTAAACCCCTTTCGGCAGGGACGAAGCTTTTGTGACGGTACCTGCATAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTCGTCTGTGAAATCCCGGGGCTTAACTCCGGGCGTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCATGGGTAGCGAACA Bacteria Actinomycetota Actinomycetes Mycobacteriales Mycobacteriaceae Corynebacterium bovis
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGCTGCTGGTTAATACCCTGCAGTTTTGACGTTACCAACAGAATAAGCACCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTGGGTAAGTTGAATGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTGCCCGACTAGAGTACGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Gammaproteobacteria Pseudomonadales Pseudomonadaceae NA NA
TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria NA NA NA NA NA NA
TAGGGAATCTTCGGCAATGGGGGCAACCCTGACCGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTAAGTCAAGAACGGGTGTGAGAGTGGAAAGTTCACACTGTGACGGTAGCTTACCAGAAAGGGACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTGATAAGTCTGAAGTTAAAGGCTGTGGCTCAACCATAGTTCGCTTTGGAAACTGTCAAACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCGAACA Bacteria NA NA NA NA NA NA
TGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGCCTTTTGGTTGTAAAGCACTTTAAGCGAGGAGGAGGCTACTTGGATTAATACTCTAGGATAGTGGACGTTACTCGCAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTTTTAAGTCGGATGTGAAATCCCTGAGCTTAACTTAGGAATTGCATTCGATACTGGGAAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCATCTGGCCTAATACTGACGCTGAGGTACGAAAGCATGGGGAGCAAACA Bacteria NA NA NA NA NA NA
TGGGGAATCTTGCGCAATGCACGAAAGTGTGACGCAGCAACGCCGCGTGCGGGAAGAAGGTCTTCGGATCGTAAACCGCTTTCAGTTGGGACGAGGGTTGCACGGTGAATAGCCGGCGTAACTGACGGTACCTTCAGAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGAATCATTGGGCGTAAAGCGCGTGTAGGCGGTTCAGTAAGTCCGCTCTGAAAGCCCAGGGCTCAACCCTGGGAGGCGGGTGGATACTACTGGACTAGAGTACGGAAGAGGAGATTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCAACAGCGAAGGCAGATCTCTGGGACGTAACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACA Bacteria Actinomycetota Thermoleophilia Solirubrobacterales Solirubrobacteraceae JAGIBJ01 NA
TGGGGAATCTTGCGCAATGGGCGAAAGCCTGACGCAGCAACGCCGCGTGGGGGATGAAGGCTCTCGGGTTGTAAACCCCTTTCAGCAGGGACGAATCAAGACGGTACCTGCAGAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAACACGTAGGGGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGTGGCTGTGTAAGTCGGGTGTTAAATCCCCAGGCTCAACCTGGGACGCCACCCGATACTGCACTAGCTAGAGTCCGGTAGGGGAGCGTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCAGCGGCGAAGGCGGCGCTCTGGGCCGGTACTGACACTGAGGAGCGAAAGCGTGGGTAGCAAACA Bacteria Actinomycetota Acidimicrobiia Acidimicrobiales NA NA NA
TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTGTTAGCATCGAAGAAGCGAAAGTGACGGTAGGTGCAGAGAAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTTGTAGGCGGTTGGTCGCGTCTGCTGTGAAAGGCTGGGGCTTAACCCTGGTTTTGCAGTGGGTACGGGCTAACTAGAGTGCAGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCTGTAACTGACGCTGAGAAGCGAAAGCATGGGGAGCGAACA Bacteria Actinomycetota Actinomycetes Actinomycetales Micrococcaceae Rothia dentocariosa
TGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGCAGGATGAAGGCCTTCGGGTTGTAAACTGCTTTTGTACGGAACGAAAAGTCTCGGGATAACACCCCGGGACCATGACGGTACCGTAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTATGTAAGACAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGTGACTGCATAGCTGGAGTACGGCAGAGGGGGATGGAATTCCGCGTGTAGCAGTGAAATGCGTAGATATGCGGAGGAACACCGATGGCGAAGGCAATCCCCTGGGCCTGTACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Gammaproteobacteria Burkholderiales Burkholderiaceae Comamonas antarctica
TGGGGAATATTGCACAATGGGCGAAAGTCTGATGCAGCAACGCCGCGTGAGCGATGAAGGCCTTCGGGTCGTAAAGCTCTGTCCTCAAGGAAGATAATGACGGTACTTGAGGAGGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCTAGCGTTATCCGGAATTACTGGGCGTAAAGGGTGCGTAGGTGGTTTCTTAAGTCAGAGGTGAAAGGCTACGGCTCAACCGTAGTAAGCCTTTGAAACTGGGAAACTTGAGTGCAGGAGAGGAGAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTTGCGAAGGCGGCTCTCTGGACTGTAACTGACACTGAGGCACGAAAGCGTGGGGAGCAAACA Bacteria Bacillota_A Clostridia Peptostreptococcales Peptostreptococcaceae NA NA
TAGGGAATCTTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAGCGAAGAAGGTCTTCGGATTGTAAAGCTCTGTTGTTAGGGAAGAACAAGTACCGTTCGAAGAGGGCGGTACCGTGACGGTACCTAACGAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGCGCGCGCAGGCGGTCCTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGGGACTTGAGTGCAGAAGAGGAGAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACA Bacteria Bacillota Bacilli Bacillales Anoxybacillaceae Anoxybacillus_A NA
TGGGGGATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGGGATGAAGGCCTTCGGGTTGTAAACTTCTTTTGCTCTGAACAAGGCATGCCAATGGGTGTGTTGAGTGTAGGGGTTGATTAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTTGTAGGTGGCTGGTTGCGTCTGTCGTGAAAGCTCATGGCTTAACTGTGGGTTTGCGGTGGGTACGGGCTGGCTTGAGTGCAGTAGGGGAGGCTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAATACCGGTGGCGAAGGCGGGTCTCTGGGCTGTTACTGACACTGAGGAGCGAAAGCATGGGGAGCGAACA Bacteria Actinomycetota Actinomycetes Actinomycetales Actinomycetaceae Varibaculum NA
TGGGGAATTTTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGCGGGAAGAAGGCCTTCGGGTTGTAAACCGCTTTTGTCAGGGAAGAAAAGGTTCTGGTTAATACCTGGGACTCATGACGGTACCTGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTATGCAAGACAGAGGTGAAATCCCCGGGCTCAACCTGGGAACTGCCTTTGTGACTGCATGGCTAGAGTACGGTAGAGGGGGATGGAATTCCGCGTGTAGCAGTGAAATGCGTAGATATGCGGAGGAACACCGATGGCGAAGGCAATCCCCTGGACCTGTACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Gammaproteobacteria Burkholderiales Burkholderiaceae Pelomonas aquatica
TGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGCCTTTTGGTTGTAAAGCACTTTAAGCAGTGAAGAAGACTCTATGGTTAATACCCATAGACGATGACATTAGCTGCAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGAGCGTAGGTGGCTTGATAAGTCAGATGTGAAAGCCCCGGGCTTAACCTGGGAACGGCATCTGATACTGTTAGGCTAGAGTAGGTGAGAGGAAGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCTTCTGGCATCATACTGACACTGAGGTTCGAAAGCGTGGGTAGCAAACA Bacteria Pseudomonadota Gammaproteobacteria Pseudomonadales Moraxellaceae Psychrobacter faecalis
TGGGGAATTTTGGACAATGGGCGCAAGCCTGATCCAGCAATGCCGCGTGAGTGAAGAAGGCCTTCGGGTTGTAAAGCTCTTTTGTCAGGGAAGAAACGGTGAAGGCTAATATCCTTTGCTAATGACGGTACCTGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTTGTAAGTCTGATGTGAAATCCCCGGGCTCAACCTGGGAATTGCATTGGAGACTGCAAGGCTAGAATCTGGCAGAGGGGGGTAGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAACACCGATGGCGAAGGCAGCCCCCTGGGTCAAGATTGACGCTCATGCACGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Gammaproteobacteria Burkholderiales Burkholderiaceae NA NA
TTAGGAATCTTCCCCAATGGGCGCAAGCCTGAGGGAGCGACGCCGCGTGAGGGATGAAGGTCCTCGGATTGTAAACCTCTGAACGAGTGACGAAAGGCCCCGGATGGGGAGATGACGGTAGCTCGGTAATAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTACCCGGAATCACTGGGCGTAAAGGGCGTGTAGGCGGAAGGTTAAGTCCGACTTTAAAGACCGGGGCTCAACCCCGGGAGTGGGTTGGAGACTGGCTTTCTGGACCTCTGGAGAGGCAACTGGAATTCCTGGTGTAGCGGTGGAATGCGTAGATACCAGGAGGAACACCAATGGCGAAGGCAGGTTGCTGGACAGAAGGTGACGCTGAGGCGCGAAAGTGTGGGGAGCGAACC Bacteria Deinococcota Deinococci Deinococcales Deinococcaceae Deinococcus NA
TAGGGAATCTTCGGCAATGGACGGAAGTCTGACCGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTAAGAGAAGAACGAGTATGAGAGTGGAAAGTTCGTACTGTGACGGTATCTTACCAGAAAGGGACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGCTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACA Bacteria NA NA NA NA NA NA
TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGTGTTGAGGTTAATAACCTCAGCAATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Gammaproteobacteria Enterobacterales Enterobacteriaceae NA NA
TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGGGATGACGGCCTTCGGGTTGTAAACTCCTTTCAGCCATGACGAAGCCCTTGTGGTGACGGTAGTGGTAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTCTGTCGCGTCATTTGTGAAAGCCCGGGGCTTAACTCCGGGTTGGCAGGTGATACGGGCATGACTGGAGTACTGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCATGGGTAGCGAACA Bacteria Actinomycetota Actinomycetes Mycobacteriales Mycobacteriaceae Corynebacterium kroppenstedtii
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGCAGTTACCTAATACGTGATTGTTTTGACGTTACCGACAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTAGTTAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCAAAACTGACTGACTAGAGTATGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota NA NA NA NA NA
TGGGGAATTTTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGCGGGAAGAAGGCCTTCGGGTTGTAAACCGCTTTTGTTCGGGAAGAAATCCTCTGGGCTAATACCCCGGGGGGATGACGGTACCGGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTGTGCAAGACAGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATTTGTGACTGCACGGCTAGAGTGCGGCAGAGGGGAGTGGAATTCCGCGTGTAGCAGTGAAATGCGTAGATATGCGGAGGAACACCGATGGCGAAGGCAGCTCCCTGGGCCTGCACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACA Bacteria Pseudomonadota Gammaproteobacteria Burkholderiales Burkholderiaceae Caldimonas aquatica
TAGGGAATATTGGGCAATGGGCGAGAGCCTGACCCAGCCATGCCGCGTGCCGGATGAAGGCCTTCTGGGTTGTAAACGGCTTTTCTCAGGGAAGAAAAAGACCTTGCGAGGTACACTGACGGTACCTGAGGAATAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTGTCCGGATTTATTGGGTTTAAAGGGTGCGTAGGTGGCTTGTTAAGTCCGGGGTGAAAGCCCACAGCTCAACTGTGGAACTGCCCTGGATACTGGCGAGCTTGAGTCCAGACGAGGTTGGCGGAATGGAAGCTGTAGCGGTGAAATGCATAGATAGCTTCCAGAACCCCGATTGCGTAGGCAGCTGACTAGGCTGGTACTGACACTGAGGCACGAAAGCGTGGGGAGCGAACA Bacteria Bacteroidota Bacteroidia Cytophagales Hymenobacteraceae Hymenobacter NA

Phyloseq (McMurdie and Holmes, 2013)

Phyloseq is an R package designed to import, store, analyze, and visualize complex microbiome data. Providing a powerful, unified data structure that seamlessly combines abundance tables (OTUs or ASVs), taxonomy, sample data, and phylogenetic trees into a single object. This integration simplifies data management and enables a wide range of downstream analyses and ggplot2-based visualizations within a consistent workflow.

Model matrix:

samples <- rownames(seqtab_length)
mm <- data.frame(sample = gsub("_\\d+.fq.gz$", "", samples),
                 species = ifelse(grepl("Dmel", samples), "D. melanogaster", "D. subobscura"))

rownames(mm) <- samples
mm
##                            sample         species
## Dmel_K_St_M_1.fq.gz   Dmel_K_St_M D. melanogaster
## Dmel_Sl_St_F_1.fq.gz Dmel_Sl_St_F D. melanogaster
## Dmel_Sl_St_M_1.fq.gz Dmel_Sl_St_M D. melanogaster
## Dmel_St_K_1.fq.gz       Dmel_St_K D. melanogaster
## Dmel_St_Sl_1.fq.gz     Dmel_St_Sl D. melanogaster
## Dsub_K_St_F_1.fq.gz   Dsub_K_St_F   D. subobscura
## Dsub_K_St_M_1.fq.gz   Dsub_K_St_M   D. subobscura
## Dsub_Sl_St_F_1.fq.gz Dsub_Sl_St_F   D. subobscura
## Dsub_Sl_St_M_1.fq.gz Dsub_Sl_St_M   D. subobscura
## Dsub_St_K_1.fq.gz       Dsub_St_K   D. subobscura
## Dsub_St_Sl_1.fq.gz     Dsub_St_Sl   D. subobscura

Create phyloseq object:

ps <- phyloseq::phyloseq(phyloseq::otu_table(seqtab_length, taxa_are_rows = FALSE),
                         phyloseq::sample_data(mm),
                         phyloseq::tax_table(taxa)) 
ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 58 taxa and 11 samples ]
## sample_data() Sample Data:       [ 11 samples by 2 sample variables ]
## tax_table()   Taxonomy Table:    [ 58 taxa by 7 taxonomic ranks ]

Filter methods (take specific taxa for further analysis):

ps_acetobacteraceae <- phyloseq::subset_taxa(ps, Family == "Acetobacteraceae")
ps_acetobacteraceae
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 4 taxa and 11 samples ]
## sample_data() Sample Data:       [ 11 samples by 2 sample variables ]
## tax_table()   Taxonomy Table:    [ 4 taxa by 7 taxonomic ranks ]

Filter methods (take only abundant ASVs):

phyloseq::filter_taxa(ps, function(x) any(x > 100), prune = TRUE)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 13 taxa and 11 samples ]
## sample_data() Sample Data:       [ 11 samples by 2 sample variables ]
## tax_table()   Taxonomy Table:    [ 13 taxa by 7 taxonomic ranks ]

Aggregation methods based on taxonomy:

phyloseq::tax_glom(ps, "Genus")
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 32 taxa and 11 samples ]
## sample_data() Sample Data:       [ 11 samples by 2 sample variables ]
## tax_table()   Taxonomy Table:    [ 32 taxa by 7 taxonomic ranks ]
phyloseq::tax_glom(ps, "Order")
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 22 taxa and 11 samples ]
## sample_data() Sample Data:       [ 11 samples by 2 sample variables ]
## tax_table()   Taxonomy Table:    [ 22 taxa by 7 taxonomic ranks ]

Various plots:

phyloseq::plot_bar(ps, fill = "Family")

Transformation of counts + full integration of plotting functions with ggplot2:

ps_percentage <- phyloseq::transform_sample_counts(ps, function(x) x / sum(x) * 100)
ps_percentage <- phyloseq::transform_sample_counts(ps, function(x) x / sum(x))
phyloseq::plot_bar(ps_percentage, fill = "Family") + 
  scale_y_continuous(labels = scales::percent) +
  theme_bw() +
  ggtitle("Relative Abundance by Family") +
  theme(
    axis.text.x = element_text(
      angle = 45,      
      hjust = 1,       
      vjust = 1,      
      size = 10       
    )
  )

Alpha diversity:

phyloseq::estimate_richness(ps) %>%
  knitr::kable()
Observed Chao1 se.chao1 ACE se.ACE Shannon Simpson InvSimpson Fisher
Dmel_K_St_M_1.fq.gz 39 39 0 39 2.8823068 0.5763420 0.2156292 1.274907 3.7755474
Dmel_Sl_St_F_1.fq.gz 17 17 0 17 1.9703687 1.1658685 0.5604619 2.275116 1.5462610
Dmel_Sl_St_M_1.fq.gz 20 20 0 20 2.1330729 0.3380089 0.1184892 1.134416 1.8124576
Dmel_St_K_1.fq.gz 6 6 0 6 0.9128709 0.5546775 0.3437347 1.523774 0.4989470
Dmel_St_Sl_1.fq.gz 7 7 0 7 1.1952286 0.5514376 0.3296458 1.491749 0.5656896
Dsub_K_St_F_1.fq.gz 4 4 0 NaN NaN 0.1009014 0.0336597 1.034832 0.3394968
Dsub_K_St_M_1.fq.gz 4 4 0 NaN NaN 0.0498331 0.0145750 1.014791 0.3296388
Dsub_Sl_St_F_1.fq.gz 4 4 0 4 0.8660254 0.2809978 0.1443709 1.168731 0.3100068
Dsub_Sl_St_M_1.fq.gz 6 6 0 6 0.9128709 0.6032352 0.4060142 1.683542 0.5002721
Dsub_St_K_1.fq.gz 8 8 0 8 0.9354143 0.1860356 0.0792211 1.086037 0.6675988
Dsub_St_Sl_1.fq.gz 9 9 0 9 1.2472191 0.4211907 0.1971506 1.245564 0.7784515

Alpha diversity plot:

phyloseq::plot_richness(ps,
                        x = "species", # this is the "soecies" from our model matrix
                        color = "species")

Rarefaction + alpha diversity:

ps_rarefied <- phyloseq::rarefy_even_depth(ps)
phyloseq::estimate_richness(ps_rarefied) %>%
  knitr::kable()
Observed Chao1 se.chao1 ACE se.ACE Shannon Simpson InvSimpson Fisher
Dmel_K_St_M_1.fq.gz 38 38 0.0000000 38.000000 3.0435436 0.5726938 0.2134447 1.271366 4.0883524
Dmel_Sl_St_F_1.fq.gz 17 17 0.1616904 17.365497 1.9658972 1.1711663 0.5609636 2.277716 1.6681246
Dmel_Sl_St_M_1.fq.gz 20 20 0.1218349 20.564043 2.2420828 0.3310231 0.1161814 1.131454 1.9978590
Dmel_St_K_1.fq.gz 6 6 0.0000000 6.000000 0.9128709 0.5532551 0.3426930 1.521359 0.5291344
Dmel_St_Sl_1.fq.gz 6 6 0.0000000 6.000000 1.1547005 0.5506788 0.3288589 1.489999 0.5291344
Dsub_K_St_F_1.fq.gz 4 4 0.0000000 NaN NaN 0.0971727 0.0322545 1.033330 0.3394968
Dsub_K_St_M_1.fq.gz 4 4 0.0000000 NaN NaN 0.0533940 0.0157810 1.016034 0.3394968
Dsub_Sl_St_F_1.fq.gz 4 4 0.4330127 NaN NaN 0.2758875 0.1410752 1.164246 0.3394968
Dsub_Sl_St_M_1.fq.gz 6 6 0.2282177 7.032922 1.2696131 0.5989098 0.4033108 1.675914 0.5291344
Dsub_St_K_1.fq.gz 8 8 0.4677072 8.565972 1.3056819 0.1869213 0.0789741 1.085746 0.7257330
Dsub_St_Sl_1.fq.gz 8 8 0.0000000 8.000000 1.2247449 0.4284049 0.2012855 1.252012 0.7257330

Estimate a GTR + gamma + inv model ML phyogetic tree and add to phylosec object:

gtr_tree <- function(ps){
  # 1. Extract sequences from the phyloseq object
  seqs <- colnames(phyloseq::otu_table(ps))
  
  # 2. Assign names to the sequences
  names(seqs) <- seqs
  
  # 3. Align the sequences using DECIPHER
  alignment <- DECIPHER::AlignSeqs(Biostrings::DNAStringSet(seqs),
                                   anchor = NA,
                                   verbose = FALSE,
                                   iterations = 10,
                                   refinements = 10)
  
  # 4. Mask Poorly aligned regions of a multiple sequence alignment. 
  # May lead to incorrect results in downstream analyses. 
  # One method to mitigate their effects is to mask columns of the alignment that may be poorly aligned,
  # such as highly-variable regions or regions with many insertions and deletions (gaps).
  alignment_masked <- DECIPHER::MaskAlignment(alignment,
                                              correction = TRUE,
                                              showPlot = FALSE)
  
  alignment <- as(alignment_masked, "DNAStringSet")

  # 5. Convert the alignment to a phangorn object
  phangAlign <- phangorn::phyDat(as(alignment, "matrix"),
                                 type = "DNA")
  
  # https://cran.r-project.org/web/packages/phangorn/vignettes/MLbyHand.html
  # fit a maximum likelihood model using General Time Reversible. substitution model, "G(4)" discrete gamma model, "I" invariant sites
  fitGTR <- phangorn::pml_bb(phangAlign,
                             model = "GTR+G(4)+I",
                             rearrangement = "stochastic")
  
  # 7.  Add the final phylogenetic tree to the phyloseq object
  phyloseq::phy_tree(ps) <- phangorn::midpoint(fitGTR$tree)
  return(ps)
}

ps <- gtr_tree(ps)

Plot the phylogenetic tree and decorate with abundances and sample information:

taxa <- phyloseq::tax_table(ps)
seqs <- rownames(taxa)
taxa_df <- as.data.frame(taxa)

taxa_df[,"Family_Genus"] <- ifelse(!is.na(taxa_df[,"Family"]) & !is.na(taxa_df[,"Genus"]),
                                   paste(taxa_df[,"Family"], taxa_df[,"Genus"]),
                                   taxa_df[,"Family"])
rownames(taxa_df) <- seqs


phyloseq::tax_table(ps) <- as.matrix(taxa_df)


phyloseq::plot_tree(ps,
                    size = "Abundance",
                    color = "Sample",
                    label.tips = "Family_Genus") +
  scale_color_manual(values = c("Dmel_K_St_M_1.fq.gz" = "red",
                                "Dmel_Sl_St_F_1.fq.gz" = "red",
                                "Dmel_Sl_St_M_1.fq.gz" = "red",
                                "Dmel_St_K_1.fq.gz" = "red",
                                "Dmel_St_K_1.fq.gz" = "red",
                                "Dmel_St_Sl_1.fq.gz" = "red",
                                "Dsub_K_St_F_1.fq.gz" = "dodgerblue",
                                "Dsub_K_St_M_1.fq.gz" = "dodgerblue",
                                "Dsub_Sl_St_F_1.fq.gz" = "dodgerblue",
                                "Dsub_Sl_St_M_1.fq.gz" = "dodgerblue",
                                "Dsub_St_K_1.fq.gz" = "dodgerblue",
                                "Dsub_St_Sl_1.fq.gz" = "dodgerblue"))

Various distance metrics. For example weighted unifrac (Lozupone and Knight, 2005):

wunifrac_dist <- phyloseq::distance(ps,
                                    method = "wunifrac",
                                    type = "samples")

wunifrac_dist
##                      Dmel_K_St_M_1.fq.gz Dmel_Sl_St_F_1.fq.gz
## Dmel_Sl_St_F_1.fq.gz        0.2784217416                     
## Dmel_Sl_St_M_1.fq.gz        0.0670253022         0.3408667239
## Dmel_St_K_1.fq.gz           0.4813945160         0.5913327770
## Dmel_St_Sl_1.fq.gz          0.4820056614         0.5914439461
## Dsub_K_St_F_1.fq.gz         0.5979706690         0.6939696020
## Dsub_K_St_M_1.fq.gz         0.5982594431         0.6940976185
## Dsub_Sl_St_F_1.fq.gz        0.5733874538         0.6772882795
## Dsub_Sl_St_M_1.fq.gz        0.5786895989         0.6807684427
## Dsub_St_K_1.fq.gz           0.5919359895         0.6898234760
## Dsub_St_Sl_1.fq.gz          0.5868037690         0.6852822098
##                      Dmel_Sl_St_M_1.fq.gz Dmel_St_K_1.fq.gz Dmel_St_Sl_1.fq.gz
## Dmel_Sl_St_F_1.fq.gz                                                          
## Dmel_Sl_St_M_1.fq.gz                                                          
## Dmel_St_K_1.fq.gz            0.4543574356                                     
## Dmel_St_Sl_1.fq.gz           0.4552081958      0.0143020262                   
## Dsub_K_St_F_1.fq.gz          0.5743697243      0.1431786202       0.1408633421
## Dsub_K_St_M_1.fq.gz          0.5746846360      0.1436303422       0.1413144741
## Dsub_Sl_St_F_1.fq.gz         0.5476504321      0.1640710758       0.1629227350
## Dsub_Sl_St_M_1.fq.gz         0.5534644232      0.1540421434       0.1529483530
## Dsub_St_K_1.fq.gz            0.5679253567      0.1151311847       0.1126606621
## Dsub_St_Sl_1.fq.gz           0.5626387675      0.1116805238       0.1072713518
##                      Dsub_K_St_F_1.fq.gz Dsub_K_St_M_1.fq.gz
## Dmel_Sl_St_F_1.fq.gz                                        
## Dmel_Sl_St_M_1.fq.gz                                        
## Dmel_St_K_1.fq.gz                                           
## Dmel_St_Sl_1.fq.gz                                          
## Dsub_K_St_F_1.fq.gz                                         
## Dsub_K_St_M_1.fq.gz         0.0006153382                    
## Dsub_Sl_St_F_1.fq.gz        0.0753384083        0.0759018075
## Dsub_Sl_St_M_1.fq.gz        0.0580447995        0.0586089296
## Dsub_St_K_1.fq.gz           0.0335539026        0.0341618584
## Dsub_St_Sl_1.fq.gz          0.0384633132        0.0390692739
##                      Dsub_Sl_St_F_1.fq.gz Dsub_Sl_St_M_1.fq.gz
## Dmel_Sl_St_F_1.fq.gz                                          
## Dmel_Sl_St_M_1.fq.gz                                          
## Dmel_St_K_1.fq.gz                                             
## Dmel_St_Sl_1.fq.gz                                            
## Dsub_K_St_F_1.fq.gz                                           
## Dsub_K_St_M_1.fq.gz                                           
## Dsub_Sl_St_F_1.fq.gz                                          
## Dsub_Sl_St_M_1.fq.gz         0.0181067009                     
## Dsub_St_K_1.fq.gz            0.0656932893         0.0571506156
## Dsub_St_Sl_1.fq.gz           0.0657348415         0.0571206377
##                      Dsub_St_K_1.fq.gz
## Dmel_Sl_St_F_1.fq.gz                  
## Dmel_Sl_St_M_1.fq.gz                  
## Dmel_St_K_1.fq.gz                     
## Dmel_St_Sl_1.fq.gz                    
## Dsub_K_St_F_1.fq.gz                   
## Dsub_K_St_M_1.fq.gz                   
## Dsub_Sl_St_F_1.fq.gz                  
## Dsub_Sl_St_M_1.fq.gz                  
## Dsub_St_K_1.fq.gz                     
## Dsub_St_Sl_1.fq.gz        0.0073500933

Beta Diversity using double principal coordinate analysis (DPCoA) (Pavoine et al., 2004):

ps_dpcoa <- phyloseq::ordinate(ps,
                               "DPCoA",
                               "wunifrac")

plot_ordination(ps, 
                ps_dpcoa,
                "samples",
                color = "species") +
  ggrepel::geom_text_repel(aes(label = sample)) +
  theme_bw()

Session Info:

sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: Europe/Budapest
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DECIPHER_3.0.0      Biostrings_2.72.0   GenomeInfoDb_1.40.1
##  [4] XVector_0.44.0      IRanges_2.38.0      S4Vectors_0.42.0   
##  [7] BiocGenerics_0.50.0 ggrepel_0.9.6       phangorn_2.12.1    
## [10] ape_5.8-1           phyloseq_1.48.0     dada2_1.32.0       
## [13] Rcpp_1.0.12         lubridate_1.9.3     forcats_1.0.0      
## [16] stringr_1.5.1       dplyr_1.1.4         purrr_1.0.2        
## [19] readr_2.1.5         tidyr_1.3.1         tibble_3.2.1       
## [22] tidyverse_2.0.0     ggplot2_3.5.1      
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.2                   bitops_1.0-7               
##   [3] deldir_2.0-4                permute_0.9-8              
##   [5] rlang_1.1.4                 magrittr_2.0.3             
##   [7] ade4_1.7-22                 matrixStats_1.5.0          
##   [9] compiler_4.4.1              mgcv_1.9-1                 
##  [11] systemfonts_1.2.3           png_0.1-8                  
##  [13] vctrs_0.6.5                 reshape2_1.4.4             
##  [15] quadprog_1.5-8              pwalign_1.0.0              
##  [17] pkgconfig_2.0.3             crayon_1.5.2               
##  [19] fastmap_1.2.0               labeling_0.4.3             
##  [21] utf8_1.2.4                  Rsamtools_2.20.0           
##  [23] rmarkdown_2.27              tzdb_0.4.0                 
##  [25] UCSC.utils_1.0.0            xfun_0.44                  
##  [27] zlibbioc_1.50.0             cachem_1.1.0               
##  [29] jsonlite_1.8.8              biomformat_1.32.0          
##  [31] highr_0.11                  rhdf5filters_1.16.0        
##  [33] DelayedArray_0.30.1         Rhdf5lib_1.26.0            
##  [35] BiocParallel_1.38.0         jpeg_0.1-11                
##  [37] cluster_2.1.6               parallel_4.4.1             
##  [39] R6_2.5.1                    bslib_0.7.0                
##  [41] stringi_1.8.4               RColorBrewer_1.1-3         
##  [43] GenomicRanges_1.56.2        jquerylib_0.1.4            
##  [45] SummarizedExperiment_1.34.0 iterators_1.0.14           
##  [47] knitr_1.47                  splines_4.4.1              
##  [49] igraph_2.0.3                Matrix_1.7-0               
##  [51] timechange_0.3.0            tidyselect_1.2.1           
##  [53] rstudioapi_0.16.0           abind_1.4-8                
##  [55] yaml_2.3.8                  vegan_2.7-1                
##  [57] codetools_0.2-20            hwriter_1.3.2.1            
##  [59] lattice_0.22-6              plyr_1.8.9                 
##  [61] Biobase_2.64.0              withr_3.0.0                
##  [63] ShortRead_1.62.0            evaluate_0.23              
##  [65] survival_3.6-4              RcppParallel_5.1.10        
##  [67] xml2_1.3.6                  pillar_1.9.0               
##  [69] MatrixGenerics_1.16.0       foreach_1.5.2              
##  [71] generics_0.1.3              hms_1.1.3                  
##  [73] munsell_0.5.1               scales_1.3.0               
##  [75] glue_1.7.0                  tools_4.4.1                
##  [77] interp_1.1-6                data.table_1.15.4          
##  [79] GenomicAlignments_1.40.0    fastmatch_1.1-6            
##  [81] rhdf5_2.48.0                grid_4.4.1                 
##  [83] latticeExtra_0.6-30         colorspace_2.1-0           
##  [85] nlme_3.1-164                GenomeInfoDbData_1.2.12    
##  [87] cli_3.6.2                   textshaping_0.4.0          
##  [89] kableExtra_1.4.0            fansi_1.0.6                
##  [91] viridisLite_0.4.2           S4Arrays_1.4.1             
##  [93] svglite_2.2.1               gtable_0.3.5               
##  [95] sass_0.4.9                  digest_0.6.35              
##  [97] SparseArray_1.4.8           farver_2.1.2               
##  [99] multtest_2.60.0             htmltools_0.5.8.1          
## [101] lifecycle_1.0.4             httr_1.4.7                 
## [103] MASS_7.3-60.2