ANALYTICAL FRAMEWORK FOR ESTIMATING ANTIMICROBIAL RESISTANCE GENE ABUNDANCE IN METAGENOMIC SAMPLES OF ANIMAL AGRICULTURE ORIGIN By Leland K. Ackerson IV A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Animal Science – Master of Science 2023 ABSTRACT Antimicrobial resistance (AMR) has become an apex global public health threat that requires a multifaceted One Health approach. According to the CDC, 2.8 million antimicrobial resistant infections occur in the United States each year, resulting in more than 35,000 deaths. Although the development of AMR is incredibly intricate, it is widely recognized that the employment of antibiotics is one of the largest selective pressures of AMR. In many countries, antimicrobial consumption in animal agriculture surpasses that of human usage, and it is estimated that nearly 73% of global antibiotics can be attributed to livestock. Monitoring AMR emergence and historical data on a global scale is crucial when working towards the large-scale mitigation of this public health threat. One tool that can contribute to monitoring AMR is shotgun metagenomics, which entails comprehensive evaluation of the genetic material extracted from all the organisms in a complex sample. This subsequently gives genomic insights into the microorganisms residing in the sample of interest. The Sequence Read Archive (SRA) is a public repository housed by the National Center for Biotechnology Information (NCBI) containing extensive sequence data from metagenomic samples in animal agriculture, as well as the associated spatiotemporal attributes. Here we proposed to develop analytical framework to leverage the SRA and estimate relative antimicrobial resistant gene abundances across animal agriculture on a global scale from publicly available metagenomi c sequence information. The developed analytical framework was then employed to evaluate metagenomic samples from cattle and swine housed in the SRA. Estimated abundances are utilized as a proof of concept for evaluating AMR characteristics on a global scale using publicly available, highly heterogenous data. The resulting abundance estimation will offer insights into AMR emergence and dynamics as well as inform further development of mitigation strategies. This thesis is dedicated to family, the most important thing in this world. iii ACKNOWLEDGEMENTS I would like to express my deepest gratitude and appreciation to the individuals who have played significant roles in the completion of this thesis and my development as a scientist. Their unwavering support, guidance, and encouragement have been invaluable throughout this challenging journey. First and foremost, I would like to thank Dr. Wen Huang for going above and beyond both my expectations and his responsibilities as a mentor. Dr. Huang is a remarkable scientist and researcher who genuinely cares for each and every individual in his lab. From day one, I have appreciated Dr. Huang’s mentoring philosophy which is embodied by the ancient quote “Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime. ”. As a young researcher, I feel that learning how to think critically and developing the skills to independently find answers to difficult questions is vital to my success, both of which are encouraged by the aforementioned mentoring philosophy. I have received, and retained, a myriad of invaluable advice and guidance from Dr. Huang in the past couple years; all of which had my best interest in mind. I am incredibly fortunate to have Dr. Huang as my mentor, and I am ecstatic to return to his lab in pursuit of my PhD following the conferral of my MS. I would also like to thank the rest of my guidance committee members, Dr. Cedric Gondro, Dr. Pamela Ruegg, and Dr. Paul Coussens for their advice and support throughout the progression of the project. Each of them served an integral and unique role on my well-rounded guidance committee which made work efficient and enjoyable. I was fortunate enough to take multiple classes with Dr. Gondro, and we could often be found catching up and discussing new ideas in his office. This greatly influenced my positive attitude towards research and education despite any discouraging hardships encountered along the way. Dr. Ruegg and Dr. Coussens were both always more than willing to offer a helping hand, and with their expertise in their respective fields acting as guidance, iv my research and aspirations seemed drastically less daunting. Thank you all for your dedication to my research and as mentors on my guidance committee. Additionally, I am very appreciative of Samuel Snowden – his contributions regarding sample annotation were instrumental in the timely completion of this project. Moreover, many thanks to my close friends and fellow lab members for their consistent moral support throughout my academic endeavors. I would also like to thank my wonderful girlfriend, Susan Hoffman, for being by my side every step of the way. Thank you for being my rock and for always pushing me to reach new heights. I am extremely grateful to have you in my life. Finally, I would like to thank my mother and father, Angela Ackerson, and Leland Ackerson III. I would not be where I am today without each of you and your unwavering support. Your words of encouragement, reassurance, and belief in my potential have been instrumental in my academic success and development as an individual. These kind words fail to do my appreciation for you both justice, but from the bottom of my heart – thank you! v TABLE OF CONTENTS Chapter 1. Background and significance ...................................................................................... 1 1.1 Antimicrobial resistance is a global public health threat ..................................................... 1 1.2 Risks and impact of antimicrobial resistance ...................................................................... 2 1.3 Emergence of AMR .......................................................................................................... 3 1.4 Classification and mechanisms of antimicrobials ............................................................... 6 1.5 Mechanisms of antimicrobial resistance ........................................................................... 10 1.6 Monitoring and surveillance of AMR............................................................................... 12 1.7 One Health...................................................................................................................... 15 1.8 Metagenomic data in public databases ............................................................................. 16 Chapter 2. Development of an ARG estimation pipeline from metagenomic data ............. 17 2.1 Introduction .................................................................................................................... 17 2.2 Materials and Methods .................................................................................................... 19 2.2.1 Metagenomic data ........................................................................................................ 19 2.2.2 Reference databases ..................................................................................................... 19 2.2.3 Sequence processing .................................................................................................... 22 2.2.4 Gene quantification...................................................................................................... 24 2.3 Results............................................................................................................................. 28 2.3.1 Correlation between alignment and assembly based methods....................................... 28 2.3.2 Computational processing time .................................................................................... 29 2.4 Conclusion ...................................................................................................................... 30 Chapter 3. Application of developed framework on metagenomic samples ........................ 31 3.1 Introduction ....................................................................................................................... 31 3.2 Materials and methods..................................................................................................... 33 3.2.1 Experimental design..................................................................................................... 33 3.2.2 Sample curation ........................................................................................................... 33 3.2.3 Sample annotation ....................................................................................................... 35 3.2.4 Sequence data download and processing ...................................................................... 38 3.3 Results and discussion ..................................................................................................... 38 3.3.1 Analysis of ARGs in metagenomic samples of cattle origin .......................................... 39 3.3.2 Analysis of ARGs in metagenomic samples of swine origin ......................................... 40 3.3.3 Limitations of the developed tool ................................................................................ 41 3.4 Conclusion ...................................................................................................................... 42 BIBLIOGRAPHY ....................................................................................................................... 43 APPENDIX ................................................................................................................................. 49 vi Chapter 1. Background and significance 1.1 Antimicrobial resistance is a global public health threat Antimicrobial resistance (AMR) has become a serious global public health threat. AMR can be partitioned into two types of resistance: intrinsic and acquired. Intrinsic AMR can be defined as a universally shared resistance within a microbial species that is independent of previous antimicrobial exposure. Alternatively, acquired AMR is defined as the ability of a disease-causing microbe to survive exposure to an antimicrobial agent that was previously an effective treatment. As a result, treating infections caused by microorganisms such as bacteria, viruses, fungi, and parasites gradually becomes increasingly impracticable. The Center for Disease Control (CDC) and the World Health Organization (WHO) have both determined that drug resistance to potentially disease-causing pathogens is amongst the top global health security risks. The WHO in particular has declared that AMR exists within the top 10 global public health threats facing humanity. Predictive statistical models have recently estimated that in the year 2019 alone, 4.95 million deaths globally were associated with bacterial AMR; of which, 1.27 million deaths could be directly attributed to bacterial AMR (Antimicrobial Resistance Collaborators, 2022). Alarmingly, in 2016 a government funded review on the risks associated with drug-resistant infections estimated that annual global deaths associated with AMR could spike to 10 million by the year 2050 (O’Neill, 2016; Antimicrobial Resistance Collaborators, 2022). While the global burden is shared internationally, there are regional disparities. In 2019, estimated deaths attributable to bacterial AMR was found to be highest in western Sub-Saharan Africa and lowest in Australasia with rates of 27.3 and 6.5 per 100,000 deaths, respectively (Antimicrobial Resistance Collaborators, 2022), reinforcing consistent findings within the literature that the burden of AMR is more prevalent in low-income and low-resource geographical locations. 1 Nonetheless, the impact of this health concern is far from obsolete in high-income areas. In the United States of America, 2.8 million antimicrobial resistant infections occur each year, resulting in more than 35,000 deaths (CDC, 2019). It is abundantly clear that without the rapid implementation of comprehensive steps towards mitigation, the risks associated with AMR will continue to rise. 1.2 Risks and impact of antimicrobial resistance The risks and impact of widespread antimicrobial resistance cannot be overstated. The list of directly and indirectly associated risks is extensive and in part includes a) increased mortality, b) economic burdens, c) reduced effectiveness of medical procedures, and d) diminished food security. Increased mortality can result from multiple factors in and of itself, but the primary contributor is that of the loss of effective treatments. Regarding economic burdens there are a multitude of contributors. Individuals with AMR infections may require more time intensive and specialized treatment regimen thus inducing extended hospital stays and consequently higher healthcare costs. Additionally, there are increased costs associated with the development of new antimicrobial agents which subsequently can lead to higher drug and healthcare costs. According to a report by the CDC in 2019, it is estimated that AMR costs the United States $55 billion annually; this can be partitioned into costs associated with healthcare and loss of productivity, surmounting to $20 billion and $35 billion, respectively (Dadgostar, 2019). The decrease in efficacy of antimicrobials and or limitations imposed on their use due to prevalent AMR in animal agriculture would greatly contribute to economic burden as a result of a variety factors such as reduced production, increased livestock mortality and or cull rates, and treatment costs, among others. Furthermore, the use of antimicrobials in animal agriculture can lead to the proliferation of resistant bacteria with potential to enter the food chain, thus subjecting human individuals to additional sources of AMR transmission and compromising food safety (Davies and Wales, 2019). Diminished food security in regard to food availability is an additional concern. The global population is continuously expanding, 2 demanding increased food production to mitigate malnutrition, and rampant AMR development in animal agriculture could reduce production leading to food shortages. One indirect risk of AMR development is the potential reduced effectiveness of medical procedures that are highly dependent on the use of antimicrobials. Surgeries such as organ transplantations, joint replacement, and caesarean sections, as well as treatments such as chemotherapy could experience lower success and survivability rates or even become too risky to perform solely as a result of the inefficacy of the antimicrobials for which they are reliant upon (O’Neill, 2016). Therefore, AMR can essentially undermine recent developments in contemporary medicine and handicap medical professionals when developing courses of action for patients experiencing various health related issues. It is abundantly clear that there are a plethora of risks and negative impacts of a large magnitude associated with AMR development and emergence. Additionally, the known list of factors influenced by AMR is arguably incomplete as unknown factors and implications may come to fruition as the severity and comprehensive understanding of this global health concern develops further. 1.3 Emergence of AMR Although the development of AMR is incredibly intricate, it is widely recognized that the employment of antibiotics is one of the largest selective pressures of AMR. When used appropriately and precisely, antimicrobial utilization can effectively terminate or inhibit microorganisms lacking defense mechanisms; however, microbes containing the intrinsic or acquired mechanisms to resist the action of antimicrobials are able to persist and reproduce. Thus, microbes with the ability to resist antimicrobials have a higher relative fitness than their counterparts when exposed to an antimicrobial rich environment, and over time the resistance phenotype and underlying mutations will become more prevalent in the population. The functional mechanisms of antimicrobial resistance result from the expression of encoded genomic material referred to as antibiotic resistant 3 genes (ARGs). Antibiotic resistant genes are either produced by means of genetic mutations or acquired via gene transfer. The random and spontaneous occurrences of genomic mutations carry the potential to alter functionality of synthesized proteins resulting from antecedent gene expression (Schwarz et al., 2016). If said mutation is nonsynonymous, and therefore induces alterations in both the genetic and amino acid sequences of the encoded protein, subsequent protein functionality could be altered in terms of efficiency, dissolved completely (loss of function), or even redefined. In particular, the redefinition of protein function can result in the acquisition of antimicrobial resistant mechanisms by microbes. Mutations conferring subsequent antimicrobial resistance are often found in regions associated with the target sites of antimicrobial agents as this is a focal point for a variety of resistance mechanisms. In addition, mutations can further induce resistant phenotypes or lessened vulnerability to antimicrobials as a result of the enhanced expression for separate genes such as those pertaining to efflux. Alternatively, antibiotic resistant genes can be inherited by the actions of vertical gene transfer (VGT) and or horizontal gene transfer (HGT). Primarily, genetic material is passed down from parent to offspring in a vertical process. Resulting in the transmission of various genes and developed mutations to the inheriting organisms. However, genomic material can also be acquired by means of horizontal gene transfer, a process by which genetic material is transferred between distinct evolutionary lineages (Thomas, 2005; Stokes, 2011; Dunning Hotopp, 2011; Soucy, 2015). HGT is relatively common amongst microbes, and typically manifests between organisms of the same domain (archaea, bacteria, eukarya). HGT is most frequently documented within the bacteria domain. Nonetheless, interdomain HGT such as acquisition by eukaryotes from bacteria also occurs (Dunning Hotopp, 2011). This is exemplified by organelles such as mitochondria and chloroplast transmitting DNA to the nuclear genome of eukaryotes. Mitochondrial transmission in particular is well documented in Arabidopsis thaliana (Lin et al., 1999). Regardless of the organisms’ residential 4 domains, each case of HGT can result in the acquisition of novel genes and traits independent of those received via VGT inheritance. The process of horizontal gene transfer can be split into two independent processes: the transfer of the DNA, and incorporation of the DNA into the recipient organisms’ genome (Stokes, 2011). There are several mechanisms that can facilitate the horizontal transfer of DNA and mobile elements, but the three most common and primary mechanisms are conjugation, transformation, and transduction. Mobile elements pertain mostly to plasmids and transposons, plasmids are small circular DNA strands and transposons are genetic elements that consist of repeat sequences and a transposase protein (Thomas, 2005). Mobile genetic elements (MGEs) are capable of carrying and disseminating ARGs. Conjugation entails transmission via physical contact of the donor and recipient bacterium. A pilus is formed to connect the cells, and then utilized as a medium for the exchange of genetic material. The uptake of environmental DNA is classified as transformation and is most common in bacteria and archaea. Transduction is the transfer of genetic material by means of phages and can be both general and specialized. General transduction is classified as the intake of a random segment of the host DNA involving a lysed bacterial cell, whereas specialized transduction is performed by temperate bacteriophages without the utilization of lysis. In specialized transduction, viral DNA is integrated with the host DNA and exists in a prophage stage (Soucy, 2015). There are alternative mechanisms of horizontal gene transfer such as cell fusion, gene transfer agents, and intercellular transfer, but these are uncommon compared to the aforementioned mechanisms. After the genetic material is transferred, it must be stably incorporated into the recipient genome. The mechanisms for incorporation are primarily autonomous replication, transposition, homologous recombination, and site-specific recombination (Stokes, 2011). Autonomous replication pertains primarily to plasmids, whereas transposition pertain mostly to transposons, and site-specific recombination involves integrons. Integrons are assembly platforms 5 used to produce functional genes and are considered mobile elements that can be associated with transposons and plasmids. Each of the discussed mobile elements can be utilized in parallel with homologous recombination to integrate the new genetic information into the host (Stokes, 2011 ; van Hoek et al., 2011). Summarized and simply stated, both VGT and HGT present means by which antimicrobial resistant gene uptake is facilitated. 1.4 Classification and mechanisms of antimicrobials Antimicrobial agents can be classified in a variety of ways. At the highest tier, antimicrobials can be partitioned into groups based on the microorganism they impact such as bacteria (antibiotics), fungi (antifungal), viruses (antiviral), and parasites (antiparasitic). We focus in this review on antibiotics as they are utilized more frequently than the others in human medicine and animal agriculture. It should be noted that the term antibiotic is commonly used to refer to both antibacterial and antifungal drugs, as well as used interchangeably with the term antimicrobial despite not conforming to the exact definitions. Within the faction of antibiotics, agents can be further classified by their type of action: bactericidal and bacteriostatic. Bactericidal refers to antibiotics that kill bacteria, whereas bacteriostatic encompasses antibiotics that inhibit bacterial growth and stall cellular activity without directly killing the bacteria. Alternatively, spectrum-based classification can cluster antibiotics into groups of narrow, broad, or extended spectrum. Spectrum is defined as the specificity and range of microorganism that an antibiotic can negatively impact. Narrow spectrum antibiotics only target a limited number of bacterial species. Extended spectrum refers to antibiotics that have been chemically modified to affect a broader range of bacteria. Broad spectrum antibiotics have an even larger scope and can therefore affect a variety of species and types of bacteria such as both Gram positive and Gram negative. In a clinical setting when treating infections, typically broad- spectrum antibiotics are used when the origin of an infection is unknown and narrow spectrum is utilized when the origin has been isolated or refined to a small group of possibilities with high 6 confidence. Of note, the employment of broad-spectrum antibiotics is more likely to promote antimicrobial resistance and multidrug resistance in bacterial populations as well as disrupt the microbiome by altering the intestinal flora (Gerber et al., 2018; Grada & Bunick, 2021). Multidrug resistance (MDR) characterizes microbials that are resistant to multiple antimicrobial agents, this occurrence makes clinical treatment or corresponding infections more difficult (Alekshun & Levy, 2007). Nonetheless, most commonly antibiotics are classified in terms of their mechanism of action or their chemical structure. Classification in terms of chemical structure produces categories that antimicrobials are typically referred to by such as beta-lactams, aminoglycosides, tetracyclines, macrolides and more. For example, beta-lactams are characterized by the presence of a beta-lactam ring in their chemical structure and macrolides are all comprised of a 14, 15, or 16-membered ring. Tetracyclines, the most heavily utilized antimicrobial for animal agriculture domestically in the United States which accounts for 65% of sales, are identifiable by four adjacent cyclic hydrocarbon rings ( FDA, 2019). Biologically speaking, the identity of molecular structures directly equates to the function of the molecule. Thus, antimicrobials classified into the same groups by chemical structure primarily exhibit similar function, mechanism of action, and or molecular mechanisms. Antibiotic mechanisms of action, as shown in Figure 1, primarily fall within three factions: cell wall synthesis, protein synthesis, and nucleic acid synthesis (Kapoor et al., 2017). The cell wall synthesis category pertains to antibiotics that target the cell wall and inhibit biosynthesis. Beta - lactams and glycopeptides are two antibiotic classes that fall within the confines of this mechanism of action. Beta-lactams, and more specifically penicillin, target penicillin-binding proteins and block the production of peptidoglycans which are a significant component of the cell wall. The inhibition of this production results in the weakening of the cell wall and leads to subsequent lysis of the bacteria. Glycopeptides, vancomycin in particular, also inhibits cell wall synthesis associated with 7 penicillin-binding proteins. However, vancomycin binds to the side chain of peptides that normally bind to penicillin-binding proteins as opposed to the binding proteins themselves, subsequently blocking the binding of the peptide to the binding proteins and peptidoglycan production, achieving cell wall synthesis inhibition. The second mechanism of action, pertaining to inhibition of protein synthesis, is carried out by targeting ribosomal subunits necessary for the translation of RNA into proteins. Antimicrobial classes associated with this mechanism of action include but are not limited to aminoglycosides, tetracyclines, and macrolides which inhibit the 30S, 30S, and 50S subunits of ribosomes in bacteria, respectively. As this mechanism occurs within the cell, antimicrobials residing in this classification are aided by the aforementioned antimicrobials that target the cell wall since they allow easier access into the cell. Nonetheless, the presence of cell wall synthesis inhibitors is not required for the effectiveness of protein synthesis inhibitors. Aminoglycosides are positively charged molecules capable of forming pores in the cellular membrane to allow entrance of the antibiotic. Aminoglycosides inhibit protein synthesis by interacting with the 16S rRNA 30S subunit to induce errors in mRNA translation. 16S rRNA comprises the 30S ribosomal subunit and is required for protein synthesis and the correct codon-anticodon pairing during mRNA translation. Notably, the genes that encode 16S rRNA exist in all bacteria and are both highly conserved and species specific. As a result of these characteristics, 16S rRNA genes are commonly used in phylogenetic studies and for microbial identification, classification, and quantification. Nonetheless, the aminoglycoside induced errors are primarily associated with misreading the mRNA and the untimely termination of the process of translation subsequently resulting in cell death. Tetracyclines accomplish protein synthesis inhibition by directly binding with the 16S rRNA 30S subunit, therefore blocking association with tRNA and preventing a key step in producing the translated amino acid sequences. Migrating from 30S subunit inhibitors to a 50S subunit inhibitor, macrolides target conserved sequences in 23S rRNA resulting in the production of truncated peptide chains. Finally, 8 antimicrobial agent mechanisms of action can be characterized by inhibitors of nucleic acid synthesis. Representatives of this faction are quinolones and fluoroquinolones, which inhibit DNA gyrase and topoisomerase IV thereby negatively influencing DNA replication. DNA gyrase is an enzyme association with the relaxation of positive supercoils in DNA to allow for the progression of DNA replication by DNA polymerase and topoisomerase IV is primarily involved with separating daughter DNA strands immediately following replication. Inhibition of nucleic acid synthesis regarding either DNA replication or transcription is detrimental to microbial health and is the reason that quinolones are commonly bactericidal. The final criterion for antimicrobial classification is in terms of medical importance (Gelalcha & Kerro Dego, 2022; FDA, 2023). There are two subclasses associated with medical importance, antimicrobials deemed medically important, and those that are not medically important. Within the medically important antimicrobials (MIAs) faction there are three additional subclasses, critically important antimicrobials (CIAs), highly important antimicrobials (HIAs), and important antimicrobials (IAs). The more important the antimicrobial, the larger the human health risk associated with potential resistance development (FDA, 2023). The ranking criteria used to classify antimicrobials into one of the three classes of importance are as follows. Antimicrobial drugs that are unique or exist within a limited group of drugs with the ability to Figure 1 | The three primary factions of antimicrobial treat serious human infections are mechanisms of action. A large subset of antimicrobials can be partitioned into three main classes for mechanism considered CIAs. HIAs encompasses of action: cell wall, nucleic acid, and protein synthesis inhibitors. Adopted from Kapoor et al., 2017. two characteristics, these drugs are either 9 those that can treat serious human infections but with multiple drug classes available for therapeutic use, or they are one of the limited therapies that can be utilized to treat non-serious infections in humans. IAs are used to treat non-serious human infections when multiple antimicrobial classes of therapeutic drugs are effective and available. Antimicrobials deemed not medically important failed to meet any of the abovementioned ranking criteria and are considered to have less potential impact on human health. Both medically and not medically important antimicrobials are heavily utilized in animal agriculture, which can increase the risk of direct and indirect negative impacts on human health as a result of AMR selection and emergence, reinforcing the notion that AMR is a One Health concern that requires full scale collaboration to aid mitigation. 1.5 Mechanisms of antimicrobial resistance Effective mitigation of AMR is heavily reliant on the large-scale comprehension of the issue and its means of operation. The molecular mechanisms associated with antimicrobial resistance (Figure 2) are broad, and still require additional research to develop a complete understanding. Discussed here is a limited variety of the understood and common AMR mechanisms including those that a) modify the antibiotic target, b) modify or degrade the antibiotic itself, and c) reduce the intracellular accumulation and or concentration of the antibiotic. Modification of an antimicrobials target can reduce binding affinity for the antimicrobial, thus rendering the drug ineffective and conferring resistance. Alteration of the target protein is typically mediated though enzy matic function and or genomic mutations. Beta lactam and macrolide resistance are commonly developed in this manner through mutations of genes associated with penicillin-binding proteins, and enzymatic methylation of 16S rRNA by methyltransferases, respectively. Inactivation of the antimicrobials themselves can be accomplished though enzymatic modification and degradation of its molecular structure. Degradation of the antimicrobials is carried out by hydrolysis of the molecules functional group; modification typically involves the enzymatic transfer and addition of 10 chemical groups thus restricting antimicrobial binding to its target. This mechanism of resistance is also common in beta lactams, as beta-lactamases are capable of hydrolyzing the amide bond in the beta lactam ring which results in antimicrobial degradation (Darby et al., 2022; Jeong et al., 2010; Tooke et al., 2019). Modification by the transfer of a chemical group can be exemplified by aminoglycosides, for which a multitude of different enzymes can alter the hydroxyl or amino groups subsequently reducing binding affinity and decreasing efficacy. Lastly, mechanisms pertaining to the reduction of intracellular accumulation and or concentration of antimicrobials has proven an effective method of AMR. These mechanisms are primarily associated with membrane permeability and efflux systems. As previously discussed, many antimicrobials act within the confines of the cell such as nucleic acid and protein synthesis inhibitors, consequently, their effectiveness is reliant upon their ability to enter and persist in the cell. Thus, resistance mechanisms that downregulate transmembrane proteins such as porins or alter the composition of the cellular envelope can directly reduce cell permeability for antibiotics. Additionally, if antibiotics do enter the cell, they can be expelled though efflux activity. Efflux pumps are transmembrane proteins capable of diminishing the concentration of toxic materials such as antibiotics by directly exporting the compounds out of the cell. Figure 2 | Primary mechanisms of antimicrobial resistance. In Efflux is a major contributing order to combat the mechanisms of action associated with antimicrobials, microbes commonly employ these mechanisms of mechanism to AMR and resistance. Mechanisms illustrated here include decreased influx, increased efflux, target protection, bypass, and site modification, as multidrug resistance and has well as antimicrobial inactivation. Adopted from Darby et al., 2022. 11 been shown to work cooperatively with many of the other resistance mechanisms already discussed (Darby et al., 2022). In depth understanding of mechanisms such as these is crucial to mitigating AMR, and the dynamics of its emergence moving forward. 1.6 Monitoring and surveillance of AMR The development of global surveillance systems to monitor AMR dynamics in humans, animals, and the environment is critical to fully understand the interface between the three and inform future management strategies. Accomplishing this feat is reliant upon the accurate quantification of antimicrobial resistance in samples. Traditionally, the benchmark for determining antimicrobial resistance has been to selectively culture microbes and measure the exhibited phenotypes when subjected to the presence of various antimicrobials. This methodology is employed globally for multiple surveillance systems; however, it is based on a limited number of culturable pathogens and typically only evaluates antimicrobials relevant to human health. As a result, a plethora of unculturable pathogens and many antimicrobials utilized in alternative sectors such as animal agriculture are not evaluated. In many countries, antimicrobial consumption in animal agriculture surpasses that of human usage, and it is estimated that nearly 73% of global antibiotics can be contributed to livestock (Van Boeckel et al., 2017). It should be noted that is it difficult to draw comparisons between antimicrobial use in humans and animals due to the large disparity in human and animal numbers as well as the dose and duration of antimicrobial administration between species (FDA, 2019). Nonetheless, animal agriculture contributes immensely to total global antimicrobial usage and moreover the selective pressure on AMR development. Unfortunately, surveillance for AMR in farm animals typically relies on passive reporting to gather representa tive data (Woolhouse et al., 2015). Active surveillance on the other hand provides more accurate and up to date information. Regarding alternative methodologies for evaluating AMR presence, recent advancements in 12 high throughput DNA sequencing technologies have made way for cost effective shotgun metagenomic sequencing, allowing for the interrogation of an entire collection of DNA present in a complex system such as animal and environmental samples, including novel and unculturable microorganisms (Quince et al., 2017). The process of shotgun metagenomics from start to finish entails 1) sample collection, 2) DNA extraction, 3) random fragmentation of the genomic material, 4) high throughput sequencing, and 5) subsequent bioinformatic analysis (Quince et al., 2017). This design allows for subsequent data analyses to mine critical information from vast amounts of data including the identification and estimation of antibiotic resistance genes (ARGs). Sequencing is typically performed on the Illumina platform, but advancements in long read sequencing technology from Oxford Nanopore and PacBio has made their utilization viable in the field (Quince et al., 2017). Shotgun metagenomics is commonly used to identify taxonomic composition and function of microbial communities, however, its prevalence in AMR research has grown in recent years. Contemporarily, this methodology has been employed to evaluate ARG presence in samples of human fecal matter origin (urban sewage and toilet waste from airplanes), and animal agricul ture (feces from swine, milk filters from dairy farms, and feces from broiler farms). In 2019, a study by Hendriksen et al. sampled untreated urban sewage from 79 cities in 60 countries and subjected them to shotgun metagenomic sequencing to interrogate regional variation of ARG abundance and dissect factors having potential influence. Their findings concluded that there are regional disparities in ARG abundance which are strongly correlated with socioeconomic, health, and environmental factors. Regional differences could be conglomerated into two working groups of similarity: Europe, North America, and Oceania and Africa, Asia, and South America. Of note, the most highly abundant ARGs were associated with macrolides, tetracyclines, aminoglycosides, beta -lactams, and sulfonamides (Hendriksen et al., 2019). This study highlights the ability of metagenomics to characterize AMR development in healthy populations, and more importantly, that there is global 13 disparity and nonuniform dynamics of AMR. Transitioning to animal agriculture associated studies, researchers in Denmark shotgun metagenomically sequenced fecal samples from 181 swine and 178 Figure 3 | The routes of transmission of AMR between livestock, humans, and the environment. Antimicrobial poultry farms from nine resistance can be acquired through multiple direct and indirect pathways for each sector. Adopted from Woolhouse and Ward, European countries (Munk 2013. et al. 2018). Interestingly, upon quantification of the resistome for each species there were significant differences in abundance and composition. Additionally, the study concluded that AMR prevalence was associated with the degree of antibiotic utilization in livestock. Moreover, a country effect on resistome composition was determined significant with variable degrees between species. Showcasing the ability to differentiate AMR profiles based on associated factors in animal agriculture such as species, antibiotic usage, and geographical location. An additional study pertaining to dairy cattle in animal agriculture sought out to evaluate the distribution of ARGs in complex samples such as bulk tank milk filters (Rubiola et al., 2022). The purpose of utilizing an uncommon sampling source such as milk filters is to a) investigate the ability to determine resistomes with raw milk and b) interrogate food security associated risks relevant to AMR. The study concluded that the shotgun metagenomic sequencing of raw milk found in bulk tank filters can successfully determine the associated resistome, and that raw milk can be considered a source of AMR bacteria and genes. Outlining the importance of proper hygiene on dairies, and food safety risks associated with the consumption of raw milk. Once again emphasizing the importance of the 14 human, livestock, and environment interface. The routes of AMR transmission in this interface are illustrated at a fundamental level in Figure 3 (Woolhouse et al., 2015). Antimicrobial usage in humans and livestock encourages the development of resistant microbials, which can then follow various routes of transmission. Transmission can occur between different species within livestock and humans via close proximity or the consumption of food products from animal agriculture. Additionally, excrement from the living organisms can transmit AMR microbes and genomic material into the environment. Environmental AMR can also be routed back to humans, livestock, and food production, where in the case of the living organisms it can be recycled and spread within the species. It is clear that animal agriculture plays an integral role in AMR development and dissemination, and thus it is vital that rapid and quality livestock surveillance systems be established to aid in the mitigation of this global One Health threat. 1.7 One Health Despite the encouragement of amicrobial resistance due to selective pressures of antimicrobial use, it should be noted that antimicrobial resistance is not anthropogenic and existed prior to the discovery, development, and utilization of contemporary antimicrobial agents (Zhang et al., 2022). In fact, a recent study in reconstructed ancient microbial genomes from the human gut utilizing over a thousand-year-old preserved palaeofaeces samples identified the presence of antimicrobial resistance genes (D’Costa et al., Figure 4 | The relationship between 2011). Additionally, ARGs exist in the environment and Animal, Human, and Environmental Health. At the cross section of these have been discovered in permafrost cores and pristine soil three sectors resides the concept of One Health, illustrating the importance for samples from the Yukon territory of Canada and the optimization of each sectors health both individually and as a conglomerate. Antarctica, respectively (Zhang et al., 2022; Van Goethem Adopted from the International Livestock Research Institute. et al., 2018; Wibowo et al., 2021). Each of these studies 15 reinforces that antimicrobial resistance is a natural occurring phenomenon that was not established in response to antimicrobial use, but rather amplified though selective pressures and microbial proliferation. The existence of antibiotic resistant genes in the environment poses a threat to human health as the genes can be acquired by bacterial hosts in humans as well as potential pathogens. The development and emergence of AMR is multifaceted, and the large breadth of influence associated with this global health threat classifies it as a One Health issue. One Health is the idea that human, animal, and environmental health are all intertwined (Figure 4), and the optimization of this kingpin requires a collaborative, multisectoral, and transdisciplinary approach. Regarding AMR, practices employed within each sector heavily influence the health of all three. 1.8 Metagenomic data in public databases Vast amounts of data produced from published and unpublished studies utilizing shotgun metagenomic sequencing have accumulated in public repositories, most notably in the Sequence Read Archive (SRA) database hosted by the National Center for Biotechnology Information (NCBI). The tens of thousands of metagenomic samples of animal agriculture origin residing in the SRA are incredibly diverse in terms of species, sampling source (fecal, intestinal, tissue), time, and location. Additionally, the metagenomic data is rather unbiased as a result of unintended sampling since the majority of studies were not primarily investigating AMR. That is to say, most studies were not actively pursuing samples where a certain degree of AMR prevalence would be likely. As a resul t of the aforementioned characteristics associated with these metagenomic samples in the SRA, there exists the ability to capture large scale antimicrobial resistance information in global animal agriculture though the quantification of ARGs. The estimates of antimicrobial resistance gene abundance, and their correlation with a multitude of factors such as national economic indices and antimicrobial usage can then be investigated to offer insights into the dynamics and emergence of AMR as well as mitigation strategies. 16 Chapter 2. Development of an ARG estimation pipeline from metagenomic data 2.1 Introduction The development and optimization of a bioinformatic pipeline capable of processing data from raw metagenomic sequences to gene abundance is instrumental in the accurate and efficient quantification of ARGs. Primarily, bioinformatic pipelines such as these employ one of two methodologies: assembly based, or alignment (mapping) based. The process for each method starts identically, microbial DNA is randomly fragmented into many smaller pieces of genomic material and then massively parallel sequenced i.e., shotgun metagenomic sequencing. This results in the production of fastq files which contain a large quantity of reads (nucleic acid sequences). As illustrated in Figure 5, it is here where the divergence of assembly-based and Figure 5 | Shotgun metagenomic sequencing and ARG alignment-based methods occur. abundance estimation. The procedure associated with shotgun metagenomic sequencing and the estimation of absolute ARG The alignment-based method abundance are illustrated. Three distinct hypothetical bacteria are sampled and subjected to shotgun metagenomic sequencing. entails mapping the prior The produced sequence reads are subsequently utilized to estimate ARG abundance by either a two-step assembly based produced sequence reads to a (left) or mapping based (right) approach. 17 reference gene catalog, in our case one comprised of existing ARGs. Abundance estimation is subsequently quantified by the number of times each gene in the reference catalog was mapped to from the input query read sequences. Alternatively, the assembly-based methods require a two-step approach. Reads are first computationally assembled into complete genomes via overlaps of similarity between reads. This is made possible due to the high throughput and coverage produced from shotgun metagenomics sequencing. The resulting assemblies can then be functionally classified to produce a reference gene catalog. Finally, abundance can be estimated by mapping the shotgun reads to the assembled gene catalog. In layman’s terms, alignment-based methods entail mapping to a known reference gene catalog, whereas assembly-based methods produce their own reference gene catalog and entail mapping to the assembled gene catalog. The utilization of each method comes with its associated benefits and disadvantages when compared with the alternative. Compared to assembly-based methods, alignment-based methods are computationally efficient but may miss unannotated genes and normalization across samples is more challenging. On the other hand, the assembly-based method is more computationally expensive but is better at capturing the total microbiome and may discover novel genes undefined in reference databases. Each method has been shown to effectively capture ARG abundance, nonetheless, due to the sheer volume of data we wish to process, an alignment-based method was employed. The objective of this project was to establish alignment-based analytical framework for estimating ARG abundance in metagenomic samples, that is additionally capable of processing large amounts of bulk genomic sequence data in a feasible time frame. 18 2.2 Materials and Methods 2.2.1 Metagenomic data The development and optimization of the constructed bioinformatic pipeline utilized in house published data from the metagenomic sequencing of ileum, cecum, and colon samples (n=18) from 6 distinct pigs (Quan et al., 2020). The objective of the aforementioned study was to compare the metagenomes from different intestinal regions of pigs with contrasting feed efficiency, and an assembly-based bioinformatic approach was utilized for quantification of ARG abundance. The produced data from this study serves here as an evaluation dataset for the development and optimization of a new bioinformatic pipeline as well as a means for comparison between assembly and alignment based methodologies. 2.2.2 Reference databases Databases utilized for alignment include the Comprehensive Antibiotic Resistance Database (CARD) for antimicrobial resistant genes, and the GreenGenes database for 16 S rRNA genes which are later used for taxonomic purposes. The genes comprising each respective database were clustered based on sequence identity to remove any redundancy using CD-HIT. More precisely, CD- HIT-EST (W. Li & Godzik, 2006) was utilized with the following parameters: local sequence identity (as opposed to global), alignment coverage of 90%, clustering to the most similar representative (as opposed to the first cluster to meet clustering criteria), and either a 95% or 99% sequence identify threshold. The CD-HIT-EST algorithm works as follows: 1) input sequences are sorted in order of decreasing length, 2) the longest sorted sequence becomes the representative of the first cluster, 3) the following sequences of shorter length are compared to the representatives of the existing clusters. The sequence of interest will be partitioned into the first representative sequence for which it meets the clustering criteria defined above (sequence identity, alignment coverage, etc.). Therefore, 19 if a sequence of interest meets the clustering criteria for multiple clusters, it will be partitioned into the cluster with the longest representative sequence. If a sequence fails to meet a ll of the clustering criteria for all of the existing clusters, a new cluster is formed with the sequence of interest serving as the cluster representative. Once clustering was finished, the representative sequences were retained and used moving forward. Each database was clustered twice, once with 95% and once with 99% sequence identity, to inform optimization of the pipeline discussed later in this chapter. Upon clustering at 95% sequence identity, the CARD and GreenGenes databases were filtered from 4,6 05 genes to 1,323 genes, and 1,075,170 genes to 52,161 genes, respectively. At a sequence identity threshold of 99%, the CARD and GreenGenes databases were filtered from 4,605 genes to 1,986 genes, and 1,075,170 genes to 175,724 genes, respectively. Gene length distributions for the four individual nonredundant databases are illustrated in Figure 6. Each clustered database was then indexed using BWA-INDEX as required for subsequent mapping to the gene catalogs. This process is performed once and is not repeated for each sample processed by the constructed bioinformatic pipeline. All sequence processing and abundance estimation was performed on the Michigan State University (MSU) High Performance Computing Cluster (HPCC) hosted by the MSU Institute for Cyber-Enabled Research (ICER). 20 Figure 6 | Disparity in Gene Length Between Each Nonredundant Reference Database at Variable Sequence Identity Clustering Thresholds. Each Database (CARD and GreenGenes) were clustered at 95% and 99% sequence identity. The distribution of each nonredundant databases gene lengths, and the average gene length are illustrated. Of note, the 99% sequence identity clustering equated to a relatively smaller average gene length as due to the parameters, less genes met the clustering criteria for each large cluster representative sequence and instead formed their own cluster. Consequently, more clusters with a lower average gene length were produced in the 99% sequence identity reference databases compared to that of the 95% sequence identity reference databases. 21 2.2.3 Sequence processing 2.2.3.1Fastq quality control First and foremost, the raw sequence data housed in fastq files are subjected to quality control. This primarily pertains to adapter removal and quality trimming of the shotgun reads, which is performed using BBduk (Bushnell, 2014). A variety of alternative tools exist, such as Cutadapt and Trimmomatic, but BBduk has been shown to be extremely fast, scalable, and memory-efficient while still retaining relatively comparable accuracy to the competition (Guzman & D’Orso, 2017). Additionally, BBduk handles both single-end and paired-end reads very smoothly. Regarding parameterization, a quality threshold of 20 on the Phred scale (99% base calling accuracy) and a minimum length of 50 bases was employed. Filtering of reads based on length is performed after adapter removal and quality trimming. Quality control metrics such as these are a balance between sensitivity and specificity. If the parameters are too relaxed, then filtering is considered sensitive as the retention rate of truly positives is higher. Alternatively, if the parameters are rather strict (specific) then a higher proportion of the true negatives are accurately discarded. When selecting quality control thresholds, it is desirable to optimize a middle ground between the two ends of the spectrum so that reads with high confidence of accuracy off the sequencer are retained and potentially accurate data with lesser confidence aren’t massively removed. A quality trimming threshold of 20 resides in the middle of this spectrum, and a minimum length of 50 bases errs on the side of specificity; both of which generally agree with parameters commonly used in the literature. While not included presently, future development of this quality control step can allow for the filtering of host reads. This serves two primary purposes, to validate the sample species origin and remove contaminants in the data. Validation of the quality control step is performed using FastQC on both the raw sequence data and the retained clean data. FastQC performs various basic quality control (QC) metrics 22 including sequence quality, read length distribution, duplication levels, GC content, and adapter content. By subjecting both the raw and clean sequence data to FastQC, any QC metrics flagged for concern in the raw data can be cross-checked with the associated clean data QC metrics to validate that the issues have been resolved. The utilization of this technique allows for the wrangling of additional erroneous data that escaped quality control filtering prior to further downstream analysis. 2.2.3.2Alignment A two-pronged alignment is then performed to query the clean sequence data against the nonredundant 16S rRNA and ARG gene sets. The BWA-MEM algorithm was employed to map the input reads against the references (Figure 7), despite the availability of its contemporary successor BWA-MEM2. BWA-MEM2 produces an identical alignment to that of BWA-MEM but is significantly faster (up to 3.1x). However, the cost of this rapid alignment is skyrocketed memory costs, and due to the high throughput nature of shotgun metagenomic sequencing the utilization of BWA-MEM2 became infeasible with the available resources. The Burrows-Wheeler Alignment (BWA) (Li & Durbin, 2009) is a read alignment tool contingent upon the Burrows–Wheeler Transform (BWT) (Burrows and Wheeler, 1994). In the case of BWA-MEM, alignments are seeded with maximal exact matches (MEMs) and then seeds are extended with the affine-gap Smith- Waterman algorithm (SW) (H. Li, 2013). The algorithm handles sequencing errors very well and works with a wide range of sequence lengths. The execution of this algorithm given reference and query sequences will result in the production of SAM (Sequence Alignment Map) files containing the query Figure 7 | Overview of the ARG abundance estimation pipeline. sequences and the 23 reference sequence they aligned to (if applicable). Upon the completion of BWA-MEM alignment with default parameters, each output SAM file was converted to BAM (binary) format with SAMtools software to conserve storage space on the MSU HPCC. The outputs of the alignment step are two BAM files for each input clean fastq file, individually containing mapping information for the input reads to either the 16S rRNA or ARG references. These files contain both mapped and unmapped reads from the clean data. 2.2.4 Gene quantification 2.2.4.1Absolute abundance estimation In order to estimate the absolute abundances for the ARG and 16S rRNA genes of interest, the unmapped reads from the previously produced alignment files must be filtered out and discarded. This is accomplished based on each reads associated MAPQ scores produced by BWA- MEM during alignment. MAPQ scores for BWA-MEM range from zero to sixty [0,60] on the Phred scale, a score of zero means that the read could not meet the mapping criteria for any of the genes in the reference catalog (Figure 8). MAPQ scores from one to sixty [1,60] express a level of confidence in the mapping of an input read to a particular gene in the catalog and is directly related to the number of genes for which a read meets the required mapping criteria. A MAPQ score of 60 equates to extreme confidence by the alignment algorithm that the read maps to a single particular gene in the reference. As the number of genes a read can map to increases, the MAPQ score decreases as the aligner is less confident in which gene is the accurate mapping for the read. Any reads that have a MAPQ score of zero are therefore unmapped to any genes in the reference catalog and are discarded from further quantification steps in the analysis. At this point, quantification is temporarily partitioned into two different bioinformatic processes for single and paired end reads, respectively. 24 Single end analysis is A C relatively straight forward. The number of mappings for each read in the alignment file are summed, and a B D three-column table is produced with columns 1, 2, and 3, equating to read name, mapped gene name, and number Figure 8 | MAPQ Score Distribution Produced by Alignment to Each of mappings for the Reference. The distribution of MAPQ scores associated with the alignment of one individual sample to the 99% sequence identity clustered read in question, 16S rRNA and ARG gene catalogs is illustrated. The sample utilized (SRS6485642), was randomly selected from the aforementioned pig respectively. To combat intestinal feed efficiency study. A and B both represent 16S rRNA sequence alignment MAPQ scores, however, B illustrates solely MAPQ multimapping for genes scores ranging from [1,60]. Similarly, C and D both represent ARG sequence alignment MAPQ scores, with D illustrating solely MAPQ in the reference catalog, scores ranging from [1,60]. A and C serve to highlight the prevalence of unmapped reads (MAPQ=0), as the distribution is so disproportionate the value in column 3 is that none of the alternative MAPQ scores [1,60] can be depicted on the same scale. then inverted to weight each mapping (i.e., 3 gene mappings by one read is inverted to 1/3 of a count for each gene). This ensures that multiple mappings for a single read does not equate to multiple gene counts, as this would overestimate abundance. Paired end analysis requires additional considerations since the two paired end reads may not necessarily confer identical gene alignments. Consequently, each paired end alignment file must be spliced into two alignment files containing the left end and right end reads, respectively. Each of 25 these files are then used to produce a table comprising of three columns where columns 1, 2, and 3, represent read name, mapped gene name, and MAPQ score, respectively. Subsequently, the resulting tables are merged into a five-column table for which columns 1, 2, 3, 4, and 5, equate to read name, left end gene mapping name, right end gene mapping name, left end gene mapping MAPQ score, and right end gene mapping MAPQ score, respectively. Here, one of the paired end reads (left end or right end) are selected as the representative for which gene the read was mapped. Selection of this representative was determine using the following procedure: 1) if only one read mapped to a gene in the reference, the mapped read is retained as the representative, 2) if each read was mapped to an identical gene the left end read is selected, 3) if each read mapped to different genes, whichever mapping boasts a higher MAPQ score is selected, 4) if each read mapped to different genes with identical MAPQ score, the left end read is selected. Because of the random nature of left and right assignment, the last (4) selection effectively randomly assigns reads to one of the two genes. Once representatives for the paired end reads have been determined, the remaining reads are processed identically to the single end analysis previously described to produce a three-column table containing read name, mapped gene name, and weighted mapping count, respectively. At this point, the bioinformatic analysis of single end and paired end reads reconvene. For each file, the weighted mappings for each gene are summed subsequently producing a two-column table comprised of reference gene name and associated summation of weighted mappings, respectively. Of note, there are often multiple sequencing runs and produced fastq files for a single sample. Now that each of these files have been processed from raw data to a table of absolute weighted gene counts, they need to be pooled into a single file representing the original sample. This is performed by summing the gene counts for each file and representing them in a new table with the identical format of the input tables. The resulting table represents the absolute abundance of each gene in the reference catalog for the shotgun metagenomically sequenced sample. 26 2.2.4.2Normalization Quantification was performed four times using gene alignments to the nonredundant CARD and GreenGenes databases at 95% and 99% sequence identity for both. Resulting in the construction of four large tables formatted as sample by reference gene with values equating to the absolute abundance of each gene within each sample. However, this abundance estimation can be heavily biased due to disparities in microbial abundance between samples which is the purpose for 16S rRNA quantification. The total abundance of 16S rRNA genes serves as a proxy for the total microbial abundance in the metagenomic samples and is used as a means for normalizing quantified ARG abundance. Relative ARG abundance is calculated using the formula found in Equation 1 : Equation 1 | ARG Abundance Normalization. where NARG is the number of reads mapped to an ARG, LARG is the length of the ARG gene, N16S is the number of reads mapped to a 16S rRNA gene and L 16S is the length of the 16S rRNA gene. After normalization, the relative abundances for each ARG are enveloped in a sample by ARG table which will serve as the basis for all subsequent bioinformatic analysis. 27 2.3 Results 2.3.1 Correlation between alignment and assembly based methods As previously mentioned, the sequence data utilized in this chapter originates from publicly available and published data from Quan et al., 2020. As shown in the publication, ARG abundances were estimated utilizing an assembly-based bioinformatic approach (Quan et al., 2020). Thus providing available benchmarks for comparison between the authors assembly-based approach, and the alignment-based method detailed in this chapter. To evaluate concordance between the developed framework utilizing the alignment-based approach, and the existing assembly-based method, correlations A B between the concluded ARG abundances for each method using the same dataset were calculated. Furthermore, as to determine which sequence identity threshold should be used for reference database clustering, two correlations were Figure 9 | Correlation of ARG abundance estimation between the alignment-based and assembly-based methodologies. Figure A calculated. The first of represents the correlation between the alignment-based approach using the 95% sequence identity database and the assembly-based which is the correlation method. Figure B represents the correlation between the alignment- based approach using the 99% sequence identity database and the between the estimated assembly-based method. The mean correlation of the estimate relative ARG abundance for all samples in A and B are 0.337 and 0.376, ARG abundances from the respectively. 28 95% sequence identity reference database alignment-based approach, and the assembly-based ARG abundances from the publication. Secondly, correlations were calculated between the estimated ARG abundances from the 99% sequence identity reference database alignment-based approach, and the same aforementioned assembly-based ARG abundances from the publication. It should be noted that the assembly-based approach is by no means the gold standard for determining ARG abundance, but there is merit to concluding comparable results when the alignment-based approach is substantially more computationally efficient. As shown in Figure 9, the average correlation for the 99% sequence identity reference database (0.376) was marginally higher than its 95% counterpart (0.337). Moreover, the sample with the highest correlation for the 99% reference database was 0.620, significantly higher than 0.497, which is the highest sample correlation from the alternative 95% database. As a result, further analyses with the alignment-based approach will utilize the nonredundant CARD and GreenGenes databases clustered at 99% sequence identity. Furthermore, the results from the alignment-based approach generally agree with the concluded ARG abundances produced via the assembly-based approach. 2.3.2 Computational processing time When it comes to computational efficiency, CPU time spent on gene quantification and normalization is negligible and nearly instantaneous. However, sequence data trimming and alignment can prove taxing when it comes to large scale data processing. As the pig intestinal data processed above comprises of merely 18 samples, computational burden analysis was performed using the data set discussed later in chapter 3 which contains 126 samples and provides more accurate estimations. As shown in Figure 10, the average time spent to process 1 million reads for trimming, 16S alignment, and ARG alignment, was 0.08, 55.09, and 54.60 minutes, respectively. Therefore, the average CPU time required to process 1 million reads is 109.77 minutes. The relationship between number of reads and time spent for each step generally illustrates a positive 29 and linear relationship. However, as exhibited in Figure 10, components b and c, there is clustering of files containing over ~600 million reads for the alignment steps that do not agree with the concluded linear relationship. This artifact is a result of variable node efficiency on the MSU HPCC, as some nodes are newer and faster than others and the processing of Figure 10 | Computational Burdens Associated with Read Processing. the samples CPU time allocation for read processing is not equal proportioned. The average time to process 1 million reads is 109.77 minutes, 99.9% of which can included in these be attributed to alignment. Trimming (a) time is negligible. As shown in a, b, and c, number of reads is directly related with the time spent processing. clusters took place on these particular nodes. Thus, partially explaining the contradictory nature of these clusters to the generally exhibited correlation between number of reads and CPU processing time. This process can be accelerated to a certain extent through optimized multithreading and parallelization in the future. 2.4 Conclusion In conclusion, the outlined analytical framework for estimating antimicrobial resistance gene abundance in metagenomic samples of animal agriculture origin is capable of processing vast amounts of genomic sequence data in an incredibly short amount of time. Moreover, the processing time can be accelerated to an even shorter time duration with the incorporation of optimized multithreading and parallelization on HPCC’s moving forward. Furthermore, the concluded ARG abundances generally correlate with those of which produced by the alternative assembly-based methodology. This makes large scale application to bulk datasets of genomic sequence data feasible, opening the door for a variety of future explorative studies regarding the identification of ARG presence in animal agriculture. 30 Chapter 3. Application of developed framework on metagenomic samples 3.1 Introduction Vast amounts of shotgun metagenomic sequence data pertaining to six particular animal species in animal agriculture (pig, cattle, chicken, sheep, goat, and horse) exist in the Sequence Read Archive (SRA) hosted by NCBI. The SRA is heavily intertwined with the BioSample and BioProject databases, also hosted by NCBI. The BioProject database houses information pertaining to ea ch study at large and is therefore applicable to multiple samples. The BioSample database contains metadata for each unique sample collection. Curation of characterized sequence data from the SRA requires the utilization of information stored in each of the three databases. Moreover, the public repository has exhibited continual growth in terms of volume for sequence data and number of metagenomic samples and is still actively growing. Notably, the number of metagenomic samples submitted the last two years (2021, 2022) has surpassed the previous maximum, and 2022 in particular experienced the largest growth of sample submissions to date. As the cost of sequencing in modern day continues to drop, it can be Figure 11 | Rapid growth of shotgun metagenomic data anticipated that growth will continue of animal agriculture origin in the SRA. (a) Cumulative number of metagenomic samples for six species in the SRA moving forward. As of November over the past three decades. (b) Cumulative size of metagenomic sequence data in the SRA over time, split by 2022, the number of shotgun six animal species. The size is expressed in T (10 12) Bases. (c) Number of samples submitted to the SRA in each year metagenomic samples in the SRA for each species, yearly totals stated above the stacked bars. 31 was estimated to be 15,850 by filtering the metadata with associated keywords. It’s important to note that this number comprises only shotgun metagenomic samples and does not include 16S amplicon sequencing which is a commonly used methodology in the field of metagenomics. 16S amplicon sequencing by nature fails to offer insights into the genome at large, and therefore is of no use for ARG abundance estimation. Illustrated in Figure 11 are trends and characterization for metagenomic samples and sequence data submitted over the years. Currently, the two species with the largest amount of metagenomic samples are swine and cattle with 6,009 and 4,379 samples respectively. Of particular importance, samples and their associated sequence data in the SRA are accompanied by their respective metadata. The associated metadata varies tremendously in terms of detail, but ideally each samples metadata comprises of sample species, collection data, a nd sampling location, at a minimum. Upon inspection, 11,831 of the 15,850 (~75%) samples contained at least sample collection year information, and collection dates ranged from 1994 to modern day. Moreover, of the 15,850 samples, 15,085 (95%) contained at least country information representing 37 countries in all six continents other than Antarctica. However, sample distribution is uneven as three counties (USA, China, New Zealand) comprise 55% of the total number of samples. Figure 12 illustrates the distribution of samples on a global scale as well as the dynamics of sample contribution over the past decade. As shown, sample submission is growing Figure 12 | Spatiotemporal distribution of metagenomic samples from animal agriculture in the SRA. Snapshots globally but additional sampling and of the cumulative number of SRA samples from animal agriculture are illustrated for the years 2013, 2016, 2019, and sequencing is required, especially in 2022. Significant growth in number of metagenomic samples over the past decade is apparent. low-income countries. Nonetheless, 32 the significant amount of shotgun metagenomic sequence data of animal agriculture origin present in the SRA could prove useful for evaluating spatiotemporal disparity associated with ARGs. The objective of this project was to apply the previously discussed analytical framework for estimating ARG abundance to a subset of metagenomic samples of animal agriculture origin from the SRA, and explore trends associated with the geographical distribution of ARGs on a global scale. 3.2 Materials and methods 3.2.1 Experimental design An experiment was designed to evaluate regional variation of ARG abundance and serve as a proof of concept for further large-scale analysis of the SRA. Shotgun metagenomic samples from a variety of BioProjects, countries, time points, and species housed in the SRA were hand curated to produce a working data set representing a subset of global metagenomic samples from animal agriculture. The curated samples were then subjected to in-depth annotation, and the associated sequence data was downloaded and pushed though the bioinformatic pipeline established in chapter 2 to produce relative ARG abundance estimations. In conjunction with total absolute and relative ARG abundance estimation, ARG composition in terms of associated antimicrobial drug class was also explored. Subsequently, ARG variation between and within various geographical regions and BioProjects were evaluated, and conclusions were drawn. 3.2.2 Sample curation As previously mentioned, cattle and swine comprise the largest portions of submitted metagenomic samples in the SRA. Therefore, we focused on these two species when curating a working data set as this allowed for the inclusion of more geographical regions within the same species group. The goal was to accumulate a diverse set of samples representing 6 continents (exclusion of Antarctica), and a multitude of countries, for both cattle and swine. When selecting 33 samples, the availability of a publication was deemed a prerequisite as to 1) ensure that quality data is obtained and utilized, and 2) allow for confident, accurate annotation of the associated samples. Additionally, as an attempt to keep the study relatively balanced, 12 samples were selected for each country included in this study. Of note, availability of quality shotgun metagenomic sequence data (not 16S amplicon sequencing) in Africa was incredibly limited, and available data was refined to one available publication in Malawi which comprised of 31 samples. Of these 31 samples, only 10 were sequenced via shotgun metagenomics and usable for this study. Thus, the selected representative samples from Malawi, Africa (10) did not fit the mold of 12 samples per country despite the availability of both cattle and swine samples. However, for countries with a surplus of available data, curated samples were scattered amongst the geographical landscape within the country. For example, in China, samples were split amongst the cities of Hangzhou and Yunfu, representing the eastern and southern regions, respectively. In the USA, samples were split amongst the states of Pennsylvania, Texas, Colorado, and South Dakota resulting in latitudinal and longitudinal variation for the curated representative samples. Selection of samples within BioProjects were hand selected to represent diverse collection times and geographical locations. For example, if a large study sampled continuously for three years, a relatively equal number of samples were selected from each year. However, within each block of year and location, the utilized samples were randomly selected. Table 1 highlights the number of samples allocated to each country, BioProject, and species for the final working group of curated samples. The selected samples represent 15 unique BioProjects, 11 different countries, and 6 continents; of the 4581 samples attributable to these projects, 130 samples were retained. These 130 curated samples, and their associated publications then progressed to annotation and bioinformation analysis. 34 3.2.3 Sample annotation Due to the highly heterogeneous nature of the SRA, selected samples were annotated to confirm available data and or fill in missing data. Additionally, annotation allows for the standardization of the metadata fields. This can be exemplified by collection date heterogeneity as the date October 3 rd , 1998, can be expressed as 10/3/1998, 3/10/1998, 10/3/98, etc. Following sample annotation, variation in metadata formatting such as the collection date example can be refined to a singular format which will aid downstream analyses. Prior to sample annotation, the hand curated samples and their associated metadata housed within the BioSample, BioProject, and SRA databases are conglomerated into a local database so that each sample is accompanied by a plethora of information including species, collection date, geographical location, and various accession IDs (SAMN, PRJNA, SRS, SRR). The curated sample metadata is then subjected to full scale annotation by multiple independent reviewers. The sample annotation procedure is rather detailed, and is solely summarized here, additional information can be found in the appendix. First and foremost, the presence of any missing data in the local database is evaluated and noted. Then, the associated accession IDs for each sample are searched in the three online NCBI databases, and any disparity between the online and local databases is evaluated. Further annotation from this point is reliant upon access to a publication for the sample of interest, which can be identified within the BioProject metadata or through internet exploration. BioProject IDs are typically found within the data availability section of publications, thus providing confirmation once a publication is isolated for a sample of interest. If publications fail to be found during exploration, the submitting authors of the data are contacted via email to acquire missing data or confirmation for the curated metadata. Due to the experimental design defined above, all samples curated for this study are accompanied by detailed publications which were heavily utilized for annotation by the reviewers. Each reviewer read the publication, 35 identified key information, and annotated the hand curated samples in the local database. Each curated sample was annotated for sample collection date (DD:MM:YYYY format), host species (cattle or swine), animal breed, geographical location (Farm:City:State/Providence:Country format), and source of sample (feces, intestinal, drinking water, etc.). Additionally, variation within BioProjects such as number of species sampled, different sources of sampling, and multiple methods for sequencing (16S amplicon and shotgun metagenomics) are explicitly stated in the annotations. Following annotation of the previously mentioned metadata, reviewers were encouraged to provide additional notes as well as a confidence level in their produced annotations. Confidence levels range from 0% to 100% and are a direct function of specificity in the evaluated publications, if each piece of the necessary metadata was explicitly stated in the publication, then confidence levels were very high. Lastly, concordance between reviewers was evaluated in an all or nothing binary sense. Reviewers either completely agreed with every aspect of the samples’ metadata, or they disagreed, regardless of the degree of disagreement. Following the conclusion of local database annotation, a diverse dataset of shotgun metagenomic sequence data in terms of geographical location and collection time was constructed and validated. The lowest confidence level allocated to a samples’ metadata by a reviewer was 70%, and the average confidence level amongst all reviewers for all samples was approximately 91%. Additionally, the annotations for each sample between reviewers was in complete concordance as no disparity existed between annotations. As a result, every sample curated in the local database progressed to the next stage, sequence data download and processing. 36 Table 1 | Anatomy of Curated Samples. The selected SRA samples are associated with 15 BioProjects, 11 distinct countries, and either cattle or swine. Filtering to isolate only samples related to animal agriculture and shotgun metagenomi c sequencing resulted in 1,945 samples comprising these 15 projects, for which a subset of 130 was hand selected for further analysis. Collection dates for each sample is shown. The publication DOI and citation for each study is provided. 37 3.2.4 Sequence data download and processing From the original 130 metagenomic samples selected, 126 samples were successfully downloaded onto the MSU HPCC as 4 samples were problematic on the SRA end. The associated sequence data for each of the 126 samples were then processed using the developed bioinformatic pipeline characterized in chapter 2. Estimated relative abundances of ARG for each sample were then reassociated with their respective annotated metadata, and conclusions were drawn from the produced abundances in each geographical location. Figure 13 illustrates the employed procedure for sample annotation, downloading the sequence Figure 13 | Framework for sample annotation, download, and processing. (1) illustrates the muti-person data, and estimating global annotation of curated SRA data, (2) highlights the general schema of the analytical framework to progress from antimicrobial resistance gene curated sample information and download to global antibiotic resistance gene abundance estimation. abundance. 3.3 Results and discussion Although the number of samples utilized in the study was limited and therefore did not provide enough power to perform formal statistical analyses, some interesting trends in the ARG abundance data could be teased out. Disparities in ARG abundance is evaluated on the BioProject level, however, geographical location such as country of sampling location is indicated to facilitate interpretation. 38 3.3.1 Analysis of ARGs in metagenomic samples of cattle origin In cattle, the largest average total relative ARG abundance was found in PRJEB42019 (Malawi), followed by PRJNA420682 (Canada), PRJNA292471 (USA), and PRJNA684454 (Brazil), respectively. Of note, although PRJEB42019 (Malawi) had the highest average ARG abundance, this could be an artifact of the limited number of samples found in the study. Moreover, the individual sample with the highest total relative ARG abundance was found in PRJNA684454 (Brazil). This sample has nearly twice the total ARG abundance than the second highest sample. Additionally, as illustrated in Figure 14a, there is disparity between BioProjects in terms of within BioProject variation. PRJNA684454 (Brazil) has the largest within BioProject variation, which is significantly wider than that of BioProjects such as PRJNA420682 (Canada), PRJNA597489 (China), and PRJEB23561 (France). Regarding the antimicrobial class composition of the estimated ARGs in cattle (Figure 15), there is clearly clustering based on the geographical location of sampling. Samples residing in the same country exhibit highly similar compositions, whereas Figure 14 | Regional Disparities in ARG Abundance for Cattle and Swine. Total relative ARG abundance is represented for each composition between BioProject, which are split into two subsets: cattle (A) and swine (B). Within BioProject variation is represented by the box and whisker countries varies rather plots. BioProjects are color coded by continent as indicated by the legends in the lower right corners. significantly. Across the 39 board, most commonly ARGs correlated to tetracycline comprise the largest proportion. For PRJNA631951 (Iran), PRJNA420682 (Canada), and PRJEB23561 (France) in particular, ARGs associated with tetracycline comprise over 50% of the total ARG abundance. Contrastingly, the majority of ARGs in PRJNA292471 (USA), PRJNA390551 (USA), PRJNA563872 (USA), PRJNA684454 (Brazil), and PRJEB42019 (Malawi) are associated with classes other than macrolides, streptogramins, lincosamides, aminoglycosides, and tetracyclines. According to a report from the FDA in 2021, of all the domestic sales for antibiotics, an estimated 79% of cephalosporins, 45% of sulfonamides, 52% of aminoglycosides, and 43% of tetracyclines were intended for use in cattle. That being said, in terms of mass (kg) for antimicrobials utilized in food producing animals, cephalosporins and aminoglycosides are rather negligible while tetracyclines are the clear number one (FDA 2021). Due to the large volume and proportion of tetracyclines utilized in the USA, the identification of vast amounts of ARGs associated with tetracyclines in the composition analysis (Figure 15) is far from perplexing. However, no formal conclusion regarding the influence of antimicrobial use on AMR development could be extrapolated in this study. 3.3.2 Analysis of ARGs in metagenomic samples of swine origin In the curated swine samples, the largest average total relative ARG abundance was found in PRJNA823879 (New Zealand), followed by PRJEB42019 (Malawi) and PRJNA526405 (Australia), respectively. Similar to cattle, as illustrated in Figure 14b, there is disparity between BioProjects in terms of within BioProject variation for ARGs in metagenomic samples for swine. The largest amount of variation, and the sample with the highest total relative ARG abundance are both found in PRJNA823879 (New Zealand). Moreover, while there is variation between BioProjects, the degree of variation is not as dramatic as the results from metagenomic samples of cattle origin. Regarding ARG composition (Figure 15), tetracyclines once again comprised the largest proportion of ARGs by drug class. Clustering of drug class proportions based on each samples’ associated 40 country allows for differentiation to a minimal extent, a larger sample size is necessary to further differentiate countries by ARG composition for swine. Of note, the proportion of ARGs associated with tetracyclines, aminoglycosides, and lincosamides in PRJNA526405 (Australia) are consistent across samples. PRJNA823879 (New Zealand), PRJEB42019 (Malawi), and PRJNA575543 (China) all vary significantly in terms of ARG composition. In PRJEB31650 and PRJEB26961 (both Denmark), composition of ARGs in each sample generally agree, but to a lesser extent than that of PRJNA526405 (Australia). Interestingly, the prevalence of ARGs associated with lincosamides appears to be greater in PRJNA526405 (Australia) than the alternative BioProjects. 3.3.3 Limitations of the developed tool As samples for both species were not curated for each country, it is impossible to draw any conclusion pertaining to disparity between cattle and swine on the global stage. One significant limitation imposed by the utilized experimental design. Moreover, a larger sample size is required to determine temporal dynamics of AMR at the country level. Additionally, it is important to note that the results from large-scale application of this pipeline to the SRA should merely be used for exploration. The metadata, sequence data, and sampling methods associated with samples in the SRA are highly heterogenous, Figure 15 | Composition of total estimated ARG abundance by drug class. Samples from each country and species are clustered which limits highly together. Composition for each antimicrobial drug class as a percent of total ARG abundance for each sample is illustrated. The most prevalent confident ARG drug classes (tetracyclines, aminoglycosides, lincosamides, macrolides and streptogramins, and macrolides) are color coordinated with the quantification. graphic, any abundance of genes not partitionable into the aforementioned classes are pooled together are deemed as ‘other’. Furthermore, 41 comparisons and conclusions should primarily be made at the BioProject level as larger, randomly selected, sample sizes are needed for the experiment to summarize the results at the country level. 3.4 Conclusion In conclusion, estimated total ARG abundance and ARG composition varied between BioProjects for metagenomic samples of both cattle and swine origin. Furthermore, the degree of variation within BioProjects for both ARG abundance and ARG composition is not consistent. While the estimated abundances cannot be interpreted as absolute, this tool could prove useful to inform future studies as far determining sampling locations and exploring regional trends. Moreover, this tool could identify potentially high priority areas where actions can be made to aid the mitigation of AMR. The primary conclusion of this study is that large-scale application of the developed analytical framework for estimating ARG abundance in metagenomic samples of animal agriculture origin is feasible, and when employed conjointly with publicly housed genomic sequence data such as that found in the SRA, the spatiotemporal distribution of ARGs on a global scale can be deduced. 42 BIBLIOGRAPHY Alekshun M. N. & Levy S. B. (2007). Molecular mechanisms of antibacterial multidrug resistance. Cell 128(6) 1037–1050. https://doi.org/10.1016/j.cell.2007.03.004 Andersen V. D. Aarestrup F. M. Munk P. Jensen M. S. de Knegt L. V. Bortolaia V. Knudsen B. E. Lukjancenko O. Birkegård A. C. & Vigre H. (2020). Predicting effects of changed antimicrobial usage on the abundance of antimicrobial resistance genes in finisher’ gut microbiomes. Preventive Veterinary Medicine 174 104853. https://doi.org/10.1016/j.prevetmed.2019.104853 Antimicrobial Resistance Collaborators. (2022). Global burden of bacterial antimicrobial resistance in 2019: A systematic analysis. Lancet (London England) 399(10325) 629 –655. https://doi.org/10.1016/S0140-6736(21)02724-0 Burrows, M., & Wheeler, D.J. (1994). A Block-sorting Lossless Data Compression Algorithm. Bushnell Brian. (2014) BBMap: A Fast Accurate Splice-Aware Aligner. United States. D’Costa V. M. King C. E. Kalan L. Morar M. Sung W. W. L. Schwarz C. Froese D. Zazula G. Calmels F. Debruyne R. Golding G. B. Poinar H. N. & Wright G. D. (2011). Antibiotic resistance is ancient. Nature 477(7365) 457–461. https://doi.org/10.1038/nature10388 Dadgostar P. (2019). Antimicrobial Resistance: Implications and Costs. Infection and Drug Resistance 12 3903–3910. https://doi.org/10.2147/IDR.S234610 Darby E. M. Trampari E. Siasat P. Gaya M. S. Alav I. Webber M. A. & Blair J. M. A. (2022). Molecular mechanisms of antibiotic resistance revisited. Nature Reviews. Microbiology. https://doi.org/10.1038/s41579-022-00820-y Davies R. & Wales A. (2019). Antimicrobial Resistance on Farms: A Review Including Biosecurity and the Potential Role of Disinfectants in Resistance Selection. Comprehensive Reviews in Food Science and Food Safety 18(3) 753–774. https://doi.org/10.1111/1541-4337.12438 de Carvalho F. M. Valiatti T. B. Santos F. F. Silveira A. C. de O. Guimarães A. P. C. Gerber A. L. Souza C. de O. Cassu Corsi D. Brasiliense D. M. Castelo-Branco D. de S. C. M. Anzai E. K. Bessa-Neto F. O. Guedes G. M. de M. de Souza G. H. de A. Lemos L. N. Ferraz L. F. C. Bahia M. de N. M. Vaz M. S. M. da Silva R. G. B. … Gales A. C. (2022). Exploring the Bacteriome and Resistome of Humans and Food-Producing Animals in Brazil. Microbiology Spectrum 10(5) e00565-22. https://doi.org/10.1128/spectrum.00565-22 Dunning Hotopp J. C. (2011). Horizontal gene transfer between bacteria and animals. Trends in Genetics: TIG 27(4) 157–163. https://doi.org/10.1016/j.tig.2011.01.005 Gaio D. DeMaere M. Z. Anantanawat K. Eamens G. J. Liu M. Zingali T. Falconer L. Chapman T. A. Djordjevic S. P. & Darling A. E. (2021). A large-scale metagenomic survey dataset of the post-weaning piglet gut lumen. GigaScience 10(6) giab039. 43 https://doi.org/10.1093/gigascience/giab039 Gelalcha B. D. & Kerro Dego O. (2022). Extended-Spectrum Beta-Lactamases Producing Enterobacteriaceae in the USA Dairy Cattle Farms and Implications for Public Health. Antibiotics (Basel Switzerland) 11(10) 1313. https://doi.org/10.3390/antibiotics11101313 Gerber J. Ross R. Bryan M. Localio A. R. Szymczak J. Fiks A. Barkman D. Odeniyi F. Conaboy K. Bell L. Zaoutis T. & Wasserman R. (2018). Comparing Broad- and Narrow-Spectrum Antibiotics for Children with Ear Sinus and Throat Infections. Patient-Centered Outcomes Research Institute (PCORI). http://www.ncbi.nlm.nih.gov/books/NBK582423/ Gharechahi J. Vahidi M. F. Bahram M. Han J.-L. Ding X.-Z. & Salekdeh G. H. (2021). Metagenomic analysis reveals a dynamic microbiome with diversified adaptive functions to utilize high lignocellulosic forages in the cattle rumen. The ISME Journal 15(4) 1108–1120. https://doi.org/10.1038/s41396-020-00837-2 Grada A. & Bunick C. G. (2021). Spectrum of Antibiotic Activity and Its Relevance to the Microbiome. JAMA Network Open 4(4) e215357. https://doi.org/10.1001/jamanetworkopen.2021.5357 Guzman C. & D’Orso I. (2017). CIPHER: A flexible and extensive workflow platform for integrative next-generation sequencing data analysis and genomic regulatory element prediction. BMC Bioinformatics 18(1) 363. https://doi.org/10.1186/s12 859-017-1770-1 Haley B. J. Kim S.-W. Salaheen S. Hovingh E. & Van Kessel J. A. S. (2020). Differences in the Microbial Community and Resistome Structures of Feces from Preweaned Calves and Lactating Dairy Cows in Commercial Dairy Herds. Foodborne Pathogens and Disease 17(8) 494–503. https://doi.org/10.1089/fpd.2019.2768 Hendriksen R. S. Munk P. Njage P. van Bunnik B. McNally L. Lukjancenko O. Röder T. Nieuwenhuijse D. Pedersen S. K. Kjeldgaard J. Kaas R. S. Clausen P. T. L. C. Vogt J. K. Leekitcharoenphon P. van de Schans M. G. M. Zuidema T. de Roda Husman A. M. Rasmussen S. Petersen B. … Aarestrup F. M. (2019). Global monitoring of antimicrobial resistance based on metagenomics analyses of urban sewage. Nature Communications 10(1) 1124. https://doi.org/10.1038/s41467-019-08853-3 Jeong J. Song W. Cooper W. J. Jung J. & Greaves J. (2010). Degradation of tetracycline antibiotics: Mechanisms and kinetic studies for advanced oxidation/reduction processes. Chemosphere 78(5) 533–540. https://doi.org/10.1016/j.chemosphere.2009.11.024 Kapoor G. Saigal S. & Elongavan A. (2017). Action and resistance mechanisms of antibiotics: A guide for clinicians. Journal of Anaesthesiology Clinical Pharmacology 33(3) 300 –305. https://doi.org/10.4103/joacp.JOACP_349_15 Li J. Zhong H. Ramayo-Caldas Y. Terrapon N. Lombard V. Potocki-Veronese G. Estellé J. Popova M. Yang Z. Zhang H. Li F. Tang S. Yang F. Chen W. Chen B. Li J. Guo J. Martin C. Maguin E. … Morgavi D. P. (2020). A catalog of microbial genes from the bovine rumen unveils a 44 specialized and diverse biomass-degrading environment. GigaScience 9(6) giaa057. https://doi.org/10.1093/gigascience/giaa057 Li Y. Kreuzer M. Clayssen Q. Ebert M.-O. Ruscheweyh H.-J. Sunagawa S. Kunz C. Attwood G. Amelchanka S. & Terranova M. (2021). The rumen microbiome inhibits methane formation through dietary choline supplementation. Scientific Reports 11(1) 21761. https://doi.org/10.1038/s41598-021-01031-w Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv: Genomics. Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324 Li, W., & Godzik, A. (2006). Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13), 1658–1659. https://doi.org/10.1093/bioinformatics/btl158 Lin X. Kaul S. Rounsley S. Shea T. P. Benito M. I. Town C. D. Fujii C. Y. Mason T. Bowman C. L. Barnstead M. Feldblyum T. V. Buell C. R. Ketchum K. A. Lee J. Ronning C. M. Koo H. L. Moffat K. S. Cronin L. A. Shen M. … Venter J. C. (1999). Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402(6763) 761–768. https://doi.org/10.1038/45471 Munk P. Andersen V.D. de Knegt L. Jensen M.S. Knudsen B.E. Lukjancenko O. Mordhorst H. Clasen J. Agersø Y. Folkesson A. et al. (2017). A sampling and metagenomic sequencing- based methodology for monitoring antimicrobial resistance in swine herds. J. Antimicrob. Chemother. 72 385–392. Munk P. Knudsen B. E. Lukjancenko O. Duarte A. S. R. Van Gompel L. Luiken R. E. C. Smit L. A. M. Schmitt H. Garcia A. D. Hansen R. B. Petersen T. N. Bossers A. Ruppé E. EFFORT Group Lund O. Hald T. Pamp S. J. Vigre H. Heederik D. … Aarestrup F. M. (2018). Abundance and diversity of the faecal resistome in slaughter pigs and broilers in nine European countries. Nature Microbiology 3(8) 898–908. https://doi.org/10.1038/s41564- 018-0192-9 Noyes N. R. Yang X. Linke L. M. Magnuson R. J. Dettenwanger A. Cook S. Geornaras I. Woerner D. E. Gow S. P. McAllister T. A. Yang H. Ruiz J. Jones K. L. Boucher C. A. Morley P. S. & Belk K. E. (2016). Resistome diversity in cattle and the environment decreases during beef production. ELife 5 e13195. https://doi.org/10.7554/eLife.13195 O’Neill J. Tackling drug-resistant infections globally: final report and recommendations. London: Review on Antimicrobial Resistance 2016. Poulsen C. S. Ekstrøm C. T. Aarestrup F. M. & Pamp S. J. (2022). Library Preparation and Sequencing Platform Introduce Bias in Metagenomic-Based Characterizations of 45 Microbiomes. Microbiology Spectrum 10(2) e00090-22. https://doi.org/10.1128/spectrum.00090-22 Quan J. Wu Z. Ye Y. Peng L. Wu J. Ruan D. Qiu Y. Ding R. Wang X. Zheng E. Cai G. Huang W. & Yang J. (2020). Metagenomic Characterization of Intestinal Regions in Pigs With Contrasting Feed Efficiency. Frontiers in Microbiology 11 32. https://doi.org/10.3389/fmicb.2020.00032 Quince C. Walker A. W. Simpson J. T. Loman N. J. & Segata N. (2017). Shotgun metagenomics from sampling to analysis. Nature Biotechnology 35(9) 833–844. https://doi.org/10.1038/nbt.3935 Rubiola S. Macori G. Chiesa F. Panebianco F. Moretti R. Fanning S. & Civera T. (2022). Shotgun metagenomic sequencing of bulk tank milk filters reveals the role of Moraxellaceae and Enterobacteriaceae as carriers of antimicrobial resistance genes. Food Research International (Ottawa Ont.) 158 111579. https://doi.org/10.1016/j.foodres.2022.111579 Schwarz S. Shen J. Kadlec K. Wang Y. Brenner Michael G. Feßler A. T. & Vester B. (2016). Lincosamides Streptogramins Phenicols and Pleuromutilins: Mode of Action and Mechanisms of Resistance. Cold Spring Harbor Perspectives in Medicine 6(11) a027037. https://doi.org/10.1101/cshperspect.a027037 Shaffer J. P. Nothias L.-F. Thompson L. R. Sanders J. G. Salido R. A. Couvillion S. P. Brejnrod A. D. Lejzerowicz F. Haiminen N. Huang S. Lutz H. L. Zhu Q. Martino C. Morton J. T. Karthikeyan S. Nothias-Esposito M. Dührkop K. Böcker S. Kim H. W. … Zhang S. (2022). Standardized multi-omics of Earth’s microbiomes reveals microbial and metabolite diversity. Nature Microbiology 7(12) 2128–2150. https://doi.org/10.1038/s41564-022-01266-x Soucy S. M. Huang J. & Gogarten J. P. (2015). Horizontal gene transfer: building the web of life. Nature Reviews Genetics 16 (8) 472-482. Stokes H. W. & Gillings M. R. (2011). Gene flow mobile genetic elements and the recruitment of antibiotic resistance genes into Gram-negative pathogens. FEMS microbiology reviews 35 (5) 790-819. Thomas C. M. & Nielsen K. M. (2005). Mechanisms of and barriers to horizontal gene transfer between bacteria. Nature reviews microbiology 3 (9) 711-721. Thomas M. Webb M. Ghimire S. Blair A. Olson K. Fenske G. J. Fonder A. T. Christopher- Hennings J. Brake D. & Scaria J. (2017). Metagenomic characterization of the effect of feed additives on the gut microbiome and antibiotic resistome of feedlot cattle. Scientific Reports 7(1) 12257. https://doi.org/10.1038/s41598-017-12481-6 Tooke C. L. Hinchliffe P. Bragginton E. C. Colenso C. K. Hirvonen V. H. A. Takebayashi Y. & Spencer J. (2019). β-Lactamases and β-Lactamase Inhibitors in the 21st Century. Journal of Molecular Biology 431(18) 3472–3500. https://doi.org/10.1016/j.jmb.2019.04.002 46 US Centers for Disease Control and Prevention. Antibiotic resistance threats in the United States 2019. Atlanta GA: US Department of Health and Human Services 2019. US Food and Drug Administration Center for Veterinary Medicine. Antimicrobials Sold or Distributed for Use in Food-Producing Animals 2019. US Department of Health and Human Services 2019. US Food and Drug Administration Center for Veterinary Medicine. Antimicrobials Sold or Distributed for Use in Food-Producing Animals 2021. US Department of Health and Human Services 2021. US Food and Drug Administration Center for Veterinary Medicine. Evaluating the Safety of Antimicrobial New Animal Drugs with Regard to their Microbiological Effects on Bacteria of Human Health Concern 2023. Rockville MD: US Department of Health and Human Services 2023. Van Boeckel T.P. Glennon E.E. Chen D. Gilbert M. Robinson T.P. Grenfell B.T. Levin S.A. Bonhoeffer S. and Laxminarayan R. (2017). Reducing antimicrobial use in food animals. Science 357 1350–1352. Van Goethem M. W. Pierneef R. Bezuidt O. K. I. Van De Peer Y. Cowan D. A. & Makhalanyane T. P. (2018). A reservoir of ‘historical’ antibiotic resistance genes in remote pristine Antarctic soils. Microbiome 6(1) 40. https://doi.org/10.1186/s40168-018-0424-5 van Hoek A. H. A. M. Mevius D. Guerra B. Mullany P. Roberts A. P. & Aarts H. J. M. (2011). Acquired antibiotic resistance genes: An overview. Frontiers in Microbiology 2 203. https://doi.org/10.3389/fmicb.2011.00203 Wibowo M. C. Yang Z. Borry M. Hübner A. Huang K. D. Tierney B. T. Zimmerman S. Barajas- Olmos F. Contreras-Cubas C. García-Ortiz H. Martínez-Hernández A. Luber J. M. Kirstahler P. Blohm T. Smiley F. E. Arnold R. Ballal S. A. Pamp S. J. Russ J. … Kostic A. D. (2021). Reconstruction of ancient microbial genomes from the human gut. Nature 594(7862) 234–239. https://doi.org/10.1038/s41586-021-03532-0 Woolhouse M. Ward M. van Bunnik B. & Farrar J. (2015). Antimicrobial resistance in humans livestock and the wider environment. Philosophical Transactions of the Royal Society of London. Series B Biological Sciences 370(1670) 20140083. https://doi.org/10.1098/rstb.2014.0083 Xue M.-Y. Xie Y.-Y. Zhong Y. Ma X.-J. Sun H.-Z. & Liu J.-X. (2022). Integrated meta-omics reveals new ruminal microbial features associated with feed efficiency in dairy cattle. Microbiome 10(1) 32. https://doi.org/10.1186/s40168-022-01228-9 Young W. Maclean P. Dunstan K. Ryan L. Peters J. Armstrong K. Anderson R. Dewhurst H. van Gendt M. Dilger R. N. Dekker J. Haggarty N. & Roy N. (2022). Lacticaseibacillus rhamnosus HN001 alters the microbiota composition in the cecum but not the feces in a piglet model. Frontiers in Nutrition 9 1002369. https://doi.org/10.3389/fnut.2022.1002369 47 Zaheer R. Noyes N. Ortega Polo R. Cook S. R. Marinier E. Van Domselaar G. Belk K. E. Morley P. S. & McAllister T. A. (2018). Impact of sequencing depth on the characterization of the microbiome and resistome. Scientific Reports 8(1) 5890. https://doi.org/10.1038/s41598 - 018-24280-8 Zhang Z. Zhang Q. Wang T. Xu N. Lu T. Hong W. Penuelas J. Gillings M. Wang M. Gao W. & Qian H. (2022). Assessment of global health risk of antibiotic resistance genes. Nature Communications 13(1) 1553. https://doi.org/10.1038/s41467-022-29283-8 48 APPENDIX SRA Sample Annotation Procedure 1. Evaluate the presence of any missing data for the sample in the local database. 1. Collection Date (NOT submission date), Organism, Location (State/City), Source of Sample (Fecal, Intestinal, etc.), additional Accession IDs 2. Access the SRA website, search associated Accession ID, and locate samples or project. 1. Can search any or multiple of the accession numbers (SRS, SRP, PRDJ, etc.) 2. Information on meaning of different IDs can be found here: 3. Jump between BioProject, SRA, and BioSample hyperlinks to confirm that you have found the exact samples and or project of interest. 1. Note similar and or contradicting information between online metadata and our local metadata. Update local database as needed. 4. If there is still missing / unconfirmed / contradicting data in our local file: 1. Is there an explicit and exact publication stated in the SRA metadata? 1. YES 1. Navigate to the publication online, and review the paper: 1. What to Find: 1. Sample Collection Date 1. Format: 1. DD:MM:YYYY 2. If there is a range of dates all in same year, just record the year: 1. YYYY 3. If it is a multi-year study, annotate each sample individually with a collection date. If per sample resolution is not available just record the more recent year (YYYY). 2. Host Species 3. Geographical Location 1. Format: 1. Farm:City:State/Province/Country 1. If any missing replace with “NA”: 2. i.e. NA:NA:Michigan:USA 4. Source of Sample (fecal, nasal, etc.) 5. Additional (these can be indicated in notes section): 1. Were all samples collected roughly in the same area for the study? If they vary significantly then we will need resolution down to each sample for that project. 2. Were all samples taken from the same species? If not, are samples collected from an alternative species annotated correctly? AND are these alternative species samples included in our local database? 49 3. Are there duplicated sample IDs in our spreadsheet for this project? This likely indicates 16s rRNA amplicon sequencing. b. Order for Evaluation: 1. Title 2. Materials and Methods 3. Abstract 4. Supplemental Figures and Information 5. Data Availability 1. Use this to check alternative areas that data information may be housed if the above four sections fail to offer insight. ii. NO 1. Search the internet for publications tied to authors and data that may be available but not explicitly stated in the SRA metadata. 1. Copy and paste #PRJxxxxx from SRA into google. 2. Copy and paste abstract and or title or alternative Accession IDs into google. 3. Once a paper has been located and confirmed use the same procedure as identified above in 4.a.i.1. 2. If still can’t find publication: 1. Send to Lee Ackerson or Dr. Wen Huang 2. They will email authors for confirmation or discovery of the associated data. 3. An executive decision may be made to remove the PRJ from our study due to low confidence metadata. 1. Factors Affecting this decision: 1. Size of study (#samples) 2. Completeness and detail of SRA annotation by sample submitter 5. Lastly, sign initials in column saying that you finished the annotation and record any notes you feel may be significant in the designated column. Also state your confidence level in your annotation from 0%-100% in the designated column. If you are the second reviewer, note your annotation concordance (agree or disagree) with the first reviewer; if disagreement, note what deciphered metadata differs. 50