3 L... Int! .. . I \I x. 32“”: .13.... ....,.. . J :3... u»... y 1 4 .I 80' haw. .I ‘g. 1.. I! M h s nihnuz. "st-9i: I... .sIH . 3 . .. w .7: . s 1&5‘:!‘ . 9!..3? $3.... .. ' .111 .v s 2“}, ‘!?::9‘ W . .UFv “”1... ”41.5". i. 91...}. a 7 1 :i‘imf-Lcu ”EA—,7. “.3311 .5; La; .‘ 1:. ’1‘ a -v v '. .~ q H i .2. iv..- S 3 ..q\. .1 \ ul 2:15.... «I. l! 1) .3: . o: .. a... 1. mum MICHUGAN ST lllallllllllllf: l lll llllllll lllll lllllllll 421 6521 l This is to certify that the dissertation entitled GENERATION OF RULES FOR ORGANIC SUBSTRUCTU RE DETERMINATION BY CLASSIFICATION OF MS/MS SPECTRA presented by Eric C. Hemenway has been accepted towards fulfillment of the requirements for PhD. degree in Chemistry . I - .3, Major professor Date - ( / MS U is an Affirmative Action/Equal Opportunity lnstitutt'on 0-12771 LIBRARY Mlchlgan State Unlverslty PLACE N RETURN BOX to remove thh chockom from your need. To AVOID FINES Mum on or baton date duo. DATE DUE DATE DUE DATE DUE MSU!:.A.I‘I‘"' ‘L ‘ ' " "‘ “,inuiiuniun GENERATION OF RULES FOR ORGANIC SUBSTRUCTURE DETERMINATION BY CLASSIFICATION OF MS/MS SPECTRA By Eric C. Hemenway A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Chemistry 1996 ABSTRACT GENERATION OF RULES FOR ORGANIC SUBSTRUCTURE DETERMINATION BY CLASSIFICATION OF MS/MS SPECTRA By Eric C. Hemenway Methodologies have been developed to assist in the elucidation of chemical structures from mass spectrometry / mass spectrometry (MS/MS) spectra. Due to the complexities involved in MS and MS/MS ionization and fragmentation pathways, determination of ionic structure is difficult and perhaps impossible in a convenient manner. The focus of the methods developed here are intended to avoid current deficiencies in ion structure determination by classifying observed product spectral patterns. Classification of MS/MS product spectra for isomass precursor ions is first accomplished through the use of cluster analysis. Determination is made as to the similarities and differences that exist among the measured product spectra. This process is totally empirical and does not assume that the structural integrity of ions are maintained in the ionization and fragmentation of analyte molecules. Hence, identification of ion structure is not required. The ability to group similar product spectra based on spectral features and to be able to correlate those groups with common substructures found in analyte molecules is central to this substructure determination process. Therefore, a method was developed to match unknown MS/MS spectra against the representative group spectra. The method developed here was designed for the feature poor product spectra observed in low-energy collision-induce dissociation. The clustered product spectra are then analyzed to determine what characteristic features they possess. These characteristic features are then used to construct a hierarchical rule-tree suitable for rapid classification of unknown isomass product spectra. These rule-trees are also suitable for targeted component analysis. The methodologies developed here have been successfully applied to the classification of product spectra of m/z 59. By expansion of the product spectral data base, it is expected that these methods can be employed to other ‘ precursor m/z values. To my parents Mariam anJ [)an Hemenway tor div. 6 I. s being there J .4”... J “A I ‘ ACKNOI‘I’LFDGMENTS w. l. t‘rfi’ a ilk-tend foremost l (h .. I m- ;-.~r-a~..:ch r inst-1 Dr {Tm Luke Whilt’! I'D “’9' 2‘ . _' hid our difii-renws ’ :wmth' raw-ed ra- n‘wllw‘t, Ifiblk‘h’. .mtl ."mlI'J V I also upgwflzt- [Tm Wm... .x 3-» a. in? students m '1“: .. ‘v. .. v 3 ‘ . n3. . (Inn and pur'ml. r.‘ 11;... r.- :. pm_:£‘rLs I aim. maul. rm L-Tih!‘ ‘J, . ofmy “35.43.19. Putnam.“ .‘Irtfl Crouch, D! Ned datilkfiv'lh. aw] Idle thank my under-gm: ate rvwearv'i: adviser. Dr Adrian Wade, for 9‘." ' TomyparentsMarjor-isandnmdfimway '_ an involved u: res‘kmmfmm on all aspects of 3:1:an Aqodal thank . m ti. ' mm :22. of Bnusb Columbia [.12. {um ‘ for his many stories about undergraduate benching adventures. I Ibould say unsadx entures These stories have provided me with .M-- __ -mmx.” ACKNOWLEDGMENTS First and foremost I thank my research advisor Dr. Chris Enke. While we have had our differences, I greatly respect his intellect, insight, and creativity. I also appreciate the freedom he allows his students in the formulation and pursuit of their research projects. I also thank the other members of my research committee Dr. Stan Crouch, Dr. Ned Jackson, and Dr. William Reusch. I also thank my undergraduate research advisor, Dr. Adrian Wade, for getting me involved in research and for his advice on all aspects of graduate research. A special thank you to University of British Columbia Lab Director Ben Clifi'ord for his many stories about undergraduate teaching adventures, or perhaps I should say misadventures. These stories have provided me with a unique perspective towards software development and the necessity of student proofing software as well as provided me with a ready source of humor. I thank the members of the Crouch group, Stephen Medlin, James (Odie) Ridge, Brett Quencer, and Edwin Townsend, for the many basketball games, lunches, and discussions, their friendship, and for making me an honorary Crouch group member. My thanks to Dr. Tom Atkinson and Dr. Tom Carter for the many afternoon discussions on all aspects of computing. I would also like to thank several past and present members of the Enke group, in particular Mark Cole, David McLane, Kate Noon, Ron Lopshire, Paul Vlasak, Fei Liu Overney, Gregor Overney, and Dinorah Frutos, for their advice and friendship which has meant a lot to me over my graduate career. Of course Tina Erickson deserves special mention as she has come to be much more than a friend. Her support has been very important to the completion of this dissertation. I do not thank the TSQ 70B/700 mass spectrometer at Michigan State University for its ability to de-tune within minutes. Fortunately, the TSQ 7000 proved to behave the exact opposite. Last, I thank the National Science Foundation’s Center for Microbial Ecology at Michigan State University, the American Society for Mass Spectrometry, and Finnigan MAT for their financial support. TABLE OF CONTENTS LIST OF TABLES xii LIST OF FIGURES xiii CHAPTER 1: Introduction 1 References 12 CHAPTER 2: Structure Elucidation Methods in Mass Spectrometry l4 Introduction 14 Pattern Matching Methods 15 Interpretive Systems 20 Advantages of MS/MS over MS 30 C ‘ ' 32 References 34 CHAPTER 3: Cluster Analysis as an Analytical Tool ............................ 36 Introduction 36 Background 37 Similarity Calculation Methods 43 i Euclidean Distance 44 Minkowski Distance 46 Mahalanobis Distance 46 Hierarchical Clustering Methods 47 Single Linkage Clustering 48 Complete Linkage Clustering 52 Average Linkage Clustering 55 Partitional Clustering Methods 58 K-Means Clustering 59 Probability-based and Density-based Clustering 60 Kernel Density Clustering 60 FuZzy Cluetpring 6 1 Application of Cluster Analysis to Problems of Chemical Significance...63 Application of Cluster Analysis to Mass Spectral Data ........................... 66 F ‘ 67 Rpfprpnope 68 CHAPTER 4: Application of Cluster Analysis to Microbial Characterinfinn 70 Introduction 70 Background 71 Experimental 81 Results and Discussion 85 0.. 96 Reference: 98 CHAPTER 5: Generating Clusterable MS/MS Data 100 Introduction 100 Acquisition Method B... ‘ , ‘ 103 Instrumental Pm ‘ s 103 Sample Introduction 109 Efl‘ects of Collision Energy on CID Spectra 1 12 Efi'ects of Target Gas Pressure on CID Spectra 120 Instrumental Conditions Selected for MS/MS Library Generation 123 Generating Reproducible Spectra 13 1 Post-Acquisition Data Processing and Storage 135 f‘ ‘ ' 137 References 139 CHAPTER 6: Development of Clustering Procedure for MS/MS Data... 140 Introduction 140 Similarity Calculations on MS/MS Data 141 Clustering MS/MS Data 142 Interpretation of the Clusters 150 C ‘ ' 167 i Referen cee CHAPTER 7: Product Ion Classification for Standards and Unknownc Introduction Establishing Representative Descriptors Discovery of Non-Discriminating Features: Classification of Unknowns by Spectral M ‘ L' g Classification of Unknowns by Rule Application ................................... Improving Rules with Rule-Tree: (‘ 1 Referen cee 168 169 169 170 171 174 185 186 190 197 LIST OF TABLES Table 5.1 Tune parameters for QlMS mode with typical values in volts ..... 106 Table 6.1 Compounds analyzed for m/z 59 and their representative substructures. Duplications were used to investigate reproducibility. . 149 Table 6.2 Results for partitioning m/z 59 into seven clusters. ..................... 151 Table 7.1 A comparison of the fuzzy matches of the m/z 59 product mass spectra by the three membership functions show in Figure 7.3. ........... 184 LIST OF FIGURES Figure 1.1 Schematic of the triple quadrupole mass spect. ‘ 10 Figure 1.2 The major operational scan modes of the triple quadrupole mass spectrometer. 1 1 Figure 2.1 The DENDRAL approach to structure ‘ "‘ ‘irm 22 Figure 2.2 Schematic diagram of the classification and identification expert system developed by Scott. 26 Figure 2.3 The ACES approach to structure elucidation. ............................... 28 Figure 2.4 A proposed feedback loop for ACES that would allow the system to recommend ancillary experiments. 31 Figure 3.1 Examples of cluster structures: (a) well-separated clusters; (b) touching or overlapping clusters. The axes x1 and x2, representing the two clustering dimensions, are generic in this example but could represent such things as relative intensities for two m/z values or infrared intensities at two vibrational fiequencies. ................................. 39 Figure 3.2 Four examples of complex cluster shapes: (a) spherical, (b) elongated, (c) linearly inseparable, and (d) dense clusters ................. 41 Figure 3.3 Categories of classification types. 42 Figure 3.4 A graphical representation of the Euclidean distance metric ....... 45 Figure 3.5 An example of a dendrogram. 49 Figure 3.6 The single linkage hierarchical clustering of selected unknown microorganisms obtained by phospholipid profiling 5 1 Figure 3.7 The complete linkage hierarchical clustering of the same selected unknown microorganisms presented in Figure 3.6 obtained by phospholipid profiling. 54 Figure 3.8 The relative distance measures for single, complete, and average linkage methods. Sample 1 and 2 would initially be joined at distance d1. Sample 3 would join at daingle, dcomplete, or driver-age respectively. ................. 56 Figure 3.9 The average linkage hierarchical clustering of the same selected unknown microorganisms presented in Figure 3.6 and Figure 3.7 obtained by phospholipid profiling. 57 Figure 4.1 The proposed CME approach to incorporate the research objectives in the Community Diversity Thrust Group into a more cohesive package. 74 Figure 4. 2 The general glycerophospholipid structure. The head group substituent 18 represented by Y and the two fatty acids by R and R. Figure 4.3 The phospholipid head groups and their respective neutral losses utilized in this study. 76 Figure 4.4 Low-energy collision induced dissociation neutral loss fragmentation of phospholipids for positive ions. 79 Figure 4.5 Phosphatidylethanolamine (PE) mass profiles for Salmonella abaetetuba, Citrobacter freundii, and Escherichia coli. ........................... 80 Figure 4.6 The FAB/MS/MS phospholipid (class) mass profile obtained in part by this analysis procedure. The phospholipid classes axis is determined by the neutral losses monitored and the phospholipid masses axis is determined from the respective neutral loss scans. The fatty acid masses axis is determined by operating the instrument in negative ion mode which was not utilized for this study. 82 Figure 4.7 The Instrument Control Procedure used to control the mass spectrometer in this study. 86 Figure 4.8 Dendrogram for the analysis of phophatidylglycerol (PG). ........... 88 Figure 4.9 Dendrogram for the analysis of phosphatidic acid (PA) ................ 89 Figure 4.10 Dendrogram for the analysis of :L “:L “ir‘ylntL ‘ ' (PE)S.’0 Figure 4.11 Dendrogram for the analysis of ‘L L.‘ 1.1' LhyloLL I ' (PDM) 91 K Figure 4.12 Dendrogram for the analysis of “- -ylethanolamine (PM) 92 r :- Figure 4.13 Dendrogram for the analysis of all combined phospholipids demonstrating the loss of differentiating effectiveness Figure 5.1 A schematic of the TSQ-7000 mass spectrometer. ....................... 104 Figure 5.2 Typical QlMS, Q3MS, and NEU 0 profiles for the tuning compound PFTBA 1 10 Figure 5.3 The leak inlet system used for high volatility compounds. ......... 11 1 Figure 5.4 Product spectra of the molecular ion of cyclohexanone with a collision energy of (a) 10 eV, (b) 40 eV, and (c) 80 eV with a target gas pressure of 3x1045 torr (manifold). (* denotes ofiscale precursor peak at 100%) 114 Figure 5.5 Relative intensity of several product ions of the molecular ion of cyclohexanone plotted as a function of collision energy at a target gas pressure of 3x1045 torr (manifold). 116 Figure 5.6 Product spectra of the molecular ion of 3-heptanone with a collision energy of (a) 10 eV, (b) 40 eV, and (c) 80 eV with a target gas pressure of 3x1045 torr (manifold). 1 17 Figure 5.7 Relative intensity of several product ions of the molecular ion of 3-heptanone plotted as a function of collision energy at a target gas pressure of 3x 10'6 torr (manifold). 118 Figure 5.8 Relative intensity of several product ions of the molecular ion of 3'heptanone plotted as a function of collision energy at a target gas pressure of 3x10‘6 torr (manifold). These result were obtained by repeating the experiment show in Figure 5.7. l 19 Figure 5.9 Efi‘ect of increasing target gas pressure at a constant collision energy (-30 V) for products of the molecular ion of cyclohexanone ........ 12 1 Figure 5.10 Log-log plot of the data presented in Figure 5.9. The slopes indicate reaction order. 122 Figure 5.11 Effect of increasing target gas pressure at a constant collision energy (~30 V) for products of the molecular ion of 3-heptanone. .......... 124 J’ Figure 5.12 Log-log plot of the data presented in Figure 5.11. The slopes indicate reaction order. 125 Figure 5.13 Instrument Control Procedure for the collection of normal mass qpecfra 129 Figure 5.14 Instrument Control Procedure for the collection of product mass spectra 130 Figure 5.15 Product spectra for the molecular ion of cyclohexanone. (a) and (b) were collected on the same day while (c) and (d) were collected on a different day with a difl'erent instrument tune. 133 Figure 5.16 Product spectra for the molecular ion of 3-heptanone. (a) and (b) were collected on the same day while (c) was collected on a different day with a different instrument tune 134 Figure 6.1 The m/z 59 substructures of interest. Potential bonding positions are labeled a through d. 144 Figure 6.2 The Euclidean distance / single linkage hierarchical clustering dendrogram for selected m/z 59 product spectra. .................................. 146 Figure 6.3 Sample product spectra of m/z 59 for selected compounds listed in Table 6.1 158 Figure 6.4 Mixture product spectrum for m/z 59 of sec-butyl alcohol. ........ 166 Figure 7 .1 Cluster mean product spectra for m/z 59. The error bars indicate the standard deviation in relative intensity of the cluster members. Clusters 2, 6, and 7 are single member clusters and have no error bars. 172 Figure 7.2 Reduced feature set dendrogram for the m/z 59 product spectra. 175 Figure 7.3 Fuzzy member functions investigated for spectral matching. 180 Figure 7.4 Rule-tree for m/z 59 product spectra. 188 Figure 7.5 The members and expected substructures for m/z 59 cluster 1. .192 Figure 7.6 The members and expected substructures for m/z 59 clusters 2 and 6 respectively. 193 xvi .- Figure 7.7 The members and expected substructures for m/z 59 cluster 3. .194 Figure 7.8 The members and expected substructures for m/z 59 cluster 4. .195 Figure 7.9 The members and expected substructures for m/z 59 cluster 5. . 196 CHAPTER 1 Introduction The focus of this thesis is the development of software-based tools to assist in the structural elucidation of organic chemical species using tandem mass spectral (MS/MS) data. These tools will attempt to identify the molecular structure precursor for each fragment ion in the normal MS spectrum based on the product spectrum for that ion and then reconstruct the molecule from the partial substructures that have been identified. The development of "intelligent" software for structure elucidation is not a new concept in chemistry and is indeed a necessity with the large amounts of data generated by modern analytical instruments. Mass spectrometers in particular are capable of producing large quantities of data related to molecular ion structure from small amounts of material. However, the interpretation of the data into structural information is often complicated, ambiguous, and incomplete. For this reason, substantial effort has been made in the development of structure elucidation software based on mass spectral data. In 1965 work began at Stanford on the DENDRAL project. The objective of the DENDRAL project was to develop software which could determine molecular structures consistent with the normal mass spectrum for an unknown compound [1,2]. As input, the DENDRAL software accepted the unit-resolution mass spectrum of an unknown and the empirical formula of the molecular ion. The software would then attempt to generate all possible candidate molecular structures, using those functionalities believed to be present, which were consistent with the mass spectrum. The DENDRAL software was found to be as effective as a human expert in elucidating chemical structures due in large part to the systematic approach used. The major shortcoming of DENDRAL was that the conventional mass spectral data alone were insufficient for complete and unambiguous identification of unknown structures. In order to provide better candidate structures, the DENDRAL software often required N1\IIR and IR data in conjunction with the mass spectrum. A more detailed description of the DENDRAL project is presented in Chapter 2. The ' Self-Training Interpretive and Retrieval System (STIRS) developed by McLafierty et al. is another attempt to develop software to assist in structural elucidation by mass spectrometry. The STIRS software is of value in cases where library spectral matching alone does not provide unambiguous identification of an unknown [3-5]. The STIRS software attempts to identify the major substructural feature(s) of an unknown by correlation of the conventional mass spectrum to the best matching library spectra through the use of class libraries. A more detailed discussion of the STIRS approach can be found in Chapter 2. The STIRS software, like the DENDRAL software, is in many cases unable to provide definitive structural information. Many of the problems with the DENDRAL and STIRS approaches can be attributed to the complexities of the fragmentation processes that occur in the electron impact (EI) ion source. Signals from all the ions produced in the ion source are represented in the conventional mass spectrum. As a result, the conventional mass spectrum contains peaks due to single and multiple fragmentation processes, recombination reactions, and rearrangement reactions. The DENDRAL and STIRS software were unable to interpret the effect of all these processes from the mass spectrum and hence could not provide unambiguous information on the structures present. In order to circumvent as many of the inherent shortcomings of conventional mass spectral data as possible in the elucidation of chemical structures, Enke and co-workers utilized MS/MS and the additional structure related data it affords. To this end, they have developed the Methods for Analyzing Patterns in Spectra (MAPS) software [6-13], in which feature/substructure relationships are determined empirically from both 1"} conventional mass spectral (MS) and tandem mass spectral (MS/MS) information. The initial MAPS approach was to identify the structure of observed ions and correlate the ion structure with substructures present in the molecule. However, this approach was soon discontinued as the correlation process required comprehensive knowledge of the ion formation processes occurring in the ion source. This requirement was later circumvented by the "feature bucket" approach which was based on the idea that spectral features and substructures could be correlated without complete spectral rationalization [8]. The premise of the feature bucket approach was that by examining the MS/MS spectra for several compounds which share only one particular substructure, the most prevalent spectral features observed should be representative of this substructure. These features were then combined into a rule which could be applied to an unknown to determine the presence or absence of this particular substructure. However, this concept sufi'ered due to the arbitrary selection of the substructure present in the molecule and the assumption that the ions representing this substructure would fragment from difi'erent molecules in a consistent manner. The quality of the rules was also degraded by inclusion of features observed in the MS/MS spectra but which were not derived from the substructure of interest. fi The third generation of the MAPS software, developed by Hart and Enke [11], introduced the concept of "feature combinations". The correlated features generated by this approach are obtained in the same manner as the previous version, but the substructure rules were made up of combinations of those features which occurred concertedly. As a result, substructures are considered to be present in unknowns only if all elements of a given combination of features are observed rather than if a fraction of the elements in a set of structure-related features are observed. While feature combination rules are more definitive than discrete feature rules, they are difficult to discover and they still suffer from arbitrary substructure selection and substructure contamination. The MAPS approach is described in greater detail in Chapter 2. In this research project, the problem of structure elucidation using tandem mass spectral information is approached from a more data-oriented perspective. The tools to be developed here have no preconceived constraints regarding either fragmentation pathways or the specific substructures to be determined. Substructure identification rules are determined from the identification of common data patterns and the subsequent discovery of the substructures to which they are related: This is different from the MAPS approach which correlates arbitrarily selected substructures with common data features. The determination of the substructures that are represented by the data is a more natural approach and has been used by chemists since the early development of MS/MS instruments. An approach similar to that which is followed and expanded on in this project has been presented by Cross and Enke [6,7]. However, Cross and co-workers never implemented the substructure searching component of their system. As a result, the system was only capable of matching unknown product spectra against the reference spectra. Any feature/substructure correlations had to be made by the operator. In the approach used in this project, the spectrum matching and substructure searching components are incorporated through the use of exploratory analysis techniques such as cluster analysis. Mass spectral variance is accommodated through the use of Fuzzy Logic. A description of cluster analysis in chemistry with particular focus on those methods used in this research is provided in Chapter 3. Further details of the clustering implementation utilized in this work are given in Chapter 6. The details of the fuzzy logic implementation are given in Chapters 6 and 7. The intent of this project is to develop an intelligent computer-assisted structure elucidation system embodying those qualifies envisioned by Yost and Enke [14] and Cross and Enke [6,7]. This system uses clustering algorithms for the grouping of product spectra based on similarities in their m/z and relative abundance values. The structures of the molecules which .r formed these groups are then examined for common substructural features. Where common substructural features are obtained, rules are constructed relating these substructures to the product spectra. The basic form of the rule is that if the unknown product spectrum matches well with a particular reference product spectrum then the substructural unit represented by the reference product spectrum is present in the unknown compound. These rules can then be applied to unknown spectra to determine the presence of the substructures. The rules would be of particular interest in cases where precursor ion isomers are not readily distinguishable by conventional mass spectrometry. Potential substructures are obtained from each product spectrum of the unknown. A further enhancement in this work is the development of rule-trees. Rule-trees are generated by a distillation and extraction of the most discriminating factors of the rules developed by the method described above. Since rule-trees contain only the most discriminating spectral features and are a conjunction of all the rules relating to a particular precursor m/z, they are of particular use for targeted component analysis. Once a list of known substructures is constructed, they can then be correlated to determine plausible molecular structures using GENOA. GENOA, developed as part of the DENDRAL project, is a structure generation program which will assemble all the substructures believed to be present into plausible molecular structures [15]. The details of rule generation and implementation approach developed in this work are provided in Chapters 6 and 7. In order to develop a database of representative MS/MS spectra (a product spectrum for selected significant masses in the normal spectrum) and to be able to use these data for the structural elucidation of unknowns, it is necessary to be able to obtain MS/MS spectra in a reproducible and reliable manner. Therefore, this research will also identify and characterize those instrumental parameters which have the greatest effect on the quality and reproducibility of the product spectra. For this research, mass spectra were obtained on Finnigan TSQ-700 and Finnigan TSQ-7000 tandem mass spectrometers equipped with an electron impact ion source as diagrammed in Figure 1.1. While the TSQ has several difi'erent modes of operation, some of which are shown in Figure 1.2, the product scan mode is most commonly used and will be of most use to this project. The product scan mode of operation is as follows. Rather than directly detecting the ions formed from overlapping fragmentation processes in the ion source, the first mass analyzer (Q1) individually selects these ions. The selected precursor ions which pass through the first mass analyzer are allowed to undergo collision(s) with a target gas, argon in the work presented here, in the collision chamber. The collision-induced products of these collisions can then be mass analyzed by scanning the second mass analyzer (Q3) [16]- The major advantage to using the collision chamber to induce fragmentation of the precursor ions is that the products of these reactions are separated from the fragmentation products of all other ions in the source. In other words, a second dimension of information is obtained because the spectra obtained at the detector are the collision induced dissociation (CID) product ions of only the selected precursor ions of the selected mass. These spectra not only provide the precursor ion mass-to-charge ratio (as observed in a normal mass spectrum), but also the CID product ions which contain structural information about the precursor fragment ion. The key features to controlling the CID process are the collision offset voltage, the collision target gas, and the collision gas pressure. The characterization of these parameters is provided in Chapter 5. #808959on wwafi Bosnian”. 3&5 23 mo cacao—flow wé 9:63 .8828 mmmE 545:0 888% :2 8253 33m museum Alla—n53 _ _ _ 5 0 _.8:E::E m0 \ 1 sebum—o Seamus 88% ~38on “Boo—om $9: :298280 52:8 :2 823qu 11 Normal (Q 1MS) Scan Mode GE 2 “E Scan m/z 10-500 Product Scan Mode "i i 0% 510%: Select m/z 500 Scan m/z 50- 510 Precursor Scan Mode —€'———' '— Scan m/z 140-500 Select m/z 150 0 iii Neutral Loss/Gain Scan Mode qE—v E% Scan m/z 100- 500 Select m/z 50-450 (loss) Select m/z 150-550 (gain) Figure 1.2 The major operational scan modes of the triple quadrupole mass spectrometer. 12 References L 10. ll. Duffield, A. M.; Robertson, A. V.; Djerassi, C.; Buchanan, B. G.; Sutherland, G. L.; Feigenbaum, E. A; Lederberg, J. J. Am. Chem. Soc. 1969, 91, 2977-2981. Buchs, A; Delfino, A. B.; Duffield, A. M.; Djerassi, C.; Buchanan, B. G.; Feigenbaum, E. A.; Lederberg, J. Helv. Chim. Acta. 1970, 53, 1394- 1417. Gray, N. A. B. In Computer-Assisted Structure Elucidation. John Wiley & Sons; New York, 1986, pp. 995-100. Haraki, K. S.; Venkataraghavan, R.; McLafl'erty, F. W. Anal. Chem. 1981, 53, 386-393. Lowry, S. R.; Isenhour, T. L.; Justice, J. B. Jr.; McLafl'erty, F. W.; Dayringer, H. E.; Venkataraghavan, R. Anal. Chem. 1977, 49, 1720- 1722. Cross, K. P.; Palmer, P. T.; Beckner, C. F.; Giordani, A. B.; Gregg, H. G.; Hoffman, P. A; Enke, C. G. In Artificial Intelligence Applications in Chemistry. ACS Symposium Series No. 306, 1986‘, pp. 321-336. Cross, K. P. Ph.D. Thesis, Michigan State University, 1985. Wade, A. P.; Palmer, P. T.; Hart, K. J .; Enke, C. G. Anal. Chim. Acta. 1988, 215, 169-186. Palmer, P. T. Ph.D. Thesis, Michigan State University, 1988. Palmer, P. T.; Hart, K. J.; Enke, C. G.; Wade, A P. Talanta. 1989, 36, 107-116. Hart, K. J. Ph.D. Thesis, Michigan State University, 1989. 12. 13. l4. 15. 16. 13 Hart, K. J.; Enke, C. G. Chemom. and Intell. Lab. Sys. 1990, 8, 293- 302. Hart, K. J.; Enke, C. G. In Computer-Enhanced Analytical Spectroscopy, Jurs, P. 0., Ed.; Plenum: New York, 1992; Vol. 3, Chapter 6. Yost, R. A.; Enke, C. G. In American Laboratory. June 1981; pp 88-95. Genoa Reference Manual. Molecular Design Ltd., Hayward, California, 1984. TSQ/SSQ 700 Series Systems Operator’s Manual. Finnigan MAT, San Jose, California, 1990; pp 3-6. CHAPTER 2 Structure Elucidation Methods in Mass Spectrometry Introduction The determination of molecular structure has been a fundamental problem for researchers for the last century. To solve this dilemma, a researcher may employ one of numerous methodologies. On the modern research front, these methods may include techniques such as Nuclear Magnetic Resonance (NMR), spectroscopic methods like Infra-red (IR), Ultraviolet (UV), or Raman, or Mass Spectrometry (MS). Mass spectrometry has been used routinely for chemical structure elucidation since the 1960’s. Unfortunately, molecular structure determination from mass spectral data is often complicated, ambiguous, and incomplete. This situation is further complicated by the large amounts of data that can be produced by modern mass spectrometers. For example, liquid chromatographic interfaces to mass spectrometers are often outfitted with multiwavelength ultraviolet detectors to provide UV information on eluents entering the mass spectrometer source. For this reason, several significant initiatives have been 14 15 undertaken to develop automated and semi-automated computer software to assist in the elucidation of molecular structures from MS data. These initiatives generally fall into one of two categories: Pattern Matching Methods or Interpretive Systems. Small [1] describes these two broad categories as direct and indirect database methods, respectively. Pattern Matching, or direct database methods, are those which require the presence of a spectral database in both the development and implementation of the interpretation procedure. Interpretive systems, or indirect database methods, use a database in the development of the method, but not necessarily in the implementation. Pattern Matching Methods All commercial mass spectrometers sold today contain a basic spectral library search package. Difi'erences exist between the packages with respect to the number and type of molecules represented by their spectra, the database indexing and retrieval algorithms, and the relative intensity and/or mass/charge (m/z) weighting scheme applied to spectral peaks. Nevertheless, all the library searching approaches have the same objective of trying to accurately match the mass spectra of unknown compounds or mixtures with known library spectra. While some pattern matching methods have been more successful than others, they still share, in many respects, the same limitations. 16 There are generally four situations that a successful library search system must address in order to be useful for identifying unknowns. First, a reference spectrum for an unknown is included in the library with a pattern very close to the measured unknown. Second, a reference spectrum for an unknown is included in the library but with a different pattern. Third, the unknown spectrum is a mixture, the components of which are contained in the library. Fourth, the unknown spectrum, if pure, is not included in the library, or one or more components are not present if the unknown is a mixture. The above four items are arranged in order of increasing difficulty. Obviously, item one is the ideal case and can be made possible using small custom databases intended for use where the components of an analyte are known. The second case often arises as a result in day-to-day variations in instrument tuning or from mixing spectra collected from difi'erent types of mass spectrometers that may exhibit difl'erent mass-to-charge sensitivities. The third case can often be alleviated by the use of a chromatographic inlet system to the mass spectrometer except where components co-elute and can not be readily resolved. The fourth case is perhaps the most limiting case to any library search system intended for general use. A library search system, no matter how involved the matching algorithm, can only be as good as the spectral database from which it works. As a result, the most detrimental limitation of library search systems is that 17 they can never be complete. At present, there are in excess of ten million different chemical structures that have been identified in nature or synthesized in the laboratory. Current mass spectral libraries contain on the order of tens of thousands of reference spectra and are therefore severely lacking in completeness. Since even the list of known compounds is nowhere close to being complete, no mass spectral library can ever be complete. To make matters worse, as mass spectral libraries become more comprehensive, their differentiating ability decreases with the increased likelihood that the library contains reference spectra with increasing similarity to each other. Furthermore, the reference spectra may be subject to situation two in the preceding paragraph as they are often collected on multiple systems or on a reference system that may be different than the one used to collect the unknown. The most common library match system in use today is the Probability Based Matching (PBM) algorithm developed by McLafi'erty and coworkers [2,3]. The PBM algorithm uses a statistical weighting of the peaks in a spectral database in inverse proportion to their occurrence in the database. Therefore, peaks that occur often in the database are given a lower differentiating ability weighting factor than peaks which occur less often. PBM‘ also uses various methods of peak flagging and abundance scaling to attempt to compensate for distortions in the database caused by difl'erences 18 between instruments and tuning conditions. PBM, like most mass spectral search systems that utilize statistical measures to determine match factors, assumes that all peaks in a spectrum are independent of each other. Since many fragments observed in a normal mass spectrum are highly correlated, the validity of the match factors is dependent on all compounds matched having relatively the same degree of correlation. The PBM system has been demonstrated to work quite well where the reference spectrum for an unknown exists in the database and for identifying components in mixtures. However, it has proven to be inferior to the SISCOM library search system, described below, in retrieving homologous compounds [4]. Another library search system that is quite similar to PBM is the INCOS library search system developed at Finnigan MAT (San Jose, CA). This system differs from PBM primarily in the way mass spectral peaks are weighted during matching. A library search system for mass spectra developed by Henneberg and coworkers [5], called SISCOM, utilizes a spectral coding scheme based on selecting the most important peaks within homologous ion series and on a multiple factor assessment of the results. The name SISCOM is an acronym for Search for Identical and Similar COMpounds. This system has been demonstrated suitable for detecting structural similarities like common substructures even in cases where visual inspection and differentiation of 19 spectra is difficult. The SISCOM system contains a pre-search algorithm that retrieves the highest 150 best matching reference spectra. A more extensive matching algorithm is then used to refine the similarity order of the 150 spectra. The matching procedure uses a series of match factors the results of which are all displayed for evaluation by the user. The SISCOM algorithm was extended by Domokos et al. to make the system more of an identification (retrieval) system than a match similarity system [6]. While this extension has met with some success, the authors state that the possibility of developing a system with 100% success cannot be expected. Several of the reasons for this limitation in expectation are the same as those limitations that affect all library search systems. A recent addition to SISCOM has been the implementation of structure searches [7]. The structure search can be used to either search for compounds with the presence or absence of defined fragments or for structures similar to a target structure. The first search method is useful for retrieving spectra for certain classes of compounds. The second search method is useful for validating a proposed structure for an unknown by retrieving spectra for structurally similar compounds. Recently, Stein and Scott [8] conducted a comparison of five spectral matching algorithms. These algorithms were PBM, dot-product, Hertz et al. similarity index, Euclidean distance, and absolute value distance. Each 20 match algorithm was optimized separately for a test set consisting of 12,592 alternate spectra of about 8000 compounds. The rank of the correct spectrum in the list of candidate spectra was used as the criterion for match accuracy. The algorithms were found to perform in the order of dot-product (75%), Euclidean distance (72%), absolute value distance (68%), PBM (65%), and Hertz et al. (64%) respectively. The performance measure, which represents a hit rate for correct identification, was taken as the rank of the correct compound in the hit list. This measure directly measures the key function of a search algorithm, to place the correct result as high as possible in the ranked candidate list. Furthermore, this measure is independent of the relative ranking systems utilized by the individual search algorithms. Interpretive Systems Structure elucidation systems that attempt to derive differentiating rules regarding information contained in reference spectra are considered interpretive systems. These systems usually perform a distillation step on the reference spectral database to derive the most differentiating feature rules. These rules are then used for the classification and/or identification of unknowns. Unlike pattern matching methods, interpretive systems have the potential to be much more applicable for the general use of identifying true 21 unknowns. Pattern matching algorithms are limited by the direct reference/unknown spectral correlations they perform. However, interpretive systems generally focus on the identification of substructural components from the available spectral information. Where enough substructural information can be determined by an interpretive system, complete structure elucidation is possible. The substructure identification rules are developed for known substructures from a known reference database of molecules. Since interpretive systems start with the same mass spectral databases as pattern matching methods, they are susceptible to some of the same limitations. For example, spectral distortion and skew resulting from difi'erent instruments and tuning values can contaminate substructure identification rules, and the presence of mixture components can result in the misapplication of a substructure rule. The DENDRAL project, which began at Stanford University in 1965, is one of the most well known and publicized attempts at applying artificial intelligence to the elucidation of chemical structures. The objective of the DENDRAL project was to determine molecular structures consistent with the conventional mass spectrum for an unknown compound. To accomplish this goal, DENDRAL employed three stages: plan, generate, and test as diagrammed in Figure 2.1. In the plan stage, also called Heuristic DENDRAL [9,10], constraints are derived on the unknown structure based 22 The DENDRAL Approach ms, ir, nmr 1 Plan I Heuristic DENDRAL I l substructure constraints and molecular formula Generate GENOA 1 candidate structures 1 Mass spectral simulation Test and ranking of candidate structures Figure 2.1 The DENDRAL approach to structure elucidation. 23 on peaks in the mass spectrum. Empirical fragmentation rules are then used to determine which molecular fragments are or are not present in the unknown. The empirical fragmentation rules are inferred from the mass spectra of known compounds by Meta-DENDRAL [11]. The determined molecular fragments are then used in the generate stage of DENDRAL to generate all possible molecular structures that are consistent with the supplied constraints. The third test stage involves the simulation of a mass spectrum for each of the candidate molecular structures generated in stage two. Spectra are simulated using the fragmentation rules of Heuristic DENDRAL. Due to the systematic search and test methods utilized by DENDRAL, the system has been shown to be as good as a human expert in elucidating structures [12]. In many cases, the empirical fragmentation rules in Heuristic DENDRAL were not enough to provide the complete structure of an unknown. As a result, DENDRAL also used NNLR and IR data to provide additional structural constraints. In large part, the inability of Heuristic DENDRAL to provide enough structural constraints can be attributed to the ambiguities that can often arise when using normal mass spectra for structure elucidation. The Self-Training Interpretive and Retrieval System (STIRS), developed by McLafi'erty and coworkers [3] attempts to identify the major substructural feature(s) of an unknown by correlation of the conventional 24 mass spectrum to the best matching library spectra through the use of class libraries representing some 589 substructures [13]. The correlation technique consists of the application of a number of nearest neighbor analyses, each of which is the result of a particular type of spectral abbreviation. The abbreviations are based on characteristic ions, ion series, and primary and secondary neutral losses in various mass ranges that are specific for a given substructure. At the conclusion of the individual searches, the overall match factors are calculated as the weighted sum of the individual match factors. The fifteen compounds with the best overall match factors are then examined for the presence of the 589 substructures. The result is a list of the possible substructures contained in the unknown and an estimate of the reliability of the assignment. It has been recommended by McLafl‘erty et al. [3] that STIRS be used in cases where PBM fails to provide a reliable match. Sasaki and coworkers [14,15] have developed an integrated rule-based system for structure elucidation which they have called CHEMICS (Combined Handling of Elucidation Methods for Interpretable Chemical Structures). This system was capable of utilizing data from IR, proton NMR, carbon-13 NMR, and MS. The CHEMICS approach utilized correlation tables to determine the most likely substructures present in an unknown. By combining the results from each spectroscopic method, complementary information and confirmatory evidence can help identify or exclude specific 25 substructures. The utilization of mass spectrometric information in this system was primarily for the purpose of predicting the molecular formula of an unknown with substructures being identified from IR and NMR data. Later versions of CHEMICS did not make use of MS information [16,17]. An expert system for inferring structures of acyclic organic compounds from their mass spectra has been developed by Sastry and coworkers [18]. The expert system requires, as input, the mass spectrum, molecular formula, and presence, if known, of any functional groups. Given this information, the system generates chemically possible structures for the molecular formula constrained by information obtained from the mass spectrum and any specified functional groups. Recently, Scott [19] has reported the implementation of an expert system for the identification of a target set of toxic organic compounds from their mass spectra. The system was designed to accommodate low concentration spectra and to provide some information for mixtures. A schematic diagram of the system is provide in Figure 2.2. The target classes for this expert system are the nonhalobenzenes, chlorobenzenes, bromo- and bromochloroalkanes / alkenes, mono- and dichloroalkanes / alkenes, and tri-, tetra-, and pentachloroalkanes / alkenes. In the system designed by Scott, there is a separate molecular weight estimator, molecular weight filter, and base peak filter for each of the above 26 .8: 58m .3 @3333 889$ fiends :oflmomfiaog can down—seamed“? one we Bahamas ofimEvom Wu 0.5.3.”— I V £352 seasomflaog T 23mm Mwmm ommm “Samoan £5 .32 85533 4 Hoamfifimm . £5 .32 3:632 nofimommmmm—O 27 five classes. The classification module is used to assign an unknown spectrum to one of the above classes. Then molecular weight estimator for that particular class assigns a molecular weight to the spectrum. Molecular weight exclusion filters are then applied to ensure that the spectra were not being misclassified [20] but was found to be effective only for molecular weights below about 200 amu. Next, the base peak exclusion filter is used to exclude those compounds which were misclassified and passed the molecular weight exclusion filter. Finally, if a spectrum passes the base peak exclusion filter, it is passed to the identification module for compound identification. The identification module for the 75 target compounds relies heavily upon the accuracy of the molecular weight estimators and base peak data for unique compound identification. If a spectrum is rejected by the classification module, or the exclusion filters, it is passed to the unknown molecular weight predictor. A computer-assisted interpretation system developed at Michigan State University by Enke and coworkers was the first system to utilize the added information provided by tandem mass spectrometry (MS/MS). This system was called the Automated Chemical Structure Elucidation System or ACES for short. A schematic of this system is provided in Figure 2.3. One of the main goals of this system was to generate candidate molecular structures for unknown compounds [21]. This objective was based on the ability of the 28 .noumgoflm engagm 3 nowonanm memo/V. SE. ad 9:55 \lj mmmDFODMBm E135» J mmADm Egg/EU may; mam Amm< a mfizmazmmmmxm >m<~552< 3,53 mmmDPODmEmmDm ESEMOO E 3.888895 Sac—8 888882 3 83m? 23 Beam 3595 was: mmv< 8m 33 Homage.“ 38903 < «as 0.353 mmmDPODEmem OZF (a) *2 x1A o cue 0.. 'oo oo.o .o.o ....o.o .. C. C..... oo ooo. o. o o o. b (b) x2 Figure 3.1 Examples of cluster structures: (a) well-separated clusters; (b) touching or overlapping clusters. The axes x1 and x2, representing the two clustering dimensions, are generic in this example but could represent such things as relative intensifies for two m/z values or infrared intensities at two vibrational frequencies. 40 Therefore, it is appropriate to expand the simple definition of cluster analysis presented previously to include the presence of other clusters. A more exact definition would be, therefore, that a cluster can be defined as a grouping of two or more objects that have one or more characteristics in common. However, given the information available, it may not be possible to distinguish between two or more clusters. The clusters themselves may have complex shapes. Figure 3.2 shows some of the possible complex shapes that clusters may have. Lance and Williams [3] describe classification problems as categories which include cluster analysis techniques as shown in Figure 3.3. A detailed description of each of these categories with particular reference to cluster analysis is given by Jain and Dubes [4]. The two broadest categories are non- exclusive and exclusive. In non-exclusive clustering, groups are allowed to overlap and every object is assigned a “degree of belongingness” or “degree of membership” for each cluster. In exclusive clustering techniques, a given object can belong to one and only one cluster or group. Exclusive clustering techniques are further divided into extrinsic and intrinsic methods. Intrinsic methods, also called unsupervised learning, group samples solely on the basis of a proximity matrix or similarity matrix. In contrast, extrinsic methods also rely on the research to supply category labels for the objects being clustered. Extrinsic methods then attempt to establish a discriminant 41 (a) (b) 000 0 000 00.. 09 C . .C 0 0 .0 90 0 (C) (d) 0 ”0 0 . 0 0.0.0 ..9.0. . . 9 9 00 9 909 0 . 00 . : . 0 0. 0 ‘. .0 .. 0 . 0 .0 909 9: . . 9 9 0.0 .99 0 0 “90 . 00 0 000 . 0 0 Figure 3.2 Four examples of complex cluster shapes: (a) spherical, (b) elongated, (c) linearly inseparable, and (d) dense clusters. 42 Classifications l Non-Exclusive . Excluswe (Overlapping) > Extrinsic (Supervised IntrInSIc 4m; “a; Hierarchical Partitional Figure 3.3 Categories of classification types. 43 surface that separates the objects according to their category. Intrinsic methods are further separated into hierarchical and partitional clustering methods by Figure 3.3. Hierarchical and partitional clustering will be described in more detail later in this chapter. The remainder of this chapter is dedicated to providing a background on the methods, assumptions, and some chemical applications of cluster analysis techniques. Since cluster analysis is such a broad area by itself, only the more common methods of cluster analysis, particularly those that have been utilized in this research, will be discussed. Similarity Calculation Methods In order for cluster analysis to be performed, it is necessary to quantify the similarity between observations. Usually, this quantification is done on the basis of distance between observations in a multidimensional space [2]. In this section, I will describe the more common similarity (or distance) metrics. There currently is no clear definition of similarity in the cluster analysis literature [5]. However, similarity is often taken to be a scaled form of the inter-object distance with values in the range of zero, for no similarity, to one, for identical values. If the distance value is scaled to a value from zero to one where unity would represent the maximum distance possible between two objects k and 1, then the similarity would be calculated as 44 Skl = 1-d kl (3.1) Euclidean Distance The Euclidean distance metric is by far the most common method for determining the distance between two objects. The equation for calculating a two-dimensional Euclidean distance between two points k and l is given by the formula 2 Dk,= 2(ng—xzj)2 (32) j = 1 For multivariate observations, this equation can be generalized to n dimensions. n 2 Dkl = 2(xkj—xlj) (3-3) I =1 The most important criterion for using the Euclidean distance metric is that accurate calculation 1 of the distance requires that the dimensions be orthogonal. In many situations, this does not occur because of cross- correlation between variables. A graphical representation of the Euclidean distance metric is given in Figure 3.4. 45 d=l(xb-x.)2+(y,-y.)2 Figure 3.4 A graphical representation of the Euclidean distance metric. 46 Minkowski Distance The Minkowski distance metric between two objects k and l is given by l/r n D, (k1) =( z leg—xljl’] (3.4) 1' =1 where r is larger than or equal to one. By comparison of Equation 3.4 with Equation 3.3, the Minkowski distance metric is a generalized form of the Euclidean distance metric. For the case where r = 2, the Minkowski distance reduces to the Euclidean distance. For r = 1, Equation 3.4 reduces to n Dlw) = ,E llxkj—xzjl (3.5) This resulting distance metric is commonly called the Manhattan, city-block, or taxi distance and is mainly used in two dimensional cases although it is more common to use the Euclidean distance [6]. Mahalanobis Distance The Mahalanobis distance is sometimes used as an alternative to the Euclidean distance metric. The Mahalanobis distance between two points k and l is given by the general equation um:J(xk—x,)Tc-1(x.-x,) (3.6) 47 where C is the covariance matrix. The advantage of the Mahalanobis distance is that it is scale-invariant and takes account of correlations among the variables. If C is replaced by a diagonal matrix of standard deviations, one obtains the auto-scaled Euclidean distance [5]. Likewise, replacing C by a diagonal matrix of ranges, one obtains the range-scaled Euclidean distance [5]. If within the clusters the variables are standardized and uncorrelated, then the Mahalanobis and Euclidean distances are the same since the covariance matrix term, 0", drops out. Hierarchical Clustering Methods Hierarchical methods of cluster analysis are the most commonly used today. In particular, hierarchical clustering is widely used in the biological, social, and behavioral sciences because of the need to construct taxonomies [4]. There are three main types of hierarchical clustering that I will describe here: single linkage (or nearest neighbor linkage), complete linkage (or furthest neighbor linkage), and average linkage. The general approach in hierarchical methods is to link, or join, observations based on their similarities. This is accomplished using one of the similarity or distance metrics described previously. Hierarchical clustering methods, unlike partitional or density clustering methods, which will be described later, do not directly assign observations to a particular cluster. Rather, hierarchical clustering methods 48 construct dendrograms that show the relationships between the observations. It is left to the researcher to determine what and where, if any, clusters are represented by the dendrogram. An example of a dendrogram is provided in Figure 3.5. On the vertical axis of the dendrogram are the objects being clustered. On the horizontal axis is the similarity index with a limiting value of 1.0 or 100% on the left and a limiting value of 0.0 or 0% on the right. The lengths of the horizontal line segments represent the dissimilarities between objects. The join points, represented by the vertical line segments, represents the similarities between objects as measured on the similarity index scale. The closer two objects join to the left, the more similar are their representative values. An advantage of hierarchical clustering is that it provides, in a readily interpretable form, the most similar observations as well as inter-group relationships formed by the branch points between dissimilar objects or groups. A disadvantage of hierarchical clustering techniques is that the resulting dendrograms can become dificult to interpret cleanly with very large data sets [7]. Single Linkage Clustering Before single linkage joining can be performed, it is first necessary to calculate a similarity or distance matrix by one of the similarity 49 8888683. 8 mo 83888 5 m5 gamma 88838 mfion 8388 .8 38.30 .b was. .w 6 .v 8388 3 magnum—88 @ch mud 98: m was .N A 8388 85 mgoaw 889 flow / _l 5% 33.58% .3 museum mu macaw 3a 335%. —|| «.4» 3QEBM. 5838.8 85 8888 as... .8385 868 85 84mg 23. .888 8?: Eflummgm _A_H was 335%. / Nu macaw. _ _ 4 _ q q _ _ _ _ J 0.0 To N.o m.o v.0 m.o m.o 8.0 m.o m.o oé EHggm 5O calculation methods described previously, or by some other method if it is desired. Single linkage joining is then accomplished by searching the distance matrix for the lowest value. The two observations that match this value are then joined at their respective distance (similarity) level. This step is then repeated for the next lowest distance value and so on. If an observation for a selected distance value is already joined to other observations, a branch point is formed and the new observation is joined on to the existing group. This process continues until all the observations have been joined. An example of single linkage clustering is presented in Figure 3.6 for a selection of microorganisms analyzed by phospholipid profiling. The details of the phospholipid profiling technique are presented in Chapter 4. Since dendrograms are a display technique, it is left to the researcher to interpret the results. In the case of Figure 3.6, one might choose to consider samples 74, 36, 80, and 78 in one group, samples 64, 51, 66, and 50 in another group, and the remaining three samples in the third group. Therefore, the members of each group have more in common with each other than with other samples or group. However, this particular interpretation is only valid if samples with a similarity of 0.65 (65%) or more can be considered a group. If the threshold similarity for a group is set at 0.8, then only samples 66 and 50 form a group and samples 69, 24, and 94 form another. All other samples form single 51 8858: 8828 MG @8389 Eamoanmoan .3 68:83.. m8m8w8o88 union—mic 8039883 8815 88:8 25. gm ops—mum ___'L m . o EEG/mm 8 835m 8 0388 8 seam 8 sarcasm .8 oaaam S museum 8 £88 2. 388 8 seam 8.... 288 E oaaam 52 member groups as no other sample is significantly similar. Likewise, if the threshold similarity were set at 0.4 or lower, then all the samples would constitute one group. Because of the manner by which the distance matrix is searched, and joins are formed, single linkage clustering tends to overestimate the degree of similarity between successive observations at branch points. The reason for this is that new observations are joined to previously joined observations based on the lowest distance between the new observation and any observation in the previously joined group. This means that every other observation in this joined group has a greater distance from and therefore less in common with the new observation than the branch point indicates. Nevertheless, single linkage clustering is the most common method of hierarchical clustering due in large part to its ease of implementation. Complete Linkage Clustering Complete linkage clustering is performed in the same manner as single linkage clustering with one exception. While single linkage clustering joins new observations onto previously joined observations based on the lowest distance between the new observation and any of the joined observations, complete linkage clustering does the opposite. Complete linkage clustering joins new observations to previously joined observations at 53 the highest distance between the new observation and any of the joined observations. As a result, while single linkage clustering tends to overestimate the similarity between observations, complete linkage clustering tends to underestimate it. An example of complete linkage clustering is presented in Figure 3.7 for a selection of microorganisms analyzed by phospholipid profiling the details of which are presented in Chapter 4. Figure 3.7 may be interpreted in much the same manner as Figure 3.6 in the previous section. However, it becomes readily apparent that the complete linkage algorithm significantly reduces the similarities at the join positions for clusters of three or more members in this case. For example, at a similarity of 0.65, sample 74 would no longer belong in the same group as samples 36, 80, and 78. At a similarity value of 0.8, only samples 66 and 50 form a group and samples 24 and 94 form another. At a similarity of 0.4, the same three groups are formed that were formed at a similarity of 0.6 for the single linkage example in Figure 3.6. 54 $8598 EamoAQmoAa .3 88888 was... 28E 8 @3888 ”388888988 8888: 8838 888 an» .8 m88838 88888.3 .8me 88388 2E. h.” 0.8M?— l__ 8 3.25m 8 £88 8 39:8 an 2.58 8 3.58 8 035m 8 £88 ms @388 - 8 3.288 8 335m 3 3.38m 55 Average Linkage Clustering Average linkage clustering represents the mid-point between single linkage clustering and complete linkage clustering. With average linkage clustering, the algorithm is the same as single linkage clustering in that the distance matrix is searched for the lowest distance values when joining observations. However, once two or more observations are joined, a mean observation is calculated, which replaces the joined observations at each cycle. Figure 3.8 demonstrates the differences in distance measures between single, complete, and average linkage methods. For comparison purposes, the microorganism phospholipid profiles clustered by single and complete linkage clustering in Figure 3.6 and Figure 3.7 respectively are also clustered by average linkage and presented in Figure 3.9. By comparison of Figure 3.9 with Figure 3.6 and Figure 3.7, it becomes readily apparent how the average linkage method represents the midpoint between single and complete linkage clustering. If the measurements taken on the samples and used in the clustering are taken to represent normal distributions, then it is more statistically valid to cluster using average linkage. Because single, complete, and average linkage methods only afi‘ect the joining of three or more samples, they always join the first two samples in a branch at the same similarity. For example, samples 80 and 78, 64 and 51, 66 and 50, and 24 and 94 are joined to each other at 56 Sample 1 Sample 3 Sample 2 , K Figure 3.8 The relative distance measures for single, complete, and average linkage methods. Sample 1 and 2 would initially be joined at distance d1. Sample 3 would join at deingle, doomplete, or davenge respectively. 57 .wfiaoa Eamoaamcan .3 353% Pm wan—ME was ad. 953% 5 63.333 manganese? ESE—q: @3033 can on» we magma? 30398.83 omega: magmas BE. ad 0.33m _I_ ,__1 m . o FEES/mm «a £36 «a 33am am «38$ cm. 23% 8 £86 5 magnum 3 035m we macaw S 333 mm 23% 3 038mm 58 the same similarity in all three cases. Although it is not uncommon, no rearrangement of the sample order occurred with these example data. As a result, it is possible to use any of these three linkage methods with these data and still come to the same conclusions, albeit at diflerent similarity levels, regarding the data structure. Average linkage clustering is more computationally expensive than single or complete linkage clustering because it is necessary to recalculate portions of the distance matrix after every join to take into account the join mean observation. However, average linkage clustering does tend to provide a more accurate measure of observation similarity on the resulting dendrogram. Partitional Clustering Methods Partitional methods of cluster analysis are used to classify samples into distinct groups without the intervention of the researcher. Unlike the hierarchical methods, observations are assigned to specific clusters. It then falls to the researcher to evaluate and validate the correctness of the clusters formed. However, it is not necessary for the researcher to decide where the separation exists between clusters. As a result, partitional methods of cluster analysis are generally less susceptible, though certainly not immune, to researcher bias. Partitional clustering methods are more popular in the 59 engineering sciences, where single partitions are important, than the biological sciences [4]. While there are several types of partitional clustering methods in the literature, I will only be describing the K-means method. As a general scheme, most partitional methods rely on maximizing some between-group distance function or minimizing generalized variance [8]. K-Means Clustering The K-means clustering algorithm is one of the most popular of the partitional clustering methods. This method relies on maximizing the between-group Euclidean distance while minimizing the within-group distances. The K-means method is described in detail with FORTRAN code by Dr. John Hartigan [8] and by Johnson and Wichern [9]. The K-means method is composed of four main steps. In Step 1, the observations are partitioned into K initial clusters and initial cluster centroids (or cluster means) are calculated. In Step 2, the observations are all reassigned to the cluster whose centroid is nearest. The new cluster centroids, after observation reassignment, are calculated in Step 3. In Step 4, Steps 2 and 3 are repeated until no more reassignments of observations occurs . 60 Probability-based and Density-based Clustering The previously mentioned clustering methods all rely on distance or similarity as a means of dividing observations into clusters. However, almost any means by which observations can be divided into groups can be utilized in cluster analysis [5]. In Figure 3.1, it is readily apparent that a cluster implies a greater concentration or density of observations in a given region. In this section, I will describe a few methods of clustering methods that seek to form groups based on density or distributional assumptions. Kernel Density Clustering Massart and Kaufman [10] describe the kernel density clustering method as CLUPLOT. In this clustering approach, each observation is associated with it a kernel or potential density function, often Gaussian, which can be considered to measure the influence exerted by the observation on the other observations in the data set. The density function need not be Gaussian but can be any function which measures the density of an object at another point as a function of the distance between them. At each iteration, the observation with the greatest average density is chosen as the new cluster center. Unallocated observations are assigned to clusters if the influence exerted by the cluster on it is greater than some threshold. Kernel density clustering can be modified to permit overlapping clusters and provides generally more information than standard clustering approaches. 61 Like the fuzzy clustering method described below, the kernel density clustering method is computationally expensive. Fuzzy Clustering At present, there exists in the literature only a few fuzzy clustering implementations; it is not yet a commonly used clustering method [5]. However, I will describe fuzzy clustering here because it has significant advantages over traditional clustering methods and is likely to become a more commonly utilized clustering method in the future even though it is computationally expensive. Fuzzy clustering is uniquely different from all clustering methods mentioned previously. In 1965 Dr. Lofti Zadeh [11] introduced the concept of fuzzy sets by significantly extending the principles of set theory. Set theory is based on bimodal assumptions about elements in a set [12]. That is, each element may either belong to a particular set or not. Fuzzy set theory is based on the premise that an element does not belong totally to a set but has a “degree of belongingness” to each set. Consequently, under fuzzy set theory, an element X may be described as belonging 70% to set A, 25% to set B, 5% to set C, and 0% to all other sets. Indeed, it is not even necessary that an element belong cumulatively exactly 100% to all sets in question [13]. 62 By comparison, traditional cluster analysis, as described previously, is much like set theory, observations (or elements) either belong to a cluster (or set) or they do not. This exclusive idea in cluster analysis works well for cleanly separated clusters. However, exclusive cluster analysis does not handle overlapped clusters very efiiciently. Furthermore, exclusive clustering methods only provide at best a minimal amount of information, through inter-cluster distances, regarding commonality of features between clusters [14]. The degree of cluster membership assigned to each element provides much more information in this regard. Current implementations of fuzzy clustering operate to some degree like partitional clustering methods [5]. The number of clusters to be found is specified by the researcher and each observation is then assigned a degree of membership for each competing cluster. Somewhat like K-means clustering, observations are moved between clusters until the set of memberships for each observation to each cluster is optimal. That is, they minimize some criterion defined by the clustering implementation. Fuzzy clustering can also be useful for identifying outliers or those observations positioned between clusters since the membership values provides some indication of the likelihood with which observations belong to competing clusters. 63 Application of Cluster Analysis to Problems of Chemical Significance In recent years, interest in multivariate analysis techniques for interpreting chemical information has increased significantly. This is due in large part to advances in computers and the large quantities of multivariate information produced by modern chemical analysis instrumentation. At present, there are several hundred articles referencing the use of cluster analysis to aid data interpretation in the chemical literature. These references span the full spectrum of chemical science. The main underlying theme of all these articles is the multivariate nature of the data investigated. For Example, Frapporti et al. [15] report using fuzzy cluster analysis in the analysis of the hydrogeochemistry of shallow Dutch groundwater samples collected from 350 sites in the national groundwater quality monitoring network. Haswell and Walmsley [16] describe using hierarchical clustering to assist in the selection of sensors for an array device for the detection of analytes in the vapor phase. Likewise, Bondarenko et al. [17] report using hierarchical cluster analysis in the classification of Lake Baikal, Belgium aerosol particles samples using electron probe X-ray microanalysis. Vogt et a1. [18] report using cluster analysis to facilitate the interpretation and the formulation of diagnostic models for the application of chemical tests in clinical chemical diagnosis. Their approach uses hierarchical clustering by Ward’s method [4], which is based on the minimization of the square error 64 popularized in analysis-of-variance, to reduce data while increasing utilizable information. In the area of biochemistry, Alexandrov [19] reports using hierarchical clustering for the purpose of interpreting the biological meaning, statistical significance, and classification of local spatial similarities in nonhomologous proteins. Sowdhamini and Blundell [20] describe a method using cluster analysis to reduce the complexity of performing protein searches in the Brookhaven Protein Data Bank by establishing protein domains. These domains are considered to be clusters of secondary protein structure elements. Cluster analysis utilization has also been reported by Livingstone [21] for molecular modeling and drug design and by Chu [22] for drug pharmacological activity. Likewise, several articles have been written on the use of cluster analysis to determine chemical structure-biological activity relationships [23, 24, 25]. Jurs and Lawson [23] describe the use of a modifed Hopkin’s statistic, which is intended to asses whether or not a given data set difi'ers from a set of uniform random numbers. They further present an example using this statistic in a structure-toxicity relationship investigation of 143 acrylate compounds. Darwish et al. [24] compare principal component analysis and cluster analysis in quantitative structure-activity relationships (QSAR). Specifically, they examined the inhibitory effect of some 65 benzothiazole pesticide derivatives on the bacteria Bacillus subtilis and the fungi Aspergillus niger, Helminthosporium sativum, and Fusarium graminearum. They found that cluster analysis required lower computational ‘time but gave results commensurable with principal component analysis. Novellino et a1. [25] present the use of clustering the principal components generated from comparative molecular field analysis (CoMFA). This yields an approach to drug series design that produces relatively smaller set of series structures that provide nearly the same amount of information as a much larger data set of structural analogs. With a similar objective to that described previously by Sowdhamini and Blundell [20], Domokos et al. [26] report a procedure to reduce mass spectrometric library search times by presearching. The presearch is made on a reduced mass spectrometric library where large groups of compounds exhibiting similar spectral features are replaced by their respective prototypes. This approach significantly reduces the number of spectra used in the subsequent search. While the above list of applications of cluster analysis to chemical problems is certainly not comprehensive, it hopefully serves to indicate the diversity of tasks to which cluster analysis can be applied. 66 Application of Cluster Analysis to Mass Spectral Data Using cluster analysis with normal mass spectral data as a means of characterizing compounds generally fails to provide useful information because the grouping is usually based on an undefined mass-to-charge correlation. For example, the clustering of compounds whose spectra contain m/z 18, which could be either H20+ or NH4+, might place these compounds in the same group. Since the m/z 18 ions are not identical, any grouping which places these ions in the same cluster would be incorrect. Consequently, any assumptions that the two compounds must have something in common would also be incorrect. Therefore, one cannot make reliable feature-substructure correlations by clustering normal mass spectra. Perhaps the only exception is the use of cluster analysis as an inefficient means of library searching. This search process would be based largely on the assumption that each unique compound in the library database forms its own group. However, the dificulty or failure of cluster analysis with normal mass spectra data does not apply to tandem mass spectral data provided that the spectra clustered are derived from the same precursor m/z value. Product spectra can be directly correlated with the substructure of the precursor from which they were derived. Hence, the groupings in such a case are based on a limited set of predefined substructures. Cluster analysis of MS/MS spectra can, therefore, be used as a reliable means of formulating feature- 67 substructure correlations. The use of cluster analysis with tandem mass spectra data will be explained further in Chapter 6. Conclusions In this chapter, I have portrayed cluster analysis as a relatively non- biased method for determining the natural clustering or grouping tendencies amongst data. While such levels of perfection are desirable, this is not strictly correct. So long as the researcher has any choice or control over clustering parameters there exists the possibility for personal bias. Also, of all the clustering methodologies described in this chapter, none is suitable for all possible cases [4]. This is due in part to the underlying assumptions each clustering method makes regarding the data being analyzed. We have found that this is not a problem with mass spectral data, however. Indeed, the above concerns are not a significant problem in most practical applications of cluster analysis with “real” data because the researcher usually knows enough about the data to distinguish “good” groupings from “bad” groupings. Consequently, it is often necessary to investigate the data space with known test samples before relying on a particular clustering algorithm for unknowns. 68 References l. 10. ll. 12. l3. 14. Johnson, R. A; Wichern, D. W. Applied Multivariate Statistical Analysis; Prentice Hall: Englewood Cliffs, 1992; Chapter 12. Everitt, B. S. Cluster Analysis; Halsted: New York, 1993; Chapter 1. Lance, G. N.; Williams, W. T. Computer Journal 1967, 10, 271-27 7. Jain, A. K.; Dubes, R. C. Algorithms for Clustering Data; Prentice Hall: Englewood Cliffs, 1988; Chapter 3. Bratchell, N. Chemom. and Intell. Lab. Sys. 1989, 6, 105-125. Massart, D. L.; Kaufman, L. Interpretation of Analytical Chemical Data by the Use of Cluster Analysis; John Wiley and Sons: New York, 1983; Chapter 2. Hodes, L.; Feldman, A J. Chem. Inf. Comput. Sci. 1991, 31, 347-350. Hartigan, J. A. Clustering Algorithms; John Wiley and Sons: New York, 1975; Chapter 4. Johnson, R. A; Wichern, D. W. op. cit.; 597-602. Massart,D. L.; Kaufman, L. op. cit.; Chapter 4. Zadeh, L. A Fuzzy Sets, Information and Control 1965, 8, 338-353. Otto, M. Anal. Chim. Acta 1990, 235, 169-175. Bezdek, J. C.; Dunn, J. C. IEEE Transactions on Computers 1975, August, 835-838. Ruspini, E. H. Information and Control 1969, 15, 22-32. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 69 Frapporti, G.; Vriend, S. P.; Van Gaans, P. F. M. Water Resour. Res. 1993, 29, 2993-3004. Haswell, S. J.; Walmsley, A. D. Anal. Proc. 1991, 28(4), 115-117. Bondarenko, L; Van Malderen, H.; Treiger, B.; Van Espen, P.; Van Grieken, R. Chemom. Intell. Lab. Syst. 1994, 22, 87-95. Vogt, W.; Sator, H.; Nagel, D. TrAC, Trends Anal. Chem. 1984, 3(7), 166-171. Alexandrov, N. N.; Go, N. Protein Sci. 1994, 3, 866-875. Sowdhamini, R.; Blundell, T. L. Protein Sci. 1995, 4, 506-520. Livingstone, D. J. Anal. Proc. 1991, 28(8), 247-248. Chu, K. C. Anal. Chem. 1974, 46', 1181-1187. Jurs, P. C.; Lawson, R. G. Chemom. Intell. Lab. Syst. 1991, 10, 81-83. Darwish, Y.; Cserhati, T.; Forgacs, E. Chemom. Intell. Lab. Syst. 1994, 24, 169-176. Novellino, E.; Fattorusso, C.; Greco, G. Pharm. Acta Helv. 1995, 70, 149-154. Domokos, L.; Pretsch, E.; Maendli, H.; Koenitzer, H.; Clerc, J. T. Fresenius'Z. Anal. Chem. 1980, 304, 241-249. CHAPTER 4 Application of Cluster Analysis to Microbial Characterization Introduction My interest in cluster analysis as a tool for data interpretation began during my affiliation with the National Science Foundation Center for Microbial Ecology (CME) at Michigan State University. The Community Diversity Group, of which I was a member, is one of several in the CME and is focused on investigating and identifying what constitutes a viable microbial community [1]. In nature, that is outside of the laboratory, microbes often cannot survive without being part of a microbial community. Microbes, being single-celled organisms, are not complex enough to provide for all of their environmental needs. As a result, microbes are generally members of a community where each member or microbial strain contributes something to the whole. 70 71 The concept of a microbial community is very important if one is to understand microbes in their natural habitat. In the laboratory, microbes are usually grown and studied as isolates. However, for every organism that can be grown as an isolate in the laboratory, there are scores of others that cannot survive in this environment. It is, therefore, necessary to look at the larger community picture if one is to begin to understand the interaction of microbes with nature. However, in order to investigate a microbial community, it is also necessary to be able to reliably characterize and hopefully identify the microbial members of the community. In this chapter, I will present my efforts in furthering the development of methodologies for microbial characterization with specific application to the characterization of selected food-home contaminants by means of phospholipid profiling and hierarchical cluster analysis. Background When I joined the Enke research group, Mark Cole was conducting his Doctoral research in the area of developing biomarkers for microbes using mass spectrometric methods. The development of reliable biomarkers is very important as it allows for the rapid screening of microbes as to form and/or function without necessarily identifying the specific microbe. Mass 72 spectrometric methods have proven quite useful in biomarker method development [2]. While the general goal of any analysis procedure of an unknown microbe is to determine its identity, this is usually not possible or feasible for several reasons. For example, only a very small percentage of microbes that occur in nature have ever been fully characterized and identified. Of those microbes that have been identified most have been characterized because they exhibit some pathogenic character to other life forms be they human, animal, or plant [3]. Furthermore, the extensive characterization and identification of a single microbe can take several days to weeks to accomplish. It is obviously impractical to perform such a protracted procedure on items like food harvests due to the storage requirements unless there is some previous indication that the food may be contaminated. This is part of the reason why foods like tainted meat slip through Food and Drug Administration (FDA) inspections. The food industry, in particular, is therefore very interested in establishing a set of biomarker based screening methods to identify or at least provide an indicator as to the microbial composition in a food sample. Only then might a more extensive analysis of the food become necessary which would represent a significant cost and time reduction to the food industry. 73 The development of reliable biomarkers was a primary objective of the CME Community Diversity Thrust Group. Towards that end, a general approach was devised by which current and future biomarker data could be collected, stored, and analyzed; this approach is diagrammed in Figure 4.1. These biomarker methods, databases, and analytical tools would then be made available to others both within and outside the CME. My involvement with the CME ended not long after this approach was proposed. One of the biomarkers that Mark Cole investigated was the glycerophospholipids which I shall refer to by the more general term phospholipids. The glycerophospholipid structure, Figure 4.2, contains four main components: a substituted phosphate head group, two fatty acid chains, and a glycerol backbone. The phospholipid classes are determined by the Y substituent attached to the phosphate group. The phospholipid classes used in this study are provided in Figure 4.3. Phospholipids are a major component of the lipid bilayer that makes up the cellular membrane of a microorganism [4]. It is through this bilayer that all nutrients must travel into the cell and all waste byproducts must travel out. As a result, the cell membrane serves as both a screen or shield for the organism as well as an interface to the environment. For this reason, the membrane composition can vary greatly depending on the form and/or function of an organism. Further variations are also observed within a 74 N Computer-Assisted Characterization sampling—l: Organisms - isolates or communities Morphology Fatty l IPhospholipids Ac1ds f Methods— ‘ V . . . e e 0 Genetic Traditional Methods Taxonomy Data 1 v L smmge _ fll Databases Id— and Retrieval Numerical H Cluster Methods f Analysis Analytical_ T°°15 Rule Based Expert System . . . V V Taxonomical info. (i.e., species, strain, etc.) Behavioral qualities Community Environmental qualitg} qualities k J Figure 4.1 The proposed CME approach to incorporate the research objectives in the Community Diversity Thrust Group into a more cohesive package. 75 HC—O—C—R' O Hzc—O—P—O—Y OH Figure 4.2 The general glycerophospholipid structure. The head group substituent is represented by Y and the two fatty acids by R and R'. 76 Mass of O Phospholipid Head Group (Y) OH—P—O—Y OH Phosphatidylglycerol CHZ—CH—CHZ 172 u (PG) OH OH Phosphatidic Acid (PA) -—-H 98 u Phosphatidylethanol- amine CHz-‘CHz—NHZ 141 u (PE) Phosphatidyldimethyl CH3 -ethanolamine CHz—CHz-N 169 u (PDME) CH3 Phosphatidylmono- C _ C -NH—C methylethanolamine H2 H2 H3 155 u (PMME) Figure 4.3 The phospholipid head groups and their respective neutral losses utilized in this study. 77 particular microbial strain as affected by the extracellular environment: food supply and composition, temperature, moisture content, etc. [5] Therefore, utilizing components of the cellular membrane, like phospholipids, as biomarkers should provide for the difi'erentiation between microorganisms and perhaps also for the correlation with the form and/or function or a microorganism. One of the common characterization methods for microorganisms, shown in Figure 4.1, is to perform a complete lipid extraction, esterify the fatty acids, and then analyze the resulting esters by gas chromatography. This method is called FAME which stands for fatty acid methyl ester. While being relatively simple to perform, this method sufi‘ers from a number of drawbacks. In particular, the extraction process removes all the fatty acids in the organism. This is accomplished by first saponifying the whole microorganism cells in a strong base and then acidifying, methylating, and finally extracting the fatty acids [6]. It is, therefore, not possible by this method to selectively characterize the cell membrane composition or the cell contents. Another deficiency of the FAME method is the loss of channels of information. During the extraction process, the fatty acids are cleaved from the substituent to which they may have been bonded. In many cases, the backbone substituent, coupled with the fatty acid composition, may be very specific to a particular strain of organism [7 -9]. While this information is lost 78 in FAME, it is not lost in the analysis of phospholipids by tandem mass spectrometry. As a result, phospholipid analysis, which characterizes intact phospholipids, provides a much more organism specific method of characterization. When phospholipids, which have been ionized by fast atom bombardment (FAB) undergo collision induced dissociation (CID), they fragment at specific bond positions that are common to all classes of phospholipids [10-13]. In positive ion scan mode, phospholipids are observed to fragment to lose the polar substituted phosphate head group as a neutral loss (Figure 4.4). The mass of this neutral loss is specific to each phospholipid class and is used to selectively profile each phospholipid class. The neutral loss masses are shown in Figure 4.3. Therefore, positive ion neutral loss scans produce individual spectra for each of the phospholipid classes. These spectra are without spectral interferences from other components in the crude microbial extract mixture as a result of the coupled filtering of the quadrupoles. The detected ion mass-to-charge values correspond to the sum of the remaining components of the original phospholipid. The measured signal intensity is recorded against the mass-to-charge setting of the first filtering quadrupole so that the resulting spectrum is calibrated for the ion mass-to-charge before the CID process occurred. Sample neutral loss spectra (mass profiles) for three organisms are provided in Figure 4.5. When this 79 (”'0 G) CHz-O—C—R o CH—O—C—R' O CHz—O—P—O—Y OH HO—P—O—Y OH 6) CHz—O—C—R CH—O—C—R' CH2 Figure 4.4 Low-energy collision induced dissociation neutral loss fragmentation of phospholipids for positive ions. 80 m Salmonella abaetetuba 7M 80 < 718 IO-I 40-1 80< Citrobacter freundii 100 I” 710 80‘ 00-1 Relative Abundance 1”-Escherichia coli Mass/Charge Figure 4.5 Phosphatidylethanolamine (PE) mass profiles for Salmonella abaetetuba, Citrobacter freundii, and Escherichia coli. 81 mass spectral information is combined with the information that can be obtained using negative ion scan mode, a 4-dimensional phospholipid (class) mass profile can be constructed (Figure 4.6). The negative ion scan mode information that provides the mass-to-charge values of each fatty acid moiety respectively was not utilized in this study. By comparison, the levels of information provided by the phospholipid characterization method greatly exceed the information provided by bulk methods such as FAME. By the FAME method, a single 2-dimensional gas chromatogram is obtained per organism, which contains the retention times and quantities of the bulk extracted and esterified fatty acids. Experimental All phospholipid analyses were performed using a Finnigan MAT (San Jose, CA) TSQ-70B triple stage quadrupole mass spectrometer that had been upgraded to the TSQ-700 model. This instrument was equipped with a JEOL (Boston, MA) MS-009 charge-transfer fast atom bombardment (FAB) gun and a modified power supply. Mass spectra were acquired and initially processed using the Finnigan TSQ data system and software. The bacteria investigated in this study were Citrobacter freundii (C. freundii), Escherichia coli (E. coli), Salmonella abaetetuba (S. abaetetuba), Salmonella enteritidis (S. enteritidis), and Listeria 82 Dmmm nmmm Phospholipid Classes u (k 89 ~; 8 5s ’D/ 0 ‘50 8 e 868 ‘7 Q $95 e 5 B 8 Figure 4.6 The FAB/MS/MS phospholipid (class) mass profile obtained in part by this analysis procedure. The phospholipid classes axis is determined by the neutral losses monitored and the phospholipid masses axis is determined from the respective neutral loss scans. The fatty acid masses axis is determined by operating the instrument in negative ion mode which was not utilized for this study. 83 monocytogenes (L. monocytogenes). The bacteria used in this study, with the exception of E. coli, were obtained in lypholized form from Dr. Edward Richter at Silliker Laboratories of Ohio, Inc. The E. coli was obtained lypholized from Sigma Chemical Co. (St. Louis, MO). Crude lipid extracts were prepared from the bacteria using a modified Bligh-Dyer extraction procedure [14]. The extraction procedure used is as follows: 1. Rinse extraction vials (scintillation vials) twice with chloroform. 2. Prepare extraction reagent by mixing one part chloroform with two parts methanol in a clean erlenmeyer flask. 3. Rinse a long (9 inch) pasteur pipette once with the extraction reagent and discard. 4. Remove cells pooled at the bottom of a broth culture with a clean pipette, or transfer approximately one milliliter of the broth culture. Alternatively, scrape the cells off of a culture plate with a sterile loop and transfer to a clean vial. 5. Add extraction reagent to each vial until it is one-half to two- thirds full. Layers should not separate at this point if a broth is used. If the bacterial cells don’t separate put vials in a sonicator for a few minutes. 6. Place all vials in a sonicator and sonicate for at least 30 minutes. 7. Add distilled water to each vial until the layers separate. The methanol will go into the water and the chloroform layer will be on the bottom. 8. Allow the layers to settle overnight, or centrifuge and remove the chloroform layer. Place the chloroform extracts into clean small vials. 84 9. Evaporate the chloroform extracts to dryness under nitrogen. Reconstitute with a small volume of fresh chloroform. At the end of the procedure, a water/organic extraction is performed resulting in the phospholipids being contained in the chloroform organic layer. The chloroform is then evaporated ofi‘ under a stream of nitrogen. The remaining extract was then reconstituted with 5-10 [L of fresh chloroform just prior to analysis. Mass spectral analyses were performed by dissolving 3-5 uL of the chloroform extract solution in a drop of nitrobenzyl alcohol matrix on a custom made FAB probe tip. The FAB gun was operated with a filament current of 10 uA and a xenon FAB gas beam energy of 8 keV. Neutral loss spectra were obtained in the positive ion mode of the mass spectrometer after tuning with cesium iodide. The argon collision gas pressure was set at 0.5 mtorr and the collision energy was set at 30 eV. In order to obtain repetitive neutral loss scans for five of the possible phospholipid classes known to occur in the organisms studied, an Instrument Control Language (ICL) procedure was written and utilized. The ICL procedure used is provided in Figure 4.7. By using the ICL procedure, a complete phospholipid analysis can be efficiently performed from a single probe sample within the useful lifetime of the sample droplet. Twenty neutral loss scans were collected in centroid mode for each of the five 85 phospholipid classes used in this study. Three or four replicates were run for each organism investigated. Results and Discussion It was necessary to perform some fairly rigorous filtering of the mass spectral data to extract the useful phospholipid data and to reduce the number of dimensions (features) used in the cluster analysis. Much of the data filtering was necessary to reduce or eliminate the high levels of background interferents resulting from the poor sensitivity of the mass spectrometer at the time of the analysis. Filtering of the mass spectral data was accomplished in three steps: (1) a peak (a mass/intensity pair) must occur in 50% or more of the 20 neutral loss scans collected for each phospholipid class, (2) the scans thus filtered are then averaged, and (3) a peak must occur in 100% of the averaged spectra within each set of replicates to be used as a dimension in the clustering procedure [15]. Hierarchical cluster analysis was then performed on the filtered mass spectra for each phospholipid class analyzed to yield the dendrograms in Figures 4.8-4.13 [15]. The development and interpretation of these dendrograms are explained below. Figure 4.7 86 PROF NEU 172,600,850, 1 # setup COFF=-30;ON;CDYN 18; EMULT=1700;SN=1 APAUSE;ARESUME NEU l72,600,850,1 REPEAT 20;GO;STOP;END NEU 98,600,850,1 REPEAT 20;GO;STOP;END NEU 141,600,850,1 REPEAT 20;GO;STOP;END NEU 169,600,850,1 REPEAT 20;GO;STOP;END NEU 155,600,850,1 REPEAT 20;GO;STOP;END Q1MS;SW 20;ST=0.5 ASTOP OFF # scan PG #scan PA # scan PE # scan PDME #scan PMME # standby mode The Instrument Control Procedure used to control the mass spectrometer in this study. 87 The organism/mass spectral data was converted into an interorganism triangular distance matrix using the Euclidean distance metric [16]. The mass spectral data were translated into a multidimensional data space by allowing each integral mass-to-charge value, for which at least one organism had an intensity greater than zero, to be treated as a separate orthogonal dimension (feature). Since the mass spectra were all normalized to 100% relative abundance, the percent abundance was used to represent the distance along each dimensional vector. As a result, no further scaling was necessary. The resulting distance matrix was normalized by the number of dimensions and scaled to yield distance values (similarities) in the range of zero to one. Dendrograms were then constructed from the normalized distance matrix using single linkage joining [17]. Single linkage joining was accomplished by searching the distance matrix for the lowest value. The two organisms that correspond to this value are then joined at their respective similarity (distance) level. This step is then repeated for the next lowest distance value and so on. If an organism for a selected distance value is already joined to other organisms, a branch point is formed, and the new organism is joined on to the existing group. This process continues until all the organisms have been joined. The advantage of hierarchical clustering is 88 .68 389333583839 mo $9388 83... new 850.8an mi gamma— .o H _ .o _ N. o m _ .o vmo m_o wmo Esggm N. A .o u m. 0 mi 88838ch 4 N“... 88838028: .3 2» 88838028: {N Na, 353238 .m .33 338838 .m. a 338838 .W 2“ 3333.88 .m N3 8.333.838 .m. 3“ 3383.288 .m an 8.333.838 .m. E .8333 .0 mu 338t .0 «a 338.3 .0 :3 .38 .M «.3 .38 .M S» .38 .M 89 .30 38 835838838 .8 8.8388 83... new 8888.3.an ad. 8.33m ,__4_. H m ll fl :1 .o H _ .o u N. o m m .o vmo mmo omo 8.83888. 8 u .o A. m .o m. 3.. 888388888 .‘N mu 888388888 {N Nu. 888388888 4 23 38.3.2838 .0 «3 38.83838 .0 3... 8.383.238 .0 mu 8333.838 .m. in 83888838 .0 mu 83888838 .m. N3 83888838 .0 Nu 3888i .0 8 8:83 .8 3 3.338% .0 an .38 .m an 38 MN 83 .38 MN 90 o AmEv ogfiogfimfigflannmofim mo $9388 83... new 88.598an 3.8 8.3.3..— ‘—__T .o 3.0 N. o m .o vmo mmo wwo 38598 b. o m o m.o o. H mi 83888838 .0 N3 83888838 .0 2., 83888838 .0 V3 8333.838 .0 m3 8383.888 .m. N3 8.333.838 .0 Z“ 8.3833838 .0 2x 3.3888t .0 m3 3.38.58t .0 Nu 3.38.58t .0 2» .38 3m .2» .38 .M N3 .38 .M 91 .3958 888888838338833383838.8839 .38 £93.88 833 8.3 Eauuoudaom H H4. 9:53,.— .o H .o N. o m _ o vmo mmo wmo 585.82% b _ o H 3 .38 .m .33 .38 .m N3 .38 .N N3 83888838 .0 N3 83888838 .0 H 3 83888838 .m. V3 8.333388 .m. H 3 8.333838 .0 N3 8.333.838 .0 N3 8.333.838 .0 N3 338883 .0 N3 33888833 .0 H 3 33888.3 .0 92 3862.8 @588838223885888838 8o 888 28 88 888858 a: 958$ .o H H .o N. _ o m _ .o H 8\ 1 8.0 m.o 0.0 Egg/mm b. o _ _ 8 m.o m.o o.H 8. 88388 .8 .3 33388 .8 Nun GQSugmfifld .m. a 8.5.3 .8 .8, 8:33 .8 an 8233 .8 8., .88 .8 93 888388.08 wnflafiamuog mo 83 833 unflmuumnofimw 833838838 8:388 =8 .38 $9»ng 833 88.3 Emaognom 2.8 8.3030 H .o N. o "H o 800 m m o 0 .10 5858288 N. .o m .- o m o 0. V3 8333838 .0 N3 8.333838 .0 ~ 3 8.333.838 .0 N3 888388888 .0 03 8883388888 .0 N3 8.333.838 .0 N3 888.838.8888 .0 N3 33888833 .0 ~ 3 3388833 .0 N3 3.388833 .0 N3 83888838 .0 N3 83888838 .0 ~ 3 83888838 .0 0 3 .38 .0 N3 .38 .0 N3 .38 .0 94 that it provides the most similar organisms as well as any parent-offspring relationships formed by the branch points in a readily interpretable form. Figures 4.8-4.13 are the dendrograms generated for each phospholipid class by the above clustering procedure. On the vertical axis of each dendrogram are the microorganisms that contained the given phospholipid. On the horizontal axis is the similarity index with a limiting value of 1.0 or 100% on the left and a limiting value of 0.0 or 0% on the right. The lengths of the horizontal line segments represent the dissimilarities between organisms. The join points, represented by the vertical line segments, represent the similarities as measured on the similarity index scale. The closer two organisms join to the left, the more similar are their mass spectra. It is interesting to note that while differentiation between several of the organisms can be made from each of the individual phospholipid class dendrograms, no single dendrogram cleanly differentiates among all five organisms. Even the phophatidylmonomethylethanolamine (PMME) dendrogram, which provides very good diflerenfiafion between E. coli, C. freundii, and S. abaetetuba, does not difl'erenu'ate between S'. enteritidis or L. monocytogenes because they do not contain measurable levels of PMME. However, using a combination of two or more phospholipid class dendrograms, all five microorganisms can be cleanly differentiated from each other. For example, the phosphatidylglycerol (PG) dendrogram does not 95 cleanly differentiate between the two Salmonella species but does cleanly difierentiate between the other organisms and from the two Salmonella species. However, either of the phosphatidyldimethylethanolamine (PDME) or phosphatidic acid (PA) dendrograms can be used to differentiate between the Salmonella species. Since differentiation can be made among all five organisms using just two of the phospholipid class dendrograms, it should be possible that by combining the phospholipid neutral loss data, a single dendrogram could be constructed that cleanly differentiates among all five organisms. However, the result of this test produces the dendrogram in Figure 4.13 which does not cleanly difi'erentiate among any of the organisms. The reason for this is that by combining the data, the level of discrimination obtained by the phospholipid classes themselves has been removed. Furthermore, each of the five phospholipid class dendrograms differentiates among the organisms difl‘erently. By combining the data before performing the cluster analysis, the result is a loss of selectivity for any one organism. The important information to be gained from trying to combine the phospholipid data is that it is the most discriminating features in the mass spectra that result in the cleanest difl'erentiations between the microorganisms. The presence of any nondiscriminating features merely 96 serves to increase the computational complexity of the clustering and to decrease the differentiating ability of the resulting dendrogram. It is therefore advisable to select out of the data set only the most discriminating features to work with. One technology which can assist in determining the most discriminating feature, particularly for large data sets, is Genetic Algorithms which were invented by Dr. John Holland at the University of Michigan [18-20]. The importance of isolating discriminating features will be discussed further in Chapter 7. Conclusions The use of phospholipid biomarker screening and hierarchical cluster analysis provide an efiecfive means of differentiating between food-borne microbial contaminants and likely other organisms as well. The advantage of hierarchical cluster analysis to this study becomes particularly relevant once there are more than a few mass spectra to compare as it becomes difficult to differentiate among them using visual inspection. For example, even with the small microbial sample set used here, a total of 67 individual mass spectra were used to construct the dendrograms. Each mass spectrum may also contain a few to dozens of mass/intensity pairs. 97 However, by this efiecfive implementation of hierarchical cluster analysis, this data complexity is reduced to five easily interpreted dendrograms. With less noisy data, even better diflerenfiafion should be possible. Likewise, the implementation of a method for determination of the most discriminating factors should further improve the differentiation between microorganisms. 98 References l. 10. ll. 12. 13. 14. Archibold, E. R.; Bourquin, A W. J.; F., C.; Chakrabarty, A. M.; Fletcher, M. M.; Lenski, R. E.; White, D. W. Center for Microbial Ecology - Science Advisory Panel Report, National Science Foundation: Washington, DC, 1990. Fenselau, C. In ASC Symposium Series; Comstock, M. J., Ed.; American Chemical Society: Washington, DC, 1994; Vol. 541. Cole, M. J. Ph.D. Thesis, Michigan State University, East Lansing, 1990; Chapter 3. Brock, T. D.; Madigan, M. T. Biology of Microorganisms, 5th ed.; Prentice Hall: Englewood Clifl‘s, 1988; Chapter 19. Cole, M. J. Ph.D. Thesis, Michigan State University, East Lansing, 1990; Chapter 5. Miller, L.; Berger, T. Gas Chromatography Application Note 228-1 - Bacteria Identification by Gas Chromatography of Male Cell Fatty Acids, Hewlett Packard: Palo Alto, CA, 1985. Ratledge, C.; Wilkinson, S. G. Microbial Lipids, Volume 1, Academic Press: San Diego, 1988. Kates, M. Advances in Lipid Research, Volume 2, Paoletti, R., Ed., Academic Press: New York, 1964; Chapter 1. Lechevalier, M. P. CRC Crit. Rev. Microbiol, 1977, 5, 109-210. Heller, D. N.; Cotter, R. J.; Fenselau, C.; Uy, O. M. Anal. Chem., 1987, 59, 2806-2809. Heller, D. N.; Murphy, C. M.; Cotter, R. J.; Fenselau, C.; Uy, O. M. Anal. Chem., 1988, 60, 2787-2791. Cole, M. J.; Enke, C. G. Anal. Chem., 1991, 63, 1032-1038. Cole, M. J.; Enke, C. G. J. Am. Soc. Mass Spectrom., 1991, 2, 470-475. Bligh, E. G.; Dyer, W. J. Can. J. Biochem. Physiol, 1959, 37, 911-917. 15. 16. 17. 18. 19. 20. 99 Cole, M. J.; Hemenway, E. C.; Enke, C. G. Presented at the 39th Annual Conference on Mass Spectrometry and Allied Topics, May 19- 24, 1991, Nashville, TN. Jain, A K.; Dubes, R. C. Algorithms for Clustering Data; Prentice Hall: Englewood Cliffs, 1988; Chapter 2. Bratchell, N. Chemomet. Intell. Lab. Syst., 1989, 6, 105-125. Holland, J. H. Adaptation in Natural and Artificial Systems; The MIT Press: Cambridge, MA, 1992. Siedlecki, W.; Sklansky, J. Pattern Recognition Letters 1989, 335-347. Goldberg, D. E. Genetic Algorithms in Search, Optimization and Machine Learning; Addison-Wesley: New York, NY, 1989. CHAPTER 5 Generating Clusterable MSMS Data Introduction As with any spectral matching or data interpretation technique, assumptions must be made, and indeed allowed, regarding the consistency of the data being investigated. For the purpose of data collection, this requires that the data be collected under a consistent set of defined conditions. Alternatively, the data collection under varying conditions must be adjusted to match the data that would have been generated under a standard set of conditions for inter-spectrum interpretation. Unfortunately, the second option is not practical because it requires a set of adjustment algorithms that have not been developed and may not be possible. It also requires data about the sample and analysis conditions that are generally not available. Indeed, the applicability of such adjustments probably requires a knowledge of the analyte to a degree that makes the spectra matching or interpretation unnecessary. Therefore, for the purpose of general sample analysis, it is a requirement that samples be analyzed under a standard set of operating 100 101 conditions. This is particularly true in the field of mass spectrometry, where sample introduction, ionization, and fragmentation conditions afl‘ect the spectra so significantly. For mass spectrometric analysis, standard conditions have been established for certain types of analyses by consensus or by various agencies such as the Environmental Protection Agency (EPA). For example, electron impact ionization (E1) is performed using an electron beam energy of 70 eV. Furthermore, the normal mass spectrum (MS) profile of a standard tuning compound like perfluorotributylamine (PFTBA) should be consistent among difi'erence types and models of mass spectrometers. Attempts have been made to develop standardized methods for the tandem mass spectral (MS/MS) analysis of small organic molecules using EI ionization and collision induced dissociation (CID). Dawson et al. [1], for example, conducted a round-robin study involving seven laboratories prior to establishing a standard set of operating conditions for MS/MS analyses. The focus of this study was the setting of the collision energy between selected precursor ions and the argon collision gas and the collision gas target thickness or pressure. The results demonstrated that spectral profiles varied widely from laboratory to laboratory even when the laboratories were provided with a specific set of operating conditions to follow. The spectral 102 differences can be ascribed to differences between instrumental designs and characteristic behavior, operating conditions, and operator experience. For the majority of MS/MS analyses performed in tandem quadrupole mass spectrometers, the instrument is optimized for the specific sample and the type of information desired. For example, product spectra are generally much simpler in nature and provide fewer spectral peaks than normal mass spectra. For example, isotope peaks may be missing and low-energy CID spectra do not have many of the higher energy products typically produced in the E1 source. High-energy CID [2] and 193 nm photo-induced dissociation (PID) [3] have been found to produce product spectra that are very similar to normal EI spectra but with MS/MS isotopic characteristics. Under typical collision energy and single collision conditions, the product spectral peaks are often of very low intensity (a few percent) relative to the precursor peak. However, the substructures represented by the product m/z fragments are generally the result of direct unimolecular decomposition of the precursor ion. As the target gas density (pressure) is increased, ions passing through the collision chamber octapole undergo multiple collisions with the target gas. As a result, more product peaks are produced which may represent a combination of higher energy fragmentations of the precursor ion directly and/or secondary fragmentation of other product ions. The product ion peak intensities are also greater 103 relative to the precursor ion peak. The first, low pressure, conditions are useful for direct characterization of ions and for investigating the physical chemistry of the ion fragmentation process. The second method, though it produces more complex results, is often used for obtaining better sensitivity and additional ion characterizing information. Due to the above complexities, there is neither a set of standard operating conditions for MS/MS analyses nor any readily available MS/MS spectral libraries. Those MS/MS libraries that do exist are generally specialty libraries created in-house by such organizations as pharmaceutical companies for the characterization of drugs and drug metabolites. It was, therefore, necessary to develop a set of working conditions for this project as well as generate a library of MS/MS spectra. The remainder of this chapter will focus on the process of characterizing the relevant operating parameters of the mass spectrometer, the development of an MS/MS analysis protocol, and the development of a library system for spectral storage and retrieval. Acquisition Method Development Instrumental Parameters For this research, mass spectra were obtained on Finnigan TSQ-700 and Finnigan TSQ-7 000 tandem mass spectrometers (referred to later as 104 88888.58QO 888 83-009 833 .38 8388838 < H3 8.5me 3888—88 888 , 8838830 , 838883 :2 88389 H 3830 .8830 Alla—9:88 _ P _ _.8=&:=E _ m0 . 6 , \ 8888—8 $880.3 8388.33 N 88 .32 838838 .8828 8.88 H 88 8:83 828958 88658 :2 38.5883 105 simply TSQ) equipped with an electron impact ion source as diagrammed in Figure 5.1. This instrument contains fifteen electrical elements that must be optimized or tuned individually. These elements, listed by position from the source to the detector, are the source EI filament, lens 1-1, 1-2, and 1-3, quadrupole 1, lens 2-1, 2-2, and 2-3, the collision chamber octapole, lens 3-1, 3-2, and 3-3, quadrupole 3, the conversion dynode, and the electron multiplier. With the exception of the E1 filament, the conversion dynode, and the electron multiplier, the potentials at any moment during a scan are controlled by nineteen different parameters. The parameters for the normal mass analysis mode, with quadrupole 1 operating as the filtering quadrupole, with typical values are given in Table 5.1. Due to the complexity of the TSQ, it is necessary to tune the instrument in stages. This is accomplished by first tuning the instrument with PFTBA in normal MS mode for both Q1MS and Q3MS modes. In QIMS mode, only the first quadrupole is operating as a mass (m/z) filter while quadrupoles 2 and 3 are operating in pass-through (rf only) mode. Q3MS mode is the same as QIMS mode except that quadrupole 3 is the mass filtering quadrupole while quadrupole 1 operates in pass-through mode. As a result, the procedure for tuning in QIMS and Q3MS modes are essentially the same. L 1 1 L 12 L 1 3 POFF L2 1 L22 L23 COFF L3 1 L32 L33 DOFF PRES PCAL CRFP DRFP COLL -13.4 -50.0 -10.9 -10.0 -2.3 -180.0 -93.1 -10.0 -92.0 -180.0 -35.7 -10.0 4.1 2.6 27.4 -1.1 135.0 106 Table 5.1 Tune parameters for QIMS mode with typical values in volts. // lens 1-1 // lens 1-2 // lens 1-3 // quadrupole l ofi'set // lens 2-1 // lens 2-2 // lens 2.3 // collision chamber octapole ofi'set // lens 3-1 // lens 3-2 // lens 3-3 // quadrupole 3 offset // quadrupole 1 resolution // quadrupole 1 calibration // collision chamber octapole rf amplitude ll quadrupole 3 rf amplitude ll source electron lens 107 The tuning procedure for normal MS mode involves first adjusting the resolution and offset of the filtering quadrupole to obtain the optimal signal and peak shape for a single tune mass (m/z) while maintaining unit mass resolution. Due to the energy spread of the ions leaving the source, the tuning of the filtering quadrupole to maintain the above criteria results in approximately 90% of the ions of the selected m/z being discarded. This is repeated for all tune masses which for PFTBA are nominally 69, 100, 119, 131, 169, 219, 264, 414, 502, and 614. Tuning is also performed using background water (m/z 18) and air (m/z 28 and 32). After unit resolution for the tune masses is achieved, the mass range is calibrated for each tune mass. Then the If potential for the collision chamber octapole and the non-filtering quadrupole are optimized for the transmission of each tune mass. The next step in tuning is to set the lens voltages. The center lenses 1-2, 2-2, and 3-2 are manually adjusted first for m/z 32 and 502. It is not necessary to tune the center lenses for each tune mass because these lenses are not particularly mass sensitive. Next, the first and third lens voltages for each set are adjusted to obtain the optimal signal intensity and stability for the specific tune mass. There are so many ion optic elements that must be tuned on the tandem quadrupole mass spectrometer that a single pass is not adequate for 108 optimal tuning. The reason for this is that most of the ion optic elements have inter-related effects on the filtering and transmission of ions. For example, the initial tuning step involved the adjustment of the quadrupole resolution and offset. These values are sensitive to the energies of the ions leaving the source. Since the first set of lenses is located between the source and the first quadrupole, adjustment of the first lens set can significantly affect the resolution and offset parameters on the first quadrupole. Therefore, once the above tuning steps are completed, the process is repeated to obtain a better tune approximation. After both QIMS and Q3MS modes have been tuned, it is necessary to tune the instrument for MS/MS operation. This step is much simpler than the normal MS tuning procedures. For the first quadrupole and lens sets 1 and 2, values from the QIMS tune are used. For the third quadrupole and lens set 3, values from the Q3MS tune are used. As a result, the only adjustment necessary for MS/MS tuning to be completed is the rf potential on the collision chamber octapole. This step is performed by setting both quadrupole l and 3 to filter on the same tune mass and then to adjust the collision chamber octapole rf potential to optimize ion transmission. This is repeated for all tune masses. This particular mode of operation is called neutral loss/gain mode (NEU). For the purpose of tuning, the neutral loss/gain value is zero. 109 Typical tune profiles of PFTBA for Q1MS, Q3MS, and NEU 0 operating modes are provided in Figure 5.2. Sample Introduction Sample introduction for this research was accomplished by two methods. For moderate to low volatility samples, a solids probe, which is controlled by the mass spectrometer, was utilized. This probe uses a combination of heaters and water-ethylene glycol coolant to control the temperate of the probe. Through the data system software, a heating curve can be defined to control the temperature of the probe over the course of an analysis. The operating range of the probe is from the coolant temperature up to 250 C. The probe tip is designed to hold sample crucibles, usually aluminum or glass, which have an internal sample volume of approximately 10 “L. For highly volatile compounds, a custom designed leak inlet system was utilized. This device is shown in Figure 5.3. This device provides for controlled inlet of sample that cannot be obtained from the solids probe. Highly volatile samples generally only last for a few seconds on the solids probe with the highest volatility samples gone before the solids probe has been inserted. However, this problem is circumvented by the leak inlet. The glass sample holders are cleaned and the glass capillary is replaced between 100- sol 6o- 401 201 110 Q lMS Mode 01-. ALL...» 160 260 100- oT -. Jul—LA. 360 460 V 500 ' 600 Q3MS Mode Relative Intensity (%) 160 260 100- 804' 40~ 20q 360 460 ' 500 ' 600 NEU 0 Mode 1.,.l 100 260 360 171/Oz I 460 500 ' 600 Figure 5.2 Typical Q1MS, Q3MS, and NEU 0 profiles for the tuning compound PFTBA 111 .888988 33339, AME no.3 com: 839$ 8H8 88H 25. «.3 9530 8.9803088 89: .6 8:. 305 9 AH 8.00 8680 0. EE 9.0 \ - .os 28.8 32 112 samples. The glass capillary is a section of gas chromatographic guard column approximately 18 inches long. Using this device, a complete MS/MS map can be obtained flour a few uL of sample. The volatility range of samples introduced by this means is limited by the lack of heating of the sample bulb and capillary tubing. One characteristic problem with the quadrupole design of the TSQ-7000 and both the solids probe and leak inlet systems is that the mass filtering ability of the first quadrupole is significantly affected by source pressure. As a result, care must be taken not to allow too high a sample flow to the source. Effects of Collision Energy on CID Spectra For low-energy collision-induced dissociation, collision energy, target gas pressure (density), and target gas composition are the key instrumental parameters. The most common target gas for low-energy CID is argon and it was also used exclusively for this research. Due to the importance of these parameters, it was necessary to investigate how sensitive product spectral profiles are to these parameters and to determine what the optimal operating parameters might be. A similar series of studies were conducted by Hart [4], Palmer [5], and Cross [6] as part of the ACES project which also utilized triple quadrupole mass spectrometers for the generation of product spectra. 113 The results of the studies for this research are consistent with these previously conducted studies. Under low-energy collision conditions (i.e., 1-100 V), the transfer of kinetic energy of the precursor ion to internal energy occurs through vibrational excitation of the electronic ground state. The amount of kinetic energy converted to internal energy from the collision of a polyatomic ion with a neutral collision gas can be directly correlated with collision energy [7]. On triple quadrupole mass spectrometers, the maximum lab-centered collision energy is given by the potential difference between the ion source region, which is usually at ground potential, and the collision chamber offset potential. The center of mass collision energy, which is dependent on the mass of the precursor ion and the target gas is given by the following equation [8] where ECM is the center of mass collision energy, Ems is the laboratory collision energy, mg is the mass of the target gas, and mp is the mass of the precursor ion. _ ms ECM— Em(mp+mg) It has been shown for n-butylbenzene that the fraction of the kinetic energy converted to internal energy increases with collision energy up to 40 V(1ab) with about 50% conversion [7]. This fraction decreases as the collision energy exceed 40 V(1ab). 114 i: 40‘ I ml 0.4 I I I I I I A I I °\° V 9: E: I '53 40- CI Q) +9 CI H 20- G) >’ Nd *‘ l l '53 0" II t . . I I I ' 9': 40- 201 oIll n..........I|I I I. '.' s. v . I .ILI ..v...l j 20 40 60 80 100 m/z Figure 5.4 Product spectra of the molecular ion of cyclohexanone with a collision energy of (a) 10 eV, (b) 40 eV, and (c) 80 eV with a target gas pressure of 3x 10-6 torr (manifold). (* denotes ofi‘scale precursor peak at 100%) 115 Product spectra of the molecular ion of cyclohexanone (m/z 98) showing the effect of increasing collision energy are provided in Figure 5.4. These spectra were collected at a fixed target gas pressure of 3x108 torr (manifold, 0.47 mtorr collision chamber) and plotted relative to the unfragmented precursor ion intensity obtained without target gas present. Few products are produced at 10 V since a 10 V collision for the most part results in an internal energy lower than the critical dissociation energy, also known as the appearance energy, for many fiagment ions. At 40 V, more precursor ions exceed these energy barriers and high intensity product ions are the result. The product ions of m/z 41, 42, and 55 appear to be more sensitive than the other product ions to collision energy. The relative intensity of several product ions of cyclohexanone are plotted versus collision energy in Figure 5.5. A similar set of experiments to that described above was conducted on the molecular ion of 3-heptanone (m/z 114). Product spectra showing the effect of increasing collision energy are given in Figure 5.6. The relative intensities of several product ions of 3-heptanone are plotted versus collision energy in Figure 5.7. The results of a repetition of this experiment are given in Figure 5.8. While the lower intensity product ions appear consistent, the intensity of the m/z 72 product ion varies significantly. The m/z 72 ion is the result of a McLafl'erty rearrangement reaction from m/z 1 14. The McLafi'erty 116 .GH80888V .83 8.3me .38 88888.8 88» 8983 8 38 .8888 88858 .38 83885.3 8 8 H8383 88888883388 .8 88 8838888 833 .38 88 83888 H8888 .38 38838 833880 3.3 9:830 CC .3880 88580 cm cm 38 an o . p r p \u - e 4 4ImVI¢ x HfiI “II II o 3. mm IxI x/x x x\\\e e m cm lxl \\ o\o\ . Ob I+I OIIIO\\\\O mm I>I l\ O NH. Ifl£om wd charm CC awkwam dofimflzoo ow ow ov om o . _ . _ . _ . _ . l l ll ll l lnl I o o\I\l\lu\l\D\ll+ +\H\M . mm lxl 0\ % we lxl x x V«Kx\xlx/ m o/o\ x u a m. S l+l 1 W. 3 l>l . m. mm ll b M mv ldl m/w NV I'll l Om 3 l-l, b Iov 122 .363 nouns“ 33%“: mono? BE. .a.m 353m 3 $3533 Saw 2% mo 83 mafiuoq id 0.53% mm luml ow lxl on. l+l mm l¢l mm l>l mv ldl av lcl 3 l-l Nd- ,— \ + >66 0.3395 30 gouge mod «an. . \"\\.. ‘\ \ \\3\ \\\... b $6..- Wm. _ . _ 9w. _ \ la \ \ .\M Na. I ad I md log I mg mA. '02”: I md. qusuequl eAnBIeH Boq 123 indicates the occurrence of metastable decomposition as well. Slopes higher than unity indicate a multiple collision product. In the case of the products of cyclohexanone, these reaction orders vary from about 1.4 to 1.6 indicating that some of the products are formed from single collision reactions and some fiom two collision reactions. In Figure 5.11 is shown the efi‘ect of increasing the target gas pressure at a constant collision energy (-30 V) for the molecular ion of 3-heptanone. The trends represented in these data are quite similar to those obtained previously for cyclohexanone. The equivalent log-log plot for the data presented in Figure 5.11 is given in Figure 5.12. The slopes of these plots, which are in the range of 1.4 to 1.6 as well, indicate, as for the cyclohexanone data, that at a target gas pressure of 3x106 torr (manifold) and -30 V collision energy, a mixture of single and multiple collisions is occurring in the collision chamber. Instrumental Conditions Selected for MS/MS Library Generation Configuring the TSQ-7000 for MS/MS library spectra generation first requires tuning the instrument to provide consistent PFTBA profiles between tunes. Source conditions were the same as those used for standard EI analysis (70 eV electrons, 150 C) with a filament emission current of 1300 uA. 124 65:35:-m me 5% 53538 we. we 32689 How 9 $3 .355 H8558 «5559 a 5 2553 new nouns wfimmmuofi me 50mm =.m charm AER—v 0.3595 56 5958 @255 1:35 92.23. secs.“ o... F _ . _ b _ . . . . IIIIIIIIll I O Wlflflt + . m x\ w 2: m. mm lxl m up lxl b . m1 S l+l m all as m 1 me lpl. M 3 l, 1005 4..) g, 80} Cmpd27 q: ‘ Cluster 1 :3 60- c) . .2; 20~ *" ‘ | l E O V ! I I V I f T V I V I '1 I V I w I V I 52 1o 15 20 25 3o 35 4o 45 5o 55 60 100- 80: Cmpd34 60: Cluster 1 40j 2bc 201 0 i V I f j IV LI V I V ' I ’ V’ J I V I i I V I I 10 15 20 25 30 35 4o 45 50 55 60 m/z Figure 6.3 Sample product spectra of m/z 59 for selected compounds listed in Table 6.1 159 100- 80~ Cmpd29 601 Cluster 2 40- 3ac 20~ 0 q l I V I IV I I V ' I V 'II V ' l j ' v I <3 10 15 20 25 3o 35 4o 45 5o 55 60 8.2 >., 1005 +3 4 g 30~ Cmpd2 3 60; Cluster 3 E 40; lac Q) l .5: 20~ +2 . c5 "-" O V I I V V I ' I T I V I I I V I 030 1o 15 2o 25 30 35 4o 45 50 55 60 100« 80*, Cmpd32 60‘, Cluster 3 40] 9bc 20- 0 ‘ I ' I ' I V l ‘ U V f U I r I ' 1 1o 15 20 25 30 35 4o 45 50 55 60 m/ z Figure 6.3 (cont’d). 160 100- J 30« Cmpd21 601 Cluster4 J 404 lao 20~ O ' ' I I l' , ' I I I fil l I I I I r I <3 10 15 20 25 30 35 4o 45 50 55 50 8/ >5 100- CG + g 30~ Cmpd24 a) j Cluster4 E." 60 G) . .2 20- 4-? I '53 0 I I I I IJ—v I I II I! I I 1 1% 10 15 20 25 30 35 4o 45 50 55 60 100- 30: Cmpd35 60] Cluster4 401 lbo 20- I 0 V Y— I T I' l ' I I T] I ‘— 1 I I 10 15 20 25 30 35 40 45 5o 55 60 Figure 6.3 (cont’d). m/z 161 100- 80« Cmpd12 50+ Cluster 5 404 Ice 205 0+ '—' I ' I fl I J' I ' I ' II ' J I i I T I ' I ”\3 10 15 20 25 30 35 40 45 5o 55 60 E/ >, 100- +9 4 g 80‘ Cmpd14 a) j Cluster 5 4a) 60 ._. 401 lao 0’ 1 .2 201 I" 1 ’5' 0 r r I L T Lf I“ I1 I! f I f I I Q) m 100- 80? Cmpd26 60? Cluster 5 40? ICC 204 l J | O ' T ' I I I ' I I I I I I] I I I T I I I I 10 15 20 25 3o 35 4o 45 50 55 60 m/z Figure 6.3 (cont’d). 162 ’3 5 :1 100- “ I g 80‘ Cmpd17 3 601 C1uster6 G . 1—1 40J 1bc G) > 20‘ .H I .p w 0 I I I I j I III ' II I T I j fl :2 10 15 20 25 30 35 4o 45 5o 55 60 m/z Figure 6.3 (cont’d). 163 This cluster contains members expected to produce precursor ions from substructures 2bc, Zoo, and 3cc. As can be seen from the sample spectra in Figure 6.3 compounds expected to express substructures 2bc or 2cc produce nearly indistinguishable m/z 59 product spectra. From the sample spectrum of a compound expected to express substructure 300, it can be seen that the only real difference from the 2bc and 2cc spectra is the ratio of the m/z 43 and 44 peaks. Since the 43/44 ratio differences are all that really difl'erentiates these two substructures and those ratios are not particularly dominant, it is easy to see why these two substructure types would cluster together. From these results, it appears that substructures 2bc, 2cc, and 3cc cannot be separated from each other based on their representative product spectra by the clustering methodologies utilized here. Partitional cluster 2 has all the same m/z values as cluster 1. However, it is significantly difl'erent because of the m/z 41 base peak. This is the only cluster that has such a strong m/z 41 relative intensity. This member of this cluster is expected to express the 3ac substructure. This substructure, being a primary alcohol, could produce an m/z 41 ion by losing H20. Partitional cluster 5 members are characterized by intense peaks at both m/z 29 and 31 with much lower intensities at all other m/z peaks. This 164 cluster contains product spectra predominantly from compounds expected to express substructures Ice and lco which are ethyl ether substructures. Partitional cluster 4 members are is characterized by an intense m/z 15, resulting fiom a methyl ion formation, and intermediate intensity peaks at m/z 29 and 31. This cluster contains compounds expected to express different substructures, specifically 7bc, lbo, and lao. Substructure 7 is significantly diflerent from substructure 1 which likely indicates that there is some rearrangement of the precursor ion to a common structure or that the fragmentation patterns are similar. Partitional cluster 6 contains a base peak at m/z 29, intermediate peaks at m/z 27, 30, 33, and 43, and the absence of a peak at m/z 15. This is the only cluster with intermediate intensity peaks at m/z 27 and 33. The compound contained in this cluster is expected to express substructure 1bc. This cluster grouped closely with cluster 5. Compounds expected to express substructure la are present in clusters 3, 4, and 5 with the compounds expected to express substructures 1c, 7bl lb, and 9b respectively. This would seem to indicate something unique about the la substructure. In particular, substructure lao is joined in two locations on the dendrogram that appear to have relatively little in common. There are a couple of possible reasons for this discrepancy. For example, there may have been some contamination of the lao samples, perhaps due to aging of the 165 samples. Another possibility is that the fragmentation of the molecular ion in the E1 source resulted in significant rearrangements such that more than one m/z 59 substructure was analyzed. A third possibility is that substructure lao is more sensitive to variations in El source and/or CID conditions and this sensitivity is reflected in the product spectrum. For whatever reason, it is readily apparent that, based on the data base product spectra, substructure lao cannot be readily isolated from some of the other substructures. Clusters 2 and 6 only have one member in the group. Obviously, it is difficult to validate these as true clusters. However, the spectra for these two clusters are significantly different from the other spectra, particularly for compound 29 (3ac). This difference is enough to support placing these compounds, and the substructures they represent, into groups other than those already present. It is important to ensure that product spectra obtained do not result from a combination of two or more m/z 59 precursor ion structures as this can easily lead to blurred or overlapping clusters. In many cases, however, even though a compound may be capable of producing more than one ion structure of the same precursor m/z, one ion structure may dominate over the others. This can be seen in the following m/z 59 product spectrum for sec- butyl alcohol, which can potentially produce a mixture of substructures 2ac 166 ’3 B\ V 53’ 'g 1001 Q) 803 "a . H 60" g 40« .H 4 4.: Cd 20‘ ' I H y I £ 0 I E I I I I II I I f I 'I 7 I T I If I r I 10 15 20 25 3O 35 40 45 50 55 60 m/z Figure 6.4 Mixture product spectrum for m/z 59 of sec-butyl alcohol. 167 and 3cc. The spectral pattern matches quite well with the 3cc sample spectrum given in Figure 6.3. Nevertheless, this is not always the case so care must be taken when selecting compounds, from which the representative product spectra will be derived. Conclusions The clustering results presented in this chapter demonstrate that it is possible to group product spectra with a reasonable degree of delineation according to substructure. However, these results also demonstrate that there is enough similarity among product spectra for different substructures to make full delineation to specific substructures impossible in some cases. Nevertheless, the ability of this technique to reduce the number of candidate substructures for an unknown to two or three substructures still provides tremendously valuable information for structure elucidation. 168 References l. Hartigan, J. A. Clustering Algorithms; John Wiley and Sons: New York, 1975; Chapter 4. CHAPTER 7 Product Ion Classification for Standards and Unknowns Introduction In Chapter 6, it was demonstrated that low-energy CID product mass spectra could be classified into groups based on their m/z-intensity patterns. These classifications further demonstrated that compounds that produced the same ion in the El ionization source in general produced similar product spectra which formed a single cluster. It was also shown that different ions generally gave different product spectra which formed into diflerent clusters. The objective of this chapter is to leverage the information obtained in Chapter 6 to form the basis of a classification system for product spectra for standards and unknowns. To accomplish this objective, it will be necessary to determine those spectra features that make a cluster distinct from other clusters. Also, it will be necessary to represent this information in a manner suitable for the characterization of other product spectra. 169 170 Establishing Representative Descriptors The establishment of representative descriptors for the cluster information obtained in Chapter 6 is an important first step towards general product ion classification. This process can be accomplished in a variety of ways. For example, if the intent were to develop a binary or bit field representation for a spectrum, an intensity threshold might be applied. Any peaks above the threshold would be assigned a value of unity while any peaks below the threshold would be assigned a value of zero. Bit field encoding of normal mass spectra for the purpose of reduced storage and matching has been performed by others [1,2]. Such an approach would, however, be of little value for matching product mass spectra because product spectra generally have few peaks most of which occur for most possible precursor ions. In order for a product spectral matching system to work, it must, therefore, retain the intensity information since peak ratios may be the only available mechanism for differentiating among spectra. For this work, the most obvious descriptor system is to retain the intensity information by calculating a representative mean spectrum for each cluster. In fact, this information is already available from the partitional clustering results shown in Table 6.2. For each cluster found by the partitional cluster algorithm, the results include a list of the m/z values present in at least one compound. For each m/z is listed the minimum, 171 maximum, and average intensity. Also listed is the standard deviation in the intensity for each m/z. Therefore, the average intensity column, which represents the cluster mean, was used as the most appropriate descriptor for each cluster. The standard deviation information is useful as it helps to characterize the averaged peak intensities. In Figure 7.1 are shown the cluster mean spectra with standard deviation error bars. Discovery of Non-Discriminating Features While the cluster representative product spectra discussed in the previous section are, by themselves, very useful information to have for product ion classification, these spectra are not necessarily optimal for difierentiation between product spectra. This is due to the fact that some peaks occur commonly in product spectra of the same precursor m/z at about the same intensity. Since the goal is to find features that make spectra that should group together do so and those that should not not do so, common non-differentiating features simply dilute this process. Therefore, the most desirable coarse of action is to either reduce or remove these influences. For example, m/z 27 is represented in all of the cluster mean spectra at low intensity except for cluster 7 where it is just under 30% relative intensity. It was expected that removing m/z 27 might improve the clustering with the remaining data. After reducing the spectra down to m/z 15, 29, 31, 41, and 43, no reordering was found in either the hierarchical or 100- on O 1 I“ 05 O o N o n 1 O 172 Cluster 1 I T v v I v 1 H O 1001 J on C L 1 05 O L 1 uh o 1 N o 4‘ J A O 2O 3O 4O 50 60 Cluster 2 “I I V v I V ' Relative Intensity (%) 100- sol 60q 401 209 20 30 40 50 60 Cluster 3 10 . 1,11 , 1'— ' I ' I I I 20 30 40 5o 50 m/z Figure 7.1 Cluster mean product spectra for m/z 59. The error bars indicate the standard deviation in relative intensity of the cluster members. Clusters 2 and 6 are single member clusters and have no error bars. 173 100- 80; Cluster 4 60+ 40- ' 20« A . i \O 0 Y ' I ' 7 ' i a, 10 20 3o 40 50 60 E9 .... 100- g 1 Cluster 5 Q) 801 4a . H 60': g9 40a .H 4 *5; 20- "" t g; o a: . i . fl . . 10 20 30 4o 50 60 100- so: Cluster 6 1 60-4 , 401l 20« O V l ' I fi 10 20 3o 40 50 60 m /2 Figure 7.1 (cont’d). 174 partitional clustering results. The reason for this is that the m/z values that were removed were not the dominant features. As a result, they contributed much less to the overall distance than the remaining features. The reduced feature dendrogram is given in Figure 7 .2. While the distance between samples has increased in some cases, the dendrogram still strongly resembles the full feature dendrogram shown in Figure 6.2. While it appears that the non-discriminating features above do not apply across the entire data base, there may be cases where they are useful for direct cluster-to-cluster differentiation. For example, cluster 6 has a significant m/z 33 peak. This peak is useful for difl'erentiating cluster 6 from the other clusters. However, in some situations, the reliability of these features may be in doubt unless they have significant intensifies. Therefore, the most discriminating features in the m/z 59 product spectra given here are those in the reduced set discussed here and in Chapter 6 in the cluster interpretation section. Classification of Unknowns by Spectral Matching While spectral matching systems have been developed for use with normal mass spectra [3,4], these approaches are optimized for the feature rich spectra typically obtained by EI mass spectrometry. It was, therefore, decided to develop a spectral matching approach for this work that is better 175 ..mhomam 835.5 mm a}: 23 wow Bannoanow new $353 wood—com NS 25mg fl _|| 89?. moguflm .53. com com 8m 8N 3N 3N 8a SN 3N 8N 8m $380 $380 $an 3380 SE80 3380 2.25 $580 2280 25:5 3380 176 . 95»?— .G «nos N h 2: _|I|II o2 _.||I| 2:. 2: o3 fl 2: 92 II 2: III 03 8_ R: 8a I 8a 3380 3380 5.95 5380 33:6 2380 3380 $380 mfiaao 8380 2380 3380 £380 177 .8288 NS 8&3 25 . 25 . 25 . as . 25 . 85 F as 23 25 25 Fl :2 2386 «3:6 63:6 «3:6 :36 :36 83:6 23:6 8::6 33:6 «3:6 178 suited for feature poor low-energy CID spectra. Because such product spectra often have the same m/z values represented, this approach would have to rely less on m/z occurrence and more on intensity. However, this approach would also need to handle peak intensity variations typical of normal mass spectral analysis. Obviously no pattern matching or classification approach is particularly well suited for handling differences caused by experimental error. To represent the intensities of the mass spectral peaks a membership function was developed. The membership function is based on the principles of fuzzy set theory introduced by Zadeh [5]. Applications of fuzzy sets to problems of chemical significance have been reported in the literature [6,7]. Several of these publications contain overviews of the relevant aspects of fuzzy set theory [6,8]. Fuzzy set theory is based on the principles of traditional crisp set theory where membership in a set is designated by the values zero (no membership) or one (full membership). Fuzzy set theory extends this concept to allow members to have values in the range of zero to one where zero represents no membership and one represents full membership. Under crisp set operations something may or may not be a member of a set as constrained by the zero and one values. This is generally referred to in terms of A or NOT A. Conversely, fuzzy set theory is referred to in terms of A and NOT A 179 The utilization of fuzzy set theory membership functions to peak intensifies provides a useful mechanism by which the similarities in intensity among peaks can be expressed. For this purpose, three membership functions were investigated and are shown in Figure 7 .3. The first member function is of the standard type used in many fuzzy set implementations. The vertical scale runs from O to +1 indicating degree of membership. The horizontal axis in this case is the relative intensity of the library peak being compared. The relative intensity of the library peak is centered on the plateau region. If the unknown peak is in the range of the library peak plus or minus an intensity variability value, its membership is one. This is done so as not to penalize the match quality due to normal intensity variations. Outside of this region, the membership value falls ofi linearly to zero after which it remains zero. For this work, an intensity variability value was used to accommodate the spectrum-to-spectrum variations routinely observed among mass spectra. Another option would have been to use the standard deviation information provided by the partitional clustering. However, the standard deviation information can be misleading and is inappropriate in this implementation. The intensity variability value determines the width of the match window for a give peak. A wide window makes it easier for peaks to give a positive match while a narrow window makes this outcome less likely. Using the standard deviation to set the window width can have an undesirable side 180 l--| C) O (D O p... 0" O 0 ' 2b ' 4o --O------db-- Membership J7" I fi U 0 . 20 40 Relative Intensity (%) 80 ' 160 05 C Figure 7 .3 Fuzzy member functions investigated for spectral matching. 181 effect. If the members of a cluster have features that are very close to each other in intensity (the desirable situation), the average intensities would have a correspondingly small standard deviation. This in turn will make it harder for other spectra to positively match with this representative spectrum. Conversely, the opposite effect can happen with more diffuse cluster members. In this case, the poorer quality of the cluster the more easily other spectra will match with the representative spectrum. Hence, the more diffuse the cluster the better the chance of a match. The preferred situation is for the opposite to occur. In this case, an intensity variability value of 10% relative intensity was found suitable for all peak intensity windows to equalize the probability of a match between representative spectra and an unknown. For the purpose of evaluating the three membership functions, the same intensity variability value was used in all cases. The second member function extends the scale from 0 down to -1 but otherwise is the same as the first member function. In most fuzzy set implementations the membership scale ranges from +1 to 0 where +1 indicates full membership and zero indicates no membership. However, this approach is weighted towards membership because non-membership simply makes no contribution as opposed to a negative membership influence. For example, the membership scale can be interpreted as a continuous voting scale. In most voting systems (albeit these are based on crisp set theory 182 principles) the +1 to zero membership scale would translate into +1 being a yes vote and zero a no vote. The result, therefore, is almost always a positive vote though it may be very small. Furthermore, a no vote is the same as an abstention or no-influence position and contributes no negative influence. In terms of pattern matching, the negative half of the membership scale provides a mechanism for determining and expressing that two observafions are really not the same. Therefore, a +1 to -1 membership scale is investigated in this work for membership functions two and three. The third member function has a membership value that increases from -1 to 0 at low intensifies. Since there is no intensity weighfing used in this search, low intensity peaks contribute the same as high intensity peaks to the final match factor. Due to the possibility that some low intensity peaks may be noise, this member funcfion seeks to reduce the negafive influence of such peaks if the library peak is intense at the m/z in quesfion. If the library peak intensity is not intense, the trapezoidal porfion of the member funcfion that is the same as the second invesfigated member funcfion takes precedence. A comparison of the three member funcfions is given in Table 7.1. This was accomplished by matching selected unknown spectra against the cluster mean spectra shown in Figure 7 .1. The test unknown spectra were obtained by removing them from the data base used to create the clusters. In order to 183 prevent undue influence of the match factors by matching a spectrum used to determine the clusters, the cluster mean for which the spectrum is a member was recalculated without the test spectrum. Specifically, compounds 14, 5, 1 1, and 22 were used to test the match quality of the three member funcfions. As can be seen fi'om Table 7 .1, of the tested member funcfions, the second funcfion proved to give the best separafion between the expected match and the incorrect matches. The match factors given in Table 7.1 range from 0 to 1 indicafing no membership to full membership respecfively. The match factors for Cmpd14 for clusters 5 and 6 are similar in value with membership funcfion two providing a slightly better match for cluster 5. It was expected that if any of the compounds were to be incorrectly assigned to a cluster, it would have been Cmpd14. Cmpd5, from cluster 3, matched up with the adjusted cluster 3 mean perfectly while matching at 0.81 with cluster 4 using the first and third member funcfions. Likewise, Cmpdl 1, fiom cluster 1, and Cmpd22, also fi‘om cluster 1, matched into the correct cluster by all three member funcfions. However, in all cases, member funcfion two gave the best disfincfion between the expected cluster match and the remaining clusters. 184 Table 7.1 A comparison of the fuzzy matches of the m/z 59 product mass spectra by the three membership funcfions shown in Figure 7 .3. Matched: Cmpd14 (expected Cluster 5 but potenfially close to Cluster 6) Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 First 0.73 0.84 0.71 0.79 0.88 0.82 Second 0.59 0.70 0.49 0.62 0.86 0.75 Third 0.59 0.77 0.56 0.69 0.86 0.82 Matched: Cmpd5 (expected Cluster 3) Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 First 0.68 0.74 1.00 0.81 0.77 0.57 Second 0.46 0.53 1.00 0.67 0.55 0.19 Third 0.61 0.67 1.00 0.81 0.70 0.50 Matched: Cmpdll (expected Cluster 1) Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 First 0.93 0.77 0.70 0.64 0.84 0.57 Second 0.87 0.60 0.45 0.35 0.70 0.23 Third 0.87 0.60 0.53 0.46 0.74 0.39 Matched: Cmpd22 (expected Cluster 1) Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 First 0.93 0.86 0.71 0.65 0.85 0.58 Second 0.86 0.76 0.53 0.42 0.78 0.29 Third 0.89 0.76 0.58 0.52 0.82 0.47 185 Classification of Unknowns by Rule Application One of the simplest tools for structure elucidafion is that of basic spectral pattern matching or library searching as it is commonly called. All commercial mass spectrometers come equipped with a library search package. However, such packages have significant deficiencies some of which can be traced to the semanfics of library searching. In a library search an unknown spectrum is usually matched in a forward or reverse manner with the spectra in the library data base. The result of such an operafion is one-dimensional with each library spectrum being given as a distance fi'om the unknown on whatever similarity scale is used. As the size of the library is increased, the number of points along the match distance vector also increases making it more difficult to choose one library spectrum over another. This process contributes to the unreliability of standard library search methods for mass spectral data which have recently been examined by Stein and Scott [9]. A further deficiency in library search methodology is that it is somewhat limited and inappropriate in some cases such as targeted component analysis. A more advanced mechanism is to represent individual spectra as rules relafing compound structure to spectrum. To simple convert all spectrum-structure correlafions represented in the library to rules does not provide significant benefit. It would sfill be necessary to test every rule 186 during the analysis of an unknown just as it is necessary to match every spectrum in a library search. This assumes of course that no pre-filtering has been implemented in either case. The significant benefit of a rule based (or expert system) approach to spectral matching is the interacfion of rules under the control of the inference engine. For example, if a spectrum for a compound is in the data base more than once, each entry acts individually under spectral matching methodologies. However, a rule base approach can combine these replicates to improve the reliability that a posifive match has occurred based on the certainty theory expression A + B - A x B where A and B denote the confidence factors for each of two replicates. Improving Rules with Rule-Trees In the previous secfion, it was discussed that rules, unless combined, yield little benefit over standard library searching for mass spectral data. However, a disfillafion of the rule informafion into decision-trees, referred to here as rule-trees, can be very useful. Furthermore, the process of construcfing a rule-tree is straightforward in this implementafion and can be updated automafically as simple rules or spectra are added to the data base. Rule-trees are also more efficient that standard spectral matching and even than rule based methods because a rule-tree only has as many decisions as there are spectral features. In contrast, a spectral match must make the same number of comparisons for each spectrum tested. 187 Figure 7.4 is a rule-tree for the m/z 59 product spectra constructed based on examinafion of the representafive product spectra. This rule tree was constructed in the following manner. First, the cluster mean spectra were summed together to determine which m/z values are the most intense across the clusters. Then, in order of decreasing summed intensity the clusters are compared for regions of separafion among the clusters. For example, cluster 3 is characterized by a base peak at m/z 15. In other clusters, m/z 15 may be of low intensity or even completely absent. As a result, there exists a good region of separafion about which to form a decision point. For each cluster that does not overlap another, there exists a region of separafion from which a leaf node is formed which indicates the end of a decision branch. Where clusters overlap, the next most intense peak is examined in the above manner. This process confinues unfil all the clusters have been assigned to leaf nodes or there are no more m/z features to compare. At each decision or branch point, a confidence interval is assigned to the decision. The crifical decision point is taken to be the median posifion between the clusters. The confidence interval is designated by plus or minus an intensity value on the crifical decision point. The confidence interval is calculated by taking the region between the nearest proximity of the error bars, defined by the standard deviafions in relafive intensity among cluster members. Outside the crifical decision point plus or minus the confidence 188 Limited to commonly occurring substructures containing only (Precursor ion m/z=59 C, H, and 0. W2 31 = 50% yes :I: 20% no - m/z15=60% m/229=30% 1 30% i 15% yes Cluster 3 es m/z 15 = 60% i 30% Cluster 4 no yes Cluster 6 no Cluster 5 Cluster 1 Cluster 2 Figure 7.4 Rule-tree for m/z 59 product spectra. 189 interval, one can be certain of which direcfion to branch. However, within the median region, the decision confidence decreases to zero at the crifical decision point. To traverse the tree once it has been completed, one can begin at the top if the unknown spectrum is available. However, for targeted component analysis, one can begin at the cluster of interest and traverse upwards to assemble a list of idenfifying peaks and intensifies to monitor during analysis. Validafion of the rule-tree was accomplished by tesfing various m/z 59 product spectra against the rule-tree. The compounds most likely to not be properly described by the rule tree are those that are farthest fi‘om the cluster means due to their greater overall relafive intensity difference fi'om the mean. For example, in cluster 1, Cmpd3 is the farthest from the cluster mean as shown in Table 6.2. However, it is correctly assigned to cluster 1. Likewise compounds 11, 22, 27, and 33, which were the next farthest fi'om the mean were also assigned correctly. Compounds 2 and 10 were tested and assigned correctly to cluster 3. Compounds 24 and 35 from cluster 4 were tested. Cmpd24 was correctly assigned but Cmpd35 was incorrectly assigned to cluster 3. The reason for this is that the m/z 31 peak for this compound was measured at 31%. From cluster 5, compounds 14, 15, and 23 were tested and were assigned correctly to cluster 5. Therefore, with one excepfion, this m/z 59 rule-tree successfully classified all tested compounds. Given that 190 these compounds were the farthest from the cluster mean, it is likely that all other compounds analyzed here would also be properly assigned. Hence, this rule-tree can be used with a high degree of reliability for determining likely candidate substructures for m/z 59 low-energy product spectra. Rule-trees are potenfially very powerful as an on-line tool for targeted component analysis particularly where sample quanfifies are low. Unfortunately, rule-trees are significantly more dificult to generate than simple or combinafion rules and may not be possible to generate cleanly in all cases. Conclusions The above methods have proven to be efi‘ecfive for the classificafion of product spectra and the recommendafion of candidate substructures. The ideal case would be for every substructure to produce a unique ion that in turn would yield a unique product spectrum when analyzed by MS/MS methods. Unfortunately, and obviously, this is not the case. Therefore, without more clarifying informafion, the best case scenario is to idenfify those substructures that do produce a parficular product spectral pattern. The tools developed here accomplish this goal in a safisfactory manner. The m/z 59 clusters discovered in this work and the substructures believed to be expressed in them are given in Figures 7 .5 to 7 .9. 191 While rule-trees have disfinct advantages over spectral matching, spectral matching, with fuzzy intensity regions, can also be an important tool for classifying product mass spectra. In parficular, the spectral matching approach used here can be used to obtain an esfimate of the membership of an unknown product mass spectrum into all the clusters idenfified so far. The methodologies developed here do not, by themselves, provide complete chemical structure elucidafion. Rather, the focus of this work was to develop methods for classifying product spectra and recommending candidate substructures. Structure generators have been developed by other [10,11] to generate all possible candidate molecular structures given a series of substructures and other constraints. As tandem mass spectrometry becomes more accepted for roufine chemical analysis, MS/MS databases are bound to follow. The tools developed here were conceived to provide assistance to the mass spectrometric invesfigator at such a fime. 192 Compound 3, 8 59 59 Compound 1 1 I 59 Compound 33 59 Compound 13 >Jl:o:l—< 59 59 H H Compound 16 59 59 Compound 34 Compound 22 W H 59 59 Compound 27 H Compound 28 \E<) H 59 m/z 59 substructures present: M X \i} H H Figure 7.5 The members and expected substructures for m/z 59 cluster 1. 193 a Cluster 2 m/z 59 substructure: Com ound 29 WH M p 59 H Cluster 6 \ O 0\ Compound 17 59 Figure 7.6 The members and expected substructures for m/z 59 clusters 2 and 6 respecfively. 194 Compound 1, 10, 32 59 59 Compound 2 \M Compound 4, 31 w compound 5, 9 M Compound 6 :flg/ Compound 7 $0/ Compound 19 W m/z 59 substructures present: \K M Figure 7.7 The members and expected substructures for m/z 59 cluster 3. 195 Compound 2 1 WW 59 59 Compound 24 \/\0/\/\ 59 59 Compound 35 59 m/z 59 substructures present: A: M r Figure 7.8 The members and expected substructures for m/z 59 cluster 4. 196 59 Compound 12, 20 W 59 59 Compound 14 /W 59 59 Compound 15 NW 59 59 Compound 18, 23 W 59 59 Compound 25 \/O\:l/\ OH 59 Compound 26 VOQ/\C 59 Compound 30 M 59 m/z 59 substructures present: Figure 7.9 The members and expected substructures for m/z 59 cluster 5. 197 References 1. Varmuza, K. Fresenius’ Z. Anal. Chem. 1976, 282, 129-134. 2. Scott, D. R.; Anal. Chim. Acta. 1988, 211, 11-29. 3. McLafi‘erty, F. W.; Hertel, R. H.; Villwock, R. D. Org. Mass Spectrom. 1974, 9, 690. 4. Damen, H.; Henneberg, D.; Weimann, B. Anal. Chim. Acta. 1978, 103, 289-302. 5. Zadeh, L. A Fuzzy Sets, Information and Control 1965, 8, 338-353. 6. Blafl'ert, T. Anal. Chim. Acta. 1984, 161, 135-148. 7. Otto, M. Anal. Chim. Acta. 1993, 283, 500-507. 8. Bandemer, H.; Otto, M. Mikrochim. Acta. 1986, 2, 93-124. 9. Stein, S. E.; Scott, D. R. J. Am. Soc. Mass Spectrom. 1994, 5, 859-866. 10. Buchanan, B.; Feigenbaum, E. Artificial Intelligence, 1978, 11, 5. 11. Blafi‘ert, T. Anal. Chim. Acta. 1986, 191, 161-168. "lllilllllllllllltill“