1-; 4‘4“ w?§:::‘“‘h‘-\ \ 323’ 1‘11}: L‘C 94.334333“: 1:313: ‘ 75;?“3’... 1‘; £1 ‘ S k L,“ “4 ‘45; ‘ ’ , , ‘ ‘1' 1.}: W,: l; 21;}:\_ w’xatfq 15' 4.15 3‘; ; ' ' ‘ :41: "3;.3» "‘ “£2. 3413.44 21:3 ”‘- 4 ~ 13": - .7 , - ~ MM H 1&1“? 11 €311}- 1 "1% ‘W {H $1191“? ~4.1%:- 3. ‘ .‘, ‘ 1: ‘7 1134.3" . 4 4‘: "11* 44;}: ' ‘13; xrk‘m‘g‘a . L ’3 ‘E‘fifi ' I ‘W. 355’" , .4 ”£4, ffiFV'bk 4, 4'3 .2223 755319.11“ 5:1,‘~.-_ '. ‘(" ;. o'f h4 ’ 1 In. . , "1%“? =13 . ”1 12.. u-W, m anhmmi W1 i; '5; _ $254 111131? 514;“ ,4 4 ~ A W413”: $1.1, 11311;? ”111 ; “R ”1:11;." I???“ 1;“— . g.; 13% 1}"; -- - ("I , ,, fit?" ”41'5”!an 5 1111115914}.in 1111345. 15119.11-43 ‘ 4 J 31. .4 tW , 1’1 3????144‘3'“ 4» 4. ““11 "HI.“ . 7.6:.»v Jim—1" . '7- a 11:2,; I- -3 .1“: .r v . r 7'96 3! ~----..... ,_ .Zfiitr'fi -'—=‘§;¢t1- _ - - 1:.” .1 A h: 9“ b.“ ‘t 1‘r1:"'$’,< 11"-, 55555 W Shh W This is to certify that the dissertation entitled An Automated Structure Elucidation System for MS/MS Data: Substructure Determination Through Spectral Matching presented by Kevin Patrick Cross has been accepted towards fulfillment of the requirements for _Bh.D.__ degree in LhemistmL flflflg /-4 Major professor Date November 8, 1985 Mun"... Hr ..- ‘ ' "1 'I‘” ' I .-..- O~12T71 MSU LIBRARIES .——. RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. AUTOMATED STRUCTURE ELUCIDATION SYSTEM FOR MS/MS DATA: SUBSTRUCTURE DETERMINATION THROUGH SPECTRAL MATCHING By Kevin Patrick Cross A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of. DOCTOR OF PHILOSOPHY Department of Chemistry 1985 Copyright by KEVIN PATRICK CROSS 1985 ABSTRACT AN AUTOMATED STRUCTURE ELUCIDATION SYSTEM FOR MS/MS DATA: SUBSTRUCTURE DETERMINATION THROUGH SPECTRAL MATCHING By Kevin Patrick Cross The building blocks for an automated structure elucidation system have been developed to evaluate the structural information contained in mass spectrometry/mass spectrometry (MS/MS) spectra. The system employs several software tools to assist in the determination of unknown organic structures. These include tools to: 1) match conventional and MS/MS spectra, 2) assist in the determination of spectra/substructures correlations, and 3) assist in the determination of structures from identified substructures. Correlations of MS/MS daughter spectra with substructures are determined by matching daughter spectra of parent ions of identical masses. The molecular substructures giving rise to the parent ions with matching daughter spectra are then identified. This method of substructure determination is totally empirical and does not assume that the structural integrity of an ion is maintained in the ionization or fragmentation process. It does not, therefore, require the identification of ion structures. Kevin Patrick Cross The ability to group similar daughter spectra representing a common substructure is central to the substructure determination process. Therefore a program was developed to match an unknown MS/MS spectrum against either conventional or MS/MS spectra in a reference data base. It is an interactive, transparent program which employs several matching techniques for flexibility in matching criteria. An overall match factor is calculated which is a combination of "forward" and "reverse" searching techniques. This approach emphasizes common spectral features while de—emphasizing experimental conditions. The ability of the search program to correctly discern identical compound MS/MS spectra taken under different operating conditions from similar spectra of other compounds has proven valuable in determining how instrumental conditions affect MS/MS spectra. By determining voltage and pressure ranges for each parameter that yield matching MS/MS spectra, acceptable standard conditions for acquiring MS/MS reference spectra were determined. The MS/MS search program has been successfully applied to determine the substructures of .the compound di—n-octylphthalate by grouping daughter spectra similar to those of the sample. These substructures were then combined to generate the complete phthalate molecule. The program has also been used to identify substructure/daughter spectrum correlations by matching daughter spectra of several known compounds to determine the molecular substructure associated with a particular daughter spectrum. A project such as this results only from the cooperation and help of many individuals. They all deserve a word of thanks. My advisor, Chris Enke, deserves special acknowledgement for his patient guidance, innovative ideas, and pursuit of excellence. I am also grateful to the members of the "structure determination group" who provided an environment where all benefited from the ideas and work of others. These individuals include Phil Hoffman, Hugh Gregg, Anne Giordani, Pete Palmer, and Kevin Hart. A special thanks goes out to Tom Atkinson for all his efforts toward helping anyone with any computer problem at any time. I wish to thank Adam Schubert for the time he spent evaluating and discussing my work and the rest of the Enke group for their help and friendship. I would also like to thank all the professors at Lawrence University for the creative environment and intellectual stimulation they provided; especially my advisors, Dr. Robert M. Rosenberg and Dr. James S. Evans. Most of all, I would like to thank my wife, Carol, for all the love and endurance througout this period. 'Financial support for this work was provided from the National Institutes of Health. TABLE OF CONTENTS LIST OF TABLES Vi LIST OF FIGURES Vii CHAPTER I. INTRODUCTION.. .......... ............ ...... . ..... . ......... 1 Artificial Intelligence. ..................................... ........ 2 Expert Systems.. ............. ........ ...... ... ..... .... .......... 4 The Focus of Expert Systems ..... . .............................. ..7 Applications of Expert Systems to Structure Elucidation... ........ 9 MS/MS in Automated Structure Elucidation................. ...... ......11 References.... .................. .. .. ..... .....................17 FOR MS/BB DATA ........................... . ...... .. ..... ....21 Abstract ....................... ....... ..... ...... ....... . ........ ...21 Introduction ........................................... ..... ..... 22 Development of an MS/MS Data Base... ................ ..... ...... ......27 Structure/Substructure Data Base Format ............ . ............... ..30 Matching MS/MS Spectra.... .................................. .........31 Substructure Identification ............................ . ..... . ....... 34 Generation of Molecular Structures ................. ........ ......... 35 An Example: The Elucidation of Di—naoctylphthalate ..... ..............36 Conclusions ................. ... ... . .............. . ....... .....47 References ................... .... ........................ . ........... 49 CHAPTER III. The Development of a Mass Spectra/Mass Spectra Information Management System... ..... . ........................ 50 Introduction.... ..................... .. .............................. 50 Data Base Designs ....... . ............ . .................... . ..... .....52 Computer Architecture.. ....................... . .......... .. ....... ...54 Software Tools for Data Manipulation ................................. 57 Software Tools for Structure Determination. ................... .. ..... 61 The Storage of MS/MS Data ............................... .. ....... ...64 Mass Spectrometry Data Base Characteristics ............. ... .......... 66 Statistical Occurrence of Mass Values in Mass Spectra.... ............ 68 Statistical Occurrence of Abundance Values in Mass Spectra...........71 Conclusions... .............. .. ............................ . ....... ...77 References...... ........................... . ....... . .............. ...78 iv CHAPTER IV. A SPECTRAL MATCHING SYSTEM FOR MS/MS DATA ................ 80 Abstract....... .......... . .............. .. ........................... 80 Introduction.... ..................................................... 81 Reducing the Number of Candidate MS/MS Spectra ....................... 81 Intensity—Based Matching of MS/MS Spectra ............................ 83 Results ............................................................. 88 Matching N-butylbenzene Daughter Spectra Against Similar Compounds. .96 Performance Characteristics of the MS/MS Automated Search Program. ..101 Conclusions .......................................................... 104 References ........................ ~ ................................... 105 CHAPTER V. INSTRUMENTAL PARAMETER EFFECTS ON MATCHING DAUGHTER SPECTRA ...................................... . ....... 106 Introduction ......................................................... 106 Instrumental Parameter Effects on CID Efficicency .................... 108 Instrumental Parameter Effects on Spectral Matching .................. 122 Automated Resolution of MS/MS Mixtures ............................... 128 Conclusions .......................................................... 141 References ............................................. . ............. 142 CHAPTER VI. A STRUCTURE/SUBSTRUCTURE DATA BASE ASSOCIATED WITH MS/MS SPECTRA ................... . ............................. 144 Abstract ............................................................. 144 Introduction ......................................................... 145 Data Base Design ..................................................... 147 Structure Storage Format ............................................. 154 Characteristics and Operation ........................................ 157 Summary .............................................................. 161 References ........................................................... 162 CHAPTER VII. FUTURE DEVELOPMENTS ..................................... 163 References ........................................................... 167 V NNNN #ri-I huh hub .p-w M NH N H (D .h LIST OF TABLES Match Factor Definitions ......................................... 33 Daughter Spectra of Di-n-octylphthalate .......................... 38 Match of 149” Di—n-octylphthalate Daughter Spectrum .............. 39 Match of 105+ Di-n-octylphthalage Daughter Spectrum .............. 43 Abundance Bins (Percent Total Ion Current) ....................... 73 Match Factor Definitions ......................................... 86 Frequency of Mass Spectra Peaks of N—butylbenzene in the Data Base ..................................................... 89 Match of N—butylbenzene Mass Spectrum ............................ 95 M/Z 136+ Daughter Match Factors Sample Spectrum (N-butylbenzene 0.33 P/PO, 28 eV CE) .......... 97 M/Z 136+ Methylbenzoate Daughter Spectra Match Factors Sample Spectrum (C011 Press: 9.9 X 10‘3 Torr, CE: 20 eV, Drawout: ~10V) ............................................... 123 M/Z 105+ Methylbenzoate Daugther Spectra Match Factors Sample Spectrum (Coll Press: 9.9 X 10'3 Torr, CE: 20 eV, Drawout: —10V) ............................................... 125 Match Factors for Determination of the Major Component of the Ether Mixture ............................................ 135 Match Factors for Determination of the Minor Component of the Ether Mixture ............................................ 138 vi NNN (AMI—4 wwww MN #le—I 00 UI hub-b (AND—I Oink <1 00 3:. U! D—‘ N (A, LIST OF FIGURES Structure-Property Relationships in Mass Spectrometry ............. 13 Software Tools for Structure Determination by MS/MS ............... 23 Di-n-octylphthalate Mass Spectrum ................................. 37 Di-n—octylphthalate. 1) Molecular Substructure. II) 149+ Ion Structure. III) 105+ Ion Structure. IV) Molecular Structure ................................... 40 149” Di-n-octylphthalate Parent Spectrum. ......................... 45 149+ Di—n—octylphthalate M+1 Spectra ...... . ....................... 46 MS/MS Information Management System ............................... 51 MS/MS Computer Network ............................................ 55 Example of MSPLOT Output .......................................... 60 Molecular Weight Distribution of Compounds in the Reference Data Base ....................................... 67 Frequency Distribution of Spectral Peaks in the Reference Data Base ....................................... 69 Log Frequency Distribution of Spectral Peaks in the Reference Data Base ....................................... 71 Frequency Distribution of Abundance Values in the Reference Data Base ................................................. 75 Log Frequency Distribution of Abundance Values in the Reference Data Base ....................................... 76 Logical Reduction of Candidate Spectra (Venn Diagram) ............. 91 Logical Reduction of Candidate Spectra ............................ 92 Substituted Benzene Matching Results, Sample: n-Butylbenzene, P/Po = 0.33, CE = 27 eV ........... 99 Substituted Benzene Matching Results, Sample: n—Butylbenzene, P/Po = 0.10, CE = 27 eV PC Match Factor Results .................................. 100 Intensity—Based Matching Speeds .................................. 103 ExtraNuclear EL 400-TQ3 Triple Quadrupole Mass Spectrometer ............................................. 109 Instrumental Effects of Total Ion Current. Each Peak is a Scan of Drawout Potential From Q2+20 V to 02—30 V ............. 111 Instrumental Effects on Collision Induced Dissociation Dissociation Efficiency for 136* Methylbenzoate Daughter Spectra. Each Peak is a Scan of Drawout Potential From 02+20 V to 02-30 V ....................................... 113 5. 4 Instrumental Parameter Effects on Collision Induced Dissociation Efficiency for 105+ Methylbenzoate Daughter Spectra. Each Peak is a Scan of Drawout Potential From QZ+20 V to 02-30 V .................................. 114 5.5 Drawout Potential Effects on Collision Induced Dissociation Efficiency for 136+ Methylbenzoate Daughter Spectra ...... 116 5.6 Collision Energy Effects on Collision Induced Dissociation Efficiency for 105+ Methylbenzoate Daughter Spectra ...... 117 5.7 Collision Energy Effects on Collision Induced Dissociation Efficiency for 105+ Methylbenzoate Daughter Spectra ...... 118 5.8 CE Breakdown Curves for 136+ Methylbenzoate Daughter Spectra ......................................... 120 5.9 CE Breakdown Curves for 105+ Methylbenzoate Daughter Spectra ......................................... 121 5.10 Instrumental Parameter Effects on the Overall Match Factor for 136+ Methylbenzoate Daughter Spectra ................ 126 5.11 Instrumental Parameter Effects on the Overall Match Factor for 105* Methylbenzoate Daughter Spectra ................ 127 5.12 Logical Reduction of Candidate Spectra During Mixture Analysis (Venn Diagram) ................................. 132 5.13 Logical Reduction of Candidate Spectra During Mixture Analysis (Stepping Through M/Z Values) .................. 134 5.14 MS Ether Mixture Resolution ..................................... 137 5.15 MS/MS Pesticide Mixture Resolution .............................. 140 6.1 Structure Data Base Format ....................................... 148 6.2 Master Header Record Format ...................................... 150 6.3 Structure Header Record Format ................................... 152 6.4 Structure Storage Record Format .................................. 155 6.5 Structure Representing n-Butylbenzene ............................ 159 6.6 Structures output from DRAWCZ. A) n—Butylbenzene, B) 1-(3—methyloxirany1)-Ethanone I l C) 0:0:(sulfinydi-4,1—phenylene), 0,0,0,0-tetramethy1ester Phosphorothioc Acid, D) Oe-methyl-Pancracine ............. 160 viii CHAPTER I INTRODUCTION Organic structure determination is one of the most sought after goals of chemical analysis. Positive identification of a compound whether in a clinical, academic, pharmaceutical, or industrial environment requires the determination of its structure. Compounds with minor structural differences that are often difficult to qualitatively analyze may have greatly differing effects on biological systems. As the chemist seeks to identify complex compounds present in trace amounts, the number of possible interferences increases. Therefore, increasing the selectivity of modern instruments has become as important as improvements in sensitivity. Consequently, scientists are turning increasingly toward integrating techniques with complementary capabilities such as liquid chromatography/mass spectrometry and mass spectrometry/mass spectrometry (MS/MS). Huge volumes of data are being produced by integrated technique instruments. The chemist has turned to the computer for help in storing and interpreting all this data. The development of artificial intelligence guided instrumentation is a long range goal of the Enke research group. This has included development of an expert system to determine structures from low-resolution mass spectral data and an intelligent instrument control system for the triple quadrupole mass spectrometer. The 1 integration of these two systems will allow data acquisition decisions to be made by the expert system and then be automatically performed on the MS/MS instrument. This thesis will focus on my part in the ongoing design and development of an expert system to perform structure determination using MS/MS data, and in particluar, the development of spectral matching algorithms for MS and MS/MS spectra for the determination of molecular substructures. Before describing the structure determination system, it is useful to digress to illustrate how and why the proposed method was developed, to examine previous work performed in this area and to note how the current project compares with previous accomplishments. The discussion starts by defining artificial intelligence and expert systems, and proceeds to discuss applications of expert systems to structure elucidation. Artificial Intelligence Artificial intelligence (AI) has many definitions; one of the broadest and probably most accurate was stated by Patrick Winston. "Artificial intelligence is the study of ideas which enable computers to do the things that make people seem intelligent" (1). This definition seems all encompassing since intelligence appears to be a conglomeration of many different information storage and processing abilities. The definition of artificial intelligence programs is more restrictive. "AI programs are those programs designed to emulate human performance in problem—solving activities through inductive reasoning and semantic information processing" (1). In general, those programs which use inductive reasoning are those most commonly thought of as AI programs. Inductive reasoning is the process of reasoning from some observed cases to a universal conclusion regarding similar cases, some of which are unobserved. The first conceptualization of AI as a usable tool was in the 1930s by E. Post, a logistics mathematician (2). He developed a set of production rules for manipulating groups of symbols. These rules served as a foundation for building several levels of rules into a knowledge base. In 1971, the first natural language interaction with a computer was demonstrated (2). In the early 1980s, Carnegie-Mellon University and Digital Equipment Corporation (DEC) collaborated on an AI project to configure computer systems (3,4). The development of AI concepts and tools for general use has been slow to emerge until very recently, when the development of powerful, inexpensive hardware has renewed interest in exploring what the computer can do for both the scientist and the businessman. Artificial intelligence research and development can be divided into four areas: 1) Expert systems - The computer reasons from knowledge in a particular domain with expert ability. 2) Natural language processing - The computer interprets oral or written commands, acts upon them and then reports the results to the operator. 3) Cognitive research - The exploration of the human mind by using a computer to emulate the thinking process. 4) Robotics - The development of computers with special appendages to perform hazardous or mundane tasks. The application of artificial intelligence to automated structure elucidation involves the development of an expert system to help the scientist analyze acquired experimental data and to postulate plausible structures. For MS/MS in particular, the goal is to let the expert system identify substructures represented by MS/MS spectra and then to combine these substructural units to elucidate a molecular structure. Expert Systems An expert system is a program that uses resident knowledge to solve specific problems normally requiring human expertise (4-9). Although types of expert systems vary, each system is made up of three parts: a means of knowledge representation, an inference engine, and a user interface. (1) Knowledge representation. A method must be available to either maintain a knowledge base or to deduce knowledge from acquired data. Those systems which maintain a knowledge base are termed knowledge—based systems (6). Knowledge is stored in them as heuristic rules, or rules—of‘thumb, which are entered by an expert in the specific field. Those systems which deduce knowledge from experimental data are termed power-based systems (6). Instead of operating with an established set of rules, conclusions are derived solely using the data present. These systems are termed data-driven systems and are usually accompanied by large numeric data bases (8). The system we developed for identifying organic compounds using MS/MS data is best described as a data-driven, power—based expert system (although some heuristics are present). Large spectral and structure data bases are used to help elucidate the compound of interest. (2) Inference Engine. Each expert system must have a method for operating on the knowledge. In knowledge-based systems, an inference engine sorts through the available heuristic rules and applies those that are relevant to the situation. In power-based systems, the inference engine is a computer algorithm that uses numerical techniques to compare experimental and reference data and thereby reach a conclusion. (3) User Interface. Each expert system must identify the task to be solved. For specific problems this interface is very simple. As the range of problems increases, the user interface becomes more complicated. The interface reports the results of applying the system’s expertise and demonstrates the validity of its conclusions by illustrating its reasoning. This is extremely important in not only providing the operator with confidence in its conclusions, but in determining the shortcomings of the system by pointing out flaws in the knowledge base or inference engine. The most important characteristic of the expert system is the ability to comprehensively evaluate all possibilities for a given situation. In mathematical terms, the expert system systematically reduces the search space of possibilities until only one remains (5). It does this by subdividing the problem into many pieces and solving each piece by applying the appropriate algorithm. Solutions from subproblems are then combined to generate the final conclusion. Although the use of building blocks in solving complex problems is not new, it is well applied to expert systems (2). The inference engine methodically evaluates every possible situation. This feature alerts the operator to possibilities that he did not consider or have time to evaluate. Those inference engines that use experimental data to reason toward a conclusion are data-driven systems. They use forward-chaining rules (5) to progress from the problem’s beginning to the finish. There are also inference engines that reason backwards from the goals to the data. Engines that run in reverse are used in goal-driven systems. They use backward-chaining logic to progress through the knowledge base (5). Some expert systems use both forms of logic. The expert system being developed for structure determination uses MS/MS data and forward reasoning to reach its conclusions. The results from the expert system, however, are only as good as the information in the knowledge base. The saying, "garbage in, garbage out", is nowhere more applicable than in describing the misuse of expert systems. Expert systems model certain aspects of human behavior better than others (10). Reasoning from well established rules and from a large data base are the strong points of expert systems. There are, however, several areas of weakness. An expert system cannot reason by analogy or from "first principles". In addition, an expert system has no common sense and therefore may produce ludicrous results. Advice and consultation of experts is needed when building a knowledge base, even for power-based systems. Most expert systems do not have certainty values associated with heuristic rules. While rules-of-thumb hold in the majority of the cases, many exceptions cause conflicts in logic by the inference engine. Demon rules (1) and fuzzy logic (11) are two methods currently being explored to resolve these situations. The Focus of Expert Systems The growth of expert system development in society stems from human demands and from developments in technology (5). Human demands for expert systems are fostered by the scarcity of human expertise and its perishable nature. The slow and error-prone diffusion of knowledge among coworkers is unsatisfactory in many situations. Lastly, the automation of laboratories has .generated volumes of data that need interpretation (12,13). Recent developments in both computer hardware and software have helped spur development of expert systems. Inexpensive processors, memory, and disk drives have made AI techniques available to laboratories that could not afford a personal computer five years ago. Maturing of the AI field has produced new commercial software and hardware. Symbolic processors (Xerox, Symbolics), AI operating systems (PROLOG), knowledge environments (KEE, M1), and AI languages (LISP) are now available for the expert system developer (5). The application of expert systems will continue to increase as fifth generation computers, using large numbers of parallel processors, become available (14). In 1984 there were over 100 expert systems under development by over 2500 knowledge engineers and AI programmers. These figures are expected to double in 1985 and to exponentially increase over the next ten years (5). Recent developments of expert systems have focused on interpretation of data from intelligent instruments and on the development of high-value specialized systems (5). Both of these areas represent limited knowledge domains, where a closed system of expertise can be developed. One example of a high—value expert system already developed is R1, a system developed by Carnegie-Mellon and DEC to configure VAX computer systems based on customer needs (3). DEC estimates that this system saves them 10 million dollars annually. Several established companies currently involved in developing expert systems include: Digital Equipment Corporation, Ford Aerospace, General Motors, Hewlett’Packard, International Business Machines, National Aeronautics and Space Administration, Rand Corporation, and Xerox (5,15). Several new companies exclusively marketing expert systems have been born. These include: Intellicorp, IntelliGenetics, Corporation Systems, Teknowledge, and Visual Intelligence (5,8). Applications of Expert Systu to Structure Elucidation The development of computer methods for analyzing spectral data to elucidate structures has matured from the numerical reduction of data from a single instrument to expert systems capable of interpreting data from several different sources. Expert systems have commonly been applied to spectroscopic instruments including mass spectrometry (16-22), HlNMR (16,22,23), Cl3NMR (16,19,22-27), IR (19,23,26,28), and UV spectroscopy (29). Even x-ray powder diffraction and x-ray crystallography data have been incorporated into one structure elucidation system (16). A11 expert systems perfonming structure elucidation include one or more of the techniques described below. Although a variety of data sources have been used, the discussion in this work focuses on those systems using mass spectra. (1) Numerical reduction of spectral data. This category includes spectral identification through library searching and data grouping. The comparison of a sample spectrum to a stored library of spectra has been implemented for both low- and high-resolution mass spectra (17). Library search methods range from simple linear regression (18,30-35), to multiple regression techniques (36), and statistical methods (37). Identification of the sample spectrum often depends on the size and nature of the accompanying spectral library. Data grouping methods include factor analysis (38), pattern recognition 10 (26,36,39,40), cluster analysis (41,42), and applied information theory (43,44). In many cases the complete molecular structure cannot be solely determined by numerical 'methods and the identification of a substructure or functional group is considered satisfactory (45). (2) Substructure determination. Often the entire molecular structure is not elucidated by using a single spectroscopic technique or numerical method. The operator may, however, be able to identify relevant substructures. Substructures are commonly determined by comparing the molecular structures of top matching compounds for structural similarities (42,46-53). Substructures representing the skeletal backbone of a molecule or its prominent functional groups can often be determined. Several structure determination methods involve the interpretation of mass spectra. (3) Spectral interpretation. Applications of interpretive methods to analyze mass spectra have had mixed success. Techniques have included the use of high—resolution mass spectra (17), self—training algorithms (54-57), and learning machines (58-62). The Dendral project, which started in 1965 and concluded in 1980, was the most successful and ambitious project to date (63—65). It represented the first attempt to develop heuristic rules which represent chemical knowledge concerning structure stability and the fragmentation of molecules. It was also the first to use substructures as building blocks to elucidate structures. Various substructures were determined through spectral interpretation and then presented to a structure generator (GENOA) to generate possible molecular structures. 11 (4) Structure generation. Once all relevant substructures of an unknown compound are determined, they must be combined to produce the molecular structure. This is a straightforward yet constrained process. The structure generator must contain heuristics to comprehensively generate all plausible molecules yet prevent a combinatorial explosion by elimination of structures that are chemically impossible. Several structure generators have been developed (26,66). The most celebrated is the GENOA program (63,64) which allows overlapping and alternative substructural fragments to be identified as substructural constraints. (5) Assessment of determined structures. Once molecular structures have been postulated, they are validated by predicting the spectral properties they should exhibit. Structure checking programs generate theoretical spectra from heuristics for the candidate compound and compare it with the measured experimental spectrum (64). If the two spectra are not similar, the candidate structure may be suspect. MS/MS in Automated Structure Elucidation The application of conventional mass spectrometry to structure elucidation has several drawbacks. The mass spectrum represents a superposition of products resulting from all reaction pathways a fragmentated molecular ion may follow. In addition, ion rearrangements and consecutive neutral losses complicate the mass spectrum making interpretation difficult. The abundance of some spectral peaks is small due to multiple consecutive neutral losses while others appear high if 12 that peak represents products from several fragmentation pathways. When electron-impact (EI) sources are used to fragment large molecules, the molecular weight information disappears as the intensity of the molecular ion peak decreases. Chemical—ionization (CI) sources obtain molecular weight information but leave little structural information. Mass spectrometry/mass spectrometry (MS/MS) instruments provide additional capabilities over conventional mass spectrometry for structure elucidation. By selecting the parent ion and fragmenting it, information concerning an isolated portion of the molecule is obtained. Hence the moleuclar substructure associated with the parent ion in the normal mass spectrum may be determined. In this manner, structures that differ by only one spectral feature may be identified. Secondary fragmentation conditions are carefully controlled so that only first-order fragmentations occur. Daughter peaks, therefore, result from single neutral losses representing simple bond cleavages of the parent ion. A complete description of MS/MS spectrometry and the triple quadrupole mass spectrometer is readily available (67). The substructure—property relationship determined by MS/MS is unique in that unambiguous substructural information is determined (Figure 1.1). In contrast to the interpretation of a mass spectrum to elucidate a molecule’s structure, MS/MS allows substructures of the molecule to be deduced from MS/MS spectra. The molecular structure may then be generated by using a structure generator. Structure elucidation using MS/MS becomes an empirical determination as opposed to the interpretative methods used with conventional mass spectrometry. >mpm20mhomam mm<§ z_ m&_Imzo;Hmmaomalmmaposmpm é.” mmDoC 13 zo:emmaomm 14 The identification of the substructural components of a molecule eliminates the need for a large conventional mass spectra data base. To identify a conventional mass spectrum through spectral matching, the spectral library must contain the unknown compound (or a similar analog). The use of MS/MS data in spectral matching requires only a data base of spectra representing known substructures. This data base is relatively small, while the number of compound spectra needed in conventional mass spectrometry is impossibly large. Another capability of MS/MS is the ability to determine the molecular formula by fragmenting parent ions of isotopic species (68). This eliminates the need for high resolution mass spectrometry and determines a piece of information essential to the structure generation process. The use of MS/MS for structure elucidation was first proposed by Beynon in 1978 (68). The capabilities of MS/MS at that time (MIKES) were still immature. Poor resolution of daughter spectra (> 1 amu) and complex instrumentation hindered progress. In addition, an MS/MS instrument’s ion path is complicated with several electrostatic lenses and varying longitudinal energies. Variation in peak intensities with instrumental parameters has traditionally caused problems for determining standard MS/MS operating conditions and has slowed the development of MS/MS libraries (69). l5 Increases in resolution were made as improved mass filter hardware became available (70). The triple quadrupole mass spectrometer and FT-MS provide unit mass resolution or better. Recently developed hybrid instruments (EBEB, BEOO, BEQ) provide high mass resolution for at least one mass filter and atleast unit resolution for the other (71,72). The technological advances in computer hardware have helped increase the reproducibility of MS/MS spectra. Those instrumental parameters along the ion path which affect the intensity of MS/MS peaks have been identified, and standard operating conditions determined. In addition, expert systems have been developed to control the instrument and to "tune" the ion path to obtain reproducible library quality spectra without human intervention (73,74). The focus of this thesis is the design and development of several elements in an expert system for elucidating organic compounds using MS/MS data. The system proposed herein employs numerical reduction, substructure determination, and structure generation techniques to optimize the elucidation process using MS/MS data. Since the daughter spectrum/substructure relationship is central to the structure determination process, the majority of this work describes the determination of substructures through spectral matching. The overall structure determination scheme using MS/MS data is described in chapter 2. An MS/MS information management system to support this process in presented in chapter 3. The development of a spectral 16 matching program to determine substructures from MS/MS data is presented in chapter 4. The effects of instrumental parameters on MS/MS spectra and the substructure determination process is described in chapter 5. Lastly, chapter 6 details the design and development of a data base to maintain structures and substructures, and chapter 7 suggests future work on the project. 8. 9. 10. ll. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 17 References Winston, P. B., Artificial Intelligence, Addison—Wesley Publishing Co., Don Mills, Ontario (1977). . Fleig, C. P., Hardcopy, August, p. 56 (1983). McDermott, J., Proc. First Ann. Nat. Conf. Art. Intell., 269 (1980). . Dude, R. 0., Shortliffe, E. R., Science, 220 (1983). . Hayes-Roth, F., Computer, October, p. 263 (1984). . Hayes—Roth, F., Computer, September, p. 11 (1984). . Stefik, M., Aikins, J., Balzer, R., Benoit, J., Birnbaum, L., Hayes—Roth, F., and Sacerdoti, E., Artificial Intelligence, 18, 135 (1982). Gevarter, W. B., IEEE Spectrum, August, 39, 261 (1983). Hayes-Roth, F., Waterman, D. A., Lenat, D. B., Building Expert Systems, Addison—Wesley Publishing Co., Don Mills, Ontario (1983). Interview with Natalie Dehn, Personal Computing, June, p. 49 (1983). Zadeh, L. A., Fuzzy Sets Syst., 11, 3 (1978). Dagani, R., Chemical and Engineering News, August 12, 7 (1985). Zurer, P. S., Chemical and Engineering News, August 19, 21 (1985). Cohen, C., Electronics, July 28, 101 (1983). Elmer—DeWitt, P., Time, September 2, 44 (1985). Milne, G. W., Heller, S. R., Computer Assisted Structure Elucidation, ACS Symposium Series, 54, 26 (1977). Hilmer, R. M., Taylor, J. W., Anal. Chem., 51, 1361 (1979). Venkataraghavan, R., Dayringer, H. E., Pesyna, G. M., Atwater, B. L., Mun, I. K., Cone, M. M., McLafferty, F. W., Computer Assisted Structure Elucidation, ACS Symposium Series, 54, 1 (1977). Heller, S. R., J. Chem. Inf. Comput. Sci., 25, 224 (1985). McLafferty, F. W., Stauffer, D. B., J. Chem. Inf. Comput. Sci., 25, 245 (1985). Sasaki, S., Kudo, Y., J. Chem. Inf. Comput. Sci., 25, 252 (1985). 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 18 Zupan, J., Denca, M., Hadzi, D., Marsel, J., Anal. Chem., 49, 2141 (1977). Yamasaki, T., Abe, R., Kudo, Y., Sasaki, S., Computer Assisted Structure Elucidation, ACS Symposium Series, 54, 108 (1977). Suprenant, H. L., Reilley, C. N., Computer Assisted Structure Elucidation, ACS Sym. Series, 54, 77 (1977). Swenzer, G. M., Mitchell, T. M., Computer Assisted Structure Elucidation, ACS Sym. Series, 54, 58 (1977). Shelley, C. A., Woodruff, H. B., Snelling, C. R., Munk, M. E., Computer Assisted Structure Elucidation, ACS Sym. Series, 54, 92 (1977). Abe, H., Yamasaki, T., Fujiwara, 1., Sasaki, S., Anal. Chim. Acta, 133, 499 (1981). Lowery, S. R., Huppler, D. A., Anderson, C. R., J. Chem. Inf. Comput. Sci., 25, 235 (1985). Sasaki, S., Kudo, Y., J. Chem. Inf. Comput. Sci., 25, 252 (1985). Heller S. R., Anal. Chem., 44, 1951 (1972). Damen, H., Henneberg, D., Wiemmann, B., Anal. Chim. Acta, 103, 289 (1978). Lebedev, K. S., Tormyshev, V. M., Derendyaev, B. G., Koptyug, V. A., Anal. Chim. Acta, 133, 517 (1981). Lefkovitz, D., J. Chem. Inf. Comput. Sci., 15, 14 (1975). Pesyna, G. M., McLafferty, F. W., in Determination of Organic Structural Physical Methods, 6, 91, (1976). Knock, B. A., Smith, I. C., Wright, D. E., Ridley, R. G., Kelly, W., Anal. Chem., 42, 1516 (1970). Clark, H. A., Jurs, P. C., Anal. Chim. Acta, 132, 75 (1981). McLafferty, F. W., Stauffer, D. B., Int. J. of Mass Spectrom Ion Phys., 58, 139 (1984). Malinowski, E. R., Howery, D. G., Factor Analysis in Chemistry, Wiley and Sons, New York, NY, 165 (1980). McGill, J. R., and Kolwalski, B. R., J. Chem. Inf. Comput. Sci., 18, 52, (1978). Isenhour, T. L., Kolwalski, B. R., Jurs, P. C., CRC Crit. Rev. Anal. Chem., 3, l (1974). 41. 42. 43. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 19 Massart, D. L., Kaufman, L., The Interpretation of Analytical Chemistry by the use of Cluster Analysis, Vol 65, Anal. Chem. and Its Applications, Wiley and Sons, New York, NY, (1983). Willet, P., J. Chem. Inf. Comput. Sci., 24, 29 (1984). Fetteralf, D. D., Yost, R. A., Int. J. Mass Spectrom. Ion Proc., 62, 33 (1984). deHaseth, J. A., Insenhour, T. L., Computer Assisted Structure Elucidation, ACS Sym. Series, 54, 46 (1977). Rasmussen, G. T., Isenhour, T. L., J. Chem. Inf. Comput. Sci., 19, 179 (1979). Dromey, R. G., J. Chem. Inf. Comput. Sci., 18, 222 (1978). Adamsom, G. W., Cowell, J., Lynch, M. F., J. Chem. Doc., 13, 153 (1972). Bawden, D., J. Chem. Inf. Comput. Sci., 23, 14 (1983). Varkony, T. R., Siloach, Y., Smith, D. M., J. Chem. Inf. Comput. Sci., 19, 104 (1979). Cone, M. M., Venkataraghavan, R., McLafferty, F. W., J. Am. Chem. Soc., 99, 7688 (1977). Willet, P., J. Chem. Inf. Comput. Sci., 25, 114, (1985). Synge, R. L. M., J. Chem. Inf. Comput. Sci., 25, 50 (1985). Kudo, Y., Chihara, H., J. Chem. Inf. Comput. Sci., 23, 109 (1983). Dayringer, H. E., McLafferty, F. W., Venkataraghavan, R., Org. Mass Spectrom., 11, 895 (1976). Haraki, K. S., Venkataraghavan, R., McLafferty, F. W., Anal. Chem., 53, 386 (1981). Kwok, K., Venkataraghavan, R., McLafferty, F. W., J. Am. Chem. Soc., 95, 4185 (1973). Hippe, Z., J. Chem. Inf. Comput. Sci., 25, 344 (1985). Bender, C. F., Shepherd, H. D., Kolwalski, B. R., Anal. Chem., 45, 617 (1973). Isenhour, T. L., Jurs, P. C., Anal. Chem., 43, 20A (1971). Jurs, P. C., Anal. Chem., 43, 1812 (1971). Smith, D. M., Anal. Chem., 44, 536 (1972). 62. 63 59 66. 67. 68. 69. 70. 71. 72. 73. 74. 20 Martinson, D. P., Applied Spectroscopy, 35, 255 (1981). Smith, D. E., Anal. Chim. Acta., 133, 471, (1981). Carhart, R. E., Smith, D. R., Gray, N. B., Nourse, J. G., Djerassi, C., J. Org. Chem., 1708 (1981). . Lindsay, R. K., Buchanan, B. G., Feigenbaum, E. A., Lederberg, J., Applications of Artificial Intelligence to Organic Chemistry: The Dendral Project, McGraw Hill, New York, N. Y. (1980). Carhart, R. E., Varkony, T. H., Smith, D. H., Computer Assisted Structure Elucidation, ACS Sym. Series, 54, 126, (1977). Yost, R. A., Enke, C. G., J. Am. Chem. Soc., 100, 2274 (1978). Borogzadeh, M. R., Morgan R. P., Beynon, J. 8., Analyst, 103, 613, (1978). Dawson, P. R., Sun, W. F., Int. J. Mass Spectrom. Ion Proc., 55, 155 (1983). Cook, R. G., Bush, K. L., Glish, G. L., Science, 222, 273 (1983). McLafferty, F. W., Accounts of Chem. Res., 13, 33 (1980). Chapman, J. R., J. Phys. Ed., 365 (1978). Wong, C. M., Lanning, S., Energy and Technology Review, Lawrence Livermore National Laboratory, February, p. 8 (1984). Wong, C. M., Crawford, R. W., Barton, V. C., Brand, H. R., Neufield, K. W., Bowman J. E., Rev. Sci. Inst., 54, 996 (1983). CHAPTER II AN AUTOMATED STRUCTURE ELUCIDATION SYSTEM FOR MS/MS DATA! Abstract An automated system was developed to evaluate the structural information contained in mass spectrometry/mass spectrometry (MS/MS) spectra. The system employs several software tools to assist in the determination of unknown organic structures. These include tools to: 1) match conventional and MS/MS spectra, 2) assist in the determination of correlations between spectral characteristics and substructures, and 3) assist in the determination of structures from identified substructures. Correlations of MS/MS spectra with substructures are determined by matching MS/MS spectra with common mass parent ions and identifying substructures leading to ions with common spectral characteristics. Identified substructures are then combined using a constrained structure generator to postulate molecular structures. The scheme is totally empirical and does not assume that structural integrity is maintained in the ionization or fragmentation process; it does not require the ion structures to be identified. *Note: This chapter is adapted from a preliminary draft of a manuscript written by the author of this thesis, to be published in the ACS Symposium Series entitled, "Artificial Intelligence Applications in Chemistry", with P. T. Palmer, C. F. Beckner, A. B. Giordani, H. R. Gregg, P. A. Hoffman, and C. G. Enke as coauthors. 21 22 Introduction The development of mass spectrometry/mass spectrometry (MS/MS) has given the analyst a powerful tool for structure elucidation. The primary goal of this project has been to further develop triple quadrupole mass spectrometry (TOMS) as a tool for structure determination by developing a software system to organize MS/MS data, to aid in the discovery of MS/MS spectrum/substructure relationships, and to aid in the determination of a compound’s structure from identified substructures. The information and data presented in this chapter represents the culmination of several years of work by several different individuals. The structure elucidation project goals and methods were conceived by Chris Enke and Anne Giordani, and expounded upon by other project members. The multi-dimensional data base was designed and developed by Hugh Gregg (See Figure 2.1). The reference spectrum data base was designed and developed by Phil Hoffman. The data format for storing structural information was developed by Carl Beckner. Pete Palmer has been active in acquiring data for a MS/MS spectra data base. The phthalate data presented in this chapter were contributed by him. Kevin Hart has been active in implementing the molecular structure generator. My contributions include the design and development of MS/MS spectral matching routines and the structure/substructure data base. In addition, I have combined the developed software tools into a cohesive, 23 REFERENCE SPECTRA EXPERIMENTAL DATA ‘Rarananc: SPECTRA LIBRARY STORAGE MULTl-DIMENSIONAL m DATA BASE DATA BASE m E E a REF SPECTRA TEST SPECTRA \V I D 2 0 3 DATA PLOTS E 9 g : INVERTED D d DATA BASE SPECTRUM MATCHING (0: <0: 5 8 MATCH USTS U ‘i‘ m u TUR SUBSTRUCTURE TEST STRUCTURES STR C E/ MATCHING UBRARY STORAGE suasrRUCTURE DATA BASE MATCHED IDENTWTED J SUBSTRUCTURES Q/ SUBSTRUCTURES MOLECULAR STRUCTURE GENERATOR \1/ ALL POSSIBLE STRUCTURES Figure 2.1 Software Tools for Structure Determination by MS/MS 24 interactive information management system system capable of aiding in the structure determination process. The use of two-dimensional analytical instruments allows the experimenter the unique opportunity to categorize molecular substructures by their physical properties. In the case of MS/MS, the mass spectrum of a chosen parent ion, called a daughter spectrum, acts as a fingerprint in identifying a particular substructure (or substructures). MS/MS data are very clear: 1) daughter spectra reveal structural characteristics of isolated sections of the molecule, and 2) all masses in a daughter spectrum are simple cleavages or rearrangements from the parent ion. Hence MS/MS provides clear substructure-property relationships. The philosophy of the structure elucidation process is to employ a MS/MS instrument along with computer automated spectral interpretation to exploit this substructure-property relationship. Data from this instrument were used in two different ways: 1) to develop substructure/spectrum correlations from the spectra of known compounds, and 2) to use the developed correlations to determine the substructures and overall structure of unknown compounds. We are in effect substituting a reference library of the substructures for a library of spectra of known compounds. If successful, this approach should allow determination of unknown compounds not previously studied by mass spectrometry. The development of a reference library of MS/MS spectra and the substructures associated with each spectrum is a difficult preliminary 25 step to the determination of an unknown compound. The instrumental operating conditions must be carefully controlled to obtain substructures representative of all data in the daughter spectrum. In addition, all numerical and structural information for each compound must be correlated and stored in a data base where it can be easily and quickly retrieved for spectral matching. In automating the structure elucidation process, several software tools were developed and integrated into a comprehensive system to acquire, store, match, and correlate the MS/MS data (Figure 2.1). The experimental data are placed into a user’s data base, termed a multi—dimensional data base (MDDB). This data base was designed and developed by Hugh Gregg (l). The MDDB is local to each experiment and allows data inspection and massaging before inclusion into a MS/MS reference library. Once MS/MS data are introduced into the information management system, they may be plotted, compared with existing data, or stored in a reference data base for later referral. Substructure determination is initiated by the spectral matching program which compares the experimental daughter spectra against those in a reference library and retrieves similar MS/MS reference spectra from the reference data base (2,3). By comparing an experimental daughter spectrum against a library of daughter spectra from the same mass parent, the unknown daughter spectrum is identified. The best matching reference daughter spectra are used to extract corresponding molecular structures or substructures from a structure related data base (4). By acquiring a daughter spectrum for every significant ion in the 26 conventional mass spectrum, the user can identify many of the substructures making up the complete unknown molecule. The process yields redundant and overlapping substructures while identifing the major substructures. If relevant substructural data representing the unknown spectrum are not available, the complete molecular structures of the top matching daughter spectra are compared for the largest common substructure. This substructure matching process is currently performed manually using a program for plotting both molecular structures and substructures (5). Neutral fragments (lost when the major substructures are formed) are identified by acquiring parent spectra of the highest m/z parent ion representing each identified substructure. These fragments are identified by matching daughter spectra (whose parent mass corresponds to the mass of the neutral lost) against a library of daughter spectra. This method ensures acquisition of substructural information from all parts of the unknown compound that can be represented by MS/MS spectra. When all substructural information has been determined, the overlapping substructures are transferred to a constrained molecular structure generator called GENOA (6). GENOA postulates the number and identity of all possible, yet plausible, molecular structures of the unknown compound. If the number of possible molecular structures is too large, additional substructural information must be provided to limit the number of structural possibilities. This act may require obtaining additional MS/MS data or the addition of other spectral or non-spectral 27 information. This scheme ensures completeness in evaluating the structural possibilities of an unknown compound. The experimenter’s chemical knowledge and intuition, however, still remains central to the elucidation process and crucial to its success. An example demonstrating the elucidation process will be presented later. Development of an MS/MS Data Base The concern over instrumental operating conditions stems from the need to reliably identify daughter spectra of molecular substructures and to distinguish them from all other daughter spectra. The latitude in the operating conditions meeting these criteria was tested by exhaustively collecting data under different experimental operating conditions for certain classes of compounds. (See chapter 5.) The instrumental parameters experimentally determined as crucial in obtaining reproducible MS/MS spectra are collision energy and collision cell pressure. For this application it was critical that daughter spectra correspond only to substructures represented by single collisions. Therefore, the collision cell pressure used for acquiring reference spectra must ensure. first-order fragmentations of the compounds. Brief kinetic studies were carried out for each class of compounds to determine the acceptable range of collision cell pressures for acquiring reference daughter spectra. The optimum collision energy differs for each parent, and also for each daughter of that parent. The collision energy setting used was 28 based on the parent ion m/z value. The procedure for acquiring MS/MS reference spectra parallels that for identifying an unknown compound. A daughter spectrum is obtained at every significant m/z value in the conventional spectrum and matched against a library of reference daughter spectra. The substructures associated with the top matching spectra are examined to determine their relevance to the known compound. Acquired daughter spectra having significant associated substructures are stored in the spectrum data base and linked to their respective substructures. Substructures not identified through spectra matching are identified by comparing the complete molecular structures corresponding to top matching spectra for the largest common substructural fragment. If the fragment represents a significant substructure in the known molecule, it is stored in the structure data base and the associated daughter spectrum is stored in the spectrum data base. Only daughter spectra representing significant substructures are saved for later referral. MS/MS Spectra Data Base Format There are two data bases present in our MS/MS information management system. One data base manages the MS/MS spectra, while the other manages the structures and substructures. The two data bases are logically linked together so that all information concerning a particular molecule or substructure is associated. 29 The MS/MS reference spectrum data base is capable of storing and correlating all types of MS/MS spectral data including parent, daughter, neutral-loss, and conventional mass spectra. All spectra for each compound are logically associated with that compound. This data base was designed and developed by Phil Hoffman (2). Redundant spectra, such as those taken under different operating conditions, are all associated with a single compound registry number thereby simplifying both the retrieval and maintenance of the data base information. The most important design feature of the reference spectrum data base is the provision to generate and store inverted data. The data in the spectrum data base may be inverted upon a specified characteristic, such as m/z value, and then be retrieved using that characteristic. For instance, a data file inverted about the daughter m/z value will contain, for each m/z value, a list of reference daughter spectra. Hence all reference daughter spectra containing m/z 43.0 may be retrieved. When boolean algebra operations are performed on inverted data lists, the power of the design becomes apparent. When all reference daughter spectra containing peaks at 43.0 and 57.0 but not 119.0 are retrieved, the list of suitable reference spectra rapidly shrinks to a manageable size. In addition to a daughter m/z value, spectral data may be inverted about molecular weight, empirical formula, and parent ion m/z value. This feature allows the matching program to prefilter candidates before retrieving candidate spectra thereby significantly reducing the overall spectral matching time. Over 30,000 conventional mass spectra 30 are currently stored in the spectrum data base as well as MB/MS spectra corresponding to specific classes of compounds. Structure/Substructure Data Base Format The structure data base was designed to contain both molecular structures and substructures (4). The MS/MS instrument specifically provides a substructure-property relationship where many daughter spectra may correspond to a single substructure. Hence, there is no logical link between the molecular structure and its associated substructures; unlike many existing structure data bases. There is, however, a logical link between the MS/MS spectra in the spectrum data base and the respective molecular structure and substructures in the structure data base. This link allows retrieval of structural information from the reference daughter spectra best matching the unknown spectrum. Structures present in the structure data base may be retrieved and drawn via substructure number, Chemical Abstracts Service number, or spectrum data base number. The structures and substructures are stored unambiguously using the Morgan algorithm for encoding molecular structures via connectivity tables. The version of the algorithm implemented was that described by Wipke and Dyott (7) and includes representation for stereochemical isomers. The notation of the elements was expanded from the organic elements included in the original version to include all known elements. This notation was developed by Carl Beckner. Any molecule up to 128 atoms in size (excluding hydrogens) may be included in the data base. 31 The structure data base contains over 30,000 structures corresponding to the spectra in the MS/MS reference library as well as substructures corresponding to various reference daughter spectra. Matching MS/MS Spectra The MS/MS spectra matching program allows the chemist to match any MS/MS spectrum against either MS or MS/MS spectra in the reference spectrum data base (3). The program uses inverted data organized by m/z value to eliminate inappropriate reference spectra. The program determines the data base frequency of each significant peak in the experimental daughter spectrum and ranks the peaks in ascending order of frequency. Inverted lists of reference spectra containing each spectral peak are then retrieved and logically ANDed together to reduce the number of candidate reference spectra. Additional prefiltering of candidate spectra using molecular weight, parent ion m/z value, and empirical formula may be invoked to further reduce the number of candidate spectra. When matching daughter spectra, the parent ion m/z value usually serves as an adequate prefilter. The exception is the case where no similar daughter spectra of that parent ion are available. In this case, daughter spectra of higher parent ion m/z values may help deduce any substructures present. Until intensity—based matching is performed, the reference data base is not accessed and abundance values are not considered. This design considerably reduces the overall matching time and makes it practical to work with unabridged spectra. Once the number of candidate reference spectra has been reduced 32 to reasonable size (25—100), intensity-based matching is performed to characterize the correspondence between the experimental and remaining candidate spectra. Several match factors describing the quality of the match are used to quantitatively characterize the match and to infer whether any substructures representing those in the experimental spectrum are present. The various match factors returned to the user are listed is Table 2.1. The overall match factor (PT) is a combination of forward and reverse searching techniques. It takes into account the deviations in intensity of the sample spectrum peaks with respect to the candidate spectrum peaks and vice versa for all peaks in both spectra. The pattern correspondence match factor (PC) is a forward searching match factor which takes into account the intensity deviations of sample spectrum peaks with respect to the candidate spectrum peaks for peaks common to both spectra. This factor detects structural similarities, such as substructures, based on common spectral patterns. 33 Table 2.1. Match Factor Definitions PT NC NS IS IR NP 1 An overall match factor that indicates how well the intensities of all the peaks in the two spectra match. PT = ( EYs + Yr — 2*l Yr - Ysl) / ( EYs + EYr) * 100 where Yi = logz (Intensity/Total Ion Current) + SENS A pattern correspondence factor that indicates how well the intensity of the peaks in common match. PC = ( ZYs - | Yr - Ys I) / (2 Ys) * 100 The number of peaks common to both the candidate and unknown sample spectrum. The number of peaks remaining unmatched in the unknown sample spectrum. The number of peaks remaining unmatched in the reference spectrum. The percent total ion current of the sample spectrum that was unmatched in the comparison due to NS. The percent total ion current of the reference spectrum that was unmatched in the comparison due to NR. The number of peaks in common between the two spectra that are used in parabolic fitting of the quotient spectrum. Chi2 - The reduced chi—square value obtained from parabolic fitting of the quotient spectrum. 34 Substructure Identification After the spectral matching process has been completed, substructures associated with the top matching daughter spectra are identified and retrieved. If relevant substructures are unavailable, the molecular structures of the top matching candidates are drawn and compared for common substructures. An heuristic program written by Dr. Craig Shelley (5) has been adapted for our computer system to display molecular structures and substructures from connectivity tables. Since molecular structures and substructures are stored in a unique form, the structure drawings facilitate visual comparison for commonalities. Although the process of substructure determination is currently performed manually, it will soon be automated using an atom-by-atom substructural search program. 35 Generation of Molecular Structures The GENOA program is a constrained molecular structure generator resulting from the Stanford Dendral project (6) and is marketed by Molecular Design Ltd. (8). This program generates molecular structures using the overlapping substructural information obtained from the daughter spectrum/substructure relationship and the empirical formula of the compound. Additional spectral and non—spectral information from other sources may also be included. Heuristic rules determine whether a particular generated structure is chemically plausible, and whether or not it is retained. The advantage of the GENOA program is its ability to exhaustively produce all the plausible compounds given the generation constraints. This capability eliminates the possibility that the chemist might overlook any chemically possible compounds. In many cases the number and types of different structures produced will suggest missing pieces of structural data. An essential piece of information required by GENOA is the empirical formula of the unknown compound. M+1 daughter spectrum data have been used instead of the high—resolution molecular—ion mass spectrum to assist in determining the empirical formula. The daughter spectrum of the M+1 isotope ion contains peak pairs at adjacent masses representing the 013 isotope mixture of the M+1 isotope fragment ion. The relative peak areas of these daughter pairs depends on the ratio of carbon atoms lost to carbon atoms retained by the M+1 ion. Hence the peak area ratios determine the number of carbon atoms present in the 36 compound. An existing program then calculates all possible empirical formulas from the molecular weight and number of carbons. The resulting reasonable empirical formulas are given to GENOA. This method was developed by Pete Palmer. An Example: The Elucidation of Di-n—octylphthalate To demonstrate the automated structure determination process, di-n-octylphthalate MS/MS spectra were acquired and treated as unknown spectra (Figure 2.2). Di-n-octylphthalate daughter spectra of m/z 149* and m/z 105+ served to identify the phthalate substructure while daughter spectra of m/z 113+ served to identify the alkyl groups. The results are presented in Table 2.2. The m/z 149+ daughter spectrum of di-n-octylphthalate was matched against a reference library of similar m/z 149+ daughter spectra (Table 2.3). The top three matching spectra all correspond to the same molecular substructure, namely the phthalate substructure (Structure #1 in Figure 2.3). RELATIVE ABUNDANCE 100 80 60 40 20 37 Dl-N-OCTYLPHTHALATE MASS SPECTRUM in O 50 100 150 200 250 300 -M/Z Figure 2.2 Di-n—octylphtholote Moss Spectrum 350 '1 ..ll 1 l L llllllIllllllllllllllllllllll Illllll‘lllllllllllIlllllllllllllllllllllllll'lllll 400 38 Table 2.2. Daughter Spectra of Di-n-octylphthalate Parent Ion Daughter Ion/Relative Abundance 149 65/4, 93/11, 121/9, 149/100 113 43/16, 57/79, 71/100, 84/17, 113/46 105 77/62, 105/100 Table 2.3. 39 Match of 149* Di-n-octylphthalate Daughter Spectrum PT PC NC NS NR IS IR Reg # Name 100 100 6 0 0 0 0 15 Di—n—octylphthate 96 100 4 2 0 3 0 ll Dibutylphthalate 91 99 5 l 0 2 0 l3 Dipentylphthalate 72 95 3 3 7 5 2 7 2-t-buty1-4-methylphenol 67 92 1 5 l 8 29 9 P-t-amylphenol 66 96 4 2 9 3 l4 5 P-t—butylbenzyl alcohol 50 88 1 5 3 8 40 3 Benzyl-t-butanol 35 75 3 3 10 l 26 1 2—t—butyl—6-methylphenol O 0 II | O_ OH+ O.— l I O O O 0 H I l + OCBH17 OC8H17 I | 0 III IV Figure 2.3 Di—n—octylphthalate. I) Molecular Substructure. ll) 149+ Ion Structure. Ill) 105+ Ion Structure. IV) Molecular Structure. 41 The ion structure represented by these daughter spectra (Structure #2 in Figure 2.3) is not identified during the elucidation process. Instead, the molecular substructure is associated with the daughter spectra and stored in the structure data base. The elucidation process is totally empirical and does not assume that structural integrity of an ion is maintained in the ionization or fragmentation process. As a result, the ion structures need not be identified. The compounds yielding the top three daughter spectra are di-n-octylphthate, dibutylphthalate, and dipentylphthalate. Although the daughter spectra represent the same substructure, they are not identical. Different NS values amongst the three candidates indicate that the three spectra contain different spectral peaks. It is important that the spectral matching program properly groups these spectra together and ensures a substantial difference between the overall match factors of these spectra and those corresponding to unrelated substructures. The difference between the overall match factor of di-n—octylphthalate and the best matching daughter spectra corresponding to a different substructure is 19. Since the overall match factor range is 0—100 and the variance among similar daughter spectra is 9, a value of 19 represents good separation. This daughter spectrum corresponds to a substructure of 2-t—butyl-4—methylphenol. The need for multiple referenCe daughter spectra representing one substructure still remains since daughter spectra vary for different compounds and under different conditions. 42 The m/z 105+ daughter spectrum of di-n—octylphthalate was matched against a reference library of similar m/z 105+ daughter spectra and the results presented in Table 2.4. Again, the top three matching spectra all correspond to the same phthalate substructure (Structure #1 in Figure 2.3). In this case the daughter spectra are highly similar; all three contain the same spectral peaks, only the intensity patterns are different (NR, NS, IS, and IR for the three are all zero). Better clustering is probably due to the greater stability of the ion structure yielding these daughter spectra (Structure # 3 in Figure 2.3). Note again the large difference in overall match factor values (25) between daughter spectra representing the correct substructure and that of the next best match. 43 Table 2.4. Match of 105+ Di-n-octylphthalate Daughter Spectrum PT PC NC NS NR IS IR Reg # Name 100 100 2 0 0 0 0 16 Di-n-octylphthalate 94 94 2 0 0 0 0 l4 Dipentylphthalate 91 91 2 0 0 0 0 12 Dibutylphthalate 66 89 2 0 2 0 31 10 P-t—amylphenol 60 80 2 0 2 - 0 20 8 2-t—buty1-4-methylphenol 49 54 l l 4 30 20 4 Benzyl-t-butanol 47 56 l l 3 30 29 6 P-t—butylbenzyl alcohol 37 51 l 1 3 30 52 2 2-t—butyl-6-methy1phenol The parent spectrum of m/z 149+ was used determine the alkyl groups attached to the phthalate substructure (Figure 2.4). The largest ion (149*) associated with the phthalate substructure was used since it will yield neutrals corresponding to the groups attached to the complete substructure and not include pieces of the identified substructure. The parent spectrum has 4 major (non-isotopic peaks) at m/z 167, 260, 279, and 390. The neutral corresponding to a loss of 18 (167-149) is water. The neutral corresponding to the loss of 130 (279—149) is CsH170H which may represent an alkyl group. The neutral corresponding to the loss of 113 (262-149) is the C9H17 radical which confirms the presence of a CaH17 alkyl group. The neutral corresponding to the loss of 241 (390-149) is the C8H170C8H17 radical; a rearrangement product. To determine the branching of the CeH17 group, the daughter spectrum of m/z 113* was matched against a library of daughter spectra. The alkyl group was found to be unbranched. Hence the alkyl groups n-CeH17, and n—C3H170H were used in conjunction with the phthalate substructure for generating possible molecular structures. The last piece of information required is the empirical formula. To determine the empirical formula, the daughter spectrum of the M+H+l peak in the CI di-n—octylphthalate spectrum (m/z 392) was obtained (Figure 2.5). The relative peak areas of adjacent peak pairs at m/z 149 and 150 is 2:1 indicating that the M+1 ion is twice as likely to lose a 013 atom as retain it. The ratio of the number of carbon atoms lost to those retained is, therefore, 2:1. Since the identified phthalate substructure contains 8 carbons, the unknown compound LOG ABUNDANCE 100.0 10.0 llLlllll 1 lllllll L l lllllll 45 149+ DI—N—OCTYLPHTHALATE PARENT SPECTRUM I 140 'l‘l'l'l'lrI‘rrI'l'l'l'l' 'I'I'l'lrI'I'I‘ITI'I'I 160 180 200 220 240 260 280 300 320 340 360 380 400 M/Z Figure 2.4 149+ Di-n—octylphthalate Parent Spectrum REL. ABUNDANCE REL. ABUNDANCE REL. ABUNDANCE A) Dl—N—OCTYLPHTHALATE M+1 DAUGHTER SPECTRUM d O O O O 111;] 40-- 20 - 0 LL L 11. l l n l IIII‘IIII'IIIIIIIII IIllllll[IIIIIIIIIIIIlI‘llIlllllllllIlIIlllIIIIIII' 50 100 150 200 250 300 350 400 B) DAUGHTER SPECTRUM EXPANSION (M/Z 148-151) 100 — 80-4 60— 4O— 20- o I I I l I I T I I I I I I I I I I I I l I I I I I I I I I l 148.0 148.5 149.0 149.5 150.0 150.5 151.0 C) DAUGHTER SPECTRUM EXPANSION (M/Z 260—263) 100 1 so— 60-- 4o- 20- 0 I I I I l I I I I I I I I I I I I I I I I I I I I I I I I I 260.0 260.5 261.0 261.5 262.0 262.5 263.0 M/Z Figure 2.5 149+ Di—n—octylphthalate M+1 Spectra 47 (di-n-octylphthalate) must contain 24 carbon atoms and the empirical far-ula must be 0240uflse. Armed with the phthalate substructure, the two alkyl substructures, and the empirical formula, we are now ready to generate all plausible molecular structures. The oxygen in the Ceflivofl group is allowed to overlap with either terminal phthalate oxygen. With this information, GENOA constructs only one molecular structure (Structure #4 Figure 2.3) and it is the correct structure. The number of generated structures depends on the completeness of the information provided. If the branching of the alkyl group had not been specified, 89 different structures would have been generated. The identities of these generated structures, however, would provide clues as to the needed information. In cases where MS/MS information cannot determine a unique result, additional spectral and non-spectral information may be given to GENOA as structural constraints. Conclusions The automated structure determination system of software tools for aiding in the elucidation of organic structures from MS/MS data is now at a stage where the chemist can actively apply it to real elucidation problems. Nearly all of the software tools have been developed and integrated into a comprehensive, interactive system. The system has been successfully used to develop daughter spectra/substructure correlations and to extend MS/MS data bases. Preliminary results from applying the system to structure determination 48 problems have been very encouraging. An expert system is currently being implemented to oversee the entire structure determination process by examining results from the software tools and suggesting further experimentation. H O 49 References Gregg, H. R., Hoffman, P. A., Enke, C. G., Crawford, R. W., Brand, H. R., Wang, C. M., Anal. Chem., 56, 1121 (1984). Hoffman, P. A., Enke, C. G., presented at 3lst Annual Conference on Mass Spectrometry and Allied Topics, Boston, MA (1983); bound p. 556. . Cross, K. P., Enke, C. G., Computers and Chemistry, in press. Cross, K. P., Beckner, C. F., Enke, C. G., in preparation. . Shelley, C. A., J. Chem. Inf. Comput. Sci., 23, 61 (1978). Carhart, R. E., Smith, D. B., Gray, N. B., Nourse, J. G., Djerassi, C., J. Org. Chem., 46, 1708 (1981). . Wipke, T. W., Dyott, T. M., J. Am. Chem. Soc., 96, 4834 (1974). . Molecular Design Ltd., 11223 Street, Hayward, CA 94541. CHAPTER III The Development of a Mass Spectra/Mass Spectra Infbrmation Management System Introduction To determine unknown molecular structures using MS/MS data a system was needed to handle the additional dimension of data provided by MS/MS instruments. The large amount of data generated using computer-controlled instrumentation has inspired the development of commercial information management systems for organizing and storing relevant information. Many spectroscopic instruments commonly contain data acquisition and reduction systems. Laboratory information management systems (LIMS) capable of integrating data acquired from several different laboratory instruments have also become quite popular (1). Since our research group has access to three MS/MS instruments, an information management system was needed to acquire, view, and compare MS/MS data obtained from different instruments and different laboratories. The MS/MS information management system was developed to move information onto a central minicomputer and to transform spectral data into a common format. Various software tools allow the user to massage experimental data, to view MS/MS data in different two-dimensional planes, to store MS/MS spectra in a reference data base, and to match MS/MS data against a reference library (Figure 3.1). 50 51 E326 “coEomocofi cozoELoueE m2\m2 Tn 0.5mm mica: hogan: I mmnEh I muxz: 1| .l go: go: — mg (#3 I _ mg <._.231: IL :93. mo 1311V8Vd ummvmo OE\E 9:02 .LDDemi_;o 12 m_ m_ m2 m2 02 as ea fine >0 wN .oa\a mn.o NZNNZNNDLHDNTZV zsmsomam Neaz_ 0.40 -' Q _ z ._ “A O _ E: 030 - LL _ LIJ 9 - Q 0.20 — 0.10 — ‘ AA/m lll A 0-00 'l'l'l'l‘l'l‘l'l'l‘T'l‘l‘l‘lITT‘FETT 0))‘40\k 80 120 160 200 240 280 320 360 400 CEéV) o lo 2.0 so #0 SCAN NUMBER Figure 5.4 Instrumental Parameter Effects on Collision Induced Dissociation Efficiency for 105+ Methylbenzoate Daughter Spectra Each Peak is 0 Scan of Drawout Potential From 02+20 V to 02-30 V 115 each 5 peak. multiplet increases with scan number indicating that fragmentation efficiency increases as collision cell pressure increases. This effect was expected as a greater number of Argon atoms become collision targets. The effect of collision energy at higher pressures also becomes apparent (scans 300 - 420) since the impact energy determines the extent of fragmentation. The maximum CID efficiency for both m/z 136+ and m/z 105+ daughter spectra is obtained at 4.3 X 10‘2 torr, 20 electron Volts collision energy and -40 Volts drawout potential. The effects of drawout potential on CID efficiency are illustrated by plotting CID efficiency versus collision cell pressure (at a constant collision energy) (Figure 5.5). The results indicate that CID efficiency increases as the drawout potential becomes more negative. Peak shape, however, suffers as the drawout potential reaches large negative values. This causes peak splitting and causes the CID efficiency to become artificially large. Therefore, I found it best to apply a constant drawout potential by tracking Q3 behind 02 by only -10 or -20 volts. The effect of collision energy and collision cell pressure upon CID efficiency for m/z 136+ and m/z 105* methylbenzoate daughter spectra are illustrated in Figures 5.6 and 5.7. Fragmentation initially increases with collision cell pressure as fragmentation efficiency increases. As collision cell pressure becomes higher, ion scattering becomes dominant. This causes the collection efficiency to decrease and the CID efficiency to subsequently decline. By judiciously selecting CID EFFICIENCY 116 136+ METHYLBENZOATE DAUGHTER SPECTRA (02 = 0 v) O- 03 - —30 V A— 03 = —20 V 0.80 - El— 03 - -10 V ' <>- as - o v 0.70 -‘ 0.60 '- 0.50 - 0.40 - 0.30 ‘- O.20 -‘ 0.10 -‘ 0.00 - "0'10‘l'l‘l'l‘l'l'l'l‘l‘i‘l‘l -0.01 0.00 0.01 0.02 0.03 0.04 0.05 COLLISION PRESSURE (TORR) Figure 5.5 Drawout Potential Effects on Collision Induced Dissociation Efficiency for 136+ Methylbenzoate Daughter Spectra CID EFFICIENCY 117 136+ METHYLBENZOATE SPECTRA (03 = 02 — 10V) O-CE==40eV A-CE=-30eV El-CE==20eV O—CE=10eV V—CE= 0eV 0.20 -‘ 0.15 - —0.05 '- —0.10[lirxlililirifrIIIII —0.01 0.00 0.01 0.02 0.03 0.04- COLLISION PRESSURE (TORR) Figure 5.6 Collision Energy Effects on Collision Induced Dissociation Efficiency for 136+ Methylbenzoate Daughter Spectra CID EFFICIENCY 118 105+ METHYLBENZOATE SPECTRA (Q3 = 02 — O-- CE=40 eV A— CE=30 6V 1 0v) 010‘ El— ca=2o eV O—CE=IOeV V~CE-— eV 0.15— 0.10— 0.05-1 0.00q “0.05 {IIIIIIIIIIIIIIIFIIIIII —0.01 0.00 0.01 0.02 0.03 0.04 COLLISION PRESSURE (TORR) Figure 5.7 Collision Energy Effects on Collision Induced Dissociation Efficiency for 105+ Methylbenzoate Daughter Spectra 119 the collision energy, ion scattering can be minimized. For methylbenzoate, 20 eV collision energy produced a large, constant CID efficiency value in the higher collision cell pressure region for both m/z 136* and m/z 105+ daughter spectra. To determine collision energy and collision cell pressure effects on individual daughter ions, it is necessary to examine the collision energy breakdown curves for m/z 136+ and m/z 105+ daughter spectra at different collision cell pressures (Figures 5.8, 5.9). At very low collision cell gas pressures (5.2 X 10'5 to 3.5 X 10'4 torr), little fragmentation occurs and the parent ion dominates the daughter spectrum. At medium collision cell pressures (1.7 X 10‘3 to 3.2 X 10‘3 torr), fragmentation increases and the relative abundance of the parent ion decreases. At high collision cell pressures (9.9 X 10‘3 to 4.3 X 10‘2 torr), the decomposition order changes causing different ions to appear in the spectrum. The collision cell pressure required for a particular experiment depends on the type of information desired. If consistent fragmentation resulting from single collisions of the parent ion and collision gas is desired, the collision cell pressure should remain in a region low enough such that only first-order fragmentation occurs. Higher collision cell pressures may cause multiple collisions in the collision cell resulting in additional fragmentation that complicates the spectrum. The effect of collision cell pressure on matching MS/MS spectra was investigated in the next section. ,_' 100 Z , 50 .1 Lu CE 0 ,_' 100 .2. . 50 .J Lu 0: o ,4 100 Z . 50 _l Lu 0: O ,_‘ 100 Z . 50 .1 uJ 0: 0 +4 100 E . 50 .J in II 0 ,_' 100 Z . 50 ..J Lu 0C 0 ,_' 100 Z , 50 .1 Lu 0: 0 Figure 5.8 CE 120 5.2 x 10‘5 TORR _. I36 1flrlllllI1jlj1lrll’lrlllllllTr—Irr1 IllT—lqz 1.9 x10‘4 TORR _\ Is; —-1 d '2. 1'IIII[filfirlIIIIIIIIIIIIIIIIIIIIIIIITl'llrllof 3.5 x 10—4 TORR (36 1! IlllfllrjllerTIlrlIrllllrllltlilll lll’rlm‘ 1.7 x 10—3 TORR " :36 _ _________.__—~— /::5 IITTITIIllTlIIIITlllllIl‘lrfi‘rlllll [III] 3.2 x 10‘3 TORR _: .IJ‘ ...—f I05 77 fllrleIlTrIrrllllllllIIIIIIIIIII IlIll 9.9 x10‘3 TORR IO: ...—”M,” IIIIIlIlllr‘l—T‘T‘TIIIIIlTrfrTTlfolellIxrrjw 4.3 x 10-2 TORR (of 77 20 30 Collision Energy (Electron Volts) 0L IlllllIIIllllllllllllllll 4O Breakdown Curves for 136+ Methylbenzoate Daughters ,_: 100 Z . 50 _1 Lu DE 0 ,_‘ 100 Z . 50 ..1 uJ 0: O ,_: 100 Z . 50 .J Lu Ir 0 ,_.' 100 E . 50 _1 Lu I! 0 1-' 100 Z , 50 ...1 Lu a: O 1—3 100 Z . 50 _1 1.1.1 0: 0 ._' Z .1' 11.1 [1: 121 5.2 x 10'5 TORR l 111' llll 1|.l llllllllllllIlllllllllllllIlll 1‘1” 1.9 x10‘4 TORR Illllllllllllllillllllllllllll Irl~1 3.5 x10"4 TORR 17 ITIITTrYYIIIIITrYTTI'YllllllIll III 1.7 x10‘3 TORR / 1o; \ ... TIIIIIIIIIIIIFIlrIIIllllllllll III 3.2 x10‘3 TORR /\ m: ~‘~‘~‘-_, n r-mfl llll IllllllllllllllllllIllllllTI 9.9 x 10*3 TORR 11 I“ fl TIIIIYTIIIrrirrrrrTrIIIIIIIIIIIIIrIITIrV Collision Energy (Electron Volts) 4.3 x 10‘2 TORR Figure 5.9 CE Breakdown Curves for 105+ Methylbenzoate Daughters 122 Instrumental Parameter Effects on Spectral Matching The MS/MS spectral matching program was designed to be a flexible, interactive tool for comparing MS/MS spectra. Flexibility is necessary to account for deviations due to a variety of experimental situations. The search program uses several match factors to characterize the correspondence between the sample and each candidate spectrum (5). Similarity and difference factors are weighted and combined to form an overall match factor. A complete description of the MS/MS spectral matching program was presented in chapter 4. The resulting match factors returned to the user are described in Table 4.1. To test the effectiveness of the MS/MS spectral matching program on varying MS/MS spectra, methylbenzoate daughter spectra were acquired under a variety of instrumental conditions. When different instrument conditions were required, only the parameter of interest was changed. This parameter, however, was varied over a range sufficient to cause an appreciable change in the Spectrum’s appearance. The resulting spectra were matched against each other to determine which combinations of instrumental parameters produced similar spectra. The results of matching a methylbenzoate (m/z 136*) daughter spectrum against identical compound spectra acquired under various experimental conditions are displayed in Table 5.1. Each table entry represents a m/z 138+ daughter spectrum of methylbenzoate taken under different instrumental conditions. Results of the top matching spectra are displayed in the table. 123 on! OF NIOF X mév mF OF M N ta mm mm OFI ON mlOF x NAM m n F F m #0 mm Owl O nlOF x Wm. OF mF N N Fa no mo OFl O.v mlOF x N6 ta n F F m mm mm OFI on nlOF x NAM n O F F 0 mm mm onl Ofi mlOF x NA.” m m ¢ F m mm OO O on nlOF x Wm O OF O n M. No no OFI on nlOF x Wm OF 0 N F 0 MN. mm ONl on nlOF x N.n O m 0 F 0 mm on ONl OF NlOF x 9* OF O m. O O new FR oml OF mlOF x Wm O O O N .v mm mm ONI ON nlOF x m.m OF m N F 0 mm Ow OFI OF MIOF x OO O F» O F n om em ONI OF nlOF x Wm O n O F m Ox. mm OFI ON mlOF X OO O O O O O OOF OOF 933265 938 Atacama :8 m. 9 m2 m2 oz on E 991 3.30265 .>o ON “mo 50F nlOF x md 63$ :08 Eabooam anom 9300.... 28.02 0.50on 520300 3085235on +OMF N\E F.m oBoF 124 The spectral matching data (Table 5.1, 5.2) demonstrate that similar spectra may be obtained by using different combinations of instrumental parameters. By grouping spectra with the best match factors, an acceptable operating range for each instrumental parameter required to obtain similar matching daughter spectra was determined. Candidate spectra acquired at the same collision cell pressure matched the best, indicating that collision cell pressure affected the spectra the most. Collision energy also showed a significant effect on the spectral pattern of the daughter spectra. Drawout potential had a lesser effect on the appearance of the mass spectrum since different values for these parameters appear throughout the top matching spectra. Collision energy and pressure effects on the overall match factor for the tabular data are illustrated in Figures 5.10 and 5.11. To eliminate the effects of other parameters, only spectra taken at 02-10 volts were considered. The spectrum treated as the sample spectrum in the matching process was acquired at 9.9 X 10‘3 torr collision cell pressure and 20 eV collision energy. The effects of collision energy vary with collision pressure. At low collision cell pressures (5.2 X 10‘5 to 3.5 X 10‘4 torr), the collision energy makes little difference due to the sparse fragmentation. At higher collision cell pressures, the probability of an ion/atom collision increases. The collision energy determines if fragmentation will occur and, therefore, has a greater effect on the overall match factor. 125 ONI O nIOF X md O Fq O N a. on NB Owl Om. nIOF X Wm N .v N N .v MN Ax. Onl O nlOF X OO O N O F 0. fix. NH OFI OF NlOF X n4» O O O O 0 mm mm O on nlOF X m.m O x. O N e. mm mm OFI on MIOF X ad n N F F m om om ONI OF NlOF X n6. .v O .v O O NO om Owl OF NlOF X mi. 0 O n O O tam NO ONI OM nlOF X md O O O O O mm mm OMI OF nlOF X md F O F O O hm hm OFI OF mlOF X Wm O O O O O mm mm ONI OF nlOF X md O O O O O OO om onl ON nlOF X Wm F N N F m mm mm ONI ON nlOF X md O O O O m mm mm OFI ON nlOF X md O O O O O OOF OOF 532.65 988 Eobmmoi :8 w: m_ mz mz oz on E A>OFl £30265 .>o ON ”mo .to._. nlOF X m6 ”mmoi :08 83.50on oEEom mcoyoom £322 abooam ..oEmzoO Buo~co£>£o§ +mOF N\E NO 033 126 136+ METHYLBENZOATE MATCHING RESULTS 0— 5.2X10—5 TORR A- 1.9X10—4 TORR CI- 3.5X10—4 TORR 100 - _ (>— 1.7x1o—3 TORR _ v— 3.2x1o—3 TORR - +- 9.9X10—3 TORR ao — x- 4.3x1o-2 TORR c: . O '— _ 2 _ I“ so —- I _ B < _ 2 _ :1, 4o - < . CK LL] _ > O —. 2O — ° ' 1 ' 1 ' 1 ' 1 ' 1 r 1 —10 o 10 20 :so 40 50 COLLISION ENERGY (ELECTRON VOLTS) Figure 5.10 Instrumental Parameter Effects on the Overall Match Factor for 136+ Methylbenzoate Daughter Spectra 127 105+ METHYLBENZOATE MATCHING RESULTS 0- 5.2X10-5 TORR A— 1.9X10—4 TORR El— 3.5X10—4 TORR 100 - _ <>~ 1.7x1o—3 TORR _ v— 3.2x1o—3 TORR 4 +— 9.9x1o-3 TORR 80 A x— 4.3x1o—2 TORR {r _ O 1.— ... 2 _ “‘ so — I _ L) '— ‘ < 2 - j 40 - < . QC IJJ .— > O 4 20 - OFI'IFITIIIII -1o 0 10 20 so 40 so COLLISION ENERGY (ELECTRON VOLTS) Figure 5.11 Instrumental Parameter Effects on the Overall Match Factor for 105+ Methylbenzoate Daughter Spectra 128 The effect of collision pressure on overall match factor is more pronounced than collision energy. Low overall match factors are obtained at low collision cell pressures (5.2 X 10'5 to 3.5 X 10'“ torr). Poor results are also obtained at high pressures (4.3 X 10'2 torr) and collision energies (30 - 40 eV) where non-first—order decomposition occurs in the collision cell. The search program cannot compensate for spectral deviations due to decomposition mechanism changes in either the source or collision cell. Acceptable overall match factors are obtained, however, at all first-order collision cell pressures. Hence, MS/MS library spectra should be acquired in a collision cell pressure region where only first-order decomposition occurs. Automated Resolution of MS/Ms Mixtures Many compounds analyzed are impure. If the chemist has not identified a compound, he is usually unaware of its purity. Therefore, the development of a method to resolve spectra from mixtures is helpful to the structure determination process. Many algorithms have been invoked to resolve mass spectra arising from mixtures. Common techniques include spectrum stripping (6—8), multiple linear regressions (9—11), graphical rotation (12,13), pattern recognition (14,16), reverse searching (7,17), factor analysis (18,19), block-outpoint tree methods (20) and parabolic fits to quotient 129 spectra (21). Most methods are limited to resolving components of a mixture whose spectral peaks do not overlap. This has serious drawbacks when analyzing mass spectra of mixtures of high molecular weight compounds. By selecting a parent ion with the initial mass filter, fragmenting it, and recording its mass spectrum using a second mass filter, the triple quadrupole mass spectrometer provides good selectivity. This selectivity reduces the probability that a given spectrum will represent more than one compound. Many MS/MS spectra resulting from a single collision arise from single neutral losses (22). This results in a cleaner, simpler spectrum. Hence, there is less chance that spectral peaks of several components in the mixtures will overlap. These characteristics have enabled MS/MS to be successfully used for direct analysis of complex mixtures (23). Given these capabilities, is it necessary to develop a means to resolve MS/MS spectral mixtures? In triple quadrupole MS/MS, a daughter spectrum is obtained by selecting a parent ion by m/z value in the first mass filter. Ions at this m/z are passed into a collision region and undergo collisionally induced dissociation. Intensities of the resulting fragment ions are recorded by a second mass filter as a daughter spectrum. If two mixture components fragment in the ion source to give ions of equal m/z values, both ions will enter the collision cell. The resulting daughter spectrum will be a mixture of fragments from these isobaric parent ions. Hence MS/MS mixture spectra do exist and must be resolved. A means to determine the presence of a mixture in a ' («i-“Mk7 130 conventional mass spectrum is also very useful. If the conventional mass spectrum is proven to be pure, no MS/MS mixture spectra resulting from more than one compound will exist. Isobaric parent ions must come from the same molecule. The MS/MS spectral matching program was extended to resolve spectra due to mixtures of isomeric and isobaric parent ions, poor resolution MS/MS spectra (such as MIKES spectra (22)), and the presence of reagent ions in CI spectra. In particular, isOmers of synthesized compounds tend to produce isobaric MS/MS mixture spectra. An example resolving this type of mixture will be demonstrated later. Iterative parabolic fits to quotient spectra (unknown spectrum/candidate spectrum) were used to identify mixture components and to reduce the dependence of component identification upon instrumental parameters. This algorithm was designed to integrate with existing software tools. It takes advantage of the inverted data in the reference data base library and the several match factors already present in the spectral matching program. The strategy of the algorithm is to quickly reduce the number of candidate spectra by eliminating those compounds whose mass spectra contain peaks 293 found in the mixture spectrum. An intensity—based matching algorithm then uses several match factors to identify the major component. Multiple parabolic fits to the quotient spectrum determine which peaks in the mixture spectrum are not completely accounted for by the major component. The portion of the peak intensity belonging to the 131 minor components is calculated and placed with the unmatched peaks in a residual spectrum. The residual spectrum is then matched against the library to identify the second component. This process is repeated until all components in the mixture have been resolved or until the residual spectrum cannot effectively be matched. Two examples demonstrate the algorithm developed to resolve mixture spectra. The first example resolves a conventional mass spectrum arising from an ether mixture. The second example resolves a MS/MS mixture spectrum arising from isobaric parent ions. A 60:40 spectral mixture of p,fi-dibromodiethyl ether and diisopropyl ether was created by adding the two experimental mass spectra. This mixture spectrum was used to test the mixture resolution algorithm. Data from the inverted list of mass spectral peaks were used to logically reduce the number of possible component spectra. A list representing all spectra containing a m/z value not in the mixture spectrum were placed into a buffer. The list of spectra containing another m/z value (not in the mixture spectrum) were logically ANDed with the original list (Figure 5.12). This process continued until all spectra containing spectral peaks not present in the mixture spectrum were eliminated. The most frequent spectral peaks in the data base are present at lower mass values. Hence, the logical reduction process started at m/z zero and proceeded to m/z 500. Since, above mass 500 the frequency of the spectral peaks in the data base rapidly drops off, few unique spectra would be eliminated. 132 SET OF ALL REFERENCE SPECTRA Figure 5.12 Logical Reduction of Candidate Spectra During Mixture Analysis (Venn Diagram) 133 As successive sets of these spectra are combined, the intersecting portions of all sets becomes very small. The data in Figure 5.13 illustrate how the number of candidate spectra decreased as successive inverted lists were ANDed together. When the subset of candidate spectra was obtained, intensity-based matching determined the major component of the mixture. The reverse—search match factors, NR and IR, helped determine the major component, since they place no emphasis on peaks in the mixture spectrum that are not present in the reference spectrum. Table 5.3 lists the results of searching for the major component. The compound with the highest overall match factor (fi,fi—dibromodiethyl ether) was identified as the major component of the mixture. The match factors NR and IR are both zero, indicating that all peaks in the B,B-dibromodiethyl ether spectrum were present in the mixture spectrum. The minor component, Diisopropyl ether, was listed as the second best matching compound. NUMBER OF CANDIDATE SPECTRA 134 16000 14000 12000 10000 8000 6000 J,LJ ii. 1. l.l .l .1. [II All lll ll. |.| 0 Illillfi‘rlITIIITITI‘I’ITTITIIrIIllll[I'lllllxllritl 0 100 200 300 400 500 M /z VALUE Figure 5.13 Logical Reduction of Candidate Spectra During Mixture Analysis (Stepping Through M/Z Values) 135 muaoFOSOmmfiBl/E $20 no Nu n F ON 0 o F m F morjama n>2IEE_QIN NO mm mF mN m NN LVN mzmmtmomoamla we. is FN FN F F KN ON Eétamlz n>IEE so mm s .N m NN sN nozImondomo>IElQN mm mm FN FN FF mm om mziamornfixrfifiomonfiBahama m. F mm s F ON NF wN mm. MF_o mm. mm m MN m mN mm. mQXOIij n>zmInzo on on a \.F mF Fm. mm mmrb .C/aomaomzo O mm o F mF Fm ma mmzmm nxrmaosaommalmd 0 OF O m MN Ns Fm 0:52 w: ma mz mz 02 on E 33x5 ..ofm or: F0 “cocanoo no.3} or: F0 cozo£Ea30O L8 98.0011 £8.22 m.m QEOF 136 All peaks in the mixture spectrum that were not in the major component were then removed and placed into the residual spectrum. A quotient spectrum was created by dividing the intensities of the remaining peaks in the mixture spectrum (now termed the reduced component spectrum) by the peaks in the spectrum of B,fi—dibromodiethyl ether. Several peaks in the mixture result from overlap of the component spectra. A parabolic fit to the quotient spectrum determines if any portions of peaks in the mixture spectrum belong to minor components. Since the chi-square value was above a set threshold, the peak creating the largest discrepancy in the quotient spectrum (m/z 43.0) was removed. The chi-square value was then recalculated. This process was repeated until the chi-square value fell below a specified minimum. When a suitable parabola was determined, the parabolic equation calculated the probable intensity of the removed peak (m/z 43.0) due to the major component placed it in the reduced component spectrum (Figure 5.14). The difference between the actual intensity and this value was placed into the residual spectrum. The remaining mixture spectrum was termed the reduced component spectrum. Intensity-based matching performed on the residual spectrum correctly found the minor component to be diisopropyl ether (Table 5.4). INTENSITY REL. INTENSITY REL. INTENSITY REL. INTENSITY REL. INTENSITY REL. 100 50 100 50 100 50 100 50 100 50 137 DIBROMODIETHYL ETHER/DHSOPROPYL ETHER MIXTURE .. '1 1‘11 IiIlTlllTTl—rIIIITI-L—TTI] DIBROMODIETHYL ETHER (60% COMPONENT) 1 fl rtlil liITrIIrrIIlIrIIIFIII] REDUCED COMPONENT SPECTRUM l 'I lTlTl IfITIIIIIIIIIIII[IJ‘—rr1] DIISOPROPYL ETHER (40% COMPONENT) .1 i "I 11111[1111[n1IIIITITIIriiTrI—IIITFTIIrrTr] RESIDUAL SPECTRUM I TIIITIII’[1111IIi—rrr1111‘lerlixlxllrrrIFrTlllxlxl O 25 50 75 100 125 150 175 200 225 250 M/Z Figure 5.14 MS Ether Mixture Resolution 138 OF<._.OO< 4>szm FO OO FwN n O ON OF OZOFofiommn<>l<2 246 NO mO OF N O NN OF mzmmtmomoammlm ON 0 F ON O s Fun OF 1_OZmomO>I_OlO.N OO ON ON .v O is O F JOZImO4<5mI OO ON OF O O OO FN OFIEOZ§IEZOIN OO NO ON O Fu Om. ON MFECDOIZ 4%:52 OO OO F F O s Om. ON uquoiomdQIFImd 4>z_>_O OO ON «F O s NO ON mzfn—mOIQIFIFmEOmODJEOmfFDOE OO NF ON .s O O.v ON O_O< 0.23331. FO OF ON O O OLw \IN OOEJDQO 4>EzmId_O FO O O O OF Oh OO mmEm .;n_omn_om=o .VF O O O OF OO FR 9:02 m: m_ m2 Oz 02 0d .5 23x5 Locum 05 so “cocanoo 8:3 05. O0 camoEanuoo .6.— mcouoom 5302 TO 033 139 A 60:40 mixture of two pesitcide MS/MS spectra (Pirimiphos and the oxygen analog of chlorpyrifos) was created to test the mixture resolution algorithm on MS/MS spectral mixtures resulting from isdbaric parent ions (m/z 332). The resulting component spectra determined by the mixture resolution algorithm are presented in Figure 5.15. Although MS/MS spectra of the mixture contains few peaks, the algorithm correctly deconvoluted the two spectra into the two components. No portion of parent ion intensity was placed into the residual spectrum. In the future, the algorithm will be modified to ignore the parent ion intensity in the spectrum stripping process. INTENSITY REL. INTENSITY REL. INTENSITY REL. INTENSITY REL. INTENSITY REL. 100 50 100 50 100 50 100 50 100 50 140 PlRIMIPHOS/OXYGEN ANALOG OF CHLORPYRIFOS MIXTURE .I l n .l . lelITTTIIIIITIrlTl'lllll [IIIIIIIIIITTIFT—TIIIIIIIIIIII ITII PlRlMIPHOS (60% COMPONENT) l 1 41 4 l T—ITITTIIIITTTTTIIIIITTTI [Tl ITTTITTITIIrTlllllTlTlllll'lTT—l REDUCED COMPONENT SPECTRUM 4 L .1 L TTIIIITIIIITTTITTTTITTTTI TI TTTrrrTTTIFrl—I—[lllll lllllllll OXYGEN ANALOG‘OF CHLORPYRIFOS (40% COMPONENT) TTTIIIITIITII[ITTFTI'TTTTIITTTITTTTTTTTTITIITIITITITFTTIFTITI RESIDUAL SPECTRUM .II IIIIITTTEWTTTITTITTWTFTIIITIIIIIIITTTTle—TTIFTIIIrTTTIT—T—ITI 50 100 150 200 250 300 350 M/Z ‘ Figure 5.15 MS/MS Pesticide Mixture Resolution 141 Conclusions The MS/MS search program has helped illustrate instrumental parameter effects on MS/MS spectral patterns. Different combinations of instrumental conditions may yield similar spectra. The ability of the program to correctly identify MS/MS spectra taken under different conditions has helped determine standard operating condition limits for acquiring MS/MS reference spectra. The most important instrumental parameter is collision cell pressure. MS/MS library should spectra be acquired in a collision cell pressure region where only first—order fragmentation occurs. Collision energy should be adjusted for a maximum CID efficiency (approximately 20 eV). Regarding drawout potential, 02 should trail the 02 voltage setting by a fixed negative amount (-10 V). Results of the mixture resolution algorithm depend on the number of overlapping peaks from component spectra and the number of peaks in each component spectrum. As the number of overlapping peaks decreases, the process of mixture resolution becomes easier. However, if the number of peaks in the quotient spectrum decreases to less than 4, parabolic fits to the data are no longer profitable. If one component spectrum is a subset of another component _spectrum, mixture resolution by this method becomes unattainable. 10. ll. 12. 13. l4. 15. 16. 17. 18. 142 References Dawson, P. R., presented at 3lst Annual Conference on Mass Spectrometry and Allied Topics, Boston, MA, (1983); bound p. 203. . Dawson, P. R., Sun, W. F., Int. J. Mass Spectra. Ion Poc., 55, 155 (1983). ang, C. M., Crawford, R. W., Barton, V. C., Brand, H. R., Neufield, K. W., Bowman, J. E., Rev. Sci. Inst., 54, 996 (1983). Meg Osterby, unpublished work, MSU Chemistry Dept., East Lansing, Michigan. Wong, C. M., Lanning, S., Energy and Technology Review, Lawrence Livermore National Laboratory, February, p. 8 (1984). . Dromey, R. G., Stefik, M. J., Ridfleish, T. C., Duffield, A. M., Anal. Chem., 48, 1368 (1976). . McLafferty, F. W., Acc. Chem. Res., 13, 33 (1980). . Atwater, B. L., Venkataraghavan, R., McLafferty, F. W., Anal. Chem., 51, 1945 (1978). . Biller, J. E., Biemann, K., Anal. Letters, 7, 515 (1974). Biller, J. E., Herlihy, W. C., Biemann, K., Computer Assisted Structure Elucidation, ACS Sym. Series, 54, 18 (1977). Giblin, D. E., Peake, D. A., Lapp, R. L., presented at 32 Annual Conference on Mass Spectrometry and Allied Topics, San Antonio, TX, May (1984); bound p. 644 Windig, W., Meuzelaar, H. L. C., presented at 32 Annual Conference on Mass Spectrometry and Allied Topics, San Antonio, TX, May (1984); bound p. 665. Ioup, G. E., Thomas, B. S., J. Chem. Phys., 46, 3959 (1967). Chien, M., Anal. Chem., 57, 348 (1985). Van der Greff, J., Tas, A. C., Bouwman, J., Tennuever de Brauw, M. C., Schrueurs, W. H. P., Anal. Chim. Acta, 150, 45, (1983). Soltzberg, S. L., Kaberline, S. L., Lam, T. L., Brunner, T. R., Wilkens, C. L., J. Am. Chem. Soc., 98, 7139 (1976). Abramson, F. P., Anal. Chem., 47, 45 (1975). Clemens, J., Kowalski, B. R., Anal. Chim. Acta, 133, 538 (1981). 19. 20. 21 22. 23. 143 Ritter, G. L., Lowery, S. R., Isenhour, T. L., Wilkens, C. L., Anal. Chem., 48, 591 (1978). Nakayama, T., Fujiwara, Y., J. Chem. Inf. Comput. Sci., 21, 142 (1981). Henneberg, D., Wiemann, B., presented at 32nd Annual Conference on Mass Spectrometry and Allied Topics, San Antonio, TX, May (1983); bound p. 185. Borogzadeh, M. R., Morgan, R. P., Beynon, J. B., Analyst, 103, 1613 (1978). Kondrat, R. W., Cooks, R. G., Anal. Chem., 50, 81A (1978). CHAPTER VI A STRUCTURE/SUBSTHUCTURB DATA BASE ASSOCIATED WITH MS/HS SPECTRA* Abstract A structure/substructure data base was developed to store substructure—property relationships determined using mass spectrometry/mass spectrometry (MS/MS) instruments. The structures and substructures are correlated with the MS/MS spectra in a separate spectrum data base which represent them. The substructures of each compound are not associated with any other substructure of the molecule or the structure of the compound. The structures are stored in connectivity matrices using an extended Morgan algorithm to generate a unique, unambiguous form allowing for representation of stereochemistry, charged species, radicals, and isotopic species. In addition, a unique, invariant linear name describing the molecule is generated. An heuristic drawing program was adapted to allow structures and substructures to be drawn in such a manner that common substructural features are easily recognized. *Note: This chapter is a draft of a manuscript written by the author of thesis in preparation for submission to Computers and Chemistry with C. F. Beckner and C. G. Enke as coauthors. 144 145 Introduction The development of two-dimensional techniques such as mass spectrometry/mass spectrometry (MS/MS) (1) has inspired development of data bases to handle the different types of related MS/MS spectra as well as the structures and substructures determined from each spectrum. MS/MS instruments allow the determination of a unique substructure-property relationship arising from each spectrum and require an information data base to store and maintain these relationships. To this end, a MS/MS spectrum data base and an associated structure/substructure data base were designed and developed. The spectrum data base is able to store and associate the various types of MS/MS spectra for each compound and is described elsewhere (2). The focus of this paper is the design and implementation of a structure/substructure data base that can manage the structural information generated from MS/MS instruments. Central to the discussion will be the design requirements for this particular application in contrast to the requirements of other structure and substructure data bases. The structure determination scheme does not require the elucidation of ion structures; thus only molecular substructures need be stored (3). The structure determination scheme uses the substructures as building blocks to produce plausible complete molecular structures. Therefore, the substructures in the data base need not be associated with each other or the parent molecular structure. Instead each structure or substructure in the data base exists independently of each 146 other and references the MS/MS spectra in the spectral data base that are associated with it. Likewise, each MS/MS spectrum contains references to the structure or substructures that are identified by that spectrum. These links allow the MS/MS spectrum to be correlated with its respective substructure and a system of substructure-property relationships to be developed. The structure/substructure data base was designed to store both molecular structures and substructures without no distinction in data base structure or data storage format. Hence any comments regarding structures are assumed to refer to substructures as well. The substructure-property relationships determined by MS/MS are not unique. One daughter spectrum may correspond to several substructures. Likewise, one substructure may be associated with several daughter spectra. The relationship between the spectrum data base and the structure/substructure data base information is complex (n—to—n mapping) and is constantly changing as new information is added to both data bases. New information may include redundant substructures that have yet to be pruned out. Their structural pointers in the spectrum data base must be redirected to the sole remaining representation of that structure when the redundant entries in the structure data base are eliminated. Redundant structures in the data base are found by searching the data base for identical structures. The empirical formula of the structure serves to prefilter nonidentical structures. Due to the dynamic nature of the structural information, the structure data base was designed to be easily maintained and to efficiently reclaim 147 vacant space.' This included easy entry and updating of information in the structure data base. At the present time the structures are automatically drawn and then compared for structural similarities and commonalities. In the future an atom-by-atom substructural searching program will eliminate redundant structures through automated comparison (4,5). Data Base Design The design of the structure/substructure data base is illustrated in Figure 6.1. Flexibility and ease of maintenance are the most important features of the design. To accomplish this, a linked—list architecture was used for storing all data records in the data base (6,7). Variable lengths of information are represented using fixed length records. If the information overflows one record, a continuation pointer specifies another overflow record. The use of fixed length records conserves space while using linked—lists maintains flexibility. If the header records were to be extended at a later date, the data base would not have to be rewritten. A continuation pointer would merely specify the overflow record. Likewise if a structure were deleted and replaced with a larger structure, the space vacated by the deleted record could be reclaimed since a continuation pointer would allow the data to overflow into another record. This flexible design allows the data base to expand and contract with varying amounts and types of data. In theory all vacated space may 148 MASTER HEADER DIRECTORY STRUCTURE HEADER #1 STRUCTURE #1 STRUCTURE HEADER #2 STRUCTURE #2 STRUCTURE HEADER #3 STRUCTURE #3 STRUCTURE #2 (CONPD) STRUCTURE HEADER #20 STRUCTURE #20 Flgure 6.1 Structure Data Base Format 149 be reclaimed. However, when the number of internal continuation pointers becomes too large, the data associated with each structure becomes badly fragmented and the access speed of the infbrmation suffers. At this point the data base can be "merged” with an empty data base and the data restructured such that it resides in contiguous records. Any vacant space in the data base is also compressed out. A master header contains all static information describing the data base (Figure 6.2). The structure of the data base is denoted by storing the version number of the data base management software, the size of the internal directory, and the master header size. The version number of the software helps maintain consistency during growth and revision of the software. The directory size parameter sets aside space for an internal directory and protects the directory from being written to unless new data are being entered. The size of the master header is fixed but may change if forthcoming versions of the software contain more static information. It is therefore maintained as a variable. The master header maintains a historical record of the data base by containing the dates of data base creation and latest update. The highest assigned physical record number is kept to determine the next available physical record number. The highest assigned registry number determines the next directory entry (logical record) available for new data. The last registry number defines the limits of the internal directory and also the number of structures in the data base. There is no restriction on the length of the data base, only on the number of structures that may be contained within the data base. 150 SOFTWARE VERSION NUMBER MASTER HEADER SIZE DIRECTORY SIZE CREATION DATE LAST UPDATE DATE HIGHEST RECORD NUMBER HIGHEST REGISTRY NUMBER LAST REGISTRY NUMBER Flgure 6.2 Master Header Record Format 151 An internal directory containing the physical record number for each structure header record follows the master header (Figure 6.1). This design maintains data independence between the logical structure registry numbers and the actual physical record numbers where the data are stored. This feature allows easy maintenance of the data base. When a structure is deleted, the directory entry for that structure is merely forgotten. New structures are sequentially assigned new registry numbers as they are added to the data base. The size of the directory dictates the maximum number of structures that can be stored in the data base. When the last registry number has been assigned, no further additions to the data base may be made. This design conserves overhead space while retaining quick access to the data records by requiring only a single disk read to obtain the location where the desired data resides. A header record precedes the data record for each structure and identifies the structure (Figure 6.3). An arbitrary registry number is assigned to each structure and uniquely identifies the structure. This number corresponds to the entry in the internal directory for the same structure. The size of the structure header is fixed but an extension variable exists for flexibility. A status variable in the header maintains the current status of the structure and it identifies the data as either a complete structure or a substructure; and specifies whether it is logically deleted or not. This is the only piece of data in the entire data base which distinguishes structures from substructures. All other storage criteria are the same. 152 REMSTRY NUMBER HEADER SIZE STRUCTURE STATUS STRUCTURE SIZE INSERHON DATE UPDATE DATE CHEMICAL ABSTRACTS NUMBER WHEY NUMBER EMPIWCAL FORMULA NATOMS NBONDS NWNGS NISO NSCHEM NMODS LNKHDR LNKDAT REGDAT Figure 6.3 Structure Header Record Format 153 The size of the structure is maintained in terms of data records since continuation pointers allow the size to vary. The insertion date and update date are included in the header to help maintain the history of the structure. The Chemical Abstracts Service number is stored to identify complete structures and serves as a key when retrieving molecular structures. In the case of substructures, this variable is an arbitrary but uniquely assigned substructure number which also serves as a search key for substructure retrieval. Hence the structure and substructure retrieval routines are one and the same. The Wiley number of the MS/MS spectrum representing the structure in the spectra data base is stored as a cross—reference. The empirical formula is stored even though the information is redundant with that contained in the structure connectivity tables. In addition to informing the user of the number of hydrogens present in the molecule, the empirical formula serves to double check the integrity of the data and as a prefilter when comparing structures. The variables NATOMS, NBONDS, NRINGS, NISO, NSCHEM, and NMODS describe the characteristics of the structure by specifying the sizes of the connectivity matrices. NATOMS specifies the number of atoms in the structure; NBONDS specifies the number of bonds; NRINGS specifies the number of ring closure bonds; NISO specifies the number of isomeric atoms; NSCHEM specifies the number of stereochemical isomers; and NMODS specifies the number of modified atoms in the structure. Modified atoms are those containing charges, free valences, or consisting of an isotopic mass. 154 Lastly, the header maintains continuation pointers for each header (LNKHDR) and data record (LNKDAT). The array REGDAT contains the registry numbers in the spectra data base for spectra representing this structure. Since several spectra may be identified with a single structure this array may contain up to ten pointers. If more spectra pointers are needed, the size of the header may be expanded by using the continuation pointer LNKHDR. Structure Storage Fonmat The representation form of structures is central to the performance of the data base (Figure 6.4). The criteria for choosing a storage format includes the ability to uniquely and unambiguously represent the structure as compactly as possible. There are several methods for representing structures - Wiswesser line notation (WLN) and connectivity matrix methods being two of the most popular methods (8,9). While various linear notations (WLN) are unique and compact they have difficulties representing large molecules and stereochemistry. Due to their flexibility in storing molecules in a non-linear fashion, connectivity matrices have become popular for storing large, complex structures. A drawback to using connectivity matrices is that they commonly represent identical structures in a number of different ways. They alSo require a relatively large amount of space. An expanded version of the Morgan algorithm (10) was chosen to encode the structures. This algorithm uses a set of connectivity matrices to represent each molecular structure or substructure in a 155 NODE ARRAY FROM ARRAY BOND ARRAY RING ARRAY MODS ARRAY CISTRN ARRAY STEREO ARRAY Figure 6.4 Structure Storage Record Format 156 unique, unambiguous fashion. In addition, the structure can be recalled as an expanded linear name to easily determine identical structures. The version of the Morgan algorithm implemented includes revisions made by Todd and Wipke and by our own lab. Todd and Wipke used the Morgan naming algorithm to generate a unique linear name for each molecule including representation for cis/trans and stereochemical isomers (11). They expanded the atom numbering scheme developed by Morgan which classifies atoms based on the number of non-hydrogen substituents attached to each atom. Wipke’s method calculates the location of each atom relative to the center of the molecule by observing the substituents attached to each atom. These extended connectivity values, along with matrices describing the stereochemistry of the molecule, generate a stereochemically unique name whereby only one form of the connectivity matrices uniquely determines the molecular structure. The number of elements encoded in Wipke’s algorithm was expanded by Carl Beckner from the original organic elements to include all the known elements. Carl Beckner’s algorithm transformed the structure from the input matrix into a unique representation in the connectivity matrices. The bond types originally available included aromatic, single, double, and triple bonds. This notation was expanded to include tautomer bonds, ionic bonds, and structural discontinuities and to allow the representation of free valence substructures as well as polymer structural units. The connectivity matrices are large enough to include any molecule up to 128 atoms is size (excluding hydrogens). 157 Several connectivity matrices are used to describe a structure (Figure 6.4). The NODE array contains the elemental symbols for each atom. The FROM array contains the lowest numbered atom attached to each NODE atom. The BOND array describes the bond between the NODE atom and the FROM atom. The RING array is a two-dimensional array where each pair of elements corresponds to the two atom connectivity numbers closing a ring. One pair of elements exists for each ring present in the structure. The MODS array denotes any modifications for the NODE atom such as charge, free valence, and isotopic masses. The CISTRN array denotes the cis/trans arrangement for a three bond system and is a two-dimensional array whose second dimension contains three elements. Lastly, the STEREO array denotes the stereochemical arrangement of three atoms about a chiral center. Characteristics and Operation The structure/substructure data base currently has over 30,000 structures and substructures that correspond to MS/MS spectra in the reference data base library. The majority of the information are structures that correspond to their normal (electron—impact) mass spectra. The remaining substructures are associated with the various MS/MS spectra in the spectral library. Several functions enable the retrieval and manipulation of the data in the structure data base. A new data base is created by initializing the master header and the internal directory. New 158 structures are entered manually or by using a designated file format. Structures may be logically deleted and later undeleted if needed. A merge operation physically compress out logically deleted files and also merges structures from two different data bases. An update operation replaces an existing structure with a revised structure by reclaiming the space used by the original structure and overflowing into another record if needed. The most useful operation of the program is the display of structures. A tabular dump function allows the structure to be retrieved by structure number, Chemical Abstracts number, or spectrum data base registry number. Structural data are presented in a readable tabular form. The output may be directed to either the terminal, the printer, or an output file. In addition, several structures may be displayed with a single command. An example of the tabular output is given in Figure 6.5 for the compound n-butylbenzene. In addition to tabular output, graphical output to one of several graphics devices is available. An heuristic structure drawing program (DRAWC2) that was developed by Shelley (12) has been incorporated to draw the structures. Several structures were drawn using this program are presented in Figure 6.6. DRAWC2 initially perceives the ring systems in the molecule and assigns spatial coordinates such that they are conventionally displayed; oriented as the chemist would normally draw them. The heuristics of the program eliminates any overlapping bonds or atoms, maintains correct bond lengths, and tries to minimize atom crowding. Identical and 159 Software version: 1 Creation date: lS—JUL—85 Last update data: lS—JUL—85 Last record number: 110160 Highest assigned structure number: 32014 Maximum number of structures: 40000 Structure number: 322 Structure type (Structure=l, Substructure=2): 1 Empirical formula: 010 914 Insertion date: l3-JUL-85 Last update date: l3-JUL-85 Number of data records: 1 Chemical abstracts number: 104518 Wiley number: 6448 Number of atoms: 10 Number of bonds: 11 Number of rings: 1 Number of associated spectra: 1 Spectra registry # 4043 Atom # 1 type: C Atom # 2 (C ) is single bonded to atom # 1 Atom 0 3 (C ) is single bonded to atom # 1 Atom # 4 (C ) is single bonded to atom # 2 Atom # 5 (C ) is aromatic bonded to atom # 3 Atom # 6 (C ) is aromatic bonded to atom # 3 Atom 0 7 (C ) is single bonded to atom # 4 Atom # 8 (C ) is aromatic bonded to atom 8 5 Atom 3 9 (C ) is aromatic bonded to atom # 6 Atom # 10 (C ) is aromatic bonded to atom # 8 Ring closure # 1: Atom # 9 attached to atom # 10 Figure 6.5 Structure Representing N—butylbenzene 160 oEoEocodlSfioElNO 8 22 eofieofimoi L3835305316666 XocoscofiiELEEESAO howo 8 ococofim leancotxoifioElnvlF Am .OCONCOEBJOIC Q Noe/<10 Eat unasso measuoaaum O.O 23mm 161 similar structures are similarly represented so that the chemist can perceive the commonalities of the structures. DRAWC2 was adapted to take advantage of the graphics devices in our lab. Structures are output in stick fashion with carbons atoms unlabelled. A drawing option allows all atoms, or just carbon atoms, to be tagged with sequence numbers. To provide flexibility and transportability, the structure data base software was written in FORTRAN-77, and C. The program is currently implemented on a L81 11/23 minicomputer running the RSX—llM operating system in a multi-user environment. A large capacity, 474 megabyte disk drive with an average access time of 18 milliseconds is used to hold the structure and spectral data bases. Summary The ability to acquire an MS/MS spectrum and to retrieve and draw associated substructures is central to the structure determination process. The structure/substructure data base allows the substructure—property relationships determined by MS/MS instruments to be represented and developed. Substructures determined through spectral comparisons form the building blocks for generating possible molecular structures. 162 References . Yost, R. A., Enke C. G., J. Am. Chem. Soc., 100, 2274 (1978). . Hoffman, P. A., Beckner, C. F., Enke, C. G., in preparation. Cross, K. P., Palmer, P. T., Beckner, C. F., Giordani, A. B., Gregg, H. R., Hoffman, P. A., Enke, C. G., Accepted by ACS Symposium Series. . Varkony, T. H., Shiloach, Y., Smith, D. H., J. Chem. Inf. Comp. Sci., 19, 104 (1979). . Cone, M., Venkataraghavan, R., McLafferty, F. W., J. Am. Chem. Soc., 99, 7668 (1977). . Heller, S. R., Anal. Chem., 44, 1951 (1974). . deHaseth, J. A., Woodruff, H. B., Lowry, S. R., Insenhour, T. L., Anal. Chim. Acta, 103, 109 (1978). . Wiswesser, W. J., Comput. Automat., 19, 2 (1970). . Lederberg, J., Sutherland, G. L., Buchanan, B. G., Feigenbaum, E. A., Robertson, A. V., Duffield, A. M., Djerassi, C., J. Am. Chem. Soc., 91, 2973 (1969). 10. Morgan, H. L., J. Chem. Doc., 5, 107 (1965). 11. Wipke, W. T., Dyott, T. M., J. Am. Chem. Soc., 96, 4834 (1974). 12. Shelley, C. A., J. Chem. Inf. Comp. Sci., 23, 61 (1983). CHAPTER VII FUTURE DEVELOPMENTS The automated structure determination system utilizing MS/MS spectra has advanced to the point where the software tools can be routinely used in elucidating molecular structures. The temptation exists to leave the software tools unchanged and to judiciously proceed to determine structures for many compounds. While this urge may be inviting, further development and revision of the software tools should still continue. Our software tools have evolved as new and better ideas were suggested. Much of the software has been written from the "ground up" without the benefit of previous examples. In addition, the design of some of the tools has progressed beyond their original goals. The MDDB data base was not designed to handle thousands of spectra. Likewise, the spectral matching program was not originally designed to handle multiple data base formats or to perform automated resolution of MS/MS mixture spectra. The phrase, "hindsight is 20/20", becomes appropriate when evaluating the capabilities and efficiency of the software tools. Hence those whose responsibility it becomes to maintain and upgrade the software should consider the following suggestions. The spectra data base management software was written by Phil Hoffman in MACRO—11 assembly language for optimization on a PDP 11/40 minicomputer. Since that time our laboratory has upgraded through three generations of DEC minicomputers to the current micro—VAX I workstation. 163 164 Since the trend of hardware will continue towards faster computers at cheaper prices, all the software tools developed in our lab should be as transportable and upwardly mobile as possible. This statement is particularly true of the reference spectrum data base management software as it currently cannot be implemented on our micro-VAX. This software should be rewritten as soon as possible to avoid maintenance of archaic, out-of—date software. In addition, the structure data base should be combined with the spectrum data base under a new architecture combining the best features of both data base management programs. This action will decrease the number of programs and the number of data bases to be maintained. We now have the disk space available to keep all spectra and structural information as a single file. The matching program has suffered from the evolution process and 32K word memory restrictions. If this program were implemented on the micro—VAX the performance of the spectral matching routines would increase dramatically without sacrificing any of its many features. Neutral-loss spectra provide valuable information in substructure determination. The approach presented in this work uses neutraleloss information in a very limited manner. The wide spread use of neutral—loss information in spectral matching will greatly enhance the determination of substructures. 165 A substructural searching program needs to be implemented to determine the largest common fragments of any two structures or substructures. This program will determine the structural fragments that are given to the GENOA program as well as in identify redundant structures or substructures in the structure/substructure data base. I strongly urge the developer of this software to consider the approaches presented in the literature and especially those referenced here (1—9). The integration of the micro-VAX into the mass spectrometry information management system should be exploited as fully as possible. There are many applications where the computing power of this computer excels. These include GENOA, IONSIM and other compute bound tasks. The average access time of the micro—VAX disk drives are slow (78 milliseconds) relative to the Fujitsu Eagle disk drive now on the MS/MS 11/23 (18 milliseconds). Hence I/O bound tasks will run slower on the micro-VAX than on the PDP 11/23. These include spectra and structure data base management software. If these tasks are run on the micro-VAX they should use the faster disk on the PDP 11 as a file server over the Ethernet network while taking advantage of the faster micro-VAX processor. The Xerox 1108 computer should serve as a base to write and develop a knowledge-based expert system for MS/MS. An expert system that evaluates the effect of operating conditions upon MS/MS spectra will represent a milestone in the development of standard operating conditions for MS/MS. Such a system could spur the development of badly 166 needed community—wide MS/MS libraries. A secondary application of the Xerox computer could be the development of an expert system to replace GENOA. Molecular Design Ltd. is no longer supporting this software and will not maintain it. An expert system would run more efficiently than GENOA and could be updated as we determine new heuristic rules and conditions. The Xerox computer could play an integral part in the development of AI guided instrumentation. It should be linked to the microcomputers controlling the MS/MS instrument as well as the minicomputers. It should complete a feedback loop to the instrument where it can make decisions regarding the information desired and then instruct the MS/MS instrument to perform experiments acquiring such information. 167 References Willet, P., J. Chem. Inf. Comput. Sci., 24, 29 (1984). . Dromey, R. G., J. Chem. Inf. Comput. Sci., 18, 222 (1978). . Adamson, G. W., Cowell, J., Lynch, M. F., J. Chem. Doc., 13, 153 (1972). . Bawden, D., J. Chem. Inf. Comput. Sci., 23, 14 (1983). . Varkony, T. R., Shiloach, Y., Smith, D. M., J. Chem. Inf. Comput. Sci., 19, 104 (1979). . Cone, M. M., Venkataraghavan, R., McLafferty, F. W., J. Am. Chem. Soc., 99, 7688 (1977). . Willet, P., J. Chem. Inf. Comput. Sci., 25, 114 (1985). . Synge, R. L. M., J. Chem. Inf. Comput. Sci., 25, 50 (1983). . Kudo, Y., Chihara, H., J. Chem. Inf. Comput. Sci., 23, 109 (1983). IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII lllIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIll