é; IS’W u: I . 51w If~.:-T . S" 526'!”an ’ ". . 'J t ‘ l ’ ' " I O’N- .~' :212.'2r-2.:2....,::"..r2 .. .-,. .21-1 " y}: Ev f. .‘Slu'. .' I .' 5' re: n,f"d‘¢ 3“" "S” .: '15: $35553: .‘2 . ., Sufi”. .2- rS‘a M‘ 15:21:22,: ‘22-: s I - 2'41, 2.7.2,: ;:-'.“.2’?‘:'.* 31$," ~H' . k : . v, S‘T.-‘<:" .72). .I ."‘ ’f": ” {E E’E ’E”; E ’E’ESf’EsxfiffflES r '33."; E’SE‘;P;2.'L1‘T!"21’.:2’W{’3;E" (23%;: “3mg. 1.- ' ‘ .S‘rqfim . ... .nES :1“: S' i‘S‘?‘ 7-3631“: ., ‘;2.§;‘:\ :‘Hm W15; ‘. -. . .4713.S§'.l1.§.i 25 PM 5}: 1?};‘CE It’ll." . ! ._l ‘. 5., ’E‘S s "."2' _ ‘ “pp“ L I . W"); _..2.€‘7;‘ 4:.é-L-xga ... , 2 2 'L'v‘v '5’.» \ ’f 22.1? -. 33.; ‘. '3! r p ....I x”: . . -.- - . -- - ~ . w .n ’ .. J'- . .22.:1‘7.‘ ml-.~- , m. V ...... w: 4-"... :1: S‘i'f' .' M m Sm. '~"'. I .':_'_ “."IL .ES’E’ .22. . . . 2 232‘ S2 . 2:22 23? 22:2..- 2 JESS: ‘. 2. SSS-b23239 quS '2.’ ' ”SEW”? " ""S ;. '1: ’. v :1. 1’.’ rr’ .c I’e’f .; . ’2 E ""- 2. I’SE’ I"; ~2‘ 2-: ' r425 ” c’" ’ . ‘é T’EE’ I,3:S--”.~f.:_2,:f§;;2’;2”"."', 3;" WW" SZ'SS‘SES’S SSEMS’flfé‘Sfi: 0*” ' . 1'19?“ 1 IL 9 4'7. , ’E” 2!: . -.‘ 'J.’ 1’74' ., ,‘é' TESS" '7” I: ;- '7 .mg _ v. 0. $.33 ,- ’222ES‘25‘55‘13‘U.‘ mm s ' '22? E ’ '1’:S‘2“"' ; , . ’.---.SS'1’E~"" 2:22:52“: "'4’ :. :S .1S’ isSfi-ESS'SSS:.s§¥:;':f”fg 2:22;};2324512927'45'1'Sta-Ty :2;-n..!2;2“14,322 2' 233;" 2:52..“ 2223351.“: S'i‘cw'!S"2SSS3SSi:S:12’”'S'SSMSSESJSESEE';EESSSiii-‘35::2: ' . E'--"'T.S7S§22'fg~;S:;SS':i’PS r2”: .;:: Wait, 2fogzi':"S‘;HiS?‘ “1‘31:fo =S§95§féIm2 7.. Elm] S' .29: 352“" LESSMg:2:9;22212;:=§;2i22§2:2sr;3a 2- twp a: I.’ J’ . 1‘ ”’89.! r i2'L2glI '3 '2 -1 ' 2" ' ’2‘! u 2’ :9- t 2* ‘:.'.,.I S. at. :21" 'L ' "S -2' SS} SSE”?! “Sufi?! ”...:t‘S‘w 2:" '.'.’S,I~,§='2’:' ". . 7:5?th 12:31:». -332}. ml :7 21:33? 2E2-2g227m ':S;5:‘SS.S,§§ISSESS: ~"-.-SI.2’:::‘ ."W’h-‘S W?! 5.2%.:12 .,..'7-~I2 {SgiFSE‘ifESSg-2. . S1335 £3; I*2i‘."'"" ’5 S":WW'SEJET2;Sftg22z‘2::fi2$S2JET'TSSSIEpn'qQS ST,‘ ST ‘EiE’E’ flS‘. "S’iES ' .i-nEE-i 3 2f I ‘zhE" .. .2 2 :{fii‘r Fair; 5' E57, 1 1.31:1?“ ‘3‘; :1? I “SW ”ES 2:: ““27“"? >253...2’Qf-’2TS2’S:'3 2”” 2E ”(352225! 2,"- iij-I .2532 2222:: “‘Sifii‘ut” ‘S’f’SESififSESTSE:SSSESSSSSSE‘SSEEPSSSESM‘T'T u i: ’5’?“ SE: J‘SQI’SE"; ‘.’:‘r2:,",.2. 5‘73. 2.2‘ 2.222;: .2S29'S‘2S. :5: E”{ . 22"15-‘25', :xf’ _:e;"‘S:S:2LS’§: "" Sag ’2L2SM‘E1’E1’SSH‘I’W 'S'SSSSESm’:"Si§*:2"’37,;;5(7g’1[, 3 S~,l'"§,.S’f~1ff SENS"?’02?i’§:’i¢.’12S24 EST-{SE l’fSE” 2+ .::.:.Si."‘12r‘ E’tif’ ‘ SSEWSZEQM “1:311:!': RES‘llf-VEXELQdifv. x ‘ s :,'}’E‘ .’O’.’E \ '. .' :-' ; ‘4’, X”; - “ICE 1".” 5 'S :3”. '3“ 23"" P . “’7’ F1“? r“? E NE 7”” 54“! S’ -' 2.52o‘EESETEf;r-‘S.’SE:’IL 'S"L2: :‘T ’L’ Riggui’ S”””” SSE ”of,‘ “'11:“? A “Md". LESS? ’N’M'S’t’ SSTSS’S SW 1 “:ifi- , ' ."-‘ 3131- ' ‘ ’1'2.' 1. ‘ SSE“ .’ 4122.12.11.“ " S ’ Sw- ” SE9”) 'E‘ES " .;.-,. {92%wi 'Z:'-:' " :‘f‘E ,cflhmyn SJ wfzi5.’J!f‘!'f"$b.‘:2’r;’2I 5 '2’ ’S’S‘SE 'u‘F'S’S’IE” ”‘3' SWEET?!" . ’St 2 ' ES’HE’ES SESE'ELSA“ SSS'SSS'J" 22S: NM 323‘“ S':'2§S2§E:SS:S;:SS232’ f’S '.';w'S‘S' JRS Q {2. .'"S m1: .1 .I] {QM . ’Shf SESS'I’] S’zf "329$. :S' c "3:131 ES’S’S :' "' :SIS‘SE’ My; firffii “-1.le i' “’5! I Sn; P132???” H K "S {but 1‘3}: 7 f’.‘ ”thguhHL‘hfl"; {II/fir},- 13’4“”? [14? gig! :35)“; b"): huff; 7f; y’.‘ '23:"! f‘a‘wg jé‘EE ’2' ..i: f . . T ’. Sit-iii)!” 0," h; . 4’ _2 o ’Iv’), ’r‘ qr) 'E’ pr”. 4 {It ' TH}. ‘ it, ‘I 'v -‘:S’E’"}3".E ’ "rm-"1W 2 ‘SSM ' 1;?2) 4412:2521: I‘ S:‘=E.SSS£;2’2SI.'1"S: if'flré. w’JSSSH 0f 2“,: st’yfiElEESSESz'sSE’E"?12!:’(S‘SSES 'Ir::t.'l: SA ?:""2"’ HS) 9'2“er: ' 22 (if; S: ‘S‘Sr:SStSSS2:;:.7‘S}:,S'S-”"Sr” SSSTSEMES“:‘22figi2sik'ffel'i'53: ’-S"2’1.¢'2' 5‘2‘ ‘2‘ ‘2‘ a. S2 W. u :3". "S “373,1“; “@413? :2 . 2 2: :. 2. f 5 0’5 ' 7:1,“; Q . $1.: ~.t\ ...... u ¢ 0 ~,.\- 4‘ '~ . . . . . -2 r. -' a“ ' II ' ‘E ”A“ -,«~:r ~.. I ‘ ;‘ pt: . . , m. 5': m, Civil It??? I as“: ”' S’ fi'I’ES E’s“ . wv~w "It; '1’: ”SW D "I }| ”"[, '; ESP? ”4:." .Eik ’y‘f "J’l‘j; ” IL I‘ 5 i2." E"';-‘r>l‘ 4 ,, I t, \ ’r ”(’E j. ‘ 2.: D ‘ZPSIE: nun... , . _ g' '~ 2 I I y] r E’! "t' 3 .II SES‘E‘.” 3-:T-291: 9 EFE’E’SWPEhfid” £2” “if: E’E’EH .S 2: 331:",f'."'§",fzé 2' 2 I ‘ J T Erfi§§fi§é {131? EE’EEEEEEEEE’E’EE’E”E’”S’ r” [: ’d’E’E’E’ ’ ' ‘ ' S; ’ 5S§fi1§!"5{}' WI :{'}5":"‘.2 ‘.’ E: .12, I l :flflg”":’f ... ’5’.I’I'Ei§?’IIl:f, {"2 I '. . E ' S - 2255;119:23 :7:§5§S§}2.:;2Si222::2§132S5? finglSE ' . ...' . , . .. 2- ' ' Snark-£11559:I'ére‘aée-e I2,£;::;2S.2:::S:2;2;2=:;S 22:" . 2 .. 1: 1:21.73” “’ TM: 2 '2....':.2 S22 ' v' « 11“”: 81.3.73: (£31121? 'SE’EE a! "2L: IS?“ 0 ("3"1'!’ {/2 EEE’E'EE’ i: E” 1:th .ZpI ’Ei’ .; '25“ ’tjil’ . g: z g-(v l I, :1 ’of - 1" 4.1,":P" J .Pb {"11 a W i,’ . fl! £32,}: I? : . ., I. ,1 , , :H'.S.21S"S.:'7ISF‘P;2Si-S 1r 7 "f”? Tag: (~ .-. .. ; :5 ’MS- 4S" " SH "SH 2. SSS EMS 53’: SlrS-f:2’SS’SSS'~-"’"- 2 E252 .22I2SE3»? 12*- . 22 -;:"2': " . "2‘ MS 2:§S‘:'.,." ‘S’S'S €15:’:*£=‘::~;'SSEShSL-SE: {SuiSuEzfiSSr;Sf.’m-SEISS: ,--. 2 2.3552? 3' .Sf; ”222‘; 2“2",2,‘:;.S: ‘-:':f2‘:::Si.I':S§2.t ;~‘;::?2,;7-::T fizfifisfmfidicf uni " 'S‘SPS'I',’S?’ . :‘r.§f_.:¢:‘f‘:§ WERE; ; ‘g-Sv’ 5’1"“ S“f;’E “WES J:t"?£r“’:'E’S”’ SEER-E3917 ’ ”22’§S§S:EEXS;S3£§:§?SS§§ Sféifzi, - 13,73; :33: “TITS“. ”i’ '1'" maggff'ekzgSz'S " ’ 'I'rTgistfigzsisg r,’ 'v':’ ‘ '2",: 5 r, 'rI':' ..hI. , n :;"‘ E '_ :‘ 2’ E. ,th’7EI‘.r’I {51 74:29:!“ I'E'ZM‘ISS S’SS'SISSSEV’JS’FS‘SSI S’n: E’IS'RE‘ EEEé’c‘Sé'ES-Eix’v’g’h . - SSiS'S2S21‘55Eé:"' 'S‘Ei’iififlg, ;" iS'Sfi’SEfle’. ‘12? SW1 "Sm: :522'2SSei221'é‘S55S‘WH; fig”? ‘52 $925 .2: ”‘SS’SS’”S’§””‘SE”E ~‘" 22 1W "SS v.2 TQM FEE” Way I "'f’ l ... Lg . I‘ ’ Eregf wgii IHL‘ 17‘ i; ‘5' is 2 S: 213.3% 4:? g: SS “224:2 ' Jr ', I 3% ‘S ’_”’E '2' J 1 . : EglE’fii’, 0 a m,- «up-w..- 'M‘M um man. "a -" “.2.- .._. Y.» \m ~ .- I .— ‘.:: w —o-—s<~“-v -m .‘ . . nan W >—-...—. fhb. ".-...,. 1‘50 ‘ILO‘U'I MIMIMQQNMMJ * mm, Michigan State University This is to certify that the dissertation entitled GENERATION OF SUBSTRUC'IURE IDENTIFICATION RULES AND MOLECULAR STRUCTURES FRCM MS/MS SPECTRA presented by Kevin Joseph Hart has been accepted towards fulfillment of the requirements for Ph . D . degree in Chemistry fl% 5% Major professor Date Qw/D/ /7?0 V MS U is an Affirmative Action/Equal Opportunity Institution 0 12771 PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE 4—.» 4' ==Tfi MSU Is An Affirmdive Action/Equal Opportunity Institution GENERATION OF SUBSTRUCTURE IDENTIFICATION RULES AND MOLECULAR STRUCTURES FROM MS/MS SPECTRA By Kevin Joseph Hart A DISSERTATION Submitted to Michigan State University in partial fuflllment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Chemistry 1989 w v: I h‘ Fm}, . .2 V v 1 -1 t-k’ ABSTRACT GENERATION OF SUBSTRUCTURE IDENTIFICATION RULES AND MOLECULAR STRUCTURES FROM MS/MS SPECTRA By Kevin Joseph Hart An Automated Chemical Structure Elucidation System (ACES) is being developed to provide a method for obtaining molecular structures from mass spectrometry / mass spectrometry (MS/MS) spectra. The MAPS (Method for Analyzing Patterns in Spectra) program generates rules to identify substructures in unknown compounds. This program requires a reference database of MS/MS spectra and a listing of the substructure content of each of the reference compounds to provide these rules. Two complete reference databases (using single and multiple collision conditions) were compiled, each consisting of 6,106 MS/MS spectra for 105 compounds, many of which are regulated drugs., A new version of the MAPS program has been developed to generate “feature- combination” rules. Feature-combination rules possess greater reliability and recall than those generated using previous versions of MAPS. An Automated Structure Library Search (ASLS) function was implemented to provide the substructure listing used by the MAPS software. A commercial generator program was modified to automatically provide candidate structures for an unknown based on a list of candidate molecular formula(e) and substructures identified as present in the unknown. Both the ASLS and Automated Structure Generator (ASG) programs were modified versions of programs originally developed at Stanford University (STRCHK and GENOA). The rules obtained from the MAPS software were evaluated by 1) the recall and reliability of the rules with respect to the reference database, 2) a comparison of the content of several rules with known fragmentation patterns and 3) application of the ‘rules to compounds not in the A reference database (test compounds). Among the substructures for which MAPS rules were generated are phenothiazine, barbiturate, t-butyl, phenol and amphetamine. Many of these rules were found to be reliable when applied to the test compounds. In some cases, the substructures for rules which produced false positives were cross-correlated with another 6‘ substructure. Other false positives could be classified as near misses” and were often due to inappropriate substructure definitions. It was concluded that while the current reference database was large enough to prove the viability of this method, it was too small to produce sufficient rules for a generalized structure elucidation system. Copyright by Kevin Joseph Hart 1989 “It’s a terrible thing to lose one’s mind, or never to have had one.” Newsweek Dan Quayle, Vice President of the United States, speaking to the NAACP. “Stress can be avoided...If your car isn’t working and that’s causing you stress, buy a new one.” State News This guy obviously never had to make car payments. “There is yet time enough for you to take a different path.” A fortune in a Chinese fortune cookie acquired by the author on the eve of his thesis defense. “Toaster Ovens! Did you here meg)", PROGRAMMABLE TOASTER OVENS!” Bloom County A threat issued to a “Banana JR.” personal computer by its manufacturer after it spit out a user’s “cheap software”. ACKNOWLEDGEMENTS I would like to thank my research advisor Chris Enke for providing his support, intellectual and moral, over the last several years. I decided to come to Michigan State University because of Chris’s interesting and active research group. It has proven to be one of the best decisions I have ever made. I have grown a great a deal (admittedly not without pain) in the environment that Chris has striven to build here at State. I would also like to thank the members of my committee, Dr. Charles Sweeley, Dr. William Reusch, Dr. Thomas Atkinson and Dr. Victoria McGuffin, for their assistance. I would like to thank George Yefchak, Eric Erickson, Jon Wahl, Mark Cole and the other members of the Enke group for their friendship and assistance. I also want to thank the “old farts”: Mark Bauer, Hugh Gregg, Bruce Newcome, Kathy Fix and Norm Penix. I would never be able to play cards as well as I do without their expert training. Fortunately I learned a great deal of science from them as well. Pete Palmer and Adrian Wade deserve special mention. The ACES system could not have been realized without their contributions. I am grateful to them for both for lending their expertise and their friendship in this endeavor. A special thank you goes to Drake Diedrich and Chris Weaver, two undergraduate programmers, for their programming efforts. The development of the ACES system is truly a joint effort of some very talented people. I want to acknowledge the huge amount of love and support I received from my family, especially my mother, Joan. She has always been there when I needed her. In addition, I want to acknowledge my good friend Chris Evans for providing a shoulder to lean on during the bad times and for all of the entertaining experiences we shared in graduate school. I also want to acknowledge the love and support of my very special friend, Bobette Nourse. You are, and always will be, my friend. I would be remiss if I did not extend my regards to the MSU Department of Public Safety for towing my car while I was proctoring an exam for 300 freshman chemistry students. Their dedication to the educational effort here at MSU is truly astounding. The secretary in the Dean’s office also deserves special note. She once responded, “Does he really need it now?” and “Can he wait until next pay period?” (read that, “next month”) in response to an inquiry about why my paycheck was three days late. Why do Fellowships always get screwed up? Finally, I would like to acknowledge the National Institutes for Health for funding much of this work. Finnigan MAT has also been generous in their support of our efforts. Thanks are due to Molecular Design, LTD., especially Dennis Smith, for licensing the source code to the GENOA and STRCHK programs. TABLE OF CONTENTS List of Tables - - - - _ ._ ...... . ..... -----xi List of Figures"--- -- - - ........ -- xiii Chapter 1 “Automated Structure Elucidation using MS/MS Spectra” Introduction ............................................................... 1 Historical Background ........................................................................... 5 Triple Quadrupole Mass Spectrometry - ..... - -5 Structure Elucidation of Pharmaceuticals Using MS/MS Data. ..................................................................... 8 Automated Structure Elucidation for MS and Other Methods ..... 15 MS [MS Spectral Matching and Reference Databases - 18 Pattern Recognition and Other Methods ........................ 22 The Automated Chemical Structure Elucidation System ---- 22 System Components - -_ - - 23 Multiple Rulebases .................................................................. 2 9 Potential for Recursive Operation _ -3 0 Conclusions ..... 3 2 References - - --.--33 Chapter 2 “Automated MS/MS Spectral Data Acquisition for MAPS Reference Databases” Introduction ........................................................................................... 3 9 Key Instrumental Parameters -- 40 Automated Acquisition of MS/MS Spectra ............................... 5 3 Data Transfer Software and Computer Facilities - ............................ 6 3 Standard Compounds Selected for the Reference Database 6 5 Purity of the Reference Database Standards- - -7 3 Irregularities in the Reference Database 77 References - ............... 8 8 viii Chapter 3 “Generation of MAPS Substructure Identification Rules” Introduction ........................................................................................... 8 9 MAPS Software Development - -- - - - 91 The MAPS Software - Version 11 .................................................. 9 3 Initial Work on Feature-Combination Rules _ -- _ -_ 103 Feature-Combination Rules Obtained Using MAPS Version II .......... -106 The MAPS Software- Version III ............................................. 1 1 1 The GENT Program -- 1 12 The MAPS (v. 111) Program -- 115 The RULE Program -_ ............ 121 References _ _ _ 123 Chapter 4 “Automated Substructure Search and Structure Generation” Introduction ........................................................................................ 124 Computerized Representation of Chemical Structure ....... 126 The Structure Generation Program, GENOA 13 O The Structure Checking Program, STRCHK ............................. 13 6 Use of Substructure Search in Rule Generation ................... l 3 7 Automatic Structure Generation 145 Potential for the Reduction of the Number of Candidate Structures through Ancillary Experiments ...... 152 References __ l 60 Chapter 5 “Evaluation of MAPS Feature-Combination Rules” Introduction ........................................................................................ 1 6 2 Reliability and Recall for the MAPS Rulebase ...................... 16 3 Analysis of Several MAPS Rules 168 PHENOTHIAZINF. l 6 8 BARBITURATE l 8 5 PI-IENOL and T-BUTYI _ .---189 Alternate MAPS Rules from Multiple Collision MS/MS Data ..... 19 6 Comparitive Recall for MAPS Rulebases Generated from Data Acquired at Different Collision Gas Pressures - ____________________ - -.-__-.-.--_19 8 Effect of Collision Gas Pressure on Rule Content ................. 200 Conclusions -- 2 0 5 References - __ 205 ix Chapter 6 “Evaluation of MAPS Rules by Application to Test Compounds” Introduction ................................................................... 207 Evaluation of Individual Rules using Test Compounds. 2.1 l PHENOTHIAZINF. _-21 1 BARBITURATE 21 4 PHENOL and T-BUTYL- 2 l 6 AMPHETAMINE 2 1 7 Evaluation of Rulebases Generated using Global Parameters _- -218 Recommendations for Future Development of MAPS ....... 22 3 Conclusions --22 5 1.1 1.2 2.1 2.2 2.3 3.1 3.2 3.3 4.1 4.2 4.3 4.4 4.5 LIST OF TABLES Comparison of the number of potential spectral features in the MS and MS/MS data spaces for selected masses ........................... 3 List of key parameters affecting relative daughter ion intensity in a triple quadrupole mass spectrometer (adapted from reference 47) ................................................................ 21 Standard instrumental conditions for creation of MAPS reference databases using a Finnigan TSQ-70 TOMS ............ 5 3 List of compound names and CAS numbers for each of the reference database compounds - -- ---.--67 List of nominal masses, molecular formulae and number of daughter spectra obtained for each reference database compound ............................................................. 7 0 Summary of the inputs to the MAPS (v.11) software ................ 94 Procedure for generating MAPS (v.11) rules-W ------ --99 Reliability and recall estimates obtained for the MAPS (v.11) PHENOTHIAZINE substructure identification rule at several different match factors ......................................... 10 3 Connectivity table for cloral adapted from reference 17 ..... 12 8 DENDRAL connectivity table for hydroquinone adapted from reference 15 129 Atom connectivity matrix for hydroguinone adapted from reference 15 - - -- ------------ ----- 129 . GENOA connectivity table obtained for the barbiturate substructure - - ------- .-131 Summary of major GENOA commands - --------------- --13 3 xi 4.6 4.7 5.1 6.1 6.2 6.3 6.4 List of substructure constraints (ordered by reference number) and the number of cases generated by the structure generator. The number of atoms and a descriptive name for each substructure reference is provided in parentheses. - -----152 List of substructure constraints (ordered by number of atoms) and the number of cases generated by the structure generator. The number of atoms and a descriptive name for each substructure reference is provided in parentheses. -153 Cross-correlation values calculated for each of the spectral features contained in the MAPS rule for the PHENOTHIAZINE substructure (shown in Figure 3.22) with respect to the “88110” substructure .................................... 184 Compound name, molecular weight, molecular formula, and CAS number for each of the test compounds used to evaluate the MAPS rules........‘ ............................................................. 20 8 List of MAPS rules and the results obtained when the rules were applied to the test compounds (X - correct identification, F - incorrect identification). - - -- -- 212 Reliability, number of predictions and number of rules obtained for three MAPS rulebases - ------------------------ 219 Rule reliabilities, number of predictions and number of rules for several MAPS rulebases obtained using the indicated parameters ............................................................................ 221 xii LIST OF FIGURES Block diagram of the major components of a triple quadrupole mass spectrometer ........................................................... 6 MS/MS map for isopropanol. Adapted from Yost and Enke, American Laboratory, (June, 1981). --7 Primary mass spectra of (a) pentobarbital and (b) amobarbital . 10 Daughter spectra of the molecular ions of (a) pentobarbital and (b) amobarbital 11 Daughter spectra of the 156+ fragment ions of (a) pentobarbital and (b) amobarbital ------ 13 Daughter spectra of the 141+ fragment ions of (a) pentobarbital and (b) amobarbital ------ 14 Schematic of the Automated Chemical structure Elucidation System (ACES) ............................................................ 2 4 The intial MAPS (v.11) rule obtained for the PI-IENOTHIAZINE substructure 2 6 Daughter spectra of the molecular ion of promazine with a collision energy of (a) 5 eV, (b) 15 eV and (c) 30 eV - 4 3 Daughter spectra of the molecular ion of promazine with a collision energy of (a) 45 eV, (b) 60 eV and (c) 75 eV 44 Relative abundance of several fragment ions of meperidine plotted as a function of collision energy ............. 4 5 Daughter spectra of the molecular ion of promazine with a collision gas (argon) pressure of (a) 0.4 mtorr, (b) 1.2 mtorr and (c) 1.8 mtorr (CE=30eV) ................... 47 xiii 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 Daughter spectra of the m/z 199 fragment ion of promazine with a collision gas (Argon) pressure of (a) 0.4 mtorr, (b) 1.2 mtorr and (c) 1.8 mtorr. -------- Relative abundance of the m/z 56 daughter ion of meperidine plotted as a function of collision energy and a collison gas pressure of 0.4 mtorr, 1.2 mtorr and 1.8 mtorr. - ----- Daughter spectra of the m/z 141 fragment ion of a variety of barbiturates (Elab=30 eV, p=0.4 mtorr). See Table 2.2 for compound names corresponding to the reference names provided in this figure. .......... Daughter spectra of the m/z 91 fragment ion of a variety of phenols (Elab=30 eV, p=0.4 mtorr) See Table 2.2 for compound names corresponding to the reference names provided in this figure. Plots of collision gas pressure measured by (a) manifold ion guage and (b) Q2 convectron guage versus peak area of the m/z 69 fragment ion of perfluoro-tert-butylamine (PFTBA) .................................... Primary mass spectra of PFTBA using (a) Q1 and (b) Q3 as the scanning quadrupole---- v-v—vwv-vv The ICL procedure used to repetitively acquire 100 primary mass spectra for characterization of probe 48 -_49 51 -52 56 -57 temperann'e programs ......................................................................... 58 Reconstructed ion chromatograms of (a) the' m/z 58 fragment ion, m/z 284 molecular ion ‘and (c) total ion current for promazine - - - The ICL procedure used to acquire the primary mass spectra found in the reference database ............... The ICL procedure used to acquire the daughter spectra found in the reference database-- Schematic of the computing facilities available for xiv 6 O 61 62 running the ACES programs ............................................................... 6 4 2.16 2.17 2.18 2.19 2.20 2.21 2.22 2.23 2.24 3.1 3.2 Equations used to calculate the uniqueness and correlation values for the spectral features found in the reference databases ............ 74 Substructure “SS143” representing the base substructure found in the structure of barbiturates A partial primary mass spectrum (55-75 amu) of hydroxyamphetamine showing a doubly charged ion at m/z 58.5 and 67.5 ........................................................... Daughter spectra of the (a) m/z 58.5 and (b) m/z 67.5 fragment ions of hydroxyamphetamine (Elab = 2 eV, p=1.8 mtorr) ......... 75 .......... 79 80 Daughter spectrum of the m/z 67.5 fragment ion of hydroxyamphetamine with Etab = 30 eV ....................... Daughter spectra of the (a) the doubly charged molecular ion and (b) the m/z 108.5 fragment ion ......... 82 -83 of morphine (Elab = 2 eV, p=1.8 mtorr)-.-----. ------------- Daughter spectra of the (a) m/z 98.5 and (b) m/z 99.5 fragment ions of oxymorphone (Elab: 2eV, 85 p=1. 8 mtorr) -- ---------------- - - Daughter spectra of the (a) m/z 112. 5 and (b) m/z 113 .5 fragment ions of oxymorphone (Elab: 2 eV, ----86 p=1. 8 mtorr) - ------------- - Daughter spectrum of the doubly charged molecular ion of oxymorphone (Elab = 2 eV, p=1.8 mtorr) - -- 8 7 An excerpt from the “substructure-buckets”, SS—BUCKEI‘S -- -- An excerpt from the “feature- buckets” FEATURE-BUCKETS, showing different features with the same nominal mass-- ------- 95 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 Equations used to calculate the uniqueness and correlation of spectral features in MAPS for use by the U and C filters ------------------- 98 The initial MAPS (v.11) rule obtained for the PHENOTHIAZINE substructure - ....... -...102 Equations used to calculate the rule reliability and rule recall estimates in MAPS ......................................................... 10 3 The MAPS (v.1) feature-combination rule obtained for the BROMO substructure. Adapted from reference 11. -- .... 105 Plot of the number of initial spectral features for the BARBITURATE substructure for use in generating feature-combination rules versus minimum correlation for several minimum uniqueness values (Ui) ....................................................................... 10 8 MAPS (v.11) feature-combination rule for the PHENOTHIAZINE substructure with a recall and reliability estimate of 100% ............................................................. 1 10 The GENT output file format ............................................................ 1 1 3 An exerpt from the GENT output file produced using 40 reference database compounds and two substructure definitions - - -- - ---------- 114 An illustration of feature-combination bit string generation with backchecking _ -- - --------- ..--l 17 Sample output from the MAPS (v.III) program with user input highlighted in bold faced type ...................... 119 The format of the MAPS feature-combination rule save files ......................................................................................... 12 0 Sample output from the RULE program with user input highlighted in bold faced type ................................. 122 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 5.1 Structure corresponding to the WLN notation given in text. Adapted from reference 17. - A sample GENOA session- -- - Substructure drawings for the 88145 and 8872 substructures- A demonstration of the SURVEY function of STRCHK . GENOA session illustrating the creation of a substructure definition ------- ..... Format and an excerpt of the “SUBSTRLIS” file ........... Schematic of automated structure genration in ACES -‘A‘- Format and example of the “.MFG” results file Format and example of the “.MPS” results file - -- -- Flowchart of the automatic structure generator ............. Sample output from the automated structure generator (ASG) program ......................................................... Schematic showing the program components and interactions required to obtain ancillary experiment recommendations for the intelligent controller (IC)... Format and an example of the “.GEN” file Sample output from the survey function using 54 candidate structures generated from the molecular formula shown ----- - Flowchart showing the basic algorithm for obtaining ancillary experiment recommendations ...... Histogram of the substructure library showing the number of unique occurrences of each substructure in the reference database compounds ............. xvii -127 135 136 138 140 144 146 147 148 150 151 155 156 ---157 158 165 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 Histogram showing the number of intial features obtained for each substructure in the substructure library using the indicated minimum uniqueness and correlation values ................................................................. Histogram showing the overall recall achieved for each substructure in the substructure library using ....... 169 - -----17 1 the indicated Ui/Ci/Cc program parameters -- The initial MAPS (v.11) rule obtained for the PHENOTHIAZINE substructure - - -- ------ Structures for several fragment ions of phenothiazine compounds. Adapted from references 4 and 6. Comparison of documented fragmentation pathways for phenothiazine derivatives with the features contained in the MAPS PHENOTHIAZINE rule ‘ -- -------- - ........ 174 -- ---1 7 6 MAPS (v.11) feature-combination rule for the PHENOTHIAZINE substructure with a recall and reliability estimate of 100% ...................................................... The MAPS feature-combination rule for the PHENOTHIAZINE substructure (S8132) with the mass filter enabled and the indicated program parameters -- ----------------------- ----- ....... 178 ------ --179 MAPS (v.11) exclusion rule for the PHENOTHIAZINE substructure (SSl32) generated using the indicated program parameters 180 MAPS (v.lII) feature-combination rule for the PHENOTHIAZINE substructure (SSl32) generated using a high initial uniqueness and low initial correlation value ............................................................................ Structure summary for several of the phenothiazine derivatives in the reference database - ....... 181 ----- 183 xviii 5.20 5.21 5.22 Substructure drawings for the BARBITU RATE substructure and two specific derivatives ........................ MAPS (v.III) feature-combination rule for the BARBITURATE substructure (SSl4l) generated using the indicated program parameters .......................... Fragmentation pathways for the a) “$8145” and b) “S8147” substructures. Adapted from references MAPS (v.III) feature-combination rule for the “88145” substructure generated using the indicated program parameters ................................................ MAPS (v.III) feature-combination rule for the “SS147” substructure generated using the indicated program parameters ................................................ MAPS (v.III) feature-combination rule for the T-BUTYL substructure (8821) generated using the indicated program parameters ................................................ MAPS (v.11) exclusion rule for the T-BUTYL substructure (8821) generated using the indicated program parameters MAPS (v.III) feature-combination rule for the PHENOL substructure generated using the indicated program parameters ................................................ Histogram showing the recall of the feature-combination rules obtained using Ui=30%, Ci=30% and Cc=30%= -------- - ....... Daughter m/z values observed in a MAPS rule from a m/z 198 parent ion using single and multiple collision conditions --------- -- ------ Additional structures for fragment ions observed from phenothiazine derivatives .............................................. xix ........ 186 ........ 187 -----l 9 0 ....... 191 ....... 192 ....... 194 -.194 ....... 194 ....... 199 ---202 ....... 203 5.23 MAPS (v.III) feature-combination rule for the 6.1 6.2 PHENOTHIAZINE substructure generated using the indicated program parameters and the multiple collision data-3--- - - 204 Structure for each test compound listed in Table 6.1 .......... 209 Substructure definition for the “88118” substructure (a) with benzylic hydrogens (initial definition) and (b) without benzylic hydrogens (new definition) .................. 218 XX CHAPTER 1 Automated Structure Elucidation Using MS/MS Spectra Introduction Structure elucidation of organic compounds .is often essential to resolving a variety of chemical problems in academic, industrial and governmental settings. The mass spectrometry / mass spectrometry technique (MS/MS) has played an important role in fundamental studies of ion structure, reaction mechanisms and thermochemistry, as well as being used in the study of chemical problems in the environment, natural products, industrial products, foods, forensic science, petroleum products, bioorganic compounds and pharmaceuticals [1,2]. The MS/MS technique was initially popular for its potential in analyzing mixtures but has increasingly been applied to structure elucidation [3]. Interest in automating the structure elucidation process has grown as the popularity of this technique increased. This interest parallels that in other areas of analytical chemistry as new developments in hyphenated techniques [4] has led to the ever-increasing ability of new instrumentation to generate large quantities of multidimensional data [5]. Despite the increasing involvement of computers in MS/MS instrumentation, automation of the interpretation of MS/MS spectral data continues to be one of the foremost challenges to workers in this field. A similar demand occurred with the introduction of GC/MS instruments which are also capable of producing huge quantities of data. Advances in the collection of multi-order mass spectra have drastically increased the potential size of the MS‘1 data space from which spectral feature / molecular structure correlations can be derived. These advances . include the ability to collect a complete MS/MS map. in a few seconds [6], multi-order mass spectra (MSn, where n > 1) such as Fourier transform mass spectrometers (FT-MS), which are capable of five consecutive stages of MS (n=5) [7] and ion trap mass spectrometers (ITMS) which are capable of MS12 [8]. The combination of GC with MS/MS has become more powerful with the increase in MS/MS data acquisition speed [6]. This combination is a compromise between the length of time that a chromatographic peak is available for analysis as it emerges from the column and the number of MS/MS spectra that can be acquired during this time. New developments in MS/MS instrumentation based on tandem time-of-flight mass analysis are especially interesting since the most ambitious of these promise to deliver full MS/MS fragmentation maps on the capillary GC time scale [9]. Thus, higher resolution chromatography (i.e. capillary GC vs. packed column GC) can be used to separate complex mixtures without the loss of the structural information contained in the MS/MS spectra. If such a system is realized, a data system that fully exploits the information contained in the MS/MS data space will be required to speed the analysis of mixtures. The full MS/MS data space (the MS/MS map) consists of a primary mass spectrum and daughter spectra for each of the m/z values in the primary mass spectrum. The information contained in all of the major MS/MS scans on instruments such as the triple quadrupole mass spectrometer (TQMS) [10] resides in this data space. The problem facing chemists is to extract the structurally relevant information from this data space and use it to deduce the molecular structure of unknowns. This problem can be quite challenging since the MS/MS data space increases much more rapidly with mass than the MS data space. A comparison of the number of potential features in the MS/MS data spaces is provided in Table 1.1. The number of potential features in the MS/MS data space is given by the following equation: 0.5 * (n2 + 5n) where n is the nominal mass of the largest ion produced by a molecule (i.e. the molecular mass, plus isotope conuibutions, if any). For example, the molecular ion of isopropanol Mass Range (amu) Number of Features MS MS/MS 60 60 1,950 500 500 126,250 4000 4000 8,010,000 Table 1.1: Comparison of the number of potential spectral features in the MS and MS/MS data spaces for selected masses. is 60 amu and the number of potential features in the MS/MS data space is 1,950 (60 primary m/z, 60 daughter m/z, 60 neutral loss masses and 1770 specific parent-daughter combinations) versus 60 in the MS data space. The m/z limit for some of the early TQMS The full MS/MS data space (the MS/MS map) consists of a primary mass spectrum and daughter spectra for each of the m/z values in the primary mass spectrum. The information contained in all of the major MS/MS scans on instruments such as the triple quadrupole mass spectrometer (TQMS) [10] resides in this data space. The problem facing chemists is to extract the structurally relevant information from this data space and use it to deduce the molecular structure of unknowns. This problem can be quite challenging since the MS/MS data space increases much more rapidly with mass than the MS data space. A comparison of the number of potential features in the MS/MS data spaces is provided in Table 1.1. The number of potential features in the MS/MS data space is given by the following equation: 0.5 * (n2 + 5n) where n is the nominal mass of the largest ion produced by a molecule (i.e. the molecular mass, plus isotope contributions, if any). For example, the molecular ion of isopropanol Mass Range (amu) Number of Features MS MS/MS 60 6 0 1,950 500 500 126,250 4000 4000 8,010,000 Table 1.1: Comparison of the number of potential spectral features in the MS and MS/MS data spaces for selected masses. is 60 amu and the number of potential features in the MS/MS data space is 1,950 (60 primary m/z, 60 daughter m/z, 60 neutral loss masses and 1770 specific parent-daughter combinations) versus 60 in the MS data space. The m/z limit for some of the early TQMS instruments was 500 (potentially 126,250 features in the MS/MS data space) but new instruments are capable of 4000 (potentially 8,010,000 features in the MS/MS data space)! While not all of the data channels in the MS and M'S/MS data spaces are utilized for a given compound, it is significant that the amount of structural data derived from MS/MS experiments greatly exceeds that for MS experiments. It should be noted that this information can be multiplied when the number of possible instrumental conditions (i.e. ionization method, collision energy, collision gas pressure, etc.) is considered. A further complication in analyzing MS/MS data is the variety of instruments capable of producing MS/MS data with varying degrees of spectral resolution (i.e. TQMS-unit resolution [10], FTMS- high resolution [11], sector instruments-less than unit resolution [12]), different energetics in the process used to obtain daughter ions (i.e. sectors: keV range [13], TQMS: 5-150 eV [14]), and assorted ionization techniques to provide the parent ions (i.e. 70 eV electron impact (El), chemical ionization (CI) with a wide selection of reagent gases, fast atom bombardment and other ionization techniques [15]). While much has been done to provide sophisticated data acquisition systems [16-18], no data interpretation software has yet been developed that fully exploits the information obtainable from MS/MS instruments such as the TQMS. Any data system capable of interpreting MS/MS spectra must take into account the size of the data space and the versatility of the MS/MS technique to be of general use to the scientific community. The following section provides some historical background on the TQMS instrument, the value of MS/MS data in the analysis of drug compounds, methods used in automating the analysis of conventional mass spectra and previous attempts at automated interpretation of MS/MS data. The final section of this chapter provides an overview of the Automated Chemical Structure Elucidation System (ACES). This system provides software tools for the automated interpretation of MS/MS data from unknown compounds and generation of candidate structures for the unknown. HISTORICAL BACKGROUND Triple Quadrupole Mass Spectrometry The major components of a TQMS instrument are an inlet system, an ion source, two quadrupole mass filters, one quadrupole collision chamber, a variety of lenses for focusing ions and a detector. The orientation of these components is shown in Figure 1.1. The purpose of the inlet system is to introduce the sample in the gas phase into the ion source. Diverse inlet systems are available for TQMS instruments including a liquid inlet for volatile liquids, a heated direct insertion probe for solid samples and a gas chromatograph transfer inlet for GC/MS or GC/MS/MS. Once the sample is in the ion source a number of ionization methods exist to produce ions from the molecular sample. These methods include 70 eV electron impact (E1), chemical ionization (CI) and fast atom bombardment (FAB) [15]. After the sample is ionized, there are a number of experiments, or scan modes, that can be used for structure elucidation of the sample. The most common scan mode is the daughter scan. This scan is performed by selecting a specific m/z ratio for Q1, passing the ions with that m/z ratio into Q2, collisionally dissociating the ions into fragments and acquiring the mass spectrum of the fragment ions produced using Q3. The inlet system and ionization method used to create the database for this work was the direct insertion probe and E1 ionization. The database consists of primary and daughter spectra for a number of compounds. The creation of this database is discussed in detail in Chapter 2. o 1 L L L E E E E T 1011 Smimimé --- scum E E E E T T s s s o R Figure 1.1: Block diagram of the major components of a triple quadrupole mass spectrometer. The MS/MS map obtained for isopropanol on a TQMS instrument is shown in Figure 1.2. The m/z ratios found along the . - A /, ‘u. £231 0.2; ' ‘\ N at)“ 559° '//(’//l,.JJ\‘-if\, up; /// (r1, \1\ 57-3 //_/1,1 mtfiw / 'm > , / o \, - . / Adapted from Yost and Enke, American Laboratory, (June, MS/MS map for isopropanol. 1981). Figure 1.2: front edge of the figure comprise the primary mass spectrum. The m/z ratios for the diagonal edge beginning on the right and moving up and to the left constitute the daughter spectrum for the specified parent ion. The parent scan (the‘ spectrum of parent ions yielding a specified daughter) is found along a diagonal edge beginning on the left and moving up and to the right. Beynon and coworkers demonstrated in 1978 that the advantages of the MS/MS technique for structure elucidation include: i) daughter spectra give clear indication of the parent ion structure, ii) the molecular formula can be deduced without high resolution mass spectrometry, iii) the presence of particular substructures can be established with certainty, and iv) library searches of spectra related to substructures is possible [19]. The determination of molecular structure of pharmaceuticals using MS/MS spectra is discussed in the next section to illustrate the value of MS/MS spectra for this purpose. Structure Elucidation of Pharmaceuticals Using MS/MS Data Drug testing and toxicology requires the determination of drugs and their metabolites in biological samples. A number of recent reviews of the application of the MS/MS technique to structure determination of pharmaceuticals demonstrates the utility of the technique in these important analyses [20-23]. Cooks and coworkers have reported the utility of mass-analyzed-ion-kinetic-energy (MIKES) spectra in the identification of closely related barbiturates [24]. In this study, protonated molecular ions of a series of barbiturates were obtained by C1 using methane as a reagent gas. Daughter spectra for these molecular ions were then obtained and compared. Among the findings of this study was the marked differences found in the daughter spectra of two isomeric barbiturates (amobarbital and pentobarbital). The primary mass spectra of these compounds, however, were closely similar. Thus, MS/MS spectra could be used to differentiate between these two compounds while primary mass spectra could not successfully make this determination. The primary mass spectra and selected daughter spectra for these isomeric barbiturates are examined here to illustrate the ability of MS/MS to differentiate closely related compounds. They are among the spectra collected to create the reference databases used by the ACES software. The primary mass spectra and molecular structures for' pentobarbital and amobarbital are shown in Figure 1.3. As can be seen, the spectra are quite similar. In fact, when the amobarbital primary mass spectrum was searched against the library provided with the TQMS instrument, pentobarbital was the closest matching compound (with a high degree of match). The daughter spectra for the 197+ ion of both compounds are shown in Figure 1.4. Note the presence of the 155+ daughter ion (neutral loss of 42 amu) in the daughter spectrum of pentobarbital but not in the daughter spectrum of amobarbital. Recall, the base peak in the Primary mass spectrum is 156+. There are also differences in the smaller 129+, 112+, 58+ and 57+ daughter ions. Both spectra have Strong 141+ daughter ions corresponding to a neutral loss of 56 amu. Possible structures for the 197+ parent ions are also shown in Figure 10 4838525 SV 93 333333 A3 we 8.8% «38 SEE “Q— 9.52.,— ._<._._m¢ [[11,C1]" and " [U2,C7_]" and ” [U3,C3]" and " 171.0) > 154.0) > 196.0) -> 153.0) > 152.0) (70.0 -> 27.0) [42,77] ” [41,85] ” [40,77] ” [44.92] ” [77,77] ” [92,92] ” [69,85] ” [91,77] ” [83,77] ” [42,85] ” {F1} {F2} {F3} {F4} {F5} [F6} {F7} {F8} {F91 {F10} THEN substructure PHENOTHIAZINE is present. (Umin = 40%, Cmin = 70%) AX? PHENOTHIAZINE SUBSI‘RUCIURE Figure 1.8: The initial MAPS (v.11) PHENOTHIAZINE substructure. rule obtained for the 27 rules are then applied to the MS/MS data obtained. The result is a list of substructures present in or absent from the structure of the unknown compound. A match factor (MF) must be specified to set the minimum number of rule clauses that must be true for a substructure identification to be made. A match factor of 100% means that all the clauses in the rule must be true for the system to conclude that the indicated substructure is present in the structure of the unknown. As will be shown in Chapter 3, there is a tradeoff between the reliability and the recall of the rule using this method. The reliability is an estimate of the accuracy of the rule. The recall of the rule is the frequency that the rule correctly predicted the presence of A the substructure when applied to the data for the standard compounds in the reference database. A new method for generating substructure identification rules that eliminates the reliability / recall tradeoff was described and tested by Palmer [35]. The implementation of this new method and its application to generation of substructure identification rules using the new reference database is also described in Chapter 3. The molecular weight of the unknown compound is also required. This piece of information may not be present in E1 spectra so a CI experiment (a "soft ionization" technique that often yields molecular ions) may be necessary. The molecular weight and the elemental compositions of the substructures found to be present in the structure of the unknown compound are input into the molecular formula generator (MFG) [60]. The MFG, developed by Peter Palmer, is an exhaustive molecular formula generator that uses the nominal 28 mass of a molecule and elemental constraints to provide a list of candidate molecular formula(e). Palmer has also developed the CARBON program which uses the ratios of daughter ions found in the daughter spectrum of the M+1 ion to determine the number of carbon atoms in the molecular ion [60]. Bozorgzadeh has recently described further development and generalization of this method for the determination of all the constituent elements in daughter ions [63,64]. The MFG will also accept an exact mass if this information is available. The molecular formula(e) and the substructures found to be present in or absent from the structure of the unknown compound are used as constraints in GENOA, an exhaustive structure generation program [41]. GENOA provides a set of candidate structures that are consistent with the input constraints. A series of articles on the application of GENOA to structure elucidation problems using a variety of spectroscopic data to infer structural constraints has been published [65-71]. The use of GENOA and structure generation in general was a major focus of a recent book on computer-assisted structure elucidation [72]. The changes made to GENOA to incorporate the software into ACES are discussed in Chapter 4. One of the key issues in using this approach is the accuracy of the input constraints. Each of the candidate structures that GENOA produces is completely consistent with the input constraints. If one of the input constraints is wrong, then each of the candidate structures is wrong. This fact leads to the paramount importance of accurate substructure and molecular formula determinations and accounts for the 29 continuing search for better ways to generate reliable substructure identification rules. The MAPS rules are evaluated in Chapter 5 by comparison of the features found in the rules with those contained in documented fragmentation pathways and by application to 20 test compounds in Chapter 6. The evaluation of MAPS rules and subsequent refinement of the rule generation software are among the major accomplishments of this research. Multiple Rulebases MS/MS databases for reference compounds acquired under different operating conditions are being compiled to assess the viability of recommending ancillary experiments for reduction of the number of candidate structures for an unknown. This reduction can be achieved by identifying substructures that were previously unidentified using another set of operating. conditions. The variables that may be explored include ionization conditions (eg. electron impact, chemical ionization, FAB), ion polarity (eg. positive ions or negative ions) and collision conditions (eg. collision energy, collision gas pressure, target gas, reactive collisions). It is not expected that one set of instrumental conditions will be optimal for all substructures but rather a number of instrumental conditions will be found which are best suited for some fraction of the substructures being investigated. As will be seen in the [next section, these rulebases will not only be useful in obtaining new and better MAPS rules, but may also be useful in reducing the number of candidate structures obtained using any one of the rulebases. 30 Potential for Recursive Operation In general, systems which use structure generators provide a method for ranking candidate structures to assist the user in determining the most likely candidate structure for the structure of an unknown compound. The approach to be used in a forthcoming version of ACES does not rank candidate structures, but instead, seeks to assist the user in identifying ancillary experiments which will potentially reduce the number of candidate structures. Implementation of this approach will establish a "feedback loop" which includes the instrument (the TQMS in this case) [73,74]. This approach preserves the experimental versatility that MS/MS ' instrumentation can provide to the structure elucidation chemist. The dashed box in Figure 1.7 highlights the components necessary to implement this approach. The first step is to perform a substructure analysis of the candidate structures using a modified version of STRCHK (discussed in Chapter 4). The original version of STRCHK was developed as part of the DENDRAL project. The modifications made to STRCHK provide the level of automation necessary for integration of this program into ACES. This program outputs a list of substructures that differentiate among the candidate structures (discriminating substructures). MAPS rules that have been collected under different operating conditions (eg. a higher collision gas pressure to provide more collisions) could then be consulted by the EXPT program (currently under development) to determine if a new experiment exists that can confirm the presence 31 of the discriminating substructures. Thus, multiple rulebases must be generated using data obtained under a variety of conditions to implement this feature of ACES. Once the EXPT program has selected an experiment, the new instrumental conditions can be passed to the user (or directly to the instrument control software in a totally automated system) to be implemented. The list of candidate structures can then be pruned depending on the results of the new experiment. The goal of this pruning is to reduce the list of candidate structures to one, the structure of the unknown compound. This approach is quite different from the attempts to use only a single, standardized set of instrumental conditions to provide automated interpretation of MS/MS spectra. There is a historical precedent for the standardization approach in conventional mass spectrometry using 70 eV electron impact ionization for creation of spectral libraries. The value of 70 eV was chosen because the differences in fragmentation observed using this ionization method ”plateaued" around this value. In other words, the amount of fragmentation (information) peaked around 70 eV for electron impact ionization. An analogous situation in MS/MS does not exist since the information obtainable from MS/MS spectra does not plateau around a given set of instrumental conditions. While the requirement that standards be run under each of the desired instrumental conditions may be considered as a disadvantage, it has the decided advantage of preserving the versatility of the MS/MS technique. 32 Conclusions In a recent book on MS/MS, Busch and co-authors provided an outlook for this important analytical technique. A critical area of research that was cited was the computerized interpretation of MS/MS spectra. “As computer systems are more fully integrated with the control of the MS/MS instrument, the details and quality of the MS/MS data will come to depend more explicitly on the design and operation of the computer / mass spectrometer interface. This is especially so with computer-controlled triple-quadrupole mass spectrometers, in which the operating characteristics of each of the three quadrupoles are under direct computer control...To develop new experiments, or for the most demanding applications, mass spectrometrists cannot allow themselves to be drawn into situations in which their instrumental options are limited [1].” We believe the automated structure elucidation system described here conforms to the spirit of the quotation above. It allows the user to utilize reference databases containing MS/MS spectra acquired under any set of MS/MS operating conditions. The only constraint is that the user be consistent within a database. Thus, multiple rulebases can be generated using complementary sets of MS/MS instrumental conditions. These rulebases will provide the basis for recommendation of ancillary experiments by a forthcoming version of the ACES software. This structure elucidation expert system (ACES) has the distinct advantage of allowing the use of 33 optimal MS/MS instrumental conditions in solving structure elucidation problems. REFERENCES 10. 11. 12. Busch, .K.L., Glish, G.L., McLuckey, .S.A., "Mass Spectrometry / Mass Spectrometry: Techniques and Applications of Tandem Mass Spectrometry", VCH Publishers, Inc., New York, 1988. McLafferty, F.W., "Tandem Mass Spectrometry", John Wiley & Sons, New York, 1983.’ Burlingame, A.L., Maltby, D., Russel, D.H., Holland, P.T., Ann], M 300R (1988)- Hirschfeld, T., m 52,, 297A (1980). Gurka, D.F., Betowski, L.D., Hinners, T.A., Heithmar, E.M., Titus, R., Henshaw, J.M., AHMEEL. 6Q, 454A (1988). Eckenrode, B.E., Watson, J.T., Holland, J., Enke, C.G., W W81. 177 (1988). Laukein, F.H., Abstracts of Papers, 14th Annual FACSS Meeting, Detroit, MI (1987); American Chemical Society, Abstract 524, Washington, D.C., (1987). Louris, J.M., Brodbelt-Lustig, J.S., Cooks, R.G., Glish, G.L., Van Berkcl. 0.1.. McLuckey, 8A.. W. submitted C.G. Enke, personal communication, November, 1989. Yost, R.A. and Enke, C.G., MW LQQ, 2274 (1978). Cody, RB. and Freiser, 8.8., W 55, 571 (1983). Beynon, 1.11., Caprioli, R.M., Ast, 1:, W 2, 229 (1971). 13. 14. 15. 16. 17. 18. 19. 20. 21. 22 23. 24. 25. 26. 27. 28. 34 Busch, K.L., et. al., ibid, p.75. ibid, p.78. Watson, J.T., “Introduction to Mass Spectrometry”, Raven Press, New York, 1985. Johnson, RH. and Steiner, U., Pittsburgh Conference and Expositision Abstracts, Atlantic City, NJ, 1986, abstract #541. Lammbert, S.A., Chapman, W.K., Steiner, U., and Schoen, A.E., Pittsburgh Conference and Expositision Abstracts, Atlantic City, NJ, 1987, abstract #1105. Parr, V.C., Waddicor, J., and Wood, D., Pittsburgh Conference and Expositision Abstracts, Atlanta, GA, 1989, abstract #1523. Borogzadeh, M.H., Morgan, R.P., Beynon, J.H., AnalystJfll, 613 (1978). Busch, K.L., et. al., ibid., p.265. McLafferty, F.W., ibid., p.385. Yost, R.A., Perchalski, R.J., Brotherton, H.O., Johnson, J.V., Budd, M.B., 1am, 3_1_, 929 (1984). Straub, K.M., Rudewicz, P., Garvic, C., W, 11, 413 (1987). Soltero-Rigau, E., Kruger, T.L., Cooks, R.G., ABEL—Chm. 42, 435 (1977). Perchalski, R.J., Lee, M.S., Yost, R.A., W, 26, 435 (1986). Perchalski, R.J., Yost, R.A., Wilder, B.J., Mm” 531, 1466 (1982). Covey, T.R., Lee, E.D., Henion, J.D., W” 58,, 2453 (1986). Small, G.W., AW, 52, 535A (1987). 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 35 Barr, A., Feigenbaum, E.A., "The Handbook of Artificial Intelligence", Volume 2, Heuristeck, Stanford (1982). Linsay, R.K., Buchanan, G.B., Feigenbaum, E.A., Lederburg, J., ”Applications of Artificial Intelligence for Organic Chemistry - The Dendral Project", McGraw-Hill, New York, 1980. Munk, M.B., Christie, E.D., MM, 216,, 57 (1989). Sasaki, S., Fujiwara, I., Abe, H., Yamasaki, T., Anal._Chim._Acta, 121, 87 (1980). Fujiwara, 1., Okuyama, T., H., Yamasaki, Abe, T., Sasaki, 8., Anal, W111 527 (1981). Dubois, J.E., Carabedian, M., Dagane, 1., AnaL_Chim._Ac_ta, 1.5.8.. 217 (1984). Palmer, P.T., Ph.D. Dissertation, Michigan State University, East Lansing, MI, 1988. Martinsen, D.P., Sung, B., W 4, 461 (1985). Stauffer, D.B., Loh, S.Y., Henry, K.D., Twiss-Brooks, A.B., McLafferty, F.W., 35th ASMS Conference on Mass Spectrometry and Allied Topics, Denver, CO, p.391 (1987). McLafferty, F.W., Stauffer, D.B., W, 25., 245 (1985). Dayringer, H.E., Pesyna, G.M., Venkataraghavan, R., McLafferty, F.W., Wm. 1.1. 529 (1976)- Buchanan, B.G., Smith, D.H., White, W.C., Gritter, R.J., Feigenbaum, E.A., Lederberg, J., Djerassi, C., W” 98, 6168 (1976). Carhart, R.E., Smith D.H., Gray, N.A.B., Nourse. J.G., Djerassi, C., .L W46. 1708 (1981)- Crawford, R.W., Brand, H.R., Wong, C.H., Gregg, H.R., Hoffman, P.A., and Enke, C.G., Anal._Ch_em,, 16,, 1121 (1984). 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 36 Cross, K.P., and Enke, C.G., mummy LQ, 175 (1986). Cross, K.P., Palmer, P.T., Beckner, C.F., Giordani, A.B., Gregg, H.R., Hoffman, P.A., and Enke, C.G., in Pierce, T.H., Hohne, B.H. (Eds.), “Artificial Intelligence in Chemistry”, ACS Symposium Series No. 306, American Chemical Society, Washington, D.C., 1986, p. 321. Dawson. PH. Sun W.. WW1. 5.5.. 155, (1983/1984). Martinez, R.I., Cooks, R.G., 35th ASMS Conference on Mass Spectrometry and Allied Topics, Denver, CO, p. 1175 (1987). Martinez, R.I., Dheandhanoo, 8., MW M.BA. 1 (1988). Martinez. R.I. 83W”; 8 (1988). Martinez. R.I. W111”; 127 (1989)- Dheandhanoo, 8.. WJL 266 (1988). Martinez, R.I., 37th ASMS Conference on Mass Spectrometry and Allied Topics, Miami Beach, FL, in press (1989). Weber, J.J., Thuijl, J.V., De Jong, J., W L8}, 195 (1986). Giblin, D.B., Peake, D.A., Lapp. R.L., 32th ASMS Conference on Mass Spectrometry and Allied Topics, San Antonio, TX, p. 644 (1984). Zidarov, D., Bertrand, 1., 37th ASMS Conference on Mass Spectrometry and Allied Topics, Miami Beach, FL, in press (1989). Bertrand, J., Zidarov, D., 36th ASMS Conference on Mass Spectrometry and Allied Topics, San Francisco, CA, p. 1384 (1988). 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 37 Johnson, R.S., Biemann, K., 36th ASMS Conference on Mass Spectrometry and Allied Topics, San Francisco, CA, p. 1398 (1988). Enke, C.G., Wade, A.P., Palmer, P.T., Hart, K.J., AWL, 5.2. 1363A (1987). Wade, A.P., Palmer, P.T., Hart, K.J., Enke, C.G., MM 215, 169 (1988). Palmer, P.T., Hart, K.J., Enke, C.G., Mama, 3_6_, 107 (1989). Palmer. P.T., Enke. C.G.. Was» 8.8.. 81 (1989). Hart, K.J., Wade, A.P., Palmer, P.T., Nourse, B.., Enke, C.G., Anal, W accepted- ' Hart. K.J.. Enke. C.G.. Wm 53.11.19.011. in press»- Bozorgzadeh. M.H.. W2. 61. (1988). Bozorgzadeh, M.H., MES—399521.132!!!» 21, 712, (1988). Gray, N.A.B., Buchs, A., Smith, D.H., Djerassi, C., W913. 61L, 458 (1981). Smith, D.H., Gray, N.A.B., Nourse, J.G., Crandell, C.W., AnaLChim, mm 471 (1981). Crandell, C.W., Gray, N.A.B., Smith, D.H., Mm 551,, 21, 48 (1982). Lindley, M.R., Gray, N.A.B., Smith, D.H., Djerassi, C., W 41, 1027 (1982). Egli, H., Smith, D.H., Djerassi, C., W 61, 1898 (1982). 70. 71. 72. 73. 74. 38 Djerassi, C., Smith, D.H., Crandell, C.W., Gray, N.A.B., Nourse, J.G., LindlCY. M.R.. WM. 2425 (1982)- Lindley, M.R., Shoolery, J.N., Smith, D.H., Djerassi, C., Qrg._Mag, Res" 2L, 405 (1983). Gray, N.A.B., "Computer-Assisted Structure Elucidation", John Wiley and Sons, New York, NY (1986). Hart, K.J., Palmer, P.T., Enke, C.G., 36th ASMS Conference on Mass Spectrometry and Allied Topics, San Francisco, CA, p. 1388 (1988). Hart, K.J., Enke, C.G., 37th ASMS Conference on Mass Spectrometry and Allied Topics, Miami Beach, FL, in press (1989). CHAPTER 2 Automated MS/MS Spectral Data Acquisition for MAPS Reference Databases Introduction Reference databases containing infrared spectra and mass spectra are routinely used for the identification of molecular structure by matching the spectrum of an unknown with spectra of reference compounds [1]. The MAPS (Method for Analyzing Patterns in Spectra) software attempts to deduce spectral feature / substructure correlations from reference spectra for use in determining the presence of substructures in unknown compounds [2]. An advantage of the MAPS approach is the number of spectra required to glean these correlations is much smaller than a typical database used for spectral matching. An additional advantage of this approach is the ability to predict the presence (or absence) of a substructure in an unknown compound regardless of whether the spectrum for the unknown compound is contained in the database. Only a sufficient number of compounds with a substructure to produce a reliable rule are required. The instrumental conditions used to acquire the spectra for a given reference database are important and must be maintained 39 40 within the database. For the analysis of unknowns, the instrumental conditions used must be the same as those used to create the database from which the MAPS rules were generated. The instrumental conditions used to acquire MS/MS spectra for three reference databases are discussed in the next section. The procedures used for automated data acquisition and data transfer software are then described. The scope of the MAPS rules will depend upon‘ the diversity of the substructures contained in the standard compounds used to create the database. Thus, a description of the standard compounds selected for the reference databases is provided in a subsequent section. A discussion of some of the irregularities discovered in the databases complete this chapter on reference databases. Key Instrumental Parameters The instrumental parameters that affect the relative intensity of daughter ions and the set of daughter ions obtained in a daughter mass spectrum have been studied in depth for the purpose of defining a set of standard instrumental conditions [3,4]. An instrument-independent CAD (collisionally assisted dissociation) database is the ultimate goal of the research into standardized operating conditions [5]. Palmer has also studied these instrumental parameters for two different TQMS instruments [6]. Collision energy and collision gas pressure were found to be the most significant instrumental parameters since they not only affected the relative intensities of the daughter ions but which daughter ions were 41 obtained [6]. Another important parameter, the target gas, was constant in these comparisons. The MAPS software relies on the appearance of characteristic spectral features within the MS/MS data space rather than their relative intensities in the development and application of substructure identification rules. Thus, only those instrumental parameters that affect the appearance of spectral features need to be strictly controlled. Therefore, collision energy and collision gas pressure were carefully controlled for the acquisition of MS/MS spectra to be included in the reference databases described in this chapter. However, other parameters that affect instrument performance should not be completely discounted. A carefully tuned instrument is important for optimal ion transmission and mass assignment. - Also, some MS/MS methods such as energy-resolved mass spectrometry (ERMS) rely heavily on the relative intensity of daughter ions. This method is often used for isomer distinction. Since reproducible daughter ion intensities are difficult to obtain, energy-resolved mass spectra are likely to remain instrument dependenL Ion activation in the "eV" region (i.e. l-100 eV) occurs predominantly via excitation of vibrational modes of the electronic ground state. The amount of internal energy deposited from a collision of a neutral collision gas and a polyatomic ion is directly correlated with collision energy [7]. The collision energy is defined by the difference in potential between the source and the Q2 offset 42 on a TQMS instrument. Thus, the collision energy determines the kinetic energy that an ion has on entering the reaction region (Q2). It has been shown for n-butylbenzene that the internal energy of parent ions after collision increases with collision energy up to 40 eV (laboratory energy) with about 50% of the collision energy being converted to internal energy; the fraction of the collision energy converted to internal energy decreases as the collision energy increases over the entire range tested [7]. The effect of increasing collision energy on the daughter spectrum of the molecular ion of promazine is shown in Figures 2.1 and 2.2. These spectra were obtained with a fixed collision gas pressure (0.4 mtorr). Few fragment ions are obtained at 5 eV since the internal energy of the parent ion after collision is lower than the critical dissociation energy for a number of the fragment ions. At a higher energy, 30 eV, a number of new fragment ions appear in the spectrum. The efficiency of fragmentation has also increased since the relative intensity of the parent ion is about 60% rather than 100% at 5 eV. At the higher collision energies shown in Figure 2.2 more of the parent ion dissociates into smaller fragments. At 60 eV, the base peak shifts from m/z 86 to m/z 58. Almost all of the parent is dissociated at 75 eV. The relative abundance of several fragment ions of meperidine are plotted versus collision energy in Figure 2.3. Single collision conditions were used to acquire the data for this plot. Two important features of this plot are the appearance of the m/z 56 daughter ion at 30 eV collision energy and the m/z 70 daughter ion at a collision energy of 50 eV. 43 284.0 '3+05 100- .. 5 eV p=0.41ntorr -2 90- ' so- oo- '1 zo- es.1 199.9 :"""""""" V V '7 "7 = so 100 150 zio ago 203.9 *'a+os loo- '- 18 eV '°‘ ....nrrtorr 601 '1 «o- .‘.2 199.9 204 seqo J 230 9 1 l - 'f"'I""T' 'Uf'i'l "I"‘ " so 100 150 zoo 250 06.1 "E+04 too- 30 0" L5 9 s IMO nunrr oo- -5 so- 203 a -4 19o.9 -3 to- 59.0 -2 237.9 204 -1 70:2 A LE .4 A ""'If"'j"'fi1"" ' 'fi " V so too 150 200 250 a) b) c) Figure 2.1: Daughter spectra of the molecular ion of promazine with a collsion energy of (a) 5 eV, (b) 15 eV and (c) 30 eV. 44 06.1 tin-00 100- 45 0V p=0.4rntorr -3 00- 50.0 00- 1-2 a) 200.1 ‘°" 190.9 1 ' 237.0 -1 2°- 70.1 I 3----‘1 .Le- ..*.- --tfi‘t‘-‘-- ...-a so 130 1§0 230 2io ' 50.0 "E+01 1°°' 06.1 go .v +3 p - .AlllflflT 00-1 00- 52 2.3.. b to- '7 ) -1 190.7 204 70.1 230.2 I 41 202.7 A .4. L- -A- “L A 'r"'l rT'Ifi"'lfifi"l"r'U""" so 100 150 200 250' ' 50.0 ‘*e+00 100- 75 0V '1 p a .J'llflfl? 00- P3 60~ 06.0 c) -2 00d 237.9 -1 20- 190.3 42.2 70.1 ' 203.7 I ° 1 '1 .. -n. Ll - A- A 0 " '1 "'_I1T"lfi"'l ' r"' so 100 150 200 250 Figure 2.2: Daughter spectra of the molecular ion of promazine with a collsion energy of (a) 45 eV, (b) 60 eV and (c) 75 eV. 45 .333 .8358 no cocoon.“ a an @283 2:25qu mo 32 EoEmab .838 .8 352.55 gum—om "n." «...—ur— oop om om ov cm 0 _ NE .. ... 110'.- @m N}: n - ' w l x. n N... . . at e .. o. VH% o . __ «:0 «:0 010 2.30 2. N}: 46 Collision gas pressure is the other key instrumental parameter which requires careful consideration. The average number of collisions which an ion is likely to undergo is a function of the collision gas pressure. At low collision gas pressures, fragment ions are obtained. from a single collision with the target gas. This pressure regime is referred to as “single-collision” conditions. At higher collision gas pressures, some parent ions can collide with two or more target gas atoms producing granddaughter ions, greatgrandaughter ions, and so on. This pressure regime is referred to as “multiple collision conditions”. The effect of increasing collision gas pressure at constant collision energy (i.e. 30ev) is shown in Figures 2.4 and 2.5. Figure 2.4 demonstrates the effect on the molecular ion of promazine and Figure 2.5 shows the effect on the m/z 199 promazine fragment ion. It is interesting to note the difference in degree of fragmentation of the parent ions at 0.4 mtorr. The promazine molecular ion shows much more fragmentation at this pressure than does the m/z 199 fragment ion, thus demonstrating the different reaction cross sections that different ions possess. It is also interesting to note that there are several ions in the daughter spectrum of the m/z 199 fragment ion (m/z 45, 71 , 96 and 155) that do not appear or are very low in intensity in the daughter spectrum of the molecular ion. The relative abundance of the m/z 56 ion of meperidine is plotted against collision energy for three different collision gas pressures in Figure 2.6. At the higher collision gas pressures, the m/z 56 daughter ion is observed at all collision energies while under single collision conditions, this ion is only observed at collision energies above 30 eV. 47 .: rtn+os 100a .. Out luau" ooq L9 to- -2 (a) ooq 20‘ no.1 -1 95.1 199.0 59:1 239.0 - I I ""fl'fi'fl'r"1'f"I' " ' 0 so 1oo 1so zoo ago ' 96.1 ‘°B+04 1oo- ‘Lz Inn»? 90- '6 oo- 50.1 -. (b) ‘0' 199 1 29£;o -2 299.1 20: I I239.o A A l AL A AJA ' ' ' f! ' ' r' I I ' ' ' ' ' fiw I '1 ' 0 so 1oo 150 2:0 250 55.1 76.1 'B+0¢ 1oo- '2 1.0 Intent “PI art c ”A -1 () oo- 20‘ 199.1 42.1 [ I 239 o - L A I A v ~ ' 1 f V‘r' I’ " V ' I ' "rl ' " " l ' 0 so 100 150 zoo 25o 2.4: Daughter spectra of the molecular ion of promazine with a collsion gas (argon) pressure of (a) 0.4 mtorr, (b) 1.2 mtorr and (c) 1.8 mtorr (CE = 30 eV). Figure 48 19g;6 'B+05 100- 0.‘ mtorr oo- -2 60" (a) ‘05-: pl 20- 10.7 1 so 100 150 2 0 199.0 rhoo 1oo- _ 1.2 mtorr ‘5 00-4 -9 60¢ -9 (b) 40- -2 167.1 204 | H A A - -4 A A a A ‘1 JP A V ' r ' r I ' ' r f 1 ' ' ' ' I ' ' ' ' V so 100 150 200 155.0 'B+03 100"I " re 1.. mtorr 00d -3 60-1 (0) 167.0 '2 40- bl 20- ' 155.0 4530 70:9 95 9 .. - fl ' so 100 150? - - ' 200 Figure 2.5: Daughter spectra of the m/z 199 fragment ion of promazine with a collsion gas (argon) pressure of (a) 0.4 mtorr, (b) 1.2 mtorr and (c) 1.8 mtorr (CE = 30 eV). 49 ESE w; 98 :28 NA ESE ed no 8:3er mam 53:8 a can 35:0 .8353 «o 5:25— a 3 3:03 05289:: we :3 Houzwauu om n}: 2: he 852.555 03.23— SYN 0.53.,— Agkwov HO 2: on 8 3. cm a - I — b F n u I u ” o :25 To . 1 or - om % o w a a __ 230 r on - 3 z :2... a; £0 - om 50 The major goal in establishing a collision energy and collision gas pressure standard for a reference database is to ensure that the collision products obtained for identical parent ions are reproducible. The standard will ensure that parent ions encounter the same number of collisions (on average) at the same collision energy. Figure 2.7 ”demonstrates the reproducibility of the daughter spectra of a particular m/z 141 ion obtained from six different barbiturate compounds. Figure 2.8 shows similar consistency in the daughter spectra of the m/z 91 ion from six different phenols. There are two major collision gas regimes that are routinely used to obtain daughter spectra. The first is single collision conditions which has the advantage of detecting bone fide neutral losses (i.e. a parent ion dissociating to a daughter ion via one collision and no daughter ions being further dissociated from additional collisions with the target gas). The second pressure regime is multiple collision conditions. These conditions have the advantage of producing more fragment ions than single collision conditions, especially for parent ions with small cross sections. Thus, two complete MS/MS databases were acquired to use in‘ generating MAPS rules. A third database was created which contains MS/MS data for some of the reference compounds at a third collision gas pressure. This third pressure is about 3 times the single collision gas pressure. The data from this database were used mainly for evaluation 51 £5 5 3355 38s: 3:883 2: 9 wemcaemmoboo 38a: 3.89:8 .8... fin 03:. com .oSma .99.: ed 0 .— .>o cm u .38 ”3833.8: .8 but: a no new 80895 3: use on. we «boon... ~85an OOH ON” OOH O. O0 O0 ON 0 1 b! D» b m In”! Ll I it b 4‘ 0.00 ION Oooa IOv HI IOw IO. «1 OOQw' [OOH OO+Her ooOvu b Ocd ONH OOH OO O0 O0 ON D h I I h l b r L P I ‘ - ‘ ‘ O OoOO IO“ ovOa IO. lbw HI. IO. «3.. :2: 0O+fl¢ OoOvH up." 952..— o a oma can o. on a. on o I h I ‘ I .0 D 1 L“ ‘1’! b“ F 11‘. ‘F‘ P 0 a..a ton to. d] o..o to. re. N1 .90.! roan oo+u., ..ovu - a.” can oma a. o. o. o~ D u p L 5 TL P n u h ‘ 4‘1 1 d ‘ . o.np o.o. «1 ro~ .1 1oo o..a or no» .1 no. he“: too" .o+m. ..oou 52 .oSww «Eu 5 3255 3:8: 855%.. on. 2 mamucamohoo 3:3: 3.59:8 Bu ~.~ 298. new Abs:— vd n a .>o cm u .39 22.2.9 .3 but: a we :3 Eon—nab 3 Q6 2: we «been» Hana—3Q and «...—a:— 5 9. . - 3% » $~ $m 1 11 1 1 d d a: _ an. can. tow HI. to. N1 too 1 lo. 7 Bio tooa mo+met Oooo s on on ov ow s . . . - . _ 1 ‘ - - — I I ‘ Fl ..~o ...n . Ion o.vo so. 7 T row to. «56 Icon vo+no p.oo - on on a. on D n n p b F b I 11 J1 1 1 I l- . 1 _ _ _ n no o.vo ION loo 1 low [on n¢§=v [can mo+m.- p.09 L on on ov o~ } n p n F p n P L. ‘ 1 d4 d d d l. — . vsnv show lo~ HI. :5. «I too loo 3 :30 :2: no+n.» aJDMl. s 53 purposes rather than for rule generation. Table 2.1 summarizes the standard conditions used in creating three reference databases. The collision energy selected was 30 eV for all three databases. Three different collision gas pressures were selected so that the possibility of using different pressure regimes as ancillary experiments to reduce the number of candidate structures could be assessed. Collision energy, ELAB: 30 eV Collision gas: argon Collision gas pressure: Reference DB Pressure* #1 0.4 mtorr (2.0 X 10-6 torr) #2 1.2 mtorr (6.0 X 10-6 torr) #3 1.8 mtorr (1.0 X 10-5 torr) * the first reading is the average convectron gauge reading while the reading in parentheses is the indicated manifold reading. Table 2.1: Standard instrumental conditions for creation of MAPS reference databases using a Finnigan TSQ-‘70 TQMS. Automated Acquisition of MS/MS Spectra Data systems which allow for automated data collection for commercial TQMS instruments have become increasingly available over the last few years. This important development has spurred the acquisition of MS/MS spectra for inclusion in the MAPS reference databases. The instrument used for this work was a Finnigan TSQ-70 (triple stage quadrupole) equipped with a gas chromatographic and direct inlet. An instrument control language (ICL) is provided with 54 this instrument, which allows the user to pre-program sample introduction, ionization and spectral acquisition. The following paragraphs describe the experimental method used to acquire MS/MS spectra of a reference compounds for inclusion in the reference database. MS/MS fragmentation maps using BI ionization for a variety of standard compounds were acquired to build the reference databases. Each compound possesses an assortment of substructures so it is probable that one compound will contribute spectral features to several different substructure identification rules. A fragmentation map consists of the primary mass spectrum and daughter spectra for each ion that had an intensity greater than 1% of the base peak in the primary mass spectrum. Daughter spectra were not collected for peaks less than 1% of the base peak to decrease the scan time required to take a complete map and to avoid unsuitable daughter spectra due to insufficient signal. As shown in Table 2.1, daughter spectra for each reference compound were collected at two collision gas pressures, 0.4 mtorr (2.0 X 10'6 torr indicated pressure on the manifold gauge) and a pressure three times higher than the first, 1.2 mtorr (6.0 X 10'5 manifold). The first pressure provides single collision conditions while the latter pressure is closer to commonly reported CAD pressures on the TSQ-‘70. A third fragmentation map was also collected at a third collision gas pressure (1.8 mtorr, 1.0 X 10'5 manifold) for some of the standards. 55 The collision gas pressure used to acquire daughter spectra was set using an indirect indication of the pressure in Q2. The stability of the convectron gauge which reads the pressure in Q2 was insufficient for setting a specific pressure on a routine basis. The relative stability of the manifold ion gauge versus the Q2 convectron gauge is shown in Figure 2.9. The peak area of the m/z 69 fragment ion of the calibration compound perfluorotributylamine (PFTBA) is plotted against collision gas pressure as read from the manifold ion gauge and the Q2 convectron gauge. The manifold ion gauge reads the pressure in the manifold (low vacuum) region around the quadrupole assembly and can give an indirect measure of the collision gas pressure in Q2 from the leakage, of collision gas into the manifold region. The smooth curve obtained using the manifold ion gauge readings and the reproducible spectra shown in Figures 2.7 and 2.8 indicate that the manifold gauge can successfully be used to reproducibly set the collision gas pressure for daughter scan experiments. Before any experiments can be performed on any mass spectrometer, the instrument must be properly tuned to ensure proper mass calibration and ion transmission. This is accomplished by monitoring the mass spectrum of a calibration compound while adjusting the tune parameters. This process continues until the mass spectrum matches the “accepted” spectrum for the calibration compound. Typical mass spectra of PFI‘BA for Q1 and Q3 scans are shown in Figure 2.10. While adjusting the tune parameters, it is recommended that the user check to ensure that m/z 69 is the base 56 Pressure Study using Indicated Pressure Readings from Manifold Ion Guage L6E+O4- 1 1 A 1.2E+04-+ m ‘0 4 N \ é . °- 4000.0: J 0'0 ‘ l ' T ' T ' l ' I 0.000 2.000 4.000 6.000 8.000 10.000 COLLISION GAS PRESSURE ( X 1000000 TORR) ION GAUGE (MANIFOLD) READING Pressure Study using ' Pressure Readings from 02 Convectron Guage 1.6E+04-1 . A l.2E+O4~ at ‘0 1 N \ é . 8000.0- 5 . (b) R 0' 4000.0~ 0 O- 'o.ooo 0.2'00 foloo Y 0.500 1 0.500? 1.600 f 1.2'00 COLLISION GAS PRESSURE mTORR ) couvecmow GAUGE R INC Figure 2.9: Plots of collision gas pressure measured by (a) the manifold ion guage and (b) Q2 convectron guage versus peak area of the m/z 69 fragment ion of perfluoro-tert- butylamine (PFTBA). 57 69.0 '-a+os 1001 PFTBA ( 01115) _6 80-4 -5 60‘ 1.4 131.0 219,0 ('l -3 1o- -2 20- 169.0 265 0 413.9 501.9 '1 l l i 463.9 l 613 9 I O'I’CW' “‘1'" VVVVVV I vvvvvvvvv 'vvlvvv'l'VV'VA'szvvv‘vvlvl vvvvv 0 100 200 300 100 . 500 600 T?9.1 . tn+06 1001 -6 PFTBA» (trans ) b sou 5 -4 60d (b) -3 213.9 40. 130.9 -2 20% 263.8 '1 169.9 l 413.9 9* ": Vivi'I' :7v'vvvlvAVVVfTA'Af'?VlV‘VfVVLVVVfVVfVIV'Afv'vTVVVV 0 100 200 300 400 500 600 Figure 2.10: Primary mass spectra of PFTBA using (a) Q1 and (b) Q3 as the scaning quadrupole. 58 peak, m/z 131 and m/z 219 are approximately 50% of the base peak and that m/z 502 and m/z 614 are discemable (over 1% of the base peak). The mass calibration of these ions should not deviate from the actual values by more than 0.1 amu since masses that fall +/- 0.1 amu of a “half mass” (i.e. 37.5 amu) are rounded the half mass value. Masses outside this range are rounded to the nearest unit mass for use in the MAPS software. Four experiments were performed to collect the MS/MS data of each standard. All of the standards used were solids and were introduced into the ion source via a direct insertion probe (DIP). The samples were volatilized by heating the probe tip according to a predefined temperature program. -The first experiment used an ICL procedure called KHTIC, shown in Figure 2.11, and a probe procedure KHTICJCL I=100 {set counter} ON;EMULT=1000 {turn multiplier on and set voltage to 1000 V} Q3MS 40,550,l.0 {set scan mode to a Q3 mass spectrum, scanning from 40 to 550 amu in 1 second} CENT {collect centroid data} ASTART {start acquiring data to the disk} WHILE I > 0 {begin data acquisition loop} GO;STOP {acquire one scan} I-=l [ decrement counter} END [end loop} OFF {turn multiplier off} ASTOP {stop acquiring data to the disk} Figure 2.11: The ICL procedure used to repetitively acquire 100 primary mass spectra for characterization of probe temperature programs. 59 called KHPROBE to determine the total ion current generated versus probe temperature. The results of this experiment were used to create a probe temperature program to volatilize the sample at a relatively constant rate. The reconstructed. total ion current obtained for promazine is plotted against time in Figure 2.12. The reconstructed ion current obtained for m/z 284 (the molecular ion of promazine) and for the m/z 199 fragment ion are also shown in this figure. As indicated in the KHTIC procedure, each scan is a Q3 mass spectrum (Q1 and Q2 are in "rf only" mode, that is, they pass all ions) and requires 1 second. Thus, Figure 2.12 shows that after about 20 seconds, the spectra acquired as promazine is heated off the probe crucible remain relatively constant. A second experiment was performed to collect the primary mass spectrum for the sample. The ICL procedure used was KHMAPI which is listed in Figure 2.13. This procedure sets the scan mode, mass limits, scan rate and multiplier voltage, calculates which ions are at least 1% of the base peak and appends the masses of those ions into a user list. The procedure also estimates the total scan time for collection of daughter spectra for all of the masses in the user list. The third experiment uses the KHMAP2 ICL procedure, shown in Figure 2.14, to set the instrumental parameters for each daughter spectrum. The ions selected in Q1 (parent ion) are read from user list 1 created in the previous experiment. The KHMAP2 procedure 60 6552.85 .8“ 2.2.50 com .32 05 3 can :3 3300—08 own a}: £3 Eon—mac mm a}: 0.: A3 .3 «8233825 :3 couoabmcooom "an." «...—arm A233 2.33 5 a zoom com omw com omfi ooa on I P P — L b n P n n n n p n + b L b n n n p . n p n n p b0 N A3 6 o 32.". a 3;... 0H... 2: ... .3 e 2: 1; as.» x; mo+mn ENC}. .2: ... EN .3 A3 .3 32a _ r; 63»... :1 3:? 53 w 61 KHMAPLICL UCLR 1;UCLR 2;UCLR 3;UCLR 4 {clear user lists} =190;J=1;TSCANT=0 {initialize counters and variables} CENT {acquire centroid data} ON;EMULT=800 {turn multiplier on and set voltage to 800 V} DOZE 30 {wait 30 seconds} ASTART {start acquiring data to the disk} Q3MS 40 190 2.0 {set scan mode to a Q3 mass spectrum from 40 to 190 amu in 2 seconds} GO;STOP {acquire one spectrum} MAXAREA=AREA(40,190,1) {return base peak intensity] ASTOP {stop aquiring data to the disk} OFF {turn multiplier off } WHILE I>40 {begin loop} PAREA=AREA(I) {return area of largest peak +/- .5 amu of I} RATIO=PAREAIMAXAREA {calculate relative intensity} IF RATIO > 0.010 {if > 1% of base peak...} UAPP MASS(I),1 {...append the mass of the peak to user list 1] SCANT=(I-15)l200 {calculate an estimated scan time} TSCANT=TSCANT+SCANT {total scan times for each peak} END {end if loop} ' I-=1 {decrement mass counter} END {end while loop} UAPP TSCANT,3 {append estimated scan time to user list 3} Figure 2.13: The ICL procedure used to acquire the primary mass spectra found in the reference database. 62 KI-IMAP2.ICL T1=MINUTE*60+SECOND {return current time} ON;EMULT=1200 {turn multiplier on and set voltage to 1200 V} CENT;J=1 {aquire centroid data and set variable} DOZE 20 {wait 20 seconds} ASTART {start acquiring data to the disk] K = USIZE 1 {return the number of scans to be acquired} WHILE I <= K [start data acquisition loop} ' PMASS = ULIST 1,1 {get massj in user list 1} SCANT=(PMASS-15)/200 {calculate a scan time corresponding to a scan rate of 200 amu/sec} %1=MANPR*1000000 {return manifold guage reading, scale} UAPP %l,2 [append scaled reading to user list 2} %2=CPR { get Q2 convectron guage pressure reading} UAPP %2,4 {append pressure reading to user list 4} DAU PMASS,10,PMASS+5,SCANT,-30 {set scan mode to a daughter spectrum of pmass from 10 to pmass+5 amu in scant seconds with collision energy of 30 eV} GO;STOP {acquire daughter spectrum} J+=1 {increment counter} END {end while loop} T2=MINUTE*60+SECOND {get end time} ET=T2-T1 {calculate elapsed time} UAPP ET,3 {append elapsed time to user list 3} ASTOP;OFF {stop acquiring data to disk and turn multiplier off } Figure 2.14: The ICL procedure used to acquire the daughter spectra found in the reference database. 63 calculates a scan time for each mass which corresponds to a scan rate of 200 amu/s. Thus, all daughter spectra in the reference data base have been collected at the same scan rate. The fourth experiment also uses KHMAPZ, but at a higher collision gas pressure (set manually). Separate experiments were run to collect these spectra to allow the collision gas pressure time to equilibrate. Sample residence times in the source were insufficient in most cases to allow the ICL procedure to "doze" while the collision gas was equilibrating. For some compounds, data for a third pressure were also collected using this procedure so that three data points would be available for the determination of reaction orders of daughter ions. Data Transfer Software and Computer Facilities The DUMP utility was used on the TSQ-70 data system to create ASCII formatted files (with a file extension ".DAT") for the files containing the primary mass spectrum, the daughter spectra at pressure 1, the daughter spectra at pressure 2 'and, if appropriate, the daughter spectra at pressure 3. These files were then transferred to a VAXstation 3200. The TSQXFER program was then used to convert the ASCII data into MAPS compatible format (a LISP list format with a file extension ".LSP"). A program called GENF has recently been written which generates the feature-bucket list used by MAPS in the rules generation process from the various ".LSP" files. This program was written in C and optimized for speed. The LISP function "GEN-FEATURE-BUCKETS" was very inefficient when the large data set was used. The computation time for generating the 64 feature buckets was 6 hours. The C version required 46 minutes (real time) to process the entire data set on the VAXstation 3200. The computing facilities available for the ACES software are shown in Figure 2.15. The TSQ-70 data system runs on a PDP-11/73 computer and provides an interface to the instrument control computer, an instrument control language (ICL) to allow automated data collection, disk storage for data and library files and data manipulation software for display of the MS/MS spectra. Two DEC VAXstation computers are available to run. the ACES software. The ,VA'Xstation 3200 runs the ACES software while the AI VAXstation is used as a general purpose group computer and for auxiliary storage Ina-10 mommmeummoum I PDFTMNIMEASWNEM IlfiflfifllflflCOMfiULMflfiO I D‘Cfl‘f ... VIIIINNDN IAOINOOH 3200 ll 00231500 Papoctral Multitasking Ammunflmn ' “W a"! 63!! NR! Figure 2.15: Schematic of the computing facilities available for running the ACES programs. of data files. A Macintosh [I using a Tektronix 4014 terminal emulation package is used to allow remote data processing capabilities, access to computer drafting packages and a Postscript 65 printer. All these computers are linked by DECnet to allow file transfers and remote logins. Once the MS/MS maps are acquired, they are stored in a reference database on the AI VAXstation. Standard Compounds Selected for the Reference Database An important area of application of the MS/MS technique is in the analysis of pharmaceuticals. These analyses involve the screening of formulations for active drug components, impurities and synthetic markers, structural analyses of new drugs and quantitation of drug metabolites in biological fluids [7]. Since pharmaceutically active drugs often have similar structures, MS/MS can be used to establish the structures of variants of more commonly encountered drugs [7]. This last point is especially important in the analysis of so- called “designer drugs” [8]. The appellation “designer drug” stems from the process of substituting functional groups on known drugs to avoid regulation of the possession and distribution of the original drug. Unfortunately, these substitutions can radically change the potency of a drug and thus lead to accidental overdoses [8]. The generation of substructure identification rules will provide characteristic fragmentation patterns in the MS/MS data space for a variety of drug compound classes. In fact, the rules generated by the MAPS software represent the first step in mixture analysis which is the characterization of important fragment ions by examination of spectra of standard compounds. These fragment ions are often used to identify and quantitate drugs in complex mixtures as well as structure elucidation of pure compounds. 66 Many of the compounds selected for use in creating the reference databases are regulated drug compounds and are grouped according to the following classes: opiods, stimulants, antipsychotics, morphine substitutes and sedative hypnotics. Of the 105 compounds that were included in the reference databases, 84 were obtained from the Theta Corporation in “Theta-Kits”. Each standard in these kits was dissolved in an appropriate solvent (usually methanol) with f the concentration usually being 1 mg/ml. The sample crucible for the direct insertion probe holds approximately 7 microlitres. Thus, filling the crucible with standard solution and allowing the solvent to evaporate delivers approximately 7 micrograms of the standard into the crucible. The other 21 standards used were obtained from General Motors Research. All of these standards were phenols. The reference names, compound names and CAS numbers for the standards used to create the reference databases ' are listed in Table 2.2. The nominal mass, molecular formula and the number of daughter spectra collected for each of the 105 standards are shown in Table 2.3. This table shows the diversity of elements and molecular weights present in the databases. The masses shown in bold face type in Table 2.3 have entries which are isomeric, that is, there are other entries with the same elemental composition. A total of 14,097 spectra were collected for the reference databases. Each entry in a database consists of a primary mass spectrum and daughter spectra for each of the ions which had an intensity greater than 1% of the base peak in the primary mass spectrum. The database for the third pressure (P3) does not include data for all of 67 8332. 3:282 80:39:09 05 mo :80 «8 838:: m-_o~8_o~:$:«v-« ..88: 8:20 98-99 9:82.88:999280-1365: 88252 m«:20 «-38 :- -_§aE-m-295238E_E-«wz95..v.v ..88: 8:20 3358 958.-295095856-F. :«Ezocfiaaegfitv ..88: 8:20 :«8 -2959??? w. F7.3-9«-AEsESEmEEEE-.. ..88: 9:20 «18-8 -95650-i£o_§oe_u-F. :- «2295......v .88: 2:20 92-5. 8.8 985 99288.0. v. o .98 988: 2:20 8-2.88 .28.- - 1 ... 1.-” v .. .. . -. «:99- me-n 989888.: 3:20 .-«3: 595695656-.. 5931-: «8289956596. «- ..88: «E20 72.-2 F 9565-1959388? F. :-£m_88_§9=-.«.« ..88: 2:20 $8-8 52.592555-.. :-$m_8=o_E_9=-.«.« ..88: 2:20 man-SF .88 958.88. 885.392: 820 «-288 998859880-.. ..88: E20 «-888 $595883. %20.« ..88: «:20 98.8 295695689_.:93.« 2988883.; 820 99.-332 953-..“-23.98656-F.Fv-«_9n8§_§a-.99_88: «:20 rem-8 -25695683. :0. ..88: «:20 «23-2. 92389988953-5-3. ..88: E20 n 93 9:: 0239.200 mozm:mn_m: 68 4:00 "flu 030,—. «-«m-«m 98 9.22.8 .82 9...-«m 52.8.... ..c««_2 98-8 8.2888: 8.«.2 9893 «2.8.2.828: 88.2 «-.98. «28.99.28 .«v..2 9.4.5 8.5.8 98.2 «-....-«m «2.8.98.8... «8.2 99.98 «2.2.9888 88.2 ..m«.«« .889: 83.2 9m«-«« 8.5.89: 83.2 .-«9«« 8.99.29. «««.2 «-....-«m 9.2.9.92 «8.2 9««-«« «2.928.... 3.3-.2 99..-: 82.8989... «922 98-88 «2.298882 «mom-.2 78...: 9.28861 88.2 9. 78 8.5.882 .822 98-..... 8.98882 «822 «8-9-. 8.88 89.2 98.5 vow-8.5.08.0 88.2 «-89.. 8.893.... «..«22 .098 31.8.82: .822 «-....-«o 5.889.: .822 98-8. 61.8.5.5 88.2 m-«9.««« 588.8: 88.2 .-«98. 65883 88.2 «...«8 518.888.. 88.2 98-8.. 61.88982 88.2 9«Yo« 81.88890 ««8.2 98-8. .:<_m.8.88o.9._..o «892 78-8. .:<..m.88829.1 «812 9.8.8 61.829295: ..««o.2 9.98 8898.90 «88.2 98-8.. 01.89.932.91 ...«22 «-«98 8.88 88.2 «-««-«m 8.5.02 8.92 98.9.. .o1.8..o~2oe..xo 38.2 9««-3 ..01.8._o~eu«..m..o. «322 n 9.0 «.522 0229.28 mozm:mu.m: 69 ....oo .~.~ 039—. 28.2... .22 5.8.. .z .32.... .23 .35.... .58 £22.22... 81 55522.... .61 32%. to ...-...... .8 cm... «88 8.58.388 8.12 «8.8... 88.52.... 88.2 7.9.... 8.2288 88.2 3.98.: 81.88532 88.2 «.8.2. 8.8.2225 8.«.2 98-«8 81.888238 88.2 «-«98 81.8.8322... 38.2 988.. 81.2.8282... 852 «88... 81.8.5882 88.2 98-88 882.82 88.2 «-888 82.8.8828 812 98-8 88882.... 3.2.2 «-888 82.8.8828 3.8.2 98-88 -.<2.o.8.«8oio.8< 8.2 0.8.8. 81.882288. 88.2 .-8-8 81.88.828.832 88.2 «.88 818.822.. 88.2 93-3.. 8.2.2.5 822 98-8 8.228....» 2.3.2 98.8. 3.288.: ....22 9««.«m 22:8... 88.2 .88..- 1m1.8.8888 88.2 «8...; 818828.02-.. 88.2 988. .81982283 «.8.2 $88.. 81.88882... 852 ...-8-8... 81.8858 . 88.2 98-..... 22.888 88.2 «8-88 10.88.8588 88.2 «-88.. 5.8.8.. 88.2 .-«YB 81.8.2.8..2 ..«8.2 n m<0 m3 M“ + T+ e') do not usually occur under low energy CAD [7]. Also, there was a number of misclassified neutral loss entries (eg. neutral loss of 9 amu). Examination of the profile data (the database contains centroid data) indicates two sources for these entries in the feature-buckets. The first is the presence of doubly charged parent ions which dissociate with charge retention into doubly charged daughter ions. These ions are potentially useful for structure determination, however, they seem to be formed only in low concentrations. The other source of the “half mass” entries is erroneous peak finding. A partial primary mass spectrum collected in profile mode for hydroxyamphetamine is shown in Figure 2.18. The instrument was tuned for unit resolution from 12 to 700 amu. The peaks at m/z 59 78 and m/z 68, however, are severely broadened. The peaks at m/z 57, 65 and 71 . possess the peak shape typically obtained on this instrument. Therefore, the peak broadening is likely due to the presence of peaks at m/z 58.5 and 67.5 (doubly charged: z=2, odd mass ions). Daughter spectra of these ions were taken to establish their structure and are shown in Figure 2.19. The scan range for Q3 was doubled to detect any singly charged daughters with masses larger than the parent ion. Significant daughter ions were obtained at m/z 116, m/z 89 and m/z 67 for the m/z 58.5 parent ion. For the m/z 67.5 parent ion, daughters were obtained at m/z 134 (loss of H+), m/z 116 (loss of H304) and m/z 107 (loss of C2H5+). The m/z 19 and m/z 29 daughter ions formed from these losses are also observed in the spectrum. The daughter ion at m/z 58.5 (obtained from the 67.5 parent ion) is of particular interest in explaining apparently spurious neutral losses (eg. 9 amu). The software that converts raw data into LISP format calculates neutral losses by subtracting the mass of the daughter ions from the mass of the parent. Thus, for the m/z 67.5 -> m/z 58.5 reaction, a neutral loss of 9 amu is calculated. In reality, the mass of the m/z 67.5 parent ion is 135 amu .and the mass of the m/z 58.5 daughter ion is 117 amu. The actual neutral loss is, therefore, 18 amu which corresponds to a neutral loss of water (with the retention of both charges). The loss of 19 amu (H30+) to form the m/z 116 daughter ion was not found in the reference database because this ion was beyond the mass range selected for Q3.. In addition, the optimal instrumental conditions required to observe the 79 6.5 a): 28 Own a}: 3 2.8 @0320 33:3 9:39: oiESofiEazxegn ..o 9:3 mhfinv 83.8% 258 >555 3939 < "3." 0.52...— mp or mm o - b b n D - D D I I b $ .mamp rmamp tau _ F H.o> ~.op Ice «93$ : . m.mm H _ H.H> m.pm 1‘ [cm ~.Hp o.>m - H.mm _ C m.~m H.>m _ IooH Sim..- fimm 8C) [*5 l [*5 ll 5|.2 'B+05 100- 00‘ '1 (a) 116.1 - -o 120 ' 67.4 *E+OS 1001 -1 oo- 134.1 I 60- 53.7 (b) 20 40 60 80 100 120 140 Figure 2.19: Daughter spectra of the (a) m/z 58.5 and (b) m/z 67.5 fragment ions of hydroxyamphetamine (Elab = 2 eV, p = 1.8 mtorr). 81 larger daughter ions of the doubly charged parents were different than those used for collecting the reference database spectra. The conditions used to obtain the spectra shown in Figure 2.19 were low collision energy (Elab = 2 eV) and multiple collision conditions (1.2 mtorr Ar). Figure 2.20 shows the daughter spectrum for the m/z 67.5 parent ion obtained using the same conditions as those used for collecting the reference databases (Em, = 30 eV, p = 0.4 and 1.2 mtorr). The only major ions obtained using these conditions were m/z 58.5 and m/z 29.0. Doubly charged daughter ions have been observed in the spectra for a number of other reference compounds. Morphine, for instance, produces a doubly charged ion at m/z 108.5 ( 5% RA in the primary mass spectrum). The daughter spectrum obtained for this parent ion is shown in Figure 2.21 (a). A daughter spectrum was also taken for m/z 142.5 (half the molecular weight of morphine) to determine if there was an appreciable amount of doubly charged molecular ion produced. The daughter spectrum obtained is shown in Figure 2.21 (b). A loss of H+ from the doubly charged molecular ion can be seen in this spectrum (m/z 284) as well as a substantial amount of m/z 108.5. Doubly charged ions are also present in the primary mass spectrum of oxymorphone at m/z 112.5 (4% RA), m/z 113.5 (2% RA), m/z 99.5 (2% RA), and m/z 98.5 (4% RA). Daughter spectra were taken for these parent ions and m/z 150.5 (half the molecular weight of oxymorphine). The daughter spectra obtained for these ions are shown in Figures 2.22 - 2.24. 812 T 67.6 times 100- -6 to- so-J -4 40- -2 - o Figure 2.20: Daughter spectrum of the m/z 67.5 fragment ion of hydroxyamphetamine with Bus 2 30 eV. 823 x10 [ 142l.8 11:34.05 100- -e .0. 108.5 -6 60- 215.9 (.) 40- " so 100 150 200 250 x5 l 10 .0 II‘B+05 100- La 00- '6 (b) so 100 150 200 Figure 2.21: Daughter spectra of the (a) doubly charged molecular ion and (b) the m/z 108 fragment ion of morphine (Elab . 2 eV, p a 1.8 mtorr). 84 The other source of the “half-mass” entries is erroneous peak finding. The peak at m/z 66 has a “shoulder” which can be erroneously assigned as a separate peak and rounded to a “half- mass” by the data transfer software. Since both the shoulder and the major portion of the peak are present, no data are lost by ignoring the “half-mass” peak. However, if the utility of the real “half-mass” ions are to be explored, there is a possibility that some of the spurious peaks may be included in a rule. Therefore, “half-mass” entries in a MAPS rule must be manually verified. The mass filter normally used by the MAPS software does not allow “half mass” features to be considered for inclusion in the substructure identification rules. Also, the rules generated so far do not include the misclassified neutral losses such as 9 amu. Judicious choice of the minimum U and C values used by the MAPS software effectively prevents the inclusion of these features. If one of these features is passed by the U and C filters, then daughter spectra for the associated compounds should be taken to determine the actual value (i.e. 18 amu vs. 9 amu). 85 x10 98.7' l1a3+05 1 00- L4 00‘ 60‘ 40‘ 20‘ 50 100 150 200 I Ix 10 I 99.9 FE+05 100- r-G so: '5 L4 60‘ (b) '3 40‘ p2 20‘ ’1 ol -o 50 100 150 200 Figure 2.22: Daughter spectra of the (a) m/z 98.5 and (b) m/z 99.5 fragment ions of oxymorphone (Elab a 2 eV, p a 1.8 mtorr). 86 100- 80d 60‘ ‘0‘ x50 112. 200 50 100 150 I*E+06 p2 (I) 80- 60‘ 40‘ 20"l x30 113. 143.6 169.9 19741225.: l ' -: 200 50 100 150 lfiz+06 -2 ' (b) '1 Figure 2.23: Daughter spectra of the (a) m/z 112.5 and (b) m/z 113.5 fragment . ions of oxymorphone (Elab = 2 eV, p a 1.8 mtorr). 8‘7 | [x10 | 151.1 ts+os 100- -6 so- 60- _4 404 169.2 299.9 -2 201 123.6 176.5 r' 207.9 227.9 0‘ -o 50 100 150 200 250 300 Figure 2.24: Daughter spectrum of the doubly charged molecular ion of oxymorphone (m/z 150.5). 88 References 1. Small, G.W., AM” 52, 535A (1987). 2. Wade, A.P., Palmer, P.T., Hart, K.J., Enke, C.G., AW 215, 169 (1988). 3. Dawson. F.H., Sun W.. W55. 155, (1983/1984). 4. Martinez, R.I., Dheandhanoo, 8., MW Emma. 1 (1988)- 5. Martinez,“ R.I. W.L 127 (1989). 6. Palmer, P.T., Ph.D. Dissertation, Michigan State University, East Lansing, MI, 1988. 7. Busch, K.L., Glish, G.L., McLuckey, S.A., "Mass Spectrometry / Mass Spectrometry: Techniques and Applications of Tandem Mass Spectrometry", VCH Publishers, Inc., New York, 1988. 8. Baum, R.M., m 61,7 (1985). 9. NBS/NIH/EPA/MSDC Mass Spectral Database, National Technical Information Service (NTIS), 5285 Port Royal Road, Springfield, VA 22161. CHAPTER 3 Generation of MAPS Substructure Identification Rules Introduction Reliable substructure identification rules are one of the crucial elements in the Automated Chemical Structure Elucidation System (ACES). The rules used in this system are generated by the MAPS (Method for Analyzing Patterns in Spectra) software [1-4]. This software utilizes the structural information ”contained in mass spectrometry. / mass spectrometry (MS/MS) data to formulate spectral feature / substructure correlations. Other interpretive systems have been devised based on spectral matching (eg. STIRS) and fragmentation rules (eg. DENDRAL) [5-7]. As was discussed earlier, a spectral matching approach is inappropriate for interpretation of MS/MS data due to the variability of daughter spectra and the incumbent lack of a library of daughter spectra. The DENDRAL project took a knowledge engineering approach to provide substructure identification rules. Knowledge engineering is an artificial intelligence method where the knowledge of human experts regarding a specific problem domain is captured in a format usable by computers. The objective was to formulate rules, 89 9O typically ”IF-THEN” rules, to emulate the process by which the human experts solve problems. This project was quite successful in developing artificial intelligence (AI) technology but has not been recognized as a success in mass spectrometry. The limited results achieved in interpreting mass spectra is due primarily to the incomplete understanding of mass fragmentations which encompass the huge variety of compounds that are analyzed by this technique. An other significant disadvantage of this method is the reliance on primary mass spectra, which has in our experience, less information for an interpretive approach than techniques such as MS/MS. The MAPS software, on the other hand, uses an empirical approach to derive the substructure identification rules. Few assumptions, if any, are made by this software ”about the potential fragmentation pathways that are open for any given substructure. All pre- programmed information regarding the substructures are well known and not subject to interpretation (eg. when using a mass filter, the elemental composition of the substructures and their masses are used to limit the spectral features considered for each substructure). There are two types of MAPS rules. The first type of rule is used to predict the presence of a substructure in the structure of an unknown compound. These rules are referred to as inclusion rules. These rules have the general format: "IF is present and - is present and is present THEN substructure ' X ' is present." 91 The second type of rule is used to predict the absence of a substructure and has the general form: ”IF (spectral feature f1> is absent and is absent and ' is absent THEN substructure X is absent." These rules are referred to as exclusion rules. This chapter is devoted mainly to the generation and evaluation of inclusion rules. The generation of more effective exclusion rules remains an open area for further research. The following section describes the evolution of MAPS software and is followed by a detailed description of the software. A new method for generating the rules is introduced and the software written to use this method is then described. This discussion presents two different versions of the MAPS software, an interactive LISP version and a optimized C version. An analysis of several of the MAPS rules and a comparison of MAPS rules generated using MS/MS data acquired under different collision conditions are provided in Chapter 5. MAPS Software Development The original version of the MAPS software was written by Dr. Adrian Wade in InterLISP-D for a Xerox 1108 Workstation [1]. This software was subsequently modified by Dr. Peter Palmer to incorporate a number of the features used for this work (MAPS 92 version I) [2]. Since the hardware resources (i.e. hard disk space, physical memory and computation speed) of the Xerox computer was limiting the number and types of experiments that could be run using this system, the MAPS software was ported to Common LISP running on a VAXstation computer [3-4]. This new code (MAPS version 11) was developed with the programming assistance of Chris Weaver, with several functions being converted .to C for speed by another undergraduate, Drake Diedrich. Common LISP was chosen because it runs on a number of computers with more computing power than the Xerox 1108 (eg. 80386 based PC's and DEC VAXstation computers). A decision was made to run the software on a DEC VAXstation 3200 rather than the 80386 based PC because more physical memory was available to the software on the VAXstation (i.e. 16 MB vs 640 KB). Much was learned using the LISP version of the MAPS software but it was limited in some respects by the inefficiency of LISP versus other programming languages such as C. Thus, a C version of MAPS (version 111) was developed with the programming assistance of Drake Diedrich. Many AI programs are developed in LISP and later translated into C for speed. This transition is often observed in Al based projects where the LISP prototype is used to solve problems with a previously unknown algorithm and an optimized C version is used to increase the efficiency of the newly discovered algorithm [8-10]. The following discussion focuses on MAPS version 11. The third version of MAPS is described in a subsequent section. 93 The MAPS Software - Version 11 The MAPS software requires several inputs in order to create substructure identification rules. These inputs are summarized in Table 3.1. The major data inputs are the “substructure buckets” (SS- BUCKETS) and the “feature buckets” (FEATURE-BUCKETS). The SS- BUCKETS provide the substructure content of each of the reference compounds in an inverted format to optimize the calculations by the MAPS software. The LISP version of the substructure buckets have the format: (SS-reference-name-l CMPD-name-a CMPD-name-b ...) (SS-reference-name-2 CMPD-name-a CMPD-name-b ...) (SS-reference-name-x CMPD-name-a CMPD-name-b ...) (...). A portion of the substructure buckets used in this work is provided in Figure 3.1. The origin of the substructure buckets is discussed in Chapter 4 since that chapter deals with the substructure search and structure generation software. Similarly, the FEATURE-BUCKETS data input is a list of the spectral features found in the MS/MS spectra of the reference compounds. The spectral features used in creating the FEATURE-BUCKETS are primary scan ions (PS), daughter scan ions (DS), neutral loss masses (NL), and parent—ion / daughter-ion combinations (PD). Selected portions of the FEATURE-BUCKETS used in this work are‘ shown in Figure 3.2. The SUBSTRUCTURES information is contained in a file called “SUBSTRUCTURESDAT”. This file contains atom definitions (i.e. chemical symbol, nominal mass, and number of valences) and substructure definitions (i.e. 94 INPUT VARIABLE NAME SUBSTRUCTURES SS-BUCKETS FEATURE-BUCKETS Umin Cmin PURPOSE a list of atom definitions (the number of valences), the elemental compositions and maximum masses allowed for each substructure for use by the mass filter a list containing each substructure and the reference compound names which are associated with the specified substructure 'a list containing each spectral feature and the reference compound names which are associated with the specified spectral feature the minimum uniqueness to be used by the uniqueness filter the minimum correlation to be used by the correlation filter Table 3.1: Summary of the inputs to the MAPS (v.11) software. 95 substructure reference name, nominal mass, and empirical formula). This information, along with Umin. and Cmin, is used by the MAPS “filters” in selecting the spectral features to be included in a substructure identification rule. (...) (88131 MI4060) (88132 M19492 M17723 M17688 M164 M15826 M13696 M11485 M11844 M15755 M15862 M17044 M17691 M19202) (...) Figure 3.1: An excerpt from the “substructure-buckets”, SS- BUCKEI‘S. The first filter used by the MAPS software reduces the number of potential spectral features that can be correlated to a given substructure to those that are plausible given the mass and elemental composition of the substructure. In MAPS (v.II), this reduction is performed by the RELEVANT function. This function uses the mass and the elemental composition of the substructure to calculate the masses of potential fragments of the substructure. The RELEVANT function passes these masses to the uniqueness and correlation filters, thus limiting the number of uniqueness and correlation calculations. This function can be quite useful for focusing the rule generation on those features which are directly attributable to a substructure. The mass filter has the disadvantage, however, of eliminating spectral features which may be due to larger (and possibly not defined) substructures which encompass the relevant substructure. The spectral features due to the larger 96 (...) ((P 198.0 ) GMR8 GMRIO GMRll GMR17 GMR24 GMR25 MI64 M1592 M11 125 M13 152 M15 75 5 M16834 M17691 M19042 M11485 M13696 M15826 M16837 M17723 M19141 MI 1 844 M13774 M15 862 M16990 M17978 M19202 M1241 1 M14687 M16129 M16998 M18009 M19492 M12423 M147 l4 M16208 M17044 M18268 M12885 M15297 M16827 M17688 M18728 M19895) (...) ((NL 198.0 ) GMR3 GMR12 GMR14 GMR17 GMR19 GMR24 GMR25 M164 M1680 M11122 M11125 M11844 M12166 M12411 M12423 M12885 M13152 M13691 M13774 M14687 M14714 M15297 M15826 M15857 M16086 M16129 M16208 M16827 M16837 M16881 M17044 M17688 M17691 M17978 M18009 M18728 M19141) (...) ((D 198.0 ) GMR12 M164 M11125 M11485 M11844 M12411 M12423 M13152 M13696 M13774 M14687 M15297 M15755 M15826 M15862 M16129 M16208 M16827 M16834 M16837 M17044 M17688 M17691 M17723 M18009 M19202 M19895) (...) ((PD 198.0 154.0 ) M164 M11485 M11844 M13696 M15755 M15826 M15862 M17044 M17688 M17691 M17723 M19042 M19202) (...) An excerpt from the “feature buckets”, FEATURE- BUCKETS, showing different features with the same nominal mass. Figure 3.2: 97 substructure can provide clues for defining new substructures. Since there are tradeoffs to using the mass filter, it is possible to enable and disable this filter using a compiler switch in ‘the latest version of MAPS. The uniqueness and correlation filters are actually implemented simultaneously. These filters calculate two values which describe the specificity (uniqueness) of a spectral feature for a given substructure and the frequency (correlation) the spectral feature appears with the presence of the specified substructure. These values are then checked against the minimum uniqueness and correlation values input by the user to determine if the spectral feature should be placed in the rule for a given substructure. The equations used to calculate these values are given in Figure 3.3. The ability to describe the MS/MS spectral features in this way is, in itself, quite useful for MS/MS practitioners since these descriptors provide a means to rapidly summarize the relevant information (i.e. 'with respect“ to substructure, compound class, etc.) contained in a large body of MS/MS reference data. For example, the “PD” spectral features in a MAPS rule can be used to assist a user in selecting the specific CAD reactions to monitor in solving a mixture analysis problem. The procedure for generating a MAPS (v.II) rule starting with raw data is given in Table 3.2. This version of MAPS is, like it predecessor, a collection of LISP functions that manipulate substructure and spectral feature data. One important function, GEN- 98 FEATURE-BUCKETS has been replaced by a more efficient C program, GENF. This program requires: 1) a yes/no response to use intensity classifiers, 2) the minimum number of compounds in which each feature must be observed to be included in the FEATURE-BUCKETS, 3) SPECTRAL FEATURE UNIQUENESS: Ux = Number of compounds with Fx SPECTRAL FEATURE CORRELATION: Cx = Number of compounds with SS): 881:: substructure x Fx: feature x Figure 3.3: Equations used to calculate the uniqueness and correlation of spectral features in MAPS for use by the U and C filters. the output filename and 4) a list of data filenames. Another C program is used to convert the ASCII MS/MS data from the TSQ-70 triple quadrupole mass spectometer to a LISP format (TSQXFER). This program requires the filenames of the primary mass spectrum and daughter spectra datafiles as well as the substructure list obtained from the ASLS program (see Chapter 4). The format of this file is the same as that used with the version I software. A disadvantage of this software is that a great deal of knowledge of LISP and the MAPS functions is required to successfully generate miles. This problem has been largely overcome with MAPS (v.III). 99 1. Define molecular structures for all reference compounds using GENOA (see Chapter 4). Define all substructures of interest (see Chapter 4). Acquire MS/MS data of all reference compounds (see Chapter 2). 4. Use the RSX DUMP program on the TSQ-70 data system to convert binary datafiles to ASCII. Transfer ASCII data to a VAX running the ACES software. Use the TSQXFER program to create LISP compatible datafiles which contain the compound name, reference name, mass, MS/MS data in LISP list format, a list of substructures contained in the structure of the compound and the molecular formula of the compound. This is best done using a command file. 7. Use the GENF C program to create the FEATURE-BUCKETS for the entire reference database. This is best done using a command file. This program replaces the GEN-FEATURE- BUCKETS LISP command. 8. Create (or modify) “SUBSTRUCTURESDAT” to reflect changes to the substructure library and / or substructure definitions. 9. Invoke LISP using the VAX VMS command: LISP/RESUME=MAPS_BASE_SYSTEM. This command restores a “suspended system” which includes the MAPS LISP functions, among other things. 10. If changes have been made to the substructure library, compounds have been added to the database, or there are no existing substructure buckets, use the MAPS “GEN -SS-BUCKETS” function to generate the substructure buckets. 11. If compounds have been added to the database or there are no existing FEATURE-BUCKETS, load the file containing the newly generated FEATURE-BUCKETS using the LISP command: (LOAD “filename”). 12. If changes have been made to the “SUBSTRUCTURESDAT” file, load the new version using the LISP command: (LOAD “filename”). 13. If new versions of the MAPS functions have been written, these need to be loaded at this time. WN O‘M Ta ble 3.2: Procedure for generating MAPS (v.II) rules. 100 14. Create a new suspended system if any new datafiles have been loaded using the LISP command: (SUSPEND “filename”). This will facilitate a future MAPS session which requires the same data. 15. To generate a set of rules for a specified U/C combination: i) modify the MAKERULE function for the desired U/C ii) type “(SETQ SSRULES (MAKERULE))”. This function creates a MAPS.rule for each substructure contained in the SUBSTRUCTURESDAT file using the minimum U/C value specified in the MAKERULE function. The resulting rules are stored in a list identified by a symbol (eg. SSRULES). The main use of this function as of this writing has been to generate initial MAPS rules using a relatively low U/C combination (eg. 10/10). U/C combinations with higher values may be obtained using the GENUCRULE function. Use of this function is much more efficient than regeneration of the rules using a new U/C combination. 16. Use the MAPS function GENUCRULE to obtain a MAPS rule for a particular substructure with a higher U/C combination (eg. “(SETQ NEWRULE (GENUCRULE 40 70 LOWRULE)”). This function has mainly been used to obtain the starting features for feature-combination rules (eg. “(SETQ SSl-CR (GEN-' COMBINATIONS ‘SSl NEWRULE 100 SS-BUCKETS)”). The parameter “100” in the example is used to specify the minimum uniqueness for a combination. Table 3.2: cont. 101 The advantage of the version 11 software for experienced users was the ability to examine intermediate lists (a characteristic of the interactive environment that LISP provides) and the relative ease of programming recursive functions. The MAPS inclusion rule obtained for the phenothiazine substructure is shown in Figure 3.4. The numbers in square brackets correspond to the uniqueness and correlation values of the specified spectral feature with respect to the phenothiazine substructure. The original method for applying rules to unknowns used a match factor (MF) to determine the fraction of rule clauses (i.e. spectral features) which had to be found in the MS/MS spectra of an unknown for a substructure identification to be made. The substructure identifications obtained when the phenothiazine rule was applied against all of the reference database compounds for several different match factors are summarized in Table 3.3. There are two indices which describe the effectiveness of a substructure identification rule with respect to the reference database. These are the reliability and recall estimates of a rule. The equations used to calculate these values are shown in Figure 3.5. The best possible rule is one with 100% reliability and 100% recall. The recall observed for the phenothiazine rule (with 100% reliability), for example, is 77% for a match factor of 70%. As can be seen from Table 3.3, recall can only be increased to 100% by accepting a reliability below 100% (i.e. allowing false positives). Since false positives need to be held to an absolute minimum in our system, a new method of rule generation Called “feature combinations” was developed. 102 IF and and and and and and and and and “ D “ D “ D “ D “ PD “ PD “ PD “ PD “ PD “ PD (211.0) (210.0) (209.0) (198.0) (198.0 -> 171.0) (198.0 -> 154.0) (197.0 > 196.0) (197.0 > 153.0) (196.0 > 152.0) (70.0 -> 27.0) [42,77] ” [41,85] ” [40,77] ” [44,92] ” [77,77] ” [92,92] ” [69,85] ” [91,77] ” [83,77] ” [42,85] ” {F1} {F2} {F3} {F4} {F5} {F6} {F7} {F8} {F91 {F10} THEN substructure PHENOTHIAZINE is present. (Umin = 40%, Cmin = 70%) *# PHENOTHIAZINE SUBSTRUCIURE Figure 3.4: The initial MAPS (v.II) .rule obtained for PHENOTHIAZINE substructure. the 103 NH?G%) RELLAEHJTY’Ufi) RECALL(%0 100 100 38 90 100 54 80 100 69 70 100 77 60 80 92 50 75 92 40 65 100 Table 3.3: Reliability and recall estimates obtained for the MAPS (v.11) PHENOTHIAZINE substructure identification rule at several different match factors. RULE RELIABILITY ESTIMATE: REL = total number of predictions RULE RECALL ESTIMATE: REC= 11W total number of possible correct predictions Figure 3.5: Equations used to calculate the rule reliability and rule recall estimates in MAPS. Initial Work on Feature-Combination Rules A new method of generating MAPS rules to provide highly reliable rules with increased recall was explored by Dr. Peter Palmer using MAPS version I [11]. This method uses “feature combinations” rather than individual spectral features to generate substructure identification rules. A “feature combination” is simply a collection of 104 two or more spectral features. The importance of these combinations lies in the fact that the uniqueness of a feature combination is equal to or greater than the highest uniqueness of the individual features. In fact, the feature-combination method is targeted at producing combinations of 100% uniqueness with respect to a given substructure (although a lower uniqueness could be specified, if desired). A feature-combination rule is constructed by combining spectral features until the uniqueness of the combination meets a specified minimum value (usually 100%). The construction of a feature combination is aborted if the correlation of the combination falls to zero. The other objective of the feature-combination method is to discover a sufficient number of feature combinations to achieve 100% recall. A potential bonus obtained by generating feature- combinations is the isolation of fragmentation pathways which may be indicative of the fragmentation of the indicated substructure within a particular structural environment. Feature-combination rules have the general format: "IF are present or are present or are present THEN substructure ' X ' is present." These rules also differ from individual-feature rules in the way they are applied to MS/MS spectra from an unknoWn. If any feature combination from a MAPS rule is found in the MS/MS spectra of an unknown, then the substructure indicated in the rule is concluded to 105 be present in the structure of the unknown. An example of a MAPS (v.1) feature-combination rule is provided in Figure 3.6. IF (neutral loss of 80s) OR (neutral loss of 818) OR (parent giving neutral losses of 81s and 825) OR (neutral loss of 82s AND parent ion at m/z 79w AND parent ion at m/z 80w) OR (neutral loss of 82s AND parent ion at m/z 80m AND parent ion at m/z 81m) OR (parent ion at m/z 79w AND parent ion at m/z 80w AN D parent ion at m/z 81w) THEN the BROMO substructure is present. Figure 3.6: The MAPS (v.1) feature-combination rule obtained for the BROMO substructure. Adapted from reference 11. There are two differences between MAPS version I and MAPS version 11 that are evident from this rule. First, intensity categories were not implemented in the version 11 software. Intensity categorization significantly increases the number of features in the FEATURE-BUCKETS with only a modest increase in the individual- feature uniqueness. Thus, the rule in Figure 3.6 contains intensity classifiers while the rest of the rules in this and other chapters do not. Second, the rule shown in Figure 3.6 has a fifth type of feature called “multiple neutral loss” (eg. “parent giving neutral losses of 813 and 823”). The discovery of multiple neutral losses has yet to be implemented in versions II and III of MAPS. 106 The BROMO rule shown in Figure 3.6 has a reliability of 100% with respect to the reference database (the Extrel database not the TSQ-70 database) as well as 83% recall. Thus, the feature- combination method of generating MAPS rules can provide rules with high reliability and high recall. This work Was limited to small substructures. (eg. chloro, ethyl, phenyl) to restrict the number of initial features used by the feature-combinations function. The reason for this caution was the upper limit on the number of possible feature-combinations is 211-1. Thus, for larger substructures, an inordinate number of combinations were obtained (i.e. a combinatorial explosion is observed). The major conclusions reached in this study were: 1) feature-combinations provide rules with high reliability and high recall, 2) the feature-combinations method requires a great deal of computation time and 3) refinements to the feature-combination search algorithm and increased computer resources are required to define the limits of this approach [11]. The following paragraphs describe recently completed research to overcome the computational barriers encountered in this study. Feature-Combination Rules Obtained Using MAPS (v.II) A simple feature-combination generation function (GEN- COMBINATIONS) was incorporated into MAPS (v.II) so feature- combination rules generated using the new TSQ-70 reference database could be evaluated. The algorithm for this function 107 searches all possible combinations of spectral features produced by the GENUCRULE function (a function which removes all features from an initial rule which do not meet specified U/C criteria). Redundant combinations were then pruned using the PRUNE-FC function. While this algorithm was inefficient, it was relatively easy to implement and produced feature-combination rules for use in ACES five months before MAPS version 111 was completed. In fact, the experience with MAPS version 11 led to some design changes in MAPS version 111 (eg. specification of a minimum desired correlation for individual feature combinations”). In order to use the MAPS (v.II) software effectively, the number of initial features must be kept to a minimum (usually 10 initial features, which corresponds to 1023 combinations). The method used in this study for selection of initial features was to use the U and C filters already present in the code (using the GENUCRULE function) to restrict the number of features. A disadvantage of this method is that no "global" minima could be specified for uniqueness and correlation which would limit the number of initial spectral features for all of the substructures of interest. Thus, the initial uniqueness and correlation values varied depending upon which substructure rule was being generated. The number of spectral features obtained from the U and C filters for the barbiturate substructure using a number of initial uniqueness and correlation values is shown in Figure 3.7. The complete reference database for 105 compounds contains 78,308 108 MS/MS features. After initial generation of MAPS rules using a minimum uniqueness and correlation of 10%, the number of features is reduced to 4,756. This represents a 94% data reduction with no loss of significant information! This reduced database can be considered as an initial rulebase for all defined substructures. Initial MAPS rules are simply collections of spectral features which meet specified Ui and Ci minima. Feature-combination rules can then 1oo- , U i - 30 78,308 teatues in database so a ‘ 4,756 features with U -10% and In C - 10% from initial rule generation 60 - U i - 40 i 40- ‘6 ‘ u i =- so a 204 z . . \ o r ____ _ ______ ___ a o 20 4o 60 so too Figure 3.7: Plot of the number of initial spectral features obtained for the BARBITURATE substructure for use in generating feature-combination rules versus minimum correlation (Ci) for several minimum uniqueness values (Ui). 109 be generated using the features contained in the initial rules to improve their reliability and recall. The U1 and Ci parameters used in the MAPS (v.III) program are used to control the number and type of features used in the generation of feature-combination rules. ' For example, there are three different combinations of the Ui and Ci parameters that yield 30 features (see Figure 3.7). These combinations are approximately Ui = 50%/Ci = 28%, Ui = 40%/Ci = 30% and Ui = 30%/Ci = 35%. The set of initial features obtained using the first combination of Ui/Ci parameter values contains higher uniqueness but lower correlation features while the set of features obtained with the third Ui/Ci combination contains lower uniqueness and higher correlation features. The effect of using different Ui and Ci parameters varies widely among the substructures. The procedure used in generating feature-combination rules with this version of the MAPS software was: 1) input a large initial correlation value with various initial uniqueness values until the desired number of initial features was obtained and 2) if no features were obtained, lower the initial correlation value and try again. Thus, the initial features obtained using this method had an optimal degree of correlation with respect to the substructure of interest (with the 10-13 initial feature limitation). The spectral features contained in the initial rule obtained for the phenothiazine substructure, shown in Figure 3.4, were used to generate feature-combinations. The minimum uniqueness and 110 correlation values used to generate this rule were 40% and 70%, respectively. A total of 10 spectral features were passed by the U and C filters. The mass filter was disabled for this particular rule so several features with masses larger than the nominal mass of the phenothiazine substructure (i.e. 198 amu) were included in the rule. The feature-combination rule obtained for the phenothiazine substructure is shown in Figure 3.8. Note that each feature combination has a U value of 100%. The overall recall of the rule is 100% with a reliability of 100% when applied to the reference database. Thus, by using the feature combination method, the recall for this substructure was raised from 77% to 100% while maintaining 100% reliability. 1F “ D (198) and PD (198 -> 154) [100,92] ” (ll “ D (211) and PD (198 -> 154) [100,85] ” THEN substructure PHENOTHIAZINE is present. REL = 100% REC = 100% (Umin = 40%, Cmin = 70% for initial features; rule clauses with correlation < 85% were deleted) Figure 3.8: MAPS (v.II) feature—combination rule for the PHENOTHIAZINE substructure with a recall and reliability estimate of 100%. Twenty—five substructure identification rules were generated using this software. However, the recall observed for several of these rules was lower than expected. For example, an overall recall 111 of 46% was obtained for “88108”, a substructure consisting of a 6 membered ring with 1 nitrogen and 5 carbons. One of the reasons that a high degree of recall was not observed was the relatively small number of initial features used in generating the rule. The latest version of the MAPS software, MAPS version 111, incorporates several strategies to improve the generation of feature-combination rules from a relatively large pool of initial spectral features. For example, the algorithm used in MAPS version 11 could only manipulate about 13-15 initial features before exhausting virtual memory. In contrast, a recall of 96% was achieved for the “88108” substructure using 123 initial features and MAPS version III. The MAPS Software - VersiOn 111 There are three programs which comprise the MAPS (v.III) software. These programs include: GENT (GENerate Training set) a new version of GENF which generates the reduced substructure and feature data used by MAPS, the MAPS (v.III) program itself and the RULE program which applies feature-combination rules to the MS/MS data of unknown compounds. These programs are not only more efficient, but are easier to use. The following paragraphs describe the 'major enhancements found in 'this version of the MAPS software. 112 The GENT Program The GENT program effectively performs steps 6 and 12 from Table 3.2. The program begins by prompting for the primary mass spectrum and daughter spectra filenames. A‘ reference filename for each compound is also requested. After the datafiles for each reference compound have been entered, the program prompts for the substructure list filename (usually “SUBSTR.LIS”, the default filename used by the ASLS program), the filename for the substructure data (usually “SUBSTUCTURES.DAT”), the minimum initial feature uniqueness (Ui), the minimum initial feature correlation (Ci), and the output filename. Thus, the new “FEATURE- BUCKETS” produced by GENT are actually initial MAPS rules. This program is best run using a command file because there are usually many datafiles to be entered. Low initial feature uniqueness and correlation are input to GENT (usually 10% for each). The format of the GENT output file is provided in Figure 3.9 and the contents of the file obtained for a small number of compounds is furnished in Figure 3.10. The file begins with a list of substructure reference names and mass entries (from “SUBSTRUCTURES.DAT”). This list is followed by another list detailing the compound reference names and substructures contained in the structure of each compound (from “SUBSTR.LIS”). A list of features is then included at the end of this file. Each feature in this final listis followed by a bit string (i.e. a line of ones and zeros) which relate compounds with spectral features. Thus, for a reference database containing 105 features, a string containing 105 bits would 113 < substructure referenence name 1 > < mass 1 > [< formula 1 >] < substructure referenence name 2 > < mass 2 > [< formula 2 >] < substructure referenence name 11 > < mass 11 > [< formula n >] < compound referenence name 1 > < (88a, 88b, ..., $82 > < compound referenence name 2 > < (88a, 88b, ..., $82 > < compound referenence name 11 > < (88a, 88b, ..., $82 > < feature 1 > < l|0 > < ($821 Ua Ca), (SSb Ub Cb), ..., (SSz Uz C2) > < feature 2 > < “0 > < (8821 Ua Ca), (SSb Ub Cb), ..., (SSz Uz C2) > < feature 11 > < “0 > < (SSa Ua Ca), (SSb Ub Cb), ..., (SSz Uz C2) > < eof > Figure 3.9: The GENT output file format. 114 285.53 232538 25 28 3259—80 08933 3:288 9. mafia 30389 2: 39.5 9750 2: Sec .938 =< “3d 952.: Ace 8 23 Ge 8 888 :8 S 2.88 2:222:22:2528225222 8.2 .c a.» 8 888 an S 288 2222588:8822822285: 8.: e an 8 888 8 8 888 258882888822288228888 8.: .8 3m 02 on w my cooocccc _ ooooocoocoooo a oooocooc _ 9 8o 25¢ 36: ofiom an: SN 2: on m m v cocoa fl co g oooooooooooocoooocoo ~ ooooog So 862 0.3— an: 8N mm ow mmv coca Soc 5 fl oooooooocoooooooooocoo g coo 2: Aodwy ada— DAG Gem 2: on mmv coocoooo _ ooooocoo g cog socooooo _ coca Soc 8&3 ofiom an: 38559280 av .8 8.2 a 8.. no 8 28v cmmm 3mm N320 ommm 3mm :éo cmmm 3mm ©3840 a cmmm 66 0A5 3mm 115 follow each feature. The bit string is, in turn, followed by a list of substructures with a corresponding uniqueness and correlation for the feature / substructure pair. The GENT output file contains most of the data required by the MAPS (v.III) program. The MAPS (v.III) Program The MAPS (v.III) program was developed with the programming assistance of Drake Diedrich and includes several enhancements not found in previous versions of MAPS. This program requires: 1) the GENT output filename (eg. “GENT.OUT”), 2) a substructure reference name to identify the desired substructure rule (eg. SS141 or ALL for all substructures), 3)‘ the minimum initial uniqueness (Ui), 4) the minimum initial correlation (Ci), 5) the output filename for the feature-combination rule(s), and 6) the minimum correlation for each feature-combination (Cc). MAPS (v.III) is best run using a command file if the user is exploring assorted Ui/Ci/Cc combinations to determine their effect on rule recall and content. Several parameters which control the rule generation were implemented as compiler switches. One parameter controls the sort order of the features which meet the newly specified minimum uniqueness and correlation (i.e. the features used to generate the combination). This parameter is set by defining one of three symbols to select a uniqueness sort (USORT), correlation sort (CSORT) or mass sort (MSORT). For example, the initial features used to generate feature combinations could be sorted so that the highest uniqueness features are tried first by defining USORT (eg. #define USORT) in the 116 MAPS source code. Other parameters set the minimum and maximum allowed feature-combination length (MAXF and MINF, respectively), the minimum number of compounds which must produce a feature before it can be used in a rule (MINCMPDS) and a maximum elapsed feature-combination generation time (SSTIME). A typical value for SSTIME is one hour. An additional parameter, HITS, was included to stop feature-combination generation after recalling the same compounds a specified number of times. The feature-combination generation algorithm used in MAPS (v.III) takes a different approach to calculating uniqueness values for partially constructed combinations. The idea behind this algorithm is that eliminating false positives (an integer value) is precisely the same as increasing uniqueness (a floating-point value). Thus, the goal in this approach is to reduce the false positives for a given combination by adding or subtracting initial features. The integer false positives calculations can be performed more efficiently than the corresponding uniqueness calculations using our current hardware. The feature-combination generation algorithm begins with the sorted initial features. A feature-combination is constructed by adding features from the sorted list until no false positives are observed or the recall for the combination falls below a specified minimum value. Figure 3.11 illustrates this procedure. In part A of this figure, feature 2 was added to feature 1 but this did not decrease the false positives associated with feature 1. Thus, a zero 117 Feature Combination Bit String t1 t2 13 f4 t5]... in, A) 10l1|0|0|0|0| r1 r2 r3 t4f§u.tn 3) 00|1l0|0l0|0l r1 f2 t3t4t5 u. in c) lolol1|ol1o|o| Figure 3.11: An illustration of feature combination bit string generation with backchecking. was placed in position 2 of the feature-combination bit string. Feature 3 was then tried in the feature-combination. In this case, adding feature 3 decreased the false positives associated with the combination so a l was placed in position 3 of the feature- combination bit string. An additional optimization called “back- checking” comes into play at this point. Only the non-redundant features (i.e. features that have some effect on the uniqueness or correlation of a combination) should be included in a feature- combination. Thus, features which do not affect the reliability or recall of a feature-combination need to be pruned from the feature- combination. 118 The back-checking optimization occurs after a new feature is added to a combination (part A, Figure 3.11). The feature- combination generation function checks the features added before the latest one to determine if removal of any of these features increases false positives or decreases recall. If this check is negative for a feature (i.e. removing the feature does not increase false positives or decrease recall), then the bit for that feature is changed to zero, as shown in part B of Figure 3.11. This process does not preclude the feature from future consideration in feature- combinations, but only from the current one. In other words, it is certain that the features removed from one feature-combination are tried in future combinations since they have already demonstrated some ability to decrease false positives. The major advantage of back-checking is that feature-combinations with non-redundant features are produced by the generation function. Sample output for the MAPS (v.III) program is provided in Figure 3.12. The program will output the initial features and their U and C values with respect to the input substructure. The number of compounds with the input substructure, the number of rule clauses (feature-combinations) generated, the overall recall of the rule and the generation time are provided by the program. The format for the rule output file has not been changed from the LISP format and is shown in Figure 3.13. A total of 42 substructure identification rules were generated out of 49 possible rules with an average recall of 82% using MAPS 119 Training set file: gent_pl_10_10.out 105 compounds 127 substructures 4756 features Substructures: SS4 8818 8819 8820 8821 8844 8845 8846 8850 8893 88133 8860 8812 8822 8823 8824 8825 8826 8827 8828 8829 8830 8831 8832 8833 8834 8835 8836 8837 8843 8811 8810 8878 8856 8870 8887 8892 SS3 SS7 8888 8851 8854 882 8840 8863 8815 8848 8890 8877 88118 8857 8859 88100 88108 88116 88124 SS6 88144 88148 8871 8872 88141 88143 88145 8885 88110 88132 88147 88142 8813 8847 88130 8861 8884 88104 88155 8891 88102 88157 8882 88122 8864 8865 8883 8898 88121 88156 88158 88418855 8862 88111 88112 88113 88120 88137 88125 88127 88128 88131 8895 88159 885 88119 8897 88149 88146 88105 88109 8867 8869 88107 8894 88126 88135 88123 88139 88117 8886 88153 88154 8896 88103 88151 88152 88115 8874 Generate rule for substructure (name,ALL,retum): s s l 3 2 Feature uniqueness[0.3]: .4 Feature correlation[0.3]: .7 Output filezss l32_p l_40_70_l 00_50.out Combination correlation[0.7]: .5 88132 7 features 7 s 88132 features ( (PD 198.0 154.0) U=92 C=92) ( (PD 197.0 153.0) U=90 C=76) ( (PD 196.0 152.0) U=83 C=76) ( (PD 198.0 171.0) U=76 C=76) ( (PD 197.0 196.0) U=68 =84) ( (PD 70.0 27.0) U=44 C=84) ( (D 198.0) UflZ C=92) 88132 7 features 13 compounds 10 clauses 100% recall 0 5 Generate rule for substructure (name,ALL,retum): Figure 3.12: Sample output from the MAPS (v.III) program with user input highlighted in bold faced type. 120 (setq SSl_COMB_RULE ‘( (((feature-al) (feature-bl) (...) (feature-n1» Uc Cc (compound-1 compound-2 compound-x)) (((feature-a2) (feature-b2) (...) (feature-n2)) Uc Cc (compound-1 compound-2 compound-x» ‘ (((feature-ay) (feature-by) (...) (feature-ny)) Uc Cc (compound-1 compound-2 compound-x)) )) (setq 882_COMB_RULE ‘( (((feature-al) (feature-bl) (...) (feature-nl)) Uc Cc (compound-1 compound-2 compound-x)) (((feature-a2) (feature-b2) (...) (feature-n2)) Uc Cc (compound-l compound-2 compound-x» (((feature-ay) (feature-by) (...) (feature-n10} Uc Cc (compound-1 compound-2 compound-x)) )) (setq SSn_COMB_RULE ‘( (((feature-al) (feature-bl) (...) (feature-n1)) Uc Cc (compound-l compound—2 compound-x)) (((feature-a2) (feature-b2) (...) (feature-n2» Uc Cc (compound-1 compound-2 compound-x)) (((feature—ay) (feature-by) (...) (feature-n30) Uc Cc (compound-l compound-2 compound-x» )) where 881, etc. is the substructure reference name, feature-a1, etc. are any spectral features, Uc is the feature-combination uniqueness, Cc is the feature-combination correlation, and compound—l, etc. are the compound reference names that each feature combination recalls. Figure 3.13: The format of the MAPS feature-combination rule save files. 121 (v.III). The substructures corresponding to these rules vary widely in complexity (eg. phenyl, phenothiazine, barbiturate and dimethylamino). The initial conditions used were: Ui=30%, Ci=30% and Cc=30%. However, these values are not necessarily optimal for all substructures. Therefore recall can be further improved using different starting values for some of the substructures with less than 100% recall. An important, and limiting consideration, however, is the reliability of the rules when they are applied to compounds outside of the reference database. This question is addressed in Chapter 6. The RULE Program The rule program is used to apply MAPS feature-combination rules to MS/MS data of unknown compounds. The required inputs are: 1) the filenames of the primary and daughter mass spectra of an unknown, 2) the name of the file containing the rules and 3) the output filename. The substructures identified as present (using inclusion rules) are written to the output file in a format given in Chapter 4. A sample of the output from the rule program for a test compound is [shown in Figure 3.14. The MAPS (v.III) software represents a significant advance in the development of ACES. These programs have increased the amount of data that can be used to generate substructure identification rules using the feature-combination method. Previous versions of the MAPS software could not effectively utilize the data 122 Input Primary Spectrum[*.dat]: khunklmp 1 Input Daughter Spectra[*.dat]: khunklmd 12345678910111213141516171819202122232425 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 4142 43 44 45 46 47 48 49 Rule file: maps2_p1_all_rules.out Output file: khunkl.res Unknown has the following substructures: 8818 (P 63.0) (D 59.0) 8818 (D 59.0) (N 14.0) 8818 (D 59.0) (N 50.0) 8818 (D 59.0) (N 51.0) (and so on for all remaining substructure identifications) Figure 3.14: Sample output from the RULE program with user input highlighted in bold faced type. in the newly generated reference database. The feature-combination method has proven to give highly reliable rules with respect to the reference database. These rules are critical to the success of the ACES system. All three of the programs that comprise the MAPS (v.III) software can be used in batch mode through the use of command files. Thus, a range of parameter values for each program can be efficiently explored. The result file written by the MAPS program is used by the automated structure generator to obtain candidate structures for the unknown. The structure checking and generation software used in the ACES system is described in the next chapter. 123 References 1. Wade,A.P. Palmer, P..T, Hart, K..,J Enke,C.G., MW 215,169(1988). 2. Palmer,.P.T., Hart, K.J., Enke, C.G., Manta, :15. 107 (1989). 3. Hart, K.J., Enke, C.G., 37th ASMS Conference on Mass Spectrometry and Allied Topics, Miami Beach, FL, p. 348 (1989). 4. Hart, K..,J Enke, C.G., Proceedings of the Symposium on Chemometrics and Intelligent Automation, W W in press (1989) 5. McLafferty, F.W., Stauffer, D.B., 1W, 2;, 245 (1985). 6. Dayringer, H.E., Pesyna, G.M., Venkataraghavan, R., McLafferty, F.W., W11. 529 (1976). 7. Buchanan, B.G., Smith, D.H., White, W.C., Gritter, R.J., Feigenbaum, E.A., Lederberg, J., Djerassi, C., W, 28, 6168 (1976). 8. Schildt, H., “Artificial Intelligence Using C”, Osbourne McGraw- Hill, New York, NY, 1987. 9. Barber, G.R., W212), 28 (1987). 10. Roland, J., Mm, 214),, 46 (1987). 11. Palmer, P.T., Ph.D. Dissertation, Michigan State University, East Lansing, MI, 1988. CHAPTER 4 Automated Substructure Search and Structure Generation Introduction Two of the important functions required to implement the Automated Chemical structure Elucidation System (ACES) are 1) the ability to search structures of standard compounds to determine their substructure content and 2) the ability to construct candidate structures for an unknown compound. The latter function is needed to provide the structures which are consistent with the results of the ACES interpretive software. The first function is essential for both the generation of the MAPS substructure identification rules and for the evaluation of the candidate structures of an unknown compound. Programs for substructure search and isomer generation have already been developed [1-4] but lack the interfaces and level of automation required by the ACES system. Consequently, one of the major tasks in making ACES a reality was modification of one of these programs to provide the substructure search and structure generation functions needed by ACES. The software selected for use in this project was GENOA [5], a structure generation package originally developed as part of the 124 g}. 125 DENDRAL project [6]. There are a number of publications in the literature regarding the use of GENOA. Programs have been developed for the interpretation of mass spectra using rules of fragmentation and for the prediction of mass spectra of candidate structures [7]. These programs were applied to the analysis of marine sterols. Another program was developed to interpret and predict 13C-NMR spectra. Substructures derived from these programs were used to constrain generation of candidate structures. The structures obtained were then ranked according to the similarity of their predicted mass and l3C«-NMR spectra [8-10]. Similar programs were developed for lH-NMR [11] and two-dimensional NMR [12]. Use of these programs in the analysis of natural products was also described [13]. The advances in the interpretation of NMR data were accompanied by updates to GENOA to include generation of stereoisomers. Much of the GENOA software was subsequently commercialized by Molecular Design, LTD [14]. Among the functions not included in the commercial version of the software were mass spectral prediction and generation of stereoisomers. This chapter begins by introducing the techniques of computerized representation of chemical structure. Connectivity tables are discussed in greatest detail since they are the most appropriate representation for computerized structure elucidation. The programs included in the GENOA software are then described, along with the modifications that were necessary to provide the structure manipulation functions essential to ACES. 126 Computerized Representation of Chemical Structures There are several representations of molecular structures that are compatible with computer data storage and manipulation. These have been reviewed in a book by Gray [15]. Structure representation schemes have been categorized as 1) fragment codes, 2) linear notations, 3) connection tables and 4) coordinate representations [16]. These representations vary in the information they provide (cg. atomic connectivity vs. molecular shape) and in the computation time required to develop and interpret them. Consequently, different applications may best be served by different representations. The following paragraphs provide brief descriptions of the major structure representations and] the applications for which they are used. Fragment codes usually describe the key functionalities contained in a structure and are most often used in the searching of large databases. These codes are also referred to as screens since they sometimes consist of a range of atoms (eg. 5-10 carbons). The advantage of fragment codes is that a large database can be rapidly searched using these codes. The principal disadvantages are 1) they do not allow direct reconstruction of a structure and 2) they do not provide for generalized substructure search [15]. Linear notations have also been used extensively in searching large databases. The Wiswesser linear notation (WLN) is an example 127 of this type of structure representation. The WLN string obtained for the structure shown in Figure 4.1, for example, was: T56 BN DN FN HN BU FU HUT] IZ DBT50TJ CQ DQ ElQ [17]. However, the conversion of structure diagrams to WLN and back again has been only partially successful. Another limitation of this notation is the difficulty of manipulating them to derive substructure information [15]. Figure 4.1: Structure corresponding to WLN notation given in text. Adapted from reference 17. Connectivity tables provide a list of the atoms contained in a molecule or substructure and the atoms to which they are bonded. Thus, they impart the topology of a molecule or substructure. An early format for computerized connection tables was published by Ray and Kirsch in 1957 [15,18]. The connection table published for 128 chloral is shown in Table 4.1 [19]. The structure of chloral and the corresponding atom numbering are also shown. Among the notable characteristics of this format are 1) the explicit numbering of Component Connections Element Symbol 1 2 O 2 1,3 = 3 2,4,6 C 4 3 H 5 6 C1 6 3,7,8,5 C 7 6 (J 8 6 Cl 5 1 Cl 0 ' 6 II2 4 Cl — C- C—H a j 3 Cl 7 Table 4.1: Connectivity table for chloral from reference 17. hydrogens and 2) the use of element symbols for bond order. The DENDRAL format for hydroquinone is shown in Table 4.2. This format does not express the connectivity of hydrogen explicitly. Also, double bonds are expressed as multiple occurrences of a connection to an atom (eg. a double bond between carbon #4 and carbon #6 in the table). A number of other atom properties such as hybridization is provided in this format as well. Atomic connectivity can also be represented in matrix form using the atom connectivity matrix (ACM) [15]. The ACM for hydroquinone is shown in Table 4.3, 129 Atom # Type Neighbors Artype Hybrid 1 O 3 SP3 2 O 8 SP3 3 C 5 1 4 AROM SP2 4 C 6 6 3 AROM SP2 5 C 7 3 3 AROM SP2 6 C 8 4 4 AROM SP2 7 C 8 8 5 AROM SP2 8 C 2 6 7 7 AROM SP2 4 6 C=C Table 4.2: DENDRAL connectivity table for hydroquinone from reference 15. and uses the same atom numbering as shown in Table 4.2. The diagonal elements are used to denote atom properties (eg. atom name) and the off-diagonal elements are used to define connectivity (eg. bond order). Connectivity tables are most appropriate in systems where structure manipulation and substructure search are l 2 3 4 5 6 7 8 l O 0 1 0 0 0 0 0 2 0 O 0 0 0 0 0 1 3 1 0 C 1 2 0 0 0 4 0 0 1 C 0 2 0 0 5 0 0 2 0 C 0 l 0 6 0 0 0 2 0 C 0 1 7 0 0 0 0 1 0 C 2 8 0 1 0 0 0 1 2 C Table 4.3: Atom connectivity matrix for hydroquinone from reference 15. 130 required. Consequently, they are often encountered in structure elucidation and synthesis programs. The major disadvantage of connectivity tables for some applications is the computation time required to manipulate them. Coordinate representations provide the shape of molecules and are of value in studying structure-activity relationships [15]. The difference between this representation and connectivity tables is that coordinate representations show the relative positions of atoms in 3-dimensional space (eg. conformational stereochemistry). The torsion angles contained in these representations “are derived from X- ray crystallography or quantum mechanical calculations. The Structure Generation Program, GENOA GENOA is an interactive isomer generation program which produces an exhaustive and nonredundant set of structural isomers based on substructure constraints. The program requires as input the molecular formula of an unknown and substructure constraints. These constraints can be inclusive (i.e. a substructure present in the unknown), exclusive (i.e. a substructure not present in the unknown) or alternative (i.e. a list of one or more substructures, one of which is present in the unknown). This generation program also has the distinct advantage of being able to handle overlapping substructures. Thus, each substructure constraint entered into the program does not have to be a distinct structural entity. In terms of the MAPS 131 software, the substructures found as present can all be entered into the program without further analysis to determine whether a substructure identification is an unique occurrence, or is, in fact, due to a larger substructure. Structures and substructures are represented in this software using connectivity tables. The connectivity table obtained for the barbiturate substructure is illustrated in Table 4.4. The HRANGE descriptor is used to control the number of hydrogens which may be attached to each atom. The sites where other atoms, either free atoms or atoms in other substructures, may bond are called "free valences". Free valences can be designated in a substructure Atom # Type Neighbors Hrange 1 N 6 2 FV 0-1 2 C 7 7 3 1 3 N 4 2 FV 0-1 4 C 8 8 5 3 5 C 6 4 FV FV 0-2 6 C 9 9 1 5 7 O 2 2 8 O 4 4 9 O 6 6 O \N_zcl 7 6/1 \ 90: SN— 5\C—C/ / | 4 \ 8O Table 4.4: GENOA connectivity table obtained for the barbiturate substructure. 132 definition explicitly using the FV descriptor or implicitly using the HRANGE descriptor. Free valences are actually converted to HRANGES in the computer software [15]. The major functions of the GENOA program are summarized in Table 4.5. Atom and substructure definitions can be saved to a file for repetitive use. Atom definitions consist of a symbol and the number of valences (eg. C, valences = 4). Substructures are defined by creating the connectivity table for each substructure. The molecular formula definition usually changes for each session. The draw command allows substructures to be viewed. However, these drawings are collections of characters and are occasionally confusing to interpret. The constraint command allows structural constraints to be entered with a range of occurrences (eg. none, at least 2, exactly 1). Once the molecular formula and substructure constraints have been entered, the generate command can be issued to begin the generation of candidate structures. A typical GENOA session is shown in Figure 4.2. User supplied input is highlighted in bold faced type. The program is command driven with the “it” symbol as a prompt. The first command given to the program in the example defines the molecular formula of a test compound (i.e. butalbital). The next two commands retrieve substructures from a user defined substructure library called “SUB.LIB”. Drawings for the “88145” and “8872” substructures are provided in Figure 4.3. The program is highly interactive and prompts for required information if it is not found in the entered 133 COMMAND ALTERNATIVE BLEACH CONSTRAINT DRAW DEFINE IDRCET GENERA'IE PRESUPPOSE~ DESCRIPTION is used to define a substructure constraint on one of a number of substructures (eg. ALTERNATIVE carbonyl ethyl is used to remove color designations in cases. Atoms may be “colored” to control how substructures overlap. is used to define a substructure constraint with a number of occurrences (eg. CONSTRAINT ethyl at least 1). is used to draw structures, cases and substructures using ASCII charachters in either a numbered or atom-labelled format. is used to define atoms, substructures and the molecular formula. is used to terminate the program. is used to correct an already existing definition. is used to remove definitions. is used to initiate exhaustive structure generation. is used to start structure generation using a file of alternative substructures (eg. in some natural products one might assume the presence a terpenoid substructure). Table 4.5: Summary of major GENOA commands. 134 COMMAND DESCRIPTION RESTORE is used to restart a previously saved session or a file of substructure definitions. SAVE is used to store the current status of a session in a disk file. SEARCH is used to search files for definitions. SHOW is used to show current status, history and conectivity tables. Table 4.5: cont. 135 WELCOME TO GENOA, VERSION 83.0, MOLECULAR DESIGN LTD. #define molform c 11 h 16 n 2 o 3 MOLECULARFORMULADEFINED #search mylib sub.lib 38145 88145 DEFINED #search mylib sub.lib ss72 8872 DEFINED #constraint SUBSTRUCTURE NAMEzssl45 RANGE OF OCCURRENCES:at least 1 1 CASE WAS OBTAINED #constraint 3372 at least 1 #. 1 CASE WAS OBTAINED #generate #..D..DDD. 5 STRUCTURES WERE GENERATED TRANSFERRING CONTROL TO STRCHK... Figure 4.2: A sample GENOA session. 136 command line. This feature is demonstrated in the example for the first “constraint” command. Another constraint is entered and then the generate command is issued. In this example, 5 structures were obtained. GENOA outputs a period to the screen for each unique structure generated and a D for each duplicate structure. Program control is then transferred to another program called STRCHK which is described in the next section. \N _clo /N \ O=C N— C \c_ c/ __c/ 88145 8872 Figure 4.3: Substructure drawings for the 88145 and 8872 substructures. The Structure Checking Program, STRCHK One of the major programs included with GENOA is a structure checking program called STRCHK [8]. This program provides the ability to interactively search complete structures (obtained from the structure generation program, GENOA) for occurrences of predefined substructures. The STRCHK program has many of the same commands as GENOA (see Table 4.5) with the following additions: 137 PRUNE a command used to remove structures which do not meet the constraints given in the command line (eg. PRUNE 883 at least 1 would remove all structures which did not have at least 1 occurrence of 883). SURVEY provides a synopsis of the substructure content of the list of candidate structures obtained from GENOA. The SURVEY results using the structures generated in the previous example are shown in Figure 4.4. Once again, user output is highlighted in bold faced type. The key outputs from the program are shown in bold faced, italic type. The most important output, in terms of evaluating a set of candidate structures, is the list of discriminating substructures. This output consists of a list of substructure names and an associated number of occurrences in the candidate structures. Discriminating substructures indicate quantitatively the relative importance of substructures not yet determined as being present or absent by the interpreter (or interpretive software). Thus, the interpreter can focus attention on identifying these substructures using existing data or possibly data from new experiments. It is also possible to determine the number of structures which possess combinations of discriminating features using the SELECT command (a sub-command of SURVEY) as shown in Figure 4.4. Use of Substructure Search in Rule Generation The substructure buckets are one of the major pieces of information required by the MAPS software. These data summarize 138 #survey y mylib sub.lib all n READING ENTRIES FROM LIBRARY FILE. SCANNING niiioucH STRUCTURES. THE FOLLOWING LIBRARY FEATURES WERE NOT FOUND IN ANY STRUCTURE. (long list of substructures) ITLE.FOMJXTWUNM?IUMHLARY’IHfiATLUUESIAREIIN'AIJL SIWHWCTTHLES. 8819 8843 8872 88116 88141 88143 88145 #STRUCTURES WITH DISCRIMINATING FEATURES: 4 886 3 8871 Do you want to select structures with combinations of features?y ->select Desired features>ssG and 8871 3 structures. ->done # Figure 4.4: A demonstration of the SURVEY function of ST RCHK. 139 the substructure content of each of‘ the reference database compounds. Manual construction of the substructure buckets is time consuming and prone to errors, especially as the size of the reference database increases. Consequently, a computerized method of creating molecular structures, storing the computer representations obtained, and searching each structure for occurrences of predefined substructures is required. This problem was largely solved with the acquisition of the GENOA software. However, three major tasks remained to provide automated structure searching. First, a library of predefined, substructures had to be constructed. The libraries provided with the software were not directly ‘ applicable to the reference database compounds. Second, the structures of each of the reference database compounds had to be generated using the GENOA program. Third, the STRCHK program had to be modified to allow a series of structures with differing molecular formula(e) to be searched. The substructures found to be present are written to a disk file. The construction of the substructure library required the use of the substructure definition commands in GENOA. The interactive version of STRCHK was used to complete this operation. An example of the commands used to create the substructure definitions is shown in Figure 4.5. Two types of substructure definitions were included in the library. The first type of definition includes those substructures which define a particular compound class (eg. barbiturate, phenothiazine, phenol, etc.). The second type of definition includes 140 WELCOME TO GENOA, VERSION 83 .0, MOLECULAR DESIGN LTD. #define substructure 33161 (NEW SUBSTRUCTURE) >? CHAIN RING BRANCH LINK JOIN BORD HYBRIDIZATION ATN AME HRANGE FREEV ARTYPE UNJOIN COLOR ERASE SHOW DRAW GET DONE HALT >ring 6 /* */ >bord /* */ FIRST ATOM:1 /* define a benzene ring */ SECOND ATOM:2 /* */ BONDORDERz’Z /* ' */ >bord342562 /* */ >branch I! !/ FROM ATOle I! fuse 4 carbons to make a ! LENGTH OF BRANCH:4 / l 6 membered ring !/ >join 2 10 ' / l l/ >branch 7 4 /+ fuse 4 more carbons to +/ >join 14 8 /+ make another ring +/ >branch 7 3 /> >/ >at 17 N /> fuse 2 C’s and 1 N to >/ >join l7 9 /> make a 5 membered ring >/ >done 88161 DEFINED Figure 4.5: GENOA session illustrating the creation of a substructure definition. 141 aromatic ring isomers with differing substitution patterns. These substructures were included mainly to detect ring isomers among the set of candidate structures, rather than to produce MAPS substructure identification rules (a difficult, if not impossible task without resorting to isomer specific experiments such as energy resolved mass spectrometry). Once all of the substructures were defined, they were saved to a file called “SUB .LIB”. The substructure library currently contains 160 substructure definitions. The reference names of the substructures were generalized to “88x” where “x” is the entry number in the list of substructures. Previously, descriptive names such as “ethyl” or “1,2- phenyl” were used. However, it became increasingly difficult to formulate short descriptive names, especially for closely related substructures. Usually, the substructure definitions were kept as general as possible since it is difficult to make assumptions about substitution. patterns on the substructures. For example, “8818” is a generic benzene ring with no specific substitution pattern. For rule application, this is the only substructure that is submitted to the structure generator. There are 127 substructures out of this library which are found in the reference database compounds. It should be emphasized that our approach to substructure identification is not to identify the specific structure of an ion and then relate that structure to the overall molecular structure. Rather, our approach is to discover those spectral features in the MS/MS‘ dataspace which are indicative of the presence f, .. 142 (or absence) of a predefined substructure. This point has been illustrated using the ion structure of the 149+ ion and the phthalate- ester substructure [20]. Thus, the substructure definitions are descriptors of key molecular substructures (not ions) and a reliable MAPS rule may or may not be found for a given substructure definition. The GENOA-formatted structure for each of the reference database compounds was obtained in the following manner. A substructure definition was created which included the connectivity of all of the atoms in each of the compounds except hydrogen. The resulting substructure definition was then used as a constraint in the structure generator. Thus, only one structure was obtained when the structure generation was initiated. Each structure was then saved to a file with a standard file extension, “.MSMS”. This extension allows a file to be easily created which contains the filenames of all of the GENOA-format structures. There are currently 105 of these structures corresponding to the reference database compounds listed in Chapter 2. The Molecular Design version of the STRCHK program was modified as part of this research to automatically check the structures of each of the “.MSMS” files. Several file protocols were also developed to automate this program for use in ACES. The modified program is referred to as ASLS (Automated Structure Library Search) within the ACES system. This program assumes that there is a file called “STR.LIS” which contains the filenames of all of 143 the reference database structures. This file can be easily generated using the VMS (the VAX operating system) DIRECTORY / NOHEADER / NOFOOTER command. For convenience, a command file which creates the “STR.LIS” file and then invokes the ASLS program was written as part of this research. The results of the program are written to another file called “SUBSTR.LIS”. This is the file used by the MAPS software to create the substructure buckets and subsequently, substructure identification rules. The format of the “SUBSTR.LIS” file is shown in Figure 4.6. Probably the only statistic of general interest regarding the “SUBSTR.LIS” file obtained from the current reference database is the average number of substructures found in the reference database compounds. On average, 11 substructures were identified. This average means that the compounds contained in our database are not simple (i.e. monofunctional) but are in reality, quite complex. 7 A number of improvements could still be made to the ASLS program. These improvements include: 1) creating the “STR.LIS” file from within the program using VMS run-time library commands, thus eliminating the need for a command file, 2) creating a “unified” data format so all of the reference database structures can exist in one file, and 3) modifying the code so the substructure library, “SUB.LIB”, is only loaded once, rather than each time a search is performed. It should be noted at this point that the STRCHK software (STRCHK and associated modules) comprise approximately 10,500 lines of undocumented C code. While Molecular Design was kind 144 FORMAT EXCERPI‘ @ 884 8818 8819 @ 8844 8846 8850 @ 8870 ‘ 8887 88133 @GMRIO 884 8818 8819 8820 8821 8844 8845 8846 8850 8893 88 133 (...) Figure 4.6: Format and an excerpt of the “SUBSTR.LIS” file. 145 enough to provide us with the source code to this software, they provided no software support. Thus, the modifications made to this software (and to GENOA) required more “sleuthing” than is generally desirable. Automatic Structure Generation Three ACES programs are involved in the generation of candidate structures. These programs are: 1) the molecular formula generator (MPG) [21], 2) the RULE program, and 3) GENOA. As was discussed in Chapter 1, the MFG program produces a list of molecular formula(e) that are consistent with a given molecular weight and elemental constraints. If an exact mass is input to MFG, only one formula will be generated. If only TQMS data are available, a molecular mass with a typical tolerance of 0.1 amu is entered. Additional constraints based on the elemental compositions of substructures found to be present by the RULE software or from ratio measurements of ions in the M+1 daughter spectrum can be used to reduce the number of formulae generated by the MFG program. The RULE program was recently written by Drake Diedrich in C to apply MAPS rules to MS/MS data of unknowns. This program prompts for the filenames of the M8/M8 data, the file containing the MAPS rules and the output filename. The interaction of these programs and their I/O are shown in Figure 4.7. L! 146 .mmu< 3 88.88» 23228 38828 .3 28228 "b... 0.52..— Ace-v chméoe Khmdoe <0qu C 25: a 3:222 2:... 6 92215281.. 98:2 a. All. Emimd .38 E 3.5 8 .3 All men: was... .553 3 2.8: 00:220.. a 6:3: . 3:229. . m 2:: . ._ $8: 2.39:. All ma<2 9am ... m 147 GENOA has been modified to read and utilize the information available from the MFG and RULE programs. The modified version of this program is called AGEN (Automated GENoa). The GENOA program requires, at a minimum, the molecular formula of an unknown before candidate structures can be generated. Structural constraints are also necessary to limit the number of structures generated. Automation of the structure generation process requires a link between GENOA, the MFG program and the RULE program. The links used in this work are files written to include the results of each program. The MFG program writes a file with a standard file extension of “.MFG” which contains a list of molecular formula(e). The format and an example of this file are shown in Figure 4.8. The RULE program writes a file using a standard file extension of “.MPS” which lists the substructures found to be present in an unknown. This software will be modified to utilize exclusion rules, as well. The format and an example of this file are shown in Figure 4.9. FORMAT EXAMPLE molforml C20H8N28102 molform2 C20H8N282 C20H12N4SI molformn C20H21N281F1 Figure 4.9: Format and example of the “.MPS” results file. The automated procedure used in the AGEN program is summarized in Figure 4.10. The AGEN program can produce more 149 than one result file; one file for each entry in the “.MFG” file. The format of these files is the standard “save” format used by GENOA with a standard file extension of “.STR” Note in Figure 4.10 that the . AGEN program can reject a molecular formula if _the substructure constraints are inconsistent with a given molecular formula. Sample output from AGEN is provided in Figure 4.11. The output shown was obtained for chlorpromazine, another test compound. For brevity, only one molecular formula was used and the output for substructure retrieval and case generation (i.e. calculating overlapping substructures) were abbreviated. Ten structures were obtained in this example and written to a file as specified in Figure 4.7. The AGEN program transfers control to another program, DISCRIM, when structure generation has completed. The DISCRIM program is a modified version of STRCHK and is discussed in the next section. The computation time required to generate structures can vary widely (eg. from several minutes to several hours) depending upon the constraints given to the structure generator. The structure generation time is a function of, among other things, the number of undetermined free valences (bonding sites) and atoms. The order in which the constraints are entered can also have a drastic effect on structure generation time. The number of cases generated using chlorpromazine as an example and two different constraint orderings are shown in Table 4.6 and Table 4.7. The constraints were ordered according to their reference names in Table 4.6 and according to the number of atoms contained in each of the substructure definitions in 150 AUTOMATIC STRUCTURE GENERATOR (ASG) I GET SS'S (INCLUSION) I I I GET MOL. FORMULME) I I DEFINE MOL. FORMULA ’ GET SS DEFINITIONS I I CONST SS(X) AT LEAST 1 I (:‘IENERATEj STRUCTURES YES NO SAVE I MARK MF STRUCTURES AS BAD Figure 4.10: Flowchart of the auotmatic structure generator. 151 Automated Structure Generator GENOA: MSU Version 1.0 REFERENCE NAME (EXIT to end program): UNKI There are 1 molecular formula(e) associated with UNKl There are 11 substructures associated with UNKl . SUBSTRUCTURE TO BE RETRIEVED: $8132 DEFINED SUBSTRUCTURE TO BE RETRIEVED: (and so on...) MOLECULAR FORMULA DEFINED SUBSTRUCTURE NAME: 1 CASE WAS OBTAINED SUBSTRUCTURE NAME: (and so on...) 10 STRUCTURESWEREGENERATED SAVED ON UNKIOOSTR Structure generation complete. Transferring to DISCRIM for discriminating substructure analysis. ENTERING DISCRIM PROGRAM... UNKlm.STR RESTORED 10 STRUCTURES Starting SURVEY READING ENTRIES FROM LIBRARY FILE. 3cm ‘G‘fiinouonsrfiucnmssw SAVFIIENAME(STRCHK)= UNK1.GEN ASG exited. Figure 4.11: Sample output from the automated structure generator (ASG) program. 152 Table 4.7. The number of cases generated was greatly reduced by starting with the substructures which had the greatest number of atoms. This optimization leads to greatly reduced structure generation times (eg. 36 minutes using the ordering found in Table 4.6 versus 6 minutes using the ordering found in Table 4.7). SUBSTRUCIURE REFERENCE # OF CASES GENERATED 8810 (3, 2 carbons and a S) 1 $813 (1, chloro) 1 8818 (6, x-phenyl) 6 $819 (1, methyl) 6 8847 (7, x-chloro-phenyl) 149 $851 (7, x-nitrogen-phenyl) 942 8856 (3, dimethylamino) 3436 8878 (7, x-sulfur-phenyl) 3278 $885 (6, 1,2-phenyl) 562 $890 (6, 1,2,3-phenyl) 210 88132 (14, phenothiazine) 4 Table 4.6: List of substructure constraints (ordered by reference number) and the number of cases ' generated by the structure generator. The number of atoms and a descriptive name for each substructure reference are provided in parentheses. Potential for the Reduction of the Number of Candidate Structures through Ancillary Experiments One of , the long-term goals in the development of the ACES software is the integration of the various software tools under an “intelligent controller” program. Among the desired features of this program is easy access to the ACES software tools, thus minimizing the number of programs a user must learn to achieve results. 153 SUBSIRUCIURE REFERENCE # OF CASES GENERATED 88132 (14, phenothiazine) 1 8878 (7, x-sulfur—phenyl) l SSS] (7, x-nitrogen-phenyl) l SS4? (7, x-chloro-phenyl) 4 $890 (6, 1,2,3-phenyl) 10 8885 (6, 1,2-phenyl) 2 8818 (6, x-phenyl) 2 $856 (3, dimethylamino) 2 8810 (3, 2 carbons and a S) 2 $819 (1, methyl) 2 8813 (l, chloro) 2 Table 4.7: List of substructure constraints (ordered by number of atoms) and the number of cases generated by the structure generator. The number of atoms and a descriptive name for each substructure reference are provided in parentheses. Another important capability of the intelligent controller is the ability to suggest ancillary experiments (MS/MS experiments, in this case) which can further resolve a structure elucidation problem. Implementation of these ancillary experiments either by a human operator or by using the data acquisition / instrument control software will close a “feedback loop” which includes the mass spectrometer. This feedback loop will allow the structure elucidation system to solve problems in much the same way human experts solve problems. The following paragraphs describe how such a system may be implemented. The automated version of the structure generator is an important step in achieving an “intelligent” MSIMS instrument. A method for Summarizing the unidentified substructures within a set 154 of candidate structures is also required to make this instrument a reality. The approach used in the ACES system is to use another modified version of the Molecular Design STRCHK program to provide a substructure analysis of a set of candidate structures for an unknown. The modified version of STRCHK program (referred to as DISCRIM within the ACES system) was modified to compile substructure counts from the files created by the AGEN program (the “.STR” files) and to write the results to a file with a standard extension of “.GEN”. More than one structure file is obtained for an unknown only if there is more than one viable molecular formula for an unknown. This process is shown schematically in Figure 4.12 and the format of the “.GEN” file is given in Figure 4.13. The modified version of STRCHK is called DISCRIM in ACES because it provides a list of “discriminating substructures”. The list of discriminating substructures produced by the SURVEY function for a set of candidate structures is shown in Figure 4.14. This list consists of each of the substructure names and an associated number of occurrences of each discriminating substructure within the set of candidate structures. Thus, the automated version of the structure generator, AGEN, and the automated version of the structure checker, DISCRIM, provide the basis for incorporating the ability to suggest ancillary experiments into the ACES system. The flowchart which describes a proposed program, EXPT, that will be able to recommend ancillary experiments is shown in Figure 4.15. The EXPT program will read the “.GEN” file for an unknown to obtain the discriminating substructures. The MAPS rulebase will 155 AUG ..ozobeoo 28$:an 2: he meouaueoEEoooe EoEtoqxu SEQ; 5390 o. 3::on 338.285 can «Eocene—co 88on 2.. 2365 ouafionom “um... «...-um...— 1 f. mpzmaimexm o. 9.‘ >¢<4.=Oz< mm NEWTEST dhflNKnFORMS> l humuxnuu#1 numaxnuu#1 C 24 H 38 O 4 <1! of structures> DISCRIMINATING SUBSTRUCTURES MOLFORM # 2 DISCRIMINATING SUBSTRUCTURES MOLFORM # n DISCRIMINATING SUBSTRUCTURES 171.0) {77,77} " {F5} and “PD (198.0 -> 154.0) {92,92} " {F6} and “PD (197.0 -> 196.0) [69,85] " {F7} and “PD (197.0 -> 153.0) {91,77} " [F8} and “PD (196.0 -> 152.0) [83.77] " {F9} and “PD (70.0 -> 27.0) {42,35} " {F10} THEN substructure PHENOTHIAZINE is present. (Umin = 40%» Cmin = 70%) u: PHENOTHIAZINE SUBSTRUCIURE Figure 5.4: The initial MAPS (v.II) PHENOTHIAZINE substructure. rule obtained for the 175 pathways gleaned from the references discussed above are given in Figure 5.6. Note that the "PD" features contained in the MAPS rule for PHENOTHIAZINE (Figure 5.4) correspond directly to the key fragmentation pathways outlined in Figure 5.6. The feature number from the PHENOTHIAZINE rule is provided in brackets for each of the corresponding fragmentations shown in Figure 5.6. These figures demonstrate the ability of the MAPS rule generation software to select diagnostic spectral features from within the MS/MS database for inclusion in substructure identification rules. The feature combination rule produced from the initial spectral features shown in Figure 5.4 is provided in Figure 5.7. This rule. has 100% reliability and 100% recall for the PHENOTHIAZINE substructure with respect to the reference database. Another initial rule for the PHENOTHIAZINE substructure was generated using the mass filter but with the same Ui and Ci. The three features shown in Figure 5.4 with masses higher than 198 amu were eliminated by this filter but the remainder of the rule was identical. A new feature combination rule, shown in Figure 5 .8, was then generated. This rule contains a different clause but retains 100% reliability and 100% recall with respect to the reference database. The mass filter effectively restricts the initial rule features to those that are directly related to the fragmentation of a substructure. Since the MAPS program stops searching for feature- combinations after a recall of 100% is reached, it is important that the initial features be closely related to the indicated substructure and not just due to the combination of the indicated substructure and 176 .c E... v 32.289. Ea... 3.9.3.. ...—E3383 ocfififiocofi .0 use. 2.08%... 3.2.2.. .8 3.....osbm .m.m ...—awr— su ~>c um. SE mm. #5 mm. 3.: ...o/ “..."..m \zllfi'fi n30 .. _x x a: :5» em. #5 K. #8 8. NE. 3. NE a >— 8. 5... Eu 55 ..~.~.e 25 Q. 23 51?. 00va Q. 177 another substructure. If the latter were true, then the MAPS rule for the indicated substructure may only identify the substructure in the presence of the additional substructure. On the other hand, the rule generated without the mass filter provides some clues to other substructures bonded to the indicated substructure (eg. a carbon bonded to the nitrogen in the PHENOTHIAZINE substructure). There is, as yet, no automated method for determining the origin of the features with higher masses. Thus, MAPS rules generated without the mass filter are, at this point, most useful for manual inspection to determine additional substructure / spectral feature correlations . Corresponding Fragmentation Path ' Rule Feature Neutral Loss m/z 198 (III) -> m/z 171 (IV) {F5} (loss of HCN - 27 amu) m/z 198 (III) -> m/z 154 (VII) {F6} (loss of CS - 44 amu) m/z 197 (IV) -> m/z 196 (V) {F7} (loss of H - l amu) m/z 197 (IV) -> m/z 153 (VIII) {F8} (loss of CS - 44 amu) m/z 196 (V) -> m/z 152 (IX) {F9} (loss of CS - 44 amu) m/z 70 (X) —> m/z 27 (XI) {F10} (loss of CszN - 43 amu) Figure 5.6: Comparison of documented fragmentation pathways for phenothiazine derivatives with the features contained in the MAPS PHENOTHIAZINE rule. while those generated with the mass filter can be directly applied to an unknown. Other starting values for the uniqueness and correlation filters were used to determine their effect on rule content. 178 IF “ D (198) and PD (198 -> 154) [100,92] ” OR “ D (211) and PD (198 -> 154) [100,85] ” THEN substructure PHENOTHIAsz is present. REL = 100% REC = 100% - {Um};l = 40%, cum. = 70% for initial features; rule clauses with correlation < 85% were deleted) Figure 5.7: MAPS (v.II) feature-combination rule for the PHENOTHIAZINE substructure with a recall and reliability estimate of 100%. There are two sets of values which provide useful information but not necessarily optimal reliability or recall for a specified substructure. First, an initial .MAPS rule can be generated using a high minimum correlation value and a very small minimum uniqueness value to identify the spectral features with the greatest frequency of appearance with the presence of a given substructure. This procedure, when Ci = 100% is used, provides “exclusion” rules since the absence of these "universally" observed spectral features can be used to predict the absence of substructures [7-9]. The exclusion rule for phenothiazine is provided in Figure 5.9. Second, a high minimum uniqueness value and a very low correlation value can be used to select spectral features that are strongly dependent {upon the presence of other substructures. The MAPS feature-combination inclusion rule obtained for the PHENOTHIAZINE substructure using an initial uniqueness of 100% and an initial correlation of 20% is shown in Figure 5.10. The mass 5.0.0.202... Bane... 0030...... 0... ...... 0032.0 .02.. 0...... 0... 5.... 39mm. 0.30.2.3... mszsmhozaa 0... .0. 00.2030 0.... 2020258000330. @252 2:. "ad 0...»...— ..—..mOmD 6032.0 .05.. «an... .on n 00 .05 u _U .9. u 5 ”sag—.200 22:5 179 A. .832: .32: .32: 382: 3:83.: 3...: 3...: 23.3. 3 8. .8... 8.8. a... 8.3. .32 one. .832: .32: .32: 382: 3:83.: .22: 3...: 232:. 3 8. .8... 8.8. a... 8.3. 8.2.. new .822: .32: .32: . 382: 3.: 83.: 3.2: 3.2.: 23:. 3 8. .83. 8.8. a... 8.3. 8.... 2...: .822: .32: .32: 382: 3.: 32.2: 2.2: 3...: 23.3. 3 8. .83. 8.2... a... 8.3. 3.2 2...: .832: 8.2: .32: .32: 382: 3.: 3.3.: 2.2: 3...: .325 3. 8. .832 8.2.. a... 8.3. .32 2...: .832: .22: .32: .32: 382: 3.: 32.2: 3.2: 3.2.: 23.3. 2. 8. .832 8.3. a... 8.3. 8.3. 2...: .832: 3.2: .32: .32: 382: 3.: 32.2: 3.2: 3...: 23:5 3. 8. .83. .3... a... 8.3. 8.2.. one: .822: .32: .32: 382: 3.: 32.2: 3.2: 332: 3.2.: 23.26 2. so. .832 8.3. a... 8.3. 3.2 2...: 2.8.2: .22: .32: 382: 3.: 33.: 83.: .22: 3.2.: 232:. e. 8. .822 3.2 a... 8.3. 8.... 2...: 2 8.2: 3.2: .32: .32: 382: 3.: 3.2: 83.: 2.2: 332: 32.: 23:5 8 so. .882 e 8.3. 8.... 2.3. V, 2.22-0:oolnmam so... 180 (((SS132 (((PD 70.0 42.0) 25 100) ((NL 1.0) 13 100) ((NL 16.0) 13 100) ((NL 29.0) 13 100) ((NL 42.0) 14 100) ((NL 56.0) 13 100) ((NL 157.0) 20 100) ((D 29.0) 15 100) ((D 43.0) 16 100) ((D 55.0) 14 100) ((D 69.0) 18 100) ((D 70.0) 18 100) ((D 71.0) 19 100) ((P 41.0) 14 100) ((P 42.0) 13 100) ((P 43.0) 14 100) .((P 57.0) 14 100) ((P 65.0) 13 100) ((P 68.0) 16 100) ((P 69.0) 14 100) ((P 70.0) '14 100) ((P 71.0) 14 100) ((P 77.0) 13 100) ((P 82.0) 15 100) ((P 83.0) 15 100) ((P 84.0) 16 100) ((P 95.0) 17 100) ((P 125.0) 22 100) ((P 126.0) 19 100) ((P 127.0) 18 100) ((P 139.0) 20 100) ((P 149.0) 14 100) ((P 152.0) 21 100) ((P 153.0) 20 100) ((P 167.0) 17 100) ((P 171.0) 24 100) ((P 178.0) 28 100) ((P 179.0) 27 100) ((P 184.0) 25 100) ((P 185.0) 29 100) ((P 196.0) 31 100) ((P 197.0) 25 100) ((P 198.0) 30 100) ((P 210.0) 31 100) ((P 212.0) 34 100) ((P 223.0) 36 100) ((P 224.0) 38 100)) )) ((D 41.0) 14 100) ((NL 15.0) 13 100) ((NL 27.0) 13 100) ((NL 28.0) 12 100) ((NL 32.0) 19 100) ((NL 41.0) 15 100) ((NL 43.0) 14 100) ((NL 44.0) 14 100) ((NL 71.0) 14 100) ((NL 85.0) 15 100) ((NL 187.0) 27 100) ((D 27.0) 14 ((D 42.0) 15 100) 100) 1.... Initial conditions: Ui = 10, Ci = 100 Figure 5.9: MAPS (v.II) exclusion rule for the PHENOTHIAZINE substructure (SS132) generated using the indicated program parameters. .22... 20.8.0200 ......a. 30. . ...... 82.2.3.2. 8...... an... a 2...... ..0...0..0w 181 .8.... 238:3... 8.32.822... 2.. .5. 0.... 8.32.8835“... 2...... ...<: ...... 2...... . .... 22.: .3380 .3... 32.. .... u 6 .... u .0 .8. u ... 38.88 3...... .. .832: 3.: 3.2:. m. 8. .8.... 3... 9.... .... .832: 3.: 3.2:. .. 8. .8.... ...... ...... ...... .832: 3.: 3.2:. .. 8. .8... 8...... ...... 8.... .832: 3.: 3.2:. .. 8. .8.... 8.8. ...... ...... .832: 3.: 3.2: 282:. .. 8. .8.... 8.8. ...... 8.... .832: 3.: 882:. .. 8. .8.... ...... ...... ...... .832: 3.: 3.2:. m. 8. .8.... 8.8. 9.... ...... .....2: .32: 3.: 8...: 3...: 282:. ... 8. ......3 o... 9.... ...... .. gaum:oon..... 28. 182 filter was utilized in generating this rule. There are 8 spectral features with 100% uniqueness (and therefore 100% reliability) with respect to the reference database. The compound reference names which produce the indicated feature are included in Figure 5.10. A feature-combination rule was not generated using this initial rule because each feature already has a 100% reliability estimate. An interesting. trend is found when the lists of reference compounds in Figure 5.10 are examined. The features in this rule with correlation values below 31% are all produced by the same compounds! The structures for these compounds are summarized in Figure 5.11. This analysis indicates that the features contained in this rule are highly correlated with the substructure bonded at the 10 position of the PHENOTHIAZINE substructure; This type of correlation, referred to as “cross-correlation”, can be expressed quantitatively for each feature-combination using the following equation: XC(F)J' = MW number of compounds with F where F is a spectral feature or combination and SSj is substructure j. The cross-correlation values obtained for each of the features found in Figure 5.10 are summarized in Table 5.1. The substructure used for these calculations was “S8110” as shown. Thus, the features with high XC values are indicative of the presence of “88110” and PHENOTHIAZINE. 183 I I I1 Compound Reference Name R1 R2 c'a2_/\_‘Imz (7., \_/ 7., a M164 4" ' °" "Lm’ m,-/-\-c.., é": v o M11485 M” 'Lw’w’m’ Ta-fl-‘r‘a in: V Gilla 0 —-(:III A“ _LCN CH. M11844 (Eu: V 7.2 M17044 'w' °" "°' Figure 5.11: Structure summary for several of the phenothiazine compounds in the reference database. 184 \_/ Substructure “SSllO” Feature Number XC(F)110 (%1 Fl 50 F2 100 F3 100 F4 100 F5 100 F6 100 F7 100 F8 100 Table 5.1: Cross-correlation values calculated for each of the spectral features contained in the MAPS rule for the PHENOTHIAZINE substructure (shown in Figure 3.22) with respect to the “88110” substructure. 185 The cross-correlation factor has two potential uses in the MAPS program. First, a cross-correlation value could be calculated for each feature-combination that meets the specified U/C criteria. If the cross-correlation value for the combination is too high (i.e. above a set maximum) then that combination could be discarded. This ability could be built into the feature-combination generation function of MAPS. A second potential use of the cross-correlation factor is to place feature-combinations with high cross-correlation into an ancillary rulebase to be applied once a more general rule had identified the major substructure (cg. applying the rule shown in Figure 5.10 only after the more general rule shown in Figure 5.7 has identified the presence of the major substructure PHENOTHIAZINE). This function' could be built into the RULE program. “BARBITURATE” Barbiturates are another class of compounds that find use in pharmaceuticals and in the illicit drug trade [2,10]. There are ten compounds which contain the BARBITURATE substructure in the reference database with a variety of substituents, at the l, 3 and 5 positions. Three substructures relating 'to barbiturate derivatives (including the generic BARBITURATE substructure) are provided in Figure 5.12. The MAPS feature-combination rule for the BARBITURATE substructure is shown in Figure 5.13 along with the parameters input to the MAPS program. Unlike the rules for the PHENOTHIAZINE substructure, the MAPS rules for the BARBITURATE substructure do not contain spectral features indicative of the intact *h 186 substructure (cg. m/z 129). Among the notable “PD” dissociations are m/z 98 -> m/z 80, 70, 28, 27 and m/z 97 -> 69, 55. The features in this rule represent the largest common fragments produced from a variety of barbiturate standards. A potential problem can occur when using these types of features to identify compounds in biological matrices due to their relatively low intensity compared to other higher mass spectral features [9]. . \ ”Xu/ :1qu we 54>» cu2 GI=OH ......»k “88141” “88145” “88147” alias: alias: alias: “BARBITURATE” none none Figure 5.12:.Substructure drawings for the BARBITURATE substructure and two specific derivatives. Several studies of the fragmentation pathways of barbiturates have been published [IO-13]. Two different pathways are illustrated in Figure 5.14. These two pathways represent common fragments for the 5-allyl and 5-ethyl barbiturates. Structures for the major ions formed from these barbiturates (i.e. m/z 168,167 and m/z 156) are also shown in Figure 5.14. The MAPS rule shown in Figure 5.13 does not contain spectral features indicative of these ions. However, the MAPS rule for the “88145” (S-allyl barbiturate) and “88147” (5- 187 (setq 88141 _COMB_RULE ( (((PD 97.0 55.0) (PD 97.0 69.0) (P 112.0) (P 40.0) (D 52.0)) 100 60 (M11478 M11490 M1247 M1592 M17109 M1777» (((PD 98.0 28.0) (PD 98.0 27.0) (PD 98.0 70.0)) 100 60 (M11478 M11490 M1247 M1592 M1777 M1960» (((PD 98.0 28.0) (PD 98.0 27.0) (D 26.0)) 100 60 (M11478 M11490 M1247 M1592 M1777 M1960» (((PD 98.0 28.0) (PD 98.0 27.0) (PD 54.0 27.0)) 100 so (M11478 M1247 M1592 M1777 M1960» (((PD 98.0 28.0) (PD 98.0 27.0) (PD 56.0 29.0)) 100 so (M11478 M11490 M1247 M1592 M1960» (((PD 98.0 28.0) (PD 98.0 27.0) (PD 56.0 41.0)) 100 50 (M11478 M11490 M1247 M1592 M1960» (((PD 98.0 80.0) (PD 98.0 70.0) (D 28.0) (NL 57.0) (D 27.0)) 100 50 (M11478 M1247 M15821 M1592 M1777» (((PD 97.0 55.0) (PD 97.0 69.0) (P 112.0) (P 40.0) (PD 167.0 149.0)) 100 50 (M11478 M11490 M1247 M17109 M1777» (((PD 98.0 28.0) (PD 98.0 27.0) (PD 83.0 55.0)) 100 50 (M11490 M1247 M1592 M1777 M1960» (((PD 98.0 28.0) (PD 98.0 27.0) (PD 167.0 149.0)) 100 40 (M11478 M11490 M1247 M1777» (((PD 98.0 28.0) (PD 98.0 27.0) (PD 54.0 26.0)) 100 40 (M11490 M1247 M1592 M1960» (((PD 98.0 80.0) (PD 98.0 70.0) (D 28.0) (NL 112.0) (D 27.0)) 100 40 (M11478 M1247 M15821 M1777» (((PD 98.0 80.0) (PD 68.0 40.0) (PD 97.0 69.0)) 100 40 (M1247 M15821 M1777 M1960» (((PD 98.0 80.0) (PD 68.0 40.0) (P 40.0) (D 27.0)) 100 40 (M1247 M15821 M1777 M1960» (((PD 98.0 80.0) (PD 68.0 40.0) (D 66.0) (D 27.0)) 100 40 (M1247 M15821 M1777 M1960» (((PD 169.0 126.0) (PD 69.0 39.0) (NL 87.0) (PD 69.0 68.0) (D 27.0)) 100 40 (M11478 M15682 M15821 M1777» (((PD 169.0 126.0) (PD 69.0 39.0) (PD 69.0 68.0) (P 113.0) (D 27.0)) 100 40 (M11478 M15682 M15821 M1777» (((PD 169.0 126.0) (PD 69.0 39.0) (PD 69.0 68.0) (P 98.0) (D 27.0)) 100 40 (M11478 M15682 M15821 MI777)) (((PD 169.0 126.0) (PD 69.0 39.0) (PD 69.0 68.0) (P 99.0) (D 27.0)) 100 40 (M11478 M15682 M15821 MI777))' Figure 5.13: MAPS (v.III) feature-combination rule for the BARBITURATE substructure (SSl41) generated using the indicated program parameters. 188 (((PD 169.0 126.0) (PD 69.0 39.0) (PD 69.0 68.0) (P 54.0) (D 27.0)) 100 40 (M11478 M15682 MI5821 MI777)) (((PD 68.0 40.0) (PD 55.0 29.0) (P 40.0)) 100 40 (M1247 M1777 M1960 M1961)) (((PD 68.0 40.0) (PD 69.0 41.0) (NL 86.0) (P 40.0) (P 44.0)) 100 40 (M1247 M1777 M1960 M1961» (((PD 68.0 40.0) (PD 69.0 41.0) (NL 86.0) (P 40.0) (D 27.0)) 100 40 (M1247 M1777 M1960 MI961)) (((PD 68.0 40.0) (PD 69.0 41.0) (P 112.0) (P 40.0) (P 44.0)) 100 40 (M1247 M1777 M1960 M1961» (((PD 68.0 40.0) (PD 69.0 41.0) (P 112.0) (P 40.0) (D 27.0)) 100 40 (M1247 M1777 MI960 MI961)) (((PD 97.0 55.0) (PD 97.0 69.0) (PD 83.0 55.0) (P 40.0) (PD 149.0 93.0)) 100 40 (M1247 M1592 M17109 MI960» (((PD 97.0 55.0) (PD 97.0 69.0) (NL 86.0) (P 40.0) (P 50.0)) 100 40 (M11478 M1247 M17109 MI777)) (((PD 97.0 55.0) (PD 97.0 69.0) (NL 87.0) (P 40.0)) 100 40 (M11478 M1247 M17109 MI777)) )) Initial Conditions: Ui = 10, ‘Ci = 50, Cc = 40, mass filter enabled. Figure 5.13: cont. 189 ethyl barbiturate) do show spectral features indicative of the intact barbiturate substructure with part of the side chains attached (see Figure 5.14). The MAPS feature combination rules for these substructures are provided in Figures 5.15 and 5.16. The spectral features indicative of the intact barbiturate ions (shown in Figure 5.14) are highlighted in bold faced type in these figures. The fragmentation of the BARBITU RATE substructure appears to be quite dependent on the side chains. Thus, the MAPS feature-combination rule for the BARBITURATE substructure, with uniqueness and correlation minima set to achieve 100% recall, contains the smaller common fragments produced by large numbers of the barbiturate standards. This situation makes for a somewhat weaker rule for a substructure 1 since there is a possibility that other, substructures with higher masses contained in compounds not in our database, may fragment to give fragments of the same nominal mass as those contained in these types of rules (thus leading to a false positive). While there is always some risk of this situation arising in a “trained” system such as ACES, the risk seems more acute if a substructure rule does not contain spectral features due to the intact substructure. “PHENOL and T-BUTYL” The T-BUTYL substructure is an example of a highly cross- correlated substructure. The feature combination rule shown in Figure 5.17 has a number of features which are most likely due to PHENOL (eg. m/z 105), not T-BUTYL. This observation was noted in a previous study which used many of the same phenol standards [9]. 190 A) ., 484—: CH: -CH 6112 m/z 167 -—D m/z 98 o HUI/KIN o / 311 m/z 156 Figure 5.14: Fragmentation pathways for the A) “88145” and B) “88147” substructures. and 11. Adapted from references 10 191 (setq SSl45_COMB_RULE ‘( (((PD 169.0 152.0) (PD 96.0 28.0)) 100 100 (M11478 MI247 MI777)) (((PD 169.0 152.0) (PD 168.0 96.0) (PD 124.0 43.0)) 100 100 (M11478 MIz47 MI777)) (((PD 96.0 28.0) (PD 153.0 136.0» 100 100 (M11478 11711247 M1777» (((PD 96.0 28.0) (PD 167.0 43.0)) 100 100 (M11478 111247 MI777)) (((PD 96.0 28.0) (PD 169.0 109.0» 100 100 (M11478 111247 MI777)) )) Initial Conditions: Ui = 40, Ci = 70, Cc = 50 Figure 5.15: MAPS (v.III) feature-combination rule for the “88145” substructure generated using the indicated program parameters. 192 . 8.888986 83on 633:3: 2: win: 6280:...» 832533 :55? 2: 8.. 2.: 2:56 922 ":6 «.3»:— 88:2 88:2 8222 83:6 8 82 8.6. 8.8 :6 8.82 .6 8.8 8.8 :6 8.8 8.22 .96 8.8 3.: :66 88:2 «8:2 88:: 83:6 8 82 8.8 8.8 :6 8.8 8.8 :6 8.8 8.4m a: 8.8 8.8 9666 88:28:: 88:2 83:6 8 8. 8.8 8 8.8 8.8 96 8.8 8.83 96: 88:2 82: 88:2 83:6 8 82 8.8 :6 8.8 8.8 :6 8.8 8.82 96: 88:2 82288.2 83:6 8 8_ 8.8 8.8 :6 8.8 8.8 :6 8.8 8.82 9:: 88:2 82.28822 83:6 8 82 8.8 8.8 96 8.8 8.8 96 8.8 8.82 9:: 88:2 88:2 88:2 83:6 8 82 8.8 :6 8.82 .6 8.8 8.22 96 8.8 8.8 96: 88:2 88:2 88:2 83:6 8 8. 8.82 .6 8.8 8.8 96 8.8 .8: 9: 8.8 8.8 :66 88:2 88:2 82: 83:6 8 82 8.8 86 8.82 .6 8.8 3.: 96 8.8 8.8 9:: 88:2 88:2 8:: 83:6 8 82 8.82 .6 8.8 8.8 96 8.8 8.22 96 8.8 8.8 :6: 6. 8521828838 886 193 .83 “36 9:22,.— 8 u 6: .8 u 6 .2 u 5 88:8: 28:: 88:2 88:2 83:6 8 82 8.8 8.8 :6 8.8 8.8 :6 8.8 8.82 :6 8.8 8.8 :66 88:2 «8:2 83:26 8 82 8.8 :26 8.8 3: :6 8.8 8.82 :6 8.8 8.8 :66 88:2 N822 83:6 8 82 8.8 8.8 :6 8.8 8.22 :6 8.8 8.82 :6 8.8 8.8 :66 88:2 «8:2 83:6 8 82 8.8 :76 8.8 8.22 :6 8.8 8.82 :6 8.8 8.8 :66 88:2 «8:2 83:6 8 82 8.8 8.8 :6 8.8 8.22 :6 8.8 8.82 :6 8.8 8.8 :66 2 194 A cross-correlation factor similar to the one already presented for feature-combinations can be calculated using the following equation: XC(SSJ)‘= WW number of compounds with SSj where SSj and SS]: are substructures j and k respectively. The result (setq SSZl__COMB_RULE ‘( (((PD 175.0 133.0) (PD 145.0 105.0» 100 57 (((PD 175.0 133.0) (PD 147.0 107.0» 100 57 ..» (((PD 161.0 119.0» 100 50 .» (((PD 175.0 142.0» 100 50 .» (((PD 175.0 133.0) (PD 175.0 145.0)) 100 50 (((PD 105.0 65.0) (PD 119.0 103.0)) 100 50 (((PD 175.0 133.0) (PD 119.0 103.0)) 100 50 (((PD 175.0 133.0) (PD 161.0 121.0)) 100 50 (((PD 161.0 128.0) (PD 147.0 107.0)) 100 50 (((PD 175.0 133.0) (PD 135.0 91.0)) 100 50 AAAAAAAAAA o o o o c o o o o c o o c o o o o v V )) Initial Conditions: Ui = 20, Ci = 50, Cc = 40, MINF=1 (rule clauses with Cc < 50 not shown) Figure 5.17: MAPS (v.III) feature-combination rule for the T- BUTYL substructure (8821) generated using the indicated program parameters. of this calculation for T-BUTYL, XC(SS21), is 93%. (i.e. 13 out of 14 of the T-BUTYL containing compounds also contain PHENOL)! Thus, the only correlation value that has potential for isolating the spectral features due to T-BUTYL from those due to PHENOL is 100%. The initial MAPS rule obtained using a 100% minimum correlation with a minimum uniqueness of 10% is shown in Figure 5.18. The features 195 found in this rule are those which might be expected from fragmentation of a T-BUTYL substructure. Unfortunately, a feature- (((SSZl (((NL 1.0) 14 100) ((NL 15.0) 14 100) ((NL 28.0) 13 100) ((NL 33.0) 21 100) ((NL 44.0) 15 100) ((NL 57.0) 15 100) ((P 55.0) 14 100) ((NL 2.0) 15 100) ((NL 16.0) 14 100) ((NL 29.0) 14 100) ((NL 42.0) 15 100) ((NL 55.0) 18 100) ((D 55.0) 15 100) ((P 57.0) 15 100) ((NL 14.0) 22 100) ((NL 26.0) 14 100) ((NL 30.0) 15 100) ((NL 43.0) 15 100) ((NL 56.0) 14 100) ((P 53.0) 17 100) ((P 58.0) 16 100) ))) Ui = 10, Ci = 100, mass filter enabled the T-BUTYL the indicated Figure 5.18: MAPS (v.II) exclusion rule for substructure (SS2l) generated using . program parameters. combination rule with 100% reliability could not be generated from these features. The addition of compounds to the reference database which contain T-BUTYL, but not PHENOL, is the only recourse for obtaining a reliable substructure identification rule for the T-BUTYL (setq SS44_COMB__RULE ‘( (((PD 133.0 105.0) (NL 14.0) (PD 91.0 65.0) (D 53.0) (D 67.0)) 100 46 (GMRIO GMRll GMR12 GMRl GMR23 GMR24 GMR2 GMR3 GMR5 GMR7 M15297 MI6129 M16208 M16834 M16990)) )) Ui = 10%, Ci = 50%, Cc = 45%, mass filter disabled Figure 5.19: MAPS (v.III) feature-combination rule for the PHENOL substructure generated using the indicated program parameters. 196 substructure. The cross-correlation factor for PHENOL with respect to the T-BUTYL substructure, on the other hand, is 41%. The MAPS feature-combination rule obtained for PHENOL, using an initial correlation of 50% and a feature-combination correlation of 45% is shown in Figure 5.19. The reliability of this rule is 100% and the recall is 46% with respect to the reference database. Alternate MAPS Rules From Multiple Collision Data MS/MS spectra for use in generating MAPS rules were acquired under two different sets of operating conditions (see Chapter 2). A major question regarding the selection of pressure regimes for CAD spectra concerns the relative value of the spectral features observed in these spectra for structure elucidation. One school of thought treats spectral features as having equal value in determining molecular structure. Thus, an increase in the number of spectral features should result in an increased ability to determine a structure. Since increasing collision gas pressure increases the number of fragment ions observed, multiple collision conditions are preferred. MAPS rules have been generated using a variety of program parameters to vary the number and types of initial features used by the rule generator. While in some cases an increase in recall was observed with increasing the number of initial features, there were also many examples where zero recall was obtained no matter how many features were used. Thus it is not just the number of features which is important in identifying a substructure. 197 Another school of thought recognizes that some spectral features are more important than others and the collision pressure should be set to an optimal yalue to detect them (or to avoid the production of features which interfere with features that would otherwise be unique). In some cases this means using single collision conditions to obtain bone fide neutral losses. In other cases multiple collision conditions are required to obtain fragments of parent ions which possess very small cross sections (a case where one collision provides insufficient energy deposition for dissociation to occur). Thus, selection of a pressure regime is compound dependent. This dependence is one of the difficulties facing the adoption of standard conditions for a CAD spectral database. The MAPS rules discussed heretofore were generated using MS/MS spectra acquired under single collision conditions. The base peak in these spectra is almost always the parent ion and there are often relatively few major fragment ions (daughter ions). Single collision conditions are established by adjusting the target gas pressure so the probability of only a single collision between a mass selected parent ion and the target gas is very large compared to the probability of more than one collision between the parent (and daughter) ions and the target gas. In other words, the fragment ions in a CAD (collisionally-activated dissociation) mass spectrum acquired under single collision conditions are the result of dissociation of the parent ions only and not from dissociation of daughter ions (sometimes referred to as granddaughter ions). A preliminary comparison of the rules obtained from the single- 198 collision data to those generated using the multiple-collision spectra is provided below. Comparative Recall for MAPS Rulebases Generated from Data Acquired at Different Collision Gas Pressures An additional MAPS rulebase was generated using the MS/MS spectra acquired under multiple collision' conditions. The spectra used to obtain these rules are characterized by a great deal more fragmentation and often by a base peak other than the parent ion. The MAPS program parameters (eg. initial feature uniqueness) were the same as those used to generate the rulebase discussed in the previous section. The recall ‘obtained for substructures with at least 5 occurrences in the reference database is provided in Figure 5 .20 for the two rulebases. Overall, 40 MAPS feature-combination rules with non-zero recall were obtained (versus 42 rules using the single- collision data) with an overall recall of 79% (versus 82% recall using the single-collision data). One significant detail in Figure 5.20 is that there are four substructures where non-zero recall was obtained using one dataset and zero recall using the other dataset (i.e. SS7 - monosubstituted phenyl, SS19 - methyl, S823 - butyl and S856 - dimethylamino). Of these four substructures, only the S819 substructure (methyl) was better identified using multiple collision conditions. The recall shown for the remaining substructures in figure 5.20 shows some 199 .86mnoo\86m05\86mu5 98: 60:83: 3:: 2828223886388 2: .2: :82 2: 9:328 8888:: Sad 2:3...— algl d pl 0 an 4 1 did on. on. at. 0V. ch on. an. uuv ... o—— o—— .o— no. .0 no a. no —. .- O Q (fi) M 1% I. d E- ! a- 4 h. a. n- In no no on on .u on O. a. '0 av .9 a. an «a —n on 0. Or a. o. o— h 0 a v 0 ON 2 M 00 m on 200 variability between the two rulebases. Thus, the likelihood that a MAPS rule will identify a substructure, as provided by the recall estimate, varies between single and multiple collision conditions. The significance of this observation for the ACES system is that an ancillary experiment using alternative pressure regimes (eg. multiple collisions) may potentially identify a substructure that could not be identified using the original pressure regime (e.g. single collision). Therefore multiple rulebases, one based on single collision conditions and one based on multiple collision conditions, can be used to reduce the number of candidate structures by securing additional substructure identifications. This component of the ACES system requires further data and program support to be fully realized. The most critical program development in this regard is the intelligent controller / program shell. The intelligent controller should lead the user through the programs which comprise the ACES system (a shell function) and act as an interface to the user and /or mass spectrometer so ancillary experiments can be performed (a control function). Effect of Collision Gas Pressure on Rule Content The uniqueness and correlation values of a given spectral feature in the reference database varies depending on the pressure regime chosen for data acquisition. Since the MAPS software uses these values to identify the starting features for feature-combination generation, the content of a MAPS rule may alSO be different from one generated using another pressure regime. Consider, for example, 201 the “PD” features in the rules for PHENOTHIAZINE (single and multiple collision conditions) that have m/z 198 as the parent ion. These features are shown in Figure 5.21. Three “PD” features with a m/z 198 parent met the initial uniqueness and correlation minima of 30% using the single-collision data. Note that four additional ions are observed under multiple collision conditions and using the same MAPS parameters. The structure for the ions labelled in Figure 5.21 are provided in Figures 5.5 and Figure 5.22. The uniqueness and correlation for each of the “PD” features are given in brackets in Figure 5.20. Note that the very diagnostic feature, “PD 198 -> 154”, is less characteristic of the phenothiazine substructure under multiple collision conditions than under single collision conditions (U = 92 vs. U = 68). However, the new features included in the initial rule using multiple collision conditions are also quite diagnostic (eg. “PD 198 -> 45”, U=90 and “PD 198 -> 166”, U = 87). The feature-combination rule obtained for U1 = 40%, Ci = 50%, Cc = 50%, mass filter enabled, and the multiple collision data is shown in Figure 5.23. Several features observed only under multiple collision conditions are highlighted in this rule in bold-faced type (eg. “PD 198 -> 45”). Note that even though the rule content changed, 100% reliability and 100% recall were maintained for the PHENOTHIAZINE substructure generated using the multiple collision data. There is one single-feature clause in this rule, “PD 197 -> 170”, which has 100% reliability in predicting the presence of the PHENOTHIAZINE substructure within the reference database. As will 202 P1 m/z 197 [42,46] IV m/z 198 m/z 171 [76.76] VI III 111/: 154 [92,92] VII P2 I I‘ll/Z 197 [45,75] IV 111/2 171 [61,84] VI I'll/Z 165 [87,53] XII m/z 198 m/z 154 [68,84] VII III 111/: 128 [53,53] XIII m/z 127 [81,69] XIV m/z 45 [90,69] Figure 5.21: Daughter m/z values observed in a MAPS rule from a m/z 198 parent ion using single and multiple collision conditions. 203 m/z 166 m/z 128 m/z 127 Figure 5.22: Additional structures for fragment ions observed from phenothiazine derivatives. be shown in the next chapter, single feature clauses are somewhat unreliable outside of the reference database. This result is not particularly surprising if one considers that a substructure identification based on one feature is not a result that an “expert” mass spectrometrist would find compelling. Generally a series of features is the most reliable indicator of the presence of a substructure. The rules shown here often have two or three features in each rule clause and it is expected that this length will increase as more compounds are added to the reference database. The length of the rule clauses will increase because more features will be required to achieve 100% uniqueness (reliability) within the reference database. The rate of increase, however, will decrease rapidly with increasing database size since all legitimate examples of the fragmentation of a given substructure will eventually be represented in the MAPS rules. Thus, the size of the reference database required to achieve this stabilization of rule content is probably much smaller than the size required for reliable spectral matching of unknowns. 204 (setq S8132_COMB_RULE ‘( (((PD 197.0 170.0» 100 76 (...) ) )) (((PD (((PD (((PD (((PD (((PD (((PD (((PD (((PD (((PD (((PD (((PD (((PD (((PD (((PD (((PD (((PD (((PD (((PD (((PD 198.0 45.0) (PD 198.0 154.0)) 100 69 (...)) 199.0 167.0) (PD 198.0 154.0)) 100 69 (...)) 198.0 45.0) (PD 198.0 171.0)) 100 69 (...)) 199.0 167.0) (PD 197.0 196.0)) 100 61 (...)) 199.0 167.0) (PD 197.0 153.0)) 100 61 (...)) 196.0 45.0) (PD 197.0 196.0)) 100 61 (...)) 196.0 45.0) (PD 197.0 153.0)) 100 61 (...)) 196.0 45.0) (PD 196.0 69.0)) 100 61 (...)) 198.0 45.0)(PD.198.0 127.0)) 100 53 (...)) 198.0 45.0) (PD 199.0 155.0)) 100 53 (...)) 198.0 45.0) (PD 197.0 196.0)) 100 53 (...)) 198.0 45.0) (PD 197.0 153.0)) 100 53 (...)) 196.0 45.0) (PD 196.0 169.0)) 100 53 (...)) 198.0 166.0) (PD 198.0 127.0)) 100 53 (...)) 171.0 45.0) (PD 199.0 167.0)) 100 53 (...)) 171.0 45.0) (PD 199.0 155.0)) 100 53 (...)) 171.0 45.0) (PD 196.0 169.0)) 100 53 (...)) 171.0 45.0) (PD 197.0 196.0)) 100 53 (...)) 199.0 167.0) (PD 179.0 153.0)) 100 53 (...)) Ui = 40%, Ci = 50%, Cc = 50%, mass filter enabled Figure 5.23: The MAPS (v.III) feature-combination rule for PHENOTHIAZINE generated using the indicated program parameters and the multiple collision data. 205 Conclusions The evaluation of the rules generated using MAPS (v.III) indicates that this new software provides high reliability and high recall substructure identification rules. The rules obtained for several substructures were determined to contain features which correspond to established ion structures using a variety of instrumental techniques. Some rules, however, clearly contain features from other substructures. This contamination stems from a high degree of cross-correlation with another substructure in the reference compounds. In some instances, raising the correlation above the cross-correlation value can eliminate the contaminating features. In other cases, additional compounds containing the pertinent substructure (and not the contaminating substructure) need to be added to the reference database to alleviate the cross- correlation problem. Further evaluation of rule reliability is addressed in the next chapter using test compounds. Of particular interest is the minimum advisable uniqueness and correlation values to be used in rule generation to ensure reliable predictions outside of the reference database (eg. is a feature-combination rule based on a combination of low uniqueness and correlation features reliable or just a statistical anomaly?). References 1. Busch, K.L., Glish, G.L., McLuckey, S.A., "Mass Spectrometry / Mass Spectrometry: Techniques and Applications of Tandem Mass Spectrometry", VCH Publishers, Inc., New York, 1988. 10. 11. 12. 13. 206 Merck Index Morosawa, S., Kamal, S., Dandiya, P.C., Sharma, H.C., Ms 3.121291152111211. 309 (1982)- Hallberg, A., Al-Showaier, 1., Martin, A.R., Lflflmfljgflm 2_1_, 841 (1984). Flurer, R.A., Busch, K.L., MW” 23, 118 (1988). Gilbert, J.N.T., Millard, B.J., MW, 2, 17 (1969). Wade, A.P., Palmer, P.T., Hart, K.J., Enke, C.G., Mm 215, 169 (1988). Palmer, P.T., Hart,'K.J., Enke, C.G., Ialnmfi, 16, 107 (1989). Palmer, P.T., Ph.D. Dissertation, Michigan State University, East Lansing, MI, 1988. Falkner, F.C., Watson, J.T., Wm], 8, 257 (1974). Watson, J.T., Falkner, F.C., WI, 1227 (1973). Thompson, R.M., Desiderio, D.M., W,L 987 (1973). Grutzmacher, W.F., Arnold, W., W, 1365 (1966). Chapter 6 Evaluation of MAPS Rules by Application to Test Compounds Introduction There are two methods of generating MAPS feature- combination rules. The first method is to generate rules individually for each substructure using a number of different program parameters (e.g. Ui). The second method is to generate rules for a number of substructures using global program parameters (i.e. fixed Ui/Ci/Cc values for all rules). The reliability of the MAPS (v.III) rules obtained thus far was evaluated by applying the rules to test compounds (i.e. compounds not in the reference database). These compounds were drawn from several compound classes that are represented in the database (eg. phenothiazines, amphetamines, opiods and barbiturates). A list containing the compound name, molecular weight, molecular formula and CAS number of each of the test compounds is provided in Table 6.1. The structure for each of these compounds is given in Figure 6.1. The rules used in this evaluation were those described in Chapter 5 and those contained in several rulebases generated using global program parameters. A 207 .322: mm; 222 2:22:25 2 :8: 26:25:88 82 0222 .2: 2263 5.2 :onE:: m_< pressure >_< Ui >_< Ci >_< Uc >_< Cc >“. Other program parameters used were: USORT (enabled), MAXF = 10, MINF = 1, MINCMPDS '= 10, HITS = 5 (unless otherwise indicated), SSTIME = 3600 seconds. “PHENOTHIAZINE” The first three rules listed in Table 6.2 were generated for the PHENOTHIAZINE substructure. The first and third rules were generated using the single collision data while the second was generated using the multiple collision data. Both of these rules have 100% reliability with respect to the test compounds. The recall of the second rule (generated using multiple collision conditions) is lower than the recall observed for the first rule (single collision conditions). Lb AguaocmEoE 82.235 - ...— .:ouaomu:oE 80.228 .. :0 3:25:88 2.22 on. 8 com—23: 203 8222.2 2: :02; 2:05:52: 22:30.2 2: :5 3:: $22 we a: “"6 035,—. 2.2222 8 .55 8 x N x m x x x 2 2.218262184212223 .2" >4 22222 on .322: mm x x x m x x N 2.2-82to2noou222uw22a .3 23222 2 .322: 3.82-81842122: .3 § 22222.. cm .322: 3-826222222722222 £2 232d 2-82-2..onu222u2222a .2 NNNN ~32: 8-813.842.4222: .3 . I u .. . W32: 8 82 on em 222 2:23. .2 N Ft; EHNNNNN NNNN 212 Fax rm omlc¢_lcmlcml_%v¢mm .3 FHFHFHNNN N><><>< >4 [a >4. hthNNNN neuoc2uomnc2u222u$a .2 FHFHNNNN .m 8-8218632]? .2 N FR N In Chi-i=4 sax ..H 2 : $.82ucmucuu2mn2ma .2— - u u I 4:222 cm: cm 82 8 cm 22:222.: .2: . rm rm cnlceloolo—I_%hv~mm .a 'GO °°8388 $ 8 8 cnloelohlcVI—qlme—mm .w . u I n .2222 a: 82 x . 8 82 8 cm 222 $23 .5 - 8:82ncmuomu2mumza .2. 8H N omloo_IcVIch:—Imv~mm . 81826221363: . c Infill [:4 oeueo2nopuc¢nugumm2mm . §8§czeoza oco cases as 23 8 as as e m 2. omu82uomucms2nsmm2a .n a .2 § >422 xxx xxx xxx equoc2lchucwu2aumm2mm 213 Thus, the use of multiple collision conditions was not advantageous for identifying this substructure in the compounds tested. The third PHENOTHIAZINE rule was generated using the single collision data but lower Ui/Ci/Cc program parameter values. The reliability of this rule, however, is lower than the previous two rules (i.e. two false positives were obtained). Thus, the reliability of the MAPS rules are dependent on the program parameters used in generating the rules. The minimum advisable Ui/Ci/Cc program parameters are difficult to assess since the number of compounds containing a substructure and the number of initial features obtained for a Ui/Ci combination varies widely among the substructures in the substructure library (see Chapter 5). Generally, for substructures that are represented in only a few reference compounds, larger Ui/Ci parameter values are required to ensure a reliable rule outside of the reference database. For substructures contained in many reference compounds, smaller Ui/Ci parameter values can be used. For example, the Ui/Ci parameters are percentages so a 50% initial correlation (Ci) means that for a substructure represented in 36 compounds, a feature must be exhibited in the MS/MS spectra of 18 compounds before it can be used in the generation of feature- combinations. On the other hand, for a substructure represented in only 4 compounds, a 50% initial correlation means that only two compounds must exhibit a feature before it can be used in rule generation. The features for substructures represented in many compounds, therefore, have more statistical validity outside the reference database than those exhibited by only a few compounds. 214 An additional noteworthy observation on Rule 3 in Table 6.2 is that an identification was made for Test Compound #10 based on a single feature. As was mentioned in Chapter 5, identifications based on only a single feature are generally unreliable. The reliability of rule #3 with no single feature clauses is 83%.. This error can be avoided by setting MINF to a value greater than one. “BARBITURATE” There are seven MAPS rules listed in Table 6.2 that are associated with barbiturates (i.e. $8143, 88145, 88147 as shown in Chapter 5). The feature combination rule shown in Figure 5.13 for the barbiturate substructure ‘SSl43 gave disappointing results (two false positives and no correct identifications). This rule was generated with a low initial uniqueness parameter (10%) and a relatively high initial correlation parameter (50%). The reliability and recall estimates for this rule with respect to the reference database were 100%. The reliability and recall for this rule with respect to the test compounds, however, were 0%. Thus, the initial uniqueness value of 10% (i.e. a minimum of l in every 10 reference compounds must have the feature and the substructure) was too low to produce a substructure identification rule which was reliable outside of the reference database. Rules 5-7 shown in Table 6.2 were generated using an initial uniqueness value of 30% and an initial correlation values of 20% and 40%. Rule 6 in Table 6.2 was generated using a lower initial 215 correlation parameter than Rule 5. Thus, Rule 6 should have an equal or higher recall than Rule 5 because there are more features to use in generating a rule. However, as can be seen from Table 6.2, Rule 6 has a lower recall than Rule 5. The reason for this difference is the HITS optimization. This optimization exits a rule generation loop if rule clauses have been generated which recall all of the compounds a given number of times (as specified by the HITS parameter). Thus, the MAPS (v.III) program will not continue to add feature combinations past the point where the. HITS parameter is satisfied. All of the rules presented here were generated using a HITS parameter value of 5 unless otherwise indicated. A non-zero recall was obtained for Rule 5 and not for Rule 6 because a feature- combination was generated that identified 88143 in Test Compound #17 using Rule 5 while this same combination was overlooked during the generation of Rule 6. This combination was not obtained for Rule #6 because the program exited the feature-combination generation loop before the combination was tried (due to the HITS limitation being met). An interesting effect of another MAPS program parameter, HITS, was explored using the Rule #7. Rule 7 in Table 6.2 was generated using the same Ui/Ci/Cc parameters as were used in generating Rule #6 and with a HITS parameter of 20. The substructure identification for Test Compound #17 was observed and a non-zero recall was obtained for this rule. Thus, when using a lower set of Ui and Ci program parameters, a higher HITS parameter can be used to increase the number of feature combinations that are 216 tried. This will reduce the probability that the MAPS program will abort before trying important combinations. Rule #8 (see Figure 5.15) for the S-allyl barbiturates did not hit on the one test compound that had this substructure. There were, however, only three examples of this substructure in the reference database. Rule #9 in Table 6.2 for the S-ethyl barbiturates was also a poor rule outside of the reference database. This poor performance is probably due to the low initial uniqueness parameter used in generating the rule (10%). Rule 10 in Table 6.2 was generated for the same substructure using a higher initial uniqueness parameter (30%), a lower initial correlation parameter (20%) and an increased HITS parameter (20 hits). The new rule ,did not increase the recall obtained for this substructure but did decrease the number of false positives by one. Overall, it appears that there is an insufficient number of barbiturate standards in the reference database which contain the 88143, 88145, 88147 substructures to generate reliable rules for these substructures. “T-BUTYL” and “PHENOL”. Not surprisingly, Rules ll-l4 in Table 6.2 for the T-BUTYL and PHENOL substructures exhibit poor reliability outside the reference database. As was noted in Chapter 5, these substructures are highly cross-correlated. Significantly, the T-BUTYL substructure is also highly cross—correlated with the “XPHENYL” substructure since most of the reference compounds containing the T-BUTYL substructure are attached to a benzene ring. For Rule 14, the compounds 217 corresponding to 6 out of the 7 false positives for PHENOL contain a benzylic carbon. The remedy for this problem is a larger and more diverse reference database. Ultimately, however, MS/MS spectra must be able to discriminate among these substructures in order to obtain reliable rules for them. “AMPHETAMINE” The AMPI-IETAMINE substructure was originally defined in an attempt to generate a MAPS rule that would identify compounds in this general compound class. The substructure definition used for this purpose is shown in Figure 6.2(A). Examination of the “substructure-buckets” for this substructure definition revealed that many of the opiod reference compounds (eg. Test Compound #6 shown in Figure 6.1) also had this substructure. A third compound class which contains a very similar substructure was discovered when Rule 15 listed in Table 6.2 was applied to the test compounds. All four of the false positives for this rule are phenobarbitals. The only difference between the substructure in the phenobarbitals and the substructure shown in Figure 6.2(A) are the presence of two benzylic hydrogens. The AMPI-IETAMINE substructure was changed (henceforward referred to as 88118) to the more generic definition shown in Figure 62(8). The MAPS rule generated using this new definition is listed in Table 6.2 (Rule 16). The reliability of this rule is much improved over the previous rule (i.e. 88% versus 50%). Thus, substructure definitions which are too specific, that is, definitions which exceed 218 (a Figure 6.2: Substructure definition for the “88118” substructu; (a) with benzylic hydrogens (initial definition) and (b) without benzylic hydrogens (new definition). the specificity which MS/MS spectra afford, can also affect rule reliability outside of the reference database. Several other rules for 88118 are listed in Table 6.2. Of these, Rule 20 is important in that it demonstrates that rules based on the multiple collision data can identify substructures in compounds where rules based on single collision data failed (cg. Test Compounds 13 and 14). This information will be of use in implementing the EXPT program (see Chapters 1 and 4). The EXPT program is intended for use in suggesting ancillary experiments to reduce the number of candidate structures for an unknown. Evaluation of Rulebases Generated using Global Parameters The reliability of two rulebases, one generated using Ui=30%, Ci=30% and Cc=30% and another generated using Ui=30%, Ci=50% and Cc=30%, were calculated to determine the relative merit of using global parameters to generate MAPS rules. The overall reliability of each rulebase is provided in Table 6.3. The total number of 219 predictions and the number of rules in each rulebase are also provided. P1 30 30 30 P1 30 50 30 REL 64 80 number of predictions 118 51 number of rules 31 ll Table 6.3: Reliability, number of predictions and number of rules obtained for three MAPS rulebases. The overall reliabilities shown in Table 6.3 indicate that the global parameters used (30%/30%I30% and 30%/50963096) were not optimal for reliable rule generation (i.e. they had overall reliabilities less than 100%). Specifically, two false positives were observed for the PHENOTHIAZINE substructure using the “Pl_30_30_30” rulebase. Yet, it has been shown that a 100% reliable rule (with respect to the test compounds) can be generated using a Ui/Ci/Cc combination of 40%/70%/50%. The “P1_30_50_30” rulebase has a higher overall reliability than the “Pl_30_30_30” rulebase, however, fewer rules are obtained using the higher correlation value (i.e. Ci=50%). One problem with using global parameters is that it allows the number of initial features used in rule generation to vary over a broad range for the substructures contained in the substructure library (see Chapter 5). Thus, selecting initial features using global parameters (percent uniqueness and correlation) does not appear to be the best method of creating a rulebase. A possible new method for generating feature- combination rules that may yield better results is to allow the MAPS 220 software to vary the Ui and Ci parameters to select a fixed number of features (i.e. the “best” features available in the database). Thus, each rule in a rulebase created using this new method will be generated utilizing approximately the same number of initial features (with the advantage that the features are the best that are available within the database). Additional problems with the MAPS rules were discovered during this investigation. Since each feature-combination in a MAPS (v.III) rule possesses 100% reliability, the presence of any feature- combination in the MS/MS spectra of an unknown was deemed to be sufficient to make a substructure identification. However, many of the false positives observed, using the “Pl_30_30_30” rulebase were based on a small number of rule clauses “firing” (i.e. the feature- combination representing a rule clause Was found in the MS/MS spectra of an unknown). The RULE program lists each rule clause that fires for an unknown. The overall reliability for the rulebase was recalculated using a new rule application method (i.e. at least 10 rule clauses firing before an identification is made). The results obtained for this study are summarized in Table 6.4. The overall reliability of the rulebase increased from 64% to 82% using this new method of rule application. Ten of the test compounds were added to the database to increase the number of reference database compounds used for rule generation. The reliability of the rulebase obtained from the new reference database was determined using the remaining 10 test 221 compounds.‘ This process was repeated by switching the ten compounds added to the reference database and the ten used as test compounds. The results of this study are also provided in Table 6.4. There was a slight improvement in the reliabilities observed using the larger reference databases (using both methods of rule application). # of cmpds in DB REL # of pred.’s # of rules 1 0 S 64% l 3 5 2 0 (82%)* 6 8 2 0 l l 5 65% 3 2 l 9 (88%)* - 2 8 l 9 1 l 5 69% 4 3 1 9 (86%)* 3 2 l 9 CE = 30 eV, p = 0.4 mtorr Ui = 30%, Ci = 30%, Cc = 30% HITS = 20 " minimum of 10 rule clauses for an identification to be made Table 6.4: Rule reliabilities, number of predictions and number of rules for several MAPS rulebases obtained using the indicated parameters. A very promising trend was observed in the false positives obtained using the rulebases shown in Table 6.4. Many of these false positives may be classified as “near misses” and have to do with the way substructures are interpreted by the structure checking software. The substructure list used to determine correct and 222 incorrect predictions was obtained in the same way as the list used to generate the “substructure buckets”. For example, one false positive was obtained for test compound #5 (i.e. “SSlZl” - a six carbon membered ring with one CH2 substituent and the remaining free valences undefined). The structure corresponding to test compound #5 (see Figure 6.1) clearly has a six membered ring with a CH2 substituent. The reason this substructure identification is considered a false positive is that the ring corresponding to the “88121” substructure in test compound #5 is fused to a benzene ring. Thus, two of the carbon atoms in the potential “SSlZl” ring are doubly bonded (in one resonance form) and the structure checking software does not include the “88121” substructure in the substructure list for test compound #5. One-half of the false positives associated with the first rulebase listed in Table 6.4 can be classified as “near—misses”. Another one-quarter of these false positives may be attributed to cross-correlation, removed by use of a better rule (i.e. a rule generated using more stringent program parameters) or neglected because the MS/MS data used can not be reasonably expected to identify the substructure (cg. aromatic ring isomers). Thus, there is considerable evidence that further expansion of the reference database and refinement of the MAPS software will yield substructure identification rulebases with sufficient reliability for use in a generalized structure elucidation system (ACES). 223 Another important point is the reliability of the first database increases to 88% if test compound #10 is neglected. The threshold used to acquire daughter spectra for test compound #10 was set an order of magnitude lower than the rest of the test compounds because the large majority of the ion current for this compound is concentrated in only a few ions. Thus, no identifications were obtained using the 1% threshold (i.e. the threshold used for the rest of the test compounds). In summary, the major causes of false positives for the MAPS rules were: 1) cross-correlation among substructures, 2) inappropriate substructure definitions and 3) inappropriate application of the rules. The use of a lower threshold for acquiring daughter spectra also increased false positives. This last result may change, however, when a larger reference database is used to generate the rules. A lower threshold is desirable because more substructure identifications can be made if daughter spectra of weak primary scan ions are available. Recommendations for Future Development of MAPS The following are the major recommendations for future development of the MAPS software. These recommendations are based on the evaluations made of the MAPS rules discussed in this and other chapters of this thesis. First, the MAPS code should be modified to Select the values of the Ui/Ci variables so that a fixed 224 number of initial features (i.e. the “best” features) are selected for use in generation of feature combinations. Second, the reference database should be increased in size to provide more examples (i.e. provide a better statistical basis) of the fragmentation of the substructures contained in the substructure library. A larger reference database will also decrease the number of false correlations ’in the rules and, therefore, decrease false positives. Third, the structure software should be modified to eliminate the problem of overlapping substructures (i.e. to take into account the “near-misses”). Fourth, the substructure library should be increased to exploit a larger number of the functionalities present in the reference database compounds. Fifth, a truly novel and extremely valuable contribution to the ‘optimal generation of MAPS rules would be to provide a mechanism for the software to define its own substructures, rather than looking for only those defined in the substructure library. The molecular formula generator and the “daughter buckets” may prove useful information to implement this new procedure since the daughter ions are indicative of important functionalities present in the reference database compounds. Lastly, the RULE program should be modified to allow a user-defined minimum number of rule clauses which must “fire” before a substructure identification is made by the program. 225 Conclusions The MAPS (v.III) software has proven to be effective in generating substructure identification rules based on feature- combinations. The rules obtained for several substructures were found to be highly reliable when applied to test compounds. Other substructure identification rules which exhibited false positives among the test compounds were often found to be “near-misses” or due to cross-correlation with another substructure which was also present in the test compounds. The addition of more compounds to the reference database and a further modification of the MAPS code should alleviate these problems with the MAPS rules. It is also important to examine closely the substructure definitions used in generating the MAPS rules and in assigning the list of correct predictions for the test compounds. Inappropriate substructure definitions often lead to false positives. These changes will increase the overall reliability and applicability of the substructure identification rulebase(s) used by the ACES system.