3
L... Int!
.. . I
\I x. 32“”:
.13....
....,.. .
J :3... u»...
y

1 4 .I
80' haw. .I
‘g.
1..
I!

M
h

s
nihnuz.
"st-9i:
I... .sIH
. 3
. .. w .7: .
s 1&5‘:!‘
. 9!..3?

$3....
..

' .111

.v

s 2“},
‘!?::9‘ W

. .UFv “”1...

”41.5".
i. 91...}.
a 7 1
:i‘imf-Lcu
”EA—,7. “.3311
.5;
La;

.‘ 1:.
’1‘

a
-v
v

'.
.~ q

H

i

.2.

iv..- S
3 ..q\. .1
\ ul

 

2:15....

«I. l!
1) .3: .

 

o: .. a... 1.

mum

MICHUGAN ST

lllallllllllllf:

 

 

 

 

 

 

l

lll llllllll lllll lllllllll

421 6521

 

 

 

l

This is to certify that the

dissertation entitled

GENERATION OF RULES FOR ORGANIC SUBSTRUCTU RE
DETERMINATION BY CLASSIFICATION OF MS/MS SPECTRA

presented by

Eric C. Hemenway

has been accepted towards fulﬁllment
of the requirements for

PhD. degree in Chemistry

 

 

 

.
I
- .3,

Major professor

Date - ( /

MS U is an Affirmative Action/Equal Opportunity lnstitutt'on 0-12771

 

 

LIBRARY
Mlchlgan State
Unlverslty

 

 

 

PLACE N RETURN BOX to remove thh chockom from your need.
To AVOID FINES Mum on or baton date duo.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU!:.A.I‘I‘"' ‘L ‘ ' " "‘ “,inuiiuniun

 

GENERATION OF RULES FOR ORGANIC SUBSTRUCTURE
DETERMINATION BY CLASSIFICATION OF MS/MS SPECTRA

By

Eric C. Hemenway

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Chemistry

1996

 

ABSTRACT

GENERATION OF RULES FOR ORGANIC SUBSTRUCTURE
DETERMINATION BY CLASSIFICATION OF MS/MS SPECTRA

By

Eric C. Hemenway

Methodologies have been developed to assist in the elucidation of
chemical structures from mass spectrometry / mass spectrometry (MS/MS)
spectra. Due to the complexities involved in MS and MS/MS ionization and
fragmentation pathways, determination of ionic structure is difﬁcult and
perhaps impossible in a convenient manner. The focus of the methods
developed here are intended to avoid current deﬁciencies in ion structure
determination by classifying observed product spectral patterns.

Classiﬁcation of MS/MS product spectra for isomass precursor ions is
ﬁrst accomplished through the use of cluster analysis. Determination is made
as to the similarities and differences that exist among the measured product
spectra. This process is totally empirical and does not assume that the
structural integrity of ions are maintained in the ionization and
fragmentation of analyte molecules. Hence, identiﬁcation of ion structure is
not required.

The ability to group similar product spectra based on spectral features

and to be able to correlate those groups with common substructures found in

analyte molecules is central to this substructure determination process.
Therefore, a method was developed to match unknown MS/MS spectra
against the representative group spectra. The method developed here was
designed for the feature poor product spectra observed in low-energy
collision-induce dissociation.

The clustered product spectra are then analyzed to determine what
characteristic features they possess. These characteristic features are then
used to construct a hierarchical rule-tree suitable for rapid classiﬁcation of
unknown isomass product spectra. These rule-trees are also suitable for
targeted component analysis.

The methodologies developed here have been successfully applied to
the classiﬁcation of product spectra of m/z 59. By expansion of the product
spectral data base, it is expected that these methods can be employed to other ‘

precursor m/z values.

 

 

 

 

 

 

 

To my parents Mariam anJ [)an Hemenway
tor div. 6 I. s being there

 

  
   
   
  
   
    
     
   

J
.4”...

J “A I ‘ ACKNOI‘I’LFDGMENTS
w. l.
t‘rﬁ’ a

ilk-tend foremost l (h .. I m- ;-.~r-a~..:ch r inst-1 Dr {Tm Luke Whilt’!
I'D “’9'

2‘ . _' hid our diﬁi-renws ’ :wmth' raw-ed ra- n‘wllw‘t, Iﬁblk‘h’. .mtl
."mlI'J
V I also upgwﬂzt- [Tm Wm... .x 3-» a. in? students m '1“:

.. ‘v. ..

v 3 ‘ .
n3. . (Inn and pur'ml. r.‘ 11;... r.- :. pm_:£‘rLs I aim. maul. rm L-Tih!‘
‘J, .

ofmy “35.43.19. Putnam.“ .‘Irtfl Crouch, D! Ned datilkﬁv'lh. aw]

Idle thank my under-gm: ate rvwearv'i: adviser. Dr Adrian Wade, for
9‘." ' TomyparentsMarjor-isandnmdﬁmway

'_ an involved u: res‘kmmfmm on all aspects of 3:1:an
Aqodal thank . m ti. ' mm :22. of Bnusb Columbia [.12. {um
‘ for his many stories about undergraduate benching adventures.

I Ibould say unsadx entures These stories have provided me with

 

 

 

.M--

 

__ -mmx.”

ACKNOWLEDGMENTS

First and foremost I thank my research advisor Dr. Chris Enke. While
we have had our differences, I greatly respect his intellect, insight, and
creativity. I also appreciate the freedom he allows his students in the
formulation and pursuit of their research projects. I also thank the other
members of my research committee Dr. Stan Crouch, Dr. Ned Jackson, and
Dr. William Reusch.

I also thank my undergraduate research advisor, Dr. Adrian Wade, for
getting me involved in research and for his advice on all aspects of graduate
research. A special thank you to University of British Columbia Lab Director
Ben Cliﬁ'ord for his many stories about undergraduate teaching adventures,
or perhaps I should say misadventures. These stories have provided me with
a unique perspective towards software development and the necessity of
student prooﬁng software as well as provided me with a ready source of
humor.

I thank the members of the Crouch group, Stephen Medlin, James
(Odie) Ridge, Brett Quencer, and Edwin Townsend, for the many basketball
games, lunches, and discussions, their friendship, and for making me an

honorary Crouch group member.

My thanks to Dr. Tom Atkinson and Dr. Tom Carter for the many
afternoon discussions on all aspects of computing.

I would also like to thank several past and present members of the
Enke group, in particular Mark Cole, David McLane, Kate Noon, Ron
Lopshire, Paul Vlasak, Fei Liu Overney, Gregor Overney, and Dinorah
Frutos, for their advice and friendship which has meant a lot to me over my
graduate career. Of course Tina Erickson deserves special mention as she has
come to be much more than a friend. Her support has been very important to
the completion of this dissertation.

I do not thank the TSQ 70B/700 mass spectrometer at Michigan State
University for its ability to de-tune within minutes. Fortunately, the
TSQ 7000 proved to behave the exact opposite.

Last, I thank the National Science Foundation’s Center for Microbial
Ecology at Michigan State University, the American Society for Mass

Spectrometry, and Finnigan MAT for their ﬁnancial support.

TABLE OF CONTENTS

 

 

 

LIST OF TABLES xii
LIST OF FIGURES xiii
CHAPTER 1: Introduction 1

References 12

 

CHAPTER 2: Structure Elucidation Methods in Mass Spectrometry l4

 

 

 

 

 

 

 

 

 

Introduction 14
Pattern Matching Methods 15
Interpretive Systems 20
Advantages of MS/MS over MS 30
C ‘ ' 32
References 34
CHAPTER 3: Cluster Analysis as an Analytical Tool ............................ 36
Introduction 36
Background 37
Similarity Calculation Methods 43

i

 

 

 

 

 

 

 

 

 

 

 

 

Euclidean Distance 44
Minkowski Distance 46
Mahalanobis Distance 46
Hierarchical Clustering Methods 47
Single Linkage Clustering 48
Complete Linkage Clustering 52
Average Linkage Clustering 55
Partitional Clustering Methods 58
K-Means Clustering 59
Probability-based and Density-based Clustering 60
Kernel Density Clustering 60
FuZzy Cluetpring 6 1

 

Application of Cluster Analysis to Problems of Chemical Signiﬁcance...63

 

 

 

 

 

 

Application of Cluster Analysis to Mass Spectral Data ........................... 66
F ‘ 67
Rpfprpnope 68
CHAPTER 4: Application of Cluster Analysis to Microbial

Characterinﬁnn 70
Introduction 70
Background 71
Experimental 81
Results and Discussion 85

 

0.. 96

 

 

 

 

 

 

 

 

Reference: 98
CHAPTER 5: Generating Clusterable MS/MS Data 100
Introduction 100
Acquisition Method B... ‘ , ‘ 103
Instrumental Pm ‘ s 103
Sample Introduction 109
Eﬂ‘ects of Collision Energy on CID Spectra 1 12
Eﬁ'ects of Target Gas Pressure on CID Spectra 120

 

Instrumental Conditions Selected for MS/MS Library Generation 123

 

 

 

Generating Reproducible Spectra 13 1
Post-Acquisition Data Processing and Storage 135
f‘ ‘ ' 137
References 139

 

CHAPTER 6: Development of Clustering Procedure for MS/MS Data...

 

 

 

 

 

 

140
Introduction 140
Similarity Calculations on MS/MS Data 141
Clustering MS/MS Data 142
Interpretation of the Clusters 150
C ‘ ' 167

i

Referen cee

CHAPTER 7: Product Ion Classiﬁcation for Standards and
Unknownc

Introduction

 

Establishing Representative Descriptors

 

Discovery of Non-Discriminating Features:

 

Classiﬁcation of Unknowns by Spectral M ‘ L' g
Classiﬁcation of Unknowns by Rule Application ...................................

Improving Rules with Rule-Tree:

 

(‘ 1

Referen cee

168

169
169

170

171

174

185

186

190

197

 

LIST OF TABLES

Table 5.1 Tune parameters for QlMS mode with typical values in volts ..... 106

Table 6.1 Compounds analyzed for m/z 59 and their representative
substructures. Duplications were used to investigate reproducibility. . 149

Table 6.2 Results for partitioning m/z 59 into seven clusters. ..................... 151

Table 7.1 A comparison of the fuzzy matches of the m/z 59 product mass
spectra by the three membership functions show in Figure 7.3. ........... 184

 

LIST OF FIGURES

 

 

 

 

Figure 1.1 Schematic of the triple quadrupole mass spect. ‘ 10
Figure 1.2 The major operational scan modes of the triple quadrupole mass

spectrometer. 1 1
Figure 2.1 The DENDRAL approach to structure ‘ "‘ ‘irm 22
Figure 2.2 Schematic diagram of the classiﬁcation and identiﬁcation expert

system developed by Scott. 26
Figure 2.3 The ACES approach to structure elucidation. ............................... 28

Figure 2.4 A proposed feedback loop for ACES that would allow the system to
recommend ancillary experiments. 31

Figure 3.1 Examples of cluster structures: (a) well-separated clusters; (b)
touching or overlapping clusters. The axes x1 and x2, representing the
two clustering dimensions, are generic in this example but could
represent such things as relative intensities for two m/z values or
infrared intensities at two vibrational ﬁequencies. ................................. 39

Figure 3.2 Four examples of complex cluster shapes: (a) spherical,
(b) elongated, (c) linearly inseparable, and (d) dense clusters ................. 41

 

Figure 3.3 Categories of classiﬁcation types. 42
Figure 3.4 A graphical representation of the Euclidean distance metric ....... 45

Figure 3.5 An example of a dendrogram. 49

Figure 3.6 The single linkage hierarchical clustering of selected unknown
microorganisms obtained by phospholipid proﬁling 5 1

 

Figure 3.7 The complete linkage hierarchical clustering of the same selected
unknown microorganisms presented in Figure 3.6 obtained by
phospholipid proﬁling. 54

Figure 3.8 The relative distance measures for single, complete, and average
linkage methods. Sample 1 and 2 would initially be joined at distance d1.
Sample 3 would join at daingle, dcomplete, or driver-age respectively. ................. 56

Figure 3.9 The average linkage hierarchical clustering of the same selected
unknown microorganisms presented in Figure 3.6 and Figure 3.7
obtained by phospholipid proﬁling. 57

 

Figure 4.1 The proposed CME approach to incorporate the research objectives
in the Community Diversity Thrust Group into a more cohesive package.
74

 

Figure 4. 2 The general glycerophospholipid structure. The head group
substituent 18 represented by Y and the two fatty acids by R and R.

Figure 4.3 The phospholipid head groups and their respective neutral losses
utilized in this study. 76

 

Figure 4.4 Low-energy collision induced dissociation neutral loss
fragmentation of phospholipids for positive ions. 79

 

Figure 4.5 Phosphatidylethanolamine (PE) mass proﬁles for Salmonella
abaetetuba, Citrobacter freundii, and Escherichia coli. ........................... 80

Figure 4.6 The FAB/MS/MS phospholipid (class) mass proﬁle obtained in
part by this analysis procedure. The phospholipid classes axis is
determined by the neutral losses monitored and the phospholipid masses
axis is determined from the respective neutral loss scans. The fatty acid
masses axis is determined by operating the instrument in negative ion

 

 

 

mode which was not utilized for this study. 82
Figure 4.7 The Instrument Control Procedure used to control the mass
spectrometer in this study. 86
Figure 4.8 Dendrogram for the analysis of phophatidylglycerol (PG). ........... 88
Figure 4.9 Dendrogram for the analysis of phosphatidic acid (PA) ................ 89
Figure 4.10 Dendrogram for the analysis of :L “:L “ir‘ylntL ‘ ' (PE)S.’0

 

Figure 4.11 Dendrogram for the analysis of
‘L L.‘ 1.1' LhyloLL I ' (PDM) 91

K

 

Figure 4.12 Dendrogram for the analysis of
“- -ylethanolamine (PM) 92

 

r :-

Figure 4.13 Dendrogram for the analysis of all combined phospholipids
demonstrating the loss of differentiating effectiveness

 

 

Figure 5.1 A schematic of the TSQ-7000 mass spectrometer. ....................... 104
Figure 5.2 Typical QlMS, Q3MS, and NEU 0 proﬁles for the tuning

compound PFTBA 1 10
Figure 5.3 The leak inlet system used for high volatility compounds. ......... 11 1

Figure 5.4 Product spectra of the molecular ion of cyclohexanone with a
collision energy of (a) 10 eV, (b) 40 eV, and (c) 80 eV with a target gas
pressure of 3x1045 torr (manifold). (* denotes oﬁscale precursor peak at
100%) 114

Figure 5.5 Relative intensity of several product ions of the molecular ion of
cyclohexanone plotted as a function of collision energy at a target gas
pressure of 3x1045 torr (manifold). 116

Figure 5.6 Product spectra of the molecular ion of 3-heptanone with a
collision energy of (a) 10 eV, (b) 40 eV, and (c) 80 eV with a target gas
pressure of 3x1045 torr (manifold). 1 17

 

Figure 5.7 Relative intensity of several product ions of the molecular ion of
3-heptanone plotted as a function of collision energy at a target gas
pressure of 3x 10'6 torr (manifold). 118

Figure 5.8 Relative intensity of several product ions of the molecular ion of
3'heptanone plotted as a function of collision energy at a target gas
pressure of 3x10‘6 torr (manifold). These result were obtained by
repeating the experiment show in Figure 5.7. l 19

Figure 5.9 Eﬁ‘ect of increasing target gas pressure at a constant collision
energy (-30 V) for products of the molecular ion of cyclohexanone ........ 12 1

Figure 5.10 Log-log plot of the data presented in Figure 5.9. The slopes
indicate reaction order. 122

 

Figure 5.11 Effect of increasing target gas pressure at a constant collision
energy (~30 V) for products of the molecular ion of 3-heptanone. .......... 124

J’

Figure 5.12 Log-log plot of the data presented in Figure 5.11. The slopes
indicate reaction order. 125

 

Figure 5.13 Instrument Control Procedure for the collection of normal mass
qpecfra 129

Figure 5.14 Instrument Control Procedure for the collection of product mass
spectra 130

Figure 5.15 Product spectra for the molecular ion of cyclohexanone. (a) and
(b) were collected on the same day while (c) and (d) were collected on a
different day with a diﬂ'erent instrument tune. 133

 

Figure 5.16 Product spectra for the molecular ion of 3-heptanone. (a) and (b)
were collected on the same day while (c) was collected on a different day
with a different instrument tune 134

 

Figure 6.1 The m/z 59 substructures of interest. Potential bonding positions

 

 

are labeled a through d. 144
Figure 6.2 The Euclidean distance / single linkage hierarchical clustering
dendrogram for selected m/z 59 product spectra. .................................. 146
Figure 6.3 Sample product spectra of m/z 59 for selected compounds listed in
Table 6.1 158
Figure 6.4 Mixture product spectrum for m/z 59 of sec-butyl alcohol. ........ 166

Figure 7 .1 Cluster mean product spectra for m/z 59. The error bars indicate
the standard deviation in relative intensity of the cluster members.
Clusters 2, 6, and 7 are single member clusters and have no error bars.

172

 

Figure 7.2 Reduced feature set dendrogram for the m/z 59 product spectra.
175

 

Figure 7.3 Fuzzy member functions investigated for spectral matching. 180

Figure 7.4 Rule-tree for m/z 59 product spectra. 188

 

Figure 7.5 The members and expected substructures for m/z 59 cluster 1. .192

Figure 7.6 The members and expected substructures for m/z 59 clusters 2
and 6 respectively. 193

 

xvi .-

Figure 7.7 The members and expected substructures for m/z 59 cluster 3. .194
Figure 7.8 The members and expected substructures for m/z 59 cluster 4. .195

Figure 7.9 The members and expected substructures for m/z 59 cluster 5. . 196

 

CHAPTER 1

Introduction

The focus of this thesis is the development of software-based tools to
assist in the structural elucidation of organic chemical species using tandem
mass spectral (MS/MS) data. These tools will attempt to identify the
molecular structure precursor for each fragment ion in the normal MS
spectrum based on the product spectrum for that ion and then reconstruct

the molecule from the partial substructures that have been identiﬁed.

The development of "intelligent" software for structure elucidation is
not a new concept in chemistry and is indeed a necessity with the large
amounts of data generated by modern analytical instruments. Mass
spectrometers in particular are capable of producing large quantities of data
related to molecular ion structure from small amounts of material. However,
the interpretation of the data into structural information is often
complicated, ambiguous, and incomplete. For this reason, substantial effort
has been made in the development of structure elucidation software based on

mass spectral data.

In 1965 work began at Stanford on the DENDRAL project. The
objective of the DENDRAL project was to develop software which could
determine molecular structures consistent with the normal mass spectrum
for an unknown compound [1,2]. As input, the DENDRAL software accepted
the unit-resolution mass spectrum of an unknown and the empirical formula
of the molecular ion. The software would then attempt to generate all
possible candidate molecular structures, using those functionalities believed
to be present, which were consistent with the mass spectrum. The DENDRAL
software was found to be as effective as a human expert in elucidating
chemical structures due in large part to the systematic approach used. The
major shortcoming of DENDRAL was that the conventional mass spectral
data alone were insufﬁcient for complete and unambiguous identiﬁcation of
unknown structures. In order to provide better candidate structures, the
DENDRAL software often required N1\IIR and IR data in conjunction with the
mass spectrum. A more detailed description of the DENDRAL project is

presented in Chapter 2.

The ' Self-Training Interpretive and Retrieval System (STIRS)
developed by McLaﬁerty et al. is another attempt to develop software to
assist in structural elucidation by mass spectrometry. The STIRS software is
of value in cases where library spectral matching alone does not provide

unambiguous identiﬁcation of an unknown [3-5]. The STIRS software

attempts to identify the major substructural feature(s) of an unknown by
correlation of the conventional mass spectrum to the best matching library
spectra through the use of class libraries. A more detailed discussion of the

STIRS approach can be found in Chapter 2.

The STIRS software, like the DENDRAL software, is in many cases
unable to provide deﬁnitive structural information. Many of the problems
with the DENDRAL and STIRS approaches can be attributed to the
complexities of the fragmentation processes that occur in the electron impact
(EI) ion source. Signals from all the ions produced in the ion source are
represented in the conventional mass spectrum. As a result, the conventional
mass spectrum contains peaks due to single and multiple fragmentation
processes, recombination reactions, and rearrangement reactions. The
DENDRAL and STIRS software were unable to interpret the effect of all
these processes from the mass spectrum and hence could not provide

unambiguous information on the structures present.

In order to circumvent as many of the inherent shortcomings of
conventional mass spectral data as possible in the elucidation of chemical
structures, Enke and co-workers utilized MS/MS and the additional structure
related data it affords. To this end, they have developed the Methods for
Analyzing Patterns in Spectra (MAPS) software [6-13], in which

feature/substructure relationships are determined empirically from both

1"}

conventional mass spectral (MS) and tandem mass spectral (MS/MS)

information.

The initial MAPS approach was to identify the structure of observed
ions and correlate the ion structure with substructures present in the
molecule. However, this approach was soon discontinued as the correlation
process required comprehensive knowledge of the ion formation processes
occurring in the ion source. This requirement was later circumvented by the
"feature bucket" approach which was based on the idea that spectral features
and substructures could be correlated without complete spectral
rationalization [8]. The premise of the feature bucket approach was that by
examining the MS/MS spectra for several compounds which share only one
particular substructure, the most prevalent spectral features observed should
be representative of this substructure. These features were then combined
into a rule which could be applied to an unknown to determine the presence
or absence of this particular substructure. However, this concept suﬁ'ered due
to the arbitrary selection of the substructure present in the molecule and the
assumption that the ions representing this substructure would fragment
from diﬁ'erent molecules in a consistent manner. The quality of the rules was
also degraded by inclusion of features observed in the MS/MS spectra but

which were not derived from the substructure of interest.

ﬁ

 

The third generation of the MAPS software, developed by Hart and
Enke [11], introduced the concept of "feature combinations". The correlated
features generated by this approach are obtained in the same manner as the
previous version, but the substructure rules were made up of combinations of
those features which occurred concertedly. As a result, substructures are
considered to be present in unknowns only if all elements of a given
combination of features are observed rather than if a fraction of the elements
in a set of structure-related features are observed. While feature combination
rules are more deﬁnitive than discrete feature rules, they are difﬁcult to
discover and they still suffer from arbitrary substructure selection and
substructure contamination. The MAPS approach is described in greater

detail in Chapter 2.

In this research project, the problem of structure elucidation using
tandem mass spectral information is approached from a more data-oriented
perspective. The tools to be developed here have no preconceived constraints
regarding either fragmentation pathways or the speciﬁc substructures to be
determined. Substructure identiﬁcation rules are determined from the
identiﬁcation of common data patterns and the subsequent discovery of the
substructures to which they are related: This is different from the MAPS
approach which correlates arbitrarily selected substructures with common

data features. The determination of the substructures that are represented

by the data is a more natural approach and has been used by chemists since

the early development of MS/MS instruments.

An approach similar to that which is followed and expanded on in this
project has been presented by Cross and Enke [6,7]. However, Cross and
co-workers never implemented the substructure searching component of their
system. As a result, the system was only capable of matching unknown
product spectra against the reference spectra. Any feature/substructure
correlations had to be made by the operator. In the approach used in this
project, the spectrum matching and substructure searching components are
incorporated through the use of exploratory analysis techniques such as
cluster analysis. Mass spectral variance is accommodated through the use of
Fuzzy Logic. A description of cluster analysis in chemistry with particular
focus on those methods used in this research is provided in Chapter 3.
Further details of the clustering implementation utilized in this work are
given in Chapter 6. The details of the fuzzy logic implementation are given in

Chapters 6 and 7.

The intent of this project is to develop an intelligent computer-assisted
structure elucidation system embodying those qualiﬁes envisioned by Yost
and Enke [14] and Cross and Enke [6,7]. This system uses clustering
algorithms for the grouping of product spectra based on similarities in their

m/z and relative abundance values. The structures of the molecules which

.r

formed these groups are then examined for common substructural features.
Where common substructural features are obtained, rules are constructed
relating these substructures to the product spectra. The basic form of the rule
is that if the unknown product spectrum matches well with a particular
reference product spectrum then the substructural unit represented by the
reference product spectrum is present in the unknown compound. These
rules can then be applied to unknown spectra to determine the presence of
the substructures. The rules would be of particular interest in cases where
precursor ion isomers are not readily distinguishable by conventional mass
spectrometry. Potential substructures are obtained from each product

spectrum of the unknown.

A further enhancement in this work is the development of rule-trees.
Rule-trees are generated by a distillation and extraction of the most
discriminating factors of the rules developed by the method described above.
Since rule-trees contain only the most discriminating spectral features and
are a conjunction of all the rules relating to a particular precursor m/z, they

are of particular use for targeted component analysis.

Once a list of known substructures is constructed, they can then be
correlated to determine plausible molecular structures using GENOA.
GENOA, developed as part of the DENDRAL project, is a structure

generation program which will assemble all the substructures believed to be

present into plausible molecular structures [15]. The details of rule
generation and implementation approach developed in this work are

provided in Chapters 6 and 7.

In order to develop a database of representative MS/MS spectra (a
product spectrum for selected signiﬁcant masses in the normal spectrum) and
to be able to use these data for the structural elucidation of unknowns, it is
necessary to be able to obtain MS/MS spectra in a reproducible and reliable
manner. Therefore, this research will also identify and characterize those
instrumental parameters which have the greatest effect on the quality and
reproducibility of the product spectra. For this research, mass spectra were
obtained on Finnigan TSQ-700 and Finnigan TSQ-7000 tandem mass
spectrometers equipped with an electron impact ion source as diagrammed in
Figure 1.1. While the TSQ has several diﬁ'erent modes of operation, some of
which are shown in Figure 1.2, the product scan mode is most commonly

used and will be of most use to this project.

The product scan mode of operation is as follows. Rather than directly
detecting the ions formed from overlapping fragmentation processes in the
ion source, the ﬁrst mass analyzer (Q1) individually selects these ions. The
selected precursor ions which pass through the ﬁrst mass analyzer are
allowed to undergo collision(s) with a target gas, argon in the work presented

here, in the collision chamber. The collision-induced products of these

   

collisions can then be mass analyzed by scanning the second mass analyzer
(Q3) [16]-

The major advantage to using the collision chamber to induce
fragmentation of the precursor ions is that the products of these reactions are
separated from the fragmentation products of all other ions in the source. In
other words, a second dimension of information is obtained because the
spectra obtained at the detector are the collision induced dissociation (CID)
product ions of only the selected precursor ions of the selected mass. These
spectra not only provide the precursor ion mass-to-charge ratio (as observed
in a normal mass spectrum), but also the CID product ions which contain
structural information about the precursor fragment ion. The key features to
controlling the CID process are the collision offset voltage, the collision target
gas, and the collision gas pressure. The characterization of these parameters

is provided in Chapter 5.

 

 

#808959on wwaﬁ Bosnian”. 3&5 23 mo cacao—ﬂow wé 9:63

 

       

 

 

 

 

.8828 mmmE 545:0
888% :2 8253 33m museum Alla—n53
_ _ _ 5
0 _.8:E::E m0 \
1

sebum—o

 

 

Seamus 88% ~38on “Boo—om $9:
:298280 52:8 :2 823qu

11

Normal (Q 1MS) Scan Mode
GE

2
“E

Scan m/z 10-500

 

Product Scan Mode

 

"i
i

0% 510%:
Select m/z 500 Scan m/z 50- 510

Precursor Scan Mode

—€'———'

'—
Scan m/z 140-500 Select m/z 150

0
iii

 

Neutral Loss/Gain Scan Mode

 

qE—v
E%

Scan m/z 100- 500 Select m/z 50-450 (loss)
Select m/z 150-550 (gain)

Figure 1.2 The major operational scan modes of the triple quadrupole mass
spectrometer.

 

12

References

L

10.

ll.

Dufﬁeld, A. M.; Robertson, A. V.; Djerassi, C.; Buchanan, B. G.;
Sutherland, G. L.; Feigenbaum, E. A; Lederberg, J. J. Am. Chem. Soc.
1969, 91, 2977-2981.

Buchs, A; Delﬁno, A. B.; Dufﬁeld, A. M.; Djerassi, C.; Buchanan, B.
G.; Feigenbaum, E. A.; Lederberg, J. Helv. Chim. Acta. 1970, 53, 1394-
1417.

Gray, N. A. B. In Computer-Assisted Structure Elucidation. John
Wiley & Sons; New York, 1986, pp. 995-100.

Haraki, K. S.; Venkataraghavan, R.; McLaﬂ'erty, F. W. Anal. Chem.
1981, 53, 386-393.

Lowry, S. R.; Isenhour, T. L.; Justice, J. B. Jr.; McLaﬂ'erty, F. W.;
Dayringer, H. E.; Venkataraghavan, R. Anal. Chem. 1977, 49, 1720-
1722.

Cross, K. P.; Palmer, P. T.; Beckner, C. F.; Giordani, A. B.; Gregg, H.
G.; Hoffman, P. A; Enke, C. G. In Artiﬁcial Intelligence Applications
in Chemistry. ACS Symposium Series No. 306, 1986‘, pp. 321-336.

Cross, K. P. Ph.D. Thesis, Michigan State University, 1985.

Wade, A. P.; Palmer, P. T.; Hart, K. J .; Enke, C. G. Anal. Chim. Acta.
1988, 215, 169-186.

Palmer, P. T. Ph.D. Thesis, Michigan State University, 1988.

Palmer, P. T.; Hart, K. J.; Enke, C. G.; Wade, A P. Talanta. 1989, 36,
107-116.

Hart, K. J. Ph.D. Thesis, Michigan State University, 1989.

 

12.

13.

l4.

15.

16.

13

Hart, K. J.; Enke, C. G. Chemom. and Intell. Lab. Sys. 1990, 8, 293-
302.

Hart, K. J.; Enke, C. G. In Computer-Enhanced Analytical
Spectroscopy, Jurs, P. 0., Ed.; Plenum: New York, 1992; Vol. 3,
Chapter 6.

Yost, R. A.; Enke, C. G. In American Laboratory. June 1981; pp 88-95.

Genoa Reference Manual. Molecular Design Ltd., Hayward, California,
1984.

TSQ/SSQ 700 Series Systems Operator’s Manual. Finnigan MAT, San
Jose, California, 1990; pp 3-6.

 

CHAPTER 2

Structure Elucidation Methods in Mass Spectrometry

Introduction

The determination of molecular structure has been a fundamental
problem for researchers for the last century. To solve this dilemma, a
researcher may employ one of numerous methodologies. On the modern
research front, these methods may include techniques such as Nuclear
Magnetic Resonance (NMR), spectroscopic methods like Infra-red (IR),
Ultraviolet (UV), or Raman, or Mass Spectrometry (MS). Mass spectrometry

has been used routinely for chemical structure elucidation since the 1960’s.

Unfortunately, molecular structure determination from mass spectral
data is often complicated, ambiguous, and incomplete. This situation is
further complicated by the large amounts of data that can be produced by
modern mass spectrometers. For example, liquid chromatographic interfaces
to mass spectrometers are often outﬁtted with multiwavelength ultraviolet
detectors to provide UV information on eluents entering the mass

spectrometer source. For this reason, several signiﬁcant initiatives have been

14

 

15

undertaken to develop automated and semi-automated computer software to
assist in the elucidation of molecular structures from MS data. These
initiatives generally fall into one of two categories: Pattern Matching
Methods or Interpretive Systems. Small [1] describes these two broad
categories as direct and indirect database methods, respectively. Pattern
Matching, or direct database methods, are those which require the presence
of a spectral database in both the development and implementation of the
interpretation procedure. Interpretive systems, or indirect database methods,
use a database in the development of the method, but not necessarily in the

implementation.

Pattern Matching Methods

All commercial mass spectrometers sold today contain a basic spectral
library search package. Diﬁ'erences exist between the packages with respect
to the number and type of molecules represented by their spectra, the
database indexing and retrieval algorithms, and the relative intensity and/or
mass/charge (m/z) weighting scheme applied to spectral peaks. Nevertheless,
all the library searching approaches have the same objective of trying to
accurately match the mass spectra of unknown compounds or mixtures with
known library spectra. While some pattern matching methods have been
more successful than others, they still share, in many respects, the same

limitations.

16

There are generally four situations that a successful library search
system must address in order to be useful for identifying unknowns. First, a
reference spectrum for an unknown is included in the library with a pattern
very close to the measured unknown. Second, a reference spectrum for an
unknown is included in the library but with a different pattern. Third, the
unknown spectrum is a mixture, the components of which are contained in
the library. Fourth, the unknown spectrum, if pure, is not included in the
library, or one or more components are not present if the unknown is a
mixture. The above four items are arranged in order of increasing difﬁculty.
Obviously, item one is the ideal case and can be made possible using small
custom databases intended for use where the components of an analyte are
known. The second case often arises as a result in day-to-day variations in
instrument tuning or from mixing spectra collected from diﬁ'erent types of
mass spectrometers that may exhibit diﬂ'erent mass-to-charge sensitivities.
The third case can often be alleviated by the use of a chromatographic inlet
system to the mass spectrometer except where components co-elute and can
not be readily resolved. The fourth case is perhaps the most limiting case to

any library search system intended for general use.

A library search system, no matter how involved the matching
algorithm, can only be as good as the spectral database from which it works.

As a result, the most detrimental limitation of library search systems is that

 

17

they can never be complete. At present, there are in excess of ten million
different chemical structures that have been identiﬁed in nature or
synthesized in the laboratory. Current mass spectral libraries contain on the
order of tens of thousands of reference spectra and are therefore severely
lacking in completeness. Since even the list of known compounds is nowhere
close to being complete, no mass spectral library can ever be complete. To
make matters worse, as mass spectral libraries become more comprehensive,
their differentiating ability decreases with the increased likelihood that the
library contains reference spectra with increasing similarity to each other.
Furthermore, the reference spectra may be subject to situation two in the
preceding paragraph as they are often collected on multiple systems or on a
reference system that may be different than the one used to collect the

unknown.

The most common library match system in use today is the Probability
Based Matching (PBM) algorithm developed by McLaﬁ'erty and coworkers
[2,3]. The PBM algorithm uses a statistical weighting of the peaks in a
spectral database in inverse proportion to their occurrence in the database.
Therefore, peaks that occur often in the database are given a lower
differentiating ability weighting factor than peaks which occur less often.
PBM‘ also uses various methods of peak ﬂagging and abundance scaling to

attempt to compensate for distortions in the database caused by diﬂ'erences

 

18

between instruments and tuning conditions. PBM, like most mass spectral
search systems that utilize statistical measures to determine match factors,
assumes that all peaks in a spectrum are independent of each other. Since
many fragments observed in a normal mass spectrum are highly correlated,
the validity of the match factors is dependent on all compounds matched
having relatively the same degree of correlation. The PBM system has been
demonstrated to work quite well where the reference spectrum for an
unknown exists in the database and for identifying components in mixtures.
However, it has proven to be inferior to the SISCOM library search system,

described below, in retrieving homologous compounds [4].

Another library search system that is quite similar to PBM is the
INCOS library search system developed at Finnigan MAT (San Jose, CA).
This system differs from PBM primarily in the way mass spectral peaks are

weighted during matching.

A library search system for mass spectra developed by Henneberg and
coworkers [5], called SISCOM, utilizes a spectral coding scheme based on
selecting the most important peaks within homologous ion series and on a
multiple factor assessment of the results. The name SISCOM is an acronym
for Search for Identical and Similar COMpounds. This system has been
demonstrated suitable for detecting structural similarities like common

substructures even in cases where visual inspection and differentiation of

   

19

spectra is difﬁcult. The SISCOM system contains a pre-search algorithm that
retrieves the highest 150 best matching reference spectra. A more extensive
matching algorithm is then used to reﬁne the similarity order of the 150
spectra. The matching procedure uses a series of match factors the results of

which are all displayed for evaluation by the user.

The SISCOM algorithm was extended by Domokos et al. to make the
system more of an identiﬁcation (retrieval) system than a match similarity
system [6]. While this extension has met with some success, the authors state
that the possibility of developing a system with 100% success cannot be
expected. Several of the reasons for this limitation in expectation are the

same as those limitations that affect all library search systems.

A recent addition to SISCOM has been the implementation of
structure searches [7]. The structure search can be used to either search for
compounds with the presence or absence of deﬁned fragments or for
structures similar to a target structure. The ﬁrst search method is useful for
retrieving spectra for certain classes of compounds. The second search
method is useful for validating a proposed structure for an unknown by

retrieving spectra for structurally similar compounds.

Recently, Stein and Scott [8] conducted a comparison of ﬁve spectral
matching algorithms. These algorithms were PBM, dot-product, Hertz et al.

similarity index, Euclidean distance, and absolute value distance. Each

 

20

match algorithm was optimized separately for a test set consisting of 12,592
alternate spectra of about 8000 compounds. The rank of the correct spectrum
in the list of candidate spectra was used as the criterion for match accuracy.
The algorithms were found to perform in the order of dot-product (75%),
Euclidean distance (72%), absolute value distance (68%), PBM (65%), and
Hertz et al. (64%) respectively. The performance measure, which represents a
hit rate for correct identiﬁcation, was taken as the rank of the correct
compound in the hit list. This measure directly measures the key function of
a search algorithm, to place the correct result as high as possible in the
ranked candidate list. Furthermore, this measure is independent of the

relative ranking systems utilized by the individual search algorithms.

Interpretive Systems

Structure elucidation systems that attempt to derive differentiating
rules regarding information contained in reference spectra are considered
interpretive systems. These systems usually perform a distillation step on the
reference spectral database to derive the most differentiating feature rules.
These rules are then used for the classiﬁcation and/or identiﬁcation of

unknowns.

Unlike pattern matching methods, interpretive systems have the

potential to be much more applicable for the general use of identifying true

 

21

unknowns. Pattern matching algorithms are limited by the direct
reference/unknown spectral correlations they perform. However, interpretive
systems generally focus on the identiﬁcation of substructural components
from the available spectral information. Where enough substructural
information can be determined by an interpretive system, complete structure
elucidation is possible. The substructure identiﬁcation rules are developed

for known substructures from a known reference database of molecules.

Since interpretive systems start with the same mass spectral
databases as pattern matching methods, they are susceptible to some of the
same limitations. For example, spectral distortion and skew resulting from
diﬁ'erent instruments and tuning values can contaminate substructure
identiﬁcation rules, and the presence of mixture components can result in the

misapplication of a substructure rule.

The DENDRAL project, which began at Stanford University in 1965, is
one of the most well known and publicized attempts at applying artiﬁcial
intelligence to the elucidation of chemical structures. The objective of the
DENDRAL project was to determine molecular structures consistent with the
conventional mass spectrum for an unknown compound. To accomplish this
goal, DENDRAL employed three stages: plan, generate, and test as
diagrammed in Figure 2.1. In the plan stage, also called Heuristic

DENDRAL [9,10], constraints are derived on the unknown structure based

22

The DENDRAL Approach

 

ms, ir, nmr

1

Plan I Heuristic DENDRAL I

l

substructure constraints and
molecular formula

Generate GENOA

1

candidate structures

1

Mass spectral simulation
Test and ranking of candidate
structures

 

 

 

 

 

 

Figure 2.1 The DENDRAL approach to structure elucidation.

23

on peaks in the mass spectrum. Empirical fragmentation rules are then used
to determine which molecular fragments are or are not present in the
unknown. The empirical fragmentation rules are inferred from the mass
spectra of known compounds by Meta-DENDRAL [11]. The determined
molecular fragments are then used in the generate stage of DENDRAL to
generate all possible molecular structures that are consistent with the
supplied constraints. The third test stage involves the simulation of a mass
spectrum for each of the candidate molecular structures generated in stage
two. Spectra are simulated using the fragmentation rules of Heuristic
DENDRAL. Due to the systematic search and test methods utilized by
DENDRAL, the system has been shown to be as good as a human expert in
elucidating structures [12]. In many cases, the empirical fragmentation rules
in Heuristic DENDRAL were not enough to provide the complete structure of
an unknown. As a result, DENDRAL also used NNLR and IR data to provide
additional structural constraints. In large part, the inability of Heuristic
DENDRAL to provide enough structural constraints can be attributed to the
ambiguities that can often arise when using normal mass spectra for

structure elucidation.

The Self-Training Interpretive and Retrieval System (STIRS),
developed by McLaﬁ'erty and coworkers [3] attempts to identify the major

substructural feature(s) of an unknown by correlation of the conventional

 

24

mass spectrum to the best matching library spectra through the use of class
libraries representing some 589 substructures [13]. The correlation technique
consists of the application of a number of nearest neighbor analyses, each of
which is the result of a particular type of spectral abbreviation. The
abbreviations are based on characteristic ions, ion series, and primary and
secondary neutral losses in various mass ranges that are speciﬁc for a given
substructure. At the conclusion of the individual searches, the overall match
factors are calculated as the weighted sum of the individual match factors.
The ﬁfteen compounds with the best overall match factors are then examined
for the presence of the 589 substructures. The result is a list of the possible
substructures contained in the unknown and an estimate of the reliability of
the assignment. It has been recommended by McLaﬂ‘erty et al. [3] that

STIRS be used in cases where PBM fails to provide a reliable match.

Sasaki and coworkers [14,15] have developed an integrated rule-based
system for structure elucidation which they have called CHEMICS
(Combined Handling of Elucidation Methods for Interpretable Chemical
Structures). This system was capable of utilizing data from IR, proton NMR,
carbon-13 NMR, and MS. The CHEMICS approach utilized correlation tables
to determine the most likely substructures present in an unknown. By
combining the results from each spectroscopic method, complementary

information and conﬁrmatory evidence can help identify or exclude speciﬁc

25

substructures. The utilization of mass spectrometric information in this
system was primarily for the purpose of predicting the molecular formula of
an unknown with substructures being identiﬁed from IR and NMR data.

Later versions of CHEMICS did not make use of MS information [16,17].

An expert system for inferring structures of acyclic organic compounds
from their mass spectra has been developed by Sastry and coworkers [18].
The expert system requires, as input, the mass spectrum, molecular formula,
and presence, if known, of any functional groups. Given this information, the
system generates chemically possible structures for the molecular formula
constrained by information obtained from the mass spectrum and any

speciﬁed functional groups.

Recently, Scott [19] has reported the implementation of an expert
system for the identiﬁcation of a target set of toxic organic compounds from
their mass spectra. The system was designed to accommodate low
concentration spectra and to provide some information for mixtures. A
schematic diagram of the system is provide in Figure 2.2. The target classes
for this expert system are the nonhalobenzenes, chlorobenzenes, bromo- and
bromochloroalkanes / alkenes, mono- and dichloroalkanes / alkenes, and tri-,

tetra-, and pentachloroalkanes / alkenes.

In the system designed by Scott, there is a separate molecular weight

estimator, molecular weight ﬁlter, and base peak ﬁlter for each of the above

26

.8: 58m .3 @3333 889$ ﬁends :oﬂmomﬁaog can down—seamed“? one we Bahamas oﬁmEvom Wu 0.5.3.”—

I
V

 

 

£352
seasomﬂaog

 

 

T

 

23mm
Mwmm ommm

 

 

 

 

“Samoan
£5 .32
85533

 

 

 

4

 

 

 

Hoamﬁﬁmm
. £5 .32

 

 

 

 

 

 

3:632
noﬁmommmmm—O

 

 

27

ﬁve classes. The classiﬁcation module is used to assign an unknown
spectrum to one of the above classes. Then molecular weight estimator for
that particular class assigns a molecular weight to the spectrum. Molecular
weight exclusion ﬁlters are then applied to ensure that the spectra were not
being misclassiﬁed [20] but was found to be effective only for molecular
weights below about 200 amu. Next, the base peak exclusion ﬁlter is used to
exclude those compounds which were misclassiﬁed and passed the molecular
weight exclusion ﬁlter. Finally, if a spectrum passes the base peak exclusion
ﬁlter, it is passed to the identiﬁcation module for compound identiﬁcation.
The identiﬁcation module for the 75 target compounds relies heavily upon
the accuracy of the molecular weight estimators and base peak data for
unique compound identiﬁcation. If a spectrum is rejected by the classiﬁcation
module, or the exclusion ﬁlters, it is passed to the unknown molecular weight

predictor.

A computer-assisted interpretation system developed at Michigan
State University by Enke and coworkers was the ﬁrst system to utilize the
added information provided by tandem mass spectrometry (MS/MS). This
system was called the Automated Chemical Structure Elucidation System or
ACES for short. A schematic of this system is provided in Figure 2.3. One of
the main goals of this system was to generate candidate molecular structures

for unknown compounds [21]. This objective was based on the ability of the

28

.noumgoﬂm engagm 3 nowonanm memo/V. SE. ad 9:55

\lj

mmmDFODMBm
E<QHQZ<O

O

 

 

 

@2390ng
Eémznmw

 

 

<OZHG

 

 

 

ZOHE<mmZmU
£22
ngZOO

 

 

mum—z

 

 

Ll

.54ng
m0
Emmmmm
m.m._.mme

 

 

 

mmém
>135»

 

J

mmADm
Egg/EU

 

 

may;

 

 

mam
<E<Q

mQZDOn—EOO
7:50va29

all

 

 

ng

 

 

+
823%?
2302M

29

system to determine the molecular formula and molecular substructures
present in or absent from the molecular structure of an unknown compound
[22]. By using this information as constraints, the system would then
perform an exhaustive generation of candidate structures. In some ways, this
approach was similar to the DENDRAL approach. Indeed, this system
utilized the GENOA module from DENDRAL to perform the structure
generation. The molecular formula constraint to the structure generator
GENOA was provided by the Molecular Formula Generator (MFG)

component of ACES [23].

In order for the ACES system to determine the substructures present
or absent in an unknown, it utilized a rule base of substructure identity rules
[24]. The generation and application of these rules was the duty of the MAPS
(Methods for Analyzing Patterns in Spectra) module. The learning mode of
the ACES system is shown with the dashed arrows in Figure 2.3. In this
mode, MS/MS maps for standard compounds of known structure are used by
MAPS to generate substructure identiﬁcation rules. These rules are of two
types: inclusion rules and exclusion rules. Inclusion rules predict the
presence of a particular substructure while exclusion rules predict the
absence of a particular substructure. The solid arrows in Figure 2.3 indicate

the identiﬁcation mode of operation of ACES.

30

A goal of the ACES project that was never fully implemented was to
integrate ACES with the tandem mass spectrometer control software as
shown in Figure 2.4. The objective of this was to develop a fourth-generation
system such as that described by Wade and Crouch [25]. Such a system
would be capable of identifying ancillary experiments that would potentially
reduce the number of candidate structures and, if desired, perform those

experiments automatically.

Advantages of MS/MS over MS

What made ACES possible was the advent of MS/MS instrumentation.
All of the above MS structure elucidation techniques are limited by their
ability to interpret the spectral patterns that result from the overlapping
fragmentation pathways, including ion rearrangements and neutral losses,
that occur in the mass spectrometer source region. These overlapping
pathways, if not fully understood and rationalized, can easily lead to

incorrect determination of the molecular structure or substructures.

Tandem mass spectral (MS/MS) instruments can be very useful in
deconvoluting the complicated ion formation and fragmentation processes
that occur in an ion source. By selecting a particular precursor ion m/z that
exits the source region and further fragmenting these ions, information

regarding a portion of the molecule can be obtained without interference from

31

 

@209

 

Iv

 

 

a

 

Eznéﬁmmmxm
23m QZ< Poms—mm

 

 

3.2me m0
Emmmmm .Emmbm
:3qu >Amm<

 

 

 

 

 

a

 

mﬁzmazmmmmxm
>m<~552<
3,53
mmmDPODmEmmDm
ESEMOO

 

E

 

 

 

 

3.888895 Sac—8
888882 3 83m? 23 Beam 3595 was: mmv< 8m 33 Homage.“ 38903 < «as 0.353

 

 

      

  

mmmDPODEmem
OZF<ZEHmOm~Q

mZOHrEQZOO

ﬁme—memmxm

Emmmmem mOm
man—3m mg

 

 

 

 

 

 

mamaosmem \llj
82:5 88.885
8.8525
wasp—.88
02588838 I
352228 -
sarcasm

 

 

O

 

32

other ions.

Conclusions

While several structure elucidation systems have been developed and
more continue to be developed, only the more basic library search systems
have gained signiﬁcant interest on commercial mass spectrometers. All
commercial mass spectrometers come equipped with a basic library search
package and a reference database of normal mass spectra, but very few
provide tools beyond this level. With this present state of commercial
instrumentation software and the increasing demands on Operators for
greater efﬁciency, there exists an ever increasing need for intelligent
software that can help alleviate, if not remove, the burden of structure

elucidation.

In order for such software to be of general applicability, it must be of
the interpretive nature with primary focus on the identiﬁcation of
substructural components. Since individual compounds are deﬁned and
unique but substructures are not, interpretive systems must by deﬁnition be
generalized in their implementation. Often, there is not enough information
generated in a single MS or MS/MS experiment to unequivocally identify or
reconstruct a complete molecular formula. In these situations, software

which can provide substructural information can still be of use to the

33

operator. It is the goal of this project as described in later chapters to address

just such a situation.

34

References

1. Small, G. W. Anal. Chem. 1987, 59, 535A-546A.

2. McLaﬁ‘erty, F. W.; Hertel, R. H.; Villwock, R. D. Org. Mass Spectrom.
1974, 9, 690.

3. McLaﬂ‘erty, F. W.; Stauﬂ'er, D. B. J. Chem. Inf. Comput. Sci. 1985, 25,
245-252.

4. Henneberg, D. Adv. Mass Spectrom. 1980, 8, 1511.

5. Damen, H.; Henneberg, D.; Weimann, B. Anal. Chim. Acta. 1978, 103,
289-302.

6. Domokos, L.; Henneberg, D.; Weimann, B. Anal. Chim. Acta. 1983,
150, 37-44.

7. Henneberg, D.; Weimann, B.; Zalfen, U. Org. Mass Spectrom. 1993, 28,
198-206.

8. Stein, S. E.; Scott, D. R. J. Am. Soc. Mass Spectrom. 1994, 5, 859-866.

9. Lederberg, J.; Sutherland, G. L.; Buchanan, B. G.; Feigenbaum, E. A.;
Robertson, A V.; Duﬁield, A M.; Djerassi, C. J. Am. Chem. Soc. 1969,
91, 2973-2976.

10. Dumeld, A M.; Robertson, A. V.; Djerassi, C.; Buchanan, B. G.;
Sutherland, G. L.; Feigenbaum, E. A ; Lederberg, J. J. Am. Chem. Soc.
1969, 91, 2977-2981.

11. Buchanan, B. G.; Smith, D. H.; White, W. C.; Gritter, R. J.;
Feigenbaum, E. A ; Lederberg J.; Djerassi, C. J. Am. Chem. Soc. 1976,
98,6168.

12. Carhart, R. E.; Smith, D. H.; Gray, N. A. B.; Nourse, J. G.; Djerassi, C.
J. Am. Chem. Soc. 1981, 46, 1708.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

35

Haraki, K. S.; Venkataraghavan, R.: McLaﬁ'erty, F. W. Anal. Chem.
1981, 53, 386-392.

Sasaki, S.; Abe, H.; Hirota, Y.; Ishida, Y.; Kudo, S.; Ochiai, S.; Saito,
K.; Yamasaki, T. J. Chem. Inf. Comput. Sci, 1978, 18, 211.

CHEMICS Users Manual; Laboratory for Chemical Information
Science. Toyohasi University of Technology: Temp aku, Toyohasi,
Japan, 1982.

Oshima, T.; Ishida, Y.; Saito, K.; Sasaki, S. Anal. Chim. Acta. 1980,
122, 95-102.

Sasaki, S.; Fujiwara, I.; Abe, H.; Yamasaki, T. Anal. Chim. Acta. 1980,
122, 87-94.

Sridhar, M. P.; Menon, A. G.; Sastry, P. S. Rapid Commun. Mass
Spectrom. 1991, 5, 206-213.

Scott, D. R. Chemomet. Intell. Lab. Syst., 1994, 23, 351-364.
Scott, D. R. Anal. Chim. Acta., 1992, 265, 43-54.
Hart, K. J.; Enke, C. G. Chemomet. Intell. Lab. Syst., 1990, 8, 293-302.

Hart, K. J.; Enke, C. G. In Computer-Enhanced Analytical
Spectroscopy, Jurs, P. C., Ed.; Plenum: New York, 1992; Vol. 3,
Chapter 6.

Palmer, P. T. Ph.D. Thesis, Michigan State University, East Lansing,
1988; Chapter 6.

Hart, K. J.; Palmer, P. T.; Diedrich, D. L.; Enke, C. G. J. Am. Soc.
Mass Spectrom. 1992, 3, 159-168.

Wade, A. P.; Crouch, S. R. Spectroscopy, 1988, 3, 24-31.

CHAPTER 3

Cluster Analysis as an Analytical Tool

Introduction

The timely interpretation of information produced by today’s fast
integrated instrumentation is almost impossible without the aid of
computers. Consequently, there has developed an increasing interest in the
ﬁeld of computer intensive data analysis generally referred to as
multivariate analysis. This general ﬁeld of analysis is commonly called

Chemometrics when applied to the ﬁeld of chemical analysis.

Multivariate Analysis is not a speciﬁc method of analysis but refers to
a youp of multivariate analytical techniques that can be used to aid the
researcher in evaluating and extracting useful information from data.
Multivariate analysis techniques share the common goal of assisting in the
understanding of relationships between many variables. Consequently, these
techniques are inherently computer intensive in nature. Examples of
multivariate analytical techniques are Multiple Analysis of Variance,

Principal Component Analysis, Factor Analysis, and Cluster Analysis.
36

37

In this research project, cluster analysis has been utilized as the initial
means of extracting and interpreting information from tandem mass spectral
data. Cluster analysis is an exploratory data analysis technique that seeks to
ﬁnd natural groupings amongst multivariate data. One of the strong features
of cluster analysis is that no preconceived assumptions regarding data

classiﬁcation are necessary for cluster analysis to be utilized.

Background

Clustering is a generic term used to describe procedures that divide
objects or observations into groups or clusters based on proﬁle similarity.
Furthermore, cluster analysis is generally considered to be a rudimentary,
exploratory data analysis technique. It is considered rudimentary in part
because its strength lies in its ability to be used to assist in the discovery of
natural groupings or tendencies amongst multivariate data [1]. This
discovery process is further enhanced by the ability of cluster analysis to be
applied without signiﬁcant preconceived notions about the information
represented by the data. Of course, blindly applying any analysis technique
and assuming the results are useful and correct is usually catastrophic.
Nevertheless, successful applications of cluster analysis can be obtained with

only limited understanding of the true data structure.

38

In order to utilize cluster analysis techniques, it is ﬁrst necessary to
understand what is a “natural grouping” or “cluster”. In a general sense, a
cluster can be deﬁned as a grouping of two or more objects that have one or
more characteristics in common. Where more that one cluster is formed, it is
also generally accepted that the members of one cluster should have more in
common with each other than with the members of another cluster. This
means that the intra-cluster distance or dispersion should be less than the
inter-cluster distance or separation. Intra-cluster and inter-cluster distances
are also referred to in the literature as within- and between-cluster distances
respectively. While this deﬁnition seems simple enough on the surface, real-
world applications of cluster analysis are often more complicated due to
irregularities in cluster shapes [2]. For example, Figure 3.1 shows two plots
containing clusters. Figure 3.1(a) appears to be two distinct clusters well
separated from each other. Figure 3.1(a) ﬁts well into the simple deﬁnition of
a cluster or clusters. Obviously the members or objects in each cluster have
something in common that they do not share with members of the other
cluster. Figure 3.1(b), on the other hand, shows what appears to be two
clusters that are overlapping each other. The problem now becomes one of
separating the two clusters, if possible, and if in fact they are two distinct

clusters.

39

 

 

 

 

x14
.0 ”o
o. oo.o
.o o .o o
o .o
o o .
>
(a) *2
x1A
o cue
0.. 'oo oo.o
.o.o ....o.o
.. C. C.....
oo
ooo. o. o
o o.
b
(b) x2

Figure 3.1 Examples of cluster structures: (a) well-separated clusters; (b)
touching or overlapping clusters. The axes x1 and x2,
representing the two clustering dimensions, are generic in this
example but could represent such things as relative intensiﬁes
for two m/z values or infrared intensities at two vibrational
frequencies.

40

Therefore, it is appropriate to expand the simple deﬁnition of cluster
analysis presented previously to include the presence of other clusters. A
more exact deﬁnition would be, therefore, that a cluster can be deﬁned as a
grouping of two or more objects that have one or more characteristics in
common. However, given the information available, it may not be possible to
distinguish between two or more clusters. The clusters themselves may have
complex shapes. Figure 3.2 shows some of the possible complex shapes that

clusters may have.

Lance and Williams [3] describe classiﬁcation problems as categories
which include cluster analysis techniques as shown in Figure 3.3. A detailed
description of each of these categories with particular reference to cluster
analysis is given by Jain and Dubes [4]. The two broadest categories are non-
exclusive and exclusive. In non-exclusive clustering, groups are allowed to
overlap and every object is assigned a “degree of belongingness” or “degree of
membership” for each cluster. In exclusive clustering techniques, a given
object can belong to one and only one cluster or group. Exclusive clustering
techniques are further divided into extrinsic and intrinsic methods. Intrinsic
methods, also called unsupervised learning, group samples solely on the
basis of a proximity matrix or similarity matrix. In contrast, extrinsic
methods also rely on the research to supply category labels for the objects

being clustered. Extrinsic methods then attempt to establish a discriminant

41

(a) (b) 000
0 000
00.. 09
C . .C
0 0
.0
90
0
(C) (d)
0
”0 0 . 0
0.0.0 ..9.0. . . 9 9
00 9 909 0
. 00 . : . 0
0. 0 ‘.
.0 .. 0 . 0
.0 909 9: . . 9 9 0.0
.99 0 0 “90 .
00 0 000
. 0 0

Figure 3.2 Four examples of complex cluster shapes: (a) spherical,
(b) elongated, (c) linearly inseparable, and (d) dense clusters.

42

 

Classiﬁcations

 

 

l

Non-Exclusive .
Excluswe

 

(Overlapping) >

 

 

   

       
 
 
 

   

Extrinsic

(Supervised IntrInSIc

   
    

  

4m;

 

“a;

 

    
 

Hierarchical Partitional

Figure 3.3 Categories of classiﬁcation types.

43

surface that separates the objects according to their category. Intrinsic
methods are further separated into hierarchical and partitional clustering
methods by Figure 3.3. Hierarchical and partitional clustering will be

described in more detail later in this chapter.

The remainder of this chapter is dedicated to providing a background
on the methods, assumptions, and some chemical applications of cluster
analysis techniques. Since cluster analysis is such a broad area by itself, only
the more common methods of cluster analysis, particularly those that have

been utilized in this research, will be discussed.

Similarity Calculation Methods

In order for cluster analysis to be performed, it is necessary to quantify
the similarity between observations. Usually, this quantiﬁcation is done on
the basis of distance between observations in a multidimensional space [2].
In this section, I will describe the more common similarity (or distance)
metrics. There currently is no clear deﬁnition of similarity in the cluster
analysis literature [5]. However, similarity is often taken to be a scaled form
of the inter-object distance with values in the range of zero, for no similarity,
to one, for identical values. If the distance value is scaled to a value from zero
to one where unity would represent the maximum distance possible between

two objects k and 1, then the similarity would be calculated as

44

Skl = 1-d kl (3.1)

Euclidean Distance

The Euclidean distance metric is by far the most common method for
determining the distance between two objects. The equation for calculating a
two-dimensional Euclidean distance between two points k and l is given by

the formula

 

2
Dk,= 2(ng—xzj)2 (32)
j = 1

For multivariate observations, this equation can be generalized to n

dimensions.

 

n
2
Dkl = 2(xkj—xlj) (3-3)
I =1

The most important criterion for using the Euclidean distance metric is that
accurate calculation 1 of the distance requires that the dimensions be
orthogonal. In many situations, this does not occur because of cross-
correlation between variables. A graphical representation of the Euclidean

distance metric is given in Figure 3.4.

45

 

 

 

 

d=l(xb-x.)2+(y,-y.)2

Figure 3.4 A graphical representation of the Euclidean distance metric.

46

Minkowski Distance

The Minkowski distance metric between two objects k and l is given by

l/r
n
D, (k1) =( z leg—xljl’] (3.4)
1' =1

where r is larger than or equal to one. By comparison of Equation 3.4 with
Equation 3.3, the Minkowski distance metric is a generalized form of the
Euclidean distance metric. For the case where r = 2, the Minkowski distance
reduces to the Euclidean distance. For r = 1, Equation 3.4 reduces to
n

Dlw) = ,E llxkj—xzjl (3.5)
This resulting distance metric is commonly called the Manhattan, city-block,
or taxi distance and is mainly used in two dimensional cases although it is

more common to use the Euclidean distance [6].

Mahalanobis Distance

The Mahalanobis distance is sometimes used as an alternative to the
Euclidean distance metric. The Mahalanobis distance between two points k

and l is given by the general equation

 

um:J(xk—x,)Tc-1(x.-x,) (3.6)

47

where C is the covariance matrix. The advantage of the Mahalanobis
distance is that it is scale-invariant and takes account of correlations among
the variables. If C is replaced by a diagonal matrix of standard deviations,
one obtains the auto-scaled Euclidean distance [5]. Likewise, replacing C by
a diagonal matrix of ranges, one obtains the range-scaled Euclidean distance
[5]. If within the clusters the variables are standardized and uncorrelated,
then the Mahalanobis and Euclidean distances are the same since the

covariance matrix term, 0", drops out.

Hierarchical Clustering Methods

Hierarchical methods of cluster analysis are the most commonly used
today. In particular, hierarchical clustering is widely used in the biological,
social, and behavioral sciences because of the need to construct
taxonomies [4]. There are three main types of hierarchical clustering that I
will describe here: single linkage (or nearest neighbor linkage), complete
linkage (or furthest neighbor linkage), and average linkage. The general
approach in hierarchical methods is to link, or join, observations based on
their similarities. This is accomplished using one of the similarity or distance

metrics described previously.

Hierarchical clustering methods, unlike partitional or density
clustering methods, which will be described later, do not directly assign

observations to a particular cluster. Rather, hierarchical clustering methods

48

construct dendrograms that show the relationships between the observations.
It is left to the researcher to determine what and where, if any, clusters are
represented by the dendrogram. An example of a dendrogram is provided in
Figure 3.5. On the vertical axis of the dendrogram are the objects being
clustered. On the horizontal axis is the similarity index with a limiting value
of 1.0 or 100% on the left and a limiting value of 0.0 or 0% on the right. The
lengths of the horizontal line segments represent the dissimilarities between
objects. The join points, represented by the vertical line segments, represents
the similarities between objects as measured on the similarity index scale.
The closer two objects join to the left, the more similar are their

representative values.

An advantage of hierarchical clustering is that it provides, in a readily
interpretable form, the most similar observations as well as inter-group
relationships formed by the branch points between dissimilar objects or
groups. A disadvantage of hierarchical clustering techniques is that the
resulting dendrograms can become diﬁcult to interpret cleanly with very

large data sets [7].

Single Linkage Clustering

Before single linkage joining can be performed, it is ﬁrst

necessary to calculate a similarity or distance matrix by one of the similarity

49

8888683. 8 mo 83888 5 m5 gamma

88838 mﬁon 8388 .8 38.30

.b was.
.w 6 .v 8388 3 magnum—88 @ch mud 98:
m was .N A 8388 85 mgoaw 889 ﬂow

 

 

 

 

/ _l 5% 33.58%
.3 museum
mu macaw
3a 335%.
—|| «.4» 3QEBM.
5838.8 85 8888 as... .8385 868
85 84mg 23. .888 8?: Eﬂummgm _A_H was 335%.
/ Nu macaw.

 

_ _ 4 _ q q _ _ _ _ J
0.0 To N.o m.o v.0 m.o m.o 8.0 m.o m.o oé
EHggm

5O

calculation methods described previously, or by some other method if it is
desired. Single linkage joining is then accomplished by searching the
distance matrix for the lowest value. The two observations that match this
value are then joined at their respective distance (similarity) level. This step
is then repeated for the next lowest distance value and so on. If an
observation for a selected distance value is already joined to other
observations, a branch point is formed and the new observation is joined on
to the existing group. This process continues until all the observations have

been joined.

An example of single linkage clustering is presented in Figure 3.6 for a
selection of microorganisms analyzed by phospholipid proﬁling. The details of
the phospholipid proﬁling technique are presented in Chapter 4. Since
dendrograms are a display technique, it is left to the researcher to interpret
the results. In the case of Figure 3.6, one might choose to consider samples
74, 36, 80, and 78 in one group, samples 64, 51, 66, and 50 in another group,
and the remaining three samples in the third group. Therefore, the members
of each group have more in common with each other than with other samples
or group. However, this particular interpretation is only valid if samples with
a similarity of 0.65 (65%) or more can be considered a group. If the threshold
similarity for a group is set at 0.8, then only samples 66 and 50 form a group

and samples 69, 24, and 94 form another. All other samples form single

51

8858: 8828

MG

@8389 Eamoanmoan .3 68:83.. m8m8w8o88
union—mic 8039883 8815

88:8

25. gm ops—mum

 

 

 

 

 

 

 

 

 

 

 

 

 

___'L

 

 

 

 

m . o
EEG/mm

8 835m
8 0388
8 seam
8 sarcasm
.8 oaaam
S museum
8 £88
2. 388
8 seam
8.... 288
E oaaam

52

member groups as no other sample is signiﬁcantly similar. Likewise, if the
threshold similarity were set at 0.4 or lower, then all the samples would

constitute one group.

Because of the manner by which the distance matrix is searched, and
joins are formed, single linkage clustering tends to overestimate the degree of
similarity between successive observations at branch points. The reason for
this is that new observations are joined to previously joined observations
based on the lowest distance between the new observation and any
observation in the previously joined group. This means that every other
observation in this joined group has a greater distance from and therefore
less in common with the new observation than the branch point indicates.
Nevertheless, single linkage clustering is the most common method of

hierarchical clustering due in large part to its ease of implementation.

Complete Linkage Clustering

Complete linkage clustering is performed in the same manner as
single linkage clustering with one exception. While single linkage clustering
joins new observations onto previously joined observations based on the
lowest distance between the new observation and any of the joined
observations, complete linkage clustering does the opposite. Complete

linkage clustering joins new observations to previously joined observations at

53

the highest distance between the new observation and any of the joined
observations. As a result, while single linkage clustering tends to
overestimate the similarity between observations, complete linkage

clustering tends to underestimate it.

An example of complete linkage clustering is presented in Figure 3.7
for a selection of microorganisms analyzed by phospholipid proﬁling the
details of which are presented in Chapter 4. Figure 3.7 may be interpreted in
much the same manner as Figure 3.6 in the previous section. However, it
becomes readily apparent that the complete linkage algorithm signiﬁcantly
reduces the similarities at the join positions for clusters of three or more
members in this case. For example, at a similarity of 0.65, sample 74 would
no longer belong in the same group as samples 36, 80, and 78. At a similarity
value of 0.8, only samples 66 and 50 form a group and samples 24 and 94
form another. At a similarity of 0.4, the same three groups are formed that
were formed at a similarity of 0.6 for the single linkage example in

Figure 3.6.

54

$8598 EamoAQmoAa .3 88888 was... 28E 8 @3888 ”388888988

8888: 8838 888 an» .8 m88838 88888.3 .8me 88388 2E. h.” 0.8M?—

 

 

 

 

 

 

 

 

 

 

 

l__

 

 

 

 

 

 

 

 

 

 

 

8 3.25m
8 £88
8 39:8
an 2.58
8 3.58
8 035m
8 £88
ms @388 -
8 3.288
8 335m
3 3.38m

55

Average Linkage Clustering

Average linkage clustering represents the mid-point between single
linkage clustering and complete linkage clustering. With average linkage
clustering, the algorithm is the same as single linkage clustering in that the
distance matrix is searched for the lowest distance values when joining
observations. However, once two or more observations are joined, a mean
observation is calculated, which replaces the joined observations at each
cycle. Figure 3.8 demonstrates the differences in distance measures between

single, complete, and average linkage methods.

For comparison purposes, the microorganism phospholipid proﬁles
clustered by single and complete linkage clustering in Figure 3.6 and
Figure 3.7 respectively are also clustered by average linkage and presented
in Figure 3.9. By comparison of Figure 3.9 with Figure 3.6 and Figure 3.7, it
becomes readily apparent how the average linkage method represents the
midpoint between single and complete linkage clustering. If the
measurements taken on the samples and used in the clustering are taken to
represent normal distributions, then it is more statistically valid to cluster
using average linkage. Because single, complete, and average linkage
methods only aﬁ‘ect the joining of three or more samples, they always join the
ﬁrst two samples in a branch at the same similarity. For example, samples

80 and 78, 64 and 51, 66 and 50, and 24 and 94 are joined to each other at

56

Sample 1

 

Sample 3

 

Sample 2

 

 

,
K

Figure 3.8 The relative distance measures for single, complete, and average
linkage methods. Sample 1 and 2 would initially be joined at
distance d1. Sample 3 would join at deingle, doomplete, or davenge
respectively.

57

.wﬁaoa Eamoaamcan

.3 353% Pm wan—ME was ad. 953% 5 63.333 manganese?
ESE—q: @3033 can on» we magma? 30398.83 omega: magmas BE. ad 0.33m

 

 

_I_

 

 

 

 

 

 

 

 

 

 

,__1

 

 

 

 

 

 

 

 

m . o
FEES/mm

«a £36
«a 33am
am «38$
cm. 23%
8 £86
5 magnum
3 035m
we macaw
S 333
mm 23%
3 038mm

58

the same similarity in all three cases. Although it is not uncommon, no
rearrangement of the sample order occurred with these example data. As a
result, it is possible to use any of these three linkage methods with these
data and still come to the same conclusions, albeit at diﬂerent similarity

levels, regarding the data structure.

Average linkage clustering is more computationally expensive than
single or complete linkage clustering because it is necessary to recalculate
portions of the distance matrix after every join to take into account the join
mean observation. However, average linkage clustering does tend to provide
a more accurate measure of observation similarity on the resulting

dendrogram.

Partitional Clustering Methods

Partitional methods of cluster analysis are used to classify samples
into distinct groups without the intervention of the researcher. Unlike the
hierarchical methods, observations are assigned to speciﬁc clusters. It then
falls to the researcher to evaluate and validate the correctness of the clusters
formed. However, it is not necessary for the researcher to decide where the
separation exists between clusters. As a result, partitional methods of cluster
analysis are generally less susceptible, though certainly not immune, to

researcher bias. Partitional clustering methods are more popular in the

59

engineering sciences, where single partitions are important, than the

biological sciences [4].

While there are several types of partitional clustering methods in the
literature, I will only be describing the K-means method. As a general
scheme, most partitional methods rely on maximizing some between-group

distance function or minimizing generalized variance [8].

K-Means Clustering

The K-means clustering algorithm is one of the most popular of the
partitional clustering methods. This method relies on maximizing the
between-group Euclidean distance while minimizing the within-group
distances. The K-means method is described in detail with FORTRAN code

by Dr. John Hartigan [8] and by Johnson and Wichern [9].

The K-means method is composed of four main steps. In Step 1, the
observations are partitioned into K initial clusters and initial cluster
centroids (or cluster means) are calculated. In Step 2, the observations are all
reassigned to the cluster whose centroid is nearest. The new cluster
centroids, after observation reassignment, are calculated in Step 3. In Step 4,
Steps 2 and 3 are repeated until no more reassignments of observations

occurs .

60

Probability-based and Density-based Clustering

The previously mentioned clustering methods all rely on distance or
similarity as a means of dividing observations into clusters. However, almost
any means by which observations can be divided into groups can be utilized
in cluster analysis [5]. In Figure 3.1, it is readily apparent that a cluster
implies a greater concentration or density of observations in a given region.
In this section, I will describe a few methods of clustering methods that seek

to form groups based on density or distributional assumptions.

Kernel Density Clustering

Massart and Kaufman [10] describe the kernel density clustering
method as CLUPLOT. In this clustering approach, each observation is
associated with it a kernel or potential density function, often Gaussian,
which can be considered to measure the inﬂuence exerted by the observation
on the other observations in the data set. The density function need not be
Gaussian but can be any function which measures the density of an object at
another point as a function of the distance between them. At each iteration,
the observation with the greatest average density is chosen as the new
cluster center. Unallocated observations are assigned to clusters if the
inﬂuence exerted by the cluster on it is greater than some threshold. Kernel
density clustering can be modiﬁed to permit overlapping clusters and

provides generally more information than standard clustering approaches.

61

Like the fuzzy clustering method described below, the kernel density

clustering method is computationally expensive.

Fuzzy Clustering

At present, there exists in the literature only a few fuzzy clustering
implementations; it is not yet a commonly used clustering method [5].
However, I will describe fuzzy clustering here because it has signiﬁcant
advantages over traditional clustering methods and is likely to become a
more commonly utilized clustering method in the future even though it is

computationally expensive.

Fuzzy clustering is uniquely different from all clustering methods
mentioned previously. In 1965 Dr. Lofti Zadeh [11] introduced the concept of
fuzzy sets by signiﬁcantly extending the principles of set theory. Set theory is
based on bimodal assumptions about elements in a set [12]. That is, each
element may either belong to a particular set or not. Fuzzy set theory is
based on the premise that an element does not belong totally to a set but has
a “degree of belongingness” to each set. Consequently, under fuzzy set theory,
an element X may be described as belonging 70% to set A, 25% to set B, 5% to
set C, and 0% to all other sets. Indeed, it is not even necessary that an

element belong cumulatively exactly 100% to all sets in question [13].

62

By comparison, traditional cluster analysis, as described previously, is
much like set theory, observations (or elements) either belong to a cluster (or
set) or they do not. This exclusive idea in cluster analysis works well for
cleanly separated clusters. However, exclusive cluster analysis does not
handle overlapped clusters very eﬁiciently. Furthermore, exclusive clustering
methods only provide at best a minimal amount of information, through
inter-cluster distances, regarding commonality of features between clusters
[14]. The degree of cluster membership assigned to each element provides

much more information in this regard.

Current implementations of fuzzy clustering operate to some degree
like partitional clustering methods [5]. The number of clusters to be found is
speciﬁed by the researcher and each observation is then assigned a degree of
membership for each competing cluster. Somewhat like K-means clustering,
observations are moved between clusters until the set of memberships for
each observation to each cluster is optimal. That is, they minimize some

criterion deﬁned by the clustering implementation.

Fuzzy clustering can also be useful for identifying outliers or those
observations positioned between clusters since the membership values
provides some indication of the likelihood with which observations belong to

competing clusters.

63

Application of Cluster Analysis to Problems of Chemical Signiﬁcance

In recent years, interest in multivariate analysis techniques for
interpreting chemical information has increased signiﬁcantly. This is due in
large part to advances in computers and the large quantities of multivariate

information produced by modern chemical analysis instrumentation.

At present, there are several hundred articles referencing the use of
cluster analysis to aid data interpretation in the chemical literature. These
references span the full spectrum of chemical science. The main underlying

theme of all these articles is the multivariate nature of the data investigated.

For Example, Frapporti et al. [15] report using fuzzy cluster analysis
in the analysis of the hydrogeochemistry of shallow Dutch groundwater
samples collected from 350 sites in the national groundwater quality
monitoring network. Haswell and Walmsley [16] describe using hierarchical
clustering to assist in the selection of sensors for an array device for the
detection of analytes in the vapor phase. Likewise, Bondarenko et al. [17]
report using hierarchical cluster analysis in the classiﬁcation of Lake Baikal,
Belgium aerosol particles samples using electron probe X-ray microanalysis.
Vogt et a1. [18] report using cluster analysis to facilitate the interpretation
and the formulation of diagnostic models for the application of chemical tests
in clinical chemical diagnosis. Their approach uses hierarchical clustering by

Ward’s method [4], which is based on the minimization of the square error

64

popularized in analysis-of-variance, to reduce data while increasing

utilizable information.

In the area of biochemistry, Alexandrov [19] reports using hierarchical
clustering for the purpose of interpreting the biological meaning, statistical
signiﬁcance, and classiﬁcation of local spatial similarities in nonhomologous
proteins. Sowdhamini and Blundell [20] describe a method using cluster
analysis to reduce the complexity of performing protein searches in the
Brookhaven Protein Data Bank by establishing protein domains. These
domains are considered to be clusters of secondary protein structure

elements.

Cluster analysis utilization has also been reported by Livingstone [21]
for molecular modeling and drug design and by Chu [22] for drug
pharmacological activity. Likewise, several articles have been written on the
use of cluster analysis to determine chemical structure-biological activity
relationships [23, 24, 25]. Jurs and Lawson [23] describe the use of a modifed
Hopkin’s statistic, which is intended to asses whether or not a given data set
diﬁ'ers from a set of uniform random numbers. They further present an
example using this statistic in a structure-toxicity relationship investigation
of 143 acrylate compounds. Darwish et al. [24] compare principal component
analysis and cluster analysis in quantitative structure-activity relationships

(QSAR). Speciﬁcally, they examined the inhibitory effect of some

65

benzothiazole pesticide derivatives on the bacteria Bacillus subtilis and the
fungi Aspergillus niger, Helminthosporium sativum, and Fusarium
graminearum. They found that cluster analysis required lower computational
‘time but gave results commensurable with principal component analysis.
Novellino et a1. [25] present the use of clustering the principal components
generated from comparative molecular ﬁeld analysis (CoMFA). This yields an
approach to drug series design that produces relatively smaller set of series
structures that provide nearly the same amount of information as a much

larger data set of structural analogs.

With a similar objective to that described previously by Sowdhamini
and Blundell [20], Domokos et al. [26] report a procedure to reduce mass
spectrometric library search times by presearching. The presearch is made on
a reduced mass spectrometric library where large groups of compounds
exhibiting similar spectral features are replaced by their respective
prototypes. This approach signiﬁcantly reduces the number of spectra used

in the subsequent search.

While the above list of applications of cluster analysis to chemical
problems is certainly not comprehensive, it hopefully serves to indicate the

diversity of tasks to which cluster analysis can be applied.

66

Application of Cluster Analysis to Mass Spectral Data

Using cluster analysis with normal mass spectral data as a means of
characterizing compounds generally fails to provide useful information
because the grouping is usually based on an undeﬁned mass-to-charge
correlation. For example, the clustering of compounds whose spectra contain
m/z 18, which could be either H20+ or NH4+, might place these compounds in
the same group. Since the m/z 18 ions are not identical, any grouping which
places these ions in the same cluster would be incorrect. Consequently, any
assumptions that the two compounds must have something in common would
also be incorrect. Therefore, one cannot make reliable feature-substructure
correlations by clustering normal mass spectra. Perhaps the only exception is
the use of cluster analysis as an inefﬁcient means of library searching. This
search process would be based largely on the assumption that each unique

compound in the library database forms its own group.

However, the diﬁculty or failure of cluster analysis with normal mass
spectra data does not apply to tandem mass spectral data provided that the
spectra clustered are derived from the same precursor m/z value. Product
spectra can be directly correlated with the substructure of the precursor from
which they were derived. Hence, the groupings in such a case are based on a
limited set of predeﬁned substructures. Cluster analysis of MS/MS spectra

can, therefore, be used as a reliable means of formulating feature-

67

substructure correlations. The use of cluster analysis with tandem mass

spectra data will be explained further in Chapter 6.

Conclusions

In this chapter, I have portrayed cluster analysis as a relatively non-
biased method for determining the natural clustering or grouping tendencies
amongst data. While such levels of perfection are desirable, this is not strictly
correct. So long as the researcher has any choice or control over clustering
parameters there exists the possibility for personal bias. Also, of all the
clustering methodologies described in this chapter, none is suitable for all
possible cases [4]. This is due in part to the underlying assumptions each
clustering method makes regarding the data being analyzed. We have found
that this is not a problem with mass spectral data, however. Indeed, the
above concerns are not a signiﬁcant problem in most practical applications of
cluster analysis with “real” data because the researcher usually knows
enough about the data to distinguish “good” groupings from “bad” groupings.
Consequently, it is often necessary to investigate the data space with known
test samples before relying on a particular clustering algorithm for

unknowns.

68

References

l.

10.

ll.

12.

l3.

14.

Johnson, R. A; Wichern, D. W. Applied Multivariate Statistical
Analysis; Prentice Hall: Englewood Cliffs, 1992; Chapter 12.

Everitt, B. S. Cluster Analysis; Halsted: New York, 1993; Chapter 1.
Lance, G. N.; Williams, W. T. Computer Journal 1967, 10, 271-27 7.

Jain, A. K.; Dubes, R. C. Algorithms for Clustering Data; Prentice
Hall: Englewood Cliffs, 1988; Chapter 3.

Bratchell, N. Chemom. and Intell. Lab. Sys. 1989, 6, 105-125.

Massart, D. L.; Kaufman, L. Interpretation of Analytical Chemical
Data by the Use of Cluster Analysis; John Wiley and Sons: New York,
1983; Chapter 2.

Hodes, L.; Feldman, A J. Chem. Inf. Comput. Sci. 1991, 31, 347-350.

Hartigan, J. A. Clustering Algorithms; John Wiley and Sons: New
York, 1975; Chapter 4.

Johnson, R. A; Wichern, D. W. op. cit.; 597-602.

Massart,D. L.; Kaufman, L. op. cit.; Chapter 4.

Zadeh, L. A Fuzzy Sets, Information and Control 1965, 8, 338-353.
Otto, M. Anal. Chim. Acta 1990, 235, 169-175.

Bezdek, J. C.; Dunn, J. C. IEEE Transactions on Computers 1975,
August, 835-838.

Ruspini, E. H. Information and Control 1969, 15, 22-32.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

69

Frapporti, G.; Vriend, S. P.; Van Gaans, P. F. M. Water Resour. Res.
1993, 29, 2993-3004.

Haswell, S. J.; Walmsley, A. D. Anal. Proc. 1991, 28(4), 115-117.

Bondarenko, L; Van Malderen, H.; Treiger, B.; Van Espen, P.; Van
Grieken, R. Chemom. Intell. Lab. Syst. 1994, 22, 87-95.

Vogt, W.; Sator, H.; Nagel, D. TrAC, Trends Anal. Chem. 1984, 3(7),
166-171.

Alexandrov, N. N.; Go, N. Protein Sci. 1994, 3, 866-875.
Sowdhamini, R.; Blundell, T. L. Protein Sci. 1995, 4, 506-520.
Livingstone, D. J. Anal. Proc. 1991, 28(8), 247-248.

Chu, K. C. Anal. Chem. 1974, 46', 1181-1187.

Jurs, P. C.; Lawson, R. G. Chemom. Intell. Lab. Syst. 1991, 10, 81-83.

Darwish, Y.; Cserhati, T.; Forgacs, E. Chemom. Intell. Lab. Syst. 1994,
24, 169-176.

Novellino, E.; Fattorusso, C.; Greco, G. Pharm. Acta Helv. 1995, 70,
149-154.

Domokos, L.; Pretsch, E.; Maendli, H.; Koenitzer, H.; Clerc, J. T.
Fresenius'Z. Anal. Chem. 1980, 304, 241-249.

CHAPTER 4

Application of Cluster Analysis to

Microbial Characterization

Introduction

My interest in cluster analysis as a tool for data interpretation began
during my afﬁliation with the National Science Foundation Center for
Microbial Ecology (CME) at Michigan State University. The Community
Diversity Group, of which I was a member, is one of several in the CME and
is focused on investigating and identifying what constitutes a viable
microbial community [1]. In nature, that is outside of the laboratory,
microbes often cannot survive without being part of a microbial community.
Microbes, being single-celled organisms, are not complex enough to provide
for all of their environmental needs. As a result, microbes are generally
members of a community where each member or microbial strain contributes

something to the whole.

70

71

The concept of a microbial community is very important if one is to
understand microbes in their natural habitat. In the laboratory, microbes are
usually grown and studied as isolates. However, for every organism that can
be grown as an isolate in the laboratory, there are scores of others that
cannot survive in this environment. It is, therefore, necessary to look at the
larger community picture if one is to begin to understand the interaction of
microbes with nature. However, in order to investigate a microbial
community, it is also necessary to be able to reliably characterize and

hopefully identify the microbial members of the community.

In this chapter, I will present my efforts in furthering the development
of methodologies for microbial characterization with speciﬁc application to
the characterization of selected food-home contaminants by means of

phospholipid proﬁling and hierarchical cluster analysis.

Background

When I joined the Enke research group, Mark Cole was conducting his
Doctoral research in the area of developing biomarkers for microbes using
mass spectrometric methods. The development of reliable biomarkers is very

important as it allows for the rapid screening of microbes as to form and/or

function without necessarily identifying the speciﬁc microbe. Mass

72

spectrometric methods have proven quite useful in biomarker method

development [2].

While the general goal of any analysis procedure of an unknown
microbe is to determine its identity, this is usually not possible or feasible for
several reasons. For example, only a very small percentage of microbes that
occur in nature have ever been fully characterized and identiﬁed. Of those
microbes that have been identiﬁed most have been characterized because
they exhibit some pathogenic character to other life forms be they human,
animal, or plant [3]. Furthermore, the extensive characterization and
identiﬁcation of a single microbe can take several days to weeks to
accomplish. It is obviously impractical to perform such a protracted
procedure on items like food harvests due to the storage requirements unless
there is some previous indication that the food may be contaminated. This is
part of the reason why foods like tainted meat slip through Food and Drug

Administration (FDA) inspections.

The food industry, in particular, is therefore very interested in
establishing a set of biomarker based screening methods to identify or at
least provide an indicator as to the microbial composition in a food sample.
Only then might a more extensive analysis of the food become necessary
which would represent a signiﬁcant cost and time reduction to the food

industry.

73

The development of reliable biomarkers was a primary objective of the
CME Community Diversity Thrust Group. Towards that end, a general
approach was devised by which current and future biomarker data could be
collected, stored, and analyzed; this approach is diagrammed in Figure 4.1.
These biomarker methods, databases, and analytical tools would then be
made available to others both within and outside the CME. My involvement

with the CME ended not long after this approach was proposed.

One of the biomarkers that Mark Cole investigated was the
glycerophospholipids which I shall refer to by the more general term
phospholipids. The glycerophospholipid structure, Figure 4.2, contains four
main components: a substituted phosphate head group, two fatty acid chains,
and a glycerol backbone. The phospholipid classes are determined by the
Y substituent attached to the phosphate group. The phospholipid classes

used in this study are provided in Figure 4.3.

Phospholipids are a major component of the lipid bilayer that makes
up the cellular membrane of a microorganism [4]. It is through this bilayer
that all nutrients must travel into the cell and all waste byproducts must
travel out. As a result, the cell membrane serves as both a screen or shield for
the organism as well as an interface to the environment. For this reason, the
membrane composition can vary greatly depending on the form and/or

function of an organism. Further variations are also observed within a

74

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

N
Computer-Assisted Characterization
sampling—l: Organisms - isolates or communities
Morphology Fatty l IPhospholipids
Ac1ds f
Methods— ‘
V . . . e e 0
Genetic Traditional
Methods Taxonomy
Data 1 v L
smmge _ ﬂl Databases Id—
and
Retrieval
Numerical H Cluster
Methods f Analysis
Analytical_
T°°15 Rule Based
Expert System . . .
V V
Taxonomical info. (i.e., species, strain, etc.)
Behavioral qualities Community
Environmental qualitg} qualities
k J

 

Figure 4.1 The proposed CME approach to incorporate the research
objectives in the Community Diversity Thrust Group into a more

cohesive package.

75

 

HC—O—C—R'

O

Hzc—O—P—O—Y

OH

 

Figure 4.2 The general glycerophospholipid structure. The head group
substituent is represented by Y and the two fatty acids by R

and R'.

76

 

 

 

 

 

 

 

 

 

 

 

 

Mass of O
Phospholipid Head Group (Y) OH—P—O—Y
OH
Phosphatidylglycerol CHZ—CH—CHZ 172 u
(PG) OH OH
Phosphatidic Acid
(PA) -—-H 98 u
Phosphatidylethanol-
amine CHz-‘CHz—NHZ 141 u
(PE)
Phosphatidyldimethyl CH3
-ethanolamine CHz—CHz-N 169 u
(PDME) CH3
Phosphatidylmono- C _ C -NH—C
methylethanolamine H2 H2 H3 155 u
(PMME)

 

Figure 4.3 The phospholipid head groups and their respective neutral

losses utilized in this study.

 

77

particular microbial strain as affected by the extracellular environment: food
supply and composition, temperature, moisture content, etc. [5] Therefore,
utilizing components of the cellular membrane, like phospholipids, as
biomarkers should provide for the diﬁ'erentiation between microorganisms
and perhaps also for the correlation with the form and/or function or a

microorganism.

One of the common characterization methods for microorganisms,
shown in Figure 4.1, is to perform a complete lipid extraction, esterify the
fatty acids, and then analyze the resulting esters by gas chromatography.
This method is called FAME which stands for fatty acid methyl ester. While
being relatively simple to perform, this method suﬁ‘ers from a number of
drawbacks. In particular, the extraction process removes all the fatty acids in
the organism. This is accomplished by ﬁrst saponifying the whole
microorganism cells in a strong base and then acidifying, methylating, and
ﬁnally extracting the fatty acids [6]. It is, therefore, not possible by this
method to selectively characterize the cell membrane composition or the cell
contents. Another deﬁciency of the FAME method is the loss of channels of
information. During the extraction process, the fatty acids are cleaved from
the substituent to which they may have been bonded. In many cases, the
backbone substituent, coupled with the fatty acid composition, may be very

speciﬁc to a particular strain of organism [7 -9]. While this information is lost

78

in FAME, it is not lost in the analysis of phospholipids by tandem mass
spectrometry. As a result, phospholipid analysis, which characterizes intact
phospholipids, provides a much more organism speciﬁc method of

characterization.

When phospholipids, which have been ionized by fast atom
bombardment (FAB) undergo collision induced dissociation (CID), they
fragment at speciﬁc bond positions that are common to all classes of
phospholipids [10-13]. In positive ion scan mode, phospholipids are observed
to fragment to lose the polar substituted phosphate head group as a neutral
loss (Figure 4.4). The mass of this neutral loss is speciﬁc to each phospholipid
class and is used to selectively proﬁle each phospholipid class. The neutral
loss masses are shown in Figure 4.3. Therefore, positive ion neutral loss
scans produce individual spectra for each of the phospholipid classes. These
spectra are without spectral interferences from other components in the
crude microbial extract mixture as a result of the coupled ﬁltering of the
quadrupoles. The detected ion mass-to-charge values correspond to the sum
of the remaining components of the original phospholipid. The measured
signal intensity is recorded against the mass-to-charge setting of the ﬁrst
ﬁltering quadrupole so that the resulting spectrum is calibrated for the ion
mass-to-charge before the CID process occurred. Sample neutral loss spectra

(mass proﬁles) for three organisms are provided in Figure 4.5. When this

79

(”'0 G)

CHz-O—C—R
o
CH—O—C—R'

O

 

CHz—O—P—O—Y

OH

HO—P—O—Y

OH

6)

CHz—O—C—R

CH—O—C—R'

 

CH2

Figure 4.4 Low-energy collision induced dissociation neutral loss
fragmentation of phospholipids for positive ions.

80

m Salmonella abaetetuba
7M

80 <
718

IO-I

40-1

 

80<

 

 

 

 

 

Citrobacter freundii

100
I” 710

80‘

00-1

 

 

Relative Abundance

 

 

 

 

1”-Escherichia coli

 

 

 

 

 

Mass/Charge

Figure 4.5 Phosphatidylethanolamine (PE) mass proﬁles for Salmonella
abaetetuba, Citrobacter freundii, and Escherichia coli.

81

mass spectral information is combined with the information that can be
obtained using negative ion scan mode, a 4-dimensional phospholipid (class)
mass proﬁle can be constructed (Figure 4.6). The negative ion scan mode
information that provides the mass-to-charge values of each fatty acid moiety
respectively was not utilized in this study. By comparison, the levels of
information provided by the phospholipid characterization method greatly
exceed the information provided by bulk methods such as FAME. By the
FAME method, a single 2-dimensional gas chromatogram is obtained per
organism, which contains the retention times and quantities of the bulk

extracted and esteriﬁed fatty acids.

Experimental

All phospholipid analyses were performed using a Finnigan MAT (San
Jose, CA) TSQ-70B triple stage quadrupole mass spectrometer that had been
upgraded to the TSQ-700 model. This instrument was equipped with a JEOL
(Boston, MA) MS-009 charge-transfer fast atom bombardment (FAB) gun and
a modiﬁed power supply. Mass spectra were acquired and initially processed

using the Finnigan TSQ data system and software.

The bacteria investigated in this study were Citrobacter freundii
(C. freundii), Escherichia coli (E. coli), Salmonella abaetetuba

(S. abaetetuba), Salmonella enteritidis (S. enteritidis), and Listeria

82

Dmmm
nmmm

 

 

Phospholipid Classes

 

 

 

 

 

 

 

 

u (k 89 ~;
8 5s ’D/ 0 ‘50 8
e 868 ‘7 Q $95 e

5

B

8

Figure 4.6 The FAB/MS/MS phospholipid (class) mass proﬁle obtained in
part by this analysis procedure. The phospholipid classes axis is
determined by the neutral losses monitored and the
phospholipid masses axis is determined from the respective
neutral loss scans. The fatty acid masses axis is determined by

operating the instrument in negative ion mode which was not
utilized for this study.

83

monocytogenes (L. monocytogenes). The bacteria used in this study, with the
exception of E. coli, were obtained in lypholized form from Dr. Edward
Richter at Silliker Laboratories of Ohio, Inc. The E. coli was obtained

lypholized from Sigma Chemical Co. (St. Louis, MO).

Crude lipid extracts were prepared from the bacteria using a modiﬁed
Bligh-Dyer extraction procedure [14]. The extraction procedure used is as

follows:

1. Rinse extraction vials (scintillation vials) twice with
chloroform.

2. Prepare extraction reagent by mixing one part chloroform
with two parts methanol in a clean erlenmeyer ﬂask.

3. Rinse a long (9 inch) pasteur pipette once with the extraction
reagent and discard.

4. Remove cells pooled at the bottom of a broth culture with a
clean pipette, or transfer approximately one milliliter of the
broth culture. Alternatively, scrape the cells off of a culture
plate with a sterile loop and transfer to a clean vial.

5. Add extraction reagent to each vial until it is one-half to two-
thirds full. Layers should not separate at this point if a broth
is used. If the bacterial cells don’t separate put vials in a
sonicator for a few minutes.

6. Place all vials in a sonicator and sonicate for at least 30
minutes.

7. Add distilled water to each vial until the layers separate. The
methanol will go into the water and the chloroform layer will
be on the bottom.

8. Allow the layers to settle overnight, or centrifuge and remove

the chloroform layer. Place the chloroform extracts into clean
small vials.

84

9. Evaporate the chloroform extracts to dryness under nitrogen.
Reconstitute with a small volume of fresh chloroform.

At the end of the procedure, a water/organic extraction is performed resulting
in the phospholipids being contained in the chloroform organic layer. The
chloroform is then evaporated oﬁ‘ under a stream of nitrogen. The remaining
extract was then reconstituted with 5-10 [L of fresh chloroform just prior to

analysis.

Mass spectral analyses were performed by dissolving 3-5 uL of the
chloroform extract solution in a drop of nitrobenzyl alcohol matrix on a
custom made FAB probe tip. The FAB gun was operated with a ﬁlament

current of 10 uA and a xenon FAB gas beam energy of 8 keV.

Neutral loss spectra were obtained in the positive ion mode of the mass
spectrometer after tuning with cesium iodide. The argon collision gas

pressure was set at 0.5 mtorr and the collision energy was set at 30 eV.

In order to obtain repetitive neutral loss scans for ﬁve of the possible
phospholipid classes known to occur in the organisms studied, an Instrument
Control Language (ICL) procedure was written and utilized. The ICL
procedure used is provided in Figure 4.7. By using the ICL procedure, a
complete phospholipid analysis can be efﬁciently performed from a single
probe sample within the useful lifetime of the sample droplet. Twenty

neutral loss scans were collected in centroid mode for each of the ﬁve

85

phospholipid classes used in this study. Three or four replicates were run for

each organism investigated.

Results and Discussion

It was necessary to perform some fairly rigorous ﬁltering of the mass
spectral data to extract the useful phospholipid data and to reduce the
number of dimensions (features) used in the cluster analysis. Much of the
data ﬁltering was necessary to reduce or eliminate the high levels of
background interferents resulting from the poor sensitivity of the mass

spectrometer at the time of the analysis.

Filtering of the mass spectral data was accomplished in three steps: (1)
a peak (a mass/intensity pair) must occur in 50% or more of the 20 neutral
loss scans collected for each phospholipid class, (2) the scans thus ﬁltered are
then averaged, and (3) a peak must occur in 100% of the averaged spectra
within each set of replicates to be used as a dimension in the clustering

procedure [15].

Hierarchical cluster analysis was then performed on the ﬁltered mass
spectra for each phospholipid class analyzed to yield the dendrograms in
Figures 4.8-4.13 [15]. The development and interpretation of these

dendrograms are explained below.

Figure 4.7

86

PROF
NEU 172,600,850, 1

# setup

COFF=-30;ON;CDYN 18; EMULT=1700;SN=1

APAUSE;ARESUME

NEU l72,600,850,1
REPEAT 20;GO;STOP;END
NEU 98,600,850,1

REPEAT 20;GO;STOP;END
NEU 141,600,850,1
REPEAT 20;GO;STOP;END
NEU 169,600,850,1
REPEAT 20;GO;STOP;END
NEU 155,600,850,1
REPEAT 20;GO;STOP;END
Q1MS;SW 20;ST=0.5
ASTOP

OFF

# scan PG

#scan PA

# scan PE

# scan PDME

#scan PMME

# standby mode

The Instrument Control Procedure used to control the mass

spectrometer in this study.

87

The organism/mass spectral data was converted into an interorganism
triangular distance matrix using the Euclidean distance metric [16]. The
mass spectral data were translated into a multidimensional data space by
allowing each integral mass-to-charge value, for which at least one organism
had an intensity greater than zero, to be treated as a separate orthogonal
dimension (feature). Since the mass spectra were all normalized to 100%
relative abundance, the percent abundance was used to represent the
distance along each dimensional vector. As a result, no further scaling was
necessary. The resulting distance matrix was normalized by the number of
dimensions and scaled to yield distance values (similarities) in the range of

zero to one.

Dendrograms were then constructed from the normalized distance
matrix using single linkage joining [17]. Single linkage joining was
accomplished by searching the distance matrix for the lowest value. The two
organisms that correspond to this value are then joined at their respective
similarity (distance) level. This step is then repeated for the next lowest
distance value and so on. If an organism for a selected distance value is
already joined to other organisms, a branch point is formed, and the new
organism is joined on to the existing group. This process continues until all

the organisms have been joined. The advantage of hierarchical clustering is

88

.68 389333583839 mo $9388 83... new 850.8an mi gamma—

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.o

H

_

.o

_

N.

o

m

_

.o

vmo m_o wmo
Esggm

N.

A

.o

u

m.

0

mi 88838ch 4
N“... 88838028: .3
2» 88838028: {N
Na, 353238 .m

.33 338838 .m.

a 338838 .W

2“ 3333.88 .m

N3 8.333.838 .m.

3“ 3383.288 .m

an 8.333.838 .m.

E .8333 .0

mu 338t .0

«a 338.3 .0

:3 .38 .M

«.3 .38 .M

S» .38 .M

89

.30 38 835838838 .8 8.8388 83... new 8888.3.an ad. 8.33m

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

,__4_.

 

H m ll ﬂ :1

 

 

.o

H

_

.o

u

N.

o

m

m

.o

vmo mmo omo
8.83888.

8

u

.o

A.
m

.o

m.

3.. 888388888 .‘N
mu 888388888 {N
Nu. 888388888 4
23 38.3.2838 .0

«3 38.83838 .0

3... 8.383.238 .0

mu 8333.838 .m.

in 83888838 .0

mu 83888838 .m.

N3 83888838 .0

Nu 3888i .0

8 8:83 .8

3 3.338% .0

an .38 .m

an 38 MN

83 .38 MN

90

o

AmEv ogﬁogﬁmﬁgﬂannmoﬁm mo $9388 83... new 88.598an 3.8 8.3.3..—

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

‘—__T

 

 

.o

3.0

N.

o

m

.o

vmo mmo wwo
38598

b.

o

m

o

m.o

o.

H

mi 83888838 .0
N3 83888838 .0
2., 83888838 .0
V3 8333.838 .0
m3 8383.888 .m.
N3 8.333.838 .0
Z“ 8.3833838 .0
2x 3.3888t .0
m3 3.38.58t .0
Nu 3.38.58t .0
2» .38 3m

.2» .38 .M

N3 .38 .M

91

.3958 888888838338833383838.8839 .38 £93.88 833 8.3 Eauuoudaom H H4. 9:53,.—

 

 

 

 

 

 

 

 

 

 

 

.o

H

.o

N.

o

m

_

o

vmo mmo wmo
585.82%

b

_

o

H 3 .38 .m

.33 .38 .m

N3 .38 .N

N3 83888838 .0
N3 83888838 .0
H 3 83888838 .m.
V3 8.333388 .m.
H 3 8.333838 .0
N3 8.333.838 .0
N3 8.333.838 .0
N3 338883 .0
N3 33888833 .0

H 3 33888.3 .0

92

3862.8 @588838223885888838 8o 888 28 88 888858 a: 958$

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.o

H

H

.o

N.

_

o

m

_

.o

H 8\ 1
8.0 m.o 0.0
Egg/mm

b.

o

_ _ 8
m.o m.o o.H

8. 88388 .8
.3 33388 .8
Nun GQSugmﬁﬂd .m.
a 8.5.3 .8
.8, 8:33 .8
an 8233 .8
8., .88 .8

93

888388.08 wnﬂaﬁamuog
mo 83 833 unﬂmuumnoﬁmw 833838838 8:388 =8 .38 $9»ng 833 88.3 Emaognom 2.8 8.3030

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

H

.o

N.

o

"H

o

800 m m o 0 .10
5858288

N.

.o

m

.-

o

m

o

0.

V3 8333838 .0

N3 8.333838 .0

~ 3 8.333.838 .0

N3 888388888 .0
03 8883388888 .0
N3 8.333.838 .0

N3 888.838.8888 .0
N3 33888833 .0

~ 3 3388833 .0

N3 3.388833 .0

N3 83888838 .0

N3 83888838 .0

~ 3 83888838 .0

0 3 .38 .0

N3 .38 .0

N3 .38 .0

94

that it provides the most similar organisms as well as any parent-offspring

relationships formed by the branch points in a readily interpretable form.

Figures 4.8-4.13 are the dendrograms generated for each phospholipid
class by the above clustering procedure. On the vertical axis of each
dendrogram are the microorganisms that contained the given phospholipid.
On the horizontal axis is the similarity index with a limiting value of 1.0 or
100% on the left and a limiting value of 0.0 or 0% on the right. The lengths of
the horizontal line segments represent the dissimilarities between
organisms. The join points, represented by the vertical line segments,
represent the similarities as measured on the similarity index scale. The

closer two organisms join to the left, the more similar are their mass spectra.

It is interesting to note that while differentiation between several of
the organisms can be made from each of the individual phospholipid class
dendrograms, no single dendrogram cleanly differentiates among all ﬁve
organisms. Even the phophatidylmonomethylethanolamine (PMME)
dendrogram, which provides very good diﬂerenﬁaﬁon between E. coli,
C. freundii, and S. abaetetuba, does not diﬂ'erenu'ate between S'. enteritidis or
L. monocytogenes because they do not contain measurable levels of PMME.
However, using a combination of two or more phospholipid class
dendrograms, all ﬁve microorganisms can be cleanly differentiated from each

other. For example, the phosphatidylglycerol (PG) dendrogram does not

95

cleanly differentiate between the two Salmonella species but does cleanly
diﬁerentiate between the other organisms and from the two Salmonella
species. However, either of the phosphatidyldimethylethanolamine (PDME)
or phosphatidic acid (PA) dendrograms can be used to differentiate between

the Salmonella species.

Since differentiation can be made among all ﬁve organisms using just
two of the phospholipid class dendrograms, it should be possible that by
combining the phospholipid neutral loss data, a single dendrogram could be
constructed that cleanly differentiates among all ﬁve organisms. However,
the result of this test produces the dendrogram in Figure 4.13 which does not
cleanly diﬁ'erentiate among any of the organisms. The reason for this is that
by combining the data, the level of discrimination obtained by the
phospholipid classes themselves has been removed. Furthermore, each of the
ﬁve phospholipid class dendrograms differentiates among the organisms
diﬂ‘erently. By combining the data before performing the cluster analysis, the

result is a loss of selectivity for any one organism.

The important information to be gained from trying to combine the
phospholipid data is that it is the most discriminating features in the mass
spectra that result in the cleanest diﬂ'erentiations between the

microorganisms. The presence of any nondiscriminating features merely

96

serves to increase the computational complexity of the clustering and to

decrease the differentiating ability of the resulting dendrogram.

It is therefore advisable to select out of the data set only the most
discriminating features to work with. One technology which can assist in
determining the most discriminating feature, particularly for large data sets,
is Genetic Algorithms which were invented by Dr. John Holland at the

University of Michigan [18-20].

The importance of isolating discriminating features will be discussed

further in Chapter 7.

Conclusions

The use of phospholipid biomarker screening and hierarchical cluster
analysis provide an eﬁecﬁve means of differentiating between food-borne

microbial contaminants and likely other organisms as well.

The advantage of hierarchical cluster analysis to this study becomes
particularly relevant once there are more than a few mass spectra to compare
as it becomes difﬁcult to differentiate among them using visual inspection.
For example, even with the small microbial sample set used here, a total of
67 individual mass spectra were used to construct the dendrograms. Each

mass spectrum may also contain a few to dozens of mass/intensity pairs.

97

However, by this eﬁecﬁve implementation of hierarchical cluster analysis,

this data complexity is reduced to ﬁve easily interpreted dendrograms.

With less noisy data, even better diﬂerenﬁaﬁon should be possible.
Likewise, the implementation of a method for determination of the most
discriminating factors should further improve the differentiation between

microorganisms.

98

References

l.

10.

ll.

12.
13.

14.

Archibold, E. R.; Bourquin, A W. J.; F., C.; Chakrabarty, A. M.;
Fletcher, M. M.; Lenski, R. E.; White, D. W. Center for Microbial

Ecology - Science Advisory Panel Report, National Science Foundation:
Washington, DC, 1990.

Fenselau, C. In ASC Symposium Series; Comstock, M. J., Ed.;
American Chemical Society: Washington, DC, 1994; Vol. 541.

Cole, M. J. Ph.D. Thesis, Michigan State University, East Lansing,
1990; Chapter 3.

Brock, T. D.; Madigan, M. T. Biology of Microorganisms, 5th ed.;
Prentice Hall: Englewood Cliﬂ‘s, 1988; Chapter 19.

Cole, M. J. Ph.D. Thesis, Michigan State University, East Lansing,
1990; Chapter 5.

Miller, L.; Berger, T. Gas Chromatography Application Note 228-1 -
Bacteria Identiﬁcation by Gas Chromatography of Male Cell Fatty
Acids, Hewlett Packard: Palo Alto, CA, 1985.

Ratledge, C.; Wilkinson, S. G. Microbial Lipids, Volume 1, Academic
Press: San Diego, 1988.

Kates, M. Advances in Lipid Research, Volume 2, Paoletti, R., Ed.,
Academic Press: New York, 1964; Chapter 1.

Lechevalier, M. P. CRC Crit. Rev. Microbiol, 1977, 5, 109-210.

Heller, D. N.; Cotter, R. J.; Fenselau, C.; Uy, O. M. Anal. Chem., 1987,
59, 2806-2809.

Heller, D. N.; Murphy, C. M.; Cotter, R. J.; Fenselau, C.; Uy, O. M.
Anal. Chem., 1988, 60, 2787-2791.

Cole, M. J.; Enke, C. G. Anal. Chem., 1991, 63, 1032-1038.
Cole, M. J.; Enke, C. G. J. Am. Soc. Mass Spectrom., 1991, 2, 470-475.
Bligh, E. G.; Dyer, W. J. Can. J. Biochem. Physiol, 1959, 37, 911-917.

15.

16.

17.

18.

19.
20.

99

Cole, M. J.; Hemenway, E. C.; Enke, C. G. Presented at the 39th
Annual Conference on Mass Spectrometry and Allied Topics, May 19-
24, 1991, Nashville, TN.

Jain, A K.; Dubes, R. C. Algorithms for Clustering Data; Prentice
Hall: Englewood Cliffs, 1988; Chapter 2.

Bratchell, N. Chemomet. Intell. Lab. Syst., 1989, 6, 105-125.

Holland, J. H. Adaptation in Natural and Artiﬁcial Systems; The MIT
Press: Cambridge, MA, 1992.

Siedlecki, W.; Sklansky, J. Pattern Recognition Letters 1989, 335-347.

Goldberg, D. E. Genetic Algorithms in Search, Optimization and
Machine Learning; Addison-Wesley: New York, NY, 1989.

CHAPTER 5

Generating Clusterable MSMS Data

Introduction

As with any spectral matching or data interpretation technique,
assumptions must be made, and indeed allowed, regarding the consistency of
the data being investigated. For the purpose of data collection, this requires
that the data be collected under a consistent set of deﬁned conditions.
Alternatively, the data collection under varying conditions must be adjusted
to match the data that would have been generated under a standard set of
conditions for inter-spectrum interpretation. Unfortunately, the second
option is not practical because it requires a set of adjustment algorithms that
have not been developed and may not be possible. It also requires data about
the sample and analysis conditions that are generally not available. Indeed,
the applicability of such adjustments probably requires a knowledge of the
analyte to a degree that makes the spectra matching or interpretation
unnecessary. Therefore, for the purpose of general sample analysis, it is a

requirement that samples be analyzed under a standard set of operating

100

101

conditions. This is particularly true in the ﬁeld of mass spectrometry, where
sample introduction, ionization, and fragmentation conditions aﬂ‘ect the

spectra so signiﬁcantly.

For mass spectrometric analysis, standard conditions have been
established for certain types of analyses by consensus or by various agencies
such as the Environmental Protection Agency (EPA). For example, electron
impact ionization (E1) is performed using an electron beam energy of 70 eV.
Furthermore, the normal mass spectrum (MS) proﬁle of a standard tuning
compound like perﬂuorotributylamine (PFTBA) should be consistent among

diﬁ'erence types and models of mass spectrometers.

Attempts have been made to develop standardized methods for the
tandem mass spectral (MS/MS) analysis of small organic molecules using EI
ionization and collision induced dissociation (CID). Dawson et al. [1], for
example, conducted a round-robin study involving seven laboratories prior to
establishing a standard set of operating conditions for MS/MS analyses. The
focus of this study was the setting of the collision energy between selected
precursor ions and the argon collision gas and the collision gas target
thickness or pressure. The results demonstrated that spectral proﬁles varied
widely from laboratory to laboratory even when the laboratories were

provided with a speciﬁc set of operating conditions to follow. The spectral

102

differences can be ascribed to differences between instrumental designs and

characteristic behavior, operating conditions, and operator experience.

For the majority of MS/MS analyses performed in tandem quadrupole
mass spectrometers, the instrument is optimized for the speciﬁc sample and
the type of information desired. For example, product spectra are generally
much simpler in nature and provide fewer spectral peaks than normal mass
spectra. For example, isotope peaks may be missing and low-energy CID
spectra do not have many of the higher energy products typically produced in
the E1 source. High-energy CID [2] and 193 nm photo-induced dissociation
(PID) [3] have been found to produce product spectra that are very similar to

normal EI spectra but with MS/MS isotopic characteristics.

Under typical collision energy and single collision conditions, the
product spectral peaks are often of very low intensity (a few percent) relative
to the precursor peak. However, the substructures represented by the
product m/z fragments are generally the result of direct unimolecular
decomposition of the precursor ion. As the target gas density (pressure) is
increased, ions passing through the collision chamber octapole undergo
multiple collisions with the target gas. As a result, more product peaks are
produced which may represent a combination of higher energy
fragmentations of the precursor ion directly and/or secondary fragmentation

of other product ions. The product ion peak intensities are also greater

103

relative to the precursor ion peak. The ﬁrst, low pressure, conditions are
useful for direct characterization of ions and for investigating the physical
chemistry of the ion fragmentation process. The second method, though it
produces more complex results, is often used for obtaining better sensitivity

and additional ion characterizing information.

Due to the above complexities, there is neither a set of standard
operating conditions for MS/MS analyses nor any readily available MS/MS
spectral libraries. Those MS/MS libraries that do exist are generally specialty
libraries created in-house by such organizations as pharmaceutical

companies for the characterization of drugs and drug metabolites.

It was, therefore, necessary to develop a set of working conditions for
this project as well as generate a library of MS/MS spectra. The remainder of
this chapter will focus on the process of characterizing the relevant operating
parameters of the mass spectrometer, the development of an MS/MS analysis
protocol, and the development of a library system for spectral storage and

retrieval.

Acquisition Method Development

Instrumental Parameters

For this research, mass spectra were obtained on Finnigan TSQ-700

and Finnigan TSQ-7 000 tandem mass spectrometers (referred to later as

104

88888.58QO 888 83-009 833 .38 8388838 < H3 8.5me

 

 

  
      

 

 

 

 

3888—88 888 , 8838830 ,
838883 :2 88389 H 3830 .8830 Alla—9:88
_ P
_
_.8=&:=E _ m0 . 6 , \

 

8888—8

 

 

 

$880.3 8388.33 N 88 .32 838838 .8828 8.88 H 88 8:83
828958 88658 :2 38.5883

105

simply TSQ) equipped with an electron impact ion source as diagrammed in
Figure 5.1. This instrument contains ﬁfteen electrical elements that must be
optimized or tuned individually. These elements, listed by position from the
source to the detector, are the source EI ﬁlament, lens 1-1, 1-2, and 1-3,
quadrupole 1, lens 2-1, 2-2, and 2-3, the collision chamber octapole, lens 3-1,
3-2, and 3-3, quadrupole 3, the conversion dynode, and the electron
multiplier. With the exception of the E1 ﬁlament, the conversion dynode, and
the electron multiplier, the potentials at any moment during a scan are
controlled by nineteen different parameters. The parameters for the normal
mass analysis mode, with quadrupole 1 operating as the ﬁltering quadrupole,

with typical values are given in Table 5.1.

Due to the complexity of the TSQ, it is necessary to tune the
instrument in stages. This is accomplished by ﬁrst tuning the instrument
with PFTBA in normal MS mode for both Q1MS and Q3MS modes. In QIMS
mode, only the ﬁrst quadrupole is operating as a mass (m/z) ﬁlter while
quadrupoles 2 and 3 are operating in pass-through (rf only) mode. Q3MS
mode is the same as QIMS mode except that quadrupole 3 is the mass
ﬁltering quadrupole while quadrupole 1 operates in pass-through mode. As a
result, the procedure for tuning in QIMS and Q3MS modes are essentially

the same.

L 1 1
L 12
L 1 3
POFF
L2 1
L22
L23
COFF
L3 1
L32
L33
DOFF
PRES
PCAL
CRFP
DRFP
COLL

-13.4
-50.0
-10.9
-10.0
-2.3
-180.0
-93.1
-10.0
-92.0
-180.0
-35.7
-10.0
4.1
2.6
27.4
-1.1
135.0

106

Table 5.1 Tune parameters for QIMS mode with typical values in volts.

// lens 1-1

// lens 1-2

// lens 1-3

// quadrupole l oﬁ'set

// lens 2-1

// lens 2-2

// lens 2.3

// collision chamber octapole oﬁ'set
// lens 3-1

// lens 3-2

// lens 3-3

// quadrupole 3 offset

// quadrupole 1 resolution

// quadrupole 1 calibration

// collision chamber octapole rf amplitude
ll quadrupole 3 rf amplitude

ll source electron lens

107

The tuning procedure for normal MS mode involves ﬁrst adjusting the
resolution and offset of the ﬁltering quadrupole to obtain the optimal signal
and peak shape for a single tune mass (m/z) while maintaining unit mass
resolution. Due to the energy spread of the ions leaving the source, the
tuning of the ﬁltering quadrupole to maintain the above criteria results in
approximately 90% of the ions of the selected m/z being discarded. This is
repeated for all tune masses which for PFTBA are nominally 69, 100, 119,
131, 169, 219, 264, 414, 502, and 614. Tuning is also performed using

background water (m/z 18) and air (m/z 28 and 32).

After unit resolution for the tune masses is achieved, the mass range is
calibrated for each tune mass. Then the If potential for the collision chamber
octapole and the non-ﬁltering quadrupole are optimized for the transmission

of each tune mass.

The next step in tuning is to set the lens voltages. The center lenses
1-2, 2-2, and 3-2 are manually adjusted ﬁrst for m/z 32 and 502. It is not
necessary to tune the center lenses for each tune mass because these lenses
are not particularly mass sensitive. Next, the ﬁrst and third lens voltages for
each set are adjusted to obtain the optimal signal intensity and stability for

the speciﬁc tune mass.

There are so many ion optic elements that must be tuned on the

tandem quadrupole mass spectrometer that a single pass is not adequate for

108

optimal tuning. The reason for this is that most of the ion optic elements
have inter-related effects on the ﬁltering and transmission of ions. For
example, the initial tuning step involved the adjustment of the quadrupole
resolution and offset. These values are sensitive to the energies of the ions
leaving the source. Since the ﬁrst set of lenses is located between the source
and the ﬁrst quadrupole, adjustment of the ﬁrst lens set can signiﬁcantly
affect the resolution and offset parameters on the ﬁrst quadrupole. Therefore,
once the above tuning steps are completed, the process is repeated to obtain a

better tune approximation.

After both QIMS and Q3MS modes have been tuned, it is necessary to
tune the instrument for MS/MS operation. This step is much simpler than
the normal MS tuning procedures. For the ﬁrst quadrupole and lens sets 1
and 2, values from the QIMS tune are used. For the third quadrupole and
lens set 3, values from the Q3MS tune are used. As a result, the only
adjustment necessary for MS/MS tuning to be completed is the rf potential on
the collision chamber octapole. This step is performed by setting both
quadrupole l and 3 to ﬁlter on the same tune mass and then to adjust the
collision chamber octapole rf potential to optimize ion transmission. This is
repeated for all tune masses. This particular mode of operation is called
neutral loss/gain mode (NEU). For the purpose of tuning, the neutral

loss/gain value is zero.

109

Typical tune proﬁles of PFTBA for Q1MS, Q3MS, and NEU 0

operating modes are provided in Figure 5.2.

Sample Introduction

Sample introduction for this research was accomplished by two
methods. For moderate to low volatility samples, a solids probe, which is
controlled by the mass spectrometer, was utilized. This probe uses a
combination of heaters and water-ethylene glycol coolant to control the
temperate of the probe. Through the data system software, a heating curve
can be deﬁned to control the temperature of the probe over the course of an
analysis. The operating range of the probe is from the coolant temperature
up to 250 C. The probe tip is designed to hold sample crucibles, usually

aluminum or glass, which have an internal sample volume of approximately

10 “L.

For highly volatile compounds, a custom designed leak inlet system
was utilized. This device is shown in Figure 5.3. This device provides for
controlled inlet of sample that cannot be obtained from the solids probe.
Highly volatile samples generally only last for a few seconds on the solids
probe with the highest volatility samples gone before the solids probe has
been inserted. However, this problem is circumvented by the leak inlet. The

glass sample holders are cleaned and the glass capillary is replaced between

100-
sol
6o-
401
201

 

 

 

 

110

Q lMS Mode

 

 

01-. ALL...»

160 260

100-

 

 

 

oT -. Jul—LA.

 

360

460 V 500 ' 600

Q3MS Mode

 

 

Relative Intensity (%)

160 260
100-

804'

40~
20q

 

 

 

360

460 ' 500 ' 600

NEU 0 Mode

1.,.l

 

 

100 260

360

171/Oz

I

460 500 ' 600

Figure 5.2 Typical Q1MS, Q3MS, and NEU 0 proﬁles for the tuning

compound PFTBA

111

.888988 33339, AME no.3 com: 839$ 8H8 88H 25. «.3 9530

8.9803088 89:
.6 8:. 305 9

AH

8.00 8680
0. EE 9.0

 

 

 

 

 

 

\ -

.os 28.8 32

 

 

 

112

samples. The glass capillary is a section of gas chromatographic guard
column approximately 18 inches long. Using this device, a complete MS/MS
map can be obtained ﬂour a few uL of sample. The volatility range of samples
introduced by this means is limited by the lack of heating of the sample bulb

and capillary tubing.

One characteristic problem with the quadrupole design of the
TSQ-7000 and both the solids probe and leak inlet systems is that the mass
ﬁltering ability of the ﬁrst quadrupole is signiﬁcantly affected by source
pressure. As a result, care must be taken not to allow too high a sample ﬂow

to the source.

Effects of Collision Energy on CID Spectra

For low-energy collision-induced dissociation, collision energy, target
gas pressure (density), and target gas composition are the key instrumental
parameters. The most common target gas for low-energy CID is argon and it
was also used exclusively for this research. Due to the importance of these
parameters, it was necessary to investigate how sensitive product spectral
proﬁles are to these parameters and to determine what the optimal operating
parameters might be. A similar series of studies were conducted by Hart [4],
Palmer [5], and Cross [6] as part of the ACES project which also utilized

triple quadrupole mass spectrometers for the generation of product spectra.

113

The results of the studies for this research are consistent with these

previously conducted studies.

Under low-energy collision conditions (i.e., 1-100 V), the transfer of
kinetic energy of the precursor ion to internal energy occurs through
vibrational excitation of the electronic ground state. The amount of kinetic
energy converted to internal energy from the collision of a polyatomic ion
with a neutral collision gas can be directly correlated with collision energy
[7]. On triple quadrupole mass spectrometers, the maximum lab-centered
collision energy is given by the potential difference between the ion source
region, which is usually at ground potential, and the collision chamber offset
potential. The center of mass collision energy, which is dependent on the
mass of the precursor ion and the target gas is given by the following
equation [8] where ECM is the center of mass collision energy, Ems is the
laboratory collision energy, mg is the mass of the target gas, and mp is the

mass of the precursor ion.

_ ms
ECM— Em(mp+mg)

It has been shown for n-butylbenzene that the fraction of the kinetic energy
converted to internal energy increases with collision energy up to 40 V(1ab)

with about 50% conversion [7]. This fraction decreases as the collision energy

exceed 40 V(1ab).

114

 

 

 

 

 

 

 

 

 

i:
40‘
I
ml
0.4 I I I I I I
A I I
°\°
V 9:
E: I
'53 40-
CI
Q)
+9
CI
H 20-
G)
>’
Nd
*‘ l l
'53 0" II t . . I I I '
9':
40-
201
oIll n..........I|I I I. '.' s. v . I .ILI ..v...l j
20 40 60 80 100

m/z

Figure 5.4 Product spectra of the molecular ion of cyclohexanone with a

collision energy of (a) 10 eV, (b) 40 eV, and (c) 80 eV with a
target gas pressure of 3x 10-6 torr (manifold). (* denotes oﬁ‘scale
precursor peak at 100%)

115

Product spectra of the molecular ion of cyclohexanone (m/z 98)
showing the effect of increasing collision energy are provided in Figure 5.4.
These spectra were collected at a ﬁxed target gas pressure of 3x108 torr
(manifold, 0.47 mtorr collision chamber) and plotted relative to the
unfragmented precursor ion intensity obtained without target gas present.
Few products are produced at 10 V since a 10 V collision for the most part
results in an internal energy lower than the critical dissociation energy, also
known as the appearance energy, for many ﬁagment ions. At 40 V, more
precursor ions exceed these energy barriers and high intensity product ions
are the result. The product ions of m/z 41, 42, and 55 appear to be more
sensitive than the other product ions to collision energy. The relative
intensity of several product ions of cyclohexanone are plotted versus collision

energy in Figure 5.5.

A similar set of experiments to that described above was conducted on
the molecular ion of 3-heptanone (m/z 114). Product spectra showing the
effect of increasing collision energy are given in Figure 5.6. The relative
intensities of several product ions of 3-heptanone are plotted versus collision
energy in Figure 5.7. The results of a repetition of this experiment are given
in Figure 5.8. While the lower intensity product ions appear consistent, the
intensity of the m/z 72 product ion varies signiﬁcantly. The m/z 72 ion is the

result of a McLaﬂ'erty rearrangement reaction from m/z 1 14. The McLaﬁ'erty

116

.GH80888V
.83 8.3me .38 88888.8 88» 8983 8 38 .8888 88858 .38 83885.3 8 8 H8383

88888883388 .8 88 8838888 833 .38 88 83888 H8888 .38 38838 833880 3.3 9:830

 

 

 

 

 

 

 

 

CC .3880 88580
cm cm 38 an o
. p r p
\u - e
4 4ImVI¢
x HﬁI “II II o 3.
mm IxI x/x x x\\\e e m
cm lxl \\ o\o\ .
Ob I+I OIIIO\\\\O

mm I>I l\ O
NH. I<I eIIIeIII \
8. II \

3. III

I
to
H

\.
r
a
(%) Katsuenul aAnBIeH

I
O
N

 

 

 

INN

 

Relative Intensity (%)

117

100-

00
O
1

Oh O:
O O
A l 4. l L

N)
O
1

L

 

 

 

 

O
.1
q
q
l
4 _

.—n
A O} on O
O O O O

N
O
A l A

 

 

 

O
.T
‘
4
r-
d

 

100-
sol
60:
401
20d

 

 

 

O‘luull .I...u I": Avl .. ' lulﬁ ....v .’ v J' 1.: a I
20 40 60 80 100 120

 

m/z

Figure 5.6 Product spectra of the molecular ion of 3-heptanone with a
collision energy of (a) 10 eV, (b) 40 eV, and (c) 80 eV with a
target gas pressure of 3x 10-6 torr (manifold).

118

.AEcﬁnmEv ES eoﬁxm mo gamma 9% ”6qu a an 385 5858 mo deﬂouﬂ a mm

@883 oaoaunwaé mo :3 33838 23 mo 83 835.3 Eugen mo 38:85 gamma

 

 

 

 

 

 

 

 

 

 

 

 

 

QC magnum ”86.500
ow ow ov om _o
_ _ _ _

Ill ll ll 1 c
lll|l|lllw ”HUN “1..“ Cﬂlllll
.\.\.\.\M\+\+\
«HHHTH\+.\\\ Ill \

+ +|\.\I \ I 2
mm lxl, x WVﬁLA x x
mm +
mm '0' 1 8
mV Ibul
S l<|7 I 8
am lol
um Ill 1 on

x
X\ /X\X/X X\/ l ow

 

Wm 0.33m

(%) msueml GAFWIQH

119

gum 933m 3 Bonm Engage one ”magma .3 $0553.. 983 «152 omen?
.Agoﬁqaav .23 3me mo 35me man gouge a as Rhona “83.58 mo now—25m a 3
332m omeganmné .«o :3 $3338 23 me mac“ 835.3 H238 mo Summon—E w>ﬂ£om wd charm

 

 

 

 

 

 

 

 

 

 

 

CC awkwam doﬁmﬂzoo
ow ow ov om o
. _ . _ . _ . _ .
l l ll ll l lnl I o
o\I\l\lu\l\D\ll+ +\H\M .
mm lxl 0\ %
we lxl x x V«Kx\xlx/ m
o/o\ x u a m.
S l+l 1 W.
3 l>l . m.
mm l<l 1 av M
mm lol . \o/w
hm l-l (
x/ \/ u on
\ ﬂ
ow

 

120

rearrangement reaction is known to be sensitive to collision energy [9] and

has, therefore, been used as a means of investigating instrument stability.

Eﬂ'ects of Target Gas Pressure on CID Spectra

The effect of target gas pressure was also investigated. The average
number of collisions that an ion is likely to undergo is a function of the target
gas pressure. At low target gas pressures, product ions are obtained from a
single collision with the target gas. At higher pressures, precursor ions can
collide with two or more target gas atoms to attain higher internal energy.
Product ions can also undergo collision before exiting the collision chamber
This can result in the further fragmentation of product ions. As a result,
product spectra obtained under multiple-collision conditions typically have
more products of higher intensity than spectra obtained under single-
collision conditions. The eﬂ'ect of increasing the target gas pressure at a
constant collision energy (-30 V) is shown in Figure 5.9 for the molecular ion
of cyclohexanone. The expected trend is for a power law increase in product
ion intensities with pressure at a constant collision energy. Figure 5.10 is the
log-log plot of relative intensity versus target gas pressure for the major
product ions of cyclohexanone. The slope of such a log-log plot for a speciﬁc
product is indicative of the reaction order. A slope of unity indicates that the
products are being formed by single collision interactions with subsequent

unimolecular decomposition of the precursor ion. A slope less than unity

121

.muonwxoaﬁomo me :3 33032: one we $03.89
new 9 $3 .885 5558 3338 a an o.H—mama new .833 manages: mo ”Beam a.» 9:63

3.83 magmnm 990 $989

 

 

“335$ QSNOé waBé c.6323“ o.o
. . — . p p _ p
I c
.“H\H\Wlm
Kx\«\ m
B

X 0 la: 4...
mm lwxl X 0\ \ W.
cm lxl 0\ W;
on l+l o w
8 ll - 8. m.
an l>l b M
mv ldl m/w
NV I'll l Om
3 l-l,

b

 

 

 

 

Iov

122

.363 nouns“ 33%“: mono? BE. .a.m 353m 3 $3533 Saw 2% mo 83 maﬁuoq id 0.53%

 

 

mm luml
ow lxl
on. l+l
mm l¢l
mm l>l
mv ldl
av lcl
3 l-l

 

 

Nd-

,—

\

+

>66

0.3395 30 gouge mod

«an.
.

\"\\.. ‘\
\ \\3\
\\\...

b

$6..- Wm.
_ . _

9w.
_

\

la

\

\

.\M

Na.

I ad

I md

log

I mg

 

mA.

'02”:

I md.

qusuequl eAnBIeH Boq

123

indicates the occurrence of metastable decomposition as well. Slopes higher
than unity indicate a multiple collision product. In the case of the products of
cyclohexanone, these reaction orders vary from about 1.4 to 1.6 indicating
that some of the products are formed from single collision reactions and some

ﬁom two collision reactions.

In Figure 5.11 is shown the eﬁ‘ect of increasing the target gas pressure
at a constant collision energy (-30 V) for the molecular ion of 3-heptanone.
The trends represented in these data are quite similar to those obtained
previously for cyclohexanone. The equivalent log-log plot for the data
presented in Figure 5.11 is given in Figure 5.12. The slopes of these plots,
which are in the range of 1.4 to 1.6 as well, indicate, as for the cyclohexanone
data, that at a target gas pressure of 3x106 torr (manifold) and -30 V collision
energy, a mixture of single and multiple collisions is occurring in the collision

chamber.

Instrumental Conditions Selected for MS/MS Library Generation

Conﬁguring the TSQ-7000 for MS/MS library spectra generation ﬁrst
requires tuning the instrument to provide consistent PFTBA proﬁles between
tunes. Source conditions were the same as those used for standard EI

analysis (70 eV electrons, 150 C) with a ﬁlament emission current of

1300 uA.

124

65:35:-m me 5% 53538 we. we 32689
How 9 $3 .355 H8558 «5559 a 5 2553 new nouns wﬁmmmuoﬁ me 50mm =.m charm

AER—v 0.3595 56 5958

 

 

 

 

 

@255 1:35 92.23. secs.“ o...
F _ . _ b _ . . .
. IIIIIIIIll I O
Wlﬂﬂt

+ . m

x\ w 2: m.
mm lxl m
up lxl b . m1
S l+l m
all as m

1

me lpl. M
3 l<l m/w
mm lo 1 8”
pm l-

 

 

 

 

I ocv

125

.565 5.552 85ch 55? 25. A H .m 555 E @3553 Sam. 25 mo 53 moﬁmod «ﬂu 0.33.5

 

 

 

 

 

0.55.5.5 56 53mg. mod
«.m- «in. mg. mg». ad. ad
. . _ . b . P . . . b
\I l QN.
I O I ma-
mm lxml \ \‘ I CA.
NR. le I "\ .-
um l+l, l\ \ e r ad.
mm lol, ¢X+ -5
WV lDl. \+\ ..
3V ldl 0\ x I 3.
mm lOl \Om+\ﬂ\x .I o.“
5N lll RR+X\X\ v
6 I m4
+\ X\ t
m x\ I 5
x\ .
x\ I Om

 

Aqysuequl mime}; Boq

126

For MS/MS analysis, argon was selected as the target gas. The
collision energy was chosen to be -30 eV. The argon pressure was selected for
a manifold pressure of 5x10'6 torr. The corresponding collision chamber
convectron gauge reading was 0.6 mtorr. The collision chamber pressure was
set using the manifold pressure rather then the collision chamber convectron
gauge as the convectron gauge was generally too unstable for reliable
reading. The conversion dynode voltage on the TSQ-700 was set to -15 kV.
On the TSQ-7 000 the conversion dynode voltage is not adjustable but is ﬁxed
at -18 kV. The electron multiplier voltage was set to provide a maximum
signal intensity of 10-20 million counts. This provides for reasonable signal
to noise ratios for even low intensity ions without undo wear on the electron
multiplier. As a result, the electron multiplier was typically set in the range
of 1300 to 2300 volts depending on the condition of the multiplier and the

volatility of the sample.

It was necessary to tradeoﬂ' between single and multiple collision
conditions for the purpose of library generation. The ideal case would be to
operate under single collision conditions as the products are direct
decomposition fragments from the precursor and are, therefore, the most
characteristic of the precursor. However, one of the objectives of this research
was to develop an analysis method that would be generally applicable. It

was, therefore, necessary to operate under more typical conditions. A collision

127

energy of -30 eV and a collision chamber pressure of 0.6 mtorr are typical of
the majority of MS/MS analyses of small organic molecules. The collision
pressure was also selected based on the shape of the pressure dependence
curves presented in Figure 5.9 and Figure 5.11. Above 0.6 mtorr, the
intensity of product ions relative to the precursor ions curves upwards
signiﬁcantly. As a result, product ion intensity in this region would become
signiﬁcantly more sensitive to collision chamber pressure ﬂuctuations. Below
0.6 mtorr, product ion intensities are just lower relative to the precursor and

would, therefore, result in a lower signal-to-noise ratio for the product ions.

Sample introduction for highly volatile samples was accomplished
using the custom made leak inlet system described previously. The majority
of samples were introduced using the solids probe with 1-2 uL of sample
placed in the crucible for each run. The solids probe was programmed to hold
the probe temperature at 25 C for 2 minutes and then increase the
temperature to .150 C at 50 C per minute. The vast majority of small organic
molecules, when placed in the mass spectrometer vacuum evaporate rapidly.
As a result, only in the case of a few compounds did the analysis occur during

the heating stage of the solids probe.

The Finnigan MAT TSQ-7000 mass spectrometer provides an
Instrument Control Language (ICL) to assist operators in controlling the

instrument. Once the standard analysis conditions were determined, the

128

settings were written into an ICL procedure to help ensure that the same
conditions were used from acquisition to acquisition. The ICL procedure to
collect normal mass spectra is given in Figure 5.13. The normal mass spectra
were used to check sample quality and to determine which ions to analyze in
MS/MS mode. The ICL procedure to obtain product spectra was then used
and is given in Figure 5.14. The TSQ does not provide a mechanism for
adjusting gas pressures (other than on/oﬂ) so the target gas pressure must be
set manually before analysis. The variables echMW, echDAU, and echEM are
user deﬁned variables representing the molecular weight of the analyte, the
precursor m/z value to analyze, and the electron multiplier setting to use
respectively. These variables are set on a run by run and compound by

compound basis.

The MSMSC parameter, set in the ICL procedure shown in
Figure 5.14, is an empirical MS/MS correction factor applied to quadrupole 3
of the TSQ. The MSMSC parameter can be set between 0 and 100%. Its
purpose is to correct for excess kinetic energy of ions exiting the collision
chamber. This excess kinetic energy is the result of partial conversion of the
initial collision kinetic energy of 30 V. Since the TSQ is tuned in normal MS
mode for ions with an energy of 10 V (to ensure ions have enough kinetic
energy to reach the detector), ions leaving the collision chamber with

energies ranging from 10 V to 30 V will not be ﬁltered cleanly. This results in

QIMS

PROF

S 10 (echMW + 5) 0.5
MSMSC = 0

COFF = -10

SN = 0

ON

EMULT = 800

APAUSE; ARESUME

SPSTART

WHILE SN < 500
GO; STOP

END

ASTOP

OFF

Q1MS; SW 20

spectra.

129

// set instrument in QIMS scan mode

// set instrument into proﬁle scan mode
// scan from 10 to echMW + 5 in 0.5 sec.
// Set MS/MS correction factor

// set collision energy

// reset scan counter

// turn on source ﬁlament and detector
// set electron multiplier voltage

// begin acquisition

// begin solids probe heating program

// collect 500 scans (~4.2 minutes)

// end acquisition
// turn oﬂ' ﬁlament and detector

// place instrument in stand-by mode

Figure 5.13 Instrument Control Procedure for the collection of normal mass

PROF

DAU echDAU 10 (echDAU + 5)

MSMSC = 70

COFF = ~30

ST = 0.5

SN = 0

ON

EMULT = echEM

..#emult = ; .emult

APAUSE; ARESUME

SPSTART

WHILE SN < 500
GO; STOP

END

ASTOP

OFF

130

// set instrument into proﬁle scan mode
// scan products of echDAU from 10

// to echDAU + 5

// Set MS/MS correction factor

// set collision energy

// set scan time

// reset scan counter

// turn on source ﬁlament and detector
// set electron multiplier voltage

// display electron multiplier setting

// begin acquisition

// begin solids probe heating program

// collect 500 scans (~4.2 minutes)

// end acquisition

// turn off ﬁlament and detector

 

Q1MS; SW 20; ST = 0.5 // place instrument in stand-by mode

Figure 5.14 Instrument Control Procedure for the collection of product mass
spectra.

131

a degradation of peak shape and loss of resolution in the product spectra. To
counter this effect, the offset of quadrupole 3 is adjusted. However, this effect
is mass dependent as well as dependent on the average number of collisions
an ion is likely to undergo in the collision chamber. If the MSMSC parameter
is set too high, low mass ions, which typically have lower energy, to not have
enough kinetic energy to enter quadrupole 3. As a result, there is no simple
way to determine the optimal MSMSC value. Rather it is adjusted so as to
obtain the best trade-off between product spectral proﬁle and spectral
resolution, and this value is used consistently for all product spectral

collection. For this research, the optimal MSMSC value was found to be 70%.

All spectra were collected in proﬁle mode for the purpose of evaluating

spectrum quality and instrument performance.

Generating Reproducible Spectra

The major objective of the work presented in this chapter was to
demonstrate that the TSQ can be used to obtain reproducible product
spectra. It is imperative that this objective be satisﬁed or the ability to use
relative intensities in the characterization of product spectra by common
pattern grouping is placed in serious jeopardy. Spectral classiﬁcation
methods that rely only on the presence or absence of spectral peaks are less

susceptible to irreproducibility provided analysis conditions are chosen such

132

that all peaks that are likely to occur in the product spectrum do occur above
some threshold level. This latter method is used by the ACES project [4] to
alleviate the need for such careful instrument performance characterization.

Figure 5.15 demonstrates the ability to obtain reproducible product
spectra on the TSQ mass spectrometer for the molecular ion of
cyclohexanone. Figures (a) and (b) were obtained on the same day. Figures (c)
and (d) were obtained on a different day with a diﬂ'erent tune. The CID
conditions for all four analyses was ~30 V collision energy and 3x10'6 torr
(manifold) collision gas pressure as measured at the manifold ion gauge.
These spectra show very good peak intensity correlation.

Figure 5.16 shows the product spectra of the molecular ion of
3-heptanone. Figures (a) and (b) were collected on the same day while
Figure (c) was collected on a diﬂ'erent day with a different tune. Again, these
spectra show good peak intensity correlation. The largest variation occurs for
m/z 72 which is the McLafferty rearrangement product of m/z 114. This
larger variation is expected due to the sensitivity of the McLaﬁ‘erty
rearrangement reaction to CID conditions. Nevertheless, this intensity

variation is still within expected mass spectral reproducibility tolerances.

100-
801
603
403
20:

 

(a)

44' I;

 

 

1001
80j
60-
40J

203

 

(b)

.lilii

 

 

1001
80-
60~
40:
20%

Relative Intensity (%)

 

(C)

. I11.-

 

A I l“-‘l| Allll.4
v I ' , '

 

100-
80~
60;
40~

20%

 

, (d)

 

 

 

60 8b ' 160
711/92

Figure 5.15 Product spectra for the molecular ion of cyclohexanone. (a) and
(b) were collected on the same day while (c) and (d) were
collected on a diﬁerent day with a different instrument tune.

134

100? (a)

on
O
L1

A 05
o O
+i4 1

 

 

N
o o
A l
J
L.
i-

 

H

O

O
J

.(b)

05 on
O O
. 1 1 1

 

N
O
A l A

 

 

Relative Intensity (%)

H

O

O
J

(C)

M Q C) on
O O O O O
1 1 . 1 . 1 L 1 1

 

 

I. I A A I
I

l ' I

20 ' 4o ' 60 80 160 ' 1:20
m/z

 

Figure 5.16 Product spectra for the molecular ion of 3-heptanone. (a) and (b)
were collected on the same day while (c) was collected on a
different day with a diﬂerent instrument tune.

135

Post-Acquisition Data Processing and Storage

Once a standardized set of MS/MS operating conditions was
established, it was necessary to develop a protocol for spectral processing,
storage, and retrieval as the MS library system provided with the TSQ data

system was inappropriate for handling MS/MS data.

When the collection of proﬁle scans is complete for a compound, the
scans are visually inspected to ensure that they are of acceptable quality.
This entails checking the scans for anomalous noise spikes, unacceptably
high levels of white noise (perhaps as a result of low signal intensity), or
unresolved peaks caused by tune drift, improper sample utilization, or a
fouling of the instrument. Once this inspection is complete, a ten scan region
is selected for post-acquisition processing and inclusion into the data base.

For post-acquisition processing, the software provided with version 8.1
of the TSQ-7000 data system was utilized. This software contained an
extensive assortment of programs to facilitate data processing and handling

of the proprietary ﬁle format used to store the raw data.

The post-acquisition processing procedure proceeds as follows. First,
the selected ten scans are averaged using the AVER program. The AVER
program is also used to clip the mass range upper limit so as to exclude the

unfragmented precursor ion peak. The resulting data ﬁle contains the single

136

averaged scan. This averaging step helps reduce white noise and smooth the
spectral proﬁle. The TSQ-7000 also provides for real-time averaging during
acquisition. This method of averaging was not used as there was no way to
prevent anomalous scans from being included in the average. Since only the
averaged scans are stored to disk, it would be impossible to reliably identify

those scans that were appreciably contaminated by noise.

The next step in the processing procedure involved centroiding the
data with the CENT program to obtain the mlz-intensity pairs for the
averaged spectrum. Visual inspection of the centroid results was necessary as
the centroiding algorithm used in CENT is succeptible to splitting individual
proﬁle peaks that contain certain characteristics of noise near the top of the
proﬁle peak. The CENT program has also been found to occasionally
incorrectly assign m/z and/or intensity values to the resulting centroid
spectrum. In cases where the preceding anomalous behavior was observed, it
was found that the problem could be circumvented by ﬁrst centroiding the
ten proﬁle scans and averaging the results with AVER. Last, the centroid
data were written to disk in text ﬁle format using the LIST program. A ﬁve
percent threshold, relative to the most intense product peak, was also applied

using LIST to exclude low level noise peaks.

The resulting text ﬁles are then transferred from the Digital

Equipment Corporation (Boston, MA) OSF (UNIX) based TSQ data system to

137

a Microsoft (Redmond, WA) Windows NT system using FTP. All further data
related operations were performed using software developed on the

Windows NT environment.

Rather than develop a complete data base management package for
MS/MS spectra, the Microsoft JET version 3.0 data base engine was used.
The JET data base engine was developed as part of Microsoft Access which is
a Microsoft Windows hosted data base environment that has become popular

in the last few years.

The software developed in the remaining chapters was done using
Microsoft Visual C++ version 2.0 and 4.0. Communication between
Visual C++ and the JET engine was accomplished using the Data Access
Objects (DAO) and Object Linking and Embedding (OLE) interface software
through a custom designed C++ class. The C++ class was designed to
facilitate MS/MS spectral manipulation and contains routines to parse the
text formatted ﬁles produced by the LIST program. This in turn helps

automate the entry of product spectral information into the data base.

Conclusions

The results of the studies detailed in this chapter indicate that a
collision chamber pressure of 0.6 mtorr (manifold pressure 5x 10‘ torr) and a

collision energy of ~30 eV is suitable for the generation of an MS/MS library

138

of small organic molecules. However, it should be noted that these values,
particularly the pressure, might not be appropriate for an instrument by
another manufacturer or even for a different model by the same
manufacturer. Until recently, quality control problems in the manufacture of
ion optic components and pressure gauge behavior would have made it
uncertain whether these same values would be optimal for use on another
instrument of the same model. Nevertheless, the shapes of the breakdown
and pressure dependence curves indicate that suitable spectra can still be
obtained even if the collision chamber pressure or collision energy differ from

those utilized here by a small amount.

Hopefully, further advances in the area of mass spectrometer design
will allow for even better between instrument correlations. Unfortunately, it
is unlikely that there will ever be a single set of standard conditions for
low-energy CID analysis due to the wide range of conditions considered

optimal by diﬂ‘erent researchers and for diﬂ'erent types of applications.

139

References

1.

Dawson, P. H.; Sun, W. F. Int. J. Mass Spectrom. Ion Proc. 1983/1984,
55, 155-170.

McLuckey, S. A.; Ouwerkerk, C. E. D.; Boerboom, A. J. H.; Kistemaker,
P. G. Int. J. Mass Spectrom. Ion Proc. 1984, 59, 85-101.

Beussman, D. J.; Erickson, T. A.; Enke, C. G. in preparation for
submission to Anal. Chem.

Hart, K. J. Ph.D. Thesis, Michigan State University, East Lansing,
1989; Chapter 2.

Palmer, P. T. Ph.D. Thesis, Michigan State University, East Lansing,
1989; Chapter 2.

Cross, K. P. Ph.D. Thesis, Michigan State University, East Lansing,
1989; Chapter 5.

Busch, K. L.; Glish, G. L.; McLuckey, S. A. Mass Spectrometry / Mass
Spectrometry: Techniques and Applications of Tandem Mass
Spectrometry; VCH Publishers, Inc., New York, 1988.

Tandem Mass Spectrometry; McLaﬁ‘erty, F.W., Ed.; John-Wiley and
Sons: New York, 1983; p 126.

Nystrom, J. A.; Bursey, M. M.; Hass, J. R. Int. J. Mass Spectrom. Ion
Proc. 1983, 55, 263-274.

CHAPTER 6

Development of Clustering Procedure for MSMS Data

Introduction

Cluster analysis is a powerful multivariate analysis tool that seeks to
ﬁnd natural groupings in data. Tools of this sort can be particularly useful
for discovering patterns or groupings in data even in cases where very little
is known about the data or what the data represent.

For small data sets, one might be inclined to forgo the use of cluster
analysis in favor of the human eye and mind. Indeed, the mind is far
superior to computational methods in recognizing patterns provided the
complexity of the data is limited, the data set is relatively small, and the data
can be represented graphically. For exploratory analysis this is often the
case. Unfortunately, this situation does not often hold once more extensive
analyses are performed and computational methods become a requirement.

While this research is exploratory in many ways, the amount of data
generated by tandem mass spectrometry has necessitated implementing
spectral classiﬁcation tools. Cluster analysis was initially selected for this

purpose because it does not require speciﬁc information about the ion

140

141
formation or fragmentation processes that occur in the mass spectrometer.
Cluster analysis has proven a suitable tool for this purpose. The focus of this
chapter will be the implementation and interpretation of cluster analysis on

product mass spectra.

Similarity Calculations on MS/MS Data

As described in Chapter 3, the ﬁrst step in cluster analysis is to
calculate a sample inter-distance matrix. This requires selecting an
appropriate distance metric. For this work, it was decided to use the
Euclidean distance metric. The Euclidean distance metric was selected in
part for its ease of implementation, interpretation, and its prevalence in
clustering implementations. The Euclidean distance metric can also be
applied to data sets of one or more samples without concern about generating
artifacts.

The general implementation of the Euclidean distance metric is to
assign each observation as a distance along an orthogonal vector in relation
to all other observations on the sample. This orthogonality is a requirement
for reliable calculation of the Euclidean distance. For this research, the
product m/ z values with relative intensities above 2% were taken as
observations on the precursor ion. The intensities at each m/z were assumed
to be orthogonal. While this assumption is not strictly accurate, it is
acceptable to a close approximation. This assumption is substantiated by the

reaction orders for the CID process, that ranged from 1.4 to 1.6, presented in

142
Chapter 5. At higher target gas pressures, signiﬁcantly above single collision
conditions, the formation of products of products during the CID process
would require a covariance calculation on the peak intensities to extract the
orthogonal components.

Another important consideration for calculating inter-sample distance
is the scale or magnitude of each observation. If observations are not
measured on a consistent scale, there exists the potential for some
observations to unduly inﬂuence the distance value. Therefore, observations
need to be scaled to a common magnitude. For mass spectral peaks or
observations, this can be accomplished by normalizing the intensity of each
peak to the most intense peak in the spectrum (the base peak). For this
research, product peaks were normalized to the most intense product, not the

nonfragmented precursor peak.

Clustering MS/MS Data

With the similarity metric selected and the distance matrix calculated,
the product spectra can be clustered. This was accomplished using a
combination of hierarchical and partitional clustering. Each of these
clustering methodologies has its strengths and weaknesses as described in
Chapter 3.

Partitional clustering provides a useful visualization of the inter-
sample similarity but does not provide much information on speciﬁc clusters.

In contrast, partitional clustering provides a lot of information on clusters

143
but the results are not easily visualized. Furthermore, partitional clustering
methods, in particular the K-Means algorithm used here, require as input
the number of clusters into which the samples should be partitioned. This
number was obtained by examination of the hierarchical clustering
dendrogram and the substructures represented in the samples.

In the development of the MS/MS data base, compounds were selected
for inclusion based on availability and on the substructure(s) present in the
molecule and believed to be represented in the normal mass spectrum. For
this reason, the substructures represented were known beforehand and
limited to those commonly available from chemical suppliers.

The objective of the clustering was to see if the product spectra could
be grouped in accordance with the represented substructures. If this is
possible, rules could be formed relating the clusters with represented
substructures. For the m/z 59 precursor, based on substructures containing
carbon, hydrogen, and oxygen, the more commonly occurring substructures
are shown in Figure 6.1. These substructures are assigned a number to
indicate a particular substructure backbone of carbon and oxygen and two
letters. The ﬁrst letter indicates an atom in the substructure where cleavage
from the rest of the molecule occurred. The second letter is either ‘c’ or ‘0’
indicating the atom, either carbon, oxygen, or nitrogen respectively, that the
substructure was bonded to in the original molecule. Only substructures 1, 2,

3, 7, and 9 are represented here. The cyclic substructures are not listed as

144

Substructure Substructure Substructure Substructure

Number Number
1 a 2 a
b O
W A

a c
O

3 a b c d 4 a
O b i b
O O

5 8 b 6 a

My \b/O\g

7 b 8 8 C
b
W0
a/K
O O

b

\/\

O 0

Figure 6.1 The m/z 59 substructures of interest. Potential bonding
positions are labeled a through d.

145
these substructures are generally of lower interest and rarely are
represented in the normal mass spectrum. The remaining substructures,
though structurally possible, were not characterized because either
representative compounds were not readily available or the substructures
were not represented in the normal mass spectrum of those compounds that
were available.

The hierarchical clustering dendrogram for m/z 59 by Euclidean
distance and single linkage clustering is shown in Figure 6.2. The compounds
whose m/z 59 product spectra are represented in Figure 6.2 are given in
Table 6.1.

The partitional clustering algorithm used was the K-Means algorithm
developed by Hartigan [1]. The methodology of this algorithm is described in
detail in Chapter 3 and will not be reiterated here.

From Figure 6.2 it can be seen that the product spectra are generally
delineated according to the represented substructure with some exceptions.
As a ﬁrst approximation, one might be tempted to try to partitionally cluster
the spectra into as many groups as there are represented substructures.
However, from Figure 6.2 it can readily be seen that this would not be
appropriate because not all substructures give uniquely identifying product
spectra. This situation arises in part from the relatively small number of

fragments observed in low-energy CID for most precursor ions. Also, ions of

146

.950on 835.3 mm m):

383mm you Bannoacnow mavens? 3039853 $33.: 39% \ 893mm. magmasm SE. «6 0.253

89.3.

 

 

 

 

8»
ﬂ 8.

room

 

 

8“
I: 25
l DUN
1| UQN

l cam

l USN

 

 

 

 

 

8N

 

can
8% c

 

SSSEQ

eﬁaao
«$980
$330
2380
maniac

@380
$380
2380
_ 895

$95
mﬁaao

147

.8283 «a Essa

 

 

2:
06,—”
on:
2:

 

 

 

 

cg

, _H 2:
8H

um on:
8H
8H

o3

8a

 

 

 

l 03

£280
SEED
5580
2380
£380
£380
Sumac
£380
£380
3380
2380
$380
$360

148

.9588 «d 9:65

 

 

 

lllri‘l.l.i...

25
25
25
25
25
25
25
25
25
25

on“

$380
@380
2.9.5
2580
2380
53.5
895
$580
2380
2380
Negev

149

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 6.1 Compounds analyzed for m/z 59 and their representative
substructures. Duplications were used to investigate
reproducibility.

Compound Name Label Analysis Date Substructure
1, 1~dimethoxyethane Cmp d35 10/22/95 lbo
1,2~bis(2~methoxyethoxy)ethane Cmpd14 7/13/95 lao
2-(2-butoxyethoxy)ethyl acetate Cmpd24 10/6/95 7bc
2,4-dimethyl-2-pentanol Cmpd16 7/14/95 2bc
2-methyl-3-pentanol Cmpd33 10/18/95 3cc
3~ethoxyethanol Cmpd25 10/6/95 1cc
3~ethoxypropionitrile Cmpd26 10/6/95 1cc
3~hexanol Cmpd22 9/27/95 3cc
3~methoxy-l-butanol Cmpd17 7/14/95 1bc
3~pentanol Cmpd27 10/6/95 3cc
4-hydroxy-4-methyl-2-p entanone Cmpdl l 7 / 1 3/95 2bc
bis(2~ethgyethyl)ether Cmpd18 7/14/95 1cc
bis(2-ethoxyethle)ether Cmpd23 9/27/95 1cc
bis(2~methoxyethyl)ether Cmpd15 7/13/95 lao
bile-(Z~methoxyethoxy)ethyllether Cmpd21 9/27/95 lao
diethoxymethane Cmpd12 7/13/95 lco
diethoxymethane Cmpd20 9/13/95 lco
dimethyl malonate Cmp d5 5/30/95 9bc
dimethyl oxalate Cmp d7 5/30/95 9bc
carbonic acid dimethyl ester Cmpd6 5/30/95 9bc
ethyl ether Cmpd30 10/9/95 lcc
isopropyl ether Cmpd13 7/13/95 2cc
malonic acid diisopropyl ester Cmp d3 5/30/95 2cc
malonic acid diisopropyl ester Cmpd8 6/8/95 2cc
malonic acid dimethyl ester Cmpd9 6/8/95 9bc
methyl butyrate Cryd3l 10/9/95 9bc
methyl n~butyl ether Cmpd2 5/30/95 lac
methyl n~hexanoate Cmpdl 5/30/95 9bc
methyl n~hexanoate Cmpd10 6/8/95 9bc
methyl n~hexanoate Cmpd32 10/9/95 9bc
methyl n~valeric acid Cmpd4 5/30/95 9bc
methyl octanoate Cmpdl9 9/13/95 9bc
octanol Cmpd29 10/7/95 3ac
t-butanol Cmpd34 10/22/95 2bc
tert-amyl alcohol Cmpd28 10/6/95 2bc

 

 

 

 

 

 

150

diﬂerent substructures formed during the El ionization process in the source
may rearrange to a common ionic structure before the CID reaction. This may
be due to increased stability of the common ionic structure or the result of
excess energy being imparted to the precursor ion during EI ionization.

As a result, a diﬁ‘erent approach was used to determine the number of
clusters to be formed by partitional clustering. Rather than start at the high
end, a low initial guess was used. This number was increased until groups
containing the same substructure began to be split apart. For the spectra in
Figure 6.2, this occurred at seven clusters. Therefore, six clusters was
determined to be optimal. Ideally, this process should be part of the
clustering algorithm. However, there are a number of assumptions that must
be satisﬁed regarding data quality and mixture spectra that could easily lead
such an automated approach astray. Consequently, the examination of the
clusters assignments is better accomplished in a supervised manner.

The results of the partitional clustering are given in Table 6.2. An

interpretation of these results is given in the following section.

Interpretation of the Clusters

The next. stage in the cluster analysis is to determine what, if any,
signiﬁcance can be assigned to the cluster groupings. If the substructures
represented are randomly distributed in the clusters, then obviously no

substructure / cluster correlations can be made. However, the results

151

Table 6.2 Results for partitioning m/z 59 into six clusters.

Cluster 1

Members

Case Distance

Cmpd3 42.408

Cmpd8 1 1.092

Cmpdl 1 26.544

Cmpd13 15.249

Cmpd16 1 1.057

Cmpd22 30.088

Cmpd27 3 1.302

Cmpd28 8.168

Cmpd33 26.444

Cmpd34 9.963

Statistics

Variable Minimum Mean Maximum Std. Dev.
15 0.000 5.145 1 1.470 4.372
27 0.000 1.365 3.530 1.485
28 0.000 0.000 0.000 0.000
29 4.940 9.084 2 1.950 4.843
30 0.000 0.000 0.000 0.000
3 1 100.000 100.000 100.000 0.000
32 0.000 0.000 0.000 0.000
33 0.000 0.000 0.000 0.000
39 0.000 2.078 3.740 1.502
41 8.490 20.618 25.520 4.718
42 0.000 0.000 0.000 0.000
43 8.410 39.454 79.800 23.403
44 10.230 15.169 2 1.910 4.453
45 0.000 0.000 0.000 0.000

152

Table 6.2 (cont’d).

Cluster 2

Members

Case Distance

Cmpd29 0.000

Statistics

Variable Minimum Mean Maximum Std. Dev.
15 12.050 12.050 12.050 0.000
27 3.230 3.230 3.230 0.000
28 0.000 0.000 0.000 0.000
29 10.640 10.640 10.640 0.000
30 3.420 3.420 3.420 0.000
31 76.350 76.350 76.350 0.000
32 0.000 0.000 0.000 0.000
33 0.000 0.000 0.000 0.000
39 9.050 9.050 9.050 0.000
41 100.000 100.000 100.000 0.000
42 0.000 0.000 0.000 0.000
43 13.170 13. 170 13.170 0.000
44 12.300 12.300 12.300 0.000
45 0.000 0.000 0.000 0.000

Table 6.2

Cluster 3
Members

Case
Cmpdl
Cmpd2
Cmpd4
Cmp d5
Cmpd6
Cmpd7
Cmpd9
Cmpd10
Cmpdl9
Cmp d3 1
Cmpd32

Statistics

Variable
15
27
28
29
30
3 1
32
33
39
41
42
43
44
45

(cont’d).

Minimum Mean

Distance

3.751

16.234

4.919
4.919
4.919
4.919
4.919

10.114

5.491
3.000
3.836

100.000

0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

100.000

0.000
0.000
4.717
0.000
1.025
0.822
0.000
0.000
0.000
0.000
0.463
0.000
0.000

153

Maximum Std. Dev.
100.000

0.000
0.000

20.840

0.000
5.720
9.040
0.000
0.000
0.000
0.000
2.710
0.000
0.000

0.000
0.000
0.000
6.582
0.000
1.913
2.726
0.000
0.000
0.000
0.000
1.032
0.000
0.000

154

Table 6.2 (cont’d)

Cluster 4

Members

Case Distance

Cmpd2 1 6.837

Cmpd24 2 1.024

Cmpd35 15.752

Statistics

Variable Minimum Mean Maximum Std. Dev.
15 100.000 100.000 100.000 0.000
27 2.960 4.543 6.820 2.02 1
28 0.000 0.000 0.000 0.000
29 6 1.920 66.840 74.420 6.661
30 0.000 0.000 0.000 0.000
31 32.970 46.020 60.790 13.990
32 0.000 0.000 0.000 0.000
33 0.000 0.000 0.000 0.000
39 0.000 0.000 0.000 0.000
41 0.000 2.527 7.580 4.376
42 0.000 0.000 0.000 0.000
43 10.570 16.557 26.990 9.068
44 0.000 2.553 7.660 4.423
45 0.000 2.097 3.330 1.825

155

Table 6.2 (cont’d).

Cluster 5

Members

Case Distance

CmpdlZ 1 1.046

Cmpd14 27.148

Cmpd15 28.43 1

Cmpd18 2.789

Cmpd20 7.330

Cmpd23 15.06 1

Cmpd25 8.523

Cmpd26 9.146

Cmpd30 13.449

Statistics

Variable Minimum Mean Maximum Std. Dev.
15 0.000 0.704 3.540 1.410
27 3.610 7.952 12.840 2.770
28 0.000 0.000 0.000 0.000
29 92.270 99.041 100.000 2.557
30 0.000 0.000 0.000 0.000
31 69.470 89.371 100.000 1 1.997
32 0.000 0.000 0.000 0.000
33 0.000 1.849 1 1.610 4.02 1
39 0.000 0.000 0.000 0.000
41 3.970 12.430 17.260 5.026
42 0.000 0.000 0.000 0.000
43 0.000 6.181 2 1.680 8.944
44 2.090 2.897 4.020 0.679
45 0.000 1.080 5.000 2.144

156

Table 6.2 (cont’d).

Cluster 6

Members

Case Distance

Cmpd17 0.000

Statistics

Variable Minimum Mean Man'mum Std. Dev.
15 0.000 0.000 0.000 0.000
27 29.230 29.230 29.230 0.000
28 0.000 0.000 0.000 0.000
29 100.000 100.000 100.000 0.000
30 0.000 0.000 0.000 0.000
3 1 53.960 53.960 53.960 0.000
32 0.000 0.000 0.000 0.000
33 40.290 40.290 40.290 0.000
39 0.000 0.000 0.000 0.000
41 2.850 2.850 2.850 0.000
42 0.000 0.000 0.000 0.000
43 23.770 23.770 23.770 0.000
44 5.980 5.980 5.980 0.000
45 0.000 0.000 0.000 0.000

157

presented here indicate that it is possible in most cases to correlate the
clusters formed with the substructures derived from the neutral molecule
structures. The m/z 59 precursor ion structure, must, therefore, be different
for several of these substructures. This is an important assumption since one
of the objectives of this work was to bypass the complexities involved in ion
formation, fragmentation, and rearrangement and to correlate the observed
mass spectral pattern directly with the substructures present in the neutral
molecule.

By examination of the product spectra clustered here, there are
obvious spectral characteristics that have given rise to the observed
groupings. Sample product spectra for each of the represented substructures
are given in Figure 6.3. Partitional cluster 3 is characterized by the very
intense m/z 15 peak with only a small m/z 29 and 31. This is due to the
ready fragmentation of the precursor ion of members of this cluster to a
methyl ion and a stable neutral loss. This cluster is dominated by ions
expected to express substructure 9bc. This methyl ester precursor structure
would be expected to fragment to a methyl ion and a stable carbon dioxide
neutral loss.

Partitional cluster 1 is characterized by an intense m/z 31 followed by
a fairly intense m/z 43 with smaller peaks at m/z 29, 41, and 44. Attempting

to isolate eight clusters from these data results in a splitting of this cluster.

158

 

 

 

 

 

 

100-
80- Cmpd13
60~ Cluster 1
401 2cc
204 I I
O ‘ I I V I ﬂ I L I V I V I I V II V I I V I
’2; 10 15 20 25 3o 35 4o 45 5o 55 60
o\
V
>, 1005
4..)
g, 80} Cmpd27
q: ‘ Cluster 1
:3 60-
c) .
.2; 20~
*" ‘ | l
E O V ! I I V I f T V I V I '1 I V I w I V I
52 1o 15 20 25 3o 35 4o 45 5o 55 60
100-
80: Cmpd34
60: Cluster 1
40j 2bc
201
0 i V I f j IV LI V I V ' I ’ V’ J I V I i I V I

 

 

 

I
10 15 20 25 30 35 4o 45 50 55 60
m/z

Figure 6.3 Sample product spectra of m/z 59 for selected compounds listed
in Table 6.1

 

 

159

 

 

 

 

 

 

 

 

100-
80~ Cmpd29
601 Cluster 2
40- 3ac
20~
0 q l I V I IV I I V ' I V 'II V ' l j ' v I
<3 10 15 20 25 3o 35 4o 45 5o 55 60
8.2
>., 1005
+3 4
g 30~ Cmpd2
3 60; Cluster 3
E 40; lac
Q) l
.5: 20~
+2 .
c5
"-" O V I I V V I ' I T I V I I I V I
030 1o 15 2o 25 30 35 4o 45 50 55 60
100«
80*, Cmpd32
60‘, Cluster 3
40] 9bc
20-
0 ‘ I ' I ' I V l ‘ U V f U I r I ' 1
1o 15 20 25 30 35 4o 45 50 55 60
m/ z

Figure 6.3 (cont’d).

 

 

160

 

 

 

 

 

 

 

 

 

 

 

100-
J
30« Cmpd21
601 Cluster4
J
404 lao
20~
O ' ' I I l' , ' I I I ﬁl l I I I I r I
<3 10 15 20 25 30 35 4o 45 50 55 50
8/
>5 100-
CG +
g 30~ Cmpd24
a) j Cluster4
E." 60
G) .
.2 20-
4-? I
'53 0 I I I I IJ—v I I II I! I I 1
1% 10 15 20 25 30 35 4o 45 50 55 60
100-
30: Cmpd35
60] Cluster4
401 lbo
20- I
0 V Y— I T I' l ' I I T] I ‘— 1 I I
10 15 20 25 30 35 40 45 5o 55 60

Figure 6.3 (cont’d).

m/z

161

 

 

 

 

 

 

 

 

100-
80« Cmpd12
50+ Cluster 5
404 Ice
205
0+ '—' I ' I ﬂ I J' I ' I ' II ' J I i I T I ' I
”\3 10 15 20 25 30 35 40 45 5o 55 60
E/
>, 100-
+9 4
g 80‘ Cmpd14
a) j Cluster 5
4a) 60
._. 401 lao
0’ 1
.2 201
I" 1
’5' 0 r r I L T Lf I“ I1 I! f I f I I
Q)
m

 

 

 

 

100-

80? Cmpd26

60? Cluster 5

40? ICC

204
l J |

O ' T ' I I I ' I I I I I I] I I I T I I I I

10 15 20 25 3o 35 4o 45 50 55 60

m/z

Figure 6.3 (cont’d).

162

 

 

 

 

’3

5

:1 100-

“ I

g 80‘ Cmpd17

3 601 C1uster6

G .

1—1 40J 1bc

G)

> 20‘

.H I

.p

w 0 I I I I j I III ' II I T I j ﬂ
:2 10 15 20 25 30 35 4o 45 5o 55 60

m/z

Figure 6.3 (cont’d).

163

This cluster contains members expected to produce precursor ions from
substructures 2bc, Zoo, and 3cc. As can be seen from the sample spectra in
Figure 6.3 compounds expected to express substructures 2bc or 2cc produce
nearly indistinguishable m/z 59 product spectra. From the sample spectrum
of a compound expected to express substructure 300, it can be seen that the
only real difference from the 2bc and 2cc spectra is the ratio of the m/z 43
and 44 peaks. Since the 43/44 ratio differences are all that really
diﬂ'erentiates these two substructures and those ratios are not particularly
dominant, it is easy to see why these two substructure types would cluster
together. From these results, it appears that substructures 2bc, 2cc, and 3cc
cannot be separated from each other based on their representative product
spectra by the clustering methodologies utilized here.

Partitional cluster 2 has all the same m/z values as cluster 1.
However, it is signiﬁcantly diﬂ'erent because of the m/z 41 base peak. This is
the only cluster that has such a strong m/z 41 relative intensity. This
member of this cluster is expected to express the 3ac substructure. This
substructure, being a primary alcohol, could produce an m/z 41 ion by losing
H20.

Partitional cluster 5 members are characterized by intense peaks at

both m/z 29 and 31 with much lower intensities at all other m/z peaks. This

164

cluster contains product spectra predominantly from compounds expected to
express substructures Ice and lco which are ethyl ether substructures.

Partitional cluster 4 members are is characterized by an intense
m/z 15, resulting ﬁom a methyl ion formation, and intermediate intensity
peaks at m/z 29 and 31. This cluster contains compounds expected to express
different substructures, speciﬁcally 7bc, lbo, and lao. Substructure 7 is
signiﬁcantly diﬂerent from substructure 1 which likely indicates that there is
some rearrangement of the precursor ion to a common structure or that the
fragmentation patterns are similar.

Partitional cluster 6 contains a base peak at m/z 29, intermediate
peaks at m/z 27, 30, 33, and 43, and the absence of a peak at m/z 15. This is
the only cluster with intermediate intensity peaks at m/z 27 and 33. The
compound contained in this cluster is expected to express substructure 1bc.
This cluster grouped closely with cluster 5.

Compounds expected to express substructure la are present in clusters
3, 4, and 5 with the compounds expected to express substructures 1c, 7bl lb,
and 9b respectively. This would seem to indicate something unique about the
la substructure. In particular, substructure lao is joined in two locations on
the dendrogram that appear to have relatively little in common. There are a
couple of possible reasons for this discrepancy. For example, there may have

been some contamination of the lao samples, perhaps due to aging of the

165

samples. Another possibility is that the fragmentation of the molecular ion in
the E1 source resulted in signiﬁcant rearrangements such that more than one
m/z 59 substructure was analyzed. A third possibility is that substructure
lao is more sensitive to variations in El source and/or CID conditions and
this sensitivity is reﬂected in the product spectrum. For whatever reason, it
is readily apparent that, based on the data base product spectra,
substructure lao cannot be readily isolated from some of the other
substructures.

Clusters 2 and 6 only have one member in the group. Obviously, it is
difﬁcult to validate these as true clusters. However, the spectra for these two
clusters are signiﬁcantly different from the other spectra, particularly for
compound 29 (3ac). This difference is enough to support placing these
compounds, and the substructures they represent, into groups other than
those already present.

It is important to ensure that product spectra obtained do not result
from a combination of two or more m/z 59 precursor ion structures as this
can easily lead to blurred or overlapping clusters. In many cases, however,
even though a compound may be capable of producing more than one ion
structure of the same precursor m/z, one ion structure may dominate over
the others. This can be seen in the following m/z 59 product spectrum for sec-

butyl alcohol, which can potentially produce a mixture of substructures 2ac

166

 

 

 

’3

B\

V

53’

'g 1001

Q) 803

"a .

H 60"

g 40«

.H 4

4.:

Cd 20‘ ' I

H y I

£ 0 I E I I I I II I I f I 'I 7 I T I If I r I
10 15 20 25 3O 35 40 45 50 55 60

m/z

Figure 6.4 Mixture product spectrum for m/z 59 of sec-butyl alcohol.

167

and 3cc. The spectral pattern matches quite well with the 3cc sample
spectrum given in Figure 6.3. Nevertheless, this is not always the case so
care must be taken when selecting compounds, from which the

representative product spectra will be derived.

Conclusions

The clustering results presented in this chapter demonstrate that it is
possible to group product spectra with a reasonable degree of delineation
according to substructure. However, these results also demonstrate that
there is enough similarity among product spectra for different substructures
to make full delineation to speciﬁc substructures impossible in some cases.
Nevertheless, the ability of this technique to reduce the number of candidate
substructures for an unknown to two or three substructures still provides

tremendously valuable information for structure elucidation.

168

References

l. Hartigan, J. A. Clustering Algorithms; John Wiley and Sons: New
York, 1975; Chapter 4.

CHAPTER 7

Product Ion Classification for Standards and Unknowns

Introduction

In Chapter 6, it was demonstrated that low-energy CID product mass
spectra could be classiﬁed into groups based on their m/z-intensity patterns.
These classiﬁcations further demonstrated that compounds that produced the
same ion in the El ionization source in general produced similar product
spectra which formed a single cluster. It was also shown that different ions
generally gave different product spectra which formed into diﬂerent clusters.

The objective of this chapter is to leverage the information obtained in
Chapter 6 to form the basis of a classiﬁcation system for product spectra for
standards and unknowns. To accomplish this objective, it will be necessary to
determine those spectra features that make a cluster distinct from other
clusters. Also, it will be necessary to represent this information in a manner

suitable for the characterization of other product spectra.

169

170

Establishing Representative Descriptors

The establishment of representative descriptors for the cluster
information obtained in Chapter 6 is an important ﬁrst step towards general
product ion classiﬁcation. This process can be accomplished in a variety of
ways. For example, if the intent were to develop a binary or bit ﬁeld
representation for a spectrum, an intensity threshold might be applied. Any
peaks above the threshold would be assigned a value of unity while any
peaks below the threshold would be assigned a value of zero. Bit ﬁeld
encoding of normal mass spectra for the purpose of reduced storage and
matching has been performed by others [1,2]. Such an approach would,
however, be of little value for matching product mass spectra because product
spectra generally have few peaks most of which occur for most possible
precursor ions. In order for a product spectral matching system to work, it
must, therefore, retain the intensity information since peak ratios may be the
only available mechanism for differentiating among spectra.

For this work, the most obvious descriptor system is to retain the
intensity information by calculating a representative mean spectrum for each
cluster. In fact, this information is already available from the partitional
clustering results shown in Table 6.2. For each cluster found by the
partitional cluster algorithm, the results include a list of the m/z values

present in at least one compound. For each m/z is listed the minimum,

171

maximum, and average intensity. Also listed is the standard deviation in the
intensity for each m/z. Therefore, the average intensity column, which
represents the cluster mean, was used as the most appropriate descriptor for
each cluster. The standard deviation information is useful as it helps to
characterize the averaged peak intensities. In Figure 7.1 are shown the

cluster mean spectra with standard deviation error bars.

Discovery of Non-Discriminating Features

While the cluster representative product spectra discussed in the
previous section are, by themselves, very useful information to have for
product ion classiﬁcation, these spectra are not necessarily optimal for
diﬁerentiation between product spectra. This is due to the fact that some
peaks occur commonly in product spectra of the same precursor m/z at about
the same intensity. Since the goal is to ﬁnd features that make spectra that
should group together do so and those that should not not do so, common
non-differentiating features simply dilute this process. Therefore, the most
desirable coarse of action is to either reduce or remove these inﬂuences.

For example, m/z 27 is represented in all of the cluster mean spectra
at low intensity except for cluster 7 where it is just under 30% relative
intensity. It was expected that removing m/z 27 might improve the
clustering with the remaining data. After reducing the spectra down to m/z

15, 29, 31, 41, and 43, no reordering was found in either the hierarchical or

100-

on
O
1

I“ 05
O o

N
o
n 1

 

O

172

Cluster 1

I T v v I v 1

 

H
O

1001

J

on
C
L 1

05
O
L 1

uh
o
1

N
o
4‘ J A

 

O

 

2O 3O 4O 50 60

Cluster 2

“I I V v I V '

 

 

Relative Intensity (%)

100-
sol
60q
401
209

20 30 40 50 60

Cluster 3

 

 

 

10

. 1,11 , 1'—
' I ' I I I

20 30 40 5o 50
m/z

Figure 7.1 Cluster mean product spectra for m/z 59. The error bars

indicate the standard deviation in relative intensity of the
cluster members. Clusters 2 and 6 are single member clusters
and have no error bars.

173

 

 

 

 

 

 

 

 

 

 

100-
80; Cluster 4
60+
40- '
20«
A . i
\O 0 Y ' I ' 7 ' i
a, 10 20 3o 40 50 60
E9
.... 100-
g 1 Cluster 5
Q) 801
4a .
H 60':
g9 40a
.H 4
*5; 20-
"" t
g; o a: . i . ﬂ . .
10 20 30 4o 50 60
100-
so: Cluster 6
1
60-4 ,
401l
20«
O V l ' I ﬁ
10 20 3o 40 50 60

m /2
Figure 7.1 (cont’d).

174

partitional clustering results. The reason for this is that the m/z values that
were removed were not the dominant features. As a result, they contributed
much less to the overall distance than the remaining features. The reduced
feature dendrogram is given in Figure 7 .2. While the distance between
samples has increased in some cases, the dendrogram still strongly resembles
the full feature dendrogram shown in Figure 6.2.

While it appears that the non-discriminating features above do not
apply across the entire data base, there may be cases where they are useful
for direct cluster-to-cluster differentiation. For example, cluster 6 has a
signiﬁcant m/z 33 peak. This peak is useful for diﬂ'erentiating cluster 6 from
the other clusters. However, in some situations, the reliability of these
features may be in doubt unless they have signiﬁcant intensiﬁes. Therefore,
the most discriminating features in the m/z 59 product spectra given here
are those in the reduced set discussed here and in Chapter 6 in the cluster

interpretation section.

Classiﬁcation of Unknowns by Spectral Matching

While spectral matching systems have been developed for use with
normal mass spectra [3,4], these approaches are optimized for the feature
rich spectra typically obtained by EI mass spectrometry. It was, therefore,

decided to develop a spectral matching approach for this work that is better

175

 

..mhomam 835.5 mm a}: 23 wow Bannoanow new $353 wood—com NS 25mg

ﬂ

 

_||

 

 

89?.

 

 

 

 

moguﬂm

.53.

com
com
8m
8N
3N
3N
8a
SN
3N
8N

8m

$380
$380
$an
3380
SE80
3380

2.25
$580
2280

25:5
3380

176

. 95»?—
.G «nos N h

2:
_|I|II

 

 

o2
_.||I|

 

2:.

 

2:

 

 

 

o3
ﬂ 2:

92
II 2:
III 03
8_
R:

8a

 

 

 

 

I

8a

3380
3380
5.95
5380
33:6
2380
3380
$380
mﬁaao
8380
2380
3380
£380

177

.8288 NS 8&3

 

 

 

 

 

 

 

25
. 25
. 25
. as
. 25
. 85

 

F as

23
25
25

Fl :2

2386
«3:6
63:6
«3:6
:36
:36

83:6

23:6
8::6

33:6
«3:6

178

suited for feature poor low-energy CID spectra. Because such product spectra
often have the same m/z values represented, this approach would have to
rely less on m/z occurrence and more on intensity. However, this approach
would also need to handle peak intensity variations typical of normal mass
spectral analysis. Obviously no pattern matching or classiﬁcation approach is
particularly well suited for handling differences caused by experimental
error.

To represent the intensities of the mass spectral peaks a membership
function was developed. The membership function is based on the principles
of fuzzy set theory introduced by Zadeh [5]. Applications of fuzzy sets to
problems of chemical signiﬁcance have been reported in the literature [6,7].
Several of these publications contain overviews of the relevant aspects of
fuzzy set theory [6,8].

Fuzzy set theory is based on the principles of traditional crisp set
theory where membership in a set is designated by the values zero (no
membership) or one (full membership). Fuzzy set theory extends this concept
to allow members to have values in the range of zero to one where zero
represents no membership and one represents full membership. Under crisp
set operations something may or may not be a member of a set as constrained
by the zero and one values. This is generally referred to in terms of A or

NOT A. Conversely, fuzzy set theory is referred to in terms of A and NOT A

179

The utilization of fuzzy set theory membership functions to peak
intensiﬁes provides a useful mechanism by which the similarities in
intensity among peaks can be expressed. For this purpose, three membership
functions were investigated and are shown in Figure 7 .3.

The ﬁrst member function is of the standard type used in many fuzzy
set implementations. The vertical scale runs from O to +1 indicating degree of
membership. The horizontal axis in this case is the relative intensity of the
library peak being compared. The relative intensity of the library peak is
centered on the plateau region. If the unknown peak is in the range of the
library peak plus or minus an intensity variability value, its membership is
one. This is done so as not to penalize the match quality due to normal
intensity variations. Outside of this region, the membership value falls oﬁ
linearly to zero after which it remains zero.

For this work, an intensity variability value was used to accommodate
the spectrum-to-spectrum variations routinely observed among mass spectra.
Another option would have been to use the standard deviation information
provided by the partitional clustering. However, the standard deviation
information can be misleading and is inappropriate in this implementation.

The intensity variability value determines the width of the match
window for a give peak. A wide window makes it easier for peaks to give a
positive match while a narrow window makes this outcome less likely. Using

the standard deviation to set the window width can have an undesirable side

180

 

 

 

 

l--|

 

C)
O
(D
O
p...
0"
O

0 ' 2b ' 4o

 

--O------db--

 

 

 

Membership

 

 

   

J7"

 

 

I ﬁ U

0 . 20 40
Relative Intensity (%)

80 ' 160

05
C

Figure 7 .3 Fuzzy member functions investigated for spectral matching.

181

effect. If the members of a cluster have features that are very close to each
other in intensity (the desirable situation), the average intensities would
have a correspondingly small standard deviation. This in turn will make it
harder for other spectra to positively match with this representative
spectrum. Conversely, the opposite effect can happen with more diffuse
cluster members. In this case, the poorer quality of the cluster the more
easily other spectra will match with the representative spectrum. Hence, the
more diffuse the cluster the better the chance of a match. The preferred
situation is for the opposite to occur. In this case, an intensity variability
value of 10% relative intensity was found suitable for all peak intensity
windows to equalize the probability of a match between representative
spectra and an unknown. For the purpose of evaluating the three
membership functions, the same intensity variability value was used in all
cases.

The second member function extends the scale from 0 down to -1 but
otherwise is the same as the ﬁrst member function. In most fuzzy set
implementations the membership scale ranges from +1 to 0 where +1
indicates full membership and zero indicates no membership. However, this
approach is weighted towards membership because non-membership simply
makes no contribution as opposed to a negative membership inﬂuence. For
example, the membership scale can be interpreted as a continuous voting

scale. In most voting systems (albeit these are based on crisp set theory

182

principles) the +1 to zero membership scale would translate into +1 being a
yes vote and zero a no vote. The result, therefore, is almost always a positive
vote though it may be very small. Furthermore, a no vote is the same as an
abstention or no-inﬂuence position and contributes no negative inﬂuence. In
terms of pattern matching, the negative half of the membership scale
provides a mechanism for determining and expressing that two observaﬁons
are really not the same. Therefore, a +1 to -1 membership scale is
investigated in this work for membership functions two and three.

The third member function has a membership value that increases
from -1 to 0 at low intensiﬁes. Since there is no intensity weighﬁng used in
this search, low intensity peaks contribute the same as high intensity peaks
to the ﬁnal match factor. Due to the possibility that some low intensity peaks
may be noise, this member funcﬁon seeks to reduce the negaﬁve inﬂuence of
such peaks if the library peak is intense at the m/z in quesﬁon. If the library
peak intensity is not intense, the trapezoidal porﬁon of the member funcﬁon
that is the same as the second invesﬁgated member funcﬁon takes
precedence.

A comparison of the three member funcﬁons is given in Table 7.1. This
was accomplished by matching selected unknown spectra against the cluster
mean spectra shown in Figure 7 .1. The test unknown spectra were obtained

by removing them from the data base used to create the clusters. In order to

183

prevent undue inﬂuence of the match factors by matching a spectrum used to
determine the clusters, the cluster mean for which the spectrum is a member
was recalculated without the test spectrum. Speciﬁcally, compounds 14, 5,
1 1, and 22 were used to test the match quality of the three member funcﬁons.

As can be seen ﬁ'om Table 7 .1, of the tested member funcﬁons, the
second funcﬁon proved to give the best separaﬁon between the expected
match and the incorrect matches. The match factors given in Table 7.1 range
from 0 to 1 indicaﬁng no membership to full membership respecﬁvely. The
match factors for Cmpd14 for clusters 5 and 6 are similar in value with
membership funcﬁon two providing a slightly better match for cluster 5. It
was expected that if any of the compounds were to be incorrectly assigned to
a cluster, it would have been Cmpd14. Cmpd5, from cluster 3, matched up
with the adjusted cluster 3 mean perfectly while matching at 0.81 with
cluster 4 using the ﬁrst and third member funcﬁons. Likewise, Cmpdl 1, ﬁom
cluster 1, and Cmpd22, also ﬁ‘om cluster 1, matched into the correct cluster
by all three member funcﬁons. However, in all cases, member funcﬁon two
gave the best disﬁncﬁon between the expected cluster match and the

remaining clusters.

184

Table 7.1 A comparison of the fuzzy matches of the m/z 59 product mass
spectra by the three membership funcﬁons shown in Figure 7 .3.

Matched: Cmpd14 (expected Cluster 5 but potenﬁally close to Cluster 6)
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6

First 0.73 0.84 0.71 0.79 0.88 0.82
Second 0.59 0.70 0.49 0.62 0.86 0.75
Third 0.59 0.77 0.56 0.69 0.86 0.82

Matched: Cmpd5 (expected Cluster 3)
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6

First 0.68 0.74 1.00 0.81 0.77 0.57
Second 0.46 0.53 1.00 0.67 0.55 0.19
Third 0.61 0.67 1.00 0.81 0.70 0.50

Matched: Cmpdll (expected Cluster 1)
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6

First 0.93 0.77 0.70 0.64 0.84 0.57
Second 0.87 0.60 0.45 0.35 0.70 0.23
Third 0.87 0.60 0.53 0.46 0.74 0.39

Matched: Cmpd22 (expected Cluster 1)

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6
First 0.93 0.86 0.71 0.65 0.85 0.58
Second 0.86 0.76 0.53 0.42 0.78 0.29
Third 0.89 0.76 0.58 0.52 0.82 0.47

185

Classiﬁcation of Unknowns by Rule Application

One of the simplest tools for structure elucidaﬁon is that of basic
spectral pattern matching or library searching as it is commonly called. All
commercial mass spectrometers come equipped with a library search
package. However, such packages have signiﬁcant deﬁciencies some of which
can be traced to the semanﬁcs of library searching.

In a library search an unknown spectrum is usually matched in a
forward or reverse manner with the spectra in the library data base. The
result of such an operaﬁon is one-dimensional with each library spectrum
being given as a distance ﬁ'om the unknown on whatever similarity scale is
used. As the size of the library is increased, the number of points along the
match distance vector also increases making it more difﬁcult to choose one
library spectrum over another. This process contributes to the unreliability of
standard library search methods for mass spectral data which have recently
been examined by Stein and Scott [9].

A further deﬁciency in library search methodology is that it is
somewhat limited and inappropriate in some cases such as targeted
component analysis. A more advanced mechanism is to represent individual
spectra as rules relaﬁng compound structure to spectrum. To simple convert
all spectrum-structure correlaﬁons represented in the library to rules does

not provide signiﬁcant beneﬁt. It would sﬁll be necessary to test every rule

186

during the analysis of an unknown just as it is necessary to match every
spectrum in a library search. This assumes of course that no pre-ﬁltering has
been implemented in either case. The signiﬁcant beneﬁt of a rule based (or
expert system) approach to spectral matching is the interacﬁon of rules
under the control of the inference engine. For example, if a spectrum for a
compound is in the data base more than once, each entry acts individually
under spectral matching methodologies. However, a rule base approach can
combine these replicates to improve the reliability that a posiﬁve match has
occurred based on the certainty theory expression A + B - A x B where A and

B denote the conﬁdence factors for each of two replicates.

Improving Rules with Rule-Trees

In the previous secﬁon, it was discussed that rules, unless combined,
yield little beneﬁt over standard library searching for mass spectral data.
However, a disﬁllaﬁon of the rule informaﬁon into decision-trees, referred to
here as rule-trees, can be very useful. Furthermore, the process of
construcﬁng a rule-tree is straightforward in this implementaﬁon and can be
updated automaﬁcally as simple rules or spectra are added to the data base.
Rule-trees are also more efﬁcient that standard spectral matching and even
than rule based methods because a rule-tree only has as many decisions as
there are spectral features. In contrast, a spectral match must make the

same number of comparisons for each spectrum tested.

187

Figure 7.4 is a rule-tree for the m/z 59 product spectra constructed
based on examinaﬁon of the representaﬁve product spectra. This rule tree
was constructed in the following manner. First, the cluster mean spectra
were summed together to determine which m/z values are the most intense
across the clusters. Then, in order of decreasing summed intensity the
clusters are compared for regions of separaﬁon among the clusters. For
example, cluster 3 is characterized by a base peak at m/z 15. In other
clusters, m/z 15 may be of low intensity or even completely absent. As a
result, there exists a good region of separaﬁon about which to form a decision
point. For each cluster that does not overlap another, there exists a region of
separaﬁon from which a leaf node is formed which indicates the end of a
decision branch. Where clusters overlap, the next most intense peak is
examined in the above manner. This process conﬁnues unﬁl all the clusters
have been assigned to leaf nodes or there are no more m/z features to
compare. At each decision or branch point, a conﬁdence interval is assigned
to the decision. The criﬁcal decision point is taken to be the median posiﬁon
between the clusters. The conﬁdence interval is designated by plus or minus
an intensity value on the criﬁcal decision point. The conﬁdence interval is
calculated by taking the region between the nearest proximity of the error
bars, deﬁned by the standard deviaﬁons in relaﬁve intensity among cluster

members. Outside the criﬁcal decision point plus or minus the conﬁdence

188

 

Limited to commonly
occurring substructures
containing only

(Precursor ion m/z=59

 

  
      
    
   
    
   

  

 

 

 

 

C, H, and 0.
W2 31 = 50% yes
:I: 20%
no -

m/z15=60% m/229=30%

1 30% i 15%

yes
Cluster 3 es m/z 15 = 60%

i 30%

Cluster 4 no

yes

Cluster 6 no
Cluster 5

  

Cluster 1
Cluster 2

Figure 7.4 Rule-tree for m/z 59 product spectra.

189

interval, one can be certain of which direcﬁon to branch. However, within the
median region, the decision conﬁdence decreases to zero at the criﬁcal
decision point. To traverse the tree once it has been completed, one can begin
at the top if the unknown spectrum is available. However, for targeted
component analysis, one can begin at the cluster of interest and traverse
upwards to assemble a list of idenﬁfying peaks and intensiﬁes to monitor
during analysis.

Validaﬁon of the rule-tree was accomplished by tesﬁng various m/z 59
product spectra against the rule-tree. The compounds most likely to not be
properly described by the rule tree are those that are farthest ﬁ‘om the
cluster means due to their greater overall relaﬁve intensity difference ﬁ'om
the mean. For example, in cluster 1, Cmpd3 is the farthest from the cluster
mean as shown in Table 6.2. However, it is correctly assigned to cluster 1.
Likewise compounds 11, 22, 27, and 33, which were the next farthest ﬁ'om
the mean were also assigned correctly. Compounds 2 and 10 were tested and
assigned correctly to cluster 3. Compounds 24 and 35 from cluster 4 were
tested. Cmpd24 was correctly assigned but Cmpd35 was incorrectly assigned
to cluster 3. The reason for this is that the m/z 31 peak for this compound
was measured at 31%. From cluster 5, compounds 14, 15, and 23 were tested
and were assigned correctly to cluster 5. Therefore, with one excepﬁon, this

m/z 59 rule-tree successfully classiﬁed all tested compounds. Given that

190

these compounds were the farthest from the cluster mean, it is likely that all
other compounds analyzed here would also be properly assigned. Hence, this
rule-tree can be used with a high degree of reliability for determining likely
candidate substructures for m/z 59 low-energy product spectra.

Rule-trees are potenﬁally very powerful as an on-line tool for targeted
component analysis particularly where sample quanﬁﬁes are low.
Unfortunately, rule-trees are signiﬁcantly more diﬁcult to generate than
simple or combinaﬁon rules and may not be possible to generate cleanly in

all cases.

Conclusions

The above methods have proven to be eﬁ‘ecﬁve for the classiﬁcaﬁon of
product spectra and the recommendaﬁon of candidate substructures. The
ideal case would be for every substructure to produce a unique ion that in
turn would yield a unique product spectrum when analyzed by MS/MS
methods. Unfortunately, and obviously, this is not the case. Therefore,
without more clarifying informaﬁon, the best case scenario is to idenﬁfy
those substructures that do produce a parﬁcular product spectral pattern.
The tools developed here accomplish this goal in a saﬁsfactory manner. The
m/z 59 clusters discovered in this work and the substructures believed to be

expressed in them are given in Figures 7 .5 to 7 .9.

191

While rule-trees have disﬁnct advantages over spectral matching,
spectral matching, with fuzzy intensity regions, can also be an important tool
for classifying product mass spectra. In parﬁcular, the spectral matching
approach used here can be used to obtain an esﬁmate of the membership of
an unknown product mass spectrum into all the clusters idenﬁﬁed so far.

The methodologies developed here do not, by themselves, provide
complete chemical structure elucidaﬁon. Rather, the focus of this work was to
develop methods for classifying product spectra and recommending candidate
substructures. Structure generators have been developed by other [10,11] to
generate all possible candidate molecular structures given a series of
substructures and other constraints. As tandem mass spectrometry becomes
more accepted for rouﬁne chemical analysis, MS/MS databases are bound to
follow. The tools developed here were conceived to provide assistance to the

mass spectrometric invesﬁgator at such a ﬁme.

192

Compound 3, 8 59 59

Compound 1 1

I

59 Compound 33

59
Compound 13 >Jl:o:l—< 59
59

H
H
Compound 16
59
59 Compound 34
Compound 22 W

H
59 59
Compound 27
H
Compound 28 \E<)
H
59

m/z 59 substructures present: M X \i}
H H

Figure 7.5 The members and expected substructures for m/z 59 cluster 1.

193

a
Cluster 2 m/z 59 substructure:

Com ound 29 WH M
p 59 H

Cluster 6

\

O 0\

Compound 17

59

Figure 7.6 The members and expected substructures for m/z 59 clusters 2
and 6 respecﬁvely.

194

Compound 1, 10, 32

59
59

Compound 2 \M
Compound 4, 31 w
compound 5, 9 M
Compound 6 :ﬂg/
Compound 7 $0/
Compound 19 W

m/z 59 substructures present: \K M

Figure 7.7 The members and expected substructures for m/z 59 cluster 3.

195

Compound 2 1 WW
59 59
Compound 24 \/\0/\/\

59
59

Compound 35

59

m/z 59 substructures present:

A: M r

Figure 7.8 The members and expected substructures for m/z 59 cluster 4.

196

59

Compound 12, 20 W
59

59

Compound 14 /W
59

59

Compound 15 NW
59

59

Compound 18, 23 W
59

59

Compound 25 \/O\:l/\
OH

59

Compound 26 VOQ/\C

59
Compound 30 M
59

m/z 59 substructures present:

Figure 7.9 The members and expected substructures for m/z 59 cluster 5.

197

References

1. Varmuza, K. Fresenius’ Z. Anal. Chem. 1976, 282, 129-134.
2. Scott, D. R.; Anal. Chim. Acta. 1988, 211, 11-29.

3. McLaﬁ‘erty, F. W.; Hertel, R. H.; Villwock, R. D. Org. Mass Spectrom.
1974, 9, 690.

4. Damen, H.; Henneberg, D.; Weimann, B. Anal. Chim. Acta. 1978, 103,
289-302.

5. Zadeh, L. A Fuzzy Sets, Information and Control 1965, 8, 338-353.

6. Blaﬂ'ert, T. Anal. Chim. Acta. 1984, 161, 135-148.

7. Otto, M. Anal. Chim. Acta. 1993, 283, 500-507.

8. Bandemer, H.; Otto, M. Mikrochim. Acta. 1986, 2, 93-124.

9. Stein, S. E.; Scott, D. R. J. Am. Soc. Mass Spectrom. 1994, 5, 859-866.
10. Buchanan, B.; Feigenbaum, E. Artiﬁcial Intelligence, 1978, 11, 5.

11. Blaﬁ‘ert, T. Anal. Chim. Acta. 1986, 191, 161-168.

"lllilllllllllllltill“