EFFECTS OF DATA PRETREATMENT ON THE MULTIVARIATE STATISTICAL
ANALYSIS OF CHEMICALLY COMPLEX SAMPLES
By
John William McIlroy

A THESIS
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Criminal Justice—Master of Science
2014

i

ABSTRACT
EFFECTS OF DATA PRETREATMENT ON THE MULTIVARIATE STATISTICAL
ANALYSIS OF CHEMICALLY COMPLEX SAMPLES
By
John William McIlroy
Multivariate statistical procedures, such as principal component analysis (PCA),
are often utilized to differentiate and associate a large number of complex samples
consisting of thousands of variables. When samples with similar chemical compositions
are compared, chemical differences between samples are often overshadowed by nonchemical variation. Therefore, in order to provide meaningful statistical comparisons and
differentiate complex and highly similar samples, these non-chemical sources of variation
must be minimized, often accomplished by implementing data pretreatment procedures.
In this work, ten diesel samples from different service stations were analyzed in
triplicate by gas chromatography-mass spectrometry. The resulting chromatograms were
processed with data pretreatment procedures, including baseline correction, smoothing,
retention time alignment, and normalization, to evaluate the enhanced discrimination in
PCA achieved by minimizing non-chemical variation. For each pretreatment procedure,
metrics were developed to evaluate the effect on the chromatogram as well as the PCA
results. Normalization and alignment resulted in the greatest enhancement in association
of replicate samples, while smoothing and baseline correction were shown to have
minimal effect.

By applying data pretreatment procedures, replicate samples were

closely associated with one another and differentiated from the other diesel samples,
allowing for differentiation of complex and similar samples.
ii

ACKNOWLEDGEMENTS

I would like to thank everyone that has helped me throughout my years at Michigan
State, while completing my Master’s degree. I owe a huge thank you to Ruth Smith for
all of her help, guidance, and training during the years as my Master’s degree advisor.
She has also provided me with numerous opportunities for learning and experience. I
would also like to thank Vicki McGuffin and Dan Jones for their guidance on completion
of this project. As my Ph.D. advisors, they provided many of the suggestions and
guidance throughout the development of this research. They have greatly helped to
develop my skills as a scientist. Thank you to Ruth Smith, Vicki McGuffin, and Steven
Dow for also taking time to serve on my committee.
I owe a debt of gratitude to all of the McGuffin group students, Forensic Master’s
students and Chemistry Graduate students that have helped me while at Michigan State.
In particular, I would like to thank Lucas Marshall, for analyzing the diesel samples that
were utilized in this work and Steve Halpin for helping to write several of the Matlab
algorithms that were utilized for this work.
I would also like to thank my friends and family who have supported me throughout
my education. I would especially like to thank my wife Katie for all of her support.

iii

TABLE OF CONTENTS

LIST OF TABLES .............................................................................................. vii
LIST OF FIGURES .............................................................................................. ix
CHAPTER 1: INTRODUCTION............................................................................ 1
Chemical Analysis of Forensically Significant Samples .................. 1
1.1.1. Forensic Chemistry ........................................................................... 2
1.1.2. Gas Chromatography-Mass Spectrometry ........................................ 3
1.1.2.1. Gas Chromatography Theory ..................................................... 3
1.1.2.2. Mass Spectrometry Theory......................................................... 6
1.1.2.3. GC-MS Chromatogram ............................................................... 8
1.1.3. Data Pretreatment Procedures Applied to Chromatographic Data .. 11
Statistical Analysis of Chromatographic Data ................................ 12
1.2.1. Principal Component Analysis ......................................................... 12
1.2.2. Application of PCA to Differentiate Complex Samples .................... 15
Evaluation of Data Pretreatment to Enhance Multivariate Statistical
Analysis .............................................................................................. 17
Research Objective ........................................................................... 22
Impact on the Forensic Science Community .................................. 23
REFERENCES ............................................................................................... 24
CHAPTER 2: INITIAL ANALYSIS OF DIESEL SAMPLES ............................... 29
Introduction ........................................................................................ 29
Selection of Samples ......................................................................... 30
GC-MS Parameters ............................................................................ 31
Visual Assessment of Diesel Chromatograms ................................ 33
PCA of Diesel Chromatograms ......................................................... 42
Summary ............................................................................................ 49
REFERENCES ............................................................................................... 50
CHAPTER 3: NORMALIZATION ....................................................................... 52
Introduction ........................................................................................ 52
Methods Tested and Evaluation Metrics.......................................... 53
3.2.1. Total Area Normalization ................................................................. 53
3.2.2. Single Peak Normalization .............................................................. 54
3.2.3. Evaluation Metrics ........................................................................... 56
Effect of Normalization on Chromatographic Data ......................... 57
3.3.1. Visual Assessment .......................................................................... 57
3.3.2. Quantitative Assessment................................................................. 59
iv

Effect of Normalization on PCA Scores Plot ................................... 59
3.4.1. Visual Assessment .......................................................................... 59
3.4.2. Quantitative Assessment................................................................. 69
Summary ............................................................................................ 69
REFERENCES ............................................................................................... 71
CHAPTER 4: BASELINE CORRECTION .......................................................... 73
Introduction ........................................................................................ 73
Methods Tested and Evaluation Metrics.......................................... 75
4.2.1. Background Subtracted Baseline (BSB).......................................... 76
4.2.2. Subtraction of Extracted Ion Profiles ............................................... 78
4.2.3. Subtraction of a the Baseline using a Modeled Function................. 84
4.2.4. Evaluation Metrics ........................................................................... 87
Effect of Baseline Correction on Chromatographic Data ............... 87
4.3.1. Visual Assessment .......................................................................... 87
4.3.1.1. Background Subtracted Baseline ............................................. 87
4.3.1.2. Subtraction of Extracted Ion Profiles ........................................ 88
4.3.1.3. Subtraction of a the Baseline using a Modeled Function .......... 92
4.3.2. Quantitative Assessment................................................................. 93
Effect of Baseline Correction on PCA Scores Plot ......................... 94
4.4.1. Visual Assessment .......................................................................... 94
4.4.2. Quantitative Assessment............................................................... 100
Summary .......................................................................................... 102
REFERENCES ............................................................................................. 103
CHAPTER 5: SMOOTHING ............................................................................. 105
Introduction ...................................................................................... 105
Methods Tested and Evaluation Metrics........................................ 105
5.2.1. The Savitzky-Golay Smooth .......................................................... 106
5.2.2. The Fast Fourier Transform Smooth ............................................. 106
5.2.3. Metrics Used for Evaluation .......................................................... 107
Effect of Smoothing on Chromatographic Data ............................ 110
5.3.1. Visual Assessment ........................................................................ 110
5.3.2. Quantitative Assessment............................................................... 114
Effect of Smoothing on PCA Scores Plot ...................................... 119
5.4.1. Visual Assessment ........................................................................ 119
5.4.2. Quantitative Assessment............................................................... 122
Summary .......................................................................................... 126
REFERENCES ............................................................................................. 127
CHAPTER 6: ALIGNMENT .............................................................................. 129
Introduction ...................................................................................... 129
Methods Tested and Evaluation Metrics........................................ 131
6.2.1. Peak-Matching Alignment Algorithm ............................................. 132
6.2.2. Correlation Optimized Warping Algorithm ..................................... 133
6.2.3. Target Selection ............................................................................ 134
v

6.2.4. Evaluation Metrics ......................................................................... 135
Effect of Retention Time Alignment on Chromatographic Data .. 136
6.3.1. Visual Assessment ........................................................................ 136
6.3.1. Quantitative Assessment............................................................... 142
6.3.2. Target Selection ............................................................................ 148
Effect of Retention Time Alignment on PCA Scores Plot............. 152
6.4.1. Visual Assessment ........................................................................ 152
6.4.2. Quantitative Assessment............................................................... 160
Summary .......................................................................................... 163
REFERENCES ............................................................................................. 164
CHAPTER 7: CONCLUSIONS AND FUTURE WORK .................................... 167
Conclusions ..................................................................................... 167
Future Work...................................................................................... 174

vi

LIST OF TABLES

Table 2-1. Diesel samples collected for this work, including the service station
and the date of collection. .................................................................................. 32
Table 4-1. The average percent change in the clustering (PCC) of replicates after
the listed pretreatment procedures including baseline correction using the
extracted ion profiles (EIP fit) and normalization using total area (Area) and
single peak (Peak) normalization methods. .................................................... 101
Table 5-1. Percent change in each metric for different smoothing parameters.
The parameters are grouped based on the level of smoothing. ................... 115
Table 5-2. The average percent change in the clustering (PCC) of replicates after
the listed pretreatment procedures including baseline correction using the
extracted ion profiles (EIP fit), smoothing using fast Fourier transform
smooth with 2 points (FFT2) and normalization using total area (Area) and
single peak (Peak) normalization methods. .................................................... 125
Table 6-1. Percent change in the standard deviation of the peak maxima of
selected peaks (PC-SDRT) and the sum of the percent change in the PPMC
coefficients (PC-PPMC) for different window sizes using the peak-matching
alignment algorithm. A decrease in the PC-SDRT of the retention time or
an increase in the sum of the PC-PPMC indicates an improvement in
alignment. .......................................................................................................... 144
Table 6-2. Percent change in the standard deviation of the peak maxima of
selected peaks (PC-SDRT) and the sum of the percent change in the PPMC
coefficients (PC-PPMC) for varying warp and segment sizes using the COW
alignment algorithm. A decrease in the PC-SDRT of the retention time or
an increase in the sum of the PC-PPMC indicates an improvement in
alignment. .......................................................................................................... 146
Table 6-3. Percent change in the standard deviation of the peak maxima of
selected peaks (PC-SDRT) using the correlation optimized warping
alignment algorithm with a warp of 2 and a segment size of 75, with each
sample chromatogram as well as the average chromatogram serving as the
target. A decrease in the PC-SDRT of the retention time indicates an
improvement in alignment. .............................................................................. 150
Table 6-4. The sum of the percent change of the Pearson product moment
correlation coefficients (PC-PPMC) using the correlation optimized
warping alignment algorithm with a warp of 2 and a segment size of 75,
with each sample chromatogram as well as the average chromatogram
vii

serving as the target. An increase in the sum of the PC-PPMC indicates an
improvement in alignment. .............................................................................. 151
Table 6-5. The average percent change in the clustering of replicates (PCC) after
the listed pretreatment procedures including baseline correction using the
extracted ion profiles (EIP fit), smoothing using fast Fourier transform
smooth with 2 points (FFT2), alignment using the correlation optimized
warping algorithm with a warp of 2 and a segment of 75 (COW 2, 75) and
normalization using total area (Area) and single peak (Peak) normalization
methods. ............................................................................................................ 162

viii

LIST OF FIGURES

Figure 1-1. Diagram of gas chromatograph-mass spectrometer. ............................ 4
Figure 1-2. Example chromatograms generated for ignitable liquids including
diesel fuel (a), lighter fluid (b), and gasoline (c). ................................................ 5
Figure 1-3. A representative chromatogram of diesel fuel with the normal alkane
peaks labeled (a) and an expanded region of the pentadecane (C15) peak
(b). The blue circles indicate points where mass spectra were collected.
The red line indicates the point which the mass spectrum in Figure 1-4 was
taken. ...................................................................................................................... 9
Figure 1-4. The mass spectrum (scan at retention time = 39.219 min, indicated
by the red line in Figure 1-3b) of pentadecane (molecular weight = 212 amu).
.............................................................................................................................. 10
Figure 1-5. PCA scores plot (a) and loadings plot (b) of diesel fuel (yellow),
lighter fluid (blue), and gasoline (red) shown in Figure 1-2............................. 14
Figure 2-1. A representative diesel chromatogram of each diesel fuel sample (1
- 10) with the normal alkanes labeled. Octane was detected at low
abundance, but was not labeled. Labels y and z are used to indicate two
clusters of peaks from substituted aromatic compounds observed in diesel
1 and 2. ................................................................................................................. 34
Figure 2-2. Chromatograms of three replicates of diesel 5 (a) and an expanded
region of the chromatogram on the undecane peak (b). The inset shows
the baseline at the end of the chromatogram. .................................................. 40
Figure 2-3. An overlay of one chromatogram from each of the eight diesel
samples (a) and an expanded region of the chromatogram on the undecane
peak (b). The insets in part a show the baseline at the end of the
chromatogram. Each color represents a different diesel sample .................. 41
Figure 2-4. PCA scores plot of 10 diesel samples in triplicate. Each diesel
sample is represented by a different color and shape: Diesel 1 (dark red
ovals), Diesel 2 (grey 4-point stars), Diesel 3 (red circles), Diesel 4 (orange
squares), Diesel 5 (yellow diamonds), Diesel 6 (light blue triangles), Diesel
7 (green crosses), Diesel 8 (dark blue inverted triangles), Diesel 9 (purple
pentagons), and Diesel 10 (pink 5 point-stars). ................................................ 43

ix

Figure 2-5. Loading plots for PC1 (a) and PC2 (b) after PCA analysis of diesels
1 - 10. The labels y and z correspond to compounds that were provisionally
as branched alkanes and substituted aromatic compounds. ......................... 45
Figure 2-6. PCA scores plot of diesels 3 - 10 in triplicate. Each diesel sample is
represented by a different color and shape: diesel 3 (red circles), diesel 4
(orange squares), diesel 5 (yellow diamonds), diesel 6 (light blue triangles),
diesel 7 (green crosses), diesel 8 (dark blue inverted triangles), diesel 9
(purple pentagons), and diesel 10 (pink 5 point-stars). ................................... 47
Figure 2-7. Loading plots for PC1 (a) and PC2 (b) after PCA analysis of diesels
3 - 10. The inset shows an expanded region of the undecane (C11) peak to
show the derivative-shaped peak, which is characteristic of misalignments.
.............................................................................................................................. 48
Figure 3-1. An expanded region of the hexadecane peak (C16) in triplicate
analysis of a diesel sample, before normalization (a), after total area
normalization (b), and after selected peak normalization (c). ......................... 58
Figure 3-2. PCA scores plot of eight diesel chromatograms in triplicate prior to
the application of data pretreatment (a) and after total area normalization
(b). Each diesel is represented by a different shape and color...................... 60
Figure 3-3. Loadings plot for PC1 (a) and PC2 (b) after PCA with total area
normalization. The inset in part b shows a derivative shaped peak,
indicative of misalignments. .............................................................................. 61
Figure 3-4. An expanded region of dodecane in three replicate chromatograms of
diesel 5 before (a) and after (b) area normalization (R2 and R3 are directly on
top of one another). Each replicate is indicated by a different color (R1: red,
R2: blue, R3: green). .…………………………………………………………………63
Figure 3-5. PCA scores plot of eight diesel chromatograms in triplicate prior to
the application of data pretreatment (a) and after single peak normalization
(b). ........................................................................................................................ 65
Figure 3-6. Loadings plot for PC1 (a) and PC2 (b) after PCA with single peak
normalization. ...................................................................................................... 66
Figure 3-7. An expanded region of dodecane in three replicate chromatograms
of diesel 5 before (a) and after (b) peak normalization. Each replicate is
indicated by a different color (R1: red, R2: blue, R3: green). .......................... 67
Figure 4-1. Representative mass spectrum from a diesel chromatogram (Diesel
1) at retention time 108.335 minutes, the last scan in the chromatogram. .... 77
Figure 4-2. Extracted ion chromatograms for ions present in the last mass
spectral scan (from Figure 4-1) including mass-to-charge (m/z) 73 (a), m/z
x

96 (b), m/z 133 (c), m/z 191 (d), m/z 207 (e), m/z 208 (f), m/z 209 (g), m/z 281
(h), and m/z 282 (i). The extracted ion profile generated from these ions is
also shown (j) ...................................................................................................... 79
Figure 4-3. Model generated for baseline of a diesel chromatogram, based on
Equation 4-1. The a term is the initial height of the function, the b term is
the transition height, the c term is the retention time at which the inflection
point of the curve occurred, and the d and e terms control the shape of the
curve. ................................................................................................................... 86
Figure 4-4. The signal that was subtracted from the TIC using the BSB method
(a), the EIP (b), and the function fit by the EIP (c) ............................................ 89
Figure 4-5. The baseline of the TIC before pretreatment (a) and pretreatment
using the BSB method (b), the EIP (c), and the function fit by the EIP (d). .... 90
Figure 4-6. Scores plots of eight diesels in triplicate without any pretreatment
(a) and after baseline correction (b). ................................................................. 95
Figure 4-7. Loadings plot for PC1 (a) and PC2 (b) after baseline correction. ....... 96
Figure 4-8. Scores plots of eight diesels in triplicate after total area
normalization (a) and after baseline correction followed by total area
normalization (b). ................................................................................................ 98
Figure 4-9. Loadings plot for PC1 (a) and PC2 (b) after baseline correction and
area normalization. ............................................................................................. 99
Figure 5-1. A representative diesel chromatogram showing the TIC (black) (a)
and EICs (b) of m/z 132 for tetralin (blue) and m/z 148 for pentylbenzene
(red). ................................................................................................................... 109
Figure 5-2. An expanded region of 1, 3, 5-trimethylbenzene in a representative
diesel chromatogram after baseline correction (a) and after baseline
correction and smoothing, using FFT 2 (b). The inset on the left is a further
expanded region of the baseline, demonstrating the point-to-point
variation before and after smoothing. The inset on the right shows the
region at the end of the chromatogram, including the region defined as
noise................................................................................................................... 111
Figure 5-3. An expanded region of a diesel chromatogram without smoothing
(black line) and with smoothing (red line) using a Savitzky-Golay
smoothing algorithm. Part a shows a good smooth (polynomial order of 4
and 11 total points) while part b shows the broadening of peaks, decrease
in peak height, and artifacts on the peak edges associated with
oversmoothing (polynomial order of 6 and 31 total points). ......................... 113

xi

Figure 5-4. A log-log plot of the standard deviation of the noise region versus
the total number of points in the smooth. Different smoothing parameters
are represented by each symbol: FFT (), SG 1st order polynomial (), SG
2nd order polynomial (), SG 4th order polynomial (), SG 6th order
polynomial (▼). Groupings were assigned based on the standard deviation
in the noise region after smoothing. The color represents the groups in
Table 5-1. ........................................................................................................... 116
Figure 5-5. PCA scores plot of eight diesel chromatograms in triplicate after
baseline correction (a) and after smoothing using FFT 2 (b). ....................... 120
Figure 5-6. Loadings plot for PC1 (a) and PC2 (b) after PCA smoothing. ........... 121
Figure 5-7. PCA scores plot of eight diesel chromatograms in triplicate after
baseline correction and normalization (a) and after baseline correction,
smoothing, and normalization (b).................................................................... 123
Figure 5-8. Loadings plot for PC1 (a) and PC2 (b) after PCA smoothing and
normalization. .................................................................................................... 124
Figure 6-1. An expanded region of chromatograms of three diesel samples
analyzed in triplicate, each represented by a different color, before
alignment. The peaks correspond to 1, 3, 5-trimethyl-benzene (9.20 min)
and decane (9.48 min). ...................................................................................... 137
Figure 6-2. An expanded region of the 1, 3, 5-trimethyl-benzene peak in
chromatograms of three diesel samples analyzed in triplicate, each
represented by a different color, before alignment. The individual data
points are shown as black circles. In this example, peak maxima are
shifted by approximately three data points. ................................................... 139
Figure 6-3. The same expanded region of chromatograms of three diesel
samples from Figure 6-1, each represented by a different color, after
alignment using the peak-matching algorithm (a) and the correlationoptimized warping algorithm (b). ..................................................................... 140
Figure 6-4. An expanded region of the phytane peak in chromatograms of three
diesel samples analyzed in triplicate, each represented by a different color,
before alignment (a) and after alignment using the peak-matching
algorithm with a window size of 10 (b). ........................................................... 143
Figure 6-5. PCA scores plot of eight diesel chromatograms in triplicate after
baseline correction and smoothing (a) and after baseline correction,
smoothing, and alignment (b). ......................................................................... 153
Figure 6-6. Loadings plot for PC1 (a) and PC2 (b) after baseline correction,
smoothing, and alignment................................................................................ 154
xii

Figure 6-7. An expanded region of dodecane in three replicate chromatograms
of Diesel 5 before (a) and after (b) alignment. Each replicate is indicated
by a different color (replicate 1: red, replicate 2: blue, replicate 3: green). . 156
Figure 6-8. PCA scores plot of eight diesel chromatograms in triplicate after
baseline correction, smoothing, and normalization (a) and after baseline
correction, smoothing, alignment and normalization (b). ............................. 157
Figure 6-9. Loadings plot for PC1 (a) and PC2 (b) after baseline correction,
smoothing, alignment, and normalization. ..................................................... 159
Figure 6-10. An expanded region of dodecane in three replicate chromatograms
of diesel 5 after baseline correction, smoothing, and alignment (a) and after
baseline correction, smoothing, alignment, and normalization (b). Each
replicate is indicated by a different color (R1: red, R2: blue, R3: green). .... 161
Figure 7-1. Loadings plot for PC2 prior to applying data pretreatment (a) and for
PC 1 after applying baseline correction, smoothing, alignment, and
normalization (b). .............................................................................................. 172

xiii

1. CHAPTER 1: INTRODUCTION
Chemical Analysis of Forensically Significant Samples
Evidence found at a crime scene is a crucial aspect of many police investigations
and is often required in court to establish guilt or innocence. Forensics scientists examine
the evidence, draw a conclusion from the results obtained, and provide their conclusion
in court as an expert’s opinion. The examination of the evidence often consists of an
identification or a comparison between a questioned and known sample [1-3]. Often the
comparisons are facilitated by instrumental analyses, which generate a chemical
fingerprint of the samples for comparison [2, 4]. The forensic scientist must then make
the determination whether the question and known samples are consistent with one
another (i.e. a “match”).
Even when the conclusions that forensic scientists draw are based on scientific
tests, all testing has errors and uncertainties associated with the measurement. These
must be taken into account by the forensic scientist when forming their expert opinion. In
addition, the opinions of the forensic scientists are susceptible to outside influence and
human bias [3]. A 2009 report from the National Academies of Science (NAS) identified
the need to address the “accuracy, reliability, and validity” of forensic testing to help
reduce testing error and bias [3]. In order to address this concern, forensic research
began to focus on the use of statistical procedures to aid in comparison of chemical
fingerprints, assign a statistical confidence to forensic tests, and to help minimize errors
and human bias [5-18].

1

1.1.1. Forensic Chemistry
The area of forensic science that is of most interest to analytical chemists is the
application of instrumental analysis to forensically relevant samples, either in forensic
toxicology or forensic chemistry [2]. Forensic toxicologists generally examine biological
fluids for the presence of drugs or poisons and relevant metabolites. Forensic chemists
generally utilize analytical techniques such as gas chromatography-mass spectrometry
(GC-MS) and infrared spectroscopy to analyze physical evidence such as fire debris,
explosives, and controlled substances [2].
Due to the expansive range of evidence types that a forensic chemist may analyze,
this work will focus on a single, complex example, the GC-MS analysis of ignitable liquids
(specifically diesel) for the detection of accelerants in fire debris. GC-MS is one of the
most common analytical instruments in a forensic laboratory.

It encompasses a

separation and identification aspect (discussed in Section 1.1.2) and is used to confirm
the identity of a compound [2, 4, 19]. Diesel fuel will provide a forensically relevant sample
that is chemically complex, consisting of hundreds of compounds. Further, diesels from
different sources vary in chemical composition, based on the refinery at which the fuel
was produced and additives from the individual service stations at which the fuel was
obtained. The composition and properties of diesel fuel used for this work will be further
discussed in Chapter 2.

2

1.1.2. Gas Chromatography-Mass Spectrometry
GC-MS is a hyphenated technique that combines gas chromatography, which
separates compounds in a mixture based on boiling point or polarity, and mass
spectrometry, which breaks compounds into fragments that are characteristic of the
compound. Each compound has a reproducible retention time and as well as a unique
and reproducible fragmentation pattern, under specific and controlled conditions. The
retention time and fragmentation pattern are then utilized to determine the identity of the
compound [1, 20]. The GC-MS instrument (Figure 1-1) used in this research is similar to
those found in forensic laboratories. The output from the GC-MS is a chromatogram,
which contains peaks for compounds present in the sample. Example chromatograms of
several ignitable liquids including diesel fuel (a), lighter fluid (b), and gasoline (c) are
shown in Figure 1-2. Characteristic compounds are labeled in each chromatogram.
1.1.2.1. Gas Chromatography Theory
Chromatography is a broad class of analytical techniques that is used to separate
sample mixtures. In all chromatography methods, the mixture is dissolved into a mobile
phase which is moved across a stationary phase. Compounds in the mixture interact
differentially with the stationary phase [19]. Compounds that interact more with the
stationary phase are more retained, while compounds that interact less with the stationary
phase move through the chromatography system quickly and are retained to a lesser
extent [19]. As a result of the different extents of interaction with the stationary phase,
compounds in a sample mixture are separated, resulting in a chromatogram (Figure 1-2).
Compounds with similar chemical properties will generally elute close to one another.

3

Injection Port
Oven

Transfer Line

Column

Quadrupole

Ionization
Source

Gas Chromatograph

Detector

Mass Spectrometer

Figure 1-1. Diagram of gas chromatograph-mass spectrometer.

4

Abundance

a

C13
C12

5

0

10
15
Retention Time (min)
C10

Abundance

0

C
C14 15C16

Abundance

C20

20

25
b

C11

10
15
Retention Time (min)

20

10
15
Retention Time (min)

20

C2-alkylbenzenes

0

C18
C19

C12

C9

5

C17

25
c

C3-alkylbenzenes

Toluene

5

25

Figure 1-2. Example chromatograms generated for ignitable liquids
including diesel fuel (a), lighter fluid (b), and gasoline (c).

5

One of the most common chromatography systems is the gas chromatograph. In
GC analysis, the sample is volatized into the gas phase, using high temperatures, in the
injector port of the instrument (Figure 1-1). The gaseous mixture is transferred onto a
column using a carrier gas, typically helium, hydrogen, or nitrogen, which is the mobile
phase in the separation [19-21]. The stationary phase is contained on the inside of the
column. The column is typically inside a temperature-controlled oven, and temperature
can be varied during the analysis to change the speed and efficiency of the separation
[19]. The compounds from the sample interact with the stationary phase in the column,
causing separation. The same compound should have the same retention time from
sample to sample on the same instrument and under the same experimental conditions.
The effluent from the column then travels into a detector.
1.1.2.2. Mass Spectrometry Theory
One of the most common detectors in forensic science is the mass spectrometer
[1, 2]. The MS not only detects the compound as it elutes from the GC, but it can also
help to identify the compound [19]. In GC-MS, the end of the column is positioned directly
into the ion source of the MS, via a transfer line (Figure 1-1). The MS is under high
vacuum, in order to allow molecules from the column to traverse the MS, without colliding
with air molecules. While samples are introduced at atmospheric pressures, the low flow
rates utilized in capillary GC allow the vacuum pumps in the MS to remove the air and
mobile phase molecules, resulting in a high vacuum.
In order to be analyzed by MS, the compounds must first be ionized. In this work,
compounds were ionized using electron ionization (EI). In the ion source, a heated

6

filament produces high energy electrons which are accelerated across the ionization
space, towards an anode [19, 22]. The molecules from the sample traverse the ionization
space perpendicular to the electron beam. As the sample molecules pass close to the
beam, some of the energy is transferred from an electron to a molecule, which causes
the molecule to ionize (for a positive ion by removing an electron). However, often more
energy is imparted to the molecule than is required for ionization. The additional energy
causes the molecule to fragment.

Each compound fragments in a unique and

reproducible manner under these conditions, allowing for identification of the compound
using a known standard or a reference database [19, 22].
After the molecules have been ionized, the mass of each ion is determined using
a mass analyzer. In this work, a quadrupole mass analyzer was used. The quadrupole
typically has four cylindrical metal rods. Positive ions are directed from the ion source
through a series of electrostatic lenses and focused into the quadrupole [20]. The
quadrupole has a direct current (DC) applied to each rod as well as an oscillating radio
frequency (RF) current. Opposite pairs of rods are electrically connected, with adjacent
rods always having opposite charges for both the DC and RF currents [20]. The electric
field cause by the combination of DC and alternating RF potentials results in ions moving
along the quadrupole in a corkscrew trajectory [19]. Only ions with a narrow range of
mass-to-charge (m/z) ratio can pass through the quadrupole at a given set of DC and RF
potentials. The DC and RF potentials can be scanned, which allow a range of m/z to
pass [22]. Ions that pass through the mass analyzer strike the conductive surface of an
electron multiplier, creating a cascade of electrons which are then detected and converted
into an electronic signal [20].
7

1.1.2.3. GC-MS Chromatogram
After GC-MS analysis, a chromatogram of the data is generated, consisting of an
array of abundances at discrete retention time points. The chromatogram indicates the
time at which individual compounds elute from the GC. The abundance at each retention
time point in the chromatogram originates from the mass spectrum generated in the MS.
Each spectrum shows the ions resulting from the fragmentation of the compound that
eluted at that point. The fragmentation pattern relates to the structure of that compound,
which allows for identification. The total ion chromatogram is the sum of the abundance
of all m/z at each retention time point [1].
Another example of a GC-MS chromatogram of diesel fuel is shown in Figure 1-3a.
This diesel fuel was analyzed using a slower temperature program than the fuels shown
in Figure 1-2. The major normal alkane peaks are labeled for reference. Figure 1-3b
shows an expanded region of the pentadecane (C15) peak. The points indicate where
mass spectra were collected and summed to create the total ion chromatogram (TIC)
abundance. The red line shows where the mass spectrum in Figure 1-4 was obtained.
The mass spectrum shows the m/z value of the molecular ion and fragment ions resulting
from this compound.

8

4.00E+05

a
C15 C

16

C17

Abundance

C14

C18
C19

C13
C12
C11

C20

C10

C21
C22

0.00E+00
0

20

40
60
Retention Time (min)

80

100

4.00E+05

Abundance

b

0.00E+00
39.1

39.15

39.2
39.25
Retention Time (min)

39.3

Figure 1-3. A representative chromatogram of diesel fuel with the normal alkane
peaks labeled (a) and an expanded region of the pentadecane (C15) peak (b). The
blue circles indicate points where mass spectra were collected. The red line
indicates the point which the mass spectrum in Figure 1-4 was taken.

9

1.00E+05

57

Abundance

71
85
5.00E+04

99

212

0.00E+00
50

70

90

110

130
m/z

150

170

190

210

Figure 1-4. The mass spectrum (scan at retention time = 39.219 min, indicated by
the red line in Figure 1-3b) of pentadecane (molecular weight = 212 amu).

10

1.1.3. Data Pretreatment Procedures Applied to Chromatographic Data
The signals in the GC-MS chromatogram can vary from one analysis to the next.
Data pretreatment procedures are often employed to correct for these non-chemical
variations in chromatographic data [9, 23, 24]. Differences in overall abundance between
chromatograms analyzed on the same instrument can result from variation in sample
preparation, sample injection, chromatographic conditions, and instrument response [23,
25]. In order to correct this problem, normalization procedures are commonly applied.
Another source of variation that occurs in the baseline at the end of the chromatogram,
particularly at high temperature, is from the degradation of compounds in the septum,
injection port, and GC stationary phase. Differences in stationary phase age and wear
can lead to a rise and variation in the signal over time [23]. It is often necessary to apply
baseline correction procedures to minimize this variation. Noise, the high-frequency
fluctuation in the signal, is another source of variation, which is hard to identify visually
when there is a high signal-to-noise ratio. Noise is often the result of instrumental and
electronic variation, and can be corrected using a smoothing procedure [23]. Peaks in
the chromatogram can elute at slightly different retention times, due to instrumental
variation in flow rates, column degradation, and manual injection procedures. These
variations can be corrected by applying a retention time alignment algorithm [8, 23].

11

Statistical Analysis of Chromatographic Data
After generating the chemical fingerprint, multivariate statistical procedures can be
utilized to compare many samples simultaneously. Multivariate statistical procedures are
widely used as a research tool and have been applied to a variety of complex samples to
help reveal underlying patterns in the data, including applications in lipidomics,
metabolomics, proteomics, and petroleomics [12, 25-35]. The most commonly applied
multivariate statistical procedure is principal component analysis (PCA). PCA serves as
the basis of many other multivariate statistical procedures [36, 37].

Often data

pretreatment procedures are applied prior to any multivariate statistical analysis to
minimize non-chemical sources of variation.

1.2.1. Principal Component Analysis
PCA is an unsupervised multivariate statistical procedure that helps to identify
underlying relationships within complex datasets without any prior knowledge about the
data [37]. Often, PCA can identify small differences between samples, which can be
over-looked by simple visual inspection of the data [8, 24]. In PCA, latent variables are
used to reduce the dimensionality of the data, allowing for the visualization of
relationships between samples [9, 36, 38]. The main outputs from PCA are scores plots,
which show relationships between samples, and loadings plots, which show the
importance of each variable.
In PCA, variables that vary together (covariance) are identified and grouped using
eigenanalysis. These groups are identified as the principal components (PCs), which are
orthogonal and uncorrelated linear combinations of the original dataset.
12

From the

covariance matrix, the eigenvector and eigenvalue are calculated. The eigenvector for
each PC contains the weights of each variable that define that PC, while the eigenvalue
is a measure of the amount variance a particular eigenvector describes [9].

The

eigenvector with the highest eigenvalue is the first PC. The eigenvector is then multiplied
by the mean-centered data in order to obtain the scores for the samples [9, 13, 23, 36,
37]. In this work, the entire chromatogram of each sample was utilized, so each retention
time point serves as a variable.
The scores for each sample on the first few PCs can be plotted, which allows for
the visualization of clustering patterns. These scores plots (Figure 1-5a), which are a
projection of the data onto a lower dimensional space, can then be used to infer
relationships among samples [23]. Samples positioned close together in the new PC
space are more similar and are associated, while samples that are positioned further
apart are different and discriminated from one another [23]. Generally, the chemical
differences between samples provide the greatest sources of variance [37]. However,
when PCA is applied to chemically similar and highly complex samples, non-chemical
sources of variation tend to be the greatest sources of variation.
The eigenvector, or weight, for each PC can also be plotted against each variable
resulting in a loadings plot (Figure 1-5b) [23]. Variables with the highest or lowest
weightings contribute the most to the positioning of the samples on the scores plot [23,
39]. Loadings plots can be used to infer which variables in the sample are changing or
differing among the samples, as these variables will be given the most weight [6].

13

3.0E+06

PC2 (34.9%)

a

-3.0E+06
-3.0E+06

3.0E+06

PC1 (63.9%)

0.2

b

PC1 Loadings

0.1

0

-0.1

-0.2

-0.3
0

5

10
15
Retention Time (min)

20

Figure 1-5. PCA scores plot (a) and loadings plot (b) of diesel fuel
(yellow), lighter fluid (blue), and gasoline (red) shown in Figure 1-2.
14

25

The PCA plots shown in Figure 1-5 were performed using diesel fuel, lighter fluid,
and gasoline, each extracted in triplicate with each extract analyzed in triplicate, creating
nine total samples for each fuel (Figure 1-2). On the PCA scores plot (Figure 1-5a)
replicates are positioned close together, while different fuel samples are positioned further
apart from one another. This demonstrates that chemically similar samples are clustered
(i.e. the replicates), while chemically different samples are positioned further apart (i.e.
the different fuels).
The loadings plot in Figure 1-5b can be used to identify the variables that are
differentiating the samples on the scores plot.

For example, many of the compounds

found in gasoline (Figure 1-2) are loading negatively on the PC1 loadings plot. This
explains why gasoline is positioned negatively on PC1 in the scores plot (Figure 1-5a).
Many of the compounds found in diesel fuel (Figure 1-2) are loading positively on PC1,
explaining why diesel fuel is positioned positively on PC1 in the scores plot. Similar logic
can be used to explain the position of the lighter fluid and all of the samples on PC2.
1.2.2. Application of PCA to Differentiate Complex Samples
There are many areas of research in which PCA and other multivariate statistical
procedures are used to differentiate complex samples. As an example, an on-going
project in our lab focuses on the use of PCA to associate fire debris with a corresponding
ignitable liquid reference standard for use in fire debris analysis [5, 7, 40-42]. Hupp et al.
demonstrated that TICs and EICs were useful in differentiating diesel fuels from different
service stations using PCA, after applying alignment and normalization [7]. However, in
this work, distinguishing between chemical variation and non-chemical variation was

15

challenging because no replicates were used in the PCA. This also made evaluating the
improvements from data pretreatment very challenging.
Marshall et al. utilized the TIC and extracted ion profiles (EIPs) for the
differentiation of five diesel fuels, analyzed in triplicate [5]. EIPs are the sum of several
extracted ion chromatograms (EICs) that are characteristic for a compound class. EICs
are the plot of the abundance of a single m/z at each retention time. In addition, Marshall
investigated association of a diesel residue, extracted from a cloth matrix, to the neat
diesel using PCA. Marshall showed that replicates of diesel samples could be clustered,
but only after retention time alignment and normalization. However, association of the
diesel residue to the neat liquid was not possible and the authors suggested that
additional data pretreatment procedures would be necessary to further minimize nonchemical sources of variation [5]. In addition, this work demonstrated strategies for
identifying retention time misalignments, including a characteristic derivative-shaped
peak in the PCA loadings plot.
Baerncopf et al. used PCA to differentiate replicate GC-MS chromatograms of six
different ignitable liquids (gasoline, diesel, lamp oil, adhesive remover, torch fuel, and
paint thinner).

By applying only retention time alignment and normalization,

chromatograms of residues of each liquid were associated to the neat fuel after being
spiked onto carpet and burned, which simulated burning at an arson scene [40]. This
shows promise for associating fire debris to a neat source. However, the fuels used in
this study are very different in chemical composition, making the chemical variation much
greater than the non-chemical variation in the data.

16

Prather et al. investigated the effects of weathering the fuel in addition to matrix
interferences on association of burned fire debris samples to a neat liquid, again using
PCA [42]. Unevaporated and evaporated samples of gasoline and kerosene were spiked
onto a carpet matrix and burned, then the residues were analyzed by GC-MS. After
alignment and normalization, the simulated fire debris was associated with the correct
ignitable liquid, even in the presence of interference compounds. However association
to the correct extent of evaporation was not possible. The authors suggested that a larger
dataset with fuels from different chemical classes would be necessary to test the
robustness of these procedures for forensic analyses [42].

Evaluation of Data Pretreatment to Enhance Multivariate Statistical Analysis
The previous work described here demonstrates the utility of applying multivariate
statistics to forensic analyses. However, the authors all commented on the small size of
the dataset and the need for more thorough investigation of data pretreatment
procedures. When chromatograms are collected over a long period of time (several
months), data pretreatment procedures become crucial because instrumental drift over
time introduces more non-chemical variation. Therefore, it is important to have data
pretreatment procedures that can minimize or eliminate these variations as well as having
metrics that can be used to evaluate the effect of the applied pretreatment procedures.
Often, data pretreatment procedures are applied to data with little discussion of
how the parameters were selected or evaluated [43, 44]. Selection of data pretreatment
procedures are facilitated by understanding the sources of the signals that require
correction. Incorrect selection of data pretreatment procedures can result in amplifying

17

small variations in the data and lead to erroneous results [45].

Therefore, data

pretreatment cannot be carried out using a “black-box” approach; instead, care must be
taken to understand how and the extent to which each data pretreatment procedure
corrects the non-chemical sources of variation in the chromatogram [43]. It is critical that
the original relationships between variables are preserved even after data pretreatment
[46].
There have been many algorithms designed to minimize non-chemical sources of
variation in chemical analyses and are too numerous to discuss here [47, 48]. However,
in many cases, there is little quantitative assessment of the effect of each procedure on
the original chromatographic data. Many of the comparisons are based on a visual
examination of the pretreated chromatograms compared to the original chromatograms
[25].

This is problematic when trying to optimize the data pretreatment procedure

because visual comparison is time consuming and subjective [49].
Common metrics for monitoring data pretreatment are based on a measure of a
correlation coefficient or variance, either between samples or among replicates [8, 5054]. The extent of non-chemical variation can be compared by examining replicate
injections of the same sample. As replicates are chemically the same, the only variation
must be non-chemical, arising from fluctuations and variations in the instrument [27].
Chromatograms from different samples are also compared, but differences could arise
from variation in chemical composition.
Pearson product-moment correlation (PPMC) coefficients, which measure
correlation between two samples, have been applied evaluate the effect of data

18

pretreatment. PPMC coefficients (r) are calculated by dividing the covariance between
two sets of variables (x and y) by the product of each variables standard deviation
(equation 1) [55].

r=

 [( xi − x )( y i − y )]

Equation 1-1

2
2
 ( xi − x )  ( y i − y )

High positive correlation indicates that variables increase and decrease together [55, 56].
In the case of chromatographic data, a high correlation coefficient indicates that the
variables, which for chromatographic data is the abundance at each retention time,
increase and decrease together [51, 57]. This makes PPMC coefficients an effective
metric for evaluating data pretreatment procedures, especially retention time alignment
procedures [45, 50, 54, 58]. An increase in the PPMC coefficient is observed when
alignment improves. However, due to the large number of data points in chromatographic
data, PPMC coefficients can be insensitive to small changes in the chromatogram. In
addition, correlation coefficients are unaffected by relative changes in magnitude of the
variables, and therefore, could not be utilized to evaluate normalization [56].
Another common method for evaluating data pretreatment procedures is to
compare sample or replicate variance before and after pretreatment [8, 50, 53]. The
variance (s2) is calculated using the sum of the squared differences between two sets of
values divided and the number of observations (n) using equation 2 [55].

s2 =

2
 ( xi − x )
n −1

Equation 1-2

19

Another metric reported in the literature for evaluating data pretreatment is the standard
deviation, or the square root of the variance, which measures deviation from the mean
[57]. A smaller variance or standard deviation indicates less variation between samples
[55]. In most cases, the variance is calculated based on selected features from the
chromatogram, such as retention time or peak height.
In the comparison of three retention time alignment algorithms, van Nederkassel
et al. utilized PPMC coefficients and the standard deviation of selected peaks to optimize
the alignment parameters [50]. Gong et al. utilized a correlation coefficient and similarity
index to compare aligned chromatograms to a target chromatogram in order to evaluate
alignment [54].

Johnson et al. employed the average PPMC coefficient of all

chromatograms and the standard deviations of selected peaks to optimize alignment
parameters [51].

PPMC coefficients are the metric utilized in the correlation optimized

warping (COW) alignment algorithm, which aligns chromatograms by maximizing the
correlation between a sample and reference chromatogram [58].

Malmquist and

Danielsson evaluated a single alignment and normalization using the residual sum of
squares between each chromatogram before and after data pretreatment and the
average chromatogram [8]. While the authors differ on the “best” alignment algorithm to
use, it is generally agreed that alignment is necessary when chromatograms have been
collected over a long period of time.
The ratio of the noise in a smoothed verses unsmoothed peak was used to
compare parameters of the Savitzky-Golay smoothing algorithm [53, 59] and to compare
the result of different smoothing algorithms [52]. In addition, the residual sum of squares
between smoothed and unsmoothed voltammograms was utilized by Jakubowska and
20

Kubiak to evaluate distortion of the signal [53]. The precision of the peak area and height
as well as the limit of detection have also been used to compare the effect of different
smoothing algorithms, without regard to any peak distortion that may have occurred [60].
In addition to evaluating the raw data, the scores and loadings plots after PCA can
also be used to evaluate the effect of each pretreatment procedure. Visual assessment
of the clusters on the scores plot is a common method to evaluate data pretreatment
procedures [8, 24, 45, 51]. However, visual assessment provides limited quantifiable
information. Moreda-Pineiro et al. suggested the use of the percent variance accounted
for uisng the first three PCs as a method for evaluating data pretreatment [44]. This
method is an indirect measure of the association and discrimination of samples and is
highly susceptible to influence by outliers. Degree-of-class-separation has been utilized
for evaluating the clustering of samples on a scores plot based on a distance between
clusters and the distance between samples within each cluster [49].
Despite the critical need for data pretreatment procedures, there has been no
direct comparison of applying sequential data pretreatment procedures. When data
pretreatment procedures are compared, there typically is not a quantitative comparison
because visual examination of the PCA scores plot is used rather than a metric. The
development of metrics for the comparison of these data pretreatment procedures would
allow for parameter optimization and a means to evaluate the effectiveness of each
pretreatment.

21

Research Objective
Multivariate statistical procedures are widely utilized in chemical analyses and
show great promise for applications in forensic science. However, additional research is
still required to develop appropriate methodologies for forensic applications.

As

previously demonstrated, data pretreatment is a critical aspect of successfully applying
multivariate statistical procedures capable of identifying minute differences in the
chemical fingerprints of forensic evidence.
The goal in this work is to develop methods for evaluating and optimizing data
pretreatment procedures in order to minimize non-chemical sources of variation, resulting
in enhanced discrimination of complex samples using multivariate statistical analysis.
The goal is not to compare every possible method of data pretreatment, but rather to
provide a general overview of common pretreatment procedures and to provide a uniform
set of metrics for evaluating these procedures, using both the raw data and the resulting
PCA scores and loadings plots. In order to attain this goal, the following aims were
outlined:

•

Demonstrate methods for objectively selecting and optimizing different data
pretreatment methods and associated parameters.

•

Develop metrics for evaluating the effect of data pretreatment on the
chromatographic data and the PCA results of chemically complex and highly similar
samples.

22

•

Demonstrate, using proper data pretreatment procedures, that non-chemical
variation in chromatograms can be minimized without altering the discriminatory
chemical information.

Impact on the Forensic Science Community
Multivariate statistics have not been widely applied in a legal setting. In order to
be accepted in court, these statistical procedures must pass a Frye or Daubert standard
[1]. As part of the basis for meeting these standards, it will be critical to demonstrate that
these statistical procedures and the data pretreatment that accompanies them, do not
change the fundamental chemical information in the analyses. This research aids in that
goal by providing a fundamental understanding of the data pretreatments and metrics to
evaluate their effectiveness.

23

REFERENCES

24

REFERENCES

[1] R. Saferstein, Criminalistics: An Introduction to Forensic Science, Prentice Hall,
Upper Saddle River, NJ, 2004.
[2] S. Bell, Forensic Chemistry, Prentice Hall, Upper Saddle River, NJ, 2006.
[3] National Research Council, Strengthening Forensic Science in the United States: A
Path Forward, The National Academies Press, Washington, DC, 2009.
[4] J.D. DeHaan, Kirk's Fire Investigation, Prentice Hall, Upper Saddle River, NJ, 2002.
[5] L.J. Marshall, J.W. McIlroy, V.L. McGuffin, R. Waddell Smith, Anal. Bioanal. Chem.,
394 (2009) 2049.
[6] R.G. Brereton, Applied Chemometrics for Scientists, John Wiley & Sons, Hoboken,
NJ, 2007.
[7] A.M. Hupp, L.J. Marshall, D.I. Campbell, R.W. Smith, V.L. McGuffin, Anal. Chim.
Acta, 606 (2008) 159.
[8] G. Malmquist, R. Danielsson, J. Chromatogr. A, 687 (1994) 71.
[9] S.L. Morgan, E.G. Bartick, in: R.D. Blackledge (Ed.), Forensic Analysis on the Cutting
Edge: New Methods for Trace Evidence Analysis, John Wiley & Sons, Inc. , Hoboken,
NJ, 2007.
[10] M.J. Adams, Chemometrics in Analytical Spectroscopy, Royal Society of Chemistry,
Victoria, Australia, 1995.
[11] D.R. Burgard, J.T. Kuznicki, Chemometrics: Chemical and Sensory Data, CRC
Press, Boca Raton, FL, 1990.
[12] Z. Wang, J.H. Christensen, Crude Oil and Refined Product Fingerprinting:
Applications (Chapter 17), Elsevier, Burlington, MA, 2006.
[13] S.M. Mudge, Environ. Forensics, 8 (2007) 155.
[14] J. Gonzalez-Rodriguez, G. Fowler, Forensic Sci. Int., 231 (2013) 6.
25

[15] N.V.S. Rodrigues, E.M. Cardoso, M.V.O. Andrade, C.L. Donnici, M.M. Sena, J. Braz.
Chem. Soc., 24 (2013) 507.
[16] C. Muehlethaler, G. Massonnet, P. Esseiva, Forensic Sci. Int., 209 (2011) 173.
[17] M. Monfreda, A. Gregori, J. Forensic Sci., 56 (2011) 372.
[18] R.J.H. Waddell-Smith, J. Forensic Sci., 52 (2007) 1297.
[19] D.A. Skoog, F.J. Holler, S.R. Crouch, Principles of Instrumental Analysis, Thomson
Brooks/Cole, Belmont, CA, 2007.
[20] M.C. McMaster, GC/MS: A Practical User’s Guide John Wiley & Sons, Inc, Hoboken,
NJ, 2008.
[21] P.J. Marriott, in: E. Heftmann (Ed.), Chromatography, Elsevier, New York, NY, 2004.
[22] E. de Hoffmann, V. Stroobant, Mass Spectrometry Principles and Applications, John
Wiley & Sons, Hoboken, NJ, 2007.
[23] K.M. Pierce, J.S. Nadeau, R.E. Synovec, in: C.F. Poole (Ed.), Gas Chromatography,
Elsevier, Waltham, MA, 2012.
[24] M.E. Pate, N.F. Thornhill, R. Chandwani, M. Hoare, N.J. Titchener-Hooker,
Bioprocess Eng., 19 (1998) 297.
[25] J.H. Christensen, G. Tomasi, J. Chromatogr. A, 1169 (2007) 1.
[26] J.H. Christensen, A.B. Hansen, U. Karlson, J. Mortensen, O. Andersen, J.
Chromatogr. A, 1090 (2005) 133.
[27] J.H. Christensen, J. Mortensen, A.B. Hansen, O. Andersen, J. Chromatogr. A, 1062
(2005) 113.
[28] J.H. Christensen, G. Tomasi, A.B. Hansen, Environ. Sci. Technol., 39 (2005) 255.
[29] L.M.V. Malmquist, R.R. Olsen, A.B. Hansen, O. Andersen, J.H. Christensen, J.
Chromatogr. A, 1164 (2007) 262.
[30] N.J. Nielsen, D. Ballabio, G. Tomasi, R. Todeschini, J.H. Christensen, J. Chromatogr.
A, 1238 (2012) 121.
26

[31] M. Steinfath, D. Groth, J. Lisec, J. Selbig, Physiol. Plant., 132 (2008) 150.
[32] J. van der Greef, H. van Wietmarschen, B. van Ommen, E. Verheij, Mass Spectrom.
Rev., 32 (2013) 399.
[33] M. Chadeau-Hyam, G. Campanella, T. Jombart, L. Bottolo, L. Portengen, P. Vineis,
B. Liquet, R.C.H. Vermeulen, Environ. Mol. Mutagen., 54 (2013) 542.
[34] J. Trygg, E. Holmes, T. Lundstedt, J. Proteome Res., 6 (2007) 469.
[35] D.I. Ellis, R. Goodacre, Analyst, 131 (2006) 875.
[36] K. Varmuza, P. Filzmoser, Introduction to Multivariate Statistical Analysis in
Chemometrics, CRC Press, New York, NY, 2009.
[37] S. Wold, K. Esbensen, P. Geladi, Chemometrics Intell. Lab. Syst., 2 (1987) 37.
[38] P. Gemperline (Ed.), Practical Guide to Chemometrics, CRC Press, Boca Raton, FL,
2006.
[39] K.R. Beebe, R.J. Pell, M.B. Seasholtz, Chemometrics: A Practical Guide, John Wiley
& Sons, Inc., New York, NY, 1998.
[40] J.M. Baerncopf, V.L. McGuffin, R.W. Smith, J. Forensic Sci., 56 (2011) 70.
[41] J.M. Baerncopf, V.L. McGuffin, R.W. Smith, J. Forensic Sci., 55 (2010) 185.
[42] K.R. Prather, V.L. McGuffin, R.W. Smith, Forensic Sci. Int., 222 (2012) 242.
[43] M. Daszykowski, B. Walczak, Trac-Trends Anal. Chem., 25 (2006) 1081.
[44] A. Moreda-Pineiro, A. Marcos, A. Fisher, S.J. Hill, J. Environ. Monit., 3 (2001) 352.
[45] G. Tomasi, F. van den Berg, C. Andersson, J. Chemometr., 18 (2004) 231.
[46] O.M. Kvalheim, F. Brakstad, Y.Z. Liang, Anal. Chem., 66 (1994) 43.
[47] S.D. Brown, S.T. Sum, F. Despagne, B.K. Lavine, Anal. Chem., 68 (1996) R21.
[48] M. Katajamaa, M. Oresic, J. Chromatogr. A, 1158 (2007) 318.

27

[49] K.M. Pierce, J.L. Hope, K.J. Johnson, B.W. Wright, R.E. Synovec, J. Chromatogr. A,
1096 (2005) 101.
[50] A.M. van Nederkassel, M. Daszykowski, P.H.C. Eilers, Y.V. Heyden, J. Chromatogr.
A, 1118 (2006) 199.
[51] K.J. Johnson, B.W. Wright, K.H. Jarman, R.E. Synovec, J. Chromatogr. A, 996 (2003)
141.
[52] P. Barak, Anal. Chem., 67 (1995) 2758.
[53] M. Jakubowska, W.W. Kubiak, Anal. Chim. Acta, 512 (2004) 241.
[54] F. Gong, B.T. Wang, F.T. Chau, Y.Z. Liang, Anal. Lett., 38 (2005) 2475.
[55] J.L. Devore, Probability and Statistics for Engineering and the Sciences, Duxbury
Press, Belmont, CA, 1991.
[56] J.H. Zar, Biostatistical Analysis, Prentice-Hall, Upper Saddle River, NJ, 1999.
[57] J.N. Miller, J.C. Miller, Statistics and Chemometrics for Analytical Chemistry,
Pearson, New York, NY, 2000.
[58] N.P.V. Nielsen, J.M. Carstensen, J. Smedsgaard, J. Chromatogr. A, 805 (1998) 17.
[59] C.G. Enke, T.A. Nieman, Anal. Chem., 48 (1976) A705.
[60] C.R. Mittermayr, H. Frischenschlager, E. Rosenberg, M. Grasserbauer, Fresenius J.
Anal. Chem., 358 (1997) 456.

28

2. CHAPTER 2: INITIAL ANALYSIS OF DIESEL SAMPLES
Introduction
The focus of this work is to evaluate strategies for enhancing the differentiation of
complex and chemically similar samples for multivariate statistical analysis.

Many

statistical procedures have been applied to forensically relevant data, with the goal of
implementing those procedures into forensic laboratories to assist in comparisons and
assigning a statistical confidence to the forensic analysis [1-5].

One of the most

commonly utilized multivariate statistical procedure is principal component analysis
(PCA), which discriminates samples based on the greatest sources of variance within the
dataset.
Typical data generated in forensic laboratories, including chromatograms of fire
debris from gas chromatography-mass spectrometry (GC-MS), are highly complex
making differentiation challenging. For example, the chromatograms of different fuel
samples can appear similar when compared visually. Often, minute differences, which
are hard to find by eye, are necessary to distinguish between samples.

Statistical

procedures can be introduced to assist in the differentiation. Additionally, by utilizing
statistical approaches to analyze forensic samples, there is less subjectivity and greater
consistency across forensic laboratories. Moreover, the application of statistical analyses
allow for comparisons with statistical confidence rather than simply the analyst’s opinion.
In any chemical investigation, variation due to sample preparation and instrumental
procedures is often introduced prior to the statistical analysis [6, 7]. When applying PCA
to samples with different chemical composition, such as diesel and gasoline, the variation
29

introduced from sample preparation and analysis is small compared to the chemical
differences between the samples. However, when utilizing PCA to differentiate highly
similar samples, such as diesel samples from different sources that contain the many of
the same compounds, non-chemical differences between analyses are often the largest
source of variation, which can mask the chemical differences necessary for differentiation.
Data pretreatment procedures are then applied to minimize these variations.
In this chapter, the challenges in differentiating complex and similar samples are
demonstrated. First, thirty chromatograms of diesel fuel, a very complex sample, were
generated. Then, PCA was applied to highlight the difficulties in discriminating these
complex and similar samples. Subsequent chapters will focus on selecting pretreatment
procedures, methods to evaluate and select appropriate parameters for each procedure,
and the resulting effect of the procedure on the ability to discriminate chemically similar
samples.

Selection of Samples
Diesel fuel was chosen as the model sample due to its complex chemical
composition. Diesel fuel consists of small and relatively non-polar molecules, with boiling
points ranging from approximately 80 - 350 °C, which is well suited for GC-MS analysis.
Diesel fuel consists of hundreds of different chemical compounds, present at varying
concentrations, which adds to the complexity of the sample. Diesel fuel is widely available
from local service stations and can be used as an accelerant in arson, making diesel fuel
forensically relevant. During an arson investigation it is necessary to identify whether an
accelerant is present. With additional research, small differences between fuel samples

30

may provide statistical confidence in a comparison between an accelerant found at a
crime scene and an accelerant found in the possession of a suspect. However, an
accelerant found at a crime scene cannot be traced to a single service station or brand
because diesel fuel found at different service stations (even different brands) are often
purchased from the same refinery.
Ten different diesel fuels were collected from service stations in the Lansing,
Michigan area during June 2007 and were stored in acid-washed amber bottles at 3 °C
until analysis (Table 2-1).

Prior to analysis, each sample was diluted 200:1 in

dichloromethane (spectrophotometric grade, Sigma-Aldrich, St. Louis, MO), and then
analyzed in triplicate using GC-MS, resulting in 30 chromatograms [1, 2, 8].

GC-MS Parameters
All analyses were performed on an Agilent 6890N gas chromatograph coupled to
an Agilent 5975 mass spectrometer detector (Agilent Technologies, Santa Clara, CA).
The GC was equipped with an HP-5MS capillary column with a 5% phenyl- 95% methylpolysiloxane stationary phase (30 m x 0.25 mm x 0.25 μm, Agilent Technologies). Ultrahigh purity helium was used as the carrier gas with a nominal flow rate of 1 mL/min. A
manual injection with a 10 µL syringe (Hamilton, Reno, NV) was used to deliver 1 µL of
diluted diesel with a split ratio of 50:1. A slow, two-step temperature ramp was used in
order to maximize the chromatographic resolution. The oven temperature program was

31

Table 2-1. Diesel samples collected for this work, including the service station and
the date of collection.

Sample
Identifier

Service Station

Location

Date

Diesel 1

Sunoco

2139 Haslett Road
East Lansing, MI

05/31/07

Diesel 2

Sunoco

3000 Dunckel Road
Lansing, MI

06/06/07

Diesel 3

Meijer

550 Hull Road
Mason, MI

06/07/07

Diesel 4

Meijer

2055 W. Grand River Ave
Okemos, MI

06/11/07

Diesel 5

Mobil

1500 Haslett Road
Haslett, MI

06/12/07

Diesel 6

Speedway

16819 Marsh Road
Bath Township, MI

06/13/07

Diesel 7

Mobil

2704 Lake Lansing Road
Lansing, MI

06/15/07

Diesel 8

Marathon

3010 W. Lake Lansing Road
East Lansing, MI

06/18/07

Diesel 9

Marathon

401 S. Pennsylvania Ave
Lansing, MI

06/19/07

Diesel 10

Speedway

1659 Grand River Avenue
Okemos, MI

06/20/07

32

as follows: 50 °C to 150 °C at 2 °C/min, then 150 °C to 280 °C at 3 °C/min with a final
hold of 15 min. The inlet and transfer line were maintained at 300 °C. The mass
spectrometer utilized electron ionization (70 eV) with a quadrupole mass analyzer, which
scanned mass-to-charge (m/z) ratios of 40-550 at a scan rate of 2.91 scans/s [1, 2, 8].

Visual Assessment of Diesel Chromatograms
After GC-MS analysis, the resulting chromatograms were exported from
Chemstation (version E.01.01.335, Agilent Technologies, Santa Clara, CA) and
regenerated in Excel (Office 2013, version 15.0, Microsoft Corporation, Redmond, WA).
Representative chromatograms of all ten diesel samples are shown in Figure 2-1. The
normal alkanes are labeled for reference. The same scale is utilized so that differences
in overall abundance can be observed. The overall differences in abundance are likely
not chemical, but rather a result of injecting slightly different volumes of sample.
Visual examination of the chromatograms indicates slight chemical differences
among each diesel sample. There is a lower abundance of short-chain alkanes (C9-C11)
in Diesels 1 and 2, relative to the other samples. A low abundance of short-chain alkanes
is characteristic of summer diesel fuel. In order to increase the cloud point in the winter,
diesel fuel is blended with kerosene or jet fuel, which increases the concentration of shortchain alkanes. Therefore, Diesels 1 and 2 are likely summer diesel fuels, due to their
lower abundance of short-chain alkanes, while samples 3 - 10 are potentially winter diesel
blends. Even though all samples were collected in June, the winter diesel fuel is likely
left over from the winter and being distributed until depleted.

33

4.00E+05

Diesel 1
C15C

16

C17

Abundance

C14

C18
C19

C13
C12

C9

z

C11
y

C20

C10

C21
C22

0.00E+00
0

20

40
60
Retention Time (min)

80

100

4.00E+05

Diesel 2

C15
C14
C16

Abundance

zC
13

C17
C18

C9
C10 yC12
C11

C19
C20
C21
C22

0.00E+00
0

20

40
60
Retention Time (min)

80

100

Figure 2-1. A representative diesel chromatogram of each diesel fuel sample (1
- 10) with the normal alkanes labeled. Octane was detected at low abundance,
but was not labeled. Labels y and z are used to indicate two clusters of peaks
from substituted aromatic compounds observed in diesel 1 and 2.
34

Figure 2-1 (cont’d).
4.00E+05

C12

C11

Abundance

Diesel 3

C15
C16
C
C17
C13 14
C18
C19
C20

C10
C9

C21
C22
0.00E+00
0

20

40
60
Retention Time (min)

80

100

4.00E+05

Diesel 4

Abundance

C15
C12C13C14 C16

C11

C17
C18
C19

C9
C10

C20
C21
C22

0.00E+00
0

20

40
60
Retention Time (min)

35

80

100

Figure 2-1 (cont’d).
4.00E+05

Diesel 5

Abundance

C15
C13
C12
C14 C
16
C17 C18

C11

C19
C9

C20

C10

C21
C22
0.00E+00
0

20

40
60
Retention Time (min)

80

100

4.00E+05

Diesel 6

C12 C13
C15
C14
C16
C11

Abundance

C17
C18
C19

C9

C20

C10

C21
C22
0.00E+00
0

20

40
60
Retention Time (min)

36

80

100

Figure 2-1 (cont’d).
4.00E+05

Diesel 7

Abundance

C15 C16
C17
C14
C18

C12

C13

C11

C19

C9 C10

C20
C21
C22

0.00E+00
0

20

4.00E+05

40
60
Retention Time (min)

80

C12
C15
C13
C14
C11
C16

100

Diesel 8

Abundance

C17
C18
C9

C19

C10

C20
C21
C22

0.00E+00
0

20

40
60
Retention Time (min)

37

80

100

Figure 2-1 (cont’d).
4.00E+05

Diesel 9

Abundance

C12 C13
C11

C15
C14
C16
C17C

18

C19
C20

C9
C10

C21
C22

0.00E+00
0

20

40
60
Retention Time (min)

80

100

4.00E+05

Diesel 10

Abundance

C12C13
C11

C14

C15
C16
C17
C18
C19

C10

C9

C20
C21
C22

0.00E+00
0

20

40
60
Retention Time (min)

38

80

100

Diesels 1 and 2 also have a higher abundance of two clusters of peaks,
provisionally identified as branched and substituted aromatic compounds, labeled as y
and z in Figure 2-1. Diesel 2 also has a higher abundance of more volatile (early eluting)
aromatic compounds, which is not observed in any of the other samples. The origin of
these differences is not known, but may be a result of the starting material or additive, as
these compounds were only at higher abundance in samples from Sunoco Service
stations (Table 2-1).
Three characteristic groups were observed within the provisionally identified winter
diesel fuels (Figure 2-1, samples 3 - 10). Diesel samples 3 and 7 have a unimodal
distribution of the normal alkanes, which maximizes at a retention time of approximately
40 minutes (C15). Diesel samples 4, 5, 6, 8, 9, and 10 have a bimodal distribution. Diesel
samples 6 and 8 maximize at retention times of approximately 20 and 40 minutes (C12
and C15), while diesel samples 4, 5, 9, and 10 maximize at retention times of
approximately 28 and 40 minutes (C13 and C15). These differences are likely due to
differences in the crude oil starting material, refining processes, and blending for each
brand as well as the refinery from which the fuel was purchased.
Small chemical differences (such as those described above) that are observed
through visual assessment are often overshadowed by non-chemical sources of variation
when PCA is utilized. Examples of non-chemical variation are highlighted in Figure 2-2
and Figure 2-3, which shows an overlay of replicate chromatograms and representative
chromatograms of Diesels 3 – 10, respectively.

39

In the replicate chromatograms,

3.00E+05

a

Abundance

03

0.00E+00
0

20

40
60
Retention Time (min)

80

100

3.00E+05

Abundance

b

0.00E+00
14.65

14.75
Retention Time (min)

14.85

Figure 2-2. Chromatograms of three replicates of diesel 5 (a) and an expanded
region of the chromatogram on the undecane peak (b). The inset shows the
baseline at the end of the chromatogram.

40

4.00E+05

Abundance

a

0.00E+00
0

20

40
60
Retention Time (min)

80

100

4.00E+05

Abundance

b

0.00E+00
14.65

14.75
Retention Time (min)

14.85

Figure 2-3. An overlay of one chromatogram from each of the eight diesel
samples (a) and an expanded region of the chromatogram on the undecane peak
(b). The insets in part a show the baseline at the end of the chromatogram. Each
color represents a different diesel sample

41

differences in peak height are observed. In addition, there is also variation in the rise in
the baseline and noise observed at the end of each chromatogram. Figure 2-2b and
Figure 2-3b show an expanded view of the undecane peak (between 14.65 and 14.85
minutes) where misalignments are observed, both between replicates and between
sample chromatograms.
PCA of Diesel Chromatograms
In this work, PCA was initially performed using three replicates of ten diesel
chromatograms. Based on the chemical composition of each fuel in the chromatogram,
three clusters are expected on the PCA scores plot. One cluster would contain the
summer diesel fuels (Diesels 1 and 2). The second cluster would contain Diesels 3 and
7, which have the unimodal distribution of normal alkanes and the final cluster would
contain the winter diesel samples (Diesels 4, 5, 6, 8, 9, and 10) that contain the bimodal
distribution.
The scores plot obtained from the PCA of ten diesel samples is shown in Figure
2-4. The x-axis is the first principal component (PC1), while the y-axis is the second
principal component (PC2). The number in parentheses indicates the percent variance
for each principal component (47.1% by PC1 and 19.1% by PC2, 66.2% for both PCs).
Replicates of each sample are not positioned close together, indicating that there are
non-chemical sources of variation present. The only replicate samples positioned close
together are those of Diesel 2 (grey 4-point stars). Two general clusters are observed,
one with the summer diesels: (Diesels 1 and 2), and one with the winter diesels (Diesels
3 - 10). The other 8 diesels are intermingled, even though differences in the distribution
of the normal alkanes were observed in the chromatograms.
42

PC2 (19.1%)

1.5E+06

-1.5E+06
-1.5E+06

PC1 (41.7%)

1.5E+06

Figure 2-4. PCA scores plot of 10 diesel samples in triplicate. Each diesel
sample is represented by a different color and shape: Diesel 1 (dark red
ovals), Diesel 2 (grey 4-point stars), Diesel 3 (red circles), Diesel 4 (orange
squares), Diesel 5 (yellow diamonds), Diesel 6 (light blue triangles), Diesel
7 (green crosses), Diesel 8 (dark blue inverted triangles), Diesel 9 (purple
pentagons), and Diesel 10 (pink 5 point-stars).

43

This demonstrates that there is as much variation between replicates as there is between
samples. This also demonstrates that when PCA is applied to a dataset of chemically
similar samples, the non-chemical differences are often identified as the greatest sources
of variation between chromatograms.
The loadings plots, which shows the weighting or importance of each variable, can
be used to determine which variables are affecting the positioning of the samples on the
scores plot. In this work, the loadings plots are presented with the loadings on the y-axis
and the variable on the x-axis, which for chromatographic data, is retention time. The
loadings plots for PC1 and PC2 are shown in Figure 2-5. In the loadings plot for PC1
(Figure 2-5a), the normal alkanes have the largest influence and contribute most to the
positioning of samples on the scores plot. Several short-chain alkanes (C11, C12, and C13)
are loading negatively on PC1, affecting the samples with the highest abundance of shortchain normal alkanes. Therefore, winter Diesels 3 – 10, which have a higher abundance
of short-chain alkanes, are positioned more negatively on PC1 than the summer diesels.
In the loadings for PC2 (Figure 2-5b), several short-chain normal alkanes (C10-C14)
are loading positively. Therefore, compounds with higher abundance of short-chain
normal alkanes (the winter diesels) are positioned more positively on PC2 in the scores
plot. The two small clusters of peaks on either side of C12 (at approximately 17 and 24
min, labeled y and z in Figure 2-1) are present in the loadings plot of both PC1 and PC2
(Figure 2-5). This shows that these peaks were identified as a major source of variation

44

0.10

C15

PC1 Loadings

z

C14

C17

y

0.05

a

C16

C18

C10

C19
C20
C21

0.00

C13
C11 C12
-0.05
0

20

40
60
Retention Time (min)

80

100

0.20

b

PC2 Loadings

C
C11 12
C13

0.10

C10
C14
0.00

z

y
-0.10
0

20

40
60
Retention Time (min)

80

100

Figure 2-5. Loading plots for PC1 (a) and PC2 (b) after PCA analysis of
diesels 1 - 10. The labels y and z correspond to compounds that were
provisionally as branched alkanes and substituted aromatic compounds.

45

between samples. However, these peaks only appear in Diesels 1 and 2, which also
explain why these samples are separated from Diesels 3 - 10.
The replicates from Diesels 3 - 10 cluster together, except for one replicate of
Diesel 7 (green cross on the left side of Figure 2-4). When the chromatogram of this
replicate is compared to all other chromatograms, it has the lowest overall abundance.
As most variables in the loadings plot of PC1 and PC2 are loading positively, and this
replicate has the lowest abundance, it is positioned most negatively on this PC in the
scores plot. This replicate’s low abundance is likely due to a lower volume of sample
injected into the GC-MS during analysis.
The goal of this work is to investigate different data pretreatment procedures to
minimize non-chemical sources of variation, and thereby enhance the discrimination of
chemically similar samples, using PCA. To make the differentiation as challenging as
possible for this work, samples with no clustering on the scores plot were selected.
Therefore, as Diesels 1 and 2 contained chemical differences that were identified by PCA
prior to pretreatment, these two samples were omitted from the dataset, and PCA was
performed using only replicates of Diesels 3 - 10. Diesel samples 1 and 2 were included
in some pretreatment parameter optimization in subsequent chapters, but were omitted
from all subsequent PCA scores plots.
The scores plot resulting from PCA of Diesels 3 - 10 is shown in Figure 2-6. No
clustering of diesels is observed and most of the replicates are spread along PC1. The
loadings plots for PC1 and PC2 are shown in Figure 2-7a and Figure 2-7b, respectively.
All compounds are positioned positively in the loadings plot for PC1 and negatively in the
loadings plot for PC2. When the overall abundance in the chromatogram varies between

46

PC 2 (27.2%)

1.5E+06

-1.5E+06
-1.5E+06

PC1 (50.2%)

1.5E+06

Figure 2-6. PCA scores plot of diesels 3 - 10 in triplicate. Each diesel sample
is represented by a different color and shape: diesel 3 (red circles), diesel 4
(orange squares), diesel 5 (yellow diamonds), diesel 6 (light blue triangles),
diesel 7 (green crosses), diesel 8 (dark blue inverted triangles), diesel 9
(purple pentagons), and diesel 10 (pink 5 point-stars).

47

0.08

C12

a
C11

0.06

C13
C14

C11
C15

PC1 Loadings

C16
0.04

C17

C10

0.02

C18
C19
C20
C21

0.00

-0.02
0

20

40
60
Retention Time (min)

80

100

80

100

0.05

b
C19

PC2 Loadings

0.00

C18
-0.05

C17

C10

C16
C14

-0.10

C11

C21
C20

C15

C13
C12

-0.15
0

20

40
60
Retention Time (min)

Figure 2-7. Loading plots for PC1 (a) and PC2 (b) after PCA analysis of
diesels 3 - 10. The inset shows an expanded region of the undecane (C11)
peak to show the derivative-shaped peak, which is characteristic of
misalignments.
48

samples, loadings plots that are mostly positive or mostly negative are common. Further,
replicates of each diesel are spread mostly across PC1 in the scores plot, indicating that
there are likely differences in abundance among replicates. Hence, the scores and
loadings plots indicate that the greatest sources of variation in this dataset are from overall
abundance, a non-chemical source of variation, rather than chemical differences.
The loadings plots can provide insight into other non-chemical sources of variation.
Derivative-shaped peaks are observed for C11, C12, and C13 in PC1 and PC2 (inset Figure
2-7a). The derivative-shaped peaks in the loadings plots result from the peaks in the
chromatograms maximizing at slightly different retention times in each sample, indicating
retention time misalignment [1, 9]. Therefore, the loadings plots indicate that abundance
and alignment are the major sources of variation between chromatograms. The baseline
and noise are not prevalent in the loadings plots, indicating that these are not major
sources of variation. However, as the largest non-chemical variations are minimized,
these lesser sources of variation may become more prominent.
Summary
Diesel fuel was selected to evaluate the effect of data pretreatment on the PCA of
highly similar and chemically complex samples. Eight diesel samples were selected, as
they were chemically indistinguishable using PCA, prior to application of data
pretreatment.

This indicates that the variation from non-chemical sources, such as

sample preparation and instrumental analysis, are more discriminatory than the chemical
differences between the fuel samples. By minimizing these non-chemical sources of
variation, the chemical differences can be utilized to differentiate different diesel samples.

49

REFERENCES

50

REFERENCES

[1] L.J. Marshall, J.W. McIlroy, V.L. McGuffin, R. Waddell Smith, Anal. Bioanal. Chem.,
394 (2009) 2049.
[2] A.M. Hupp, L.J. Marshall, D.I. Campbell, R.W. Smith, V.L. McGuffin, Anal. Chim.
Acta, 606 (2008) 159.
[3] J.M. Baerncopf, V.L. McGuffin, R.W. Smith, J. Forensic Sci., 55 (2010) 185.
[4] J.M. Baerncopf, V.L. McGuffin, R.W. Smith, J. Forensic Sci., 56 (2011) 70.
[5] K.R. Prather, V.L. McGuffin, R.W. Smith, Forensic Sci. Int., 222 (2012) 242.
[6] R.G. Brereton, Applied Chemometrics for Scientists, John Wiley & Sons, Hoboken,
NJ, 2007.
[7] K.M. Pierce, J.S. Nadeau, R.E. Synovec, in: C.F. Poole (Ed.), Gas Chromatography,
Elsevier, Waltham, MA, 2012.
[8] L.J. Marshall, Association and Discrimination of Diesel Fuels using Chemometric
Procedures for Forensic Arson Investigations (Masters Thesis), Michigan State
University, Ann Arbor, MI, 2008.
[9] G. Malmquist, R. Danielsson, J. Chromatogr. A, 687 (1994) 71.

51

3. CHAPTER 3: NORMALIZATION
Introduction
Differences in sample abundance are the most common variation observed
between chromatograms and can arise for many reasons [1]. In chromatographic data,
instrumental fluctuations in flow and injection port temperature can result in abundance
differences among samples and even among replicates.

The method of injection,

including the speed of injection, the amount of time the syringe remains in the injection
port, and syringe volume can also result in differences from analysis to analysis. Manual
injection is particularly problematic as the injection method (the volume injected, the
speed that it was injected, etc.) can vary widely for each analysis [2].
Normalization is the most widely applied data pretreatment procedure, even when
not utilizing multivariate statistics. Normalization procedures are often used to correct
systematic variations in abundance between samples [1, 3]. For chromatographic data,
this can be done using some part of the chromatogram, often the height or area of a single
peak, or the total area of the chromatogram [4].

However, normalization is often

challenging for chromatograms of complex mixtures. Care must be taken when choosing
a normalization procedure to ensure that important differences in relative peak
abundance among the samples are not lost. In addition, accurate integration to determine
peak area for normalization is challenging in complex samples due to co-eluting peaks.
Normalization is generally applied after other data pretreatment procedures; however, in
this work it is discussed first as this procedure was found to have the greatest effect on
the clustering of replicates and discrimination among samples in the scores plot.
52

Methods Tested and Evaluation Metrics
In normalization, each data point in a sample (in this work, the abundance at each
retention time in each diesel chromatogram) is divided by a unique factor, derived from
the sample. There are many methods for determining the normalization factor(s) that are
applied: the two most common are the total area and the height from a specific peak in
the chromatogram [1, 2, 5-8]. For this work, manual injections were used and no internal
standard was included in order to simulate the most challenging normalization scenario.

3.2.1. Total Area Normalization
Total area normalization (also called unit area normalization or constant sum
normalization) is performed by dividing the abundance at each retention time (At) by the
total sum of the abundances in the chromatogram, resulting in the normalized abundance
(At’).

At ' =

At
 At

Equation 3-1

In this work, the total area of each chromatogram was approximated by summing the
abundance at each retention time in the total ion chromatogram. The abundance at each
retention time was then divided by this sum to normalize the chromatogram. Each point
in every chromatogram was multiplied by the average total area across all
chromatograms in the dataset to return them to the original order of magnitude.
The major assumption using total area normalization is that the total signal
response from one sample is equivalent to the total signal from another. In other words,
this method assumes that the same volume of each sample has the same instrument
53

response [2].

While this is rarely true, because response factors differ between

compounds, when a large number of compounds are present, this is often a reasonable
approximation. A major drawback of this method is that when one peak decreases in
size, another peak necessarily increases, which can result in misleading correlations
between samples [6]. Area normalization is a good initial method for normalization
because it is generally fast and easy to apply and often results in adequate normalization.
Additionally, for complex samples, where there are unresolved peaks or a high baseline,
area normalization often results in the best minimization of the variation introduced from
injection [9].
3.2.2. Single Peak Normalization
In single peak normalization (also called maximum peak or internal standard
normalization), each data point is divided by the amplitude of a specific peak of interest
(AI) in the data.

At ' =

At
AI

Equation 3-2

For chromatographic data, the peak height or peak area of a selected peak (which is
constant across all samples) is used for normalization by dividing each data point by the
peak height or peak area of the selected peak.
The most common single peak normalization utilizes an internal standard. Prior to
analysis, the same concentration of a non-native compound is spiked into each sample.
An ideal internal standard for chromatographic data should have similar physical
properties to the compound being analyzed, elute close to the compound of interest. In

54

addition, the internal standard should be completely resolved in the chromatogram
(avoiding increased signal from co-elution), and be at a similar concentration to the
compound(s) of interest [7, 8].

Selection of an appropriate internal standard is

challenging, especially for a complex mixture where there are compounds with many
different properties and at different concentrations [2]. In many cases, a deuterated
analogue of each compound is utilized. However, this type of internal standard can be
expensive and very challenging to obtain for all compounds in a complex sample, as there
are a large number of compounds present. If the internal standard is not properly
selected, is present at an inappropriate abundance, or co-elutes with another compound,
the internal standard itself can become a major source of variance in PCA. This highlights
the need to think about proper data pretreatment procedures, even before sample
collection.
When an internal standard is not added, a compound within the sample can be
used for single peak normalization. This peak could be the highest abundance peak in
each sample, or could be a peak that is common to each sample. However, single peak
normalization can skew the relative abundance between samples because the
abundance of the selected peak may not truly be the same in all samples. Therefore,
normalization to a peak in the sample is often problematic if the abundance of that peak
changes between samples.
For this work, each point was divided by the peak height of heptadecane (C17) (at
approximately 50.3 min) and multiplied by the average heptadecane peak height across
all samples [2]. Heptadecane was chosen because it is a large, retained peak and is less
affected by evaporation or variation introduced during injection, due to its lower volatility.
55

3.2.3. Evaluation Metrics
In order to evaluate the effect of normalization, all 24 diesel chromatograms
(Diesels 3 - 10 analyzed in triplicate) were normalized using both total area and single
peak normalization methods. Initially, to assess the effect of normalization, a visual
inspection of overlaid chromatograms before and after normalization was used. However,
comparing overlaid chromatograms is subjective and time consuming as it requires
observing only small regions of the chromatograms at one time. In order to quantitatively
compare normalization methods, the percent change in total sum of squares of the
residuals for replicates (SSR) was developed.

To calculate the SSR, an average

chromatogram of the triplicates for each diesel sample was calculated. The residual was
calculated by subtracting each replicate chromatogram from the corresponding average
chromatogram. The residuals were squared and then summed for all 24 chromatograms.
The percent change in the SSR between the untreated and the normalized data was then
calculated. Using the residuals of replicates allows for monitoring both the peaks and the
baseline. Theoretically, for instrument replicates, which are chemically the same, the
SSR should be zero. When any variation among replicates is present, the differences
must arise from injection and instrumental analysis. Ideally, normalization would remove
all differences in peak height, making the replicate samples have the same height at each
retention time.
Evaluation of how normalization affected the association of replicates on the PCA
scores plot was performed using a visual inspection of clustering patterns. Additionally,
the effect of normalization on the clustering or grouping of replicates was quantitatively
assessed using the average percent change in the clustering of replicates (PCC). The

56

PCC was calculated by summing the variance in PC1 and PC2 for replicates of each
diesel. The standard deviation was then calculated by taking the square root of the
variance. The standard deviations were averaged and the PCC was calculated as the
percent change in the standard deviation between the scores plot generated from the
chromatograms after data pretreatment and the scores plot generated from
chromatograms prior to pretreatment.
Effect of Normalization on Chromatographic Data

3.3.1. Visual Assessment
An expanded region near the hexadecane (C16) peak of three representative diesel
chromatograms is shown in Figure 3-1 before (a) and after each normalization method (b
- c).

The inset in each figure shows the further expanded baseline, just after the

hexadecane peak. Without normalization (Figure 3-1a) there is spread in the abundance
along the baseline as well as at the peak maxima. Because these are replicates of the
same sample, these differences are due to small differences in injection volume and
instrumental variation. After total area normalization (Figure 3-1b), there is less spread
in the baseline of the three samples; however, spread in the abundance is still observed
at the peak maxima. Using single peak normalization (Figure 3-1c), the peak maxima are
close together, while spread is still observed along the baseline. This demonstrates that
the normalizations investigated in this work could not correct all of the variation observed.

57

5.00E+05

Abundance

a

0.00E+00
44.6

44.8

45
45.2
45.4
Retention Time (min)

45.6

44.8

45
45.2
45.4
Retention Time (min)

45.6

44.8

45
45.2
45.4
Retention Time (min)

45.6

5.00E+05

Abundance

b

0.00E+00
44.6

c

Abundance

5.00E+05

0.00E+00
44.6

Figure 3-1. An expanded region of the hexadecane peak (C16) in triplicate analysis
of a diesel sample, before normalization (a), after total area normalization (b), and
after selected peak normalization (c).

58

3.3.2. Quantitative Assessment
Both normalization methods resulted in a reduction in the spread in the abundance
between replicate chromatograms (Figure 3-1). Using the quantitative metric, there is a
92% decrease in the SSR using the area normalization procedure and an 87% decrease
using peak normalization.

These percent decreases in the SSR are very similar,

indicating that both methods result in a large reduction in the variation. However, in this
work, the majority of the chromatogram consists of an unresolved baseline; therefore,
area normalization resulted in a larger improvement than single peak normalization.
Effect of Normalization on PCA Scores Plot
Because normalization was determined to be the most important pretreatment
procedure for this dataset, both total area and single peak normalization were utilized
prior to PCA.

Triplicate chromatograms of the eight diesels (Diesels 3 - 10) were

normalized using each method, then PCA was performed.

3.4.1. Visual Assessment
Both normalization methods resulted in enhanced clustering of replicates when
compared to the PCA scores plot prior to data pretreatment (Figure 3-2a). Using total
area normalization (Figure 3-2b), PC1 accounts for 57.9% of the variation and PC2
accounts for 21.4%. After area normalization, there is still spread among replicates,
mostly along PC2 on the scores plot, indicating additional sources of non-chemical
variation (Figure 3-2b). The loadings plot of PC1 after total area normalization (Figure 33a) shows mostly the normal alkane peaks positioned positively, likely due to the variation
in peak height shown in Figure 3-1b. However, the largest peaks in the PC1 loadings plot

59

1.5E+06

PC 2 (27.2%)

a

R1

-1.5E+06
-1.5E+06

PC1 (50.2%)

R3

R2

1.5E+06

1.3E+06

PC2 (21.4%)

b

R1
R3
R2

-1.3E+06
-1.3E+06

1.3E+06
PC1 (57.9%)
Figure 3-2. PCA scores plot of eight diesel chromatograms in triplicate prior to
the application of data pretreatment (a) and after total area normalization (b).
Each diesel is represented by a different shape and color.

60

0.16

a

C12
0.12

PC1 Loadings

C14
0.08

C16
C10

C18 C20

0.04

0.00

-0.04
0

20

40
60
Retention Time (min)

80

100

0.10

C10

C12

C16

b
C18
C20

0.00
0.10

0.05

-0.05

PC2 Loadings

PC2 Loadings

0.05

C14

-0.10

0.00

-0.05

-0.10

-0.15
20.5

20.5

20.7
20.9
Retention Time (min)

21.1

21.1

-0.15
0

20

40
60
Retention Time (min)

80

100

Figure 3-3. Loadings plot for PC1 (a) and PC2 (b) after PCA with total area
normalization. The inset in part b shows a derivative shaped peak,
indicative of misalignments.

61

are the short-chain alkanes, C11 - C13, with the largest peak at C12, corresponding to the
peaks that maximize in the bimodal distribution (see Chapter 2, Figure 2-1). Therefore,
some information contained in PC1 is also chemical differences between samples. The
loadings plot of PC2 (Figure 3-3b) shows the most dominant peaks as derivative-shaped
curves (see inset), indicating that the greatest source of variation on PC2 is retention time
misalignments [10]. Most of the spread among replicates occurs on PC2 on the scores
plot because misalignments are dominating PC2.
To demonstrate the correction of non-chemical sources of variation, Diesel 5
(yellow diamonds) was chosen and the change in clustering will be highlighted throughout
the subsequent chapters. In order to observe the changes in the chromatogram, three
replicates of Diesel 5 (labeled R1, R2, and R3) were overlaid and the region around the
dodecane peak (C12) was expanded (Figure 3-4). Prior to any pretreatment (Figure 3-4a),
R1 is shifted to the left of the other two replicates and R2 is at a higher abundance than
the other replicates. After total area normalization, all replicates were at approximately
the same height, while R1 was still shifted to the left of the other replicates. This results
in the spread in the replicates observed in Figure 3-2b. In the scores plot prior to data
pretreatment (Figure 3-2a), the three replicates of Diesel 5 were spread along PC1. From
the loadings plot shown in Chapter 2 (Figure 2-7), PC1 included differences in height and
misalignments. However, after normalization, peaks were at approximately equal heights
(Figure 3-4b). Therefore when PCA was performed, the variation in height was minimized
(Figure 3-2b), and R2 and R3 were positioned close together. R1 was still separated
along PC2, due to the misalignments. This is supported in the PC2 loadings plot

62

Abundance

3.00E+05

0.00E+00
20.6

20.7

20.8
Retention Time (min)

20.9

21

20.7

20.8
Retention Time (min)

20.9

21

Abundance

3.00E+05

0.00E+00
20.6

Figure 3-4. An expanded region of dodecane in three replicate chromatograms of
diesel 5 before (a) and after (b) area normalization (R2 and R3 are directly on top
of one another). Each replicate is indicated by a different color (R1: red, R2: blue,
R3: green).

63

(Figure 3-3b), which shows derivative-shaped curves for many of the normal alkanes that
are present.
After application of single peak normalization (using hexadecane), PC1 accounted
for 60.2% of the variance and PC2 accounted for 21.4%. Peak normalization results in
the spread among replicates occurring along both PC1 and PC2 (Figure 3-5b), indicating
that non-chemical variation is present in both principal components. This is supported by
the loadings plot for the peak normalized chromatograms (Figure 3-6). Derivative-shaped
peaks are observed for many of the alkanes in PC1, indicating that misalignments are still
a major source of variation. Additionally, the loadings plot for PC1 shows a large portion
of the unresolved region of the chromatogram (20 - 65 min) contributing to the loadings,
indicating that the differences in the unresolved baseline region are a major contribution
to the variance. This agrees with the visual assessment of the chromatograms, which
shows that after peak normalization, variation in the baseline between replicates was still
present (Figure 3-1). The loadings plot of PC2 (Figure 3-6b) is very similar to the PC1
loadings plot after area normalization (Figure 3-3). As with the area normalization, this
pattern is likely due to the chemical differences between diesel samples with the unimodal
and bimodal distribution of normal alkanes.
The dodecane peak (C12) in replicates of Diesel 5 can again be utilized to explain
the spread on the PCA scores plot present in the samples and replicates (Figure 3-7).
After peak normalization (Figure 3-7b), there are still some differences in the height
between R1, R2, and R3. In the PC1 loadings plot (Figure 3-6a), many of the peaks from
the normal alkanes, including the dodecane peak, are derivative-shaped peaks, indicating

64

PC 2 (27.2%)

1.5E+06

R1

-1.5E+06
-1.5E+06

R3

R2

1.5E+06

PC1 (50.2%)

1.3E+06

PC2 (21.4%)

R1

R3
R2

-1.3E+06
-1.3E+06

PC1 (60.2%)

1.3E+06

Figure 3-5. PCA scores plot of eight diesel chromatograms in triplicate prior to
the application of data pretreatment (a) and after single peak normalization (b).

65

0.06

PC1 Loadings

a

0.03

0.00

C12

C14

C16

C18 C20

C10
-0.03
0

20

40
60
Retention Time (min)

80

100

0.05

b

PC2 Loadings

0.00

-0.05

C20

C16

C10

C14

-0.10

C12

-0.15
0

20

40
60
Retention Time (min)

80

100

Figure 3-6. Loadings plot for PC1 (a) and PC2 (b) after PCA with single peak
normalization.

66

Abundance

3.00E+05

0.00E+00
20.6

20.7

20.8
Retention Time (min)

20.9

21

20.7

20.8
Retention Time (min)

20.9

21

Abundance

3.00E+05

0.00E+00
20.6

Figure 3-7.
An expanded region of dodecane in three replicate
chromatograms of diesel 5 before (a) and after (b) peak normalization. Each
replicate is indicated by a different color (R1: red, R2: blue, R3: green).
67

retention time misalignment. After normalization, Diesel 5 R1 (shown in red in Figure 3-7)
is still misaligned to the other two replicates, explaining some of the spread observed in
PC1. The dodecane peak and several other normal alkane peaks are heavily weighted
in the loadings plots (before and after each normalization), indicating that dodecane is
greatly influencing the positioning of samples on the scores plot. When large peaks are
not well aligned and normalized, there is variation in the resulting PCA scores plot. This
explains why R1, R2, and R3 are spread even after normalization, in both PC1 and PC2.
After application of area normalization, chemical differences between samples are
beginning to become useful in differentiating samples, based on the loadings plot of PC1.
Non-chemical sources of variation are still observed on PC2, resulting in some spread
among replicates. Using peak normalization, misalignments and variation in the baseline
are observed on the loadings plot of PC1 and chemical differences are observed on PC2.
This demonstrates that area normalization minimizes more of the non-chemical
variations, thus allowing chemical differences between samples to become the greatest
source of variation. However, with peak normalization, additional pretreatment
procedures are required to minimize the non-chemical sources of variation.
While differences in the baseline may be small (generally less than 3% of the
signal), they contribute to the variation because the differences occur over the entire
chromatogram (Figure 2-2). In this work, both a large number of relatively small peaks
(compared to the baseline) and a large, unresolved region were present, complicating
normalization (Figure 2-1). The baseline must be well normalized, because the baseline
accounts for most of the points in the chromatogram. However, the peaks also need to
be well normalized because most of the chemical differences in complex samples will
68

come from differences in peak heights. After utilizing area normalization, real chemical
differences were identified as the greatest source of variation, followed by other nonchemical sources of variation. After single peak normalization, the greatest source of
variation was still non-chemical, followed by chemical differences between the samples.
This demonstrates that area normalization is more effective at correcting the major
sources of variation.

3.4.2. Quantitative Assessment
The average percent change in the clustering of replicates (PCC) between the
unnormalized and normalized chromatograms was calculated to assess improvements in
clustering attained using each normalization procedure. After applying total area
normalization, the PCC was 45.1%, indicating that replicates on the scores plot were
closer together after normalization than without normalization.

Similar results were

observed for the single peak normalization, where the PCC was 58.9%. The higher PCC
using the single peak is likely due to Diesel 5 (yellow diamonds, Figure 3-2 and Figure
3-5). After area normalization, one replicate of this sample became further separated due
to misalignments, resulting in a negative PCC for that sample. If this sample is removed,
the average PCC for the area normalization increases to 52.9%, demonstrating similar
clustering of replicates as observed using peak normalization.
Summary
Proper normalization of complex chromatographic data can be challenging;
however, normalization is a critical data pretreatment procedure to minimize nonchemical sources of variation. Small differences in abundance, caused by variation in
69

injection, are often the greatest source of variation for complex and similar samples. In
this work, both normalization methods resulted in a reduction in the non-chemical
variation. Based on the chromatographic metric, the percent change in the sum of
squares of the residual, total area normalization resulted in a larger reduction in variation.
However, using the metric for the PCA scores plot, the percent change in the clustering
of replicates showed that single peak normalization resulted in better clustering of
replicates for this data. This demonstrates that selection of the proper normalization
method is dependent on the data and non-chemical variation that is present. The most
important point demonstrated in this work is that selection of the particular normalization
method is not critical; however, the application of normalization procedures drastically
improves discrimination of highly complex samples using PCA.

Even though the

performance of each normalization method is similar for these data, it is still important
that analysts consider selection of an appropriate normalization method, to ensure that
the method that is chosen does not skew the data. In some cases, one or more of the
assumptions that are made may not always be valid and can lead to erroneous results.

70

REFERENCES

71

REFERENCES

[1] K.R. Beebe, R.J. Pell, M.B. Seasholtz, Chemometrics: A Practical Guide, John Wiley
& Sons, Inc., New York, NY, 1998.
[2] K.M. Pierce, J.S. Nadeau, R.E. Synovec, in: C.F. Poole (Ed.), Gas Chromatography,
Elsevier, Waltham, MA, 2012.
[3] B.K. Lavine, in: S.J. Haswell (Ed.), Practical Guide to Chemometrics, Marcel Dekker,
Inc. , New York, NY, 1992, p. 211.
[4] K. Varmuza, P. Filzmoser, Introduction to Multivariate Statistical Analysis in
Chemometrics, CRC Press, New York, NY, 2009.
[5] A. Moreda-Pineiro, A. Marcos, A. Fisher, S.J. Hill, J. Environ. Monit., 3 (2001) 352.
[6] G. Malmquist, R. Danielsson, J. Chromatogr. A, 687 (1994) 71.
[7] J.D. Ingle Jr., S.R. Crouch, Spectrochemical Analysis, Prentice-Hall, Englewood
Cliffs, NJ, 1988.
[8] D.A. Skoog, F.J. Holler, S.R. Crouch, Principles of Instrumental Analysis, Thomson
Brooks/Cole, Belmont, CA, 2007.
[9] S.L. Morgan, E.G. Bartick, in: R.D. Blackledge (Ed.), Forensic Analysis on the Cutting
Edge: New Methods for Trace Evidence Analysis, John Wiley & Sons, Inc. , Hoboken,
NJ, 2007.
[10] L.J. Marshall, J.W. McIlroy, V.L. McGuffin, R. Waddell Smith, Anal. Bioanal. Chem.,
394 (2009) 2049.

72

4. CHAPTER 4: BASELINE CORRECTION
Introduction
In temperature-programed gas chromatography (GC), the rise of the baseline at
the end of the chromatogram (at high temperature) can vary widely between analyses.
This rise is due to the degradation of the polysiloxane stationary phase in the column as
well as breakdown of the silicone septum [1]. As these baseline differences are not from
the sample, minimizing the baseline should not change the chemical information
contained in the chromatogram [1, 2]. In the chromatograms in this work, the unresolved
portion in the middle of the chromatogram between 15 and 65 minutes also has an
elevated baseline which can introduce variation into the chromatograms. However, for
the data considered here, the unresolved portion is sample dependent, originating from
the large number of compounds with similar boiling points, which were not well separated
by gas chromatography-mass spectrometry (GC-MS).

Hence, these variations are

chemically important and should not be removed.
There are a number of different baseline correction methods that can be applied.
These methods generally fall into one of four categories: (1) transformation of the signal,
(2) subtraction of a chromatogram, (3) subtraction of a modeled function, or (4) removal
of specific signals in the sample chromatogram [1-4]. Each of these correction methods
is discussed in more detail below.
Signal transformation is an indirect method of baseline correction, in which the
baseline is not removed, but rather transformed, so that it is no longer meaningful.
Although many transforms are available, the most common is using the first-derivative of
73

the chromatogram [2, 4]. However, this method can often enhance the noise in the
chromatogram [4].
In chromatogram subtraction, a chromatogram that does not contain the sample
(generally the chromatogram of a solvent blank) is subtracted from each sample
chromatogram. This is a fast and simple correction method [1]. However, when the
baseline varies between injections, specifically between the blank and sample
chromatogram, this method cannot be used because subtraction of a single blank
chromatogram will not correct for variations in baseline among all samples and may result
in additional variations. This method would then require running a blank after every
sample, which would greatly increase analysis time.
Another baseline correction method utilizes a mathematical function to model the
baseline in each chromatogram, which is then subtracted from each sample
chromatogram [1-4]. The age of the stationary phase, the solvent, and the temperature
program can all affect the shape of the baseline and hence, functions used for fitting can
range from a simple linear fit to more complex high-order polynomials. As this method is
specifically modeled to fit each sample chromatogram, it will more accurately correct for
variations in the baseline among samples. However, care must be taken to ensure that
the model only accounts for non-chemical signals of the baseline. This is often difficult to
achieve, particularly in chromatograms of complex samples where resolution is poor or in
cases where peaks of relevance elute during the rise in the baseline.
Removal of individual signals in the chromatogram is specific for data generated
by mass spectrometry, where the total signal is the sum of individual ions. In this method,

74

specific ions are subtracted from the total ion chromatogram (TIC); typically these ions
correspond to column degradation and septum bleed. Common mass-to-charge (m/z)
ratios for ions resulting from column and septum degradation include m/z 73, m/z 147,

m/z 207, and m/z 281 [5]. However, these ions could also result from fragmentation of
chemical compounds in the sample. Therefore, removal of these ions can alter the
chemical signal.
The selection of appropriate baseline correction methods is highly sample
dependent.

Considerations for selecting a baseline subtraction method include the

complexity of the sample, the source of the baseline signal, and the compounds present
in the sample. Care must be taken to ensure that only signal from the baseline is being
removed, and not signal from compounds in the sample.
Methods Tested and Evaluation Metrics
Three different baseline correction methods were compared in this research.
Preliminary results indicated that transformations and subtraction of a solvent blank
chromatogram were not appropriate for these data. Using a first-derivative transform led
to challenges in applying other data pretreatment procedures, particularly alignment as
the peaks were no longer the traditional Gaussian-shaped peaks that are typically
observed in chromatography.

Subtracting the background from a solvent blank

chromatogram resulted in an incomplete reduction of the baseline and was therefore
discounted. The methods selected for this work all utilized extracted ion chromatograms
(EICs) to remove specific background signals from the chromatogram and included the

75

background subtracted baseline, subtraction of extracted ion profiles of the background
signal, and subtraction of a the baseline using a modeled function.
4.2.1. Background Subtracted Baseline (BSB)
The background subtracted baseline (BSB) method for correcting the baseline
involves removing individual ions from the TIC and is included as a function in the
instrument software. Chemstation (Version E.02.01.1177, Agilent Technologies, Santa
Clara, CA) was used to select a specific mass spectrum, which was then subtracted from
each individual mass spectral scan in the chromatogram [6]. For this work, the last scan
in the TIC of each chromatogram was used for subtraction (Figure 4-1). Generally, scans
at the end of the chromatogram contain ions from only column degradation and septum
bleed, making this region useful for evaluating the baseline. Column degradation occurs
more readily at high temperature. The GC oven is at the highest temperature at the end
of the analysis, resulting in the end of the chromatogram containing most of the ions
resulting from degradation. Multiple scans can also be subtracted by repeating this
procedure, if additional reduction is required. The BSB function does not allow for
negative ion intensities, so the subtraction of any ion that would result in a negative
number becomes zero. The function can also be used to remove other background
interferences in the chromatogram, such as impurities, by selecting a scan containing the
ions characteristic of the interference.

76

2500

207

Abundance

2000

1500

1000

281

500

73

96

191

133

0
50

75

100

125

150 175 200 225
Mass-to-Charge Ratio

250

275

300

Figure 4-1. Representative mass spectrum from a diesel chromatogram (Diesel 1) at
retention time 108.335 minutes, the last scan in the chromatogram.

77

4.2.2. Subtraction of Extracted Ion Profiles
As some ions in the spectra used for BSB subtraction may also be fragment ions
of some of the compounds in the sample, resulting in a reduction in a reduction in the
peaks that also contained those ions. Therefore, other subtraction methods may be
necessary to prevent the loss of chemical information.
In this work, rather than removing all ions in a given spectrum, the removal of
selected ions was also investigated. The EICs of the ions of interest from the baseline
were generated in the Chemstation software. The m/z that were investigated were the
ions present in the last scan of the chromatogram: m/z 73, m/z 96, m/z 133, m/z 191, m/z
207, m/z 208, m/z 209, m/z 281, and m/z 282, all of which are characteristic polysiloxane
fragments [5]. The EIC of each m/z of interest from Diesel 1 is shown in Figure 4-2. Each
EIC was examined to determine if any ions of that m/z were also generated from
compounds in the sample. As shown in Figure 4-2, m/z 96 and m/z 133 had high
abundances in the peak region of the chromatogram (prior to 70 minutes), which were
not present in the same EIC of a solvent chromatogram, demonstrating that these ions
also resulted from fragmentation of compounds that are in the sample. Therefore, m/z
96 and m/z 133 should not be removed from the TIC and were eliminated from further
consideration.
The six most abundant EICs were chosen to create a baseline extracted ion profile
(EIP) that could be subtracted from the TIC, in order to minimize the baseline. As m/z
73 had a relatively low abundance, it was also excluded from the EIP. The selected EICs

78

300

a

Abundance

250
200
150
100
50
0
0

20

40
60
Retention Time (min)

80

100

40
60
Retention Time (min)

80

100

2000

b

Abundance

1600

1200

800

400

0
0

20

Figure 4-2. Extracted ion chromatograms for ions present in the last mass
spectral scan (from Figure 4-1) including mass-to-charge (m/z) 73 (a), m/z 96 (b),
m/z 133 (c), m/z 191 (d), m/z 207 (e), m/z 208 (f), m/z 209 (g), m/z 281 (h), and m/z
282 (i). The extracted ion profile generated from these ions is also shown (j)

79

Figure 4-2 (cont’d).
6000

dc

Abundance

5000
4000
3000
2000
1000
0
0

20

40
60
Retention Time (min)

80

100

40
60
Retention Time (min)

80

100

Abundance

800

400

0
0

20

80

Figure 4-2 (cont’d).
4000

fe

Abundance

3000

2000

1000

0
0

20

40
60
Retention Time (min)

80

100

0

20

40
60
Retention Time (min)

80

100

Abundance

1200

800

400

0

81

Figure 4-2 (cont’d).

Abundance

800

g

400

0
0

1600

20

40
60
Retention Time (min)

80

100

40
60
Retention Time (min)

80

100

h

Abundance

1200

800

400

0
0

20

82

Figure 4-2 (cont’d).
1200

Abundance

ij

800

400

0
0

20

40
60
Retention Time (min)

80

100

0

20

40
60
Retention Time (min)

80

100

Abundance

7000

3500

0

83

(m/z 191, m/z 207, m/z 208, m/z 209, m/z 281, and m/z 282) were exported from the
Chemstation software and imported to Excel (Office 2013, version 15.0, Microsoft
Corporation, Redmond, WA). Once in Excel, the EICs were summed at each retention
time to generate an EIP (Figure 4-2j). The EIP was then subtracted from the original TIC
for each chromatogram of diesel fuel.
4.2.3. Subtraction of a the Baseline using a Modeled Function
A third method for baseline correction was developed and tested in which a model
was generated to fit the baseline, using the EIP of the baseline described in Section 4.2.2.
When the EIP was generated for the previous method, some ions were not included
because those ions also resulted from the fragmentation of compounds in the sample.
This resulted in incomplete removal of the baseline. In order to model the EIP, the EIP
for each diesel chromatogram was fit in TableCurve 2D (Version 5.01, Jandel Scientific,
San Rafael, CA) in order to determine an appropriate equation for the model.

An

asymmetrical sigmoid function (Equation 4-1) was selected based on the highest
coefficient of determination (r2) value when used to fit a solvent blank. As the solvent
blank is the shape of the baseline that is being removed without any other signal present,
this can be used to identify the appropriate function to model the EIP of each
chromatogram.

y =a+

b

(

)


 x −c ln 21 e −1 − b  
1+ exp −



 
c


 

Equation 4-1

d

84

The y term is the resulting abundance at each retention time, x. The a term is the
initial height of the function and the b term is the transition height (the final height of the
function subtracted from the initial height). The c term is the retention time that where the
inflection point of the curve occurred, while the d and e terms control the shape of the
curved portion of the function. An asymmetric sigmoid allows for different curvature at
the top and the bottom of the function, allowing for more flexibility when fitting the
baseline. An example of the modeled baseline is shown in Figure 4-3.
The baseline EIP for each chromatogram was imported into TableCurve 2D and fit
to generate appropriate c-e terms. The a term was selected as zero so that no signal
was removed from the beginning of the chromatogram The b term was the average value
of the last eight minutes of the chromatogram. This region of the chromatogram is where
the baseline is at highest abundance and generally at a constant value. The b term was
determined from the TIC, rather than the EIP, in order to remove as much of the baseline
as possible. The asymmetrical sigmoid function was then regenerated in Excel using
Equation 4-1 and subtracted from the original chromatogram, generating a baseline
corrected chromatogram. This was repeated for each chromatogram.

85

7.00E+03

b

Abundance

d/e

3.50E+03

c

d/e
a

0.00E+00
0

20

40
60
Retention Time (min)

80

100

Figure 4-3. Model generated for baseline of a diesel chromatogram, based
on Equation 4-1. The a term is the initial height of the function, the b term
is the transition height, the c term is the retention time at which the
inflection point of the curve occurred, and the d and e terms control the
shape of the curve.

86

4.2.4. Evaluation Metrics
Chromatograms were visually examined to assess reduction in baseline as a result
of correction. The reduction in the baseline was also quantitatively evaluated using the
last 40 minutes of the chromatogram, where the rise in the baseline occurs. Ideally, the
baseline in this region should be zero. The sum of squares of the abundance for all points
in this baseline region was used to measure the magnitude of the baseline. The percent
change in the magnitude of the baseline before and after baseline correction was then
used to quantitatively compare each baseline reduction method. Visual inspection of the
PCA scores plot and average percent change in the clustering of replicates (PCC) was
also applied as previously discussed.
Effect of Baseline Correction on Chromatographic Data

4.3.1. Visual Assessment
Prior to applying any baseline correction method, the chromatograms were
overlaid and only very slight differences were observed in the baseline among the sample
chromatograms. The percent difference between the average signals in this region was
less than 10%, indicating that there were only small differences in the baseline. While it
is important to minimize non-chemical sources of variation, this variation is not likely to
have a major impact on the resulting PCA.
4.3.1.1. Background Subtracted Baseline
The EICs of each m/z present in the baseline of a diesel chromatogram (Figure
4-2) demonstrates the largest drawback with the BSB subtraction. Because the operator

87

cannot select which m/z to remove, all m/z in a single scan will be removed.

A

reconstructed chromatogram, showing the signals that are subtracted using the BSB
method is shown in Figure 4-4a. This method resulted in a reduction of the signal in the
TIC within the peak region, due to the signals between 30 and 60 minutes in Figure 4-4a.
Figure 4-5a shows a representative diesel chromatogram prior to any pretreatment, with
an insert showing an expended region of the chromatogram from approximately 70 to 108
minutes, where the rise in baseline occurs. Figure 4-5b shows the same chromatogram
after applying the BSB method (removing the signals shown in Figure 4-4a). As shown
in Figure 4-5b, after subtraction with the BSB method, not all of the baseline is removed.
Therefore, this method removes signals arising from chemical differences in the samples,
while not completely removing the baseline.
4.3.1.2. Subtraction of Extracted Ion Profiles
The subtraction of an extracted ion profile allows for the selection of ions to include
in the subtraction, permitting the analyst to tailor the subtraction to the sample, thereby
overcoming the main limitation in the BSB method described above. In this work, the ions
chosen for the baseline EIP (from Section 4.2.2) were selected based on having a large
contribution to the rise in the baseline in the noise region but low abundance in the peak
region (Figure 4-2). The EIP subtracted from the TIC using this method is shown in Figure
4-4b. Even when selecting specific ion to remove, there is still a reduction in the peak

88

7.00E+03

Abundance

a

0.00E+00
0

20
40
60
80
Retention Time (min)

100

20
40
60
80
Retention Time (min)

100

20
40
60
80
Retention Time (min)

100

7.00E+03
Abundance

b

0.00E+00
0

7.00E+03
Abundance

c

0.00E+00
0

Figure 4-4. The signal that was subtracted from the TIC using the BSB
method (a), the EIP (b), and the function fit by the EIP (c)

89

4.00E+05

a
3.00E+05

Abundance

70

90

2.00E+05

1.00E+05

0.00E+00
0

20

40
60
Retention Time (min)

80

100

4.00E+05

b
3.00E+05

Abundance

70

90

2.00E+05

1.00E+05

0.00E+00
0

20

40
60
Retention Time (min)

80

100

Figure 4-5. The baseline of the TIC before pretreatment (a) and pretreatment
using the BSB method (b), the EIP (c), and the function fit by the EIP (d).

90

Figure 4-5 (cont’d)
4.00E+05

c
3.00E+05

Abundance

70

90

2.00E+05

1.00E+05

0.00E+00
0

20

40
60
Retention Time (min)

80

100

4.00E+05

d

Abundance

3.00E+05
70

90

2.00E+05

1.00E+05

0.00E+00
0

20

40
60
Retention Time (min)

91

80

100

region in the middle of the chromatogram. Figure 4-5c shows a diesel chromatogram
after subtracting the EIP (removing the signals shown in Figure 4-4b). The subtraction of
the EIP also did not completely remove the baseline (Figure 4-5c). More of the baseline
remains than when using the BSB method, because several m/z were excluded from the
EIP. Ions formed from the fragmentation of compounds in the sample will also be
removed from the chromatogram, resulting in the observed signal reduction in the peak
region. Therefore, for these particular data, subtraction of the EIP is also not effective at
baseline correction.
4.3.1.3. Subtraction of a the Baseline using a Modeled Function
The modeling of the baseline also provides the analyst with control over the
removal of signals from different regions of the chromatogram. However, the parameters
for fitting or modeling the baseline must be determined, which is more labor intensive than
other methods. Figure 4-4c shows an example of the modeled baseline that was removed
from each chromatogram. Using this method, there is no reduction of signal in the peak
region. Figure 4-5d shows a representative diesel chromatogram after subtraction of the
modeled baseline. This method results in a more complete removal of the baseline.
However, there is a small artifact at approximately 80 minutes (circled in Figure 4-5d) due
to improper modeling of the baseline. The c, d, and e terms in the asymmetric sigmoidal
fit could be further optimized to reduce this artifact, but would be labor intensive. For
these data, this artifact will have little influence in future analyses because of its low
abundance.

92

This method is the only one of the methods investigated that results in both positive
and negative baseline signal. The point-to-point fluctuation observed in signal is noise.
This is important for evaluation of smoothing, discussed in Chapter 5. In addition, this
method did not result in a reduction of signal in the peak region and resulted in the most
complete reduction in the baseline.
4.3.2. Quantitative Assessment
As expected from the visual assessment of chromatograms, similar percent
changes in the sum of squares of the baseline regions were observed for each of the
baseline correction methods that were tested. The BSB method resulted in a 90%
reduction in the magnitude of the baseline, and the subtraction of the EIP resulted in an
88% reduction in the baseline compared to the non-corrected chromatograms. The
subtraction of the baseline using a modeled baseline EIP resulted in a 92% reduction in
the baseline. Because the BSB method contained more ions than the subtracted EIP,
the BSB method resulted in a larger reduction in the baseline than the subtraction of the
EIP. More ions could have been included in the EIP to increase the reduction of the
baseline; however, this could also result in a larger reduction of the signal in the peak
region. Employing the fitted EIP allowed for the greatest reduction, due to the overall
abundance (b term) being from each TIC. Additionally, this method reduces the chance
of removing chemical signal as a result of the baseline subtractions.

93

Effect of Baseline Correction on PCA Scores Plot

4.4.1. Visual Assessment
Figure 4-6 shows the scores plot for replicates of the eight diesel samples prior to
any pretreatment (a), and after baseline correction only (b). There are no differences
observed in the positioning of samples on the scores plot after baseline correction. The
loadings plots for PC1 and PC2 (Figure 4-7a and Figure 4-7b, respectively) also show no
difference before (Figure 2-7) and after baseline correction. In this work, even though the
baseline was elevated across a large portion of the chromatogram, this elevation was
consistent between samples, and was not a major source of variation. This is likely due
to the short period of time over which the samples were analyzed. As expected, because
there was no change in the loadings plot, there was also no change in the positioning of
samples on the scores plot.

94

a

PC 2 (27.2%)

1.5E+06

-1.5E+06
-1.5E+06

PC1 (50.2%)

1.5E+06

1.5E+06

PC2 (27.2%)

b

-1.5E+06
-1.5E+06

PC1 (50.2%)

1.5E+06

Figure 4-6. Scores plots of eight diesels in triplicate without any pretreatment
(a) and after baseline correction (b).

95

0.08

C12

a

C14

0.06

PC1 Loadings

C16
0.04

C18

C10

C20

0.02

0.00

-0.02
0

0.04

20

40
60
Retention Time (min)

80

100

80

100

b

PC2 Loadings

0.00

C18

-0.04

C10

C20

C16

-0.08

C14

-0.12

C12

-0.16
0

20

40
60
Retention Time (min)

Figure 4-7. Loadings plot for PC1 (a) and PC2 (b) after baseline correction.

96

PCA was then performed after baseline correction and total area normalization of
the data. Figure 4-8 shows the scores plot after area normalization only (a) and after
baseline correction followed by area normalization (b). All of the samples were slightly
shifted on the scores plot after baseline correction and normalization. This resulted in a
few samples appearing to be better clustered (i.e. Diesel 10, pink stars), while other
samples appeared to be less clustered (i.e. Diesel 4, orange squares). The loadings plots
for PC1 and PC2 after baseline correction followed by normalization are shown in Figure
4-9a and Figure 4-9b, respectively. Compared to the raw data (Figure 2-7), there is a
slight decrease in the baseline in the loadings plots of both PC1 and PC2. The small
differences in the loadings plots after baseline correction explains why only small shifts in
the samples were observed. After applying baseline correction, the percent variance for
PC1 increased from 57.9% to 58.9%, while the percent variance for PC2 decreased from
21.4% to 21.1%. The total percent variance accounted for using the first two PCs
increased form 79.3% to 80.0%. In general, the increase in percent variance accounted
for using PC1 and PC2 is important because the more variance accounted for on the first
two PCs, the more random error is being removed. The few changes in the PCA results
after baseline correction indicate that the baseline is not a major source of variation in
these data.

97

a

PC2 (21.4%)

1.3E+06

-1.3E+06
-1.3E+06

1.3E+06

b

PC2 (21.1%)

1.3E+06

PC1 (57.9%)

-1.3E+06
-1.3E+06

PC1 (58.9%)

1.3E+06

Figure 4-8. Scores plots of eight diesels in triplicate after total area normalization
(a) and after baseline correction followed by total area normalization (b).

98

0.15

a

C12
C14

PC1 Loadings

0.10

C16
C10

0.05

C18

C20

0.00

-0.05
0

0.10

20

b

40
60
Retention Time (min)

C16

C12

PC2 Loadings

0.05

80

100

80

100

C18
C20

0.00

-0.05

C10
C14

-0.10

-0.15
0

20

40
60
Retention Time (min)

Figure 4-9. Loadings plot for PC1 (a) and PC2 (b) after baseline correction
and area normalization.

99

4.4.2. Quantitative Assessment
Based on the percent change in the clustering of replicates (PCC), baseline
correction alone had no measurable effect on the variance of replicates samples in the
PCA scores plot (Table 4-1).

However, when used in conjunction with either

normalization method, there was a slight decrease in the PCC, indicating that baseline
correction resulted in slightly poorer clustering of replicates. The PCC using only area
normalization was 45.1% and 59.0% for peak normalization. After baseline correction,
the PCC decreased slightly to 44.6% and 58.7%, respectively. As previously discussed,
baseline correction may not be necessary for this work. However, this is not always the
case. If different chromatographic columns were used or the samples were analyzed
over a longer time period, there may be greater variability in the baseline. However, in
this work, the samples were collected over a few weeks and only small, insignificant
variation was observed in the baseline.

Also, baseline correction may have been

necessary if peaks eluted during the rise in the baseline.

100

Table 4-1. The average percent change in the clustering (PCC) of replicates after
the listed pretreatment procedures including baseline correction using the
extracted ion profiles (EIP fit) and normalization using total area (Area) and single
peak (Peak) normalization methods.

Baseline
Correction
-

Normalization

PCC

Area

45.1

-

Peak

59.0

EIP fit

-

0.0

EIP fit

Area

44.6

EIP fit

Peak

58.7

101

Summary
The baseline extends over a large number of data points in the chromatogram and
therefore could be a major source of variation in PCA, even though the baseline is often
small compared to the peaks. In this work, there was little quantifiable difference in the
baseline correction methods examined. Additionally, baseline correction was shown to
have little effect on the clustering of replicates when analyzed using PCA. This is likely
due to the similarity of the rise in the baseline for the different diesel samples tested.
Baseline correction may be useful when applied to other datasets. Based on the three
methods evaluated in this work, the fitted EIP to remove the baseline provides the analyst
with the most control when selecting the signals to remove.

102

REFERENCES

103

REFERENCES

[1] K.M. Pierce, J.S. Nadeau, R.E. Synovec, in: C.F. Poole (Ed.), Gas Chromatography,
Elsevier, Waltham, MA, 2012.
[2] K.R. Beebe, R.J. Pell, M.B. Seasholtz, Chemometrics: A Practical Guide, John Wiley
& Sons, Inc., New York, NY, 1998.
[3] M. Daszykowski, B. Walczak, Trac-Trends Anal. Chem., 25 (2006) 1081.
[4] S.L. Morgan, E.G. Bartick, in: R.D. Blackledge (Ed.), Forensic Analysis on the Cutting
Edge: New Methods for Trace Evidence Analysis, John Wiley & Sons, Inc., Hoboken,
NJ, 2007.
[5] M.C. McMaster, GC/MS: A Practical User’s Guide John Wiley & Sons, Inc., Hoboken,
NJ, 2008.
[6] Agilent Technologies Inc., Agilent Technologies, Inc, 2011.

104

5. CHAPTER 5: SMOOTHING
Introduction
Noise, or point-to-point fluctuations in signal, is another source of non-chemical
variation. Like the baseline, the noise must also be minimized to allow for comparison of
chemical variations between samples [1]. The goal of smoothing is to minimize the
random fluctuations in the chromatogram without distorting the chemical signal.
Smoothing methods can be classified into two general categories: running and filtering
smoothers [1-4].

Running smoothers remove point-to-point fluctuation using the data

points around a central point (called a window). The position of the central point is
calculated using an average of the data points in the window (called a boxcar smooth) or
by fitting the data points in the window using a polynomial function (called the SavitzkyGolay smooth) [3, 4]. The center point is incremented along the chromatogram and the
process is repeated for each point, resulting in a smoothed chromatogram. The filtering
smoothers removes specific signals from the chromatogram. The most common example
is the fast Fourier transform smoother, which filters the high-frequency signals from the
chromatogram. Noise is rapid changes in signal that occurs from point to point and
therefore is high frequency in nature [3].
Methods Tested and Evaluation Metrics
One smoothing algorithm from each of the general types of smoothers was
compared. The Savitzky-Golay (SG) smooth was utilized as the running smoother and
the fast Fourier transform (FFT) smooth served as the signal filtering smoother. These
smoothing algorithms were selected due to their popularity and wide availability [1, 3, 4].
105

Many commercially available data analysis and chemometric software packages contain
one or more of these smoothing algorithms.

Origin Pro (version 7.5 OriginLab

Corporation, Northampton, MA) contains both the Savitzky-Golay and the fast Fourier
transform and was used to compared both algorithms using a single diesel
chromatogram. TICs were exported from Excel after baseline correction and imported
into Origin for smoothing.
5.2.1. The Savitzky-Golay Smooth
The Savitzky-Golay algorithm uses a moving average with a least-squares
polynomial equation to fit the chromatogram [5]. The order of the polynomial and the
number of data points in the smooth can be varied. For this work, different combinations
of polynomial order and number of points were investigated. The order of the polynomial
ranged from 1 to 6 and the total number of data points varied from 3 to 25, with equal
number of points on each side of the central point. Only even-order polynomials (after
the first order) were considered, as the central point smooth results in the equivalent
smoothing for even and the following odd-order polynomial [6]. The SG smoothing
algorithms were applied in Origin and exported back to Excel for further investigation.
5.2.2. The Fast Fourier Transform Smooth
To apply a FFT smooth, the data are transformed from the time domain to the
frequency domain using a fast Fourier transform. Then, a low-pass filter is applied to the
data in the frequency domain to remove the high-frequency noise component. The point
at which the filter is applied is called the cutoff frequency. The cutoff frequency for the
low-pass filter is inversely related to the number of points in the chromatogram [7], at a
fixed scan rate. This means that the number of points in the chromatogram, which is
106

affected by the temperature program, the solvent delay, and the scan rate, will result in
different degrees of smoothing. Many software packages favor the running smoothers,
because their performance is not dependent on the number of points or scan rate. For
this work, FFT smooth from 1 to 10 points was applied in Origin, corresponding to cutoff
frequencies between approximately 2.91 and 0.29 Hz.
5.2.3. Metrics Used for Evaluation
The performance of each smoothing algorithm was evaluated based on the signal
enhancement and extent of peak distortion. The signal enhancement was quantitatively
measured by calculating the percent change in the noise and the signal-to-noise ratio in
the TIC before and after smoothing. The standard deviation of the last 13 minutes of the
TIC was defined as the noise (snoise), because only noise is present in this region of the
chromatogram.

The signal-to-noise (S/N) was calculated by dividing the maximum

abundance of the pentadecane peak (AC15) in each TIC by the previously defined noise.
AC15
S
=
N snoise

Equation 5-1

The percent change in the noise was also used to determine the degree of smoothing.
A higher degree of smoothing resulted from a larger reduction in the noise.
Smoothing can result in peak broadening, which can be observed as a widening
of the peak and a reduction in the peak height. EICs were used to determine peak
distortion, as EICs provide better resolution and enhanced signal-to-noise, allowing for a
more sensitive evaluation of peak distortion. In order to monitor peak distortion, the
percent change in the peak height, peak width, and resolution before and after smoothing
107

were calculated. The maximum peak height of the pentadecane peak in the EIC of m/z
71 was utilized for the peak height. The pentadecane peak was selected because it was
the largest peak in the chromatogram, resulting in a high signal. The ion with m/z 71 was
selected because it was the highest abundance ion with baseline resolution for
pentadecane.

The peak width for pentadecane was determined using the second

statistical moment (M2), which measures the variance of the peak, using the abundance
(A) at each retention time (t) in the peak [8, 9].
∞

M2 =

2
 t At dt

0
∞

Equation 5-2

 At dt

0

When peaks become broader, there is also a reduction in the resolution between peaks.
The resolution (Rs), or separation between peaks, [10] was calculated between tetralin
and pentylbenzene using the retention time (t) and width (w) of each peak [11].

Rs =

2 ( t 2 − t1 )

Equation 5-3

( w1 + w 2 )

In the TIC, these two peaks overlap; however, by utilizing EICs of m/z 132 (for tetralin)
and m/z 148 (for pentylbenzene), the peaks can be separated, allowing an accurate
calculation of resolution (Figure 5-1). The resolution of these peaks in Diesel 1a prior to

108

9.00E+04

Abundance

a

4.50E+04

0.00E+00
17.9

18

18.1
18.2
Retention Time (min)

18.3

18.4

18

18.1
18.2
Retention Time (min)

18.3

18.4

1.20E+04

Abundance

b

6.00E+03

0.00E+00
17.9

Figure 5-1. A representative diesel chromatogram showing the TIC (black) (a)
and EICs (b) of m/z 132 for tetralin (blue) and m/z 148 for pentylbenzene (red).

109

smoothing is 0.9. Baseline resolution is 1.5. Ideally, smoothing should enhance signalto-noise while only causing minimal peak distortion (i.e., minimal reduction in height and
broadening of peaks, resulting in loss of resolution).
Effect of Smoothing on Chromatographic Data

5.3.1. Visual Assessment
The small fluctuations that are due to noise are often difficult to visually identify in
the chromatogram. Figure 5-2 shows an expanded region of a diesel chromatogram
before (a) and after (b) smoothing using a 2-point FFT smooth. The inset on the left
shows an expanded view of the baseline. Prior to smoothing, the peak and baseline are
jagged, showing point-to-point variation in the signal.
fluctuations have been reduced.

After smoothing, the random

The inset on the right shows the end of the

chromatogram where the signal is approximately constant and the point-to-point
fluctuations are due to noise. After smoothing, the noise at the end of the chromatogram
is also reduced. The higher degree of smoothing that is applied, the more the noise is
reduced. However, the small differences resulting from different smoothing parameters
were challenging to identify using visual assessment.

110

a

Abundance

Abundance

1.8E+05
1.8E5

0E

0.0E+00
8.75

8.75

9.25

9.50

9.25

9.50

9.25

9.50

Retention Time (min)

b

Abundance

Abundance

1.8E+05
1.8E5

9.00

9.00

0E

0.0E+00
8.75

8.75

9.00

9.00

9.25

Retention Time (min)

9.50

Retention Time (min)
Figure 5-2. An expanded region of 1, 3, 5-trimethylbenzene in a representative
diesel chromatogram after baseline correction (a) and after baseline correction
and smoothing, using FFT 2 (b). The inset on the left is a further expanded
region of the baseline, demonstrating the point-to-point variation before and
after smoothing. The inset on the right shows the region at the end of the
chromatogram, including the region defined as noise.

111

For some of the diesel samples, after smoothing had been applied, undesirable
changes were observed in the chromatogram.

At higher degrees of smoothing,

reductions in peak heights and increases in peak widths were observed. These changes
were often small, but could be identified after the chromatograms were overlaid.
Additionally, artifacts were often observed on the edges of the peaks. Figure 5-3 shows
an expanded region of Diesel 1a overlaid before (black line) and after smoothing (red line)
with a Savitzky-Golay, 4th-order polynomial with 11 total points (a) and a 6th-order
polynomial with 31 points (b). In Figure 5-3a, there were only slight differences observed
between the unsmoothed and smoothed chromatograms, using a moderate level of
smoothing. When a high degree of smoothing was applied, the peaks became broader
and peak heights decreased. In addition, artifacts were usually observed near the edges
of large peaks and appeared as valleys in the negative direction on either side of the
peak. In Figure 5-3b, the artifacts are shown on the red trace between 4.40 and 4.90
minutes. These artifacts, which are characteristic of over-smoothing, are not easily
identified using the metrics, so visual inspection may be still be necessary to ensure that
over-smoothing is not occurring. These artifacts are not usually observed at low degrees
of smoothing.

112

Abundance

8.00E+04

-1.00E+04
3.8

4.6
Retention Time (min)

5.4

Abundance

8.00E+04

-1.00E+04
3.8

4.6
5.4
Retention Time (min)
Figure 5-3. An expanded region of a diesel chromatogram without smoothing
(black line) and with smoothing (red line) using a Savitzky-Golay smoothing
algorithm. Part a shows a good smooth (polynomial order of 4 and 11 total
points) while part b shows the broadening of peaks, decrease in peak height,
and artifacts on the peak edges associated with oversmoothing (polynomial
order of 6 and 31 total points).

113

5.3.2. Quantitative Assessment
The specific smoothing parameters that were tested and the results of the
quantitative assessment are shown in Table 5-1. The percent change in the noise and
signal-to-noise were used to evaluate the degree of smoothing, while the percent
reduction in the peak height, peak variance, and resolution were used to indicate the
extent of peak distortion. Similar percent reductions in noise were observed for different
combinations of the smoothing parameters for both the fast Fourier transform and
Savitzky-Golay smoothing algorithms. For example, a fast Fourier transform smooth with
1 point had a similar reduction in noise and increase in signal-to-noise to a Savitzky-Golay
smooth with a 4th-order polynomial with 7 points (Table 5-1). Parameters were grouped,
based on the degree of smoothing. Different parameters that resulted in the same
reduction in noise (degree of smoothing) were assigned to the same group (groups 1 - 5,
Table 5-1). Each group has a similar reduction in noise but contains both smoothing
algorithms and several different combinations of parameters for the SG smoothing
algorithm. These relationships are more apparent in Figure 5-4, when the standard
deviation in the noise is plotted versus the number of points in the smooth, on a log-log
scale. Each group (Figure 5-4, represented by color) has approximately the same noise,
but a different number of points included in the smooth. Having several combinations of
parameters at each smoothing level provides the analyst with more control over the
smoothing and the ability to minimize some peak distortion that may be observed.

114

Table 5-1. Percent change in each metric for different smoothing parameters.
The parameters are grouped based on the level of smoothing.
Group

-20

Signal-toNoise Ratiob
24

Peak
Heightc
-1.7

Peak
Variancec
0.6

SG 2,5f
SG 4,7

-25
-21

34
26

-2.8
-2.1

0.2
0.2

2.1
2.1

FFT 2
SG 1,3
SG 2,7
SG 4,11
SG 6,15

-38
-35
-35
-35
-35

58
53
54
54
54

-5.1
-5.0
-4.7
-4.6
-5.0

4.9
4.5
1.8
2.7
2.9

1.7
-0.3
2.8
2.6
2.6

FFT 3
SG 1,5
SG 2,11
SG 4,17
SG 6,23

-45
-45
-45
-45
-43

79
78
80
80
75

-7.0
-8.5
-5.5
-4.5
-4.2

7.8
10
3.1
2.5
2.1

-0.7
-1.5
-0.7
1.6
1.6

FFT 4
SG 1,7
SG 2,15
SG 4,25

-50
-50
-50
-50

92
91
94
98

-9.0
-11
-7.5
-7.8

12
18
3.7
2.7

-3.5
-7.3
0.5
0.1

FFT 10
SG 1, 15

-58
-57

91
88

-31
-31

58
74

-22
-25

Applied
Smooth

Noisea

FFT 1e

a.
b.
c.
d.

Resolutiond
-0.2

Calculated as the standard deviation of the last 13 minutes of the chromatogram.
The maximum height of the C15 peak divided by the noise.
Using the EIC of m/z 132 for Tetralin.
Between EIC of Tetralin (m/z 132) and EIC of pentylbenzene (m/z 148). Baseline
resolution corresponds to a value of 1.5.
e. FFT: Fast Fourier transform smoothing. The number indicates how many points
were used for the smooth.
f. SG: Savitzky-Golay smoothing. The first number indicates the order of the
polynomial, while the second number indicates the total number of points in the
smooth.

115

Standard Deviation of Noise

2.5
316

Group 1

2.4

Group 2
200
2.3
Group 3
Group 4

2.2

Group 5
125
2.1
10

0.5

1
10

1.5

2
100

Number of Points in Smooth

Figure 5-4. A log-log plot of the standard deviation of the noise region versus
the total number of points in the smooth. Different smoothing parameters are
represented by each symbol: FFT (), SG 1st order polynomial (), SG 2nd
order polynomial (), SG 4th order polynomial (), SG 6th order polynomial
(▼). Groupings were assigned based on the standard deviation in the noise
region after smoothing. The color represents the groups in Table 5-1.

116

Over all 5 groupings, the percent reduction in noise ranged from 20 - 58%, with
similar reductions in noise observed within each group. Both the FFT and SG algorithms
performed similarly within each group (Table 5-1). This demonstrates the wide range of
smoothing options available. Group 1 had the smallest reduction in noise (20 - 25%)
while group 5 had the greatest reduction (57 - 58%). Group 1 also had the smallest
improvement in the signal-to-noise ratio (24 - 34%) while groups 4 and 5 had similar
improvement in the signal-to-noise ratio (88 - 98%). This shows that fewer points and a
lower order polynomial result in a lower degree of smoothing and more points and a higher
order polynomial result in a higher degree of smoothing.
Peak distortion was also considered when evaluating the smoothing algorithms.
For all levels of smoothing, there is a reduction in the peak height, ranging from 2% to
30%. Variations in peak height of 5% were observed for replicate analyses prior to
smoothing; therefore, reductions in peak heights greater than 5% were considered
significant and detrimental. This only occurs in groups 3, 4, and 5. The percent change
in the peak variance ranged from 0.2% to 74% for groups 1 to 5; however, significant and
detrimental changes (>5%) were also only observed in groups 3, 4, and 5. In most cases,
groups 1 - 4 had only small changes in resolution (generally less then ± 3%) and were
likely not significant. Larger and detrimental changes in resolution were observed when
the change in peak variance exceeded 18%, which is observed in groups 4 and 5. These
metrics were not able to identify the artifacts that were visually observed on the edges of
peaks (Figure 5-3) when the percent change in the noise exceeded approximately 40%
(group 3).

117

In general, the FFT smoothing algorithm resulted in a monotonic decrease in noise
when more points were considered in the smooth (Figure 5-4, diamonds). When a higher
degree of smoothing was used (i.e. more points, corresponding to a lower cutoff
frequency), there was also more peak distortion observed. The SG smoothing algorithm
resulted in a decrease in noise when the order of the polynomial was decreased and the
number of points was constant (Figure 5-4). A decrease in noise was also observed when
the number of points included in the smooth was increased while the order of the
polynomial was constant (Figure 5-4, squares, triangles, and circles).

As with FFT

smoothing, SG smoothing resulted in more peak distortion at higher degrees of
smoothing. However, within a group, there was less distortion using a SG smooth with a
higher order polynomial and more points. The performance of the corresponding FFT
within that group usually fell in the middle of the SG parameters.

The smoothing

parameters in group 2 result in the largest reduction of noise (35%), while introducing
only minimal (5%) peak distortion. For this work, the FFT smoothing algorithm with 2
points (FFT 2) was selected and used for all subsequent pretreatments.
Using the FFT smoothing algorithm, the ideal cutoff frequency was approximately
half of the scan rate (in this work corresponding to FFT 2). For the SG smoothing
algorithm, a polynomial order of 2 or 4 is sufficient for smoothing most chromatographic
peaks. For a given order polynomial, an approximately 25% increase in signal-to-noise
ratio is obtained by increasing the number of points in the smooth by 2n, where n is the
order of the polynomial. To decrease the peak distortion and maintain the same degree
of smoothing, the order of the polynomial can be increased and the number of points in

118

the smooth increased by 2n. After group 4, there is no increase in the signal-to-noise
ratio and substantial peak distortion.

Effect of Smoothing on PCA Scores Plot

5.4.1. Visual Assessment
The PCA scores plots (Figure 5-5) after baseline correction only (a) and after
baseline correction and smoothing (b) show little visual difference in the positioning of
samples.

This indicates that the noise is not a major source of variation in the

chromatograms of these samples. After baseline correction and smoothing, PC1 and 2
account for 78.3% of the variation, only slightly increased from 77.4% for baseline
correction alone.
The small change in the positioning of samples is also reflected in the loadings
plots (Figure 5-6) of PC1 (a) and PC2 (b). There is little visual difference between the
loadings plots after only baseline correction (Figure 4-7) and after baseline correction and
smoothing. The only notable difference is that the point-to-point fluctuations observed in
the noise region of the loadings plots (95 to 108 minutes) have been reduced. The small
effect that smoothing has on clustering of replicates is expected, given the small
contribution of this region in both the chromatogram and the loadings plots. However, if
a lower signal-to-noise ratio was observed in the chromatogram, the noise would become
a more significant source of variation.

119

PC 2 (27.2%)

1.5E+06

-1.5E+06
-1.5E+06

PC1 (50.2%)

1.5E+06

PC2 (27.4%)

1.5E+06

-1.5E+06
-1.5E+06

PC1 (50.9%)

1.5E+06

Figure 5-5. PCA scores plot of eight diesel chromatograms in triplicate
after baseline correction (a) and after smoothing using FFT 2 (b).

120

0.10

C12

PC1 Loadings

C14
C16

0.05

C18

C10

C20
0.00

-0.05
0

20

40
60
Retention Time (min)

80

100

80

100

0.04

Loadings PC2

0.00

C18

-0.04

C10

C20

C16

-0.08

C14

-0.12

C12

-0.16
0

20

40
60
Retention Time (min)

Figure 5-6. Loadings plot for PC1 (a) and PC2 (b) after PCA smoothing.

121

There are also only slight differences in the positioning of samples on the scores
plot (Figure 5-7) after baseline correction and normalization (a) and after baseline
correction, smoothing, and normalization (b). The variance accounted for by PC1 and
PC2 increased from 79.3% to 81% with the inclusion of normalization as a pretreatment.
Additionally, the loadings plots after baseline correction, smoothing, and normalization
(Figure 5-8) for PC1 (a) and PC2 (b), appear similar to the loadings plots after only
baseline correction and normalization (Figure 4-9). This again demonstrates that the
noise is not a major contribution to the variance in this dataset.
5.4.2. Quantitative Assessment
To quantify the clustering of replicates on the scores plot, the percent change in
the clustering of replicates (PCC) was again employed (Table 5-2).

After baseline

correction using the fitting of the extracted ion profiles (EIP fit) and smoothing using the
FFT with 2 points, only a very small improvement in replicate clustering was observed
(0.5%), further confirming that noise was not a major source of variation in this dataset.
When baseline correction, smoothing, and normalization were applied, there was an
increase in the PCC over normalization alone and over baseline correction and
normalization (Table 5-2). The improvements that are observed are greater than the
simple sum of the smoothing affect alone and are likely due to the decreased variation in
the noise at the end of the chromatogram. While these improvements were small, this
highlights the power of combining data pretreatment procedures to enhance
discrimination of highly similar samples.

122

PC2 (21.4%)

1.3E+06

-1.3E+06
-1.3E+06

PC1 (57.9%)

1.3E+06

PC2 (21.1%)

1.3E+06

-1.3E+06
-1.3E+06

PC1 (59.9%)

1.3E+06

Figure 5-7. PCA scores plot of eight diesel chromatograms in triplicate after
baseline correction and normalization (a) and after baseline correction,
smoothing, and normalization (b).
123

0.15

C12
C14

PC1 Loadings

0.10

C16
C10

0.05

C18 C
20

0.00

-0.05
0

20

40
60
Retention Time (min)

80

100

0.10

C10

PC2 Loadings

0.05

C12

C14

C16
C18
C20

0.00

-0.05

-0.10

-0.15
0

20

40
60
80
100
Retention Time (min)
Figure 5-8. Loadings plot for PC1 (a) and PC2 (b) after PCA smoothing and
normalization.

124

Table 5-2. The average percent change in the clustering (PCC) of replicates after
the listed pretreatment procedures including baseline correction using the
extracted ion profiles (EIP fit), smoothing using fast Fourier transform smooth with
2 points (FFT2) and normalization using total area (Area) and single peak (Peak)
normalization methods.

Baseline
Correction
-

Smoothing

Normalization

PCC

-

Area

45.1

-

-

Peak

59.0

EIP fit

-

-

0.0

EIP fit

-

Area

44.6

EIP fit

-

Peak

58.7

EIP fit

FFT 2

-

0.5

EIP fit

FFT 2

Area

46.5

EIP fit

FFT 2

Peak

62.1

125

Summary
There are a wide array of smoothing methods, each with many parameters that
can be applied, depending on the enhancement that is required. However, care must be
taken not to introduce peak distortion, particularly the negative-going peaks on the edges
of peaks, which, when present, can be identified as a major source of variation between
samples.

All smoothing parameter groupings led to a reduction in noise; however,

different smoothing parameters resulted in different degrees of peak distortion. Peak
distortion was first identified when there was a greater than 35% reduction in the noise.
In this work, there was little difference in the chromatogram or scores plot after
applying a moderate degree of smoothing. For this dataset, the contribution of the noise
was small compared to the signal. Therefore, noise generally had a minimal effect on
statistical comparisons. For datasets where signal-to-noise ratio is lower, application of
smoothing would be more critical. The enhanced discrimination after applying baseline
correction, smoothing, and normalization demonstrate the improved enhancement that
can be achieved by applying several pretreatment procedures.

126

REFERENCES

127

REFERENCES

[1] K.M. Pierce, J.S. Nadeau, R.E. Synovec, in: C.F. Poole (Ed.), Gas Chromatography,
Elsevier, Waltham, MA, 2012.
[2] Y.-z.L. Foo-tim Chau, Junbin Gao, Xue-guang Shao (Ed.), Chemometrics From
Basics to Wavelet Transform, John Wiley & Sons, Inc. , Hoboken, NJ, 2004.
[3] K.R. Beebe, R.J. Pell, M.B. Seasholtz, Chemometrics: A Practical Guide, John Wiley
& Sons, Inc., New York, NY, 1998.
[4] S.L. Morgan, E.G. Bartick, in: R.D. Blackledge (Ed.), Forensic Analysis on the Cutting
Edge: New Methods for Trace Evidence Analysis, John Wiley & Sons, Inc., Hoboken,
NJ, 2007.
[5] A. Savitzky, M.J.E. Golay, Anal. Chem., 36 (1964) 1627.
[6] P. Barak, Anal. Chem., 67 (1995) 2758.
[7] M.J. Adams, Chemometrics in Analytical Spectroscopy, Royal Society of Chemistry,
Victoria, Australia, 1995.
[8] J.P. Foley, J.G. Dorsey, Anal. Chem., 55 (1983) 730.
[9] D.W. Morton, C.L. Young, J. Chromatogr. Sci., 33 (1995) 514.
[10] J.C. Giddings, Unified Separation Science, John Wiley & Sons, Inc., New York, 1991.
[11] V.L. McGuffin, in: E. Heftmann (Ed.), Chromatography Elsevier, New York, NY, 2004.

128

6. CHAPTER 6: ALIGNMENT
Introduction
After appropriate minimization of the baseline and the noise in a chromatogram,
ideally, only the analytical signal remains. However, drift in the retention time of peaks in
the chromatogram between analyses can remain, particularly in datasets that were
collected over a relatively long time period (usually months or longer). This drift can arise
from variation in injection mode, fluctuation in mobile phase pressure and flow rates,
degradation of the stationary phase, variation in the oven of the gas chromatograph,
among other sources. All of these sources of drift effect how the analytes move through
the column. When performing principal component analysis (PCA) on these data, each
retention time serves as a variable. Therefore, peaks from the same compounds in
different chromatograms must be well aligned so that the variables rise and maximize at
the same retention time. Any retention time misalignments will be identified as sources
of variation between samples, which will be highlighted in the statistical analysis [1-4].
Many factors can affect the severity of the misalignments. Chromatograms analyzed on
the same instrument immediately after one another will have smaller misalignments than
chromatograms analyzed months apart or from different instruments [4]. Also, variation
in alignment can be reduced by optimizing the injection parameters and by utilizing an
auto-sampler.
In order to correct misalignments, retention time alignment algorithms are
employed. All alignment algorithms utilize interpolation or extrapolation of points in the
chromatogram to shift the peak in a sample chromatogram to the corresponding peak in
129

a target chromatogram. The target chromatogram is considered to have the true retention
times, and all sample chromatograms are then aligned so that the peaks from the same
compounds in different chromatograms maximize at the same retention time. Alignment
algorithms can be generally classified into four types according to their mode of operation:
scalar shifts, selected peak alignment, local alignment, and global optimized alignment
algorithms [1, 5]. Considerations for choosing a target for alignment will be discussed
later in this chapter.
Alignment algorithms based on scalar shift apply a shift to the entire sample
chromatogram, to maximize the similarity between the sample and target chromatogram.
This simplistic alignment allows for a fast, but crude, alignment [1] and shift all of the
peaks in the sample chromatogram in one direction and by the same number of points.
This type of coarse alignment is sometimes performed to correct for large shifts in
retention times prior to more robust alignment methods [6, 7].
Selected peak alignment is performed by assigning a specific value for the
retention time of known peaks in the chromatogram. The retention times of the other
peaks are then scaled between the known peaks. This method is similar to scaling using
the Kovat’s retention index [8]. However, in complex samples it is difficult for algorithms
to identify known peaks and manual intervention is often necessary [1].

A target

chromatogram is not needed for this type of alignment and specific peaks that are present
in all samples serve as the bases for alignment.
Local alignment algorithms are applied iteratively to regions of the sample
chromatogram to maximize similarity to the target within each region [1]. Generally, this

130

method requires peak detection or other method of defining the local regions of interest
in both the target and sample chromatograms [3, 5, 9]. This method requires no prior
knowledge about the sample; however, it only aligns small regions of the chromatogram,
such as selected peaks, rather than maximizing similarity between all peaks in the
chromatogram [1, 10].
The global optimized alignment algorithms are the most dynamic and robust as
they maximize a local as well as a global measure of similarity [1-3, 7, 11-14]. This
method allows for alignment of chromatograms with different numbers of data points and
with severe shifts in retention time. These methods are often computationally intensive
and require optimization of several parameters [15].

Methods Tested and Evaluation Metrics
The performance of two common retention time alignment algorithms was
compared: a local alignment algorithm or peak-matching (PM) algorithm [10] and a global
alignment algorithm, or correlation-optimized warping (COW) algorithm [11]. These two
algorithms were selected as they are more robust and require less manual intervention
than the other types of algorithms. Many commercially available chemometric software
packages include a COW alignment algorithm. The PM algorithm was applied in Matlab
(version 7.12 R2011a, MathWorks, Natick, MA) and the COW alignment was applied in
LineUp (version 3.5, Infometrix, Inc., Bothwell, WA). The performance of the alignment
algorithms was evaluated using Diesels 1 - 3, which were analyzed in triplicate, after
baseline correction and smoothing.

131

6.2.1. Peak-Matching Alignment Algorithm
The PM algorithm identifies and matches individual peaks in a sample
chromatogram to peaks found in a target chromatogram [10]. Peaks are detected in each
chromatogram by identifying zero-crossings after an estimation of the first derivative of
each chromatogram. The algorithm considers points starting at the beginning of the
chromatogram and moves to the end. The leading edge of a peak is identified when the
point-to-point difference exceeds five times the standard deviation of the baseline. When
this threshold is met, the first zero-crossing that is encountered is considered the peak
maximum. The time point closest to the zero-crossing is then added to a list for each
chromatogram (the target and all sample chromatograms). The algorithm continues to
locate peaks until the end of the chromatogram is reached, creating a list of time points
closest to each zero-crossing identified. The peaks found in each sample chromatogram
are then compared to those found in the target chromatogram.

If a peak is present in

both the target and sample chromatograms, within a user-defined window, then the peaks
are considered a match. The retention time axis is interpolated, so that the point closest
to the zero-crossing in the sample chromatogram occurs at the same retention time as
the point closest to the zero-crossing in the target chromatogram [10].
In this work, the algorithm was used as described by Johnson et al. [10], except
that the baseline subtraction step was omitted as baseline correction was performed as
a separate pretreatment method, prior to alignment. The threshold was calculated as five
times the standard deviation of the noise, which was defined as the region in the
chromatograms between 79.5 and 80.5 minutes. This particular region was selected as
there were no peaks present and the region was only minimally affected by baseline
132

correction. The window size is the only user-defined parameter in the algorithm and
window sizes ranging from 2 to 20 data points were evaluated.
6.2.2. Correlation Optimized Warping Algorithm
The COW algorithm optimizes the correlation coefficient (Equation 1-1) between a
sample and target chromatogram [12].

As with the PM algorithm, each sample

chromatogram is compared to a target chromatogram.

In order to align the

chromatograms, both the target and sample chromatograms are divided into segments,
based on a user-defined parameter of segment size. The segment size is the number of
data points in each segment. Beginning at the end of the chromatogram and moving
towards the beginning, each segment of the sample chromatogram is stretched or
compressed by adding or removing points, using interpolation, in order to better align the
peaks in the sample chromatogram to those in the target. The maximum number of points
added or removed is determined from the warp, which is also a user-defined parameter.
The Pearson product-moment correlation (PPMC) coefficient is used to assess the
similarity between data points in the segment of the sample chromatogram and the
corresponding points in the segment of the target chromatogram.

The PPMC is

calculated for each permutation of adding or removing up to the number of data points
specified by the warp. This process is repeated for each segment. The alignment is
based on the highest global correlation coefficient for all segments [11].
In this work, the COW algorithm was tested using varying warps (1 - 4 data points)
and segment sizes (25 - 120 data points). For the COW alignment algorithm, the initial
starting point recommended for the segment size is the approximate number of points

133

across a peak, with the warp typically being just a few points. Peaks in this dataset were
approximately 45 points across and generally peaks were shifted less than two points
from the target chromatogram. Hence, the starting point for the alignment was chosen to
be a segment size of 45 and a warp of 2. During investigation of warp and segment size,
one parameter at a time was varied while the other was held constant.
6.2.3. Target Selection
As discussed above, each alignment algorithm compares sample chromatograms
to a target chromatogram. Therefore, selection of the target is a critical and often a
challenging aspect of alignment. The ideal target chromatogram has well resolved peaks
and is representative of the sample chromatograms [6, 10]. There are generally three
targets that can be selected: one of the sample chromatograms, an average target, and
a consensus target [16-20]. Generally, if a sample chromatogram is used as the target,
the chromatogram is chosen at random from the dataset. Using a sample chromatogram
can be problematic if all of the compounds in the dataset are not present in the selected
sample chromatogram.

An average target is generated mathematically from all

chromatograms in the dataset. To do this, the abundance at each retention time is added
from each of the sample chromatograms, then divided by the total number of
chromatograms to yield the average. The average target is advantageous because it
includes peaks that may not be present in every sample. However, averaging leads to
peak broadening and a reduction in signal, which make alignment more challenging. A
consensus target is a separate sample that contains a mixture of all compounds of interest
that are in the sample chromatograms. This type of target is challenging to create

134

because trial and error is often required to create a mixture with all of the compounds at
the correct abundance.
In this work, a random target was used to investigate the alignment parameters for
the COW and PM alignment algorithms using replicates (n=3) of Diesels 1 - 3. The
second replicate from Diesel 2 was randomly selected (using a random number generator
in Excel) to serve as the target chromatogram. The optimized parameters from this
preliminary study were then used to align the dataset to evaluate different target
chromatograms. Each diesel chromatogram (from samples 1 - 10, analyzed in triplicate)
and the average chromatogram were used to determine the most appropriate type of
target chromatogram for alignment of these data. The success of alignment was
evaluated for each target.
6.2.4. Evaluation Metrics
Two metrics were used to quantitatively evaluate retention time alignment. These
metrics were the percent change in the average of the standard deviation of the retention
time of selected peak maxima (PC-SDRT) and the sum of the percent change in the
PPMC coefficient for each chromatogram before and after alignment (PC-PPMC).
Calculation of the PC-SDRT is performed using peak maxima selected using a peakfinding algorithm, based on the peak-matching algorithm described previously [10]. In
general, there were 100 - 200 peaks identified per chromatogram using this algorithm.
The standard deviation in the retention time was calculated for each selected peak across
all chromatograms in the dataset, and then averaged across all selected peaks, both
before ( sU ) and after ( sA ) alignment.
135

PC − SDRT =

(s

A

− sU
sU

) * 100

Equation 6-1

To determine the PC-PPMC, PPMC coefficients were first calculated in Excel, in a
pair-wise fashion, between all chromatograms, both before and after alignment. The
percent change in the PPMC coefficient before (PPMCU) and after (PPMCA) alignment
was calculated and summed.
 ( PPMCA − PPMCU )

PC − PPMC =  
* 100 
PPMCU



Equation 6-2

After proper retention time alignment, chromatograms become more similar, which
results in a lower standard deviation of peak maxima and higher PPMC coefficient. Each
metric is similar to one of the methods used for alignment: the PC-SDRT is based on the
peak finding algorithm while the PC-PPMC utilizes PPMC coefficients, similar to the COW
algorithm.

Effect of Retention Time Alignment on Chromatographic Data

6.3.1. Visual Assessment
Visual assessment of alignment is challenging because multiple chromatograms
must be overlaid and compared. In this work, misalignments were considered small, often
within 5 points (generally ± 0.02 min), and were difficult to visualize. Figure 6-1 shows the
chromatograms of three diesel samples, each analyzed in triplicate, overlaid. In this
example, the peak maxima, as well as the leading and tailing edges of the peak, can vary

136

Abundance

2.00E+05

0.00E+00
9

9.1

9.2
9.3
9.4
Retention Time (min)

9.5

9.6

Figure 6-1. An expanded region of chromatograms of three diesel samples
analyzed in triplicate, each represented by a different color, before alignment.
The peaks correspond to 1, 3, 5-trimethyl-benzene (9.20 min) and decane (9.48
min).

137

slightly, even among replicates. When multivariate statistical procedures are applied,
these differences can be identified as sources of variation and hence, it is important to
minimize or eliminate such differences.
Selection of an appropriate window size (for the PM algorithm) or warp (for the
COW algorithm) is facilitated by careful inspection of the chromatograms, as shown in
Figure 6-2. When the chromatograms are overlaid, the peaks resulting from the same
compound in all chromatograms should be inspected. In Figure 6-2, the peak shown
corresponds to 1, 3, 5-trimethyl-benzene from Diesels 1 - 3, each analyzed in triplicate.
The minimum window size or warp is the number of points that a peak would need to be
shifted to align with peaks from the same compound in the other chromatograms. In this
example, the window size or warp would need to be at least 2. The maximum window
size or warp is the number of points a peak could be shifted before being aligned to the
peak from another compound. If the window size or warp is too small, the peaks cannot
be aligned; if it is too large, then peaks from different compounds could be aligned.
Figure 6-3 shows the same region as Figure 6-1, after alignment using the PM (a)
and COW (b) alignment algorithms. In order to compare the alignment algorithms, all
diesel samples were aligned. The PM alignment was performed using a window size of
5 and the COW alignment was performed using a warp of 2 and a segment size of 75. In
both cases, an average chromatogram was utilized as the target. After alignment, peak
maxima and edges are generally more similar across all chromatograms, generally only
varying by 1 or 2 points (approximately ± 0.005 min). However, there are anomalies that

138

Abundance

2.00E+05

0.00E+00
9.14

9.18
9.22
Retention Time (min)

9.26

Figure 6-2. An expanded region of the 1, 3, 5-trimethyl-benzene peak in
chromatograms of three diesel samples analyzed in triplicate, each
represented by a different color, before alignment. The individual data points
are shown as black circles. In this example, peak maxima are shifted by
approximately three data points.

139

2.00E+05

Abundance

a

0.00E+00
9

9.1

9.2
9.3
9.4
Retention Time (min)

9.5

9.6

9.1

9.2
9.3
9.4
Retention Time (min)

9.5

9.6

2.00E+05

Abundance

b

0.00E+00
9

Figure 6-3. The same expanded region of chromatograms of three diesel
samples from Figure 6-1, each represented by a different color, after
alignment using the peak-matching algorithm (a) and the correlationoptimized warping algorithm (b).

140

are observed after some alignments. Using the COW algorithm, the most commonly
encountered anomaly is that peaks are often aligned to one of the edges of the peak,
rather than to the peak maxima. This can be seen in both peaks in Figure 6-3b. Peaks
that are approximately the same height and width are well aligned. However, when
comparing peaks across several chromatograms, most peaks are aligned to either the
leading or tailing edge of the peak. The first peak in all chromatograms (1, 3, 5-trimethylbenzene) is aligned to the tailing edge of the peak. The leading edge of the peak and the
peak maxima are not well aligned. In the second peak (decane), most of the replicates
are also aligned to the tailing edge of the peak. However, one of the replicates shown in
red is aligned to the leading edge of a peak in another diesel sample. Even though these
variations are often only 1 or 2 points, they can still be identified as a non-chemical source
of variation in PCA.
The resulting alignment of the peak edges using the COW alignment algorithm is
not surprising because optimization is based on maximizing correlation coefficients
between the sample and target chromatograms. The correlation is higher when the
abundance from point to point increases and decreases at the same retention times
across all chromatograms. When the peaks being aligned are different width or height,
alignment to the front or tail of the peak is common, so that correlation is optimized.
In the PM algorithm, peak maxima are identified and aligned (Figure 6-3a).
Therefore, differences in peak size do not affect the alignment. However, only peaks that
have been identified by the algorithm in both the sample and target chromatogram are
aligned. Therefore, some low abundance or co-eluting peaks are often not aligned.
Additionally, this can result in alignment of peaks that do not correspond to the same
141

compound, often making alignment worse. The sensitivity of the algorithm to identify
peaks is the major drawback of this method. Figure 6-4 shows the phytane peak and
another co-eluting peak in three diesel chromatograms, each analyzed in triplicate, before
alignment (a) and after alignment using the PM algorithm with a window size of 10 (b).
Using a large window size results in peak maxima of one replicate from each sample to
shift, creating retention time misalignments. This problem can be minimized by selecting
an appropriate window size, which requires manual optimization.
6.3.1. Quantitative Assessment
The PC-SDRT and the PC-PPMC were utilized as metrics to evaluate the
alignment. A decrease in the PC-SDRT or an increase in the PC-PPMC indicates an
improvement in alignment. The PM algorithm resulted in improved alignment for most of
the window sizes that were investigated (Table 6-1). The PC-SDRT ranged from -62 to
49% and the PC-PPMC ranged from -2.8% to 9.3%. Similar improvement in the quality
of alignment was observed for window sizes 3 - 7 using both the PC-SDRT (-61% to 62%) and the PC-PPMC (9.2% to 9.3%). This indicates that windows of 3 - 7 resulted in
similar alignment. Window sizes of less than 3 were not able to shift peaks far enough to
align them. When the window size was greater than 7, peaks were shifted too far,
resulting in improper alignment.

142

1.50E+05

Abundance

a

0.00E+00
55.1

55.2

55.3
55.4
Retention Time (min)

55.5

55.6

55.2

55.3
55.4
Retention Time (min)

55.5

55.6

1.50E+05

Abundance

b

0.00E+00
55.1

Figure 6-4. An expanded region of the phytane peak in chromatograms of three
diesel samples analyzed in triplicate, each represented by a different color,
before alignment (a) and after alignment using the peak-matching algorithm with
a window size of 10 (b).
143

Table 6-1. Percent change in the standard deviation of the peak maxima of selected
peaks (PC-SDRT) and the sum of the percent change in the PPMC coefficients (PCPPMC) for different window sizes using the peak-matching alignment algorithm. A
decrease in the PC-SDRT of the retention time or an increase in the sum of the PCPPMC indicates an improvement in alignment.
Window Size
2
3
4
5
6
7
8
10
15
20

PC-SDRT
-55
-61
-62
-62
-62
-61
-61
-46
-19
49

144

PC-PPMC
9.2
9.2
9.2
9.2
9.3
9.2
8.4
8.5
5.8
-2.8

There was a decrease in the quality of the alignment based on the PC-SDRT metric
for window sizes below 3 and above 8 and a decrease based on the PC-PPMC using a
window size greater than 7. This reduction in quality of alignment is likely a result of a
mixture of improved alignment for some peaks and a worsening in alignment for other
peaks in the same chromatogram. At a window size of 20, there is a positive PC-SDRT
and a negative PC-PPMC, indicating that this window size actually resulted in worse
alignment than prior to application of the alignment algorithms. This is likely a result of
additional misalignments caused by aligning peaks in the sample chromatogram to the
improper corresponding peak in the target chromatogram.
To compare the COW alignment algorithm, a range of segment sizes (20 – 120)
and warps (1 – 4) were investigated. The recommended segment size for this algorithm
is the average number of points across a peak [6]. In this work, the number of points
across a peak ranged from approximately 25 - 45 points. The largest peaks were
expected to be the most problematic, so an initial segment size of 45 data points was
selected. A warp of 2 was used to compare the various segment sizes. This warp was
chosen based on visual assessment of the unaligned chromatograms, which showed that
most peaks were only misaligned by 1 or 2 data points.
The COW alignment algorithm resulted in improved alignment over the range of
segment sizes (20 – 120) that were investigated (Table 6-2). The PC-SDRT ranged from
-21 to -30% and the PC-PPMC ranged from 13 to 21%. Both the decrease in the standard
deviation in retention time and the increase in PPMC coefficient indicates an increase in

145

Table 6-2. Percent change in the standard deviation of the peak maxima of selected
peaks (PC-SDRT) and the sum of the percent change in the PPMC coefficients (PCPPMC) for varying warp and segment sizes using the COW alignment algorithm. A
decrease in the PC-SDRT of the retention time or an increase in the sum of the PCPPMC indicates an improvement in alignment.

Segment
20
25
37
45
45
45
45
50
60
75
90
100
120

Warp
2
2
2
1
2
3
4
2
2
2
2
2
2

PC-SDRT
-26
-21
-27
-29
-28
-24
-23
-23
-25
-28
-30
-26
-29

146

PC-PPMC
14
13
19
21
21
20
20
20
20
21
21
21
21

the quality of the alignment for all combinations. All segment sizes of 45 and larger
resulted in a 20% to 21% increase in the PC-PPMC while segment sizes 45, 75, 90 and
120 showed the highest decrease in the PC-SDRT (28% to 30%).
Using the segment size of 45, the warp was varied between 1 and 4 data points.
The greatest improvements in alignment were observed using a warp of 1 or 2 points.
There was a 29% and 28% reduction in the PC-SDRT for warp sizes of 1 and 2,
respectively. There was also 21% increase in the PC-PPMC for both a warp 1 and 2
points. Using a warp of 3 and 4 points, there was a 24% and 23% reduction in the PCSDRT and a 20% increase in the PC-PPMC. The smaller warps resulted in better
alignment, as the chromatograms were collected over a short period of time, and only
small differences in alignment were observed.

When larger warps were applied,

chromatograms were shifted more, resulting in poorer alignment, due to more peaks
aligning to the leading or tailing edge of the peak.
Rather than selecting the ideal alignment parameters (which can only be obtained
through optimization), adequate alignment parameters were selected for further analysis.
For COW, a warp of 2 data points and a segment size of 75 data points were selected.
Most peaks were misaligned by 1 or 2 points, making 2 a reasonable choice for warp. A
segment size of 75 corresponds to 1.5 to 2 times the number of points across a peak.
For the PM algorithm, a window size of 5 was selected, to allow for slightly larger shifts
that might be present when applying the pretreatment to the larger dataset.
Each metric is based on the method used to align the chromatograms resulting in
a potential bias when trying to comparison the alignment algorithms. When comparing

147

the PM parameters (window size 5) and COW parameters (warp of 2, segment size 75),
the PC-SDRT indicates greater improvement for the PM algorithm, while the PC-PPMC
indicates greater improvement for the COW alignment algorithm.

Because each

evaluation method is similar to the alignment algorithms, the evaluation favors the
alignment algorithm from which the metric is derived. In PCA, variation between samples
must be minimized. Therefore, the change in PPMC coefficients would be a more reliable
indicator as coefficients account for more of a global change, rather than just selected
peaks. Additionally, when a peak is not correctly aligned using the COW algorithm, the
misalignment is generally less severe than with the PM algorithm. Lastly, the COW
algorithm is widely available in a number of commercial software packages. Therefore,
the COW alignment algorithm with a warp of 2 data points and a segment size of 75 data
points was utilized for the rest of this work.
6.3.2. Target Selection
After choosing the alignment algorithm and parameters, a method for choosing a
target chromatogram was investigated. Each chromatogram contained all of the
compounds, making each chromatogram a suitable target for consideration. In addition,
the average chromatogram was also utilized as the target. Both metrics were again
applied to evaluate the selection of a target.
The PC-SDRT using each of the selected targets ranged from 12.1 to -14.2%
(Table 6-3), where negative values indicate an improvement in the alignment and positive
values indicate a worsening in the alignment. The greatest improvement in alignment
was observed when the average chromatogram was used as the target. Additionally,

148

most of the possible target chromatograms resulted in an improvement in the quality of
the alignment. However, three chromatograms, when used as the target, resulted in a
worsening in the quality of the alignment. Upon visual inspection, this decrease in
alignment quality was due to misalignments of several low abundance, co-eluting peaks.
Using the PC-PPMC, all chromatograms when used as a target resulted in an
improvement in the quality of alignment, ranging from 211% to 221% (Table 6-4). This
small range indicates that all chromatograms were more similar after alignment,
regardless of which sample was selected as the target. In addition, the similarity of the
PC-PPMC demonstrates the insensitivity of this metric for when evaluating the COW
alignment algorithm. It is not clear why diesel samples 3 and 8 resulted in the best
alignment. However, the average chromatogram was still among the best choices for a
target. The use of the average chromatogram as the target is advantageous because it
has been shown to result in good alignment, without requiring testing of all possible
chromatograms.

149

Table 6-3. Percent change in the standard deviation of the peak maxima of selected
peaks (PC-SDRT) using the correlation optimized warping alignment algorithm with
a warp of 2 and a segment size of 75, with each sample chromatogram as well as
the average chromatogram serving as the target. A decrease in the PC-SDRT of the
retention time indicates an improvement in alignment.
Target
Chromatogram
Average
D5B
D6A
D10A
D9B
D8B
D3B
D4C
D7B
D10C
D9C
D9A
D8C
D6B
D5A
D3A
D8A
D3C
D4A
D10B
D7A
D4B
D5C
D6C
D7C

PC-SDRT
-14.2
-11.8
-11.0
-10.9
-9.7
-9.4
-7.2
-6.8
-6.7
-6.7
-6.4
-6.2
-5.9
-5.7
-5.5
-4.7
-4.0
-4.0
-3.2
-2.2
-0.1
-0.0
0.7
5.8
12.1

150

Table 6-4. The sum of the percent change of the Pearson product moment
correlation coefficients (PC-PPMC) using the correlation optimized warping
alignment algorithm with a warp of 2 and a segment size of 75, with each sample
chromatogram as well as the average chromatogram serving as the target. An
increase in the sum of the PC-PPMC indicates an improvement in alignment.

Target
Chromatogram
D3B
D8A
D8B
D3C
D3A
Average
D4A
D6A
D7A
D6C
D4C
D9A
D5A
D7B
D4B
D6B
D7C
D5C
D10C
D5B
D10B
D9B
D10A
D9C
D8C

PC-PPMC
221
221
220
220
219
219
219
219
219
218
218
217
217
216
216
216
215
215
215
214
214
214
214
212
211

151

Effect of Retention Time Alignment on PCA Scores Plot

6.4.1. Visual Assessment
PCA was performed after baseline correction, smoothing, and alignment.

In

comparing the scores plot (Figure 6-5) with baseline correction and smoothing (a) and
with baseline correction, smoothing, and alignment (b), only a small enhancement in
clustering of replicates is observed, specifically in replicates of Diesel 4 (orange squares)
and Diesel 5 (yellow diamonds). The total variance accounted for in PC1 and PC2
increased from 78.3% to 84.9% after alignment. Similarly, there were only small changes
observed in the loadings plots (Figure 6-6) of PC1 (a) and PC2 (b). Prior to alignment
(Figure 5-6 and Figure 5-8), the loadings plot for PC1 contained derivative-shape peaks,
indicative of misalignments [18] for C10-C14. After alignment, the negative portions are no
longer observed, indicating that there is no longer misalignment of these peaks. The
loadings plot for PC2 remained largely unchanged after alignment.

152

a

PC2 (27.4%)

1.5E+06

R1 R3

-1.5E+06
-1.5E+06

1.5E+06

PC1 (50.9%)

b

PC2 (30.6%)

1.5E+06

R2

R3

-1.5E+06
-1.5E+06

PC1 (54.3%)

R1

R2

1.5E+06

Figure 6-5. PCA scores plot of eight diesel chromatograms in triplicate after
baseline correction and smoothing (a) and after baseline correction,
smoothing, and alignment (b).
153

0.06

C12 C13
C11
C14

a

C15
C16
C17
C18
C19
C20

PC1 Loadings

0.04

C10
0.02

C21

0.00

-0.02
0

0.04

20

40
60
Retention Time (min)

b

80

100

80

100

C19

Loadings PC2

0.00

C18

-0.04

C17
C16

C10

C21
C20

C15
C14

-0.08

C11

-0.12

C13
C12

-0.16
0

20

40
60
Retention Time (min)

Figure 6-6. Loadings plot for PC1 (a) and PC2 (b) after baseline correction,
smoothing, and alignment.

154

Diesel 5, represented by the yellow diamonds was previously discussed in Chapter
3 and will be again used to highlight the applied data pretreatment and resulting
positioning on the scores plot. Prior to alignment, the replicates of Diesel 5 are spread
along PC 1 (Figure 6-5a). After alignment using the COW algorithm with a window size
of 2 and a segment size of 75, replicates 1 and 3 are closely clustered while replicate 2
is not. The source of this change in clustering of replicates 1 and 3 is due in large part to
misalignments of several normal alkanes. In Figure 6-7a, the dodecane peak in replicate
1 (red) of Diesel 5 is misaligned from replicates 2 (blue) and 3 (green). After alignment
(Figure 6-5b), the three replicates are well aligned. However, replicate 2 is still at a higher
abundance than replicates 1 and 3. This is reflected in the scores plot (Figure 6-5b) which
shows that after alignment, replicates 1 and 3 are clustered together, while replicate 2 is
still not clustered.
Figure 6-8 shows the scores plot after baseline correction, smoothing, and total
area normalization (a) and after baseline correction, smoothing, alignment, and total area
normalization (b). The total percent variance accounted for on the first 2 PCs increased
from 77.4% with no data pretreatment (Figure 2-6) to 88.7% after baseline correction,
smoothing, alignment, and normalization. After normalization, replicates are clustered
and several groupings of the samples can now be observed. Diesels 3 and 7 are
positioned closely, as are Diesels 4, 9, and 10, as well as Diesels 6 and 8. Diesel 5,

155

a

Abundance

3.00E+05

0.00E+00
20.6

20.8
Retention Time (min)

20.9

21

20.7

20.8
Retention Time (min)

20.9

21

b

Abundance

3.00E+05

20.7

0.00E+00
20.6

Figure 6-7. An expanded region of dodecane in three replicate chromatograms
of Diesel 5 before (a) and after (b) alignment. Each replicate is indicated by a
different color (replicate 1: red, replicate 2: blue, replicate 3: green).

156

1.3E+06

a

PC2 (21.1%)

R1
R3
R2

-1.3E+06
-1.3E+06

PC1 (59.9%)

1.3E+06

1.3E+06

PC2 (17.2%)

b

-1.3E+06
-1.3E+06

R1 R3
R2

PC1 (71.5%)

1.3E+06

Figure 6-8. PCA scores plot of eight diesel chromatograms in triplicate after
baseline correction, smoothing, and normalization (a) and after baseline
correction, smoothing, alignment and normalization (b).
157

however, is discriminated from the other diesels. Based on the loadings plots (Figure
6-9), the positioning of samples on PC1 is based mostly on the abundance of the normal
alkanes, particularly the short-chain normal alkanes, which load positively on PC1 (Figure
6-9a).

The variation in the short chain alkanes could arise from differences in the

distillation of the fuel or from the mixing of different summer and winter diesel blends in
the storage tanks at each service station. Also on PC1, a portion of the baseline between
40 and 60 minutes is loading negatively. This is the retention time region that shows an
increase in the baseline for Diesel 5 (Figure 2-1). This corresponds to the negative
positioning of Diesel 5 on the PCA scores plot (Figure 6-8a). The positioning of samples
on PC2 is influenced most by the long-chain normal alkanes loading positively as well as
a few of the most volatile compounds. Most of the short chain normal alkanes as well as
some branched alkane and aromatic compounds are loading negatively on PC2 (Figure
6-9b). PC2 is differentiating compounds using the unimodal versus bimodal distribution.
Using the chromatograms (Figure 2-1), the scores plot (Figure 6-8b) and the
loadings plots (Figure 6-9) the positioning of each diesel sample can be explained. As
mentioned previously, Diesels 3 and 7 have a unimodal distribution of the normal alkanes
(rather than the bimodal distribution observed in the other diesel samples) and should be
positioned close together. Diesels 3 and 7 are closely associated on the scores plot and
are positioned negatively on PC1 and positively on PC2. Diesels 3 and 7 have a lower
abundance of short-chain alkanes (which load positively on PC1), and a higher
abundance of long-chain normal alkanes (which load positively on PC2). Diesels 6 and
8 are positioned positively on PC1, while Diesel 5 is positioned negatively on PC1. For

158

0.15

a

C12
C11

C14 C

0.10

PC1 Loadings

C13
15

C16
C17
C18

C10

0.05

0.00

-0.05
0

0.12

20

40
60
Retention Time (min)

100

80

100

C17
C16

b

0.08

PC2 Loadings

80

C15
C10

0.04

C13 C14

0.00

C11
-0.04

C12
-0.08
0

20

40
60
Retention Time (min)

Figure 6-9. Loadings plot for PC1 (a) and PC2 (b) after baseline correction,
smoothing, alignment, and normalization.

159

these samples, the distribution of peaks appears similar, however, Diesels 6 and 8 have
an overall higher abundance while Diesel 5 has a lower abundance than other samples.
Diesels 4, 9, and 10 are all positioned near the origin, which shows that they are not being
well differentiated using PC1 and PC2 and are not well described by the variance on PC1
or PC2.
As discussed at the beginning of this section, alignment resulted in replicates 1
and 3 of Diesel 5 becoming more closely clustered. Also, after applying normalization
(Chapter 3), replicates 2 and 3 became more closely clustered. When alignment and
normalization are both applied, all three replicates become closely clustered.

After

alignment and normalization, non-chemical variation from shifts in retention time and
differences in abundance have been minimized (Figure 6-10), resulting in replicates that
are more similar and therefore clustered closer together.
6.4.2. Quantitative Assessment
The percent change in the clustering of replicates (PCC) on PC1 and PC2 was
again used to assess the effect of data pretreatment on the samples in the PCA scores
plot. After baseline correction, smoothing, and alignment, there was a 5.9% increase in
the clustering (Table 6-5), while there was only a 0.5% increase when smoothing and
baseline correction were applied and no change in the clustering when baseline
correction alone was applied. The largest increase in the PCC was observed after
normalization was also applied. The PCC increased 85.1% when total area normalization
was applied and 71.8% when single peak normalization was applied. This shows that
after normalization, alignment is the next most important data pretreatment.

160

a

Abundance

3.00E+05

0.00E+00
20.6

20.7

20.8
Retention Time (min)

20.9

21

20.7

20.8
Retention Time (min)

20.9

21

3.00E+05

Abundance

b

0.00E+00
20.6

Figure 6-10. An expanded region of dodecane in three replicate
chromatograms of diesel 5 after baseline correction, smoothing, and alignment
(a) and after baseline correction, smoothing, alignment, and normalization (b).
Each replicate is indicated by a different color (R1: red, R2: blue, R3: green).
161

Table 6-5. The average percent change in the clustering of replicates (PCC) after
the listed pretreatment procedures including baseline correction using the
extracted ion profiles (EIP fit), smoothing using fast Fourier transform smooth with
2 points (FFT2), alignment using the correlation optimized warping algorithm with
a warp of 2 and a segment of 75 (COW 2, 75) and normalization using total area
(Area) and single peak (Peak) normalization methods.

Baseline
Correction
-

Smoothing

Alignment

Normalization

PCC

-

-

Area

45.1

-

-

-

Peak

59.0

EIP fit

-

-

-

0.0

EIP fit

-

-

Area

44.6

EIP fit

-

-

Peak

58.7

EIP fit

FFT 2

-

-

0.5

EIP fit

FFT 2

-

Area

46.5

EIP fit

FFT 2

-

Peak

62.1

EIP fit

FFT 2

COW 2, 75

-

5.9

EIP fit

FFT 2

COW 2, 75

Area

85.1

EIP fit

FFT 2

COW 2, 75

Peak

71.8

162

Summary
Retention time misalignments were observed in overlaid chromatograms of diesel
samples that were analyzed over the course of approximately two weeks.

After

processing each chromatogram with a PM or a COW retention time alignment algorithm,
these misalignments were reduced. Both alignment algorithms resulted in a minimization
of the non-chemical sources of variation when appropriate parameters were selected.
For the PM algorithm, window sizes of 3 - 8 resulted in a similar quality alignment. For
the COW algorithm, many combinations of the warp and segment size resulted in
improved alignment. For both alignment algorithms, the selection of an appropriate target
is critical. Rather than testing every possible chromatogram as the target, the average
target proved to be a fast and effective choice for the target. This demonstrates that
different alignment algorithms can result in a similar quality of alignment, even over a
range of different parameters.
These results indicate that the optimization of alignment is not necessary, at least
for this work.

Because samples were collected over a short period of time using

temperature-programmed GC, there is a reduced need for retention time alignment.
However, when samples are collected over a long period of time or collected using other
thermal modes, such as isothermal GC, there would likely be an increased need for
optimization of the retention time alignment.

163

REFERENCES

164

REFERENCES

[1] K.M. Pierce, J.S. Nadeau, R.E. Synovec, in: C.F. Poole (Ed.), Gas Chromatography,
Elsevier, Waltham, MA, 2012.
[2] M. Daszykowski, B. Walczak, Trac-Trends Anal. Chem., 25 (2006) 1081.
[3] A.M. van Nederkassel, M. Daszykowski, P.H.C. Eilers, Y.V. Heyden, J. Chromatogr.
A, 1118 (2006) 199.
[4] G. Malmquist, R. Danielsson, J. Chromatogr. A, 687 (1994) 71.
[5] K.M. Pierce, J.L. Hope, K.J. Johnson, B.W. Wright, R.E. Synovec, J. Chromatogr. A,
1096 (2005) 101.
[6] Infometrix Inc., LineUp: Chromatographic Alignment Tool, Version 3.0, 2010.
[7] J.S. Nadeau, B.W. Wright, R.E. Synovec, Talanta, 81 (2010) 120.
[8] B.K. Lavine, D. Brzozowski, A.J. Moores, C.E. Davidson, H.T. Mayfield, Anal. Chim.
Acta, 437 (2001) 233.
[9] R.J.O. Torgrip, M. Aberg, B. Karlberg, S.P. Jacobsson, J. Chemometr., 17 (2003)
573.
[10] K.J. Johnson, B.W. Wright, K.H. Jarman, R.E. Synovec, J. Chromatogr. A, 996 (2003)
141.
[11] N.P.V. Nielsen, J.M. Carstensen, J. Smedsgaard, J. Chromatogr. A, 805 (1998) 17.
[12] G. Tomasi, F. van den Berg, C. Andersson, J. Chemometr., 18 (2004) 231.
[13] V. Pravdova, B. Walczak, D.L. Massart, Anal. Chim. Acta, 456 (2002) 77.
[14] T. Skov, F. van den Berg, G. Tomasi, R. Bro, J. Chemometr., 20 (2006) 484.
[15] S. Peters, E. van Velzen, H.G. Janssen, Anal Bioanal Chem, 394 (2009) 1273.

165

[16] K. Prather, Using Multivariate Statistical Procedures to Identify Ignitable Liquid
Residues in the Presence of Interferences, Michigan State University, Ann Arbor,
2011.
[17] J.M. Baerncopf, Association and Discrimination of Ignitable Liquids from Matrix
Interferences using Chemometric Procedures, Michigan State University, Ann Arbor,
2009.
[18] L.J. Marshall, J.W. McIlroy, V.L. McGuffin, R. Waddell Smith, Anal. Bioanal. Chem.,
394 (2009) 2049.
[19] L.J. Marshall, Association and Discrimination of Diesel Fuels using Chemometric
Procedures for Forensic Arson Investigations (Masters Thesis), Michigan State
University, Ann Arbor, MI, 2008.
[20] A.M. Hupp, L.J. Marshall, D.I. Campbell, R.W. Smith, V.L. McGuffin, Anal. Chim.
Acta, 606 (2008) 159.

166

7. CHAPTER 7: CONCLUSIONS AND FUTURE WORK
Conclusions
Multivariate statistical analyses are being applied to complex data in a growing
number of fields, including forensic science. Discriminating complex profiles is critical in
many areas of research from arson investigation to proteomics. As shown from this work,
complex and similar samples are not well discriminated using principal component
analysis (PCA). Identifying small differences in complex samples is often complicated by
non-chemical sources of variation, such as differences in abundance, shifts in retention
time, noise, and signals from background compounds, which can often be identified as
the greatest sources of variance among samples. This limitation can be overcome by
utilizing appropriate data pretreatment methods to minimize the non-chemical sources of
variation in a dataset. Data pretreatment procedures cannot be treated with a “black box”
approach. Therefore, visual examination and metrics to monitor the application of data
pretreatment are required. The analyst must take care to ensure that proper pretreatment
procedures are being applied and the assumptions that are made prior to applying the
pretreatments are valid.
In this work, eight different diesel samples were each analyzed in triplicate by gas
chromatography-mass spectrometry (GC-MS). Four data pretreatment procedures (i.e.,
baseline correction, smoothing, retention time alignment, and normalization) were applied
to the chromatograms to minimize the non-chemical variation. For each type of data
pretreatment applied, several different procedures were tested. For baseline correction,
the background subtracted baseline function, the subtraction of an extracted ion profile,
167

and the subtraction of the baseline using a modeled function were compared. The
Savitzky-Golay and the fast Fourier transform smoothing algorithms were compared for
their ability to reduce noise in the chromatogram. Misalignments were corrected using a
correlation-optimized warping algorithm and a peak-matching algorithm. Normalization
was compared using a single peak normalization and a total area normalization. After
each pretreatment, chromatograms were compared using the developed metrics and
PCA was performed.
The metrics that were developed provide a rapid method for evaluating the effect
of each pretreatment on the chromatograms. Each metric was designed to evaluate the
increase in similarity obtained by minimizing the non-chemical sources of variation among
the replicates. These metrics also allow for a comparison and optimization of parameters
associated with each data pretreatment procedure. The evaluations that were utilized
include a visual examination of the chromatogram, a metric measuring the change in the
chromatogram after pretreatment, a visual examination of the PCA scores plot, and the
percent change in the clustering of replicate samples in the PCA scores plot.
The minimization of non-chemical sources of variation improves the multivariate
statistical analysis in two ways. First, after data pretreatment, replicate chromatograms
are more similar, therefore replicates will cluster more closely using PCA. Second, and
more importantly, when PCA is applied, more chemical (rather than non-chemical)
differences will be identified as the greatest sources of variance. Therefore, the loadings
plots will contain variables that reflect chemical differences between samples, rather than
non-chemical differences due to instrumental variation. As replicate chromatograms are

168

chemically identical, replicates should cluster better and samples should be well
discriminated from one another following appropriate pretreatment.
In baseline subtraction, the background subtracted baseline, the subtraction of the
extracted ion profiles, and the subtraction of the baseline using a modeled function all
resulted in a reduction in the baseline. The subtraction of the modeled function allowed
for a reduction in the baseline, without any reduction in signal from the peaks in the
chromatogram, ensuring that chemical information was not inadvertently removed from
the chromatogram. Therefore, it was selected as the most appropriate option for baseline
correction. The baseline of the chromatograms in this work was not a major source of
variation, because the chromatograms were generated over a relatively short period of
time. However, when chromatograms that have been generated over a long period of
time are compared, the difference in baseline may become more significant. Additionally,
when the GC is operated at high temperature or the column is old, stationary phase
degradation is more prevalent and baseline correction will become more critical.
The Savitzky-Golay and fast Fourier transform smoothing algorithms both resulted
in noise reduction in the chromatogram. By changing the number of points in the smooth
and the polynomial order (only for the Savitzky-Golay smoothing algorithm) the degree of
smoothing in each chromatogram was similar. The similar reduction in noise will result in
similar clustering of replicates on the scores plot after PCA. Therefore, optimization and
careful selection of the smoothing algorithm is not necessary; however, care must be
taken to ensure that peak distortion does not occur. Based on this work, significant peak
distortion was not observed until there was more than a 45% reduction in noise. For this
work, the fast Fourier transform smoothing algorithm with 2 points was selected but this
169

algorithm has similar performance to several combinations of the Savitzky-Golay
smoothing algorithm.
A peak-matching and a correlation optimized warping algorithm were compared
for peak alignment. Nearly all combinations of window size (for the peak-matching
algorithm) or warp and segment size (for the correlation optimized warping algorithm)
resulted in an improvement in alignment. Most parameters resulted in similar quality
alignment. The number of points the peaks could be shifted (called the window for the
peak-match algorithm and the warp for the correlation optimized warping algorithm) is an
important consideration. Ideally, the chromatograms should be overlaid and visually
inspected. The window or warp should then be selected based on the number of points
that each peak needs to shift in order to be aligned. If the window or warp is too small,
the peaks cannot be aligned; if it is too large, peaks may be shifted too far and aligned to
the wrong compound. Target selection can also influence the alignment. The target must
include all of the compounds that require alignment. For this work, the average target
was selected and resulted in good alignment, without the need for optimization. The
correlation optimized warping algorithm with a warp of 2 and segment size of 75 was
selected.
Two different normalizations were compared, a total area normalization and a
single peak normalization.

Both normalization procedures resulted in the largest

improvement of clustering of replicates compared to the other pretreatment procedures,
but each normalization is based on a different assumption. The total area normalization
assumes that the total area is the same for all chromatograms.

The single peak

normalization assumes that there is a single compound within each chromatogram that is
170

at the same concentration. When choosing a normalization procedure, an analyst must
decide which of these assumptions are correct for the specific application and data. In
this work, the total area normalization was selected. Because each diesel sample is from
a different source, there is no reason to assume that any single compound has the same
concentration across all of the samples.

However, as there are so many different

compounds, in this case, the total area would likely be equivalent.
After applying data pretreatment procedures, replicates on the PCA scores plot
were shown to cluster more closely. This is due to the removal of non-chemical sources
of variation. The scores plot shown prior to data pretreatment (Figure 2-6) shows that the
replicates are spread along PC1 and the samples are separated along PC2. Replicates
are chemically the same, so any spread along PC1 is due to non-chemical sources of
variation introduced during the analysis. This means that the loadings plot of PC1 prior
to applying data pretreatment (Figure 2-7a) contains only non-chemical sources of
variation. Because the samples were separated along PC2 this indicates chemical
differences were identified in PC2, prior to application of data pretreatment. Therefore,
the loadings plot of PC2 showed chemical differences between samples (Figure 2-7b and
Figure 7-1a). After applying data pretreatment procedures, the non-chemical sources of
variation were minimized, which was reflected in the PCA scores plot (Figure 6-8).
Replicates were positioned close together and samples were differentiated from one
another, indicating that PC1 and PC2 contain chemical differences. The loadings plot for
PC1

after

applying

the

data

pretreatment

171

procedures

contained

the

same

0.05

a

PC2 Loadings

0.00

-0.05

-0.10

-0.15
0
0.15

20

40
60
Retention Time (min)

80

100

20

40
60
Retention Time (min)

80

100

b

PC1 Loadings

0.10

0.05

0.00

-0.05
0

Figure 7-1. Loadings plot for PC2 prior to applying data pretreatment (a) and for
PC 1 after applying baseline correction, smoothing, alignment, and normalization
(b).

172

chemical differences that were identified on PC2, prior to data pretreatment (Figure 6-9a
and Figure 7-1b). This demonstrates that prior to the application of data pretreatment,
PC1, which accounts for the most variance, was only accounting for non-chemical
sources of variation. After applying data pretreatment, the non-chemical sources of
variation had been minimized and PC1 accounted for chemical differences between
samples.
Overall, this work has demonstrated that application of data pretreatment
procedures can significantly enhance the discriminatory ability of PCA. For this work,
normalization was shown to provide the largest improvement in the clustering of replicates
followed by retention time alignment. Smoothing and baseline correction had relatively
little effect on the clustering of replicates. Overall, there was an 85% improvement in the
clustering of replicates after applying all of the data pretreatment procedures.
Additionally, this work has shown that when multiple pretreatments are applied to
the same chromatogram, there is a larger increase in clustering than with a single
pretreatment. For example, normalization alone resulted in a 45% increase in clustering,
however when baseline correction, smoothing, alignment and normalization were applied,
there was an 85% increase in the clustering of replicates.
This work has also shown that optimization of the data pretreatment procedures is
not necessary to obtain enhanced clustering of replicates. Most of the parameters tested
resulted in a reduction of non-chemical sources of variation, based on the metrics used
to evaluate each pretreatment procedure. As previously discussed, not all non-chemical
sources of variation were equally prevalent (e.g. normalization had a larger effect than

173

baseline correction, indicating that differences between injections were more variable
than the baseline). Therefore, pretreatment selection is more important for non-chemical
sources of variation that are more prevalent in the chromatograms.
The application of data pretreatment procedures results in an enhancement of the
discrimination of complex and chemically similar mixtures by minimizing the non-chemical
sources of variation.

In forensic science and many other fields, the comparison of

complex samples is becoming more common. Prior to applying multivariate statistical
procedures, data pretreatment is commonly utilized. However, for the data pretreatment
procedures to be permitted into court, the procedures would have to be shown to not alter
the chemical information contained in the chromatograms. In addition, it is critical to
demonstrate that samples can be differentiated using chemical information once the nonchemical differences have been minimized.

This work provides methodologies for

comparing and selecting appropriate pretreatment procedures. It is critical to ensure that
the chemical information is not being altered by the pretreatment procedures and to
understand the effect of each pretreatment on the chromatographic data.

Future Work
There are several areas presented in this research that could be further expanded.
First, additional research could focus on novel data pretreatment procedures. As shown
in this work, each of the data pretreatment procedures did not remove all of the nonchemical sources of variation. Retention time alignment and normalization were shown
to result in the largest reductions in non-chemical variation, and would benefit the most
for additional investigation. For alignment, developing an algorithm that is capable of

174

finding co-eluting and low abundance peaks would result in better alignment because
more peaks would be identified and aligned. Developing a normalization that is able to
normalize the baseline and peak maxima would allow for a more complete normalization.
As part of this work, the precision between replicates was evaluated using each
metric. An additional metric could be developed that is capable of demonstrating that
chemical compounds between replicates are not changing as a result of the data
pretreatment procedures. The metric could be based on an average chromatogram
created from multiple replicates. If chemical information is not being lost, the average
chromatogram should remain relatively unchanged after applying each data pretreatment
procedure.

If the average remains unchanged, this will demonstrate that chemical

information is remaining, even after applying data pretreatment.
Another area in which this work could be expanded would be to evaluate the effect
of data pretreatment on the output from other types of samples (i.e. drugs, paints, etc.)
and instrumentation (i.e. infrared spectroscopy, scanning electron microscopy with
electron dispersive spectroscopy). Demonstrating that non-chemical variation can be
minimized without altering the underlying chemical information would be a critical step in
the use of multivariate statistic for forensic cases.
The ultimate goal is to apply multivariate statistics to forensic data. However, work
is still required to develop statistical methods and ways to apply those methods to forensic
data. This will help to limit bias and assign statistical confidence to forensic comparisons,
addressing concerns outlined in by National Academies of Science.

175