HARNESS
ING
 THE POWER OF GRAPHICS PROCESSING UNITS TO 
ACCELERATE COMPUTATIONAL CHEMISTRY
  By Yipu Miao
     A DISSERTATION
 Submitted to
 Michigan State University
 in partial fulfillment 
of the requirements
 for the degree of
 Chemistry 
- Doctor of Philosophy
 2015  ABSTRACT
  HARNESS
ING
 THE POWER OF GRAPHICS PROCESSING UNITS TO 
ACCELERATE COMPUTATIONAL 
CHEMISTRY
 By Yipu Miao
 Electron Repulsion Integral (ERI) and its derivative evaluation
 is the limiting factor for 
self-consistent
-field (SCF) 
and Density Functional Theory (DFT) 
calculations
. Therefore, 
calculation 
of these quantities 
on graphical processing Units (GPUs) can significantly 
accelerate quantum chemical calculation
s. Recurrence
 relation
s, one of the fastest ERI 
evaluation al
gorithms currently available, are
 used to compute ERIs. A direct
-SCF 
scheme to assemble the Fock matrix and gradient 
efficiently 
is presented, wherein ERIs 
are evaluated on
-the
-fly to avoid CPU
-GPU data transfer, a well known architectural 
bottleneck in GPU specific computation. A machine
-generated code is utilized to 
calculate different ERI types efficiently. However, only s, p and
 d ERIs and s, p 
derivatives can be executed on GPUs using the current version of CUDA and NVidia 
GPUs. Hence, we developed an algorithm to compute f type ERIs and d type ERI 
derivatives 
on GPUs. Our benchmarks shows the performance GPU enable ERI and ERI 
derivative computation yielded speedups of 10~100 times relative to traditional CPU 
execution. An accuracy analysis using double
-precision calculations demonstrates the 
accuracy is satisfactory for most applications. Besides 
ab inito
 quantum chemistry 
methods, GPU programming can be applied to 
a number of
 computational chemistry 
applications, for example, The Weighted Histogram Analysis Method (WHAM), a 
technique to compute potentials of mean force. We present an implementation of 
multidimensional WHAM on G
raphical Processing Units (GPUs), which significantly 
 accelerates its computational performance. Our test cases, that simulate two
-dimensional 
free energy surfaces, yield
ed speedups up to 1000 times in double precision.  Moreover, 
speedups of 2100 times ca
n be achieved when single precision is used whose use 
introduces errors of less than 0.2 kcal/mol. These
 applications of GPU computing i
n computational chemis
try can significantly benefit
 the whole computational chemistry 
community.
          Copyright by
 YIPU MIAO
 2015v                   This dissertation i
s dedicated to my p
arents
, girlfriend
, and all those, whose 
support, encouragement, and personal sacrifice have made this research 
possible 
vi ACKNOWLEDGMENTS
  First of all I 
want
 to thank
 my adviser
, Professor Kenneth Merz, Jr
 from Michigan State 
University 
and 
previously 
University of Florida
, for providing me the opportunity to pursue
 my 
Ph.D
. in United States
 and
 supporting and guiding me as my 
advisor during the
 five 
and half 
years of my Ph.
D. Research
.   Special
 thank
s to
 my 
committee members, 
Professor 
Metin Aktulga
, Professor 
Katharine 
Hunt
 and 
Dr. Dirk Colbry
. Moreover, 
I want
 to 
thank 
Professor Rodney Bartlett
, Professor 
Beverly Sanders
, Dr. Erik Deu
mens and 
Dr. Benjamin 
Smith who were part of
 my original 
committee 
at University of Florida 
for
 their help before 
my transfer to MSU with Professor 
Merz. 
  I sincerely want to thank, 
Dr. Mark Burger
 and 
the 
many engineers from NVidia 
Cooperation 
who 
support
ed myself and my
 research 
group with the
-state-of-art facilities
 for software develo
pment and research use. Their help was was a key to our success
. After 
my Ph.D. I will be taking up a position with the well respected company, 
Amazon.com 
Inc.. Finally, I want to thank 
Pro
fessor Shuhua Li
 from Nanjing University, China, 
who 
hel
ped me
 understand the foundation
s of quantum chemistry and open
ed the door
 to 
scientific research
.   I want to give my heartfelt thanks to m
y parents 
Zhiguo Miao
 and 
Xiulan Ou
 who 
emotionally and 
financially supported me
 during
 my Ph.D
. career
. I also want to thank my 
girlfriend, 
Jiamin Liang
, for her
 tireless support of me during my Ph.D.
 vii
  Many 
group members from Merz group, especially 
Dr. Yue Yang
, Dr. Nihan Ucisik
, Dr. 
Dhruva Chakravorty
, Dr. Al
ex Fu
, To-be Dr.
 Li-Li Pan
, Mr. 
Pengfei Li
, Dr. Zheng Zheng
, and 
Dr. Xue Li
 provided
 fruitful discussion
s during my
 research career, and 
Mona 
Minkara
, Ting Wang
, Nupur Bansal
, Zhuoqing Yu
 and 
Lin Song
 also provided support to 
me during my Ph.D.
 . The 
UFHPC and MSU iCER sup
ported our group with facilities and 
HPC 
course
s, without which
, my research 
would have been less successful
, so I want to 
thank 
Dr. Erik Deuemns
, Dr. Charles Taylor
, Dr. Craig Prescott
, Dr. Ying Zhang
 of 
the 
UF HPC and 
Dr. Kenneth Merz
, Dr. Dirk Colbry
 and 
Dr. Benjamin Ong
 of 
the 
MSU iCER
 team.  The 
Department of Chemistry of 
the University of Florida and the Department of 
Chemistry at 
Michigan State University provided
 me with 
teaching assistant position
s during my Ph.D
. career and guided
 towards the completion of the
 Ph.D
. degree
 in 
Chemistry. The D
epartment of Computer and Information Science and Engineering 
of the 
University of Florida 
bestowed a
 Master of Science degree in Computer Engineering, 
thereby strengthening my knowledge of the
 computer science field. I want to thank all 
faculty and staff of these department
s, which had an influence on my life and career
.   Last but not least
, I also want
 to thank all 
of my 
friends in 
Gainesville
, FL
 (the Gator 
Nation)
 and East
 Lansing, MI for 
all their help and support
.  viii
 PREFACE
  I was fortunate enough to work on a multitude of projects 
in computational chemistry 
during my 
five
-year
 studies towards my Ph.D. degree with Professor Ken
neth Merz
, Jr
. I 
focused on Graphic Processing 
Units
 (GPU) 
acceleration on computational chemistry
 and 
computational biology
 methods, but also had great opportunities to explore the properties 
of molecules especially bio
chemical 
molecules with potential application for drug design. 
These projects, which 
were
 both 
int
eresting and challenging, allowed
 me to gain 
experience in 
several
 field
s especially at the interface of compute
r science and 
computational chemistry. 
  The fundamental idea of my research projects was to speed up traditional computational 
chemistry cal
culation
s using 
current generation
 computation technology with
 or with
out modification of
 the 
computational methods employed. I
t is my strong belief that 
computational chemistry will be a powerful tool for 
a broad range of chemistry disciplines
, but many o
f these application
s of computational chemistry methods 
continue to be
 limited 
by the speed of the computations 
rather than 
by their accuracy
 and 
theoretical 
justification.  
Therefore, I 
focused
 my efforts 
on the power of high
-performance computing especia
lly 
GPU-programming 
and its role in improving
 computational chemistry.
  My dissertation focuses on three
 computational chemistry problems that 
are relevant to 
scientific research in computational chemistry
. I. 
Acceleration of e
lectron repu
lsion integral 
(ERI) evaluation
 on GPU
s. II. Acceleration of t
he first order derivative
s of ERI
s (including 
ix high
-angular momentum
 integrals)
 on GPU
s. III. Acceleration of M
olecular Dynamic 
trajectory analysis using WHAM on GPU
s.                      x TABLE OF CONTENTS
   LIST OF TABLES
 ............................................................................................................... xii
  LIST OF FIGURES
 ............................................................................................................ xiii
  KEY TO ABBREVIATIONS
 ............................................................................................. xvi  CHAPTER 1. INTRODUCTION TO GPU PROGRAMMING, AND ITS APPLICATION 
TO QUANTUM CHEMISTRY
 ............................................................................................. 1 1.1. INTRODUCTION TO GPU PROGRAMMING
 ....................................................... 2 1.2. GPU APPLICATIONS IN QUANTUM CHEMISTRY
 .......................................... 15 REFERENCES
 ................................................................................................................ 22 CHAPTER 2. PROJECT OBJECTIVES
 ............................................................................. 25  CHAPTER 3. GPU 
ACCELERATION
 ON ERI EVALUATION
 ..................................... 28 3.1. INTRODUCTION
 .................................................................................................... 29 3.2. ELECTRON REPULSION INTEGRALS
 ............................................................... 30 3.2.1. RECURRENCE RELATIONS FOR ERI EVALUATION
 ............................... 32 3.3. IMPLEMENTATION
 ............................................................................................... 35 3.3.1. COMPUTE UNIFIED DEVICE ARCHITECTURE
 ........................................ 35 3.3.2. RELATED WORK
 ............................................................................................ 37 3.3.3. IMPLEMENTATION DESIGN
 ........................................................................ 38 3.4. RESULTS AND DISCUSSION
 ............................................................................... 47 3.5. CONCLUSIONS
 ...................................................................................................... 61 APPENDIX
 ...................................................................................................................... 63 REFERENCES
 ................................................................................................................ 67 CHAPTER 4. GPU 
ACCELERATION
 ON ERI DERIVATIVE EVALUATION
 ............ 70 4.1. INTRODUCTION
 .................................................................................................... 71 4.2. THEORY
 .................................................................................................................. 74 4.2.1. ELECTRON REPULSION INTEGRALS AND RECURRENCE RELATIONS
...................................................................................................................................... 75 4.2.2. CALCULATION OF DERIVATIVE ERIS
 ...................................................... 78 4.3. IMPLEMENTATION
 ............................................................................................... 79 4.3.1. CUDA
 ................................................................................................................ 79 4.3.2. MACHINE GENERATED ERI CODE
............................................................. 81 4.3.3. ERI EVALUATION FOR LOW
-ANGULAR MOMENTUM VALUES
 ........ 85 4.3.4. ERI EVALUATION FOR HIGH
-ANGULAR MOMENTUM VALUES
 ....... 85 4.3.5. IMPLEMENTATION OF ERI DERIVATIVE
 EVALUATION
 ...................... 94 4.4. BENCHMARK RESULTS AND DISCUSSION
 .................................................... 98 4.5. CONCLUSIONS
 .................................................................................................... 110 APPENDIX
 .................................................................................................................... 112 xi REFERENCES
 .............................................................................................................. 116 CHAPTER 5. GPU 
ACCELERATION
 ON HISTOGRAM ANALYSIS
 ........................ 120 5.1. INTRODUCTION
 .................................................................................................. 121 5.2. THEORY AND METHODS
 .................................................................................. 123 5.3. RESULTS AND DISCUSSION
 ............................................................................. 130 5.4. CONCLUSIONS
 .................................................................................................... 138 APPENDIX
 .................................................................................................................... 141 REFERENCES
 .............................................................................................................. 146 CHAPTER 6. SUMMARY AND CONCLUSION REMARKS
 ....................................... 150   xii
 LIST OF TABLES
  Table 1.1 Summary of Capability and Performance of GPU
-based HF and DFT 
implementation
 ............................................................................................................ 16 Table 
3.1. Timing for two SCF schemes for Hydrogen Atom System ERI Evaluation 
 ..... 47!Table 3.2. 
Time
 Comparisons Between QUICK and GAMESS
 ......................................... 49!Table 3.3. Accuracy and Performance Comparison Between CPU and GPU calculation 
 . 57!Table 3.4
. Timing for GPU version calculation using different devices 
 ............................ 60!Table 3.5. Peak Memory Usage Comp
arison 
 ..................................................................... 60!Table 4.1. Performance Comparison between CPU and GPU SCF Calculation
 ............... 108!Table 4.2. Performance Comparison between CPU and GPU HF
-Gradient Calculation
 .. 109!  xiii
 LIST OF F
IGURES
  Figure 1.1. The FLOPS counts of CPU and GPU architectures as a function of time. Data is 
sourced from reference 1.
 .............................................................................................. 3!Figure 1.2. The bandwidth of CPU and GPU architectures as a function of time. Data is 
sourced from reference 1.
 .............................................................................................. 4!Figure 1.3. Simplified model of CPU and GPU architectures, illustrating that GPUÕs are 
designed for massive data processing while CPUs depend more on flow control and 
data caching.
 .................................................................................................................. 5!Figure 1.4. Number of Computational chemistry GPU applications published in the last few 
years. Numbers provided by NVidia.
 ............................................................................ 6!Figure 1.5. Illustration of the automatic scalability of CUDA. A block, a logical unit, can 
be executed in any available multiprocessor, which is a physical concept, by the 
CUDA scheduler.
 ........................................................................................................... 9!Figure 1.6. Illustration of the CUDA thread hierarchies. One, two or three
-dimensional 
threads bundled as a block while one, two or three
-dimensional blocks form a grid. 
Two-dimensional threads and two
-dimensional blocks are pr
esented.
 ....................... 13!Figure 1.7. CUDA Memory hierarchies. Local memory, shared memory and global 
memory are presented. While the texture and constant memory are not present
ed but 
are described in the text.
 .............................................................................................. 14!Figure 3.1. Pre
-sorting scheme for half ERIs. Row and column indices correspond to bra 
and ket pairs of contracted integrals,
 and the colors reflect the estimated magnitude of 
the ERI value
. Sorting order is described in the text.
 .................................................. 43!Figure 3.2. Flowchart for GPU implementation of ERI e
valuation within an SCF cycle
 ... 44!Figure 3.3. Speedup comparisons between different basis sets on Alanine chain series. 
Timing for CPU and GPU are their first iter
ation Fock matrix formation time 
including data transfer time. Platform and software details are described in text.
 ...... 52!Figure 3.4.  Energy deviation comparisons b
etween different basis sets on Alanine chain 
series. Logarithmic scale Y
-axis is used.
 ..................................................................... 53!Figure 4.1. Class and Function description for the ERI code. (i,j)
m repre
sent an ERI class 
for [i0|j0]
(m) 
and class C
i,j is built to express its relation with other classes with lower i, 
j, same m and higher m values The goal of Function F
i,j is to generate [i0|j0]
(0)
 from 
[00|00]
(m) 
where the m value varies from 0 to i+j by using classes to express its path. 
 ...................................................................................................................................... 84!xiv
 Figure 4.2. Optimized subroutine partition pattern for (a) CUDA 3.0 and (b) CUDA 
2.0 architectures. All F
ij functions that are within one zone will be warped in one 
subroutine and called in a kernel. (c) (d) Kernel call pattern for ERI type [ij|kl] with 
the partition strategy (a) for CUDA 3.0 and (b) for CUDA 2.0. See text for details.
 .. 90!Figure 4.3. Pre
-sorting and thread walking illustration. The test case is a water cluster with 
4 water molecules. The color scheme indicates the upper bound of
 ERIs, dense*dense, 
sparse*dense and sparse*sparse areas can be visually identify in the ERI matrix. The 
table on the right is the boundary for different ERI types. See details in the text.
 ...... 91!Figure 4.4.  Thread walk strategy comparison. Three different thread walk strategies are 
presented for GPU calculations. Time used to calculate the 2
nd iteration for four 
examples of alanine chains of different lengths using 
6-31g** are reported. Unit is 
seconds. Improvement % row indicates performance improvement of circular strategy 
over linear strategy.
 ...................................................................................................... 93!Figure 4.5. Optimized 
Subroutine Partition for ERI Derivatives. This partition strategy 
works for both CUDA 3.0 and CUDA 2.0. The kernel call mapping is similar to 
Figure 4.2 but simplified so it is not shown here.
 ........................................................ 96!Figure 4.6. Calculation Time profiles for the CPU and GPU calculations on alanine chains. 
Timings for the CPU and GPU kernel time (top two curves) are based on the Fock 
matrix formation excluding the one
-electron 
contribution. Timings for kernel 0 to 
kernel 8 are given as well and the kernel call mapping corresponds to Figure 4.2. 
Platform and software details are described in the text. A logarithmic scale is used for 
the Y axis.
 .................................................................................................................. 101!Figure 4.7. (a) Speedup comparison and (b) gradient deviation comparison between CPU 
and GPU calculations with different basis sets for alanine chain lengths from 1 to 30. 
Timing for CPU and GP
U are based on the gradient generation time including data 
transfer but excludes the one
-electron integral contribution. Platform and software 
details are described in text. For (b) RMSD and maximum difference (bolded) 
between the GPU and CPU result are pr
esented. First 6 points are not presented 
because the difference is smaller than 10
-12. A logarithmic scale is used for the Y
-axis.
 .................................................................................................................................... 105!Figure 4.8. Accuracy of geo
metry optimization using a GPU 
versus
 a CPU. The test 
molecule was (Glycine)
12 with the 6
-31G and 6
-31G** basis sets. Three sets of 
computations were executed including CPU+CPU, GPU+CPU, GPU+GPU. The 
curves indicate geometry RMSD relative to the CPU+CPU re
sults at the same step.
 .................................................................................................................................... 107!Figure 5.1. Psudo
-code for implementation of part 1.
 ....................................................... 126!Figure 5.2. Psudo
-code for implementation of part 2.
 ....................................................... 127!xv Figure 5.3. Configuration used for the first step in a WHAM calculation. Blue rectangles 
indicate a block unit, while orange rectangles represent threads.
 .............................. 129!Figure 5.4. Configuration used for the second step in a WHAM calculation. Blue rectangles 
ind
icate a block unit, while orange rectangles represent threads.
 .............................. 130!Figure 5.5. Benchmark reaction studied:
 CusF metallochaperone. Open(top) and 
close(bottom) stat
e exist for CusF protein. Umbrella sampling carried on for two 
dimensions with x axis as one distance (Cu+
--Trp44@CZ3 distance) and y axis as one 
dihedral angle (Glu36
-Met47 Phi angle, highlighted in black). See text for details.  
 .................................................................................................................................... 131!Figure 5.6. Profiling of the GPU
-WHAM code for step 1 with different thread and block 
choices.
 ....................................................................................................................... 136!Figure 5.7. Profiling of the GPU
-WHAM code for step 2 with different thread and block 
choices.
 ....................................................................................................................... 136!Figure 5.8. Benchmark results for two sets of data with
 different numbers of windows 
(1024 for top, 1533 for bottom) and different grid points (from 32*128 to 768*1024). 
Profiling with CPU, GPU single
-precision (SP) and GPU double
-precision (DP) 
calculation is presented. A logarithmic scale on the y
-axis is us
ed. Speedup by GPU 
SP and GPU DP is also plotted. The benchmark reaction is described in the text.
 ... 137!Figure 5.9. The PMF surface and contour map produced using 
the GPU
-WHAM code with 
double
-precision (top), single
-precision (middle) and the absolute energy difference 
(bottom). The energy unit is kcal/mol. X
- and y
-axes are Cu+
--Trp44@CZ3 distance 
and Glu36
-Met47 Phi angle the unit is † and rad. Full computational
 details are given 
in the text.
 ................................................................................................................... 140!Figure 5.10.
 The scheme of step 1 (left) and step 2(right) in CusF 2D PMF simulations. The 
arrow
 indicated the direction of the
 reaction coordinate. The 
hollow circle on the right 
figure was the windows have been finished in step 1. 
(Just
 to be 
symbolize
, the figure 
shows 5 windows on X axis and 7 windows on Y axis but actually there are 39 
windows in the X axis and 44 windows in 
the Y axis)
 .............................................. 145!  xvi KEY TO ABBREVIATIONS
  AMBER
  Assisted Model Building with Energy Refinement
 CPU
  Central Processing Unit
  CUBLAS
  the NVIDIA CUDA Basic 
Linear Algebra Subroutines library
 CUDA
  Compute Unified Device Architecture
   CUFFT 
  the NVIDIA CUDA Fast Fourier Transform library
 DFT   Density Functional Theory
  DP   Double
-Precision
  DRAM
  Dynamic Random
-Access Memory
 ERI
  Electron Repulsion Integra
l FLOPS   Floating
-point Operations Per Second
 GPU   Graphic 
Processing Unit
  HF   Hartree Fock Theory 
 HRR
  Horizontal Recurrence Relation
 PDB 
  Protein Data Bank
 PMEMD  Molecular Dynamics using Particle mesh Ewald 
  PMF  Potential Mean Force
 MD  Molecula
r Dynamics 
 MM   Molecular Mechanics
 QM   Quantum Mechanics 
 QM/MM 
  Mixed Quantum Mechanics and Molecular Mechanics
 RHF 
  Restricted Hartree Fock Theory
 xvii
 RMSD 
  Root-Mean-Square Deviation
 SCF 
  Self Consistent Field method 
 SIMD
  Single
-Instruction, Multip
le Data 
 SP   Single
-Precision
 VRR
  Vertical Recurrence Relation
 WHAM  The Weighted Histogram Analysis Method
         1            CHAPTER 1. 
INTRODUCTION TO GPU PROGRAMMING, AND 
ITS APPLICATION TO 
QUANTUM
 CHEMISTRY 
2  1.1
. INTRODUCTION
 TO GPU PROGRAMMING
  In the past few years, the computing capability of
 Graphic Processing Units or
 GPUs has 
increased dramatically as Figure 1.1 shows. 
From this figure we see that
 the GPU FLOPS 
(Floating
-point O
perations Per Second) 
of DP(double precision) and SP(single precision) 
have 
increased exponentially 
over the 
past 14 years, 
following MooreÕs Law
, while 
the 
corresponding 
CPU FLOPS counts 
have
 not. Meanwhile, as Figure 1.2 shown, the 
bandwidth of 
both 
GPUs and CPU
s have dramati
cally increased largely
 driven by the 
insatiable demand for graphics and high
-per
formance computing that involve
 highly 
parallel
 multithreaded processors with tremendous computational power and high 
memory bandwidth. 
The 
Telsa, Fermi and Kepler architectur
es were
 released by NVidia 
between the years of
 2008 to 2013 and we 
were fortunate
 enough to have 
had early 
access to them for benchmarking purposes They feature
 fast double
-precision computing 
and dynamic parallelism
 with the former 
being
 critical to scie
ntific computing and the 
latter 
to maximize
 the potential of GPU computing. The next two generation
 NVidia 
GPUs, the 
Maxwell and Volta microarchitectures, 
have been 
announced by NVidia and 
are 
expected to 
be 
release in the near future
. They will be 
 featured with Unified Virtual 
Memory and stacked DRAM that
 will make it easier for the
 CUDA developer and 
overcome some 
known 
drawbacks that 
have 
limited performance.
  3  Figure 1.1.
 The FLOPS counts of CPU and GPU architectures 
as a function of time
. Data is sourced
 from reference 1.
 4  Figure 1.2.
 The bandwidth of CPU and GPU architectures 
as a function of time
. Data 
is sourced
 from reference 1.
   The main reason for th
e discrepancy in FLOP count between G
PUs and
 CPU
s is the 
former
 is specialized for compu
te-intensive, highly parallel computing 
driven by the 
needs of graphical
 rendering
. Hence, 
GPUÕs are
 designed t
o have
 more transistor
s that 
are devoted to massive data processing rather than flow control as show
n in Figure 1.3. 
Therefore, 
GPUÕs are better 
suited to address problems that can be split into data
-parallel 
5 and data
-independent computations;
 thus, 
massively parallel arithmetic threads instead of 
large data caches can mask memory access latency
. CPU
s on the other hand, with 
sophisticated flow cont
rol and data caching, 
are
 more suitable for calculation
s that are
 data dependent or 
are less 
arithmetically
 demanding
.      Figure 1.3
. Simplified model of C
PU and GPU architecture
s, illustrating 
that 
GPUÕs 
are
 designed for massive data processing while CPU
s depend
 more 
on flow control and 
data caching.
   With the 
growth of GPU computing power,
 computational chemists 
have ported a 
number of
 application
s to GPU architectures, which 
are summarized
 in Figure 1.4 
per NVidia. C
omputational chemists have used GPUs extensively to treat a wide range of 
problems
4, including AMBER PMEMD
5,6
 (molecular dynamics simulation), quantum 
Monte Carlo
7, DFT (density functional theory), SCF (self
-consistent
-field)
8-11 and post 
6 HF (Hartree
-Fock) theory
12. In this way, com
putationally intensive scientific applications, 
which previously required expensive supercomputing facilities, are now within reach 
to 
average users 
using relatively low
-cost GPU cards.
 We will 
go into more detail regarding 
the current state
 of quantum che
mistry on GPU
s below
.     Figure 1.4.
 Number
 of Computational chemistry 
GPU applications published in the last 
few years. Numbers 
provided by NVidia
.     7 A GPU is an example of the single
-instruction, multiple data (SIMD) paradigm, which, 
unlike CPUs that
 are designe
d for rapid sequential code execution using a single thread, 
has a parallel architecture that ex
ecutes many concurrent threads relatively
 slowly.  
Therefore, GPUs are well suited for high
-performance computation with dense levels of 
data parall
elism where the threads are data
-independent from each other. CUDA is 
currently the most mature and widespread GPU computing platform for scientific 
applications. It provides developers direct access to parallel computational elements 
(GPUs) and enables co
de to run concurrently in CPUs. The assumption is that most 
numerically intensive components of a program will be executed in the GPU hardware 
with the remaining steps carried out in the CPU. The challenge in using GPUs lies in 
adapting the specialized har
dware to take advantage of the expected performance 
increase. Moreover, memory allocation should be carefully handled to avoid memory 
latency issues. Single
-precision should be carefully employed because its accuracy may 
be insufficient to handle the task 
at hand, and generally, single
-precision is about 2
-8 times faster than double
-precision (depending on the device), so double
-precision code 
will give relatively poor performance. Another important consideration is the fact that 
most existing computational
 chemistry software is written in Fortran, and to create GPU 
code (particularly C/C++ code, but there are commercial Fortran or other programming 
language CUDA compilers available), requires the creation of new software or 
incremental inclusion of GPU kern
els into the code base. 
  GPU programming, especially 
the 
CUDA parallel programming model is designed to 
overcome the difficulty
 of develop
ing
 application software that transparently scales its 
8 parallelism 
in order 
to leverage the increasing number of proc
essor cores. The core of 
CUDA 
involves
 three key abstractions, a hierarchy of thread groups, memory types and 
barrier synchronization that are exposed to the developer as a minimal set of C language 
extensions. With these core abstractions, 
a CUDA develope
r can partition a problem into 
several sub
-problems that can be solved independently in parallel with a thread in a 
block, and the 
overall 
problem 
can be
 solved cooperatively
, by all threads
, in all blocks
, once 
the 
sub
-problems 
have each been 
solved. Each
 block of threads, which is a logic 
concept, can be scheduled on any of 
the 
available multiprocessor within a GPU providing 
automatic scalability and only 
the 
runtime system need
s to 
be aware of the
 running 
multipro
cessor as Figure 1.5 illustrates
. This automatic scalability scheduler allows 
the 
GPU architecture to span a range of market
s by simply scaling the number of 
multiprocessor
s, from high
-performance gaming platform
s such as 
the 
GeForce GPU
 card, the 
professional 
level 
Quadro and 
high
-perfo
rmance 
computing platform Tesla.
 9   Figure 1.5.
 Illustration of the automatic scalability of 
CUDA. A block, a logic
al unit, 
can be executed in any available multiprocessor, 
which is 
a physical concept, by 
the 
CUDA schedule
r.     For computational chemistry
, Electron Repuls
ion Integral (ERI) calculations for ab initio 
theory along with
 with basic linear algebra
 operations 
typically define the computational 
cost of the application. 
ERI calculations formally scale as N
4 for the HF method where N 
is the number 
of basis functions, and typically it is reduced to N
<3 with careful pre
-10 screening, cutoffs and efficient computational schemes. So for most calculations with N 
less than several thousand basis functions, ERI evaluation is the most time
-consuming 
part while
 linear algebra, with 
a cubic scaling, 
will eventually dominate with very large 
number of basis functions. However, ERI evaluations are not as general as linear 
algebraic manipulations, and 
are 
less optimized than basic linear algebra subroutines 
(BLAS). I
n addition, unlike linear algebra and some other common subroutines with 
predictable and well
-defined memory access patterns, ERI evaluations have many 
different types of integral classes and thread divergence
 could
 jeopardize computing 
efficiency. In addi
tion, ERI evaluations require large amounts of both register
s and 
memory. In GPU computation, slow memory access and 
shortages in 
register s
pace
 are 
two factors that affect computational speed and 
need
 be considered carefully during 
software development. H
ence, careful software design and special 
procedures
 are 
necessary to accelerate ERI evaluation using a GPU.
  For the thread hierarchy, as illustrated in Figure 1.6, the most basic unit is 
the thread and
 threads 
are
 identified by the build
-in variable 
threadIdx
. This variable is 
a three
-component vector, so that threads can be identified using on
e-two or three
-dimensional 
indices
 forming one
-two or three
-dimensional block
s depending 
on the specifications of 
the 
developerÕs. The index scheme is especiall
y suitable for the elements in a domain 
such as a vector or matrix. There is a limit to the number of threads per block since all 
threads of a block are expected to reside on the same multiprocessor and share limited 
memory. Blocks are further organized in
to a one
-, two or three
-dimensional grid
s of 
threads blocks. The number of thread blocks in a grid is usual
ly decided 
upon by the size 
11 of the data being proc
essed or the number of
 processor
s. Similar to 
the 
thread
 variable
, blockIdx
 is also provided to ide
ntify a
 block in a thread. It is worth noting that the
 thread, 
block and grid are all logic
al concept
s that
 are straightforward to map to physical 
concepts such as 
a multiprocessor in GPU to maximize
 performance. 
  In terms of 
the memory hierarchy, CUDA th
reads may access data from multiple memory 
spaces during execution as shown in Figure 1.7. First, each thread has fast and 
readable/writable local memory. Each thread block has shared memory that 
is visible to 
all threads within this block, which is also r
eadable/writable with medium speed. 
Moreover, all threads have access to large, readable/writable but comparatively slow 
global memory. Two additional read
-only memory spaces that are accessible by all 
threads (which 
are
 not display
ed in Figure 1.7) are co
nstant and texture memory. Texture 
memory also offers different addressing modes, as well as data filtering, for some 
specific data formats. The global, constant, and texture memory are persistent across 
kernel launches for the same application. 
  GPU prog
ramming, unlike CPU parallel programing, has several 
basic 
principal
s and 
tricks to maximize
 performance that is unique because of the architecture of 
the 
GPU. 
For example, 
in CUDA programming, 
the best performance is obtained when the number 
of threads in
 one block is a multiple of 32. Each 32 threads are bundled as a warp, and 
threads in a warp execute 
the 
same instructions, leading to the possible execution of the 
code differently than 
when executed sequentially 
- a phenomenon termed thread 
diver
gence. 
This divergence results
 in performance degradation 
and 
should be avoided in 
12 order to maximize parallel performance. Moreover, coalescing memory access ensures 
consecutive threads access consecutive memory addresses so that memory requests can 
be 
handled
 simultaneously and multiple active threads can hide memory latency by 
overlapping their computations. We will discuss these tuning 
procedures
 and show how 
to take 
full
 advantage of GPU architecture later
 on in this document
. 13    Figure 1.6.
 Illustration of 
the 
CUDA thread hierarchies. One, two or three
-dimensional 
thread
s bundled as a block while
 one, two 
or three
-dimensional blocks form
 a grid. Two
-dimensional threads and two
-dimensional blocks are presented.
 14  Figure 1.7.
 CUDA Memory hierarchies. Local mem
ory, shared memory and global 
memory are presented. While 
the 
texture and constant memory 
are
 not presented but 
are 
described in 
the 
text.
 15 1.2
. GPU APPLICATIONS
 IN QUANTUM CHEMISTRY
  Quantum Chemistry and solid
-state physics code implement relatively compl
ex 
algorithms
13. The challenge in using GPUs and other parallel platform
s lies in adapting 
these complex algor
ithms to take advantage of
 specialized hardware. For example
, CUDA provides a swift way to utilize the power of GPU
s by incorporating a
 CUDA 
libra
ry such as CUBLAS
3 or CUFFT
2 even without coding directly in CUDA, however, 
to maximize
 the performance, a careful consideration of memory hierarchy, thoughtful 
design of 
parallel
ism and 
maximization of 
memory access 
in order
 to minimize memory 
access latency
14 should be taken. 
For example, w
hen using single
-precision GPUs, 
numerical accuracy is a central issue because it is usually insufficient to match the 
accuracy of 
the underlying theoretical model. For example,
 using
 sing
le-precision 
computation in 
ab initio
 calculation 
has
 as much as 
a 1kcal mol
-1 error
9, however, 
for 
some other application
s, such as GPU
-WHAM, described in chapter 5, 
an error 
of 0.2 
kcal mol
-1 is realized
 using single
-precision but 
the resultant code is 
2-8 times faster than
 double
-precision
 depending on 
the GPU device chosen
. It is important to realize that 
many of these consideration
s are
 not only important for GPU programming, but 
for 
other 
platform
s, including
 standard CPU clusters. Thus, many of the t
echniques employed to 
improve the parallel efficiency of quantum chemistry codes are also 
useful
 in GPU programming. 
  Hartree
-Fock
 (HF) 
derived wavefunctions is
 the starting point for 
ab initio
 electron 
correlation method while Kohn
-Sham Density Functional Theory(KS
-DFT)15,16
 is usual
ly 16 used to calculate electronic ground states and their properties in chemistry due to its 
outstanding balance between accuracy and computing time. However, both KS
-DFT and 
HF has two bottlenecks: 1) evaluation 
of the 
HF or DFT matrix elements, 2) solving
 the 
SCF equations. The latter relies on diagonalization of 
the 
Fock or KS matrix, which 
eventually dominate
s the computational cost because of its high 
order 
scaling
 (N3), but 
GPU application 
to addressing
 this bottleneck 
has
 not has be
en touched extensively. It 
potentially 
could be solved via
 alternative electronic structure method
s17 or linear algebra 
library
s18. The former bottleneck is the major focus for q
uantum chemistry GPU software 
developers, which 
is dominated by the evaluation of 
the 
two-electron repulsion integrals 
(ERIs). ERIs are require
d by HF and the exchange
-correlation
 (XC) contribution of DFT. 
Much
 work has been reported 
on the acceleration
 of ERI evaluation and 
these efforts 
are 
summarized in Table 1.1, including our 
own 
work.
  Table 
1.1 Summary of Capability and Performance of GPU
-based HF and DFT 
implementation
 Authors
 Software
 lmax
 ERIs
 J K XC Gradient
 Speedup
a Ufimtsev and 
Martinez
9-11  TeraChem
 d Yes
 Yes
 Yes
 Yes
 Yes
 100-1000 Yasuda
8  p Yes
 Yes
 No No No 10 Asadchev et al.
19.31
 GAMESS
 g Yes
 Yes
 Yes
 Yes
 No 17.5
 Miao and Merz
 QUICK
 f Yes
 Yes
 Yes
 Yes
 Yes
b 10-100 a. Compare
d to single core CPU.  b.
 Only support
s up to d orbital
    17 Yasuda
8 is the pioneer
 in porting
 ERI calculation 
to GPUs in 2008. In his work, 
he 
addressed
 the major problem for ERI evaluation on GPUs and 
present
ed the result
s for 
the
 Coulomb contribution to 
the 
KS matrix and 
the 
HF matrix with s and p type basis 
function
s. Even tough the algorithm is not the most efficient due the Rys quadrature 
scheme
20 he chose
 with low memory requirement, his program maximize
d the load 
balance of GPUÕs SMs. At that moment
 in time
, GPU
Õs only support
ed single precision, 
so a mixed
-precisi
on combined with CPU
s and GPU
s was introduced so that large ERIs, 
prescreened by 
the 
Schwarz cutoff up bound, 
were
 calculated in 
the 
CPU with double
-precision while, relatively small ERIs 
were 
evaluated on 
the 
GPU in single p
recision. 
This algorithm 
resulted in
 accurate DFT and SCF energies with error
s within 1
0-3 kcal 
mol 
-1 whi
le full
-GPU calculation produced
 an error of 1 kcal mol 
-1. The contribution to 
the Coulomb matrix 
was directly computed from uncontracted ERIs 
to avoid
 one of the 
major bottl
eneck
s of GPU computing, namely,
 data
-transfer between CPU and GPU.
 If 
all ERIs are taken by 
the 
GPU, t
he result
ant
 speedup using NVIDIA
Õs Geforce 8800 
GTX 
is around one order of magnitude for the formation of 
the 
Coulomb matrix for a 
system 
of over
 one hu
ndred atoms
 (such as
 valinomycin
, 168 atoms)
 with 
the 
6-31G 
basis se
t when 
compared to a single core CPU. 
However,
 if mixed
-precision 
was employed, the speedup drops to three
-fold
. YasudaÕs method only support
ed the
 Coulomb matrix and only support
ed low
-angular momentum functions, and precision 
issue
s due to
 the hardware 
available at the time limited what he could do. However, 
he 
opened the door for researcher
s to explore the potential of GPU
s for acceleration of ERI 
calculation.
  18 Ufimtsev and Martinez
9-11 later on published a s
eries of paper
s developing a 
CUDA 
kernel for ERI evalu
ation and Fock matrix assembly
. They wrapped and commercialized 
their code into 
a software 
package 
called TeraChem, and GPU computing is the biggest 
selling
-point
 of this program
. Their early paper support
ed s and p type basis function
s and gradient
s, and later 
on support 
for 
d type and its gradient
 was reported
29. Both HF 
and KS
-DFT method
s are able to 
execute
 on a GPU, and 
was coupled
 with AMBER for 
ab initio
 QM/MM calculation
s for small systems. They implement the McMurchie
-Davidson
21 scheme for its 
relatively small
 footprint throughout 
ERI evaluation, resulting 
in low memory requirement
s, which is similar to 
the 
Rys quadrature
 scheme
. Three 
different mapping 
were
 tested and the 
result
s showed that
 the optimal strategy is to 
calculate each primitive ERI batch 
on one
 thread. In order to maximize 
the 
load balance 
and reduce thread divergence, 
a pre-sorting strategy 
was introduced, which treat
ed the
 N4 integrals as a
n N2 * N
2 matrix 
which was sorted along the
 N2 dimension 
using
 different 
criteria. As in YasudaÕs work, the Fock matrix elements are directly computed on the 
GPU. 
  HF calculation with 
the 
3-21G and 6
-31G basis set
s running on 
a NVIDIA GTX 280 card 
realized more than 
a 100-fold
 speedup compared with 
the 
quantum chemistry program 
package GAMESS. For small and medium size molecules, 
the 
Fock matrix time is the 
most time consuming part but for large molecules, such as Olestra
 (453 atoms), linear 
algebra required for the SCF so
lution 
emerged as a bottleneck since 
the 
Fock matrix 
formation time on 
the 
GPU is close to that of 
the 
linear algebra
 part
. This algorithm 
places the
 s and p type function
s in small integral blocks that can be treated entirely in 
19 shared memory to achieve h
igh performance, however, this situation changed for basis 
function
s with higher angular momentum quantum numbers, such as d
-type functions. 
Moreover, the Rys quadrature used by GAMESS in 
their 
comparison
s is not 
an 
optimal 
ERI evaluation algorithm and
 muc
h more efficient algorithms do exist and less favorable 
GPU speedup should be 
expected. Moreover, t
he speedup observed is mostly based on 
the comparison between single
-precision GPU and double
-precision CPU results, so the 
errors in 
the 
SCF energies quickl
y exceed
ed 10-3 au for mid
-size molecule
. However, for 
larger molecules
 computation for larger ERIs in DP will be required. S
imilar to YasudaÕs 
procedure
, large ERIs in DP will be computed in double
-precision, while small
er ones 
were carried out in SP. T
his mixed precision
22 model is a compromise between
 computation
 speed an
d accuracy by providing 2
-4-fold
 speedup 
over
 full double
-precision 
computing 
but reducing 
the 
error to about 10
-6 au on Tesla C1060 and C2050 
cards
.   TeraChem also implemented the calculation 
of 
the analytical HF energy gradient with s
- and p type basis function
9-11, and later on support
ed d type basis function
29. Using the 3
-21G basis set, GPU
s achieved a speedup of 6
-100 for small to larger molecules 
on NVIDIA 295 cards. Using 
the 
mixed precision 
model 
described above, the RMS (root 
mean square) error in the forces is about 10
-5 au, which is close to the typical 
convergence 
threshold for geometry optimization
s. With the 
aid of energy and gradient 
calculation
s on GPU
s, small 
ab inito
 molecule dynamic
s simulation
s was feasible. For 
example, a MD simulation of the 
H3O+(H2O)30 Cluster with the 6
-31G basis set in the 
microcanonical ensemble with a time step of 0.5 fs 
was realized. 
An 
energy drift of 0.022 
20 kcal mol
-1 ps-1 was observed over a simulation time of 20 ps.
 It is worth 
mentioning that TeraChem can run over multiple GPUs with a parallel efficiency of over 60% on three 
NVIDIA GeFrorce 8800GTX cards.
  To support d orbital
s for both 
the 
energy and g
radient
29, as mentioned above, a meta
-programming strategy that leverages the computer algebra system 
to generate correct and 
efficient code
 was employed. 
Since the capability of 
a compiler to identify the best code 
transformation and permutations of data 
accesses is limited by the need to carry out 
source code to mac
hine code translation efficiently
, full code optimization 
needs to be 
dealt with at the
 source code
 level
. Therefore, their code
-generator, 
translate
d both the 
mathematica
l formalism of Coulomb
 and exchange in
tegral computation
 to modern 
programming language such as 
Fortran and C, but also simplified the
 algebraic 
expression
s by 
factor
ing out and eliminating
 common sub
-expression
s and 
by divying
 expressions into groups of intermediates. By using
 meta-programming, not only 
were d
-type basis sets
 supported, 
but 
s- and p
- type basis sets 
were
 re-written and 
further 
optimized. M
oreover, the resultant CPU code also tokk
 advantage of meta
-programming 
optimization. Similar to previous publications, 
the 
realized speedup on 
GPUs involving
 d-type basis function
s was a factor of 10
-100 using 
a NVIDIA GTX 580 card with an 
error of 10
-3-10-2 kcal mol
-1 when 
compared to 
the corresponding 
CPU calculation
s.  Asadchev 
et al
. presented algorithms and CU
DA implement
ation
s for uncontracted ERI
 evaluation up to g
-type functions based on 
the 
GAMESS package
19,31
. The Rys 
quadrature was chosen for its low memory requirement
 and efficiency for higher order 
21 angular momentum
. Rys quadrature has complex memory access patterns especially for 
high angular momentum. (ff|ff) for example, requires 5376 FLOPS for intermediate 
quantities which are reused and 10
4 FLOPS for the final ERIs.
 So with DP, 
the 
memory 
requirement is larger th
an 
the 
device memory so that slow global memory access is 
mandatory. T
herefore, the author re
-arranged the
 parallel calculation of the ERIs to 
minimize memory loads. 
The code also processes ERIs, such as 
the 
Fock Matrix 
formation, 
on the GPU device to avoi
d CPU
-GPU communication. 
Moreover, similar to 
TeraChem, the complex code must be 
machine 
generated 
to ensure
 correctness and 
efficiency. The authors therefor select
ed Python Cheetah
30 to adopt
 a template
-based 
approach to generate automated code. As 
was se
en for the
 TeraChem
 implementation
, machine
-generated code also boost
s the performance of 
the 
CPU
 code
.   The improved CPU and GPU code was
 tested on NVIDIA GTX 275
Õs and Tesla T10
Õs and 
was then 
compared with the performance of ERI evaluation with the ori
ginal Rys 
quadrature implemented in GAMESS. More than 
a 30% improvement 
was gained 
using 
the
 new
 C++ Rys quadrature code 
on the 
CPU, 
even though some of test case
s showed
 other ERI evaluation algorithm
s in GAMESS outperf
ormed this approach
. The 
GPU impleme
ntation
, on the other hand, 
observed 
a speed
-up of 
up to 17.5
-fold when 
compared with 
the 
new C++ Rys quadrature code on 
a CPU with double
-precision.
 Their 
code not only can execute 
in 
parallel on multiple
-CPU
s, but also on 
a multi
-CPU
-single
-GPU hybrid platform. 
   22                  REFERENCES
 23 REFERENCES
   (1)
 NVIDIA. Compute Unified Device Architecture
 (CUDA). 
http://docs.nvidia.com/cuda/index.html 
(accessed July, 2014)
 (2)
 NVIDIA. The NVIDIA CUDA Fast Fourier Transform library. 
http://developer.nvidia.com/cufft
 (accessed July, 2014)
 (3)
 NVIDIA. The NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) 
library. 
http://developer.nvidia.com/cublas
 (accessed July, 2014)
 (4) Gıtz, A. W.; Wılfle, T.; Walker, R. C. In 
Annual Reports in Computational 
Chemistry
; Ralph, A. W., Ed. 2010; Vol. 6, p 21
-35. (5) D.A. Case; T.A. 
Darden; T.E. Cheatham, I.; C.L. Simmerling; J. Wang; R.E. 
Duke; R. Luo; R.C. Walker; W. Zhang; K.M. Merz; B.P. Roberts; B. Wang; S. Hayik; A. 
Roitberg; G. Seabra; I. Kolossv⁄ry; K.F. Wong; F. Paesani; J. Vanicek; J. Liu; X. Wu; 
S.R. Brozell; T. Steinbreche
r; H. Gohlke; Q. Cai; X. Ye; J. Wang; M.
-J. Hsieh; G. Cui; 
D.R. Roe; D.H. Mathews; M.G. Seetin; C. Sagui; V. Babin; T. Luchko; S. Gusarov; A. 
Kovalenko; Kollman, P. A. 
AMBER11,University of California, San Francisco.
 2010. (6) Case, D. A.; Cheatham, T. E.;
 Darden, T.; Gohlke, H.; Luo, R.; Merz, K. M.; 
Onufriev, A.; Simmerling, C.; Wang, B.; Woods, R. J. 
J. Comput. Chem. 2005, 26, 1668
-1688. (7) Anderson, J. A.; Lorenz, C. D.; Travesset, A. 
J. Comput. Phys.
 2008, 227, 5342
-5359. (8) Yasuda, K. 
J. Comput. Che
m. 2008, 29, 334-342. (9) Ufimtsev, I. S.; Martinez, T. J. 
J. Chem. Theory Comput.
 2008, 4, 222-231. (10
) Ufimtsev, I. S.; Martinez, T. J. 
J. Chem. Theory Comput.
 2009, 5, 1004-1015. (11
) Ufimtsev, I. S.; Martinez, T. J. 
J. Chem. Theory Comput.
 2009, 5, 2619-2628. (12
) Vogt, L.; Olivares
-Amaya, R.; Kermes, S.; Shao, Y.; Amador
-Bedolla, C.; 
Aspuru
-Guzik, A. 
J. Phys. Chem. A
 2008, 112, 2049-2057. (13) 
 Helgaker, T., J¿rgensen, P., Olsen, J. Molecular Electronic
-Structure Theory, 
Wiley, West Sussex, England,
 2000.  (14)
 Kirk, D.B., Hwu, W.W. Programming Massively Parallel Processors, Morgan 
Kaufmann Publish
! ers, Burlington, MA, 
2010. (15)
 Kohn, W., Sham, L
. Phys. Rev.
 1965, 140, A1133
-8.  24 (16)
  Parr, R.G., Yang, W. Density
-Functional Theory of Atoms and Mole
cules, 
Oxford University Press, Oxford, 
1989. (17)
 Salek, P., Hos, S., Thogersen, L., Jo
rgensen, P.,
 Manninen, P., Olsen, J., Jansi
k, B. J. Chem. Phys
. 2007, 126, 114110. (18) 
 NVIDIA. 
The NVIDIA CUDA Sparse Matrix library
 (cuSPARSE
) library. 
https://developer.nvidia.com/cuSPARSE
 (accessed July, 2014)
 (19)
 Asadchev, A.; Allada, V.; Felder, J.; Bode, B. M.; Gordon, M. S.; Windus, T. L. 
J. Chem. Theory Comput.
 2010, 6, 696-704. (20
) Rys, J.; Dupuis, M.; King, H. F. 
J. Comput. Chem.
 1983, 4, 154-157. (21
) McMurchie, L. E.; Davidson, E. R. 
J. Comput. Phys.
 1978, 26, 218-231. (22
) Luehr, N.; Ufimtsev, I. S.; Martinez, T. J. 
J. Chem. Theory Comput. 2011, 7, 949
-954. (23) Kulik, H. J.; Martinez, T. J. 
Abstr Pap Am Chem S
 2012, 244. (24) Kulik, H. J.; Luehr, N.; Ufimtsev, I. S.; Martinez, T. J. 
J Phys Chem B 2012, 116, 12501-12509. (25) Isborn, C. M.; Gotz, A. W.; Clark, M. A.; Walker, R. C.; Martinez, T. J. 
J. Chem. Theory Comput.
 2012, 8, 5092-5106. (26) Ufimtsev, I. S.; Luehr, N.; Titov
, A.; Martinez, T. 
Abstr Pap Am Chem S
 2011, 242. (27) Ufimtsev, I. S.; Luehr, N.; Martinez, T. J. 
J Phys Chem Lett
 2011, 2, 1789-1793. (28) Kulik, H. J.; Isborn, C. M.; Luehr, N.; Ufimtsev, I.; Martinez, T. J. 
Abstr Pap Am Chem S
 2011, 242. (29) 
 Titov, A. V.; Ufimtsev, I. S.; Luehr, N.; Martinez, T. J. 
J. Chem. Theory Comput. 2013, 9, 213-221. (30
)  Cheetah 
- the Python
-Powered template engine. http://www. cheetahtemplate.org/ 
(acc
essed Jan, 2015
). (31
) Asadchev, A.; Gordon, M. S. 
J. Chem. Theory 
Comput.
 2012, 8, 4166-4176. 25          CHAPTER 2. PROJECT OBJECTIVES
 26  The main objectives of this work 
were
: A. Develop 
a novel algorithm to execute 
ab in
itio quantum chemistry methods 
including
 energy calculation 
using the
 Hartree
-Fock method and Density 
Functional Theory on GPU
s using the
 CUDA platform. As described above, the 
peak FLOPs count and bandwidth of GPU
s are about 10x of that of 
a CPU, so we 
expected 
significant
 speedup could be realized. For most of 
these 
calculation
s, s, 
p and d orbital
s are 
sufficient
 for second row elements that 
are
 essential 
in 
organi
c chemistry and biochemistry; hence,
 the support
 of s, p and d orbital calculation
s was initially focused on
. B. Develop novel algorithm
s to execute 
the gradient c
alculation of 
ab initio quantum chemistry methods 
including
 HF and DFT methods. 
Support for
 s, p and d 
orbital
s is the target, but to support
 the
 d orbital gradient, 
we needed f orbital 
energies as well. Hence, we 
also aim
ed to support 
f orbital energy cal
culation
s as well. C. Integrate the above
-mentioned main features 
into our quantum chemistry software
 package, Quick, and further GPU
-ize the software by introducing CUDA library 
calls in
to linear algebra calculation
s to further speedup HF and DFT calculation
s. D. Benchmark 
the 
software for a series of representative molecules and calculations 
including energy, gradient, and geometry optimization calculation
s on GPU
s and 
CPU
s and then 
compare and discuss the result.
 E. Develop 
GPU-based WHAM (The Weighted Histogram Analysis Method), 
a Molecular Dynamic
s trajectory analysis software. The one
-dimensional WHAM 
is very
 fast, but multi
-dimensional WHAM is time
-consuming for some real tasks 
27 in our group, therefore, a GPU
-ized WHAM acc
elerated our research
.28            CHAPTER 3
. GPU 
ACCELERATION
 ON ERI 
EVALUATION
 29 3.1. INTRODUCTION
  Quantum theory has been utilized in many roles, including interpreting chemical 
phenomena and predicting new molecules with novel functions. To achieve this, 
computational and theoretical chemists constantly make compromises between accuracy 
and computatio
nal expense. However, we are witnessing a new era in computational 
quantum chemistry, sparked by an interest in harnessing the capabilities of heterogeneous 
computing, especially modern graphics processing units (GPUs), which afford impressive 
price 
versus
 performance characteristics. Early GPUs were not widely accepted by the 
computational chemistry community because of limited precision and programming 
difficulties. Recently, though, these difficulties have been largely remedied by the 
development of the 
latest generation of GPU cards from NVIDIA, which support up to 
64-bit floating
-point arithmetic, and through the introduction of the Compute Unified 
Device Architecture (CUDA)
1, which is a simple interface extension based on the 
standard C/C++ language. 
  So i
n this 
chapter
, we will focus on the acceleration on ERI evaluation using GPU so that 
the calculation of HF energy can be speedup. This chapter is organized as following. 
First, we briefly describe the ERI evaluation algorithm employed, which uses vertical and 
horizontal recurrence relations and is one of the most efficient methods for ERI 
evaluation. Next we describe a general approach for GPU evaluation of ERIs using 
recur
rence relations, and describe the details regarding Fock matrix assembly in our 
direct SCF implementation. To study the efficiency of our direct SCF scheme, we 
30 compare it to a conventional SCF implementation on a GPU. In the next section, we 
provide more d
etailed benchmarks in order to compare our GPU and CPU 
implementations. With double
-precision ERI evaluation, the accuracy of the SCF 
calculation is 10
-7 a.u. or better for moderate systems with thousands of basis functions, 
suitable for most applications.
 We have applied these ideas in a series of typical proteins 
(up to 4000 basis functions) with high
-angular momentum Gaussian basis sets and show 
performance increases of up to 100 fold, but more typical speedups are approximately 10 
to 20 fold. Memory usa
ge is shown in this section, where we show that systems with up 
to ~10000 basis function are feasible with a typical GPU. Finally, we conclude the 
chapter
 with a brief discussion and conclusions.
   3.2. ELECTRON REPULSION INTEGRALS
  In computational chemis
try field, the Hatree
-Fock(HF) method is one of the most popular 
ab initio
 quantum chemistry method, which is based on approximation for the 
determination of wave functions and energy in a stationary state. By assuming N
-Body wave function of a system can 
be approximated expressed by a single Slater determinant, 
HF method invokes the variational method that minimizes N
-body HF energy to be 
close 
to real energy. A solution yielding HF wave function and energy, can be derived from a 
series of equation called 
HF equation as an approximate solution of Schrıdinger 
equations and the solution of HF equation required to be a self
-consistent mean charge 
field. Most HF calculation can be solved iteratively although iteration algorithm does not 
31 guarantee to converge an
d advanced method can remedy the issue that is not in the scope 
of this chapterÕs discussion.
  On the other hand, Density functional theory (DFT) is a computational quantum 
mechanical 
modeling
 method that also based on an approximation of Schrıdinger 
equat
ion for N
-body. With this theory, the properties of N
-body system are determined 
and described by the spatially electron density of the N
-body using 
functional
. For most 
modern DFT method, iteratively solution is necessary based on Kohn
-Sham equation in a 
self-consistent fashion. 
  For 
ab initio
 including HF and DFT 
methods, one of the major time
-consuming steps is 
the evaluation of a large number of two
-electron repulsion integrals that have the form: 
 !"!"!!!!!!!!!!!!!!!!!!!!!!!!!   ( 1 ) where the 
!! are one
-electron basis functions and 
r refers to the coordinates of the 
electron. In practice, linearly combined contracted Gaussian functions made up of 
primitive atom
-centered Cartesian Gaussian functions of the form:
 !!!!!!!!!!!!!!    ( 2 ) are used to represent the one
-electron basis functions. The contribution of the primitive p 
to the contracted function is denoted by the coefficient
!!!". An unnormalized primitive 
Cartesian Gaussian function centere
d at 
!!!!!!!!!!!! with exponent 
! is given 
by !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!   ( 3 ) where 
 32 !!!!!!!!!!!!      ( 4 ) and p is a set of integers indicating angular momentum
 and the direction of the Gaussian 
function. Therefore, the contracted two
-electron integral is constructed from integrals 
over primitive functions and coefficients. A primitive ERI will be denoted by [pq|rs], and 
a contracted ERI will be distinguished fro
m a primitive ERI by:
 !"!"!!!"!!"!!"!!"!!"!!"!!"#$
      ( 5 ) ERIs possess eight
-fold symmetry (
!"!"!!"!"!!"!", etc.), meaning one 
can be computed and used for the remainder, providing a reduction in computational 
effort by roughly 6
-fold. It would be more efficient to compute all of the primitive ERIs 
involving four shells for which 
! has the same 
!!. For example, 
3 [ps|ss] type integrals 
can be computed at the same time where 
!!!!!!!!!!!!!!and 
!!!!!. The ERI 
bottleneck formally has N
4 scaling, but many approaches have been devised to reduce 
scaling to N
<3, however, even with these algorithms, ERI computation can 
still consume 
the majority of the computational time (typically 80
-90% of the overall CPU time for 
systems with less than a thousand basis functions). Once contracted ERIs are generated as 
given in equation (1), it is straightforward to form the Fock matri
x with 
 !!"!!!"!"#$
!!!"!!"!"!!!!"!"!"!!!!"!"#$
!!!"!!!!!" ( 6 ) where P is density matrix and J and K are the Coulomb and exchange matrices 
respectively.
  3.2.1. RECURRENCE RELATIONS FOR ERI EVALUATION
 The primitive integrals can be evaluated in many ways and most of them are based on 
BoysÕ seminal work
13. Recurrence relation is one of the most efficient and most widely 
33 used methods. Obara and Saika(OS)
14 first derived a recurrence relation for ERI 
evaluation which related a given ERI to other integr
als. Head
-Gordon and Pople (HGP)
15 optimized this method by storing some common ERIs to reduce floating point operation 
counts. In their derivation, the verti
cal recurrence relation (VRR) is given as:
 !!!!!!"!!!!!!!!"!"!!!!!!!!"!"!!! !!!!!!!!!!!"!!!!!!!!!!!!"!!! !!!!!!!!!!!"!!!!!!!!!!!!"!!! !!!!!!!!!!"!!!!!!!!!!!!!!!!!!"!!!!!!!!     ( 7 ) where i is x, y or z, and
 !!!!!!"!!!"!!!"!,      ( 8 ) !!!!!, !!!!!,     ( 9 ) and 
!!!!!!! are exponents of a, b, c and d respectively.
 !!!!!!!!!!!!!,!!!!!!!!!!!!!!,!!!!!!!!!!!!!!.     ( 10 ) The superscript index (m) in eq (7) is an auxiliary index, with m=0 yielding true ERIs. 
The VRR shows that primitive ERIs of higher angular momentum are linear 
combinations of lower 
angular momentum ERIs ultimately depending on s type ERIs. 
The quantities
 requiring evaluation are [ss|ss]
(0)
 to [ss|ss]
(n)
, where the max n value is:
 !!!!!!!!!!!!!!!!!!!    ( 11 ) We can evaluate [ss|ss]
(m)
 with analytical formality:
 !!!!!!!!!!!!!!!!!"!!"!!!!!     ( 12 ) 34 where
 !!!"!!!!!!!!!    ( 13 ) !!!!!!!!!!!!!"!!    ( 14 ) !!"!!!!!!!!!"#
!!!!"!!!!!!!!    ( 15 ) and K
CD is analogous to K
AB. HGP noted further that
 !!!!!!!!!!"!!!!!!!!!!!!!!!"!!!!!!!!!!!!!!"!!"!!!!  ( 16 ) This equation relates one integral with another of the same total angular 
momentum but 
with a shifted position from the first to the second, and is termed as a horizontal 
recurrence relation (HRR). Since this relation does not involve the exponent partial, it 
may be applied to contracted ERIs.
 !!!!!!!!!!"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"!!"!!!!  ( 17 ) For example, using the previous equation to evaluate (ab|cd), integrals from (a0|b0) to 
((a+b)0|(c+d)0) are constructed using VRR and then (ab|cd) are evaluated using HRR. 
This algorithm great
ly reduces floating
-point operation counts, especially for basis 
functions with high angular momentum. In practice, temporary integrals, with auxiliary 
index m = 0, are used in other ERI evaluations, and should be stored in memory 
temporarily for higher ef
ficiency. For double zeta Gaussian basis sets, hybrid functions, 
sp for example, are used, and to treat this type of function, primitive ERIs such as [a0|b0] 
to [(a+b)0|(c+d)0] can be stored instead of contracted ERIs because s and p share the 
same primiti
ve ERI exponent values, this treatment speeds up calculations, but requires 
large memory resources in exchange. Because of their different architectures, the 
35 required memory is attainable for CPUs but is too large to implement efficiently for 
GPUs. We will
 return to this issue in the coming sections.
  3.3. IMPLEMEN
TATION
  3.3.1. COMPUTE UNIFIED DEVICE ARCHITECTURE
 GPU technology, typically used to handle computation for computer graphics and video 
gaming, has been adapted to perform computations for applications that are traditionally 
handled by CPUs. This is termed General
-Purpose computing on Graphics Processing 
Units or GPGPU. GPU is a good example of a massive parallel stream
-processing 
architecture that uses the single
-instruction multiple data (SIMD) model.  Currently, 
as we described above, 
the most widely used language environment for GPGPU technology 
is NVID
IA Compute Unified Device Architecture (CUDA). CUDA is a programming 
model for graphics as well as general
-purpose computation using a relatively simple 
extension of the standard C language to develop scalable and efficient parallel programs.
  The CUDA dev
ice architecture has a scalable array o
f streaming multiprocessors
 that 
con
sist of 8 scalar processors
. Note that all technical specifications referred to in this 
paragraph are for CUDA 2.x unless otherwise noted. With this design, GPUs are 
especially well
 suited for compute
-intensive and highly parallel computations. 
  As introduced in chapter 1, w
ithin the CUDA framework, a batch of threads is 
hierarchically arranged into a one
-, two
- or three
-dimensional grid of blocks up to 65535 
36 blocks, and each block 
of threads further consists of one
-, two
-, or three
-dimensional 
grids
 (shown in 1.6)
. The number of threads in a block cannot exceed 512 and should be 
specified explicitly in the code. The CUDA framework introduces the built
-in variables 
threadIdx
 and 
bloc
kIdx
 to identify a thread in a block. The best performance is obtained 
when the number of threads in one block is a multiple of 32. Each 32 threads are bundled 
as a warp,
 so that thread in the same warp will execute the same instruction that may 
result div
ergence when occur branches. D
ivergence results in performance degradation, 
and should be avoided in order to maximize parallel performance, an issue that will be 
touched on later. The thread scheduler switches active warps to balance the load so that 
the 
overall performance will be maximized.  Although only one thread block can be 
executed at any given time on an SM (two for the Fermi architecture), multiple thread 
blocks can be active. For the memory hierarchies
 (as shown in 1.7)
, each thread has 
access t
o its local register on the processor, and shared data in a block is visible to threads 
in this block via a parallel data cache with medium data latency but limited resources. 
Furthermore, all threads have access to the relatively large global memory (also
 known as 
dynamic random
-access memory, DRAM) but with high data access latencies. Also, 
GPUs provide fast and high visibility but limited and read
-only constant and texture 
memories.
 The global, constant, and texture memory spaces are persistent across ke
rnel 
launches by the same application. 
Because of the architecture of the CUDA platform, 
main memory has high latency, on the order of hundreds of cycles. Therefore, in order to 
achieve high bandwidth, coalesced memory access to DRAM is recommended. 
Coales
cing ensures consecutive threads access consecutive memory addresses so that 
37 memory requests can be served simultaneously.  Multiple active threads can hide 
memory latency by overlapping their computations.
  When the GPU kernel function is called in the CU
DA implementation, the CPU waits 
until the kernel function completes and is returned. Most GPU kernel functions only 
access GPU memory, therefore the CPU must copy the required data from the CPU 
memory to the GPU memory. Because of the relatively slow 2.0 
Gb/s transfer speeds 
between the CPU and GPU memories, it is important to avoid large amounts of CPU
-GPU data transfer.  We will discuss a scheme to reduce CPU
-GPU data transfer below.
  In addition, subroutine libraries are available that provide common so
lutions for problems 
in quantum chemistry and solid
-state physics such as the Fourier transform (CUFFT)
16 and linear algebra (CUBLAS)
17 operations t
hat are used in our implementation. One of 
the most notable accelerations these libraries afford is in matrix
-multiplication, providing 
a six
- to ten
-fold speedup for large GEMMs over the INTEL MKL library.
  3.3.2. RELATED WORK
 As introduction chapter mentioned, 
Yasuda
6 was the first to evaluate ERIs on GPUs for s 
and p functions in single precision. He chose the Rys quadrature scheme
18 for its low 
memory requirements even though it is not the most efficient algorithm for 
low angular 
momentum quantum numbers. A mixed
-precision (MP) scheme was introduced to 
calculate the largest ERIs in double precision (DP) on the CPU and the remainder in 
single
-precision (SP) on the GPU. This scheme led to SCF energy errors of the order of
 38 10-3 au. If all ERIs are computed on the GPU, speedups of ~1 order of magnitude for 
molecules as big as valinomycin with 6
-31G basis set (168 atom, 882 basis sets) have 
been realized. Asadchev
19,20 further adopted the Rys quadrature schem
e on GPUs for ERI 
evaluation in GAMESS
21 and extended it up to g functions. Their implementation on 
NVIDIA Tesla T10 cards showed speedups of around 25 times for
 DP and 50 for SP 
compared with a single CPU; however, no timings were given for data transfer between 
the GPU DRAM and CPU main memory which may take several times longer than the 
actual execution time of the ERI kernel. 
  Ufimtsev and Martinez
7-9 have developed a CUDA
-based program for ERI evaluation and 
Fock matrix formation involving s and p basis functions on GPUs, using the McMurchie
-Davidson
22 scheme for its relatively few intermediate steps per integral and its memory 
requirements, si
milar to the Rys quadrature
18. HF SCF calculations with 3
-21G and 6
-31G basis sets using their implementation and a NVIDIA GTX280 card
7-9 was, in 
some 
cases, more than 100 times faster than the quantum chemistry software GAMESS
21 on a 
CPU. However, most of their comparisons were between 
single
-precision GP
U result and 
double
-precision CPU results so the SCF energy error observed with UMÕs code quickly 
exceeded 10
-3 au. Later, a mixed SP/DP approach was described as a compromise 
between accuracy and speed
23.  3.3.3. IMPLEMENTATION DESIGN
 In our work, in order to implement the ERI evaluation in the GPU, code was created 
using our code generator to express the recurrence relation given in eq (7). Current GPU 
39 code will us
e registers to store the auxiliary integrals and automatically delete and flag 
them as available once they are not needed in the subsequent steps, increasing register 
usage efficiency by keeping as many registers busy as possible.
  As discussed above, GPU 
architecture is very different from that of CPU, and one of the 
most important concerns is thread divergence, that is to say, every thread in a warp will 
execute the same instruction sets and, therefore, threads may be idle when they do not go 
into the sam
e branching or conditional statement as other threads in the warp. So it is 
necessary to classify ERI type and subsequently reorder them so that threads in one warp 
will likely undergo the same ERI or loop (or all of them bypass ERI evaluation because 
of t
he Schwarz upper bound cutoff
24,25). We treat the ERIs in the direct SCF procedure as 
an N
2*N2 integral matrix problem with row and column indices as half ERIs in the form 
of bra ( 
!" ) and ket ( 
!!"! ), where an ERI corresponds to an element in this matrix. 
!!"!!"! evaluation can be skipped if index 
a is greater than index c due to ERI 
symmetry. We find that by sorting bra and ket (including tiebreakers) a well
-ordered 
integral grid can be obtained that leads to generally optimal perfor
mance for most 
systems. Figure 3
.1 illustrates our approach using a water cluster with 4 water molecules 
employing the cc
-pVDZ
26 basis set 
with 100 basis functions. We use coloration to denote 
the magnitude of the ERI estimation value
. First, sorting in 
!!"! half ERIs type, for 
instance, the highest angular momentum in this system with cc
-pVDZ basis set is a d
-orbital, so we have nine combinations as shown in figure
 3.1. We identify the angular 
momentum criterion as the most important. Second, half ERIs
 are sorted by Primitive 
Gaussian function number. The cc
-pVDZ basis set has a maximum of seven primitive 
40 Gaussian functions for Oxygen used to construct the contracted basis set. Contracted 
ERIs of type (ss|ss), for example, require various numbers of pri
mitive ERIs [ss|ss] from 
2401 to 1 (from 7
4 to 1
4), so if it is not sorted, it is quite possible that one thread takes 
2401 loops and others in the warp take significantly fewer loops and wait in an idle state. 
Therefore, sorting by primitive Gaussian func
tion number in the same ERI type will 
greatly improve thread usage percentage. Correlation
-consistent basis sets benefit most 
from this sorting criterion as our benchmark results show below.  The third sorting 
criterion is the Schwarz cutoff, or upper boun
d. The size of an individual ERI can be 
estimated using Cauchy
-Schwarz inequality
24,25, !!"!!"!!!"!"!!"!!"!        (18)
 With this cutoff bound sorting, sparse populated regions are gathered to certain areas so 
that threads in a warp that are responsible for this a
rea could possibly skip ERI evaluation 
simultaneously
 and set as available for future computation. These three sorting orders 
create a well
-bala
nce
d grid as shown in Figure 3.1
 and guarantee that threads are kept 
busy and at or near maximum usage. 
  Most s
treaming
-type architectures, such as CUDA, do not provide efficient tools for 
inter
-thread data communication except for threads that are in the same block. We find 
that the most efficient mapping strategy is to have one thread working on one contracted 
ERI. This mapping strategy is efficient based on our well
-sorted scheduling combined 
with thread divergence avoidance in order to keep threads under dense computation. 
Variables in eqs. (9), (10) and (15) (
!!!!!, !!!!!!!!!!!!! and 
41 !!"!!!!!!!!!"#
!!!!"!!!!!!!! ) are all pre
-calculated and stored in GPU memory. 
They will occupy most of the DRAM in the GPU with a usage scale of O(N
2), trading 
memory usage with speed. With these pre
-calculated values, both DRAM data access 
count and 
floating
-pointing count is reduced significantly.
  Similar to the CPU implementation, once a thread generates a series of integrals two 
strategies are available. One is to store integrals to disk so that they will not be re
-calculated in further iterations
, which is traditionally called a conventional SCF. 
However, the drawback of this strategy is the slow CPU
-GPU data
-transfer speed and its 
high scale. It has been proven for large systems with a huge number of integrals that data
-transfer time is greater t
han GPU kernel time
7 and similar validation will be shown.
 Another algorithm is direct SCF, which uses calculated integrals immediately and 
recalculates those integrals on the fly. It is clearly desirable to implement the SCF 
entirely in the GPU. In our implementation, once an integral is available, it is used to
 assemble the Fock matrix. Traditionally, matrix
-element
-to-block mapping is adopted for 
Fock matrix formation, however, that implementation produces 2
-8 times more redundant 
integrals, leading to low efficiency. In our code, the atomic function is introdu
ced to 
assemble the Fock matrix. The atomic function is a feature offered by NVIDIA since 
CUDA 1.1, which performs a lock
-and
-set atomic operation on words residing in global 
or shared memory (since CUDA 1.2). These operations are named atomic in the sense
 that it is guaranteed to perform without interference from any other threads. After one 
thread locks a memory address, no other thread has access privilege until the operation is 
completed. This feature is used in our implementation: once a unique integra
l is evaluated 
42 in a thread, it will be added to the corresponding Fock matrix elements forming the 
Coulomb and exchange matrices simultaneously. A unique integral is usually required by 
Fock matrix elements multiple times, depending on ERI symmetry. For ex
ample, because 
of symmetry relations for ERIs (ij|kl)=(ji|kl)=(kl|ij)=etc., (ij|kl) can be reused 8 times if i, 
j ,k, l are not equal to each other. 
  One concern for the atomic function is it only supports long integer operations in the 
current version wh
ile the double
-precision atomic operator is not available (due to 
hardware limitations). This concern arise because, for integral arithmetic, there is no 
stipulation about the order in which threads perform their operations. Hence, floating
-point arithmeti
c is not associative for the rounding of intermediate results. For instance, 
(A+B)+C equals A+(B+C) if A, B, C are integers, but this is generally not true for 
floating
-point. Therefore, early hardware does not support floating
-point and the newest 
version
 only supports single
-precision. A compromise approach is masking the double 
precision type as a long
-long integer and eventually unmasking the formed Fock matrix 
when all ERIs are evaluated.  However, this step reduces the accuracy of the Fock matrix 
even
 though ERIs are evaluated using double
-precision. Hence, we say this step is an 
accuracy bottleneck and the main source of error. Another concern beyond accuracy is 
efficiency. 
  43  Figure 3.1
. Pre
-sorting scheme for half ERIs. Row and column indices correspond to 
bra and ket pairs of contracted integrals, and the colors reflect the estimated magnitude of 
the ERI value
. Sorting order is described in the text.  
 44  Figure
 3.2. Flowchart for GPU implementation of ERI evaluation within an SCF cycle
           45 If a thread is supposed to access an address that is locked by another thread that wants to 
access the same memory, that thread will hold until the occupying thread unlocks t
he 
memory address. In that case, the more threads that are idle and waiting to write to one 
identical memory address in sequence, the lower the efficiency will be. This will be a 
factor preventing better efficiency for small systems with crowded memory acc
ess. 
However, in our implementation, a typical launch bound for a CUDA 2.0 card is a GPU 
with 16 blocks and 512 threads, if one thread generates one integral and that integral is 
used 8 times at maximum, then 8192 integrals are calculated in one loop and a
 maximum 
value of 65536 memory addresses are requested to be modified. The Fock matrix has a 
N*N elements where N is number of basis functions. For a system with 1000 basis 
functions, only 6.6% of the memory addresses are on the list to be updated. So it i
s less 
likely that two threads will want to write to the same address, especially for larger 
systems, and actually only a few of the threads are active while others are waiting for 
global memory access due to the memory latency or inactive due to lack of r
egisters. 
  In conclusion, atomic operations will not lead to significant system inefficiency besides 
what is expected for normal memory access. To demonstrate this, we compared two ways 
to deal with ERIs: (1) form the Fock matrix in the CPU after transfer
ring the ERIs from 
the GPU (labeled as conventional SCF) and (2) form the Fock matrix in the GPU using 
atomic operations (labeled as direct SCF). The model we used is an N * N hydrogen
-mesh with neighboring distance of 1 †. The systems used here are ÒtoyÓ 
models because 
there are only s
-orbitals required which exclude the overhead brought
 by thread 
convergence.  Table 3
.1 shows the timing results produced by the two methods, 
46 conventional SCF and direct SCF. All those results are based on 64
-bit double
-preci
sion 
computation on an NVIDIA M2090 graphic processing unit card. For our conventional 
SCF code, we have modified it to avoid any disk I/O operations that may seriously reduce 
computation efficiency so that the two methods can be compared fairly. Kernel ti
me includes floating
-point calculation and data
-fetch latency, and the majority of CPU
-GPU transfer time is ERI transfer for conventional SCF and the Fock matrix for direct SCF. 
The
 first notable result in table 3
.1 is the impressive speedup by GPU computa
tion, and 
both ways provide significant improvement compared to a CPU no matter the kernel or 
wall time. If only kernel time is considered, direct SCF provides about a 40
-fold speedup 
while the conventional method provides a 150
-fold increase. The 150
-fold
 speedup can 
be ruled as the maximum speedup that GPU can achieve in an ideal situation, without any 
overhead such as data latency and thread convergence. However, the data transfer time of 
conventional SCF is as much as the kernel time while it is less th
an 0.5% wall time and 
negligible for direct SCF. The kernel time of direct SCF is about 3 times more than 
conventional SCF, except it includes the Fock matrix assembly time, one of the reasons is 
the penalty from the atomic operators.
 However, even the ove
rall time of conventional 
SCF is close to that of direct SCF in this case, data
-transfer time and memory 
requirement scale N
4 for conventional SCF but N
2 for direct SCF, therefore, memory will 
be quickly exhausted for conventional SCF (the maximum number o
f basis functions is 
<300 with a 6 GB GPU Memory) and takes more time for data transfer (via a PCI 
Express slot, 2.0GB/s) which makes calculations on larger systems impossible. So, as 
noted before, it is necessary to assemble the Fock matrix in the GPU and
 abandon 
sophisticated code optimized for CPUs over years to avoid expensive copy operations. 
 47  Table 
3.1
. Timing for two SCF schemes for Hydrogen Atom System ERI Evaluation 
a Hydrogen Number
 Direct SCF
 Conventional SCF
 CPU Time/s
 Kernel 
Time/s
 GPU
-CPU 
transfer/s
 Kernel 
Time/s
 GPU
-CPU 
transfer/s
 8*8 STO
-3G 0.183
 0.062E
-3 0.051
 0.008
 6.30
 6-31G 0.443
 0.229E
-3 0.127
 0.086
 14.97
 10*10 STO
-3G 0.454
 0.049E
-3 0.121
 0.032
 19.00
 6-31G 1.178
 0.289E
-3 0.340
 0.447
 47.40
 12*12 STO
-3G 0.918
 0.101E
-3 0.274
 0.139
 46.18
 6-31G 2.630
 0.789E
-3 0.770
 1.942
 121.83
 a GPU timing refer to kernel time and GPU
-CPU data transfer time. CPU time refers to 
the Fock matrix formation time. Unit is second.
  3.4. RESULTS AND DISCUSSION
  Before we began tests on our GPU code,
 we implemented the same recurrence relations 
for a CPU on QUICK
27. Our test indicates i
t has competitive efficiency
 with GAMESS as 
shown in Table 3
.2. The GAMESS version is August 11, 2011 R1 64
-bit under Linux 
with Intel FORTRAN Compiler 10.1.15(shown as GAMESS/intel) and the GNU 
FORTRAN compiler 4.1.2(shown as GAMESS/gnu) using the default
 configuration, 
while QUICK is compiled using the same Intel compiler with optimization option level 3 
(-O3). The test cases are Hartree
-Fock SCF computation on sets of representative 
molecules of small and medium size, such as a water cluster with 16 wate
r molecules, an 
(Alanine)
3 chain, and a (Glycine)
12 chain
 using the 6
-31G, 6
-31G* and 6
-31G** basis 
sets. In the calculations most setups are default except the direct option is selected for all 
48 systems and the integral cutoff is set as 10
-9 ensuring a fai
r comparison. The benchmark is 
performed on the same machine with an AMD Opteron Processor 2427 CPU, and we list 
their 
first iteration times in Table 3
.2. QUICK performs competitively in most categories, 
therefore, we conclude that QUICKÕs implementation i
s as efficient as a standard 
quantum chemistry package.
  In the GPU version, all primitive ERIs are calculated on
-the
-fly due to register shortage 
as previously described. In contrast, in the CPU version, because of the considerably 
larger amount of memory
 provided, primitive ERIs are saved in memory temporarily to 
form contracted ERIs, which leads to about a two
- to four
-fold speedup for split Gaussian 
basis sets.  In the following benchmark test, both CPU and GPU codes use the direct SCF 
procedure with th
e 2
-electron integral cutoff set as starting from 10
-9 that will be changed 
to tighter criteria after reach convergence threshold.  A Schwarz upper bound cutoff is 
used to pre
-screen small ERI blocks.
         49 Table 
3.2
. Time Comparisons Between QUICK and 
GAMESS
a  QUICK
 GAMESS/gnu
 GAMESS/intel
 6-31G (H2O)16  4.7 7.5 6.4 (Alanine)
3  6.2 5.2 4.4 (Glycine)
12  52.4 51.0 43.3 6-31G*
 (H2O)16 11.3 17.4 14.9 (Alanine)
3 21.4 17.3 15.0 (Glycine)
12 175.6 183.2 154.1 6-31G**
 (H2O)16 21.5 33.4 29.4 (Alanine)
3 26.3 27.7 23.8 (Glycine)
12 225.0 264.3 224.9 a. The number in the table indicates CPU time in the first iteration for the Fock 
matrix formation time. All values are reported in seconds. Compiler and platform 
information is described in the text.
    The QUICK GPU code is rewritten in C++, using the Intel C++ Compiler 10.1.15 and 
CUDA compiler 4.0 v0.2.1221 with optimization option level 3(
-O3). In the CUDA 
compiler, the fast math library option is on (
-use_fast_math), CUBLAS 4.0 is used for 
linear
 algebra especially matrix
-matrix multiplication. Most efficient comparison
-based 
sorting algorithms have a complexity of 
!!!!!!"#
!!, and in our case, n=N2, where N is 
the number of unique shells,  therefore, the scale of sorting is 
!!!!!!"#
!!, which
 is smaller than ERI scale and linear algebra; therefore, sorting will not be a bottleneck in 
terms of scale. In the benchmark, we compare performance on one typical CPU processor 
used in high
-performance computation (AMD Opteronª Processor 2427) and one o
f the 
state-of-the
-art GPU(TESLA M2090, with ECC off). An M2090 card has a specification 
50 of 6 Gigabytes memory size, with 16 Streaming Multiprocessor and 512 CUDA cores. 
We set the block number equal to the number of stream multiprocessors in a device, 16 
in 
this case. Timing is set for the Fock matrix formation time including CPU
-GPU data 
transfer except where mentioned otherwise.
  The first set of test cases were linear (Alanine)
n chains. We chose different Gaussian split 
type basis sets varied from 3
-21G to 6
-311G**, and the chain length is from n = 1 to n = 
30 (C
90N30H152O30) with up to 3762 basis functions. The speedup comparison for GPU 
and CPU to form the Fock Matrix in the first iter
ation is presented in Figure 3.3
, while 
we compare the energy deviat
ion of the GPU relative to the CPU in t
he first iteration in 
Figure 3.4. From Figure 3.3
, an obvious trend is that the speed up increase is directly 
related to the system size, for example, for the 6
-31G* basis set, GPU calculation ranges 
from 7.47 times f
aster for a single Alanine to as much as 27.40 times faster for 30 
Alanine residues. This tendency can be explained by three factors. First, with larger 
systems and larger basis sets, the threads in the GPU are better balanced, and therefore 
less thread di
vergence occurs, which is important to avoid in CUDA programming. 
Second, the density matrix is quadratic with respect to the number of basis functions 
while the thread number is fixed in GPU, so if a larger number of basis functions are 
used, there is les
s chance that two threads modify the same address when they finish ERI 
evaluation.  In contrast, threads will take atomic operations ÒcrowdedlyÓ if the density 
matrix size is small, which leads some threads to adopt idle states that slows calculation 
speed
. The third reason is because of the better pre
-screen strategy including the Schwarz 
cutoff used in GPU and skipping a large amount of memory read requests or atomic write 
51 requests, which are the bottleneck in GPU computation. The CPU also uses a similar 
cutoff scheme, and its memory access is fast compared with the GPU memory access, so 
the increased speedup is observed if more ERIs are treated as negligible by the cutoff.
  An interesting observation is double
-polarized basis sets are generally most accel
erated 
and their speedup increases fastest, while non
-polarized basis sets are the least 
accelerated but the speedup increase is second fastest. There is less of a speedup for small 
molecules but a greater speedup for large molecules compared with single
-polarized basis 
sets. This is because polarized basis sets will have fewer integrals cutoff but possess a 
large number of basis functions, which take advantage of reason 2 above but goes against 
reasons 1 and 3. So for double
-polarized basis sets, reason 2 
dominates reasons 1 and 3, 
while for single
-polarized basis set this is switched. 
  In Figure 3.4
, energy differences are shown. Since all of the calculations are double 
precision except for atomic operations, the only error beyond numeric error is brought
 on by atomic operations. We found all the absolute errors are within 10
-10 Hartrees for a 
system with a 
-105 Hartree electronic energy, so the relative error is on the order of 10
-16, which is approximately double
-precision accuracy magnitude. This error 
increases with 
the increase of system size and is due to error accumulation as expected. We analyze the 
error growth and find the growth scale is about n
2.5~3.0
, particularly for 6
-311G* and 6
-311G**, two big basis sets with significant enough error to pro
vide a meaningful fit, the 
scale is n
3.01±0.05
and n
2.77±0.05 
(both ignoring the first ten points). The error growth scale is 
quite close to the ERI growth scale. Therefore, in conclusion, the ERI error is essentially 
52 zero because they are calculated as dou
ble precision, and the main error is from the 
atomic error. This is because to work with atomic operations, ERIs are masked as long
-long integers leading to approximate 10 decimal digits of accuracy which is less than 
double
-precision (~15.9), but more tha
n single
-precision (~7.2). These calculations are 
far better than single
-precision GPU ERI evaluation, which only has a 10
-3 a.u. accuracy 
because of error accumulation. The error order we achieve, 10
-10 Hartrees, is accurate 
enough for most chemistry calc
ulations especially considering the accompanying 
speedup.
    Figure 
3.3. Speedup comparisons between different basis sets on Alanine chain series. 
Timing for CPU and GPU are their first iteration Fock matrix formation time including 
data transfer time. Platform and software details are described in text.
  53  Figure 3.4
.  Energy
 deviation comparisons between different basis sets on Alanine 
chain series. Logarithmic scale Y
-axis is used.
   Further tests are full SCF calculations on some prototypical systems such as the 3
10-helix 
acetyl(ala)
18NH2, the 
"-helix acetyl(ala)
18NH2, the 
#-strand acetyl(ala)
18NH2, an ice 
crystal structure and a water
-cluster together with some mid
-size systems like taxol, 
valinomycin, olestra and proteins with PDB
28 ID of 1M2C(
" -conotoxin mii), 1OMG(
$-conotoxin mviia), 1VTP(vacuolar targeting peptide). Th
e 6
-31G*, 6
-31G** and 6
-311G 
basis sets were used for the large systems and for some of the small systems the 
6-311, 6-311G** 
and
 cc-pVDZ basis sets were used. The system sizes range from 110 atoms to 
453 atoms with up to 4015 basis functions. The geometri
es of the above
-mentioned 
molecules were taken from the literature or constructed and optimized in
-house. In these 
calculations, in addition to the calculation settings used in the Alanine chain series, we 
turned on the DIIS (direct inversion in the iterat
ive subspace) SCF
29 option in order to 
accelerate SCF convergence. The convergence criteria were set to a density matrix RMS 
(root mean square) difference within 10
-7, and an energy change of less than 10
-9 Hartrees. The Fock matrix will be calculated usin
g differences from the previous 
54 iteration, which minimizes the number of Fock matrix elements that need to be updated, 
especially for late iterations. A staged integral cutoff strategy was also used to save time 
in the early stages of the SCF.  
  The resul
ts are presented in Table 3
.3. We list first iteration time, last iteration time and 
compare the converged total energy deviation between the CPU and GPU result. From 
the table, we observe that a single GPU can speedup a SCF calculation by up to 130
-fold 
compared to a single CPU. For relatively small systems, GPU direct SCF calculations are 
about 10 times faster than CPU calculations. The speedup, as discussed above, depends 
on the basis set type and the number of basis functions: large basis sets and weakl
y-interacting basis sets which lead to more integrals being cutoff will provide a higher 
speedup. Moreover, if we compare the three different helix conformations, we note that 
the speedup of the 
#-strand calculation outperforms that of the 
"-helix and 3
10-helix 
calculations when same basis sets are used. This is because the 
"-helix structure is more 
compact, while the 
#-strand structure is less compact benefitting from a higher amount of 
integrals being cutoff. We also find that the olestra molecule and wat
er cluster (H
2O)100 yields impressive speedups, about 100 time faster for 6
-31G, due to the large amount of 
hydrogen atoms, where a large portion of the integrals are of the (ss|ss) type, which 
requires the least computational 
cost 
and memory access. Accor
ding to Table 3
.1, for a 
hydrogen mesh with the 6
-31G basis set, in the conventional SCF scheme, the kernel time 
in GPU versus CPU can reaches a 150 times speedup, which is the maximum speedup a 
GPU of the type employed can achieve. Therefore, the speedup 
achieved for olestra and 
water clusters are close to the ideal kernel efficiency with negligible atomic operation 
55 penalties. The energy error including ERI error and CUBLAS error will accumulate with 
iterations, therefore, for a full SCF calculation, the m
agnitude of energy error increases to 
10-7 after error accumulation for a large system with more than 3000 basis functions. 
  However, since time used in the linear algebra subroutines, especially matrix
-matrix 
multiplication (we use the DGEMM subroutine f
rom NVIDIA CUBLAS) and the 
diagonalization routine (we do this calculation on the CPU), scale cubically with the 
number of the basis functions (these calculations scale larger than does ERI computation) 
for large systems, indicates that linear algebra play
s a significant role in the total 
calculation time. For example, in the SCF calculation for olestra with the 6
-31G** basis 
set, in the first iteration, the time for ERI evaluation is 371.51s (37.7% of Fock matrix 
formation time) while diagonalization is 32
3.68 seconds (32.9% of Fock matrix 
formation time) and the DIIS time (excluding the diagonalization time) is 289.45 seconds 
(29.38% of Fock matrix formation time), while the ERI evaluation and DIIS times in 
CPU were 50009.94 seconds and 16079.17 seconds. H
ence, the GPU achieves a 134.6 
and a 55.6 times speedup respectively, while the diagonalization time is almost 
unchanged, and, indeed, it is almost as much as the ERI evaluation time on a GPU. Thus, 
because of the diagonalization routine the maximum achiev
able speedup will not be 
realized. 
  In this series of benchmark computations, the largest peak DRAM usage is 1.25 
Gigabytes, which is for olestra using 6
-31G**. Peak memory usages for some larger 
systems were tested as well, and the results are listed in 
Table 3
.5. Memory usage is in a 
56 trade off with speed as described above by storing pre
-calculated values and they, 
together with necessary molecule basis set and electron structure information such as 
one-electron operator matrix and density matrix, repres
ent most of the memory usage. 
Ideally, memory usage is quadratic with the number of unique shells, so for an M2090 
card with 6.0 Gigabytes, the maximum systems are about 10000 basis sets because of 
well-designed pre
-screening strategies. We could avoid the
 pre-calculation step to reduce 
memory usage but this will sacrifice speed for large systems. 
  In addition, we did more test cases using different devices, and we find the speed up is 
not significantly different (~10
-20%) when lower level cards were used 
(M2070, 
GTX580 for example)
, as shown in Table 3
.4. For some small basis sets, 6
-31G for 
instance, about a 20% advantage over GTX 580 is realized using a M2090, but for some 
larger basis sets, such as 6
-311G**, the advantage is less obvious. This is becaus
e the 
bottleneck for GPU calculation is memory bandwidth rather than floating
-point 
calculation, and in terms of registers and shared memory, the Tesla M2090 card does not 
have a significant advantage over GTX GPUs, and so the test result matches our 
expec
tations. But we do recommend Telsa or better cards for their stability and fault 
tolerance.
   57 Table 
3.3. Accuracy and Performance Comparison Between CPU and GPU calculation 
a Molecule
 (atom number)
 Basis sets
 (function number)
 1st iteration/s
 Last 
iteration/s
 Energy
 CPU
 GPU
 speedup
 CPU
 GPU
 speedup
 CPU/a.u.
 GPU/*10
-9 a.u.
 Taxol(110)
 6-311G(940)
 513.26 
 70.74
 7.26 
 823.86 
 76.53
 10.77 
 -2909.149312868
 -2 6-311G**(1453)
 1789.80 
 203.97
 8.77
 3611.05 
 277.49 
 13.01 
 -2910.445038007
 -1 cc-pvDZ (1160)
 1201.54 
 116.87 
 10.28 
 2694.33 
 174.99
 15.40 
 -2910.042418906
 1 Valinomycin(168)
 6-311G(1284)
 1343.53 
 153.67 
 8.74 
 2031.90 
 182.15 
 11.16 
 -3770.243737052
 -4 6-311G**(2022)
 4694.71 
 393.55 
 11.93 
 8601.67 
 612.14 
 14.05
 -3772.058214363
 -5 cc-pvDZ (1620)
 3451.95 
 340.87 
 10.13 
 6089.49
 458.61
 13.28
 -3771.439946125
 6 Ice
-like (H
2O)80(240)
 6-311G(1520)
 3073.54 
 109.57 
 28.05 
 3153.74 
 109.48 
 28.81 
 -6082.305528935
 5 6-311G**(2480)
 9953.00 
 246.84 
 40.32 
 10241.20 
 300.18 
 34.12
 -6084.520235828
 3 cc-pvDZ (2000)
 6776.11 
 93.51
 72.46
 10684.27 
 295.63 
 36.14
 -6083.023206911
 -1 Ice
-like (H
2O)96(288)
 6-311G(1824)
 6639.24 
 165.18 
 40.19 
 9518.88 
 325.39 
 29.25
 -7298.891219499
 7 6-311G**(2976)
 20754.41 
 366.43 
 56.64 
 29402.37 
 925.64 
 31.76 
 -7301.528543939
 6 cc-pvDZ (2400)
 14534.58 
 136.77 
 106.27 
 20173.25 
 449.68 
 44.86
 -7299.732249179
 -2 (H2O)32(96)
 6-311G(608)
 134.77 
 14.40 
 9.36 
 128.77 
 10.78 
 11.95 
 -2431.716629896
 2 6-311G**(992)
 379.50 
 34.05 
 11.15 
 957.65 
 66.14 
 14.48
 -2432.703288503
 1 cc-pvDZ (800)
 254.58 
 14.21 
 17.92 
 674.14 
 32.37 
 20.83
 -2432.055391595
 2 (H2O)100(300)
 6-311G(1900)
 7157.31 
 104.32 
 68.61 
 8728.88 
 141.65 
 61.62
 -7602.094350092
 -1 6-311G**(3100)
 25236.82 
 252.80 
 99.83 
 31204.21 
 486.81 
 64.10 
 -7605.175710929
 2 cc-pvDZ (2500)
 16049.72
 100.04 
 160.43 
 20052.53 
 200.23 
 100.15
 -7603.361888329
 22  58   Table 3.3. (contÕd)
  310-helix acetyl(ala)
18NH2(189)
   6-311G(1507)
 1923.62 
 173.75 
 11.07 
 2842.48 
 227.14 
 12.51 
 -4632.003799593
 -17 6-311G**(2356)
 6639.98 
 472.74 
 14.05 
 11533.48 
 799.10 
 14.43
 -4634.169869157
 -7 cc-pvDZ(1885)
 5800.82
 541.90
 10.70
 11243.90
 806.39 
 13.96
 -4633.502073111
 11 !-helix acetyl(ala)
18NH2(189)
 6-311G(1507)
 2053.30 
 194.58 
 10.55 
 3164.16
 233.21 
 13.57
 -4632.315977523
 -7 6-311G**(2356)
 7083.52 
 507.06 
 13.97 
 11889.02 
 822.49 
 14.45
 -4634.476339611
 -13 cc-pvDZ(1885)
 6004.15
 541.96
 11.08
 10230.45
 726.92 
 14.07
 -4633.816772135
 3 "-strand acetyl(ala)
18NH2 (189)
 6-311G(1507)
 1499.20 
 89.95 
 16.67 
 2101.61 
 123.06 
 17.08
 -4632.231275219
 -6 6-311G**(2356)
 5572.93 
 263.14 
 21.18 
 9102.65 
 499.50 
 18.22
 -4634.420919368
 2 cc-pvDZ(1885)
 4970.84
 440.33
 11.29
 8287.50 
 575.00
 14.41
 -4633.750359062
 147 ! -conotoxin mii 
 (PDBID: 1M2C, 3+1
-)(220)
 6-31G*(1964)
 3978.10
 345.70 
 11.51 
 13309.37 
 973.75 
 13.67
 -7106.869678943
 15 6-31G**(2276)
 5453.77
 409.36
 13.32 
 18046.87
 1299.04
 13.89
 -7107.108358427
 27 6-311G(1852)
 4485.98 
 359.74 
 12.47 
 11311.33 
 902.68 
 12.53 
 -7105.597849172
 3 #-conotoxin mviia
  (PDB ID: 1OMG, 7+2
-) (353)
 6-31G*(3035)
 17473.01
 730.71 
 23.91 
 23381.07
 1081.33 
 21.62
 -11122.046450828
 5 6-31G**(3563)
 26247.84 
 869.72 
 30.18 
 33451.28 
 1439.35 
 23.24
 -11122.460728457
 16 6-311G(2885)
 28709.47 
 812.89 
 35.32 
 30734.31 
 1171.02 
 26.25
 -11120.201936663
 -13 Olestra(453)
 6-31G*(3181)
 22626.95 
 306.09 
 73.92 
 25706.94 
 539.59 
 47.64 
 -7491.143229057
 -165 6-31G**(4015)
 53992.79 
 402.68 
 134.08 
 58037.78 
 853.27
 68.02
 -7491.573825553
 -250 6-311G(3109)
 51047.04
 358.01 
 142.59 
 59152.48
 1035.48 
 57.13 
 -7489.506716913
 1   59     Table 3.3. (contÕd)
  vacuolar targeting peptide 
 (PDB ID: 1VTP, 4+7
-) (396)
 6-31G*(3418)
 24196.66 
 1041.92 
 23.22 
 31079.79 
 1382.36
 22.48
 -10014.756057438
 16 6-31G**(4000)
 38460.44 
 941.06 
 40.87 
 51952.78 
 1849.08 
 28.10
 -10015.181384177
 23 6-311G(3208)
 34865.17 
 895.09
 38.95 
 51292.98 
 1305.71 
 39.28
 -10012.680303881
 -15  a GPU energy column lists relative energies with respect to corresponding CPU calculation.  Platform and software details 
 are described in text. 
b PDB ID and number of positive and negative charge center included in parentheses.
 60  Table
 3.4. Timing for GPU version calculation using different devices 
a  (Glycine)12 Chain
 Taxol
  6-31G 6-31G*
 6-31G**
 6-311G 6-311G*
 6-311G**
 6-31G 6-31G*
 6-31G**
 6-311G 6-311G*
 6-311G**
 M2090(6.0 GB)
b 7.92
 24.70
 29.48
 18.30
 47.93
 61.48
 28.63
 81.22
 96.91
 70.73
 171.37
 215.64
 M2070(6.0 GB)
 9.16
 28.50
 33.89
 21.03
 55.19
 65.55
 32.82
 94.09
 112.22
 80.99
 194.59
 232.30
 GTX580(1.5 GB)
 10.00
 26.06
 30.82
 22.25
 51.08
 60.48
 35.10
 87.00
 102.61
 85.54
 183.57
 217.39
 CPU
c 73.33
 275.22
 342.79
 175.16
 495.83
 628.43
 257.80
 828.51
 987.13
 582.52
 1622.83
 1389.97
 a. Number in the table shows the Fock Matrix formation time in GPU in seconds. 
b. global memory size(also know as 
DRAM) 
included in parentheses. 
c. CPU calculation is on AMD Opteron
tm Processor 2427.
   Table 
3.5. Peak Memory Usa
ge Comparison 
a Molecule
 Number 
of 
Atoms
 6-31G 6-31G*
 6-31G**
 6-311G 6-311G*
 6-311G**
 Number 
of Basis 
Functions
 Peak 
Memory 
Usage
 Number 
of Basis 
Functions
 Peak 
Memory 
Usage
 Number 
of Basis 
Functions
 Peak 
Memory 
Usage
 Number 
of Basis 
Functions
 Peak 
Memory 
Usage
 Number 
of Basis 
Functions
 Peak 
Memory 
Usage
 Number 
of Basis 
Functions
 Peak 
Memory 
Usage
 1VTP
 396 2206 674MB 3418 880MB 4000 1049MB 4420 963MB 5002 1224MB 4000 1427MB Crambin
 642 3597 1796MB 5559 2338MB 6509 2785MB 5244 2589MB 7206 3279MB 8151 3818MB 1BQ9
 765 4318 2570MB 6694 3357MB 7801 3985MB 6287 3687MB 8663 4686MB 9770 5441MB a. All calculations are on Telsa M2090 6.0 GB GPU. 
 61 3.5. CONCLUSIONS
  In this 
chapter
, we evaluate ERIs on a GPU using recurrence relations and 
form the Fock 
matrix entirely in the GPU. Our full SCF benchmark calculations (along with earlier 
work described in the introduction) demonstrate that GPU ERI implementations achieve 
impressive speedups compared to traditional CPU architectures. The energy
 error 
associated with double
-precision calculations in our implementation is on the order of 10
-7 a.u. compared to the CPU result, which meets most chemical calculation accuracy 
requirements. A well
-sorted integral grid that reduces thread divergence and 
provides an 
optimized memory access pattern boosts the performance in terms of accuracy and 
efficiency. This speedup is also achieved by optimizing the Fock matrix formation 
scheme by introducing the atomic
-operation to significantly reduce data transfer f
rom N
4 to N
2, which is one of the most time
-consuming steps in conventional GPU SCF 
programming. Moreover, this approach also reduces redundant ERI calculation by 
reusing ERI data. Our benchmarks show the speedup increases with increasing system 
size, and 
our code now is applicable to s, p and d orbital functions which are in most most 
organic or biochemistry calculations. 
  Even with the observed performance increases there are several avenues for 
improvement. In full SCF calculations, diagonalization is e
xecuted on the CPU largely 
because it is hard to efficiently implement on a GPU. In some systems, this dominates 
computation time, so it is necessary to introduce highly efficient GPU
-based 
diagonalization routines. Moreover, like most other GPU ERI evalua
tion schemes, 
62 register shortage is a crucial factor that limits GPU speedup, and this can be improved by 
even more aggressive memory caching and shared memory usage. As shown in our 
benchmarks, atomic operators bring considerable penalties especially for s
mall molecules. Also it brings along a major energy error because of the limited function 
support. 
  Our future work will focus on two aspects. First is the ERI derivative with which 
geometry optimization or molecular dynamics can be implemented and we are
 working to 
integrate this work with the AMBER MD package so that large
-scale ab initio
 QM/MM 
is readily available. Another direction we are working on is post
-HF methods such as 
MP2 and Coupled
-Cluster methods. However, most post
-HF methods reuse ERIs in 
different stages so a proper strategy is to store ERIs in external files, but in GPU 
implementation this treatment of ERIs will inevitably transfer calculated ERIs from the 
GPU to CPU, which as mentioned above, is very slow. So how to create efficient post
-HF methods entirely in GPUs is still an open question. In addition, higher angular 
momentum function ERI evaluation is under development. We will refer to these 
questions in the next chapter. 
 63                 APPENDIX
 64 MANUAL AND SAMPLE INPUT FILE OF 
QUICK
  The input file of QUICK follows the philosophy ÒSimple is betterÓ, so software designer 
tried to minimize userÕs learning cost. 
  Installation
  To install QUICK, first you may need to configure make.in file. Intel Fortran compiler is 
recommanded. 
 ! Copy corresponding make.in file from makein folder and rename it to make.in. 
 Then type 
 make quick 
 If you want to install GPU QUICK, NVIDIA CUDA COMPILER is required, which uses 
make.in.gnu.cuda. You may test 'nvcc 
--version'. Copy make.in.gnu.cuda to 
from makein 
folder to your directory and remane it to make.in 
 cp ./makein/make.in.gnu.cuda ./make.in 
 Before compiler, you must set CUDA_HOME in make.in file if not set in bashfile, 
otherwise, CUBLAS and other libraries can not been compiled or linked. No
tice, if you 
want to support f orbital, set CUDA_SPDF=y in make.in file you copied from 
make.in.gpu.cuda, which will take a little bit long time and memory to compile, 
otherwise, set CUDA_SPDF=n for s, p and d orbital. (Default is not) 
 ! Modify CUDA_HOME=
(your cuda home) in make.in file 
 ! Modify CUDA_SPDF=y or n in make.in file if you want or do not want to 
support gpu f function 
 Then type 
 65 make quick.cuda 
 in ./bin directory, you can find executable files.
  Usage and input file
  We suggest you to 
export
 install directory
 into PATH. Edit ~/.bashrc, then add this line 
 export PATH=(YOUR QUICK DIR)/bin:$PATH 
 and add basis set path 
 export QUICK_BASIS=(YOUR QUICK DIR)/basis
 where (YOUR
 QUICK DIR) is your directory. 
 In the input file, the first line is calculation card. For example 
 HF ncyc=3 energy BASIS=6
-31gs denserms=1.0e
-7  This means HF calculation is wanted, and after 
3 cycle, w
e only calculate difference of 
Fock matrix, basis set is 6
-31g**, and the convergence
 criteria is 1.0E
-7. If you don't 
know anything about quantum chemistry, this card is recommended. After first line, you 
input your molecule geometry 
after a empty line. For example
 Ò HF ncyc=3 energy BASIS=6
-31gs denserms=1.0e
-7  C 4.5916 2.6127 4.3145  H 6.8049 4.5138 -2.5775  H 8.3134 7.7325 
-1.0688  H 4.3547 2.9911 3.3201 
 H 4.5176 3.5187 3.7129 
 66 Ó which is the element and x,y,z coordinates respectively. 
Save the file as 
CH4.in
 for 
example, n
ext, type 
 quick CH4
.in
  or GPU version 
 quick.cuda CH4.in 
 or MPI version
  quick.MPI CH4.in
 These will generate CH4
.out as output file
 (to master node if MPI version is used)
.   More comments about GPU version 
  Currently, QUICK support CUDA version 2.0 and up GPU
, that supporting double
-precision. We tested on 
GTX580, M2090,
 K20
 and K40 on minimum Linux system
. We 
can't guarantee our program is flawless, and it is experimental now. If you have any 
question feel free to email the author.
          67                    REFERENCES
68 REFERENCES
  (1)
 NVIDIA. Compute Unified Device Architecture(CUDA). 
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Pro
gramming_Gui
de.pdf
 (accessed October 31, 2012)
 (2)
 Gıtz, A. W.; Wılfle, T.; Walker, R. C. In 
Annual Reports in Computational 
Chemistry
; Ralph, A. W., Ed. 2010; Vol. 6, p 21
-35. (3)
 D.A. Case; T.A. Darden; T.E. Cheatham, I.; C.L. Simmerling; J. Wang; R.E. 
Duke; R. Luo;
 R.C. Walker; W. Zhang; K.M. Merz; B.P. Roberts; B. Wang; S. Hayik; A. 
Roitberg; G. Seabra; I. Kolossv⁄ry; K.F. Wong; F. Paesani; J. Vanicek; J. Liu; X. Wu; 
S.R. Brozell; T. Steinbrecher; H. Gohlke; Q. Cai; X. Ye; J. Wang; M.
-J. Hsieh; G. Cui; 
D.R. Roe; D.
H. Mathews; M.G. Seetin; C. Sagui; V. Babin; T. Luchko; S. Gusarov; A. 
Kovalenko; Kollman, P. A. 
AMBER11,University of California, San Francisco.
 2010. (4)
 Case, D. A.; Cheatham, T. E.; Darden, T.; Gohlke, H.; Luo, R.; Merz, K. M.; 
Onufriev, A.; Simmerling
, C.; Wang, B.; Woods, R. J. 
J. Comput. Chem. 2005, 26, 1668
-1688. (5)
 Anderson, J. A.; Lorenz, C. D.; Travesset, A. 
J. Comput. Phys.
 2008, 227, 5342
-5359. (6)
 Yasuda, K. 
J. Comput. Chem.
 2008, 29, 334-342. (7)
 Ufimtsev, I. S.; Martinez, T. J. 
J. Chem. The
ory Comput.
 2008, 4, 222-231. (8)
 Ufimtsev, I. S.; Martinez, T. J. 
J. Chem. Theory Comput.
 2009, 5, 1004-1015. (9)
 Ufimtsev, I. S.; Martinez, T. J. 
J. Chem. Theory Comput.
 2009, 5, 2619-2628. (10)
 Vogt, L.; Olivares
-Amaya, R.; Kermes, S.; Shao, Y.; Amador
-Bedolla, C.; 
Aspuru
-Guzik, A. 
J. Phys. Chem. A
 2008, 112, 2049-2057. (11)
 Olivares
-Amaya, R.; Watson, M. A.; Edgar, R. G.; Vogt, L.; Shao, Y.; Aspuru
-Guzik, A. 
J. Chem. Theory Comput.
 2010, 6, 135-144. (12)
 Ma, W.; Krishnamoorthy, S.; Villa, O.; Kowalski, K. 
J. Chem. Theory Comput.
 2011, 7, 1316-1327. (13)
 Boys, S. F. 
Proceedings of the Royal Society of London. Series A. Mathematical 
and Physical Sciences
 1950, 200, 542-554. (14)
 Obara, S.; Saika, A
. J. Chem. Phys
 1988, 89, 1540-1559. 69 (15)
 Head-Gordon, M.; Pople, J. A. 
J. Chem. Phys
 1988, 89, 5777-5786. (16)
 NVIDIA. The NVIDIA CUDA Fast Fourier Transform library. 
http://developer.nvidia.com/cufft
 (acc
essed October 31, 2012)
 (17)
 NVIDIA. The NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) 
library. 
http://developer.nvidia.com/cublas
 (accessed October 31, 2012)
 (18)
 Rys, J.; Dupuis, M.; King, H. F. 
J. Comput. Chem.
 1983, 4, 154-157. (19)
 Asadchev, A.; Allada, V.; Felder, J.; Bode, B. M.; Gordon, M. S.; Windus, T. L. 
J. Chem. Theory Comput.
 2010, 6, 696-704. (20)
 Wilkinson, K. A.; Sherwood, P.; Guest, M. F.; Naidoo, K. J. 
J. Comput. Chem.
 2011, 32, 2313-2318. (21)
 Schmidt, M. W.; Baldridge, K. K.; Boatz, J. A.; Elbert, S. T.; Gordon, M. S.; 
Jensen, J. H.; Koseki, S.; Matsunaga, N.; Nguyen, K. A.; Su, S.; Windus, T. L.; Dupuis, 
M.; Montgomery, J. A. 
J. Comput. Chem.
 1993, 14, 1347-1363. (22)
 McMurchie, L. E.; Davidson, E. R. 
J. Comput. Phys.
 1978, 26, 218-231. (23)
 Luehr, N.; Ufimtsev, I. S.; Martinez, T. J. 
J. Chem. Theory Comput. 2011, 7, 949
-954. (24)
 Almlof, J. 
Lecture Notes in Quantum Chemistry II, European Summer School in 
Quantum Chemis
try 1994, 1Ð90. (25)
 Strout, D. L.; Scuseria, G. E. 
J. Chem. Phys
 1995, 102, 8448-8452. (26)
 Dunning, T. H. 
J. Chem. Phys.
 1989, 90, 1007-1023. (27)
 Y. Miao, X. He, K. Ayers, Ed. Brothers, K.Merz, QUICK(version 0.09_120306), 
University of Florida, Gainesvi
lle, FL, 2011
 (28)
 Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; 
Shindyalov, I. N.; Bourne, P. E. 
Nucleic Acids Res
 2000, 28, 235-242. (29)
 Pulay, P. 
Chem Phys Lett
 1980, 73, 393-398.  70           CHAPTER 4
. GPU 
ACCELERATION
 ON ERI DERIVATIVE 
EVALUATION
 71  4.1. INTRODUCTION
 As we shown last chapter, 
Graphical Processing Units (GPUs) provide fast and accurate 
computational performance for a wide range of problems at a reasonable cost. 
Environments for developing 
general purpose computing on graphical processing units 
(GPGPU), such as the Compute Unified Device Architecture (CUDA), facilitate the 
creation of high
-performance software for a wide range of applications even though 
GPUs were originally designed to rend
er images on a monitor. Meanwhile, traditional 
quantum chemistry methods, such as the Hartree
-Fock Self
-Consistent field method (HF
-SCF) and Density Functional Theory method (DFT) are widely used to rationalize the 
behavior of molecular systems. However, t
he broad application of these methods, to large 
systems containing 1,000s of atoms, has been limited by their
 high computational 
requirements using traditional CPUs.
 GPUs, on the other hand, have recently become widely available as a general purpose
-proces
sing platform. With GPU technology, it is possible to bring supercomputing power 
to the desktop and achieve trillions of peak floating point operations per second (FLOPS) 
outperforming 
desktop
 CPUs by over an order of a magnitude. Another key factor in 
GPUÕs is becoming widely used in scientific areas is the release of NVIDIAÕs Computer 
Unified Device Architecture (CUDA) platform that eases the coding burden for GPUs 
with an extension of the standard C/C++ language. 
Several papers describing the use of 
GPU Computing, especially with CUDA, have demonstrated impressive speedups for a 
range of computational chemistry applications, including
 ab intio
 quantum chemistry
1-12 72 and its application to the simulation of biochemical reactions
13-18, post
-HF methods
19-23 as well as 
empirical force
-field and ab initio
 molecular dynamics
24-26 simulations.
  The applications of 
ab initio
 computational chemistry methods are not limited to the 
study of the single point energetics or electronic structure of large molecular systems, but 
can also be used in molecular geometry optimization and molecular dynamics 
simulations which help scientist
s to simulate chemical reactions 
in silico 
rather than in 
the 
laboratory
. However, for HF and DFT methods, not only is the single point energy 
required but also the computation of the energy gradients requiring significant 
computational resources especiall
y in the study of large biochemical systems. Gradient 
calculation requires approximately the same computational effort as the SCF calculation, 
hence, GPU acceleration of this step will further benefit to computational chemistry 
community in its ongoing eff
ort to study larger chemical systems.
  In last chapter
, we reported the GPU acceleration of the Electron Repulsion Integral 
(ERI) evaluation for s, p and d orbitals
8 to compute the HF and DFT single point energy, 
which further demonstrated t
he potential of GPU use in 
ab initio
 quantum chemistry 
methods. The realized speedup was 10~20
-fold (with excellent accuracy) compared with 
a single core CPU for moderately sized molecular systems including proteins. However, 
two more challenging problems 
still need to be tackled for broader application of GPUs 
in quantum chemistry: efficient implementation of electron
-correlation methods and 
efficient gradient computation for geometry optimization and MD simulation. 
GPU application to address electron
-corr
elation has already been reported using MP2
22 and 
73 Coupled
-Cluster method
19,20,27-29 with impressive overall performance.
 Herein, we focus 
on the latter issue.
  For energy computation the evaluation of ERIs with higher
-angular momentum functions 
such as f orbitals (and beyond) becomes necessary. Due to the complexity of high
-angular momentum ERIs, GPU code to compute them is especially difficult to create and 
fine
 tune because of the difference in the architectures of GPUs relative to CPUs. 
Another challenge is the calculation of ERI derivatives, the bottleneck of HF gradient 
computing, that also depend on high
-angular momentum ERIs, which is the problem 
faced for 
large basis set computations, but with a memory access pattern different from 
that of normal ERI computations. Therefore, these challenges and the ever increasing 
demands from computational chemists provide the main motivation for the present work, 
which i
s an implementation and calculation of high
-angular momentum ERIs and ERI 
derivatives on GPUs. 
Efforts aimed at computing high
-angular momentum functions and 
their associated gradients on GPUs have been reported
 using different ERI algorithms
6,30. In this 
chapter
, we describe a new algorithm to overcome issues surrounding the 
computation of higher
-angular momentum functions and their gradients using recurrence 
relationships carried out on current
-generation GPUs with negligible loss of efficiency
.  In this 
chapter
, we first describe the algorithm we applied to evaluate ERIs, which is 
based on recurrence relations, one of the fastest ERI evaluation algorithms avai
lable 
especially for ERIs with high
-angular momentum functions, and then its extension to 
analytical gradient computation will be introduced. In the next section, we will briefly 
74 describe how to implement ERIs with high
-angular momentum functions on GPUs a
nd apply this to direct HF and DFT calculations. Moreover, a new strategy to assemble the 
gradient within the HF framework is introduced. In the last section, detailed benchmarks 
will be presented. We performed a series of small, medium and large molecule 
calculations to profile the speed and accuracy performance for problems requiring 
thousands of Gaussian
-type basis functions. Finally we conclude the 
chapter
 with a brief 
discussion and conclusions.
  4.2
. THEORY
  Within the framework of Hartree
-Fock theory
, the total energy of a closed
-shell system 
can be expressed as the sum of electron
-nucleus interactions (the first term), 
electron
-electron interactions (the second term)
 and nucleus
-nucleus interactions (the third term) 
within 
the Born
ÐOppenheimer 
approximation
 !!"!
!!!"!!"!"#$
!!!!!"!!"!"!!"!!!!!"#$
!"         (1)
 Here we use the notation 
!"!!"!!"!"!!"!" to describe the electron 
repulsion integrals (ERIs), which represent the most expensive part in equation
 1 contributing to both the 
Coulomb 
and Exchange ERI terms. Four
-centered ERIs are given 
by: 
 !"!"!!!!!!!!!!!!!!!!!!!!"!!!!!"!!!!!"   (2)
 and the derivative of the total energy with respect to nuclear coordinate X
A can be written 
as: 75 !!!"!
!!!!!!"!!!"!"#$
!!!!"!!!!!"!!"!!"!!"!!!!!!"!!!"!!!!"!!!!!!!!!"#$
  (3)
 Where the density matrix is 
!!"!!!!"!!"!!!! and 
!!"!!!!!!"!!"!!!!. As in 
electronic energy calculations, the major bott
leneck in equation 3 is the evaluation of the 
ERI derivative. These can also be efficiently computed using recurrence relations in an 
analogous way to ERI computation as the following sections describe.
  4.2.1. ELECTRON REPULSION INTEGRALS AND RECURRENCE 
RELATIONS
 To evaluate ERIs, many different and efficient algorithms have been developed
31,32, for 
example, Dupuis, Rys and King (DRK)
33 developed the first algorithm for integral 
evaluation involving high
-angular momentum functions,
 McMurchie and Davidson(MD) 
34 later on used Hermite polynomials to efficiently evaluate integrals over Gaussians
-type 
basis functions,
 Obara and Saika (OS)
35 and Head
-Gordon and Pople (HGP)
36 developed 
new recursion relationships with fewer terms by focusing on shifting work outside of the 
contractions loops.
 In this work, we employed an adapted OS and HGP algorithm
 to 
generate a general algorithm that is applicable to a wide range of integral types and 
offered an efficient implementation on GPUs.
  We represent ERIs using a linear combination of contracted Gaussian functions because 
of their well
-known mathematical a
dvantages.
 !!!!!!"!!!!!!!     (4)
 76 A primitive Cartesian Gaussian function centered at 
!!!!!!!!!!!! with exponent 
! is given by 
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!      (5)
  and the contracted two
-electron 
integrals are constructed from primitive ERIs by:
 !"!"!!!"!!"!!"!!"!!"!!"!!"#$
    (6)
 in equation 5, 
!!!!!!!!!!!!, and a
x, a
y and a
z are a set of integers indicating angular 
momentum and the direction of the Gaussian functi
on. These sums are restricted to 
functions with the same quantum number. It w
ould be more efficient to compute all of 
the primitive ERIs involving four shells for which 
! has the same 
!!. For example, 3 
[ps|ss] type integrals can be computed at the same
 time where 
!!!!!!!!!!!!!!and 
!!!!!. This will lead to complex conditions for high
-angular momentum ERIs, for 
example, [dd|dd] will have 1296 kinds of different ERIs because each index can have 6 
different integer combinations to satisfy the pre
-conditio
n that the sum of these three 
direction integers equals 2. Therefore, 
the calculation complexity grows dramatically 
with the introduction of high
-angular momentum function
s such as 
f orbital
s.  To efficiently evaluate the value of ERIs, Head
-Gordon and Pop
le (HGP)
36 optimized the 
recurrence relation algorithm described by Obara and Saika (OS)
35 to reduced the floating 
point operation count. It is based on the recurre
nce equation: 
 !!!!!!"!!!!!!!!"!"!!!!!!!!"!"!!! !!!!!!!!!!!"!!!!!!!!!!!!"!!! !!!!!!!!!!!"!!!!!!!!!!!!"!!! !!!!!!!!!!"!!!!!!!!!!!!!!!!!!"!!!!!!!!                (7)
 77 Where i is 
x, y or z, and
 !!!!!!"!!!"!!!"!,      (8)
 !!!!!, !!!!!,     (9)
 !!!!!!!!!!!!!,!!!!!!!!!!!!!!,!!!!!!!!!!!!!!.   (10)
 and 
!!!!!!! are exponents of a, b, c and d respectively. Technically, all the integrals c
an 
be ultimately computed from [ss|ss] integrals, which can be analytically evaluated 
efficiently
35,37. !!!!!!!!!!!!!!!!!"!!"!!!!!     (11
) where
 !!!"!!!!!!!!!    (12
) !!!!!!!!!!!!!"!!    (13
) !!"!!!!!!!!!"#
!!!!"!!!!!!!!    (14)
 In, equation 7, the summation of the
 four
 indices decreases as ERIs with higher angular 
momentum functions are constructed from lower ones, so it is termed a vertical 
recurrence relation (VRR).
 Moreover, a horizontal recurrence relation (HRR) is also 
applicable for Gaussian
-type ERIs
 !!!!!!!!!!"!!!!!!!!!!!!!!!"!!!!!!!!!!!!!!"!!"!!!!  (15)
 and can be applied to contracted ERIs,
 !!!!!!"!!!!!!!!"!!!!!!!!"!"!         =!!"!!!!!!!!!!!!!!!!!!!!!!        (16)
 Which hints that an ERI can always be expressed as a linear combination of (k0|l0) type 
ERIs. Therefore, a general strategy for evaluation of (ab|cd) is to calculate them from a 
set of integrals from (a0|b0) to ((a+b)0|(c+d)0). 
For example, for an ERI with
 a four d 
orbital index, (dd|dd), first we compute (ds|ds) (36 total), (fs|fs) (100 total) and (gs|gs) 
78 (225 total) via a VRR and then assemble (dd|dd) using an HRR. 
Therefore, for the VRR 
step, only the b = 0 and d = 0 situation will be considered in equat
ion (7), and we can 
obtain the (half) coefficient analytically as well from equation (16).
 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! (17)
   4.2.2. CALCULATION OF DERIVATIVE ERIS
 The derivative of a primitive G
aussian function 
!!! is a linear combination of a higher 
and a lower angular mome
ntum Gaussian function:
 !!!!!!!!!!!!!!!!!!!!!!!!!    (18)
 Here A is the function center and i can be x, y or z. Similarly, the first derivative of a 
primitive ERI is also a combination of a higher and a lower primitive ERIs.
 !!!!!!"!!"!!!!!!!!!!!!!"!!!!!!!!!!!!"!  (19)
 The recurrence algorithm described in the last section can be adapted to this equation 
except higher angular momentum ERIs are needed to compute th
e first derivative. For 
example, the first derivative of [dd|ss] req
uires [ps|ss] and [hs|ss] along with [ds|ss], 
[fs|ss] and [gs|ss] where the latter three are needed in the ERI evaluation as well. It is 
worth noticing that, to evaluate [dd|dd] type derivatives, for instance, [gs|gs], the integral 
with the most expensive 
computing cost, does not need to be evaluated because the first 
and third indices do not require higher ERIs simultaneously. 
 79  Also, because of translational invariance
38, the sum of the gradients of four indices equals 
to 0:
 !!!!!!!!!!!!!!!!!!!!"!"!!    (20)
 So, in the worst case, only three rather four 
centers must be evaluated, and thus, we can 
skip the one with the largest 
estimated 
computational resource requirement to optimize 
the calculation. For the most optimal case, the contraction step can be applied to equation 
(19) using 
 !!!!!!"!!"!!!!!!!!"!!!!!!!!!!!!"!  (21)
 where the subscript 
! in the first term of the RHS of above equation indicates
 that it has 
been formed with a contraction coefficients scaled by two. However, this strategy may be 
limited by the fact that the GPU may not have sufficient registers or/and memory to hold 
auxiliary integrals for high angular momentum ERIs. For equation 1
9, if three centers are 
evaluated 
simultaneously
, only one set of 
temporary
 ERIs is needed, which is of [ab|cd] 
type, while equation 21 requires 4 sets ((ab|cd)
a, (ab|cd)
b, (ab|cd)
c and (ab|cd)). Therefore, 
efficiency has to be sacrificed in this situation
, and we will return to this issue and discuss 
the details in the next section.
 4.3. IMPLEMENTATION
 4.3.1. CUDA GPUs offer a tremendous amount of computing power in terms of FLOPS while being 
relatively efficient in terms of heat produced and overall energ
y costs. However, 
increased complexity and reduced flexibility are the shortcomings of GPUs when 
80 compared with traditional CPU platforms. A GPU is an example of a massively parallel 
stream
-processing architecture using a single
-instruction multiple data (S
IMD) model. 
GPUs process threads in blocks, with 16 to 1024 threads per block, which a programmer 
can specify in their code and the GPU executes the threads in warps. In the current 
generation of GPUs the warp size is 32. Threads in one warp must execute t
he same 
instruction in the same clock cycle due to the fact each streaming multi
-processor 
executes in a Single Instruction Multiple Thread (SIMT) fashion. Therefore, branching 
must be addressed to avoid instruction divergence, which significantly affects 
overall 
performance. These thread blocks logically map to streaming multiprocessors (SM) in the 
GPUs; for example, 16 SMs for a NVIDIA M2090 (Fermi architecture) and 15 SMs for a 
NVIDIA K40 (Kepler architecture). These specifications are useful for optimum
 performance for code vectorization. Moreover, the Fermi architecture has compute 
capability of CUDA 2.x, which features fast double
-precision computing and atomic 
operations, while Kepler has CUDA 3.x, which features even faster double
-precision 
computing
, in
-warp communication and dynamic parallelism. 
  The memory hierarchy of GPUs originates from its graphics lineage. Main memory, 
known as dynamic random
-access memory (DRAM) or global memory, is visible to all 
threads in all multiprocessors, is relativel
y large (for example, 6GB for a M2090, 12 GB 
for a K40c) but comparatively slow such that that frequent data retrieval from DRAM 
should be avoided. Shared memory is accessible to all threads within a multiprocessor, 
but it is small (typically 48kB), but fa
ster than DRAM, hence, it plays a critical role for 
thread communication or memory buffering when frequently fetching from DRAM is 
81 necessary. Besides this type of memory, all threads have a small amount of private and 
fast registers, and in the recent Kepl
er platform
, threads can gain access to registers of 
other threads in a same warp
 and avoid thread communication via shared memory
. To 
maximize
 the potential of GPU computing, some further suggestions include 
(1) 
coalesced memory to access DRAM is recommen
ded, (2) 
the chance of thread 
divergence should be minimized utilizing shared memory to store intermediate results, 
and 
(3) threads can hide memory latency by overlapping c
omputations. Moreover, 
the 
CPU and GPU are in different physical address spaces a
nd DRAM is the only way to 
synchronize the data between the host (mostly CPU) and the attached devices (GPUs), 
but this comes with a significant performance penalty due to the slow PCIe bus. Thus, 
CPU and GPU communication should be avoid if not necessary. In
 the following sections, 
we will describe an implementation based on the GPU programming philosophy 
described above.   
  4.3.2. MACHINE GENERATED ERI CODE
 The code to evaluate ERIs, especially with high
-angular momentum values, is extremely 
complicated 
with many different scenarios and possible coding tricks for optimization, so 
it is very difficult or even impossible to write ERI code by hand efficiently. In this case 
automated code generation is almost a requirement
6,9. Our major focus is on the VRR 
step, which is based on equation 17, while the HRR step, which is given in equation 16, 
is r
elatively easy to write because the branching condition only depends on one index 
although it requires different subroutines for ERI evaluation and ERI gradient evaluation.
  82 First of all, meta
-classes were written by a code
-generator that describes how an 
ERI 
class is related to up to 5 other classes via the determination of the necessary coefficients 
(using equation 17). The class description is presented in Figure 
4.1. The class, named C
ij , or relation class, represents the relation between the ERI class
 [i0|j0]
(m)
 and ERI classes 
with lower i and j values and higher m values with explicit i and j values but an implicit 
m value. The functions, F
ij, evaluates the integral [i0|j0]
(0)
 starting from a set of starting 
integrals ranging from [00|00]
(0)
 to [00|0
0](i+j)
, and the [00|00] type integrals or C
00 class can be analytically evaluated from equation 11 (the C
00 class and the F
00 function are 
actually 
assigned
 values in our code). 
The path from starting integrals to integral [i0|j0]
(0)
 can be easily patterned using a 
breath
-first search
 by knowing the tree
-like vertical 
recurrence relation as expressed in equation 17.
 Our code traces the required relation 
class C
ij starting from F
ij with a 
starting queue
 with C
ij and an empty pool, then
 pop C
ij into the queue and adding a dependent relation classes of C
ij into the pool, after that, it 
repeatedly pushes all relation classes in the pool into the queue and pops them all while 
adding their dependent relation classes into the pool until only 
the C
00 classes remain
s in 
the pool. 
As noted, in class C
ij of Figure 
4.1, the j index is downgraded while the i value 
is constant except for [(i
-1)0|(j
-1)0](m+1)
, however, we can also downgrade the i index and 
produce the same ERI value, so the final sele
ction depends on which one has the fewest 
FLOPs and the same selection is made for F
ij. These 
selections are made empirically. We 
hide some parameters such as P, Q, W and Gaussian function information in Figure 
4.1 for brevity.
  83 The last, but quite importa
nt step, is the final optimization. We simplified the code by 
eliminating variables that are only used once by replacing the variables with their 
expressions and factoring out common sub
-expression. For CUDA, eliminating variables 
may offer a greater benef
it because it reduces register usage which otherwise could 
utilize slow global memory and jeopardize performance. 
Therefore, factorization may 
have a deleterious effect according to our tests.
  Using the procedure outlined above we generate an efficient ER
I evaluation code for 
[i0|j0] with i and j values up to 6 respectively. This is sufficient for [ff|ff] ERIs and 
[dd|dd] gradient ERIs. These subroutines are relatively complicated in terms of numbers 
of lines, for example, F
44, which calculates [h0|h0], ha
s 344 lines and F
66 has 8568 lines, 
which illustrates the necessity of using machine
-generated code.
 84  Figure 
4.1. Class and Function description for the ERI code. (i,j)
m represent an ERI 
class for [i0|j0]
(m) 
and class C
i,j is built to express its relation with other classes with lower 
i, j, same m and higher m values The goal of Function F
i,j is to generate [i0|j0]
(0)
 from 
[00|00]
(m) 
where the m value varies from 0 to i+j by using classes to express its path.
   Class C
i,j
 Member
: !!!, where 
!! in A and  
!!! in B satisfy
  !!!!!!!!!!!!! that 
!!!!!!!!!!  and 
!!!!!!!!!!!!!! that 
!!!!!!!!!! Constructor
: Input parameters:
  Ci,(j
-1)  (i, j
-1)m+1
 Ci,(j
-1)  (i, j
-1)m  Ci,(j
-2)  (i, j
-2)m+1
  Ci,(j
-2)  (i, j
-2)m  C(i-1),(j
-1)  (i-1, j
-1)m+1
  !, !, !!!,!!!! and 
!!!!! Content: using input parameters to express each member of class C
i,j
  Function F
i,j
 C0,1
 (0,1)
0 : C0,0
 (0,0)
0 , C
0,0
 (0,0)
1       
// express (0,1)
0 constructs from (0,0)
0 , (0,0)
1  
 C0,1
 (0,1)
1 : C0,0
 (0,0)
1 , C
0,0
 (0,0)
2   
  // 
!, !, !!!,!!!! and 
!!!!! should also include but not listed 
here
 É 85  4.3.3. ERI
 EVALUATION FOR LOW
-ANGULAR MOMENTUM VALUES
 It is relatively straightforward to evaluate ERIs with 
quantum numbers less than 2.  
Technically, F
00 to F
44 (F00 is not a subroutine but a
 value) are all that is needed to 
calculate the most complex integral [dd|dd], and it is 
possible
 to combine all of these 24 
subroutines to one large subroutine, and 
NVIDIA's CUDA Compiler (NVCC)
 can 
successfully compile and execute this subroutine. In 
last chapter
8, we showed the speedup 
and accuracy of this unoptimized subroutine, and below we illustrate the performance 
with optimized code. Moreover, this subroutine can calculate the gradient of s and p 
orbital integrals. We provide benchma
rk details below.
  4.3.4. ERI EVALUATION FOR HIGH
-ANGULAR MOMENTUM VALUES 
 For ERIs with higher
-angular momentum, for example ERIs with f orbitals and ERI 
gradients of d orbitals, large numbers of registers are required and NVCC can neither 
compile nor exe
cute if we include subroutines beyond F
44. However, since we have 
equation 16, which suggests ERIs can be expressed as a linear combination of [k0|l0] or 
(k0|l0), we can divide these combination into several parts, for example, 
 !!!"!!!!"!!!!!!!!!!!!!!!!!!!!"!!!!!!!!!!! (22)
 Which implies the 
additivity
 of ERIs. 
The subscript Z indicates that these ERIs are only 
contributing to VRR ERIs within zone Z. For example, we can manually setup two zones 
for [a0|c0] that Z
1={a<3 and
 c<3},
 Z2={a >=3 or c >= 3}. Then, Z
1 includes s and p 
orbital only ERIs and Z
2 contains ERIs involved with d and higher angular momentum 
86 values.
 Moreover, it does not involve the exponent part so the contraction step can be 
applied. 
 !!!!!!"!!!!"!!!!!!!!!!! !!!!!!!!!!!!"!!!!!!!!!!!!!!!"!!"!!!!!!!  (23)
 The previous equations demonstrate that if we separate the VRR step into several parts 
and count contributions from each part rather than compute all VRR ERIs as a batch, we 
can s
till produce the same results without introducing extra FLOPs into the VRR and a 
just a few extra FLOPs into the HRR. In addition, the derivatives of ERIs can be 
computed in this fashion as well, which we will use later.
 !!!!!!"!!"!!!!!!!!!!!!"!!!!!!!!!!"!!!  (24)
 and if a contraction step is feasible on the GPU, then
 !!!!!!"!!"!!!!!!!!!!"!!!!!!!!!!!!"!!!  (25)
 With these equations, we can further consider a partition scheme. For now, we onl
y consider
 F00 to F66, which 
comp
utes ERIs up to [ff|ff]
. All of these 48 subroutines 
(again, F00 is a value rather than a subroutine) can be merged into one subroutine.
 However, this fails in the 
compilation stage as described 
previously
. The basic principal 
of our partition scheme is to
 try to bind as many subroutines into one subroutine as 
possible. Therefore, we developed the partition design presented in Figure 
4.2. This 
scheme is the best fit for device of CUDA compute capability 3.x (3.0 and above) and 
2.x(2.0 and above but 3.0 below), within which, 9 and 5 subroutines or zones are 
separated from the global subroutine respectively, and functions that fall within 
the same 
zone will be warped into the same subroutine corresponding to one kernel during the 
87 VRR step. 
Addition of other subroutines to a zone (combining zone 3 and zone 1 in 
Figure 
4.2(a) for example) cannot compiled nor executed on current generation GPU
s. 
These series of subroutines can be further optimized at the HRR step by pre
-excluding 
select ERIs.
  As we described in last chapter and, indeed, most GPU
-based SCF implantations, a pre
-sorting algorithm developed by Ufimtsev and Martinez
1,2 is used prior to
 ERI evaluation 
to ensure that threads can execute the same or similar instructions with their neighboring 
threads within a warp. We again e
mployed pre
-sorting in our implementation, and we 
only describe the outline of this treatmen
t with the full details being given in our previous 
publication. Since it is impossible to sort or 
vectorize
 N4 ERIs but it is possible to do so 
for N
2 half ERIs, a feasible strategy is to replace the four
-index ERI with a N
2*N2 matrix 
problem with two dim
ensions represented by a bra (such as [ij|) and ket (such as |kl]). 
And once we have this type of ERI matrix, we can Òthread walkÓ 
- searching from one 
element of the matrix to another, calculating the assigned ERI and continue to work on 
the next one unti
l all elements have been evaluated. A thread will evaluate a contracted 
ERI rather than a primitive ERI since it has been shown that this strategy is very 
effective
3. Each dimension, bra or ket, of the ERI matrix can be rearrange by three kinds 
of criteria, ERI type is the criteria with h
ighest priority, with the primitive Gaussian 
function number and Schwarz cutoff as the second and third criteria, respectively. 
To try 
to minimize thread divergence, the selection of these three criteria was based on 1) 
different ERI types call different E
RI subroutines which will take the majority of the 
calculation time, and is the innermost loop in the ERI subroutine. 2) The primitive 
88 Gaussian function number determines the number of loops calling the ERI subroutine, 
and is the second innermost loop.  3)
 Sorting by the Schwarz cutoff maximizes the 
possibility of an all
-cutoff or all
-pass scenario for the ERIs evaluated in a warp, which is 
the outermost loop
. The Schwarz cutoff is a method to estimate the upper bound of an 
ERI value by using a Cauchy
-Schwa
rz inequality
39  !!"!!"!!!"!"!!"!!"!     (26)
 Herein, we improve this treatment in two ways: First, before sorting by ERI types, we 
split the bra or ket into two parts, one with a high Schwarz value (dense region, expecting 
large integral values) and anothe
r with a low value (sparse region, expecting small 
integral values). The cutoff can then be manually chosen, and we select a value of 10
-4 because we found that it gives the best balance between efficiency and accuracy. For the 
dense region, the ERI values
 are typically too big to be ignored and the dominant factors 
are its ERI types and primitive Gaussian function numbers
 if a contraction step is feasible 
on the GPU. A hidden filter arises because of the sym
metry of the ERI, 
[ij|kl]=[ji|kl]=[ij|lk] and so 
on, so only ERIs with i<=j and i<=k a
nd k<=l need to be 
calculated. Therefore, if the Schwarz cutoff is an upper bound criteria, this does not 
impact the dense 
regions
 of the ERI matrix (see Figure 
4.3). Via this, the order of the half 
matrix will be rando
mized resulting in instances where i>j which will be ignored by 
symmetry (i
!j are retained). If this conditional check is not made with the Schwartz 
cutoff divergence may occur. Without the Schwarz cutoff i
!j is always true which 
eliminates the need for th
e conditional check. So in the dense 
region
, the Schwartz cutoff 
is not needed. On the other hand, for the sparse 
region
 ERIs will have a greater chance to 
be small enough to encounter 
the Schwar
z cutoff so that once threads in one warp fall 
89 into this regi
on, it is very likely that all threads in the same warp will skip the ERI 
evaluation. Hence, in the sparse region, after sorting by ERI types, sorting the half ERI 
by the Schwarz cutoff upper bound is an efficient strategy. 
  This strategy is illustrated i
n Figure 
4.3, using an example of a water cluster with 4 
molecules 6 † from each other such that both the inter and intra molecular interactions 
are considerable but neither very strong nor very weak.  This clearly splits the half ERIs 
into dense and spars
e regions. This system includes 212 basis functions, 64 shells and 
1187 eligible half ERIs. We used a color scheme to illustrate the magnitude of the ERI 
upper bound. It is comparatively easy to discern the dense*dense and sparse*sparse 
region and the othe
r two areas 
(dense*sparse
) in between in Figure 
4.3. The 
sparse*sparse
 area, mostly colored with purple and black regions are crowded with ERIs 
with a small upper bound while the 
dense*dense
 area is filled with orange that represents 
ERIs expected to have 
large values. However, the boundary for ERI types 
is hard to tell 
for the dense area because the Schwarz cutoff is not introduced in that area, however, the 
boundary for ERIs in the sparse area can be easily identified. The table next to the ERI 
matrix in 
Figure 
4.3 indicates the boundaries for the half ERIs.
   90                                  (a) 
       (b)
          (c)
       (d)
 Figure 4.2
. Optimized subroutine partition pattern for (a) CUDA 3.0 and (b) CUDA 
2.0 architectures. All F
ij functions that are within one zone will be warped in one 
subroutine and called in a kernel. (c) (d) Kernel call pattern for ERI type [ij|kl] with the 
partition strategy (a) for CUDA 3.0 and (b) for CUDA 2.0. See text for details.
  91   Figure 
4.3. Pre
-sorti
ng and thread walking illustration. The test case is a water cluster 
with 4 water molecules. The color scheme indicates the upper bound of ERIs, 
dense*dense, sparse*dense and sparse*sparse areas can be visually identify in the ERI 
matrix. The table on the 
right is the boundary for different ERI types. See details in the 
text.
   Besides this change, a second change is the thread walk was altered from a linear search 
to a 
circular
 search as shown in Figure 
4.3. This 
adaptation
 is especially suitable to the 
first change because the large
-valued and small
-valued regions are most likely distributed 
circularly at the origin or at the edge of the ERI matrix. So at the beginning, the thread 
walk ensures almost every ERI is not cutoff and at the last steps of the wal
k, almost every 
visited ERI will be smaller than the cutoff criteria. Both changes yield a ~30% 
improvement for small systems and ~15% for large systems (typically for system with 
92 more than 1000 basis functions) because of the smaller possibility of thread
 divergence. 
We compare
 other walk strategies, like linear and snake, on 
alanine chains with lengths 
of 1, 5, 9 and 13 using the 6
-31g** basis set in Figure 4.
 The circular is faster for the 
smaller systems, but for larger all methods explored performed 
similarly.
 Other 
thread 
walk strategies might offer significant improvements as well but generally are not as 
good as the circular search and suffer more significant degradations when applied to large 
systems as well. Another note is that most SCF calculat
ions will only evaluate the 
difference between iterations so in the final several iterations, the ERI matrix will be 
much more sparse than in the initial iterations. Thus, when applying the thread walking 
changes, the time needed to calculate later iterati
ons will be slightly increased when 
compared with the original thread walking strategy. 
Nonetheless, the overall time 
required after these modifications is better than when these computational tricks are not 
employed. 
 93                                  (a) Linear                          (b) Snake                               (c) Circular
   Performance comparison between different thread walk strategies
 Molecule 
Name(Atom 
Number)
 !-helix 
acetyl(ala)
18NH2 (189)
 "-strand  
acetyl(ala)18NH2
 (189)
 Taxol
 (110)
 Valinomycin
 (168)
 Basis Set
 6-31g 6-31g** 6-31g 6-31g** 6-31g 6-31g** 6-31g 6-31g** (a) Linear
 144.7
 435.1
 64.7
 203.4
 49.3
 149.9
 106.8
 312.1
 (b) Snake
 142.0
 433.2
 63.1
 201.2
 49.0
 148.2
 105.8
 310.7
 (c) Circular
 135.2
 424.1
 57.0
 191.8
 45.2
 145.2
 98.7
 301.4
 Improvement %
 6.6
 2.5
 11.9
 5.7
 8.2
 3.1
 7.6
 3.4
  Figure 4
.4.  Thread walk strategy comparison. Three different thread walk strategies 
are presented for GPU calculations. Time used to calculate the 2
nd iteration for four 
examples of alanine chains of different lengths using 6
-31g** are reported. Unit is 
seconds. Improvement % row indicates performance improvement of circular strategy 
over linear strategy.
   94 We visualize the kernel call pattern in Figure
 4.2(c) and (d) for an ERI type [ij|kl] with 
two axes with i+j value and j+k values, respectively. We show the partition strategies for 
CUDA capability of 3.x and 2.x in Figure 
4.2(a) and (b), respectively. Each cube 
corresponds to a kernel run for this ty
pe. For example, if i+j equals 5 and k+l equals 6, 
kernel 0,1,2,4,5 and 7 are necessary to compute its value. From a kernel perspective, a 
kernel does not contribute to every ERI type but only for certain types. For example, for 
subroutines in zone 1 warpe
d as kernel 1 in Figure 
4.2(a)  (designed for the device of 
CUDA compute capability 3.x) the kernel containing subroutines F
(5~6)(0~6)
, is needed by 
ERI types [ij|kl] that fulfill the condition 
!!!!!, and similarly, for zone 3, threads only 
need to searc
h !!!!! and 
!!!!!. Therefore, only going through those regions that 
may produce these types of ERIs can minimize re
dundant thread walks and further 
improve performance. These calls are statically determined at the compilation stage to 
avoid unnecessary
 branching during runtime. 
  4.3.5. IMPLEMENTATION OF ERI DERIVATIVE EVALUATION 
 Besides the single point energy, the Hartree
-Fock method can provide analytical 
expression for the energy gradient. Similar to the single point energy, the bottleneck is, as 
shown in equation 3, the evaluation of the ERI derivatives, which accounts for more than 
80%~90% of the CPU depending upon system size. To implement the HF gradient 
calculation on GPUs, two general steps are necessary: the ERI derivative evaluation and 
fina
l gradient assembly.
  95 For ERI derivative evaluation, similar to normal ERI evaluation, a recurrence relation is 
used in our implementation. The difference between normal ERI evaluation and the 
derivative evaluation, as discussed above (see equation 19), is
 the presence of two extra 
classes of ERIs one with higher index and another with a lower one. If, for example, we 
want the d orbital gradient, the derivative of [dd|dd] requires a series of subroutines 
F(0~5)(0~5)
 but F
55 is not needed. However, as descri
bed in the last section, the compiler 
cannot handle all subroutines at once, therefore, a similar, but simpler, partition strategy 
was developed as shown in Figure 
4.5, which represents a subset of the normal ERI 
evaluation process. As Figure 
4.5 shows, th
ree zones are indicated, and this partition is 
suitable for both the CUDA compute capability 2.x and 3.x architectures. The kernel call 
pattern is quite similar to that of the CUDA compute capability 2.x ERI evaluation but 
simpler so we do not present it h
erein. The thread walk is similar to that used in the 
normal ERI evaluation. Moreover, the HRR step for zone 1 and 2 are simpler than zone 
0. For zone 0, the most efficient algorithm is to evaluate twelve derivatives classes (four 
centers and three directi
ons, but nine classes need to be computed (see equation 20)
 simultaneously
 once the value of the auxiliary ERIs from [00|00] to [(i+j+1)0|(k+l)0] and 
[(i+j)0|(k+l+1)0] are obtained. 
For the derivative of a primitive ERI, the HRR is applied 
as much as 18 ti
mes because there are 9 classes and each class has two ERIs (a higher 
and lower function). However, if the contraction step is possible on the GPU significant 
computational savings can be achieved. 
Zone 1, for example, is called when (1) (i+j) 
equals 4, (2
) for the derivative of center i or j and (3) only when an ERI with a higher 
function is involved
, hence, only 6 and 3 HRR computes are required at most for zone 1 
and zone 2.
 96     Figure 
4.5. Optimized Subroutine Partition for ERI Derivatives. This partition strategy 
works for both CUDA 3.0 and CUDA 2.0. The kernel call mapping is similar to Figure 
4.2 but simplified so it is not shown here. 
   However, the contraction step
, sacrificing space
 usage for reduced FLOP counts
, given 
by equation 25 cannot be applied to zone
s 1 and 
2 which are wrapped as two separate 
kernels. This is caused by insufficient register and memory resources available on the 
GPU. If the contraction step i
s applied, as giv
en in equation 21, an efficient way to 
implement this is to store 4 sets of auxiliary classes whose class size increases 
exponentially with the growth in quantum number:
 For example, 
because the degeneracy 
number of s,p,d,f,g,h,i orbitals are 1,3,6,10,15,2
1,28, so for [dd|, L=4, we have 
1+3+6+10+15 = 35, therefore,
 [dd|dd] requires 35
#35 for each class and 56
#56 for the 
derivative; [ff|ff] requires 84
#84 if each thread computes one contracted ERI, which is 
97 infeasible for ERIs with higher
-angular momentum fu
nctions even though one set is 
possible as discussed in the last section regarding [ff|ff] evaluation. 
So for zone 1 and 
zone 2, the contraction step is skipped because it is computationally infeasible. In this 
way CUDA executes the subroutines and each th
read will work on a primitive ERI 
instead of a contracted one.
 We show below that with or without the contraction step we 
can achieve impressive performance.
   Another stage is assembling the gradient. In the HF energy implementation, we used 
atomic operat
ions, a set of lock
-set-unlock operations to void thread conflicts and 
assemble the Fock matrix in global memory. However, the drawback of atomic 
operations in our hands was their speed and accuracy. The clock cycle of atomic 
operations was generally two
-fold less than that of registers and one
-fold with respect to 
shared memory, even though this penalty 
could be reduced by coalescing memory access 
and thread access. 
  Atomic operations are not particularly fast on Fermi GPUs but for Kepler GPUs they are 
~3x faster than the previous generation of cards. Therefore, we expect a similar accuracy 
using atomic operations for the SCF energy and gradient but with increased performance 
from the Kepler GPUs. Compared to the HF energy calculation, the number of atomic
 operations is significantly less for the HF gradient. This is because for the HF energy, an 
ERI class contains many individual ERIs (
e.g.,
 (pp|pp) contains 81 ERIs) and each ERI 
contributes to up to six elements to the density matrix via
 atomic operations 
(e.g., 
the 
(pp|pp) class, in the worst case scenario, requests up to 486 atomic operator calls). But for 
98 assembling the gradient, an ERI class shares common centers which means they 
contribute to the same gradient elements, thus, the cont
ribution from the ERI derivative 
can be assembled locally at the thread level.  The gradient is then updated using a single 
atomic operator call regardless if the 
contraction step 
is introduced, so a thread calls at 
most 12 atomic operators at once for zon
e 0, 1 and 2. We use a presorting step for the 
energy computation and reutilize this for the gradient computation. We could 
have
 created a pre
-sorting strategy for the gradient computation as well, but this added 
overhead would eliminate the realized compu
tational gains.
  4.4. BENCHMARK RESULTS AND DISCUSSION
  We implemented and benchmarked our ERI and ERI derivative code in our quantum 
chemistry package QUICK
40 for both CPUs and GPUs. The QUICK CPU code was 
originally
 written in FORTRAN and the GPU code is rewritten in C++ together with the 
machine
-generated code described above. We built the CPU version with the INTEL 
FORTRAN Compile
r (version 
12.1.5 20120612
) with optimization option level 3 (
-O3). 
For the GPU version, the Intel C++ Compiler (version 
12.1.5 20120612
) and CUDA 
compiler 4.2 
V0.2.1221
 with optimization option level 3(
-O3) was used. Moreover, the 
fast math library option
 was used (
-use_fast_math) and all calculations used double 
precision unless otherwise indicated. In our benchmarks, the CPU we used was a single 
core 
3.07GHz
 Intel Xeon CPU X5675. The GPU test platform was a NVIDIA TESLA 
K40. This GPU card features 12 Gb 
of global memory, 15 streaming multiprocessors and 
2880 CUDA cores built based on the KEPLER architecture. In addition, we turned the 
99 ECC off and clocked to 875MHz to maximize the performance. The K40 card has 
compute capability 3.5, so the kernel partitio
n and kernel call pattern will follow Figures 
2(a) and (c).
  First, we present the tests we carried out on SCF energy calculations involving f orbital 
ERI contributions. In the SCF energy GPU implementation, all primitive ERIs were 
calculated on
-the
-fly an
d the Fock matrix was assembled in global memory, while for the 
CPU version, the primitive ERIs can be saved in memory to form contracted ERIs for 
higher efficiency which cannot be done on GPUs due to a shortage of register space. Both 
the CPU and GPU vers
ions used the direct SCF procedure, which has been shown
1,2,7,8 to 
be suitable for GPU
-based SCF calculations. Any integrals smaller than 10
-9 were 
neglected both in the CPU and GPU benchmark studies. ERI evaluation with f orbital 
contributions on GPUs will be treated with several kernel runs as discussed above, while 
for the CPU we ran in a traditional manner. 
  We studied linear alanine chains
 with lengths ranging from 1 to 26 with the 6
-31G(df, pd) 
basis set (with 280 to 4702 basis functions). The results are summarized in Figure 
4.6. We plotted the time used for the 
second
 iteration to form the Fock Matrix with the 
superposition of atomic den
sities (SAD) initial guess.
41,42 We exclude the one
-electron 
contribution 
in the plot
, which is relatively small and does not need to be calculated every 
iteration. Besides the time to form the Fock Matrix, diagnalization is another critical step 
for each iteration, and while it is relatively small for small systems it will ultimately 
dominate for very large systems because it scales as O(N
3) which is large
r than the 
100 scaling dependence of direct SCF calculations. We use the second iteration for our 
benchmark studies because comparing with first iteration timings can lead to misstating 
the overall GPU performance. Meanwhile, besides the time to run a single i
teration, the 
time cost associated with different kernels is plotted as well. As observed in Figure 
4.6, a 
GPU yields an 8
-18 times speedup when compared to a 
single core of a CPU
 with the 
speedup increasing with an increase in system size.
 This
 tendency was
 observed in last 
chapter
8 and 
in the work of other groups
9. This
 tendency
 is due to three reasons: (a) larger 
systems means larger ERI matrices that lead to less thread divergence; (b) larger systems 
produce more ERIs that are small enough to be cutoff and (c) reduction in chance 
collisions 
of thread requests
 to the sa
me density matrix element in larger systems, which 
reduces the atomic operation penalty. T
he most time consuming kernel is kernel 0
 (see 
Figure 
4.6), which is not surprising since every ERI including ERIs with f
-type Gaussian 
function contribution
s calls t
his kernel.
 Besides kernel 0, kernels 1
-8 are called 
when 
the 
ERI has 
an 
f Gau
ssian function contribution with the accumulated time taking roughly 30% 
of the total time.
 101     Figure 
4.6. Calculation Time profiles for the CPU and GPU calculations on 
alanine 
chains. Timings for the CPU and GPU kernel time (top two curves) are based on the Fock 
matrix formation excluding the one
-electron contribution. Timings for kernel 0 to kernel 
8 are given as well and the kernel call mapping corresponds to Figure
 4.2. Platform and 
software details are described in the text. A logarithmic scale is used for the Y axis.
   The SCF energy error between the CPU and GPU calculation using double
-precision is 
within 10
-7 a.u. for the alanine chain test set which is 
quite accu
rate
 given the
 different 
execution 
order of the floating point operations
 for the GPU and CPU implementations. 
To further analyze our implementation, we examined
 several systems with different basis 
sets with (6
-31G(df,pd) and 6
-31G(2df,2pd)) and without f orbital contributions (6
-102 31G**). The systems we tested include relatively small molecules such as taxol, 
valinomycin, and small proteins containing around 200 a
toms including the PDB
43 IDÕs 
of 1AKG, 1CNL, 1M2C and 1PEN.
 The results are presented in Table 
4.1. The number 
of basis functions varies from more than a thousand (taxol at the 6
-31G** level) to more 
than five thousand (1M2C at the 6
-31G(2df, 2pd) level). The realized speedup is11
-13 times and varies slightly with
 the basis set employed. The listed times used for kernel 1
-8 accounts for approximately 9.6%, 12.0%, 1.2%, 1.1%, 1.7%, 0.6%, 0.6%, and 1.0% of 
the total time and ~27% overall. According to Table 
4.1, the performance of f orbital 
computing with a GPU is no
t significantly impacted by our partition treatment. Actually, 
6-31G(df,pd) and 6
-31G(2df,2pd) calculations benefit from GPU usage because of  the 
Òlarge basis set effectÓ described above and along with 
our partition treatment
 (which 
goes against some conv
entional thinking on this topic
44). The partition scheme might also 
be applied to kernel 0, where we separate kernel 0 into se
veral smaller kernels, to 
potentially gain better performance. However, we found this idea did not work because 
these subroutines are relatively small so the overhead penalty of GPU kernel initialization, 
repeated ERI screening calculations and extra calcu
lations to select the corresponding 
subroutine reduces the benefit brought about by our partitioning scheme. 
In sum, to 
evaluate ERIs with f Gaussian functions, we need slightly more time for the calculations, 
but the overall speedup, even considering the 
extra time used, is still impressive, 
demonstrating the accuracy, feasibility and efficiency of our treatment of ERIs with 
contributions from f functions.
  103 The next benchmark is focused on SCF gradient calculation using the same test systems 
(polyalanines)
 as before. The GPU is 
used only to
 accelerate the most time
-consuming 
part, which is the derivative of 4
-center ERIs while other contributions such as the one
-electron integral derivatives are calculated on the CPU since they only take a minor 
portion of 
the overall computation time. In this test, for both the CPU and GPU gradient 
calculations, the energy 
calculation itself
 was performed on
 the GPU to ensure we are 
using the same density matrix. The speedup of the GPU over the CPU calculation is 
illustrate
d in Figure 
4.7(a) by comparing the time used to evaluate the derivative ERI part 
of the gradient calculation. As Figure 
4.7(b) indicates, the accuracy of the GPU gradient 
compared to the CPU as indicated by the root mean squared deviation
 (RMSD) of 
calcul
ated gradients and the max difference (bolded) is quite good. The first 6 points for 
6-31G are not presented because the difference is smaller than 10
-12. In terms of accuracy, 
the RMSD of the gradient difference between the CPU and GPU is less than 10
-10 and 10
-11 a.u. for the 6
-31G** and 6
-31G basis sets while the maximum difference is within 10
-9 and 10
-10 a.u. respectively. According to Figure 
4.7(a), we find that our GPU 
implementation speeds up the gradient calculation by as much as 19 times ((alanin
e)30 with the 6
-31G** basis set and 3010 basis functions). When compared to the GPU 
speedup of the SCF energy, the observed speedup here is slightly better due to the 
reduced number of atomic operations. 
  To further analyze the differences in kernel time 
usage, we used the same set of systems 
used in the analysis of the energy calculation. The results are summarized in Table 
4.2. Generally, the extra calculation (kernel 1 and kernel2) arising from the partition strategy 
104 represents about 12% of the total ti
me (previously it was ~30% on average). Overall the 
speedup of the gradient calculation on the GPU relative to the CPU was between 12.9
-16.1 times, which, as expected due to the reduction in atomic operations, is slightly better 
than what we observed for t
he realized speedups in the HF energy calculation on a GPU. 
  One important
 application
 of the energy gradients is geometry optimization. To test our 
implementation we optimized the (glycine)
12 chain with the 6
-31G and 6
-31G** basis set 
using the L
-BFGS 
algorithm
 and profiled the performance and measured the accuracy of 
the geometries obtained using the CPU or GPU code base. These calculations involve 87 
atoms and 514 and 925 basis functions, respectively. The starting geometry was initially 
optimized with 
the STO
-3G basis set using the GPU code. In order to compare the 
accuracy of the GPU gradient along with the GPU energy, three sets of computations 
were carried out. These include CPU (energy) + CPU (gradient), GPU (energy) + CPU 
(gradient), and GPU (energ
y) + GPU (gradient). We present the geometry RMSDs of the 
GPU+CPU and GPU+GPU relative to the CPU+CPU result from the same step in Figure 
4.8. As expected the error observed using the GPU energy and gradient calculation 
increases as the number of 
steps
 inc
reases. The error in the 6
-31G** result is slightly 
larger than that of 6
-31G due to the larger number of basis functions employed. At 
geometry convergence, the RMSD is mostly within the 10
-4 magnitude for both the 
GPU+CPU a
nd GPU+GPU calculations with 6
-31G and 6
-31G** basis sets. The energy 
differences 
obtained for the GPU+CPU and GPU+GPU optimizations relative to the 
CPU+CPU result (in atomic units) are 1.80
#10-6 and 3.21
#10-6 for 6
-31G and 7.46 
#10-6 and 4.28
#10-7 for 6
-31G**. 
 105    (a)
  (b)
 Figure 4
.7. (a) Speedup comparison and (b) gradient deviation comparison between 
CPU and GPU calculations with different basis sets for alanine chain lengths from 1 to 
30. Timing for CPU and GPU are based on the gradient generation time including data 
transfer but e
xcludes the one
-electron integral contribution.
 Platform and software details 
are described in text. For (b) RMSD and maximum difference (bolded) between the GPU 
and CPU result are presented. First 6 points are not presented because the difference is 
smaller than 10
-12. A logarithmic scale is used for the Y
-axis.
  106  We also explored the memory usage of the GPU calculations. ERI evaluation with high
-angular momentum functions does not allocate extra memory, and therefore, the peak 
memory usage is the same as 
in our earlier efforts
8. For the gradient calculation on the 
GPU, a 
relatively small
 amount of memory is required to store the gradient value and it 
scales linearly
 with the number of atoms. For the K40 card with 12 gigabytes, for 
example, calculation of t
he (alanine)
28 chain using the 6
-311G(2df, 2pd) basis set with 
6780 basis functions uses almost every bit of global memory, which is one of the largest 
systems that the K40 is able to handle in our hands. Hence, with Quick, GPU global 
memory is the limitin
g factor for larger systems currently. This limit, together with 
bandwidth and the number of registers that limit calculation speed and size will be 
mitigated with incoming GPU technologies such as NVLink which will boost bandwidth 
and available memory.
 107     Figure 
4.8. Accuracy of geometry optimization using a GPU 
versus
 a CPU. The test 
molecule was (Glycine)
12 with the 6
-31G and 6
-31G** basis sets. Three sets of 
computations were executed including CPU+CPU, GPU+CPU, GPU+GPU. The curves 
indicate geome
try RMSD relative to the CPU+CPU results at the same step. 
108         Table
 4.1. Performance 
Comparison
 between CPU and GPU SCF Calculation
a,b Molecule/
 Atom 
Number
 Basis Sets/
 No. of Basis 
Function
 Kernel Time/s
 GPU 
total/s
 CPU 
total/s
 Speedu
p 0 1 2 3 4 5 6 7 8 taxol
 (110)
 6-31G**/1160
 132.11
 - - - - - - - - 132.11
 1735.13
 13.1
 6-31G(df,dp)/2064
 607.57
 86.23
 114.41 
 10.79
 10.63 
 17.22
 4.78
 5.25 
 4.68
 861.55
 13178.28
 15.3
 6-31G(2df,2dp)/2577
 1488.60
 221.32
 283.74
 20.05
 20.22 
 38.49
 8.95
 9.28
 9.05
 2099.70
 28856.43
 13.8
 valinomyc
in (168)
 6-31G**/1620
 276.29
 - - - - - - - - 276.29
 3397.98
 12.3
 6-31G(df,dp)/2940
 1206.36
 158.77
 205.08
 21.29
 19.24
 29.98
 10.56
 11.00
 16.77
 1679.05
 24905.01
 14.9
 6-31G(2df,2dp)/3678
 3014.18
 411.14 
 513.33 
 41.63 
 37.88 
 68.08 
 21.46 
 21.73 
 36.03 
 4165.46
 56135.19
 13.4
 1AKG
 (209)
 6-31G**/2171
 616.04
 - - - - - - - - 616.04
 7290.80
 11.8
 6-31G(df,dp)/3839
 2874.78
 370.15
 469.47
 53.39
 47.25
 68.28
 28.20
 28.47
 47.67
 3987.67
 49457.74
 12.4
 6-31G(2df,2dp)/4829
 6220.36
 841.31
 1055.04
 90.57
 81.01
 141.26
 48.86
 48.28
 86.58
 8613.27
 107771.61
 12.5
 1CNL
 (169)
 6-31G**/1771
 424.36
 - - - - - - - - 424.36
 5054.76
 11.9
 6-31G(df,dp)/3149
 2033.50
 263.78
 335.74
 35.90
 32.47
 48.86
 17.67
 18.57
 26.28
 2812.77
 34725.38
 12.4
 6-31G(2df,2dp)/3929
 4405.82
 597.59
 743.98
 62.04
 56.40
 97.97
 29.60
 30.72
 49.02 
 6073.13
 75738.14
 12.5
 1M2C
 (220)
 6-31G**/2276
 704.04
 - - - - - - - - 704.04
 7969.27
 11.3
 6-31G(df,dp)/4060
 3245.16
 410.62
 511.75
 61.66
 52.59
 76.42
 33.30
 34.26
 60.80
 4486.57
 53311.56
 11.9
 6-31G(2df,2dp)/5068
 7055.35
 937.31
 1158.67
 107.66
 95.43
 156.81
 57.60
 57.60
 110.93
 9737.35
 119672.64
 12.3
 1PEN
 (203)
 6-31G**/2131
 607.46
 - - - - - - - - 607.46
 7537.26
 12.4
 6-31G(df,dp)/3789
 2812.73
 365.32
 464.97
 54.55
 48.75
 69.25
 28.03
 28.10
 46.13
 3917.83
 48023.12
 12.4
 6-31G(2df,2dp)/4728
 6101.39
 829.10
 1034.71
 88.97
 80.81
 138.65
 46.94
 46.85
 82.34
 8449.76
 106282.32
 12.6
 a. Time listed in the table indicates the time used to form the Fock Matrix in the 2
nd iteration of the SCF calculation excluding the one
-electron contribution and the diagonalization time.
 b. Platform and software details are described in the text.
 109 Table 
4.2. Performance 
Comparison
 between CPU and GPU HF
-Gradient Calculation
 a,b Molecule/
 Atom Number
 Basis Sets/
 No. of Basis 
Function
 Kernel Time/s
 GPU 
total/s
 CPU 
total/s
 Speedup
 0 1 2 taxol
 (110)
 6-31G/647
 98.00
 - - 98.00
 1345.82
 13.7
 6-31G**/1160
 379.70 
 18.65 
 33.78
 432.13 
 6939.49
 16.1
 valinomycin
 (168)
 6-31G/882
 183.47
 - - 183.47
 2584.33
 14.1
 6-31G**/1620
 691.03 
 35.75 
 57.62 
 784.40 
 12538.55 
 16.0
 1AKG
 (209)
 6-31G/1211
 444.66
 - - 444.66
 5729.57
 12.9
 6-31G**/2171
 1522.28
 75.22
 132.96
 1730.46 
 25422.03
 14.7
 1CNL
 (169)
 6-31G/991
 303.96
 - - 303.96
 3914.50
 12.9
 6-31G**/1771
 1078.59 
 49.77 
 90.32 
 1218.67 
 17861.00 
 14.7
 1M2C
 (220)
 6-31G/1268
 477.74
 - - 477.74
 6238.65
 13.1
 6-31G**/2276
 1677.16 
 86.20 
 146.89 
 1910.26 
 27997.59 
 14.7
 1PEN
 (203)
 6-31G/1192
 423.68
 - - 423.68
 5616.09
 13.3
 6-31G**/2131
 1500.19 
 74.27 
 132.99 
 1707.45 
 25548.42 
 15.0
 a. Time listed in the table indicates the time used to calculate the gradient using the 
HF method excluding the one
-electron contribution
 b. Platform and software 
details are described in the text.
   110  4.5. CONCLUSIONS
  In this 
chapter
, we implemented the evaluation of ERIs up to f orbitals and ERI 
derivatives up to d orbitals using GPU technology and ERI recurrence relations. Our SCF 
and gradient calculations demon
strate the efficiency of GPUs where speedups of 10~20 
times faster are expected over modern CPUs. A partition strategy is introduced to solve 
the difficulties encountered in computing ERIs including f orbitals and was further 
applied to d orbital gradient 
calculation. Importantly, we observed a very limited 
efficiency decrease after employing this strategy. A well
-sorted pre
-sorting strategy and 
several other improvements boost overall GPU performance and atomic operations are 
used to reduce data transfer w
hich is particularly effective on a Kepler GPU. Moreover, 
like other GPU
-enabled quantum chemistry software
4,14, GPU-based DFT calculation is 
also available in QUICK although we did n
ot touch on this topic. For pure DFT methods, 
such as BLYP and LDA, the difficulty of GPU coding is significantly easier than 
encountered in HF while the speedup that a GPU can bring is at the same level seen for 
HF. Now with the help of ERI acceleration, 
hybrid DFT methods, such as B3LYP, is 
available by simply calculating the HF and BLYP contributions to exchange
-correlation 
part, respectively, using the GPU. 
  Even with the observed performance increase using GPUs, there is still room to improve. 
Our lon
g-term goal is to efficiently and accurately investigate complex biological 
systems with GPUs and most calculations will involve s, p and d orbitals and f orbitals 
when large basis sets or calculations involving metal ions are employed. Hence, f orbital 
111 gradients is currently the next
 piece of the puzzle
. However, even though we have 
generated the F
(0~7)(0~7)
 code with an automatic code generator, which is sufficient for the f 
orbital gradient calculation, the difficulty is that even if the partition strate
gy is used as 
well, a kernel that only calls the subroutine F
76, for example, will fail at the compilation 
stage because the memory requirements are beyond the current generation of GPU cards. 
We developed a compromise approach to deal with this dilemma th
at split a given 
subroutine into several kernels but find that the overall performance is quite poor. So we 
currently run this part of the calculation on the CPU. However, GPU cards and software 
continues to improve so solutions to this problem are likely 
to be available in the near 
future. 
  Our future work will focus on following aspects. First we are integrating our source code 
with the AMBER MD package to further enable 
ab initio QM/MM simulations
15. Second 
is to develop GPU enabled correlated 
ab initio methods such as MP2 and coupled
-cluster methods. Moreover, GPU 
enabled second order derivatives would also be helpful for 
scientists who want to calculate frequencies or other properties. Additionally, a multi
-GPU implementation and implementation of the code on INTEL PHI platform is on 
going in order to have another 
option to accelerate 
ab initio
 calculation
1,9.      112           APPENDIX
 113 SAMPLE OF MACHINE GENERATED CODE
  The
 goal of the machine
-generated code is to generate a function such as
 void h_2_2(QUICKDouble* store, arguments)
 where
 store
 is an double
-precision float(type name QUICKDouble, a redefined double 
type) array that saved ERIs. The purpose of above function is
 to evaluate (20|20) or 
(ds|ds) integrals and save them to the ERI container 
store
. Arguments
 are some constant 
parameters that describes molecular and basis sets information.
  Within 
h_2_2,  __device__ __inline__ void h_2_2(QUICKDouble* store,arguments) 
 {  // call for L = 0 B = 1 
 f_0_1_t f_0_1_0 ( VY(0), VY(1), arguments); 
 // call for L = 0 B = 1 
 f_0_1_t f_0_1_1 ( VY(1), VY(2), arguments); 
 // call for L = 0 B = 2 
 f_0_2_t f_0_2_0 ( f_0_1_0, f_0_1_1, VY(0), VY(1), arguments); 
 // call for L = 0 B = 1 
 f_0_1_t f_0_1_2 ( VY(2), VY(3), arguments); 
 É.  // call for L = 2 B = 2 
 f_2_2_t f_2_2_0 ( f_1_2_0, f_1_2_1, f_0_2_0, f_0_2_1, f_1_1_1, arguments); 
 114 // WRITE LAST FOR I = 2 J= 2 
 LOC2(store, 4, 4, STOREDIM, STOREDIM) += f_2_2_0.x_4_4 ; 
 LOC2(store, 4, 5, STOREDIM, STOREDIM) += f_2_2_0.x_4_5 ; 
 É } Here, 
VY(i
) stands for F00 classes with m value equals to i because they are a set of value 
rather than classes. 
f_[i]_[j]_t class is a ERI (i0|j0) type with each member of the class is 
one comb
ination of the ERI, for example, 
  // Class for L = 0 B = 2 
 class f_0_2_t { 
   // (ss|sd) ERI type
 public: QUICKDouble x_0_4 ; 
 // (s=0, s = 0, s=0, d = xx)
 QUICKDouble x_0_5 ; 
 // (s=0, s = 0, s=0, d = yy)
 QUICKDouble x_0_6 ; 
 // (s=0, s = 0, s=0, d = zz
) QUICKDouble x_0_7 ; 
 // (s=0, s = 0, s=0, d = xy)
 QUICKDouble x_0_8 ; 
 // (s=0, s = 0, s=0, d = xz)
 QUICKDouble x_0_9 ; 
 // (s=0, s = 0, s=0, d = yz)
 __device__ __inline__ f_0_2_t( f_0_1_t t_0_1_0, f_0_1_t t_0_1_1, 
QUICKDouble t_0_0_0, QUICKDouble t_0_0_
1, arguments); 
 };  115 Size of this class is enormous for ERIs with higher angular momentum value. And the 
inline function, expresses the relation between one ERI type and ERIs with lower angular 
momentum value. For example, 
 __device__ __inline__ f_0_2_t :: 
f_0_2_t ( f_0_1_t t_0_1_0, f_0_1_t t_0_1_1, 
QUICKDouble t_0_0_0, QUICKDouble t_0_0_1, 
arguments
)  { x_0_4 = Qtempx * t_0_1_0.x_0_2 + WQtempx * t_0_1_1.x_0_2 ; 
 x_0_5 = Qtempy * t_0_1_0.x_0_3 + WQtempy * t_0_1_1.x_0_3 ; 
 x_0_6 = Qtempx * t_0_1_0.x_0_3 + WQt
empx * t_0_1_1.x_0_3 ; 
 x_0_7 = Qtempx * t_0_1_0.x_0_1 + WQtempx * t_0_1_1.x_0_1 + CDtemp * ( 
t_0_0_0 
- ABcom * t_0_0_1 ) ; 
 x_0_8 = Qtempy * t_0_1_0.x_0_2 + WQtempy * t_0_1_1.x_0_2 + CDtemp * ( 
t_0_0_0 
- ABcom * t_0_0_1 ) ; 
 x_0_9 = Qtempz * t_0_1_0.x_0_3 + WQtempz * t_0_1_1.x_0_3 + CDtemp * ( 
t_0_0_0 
- ABcom * t_0_0_1 ) ; 
 }  In sum, function h_[i]_[j] is to calculate ERI type (i0|j0) by the assistant of sets of 
f_[i]_[j]_t classes to evaluate the value from VY(i), a set of F
00 or (00|00)
(m)
 values. The 
code is extremely long and complicated, for example, 788 lines for f_6_6_t, and 124 
f_[i]_[j]_t is constructed and computed as auxiliary ERIs for h_6_6 to calculate (60|60) 
ERI type containing 28*28 = 784 elements. 
 116                    REFERENCES
 117 REFERENCES
  (1)
 Ufimtsev, I. S.; Martinez, T. J. 
J. Chem. Theory Comput.
 2009, 5, 1004-1015. (2)
 Ufimtsev, I. S.; Martinez, T. J. 
J. Chem. Theory Comput.
 2009, 5, 2619-2628. (3)
 Ufimtsev, I. S.; Martinez, T. J. 
J. Chem. Theory Comput.
 2008, 4, 222-231. (4)
 Yasuda, K. 
J. Chem. Theory Comput.
 2008, 4, 1230-1236. (5)
 Yasuda, K. 
J. Comput. Chem.
 2008, 29, 334-342. (6)
 Asadchev, A.; Allada, V.; Felder, J.; Bode, B. M.; Gordon, M. S.; Windus, T. L. 
J. Chem. Theory Comput.
 2010, 6, 696-704. (7) Asadchev, A.; Gordon, M. S. 
J. Chem. Theory Comput.
 2012, 8, 4166-4176. (8)
 Miao, Y. P.; Merz, K. M. 
J. Chem. Theory Comput.
 2013, 9, 965-976. (9)
 Titov, A. V.; Ufimtsev, I. S.; Luehr, N.; Martinez, T. J. 
J. Chem. Theory Comput. 2013, 9, 213-221. (10)
 Luehr, N.; Ufimtsev, I. S.; Martinez, T. J. 
J. Chem. Theory Comput. 2011, 7, 949
-954. (11)
 Wilkinson, K. A.; Sherwood, P.; Guest, M. F.; Naidoo, K. J. 
J. Comput. Chem.
 2011, 32, 2313-2318. (12)
 Gıtz, A. W.; Wılfle, T.; Walker, R. C. In 
Annual Reports in Comp
utational 
Chemistry
; Ralph, A. W., Ed. 2010; Vol. 6, p 21
-35. (13)
 Kulik, H. J.; Martinez, T. J. 
Abstr Pap Am Chem S
 2012, 244. (14)
 Kulik, H. J.; Luehr, N.; Ufimtsev, I. S.; Martinez, T. J. 
J Phys Chem B 2012, 116, 12501-12509. (15)
 Isborn, C. M.; Gotz, A. W.; Clark, M. A.; Walker, R. C.; Martinez, T. J. 
J. Chem. Theory Comput.
 2012, 8, 5092-5106. (16)
 Ufimtsev, I. S.; Luehr, N.; Titov, A.; Martinez, T. 
Abstr Pap Am Chem S
 2011, 242. (17)
 Ufimtsev, I. S.; Luehr, N.; Martinez, T. J. 
J Phys Chem Lett
 2011, 2, 1789-1793. (18)
 Kulik, H. J.; Isborn, C. M.; Luehr, N.; Ufimtsev, I.; Martinez, T. J. 
Abstr Pap Am Chem S
 2011, 242. 118 (19)
 Bhaskaran
-Nair, K.; Ma, W. J.; Krishnamoorthy, S.; Villa, O.; van Dam, H. J. J.; 
Apra, E.; Kowalski, K. 
J. Che
m. Theory Comput.
 2013, 9, 1949-1957. (20)
 Ma, W. J.; Krishnamoorthy, S.; Villa, O.; Kowalski, K. 
J. Chem. Theory Comput. 2011, 7, 1316-1327. (21)
 Isborn, C. M.; Luehr, N.; Ufimtsev, I. S.; Martinez, T. J. 
J. Chem. Theory 
Comput.
 2011, 7, 1814-1823. (22)
 Vogt, L.; Olivares
-Amaya, R.; Kermes, S.; Shao, Y.; Amador
-Bedolla, C.; 
Aspuru
-Guzik, A. 
J. Phys. Chem. A
 2008, 112, 2049-2057. (23)
 Isborn, C. M.; Luehr, N.; Gour, J. R.; Ufimtsev, I. S.; Martinez, T. J. 
Abstr Pap 
Am Chem S
 2011, 242. (24)
 Gotz, A. W.; Wil
liamson, M. J.; Xu, D.; Poole, D.; Le Grand, S.; Walker, R. C. 
J. 
Chem. Theory Comput.
 2012, 8, 1542-1555. (25)
 Salomon-Ferrer, R.; Gıtz, A. W.; Poole, D.; Le Grand, S.; Walker, R. C. 
J. Chem. Theory Comput.
 2013, 9, 3878-3888. (26)
 D.A. Case; T.A. Darden; T.E. Cheatham, I.; C.L. Simmerling; J. Wang; R.E. 
Duke; R. Luo; R.C. Walker; W. Zhang; K.M. Merz; B.P. Roberts; B. Wang; S. Hayik; A. 
Roitberg; G. Seabra; I. Kolossv⁄ry; K.F. Wong; F. Paesani; J. Vanicek; J. Liu; X. Wu; 
S.R. Brozell
; T. Steinbrecher; H. Gohlke; Q. Cai; X. Ye; J. Wang; M.
-J. Hsieh; G. Cui; 
D.R. Roe; D.H. Mathews; M.G. Seetin; C. Sagui; V. Babin; T. Luchko; S. Gusarov; A. 
Kovalenko; Kollman, P. A. 
AMBER11,University of California, San Francisco.
 2010. (27)
 DePrince, A.
 E.; Hammond, J. R. 
J. Chem. Theory Comput.
 2011, 7, 1287-1295. (28)
 DePrince, A. E.; Kennedy, M. R.; Sumpter, B. G.; Sherrill, C. D. 
Mol Phys 2014, 112, 844-852. (29)
 Asadchev, A.; Gordon, M. S. 
J. Chem. Theory Comput.
 2013, 9, 3385-3392. (30)
 Titov, A. V
.; Ufimtsev, I.; Martinez, T.; Dunning, T. H. 
Abstr Pap Am Chem S
 2010, 240. (31)
 Gill, P. M. W. 
Adv Quantum Chem
 1994, 25, 141-205. (32)
 Fletcher, G. D. 
Int J Quantum Chem
 2006, 106, 355-360. (33)
 Rys, J.; Dupuis, M.; King, H. F. 
J. Comput. Chem.
 1983, 4, 154-157. (34)
 McMurchie, L. E.; Davidson, E. R. 
J. Comput. Phys.
 1978, 26, 218-231. (35)
 Obara, S.; Saika, A. 
J. Chem. Phys
 1988, 89, 1540-1559. (36)
 Head-Gordon, M.; Pople, J. A. 
J. Chem. Phys
 1988, 89, 5777-5786. 119 (37)
 Boys, S. F. 
Proceedings of the 
Royal Society of London. Series A. Mathematical 
and Physical Sciences
 1950, 200, 542-554. (38)
 Komornicki, A.; Ishida, K.; Morokuma, K.; Ditchfield, R.; Conrad, M. 
Chem 
Phys Lett
 1977, 45, 595-602. (39)
 Strout, D. L.; Scuseria, G. E. 
J. Chem. Phys
 1995, 102, 8448-8452. (40)
 Miao, Y; He, X; Ayers, K.; Brothers, E.; Merz, K.M. QUICK, version 
2.0.140304; University of Florida: Gainesville, FL, 2013, Download free at: 
http://www.merzgroup.org/quick.html
 (41)
 He, X.; Merz, K. M. 
J. Chem. Theory Comput.
 2010, 6, 405-411. (42)
 Van Lenthe, J. H.; Zwaans, R.; Van Dam, H. J. J.; Guest, M. F. 
J. Comput. Chem. 2006, 27, 926-932. (43)
 Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; 
Shind
yalov, I. N.; Bourne, P. E. 
Nucleic Acids Res
 2000, 28, 235-242. (44)
 "Megakernels Considered Harmful: Wavefront Path Tracing on GPUs"
 Samuli 
Laine (NVIDIA), Tero Karras (NVIDIA), Timo Aila (NVIDIA), in High
-Performance 
Graphics 2013, July 2013
    120            CHAPTER 5
. GPU 
ACCELERATION
 ON HISTOGRAM ANALYSIS
 121  5.1. INTRODUCTION
  The concept of the potential of mean force (PMF)
 [1]
 !!!! with coordinate 
!, was 
originally introduced by Kirkwood and is frequently used to understand the mechanism 
of ÒrareÓ tr
ansitions in solid, fluids or complex biochemical systems. To obtain accurate 
estimates of a free energy barrier along a reaction coordinate that is substantially higher 
than the minima, a standard canonical (NVT) simulation provides little or no sampling 
in 
the barrier region. In order to address this issue, a set of N separate ÒumbrellaÓ sampling
 [2,3]
 windows are created using a biasing potential, 
!!!!in order to confine the system in 
the neighborhood of position 
! in order to enhance sampling. A typical umbrella biasing 
potential uses the harmonic form:
 !!!!!!!!!!!!!!!   (1)
 which restrains the system at position 
!! with a force constant 
!!. The windows created 
along a reaction path are then ÒunbiasedÓ 
and combined to obtain the PMF over the 
whole region of interest.
  The weighted histogram analysis method (WHAM) described by Kumar et al
[4-6] is an 
algorithm that unbiases a set of simulations that have sampled different regions of 
!! and 
then assembles
 the free energy profile. This method is based on the histogram method 
proposed by Ferrenberg and Swendsen 
[7]
, whose central idea is an optimal estimate of an 
unbiased distribution function as a weighted sum over data extracted from simulations 
using the 
maximum overlap method and statistical error minimization. Not only single 
122 coordinate 
[8,9]
 but also multidimensional free energy surfaces 
[5]
 can be determined with 
this method, and many successful applications have been reported 
[10
-12]. However, one of 
the difficulties of multidimensional WHAM is its cost which is O(Nm), where N is the 
number of simulations and m is the number of grid points. Here grid points are defined 
by the WHAM calculation itself and represent a further parsing of the region between
 umbrella sampling points. Hence, depending on the data set size the computational cost 
can range into days on a single processor for high
-resolution WHAM calculations.  
  The use of graphics processing units (GPUs) in computational chemistry and biology h
as rapidly expanded over the past decade due to the availability of GPU software 
development tools and because GPUs offer massively parallel computing at a reasonable 
cost. Compared with traditional parallel approaches, such as OpenMP or MPI, which are 
des
igned for sequential execution, GPU programming is especially suitable for 
calculations that require massive data parallelism where thread computation is 
independent from each other. The most widely used GPU programming interface, which 
was introduced by N
VIDIA, is the Compute Unified Device Architecture (CUDA)
 [13]
. Within the CUDA framework, the most basic units called threads are arranged into one
-, two- or three
-dimensions to form a block, and blocks then form one
-, two
- or three
-dimensional grids. Conf
iguration of blocks and threads are logical concepts, so they can 
be tuned to maximize performance. CUDA provides threads ÒprivateÓ and fast, albeit 
limited registers, and threads in a block can communicate with each other via shared 
memory, which has medi
um access speed. Global memory (also know as DRAM) and 
constant memory are two types of memory that CUDA provides that are visible to every 
123 thread. Global memory has a large capacity (
e.g.
, 6.0 GB for the NVIDIA FERMI 
M2090) but large access latency, while
 constant memory is fast but read
-only. 
  In this chapter, we implement GrossfieldÕs WHAM program
[14]
 to run on a GPU. We will 
briefly describe the theory of WHAM and then introduce the algorithm and strategy used 
in the GPU implementation. We then demonst
rate the performance improvement by 
calculating the energy surface for a typical enzyme reaction. Finally, we draw 
conclusions in the final sec
tion of this chapter
. Our GPU tests were performed on the 
widely used NVIDIA 
K40
 with 12
.0 GB of DRAM.
  5.2. 
THEORY AND METHODS
  In this section, we will first briefly review the theoretical basis of WHAM. WHAM deals 
with a set of histograms obtained using the same grid size from the N independent 
window simulations. The independent simulations are generally call
ed windows, that 
have been collected via molecular dynamics simulations biased with an umbrella 
potential like that given in eq (1). Generally, like most statistical methods, the accuracy 
of the calculation and cost are dependent on the numbers of windows 
and the amount of 
sampling done at each window.
  From the 
ith biased ensemble, suppose that the distribution function along the 
! is a 
Boltzmann distribution, then the unbiased distribution function obtained is 
 !!!!"#$%&'(
!!!!!!!!!!!!!!!!!"#$%&
!!     (2)
 124 Where 
!! is the umbrella potential perturbation
 constant, and the b is the inverse 
temperature 
!!!!!!!. It has been shown
[4,7]
 that the optimal estimate of 
!!, the 
unbiased probability distribution at position
!! is: !!!!!!!!!!!!!!!!!!!!"#$%&'(
!!!!!!!!!!!!!!!!!!!!!!!   (3)
 where 
!! is the number of data points obtained from i
th the umbrella simulation used to 
construct the biased distribution function. We assume the probability distri
bution is 
normalized, and if it is not, we can simply normalize it in the end. Combining eqs(2) and 
(3),
 !!!!!!!!"#$%&
!!!!!!!!!!!!!!!!!!!!     (4)
 For the constant 
!!, we have
 !!!!!!!!!!!!!!!!    (5)
 Eqs(4) and (5) 
are the WHAM equations, and they depend on each other, so in practice 
the unbiased potential is achieved through an iterative procedure by starting with an 
initial guess for the 
!!!(usually set as 0) and then generating new estimates of 
!! and 
!! until convergence is achieved. When implementing this algorithm, we divide the 
coordinates into very small grids that have the same size (not to be confused with the 
points used in the umbrella simulations). We can rewrite eqs (4) and (5) as:
 !!!!!!!!!!!!!!!!!!!!!!!!!!     (6)
 !!!!!!!!!!!!!!!    (7)
 125 where 
!!! is the number of sample points located on the same grid point with 
!, and 
!! is the average additional free energy. 
  Up to this point, eq (6) and (7) are applicable to
 a general N
-dimensional space, but we 
have only implemented the two
-dimensional case because 1
-D is relatively fast and 2
-D analyses are the most widely used. Higher dimensions can be analyzed in an analogous 
fashion, but have not been explored herein. Ho
wever, extensions to 1
-D or higher 
dimensions involve similar algorithms and can be readily implemented if needed. For 
two-dimensions, eq(6) yields N
grid_x
*Ngrid_y 
equations, where, N
grid_x
 stands for the number 
of x axis grid points in each WHAM histogram
. Each equation requires two summations, 
and each of them has N terms that need to be added up. Therefore, for eq(6), the 
computation cost is O(N*N
grid_x
*Ngrid_y
). Eq(7) represents N independent equations and 
each one has N
grid_x
*Ngrid_y 
elements to be gat
hered up yielding a computational cost of 
O(N*N
grid_x
*Ngrid_y
) as well. Its worth noting that the computational cost is only relative to 
the number of umbrella windows and the number of WHAM grid points, and not to the 
number of sample points obtained in t
he constrained MD trajectory, but the accuracy 
depends on all the three factors. 
     126  Figure 5.1.
 Psudo
-code for implementation of part 1.
  Below, we describe the details of the present implementation on the CUDA platform. The 
description is presented using C++ pseudocode in C++ (see 
Figure
 5.1). The lines after 
Ò//Ó in pseudocode are comments.
    __global__ void calc_rou(){
 // array numerator and denominator are stored in shared 
 // memory so that thread in this block can have access
 __shared__ 
numerator[THREAD_PER_BLOCK];
     
 __shared__ denominator[THREAD_PER_BLOCK];
     
 for( int i = blockIdx.x; i < N
grid_x
; i+= gridDim.x){ 
         
  for( int j = blockIdx.y; j < N
grid_y
; j+= gridDim.y){
             
  k = threadIdx.x;  // k is the thread ID
             
      
     
//numerator for o
ne thread is the 
 // histogram 
value 
fetch
ed from global 
  //memory
             
numerator[k] = get_histogram_val(i, j, k); 
              
  // denominator requires potential calculation
             
denominator[k] = ge
t_number_points(k) 
      
* exp(
-beta*(calc_potential(i, j, k) 
Ð f[k]));
       
 __syncthreads();
             
   // thread 0 will summarize
             
 if (k ==0){
     summarize numerator array to numerator[0];
     summarize denominator array to denominator [0];  
      // rou is two
-dimensional array that store in global  
                   
   // memory as eq6 calculated
     rou[i][j] = numerator[0]/denominator[0];
             
  }         
  } 127  Figure 5.2.
 Psudo
-code
 for implementation of part 2.
  To begin with, we call the implementation of eq(6) and eq(7) as Part 1 and Part 2 in the 
following discussion (see 
Figure
 5.1 and 
5.2). For eq(6), as the pseu
docode shown in 
Figure
 5.1 indicates, our strategy was to assign a block to an equation, then all threads in 
a block calculate 
!!!!!!!!!!! and fetch 
!!! from global memory, where k is the 
thread id. Therefore, within the two
-dimensional WHAM fr
amework, we configure a 
two-dimensional block to map each equation with a one
-dimensional thread
. For 
example, Block (0,0) will work on
!!!!!!!, and, for example, the very first thread with 
threadID equals 0, will fetch 
!!!!!!! and calculate 
!!!!!!!!!!!!!!!. After that, a 
reduction in a block is executed to sum the denominator and numerator. 
As a result, 
 __global__ void calc_F(){
   // array F are 2
-dimensional and stored in shared memory so that 
   // every thread in this block can access it
     
__shared__  F_cache[THREAD_PER_BLOCK][THREAD_PER_BLOCK];
     
for( int i = blockIdx.x; i < N
windows
; i+= gridDim.x){ 
         
for( int j = threadIdx.x; j < N
grid_x
; j+= blockDim.x){ 
             
for( int k = threaIdx.y; k < N
grid_y
; k+= blockDim.y){
                  
F_cache[j][k]=get_rou(j,k)*exp(
-beta*calc_potential(j,k,i));
             
}         
}    
      
__syncthreads();
          
if(threadIdx.x == 0 && threadIdx.y == 0) {
               
summarize F_cache to F_cache[0][0];
              
// F is one
-dimensional array that store F in global memory that
             
// eq(7) wants to calculate.
             
  F[i] = F_cache[0][0];
          
}      
}  128 !!!can be computed in one block and stored in global memory so that it is visible to all 
threads on the GPU. Similarly for eq(7), as the pseudocode
 listed in 
Figure
 5.2 describes,
 each block works on one equation, and threads in this block generate 
!!!!!!!!!!!!!!!!!!, where 
!!!!!! is the assigned grid point. The most suitable 
configuration in this case will be a one
-dimensional block with two
-dimensional threads. 
The configuration utilized is illustrated in Figure 
5.3. It is worth noticing that
 the data 
transfer from CPU to GPU is 
carried out once before the first iteration starts 
rather than at 
every iteration, 
which includes the histogram and force constant(s), and the variables 
reside in the GPU until the probability distribution reaches convergence, and data is 
transferred from 
the GPU to CPU to download the distribution and 
!! and 
!!. Therefore, data transfer time is negligible compared to the full calculation.
       129   Figure 5
.3. Configuration used for the first step in a WHAM calculation. Blue 
rectangles indicate a block unit, while orange rectangles represent threads.
  130    Figure 5
.4. Configuration used for the second step in a WHAM calculation. Blue 
rectangles indicate a block unit, while orange rectangles represent threads.
   5.3. RESULTS AND DISCUSSION
  In this section, we benchmarked our GPU implementation by evaluating the energy 
surface of CusF metallochaperone. CusF is a periplasmic metallochaperone exist in 
Escherichia coli, encoded by cus
-CFBA operon
[30]
. It was hypothesized to capture 
Cu+/Ag+ ions and transfer them to CusCBA pump in order to expel them out of the 
cell[31]
. It 
is related to metal ion and antibiotic resistance, which makes it a potential 
pharmaceutical target.
  131     Figure 5
.5. Benchmark reaction studied:
 CusF metallochaperone. Open(top) and 
close(bottom) state exist for CusF protein. Umbrella sampling carried on for two 
dimensions with x axis as one distance (Cu+
--Trp44@CZ3 distance) and y axis as one 
dihedral angle (Glu36
-Met47 Phi angle, highlighted in 
black). See text for details.
 132 In recent simulations of our group, we found there is an open state structure exist for the 
CusF protein, while the transition between the closed and open states may play a 
significant role in the metal ion transfer process be
tween CusF and CusCBA complexes 
(unpublished results). Just like the former work
[32]
, we modeled the system based on the 
PDB entry 2VB2. AMBER ff99SB force field4 was employed for the protein system 
while the SPC/E water model5 was used to solvate the syst
em. A hybrid metal model was 
utilized to model the metal binding site. The interaction of Cu+ ion and Trp44 residue 
was treated with nonbonded model while the interactions between Cu+ ion and the other 
three ligating residues (His36, Met47 and Met49) were 
represented by the bonded model. 
The bonded model are parameterized by Metal Center Parameter Builder (MCPB)6 
software in Modeling ToolKit++ (MTK++) in AmberTools
[33]
. During the research we 
found there are one
 distance (Cu+
--Trp44@CZ3 distance) and one di
hedral angle 
(Glu36
-Met47 Phi angle) in the metal binding site, as shown in Figure 
5.5, could 
distinguish the closed and open state remarkably. For example, in a closed state structure 
we found the distance is 2.33 † while the dihedral angle is 
-147.53¡. I
n an open state 
structure the corresponding distance is 17.42 † while the dihedral angle is 47.61¡. To 
further clarify the free energy differences between the closed and open state, we 
processed a 2D potential of mean force (PMF) simulations by treating th
e Cu
-Trp44@CZ3 distance as the reaction coordinate (RC) X axis while the Glu46
-Met47 Phi 
dihedral angle as RC Y axis. We began the umbrella sampling from a closed state 
structure of metal loaded CusF. The harmonic potential of all the simulations is set as
 10 kcal/mol
-1†-2 for the distance and 500 kcal/mol
-1Rad
-2 for the dihedral angle. The initial 
closed state structure has a RC as (X=2.33†, Y=
-147.52¡) while the PMF termination has 
133 RC as (X=17.83 †, Y=98.72¡). The X
-axis has 0.50 † increased and Y
- axis h
as 6.48¡ 
increased and in sum there are 32 * 39 =1408 windows for the whole umbrella sampling 
map. There are 6 ns sampling for each window with stepsize as 1 fs. The data points were 
stored for each 1 ps during the simulation so that sample data is uncorre
lated.
  As mentioned before, our implementation is based on GrossfieldÕs WHAM program
[14]
, so the original  code is used for the CPU calculations and we directly ported this code to 
run on GPUs. Our code can be downloaded under a GNU general public license
. The 
GPU benchmarks were executed on a
n NVidia
 K40 card. This card has 6 Gigabytes of 
memory, 12 Streaming Multiprocessors and 2880 CUDA cores. For comparison, the CPU 
version was run on a single core of an Intel Xeon X5675 CPU with 3.07GHz frequency. 
Both codes were compiled using the Intel C++ Compiler 10.1.15 using optimization level 
3(-O3) and the CUDA compiler 4.0 v0.2.1221 is used for the GPU code. We used the fast 
math library option (
-use_fast_math) for better performance on the GPUs.
 Using a CP
U, 
a single iteration takes 90 s on average, and each calculation takes more than two 
thousand iterations to reach convergence, which requires more than 50 hours in total to 
reach convergence on this data set. In most of our calculations, the grid was fine
 enough 
to obtain accurate results and in fact, but the effect of using coarser grids is examined 
below as well. 
  For GPU computations, single
-precision (SP) should be carefully used because it may 
produce significantly less accurate results when compared
 with double
-precision (DP), 
but offers a two
-fold speed
-up. In contrast, for a CPU, single and double
-precision have 
134 similar performance on 64
-bit hardware. So in our benchmarks, we do identical 
calculations for both single and double
-precision in order t
o fairly compare their speed 
and accuracy.
  The calculation efficiency is greatly dependent on the block and thread configuration. To 
find the most suitable one to maximize the performance, for first step, eq(6), we used 
different block sizes from (32*32) 
to (256*256) blocks per grid and thread sizes from 64 
to 256 threads per block. Figure 5
.6 shows the time spent in one iteration using different 
thread per block parameters for step 1. As it indicates, for the Tesla platform, the most 
suitable configuratio
n is (256*256) blocks per grid and 128 threads per block for both 
precision options. For the second step, eq(7), we tested the same example making a 
similar comparison, and we find that the configuration with 192 blocks per grid and 
(32*32) threads per b
lock performs best (see Figure 5
.7). It is worth noting that the total 
time cost for single iteration decreased to 86.4 ms, which roughly represents more than 
1000 times speedup relative to the CPU calculation.
  In order to further compare the computational 
efficiency of GPU
-WHAM, we examined 
another data set using various levels of WHAM grids, from 32*44 to 2048 * 2816. The 
profiling r
esults are summarized in Figure 5
.8. According to the benchmark results, the 
speedup achieved by a GPU compared to a single C
PU reaches about 1200 times for 
double
-precision and 2020 times for single
-precision. The speedup is dependent on 
system size when the number of grid points is large enough. The grid used for this case is 
256*352 and this resolution was used for plotting t
he PMF surface below. However, an 
135 artificially too high
-resolution grid choice will produce unreliable results if the sampling 
was insufficient. However, we still profile the speedup of very fine grids since the 
computation time is independent to the thoro
ugh sampling of the data points. This 
exercise, while not physically justified given the amount of data we have does serve to 
show that even finer grids can be used with our GPU code. The largest grid choices in 
this test require about 1.39 second per iter
ation (single
-precision) and about a half hour to 
complete the full calculation (3760 second GPU time and 3784 second total wall time), 
which covers 2061 iterations to reach convergence.  The CPU implementation of WHAM 
is projected to finish the same task 
in about 814 hours. Besides WHAM grid point 
influence, when we compare the speedup horizontally between calculations with different 
numbers of umbrella windows but with the same number of grid points, the speedup 
increased slightly with more umbrella windo
ws. 
  In practice, after running the MD simulations, the number of windows and the number of 
sample points is a constant, but the grid resolution can be specified in WHAM and grid 
resolution impact the accuracy of the calculation. Because of the speed impr
ovements 
realized using GPU
-WHAM, inaccuracies arising from low
-resolution grids can be 
reduced if sufficient sampling data is collected. However, in most case, data collection is 
the most expensive part to generate an accurate result. However, this type o
f analysis is 
beyon
d the scope of the present chapter
, which focuses on the creation of GPU
-WHAM. 
For the benchmark cases studied the 160*640 grid scheme is fine enough that grid point 
density is not the factor that limits calculation precision, but higher
-resolution does 
continue to offer some modest improvements.
 136    Figure 5
.6. Profiling of the GPU
-WHAM code for step 1 with different thread and 
block choices.
     Figure 5
.7. Profiling of the GPU
-WHAM code for step 2 with different thread and 
block choic
es.  137   Figure 5
.8. Benchmark results for two sets of data with different numbers of windows 
(1024 for top, 1533 for bottom) and different grid points (from 32*128 to 768*1024). 
Profiling with CPU, GPU single
-precision (SP) and GPU double
-precision (DP) 
calculation is presen
ted. A logarithmic scale on the y
-axis is used. Speedup by GPU SP 
and GPU DP is also plotted. The benchmark reaction is described in the text.
   In order to evaluate single 
versus double precision we present the PMF surface with both 
single precision and double precision with the difference as well in Figure 
5.9. As the 
figure suggests, the two PMF surfaces have only minor differences arising from the use 
of single precision. Our te
st results indicate that the average absolute energy difference 
and RMS (Root Mean Square), which is defined as,
  !"#!!!!"#$%&
!!!!!!!!"#$%&
!!!!!!!!!!!!!!!!!!!!!!!!    (8)
 138 between double and single
-precision is 9.5 E
-7 kcal/mo
l and 1.5 E
-4 kcal/mol, 
respectively. The maximum absolute difference between grid points is less than 0.4 
kcal/mol with most of them being well under 0.01 kcal/mol. Most of the maximum of 
deviations appear regions near the edge of the PMF map and in areas
 of little relevance to 
the reactive process. These differences are acceptable for most applications of WHAM 
especially considering single
-precision provides a 2
-fold speed up in return. For example, 
the energy difference be
tween the two minima of Figure 5
.9 is 2.74 kcal/mol in SP and is 
2.73 kcal/mol for DP, which is accurate enough for most chemical applications. 
  5.4. CONCLUSIONS
  In this chapter
, we accelerate WHAM using GPU technology by selecting a suitable 
computational configuration to maximum effi
ciency. The GPU
-WHAM performance is 
as much as 1200 times in double precision faster for the Kepler card when compared with 
a single CPU. Single
-precision is two times faster than double
-precision
 with up to 2020 
time speedup compared with a single CPU, bu
t reduces accuracy slightly, but for most 
cases this will not be an issue because the introduced inaccuracy is generally less than the 
inaccuracy of the method utilized. Nonetheless, even very high
-resolution grids are now 
affordable in a computationally r
easonable amount of time.
  139    Figure 5
.9. The PMF surface and contour map produced using the GPU
-WHAM code
  140 Figure 5.9
 (contÕd) 
with double
-precision (top), single
-precision (middle) and the 
absolute energy difference (bottom). The energy unit is 
kcal/mol. X
- and y
-axes are Cu+
--Trp44@CZ3 distance and Glu36
-Met47 Phi angle the unit is † and rad. Full 
computational details are given in the text.
 141                    APPENDIX
 142 MOLECULE DYNAMIC DETAILS OF CUSF
  CusF is a periplasmic 
metallochaperone
 xist in 
Escherichia coli
, encoded by 
cus
-CFBA 
operon
36. It was hypothesized to capture Cu
+/Ag
+ ions and transfer them to CusCBA 
pump in order to expel them out of the cell.
 37 It is related to metal ion and antibiotic 
resistance, which makes it a potential pharmaceutical target. 
  In recent simulations of our group, we found there is an open state structure exist for the 
CusF protein, while the transition between the closed and
 open states may play a 
significant role in the metal ion transfer process between CusF and CusCBA complexes. 
Just like the former work
38, we modeled the system based on the PDB entry 2VB2. 
AMBER ff99SB force field
39 was employed for the protein system whi
le the SPC/E water 
model
40 was used to solvate the system. A hybrid metal model was utilized to model the 
metal binding site. The interaction of Cu
+ ion and Trp44 residue was treated with 
nonbonded model while the interactions between Cu
+ ion and the other
 three ligating 
residues (His36, Met47 and Met49) were represented by the bonded model. The bonded 
model are parameterized by Metal Center Parameter Builder (MCPB)
 41 software in 
Modeling ToolKit++ (MTK++) in AmberTools. 
  We found there are one distance (
Cu+--Trp44@CZ3 distance) and one dihedral angle 
(Glu36
-Met47 Phi angle
) in the metal binding site, could distinguish the closed and open 
state remarkably. For example, in a closed state structure we found the distance is 2.33 
† while the dihedral angle is 
-147.53¡. In an open state structure the corresponding distance 
143 is 17.42 † while the dihedral angle is 47.61¡. To further clarify the free energy 
differences between the closed and open state, 
we processed a 2D potential of mean force 
(PMF) simulations by 
treating the Cu
-Trp44@CZ3 distance as the reaction coordinate 
(RC) X axis while the Glu46
-Met47 Phi dihedral angle as RC Y axis. We began the 
umbrella sampling from a closed state structure of metal loaded CusF. Generally there are 
three steps in the 2D PM
F umbrella sampling process while the scheme of first two
 steps 
are shown in Figure 5.10
. The harmonic potential of all the simulations is set as 
10 kcal/mol
-1†-2 for the distance and 500 kcal/mol
-1Rad
-2 for the dihedral angle. 
There is 1 ns 
sampling for e
ach window with stepsize as 1 fs. The data points were stored for each 10 
fs during the simulation.
  Step1: Firstly we performed the PMF simulations based on the conjugated changing the 
distance and dihedral variables to ÒpushÓ the structure from closed st
ate to open state. 
The initial closed state structure has a RC as (X=2.33†, Y=
-147.52¡) while the PMF 
termination has RC as (X
=21.33 
†, Y=98.72¡). The total RC pathway was d
ivided into 
39 windows with the each later window has 
0.50 † increased in the X axi
s and 6.48
¡ increased in the Y axis than the former window. The final snapshot of the former window 
was treated as the initial structure of next window. 
  Step 2: We began from the RC windows finished in the former step, and scanned across 
the Y axis for e
ach certain X value individually. The scanning headed for two directions 
at the same time with the ending points for the negative direction and 
positive directions 
are 
-173.44¡ and 105.20¡ respectively
. To be consistent with Step 1, the windows are 
144 spaced 
by 6.48¡ apart along the Y axis as well. Totally there are 44 windows across the Y 
axis for each certain X value. By multiplying the window numbers in X axis, in sum there 
are 44*39 =1716 windows for the whole umbrella sampling map. Up to then, we have 
fin
ished the first 1 ns sampling of all the windows on the whole PMF profile map. 
  Step 3: The final umbrella samplings are performed independently for each window. By 
treating the final structure of the first 1 ns sampling as the initial structure, another 
5 ns 
sampling was performed for each window. In total, there are 1716 windows with each 
window has 6 ns umbrella sampling. There are more than 1 billion correlated data points 
and more than 1 million uncorrelated data points (with treating 1 ps as the unco
rrelated 
timescale) were collected for the umbrella sampling simulations. Based on which the free 
energy profile were generated by using the weighted histogram analysis method 
(WHAM) software.
  It is notable to mention that in MD simulation, for x axis, th
ere is 44 sample, but we only 
pick the first 32 to optimized and benchmark GPU
-WHAM program because of the 
architecture of GPUs.
 145     Figure 5.10
. The scheme of step 1 (left) and step 2(right) in CusF 2D PMF simulations. 
The arrow
 indicated the direction of the reaction coordinate. The 
hollow circle on the 
right figure was the windows have been finished in step 1. 
(Just
 to be 
symbolize
, the 
figure shows 5 windows on X axis and 7 windows on Y axis but actually there are 39 
windows i
n the X axis and 44 windows in the Y axis) 
   146                    REFERENCES
  147 REFERENCES
  (1)  J. G. Kirkwood, 
J. Chem. Phys.
 1935, 3, 300-313. (2)  G. M. Torrie, J. P. Valleau, 
Chem. Phys. Lett.
 1974, 28, 578-581. (3)  G. M. Torrie, J. P. Valleau, 
J. 
Comput. Phys.
 1977, 23, 187-199. (4)  S. Kumar, J. M. Rosenberg, D. Bouzida, R. H. Swendsen, P. A. Kollman, 
J. 
Comput. Chem. 
1992, 13, 1021. (5)  S. Kumar, J. M. Rosenberg, D. Bouzida, R. H. Swendsen, P. A. Kollman, 
J. 
Comput. Chem.
 1995, 16, 1350. (6)  S. Kumar, P. W. Payne, M. V⁄squez, 
J. Comput. Chem.
 1996, 17, 1269-1275. (7)  A. M. Ferrenberg, R. H. Swendsen, 
Phys. Rev. Lett.
 1989, 63, 1195-1198. (8)  C. H. Bennett, 
J. Comput. Phys.
 1976, 22, 245-268. (9)  E. M. Boczko, C. L. Brooks, 
J. Phys. Chem. 
1993, 97, 4509-4513. (10)  G. Cui, B. Wang, K. M. Merz, 
Biochemistry
 2005, 44, 16513-16523. (11) G. Cui, X. Li, K. M. Merz, 
Biochemistry
 2007, 46, 1303-1311. (12)  E. Rosta, M. Nowotny, W. Yang, G. Hummer, 
J. Am. Chem. Soc.
 2011, 133, 8934-8941. (13) NVIDIA. Compute Univied Device Architecture(CUDA). 
http://www.nvidia.com/object/cuda_home_new.html
 (14)  Alan Grossfield,, "WHAM: the weighted histogram analysis method", version 
2.06, http://membrane.urmc.rochester.edu/content/wham
 (15)  D. Porezag, T. Fr
auenheim, T. Kıhler, G. Seifert, R. Kaschner, 
Phys. Rev. B
 1995, 51, 12947-12957. (16)  G. Seifert, D. Porezag, T. Frauenheim, 
Int. J. Quant. Chem.
 1996, 58, 185-192. (17)  D.A. Case. T.A. Darden, T.E. Cheatham, I., C.L. Simmerling, J. Wang, R.E. 
Duke, R. Luo, R.C. Walker, W. Zhang, K.M. Merz, B.P. Roberts, B. Wang, S. Hayik, A. 
Roitberg, G. Seabra, I. Kolossv⁄ry, K.F. Wong, F. Paesani, J. Vanicek, J. Liu, X. Wu, 
S.R. Brozell
, T. Steinbrecher, H. Gohlke, Q. Cai, X. Ye, J. Wang, M.
-J. Hsieh, G. Cui, 
D.R. Roe, D.H. Mathews, M.G. Seetin, C. Sagui, V. Babin, T. Luchko, S. Gusarov, A. 
Kovalenko, Kollman, P. A. 
AMBER11,University of California, San Francisco.
 2010. 148 (18)  P. Kollman,
 Chem. Rev.
 1993, 93, 2395-2417. (19)  T. Bereau, R. H. Swendsen, 
J. Comput. Phys.
 2009, 228, 6119-6129. (20)  B. Roux, Comput. Phys. Commun.
 1995, 91, 275-282. (21)  D. Bouzida, P. A. Rejto, G. M. Verkhivker, , 
Int. J. Quant. Chem.
 1999, 73, 113
-121. (22)  M. Souaille, B. Roux, 
Comput. Phys. Commun.
 2001, 135, 40-57. (23)  E. Gallicchio, M. Andrec, A. K. Felts, R. M. Levy, 
J.Phys. Chem. B
 2005, 109, 6722-6731. (24)  J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, K. A. Dill, 
J. Chem. Theory 
Comput. 
2006, 3, 26-41. (25)  V. Hornak, R Abel, A. Okur, B. Strockbine, A. Roitberg, C. Simmerling,
 Proteins: Struct., Funct., Bioinf. 
2006, 65, 712-725. (26)  R.C. Walker, M.F. Crowley, D.A. Case,
 J. Comput. Chem
. 2008, 29, 1019-1031. (27)  G.M. Seabra, R.C. Walker, M. Elstner, D.A. Case, A. Roitberg, 
J.Phys. Chem. A.
 2007,111, 5655-5664. (28)  J. D. Chodera, W. C. Swope, J. W. Pitera, C. Seok, K. A. Dill, 
J. Chem. Theory 
Comput.
 2007, 3, 26-41 (29)  H. M. Berman, J. Westbrook, Z. Feng, G. Gi
lliland, T. N. Bhat, H. Weissig, I. N. 
Shindyalov, P. E. Bourne,
 Nucleic Acids Res. 
2000, 28, 235-242. (30)  S. Franke,  G. Grass, C. Rensing, D.H. Nies, 
J. Bacteriol.
 2003, 185, 3804-3812. (31)  I. R. Loftin,  S. Franke, S. A. Roberts, A. Weichsel, A. H”roux, W. R. Montfort, 
C. Rensing, M. M. McEvoy, 
Biochemistry
 2005, 44, 10533-10540. (32)  D. K. Chakravorty, B. Wang, M. N. Ucisik, K. M. Merz, 
J. Am. Chem. Soc. 2011, 133, 19330-19333. (33)  D. A.
 Case, T. E. Cheatham, T. Darden, H. Gohlke, R. Luo, K. M. Merz, A. 
Onufriev, C. Simmerling, B. Wang, R. J. Woods, 
J. Comput. Chem.
 2005, 26, 1668-1688. (34)  H. J. C. Berendsen, J. R. Grigera, T. P. Straatsma, 
J. Phys. Chem. 1987, 91, 6269
-6271.  (35)  M. B. Peters, Y. Yang, B. Wang, L. s. Fu
!sti-Molna
!r, M. N. Weaver, K. M. Merz, 
J. Chem. Theory Comput.
 2010, 6, 2935-2947. 149 (36) 
 Franke, S.; Grass, G.; Rensing, C.; Nies, D. H. 
J. Bacteriol.
 2003, 185, 3804-3812. (37) 
 Loftin, I. R.; Franke, S.; Roberts, S. A.; Weichsel, A.; H”roux, A.; Montfort, W. 
R.; Rensing, C.; McEvoy, M. M. 
Biochemistry
 2005, 44, 10533-10540. (38) 
 Chakravorty, D. K.; Wang, B.; Ucisik, M. N.; Merz, K. M. 
J. Am. Chem. Soc.
 2011, 133, 19330-19333. (39)  Case, D. A.; Cheatham, T. E.; Darden, T.; Gohlke, H.; Luo, R.; Merz, K. M.; 
Onufriev, A.; Simmerling, C.; Wang, B.; Woods, R. J. 
J. Comput. Chem.
 2005, 26, 1668-1688. (40) 
 Berendsen, H. J. C.; Grigera, J. R.; Straatsma, T. P. 
J. Phys. Chem.
 1987, 91, 6269-6271. (41) 
 Peters, M. B.; Yang, Y.; Wang, B.; Fu
!sti-Molna
!r, L. s.; Weaver, M. N.; Merz, K. 
M. 
J. Chem. Theory Comput.
 2010, 6, 2935-2947.   150             CHAPTER 6
. SUMMARY AND CONCLUSION REMARKS
 151  In chapter 1, we briefly introduced GPU and CUDA programming by illustrating GPU 
architecture, thread hierarchy and memory hierarchy. We also provided some details 
about philosophy of GPU programming and tuning tricks, which enlighten the following 
researc
h projects to realize GPU programming on quantum chemistry.  
  In chapter 3, we evaluated ERIs on a GPU using recurrence relations and form the Fock 
matrix entirely in GPU memory. And in full SCF benchmark calculations demonstrate 
the impressive speedups G
PU ERI implementations achieved and the energy error GPU 
produced associated with double
-precision calculations meets most of computational 
chemistry calculation accuracy requirements. To realize the GPU implementations and 
achieve the speedup, a
 well-sort
ed integral grid that reduces thread divergence and 
provides an optimized memory access pattern boosts the performance in terms of 
efficiency. This speedup is also achieved by optimizing the Fock matrix formation 
scheme by introducing the atomic
-operation 
to significantly reduce data transfer from N
4 to N
2, which was
 one of the most time
-consuming steps in conventional GPU SCF 
programming
 that limited by th CPU
-GPU bandwidth
. Moreover, this approach also 
reduces redundant
 and unnecessary re
-calculated
 ERI c
alculation by reusing ERI data. 
Our benchmarks show the speedup increases with increasing system size, and our code 
now is applicable to s, p and d orbital functions which are in most most organic or 
biochemistry calculations. 
  In the chapter 4, based on 
the work we completed in last chapter, 
we implemented the 
evaluation of ERIs up to f orbitals and ERI derivatives up to d orbitals using GPU 
152 technology and ERI recurrence relations. Our SCF and gradient calculations demonstra
te the similar speedup of GPU i
mplementation compared to
 over modern CPUs. A partition 
strategy is introduced to solve the difficulties encountered in computing ERIs including f 
orbitals and was further applied to d orbital gradient calculation. Importantly, we 
observed a very limited e
fficiency decrease after employed this strategy. 
The code is 
written efficiently by our machine
-generated code which is almost impossible for human 
to complete because its complicity. Similar to ERI evaluation, a
 well-sorted pre
-sorting 
strategy and severa
l other improvements boost overall GPU performance and atomic 
operations are used 
as well 
to reduce data transfer. Moreover, GPU
-based DFT 
calculation is also available in 
QUICK
 although we did not touch on this topic. 
  In chapter 5, to boost molecular dy
namic simulation, we 
accelerate WHAM using GPU 
technology by selecting a suitable computational configuration to maximum efficiency. 
The GPU
-WHAM performance is as much as 1200 times in double precision faster for 
compared with a single CPU while s
ingle
-precision is two times faster than double
-precision
 with up to 2020 time speedup
 reducing accuracy slightly. B
ut 
according to our 
test and observation, 
for most cases this will not be an issue because the introduced 
inaccuracy is generally less than the inac
curacy of the method utilized. Nonetheless, even 
very high
-resolution grids are now affordable in a computationally reasonable amount of 
time.  These projects utilized GPU as an economical accelerator to multiply areas of 
computational chemistry calculatio
ns. Meanwhile, there is still plenty of room to apply 
153 GPU technology. 
Therefore, o
ur future work will focus on following aspects. 
First of all, 
we are integrating our source code with the AMBER MD package to further enable 
ab initio
 QM/MM simulations. 
Second, we are developing GPU
-based ERI and ERI 
derivative generator that computational chemist can use our tool to develop their own 
algorithm by using GPU to accelerate ERI calculation, a very common step for most 
advanced ab initio computational chemistry m
ethods. Moreover, we are going to 
develop GPU enabled correlated 
ab initio
 methods such as MP2 and coupled
-cluster methods. 
In addition
, to calculate frequencies and other physical and chemical properties, 
GPU enable 
second order derivatives would also be 
helpful
. In terms of technology,
 a multi
-GPU implementation and implementation of the code on I
NTEL PHI platform is on our time
-line
 order to have another option to accelerate 
ab initio
 calculation.