PROBING INTERACTION MOTIFS FOR LIGAND BINDING
PREDICTION FROM THREE PERSPECTIVES: ASSESSING
PROTEIN SIMILARITY, LIGAND SIMILARITY AND
COMPONENTS OF PROTEIN-LIGAND INTERACTIONS
By
Nan Liu

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Chemistry – Doctor of Philosophy
2015

ABSTRACT
PROBING INTERACTION MOTIFS FOR LIGAND BINDING PREDICTION FROM THREE
PERSPECTIVES: ASSESSING PROTEIN SIMILARITY, LIGAND SIMILARITY AND
COMPONENTS OF PROTEIN-LIGAND INTERACTIONS
By
Nan Liu
The interactions between small molecules and diverse enzyme, membrane receptor and
channel proteins are associated with important biological processes and diseases. This makes the
study of binding motifs between proteins and ligands appealing to scientists. We use multiple
computational techniques to unveil the protein-ligand interaction motifs from three perspectives.
Firstly, from the perspective of proteins, by comparing the structure differences and common
features of different binding sites for the same ligand, 3-dimensional motifs that represent the
favorable interactions of the same ligands can be extracted. The goal is for such a motif to
represent the shared features for binding a certain ligand in unrelated proteins, while
discriminating from other ligands. The 3-dimensional motifs for cholesterol and cholate binding
to non-homologous protein sites have been extracted, using SimSite3D alignment and analysis of
the conserved interactions between these sites. The 3-dimensional protein motif for cholesterol
binding can give about 80% accuracy of true positive sites with a low false positive rate.
Furthermore, an online server CholMine was established so that the users can use this approach
to predict cholesterol and cholate binding sites in proteins of interest. These motifs can help
annotation of protein functions, drug discovery and the design of mutations.
Secondly, from the perspective of ligands, interaction motifs can be represented as
	  

molecular features important for biological activities of ligands. Searching and summary of
shared motifs from pretested series of ligand candidates can provide rational guidance to further
drug improvement and screening. Here, we report a series of potential sea lamprey olfactory
receptor 1 antagonists discovered from databases we designed of molecules that are similar to the
native ligand, 3kPZS. Compounds with overall electrostatic and shape similarity to 3kPZS were
assessed by using ROCS software, and their initial important feature matches to 3kPZS were
analyzed, to prioritize compounds for biological testing. Then, the molecular features important
to biological activities were summarized using SALI analysis and functional group matchprint
analysis. By combining theses approaches, 12 compounds were discovered that suppress the
detection of 3kPZS by at least 45%, and the most active compounds have entered field testing.
Thirdly, dissecting the components of protein-ligand binding energies is also important to
define the key determinants of ligand interaction with a protein site. Through analyzing the
correlation coefficient of interaction energies between a series of alpha-phenylalanine substitutes
and PaPAM and biological activities of these compounds, the dominant factor that determine the
activities of the compounds was revealed, which was steric effect between the binding site and
these compounds. From the analysis, mutations at the residues of the binding site were suggested
to change or improve the catalytic efficiency of the enzyme.
Given these three approaches, we envision a more integrated approach in the future that
combines the analysis of shared protein-ligand interactions, shared interaction features from
active ligands and shared features of protein binding sites to identify even more selective and
tight-binding ligands.
	  

Copyright by
NAN LIU
2015

	  

This dissertation is dedicated to my beloved parents, younger brother and my husband.

	  

v	  

ACKNOWLEDGEMENTS

Here I sincerely acknowledge all the people who offer me their kindness and helps during
my graduate study. Without their support and guidance, I can’t finish my graduate study and
dissertation.
First of all, with most gratitude, I thank to my advisor, Leslie Kuhn. She gives me a lot of
support and technical guidance during my graduate study. She cultivates me with good research
habits, such as writing weekly reports in scientific formats, always backing up the research data
and so on. Furthermore, she encourages and inspires me to solve problems independently and
innovatively. She is always approachable, patient and resourceful to me. She makes my stay at
Michigan State University delightful, memorable and productive.
Secondly, I wound like to appreciate the helps from my committee members. I am grateful
to Dr. Robert Cukier, Dr. Shelagh Ferguson-Miller and Dr. Kevin Walker. They give me many
valuable feedbacks and suggestions on my research, based on which I can modify and improve
my research skills. In addition, I learn a lot from the collaborations with Dr. Shelagh
Ferguson-Miller on cholesterol prediction and the CholMine server building. Without Dr. Kevin
Walker’s help, the PaPAM project cannot come to a cheerful end.
Thirdly, I owe my thanks to all of my collaborators. I need to thank to Dr. Fei Li and Dr.
Jian Liu for their suggestions on CholMine software features to support experimental follow-up,
and their feedbacks on this manuscript. I am grateful to Dr. Nishanka Dilini Ratnayake on the
	  

vi	  

collaboration of the PaPAM project. I also appreciate Dr. Mar Huertas, Anne Scott (Graduate
Research Assistant) and Dr. Weiming Li for their contributions to sea lamprey olfactory receptor
1 antagonist discovery project. Without their help and encouragement, I cannot produce the
useful and instructive results for these projects.
Furthermore, I would like to appreciate my group members. John Johnson, the technique
specialist, helped me to solve a lot of problems, such as implementation of CholMine server,
various software environment setting problems and so on. Dr. Jeffrey Van Voorst, the developer
of SimSite3D, gave me a lot of helps on how to use SimSite3D properly. Dr. Leann Buhrow
helped me a lot when I started to work in the lab. Santosh Gunturu, a former undergraduate
researcher in our lab, collaborated with me at the beginning of the CholMine and sea lamprey
olfactory receptor 1 antagonist discovery projects. I also need to thank to Sebastian Raschka, Joe
Bemister and Alex Wolf for their helps and suggestions on my research and presentations.
Lastly, I want to thank to my husband. He gave me unlimited love and support in my life
and graduate study. Only under his support, understanding, and love, I can finish my graduate
study and research. He gives me strength to overcome various problems in my life. I want to
thanks to my parents. Without their unconditional love and support, I can’t overcome so many
problems in my life. I owe my appreciations to all my families members.

	  

vii	  

TABLE OF CONTENTS

LIST OF TABLES ......................................................................................................................... xi
LIST OF FIGURES ..................................................................................................................... xiii
KEY TO ABBREVIATIONS ...................................................................................................... xix
Chapter 1 Introduction .................................................................................................................... 1
1.1 Introduction ........................................................................................................................... 2
1.1.1 Computer technologies used in biochemistry ................................................................. 2
1.2 Representations of protein and small molecule structures .................................................... 3
1.2.1 Protein structure representations and applications ......................................................... 4
1.2.2 Molecular structure representations and applications .................................................... 6
1.3 Predicting ligand binding only given protein information .................................................... 9
1.4 Combination of multiple techniques in drug discovery ........................................................ 9
1.5 Objectives of this dissertation ............................................................................................. 11
REFERENCES ......................................................................................................................... 12
Chapter 2 Decoding protein structural motifs for ligand binding prediction................................ 16
2.1 Introduction ......................................................................................................................... 17
2.1.1 Conserved lipid binding sites in membrane proteins.................................................... 17
2.1.2 Cholesterol, cholate and related sequence-based binding motifs ................................. 18
2.1.3 Determinants of lipid-membrane protein binding ........................................................ 22
2.1.4 Previous prediction of lipid binding sites ..................................................................... 23
2.2 Methods............................................................................................................................... 25
2.2.1 SimSite3D and site maps for aligning and comparing protein sites ............................. 25
2.2.2 Extraction of an interaction motif for binding the same ligand in non-homologous sites
............................................................................................................................................... 27
2.2.3 Establishing a cholate site predictor ............................................................................. 31
2.2.4 Summary of the steps for establishing a cholesterol (or cholate) site predictor ........... 32
2.2.4.1 Step 1: Preparing the training and testing databases .............................................. 32
2.2.4.2 Step 2: Choosing the most representative cholesterol (or cholate) binding site .... 34
2.2.4.3 Step 3: Extracting a fingerprint of conserved interactions from known cholesterol
(or cholate) sites and applying it to predict on the test set ................................................. 35
2.2.5 Bacterial membrane proteins for evaluating false positive prediction rate .................. 36
	  

viii	  

2.2.6 CholMine server ........................................................................................................... 37
2.3 Results ................................................................................................................................. 39
2.3.1 Cholesterol binding site training and testing ................................................................ 39
2.3.2 Cholate site training and testing ................................................................................... 42
2.3.3 Evaluating the statistical significance of the cholesterol and cholate site predictors ... 45
2.3.4 GPCR cholesterol binding site prediction .................................................................... 46
2.3.5 Comparison of CholMine structure-based predictions with sequence-based predictions
using the CCM, CRAC, and GXXXG motifs ....................................................................... 46
2.3.6 Deciphering the determinants of cholesterol binding ................................................... 49
2.3.7 CholMine distinguishes cholesterol sites from sites occupied by acyl chain lipids ..... 50
2.3.8 Discriminating cholesterol and cholate sites from other steroid sites .......................... 53
2.3.9 Bacterial membrane proteins for evaluating false positive predictions ........................ 54
2.3.10 Cholate binding determinants ..................................................................................... 54
2.3.11 Comparison of cholesterol and cholate binding site conservation ............................. 55
2.3.12 Computational efficiency of the CholMine server ..................................................... 57
2.4 Concluding discussion ........................................................................................................ 58
APPENDIX ............................................................................................................................... 60
REFERENCES ......................................................................................................................... 72
Chapter 3 Deciphering Substituent Effects of Ring-substituted α-Arylalanines on the
Isomerization Reaction Catalyzed by an Aminomutase ............................................................... 78
3.1 Introduction ......................................................................................................................... 79
3.2 Materials and methods ........................................................................................................ 83
3.2.1 Experiments .................................................................................................................. 83
3.2.2 Modeling substrate-PaPAM structural interactions to understand selectivity ............. 83
3.2.3 Calculating substrate-PaPAM interaction energies ...................................................... 84
3.2.4 Structure-activity landscape index analysis .................................................................. 86
3.3 Results and discussion ........................................................................................................ 87
3.3.1 Overview of the PaPAM mechanism ........................................................................... 87
3.3.2 Comparing the effects of regioisomeric substituents on PaPAM catalysis and substrate
affinity ................................................................................................................................... 92
3.3.3 Relationship between PaPAM-substrate interaction energies, flexibility, and KM ...... 95
3.3.4 Activity cliff analysis.................................................................................................. 100
APPENDIX ............................................................................................................................. 104
REFERENCES ....................................................................................................................... 113
Chapter 4 Using multiple virtual screening techniques to bootstrap pheromone antagonist
discovery ..................................................................................................................................... 117
	  

ix	  

4.1 Introduction ....................................................................................................................... 118
4.1.1 Motivation .................................................................................................................. 118
4.1.2 Hypothesis .................................................................................................................. 119
4.1.3 Significance ................................................................................................................ 120
4.2 Materials and Methods ...................................................................................................... 121
4.2.1 Virtual screening......................................................................................................... 123
4.2.1.1 3kPZS and SLOR1 structural model.................................................................... 123
4.2.1.2 Preparation of the screening libraries based on hypothesis ................................. 124
4.2.1.3 Sampling flexible compounds.............................................................................. 128
4.2.1.4 Overlays of molecular structures using ROCS .................................................... 129
4.2.1.5 Matching functional groups in 3kPZS ................................................................. 130
4.2.1.6 Incorrect steroids .................................................................................................. 130
4.2.1.7 Molecular docking ............................................................................................... 131
4.2.1.8 Ranking and prioritization based on hypothesis testing ...................................... 131
4.2.1.9 Activity cliff analysis ........................................................................................... 135
4.2.1.10 Functional group match fingerprint analysis ..................................................... 135
4.2.2 Experimental validation .............................................................................................. 137
4.2.2.1 EOG assays .......................................................................................................... 138
4.3 Results and discussion ...................................................................................................... 140
4.3.1 Binding mode of 3kPZS in SLOR1 structural model ................................................. 140
4.3.2 Electro-olfactograms (EOGs) assays identify antagonists for 3kPZS detection based on
candidates from high-throughput computational screening ................................................ 142
4.3.3 Structure-activity relationships analysis ..................................................................... 144
4.3.3.1 SAR analysis based on SALI and functional group matchprint .......................... 144
4.3.3.2 Other structure-relationship analysis ................................................................... 148
4.4 Conclusion ........................................................................................................................ 150
REFERENCES ....................................................................................................................... 152
Chapter 5 Conclusions and future directions .............................................................................. 156

	  

x	  

LIST OF TABLES

Table 2.1 Cholesterol binding proteins in the training and test sets. ............................................ 30
Table 2.2 Cholate binding proteins in the training and test sets. .................................................. 34
Table 2.3 Prediction results for using cholesterol sites in 3KDP_CLR3001D (a membrane
protein) and 1ZHY_CLR1001A (a soluble protein) for detecting cholesterol sites in other
proteins, plus assessment of false positives in a set of 139 non-cholesterol ligand sites.
When 1ZHY_CLR1001A was used as the query in the results below, the training and test
sets were inverted relative to those listed in Table 2.1. Query self-matches were excluded
from the statistics. ................................................................................................................. 42
Table 2.4 Prediction results from using cholate sites 2DYR_CHD525C (best representative from
a membrane protein) and 2QO4_CHD130A (best representative from a soluble protein in
the second set) for alignment and scoring to predict cholate binding sites in other proteins
and assess false positive rate in a set of 140 non-cholate sites. Query self-matches were
excluded from the results. The training and test sets were inverted relative to Table 2.2
when the 2QO4 query was used. .......................................................................................... 44
Table 2.5 Comparison of cholesterol site prediction in true versus non-cholesterol binding sites
by the CholMine conserved spatial motif versus sequence motif matching......................... 48
Table A.1.1 140 non-homologous protein sites binding diverse ligands, containing one
cholesterol binding site (in PDB entry 1LRI) and no cholate sites. ..................................... 61
Table A.1.2 Putative cholesterol binding sites in class A GPCRs1. ............................................. 66
Table A.1.3 Diverse non-cholesterol, non-cholate lipid binding sites. ......................................... 67
Table A.1.4 Sites in 109 low-homology bacterial membrane protein sites analyzed as potential
false positive cases for cholesterol (CLR) or cholate (CHD) binding. Sites predicted to
match the CholMine cholesterol or cholate site conserved interactions are noted in the third
column. The last column indicates whether the crystallographic ligand at the prediction site
(second column) was of lipid or lipid-like (L), drug-like (D), polar (P), or intermediate
character (e.g., P/L for a polar lipid group). 73% of the sites contained lipids or partly
lipidic molecules. .................................................................................................................. 68

	  

xi	  

Table 3.1 Kinetic Parametersa of PaPAM for Various Substituted Aryl and Heteroaromatic
Substrates. ............................................................................................................................. 91
Table A.2.1 Comparison of the experimental KM and predicted energetic order of each
substituent at ortho-, meta-, para-positions. ....................................................................... 105
Table A.2.2 Comparison of the experimental KM and predicted energetic order of each
substituent at ortho-, meta-, para-positions. This data is the same as presented in Table
A.2.1; here, it is organized according to substituent position rather than type. .................. 106
Table A.2.3 Evaluation of protein-ligand and ligand internal energy values and preference for
NH2-cis versus NH2-trans configuration............................................................................. 107
Table 4.1 The matchprints of the top 6 compounds that suppressed EOG response of sea lamprey
to 3kPZS. ............................................................................................................................ 137

	  

xii	  

LIST OF FIGURES

Figure 1.1 Representations of protein, ligand and protein-ligand interactions for structural and
chemical similarity mining. .................................................................................................... 4
Figure 1.2 SMARTS and 2D sketching of steroid ring using SMARTSViewer. ........................... 8
Figure 2.1 2D and 3D chemical structures of (A) cholesterol (blue) and (B) cholate (yellow),
with the flexible tails from C21 to C24/C25 shown in arbitrary favorable conformations. . 19
Figure 2.2 Determining conserved site map points. Aligned site map points with matching
chemical labels from the training set of cholesterol (CLR) sites are shown following
SimSite3D spatial alignment. Hydrophobic (H) or hydrogen-bond donor (D) site map
points are shown on lines 2-6 if they fall within 1.5 Å of a site map point of the same
chemical type in the query site, 3KDP_CLR3001D, where the number and letter after the
CLR residue code indicate its residue number and chain identifier in the PDB file.
Hydrogen-bond acceptor (A) and donor and/or acceptor (N) points (e.g., hydroxyl
interaction sites) also occur in cholesterol sites but are not found to be conserved between
the sites. The 3KDP query site was chosen as the representative query site for cholesterol
binding because it has the highest degree of site map point conservation with the other
cholesterol sites. Highly conserved points (green backgrounds) comprising the conserved
motif for cholesterol interation were identified based on occurring in at least 70% of these
training cases aligned to the 3KDP query site. ..................................................................... 31
Figure 2.3 Steps in CholMine cholesterol and cholate site prediction. ........................................ 37
Figure 2.4 Pairwise alignment and similarity scoring. (A) All-against-all SimSite3D comparison
for membrane protein cholesterol binding sites. (B) All-against-all comparison for soluble
protein cholesterol binding sites. For the top-scoring alignment of each site pair, the
SimSite3D similarity score values are colored from red (most similar) to dark blue
(marginally similar) with corresponding score values ranging from -5 to 0 (in standard
deviations above the mean score when the same query site is compared to the set of 140
diverse ligand binding sites, where more negative is more significant). Black indicates
failure to meet the normalized score threshold of 0. Numbers reported in the grid are the
RMSD values (Å) between cholesterol rings following SimSite3D site alignment. Lower
RMSD indicates better alignment between sites. The “# norm. hits” column on the right
side of each matrix reports the number of sites meeting the scoring threshold for similarity
to the query site (labeled to the left in each row) when searching against the 140 sites in the
	  

xiii	  

diverse dataset (Table A.1.1), which includes one true positive cholesterol site. The high
number of false positives is based on SimSite3D alignment score only, before the conserved
interaction points for cholesterol sites have been considered. .............................................. 41
Figure 2.5 Pairwise alignment and similarity scoring. (A) All-against-all SimSite3D similarity
comparison for the first dataset, which includes 4 membrane cholate binding sites and 6
soluble cholate binding sites. (B) All-against-all comparison for the second dataset, which
includes another 10 soluble cholate binding sites unrelated to the first set. (See Figure 2.4
legend for additional details.) ............................................................................................... 44
Figure 2.6 (A) Sodium/potassium-transporting ATPase cholesterol site (PDB entry 3KDP,
residue D3001) used as the representative query for CholMine predictions. Purple spheres
represent conserved interaction points in the membrane proteins binding cholesterol (from
Figure 2.2), displayed in the context of the representative site from 3KDP. The green
dashed lines connect the conserved interaction points to corresponding protein atoms.
Cholesterol atoms colored in green contact a protein atom in 60% of the training set sites,
atoms colored yellow have a 30-60% frequency of contact, and atoms colored in red contact
the protein in <30% of the sites. (B) For comparison, LigPlot+ 3-dimensional view (shown
with PyMOL; Schrödinger, New York, NY; http://pymol.org) of key
sodium/potassium-transporting ATPase cholesterol interactions identified in just the single
structure of 3KDP. (C) Alternative LigPlot 2-dimensional view of these interactions. ....... 52
Figure 2.7 Conserved interaction points for CholMine cholate site prediction (purple spheres) are
shown in the context of the interactions between the representative membrane protein query
site 2DYR_CHD525C from cytochrome c oxidase, and its bound cholate molecule (white
tubes with oxygen atoms in red). Essential residues contributing to the conserved interaction
are labeled. ............................................................................................................................ 56
Figure 2.8 SimSite3D-identified conserved interactions for cholate (yellow) and cholesterol (blue)
recognition abound along the groove formed between the row of C18, C19, and C21 methyl
groups on the beta (lower) face of the steroid and the edge of the steroid ring system. The
view on the right is rotated roughly 90 degrees about a vertical axis through the center of
each molecule. Cholate sites are distinguished from cholesterol primarily based on
interactions with the relatively conserved C22-C23 tail orientation in cholate, and numerous
conserved interactions associated with the strongly bent (5-beta configuration) joint
between the A and B rings of the cholate steroid ring system. Because the tail
configurations are conformationally diverse in different binding sites, conserved
interactions are absent in the C24-C25 region. ..................................................................... 57
Figure 3.1 Partial andrimid biosynthetic pathway starting from (S)-β-phenylalanine via
(S)-α-phenylalanine. a) Several steps. .................................................................................. 80
	  

xiv	  

Figure 3.2 Mechanism of the MIO-dependent isomerization catalyzed by PaPAM. MIO:
4-methylidene-1H-imidazol-5(4H)-one; kcatcinn : the rate at which the cinnamate
by-product is released; kcatβ: the rate at which the β-amino acid product is released. ...... 81
Figure 3.3 (A) Proposed elimination mechanisms for displacement of the NH2-MIO adduct. E1:
unimolecular, E2: bimolecular and E1cB: conjugate-base eliminations. (B) Concerted
hydroamination of the acrylate intermediate. Shown is a transition state intermediate (right)
highlighting the polarization of the π-bond in which the nucleophilic NH2-MIO and the
electrophilic H+ approach Cβ and Cα, respectively. .............................................................. 88
Figure 3.4 Route a) A stepwise Michael-addition pathway. Shown is an intermediate adduct (top
right) with the π-electrons delocalized into the carboxylate group forming a repelling
dianion prior to Cα-protonation. Route b) Concerted hydroamination of the acrylate π-bond.
Shown is an intermediate (middle right) with maximal charge separation between repelling
negative charges in the carboxylate group and the cation and anion. Route c) A stepwise
hydroamination sequence. Shown is a proposed intermediate (bottom right) resulting from
Cα-protonation as the first step, which places a positive charge at Cβ. Cβ is now primed for
nucleophilic attack by the NH2-MIO adduct. ....................................................................... 90
Figure 3.5 An overlay of the NH2-cis and NH2-trans configurations is illustrated, using the
m-methyl-(S)-α-phenylalanine substrate (atoms are C, green; N, blue; O, red). The methyl
group can be positioned on the same side (NH2-cis) or the opposite side (NH2-trans) as the
reactive amino group of the chiral substrate (left). An overlay of the NH2-cis and NH2-trans
active configurations of m-methyl-(S)-α-phenylalanine is modeled in the crystallographic
position of α-phenylalanine in PaPAM (PDB 3UNV). A partial MIO and the active site
residues that cause van der Waals overlap with the ligands are shown (C, light blue; N, dark
blue; O, red). SLIDE and other docking tools cannot model covalently bound ligands,
which are interpreted as disallowed steric overlap (right). Thus, the alkene carbon atoms of
the MIO were removed to dock the substrate. ...................................................................... 94
Figure 3.6 Plot of experimental KM and Etot = E(p-l) (protein-ligand interaction energy) + E(l) (the
intra-ligand energy) calculated with Szybki. The substrates were modeled statically,
according to the trajectory of α-phenylalanine in the PaPAM crystal structure, without
energy minimization. Substrates are labeled according to Table 3.1 and the lower energy of
the two configurations [NH2-cis (red ♦, underlined) and or NH2-trans (blue ▲, arrowed)] is
plotted for the substrates. Substrates with no significant difference in energy between the
NH2-cis and NH2-trans (ΔE < 25 kcal/mol) are shown as filled dots (●). Substrates with
para-substituents (except p-methoxy) without an NH2-cis or NH2-trans preference are
open-circles (○). Non-productive substrates 20 – 22 (not shown) were predicted to prefer
the NH2-trans orientation in the PaPAM active site............................................................. 95

	  

xv	  

Figure 3.7 Structure-activity landscape index (SALI) analysis showing the subset of substrate
pairs exhibiting a large change in KM value upon a small change in structure. Substrate pairs
with SALI scores near 200 (approaching red) indicate the most significant activity cliffs.
Asterisks (*) indicate substrates in NH2-cis configuration; all others are NH2-trans. ........ 103
Figure A.2.1 H-bonding interaction of ortho-methoxy-α-phenylalanine (19) and active site
Tyr320. o-Methoxy-α-phenylalanine atoms are colored as C, green; N, blue; O, red and
Tyr320 atoms are colored as C, light blue; O, red; H, white. ............................................. 109
Figure A.2.2 Relationship between protein-ligand interaction energy E(p-l) and experimental KM.
Substrates were placed in the active site in NH2-cis and NH2-trans orientations overlaid with
the crystallographic orientation of α-phenylalanine from PDB entry 3UNV, and the lower
energy orientation was kept. Left panel: (●) Binding site residues of PaPAM were
maintained in their crystallographic orientation, yielding a linear correlation coefficient of
0.48 between E(p-l) and experimental KM. Right panel: (○) Energy minimization was used to
reduce any repulsive interactions, leading to lower correlation between the resulting
protein-ligand interaction energy and KM value (correlation coefficient = 0.35). .............. 110
Figure A.2.3 Relationship between the electrostatic (Coulombic) component of the
protein-ligand interaction energy EC(p-l) and experimental KM. Substrates were placed in the
active site in NH2-cis and NH2-trans configurations overlaid with the crystallographic
orientation of α-phenylalanine, and the lower energy orientation was kept. Left panel: (●)
Binding site of PaPAM was kept in the crystallographic orientation (correlation coefficient
= 0.33). Right panel: (○) Energy minimization was used to reduce any protein-ligand
repulsive interactions (correlation coefficient = 0.011). ..................................................... 111
Figure A.2.4 Relationship between the van der Waals energy component of the protein-ligand
energy EV(p-l) and experimental KM. Substrates were again placed in NH2-cis and NH2-trans
orientations overlaid with the crystallographic orientation of α-phenylalanine from PDB
entry 3UNV, and the lower energy orientation was kept. Left panel: (●) Binding site
residues of PaPAM were kept in the crystallographic orientation (correlation coefficient =
0.54). Right panel: (○) Energy minimization was used to reduce any protein-ligand
repulsive interactions (correlation coefficient = 0.42). These results indicate that the van der
Waals interaction energy between the protein and each substrate overlaid with the
α-phenylalanine-bound crystal structure is most predictive of the relative KM values of the
substrates. ............................................................................................................................ 112
Figure 4.1 Structure of 3kPZS. ................................................................................................... 119
Figure 4.2 Flowchart describing the pipeline for 3kPZS antagonist discovery. In step 2, an
example of the potential hypothesis is that known GPCR ligands are likely to mimic 3kPZS
	  

xvi	  

and block SLOR1................................................................................................................ 122
Figure 4.3 Interactions between 3kPZS and SLOR1 predicted by homology modeling and SLIDE
docking performed by Dr. Leslie Kuhn and Qinghui Yuan. SLOR1 side-chain atoms and the
binding site surface are colored green for carbon atoms, blue for nitrogen, red for oxygen,
and yellow for sulfur. Carbon atoms of 3kPZS are shown in white tubes (center), with
hydrogen bonds and salt bridges to the receptor shown as yellow dashed lines. The sulfate
ester moiety is predicted to bind deep in the SLOR1 cleft (left), forming salt bridges with
His110. The methyl-group face of the steroid ring (bottom-center) interacts with an entirely
hydrophobic face of the cleft in SLOR1. ............................................................................ 141
Figure 4.4 Bile acid binding motif in SLOR1 identified based on conserved features of cholate
binding in a set of unrelated proteins (yellow), relative to the SLIDE docking orientation of
3kPZS (blue). The predicted binding orientation for cholate (horizontal molecule at center,
with carbon atoms in yellow tubes) substantially overlays with the docked 3kPZS molecule
(blue horizontal molecule), despite the bent (5-beta) cholate steroid ring in place of the
relatively planar (5-alpha) 3kPZS steroid. Their negatively charged sulfate tail groups are
predicted in highly similar positions (center-right). Side chains making key interactions
with cholate in cytochrome C oxidase (PDB entry: 2DYR) are shown below in yellow (Tyr,
Phe, Trp, and His), and SLOR1 side chains interacting with 3kPZS (Tyr, Leu, His) are
shown in blue. ..................................................................................................................... 142
Figure 4.5 Histogram of the first 143 compounds according to their percent reduction in 3kPZS
olfaction by sea lampreys. Chemical structures and names are shown for the eight most
active compounds, which exhibit >45% reduction of 3kPZS response. ............................. 144
Figure 4.6 (A) ROCS TanimotoCombo scores for the pairwise compounds with significant
activity cliffs with SALI score ≥ 70. (B) SALI scores for the pairwise compounds with
significant activity cliffs with SALI score ≥ 70. ................................................................. 146
Figure 4.7 (A) Compound without 7-OH group (in green) is twice as active (70% reduction in
3kPZS response) as compound with 7-OH (in blue; 35% reduction). Tail structure is same
in both. (B) Butane sulfate is 16% more active than butane phosphate. ............................ 146
Figure 4.8 Assayed compounds with aliphatic tails. Shown in purple is the 3 carbon compound
(ZINC01587861) with 38% inhibition of EOG response of 3kPZS; Shown in red is the 4
carbon compound (ZINC01845398) with 50% inhibition of EOG response; Shown in gray
is the 5 carbon compound (ZINC01587862) with 32% inhibition of EOG response; Shown
in cyan is the 6 carbon compound (ZINC01841381) with 31% inhibition of EOG response;
Shown in orange is 6 carbon compound (ZINC01680379) with ethyl group, which inhibits
EOG response by 18%; Shown in green in the 8 carbon compound (ZINC14591952) with
	  

xvii	  

48% inhibition of EOG response. 0.52; Shown in yellow is the 12 carbon compound
(ZINC01532179) with 46% inhibition of EOG response. .................................................. 147

	  

xviii	  

KEY TO ABBREVIATIONS

PDB: Protein Data Bank
vHTS: virtual high throughput screening
GPCRs: G-protein coupled receptors
SMILES: Simplified molecular-input line-entry system
SMARTS: SMiles ARbitrary Target Specification
RMSD: Root Mean Square Deviation
CcO: cytochrome c oxidase
CCM: cholesterol consensus motif
CRAC: cholesterol recognition amino acid consensus
TSPO: translocator protein
FXR: farnesoid X receptor
CLR: cholesterol
CHD: cholate
ATP: Adenosine triphosphate
PaPAM: Phenylalanine aminomutases from the bacterium Pantoea agglomerans
MIO: 4-methylidene-1H-imidazol-5(4H)-one
vdW: van der Waals
SALI: structure-activity landscape index

	  

xix	  

ccoef: correlation coefficient
3kPZS: 7а,12а,24-trihydroxy-3-one-5а-cholan-24-sulfate
SLOR1: sea lamprey olfactory receptor 1
GLL: GPCR ligand library
TAAR: trace amine-associated receptors
EOG: electro-olfactogram
TLC: taurolithocholic acid

	  

xx	  

Chapter 1 Introduction

	  

1	  

1.1 Introduction

Understanding protein-ligand interaction motifs and the determinants for specific
protein-ligand binding is the first step for scientists to uncover the secrets of protein-ligand
recognition, and understand how small molecules regulate biological processes. There are about
68  000 protein–ligand complexes and 2 million ligand-binding sites found in all the
protein-ligand 3-dimensional structures of the current Protein Data Bank (PDB, www.rcsb.org).1
This enormous amount of structural data gives us the opportunity to mine protein-ligand binding
motifs across different protein families to understand protein structures and their functions,2 to
modify enzyme functions,3 and to discover novel drugs and pharmaceutical targets.4

1.1.1 Computer technologies used in biochemistry

Computer technologies have wide usage in biochemistry, including the study of
protein-ligand interactions. In recent years, fast developments in computers, programming
languages, and algorithms enable scientists to solve biochemical problems quantitatively and
explain experimental data in sophisticated ways.

There are many resources on the Internet to retrieve protein data. For example, the Protein
Data Bank (PDB, www.rcsb.org)1 contains all the current protein and ligand 3D coordinates
from X-ray crystal structures and NMR structures. PDBsum (http://www.ebi.ac.uk/pdbsum/)5
allows users to analyze protein-ligand interactions using 2D LigPlot6,7 figures. It also provides
	  

2	  

the binding cleft information for most protein structures8.

In addition to providing protein data, there are also many online resources that can translate
and store information regarding the small molecules that bind, or potentially can bind to proteins.
ZINC 12 is an online accessible database (http://zinc.docking.org9) that provides structures of
millions of chemical compounds, including many that are commercially available, which can be
used for virtual high throughput screening (vHTS) in ligand discovery. Furthermore, computer
graphic techniques also have broad usage in molecular modeling and structure comparisons. A
graphic tool such as PyMOL (The PyMOL Molecular Graphics System, Version 1.5.0.4,
Schrödinger, LLC), allows scientists to visualize and manipulate 3D models of molecules on a
graphical display device.

1.2 Representations of protein and small molecule structures

Learning about the different representations of protein and ligand structures is the beginning
of utilizing that information to study protein-ligand interactions. In general, the information of
protein and ligand structures can be stored in one-dimensional notation (1D) like SMILES
strings or fingerprints, two-dimensional drawings (2D) and three-dimensional structures (3D),
respectively, as shown in Figure 1.1. Different from 3D modeling, there is no quantitative data
such as spatial coordinates of the structures in one-dimensional or two-dimensional
representations.
	  

3	  

Protein	  

1D	  notation	  (amino	  
acid	  sequence	  in	  e.g.,	  
FASTA	  format)	  

Sequence	  motifs	  for	  
ligand	  binding	  
prediction	  

2D	  drawing
(protein	  topology	  e.g.,	  
Protlog)	  

Topological	  motifs	  for	  
ligand	  binding	  
prediction	  

3D	  modeling
(coordinates	  in	  PDB	  
format)	  

Structural	  motifs	  for	  
ligand	  binding	  
prediction	  

1D	  notation
(e.g.,	  SMILES)	  

ligand	  structure	  
comparison	  for	  ligand	  
searching

2D	  drawing	  
(e.g.,	  ISIS	  draw)	  

ligand	  structure	  
comparison	  for	  ligand	  
searching

3D	  modeling
(coordinates	  in	  MOL2	  
or	  SDF	  format)	  

ligand	  structure	  
comparison	  for	  ligand	  
searching

1D	  notation
(List	  of	  interactions	  
and	  their	  energy)	  

Dominating	  
interactions	  

Protein-­‐ligand	  
recognition	  motifs	  
Ligand	  	  

Protein-­‐ligand	  	  
complex	  

Figure 1.1 Representations of protein, ligand and protein-ligand interactions for structural and
chemical similarity mining.

1.2.1 Protein structure representations and applications

An amino acid sequence or primary structure is a one-dimensional (1D) representation of
protein structure. It can be written as a string of amino acids in one letter abbreviation, stored in
various formats such as FASTA and is one of the most commonly used data types in
bioinformatics. Through alignment between two or more amino acid sequences from different
	  

4	  

proteins, similarity scores between these sequences can be obtained, from which the structural,
functional and evolutionary relationships of these proteins can be deduced. In addition, a
commonly used drug discovery technique is to discover new ligands based on the structures of
proteins using various docking tools in structure-based virtual screening.10-13 However,
sometimes the target protein structures are not available, especially for membrane proteins such
as G-protein coupled receptors (GPCRs), which make up 40% of pharmaceutical targets in the
pharmaceutical industry.14,15 Under these circumstances, 3-dimensional models of these proteins
can be built first if the degree of sequence similarity is high enough with an existing 3D structure,
to enable structure-based virtual screening. Amino acid sequence alignment is the critical step of
homology modeling.16 The threshold of sequence identities to build a reliable homology model
for the unknown protein must be above 25% over at least 80 aligned residues.17

Protein topology can be used to represent a protein’s structure in 2D diagrams, describing
the orientation and connection information of secondary structure elements (SSEs) of that
protein.18 Even though protein topology neglects the atomic information as shown in 3D
structures, it can show SSEs in a way that helps scientists to analyze protein folds. This aids the
annotation of protein families, domains, and functions, and the study of evolutionary
relationships. However, there are limited applications of protein topology in drug discovery, due
to the relatively a few types of protein topology found related to specific ligand binding.

3D structures of proteins allow the shape and chemical features of binding sites to be studied
	  

5	  

thoroughly. Unlike 1D and 2D representations, the advantage of 3D models of proteins is that
they can be analyzed independent of specific protein residues or connections. Instead, the 3D
models supply a spatial perspective to study the chemical interactions between proteins and
ligands, from which the interaction details such as hydrogen-bonding can be analyzed. The
details of the interactions can help us understand the determinants for specific ligand binding. In
drug discovery, if the binding site residues that are important for ligand binding are known, it is
possible to improve protein activities through modifying these residues.

1.2.2 Molecular structure representations and applications

Simplified molecular-input line-entry system (SMILES)19,20 is a commonly used 1D
representation of chemical compound structures in cheminformatics. It is a string of characters to
describe molecular formulas, atomic connections, bond types and chiral atoms in molecules. A
unique SMILES using a canonicalization algorithm is called canonical SMILES. SMILES
containing isotopic and stereochemical information are called isomeric SMILES. SMILES
strings are similar to condensed structural formulas while still having some differences. In
SMILES, the lower-case letters are for aromatic atoms, and the capital letters are for aliphatic
atoms. “@” is for anticlockwise chirality and “@@” is for clockwise chirality. SMiles ARbitrary
Target Specification (SMARTS),21, 22 an extension of generic SMILES, allows the representation
of broader structural patterns for searching chemical compounds. Any valid SMILES expressions
are valid SMARTS string, not vice versa. In SMARTS, “*” indicates that any atom can match,
	  

6	  

“~” indicates that any bond can match. For example, SMARTS for a steroid ring are shown in
Figure 1.2.

SMILES and SMARTS have broad usages in structure retrieving, substructure and similarity
searching that enable identifying relevant sets of compounds to analyze. For example, based on
isomeric SMILES, 3D structures of molecules with stereochemistry information can be
generated. Based on SMARTS, Root Mean Square Deviation (RMSD) of substructures
representing the closeness with which the molecules can be overlaid can be calculated. Online
chemical resources, such as PubChem,23 ZINC12,9 and SCIFINDER (https://scifinder.cas.org/),
provide interfaces to allow the user to enter SMILES strings for exact, substructure and
similarity searches.

	  

7	  

Figure 1.2 SMARTS and 2D sketching of steroid ring using SMARTSViewer.

2D drawings of chemical structures, such as ISIS drawing, provide direct views of molecular
2D structures, including atom types, bond connections and stereochemistry properties. 2D
drawings, just like 1D notations, are often used for searching molecules with exact, similar or
sub- structures. Current online resources, such as PubChem,14 ZINC12 database,9 provide the
interface to let user draw the structures of molecules they are interested in, from which structure
exact, substructure and similarity search can be performed.

In 3D models of molecules, not only the atom types, connectivity and stereochemistry
information can be viewed directly, but also the molecule shape and electrostatic distribution can
be depicted too. This allows molecular comparisons not only as ID strings and 2D connectivity
diagrams, but also as 3-dimensional structures reflecting bioactive or other conformations. 3D
modeling can go further by providing opportunity to calculate molecular similarity based on
	  

8	  

entire molecular shape and charge distribution, regardless of specific atom types and connections.
After we know the presentations and applications of protein and ligand structures, how to
translate the information into more general and applicable information to help study of
protein-ligand interaction and further ligand discovery is still challenging.

1.3 Predicting ligand binding only given protein information

Starting from tools such as those described above, given only protein information, can we
predict which ligands can bind in the given site of a protein? What is the relationship between
protein motifs and specific ligand binding? To answer these questions, there is generally a
two-step methodology. The first step is to obtain motifs through comparisons of protein
information. Comparison of protein sequence information (1D) can provide potential protein
interaction motifs, while comparison of protein structures in 2D can define topological motifs.
Protein structural motifs in 3D can be defined by similar main-chain motifs and as we show in
Chapter 2, can be generalized beyond residue correspondingly. Once a potential ligand binding
motif has been defined, its predictive value can be evaluated statistically.

1.4 Combination of multiple techniques in drug discovery

All of the techniques to utilize the structures and properties of protein and small molecules
can be integrated into virtual high throughput screening techniques from the ligand comparison,
protein comparison or protein-ligand interaction perspectives. Virtual screening is a powerful
	  

9	  

and successful tool at the initial stage of drug or inhibitor discovery. It can aid scientists to
discover lead compounds in one of the most efficient ways, not only because it integrates
currently accessible information of structures, properties and functions of protein and small
molecules, but also because of the fast and efficient screening speed and low cost.13, 24, 25 There
are two major common techniques in virtual screening techniques, one being structure-based
virtual screening and the other being ligand-based virtual screening. In structure-based virtual
screening, the structure of the target protein can be used for small ligand docking. In small ligand
docking, millions of compounds are docked at the binding site of the target protein and evaluated
by different scoring function.10-13 Compounds with higher docking scores should have a higher
probability to interact with the target protein. Secondly, by comparing a given potential binding
cleft on a protein to all ligand-bound clefts in the Protein Data Bank, two kinds of information
can be gained: which ligand(s) bind to similar sites, and which other proteins might be off-target
hits, presenting specificity issues for a given designed inhibitor or agonist. However, docking
results and site comparisons are influenced by the quality of protein structures, especially when
target proteins have low-resolution crystal structures or homology models are used. Under these
circumstances, ligand based virtual screening techniques can find compounds with similar
structures to the native substrates or known ligands of the target proteins.11, 26 The hypothesis of
ligand-based screening is that the compounds with similar structures are likely to have similar
biological activities.

	  

10	  

1.5 Objectives of this dissertation

Given a binding pocket on membrane or soluble proteins, the goal of this research is to
predict the most likely ligand or native lipid by comparing the site with established predictors,
only using protein information. Because the native ligands or lipids binding to most
membrane-exposed sites are undefined, due to the low resolution of structure determination or
displacement by detergent or lack of crystal structures, a ligand binding predictor can provide
good hypotheses as to the native ligand(s) that can be validated by experimental results. Our
group has collaborated with three experimental groups working on bile acid and cholesterol
binding: Professors Ferguson-Miller and Atshaves in the Biochemistry & Molecular Biology
Department and Professor Li in the Fisheries & Wildlife Department. The initial focus lies on
prediction of cholesterol (CLR) or cholate binding sites, and characterization of determinants
that distinguish CLR or cholate binding from sites binding other molecules. This approach can
elucidate whether the determinants of binding for cholesterol or cholate are the same in
membrane proteins as in soluble proteins and the difference between cholesterol and cholate
binding. This project is described in Chapter 2.

	  

11	  

REFERENCES

	  

12	  

REFERENCES

(1) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov,
I. N.; Bourne, P.E., The Protein Data Bank Nucleic Acids Research, 2000, 28, 235-242.
(2) Grishin, N. V. Fold Change in Evolution of Protein Structures. Journal of Structural Biology,
2001, 134, 167–185.
(3) Gutteridge, A., Thornton, J.M. Understanding nature’s catalytic toolkit. Trends Biochem.
Sci. 2005, 30, 622–629.
(4) Rognan, D. Chemogenomic approaches to rational drug design. Br. J. Pharmacol. 2007,
152, 38–52.
(5) de Beer, T. A. P; Berka, K.; Thornton, J. M.; Laskowski, R. A. PDBsum additions. Nucleic
Acids Res. 2014, 42, 292-296.
(6) Wallace, A. C.; Laskowski, R. A.; Thornton, J. M. LIGPLOT: a Program to Generate
Schematic Diagrams of Protein-Ligand Interactions. Protein Eng. 1996, 8, 127-134.
(7) Laskowski, R. A.; Swindells, M. B. LigPlot+: Multiple Ligand−Protein Interaction
Diagrams for Drug Discovery. J. Chem. Inf. Model. 2011, 51, 2778−2786.
(8) Laskowski, R. A. SURFNET: A program for visualizing molecular surfaces, cavities, and
intermolecular interactions. Journal of Molecular Graphics 1995, 13, 323-330.
(9) Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G. ZINC: A Free
Tool to Discover Chemistry for Biology. J. Chem. Inf. Model. 2012, 52, 1757−1768.
(10) Kroemer, R. T. Structure-Based Drug Design: Docking and Scoring. Current Protein and
Peptide Science, 2007, 8, 312-328.
(11) Reddy, A. S.; Pati, S. P.; Kumar, P. P.; Pradeep, H. N.; Sastry, G. N. Virtual Screening in
Drug Discovery – A Computational Perspective. Current Protein and Peptide Science, 2007, 8,
329-351.
(12) Perola, E.; Walters, W. P.; Charifson, P. S. A Detailed Comparison of Current Docking and
Scoring Methods on Systems of Pharmaceutical Relevance. Proteins. 2004, 56, 235–249.
	  

13	  

(13) Kitchen, D. B.; Decornez, H.; Furr, J. R.; Bajorath, J. Docking and Scoring in Virtual
Screening for Drug Discovery: Methods and Applications. Nat Rev Drug Discov. 2004, 3,
935-949.
(14) Flower, D. R. Modelling G-protein-coupled receptors for drug design. Biochim Biophys
Acta 1999, 1422, 207–234.
(15) Robas, N.; O’Reilly, M.; Katugampola, S.; Fidock, M. Maximizing serendipity: strategies
for identifying ligands for orphan G-protein-coupled receptors. Curr Opin Pharmacol. 2003, 3,
121–126.
(16) Chothia, C.; Lesk, A. M. The relation between the divergence of sequence and structure in
proteins. EMBO J., 1986, 5, 823–826.
(17) Sander C.; Schneider, R. Database of homology-derived protein structures and the
structural meaning of sequence alignment, Proteins, 1991, 9, 56-68.
(18) Rawlings, C. J.; Taylor, W. R.; Nyakairu, J.; Fox, J.; Sternberg, M. J.E. Reasoning about
protein topology using the logic programming language PROLOG. Journal of Molecular
Graphics, 1985, 3, 151-157.
(19) Weininger, D. SMILES 1. Introduction and Encoding Rules", J. Chem. Inf. Comput. Sci.
1988, 28, 31.
(20) James, C. A.; Weininger, D. Daylight Theory Manual. Daylight Chemical Information
Systems, Inc: 27401 Los Altos, 2006.
(21) Schomburg, K.; Ehrlich, H.-C.; Stierand, K.; Rarey, M. From Structure Diagrams to Visual
Chemical Patterns, J. Chem. Inf. Model., 2010, 50, 1529-1535.
(22) Schomburg, K.;Ehrlich, H.; Stierand, K.. Chemical pattern visualization in 2D–the
SMARTSviewer. Journal of Cheminformatics, 2011, 3, O12.
(23) Bolton, E.; Wang, Y.; Thiessen, P. A.; Bryant, S. H. PubChem: Integrated Platform of
Small Molecules and Biological Activities. Chapter 12 IN Annual Reports in Computational
Chemistry, Volume 4, American Chemical Society, Washington, DC, 2008 Apr.
(24) Doman, T. N.; McGovern SL; Witherbee, B. J.; Kasten, T. P.; Kurumbail, R.; Stallings, W.
C.; Connolly, D.T.; Shoichet, B. K. Molecular docking and high-throughput screening for novel
inhibitors of protein tyrosine phosphatase-1B. J. Med. Chem. 2002, 45, 2213–2221.

	  

14	  

(25) Zarzycka, B.; Seijkens, T.; Nabuurs, S. B.; Ritschel, T.; Grommes, J.; Soehnlein, O.;
Schrijver, R.; van Tiel, C. M.; Hackeng, T. M.; Weber, C.; Giehler, F.; Kieser, A.; Lutgens, E.;
Vriend, G.; Nicolaes, G. A. F. Discovery of Small Molecule CD40−TRAF6 Inhibitors. J. Chem.
Inf. Model., 2015, 55, 294–307.
(26) Krüger, D. M.; Evers, A. Comparison of structure- and ligand-based virtual screening
protocols considering hit list complementarity and enrichment factors. ChemMedChem. 2010, 5,
148-58.

	  

15	  

Chapter 2 Decoding protein structural motifs for ligand binding prediction

Reprint (adapted) with permission from CholMine: Determinants and Prediction of Cholesterol
and Cholate Binding Across Nonhomologous Protein Structures. Nan Liu, Jeffrey Van Voorst,
John B. Johnston and Leslie A. Kuhn. J. Chem. Inf. Model. 2015, 55, 747–759.

	  

16	  

2.1 Introduction

Given a series of non-homologous proteins binding the same ligands, we show that binding
motifs can be extracted from protein information alone. The lack of generalized binding motifs
for certain ligands inspires us to use our local structure alignment tool SimSite3D to extract
abstract and generalized motifs for the ligands we are particularly interested in, for example,
cholesterol and cholate. By deciphering the determinants of binding for these important steroids,
the CholMine tool we developed (which incorporates SimSite3D site alignment) may also aid in
the design of selective inhibitors and detergents for targets such as G protein coupled receptors
and bile acid receptors.

2.1.1 Conserved lipid binding sites in membrane proteins

Membrane proteins are surrounded by a complex mixture of lipids, including phospholipids,
cholesterol and some bile salts (bile acids and alcohols). One of the bile salts, cholate, is often
used as a detergent to solubilize membrane proteins.1,2 Different types of lipids influence
biological functions of membrane proteins in direct or indirect ways.3,4,5 Conserved binding sites
for certain lipids have been characterized on membrane proteins,4,6,7 and these lipids can play an
important role in structural stabilization and biological processes. For example, in bovine heart
cytochrome c oxidase (CcO), the tails of two phosphatidylglycerol lipids regulate oxygen
transfer to the active site, and phosphatidylethanolamine, cardiolipin, and phosphatidylglycerol
are all associated with the dimer interface.4,6 Detergents can occupy natural lipid sites under
	  

17	  

different experimental conditions.7 For example, phosphatidylcholine in bovine CcO and the
detergents decyl maltoside in Rhodobacter sphaeroides and lauryldimethylamine oxide in
Paracoccus denitrificans CcO occupy the same crevices of the proteins in different crystal
structures7. Defining the determinants of lipid binding can help scientists understand the
structural basis for the specificity of these sites, and aid in the design of site-selective ligands and
detergents for protein purification and structure determination.

2.1.2 Cholesterol, cholate and related sequence-based binding motifs

Cholesterol (Figure 1(A)) plays an important role in the function of many biological systems,
including eukaryotic, viral and prokaryotic proteins. While cholesterol is often considered
important because of its role in membrane organization, including lipid rafts,8 cholesterol also
exerts important regulatory effects via direct, specific binding to proteins. Through binding to the
nicotinic acetylcholine receptor and many G protein-coupled receptors (GPCRs), cholesterol
modifies the receptors’ affinity for agonists.9 Additionally, mutations in the cholesterol-binding
sites of virus envelope proteins, such as the HIV protein gp41 and Semliki Forest virus E1
protein, inhibit virus invasion at the fusion and budding stages.10 In addition, cholesterol binding
by podocin and MEC-2, members of the prohibitin domain family, is essential for regulating the
activity of their ion channel partners.11

	  

18	  

(A)

(B)

Figure 2.1 2D and 3D chemical structures of (A) cholesterol (blue) and (B) cholate (yellow),
with the flexible tails from C21 to C24/C25 shown in arbitrary favorable conformations.

A recent proteomic study mapped cholesterol-protein interactions in mammalian cells with
photoreactive sterol probes, followed by quantitative mass spectrometry.12 Their work identified
over 250 cholesterol binding proteins, including some known to biosynthesize, transport and
regulate cholesterol, as well as others known to regulate sugars and glycerolipids or participate in
	  

19	  

vesicular transport and protein glycosylation and degradation.

Cholesterol-binding sequence motifs have been proposed for several protein families. For
instance, a cholesterol consensus motif (CCM) has been identified in class A GPCRs as
matching the amino acid sequence R/K-(X)1-7-I/V/L-(X)1-3-W/Y on one transmembrane alpha
helix.

The “strict CCM” also contains F/Y on a neighboring helix, based on residue

conservation analysis between known cholesterol sites.13 An expanded version of the CCM
includes serine/glycine in one helix that forms an interhelical hydrogen bond with the CCM W/Y
residue on an adjacent helix. The additional hydrogen bond is proposed to adjust the orientation
of the aromatic side chain to enhance its stacking interactions with the steroid ring system.14 A
similar motif, the cholesterol recognition amino acid consensus or CRAC motif, has been
defined in the outer mitochondrial membrane translocator protein (TSPO; also known as the
peripheral benzodiazepine receptor). This consensus motif is L/V- (X) 1-5-Y- (X) 1-5-R/K, based
on the loss of cholesterol uptake in TSPO Y153 and R156 mutants and alignment of this
sequence region with other cholesterol binding proteins.15,16 Recently, an enhanced version of the
CRAC motif, LAF-CRAC, has been shown to be associated with nanomolar affinity for
cholesterol in TSPO.17 CARC, a cholesterol binding motif in the nicotinic acetylcholine
receptor,18 and a tilted peptide cholesterol binding motif have also been described.19,20

However, sequence motifs derived from one protein family often do not generalize well to
predicting cholesterol-binding sites in other families, and these sequence motifs also match sites
	  

20	  

that do not bind cholesterol. For instance, analysis of 2,100 proteins in a bacterium that does not
contain cholesterol found 5,000 matches to the CRAC motif.21 Additional cholesterol binding
sites are known that do not match any previously known motifs, for instance, the additional
cholesterol sites known in some class A GPCRs. A GXXXG motif has been found to be critical
for cholesterol binding to the β-amyloid precursor protein, as characterized by cholesterol
titration and mutagenesis.22 Cholesterol binding to this protein has been proposed to promote
amyloidogenesis in Alzheimer’s disease.23 For cytolytic toxin recognition of cholesterol, a
simple motif composed of a threonine-leucine pair in loop L1 has been identified by mutation
analysis.24 Thus, cholesterol binding sequence motifs appear to be fairly specific to protein
families. Our aim is to uncover general features of cholesterol recognition that are shared by
different protein families, and which discriminate cholesterol binding sites from other ligand
sites. These features can then be tested for their ability to capture a broader range of
cholesterol-binding sites via application of the resulting predictor, CholMine.

Prediction of cholate binding sites also attracts our attention for several reasons. Cholate
(Figure 2.1(B)) is used extensively as a membrane protein solubilizing detergent.1,2 Crystal
structures show cholate occupying binding pockets on membrane proteins, and this molecule
shares significant similarity with cholesterol in shape and steroidal chemistry, aside from its
dissimilar polar tail. Cholate, a bile acid, functions in some cells as a steroid hormone that binds
to nuclear receptors to modulate gene expression.25 Several soluble nuclear receptors have been
reported to bind bile acids, including farnesoid X receptor (FXR), liver X receptor alpha, and
	  

21	  

cyclopentyladenosine receptor. The resulting complexes stimulate or suppress gene transcription
by binding to promoter regions.25 Cholate is also one of the two major bile acids synthesized
from cholesterol and plays an essential role in the absorption of fat and lipidic vitamins, by
forming micelles to solubilize fat.26,27 Cholate has been shown to be an agonist for the human
bile acid G protein coupled receptor TGR5, involved in suppression of macrophage function.28,29
Lastly, a relative of cholate, 3-keto petromyzonol sulfate, acts as a vertebrate pheromone through
interaction with two other GPCRs.30 Thus, understanding the determinants of cholate binding and
identifying features that distinguish between cholate and cholesterol sites will be useful for
designing site-selective ligands and detergents for stabilizing and purifying membrane proteins,
and for interpreting ambiguous electron density in crystallography.

2.1.3 Determinants of lipid-membrane protein binding

What is known about the determinants for protein interaction with lipids, in general? Four
important factors can be summarized from the literature. The first is the presence of aromatic
residues such as tryptophan (W), tyrosine (Y) and phenylalanine (F). Tryptophan and tyrosine
are preferred at membrane interfaces.31 In the Ballesteros-Weinstein numbering scheme to
facilitate comparison of G-protein coupled receptors (GPCRs), residues are labeled by two
indices, X.Y, the first indexing the transmembrane helix number in which the residue occurs, and
the second indicating the position within the helix. The position number 50 is assigned to the
most highly conserved position in each helix, with numbers increasing towards the C-terminus.32
	  

22	  

The Trp residue at position 4.50 in class A GPCRs, involved in cholesterol binding, is highly
conserved (94%).13 Aromatic residues contribute to cholesterol binding through favorable π and
hydrophobic interactions with the steroid ring system of cholesterol.13 The second class of
residues contributing to lipid binding includes the positively charged residues lysine (K),
arginine (R), and histidine (H), which form electrostatic interactions with the polar or negatively
charged head groups of lipids.3,33 Uncharged polar residues such as serine (S), threonine (T), and
cysteine (C) also contribute by forming hydrogen bonds with lipids (where cysteine acts as a
weak hydrogen-bond acceptor).3,33 The last class of residues involved in lipid binding includes
the moderately bulky hydrophobic residues isoleucine, leucine, and valine (I, L, V), as found in
the CCM and CRAC motifs. Position 6.57 in GPCRs is conserved with isoleucine and valine in
adenosine receptors.34 These residues form van der Waals interactions with the hydrophobic part
of lipids, participate in stacking interactions, and form hydrophobic grooves for binding.3,31,34

2.1.4 Previous prediction of lipid binding sites

Regions of lipid interaction have also been predicted using entire amino acid sequences,
rather than motifs, along the lines of the transmembrane protein segment predictors that became
popular in the 1980s. However, this type of prediction typically focuses on annotating membrane
spanning regions of the protein sequence and does not provide information about pockets
comprised of discontiguous parts of the protein that bind lipids tightly, the kind of lipid
occupying each pocket, or the chemical and spatial determinants of lipid specificity. For
	  

23	  

example, different categories of lipid-interacting proteins have been predicted, according to lipid
degradation, metabolism, synthesis, transport, and other functions, by using amino acid sequence
information from the SwissProt database.35 In addition, residues involved in lipid binding have
been predicted based on amino acid sequence and residue conservation using a support vector
machine.36 However, this approach does not provide spatial or lipid-specificity information that
extends to new protein classes. Lipid-binding sites in several key cytoskeletal proteins have been
predicted using a matrix-based algorithm to identify highly hydrophobic or amphipathic amino
acid segments,37 again predicting transmembrane secondary structure segments rather than
pockets where lipids bind tightly and specifically. The goal of the work presented here is to
identify the shared chemical determinants of cholesterol and cholate binding across
non-homologous protein sites, and develop a sensitive and specific predictor for these sites.

	  

24	  

2.2 Methods

Our identification of the determinants for cholesterol and cholate binding employs
SimSite3D to align and quantify the similarity between pairs of binding sites.38 The predictive
accuracy is enhanced by incorporating knowledge of conserved interaction hotspots shared by
cholesterol or cholate binding sites. In developing the CholMine predictor, we test the hypothesis
that cholesterol (or cholate) binding in different proteins involves a characteristic set of
interactions that distinguish cholesterol/cholate binding from other ligands.

2.2.1 SimSite3D and site maps for aligning and comparing protein sites

To align pairs of non-homologous protein sites and find the relative orientation with
maximum shape and chemical similarity in the absence of ligand information, we use
SimSite3D.38,41 This method aligns two protein sites based on their similarity in surface shape
and chemical features, without requiring underlying sequence or structural similarity. For a given
query site, the similarity to another site is measured in standard deviations relative to the query’s
mean score when aligned to all cases in a set of 140 ligand-binding sites (including one
cholesterol site) chosen from proteins with undetectable sequence and structural homology to
one another, representing a highly diverse set of ligand sites (Table A.1.1). This Z-score
measures the statistical significance of a match.

An alignment between two sites with a

SimSite3D score less than -1.5 (in standard deviation units, where more negative values indicate
greater similarity) results in 2 Å RMSD or better site alignment in 80% of cases, based on tests
	  

25	  

across pterin, adenine, peptide and xenobotic binding sites from which the ligand has been
removed.38,41 SimSite3D alignment and scoring can also discriminate binding sites with similar
chemical features that do not bind the same ligand. By contrast, other ligand site prediction
methods either use information for both the ligand and receptor,39 or they only predict binding
sites with high sequence similarity within certain protein families such as GPCRs.40

The site map representation used by SimSite3D is a set of chemically labeled points in
3-dimensional space derived from residues in a user-defined or known ligand binding site. The
site map represents a negative chemical image of the protein, indicating ideal positions for ligand
atoms of a given chemistry to interact favorably with the protein. Each site map point can be
related back to the corresponding protein atom(s). Hydrophobic site map points are set down
discretely in a hemispherical array around hydrophobic protein atoms based on internal protein
coordinates, such that two perfectly overlaid identical side chains will have exactly matching
hydrophobic points, regardless of their initial Cartesian coordinates. Similarly, polar points are
generated according to the favored geometry of hydrogen bonds relative to donor or acceptor
groups in the protein (as is done for SLIDE docking templates42), with hydrogen-bond
donor-acceptor atom interactions in the range of 2.5-3.5 Å, and the angle between the donor,
hydrogen and acceptor atoms falling between 120° and 180°. In SimSite3D, the matches of
hydrogen-bonding groups are scaled according to the extent to which their hydrogen bonding
vectors point in the same direction, based on the colinearity of (cosine of the angle between)
their donor-acceptor vectors. Exact overlap (angle of 0°) yields a weight of 1 for the hydrogen
	  

26	  

bond match, and an angle of 90° yields a weight of 0. In the CholMine implementation, the
boundaries of a site map are determined either by user specification of a set of residues
comprising the cleft to be analyzed, or by a set of ligand atom coordinates (which can be based
on an experimentally determined or hypothesized ligand position that the user would like to
assess). The ligand coordinates are then used to define a volume for site map generation, by
selecting the set of protein residues containing at least one atom within 4.5 Å of one or more
ligand atoms. SimSite3D reads ligand coordinates in Tripos mol2 format for site map generation.
Ligand coordinates are converted from PDB format to mol2 format, as needed, by using the
molcharge utility in QuacPac v. 1.3.1, utilizing OEChem toolkit v. 1.6.1 (OpenEye Scientific
Software, Santa Fe, NM; http://www.eyesopen.com).

2.2.2 Extraction of an interaction motif for binding the same ligand in non-homologous sites

The goal of this work is to identify a motif that characterizes the binding of cholesterol (or
cholate) across non-homologous proteins. For moderately to highly polar ligand sites, the
SimSite3D score, which calculates the degree of chemical match between two sets of aligned site
map points and their degree of molecular surface shape similarity, is usually sufficient to filter
out false positive site matches while aligning and detecting most of the true positive sites.
However, cholesterol sites are unusually hydrophobic, and the degree of conservation of polar
interactions between non-homologous cholesterol sites is low, particularly because crystal
structures show that the cholesterol hydroxyl moiety is often exposed to bulk water rather than
	  

27	  

interacting directly with protein atoms. As a result, CholMine employs SimSite3D to align and
score a pair of site maps, and then determines whether this alignment matches a majority of
conserved points of hydrophobic interaction identified from known cholesterol binding sites.
Table 2.1 lists protein structures containing the twelve low-homology cholesterol sites, which
were divided into two sets: the first set for training to detect conserved points of cholesterol
interaction, and the second set for unbiased testing of cholesterol site predictions on a series of
unrelated proteins. The cholesterol sites from dogfish and pig sodium-potassium pump proteins
(PDB entries 2ZXE and 3KDP) were both included in the training set because their cholesterol
binding residues were in different conformations. The number of independently determined,
well-resolved, non-homologous cholesterol binding sites in the Protein Data Bank is limited,
likely due to the extreme difficulty in handling this ligand, which has extremely low aqueous
solubility. However, including several cholesterol sites from the same protein family would bias
towards identifying a family-specific motif, whereas the goal here is to discover the chemical
determinants of cholesterol binding sites in general. Therefore, we tested the extent to which the
cholesterol binding motif determined from the training set cases can predict cholesterol sites well
in other proteins, including: the non-homologous cholesterol binding sites in the test set, a series
of cholesterol-binding class A GPCR structures showing sequence and conformational diversity,
a set of non-cholesterol steroid binding sites, a set of aliphatic lipid binding sites, a set of 109
bacterial membrane proteins that do not contain cholesterol binding sites, and 139 soluble protein
sites known to bind ligands other than cholesterol. Including only membrane protein cholesterol
binding sites in the training set and only soluble sites in the training set (and then inverting the
	  

28	  

sets) allowed us to further test whether cholesterol binding motifs are similar in these different
cellular environments.

To determine the conserved cholesterol contacts shared by diverse binding sites, CholMine
employs the binary string output of SimSite3D (Figure 2.2), representing spatially aligned
SimSite3D interaction points. Once a set of known cholesterol or cholate training sites has been
aligned by SimSite3D based on matching the 3-dimensional site map points and the surface
shape derived from protein atom coordinates alone, the software determines which site map
points overlay in 3-dimensional space and have the same chemical interaction type (are
conserved between the sites). The most highly conserved interaction points can then serve as a
fingerprint, or filter, that aids in recognizing cholesterol sites.

The determination of conserved interaction points can be conceptualized as a matrix of
SimSite3D-aligned site map points (Figure 2.2) indexed relative to the points they match
spatially in the representative site, which is the site with the highest degree of interaction point
conservation with the other cholesterol sites. This procedure results in the unbiased detection of a
3-dimensional binding motif corresponding to shared interactions in non-homologous sites
binding cholesterol, as indicated by the highlighted vertical green bars showing points of
interaction common to 70% or more of the sites (Figure 2.2).

	  

29	  

Table 2.1 Cholesterol binding proteins in the training and test sets.
Training set: membrane proteins
PDB code

Ligand

Source

Res.(Å)

R-factor

Protein Name

2RH1

Cholesterol

H. sapiens

2.4 Å

0.198

β2-adrenergic G protein-coupled
receptor

3AM6

Cholesterol

A. acetabulum

3.2 Å

0.290

Proton-pumping rhodopsin II

2ZXE

Cholesterol

S. acanthias

2.4 Å

0.248

Sodium-potassium pump

3KDP

Cholesterol

S. scrofa

3.5 Å

0.243

Sodium-potassium pump

4DKL

Cholesterol

M. musculus

2.8 Å

0.235

µ-Opioid receptor

Test set: soluble proteins
PDB code

Ligand

Source

Res.(Å)

R-factor

Protein Name

1LRI

Cholesterol

P. cryptogea

1.45 Å

0.161

Beta-elicitin cryptogein

1N83

Cholesterol

H. sapiens

1.63 Å

0.202

Retinoic

acid-related

orphan

receptor alpha
1ZHY

Cholesterol

S. cerevisiae

1.60 Å

0.216

KES1 protein

3GKI

Cholesterol

H. sapiens

1.80 Å

0.176

Niemann-pick c1 protein

3N9Y

Cholesterol

H. sapiens

2.10 Å

0.207

Cholesterol side-chain cleavage
enzyme (Cyp11A1)

	  

30	  

Figure 2.2 Determining conserved site map points. Aligned site map points with matching
chemical labels from the training set of cholesterol (CLR) sites are shown following SimSite3D
spatial alignment. Hydrophobic (H) or hydrogen-bond donor (D) site map points are shown on
lines 2-6 if they fall within 1.5 Å of a site map point of the same chemical type in the query site,
3KDP_CLR3001D, where the number and letter after the CLR residue code indicate its residue
number and chain identifier in the PDB file. Hydrogen-bond acceptor (A) and donor and/or
acceptor (N) points (e.g., hydroxyl interaction sites) also occur in cholesterol sites but are not
found to be conserved between the sites. The 3KDP query site was chosen as the representative
query site for cholesterol binding because it has the highest degree of site map point conservation
with the other cholesterol sites. Highly conserved points (green backgrounds) comprising the
conserved motif for cholesterol interation were identified based on occurring in at least 70% of
these training cases aligned to the 3KDP query site.

2.2.3 Establishing a cholate site predictor

Creating a cholate site predictor for the CholMine software followed the same process as for
cholesterol prediction. The first step was to set up the training and test databases. 20 cholate
(PDB residue name CHD) binding sites in 12 non-redundant proteins were used to generate
SimSite3D site maps representing points of favorable hydrophobic or hydrogen-bond
interactions with cholate (Table 2.2). These 20 cholate binding sites were divided into two
datasets of equal size.

There were just four non-homologous membrane protein-bound cholate

sites in the PDB, representing limited training power, with the 16 other cholate sites coming
from soluble proteins. The training set thus included the 4 membrane protein cholate sites and 6
	  

31	  

of the soluble cholate sites. There were no instances of cholate sites repeated (even with low
homology) between the training and test sets, to guarantee that the test predictions would be
unbiased. Due to the limited availability of unrelated cholate sites in the PDB, four bile acid
binding proteins with moderate pairwise sequence identity (~60%) were included in the test set.
Inverting the two sets in testing and training then allowed testing whether a more diverse set of
cholate sites (the first set, with a mixture of unrelated membrane and soluble sites) or a set of
sites with some similarity (from four diverse bile acid binding proteins and two unrelated
proteins) provided greater cholate site detection power.

2.2.4 Summary of the steps for establishing a cholesterol (or cholate) site predictor

2.2.4.1 Step 1: Preparing the training and testing databases

The binding sites divided into training and test sets were processed by SimSite3D to create
site maps.

Sets of soluble and membrane proteins containing diverse or lipid ligands (as

described in the section above, “SimSite3D and site maps for aligning and comparing protein
sites” and in “Bacterial membrane proteins for evaluating false positive prediction rate”, below)
were also prepared as site maps for alignment and comparison as negative controls, to assess the
rate of false positive predictions.

	  

32	  

	  

33	  

Table 2.2 Cholate binding proteins in the training and test sets.
Training set: mixture of membrane and soluble proteins
PDB ID α

Ligand

Source

Δ 1EE2

Cholate

E. caballus

Δ 1S9Q

Cholate

Δ 2AZY

Res.(Å)

R-factor

Protein Name

1.5Å

0.148

Alcohol dehydrogenase

M.musculus

2.2Å

0.220

Estrogen-related receptor gamma

Cholate

S. scrofa

1.9Å

0.167

Phospholipase A2

Δ 2DQY

Cholate

H. sapiens

3.0Å

0.226

Liver carboxylesterase 1

^ 2DYR

Cholate

B. taurus

1.8Å

0.202

Cytochrome c oxidase

Δ 2HRC

Cholate

H. sapiens

1.7Å

0.221

Ferrochelatase

Res.(Å)

R-factor

Protein Name

Test set: soluble proteins
PDB ID

Ligand

Source

Δ 1TW4

Cholate

G. gallus

2.0Å

0.216

Liver bile acid binding protein

Δ 2FT9

Cholate

A. mexicanum

2.5Å

0.260

Liver bile acid-binding protein

Δ 2QO4

Cholate

D. rerio

1.5Å

0.188

Liver bile acid-binding protein

Δ 2RLC

Cholate

C. perfringens

1.8Å

0.195

Choloylglycine hydrolase

Δ 3ELZ

Cholate

D. rerio

2.2Å

0.224

Ileal bile acid-binding protein

Δ 3QPS

Cholate

C. jejuni

2.4Å

0.204

CmeR

α

Membrane proteins are indicated by ^ and soluble proteins by Δ. In PDB structures 2DYR,
2HRC, 1TW4, 2FT9, and 3ELZ, two or more independent cholate binding sites were included in
training or testing.

2.2.4.2 Step 2: Choosing the most representative cholesterol (or cholate) binding site

The goal of this step was to select the known site with the best SimSite3D scoring detection
and quality of alignment with other cholesterol (or cholate) binding sites (as described for the
site from PDB entry 3KDP in Figure 2.2). For cholesterol sites, the membrane set was initially
assigned as the training set, the soluble set as a true positive test set, and the diverse ligand sites
as a dataset with one true positive buried in many false positive cases. The SimSite3D
normalized score threshold was set to 0.0 (keeping the best scoring orientation of any site that
	  

34	  

aligns favorably with the query site), and each of the 12 cholesterol sites was compared against
all the others, and to the diverse set of 140 binding sites. The RMSD value representing the
closeness of alignment (with 0 Å representing a perfect alignment) between the query site
cholesterol atom positions and those in the aligned ligand sites was calculated by using the
RMSD function in the OEchem toolkit v.1.6.1 (http://www.eyesopen.com; OpenEye Scientific
Software, Santa Fe, NM). Assigning one query site from the training set and a separate query site
from the test site allowed the two sets to be inverted for training and testing. The same procedure
was followed for cholate sites.

2.2.4.3 Step 3: Extracting a fingerprint of conserved interactions from known cholesterol (or
cholate) sites and applying it to predict on the test set

A high false positive rate results when SimSite3D alone is used to align hydrophobic sites
with a generous scoring threshold, due to significant hydrophobic contact scores and the absence
of directional hydrogen-bonding group matches (which are strong discriminants for polar sites
binding the same ligand). This motivated our developing a way to pinpoint additional conserved
features of cholesterol or cholate binding sites. Conserved hydrophobic interactions were
identified between the cholesterol sites, based on site map points that overlaid in 3-dimensional
space, as shown in Figure 2.2, for both the training and test sets. These points represent
hydrophobic positions in the cholesterol sites that are ≥70% conserved with respect to the query
site for the membrane (3KDP_CLR3001D) or soluble set (1ZHY_CLR1001A). The conserved
points and their relative positions in space provide a shared recognition motif or fingerprint for
	  

35	  

cholesterol interaction that is implemented as a filter (following SimSite3D alignment) in the
CholMine predictor. A test site is predicted to bind cholesterol or cholate if, upon 3-dimensional
site map alignment with the query site, it matches at least 70% of the conserved points. The same
procedure was followed for identifying and applying a conserved recognition motif for the
cholate training and test sites.

2.2.5 Bacterial membrane proteins for evaluating false positive prediction rate

Bacteria contain no cholate or cholesterol, and are thus likely to provide a rigorous set of
ligand sites to test for the rate of false positive cholesterol predictions because their
membrane-exposed surfaces are hydrophobic and interact with other lipids. PDB codes of
bacterial membrane proteins were extracted from the Membrane Proteins of Known 3D Structure
Database (http://blanco.biomol.uci.edu/mpstruc/) and then entered in the Pisces server43
(http://dunbrack.fccc.edu/Guoli/PISCES_InputB.php) to select a low-homology set of bacterial
membrane proteins using default criteria: crystal structures with ≤ 25% pairwise sequence
identity, ≤ 3.0 Å resolution, R-value ≤ 0.3, and chain length between 40 and 10,000 residues.

	  

36	  

Figure 2.3 Steps in CholMine cholesterol and cholate site prediction.

2.2.6 CholMine server

The overall steps in cholesterol/cholate site prediction by CholMine are summarized in
Figure 2.3. A web server implementation has been established to support automated prediction of
cholesterol and cholate binding sites by users for their own protein structures
(http://cholmine.bmb.msu.edu). Given a Protein Data Bank file and a ligand residue number and
ligand chain ID for a placemarker ligand in the site, the server will provide the following
information: a prediction of whether the site binds cholesterol or cholate; the predicted binding
mode of the corresponding steroid; and the residues in the binding site forming conserved
interactions with cholesterol or cholate. A prediction summary plus PDB files containing the
	  

37	  

ligand orientation and essential residues are e-mailed to the user, with an option to also provide a
pre-formatted PyMOL molecular graphics file (Schrödinger, New York, NY; http://pymol.org)
showing the predicted interactions. The set of key protein interactions can be used to design
experiments that probe ligand binding, for instance by site-directed mutagenesis.

As well as supporting the use of a placeholder ligand (e.g., a crystallographic lipid or
user-defined dummy residue) to define the binding site volume to analyze, the server also
supports user uploading of a mini PDB file that contains up to 25 residues defining the protein
region the user would like to assess for cholesterol or cholate binding. This set of residues is used
to define the potential ligand binding site volume as a box bounded by the minimum and
maximum x, y, and z coordinates of the residues provided. The volume for site map generation is
then refined by placing probes on a 1.0 Å grid in the box and removing any probes within 3.5 Å
(van der Waals contact distance) of protein atoms. The site map for CholMine analysis is
generated within this volume for comparison to the conserved interaction points characteristic of
cholesterol or cholate binding. 10,000 Å3 was set as the maximum box volume in the server
implementation.

	  

38	  

2.3 Results

2.3.1 Cholesterol binding site training and testing

Of all the membrane cholesterol sites, 3KDP_CLR3001D gave the lowest average RMSD of
alignment against the other membrane sites in the training set when used as the query (Figure
2.4(A)), so the site map and positions and chemistry of conserved interactions in this site were
used as the basis to align and score the test cases. As shown in Figure 2.4(B), 1ZHY_CLR1001A
gave the lowest average RMSD when used as the query for alignment of the set of soluble
cholesterol sites. Thus, this site was chosen as the soluble site representative query when the
training and test sets were inverted to determine which query had the greatest predictive power
and lowest false positive rate.

As shown in Table 2.3, using the 3KDP_CLR3001D site as the query (where CLR is the
residue name for cholesterol and 3001D is the ligand residue number), combined with requiring
at least 70% of its conserved interactions to be matched for a site to be predicted as cholesterol
binding resulted in prediction of 83% of the membrane protein cholesterol sites (training set) and
80% of the soluble protein cholesterol sites (true positives in the unbiased test set), with a
relatively low rate (5%) of false positives in the 140-site diverse dataset. Self-prediction of a site
(when used as both the query site and as a dataset entry) is not included in the calculation of the
true positive rate, since self-prediction is guaranteed. In contrast, although the soluble cholesterol
site 1ZHY_CLR1001A has a low false positive rate when at least 70% of its conserved
	  

39	  

interactions are matched, it fails to find any of the membrane protein cholesterol binding sites,
while predicting 75% of the soluble sites. These results suggest that the membrane cholesterol
sites share a conserved motif that is also part of the soluble site recognition of cholesterol.
However, additional shared interactions within the soluble sites are not well-matched by the
membrane sites, likely due to the fact that soluble proteins more fully surround and sequester
cholesterol. Based on its superior performance on soluble as well as membrane cholesterol
binding sites, the 3KDP query site and its conserved set of interactions were implemented in the
CholMine server for cholesterol site detection.

	  

40	  

(A)

(B)

Figure 2.4 Pairwise alignment and similarity scoring. (A) All-against-all SimSite3D comparison
for membrane protein cholesterol binding sites. (B) All-against-all comparison for soluble
protein cholesterol binding sites. For the top-scoring alignment of each site pair, the SimSite3D
similarity score values are colored from red (most similar) to dark blue (marginally similar) with
corresponding score values ranging from -5 to 0 (in standard deviations above the mean score
when the same query site is compared to the set of 140 diverse ligand binding sites, where more
negative is more significant). Black indicates failure to meet the normalized score threshold of 0.
Numbers reported in the grid are the RMSD values (Å) between cholesterol rings following
SimSite3D site alignment. Lower RMSD indicates better alignment between sites. The “# norm.
hits” column on the right side of each matrix reports the number of sites meeting the scoring
threshold for similarity to the query site (labeled to the left in each row) when searching against
the 140 sites in the diverse dataset (Table A.1.1), which includes one true positive cholesterol
site. The high number of false positives is based on SimSite3D alignment score only, before the
conserved interaction points for cholesterol sites have been considered.

	  

41	  

Table 2.3 Prediction results for using cholesterol sites in 3KDP_CLR3001D (a membrane
protein) and 1ZHY_CLR1001A (a soluble protein) for detecting cholesterol sites in other
proteins, plus assessment of false positives in a set of 139 non-cholesterol ligand sites. When
1ZHY_CLR1001A was used as the query in the results below, the training and test sets were
inverted relative to those listed in Table 2.1. Query self-matches were excluded from the
statistics.
Query ID

True Positive Rate for

Unbiased True Positive Rate

False Positive Rate for

Training Dataset

for Test Dataset

Diverse Dataset

3KDP_CLR3001D

5/6 (83%)

4/5(80%)

7/139 (5%)

1ZHY_CLR1001A

3/4 (75%)

0

2/139 (1.4%)

2.3.2 Cholate site training and testing

SimSite3D pairwise comparison of the cholate sites for the two datasets is shown in Figure
2.5, allowing the identification of the query site within each set that could best detect other
cholate sites based on the lowest average RMSD of alignment over the most sites. The
membrane protein site representative (2DYR_CHD525C) provided better predictive ability
overall (Table 2.4). Predicting cholate sites as those matching at least 70% of the conserved
interactions in this query site gave a true positive rate of 67% for cholate sites in the training set,
a true positive rate of 70% for cholates in the unbiased test set, and a false positive rate of 12%
on the set of 140 diverse ligand binding sites. 2QO4_CHD130A was identified as the best
representative of the second, entirely soluble cholate site dataset. When this site was used as the
query to find cholate sites matching its conserved interactions, a true positive rate of 67% was
observed in the entirely soluble cholate site set, a true positive rate of only 10% in the mixed
membrane/soluble protein set, and a false positive rate of 1.4% when applied to the set of 140
	  

42	  

diverse cholate sites. The decreased generalization of the soluble site query and conserved points
for predicting other cholate sites was expected, since a substantial number of sites in this set
came from two sites in diverse members of the β-clamshell bile acid binding protein family.
Similarly, by being a more family-specific motif, this query’s lower false positive rate was
expected on the diverse set of 140 non-cholate binding sites. The membrane cholate site query
performed better as a cholate site predictor that generalizes across protein families, with almost
twice the unbiased true positive rate (Table 2.4). Therefore, cholate site prediction in CholMine
uses 2DYR_CHD525C as the query, combined with conserved interactions derived from the first
dataset of mixed membrane and soluble protein cholate sites.

	  

43	  

(A)

(B)

Figure 2.5 Pairwise alignment and similarity scoring. (A) All-against-all SimSite3D similarity
comparison for the first dataset, which includes 4 membrane cholate binding sites and 6 soluble
cholate binding sites. (B) All-against-all comparison for the second dataset, which includes
another 10 soluble cholate binding sites unrelated to the first set. (See Figure 2.4 legend for
additional details.)

Table 2.4 Prediction results from using cholate sites 2DYR_CHD525C (best representative from
a membrane protein) and 2QO4_CHD130A (best representative from a soluble protein in the
second set) for alignment and scoring to predict cholate binding sites in other proteins and assess
false positive rate in a set of 140 non-cholate sites. Query self-matches were excluded from the
results. The training and test sets were inverted relative to Table 2.2 when the 2QO4 query was
used.
Query ID

True Positive Rate for

Unbiased True Positive

False Positive Rate

Training Dataset

Rate for Test Dataset

for Diverse Dataset

2DYR_CHD525C

6/9 (67%)

7/10 (70%)

17/140 (12%)

2QO4_CHD130A

6/9 (67%)

1/10(10%)

2/140 (1.4%)

	  

44	  

2.3.3 Evaluating the statistical significance of the cholesterol and cholate site predictors

The lift value is a common way to evaluate models in data mining, reflecting the
enhancement in predictivity relative to random selection.44 Suppose the predictor rule is that A
implies B (e.g., a positive prediction by CholMine implies that the site binds cholesterol). The
lift value for CholMine predictions can be calculated as:

Lift(A ⇒ B) =

P(B | A) P(A ∩ B)
=
P(B)
P(A)P(B)

Lift(A ⇒ B) > 1 means A and B have a positive relationship, and the numeric value reflects
the n-fold enhancement of predictive rate (how many times higher?) relative to random
prediction. Lift(A ⇒ B) = 1 indicates that A and B are independent, and Lift(A ⇒ B) < 1
means A and B have an inverse relationship. The chi-squared test can also be used to evaluate
whether the correlation between A and B is statistically significant, by measuring the probability
of there being a significant difference between the predicted versus actual result (e.g., the
presence of a cholesterol binding site). For CholMine cholesterol site prediction, the lift value
was 7.7, indicating CholMine is almost 8 times as effective as random prediction of cholate sites.
The very small chi-squared P-value of 1.05e-13 indicates significant correlation between
CholMine prediction and cholesterol binding. For CholMine prediction of cholate sites, the lift
value is also significant (3.6), with a very small chi-squared P-value of 2.53e-08.

	  

45	  

2.3.4 GPCR cholesterol binding site prediction

Putative cholesterol sites in class A GPCRs were analyzed as one way of testing the
predictive ability of CholMine on additional cholesterol sites. The consensus motif (CCM) found
in the cholesterol-binding site of human β2-adrenergic receptor (labeled as residue 412 in PDB
code: 2RH1) is matched by the sequences in 44% of human class A G protein coupled
receptors.13 To assess the ability of CholMine to find sites matching the sequence-based
consensus motif, prediction was performed on the structures available for 11 of these receptors
(PDB codes: 3EML, 3PBL, 2KS9, 2Y00, 3RZE, 1U19, 2Z73, 3ODU, 3V2W, 3UON, and 4DJH;
Table A.1.2). 82% of these proteins were predicted by CholMine to bind cholesterol in the region
corresponding to cholesterol 412 in PDB entry 2RH1, in PDB entries 3EML, 2KS9, 2Y00,
3RZE, 1U19, 3ODU, 3V2W, and 3UON. In addition, for the 1.8Å resolution crystal structure of
the human A2a adenosine receptor (PDB entry: 4EIY), which contains 3 cholesterol-bound sites
unrelated to each other by symmetry or amino acid sequence, two of the three sites were
predicted by CholMine (labeled as residues 404 and 405 in PDB entry 4EIY).

2.3.5 Comparison of CholMine structure-based predictions with sequence-based predictions
using the CCM, CRAC, and GXXXG motifs

To compare the predictive ability of previously published cholesterol binding sequence
motifs with that of CholMine, Sequery45 was applied to identify sequences matching each motif
in crystal structures of the same proteins used for CholMine prediction (Table 2.1 and Tables
	  

46	  

A.2.1 and A.2.2). Matching the CCM, CRAC and GXXXG sequence motifs predicted the
membrane protein cholesterol binding sites well (80-100% of these sites were predicted),
predicted soluble sites less well (40-80%), and resulted in an unacceptable rate of false positives
in the diverse dataset: 100 or more cholesterol sites were predicted in 139 sites known to bind a
different ligand (Table 2.5).

	  

47	  

Table 2.5 Comparison of cholesterol site prediction in true versus non-cholesterol binding sites
by the CholMine conserved spatial motif versus sequence motif matching.
CCMα

Relaxed
CCM
Membrane set

α

CCM + Surface

CRACα

GXXXGα

accessibility

CholMine
predictor

5/5

4/5

2/5

5/5

4/5

5/6

(100%)

(80%)

(40%)

(100%)

(80%)

(83%)

4/5

2/5

1/5

3/5

3/5

4/5

(80%)

(40%)

(20%)

(60%)

(60%)

(80%)

11/11

10/11

6/11

11/11

5/11

9/11

(100%)

(91%)

(54%)

(100%)

(45%)

(82%)

Diverse dataset

130 /139

105/139

33/139

116/139

100/139

7/139

(false positives)

(94%)

(75%)

(24%)

(83%)

(72%)

Soluble Set
GPCRs

α

Relaxed CCM: R/K- (X)1-7-I/V/L- (X)1-3-W/Y;

1-5-Y-

(X) 1-5-R/K;

15,16,17

G(X)3G.

3,13

CCM: R/K -(X)2-6-I/V/L-(X)3-W/Y;

(5%)
13

CRAC: L/V- (X)

22

One of the problems with sequence motif based prediction is that it does not assess the
surface accessibility of the motif, which is required for cholesterol to access the site. To test
whether including solvent accessibility as an additional criterion for sequence motif-based
cholesterol site prediction can solve the overprediction problem, a solvent accessible surface
threshold was set at 29 Å2 for matching each residue in the CCM motif, corresponding to the
minimum exposed surface area per residue in the cholesterol site of human β2 adrenergic receptor
(PDB entry: 2RH1). The results show that the true positive rate for membrane protein cholesterol
sites decreased from 80% to 40%, for soluble protein sites from 40% to 20%, and for GPCRs
from 91% to 54% (Table 2.5, CCM + Surface Accessibility column). The false positive rate
decreased from 75% to 24%, while still resulting in 33 false positives in 139 proteins. Overall,
even when surface accessibility is considered, sequence motif prediction has an unacceptably
high false positive rate for cholesterol prediction (24%) and a moderate rate of true positive
	  

48	  

prediction (20-40%), whereas CholMine structure-based prediction results in few false positives
(5%) and a high true positive rate (80-83%).

2.3.6 Deciphering the determinants of cholesterol binding

For cholesterol binding site prediction in membrane proteins, all the conserved site map
points representing favorable cholesterol contacts derive from hydrophobic groups, more
specifically, Ile D35, Leu D36, Tyr D39, Tyr D43, Glu C840, Ile C843, Tyr C847, and Met C852
in the representative query site, 3KDP_CLR3001D (Figures 2.2 and 2.6(A)). A smaller but
similar set of interactions with cholesterol at this site is identified when the single 3KDP crystal
structure is analyzed by LigPlot and LigPlot+46,47 (Figure 2.6(B,C)). Compared with the CCM
(R/K-(X)

1-7-I/V/L-(X) 1-3-W/Y)

and CRAC (L/V-(X)1-5-Y-(X)1-5-R/K) motifs, the CholMine

spatially conserved binding motif exemplified by this site contains an I-L-(X) 2-Y motif, which
matches the residues at the end of the CCM and the beginning of the CRAC motif. CholMine’s
conserved interaction points surround atoms on the steroid ring observed to have the highest
frequency of protein interaction (Figure 2.6(A)). There may be several reasons for the observed
lack of conserved polar interactions with cholesterol. First, there is only a single polar group, the
A-ring hydroxyl substituent, in cholesterol. In seven cholesterol sites evaluated (two sites in
2RH1 and 3AM6, and one each in 2ZXE, 3KDP, and 4DKL), there was only a single direct
protein hydrogen bond to the cholesterol hydroxyl group, with water-mediated interactions to
cholesterol in another structure, and no protein hydrogen bonds to the cholesterol hydroxyl group
	  

49	  

observed in any of the other cases. This suggests that the hydroxyl group may help position
cholesterol correctly at the interface between the lipid bilayer and bulk solvent, rather than being
a recognition determinant for binding to proteins. Also supportive of a lesser role for polar group
recognition is the observation that the arginine or lysine residue in the CCM is only 22%
conserved in class A GPCRs; thus interactions of this residue with cholesterol are only mildly
conserved.13

In soluble protein cholesterol binding sites, both faces of cholesterol are surrounded in the
pocket, forming additional interactions with the protein. However, the conserved interaction
points from soluble protein cholesterol binding sites perform less well than those from
membrane proteins in predicting cholesterol sites in general (Table 2.3). The conserved
membrane protein cholesterol interactions (Figure 2.6A) can predict and are characteristic of
both membrane and soluble sites in unrelated proteins and are the basis for CholMine cholesterol
site prediction.

2.3.7 CholMine distinguishes cholesterol sites from sites occupied by acyl chain lipids

CholMine was also applied to diverse lipid binding sites: the 22 independent acyl lipid sites
in the adenosine receptor (PDB code: 4EIY) and five phosphatidylethanolamine and analog sites
in PDB entries 3DDL, 2Z73, 3UTW, 3UTV (Table A.1.3). CholMine correctly predicted that 21

	  

50	  

out of 22 sites in the adenosine receptor do not bind cholesterol, and the same for all five of the
phosphatidylethanolamine sites.

	  

51	  

(A)

(B)

(C)

Figure 2.6 (A) Sodium/potassium-transporting ATPase cholesterol site (PDB entry 3KDP,
residue D3001) used as the representative query for CholMine predictions. Purple spheres
	  

52	  

Figure 2.6 (cont’d)
represent conserved interaction points in the membrane proteins binding cholesterol (from Figure
2.2), displayed in the context of the representative site from 3KDP. The green dashed lines
connect the conserved interaction points to corresponding protein atoms. Cholesterol atoms
colored in green contact a protein atom in 60% of the training set sites, atoms colored yellow
have a 30-60% frequency of contact, and atoms colored in red contact the protein in <30% of the
sites. (B) For comparison, LigPlot+ 3-dimensional view (shown with PyMOL; Schrödinger, New
York, NY; http://pymol.org) of key sodium/potassium-transporting ATPase cholesterol
interactions identified in just the single structure of 3KDP. (C) Alternative LigPlot 2-dimensional
view of these interactions.

2.3.8 Discriminating cholesterol and cholate sites from other steroid sites

To test whether CholMine can distinguish cholesterol sites from steroid binding sites in
general, a variety of non-homologous crystal structures were tested: the progesterone sites in
PDB entries 1A28, 2AA6, 2BAB, and 2HZQ, the estradiol sites in 1AQU, 1E6W, 1JGL, 1LHU,
and 3OLL, and the testosterone sites in 2AM9, 1J96, and 3KDM (Table A.1.3). 10 out of the 12
sites were predicted as non-cholesterol sites, with two false positives, in 1AQU and 1J96. The
cholesterol site predictor was also applied to the cholate training and test sets (Table 2.2) and
vice versa (Table 2.1). The cholesterol site predictor predicts 30% of the training and 30% of the
test set of cholate sites. The cholate site predictor predicts 57% of the membrane cholesterol sites
and 80% of the soluble sites. Thus, cholesterol and cholate sites are harder to discriminate than
cholesterol and steroid sites in general, and again we see a higher level of discrimination of
cholesterol relative to cholate sites. Reasons for this are discussed below in the section below,
“Comparison of cholesterol and cholate binding site conservation”.

	  

53	  

2.3.9 Bacterial membrane proteins for evaluating false positive predictions

Bacteria contain no cholate or cholesterol. Thus, known ligand sites, mostly lipid-binding,
were analyzed in 109 low-homology bacterial membrane protein structures (Table A.1.4) as an
additional stringent test of the false positive rate for cholesterol and cholate site prediction.
Eleven of the 109 sites, or 10%, were falsely predicted as potential cholesterol sites. When
analyzed as potential cholate sites, 14 (13%) sites were predicted. Though nominally these are
false positives, eubacteria are known to contain sterol-like molecules including cyclic hopanoids,
tetrahymanol, and squalene.48,49 Thus, it remains possible that some sites that were occupied by
unnatural molecules in the bacterial crystal structures may natively bind sterol-like molecules.

2.3.10 Cholate binding determinants

Cholate is an important detergent for membrane proteins and also a representative of bile
acids that act as hormones, pheromones, and important metabolites of cholesterol. CholMine was
trained for cholate site prediction similarly to the protocol for cholesterol, and the determinants
for cholate binding in membrane proteins were found to differ somewhat from those in soluble
proteins. For membrane protein cholate binding sites, the conserved interaction points were all
hydrophobic. In the representative 2DYR_CHD525C (cytochrome c oxidase) site used for
CholMine prediction, these interactions arise from TrpC99, HisA233, TrpA288, TyrA304A, and
PheA305 (Figure 2.7). The latter trio of residues serve to anchor cholate in the binding pocket.
Out of the 10 training set cholate molecules, half of the O3 hydroxyl groups (on the A ring of
	  

54	  

cholate) formed water-mediated and two formed direct hydrogen bonds to the protein. The O7
and O12 hydroxyls (on the B and C rings) formed fewer hydrogen bonds to protein: two O7 and
four O12 water-mediated hydrogen bonds were observed, and 1 direct hydrogen bond was found
in the 10 sites, with a low degree of conservation. The tail carboxylate oxygens formed 7 direct
H-bonds overall, which were spatially varied in position.

2.3.11 Comparison of cholesterol and cholate binding site conservation

To understand why the number of conserved interaction points is greater for cholate sites
(Figure 2.7) compared with cholesterol (Figure 2.6), the crystallographic mobility of atoms in
these ligands was compared. In the training set of 10 cholate sites, the crystallographic B-factor
average for cholate atoms was 48 Å2, whereas in the training set of 7 cholesterol sites, the
B-factor average for cholesterol atoms was 1.5 times as high (74 Å2), reflecting significant
mobility. Higher atomic mobility is thus likely the reason for fewer spatially conserved
interactions in cholesterol sites.

	  

55	  

Figure 2.7 Conserved interaction points for CholMine cholate site prediction (purple spheres)
are shown in the context of the interactions between the representative membrane protein query
site 2DYR_CHD525C from cytochrome c oxidase, and its bound cholate molecule (white tubes
with oxygen atoms in red). Essential residues contributing to the conserved interaction are
labeled.

A generally similar pattern is seen in the edges and faces of cholate and cholesterol that
predominate in forming conserved interactions with protein sites (Figure 2.8). Discrimination
between cholesterol and cholate binding is not via polar interactions (which are not conserved
across cholate or cholesterol sites), but by conserved interactions at the bend between the steroid
A and B rings and near the center of the tail in cholate, versus a paucity of conserved interactions
at the A-B ring junction or hydrophobic tail region in cholesterol. The conformational diversity
of the tails when cholate and cholesterol bind to different sites results in their termini not being
well conserved spatially whereas they still experience different chemical environments.
Detecting differences in the general protein environments of the alpha face of the steroid ring
(upper face in Figure 2.8) and the tail termini in cholate (polar) versus cholesterol (hydrophobic)
sites will be a focus for enhancements in CholMine, as well as expanding the training data sets.
	  

56	  

Figure 2.8 SimSite3D-identified conserved interactions for cholate (yellow) and cholesterol
(blue) recognition abound along the groove formed between the row of C18, C19, and C21
methyl groups on the beta (lower) face of the steroid and the edge of the steroid ring system.
The view on the right is rotated roughly 90 degrees about a vertical axis through the center of
each molecule. Cholate sites are distinguished from cholesterol primarily based on interactions
with the relatively conserved C22-C23 tail orientation in cholate, and numerous conserved
interactions associated with the strongly bent (5-beta configuration) joint between the A and B
rings of the cholate steroid ring system. Because the tail configurations are conformationally
diverse in different binding sites, conserved interactions are absent in the C24-C25 region.

2.3.12 Computational efficiency of the CholMine server

For the 261 cholesterol, cholate, and other ligand sites analyzed here, the maximum protein
volume for site map generation was <10,000 Å3 (a box with edges of ~21 Å), and each
prediction completed in less than 5 minutes (the time to exhaustively check and score all
orientations of the user-defined cleft versus the representative site, then filter for conserved
interaction matches). For the majority of cases, the server elapsed time was < 3 minutes per site.
	  

57	  

2.4 Concluding discussion

CholMine, a predictor for cholesterol and cholate binding in protein 3-dimensional
structures, has been established as a free web server at http://cholmine.bmb.msu.edu. This
approach is based on the determination of conserved interactions for cholesterol and cholate
binding to non-homologous membrane and soluble protein sites in PDB structures. SimSite3D
alignment and scoring of site similarity serves as the first layer of prediction, considering the
chemical interactions that can be made with the protein and their degree of surface match,
independent of ligand information or protein structural conservation. This approach allows
CholMine to focus on spatial conservation of chemical interactions rather than residue
conservation. Requiring 70% match of the conserved spatial interactions of known cholesterol or
cholate sites serves as the second layer of prediction, ruling out the vast majority of false
positives in a dataset of diverse soluble ligand sites (resulting in a 5% false positive rate for
cholesterol and 12% for cholate sites) and a slightly higher rate when applied to a dataset of
diverse membrane proteins (10% for cholesterol and 13% for cholate sites). CholMine can
predict 80% of known cholesterol and 70% of known cholate binding sites in diverse protein
families including soluble and membrane proteins from different species, when applied to sites
unrelated to those used in training. CholMine can discriminate ~75% of sites containing other
steroids from cholesterol binding sites. Cholate site prediction is less steroid-selective; it also
predicts two-thirds of the known cholesterol sites, likely due to the limited availability of
non-homologous cholate sites for training the predictor. This problem can be addressed by
	  

58	  

periodic updating of the training set. However, the false positive rate of cholate site prediction on
non-steroid sites is 5-fold lower, even for diverse lipid sites in membrane proteins.

Hydrophobic interactions focused along the groove between the steroid methyl group
substituents and the ring system itself are found to be the major conserved determinants for the
recognition of both cholesterol and cholate, with their polar groups not contributing to conserved
interactions. Classical motifs for cholesterol site prediction have focused on amino acid residue
conservation, and tend not to generalize well to other protein families, with particularly limited
performance for predicting known binding sites in soluble proteins. Sequence motif-based
prediction also results in many false positives (with 70% or more of 139 diverse non-cholesterol,
non-cholate binding sites falsely predicted), which overwhelms the number of true positive
predictions. The enhanced predictive specificity and selectivity of CholMine is based on
inferring shared 3-dimensional shape and chemical information from non-homologous sites.
This approach is now being generalized to create a LigPattern server that discovers the shared
interaction determinants of other important regulatory ligands and substrates, including polar
molecules such as adenosine.

	  

59	  

APPENDIX

	  

60	  

Table A.1.1 140 non-homologous protein sites binding diverse ligands, containing one
cholesterol binding site (in PDB entry 1LRI) and no cholate sites.
PDB
code

Ligand

Source

Res.
(Å)

R-factor

Protein name

1R8S

GDP

B. taurus

1.46

0.159

ADP-ribosylation factor 1

1QXY

M2C

S. aureus

1.04

0.144

Methionyl aminopeptidase
Endo-oxabicyclic transition state

1ECM

TSA

E. coli

2.2

0.192

analogue

1KYF

AAchain

M. musculus

1.22

0.154

Alpha-adaptin c
Sulfolipid

1I24

UPG

biosynthesis

A. thaliana

1.2

0.192

sqd1

H. sapiens

1.58

0.343

Cyclophilin A

protein

His-Ala-G
ly-Pro-Ile1AWQ

Ala

Conserved hypothetical

protein

1PUJ

GNP

B. subtilis

2.0

0.216

ylqf

4UBP

HAE

S. pasteurii

1.55

0.151

Urease, chain A

1CHM

CMS

P. putida

1.9

0.177

Creatine amidinohydrolase

1KEK

HTL

D. africanus

1.9

0.178

Pyruvate-ferredoxin oxidoreductase

1EFY

BZC

G. gallus

2.2

0.194

Poly (ADP-ribose) polymerase
Cobalamin-dependent

methionine

1MSK

SAM

E. coli k12

1.8

0.198

synthase

1EVL

TSB

E. coli

1.55

0.215

Threonyl-trna synthetase

1JC9

NAG

T. tridentatus

2.01

0.183

Techylectin-5a
Protein-l-isoaspartate

1DL5

SAH

T. maritima

1.8

0.182

o-methyltransferase

1.7

0.193

RNA-directed RNA polymerase

H. c virus (isolate
1GX5

GTP

bk)

Ribulose-1,5

bisphosphate

1GK8

CAP

C. reinhardtii

1.4

0.149

carboxylase larg

1FK5

OLA

Z. mays

1.3

0.135

Nonspecific lipid-transfer protein
Naphthalene 1,2-dioxygenase alpha

1O7N

IND

P. putida

1.4

0.19

subunit

1M15

ARG

L. polyphemus

1.2

0.125

Arginine kinase

1KMV

LII

H. sapiens

1.05

0.13

Dihydrofolate reductase

1F20

NAP

R. norvegicus

1.9

0.186

Nitric-oxide synthase

1MXT

FAE

S. sp.

0.95

0.11

Cholesterol oxidase

1GS5

NLG

E. coli

1.5

0.2088

Acetylglutamate kinase

	  

61	  

Table A.1.1 (cont’d)
Formiminotransferase-cyclodeamin
1QD1

FON

S. scrofa

1.7

0.191

ase

1C96

FLC

B. taurus

1.81

0.225

Mitochondrial aconitase

1K3Y

GTX

H. sapiens

1.3

0.148

Glutathione s-transferase a1

1T2D

NAD

P. falciparum

1.1

0.143

L-lactate dehydrogenase

S. typhimurium

1.2

0.229

Oligopeptide binding protein

1.95

0.197

Malate synthase G

Lys-Ala-L
1JET

ys

E. coli str. k12
1P7T

ACO

substr.

Tetrahydrodipicolinate
1KGQ

NPI

M. bovis

2.0

0.179

N-Succinyltransferase

1DMH

LIO

A. sp.

1.7

0.185

Catechol 1,2-dioxygenase

1XVA

SAM

E. coli

2.2

0.196

Glycine N-methyltransferase

1B37

FAD

Z. mays

1.9

0.199

Polyamine oxidase

1B5E

DCM

E. phage t4

1.6

0.189

Deoxycytidylate hydroxymethylase

1LTZ

HBL

C. violaceum

1.4

0.159

Phenylalanine-4-hydroxylase
Major histocompatibility complex

1K5N

AAchain

H. sapiens

1.09

0.123

HLA-B*2709

1H16

DTL

E. coli

1.53

0.145

Formate acetyltransferase 1
Probable

fosfomycin

1NKI

PPF

P. aeruginosa

0.95

0.148

protein

1G6S

S3P

E. coli

1.5

0.149

EPSP synthase

1LRI

CLR

P. cryptogea

1.45

0.161

Beta-elicitin cryptogein

1R1H

BIR

H. sapiens

1.95

0.211

Neprilysin

1AMU

PHE

B. brevis

1.9

0.213

Gramicidin synthetase 1
Eukaryotic

translation

resistance

initiation

1L8B

MGP

M. musculus

1.8

0.224

factor 4E

1PFV

2FM

E. coli

1.7

0.186

Methionyl-tRNA synthetase

1M0K

RET

H. salinarum

1.43

0.134

Bacteriorhodopsin

1UZE

EAL

H. sapiens

1.82

0.188

Angiotensin converting enzyme
Chemotaxis

1AF7

SAH

S. typhimurium

2.0

0.2

M.

receptor

methyltransferase CheR
Methanol

dehydrogenase

1G72

PQQ

methylotrophus

1.9

0.161

subunit

1QZ5

KAB

O. cuniculus

1.45

0.17

Actin, alpha skeletal muscle

1DTD

Glu

H. sapiens

1.65

0.187

Carboxypeptidase A2

1JHG

TRP

E. coli

1.3

0.127

Trp operon repressor

1CCW

TAR

C. cochlearium

1.6

0.137

Glutamate mutase

1MQO

CIT

B. cereus

1.35

0.222

Beta-lactamase II

	  

62	  

heavy

Table A.1.1 (cont’d)
Acetohydroxy-acid
1QMG

DMV

S. oleracea

1.6

0.196

isomeroreductase

1UFY

MLI

T. thermophilus

0.96

0.11

Chorismate mutase
Phosphoribosylglycinamide

1KJQ

ADP

E. coli

1.05

0.19

formyltransferase 2

1CIP

GNP

R. norvegicus

1.5

0.213

GI-alpha-1 subunit
Phosphoenolpyruvate

1AYL

OXL

E. coli

1.8

0.195

carboxykinase

1GTE

IUR

S. scrofa

1.65

0.181

Dihydropyrimidine dehydrogenase

1MRJ

ADN

T. kirilowii

1.6

0.173

Alpha-trichosanthin

1PZ4

PLM

A. aegypti

1.35

0.187

Sterol carrier protein 2

1R4U

OXC

A. flavus

1.65

0.157

Uricase

1RQW

TAR

T. daniellii

1.05

0.127

Thaumatin I

2TCT

CTC

E. coli

2.1

0.18

Tetracycline repressor

1VJJ

GDP

H. sapiens

1.9

0.205

Glutamine glutamyltransferase

1PQ7

ARG

F. oxysporum

0.8

0.109

Trypsin

1CZA

G6P

H. sapiens

1.9

0.213

Hexokinase type I
Alcohol

dehydrogenase,

1O2D

NAP

T. maritima

1.3

0.139

iron-containing

1F0L

APU

C. diphtheriae

1.55

0.188

Diphtheria toxin
Baculoviral IAP repeat-containing

1TW6

AAchain

H. sapiens

1.71

0.156

protein 7

2DPM

SAM

S. pneumoniae

1.8

0.238

Adenine-specific methyltransferase

1KA1

A3P

S. cerevisiae

1.3

0.134

Halotolerance protein Hal2
Interferon-induced

1F5N

GNP

H. sapiens

1.7

0.226

guanylate-binding protein 1

1HQS

CIT

B. subtilis

1.55

0.202

Isocitrate dehydrogenase

1NVV

GNP

H. sapiens

2.18

0.208

Transforming protein p21/h-ras-1

1UNQ

ITS

H. sapiens

0.98

0.154

Rac-alpha serine/threonine kinase
Benzoate

1,2-dioxygenase

1KRH

FAD

A. sp.

1.5

0.242

reductase

1M0W

3GC

S. cerevisiae

1.8

0.172

Glutathione synthetase

1UCD

URA

M. charantia

1.3

0.2

Ribonuclease MC

1HYO

HBU

M. musculus

1.3

0.181

Fumarylacetoacetate hydrolase
Substrate

binding

domain

1DKX

AAchain

E. coli

2.0

0.206

DNAK

1SOX

MTE

G. gallus

1.9

0.175

Sulfite oxidase

1LB6

AAchain

H. sapiens

1.8

0.203

TNF receptor-associated factor

1I1Q

TRP

S. typhimurium

1.9

0.219

Anthranilate synthase comp. I

	  

63	  

of

Table A.1.1 (cont’d)
Aminoglycoside
1ND4

KAN

K. pneumoniae

2.1

0.206

3'-phosphotransferase

1EU1

MGD

R. sphaeroides

1.3

0.121

Dimethyl sulfoxide reductase

1BX4

ADN

H. sapiens

1.5

0.192

Protein (adenosine kinase)

1NOX

FMN

T. thermophilus

1.59

0.19

NADH oxidase

1HP1

ATP

E. coli

1.7

0.176

5'-nucleotidase

1LKK

AAchain

H. sapiens

1.0

0.133

Human p56 tyrosine kinase

1B4U

DHB

S. paucimobilis

2.2

0.161

Protocatechuate 4,5-dioxygenase

1GZ8

MBP

H. sapiens

1.3

0.153

Cell division protein kinase 2

1EYQ

NAR

M. sativa

1.85

0.237

Chalcone-flavonone isomerase

1TX4

GDP

H. sapiens

1.65

0.169

P50-rhogap

1US0

LDT

H. sapiens

0.66

0.0938

Aldose reductase

1UXY

EPU

E. coli

1.8

0.202

MURB

1J09

ATP

T. thermophilus

1.8

0.199

Glutamyl-tRNA synthetase

1D3V

ABH

R. norvegicus

1.7

0.157

Arginase

1KPF

AMP

H. sapiens

1.5

0.209

Protein kinase C interacting protein

1UUY

PPI

A. thaliana

1.45

0.163

Molybdopterin biosynthesis CNX1

1OUW

MLT

C. sepium

1.37

0.153

Lectin

1HFE

FCY

D. vulgaris

1.6

0.158

Fe-only hydrogenase

1JAK

IFG

S. plicatus

1.75

0.176

Beta-N-acetylhexosaminidase

1UIO

HPR

M. musculus

2.4

0.203

Adenosine deaminase

1P6O

HPY

S. cerevisiae

1.14

0.112

Cytosine deaminase

1KOL

NAD

P. putida

1.65

0.171

Formaldehyde dehydrogenase

1OAI

AAchain

H. sapiens

1.0

0.149

Nuclear RNA export factor

1FCY

564

H. sapiens

1.3

0.134

Retinoic acid receptor
Protein arginine methyltransferase

1F3L

SAH

R. norvegicus

2.03

0.209

O.
1N62

MCN

1QJA

AAchain

1G2L
2SLI

carboxidovorans

PRMT3
Carbon monoxide dehydrogenase

1.09

0.144

small chain

H. sapiens

2.0

0.214

14-3-3 Protein zeta

T87

H. sapiens

1.9

0.237

Coagulation factor X

SKD

M. decora

1.8

0.185

Intramolecular trans-sialidase
Carbamoyl phosphate synthetase

1A9X

ORN

E. coli

1.8

0.191

(large chain)
CAMP-specific

1TBB

ROL

H. sapiens

1.6

0.187

phosphodiesterase 4D

1O7Q

UDP

B. taurus

1.3

0.1155

N-acetyllactosaminide

1RLZ

NAD

H. sapiens

2.15

0.199

Deoxyhypusine synthase

1U4G

HPI

P. aeruginosa

1.4

0.18

Elastase

	  

64	  

3',5'-cyclic

Table A.1.1 (cont’d)
1TL2

NAG

T. tridentatus

2.0

0.162

Tachylectin-2

1RKD

RIB

E. coli

1.84

0.221

Ribokinase

1Q79

3AT

B. taurus

2.15

0.205

Poly(a) polymerase alpha
Ubiquinol-cytochrome c reductase

1PP9

SMA

B. taurus

2.1

0.25

complex core protein

1E8G

FCR

P. simplicissim.

2.1

0.218

Vanillyl-alcohol oxidase

1L5O

2MP

S. enterica

1.6

0.174

CobT

1OEW

Ser-Thr

C. parasitica

0.9

0.121

Endothiapepsin

1H8E

ALF

B. taurus

2.0

0.201

Bovine mitochondrial F1-ATPase

1BGV

GLU

C. symbiosum

1.9

0.173

Glutamate dehydrogenase

1USC

FMN

T. thermophilus

1.24

0.203

small comp.

1MGP

PLM

T. maritima

2.0

0.202

Hypothetical protein tm841

1QNF

HDF

S. elongatus

1.8

0.197

Photolyase

1C1D

NAD

R. sp.

1.25

0.195

L-phenylalanine dehydrogenase

1UW6

NCT

L. stagnalis

2.2

0.22386

Acetylcholine-binding protein

1G55

SAH

H. sapiens

1.8

0.21

DNMT2

1LUG

SUA

H. sapiens

0.95

0.119

Carbonic anhydrase II

Putative styrene monooxygenase

DNA cytosine methyltransferase

N-carbamyl-d-amino
1UF5

CDT

A. sp.

1.6

0.178

amidohydrolase

1V7R

CIT

P. horikoshii

1.4

0.202

Hypothetical protein ph1917

acid

Bovine endothelial nitric oxide
1D0C

INE

B. taurus

1.65

0.213

synthase heme domain

5CSM

TRP

S. cerevisiae

2.0

0.186

Chorismate mutase

1P5D

G1P

P. aeruginosa

1.6

0.157

Phosphomannomutase

	  

65	  

Table A.1.2 Putative cholesterol binding sites in class A GPCRs1.
PDB code

Source

(motif matched)
2RH1

Res.

H. sapiens

2.4 Å

(strict-CCM )
H. sapiens

2.6 Å

H. sapiens

2.9Å

(strict-CCM )
H. sapiens

NMR

(strict-CCM )

Beta

adrenoceptor

type

2

0.00 Å

Adenosine type 2A receptor

0.43 Å

Dopamine vertebrate type 3

0.48 Å

Vertebrate tachykinin receptor

0.32 Å

(TACR1)
M.

α

respect to PDB 2RH1

receptor (DRD3)

α

2Y00

with

(ADORA2A)

α

2KS9

RMSDβ

(ADRB2)

(strict-CCM α)
3PBL

Alignment

(Å)

α

3EML

Protein name

2.5 Å

Beta

adrenoceptor

type

1

0.29 Å

(CCM )

gallopavo

(ADRB1)

3RZE

H. sapiens

3.1 Å

Histamine type 1 receptor

0.59 Å

1U19

B. taurus

2.2 Å

Rhodopsin

0.63 Å

2Z73

T.pacificus

2.5 Å

Rhodopsin

0.71 Å

3ODU

H. sapiens

2.5 Å

C-X-C chemokine receptor type

2.48 Å

α

(CCM )

4 (CXCR4)
3V2W

H. sapiens

3.35 Å

Sphingosine-1-phosphate

0.63 Å

receptor (EDG)
3UON

H. sapiens

3.0 Å

M2

Human

muscarinic

0.44 Å

acetylcholine receptor
4DJH

H. sapiens

2.9 Å

κ-opioid receptor

α

0.67 Å

strict-CCM: R/K -(X)2-6-I/V/L-(X)3-W/Y on one helix and F/Y on the neighboring helix13; CCM: R/K
-(X)2-6-I/V/L-(X)3-W/Y13. Entries without motif notations belong to class A GPCRs but were not included
in reference 12 or Table 2 predictions in the present manuscript.
β
The alignment RMSD is based on relative positions of backbone atoms (N, Cα, C and O) of residues
within 9 Å of cholesterol.
1
Hanson, M. A.; Cherezov, V.; Griffith, M. T.; Roth, C. B.; Jaakola, V. P.; Chien, E. Y.; Velasquez, J.;
Kuhn, P.; Stevens, R. C. A Specific Cholesterol Binding Site is Established by the 2.8 Å Structure of the
Human Β2-Adrenergic Receptor. Structure 2008, 16, 897–905.

	  

66	  

Table A.1.3 Diverse non-cholesterol, non-cholate lipid binding sites.
PDB code

Ligand

Source

Res. (Å)

R-factor

Protein Name

1A28

Progesterone

H. sapiens

1.8 Å

0.191

Progesterone receptor

2AA6

Progesterone

H. sapiens

2.0 Å

0.197

Mineralocorticoid receptor

2ABA

Progesterone

E. cloacae

1.0 Å

0.129

Pentaerythritol

tetranitrate

reductase
2HZQ

Progesterone

H. sapiens

1.8 Å

0.189

Apolipoprotein D

1AQU

Estradiol

M. musculus

1.6 Å

0.218

Estrogen sulfotransferase

1E6W

Estradiol

R. norvegicus

1.7 Å

0.184

Short

chain

3-hydroxyacyl-CoA
dehydrogenase
1JGL

Estradiol

M. musculus

2.2 Å

0.199

Ig kappa-chain

1LHU

Estradiol

H. sapiens

1.8 Å

0.204

Sex

hormone-binding

globulin
3OLL

Estradiol

H. sapiens

1.5 Å

0.177

Estrogen receptor beta

2AM9

Testosterone

H. sapiens

1.6 Å

0.191

Androgen receptor

1J96

Testosterone

H. sapiens

1.2 Å

0.181

3-Alpha-hydroxysteroid
dehydrogenase type 3

3KDM

Testosterone

H. sapiens

1.5 Å

0.181

Immunoglobulin light chain

4EIY

Oleic acid

H. sapiens

1.8 Å

0.176

Adenosine receptor A2a

3DDL

PX4

S. ruber

1.90 Å

0.247

Xanthorhodopsin

3DDL

PCW

S. ruber

1.90 Å

0.247

Xanthorhodopsin

2Z73

PC1

T. pacificus

2.50 Å

0.188

Rhodopsin

3UTW

MC3

H. sp.

2.40Å

0.206

Bacteriorhodopsin

3UTV

MC3

H. sp.

2.06Å

0.197

Bacteriorhodopsin

	  

67	  

Table A.1.4 Sites in 109 low-homology bacterial membrane protein sites analyzed as potential
false positive cases for cholesterol (CLR) or cholate (CHD) binding. Sites predicted to match the
CholMine cholesterol or cholate site conserved interactions are noted in the third column. The
last column indicates whether the crystallographic ligand at the prediction site (second column)
was of lipid or lipid-like (L), drug-like (D), polar (P), or intermediate character (e.g., P/L for a
polar lipid group). 73% of the sites contained lipids or partly lipidic molecules.
PDB
entry
1LGH
1M56
1QD5
1U7G
1YC9
2ERV
2YEV
3GP6
3RKO
4H44
4IL6
1B12
1CWV
1EHK
1J79
1JB0
1K4C
1KMO
1KQF
1LDF
1NKZ
1Q16
1QFG
1QJP
1UJW
1UYN
1XEZ
1XIO
1XKW
1Y4Z
	  

Ligand site Prediction
analyzed
(CLR, CHD,
or neither)
LYC A97
CLR,CHD
PEH A2009 CLR,CHD
BOG A500
CLR
BOG A400
CLR
BOG A1001 CLR
CXE A300
CLR
5PL A900
CLR
SDS A163
CLR
LFA L614
CLR,CHD
7PH C303
CLR
DGD C515
CLR,CHD
1PN B1001
--CIT A994
--BNG A901
--NCD A950
--BCR A4001 --F09 A2001
--HTO A759
--MGD A1018 --GOL A476
--RG1 A404
--MD1 A1300 --DDQ A1100 --C8E A1172
--GP1 A801
--CXE X2085 CHD
BOG A999
--RET A301
--LDA A2001 --MD1 A1800 --68	  

Crystal
structure
ligand type
L
L
L
L
L
L
L
L
L
L
L
D
P
L
P
L
L
L
D/P
P
L
D/P
L
L
P
L
L
L
L
D/P

Table A.1.4 (cont’d)
2A65
2BL2
2BS2
2GSK
2GSM
2GUF
2HDI
2IWV
2J58
2NS1
2O4V
2OQO
2POR
2QCU
2QI9
2SQC
2VDF
2VPZ
2VQG
2WDQ
2WIE
2WJN
2WJR
2WSW
2X27
2X2V
2X55
2XCI
2XOV
2YHC
2YNK
2ZFG
3B9W
3BS0
3CSL
3DDL
3DWN
3DWO
3DZM
	  

LEU A601
UMQ A1162
FAD A1656
LDA A800
DMU A5001
MPG A701
LDA A664
TAM B1289
OCT A600
BOG A601
C8E A1295
EPE A244
C8E A545
TAM A805
1PE C800
C8E A632
OCT A1254
MGD A1765
MRD B1097
CBE C1130
CVM A102
MQ7 M1328
EPE A1217
CM5 A1505
C8E X1216
DPV A200
C8E A1293
PG4 A1353
BNG A503
URE A1234
OCT A1001
C8E A342
BOG A408
C8E A501
GOL A867
UNL A1402
LDA A502
C8E X453
C8E A209

------------------------CHD
------------CHD
--------------------------------CHD
CHD
--69	  

D
L
D/P
L
L
L
L
D
L
L
L
D
L
D
L
L
L
D/P
D
D
L/D
L
D
L/D
L
L
L
D/P
L
P
L
L
L
L
P
L
L
L
L

Table A.1.4 (cont’d)
3FID
3HB3
3HYW
3JQO
3KDS
3KLY
3L1L
3L7I
3M71
3OUF
3QE7
3QRA
3RLB
3RLF
3RQW
3RVY
3SZV
3TIJ
3USE
3V8X
3WO6
4AFK
4DVE
4E1S
4EHW
4GBY
4GEY
4IKV
4JR9
4MT4
4N7W
4NHR
4NM9
4NV5
4P1X
4PR7
4Q35
4QNC
2J7A
	  

CXE A304
LMT A568
DCQ A500
MPD D1
NHX E998
BOG A281
BNG A447
EDO B731
BOG A315
MPD A501
URA A430
C8E A1
VIB A191
UMQ E5004
ACH A323
PX4 A4001
C8E A385
URI A419
GOL L605
C8E A1001
OLC A302
78M A1510
BTN A201
OLB A502
MPD A402
BNG A505
DMU A510
PG4 A613
GYP A501
3PK A1008
MPG A402
PEG A301
FAD A2001
U10 A501
MPD A401
OCT A301
LDA A2004
MYS A104
LMT C1005

----CHD
--CHD
--------------------CHD
--------------CHD
------------------CHD
----------70	  

L
L
L
D
D
L
L
L/P
L
L/P
D/P
L
D
L
P/D
L
L
D/P
G
L
L
L
D/L
L
D/L
L
L
L/P
P
L
L
P/L
P
L
P/L
L
L
L
L

Table A.1.4 (cont’d)
3WU2

	  

SQD A412

---

71	  

L

REFERENCES

	  

72	  

REFERENCES

(1) Lund, S.; Orlowski, S.; Foresta, B. de; Champeil, P.; Maire M. Le; Møbller, J.V. Detergent
Structure and Associated Lipid as Determinants in the Stabilization of Solubilized Ca2+-Atpase
from Sarcoplasmic Reticulum. J. Biol. Chem. 1989, 264, 4907-4915.
(2) Seddon, A. M.; P. Curnow; Booth, P. J. Membrane Proteins, Lipids and Detergents: Not Just
a Soap Opera. Biochim. Biophys. Acta 2004, 1666, 105–117.
(3) Contreras, F.-X.; Ernst, A. M.; Wieland, F.; Brügger, B. Specificity of Intramembrane
Protein-Lipid Interaction. Cold Spring Harb. Perspect. Biol. 2011, 3, 1-18.
(4) Ernst. A. M.; Contreras. F.-X.; Brügger, B.; Wieland, F. Determinants of Specificity at the
Protein–Lipid Interface in Membranes. FEBS Lett. 2010, 584, 1713–1720.
(5) Hite, R. K.; Li, Z.; Walz, T. Principles of Membrane Protein Interactions with Annular
Lipids Deduced from Aquaporin-0 2D Crystals. EMBO J. 2010, 29, 1652-1658.
(6) Shinzawa-Itoh, K.; Aoyama, H.; Muramoto, K.; Terada, H.; Kurauchi, T.; Tadehara, Y.;
Yamasaki, A.; Sugimura, T.; Kurono, S.; Tsujimoto, K.; Mizushima, T.; Yamashita, E.;
Tsukihara, T.; Yoshikawa, S. Structures and Physiological Roles of 13 Integral Lipids of Bovine
Heart Cytochrome C Oxidase. EMBO J. 2007, 26, 1713–1725.
(7) Qin, L.; Hiser, C.; Mulichak, A.; Garavito, R. M.; Ferguson-Miller, S. Identification of
Conserved Lipid Detergent-Binding Sites in a High-Resolution Structure of the Membrane
Protein Cytochrome C Oxidase. Proc. Natl. Acad. Sci. USA 2006, 103, 16117–16122.
(8) Munro, S. Lipid Rafts: Elusive or Illusive? Cell 2003, 115, 377–388.
(9) Burger, K.; Gimpl, G; Fahrenholz, F. Regulation of Receptor Function by Cholesterol. Cell.
Mol. Life Sci. 2000, 57, 1577-1592.
(10) Schroeder, C. Cholesterol-Binding Viral Proteins in Virus Entry and Morphogenesis. In
Cholesterol Binding and Cholesterol Transport Proteins: Structure and Function in Health and
Disease; Harris, J. R., Ed.; Springer: Dordrecht, 2010; Vol. 51, pp 77-108.
(11) Huber, T. B.; Schermer, B.; Müeller, R. U.; Höhne, M.; Bartram, M.; Calixto, A.;
Hagmann, H.; Reinhardt, C.; Koos, F.; Kunzelmann, K.; Shirokova, E.; Krautwurst, D.;
	  

73	  

Harteneck, C.; Simons, M.; Pavenstädt, H.; Kerjaschki, D.; Thiele, C.; Walz, G.; Chalfie, M.;
Benzing, T. Podocin and MEC-2 Bind Cholesterol to Regulate the Activity of Associated Ion
Channels. Proc. Natl. Acad. Sci. USA 2006, 103, 17079–17086.
(12) Hulce, J. J; Cognetta, A. B.; Niphakis, M. J.; Tully, S. E.; Cravatt, B. F. Proteome-Wide
Mapping of Cholesterol-Interacting Proteins in Mammalian Cells. Nat. Methods 2013, 10,
259-64.
(13) Hanson, M. A.; Cherezov, V.; Griffith, M. T.; Roth, C. B.; Jaakola, V. P.; Chien, E. Y.;
Velasquez, J.; Kuhn, P.; Stevens, R. C. A Specific Cholesterol Binding Site is Established by the
2.8 Å Structure of the Human Β2-Adrenergic Receptor. Structure 2008, 16, 897–905.
(14) Adamian, L.; Naveed, H.; Liang, J. Lipid-Binding Surface of Membrane Proteins:
Evidence from Evolutionary and Structure Analysis. Biochim. Biophys. Acta 2011, 1808, 1092–
1102.
(15) Li, H; Papadopoulos, V. Peripheral-Type Benzodiazepine Receptor Function in Cholesterol
Transport. Identification of a Putative Cholesterol Recognition/Interaction Amino Acid Sequence
and Consensus Pattern. Endocrinology 1998, 139, 4991-4997.
(16) Takeda, K.; Tonthat, N. K.; Glover, T.; Xu, W.; Koonin, E. V.; Yanagida, M.; Schumacher,
M. A. Implications for Proteasome Nuclear Localization Revealed by the Structure of the
Nuclear Proteasome Tether Protein Cut8. Proc. Natl. Acad. Sci. USA 2011, 108, 16950–16955.
(17) Li, F.; Liu, J.; Valls, L.; Ferguson-Miller, S. Identification of a Key Cholesterol Binding
Enhancement Motif in Translocator Protein 18 Kda (TSPO). Biochemistry 2015 54, 1441-1443.
(18) Baier, C. J.; Fantini, J.; Barrantes, F. J. Disclosure of Cholesterol Recognition Motifs in
Transmembrane Domains of the Human Nicoticin Acetylcholine Receptor. Sci. Rep. 2011, 1,
1-7.
(19) Fantini, J.; Yahi, N. Molecular Basis for the Glycosphingolipid-Binding Specificity of
α-Synuclein: Key Role of Tyrosine 39 in Membrane Insertion. J. Mol.Biol. 2011, 408, 654–669.
(20) Fantini, J; Barrantes, F. J.; How Cholesterol Interacts With Membrane Proteins: an
Exploration of Cholesterol-Binding Sites Including CRAC, CARC, and Tilted Domains. Front
Physiol. 2013, 4, 1-9.
(21) Palmer, M. Cholesterol and the Activity of Bacterial Toxins. FEMS Microbiol. Lett. 2004,
238, 281–289.

	  

74	  

(22) Barrett, P. J.; Song, Y.; Van Horn, W. D.; Hustedt, E. J.; Schafer, J. M.; Hadziselimovic,
A.; Beel, A. J.; Sanders, C. R. The Amyloid Precursor Protein Has a Flexible Transmembrane
Domain and Binds Cholesterol. Science 2012, 336, 1168-1171.
(23) Song, Y; Kenworthy, A. K.; Sanders, C. R. Cholesterol as a Co-Solvent and a Ligand for
Membrane Proteins. Protein Sci. 2014, 23, 1-22.
(24) Farrand, A. J.; LaChapelle, S.; Hotze, E. M.; Johnson, A. E.; Tweten, R. K. Only Two
Amino Acids are Essential for Cytolytic Toxin Recognition of Cholesterol at the Membrane
Surface. Proc. Natl. Acad. Sci. USA 2010, 107, 4341-4346.
(25) Chiang, J. Y. Bile Acid Regulation of Gene Expression: Roles of Nuclear Hormone
Receptors. Endocr. Rev. 2001, 23, 443-463.
(26) Russell, D.W.; Setchell, K. D. R. Bile Acid Biosynthesis. Biochemistry 1992, 31, 4737–
4749.
(27) Hofmann, A.F. The Enterohepatic Circulation of Bile Acids in Man. Clin. Gastroenterol.
1977, 6, 3–24.
(28) Maruyama, T.; Miyamoto, Y.; Nakamura, T.; Tamai, Y.; Okada, H.; Sugiyama, E.;
Nakamura, T.; Itadani, H.; Tanaka, K. Identification of Membrane-Type Receptor for Bile Acids
(M-BAR). Biochem. Biophys. Res. Commun. 2002, 298, 714–719.
(29) Kawamata, Y.; Fujii, R.; Hosoya, M.; Harada, M.; Yoshida, H.; Miwa, M.; Fukusumi, S.;
Habata, Y.; Itoh, T.; Shintani, Y.; Hinuma, S.; Fujisawa, Y.; Fujino, M. A G Protein-Coupled
Receptor Responsive to Bile Acids. J. Biol. Chem. 2003, 278, 9435-9440.
(30) Lischka, F.; Kuhn, L. A.; Libants, S.; Wu, H.; Yuan, Q.; Teeter, J.; Li, W.
De-Orphanization of Two Vertebrate Pheromone Receptors. In preparation, 2015.
(31) Yau, W.-M.; Wimley, W. C.; Gawrisch, K.; White, S. H. The Preference of Tryptophan for
Membrane Interfaces. Biochemistry 1998, 37, 14713-14718.
(32) Ballesteros, J. A.; Weinstein, H. Integrated Methods for the Construction of
Three-Dimensional Models and Computational Probing of Structure-Function Relations in G
Protein-Coupled Receptors. Methods Neurosci. 1995, 25, 366-428.
(33) Hunte, C. Specific Protein-Lipid Interactions in Membrane Proteins. Biochem. Soc. Trans.
2005, 33, 938-942.

	  

75	  

(34) Liu, W.; Chun, E.; Thompson, A. A.; Chubukov, P.; Xu, F.; Katritch, V.; Han, G. W.;
Roth, C. B.; Heitman, L. H.; IJzerman, A. P.; Cherezov, V.; Stevens, R. C. Structural Basis for
Allosteric Regulation of Gpcrs by Sodium Ions. Science 2012, 337, 232-236.
(35) Lin, H. H.; Han, L. Y.; Zhang, H. L.; Zheng, C. J.; Xie, B.; Chen, Y. Z. Prediction of the
Functional Class of Lipid Binding Proteins from Sequence-Derived Properties Irrespective of
Sequence Similarity. J. Lipid Res. 2006, 47, 824-831.
(36) Xiong, W.; Guo, Y.; Li, M. Prediction of Lipid-Binding Sites Based on Support Vector
Machine and Position Specific Scoring Matrix. Protein J. 2010, 29, 427-431.
(37) Scott, D. L.; Diez, G.; Goldmann, W. H. Prediction-Lipid Interactions: Correlation of a
Predictive Algorithm for Lipid-Binding Sites with Three-Dimensional Structural Data. Theor.
Biol. Med. Model. 2006, 3, 1-14.
(38) Van Voorst, J. R.; Finzel, B. C.; Tonero, M. E.; Rai, B.; Narasimhan, L.; Howe, W. J.;
Kuhn. L. A. Screening to Identify Similar Ligand-Binding Pockets in Diverse Proteins. In
preparation, 2015.
(39) Weill, N.; Rognan, D. Development and Validation of a Novel Protein-Ligand Fingerprint
to Mine Chemogenomic Space: Application to G Protein-Coupled Receptors and Their Ligands.
J. Chem. Inf. Model. 2009, 49, 1049-1062.
(40) Madala, P. K.; Fairlie, D. P.; Boden, M. Matching Cavities in G Protein-Coupled Receptors
to Infer Liand-Binding Sites. J. Chem. Inf. Model. 2012, 52, 1401-1410.
(41) Van Voorst, J. R. Surface Matching and Chemical Scoring to Detect Unrelated Proteins
Binding Similar Small Molecules. Ph.D. Thesis, Michigan State University, December 2011.
(42) Zavodszky, M. I.; Sanschagrin, P. C.; Korde, R. S.; Kuhn, L. A. Distilling the Essential
Features of a Protein Surface for Improving Protein-Ligand Docking, Scoring, and Virtual
Screening. J. Comp.-Aided Molecular Design 2002, 16, 883-902.
(43) Wang, G.; Dunbrack, R. L. Jr. PISCES: A Protein Sequence Culling Server. Bioinformatics
2003, 19, 1589-1591.
(44) Tan, P.N.; Steinbach, M.; Kumar, V. Introduction to Data Mining; Addison-Wesley:
Boston, 2006; pp 370-386.

	  

76	  

(45) Craig, L.; Sanschagrin, P. C.; Rozek, A.; Lackie, S.; Kuhn, L. A.; Scott, J. K. The Role of
Structure in Antibody Cross-Reactivity between Peptides and Folded Proteins. J. Mol. Biol.
1998, 281, 183-201.
(46) Wallace, A. C.; Laskowski, R. A.; Thornton, J. M. LIGPLOT: a Program to Generate
Schematic Diagrams of Protein-Ligand Interactions. Protein Eng. 1996, 8, 127-134.
(47) Laskowski, R. A.; Swindells, M. B. LigPlot+: Multiple Ligand−Protein Interaction
Diagrams for Drug Discovery. J. Chem. Inf. Model. 2011, 51, 2778−2786.
(48) Barenholz, Y. Cholesterol and Other Membrane Active Sterols: From Membrane Evolution
to Rafts. Prog. Lipid Res. 2002, 41, 1-5.
(49) Majewska, M. D. Steroids and Ion Channels in Evolution: From Bacteria to Synapses and
Mind. Acta Neurobiol. Exp. 2007, 67, 219-233.

	  

77	  

Chapter 3 Deciphering Substituent Effects of Ring-substituted α-Arylalanines on the
Isomerization Reaction Catalyzed by an Aminomutase

Reprint (adapted) with permission from Ring-Substituted α-Arylalanines for Probing Substituent
Effects on the Isomerization Reaction Catalyzed by an Aminomutase. Nishanka Dilini Ratnayake,
Nan Liu, Leslie A. Kuhn, and Kevin D. Walker. ACS Catal., 2014, 4, 3077–3090.

	  

78	  

Here, we analyzed the substituent effects of ring-substituted α-arylalanines on the
isomerization reaction catalyzed by an aminomutase. The goal was to determine how the
protein-ligand interaction components of the binding energy dictate the relative biological
activities of substrates.

3.1 Introduction

β-Amino acids are gaining use as building blocks for synthetic β-peptide oligomers that are
used as biologically active antibiotics.1 These β-peptides form ordered secondary structures
similar to α-peptides, yet are less prone to cleavage than their α-peptide counterparts by most
peptidases in vivo. In addition, biosynthesizing novel (S)-β-amino arylalanines, such as
o-methyl-β-phenylalanine, has potential application in the synthesis of a pyrazole heterocycle
compound that inhibits the function of a lysosomal serine protease cathepsin A (CatA). This
inhibition of CatA was shown to prevent the development of salt-induced hypertension.2
m-Fluoro-β-phenylalanine has also been used as an intermediate in the synthesis of potent
chemokine receptor CCR5 antagonist.3

Enzymatic resolution and catalysis are described as elegant approaches to access enantiopure
β-amino acids. Phenylalanine aminomutases from the bacterium Pantoea agglomerans (PaPAM,
EC 5.4.3.11) and an isozyme from Taxus plants (TcPAM, EC 5.4.3.10) use a
4-methylidene-1H-imidazol-5(4H)-one
	  

(MIO)
79	  

prosthetic

group

to

isomerize

(2S)-α-phenylalanine to β-phenylalanine. TcPAM makes the (3R)-β-amino acid, a precursor of
the phenylisoserine side chain on the pathway to the antimitotic compound paclitaxel.4 In an
earlier study, TcPAM was shown to convert several variously modified α-arylalanines to their
cognate β-isomers.5 In contrast, PaPAM makes the (3S)-β-phenylalanine antipode on the
biosynthetic pathway to the antibiotic andrimid (Figure 3.1).6 Knowing the substrate scope of
PaPAM could increase the range of novel enantiopure β-arylalanines obtained biocatalytically.

Figure 3.1 Partial andrimid biosynthetic pathway starting from (S)-β-phenylalanine via
(S)-α-phenylalanine. a) Several steps.

Both PAMs belong to a class I lyase-like superfamily of catalysts,6-9 along with other
MIO-dependent aminomutases. Tyrosine aminomutases (CcTAM and SgTAM, respectively) are
used on the biosynthetic pathways to the cytotoxic chondramides in Chondromyces crocatus and
to the enediyne antitumor antibiotic C-1027, of the neocarzinostatin family, made by
Streptomyces globisporus. A phenylalanine aminomutase from Streptomyces maritimus
(SmPAM) described earlier as a lysase was recently characterized.7 A recently characterized

	  

80	  

aminomutase biosynthesizes (R)-2-aza-β-tyrosine from 2-aza-α-tyrosine found on the
biosynthetic pathway to the enediyne kedarcidin in Streptoalloteichus.10

Recent structural characterization of PaPAM supports the formation of an NH2-MIO adduct,
where the amino group of the substrate is covalently attached to the enzyme during the α/β
isomerization (Figure 3.2).11 A proton and the NH2-MIO group are eliminated from the substrate
to form a cinnamate intermediate (released occasionally as a minor by-product), followed by
hydroamination of the intermediate from NH2-MIO to form the β-amino acid.

Figure 3.2 Mechanism of the MIO-dependent isomerization catalyzed by PaPAM. MIO:
4-methylidene-1H-imidazol-5(4H)-one; 𝒌𝐜𝐢𝐧𝐧
𝐜𝐚𝐭 : the rate at which the cinnamate by-product is
𝜷

released; 𝒌𝐜𝐚𝐭 : the rate at which the β-amino acid product is released.
	  

81	  

The broad substrate specificity of TcPAM encouraged us to investigate, herein, the substrate
specificity of the related MIO phenylalanine aminomutase. In addition, structural and
mechanistic studies on MIO-based aminomutases are increasing our understanding of the
reaction chemistry of the enzymes in this family.9,13,15-19 Here, to gain further insights on these
enzymes, we used computational chemistry to analyze how structural interaction energies relate
to the PaPAM isomerization kinetics of substrates with different aryl rings. We propose the
PaPAM reaction chemistry is influenced by different properties of the substrate, including
sterics, and the magnitude and direction of electronic effects of the substituents on the aryl ring.

	  

82	  

3.2 Materials and methods

3.2.1 Experiments

The experimental part of this work is done by Dr. Dilini, including expression and
purification

of

paPAM,

assessment

of

the

substrate

specificity

of

PaPAM

for

(2S)-α-phenylalanine analogs were accessed and measurement of kinetic parameters (KM and
!"!#$
𝑘!"#
) of PaPAM for (2S)-α-phenylalanine analogs and inhibition assays for non-productive

substrates. The experimental kinetic data is summarized in Table 3.1.

3.2.2 Modeling substrate-PaPAM structural interactions to understand selectivity

To understand the differences in catalytic efficiency, which are largely dictated by
differences in KM, the substrates were modeled in the PaPAM active site. Active configurations
of the substrates were generated by overlaying their aryl rings onto the active conformation of
α-phenylalanine in the crystal structure by using molecular editing in PyMOL 1.5.0.4
(Schrödinger, Inc., New York, NY) and fixed reference coordinates in OMEGA 2.4.6 (OpenEye
Scientific Software).12,13 Since the substrates form covalent bonds with binding site residues of
PaPAM, their orientation is highly restricted.

The position of the ortho- or meta-substituent breaks the C2 axis of symmetry in the phenyl
ring of the substrates. Thus, the ring can adopt two configurations that are consistent with the
	  

83	  

orientation of α-phenylalanine in the crystal structure. In one configuration, called the "NH2-cis,"
the substituent on the aryl ring is on the same side as the NH2 group of the phenylalanine
substrate. In the other configuration, the "NH2-trans," obtained by a 180° rotation about the
Cβ-Cipso bond, the substituent is oriented on the side opposite the NH2 group. Alternative
low-energy conformations of the substrates, in which the substrate orientation deviated from that
of α-phenylalanine in the crystal structure, were sampled using OMEGA 2.4.6 (OpenEye
Scientific Software, Santa Fe, NM; http://www.eyesopen.com) and analyzed with respect to
experimental KM values. For energy calculations, AM1BCC charges were assigned to the
substrates using molcharge 1.3.1 (Open Eye Scientific Software).14

3.2.3 Calculating substrate-PaPAM interaction energies

The sum of protein-ligand interaction energy [E(p-l)] and ligand internal energy [E(l)] values
for the 22 substrates was calculated using Szybki15-17 1.7.0 (OpenEye Scientific Software). The
electrostatic Coulombic [EC(p-l)] and steric van der Waals (vdW) interaction energy [EV(p-l)] terms
were extracted from the E(p-l) term for each conformer. Steric collisions between the substrates
and the binding site residues were visualized pairwise by using a PyMOL script, show_bumps.py
(created by Thomas Holder of Schrödinger, Inc.) showing vdW radius overlaps of 0.1 Å or more.
The residues were then grouped according to which overlaps impacted the o-, m-, and ppositions of substrates. The component energy terms [E(p-l)], [EC(p-l)], [EV(p-l)] and [E(l)] were
calculated with two protocols to evaluate which approach led to interaction energies that best
	  

84	  

correlated with the KM values. First, a single-point energy calculation protocol employing a
Poisson-Boltzmann electrostatics model was used when the substrate was placed in the NH2-cis
or NH2-trans configuration. The NH2-cis and NH2-trans conformers were evaluated without
energy minimization. The binding site of the protein was kept in its crystallographic
conformation, to test the hypothesis that the active complex of the protein and substrate matches
the crystallographic conformation observed with α-phenylalanine (PDB entry 3UNV). Second, a
two-step protocol recommended by the OpenEye Scientific Software was used to explore
whether energy minimization could improve the modeling of PaPAM-substrate interactions by
reducing any repulsive interactions. The backbone residues of PaPAM were fixed, with the
substrates in either the NH2-cis or NH2-trans configuration. Protein side chains within 4 Å of the
substrates were then allowed to move towards an energy minimum, using the exact Coulomb
electrostatics model. Because vdW clashes lead to large, unfavorable interaction energies, this
energy minimization protocol reduces vdW overlap by small shifts in active site residues when
possible. The energy estimate of each minimized configuration was then refined using the above
single-point energy calculation with the Poisson-Boltzmann electrostatics model.

As an alternative approach, SLIDE (version 3.4) docking18,19 was used to model potential
conformational changes of the protein and substrate upon binding. SLIDE rotated active site
residues to remove or reduce vdW overlap, while the phenylalanine ligands were fixed to
maintain their initial NH2-cis or NH2-trans configuration.

	  

85	  

3.2.4 Structure-activity landscape index analysis

To identify any additional steric or electrostatic factors important for the activity of PaPAM
substrates, structure-activity landscape index (SALI) analysis was used to identify "activity
cliffs".

These

cliffs

represent

large

changes

in

PaPAM

binding

affinity

among

structurally-similar substrates.20 For identifying activity cliffs, pairwise comparisons between
substrates to measure structural similarity scores were performed using ROCS 2.4.2 software
(OpenEye Scientific Software).21 The SALI score was measured as SALI(i,j) = |(KMi – KMj)|/(2 –
sim(i,j)), in which the sim(i,j) value (structural similarity between molecules i and j) was
measured by the ROCS Tanimoto Combo score (with a maximum value of 2, reflecting equal
contributions from shape and electrostatic match terms), and KMi and KMj were the experimental
KM values of molecules i and j.

	  

86	  

3.3 Results and discussion

3.3.1 Overview of the PaPAM mechanism

The PaPAM reaction goes through a cinnamate intermediate after elimination of the amino
group and benzylic hydrogen from the α-amino acid substrate. Earlier deuterium isotope studies
(kH/kD > 2) on a related aminomutase TcPAM suggest the deprotonation step of the elimination
reaction is rate-determining.22 The coupling between the amine group of the substrate and the
MIO is proposed to make a good alkyl ammonium leaving group. α,β-Elimination of the
β-hydrogen and α-alkyl ammonium can advance through different routes. The concerted,
one-step E2 (bimolecular elimination) mechanism proceeds through base-catalyzed removal of
an acidic proton and a leaving group. By comparison, the two-step E1cB (unimolecular
conjugate-base elimination) uses base-catalysis to remove a proton vicinal to a poor leaving
group, yielding a carbanion intermediate. MIO-dependent aminomutase reactions likely follow
an E2 or E1cB mechanism, where both depend on the rate of deprotonation of Cβ, as proposed in
an earlier work.23 Thus, electron-withdrawing substituents on the aryl ring of the substrate that
stabilize a δ– charge on Cβ should therefore increase the rate of the elimination step. In contrast,
the two-step E1 (unimolecular elimination) reaction is not likely for MIO-dependent reactions.
The attached, electron-withdrawing carboxylate of the substrate would destabilize the Cα
carbocation formed after displacement of the NH2-MIO adduct (Figure 3.3A).

	  

87	  

The final reaction sequence of the MIO-dependent aminomutases involves an α,β-addition
reaction, where the NH2-MIO and a proton (H+) add across the double bond of the acrylate
intermediate. To obtain the β-amino acid in a concerted hydroamination, the polarity of the Cβ
(δ+) needs to be opposite of that in the earlier elimination sequence. Here, the nucleophilic
NH2-MIO binds to Cβ and the electrophilic H+ attaches to Cα (Figure 3.3B).

A

B

Figure 3.3 (A) Proposed elimination mechanisms for displacement of the NH2-MIO adduct. E1:
unimolecular, E2: bimolecular and E1cB: conjugate-base eliminations. (B) Concerted
hydroamination of the acrylate intermediate. Shown is a transition state intermediate (right)
highlighting the polarization of the π-bond in which the nucleophilic NH2-MIO and the
electrophilic H+ approach Cβ and Cα, respectively.
	  

88	  

Alternatively, PaPAM could use a stepwise addition sequence where the nucleophile
(NH2-MIO) couples to form a 1,4-Michael adduct. This conjugate addition route benefits from
an electropositive (δ+) Cβ by delocalizing the π-electrons towards the carboxylate of the
substrate. Theoretically, a substituent that places negative charge inductively within the ring or
mesomerically the on Cipso of the β-arylacrylate intermediate should also strengthen the
formation of a δ+ on Cβ. These types of electrostatic considerations, along with binding affinity,
were considered to explain the hydroamination reaction of TcPAM for aryl acrylate
substrates.24,25

In earlier accounts, the Michael addition mechanism was proposed,26,27 but a presumed
resonance structure has two repelling oxyanions on the carboxylate of the reactant that normally
forms a monodentate salt bridge (Figure 3.4a), as evidenced in the PaPAM crystal structure.11 To
alleviate build-up of this electrostatic repulsion, we propose that near-concerted protonation and
amination of the π-bond likely minimizes formation of the unfavorable dianion (Figure 3.4b). A
contrasting pathway is envisioned to first add a proton at Cα of the acrylate intermediate. The
resulting intermediate has a positive charge (δ+) on the benzylic Cβ, which is resonance stabilized
by the aryl ring and further stabilized by electron-releasing substituents (Figure 3.4c). Rapid,
nucleophilic attack by the NH2-MIO on the carbocation would ensue to complete the β-amino
acid catalysis.

	  

89	  

Figure 3.4 Route a) A stepwise Michael-addition pathway. Shown is an intermediate adduct (top
right) with the π-electrons delocalized into the carboxylate group forming a repelling dianion
prior to Cα-protonation. Route b) Concerted hydroamination of the acrylate π-bond. Shown is an
intermediate (middle right) with maximal charge separation between repelling negative charges
in the carboxylate group and the cation and anion. Route c) A stepwise hydroamination
sequence. Shown is a proposed intermediate (bottom right) resulting from Cα-protonation as the
first step, which places a positive charge at Cβ. Cβ is now primed for nucleophilic attack by the
NH2-MIO adduct.

	  

90	  

Table 3.1 Kinetic Parametersa of PaPAM for Various Substituted Aryl and Heteroaromatic
Substrates.

KM
R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
	  

168
(7)
339
(15)
27
(5)
432
(26)
29
(1)
88
(6)
415
(79)
337
(27)
430
(15)
73
(6)
990
(124)
132
(5)
204
(4)
491
(82)
525
(44)
163
(9)
752
(39)

!

!"##
𝑘!"#

𝑘!"#
0.301
92.8%
0.396
93.9%
0.027
85.2%
0.462
95.2%
0.020
85.7%
0.055
83.6%
0.143
34.8%
0.139
97.2%
0.136
92.6%
0.021
95.5%
0.201
99.0%
0.024
90.9%
0.048
78.3%
0.050
94.1%
0.043
95.6%
0.010
63.6%
0.025
48.0%

0.022
7.2%
0.024
6.1%
0.004
14.8%
0.022
4.8%
0.003
14.3%
0.009
16.4%
0.093
65.2%
0.004
2.8%
0.01
7.4%
0.001
4.5%
0.002
1.0%
0.002
9.1%
0.010
21.7%
0.003
5.9%
0.002
4.4%
0.003
36.4%
0.013
52.0%
91	  

!"!#$
𝑘!"#

0.323
(0.013)
0.420
(0.014)
0.031
(0.002)
0.484
(0.02)
0.023
(0.001)
0.064
(0.002)
0.236
(0.01)
0.143
(0.004)
0.146
(0.003)
0.022
(0.001)
0.203
(0.012)
0.026
(0.001)
0.058
(0.001)
0.053
(0.003)
0.045
(0.001)
0.013
(0.001)
0.038
(<10–3)

!"!#$
𝑘!"#
/KM

1.93
(0.20)
1.24
(0.12)
1.2
(0.4)
1.12
(0.14)
0.79
(0.06)
0.73
(0.09)
0.588
(0.066)
0.428
(0.063)
0.340
(0.025)
0.31
(0.04)
0.209
(0.050)
0.19
(0.02)
0.19
(0.01)
0.11
(0.03)
0.09
(0.01)
0.082
(0.010)
0.050
(0.005)

Table 3.1 (cont’d)
18
19

1187
(76)
164
(7)

0.022
97.7%
0.002
70.0%

0.0005
2.3%
0.0007
30.0%

0.022
(<10–3)
0.003
(<10–3)

0.019
(0.002)
0.02
(<10–2)

20
21
22
a
-1
!"!#$
Standard error in parenthesis. Units: s for kcat, µM for KM, and s-1•M-1 × 103 for 𝑘!"#
/KM.
20 – 22, not productive.

3.3.2 Comparing the effects of regioisomeric substituents on PaPAM catalysis and substrate
affinity

The kinetic parameters of the meta/para/ortho-regioisomers (bromo-2/15/20; fluoro-3/5/10;
chloro-4/14/21; nitro-9/17/22; methoxy-11/18/19; methyl-13/16/6) were compared. The binding
affinities (estimated by KM) for the fluoro- and methyl-substrate trifecta were approximately of
the same order. However, the KM of PaPAM for the o-methoxy substrate 19 was nearly 10-times
smaller than for its meta- and para-isomers (Table 3.1). The KI values (µM) for o-bromo- (20),
o-chloro- (21) and o-nitro- (21) substrates were 25-times smaller than the KM values of PaPAM
for the corresponding meta- and para-isomers. This supported that the ortho-substituted
substrates generally bound PaPAM better than the meta- and para-isomers.

The relative binding affinity of each substrate was assessed as a function of the six
substituents (of varying electronic and steric effects) in the ortho-, meta-, or para-position. The
relative binding affinities predicted from the calculated energies of protein-ligand interactions

	  

92	  

and the internal energy of the ligand [E(p-l) + E(l)] in the absence of energy minimization matched
the trend (m~p>o) in the experimental KM values for substrate isomers with halogens or nitro
substituents (Tables A.3.1 and A.3.2). This supports the predictive value of the model in which
the binding site residues and substrate maintain the positions found in the crystal structure with
α-phenylalanine. The calculated vdW interaction energies (EV(p-l)) also follow the "m~p>o" trend,
except for chloro compounds, which bound less tightly to PaPAM (i.e., had higher KM) than
predicted by EV(p-l) for chloro series, compared to other halogenated substrates (Tables A.3.1 and
A.3.2). The chloro series will be discussed further in the activity cliff analysis section below.

Importantly, the binding affinity order for all substrates approximately corresponded to the
vdW radii of the substituents. PaPAM bound substrates with a fluoro group (~1.5 Å) the best,
followed by methyl (~1.9 Å), then bromo and chloro groups (~1.8 Å). The least favorable
substrate for binding to PaPAM contained the bulkiest substituents: nitro (~3.1 Å; from the vdW
radii of the Car–N bond length and the terminal O–N=O) and methoxy (~3.4 Å; from the vdW
radii of the Car–O bond and the methyl C–H bonds of the methoxy).29, 30 In general, PaPAM was
predicted by EV(p-l) to disfavor binding substrates with bulky groups at the ortho-position, which
correlated well with the experimental KM values. Surprisingly, substrates with o-methyl (6) (KM
= 88 µM) and o-methoxy (19) (KM = 164 µM) groups bound PaPAM better than expected from
their calculated EV(p-l) (55 and 108 kcal/mol, respectively) (Tables A.3.1 and A.3.2). Binding of
the o-methoxy group could become more energetically favorable if it rotated slightly from its
crystallographic position to form hydrogen bonds with Tyr320 in PaPAM (Figure A.2.1).
	  

93	  

Figure 3.5 An overlay of the NH2-cis and NH2-trans configurations is illustrated, using the
m-methyl-(S)-α-phenylalanine substrate (atoms are C, green; N, blue; O, red). The methyl group
can be positioned on the same side (NH2-cis) or the opposite side (NH2-trans) as the reactive
amino group of the chiral substrate (left). An overlay of the NH2-cis and NH2-trans active
configurations of m-methyl-(S)-α-phenylalanine is modeled in the crystallographic position of
α-phenylalanine in PaPAM (PDB 3UNV). A partial MIO and the active site residues that cause
van der Waals overlap with the ligands are shown (C, light blue; N, dark blue; O, red). SLIDE
and other docking tools cannot model covalently bound ligands, which are interpreted as
disallowed steric overlap (right). Thus, the alkene carbon atoms of the MIO were removed to
dock the substrate.

	  

94	  

1250

18

Experimental KM (µM)

1000

11

750

17

500

14
4

7
8

250
12

0
100

15
9
2

13
1 16
10
6
5 3

150

200

19

250

300

350

400

Etot (kcal/mol)

Figure 3.6 Plot of experimental KM and Etot = E(p-l) (protein-ligand interaction energy) + E(l) (the
intra-ligand energy) calculated with Szybki. The substrates were modeled statically, according to
the trajectory of α-phenylalanine in the PaPAM crystal structure, without energy minimization.
Substrates are labeled according to Table 3.1 and the lower energy of the two configurations
[NH2-cis (red ♦, underlined) and or NH2-trans (blue ▲, arrowed)] is plotted for the substrates.
Substrates with no significant difference in energy between the NH2-cis and NH2-trans
(ΔE < 25 kcal/mol) are shown as filled dots (●). Substrates with para-substituents (except
p-methoxy) without an NH2-cis or NH2-trans preference are open-circles (○). Non-productive
substrates 20 – 22 (not shown) were predicted to prefer the NH2-trans orientation in the PaPAM
active site.

3.3.3 Relationship between PaPAM-substrate interaction energies, flexibility, and KM

The calculated interaction energies obtained from modeling provided insight into which
energy terms correlated best with the KM values of PaPAM for each substrate. They also helped
elucidate which substrate-docking model correlated best with experimental KM. The static model
placed the substrates identical to the trajectory of α-phenylalanine in the crystal structure. The
flexible model, however, allowed bond-rotational motion for the protein side chains to relieve
unfavorable interactions. The static modeling showed that the experimental KM for each substrate
(except for three unreactive o-bromo, o-chloro, and o-nitro substrates 20 – 22) increased with the
	  

95	  

total energy [E(p-l) + E(l)], which approximated ΔGbinding and reflected unfavorable interactions
(Figure 3.6). The linear correlation coefficient (ccoef) between [E(p-l) + E(l)] and KM was 0.48
(Figure 3.6), while the ccoef between EV(p-l) and KM was 0.54 (Figure A.2.4). Incidentally, the
ccoef between the Coulombic energy [EC(p-l)], a component of E(p-l), and KM was lower (0.33;
Figure A.2.3). These results suggested that the steric effects in the protein-ligand adduct and
within the ligand are dominant over electrostatic interactions upon substrate binding. Moreover,
when energy minimization was used to relieve vdW overlap between each substrate and the
active site residues of PaPAM (see Figure A.2.2), the ccoef between [E(p-l) + E(l)] and KM
decreased from 0.48 to 0.35. This result emphasizes the importance of vdW overlap-induced
strain in affecting the binding affinity of PaPAM for its substrates.

Another reason why energy minimization of the protein-ligand interaction likely affected the
correlation between [E(p-l) + E(l)] and KM is that, in some cases, groups were rotated that should
have remained rigid. This may be due to inaccuracies in energy-minimization force field
parameters for some functional groups, due to the prodigious challenge in deriving correct
torsional energy barrier profiles for all bonds between all types of functional group that occur in
organic molecules. For instance, the nitro substituent was rotated out-of-plane relative to the
phenyl ring during energy minimization. However, our analysis of 200 nitrophenyl groups in
small-molecule

crystal

structures

in

the

Cambridge

Structural

Database

1.1.1

(http://webcsd.ccdc.cam.ac.uk) indicated that 87.5% of the nitrophenyl groups are entirely
co-planar, regardless of other features in the structure.31 The energy minimization-free protocol
	  

96	  

provided intermolecular energy values that correlated better with KM. This observation suggests
that the crystallographic placement of the substrates and PaPAM was ideal for most substrates,
and that modeling alternative, energy-minimized side group positions may reflect catalytically
unproductive conformations.

Substrates were identified as either in the NH2-cis or NH2-trans configuration (Figure 3.5) if
the difference (ΔEtot) in the [E(p-l) + E(l)] term for models of the two orientations was >25
kcal/mol (Tables A.3.3). Using this limit, o-methoxy- (19), m-methyl (13), m-bromo- (2),
m-nitro- (9), m-chloro- (4) substrates were predicted to conform to the NH2-cis configuration,
while p-methoxy- (18), o-methyl- (6), o-chloro- (21), o-bromo- (20), and o-nitro- (22) substrates
were predicted to favor the NH2-trans configuration (Figure 3.6 and Table A.2.3). In substrate
18, the methyl of the methoxy group was predicted to adopt a quasi NH2-cis configuration.

For meta-substituted substrates, the NH2-cis is the preferred configuration because Leu104,
Val108, and Leu421 sterically hinder the NH2-trans conformers more than Gln456, Phe428,
Gly85, Phe455, and Tyr320 hinder the NH2-cis conformers (Figure 3.5). However, m-methoxy
substrate 18 has no preference for the NH2-cis or NH2-trans configuration, as energy calculations
suggest that the methoxy group interacts similarly with active sites resides on either side. It
should be noted that Phe428, Val108, and Leu421 also sterically hinder substrates with
para-substituted substrates. The ortho-substituted substrates (except for the o-methoxy substrate
19) are energetically more likely to adopt the NH2-trans configuration. The ortho-substituted
	  

97	  

substrates have steric barriers created by residues Phe428, Gln456, and Tyr320 on the NH2-cis
side of PaPAM (Figure 3.5). In addition, the NH2-trans conformers of the ortho-substituted
substrates encounter lower EV(p-l) between Leu216, Leu104 than between Tyr320, Gln456 of the
NH2-cis conformers (Figure 3.5). As mentioned previously, the o-methoxy substrate 19 bound to
PaPAM better than expected from its calculated vdW energy (EV(p-l)) (Tables A.3.1 and A.3.2).
The energy calculations predict that 19 favors the NH2-cis conformer. This orientation is
consistent with the hypothesis that the o-methoxy of 19 is near Tyr320 of PaPAM and can
potentially form an energetically favorable hydrogen bond. Of the nine substrates (1, 3, 5, 6, 10,
12, 13, 16, and 19) that bound PaPAM the best (KM ≲ 200 µM, i.e., not >20% over the KM of
PaPAM for 1), all except the o-methoxy substrate 19 (EV(p-l) = 108 kcal/mol ) had EV(p-l) ≤ 55
kcal/mol (designated as the energy threshold with low vdW overlap). On the other hand, the
majority of poorest binding substrates, with KM > 500 µM, and non-productive substrates had
EV(p-l) ≥ 80 kcal/mol, with the p-nitro- (17), o-bromo- (20), and o-nitro- (22) substrates predicted
to have comparatively higher vdW energy at ≳190 kcal/mol (Table A.2.3). Relative binding
energy, based on EV(p-l), is thus highly predictive of PaPAM having a potentially high or low
affinity for a substrate.

Generally, for productive substrates where the KM of PaPAM was ≤500 µM, the relative
energy [E(p-l) + E(l)] of the NH2-cis and NH2-trans configurations tended to be ≤200 kcal/mol (see
Table A.2.3). It was intriguing to find substrates that bind PaPAM with the least affinity (highest
KM) (compound 18) or were non-productive (21, 20, 22) had differences of ≳150 kcal/mol
	  

98	  

between the two orientations (see Table A.2.3). These results suggest that either the substituent
on the substrate causes the enzyme to preferentially bind the substrate in one orientation over the
other, or that low vdW barriers in the pocket enable the substrate to rotate to an active
conformation for turnover.

The computational analyses identified residues that will help guide future mutational studies.
Proposed mutations are envisioned to increase the binding affinity of PaPAM for various
substrates. The KM of PaPAM was higher for several substrates with meta- and para-substituents
(except fluoro and methyl) than for 1. The presumed lower binding affinity was likely due to
steric interactions between the substituents and the active site residues of PaPAM. As mentioned
herein, meta-substituted substrates were shown by modeling to prefer the NH2-cis configuration
to avoid steric clashes with branched hydrophobic residues. Mutation of Leu104, Val108, and
Leu421 to alanines may improve the binding of meta-substituted substrates by providing
flexibility to bind in the NH2-cis or NH2-trans configuration. Further, computational models
predicted that para-substituents sterically clash with Phe428, Val108, and Leu421. Therefore,
exchange of these residues for alanine may facilitate the binding of para-substituted substrates.
Surprisingly, the computational analysis predicted that all ortho-substituted α-arylalanines bound
well to PaPAM ; however, relief of the active site sterics may enable these ortho-substituted
α-arylalanines to better access a catalytically competent conformation and improve the turnover
number for these substrates.

	  

99	  

The flexible docking feature of SLIDE provided another approach to reduce vdW collisions
between the crystallographic conformation of PaPAM side chains and substituents on the
arylalanine subtrates oriented in the NH2-cis and NH2-trans configuration. After application of
the SLIDE flexibility modeling in the site, no significant correlation was found for
SLIDE-calculated interaction energies and KM values except for the unsatisfied polar interaction
term: E(p-l) (ccoef = 0.13), hydrophobic interaction energy, EH(p-l) (ccoef = –0.19), and
unfavorable energy of interaction due to unpaired or repulsive polar interactions, EUP(p-l) (ccoef =
0.44). SLIDE also assessed the sum of unresolvable vdW overlaps in each complex, in Å,
following flexibility modeling. The correlation of this value with KM, ccoef = 0.27, was positive
but somewhat lower than the correlation found between the Szybki intermolecular vdW energy
and KM in the absence of substrate or protein motion relative to the crystal structure (ccoef of
0.54). This is consistent with the decrease in correlation between Szybki intermolecular vdW
energy and KM (from 0.54 to 0.42) upon energy minimization, reflecting changes in the
conformation of the complex. These results indicate that the favorability of vdW interactions and
the absence of unsatisfied polar interactions when the substrate and protein are in their
crystallographic conformation are the strongest predictors for favorable substrate KM.

3.3.4 Activity cliff analysis

SALI values were used to identify "activity cliffs" that represent large changes in PaPAM
binding affinity among structurally-similar substrates.20 The most obvious activity cliffs were
	  

100	  

found for substrates with fluoro-, methyl-, and chloro-substituents at the same positions (Figure
3.7). The chloro- and methyl-groups share similar vdW radii. When chloro is attached to an aryl
ring carbon, its electron density delocalizes through resonance, placing a partial positive charge
at the pole of the chloro atom furthest from the ring carbon.32 The polarizability of the halogen
atoms increases with atomic orbital size; therefore, the trend to form a halogen bond is in the
order fluoro < chloro < bromo < iodo, where iodo normally forms the strongest interactions.
Thus, the chloro- and bromo-substituents of substrates used in this study can act as electrophiles
and can potentially form halogen bonds with nearby electron donor atoms, such as oxygen.

Favorable halogen-bonds between the halogen acceptor (X) and donor (O) have a C–X••••O
angle of ~165° or a C–O••••X angle of ~120°, with a distance between X and O of ~3 Å.32
However, the structure calculations and modeling revealed no evidence for chloro- or
bromo-bonding between PaPAM and the active orientation of the o-, m-, or p-chloroor -bromo-substrates, based on searching for appropriate halogen-bond donors within 4 Å of the
halogen. It is worth noting that the incompatibility between charged chloro groups and
surrounding neutral carbon atoms in the binding pocket of PaPAM may contribute to the higher
KM values for compounds with chloro-substituents relative to those with isosteric
methyl-substituents. The o-, m-, p-fluoro substrates bound PaPAM (KM values between 27 and
73 µM) better than the natural substrate 1 (KM = 168 µM), indicating a more favorable
interaction between the fluoro group and surrounding hydrocarbon side chains.

	  

101	  

In summary, vdW overlaps, estimated by the EV(p-l) in Szybki, and as the total sum (in Å) of
vdW overlaps remaining following SLIDE docking, are most significant between the substrates
and residues Phe428, Val108, Leu421, Leu104, Gln456 and Tyr320 of PaPAM (Figure 3.5),
which largely influence the binding affinity. Substrates without substituents on the aryl rings, the
natural substrate 1, 2-furyl- (7), 2-thienyl- (12) and 3-thienyl- (8) alanine have no steric
collisions with the binding site residues. This substrate specificity study was not exhaustive;
there remain several arylalanine analogs to be tested in PaPAM kinetics studies.

In the present study, the dependence of the reaction rate on the PaPAM-catalyzed
α/β-isomerization was probed with several arylalanine analogs. The influence of the substituents
on the kcat of PaPAM revealed a concave-down or a downward break in correlations with
Hammett substituent constants (σ). The trend of these correlations28 suggests that the
rate-determining step changes from the elimination to the hydroamination step based on the
direction and magnitude of the electronic properties of the substituent. In addition, the
computational analyses provided a means to predict the docking conformation of substituted 22
arylalanine substrates. This information will guide future targeted amino acid mutagenesis of
PaPAM to increase the catalytic efficiency by improving the binding affinity for various other
non-natural substrates.

	  

102	  

o-­‐Methyl	  (6)
o-­‐Fluoro	  (10)
o-­‐Methyl	  (6)*
o-­‐Fluoro	  (10)*

o-­‐Chloro	  (21)
o-­‐Chloro	  (21)*
m-­‐Methyl	  (13)

m-­‐Fluoro	  (3)
m-­‐Methyl	  (13)*
m-­‐Fluoro	  (3)*
m-­‐Chloro	  (4)
m-­‐Chloro	  (4)*

p-­‐Fluoro	  (6)
p-­‐Methyl	  (16)

p-­‐Chloro	  (6)

p-­‐Fluoro	  (6)

p-­‐Methyl	  (16)

m-­‐Chloro	  (4)*

m-­‐Fluoro	  (3)*

m-­‐Chloro	  (4)

m-­‐Fluoro	  (3)

m-­‐Methyl	  (13)*

o-­‐Chloro	  (21)*

m-­‐Methyl	  (13)

o-­‐Chloro	  (21)

o-­‐Methyl	  (6)*

o-­‐Fluoro	  (10)*

o-­‐Methyl	  (6)

o-­‐Fluoro	  (10)

p-­‐Chloro	  (6)

Figure 3.7 Structure-activity landscape index (SALI) analysis showing the subset of substrate
pairs exhibiting a large change in KM value upon a small change in structure. Substrate pairs with
SALI scores near 200 (approaching red) indicate the most significant activity cliffs. Asterisks (*)
indicate substrates in NH2-cis configuration; all others are NH2-trans.

	  

103	  

APPENDIX

	  

104	  

Table A.2.1 Comparison of the experimental KM and predicted energetic order of each
substituent at ortho-, meta-, para-positions.
Fluoro-Substituentsa

Chloro-Substituentsa

Bromo-Substituentsa

meta(3)

para(5)

ortho(10)

meta(4)

para(14)

ortho(21)

meta(2)

para(15)

ortho(20)

KM (µM)

27

29

73

432

491

-c

339

525

-

EV(p-l)
(kcal/mol)

19

19

21

33

37

93

55

60

204

(E(p-l) + E(l))
(kcal/mol)

148

150

149

166

170

226

188

193

338

Nitro-Substituentsa

Methyl-Substituentsb

Methoxy-Substituentsb

meta(9)

para(17)

ortho(22)

ortho(6)

para(16)

meta(13)

ortho(19)

meta(11)

para(18)

KM (µM)

430

752

-

88 (I)

163 (II)

204 (III)

164 (I)

990 (II)

1187 (III)

EV(p-l)
(kcal/mol)

48

186

205

55 (III) 46 (II)

(E(p-l) + E(l))
(kcal/mol)

236

360

393

190 (III) 179 (II) 174 (I)

a

40 (I)

108 (III)

86 (II)

81 (I)

292 (III)

240 (II)

219 (I)

Computational approach correctly explained the trends in KM values of substrate analogs.
Trends in KM did not correlate well with computationally predicted energy values, which fell within a
relatively narrow range. Trends from most (I) to least (III) favorable are shown in (Roman numerals).
c
Hyphens indicate non-productive substrates.
b

	  

105	  

Table A.2.2 Comparison of the experimental KM and predicted energetic order of each
substituent at ortho-, meta-, para-positions. This data is the same as presented in Table A.2.1;
here, it is organized according to substituent position rather than type.
ortho-Substituents
KM (µM)

Fluoro

Methyl

Methoxy

Bromo
a

Chloro

Nitro

-

-

73

88

164

-

EV(p-l)
(kcal/mol)

Fluoro

Methyl

Chloro

Methoxy

Bromo

Nitro

21

55

93

108

204

205

(E(p-l) + E(l))
(kcal/mol)

Fluoro

Methyl

Chloro

Methoxy

Bromo

Nitro

149

190

226

292

338

393

meta-Substituents
Fluoro

Methyl

Bromo

Nitro

Chloro

Methoxy

27

204

339

430

432

990

EV(p-l)
(kcal/mol)

Fluoro

Chloro

Methyl

Nitro

Bromo

Methoxy

19

33

40

48

55

86

(E(p-l) + E(l))
(kcal/mol)

Fluoro

Chloro

Methyl

Bromo

Nitro

Methoxy

148

166

174

188

236

240

KM (µM)

para-Substituents
Fluoro

Methyl

Chloro

Bromo

Nitro

Methoxy

29

163

491

525

752

1187

EV(p-l)
(kcal/mol)

Fluoro

Chloro

Methyl

Bromo

Methoxy

Nitro

19

37

46

60

81

186

(E(p-l) + E(l))
(kcal/mol)

Fluoro

Chloro

Methyl

Bromo

Methoxy

Nitro

150

170

179

193

219

360

KM (µM)

a

	  

Non-productive substrates are indicated by hyphens.

106	  

Table A.2.3 Evaluation of protein-ligand and ligand internal energy values and preference for
NH2-cis versus NH2-trans configuration.

Substrate

	  

NH2-trans
(E(p-l)
+
a
E(l))
(kcal/mol)

NH2-cis
(E(p-l)
+ EV(p-l)b
KM
Preferred
a
E(l))
(kcal/mol) (µM) Orientationc
(kcal/mol)

1

149

149

19

168

Symmetricald

2

429

188

55

339

NH2-cis

3

153

148

19

27

NSDe

4

273

166

33

432

NH2-cis

5

150

150

19

29

Symmetrical

6

190

489

55

88

NH2-trans

7

133

115

21

415

NSD

8

156

154

21

337

NSD

9

1640

236

48

430

NH2-cis

10

149

165

21

73

NSD

11

265

240

86

990

NSD

12

132

139

20

132

NSD

13

245

174

40

204

NH2-cis

14

170

170

37

491

Symmetrical

15

193

193

60

525

Symmetrical

16

179

179

46

163

Symmetrical

107	  

Table A.2.3 (cont’d)
17

360

360

186

752

Symmetrical

18

219

947

81

1187

NH2-trans

19

409

292

108

164

NH2-cis

20

338

525

204

-f

NH2-trans

21

226

401

93

-

NH2-trans

22

393

2065

205

-

NH2-trans

a

(E(p-l) + E(l)) is the sum of protein-ligand and ligand internal energy, where E(p-l) is the
protein-ligand interaction energy and E(l) is the ligand internal energy. bEV(p-l) is the vdW energy
of protein-ligand interaction, one of the terms contributing to E(p-l). The vdW energy is given for
whichever orientation (NH2-cis or NH2-trans) had the lower, more favorable (E(p-l) + E(l)) value.
c
Substrates were categorized as preferring an NH2-cis or NH2-trans configuration if the given
orientation was at least 25 kcal/mol lower in (E(p-l) + E(l)) value. dα-Phenylalanine and
para-substituted substrates have symmetrical aryl rings with equal interaction energies for the
NH2-cis and NH2-trans configurations. eSubstrates observed to have no significant difference
(NSD) in energy for the NH2-cis or NH2-trans configuration. fNon-productive substrates are
indicated by hyphens. Note, all energies reported should be considered relative rather than
absolute.

	  

108	  

Tyr 320

Figure A.2.1 H-bonding interaction of ortho-methoxy-α-phenylalanine (19) and active site
Tyr320. o-Methoxy-α-phenylalanine atoms are colored as C, green; N, blue; O, red and Tyr320
atoms are colored as C, light blue; O, red; H, white.

	  

109	  

1000

1000

KM (µM)

1500

KM (µM)

1500

500

0

500

0

100

200

300

E(p–l) (kcal/mol)

400

10

20

30

40

50

60

70

80

90

100

E(p–l) (kcal/mol)

Figure A.2.2 Relationship between protein-ligand interaction energy E(p-l) and experimental KM.
Substrates were placed in the active site in NH2-cis and NH2-trans orientations overlaid with the
crystallographic orientation of α-phenylalanine from PDB entry 3UNV, and the lower energy
orientation was kept. Left panel: (●) Binding site residues of PaPAM were maintained in their
crystallographic orientation, yielding a linear correlation coefficient of 0.48 between E(p-l) and
experimental KM. Right panel: (○) Energy minimization was used to reduce any repulsive
interactions, leading to lower correlation between the resulting protein-ligand interaction energy
and KM value (correlation coefficient = 0.35).

	  

110	  

1500

1500

1000

KM (µM)

KM (µM)

1000

500

500

0

0

20

-30

25

EC(p–l) (kcal/mol)

-25

-20

EC(p–l) (kcal/mol)

Figure A.2.3 Relationship between the electrostatic (Coulombic) component of the
protein-ligand interaction energy EC(p-l) and experimental KM. Substrates were placed in the
active site in NH2-cis and NH2-trans configurations overlaid with the crystallographic orientation
of α-phenylalanine, and the lower energy orientation was kept. Left panel: (●) Binding site of
PaPAM was kept in the crystallographic orientation (correlation coefficient = 0.33). Right panel:
(○) Energy minimization was used to reduce any protein-ligand repulsive interactions
(correlation coefficient = 0.011).

	  

111	  

1500

1500

1000

KM (µM)

KM (µM)

1000

500

500

0

0

0

50

100

EV(p–l) (kcal/mol)

150

-10

200

-5

mEV(p–l) (kcal/mol)

0

Figure A.2.4 Relationship between the van der Waals energy component of the protein-ligand
energy EV(p-l) and experimental KM. Substrates were again placed in NH2-cis and NH2-trans
orientations overlaid with the crystallographic orientation of α-phenylalanine from PDB entry
3UNV, and the lower energy orientation was kept. Left panel: (●) Binding site residues of
PaPAM were kept in the crystallographic orientation (correlation coefficient = 0.54). Right
panel: (○) Energy minimization was used to reduce any protein-ligand repulsive interactions
(correlation coefficient = 0.42). These results indicate that the van der Waals interaction energy
between the protein and each substrate overlaid with the α-phenylalanine-bound crystal structure
is most predictive of the relative KM values of the substrates.

	  

112	  

REFERENCES

	  

113	  

REFERENCES

(1) Horne, W. S. Peptide and peptoid foldamers in medicinal chemistry. Expert Opin. Drug
Discovery 2011, 6, 1247-1262.
(2) Ruf, S.; Buning, C.; Schreuder, H.; Horstick, G.; Linz, W.; Olpp, T.; Pernerstorfer, J.; Hiss,
K.; Kroll, K.; Kannt, A.; Kohlmann, M.; Linz, D.; Hubschle, T.; Rutten, H.; Wirth, K.; Schmidt,
T.; Sadowski, T. Novel β-Amino Acid Derivatives as Inhibitors of Cathepsin A. J. Med. Chem.
2012, 55, 7636-7649.
(3) Huang, X.; O’Brien, E.; Thai, F.; Cooper, G. Practical Asymmetric Synthesis of
RO5114436, a CCR5 Receptor Antagonist. Org. Process Res. Dev. 2010, 14, 592-599.
(4) Jennewein, S.; Wildung, M. R.; Chau, M. D.; Walker, K.; Croteau, R. Random sequencing
of an induced Taxus cell cDNA library for identification of clones involved in Taxol
biosynthesis. Proc. Natl. Acad. Sci. U.S.A. 2004, 101, 9149-9154.
(5) Klettke, K. L.; Sanyal, S.; Mutatu, W.; Walker, K. D. β-Styryl- and β-Aryl-β-alanine
Products of Phenylalanine Aminomutase Catalysis. J. Am. Chem. Soc. 2007, 129, 6988-6989.
(6) Magarvey, N. A.; Fortin, P. D.; Thomas, P. M.; Kelleher, N. L.; Walsh, C. T. Gatekeeping
versus Promiscuity in the Early Stages of the Andrimid Biosynthetic Assembly Line. ACS Chem.
Biol. 2008, 3, 542-554.
(7) Chesters, C.; Wilding, M.; Goodall, M.; Micklefield, J. Thermal Bifunctionality of
Bacterial Phenylalanine Aminomutase and Ammonia Lyase Enzymes. Angew. Chem. Int. Ed.
2012, 51, 4344-4348.
(8) Feng, L.; Wanninayake, U.; Strom, S.; Geiger, J.; Walker, K. D. Mechanistic, Mutational,
and Structural Evaluation of a Taxus Phenylalanine Aminomutase. Biochemistry 2011, 50,
2919-2930.
(9) Röther, D.; Poppe, L.; Morlock, G.; Viergutz, S.; Rétey, An active site homology model of
phenylalanine ammonia-lyase from P. crispum. J. Eur. J. Biochem. 2002, 269, 3065-3075.
(10)
Huang, S. X.; Lohman, J. R.; Huang, T.; Shen, B. A new member of the
4-methylideneimidazole-5-one–containing aminomutase family from the enediyne kedarcidin
biosynthetic pathway. Proc. Natl. Acad. Sci. U.S.A. 2013, 110, 8069-8074.
	  

114	  

(11) Strom, S.; Wanninayake, U.; Ratnayake, N. D.; Walker, K. D.; Geiger, J. H. Insights into
the Mechanistic Pathway of the Pantoea agglomerans Phenylalanine Aminomutase. Angew.
Chem. Int. Ed. 2012, 51, 2898–2902.
(12) Hawkins, P. C.; Skillman, A. G.; Warren, G. L.; Ellingson, B. A.; Stahl, M. T. Conformer
Generation with OMEGA: Algorithm and Validation Using High Quality Structures from the
Protein Databank and Cambridge Structural Database. J. Chem. Inf. Model. 2010, 50, 572-584.
(13) Hawkins, P. C.; Nicholls, A. Conformer Generation with OMEGA: Learning from the
Data Set and the Analysis of Failures. J. Chem. Inf. Model. 2012, 52, 2919-2936.
(14) Jakalian, A.; Jack, D. B.; Bayly, C. I. Fast, efficient generation of high-quality atomic
charges. AM1-BCC model: II. Parameterization and validation. J. Comput. Chem. 2002, 23,
1623-1641.
(15) Nicholls, A.; Wlodek, S.; Grant, J. A. J. Comput. Aided Mol. Des. SAMPL2 and
continuum modeling. 2010, 24, 293-306.
(16) Wlodek, S.; Skillman, A. G.; Nicholls, A. Ligand Entropy in Gas-Phase, Upon Solvation
and Protein Complexation. Fast Estimation with Quasi-Newton Hessian. J. Chem. Theory
Comput. 2010, 6, 2140-2152.
(17) Halgren, T. A. Merck molecular force field. I. Basis, form, scope, parameterization, and
performance of MMFF94. J. Comput. Chem. 1996, 17, 490-519.
(18) Zavodszky, M. I.; Rohatgi, A.; Van Voorst, J. R.; Yan, H.; Kuhn, L. A. Scoring ligand
similarity in structure-based virtual screening. J. Mol. Recognit. 2009, 22, 280-292.
(19) Zavodszky, M. I.; Sanschagrin, P. C.; Korde, R. S.; Kuhn, L. A. Distilling the essential
features of a protein surface for improving protein-ligand docking, scoring, and virtual screening.
J. Comput. Aided Mol. Des. 2002, 16, 883-902.
(20) Guha, R.; Van Drie, J. H. Structure−Activity Landscape Index:   Identifying and
Quantifying Activity Cliffs. J. Chem. Inf. Model. 2008, 48, 646-658.
(21) Hawkins, P. C.; Skillman, A. G.; Nicholls, A. Comparison of Shape-Matching and
Docking as Virtual Screening Tools. J. Med. Chem. 2007, 50, 74-82.
(22) Mutatu, W.; Klettke, K. L.; Foster, C.; Walker, K. D. Unusual Mechanism for an
Aminomutase Rearrangement:   Retention of Configuration at the Migration Termini.
Biochemistry 2007, 46, 9785-9794.
	  

115	  

(23) Schuster, B.; Rétey, J. The mechanism of action of phenylalanine ammonia-lyase: The
role of prosthetic dehydroalanine. Proc. Natl. Acad. Sci. U.S.A. 1995, 92, 8433-8437.
(24) Wanninayake, U.; DePorre, Y.; Ondari, M.; Walker, K. D. (S)-Styryl-α-alanine Used To
Probe the Intermolecular Mechanism of an Intramolecular MIO-Aminomutase. Biochemistry
2011, 50, 10082–10090.
(25) Weiner, B.; Szymanski, W.; Janssen, D. B.; Minnaard, A. J.; Feringa, B. L. Recent
advances in the catalytic asymmetric synthesis of β-amino acids. Chem. Soc. Rev. 2010, 39,
1656-1691.
(26) Szymanski, W.; Wu, B.; Weiner, B.; de Wildeman, S.; Feringa, B. L.; Janssen, D. B.
Phenylalanine Aminomutase-Catalyzed Addition of Ammonia to Substituted Cinnamic Acids: a
Route to Enantiopure α- and β-Amino Acids. J. Org. Chem. 2009, 74, 9152-9157.
(27) Ratnayake, N. D.; Wanninayake, U.; Geiger, J. H.; Walker, K. D. Stereochemistry and
Mechanism of a Microbial Phenylalanine Aminomutase. J. Am. Chem. Soc. 2011, 133,
8531-8533.
(28) Hoffmann, J.; Klicnar, J.; Štěrba, V.; Večeřa, M. Collect. Czech. Kinetics of hydrolysis of
substituted salicylideneanilines. Chem. Commun. 1970, 35, 1387-1398.
(29) Batsanov, S. S. Van der Waals radii of elements from the data of structural inorganic
chemistry. Russ. Chem. Bull. 1995, 44, 18-23.
(30) Li, A. J.; Nussinov, R. A set of van der Waals and coulombic radii of protein atoms for
molecular and solvent-accessible surface calculation, packing evaluation, and docking. Proteins
1998, 32, 111-127.
(31) Carpy, A. J. M.; Haasbroek, P. P.; Ouhabi, J.; Oliver, D. W. Keto/enol tautomerism in
phenylpyruvic acids: structure of the o-nitrophenylpyruvic acid. J. Mol. Struct. 2000, 520,
191-198.
(32) Metrangolo, P.; Meyer, F.; Pilati, T.; Resnati, G.; Terraneo, G. Halogen Bonding in
Supramolecular Chemistry. Angew. Chem. Int. Ed. 2008, 47, 6114-6127.

	  

116	  

Chapter 4 Using multiple virtual screening techniques to bootstrap pheromone antagonist
discovery

	  

117	  

4.1 Introduction

4.1.1 Motivation

Virtual screening techniques have been used for human drug discovery successfully.1-3 It is
novel to apply these techniques in other fields such as agriculture and aquatic species. Here, we
integrated these techniques into a screening pipeline to find antagonists to control an aquatic
invasive species, the sea lamprey, in the Great Lakes (in collaboration with Prof. Weiming Li’s
lab in the Department of Fishery and Wildlife). We aimed to find an effective as well as
environmentally friendly solution to control the population of sea lamprey in the Great Lakes.
Sea lamprey is a well-known invasive species, which causes billions of dollars lost to the
commercial fishery and threatens the survivals of large and medium fish.4,5 Current strategies to
control the sea lamprey population in the Great Lakes either trap the sea lamprey at a significant
yet not a large rate 10% over the years6 or employ chemical pesticides that unfortunately threaten
the lake sturgeon, an endangered species in 19 out of the 20 states.7

The in-silico screening part of this pipeline focuses on hypothesis-driven ligand-based
virtual screening due to its computational efficiency, assisted by structure-based virtual screening
(docking ligand candidates into a protein structure). To manage and record data systematically
and improve the screening efficiency to avoid repeatedly retrieving molecular data from the file
containing giant (12 million molecules) small molecule dataset, a fellow graduate student in the
Kuhn lab, Sebastian Raschka, designed a SQLite tool to store and retrieve small molecule data
	  

118	  

matching our criteria for inhibitor candidates.

4.1.2 Hypothesis

7а,12а,24-trihydroxy-3-one-5а-cholan-24-sulfate (3kPZS, Figure 4.1) is the pheromone
released by male sea lamprey,8 to attract the sexually mature females to the nesting grounds. Our
hypothesis is that blocking the female detection of 3kPZS via blocking SLOR1 receptor is
expected to halt the propagation of this invasive vertebrate species. 3kPZS specifically binds to
the SLOR1 receptor at a concentration of 10-12 M.8 Therefore, environmentally friendly
inhibitors that mimic 3kPZS to compete for binding to the ligand binding site in SLOR1 and
inhibit 3kPZS detection at small concentrations are a desirable approach for sea lamprey control.
Success in such an approach has been shown for invasive insect control.9

Figure 4.1 Structure of 3kPZS.

	  

119	  

4.1.3 Significance

Discovery of antagonists of SLOR1 to mimic 3kPZS binding is based discovering mimics of
the structure of this bile acid-like compound, which binds specifically to olfactory receptors,
which are G-protein coupled receptors (GPCRs). Approximately 50%-60% of modern medicinal
drugs are targeted at GPCRs.10 GPCRs can interact with compounds like odor molecules,
pheromones, hormones, and neurotransmitters, and also with other proteins and peptides.11 Sea
lamprey olfactory receptors are homologous to the rhodopsin or class A branch of the human
GPCR family tree.12 This provides a great case to study specificity of ligand recognition in the
GPCRs, including our bile acid ligand, 3kPZS. Furthermore, it is a brand new application of
virtual screening drug discovery techniques to control aquatic invasive species, which can be
applied to other projects to discover agonists or antagonists based on known ligand structures.

	  

120	  

4.2 Materials and Methods

Here, we design a complete pipeline (Figure 4.2) to integrate multiple screening techniques
and experimental assays, to successfully discover inhibitors for the sea lamprey olfactory
receptor 1 (SLOR1) by focusing on discovering mimics of its native ligand, 3kPZS. In the
pipeline, we designed a database tool that annotates structure information of molecules from
multiple compound libraries, which allows efficient identification of compounds matching
functional groups we hypothesize are important to 3kPZS detection. The selected compounds are
evaluated based on shape and electrostatic similarities with 3kPZS. The compounds after scoring
for shape and electrostatic similarity are secondarily selected according to functional group
matches. These selected compounds are then evaluated based on their docking into the SLOR1
receptor homology model. The candidates that have favorable interactions with the receptor
binding site were then prioritized for experimental assays. Experimental results act as feedback
to strengthen or refine the initial hypothesis and modify the screening strategy if required.

	  

121	  

Database	  of	  12	  million	  
drug-­‐like,	  commercially	  
available	  compounds	  

Selection	  of	  compounds	  
based	  on	  hypothesis	  

Conformational	  sampling	  
of	  each	  ligand	  candidate	  

Optimal	  overlay	  of	  ligand	  
candidate	  conformers	  with	  
3kPZS	  
Matching	  functional	  
groups	  
	  
	  
	  

Checking	  for	  any	  incorrect	  
steroid	  stereoisomers	  
Molecular	  docking	  into	  
receptor	  

Ranking	  and	  prioritization	  
based	  on	  hypothesis	  testing	  

Structure-­‐activity	  analysis	  
• Activity	  cliff	  analysis	  
• Functional	  group	  match	  
Zingerprint	  analysis	  

Experimental	  validation	  
• Electro-­‐olfactogram	  assays	  
• Behavorial	  tests	  
• Two-­‐choice	  maze	  
• In-­‐stream	  assay	  

Figure 4.2 Flowchart describing the pipeline for 3kPZS antagonist discovery. In step 2, an
example of the potential hypothesis is that known GPCR ligands are likely to mimic 3kPZS and
block SLOR1.
	  

122	  

4.2.1 Virtual screening

4.2.1.1 3kPZS and SLOR1 structural model

The homology model of SLOR1 was constructed by Dr. Kuhn through the ModWeb
implementation of Modeller13 (version SVN.r972), based on the alignment of SLOR1 with the
avian β1-adrenergic receptor crystal structure (Protein Data Bank entry 2vt4).14 The sequence
identity between the sequences of SLOR1 and avian β1-adrenergic receptor is 25.5% with an
e-value of 6.4e-11, which indicates that there is a very low probability to obtain an alignment with
this level of amino acid similarity at random. According to a previous statistical study, if two
sequences have at least 24.8% identity over no fewer than 80 residues, then their corresponding
main-chain structures are closely related.15 This criterion is satisfied by the alignments between
sequences of SLOR1 and β1-adrenergic receptor. The region corresponding to the orthosteric
binding site in SLOR1 model was obtained by overlaying the volume of ligand binding in the
main-chain structures of other related class A GPCRs such as rhodopsin, A2A and β1-adrenergic
receptor.

The set of all favorable-energy conformations of 3kPZS was docked into the SLOR1 ligand
binding cavity using SLIDE software with default settings16 to predict the 3kPZS-SLOR1 mode
of interaction.17

	  

123	  

4.2.1.2 Preparation of the screening libraries based on hypothesis

The “Drug-Like" subset of the ZINC12 database containing 13.2 million compounds, is the
largest screening library we used and processed by Santosh Gunturu and Sebastian Raschka.18
All of the compounds in this subset satisfy "Lipinski's rule of 5",19 as listed below:

•

molecular mass ranging from 150 to 500D,

•

octanol-water partition coefficient no greater than 5,

•

no more than 5 hydrogen bond donors and no more than 10 hydrogen bond acceptors,

•

no more than 7 rotatable bonds

•

polar surface area less than 150 Å2.

There are 7.4 million compounds labeled as “in stock” in the 13.2 million compound subset,
which were the compounds of focus. The information on these compounds such as ZINC IDs,
purchasability, number of rotatable bonds and functional groups were stored in an SQLite
database for screening. The advantage to store the information in an SQLite database is that it
allows fast selections of molecules based on an initial hypothesis (e.g., inhibitors must include a
terminal sulfate group), as well as easy manipulation of data such as insertion, deletion, editing
and storage of data. It enhanced the safety of record keeping, and reproducibility of screening
results.

	  

124	  

According to former results from Dr. Li’s lab, a hypothesis has been proposed that an
effective antagonist should contain a sulfate group matching the 3kPZS 24-sulfate and at least
one oxygen (hydroxyl or keto) group matching the 3-oxygen position in 3kPZS. Based on this
hypothesis, a subset of compounds containing a sulfate group and at least one oxygen (hydroxyl
or keto) group were selected in the SQlite database. This results in fewer than 100,000
compounds, a much smaller database than the initial 7.4 million available compounds. This step
reduces the cost dramatically in the following, computationally expensive steps.

I prepared the following databases that contain additional analogs of 3kPZS and known
ligands for G protein-coupled receptors for screening, based on different hypotheses.

The combinatorial analog data set contains 332 variants of 3kPZS, sampling different
combinations of functional groups at 3, 7 and 12 positions and steroid ring configurations that
were designed as 2D structures by our collaborators Dr. Mar Huertus and Anne Scott in Dr. Li’s
lab. For example, hydroxyl groups and keto groups at the 3, 7 and 12 position (Figure 4.1) are
substituted by keto oxygen, hydroxyl oxygen or hydrogen. The 5-beta configuration is replaced
by a 5-alpha steroid ring, the configuration in 3kPZS. The carboxylate, sulfate or phosphate
group were allowed at the C-24 position. According to the SMILES strings of the 332 variants, I
generated 3-dimensional structures of these compounds by using OMEGA 2.4.6 (OpenEye
Scientific Software, Santa Fe, NM; http://www.eyesopen.com) and partial charges were assigned
using molcharge 1.3.1 (Open Eye Scientific Software). The SMILES strings of the 332
	  

125	  

compounds served as input to SciFinder (https://scifinder.cas.org) to find commercially available
compounds using a similarity search with a threshold of 99%, corresponding to the subset of
these compounds that have been synthesized and can be purchased. The corresponding CAS
numbers (molecular database identifier) were exported and any duplicates were deleted. In the
end, 84 unique compounds were found to be available.

To test the effect of configurations of steroid ring systems on biological activities, 5-beta
steroids with bent configuration, different from the 5-alpha steroid with planar configuration as
in 3kPZS, were identified in the ZINC12 database by substructure searching using the 5-beta
steroid SMILES string “C1CCC3[C@@H(C1)CC[C@H]4[C@@H]2CCCC2CC[C@H]34”.
Because 5-beta steroids are easier to synthesize, active compounds with this configuration would
reduce experimental costs. In the end, a set of 690 compounds with 5-beta steroid configuration
that had not been previously included in the ZINC12 screening library were extracted, with a
subset of 200 compounds that were drug-like.

A final set of 2,995 steroid structures that are commercially available through other vendors
while not already in the ZINC12 database was established, by searching steroid analogs in the
CAS Registry database using SciFinder. Because the ZINC12 database does not cover all
vendors of small organic molecules, the CAS Registry database containing 91 million
compounds can serve as a complementary database to search steroid molecules with commercial
vendors that are not present in the ZINC12 database. Using SciFinder, ~8000 CAS Registry
	  

126	  

steroids were exported. The SMILES strings of 2995 steroids could be determined from their
CAS registry numbers by CACTUS (http://cactus.nci.nih.gov), which is a server to translate the
compounds’ information in different formats. Then, 3-dimensional structures of the 2995
steroids were built based on their SMILES strings by using OMEGA 2.4.6 (OpenEye Scientific
Software, Santa Fe, NM; http://www.eyesopen.com). Partial charges were assigned using
molcharge 1.3.1 (Open Eye Scientific Software).

The GPCR ligand library (GLL) database, of ~25,000 known ligands for 147 GPCRs
(http://cavasotto-lab.net/Databases/GDD/)20 was also prepared for screening. The sdf files in the
GLL data package downloaded from http://cavasotto-lab.net/Databases/GDD/Download/ were
converted to 3D structures using OMEGA 2.4.6 (OpenEye Scientific Software, Santa Fe, NM;
http://www.eyesopen.com). In addition, the 3D structures of the ligands of the trace
amine-associated receptors (TAAR) were generated by using their isomer SMILES strings by
OMEGA 2.4.6 (OpenEye Scientific Software, Santa Fe, NM; http://www.eyesopen.com),
because these compounds are not included in the GLL database. Then partial charges were
assigned to the atoms in all compounds by using molcharge 1.3.1 (Open Eye Scientific
Software).

Using EOG assays, we identified 12 compounds from the above databases that suppress at
least 45% of response to 3kPZS, based on measuring the blockage of the olfactory neurological
response to 3kPZS by using a technique called an electro-olfactogram (EOG; assayed by Dr. Mar
	  

127	  

Huertas & Anne Scott in the Li lab). We then identified ZINC compounds that share high
similarity to the identified 12 compounds following the hypothesis that they are likely to have
similar activity. Therefore, the compounds with >90% molecular similarities to the identified 12
compounds that are commercially available were extracted from the ZINC12 database for
screening.

4.2.1.3 Sampling flexible compounds

Before virtual screening based on the 3D structures of molecules, multiple 3 dimensional
conformations for each molecule need to be sampled to access all the possible low-energy
conformations that the molecules can adopt in nature. The remaining compounds from the above
databases were used to generate low energy conformers by using OMEGA 2.4.6 (OpenEye
Scientific Software, Santa Fe, NM; http://www.eyesopen.com). This is a tool that uses a
knowledge-based approach to generate hundreds of low-energy conformers for each molecule. It
can sample conformers with validated quality at a high speed of 2-2.5 sec/molecule on a machine
with a standard computer configuration of 2.4 Ghz CPU, 4GB RAM. Overall, there are three
steps to generate low energy conformers in OMEGA 2.4.6 (OpenEye Scientific Software, Santa
Fe, NM; http://www.eyesopen.com)’s algorithm. The first step is to assemble an initial set of
conformations of molecules from the pre-calculated chemical fragment library. The second step
is to generate a large ensemble of conformations by applying all torsions in the molecules based
on a pre-built torsion sampling dictionary. Lastly, a scoring function based on a modified
	  

128	  

MMFF94 force field is used to eliminate conformers with internal clashes. To guarantee the
uniqueness of sampled conformers, conformers with low pairwise RMSD threshold are
eliminated as well.21 There are on average 200 low energy conformers generated for each
molecule in the above screening libraries. The distance between the 3-hydroxyl group and the
24-sulfate group in the potential active conformers of 3kPZS ranges from 13 to 20 Å. Therefore,
conformers of database molecules with corresponding functional groups within this distance
were selected to improve the efficiency and efficacy of screening.

4.2.1.4 Overlays of molecular structures using ROCS

The compounds in the above compound libraries after sampling and selection were then
overlaid, one by one, with the 48 low-energy conformations of 3kPZS by using ROCS (OpenEye
Scientific Software, Santa Fe, NM).22 ROCS (OpenEye Scientific Software, Santa Fe, NM)22 is a
ligand-based approach to calculate the degree of similarity in shape and chemical properties of
compounds compared with target ligands. It overlays the structures of molecules quickly and
also supports multiprocessing, which makes it a feasible tool for ligand-based virtual screening.
According to OpenEye reports, ROCS can overlay 20-40 molecules per second using a single
CPU of a standard computer (2.4 Ghz CPU, 4 GB RAM). It uses a Gaussian function to
represent the volume of each atom and a partial charge model to calculate chemical matches.22
The ROCS structure similarity score, TanimotoCombo score, ranks the database compounds
with value ranging from 0.0 for no match to 2.0 for a perfect match, equally weighting shape
	  

129	  

match and partial charge match. The distribution of ROCS scores across the ZINC12 database
has a mean value of 0.64, with a standard deviation of 0.08. We considered compounds with
scores greater than 2 standard deviations above the mean as significant matches. Compounds in
this region of the score distribution were kept from all the 124 partitions of ZINC and evaluated
for functional group matches to 3kPZS.

4.2.1.5 Matching functional groups in 3kPZS

A subset of the substituent groups in 3kZPS, including 3-keto, 7-OH, 12-OH, ester O
conjugated to C-24, two methyl groups in the steroid ring and the terminal organosulfate group
are hypothesized to be essential to the biological activity of 3kZPS. Tabulation of these
functional groups for each compound in the screening libraries was performed and stored in the
SQLite database (using code developed by Santosh Gunturu and Sebastian Raschka), according
to suitable atomic charge threshold and hybridization state. If compounds in the screening
libraries have corresponding atoms with proper charge and hybridization state within 1.0Å of
these functional groups in the best-matching 3kPZS conformer, then these groups are labeled as
matches.

4.2.1.6 Incorrect steroids

In order to check whether the stereochemistry of the compounds in the above database
satisfy the natural stereochemistry of steroids, steroid checking by using SMILES representation
	  

130	  

of the canonical steroid ring system is performed using OpenEye toolkit. This was necessary
because ZINC12 samples and includes unnatural isomers in cases where the vender did not
provide complete stereo-chemical information for compounds. Also, duplicate compounds with
the same chemical structures and vendors but different ZINC IDs are deleted based on the
properties of their SMILE strings.

4.2.1.7 Molecular docking

Compounds with high ROCS TanimotoCombo score and functional group matches were
docked into the binding site in the SLOR1 homology model by Santosh Gunturu and me using
SLIDE. Then the compounds were evaluated by the ability to form a salt bridge with His 110 in
the binding site, believed to be a crucial interaction for the pheromone, as well as the degree of
isostericity of the steroid ring substructure. In addition, molecules with obvious steric clashes
with the binding site were of lower priority experimental tests and molecules with favorable
ΔGbinding values based on SLIDE scores were increased in priority.

4.2.1.8 Ranking and prioritization based on hypothesis testing

Using the datasets and screening toolkit described above, a series of hypotheses were
defined to select subsets of compounds for EOG assays.

Hypothesis 1: Compounds containing the 3-keto and one or more sulfate oxygens in the
	  

131	  

functional group matching with 3kPZS, and TanimotoCombo score above 0.85 will mimic 3kPZS.
This criterion tests the hypothesis that compounds that are highly similar to 3kZPS in overall
shape and electrostatic properties, and contain the 3-keto and sulfate groups, can compete with
and block detection of 3kPZS.

Hypothesis 2: Compounds containing a 3-hydroxy group, a sulfate oxygen, and at least one
of the other functional groups present in 3kPZS, such as sulfate oxygen, hydroxyl or methyl, in
the functional group matching with 3kPZS, and a TanimotoCombo score no less than 0.8 and
ROCS electrostatic score no less than 0.25 will mimic 3kPZS. This hypothesis tests whether
compounds that are highly similar to 3kZPS overall and match its electrostatic properties, and
contain the 3-hydroxyl and sulfate groups, can inhibit detection of 3kPZS. In addition, as a
secondary consideration, compounds that have a sulfate group that can dock close to His110 in
the SLOR1 binding site with a docking energy of -7kcal/mol or less are more favorable and
prioritized.

Hypothesis 3: Compounds that interact with the β1-adrenergic receptor will be active
against the SLOR1 receptor. Because β1-adrenergic receptor is the known structure that has the
highest overall and binding site sequence identity to SLOR1, the compounds that interact with
β1-adrenergic receptor are selected for testing, including carvedilol (agonist; ZINC01530579),
atenolol

(selective

antagonist;

ZINC00014007),

ZINC00003911).
	  

132	  

and

dobutamine

(partial

agonist;

Hypothesis 4: Compounds with a 5-alpha steroid ring configuration and functional groups
matching the 3-keto and sulfate oxygen in 3kPZS, and a ROCS TanimotoCombo score greater
than 0.65 will mimic 3kPZS. We selected analogs with the same steroid ring configuration and
which matched the oxygen-containing groups in 3kPZS, to test whether they mimic 3kPZS in
activity

Hypothesis 5: Compounds with phosphate tail. Compounds with phosphate instead of sulfate
tails at the terminus (C24) were selected to test whether phosphate can be a potential replacement
to the sulfate moiety of 3kPZS and block detection of 3kPZS.

Hypothesis 6: Compounds with a 5-β steroid ring configuration and at least 2 sulfate oxygen
matches or at least 5 functional group matches all together mimic 3kPZS. This tests whether bent
steroids can simulate the interaction between planar steroids and SLOR1.

Hypothesis 7: Compounds with more negative charged sulfate group (with charges 0.3 units
more negative than the sulfate oxygen charge in 3kPZS) will bind more tightly. It is postulated
that compounds with a more negatively charged tail can form stronger salt bridges with His110
at the binding site of SLOR1 and more strongly compete with 3kZPS for binding.

Hypothesis 8: Compounds containing negatively charged, non-sulfate oxygen atoms
matching at least one sulfate oxygen in 3kPZS would also compete with 3kPZS for binding
	  

133	  

SLOR1. Additionally, compounds need to contain atoms matching the 3-keto and at least one of
the other functional groups in 3kPZS, and have a ROCS TanimotoCombo score value of 0.8 or
above, and a ROCS electrostatics complementarity value 0.25 or above. The compounds were
further evaluated according to the distance between the sulfate tail group and His110 for the
ability to make the salt bridge believed to confer SLOR1-3kPZS specificity and a docking score
(<-7kcal/mol) assessing overall favorability of interaction. This test the hypothesis that
compounds with negative, non-oxygen atoms at the tail can mimic the function of sulfate
oxygen.

Hypothesis 9: Steroids containing epoxide. Epoxide containing compounds are reported to
be able to form a covalent interaction with histidine in the binding site of at least one protein.23
Therefore, steroids containing epoxide at the tail position or 3-O position were selected to test
whether epoxide at these positions can form covalent bond with histidine or other residues
nearby, to generate specific and permanent inhibition of SLOR1.

Hypothesis 10: Steroids with taurine tail. Because taurolithocholic acid has shown strong
inhibition of 3kPZS detection in the EOG assays, we selected its analogs with taurine tail and
high ROCS TanimotoCombo score with 3kZPS, to test the hypothesis that the steroids
containing a taurine tail can block 3kPZS detection.

	  

134	  

4.2.1.9 Activity cliff analysis

As mentioned in Chapter 3, activity cliff analysis24 enables us to find the essential functional
groups in molecules, in which slight chemical changes cause dramatic changes in the biological
activities. Activity cliff analysis involves 2 matrices: structural similarity (which we can measure
with ROCS and activity similarity (measured by EOG). To measure structural similarity for each
pair of compounds that was tested by EOG, they were first overlaid with 3kPZS (using the best
matching conformer of each relative to the docked conformer of 3kPZS), and the
TanimotoCombo score of that ROCS overlay was reported. The SALI activity cliff score was
then calculated using the following equation. The pairs with SALI score above 70 were analyzed.

SALI =

100 ∗ |activities  difference|
(2 − RocsCombo)

4.2.1.10 Functional group match fingerprint analysis

In order to find the relationship between structure and activity of the assayed 143
compounds as potential SLOR1 antagonists, functional group matchprint analysis was performed.
There are 3 steps to perform this analysis. Step 1: multiple conformers of the 143 compounds
were overlaid with 48 potentially active conformers of 3kPZS using ROCS. The top scored
conformers of the 143 compounds were selected. Then functional group matchprints were
generated by comparing the positions of the sulfate oxygens, sulfate ester oxygen, 3keto oxygen,
3-OH, 7-OH, 12-OH, 18-methyl and 19-methyl groups of the compounds with 3kPZS.
	  

135	  

Generation of the matchprints was based on the code developed by Santosh Gunturu. Step 2: the
matchprints of the top 6 compounds that suppressed EOG response of sea lamprey to 3kPZS
were extracted as references (Table 4.1).

	  

136	  

Table 4.1 The matchprints of the top 6 compounds that suppressed EOG response of sea lamprey
to 3kPZS.
Zinc ID

(0-3)

(0-1)

(0-1)

(0-1)

(0-1)

(0-1)

18-

19-

Sulfate

Sulfate

3-Keto

3-OH

7-OH

12-OH

Methyl

Methyl

Oxy

Ester

Oxy

ZINC72400307_28

3

0

0

1

1

1

1

1

ZINC35044325_22

3

0

0

1

0

0

1

1

ZINC04095893_61

2

0

0

0

0

0

2

1

ZINC72400309_95

3

1

0

0

0

0

1

1

ZINC12494532_16

3

1

0

0

0

0

0

0

ZINC01845398_1

3

1

0

0

0

0

0

0

Step 3: All matchprints were compared with the six matchprints, following the rationale that
compounds that can match the presence/absence of functional groups in the six most active
compounds are also likely to be active. By comparing the differences in the matchprints of other
compounds with the top 6 compounds, we can determine the functional groups whose
presence/absence results in enhance/reduction of biological activities. If an assayed compound
differed in the presence/absence of these functional groups by at most one position, relative to
the six most active compounds, then it was extracted for structure-activity analysis.

4.2.2 Experimental validation

Proposed antagonists were tested in Dr. Li’s lab by three assays: (1) electro-olfactograms,
which test the ability of sea lampreys to sense a compound in their olfactory epithelia, and
whether this compound competes with 3kPZS sensing25 and (2) behavorial tests including the
	  

137	  

two-choice maze with one channel containing the compound and the other channel containing a
blank and (3) in-stream assay, which is set up like the two-choice maze but in an actual flowing
stream with a more elaborate set of criteria for tracking sea lampreys’ responses to a compound.

4.2.2.1 EOG assays

The protocol of EOG assays25 was developed in Dr. Li’s lab and carried out by Dr. Mar
Huertas and Anne Scott. In EOG assays, female sea lampreys were anesthetized with 100 mg/L
MS-222 and placed in a Plexiglas V-shaped stand. Continuous aerated water containing 50 mg/L
MS-222 kept the gills irrigated and the fish anesthetized throughout the experiments. Then the
surface skin in the nose was removed to expose the olfactory lamellae. A small capillary tube
was used to deliver the chemical stimuli to the epithelial cells of olfactory rosette. The electrical
potential changes upon detection of the tested stimulus were recorded through two Ag/AgCl
electrodes (type EH-1S, World Precision Instruments, Sarasota, Florida, USA), which were
placed between two lamellae and adjusted to maximize the ratio of signal to noise. Then, the
signal was amplified and digitalized to analyze the ability of each stimulus to reduce the
detection of 3kPZS.

To test whether the prioritized compounds affect the detection of 3kPZS, a mixture of 10-6 M
3kPZS and 10-6 M concentration of a potential antagonist was exposed to the olfactory
epithelium for 4s. The EOG value of the charcoal filtered water was used as a blank control and
	  

138	  

subtracted from the EOG value upon exposure to the mixture, to normalize the response.
Between different stimuli, there was a 2 min interval, in which the olfactory epithelium was
flushed with charcoal filtered water. After measurement of three potential antagonists, the EOG
value of the blank control (charcoal filtered water), 10-6 M 3kPZS, and 10-5 M L-arginine were
recorded as well.

L-arginine was used as control to test whether the reduction of EOG response was caused by
blocking of 3kPZS detection or by a general suppression of olfactory detection. L-arginine is a
strong stimulus to sea lamprey,26 and is known to interact with the olfactory epithelium through
other mechanisms rather than by competing with 3kPZS.27 Pre-exposure to one of the two
stimuli will not influence the EOG response of the other.

	  

139	  

4.3 Results and discussion

4.3.1 Binding mode of 3kPZS in SLOR1 structural model

Based on structural modeling of the SLOR1 receptor for 3kPZS done by Prof. Kuhn, the
critical interactions between 3kPZS and residues at the binding sites are expected to include salt
bridges, H-bond and hydrophobic interactions as shown in Figure 4.3. Salt bridges are formed
between the sulfate tail of 3kPZS and protonated nitrogen atoms on the side chain of His110 in
SLOR1. Tyr203 also forms a hydrogen bond with the sulfate tail, and there is an additional
hydrogen bond between the Cys194 main chain and the 12-hydroxyl group in 3kPZS. The
hydrophobic steroid ring system of 3kPZS forms favourable hydrophobic interactions with the
hydrocarbon side chain groups from Phe87, Met106, Leu109, His110, Asp196, Pro277, Tyr280,
and Thr284 in the binding site. The docked binding mode is consistent with the cholate binding
mode

predicted

for

this

site

in

SLOR1

http://cholmine.bmb.msu.edu, see Chapter 2).28

	  

140	  

by

using

CholMine

(Figure

4.4,

Figure 4.3 Interactions between 3kPZS and SLOR1 predicted by homology modeling and
SLIDE docking performed by Dr. Leslie Kuhn and Qinghui Yuan. SLOR1 side-chain atoms and
the binding site surface are colored green for carbon atoms, blue for nitrogen, red for oxygen,
and yellow for sulfur. Carbon atoms of 3kPZS are shown in white tubes (center), with hydrogen
bonds and salt bridges to the receptor shown as yellow dashed lines. The sulfate ester moiety is
predicted to bind deep in the SLOR1 cleft (left), forming salt bridges with His110. The
methyl-group face of the steroid ring (bottom-center) interacts with an entirely hydrophobic face
of the cleft in SLOR1.

	  

141	  

Figure 4.4 Bile acid binding motif in SLOR1 identified based on conserved features of cholate
binding in a set of unrelated proteins (yellow), relative to the SLIDE docking orientation of
3kPZS (blue). The predicted binding orientation for cholate (horizontal molecule at center, with
carbon atoms in yellow tubes) substantially overlays with the docked 3kPZS molecule (blue
horizontal molecule), despite the bent (5-beta) cholate steroid ring in place of the relatively
planar (5-alpha) 3kPZS steroid. Their negatively charged sulfate tail groups are predicted in
highly similar positions (center-right). Side chains making key interactions with cholate in
cytochrome C oxidase (PDB entry: 2DYR) are shown below in yellow (Tyr, Phe, Trp, and His),
and SLOR1 side chains interacting with 3kPZS (Tyr, Leu, His) are shown in blue.

4.3.2 Electro-olfactograms (EOGs) assays identify antagonists for 3kPZS detection based on
candidates from high-throughput computational screening

To provide feedback regarding our hypotheses on features important for small molecules to
block the detection of 3kPZS, a histogram was created of the 143 assayed compounds as a
function of their percentage reduction in sea lampreys’ olfactory detection of 3kPZS (Figure 4.5).
The structures of 8 out of the 11 most effective compounds that inhibit 3kPZS detections by at
least 45% are noted on the histogram. The four most active compounds are sulfonated and have
steroid backbones. In addition, there are two drug-like molecules without steroid ring structures
	  

142	  

and two alkyl tail analogs that apparently mimic the sulfate group in 3kPZS. In addition to the 8
most effective compounds in Figure 4.5, there are additional 3 compounds that inhibit detection
of 3kPZS by at least 45%, including two steroid compounds, that is ZINC70666191 and
52205-73-9(CAS registry number), and one long alkyl chain compound ZINC1532179 with 12
carbon tail. However sulfonamides, such as the two non-steroidal compounds with ~0.5 activity,
are known to be pain-assay interference compounds.29

Similar to the two drug-like compounds on the upper-right corner of Figure 4.5, three more
recently assayed compounds, ZINC03531326 reduced 3kPZS detection by 43%, ZINC13790354
by 42%, and ZINC09227487 by 41%. These types of compounds all have alternative
heterocyclic and hydrophobic rings with different linkers instead of steroid structures.

	  

143	  

Figure 4.5 Histogram of the first 143 compounds according to their percent reduction in 3kPZS
olfaction by sea lampreys. Chemical structures and names are shown for the eight most active
compounds, which exhibit >45% reduction of 3kPZS response.

4.3.3 Structure-activity relationships analysis

4.3.3.1 SAR analysis based on SALI and functional group matchprint

We analyzed the structure-activity relationships for the EOG-assayed 143 compounds using
structure-activity landscape index (SALI), as mentioned in Chapter 3. The higher the SALI score,
the more significant an activity cliff there is. In the above equation, ROCS TanimotoCombo
	  

144	  

score is used to evaluate the similarities between the assayed compound, and to generate a
heatmap of a SALI landscape. The pairs of compounds with SALI scores above 70 were selected
(Figure 4.6), in which the pairs with only one functional group difference were analyzed (Figure
4.7). As shown in Figure 4.7(A), a pair of taurolithocholate analogs only differs in the presence
or absence of the 7-OH group. The compound without 7-OH was twice as active as the
compound with 7-OH. This phenomenon is consistent with 3kPZS docking results, in which
there are no obvious favorable interactions between SLOR1 and the 7-OH of 3kPZS (Figure 4.3,
with -OH group appearing near the Asn90 label). As shown in Figure 4.6 (B), a pair sulfate of
tail analogs with the same carbon chain length but different tail functional groups show that the
sulfate tail compound is 17% more active than the phosphate tail compounds. The more
negatively charged sulfate group may have stronger interactions with HIS 110 in the binding site
than the less negative phosphate group.

	  

145	  

Figure 4.6 (A) ROCS TanimotoCombo scores for the pairwise compounds with significant
activity cliffs with SALI score ≥ 70. (B) SALI scores for the pairwise compounds with
significant activity cliffs with SALI score ≥ 70.

(A)

(B)

Figure 4.7 (A) Compound without 7-OH group (in green) is twice as active (70% reduction in
3kPZS response) as compound with 7-OH (in blue; 35% reduction). Tail structure is same in
both. (B) Butane sulfate is 16% more active than butane phosphate.

In functional group matchprint analysis, we analyzed the pairs with only one functional
group difference at the positions of the sulfate oxygens, sulfate ester oxygen, 3-keto, 3-OH,
	  

146	  

7-OH, 12-OH, 18-methyl and 19-methyl groups in 3kPZS. A series of tail analogs with carbon
chains of various lengths (Figure 4.8) that differ by one functional group relative to another
compound were assayed by EOG and their structure-activity relationships are analyzed. The tail
analog with 4 carbons in the chain has the highest activity and the analogs with 8 and 12 carbons
have similar activities to the analogs with 4 carbons. The analogs with 5 and 6 carbons have low
activities, in which the one with branches has the worst performance. It is possible that aliphatic
chain can be used to substitute the steroid backbone. The angatonist’ activities fluctuate
according to the length of the carbon tail.

Figure 4.8 Assayed compounds with aliphatic tails. Shown in purple is the 3 carbon compound
(ZINC01587861) with 38% inhibition of EOG response of 3kPZS; Shown in red is the 4 carbon
compound (ZINC01845398) with 50% inhibition of EOG response; Shown in gray is the 5
carbon compound (ZINC01587862) with 32% inhibition of EOG response; Shown in cyan is the
6 carbon compound (ZINC01841381) with 31% inhibition of EOG response; Shown in orange is
6 carbon compound (ZINC01680379) with ethyl group, which inhibits EOG response by 18%;
Shown in green in the 8 carbon compound (ZINC14591952) with 48% inhibition of EOG
response. 0.52; Shown in yellow is the 12 carbon compound (ZINC01532179) with 46%
inhibition of EOG response.
	  

147	  

4.3.3.2 Other structure-relationship analysis

Six of the 11 most active compounds, which reduced the response to 3kPZS by 45-100%,
had steroidal substructures. Both 3kPZS and the antagonist candidates were tested at 10-6 M
concentration in the initial EOG assays. Surprisingly, several of these antagonists had none of the
canonical hydroxyl groups on the steroid ring system. However, the three most active
compounds, including PZS (the 3-OH analog of 3kPZS), which nullified the response to 3kPZS
by 92%, all had 3-hydroxyl groups in place of the 3-keto group present in 3kPZS. This was a
valuable discovery, because previous data from the Li lab17 indicated that only 3kPZS could
activate the SLOR1 receptor, not PZS, suggesting that PZS did not bind to SLOR1. Our
hypothesis is that PZS successfully competes with 3kPZS for binding to SLOR1, which leads to
its antagonist activity. We aim to test this by developing a receptor-based ligand-binding assay in
collaboration with Prof. Rick Neubig (Chair, Pharmacology & Toxicology, MSU). Such an assay
will also facilitate structure-activity relationship analysis (how antagonist side groups influence
SLOR1 activation or inhibition) and structure-based antagonist optimization for this pheromone
receptor.

Six of the 11 most active compounds mimic 3kPZS by matching the C and D steroid rings
and C24-sulfate group. This result shows that the 3-keto oxygen and steroid ring system in
3kPZS can be removed or substituted by other functional groups. Because sulfate tails exist in
both steroidal and non-steroidal compounds, it is considered as an indispensable functional group
	  

148	  

in effective antagonists. As shown in Figure 4.8, one of the simplest compounds with sulfate tail,
1-butane sulfonate can reduces EOG response of 3kPZS by 51%.

	  

149	  

4.4 Conclusion

Antagonists that inhibit 3kPZS detection can potentially hinder the mating process of sea
lamprey by blocking the ability of female sea lamprey to detect this pheromone, and aid in
controlling lamprey population. Based on this rationale, we developed an effective and efficient
antagonist discovery pipeline based on the hypothesis of overall volumetric and electrostatic
mimicry of 3kPZS and its important functional groups for binding to SLOR1. Through this
pipeline, ~300 potential antagonists were prioritized from a screening library of compounds, of
which 143 compounds were tested in EOG assays. Of the 143 compounds, 11 compounds that
inhibit 3kPZS at least 45% were identified. Three compounds, including PZS, taurolithocholic
acid (TLC) and tetrasulfonated-PZ (tetra-PZS), were shown to be behavioral antagonists in the
two-choice maze and PZS was shown to be behavioral antagonist in both maze and stream tests.
It is most interesting that PZS, whose structure differs from 3kPZS only at the 3-position in the
steroid backbone, acts as an effective antagonist to neutralize or repel the attraction of 3kPZS to
female sea lamprey in the behavioral tests. The other two steroid compounds including TLC and
tetra-PZS, are shown to repel female sea lamprey significantly at low concentration as well. The
three repellents or neutralizers are being considered in combination with other strategies for
effective sea lamprey control.

The importance of the sulfate tail, 3-keto group and 7-OH group were revealed through
structure-activity analysis. The compounds with more negatively charged tail groups have higher
	  

150	  

activities than less negatively charged compounds. Substitution of the 3-keto group with a
3-hydroxyl group switches the activities of compounds from agonist to antagonist. The 7-OH
group attenuates the inhibition activities of compounds, which is consistent with the fact that
7-OH is predicted to have no direct favorable interactions with the binding site of SLOR1 shown
in the docking results. In the 11 most active compounds, almost half of the compounds are
non-steroidal hydrophobic structures with sulfate group, which block detection of 3kPZS by at
least 45%. The presence of a terminal sulfate group in all the active compounds suggests that it
is an important determinant for activity. The non-steroidal hydrophobic backbones can replace
the steroid ring while still keeping the antagonist activities of these compounds; however,
optimization of these compounds to attain activity similar to steroids would be needed.

In addition to 3kPZS, there are additional two male sea lamprey mating pheromones
discovered, including DKPES and PAMS-24. DKPES has a similar behavioral effect as 3kPZS,
which can guide females at close range to the nesting area. PAMS-24 serves as a male territorial
pheromone, which repels mature males from nest boundaries. In the future, we will apply the
antagonist discovery-screening pipeline to identify potential antagonists that mimic DKPES and
PAMS-24. The identified compounds can be combined with 3kPZS antagonist to reach the
highest effect of repelling or causing sea lamprey to not locate spawning grounds.

	  

151	  

REFERENCES

	  

152	  

REFERENCES

(1) Kitchen, D. B.; Decornez, H.; Furr, J. R.; Bajorath, J. Docking and Scoring in Virtual
Screening for Drug Discovery: Methods and Applications. Nat Rev Drug Discov. 2004, 3,
935-949.
(2) Doman, T. N.; McGovern SL; Witherbee, B. J.; Kasten, T. P.; Kurumbail, R.; Stallings, W.
C.; Connolly, D.T.; Shoichet, B. K. Molecular docking and high-throughput screening for novel
inhibitors of protein tyrosine phosphatase-1B. J. Med. Chem. 2002, 45, 2213–2221.
(3) Zarzycka, B.; Seijkens, T.; Nabuurs, S. B.; Ritschel, T.; Grommes, J.; Soehnlein, O.;
Schrijver, R.; van Tiel, C. M.; Hackeng, T. M.; Weber, C.; Giehler, F.; Kieser, A.; Lutgens, E.;
Vriend, G.; Nicolaes, G. A. F. Discovery of Small Molecule CD40−TRAF6 Inhibitors. J. Chem.
Inf. Model., 2015, 55, 294–307.
(4) Kitchell, J. F. The Scope for Mortality Caused by Sea Lamprey. Transactions of the
American Fisheries Society 1990, 119, 642-648.
(5) Bergstedt, R. A.; Schneider, C.P. Assessment of Sea Lamprey (Petromyzon-Marinus)
Predation by Recovery of Dead Lake Trout (Salvelinus-Namaycush) from Lake-Ontario,
1982-85. Canadian Journal of Fisheries and Aquatic Sciences 1988, 45, 1406-1410.
(6) Buchinger, T.J.; Wang, H.; Li, W.; Johnson, N.S. Evidence for a receiver bias underlying
female preference for a male mating pheromone in sea lamprey. Proceedings of the Royal
Society B 2013, 280, 1771.
(7) Boogaard, M. A.; Bills, T. D.; Johnson, D. A. Acute toxicity of TFM and a
TFM/niclosamide mixture to selected species of fish, including lake sturgeon (Acipenser
fulvescens) and mudpuppies (Necturus maculosus), in laboratory and field exposures. J. Great
Lakes Research 2003, 29, 529-541.
(8) Li, W.; Scott, A.P.; Siefkes, M.J.; Yan, H.G.; Liu, Q.; Yun, S.S.; Gage, D.A. Bile acid
secreted by male sea lamprey that acts as a sex pheromone. Science 2002, 296, 138-141.
(9) Kain, P.; Boyle, S.M.; Tharadra, S.K.; Guda, T.; Pham, C.; Dahanukar, A.; Ray, A. Odour
receptors and neurons for DEET and new insect repellents. Nature 2013, 502, 507–512.
(10) Lundstrom, K. An overview on GPCRs and drug discovery: structure-based drug design
	  

153	  

and structural biology on GPCRs. Methods Mol Biol. 2009, 552, 51-66.
(11) Lin, S. H.; Civelli, O. Orphan G protein-coupled receptors: Targets for new therapeutic
interventions. Ann. Med. 2004, 36, 204-214.
(12) Katritch, V.; Cherezov, V.; Stevens, R. C. Diversity and modularity of G protein-coupled
receptor structures. Trends Pharmaco. Sci. 2012, 33, 17-27.
(13) Eswar, N., John, B.; Mirkovic, N.; Fiser, A.; Ilyin, V. A.; Pieper, U.; Stuart, A. C.;
Marti-Renom, M. A.; Madhusudhan, M. S.; Yerkovich, B.; Sali, A. Tools for Comparative
Protein Structure Modeling and Analysis, Nucleic Acids Research, 2003, 31, 3375–3380.
(14) Warne, T.; Serrano-Vega, M. J.; Baker, J. G.; Moukhametzianov, R.; Edwards, P. C.;
Henderson, R.; Leslie, A. G.; Tate, C. G.; Schertler, G. F. Structure of the Beta1-Adrenergic G
Protein-Coupled Receptor, Nature 2008, 454, 486-491.
(15) Sander, C.; Schneider, R. Database of homology-derived protein structures and the
structural meaning of sequence alignment, Proteins 1991, 9, 56-68.
(16) Zavodszky, M. I.; Sanschagrin, P. C.; Korde, R. S.; Kuhn, L. A. Distilling the essential
features of a protein surface for improving protein-ligand docking, scoring, and virtual screening.
J. Comput. Aided Mol. Des. 2002, 16, 883-902.
(17) Lischka, F.; Kuhn, L. A.; Libants, S.; Wu, H.; Yuan, Q.; Teeter, J.; Li, W. Deorphanization
of Olfactory and Vomeronasal Receptors that Respond Potently to a Vertebrate Pheromone,
2014, submitted.
(18) Irwin, J. J.; Shoichet, B.K. ZINC - A free database of commercially available compounds
for virtual screening, J. Chem. Inf. Model. 2005, 45, 177-182.
(19) Lipinski, C.A. Drug-like properties and the causes of poor solubility and poor permeability.
J Pharmacol Toxicol Methods, 2000, 44, 235-249.
(20) Gatica, E.A.; Cavasotto, C.N. Ligand and Decoy Sets for Docking to G Protein-Coupled
Receptors, J. Chem Inf. Model. 2012, 52, 1-6.
(21) Hawkins, P. C.; Skillman, A. G.; Warren, G. L.; Ellingson, B. A.; Stahl, M. T. Conformer
Generation with OMEGA: Algorithm and Validation Using High Quality Structures from the
Protein Databank and Cambridge Structural Database. J. Chem. Inf. Model. 2010, 50, 572-584.
(22) Hawkins, P. C.; Skillman, A. G.; Nicholls, A. Comparison of Shape-Matching and Docking
	  

154	  

as Virtual Screening Tools. J. Med. Chem. 2007, 50, 74-82.
(23) Chen, G.; Heim, A.; Riether, D.; Yee, D.; Milgrom, Y.; Gawinowicz, M.A.; Sames, D.
Reactivity of Functional Groups on the Protein Surface: Development of Epoxide Probes for
Protein Labeling, J. Am. Chem. Soc. 2003, 125, 8130-8133.
(24) Guha, R.; Van Drie, J. H. Structure−Activity Landscape Index:   Identifying and
Quantifying Activity Cliffs. J. Chem. Inf. Model. 2008, 48, 646-658.
(25) Siefkes, M. J.; Scott, A. P.; Zielinski, B.; Yun, S. S.; Li, W. M., Male sea lampreys,
Petromyzon marinus L., excrete a sex pheromone from gill epithelia. Biology of Reproduction,
2003, 69, 125-132.
(26) Li, W.; Sorensen, P. W.; Gallaher, D. D. The Olfactory System of Migratory Adult Sea
Lamprey (Petromyzon-Marinus) Is Specifically and Acutely Sensitive to Unique Bile-Acids
Released by conspecific larvae. J Gen Physiol. 1995, 105, 569-587.
(27) Li, W.; Sorensen, P. W. Highly independent olfactory receptor sites for naturally occurring
bile acids in the sea lamprey, Petromyzon marinus. Journal of Comparative Physiology
a-Sensory Neural and Behavioral Physiology. 1997, 180, 429-438.
(28) Liu, N; Van Voorst, J; Johnston, J. B.; Kuhn, L. A. CholMine: Determinants and Prediction
of Cholesterol and Cholate Binding Across Nonhomologous Protein Structures. J. Chem. Inf.
Model. 2015, 55, 747–759.
(29) Baell, J.B.; Holloway, G.A. New substructure filters for removal of pan assay interference
compounds (PAINs) from screening libraries and for their exclusion in bioassays, J. Med. Chem.
2010, 53, 2719–2740.

	  

155	  

Chapter 5 Conclusions and future directions

	  

156	  

In this thesis, three aspects to predict ligand binding were presented, including from aspects
of protein similarity, ligand similarity and protein-ligand interaction energy.

Given only protein information, three-dimensional ligand binding motifs, particularly for
cholesterol and cholate, were extracted from non-homologous proteins and CholMine, an online
server was built for public usage purpose. Three-dimensional motifs generalize the characteristic
of specific ligand binding across diverse protein families and show stronger prediction ability
than sequence motifs. This method deciphers the determinants of specific ligand binding only
from protein information across different protein families, which has advantages over the other
3-dimensional ligand binding prediction methods which need to incorporate ligand information.
This method can be used to find off-target proteins that are likely to bind to cholesterol/cholate
and provide guidance on the design of compounds that mimic the biological activities of
cholesterol/cholate. Since this method has shown good performance on the prediction of sites for
hydrophobic ligand such as cholesterol and cholate, in the future we can apply this method to
prediction of binding sites of hydrophilic compounds such as compounds containing adenine and
pteridine and show its generality. Preliminary results already suggest that this method can
automatically detect interaction submotifs for a ligand with distinct binding motifs to different
proteins. From the submotifs detected, the evolutionary relationship of the proteins binding to the
same ligand could be analyzed.

Given the protein structure and a series of substituted compounds, the differences in
	  

157	  

biological activities of the compounds with same backbone but different substituents can be
explained partially through protein-ligand interaction energy analysis, as we have shown in the
analysis of PaPAM interactions with α-arylalanines. The residues at the binding site that
contribute to the differences in biological activities were identified, and mutations at these sites
were suggested to improve catalytic efficiency of the enzyme.

Finally, given only protein sequence and a native ligand structure, compounds that mimic
native ligands for inhibition of a target protein can be identified using a hypothesis-driven
inhibitor discovery screening pipeline. The hypothesis of the pipeline is that compounds that
mimic the volumetric and electrostatic properties of the native ligand and match the functional
group side chains of the native ligands that are important for the specific binding can compete
with the native ligand for binding. In the future, this pipeline can be used for inhibitor discovery
based on additional known pheromone ligand structures such as DKPES and PAMS-24 and
facilitate the inhibitor discovery process.

	  

158