'- LIBRARY
Michigan State
L University

This is to certify that the
thesis entitled

TWILIGHT ZONE.
PROTEIN SEQUENCE SEARCH AND CLASSIFICATION

presented by
Jiarong Guo

has been accepted towards fulﬁllment
of the requirements for the

MS. degree in Fisheries and Wildlife

 

 

\\

ﬂ.“ \

., \ < "I,” - b

’ ‘\ VMajor Professor’s Signaturé’J>

O %c\éx& SD) TRQUQ

Date

 

MSU is an Afﬁn'native Action/Equal Opportunity Employer

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5108 K'IProilecaPres/ClRC/DateDue.indd

 

TWILIGHT ZONE:
PROTEIN SEQUENCE SEARCH AND CLASSIFICATION

By

Jiarong Guo

A THESIS

Submitted to
Michigan State University
in partial fulfillment of requirements
for degree of

MASTER OF SCIENCE

Fisheries and Wildlife

2010

ABSTRACT

TWILIGHT ZONE: PROTEIN SEQUENCE SEARCH AND CLASSIFICATION
By
Jiarong Guo

Homology search is important for gene annotation and classiﬁcation. The emergence of
Next Generation sequencing techniques and metagenomics give homology search new
challenges. In this study, I evaluate the commonly used homology search tools: BLAST
and HMMER to ﬁnd new genes (nirK) in our metagenomic data and classify closely
related genes (amoA and pmoA in UniProtKB). BLAST false positive problem is found
when comparing BLAST and HMMER searching results in metagenomic data.
Furthermore, I also describe methods which use phylogenetic trees to reﬁne the results of
common homology search methods. We evaluated these methods with the above genes,
and ﬁnd narrow phylogenetic sampling of genes (nirK) will limit sequence annotation

and some genes (amoA and pmoA) are probably misclassiﬁed.

Keywords: homology search; BLAST false positive; phylogenetic tree; sequence

misclassiﬁcation

ACKNOWLEDGEMENT

Through the past two years, I have got support from many people. Without their help,

none of this work would have happened.

First, I want to thank my advisor C. Titus Brown. As a young professor, he is
exceptionally good at advising students. His intellectual insights and scientiﬁc training

helped me greatly throughout my research.

Second, I want to thank Adina Howe. She helped me a lot revising my thesis and also

gave me great support as a ﬁ'iend.

My committee members, Joan B. Rose and Weiming Li, gave me many advices in the

committee meeting. I am very grateful to them.

Everyone in C. Titus Brown’s Lab helped me revising my thesis or presentation. Thank

you all.

I also want to thank Jiaguo Qi and R. Jan Stevenson, two professors in ZHEJIANG

UNIVERSITY and MSU program. They helped me a lot during my ﬁrst year in MSU.

At last but by no means least, I want to thank my parents. They are the two always

supporting me.

iii

TABLE OF CONTENT

LIST OF TABLES ............................................................................................................. vi
LIST OF FIGURES .......................................................................................................... Vii
Chapter 1 ............................................................................................................................. 1
Background ......................................................................................................................... 1
1.2 Pairwise comparison: BLAST .............................................................................................. 2
1.3 Proﬁle method: PSI-BLAST, HMMER, and Pfam database ................................................ S
1.4 Objective of research ............................................................................................................ 8
Chapter 2 ............................................................................................................................. 9
Comparing results of BLAST and HMMER nitrite reductase genes (nirK) search ........... 9
2.1 Introduction: ......................................................................................................................... 9
2.2 Methods: ............................................................................................................................. 10
2.2.1 Samples/Metagenomic Datasets: ................................................................................. 10
2.2.2 Data Analysis: BLAST and HMMER comparison ..................................................... 11
2.3 Results and conclusion: ...................................................................................................... 14
Chapter 3 ........................................................................................................................... 20
Using phylogenetic structure for reﬁning genes identiﬁcations ....................................... 20
3.1 Introduction: ....................................................................................................................... 20
3.2 Methods: ............................................................................................................................. 26
Using the tree squeezing method to ﬁnd more m’rK genes: ...................................................... 26
3.3 Results and Conclusion: ...................................................................................................... 28
Chapter 4 ........................................................................................................................... 31
Classiﬁcation of amoA and pmoA ..................................................................................... 31
4.1 Introduction: ....................................................................................................................... 31
4.2 Data: ................................................................................................................................... 32
4.3 Methods: ............................................................................................................................. 33
4.3.1 BLAST and HMMER comparison (Figure 19) ........................................................... 33
4.3.2 Building the tree .......................................................................................................... 34
4.4 Results: ............................................................................................................................... 35
4.4.1 The use of BLAST and HMMER to differentiate between amoAs and pmoAs in
UniprotKB ............................................................................................................................ 35
4.4.2 Using the tree clustering method to differentiate amoAs and pmoAs ........................... 39
4.4.2 Using low nodes and high nodes for tree clustering .................................................... 42
4.4.3 Investigating sequences identiﬁed by tree clustering that disagreed with UniProt
Annotations ........................................................................................................................... 46
4.5 Conclusion: ......................................................................................................................... 50
Chapter 5 ........................................................................................................................... 52
Conclusions and future work ............................................................................................ 52
5.1 Conclusions: ....................................................................................................................... 52
5.1.1 Biology is increasingly reliant on sequence similarity with the use of new sequencing
technology, and sequence similarity may not give accurate gene annotations. .................... 52

iv

5.1.2 Annotations are sometimes incorrect. .......................................................................... 52

5.1.3 Narrow phylogenetic sampling is a problem. .............................................................. 53

5.2 Novel approaches: ........ , ..................................................................................................... 53
5.3 Future directions: ................................................................................................................ 53
5.4 Parting Thoughts: ................................................................................................................ 54
Bibliography ..................................................................................................................... 55

LIST OF TABLES

Table l. BLAST F P (False Positive Rate) at different E-value cutoffs. ........................... 17

Table 2. Numbers of sequences classiﬁed by various methods. ....................................... 45

vi

LIST OF FIGURES

Figure 1. Simple examples of global and local alignment. ................................................. 3
Figure 2. The BLAST algorithm ......................................................................................... 4
Figure 3. Simple illustration of proﬁle search. ................................................................... 6
Figure 4. Nitrogen cycling ................................................................................................ 10
Figure 5. Eight datasets sequenced by roche 454 pyrosequencing. .................................. 11
Figure 6. BLAST and HMMER comparison processes in metagenomic dataset. ............ 12
Figure 7. BLAST hits and domain mapping. .................................................................... 13
Figure 8. BLAST and HMMER search comparison ......................................................... 14
Figure 9. Distribution of identities in BLAST search result of six well known nirKs
against datasetl and dataset2. ........................................................................................... 15
Figure 10. All BLAST hits matched back to a query nirK. .............................................. 16
Figure 11. E-value distribution of BLAST hits matched to domain region (blue) and a
subset that were also found by HMMER domain search (green). .................................... 18
Figure 12. Simple illustration of neighbor-joining process. ............................................. 22

Figure 13. A hypothetical sub-tree of tree of whole protein space with cluster F1, A and
F2 in a speciﬁc arrangement. ............................................................................................ 24

Figure 14. A hypothetical sub-tree of tree of whole protein space with cluster F l , F2 and
A in a speciﬁc arrangement. ............................................................................................. 25

Figure 15. E-value distribution of 3055 nirKs from F unGene (blue) and 32 sequences
from tree squeezing method (green) ................................................................................. 27

Figure 16. Deﬁnition of tree squeezing method. .............................................................. 28

Figure 17. A hypothesis explaining why only small number of sequences is squeezed. .29

Figure 18. Phylogenetic tree of all 3055 F unGene nirK sequences .................................. 30
Figure 19. BLAST and HMMER comparison flow chart ................................................. 34
Figure 20. Flow chart of constructing phylogenetic tree of amoAs and pmoAs. .............. 35
Figure 21. E-value distribution of amoA BLAST hits. ..................................................... 36
Figure 22. E-value distribution of amoA HMMER hits .................................................... 37
Figure 23. E-value distribution of pmoA BLAST hits. ..................................................... 38

vii

Figure 24. E-value distribution of pmoA HMMER hits .................................................... 39

Figure 25. Tree of known amoAs and pmoAs. .................................................................. 41
Figure 26. Trees of all amoAs and pmoAs. ....................................................................... 42
Figure 27Simple example trees for super node analysis ................................................... 44

Figure 28. Alignments of two correctly classiﬁed amoA 5, two wrongly classiﬁed pmoAs
found by low node method, two well known amoAs and two well known pmoAs .......... 47

Figure 29. Closeups of two possibly misclassiﬁed pmoAs in tree of all amoA and pmoA
(Al and B1) and two known amoAs (arrow) that help classify the two possibly
misclassiﬁed pmoAs in tree of all known amoAs and pmoAs. .......................................... 49

NOTE: Images in the thesis are presented in color.

viii

Chapter 1

Background

1.1 Importance of homology search

Homologous proteins are proteins that are derived from a common ancestor. They often
have similar sequences, structures, and functions. Because homology is often inferred
from sequence similarity, accurate sequence annotations based on homology search is an
important area in biology. Most protein sequences are annotated by sequence similarity,
and only a small number of sequences are biologically veriﬁed by molecular experiments.
UniProtKB is a large database of protein sequences. Within the UniProtKB database, the
biologically veriﬁed sequences and those sequences that can be annotated signiﬁcantly
by homology searches are stored in the SwissProt database, which is maintained
manually. The other sequences, which are annotated by automatic sequence homology
searches, are stored in the TrEMBL database (Wu, preiler et al. 2006). The size of the
SwissProt database (518,415 protein sequences) is much smaller than the TrEMBL
database (11,397,958 protein sequences), indicating that the majority of current proteins
in the UniProtKB database are annotated by automatic homology searches. Problems
with homology searching algorithms may result in incorrect protein annotations

throughout the database.

The emergence of metagenomics presents a new challenge to using homology searches to
annotate and/or classify sequences. Metagenomics is the study of DNA recovered directly
from environmental samples, and it investigates the diversity and functional pathways

represented by microbial communities (Riesenfeld, Schloss et a1. 2004). Given the fact

that most microbes cannot be cultured in standard lab conditions, metagenomics is an
alternative approach to studying the microbial world. With the introduction of next
generation sequencing (N GS) techniques, metagenomic data have rapidly accumulated in
the last ﬁve years. The resulting large number of sequences and their characteristic short

lengths present challenges to sequence annotation.

The most popular homology searching methods use pairwise comparisons or proﬁle

models (also known as position speciﬁc scoring matrices).

1.2 Pairwise comparison: BLAST

Pairwise comparison is used to ﬁnd the best matching local or global alignment of two
sequences. Global alignment tries to align every position in both sequences and is more
effective for similar sequences of roughly equal size. It also has the advantage of being
able to align multiple sequences. Local alignment, on the other hand, compares all
segments of all possible lengths and attempts to align the similar sequence regions
together (Figure 1). These sequence regions could be sequence motifs, which are deﬁned
as amino-acid or nucleotide patterns that might have a biological signiﬁcance. It is more
useful to detect these similar sequence motifs within their larger dissimilar sequence
context. The Needleman-Wunsch algorithm is a well-known method for global sequence
alignment, while the Smith-Waterman algorithm is famous for performing local
alignment. The Smith-Waterman algorithm is more sensitive at ﬁnding remote
homologous sequences, but is not fast enough for large database homology search

(Needleman and Wunsch 1970; Smith and Waterman 1981; Altschul, Gish et al. 1990).

Global i r 1 ' . . .
Seq2

 

Seql
Local

 

Seq2

Figure 1. Simple examples of global and local alignment. Global alignment aligns
sequences that have similar size and are generally similar over entire length. Local
alignment is good at ﬁnding small similar regions (including inversions) within whole

sequences that global alignment is not able to detect.

BLAST (Altschul, Madden et al. 1997) is the most popular bioinformatics tool for
searching homology of DNA or protein sequence in big databases. It performs a local
alignment of a query sequence against a sequence in database. BLAST, by using a
heuristic method called the word method (Figure 2), is able to perform pairwise
alignments with signiﬁcantly improved speed and efﬁciency without losing much
accuracy (Altschul, Gish et al. 1990). BLAST efﬁciently searches for matches between a
query and a reference by ﬁnding the hot spots of high scoring words in database
sequences. It avoids aligning the query sequence with a large portion of database
sequences that have no signiﬁcant matches. In the word method, non-overlapping k-letter
subsequences (words) in the query sequences are listed. By default k, k is 3 for protein
and 11 for DNA. All k-letter words (20k for protein and 4k for DNA in total) are aligned

with words in the list and scored with a substitution matrix. The high scoring words

(neighborhood words) that score above a threshold T are collected. These neighborhood
words are the ‘queries’ actually searched in the database. When an exact match of one
word is found, the alignment is extended in both directions to ﬁnd the ungapped
alignment with the highest score, higher than the bit score threshold. These ungapped

alignments, called HSP (high scoring pair), are reported by BLAST.

The BLAST Search Algorithm

query word (W = 3)
Step1 Query: TGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFV

P00 18
PEG 15
Step2 noig'iborhood :28 '1:
Egg :3 neighborhood
PMG 13 scorottroshoid
P80 13 (7" 13)

 

 

 

 

PQA l 2
PQN 12
etc...
Step3 e - :

Query: BZSSLAALLNKCKTI’QGQRLVNQWIKQPLMDKNRIEERLNLVEABGS
+L.-\++L~v- TP ('r R-+ +\\'+ P+ I) + ER + A
Subject: 29OTL.»\SVLDCTVTI’MUSRMLKRWLIIMPVRDTRVLLERQQTIGA330

High-scoring Segment Pair (HSP)
Figure 2. The BLAST algorithm. BLAST applies a heuristic search method which ﬁnds
k-letter words (default = 3 in blastp) scoring at T (neighborhood score threshold) when
aligned to a speciﬁc k-letter word in the query and scored with a substitution matrix.
Words scoring above T (neighborhood words) are searched in database and then extended

in both directions until the score starts to decrease. The resulting locally optimal scoring

alignment is called HSP (high scoring pair) and later reported by BLAST if it has a score

higher than S (BLAST score cutoﬂ) or E-value lower than a speciﬁed cutoff.

When pairwise sequence identity is high, putative homology and non-homology can be
distinguished by pairwise comparison methods including BLAST. However, pairwise
comparisons perform poorly on sequence alignments when the pairwise sequence identity
goes down to 20-3 5% (for protein sequence), the twilight zone of protein sequence
alignment (Rost 1999). The best scoring pairwise alignment becomes uncertain when
aligning two remote sequences with low identity. Thus, pairwise comparison performs
poorly at ﬁnding remote homologies in the twilight zone. It is not clear whether the
difficulty of using alignment methods in this zone is caused by a technical problem (i.e.,

detection of statistical signiﬁcance) or is a special feature of evolution (Rost 1999).

1.3 Profile method: PSI-BLAST, HMMER, and Pfam database

The proﬁle method, also called position speciﬁc scoring method, is another method
popularly used for homology searches in sequence databases. A proﬁle is a probability
model, made from a multiple sequence alignment of several homologous sequences
(usually from the same protein family). In the proﬁle, a position-speciﬁc scoring system
for insertions, deletions and substitutions is assigned, which shows that some columns of
the multiple sequence alignment are more conserved for several amino acids (or
nucleotides) while some are more open to gaps or indels (Figure 3). The proﬁle
represents conserved or non-conserved sequence information of the multiple sequence
alignment. Compared to the Smith-Waterman algorithm, BLAST, or other traditional

pairwise alignment methods that use position-independent scoring parameters (like

BLOSUM62), proﬁle methods are more precise and more sensitive to detect remote
homology (distantly related sequences). The use of proﬁles is more effective at capturing
important domains or motifs that are conserved in all query sequences (Gribskov,

McLachlan et al. 1987).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSA Profile _ P . .

ﬂ ,__........_..__._.__._. ____ A: 081110118
Sl:AGCCTAT L 1234507 Profilewith
SZ=CAC CAGT §A 2 2 2 0 2 1 1 transform relative frequency
S3:TCACTGT—_'§T 110 0 3 0 4 _—’ matrlxorlog
S-izATACTGA 1c 1135000 Iikelihoodmatrix
SS:GACCAGT 301100040 /

Query
Profile

Figure 3. Simple illustration of proﬁle search. Proﬁle with absolute frequency at each
position (column) is made from MSA (multiple sequence alignment). The proﬁle is
transformed into a new one with relative frequency matrix or log likelihood matrix. The
new proﬁle is aligned with every 7 base pair region of query sequence. The regions with

scores above cutoff are reported.

Proﬁle methods were introduced in the late 19805. Similar methods like “ﬂexible patterns”
(Barton 1990), “templates” (Bashford, Chothia et al. 1987) were introduced at the same
time. One such example is PSI-BLAST which implements the proﬁle method in the
BLAST2 package. This program combines pairwise local alignment and the proﬁle

method. The ﬁrst round BLAST results are integrated into a proﬁle sequence, then the

proﬁle sequence is used as query in a second round search, and new hits are added to the
proﬁle for the next round search. PSI-BLAST is much more sensitive than standard
protein-protein BLAST in picking up evolutionarily distant homology (Altschul, Madden
et al. 1997). Recently, PSI-BLAST has also been reported to have a homologous over-
extension problem (where it picks up many non-homologies) because the noise
introduced by one non-homology can be ampliﬁed through iteration (Gonzalez and

Pearson 2010).

In this study, I choose HMMER (Eddy 1998) as a representative tool implementing the
proﬁle method. The basic advantage of HMMER is that it uses a formal probabilistic
basis called the “hidden Markov model” to guide how all the parameters in a position-
speciﬁc scoring system should be set. The HMMER proﬁles constructed are called
“Proﬁle Hidden Markov Models” (proﬁle HMMs), which are statistical models of

multiple sequence alignment or even single sequences.

HMMER has a consistent theory for setting parameters in proﬁle HMMS so it is
applicable to a large database of proﬁle HMMs and large scale sequence analysis. The

' Pfam database is a large collection of protein domains represented by proﬁle HMMs. It is
an important part of the Interpro annotation system and is one of the most popular

databases for protein sequence annotation and analysis.

In the past, proﬁle HMM methods were computationally expensive. HMMER2 is about
100x slower than comparable BLAST searches. A new version of HMMER, HMMERB,
combines the power of using a probabilistic model with high computational speed, is now

essentially the same speed as BLAST (Eddy 2009).

HMMER also has the function (hmmalign) to align multiple sequences to a proﬁle HMM.
The hmmalign gives highly reproducible and high quality alignments when proﬁle seeds
are well selected because all the sequences are aligned to the proﬁle HMM. It is also
faster than pair-wise alignment methods like MUSCLE and CLUSTALW because
sequences are only aligned once each to the proﬁle HMM (Wu and Eisen 2008). Thus,
this is a good alignment option for phylogenetic analysis of sequences from the same

family.

The shortcoming of the proﬁle method is that the choice of seed sequence for proﬁle is
critical (garbage in garbage out) and the method is affected by the quality of underlying

MSA (Madera and Gough 2002).

1.4 Objective of research

In this study, I evaluate the above described protein annotation methods (pairwise
comparison implemented by BLAST and proﬁle method implemented by HMMER) to
ﬁnd and classify proteins in metagenomic datasets. Our evaluations included: (1)
comparing the usage of BLAST and HMMER to ﬁnd new environmentally-relevant
genes (nirK) in a soil metagenomic data and (2) the ability of BLAST and HMMER
packages to classify closely related genes (pmoA and amoA) in the UniProt protein
database. F urtherrnore, I also describe a method which uses phylogenetic trees to identify
sequence homology. I evaluated this method with the above genes and compared our

results to other protein annotation methods (HMMER).

Chapter 2

Comparing results of BLAST and HMMER nitrite reductase

genes (nirK) search

2.1 Introduction:

Nitrite reductase genes are a crucial part of global nitrogen cycle. NirK is such a gene and
is involved in denitriﬁcation. Denitriﬁcation is a microbe-facilitated process that uses
oxidized nitrogen as an alternative electron acceptor to produce energy in environments
where oxygen is limited. The end product of denitriﬁcation is molecular nitrogen (N2),
which may be released back to the atmosphere. Denitriﬁcation is also useful in
wastewater treatment and bioremediation, though it also causes the emission of a
greenhouse gas which may damage the ozone layer and/or cause nutrient loss in
agriculture (Tiedje 1988). Nitrite reductase catalyzes the reduction of nitrite to nitric

oxide (N O). NirK is a copper based metalloprotein.

Nitrification

  
        

 

 

  
 
  

N0,"
N2
tilt-1‘2 groups A
of protein <,\ " ”hi, Nitrogen
00“ "I 1' 1'
"0/3 _ “1““ o no Ion
~:-_3> {5" 0!“
(M TET Anoxk
groups /\\ation:
“02- oTprotein C- 6‘“.
‘mm‘m‘ Nitrogen
. fixation

I‘III'

Donitriticotion

Figure 4. Nitrogen cycling (Madigan, Martinko et al. 2006).

2.2 Methods:

2.2.1 Samples/Metagenomic Datasets:

For our studies, I used 454 (Metzker 2010) sequenced metagenomic data of various soil
samples. A total of four soil samples were studied: two agricultural soil samples
(biological replicates) and two forest soil samples (biological replicates) from Kalamazoo,
MI (KBS LTER — Kellog Biological Station Long Term Ecological Research). For each
sample, DNA was extracted directly from the soil sample and sequenced with two 454
pyrosequencing runs (Figure 5). The average resulting sequence read length was about
250 base pairs, much shorter than read lengths generated by traditional Sanger

sequencing. I evaluated the ability to identify nirK genes using BLAST and HMMER.

10

, .7 Run1(datasetl)

 

7 Sample] T ‘ ‘ > Run2 (dataset2)

l: ] 7 Run1(dataset3)

 

3‘ Samplez g
> Run2 (dataset4)
_ 7 Run] (datasetS)
.7 Samplel _
‘ ‘ " > Run2 (dataset6)
. > . 7 Run] (dataset7)
Sample2 '_

> Run2 (dataset8)
Figure 5. Eight datasets sequenced by roche 454 pyrosequencing.

2.2.2 Data Analysis: BLAST and HMMER comparison

The DNA reads in the metagenomic data were translated into amino acid reads. Six well
known nirK sequences from the F unGene (functional gene pipeline & repository) website
were used to blast against the various soil metagenomic datasets using default parameters
(E-value cutoff was 10). Additionally, the six nirKs were also used to generate a proﬁle
HMM using HMMER3 and searched against our metagenomic data using default
parameters (E-value cutoff was 10) (Figure 6). I chose to use BLAST and HMMER
default parameters with a relatively relaxed E-value cutoff to include all possible real
nirK sequences. Subsequent ﬁltering steps could be used to ﬁlter out more false positive

hits (see Chapter3)

11

 

download

 

6 well
known
nirKs

 

    
 
 

clustalw

Multiple
alignment
format

   

hmmbuild
BLAST Against soil data

Profile
“M M

hmmsearch Against soil data

BLAST
Potential compare .
‘ a ntrK

nirKs
result

Figure 6. BLAST and HMMER comparison processes in metagenomic dataset.

To investigate the difference between BLAST and HMMER search results, I evaluated
which part of the query nirK sequence, domain region or non-domain region, blast hits
were matched to. Functional domains of m'rK (cu-oxidase, cu-oxidase_2, and cu-

oxidase_3) are shown on the FunGene website, and the multiple sequence alignment

12

(MSA) of these domains were downloaded from the Pfam database (stockholm format for
use with HMMER) and then built into HMMs of the domains. Domain regions of the
query seed sequence were located by searching the domain HMMs against the query seed
sequences at an E-value cutoff of 0.01. Then, BLAST hits were matched back to the
query seeds based on the start and end position of matches that are shown in the BLAST
output. If a BLAST hit had more than 20 amino acids matched to domain region of query,
I say it is a match to the domain region. If a query was not matched to the domain region,
I removed it from the BLAST and HMMER comparison (Figure 7). HMMER is
supposed to ﬁnd hits matched to the conserved region, and thus, by removing non-
domain BLAST hits, I can make the results from BLAST and HMMER comparable. I
assume that if both BLAST and HMMER identify a sequence as nirK, the annotation is
likely to be correct. If a sequence is matched by BLAST but not by HMMER domain

search, I consider it is a false positive.

Domain
Query ———————[ " “ 1

 

 

 

 

BLAST;
hits

 

 

 

Figure 7. BLAST hits and domain mapping. HMMER is good at ﬁnding conserved
regions (domain), so I are interested in BLAST hits matched to domain region (red lines)

in order to compare results from two tools.

13

2.3 Results and conclusion:

mOnly Only

 

datasetl dataset2

Figure 8. BLAST and HMMER search comparison. “BlastOnly” are the hits obtained
only by BLAST; “hmmOnly” are the hits obtained by HMMER only; “shared” are
overlap of BLAST and HMMER hits. Search results from both datasets show BLAST

gets more hits than HMMER search.

14

 

 

 

 

 

 

 

 

8*- ..................................................... ..; .................. .4

7g .................... 1

2 6 ................................................ ....................
= 3 :
E- : s

(b 5.. ............................................. .................... : ................... n
H ; '
G g .

W 4 .......................................... .1 .................... ; ....................
=3 i i
5' 3 s

3,. ......................................... 1 ........ . .......... - ................... 4

2 .. .. .................... g .................... 1

1 ........................... . . .lw. ................ ................... #
ll 7 L

0.6 0.8 10

.2 .
l__$5_o/‘U Identity
Twilight zone

Figure 9. Distribution of identities in BLAST search result of six well known nirKs

against datasetl and dataset2.

In BLAST and HMMER search comparison (Figure 9), BLAST has 29 hits while
HMMER has 8 hits from dataset, and BLAST has 36 hits while HMMER has 12 hits
from dataset2. I can see BLAST gets more hits than HMMER, and many BLAST hits are
in the twilight zone (20%-3 5% identity) (Figure 9), suggesting that there may be false
positive hits in the BLAST result. By matching the BLAST hits back to the query
sequence, I were able to see whether the hits are matched to a domain (functional) region
or non-domain (non-functional) region (Figure 10). The total number of false positive

hits using this method was then calculated (Table 1).

15

Query 1 r #—
_ _
—
—
— _
_ _
_

" BLAST hits —

 

—
_
—
_
_
—
_
—
— -

Figure 10. All BLAST hits matched back to a query nirK. Query] is a well known nirK

seed. The red bands show the domain region of queryl. BLAST hits are shown as black
(not found by HMMER domain search) and green (also found by HMMER domain

search).

For nirK searches in our metagenomic data, the F P rate does not change much when the

E-value cutoff gets lower (more stringent) (Table 1, Figure 11).

16

Table l. BLAST F P (False Positive Rate) at different E-value cutoffs.

 

 

 

 

 

 

 

 

 

 

 

Cutoff A B FP
10 I9 27 0.587
l 13 15 0.536
0.1 9 14 0.609
0.01 6 10 0.625
0.001 6 9 0.600
0.0001 5 9 0.643

 

A is BLAST hits matched to domain region and also found by HMMER domain search;

B is BLAST hits matched to domain region but not found by HMMER domain search. FP

= B/ (A+B).

17

 

BLAST E-value distribution for nirK
SO 1 T T x x 1

 

40.. ............. .............. .............. ............

 

Number of Sequences

 

 

 

 

 

 

 

 

logtE-value)

Figure 11. E—value distribution of BLAST hits matched to domain region (blue, A+B in
Table 1) and a subset that were also found by HMMER domain search (green, A in Table
1). BLAST hits matched to domain region (blue) has similar E-value distribution shape as
its subset that are also found by HMMER domain search, which indicates the BLAST

false positive rate (FP) does not change much when E-value cutoff is changed.

For this part of study, I found BLAST might give false positives by comparing BLAST
and HMMER nirK search result in metagenomic data. Many BLAST hits had their
alignment identity in the twilight zone of sequence alignment (20%-3 5%). When I treated
the overlap of BLAST hits and HMMER hits as true positive hits, the false positive rate

did not change much when E-value cutoff was changed. This indicated that two different

18

tools, BLAST and HMMER, might share some same characteristics, which needs further

study.

It is important to note that the signiﬁcance of hits, E-value, is not considered here, and
this should be studied more in the future. For example, if I have a BLAST hit matched to
the query’s domain region but the E-value is very high (>10), the hit is probably a false
positive. However, if a low (stringent) cutoff were chosen to reduce false positives,
remote homologs may be overlooked. When searching for novel genes or remote
homologs, picking a high E-value cutoff and then using other methods to ﬁlter for better

ones may be a good strategy.

19

Chapter 3

Using phylogenetic structure for refining genes identifications

3.1 Introduction:

Pairwise comparison and proﬁle methods are commonly used for homology evaluation,
but they have their weaknesses, i.e., BLAST has poor performance for sequences within
the twilight zone (20-3 5% sequence identity) (Rost 1999) and proﬁle methods rely
heavily on seed sequences for making a statistical proﬁle (Loewenstein and Linial 2008).
Moreover, homology searching annotation methods may give inaccurate functional
annotations for paralogous sequences (homology through gene duplication) or
orthologous sequences (homology through speciation) which may have different
functions (homolog with different functions). Comparative genomics studies have given
some useful methods to address these challenges, such as gene neighboring methods,
protein domain architectures, and tree clustering methods (Singh, Doerks et al. 2009).
Here, I focus on the tree clustering method to reﬁne gene identiﬁcation methods for more

accuracy.

Phylogenetic trees show evolutionary relationships among different species or molecular
sequences based on their phenotype or sequence similarity. Here I focus on molecular
sequence phylogenetic trees. There are two types of trees: rooted and unrooted. In a
rooted tree, a unique node represents the most common ancestor of all species or
molecular sequences at the leaves under this node. In an unrooted tree, the relatedness of

branch nodes and leaves is illustrated without assumptions about ancestry. The root can

20

be set at any part of the unrooted tree if no other information is provided. Using an
uncontroversial out-group is the most common approach to make a rooted tree. The out-
group should be distantly related to the sequence of interest. If too distant, it will add

noise to the phylogenetic analysis.

Three types of methods are commonly used to construct trees: distance matrix, maximum

parsimony, and maximum likelihood.

The distance matrix method calculates all possible pairwise distances from a multiple
sequence alignment. It includes neighbor-joining and UPGMA (Unweighted Pair Group
Method with Arithmetic mean). The former produces unrooted trees and does not assume
constant rate of evolution, while the latter produces rooted trees and does assume
constant rate of evolution. The distance matrix methods are generally computationally

fast and thus are good for large sequence data analysis (Felsenstein 2004; Mount 2004).

Neighbor-joining tree is the one used my thesis. Figure 12 shows the basic idea of

neighbor-joining process.

21

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A
A 1 a c I D A
1. B 90% 1 j E
c 50% 50% 1 ;
D 60% 40% 70%l 1 D
‘1!
AB c D J A
2 AB 1 ‘ { B
c 50% 1 C
D 40% 70% 1 ‘2 D
v
‘ AB CD { A
3 3
AB 1 C
CD 40% 1 I D
W A
4. ABCD 3
ABCD 1 C
‘ 0

Figure 12. Simple illustration of neighbor-joining process. A, B, C and D stand for 4
sequences. Their pair wise sequence similarities are shown in tables. In every step, two

sequences with highest similarity are ﬁrst joined.

The maximum likelihood method is a more advanced method, which applies an explicit
model of evolution, such as Jukes—Cantor or generalized time-reversible (GTR) models of
nucleotide evolution and the JTT (Jones-Taylor-Thornton) model of amino acid evolution,
to tree estimation. It is statistically well founded, but computationally much more
expensive than the distance matrix method, and so is not fast enough for phylogenetic

analysis of a large number of sequences (Felsenstein 2004).

22

The Tree Clustering Method

The tree clustering method is a systematic approach to ﬁnd undetected relations among
homologous sequence families or subfamilies based on sequence clusters in a

phylogenetic tree (Loewenstein and Linial 2008; Singh, Doerks et al. 2009).

The ProtoNet database (Kaplan, F riedlich et al. 2004) is a successful application of the
tree clustering method. It provides protein classiﬁcation based on a phylogenetic tree that
shows hierarchical organization of the whole protein sequence space and evolutionary
relatedness among protein families. The tree is constructed by a modiﬁed UPGMA
algorithm, which can handle all the UniProtKB sequences. High correspondence is
shown between clusters in the tree and protein families classiﬁed in other databases, such
as Pfam and SCOP. Further, overlooked and new functional connections between
families and sequences can be discovered in the tree, because it provides a global view of
the protein space hierarchy from close subfamily proteins to distantly related superfamily

proteins (Loewenstein and Linial 2008).

There are several reasons why the tree clustering method reveals relations undetectable
by proﬁle methods. First, tree clustering is based on pairwise comparisons of all family
members (in a distance matrix tree), while in the proﬁle method, families are represented .
as a single statistical model (Gribskov, McLachlan et al. 1987). Similarities expressed by
only some remote family members can be detected in tree but are overlooked by proﬁle
method if the remote members are not included in the proﬁle seeds (Loewenstein and
Linial 2008). Second, remote homologs picked up in the tree can help group family

members. Third, the proﬁle method relies too much on the quality of the underlying MSA

23

(multiple sequence alignment), which is decided by choice of seed sequences and MSA
algorithm. The MSA tools also have the MSA uncertainty problem (different tools give
different alignments) (Wong, Suchard et al. 2008) when aligned sequences are distantly
related. Examples of hypothetical tree clustering methods are shown in Figures 13 and

14.

 

 

 

 

 

 

 

 

 

l I

 

 

I _1

Figure 13. A hypothetical sub-tree of tree of whole protein space with cluster F1, A and

F2 in a speciﬁc arrangement.

24

F1

 

 

 

 

H H‘Tglg

 

 

 

 

 

 

 

Figure 14. A hypothetical sub-tree of tree of whole protein space with cluster Fl , F2 and

A in a speciﬁc arrangement.

In Figure 13, hypothetical subfamilies F1 and F2 are known to belong to the protein
family, F. Sequences in subfamily A are sequences from a metagenomic dataset with no
signiﬁcant matches to known sequences in a reference database. Because of its position
in the tree between known subfamilies, I can infer that A is a subfamily of F. This
example demonstrates the advantage of the tree clustering method (a global view of all
sequence similarities). If using HMM proﬁles, and only some well known sequences are

picked to make a proﬁle HMM, the sequences in A may not be detected by that proﬁle.

In Figure 14, F1, F2 are the same hypothetical subfamilies as described in Figure 13.
The position of subfamily A is now outside the known families. I cannot infer
conﬁdently A is a new subfamily of F due to the relative position of A, F1 and F2 in the

tree though there is still a possibility.

Although using tree clustering can identify functions that are undetectable by other
methods, this method also has it shortcomings. Like proﬁle comparison methods, the tree
is constructed based on MSA and thus errors or uncertainties used in MSA construction

may result in inaccurate tree building.

In this chapter, “tree squeezing method”, an application of the general tree clustering idea,
was used to reﬁne the possible 3055 nirKs down from F unGene website. I found the

phylogenetically narrow known nirKs limited the application of tree squeezing method.

3.2 Methods:

Using the tree squeezing method to find more nirK genes:

I evaluated the use of the tree clustering method, also called the tree squeezing method, to
identify true nirK genes from those listed on the FunGene website. The nirK genes listed
on FunGene are based on a proﬁle matching of sequences in the NCBI non-redundant

database which match a well-annotated seed model of 6 well-known nirK genes.
Building the tree

In order to build a tree for this method, I initially needed to identify additional “known”
sequences (in addition to the six seed sequences). To do this, I searched the six seed nirK
proﬁle HMM (see Chapter 1) against all 3,055 nirKs listed on the F unGene website.
From the distribution of E-values of all hits, I selected a cutoff (arrow in Figure16) and
treated hits with E-values lower than the cutoff as known nirKs and hits with E-values
higher than the cutoff as potential nirKs. As a result, 124 nirKs were picked as known,

26

and the other were marked as unknown (Figure 15). All 3,055 sequences were aligned to
the 6 seed nirK proﬁle using the hmmalign function in the HMMER package. The MSA
was then used to construct a neighbor-joining tree by QuickTree (Howe, Bateman et al.
2002). QuickTree, an implementation of the neighbor-joining algorithm, can quickly

construct phylogenetic tree of thousands of sequences.

 

350 . , r r

I

300

250 E

N

O

O
1

150 ~-

 

Number of Sequences

100 e

50»

O l —150 —1‘oo T :50 0
' log(E-value)l P
Assume known (124)

 

 

 

 

Figure 15. E-value distribution of 3055 nirKs from F unGene (blue) and 32 sequences
from tree squeezing method (green). Sequences on the left of P (stringent cutoff) are

treated as known nirKs.

In the tree shown in Figure 16, if an unknown sequence L or a branch node N is

neighbored with a known nirK or a node of known nirKs, and there is another known

27

nirK outside the neighborhood, I can infer that L or the sequences in N are more likely to
be real nirKs. A python script was written to automatically ﬁnd such sequences whose

function could be “tree squeezed” out of sequences with known functions.

 

~ Known nirK]

 

Unknown Sequence
(potential nirK)

 

 

 

~ Known nirKZ

 

 

Two requirement:
1) Find a neighboring known nirK
2) Find one more known nirK outside the neighborhood

 

 

 

Unknown sequence is squeezed and more likely to be real nirK

 

 

Figure 16. Deﬁnition of tree squeezing method.

3.3 Results and Conclusion:

Using the 124 “known” nirKs in our tree, I was able to squeeze out another 32 sequences
as potential nirK genes. A closer look at their E-value in HMMER 6 seed proﬁle search
output showed that there was variations in E-values, which indicate the tree squeezing

method gave results different from HMMER search.

28

 

 

 

 

 

. , known] ‘
‘ unknown
known2
~ All known 124 nirKs are clustered
in a small part of the tree
known124 -‘ l

 

Only small number of sequences
can be squeezed

 

 

 

Figure 17. A hypothesis explaining why only small number of sequences is squeezed.

29

Green: known 124 nirKs

 

Figure 18. Phylogenetic tree of all 3055 FunGene nirK sequences. The 124 known nirKs
are marked green. The tree shows the 124 nirKs treated as known are not

phylogenetically diverse.

An explanation for the small number of squeezed sequences (32) is that the 124 known
nirKs are not phylogenetically diverse and clustered in a small part of the whole tree
(Figure 17). The position of 124 known nirKs (green) in tree in Figure 18 supports this

conclusion (Figure 18).

Though I can use the tree squeezing method to add conﬁdence to the conclusion that the
32 squeezed potential nirK genes are valid, further validation with biological experiments
is required.

30

Chapter 4

Classification of amoA and pmoA

4.1 Introduction:

Methanotrophs are a special group of bacteria utilizing methane as the only carbon and
energy source. They are commonly found at the interface of aerobic and anaerobic
environments (i.e., hot spring), and are a very important part of the global methane cycle
(Oremland and Culbertson 1992). Methane monooxygenase (MMO) is the enzyme that
catalyzes methane oxidation. Two different types of MMOs have been found, the soluble
form (sMMO) in cytoplasm and the particular form (pMMO) bound to the membrane.
Despite the functional similarity, they do not show any sequence similarity. The switch
between their expressions is regulated by a copper ion (Cu2+) (Stanley, Prior et al. 1983).

Only pMMO is present in all methanotrophs.

Ammonia monooxygenase (AMO) is the enzyme that catalyzes oxidation of ammonia,
which is crucial for the global nitrogen cycle. AMO is found only in ammonia oxidizing

nitrifying microbes (Holmes, Costello et al. 1995).

pMMO and AMO are both integral membrane proteins, having similar sequences and
structures and thought to be evolutionarily related, despite their different physiological
functions. PmoA and amoA are genes coding the alpha subunit of pMMO and AMO
respectively. They are around 450 bp long and highly conserved (McTavish, Fuchs et al.
1993; Semrau, Chistoserdov et al. 1995). Their protein products are grouped in the same

Pfam domain family called AMO (PF02461) and the same family (IPR003393) in

31

InterPro database. PmoAs are found in 01- and y-Proteobacteria while amoAs have been
detected in B-, y-Proteobacteria and archea. Some amoAs in y-Proteobacteria are more
similar to pmoAs from y-Proteobacteria than other amoAs (Holmes, Costello et al. 1995)

suggesting that these two genes share a common ancestry.

Given their sequence similarity, BLAST and HMMER3 fail to separate these two very
closely related genes. I evaluate the resolution of the tree clustering methods to classify

pmoAs and amoAs.

4.2 Data:

Initially a collection of well-known sequences for BLAST and HMMER searches was

created:

Eight well-known pmoA seed sequences were downloaded from F unGene, and six amoA
seed sequences from well studied ammonia oxidizing nitrifying bacteria were
downloaded from UniProtKB database. These genes were from species including
Nitrosomonas europaea ATCC 19718, Nitrosomonas eutropha C-7I, Nitrosospira
briensis C-128, Nitrosovibrio t'enuis Nvl , Nitrosococus oceam' ATCC 19707, and
Nitrosospira multiformis ATCC 25196. The whole UniProtKB database release 2010__05

(including Swiss-Prot and TrEMBL) was also downloaded.

Sequences are downloaded from UniProtKB to build a tree for tree clustering:

Sequences annotated as amoA (n = 176) from cultured were downloaded from

UniProtKB. The keywords for the search were: (amoA AND (namezammonia OR

32

name:amoA) NOT gene:amoB NOT gene:amoC AND gene:amoA NOT

taxonomyzenvironmental NOT organisszali).

The 10,470 sequences annotated as amoAs that were sequenced directly from
environmental samples (not cultured) were downloaded from UniProtKB. The keywords
for the search were: (amoA AND (namezammonia OR namezamoA) NOT gene:amoB

NOT gene:amoC AND gene:amoA AND taxonomyzenvironmental).

Another 233 sequences annotated as pmoAs from other research using the culturing
methods were downloaded from UniProtKB. Keywords for the search were: (pmoA NOT

taxonomyzenvironmental AND (namezmethane OR nameszoA) AND genezpmoA).

Finally 3062 sequences annotated as pmoAs that were sequenced directly from
environmental samples (not cultured) were downloaded from UniProtKB using keywords:

(pmoA AND taxonomyzenvironmental AND (namezmethane OR nameszoA ))

4.3 Methods:

4.3.] BLAST and HMMER comparison (Figure 19)

The 6 amoA seed sequences and 8 pmoA seed sequences were separately searched against
UniProtKB database using NCBI-BLAST blastall with parameters: -p blastp -v 100000 -

b 100000 -e 10.

Also, the 6 amoA seed sequences and 8 pmoA seed sequences were separately made into
proﬁle HMMs by hmmbuild in HMMER3. The two resulting HMMs were searched

using HMMER against UniProtKB database using hmmsearch with default parameters.

33

hmmbuild

     
  
 
 
 
     
    

Multiple - hmmsearch
ta w . Profile
clus l alignment

“ C“ format I |\l.\l UniProtKB
kmm n

 

 

‘ IIL‘IICL'S

compare

 

UniProtKB BLAST

search
result

 

 

 

Figure 19. BLAST and HMMER comparison ﬂow chart.

4.3.2 Building the tree

First, the 176 sequences annotated by UniProtKB as amoAs and 233 sequences annotated
as pmoAs from cultured bacteria were combined and aligned by hmmalign in HMMER3
using the AMO Pfam domain family (PF02461) proﬁle HMM. The alignment was made

into a neighbor-joining by QuickTree.

Second, all 10,646 sequences annotated by UniProtKB as amoAs and 3,295 sequences
annotated by UniProtKB as pmoAs from both culturing method and environmental
samples were combined and aligned by hmmalign in the HMMER3 package using AMO
Pfam domain family proﬁle HMM. The alignment was made into a neighbor-joining tree

by QuickTree.

34

 

QuickTree

  

hmmalign

   

 

 

AMO domain

 

 

Figure 20. Flow chart of constructing phylogenetic tree of amoAs and pmoAs.

4.4 Results:

4.4.] The use of BLAST and HMMER to differentiate between amoAs and pmoAs in

UniprotKB

BLAST and HMMER search results against the UniProtKB database showed that
sequences of amoA and pmoA could not be separated effectively (Figure 21-24). For
example, in Figure 21, BLAST hits for amoA consisted of both sequences annotated as
amoA and pmoA in the UniProtKB database. If a very low E-value was used, amoAs
identiﬁed by BLAST were largely also annotated as amoAs in UniProtKB; however,
using very stringent E-value cutoffs resulted in less sequences identiﬁed in general. Due
to the high similarity of these genes, BLAST and HMMER did not have enough

resolution to separate these amoA and pmoA sequences in UniProtKB.

35

E-value distribution of amoA BLAST hits

1200
Combined
Number of amoA
1000 Number of pmoA

800

600

Number of Hits

400

200

 

-150 -100 -50 0
log(E-value)

Figure 21. E-value distribution of amoA BLAST hits. Blue is overall E-value distribution
of amoA BLAST hits. Green is a subset of overall BLAST hits that are annotated as

amoA in UniProtKB. Red is a subset of overall BLAST hits that are annotated as pmoA in

UniProtKB.

36

E-value distribution of amoA HMMER hits
1200 1 1 r

[2 Combined
2 Number of amoA
E Number of pmoA

looor .............. .: ........................... E ........................... in.......................1é. .....

 

800

600

Number of Hits

400

200

 

 

 

 

 

 

logtE-value)

Figure 22. E-value distribution of amoA HMMER hits. Blue is overall E-value
distribution of amoA HMMER hits. Green is a subset of overall HMMER hits that are
annotated as amoA in UniProtKB. Red is a subset of overall BLAST hits that are

annotated as pmoA in UniProtKB

37

E-value distribution of pmoA BLAST hits
900 - 1 . .

Combined
Number of amoA
j Number of pmoA

 

  

800»

700 ~

Number of Hits

W
O
O

200 F

100 ~

 

 

 

 

430 A -foo
log(E-value)

Figure 23. E-value distribution of pmoA BLAST hits. Blue is overall E—value distribution

of pmoA BLAST hits. Green is a subset of overall BLAST hits that are annotated as

amoA in UniProtKB. Red is a subset of overall BLAST hits that are annotated as pmoA in

UniProtKB.

38

E-value distribution of amoA HMMER hits

 

 

 

 

 

  

 

 

 

 

 

900 r , ‘r
1: Combined 3
800 Number ofamoA .. E. E ................. i .....
WEI? Number ofpmoA .......................... . . .
3 s E 1 i
700 .....- ........................... ENHI ...................... L..”;
600,. ........................................................................ 1‘ .............................
+2
35 500.. ........................................................................ t ...................... , .....
o
h
S
E 400_ .....................................................................................................
:1
Z I 2 2 I
300 .. ................ ........................... ........................... ... ...................... ......
f f f 1 f
200 .... ........................... .1. .....
100 ......... . ...............
2 . A. . g a
0 1 .n 1* i r wuﬁ '
-150 -100 -50 0
log(E-value)

Figure 24. E-value distribution of pmoA HMMER hits. Blue is overall E-value
distribution of pmoA HMMER hits. Green is a subset of overall HMMER hits that are
annotated as amoA in UniProtKB. Red is a subset of overall HMMER hits that are

annotated as pmoA in UniProtKB.

4.4.2 Using the tree clustering method to differentiate amoAs and pmoAs

To effectively use the tree clustering method, I needed to identify a signiﬁcant number of
“known” amoA and pmoA sequences. Thus, I treated sequences from culturing methods
as known sequences and those sequenced directly from environmental sample as
unknown. The reasons supporting this assumption were: 1) most well known amoAs or

pmoAs are obtained by culturing ammonia oxidizing species or methanotrophic species

39

and the culturing process is actually a screening step for ammonia oxidizing or
methanotrophic species and thus increased the conﬁdence that these annotations are real;
and 2) environmental sequences, especially those sequenced by next generation
sequencing techniques, are more likely to be incorrectly annotated because they are

sequenced directly from environmental samples, without culturing ﬁrst.

The tree built from “known” amoAs and pmoAs (sequences from culturing method)
showed amoAs and pmoAs separated very well (Figure 25). There are also segregations
for different taxonomy groups within amoAs or pmoAs. This separation is promising

evidence that I can apply the tree clustering method for amoA and pmoA classiﬁcation.

40

Tree of known amoAs and pmoAs

Green : amoA \
white: pmoA

 

Figure 25. Tree of known amoAs and pmoAs. The green colored sequences are amoAs.
The others are pmoAs. Tree is made from known amoAs (from culture) and known
pmoAs (from culture). AmoAs and pmoAs separate well on tree though a small cluster of
amoAs from y-proteobacteria are closer to pmoAs from y-proteobacteria than other amoAs

(referenced by an arrow in Tree).

Using all amoAs and pmoAs (sequences from both cultures and environmental samples)
to build a tree, I observed that there was, in general, good separation of amoAs and
pmoAs (Figure 26). However, some pmoAs were located on amoA tree clusters, and
some amoAs were located on pmoA tree clusters suggesting possible incorrectly

annotated sequences.

41

Tree of all amoAs and pmoAs

Green : amoA
white: pmoA

 

Figure 26. Trees of all amoAs and pmoA 5. Tree is made from all amoAs and pmoAs
(including those from culture and environmental samples). The majority of amoAs and
pmoAs separate from each other though some pmoAs are scattered in amoA clusters and

some amoAs are scattered in pmoA clusters.

4.4.2 Using low nodes and high nodes for tree clustering

To further classify unknown sequences (sequenced directly from environmental samples)
based on the tree clusters, I implemented a low node method and a high nodes method to
attempt to group together amoA and unknown sequences or pmoA and unknown

sequences in neighboring tree branches.

42

1) Using the low node method, I identiﬁed the low node as the rightmost (closest to
leaves) branch nodes that only contain a list of sequence types. The list can be any
combination of amoA, pmoA, and unknown sequences. To identify amoA and pmoA,
respectively, I was interested in amoA low nodes that contain only known amoAs and
unknown sequences, and pmoA low nodes that contain known pmoAs and unknown

sequences.

2) The high node method uses high nodes which are deﬁned as the leftmost (furthest
from leaves) branch nodes that only contain a list of sequence types. For identify amoA
and pmoA using the high node method, I was interested in amoA nodes containing only
known amoAs and unknown sequences, and pmoA high nodes containing known pmoAs
and unknown sequences. Examples of using both the low node and high node methods to

identify an unknown gene are shown in Figure 27.

The low node method gave more accurate classiﬁcation, while the high node classiﬁed
more unknown sequences but with less accuracy. Since unknown sequences outside the
low node can also be included in high node (unknown2 of Tree A in Figure 27), the high
node method is able to classify more unknown sequences and the sequences classiﬁed by

low node method is a subset of those classiﬁed by high node method.

43

:1 1110A]

 

 

 

 

NI

amoA_envl (unknown)

__ ---- amo.-\1 N3 N2 -- amoAZ
N2 unknown] _ ' pmoA_env1 (unknown)
unknown2 . 7 amoA_envZ (unknown)

N4 meAI
pmo.\l -— pmoA_env2 (unknown)
* *— pmoA_env3 (unknown)

Tree A Tree B

Figure 27. Simple example trees for super node analysis. Known amoAs are marked as
amoA 1, amoA 2, Known pmoAs are marked as pmoA 1, pmoA 2, Environmental
sequences are treated as unknown sequences for classiﬁcation. Those environmental
sequences annotated as amoAs in UniProtKB are marked as amoA_envl, amoA_env2,
Those annotated as pmoAs in UniProtKB are marked as pm0A_env1 , pmoA_env2, In
Tree A, the low node of amoA and ‘env’ (environmental) sequence is N]; the high node
of amoA and ‘env’ sequence is N2, which include one more ‘env’ sequence (amoA_env2)
than low node method. In Tree B, the low node of amoA and ‘env’ sequence is N] and
N2, which shows pmoA_env1 may be incorrectly annotated; the high node of amoA and
‘env’ sequence is N3, which includes one more ‘env’ sequence than low node method.
The low node of pmoA and ‘env’ sequence is N4; the high node of pmoA and ‘env’

sequence is still N4.

As Table 2 shows both low node and high node methods have sequences classiﬁed
differently from UniProtKB annotation. High node method was able to classify the most
unknown sequences (12,980/ 13,532), while low node method classiﬁed only a small

number of sequences (528/ 13,532).

44

Using the low node, a total of 291 and 23 7, respectively, could be classiﬁed as potential

amoAs or pmoAs (Table 2). Using the high node, a total of 10,242 and 2,738, respectively,

could be classiﬁed as potential amoAs or pmoAs (Table 2). Several sequences identiﬁed

by the tree squeeze method disagreed with annotations in UniProtKB and require further

investigation.

Table 2. Numbers of sequences classiﬁed by various methods.

 

 

 

 

 

Low Node High Node
amoA pmoA unclassiﬁed sum amoA pmoA unclassiﬁed sum
Total 291 237 13004 13532 10242 2738 552 13532
UniProt
289 7 10174 10159 3i 280
amoA
UniProt
2 230 2830 83 2707 272
pmoA

 

 

 

 

 

 

 

 

 

 

45

 

4.4.3 Investigating sequences identified by tree clustering that disagreed with

UniProt Annotations

Comparing sequence alignments

When checking the alignment of some UniProtKB amoAs that agree with low node
method (correctly classiﬁed) and UniProtKB pmoAs (incorrectly classiﬁed), I found that
they are very similar and I could not recognize amoAs and pmoAs from each other based

on the alignment alone (Figure 28)

46

........................................................... l
........................................................ 2
L- I II- A- TE- m- ct- [VIIIUIF- Ill. ------ I ----------- WI 'i3
............................................................ 4
o-r5tI-A-Tt-tvt LT - I ----------- - ‘5
L-I-Fa-A-Tl tlv LT ----------- -- 6
L-r-i A V! AAL — IN rs! ------ I ----------- F 7
L-r'-rA-l 11. un-cr-u M l ----------- L E s

 
   

- -‘~-!- — - —I—I-t- -I-l-IPII- - I]
-- — --t- — — -t-t-t— -A-t-t—Ed-Arr- l2
- -t.--I-- - ~ 111— J-I-t- ,I-IM— t3
- —'--I— ~ -S- -I—t— ~I-I—I- S-LII- 4
- - --I- ~ -l—l—t-I— -I-L-!~:- tF« t5
-- -1--l~ ~ - -i-I-t- —n-t-I—E- AP- 1:6
-- -1.--1- - -s- -r-1- -I—t-I'IS-Lﬂ- ~17
-- -I--L- - ~I-t—I~le —I-t-I—Kﬂ-IAI-A- I11

--—I v-- --II--Ft

--1— H- ~11:--IL

-1—- ’v-- I --r1r

---I.-- _ -- lI--FNE

---l-- ,T—-' I3--ILV

---; v-- Inf”

_--L -— -tI-'FNF

---l—- -~ -tI-—

     

 

  
 

    
 
  

- - 1; 1:15; urrrsrrmtmcrI—mﬂﬁ all ivlvx [ng '
- K-a t :66 IAAFIASi Lvrtvl-ulﬁ A- Ft- --ka 2
- - t is .lAA ------------------------------- I .-- 3
- IR lAAFISAF mm III. B: A rt” :11 5
- :CG IAIFBSAF LMFTVN— VII; r AtYlVK- 6
- taxovvlvnArrslr rtu wr r x110 11- 7
-ﬂE-IvmrAAttcuivv11I-Il1s nu ------- llym 8

Figure 28. A part of the alignment of two correctly classiﬁed amoAs (land 2), two
incorrectly classiﬁed pmoAs (3 and 4) found by low node method, two well known
amoAs (5 and 6) and two well known pmoAs (7 and 8). Alignments shows all the two

sequences are very similar and I cannot recognize amoAs from pmoAs on the alignment.

47

I evaluated the position of the possibly misclassiﬁed genes in the tree. In general, I
expected that either (1) a possibly misclassiﬁed pmoA was in a cluster of known or
unknown amoAs or (2) a possibly misclassiﬁed amoA was in a cluster of pmoA 3. By
investigating the nodes surrounding the sequence in question, I hoped to better

understand the misclassiﬁcation.

1) For example, in Figure 29.A1, the known neighboring amoA, amoA_bacBlO2, helped
the classiﬁcation of pmoA_X1 805 and the surrounding amoAs (known or unknown)
strengthened the possibility that this may be a misclassiﬁcation of pm0A__X1 805 in the
UniProt database. Note that the position of this known amoA was also evaluated in
relation to other known amoAs in the tree of all known amoAs and pmoAs to further

support this hypothesis (Figure 29.A2).

48

Evaluating sequence positions in tree

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

pmoA_Xl974

meA—XI476 ’ —— amoA_bacB62

amoA X8934

— —- A b B50
_ amoA_X6505 am - 3“

amOA xs544 —" am0A_baCB99

amo A:X8791 amoA_bacB72
__ amoA_X7715 amoA_bacBlUZ

amoA_X6967 amoA_bacB96

amoA_bacB128 “m A hale 40

pmoA_X1806 ’

[ pmoA_X1805 4‘— —- ‘_ amoA_bacBlll
amoA_bacBlOZ — —— amoA-bacBl3
pmoA_X] 172 amoA_bacBSZ
amoA_X7695

A1 A2

 

[- pmoA_X2865 —]__{—_— amoA_bacB120
amoA_bacBl35

L
pmoA_X1400 amoA_bacG-I

 

 

 

 

 

 

 

pmoA_X2574 amoA_arc7
amoA_arc4
pmoA_Xl676 A———j: “mm-ml
pm 0 A_.\’237 4 <-— amoA_arc3
amoA arc2

o111ur\__um.u-r

____: a moA_a rc6
pmoA_XZ 031 a moAﬂa rcS
pmoA_bacCQ

B1 B2

Figure 29. Closeups of two possibly misclassiﬁed pmoAs in tree of all amoA and pmoA
(A1 and B1) and two known amoAs (green and grey shaded) that help classify the two

possibly misclassiﬁed pmoAs in tree of all known amoAs and pmoAs.

49

2) Another example of a possible misclassiﬁed gene was pmoA_X23 74 which was in a
cluster of pmoAs (known or unknown) (Figure 29.81). Looking more closely at its
neighboring amoA, amoA_bacG4, I see that it was located in a big cluster of both amoAs
and pmoAs of the tree of known amoAs and pmoAs. In this situation, I had to be skeptical
about whether amoA_bacG4 was correctly annotated. This example highlighted the need
for more reference sequences before more conclusions could be drawn (here the known

amoAs or pmoAs still may be incorrectly annotated).

4.5 Conclusion:

In general, this study found that tree clustering method gave higher resolution for
separating two genes with similar sequences but different functions, such as amoA and
pmoA. The low node method gave better accuracy in classiﬁcation, while the high node
method classiﬁed more unknown sequences with lower but still reasonably good
accuracy provided by the tree hierarchy. The high node gave more aggressive assertions
that the unknown sequences outside lode nodes are also classiﬁed with lower conﬁdence.
Meanwhile, the difference between tree clustering method classiﬁcation and UniProtKB
annotation showed that some of the annotations of amoAs and pmoAs in UniProtKB were
probably wrong, especially those of genes sequenced directly from environmental

samples. These sequences need further computational or biological veriﬁcation.

The stronghold of the tree clustering method is also its shortcoming—it is based on an

underlying tree. Problems in multiple sequence alignments and tree constructing tools

50

will results in incorrect classiﬁcation with the tree clustering method. An example of this

would be sequences containing multiple domains which are difﬁcult to cluster on a tree.

I used the tree clustering method to differentiate amoA and pmoA mainly for family
classiﬁcation. However, I could also study the evolutionary relationship of amoA and
pmoA. Signiﬁcant efforts have focused on quantiﬁcation of these genes in samples due to
their ecological importance (Webster, Embley et al. 2005; De Corte, Yokokawa et al.
2009). A close evolutionary relationship has been inferred based on their high sequence
similarity, but their evolutionary mechanism still needs further study. The multiple amoA
or pmoA copies within individual strains have resulted from gene duplication events
(paralogous evolution) not horizontal gene transfer (Klotz and Norton 1998). But the
reason that initiated the paralogous evolution and the time when paralogous evolution
started with respect to speciation is still unknown. What is the evolutionary relationship
between amoA and pmoA, convergent or divergent? I may use comparative genomics,
phylogenetics, or other methods to get a better understanding of their evolution

mechanism.

51

Chapter 5

Conclusions and future work

5.1 Conclusions:

5.1.1 Biology is increasingly reliant on sequence similarity with the use of new
sequencing technology, and sequence similarity may not give accurate gene

annotations.

When using BLAST and HMMER to identify nirK genes in a metagenomics dataset, I
found that BLAST hits could be unreliable, with several false positives within the
BLAST results. Using these tools to classify amoA and pmoA, these genes could not be
differentiated due to sequence similarity. The examples shown in this study exemplify
that computational methods infer homology from similarity, but similarity does not
necessarily mean homology. A major challenge to using these methods is a lack of a
standards or controls to assess the “truth”. Biological experiments can be used but are
expensive and take long time. For most genes, there are only a few sequences which are
biologically veriﬁed. I should be cautious when using annotation tools and consider the

possibility of incorrect annotation.

5.1.2 Annotations are sometimes incorrect.

The classiﬁcations of amoA and pmoA genes in our tree which disagree with UniProt
annotations highlight the possibility of incorrect annotations in reference databases.

Homology searches based on incorrect annotations will result in future incorrect

52

annotations. Tools which could evaluate the accuracy of annotations would be very

useful for making sound biological conclusions from homology searches.

5.1.3 Narrow phylogenetic sampling is a problem.

Our ability to detect novel nirK and reliably classify pmoA/amoA is strongly affected by
the lack of diverse nirK/pmoA/amoA. Tools which could highlight important “sequencing
gaps” would be useful in providing more resolution to maximize gene discovery using

homology searches to annotate sequences.

5.2 Novel approaches:

Here I used the tree clustering method as a ﬁltration step downstream of initial homology
search. The general tree clustering idea, which assumes more similar sequences are
closer in a tree, is already widely used (Loewenstein andLiniaI 2008; Singh, Doerks et al.
2009). In this study, I presented and evaluated some new “tricks” for discovering
“sequence closeness” including “tree squeezing”, “low node and “high node” methods to
complement other methods such as BLAST or HMMER. These methods helped reﬁne
the result of BLAST and HMMER search and found possible misclassiﬁcation of current

reference database.

5.3 Future directions:

A phylogenetically diverse set of well known sequences of a gene is very important for
annotation and classiﬁcation of new sequences. Sequences that could contribute
substantially to annotation and classiﬁcation can be picked out from the phylogenetic tree

for further biology experiment veriﬁcation.

53

I present here tools evaluated with speciﬁc examples. A natural extension of this work
would be to use these tools for other datasets, metagenomic and otherwise. For example,
the low/high node methods used here could be applied to the classiﬁcation of other
protein families with closely related subfamilies such as those that can be found in the

ProtoNet database.

The tree clustering method may also be a method to identify novel groups of genes within
a family. For example, big nodes close to the root with all unknown or new sequences

are possibly new subgroup of the family.

5.4 Parting Thoughts:

A repetitive theme throughout this study was that tools that are currently used, whether
BLAST, HMMER, or even a reference database, can often give the incorrect results
which could result in wrong conclusions. Understanding the limits of the tools and their
underlying methodology is important in their effective use. Much like experimental

methods, using computational tools requires preliminary thought and planning.

54

Bibliography

Retrieved July 21, 2010, from
http://www.ncbi.nlm.nih.ng/Education/BLASTinfo/BLAST_algorithm.html.

 

Retrieved August 15, 2009, from http://fungene.cme.msu.edu/index.spr.

Altschul, S. F., W. Gish, et al. (1990). "Basic local alignment search tool." .1 Mol Biol 215(3):
403-410.

Altschul, S. F., T. L. Madden, et al. (1997). "Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs." Nucleic Acids Res 25(17): 3389-3402.

Barton, G. J. (1990). "Protein multiple sequence alignment and ﬂexible pattern matching."
Methods Enzymol 183: 403-428.

Bashford, D., C. Chothia, et al. (1987). "Determinants of a protein fold. Unique features of the
globin amino acid sequences." .1 Mol Biol 196(1): 199-216.

De Corte, D., T. Yokokawa, et al. (2009). "Spatial distribution of Bacteria and Archaea and
amoA gene copy numbers throughout the water column of the Eastern Mediterranean
Sea." ISME 13(2): 147-158.

Eddy, S. R. (1998). "Proﬁle hidden Markov models." Bioinformatics 14(9): 755-763.

Eddy, S. R. (2009). "A new generation of homology search tools based on probabilistic
inference. " Genome Inform 23(1): 205- 211.

Felsenstein, J. (2004). Inferring phylogenies. Sunderland, Mass, Sinauer Associates.

Gonzalez, M. W. and W. R. Pearson (2010). "Homologous over-extension: a challenge for
iterative homology searches." Nucleic Acids Res 38(7): 2177-2189.

Gribskov, M., A. D. McLachlan, et al. (1987). "Proﬁle analysis: detection of distantly related
proteins." Proc Natl Acad Sci U S A 84(13): 4355-4358.

Holmes, A. 1., A. Costello, et al. (1995). " Evidence That Particulate Methane Monooxygenase

and Ammonia Monooxygenase May Be Evolutionarily Related." Fems Microbiology
Letters 132(3): 203-208.

Howe, K., A. Bateman, et al. (2002). "QuickTree: building huge Neighbour-Joining trees of
protein sequences." Bioinformatics 18(11): 1546-1547.

Kaplan, N., M. Friedlich, et al. (2004). "A functional hierarchical organization of the protein
sequence space." BMC Bioinformatics 5: 196.

Klotz, M. G. and .I. M. Norton (1998). "Multiple copies of ammonia monooxygenase (amo)
Operons have evolved under biased AT/GC mutational pressure in ammonia-oxidizing
autotrophic bacteria." Fems Microbiology Letters 168(2): 303-311.

Loewenstein, Y. and M. Linial (2008). "Connect the dots: exposing hidden protein family
connections from the entire sequence tree." Bioinformatics 24(16): 1193-199.

55

Madera, M. and J. Gough (2002). "A comparison of proﬁle hidden Markov model procedures for
remote homology detection." Nucleic Acids Res 30(19): 4321-4328.

Madigan, M. T., J. M. Martinko, et al. (2006). Brock biology ofmicroorganisms. Upper Saddle
River, NJ, Pearson Prentice Hall.

McTavish, H., J. A. Fuchs, et al. (1993). "Sequence of the gene coding for ammonia
monooxygenase in Nitrosomonas europaea." J Bacteriol 175(8): 2436-2444.

Metzker, M. L. (2010). "Sequencing technologies - the next generation." Nat Rev Genet 11(1):
31-46.

Mount, D. W. (2004). Bioinformatics : sequence and genome analysis. Cold Spring Harbor, N.Y.,
Cold Spring Harbor Laboratory Press.

Needleman, S. B. and C. D. Wunsch (1970). "A general method applicable to the search for
similarities in the amino acid sequence of two proteins." J Mol Biol 48(3): 443-453.

Oremland, R. S. and C. W. Culbertson (I992). "Importance of Methane-Oxidizing Bacteria in the
Methane Budget as Revealed by the Use of a Speciﬁc Inhibitor." Nature 356(6368): 421-
423.

 

Riesenfeld, C. S., P. D. Schloss, et al. (2004). "Metagenomics: genomic analysis of microbial
communities." Annu Rev Genet 38: 525-552.

Rost, B. (1999). "Twilight zone of protein sequence alignments." Protein Eng 12(2): 85-94.

Semrau, J. D., A. Chistoserdov, et al. (1995). "Particulate methane monooxygenase genes in
methanotrophs." J Bacteriol 177(11): 3071-3079.

Singh, A. H., T. Doerks, et al. (2009). "Discovering functional novelty in metagenomes:
examples from light-mediated processes." J Bacteriol 191(1): 32-41.

Smith, T. F. and M. S. Waterman (1981). "Identiﬁcation of common molecular subsequences." 1
Mol Biol 147(1): 195-197.

Stanley, S. H., S. D. Prior, et al. (1983). "Copper Stress Underlies the Fundamental Change in
Intracellular Location of Methane Mono-Oxygenase in Methane-Oxidizing Organisms -
Studies in Batch and Continuous Cultures." Biotechnology Letters 5(7): 487-492.

Tiedje, J. M. (1988). Ecology of denitriﬁcation and dissimilgtorv nitrate reduction to ammonium.
New York, John Wiley & Sons.

Webster, G., T. M. Embley, et al. (2005). "Links between ammonia oxidizer species composition,
functional diversity and nitriﬁcation kinetics in grassland soils." Environ Microbiol 7(5):
676-684.

Wong, K. M., M. A. Suchard, et al. (2008). "Alignment uncertainty and genomic analysis."
Science 319(5862): 473-476.

 

Wu, C. H., R. preiler, et al. (2006). "The Universal Protein Resource (UniProt): an expanding
universe of protein information." Nucleic Acids Res 34(Database issue): D187-l91.

56

Wu, M. and J. A. Eisen (2008). "A simple, fast, and accurate method of phylogenomic inference."
Genome Biol 9(10): R151.

57