FAST EDIT DISTANCE CALCULATION METHODS FOR NGS

SEQUENCE SIMILARITY

By

A K M Tauhidul Islam

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Computer Science - Doctor of Philosophy

2020

ABSTRACT

FAST EDIT DISTANCE CALCULATION METHODS FOR NGS SEQUENCE

SIMILARITY

By

A K M Tauhidul Islam

Sequence fragments generated from targeted regions of phylogenetic marker genes provide

valuable insight in identifying and classifying organisms and inferring taxonomic hierarchies.

In recent years, signiﬁcant development in targeted gene fragment sequencing through Next

Generation Sequencing (NGS) technologies has increased the necessity of eﬃcient sequence

similarity computation methods for very large numbers of pairs of NGS sequences.

The edit distance has been widely used to determine the dissimilarity between pairs of

strings. All the known methods for the edit distance calculation run in near quadratic time

with respect to string lengths, and it may take days or weeks to compute distances between

such large numbers of pairs of NGS sequences. To solve the performance bottleneck problem,

faster edit distance approximation and bounded edit distance calculation methods have been

proposed. Despite these eﬀorts, the existing edit distance calculation methods are not fast

enough when computing larger numbers of pairs of NGS sequences.

In order to further

reduce the computation time, many NGS sequence similarity methods have been proposed

using matching k-mers. These methods extract all possible k-mers from NGS sequences and

compare similarity between pairs of sequences based on the shared k-mers. However, these

methods reduce the computation time at the cost accuracy.

In this dissertation, our goal is to compute NGS sequence similarity using edit distance

based methods while reducing the computation time. We propose a few edit distance predic-

tion methods using dataset independent reference sequences that are distant from each other.

These reference sequences convert sequences in datasets into feature vectors by computing

edit distances between the sequence and each of the reference sequences. Given sequences

A, B and a reference sequence r, the edit distance, ed(A, B) ≥(cid:0)ed(A, r) ∼ ed(B, r)(cid:1). Since

each reference sequence is signiﬁcantly diﬀerent from each other, with suﬃciently large num-

ber of reference sequences and high similarity threshold, the diﬀerences of edit distances

of A and B with respect to the reference sequences are close to the ed(A, B). Using this

property, we predict edit distances in the vector space based on the Euclidean distances

and the Chebyshev distances. Further, we develop a small set of deterministically gener-

ated reference sequences with maximum distance between each of them to predict higher

edit distances more eﬃciently. This method predicts edit distances between corresponding

sub-sequences separately and then merges the partial distances to predict the edit distances

between the entire sequences. The computation complexity of this method is linear with

respect to sequence length. The proposed edit distance prediction methods are signiﬁcantly

fast while achieving very good accuracy for high similarity thresholds. We have also shown

the eﬀectiveness of these methods on agglomerative hierarchical clustering.

We also propose an eﬃcient bounded exact edit distance calculation method using the

trace [1]. For a given edit distance threshold d, only letters up to d positions apart can be

part of an edit operation. Hence, we generate pairs of sub-sequences up to length diﬀerence

d so that no edit operation is spilled over to the adjacent pairs of sub-sequences. Then

we compute the trace cost in such a way that the number of matching letters between

the sub-sequences are maximized. This technique does not guarantee locally optimal edit

distance, however, it guarantees globally optimal edit distance between the entire sequences

for distance up to d. The bounded exact edit distance calculation method is an order of

magnitude faster than that of the dynamic programming edit distance calculation method.

To my family

iv

ACKNOWLEDGMENTS

I would like to thank my PhD supervisor Professor Sakti Pramanik for his continuous

support and guidance during the entire time. His philosophy as a researcher and appreciation

for academia are truly inspiring. He is a true mentor and he helped me to grow in many

ways over the years. He consistently pushed me to be a better researcher and shaped my

thought process about solving problems in interesting and non-trivial ways.

I would also like to thank the esteemed committee members, Professor James R. Cole,

Professor Qiang Zhu and Professor Sandeep Kulkarni. I am grateful for their valuable sug-

gestions for successful completion of the dissertation. Dr. Cole always guided our work by

providing suggestions based on domain knowledge and the best practices in the Bioinfor-

matics community. Dr. Zhu, as a collaborating professor guided me continuously through

out the duration of my program. Likewise Dr. Pramanik, Dr. Zhu greatly helped me in

developing algorithms, technical writing as well as experimental evaluations. Dr. Kulka-

rni provided valuable suggestions on diﬀerent key concepts of the dissertation. In addition,

Dr. Benli Chai helped in multiple occasions to ﬁnd suitable NGS sequence datasets for our

experiments.

The Bangladeshi community in Lansing was also very helpful for making our life enjoy-

able. The constant support from the seniors in the community as well as variable social

activities by the fellow students and their families helped us to successfully complete the

PhD study.

Family is the most important thing that allowed me following this dream. My parents

are my constant source of love and support in everything.

I am deeply thankful to my

father-in-law, mother-in-law, and brothers-in-law for their continuous encouragement and

v

aﬀection. The successful completion of my PhD study was only possible because of constant

motivation, support, and love of my wife Afroza. This journey was equally hard for her as

it was for me, and I am really grateful to her for the sacriﬁces the patience she had shown.

Lastly, our son Zavyan Rushal Tauhid was born in the fourth year of my PhD program. He

is the greatest gift of our life and he brightens each moment with his activities and laughs.

He many not remember anything about this journey, he was a very important part of it.

vi

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

KEY TO SYMBOLS AND ABBREVIATIONS . . . . . . . . . . . . . . . . . xvi

Chapter 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Classiﬁcation of Similarity Measures for NGS Sequences . . . . . . . . . . . .
1.1.1 NGS Sequence Similarity based on the Edit Distance . . . . . . . . .
1.1.2 NGS Sequence Similarity based on the Matching k-mer Counts . . . .
1.1.3 A Novel Class of NGS Sequence Similarity Measures based on Refer-
ence Sequence based Predicted Edit Distances . . . . . . . . . . . . .
1.2 An Overview of our Novel Edit Distance Prediction Scheme
. . . . . . . . .
1.3 An overview of the Proposed Bounded Exact Edit Distance Calculation . . .
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Edit Distance based NGS Sequence Similarity . . . . . . . . . . . . . . . . .
2.1.1 Edit Distance Approximation . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Bounded Exact Edit Distance Calculation . . . . . . . . . . . . . . .
2.2 NGS Sequence Similarity based on Matching k-mers . . . . . . . . . . . . . .
2.3 Reference Sequence based Genome Sequence Analysis Methods . . . . . . . .
2.4 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . . . . .
2.4.2 Heuristics based Clustering
. . . . . . . . . . . . . . . . . . . . . . .

Chapter 3 The Euclidean Distance based Edit Distance Prediction . . . .
3.1 Space Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Selection of Reference Sequences . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Length of Reference Sequences . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Distances among Reference Sequences . . . . . . . . . . . . . . . . . .
3.2.3 Number of Reference Sequences . . . . . . . . . . . . . . . . . . . . .
3.2.4 Edit Distance to Euclidean Distance Mapping . . . . . . . . . . . . .
3.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Quality Assessment of Clustering Solutions . . . . . . . . . . . . . . .
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Distance among Reference Sequences . . . . . . . . . . . . . . . . . .
3.4.3 Mapping between the Edit and the Euclidean Distances . . . . . . . .
3.4.4 Number of Reference Sequences . . . . . . . . . . . . . . . . . . . . .

1
4
4
6

8
9
13
14

16
16
17
18
19
21
22
22
23

25
26
31
31
31
32
33
33
34
34
35
36
36
37
40

vii

3.4.5 Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.6 Cluster Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.7 Cluster Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.8 Comparison with Local Sensitive Hashing based Clusters . . . . . . .
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 4 MVD based Edit Distance Prediction . . . . . . . . . . . . . . .
4.1 Diﬀerences of Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Eﬀect of Increasing Number of Reference Sequences on M V D . . . .
4.2 Edit Distance Prediction in the Vector Space . . . . . . . . . . . . . . . . . .
4.2.1 Heuristics for the Edit Distance Prediction . . . . . . . . . . . . . . .
4.2.2 Optimal Number of Reference Sequences . . . . . . . . . . . . . . . .
4.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Memory Eﬃcient Clustering Technique . . . . . . . . . . . . . . . . .
4.3.2 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Predicted Edit Distance Thresholds in the Vector Space . . . . . . . .
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Clustering Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.2 Cluster Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 Cluster Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40
42
43
44
46

47
48
50
52
52
56
58
58
59
60
61
61
62
64
65

66
67
67
68
70
70
73
73
75

5.1 Basic Concepts

Chapter 5 Edit Distance Prediction from Distances between the Sub-
sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Notations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 Key Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Reference Sequence based Transformation . . . . . . . . . . . . . . . . . . .
5.2.1 Edit Distances among the Reference Sequences . . . . . . . . . . . . .
5.2.2 Predicting Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2.1 Edit Distance One . . . . . . . . . . . . . . . . . . . . . . .
5.2.2.2 Edit Distance Greater than One . . . . . . . . . . . . . . . .
5.3 Edit Distance Prediction from Partial Edit Distances of the Corresponding
Sub-sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Merging Partial Edit Distances of the Sub-sequences
. . . . . . . . .
5.3.2 Finding Primary and Secondary Sub-sequences . . . . . . . . . . . . .
5.3.3 Multiple Adjacent Letters as Distinct Elements
. . . . . . . . . . . .
5.3.4 Eﬀect of Sub-sequence Length . . . . . . . . . . . . . . . . . . . . . .
5.3.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77
78
80
84
86
86
87
5.4.1 Eﬀect of the Proposed Strategies on Edit Distance Prediction Accuracy 87
5.4.2 Edit Distance Prediction Accuracy . . . . . . . . . . . . . . . . . . .
89
91
5.4.3 Edit Distance Calculation Time . . . . . . . . . . . . . . . . . . . . .
92
5.4.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
92

5.4.4.1 Clustering Time

viii

5.4.4.2 Cluster Similarity . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93
94

Chapter 6 Bounded Exact Edit Distance . . . . . . . . . . . . . . . . . . . .
6.1 Length of the Sub-sequences . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Minimum Cost Trace between the Sub-sequences
. . . . . . . . . . . . . . .

96
96
99
6.2.1 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.1 Edit Distance Calculation Time . . . . . . . . . . . . . . . . . . . . . 101
6.3.2 Hierarchical Clustering Time . . . . . . . . . . . . . . . . . . . . . . . 102
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Chapter 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1 Key Contributions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2 Discussions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.3 Possible Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3.1 Extending the Proposed Methods . . . . . . . . . . . . . . . . . . . . 108
Similarity Comparison of other Ribosomal Regions
7.3.2
. . . . . . . . . . 108
. . . . . . . . . . 109
7.3.3 Applying Proposed Methods to Relevant Problems

APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

ix

LIST OF TABLES

Table 3.1: 16S rRNA Next Generation Sequencing (NGS) datasets

. . . . . . . . . .

36

Table 3.2: Mapping of Euclidean distance thresholds for given edit distance thresholds
in hierarchical clustering by comparing similar clusters in the edit space and
the vector space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 3.3: Comparison of hierarchical clustering time (in hours) of reference sequence
based technique in the vector space with that of actual edit distance based
technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 3.4: Evaluating accuracy of single linkage and average linkage hierarchical clus-
ters generated using the proposed predicted edit distances in the vector
space. We compute Normalized Mutual Information (NMI) for relative
similarity of the generated clusters to those based on actual edit distances.

Table 3.5: Comparison of clustering time and accuracy based on predicted edit dis-
tances in the vector space to those based on actual edit distances. All the
single linkage and average linkage clusters are compared for 97% similarity
threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 3.6: Evaluating quality of average linkage hierarchical clusters using the sil-
houette coeﬃcient. The clusters are based on the edit distances and the
Euclidean distances in the vector space.
. . . . . . . . . . . . . . . . . . .

Table 3.7: Comparison of clusters based on the edit distances, the predicted distances
in the vector space and local sensitive hashing based similarity with re-
spect to clustering time, relative similarity (NMI) to ground truth and the
silhouette coeﬃcient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 4.1: Number of candidate pairs selected for the edit distance calculation. These
pairs are selected based on the top two diﬀerences of V D and their corre-
sponding frequencies.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 4.2: Comparison of similarity thresholds for hierarchical clustering based on the
edit distances and those of the MVD . . . . . . . . . . . . . . . . . . . .

Table 4.3: Hierarchical clustering time (in hours) breakdown based on the edit dis-
tance and the MVD based on vector space for 97% similarity . . . . . . .

38

41

42

43

44

45

56

61

61

x

Table 4.4: For 99% similarity threshold, comparison of clustering time and accuracy
using the MVD based method to those of the actual edit distances as well
as k-mer matching based VSEARCH and LSH-Div clustering . . . . . . .

Table 4.5: For 98% similarity threshold, comparison of clustering time and accuracy
using the MVD based method to those of the actual edit distances as well
as k-mer matching based VSEARCH and LSH-Div clustering . . . . . . .

Table 4.6: For 97% similarity threshold, comparison of clustering time and accuracy
using the MVD based method to those of the actual edit distances as well
as k-mer matching based VSEARCH and LSH-Div clustering . . . . . . .

Table 4.7: Comparing average linkage hierarchical clusters using the silhouette coef-
ﬁcient. The clusters are generated based on the edit distances, the MVD
based predicted edit distances as well as k-mer matching techniques such
. . . . . . . . . . . . . . . . . . . . . . . . .
as VSEARCH and LSH-Div.

Table 5.1: Eﬀect of sub-sequence length on edit distance prediction accuracy (%).
For a predeﬁned distance threshold, the experimental result shows that the
prediction accuracy drops with increasing sub-sequence lengths. . . . . . .

Table 5.2: Comparison of cumulative edit distance prediction accuracy between the
MVD and the diﬀerences of distances based methods . . . . . . . . . . . .

Table 5.3: For 99% similarity threshold, comparison of clustering time (in hours) and
accuracy of the diﬀerences of distances based method to those of the actual
edit distances, the MVD as well as k-mer matching based VSEARCH and
LSH-Div clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 5.4: For 98% similarity threshold, comparison of clustering time (in hours) and
accuracy of the diﬀerences of distances based method to those of the actual
edit distances, the MVD as well as k-mer matching based VSEARCH and
LSH-Div clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 5.5: For 97% similarity threshold, comparison of clustering time (in hours) and
accuracy of the diﬀerences of distances based method to those of the actual
edit distances, the MVD as well as k-mer matching based VSEARCH and
LSH-Div clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

63

63

64

86

89

93

94

95

Table 6.1: Comparison of all pairs distance calculation time (in hours) of the proposed
bounded exact edit distance calculation method with that of the existing
dynamic programming based method and Slidesort . . . . . . . . . . . . . 102

xi

Table 6.2: Comparison of hierarchical clustering time (in hours) of the proposed bounded

exact edit distance calculation method with that of the existing dynamic
programming based method for a given distance threshold . . . . . . . . . 103

xii

LIST OF FIGURES

Figure 1.1: Example of edit operations and the corresponding edit distance between
a pair of strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1.2: NGS sequences similarity at genus and species levels are revised over the
years. With few exceptions, the species and genus similarities remain at
97% and 95% respectively. This justiﬁes the usefulness of more accurate
edit distance prediction and calculation methods for smaller edit distance
bound. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1.3: All possible k-mers in a sequence . . . . . . . . . . . . . . . . . . . . . .

Figure 1.4: Comparison of sequence similarity based on (a) the edit distance, and (b)
Local Sensitive Hashing (LSH) based methods . . . . . . . . . . . . . . .

Figure 1.5: Comparison of sequence similarity based on shared k-mers. For a given
edit distance, the sequence similarity may vary depending on the unique-
ness of the k-mers.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

5

6

7

7

Figure 1.6: Schematic representation of general space transformation and distances
. . . .

among the dataset sequences with respect to reference sequences.

10

Figure 1.7: The Chebyshev distance between a pair of 2D feature vectors

. . . . . .

11

Figure 3.1: Theoretical probability distribution of L1 distances in vector space formed
by the Hamming distances with varying number of dimensions i.e., refer-
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ence sequences.

28

Figure 3.2: Theoretical expectation of the Euclidean distances and their 99.9% inter-
vals in the vector space with 100 reference sequences by the Hamming
distance (upper panel) and the edit (lower panel) as a function of the
actual Hamming and edit distances between a random pair of sequences.
Compared with experimental lower and upper bounds of observed dis-
. . . . . . . . . . . . . . . .
tances from a random sampling simulation.

Figure 3.3: Correlation among features after transformation of random sequences as
a function of the distance among reference sequences.
[The dashed line
shows lref /2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

37

xiii

Figure 3.4: Cumulative frequency distribution of the edit and the Euclidean distances
for diﬀerent threshold settings. (a) The edit distance distribution and (b)
the Euclidean distance distribution. (c) Relative similarity of the resulting
clusters between the edit space and the vector space for varying cumulative
. . . . . . . . . . . . . . . . . . .
frequencies of the Euclidean distances.

Figure 3.5: Comparing relative similarity (NMI) between clusters in the edit space
. . . .

and the vector space with growing number of reference sequences

Figure 4.1: Distance between sequences A and B in the vector space based on hypo-
. . . . . . . . . . . . . . .

thetical alignment of the reference sequences.

Figure 4.2: Distribution of the edit distances in the vector space grouped by M V D
with varying number of dimensions . . . . . . . . . . . . . . . . . . . . .

Figure 4.3: For a given M V D, the range of F V Ds highlight distribution of the edit
distances. We consider the change in distribution for our prediction. For
example, predicted edit distances are marked based on intervals of F V D
values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.4: Predicting edit distances using top two vector component diﬀerences (VD)
and their corresponding frequencies (FVD). The diﬀerences of pairwise
edit distance distributions for the given < V D, F V D > show eﬀectiveness
of such strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.5: Total number of pairwise edit distance computation including the number
of pairs needed for space transformation and the number of candidate
pairs determined by heuristic H3 . . . . . . . . . . . . . . . . . . . . . .

Figure 5.1:

(a) A minimum cost trace with maximum number of change operations
and (b) an arbitrary trace with fewer change operations
. . . . . . . . .

Figure 5.2: Diﬀerence of edit distances between sequences A and B by the reference
sequence gl cannot predict correct edit distance because the same letter is
inserted and subsequently deleted; therefore resulting no change in overall
frequency.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 5.3: Sequences A and B are divided into ﬁxed length sub-sequences. Each pair
of sub-sequences has only one edit operation. The edit operations are pre-
dicted correctly separately with respect to the given reference sequences
using Theorem 3. Partial edit distances of the sub-sequences are merged
subsequently.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

40

49

51

54

55

57

73

76

76

xiv

Figure 5.4: The edit distance between sequences A and B are predicted from the diﬀer-
ences of distances of primary sub-sequences. The 2nd pair of sub-sequences
contains only redundant edit operations and hence are not considered for
the distance prediction.
. . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 5.5: The insertion operation between the ﬁrst pair of ﬁxed length of sub-
sequences is not reﬂected in the diﬀerences of distances between them
because of the small DNA alphabet. Considering multiple adjacent let-
ters as single elements would avoid such by chance reduction in diﬀerences
of distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 5.6: Comparing eﬀect of each diﬀerences of distances calculation strategy on
. . . . . . . . . . . . . . . . . . . . .

edit distance prediction accuracy.

Figure 5.7: Comparison of edit distance prediction time and accuracy between the
M V D and the diﬀerences of distances based methods . . . . . . . . . . .

Figure 5.8: Comparison of average edit distance prediction time between the dataset
sequences using the MVD and the diﬀerences of distances based methods

Figure 6.1:

(a) Minimum cost trace of a pair of sequences (strings) and (b) dividing
the sequences into corresponding sub-sequences based on the minimum
cost trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

85

88

90

91

97

xv

KEY TO SYMBOLS AND ABBREVIATIONS

S

Σ

Λ

A sequence dataset

Alphabet of the sequences

Empty letter/element

s, A, B

Sequences in a Dataset S

dist(A, B)

Generic distance between A and B

ed(A, B)

Edit distance between A and B

dh(A, B)

Hamming distance between A and B

l

R

Rl

lref

Ω(A)

E

Em

Em
p

λ(i)

k

Length of the largest sequence in a dataset

Set of reference sequences

A set of reference sequences of length l with edit distance l between them

Length of a reference sequence

Space transformation function for a sequence A

A sequence of edit operations

A minimum sequence of edit operations

A physically ordered minimum sequence of edit operations

Function to update positions of physically ordered edit operations

Length of a sub-sequence

(T, A, B)

A triple representing a trace T between Sequence A and B

cost(T )

Cost of a trace T

A(i, j)

A[i, j]

A[i]
γ(x → y)

Sub-string of A of length j starting from the ith position

Sub-string of A from the ith position to the jth position

ith element of string A

Cost function to change letter x into letter y

r

A reference sequence in R

xvi

ρ

AM I

N M I

DN A

GST

RN A

A sub-sequence of a reference sequence r

Adjusted Mutual Information

Normalized Mutual Information

Deoxyribonucleic Acid

Generalized Suﬃx Tree

Ribonucleic Acid

rRN A

Ribosomal Ribonucleic Acid

N GS

Next Generation Sequencing

N CBI

National Center of Biotechnology Information

SRA

V D

M V D

F V D

LSH

OT U

LCS

M ST

N N

HIN

Sequence Read Archive

Vector component Diﬀerence

Maximum Vector component Diﬀerence

Frequency of the M V D

Local Sensitive Hashing

Operational Taxonomic Unit

Longest Common Sub-sequences

Minimum Spanning Tree

Nearest Neighbor

Heterogeneous Information Networks

M CM C

Markov Chain Monte Carlo

RBSA

Reference-Based String Alignment

RCSI

Referentially Compressed Search Index

SET H

Strong Exponential Time Hypothesis

REST RAC Reference Sequence based Space TRAnsformation for Clustering

xvii

Chapter 1

Introduction

Rapidly increasing number of DNA and RNA sequence fragments obtained from Next Gen-

eration Sequencing (NGS) technologies in recent years have provided valuable insights such

as discovering, identifying, and classifying novel microorganisms and inferring taxonomic

hierarchies. Yet, processing such data-intensive applications demands eﬃcient techniques to

handle large numbers of NGS sequences. One of the important aspects of NGS sequence

analysis is partitioning sequences based on their level of similarities. Phylogenetic relation-

ships among sequences are inferred if a set of sequences are highly similar to each other. In

many cases, the sequences generated from metagenomic samples may not have annotations

and thus distance between the letters of the sequences is the only way to ﬁnd relationship

between them. While a number of eﬀective methods have been developed [1, 2, 3] for se-

quence similarity analysis, they are not suitable for handling large datasets due to high

computational time complexity of these methods.

For instance, the edit distance between two strings of arbitrary lengths represents their

dissimilarity. Given strings A and B, the minimum number of edit operations (Deﬁnition 1)

required to convert A into B is their edit distance (Deﬁnition 3) [1]. The distance measure

is critical for analyzing non-ordered and unstructured datasets in diverse ﬁelds such as NGS

sequences, text and graph datasets. Despite wide applicability on numerous sequence and

string analysis methods, the near quadratic run time with respect to sequence lengths to

1

compute the edit distance has been a performance bottleneck problem. For instance, a gene

sequence length can vary from several thousands to several hundreds of thousands of base

pairs [4]. The edit distance calculation for such large sequence is not practical. Moreover,

edit distance calculation between entire genes is not necessary because only a small fraction

of a genome contains meaningful gene sequences. On the other hand, in recent years, NGS

technology has improved signiﬁcantly, which has generated massive amounts of sequence data

from targeted regions of genomes in the form of short reads. The reads are usually a few

hundred base pairs long. The edit distance calculation methods are suitable for computing

similarity between pairs of NGS sequences. However, for large NGS sequence datasets, we

need to calculate the edit distances between lots of pairs of sequences. This reinforces the

necessity of faster edit distance calculation methods for large numbers of pairs of sequences.

The cost of an edit operation is determined by γ(x → y). An edit operation (i, x, y) is

Deﬁnition 1. Given a string A, an alphabet(cid:80) and a null element Λ, an edit operation is a
triple (i, x, y) replacing a letter x at the ith position of A with a letter y where x, y ∈ ((cid:80)∪Λ).
• a substitution if x (cid:54)= y and x, y ∈(cid:80).
• an insertion if x = Λ and y ∈(cid:80).
• a deletion if x ∈(cid:80) and y = Λ.

Deﬁnition 2. Given string A, a temporally ordered sequence of edit operations on A is
deﬁned as E =< (i1, x1, y1), (i2, x2, y2),··· > where 1 ≤ j ≤ |E|, ij is the position of the
edit operation and xj is replaced by yj at the ith

j position of A.

For instance, A = JU N E and a temporally ordered sequence of edit operations E =<

(4, Λ, L), (3, N, Λ), (4, E, Y ) >. As the edit operations are applied on A, the string is changed

2

gradually as follows: JUNE → JUNLE → JULE → JULY.

Deﬁnition 3. Given strings A, B and assuming the cost of each edit operation is the same,

the edit distance ed(A, B) is the minimum number of edit operations to convert A into B.

(a)

(b)

Figure 1.1: Example of edit operations and the corresponding edit distance between a pair
of strings

For example, the minimum number of edit operations between ”HONDA” and ”HYUNDAI”

is shown in Figure 1.1. In the ﬁrst example, (2, Λ, Y ) and (7, Λ, I) are the insertions and

the (3, O, U ) is the substitution. As shown in the Figure, there might be multiple represen-

tations of minimum number of edit operations to convert a string to another. For instance,

both E1 =< (2, Λ, Y ), (3, O, U ), (7, Λ, I) > and E2 =< (2, O, Y ), (7, Λ, I), (3, Λ, I) > convert

”HONDA” to ”HYUNDAI” with three edit operations.

It should be noted that the edit distance may not capture every modiﬁcation from string

A to string B. Any edit operations that are not lost in the conversion from string A to

string B are reﬂected in the edit distance. For example, the sequence of edit operations

E2 is applied on ”HONDA” to convert the string to ”HYUNDAI”. On the other hand,

E3 =< (2, 0, Λ), (2, Λ, Y ), (3, Λ, U ), (7, Λ, I) > can also convert ”HONDA” to ”HYUNDAI”.

This shows that the edit distance calculation can ﬁnd only the existing edit operations

between a pair of sequences.

3

1.1 Classiﬁcation of Similarity Measures for NGS Se-

quences

Various popular distance measures such as the edit, the Hamming, and others, are used to

measure the similarities between NGS sequences. The edit distance is commonly used as

the basis for measuring evolutionary relationships between sequences. For NGS sequences,

the edit distance is more eﬀective for similarity measures for many applications because

it provides the appropriate global similarity. One such important application is to discover

novel OTU’s in NGS datasets which require all pair similarity distance calculations. Because

of very large size of NGS data sets and high computation time for edit distance calculation, all

pair similarity distances calculation time becomes a serious performance bottleneck problem

for these important applications. To reduce the computation times other similarity measures

such as matching k-mer counts, reference sequence based edit distance and Euclidean distance

have been proposed.

In the following section we classify the existing similarity measures, speciﬁcally for NGS

sequences, based on various distance metrics.

1.1.1 NGS Sequence Similarity based on the Edit Distance

The edit distance is widely used to measure the similarities between strings. The edit distance

calculation method using dynamic programming [1] is of complexity O(l2) where l is the

string length. Due to wide range of applicability of the edit distance, reducing computation

time complexity of the edit distance has been an active research problem. Despite these

eﬀorts, the well known algorithms for edit distance calculation run in nearly quadratic time

complexities [1, 2, 5, 3]. However, these techniques cannot be used for calculating edit

4

distances of large numbers of pairs of strings.

In order to reduce time complexity of the edit distance calculations, several faster methods

have been proposed. For instance, a number of edit distance approximation methods have

been proposed [6, 7, 8, 9, 10, 11, 12] of computational complexity ranging from sub-quadratic

to near linear with respect to sequence lengths. Further, a number of methods have been

proposed that guarantee ﬁnding up to a given edit distance bound [5, 13]. For example, the

state-of-the-art bounded exact edit distance calculation method runs in strictly sub-quadratic

time complexity [13] with respect to sequence lengths.

Figure 1.2: NGS sequences similarity at genus and species levels are revised over the years.
With few exceptions, the species and genus similarities remain at 97% and 95% respectively.
This justiﬁes the usefulness of more accurate edit distance prediction and calculation methods
for smaller edit distance bound.

The bounded exact and the approximate edit distance calculation methods are more

appropriate over the standard edit distance calculation methods for NGS sequence similarity

because they evaluate sequence similarity primarily for up to genus and species levels [14,

15, 16, 17]. As shown in Figure 1.2, species and genus level similarities of 16s rRNA in

the bacterial samples are in the range of 98.7% and 95% respectively with few exceptions

[18, 19, 20]. Further, 97% similarity [16, 18] is often used for creating Operational Taxonomic

Units (OTUs) from metagenomic samples.

5

Despite the improved time complexity of the bounded exact and approximate edit dis-

tance calculation methods, they may not be suitable for NGS sequence similarity methods

when the numbers of sequence pairs are very large.

1.1.2 NGS Sequence Similarity based on the Matching k-mer Counts

In order to further reduce the sequence similarity computation time between large numbers

of pairs of NGS sequences, similarities of short k-mars (Deﬁnition 4) are used to evaluate

the similarities of the NGS sequences. Here, the percentage of matching k-mars is used as a

measure of similarities between the sequences. These approaches give close results when the

edit distance is small but the accuracy deteriorate with increasing edit distances.

Deﬁnition 4. k-mers are all possible sub-sequences of length k within a sequence. For

example, in Figure 1.3, all the 7-mers of a sample sequence is shown.

Figure 1.3: All possible k-mers in a sequence

The exact matching of the k-mers in their corresponding positions is deﬁned as Local

Sensitive Hashing (LSH) [21, 22]. These methods assume that exact matching between the

randomly chosen k-mers represent the edit distance between the sequences. This can be

true only if the edit operations are substitutions and the positions are valid with respect to

the smaller sequence length. For example, in Figure 1.4(a), the edit distance between the

6

sequences in one. As shown in Figure 1.4(b), two pairs of sub-sequences are generated by

randomly choosing the k-mers of length one and subsequently merging them. The distances

between both pairs of sub-sequences are greater than the actual edit distance.

(a)

(b)

Figure 1.4: Comparison of sequence similarity based on (a) the edit distance, and (b) Local
Sensitive Hashing (LSH) based methods

(a)

(b)

(c)

(d)

(e)

Figure 1.5: Comparison of sequence similarity based on shared k-mers. For a given edit
distance, the sequence similarity may vary depending on the uniqueness of the k-mers.

On the other hand, many k-mer based exact matching sequence similarity methods do not

consider the positions of the k-mers [23, 24]. Instead they take the set of k-mers generated

7

from each sequence and compute their similarity based on the exactly matched subset of

k-mers. These methods determine sequence similarity nearly correctly when the sequences

are very close to each other. However, when the sequences are far with respect to the

edit distances, they cannot determine the dissimilarity between the sequences satisfactorily.

For instance in Figure 1.5(a), an example sequence and the corresponding set of distinct

2-mers are shown. Assuming an edit operation in this sequences, Figure 1.5(b)-(e) show all

possible updated sequences and their corresponding set of 2-mers. However, the 2-mer based

similarity between these sequences vary between 0 to 2. Hence, only k-mer based sequence

similarity is not suﬃcient for determining the edit distance.

However, existing k-mer based methods improve the sequence similarity accuracy by

applying costly edit distance based methods when stringent accuracy is required [25, 23].

They become very slow with the increase of sequence lengths and the number of edit distance

calculations.

1.1.3 A Novel Class of NGS Sequence Similarity Measures based

on Reference Sequence based Predicted Edit Distances

In this dissertation, a new class of NGS sequence similarity measure is proposed that provides

signiﬁcantly better performance both in terms of quality and time over existing methods.

An overview of our proposed methods is given in the following section.

8

1.2 An Overview of our Novel Edit Distance Predic-

tion Scheme

We propose a novel class of similarity measures that improve performance in both quality

and time. Our approaches to improve edit distance calculation of sequences in NGS datasets

are based on the following goals. The ﬁrst goal is to reduce the edit distance prediction or

computation time without using the matching k-mer count based sequence similarity. Since

the majority of the NGS sequence analysis problems can be solved by ﬁnding highly similar

pairs of sequences, our second goal is to improve prediction accuracy when the edit distances

between pairs of NGS sequences are small. In order to achieve these goals, we propose a few

reference sequence based edit distance prediction methods in the vector space followed by a

bounded exact edit distance calculation method.

For the proposed edit distance prediction methods, we convert each sequence in a dataset

into a feature vector using a set of reference sequences. Suppose, S refers to a dataset of N

sequences, {s1, s2, . . . , sN} on alphabet(cid:80).

Deﬁnition 5. Given a set of n reference sequences, R = {r1, r2, . . . , rn}, a sequence in the

dataset s is mapped to an n-dimensional feature vector using the transformation function

Ω(s)

∀s ∈ S : Ω(s) = x

where x = (x1, x2, . . . , xn)

and ∀j∈{1,...,n}rj ∈ R; xj = dist(s, rj)

(1.1)

where function dist(s, rj) is a dissimilarity metric, for example, the edit distance between

sequence s and reference sequence rj.

9

Figure 1.6: Schematic representation of general space transformation and distances among
the dataset sequences with respect to reference sequences.

According to the Deﬁnition 5 , each sequence in a dataset is represented by an n-

dimensional feature vector x. Figure 1.6 shows the schematic representation of such trans-

formation, in which sequence A and sequence B are mapped into feature vectors xA and xB

respectively using two reference sequences r1 and r2.

We explored the desired properties of the reference sequences because reference sequences

are key to the edit distance prediction accuracy and low computation cost. We evaluated

the eﬀectiveness of the reference sequences based on their lengths, the distance among the

sequences, the number of sequences, and the alphabet size. We also examined various meth-

ods of creating the reference sequences, such as, randomly generated versus deterministically

generated, selected from the dataset versus generated independently from the dataset. More

detail about the properties of reference sequences are presented in the later chapters.

Our ﬁrst edit distance prediction method is based on the Euclidean distance (Deﬁnition

6) between Ω(A) and Ω(B) for A, B ∈ S. Suppose Ω(A) =⇒ (xA
Ω(B) =⇒ (xB

1 , xB

2 , . . . , xB

n ). We generate a lookup table between the Euclidean distances

1 , xA

2 , . . . , xA

n ) and

and the edit distances. This lookup table is dependent on the number of dimensions in the

feature vectors.

10

Deﬁnition 6. The Euclidean distance is the straight-line distance between two points in

Euclidean space. With this distance, Euclidean space becomes a metric space. The associated

norm is called the Euclidean norm or L2 norm.

If xA = (xA

1 , xA

2 , . . . , xA

n ) and xB = (xB

1 , xB

2 , . . . , xB

n ) are two points in Euclidean n-space,

then the distance from xA to xB is given by:
2 − xB

dist(xA, xB) =

1 − xB
(xA

1 )2 + (xA

2 )2 + ··· + (xA

n − xB
n )2

(cid:113)
(cid:118)(cid:117)(cid:117)(cid:116) n(cid:88)

=

i − xB
(xA

i )2.

i=1

Deﬁnition 7. The Chebyshev distance is a metric deﬁned on a vector space where the dis-

tance between two vectors is the greatest of their diﬀerences along any coordinate dimension

[26].

For example, in Figure 1.7, the Chebyshev distance between 2D vectors (4, 2) and (13, 10)

is 9.

Figure 1.7: The Chebyshev distance between a pair of 2D feature vectors

Our second edit distance prediction method in the vector space is based on the Maximum

Vector Component (M V D) diﬀerence (the Chebyshev distance). Diﬀerences between the

11

corresponding dimensions of the feature vectors are bounded by the edit distance between

each pair of sequences. Given the feature vectors (xA

the vector component diﬀerences would be the bag of (xA

1 , xA

2 , . . . , xA
j ∼ xB

1 , xB

n ) and (xB

2 , . . . , xB
n ),
j ) where (1 ≤ j ≤ n). The

maximum diﬀerence between any xA

j and xB
j

is known as the Chebyshev distance. As we

increase the number of reference sequences, the Chebyshev distance is likely to get closer to

the actual edit distance. The frequency of any (xA

j ∼ xB

j ) is also helpful for improving the

edit distance prediction accuracy in the vector space. Using these strategies, we are able to

predict pairwise edit distances in the vector space more eﬃciently than that of the Euclidean

distance based method.

The MVD based method predicts small edit distances with near perfect accuracy, but

the accuracy decreases as the edit distances become larger.

In order to correctly predict

the edit distance, a reference sequence must ﬁnd all the edit operations with respect to a

dataset sequence of the pair and none with respect to the other. Given alphabet Σ and
sequence length l, the number of all possible random reference sequences is |Σ|l. Since the

edit operations between pairs of sequences in a dataset can appear arbitrarily, when only a

few reference sequences are used, the probability of correctly predicting larger edit distances

decreases with increasing sequence length and increasing number of edit operations between

pairs of sequences in a dataset. On the other hand, a large number of reference sequences

improve the prediction accuracy at the cost of higher computation time. For even larger edit

distances, the beneﬁt of increasing the number of reference sequences for better edit distance

prediction diminishes as it becomes increasingly unlikely to ﬁnd all the edit operations from

the diﬀerence of edit distances of sequences A and B with respect to a reference sequence.

Hence, our third edit distance prediction method is based on a small set of simple reference

sequences. Each reference sequence is made up of the same letter of the alphabet. For

12

example, given an alphabet Σ = {a, c, g, t}, the reference sequences are {aaaa..., cccc...,
gggg..., tttt...}. Even though the number of sequences is small and, therefore, computation

time is less, the prediction accuracy for this method degrades rapidly, particularly when

predicting higher edit distances. Suppose A = gcatacg and B = gcctaag with substitution

operations at the 3rd and 6th positions. Then, ed(A, ccccccc) = 5 and ed(B, ccccccc) = 5.

Thus the diﬀerence of the edit distances of A and B with respect to ccccccc is 0. Similarly

the diﬀerences of the edit distances of A and B with respect to aaaaaaa, ggggggg and ttttttt

are 0. In order to minimize this problem, we compute the edit distances non-overlapping

sub-sequences of A and B separately. We develop merging strategies to get the edit distance

for the entire sequences from the partial edit distances. This method signiﬁcantly improves

the edit distance prediction accuracy when using the simple reference sequence set.

1.3 An overview of the Proposed Bounded Exact Edit

Distance Calculation

In the previous sections we have discussed classiﬁcation of existing similarity measures, and

our proposed a new class of similarity measures for NGS sequences. In this section we give an

overview of our new approach to compute bounded exact edit distance which uses similarity

measure described in Section 1.1.1. This method calculates edit distance of NGS sequences

for up to a predeﬁned distance bound, d. Although the proposed edit distance prediction

methods described in Section 1.2 produce high accuracy for several scenarios, they cannot

guarantee ﬁnding the exact edit distances. Since many genome sequence analysis methods

require edit distances only within a small bound, we propose an eﬃcient bounded edit

distance calculation method. Assuming d << l, the proposed method splits the sequences

13

up to length 2d because the corresponding letters in the sequences can be up to d positions

apart. Then the method computes edit distances between the sub-sequences. For example,

given sequences A and B, we split the sequences into A = A1.A2 and B = B1.B2 such that

there is no edit operation between A1 and B2, and vice versa. Then, ed(A1, B1) ﬁnds the

minimum number of edit operations with respect to ed(A, B). We merge the partial edit

distances of sub-sequences to ﬁnd the ed(A, B).

1.4 Thesis Outline

The structure of this dissertation is as follows:

• In chapter 2, we present a literature review on edit distance calculation and approxi-

mation techniques as well as heuristic based alternatives developed to predict the edit

distances.

• In chapter 3, we present the edit distance prediction technique based on the Euclidean

distance between the feature vectors in the vector space.

• In chapter 4, we present the edit distance prediction technique in the vector space

based on M V D (the Chebyshev distance) between corresponding components of the

feature vectors and their frequencies.

• In chapter 5, instead of computing edit distance between the entire sequences, we

compute the distance between ﬁxed length sub-sequences separately and subsequently

merge them.

• The bounded exact edit distance calculation technique (Section 1.3) is presented in

chapter 6.

14

• We summarize the dissertation contributions, provide a general discussion of this work,

and propose possible extensions in chapter 7.

15

Chapter 2

Related Works

In this chapter, we present a literature review on existing NGS sequence similarity methods.

We present the existing edit distance calculation methods that are frequently used to deter-

mine the minimum number of changes required to convert one sequence into another. Then,

we review the faster NGS sequence similarity computation methods based on the matching

k-mers. A majority of the exact matching k-mer based methods are used as intermediate

ﬁltering steps to reduce the numbers of pairs of sequences for the costly edit distance calcu-

lation. We also review the existing clustering methods for NGS sequences because they are

frequently used for metagenomic NGS sequence analysis. In addition, we review the existing

reference sequence based methods used for longer genome sequences to show their potential

usage in shorter NGS sequence similarity.

2.1 Edit Distance based NGS Sequence Similarity

The edit distance measures the dissimilarity between a pair of strings by determining the

minimum number of changes to convert one string to another. Standard dynamic program-

ming [1] based edit distance computation complexity is O(l2), l being the sequence length.

There have been many eﬀorts to reduce the quadratic computational complexity because of

its wide applicability on string comparison. Despite these eﬀorts, the well known algorithms

for edit distance calculation run in nearly quadratic computational complexity [2, 3, 27].

16

Masek and Patterson [2] proposed an edit distance calculation method of time complex-

ity O(l2/ log2 l) when the alphabet is constant. Billie and Farach-Colton [27] removed the

constraint of the constant alphabet imposed in [2] at the cost of a factor α = (log2 log l), re-

sulting the total complexity to (l2 log2 α/α2). Recently Backurs and Piotr [3] indicated that

the near quadratic run time complexity of edit distance calculation might be tight. With the

help of Strong Exponential Time Hypothesis (SETH) [28], they showed that no edit distance
calculation method of O(l2−α) where α > 0 existed. In order to reduce the time complexity

signiﬁcantly, many other edit distance calculation methods have been proposed by relaxing

the accuracy requirement [6, 7, 8] or by guaranteeing correct distance for up to a threshold

[5, 13].

2.1.1 Edit Distance Approximation

Nearly quadratic run time based edit distance calculation methods do not reduce the compu-

tation time enough to be practical for computing NGS sequence similarity for larger numbers

of pairs of sequences. On the other hand, the computational complexity of the edit distance

calculation can be reduced by relaxing the accuracy requirement [6, 7, 8, 9, 10, 11, 12].

Landau et. al.
√

mation of factor

[9] proposed a linear run time based incremental edit distance approxi-

l where l is the string length. Despite the low time complexity the high

approximation factor was large for the high similarity threshold NGS sequence similarity

search. The later research eﬀorts on this problem focused on reducing the approximation

factor without increasing the time complexity. For instance, Bar-Yossef et. al. [10] reduced

the approximation factor to l3/7 by a quasi-linear edit distance approximation method. Batu

et. al. [11] further reduced the approximation factor to O(l1/3+O(1)).

Recently, Andoni et. al. [12] proposed a near linear run time edit distance approximation

17

method between strings of length l and reduced the approximation factor signiﬁcantly from

√
the previously proposed methods, up to a factor of 2O(

log l). On the other hand, the

most recent edit distance approximation algorithm proposed by Chakraborty et. al.

[8]

improved the approximation bound into a constant factor at the cost of higher computation
complexity, O(n2−2/7). Hence, the existing edit distance approximation algorithms tradeoﬀ

between approximation factor and the computation complexity. Yet, the approximation

factor may be high for smaller edit distance thresholds.

2.1.2 Bounded Exact Edit Distance Calculation

Before presenting the existing bounded edit distance calculation methods, we justify their

applicability on NGS sequence similarity. As shown in Figure 1.2, the species and genus

level similarity of NGS sequences primarily fall within 97% and 95% similarity respectively.

As a result, many metagenomic NGS sequence similarity based analysis methods focus on

high similarity thresholds such as OTUs. For high similarity thresholds, the bounded exact

edit distance calculation methods are less costly alternative to the standard edit distance

calculation methods without losing the accuracy.

Abboud et. al.

O(l2d1.5(cid:14)2Ω(

[5] proposed a d-edit operation based LCS method of time complexity

√
log n/d)). This time complexity is only a marginal improvement from that of

the fastest unbounded edit distance calculation method [2]. Thankachan et. al. [13] proposed

a strictly sub-quadratic algorithmic framework for approximate matching problem of limited

number of edit operations. The framework is claimed to be the ﬁrst strictly sub-quadratic

bounded edit distance calculation method for a ﬁxed number (d) of edit operations. The al-

gorithmic framework works for any arbitrary edit distance although the smaller edit distances

are desired for optimized performance. The framework relies on converting an approximate

18

matching problem to an exact matching problem. In order to do so, they initially create a

Generalized Suﬃx Tree (GST) from the input sequences and gradually modify the suﬃx tree

for all possible suﬃxes of d-edit operations. The computational complexity of this algorithm

is bounded by the creation and modiﬁcation of the suﬃx tree which is O((N l) logd(N l))

where N is the number of sequences in a dataset. The all pairwise distance calculation of

this method is O((N l) + OCC) logd(N l) where d is a constant and OCC is the output size.

Hence, the edit distance calculation time for this algorithm is better than the quadratic

comparison of the existing method regardless of the cost of calculating distance between a

pair of sequence. However, for increasing bound of edit distance, the polylogarithmic factor

(logd(N l)) could be larger than (N l) when d > |(cid:80)| which would be the usual case for

sequence similarity of NGS datasets. The time requirement of gradually updating the suﬃx

tree for all possible d-edit operations increases rapidly with the growth of d.

2.2 NGS Sequence Similarity based on Matching k-

mers

In order to further reduce the sequence similarity computation time between large numbers

of pairs of NGS sequences, many matching k-mer based sequence similarity methods have

been proposed. Local Sensitive Hashing [22] assumes that global dissimilarity between two

sequences is preserved in the randomly generated ﬁxed length sub-sequences. These methods

use two level hashing, for instance, multiple random subset of dimensions and hash table of

the sub-sequences to narrow down the number of candidate pairs of sequences in a dataset.

While the strategy reduces the number of candidate pairs, it has two limitations. First,

the similarity of a pair of sub-sequences are measured using either Hamming, Euclidean

19

or Jaccard index distance of each dimension. The method cannot calculate edit distances

of randomly generated subsets because such distances would not represent the global edit

distance between two sequences. It is worth mentioning that edit distance can address both

insertions and deletions in a sequence while Hamming distance cannot. Second, sequence

lengths are required to be exactly same for the hashing, although it is very common for the

sequences to vary in lengths.

The minHash [29] and the shared k-mers [23] based methods ignore the position of a

k-mer and compute NGS sequence similarity based on the k-mers only. For instance, the

minHash based methods extract the k-mers to create a signature based on min-wise hashing

from a subset of k-mers. Although they can handle variable length sequences unlike the LSH

based methods, they are not very eﬀective to separate highly similar pairs of sequences from

the not similar ones. On the other hand, the shared k-mer based methods consider all the

distinct k-mers instead of a subset of them. This method improves the sequence similarity

over that of the minHash based methods. However, they still fail to satisfactorily separate

the highly similar sequences from the not similar ones when the k-mer length is small.

Slidesort [25] used a pattern growth strategy for the k-mers to handle the problem of

randomly matched k-mers and to avoid using only a subset of them. The method reduces

the number of candidate pairs for a given distance threshold. The pattern growth strategy

ensured that no true positive pairs of sequences were missing. Although, this method was
very fast for short reads (≤ 100bp), it became very slow for longer reads and larger edit

distance bound.

It should be noted that these sequence similarity based methods used the existing edit

distance calculation methods for verifying the similarity between the pairs of the sequences.

In other words, these methods are primarily provide ﬁltering to reduce the sequence similarity

20

computation time rather than standalone methods. As presented in the section 2.1, there

are room for improving the pairwise sequence similarity computation when considering very

large numbers of pairs of sequences in a dataset.

2.3 Reference Sequence based Genome Sequence Anal-

ysis Methods

Reference sequences have been frequently used to represent genomic sequences in datasets

[30, 31, 32, 33] to improve computation time with nominal loss in accuracy. For instance

RBSA [31] used a set of reference sequences to compute alignment between segments of

genome sequences. For the query search, they did same with segments of query sequences

and compared the diﬀerences to reduce the candidate segments of sequences in datasets.

However, between the candidate pairs, the exact alignment scores are computed based on

existing computationally intensive methods.

Referentially Compressed Search Index (RCSI) [32] presented a compression method

where large genome sequences are compressed with respect to reference sequences. Initially

the query search is performed on the reference sequences followed by evaluating individual

diﬀerences of the genome datasets with respect to the reference sequences. The contributions

of this method such as reduced space requirement and speedup inspired us signiﬁcantly to

explore the reference sequence based method on NGS sequence datasets.

In addition, a reference sequence based indexing method is proposed in [30] to reduce

the number of edit distance comparison. They also proposed diﬀerent strategies for selec-

tion of reference sequences as the properties of the reference sequences are critical for the

outcome any such indexing method. However, for the candidate pairs they used the existing

21

computationally expensive edit distance calculation.

Although the existing reference sequences based presentation of database sequences has

been used in many applications [30, 31, 32, 33], until recently, this method has not been used

for edit distance predictions. We have proposed edit distance prediction methods [34, 35]

based on reference sequences for NGS sequence similarity. This method worked well for

smaller edit distances. However, the prediction accuracy fell rapidly for larger edit distances.

2.4 Clustering Methods

The sequence similarity based clustering is a suitable application to put targeted gene frag-

ments (amplicons) from metagenomic samples based on their phylogenetic similarity.

It

also evaluates the eﬀectiveness of NGS sequence similarity of the aforementioned methods

with respect to required time, quality of the similarity measures, and the ability to provide

gradual similarity hierarchy eﬃciently. It should be noted that our goal was not to improve

the clustering methods, rather improving the sequence similarity comparison that might be

beneﬁcial for clustering quality and time.

2.4.1 Agglomerative Hierarchical Clustering

Agglomerative hierarchical clustering algorithms using single linage and average linkage ag-

gregation methods are frequently used for clustering of NGS sequence datasets generation

from phylogenetic marker genes. Hierarchical clustering presents the gradual dissimilarity

hierarchy between the sequences and does not require a pre-deﬁned number of clusters which

is fundamentally similar to the phylogenetic hierarchy. However, they require comparing all

pairs of sequences in a dataset which is costly with respect to space and computational

22

complexities [16, 36, 37, 38, 39, 40]. For instance, Mothur [37] is a widely used hierarchical

clustering method. It incorporated multiple sequence alignment using MUSCLE [41] that

reduced the computational cost of dissimilarity measurements, which resulted in signiﬁcant

speedup in clustering [16]. However, multiple sequence alignments with MUSCLE has time

complexity O(N 3l) and space complexity O(N 2 + N l + l2) [41].

2.4.2 Heuristics based Clustering

In order to avoid the costly hierarchical clustering, several other hashing based clustering

methods have been proposed using the k-mers similarity of the sequences (Section 2.2). For

instance, greedy heuristic clustering methods such as UCLUST and VSearch [24, 42, 23],

merge sequences in an online-learning paradigm. Initially they start with a seed sequence

assigned to a cluster, then a new sequence is assigned to one of the currently existing clusters

that satisfy a distance threshold with its representative sequence, thus, a new cluster is

formed. Similarly, LSH based NGS sequence clustering methods determine the candidate

pairs of sequences based on the similarity of randomly generated ﬁxed length sub-sequences

of pairs of sequences. However, there are many false positive candidate pairs generated from

the LSH based hashing. Without calculating the edit distances between the candidate pairs

of sequences, it is diﬃcult to improve the clustering quality. Hence, the heuristic based

methods signiﬁcantly speed up the clustering at the cost of accuracy. Furthermore, they

cannot provide the hierarchical interpretation of the resulting clusters.

Other existing clustering methods were used for NGS sequence clustering. For instance,

an unsupervised Bayesian clustering method [43] was implemented that uses the Markov

Chain Monte Carlo (MCMC) sampling method to obtain optimal clustering solution. The

probabilistic Bayesian approach resolves sensitivity and hard-cutoﬀ issues associated with

23

hierarchical clustering algorithms. However, it was not a practical method for large datasets.

24

Chapter 3

The Euclidean Distance based Edit

Distance Prediction

In this chapter, we present a reference sequence based eﬃcient edit distance prediction

technique in the vector space. Each sequence in a dataset is converted into a feature vector

by computing the edit distances between the sequence and each of the reference sequences.

The Euclidean distance is computed between the corresponding feature vectors to predict

their edit distance. The distance computation in the vector space signiﬁcantly reduces the

required time compared to that of costly edit distance calculation between the pairs of

sequences. In order to show that the relative distances between the sequences are likely to

be represented in the feature vectors, we developed probabilistic bounding models for the

Euclidean distances in the vector space for the edit distances of pairs of sequences. Further,

we developed a mapping between the edit distances and the Euclidean distances in the vector

space. This mapping is used for predicting the edit distances.

Using this transformation, numerous statistical analysis techniques and faster nearest

neighbor search algorithms that require numeric data types become computationally feasi-

ble. For instance, this technique is very eﬀective for agglomerative hierarchical clustering

techniques [38, 36] because they require edit distance calculations between each pair of se-

quences. Assuming large numbers of sequences in a dataset, the cost of computing edit dis-

25

tances between each pair of sequences becomes a performance bottleneck. As a result, faster

methods at the expense of lower accuracy have been explored in recent years. The emergence

of very large datasets [44, 45] and domain speciﬁc problems such as OTU-clustering [39, 40]

have made the edit distance prediction techniques more desirable.

3.1 Space Transformation

In this section, we present the reference sequence based space transformation in more detail.

As deﬁned in Section 1.2, the space transformation function converts a sequence in a dataset

into an n-dimensional feature vector. For the sake of simplicity, we use the Hamming distance

in the vector space and then extend the ﬁndings to the edit distance based technique.

Given, A, B ∈ S, suppose the corresponding feature vectors are xA and xB based on

the reference sequences r1 and r2. Thus, xA = (xA

1 , xA

2 ) and xB = (xB

1 , xB

2 ) and the

dissimilarity between A and B can be computed using the Eulcidean distance in the vector

space as follows:

∀A, B ∈ S :
dist (Ω(A), Ω(B)) = (cid:107)xA − xB(cid:107) = (cid:107)∆x(cid:107)

(3.1)

where ∆x = (∆x1, . . . , ∆xn). For each dimension i, ∆xi ≤ d where d = dist(A, B)

Theorem 1. For each dimension i and any pair of sequences A and B, ∆xi ≤ d, where

d = dist(A, B) is any distance in the metric space.

Proof. According to the triangle inequality property of distance in a metric space:

i ≤ d + xB
xA

i

and xB

i ≤ d + xA

i → |xB

i − xA

i | ≤ d

(3.2)

26

As a result of Theorem 1, we know that for each dimension i, ∆x of any given pair of se-

quences is bounded by their Hamming distance dh. Then, using the multinomial distribution,

the probability Pr(∆x) can be written as
dh−δ

Pr (|∆x| = δ) =

2(cid:88)

k=0

α × dh! × 2−(dh+2k+δ)

(dh − 2k − δ)! × (k + δ)! × δ!

(3.3)

where α = 1 for δ = 0, and α = 2 otherwise.
Using the probability for |∆x|, the expected value of the Euclidean distance for any

pair of sequences A and B, after their transformation into feature vectors with independent

random reference sequences, is given by

(cid:118)(cid:117)(cid:117)(cid:117)(cid:116)n ×

dh(cid:88)

δ=0

δ2 Pr(|∆x| = δ)

(3.4)

Using the bounds we deﬁned in Theorem 1 for ∆xi, i.e. 0 ≤ |∆xi| ≤ dh for the Hamming

distance, combined with the probability distribution of the ∆xi for each individual dimension,

we get the probability distribution for the Euclidean distances in the vector space by applying

the convolution over the individual distributions.

Figure 3.1 shows the probability distribution of the Euclidean distance in the vector

space by varying the number of dimensions n =< 1, 2, 10, 20, 50, 100 > for the Hamming

distances dh =< 5, 10, 15, 20 >. When n = 1, the most probable distance in the vector

space is 1, irrespective of the actual Hamming distance. Therefore, there is a high probability

that sequences that are far with respect to the Hamming distances (e.g., dh = 20), will appear

to be as close as sequences with smaller Hamming distances (e.g., dh = 5). This implies

that the distinction between the Hamming distances and the Euclidean distances between

27

Figure 3.1: Theoretical probability distribution of L1 distances in vector space formed by
the Hamming distances with varying number of dimensions i.e., reference sequences.

pairs of sequences will be lost. However, as the number of reference sequences increases, the

overlap between probability distributions for diﬀerent Hamming distances decreases, which

increases the distinction between largely dissimilar sequences and similar ones.

Figure 3.2 shows the 99.9% intervals obtained theoretically from the convolution of dis-

tributions and the expected values of the Euclidean distances by Equation 3.4 in the vector

space as functions of the Hamming distance between pairs of sequences in a dataset. The

results are compared against the mean, minimum, and maximum observed Euclidean dis-

tances from a simulation where 105 random pairs of sequences of length 20 were generated

and transformed using 100 random reference sequences. In this experiment, diﬀerent trans-

formation functions (i.e. by the Hamming distance or by the edit distance) are used, and

28

Figure 3.2: Theoretical expectation of the Euclidean distances and their 99.9% intervals in
the vector space with 100 reference sequences by the Hamming distance (upper panel) and
the edit (lower panel) as a function of the actual Hamming and edit distances between a
random pair of sequences. Compared with experimental lower and upper bounds of observed
distances from a random sampling simulation.

the predicted Hamming and edit distances are compared with the corresponding actual dis-

tances between pairs of sequences. Since the theorems are based on the Hamming distance,

theoretical results of expected value of the Euclidean distances using the Hamming distance

based transformation as a function of the actual Hamming distance between sequences match

very closely with the experiment. When the space transformation is performed using the

edit distances, the minimum observed Euclidean distances from the experiment fall below

the 99.9% interval, but the experimental maximum still matches with the theoretical upper

29

bound (Figure 3.2, top panel). On the other hand, when theoretical bounds for the Eu-

clidean distances are compared against the actual edit distances between pairs of sequences,

the maximum observed Euclidean distances by the Hamming distance based transformation

exceeds the theoretical upper bound while its minimum matches with the theory (Figure 3.2,

bottom panel). This observation supports that the Euclidean distances between pairs of fea-

ture vectors with high dimensions will be large when the actual Hamming or edit distances

are large. Similarly, the Euclidean distances will be smaller when the actual Hamming or

edit distances are also smaller.

Proposition 1. Distance conservation: With high probability, a pair of dissimilar sequences

will be mapped far from each other, and a pair of similar sequences will be mapped close to

each other into an n-dimensional vector space.

According to this proposition, if the Hamming distance for a pair of sequences is large,
with high probability they will be mapped far from each other in the n−dimensional vector

space, provided that n is large. At the same time, according to Theorem 1, the space trans-

formation guarantees that if a pair of sequences are close to each other with respect to the

Hamming distance, their distance will be bounded by dh. As a result, the space transforma-

tion will conserve the ordering among pairwise distances based on the edit distances and the

Euclidean distances in the vector space. While the proposition is developed according to the

Hamming distance, we generalize the idea to the edit distance provided that transformation

to a high dimensional space is performed by the edit distance.

30

3.2 Selection of Reference Sequences

We create a set of random reference sequences with the property that they are independent

from any sequence in any dataset. The advantage of such property is that the reference

sequences are transferable to other datasets as well. Yet, there are a few things to consider

when choosing the reference sequences.

3.2.1 Length of Reference Sequences

Assuming the length of a reference sequence is smaller than the length of a sequence in a

dataset, the transformation function only considers edit distance for a portion of the sequence

plus a number of insertions (according to the deﬁnition of the edit distance). Suppose xA
is
i
the edit distance between A and ri where |ri| = |A|. Then, using a sub-sequence of ri such as
i − (length(ri) − length(ρi)).
ρi, where length(ρi) ≤ length(ri), results in ´xA

i , where ´xA

i ≤ xA

This causes more overlaps between the mapped feature vectors in the vector space. If the edit
distance between sequence A and sequence B is denoted by ´d where ´d ≤ d, the probability
of having similar feature value increases; i.e. Pr(δ = 0| ´d) ≥ Pr(δ = 0|d). Hence, the length

of reference sequences must be at least as large as the longest sequence in the dataset.

3.2.2 Distances among Reference Sequences

Feature vectors resulting from the space transformation are correlated if the chosen reference

sequences are close to each other. We want to ﬁnd the proper distance between the reference

sequences that minimizes correlation among feature vectors. Correlations among the features

in the vectors decrease monotonically with the increase of edit distances among pairs of

reference sequences.

31

Reference sequences of length lref have the minimum correlation with respect to features

in the vector space when distances among each of them is lref . However, the number of

reference sequences satisfying such distance is limited by the alphabet size. On the other

hand, as shown in Figure 3.1, having only a few reference sequences is not suﬃcient for

satisfactory prediction accuracy in the vector space. As a compromise, we impose a mini-

mum distance threshold between reference sequences, by which a new randomly generated

reference sequence has to satisfy the distance threshold from currently selected reference

sequences.

Initially the minimum distance threshold is set to 3

4 lref because higher edit

distances between each pair of reference sequences ensure minimized correlation among fea-

tures in the vector space. However, if suﬃcient numbers of reference sequences cannot be

generated satisfying the threshold, the threshold is decremented gradually. This process is

repeated until the desired number of reference sequences is generated.

3.2.3 Number of Reference Sequences

The number of reference sequences determines the dimensionality of the feature vectors. As

our theoretical analysis shows, higher dimensions result in better discrimination between far

and close pairs of sequences in a dataset. On the other hand, the computational cost of

the NGS sequences clustering in the vector space increases by the number of dimensions.

Therefore, in order to ﬁnd an optimal choice of n, we vary n among 10, 25, 50 and 100. We

evaluate the performance of the NGS sequences clustering as a function of n.

32

3.2.4 Edit Distance to Euclidean Distance Mapping

We propose a mapping strategy between the edit distances and the Euclidean distances in

order to use the predicted Euclidean distances based thresholds for hierarchical clustering

in the vector space. For a given threshold, the distances among sequences of a cluster

based on the edit distances and the Euclidean distances can be drawn as a distribution

around the threshold. Thus, we choose a cumulative frequency in the distribution so that

the corresponding Euclidean distances threshold based clusters have optimal similarity to

those based on the edit distances.

It is worth mentioning that the Euclidean distance is

directly proportional to the number of reference sequences (Equation 3.4). Hence, for the

same cumulative frequency, the actual Euclidean distance might be diﬀerent with varying

numbers of reference sequences.

3.3 Hierarchical Clustering

We create hierarchical clusters of NGS sequences to evaluate the eﬀectiveness of our proposed

edit distance prediction techniques. Key components of the agglomerative hierarchical clus-

tering technique are all pairwise distance calculations and iterative merging of the nearest

neighbors until all the elements belong to one cluster. The technique provides a hierarchy of

clusters which are more informative [46] than a ﬁnal set of clusters. Moreover, the number of

clusters is dependent on a distance threshold. In this section, we brieﬂy present the adopted

technique followed by a cost analysis to justify the speed-up of clustering in the vector space

with the adopted technique. We also brieﬂy present the quality assessment measures used

for evaluating the clusters.

33

3.3.1 Cost Analysis

Preprocessing steps includes generation of reference sequences and transformation of se-

quences into an n-dimensional feature space. Selection of reference sequences of length lref

and an imposed minimum distance of lref /2 is performed only once, and the generated refer-

ences can be reused for diﬀerent datasets. The edit distance computation complexity between

a reference sequence and a sequence in the dataset is O(l2
time complexity is O(n × N × l2

ref ).

ref ). Thus, space transformation

The computational complexity of agglomerative hierarchical clustering is O(N 2) and

computing edit distance is O(l2) with l being the length of two sequences. Therefore, the

overall computational complexity is O((N l)2). The proposed technique can beneﬁt from

existing algorithms such as minimum spanning tree(MST) or NN-chain, since in the vector

space we beneﬁt from numeric data types, and standard distance calculations. Note that a

few algorithms exist for neighborhood search with edit distance, however, the purpose of our

work is not to make clustering based on the edit distances faster. Yet, another advantage of

the proposed technique is that distance calculation is independent of length of the sequences.

3.3.2 Quality Assessment of Clustering Solutions

We utilize silhouette coeﬃcient to evaluate the quality of the clusters. The coeﬃcient is cal-

culated using the average intra-cluster distance (a) and the average nearest-cluster distance

(b) for each sample.

Si = (bi − ai)/max(ai, bi)

(3.5)

Here, ai is the average distance between ith sequence and remaining entries within the

cluster and bi is the minimum average distance between ith sequence and any other cluster

34

that it is not a part of.

In addition, Normalized Mutual Information (NMI) score is used to compare similarity

between the edit space and the vector space clusters. NMI is useful to measure the relative

similarity between two independent sets of cluster labels when the ground truth is not known.

3.4 Experimental Results

In this section, we determine the distance among the reference sequence to minimize cor-

relation, eﬀective number of reference sequences for transformation and the edit distance

to the Euclidean distance mapping for the vector space. When these choices are ﬁxed, the

ﬁnal parameters are used to create clusters from the experiment datasets. We compare the

eﬀectiveness of the space transformation by evaluating single linkage and average linkage

hierarchical clusters based on the edit distances and the Euclidean distances in the vector

space. We also compare the cluster quality of our method with the Hamming distance based

LSH-Div. The approximate length of the sequences in experiment datasets is 252. Hence,

the maximum edit distance is 252. As the clusters are created for 97% and 95% similar-

ity thresholds, we chose equivalent edit distances 8 and 13 respectively (rounded up to the

nearest integer).

All the experiments are performed in an Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz

with 24GB physical memory.

It is worth mentioning that in case the length of reference

sequences are larger than the length of dataset sequences, reference sequences are trimmed

for consistency.

35

3.4.1 Datasets

A diverse set of 16s rRNA sequence fragment datasets are used to assess the performance

of our proposed techniques. The sequence fragments are generated using high throughput

NGS techniques. The data sets used here are obtained from NCBI databases. For these

data sets, length of most of the sequences are roughly similar because speciﬁc regions of the

16s rRNA are targeted for NGS data sets. However, there are variations in the sequence

lengths because of insertions, deletions and sequencing errors. Table 3.1 brieﬂy describes

the datasets which are obtained from various metagenomic environments. The datasets are

metagenomic samples taken from mouse, human, rat, soil, marine plankton (ERX2155923

[47]) and zoo compost (SRX1537393 [48]).

It should be noted taht we consider only the

distinct sequences in each dataset. Due to quadratic space complexity for the main memory

based hierarchical clustering method, in this chapter we experiment with the datasets Dm,

Dh, Dr and a subset of Ds (40000 sequences).

Table 3.1: 16S rRNA Next Generation Sequencing (NGS) datasets

Datasets

Id # Sequences Mean Length St. Dev.

Mouse

Rat

Soil

Human

Dm
Dh
Dr
Ds
Marine Plankton Dp
Dz

Zoo Compost

17993

21575

30079

99742

496314

625178

252.6

252.9

252.87

253.1

252.1

250.3

1.233

1.203

0.626

1.123

2.67

3.6

3.4.2 Distance among Reference Sequences

In this experiment, we evaluate the eﬀect of distance between the reference sequences on

correlation among the feature vectors. For this experiment, two random reference sequences

36

Figure 3.3: Correlation among features after transformation of random sequences as a func-
tion of the distance among reference sequences. [The dashed line shows lref /2].

are generated along with 1000 pairs of sequences, then correlation of the edit distance between

pairs and their Euclidean distance is evaluated. This experiment is repeated for 2000 times

with diﬀerent random pairs of reference sequences and random sequences. The resulting

correlations for sequences of length 100, 200, and 300 as a function of distance between each

pair of reference points is shown in Figure 3.3. When the pair of reference sequences are

close and the correlation is very high, the two corresponding features become redundant. As

the distance between reference sequences increases to lref /2, the correlation approaches to

zero. As a result, it is important to make sure that the selected reference sequences are far

from each other.

3.4.3 Mapping between the Edit and the Euclidean Distances

We set the aggregation technique to average linkage, distance threshold to 8 and number of

reference sequences to 100 for the experimental results shown in Figure 3.4. Figure 3.4(a)

37

presents the cumulative frequency distributions of intra-cluster pairwise edit distance ≥ 8 for

the mouse dataset. Since average linkage technique considers mean of all possible pairwise

distances, signiﬁcant number of elements within a cluster may be higher than the threshold.

We consider this property while selecting an equivalent threshold in the vector space. Figure

3.4(b) shows the Euclidean distance cumulative frequency curves for a set of edit distance

thresholds for the clusters generated from the mouse dataset. The goal is to choose an

Euclidean distance based on the cumulative frequency of edit distance 8 (97% similarity

corresponds to 8 for the experiment datasets) such that optimum cluster labels similarity

in the edit and in the vector space can be achieved. Hence, we determine the Euclidean

distances based on the cumulative frequencies between 0.7 and 0.95 with interval 0.05 and

create clusters in the vector space. We perform the resulting clusters similarity using NMI

score. We ﬁnd that Euclidean distance for cumulative frequency 0.75 produces as much as

97% similarity while the similarity degrades slowly with higher thresholds. It can be seen

from Figure 3.4(b) that a signiﬁcant number of larger pairwise distances overlap with the

optimum cumulative frequency area likewise the distribution shown in edit space clusters

(Figure 3.4(a)).

Table 3.2: Mapping of Euclidean distance thresholds for given edit distance thresholds in
hierarchical clustering by comparing similar clusters in the edit space and the vector space

Similarity threshold

Edit distance

Euclidean distance

Standard deviation

97%

95%

8

13

17.8

21.7

0.2

0.28

Figure 3.4(c) shows the relationship of varying cumulative frequencies and the NMI scores

on hierarchical clusters generated based on the edit distances and the Euclidean distances

in the vector space. Cumulative frequency 0.75 produces the maximum NMI for all the

38

(a)

(b)

(c)

Figure 3.4: Cumulative frequency distribution of the edit and the Euclidean distances for
diﬀerent threshold settings. (a) The edit distance distribution and (b) the Euclidean distance
distribution. (c) Relative similarity of the resulting clusters between the edit space and the
vector space for varying cumulative frequencies of the Euclidean distances.

datasets. Hence, we take the average Euclidean distance for that cumulative frequency as

the threshold in the vector space. Table 3.2 shows the mapping between the edit and the

Euclidean distances for the experiment datasets. The predicted Euclidean distances are based

on 100 reference sequences. These values can be modiﬁed with varying number of reference

sequences using Equation 3.4. Our study for smaller sequence lengths show approximately

similar mapping between the edit and the Euclidean distances though the ﬁndings are not

presented here.

39

891011121314151617Edit Distance00.10.20.30.40.50.60.70.80.91Cumulative Normalized Distribution12141618202224Euclidean Distance00.10.20.30.40.50.60.70.80.91Cumulative Normalized DistributionEdit dist=8Edit dist=9Edit dist=10Edit dist=11Edit dist=120.650.70.750.80.850.90.951Cumulative Frequency0.90.910.920.930.940.950.960.970.980.991NMIMouseHumanRatSoil(a) Single linkage

(b) Average linkage

Figure 3.5: Comparing relative similarity (NMI) between clusters in the edit space and the
vector space with growing number of reference sequences

3.4.4 Number of Reference Sequences

Figure 4.5 shows the relationship between NMI and the number of reference sequences for the

agglomerative hierarchical clustering methods. Increasing the number of reference sequences

improve NMI scores. Although, relative similarity between single linkage clusters (Figure

4.5.a) in the edit and the vector space degrade rapidly with decreasing number of references

compared to that of average linkage clusters (Figure 4.5.b), single linkage distance criteria

and variance in the Euclidean distance distributions are primarily responsible for such poor

quality. For our experiments, we choose the number of reference, n = 100 for the vector

space to ensure good quality clusters consistently for all of the linkage techniques.

3.4.5 Computation Time

Space Transformation: The pre-processing step is performed only once on the datasets.
The time complexity for the transformation is O(n × N × l2) where n is the number of

reference sequences, N is the number of sequences in a dataset and l is the length of a

40

102030405060708090100#Reference Sequences00.10.20.30.40.50.60.70.80.91NMIMouseHumanRatSoil20406080100#Reference Sequences00.10.20.30.40.50.60.70.80.91NMIMouseHumanRatSoilTable 3.3: Comparison of hierarchical clustering time (in hours) of reference sequence based
technique in the vector space with that of actual edit distance based technique

Single Linkage

Average Linkage

Dataset

Edit

Distance

Euclidean Distance

n = 50

n = 100

Edit

Distance

Euclidean Distance

n = 50

n = 100

Dm
Dh
Dr
Ds

0.62

0.94

1.81

3.34

0.001

0.002

0.004

0.008

0.003

0.004

0.009

0.014

1.07

1.72

3.98

8.36

0.23

0.38

1.06

2.42

0.46

0.80

2.24

5.13

sequence. The transformation time is included in the total clustering time.

Clustering: The clustering time depends on several factors such as the number of

sequences in a dataset, length of the sequences in a dataset, number of reference sequences

and the clustering methods. Table 3.3 compares the hierarchical clustering time based on

the the edit distances and the Euclidean distances in the vector space. Clustering in the

vector space is signiﬁcantly faster than that based on the edit distances. The speed-up

depends, to a great degree, on the linkage technique. For instance, single linkage vector

space clustering is up to 500 times faster for varying number of reference sequences, while

average linkage vector space clustering is approximately up to 5 times faster than their edit

space equivalents in main memory based hierarchical clustering techniques. The reason is

as follows; single linkage clusters are formed based on nearest sequences of the neighboring

clusters. Such clusters can be built very eﬃciently using minimum spanning tree (MST)
of time complexity O((cid:107)E(cid:107)logV ). In this case, the distance calculation time between pairs

of sequences dominates the total time. Hence, single linkage clusters in the vector space

achieves such speed-up. On the other hand, average linkage clusters measure mean distance

between all pairs of two clusters to form a larger cluster. Such clustering time complexity is

41

Table 3.4: Evaluating accuracy of single linkage and average linkage hierarchical clusters
generated using the proposed predicted edit distances in the vector space. We compute
Normalized Mutual Information (NMI) for relative similarity of the generated clusters to
those based on actual edit distances.

Dataset

Dm
Dh
Dr
Ds

Single Linkage

Average Linkage

Similarity = 97% Similarity = 95% Similarity = 97% Similarity = 95%

0.96

0.89

0.93

0.96

0.76

0.61

0.95

0.82

0.97

0.95

0.95

0.97

0.97

0.95

0.96

0.96

O(N 2) which requires a lot of time. Yet, the time diﬀerence for calculating the edit distances

and the Euclidean distances in the vector space is same for all pairs in a sequence dataset for

all the linkage types. Hence, average linkage clustering time speed-up is not so signiﬁcant in

the vector space compared to that of the single linkage clustering time.

3.4.6 Cluster Similarity

We quantify the relative similarity between the agglomerative hierarchical clusters based on

the edit distances and the Euclidean distances using the Normalized Mutual Information

(NMI) scores. An NMI score compares the agreement between the two sets of cluster labels

without considering any of them as the ground truth. Table 3.4 presents the NMI scores for

the Euclidean distance based single, and average linkage agglomerative hierarchical clusters

with respect to those of the edit distance based clusters. We set the similarity thresholds to

97% and 95% and the number of references in the vector space to 100. It can be seen that

average linkage clusters in the vector space are 90% or more similar for most of the similarity

thresholds. However, the single linkage hierarchical clusters based on the Euclidean distances

for 95% similarity threshold do not show satisfactory similarity to that of the edit distance

42

based clusters. The aggregation technique in the single linkage along with increasing variance

in the Euclidean distance distribution create faster convergence of the clusters in the vector

space and hence the dissimilarity.

For high similarity threshold such as 97%, as shown in Table 3.5, the predicted edit

distance based clusters are signiﬁcantly faster while nominal diﬀerences in relative similarity

(NMI) scores.

Table 3.5: Comparison of clustering time and accuracy based on predicted edit distances in
the vector space to those based on actual edit distances. All the single linkage and average
linkage clusters are compared for 97% similarity threshold.

Single Linkage

Average Linkage

Datasets

Edit Distance Euclidean Distance Edit Distance Euclidean Distance

Dm
Dh
Dr
Ds”

Time

0.62

0.94

1.81

3.34

Time

0.033

0.004

0.009

0.014

NMI

0.96

0.89

0.93

0.96

Time

Time

1.07

1.72

3.98

8.36

0.46

0.8

2.24

5.13

NMI

0.97

0.95

0.95

0.97

3.4.7 Cluster Quality

We evaluate quality of the clusters created based on the edit distances and the Euclidean

distances in the vector space with respect to the silhouette coeﬃcient. Similarity thresholds

are set to 97% and 95% for the average linkage hierarchical clusters. Table 3.6 represents the

silhouette coeﬃcients for the clusters of the experiment datasets based on the edit distances

and the Euclidean distances. As shown in the table, the clusters generated by both distance

measures produce the silhouette coeﬃcients between 0.2 and 0.6 which indicate that the

clusters are fairly separable despite some noisy area. We measure the coeﬃcients using the

edit distances of the sequences. Since the edit distances are approximated in the vector

43

Table 3.6: Evaluating quality of average linkage hierarchical clusters using the silhouette
coeﬃcient. The clusters are based on the edit distances and the Euclidean distances in the
vector space.

Datasets

Dm
Dh
Dr
Ds”

97% Similarity

95% Similarity

Edit Distance Euclidean Distance Edit Distance Euclidean Distance

0.56

0.47

0.36

0.41

0.50

0.42

0.30

0.35

0.61

0.50

0.43

0.45

0.55

0.44

0.38

0.38

space, the silhouette coeﬃcients of the clusters generated in the vector space are slightly

lower compared to that of the edit distance based clusters. It is worth mentioning that the

diﬀerences in silhouette score for both distance measures in otherwise similar settings are less

than one decimal point which underscores the eﬀectiveness of the proposed method. How-

ever, the quality of the clusters in the vector space degrade gradually with lower similarity

thresholds due to increasing overlap in the Euclidean distance distributions for the larger

edit distances.

3.4.8 Comparison with Local Sensitive Hashing based Clusters

We compare clusters based on the actual edit distances as well as the predicted distances

with those of the LSH-Div [22] technique on relative similarity, cluster quality and required

time. Since the hash based technique requires equal length sequences, we ﬁlter datasets as

follows. First, sequences of length more than one standard deviation is removed. Second, the

remaining sequences are cut to the smallest sequence length. For this experiment, we set the

similarity threshold to 97%, number of reference sequences to 100 and aggregation technique

to average linkage. Table 3.7 shows the comparison on the NMI score relative to the edit

44

Table 3.7: Comparison of clusters based on the edit distances, the predicted distances in
the vector space and local sensitive hashing based similarity with respect to clustering time,
relative similarity (NMI) to ground truth and the silhouette coeﬃcient.

Data
sets

Dm
Dh
Dr
Ds”

# Equal
Length

Sequences

17917

21571

30079

39994

Edit

Distance

Time(hours)
Euclidean
Distance
0.465

1.06

1.71

3.98

8.35

0.8

2.24

5.12

NMI

Euclidean
Distance

0.97

0.95

0.95

0.97

LSH-
Div
0.87

0.87

0.83

0.91

LSH-
Div
0.043

0.08

0.47

0.86

The Silhouette Coeﬃcient
LSH-
Div
0.12

Euclidean
Distance

0.57

Edit

Distance

0.51

0.47

0.36

0.38

0.42

0.30

0.34

-0.084

0.02

0.10

distance based average linkage hierarchical cluster labels, the silhouette coeﬃcient and the

clustering time. It can be seen that despite modiﬁcation of the datasets, relative similarities

of LSH-Div clusters with respect to ground truth is lower compared to those of the vector

space clusters. The main reason for the diﬀerence is that hamming distance based LSH-

Div technique cannot eﬃciently identify insert and delete operations like the edit distance.

Similarly, the silhouette coeﬃcient suggests that the edit and the vector space clusters are

better in terms of intra and inter cluster distance than that of the LSH based technique.

Although, LSH-Div is signiﬁcantly faster than the edit distances and the Euclidean distances

based clustering techniques, there is a lot of room to improve the hierarchical clustering

time for this application. Since, OTUs are generally determined based on high similarity

thresholds, we do not need to generate the entire distance based hierarchy of the sequence

datasets. Furthermore, reducing the number of reference sequences would lower the required

time for the vector space clusters. It is worth mentioning that in this chapter we focus on

the eﬀectiveness of the space transformation technique.

45

3.5 Summary

In this chapter, we presented a novel reference sequence based space transformation technique

for clustering sequences in NGS datasets. We transform each sequence into a feature vector

by computing the edit distances between the sequence and a set of reference sequences.

The space transformation technique works well for the hierarchical clustering techniques

more speciﬁcally with the single linkage aggregation strategy. Pairs of sequences that are

close with respect to the edit distances are also close in the vector space. Clustering in the

vector space provides signiﬁcant speed-up because the Euclidean distance computation is

multiple order of magnitude faster than that of computing the edit distances. This speed-up

is achieved with nominal sacriﬁce in the cluster quality. We perform extensive analysis to

determine the eﬀective number of reference sequences and their lengths. We show through

experimental results that the distance threshold for computing clusters in the vector space

is about the same for the experiment datasets and is dependent primarily on the number

and length of the reference sequences.

46

Chapter 4

MVD based Edit Distance Prediction

In the previous chapter, we presented a reference sequence based space transformation tech-

nique for the sequences in NGS datasets followed by an edit distance prediction technique

in the vector space using the Euclidean distance. Although the proposed technique showed

better performance for single linkage hierarchical clustering technique, the Euclidean dis-

tance based technique could not directly predict the pairwise edit distances.

Instead we

developed a mapping between the edit distances and the Euclidean distances in the vector

space. However, the mapping was dependent on the number of reference sequences as well

as the characteristics of the sequences in datasets and the reference sequences.

In this chapter, we develop a technique to predict edit distances between pairs of se-

quences in datasets by computing the diﬀerences of edit distances of these sequences from

the reference sequences. In the later text, we’ll use the term diﬀerence of distances to re-

fer such distances between pairs of sequences in datasets. Given sequences A, B and the

corresponding feature vectors (xA
i ∼ xB

would be the bag of (xA

n ), (xB

1 , . . . , xA
n ), the vector component diﬀerences
i ) where (1 ≤ i ≤ n). The diﬀerences of distances between

1 , . . . , xB

A and B are bounded by the ed(A, B) and the maximum of such distances is known as

the Chebyshev distance. As we increase the number of reference sequences, the Chebyshev

distance is likely to get closer to the actual edit distance. The frequency of any (xA

i ∼ xB
i )

is also helpful for improving the edit distance prediction accuracy in the vector space. Using

47

these strategies, we are able to predict the edit distances in the vector space more eﬃciently

than that of the Euclidean distance based technique.

4.1 Diﬀerences of Distances

Let us consider, sequences A and B in a dataset are transformed into xA and xB by a set of

reference sequences R.

Deﬁnition 8. Vector Diﬀerence (V D): Vector representing component wise diﬀerence of two

feature vectors xA and xB is denoted by the vector ∆x =< (∆xj : (xA

j ∼ xB

j )), 1 ≤ j ≤ n >.

Deﬁnition 9. Maximum Component Value in a V D for a pair of feature vectors is deﬁned

as M V D.

(cid:16)

M V D = max

(cid:17)
j )), 1 ≤ j ≤ n >

.

< (∆xj : (xA

j ∼ xB

Deﬁnition 10. The number of occurrences of M V D in a V D for a pair of vectors is deﬁned

as F V D.

Let us assume, xA =< 10, 8, 12, 6, 7, 11, 9, 6, 8, 12 > and xB =< 8, 9, 15, 5, 9, 13, 6, 6, 9, 13 >.

For xA and xB, V D =< 2, 1, 3, 1, 2, 2, 3, 0, 1, 1 >, M V D = 3 and F V D = 2. We use M V D

and F V D to predict the edit distance between A and B. Following theorem is the basis for

our proposed heuristics to predict the edit distances.

As shown in Theorem 1, each component value in a V D is bounded by the edit distance

between the pair of sequences. For example, in Figure 4.1 (xA

1 ∼ xB

1 ) ≤ d. A corollary of

the Theorem 1 is as follows:

48

Figure 4.1: Distance between sequences A and B in the vector space based on hypothetical
alignment of the reference sequences.

Corollary 1. Given sequences A and B, the M V D of xA and xB is bounded by the edit

distance between A and B.

It should be noted that each component value in a V D greatly depends on the relative

position of rj to the pair of database sequences. Since, each rj is quite distant from each

other, adding more reference sequences increases the likelihood of higher M V D.

If sequence A and sequence B are similar to a reference sequence rj in most of the

positions, the M V D is close to the ed(A, B). For example, in Figure 4.1, feature vector

component diﬀerence related to r1, i.e., (xA

2 ), fails to capture the actual edit distance

1 ∼ xB

ed(A, B) = d, shown by the dotted line, because the value of xA

1 and xB

same. On the other hand, component diﬀerence related to r2, i.e., (xA

1 are approximately
2 ∼ xB

2 ) is closer to

the actual edit distance only because r2, A and B are similar in most of the positions.

Proposition 2. Increasing number of reference sequences enhances the probability of achiev-

ing higher M V D.

Proof. Let us consider an arbitrary edit operation between the ith letter of A, A[i] and jth
letter of B, B[j] where 1 ≤ i ≤ |A| and 1 ≤ j ≤ |B|. Given an r ∈ R and 1 ≤ m ≤ |r|,
P r(A[i] = r[m]) = 1|(cid:80)| and P r(B[j] (cid:54)= r[m]) = (1 − 1|(cid:80)|). Assuming ed(A, B) = d, the
probability of ﬁnding all the edit operations by ed(A, r) and none by ed(B, r) is [ 1|(cid:80)|(1 −

49

1|(cid:80)|)]d.

Since, the edit distance between each reference sequence is at least lref /2, it is more

likely that the reference sequences will have diﬀerent letters at any mth position. Given

a set of n reference sequences, the probability that at least one reference sequence ﬁnding

all the edit operations becomes [ n|(cid:80)|(1 − 1|(cid:80)|)]d. Therefore, increasing n results in higher
M V D ≤ d.

According to the Proposition 2, M V D is likely to be higher with increasing number of

reference sequences. Based on the Corollary 1, M V D is bounded by the edit distance of

the sequences in a dataset. Thus with increasing number of reference sequences, M V D

will increase either to the edit distance or get close to it. Therefore, with suﬃciently large

number of reference sequences, M V D can provide a good estimation of the edit distance in

the vector space.

4.1.1 Eﬀect of Increasing Number of Reference Sequences on M V D

Figure 4.2 shows the distribution of M V Ds on increasing number of reference sequences.

The pairs of sequences that are close to each other with respect to the edit distances are

also close in the vector space. For example, Figure 4.2(a) shows the edit distance prediction

using n = 5 reference sequences. 100% of the pairs with the edit distance 1 are predicted as
M V D ≤ 1. However, only a tiny fraction the pairs of sequences predicted as M V D = 1 are

actually of edit distance one. Remaining most of the pairs that are predicted as M V D = 1

correspond to larger edit distances. The wrong prediction of larger edit distances is primarily

due to small number of reference sequences. As shown in Proposition 2, only few reference

sequences may not have desired combination of letters to capture the edit operations between

50

(a) # Reference Sequences = 5

(b) # Reference Sequences = 10

(c) # Reference Sequences = 50

(d) # Reference Sequences = 300

(e) # Reference Sequences = 500

(f) # Reference Sequences = 1000

Figure 4.2: Distribution of the edit distances in the vector space grouped by M V D with
varying number of dimensions

51

pairs of sequences in a dataset. Thus, the pairs of sequences with larger edit distances may

appear closer to each other in the vector space. Therefore, for small number of reference

sequences, a given M V D for pairs of sequences corresponds to a wider range of edit distances.

Hence, M V D is not very predictive when a a small number of reference sequences are used.

As the number of reference sequences increases, the predicted edit distances become more

distinguishable. For example, in Figure 4.2(d), for n = 300 reference sequences, if M V D = 1,

99.9% of them are actually of edit distance 1. And the remaining 0.1% pairs are of edit

distance 2.

As more sequences are added to the reference sequence set, the M V D based edit distance

prediction gets more accurate for larger edit distances. For example, in Figure 4.2(f), given

n = 1000, when M V D = 3, 95.7% of the predictions are correct. By Corollary 1, for a given
M V D = d(cid:48), the edit distances has to be greater or equal to d(cid:48). All the edit distances d where
d < d(cid:48) should be covered by M V D < d(cid:48). Thus, M V D also provides a partial ordering of the

edit distances in the vector space.

4.2 Edit Distance Prediction in the Vector Space

In this section, we present the heuristics for the proposed M V D and F V D based edit

distance prediction technique.

4.2.1 Heuristics for the Edit Distance Prediction

In the preceding section we showed that increasing number of reference sequences improved

the M V D based edit distance prediction accuracy, specially for the smaller edit distances.

Proposition 2 proved that for a suﬃciently larger number of reference sequences, the smaller

52

edit distances are likely to be predicted correctly. For instance in Figure 4.2, for n ≥ 300, the
edit distance predictions based on M V D ≤ 2 are accurate in more than 99% of the cases,

which leads to our ﬁrst heuristic:

H1: For n ≥ 300, if M V D ≤ 2, the edit distance in the vector space is predicted same

as the M V D.

We performed experiments with diﬀerent datasets and found in all these datasets, more

than 99% of the cases the predictions were correct. Only in less than 1% of the cases they

were not correct and the predicted distances diﬀered by 1.

As we look into the higher M V Ds, the more edit distances greater than or equal to

M V D map into it. Since the reference sequences are far from sequences in the datasets, each

diﬀerence of distances in a V D increases slowly when the corresponding the edit distances

become larger. For the larger edit distances, increasing the number of dimensions may not

produce an M V D close to the actual edit distances of pairs of sequence in the datasets.

The predicted M V D represents a range of edit distances despite having large number of

dimensions (Figure 4.2(d)). Therefore, only M V D is not suﬃcient to satisfactorily predict

relatively the larger edit distances.

However, the frequency of an M V D is likely to be higher when the actual edit distance

is larger. For example, in Figure 4.3, for a given M V D, cumulative frequencies of the edit

distances have been shown with the growth of the F V D.

It also shows the F V D based

cutoﬀs for the predicted edit distances. We select the F V D cutoﬀs based on the maximum

true positive percentage for a given < M V D, F V D >.
In 4.3(a), for M V D = 3 and
F V D ≤ 13, percentage of correct predictions (sensitivity) and true positive are 97% and

84% respectively. Similarly, we can ﬁnd F V D cutoﬀs for predicting higher edit distances

53

(a) Frequency of M V D = 3

(b) Frequency of M V D = 4

Figure 4.3: For a given M V D, the range of F V Ds highlight distribution of the edit distances.
We consider the change in distribution for our prediction. For example, predicted edit
distances are marked based on intervals of F V D values.

too. Based on this observation, we propose our second heuristic.

H2: For 3 ≤ M V D ≤ 5, the edit distances in the vector space are predicted based on

the M V D and their corresponding F V D.

Figure 4.3(b) presents the predicted edit distance cutoﬀs for M V D = 4. For M V D = 4
and F V D ≤ 3, sensitivity and true positive rates of the predicted edit distance 4 are 87%

and 49% respectively. The accuracy is reduced because with higher M V D, wider range of

edit distances produce the exactly same M V D and F V D. However, it can be improved

by using the second maximum diﬀerence in a V D and the corresponding frequency. For

example, Figure 4.4 presents the frequencies of the second maximum diﬀerence in a V D for

< M V D, F V D > = < 4, 2 > and < M V D, F V D > = < 4, 3 >. It is evident from the

Figure 4.3(b) and Figure 4.4 that for M V D = 4, the second maximum diﬀerence in a V D

and the corresponding frequency helped improving accuracy of the predicted edit distances.
H2.1: For 4 ≤ M V D ≤ 5, the second maximum diﬀerence and the corresponding

frequency in a V D further improves the prediction accuracy.

54

51015202530Frequency of Mvd00.511.522.533.54104Dist=3Dist=4Dist=5Ed=3Ed=4Ed=5Ed=62468101214Frequency of Mvd00.511.522.533.544.55104Dist=4Dist=5Dist=6Dist=7Dist=8Ed=4Ed=5Ed=6Ed=7Ed=8(a) M V D = 4, F V D = 2

(b) M V D = 4, F V D = 3

Figure 4.4: Predicting edit distances using top two vector component diﬀerences (VD) and
their corresponding frequencies (FVD). The diﬀerences of pairwise edit distance distributions
for the given < V D, F V D > show eﬀectiveness of such strategies.

Although, the predicted edit distances based on the H2 and H2.1 are more accurate,

their error rates are within certain lower bounds. In many cases, the erroneously predicted

distances diﬀer by 1-2 points from the actual edit distances.

However, prediction accuracy declines sharply for even larger M V D. To solve this prob-

lem we determine candidate pairs based on maximum two diﬀerences and their corresponding

frequencies in a V D. Then, we compute the edit distances of these candidate pairs using the

existing edit distance calculation techniques. Since we are only interested about NGS se-

quence similarity within a small distance threshold, the number of candidate pairs is a small

fraction of all the pairs of larger edit distances. Utilizing frequencies of second maximum

diﬀerences of V D further reduces the number of candidate pairs.

H3: For M V D > 5, the candidate pairs are selected based on the two maximum diﬀer-

ences in a V D and their corresponding frequencies. Then, we calculate the edit distances of

those pairs and keep the pairs which are within a given distance threshold.

Table 4.1 shows the eﬀect of using top two diﬀerences in a V D and their corresponding

55

10152025303540Frequency of 2nd maximum diff for Fvd=2010002000300040005000600070008000900010000Dist=4Dist=5Dist=6Ed=4Ed=5Ed=6Ed=7Ed=810152025303540Frequency of 2nd maximum diff for Fvd=3010002000300040005000600070008000Dist=4Dist=5Dist=6Dist=7Ed=4Ed=5Ed=6Ed=7Ed=8frequencies to reduce the number of candidate pairs where M V D > 5 for datasets presented

in Table 3.1. The number of candidate pairs of sequences is reduced by more than 50% when

top two diﬀerences in a V D and their corresponding frequencies are considered instead of

only M V D and F V D. Because of quadratic time complexity of the edit distance calculation

with respect to sequence lengths, it is important to reduce the numbers of false positive

candidate pairs.

Table 4.1: Number of candidate pairs selected for the edit distance calculation. These pairs
are selected based on the top two diﬀerences of V D and their corresponding frequencies.

Datasets

Maximum diﬀerence in a V D

Top two diﬀerences in a V D and

and the frequencies

the corresponding frequencies

Dm
Dh
Dr
Ds

17920424

13149464

14826484

61901271

7574093

4350262

5740473

28026329

4.2.2 Optimal Number of Reference Sequences

Increasing number of reference sequences improves the edit distance prediction accuracy in

the vector space. At the same time, computational costs between the pairs of feature vectors

increase by the number of dimensions i.e., number of reference sequences. We determine the

optimal number of reference sequences by minimizing the number of costly edit distance

calculations.

In the proposed M V D based prediction technique, the edit distances are

calculated on two occasions. First, we create the feature vectors by computing edit distance

between sequences in datasets and the reference sequences. Given N sequences in a dataset,

the number of edit distance calculation grows linearly with increasing number of reference

sequences. Second, the heuristic H3 selects candidate pairs for M V D > 5 to compute edit

56

Figure 4.5: Total number of pairwise edit distance computation including the number of
pairs needed for space transformation and the number of candidate pairs determined by
heuristic H3

distances between them. With increasing number of reference sequences, the number of

candidate pairs based on the heuristic decreases gradually until leveling oﬀ for large number

of reference sequences such as n = 1000. Figure 4.5 shows the relationship between the

required number of edit distance calculation using the proposed technique with increasing n

for datasets presented in Table 3.1. It can be seen from the Figure that the number of pairs

for each dataset reaches to the lowest range when the number of reference sequence varies

between 200 and 300. Therefore, we choose n = 300 as the optimum number of reference

sequences for the experiment datasets. It should be noted depending on the properties of

the datasets the optimum n may vary.

57

4.3 Hierarchical Clustering

In this section, we brieﬂy present the adopted hierarchical clustering technique for larger

datasets followed by a cost analysis to justify the speed-up of clustering in the vector space

with the adopted technique.

4.3.1 Memory Eﬃcient Clustering Technique

In the previous chapter, we adopted a main memory hierarchical clustering technique of

O(N 2) space complexity. However, such space requirement prohibits evaluating our proposed

technique on larger datasets.

In this chapter, we adopt a graph based external memory

hierarchical clustering technique, SparseHC [49] that can handle large number of sequences.

Key steps of the SparseHC technique are:

• It assumes that the pairwise distances are pre-calculated and are sorted based on

ascending order of their distances.

• The sequences are considered as singleton clusters i.e., vertices and the pairs along

with the distances are considered as edges.

• The edit distances of pairs of sequences are stored in the form < i, j, dist(i, j) > where

i, j ∈ N and i < j. The triples are fetched sequentially.

In case there is an edge

between the ith and jth sequences or their ancestor clusters, the edge is updated with

the new pairwise distance. Otherwise a new edge is formed between the ith and jth

sequences.

• Once the edit distances between all pairs of sequences of two clusters are accounted

for, the edge is considered complete and are merged into one cluster. The incomplete

58

edges of both clusters are updated accordingly.

• The algorithm stops when only one cluster is remaining.

We modiﬁed the algorithm so that the clustering stops if no two clusters can be merged

by satisfying the given threshold.

4.3.2 Cost Analysis

The time complexity of edit distance calculation between all pairs of sequences is O(N 2l2)

where N is the number of sequences and l is the length of a sequence. On the other hand,

preprocessing steps of the proposed technique include generation of reference sequences and

transformation of sequences in a dataset into n-dimensional feature vectors where n is the

number of reference sequences. Since, the randomly generated reference sequence are dataset

independent, their generation time is not considered in cost analysis. Time complexity of

space transformation is O(nN l2). Time complexity of pairwise distance calculation in the

vector space is O(nN 2). Equation 4.1 shows the ratio of pairwise distance calculation time

complexity between the edit and the vector space.

O(N 2l2)

O(nN l2) + O(nN 2)

(4.1)

For the clustering, we need to consider only a fraction of the all possible pairs of sequences

that are within the distance threshold. Suppose, the number of such pairs is (c N 2) where

c << 1. These pairs are needed to be sorted in order to create clusters using the SparseHC

technique. Disk based merge sort time complexity is O(cN 2log(cN 2)).

In addition, the

SparseHC clustering time complexity is O(N + cN 2). The sorting and the clustering time

59

diﬀer slightly for the edit distance and the MVD based methods due to diﬀerence in the

number of pairs for the predicted edit distances.

Ratio of the time complexity in the edit and the vector space is given below.

It is

understandable that with increasing number of sequences in datasets, all pairs edit distance

calculation time increases more rapidly compared to that of other components of the used

techniques. Hence, clustering performance in the vector space improves with the increasing

number of sequences in datasets.

O(N 2l2) + O(cN 2 log(cN 2)) + O(N + cN 2)

O(nN l2) + O(nN 2) + O(cN 2 log(cN 2)) + O(N + cN 2)

≈

O(N 2l2)

O(nN l2) + O(nN 2)
≈

c1N 2l2

c1nN l2 + c2nN 2

≈

c1N l2

c1nl2 + c2nN

4.3.3 Predicted Edit Distance Thresholds in the Vector Space

Our proposed heuristics predict the edit distances in the vector space. Given, n ≥ 300
and l ≤ 252, heuristics H1 and H3 nearly accurately predict the edit distances. Although

prediction accuracy of heuristic H2 is comparatively lower than that of the other heuristics,

majority of the erroneous predictions diﬀer by 1 to 2. Hence, we are able to predict somewhat

similar average linkage cluster thresholds in the vector space. Table 4.2 shows the predicted

thresholds in the vector space. The predicted thresholds are slightly smaller for 97% and 98%

similarity thresholds due to the erroneous predictions of heuristic H2. k-mer, local sensitive

hashing or other heuristics based clustering methods have similarity thresholds. However,

these are not similar to edit distance based thresholds.

60

Table 4.2: Comparison of similarity thresholds for hierarchical clustering based on the edit
distances and those of the MVD

Similarity Thresholds Edit Distance

MVD

99%

98%

97%

2.5

5

7.5

2.5

4.7

7.3

4.4 Experimental Results

We create clusters based on average linkage hierarchical clustering technique in the vector

space and compare the clusters with those created based on the edit distances. We also

compare the performance of the clusters based on the proposed edit distance prediction

technique with VSEARCH [23] and locality sensitive hashing based LSH-Div [22] clusters.

The similarity thresholds are set to 97%, 98% and 99% of the respective clustering methods.

Table 4.3: Hierarchical clustering time (in hours) breakdown based on the edit distance and
the MVD based on vector space for 97% similarity

Pairwise Distance

Sorting & Clustering

Dataset

Calculation time (hr)
Edit Distance MVD

time (hr)

Edit Distance MVD

Dm
Dh
Dr
Ds
Dp

0.614

0.931

1.771

17.91

411.1

0.083

0.068

0.104

0.633

11.13

0.02

0.018

0.025

0.104

0.32

0.021

0.018

0.026

0.106

0.33

4.4.1 Clustering Time

In this section, we evaluate the clustering time in the vector space. First, we justify the

speed up of the hierarchical clustering based on the MVD in the vector space over the

61

clustering time based on the edit distance by presenting component wise required times.
This is shown in table 6.1. It can be seen that for larger datasets such as N = 5 × 105, the

edit distance based hierarchical clustering requires more than 15 days while the proposed

technique requires only about 11 hours with nominal loss in accuracy. The pairwise distance

calculation based on the MVD in the vector space is more than one order of magnitude faster

than that based on the edit distance.

We evaluate the clustering time for diﬀerent similarity thresholds. Among the compared

techniques, only VSEARCH is faster than the proposed space transformation based hier-

archical clustering although the clusters accuracy is signiﬁcantly lower than the proposed

technique. Clustering time for both the edit distance based hierarchical method and the

LSH-Div increase more rapidly than the proposed technique for increasing number of se-

quences in the datasets.

Table 4.4: For 99% similarity threshold, comparison of clustering time and accuracy using
the MVD based method to those of the actual edit distances as well as k-mer matching based
VSEARCH and LSH-Div clustering

Dataset

Baseline

MVD

VSEARCH [23]

LSH-Div [22]

Time(hr) Time(hr) / NMI / AMI Time(hr) / NMI / AMI

Time(hr) / NMI / AMI

Dm
Dh
Dr
Ds
Dp

0.61

0.94

1.72

17.95

411.28

0.054/0.998/0.993

0.007 / 0.903 / 0.75

0.0253/0.89/0.68

0.067/0.997/0.9923

0.01 / 0.89 / 0.77

0.0493 / 0.9 / 0.7

0.095/0.9976/0.9916

0.03 / 0.9 / 0.71

0.218 / 0.85 / 0.67

0.052/0.9971/0.9905

0.147 / 0.91 / 0.78

1.416 / 0.92 / 0.78

7.1/0.994/0.9905

0.352/0.901/0.802

9.6/0.903/0.713

4.4.2 Cluster Similarity

We quantify relative similarity of the clusters generated by the proposed technique, VSEARCH

and LSH-Div with those of the edit distance based hierarchical clustering by using the NMI

62

Table 4.5: For 98% similarity threshold, comparison of clustering time and accuracy using
the MVD based method to those of the actual edit distances as well as k-mer matching based
VSEARCH and LSH-Div clustering

Dataset

Baseline

MVD

VSEARCH [23]

LSH-Div [22]

Time(hr) Time(hr) / NMI / AMI Time(hr) / NMI / AMI

Time(hr) / NMI / AMI

Dm
Dh
Dr
Ds
Dp

0.63

0.943

1.79

17.98

411.4

0.072 / 0.982 / 0.9725

0.0043 / 0.91 / 0.767

0.0258/0.88/0.702

0.094 / 0.98 / 0.9745

0.0041 / 0.91 / 0.78

0.0508/0.884/0.71

0.122 / 0.978 / 0.9702

0.0125 / 0.91 / 0.716

0.219 / 0.84 / 0.68

0.614 / 0.979 / 0.9691

0.058 / 0.92 / 0.79

1.43 / 0.91 / 0.784

7.88/0.979/0.967

0.19/0.895/0.798

9.52/0.891/0.708

Table 4.6: For 97% similarity threshold, comparison of clustering time and accuracy using
the MVD based method to those of the actual edit distances as well as k-mer matching based
VSEARCH and LSH-Div clustering

Dataset

Baseline

MVD

VSEARCH [23]

LSH-Div [22]

Time(hr) Time(hr) / NMI / AMI Time(hr) / NMI / AMI

Time(hr) / NMI / AMI

Dm
Dh
Dr
Ds
Dp

0.634

0.95

1.796

18.01

411.5

0.104/0.9956/0.989

0.003 / 0.915 / 0.79

0.0261/0.87/0.732

0.086/0.9938/0.9918

0.0054 / 0.905 / 0.803

0.0528/0.87/0.76

0.13 / 0.994 / 0.9882

0.008 / 0.89 / 0.73

0.22 / 0.83 / 0.686

0.74 / 0.992 / 0.987

0.043 / 0.93 / 0.82

1.44 / 0.91 / 0.8

11.36/0.992/0.9856

0.162/0.905/0.798

9.83/0.89/0.72

and the AMI scores. As shown in Table 4.4, for 99% similarity (edit distance threshold=2.5),

the clusters based on the edit distances and the MVD are nearly similar because most of

the predicted edit distances in the vector space are based on heuristic H1 and the heuristic

H1 based predictions are accurate in more than 99% of the cases. For 98% similarity (edit

distance threshold=5), majority of the predicted edit distances are based on heuristics H1,

H2 and H2.1. Because of similarity threshold 5, many of the predictions of 98% similarity

use heuristics H2 and H2.1 which are less accurate. Hence, the NMI and AMI scores are

relatively low for clusters of 98% similarity (Table 4.5). For 97% similarity (edit distance

63

threshold=7.5), the clusters also include heuristic H3 based edit distances which are cor-

rect. Because of the high accuracy of heuristics H1 and H3, the impact of less accuracy

of heuristic H2 and H2.1 based predictions is minimized (Table 4.6). Hence, the relative

similarity scores improve for clusters of 97% similarity than those of 98%. However, the

relative similarity of both VSEARCH and the LSH-Div techniques are signiﬁcantly lower.

Although, the NMI scores are around 0.9 for both techniques, the AMI scores are quite low

for them which further show that the k-mer and local sensitive hashing based techniques fail

to eﬀectively capture the edit operations between pairs of sequences.

Table 4.7: Comparing average linkage hierarchical clusters using the silhouette coeﬃcient.
The clusters are generated based on the edit distances, the MVD based predicted edit dis-
tances as well as k-mer matching techniques such as VSEARCH and LSH-Div.

Similarity Dataset Edit Distance MVD VSEARCH LSH-Div

99%

98%

97%

Dm
Dh
Dr
Dm
Dh
Dr
Dm
Dh
Dr

0.314

0.415

0.32

0.426

0.443

0.378

0.57

0.47

0.36

0.31

0.41

0.312

0.407

0.42

0.36

0.56

0.46

0.344

0.18

0.29

0.196

0.26

0.30

0.247

0.4

0.31

0.24

0.15

0.02

0.05

0.14

0.01

0.04

0.12

-0.084

0.02

4.4.3 Cluster Quality

We evaluate quality of the resulting clusters using the silhouette coeﬃcient which is mea-

sured using the pairwise edit distances of the sequences. Table 4.7 represents the silhouette

coeﬃcients of the experiment datasets Dm, Dh and Dr. The silhouette coeﬃcients of larger

64

datasets are not computed due to memory constraints. As shown in the table, quality of the

hierarchical clusters generated in both the edit and the vector space are signiﬁcantly better

than that of the other methods. Since the predicted edit distances in the vector space are

not entirely accurate, the silhouette coeﬃcients of the vector space clusters are slightly lower

compared to those in the edit space. Between the other two methods, quality of the clusters

generated by the LSH-Div are more noisy and less separable. The silhouette coeﬃcients

vary between 0 to 0.2 that refers the clusters are not well separated which indicates that the

distance between randomly generated subset of the sequences cannot eﬀectively capture the

corresponding edit distances.

4.5 Summary

In this chapter we proposed a novel edit distance prediction technique and used these pre-

dicted edit distances to create clusters from sequences in NGS datasets. We used the edit

distance based average linkage hierarchical clustering and showed that our method was an

order of magnitude faster than the actual edit distance based clustering while maintain-

ing more than 99% accuracy for given thresholds for most of the datasets. We considered

k-mer based and LSH based method, namely, VSEARCH and LSH-Div, to compare perfor-

mance. We showed that the accuracy of our method was signiﬁcantly better than that of

the VSEARCH and the LSH-Div. The better quality of our clusters was due to the fact that

the predicted edit distances within the given similarity thresholds were mostly accurate and

in other cases they were within 1-2 points apart.

65

Chapter 5

Edit Distance Prediction from

Distances between the Sub-sequences

In the previous chapters, we proposed edit distance prediction methods using randomly gen-

erated dataset independent reference sequences [35, 34]. For instance, the M V D based edit

distance prediction [34] used a set of n randomly generated dataset independent reference

sequences, R to convert dataset sequences into feature vectors where each dimension of the

vector represents the edit distance between the dataset sequence and a reference sequence.

Given dataset sequences A and B, a reference sequence r, and an edit distance function
ed(A, B); |ed(A, r) − ed(B, r)| ≤ ed(A, B) because of the triangle inequality property of
metric distance measures. Further, the M V D, i.e., max(|ed(A, r) − ed(B, r)|) ≤ ed(A, B)
where r ∈ R. The M V D based method predicts edit distances nearly accurately in sev-

eral scenarios. Since the edit operations between the NGS sequences can appear in random

positions, as number of reference sequences increases, the M V D gets closer to the actual

edit distances at the cost of higher computation time. Despite decline in prediction accu-

racy for larger edit distances, this method shows superior accuracy for applications such as

hierarchical clustering over that of the existing k-mer based sequence similarity.

In this chapter, instead of the randomly generated reference sequences, we use a set of

simple reference sequences to develop an improved edit distance prediction method. Each

66

reference sequence of this set consists of the same letter of the alphabet. For example, given
an alphabet Σ = {a, c, g, t}, the reference sequences are {aaaa..., cccc..., gggg..., tttt...}. Even

though the number of reference sequences is small and therefore computation time is less,

the prediction accuracy based on these reference sequences degrades rapidly for higher edit

distances. Let us assume, a pair of dataset sequences, A = gcatacg and B = gcctaag with

two substitution operations at the 3rd and the 6th positions. For reference sequence ccccccc,

ed(A, ccccccc) = 5 and ed(B, ccccccc) = 5. Hence the diﬀerence of distances between A and

B with respect to ccccccc is 0 which does not correctly predict the edit distance. Similarly the

diﬀerences of distances between A and B with respect to aaaaaaa, ggggggg and ttttttt are 0.

In order to minimize such erroneous distance predictions, we compute the edit distances for

sub-sequences separately. We develop merging strategies to predict edit distance between a

pair of sequences from the partial edit distances between their sub-sequences. This method

signiﬁcantly improves the edit distance prediction accuracy while running in linear time

complexity with respect to sequence lengths.

5.1 Basic Concepts

In this section, we brieﬂy present the notations relevant to the proposed reference sequences

based edit distance prediction method. We also outline the key ideas of the ﬁxed length

sub-sequence based edit distance prediction method.

5.1.1 Notations

Given an alphabet Σ, S refers to a dataset of N sequences, {s1, s2, . . . sN}. Assuming A is a
sequence in the dataset, A[i] represents the ith letter of the sequence. A[i : j] represents the

67

sub-sequence starting at the ith position and ending at the jth position. A(i, j) represents a

sub-sequence of length j starting at the ith position. The null element Λ is used to represent
insertions and deletions between a pair of sequences (Deﬁnition 1). For instance, (A[i] → Λ)
represents a deletion operation of the ith element in sequence A and (Λ → A[i]) represents
an insertion operation at the ith position of sequence A. γ(x → y) is the cost function for
replacing x with y where x, y ∈ (Σ ∪ Λ).

Assuming the maximum length of sequences in S is l, the reference sequences must be
at least l letters long. R refers to a set of n reference sequences, {r1, r2 . . . , rn} where
∀j∈{1,..n}rj ∈ R; |rj| ≥ l. As discussed in the Section 3.2.2, the reference sequences must

be far from each other so that they have nominal correlation. The maximum distance

between a pair of reference sequences could be l. A set of such reference sequences is Rl
where ∀rj, rk ∈ Rl andj (cid:54)= k; ed(rj, rk) = l. Although diﬀerent sets of reference sequences
can satisfy the requirement of Rl, we choose Rl such that ∀x ∈ Σ, ∀y ∈ Σ; xl, yl ∈ Rl
where x (cid:54)= y. For example, given Σ = {a, c, g, t} and l = 5, we choose reference sequences,
R5 = {aaaaa, ccccc, ttttt, ggggg}. In other words, each reference sequence in Rl consists of

exactly the same letter in Σ.

5.1.2 Key Ideas

In this section, we present key ideas of the edit distance prediction method based on the

reference sequences in Rl. First, prediction accuracy based on these reference sequences

is very good for small edit distances. However, the accuracy drops sharply for larger edit

distances due to the cancelled edit operations while computing the diﬀerence of distances with

respect to the reference sequences. In order to improve the accuracy, we divide the dataset

sequences into sub-sequences of predeﬁned length and compute edit distances between the

68

corresponding sub-sequences of pairs of sequences.

Second, we merge the partial edit distances of pairs of ﬁxed length sub-sequences to

predict the edit distance between the entire sequences. While merging, we need to consider

that insertion and deletion operations may be propagated to subsequent ﬁxed length sub-

sequences. Therefore, it is important to distinguish the pairs of sub-sequences where edit

operations are captured redundantly. We apply diﬀerent strategies to discard redundant edit

operations from partial edit distances of those pairs of sub-sequences.

Third, given multiple edit operations between a pair of sub-sequences, the diﬀerences

of distances are less likely to predict the edit distances correctly for small alphabet size.

For example, given an alphabet Σ, the probability of correct edit distance prediction for
two substitution operations is (|Σ|−1)+(|Σ|−2)2

. This ratio gets higher with the growth of

|Σ|2

alphabet size for increasing number of edit operations. Since the DNA alphabet consists of

only four letters (Nucleotides), we consider pairs of letters (Di-nucleotides) as single elements

to increase the number of all possible elements in the sequences. Then, we compute the

diﬀerences of distances with respect to the reference sequences of these elements. This

strategy improves edit distance prediction accuracy signiﬁcantly between the NGS sequences

compared to that of the single letter based strategy with a nominal increase in computational

time.

Fourth, for a pair of sub-sequences with multiple edit operations, the diﬀerences of dis-

tances based edit distance prediction is likely to be less accurate due to lack of information

of their positions. In these scenarios, we improve the prediction accuracy by considering the

positions of the edit operations. Although this strategy does not guarantee accurate edit

distance prediction between the entire sequences, it improves the accuracy signiﬁcantly when

there are several edit operations in close proximity.

69

5.2 Reference Sequence based Transformation

Selecting an eﬀective set of reference sequences is critical for our proposed edit distance

prediction methods. As discussed in Section 3.2, there are few key properties for such a set

of reference sequences. For example, the reference sequences must be far from each other, i.e.,

the edit distance between each of them must be signiﬁcantly large. The proposed reference

sequences in this chapter are signiﬁcantly diﬀerent than those of the previous chapters.

Hence, we revisit some of the key properties of the reference sequences.

5.2.1 Edit Distances among the Reference Sequences

Given A ∈ S, if the reference sequences are correlated, ∀r ∈ R; ed(A, r) are close to each

other. As discussed in Section 3.2.2, our goal is to minimize the correlation among the

reference sequences. Correlation among the reference sequences decrease monotonically with

increasing distance threshold among the reference sequences. The proposed single letter

reference sequences, Rl have maximum edit distances among them. Further, they possess

an unique property which is useful for predicting edit distances between the sequences in S.

We prove this property in Theorem 2 based on the concept of trace deﬁned in [1]. For the

sake of completeness, the deﬁnition of trace [1] is presented here.

Deﬁnition 11. A trace from sequence A to sequence B is a triple (T, A, B), where T is a

set of ordered pairs of integers (i, j) satisfying:

1. 1 ≤ i ≤ |A| and 1 ≤ j ≤ |B|

2. for any two distinct pairs (i1, j1) and (i2, j2) in (T, A, B), (a) i1 (cid:54)= i2 and j1 (cid:54)= j2, (b)

i1 < i2 iﬀ j1 < j2.

70

When A and B are understood, (T, A, B) is referred to as T . A pair (i, j) in T describes

a line between the letters A[i] and B[j]. Each letter can be connected by at most one line

and the lines do not cross. The ordered pairs indicate the change operations between the se-

quences. A change operation represents either an exact match or a substitution operation be-

tween the corresponding letters. Following is an example of a trace between sequences A and

B where T =< (1, 1), (2, 2), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10), (10, 11), (12, 12) >.

A trace may not connect all the letters in A and B. In such case, the untouched letters
in A are deleted from the sequence, denoted as (A[i] → Λ). The untouched letters in B are
inserted into A, denoted as (Λ → B[i]). Therefore, cost of the given trace T , denoted as

cost(T ), includes the cost of change operations as well as those for insertions and deletions.

Assuming the cost of each edit operation is exactly one, cost of change, deletion and insertion

operations in T are 0, 1 and 1 respectively. Hence, cost(T ) = 2. The minimum trace cost is

the edit distance between a pair of sequences. More detail about trace and its cost function

are available in Wagner et. al. [1].

Theorem 2. Given a reference sequence, r → xl ∈ Rl and a dataset sequence A of length

l(cid:48), ed(r, A) = (l − k) where l ≥ l(cid:48) and k =(cid:80)l(cid:48)

i=1 1, ifA[i] = x.

Proof. Let us assume, x ∈ Σ appears k times in A. The maximum size of a trace between r
and A is l(cid:48). Suppose, the size of (T0, r, A) is l(cid:48). Thus, the number of change operations in
T0 is l(cid:48). Since x ∈ Σ appears k times in A, the cost of change operations for those k pairs
of (i, j) ∈ T0 is zero. In addition, (l − l(cid:48)) deletion operations occur in r. Since l ≥ l(cid:48), there

won’t be any insertion operation.

71

(cid:88)

cost(T0) =

(cid:88)

(cid:88)

(r[i] → Λ)

(Λ → A[j]) +

(r[i] → A[j]) +

cost(T0) = (l(cid:48) − k) + 0 + (l − l(cid:48)) = l − k

Assume, an arbitrary trace (Ti, r, A) of size l(cid:48)(cid:48) where l(cid:48)(cid:48) ≤ l(cid:48). Therefore, (l − l(cid:48)(cid:48)) letters
in r and (l(cid:48) − l(cid:48)(cid:48)) letters in A do not participate in the change operations and therefore the

corresponding positions of r and A are not part of Ti. Letters of such positions in r will be

deleted and those in A will be inserted into r.

cost(Ti) = (l(cid:48)(cid:48) − k) + (l(cid:48) − l(cid:48)(cid:48)) + [(l(cid:48) − l(cid:48)(cid:48)) + (l − l(cid:48))]

cost(Ti) = (l − k) + (l(cid:48) − l(cid:48)(cid:48)) = cost(T0) + (l(cid:48) − l(cid:48)(cid:48))

Hence, cost(T0) ≤ cost(Ti).

For example, in Figure 5.1(a), trace (T0, A, r) has the maximum number of change op-

erations. Here, the number of change, insertion and deletion operations are 11, 1 and 0
respectively. The reference sequence r ∈ Rl is based on letter t ∈ Σ. Further, t appears

k = 2 times in sequence A. Assuming the cost of each insertion, deletion and substitution
operation is exactly 1, cost(T0) = (11 − 2) + 1 + 0 = 10. On the other hand, cost of an

arbitrary trace (T1, A, r) in Figure 5.1(b) is 12 which is bigger than that of T0.

However, ed(A, r) represents the edit distance with respect to the letter t ∈ Σ only. In

order to predict the edit distance between sequences A and B, we need to compute diﬀerences
of edit distances of A and B with respect to each r ∈ Rl. The following is true for a given

reference sequence because the edit distance satisﬁes the metric property.

72

(a) Trace T0

(b) Trace T1

Figure 5.1: (a) A minimum cost trace with maximum number of change operations and (b)
an arbitrary trace with fewer change operations

ed(A, B) ≥(cid:0)ed(A, r) ∼ ed(B, r)(cid:1)

(5.1)

5.2.2 Predicting Edit Distance

In this section, we apply Theorem 2 and Inequality 5.1 to develop strategies for edit distance

prediction between pairs of dataset sequences.

5.2.2.1 Edit Distance One

For a pair of dataset sequences with edit distance one, the distance is correctly predicted

from the diﬀerences of distances with respect to the reference sequences in Rl. We prove it

in Theorem 3.

Theorem 3. If ed(A, B) = 1, ∃ri ∈ Rl s.t. (cid:0)ed(A, ri) ∼ ed(B, ri)(cid:1) = 1 where 1 ≤ i ≤ |Σ|.

Proof. Let us consider, a dataset sequence A of length l(cid:48) and reference sequences ri ∈ Rl
where 1 ≤ i ≤ |Σ| and l > l(cid:48). Thus, by Theorem 2,

ed(A, ri) = (l(cid:48) − ki) + (l − l(cid:48)) = (l − ki) where ki = (cid:0) l(cid:48)(cid:88)

1, ifA[j] = ri[j](cid:1) (5.2)

j=1

Suppose, the letter x ∈ Σ is inserted at an arbitrary mth position of A to convert it into

a dataset sequence B of length (l(cid:48) + 1) . Hence,

73

l − (ki + 1),

l − ki,

ed(B, ri) =

if B[m] = ri[m] = x.

otherwise.

(5.3)

Based on Equation 5.2 and Equation 5.3, diﬀerences of the edit distances between A and

B with respect to ri ∈ Rl where ri → xl will be 1. Hence,

(cid:0)ed(A, ri) ∼ ed(B, ri)(cid:1) = 1.

∃ri ∈ Rl

(5.4)

A deletion operation in A to convert it into B is reciprocal to the insertion operation and

it satisﬁes Equation 5.4.

Also assume a substitution operation at A[m] that converts the sequence into B where

|A| = l and l ≥ l(cid:48).

ed(A, ri) =

l − (ki + 1),

l − ki,

if A[m] = ri[m].

(5.5)

Based on Equation 5.5,(cid:0)ed(A, ri) ∼ ed(B, ri)(cid:1) = 1 if A[m] = ri[m] and B[m] (cid:54)= ri[m] or

otherwise.

vice versa. Otherwise, the diﬀerences of distances will be 0. Hence, Equation 5.4 holds for

a substitution operation between A and B.

For example, let us consider, A = tctgta, B = tcttgta where the letter t in Σ is inserted in

A[4] to convert it into B. Assuming, Rl = {aaaaaaa, ccccccc, ggggggg, ttttttt};(cid:0)ed(A, ri) ∼
ed(B, ri)(cid:1) = 1 where ri = ttttttt and 0 with respect to the other reference sequences.

74

5.2.2.2 Edit Distance Greater than One

In Theorem 3, we showed the correctness of our proposed method for edit distance one. In

general, it is more likely to have more than one edit operations (Deﬁnition 2) between a

pair of sequences. If ed(A, B) > 1,

may not predict the correct edit

(cid:18)
(cid:19)
ed(A, ri) ∼ ed(B, ri)

distance. However, edit distance prediction based on the expression 5.6 is bounded by the
actual edit distance for ed(A, B) ≥ 1 because at least two of the reference sequences in Rl

will have non-zero(cid:0)ed(A, ri) ∼ ed(B, ri)(cid:1) for any edit operation between A and B. In many

cases, the following expression predicts the edit distance correctly.

(cid:80)|Σ|

i=1

(cid:0)ed(A, ri) ∼ ed(B, ri)(cid:1)

2

(5.6)

As number of edit operations grows, the prediction accuracy based on Rl drops rapidly

because increasing number of edit operations get canceled while comparing diﬀerences of

the edit distances between dataset sequences with respect to reference sequences in Rl. For

example, in Figure 5.2, the sequence of edit operations, E =< (3, Λ, g), (11, g, Λ) >. In both
cases, g ∈ Σ is involved with the edit operations. However, diﬀerence of distances between

the sequence pair with respect to reference sequence r = gggggggggggg is zero because eﬀects

of the edit operations are canceled by each other when comparing the frequency of g in A

and B.

In order to minimize this problem, we apply diﬀerences of distances based prediction

for the sub-sequences separately rather than between the entire sequences at once. This

strategy in most cases reduces the number of edit operations for a given pair of sub-sequences.

For example, in Figure 5.3, the sequences are divided into non-overlapping ﬁxed length

sub-sequences. Each of the edit operations falls into separate such sub-sequences. Hence,

75

Figure 5.2: Diﬀerence of edit distances between sequences A and B by the reference sequence
gl cannot predict correct edit distance because the same letter is inserted and subsequently
deleted; therefore resulting no change in overall frequency.

according to Theorem 3, we are able to compute them correctly. However, merging the

partial edit distances into the ﬁnal edit distance is a challenging task. We will discuss more

about the merging strategies in the following sections.

Figure 5.3: Sequences A and B are divided into ﬁxed length sub-sequences. Each pair of sub-
sequences has only one edit operation. The edit operations are predicted correctly separately
with respect to the given reference sequences using Theorem 3. Partial edit distances of the
sub-sequences are merged subsequently.

76

5.3 Edit Distance Prediction from Partial Edit Dis-

tances of the Corresponding Sub-sequences

We extend the diﬀerences of distances based edit distance prediction between pairs of dataset

sequences by applying Theorem 3 between the corresponding ﬁxed length sub-sequences in-

stead of the entire sequences. An example of such sub-sequence based edit distance prediction

is shown in Figure 5.3. The example also highlights the challenge of merging partial edit

distances between the sub-sequences so that no edit operation is counted redundantly. For

instance, in Figure 5.3, edit distance for the 2nd pair of sub-sequences is 1 which is redun-

dant due to the insertion operation in the preceding pair of sub-sequences. We develop

following strategies for eﬃcient merging of partial edit distances between the corresponding

sub-sequences.

• PS: Primary/Secondary sub-sequences: We compare diﬀerences of distances be-

tween adjacent pairs of sub-sequences to identify whether partial edit distance of the

following sub-sequences is redundant or not. A pair of corresponding sub-sequences

with only redundant diﬀerences of distances are denoted as secondary sub-sequences.

The edit distance of such sub-sequences is discarded during the merging process. Each

remaining pair of corresponding sub-sequences with non-zero diﬀerences of distances is

denoted as primary sub-sequences and therefore considered while merging partial edit

distances.

• ML: Diﬀerences of Distances Comparison using Multiple Letters: Due to

small DNA sequence alphabet size, diﬀerences of distances between elements of corre-

sponding sub-sequences are likely to be smaller than those of the actual edit distance.

77

As a result, identifying primary sub-sequences becomes more diﬃcult. To minimize

this problem, we consider multiple adjacent letters as single elements. This strat-

egy increases the number of all possible elements signiﬁcantly and reduces the risk of

matching elements by chance. Hence, it improves edit distance prediction accuracy

based on diﬀerences of distances of elements in corresponding sub-sequences.

5.3.1 Merging Partial Edit Distances of the Sub-sequences

The diﬀerences of distances between any corresponding sub-sequences are calculated inde-

pendently of their location. However, in order to determine the edit distance between the

entire sequences, we merge the partial predicted edit distances of the sub-sequences in as-

cending order because edit operations have cascading eﬀects on the following sub-sequences.

Although, edit operations occur in temporal orders, for a given pair of sequences, there is

an equivalent physically ordered sequence (series) of edit operations.

Deﬁnition 12. Given a sequence of edit operations E, Em ⊂ E is the minimum sequence
of edit operations if ∀(ij, xj, yj), (ik, xk, yk) ∈ Em; 1 ≤ j < k ≤ |Em| and ij (cid:54)= ik.

Proposition 3. Given a sequence of edit operations E, there exists a corresponding minimal

sequence of edit operations Em.

Proof. In trivial case, if |E| ≤ 1, Em = E.

For |E| > 1, if (ij, xj, yj) ∈ E is an insertion or a deletion operation, it changes positions
of the edit operations of ∀(ij+k, xj+k, yj+k) ∈ E where k > 0 and ij ≤ ij+k. Assume, func-

tion λ(ij) updates the position of all the edit operations starting from (ij, xj, yj) in a sequence
that are also physically ordered. For any pair of edit operations (ij, xj, yj), (ij+k, xj+k, yj+k) ∈

E, the following cases minimize the number of edit operations in E.

78

Case 1: Both are substitution operations and λ(ij) = ij+k. Thus, Em = E−(ij, xj, yj).

There is no change in positions of the remaining edit operations.

Case 2: (ij, xj, yj) is an insertion operation, (ij+k, xj+k, yj+k) is a substitution opera-

tion, and λ(ij) = ij+k. Thus, yj = yj+k and Em = E − (ij+k, xj+k, yj+k).

Case 3: (ij, xj, yj) is a substitution operation, (ij+k, xj+k, yj+k) is a deletion operation,

and λ(ij) = ij+k; then Em = E − (ij, xj, yj).

Case 4: (ij, xj, yj) is an insertion operation, (ij+k, xj+k, yj+k) is a deletion operation,
and λ(ij) = ij+k. Then Em = E − (ij, xj, yj) − (ij+k, xj+k, yj+k). In addition, for any
(in, xn, yn) ∈ E where j < n < (j + k) and ij > in; λ(in)− = 1.

Hence, Em ⊆ E.

For example, let consider consider, E =< (10, Λ, a), (7, g, Λ), (21, t, Λ), (9, c, Λ), (15, Λ, g) >.

The edit operation (9, c, Λ) nulliﬁes the eﬀect of the edit operation (10, Λ, a). Thus, Em =
E− < (9, c, Λ) >.

Merging partial edit distances of the corresponding pairs of sub-sequences increases the

likelihood of correct edit distance prediction between a pair of sequences only if for a given

Em, there is an equivalent physically ordered sequence (series) of edit operations. In Propo-

sition 4, we show that a temporally ordered minimum sequence of edit operations, Em can

be represented by a physically ordered one.

Proposition 4. For a given Em, there exists a physically ordered sequence of edit operations
p s.t. for any (ij, xj, yj), (ij+1, xj+1, yj+1) ∈ Em
Em

p , 1 ≤ j < |Em

p |; ij < ij+1.

Proof. Let us consider, ∀(ij, xj, yj), (ik, xk, yk) ∈ Em, 1 <= j, k <= |Em|. The following

cases can occur while converting the sequence into a physically ordered one.

79

Case 1: j < k and ij < ik; the temporally ordered edit operations are also physically

ordered.

Case 2: j < k and ij > ik; ik may change the value of ij.



i”j = ij +

1,

if ij > ik and xk = Λ

-1,

if ij > ik and yk = Λ

0,

otherwise

(5.7)

For the sake of contradiction, suppose, the edit operation (ij, xj, yj) ∈ Em is converted

(i”j, xj, yj) ∈ Em

p but the positions ij ad i”j are not equivalent. However, it cannot be true

because of the following scenarios.

• There is no ij = ik because the Em is a minimal sequence of edit operation.

• If ij > ik, Equation 5.7 updates ij.

• ∀(j < n < k) ∈ |Em| if ij > in, the position of ij will be updated by λ(in). On the

other hand, ij will be also updated by λ(in).

Hence, the edit operations in Em

p are equivalent to Em.

For example, Em =< (7, c, Λ), (20, a, Λ), (9, t, Λ), (15, Λ, g), (12, Λ, t) >. The edit operation

(20, a, Λ) will be updated to (21, a, Λ). Similarly, (15, Λ, g) will be updated to (14, Λ, g).

Thus, Em

p =< (7, c, Λ), (9, t, Λ), (12, Λ, t), (16, Λ, g), (21, a, Λ) >.

5.3.2 Finding Primary and Secondary Sub-sequences

If each edit operation belongs to only one pair of corresponding sub-sequences, we can

accumulate partial predicted edit distances of the sub-sequences to predict the edit distance

80

between the entire sequences. However, insertion and deletion operations shift the positions

of the letters and therefore in ﬁxed length sub-sequences may likely to create edit operations

that are not present in the actual sequences. Hence, it is critical to discard the partial

distances of the secondary sub-sequences containing only edit operations generated due to

segmentation of the sequences.

We determine secondary sub-sequences as follows. Let us consider, A =< A1.A2. . . . Ai ··· >

and B =< B1.B2. . . . Bi.··· >, Ai and Bi are a pair of primary sub-sequences, and there

are non-zero diﬀerences of distances between Ai+1 and Bi+1. We need to verify whether the

edit operation(s) in Ai and Bi are causing the diﬀerences between Ai+1 and Bi+1. We add

the distances in Ai and Ai+1 with respect to each reference sequence and do the same for

Bi and Bi+1. Then we compute the diﬀerences of distances between Ai.Ai+1 and Bi.Bi+1

with respect to the reference sequences. If the sum of diﬀerences of distances is greater than

that of Ai and Bi, then Ai+1 and Bi+1 are also primary sub-sequences. Otherwise, we need

to examine each diﬀerence of distances in the sub-sequences to determine their type. In case

the eﬀect of letters in Ai are canceled by letters in Bi+1 or vice versa, Ai+1 and Bi+1 are

more likely secondary sub-sequences. If not, they are more likely primary sub-sequences.

For example, in Figure 5.4, sequences A and B are divided into sub-sequence of length

4. Diﬀerences of distances between each pair of corresponding sub-sequences are non-zero

for multiple reference sequences. Among them, A(0, 4) and B(0, 4) are inherently primary

sub-sequences. In order to determine whether A(4, 4) and B(4, 4) are primary sub-sequences,

we compute diﬀerence of distance between A(0, 8) and B(0, 8). The sum of diﬀerences of

distances between A(0, 4) and B(0, 4), and A(4, 4) and B(4, 4) is equal to that of A(0, 8)

and B(0, 8). In addition, the letters responsible for the diﬀerences of distances also indicate

high likelihood of A(4, 4) and B(4, 4) being secondary sub-sequences. Hence, we discard the

81

Figure 5.4: The edit distance between sequences A and B are predicted from the diﬀerences of
distances of primary sub-sequences. The 2nd pair of sub-sequences contains only redundant
edit operations and hence are not considered for the distance prediction.

diﬀerences of distances found between A(4, 4) and B(4, 4). Similarly, for A(8, 4), B(8, 4),

the sum of diﬀerences of distances for A(0, 12) and B(0, 12) is zero. However, it cannot be

correct because an edit operation was recorded between A(0, 4) and B(0, 4). It suggests that

at least an insertion or deletion operation is occurred between A(8, 4) and B(8, 4). Hence,

we predict A(8, 4) and B(8, 4)) as primary sub-sequences. We add the edit operation to

that of A(0, 4) and B(0, 4) and update the latest primary sub-sequences position to the 3rd

pair of sub-sequences. The merged edit distance correctly predicts the edit distance between

sequences A and B.

Based on the proposed strategies, we present the following edit distance prediction func-

tion that compute diﬀerences of distances between the corresponding sub-sequences of a pair

of dataset sequences.

Function 1. Edit-Distance-Prediction-using-Sub-sequences(A,B,k)

Input: A, B, sub-sequence length k

Output: Predicted distance edpr(A, B)

1. At=Reference-Sequence-based-Transformation(A,k)

82

2. Bt=Reference-Sequence-based-Transformation(B,k)

3. edpr(A, B) = 0

4. foreach corresponding pair of sub-sequences in A and B do

5.

6.

7.

8.

9.

10.

11.

12.

13.

dif f dist = 0
foreach c ∈ Σ do

dif f dist+ =(cid:0)ed(A(i ∗ k, k), ck) ∼ ed(B(j ∗ k, k), ck)(cid:1)

end for

if dif f dist > 0 do

Verify if the pair of sub-sequences is primary or secondary

if (primary sub-sequence) do

if (dif f dist > 2) do

dif f dist = Edit-Distance-Prediction-using-Sub-sequences (A(i ∗ k, k), B(j ∗

k, k), k/2)

14.

15.

16.

17.

end if

Add the partial edit distances to edpr(A, B)

Update index of the latest primary sub-sequence

end if

18.

end if

19. end for

20. Return edpr(A, B)

Function 2. Reference-Sequence-based-Transformation(A, k)

Input: Sequence A, sub-sequence length k
Output: At /*The edit distances between sub-sequences of A and ri ∈ Rl*/

83

1. At = [[]]
2. foreach A(j ∗ k, k) in A do

m=(j+1)∗k(cid:80)
foreach ri ∈ Rl do

At[i][j] =

m=j∗k

end for

3.

4.

5.

1, if A[m] (cid:54)= ri[m]

6. end for

7. return At

Step 1 and 2 in Function 1 transform a pair of dataset sequences into feature vectors

by computing edit distances of the sub-sequences with respect to reference sequences. Dif-

ferences of distances between each pair of corresponding sub-sequences is computed in step

5-14. In case more than one edit operation is predicted, the sub-sequences are divided further

and recursively computed for diﬀerences of distances. Partial edit distances are gradually

merged in step 15 and the primary sub-sequence index is updated in step 16.

5.3.3 Multiple Adjacent Letters as Distinct Elements

The diﬀerences of distances based method relies on the number of appearance of each letter

in a sub-sequence. Due to small size of the DNA alphabet, it is more likely that diﬀerences

of distances between corresponding sub-sequences will be minimized despite having one or

more edit operations between them. For example, in Figure 5.5, the diﬀerences of distances

between A(0, 4) and C(0, 4) is zero despite having an insertion operation. The diﬀerence of
distances with respect to the letter a ∈ Σ is minimized because A[3] = a cancels the eﬀect

of the insertion operation at C[2].

84

Figure 5.5: The insertion operation between the ﬁrst pair of ﬁxed length of sub-sequences is
not reﬂected in the diﬀerences of distances between them because of the small DNA alphabet.
Considering multiple adjacent letters as single elements would avoid such by chance reduction
in diﬀerences of distances.

To minimize this problem, we consider multiple adjacent letters of a sub-sequence as

distinct elements. This strategy signiﬁcantly increases the number of all possible elements

and therefore reduces the risk of accidental matching of elements between corresponding

sub-sequences. Hence, the larger number of distinct elements signiﬁcantly improves the

edit distance prediction accuracy.

It should be noted that increasing number of letters

in an element rapidly increases the number of all possible elements as well as prediction

accuracy with added computation cost and memory requirement. Since we are focused on

small pseudorandom DNA sub-sequences, elements containing two adjacent letters should

be suﬃcient for our experimental datasets.

For example, for Nucleotides (letters) in DNA sequences, we consider all possible two
adjacent letters as distinct elements. Thus, ∀x ∈ Σ and ∀y ∈ Σ; xy ∈ Σ(cid:48) where |Σ(cid:48)| = |Σ|2.

In Figure 5.5, diﬀerence of distance between A(0, 4) and B(0, 4) with respect to elements
{ga, gt, at, ag} are 4. The non-zero diﬀerences of distances indicate an edit operation. Further

examination of the position of the elements also identiﬁes type of the edit operation. Hence,

the increased number of distinct elements signiﬁcantly improves the edit distance prediction

accuracy.

It should be noted that Function 1 and Function 2 are needed to be marginally modiﬁed

to accommodate the increased alphabet based diﬀerence of distance calculation.

85

5.3.4 Eﬀect of Sub-sequence Length

Length of a sub-sequence is critical for our proposed diﬀerences of distances based edit dis-

tance prediction method. Smaller sub-sequences likely to have fewer edit operations between

them. As a result, the possibility of canceled edit operations is also reduced. We conduct

an experiment to show the eﬀect of sub-sequence length on edit distance prediction accu-

racy (Table 5.1). Pairs of sequences up to edit distance threshold 8 from the dataset Dm

(Table 3.1) are used for this experiment. The prediction accuracy is signiﬁcantly higher for

smaller sub-sequences than that of the larger sub-sequences. When combined with the mul-

tiple letters based strategy (Section 5.3.3), the prediction accuracy is very high for smaller

sub-sequences because increased number of distinct elements reduces the number of canceled

edit operations signiﬁcantly.

Table 5.1: Eﬀect of sub-sequence length on edit distance prediction accuracy (%). For a
predeﬁned distance threshold, the experimental result shows that the prediction accuracy
drops with increasing sub-sequence lengths.

Number of

distinct elements

6

4
16

99.1
99.8

92.1
96.4
99.06 98.3

Sub-sequence length
4

8

10
88.5
97.8

5.3.5 Computational Complexity

Given sub-sequence length k and sequence length l, initially computational complexity of dif-
ferences of distances between the sub-sequences is O((l/k)∗ Σ). Sub-sequences with multiple

edit operations are computed recursively until each pair of smaller sub-sequences contains
only one edit operation. Hence, for edit distance d, there could be at most d ≤ l parti-

tions on top of the initially created sub-sequences. The overall computational complexity is

86

O((l/k + d) ∗ Σ). Assuming constant sub-sequence length and alphabet size, the complexity

becomes O(l).

5.4 Experimental Results

We evaluate the proposed edit distance prediction method with respect to computation time

as well as accuracy. As shown in the following experiments, the prediction accuracy of

our proposed method declines very slowly. Since our focus is sequence similarity on high

similarity thresholds, the experimental results are primarily shown for up to edit distance

10 only. Computation time for each pair of sequences varies marginally because only a

small fraction of the sub-sequences contain edit operations that require more computation

time. We also evaluate eﬀectiveness of our proposed method on an important NGS dataset

application, average linkage hierarchical clustering for high similarity thresholds. Despite

errors in prediction for larger edit distances, the overall clustering accuracy is signiﬁcantly

better compared to that of VSEARCH [23] and locality sensitive hashing based LSH-Div [22]

clustering methods. The quadratic run time based edit distance calculation method [1] is

set as the baseline method. The edit distances and the clusters generated from the baseline

method are used as the ground truth for the experiments.

5.4.1 Eﬀect of the Proposed Strategies on Edit Distance Predic-

tion Accuracy

We evaluate eﬀectiveness of each of the proposed diﬀerences of distances calculation strategy

on edit distance prediction accuracy. Figure 5.6 shows improvement in prediction accuracy

when the strategies are applied collectively. The experiment is conducted on the dataset

87

D=Diﬀerences of distances, PS=Primary/Secondary sub-sequences, ML = Multiple letter as distinct

elements, Trace = Trace cost between the sub-sequences

Figure 5.6: Comparing eﬀect of each diﬀerences of distances calculation strategy on edit
distance prediction accuracy.

Dh (Table 3.1) and the pairwise edit distance threshold is set to 10. First, it can be seen

that discarding secondary sub-sequences i.e., sub-sequences with only redundant edit op-

erations, greatly improves prediction accuracy. Second, increasing the number of distinct

elements by considering all possible pairs adjacent letters signiﬁcantly decreases random

matching of letters between the sub-sequences. Because of fewer random matching, the pro-

posed edit distance prediction method becomes more accurate. Third, a key limitation of

diﬀerences of distances based strategies is that they rely only count of the letters between

the sub-sequences. Computing minimum cost trace between the sub-sequences also consid-

ers positions of the letters along with their frequencies and thus improves the edit distance

prediction accuracy. In the following sections, we predict edit distances based on the diﬀer-

ences of distances, the increasing number of distinct elements and the minimum cost traces

between the sub-sequences.

88

Table 5.2: Comparison of cumulative edit distance prediction accuracy between the MVD
and the diﬀerences of distances based methods

Data
sets

Dm
Dh
Dr
Ds
Dp
Dz

Data
sets

ed ≤ 1

ed ≤ 2

ed ≤ 3

ed ≤ 4

ed ≤ 5

Pp/M V D

Pp/M V D

Pp/M V D

Pp/M V D

Pp/M V D

1/1

1/1

1/1

1/1

1/1

1/1
ed ≤ 6

1/1

1/1

1/1

0.999/1

0.999/1

0.999/1
ed ≤ 7

0.999/0.975

0.997/0.931

0.996/0.895

0.999/0.961

0.997/0.912

0.995/0.867

0.998/0.963

0.994/0.845

0.99/0.766

0.999/0.979

0.997/0.924

0.994/0.852

0.999/0.963

0.996/0.879

0.991/0.811

0.998/0.954

0.995/0.865

ed ≤ 8

ed ≤ 9

0.991/0.789
ed ≤ 10

Pp/M V D

Pp/M V D

Pp/M V D

Pp/M V D

Pp/M V D

0.987/0.921

0.986/0.926

0.994/0.904

0.992/0.911

0.991/0.917

0.989/0.922

0.987/0.841

0.989/0.914

0.983/0.876

Dm
Dh
Dr
Ds
Dp
Dz
Pp = Diﬀerences of distances based edit distance prediction

0.986/0.931

0.984/0.912

0.983/0.903

0.981/0.928

0.979/0.939

0.98/0.922

0.977/0.934

0.981/0.95

0.976/0.962

0.98/0.897

0.977/0.91

0.992/0.899

0.99/0.901

0.987/0.88

0.987/0.886

0.987/0.927

0.984/0.931

0.974/0.921

0.972/0.969

0.976/0.947

0.974/ 0.942

5.4.2 Edit Distance Prediction Accuracy

We evaluate edit distance prediction accuracy between sequences of the experimental datasets

(Table 3.1) using the MVD and the diﬀerences of distances based methods. Prediction

accuracy of the diﬀerences of distances based method declines very slowly for increasing edit

distances compared to sharp decline in accuracy with the MVD based method [34]. As a

result, for M V D > 5, the predicted distances need to be veriﬁed using the existing quadratic

run time based method [1]. Despite the costly computation, it is shown in Table 5.2 that

the cumulative prediction accuracy based on the proposed method is better than that of the

M V D based method. The superior performance of the proposed method is demonstrated in

89

Figure 5.7 when compared with respect to both prediction time and accuracy on increasing

pairwise edit distances. The proposed method is signiﬁcantly faster with nominal loss in

accuracy.

(a) Dataset Dh

(b) Dataset Dr

(c) Dataset Ds

(d) Dataset Dp

Figure 5.7: Comparison of edit distance prediction time and accuracy between the M V D
and the diﬀerences of distances based methods

In the proposed method, edit operations between the corresponding sub-sequences are

determined separately. For small sub-sequences, the likelihood of an edit operation being

canceled when computing diﬀerence of distances with respect to an equal length reference

sequence is very low. However, the likelihood of canceled edit operations increases when

90

the sub-sequences become longer. On the other hand, the MVD based method predicts edit

distances between the entire sequences. Since the randomly generated reference sequences

are far from each other, dataset sequences far from each other may have similar edit distances

with respect to such reference sequences. Therefore, for larger edit distances, in the MVD

based method, no reference sequence may be able to correctly predict the edit distance

between a pair of data sequences. Increasing the number of reference sequences improves

the prediction accuracy marginally with rapid increase in computation time.

5.4.3 Edit Distance Calculation Time

Figure 5.8 shows comparison of pairwise edit distance calculation time between the proposed

and the MVD based methods. We include the transformation time in the total time for both

methods while calculating the average pairwise edit distance prediction time. The ﬁgure also

shows the variances (dashed line) in computation time for varying edit distances. It should

be noted that eﬀect of the transformation time becomes smaller with increasing dataset

size. The average pairwise distance prediction time is signiﬁcantly better for the proposed

method than that of the MVD because it requires more reference sequences and therefore

more pairwise edit distance calculations between dataset sequences and reference sequences.

Figure 5.8: Comparison of average edit distance prediction time between the dataset se-
quences using the MVD and the diﬀerences of distances based methods

91

5.4.4 Clustering

We create average linkage agglomerative hierarchical clusters using the diﬀerences of dis-

tances based method, and compare the results with those of the MVD [34], VSEARCH [23]

and LSH-Div[22]. Lower dendogram of hierarchical clusters are useful for many NGS se-

quences analysis methods. Since we focus on large NGS datasets, a graph based external

memory (disk) hierarchical clustering method, SparseHC [49] is adopted for creating the

clusters. It should be noted that main memory based agglomerative hierarchical clustering

space complexity is O(N 2) where N is the number of sequences in the dataset. The per-

formance comparison is based on the clustering time and relative similarity (i.e., NMI and

AMI) of the generated clusters with that based on the dynamic programming based edit

distances.

5.4.4.1 Clustering Time

We compare clustering time of the proposed method with that of the MVD as well as the

k-mer similarity based methods. Hierarchical clustering using the quadratic run time edit

distance calculation requires days with the growth of dataset size. However, the proposed

diﬀerence of distances method based clustering is considerably faster than that of the MVD

based method [34] because number of reference sequences for the proposed edit distance

prediction method is very small compared to that of the MVD based method. As a result,

the proposed method beneﬁts from low space transformation time, fewer dimensions for

pairwise comparison and absence of costly post processing time.

Among the matching k-mer based methods, VSEARCH requires the least amount of

time at the cost of accuracy. On the other hand, clustering time of the LSH-Div increases

more rapidly than that of the diﬀerences of distances based method for increasing number

92

of sequences in a dataset.

Table 5.3: For 99% similarity threshold, comparison of clustering time (in hours) and accu-
racy of the diﬀerences of distances based method to those of the actual edit distances, the
MVD as well as k-mer matching based VSEARCH and LSH-Div clustering

The edit distance based similarity

Matching k-mer based similarity

Baseline

Proposed

MVD

VSEARCH [23]

LSH-Div [22]

Time

Time/NMI/AMI

Time/NMI/AMI

Time/NMI/AMI Time/NMI/AMI

0.61

0.94

1.72

0.017/0.998/0.993

0.054/0.998/0.993

0.007/0.903/0.75

0.025/0.89/0.68

0.029/0.9976/0.992

0.067/0.997/0.9923

0.01/0.89/0.77

0.049/0.90/0.70

0.067/0.997/0.9916

0.095/0.9976/0.991

0.03/0.90/0.71

0.218/0.85/0.67

17.95

0.24/0.994/0.99

0.47/0.997/0.9905

0.147/0.91/0.78

1.416/0.92/0.78

411.28

4.02/0.9999/0.998

7.1/0.994/0.9905

0.352/0.90/0.802

9.6/0.903/0.713

641.62

5.9/0.9956/0.9902

9.9/0.9956/0.99

0.52/0.889/0.776

13.1/0.91/0.706

Data
sets

Dm
Dh
Dr
Ds
Dp
Dz

5.4.4.2 Cluster Similarity

We quantify relative similarity of the clusters generated using the proposed edit distance

prediction method with respect to the ground truth. We also experiment with heuristics

based clustering methods such as VSEARCH and LSH-Div. The NMI and AMI scores of

both VSEARCH and LSH-Div method based clusters are signiﬁcantly lower compared to

those of the proposed method. Although the NMI scores are around 0.9 for both methods,

the AMI scores are quite low for them. On the other hand, the proposed edit distance

prediction method based clusters and those of the MVD based method show close to 99%

relative similarity for edit distance similarity thresholds 99% (Table 5.3) and 97% (Table

5.5). However, as shown in Table 5.4, for 98% similarity threshold, the proposed predicted

edit distance based clusters show better NMI and AMI scores than those of the MVD because

of higher accuracy in predicted edit distances.

93

Table 5.4: For 98% similarity threshold, comparison of clustering time (in hours) and accu-
racy of the diﬀerences of distances based method to those of the actual edit distances, the
MVD as well as k-mer matching based VSEARCH and LSH-Div clustering

The edit distance based similarity

Matching k-mer based similarity

Baseline

Proposed

MVD

VSEARCH [23]

LSH-Div [22]

Time

Time/NMI/AMI

Time/NMI/AMI

Time/NMI/AMI Time/NMI/AMI

0.63

0.02/0.995/0.991

0.072/0.982/0.9725

0.0043/0.91/0.767

0.026/0.88/0.702

0.943

0.034/0.994/0.989

0.094/0.98/0.9745

0.0041/0.91/0.78

0.051/0.884/0.71

1.79

0.077/0.993/0.987

0.122/0.978/0.9702

0.0125/0.91/0.716

0.219/0.84/0.68

17.98

411.4

641.8

0.28/0.992/0.986

0.614/0.979/0.969

0.058/0.92/0.79

1.43/0.91/0.784

4.7/0.997/0.991

7.88/0.979/0.967

0.19/0.895/0.798

9.52/0.891/0.708

6.72/0.99/0.98

11.1/0.981/0.969

0.29/0.89/0.78

13.3/0.896/0.706

Data
sets

Dm
Dh
Dr
Ds
Dp
Dz

5.5 Summary

In this chapter, we propose an edit distance prediction method using a set of deterministi-

cally generated reference sequences. Distances among the reference sequences are maximum

i.e., equal to their lengths. Each such sequence is based on a speciﬁc letter of the alpha-

bet. As a result, edit distance between a reference sequence and a dataset sequence can

be computed using only substitution operations. We compare diﬀerences of distances of

the dataset sequences with respect to the reference sequences for predicting edit distances

between them. Although this technique correctly predicts edit distance one, prediction ac-

curacy drops sharply for larger edit distances. In order to improve the prediction accuracy,

we develop several strategies such as calculating diﬀerences of distances between the sub-

sequences separately, discarding redundant edit operations, increasing the number of distinct

elements and computing minimum cost traces of the sub-sequences. Subsequently, we merge

the diﬀerence of distances of the sub-sequences to predict edit distances between the en-

tire sequences. Computational complexity of the proposed method is linear with respect

to the sequence lengths. We evaluate the proposed edit distance prediction method with

94

Table 5.5: For 97% similarity threshold, comparison of clustering time (in hours) and accu-
racy of the diﬀerences of distances based method to those of the actual edit distances, the
MVD as well as k-mer matching based VSEARCH and LSH-Div clustering

The edit distance based similarity

Matching k-mer based similarity

Baseline

Proposed

MVD

VSEARCH [23]

LSH-Div [22]

Time

0.634

Time/NMI/AMI

Time/NMI/AMI

Time/NMI/AMI

Time/NMI/AMI

0.022/0.99/0.98

0.104/0.9956/0.989

0.003/0.915/0.79

0.026/0.87/0.732

0.95

0.038/0.991/0.978

0.086/0.9938/0.992

0.005/0.905/0.803

0.053/0.87/0.76

1.796

18.01

411.5

642.3

0.085/0.99/0.976

0.13/0.994/0.9882

0.008/0.89/0.73

0.22/0.83/0.686

0.31/0.992/0.98

0.74/0.992/0.987

0.043/0.93/0.82

1.44/0.91/0.80

5.44/0.988/0.977

11.36/0.992/0.9856

0.162/0.905/0.798

9.83/0.89/0.72

7.6/0.981/0.972

12.66/0.991/0.985

0.24/0.901/0.794

13.78/0.887/0.716

Data
sets

Dm
Dh
Dr
Ds
Dp
Dz

respect to required time and prediction accuracy to show its superiority over those of the

MVD based edit distance prediction method. Eﬀectiveness of the proposed predicted edit

distance method is also shown on average linkage agglomerative hierarchical clustering of the

experimental 16s rRNA metagenomic NGS sequence datasets for high similarity thresholds.

95

Chapter 6

Bounded Exact Edit Distance

The edit distance prediction using diﬀerences of distances between the corresponding sub-

sequences of pairs of dataset sequences can not guarantee correctness because Theorem 3

is applicable only if the edit distance is one. Further, there is no way to know the exact

location of an edit operation from the diﬀerences of distances. As a result, the proposed

method in Chapter 5 is error prone in determining the type and the number of edit operations.

Although the diﬀerences of distances based method provides high accuracy when the number

of edit operations is few or the edit operations are distributed over the entire sequences, the

performance degrades sharply when several edit operations appear in close proximity. On

the other hand, many genome sequence analysis methods require edit distances only within a

small bound. For those applications, minor modiﬁcations in the standard quadratic method

greatly reduce the computation time. However, they still require computing all possible

combinations of sub-sequences within the given bound.

6.1 Length of the Sub-sequences

In this chapter, we develop an eﬃcient bounded exact edit distance calculation method using

the concept of trace (Deﬁnition 11) [1]. As shown in Wagner et. al. [1], given a minimum

cost trace (T, A, B), if A = A1.A2, B = B1.B2 and if there is no trace element between A1

96

and B2, and A2 and B1, then,

cost(T, A, B) = cost(T1, A1, B1) + cost(T2, A2, B2)

(6.1)

For example, given A = honda and B = hyundai, a minimum cost trace is shown in

Figure 6.1(a). We divide the strings into A1 = hon and A2 = da as well as B1 = hyun

and B2 = dai so that there is no trace element between A1 and B2, and A2 and B1.

In Figure 6.1(b), (T1, A1, B1) and (T2, A2, B2) are also minimum cost traces and thereby

satisﬁes Equation 6.1.

(a)

(b)

Figure 6.1: (a) Minimum cost trace of a pair of sequences (strings) and (b) dividing the
sequences into corresponding sub-sequences based on the minimum cost trace

However, we do not know a minimum cost trace without comparing all possible combi-

nations of the sub-sequences. On the other hand, for a given edit distance bound d, only
a subset of all possible sub-sequences is needed to be compared because ∀(i, j) ∈ (T, A, B),

each i and j cannot be more than d letters apart. We break A and B into sub-sequences
based on the potential (i, j) ∈ (T, A, B) and compute exact edit distance up to the bound.

Theorem 4. Given, ed(A, B) ≤ d, 1 ≤ i ≤ |A|, 1 ≤ j ≤ |B| and a minimum cost trace
(T, A, B), ∀(i, j) ∈ (T, A, B); (i − d) ≤ j ≤ (i + d).

Proof. Let us consider, A[0 : m] is a preﬁx of sequence A of length m, A[m] is the mth letter
in sequence A and γ(A[m] → B[n]) is the cost function to convert A[m] into B[n]. The

97



min

Case1 : ed(A[0 : m], B[0 : n − 1]) + γ(Λ → B[n])

Case2 : ed(A[0 : m − 1], B[0 : n]) + γ(A[m] → Λ)

Case3 : ed(A[0 : m − 1], B[0 : n − 1]) + γ(A[m] → B[n])

recursive function for calculating edit distance between A[0 : m] and B[0 : n] is as follows:

At each level of the recursion, we divide A[0 : m] and B[0 : n] into two pairs of corre-
sponding sub-sequences in one of the aforementioned cases such that no (i, j) ∈ (T(cid:48), A[0 :
m], B[0 : n]) exists between neighboring pairs of sub-sequences. Thus T(cid:48) can be split into
T1 and T2 and cost(T(cid:48)) ≤ cost(T1) + cost(T2). If T(cid:48) is a minimum cost trace, T1 and T2 are
also minimum cost traces and cost(T(cid:48)) = cost(T1) + cost(T2).

Each path from root to a leaf node of the recursion tree represents an arbitrary trace.

Given minimum cost trace T and maximum d edit operations, we prove the maximum
diﬀerence of index positions for any (i, j) ∈ (T, A, B) is bounded by d. Since, maximum

diﬀerence between i and j comes from either only insertions or only deletions, we focus on

those two scenarios.

Case1 occurs whenever there is an insertion operation. Assuming d insertion operations,

if T is a minimum cost trace, Case1 will occur maximum d times. For any T’ where there
is more than d inserts, T’ is not a minimum cost trace and cost(T(cid:48), A, B) > d. Thus,
∀(i, j) ∈ (T, A, B); i ≤ j ≤ (i + d) when the edit operations are insertions only.

Similarly, for deletion operations only, Case2 occurs at most d times for a minimum cost

trace (T,A,B). Thus, ∀(i, j) ∈ (T, A, B); (i − d) ≤ j ≤ i.
Considering both scenarios, ∀(i, j) ∈ (T, A, B); (i− d) ≤ j ≤ (i + d) where ed(A, B) ≤ d.

For example, in Figure 6.1(a), ed(A, B) = 3 and a minimum cost trace (T, A, B) =<
(1, 1), (2, 2), (3, 4), (4, 5), (5, 6) >. For all From the ﬁgure, we see that ∀(i, j) ∈ (T, A, B)
where 1 ≤ i ≤ |A| and 1 ≤ j ≤ |B|; |i − j| ≤ 3.

98

6.2 Minimum Cost Trace between the Sub-sequences

For any pair of corresponding sub-sequences, we compute a trace that requires minimum

number of changes for the individual (i, j) pairs within the sub-sequences. Such a trace cost

may not be locally minimum, however, they are more likely to a globally minimum trace for

a given threshold.

Theorem 5. Let us consider, ed(A, B) ≤ d, (T, A, B) is the minimum cost trace and A” =
A(m, d), B” = B(m, 2 ∗ d). Also assume, (i, j) ∈ P where i ∈ {m, . . . (m + d)}, j, k ∈
{m, . . . , (m + 2 ∗ d)}, j (cid:54)= k, A”[i] = B”[j] = B”[k] and (i ∼ j) ≤ (i ∼ k). If ∀(i(cid:48), j(cid:48)) ∈
(T ”, A”, B”) represents the minimum total diﬀerence from ∀(i, j) ∈ P , then (T ”, A”, B”) ⊂

(T, A, B).

Proof. Any (i, j) ∈ (T ”, A”, B”); m ≤ j ≤ (m + d)andm ≤ j ≤ (m + 2 ∗ d) refers to a
change operation with the cases (a) A”[i] = B”[j] and (b)A”[i] (cid:54)= B”[j]. However, (i, j) ∈
(T ”, A”, B”) where A”[i] (cid:54)= B”[j] exists only to minimize trace cost i.e., the number of edit

operations.

For each (i, j) ∈ P where i ∈ {m, . . . (m + d)}, j, k ∈ {m, . . . , (m + 2 ∗ d)}, j (cid:54)= k,
if A”[i] = B”[j] = B”[k] and (i ∼ j) ≤ (i ∼ k), when considered in isolation (i, j) is the

candidate pair for the minimum cost trace. Since the minimum cost trace is globally optimal,
the candidate (i, j) ∈ P must not cross each other as shown in Deﬁnition 11. Hence, one of

the following cases will hold.

Case 1: ∀(i1, j1), (i2, j2) ∈ P where i1 (cid:54)= i2 and j1 (cid:54)= j2; i1 < i2 iﬀ j1 < j2. In this case,

all the candidate pairs satisfy the requirements of being trace elements. For any untouched

pairs between any distinct (m1, n1) and (m2, n2), we add more pairs arbitrarily to minimize

the trace cost.

99

Case 2: ∃(i1, j1), (i2, j2) ∈ P where i1 (cid:54)= i2 and j1 (cid:54)= j2; i1 < i2 yet j1 > j2. Both

of (i1, j1), (i2, j2) cannot belong to T ”. Thus, the conﬂicting pairs have to be adjusted so

that the increase in cost is the minimum. Suppose, discarding/modifying (i1, j1) increases

the cost less compared to that of discarding/modifying (i2, j2). We replace (i2, j2) and ﬁnd

minimum cost increment for all such conﬂicts.

It should be noted that (T ”, A”, B”) may not be the minimum cost trace locally. However,

it will be candidate subset of the global minimum cost trace (T, A, B).

Similarly, we ﬁnd the (T ””, A(m, 2∗d), B(m, d)) and compare with (T ”, A(m, d), B(m, 2∗

d)). In case the costs are tied yet the trace elements are not equivalent, we need to look

into the remaining sub-strings to break the tie. Otherwise, the minimum cost trace between

them will be subset of the globally optimal trace.

For example, given A = ”acgtagctag”, B = ”gggggacgt agctag”; ed(A, B) = 5. Suppose,
we compute the trace (T ”, A(1, 5), B(1, 10)) ⊂ (T, A, B). According to Theorem 5, P =
{(1, 6), (2, 7), (3, 3), (4, 9), (5, 10)}. However, (3, 3) is in conﬂict with (2, 7) and (4, 9) when

considered as a trace pair for (T ”, A(1, 5), B(1, 10)). The minimum costly change in P is to
replace (3, 3) with (3, 8). Hence, (T ”, A(1, 5), B(1, 10)) = {(1, 6), (2, 7), (3, 3), (4, 9), (5, 10)}.
It should be noted that not all the substrings would be length of 2∗ d where upper bound
of edit distance d because the maximum diﬀerence between i ∼ j ≤ d and the trace pairs are

physically ordered. If there are insertions and/or deletions in some corresponding substrings,
the length of the subsequent ones will be less than 2 ∗ d. This property will reduce the cost

of the proposed bounded edit distance based calculation method.

100

6.2.1 Computational Complexity

Given edit distance bound d, we compute ed(A, B) by comparing corresponding pairs of
sub-sequences of length d and 2 ∗ d of A and B. The comparison is in quadratic complexity

with respect to d. Hence, computation complexity of a trace for the sub-sequences is O(d2).

The maximum number of sub-sequence pairs is (l/d). Hence, the overall computational

complexity is O(ld).

6.3 Experimental Results

We evaluate the proposed bounded exact edit distance calculation method with respect

to computation time. For a given distance threshold, we measure pairwise edit distance

calculation time as well as hierarchical clustering time. The pairwise edit distance calculation

time is compared with that of the dynamic programming based method and the Slidesort.

However, we do not compare with the state-of-the-art bounded exact edit distance method

proposed by Thankachan et. al. because for larger edit distance thresholds, the suﬃx tree

construction complexity of this method is not strictly sub-quadratic.

6.3.1 Edit Distance Calculation Time

In Table 6.1, we present all pair distance calculation time of the experimental dataests

for increasing distance thresholds. The quadratic run time based edit distance calculation

method is used as the baseline method. We compare required time of the proposed bounded

exact edit distance calculation method with that of the baseline method and the k-mer chain

based Slidesort method. The proposed method is an order of magnitude faster than that of

the baseline method for all the distance thresholds. The Slidesort method is faster for smaller

101

Table 6.1: Comparison of all pairs distance calculation time (in hours) of the proposed
bounded exact edit distance calculation method with that of the existing dynamic program-
ming based method and Slidesort

Datasets

Dm
Dh
Dr
Ds
Dp
Dz

Baseline
Method

0.614

0.931

1.771

17.91

411.17

641.4

ed ≤ 3
Ep/S

ed ≤ 5
Ep/S

ed ≤ 8
Ep/S

0.021/0.044

0.023/0.184

0.024/0.38

0.034/0.041

0.038/0.21

0.041/0.43

0.083/0.036

0.092/0.16

0.097/0.41

0.28/0.138

0.32/0.61

0.35/1.72

4.8/0.543

5.62/5.26

5.95/7.22

6.63/0.71

7.85/6.8

8.46/10.3

Ep = Bounded exact edit distance and S = Slidesort

distance thresholds than that of the proposed method because the k-mer based ﬁltering of

Slidesort can discard most of the pairs for such thresholds. However, for larger distance

thresholds, the number of candidate pairs increase rapidly with the Slidsort. Since Slidesort

uses the existing quadratic time complexity based method for computing edit distances of

the candidate pairs, the required time increases rapidly with increasing thresholds. On the

other hand, for larger distance thresholds, the required time of the proposed bounded exact

edit distance calculation method increases very slowly. Hence, our proposed method is more

scalable for increasing numbers of pairs and higher distance thresholds.

6.3.2 Hierarchical Clustering Time

We also compare average linkage hierarchical clustering time based on the proposed bounded

exact edit distance method with that of the existing dynamic programming based baseline

method. The similarity threshold is set to 97% for this experiment. As shown in Table 6.2,

the proposed method based hierarchical clustering time is an order of magnitude faster than

102

Table 6.2: Comparison of hierarchical clustering time (in hours) of the proposed bounded
exact edit distance calculation method with that of the existing dynamic programming based
method for a given distance threshold

XXXXXXXXXXXX

Methods

Datasets

Dm

Dh

Dr

Ds

Dp

Dz

The Edit Distance

0.634

0.95

1.796

18.01

411.5

642.3

The Proposed
bounded Edit

Distance

0.027

0.048

0.103

0.39

6.5

8.5

that of the baseline method. The graph based clustering time for high similarity thresholds

is a small fraction of the all pairs edit distance calculation time. As a result, speedup of the

proposed method is primarily due to near linear edit distance calculation time for the given

high similarity threshold.

6.4 Summary

In this chapter, we propose a bounded exact edit distance calculation method. For a given

distance bound, we divide the sequences into corresponding sub-sequences in a way so that no

edit operation is shared between neighboring sub-sequences. Then, we compute partial edit

distances for each pair of corresponding sub-sequences and accumulate them to determine

edit distance for the entire sequences. We show that the proposed method is signiﬁcantly

faster than that of the existing exact edit distance calculation methods. Eﬀectiveness of the

proposed method is evaluated with respect to all pairs edit distance calculation time as well

as hierarchical clustering time based on the computed pairwise edit distances.

103

Chapter 7

Conclusion

7.1 Key Contributions

Primary contributions of the dissertation are as follows:

• We have proposed a number of fast edit distance calculation methods to determine

dissimilarity between NGS sequences. In doing so, we have introduced a new class of

edit distance prediction approach based on dataset independent reference sequences.

• Initially we have used randomly generated reference sequences for computing dissim-

ilarity between NGS sequences. For this approach, we convert sequences of a dataset

into feature vectors and use metric distance measures such as the Euclidean distance

and the Chebyshev distance for edit distance prediction. The proposed methods are

an order of magnitude faster than that of the existing quadratic run time based edit

distance calculation method. For high similarity thresholds the distance prediction

accuracy is very high for diﬀerent scenarios.

• We have also developed a small set of deterministic reference sequences with maximum

distances among them. This method runs in linear time with respect to sequence

lengths. Further, the edit distance prediction accuracy is signiﬁcantly higher for larger

edit distances compared to that of the random reference sequence based methods.

104

• We have conducted extensive experiments and theoretical analysis to justify the su-

perior performance of the proposed reference sequence based edit distance prediction

methods. For high similarity threshold, the predicted distances are nearly accurate and

when wrong they are very close to the correct distances. Because of that, hierarchical

clustering using the predicted distances produce nearly similar clusters to that based

on the actual edit distances for high similarity thresholds.

• We have also proposed an eﬃcient bounded exact edit distance calculation method.

For small edit distance threshold, the method runs in near linear time complexity.

• The edit distance calculation time using the proposed bounded exact method is an

order of magnitude faster than that of the existing quadratic run time based methods.

7.2 Discussions

Eﬃcient edit distance calculation methods are very important for comparing distances be-

tween strings and sequences because this distance measure is the most appropriate to de-

termine the dissimilarity between a pair of strings. For metagenomic short gene sequences

that are generated using NGS technologies, sequence similarity is the only way to determine

phylogenetic relation between them. However, the standard edit distance computation com-

plexity is O(l2) where l is the length of the sequences. The computation complexity becomes

a performance bottleneck problem for computing similarity between large numbers of pairs

of sequences which is being increasingly common due to advancement of NGS technologies.

As a result, reducing the edit distance computation time without signiﬁcant loss in accuracy

has been an active research problem.

105

In this dissertation, we have proposed a few edit distance prediction methods and a

bounded exact edit distance calculation method. For the edit distance prediction methods,

we have developed a set of dataset independent reference sequences that are minimally

correlated to each other. The reference sequences convert sequences in datasets into feature

vectors. Using these feature vectors, we have proposed edit distance prediction methods

in the metric space.

Initially, we have used the Euclidean distances between the feature

vectors for predicting distances between the corresponding sequences. Since there is no

direct relationship between the edit distance and the Euclidean distance in the metric space

for a given pair of sequences and their corresponding feature vectors, we have developed a

mapping between these two distance measures based on empirical knowledge. This prediction

technique is eﬀective for clustering of NGS sequences especially for high similarity thresholds

such as creating OTUs. However, accuracy of the resulting hierarchical clusters drops rapidly

for predicted edit distances of pairs with larger edit distances.

Our proposed MVD (the Chebyshev distance) based edit distance prediction method has

addressed lack of relationship between the actual edit distances and the predicted distances.

Further, this method is able to predict lower edit distance nearly accurately, thereby mak-

ing the predicted distances useful for applications other than clustering of NGS sequences.

However, for larger edit distances, the MVD based prediction can not provide satisfactory

accuracy despite having large number of reference sequences.

In order to address the poor prediction accuracy of higher edit distances and to reduce the

computational overhead of creating feature vectors using hundreds of reference sequences, we

have also developed a small set of special reference sequences based edit distance prediction

method for the NGS sequences. The distance among the reference sequences is the maximum,

i.e., exactly same of the length of the largest dataset sequence. Computational complexity of

106

this method is linear with respect to sequence lengths. We have developed several strategies

to provide increased prediction accuracy for relatively higher edit distances compared to

our previously proposed methods. Despite the good prediction accuracy in many cases, this

prediction method performs poorly when multiple edit operations appear in close proximity.

On the other hand, we have proposed an eﬃcient bounded exact edit distance calculation

method using trace cost [1] of the sub-sequences. For a given edit distance threshold d where
d2 ≤ l, the proposed method runs in near linear time complexity. This method can be

used separately for eﬃcient edit distance calculation as well as in conjunction with the other

proposed distance prediction methods to avoid costly quadratic time based computation for

NGS sequence analysis.

We have conducted extensive experiments on the 16s rRNA metagenomic NGS sequences

using the proposed edit distance calculation methods to evaluate computation time and

accuracy. All of our proposed methods show high accuracy for creating highly similar clusters

from these sequences while achieving order of magnitude speed up compared to the existing

near quadratic edit distance calculation methods. For smaller edit distances, our proposed

MVD and sub-sequence based methods are highly accurate and are in close proximity when

they are not. We have also shown eﬀectiveness of our proposed methods compared to those

of k-mer based similarity by comparing both required time and accuracy simultaneously.

7.3 Possible Extensions

In this section, we brieﬂy present the possible extensions of the proposed edit distance

calculation methods with respect to optimization of these methods, application on NGS

sequences other than 16s rRNA and potential usability of the proposed methods in relevant

107

ﬁelds and data types.

7.3.1 Extending the Proposed Methods

In near future, we plan to extend our proposed edit distance calculation methods using

the following strategies to further improve their eﬀectiveness. First, for the MVD based

edit distance prediction method, we plan to import the domain knowledge by choosing

reference sequences from the datasets. Likewise the dataset independent random references,

these reference sequences from the datasets will be also far from each other. However, it is

expected the distance will be much smaller compared to that of dataset independent reference

sequences because of the conserved regions among sequences in datasets. We expect that this

strategy will improve prediction for larger edit distances because it will focus on particular

regions of the sequences only. Further, the MVD and FVD based heuristics are expected to

be based on domain knowledge.

Second, for the sub-sequence based edit distance prediction method, we plan to use k-

mer based strategies instead of the primary and secondary sub-sequences as well as merging

strategies. While these strategies provide good prediction accuracy, they are also dependent

on sequential processing of the sub-sequences. Our objective with k-mer based strategies is

to develop a more robust method without aﬀecting computation time and accuracy.

7.3.2 Similarity Comparison of other Ribosomal Regions

We intend to optimize our edit distance calculation methods for other regions (short se-

quences) within ribosomal transcripts such as Internal Transcribed Spacer (ITS) datasets

and 28s rDNA datasets. Recently, the ITS sequences are being more frequently used for

108

phylogenetic analysis because of the increased variances in such sequences. However, it cre-

ates a challenge to our proposed methods because our experimental analysis are based on

the more conserved 16s rRNA sequence regions.

7.3.3 Applying Proposed Methods to Relevant Problems

The proposed edit distance calculation methods are also applicable to similar data types

such as text datasets, network connectivity and graph datasets. They are diﬀerent from

the NGS sequences with respect to length, alphabet size, length diﬀerence and sentiment

information. In some cases, they may require non-uniform cost for diﬀerent edit operations

[50]. Utilizing the key ideas of the proposed distance calculation methods for these datasets

might be interesting.

Another potential application of the proposed reference sequence based edit distance

calculation method is computing graph edit distance [51, 52] such as similarity search in

Heterogeneous Information Networks (HIN) [52]. Likewise the dataset independent reference

sequences, we can create core sub-graphs and compute dissimilarity between them for eﬃcient

computation of dissimilarity between graph fragments.

109

APPENDIX

110

Probability Distribution of δ for Individual Dimensions

For a given Hamming distance dh, from theorem 1, we know that 0 ≤ δ ≤ dh. According

to the deﬁnition of Hamming distance, there are dh number of positions that diﬀer between

the two sequences. Given a random reference sequence Ri, we get



for all j = 1, 2, .., dh we have

∆xij =

+1
−1

0

if xA

if xA

if xA

ij = Rij and xB
ij (cid:54)= Rij and xB
ij (cid:54)= Rij and xB

ij (cid:54)= Rij
ij = Rij
ij (cid:54)= Rij

where j is the index of character positions at which the two sequences are diﬀerent from

each other, and Ri is a random reference point. Given we have only four codons, T, A, C,
ij (cid:54)= Rij and one for each of the other

and G, there are two cases where xA

ij (cid:54)= Rij and xB

conditions above. Therefore, probability of occurrence of each case is given by:



111

Pr(∆xij) =

1/4 ∆xij = +1
1/4 ∆xij = −1

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dh(cid:88)

j=1

1/2 ∆xij = 0

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12), which can be modeled by the sum of ﬁnite discrete random

Then, ∆xi =

∆xij

numbers with multinomial distribution that reaches to Eq. 4.

Probability Distribution of the Euclidean Distance in

Feature Space for a Given Hamming Distance

If Sn−1 is the squared Euclidean distance between transformed sequences A and B, i.e.
(cid:107)Ω(A) − Ω(B)(cid:107)2, with n − 1 features, and its probability distribution is given by the con-
volution of distributions Prn−1(Sn = n − 1) for 0 ≤ Sn−1 ≤ (n − 1) × (dh)2, then adding a

new feature gives

Sn = Sn−1 + (∆xn)2

Then, the probability distribution of Sn for every z in [0, .., n × (dh)2] can be written as

dh(cid:88)

Pr1(S1 = k2) × Prn−1(Sn−1 = z − k2)

Prn (Sn = z) =
√

k=0

where Pr1(S1) = Pr(∆x =

S1) as given in given in Eq. 3.3.

Note:

in order to reduce the number of computations for large values of n and dh, we

used a cutoﬀ of 1e − 10 to remove small probabilities.

112

BIBLIOGRAPHY

113

BIBLIOGRAPHY

[1] R. A. Wagner and M. J. Fischer, “The string-to-string correction problem,” Journal of

the ACM (JACM), vol. 21, no. 1, pp. 168–173, 1974.

[2] W. J. Masek and M. S. Paterson, “A faster algorithm computing string edit distances,”

Journal of Computer and System sciences, vol. 20, no. 1, pp. 18–31, 2013.

[3] A. Backurs and P. Indyk, “Edit distance cannot be computed in strongly subquadratic
time (unless seth is false),” in Proceedings of the forty-seventh annual ACM symposium
on Theory of computing. ACM, 2015, pp. 51–58.

[4] H. Ochman and I. B. Jones, “Evolutionary dynamics of full genome content in es-

cherichia coli,” The EMBO journal, vol. 19, no. 24, pp. 6637–6643, 2000.

[5] A. Abboud, R. Williams, and H. Yu, “More applications of the polynomial method to
algorithm design,” in Proceedings of the twenty-sixth annual ACM-SIAM symposium
on Discrete algorithms.
Society for Industrial and Applied Mathematics, 2015, pp.
218–230.

[6] A. Andoni, R. Krauthgamer, and K. Onak, “Polylogarithmic approximation for edit
distance and the asymmetric query complexity,” in 2010 IEEE 51st Annual Symposium
on Foundations of Computer Science.

IEEE, 2010, pp. 377–386.

[7] A. Abboud and A. Backurs, “Towards hardness of approximation for polynomial time
problems,” in 8th Innovations in Theoretical Computer Science Conference (ITCS
2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.

[8] D. Chakraborty, D. Das, E. Goldenberg, M. Koucky, and M. Saks, “Approximating edit
distance within constant factor in truly sub-quadratic time,” in 2018 IEEE 59th Annual
Symposium on Foundations of Computer Science (FOCS).
IEEE, 2018, pp. 979–990.

[9] G. M. Landau, E. W. Myers, and J. P. Schmidt, “Incremental string comparison,” SIAM

Journal on Computing, vol. 27, no. 2, pp. 557–582, 1998.

[10] Z. Bar-Yossef, T. Jayram, R. Krauthgamer, and R. Kumar, “Approximating edit dis-
tance eﬃciently,” in 45th Annual IEEE Symposium on Foundations of Computer Sci-
ence.

IEEE, 2004, pp. 550–559.

[11] T. Batu, F. Ergun, and C. Sahinalp, “Oblivious string embeddings and edit distance
approximations,” in Proceedings of the seventeenth annual ACM-SIAM symposium on
Discrete algorithm. Society for Industrial and Applied Mathematics, 2006, pp. 792–801.

114

[12] A. Andoni and K. Onak, “Approximating edit distance in near-linear time,” SIAM

Journal on Computing, vol. 41, no. 6, pp. 1635–1648, 2012.

[13] S. V. Thankachan, C. Aluru, S. P. Chockalingam, and S. Aluru, “Algorithmic framework
for approximate matching under bounded edits with applications to sequence analysis,”
in International Conference on Research in Computational Molecular Biology. Springer,
2018, pp. 211–224.

[14] M. E. Carew, V. J. Pettigrove, L. Metzeling, and A. A. Hoﬀmann, “Environmental
monitoring using next generation sequencing: rapid identiﬁcation of macroinvertebrate
bioindicator species,” Frontiers in zoology, vol. 10, no. 1, p. 45, 2013.

[15] P. K. Lindeque, H. E. Parry, R. A. Harmer, P. J. Somerﬁeld, and A. Atkinson, “Next
generation sequencing reveals the hidden diversity of zooplankton assemblages,” PloS
one, vol. 8, no. 11, p. e81327, 2013.

[16] W. Chen, C. K. Zhang, Y. Cheng, S. Zhang, and H. Zhao, “A comparison of methods

for clustering 16s rrna sequences into otus,” PloS one, vol. 8, no. 8, pp. 36–44, 2013.

[17] M. Kim, K.-H. Lee, S.-W. Yoon, B.-S. Kim, J. Chun, and H. Yi, “Analytical tools
and databases for metagenomics in the next-generation sequencing era,” Genomics &
informatics, vol. 11, no. 3, p. 102, 2013.

[18] E. Stackebrandt, “Taxonomic parameters revisited: tarnished gold standards,” Micro-

biol. Today, vol. 33, pp. 152–155, 2006.

[19] M. Rossi-Tamisier, S. Benamar, D. Raoult, and P.-E. Fournier, “Cautionary tale of
using 16s rrna gene sequence similarity values in identiﬁcation of human-associated
bacterial species,” International journal of systematic and evolutionary microbiology,
vol. 65, no. 6, pp. 1929–1934, 2015.

[20] E. Stackebrandt and B. M. GOEBEL, “Taxonomic note: a place for dna-dna reassoci-
ation and 16s rrna sequence analysis in the present species deﬁnition in bacteriology,”
International journal of systematic and evolutionary microbiology, vol. 44, no. 4, pp.
846–849, 1994.

[21] B. D. Ondov, T. J. Treangen, P. Melsted, A. B. Mallonee, N. H. Bergman, S. Koren,
and A. M. Phillippy, “Mash: fast genome and metagenome distance estimation using
minhash,” Genome biology, vol. 17, no. 1, p. 132, 2016.

[22] Z. Rasheed, H. Rangwala, and D. Barbar´a, “16s rrna metagenome clustering and diver-
sity estimation using locality sensitive hashing,” BMC systems biology, vol. 7, no. 4, p.
S11, 2013.

115

[23] T. Rognes, T. Flouri, B. Nichols, C. Quince, and F. Mah´e, “Vsearch: a versatile open

source tool for metagenomics,” PeerJ, vol. 4, p. e2584, 2016.

[24] R. C. Edgar, “Search and clustering orders of magnitude faster than blast,” Bioinfor-

matics, vol. 26, no. 19, pp. 2460–2461, 2010.

[25] K. Shimizu and K. Tsuda, “Slidesort: all pairs similarity search for short reads,” Bioin-

formatics, vol. 27, no. 4, pp. 464–470, 2010.

[26] J. Abello, P. M. Pardalos, and M. G. Resende, Handbook of massive data sets. Springer,

2013, vol. 4.

[27] P. Bille and M. Farach-Colton, “Fast and compact regular expression matching,” The-

oretical Computer Science, vol. 409, no. 3, pp. 486–496, 2008.

[28] R. Impagliazzo, R. Paturi, and F. Zane, “Which problems have strongly exponential
complexity?” Journal of Computer and System Sciences, vol. 63, no. 4, pp. 512–530,
2001.

[29] Z. Rasheed and H. Rangwala, “Mc-minh: Metagenome clustering using minwise based
hashing,” in Proceedings of the 2013 SIAM International Conference on Data Mining.
SIAM, 2013, pp. 677–685.

[30] J. Venkateswaran, D. Lachwani, T. Kahveci, and C. Jermaine, “Reference-based index-
ing of sequence databases,” in Proceedings of the 32nd international conference on Very
large data bases. VLDB Endowment, 2006, pp. 906–917.

[31] P. Papapetrou, V. Athitsos, G. Kollios, and D. Gunopulos, “Reference-based alignment
in large sequence databases,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp.
205–216, 2009.

[32] S. Wandelt, J. Starlinger, M. Bux, and U. Leser, “Rcsi: Scalable similarity search in
thousand (s) of genomes,” Proceedings of the VLDB Endowment, vol. 6, no. 13, pp.
1534–1545, 2013.

[33] A. M. Benjamin, M. Nichols, T. W. Burke, G. S. Ginsburg, and J. E. Lucas, “Com-
paring reference-based rna-seq mapping methods for non-human primate data,” BMC
genomics, vol. 15, no. 1, p. 570, 2014.

[34] S. Pramanik, A. T. Islam, and S. Sural, “Predicted edit distance based clustering of gene
IEEE,

sequences,” in 2018 IEEE International Conference on Data Mining (ICDM).
2018, pp. 1206–1211.

116

[35] A. T. Islam, S. Pramanik, V. Mirjalili, and S. Sural, “Restrac: Reference sequence based
space transformation for clustering,” in 2017 IEEE International Conference on Data
Mining Workshops (ICDMW).

IEEE, 2017, pp. 462–469.

[36] F. Murtagh and P. Contreras, “Algorithms for hierarchical clustering: an overview,”
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 2, no. 1,
pp. 86–97, 2012.

[37] P. D. Schloss, S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann, E. B. Hollister, R. A.
Lesniewski, B. B. Oakley, D. H. Parks, C. J. Robinson et al., “Introducing mothur:
open-source, platform-independent, community-supported software for describing and
comparing microbial communities,” Applied and environmental microbiology, vol. 75,
no. 23, pp. 7537–7541, 2009.

[38] S. M. Huse, D. M. Welch, H. G. Morrison, and M. L. Sogin, “Ironing out the wrinkles
in the rare biosphere through improved otu clustering,” Environmental microbiology,
vol. 12, no. 7, pp. 1889–1898, 2010.

[39] T. S. Schmidt, J. F. Matias Rodrigues, and C. Mering, “Limits to robustness and
reproducibility in the demarcation of operational taxonomic units,” Environmental mi-
crobiology, vol. 17, no. 5, pp. 1689–1706, 2015.

[40] S. L. Westcott and P. D. Schloss, “Opticlust, an improved method for assigning
amplicon-based sequence data to operational taxonomic units,” mSphere, vol. 2, no. 2,
pp. e00 073–17, 2017.

[41] R. C. Edgar, “Muscle: multiple sequence alignment with high accuracy and high

throughput,” Nucleic acids research, vol. 32, no. 5, pp. 1792–1797, 2004.

[42] W. Li and A. Godzik, “Cd-hit: a fast program for clustering and comparing large sets of
protein or nucleotide sequences,” Bioinformatics, vol. 22, no. 13, pp. 1658–1659, 2006.

[43] X. Hao, R. Jiang, and T. Chen, “Clustering 16s rrna for otu prediction: a method of

unsupervised bayesian clustering,” Bioinformatics, vol. 27, no. 5, pp. 611–618, 2011.

[44] S. M. Leinonen R, Sugawara H, “International nucleotide sequence database collabora-
tion (2011). the sequence read archive.” Nucleic acids research, vol. 39 (Database issue),
pp. 19–21, 2011.

[45] N. Siva, “1000 genomes project,” 2008.

[46] K. Kameshwaran and K. Malarvizhi, “Survey on clustering techniques in data mining,”
International Journal of Computer Science and Information Technologies, vol. 5, no. 2,
pp. 2272–2276, 2014.

117

[47] “Inter-comparison of marine plankton metagenome analysis methods,” 2017. [Online].

Available: https://www.ncbi.nlm.nih.gov/sra/ERX2155923[accn]

[48] “16s

rrna sequencing of

sao paulo zoo compost,” 2016.

[Online]. Available:

https://www.ncbi.nlm.nih.gov/sra/SRX1537393[accn]

[49] T.-D. Nguyen, B. Schmidt, and C.-K. Kwoh, “Sparsehc: a memory-eﬃcient online
hierarchical clustering algorithm,” Procedia Computer Science, vol. 29, pp. 8–19, 2014.

[50] S. V. Rice, H. Bunke, and T. A. Nartker, “Classes of cost functions for string edit

distance,” Algorithmica, vol. 18, no. 2, pp. 271–280, 1997.

[51] R. Dijkman, M. Dumas, and L. Garc´ıa-Ba˜nuelos, “Graph matching algorithms for busi-
ness process model similarity search,” in International conference on business process
management. Springer, 2009, pp. 48–63.

[52] J. Lu, N. Lu, S. Ma, and B. Zhang, “Edit distance based similarity search of hetero-
geneous information networks,” in International Conference on Database and Expert
Systems Applications. Springer, 2018, pp. 195–202.

118