FAST EDIT DISTANCE CALCULATION METHODS FOR NGS SEQUENCE SIMILARITY By A K M Tauhidul Islam A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science - Doctor of Philosophy 2020 ABSTRACT FAST EDIT DISTANCE CALCULATION METHODS FOR NGS SEQUENCE SIMILARITY By A K M Tauhidul Islam Sequence fragments generated from targeted regions of phylogenetic marker genes provide valuable insight in identifying and classifying organisms and inferring taxonomic hierarchies. In recent years, significant development in targeted gene fragment sequencing through Next Generation Sequencing (NGS) technologies has increased the necessity of efficient sequence similarity computation methods for very large numbers of pairs of NGS sequences. The edit distance has been widely used to determine the dissimilarity between pairs of strings. All the known methods for the edit distance calculation run in near quadratic time with respect to string lengths, and it may take days or weeks to compute distances between such large numbers of pairs of NGS sequences. To solve the performance bottleneck problem, faster edit distance approximation and bounded edit distance calculation methods have been proposed. Despite these efforts, the existing edit distance calculation methods are not fast enough when computing larger numbers of pairs of NGS sequences. In order to further reduce the computation time, many NGS sequence similarity methods have been proposed using matching k-mers. These methods extract all possible k-mers from NGS sequences and compare similarity between pairs of sequences based on the shared k-mers. However, these methods reduce the computation time at the cost accuracy. In this dissertation, our goal is to compute NGS sequence similarity using edit distance based methods while reducing the computation time. We propose a few edit distance predic- tion methods using dataset independent reference sequences that are distant from each other. These reference sequences convert sequences in datasets into feature vectors by computing edit distances between the sequence and each of the reference sequences. Given sequences A, B and a reference sequence r, the edit distance, ed(A, B) ≥(cid:0)ed(A, r) ∼ ed(B, r)(cid:1). Since each reference sequence is significantly different from each other, with sufficiently large num- ber of reference sequences and high similarity threshold, the differences of edit distances of A and B with respect to the reference sequences are close to the ed(A, B). Using this property, we predict edit distances in the vector space based on the Euclidean distances and the Chebyshev distances. Further, we develop a small set of deterministically gener- ated reference sequences with maximum distance between each of them to predict higher edit distances more efficiently. This method predicts edit distances between corresponding sub-sequences separately and then merges the partial distances to predict the edit distances between the entire sequences. The computation complexity of this method is linear with respect to sequence length. The proposed edit distance prediction methods are significantly fast while achieving very good accuracy for high similarity thresholds. We have also shown the effectiveness of these methods on agglomerative hierarchical clustering. We also propose an efficient bounded exact edit distance calculation method using the trace [1]. For a given edit distance threshold d, only letters up to d positions apart can be part of an edit operation. Hence, we generate pairs of sub-sequences up to length difference d so that no edit operation is spilled over to the adjacent pairs of sub-sequences. Then we compute the trace cost in such a way that the number of matching letters between the sub-sequences are maximized. This technique does not guarantee locally optimal edit distance, however, it guarantees globally optimal edit distance between the entire sequences for distance up to d. The bounded exact edit distance calculation method is an order of magnitude faster than that of the dynamic programming edit distance calculation method. To my family iv ACKNOWLEDGMENTS I would like to thank my PhD supervisor Professor Sakti Pramanik for his continuous support and guidance during the entire time. His philosophy as a researcher and appreciation for academia are truly inspiring. He is a true mentor and he helped me to grow in many ways over the years. He consistently pushed me to be a better researcher and shaped my thought process about solving problems in interesting and non-trivial ways. I would also like to thank the esteemed committee members, Professor James R. Cole, Professor Qiang Zhu and Professor Sandeep Kulkarni. I am grateful for their valuable sug- gestions for successful completion of the dissertation. Dr. Cole always guided our work by providing suggestions based on domain knowledge and the best practices in the Bioinfor- matics community. Dr. Zhu, as a collaborating professor guided me continuously through out the duration of my program. Likewise Dr. Pramanik, Dr. Zhu greatly helped me in developing algorithms, technical writing as well as experimental evaluations. Dr. Kulka- rni provided valuable suggestions on different key concepts of the dissertation. In addition, Dr. Benli Chai helped in multiple occasions to find suitable NGS sequence datasets for our experiments. The Bangladeshi community in Lansing was also very helpful for making our life enjoy- able. The constant support from the seniors in the community as well as variable social activities by the fellow students and their families helped us to successfully complete the PhD study. Family is the most important thing that allowed me following this dream. My parents are my constant source of love and support in everything. I am deeply thankful to my father-in-law, mother-in-law, and brothers-in-law for their continuous encouragement and v affection. The successful completion of my PhD study was only possible because of constant motivation, support, and love of my wife Afroza. This journey was equally hard for her as it was for me, and I am really grateful to her for the sacrifices the patience she had shown. Lastly, our son Zavyan Rushal Tauhid was born in the fourth year of my PhD program. He is the greatest gift of our life and he brightens each moment with his activities and laughs. He many not remember anything about this journey, he was a very important part of it. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii KEY TO SYMBOLS AND ABBREVIATIONS . . . . . . . . . . . . . . . . . xvi Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Classification of Similarity Measures for NGS Sequences . . . . . . . . . . . . 1.1.1 NGS Sequence Similarity based on the Edit Distance . . . . . . . . . 1.1.2 NGS Sequence Similarity based on the Matching k-mer Counts . . . . 1.1.3 A Novel Class of NGS Sequence Similarity Measures based on Refer- ence Sequence based Predicted Edit Distances . . . . . . . . . . . . . 1.2 An Overview of our Novel Edit Distance Prediction Scheme . . . . . . . . . 1.3 An overview of the Proposed Bounded Exact Edit Distance Calculation . . . 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Edit Distance based NGS Sequence Similarity . . . . . . . . . . . . . . . . . 2.1.1 Edit Distance Approximation . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Bounded Exact Edit Distance Calculation . . . . . . . . . . . . . . . 2.2 NGS Sequence Similarity based on Matching k-mers . . . . . . . . . . . . . . 2.3 Reference Sequence based Genome Sequence Analysis Methods . . . . . . . . 2.4 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . . . . . 2.4.2 Heuristics based Clustering . . . . . . . . . . . . . . . . . . . . . . . Chapter 3 The Euclidean Distance based Edit Distance Prediction . . . . 3.1 Space Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Selection of Reference Sequences . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Length of Reference Sequences . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Distances among Reference Sequences . . . . . . . . . . . . . . . . . . 3.2.3 Number of Reference Sequences . . . . . . . . . . . . . . . . . . . . . 3.2.4 Edit Distance to Euclidean Distance Mapping . . . . . . . . . . . . . 3.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Quality Assessment of Clustering Solutions . . . . . . . . . . . . . . . 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Distance among Reference Sequences . . . . . . . . . . . . . . . . . . 3.4.3 Mapping between the Edit and the Euclidean Distances . . . . . . . . 3.4.4 Number of Reference Sequences . . . . . . . . . . . . . . . . . . . . . 1 4 4 6 8 9 13 14 16 16 17 18 19 21 22 22 23 25 26 31 31 31 32 33 33 34 34 35 36 36 37 40 vii 3.4.5 Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.6 Cluster Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.7 Cluster Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.8 Comparison with Local Sensitive Hashing based Clusters . . . . . . . 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4 MVD based Edit Distance Prediction . . . . . . . . . . . . . . . 4.1 Differences of Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Effect of Increasing Number of Reference Sequences on M V D . . . . 4.2 Edit Distance Prediction in the Vector Space . . . . . . . . . . . . . . . . . . 4.2.1 Heuristics for the Edit Distance Prediction . . . . . . . . . . . . . . . 4.2.2 Optimal Number of Reference Sequences . . . . . . . . . . . . . . . . 4.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Memory Efficient Clustering Technique . . . . . . . . . . . . . . . . . 4.3.2 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Predicted Edit Distance Thresholds in the Vector Space . . . . . . . . 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Clustering Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Cluster Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Cluster Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 42 43 44 46 47 48 50 52 52 56 58 58 59 60 61 61 62 64 65 66 67 67 68 70 70 73 73 75 5.1 Basic Concepts Chapter 5 Edit Distance Prediction from Distances between the Sub- sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Key Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Reference Sequence based Transformation . . . . . . . . . . . . . . . . . . . 5.2.1 Edit Distances among the Reference Sequences . . . . . . . . . . . . . 5.2.2 Predicting Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2.1 Edit Distance One . . . . . . . . . . . . . . . . . . . . . . . 5.2.2.2 Edit Distance Greater than One . . . . . . . . . . . . . . . . 5.3 Edit Distance Prediction from Partial Edit Distances of the Corresponding Sub-sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Merging Partial Edit Distances of the Sub-sequences . . . . . . . . . 5.3.2 Finding Primary and Secondary Sub-sequences . . . . . . . . . . . . . 5.3.3 Multiple Adjacent Letters as Distinct Elements . . . . . . . . . . . . 5.3.4 Effect of Sub-sequence Length . . . . . . . . . . . . . . . . . . . . . . 5.3.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 78 80 84 86 86 87 5.4.1 Effect of the Proposed Strategies on Edit Distance Prediction Accuracy 87 5.4.2 Edit Distance Prediction Accuracy . . . . . . . . . . . . . . . . . . . 89 91 5.4.3 Edit Distance Calculation Time . . . . . . . . . . . . . . . . . . . . . 92 5.4.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4.4.1 Clustering Time viii 5.4.4.2 Cluster Similarity . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 94 Chapter 6 Bounded Exact Edit Distance . . . . . . . . . . . . . . . . . . . . 6.1 Length of the Sub-sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Minimum Cost Trace between the Sub-sequences . . . . . . . . . . . . . . . 96 96 99 6.2.1 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3.1 Edit Distance Calculation Time . . . . . . . . . . . . . . . . . . . . . 101 6.3.2 Hierarchical Clustering Time . . . . . . . . . . . . . . . . . . . . . . . 102 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Chapter 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.1 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.3 Possible Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.3.1 Extending the Proposed Methods . . . . . . . . . . . . . . . . . . . . 108 Similarity Comparison of other Ribosomal Regions 7.3.2 . . . . . . . . . . 108 . . . . . . . . . . 109 7.3.3 Applying Proposed Methods to Relevant Problems APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 ix LIST OF TABLES Table 3.1: 16S rRNA Next Generation Sequencing (NGS) datasets . . . . . . . . . . 36 Table 3.2: Mapping of Euclidean distance thresholds for given edit distance thresholds in hierarchical clustering by comparing similar clusters in the edit space and the vector space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 3.3: Comparison of hierarchical clustering time (in hours) of reference sequence based technique in the vector space with that of actual edit distance based technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 3.4: Evaluating accuracy of single linkage and average linkage hierarchical clus- ters generated using the proposed predicted edit distances in the vector space. We compute Normalized Mutual Information (NMI) for relative similarity of the generated clusters to those based on actual edit distances. Table 3.5: Comparison of clustering time and accuracy based on predicted edit dis- tances in the vector space to those based on actual edit distances. All the single linkage and average linkage clusters are compared for 97% similarity threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 3.6: Evaluating quality of average linkage hierarchical clusters using the sil- houette coefficient. The clusters are based on the edit distances and the Euclidean distances in the vector space. . . . . . . . . . . . . . . . . . . . Table 3.7: Comparison of clusters based on the edit distances, the predicted distances in the vector space and local sensitive hashing based similarity with re- spect to clustering time, relative similarity (NMI) to ground truth and the silhouette coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4.1: Number of candidate pairs selected for the edit distance calculation. These pairs are selected based on the top two differences of V D and their corre- sponding frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4.2: Comparison of similarity thresholds for hierarchical clustering based on the edit distances and those of the MVD . . . . . . . . . . . . . . . . . . . . Table 4.3: Hierarchical clustering time (in hours) breakdown based on the edit dis- tance and the MVD based on vector space for 97% similarity . . . . . . . 38 41 42 43 44 45 56 61 61 x Table 4.4: For 99% similarity threshold, comparison of clustering time and accuracy using the MVD based method to those of the actual edit distances as well as k-mer matching based VSEARCH and LSH-Div clustering . . . . . . . Table 4.5: For 98% similarity threshold, comparison of clustering time and accuracy using the MVD based method to those of the actual edit distances as well as k-mer matching based VSEARCH and LSH-Div clustering . . . . . . . Table 4.6: For 97% similarity threshold, comparison of clustering time and accuracy using the MVD based method to those of the actual edit distances as well as k-mer matching based VSEARCH and LSH-Div clustering . . . . . . . Table 4.7: Comparing average linkage hierarchical clusters using the silhouette coef- ficient. The clusters are generated based on the edit distances, the MVD based predicted edit distances as well as k-mer matching techniques such . . . . . . . . . . . . . . . . . . . . . . . . . as VSEARCH and LSH-Div. Table 5.1: Effect of sub-sequence length on edit distance prediction accuracy (%). For a predefined distance threshold, the experimental result shows that the prediction accuracy drops with increasing sub-sequence lengths. . . . . . . Table 5.2: Comparison of cumulative edit distance prediction accuracy between the MVD and the differences of distances based methods . . . . . . . . . . . . Table 5.3: For 99% similarity threshold, comparison of clustering time (in hours) and accuracy of the differences of distances based method to those of the actual edit distances, the MVD as well as k-mer matching based VSEARCH and LSH-Div clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 5.4: For 98% similarity threshold, comparison of clustering time (in hours) and accuracy of the differences of distances based method to those of the actual edit distances, the MVD as well as k-mer matching based VSEARCH and LSH-Div clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 5.5: For 97% similarity threshold, comparison of clustering time (in hours) and accuracy of the differences of distances based method to those of the actual edit distances, the MVD as well as k-mer matching based VSEARCH and LSH-Div clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 63 63 64 86 89 93 94 95 Table 6.1: Comparison of all pairs distance calculation time (in hours) of the proposed bounded exact edit distance calculation method with that of the existing dynamic programming based method and Slidesort . . . . . . . . . . . . . 102 xi Table 6.2: Comparison of hierarchical clustering time (in hours) of the proposed bounded exact edit distance calculation method with that of the existing dynamic programming based method for a given distance threshold . . . . . . . . . 103 xii LIST OF FIGURES Figure 1.1: Example of edit operations and the corresponding edit distance between a pair of strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.2: NGS sequences similarity at genus and species levels are revised over the years. With few exceptions, the species and genus similarities remain at 97% and 95% respectively. This justifies the usefulness of more accurate edit distance prediction and calculation methods for smaller edit distance bound. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.3: All possible k-mers in a sequence . . . . . . . . . . . . . . . . . . . . . . Figure 1.4: Comparison of sequence similarity based on (a) the edit distance, and (b) Local Sensitive Hashing (LSH) based methods . . . . . . . . . . . . . . . Figure 1.5: Comparison of sequence similarity based on shared k-mers. For a given edit distance, the sequence similarity may vary depending on the unique- ness of the k-mers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 5 6 7 7 Figure 1.6: Schematic representation of general space transformation and distances . . . . among the dataset sequences with respect to reference sequences. 10 Figure 1.7: The Chebyshev distance between a pair of 2D feature vectors . . . . . . 11 Figure 3.1: Theoretical probability distribution of L1 distances in vector space formed by the Hamming distances with varying number of dimensions i.e., refer- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ence sequences. 28 Figure 3.2: Theoretical expectation of the Euclidean distances and their 99.9% inter- vals in the vector space with 100 reference sequences by the Hamming distance (upper panel) and the edit (lower panel) as a function of the actual Hamming and edit distances between a random pair of sequences. Compared with experimental lower and upper bounds of observed dis- . . . . . . . . . . . . . . . . tances from a random sampling simulation. Figure 3.3: Correlation among features after transformation of random sequences as a function of the distance among reference sequences. [The dashed line shows lref /2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 37 xiii Figure 3.4: Cumulative frequency distribution of the edit and the Euclidean distances for different threshold settings. (a) The edit distance distribution and (b) the Euclidean distance distribution. (c) Relative similarity of the resulting clusters between the edit space and the vector space for varying cumulative . . . . . . . . . . . . . . . . . . . frequencies of the Euclidean distances. Figure 3.5: Comparing relative similarity (NMI) between clusters in the edit space . . . . and the vector space with growing number of reference sequences Figure 4.1: Distance between sequences A and B in the vector space based on hypo- . . . . . . . . . . . . . . . thetical alignment of the reference sequences. Figure 4.2: Distribution of the edit distances in the vector space grouped by M V D with varying number of dimensions . . . . . . . . . . . . . . . . . . . . . Figure 4.3: For a given M V D, the range of F V Ds highlight distribution of the edit distances. We consider the change in distribution for our prediction. For example, predicted edit distances are marked based on intervals of F V D values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.4: Predicting edit distances using top two vector component differences (VD) and their corresponding frequencies (FVD). The differences of pairwise edit distance distributions for the given < V D, F V D > show effectiveness of such strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.5: Total number of pairwise edit distance computation including the number of pairs needed for space transformation and the number of candidate pairs determined by heuristic H3 . . . . . . . . . . . . . . . . . . . . . . Figure 5.1: (a) A minimum cost trace with maximum number of change operations and (b) an arbitrary trace with fewer change operations . . . . . . . . . Figure 5.2: Difference of edit distances between sequences A and B by the reference sequence gl cannot predict correct edit distance because the same letter is inserted and subsequently deleted; therefore resulting no change in overall frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 5.3: Sequences A and B are divided into fixed length sub-sequences. Each pair of sub-sequences has only one edit operation. The edit operations are pre- dicted correctly separately with respect to the given reference sequences using Theorem 3. Partial edit distances of the sub-sequences are merged subsequently. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 40 49 51 54 55 57 73 76 76 xiv Figure 5.4: The edit distance between sequences A and B are predicted from the differ- ences of distances of primary sub-sequences. The 2nd pair of sub-sequences contains only redundant edit operations and hence are not considered for the distance prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 5.5: The insertion operation between the first pair of fixed length of sub- sequences is not reflected in the differences of distances between them because of the small DNA alphabet. Considering multiple adjacent let- ters as single elements would avoid such by chance reduction in differences of distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 5.6: Comparing effect of each differences of distances calculation strategy on . . . . . . . . . . . . . . . . . . . . . edit distance prediction accuracy. Figure 5.7: Comparison of edit distance prediction time and accuracy between the M V D and the differences of distances based methods . . . . . . . . . . . Figure 5.8: Comparison of average edit distance prediction time between the dataset sequences using the MVD and the differences of distances based methods Figure 6.1: (a) Minimum cost trace of a pair of sequences (strings) and (b) dividing the sequences into corresponding sub-sequences based on the minimum cost trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 85 88 90 91 97 xv KEY TO SYMBOLS AND ABBREVIATIONS S Σ Λ A sequence dataset Alphabet of the sequences Empty letter/element s, A, B Sequences in a Dataset S dist(A, B) Generic distance between A and B ed(A, B) Edit distance between A and B dh(A, B) Hamming distance between A and B l R Rl lref Ω(A) E Em Em p λ(i) k Length of the largest sequence in a dataset Set of reference sequences A set of reference sequences of length l with edit distance l between them Length of a reference sequence Space transformation function for a sequence A A sequence of edit operations A minimum sequence of edit operations A physically ordered minimum sequence of edit operations Function to update positions of physically ordered edit operations Length of a sub-sequence (T, A, B) A triple representing a trace T between Sequence A and B cost(T ) Cost of a trace T A(i, j) A[i, j] A[i] γ(x → y) Sub-string of A of length j starting from the ith position Sub-string of A from the ith position to the jth position ith element of string A Cost function to change letter x into letter y r A reference sequence in R xvi ρ AM I N M I DN A GST RN A A sub-sequence of a reference sequence r Adjusted Mutual Information Normalized Mutual Information Deoxyribonucleic Acid Generalized Suffix Tree Ribonucleic Acid rRN A Ribosomal Ribonucleic Acid N GS Next Generation Sequencing N CBI National Center of Biotechnology Information SRA V D M V D F V D LSH OT U LCS M ST N N HIN Sequence Read Archive Vector component Difference Maximum Vector component Difference Frequency of the M V D Local Sensitive Hashing Operational Taxonomic Unit Longest Common Sub-sequences Minimum Spanning Tree Nearest Neighbor Heterogeneous Information Networks M CM C Markov Chain Monte Carlo RBSA Reference-Based String Alignment RCSI Referentially Compressed Search Index SET H Strong Exponential Time Hypothesis REST RAC Reference Sequence based Space TRAnsformation for Clustering xvii Chapter 1 Introduction Rapidly increasing number of DNA and RNA sequence fragments obtained from Next Gen- eration Sequencing (NGS) technologies in recent years have provided valuable insights such as discovering, identifying, and classifying novel microorganisms and inferring taxonomic hierarchies. Yet, processing such data-intensive applications demands efficient techniques to handle large numbers of NGS sequences. One of the important aspects of NGS sequence analysis is partitioning sequences based on their level of similarities. Phylogenetic relation- ships among sequences are inferred if a set of sequences are highly similar to each other. In many cases, the sequences generated from metagenomic samples may not have annotations and thus distance between the letters of the sequences is the only way to find relationship between them. While a number of effective methods have been developed [1, 2, 3] for se- quence similarity analysis, they are not suitable for handling large datasets due to high computational time complexity of these methods. For instance, the edit distance between two strings of arbitrary lengths represents their dissimilarity. Given strings A and B, the minimum number of edit operations (Definition 1) required to convert A into B is their edit distance (Definition 3) [1]. The distance measure is critical for analyzing non-ordered and unstructured datasets in diverse fields such as NGS sequences, text and graph datasets. Despite wide applicability on numerous sequence and string analysis methods, the near quadratic run time with respect to sequence lengths to 1 compute the edit distance has been a performance bottleneck problem. For instance, a gene sequence length can vary from several thousands to several hundreds of thousands of base pairs [4]. The edit distance calculation for such large sequence is not practical. Moreover, edit distance calculation between entire genes is not necessary because only a small fraction of a genome contains meaningful gene sequences. On the other hand, in recent years, NGS technology has improved significantly, which has generated massive amounts of sequence data from targeted regions of genomes in the form of short reads. The reads are usually a few hundred base pairs long. The edit distance calculation methods are suitable for computing similarity between pairs of NGS sequences. However, for large NGS sequence datasets, we need to calculate the edit distances between lots of pairs of sequences. This reinforces the necessity of faster edit distance calculation methods for large numbers of pairs of sequences. The cost of an edit operation is determined by γ(x → y). An edit operation (i, x, y) is Definition 1. Given a string A, an alphabet(cid:80) and a null element Λ, an edit operation is a triple (i, x, y) replacing a letter x at the ith position of A with a letter y where x, y ∈ ((cid:80)∪Λ). • a substitution if x (cid:54)= y and x, y ∈(cid:80). • an insertion if x = Λ and y ∈(cid:80). • a deletion if x ∈(cid:80) and y = Λ. Definition 2. Given string A, a temporally ordered sequence of edit operations on A is defined as E =< (i1, x1, y1), (i2, x2, y2),··· > where 1 ≤ j ≤ |E|, ij is the position of the edit operation and xj is replaced by yj at the ith j position of A. For instance, A = JU N E and a temporally ordered sequence of edit operations E =< (4, Λ, L), (3, N, Λ), (4, E, Y ) >. As the edit operations are applied on A, the string is changed 2 gradually as follows: JUNE → JUNLE → JULE → JULY. Definition 3. Given strings A, B and assuming the cost of each edit operation is the same, the edit distance ed(A, B) is the minimum number of edit operations to convert A into B. (a) (b) Figure 1.1: Example of edit operations and the corresponding edit distance between a pair of strings For example, the minimum number of edit operations between ”HONDA” and ”HYUNDAI” is shown in Figure 1.1. In the first example, (2, Λ, Y ) and (7, Λ, I) are the insertions and the (3, O, U ) is the substitution. As shown in the Figure, there might be multiple represen- tations of minimum number of edit operations to convert a string to another. For instance, both E1 =< (2, Λ, Y ), (3, O, U ), (7, Λ, I) > and E2 =< (2, O, Y ), (7, Λ, I), (3, Λ, I) > convert ”HONDA” to ”HYUNDAI” with three edit operations. It should be noted that the edit distance may not capture every modification from string A to string B. Any edit operations that are not lost in the conversion from string A to string B are reflected in the edit distance. For example, the sequence of edit operations E2 is applied on ”HONDA” to convert the string to ”HYUNDAI”. On the other hand, E3 =< (2, 0, Λ), (2, Λ, Y ), (3, Λ, U ), (7, Λ, I) > can also convert ”HONDA” to ”HYUNDAI”. This shows that the edit distance calculation can find only the existing edit operations between a pair of sequences. 3 1.1 Classification of Similarity Measures for NGS Se- quences Various popular distance measures such as the edit, the Hamming, and others, are used to measure the similarities between NGS sequences. The edit distance is commonly used as the basis for measuring evolutionary relationships between sequences. For NGS sequences, the edit distance is more effective for similarity measures for many applications because it provides the appropriate global similarity. One such important application is to discover novel OTU’s in NGS datasets which require all pair similarity distance calculations. Because of very large size of NGS data sets and high computation time for edit distance calculation, all pair similarity distances calculation time becomes a serious performance bottleneck problem for these important applications. To reduce the computation times other similarity measures such as matching k-mer counts, reference sequence based edit distance and Euclidean distance have been proposed. In the following section we classify the existing similarity measures, specifically for NGS sequences, based on various distance metrics. 1.1.1 NGS Sequence Similarity based on the Edit Distance The edit distance is widely used to measure the similarities between strings. The edit distance calculation method using dynamic programming [1] is of complexity O(l2) where l is the string length. Due to wide range of applicability of the edit distance, reducing computation time complexity of the edit distance has been an active research problem. Despite these efforts, the well known algorithms for edit distance calculation run in nearly quadratic time complexities [1, 2, 5, 3]. However, these techniques cannot be used for calculating edit 4 distances of large numbers of pairs of strings. In order to reduce time complexity of the edit distance calculations, several faster methods have been proposed. For instance, a number of edit distance approximation methods have been proposed [6, 7, 8, 9, 10, 11, 12] of computational complexity ranging from sub-quadratic to near linear with respect to sequence lengths. Further, a number of methods have been proposed that guarantee finding up to a given edit distance bound [5, 13]. For example, the state-of-the-art bounded exact edit distance calculation method runs in strictly sub-quadratic time complexity [13] with respect to sequence lengths. Figure 1.2: NGS sequences similarity at genus and species levels are revised over the years. With few exceptions, the species and genus similarities remain at 97% and 95% respectively. This justifies the usefulness of more accurate edit distance prediction and calculation methods for smaller edit distance bound. The bounded exact and the approximate edit distance calculation methods are more appropriate over the standard edit distance calculation methods for NGS sequence similarity because they evaluate sequence similarity primarily for up to genus and species levels [14, 15, 16, 17]. As shown in Figure 1.2, species and genus level similarities of 16s rRNA in the bacterial samples are in the range of 98.7% and 95% respectively with few exceptions [18, 19, 20]. Further, 97% similarity [16, 18] is often used for creating Operational Taxonomic Units (OTUs) from metagenomic samples. 5 Despite the improved time complexity of the bounded exact and approximate edit dis- tance calculation methods, they may not be suitable for NGS sequence similarity methods when the numbers of sequence pairs are very large. 1.1.2 NGS Sequence Similarity based on the Matching k-mer Counts In order to further reduce the sequence similarity computation time between large numbers of pairs of NGS sequences, similarities of short k-mars (Definition 4) are used to evaluate the similarities of the NGS sequences. Here, the percentage of matching k-mars is used as a measure of similarities between the sequences. These approaches give close results when the edit distance is small but the accuracy deteriorate with increasing edit distances. Definition 4. k-mers are all possible sub-sequences of length k within a sequence. For example, in Figure 1.3, all the 7-mers of a sample sequence is shown. Figure 1.3: All possible k-mers in a sequence The exact matching of the k-mers in their corresponding positions is defined as Local Sensitive Hashing (LSH) [21, 22]. These methods assume that exact matching between the randomly chosen k-mers represent the edit distance between the sequences. This can be true only if the edit operations are substitutions and the positions are valid with respect to the smaller sequence length. For example, in Figure 1.4(a), the edit distance between the 6 sequences in one. As shown in Figure 1.4(b), two pairs of sub-sequences are generated by randomly choosing the k-mers of length one and subsequently merging them. The distances between both pairs of sub-sequences are greater than the actual edit distance. (a) (b) Figure 1.4: Comparison of sequence similarity based on (a) the edit distance, and (b) Local Sensitive Hashing (LSH) based methods (a) (b) (c) (d) (e) Figure 1.5: Comparison of sequence similarity based on shared k-mers. For a given edit distance, the sequence similarity may vary depending on the uniqueness of the k-mers. On the other hand, many k-mer based exact matching sequence similarity methods do not consider the positions of the k-mers [23, 24]. Instead they take the set of k-mers generated 7 from each sequence and compute their similarity based on the exactly matched subset of k-mers. These methods determine sequence similarity nearly correctly when the sequences are very close to each other. However, when the sequences are far with respect to the edit distances, they cannot determine the dissimilarity between the sequences satisfactorily. For instance in Figure 1.5(a), an example sequence and the corresponding set of distinct 2-mers are shown. Assuming an edit operation in this sequences, Figure 1.5(b)-(e) show all possible updated sequences and their corresponding set of 2-mers. However, the 2-mer based similarity between these sequences vary between 0 to 2. Hence, only k-mer based sequence similarity is not sufficient for determining the edit distance. However, existing k-mer based methods improve the sequence similarity accuracy by applying costly edit distance based methods when stringent accuracy is required [25, 23]. They become very slow with the increase of sequence lengths and the number of edit distance calculations. 1.1.3 A Novel Class of NGS Sequence Similarity Measures based on Reference Sequence based Predicted Edit Distances In this dissertation, a new class of NGS sequence similarity measure is proposed that provides significantly better performance both in terms of quality and time over existing methods. An overview of our proposed methods is given in the following section. 8 1.2 An Overview of our Novel Edit Distance Predic- tion Scheme We propose a novel class of similarity measures that improve performance in both quality and time. Our approaches to improve edit distance calculation of sequences in NGS datasets are based on the following goals. The first goal is to reduce the edit distance prediction or computation time without using the matching k-mer count based sequence similarity. Since the majority of the NGS sequence analysis problems can be solved by finding highly similar pairs of sequences, our second goal is to improve prediction accuracy when the edit distances between pairs of NGS sequences are small. In order to achieve these goals, we propose a few reference sequence based edit distance prediction methods in the vector space followed by a bounded exact edit distance calculation method. For the proposed edit distance prediction methods, we convert each sequence in a dataset into a feature vector using a set of reference sequences. Suppose, S refers to a dataset of N sequences, {s1, s2, . . . , sN} on alphabet(cid:80). Definition 5. Given a set of n reference sequences, R = {r1, r2, . . . , rn}, a sequence in the dataset s is mapped to an n-dimensional feature vector using the transformation function Ω(s) ∀s ∈ S : Ω(s) = x where x = (x1, x2, . . . , xn) and ∀j∈{1,...,n}rj ∈ R; xj = dist(s, rj) (1.1) where function dist(s, rj) is a dissimilarity metric, for example, the edit distance between sequence s and reference sequence rj. 9 Figure 1.6: Schematic representation of general space transformation and distances among the dataset sequences with respect to reference sequences. According to the Definition 5 , each sequence in a dataset is represented by an n- dimensional feature vector x. Figure 1.6 shows the schematic representation of such trans- formation, in which sequence A and sequence B are mapped into feature vectors xA and xB respectively using two reference sequences r1 and r2. We explored the desired properties of the reference sequences because reference sequences are key to the edit distance prediction accuracy and low computation cost. We evaluated the effectiveness of the reference sequences based on their lengths, the distance among the sequences, the number of sequences, and the alphabet size. We also examined various meth- ods of creating the reference sequences, such as, randomly generated versus deterministically generated, selected from the dataset versus generated independently from the dataset. More detail about the properties of reference sequences are presented in the later chapters. Our first edit distance prediction method is based on the Euclidean distance (Definition 6) between Ω(A) and Ω(B) for A, B ∈ S. Suppose Ω(A) =⇒ (xA Ω(B) =⇒ (xB 1 , xB 2 , . . . , xB n ). We generate a lookup table between the Euclidean distances 1 , xA 2 , . . . , xA n ) and and the edit distances. This lookup table is dependent on the number of dimensions in the feature vectors. 10 Definition 6. The Euclidean distance is the straight-line distance between two points in Euclidean space. With this distance, Euclidean space becomes a metric space. The associated norm is called the Euclidean norm or L2 norm. If xA = (xA 1 , xA 2 , . . . , xA n ) and xB = (xB 1 , xB 2 , . . . , xB n ) are two points in Euclidean n-space, then the distance from xA to xB is given by: 2 − xB dist(xA, xB) = 1 − xB (xA 1 )2 + (xA 2 )2 + ··· + (xA n − xB n )2 (cid:113) (cid:118)(cid:117)(cid:117)(cid:116) n(cid:88) = i − xB (xA i )2. i=1 Definition 7. The Chebyshev distance is a metric defined on a vector space where the dis- tance between two vectors is the greatest of their differences along any coordinate dimension [26]. For example, in Figure 1.7, the Chebyshev distance between 2D vectors (4, 2) and (13, 10) is 9. Figure 1.7: The Chebyshev distance between a pair of 2D feature vectors Our second edit distance prediction method in the vector space is based on the Maximum Vector Component (M V D) difference (the Chebyshev distance). Differences between the 11 corresponding dimensions of the feature vectors are bounded by the edit distance between each pair of sequences. Given the feature vectors (xA the vector component differences would be the bag of (xA 1 , xA 2 , . . . , xA j ∼ xB 1 , xB n ) and (xB 2 , . . . , xB n ), j ) where (1 ≤ j ≤ n). The maximum difference between any xA j and xB j is known as the Chebyshev distance. As we increase the number of reference sequences, the Chebyshev distance is likely to get closer to the actual edit distance. The frequency of any (xA j ∼ xB j ) is also helpful for improving the edit distance prediction accuracy in the vector space. Using these strategies, we are able to predict pairwise edit distances in the vector space more efficiently than that of the Euclidean distance based method. The MVD based method predicts small edit distances with near perfect accuracy, but the accuracy decreases as the edit distances become larger. In order to correctly predict the edit distance, a reference sequence must find all the edit operations with respect to a dataset sequence of the pair and none with respect to the other. Given alphabet Σ and sequence length l, the number of all possible random reference sequences is |Σ|l. Since the edit operations between pairs of sequences in a dataset can appear arbitrarily, when only a few reference sequences are used, the probability of correctly predicting larger edit distances decreases with increasing sequence length and increasing number of edit operations between pairs of sequences in a dataset. On the other hand, a large number of reference sequences improve the prediction accuracy at the cost of higher computation time. For even larger edit distances, the benefit of increasing the number of reference sequences for better edit distance prediction diminishes as it becomes increasingly unlikely to find all the edit operations from the difference of edit distances of sequences A and B with respect to a reference sequence. Hence, our third edit distance prediction method is based on a small set of simple reference sequences. Each reference sequence is made up of the same letter of the alphabet. For 12 example, given an alphabet Σ = {a, c, g, t}, the reference sequences are {aaaa..., cccc..., gggg..., tttt...}. Even though the number of sequences is small and, therefore, computation time is less, the prediction accuracy for this method degrades rapidly, particularly when predicting higher edit distances. Suppose A = gcatacg and B = gcctaag with substitution operations at the 3rd and 6th positions. Then, ed(A, ccccccc) = 5 and ed(B, ccccccc) = 5. Thus the difference of the edit distances of A and B with respect to ccccccc is 0. Similarly the differences of the edit distances of A and B with respect to aaaaaaa, ggggggg and ttttttt are 0. In order to minimize this problem, we compute the edit distances non-overlapping sub-sequences of A and B separately. We develop merging strategies to get the edit distance for the entire sequences from the partial edit distances. This method significantly improves the edit distance prediction accuracy when using the simple reference sequence set. 1.3 An overview of the Proposed Bounded Exact Edit Distance Calculation In the previous sections we have discussed classification of existing similarity measures, and our proposed a new class of similarity measures for NGS sequences. In this section we give an overview of our new approach to compute bounded exact edit distance which uses similarity measure described in Section 1.1.1. This method calculates edit distance of NGS sequences for up to a predefined distance bound, d. Although the proposed edit distance prediction methods described in Section 1.2 produce high accuracy for several scenarios, they cannot guarantee finding the exact edit distances. Since many genome sequence analysis methods require edit distances only within a small bound, we propose an efficient bounded edit distance calculation method. Assuming d << l, the proposed method splits the sequences 13 up to length 2d because the corresponding letters in the sequences can be up to d positions apart. Then the method computes edit distances between the sub-sequences. For example, given sequences A and B, we split the sequences into A = A1.A2 and B = B1.B2 such that there is no edit operation between A1 and B2, and vice versa. Then, ed(A1, B1) finds the minimum number of edit operations with respect to ed(A, B). We merge the partial edit distances of sub-sequences to find the ed(A, B). 1.4 Thesis Outline The structure of this dissertation is as follows: • In chapter 2, we present a literature review on edit distance calculation and approxi- mation techniques as well as heuristic based alternatives developed to predict the edit distances. • In chapter 3, we present the edit distance prediction technique based on the Euclidean distance between the feature vectors in the vector space. • In chapter 4, we present the edit distance prediction technique in the vector space based on M V D (the Chebyshev distance) between corresponding components of the feature vectors and their frequencies. • In chapter 5, instead of computing edit distance between the entire sequences, we compute the distance between fixed length sub-sequences separately and subsequently merge them. • The bounded exact edit distance calculation technique (Section 1.3) is presented in chapter 6. 14 • We summarize the dissertation contributions, provide a general discussion of this work, and propose possible extensions in chapter 7. 15 Chapter 2 Related Works In this chapter, we present a literature review on existing NGS sequence similarity methods. We present the existing edit distance calculation methods that are frequently used to deter- mine the minimum number of changes required to convert one sequence into another. Then, we review the faster NGS sequence similarity computation methods based on the matching k-mers. A majority of the exact matching k-mer based methods are used as intermediate filtering steps to reduce the numbers of pairs of sequences for the costly edit distance calcu- lation. We also review the existing clustering methods for NGS sequences because they are frequently used for metagenomic NGS sequence analysis. In addition, we review the existing reference sequence based methods used for longer genome sequences to show their potential usage in shorter NGS sequence similarity. 2.1 Edit Distance based NGS Sequence Similarity The edit distance measures the dissimilarity between a pair of strings by determining the minimum number of changes to convert one string to another. Standard dynamic program- ming [1] based edit distance computation complexity is O(l2), l being the sequence length. There have been many efforts to reduce the quadratic computational complexity because of its wide applicability on string comparison. Despite these efforts, the well known algorithms for edit distance calculation run in nearly quadratic computational complexity [2, 3, 27]. 16 Masek and Patterson [2] proposed an edit distance calculation method of time complex- ity O(l2/ log2 l) when the alphabet is constant. Billie and Farach-Colton [27] removed the constraint of the constant alphabet imposed in [2] at the cost of a factor α = (log2 log l), re- sulting the total complexity to (l2 log2 α/α2). Recently Backurs and Piotr [3] indicated that the near quadratic run time complexity of edit distance calculation might be tight. With the help of Strong Exponential Time Hypothesis (SETH) [28], they showed that no edit distance calculation method of O(l2−α) where α > 0 existed. In order to reduce the time complexity significantly, many other edit distance calculation methods have been proposed by relaxing the accuracy requirement [6, 7, 8] or by guaranteeing correct distance for up to a threshold [5, 13]. 2.1.1 Edit Distance Approximation Nearly quadratic run time based edit distance calculation methods do not reduce the compu- tation time enough to be practical for computing NGS sequence similarity for larger numbers of pairs of sequences. On the other hand, the computational complexity of the edit distance calculation can be reduced by relaxing the accuracy requirement [6, 7, 8, 9, 10, 11, 12]. Landau et. al. √ mation of factor [9] proposed a linear run time based incremental edit distance approxi- l where l is the string length. Despite the low time complexity the high approximation factor was large for the high similarity threshold NGS sequence similarity search. The later research efforts on this problem focused on reducing the approximation factor without increasing the time complexity. For instance, Bar-Yossef et. al. [10] reduced the approximation factor to l3/7 by a quasi-linear edit distance approximation method. Batu et. al. [11] further reduced the approximation factor to O(l1/3+O(1)). Recently, Andoni et. al. [12] proposed a near linear run time edit distance approximation 17 method between strings of length l and reduced the approximation factor significantly from √ the previously proposed methods, up to a factor of 2O( log l). On the other hand, the most recent edit distance approximation algorithm proposed by Chakraborty et. al. [8] improved the approximation bound into a constant factor at the cost of higher computation complexity, O(n2−2/7). Hence, the existing edit distance approximation algorithms tradeoff between approximation factor and the computation complexity. Yet, the approximation factor may be high for smaller edit distance thresholds. 2.1.2 Bounded Exact Edit Distance Calculation Before presenting the existing bounded edit distance calculation methods, we justify their applicability on NGS sequence similarity. As shown in Figure 1.2, the species and genus level similarity of NGS sequences primarily fall within 97% and 95% similarity respectively. As a result, many metagenomic NGS sequence similarity based analysis methods focus on high similarity thresholds such as OTUs. For high similarity thresholds, the bounded exact edit distance calculation methods are less costly alternative to the standard edit distance calculation methods without losing the accuracy. Abboud et. al. O(l2d1.5(cid:14)2Ω( [5] proposed a d-edit operation based LCS method of time complexity √ log n/d)). This time complexity is only a marginal improvement from that of the fastest unbounded edit distance calculation method [2]. Thankachan et. al. [13] proposed a strictly sub-quadratic algorithmic framework for approximate matching problem of limited number of edit operations. The framework is claimed to be the first strictly sub-quadratic bounded edit distance calculation method for a fixed number (d) of edit operations. The al- gorithmic framework works for any arbitrary edit distance although the smaller edit distances are desired for optimized performance. The framework relies on converting an approximate 18 matching problem to an exact matching problem. In order to do so, they initially create a Generalized Suffix Tree (GST) from the input sequences and gradually modify the suffix tree for all possible suffixes of d-edit operations. The computational complexity of this algorithm is bounded by the creation and modification of the suffix tree which is O((N l) logd(N l)) where N is the number of sequences in a dataset. The all pairwise distance calculation of this method is O((N l) + OCC) logd(N l) where d is a constant and OCC is the output size. Hence, the edit distance calculation time for this algorithm is better than the quadratic comparison of the existing method regardless of the cost of calculating distance between a pair of sequence. However, for increasing bound of edit distance, the polylogarithmic factor (logd(N l)) could be larger than (N l) when d > |(cid:80)| which would be the usual case for sequence similarity of NGS datasets. The time requirement of gradually updating the suffix tree for all possible d-edit operations increases rapidly with the growth of d. 2.2 NGS Sequence Similarity based on Matching k- mers In order to further reduce the sequence similarity computation time between large numbers of pairs of NGS sequences, many matching k-mer based sequence similarity methods have been proposed. Local Sensitive Hashing [22] assumes that global dissimilarity between two sequences is preserved in the randomly generated fixed length sub-sequences. These methods use two level hashing, for instance, multiple random subset of dimensions and hash table of the sub-sequences to narrow down the number of candidate pairs of sequences in a dataset. While the strategy reduces the number of candidate pairs, it has two limitations. First, the similarity of a pair of sub-sequences are measured using either Hamming, Euclidean 19 or Jaccard index distance of each dimension. The method cannot calculate edit distances of randomly generated subsets because such distances would not represent the global edit distance between two sequences. It is worth mentioning that edit distance can address both insertions and deletions in a sequence while Hamming distance cannot. Second, sequence lengths are required to be exactly same for the hashing, although it is very common for the sequences to vary in lengths. The minHash [29] and the shared k-mers [23] based methods ignore the position of a k-mer and compute NGS sequence similarity based on the k-mers only. For instance, the minHash based methods extract the k-mers to create a signature based on min-wise hashing from a subset of k-mers. Although they can handle variable length sequences unlike the LSH based methods, they are not very effective to separate highly similar pairs of sequences from the not similar ones. On the other hand, the shared k-mer based methods consider all the distinct k-mers instead of a subset of them. This method improves the sequence similarity over that of the minHash based methods. However, they still fail to satisfactorily separate the highly similar sequences from the not similar ones when the k-mer length is small. Slidesort [25] used a pattern growth strategy for the k-mers to handle the problem of randomly matched k-mers and to avoid using only a subset of them. The method reduces the number of candidate pairs for a given distance threshold. The pattern growth strategy ensured that no true positive pairs of sequences were missing. Although, this method was very fast for short reads (≤ 100bp), it became very slow for longer reads and larger edit distance bound. It should be noted that these sequence similarity based methods used the existing edit distance calculation methods for verifying the similarity between the pairs of the sequences. In other words, these methods are primarily provide filtering to reduce the sequence similarity 20 computation time rather than standalone methods. As presented in the section 2.1, there are room for improving the pairwise sequence similarity computation when considering very large numbers of pairs of sequences in a dataset. 2.3 Reference Sequence based Genome Sequence Anal- ysis Methods Reference sequences have been frequently used to represent genomic sequences in datasets [30, 31, 32, 33] to improve computation time with nominal loss in accuracy. For instance RBSA [31] used a set of reference sequences to compute alignment between segments of genome sequences. For the query search, they did same with segments of query sequences and compared the differences to reduce the candidate segments of sequences in datasets. However, between the candidate pairs, the exact alignment scores are computed based on existing computationally intensive methods. Referentially Compressed Search Index (RCSI) [32] presented a compression method where large genome sequences are compressed with respect to reference sequences. Initially the query search is performed on the reference sequences followed by evaluating individual differences of the genome datasets with respect to the reference sequences. The contributions of this method such as reduced space requirement and speedup inspired us significantly to explore the reference sequence based method on NGS sequence datasets. In addition, a reference sequence based indexing method is proposed in [30] to reduce the number of edit distance comparison. They also proposed different strategies for selec- tion of reference sequences as the properties of the reference sequences are critical for the outcome any such indexing method. However, for the candidate pairs they used the existing 21 computationally expensive edit distance calculation. Although the existing reference sequences based presentation of database sequences has been used in many applications [30, 31, 32, 33], until recently, this method has not been used for edit distance predictions. We have proposed edit distance prediction methods [34, 35] based on reference sequences for NGS sequence similarity. This method worked well for smaller edit distances. However, the prediction accuracy fell rapidly for larger edit distances. 2.4 Clustering Methods The sequence similarity based clustering is a suitable application to put targeted gene frag- ments (amplicons) from metagenomic samples based on their phylogenetic similarity. It also evaluates the effectiveness of NGS sequence similarity of the aforementioned methods with respect to required time, quality of the similarity measures, and the ability to provide gradual similarity hierarchy efficiently. It should be noted that our goal was not to improve the clustering methods, rather improving the sequence similarity comparison that might be beneficial for clustering quality and time. 2.4.1 Agglomerative Hierarchical Clustering Agglomerative hierarchical clustering algorithms using single linage and average linkage ag- gregation methods are frequently used for clustering of NGS sequence datasets generation from phylogenetic marker genes. Hierarchical clustering presents the gradual dissimilarity hierarchy between the sequences and does not require a pre-defined number of clusters which is fundamentally similar to the phylogenetic hierarchy. However, they require comparing all pairs of sequences in a dataset which is costly with respect to space and computational 22 complexities [16, 36, 37, 38, 39, 40]. For instance, Mothur [37] is a widely used hierarchical clustering method. It incorporated multiple sequence alignment using MUSCLE [41] that reduced the computational cost of dissimilarity measurements, which resulted in significant speedup in clustering [16]. However, multiple sequence alignments with MUSCLE has time complexity O(N 3l) and space complexity O(N 2 + N l + l2) [41]. 2.4.2 Heuristics based Clustering In order to avoid the costly hierarchical clustering, several other hashing based clustering methods have been proposed using the k-mers similarity of the sequences (Section 2.2). For instance, greedy heuristic clustering methods such as UCLUST and VSearch [24, 42, 23], merge sequences in an online-learning paradigm. Initially they start with a seed sequence assigned to a cluster, then a new sequence is assigned to one of the currently existing clusters that satisfy a distance threshold with its representative sequence, thus, a new cluster is formed. Similarly, LSH based NGS sequence clustering methods determine the candidate pairs of sequences based on the similarity of randomly generated fixed length sub-sequences of pairs of sequences. However, there are many false positive candidate pairs generated from the LSH based hashing. Without calculating the edit distances between the candidate pairs of sequences, it is difficult to improve the clustering quality. Hence, the heuristic based methods significantly speed up the clustering at the cost of accuracy. Furthermore, they cannot provide the hierarchical interpretation of the resulting clusters. Other existing clustering methods were used for NGS sequence clustering. For instance, an unsupervised Bayesian clustering method [43] was implemented that uses the Markov Chain Monte Carlo (MCMC) sampling method to obtain optimal clustering solution. The probabilistic Bayesian approach resolves sensitivity and hard-cutoff issues associated with 23 hierarchical clustering algorithms. However, it was not a practical method for large datasets. 24 Chapter 3 The Euclidean Distance based Edit Distance Prediction In this chapter, we present a reference sequence based efficient edit distance prediction technique in the vector space. Each sequence in a dataset is converted into a feature vector by computing the edit distances between the sequence and each of the reference sequences. The Euclidean distance is computed between the corresponding feature vectors to predict their edit distance. The distance computation in the vector space significantly reduces the required time compared to that of costly edit distance calculation between the pairs of sequences. In order to show that the relative distances between the sequences are likely to be represented in the feature vectors, we developed probabilistic bounding models for the Euclidean distances in the vector space for the edit distances of pairs of sequences. Further, we developed a mapping between the edit distances and the Euclidean distances in the vector space. This mapping is used for predicting the edit distances. Using this transformation, numerous statistical analysis techniques and faster nearest neighbor search algorithms that require numeric data types become computationally feasi- ble. For instance, this technique is very effective for agglomerative hierarchical clustering techniques [38, 36] because they require edit distance calculations between each pair of se- quences. Assuming large numbers of sequences in a dataset, the cost of computing edit dis- 25 tances between each pair of sequences becomes a performance bottleneck. As a result, faster methods at the expense of lower accuracy have been explored in recent years. The emergence of very large datasets [44, 45] and domain specific problems such as OTU-clustering [39, 40] have made the edit distance prediction techniques more desirable. 3.1 Space Transformation In this section, we present the reference sequence based space transformation in more detail. As defined in Section 1.2, the space transformation function converts a sequence in a dataset into an n-dimensional feature vector. For the sake of simplicity, we use the Hamming distance in the vector space and then extend the findings to the edit distance based technique. Given, A, B ∈ S, suppose the corresponding feature vectors are xA and xB based on the reference sequences r1 and r2. Thus, xA = (xA 1 , xA 2 ) and xB = (xB 1 , xB 2 ) and the dissimilarity between A and B can be computed using the Eulcidean distance in the vector space as follows: ∀A, B ∈ S : dist (Ω(A), Ω(B)) = (cid:107)xA − xB(cid:107) = (cid:107)∆x(cid:107) (3.1) where ∆x = (∆x1, . . . , ∆xn). For each dimension i, ∆xi ≤ d where d = dist(A, B) Theorem 1. For each dimension i and any pair of sequences A and B, ∆xi ≤ d, where d = dist(A, B) is any distance in the metric space. Proof. According to the triangle inequality property of distance in a metric space: i ≤ d + xB xA i and xB i ≤ d + xA i → |xB i − xA i | ≤ d (3.2) 26 As a result of Theorem 1, we know that for each dimension i, ∆x of any given pair of se- quences is bounded by their Hamming distance dh. Then, using the multinomial distribution, the probability Pr(∆x) can be written as dh−δ Pr (|∆x| = δ) = 2(cid:88) k=0 α × dh! × 2−(dh+2k+δ) (dh − 2k − δ)! × (k + δ)! × δ! (3.3) where α = 1 for δ = 0, and α = 2 otherwise. Using the probability for |∆x|, the expected value of the Euclidean distance for any pair of sequences A and B, after their transformation into feature vectors with independent random reference sequences, is given by (cid:118)(cid:117)(cid:117)(cid:117)(cid:116)n × dh(cid:88) δ=0 δ2 Pr(|∆x| = δ) (3.4) Using the bounds we defined in Theorem 1 for ∆xi, i.e. 0 ≤ |∆xi| ≤ dh for the Hamming distance, combined with the probability distribution of the ∆xi for each individual dimension, we get the probability distribution for the Euclidean distances in the vector space by applying the convolution over the individual distributions. Figure 3.1 shows the probability distribution of the Euclidean distance in the vector space by varying the number of dimensions n =< 1, 2, 10, 20, 50, 100 > for the Hamming distances dh =< 5, 10, 15, 20 >. When n = 1, the most probable distance in the vector space is 1, irrespective of the actual Hamming distance. Therefore, there is a high probability that sequences that are far with respect to the Hamming distances (e.g., dh = 20), will appear to be as close as sequences with smaller Hamming distances (e.g., dh = 5). This implies that the distinction between the Hamming distances and the Euclidean distances between 27 Figure 3.1: Theoretical probability distribution of L1 distances in vector space formed by the Hamming distances with varying number of dimensions i.e., reference sequences. pairs of sequences will be lost. However, as the number of reference sequences increases, the overlap between probability distributions for different Hamming distances decreases, which increases the distinction between largely dissimilar sequences and similar ones. Figure 3.2 shows the 99.9% intervals obtained theoretically from the convolution of dis- tributions and the expected values of the Euclidean distances by Equation 3.4 in the vector space as functions of the Hamming distance between pairs of sequences in a dataset. The results are compared against the mean, minimum, and maximum observed Euclidean dis- tances from a simulation where 105 random pairs of sequences of length 20 were generated and transformed using 100 random reference sequences. In this experiment, different trans- formation functions (i.e. by the Hamming distance or by the edit distance) are used, and 28 Figure 3.2: Theoretical expectation of the Euclidean distances and their 99.9% intervals in the vector space with 100 reference sequences by the Hamming distance (upper panel) and the edit (lower panel) as a function of the actual Hamming and edit distances between a random pair of sequences. Compared with experimental lower and upper bounds of observed distances from a random sampling simulation. the predicted Hamming and edit distances are compared with the corresponding actual dis- tances between pairs of sequences. Since the theorems are based on the Hamming distance, theoretical results of expected value of the Euclidean distances using the Hamming distance based transformation as a function of the actual Hamming distance between sequences match very closely with the experiment. When the space transformation is performed using the edit distances, the minimum observed Euclidean distances from the experiment fall below the 99.9% interval, but the experimental maximum still matches with the theoretical upper 29 bound (Figure 3.2, top panel). On the other hand, when theoretical bounds for the Eu- clidean distances are compared against the actual edit distances between pairs of sequences, the maximum observed Euclidean distances by the Hamming distance based transformation exceeds the theoretical upper bound while its minimum matches with the theory (Figure 3.2, bottom panel). This observation supports that the Euclidean distances between pairs of fea- ture vectors with high dimensions will be large when the actual Hamming or edit distances are large. Similarly, the Euclidean distances will be smaller when the actual Hamming or edit distances are also smaller. Proposition 1. Distance conservation: With high probability, a pair of dissimilar sequences will be mapped far from each other, and a pair of similar sequences will be mapped close to each other into an n-dimensional vector space. According to this proposition, if the Hamming distance for a pair of sequences is large, with high probability they will be mapped far from each other in the n−dimensional vector space, provided that n is large. At the same time, according to Theorem 1, the space trans- formation guarantees that if a pair of sequences are close to each other with respect to the Hamming distance, their distance will be bounded by dh. As a result, the space transforma- tion will conserve the ordering among pairwise distances based on the edit distances and the Euclidean distances in the vector space. While the proposition is developed according to the Hamming distance, we generalize the idea to the edit distance provided that transformation to a high dimensional space is performed by the edit distance. 30 3.2 Selection of Reference Sequences We create a set of random reference sequences with the property that they are independent from any sequence in any dataset. The advantage of such property is that the reference sequences are transferable to other datasets as well. Yet, there are a few things to consider when choosing the reference sequences. 3.2.1 Length of Reference Sequences Assuming the length of a reference sequence is smaller than the length of a sequence in a dataset, the transformation function only considers edit distance for a portion of the sequence plus a number of insertions (according to the definition of the edit distance). Suppose xA is i the edit distance between A and ri where |ri| = |A|. Then, using a sub-sequence of ri such as i − (length(ri) − length(ρi)). ρi, where length(ρi) ≤ length(ri), results in ´xA i , where ´xA i ≤ xA This causes more overlaps between the mapped feature vectors in the vector space. If the edit distance between sequence A and sequence B is denoted by ´d where ´d ≤ d, the probability of having similar feature value increases; i.e. Pr(δ = 0| ´d) ≥ Pr(δ = 0|d). Hence, the length of reference sequences must be at least as large as the longest sequence in the dataset. 3.2.2 Distances among Reference Sequences Feature vectors resulting from the space transformation are correlated if the chosen reference sequences are close to each other. We want to find the proper distance between the reference sequences that minimizes correlation among feature vectors. Correlations among the features in the vectors decrease monotonically with the increase of edit distances among pairs of reference sequences. 31 Reference sequences of length lref have the minimum correlation with respect to features in the vector space when distances among each of them is lref . However, the number of reference sequences satisfying such distance is limited by the alphabet size. On the other hand, as shown in Figure 3.1, having only a few reference sequences is not sufficient for satisfactory prediction accuracy in the vector space. As a compromise, we impose a mini- mum distance threshold between reference sequences, by which a new randomly generated reference sequence has to satisfy the distance threshold from currently selected reference sequences. Initially the minimum distance threshold is set to 3 4 lref because higher edit distances between each pair of reference sequences ensure minimized correlation among fea- tures in the vector space. However, if sufficient numbers of reference sequences cannot be generated satisfying the threshold, the threshold is decremented gradually. This process is repeated until the desired number of reference sequences is generated. 3.2.3 Number of Reference Sequences The number of reference sequences determines the dimensionality of the feature vectors. As our theoretical analysis shows, higher dimensions result in better discrimination between far and close pairs of sequences in a dataset. On the other hand, the computational cost of the NGS sequences clustering in the vector space increases by the number of dimensions. Therefore, in order to find an optimal choice of n, we vary n among 10, 25, 50 and 100. We evaluate the performance of the NGS sequences clustering as a function of n. 32 3.2.4 Edit Distance to Euclidean Distance Mapping We propose a mapping strategy between the edit distances and the Euclidean distances in order to use the predicted Euclidean distances based thresholds for hierarchical clustering in the vector space. For a given threshold, the distances among sequences of a cluster based on the edit distances and the Euclidean distances can be drawn as a distribution around the threshold. Thus, we choose a cumulative frequency in the distribution so that the corresponding Euclidean distances threshold based clusters have optimal similarity to those based on the edit distances. It is worth mentioning that the Euclidean distance is directly proportional to the number of reference sequences (Equation 3.4). Hence, for the same cumulative frequency, the actual Euclidean distance might be different with varying numbers of reference sequences. 3.3 Hierarchical Clustering We create hierarchical clusters of NGS sequences to evaluate the effectiveness of our proposed edit distance prediction techniques. Key components of the agglomerative hierarchical clus- tering technique are all pairwise distance calculations and iterative merging of the nearest neighbors until all the elements belong to one cluster. The technique provides a hierarchy of clusters which are more informative [46] than a final set of clusters. Moreover, the number of clusters is dependent on a distance threshold. In this section, we briefly present the adopted technique followed by a cost analysis to justify the speed-up of clustering in the vector space with the adopted technique. We also briefly present the quality assessment measures used for evaluating the clusters. 33 3.3.1 Cost Analysis Preprocessing steps includes generation of reference sequences and transformation of se- quences into an n-dimensional feature space. Selection of reference sequences of length lref and an imposed minimum distance of lref /2 is performed only once, and the generated refer- ences can be reused for different datasets. The edit distance computation complexity between a reference sequence and a sequence in the dataset is O(l2 time complexity is O(n × N × l2 ref ). ref ). Thus, space transformation The computational complexity of agglomerative hierarchical clustering is O(N 2) and computing edit distance is O(l2) with l being the length of two sequences. Therefore, the overall computational complexity is O((N l)2). The proposed technique can benefit from existing algorithms such as minimum spanning tree(MST) or NN-chain, since in the vector space we benefit from numeric data types, and standard distance calculations. Note that a few algorithms exist for neighborhood search with edit distance, however, the purpose of our work is not to make clustering based on the edit distances faster. Yet, another advantage of the proposed technique is that distance calculation is independent of length of the sequences. 3.3.2 Quality Assessment of Clustering Solutions We utilize silhouette coefficient to evaluate the quality of the clusters. The coefficient is cal- culated using the average intra-cluster distance (a) and the average nearest-cluster distance (b) for each sample. Si = (bi − ai)/max(ai, bi) (3.5) Here, ai is the average distance between ith sequence and remaining entries within the cluster and bi is the minimum average distance between ith sequence and any other cluster 34 that it is not a part of. In addition, Normalized Mutual Information (NMI) score is used to compare similarity between the edit space and the vector space clusters. NMI is useful to measure the relative similarity between two independent sets of cluster labels when the ground truth is not known. 3.4 Experimental Results In this section, we determine the distance among the reference sequence to minimize cor- relation, effective number of reference sequences for transformation and the edit distance to the Euclidean distance mapping for the vector space. When these choices are fixed, the final parameters are used to create clusters from the experiment datasets. We compare the effectiveness of the space transformation by evaluating single linkage and average linkage hierarchical clusters based on the edit distances and the Euclidean distances in the vector space. We also compare the cluster quality of our method with the Hamming distance based LSH-Div. The approximate length of the sequences in experiment datasets is 252. Hence, the maximum edit distance is 252. As the clusters are created for 97% and 95% similar- ity thresholds, we chose equivalent edit distances 8 and 13 respectively (rounded up to the nearest integer). All the experiments are performed in an Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz with 24GB physical memory. It is worth mentioning that in case the length of reference sequences are larger than the length of dataset sequences, reference sequences are trimmed for consistency. 35 3.4.1 Datasets A diverse set of 16s rRNA sequence fragment datasets are used to assess the performance of our proposed techniques. The sequence fragments are generated using high throughput NGS techniques. The data sets used here are obtained from NCBI databases. For these data sets, length of most of the sequences are roughly similar because specific regions of the 16s rRNA are targeted for NGS data sets. However, there are variations in the sequence lengths because of insertions, deletions and sequencing errors. Table 3.1 briefly describes the datasets which are obtained from various metagenomic environments. The datasets are metagenomic samples taken from mouse, human, rat, soil, marine plankton (ERX2155923 [47]) and zoo compost (SRX1537393 [48]). It should be noted taht we consider only the distinct sequences in each dataset. Due to quadratic space complexity for the main memory based hierarchical clustering method, in this chapter we experiment with the datasets Dm, Dh, Dr and a subset of Ds (40000 sequences). Table 3.1: 16S rRNA Next Generation Sequencing (NGS) datasets Datasets Id # Sequences Mean Length St. Dev. Mouse Rat Soil Human Dm Dh Dr Ds Marine Plankton Dp Dz Zoo Compost 17993 21575 30079 99742 496314 625178 252.6 252.9 252.87 253.1 252.1 250.3 1.233 1.203 0.626 1.123 2.67 3.6 3.4.2 Distance among Reference Sequences In this experiment, we evaluate the effect of distance between the reference sequences on correlation among the feature vectors. For this experiment, two random reference sequences 36 Figure 3.3: Correlation among features after transformation of random sequences as a func- tion of the distance among reference sequences. [The dashed line shows lref /2]. are generated along with 1000 pairs of sequences, then correlation of the edit distance between pairs and their Euclidean distance is evaluated. This experiment is repeated for 2000 times with different random pairs of reference sequences and random sequences. The resulting correlations for sequences of length 100, 200, and 300 as a function of distance between each pair of reference points is shown in Figure 3.3. When the pair of reference sequences are close and the correlation is very high, the two corresponding features become redundant. As the distance between reference sequences increases to lref /2, the correlation approaches to zero. As a result, it is important to make sure that the selected reference sequences are far from each other. 3.4.3 Mapping between the Edit and the Euclidean Distances We set the aggregation technique to average linkage, distance threshold to 8 and number of reference sequences to 100 for the experimental results shown in Figure 3.4. Figure 3.4(a) 37 presents the cumulative frequency distributions of intra-cluster pairwise edit distance ≥ 8 for the mouse dataset. Since average linkage technique considers mean of all possible pairwise distances, significant number of elements within a cluster may be higher than the threshold. We consider this property while selecting an equivalent threshold in the vector space. Figure 3.4(b) shows the Euclidean distance cumulative frequency curves for a set of edit distance thresholds for the clusters generated from the mouse dataset. The goal is to choose an Euclidean distance based on the cumulative frequency of edit distance 8 (97% similarity corresponds to 8 for the experiment datasets) such that optimum cluster labels similarity in the edit and in the vector space can be achieved. Hence, we determine the Euclidean distances based on the cumulative frequencies between 0.7 and 0.95 with interval 0.05 and create clusters in the vector space. We perform the resulting clusters similarity using NMI score. We find that Euclidean distance for cumulative frequency 0.75 produces as much as 97% similarity while the similarity degrades slowly with higher thresholds. It can be seen from Figure 3.4(b) that a significant number of larger pairwise distances overlap with the optimum cumulative frequency area likewise the distribution shown in edit space clusters (Figure 3.4(a)). Table 3.2: Mapping of Euclidean distance thresholds for given edit distance thresholds in hierarchical clustering by comparing similar clusters in the edit space and the vector space Similarity threshold Edit distance Euclidean distance Standard deviation 97% 95% 8 13 17.8 21.7 0.2 0.28 Figure 3.4(c) shows the relationship of varying cumulative frequencies and the NMI scores on hierarchical clusters generated based on the edit distances and the Euclidean distances in the vector space. Cumulative frequency 0.75 produces the maximum NMI for all the 38 (a) (b) (c) Figure 3.4: Cumulative frequency distribution of the edit and the Euclidean distances for different threshold settings. (a) The edit distance distribution and (b) the Euclidean distance distribution. (c) Relative similarity of the resulting clusters between the edit space and the vector space for varying cumulative frequencies of the Euclidean distances. datasets. Hence, we take the average Euclidean distance for that cumulative frequency as the threshold in the vector space. Table 3.2 shows the mapping between the edit and the Euclidean distances for the experiment datasets. The predicted Euclidean distances are based on 100 reference sequences. These values can be modified with varying number of reference sequences using Equation 3.4. Our study for smaller sequence lengths show approximately similar mapping between the edit and the Euclidean distances though the findings are not presented here. 39 891011121314151617Edit Distance00.10.20.30.40.50.60.70.80.91Cumulative Normalized Distribution12141618202224Euclidean Distance00.10.20.30.40.50.60.70.80.91Cumulative Normalized DistributionEdit dist=8Edit dist=9Edit dist=10Edit dist=11Edit dist=120.650.70.750.80.850.90.951Cumulative Frequency0.90.910.920.930.940.950.960.970.980.991NMIMouseHumanRatSoil (a) Single linkage (b) Average linkage Figure 3.5: Comparing relative similarity (NMI) between clusters in the edit space and the vector space with growing number of reference sequences 3.4.4 Number of Reference Sequences Figure 4.5 shows the relationship between NMI and the number of reference sequences for the agglomerative hierarchical clustering methods. Increasing the number of reference sequences improve NMI scores. Although, relative similarity between single linkage clusters (Figure 4.5.a) in the edit and the vector space degrade rapidly with decreasing number of references compared to that of average linkage clusters (Figure 4.5.b), single linkage distance criteria and variance in the Euclidean distance distributions are primarily responsible for such poor quality. For our experiments, we choose the number of reference, n = 100 for the vector space to ensure good quality clusters consistently for all of the linkage techniques. 3.4.5 Computation Time Space Transformation: The pre-processing step is performed only once on the datasets. The time complexity for the transformation is O(n × N × l2) where n is the number of reference sequences, N is the number of sequences in a dataset and l is the length of a 40 102030405060708090100#Reference Sequences00.10.20.30.40.50.60.70.80.91NMIMouseHumanRatSoil20406080100#Reference Sequences00.10.20.30.40.50.60.70.80.91NMIMouseHumanRatSoil Table 3.3: Comparison of hierarchical clustering time (in hours) of reference sequence based technique in the vector space with that of actual edit distance based technique Single Linkage Average Linkage Dataset Edit Distance Euclidean Distance n = 50 n = 100 Edit Distance Euclidean Distance n = 50 n = 100 Dm Dh Dr Ds 0.62 0.94 1.81 3.34 0.001 0.002 0.004 0.008 0.003 0.004 0.009 0.014 1.07 1.72 3.98 8.36 0.23 0.38 1.06 2.42 0.46 0.80 2.24 5.13 sequence. The transformation time is included in the total clustering time. Clustering: The clustering time depends on several factors such as the number of sequences in a dataset, length of the sequences in a dataset, number of reference sequences and the clustering methods. Table 3.3 compares the hierarchical clustering time based on the the edit distances and the Euclidean distances in the vector space. Clustering in the vector space is significantly faster than that based on the edit distances. The speed-up depends, to a great degree, on the linkage technique. For instance, single linkage vector space clustering is up to 500 times faster for varying number of reference sequences, while average linkage vector space clustering is approximately up to 5 times faster than their edit space equivalents in main memory based hierarchical clustering techniques. The reason is as follows; single linkage clusters are formed based on nearest sequences of the neighboring clusters. Such clusters can be built very efficiently using minimum spanning tree (MST) of time complexity O((cid:107)E(cid:107)logV ). In this case, the distance calculation time between pairs of sequences dominates the total time. Hence, single linkage clusters in the vector space achieves such speed-up. On the other hand, average linkage clusters measure mean distance between all pairs of two clusters to form a larger cluster. Such clustering time complexity is 41 Table 3.4: Evaluating accuracy of single linkage and average linkage hierarchical clusters generated using the proposed predicted edit distances in the vector space. We compute Normalized Mutual Information (NMI) for relative similarity of the generated clusters to those based on actual edit distances. Dataset Dm Dh Dr Ds Single Linkage Average Linkage Similarity = 97% Similarity = 95% Similarity = 97% Similarity = 95% 0.96 0.89 0.93 0.96 0.76 0.61 0.95 0.82 0.97 0.95 0.95 0.97 0.97 0.95 0.96 0.96 O(N 2) which requires a lot of time. Yet, the time difference for calculating the edit distances and the Euclidean distances in the vector space is same for all pairs in a sequence dataset for all the linkage types. Hence, average linkage clustering time speed-up is not so significant in the vector space compared to that of the single linkage clustering time. 3.4.6 Cluster Similarity We quantify the relative similarity between the agglomerative hierarchical clusters based on the edit distances and the Euclidean distances using the Normalized Mutual Information (NMI) scores. An NMI score compares the agreement between the two sets of cluster labels without considering any of them as the ground truth. Table 3.4 presents the NMI scores for the Euclidean distance based single, and average linkage agglomerative hierarchical clusters with respect to those of the edit distance based clusters. We set the similarity thresholds to 97% and 95% and the number of references in the vector space to 100. It can be seen that average linkage clusters in the vector space are 90% or more similar for most of the similarity thresholds. However, the single linkage hierarchical clusters based on the Euclidean distances for 95% similarity threshold do not show satisfactory similarity to that of the edit distance 42 based clusters. The aggregation technique in the single linkage along with increasing variance in the Euclidean distance distribution create faster convergence of the clusters in the vector space and hence the dissimilarity. For high similarity threshold such as 97%, as shown in Table 3.5, the predicted edit distance based clusters are significantly faster while nominal differences in relative similarity (NMI) scores. Table 3.5: Comparison of clustering time and accuracy based on predicted edit distances in the vector space to those based on actual edit distances. All the single linkage and average linkage clusters are compared for 97% similarity threshold. Single Linkage Average Linkage Datasets Edit Distance Euclidean Distance Edit Distance Euclidean Distance Dm Dh Dr Ds” Time 0.62 0.94 1.81 3.34 Time 0.033 0.004 0.009 0.014 NMI 0.96 0.89 0.93 0.96 Time Time 1.07 1.72 3.98 8.36 0.46 0.8 2.24 5.13 NMI 0.97 0.95 0.95 0.97 3.4.7 Cluster Quality We evaluate quality of the clusters created based on the edit distances and the Euclidean distances in the vector space with respect to the silhouette coefficient. Similarity thresholds are set to 97% and 95% for the average linkage hierarchical clusters. Table 3.6 represents the silhouette coefficients for the clusters of the experiment datasets based on the edit distances and the Euclidean distances. As shown in the table, the clusters generated by both distance measures produce the silhouette coefficients between 0.2 and 0.6 which indicate that the clusters are fairly separable despite some noisy area. We measure the coefficients using the edit distances of the sequences. Since the edit distances are approximated in the vector 43 Table 3.6: Evaluating quality of average linkage hierarchical clusters using the silhouette coefficient. The clusters are based on the edit distances and the Euclidean distances in the vector space. Datasets Dm Dh Dr Ds” 97% Similarity 95% Similarity Edit Distance Euclidean Distance Edit Distance Euclidean Distance 0.56 0.47 0.36 0.41 0.50 0.42 0.30 0.35 0.61 0.50 0.43 0.45 0.55 0.44 0.38 0.38 space, the silhouette coefficients of the clusters generated in the vector space are slightly lower compared to that of the edit distance based clusters. It is worth mentioning that the differences in silhouette score for both distance measures in otherwise similar settings are less than one decimal point which underscores the effectiveness of the proposed method. How- ever, the quality of the clusters in the vector space degrade gradually with lower similarity thresholds due to increasing overlap in the Euclidean distance distributions for the larger edit distances. 3.4.8 Comparison with Local Sensitive Hashing based Clusters We compare clusters based on the actual edit distances as well as the predicted distances with those of the LSH-Div [22] technique on relative similarity, cluster quality and required time. Since the hash based technique requires equal length sequences, we filter datasets as follows. First, sequences of length more than one standard deviation is removed. Second, the remaining sequences are cut to the smallest sequence length. For this experiment, we set the similarity threshold to 97%, number of reference sequences to 100 and aggregation technique to average linkage. Table 3.7 shows the comparison on the NMI score relative to the edit 44 Table 3.7: Comparison of clusters based on the edit distances, the predicted distances in the vector space and local sensitive hashing based similarity with respect to clustering time, relative similarity (NMI) to ground truth and the silhouette coefficient. Data sets Dm Dh Dr Ds” # Equal Length Sequences 17917 21571 30079 39994 Edit Distance Time(hours) Euclidean Distance 0.465 1.06 1.71 3.98 8.35 0.8 2.24 5.12 NMI Euclidean Distance 0.97 0.95 0.95 0.97 LSH- Div 0.87 0.87 0.83 0.91 LSH- Div 0.043 0.08 0.47 0.86 The Silhouette Coefficient LSH- Div 0.12 Euclidean Distance 0.57 Edit Distance 0.51 0.47 0.36 0.38 0.42 0.30 0.34 -0.084 0.02 0.10 distance based average linkage hierarchical cluster labels, the silhouette coefficient and the clustering time. It can be seen that despite modification of the datasets, relative similarities of LSH-Div clusters with respect to ground truth is lower compared to those of the vector space clusters. The main reason for the difference is that hamming distance based LSH- Div technique cannot efficiently identify insert and delete operations like the edit distance. Similarly, the silhouette coefficient suggests that the edit and the vector space clusters are better in terms of intra and inter cluster distance than that of the LSH based technique. Although, LSH-Div is significantly faster than the edit distances and the Euclidean distances based clustering techniques, there is a lot of room to improve the hierarchical clustering time for this application. Since, OTUs are generally determined based on high similarity thresholds, we do not need to generate the entire distance based hierarchy of the sequence datasets. Furthermore, reducing the number of reference sequences would lower the required time for the vector space clusters. It is worth mentioning that in this chapter we focus on the effectiveness of the space transformation technique. 45 3.5 Summary In this chapter, we presented a novel reference sequence based space transformation technique for clustering sequences in NGS datasets. We transform each sequence into a feature vector by computing the edit distances between the sequence and a set of reference sequences. The space transformation technique works well for the hierarchical clustering techniques more specifically with the single linkage aggregation strategy. Pairs of sequences that are close with respect to the edit distances are also close in the vector space. Clustering in the vector space provides significant speed-up because the Euclidean distance computation is multiple order of magnitude faster than that of computing the edit distances. This speed-up is achieved with nominal sacrifice in the cluster quality. We perform extensive analysis to determine the effective number of reference sequences and their lengths. We show through experimental results that the distance threshold for computing clusters in the vector space is about the same for the experiment datasets and is dependent primarily on the number and length of the reference sequences. 46 Chapter 4 MVD based Edit Distance Prediction In the previous chapter, we presented a reference sequence based space transformation tech- nique for the sequences in NGS datasets followed by an edit distance prediction technique in the vector space using the Euclidean distance. Although the proposed technique showed better performance for single linkage hierarchical clustering technique, the Euclidean dis- tance based technique could not directly predict the pairwise edit distances. Instead we developed a mapping between the edit distances and the Euclidean distances in the vector space. However, the mapping was dependent on the number of reference sequences as well as the characteristics of the sequences in datasets and the reference sequences. In this chapter, we develop a technique to predict edit distances between pairs of se- quences in datasets by computing the differences of edit distances of these sequences from the reference sequences. In the later text, we’ll use the term difference of distances to re- fer such distances between pairs of sequences in datasets. Given sequences A, B and the corresponding feature vectors (xA i ∼ xB would be the bag of (xA n ), (xB 1 , . . . , xA n ), the vector component differences i ) where (1 ≤ i ≤ n). The differences of distances between 1 , . . . , xB A and B are bounded by the ed(A, B) and the maximum of such distances is known as the Chebyshev distance. As we increase the number of reference sequences, the Chebyshev distance is likely to get closer to the actual edit distance. The frequency of any (xA i ∼ xB i ) is also helpful for improving the edit distance prediction accuracy in the vector space. Using 47 these strategies, we are able to predict the edit distances in the vector space more efficiently than that of the Euclidean distance based technique. 4.1 Differences of Distances Let us consider, sequences A and B in a dataset are transformed into xA and xB by a set of reference sequences R. Definition 8. Vector Difference (V D): Vector representing component wise difference of two feature vectors xA and xB is denoted by the vector ∆x =< (∆xj : (xA j ∼ xB j )), 1 ≤ j ≤ n >. Definition 9. Maximum Component Value in a V D for a pair of feature vectors is defined as M V D. (cid:16) M V D = max (cid:17) j )), 1 ≤ j ≤ n > . < (∆xj : (xA j ∼ xB Definition 10. The number of occurrences of M V D in a V D for a pair of vectors is defined as F V D. Let us assume, xA =< 10, 8, 12, 6, 7, 11, 9, 6, 8, 12 > and xB =< 8, 9, 15, 5, 9, 13, 6, 6, 9, 13 >. For xA and xB, V D =< 2, 1, 3, 1, 2, 2, 3, 0, 1, 1 >, M V D = 3 and F V D = 2. We use M V D and F V D to predict the edit distance between A and B. Following theorem is the basis for our proposed heuristics to predict the edit distances. As shown in Theorem 1, each component value in a V D is bounded by the edit distance between the pair of sequences. For example, in Figure 4.1 (xA 1 ∼ xB 1 ) ≤ d. A corollary of the Theorem 1 is as follows: 48 Figure 4.1: Distance between sequences A and B in the vector space based on hypothetical alignment of the reference sequences. Corollary 1. Given sequences A and B, the M V D of xA and xB is bounded by the edit distance between A and B. It should be noted that each component value in a V D greatly depends on the relative position of rj to the pair of database sequences. Since, each rj is quite distant from each other, adding more reference sequences increases the likelihood of higher M V D. If sequence A and sequence B are similar to a reference sequence rj in most of the positions, the M V D is close to the ed(A, B). For example, in Figure 4.1, feature vector component difference related to r1, i.e., (xA 2 ), fails to capture the actual edit distance 1 ∼ xB ed(A, B) = d, shown by the dotted line, because the value of xA 1 and xB same. On the other hand, component difference related to r2, i.e., (xA 1 are approximately 2 ∼ xB 2 ) is closer to the actual edit distance only because r2, A and B are similar in most of the positions. Proposition 2. Increasing number of reference sequences enhances the probability of achiev- ing higher M V D. Proof. Let us consider an arbitrary edit operation between the ith letter of A, A[i] and jth letter of B, B[j] where 1 ≤ i ≤ |A| and 1 ≤ j ≤ |B|. Given an r ∈ R and 1 ≤ m ≤ |r|, P r(A[i] = r[m]) = 1|(cid:80)| and P r(B[j] (cid:54)= r[m]) = (1 − 1|(cid:80)|). Assuming ed(A, B) = d, the probability of finding all the edit operations by ed(A, r) and none by ed(B, r) is [ 1|(cid:80)|(1 − 49 1|(cid:80)|)]d. Since, the edit distance between each reference sequence is at least lref /2, it is more likely that the reference sequences will have different letters at any mth position. Given a set of n reference sequences, the probability that at least one reference sequence finding all the edit operations becomes [ n|(cid:80)|(1 − 1|(cid:80)|)]d. Therefore, increasing n results in higher M V D ≤ d. According to the Proposition 2, M V D is likely to be higher with increasing number of reference sequences. Based on the Corollary 1, M V D is bounded by the edit distance of the sequences in a dataset. Thus with increasing number of reference sequences, M V D will increase either to the edit distance or get close to it. Therefore, with sufficiently large number of reference sequences, M V D can provide a good estimation of the edit distance in the vector space. 4.1.1 Effect of Increasing Number of Reference Sequences on M V D Figure 4.2 shows the distribution of M V Ds on increasing number of reference sequences. The pairs of sequences that are close to each other with respect to the edit distances are also close in the vector space. For example, Figure 4.2(a) shows the edit distance prediction using n = 5 reference sequences. 100% of the pairs with the edit distance 1 are predicted as M V D ≤ 1. However, only a tiny fraction the pairs of sequences predicted as M V D = 1 are actually of edit distance one. Remaining most of the pairs that are predicted as M V D = 1 correspond to larger edit distances. The wrong prediction of larger edit distances is primarily due to small number of reference sequences. As shown in Proposition 2, only few reference sequences may not have desired combination of letters to capture the edit operations between 50 (a) # Reference Sequences = 5 (b) # Reference Sequences = 10 (c) # Reference Sequences = 50 (d) # Reference Sequences = 300 (e) # Reference Sequences = 500 (f) # Reference Sequences = 1000 Figure 4.2: Distribution of the edit distances in the vector space grouped by M V D with varying number of dimensions 51 pairs of sequences in a dataset. Thus, the pairs of sequences with larger edit distances may appear closer to each other in the vector space. Therefore, for small number of reference sequences, a given M V D for pairs of sequences corresponds to a wider range of edit distances. Hence, M V D is not very predictive when a a small number of reference sequences are used. As the number of reference sequences increases, the predicted edit distances become more distinguishable. For example, in Figure 4.2(d), for n = 300 reference sequences, if M V D = 1, 99.9% of them are actually of edit distance 1. And the remaining 0.1% pairs are of edit distance 2. As more sequences are added to the reference sequence set, the M V D based edit distance prediction gets more accurate for larger edit distances. For example, in Figure 4.2(f), given n = 1000, when M V D = 3, 95.7% of the predictions are correct. By Corollary 1, for a given M V D = d(cid:48), the edit distances has to be greater or equal to d(cid:48). All the edit distances d where d < d(cid:48) should be covered by M V D < d(cid:48). Thus, M V D also provides a partial ordering of the edit distances in the vector space. 4.2 Edit Distance Prediction in the Vector Space In this section, we present the heuristics for the proposed M V D and F V D based edit distance prediction technique. 4.2.1 Heuristics for the Edit Distance Prediction In the preceding section we showed that increasing number of reference sequences improved the M V D based edit distance prediction accuracy, specially for the smaller edit distances. Proposition 2 proved that for a sufficiently larger number of reference sequences, the smaller 52 edit distances are likely to be predicted correctly. For instance in Figure 4.2, for n ≥ 300, the edit distance predictions based on M V D ≤ 2 are accurate in more than 99% of the cases, which leads to our first heuristic: H1: For n ≥ 300, if M V D ≤ 2, the edit distance in the vector space is predicted same as the M V D. We performed experiments with different datasets and found in all these datasets, more than 99% of the cases the predictions were correct. Only in less than 1% of the cases they were not correct and the predicted distances differed by 1. As we look into the higher M V Ds, the more edit distances greater than or equal to M V D map into it. Since the reference sequences are far from sequences in the datasets, each difference of distances in a V D increases slowly when the corresponding the edit distances become larger. For the larger edit distances, increasing the number of dimensions may not produce an M V D close to the actual edit distances of pairs of sequence in the datasets. The predicted M V D represents a range of edit distances despite having large number of dimensions (Figure 4.2(d)). Therefore, only M V D is not sufficient to satisfactorily predict relatively the larger edit distances. However, the frequency of an M V D is likely to be higher when the actual edit distance is larger. For example, in Figure 4.3, for a given M V D, cumulative frequencies of the edit distances have been shown with the growth of the F V D. It also shows the F V D based cutoffs for the predicted edit distances. We select the F V D cutoffs based on the maximum true positive percentage for a given < M V D, F V D >. In 4.3(a), for M V D = 3 and F V D ≤ 13, percentage of correct predictions (sensitivity) and true positive are 97% and 84% respectively. Similarly, we can find F V D cutoffs for predicting higher edit distances 53 (a) Frequency of M V D = 3 (b) Frequency of M V D = 4 Figure 4.3: For a given M V D, the range of F V Ds highlight distribution of the edit distances. We consider the change in distribution for our prediction. For example, predicted edit distances are marked based on intervals of F V D values. too. Based on this observation, we propose our second heuristic. H2: For 3 ≤ M V D ≤ 5, the edit distances in the vector space are predicted based on the M V D and their corresponding F V D. Figure 4.3(b) presents the predicted edit distance cutoffs for M V D = 4. For M V D = 4 and F V D ≤ 3, sensitivity and true positive rates of the predicted edit distance 4 are 87% and 49% respectively. The accuracy is reduced because with higher M V D, wider range of edit distances produce the exactly same M V D and F V D. However, it can be improved by using the second maximum difference in a V D and the corresponding frequency. For example, Figure 4.4 presents the frequencies of the second maximum difference in a V D for < M V D, F V D > = < 4, 2 > and < M V D, F V D > = < 4, 3 >. It is evident from the Figure 4.3(b) and Figure 4.4 that for M V D = 4, the second maximum difference in a V D and the corresponding frequency helped improving accuracy of the predicted edit distances. H2.1: For 4 ≤ M V D ≤ 5, the second maximum difference and the corresponding frequency in a V D further improves the prediction accuracy. 54 51015202530Frequency of Mvd00.511.522.533.54104Dist=3Dist=4Dist=5Ed=3Ed=4Ed=5Ed=62468101214Frequency of Mvd00.511.522.533.544.55104Dist=4Dist=5Dist=6Dist=7Dist=8Ed=4Ed=5Ed=6Ed=7Ed=8 (a) M V D = 4, F V D = 2 (b) M V D = 4, F V D = 3 Figure 4.4: Predicting edit distances using top two vector component differences (VD) and their corresponding frequencies (FVD). The differences of pairwise edit distance distributions for the given < V D, F V D > show effectiveness of such strategies. Although, the predicted edit distances based on the H2 and H2.1 are more accurate, their error rates are within certain lower bounds. In many cases, the erroneously predicted distances differ by 1-2 points from the actual edit distances. However, prediction accuracy declines sharply for even larger M V D. To solve this prob- lem we determine candidate pairs based on maximum two differences and their corresponding frequencies in a V D. Then, we compute the edit distances of these candidate pairs using the existing edit distance calculation techniques. Since we are only interested about NGS se- quence similarity within a small distance threshold, the number of candidate pairs is a small fraction of all the pairs of larger edit distances. Utilizing frequencies of second maximum differences of V D further reduces the number of candidate pairs. H3: For M V D > 5, the candidate pairs are selected based on the two maximum differ- ences in a V D and their corresponding frequencies. Then, we calculate the edit distances of those pairs and keep the pairs which are within a given distance threshold. Table 4.1 shows the effect of using top two differences in a V D and their corresponding 55 10152025303540Frequency of 2nd maximum diff for Fvd=2010002000300040005000600070008000900010000Dist=4Dist=5Dist=6Ed=4Ed=5Ed=6Ed=7Ed=810152025303540Frequency of 2nd maximum diff for Fvd=3010002000300040005000600070008000Dist=4Dist=5Dist=6Dist=7Ed=4Ed=5Ed=6Ed=7Ed=8 frequencies to reduce the number of candidate pairs where M V D > 5 for datasets presented in Table 3.1. The number of candidate pairs of sequences is reduced by more than 50% when top two differences in a V D and their corresponding frequencies are considered instead of only M V D and F V D. Because of quadratic time complexity of the edit distance calculation with respect to sequence lengths, it is important to reduce the numbers of false positive candidate pairs. Table 4.1: Number of candidate pairs selected for the edit distance calculation. These pairs are selected based on the top two differences of V D and their corresponding frequencies. Datasets Maximum difference in a V D Top two differences in a V D and and the frequencies the corresponding frequencies Dm Dh Dr Ds 17920424 13149464 14826484 61901271 7574093 4350262 5740473 28026329 4.2.2 Optimal Number of Reference Sequences Increasing number of reference sequences improves the edit distance prediction accuracy in the vector space. At the same time, computational costs between the pairs of feature vectors increase by the number of dimensions i.e., number of reference sequences. We determine the optimal number of reference sequences by minimizing the number of costly edit distance calculations. In the proposed M V D based prediction technique, the edit distances are calculated on two occasions. First, we create the feature vectors by computing edit distance between sequences in datasets and the reference sequences. Given N sequences in a dataset, the number of edit distance calculation grows linearly with increasing number of reference sequences. Second, the heuristic H3 selects candidate pairs for M V D > 5 to compute edit 56 Figure 4.5: Total number of pairwise edit distance computation including the number of pairs needed for space transformation and the number of candidate pairs determined by heuristic H3 distances between them. With increasing number of reference sequences, the number of candidate pairs based on the heuristic decreases gradually until leveling off for large number of reference sequences such as n = 1000. Figure 4.5 shows the relationship between the required number of edit distance calculation using the proposed technique with increasing n for datasets presented in Table 3.1. It can be seen from the Figure that the number of pairs for each dataset reaches to the lowest range when the number of reference sequence varies between 200 and 300. Therefore, we choose n = 300 as the optimum number of reference sequences for the experiment datasets. It should be noted depending on the properties of the datasets the optimum n may vary. 57 4.3 Hierarchical Clustering In this section, we briefly present the adopted hierarchical clustering technique for larger datasets followed by a cost analysis to justify the speed-up of clustering in the vector space with the adopted technique. 4.3.1 Memory Efficient Clustering Technique In the previous chapter, we adopted a main memory hierarchical clustering technique of O(N 2) space complexity. However, such space requirement prohibits evaluating our proposed technique on larger datasets. In this chapter, we adopt a graph based external memory hierarchical clustering technique, SparseHC [49] that can handle large number of sequences. Key steps of the SparseHC technique are: • It assumes that the pairwise distances are pre-calculated and are sorted based on ascending order of their distances. • The sequences are considered as singleton clusters i.e., vertices and the pairs along with the distances are considered as edges. • The edit distances of pairs of sequences are stored in the form < i, j, dist(i, j) > where i, j ∈ N and i < j. The triples are fetched sequentially. In case there is an edge between the ith and jth sequences or their ancestor clusters, the edge is updated with the new pairwise distance. Otherwise a new edge is formed between the ith and jth sequences. • Once the edit distances between all pairs of sequences of two clusters are accounted for, the edge is considered complete and are merged into one cluster. The incomplete 58 edges of both clusters are updated accordingly. • The algorithm stops when only one cluster is remaining. We modified the algorithm so that the clustering stops if no two clusters can be merged by satisfying the given threshold. 4.3.2 Cost Analysis The time complexity of edit distance calculation between all pairs of sequences is O(N 2l2) where N is the number of sequences and l is the length of a sequence. On the other hand, preprocessing steps of the proposed technique include generation of reference sequences and transformation of sequences in a dataset into n-dimensional feature vectors where n is the number of reference sequences. Since, the randomly generated reference sequence are dataset independent, their generation time is not considered in cost analysis. Time complexity of space transformation is O(nN l2). Time complexity of pairwise distance calculation in the vector space is O(nN 2). Equation 4.1 shows the ratio of pairwise distance calculation time complexity between the edit and the vector space. O(N 2l2) O(nN l2) + O(nN 2) (4.1) For the clustering, we need to consider only a fraction of the all possible pairs of sequences that are within the distance threshold. Suppose, the number of such pairs is (c N 2) where c << 1. These pairs are needed to be sorted in order to create clusters using the SparseHC technique. Disk based merge sort time complexity is O(cN 2log(cN 2)). In addition, the SparseHC clustering time complexity is O(N + cN 2). The sorting and the clustering time 59 differ slightly for the edit distance and the MVD based methods due to difference in the number of pairs for the predicted edit distances. Ratio of the time complexity in the edit and the vector space is given below. It is understandable that with increasing number of sequences in datasets, all pairs edit distance calculation time increases more rapidly compared to that of other components of the used techniques. Hence, clustering performance in the vector space improves with the increasing number of sequences in datasets. O(N 2l2) + O(cN 2 log(cN 2)) + O(N + cN 2) O(nN l2) + O(nN 2) + O(cN 2 log(cN 2)) + O(N + cN 2) ≈ O(N 2l2) O(nN l2) + O(nN 2) ≈ c1N 2l2 c1nN l2 + c2nN 2 ≈ c1N l2 c1nl2 + c2nN 4.3.3 Predicted Edit Distance Thresholds in the Vector Space Our proposed heuristics predict the edit distances in the vector space. Given, n ≥ 300 and l ≤ 252, heuristics H1 and H3 nearly accurately predict the edit distances. Although prediction accuracy of heuristic H2 is comparatively lower than that of the other heuristics, majority of the erroneous predictions differ by 1 to 2. Hence, we are able to predict somewhat similar average linkage cluster thresholds in the vector space. Table 4.2 shows the predicted thresholds in the vector space. The predicted thresholds are slightly smaller for 97% and 98% similarity thresholds due to the erroneous predictions of heuristic H2. k-mer, local sensitive hashing or other heuristics based clustering methods have similarity thresholds. However, these are not similar to edit distance based thresholds. 60 Table 4.2: Comparison of similarity thresholds for hierarchical clustering based on the edit distances and those of the MVD Similarity Thresholds Edit Distance MVD 99% 98% 97% 2.5 5 7.5 2.5 4.7 7.3 4.4 Experimental Results We create clusters based on average linkage hierarchical clustering technique in the vector space and compare the clusters with those created based on the edit distances. We also compare the performance of the clusters based on the proposed edit distance prediction technique with VSEARCH [23] and locality sensitive hashing based LSH-Div [22] clusters. The similarity thresholds are set to 97%, 98% and 99% of the respective clustering methods. Table 4.3: Hierarchical clustering time (in hours) breakdown based on the edit distance and the MVD based on vector space for 97% similarity Pairwise Distance Sorting & Clustering Dataset Calculation time (hr) Edit Distance MVD time (hr) Edit Distance MVD Dm Dh Dr Ds Dp 0.614 0.931 1.771 17.91 411.1 0.083 0.068 0.104 0.633 11.13 0.02 0.018 0.025 0.104 0.32 0.021 0.018 0.026 0.106 0.33 4.4.1 Clustering Time In this section, we evaluate the clustering time in the vector space. First, we justify the speed up of the hierarchical clustering based on the MVD in the vector space over the 61 clustering time based on the edit distance by presenting component wise required times. This is shown in table 6.1. It can be seen that for larger datasets such as N = 5 × 105, the edit distance based hierarchical clustering requires more than 15 days while the proposed technique requires only about 11 hours with nominal loss in accuracy. The pairwise distance calculation based on the MVD in the vector space is more than one order of magnitude faster than that based on the edit distance. We evaluate the clustering time for different similarity thresholds. Among the compared techniques, only VSEARCH is faster than the proposed space transformation based hier- archical clustering although the clusters accuracy is significantly lower than the proposed technique. Clustering time for both the edit distance based hierarchical method and the LSH-Div increase more rapidly than the proposed technique for increasing number of se- quences in the datasets. Table 4.4: For 99% similarity threshold, comparison of clustering time and accuracy using the MVD based method to those of the actual edit distances as well as k-mer matching based VSEARCH and LSH-Div clustering Dataset Baseline MVD VSEARCH [23] LSH-Div [22] Time(hr) Time(hr) / NMI / AMI Time(hr) / NMI / AMI Time(hr) / NMI / AMI Dm Dh Dr Ds Dp 0.61 0.94 1.72 17.95 411.28 0.054/0.998/0.993 0.007 / 0.903 / 0.75 0.0253/0.89/0.68 0.067/0.997/0.9923 0.01 / 0.89 / 0.77 0.0493 / 0.9 / 0.7 0.095/0.9976/0.9916 0.03 / 0.9 / 0.71 0.218 / 0.85 / 0.67 0.052/0.9971/0.9905 0.147 / 0.91 / 0.78 1.416 / 0.92 / 0.78 7.1/0.994/0.9905 0.352/0.901/0.802 9.6/0.903/0.713 4.4.2 Cluster Similarity We quantify relative similarity of the clusters generated by the proposed technique, VSEARCH and LSH-Div with those of the edit distance based hierarchical clustering by using the NMI 62 Table 4.5: For 98% similarity threshold, comparison of clustering time and accuracy using the MVD based method to those of the actual edit distances as well as k-mer matching based VSEARCH and LSH-Div clustering Dataset Baseline MVD VSEARCH [23] LSH-Div [22] Time(hr) Time(hr) / NMI / AMI Time(hr) / NMI / AMI Time(hr) / NMI / AMI Dm Dh Dr Ds Dp 0.63 0.943 1.79 17.98 411.4 0.072 / 0.982 / 0.9725 0.0043 / 0.91 / 0.767 0.0258/0.88/0.702 0.094 / 0.98 / 0.9745 0.0041 / 0.91 / 0.78 0.0508/0.884/0.71 0.122 / 0.978 / 0.9702 0.0125 / 0.91 / 0.716 0.219 / 0.84 / 0.68 0.614 / 0.979 / 0.9691 0.058 / 0.92 / 0.79 1.43 / 0.91 / 0.784 7.88/0.979/0.967 0.19/0.895/0.798 9.52/0.891/0.708 Table 4.6: For 97% similarity threshold, comparison of clustering time and accuracy using the MVD based method to those of the actual edit distances as well as k-mer matching based VSEARCH and LSH-Div clustering Dataset Baseline MVD VSEARCH [23] LSH-Div [22] Time(hr) Time(hr) / NMI / AMI Time(hr) / NMI / AMI Time(hr) / NMI / AMI Dm Dh Dr Ds Dp 0.634 0.95 1.796 18.01 411.5 0.104/0.9956/0.989 0.003 / 0.915 / 0.79 0.0261/0.87/0.732 0.086/0.9938/0.9918 0.0054 / 0.905 / 0.803 0.0528/0.87/0.76 0.13 / 0.994 / 0.9882 0.008 / 0.89 / 0.73 0.22 / 0.83 / 0.686 0.74 / 0.992 / 0.987 0.043 / 0.93 / 0.82 1.44 / 0.91 / 0.8 11.36/0.992/0.9856 0.162/0.905/0.798 9.83/0.89/0.72 and the AMI scores. As shown in Table 4.4, for 99% similarity (edit distance threshold=2.5), the clusters based on the edit distances and the MVD are nearly similar because most of the predicted edit distances in the vector space are based on heuristic H1 and the heuristic H1 based predictions are accurate in more than 99% of the cases. For 98% similarity (edit distance threshold=5), majority of the predicted edit distances are based on heuristics H1, H2 and H2.1. Because of similarity threshold 5, many of the predictions of 98% similarity use heuristics H2 and H2.1 which are less accurate. Hence, the NMI and AMI scores are relatively low for clusters of 98% similarity (Table 4.5). For 97% similarity (edit distance 63 threshold=7.5), the clusters also include heuristic H3 based edit distances which are cor- rect. Because of the high accuracy of heuristics H1 and H3, the impact of less accuracy of heuristic H2 and H2.1 based predictions is minimized (Table 4.6). Hence, the relative similarity scores improve for clusters of 97% similarity than those of 98%. However, the relative similarity of both VSEARCH and the LSH-Div techniques are significantly lower. Although, the NMI scores are around 0.9 for both techniques, the AMI scores are quite low for them which further show that the k-mer and local sensitive hashing based techniques fail to effectively capture the edit operations between pairs of sequences. Table 4.7: Comparing average linkage hierarchical clusters using the silhouette coefficient. The clusters are generated based on the edit distances, the MVD based predicted edit dis- tances as well as k-mer matching techniques such as VSEARCH and LSH-Div. Similarity Dataset Edit Distance MVD VSEARCH LSH-Div 99% 98% 97% Dm Dh Dr Dm Dh Dr Dm Dh Dr 0.314 0.415 0.32 0.426 0.443 0.378 0.57 0.47 0.36 0.31 0.41 0.312 0.407 0.42 0.36 0.56 0.46 0.344 0.18 0.29 0.196 0.26 0.30 0.247 0.4 0.31 0.24 0.15 0.02 0.05 0.14 0.01 0.04 0.12 -0.084 0.02 4.4.3 Cluster Quality We evaluate quality of the resulting clusters using the silhouette coefficient which is mea- sured using the pairwise edit distances of the sequences. Table 4.7 represents the silhouette coefficients of the experiment datasets Dm, Dh and Dr. The silhouette coefficients of larger 64 datasets are not computed due to memory constraints. As shown in the table, quality of the hierarchical clusters generated in both the edit and the vector space are significantly better than that of the other methods. Since the predicted edit distances in the vector space are not entirely accurate, the silhouette coefficients of the vector space clusters are slightly lower compared to those in the edit space. Between the other two methods, quality of the clusters generated by the LSH-Div are more noisy and less separable. The silhouette coefficients vary between 0 to 0.2 that refers the clusters are not well separated which indicates that the distance between randomly generated subset of the sequences cannot effectively capture the corresponding edit distances. 4.5 Summary In this chapter we proposed a novel edit distance prediction technique and used these pre- dicted edit distances to create clusters from sequences in NGS datasets. We used the edit distance based average linkage hierarchical clustering and showed that our method was an order of magnitude faster than the actual edit distance based clustering while maintain- ing more than 99% accuracy for given thresholds for most of the datasets. We considered k-mer based and LSH based method, namely, VSEARCH and LSH-Div, to compare perfor- mance. We showed that the accuracy of our method was significantly better than that of the VSEARCH and the LSH-Div. The better quality of our clusters was due to the fact that the predicted edit distances within the given similarity thresholds were mostly accurate and in other cases they were within 1-2 points apart. 65 Chapter 5 Edit Distance Prediction from Distances between the Sub-sequences In the previous chapters, we proposed edit distance prediction methods using randomly gen- erated dataset independent reference sequences [35, 34]. For instance, the M V D based edit distance prediction [34] used a set of n randomly generated dataset independent reference sequences, R to convert dataset sequences into feature vectors where each dimension of the vector represents the edit distance between the dataset sequence and a reference sequence. Given dataset sequences A and B, a reference sequence r, and an edit distance function ed(A, B); |ed(A, r) − ed(B, r)| ≤ ed(A, B) because of the triangle inequality property of metric distance measures. Further, the M V D, i.e., max(|ed(A, r) − ed(B, r)|) ≤ ed(A, B) where r ∈ R. The M V D based method predicts edit distances nearly accurately in sev- eral scenarios. Since the edit operations between the NGS sequences can appear in random positions, as number of reference sequences increases, the M V D gets closer to the actual edit distances at the cost of higher computation time. Despite decline in prediction accu- racy for larger edit distances, this method shows superior accuracy for applications such as hierarchical clustering over that of the existing k-mer based sequence similarity. In this chapter, instead of the randomly generated reference sequences, we use a set of simple reference sequences to develop an improved edit distance prediction method. Each 66 reference sequence of this set consists of the same letter of the alphabet. For example, given an alphabet Σ = {a, c, g, t}, the reference sequences are {aaaa..., cccc..., gggg..., tttt...}. Even though the number of reference sequences is small and therefore computation time is less, the prediction accuracy based on these reference sequences degrades rapidly for higher edit distances. Let us assume, a pair of dataset sequences, A = gcatacg and B = gcctaag with two substitution operations at the 3rd and the 6th positions. For reference sequence ccccccc, ed(A, ccccccc) = 5 and ed(B, ccccccc) = 5. Hence the difference of distances between A and B with respect to ccccccc is 0 which does not correctly predict the edit distance. Similarly the differences of distances between A and B with respect to aaaaaaa, ggggggg and ttttttt are 0. In order to minimize such erroneous distance predictions, we compute the edit distances for sub-sequences separately. We develop merging strategies to predict edit distance between a pair of sequences from the partial edit distances between their sub-sequences. This method significantly improves the edit distance prediction accuracy while running in linear time complexity with respect to sequence lengths. 5.1 Basic Concepts In this section, we briefly present the notations relevant to the proposed reference sequences based edit distance prediction method. We also outline the key ideas of the fixed length sub-sequence based edit distance prediction method. 5.1.1 Notations Given an alphabet Σ, S refers to a dataset of N sequences, {s1, s2, . . . sN}. Assuming A is a sequence in the dataset, A[i] represents the ith letter of the sequence. A[i : j] represents the 67 sub-sequence starting at the ith position and ending at the jth position. A(i, j) represents a sub-sequence of length j starting at the ith position. The null element Λ is used to represent insertions and deletions between a pair of sequences (Definition 1). For instance, (A[i] → Λ) represents a deletion operation of the ith element in sequence A and (Λ → A[i]) represents an insertion operation at the ith position of sequence A. γ(x → y) is the cost function for replacing x with y where x, y ∈ (Σ ∪ Λ). Assuming the maximum length of sequences in S is l, the reference sequences must be at least l letters long. R refers to a set of n reference sequences, {r1, r2 . . . , rn} where ∀j∈{1,..n}rj ∈ R; |rj| ≥ l. As discussed in the Section 3.2.2, the reference sequences must be far from each other so that they have nominal correlation. The maximum distance between a pair of reference sequences could be l. A set of such reference sequences is Rl where ∀rj, rk ∈ Rl andj (cid:54)= k; ed(rj, rk) = l. Although different sets of reference sequences can satisfy the requirement of Rl, we choose Rl such that ∀x ∈ Σ, ∀y ∈ Σ; xl, yl ∈ Rl where x (cid:54)= y. For example, given Σ = {a, c, g, t} and l = 5, we choose reference sequences, R5 = {aaaaa, ccccc, ttttt, ggggg}. In other words, each reference sequence in Rl consists of exactly the same letter in Σ. 5.1.2 Key Ideas In this section, we present key ideas of the edit distance prediction method based on the reference sequences in Rl. First, prediction accuracy based on these reference sequences is very good for small edit distances. However, the accuracy drops sharply for larger edit distances due to the cancelled edit operations while computing the difference of distances with respect to the reference sequences. In order to improve the accuracy, we divide the dataset sequences into sub-sequences of predefined length and compute edit distances between the 68 corresponding sub-sequences of pairs of sequences. Second, we merge the partial edit distances of pairs of fixed length sub-sequences to predict the edit distance between the entire sequences. While merging, we need to consider that insertion and deletion operations may be propagated to subsequent fixed length sub- sequences. Therefore, it is important to distinguish the pairs of sub-sequences where edit operations are captured redundantly. We apply different strategies to discard redundant edit operations from partial edit distances of those pairs of sub-sequences. Third, given multiple edit operations between a pair of sub-sequences, the differences of distances are less likely to predict the edit distances correctly for small alphabet size. For example, given an alphabet Σ, the probability of correct edit distance prediction for two substitution operations is (|Σ|−1)+(|Σ|−2)2 . This ratio gets higher with the growth of |Σ|2 alphabet size for increasing number of edit operations. Since the DNA alphabet consists of only four letters (Nucleotides), we consider pairs of letters (Di-nucleotides) as single elements to increase the number of all possible elements in the sequences. Then, we compute the differences of distances with respect to the reference sequences of these elements. This strategy improves edit distance prediction accuracy significantly between the NGS sequences compared to that of the single letter based strategy with a nominal increase in computational time. Fourth, for a pair of sub-sequences with multiple edit operations, the differences of dis- tances based edit distance prediction is likely to be less accurate due to lack of information of their positions. In these scenarios, we improve the prediction accuracy by considering the positions of the edit operations. Although this strategy does not guarantee accurate edit distance prediction between the entire sequences, it improves the accuracy significantly when there are several edit operations in close proximity. 69 5.2 Reference Sequence based Transformation Selecting an effective set of reference sequences is critical for our proposed edit distance prediction methods. As discussed in Section 3.2, there are few key properties for such a set of reference sequences. For example, the reference sequences must be far from each other, i.e., the edit distance between each of them must be significantly large. The proposed reference sequences in this chapter are significantly different than those of the previous chapters. Hence, we revisit some of the key properties of the reference sequences. 5.2.1 Edit Distances among the Reference Sequences Given A ∈ S, if the reference sequences are correlated, ∀r ∈ R; ed(A, r) are close to each other. As discussed in Section 3.2.2, our goal is to minimize the correlation among the reference sequences. Correlation among the reference sequences decrease monotonically with increasing distance threshold among the reference sequences. The proposed single letter reference sequences, Rl have maximum edit distances among them. Further, they possess an unique property which is useful for predicting edit distances between the sequences in S. We prove this property in Theorem 2 based on the concept of trace defined in [1]. For the sake of completeness, the definition of trace [1] is presented here. Definition 11. A trace from sequence A to sequence B is a triple (T, A, B), where T is a set of ordered pairs of integers (i, j) satisfying: 1. 1 ≤ i ≤ |A| and 1 ≤ j ≤ |B| 2. for any two distinct pairs (i1, j1) and (i2, j2) in (T, A, B), (a) i1 (cid:54)= i2 and j1 (cid:54)= j2, (b) i1 < i2 iff j1 < j2. 70 When A and B are understood, (T, A, B) is referred to as T . A pair (i, j) in T describes a line between the letters A[i] and B[j]. Each letter can be connected by at most one line and the lines do not cross. The ordered pairs indicate the change operations between the se- quences. A change operation represents either an exact match or a substitution operation be- tween the corresponding letters. Following is an example of a trace between sequences A and B where T =< (1, 1), (2, 2), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10), (10, 11), (12, 12) >. A trace may not connect all the letters in A and B. In such case, the untouched letters in A are deleted from the sequence, denoted as (A[i] → Λ). The untouched letters in B are inserted into A, denoted as (Λ → B[i]). Therefore, cost of the given trace T , denoted as cost(T ), includes the cost of change operations as well as those for insertions and deletions. Assuming the cost of each edit operation is exactly one, cost of change, deletion and insertion operations in T are 0, 1 and 1 respectively. Hence, cost(T ) = 2. The minimum trace cost is the edit distance between a pair of sequences. More detail about trace and its cost function are available in Wagner et. al. [1]. Theorem 2. Given a reference sequence, r → xl ∈ Rl and a dataset sequence A of length l(cid:48), ed(r, A) = (l − k) where l ≥ l(cid:48) and k =(cid:80)l(cid:48) i=1 1, ifA[i] = x. Proof. Let us assume, x ∈ Σ appears k times in A. The maximum size of a trace between r and A is l(cid:48). Suppose, the size of (T0, r, A) is l(cid:48). Thus, the number of change operations in T0 is l(cid:48). Since x ∈ Σ appears k times in A, the cost of change operations for those k pairs of (i, j) ∈ T0 is zero. In addition, (l − l(cid:48)) deletion operations occur in r. Since l ≥ l(cid:48), there won’t be any insertion operation. 71 (cid:88) cost(T0) = (cid:88) (cid:88) (r[i] → Λ) (Λ → A[j]) + (r[i] → A[j]) + cost(T0) = (l(cid:48) − k) + 0 + (l − l(cid:48)) = l − k Assume, an arbitrary trace (Ti, r, A) of size l(cid:48)(cid:48) where l(cid:48)(cid:48) ≤ l(cid:48). Therefore, (l − l(cid:48)(cid:48)) letters in r and (l(cid:48) − l(cid:48)(cid:48)) letters in A do not participate in the change operations and therefore the corresponding positions of r and A are not part of Ti. Letters of such positions in r will be deleted and those in A will be inserted into r. cost(Ti) = (l(cid:48)(cid:48) − k) + (l(cid:48) − l(cid:48)(cid:48)) + [(l(cid:48) − l(cid:48)(cid:48)) + (l − l(cid:48))] cost(Ti) = (l − k) + (l(cid:48) − l(cid:48)(cid:48)) = cost(T0) + (l(cid:48) − l(cid:48)(cid:48)) Hence, cost(T0) ≤ cost(Ti). For example, in Figure 5.1(a), trace (T0, A, r) has the maximum number of change op- erations. Here, the number of change, insertion and deletion operations are 11, 1 and 0 respectively. The reference sequence r ∈ Rl is based on letter t ∈ Σ. Further, t appears k = 2 times in sequence A. Assuming the cost of each insertion, deletion and substitution operation is exactly 1, cost(T0) = (11 − 2) + 1 + 0 = 10. On the other hand, cost of an arbitrary trace (T1, A, r) in Figure 5.1(b) is 12 which is bigger than that of T0. However, ed(A, r) represents the edit distance with respect to the letter t ∈ Σ only. In order to predict the edit distance between sequences A and B, we need to compute differences of edit distances of A and B with respect to each r ∈ Rl. The following is true for a given reference sequence because the edit distance satisfies the metric property. 72 (a) Trace T0 (b) Trace T1 Figure 5.1: (a) A minimum cost trace with maximum number of change operations and (b) an arbitrary trace with fewer change operations ed(A, B) ≥(cid:0)ed(A, r) ∼ ed(B, r)(cid:1) (5.1) 5.2.2 Predicting Edit Distance In this section, we apply Theorem 2 and Inequality 5.1 to develop strategies for edit distance prediction between pairs of dataset sequences. 5.2.2.1 Edit Distance One For a pair of dataset sequences with edit distance one, the distance is correctly predicted from the differences of distances with respect to the reference sequences in Rl. We prove it in Theorem 3. Theorem 3. If ed(A, B) = 1, ∃ri ∈ Rl s.t. (cid:0)ed(A, ri) ∼ ed(B, ri)(cid:1) = 1 where 1 ≤ i ≤ |Σ|. Proof. Let us consider, a dataset sequence A of length l(cid:48) and reference sequences ri ∈ Rl where 1 ≤ i ≤ |Σ| and l > l(cid:48). Thus, by Theorem 2, ed(A, ri) = (l(cid:48) − ki) + (l − l(cid:48)) = (l − ki) where ki = (cid:0) l(cid:48)(cid:88) 1, ifA[j] = ri[j](cid:1) (5.2) j=1 Suppose, the letter x ∈ Σ is inserted at an arbitrary mth position of A to convert it into a dataset sequence B of length (l(cid:48) + 1) . Hence, 73 l − (ki + 1), l − ki, ed(B, ri) = if B[m] = ri[m] = x. otherwise. (5.3) Based on Equation 5.2 and Equation 5.3, differences of the edit distances between A and B with respect to ri ∈ Rl where ri → xl will be 1. Hence, (cid:0)ed(A, ri) ∼ ed(B, ri)(cid:1) = 1. ∃ri ∈ Rl (5.4) A deletion operation in A to convert it into B is reciprocal to the insertion operation and it satisfies Equation 5.4. Also assume a substitution operation at A[m] that converts the sequence into B where |A| = l and l ≥ l(cid:48). ed(A, ri) = l − (ki + 1), l − ki, if A[m] = ri[m]. (5.5) Based on Equation 5.5,(cid:0)ed(A, ri) ∼ ed(B, ri)(cid:1) = 1 if A[m] = ri[m] and B[m] (cid:54)= ri[m] or otherwise. vice versa. Otherwise, the differences of distances will be 0. Hence, Equation 5.4 holds for a substitution operation between A and B. For example, let us consider, A = tctgta, B = tcttgta where the letter t in Σ is inserted in A[4] to convert it into B. Assuming, Rl = {aaaaaaa, ccccccc, ggggggg, ttttttt};(cid:0)ed(A, ri) ∼ ed(B, ri)(cid:1) = 1 where ri = ttttttt and 0 with respect to the other reference sequences. 74 5.2.2.2 Edit Distance Greater than One In Theorem 3, we showed the correctness of our proposed method for edit distance one. In general, it is more likely to have more than one edit operations (Definition 2) between a pair of sequences. If ed(A, B) > 1, may not predict the correct edit (cid:18) (cid:19) ed(A, ri) ∼ ed(B, ri) distance. However, edit distance prediction based on the expression 5.6 is bounded by the actual edit distance for ed(A, B) ≥ 1 because at least two of the reference sequences in Rl will have non-zero(cid:0)ed(A, ri) ∼ ed(B, ri)(cid:1) for any edit operation between A and B. In many cases, the following expression predicts the edit distance correctly. (cid:80)|Σ| i=1 (cid:0)ed(A, ri) ∼ ed(B, ri)(cid:1) 2 (5.6) As number of edit operations grows, the prediction accuracy based on Rl drops rapidly because increasing number of edit operations get canceled while comparing differences of the edit distances between dataset sequences with respect to reference sequences in Rl. For example, in Figure 5.2, the sequence of edit operations, E =< (3, Λ, g), (11, g, Λ) >. In both cases, g ∈ Σ is involved with the edit operations. However, difference of distances between the sequence pair with respect to reference sequence r = gggggggggggg is zero because effects of the edit operations are canceled by each other when comparing the frequency of g in A and B. In order to minimize this problem, we apply differences of distances based prediction for the sub-sequences separately rather than between the entire sequences at once. This strategy in most cases reduces the number of edit operations for a given pair of sub-sequences. For example, in Figure 5.3, the sequences are divided into non-overlapping fixed length sub-sequences. Each of the edit operations falls into separate such sub-sequences. Hence, 75 Figure 5.2: Difference of edit distances between sequences A and B by the reference sequence gl cannot predict correct edit distance because the same letter is inserted and subsequently deleted; therefore resulting no change in overall frequency. according to Theorem 3, we are able to compute them correctly. However, merging the partial edit distances into the final edit distance is a challenging task. We will discuss more about the merging strategies in the following sections. Figure 5.3: Sequences A and B are divided into fixed length sub-sequences. Each pair of sub- sequences has only one edit operation. The edit operations are predicted correctly separately with respect to the given reference sequences using Theorem 3. Partial edit distances of the sub-sequences are merged subsequently. 76 5.3 Edit Distance Prediction from Partial Edit Dis- tances of the Corresponding Sub-sequences We extend the differences of distances based edit distance prediction between pairs of dataset sequences by applying Theorem 3 between the corresponding fixed length sub-sequences in- stead of the entire sequences. An example of such sub-sequence based edit distance prediction is shown in Figure 5.3. The example also highlights the challenge of merging partial edit distances between the sub-sequences so that no edit operation is counted redundantly. For instance, in Figure 5.3, edit distance for the 2nd pair of sub-sequences is 1 which is redun- dant due to the insertion operation in the preceding pair of sub-sequences. We develop following strategies for efficient merging of partial edit distances between the corresponding sub-sequences. • PS: Primary/Secondary sub-sequences: We compare differences of distances be- tween adjacent pairs of sub-sequences to identify whether partial edit distance of the following sub-sequences is redundant or not. A pair of corresponding sub-sequences with only redundant differences of distances are denoted as secondary sub-sequences. The edit distance of such sub-sequences is discarded during the merging process. Each remaining pair of corresponding sub-sequences with non-zero differences of distances is denoted as primary sub-sequences and therefore considered while merging partial edit distances. • ML: Differences of Distances Comparison using Multiple Letters: Due to small DNA sequence alphabet size, differences of distances between elements of corre- sponding sub-sequences are likely to be smaller than those of the actual edit distance. 77 As a result, identifying primary sub-sequences becomes more difficult. To minimize this problem, we consider multiple adjacent letters as single elements. This strat- egy increases the number of all possible elements significantly and reduces the risk of matching elements by chance. Hence, it improves edit distance prediction accuracy based on differences of distances of elements in corresponding sub-sequences. 5.3.1 Merging Partial Edit Distances of the Sub-sequences The differences of distances between any corresponding sub-sequences are calculated inde- pendently of their location. However, in order to determine the edit distance between the entire sequences, we merge the partial predicted edit distances of the sub-sequences in as- cending order because edit operations have cascading effects on the following sub-sequences. Although, edit operations occur in temporal orders, for a given pair of sequences, there is an equivalent physically ordered sequence (series) of edit operations. Definition 12. Given a sequence of edit operations E, Em ⊂ E is the minimum sequence of edit operations if ∀(ij, xj, yj), (ik, xk, yk) ∈ Em; 1 ≤ j < k ≤ |Em| and ij (cid:54)= ik. Proposition 3. Given a sequence of edit operations E, there exists a corresponding minimal sequence of edit operations Em. Proof. In trivial case, if |E| ≤ 1, Em = E. For |E| > 1, if (ij, xj, yj) ∈ E is an insertion or a deletion operation, it changes positions of the edit operations of ∀(ij+k, xj+k, yj+k) ∈ E where k > 0 and ij ≤ ij+k. Assume, func- tion λ(ij) updates the position of all the edit operations starting from (ij, xj, yj) in a sequence that are also physically ordered. For any pair of edit operations (ij, xj, yj), (ij+k, xj+k, yj+k) ∈ E, the following cases minimize the number of edit operations in E. 78 Case 1: Both are substitution operations and λ(ij) = ij+k. Thus, Em = E−(ij, xj, yj). There is no change in positions of the remaining edit operations. Case 2: (ij, xj, yj) is an insertion operation, (ij+k, xj+k, yj+k) is a substitution opera- tion, and λ(ij) = ij+k. Thus, yj = yj+k and Em = E − (ij+k, xj+k, yj+k). Case 3: (ij, xj, yj) is a substitution operation, (ij+k, xj+k, yj+k) is a deletion operation, and λ(ij) = ij+k; then Em = E − (ij, xj, yj). Case 4: (ij, xj, yj) is an insertion operation, (ij+k, xj+k, yj+k) is a deletion operation, and λ(ij) = ij+k. Then Em = E − (ij, xj, yj) − (ij+k, xj+k, yj+k). In addition, for any (in, xn, yn) ∈ E where j < n < (j + k) and ij > in; λ(in)− = 1. Hence, Em ⊆ E. For example, let consider consider, E =< (10, Λ, a), (7, g, Λ), (21, t, Λ), (9, c, Λ), (15, Λ, g) >. The edit operation (9, c, Λ) nullifies the effect of the edit operation (10, Λ, a). Thus, Em = E− < (9, c, Λ) >. Merging partial edit distances of the corresponding pairs of sub-sequences increases the likelihood of correct edit distance prediction between a pair of sequences only if for a given Em, there is an equivalent physically ordered sequence (series) of edit operations. In Propo- sition 4, we show that a temporally ordered minimum sequence of edit operations, Em can be represented by a physically ordered one. Proposition 4. For a given Em, there exists a physically ordered sequence of edit operations p s.t. for any (ij, xj, yj), (ij+1, xj+1, yj+1) ∈ Em Em p , 1 ≤ j < |Em p |; ij < ij+1. Proof. Let us consider, ∀(ij, xj, yj), (ik, xk, yk) ∈ Em, 1 <= j, k <= |Em|. The following cases can occur while converting the sequence into a physically ordered one. 79 Case 1: j < k and ij < ik; the temporally ordered edit operations are also physically ordered. Case 2: j < k and ij > ik; ik may change the value of ij.  i”j = ij + 1, if ij > ik and xk = Λ -1, if ij > ik and yk = Λ 0, otherwise (5.7) For the sake of contradiction, suppose, the edit operation (ij, xj, yj) ∈ Em is converted (i”j, xj, yj) ∈ Em p but the positions ij ad i”j are not equivalent. However, it cannot be true because of the following scenarios. • There is no ij = ik because the Em is a minimal sequence of edit operation. • If ij > ik, Equation 5.7 updates ij. • ∀(j < n < k) ∈ |Em| if ij > in, the position of ij will be updated by λ(in). On the other hand, ij will be also updated by λ(in). Hence, the edit operations in Em p are equivalent to Em. For example, Em =< (7, c, Λ), (20, a, Λ), (9, t, Λ), (15, Λ, g), (12, Λ, t) >. The edit operation (20, a, Λ) will be updated to (21, a, Λ). Similarly, (15, Λ, g) will be updated to (14, Λ, g). Thus, Em p =< (7, c, Λ), (9, t, Λ), (12, Λ, t), (16, Λ, g), (21, a, Λ) >. 5.3.2 Finding Primary and Secondary Sub-sequences If each edit operation belongs to only one pair of corresponding sub-sequences, we can accumulate partial predicted edit distances of the sub-sequences to predict the edit distance 80 between the entire sequences. However, insertion and deletion operations shift the positions of the letters and therefore in fixed length sub-sequences may likely to create edit operations that are not present in the actual sequences. Hence, it is critical to discard the partial distances of the secondary sub-sequences containing only edit operations generated due to segmentation of the sequences. We determine secondary sub-sequences as follows. Let us consider, A =< A1.A2. . . . Ai ··· > and B =< B1.B2. . . . Bi.··· >, Ai and Bi are a pair of primary sub-sequences, and there are non-zero differences of distances between Ai+1 and Bi+1. We need to verify whether the edit operation(s) in Ai and Bi are causing the differences between Ai+1 and Bi+1. We add the distances in Ai and Ai+1 with respect to each reference sequence and do the same for Bi and Bi+1. Then we compute the differences of distances between Ai.Ai+1 and Bi.Bi+1 with respect to the reference sequences. If the sum of differences of distances is greater than that of Ai and Bi, then Ai+1 and Bi+1 are also primary sub-sequences. Otherwise, we need to examine each difference of distances in the sub-sequences to determine their type. In case the effect of letters in Ai are canceled by letters in Bi+1 or vice versa, Ai+1 and Bi+1 are more likely secondary sub-sequences. If not, they are more likely primary sub-sequences. For example, in Figure 5.4, sequences A and B are divided into sub-sequence of length 4. Differences of distances between each pair of corresponding sub-sequences are non-zero for multiple reference sequences. Among them, A(0, 4) and B(0, 4) are inherently primary sub-sequences. In order to determine whether A(4, 4) and B(4, 4) are primary sub-sequences, we compute difference of distance between A(0, 8) and B(0, 8). The sum of differences of distances between A(0, 4) and B(0, 4), and A(4, 4) and B(4, 4) is equal to that of A(0, 8) and B(0, 8). In addition, the letters responsible for the differences of distances also indicate high likelihood of A(4, 4) and B(4, 4) being secondary sub-sequences. Hence, we discard the 81 Figure 5.4: The edit distance between sequences A and B are predicted from the differences of distances of primary sub-sequences. The 2nd pair of sub-sequences contains only redundant edit operations and hence are not considered for the distance prediction. differences of distances found between A(4, 4) and B(4, 4). Similarly, for A(8, 4), B(8, 4), the sum of differences of distances for A(0, 12) and B(0, 12) is zero. However, it cannot be correct because an edit operation was recorded between A(0, 4) and B(0, 4). It suggests that at least an insertion or deletion operation is occurred between A(8, 4) and B(8, 4). Hence, we predict A(8, 4) and B(8, 4)) as primary sub-sequences. We add the edit operation to that of A(0, 4) and B(0, 4) and update the latest primary sub-sequences position to the 3rd pair of sub-sequences. The merged edit distance correctly predicts the edit distance between sequences A and B. Based on the proposed strategies, we present the following edit distance prediction func- tion that compute differences of distances between the corresponding sub-sequences of a pair of dataset sequences. Function 1. Edit-Distance-Prediction-using-Sub-sequences(A,B,k) Input: A, B, sub-sequence length k Output: Predicted distance edpr(A, B) 1. At=Reference-Sequence-based-Transformation(A,k) 82 2. Bt=Reference-Sequence-based-Transformation(B,k) 3. edpr(A, B) = 0 4. foreach corresponding pair of sub-sequences in A and B do 5. 6. 7. 8. 9. 10. 11. 12. 13. dif f dist = 0 foreach c ∈ Σ do dif f dist+ =(cid:0)ed(A(i ∗ k, k), ck) ∼ ed(B(j ∗ k, k), ck)(cid:1) end for if dif f dist > 0 do Verify if the pair of sub-sequences is primary or secondary if (primary sub-sequence) do if (dif f dist > 2) do dif f dist = Edit-Distance-Prediction-using-Sub-sequences (A(i ∗ k, k), B(j ∗ k, k), k/2) 14. 15. 16. 17. end if Add the partial edit distances to edpr(A, B) Update index of the latest primary sub-sequence end if 18. end if 19. end for 20. Return edpr(A, B) Function 2. Reference-Sequence-based-Transformation(A, k) Input: Sequence A, sub-sequence length k Output: At /*The edit distances between sub-sequences of A and ri ∈ Rl*/ 83 1. At = [[]] 2. foreach A(j ∗ k, k) in A do m=(j+1)∗k(cid:80) foreach ri ∈ Rl do At[i][j] = m=j∗k end for 3. 4. 5. 1, if A[m] (cid:54)= ri[m] 6. end for 7. return At Step 1 and 2 in Function 1 transform a pair of dataset sequences into feature vectors by computing edit distances of the sub-sequences with respect to reference sequences. Dif- ferences of distances between each pair of corresponding sub-sequences is computed in step 5-14. In case more than one edit operation is predicted, the sub-sequences are divided further and recursively computed for differences of distances. Partial edit distances are gradually merged in step 15 and the primary sub-sequence index is updated in step 16. 5.3.3 Multiple Adjacent Letters as Distinct Elements The differences of distances based method relies on the number of appearance of each letter in a sub-sequence. Due to small size of the DNA alphabet, it is more likely that differences of distances between corresponding sub-sequences will be minimized despite having one or more edit operations between them. For example, in Figure 5.5, the differences of distances between A(0, 4) and C(0, 4) is zero despite having an insertion operation. The difference of distances with respect to the letter a ∈ Σ is minimized because A[3] = a cancels the effect of the insertion operation at C[2]. 84 Figure 5.5: The insertion operation between the first pair of fixed length of sub-sequences is not reflected in the differences of distances between them because of the small DNA alphabet. Considering multiple adjacent letters as single elements would avoid such by chance reduction in differences of distances. To minimize this problem, we consider multiple adjacent letters of a sub-sequence as distinct elements. This strategy significantly increases the number of all possible elements and therefore reduces the risk of accidental matching of elements between corresponding sub-sequences. Hence, the larger number of distinct elements significantly improves the edit distance prediction accuracy. It should be noted that increasing number of letters in an element rapidly increases the number of all possible elements as well as prediction accuracy with added computation cost and memory requirement. Since we are focused on small pseudorandom DNA sub-sequences, elements containing two adjacent letters should be sufficient for our experimental datasets. For example, for Nucleotides (letters) in DNA sequences, we consider all possible two adjacent letters as distinct elements. Thus, ∀x ∈ Σ and ∀y ∈ Σ; xy ∈ Σ(cid:48) where |Σ(cid:48)| = |Σ|2. In Figure 5.5, difference of distance between A(0, 4) and B(0, 4) with respect to elements {ga, gt, at, ag} are 4. The non-zero differences of distances indicate an edit operation. Further examination of the position of the elements also identifies type of the edit operation. Hence, the increased number of distinct elements significantly improves the edit distance prediction accuracy. It should be noted that Function 1 and Function 2 are needed to be marginally modified to accommodate the increased alphabet based difference of distance calculation. 85 5.3.4 Effect of Sub-sequence Length Length of a sub-sequence is critical for our proposed differences of distances based edit dis- tance prediction method. Smaller sub-sequences likely to have fewer edit operations between them. As a result, the possibility of canceled edit operations is also reduced. We conduct an experiment to show the effect of sub-sequence length on edit distance prediction accu- racy (Table 5.1). Pairs of sequences up to edit distance threshold 8 from the dataset Dm (Table 3.1) are used for this experiment. The prediction accuracy is significantly higher for smaller sub-sequences than that of the larger sub-sequences. When combined with the mul- tiple letters based strategy (Section 5.3.3), the prediction accuracy is very high for smaller sub-sequences because increased number of distinct elements reduces the number of canceled edit operations significantly. Table 5.1: Effect of sub-sequence length on edit distance prediction accuracy (%). For a predefined distance threshold, the experimental result shows that the prediction accuracy drops with increasing sub-sequence lengths. Number of distinct elements 6 4 16 99.1 99.8 92.1 96.4 99.06 98.3 Sub-sequence length 4 8 10 88.5 97.8 5.3.5 Computational Complexity Given sub-sequence length k and sequence length l, initially computational complexity of dif- ferences of distances between the sub-sequences is O((l/k)∗ Σ). Sub-sequences with multiple edit operations are computed recursively until each pair of smaller sub-sequences contains only one edit operation. Hence, for edit distance d, there could be at most d ≤ l parti- tions on top of the initially created sub-sequences. The overall computational complexity is 86 O((l/k + d) ∗ Σ). Assuming constant sub-sequence length and alphabet size, the complexity becomes O(l). 5.4 Experimental Results We evaluate the proposed edit distance prediction method with respect to computation time as well as accuracy. As shown in the following experiments, the prediction accuracy of our proposed method declines very slowly. Since our focus is sequence similarity on high similarity thresholds, the experimental results are primarily shown for up to edit distance 10 only. Computation time for each pair of sequences varies marginally because only a small fraction of the sub-sequences contain edit operations that require more computation time. We also evaluate effectiveness of our proposed method on an important NGS dataset application, average linkage hierarchical clustering for high similarity thresholds. Despite errors in prediction for larger edit distances, the overall clustering accuracy is significantly better compared to that of VSEARCH [23] and locality sensitive hashing based LSH-Div [22] clustering methods. The quadratic run time based edit distance calculation method [1] is set as the baseline method. The edit distances and the clusters generated from the baseline method are used as the ground truth for the experiments. 5.4.1 Effect of the Proposed Strategies on Edit Distance Predic- tion Accuracy We evaluate effectiveness of each of the proposed differences of distances calculation strategy on edit distance prediction accuracy. Figure 5.6 shows improvement in prediction accuracy when the strategies are applied collectively. The experiment is conducted on the dataset 87 D=Differences of distances, PS=Primary/Secondary sub-sequences, ML = Multiple letter as distinct elements, Trace = Trace cost between the sub-sequences Figure 5.6: Comparing effect of each differences of distances calculation strategy on edit distance prediction accuracy. Dh (Table 3.1) and the pairwise edit distance threshold is set to 10. First, it can be seen that discarding secondary sub-sequences i.e., sub-sequences with only redundant edit op- erations, greatly improves prediction accuracy. Second, increasing the number of distinct elements by considering all possible pairs adjacent letters significantly decreases random matching of letters between the sub-sequences. Because of fewer random matching, the pro- posed edit distance prediction method becomes more accurate. Third, a key limitation of differences of distances based strategies is that they rely only count of the letters between the sub-sequences. Computing minimum cost trace between the sub-sequences also consid- ers positions of the letters along with their frequencies and thus improves the edit distance prediction accuracy. In the following sections, we predict edit distances based on the differ- ences of distances, the increasing number of distinct elements and the minimum cost traces between the sub-sequences. 88 Table 5.2: Comparison of cumulative edit distance prediction accuracy between the MVD and the differences of distances based methods Data sets Dm Dh Dr Ds Dp Dz Data sets ed ≤ 1 ed ≤ 2 ed ≤ 3 ed ≤ 4 ed ≤ 5 Pp/M V D Pp/M V D Pp/M V D Pp/M V D Pp/M V D 1/1 1/1 1/1 1/1 1/1 1/1 ed ≤ 6 1/1 1/1 1/1 0.999/1 0.999/1 0.999/1 ed ≤ 7 0.999/0.975 0.997/0.931 0.996/0.895 0.999/0.961 0.997/0.912 0.995/0.867 0.998/0.963 0.994/0.845 0.99/0.766 0.999/0.979 0.997/0.924 0.994/0.852 0.999/0.963 0.996/0.879 0.991/0.811 0.998/0.954 0.995/0.865 ed ≤ 8 ed ≤ 9 0.991/0.789 ed ≤ 10 Pp/M V D Pp/M V D Pp/M V D Pp/M V D Pp/M V D 0.987/0.921 0.986/0.926 0.994/0.904 0.992/0.911 0.991/0.917 0.989/0.922 0.987/0.841 0.989/0.914 0.983/0.876 Dm Dh Dr Ds Dp Dz Pp = Differences of distances based edit distance prediction 0.986/0.931 0.984/0.912 0.983/0.903 0.981/0.928 0.979/0.939 0.98/0.922 0.977/0.934 0.981/0.95 0.976/0.962 0.98/0.897 0.977/0.91 0.992/0.899 0.99/0.901 0.987/0.88 0.987/0.886 0.987/0.927 0.984/0.931 0.974/0.921 0.972/0.969 0.976/0.947 0.974/ 0.942 5.4.2 Edit Distance Prediction Accuracy We evaluate edit distance prediction accuracy between sequences of the experimental datasets (Table 3.1) using the MVD and the differences of distances based methods. Prediction accuracy of the differences of distances based method declines very slowly for increasing edit distances compared to sharp decline in accuracy with the MVD based method [34]. As a result, for M V D > 5, the predicted distances need to be verified using the existing quadratic run time based method [1]. Despite the costly computation, it is shown in Table 5.2 that the cumulative prediction accuracy based on the proposed method is better than that of the M V D based method. The superior performance of the proposed method is demonstrated in 89 Figure 5.7 when compared with respect to both prediction time and accuracy on increasing pairwise edit distances. The proposed method is significantly faster with nominal loss in accuracy. (a) Dataset Dh (b) Dataset Dr (c) Dataset Ds (d) Dataset Dp Figure 5.7: Comparison of edit distance prediction time and accuracy between the M V D and the differences of distances based methods In the proposed method, edit operations between the corresponding sub-sequences are determined separately. For small sub-sequences, the likelihood of an edit operation being canceled when computing difference of distances with respect to an equal length reference sequence is very low. However, the likelihood of canceled edit operations increases when 90 the sub-sequences become longer. On the other hand, the MVD based method predicts edit distances between the entire sequences. Since the randomly generated reference sequences are far from each other, dataset sequences far from each other may have similar edit distances with respect to such reference sequences. Therefore, for larger edit distances, in the MVD based method, no reference sequence may be able to correctly predict the edit distance between a pair of data sequences. Increasing the number of reference sequences improves the prediction accuracy marginally with rapid increase in computation time. 5.4.3 Edit Distance Calculation Time Figure 5.8 shows comparison of pairwise edit distance calculation time between the proposed and the MVD based methods. We include the transformation time in the total time for both methods while calculating the average pairwise edit distance prediction time. The figure also shows the variances (dashed line) in computation time for varying edit distances. It should be noted that effect of the transformation time becomes smaller with increasing dataset size. The average pairwise distance prediction time is significantly better for the proposed method than that of the MVD because it requires more reference sequences and therefore more pairwise edit distance calculations between dataset sequences and reference sequences. Figure 5.8: Comparison of average edit distance prediction time between the dataset se- quences using the MVD and the differences of distances based methods 91 5.4.4 Clustering We create average linkage agglomerative hierarchical clusters using the differences of dis- tances based method, and compare the results with those of the MVD [34], VSEARCH [23] and LSH-Div[22]. Lower dendogram of hierarchical clusters are useful for many NGS se- quences analysis methods. Since we focus on large NGS datasets, a graph based external memory (disk) hierarchical clustering method, SparseHC [49] is adopted for creating the clusters. It should be noted that main memory based agglomerative hierarchical clustering space complexity is O(N 2) where N is the number of sequences in the dataset. The per- formance comparison is based on the clustering time and relative similarity (i.e., NMI and AMI) of the generated clusters with that based on the dynamic programming based edit distances. 5.4.4.1 Clustering Time We compare clustering time of the proposed method with that of the MVD as well as the k-mer similarity based methods. Hierarchical clustering using the quadratic run time edit distance calculation requires days with the growth of dataset size. However, the proposed difference of distances method based clustering is considerably faster than that of the MVD based method [34] because number of reference sequences for the proposed edit distance prediction method is very small compared to that of the MVD based method. As a result, the proposed method benefits from low space transformation time, fewer dimensions for pairwise comparison and absence of costly post processing time. Among the matching k-mer based methods, VSEARCH requires the least amount of time at the cost of accuracy. On the other hand, clustering time of the LSH-Div increases more rapidly than that of the differences of distances based method for increasing number 92 of sequences in a dataset. Table 5.3: For 99% similarity threshold, comparison of clustering time (in hours) and accu- racy of the differences of distances based method to those of the actual edit distances, the MVD as well as k-mer matching based VSEARCH and LSH-Div clustering The edit distance based similarity Matching k-mer based similarity Baseline Proposed MVD VSEARCH [23] LSH-Div [22] Time Time/NMI/AMI Time/NMI/AMI Time/NMI/AMI Time/NMI/AMI 0.61 0.94 1.72 0.017/0.998/0.993 0.054/0.998/0.993 0.007/0.903/0.75 0.025/0.89/0.68 0.029/0.9976/0.992 0.067/0.997/0.9923 0.01/0.89/0.77 0.049/0.90/0.70 0.067/0.997/0.9916 0.095/0.9976/0.991 0.03/0.90/0.71 0.218/0.85/0.67 17.95 0.24/0.994/0.99 0.47/0.997/0.9905 0.147/0.91/0.78 1.416/0.92/0.78 411.28 4.02/0.9999/0.998 7.1/0.994/0.9905 0.352/0.90/0.802 9.6/0.903/0.713 641.62 5.9/0.9956/0.9902 9.9/0.9956/0.99 0.52/0.889/0.776 13.1/0.91/0.706 Data sets Dm Dh Dr Ds Dp Dz 5.4.4.2 Cluster Similarity We quantify relative similarity of the clusters generated using the proposed edit distance prediction method with respect to the ground truth. We also experiment with heuristics based clustering methods such as VSEARCH and LSH-Div. The NMI and AMI scores of both VSEARCH and LSH-Div method based clusters are significantly lower compared to those of the proposed method. Although the NMI scores are around 0.9 for both methods, the AMI scores are quite low for them. On the other hand, the proposed edit distance prediction method based clusters and those of the MVD based method show close to 99% relative similarity for edit distance similarity thresholds 99% (Table 5.3) and 97% (Table 5.5). However, as shown in Table 5.4, for 98% similarity threshold, the proposed predicted edit distance based clusters show better NMI and AMI scores than those of the MVD because of higher accuracy in predicted edit distances. 93 Table 5.4: For 98% similarity threshold, comparison of clustering time (in hours) and accu- racy of the differences of distances based method to those of the actual edit distances, the MVD as well as k-mer matching based VSEARCH and LSH-Div clustering The edit distance based similarity Matching k-mer based similarity Baseline Proposed MVD VSEARCH [23] LSH-Div [22] Time Time/NMI/AMI Time/NMI/AMI Time/NMI/AMI Time/NMI/AMI 0.63 0.02/0.995/0.991 0.072/0.982/0.9725 0.0043/0.91/0.767 0.026/0.88/0.702 0.943 0.034/0.994/0.989 0.094/0.98/0.9745 0.0041/0.91/0.78 0.051/0.884/0.71 1.79 0.077/0.993/0.987 0.122/0.978/0.9702 0.0125/0.91/0.716 0.219/0.84/0.68 17.98 411.4 641.8 0.28/0.992/0.986 0.614/0.979/0.969 0.058/0.92/0.79 1.43/0.91/0.784 4.7/0.997/0.991 7.88/0.979/0.967 0.19/0.895/0.798 9.52/0.891/0.708 6.72/0.99/0.98 11.1/0.981/0.969 0.29/0.89/0.78 13.3/0.896/0.706 Data sets Dm Dh Dr Ds Dp Dz 5.5 Summary In this chapter, we propose an edit distance prediction method using a set of deterministi- cally generated reference sequences. Distances among the reference sequences are maximum i.e., equal to their lengths. Each such sequence is based on a specific letter of the alpha- bet. As a result, edit distance between a reference sequence and a dataset sequence can be computed using only substitution operations. We compare differences of distances of the dataset sequences with respect to the reference sequences for predicting edit distances between them. Although this technique correctly predicts edit distance one, prediction ac- curacy drops sharply for larger edit distances. In order to improve the prediction accuracy, we develop several strategies such as calculating differences of distances between the sub- sequences separately, discarding redundant edit operations, increasing the number of distinct elements and computing minimum cost traces of the sub-sequences. Subsequently, we merge the difference of distances of the sub-sequences to predict edit distances between the en- tire sequences. Computational complexity of the proposed method is linear with respect to the sequence lengths. We evaluate the proposed edit distance prediction method with 94 Table 5.5: For 97% similarity threshold, comparison of clustering time (in hours) and accu- racy of the differences of distances based method to those of the actual edit distances, the MVD as well as k-mer matching based VSEARCH and LSH-Div clustering The edit distance based similarity Matching k-mer based similarity Baseline Proposed MVD VSEARCH [23] LSH-Div [22] Time 0.634 Time/NMI/AMI Time/NMI/AMI Time/NMI/AMI Time/NMI/AMI 0.022/0.99/0.98 0.104/0.9956/0.989 0.003/0.915/0.79 0.026/0.87/0.732 0.95 0.038/0.991/0.978 0.086/0.9938/0.992 0.005/0.905/0.803 0.053/0.87/0.76 1.796 18.01 411.5 642.3 0.085/0.99/0.976 0.13/0.994/0.9882 0.008/0.89/0.73 0.22/0.83/0.686 0.31/0.992/0.98 0.74/0.992/0.987 0.043/0.93/0.82 1.44/0.91/0.80 5.44/0.988/0.977 11.36/0.992/0.9856 0.162/0.905/0.798 9.83/0.89/0.72 7.6/0.981/0.972 12.66/0.991/0.985 0.24/0.901/0.794 13.78/0.887/0.716 Data sets Dm Dh Dr Ds Dp Dz respect to required time and prediction accuracy to show its superiority over those of the MVD based edit distance prediction method. Effectiveness of the proposed predicted edit distance method is also shown on average linkage agglomerative hierarchical clustering of the experimental 16s rRNA metagenomic NGS sequence datasets for high similarity thresholds. 95 Chapter 6 Bounded Exact Edit Distance The edit distance prediction using differences of distances between the corresponding sub- sequences of pairs of dataset sequences can not guarantee correctness because Theorem 3 is applicable only if the edit distance is one. Further, there is no way to know the exact location of an edit operation from the differences of distances. As a result, the proposed method in Chapter 5 is error prone in determining the type and the number of edit operations. Although the differences of distances based method provides high accuracy when the number of edit operations is few or the edit operations are distributed over the entire sequences, the performance degrades sharply when several edit operations appear in close proximity. On the other hand, many genome sequence analysis methods require edit distances only within a small bound. For those applications, minor modifications in the standard quadratic method greatly reduce the computation time. However, they still require computing all possible combinations of sub-sequences within the given bound. 6.1 Length of the Sub-sequences In this chapter, we develop an efficient bounded exact edit distance calculation method using the concept of trace (Definition 11) [1]. As shown in Wagner et. al. [1], given a minimum cost trace (T, A, B), if A = A1.A2, B = B1.B2 and if there is no trace element between A1 96 and B2, and A2 and B1, then, cost(T, A, B) = cost(T1, A1, B1) + cost(T2, A2, B2) (6.1) For example, given A = honda and B = hyundai, a minimum cost trace is shown in Figure 6.1(a). We divide the strings into A1 = hon and A2 = da as well as B1 = hyun and B2 = dai so that there is no trace element between A1 and B2, and A2 and B1. In Figure 6.1(b), (T1, A1, B1) and (T2, A2, B2) are also minimum cost traces and thereby satisfies Equation 6.1. (a) (b) Figure 6.1: (a) Minimum cost trace of a pair of sequences (strings) and (b) dividing the sequences into corresponding sub-sequences based on the minimum cost trace However, we do not know a minimum cost trace without comparing all possible combi- nations of the sub-sequences. On the other hand, for a given edit distance bound d, only a subset of all possible sub-sequences is needed to be compared because ∀(i, j) ∈ (T, A, B), each i and j cannot be more than d letters apart. We break A and B into sub-sequences based on the potential (i, j) ∈ (T, A, B) and compute exact edit distance up to the bound. Theorem 4. Given, ed(A, B) ≤ d, 1 ≤ i ≤ |A|, 1 ≤ j ≤ |B| and a minimum cost trace (T, A, B), ∀(i, j) ∈ (T, A, B); (i − d) ≤ j ≤ (i + d). Proof. Let us consider, A[0 : m] is a prefix of sequence A of length m, A[m] is the mth letter in sequence A and γ(A[m] → B[n]) is the cost function to convert A[m] into B[n]. The 97  min Case1 : ed(A[0 : m], B[0 : n − 1]) + γ(Λ → B[n]) Case2 : ed(A[0 : m − 1], B[0 : n]) + γ(A[m] → Λ) Case3 : ed(A[0 : m − 1], B[0 : n − 1]) + γ(A[m] → B[n]) recursive function for calculating edit distance between A[0 : m] and B[0 : n] is as follows: At each level of the recursion, we divide A[0 : m] and B[0 : n] into two pairs of corre- sponding sub-sequences in one of the aforementioned cases such that no (i, j) ∈ (T(cid:48), A[0 : m], B[0 : n]) exists between neighboring pairs of sub-sequences. Thus T(cid:48) can be split into T1 and T2 and cost(T(cid:48)) ≤ cost(T1) + cost(T2). If T(cid:48) is a minimum cost trace, T1 and T2 are also minimum cost traces and cost(T(cid:48)) = cost(T1) + cost(T2). Each path from root to a leaf node of the recursion tree represents an arbitrary trace. Given minimum cost trace T and maximum d edit operations, we prove the maximum difference of index positions for any (i, j) ∈ (T, A, B) is bounded by d. Since, maximum difference between i and j comes from either only insertions or only deletions, we focus on those two scenarios. Case1 occurs whenever there is an insertion operation. Assuming d insertion operations, if T is a minimum cost trace, Case1 will occur maximum d times. For any T’ where there is more than d inserts, T’ is not a minimum cost trace and cost(T(cid:48), A, B) > d. Thus, ∀(i, j) ∈ (T, A, B); i ≤ j ≤ (i + d) when the edit operations are insertions only. Similarly, for deletion operations only, Case2 occurs at most d times for a minimum cost trace (T,A,B). Thus, ∀(i, j) ∈ (T, A, B); (i − d) ≤ j ≤ i. Considering both scenarios, ∀(i, j) ∈ (T, A, B); (i− d) ≤ j ≤ (i + d) where ed(A, B) ≤ d. For example, in Figure 6.1(a), ed(A, B) = 3 and a minimum cost trace (T, A, B) =< (1, 1), (2, 2), (3, 4), (4, 5), (5, 6) >. For all From the figure, we see that ∀(i, j) ∈ (T, A, B) where 1 ≤ i ≤ |A| and 1 ≤ j ≤ |B|; |i − j| ≤ 3. 98 6.2 Minimum Cost Trace between the Sub-sequences For any pair of corresponding sub-sequences, we compute a trace that requires minimum number of changes for the individual (i, j) pairs within the sub-sequences. Such a trace cost may not be locally minimum, however, they are more likely to a globally minimum trace for a given threshold. Theorem 5. Let us consider, ed(A, B) ≤ d, (T, A, B) is the minimum cost trace and A” = A(m, d), B” = B(m, 2 ∗ d). Also assume, (i, j) ∈ P where i ∈ {m, . . . (m + d)}, j, k ∈ {m, . . . , (m + 2 ∗ d)}, j (cid:54)= k, A”[i] = B”[j] = B”[k] and (i ∼ j) ≤ (i ∼ k). If ∀(i(cid:48), j(cid:48)) ∈ (T ”, A”, B”) represents the minimum total difference from ∀(i, j) ∈ P , then (T ”, A”, B”) ⊂ (T, A, B). Proof. Any (i, j) ∈ (T ”, A”, B”); m ≤ j ≤ (m + d)andm ≤ j ≤ (m + 2 ∗ d) refers to a change operation with the cases (a) A”[i] = B”[j] and (b)A”[i] (cid:54)= B”[j]. However, (i, j) ∈ (T ”, A”, B”) where A”[i] (cid:54)= B”[j] exists only to minimize trace cost i.e., the number of edit operations. For each (i, j) ∈ P where i ∈ {m, . . . (m + d)}, j, k ∈ {m, . . . , (m + 2 ∗ d)}, j (cid:54)= k, if A”[i] = B”[j] = B”[k] and (i ∼ j) ≤ (i ∼ k), when considered in isolation (i, j) is the candidate pair for the minimum cost trace. Since the minimum cost trace is globally optimal, the candidate (i, j) ∈ P must not cross each other as shown in Definition 11. Hence, one of the following cases will hold. Case 1: ∀(i1, j1), (i2, j2) ∈ P where i1 (cid:54)= i2 and j1 (cid:54)= j2; i1 < i2 iff j1 < j2. In this case, all the candidate pairs satisfy the requirements of being trace elements. For any untouched pairs between any distinct (m1, n1) and (m2, n2), we add more pairs arbitrarily to minimize the trace cost. 99 Case 2: ∃(i1, j1), (i2, j2) ∈ P where i1 (cid:54)= i2 and j1 (cid:54)= j2; i1 < i2 yet j1 > j2. Both of (i1, j1), (i2, j2) cannot belong to T ”. Thus, the conflicting pairs have to be adjusted so that the increase in cost is the minimum. Suppose, discarding/modifying (i1, j1) increases the cost less compared to that of discarding/modifying (i2, j2). We replace (i2, j2) and find minimum cost increment for all such conflicts. It should be noted that (T ”, A”, B”) may not be the minimum cost trace locally. However, it will be candidate subset of the global minimum cost trace (T, A, B). Similarly, we find the (T ””, A(m, 2∗d), B(m, d)) and compare with (T ”, A(m, d), B(m, 2∗ d)). In case the costs are tied yet the trace elements are not equivalent, we need to look into the remaining sub-strings to break the tie. Otherwise, the minimum cost trace between them will be subset of the globally optimal trace. For example, given A = ”acgtagctag”, B = ”gggggacgt agctag”; ed(A, B) = 5. Suppose, we compute the trace (T ”, A(1, 5), B(1, 10)) ⊂ (T, A, B). According to Theorem 5, P = {(1, 6), (2, 7), (3, 3), (4, 9), (5, 10)}. However, (3, 3) is in conflict with (2, 7) and (4, 9) when considered as a trace pair for (T ”, A(1, 5), B(1, 10)). The minimum costly change in P is to replace (3, 3) with (3, 8). Hence, (T ”, A(1, 5), B(1, 10)) = {(1, 6), (2, 7), (3, 3), (4, 9), (5, 10)}. It should be noted that not all the substrings would be length of 2∗ d where upper bound of edit distance d because the maximum difference between i ∼ j ≤ d and the trace pairs are physically ordered. If there are insertions and/or deletions in some corresponding substrings, the length of the subsequent ones will be less than 2 ∗ d. This property will reduce the cost of the proposed bounded edit distance based calculation method. 100 6.2.1 Computational Complexity Given edit distance bound d, we compute ed(A, B) by comparing corresponding pairs of sub-sequences of length d and 2 ∗ d of A and B. The comparison is in quadratic complexity with respect to d. Hence, computation complexity of a trace for the sub-sequences is O(d2). The maximum number of sub-sequence pairs is (l/d). Hence, the overall computational complexity is O(ld). 6.3 Experimental Results We evaluate the proposed bounded exact edit distance calculation method with respect to computation time. For a given distance threshold, we measure pairwise edit distance calculation time as well as hierarchical clustering time. The pairwise edit distance calculation time is compared with that of the dynamic programming based method and the Slidesort. However, we do not compare with the state-of-the-art bounded exact edit distance method proposed by Thankachan et. al. because for larger edit distance thresholds, the suffix tree construction complexity of this method is not strictly sub-quadratic. 6.3.1 Edit Distance Calculation Time In Table 6.1, we present all pair distance calculation time of the experimental dataests for increasing distance thresholds. The quadratic run time based edit distance calculation method is used as the baseline method. We compare required time of the proposed bounded exact edit distance calculation method with that of the baseline method and the k-mer chain based Slidesort method. The proposed method is an order of magnitude faster than that of the baseline method for all the distance thresholds. The Slidesort method is faster for smaller 101 Table 6.1: Comparison of all pairs distance calculation time (in hours) of the proposed bounded exact edit distance calculation method with that of the existing dynamic program- ming based method and Slidesort Datasets Dm Dh Dr Ds Dp Dz Baseline Method 0.614 0.931 1.771 17.91 411.17 641.4 ed ≤ 3 Ep/S ed ≤ 5 Ep/S ed ≤ 8 Ep/S 0.021/0.044 0.023/0.184 0.024/0.38 0.034/0.041 0.038/0.21 0.041/0.43 0.083/0.036 0.092/0.16 0.097/0.41 0.28/0.138 0.32/0.61 0.35/1.72 4.8/0.543 5.62/5.26 5.95/7.22 6.63/0.71 7.85/6.8 8.46/10.3 Ep = Bounded exact edit distance and S = Slidesort distance thresholds than that of the proposed method because the k-mer based filtering of Slidesort can discard most of the pairs for such thresholds. However, for larger distance thresholds, the number of candidate pairs increase rapidly with the Slidsort. Since Slidesort uses the existing quadratic time complexity based method for computing edit distances of the candidate pairs, the required time increases rapidly with increasing thresholds. On the other hand, for larger distance thresholds, the required time of the proposed bounded exact edit distance calculation method increases very slowly. Hence, our proposed method is more scalable for increasing numbers of pairs and higher distance thresholds. 6.3.2 Hierarchical Clustering Time We also compare average linkage hierarchical clustering time based on the proposed bounded exact edit distance method with that of the existing dynamic programming based baseline method. The similarity threshold is set to 97% for this experiment. As shown in Table 6.2, the proposed method based hierarchical clustering time is an order of magnitude faster than 102 Table 6.2: Comparison of hierarchical clustering time (in hours) of the proposed bounded exact edit distance calculation method with that of the existing dynamic programming based method for a given distance threshold XXXXXXXXXXXX Methods Datasets Dm Dh Dr Ds Dp Dz The Edit Distance 0.634 0.95 1.796 18.01 411.5 642.3 The Proposed bounded Edit Distance 0.027 0.048 0.103 0.39 6.5 8.5 that of the baseline method. The graph based clustering time for high similarity thresholds is a small fraction of the all pairs edit distance calculation time. As a result, speedup of the proposed method is primarily due to near linear edit distance calculation time for the given high similarity threshold. 6.4 Summary In this chapter, we propose a bounded exact edit distance calculation method. For a given distance bound, we divide the sequences into corresponding sub-sequences in a way so that no edit operation is shared between neighboring sub-sequences. Then, we compute partial edit distances for each pair of corresponding sub-sequences and accumulate them to determine edit distance for the entire sequences. We show that the proposed method is significantly faster than that of the existing exact edit distance calculation methods. Effectiveness of the proposed method is evaluated with respect to all pairs edit distance calculation time as well as hierarchical clustering time based on the computed pairwise edit distances. 103 Chapter 7 Conclusion 7.1 Key Contributions Primary contributions of the dissertation are as follows: • We have proposed a number of fast edit distance calculation methods to determine dissimilarity between NGS sequences. In doing so, we have introduced a new class of edit distance prediction approach based on dataset independent reference sequences. • Initially we have used randomly generated reference sequences for computing dissim- ilarity between NGS sequences. For this approach, we convert sequences of a dataset into feature vectors and use metric distance measures such as the Euclidean distance and the Chebyshev distance for edit distance prediction. The proposed methods are an order of magnitude faster than that of the existing quadratic run time based edit distance calculation method. For high similarity thresholds the distance prediction accuracy is very high for different scenarios. • We have also developed a small set of deterministic reference sequences with maximum distances among them. This method runs in linear time with respect to sequence lengths. Further, the edit distance prediction accuracy is significantly higher for larger edit distances compared to that of the random reference sequence based methods. 104 • We have conducted extensive experiments and theoretical analysis to justify the su- perior performance of the proposed reference sequence based edit distance prediction methods. For high similarity threshold, the predicted distances are nearly accurate and when wrong they are very close to the correct distances. Because of that, hierarchical clustering using the predicted distances produce nearly similar clusters to that based on the actual edit distances for high similarity thresholds. • We have also proposed an efficient bounded exact edit distance calculation method. For small edit distance threshold, the method runs in near linear time complexity. • The edit distance calculation time using the proposed bounded exact method is an order of magnitude faster than that of the existing quadratic run time based methods. 7.2 Discussions Efficient edit distance calculation methods are very important for comparing distances be- tween strings and sequences because this distance measure is the most appropriate to de- termine the dissimilarity between a pair of strings. For metagenomic short gene sequences that are generated using NGS technologies, sequence similarity is the only way to determine phylogenetic relation between them. However, the standard edit distance computation com- plexity is O(l2) where l is the length of the sequences. The computation complexity becomes a performance bottleneck problem for computing similarity between large numbers of pairs of sequences which is being increasingly common due to advancement of NGS technologies. As a result, reducing the edit distance computation time without significant loss in accuracy has been an active research problem. 105 In this dissertation, we have proposed a few edit distance prediction methods and a bounded exact edit distance calculation method. For the edit distance prediction methods, we have developed a set of dataset independent reference sequences that are minimally correlated to each other. The reference sequences convert sequences in datasets into feature vectors. Using these feature vectors, we have proposed edit distance prediction methods in the metric space. Initially, we have used the Euclidean distances between the feature vectors for predicting distances between the corresponding sequences. Since there is no direct relationship between the edit distance and the Euclidean distance in the metric space for a given pair of sequences and their corresponding feature vectors, we have developed a mapping between these two distance measures based on empirical knowledge. This prediction technique is effective for clustering of NGS sequences especially for high similarity thresholds such as creating OTUs. However, accuracy of the resulting hierarchical clusters drops rapidly for predicted edit distances of pairs with larger edit distances. Our proposed MVD (the Chebyshev distance) based edit distance prediction method has addressed lack of relationship between the actual edit distances and the predicted distances. Further, this method is able to predict lower edit distance nearly accurately, thereby mak- ing the predicted distances useful for applications other than clustering of NGS sequences. However, for larger edit distances, the MVD based prediction can not provide satisfactory accuracy despite having large number of reference sequences. In order to address the poor prediction accuracy of higher edit distances and to reduce the computational overhead of creating feature vectors using hundreds of reference sequences, we have also developed a small set of special reference sequences based edit distance prediction method for the NGS sequences. The distance among the reference sequences is the maximum, i.e., exactly same of the length of the largest dataset sequence. Computational complexity of 106 this method is linear with respect to sequence lengths. We have developed several strategies to provide increased prediction accuracy for relatively higher edit distances compared to our previously proposed methods. Despite the good prediction accuracy in many cases, this prediction method performs poorly when multiple edit operations appear in close proximity. On the other hand, we have proposed an efficient bounded exact edit distance calculation method using trace cost [1] of the sub-sequences. For a given edit distance threshold d where d2 ≤ l, the proposed method runs in near linear time complexity. This method can be used separately for efficient edit distance calculation as well as in conjunction with the other proposed distance prediction methods to avoid costly quadratic time based computation for NGS sequence analysis. We have conducted extensive experiments on the 16s rRNA metagenomic NGS sequences using the proposed edit distance calculation methods to evaluate computation time and accuracy. All of our proposed methods show high accuracy for creating highly similar clusters from these sequences while achieving order of magnitude speed up compared to the existing near quadratic edit distance calculation methods. For smaller edit distances, our proposed MVD and sub-sequence based methods are highly accurate and are in close proximity when they are not. We have also shown effectiveness of our proposed methods compared to those of k-mer based similarity by comparing both required time and accuracy simultaneously. 7.3 Possible Extensions In this section, we briefly present the possible extensions of the proposed edit distance calculation methods with respect to optimization of these methods, application on NGS sequences other than 16s rRNA and potential usability of the proposed methods in relevant 107 fields and data types. 7.3.1 Extending the Proposed Methods In near future, we plan to extend our proposed edit distance calculation methods using the following strategies to further improve their effectiveness. First, for the MVD based edit distance prediction method, we plan to import the domain knowledge by choosing reference sequences from the datasets. Likewise the dataset independent random references, these reference sequences from the datasets will be also far from each other. However, it is expected the distance will be much smaller compared to that of dataset independent reference sequences because of the conserved regions among sequences in datasets. We expect that this strategy will improve prediction for larger edit distances because it will focus on particular regions of the sequences only. Further, the MVD and FVD based heuristics are expected to be based on domain knowledge. Second, for the sub-sequence based edit distance prediction method, we plan to use k- mer based strategies instead of the primary and secondary sub-sequences as well as merging strategies. While these strategies provide good prediction accuracy, they are also dependent on sequential processing of the sub-sequences. Our objective with k-mer based strategies is to develop a more robust method without affecting computation time and accuracy. 7.3.2 Similarity Comparison of other Ribosomal Regions We intend to optimize our edit distance calculation methods for other regions (short se- quences) within ribosomal transcripts such as Internal Transcribed Spacer (ITS) datasets and 28s rDNA datasets. Recently, the ITS sequences are being more frequently used for 108 phylogenetic analysis because of the increased variances in such sequences. However, it cre- ates a challenge to our proposed methods because our experimental analysis are based on the more conserved 16s rRNA sequence regions. 7.3.3 Applying Proposed Methods to Relevant Problems The proposed edit distance calculation methods are also applicable to similar data types such as text datasets, network connectivity and graph datasets. They are different from the NGS sequences with respect to length, alphabet size, length difference and sentiment information. In some cases, they may require non-uniform cost for different edit operations [50]. Utilizing the key ideas of the proposed distance calculation methods for these datasets might be interesting. Another potential application of the proposed reference sequence based edit distance calculation method is computing graph edit distance [51, 52] such as similarity search in Heterogeneous Information Networks (HIN) [52]. Likewise the dataset independent reference sequences, we can create core sub-graphs and compute dissimilarity between them for efficient computation of dissimilarity between graph fragments. 109 APPENDIX 110 Probability Distribution of δ for Individual Dimensions For a given Hamming distance dh, from theorem 1, we know that 0 ≤ δ ≤ dh. According to the definition of Hamming distance, there are dh number of positions that differ between the two sequences. Given a random reference sequence Ri, we get  for all j = 1, 2, .., dh we have ∆xij = +1 −1 0 if xA if xA if xA ij = Rij and xB ij (cid:54)= Rij and xB ij (cid:54)= Rij and xB ij (cid:54)= Rij ij = Rij ij (cid:54)= Rij where j is the index of character positions at which the two sequences are different from each other, and Ri is a random reference point. Given we have only four codons, T, A, C, ij (cid:54)= Rij and one for each of the other and G, there are two cases where xA ij (cid:54)= Rij and xB conditions above. Therefore, probability of occurrence of each case is given by:  111 Pr(∆xij) = 1/4 ∆xij = +1 1/4 ∆xij = −1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dh(cid:88) j=1 1/2 ∆xij = 0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12), which can be modeled by the sum of finite discrete random Then, ∆xi = ∆xij numbers with multinomial distribution that reaches to Eq. 4. Probability Distribution of the Euclidean Distance in Feature Space for a Given Hamming Distance If Sn−1 is the squared Euclidean distance between transformed sequences A and B, i.e. (cid:107)Ω(A) − Ω(B)(cid:107)2, with n − 1 features, and its probability distribution is given by the con- volution of distributions Prn−1(Sn = n − 1) for 0 ≤ Sn−1 ≤ (n − 1) × (dh)2, then adding a new feature gives Sn = Sn−1 + (∆xn)2 Then, the probability distribution of Sn for every z in [0, .., n × (dh)2] can be written as dh(cid:88) Pr1(S1 = k2) × Prn−1(Sn−1 = z − k2) Prn (Sn = z) = √ k=0 where Pr1(S1) = Pr(∆x = S1) as given in given in Eq. 3.3. Note: in order to reduce the number of computations for large values of n and dh, we used a cutoff of 1e − 10 to remove small probabilities. 112 BIBLIOGRAPHY 113 BIBLIOGRAPHY [1] R. A. Wagner and M. J. Fischer, “The string-to-string correction problem,” Journal of the ACM (JACM), vol. 21, no. 1, pp. 168–173, 1974. [2] W. J. Masek and M. S. Paterson, “A faster algorithm computing string edit distances,” Journal of Computer and System sciences, vol. 20, no. 1, pp. 18–31, 2013. [3] A. Backurs and P. Indyk, “Edit distance cannot be computed in strongly subquadratic time (unless seth is false),” in Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM, 2015, pp. 51–58. [4] H. Ochman and I. B. Jones, “Evolutionary dynamics of full genome content in es- cherichia coli,” The EMBO journal, vol. 19, no. 24, pp. 6637–6643, 2000. [5] A. Abboud, R. Williams, and H. Yu, “More applications of the polynomial method to algorithm design,” in Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2015, pp. 218–230. [6] A. Andoni, R. Krauthgamer, and K. Onak, “Polylogarithmic approximation for edit distance and the asymmetric query complexity,” in 2010 IEEE 51st Annual Symposium on Foundations of Computer Science. IEEE, 2010, pp. 377–386. [7] A. Abboud and A. Backurs, “Towards hardness of approximation for polynomial time problems,” in 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. [8] D. Chakraborty, D. Das, E. Goldenberg, M. Koucky, and M. Saks, “Approximating edit distance within constant factor in truly sub-quadratic time,” in 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2018, pp. 979–990. [9] G. M. Landau, E. W. Myers, and J. P. Schmidt, “Incremental string comparison,” SIAM Journal on Computing, vol. 27, no. 2, pp. 557–582, 1998. [10] Z. Bar-Yossef, T. Jayram, R. Krauthgamer, and R. Kumar, “Approximating edit dis- tance efficiently,” in 45th Annual IEEE Symposium on Foundations of Computer Sci- ence. IEEE, 2004, pp. 550–559. [11] T. Batu, F. Ergun, and C. Sahinalp, “Oblivious string embeddings and edit distance approximations,” in Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm. Society for Industrial and Applied Mathematics, 2006, pp. 792–801. 114 [12] A. Andoni and K. Onak, “Approximating edit distance in near-linear time,” SIAM Journal on Computing, vol. 41, no. 6, pp. 1635–1648, 2012. [13] S. V. Thankachan, C. Aluru, S. P. Chockalingam, and S. Aluru, “Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis,” in International Conference on Research in Computational Molecular Biology. Springer, 2018, pp. 211–224. [14] M. E. Carew, V. J. Pettigrove, L. Metzeling, and A. A. Hoffmann, “Environmental monitoring using next generation sequencing: rapid identification of macroinvertebrate bioindicator species,” Frontiers in zoology, vol. 10, no. 1, p. 45, 2013. [15] P. K. Lindeque, H. E. Parry, R. A. Harmer, P. J. Somerfield, and A. Atkinson, “Next generation sequencing reveals the hidden diversity of zooplankton assemblages,” PloS one, vol. 8, no. 11, p. e81327, 2013. [16] W. Chen, C. K. Zhang, Y. Cheng, S. Zhang, and H. Zhao, “A comparison of methods for clustering 16s rrna sequences into otus,” PloS one, vol. 8, no. 8, pp. 36–44, 2013. [17] M. Kim, K.-H. Lee, S.-W. Yoon, B.-S. Kim, J. Chun, and H. Yi, “Analytical tools and databases for metagenomics in the next-generation sequencing era,” Genomics & informatics, vol. 11, no. 3, p. 102, 2013. [18] E. Stackebrandt, “Taxonomic parameters revisited: tarnished gold standards,” Micro- biol. Today, vol. 33, pp. 152–155, 2006. [19] M. Rossi-Tamisier, S. Benamar, D. Raoult, and P.-E. Fournier, “Cautionary tale of using 16s rrna gene sequence similarity values in identification of human-associated bacterial species,” International journal of systematic and evolutionary microbiology, vol. 65, no. 6, pp. 1929–1934, 2015. [20] E. Stackebrandt and B. M. GOEBEL, “Taxonomic note: a place for dna-dna reassoci- ation and 16s rrna sequence analysis in the present species definition in bacteriology,” International journal of systematic and evolutionary microbiology, vol. 44, no. 4, pp. 846–849, 1994. [21] B. D. Ondov, T. J. Treangen, P. Melsted, A. B. Mallonee, N. H. Bergman, S. Koren, and A. M. Phillippy, “Mash: fast genome and metagenome distance estimation using minhash,” Genome biology, vol. 17, no. 1, p. 132, 2016. [22] Z. Rasheed, H. Rangwala, and D. Barbar´a, “16s rrna metagenome clustering and diver- sity estimation using locality sensitive hashing,” BMC systems biology, vol. 7, no. 4, p. S11, 2013. 115 [23] T. Rognes, T. Flouri, B. Nichols, C. Quince, and F. Mah´e, “Vsearch: a versatile open source tool for metagenomics,” PeerJ, vol. 4, p. e2584, 2016. [24] R. C. Edgar, “Search and clustering orders of magnitude faster than blast,” Bioinfor- matics, vol. 26, no. 19, pp. 2460–2461, 2010. [25] K. Shimizu and K. Tsuda, “Slidesort: all pairs similarity search for short reads,” Bioin- formatics, vol. 27, no. 4, pp. 464–470, 2010. [26] J. Abello, P. M. Pardalos, and M. G. Resende, Handbook of massive data sets. Springer, 2013, vol. 4. [27] P. Bille and M. Farach-Colton, “Fast and compact regular expression matching,” The- oretical Computer Science, vol. 409, no. 3, pp. 486–496, 2008. [28] R. Impagliazzo, R. Paturi, and F. Zane, “Which problems have strongly exponential complexity?” Journal of Computer and System Sciences, vol. 63, no. 4, pp. 512–530, 2001. [29] Z. Rasheed and H. Rangwala, “Mc-minh: Metagenome clustering using minwise based hashing,” in Proceedings of the 2013 SIAM International Conference on Data Mining. SIAM, 2013, pp. 677–685. [30] J. Venkateswaran, D. Lachwani, T. Kahveci, and C. Jermaine, “Reference-based index- ing of sequence databases,” in Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, 2006, pp. 906–917. [31] P. Papapetrou, V. Athitsos, G. Kollios, and D. Gunopulos, “Reference-based alignment in large sequence databases,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 205–216, 2009. [32] S. Wandelt, J. Starlinger, M. Bux, and U. Leser, “Rcsi: Scalable similarity search in thousand (s) of genomes,” Proceedings of the VLDB Endowment, vol. 6, no. 13, pp. 1534–1545, 2013. [33] A. M. Benjamin, M. Nichols, T. W. Burke, G. S. Ginsburg, and J. E. Lucas, “Com- paring reference-based rna-seq mapping methods for non-human primate data,” BMC genomics, vol. 15, no. 1, p. 570, 2014. [34] S. Pramanik, A. T. Islam, and S. Sural, “Predicted edit distance based clustering of gene IEEE, sequences,” in 2018 IEEE International Conference on Data Mining (ICDM). 2018, pp. 1206–1211. 116 [35] A. T. Islam, S. Pramanik, V. Mirjalili, and S. Sural, “Restrac: Reference sequence based space transformation for clustering,” in 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2017, pp. 462–469. [36] F. Murtagh and P. Contreras, “Algorithms for hierarchical clustering: an overview,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 86–97, 2012. [37] P. D. Schloss, S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann, E. B. Hollister, R. A. Lesniewski, B. B. Oakley, D. H. Parks, C. J. Robinson et al., “Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities,” Applied and environmental microbiology, vol. 75, no. 23, pp. 7537–7541, 2009. [38] S. M. Huse, D. M. Welch, H. G. Morrison, and M. L. Sogin, “Ironing out the wrinkles in the rare biosphere through improved otu clustering,” Environmental microbiology, vol. 12, no. 7, pp. 1889–1898, 2010. [39] T. S. Schmidt, J. F. Matias Rodrigues, and C. Mering, “Limits to robustness and reproducibility in the demarcation of operational taxonomic units,” Environmental mi- crobiology, vol. 17, no. 5, pp. 1689–1706, 2015. [40] S. L. Westcott and P. D. Schloss, “Opticlust, an improved method for assigning amplicon-based sequence data to operational taxonomic units,” mSphere, vol. 2, no. 2, pp. e00 073–17, 2017. [41] R. C. Edgar, “Muscle: multiple sequence alignment with high accuracy and high throughput,” Nucleic acids research, vol. 32, no. 5, pp. 1792–1797, 2004. [42] W. Li and A. Godzik, “Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences,” Bioinformatics, vol. 22, no. 13, pp. 1658–1659, 2006. [43] X. Hao, R. Jiang, and T. Chen, “Clustering 16s rrna for otu prediction: a method of unsupervised bayesian clustering,” Bioinformatics, vol. 27, no. 5, pp. 611–618, 2011. [44] S. M. Leinonen R, Sugawara H, “International nucleotide sequence database collabora- tion (2011). the sequence read archive.” Nucleic acids research, vol. 39 (Database issue), pp. 19–21, 2011. [45] N. Siva, “1000 genomes project,” 2008. [46] K. Kameshwaran and K. Malarvizhi, “Survey on clustering techniques in data mining,” International Journal of Computer Science and Information Technologies, vol. 5, no. 2, pp. 2272–2276, 2014. 117 [47] “Inter-comparison of marine plankton metagenome analysis methods,” 2017. [Online]. Available: https://www.ncbi.nlm.nih.gov/sra/ERX2155923[accn] [48] “16s rrna sequencing of sao paulo zoo compost,” 2016. [Online]. Available: https://www.ncbi.nlm.nih.gov/sra/SRX1537393[accn] [49] T.-D. Nguyen, B. Schmidt, and C.-K. Kwoh, “Sparsehc: a memory-efficient online hierarchical clustering algorithm,” Procedia Computer Science, vol. 29, pp. 8–19, 2014. [50] S. V. Rice, H. Bunke, and T. A. Nartker, “Classes of cost functions for string edit distance,” Algorithmica, vol. 18, no. 2, pp. 271–280, 1997. [51] R. Dijkman, M. Dumas, and L. Garc´ıa-Ba˜nuelos, “Graph matching algorithms for busi- ness process model similarity search,” in International conference on business process management. Springer, 2009, pp. 48–63. [52] J. Lu, N. Lu, S. Ma, and B. Zhang, “Edit distance based similarity search of hetero- geneous information networks,” in International Conference on Database and Expert Systems Applications. Springer, 2018, pp. 195–202. 118