Data-driven and task-specific scoring functions for predicting ligand binding poses and affinity and for screening enrichment

Molecular modeling has become an essential tool to assist in early stages of drug discovery and development. Molecular docking, scoring, and virtual screening are three such modeling tasks of particular importance in computer-aided drug discovery. They are used to computationally simulate the interaction between small drug-like molecules, known as ligands, and a target protein whose activity is to be altered. Scoring functions (SF) are typically employed to predict the binding conformation (docking task), binary activity label (screening task), and binding affinity (scoring task) of ligands against a critical protein in the disease's pathway. In most molecular docking software packages available today, a generic binding affinity-based (BA-based) SF is invoked for the three tasks to solve three different, but related, prediction problems. The vast majority of these predictive models are knowledge-based, empirical, or force-field scoring functions. The fourth family of SFs that has gained popularity recently and showed potential of improved accuracy is based on machine-learning (ML) approaches. Despite intense efforts in developing conventional and current ML SFs, their limited predictive accuracies in these three tasks have been a major roadblock toward cost-effective drug discovery. Therefore, in this work we present (i) novel task- specific and multi-task SFs employing large ensembles of deep neural networks (NN) and other state-of-the-art ML algorithms in conjunction with (ii) data-driven multi-perspective descriptors (features) for accurate characterization of protein-ligand complexes (PLCs) extracted using our Descriptor Data Bank (DDB) platform.We assess the docking, screening, scoring, and ranking accuracies of the proposed task-specific SFs with DDB descriptors as well as several conventional approaches in the context of the 2007 and 2014 PDBbind benchmark that encompasses a diverse set of high-quality PLCs. Our approaches substantially outperform conventional SFs based on BA and single-perspective descriptors in all tests. In terms of scoring accuracy, we find that the ensemble NN SFs, BsN-Score and BgN-Score, have more than 34% better correlation (0.844 and 0.840 vs. 0.627) between predicted and measured BAs compared to that achieved by X-Score, a top performing conventional SF. We further find that ensemble NN models surpass SFs based on other state-of-the-art ML algorithms. Similar results have been obtained for the ranking task. Within clusters of PLCs with different ligands bound to the same target protein, we find that the best ensemble NN SF is able to rank the ligands correctly 64.6% of the time compared to 57.8% obtained by X-Score. A substantial improvement in the docking task has also been achieved by our proposed docking-specific SFs. We find that the docking NN SF, BsN-Dock, has a success rate of 95% in identifying poses that are within 2 Å RMSD from the native poses of 65 different protein families. This is in comparison to a success rate of only 82% achieved by the best conventional SF, ChemPLP, employed in the commercial docking software GOLD. As for the ability to distinguish active molecules from inactives, our screening-specific SFs showed excellent improvements over the conventional approaches. The proposed SF BsN-Screen achieved a screening enrichment factor of 33.90 as opposed to 19.54 obtained from the best conventional SF, GlideScore, employed in the docking software Glide. For all tasks, we observed that the proposed task-specific SFs benefit more than their conventional counterparts from increases in the number of descriptors and training PLCs. They also perform better on novel proteins that they were never trained on before. In addition to the three task-specific SFs, we propose a novel multi-task deep neural network (MT-Net) that is trained on data from three tasks to simultaneously predict binding poses, affinities, and activity labels. MT-Net is composed of shared hidden layers for the three tasks to learn common features, task-specific hidden layers for higher feature representation, and three outputs for the three tasks. We show that the performance of MT-Net is superior to conventional SFs and competitive with other ML approaches. Based on current results and potential improvements, we believe our proposed ideas will have a transformative impact on the accuracy and outcomes of molecular docking and virtual screening.

Read

In Collections: Electronic Theses & Dissertations

Copyright Status: In Copyright

Material Type: Theses

Authors: Ashtawy, Hossam Mohamed Farg

Thesis Advisors: Mahapatra, Nihar

Committee Members: Salem, Fathi
Chen, Jin
Sun, Yanni

Date Published: 2017

Subjects: Protein binding--Computer simulation
Molecules--Models
Machine learning
Ligands
Computer simulation

Program of Study: Electrical Engineering - Doctor of Philosophy

Degree Level: Doctoral

Language: English

Pages: xvii, 188 pages

ISBN: 9780355086249
0355086247

Permalink: https://doi.org/doi:10.25335/n8tz-5444

Data-driven and task-specific scoring functions for predicting ligand binding poses and affinity and for screening enrichment

Full text