Hypotheses for a New Generation : Leveraging natural language processing to bridge gaps and generate novel hypotheses for desiccation tolerance research

Scientific hypotheses, which are explanations of natural phenomena that can be tested and falsified, are at the core of empirical biology research. Hypotheses about genes involved in biological processes or interactions between species in an ecological setting are used to design research studies and make discoveries about the natural world. However, the act of generating a novel hypothesis requires a high level of manual labor, including sifting through and reading numerous previously published research articles. Due to the explosion of scientific literature in the last century, there are too many materials in any given field for scientists to read and process while generating new hypotheses, leading to a sensation of information overload. Information overload is the state when information inputs to a system overwhelm its information processing capacities, and is not a new phenomenon; since the advent of the written word, academics have bemoaned the deluge of written resources. One possible method for ameliorating the sensation of information overload is to implement methods for automated hypothesis generation, whereby literature is automatically processed to propose new connections between biological entities. In particular, this dissertation focuses on the use of knowledge graphs, which are networks in which nodes are entities of interest, like genes or proteins, and edges are the biological relationships between them. While methods for automated hypothesis generation from the literature using knowledge graphs have been used in the biomedical literature to generate hypotheses for phenomena like adverse drug reactions or drug-disease interactions, limited work has been done to translate these methods into the plant science domain. This dissertation focuses on the use of natural language processing techniques to perform automated hypothesis generation in and explore the research landscape of the field of desiccation tolerance biology. Desiccation tolerance is the ability of an organism to revive from the loss of nearly all internal water, and exists across the kingdom of life. Nearly all land plants exhibit desiccation tolerance in seeds; however, whole-plant vegetative desiccation tolerance is much rarer, and whole-organism desiccation tolerance in other kingdoms of life is also rare. As a result, the field of desiccation tolerance research is much smaller than related fields such as drought tolerance, and possesses many fewer curated resources both experimentally, like transformation systems for desiccation tolerant organisms, as well as informationally, as manually curated databases focus on model and crop species which do not exhibit whole-plant desiccation tolerance. Many current knowledge graphs in the plant sciences are built from manually curated databases such as Planteome and UniProt, and are therefore lacking rich information on desiccation tolerance from which to generate hypotheses. Automatic information extraction from the scientific literature to identify new entities and relationships in an understudied group of organisms in a high-throughput manner is therefore promising as an approach to ameliorate the data gaps in databases that affect knowledge graph-based hypothesis generation. The first chapter of this dissertation reviews the history of information overload and hypothesis generation, and briefly introduces desiccation tolerance as a research system. Chapter two presents a dataset for the molecular plant sciences labeled with biological entities and relationships that can be used to train information extraction models, and evaluate several existing methods on this dataset. In chapter two, I find that models from other scientific disciplines are insufficient for high-quality information extraction in plant science, and that training a new model yields improved performance. In chapter three of this thesis, I use bibliometric methods and topic modeling to explore the research landscape of desiccation tolerance, and find that the various study systems (animal, plant, fungi and microbe) are very siloed, or isolated, from one another, even though mechanisms for desiccation tolerance are shared across the kingdoms of life. Additionally, I design a rule-based algorithm to use bibliometric data to recommend new attendees to a specialized desiccation tolerance conference. Finally, in the fourth chapter, I explore the possibilities for constructing a knowledge graph of desiccation and drought tolerance research, and of using the resulting graph to predict novel hypotheses about the mechanisms of desiccation tolerance. My work shows that, using the chosen data sources and methods, information extraction and hypothesis generation from knowledge graphs are inadequate to generate high-quality hypotheses. In the final chapter, I reflect on the limitations and potential future directions of automated hypothesis generation for biology. This research will hopefully provide insight on information management and hypothesis generation in the plant sciences.

Read