HYPOTHESES FOR A NEW GENERATION: LEVERAGING NATURAL LANGUAGE
PROCESSING TO BRIDGE GAPS AND GENERATE NOVEL HYPOTHESES FOR
DESICCATION TOLERANCE RESEARCH

By

Serena Ghantous Lotreck

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Plant Biology—Doctor of Philosophy
Molecular Plant Sciences—Dual Major

2024

ABSTRACT

Scientific hypotheses, which are explanations of natural phenomena that can be tested and falsified,

are at the core of empirical biology research. Hypotheses about genes involved in biological

processes or interactions between species in an ecological setting are used to design research

studies and make discoveries about the natural world. However, the act of generating a novel

hypothesis requires a high level of manual labor, including sifting through and reading numerous

previously published research articles. Due to the explosion of scientific literature in the last century,

there are too many materials in any given field for scientists to read and process while generating

new hypotheses, leading to a sensation of information overload. Information overload is the state

when information inputs to a system overwhelm its information processing capacities, and is not

a new phenomenon; since the advent of the written word, academics have bemoaned the deluge

of written resources. One possible method for ameliorating the sensation of information overload

is to implement methods for automated hypothesis generation, whereby literature is automatically

processed to propose new connections between biological entities. In particular, this dissertation

focuses on the use of knowledge graphs, which are networks in which nodes are entities of interest,

like genes or proteins, and edges are the biological relationships between them. While methods

for automated hypothesis generation from the literature using knowledge graphs have been used

in the biomedical literature to generate hypotheses for phenomena like adverse drug reactions or

drug-disease interactions, limited work has been done to translate these methods into the plant

science domain.

This dissertation focuses on the use of natural language processing techniques to perform

automated hypothesis generation in and explore the research landscape of the field of desiccation

tolerance biology. Desiccation tolerance is the ability of an organism to revive from the loss of

nearly all internal water, and exists across the kingdom of life. Nearly all land plants exhibit

desiccation tolerance in seeds; however, whole-plant vegetative desiccation tolerance is much rarer,

and whole-organism desiccation tolerance in other kingdoms of life is also rare. As a result,

the field of desiccation tolerance research is much smaller than related fields such as drought

tolerance, and possesses many fewer curated resources both experimentally, like transformation

systems for desiccation tolerant organisms, as well as informationally, as manually curated databases

focus on model and crop species which do not exhibit whole-plant desiccation tolerance. Many

current knowledge graphs in the plant sciences are built from manually curated databases such as

Planteome and UniProt, and are therefore lacking rich information on desiccation tolerance from

which to generate hypotheses. Automatic information extraction from the scientific literature to

identify new entities and relationships in an understudied group of organisms in a high-throughput

manner is therefore promising as an approach to ameliorate the data gaps in databases that affect

knowledge graph-based hypothesis generation. The first chapter of this dissertation reviews the

history of information overload and hypothesis generation, and briefly introduces desiccation

tolerance as a research system. Chapter two presents a dataset for the molecular plant sciences

labeled with biological entities and relationships that can be used to train information extraction

models, and evaluate several existing methods on this dataset. In chapter two, I find that models

from other scientific disciplines are insufficient for high-quality information extraction in plant

science, and that training a new model yields improved performance. In chapter three of this thesis,

I use bibliometric methods and topic modeling to explore the research landscape of desiccation

tolerance, and find that the various study systems (animal, plant, fungi and microbe) are very

siloed, or isolated, from one another, even though mechanisms for desiccation tolerance are shared

across the kingdoms of life. Additionally, I design a rule-based algorithm to use bibliometric data

to recommend new attendees to a specialized desiccation tolerance conference. Finally, in the

fourth chapter, I explore the possibilities for constructing a knowledge graph of desiccation and

drought tolerance research, and of using the resulting graph to predict novel hypotheses about the

mechanisms of desiccation tolerance. My work shows that, using the chosen data sources and

methods, information extraction and hypothesis generation from knowledge graphs are inadequate

to generate high-quality hypotheses. In the final chapter, I reflect on the limitations and potential

future directions of automated hypothesis generation for biology. This research will hopefully

provide insight on information management and hypothesis generation in the plant sciences.

Copyright by
SERENA GHANTOUS LOTRECK
2024

ACKNOWLEDGEMENTS

The last five years have been nothing at all like what I expected them to be, in so many ways. While

I’ve wanted to earn a PhD since my senior year of high school, nothing could have prepared me for

the highs and lows of the experience itself, and I have so many people to thank for helping me see

this crazy venture through.

As a freshman in undergrad, I toured the Jander Lab through the Cornell Undergraduate

Research Board’s peer mentorship program. Cairo, my peer mentor who ran the tour, asked one of

the lab’s postdocs to accompany us, and I will never forget what this postdoc replied when one of

the students asked her why she chose to work in plant biology: "I like working on plants because

they’re interesting, but mostly, I stay because the people are so nice."

A year later, I joined the Jander Lab as an undergraduate research intern, and that postdoc’s

words have rang true for me every day since. The plant scientists that I have been fortunate enough

to meet and work with over the last 9 years of my education have shaped me and helped form my

career trajectory as a researcher, and the kindness and mentorship that they’ve shown me has been

formative in how I operate as a professional and a researcher. In addition to the plant scientists,

there are so many other people that were instrumental in my development as a scientist and a person

over the last five years, and they all certainly deserve more thanks than I am able to put into words

here, but I will do my utmost!

From my undergraduate, I’d like to thank Dr. Tom Silva, the lecturer in my freshman physiology

course that first made me excited about plants, and Cairo, my peer mentor, for opening the doors

to undergraduate research in plant science. To Dr. Georg Jander for accepting me into his lab, and

Kevin Ahern for his considerate and personable mentorship style as he taught me the basics of being

a scientist, and for his continued friendship as we’ve both pursued our PhDs. I also owe a large

thanks to Dr. Suman Seth, for always pushing me to take a productively critical view of science

through the lens of history, and to Andrew, Mark and Karel and all the other people I worked with

at Cornell Outdoor Education for giving me opportunities to be a leader and push myself to become

a more capable person.

v

At MSU, I’d like to extend a huge thank you to my advisors, Drs. Bob VanBuren and Mohammad

Ghassemi. A special thanks to Bob for his role as my REU program mentor during my junior year

of undergrad, as it piqued my interest in computational science, setting the stage for the career

transition to data science that I have made as a graduate student. Thank you to the both of you

for your continued mentorship and guidance as I’ve worked towards finishing this dissertation, and

thank you for supporting my research interests, even though they didn’t fit neatly into either of your

research programs. Thanks as well to my committee, Drs. Tammy Long, Emily Josephs, and Dan

Chitwood, for your feedback and guidance.

To everyone in the VanBuren lab, thank you all for embracing me as your labmate. Your

encouragement, friendship, support, and of course, the tea, has brightened so many of my days

while finishing this degree. You have made my daily work environment such a happy and healthy

one, and I am privileged to call you all my coworkers and friends.

To Sara Lira at Corteva, thank you for having faith in me and taking me on as your intern.

Working with you has been a privilege and a joy, and your mentorship over the last several years

has been instrumental in my success in my degree and beyond.

To my Plant Biology program cohort, past labmates and other MSU friends, thank you all for the

brunches, escape rooms, hikes, and porch beers over the last five years. Even during Deep Covid,

your willingness to participate in virtual game nights or outdoor meetups was such a blessing. As

we’ve all started to graduate and move away, I am profoundly grateful to all of you for making my

time here so wonderful.

To the office staff in Plant Biology, especially Sara Kraeuter; without your support I would have

missed so many logistical milestones in this program. Thank you for always being so responsive and

helpful, even for the silliest of questions! To the developers I’ve interacted with in the last five years,

especially Dave Wadden, Harry Caufield and Max Berrendorf, thank you for your technical support

and willingness to answer questions and help me implement your codebases for my research; I’m

a better programmer and scientist for our interactions.

To my roommates, past and present. To Cass, por nuestras aventuras mientras estabas aquí y las

vi

llamadas semanales desde que te fuiste, en que me impartas siempre los mejores chismes y consejos

buenos. To Nick and Jacob (and honorary 4th roommate Julia!) for putting up with me during this

last stretch of thesis work, and for always being there with a, "CoDe GeAHsS??" when I needed a

distraction. A special thanks to Nick, my longest-standing roommate, for your unwavering support

during some of the most difficult times in my PhD. Our tea-and-Ted Lasso nights were the light of

my third year, and I am so eternally grateful for your friendship.

To everyone at the Greater Lansing Academy of Dance for welcoming me with open arms and

making me a part of your community. As I like to call it, my "enforced fun time" of classes,

rehearsals, and subsequent parking lot hangouts was many times the only thing keeping me from

losing my mind over my dissertation work in the last few years. Having such a wonderful, fun

community outside of school was essential to my well-being, and I so very much wish I could

bring you all with me wherever I end up next! A special thanks to my instructor Jim, for pushing

me to believe that I am capable of improvement even as an adult, and giving me an outlet to work

towards something productive outside of research. Your thoughtful observation and explanations

have made me so much stronger of a dancer, and I have so enjoyed working with you.

A Migue, Alberto, Angel, y David, por acogerme como parte del grupo desde el principio,

desde las llamadas de los domingo durante la pandemia hasta las quedadas de este año; sois unos

reyes del Martes Santo! A Barto Miranda, por tus años de amistad y todas nuestras charlas y

paseos; siempre me trae mucha alegría pasar tiempo contigo. Y finalmente a Martín, por ser tan

buen amigo mío por tantos años despues de conocerme por casualidad en el roco. Tenerte siempre

allí para hablar de todo lo que pasa en nuestras vidas, tanto personal como de investigación, y seguir

teniendo aventuras contigo me alegra más de lo que puedo explicar.

I would like to extend a heartfelt thanks to all of the artists and production professionals that

are responsible for making the music, movies, and shows that helped get me through the toughest

times of my doctoral degree. I would not have been able to accomplish meaningful scientific work

without the background of movie soundtracks and pop songs that accompanied my programming,

reading, writing, and analysis, to say nothing of the movies, books, shows, and podcasts that

vii

enriched my world in my free time. Thanks as well to the therapists and medical professionals that

have supported my journey; graduate school is hard on the mind and the body, and without the

support of the medical and mental health professionals I’ve been fortunate to work with over the

years, I wouldn’t be here today.

To my best friend Galen, I can’t express in words how much your support and friendship has

meant to me. You push me to be a better person every day, and our phone calls and road trips

and general shenanigans have made my world so much brighter since we met that fateful day in

Okenshields so long ago.

Finally, to my family. To my Grandma, for being my best friend and confidante on all of

our phone calls, and for sending me newspaper clippings and notes to brighten my spirits. To

my siblings Robert and Jake and my parents: while we all hope there are no more world-halting

pandemics in our lifetimes, I am so profoundly grateful that I got locked in with you all in 2020.

Living with you while all being adults was such a rare and wonderful opportunity, and the board

games and shared meals, spontaneous kitchen dance parties and outdoor activities are all such

cherished memories to me. Your support and understanding of the trials and tribulations of this

degree and everything that happens in my life, both in person and over distance, have been so

important to me over the last, well, my whole life. I love you all so much.

viii

PREFACE

"What if I told you you’d never have to read a scientific paper again?" As an undergraduate student,

the proposal for a dissertation project on automated hypothesis generation sounded like a proposal

for the promised land. I had already been burned scientifically by not reading enough papers; while

writing up my undergraduate honors thesis research, I found a paper that, had I read it earlier, would

have drastically changed my experimental design. Ironically, I have of course read more papers

to complete the project described here by several orders of magnitude than for any other project I

have worked on, and have continued to experience the same phenomenon of being unfortunately

surprised by relevant papers appearing at the wrong moments. However, the research and writing

of this thesis has assured me that it is not as a result of some deficiency as a scientist, but is rather

an eternal struggle that has existed since the advent of the written word.

ix

TABLE OF CONTENTS

LIST OF ABBREVIATIONS .

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

1

CHAPTER 2

CHAPTER 3

CHAPTER 4

PLANT SCIENCE KNOWLEDGE GRAPH CORPUS: A GOLD
STANDARD ENTITY AND RELATION CORPUS FOR THE
MOLECULAR PLANT SCIENCES . . . . . . . . . . . . . . . . . . . .

9

DRYING TO CONNECT: UNIFYING THE RESEARCH LANDSCAPE
OF DESICCATION TOLERANCE TO IDENTIFY TRENDS, GAPS,
AND OPPORTUNITIES

. . . . . . . . . . . . . . . . . . . . . . . . . 11

AN EVALUATION OF KNOWLEDGE GRAPH CONSTRUCTION
AND AUTOMATED HYPOTHESIS GENERATION FOR WHOLE-
PLANT DESICCATION TOLERANCE . . . . . . . . . . . . . . . . . 13

CHAPTER 5

CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 52

BIBLIOGRAPHY .

.

.

.

.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

APPENDIX

. .

. .

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

x

LIST OF ABBREVIATIONS

TLDR

too long; didn’t read

GO

Gene Ontology

KEGG Kyoto Encyclopedia of Genes and Genomes

PO

Plant Ontology

PECO

Plant Experimental Conditions Ontology

KG

Knowledge Graph

NLP

Natural Language Processing

NER

Named Entity Recognition

RE

Relation Extraction

PICKLE Plant Science Knowledge Graph (Corpus)

LP

Link prediction

TLP

Temporal link prediction

AUROC Area under the receiver-operator curve

ROC

Receiver-operator characteristic

AUPRC Area under the precision-recall curve

xi

CHAPTER 1

INTRODUCTION

Information overload

Information overload is the state when information inputs to a system overwhelm the system’s

information processing capacities [Bawden and Robinson, 2020]. One can experience an intuitive

example of information overload simply by using an academic search engine to look up a concept

in one’s research field; the hundreds of thousands of results serve as a testament to the sheer

amount of information available even in a relatively narrow scope. Information overload is often

perceived to be associated only with the modern digital age, as a result of the advent of information

technologies like the Internet, but humankind has been complaining of information overload,

and developing strategies to deal with it, for nearly as long as we have had written text.

In

the first century A.D., the Roman philosopher Seneca griped that "the abundance of books is

distraction" [Bawden and Robinson, 2020]. Vincent of Beauvais, a Christian academic who wrote

compendiums of available knowledge in the mid-13th century (a strategy for the management of

information overload even in an era before the advent of the printing press), bemoaned “the multitude

of books, the shortness of time, and slipperiness of memory" [Bawden and Robinson, 2020] – a

complaint that, when replacing “books" with “journal articles", I have found wholly relatable

in the writing of this dissertation!

Information overload exists across all spheres of life where

written information dominates; however, concern over the effect of information overload on the

future progress of science is acute [Raymond, 2019]. Scientists rely upon previous information to

generate hypotheses and design experiments to make scientific discoveries, but we are unable to

keep up with the flow of information even within very specific domains, often relying on information

management approaches that involve manually parsing, reading, and digesting the resulting papers

to formulate new hypotheses [Landhuis, 2016].

Given that a perception of overload has existed since humankind started writing things down, it

is unlikely that we will ever manage to design a “silver bullet" tool or set of tools that so effectively

manages our information workflows that the perception of being overloaded recedes substantially.

1

However, while they have not necessarily lessened our perception of overload, previous strategies

for information management, like Vincent de Beauvis’ encyclopedic compendium Speculum Maius

or the Dewey Decimal system, have still proved fruitful. Without attempting to manage overflow,

our ability to navigate available information in any period of time would be deeply hampered, and

it is therefore imperative to continue to develop new strategies to maintain pace with humanity’s

ever-expanding body of knowledge. Indeed, some authors argue that our contemporary perception

of information overload is more related to the lack of technological solutions for managing digital

information than it is to the existence of large quantities of information itself [Klerings et al., 2015].

Existing tools to manage information overload in the plant sciences

A range of digital tools exist to help manage information overload in the sciences, ranging from

familiar approaches like search engines to other newer approaches, like knowledge graphs. One

excellent example of a domain-agnostic search engine-based tool for information management is

Semantic Scholar [Raymond, 2019], which has incorporated various machine learning approaches

to search retrieval and information management. For example, in 2020 Semantic Scholar incorpo-

rated TLDRs ("too long; didn’t read"’s), which are one- to two-sentence summaries of scientific

abstracts, using a machine learning model for extreme summarization [Cachola et al., 2020]. The

goal of TLDRs is to assist scientists in identifying relevant papers from a search more rapidly than

possible by reading entire abstracts. In addition to domain-agnostic tools, plant science researchers

have access to a relatively large number of high-quality, manually-curated databases. Some are

specific to plant science, like Planteome [Cooper et al., 2024], while others are generalizable to

all areas of biology, like the Gene Ontology (GO) [The Gene Ontology Consortium, 2019] and the

Kyoto Encyclopedia of Genes and Genomes (KEGG) [Kanehisa, 2002]. Planteome in particular

provides a valuable service to plant scientists, specializing in developing ontologies specific to plant

science like the Plant Ontology (PO) and the Plant Experimental Conditions Ontology (PECO) and

mapping information from the plant science literature onto these ontologies to make databases,

as well as linking other ontology projects. While databases and ontologies provide extremely

high-quality information, their scope is limited by the labor that needs to be invested in manual

2

curation.

A more recent approach to information management in the sciences is knowledge graphs. A

knowledge graph (KG) is a network that contains knowledge of the real world, where nodes in

the graph are entities of interest, and edges are relations between the entities [Peng et al., 2023].

In the biological sphere, the capacity of KG to contain heterogeneous information, or information

from various sources and of various types, has made them attractive candidates for combining

various ontologies and databases for tasks such as predicting new links between biological en-

tities [Unni et al., 2022]. Knowledge graphs can include information from both structured and

unstructured sources, meaning they can integrate information from across structured databases

with information extracted directly from the scientific literature, patents, or other forms of un-

structured natural language text. KG themselves cannot solve the issue of information overload; a

KG that represents some large amount of information is also intractably large. However, different

from search engine-based approaches, KG can be used in downstream methods that aim to manage

information in a very specific way: by automating or semi-automating the creation of scientific

hypotheses.

What is a hypothesis?

Before we can discuss the automation of hypothesis generation, we need to establish a definition

for a hypothesis. In this work, we will define a hypothesis following [Alger, 2019]: as a proposed

explanation of a natural phenomenon that can be tested and potentially falsified. A hypothesis

is a "putative explanation for actual observations" [Alger, 2019] from which predictions about the

behavior of a system can be derived. For example, we might observe that our houseplants have been

turning yellow, and hypothesize that the reason is a lack of nutrients in the soil. One prediction

resulting from this hypothesis is that if we added nutrients to the soil by adding fertilizer, our

plants would become green again. We can evaluate this prediction by performing the experiment

of adding fertilizer, and if our plants do not turn green, we can reject the hypothesis of nutrient

deficiency as a cause for plant yellowing. [Alger, 2019] notes that this framing of the scientific

process as the falsification of hypotheses can be controversial among scientists, some of whom

3

argue for open discovery- or question-based science as opposed to hypothesis-driven science.

The full nuance of this debate is beyond the scope of this dissertation; however, I would like to

point out that discovery- and hypothesis-based approaches to science appear to be synergistically

integrated in the pursuit of managing information overload. This dissertation is framed around

the pursuit of a hypothesis generation system for the plant sciences. Such a system can function

as an open discovery- or question-based system, broadly searching within the discipline of plant

science for explanations of natural phenomena; those explanations are hypotheses, which can then be

experimentally tested with falsification. [Alger, 2019] defines 6 characteristics of a good hypothesis:

(1) Significance/Generality, (2) Riskiness, (3) Simplicity, (4) Specificity, (5) Constraint, and (6)

Falsifiability in Practice. In order, these principles require a hypothesis to (1) tackle a scientifically

meaningful issue, (2) make non-obvious predictions, (3) provide the simplest explanation for the

observed facts (i.e. follow Occam’s Razor), (4) rule out other explanations, such as hypothesizing

"necessary and sufficient" conditions in biochemistry, (5) be sufficiently detailed such that changing

any detail of the hypothesis means it no longer explains what it was intended to explain, and (6)

it can be falsified by experiments that are practical to perform. A good automated hypothesis

generation system will generate good hypotheses; we can use these 6 characteristics to define what

we mean by good hypotheses.

History of automated hypothesis generation

Now that we have a definition of hypothesis from which to work, we can define the practice

of automated hypothesis generation. Here, we will define automated hypothesis generation as the

practice of using an algorithmic system to propose falsifiable hypotheses for a given scientific

domain. This is in contrast to manual hypothesis generation, like the process of generating a

hypothesis about our dying houseplant:

in that case, we used our previous knowledge about

plants, which we could have gained from interpersonal interactions (talking to our houseplant-mom

friend) or from reading (literature or the general Web), to come up with a plausible explanation

for what we observed. In a scientific domain, manual hypothesis generation often involves reading

multitudes of journal articles in the target domain to acquire sufficient intuition to generate a

4

hypothesis [Akujuobi, 2021]. Most literature on automated hypothesis generation from scientific

papers credits the development of the field to Don R. Swanson. Swanson’s major contribution to the

field of automated hypothesis generation from literature (what he called literature-based discovery,

or LBD) was the idea of "undiscovered public knowledge", and the ABC model approach to

discovering this knowledge. Undiscovered public knowledge is information that has been implicitly

demonstrated in sources like scientific papers, but has parts that have never explicitly been brought

together to state that knowledge explicitly. Undiscovered public knowledge can best be explained

with the example that Swanson used in his seminal paper on using fish oil to treat Raynaud’s

disease [Swanson, 1986], where he used the ABC model to make implicit knowledge, explicit. In

the ABC model, the user provides two terms, A and C, that they think may be connected, and the

system, whether it be automated or manual, searches for terms B that bridge the gap between A

and C [Smalheiser and Swanson, 1998]. In [Swanson, 1986], Swanson used the ABC technique

to demonstrate that, while the scientific community knew that fish oil helped improve blood flow,

and that Raynaud’s disease was caused by poor blood flow, no one had postulated whether, or how,

consuming fish oil contributed to easing the symptoms of Raynaud’s disease. In the ABC model,

A in this case would be "fish oil", and C would be "amelioration of Raynaud’s disease", while B is

the mechanisms by which fish oil could contribute to the amelioration of Raynaud’s disease. Fish

oil being a treatment for Raynaud’s disease is a prime example of undiscovered public knowledge;

all of the information necessary to make the conclusion was present in the literature, but due to

disciplinary siloing of research, had never explicitly been brought together. Other examples of

undiscovered public knowledge include information on genetics that lies hidden in public databases

[Smalheiser and Swanson, 1998], as well as information that is implicit in the literature.

While powerful, the ABC approach still requires some initial level of hypothesis, as discussed

in [Smalheiser, 2012]. In Swanson’s research, he relied on a closed paradigm, having chosen the

A and C terms in advance, and looked for B terms that connected them; the selection of both an

A and C term requires knowledge of the research field and an initial hypothesis about what A’s

and C’s may be connected. While open-discovery ABC models, where only an A term needs to

5

be specified, exist from a technical standpoint, they result in a deluge of potential connected B

and C terms, which induces further information overload that needs to be managed by ranking the

resulting hypothesis candidates [Wren, 2008].

One approach to open discovery-based hypothesis generation that is somewhat more con-

strained than an open ABC approach is link prediction on KG. In the biomedical sphere, KG

have been combined with prediction techniques to predict adverse drug interactions, new targets

for drug repurposing, and drug discovery [Abu-Salih et al., 2023]. There exist a fair number of

plant science KG centered around model species. AgroLD [Larmande and Todorov, 2021] is a

plant science knowledge graph built from other biological databases such as UniProtKB, GO and

genetic database resources for several plant species. KnetMiner [Hassani-Pak et al., 2021] is a com-

mercialized KG platform that integrates information from genome annotations for various model

species, as well as single nucleotide polymorphism variation, quantitative trait loci, and protein

domains [Hassani-Pak et al., 2016]. KnetMiner does include information derived from PubMed

abstracts; however, it is unclear how this information was extracted for inclusion into KnetMiner

[Hassani-Pak et al., 2016].

In 2023, there was a small burst of new papers published on plant

science-specific KG, partially aided by the journal Frontiers’ special edition, Knowledge Graph

Technologies: the Next Frontier of the Food, Agriculture, and Water Domains. One graph from the

Frontiers edition is GenoPhenoEnvo, a graph integrated data from Planteome [Cooper et al., 2024],

including both ontology information as well as gene expression data for several model and crop

species [Thessen et al., 2023]. C3P0 is another KG from the Fronteirs special edition designed

for providing decision support to vegetable farmers, that is built on existing databases as well as

informational input from domain experts [Darnala et al., 2023]. Finally from the Frontiers group

is OrthoLegKB, which directly uses genomic resources to compute and include orthology and

synteny, QTL’s, and RNA-sequencing datasets [Imbert et al., 2023]. Other plant science KG pub-

lished in 2023 are: PlantConnectome , which used a GPT-based approach to turn 100,000 plant

biology abstracts into a KG with a navigable GUI component, allowing users to explore various

subsets of the graph [Fo et al., 2023]; and The Comprehensive Knowledge Network (part of the

6

Stress Knowledge Map, [Bleker et al., 2023]), which is a manually-curated network of Arabidopsis

thaliana genes, proteins, RNA, and metabolites derived from literature [Ramšak et al., 2018].

However, it appears that, in comparison with the biomedical domain, very little work has been

done on downstream methods for using any of these plant science KG to predict new hypotheses.

Most of the better-developed KG include some kind of browser that allows users to interact with

information within the graph in response to some search query; however, link prediction to generate

novel hypotheses does not appear to have been investigated in the plant sciences. KG clearly provide

an advantage when accessing omic-scale information to identify gene targets, as demonstrated by

the search queries implemented in [Thessen et al., 2023] and [Imbert et al., 2023]; however, the

same searches and results could likely have been performed, albeit with greater difficulty, using

biological datasets directly, and are not unique to the KG. This is in contrast to work in the biomedical

sphere that has directly utilized predictions of new graph connections to answer questions that were

not answerable with other kinds of data [Abu-Salih et al., 2023].

In contrast, C3PO provides a

decision-support framework, which allows farmers to input their farm details and receive a tailored

technical itinerary for their planting season [Darnala et al., 2023]. While this is much closer to the

kind of hypothesis generation we are interested in, it is targeted at a practical use case and not at

basic biological discovery.

Desiccation tolerance as a biological system

The biological system on which this dissertation focuses is whole-plant desiccation tolerance.

Desiccation tolerance (DT) is defined as the ability to revive from the "air-dry state", where all

available water in the organism has been lost to the surrounding air [Bewley, 1979]. As water

is the primary ingredient of life, that any organism can survive through near-complete drying

is astounding, and understanding the mechanisms by which this phenomenon is possible is of

great scientific interest [Hibshman et al., 2020]. Many land plants have desiccation tolerant seeds

(also known as orthodox seeds), but whole-organism DT is much rarer.

It is thought that the

earliest land plants had whole-organism DT, which was then lost as plants evolved vasculature

(xylem and phloem, which allow the long-distance internal transport of water and sugars), and that

7

certain plants re-evolved the trait by repurposing seed DT mechanisms under certain evolutionary

pressures [Marks et al., 2021]. While DT in whole organisms is a relatively rare phenomenon,

it exists across all kingdoms of life [Alpert, 2005]; animals such as tardigrades are desiccation

tolerant [Hibshman et al., 2020], as well as many microbes [Grzyb and Skłodowska, 2022]. As

such, the biology of DT has applications across many fields, including medical cryopreservation

and crop improvements [Alpert, 2005], space biology [Persson et al., 2011], and restoration ecology

[León-Lobos et al., 2012].

Because DT in whole organisms is rather rare, the field of DT research is relatively small

compared to related disciplines like drought tolerance. In plant science, no model or crop organisms

exhibit vegetative DT, and experimentally validated information about the mechanisms of DT is

scarce. As a result, using search terms like "desiccation" or "desiccation tolerance" in large KG

like AgroLD or KnetMiner returns very few results, none of which I have found to go beyond

established knowledge about DT mechanisms. Therefore, the overarching goal of this dissertation

is to explore the potential of using the DT literature to construct and generate hypotheses about

whole-plant DT.

Content roadmap

In this dissertation, I explore the application of natural language processing to KG construction

from and characterization of the DT literature. In Chapter 2, I establish the creation of a molecular

plant science dataset of 250 abstracts labeled with biological entities like genes, organisms and

proteins, and the relations between them, and use it to demonstrate the performance of existing

methods for entity and relation extraction in the plant sciences. In Chapter 3, I explore the research

themes present in the DT literature, and characterize the extent of siloing between the research in

plant, animal, microbial and fungal DT research, and address these citation gaps by designing an

algorithm to increase research integration through recommending new attendees to a specialized DT

conference. In Chapter 4, I explore the potential of literature-derived KG to predict novel hypotheses

in plant vegetative DT. Finally, in Chapter 5, I reflect on the limitations of literature-derived KG

and propose future directions.

8

CHAPTER 2

PLANT SCIENCE KNOWLEDGE GRAPH CORPUS: A GOLD STANDARD ENTITY
AND RELATION CORPUS FOR THE MOLECULAR PLANT SCIENCES

The work in this chapter is presented in the final publication:

Lotreck, S., Segura Abá, K., Lehti-Shiu, M. D., Seeger, A., Brown, B. N. I., Ranaweera, T.,

Schumacher, A., Ghassemi, M., and Shiu, S.-H. (2023). Plant Science Knowledge Graph Corpus:

a gold standard entity and relation corpus for the molecular plant sciences. in silico Plants,

6(1):diad021

Author contributions:

S.L. and S.H.S. developed the project idea. S.L. designed the ontologies and annotation

guidelines and wrote code to collect abstracts, unify annotations, apply and evaluate models, and

create figures and manually reviewed and unified all abstracts and wrote the initial draft and figure

legends. K.S.A. contributed to analyses of unexpected model performance. K.S.A., M.L.S., A.S.,

B.B., T.R. and A.S. annotated abstracts and provided feedback for improvements to annotation

guidelines. M.G. provided ideas for several analyses. S.H.S. and M.G. oversaw the project

progress and provided feedback on the design of study. All authors participated in the drafting and

revision of the manuscript.

9

Abstract

Natural language processing (NLP) techniques can enhance our ability to interpret plant science

literature. Many state-of-the-art algorithms for NLP tasks require high-quality labelled data in the

target domain, in which entities like genes and proteins, as well as the relationships between

entities, are labelled according to a set of annotation guidelines. While there exist such datasets

for other domains, these resources need development in the plant sciences. Here, we present the

Plant ScIenCe KnowLedgE Graph (PICKLE) corpus, a collection of 250 plant science abstracts

annotated with entities and relations, along with its annotation guidelines. The annotation guidelines

were refined by iterative rounds of overlapping annotations, in which inter-annotator agreement

was leveraged to improve the guidelines. To demonstrate PICKLE’s utility, we evaluated the

performance of pretrained models from other domains and trained a new, PICKLE-based model

for entity and relation extraction (RE). The PICKLE-trained models exhibit the second-highest in-

domain entity performance of all models evaluated, as well as a RE performance that is on par with

other models. Additionally, we found that computer science-domain models outperformed models

trained on a biomedical corpus (GENIA) in entity extraction, which was unexpected given the

intuition that biomedical literature is more similar to PICKLE than computer science. Upon further

exploration, we established that the inclusion of new types on which the models were not trained

substantially impacts performance. The PICKLE corpus is, therefore, an important contribution to

training resources for entity and RE in the plant sciences.

Summary

In this chapter, I developed a high-quality labeled training dataset for NER and RE in the

plant sciences. To the best of my knowledge, it is the first dataset of its kind specifically tailored

to molecular plant biology, and consists of 250 documents labeled with biological entities and

relationships between them. The development of a dataset for plant biology allowed us to evaluate

the performance of existing NER and RE models in the plant sciences, as well as to train a joint

NER/RE model specific to the plant sciences that improved information extraction on molecular

plant science abstracts.

10

CHAPTER 3

DRYING TO CONNECT: UNIFYING THE RESEARCH LANDSCAPE OF
DESICCATION TOLERANCE TO IDENTIFY TRENDS, GAPS, AND OPPORTUNITIES

The work in this chapter is presented in the pre-print:

Lotreck, S. G., Ghassemi, M., and VanBuren, R. T. (2024). Unifying the research landscape of

desiccation tolerance to identify trends, gaps, and opportunities. bioRxiv

Author contributions:

R.V. and S.L. developed the initial project idea, and S.L. developed the idea for the conference

recommendation algorithm. M.G. contributed ideas and discussion to the final implementation of

the conference recommendation algorithm. S.L. implemented all analyses, made the raw versions

of all figures and drafted the full text. R.V. provided input on figure organization and performed

all final edits on the figures. R.V. and M.G. provided oversight on project progress and reviewed

and edited the manuscript.

11

Abstract

Desiccation tolerance, or the ability to survive extreme dehydration, has evolved recurrently

across the tree of life. While our understanding of the mechanisms underlying desiccation toler-

ance continues to expand, the compartmentalization of findings by study system impedes progress.

Here, we analyzed 5,963 papers related to desiccation and examined model systems, research top-

ics, citation networks, and disciplinary siloing over time. Our results show significant siloing, with

plant science dominating the field, and relatively isolated clustering of plants, animal, microbial,

and fungal systems. Topic modeling identified 46 distinct research topics, highlighting both com-

monalities and divergences across the knowledge of desiccation tolerance in different systems. We

observed a rich diversity of model desiccation tolerant species within the community, contrasting

the single species model for most biology research areas. To address citation gaps, we developed a

rule-based algorithm to recommend new invitees to a niche conference, DesWorks, enhancing the

integration of diverse research areas. The algorithm, which considers co-citation, co-authorship,

research topics, and geographic data, successfully identified candidates with novel expertise that

was unrepresented in previous conferences. Our findings underscore the importance of interdis-

ciplinary collaboration in advancing desiccation tolerance research and provide a framework for

using bibliometric tools to foster scientific integration.

Summary

Bob and I were particularly interested in performing analysis that would be of interest to other

desiccation tolerance researchers by providing novel insights into the historical trends of citation

and research topics in the field. I presented an initial version of this work and solicited community

feedback at the DesWorks conference in January of 2024. Group debrief sessions during the

conference inspired the design of an algorithm that could turn descriptive bibliometric analyses

into a predictive tool that could provide actionable suggestions to improve research integration. To

the best of my knowledge, the conference recommendation algorithm presented in this chapter is

the first of its kind, and I have made the codebase with documentation publicly available so that it

can be re-used and extended for other conferences.

12

CHAPTER 4

AN EVALUATION OF KNOWLEDGE GRAPH CONSTRUCTION AND AUTOMATED
HYPOTHESIS GENERATION FOR WHOLE-PLANT DESICCATION TOLERANCE

Abstract

The proliferation of scientific information impedes the ability of scientists to keep up with new

discoveries, especially in complex disciplines such as desiccation tolerance research. Desiccation

tolerance, or the ability of an organism to revive from near-complete dehydration, is present

across the kingdoms of life, but we lack an integrated understanding of the mechanisms of the

phenotype. In this work, we aim to integrate information from across the drought and desiccation

tolerance literature by constructing a large knowledge graph representing biological entities and

their relationships. We evaluated several methods for knowledge graph construction, and found

that neural network-based entity extraction, combined with co-occurrence-based relationships,

provide the highest quality network. The resulting knowledge graph contains 334,327 biological

entities and 1,288,387 relationships. Using two database-derived knowledge graphs and one other

literature-derived graph, we provide preliminary evidence that literature abstracts may not be

sufficiently information-dense to produce a high-quality connected network of biological entities,

as database-derived networks had a consistently higher ratio of edges to nodes in our analysis.

Using the co-occurrence network, we demonstrated that crop species are the most prevalent in the

literature about drought and desiccation tolerance, and that while organism entities are the most

common type of entity, that chemical compound entities are consistently the most well-connected

across the literature. Finally, we applied knowledge graph embedding to build two kinds of

static link prediction models to evaluate the possibilities for hypothesis generation from the co-

occurrence network. We found that static link prediction, where the entire network is considered as

a single snapshot, is insufficient to provide high-quality predicted hypotheses. We also explored a

preliminary implementation of a temporal link prediction model, where the evolution over time of

the network is considered during the link prediction task. While the static and temporal methods are

not directly comparable to one another, we saw evidence that temporal link prediction may improve

13

upon the prediction capabilities of static link prediction. Our findings indicate future directions for

improvement of hypothesis generation from knowledge graphs for biological literature.

Introduction

The scale and scope of biological knowledge are expanding exponentially, driven by both the

increasing volume of published papers each year and the growing content within individual papers.

This proliferation of information poses significant challenges for keeping up with new discoveries,

even within narrowly defined fields, and it becomes even more daunting within large, multiscale,

or complex disciplines. Knowledge integration across different disciplines is usually inadequate,

leading to potentially important findings or connections between discoveries remaining unnoticed.

This issue is particularly acute in the field of desiccation tolerance, a trait that enables organisms

to withstand extreme dehydration. Desiccation tolerance is a widespread adaptation found across

all kingdoms of life, prevalent in diverse organisms ranging from fungi and microbes to plants

and animals. However, the knowledge spanning molecules to ecosystems remains fragmented

and poorly synthesized. This work addresses this gap by attempting to leverage the extensive,

yet disparate, body of literature to generate biological hypotheses concerning the genetic basis

of desiccation tolerance in plants. This is achieved through the development and utilization of a

knowledge graph to map out and connect information to identify underlying patterns and insights

that might not be immediately apparent. By structuring data in this way, this research aims to

enhance our understanding of desiccation tolerance, facilitating a more integrated approach to

studying this crucial biological phenomenon.

A knowledge graph (KG) is a graph that contains data representing the real world, where the

nodes are entities of interest, and edges are relations between them [Peng et al., 2023] . In biol-

ogy, entities include proteins, genes, and organisms, and edges are relationships between entities,

representing molecular interactions or regulations. The nodes and edges in a biological graph

can be drawn from existing manually-curated databases, or they can be derived from unstructured

text through natural language processing techniques [Nicholson and Greene, 2020]. Most graphs

in the biomedical domain are constructed from existing databases [Nicholson and Greene, 2020],

14

and graphs in the plant science domain follow this trend. The large resource AgroLD is entirely

derived from existing databases [Larmande and Todorov, 2021], and KnetMiner is primarily de-

rived from databases, with some unknown level of supplementation from unstructured PubMed

abstracts [Hassani-Pak et al., 2016, Hassani-Pak et al., 2021]. The more recent GenoPhenoEnvo

graph [Thessen et al., 2023] is also constructed entirely from database sources. However, there

are two examples of plant science KG constructed from literature-derived data: Comprehen-

sive Knowledge Network uses a manual approach [Ramšak et al., 2018, Bleker et al., 2023], and

PlantConnectome uses an automated approach [Fo et al., 2023]. To build a graph from the lit-

erature, we rely on information extraction methods, which include named entity recognition

(NER) and relation extraction (RE). There are many approaches to NER and RE, including

rule-based methods, and neural network-based methods. Rule-based methods that use syntac-

tic (grammatical) rules, like OpenIE [Angeli et al., 2015], are domain-agnostic, while other rule-

based approaches can incorporate domain-specific knowledge and be specific to a given subject

area [Milošević and Thielemann, 2023]. Neural network methods tend to be domain specific,

especially because they often assign entity and relation types to extracted objects, which are

semantically relevant to a given domain, as seen in [Lotreck et al., 2023]. However, neural net-

work methods can achieve higher performance on RE than domain-agnostic rule-based methods

[Milošević and Thielemann, 2023]. To build a KG from literature, NER and RE are applied to a

set of documents to obtain entities and relationship triples.

A “good” KG is information-rich, characterized by a detailed ontology that accurately represents

real-world phenomena. According to Seo et al., “A good knowledge graph should have a fine-grained

ontology structure that can precisely express information in the real world, and instances and triples

should make full use of the ontology’s classes and properties” [Seo et al., 2022]. For the biological

sciences, a KG should contain genes, proteins, and organisms, with the relations between them

indicating belonging or interaction and it should have as much real-world information as possible

in each of those categories. Since our knowledge about biological life is far from complete , there

will be plenty of missing information; however, that missing information should be the result of

15

true knowledge gaps, and not a failure to adequately capture or summarize the literature.

It is

therefore important to consider the data sources we use in constructing a KG. The proliferation of

database-derived graphs as opposed to literature-derived KG could indicate either (1) that existing

NER and RE tools are insufficient to extract information from text in the biological domain, but

that sufficient information for high-quality graph construction is present in the literature; or (2) that

there is insufficient information present in the literature to construct a high-quality graph.

The first part of the present work aims to determine which, if either, of the previous suppositions

regarding the lack of literature-derived graphs is true. Here, we first constructed a large dataset (>

80,000 abstracts) of drought and desiccation tolerance literature, and examined four KG construction

methods applied to this dataset. We then sought to determine, given the best possible KG from a

literature source, how well we can generate novel hypotheses via link prediction. Link prediction is

the act of predicting, based on the current structure of the graph, what information might be true, but

is missing from the graph. Specifically, this takes the form of predicting the edges (or links) that are

missing between entities in the graph [Rossi et al., 2021]. KG link prediction has traditionally been

formulated as a static problem, where the entire graph is considered as a single snapshot, and edges

are predicted on that snapshot; however, since KG reflect real-world information, which naturally

evolves over time, treating the graph as a static item can lead to poor prediction performance

[Cai et al., 2023]. Temporal link prediction (TLP) is the practice of predicting new connections

between nodes at future timepoints for a given graph [Qin and Yeung, 2024], and can be used to

improve link prediction on KG [Cai et al., 2023].

In general, graphs fall into two categories: homogeneous graphs, like social networks, where

there is one type of edge and one type of node; and heterogeneous graphs, where there are multiple

edge and node types. The power of KG is their ability to represent data from multiple sources

with multiple kinds of relationships, which means they are an instance of a heterogeneous graph

[Cai et al., 2018]. Unfortunately, while TLP is relatively well-developed for homogeneous graphs

[Qin and Yeung, 2024], TLP for heterogeneous networks is limited to only a handful of methods, and

does not have a comprehensive literature survey to describe the field as a whole. Meanwhile, static

16

link prediction is also well-developed for heterogeneous graphs. In this work, we first leveraged

the existing computational resources for static link prediction to generate graph embeddings for a

desiccation KG, and use both a simple Random Forest approach and an embedding-based ranked

link prediction approach to predict new triples in the graph. We used the best static link prediction

model to predict novel links between biological entities in our dataset, and perform a literature

search to investigate the biology of a subset of the predicted links. Finally, we performed a brief

survey of the literature on heterogeneous TLP, and implemented a method called STHN on our data

to determine if TLP offers a performance advantage over static link prediction.

Results and Discussion

Characterizing the drought and desiccation tolerance literature

We built a combined dataset of drought and desiccation tolerance from Web of Science using

two searches: “desiccation OR anhydrobiosis” and “(water deficit AND plants) OR (drought AND

plants)”. After post-processing the search results, the final dataset spans from 1985 to the present

(Figure 4.1A), and contains mostly drought literature, with a small subset of the drought and

desiccation literature overlapping with one another (Figure 4.1B).

Importantly, while our drought search on Web of Science specified that the papers should be

about plant drought stress, the desiccation tolerance dataset includes papers from all kingdoms. We

kept the non-plant papers in the desiccation tolerance literature in our combined dataset because

there is already very limited information on desiccation tolerance, and we did not want to further

restrict our data since we know that many mechanisms are shared across kingdoms. There is,

however, an enormous amount of literature on plant drought tolerance, so we only included plant

science papers in the drought portion of the dataset to keep the combined dataset to a computationally

tractable size.

Defining a quality measure for a plant science knowledge graph

The goal of the present work is to build a knowledge graph of the desiccation and drought

tolerance literature in order to make predictions about genes involved in the regulation of desiccation

17

Figure 4.1 Dataset statistics. (A) Cumulative publications per year for drought, desiccation, and
shared papers. (B) Number of papers in each category of the dataset.

tolerance. In biology, a knowledge graph is a network where the nodes are biological entities, such

as genes, proteins, and organisms, and the edges are relationships between those entities. We aim

to extract biological entities and their relations from scientific abstracts, which are an unstructured

source of data, using named entity recognition (NER) and relation extraction (RE) methods.

Before we begin constructing a knowledge graph, we must define how we will evaluate the

quality of the graph. KG quality evaluation is a non-trivial task, as different aspects of the KG

are important in quality evaluation, depending on which downstream tasks the KG will be used

for. Chen et al. calls this evaluation whether the KG is “fit for purpose” [Chen et al., 2019].

Differing requirements for KG quality in different scenarios means that evaluating a KG is not

as simple as an accuracy or F1 metric like we could use for a classification algorithm. However,

there exist many proposed metrics and frameworks for KG quality evaluation [Chen et al., 2019,

Issa et al., 2021, Seo et al., 2022, Wang et al., 2021b].

In particular, we are concerned with the

18

quality of the KG related to the KG construction approaches that we use. Wang et al. specifically

discusses quality control and evaluation during KG construction steps, breaking it down into

three parts: (1) knowledge source selection, (2) knowledge extraction, and (3) knowledge fusion

[Wang et al., 2021b]. In this work, we have selected scientific abstracts as our knowledge source.

Wang et al. considers knowledge source selection principally from the perspective of credibility and

relevance, including potential sources such as websites, crowdsourced information, and databases.

We used a Web of Science search to choose abstracts for our dataset, which implies both credibility

and relevance. In terms of knowledge extraction, we employ several NER and RE methods to build

our graphs, and Wang et al. emphasizes the importance of limiting errors during the information

extraction process [Wang et al., 2021b]. We can therefore consider the performance of our NER

and RE methods to be a metric of the quality of the constructed KG.

However, traditional evaluations for NER and RE necessitate labor-intensive gold standard

datasets labeled with entities and relations. We do not possess a labeled dataset for the domains of

drought and desiccation tolerance, so we need to leverage the existing plant science dataset created

in [Lotreck et al., 2023] (the PICKLE dataset). While we cannot directly determine if the entities

and relations extracted from the drought + desiccation dataset are correct, we can create a proxy

metric. Anecdotally, we notice that while NER seems to perform as expected across several of the

methods we tried, exhibiting a relatively high estimated recall, barely any relations are extracted

from any abstract with any method. Therefore, we will use the ratio of edges (relations) to nodes

(entities) in the final extracted knowledge graph to determine if the NER and RE quality is in

the general ballpark that we would expect for a dataset in the molecular plant sciences, using the

PICKLE dataset to define our expectation for the ratio in a perfect NER/RE scenario. We will

also make more direct comparisons of NER in the following sections to support the anecdotal

observation that NER performs well, to further support using the relation:entity ratio as a measure

of information extraction quality.

To support the validity of using PICKLE to generate a baseline expectation for an edge to node

ratio for the drought + desiccation dataset, we first compared some basic dataset statistics, such

19

as the distributions of the number of sentences per abstract, the number of words per sentence,

and word length for each dataset (Figure 4.2A). We see that both datasets exhibit nearly identical

distributions, but that the drought + desiccation dataset, which is several orders of magnitude larger

than the PICKLE dataset, has a very small number of outliers with larger values for each statistic.

The statistical similarity of the two datasets, combined with the semantic similarity of the datasets

(both molecular biology datasets with slightly different foci in terms of biological phenomenon),

indicates that we can expect similar quantities of entities and relations to be extracted per document.

In Figure 4.2B, we see the distribution of the edge to node ratio per abstract in the dataset. There

are no abstracts with more relations than entities, and many documents have no relations, resulting

in an overall edge to node ratio of 0.34. We will use both the per-abstract distribution and the

overall ratio to perform a heuristic assessment of the graphs we build in the next section.

An important consideration when evaluating these networks is that scientific abstracts with

sentence-level relation extraction may not be sufficient for constructing high-quality KG, even if

NER and RE are performing perfectly to extract the available information in each abstract. After

evaluating the approximate NER and RE performance of each graph construction algorithm in the

following section, we will examine indicators of knowledge source incompleteness.

Knowledge graph construction methods struggle to identify semantic biological relations in
text

We employed four graph construction approaches on our dataset: (1) DyGIE++, which is a joint

NER/RE model [Wadden et al., 2019], (2) a co-occurrence approach using the entities derived from

DyGIE++, (3) OpenIE, which is a rule-based method that uses syntactics to determine relations

[Angeli et al., 2015], and (4) OntoGPT, which passes a predefined schema to GPT-3.5 for entity and

relation extraction [Caufield et al., 2024]. The basic statistics of each resulting graph can be found

in Table 4.1. Due to the proliferation of meaningless or unusable triples in the OpenIE results (see

Figure S4.1 for examples), we filtered out any triple whose entities did not appear in the DyGIE++

entities. Filtering brought the OpenIE results down from 323,233 entities and 644,175 relations to

the values in Table 4.1.

20

Figure 4.2 Comparative dataset statistics and quality evaluation baseline. (A) Histograms of
basic dataset statistics for the PICKLE and drought/desiccation dataset. X-limits for each row are
determined by the automatic x-limits for the drought/desiccation dataset, as it has larger outliers in
each category. Default number of bins was used for PICKLE, and 5x the number of PICKLE bins
was used for drought/desiccation in each row to allow a similar level of granularity for
comparison. Orange arrows indicate the value of the maximum value in each plot. (B)
Distribution of the edge to node ratio per abstract in the PICKLE dataset. The overall edge to node
ratio is 0.34 for the dataset as a whole.

Method
DyGIE++
Co-occurrence
OpenIE
OntoGPT

# Nodes
336,120
334,327
6,195
12,488

# Edges
124,408
1,288,387
8,156
3,023

# Isolates Median degree
268,851
35,055
0
9,981

0
3
1
0

Table 4.1 Basic graph statistics. Figures reported are after any cleaning performed on the raw
constructed graphs.

21

Overall ratio: 0.61abFigure 4.3A shows the overall edge to node ratios for each of the graph construction methods.

We see that DyGIE++ on its own, which attempts to extract semantic relationships, has a lower

edge to node ratio than PICKLE. Given the statistical similarity between the two datasets, and

the fact that PICKLE was designed specifically for use with DyGIE++, the lower edge to node

ratio indicates that the DyGIE++ model is likely performing poorly on semantic relation extraction

on the drought + desiccation dataset.

In contrast, using sentence-level co-occurrence with the

DyGIE++ derived entities yields a much higher ratio. This is expected, as co-occurrence cannot

identify semantic relationships, and instead relies on the assumption that two entities that appear

together in a sentence are related to one another. Without a gold standard, there is no way for us to

quantify what proportion of co-occurrence relationships represent actual biological relationships.

We can hypothesize, however, that since the ratio is substantially higher than that of PICKLE, there

are likely many false positive relationships in the co-occurrence dataset. OpenIE is the only other

construction method with an edge to node ratio greater than 1. However, OpenIE only extracts

triples, and is not capable of directly extracting entities, which means that there are no isolate

nodes, and a ratio of higher than 1 is guaranteed. OntoGPT displays an edge to node ratio relatively

similar to that of the DyGIE++ method; however, it extracted an order of magnitude fewer entities

than DyGIE++, which indicates that it is likely not suitable as a construction method. Figure 4.3B

shows the distribution of edge to node ratios when calculated on a per-document basis, and there

is a dramatic difference between the distribution of the per-document ratio of PICKLE and of all

other methods, with all methods having a substantial right skew in their ratio distributions. The

heavy skew of all methods indicates that relation extraction performance is particularly poor, as we

would expect a more even distribution of ratios with more documents having non-zero ratios (i.e.,

having relations).

OntoGPT was a particularly promising method, as it grounds entities to databases in addition

to using GPT-3.5. Schema grounding is intended to limit model hallucinations, and performance

is drastically improved when using grounding [Caufield et al., 2024]. However, grounding is

extremely slow, and even after optimizing grounding speed by using slimmed versions of databases,

22

Figure 4.3 Edge to node ratios for KG construction methods. (A) Overall edge to node ratios
for each construction method. (B) Distribution of edge to node ratios calculated on a
per-document level for each construction approach. Note that there are only 5,237 documents in
the OntoGPT graph, because of issues with computational complexity.

it would have taken 55 days to run OntoGPT on our whole dataset. We therefore only ran OntoGPT

on the 5,237 document desiccation tolerance subset, using the slim NCBI Taxonomy to optimize

computational performance. We found that schema grounding did not completely limit model

hallucinations, especially when it came to relation extraction. The model hallucinated relations

between non-existent entities like “NaN” and “Not provided”. While hallucinated entities only

made up 0.17% of the total extracted entities, relations that included one or more hallucinated

non-entity comprised 48.34% of all extracted relations. An additional 5.34% of extracted relations

were trivial relations between an entity and itself, and were also dropped. While entity extraction

in general did not contain hallucinated entities, only 20.99% of entities were grounded back to

one of the requested databases, with the remaining 79.01% simply receiving auto-generated unique

identifiers that do not pertain to any database. None of our other methods included a grounding

component, so even ~21% grounding is an advantage. That being said, by using TaxoNERD to

ground just the Multicellular_organism DyGIE++ entities (see the next section for further details),

we achieved 15.36% overall grounding, which indicates that OntoGPT does not achieve especially

good performance over other methods for grounding entities externally to the KG construction

23

PICKLEDyGIE++Co-occurrenceOpenIEOntoGPT0.00.51.01.52.02.53.03.54.00.610.373.851.320.240.00.20.40.60.802040PICKLE0.00.51.01.52.00200004000060000DyGIE++0102030020000400006000080000Co-occurrence1234560100020003000OpenIE02468020004000OntoGPTPer-document edge to node ratioCountEdge to node ratioabmethod.

Given the order of magnitude discrepancy in entities extracted by DyGIE++ and OntoGPT, we

characterized the differences in NER between the two methods to get a sense of which is performing

better. One important difference between the two methods is that while DyGIE++ extracts entities

on a per-sentence basis, OntoGPT extracts them on a per-document basis, meaning that DyGIE++

can extract the same entity multiple times. To account for this, we resolved all entities with identical

lowercase strings from each DyGIE++ document in this analysis, to avoid over-crediting extracted

entity counts to DyGIE++. For each document in the desiccation tolerance subset, we quantified

the proportion of the DyGIE++ and OntoGPT entities that were also identified by the other method

(“shared”, Figure 4.4A). We see that the distribution of the proportion of DyGIE++ entities is

right-skewed, indicating that most documents have entities that were not identified by OntoGPT,

while the OntoGPT distribution is left-skewed, indicating that almost all entities identified by

OntoGPT were also identified by DyGIE++. When we look at abstracts randomly selected from

the dataset (Figure 4.4B), we see that DyGIE++ identified many more entities than OntoGPT.

One consideration to keep in mind is that the OntoGPT model was only tasked with extracting

gene, protein, molecule, and organism entities, while DyGIE++ is capable of extracting some other

types, like the Biochemical_process, Biochemical_pathway, or Plant_region. However, the extra

types only account for a small portion of the DyGIE++-identified entities, and OntoGPT did not

identify almost any entities of the types shared by both models. In the first abstract in Figure 4.4B,

OntoGPT didn’t identify any entities, and in the second, while it successfully identified two of the

mosquito species, it hallucinated a third. Aedes flavopictus is a real species of mosquito, but it is

not the same as Aedes albopictus, which is the species actually mentioned in the text.

After performing the above analysis, we decided that our best graph from these construction

options is the DyGIE++-based co-occurrence graph. We immediately eliminated OpenIE on the

basis of its proliferation of nonsensical/unusable triples, because when we filtered based on the

relatively reliable entity set from DyGIE++, there were two orders of magnitude fewer entities and

relations when compared to the DyGIE++ graph. The choice to eliminate OntoGPT as an option

24

Figure 4.4 Comparison of NER between DyGIE++ and OntoGPT. (A) For each method,
distribution of the fraction of entities in each document that are shared by the other method. (B)
Example abstracts from the dataset with DyGIE++ entities annotated in colored boxes, where each
color correpsonds to the entity type. OntoGPT-identified entities are outlined in black boxes.

was more complex, and involved both semantic and computational performance considerations.

Firstly, the computational complexity of the grounding component of OntoGPT was prohibitive

to running the method on our entire dataset. To confirm that grounding was necessary, we ran

OntoGPT with a schema that contained no databases for grounding. While it did run extremely fast,

most abstracts had no entities, and those that were extracted were nonsensical, which demonstrated

that grounding is necessary. To make running OntoGPT on just the desiccation subset feasible,

we had to substitute a slimmed version of the NCBI Taxonomy, which also anecdotally affected

performance when we manually observed the output, both in terms of entities identified as well as

their groundings. OntoGPT schemas can have prompts for each entity and relation type that are

passed to GPT-3.5, and we provided prompts for all relation types, and for the Organism and Gene

entity types. While we potentially could have further refined the prompts for entities and relations

to tune performance, we chose not to move ahead with OntoGPT as our construction method. Even

if good performance could be obtained, which seemed unlikely given our initial results, it would

25

WOS:000528876400009WOS:A1992HW29400006No entities foundby OntoGPTHallucinated entity:"Aedes ﬂavopictus"Is a real species, but notmentioned in this textab0.00.20.40.60.81.0Fraction of entities shared02004006008001000120014001600Document countDyGIE++OntoGPThave been prohibitively costly to run on the entire dataset.

The elimination of OpenIE and OntoGPT as graph construction methods meant our choice

was between the DyGIE++ and co-occurrence construction methods. We chose the co-occurrence

method for two reasons. First, the DyGIE++ method on its own struggled to extract semantic

relations from text, producing an edge to node ratio that was only slightly more than half of the

ratio produced by the PICKLE gold standard. Secondly, co-occurrence has been used to excellent

effect in many previous works, most famously in identifying a causal link between fish oil and

the treatment of Raynaud’s disease by Don Swanson in 1986 [Bekhuis, 2006, Swanson, 1986].

Additionally, while co-occurrence networks have the tendency to overestimate the presence of

meaningful semantic relationships between entities, lowering a measure of specificity, it has been

demonstrated that they have higher sensitivity in a biomedical use-case [Wang et al., 2021a]. High

sensitivity indicates that a co-occurrence network likely contains a greater quantity of correct

semantic relationships, even while it contains a larger volume of noisy links that don’t reflect true

semantics. Therefore, in the following sections, we will use the co-occurrence network in our

analyses.

Scientific abstracts may not be a sufficient data source for a well-connected plant science
knowledge graph

The difficulty of semantic relation extraction is clearly a limitation to using literature as a

knowledge source in KG construction. However, it is important to consider the possibility that

literature alone makes an insufficiently information-rich starting source for KG construction. To

examine this possibility, we used two database-derived graphs, KnetMiner and GenoPhenoEnvo,

and one literature-derived graph, PlantConnectome, to compute the edge to node ratio that we’ve

been using as a proxy for information-richness thus far.

Figure 4.5A shows that both KnetMiner and GenoPhenoEnvo have higher edge to node ratios

than either PICKLE or PlantConnectome does, which indicates that the database-derived graphs

are more information-dense than a literature-derived graph. However, it is important to note that the

schema for PICKLE, KnetMiner, and geonphenoenvo are not equivalent, meaning that they contain

26

different entity and relation types. It is possible that the difference in schema is responsible for the

difference in connectivity density between the literature-derived graphs and the database-derived

graphs, as the database-derived graphs have more entity and relation types. If this were the case,

we would expect the actual edge to node ratio (the “data-derived” ratio) to scale with the ratio

of relation types to entity types in the schema (the “schema-derived” ratio); however, this is not

what we observe. While KnetMiner has the most relation types (Figure 4.5B) and as a result,

the highest schema-derived ratio, GenoPhenoEnvo far outstrips KnetMiner in the data-derived

ratio. Additionally, just because a schema has more type does not mean it can represent more

information, as both entity and relation types can have varying levels of semantic granularity. For

example, the term “regulates” can encompass both “upregulates” and “downregulates”. Figure

4.5C demonstrates this concept for the KnetMiner and PICKLE relation schema. The five PICKLE

relations map loosely to about 15 of the KnetMiner relation types, meaning that the PICKLE relation

schema semantically covers almost half of the KnetMiner schema, despite only having a seventh

of the relation types by number. Therefore, the drastically lower edge to node ratio of PICKLE is

likely more related to the data source as opposed to the schema. In contrast to PICKLE, KnetMiner,

and GenoPhenoEnvo, PlantConnectome uses GPT in a schema-free extraction approach, and the

types in the resulting network are freehand phrases chosen arbitrarily by GPT. This results in

a proliferation of unique “types”, and relation types in particular are subject to rambling type

descriptions, such as “had greater levels of resistance than” or “indicated variation in”, which are

reminiscent of the predicates extracted by the rule-based method OpenIE (Figure S4.1). Because

there is no schema, we could not calculate a schema-derived edge to node ratio for PlantConnectome.

However, the data-derived ratio for PlantConnectome is higher than that of PICKLE (Figure 4.5A).

The prompts used in PlantConnectome’s GPT-based approach allow the extraction of document-

level relationships, which could potentially be responsible for the increased edge to node ratio.

These data support the hypothesis that sentence-level relation extraction from the literature is an

information-limiting condition, and that document-level extraction could potentially aid in better

literature-derived graphs. However, even in a case where relations were extracted freehand from

27

Figure 4.5 Literature-derived versus database-derived graphs. (A) Edge to node ratios for
each graph. Solid bars are calculated from all nodes and entities in the graph, while hatched bars
are the ratio calculated from the number of relation and entity types in the schema. (B) The
number of entity and relation types in each graph schema. (C) A loose mapping of the KnetMiner
relation types (grey) to the PICKLE relation types (green). While an exact mapping is impossible
to establish, we can loosely map one schema to the other, which shows that the PICKLE relation
types have a wider semantic range than the KnetMiner types.

the entire abstract, unlimited by a schema, PlantConnectome still has a much lower edge to node

ratio than the database-derived graphs. Without a performance metric for the extraction that

created PlantConnectome, it’s not possible to untangle whether this is due to data source or method

performance. However, it seems likely that both method performance and data source are at play

in the resulting lower ratio.

Without being able to compare a literature and database graph that were built on the same

schema, we cannot decisively conclude whether literature is capable of building a sufficiently

information-rich biological KG. However, given the indications that literature may not be sufficient

for a high-quality biological KG, it is worth reflecting on why. In principle, literature contains all

of the necessary information to build a densely-connected, information-rich KG, as the database

sources used by graphs like KnetMiner or GenoPhenoEnvo are manually curated from the literature.

28

ac_byassociated_withca_bycat_ccooc_wics_byencenriched_forequgenetich_s_shas_domainhas_functionhas_mutanthas_phenotypehas_variationhomoeologin_byis_ais_part_ofleads_tolocated_inneg_regnot_functioninhibitsproducesinteractsis-inactivatesnot_located_inorthoparapart_ofparticipates_inparticipates_notpd_byphysicalpos_regregulatesSchema# entity types# relation typesPICKLE215KnetMiner1934Genophenoenvo2630abcPICKLEKnetMinergenophenoenvoPlantConnectomeKG024681012Edge to node ratio0.613.6612.771.520.241.791.15Data-derived ratioSchema-derived ratioHowever, even in an ideal scenario where our information extraction methods were perfect, manual

curators differ from most automated methods in two important ways: (1) manual curators have

access to full text, while our methods above rely solely on abstracts, and (2) manual curators

are naturally performing document-level information extraction, as opposed to the sentence-level

relation extraction to which most current methods are limited.

Intuitively, abstracts contain a

summary of the most important points of a given paper, and should in theory be sufficiently

information-rich. However, it is unlikely that a manual curator would only use abstracts to find

information, as there is much more detailed information available in the full text of an article.

Additionally, not all biological relationships are stated in single sentences, and it may take a

relatively high level of reasoning over a whole paragraph or set of paragraphs in the full text of

a paper to identify the relevant relationships. Manual curation is undeniably superior in these

regards, but it cannot keep up with the flood of new publications. Therefore, research to more

comprehensively identify the weaknesses of different literature data sources, as well as research

on the best ways to balance the up-to-date nature of the literature with the more robust nature of

databases for KG construction, is necessary.

Crop species dominate the drought tolerance research landscape

In an ideal world, we could analyze the properties of the constructed KG to gain insight on

research trends over time. As we have outlined above some weaknesses inherent in our graph

construction leading to a lower-quality graph, we must be cautious in the interpretation of graph

properties; however, we can still gain valuable insights into research trends. In Figure 4.6A, the

visualization of the entire graph shows that there is no sub-structure or neighborhoods in the graph,

just one large grouping of nodes. This is consistent with the construction method of co-occurrence,

as using sentence-level co-occurrence makes any entity nearly as likely to be connected to any

other.

One of the weaknesses of the DyGIE++ method as trained on PICKLE is that there are no

coreference capabilities, and we are unable to ground entity mentions back to database entries.

However, we can partially address this by using external grounding methods, such as TaxoNERD

29

Figure 4.6 Characterization of the drought-desiccation tolerance co-occurrence network. (A)
Overview of the entire network, with a zoomed-in detail. Nodes are colored by entity type, and
edges are colored by their source node. (B) Grounded species prevalence in the graph over time;
see Methods for details on the data pre-processing considerations for this analysis. (C) Prevalence
of entity types over time. There are three groups in terms of growth trajectories, which are
outlined in navy blue boxes. (D) Mean degree over time for all entity types.

[Le Guillarme and Thuiller, 2022]. To determine the extent by which we could ameliorate the lack

of grounding with a single tool, we used TaxoNERD to ground all Multicellular_organism entities

(see Methods for details). We found that 32.93% of Multicellular_organism nodes received a

grounding; as nearly half of the nodes in the graph are made up of Multicellular_organism nodes,

this means that 15.36% of the entire graph could be grounded with a single tool. We visualized the

top 16 species and genus mentions in Figure 4.6B, and found that the most frequently mentioned

species are crop species like wheat, maize, and rice. Combined, Arabidopsis and Arabidopsis

thaliana are only mentioned about 6500 times, which, assuming (as we likely can do safely) that

most Arabidopsis genus mentions refer to A. thaliana, means that research on drought tolerance of

30

abcd1990199520002005201020152020Year02505007501000125015001750MentionsTriticum aestivum (14914 total mentions)Zea mays (10059 total mentions)Oryza sativa (8132 total mentions)Glycine max (5326 total mentions)Arabidopsis (4384 total mentions)Gossypium hirsutum (3563 total mentions)Solanum lycopersicum (3165 total mentions)Sorghum (3121 total mentions)Hordeum vulgare (3015 total mentions)Medicago sativa (2301 total mentions)Arabidopsis thaliana (2296 total mentions)Solanum tuberosum (2068 total mentions)Cicer arietinum (1769 total mentions)Saccharum hybrid cultivar (1581total mentions)Nicotiana tabacum (1521 total mentions)Agrostis (1384 total mentions)19851990199520002005201020152020Year020004000600080001000012000Number of entitiesBiochemical_processMulticellular_organismInorganic_compound_otherDNAElementPlant_regionOrganic_compound_otherBiochemical_pathwayProteinUnicellular_organismPlant_hormonePeptideRNACellVirus19851990199520002005201020152020Year020406080100Mean degreeBiochemical_processMulticellular_organismInorganic_compound_otherDNAElementPlant_regionOrganic_compound_otherBiochemical_pathwayProteinUnicellular_organismPlant_hormonePeptideRNACellViruswheat, maize and rice is more common than on A. thaliana as a model. The prevalence of crop

species in this dataset is likely related to the agronomic importance of drought tolerance as a trait.

In particular, mentions of the four major crops in the dataset are climbing at a faster rate than those

of other species, indicating that research on drought tolerance in wheat, maize, rice and soy is

increasing.

Another way to characterize the graph is to look at the growth of each type of entity over time.

Figure 4.6C shows the growth of each entity type over time in the graph. We find three groupings

of entity types by prevalence in the graph; Multicellular_organism entities are far and away the most

prevalent, followed by a group that includes Biochemical_process, DNA, Protein, and Organic_-

compound_other, and a third group containing the rest of the entity types. Interestingly, there was

a leap in the number of Multicellular_organism entities in the early 1990’s, followed by a period

of relative stasis, followed by a second period of growth beginning around 2005. In contrast, other

entity types have grown at a relatively constant rate. From 2010 onwards, as the middle group of

entity types begins to take off, the growth of Multicellular_organism entities remains at a similar

rate, indicating that while new organisms are still being added, more detailed investigation into the

mechanisms of drought and desiccation tolerance in the organisms already in the graph is being

undertaken. In contrast to which entities have the most nodes in the network, an entirely different

set of entities are the most highly connected (Figure 4.6D), Element, Inorganic_compound_other,

and Plant_hormone are the most highly connected node types in the network. Chemicals being

the most well-connected makes sense given that many organisms and processes share connections

to the same types of compounds. However, we also see that the mean degree of all node types

decreases over time, indicating that there are more nodes in all types that have a very low degree.

Because we don’t have access to coreference resolution, this is likely due to a proliferation of unique

string representations of semantically similar or identical entities as more and more new nodes are

added to the graph, and not a reflection of any particular research trend.

31

Link prediction models are unable to generate biologically relevant hypotheses

To assess the possibilities of using link prediction (LP) methods on a literature-derived plant sci-

ence KG, we applied the KG embedding model RESCAL on our co-occurrence network. RESCAL

is a tensor factorization-based KG embedding method that yields embeddings for all nodes and

edges in a given network [Nickel et al., 2011]. To provide a simple baseline forlink prediction

performance, we trained a multi-class Random Forest (RF) model that took the embeddings of

both nodes in a pair concatenated together as the feature vector, and predicted whether a given

node pair should have a desiccation edge, drought edge, both desiccation and drought edge, or no

edge. We tested two RESCAL loss functions and three negative sampling methods to optimize

model performance (Figure S4.2, S4.3 see Methods for details), but found that a random negative

sampling strategy yielded the best performance (F1 = 0.30, AUROC = 0.64, Figure 4.7A). There

was no meaningful difference between the RESCAL loss functions in the performance of the RF

models, so we selected the RESCAL model trained with BCEWithLogitsLoss to be compatible

with the RESCAL models used directly for prediction (Figure 4.8). There are several interesting

aspects to the RF model’s prediction capabilities. Notably, the RF model is unable to predict edges

that only appear in the drought dataset, in fact achieving an AUROC score that is worse than random

guessing, but is much better at predicting negative triples or triples that appear in both the drought

and desiccation dataset (Figure 4.7B). The ability to predict negative samples is expected, as our

training set contained as many negative instances as the total sum of positive instances (see Methods

for justification); however, the model has clearly overcompensated and assigns the negative label

to many instances that should be positive labels. This is an acceptable starting point for a model

that is designed to generate testable hypotheses, as false positives are harmful in a scenario where

a false positive means wasting resources and years of effort on predicting something that has no

basis in reality. However, the model also similarly assigns instances to the both class with relative

frequency, which is not explained by the presence of both in the training set, as there are an identical

number of both, drought, and desiccation instances. Further investigation is needed to provide more

comprehensive explanations of model behavior; however, the method as it stands is unsuitable for

32

Figure 4.7 Performance of Random Forest baseline models. ROC curves for (A) the drought +
desiccation model and (B) the GenoPhenoEnvo model. Note that in (B), classes 5 and 12 have
true perfect performance, while classes 1 and 9 are just so high that they are rounded to 1.0.
Confusion matrices for (C) the drought + desiccation model and (D) the GenoPhenoEnvo model.
Note that for (D), the test set is imbalanced, so the color map doesn’t visibly reflect the perfect
performance of class 5, as it is very small.

hypothesis generation due to low performance.

Most KG embedding models are trained with a loss function that evaluates the model’s capabil-

ities for link prediction. Therefore, we wanted to see what the native prediction capabilities were

for the RESCAL model we had trained. We employed two approaches to evaluating RESCAL’s

performance in link prediction. The first, to provide partial comparability with the RF model,

was to ask the RESCAL model to generate a plausibility score for each of the 4,000 triples in the

test set that we used for the RF model. We’ll refer to this as the “predict triples” approach, after

the function in PyKEEN used to perform this kind of prediction. KG embedding link prediction

functions somewhat differently from the RF model we designed – rather than acting as a multiclass

model that predicts both the presence/absence of an edge as well as its label, a KG embedding

model provides a plausibility score that represents the model’s confidence that the triple is true.

The plausibility score can be leveraged with a threshold to generate a binary classifier by choosing a

threshold plausibility score to make the cutoff between true and false for a given triple. When using

33

the BCEWithLogitsLoss, the model is optimized to score triples around a threshold of 0, so triples

with a positive score are considered to be true, while triples with a negative score are considered to

be false. Using 0 as a classifier threshold, the RESCAL model achieved an F1 score of 0.60, which

is substantially better than the RF triple classification model scored. However, when we look at

where positive an negative triples appear in the ranking, we see that negative triples appear at the

top and bottom of the ranking, while positive triples tend to receive middling ranks (Figure 4.8A).

Ideally, we would see that the bottom half of the ranking is predominantly negative triples, while the

top is predominantly positive triples. To contextualize this finding, we can look at the distribution

of triple scores for positive and negative triples (Figure 4.8B). Both the means and distributions

of positive and negative triple scores are significantly different from one another (t-test, p-value =

0.004; KS test, p-value = 1.16e-43). In particular, the negative triples have a wider distribution

of scores, with more triples at the high and low ends of the scoring range when compared to the

positive triples. A wider distribution explains why negative triples appear both highly and lowly

ranked, while positive triples tend to rank towards the middle. Because the BCEWithLogitsLoss is

optimized specifically around a threshold of 0, we did not generate an ROC curve for this model;

however, while the F1 score is substantially better than the F1 score of the best RF model (F1 =

0.30), looking at the rankings and distributions makes it clear that the predict triples approach is

also insufficient to generate high quality assessments of link plausibility.

The second approach we took to using RESCAL for predictions is in line with what we would

do to perform hypothesis generation for a new graph. We used the model to calculate plausibility

scores for all possible triples in the drought + DT dataset, saving the top 100 scores. We’ll refer

to this as the “predict all” approach, after the function in PyKEEN used to perform this prediction.

We manually investigated the links of the top 10 most plausible triples with a Web of Science

search (Table 4.3). For each pair of terms, we performed an AND search to find papers where

both terms co-occur. If no papers were returned and either of the two entities contained potentially

superfluous terms that might confound the search, we simplified the search to increase the likelihood

of obtaining results; for example, "wheat germ systems" was changed to "wheat germ". Final search

34

Figure 4.8 RESCAL link prediction results. (A) Rank distribution and (B) score distributions
for the RESCAL model trained on the drought + DT dataset.

queries are detailed in Table S4.1. Generally, there are three categories of entity pairs. The first

category are those relationships that hint at the presence of interesting biological relationships, but

that lack the specificity of a good hypothesis. Triples in this category are ("Crocus sativus L.",

"both", "tomatoes") and ("bZIP", "both", "deciduous forests"). The Web of Science results present

a small subset of papers that hint at a mechanistic relationship between the head entities (Crocus

sativus L. [saffron], bZIP) and some aspect of the tail entities. However, they lack the specificity

to provide testable hypotheses; for example, how does saffron extract improve tomato resilience to

stress on a mechanistic level? The second category of triples in the top 10 predictions are those

that are trivial but true. For example, the triple ("beech-fir stand", "both", "deciduous forests")

returns papers studying the ecology of forests; we know that tree stands occur in forests, and there

is no implication of a further mechanistic relationship. Finally, there are those triples that are either

irrelevant or incorrect, which return no papers when searched together, such as ("jasomic acid",

"both", "Amphibalanus amphitrite"). Taken together, these results indicate that link prediction on

KG as executed through an algorithm like RESCAL on a co-occurrence KG is currently insufficient

to provide testable hypotheses at scale.

Importantly, our results demonstrate that blindly trusting performance metrics such as F1

35

05001000150020002500300035004000Plausibility Rank050100150200250300350400True positive triplesTrue negative triplesNumber of triplesab160018002000True positive triplesTrue negative triples−3−2−1012341e−6050100150200250300Number of triplesScoreTail entity

# papers

Head entity

Edge
type
wheat germ system both

opuntia fragilis

both

pshsfa7a1_2595
salt stress-induced
calcium signal
bzip23
transcription factor
activity

both
both

both

deciduous seasonal
forests

deciduous seasonal
forests
mwsp
c. stelligera

deciduous seasonal
forests

beech-fir stand

both

t. fluminensis

both

deciduous seasonal
forests
deciduous seasonal
forests

crocus sativus l

both

reds/breaker
tomatoes

drought-responsive
and jasmonic acid
biosynthesis genes
mandarin water

both

b. amphitrite

both

deciduous seasonal
forests

1

0

0
3

4

23

2

22

0

5

Preliminary literature
search reveals
Entomology report, "germ"
in abstract is not wheat germ,
appears separately from the
term "wheat"

None of the results mention
calcium in the abstract
2 of the 4 papers identify
bZIP transcription factors in
trees (one deciduous species,
one confier species), 2 papers
are about transcriptional
studies in trees but don’t
specifically mention bZIP
Ecology studies in forests

One paper mentioning
ecological impact of T.
fluminensis on forests, one
that does not mention T.
fluminensis
Search results are
predominantly studies on the
impacts of Crocus sativus L.
(saffron) extracts on tomato
growth)

Two studies on mandarin
ducks, three that don’t
mention the term "Mandarin"

Table 4.2 PyKEEN top 10 prediction results.

36

score or AUROC does not guarantee a model that performs well in context. Specifically, the

RESCAL model achieved an F1 score of 0.60, which, when compared to the RF model’s F1

score of 0.30, seems like a large improvement. However, we demonstrated that many negative

triples are incorrectly classified as positive, indicating that the model is not useful in a practical

context. Additionally, the AUROC scores of the RF models are misleadingly much higher than the

corresponding F1 scores. Our findings highlight the importance of validating performance metrics

with common-sense checks such as examining the most probable predictions to ensure that methods

are providing practically valuable results.

As the performance of static link prediction on our dataset is exceedingly poor, we wanted to

see what kind of prediction capabilities we can achieve using temporal link prediction models. We

performed a brief literature review for heterogeneous TLP models, and identified 11 methods (Table

S4.2). As a result of examining the available code bases for the identified methods and testing for

functioning implementations, we selected STHN [Li et al., 2023], which embeds various aspects

of the graph and summarizes information from a temporal interaction sequence to predict links.

We ran STHN on the co-occurrence graph, and achieved an AUROC of 0.7685 after 74 epochs of

training, but did not display a substantial increase in performance across training epochs (Figure

??). While the AUROC of STHN and our previous models are not directly comparable across model

architectures, the higher AUROC of STHN indicates that incorporating temporal information into

the link prediction task adds valuable information that can help improve predictions. Unfortunately,

as the TLP field is relatively new and understudied, there are no robust codebases for any TLP

method that we could identify. STHN was the most friendly, but while it was easy to run, it only

outputs the AUROC scores, and does not save any predictions or allow re-use of the pre-trained

model. Future work is required to delve deeper into the prediction capabilities of TLP for hypothesis

generation on a plant science KG.

Conclusion

In this work, we have examined several KG construction methods and found that a combination

of poor information extraction with low information richness in scientific abstracts results in

37

poor-quality KG. We provide preliminary evidence that literature abstracts may not be sufficient

for high-quality KG construction when compared to database-derived sources. Using the best-

constructed graph for the drought + DT dataset, we showed that crop species dominate the literature

on drought and desiccation tolerance, and that while species names are the most common entities

in the dataset, chemical compounds are the most well-connected entities. Finally, we explored the

capabilities for hypothesis generation on a co-occurrence network derived from literature. We found

that prediction capabilities for static link prediction based on a RESCAL KG embedding model

are exceedingly poor, regardless of whether the built-in prediction capabilities or a downstream

external method (Random Forest) were used. We additionally performed temporal link prediction

on the co-occurrence KG, but due to limitations of the implementation are unable to further explore

the results.

The largest limitation to high-quality KG construction in this study was relation extraction.

Here, we showed that semantic relation extraction methods such as DyGIE++ and OntoGPT were

insufficient to recover relations from natural language text.

In this work, we chose to use a

co-occurrence method to improve relation recall, as all other methods resulted in such sparsely

connected KG that we were unable to implement downstream link prediction methods. However,

there are clear limitations for co-occurrence as a method for constructing KG; principally among

them is the over-representation of false positive triples in the resulting KG, which could be partially

responsible for the poor performance of link prediction models. Further work on improving the

quality of semantic relation extraction, either through the creation of more plant science-specific

training datasets or through improved prompt engineering of large language models will likely

result in high-quality KG.

Another important future direction of this work is to explore the potential benefits of using

full text documents rather than abstracts.

In particular, quantifying the richness of biological

relationships in full text versus abstracts will be important to determining the optimal data sources

for KG construction. Potentially, document-level relation extraction, made possible by tools like

GPT, will also benefit information retrieval from literature, as relations may not be stated in single

38

sentences. While the literature is the most current and complete source of information, leveraging

the data quality of manually-curated databases by integrating literature-derived information with

database information through entity and relation grounding is also an important area for future

work.

Finally, a deeper investigation into the performance of KG embedding models and link prediction

on biological datasets is called for, as their performance here was exceedingly poor. Biological

KG often have different underlying properties than the ideally-distributed graphs on which KG

embedding models are often evaluated, and this may impact the ability of KG embedding models

to effectively embed biological networks. The performance of KG embedding models is also likely

tied to the performance of upstream relation extraction methods used to construct the KG on which

prediction is performed. Specific to the analyses presented in this work, the use of co-occurrence,

which likely result in a large number of false positive triples, could exert a confusing effect on the

KG embedding model, as the model will generate embeddings that conflate the characteristics of

true negative triples for those of positive ones. Additionally, biological KG exhibit a large skew in

the distribution of degree, often approximating scale-free behavior, which can differ greatly from

the datasets on which embedding models are evaluated in their original publications. Quantification

of the impact of network topologies and of upstream relation extraction methods on KG embedding

algorithms will be a valuable next step in determining whether, and how well, KG embedding

models can perform in the link prediction task. In this work, adding a temporal element to link

prediction seems to improve prediction capabilities on a preliminary basis; however, the same

pitfalls that static link prediction methods experience on biological datasets could also have an

impact on the ability of temporal models to produce high quality predictions.

Acknowledgements

We thank Max Berrendorf from the PyKEEN development team for his invaluable help in

understanding the implementation of RESCAL models, particularly with hyperparameter selection.

We also thank Harry Caufield of OntoGPT for his help in implementing and optimizing the schema

for OntoGPT.

39

Methods

Dataset construction

We used the methods from the previous chapter to obtain and pre-process a dataset on drought,

and combined it with the desiccation tolerance dataset from the previous chapter to create the dataset

used here. We used the query “(TS=(water deficit) AND TS=(plants)) OR (TS=(drought) AND

TS=(plants))” on Web of Science to obtain the drought dataset. We downloaded the first 100,000

results (of a total of 134,510 query results).

In the end, only 99,598 entries were downloaded

from Web of Science, as some of the Fast 5000 results were incomplete. We don’t know why this

is; however, since the dataset is still substantial even with ˜400 missing results, we chose to move

ahead. 9,024 were dropped because they were outside our version of the XML dataset, and a total of

88,433 were recovered. When combined with our previously constructed desiccation dataset (5,963

documents), we obtained a total of 93,348 documents (Note that the discrepancy in addition is due

to there being some documents in common between the two datasets). We then extracted abstracts

to text documents, which resulted in the loss of an additional 11,169 documents because their XML

entries did not contain abstracts. This gave us a final dataset of 81,886 documents; 76,260 drought

abstracts, 4,622 desiccation abstracts, and 1,004 abstracts that appeared in the searches for both

drought and desiccation.

Knowledge graph quality measure determination

To compare the two datasets for statistical similarity, we calculated the number of sentences

per abstract, the number of words per sentence, and word length for each dataset, and plotted their

distributions. We built a networkx graph for the PICKLE dataset, which removes duplicate nodes

and edges. We calculated the ratio of edges to nodes for the overall PICKLE dataset based on

the networkx graph, and then calculated the edge to node ratio on a per-document basis for each

method. The per-document calculation allows repeated entities and nodes across the whole dataset,

as the same entities can be extracted across multiple documents. We also allowed repeated entities

within the same document, as sentence-level relation extraction methods like DyGIE++ rely on

every instance of each entity being present.

40

Knowledge graph construction

We tested four avenues for KG construction from the combined drought and desiccation dataset

(Table 4.1). The first two approaches rely on the DyGIE++ architecture, which is a joint entity and

relation extraction model based on the idea that the properties of entities contribute information to

the process of relation extraction and vice versa (see [Wadden et al., 2019] for architecture details).

We used a DyGIE++ model that we had previously trained on the PICKLE dataset and applied it to

the entire drought + desiccation dataset [Lotreck et al., 2023]. In the first construction approach,

we used the output of DyGIE++ as-is, with no modifications. In the second approach, we kept

the entities extracted from DyGIE++, but derived relations from sentence-level co-occurrence;

if two entities appeared together in a sentence, we put an undirected relation between them in

the resulting graph. We kept track of how many times each entity and relation appeared in the

dataset, recording the total number of times each appeared, as well as their first date of appearance,

and whether or not the relations were derived from a drought article, a desiccation article, or

both. Our third construction approach was OntoGPT, a GPT-3.5-based approach to extract entities

and relations to a predefined schema of entity and relation types (see [Caufield et al., 2024] for

implementation details). OntoGPT uses entity grounding to databases specified in each schema to

prevent GPT-derived hallucinations and to improve the recall of extraction. We built a schema to

extract genes, proteins, molecules, and organisms, as well as the relationships between each of those

types. To improve the likelihood of good extractions, we provided prompts for each relation type.

Unfortunately, in the current OntoGPT implementation, grounding is extremely computationally

complex and does not scale to larger datasets or larger schema databases. While it is possible to

substitute slim databases or use no databases whatsoever, this severely impacts the quality of the

extraction (see [Appendix] for a characterization of the information extraction and computational

performance impacts of the various options), and still does not result in enough of a speedup to

make implementation on a dataset larger than a few thousand documents practical. As a result

of our analysis, we applied OntoGPT with a schema using the slim version of NCBI Taxonomy

on our desiccation dataset only, as applying even the slimmed schema to the whole dataset was

41

prohibitively costly, and our initial performance evaluations did not provide sufficient justification

for investing resources into further application. Upon manual inspection of the subset results, the

OntoGPT graph contained a large number of hallucinated relations based on hallucinated entities

such as “NaN” and “Not provided”. We trawled the dataset for such entities and removed them,

as well as any relations that depended on them. Finally, as a common-sense baseline, we applied

the rule-based approach OpenIE, which extracts triples from text in a domain-agnostic manner

using syntactic information (see [Angeli et al., 2015] for implementation details). As OpenIE

has no knowledge of domain-specific considerations for writing style, it opts for a high-recall

approach by extracting every possible triple, resulting in a large proportion of extracted triples

being nonsensical or unusable (see Figure S4.1for examples). To combat this issue, we decided

to keep only triples whose entities matched DyGIE++-extracted entities, as we had a high degree

of confidence in our DyGIE++ results based on the performance evaluations in plant science

presented in [Lotreck et al., 2023], as well as manual examination of a small subset of our results

on the drought + desiccation dataset.

We calculated the whole-dataset and per-document edge to node ratios in the same way that

we did for PICKLE; whole-dataset ratio was calculated using the networkx graph where duplicate

entities and relations are resolved, and calculated the per-document ratios allowing duplicate entities

and relations. For documents that have 0 nodes (which does not occur in PICKLE but can occur

when an automated method is applied), which would result in a ZeroDivisionError on computation

of the ratio, we substituted a 0 for the ratio to represent that no information was extracted.

To compare the NER capabilities of DyGIE++ and OntoGPT, we mapped the document ID’s

to the randomly-generated OntoGPT document ID’s to pair up the entities extracted from each

document in the desiccation tolerance subset. For each document of DyGIE++ entities, any

lowercase entity strings identical to one another were resolved into a single entity before calculating

the proportion shared. For each of DyGIE++ and OntoGPT, we calculated on a per-document basis

the proportion of their entities that appeared in the intersection of the OntoGPT and DyGIE++

entities for each document, and plotted the distributions of those values for each method. We

42

NER approach

RE approach

Refinements

Construction
Method

DyGIE++

Joint neural
method

DyGIE++
co-occurrence

Joint neural
method

Joint neural
method
If two entities
appear in a
sentence
together, a
relation is placed
between them
GPT-3.5
extraction to
predefined
schema

None

None

Prompts added to relations in
the schema, only ran on the
desiccation tolerance subset

Filter initial output to only
keep entities (and
correspondingly their
relations) that are included in
the DyGIE++-extracted
entities

GPT-3.5
extraction to
predefined
schema

OntoGPT

OpenIE

Rule-based,
domain/schema
agnostic

Rule-based,
domain/schema
agnostic

Table 4.3 Summary of KG construction methods.

randomly selected 5 abstracts from the desiccation tolerance subset to visualize DyGIE++ and

OntoGPT entities, and manually selected two to appear in the figure.

Comparison of graph connectivity

We selected two predominantly database-derived graphs for comparison to our networks: Knet-

Miner [Hassani-Pak et al., 2021], and GenoPhenoEnvo [Thessen et al., 2023].

To obtain a relevant edge to node ratio for KnetMiner, we used the KnetMiner Neo4j browser

(http://knetminer-wheat.cyverseuk.org:7474/) for the Poaceae network, which contains wheat and

Arabidopsis. We used sample Neo4j Cypher commands provided in the browser to get the entity

and relation types present in the network. The KnetMiner network differs from other networks

examined here because it also includes the data sources as nodes with relationships in the network.

To avoid artificially altering the computed edge to node ratio, we sought to remove any entity and

relation types that dealt with data sources, as opposed to biological entities or concepts. While we

43

were unable to locate documentation explaining each entity and relation type, most names were

semantically sensical, and so we manually created a subset of both types that we believed to be

biological in nature. To confirm that the less explainable entity types were in fact biological, we

used the Cyhper command “MATCH(n) WHERE n:<entity type> RETURN n LIMIT 5” to examine

the names and properties of the first five entities of each ambiguous type and determine whether

the type was relevant. While this resulted in a relatively high-confidence list of entity types, the

semantics of the relations were much less clear. Our goal was to include only relation types that are

restricted to connecting entity types that were in the identified list of relevant entity types. While the

database schema should have provided the necessary information, there was no information about

what entity types were valid subjects/objects for the various relations. We therefore constructed

the following Cypher command to get the entity and relation types for all relations in our proposed

relation list: “MATCH (n1)-[r:<relation type>]-(n2) RETURN labels(n1), TYPE(r), labels(n2)”.

While it is possible to run this for all relation types at once with an “or” operator, the network

is large enough that running one combined command crashed the web server for the browser, so

we ran this command for each relation type in our proposed list separately. We then asserted that

all entity types in all triples for the given predicate were in our list of biological entity types, and

kept only relations for which this was true. The one exception was for the “part_of” relation,

two of its relations contained CoExpStudy (co-expression study) entities; however, there are only

two CoExpStudy entities in the entire graph, so we kept “part_of”, as the entities it connects are

predominantly biological. We then used the following two queries to count the number of entities

and relations across all the biological types:

MATCH (n) WHERE n:Gene OR n:ProtDomain OR n:Path OR n:CelComp OR

n:BioProc OR n:MolFunc OR n:EC OR n:Comp OR n:Protein OR n:Protcmplx

OR n:Enzyme OR n:Reaction OR n:CoExpCluster OR n:SNP OR n:Transport OR

n:Phenotype OR n:PlantOntologyTerm OR n:SNPEffect OR n:Trait RETURN count(*)

MATCH ()-[r:cs_by | in_by | participates_in | enriched_for | has_phenotype | ortho

44

| is_a | not_located_in | ca_by | pd_by | enc | leads_to | homoeolog | located_in | has_-

function | ac_by | physical | has_variation | pos_reg | participates_not | has_domain |

associated_with | is_part_of | h_s_s | para | neg_reg | cat_c | equ | regulates | has_mutant

| genetic | cooc_wi | not_function | part_of]->() RETURN count(*)

We then used the two resulting values to calculate the edge to node ratio. The GenoPhenoEnvo

graph is available for direct download as two dataframes, a nodelist and an edgelist. We downloaded

these and used the Python package networkx to create a graph object for analysis. All types in

GenoPhenoEnvo are biological, so we didn’t perform any further pre-processing. For all graphs,

we also used the number of entity and relation types to calculate a “schema-derived” (as opposed

to “data-derived”) edge to node ratio.

Graph characterization

We visualized the co-occurrence network in Gephi using the OpenOrd visualization algorithm,

coloring nodes by their entity type and edges by their source node.

As the PICKLE dataset doesn’t allow us to use the coreference option in DyGIE++ (which

attempts to improve predictions by mapping different mentions of the same real world object to a

single entity), we implemented a partial coreference resolution approach by grounding predicted

entities in the Multicellular_organism class back to NCBI Taxonomy using the TaxoNERD model

[Le Guillarme and Thuiller, 2022]. Rather than using the full text of the originating abstracts and

allowing TaxoNERD to also perform the NER step as it would in its full pipeline, we used the

DyGIE++-derived entities in isolation, applying the TaxoNERD entity linker by itself. We combined

entities into spaCy documents up to the maximum allowed number of characters, specifying entity

span boundaries, and then applying the TaxoNERD linker. Using this approach, 32.93% of

Multicellular_entities received a grounding. Once grounded, we mapped entity names to their

Taxonomy groundings, and summed the number of entities in each year that corresponded to

each Taxonomy grounding to obtain growth trajectories over time for the top 20 most frequently

mentioned. For this analysis, we ignored any entity that was not grounded. We examined a

subset of the original entity names for groundings in the top twenty that seemed suspicious, and

45

identified several weaknesses with the TaxoNERD groundings. First, any entity that contained the

word “transgenic” was mapped to Mus musculus, or the house mouse, likely because “transgenic

mice” is a common entity. Many entities containing the phrase “___ plants”, like “rice plants” or

“olive plants”, were mapped back to Embryophyta due to the presence of the word “plants”, and

phrases containing “grapevine” were mapped as “Grapevine virus A”. Additionally, the DyGIE++

model has the bad habit of identifying country names as Multicellular_organism entities, so China

appeared in our top twenty. We therefore removed these entries from the top 20, leaving 16 top

species.

To examine the growth of each entity type category all the time, we used the full graph for all

types (without grounding for Multicellular_organism entities). For each year in the dataset, we

summed the number of entities with that year as their first mention for each entity category. We

removed 2023 from the final visualization, because it is a partial year and therefore brings all entity

type values near 0. To examine degree over time for each entity type, we sliced the graph at each

year, removing all nodes (and as a result, edges) past the cut year, and calculated the degree for all

nodes present in the graph.

Link prediction problem setup

In order to predict hypotheses, we need to develop a framework for how to use the KG in a

prediction setup. Our co-occurrence graph contains biological entities, with undirected relations

that have three possible labels: desiccation, meaning the link was derived from a paper in the

desiccation tolerance portion of the dataset, drought, meaning it was derived from the drought

portion of the dataset, and both, which means the paper that provided the relation is found in

both portions of the dataset. At a high level, our goal is to predict what node pairs should have a

desiccation or both designation. The theoretical grounding for this problem setup is that we want to

leverage the much greater quantity of literature available in drought tolerance research to determine

the genetic basis of desiccation tolerance. Potentially, the genetic elements identified as important

in drought tolerance could also have a role in desiccation tolerance, as drought tolerance precedes

desiccation tolerance when a plant begins to dry down. Therefore, we need a model that can predict

46

new edges of varying types on an undirected graph. We are interested in both static and temporal

predictions: in the static case, we want to predict node pairs that should have desiccation edges

based on a static snapshot of the graph over all time, while in the temporal case, we want to predict

which node pairs should have an edge at the next time point based on the evolution of the graph

over time.

Static link prediction

We tested two approaches to static link prediction on the DyGIE++ co-occurrence network.

Both approaches are based on the KG embedding model RESCAL, as implemented in the PyKEEN

package [Ali et al., 2021, Nickel et al., 2011] . We first wanted to see if we could design a model

that correctly predicts the type of (or that there should not be a) relation between a given pair of

nodes. As a simple baseline, we trained a Random Forest (RF) classifier, using the node embeddings

derived from RESCAL as features. We tested two versions of the RESCAL model for generating

the embeddings for RF features: one using the default MarginRankingLoss with default entity

and relation initializers, and one using BCEWithLogitsLoss and the "normal" entity and relation

initializers (4.4). We split the data into train/validation/test sets with the ratio 0.8/0.1/0.1 using

the random seed 1234, and the same training/validation/testing splits were used for each RESCAL

model. In addition to evaluating the impacts of the RESCAL loss function on the downstream RF

model, we also tested three negative sampling strategies for the RF method: random, corrupted

tail, and embedding-based.

Random: The random sampling approach aims to choose a random subset of the possible

combinations of head and tail entities in the dataset.

In practice, random selection from all

possible combinations is computationally intractable for a dataset of this size; even just calculating

the number of possible combinations causes an OverflowError. Therefore, we implemented a

computational shortcut that approximates true random sampling of combinations. We randomly

sample head and tail entities separately, and then pair the corresponding indices of the two lists

together to create pairs, removing any pairs that either appear in the positive set, or that have

identical head and tail entities.

47

Corrupted Tail: The corrupted tail sampling method is modeled after the default sampling

method implemented in PyKEEN. In our implementation, each negative instance is created by

taking one of the positive instances and replacing the tail entity with a randomly sampled other

entity to create a triple that is not found in the positive set.

Embedding:

In a KG, there is an enormous number of possible negative triples that far

outweighs the number of positive triples, as any two nodes in the network not already connected

by an edge can form a negative triple. However, not all negatives are created equal; it is more

difficult for a model to identify a negative triple that is semantically plausible than one that is clearly

false. Therefore, random sampling to generate the negative instances for the training set is likely to

result in a model that is not able to successfully distinguish between positive and negative samples

in a realistic case where the negatives are not obviously incorrect. To ameliorate this, we tried a

corruption approach, where negative triples are generated by taking a positive triple and randomly

replacing the tail entity, and an embedding-based approach. The idea of the embedding-based

approach is to generate negative triples that are semantically plausible, to force the RF model to

learn to distinguish between all kinds of negative triples and true positive triples. We modeled

our embedding-based negative sampling on the method presented in [Islam et al., 2021]. For each

positive triple in the dataset, we randomly sample 50 possible new tails, and eliminate any that would

make a true positive triple. For each new tail, we use the embeddings from the RESCAL model to

calculate the Euclidean distance between the original and all new possible tails. We then calculate

a softmax probability on the Euclidean distances, which generates a score that is higher for tails

that are closer to the original triple (have smaller distance scores). Following [Islam et al., 2021],

rather than taking the highest-scoring triple directly, which can lead to accidentally sampling false

negative triples (since the KG is known to be incomplete, negative triples may actually be true,

unrecorded triples), we sample randomly from the top 5 highest scoring triples to choose the new

tail. If the resulting triples is already in the negative set, we sample again until we obtain a triple

that has not already been chosen.

For each upstream RESCAL model (MarginRankingLoss vs. BCEWithLogitsLoss), we trained

48

three RF models. Each model’s training set had the same positive instances, but each model used

one of the three negative sampling methods to generate the training instances in the negative class.

We sampled the training instances out of the RESCAL training set, and the test instances out of the

RESCAL testing set. All models were tested on the same test set, which was generated using the

random negative sampling method and contained 1,000 instances of each positive class, and 1,000

negative instances. The training set for each RF model contained 2,000 instances of each positive

class, and 2,000 * (number of positive classes) number of negative instances. While in principle

a balanced training set would contain 2,000 negative instances if negative is considered a uniform

class, there are negative triples that correspond to one class or another. For example, if a positive

triple is (Barak Obama, is, Democrat), the negative triple (Barak Obama, is Republican) is a negative

triple with similar semantics, while the negative triple (Photosynthesis, produces, Expo markers)

belongs to an entirely different semantic grouping. Additionally, we would prefer our model to

predict more false negatives than false positives, because in principle, any hypothesis predicted

by this model would require potentially years of labor on the part of an experimental scientist, so

being cautious in how we treat the negative class is prudent. In total, we trained and evaluated six

models for the drought + desiccation dataset. For each RF model, we used sklearn and performed

a random search hyperparameter optimization with the following parameter distribution settings:

"n_estimators": randint(100, 500), "max_depth": randint(1,50), "criterion": ["gini", "entropy",

"log_loss"]. We used the sklearn functions f1_score (average = "macro"), auc_roc_score (average

= "macro" and multi_class = "ovo") and confusion_matrix to evaluate the output.

The second prediction approach we used was the built-in prediction functionality of PyKEEN’s

RESCAL, which calculates a score for the probability of a given triple being true. The PyKEEN

implementation provides three basic prediction functionalities, which all rely on the calculation of

a plausibility score: (1) calculating the score for every possible triple (“predict all”); this is not

recommended by the developers as it necessitates calculating a score for every possible triple and

is therefore computationally intensive; (2) calculating scores for a specific list of triples (“predict

triples”), and (3) given a head entity and a relation type, returning an ordered list by plausibility

49

Parameter
stopper
model
model_kwargs

loss
training_kwargs

random_seed

Margin Ranking Loss Model
"early"
"RESCAL"

dict(num_epochs=25, checkpoint_-
name="checkpoint_name.pt",
checkpoint_frequency=0)
5678

BCE Loss Model
"early"
"RESCAL"
dict(entity_initializer="normal",
relation_initializer="normal")
BCEWithLogitsLoss
dict(num_epochs=25, checkpoint_-
name="checkpoint_name.pt",
checkpoint_frequency=0)
5678

Table 4.4 Keyword arguments provided to PyKEEN’s pipeline object during model training.

score of the most likely tail entities to complete those relations (“predict target”). To compare

with the RF implementation, we used the same test set triples with option (2), or the predict triples

method. The PyKEEN implementation requires a relation type to be specified for all triples with

this method, so for the negative triples, we randomly sampled the relation type out of the available

types (desiccation, drought, or both). The RESCAL authors state that “link prediction can be done

by comparing [the plausibility score] to some given threshold” [Nickel et al., 2011]. Practically,

RESCAL is a tensor factorization graph embedding model, where a loss function is used to optimize

the model during training. The choice of loss function determines how the triple scores produced

by RESCAL can be interpreted. The default loss function for PyKEEN’s implementation is a

pairwise loss function, which takes one negative and one positive triple at a time, and optimizes

such that the positive triple should always receive a higher score than the negative triple. The result

of this optimization method is that there is no global threshold around which positive and negative

triple scores are optimized. The score of a positive triple from one pair could be less than the

score of a negative triple from another pair. Therefore, classification of triples by defining a score

threshold as suggested in [Nickel et al., 2011] is not possible when the algorithm is optimized with

a pairwise loss. On the other hand, pointwise losses do optimize around a global threshold of 0;

triples with scores greater than 0 should be positive, while those with scores less than 0 should be

negative. In our case, the default loss function for RESCAL in PyKEEN (MarginRankingLoss)

is a pairwise loss function, which means that we could not use the RESCAL models trained with

50

MarginRankingLoss to perform triple classification. Instead, we needed to use a pointwise loss

function, of which BCEWithLogitsLoss is one. Using the same RESCAL model trained with

BCEWithLogitsLoss as we did for the RF models above, we asked RESCAL to provide plausibility

scores for the 4,000 test set triples. Using 0 as a classification threshold, we calculated an F1 score

for this method as a classifier.

In addition to evaluating the ability of the model to correctly classify a set of triples, we also

used the "predict all" functionality to assess the model’s capability to identify high-probability

triples, keeping the top 100 triples. We manually assessed the validity of the top 10 relations using

a Web of Science search. For each pair, we performed a search for the query <head entity> AND

<tail entity>. For entities with specific terms, we simplified the entity and performed an additional

search; an example is "wheat germ systems", which was changed to "wheat germ".

Temporal link prediction

To determine a suitable algorithm for TLP on our network, we first performed a literature search

on available model architectures for TLP on heterogeneous networks (Table S4.2). However, while

we were able to identify several unique algorithms, only six of these had code associated with the

paper; of these six, only only two have code that has been updated less than four years ago, and

of those two, only one had code that was serviceably documented enough to use without major

modification. We chose the algorithm with recent and serviceably reusable code, STHN, for use

in our experiments. After communicating with the developer to clarify details regarding input data

formatting, we ran the STHN algorithm with –max_edges set to 50, and with the –predict_class

option.

Code availability

All code for this project is available at https://github.com/serenalotreck/literature-genes.

51

CHAPTER 5

CONCLUSION

Link prediction on literature-derived KG for hypothesis generation leaves room to grow

In this work, I demonstrated that using literature-derived KG with static link prediction methods

is insufficient to provide high-quality automatic hypothesis generation. However, any study involv-

ing the implementation of specific methods to solve a problem is limited by the time and imagination

of the study designer, and there are always more options and methods to try. Therefore, although

this thesis does not present promising results regarding hypothesis generation on KG, results are

limited to a specific subset of methods, and there is much room for future improvement.

Having intended to work on both KG construction and link prediction in this thesis, it is my

opinion that the two tasks each deserve their own separate investigations. KG construction and

evaluation is a labor-intensive process, and shown in this work, link prediction doesn’t offer a

guarantee of useful predictions. For a potential future PhD student, I would therefore suggest

focusing on either KG construction, or link prediction/hypothesis generation on an existing, high-

quality KG, as an appropriate course of action likely to maximize success.

For those interested in pursuing the KG construction avenue of future work, there are three areas

that I believe would benefit from special focus. The first is the problem setup for KG construction.

The motivation for using literature as a data source of KG is that existing database-derived graphs

are limited in scope due to the manual curation requirements of the databases from which they

are built. However, using literature alone to build a KG also result in an incomplete graph due

to the limitations of information extraction, as well as the difficulty of resolving entity mentions

with varied spellings or synonyms to create a graph that is robust and easily usable. Therefore, I

would suggest focusing construction efforts on using the literature to complete an existing database-

derived graph. There are many stellar examples of KG in both the plant sciences and beyond, based

predominantly on database sources, as discussed in the Introduction. Many of the larger, better-

established KG benefit from entity grounding, where each node and edge is linked to a unique

identifier, and mentions of the same underlying, real-world object can be resolved. Completing

52

a database graph will require careful thought about the study system in question as it relates to

the content of the graph, as most existing KG are built around model or crop species. It will also

require careful thought about how to successfully integrate literature-derived data into the KG.

Entity resolution was completely excluded from this thesis due to logistical constraints; however, I

believe that this work would have benefited greatly from investing time and resources into building

entity resolution into our pipeline. By focusing on using literature to complete an existing high-

quality KG, the final product will likely be more useful for future efforts in hypothesis generation

than either a database-only or literature-only KG.

The second area for focus in a KG construction project should be the information extraction

methods used to draw entities and relations from the literature. As seen in this thesis, current

methods struggle especially to identify relations in text, which are a fundamental component of

creating a useful KG. Creation of a gold standard dataset, like that described in the first chapter

of this thesis, specifically for the domain in which the KG is being created, will be very important

for both evaluation of entity/relation extraction methods. Many improvements can be made on the

annotation procedures presented in the first chapter of this thesis (the PICKLE dataset) given greater

time and labor inputs. My two principal recommendations would be to (a) use defined ontologies

for entity annotations and to (b) include entity resolution in annotations. The use of defined

ontologies, as demonstrated in the CRAFT corpus’ annotation guidelines [Bada et al., 2012], goes

a long way to improving inter-annotator agreement, and has the added benefit of improving the ease

with which entities can be integrated into a KG drawn from databases using the same ontologies.

Entity resolution, as mentioned above, will likely improve the quality of the resulting KG, as

well as potentially improving the performance of entity and relation extraction algorithms, as

shown in [Wadden et al., 2019].

In addition to these changes in annotation guidelines, I would

also recommend annotating as large a volume of documents as possible. While we found that

the performance of the DyGIE++ algorithm did not improve on relation and entity extraction

after ˜150 documents in the training set for the PICKLE dataset, a larger volume of documents in

the annotation set will likely make evaluations of performance and KG quality more robust. In

53

addition to designing a larger and higher-quality training and evaluation annotated dataset, it will

be important to consider approaches that can bridge sentence boundaries to extract relations. One

likely weakness of the DyGIE++ for our use-case is that, like many other relation extraction methods,

it is limited to extracting relations at the sentence level for reasons of computational complexity.

Important biological relations may not be directly stated in a single sentence; therefore, methods

that can perform document-level relationship extraction will likely help built better KG. That being

said, poor relation extraction performance extended to the document-level GPT-based method that

we employed. Luckily, there is always up to go with prompt engineering, and this is a ripe area for

future research.

If the future researcher is able to successfully improve relation extraction to the point where

almost all relevant relations are being recovered from scientific abstracts, my third recommendation

would be to consider the literature data source being used to build the KG. One of the most common

questions I’ve received about my work on KG is why we are using abstracts instead of full text

articles. There are two reasons: first, the ease of access, as there is no guarantee of access to

full text in a workable (non-PDF) format for full text articles. Abstracts are much more accessible

than full text; as of June 20204, there were 37 million citations in PubMed, but as of fiscal year

2023, PubMed Central only contained 9,407,149 articles [PubMed, 2024]. While many institutions

have access to paywalled articles, high-throughput collection and processing of these articles is

non-trivial. The second justification for using abstracts is that in principle, the most important

findings of a paper should be described succinctly in an abstract. However, based on our findings in

the previous chapter, the abstract alone does not appear to be sufficiently information-rich to build

a high quality KG. It is possible that full text can provide better information for KG construction:

one paper quantified the difference in biomedical entities in PubMed abstracts versus free full text

available on PubMed Central [Müller et al., 2010]. The team found that on average, about 10% of

entities are only found in abstracts, while 75 - 86% of entities are only found in the full text. Another

past study showed that while information density was high in abstracts (more unique entities per

length of text), the information coverage of the full text was much greater [Schuemie et al., 2004].

54

While it is clear that the full text is more information-rich for entities, I was unable to find any

similar work on the richness of relationships; this would be an excellent area for future study.

However, since there are so many more entities in the full text, we can likely assume that there are

more relations in full text as well. Therefore, if the performance of relation extraction methods is

relatively assured, finding ways to incorporate full text into KG construction for biological domains

will likely improve the quality of the resulting KG.

For a researcher more interested in extending the hypothesis generation aspect of this work, my

principal recommendation is to evaluate the landscape of project motivation via a more human-

centered approach, evaluating the needs of researchers in order to inform a more appropriate

hypothesis generation solution. Rather than assuming that the desired result of a hypothesis

generation system is a fully automated process that removes the human researcher from the loop, as

was this case in this work, further studies involving human participants could provide insight into

more nuanced and potentially more effective forms of hypothesis generation. Domain expertise

is still necessary in a world with automated hypothesis generation, and we often gain our domain

expertise by reading the literature during the process of manual hypothesis generation. In addition,

we already posses a great number of tools designed to augment researchers’ efficiency in searching

the literature – are there other modes of augmentation, in other parts of the manual hypothesis

generation pipeline, that researchers might want? Are KG the right tool to accomplish the observed

goals of real people, or is there some other entirely different avenue down which this work could or

should progress? In the Introduction of this thesis, I discuss tools such as AgroLD and KnetMiner,

which are large KG designed to help scientists explore the known landscape of plant biology.

As essentially an information scientist, I view such tools as well-developed and very useful to

biologists. However, asking around in my own community, I have yet to encounter a biologist

who is aware of such KG-based tools before I describe them. An area of future research that I

think is very important is the use of surveys to engage with potential stakeholders to investigate

potential synergies between how they currently generate hypotheses and KG-based or other tools for

automated hypothesis generation. Given the connection between domain expertise and hypothesis

55

generation, it is unlikely that a fully automated system would make sense for scientists; therefore,

further work is required to determine what the needs are of the scientific community in regards to

hypothesis generation. Additionally, involvement of domain experts once the hypothesis generation

method has been determined is extremely important. The test set for a link prediction model, for

example, while useful for evaluating model performance in a vacuum, cannot tell you whether or

not the links you are predicting are meaningful. If a domain expert doesn’t find the connections

between organisms and chemical elements being predicted by the model relevant, it doesn’t matter

that the system is good at predicting them.

In terms of the technical aspect of hypothesis generation, I would express caution in regards to

continuing using KG embedding methods for link prediction as the primary mode of hypothesis

generation. My intuition says that, while improvement in the quality of the underlying KG and

incorporation of a temporal component will likely yield some improvement in link prediction

capabilities, the low baseline performance of the static link prediction methods on a middle-

quality graph seen here cause me to suspect that any performance increases may still not be

sufficient to generate actionable hypotheses. My experience of working with KG embedding

models was spending a lot of time trying different hyperparameters, inputs, and models, for very

little corresponding improvement in performance. While I have certainly not exhaustively tried

every option that even just the PyKEEN package implements, and the graphs on which I was

working were not of the highest-quality, I can envision a scenario where such an exhaustive

search doesn’t provide any meaningful improvement over the initial implementation. As discussed

briefly in the conclusion of the previous chapter, investigation of the impacts of network structure

on embedding algorithms could potentially help illuminate why link prediction models fail on

biological datasets. I would recommend a thorough evaluation of the impact of degree skew (scale-

free or approximation of scale-free behavior) on embedding models, especially as compared to the

datasets like the Kinships dataset on which models like RESCAL are often evaluated. Additionally,

on a higher level, thorough exploration of the interests of the research community for automated or

augmented hypothesis generation methods will provide insight into potential alternative methods.

56

KG-based or otherwise, there likely exist methods that could have a much greater degree of success

than link prediction. It seems of the utmost importance to me to explore possible other avenues

for hypothesis generation before resuming work on the link prediction trajectory presented in this

thesis.

A brief note on the role of information overload in the writing of this dissertation and its
implications

As noted by [Bawden and Robinson, 2020], even when writing on information overload, in-

formation overload remains a problem, and selectivity in citations is necessary to maintain focus.

When I first began this research in 2019, my launching point into the field of hypothesis genera-

tion was through knowledge graph completion. As a result of only using search terms related to

knowledge graphs while reading the literature to propose my dissertation research, I developed a

kind of literature myopia, where the selectivity of my citations was biased away from the broader

field in which my research was situated; this became even more problematic upon what is referred

to in systematic literature reviews as “backwards search", where the researcher follows the citations

in their initial search results [Foo et al., 2021, Xiao and Watson, 2019]. The extent of this myopia

only became clear when the methods I had identified as promising candidates for hypothesis gener-

ation started failing; I returned to the literature with the sense that I had maybe missed something

important. I decided to step back with the specificity of my search terms and instead of knowledge

graphs, simply search the phrase "automated hypothesis generation". Since then, I have luckily

stumbled upon several other terms that seem to encompass the body of literature in this field, and

would like to explicitly state them here: automated hypothesis generation, automatic hypothesis

generation, and literature-based discovery. While using a diversity of search terms may seem obvi-

ous, the knowledge of exactly which terms to search to gain a comprehensive understanding of the

state of this field, which I will predominantly refer to as automated hypothesis generation, took me

several years to come to. In addition, even the seemingly trivial difference between "automated"

and "automatic" in a search engine drastically changed the papers that turned up in the results,

which is why I feel it is important to point out just how dramatically a bias in search terms can

57

affect the process of science for a given individual.

Existing work has demonstrated the effect of word choice in titles and abstracts on the visibility

of papers in search engine results, indicating that papers that use more jargon are less often cited

[Martínez and Mammola, 2020], and there exists a body of literature containing recommendations

for search engine optimization through considered formulation of titles and abstracts to change the

overall visibility of an academic paper [Shahzad et al., 2017, Pottier et al., 2023]. There is an addi-

tional body of literature containing recommendations on how to formulate systemic reviews, include

recommendations on how to choose search terms for systematic reviews. However, these papers

predominantly assume that researchers are aware of the possible search terms they would need, and

the advice is catered towards choosing search terms out of a known set, including advice like "The

keywords for the search should be derived from the research question(s)" [Xiao and Watson, 2019],

"A major consideration in systematic searching is balancing the principles of sensitivity and speci-

ficity" [Purssell and McCrae, 2020], or even just using an example search term without explaining

how it was chosen [Foo et al., 2021]. These recommendations contain useful advice such as ex-

panding your search terms by abbreviations, and in the case of [Xiao and Watson, 2019], even

address the issue of alternative terms:

"Second, researchers doing cross-country studies should pay attention to the cul-

tural difference in terminology. For instance, “eminent domain” is called “compulsory

acquisition” and “parking lot” called “car park” in Australia and New Zealand. “Urban

revitalization” is typically called “urban regeneration” in the United Kingdom. The

search can only be successful if we use the correct vocabulary from the culture of study.

Third, Bayliss and Beyer (2015) brought up the issue of the evolving vocabulary. For

example, the interstate highway system was originally called “interstate and defense

highways” because it was constructed for defense purposes in the cold war era (We-

ingroff 1996). The term “defense” was then dropped from the name. Therefore,

researchers should be conscious of the vocabulary changes over time. In the search of

literature dated back in history, one should use the correct vocabulary from that period

58

of time."

However, note that the phrasing of this advice implies a pre-existing knowledge of alternative

terms in a field. In scientific fields that do not necessitate a geographic focus in the same way

urban planning does, nearly all literature searches are international by default. Disciplines can be

divided terminologically along invisible boundaries that don’t correspond to something evident like

geography or time, and to the best of my knowledge, there is no body of work that has quantified the

effect of this invisible prerequisite knowledge on systematic literature reviews or citation metrics. As

a result of my experiences in writing this dissertation, my personal definition of information overload

has now expanded to include the process by which important information is effectively hidden from

an individual because they do not already possess some invisible prerequisite knowledge. My

personal experience was that performing a research project outside my lab’s expertise meant that

I didn’t have an inside source who was aware of the various terminology that I needed to search;

however, given the lack of quantification of this phenomenon, who is to say that experienced

researchers are unknowingly missing important or novel literature in their field as a result of

terminology differences? Because search terms determine so much of how we process and share

information as scientists, I would be extremely interested to see the results of future work exploring

the impact of invisible prerequisite knowledge on bibliometrics like those explored in the third

chapter of this thesis. Additionally, because the hypothesis generation system explored in this

thesis as well as many other potential approaches rely on the output of a scientific literature search,

quantification of the impact of missing search terms will be important to the efficacy of potential

future literature-based hypothesis generation systems.

59

BIBLIOGRAPHY

[Abu-Salih et al., 2023] Abu-Salih, B., AL-Qurishi, M., Alweshah, M., AL-Smadi, M., Alfayez,
R., and Saadeh, H. (2023). Healthcare knowledge graph construction: A systematic review of
the state-of-the-art, open issues, and opportunities. Journal of Big Data, 10(1):81.

[Akujuobi, 2021] Akujuobi, U. (2021). Revolutionizing Hypothesis Generation.

[Alger, 2019] Alger, B. E. (2019). The Scientific Hypothesis Today. In Alger, B. E., editor, Defense
of the Scientific Hypothesis: From Reproducibility Crisis to Big Data, page 0. Oxford University
Press.

[Ali et al., 2021] Ali, M., Berrendorf, M., Hoyt, C. T., Vermue, L., Sharifzadeh, S., Tresp, V., and
Lehmann, J. (2021). PyKEEN 1.0: A Python Library for Training and Evaluating Knowledge
Graph Embeddings.

[Alpert, 2005] Alpert, P. (2005). The Limits and Frontiers of Desiccation-Tolerant Life. Integrative

and Comparative Biology, 45(5):685–695.

[Angeli et al., 2015] Angeli, G., Johnson Premkumar, M. J., and Manning, C. D. (2015). Lever-
In Proceedings of the
aging Linguistic Structure For Open Domain Information Extraction.
53rd Annual Meeting of the Association for Computational Linguistics and the 7th International
Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 344–354,
Beĳing, China. Association for Computational Linguistics.

[Bada et al., 2012] Bada, M., Eckert, M., Evans, D., Garcia, K., Shipley, K., Sitnikov, D., Baum-
gartner, W. A., Cohen, K. B., Verspoor, K., Blake, J. A., and Hunter, L. E. (2012). Concept
annotation in the CRAFT corpus. BMC Bioinformatics, 13(1):161.

[Bawden and Robinson, 2020] Bawden, D. and Robinson, L. (2020). Information Overload: An

Introduction. In Oxford Research Encyclopedia of Politics. Oxford University Press.

[Bekhuis, 2006] Bekhuis, T. (2006). Conceptual biology, hypothesis discovery, and text mining:

Swanson’s legacy. Biomedical Digital Libraries, 3(1):2.

[Bewley, 1979] Bewley,

(1979).
Annual Review of Plant Physiology,

Tolerance.
https://doi.org/10.1146/annurev.pp.30.060179.001211.

J. D.

Physiological Aspects

30(1):195–238.

of Desiccation
_eprint:

[Bian et al., 2019] Bian, R., Koh, Y. S., Dobbie, G., and Divoli, A. (2019). Network Embedding
In Proceedings of the 42nd
and Change Modeling in Dynamic Heterogeneous Networks.
International ACM SIGIR Conference on Research and Development in Information Retrieval,
pages 861–864, Paris France. ACM.

[Bleker et al., 2023] Bleker, C., Ramšak, Z., Bittner, A., Podpečan, V., Zagorščak, M., Wurzinger,
B., Baebler, , Petek, M., Križnik, M., Dieren, A. v., Gruber, J., Afjehi-Sadat, L., Županič, A.,
Teige, M., Vothknecht, U. C., and Gruden, K. (2023). Stress Knowledge Map: A knowledge
graph resource for systems biology analysis of plant stress responses. Pages: 2023.11.28.568332
Section: New Results.

60

[Cachola et al., 2020] Cachola, I., Lo, K., Cohan, A., and Weld, D. (2020). TLDR: Extreme Sum-
marization of Scientific Documents. Findings of the Association for Computational Linguistics:
EMNLP 2020, pages 4766–4777. Conference Name: Findings of the Association for Compu-
tational Linguistics: EMNLP 2020 Place: Online Publisher: Association for Computational
Linguistics.

[Cai et al., 2023] Cai, B., Xiang, Y., Gao, L., Zhang, H., Li, Y., and Li, J. (2023). Temporal
Knowledge Graph Completion: A Survey. In Proceedings of the Thirty-Second International
Joint Conference on Artificial Intelligence, pages 6545–6553. arXiv:2201.08236 [cs].

[Cai et al., 2018] Cai, H., Zheng, V. W., and Chang, K. C.-C. (2018). A Comprehensive Survey of
Graph Embedding: Problems, Techniques, and Applications. IEEE Transactions on Knowledge
and Data Engineering, 30(9):1616–1637.

[Caufield et al., 2024] Caufield, J. H., Hegde, H., Emonet, V., Harris, N. L., Joachimiak, M. P.,
Matentzoglu, N., Kim, H., Moxon, S., Reese, J. T., Haendel, M. A., Robinson, P. N., and
Mungall, C. J. (2024). Structured Prompt Interrogation and Recursive Extraction of Semantics
(SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics,
40(3):btae104.

[Chen et al., 2019] Chen, H., Cao, G., Chen, J., and Ding, J. (2019). A Practical Framework
for Evaluating the Quality of Knowledge Graph. In Zhu, X., Qin, B., Zhu, X., Liu, M., and
Qian, L., editors, Knowledge Graph and Semantic Computing: Knowledge Computing and
Language Understanding, Communications in Computer and Information Science, pages 111–
122, Singapore. Springer.

[Cooper et al., 2024] Cooper, L., Elser, J., Laporte, M.-A., Arnaud, E., and Jaiswal, P. (2024).
Planteome 2024 Update: Reference Ontologies and Knowledgebase for Plant Biology. Nucleic
Acids Research, 52(D1):D1548–D1555.

[Darnala et al., 2023] Darnala, B., Amardeilh, F., Roussey, C., Todorov, K., and Jonquet, C. (2023).
C3PO: a crop planning and production process ontology and knowledge graph. Frontiers in
Artificial Intelligence, 6. Publisher: Frontiers.

[Dileo et al., 2023] Dileo, M., Zignani, M., and Gaito, S. (2023). DURENDAL: Graph deep

learning framework for temporal heterogeneous networks. arXiv:2310.00336 [cs].

[Fo et al., 2023] Fo, K., Chuah, Y. S., Fyh, H., Davey, E. E., Fullwood, M., Thibault, G., and
Mutwil, M. (2023). PlantConnectome: knowledge networks encompassing &gt;100,000 plant
article abstracts. Pages: 2023.07.11.548541 Section: New Results.

[Foo et al., 2021] Foo, Y. Z., O’Dea, R. E., Koricheva, J., Nakagawa, S., and Lagisz, M. (2021).
A practical guide to question formation, systematic searching and study screening for literature
reviews in ecology and evolution. Methods in Ecology and Evolution, 12(9):1705–1720. _eprint:
https://onlinelibrary.wiley.com/doi/pdf/10.1111/2041-210X.13654.

61

[Grzyb and Skłodowska, 2022] Grzyb, T. and Skłodowska, A. (2022). Introduction to Bacterial
Anhydrobiosis: A General Perspective and the Mechanisms of Desiccation-Associated Dam-
age. Microorganisms, 10(2):432. Number: 2 Publisher: Multidisciplinary Digital Publishing
Institute.

[Hassani-Pak et al., 2016] Hassani-Pak, K., Castellote, M., Esch, M., Hindle, M., Lysenko, A.,
Taubert, J., and Rawlings, C. (2016). Developing integrated crop knowledge networks to
advance candidate gene discovery. Applied & Translational Genomics, 11:18–26.

[Hassani-Pak et al., 2021] Hassani-Pak, K., Singh, A., Brandizi, M., Hearnshaw, J., Parsons,
J. D., Amberkar, S., Phillips, A. L., Doonan, J. H., and Rawlings, C. (2021). Knet-
Miner: a comprehensive approach for supporting evidence-based gene discovery and com-
plex trait analysis across species. Plant Biotechnology Journal, 19(8):1670–1678. _eprint:
https://onlinelibrary.wiley.com/doi/pdf/10.1111/pbi.13583.

[Hibshman et al., 2020] Hibshman, J. D., Clegg, J. S., and Goldstein, B. (2020). Mechanisms of
Desiccation Tolerance: Themes and Variations in Brine Shrimp, Roundworms, and Tardigrades.
Frontiers in Physiology, 11.

[Imbert et al., 2023] Imbert, B., Kreplak, J., Flores, R.-G., Aubert, G., Burstin, J., and Tayeh,
N. (2023). Development of a knowledge graph framework to ease and empower translational
approaches in plant research: a use-case on grain legumes. Frontiers in Artificial Intelligence,
6. Publisher: Frontiers.

[Islam et al., 2021] Islam, M. K., Aridhi, S., and Smaïl-Tabbone, M. (2021). Simple negative sam-
pling for link prediction in knowledge graphs. In The 10th International Conference on Complex
Networks and their Applications, volume 1016, pages 549–562, Madrid, Spain. Springer Inter-
national Publishing.

[Issa et al., 2021] Issa, S., Adekunle, O., Hamdi, F., Cherfi, S. S.-S., Dumontier, M., and Zaveri,
A. (2021). Knowledge Graph Completeness: A Systematic Literature Review. IEEE Access,
9:31322–31339. Conference Name: IEEE Access.

[Kanehisa, 2002] Kanehisa, M.

(2002).

The KEGG Database.

lation of Biological Processes, pages 91–103. John Wiley & Sons, Ltd.
https://onlinelibrary.wiley.com/doi/pdf/10.1002/0470857897.ch8.

In ‘In Silico’ Simu-
_eprint:

[Klerings et al., 2015] Klerings, I., Weinhandl, A. S., and Thaler, K. J. (2015).

Information
overload in healthcare: too much of a good thing? Zeitschrift für Evidenz, Fortbildung und
Qualität im Gesundheitswesen, 109(4):285–290.

[Kong et al., 2019] Kong, C., Li, H., Zhang, L., Zhu, H., and Liu, T. (2019). Link Prediction on
Dynamic Heterogeneous Information Networks. In Tagarelli, A. and Tong, H., editors, Compu-
tational Data and Social Networks, pages 339–350, Cham. Springer International Publishing.

[Landhuis, 2016] Landhuis, E. (2016). Scientific literature:

Information overload. Nature,

535(7612):457–458. Number: 7612 Publisher: Nature Publishing Group.

62

[Larmande and Todorov, 2021] Larmande, P. and Todorov, K. (2021). AgroLD: A Knowledge
Graph for the Plant Sciences. In Hotho, A., Blomqvist, E., Dietze, S., Fokoue, A., Ding, Y.,
Barnaghi, P., Haller, A., Dragoni, M., and Alani, H., editors, The Semantic Web – ISWC 2021,
volume 12922, pages 496–510. Springer International Publishing, Cham. Series Title: Lecture
Notes in Computer Science.

[Le Guillarme and Thuiller, 2022] Le Guillarme, N. and Thuiller, W. (2022). TaxoNERD:
taxonomic entities in the ecological and
_eprint:

Deep neural models for
evolutionary literature. Methods in Ecology and Evolution, 13(3):625–641.
https://onlinelibrary.wiley.com/doi/pdf/10.1111/2041-210X.13778.

the recognition of

[León-Lobos et al., 2012] León-Lobos, P., Way, M., Aranda, P. D., and Lima-Junior, M. (2012).
The role of ex situ seed banks in the conservation of plant diversity and in ecological restoration
in Latin America. Plant Ecology & Diversity, 5(2):245–258.

[Li et al., 2023] Li, C., Hong, R., Xu, X., Trajcevski, G., and Zhou, F. (2023). Simplifying
In Proceedings of
Temporal Heterogeneous Network for Continuous-Time Link prediction.
the 32nd ACM International Conference on Information and Knowledge Management, pages
1288–1297, Birmingham United Kingdom. ACM.

[Lotreck et al., 2023] Lotreck, S., Segura Abá, K., Lehti-Shiu, M. D., Seeger, A., Brown, B. N. I.,
Ranaweera, T., Schumacher, A., Ghassemi, M., and Shiu, S.-H. (2023). Plant Science Knowledge
Graph Corpus: a gold standard entity and relation corpus for the molecular plant sciences. in
silico Plants, 6(1):diad021.

[Lotreck et al., 2024] Lotreck, S. G., Ghassemi, M., and VanBuren, R. T. (2024). Unifying the
research landscape of desiccation tolerance to identify trends, gaps, and opportunities. bioRxiv.

[Marks et al., 2021] Marks, R. A.,

(2021).

VanBuren, R.
iccation
https://onlinelibrary.wiley.com/doi/pdf/10.1002/ajb2.1588.

tolerance.

American

Farrant,

and
Unexplored dimensions of variability in vegetative des-
_eprint:

J. M., Nicholas McLetchie, D.,

108(2):346–358.

of Botany,

Journal

[Martínez and Mammola, 2020] Martínez, A. and Mammola, S. (2020). Specialized terminology
limits the reach of new scientific knowledge. Pages: 2020.08.20.258996 Section: New Results.

[Milošević and Thielemann, 2023] Milošević, N. and Thielemann, W. (2023). Comparison of
biomedical relationship extraction methods and models for knowledge graph creation. Journal
of Web Semantics, 75:100756.

[Müller et al., 2010] Müller, B., Klinger, R., Gurulingappa, H., Mevissen, H.-T., Hofmann-Apitius,
M., Fluck, J., and Friedrich, C. M. (2010). Abstracts versus Full Texts and Patents: A Quantitative
Analysis of Biomedical Entities.
In Cunningham, H., Hanbury, A., and Rüger, S., editors,
Advances in Multidisciplinary Retrieval, pages 152–165, Berlin, Heidelberg. Springer.

[Ni et al., 2023] Ni, X., Zhao, Y., and Yao, Y. (2023). Dynamic Heterogeneous Link Prediction
Based on Hierarchical Attention Model. In Proceedings of the 8th International Conference on
Cyber Security and Information Engineering, pages 111–115, Putrajaya Malaysia. ACM.

63

[Nicholson and Greene, 2020] Nicholson, D. N. and Greene, C. S. (2020). Constructing knowledge
graphs and their biomedical applications. Computational and Structural Biotechnology Journal,
18:1414–1428. Publisher: Elsevier.

[Nickel et al., 2011] Nickel, M., Tresp, V., and Kriegel, H.-P. (2011). A Three-Way Model for

Collective Learning on Multi-Relational Data.

[Peng et al., 2023] Peng, C., Xia, F., Naseriparsa, M., and Osborne, F. (2023). Knowledge Graphs:

Opportunities and Challenges. Artificial Intelligence Review, 56(11):13071–13102.

[Persson et al., 2011] Persson, D., Halberg, K. A., Jørgensen, A., Ricci, C., Møbjerg, N., and
Kristensen, R. M. (2011). Extreme stress tolerance in tardigrades: surviving space conditions
in low earth orbit. Journal of Zoological Systematics and Evolutionary Research, 49(s1):90–97.
_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1439-0469.2010.00605.x.

[Pottier et al., 2023] Pottier, P., Lagisz, M., Burke, S., Drobniak, S. M., Downing, P. A., Macartney,
E. L., Martinig, A. R., Mizuno, A., Morrison, K., Pollo, P., Ricolfi, L., Tam, J., Williams, C.,
Yang, Y., and Nakagawa, S. (2023). Keywords to success: a practical guide to maximise the
visibility and impact of academic papers.

[PubMed, 2024] PubMed (2024). About PMC.

[Purssell and McCrae, 2020] Purssell, E. and McCrae, N. (2020). Searching the Literature. In
Purssell, E. and McCrae, N., editors, How to Perform a Systematic Literature Review: A Guide
for Healthcare Researchers, Practitioners and Students, pages 31–44. Springer International
Publishing, Cham.

[Qin and Yeung, 2024] Qin, M. and Yeung, D.-Y. (2024). Temporal Link Prediction: A Unified

Framework, Taxonomy, and Review. ACM Computing Surveys, 56(4):1–40.

[Ramšak et al., 2018] Ramšak, , Coll, A., Stare, T., Tzfadia, O., Baebler, S., Van De Peer, Y., and
Gruden, K. (2018). Network Modeling Unravels Mechanisms of Crosstalk between Ethylene
and Salicylate Signaling in Potato. Plant Physiology, 178(1):488–499.

[Raymond, 2019] Raymond, D. (2019). Using Artificial Intelligence to Combat Information Over-

load in Research. IEEE Pulse, 10(1):18–21. Conference Name: IEEE Pulse.

[Rossi et al., 2021] Rossi, A., Firmani, D., Matinata, A., Merialdo, P., and Barbosa, D. (2021).
Knowledge Graph Embedding for Link Prediction: A Comparative Analysis. ACM Transactions
on Knowledge Discovery from Data, 15(2):1–49. arXiv:2002.00819 [cs, stat].

[Sajadmanesh et al., 2019] Sajadmanesh, S., Bazargani, S., Zhang, J., and Rabiee, H. R. (2019).
Continuous-Time Relationship Prediction in Dynamic Heterogeneous Information Networks.
ACM Transactions on Knowledge Discovery from Data, 13(4):1–31.

[Schuemie et al., 2004] Schuemie, M. J., Weeber, M., Schĳvenaars, B. J. A., Van Mulligen, E. M.,
Van Der Eĳk, C. C., Jelier, R., Mons, B., and Kors, J. A. (2004). Distribution of information in
biomedical abstracts and full-text publications. Bioinformatics, 20(16):2597–2604.

64

[Seo et al., 2022] Seo, S., Cheon, H., Kim, H., and Hyun, D. (2022). Structural Quality Metrics to

Evaluate Knowledge Graphs. arXiv:2211.10011 [cs].

[Sett et al., 2018] Sett, N., Basu, S., Nandi, S., and Singh, S. R. (2018). Temporal link prediction

in multi-relational network. World Wide Web, 21(2):395–419.

[Shahzad et al., 2017] Shahzad, A., Mohd Nawi, N., Abd Hamid, N., Khan, S. N., Aamir, M.,
Ullah, A., and Abdullah, S. (2017). The Impact of Search Engine Optimization on The Visibility
of Research Paper and Citations. JOIV : International Journal on Informatics Visualization,
1(4-2):195–198.

[Smalheiser, 2012] Smalheiser, N. R. (2012). Literature-based discovery: Beyond the ABCs.

Journal of the American Society for Information Science and Technology, 63(2):218–224.

[Smalheiser and Swanson, 1998] Smalheiser, N. R. and Swanson, D. R. (1998). Using ARROW-
SMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. Com-
puter Methods and Programs in Biomedicine, 57(3):149–153.

[Swanson, 1986] Swanson, D. R. (1986). Fish oil, Raynaud’s syndrome, and undiscovered public
knowledge. Perspectives in biology and medicine, 30(1):7–18. Publisher: Johns Hopkins
University Press.

[The Gene Ontology Consortium, 2019] The Gene Ontology Consortium (2019). The Gene On-
tology Resource: 20 years and still GOing strong. Nucleic Acids Research, 47(D1):D330–D338.

[Thessen et al., 2023] Thessen, A. E., Cooper, L., Swetnam, T. L., Hegde, H., Reese, J., Elser, J.,
and Jaiswal, P. (2023). Using knowledge graphs to infer gene expression in plants. Frontiers in
Artificial Intelligence, 6. Publisher: Frontiers.

[Unni et al., 2022] Unni, D. R., Moxon, S. A. T., Bada, M., Brush, M., Bruskiewich, R., Caufield,
J. H., Clemons, P. A., Dancik, V., Dumontier, M., Fecho, K., Glusman, G., Hadlock, J. J., Harris,
N. L., Joshi, A., Putman, T., Qin, G., Ramsey, S. A., Shefchek, K. A., Solbrig, H., Soman, K.,
Thessen, A. E., Haendel, M. A., Bizon, C., Mungall, C. J., and The Biomedical Data Translator
Consortium (2022). Biolink Model: A universal schema for knowledge graphs in clinical,
biomedical, and translational science. Clinical and Translational Science, 15(8):1848–1855.

[Wadden et al., 2019] Wadden, D., Wennberg, U., Luan, Y., and Hajishirzi, H. (2019). Entity,
Relation, and Event Extraction with Contextualized Span Representations. arXiv:1909.03546
[cs].

[Wang et al., 2021a] Wang, M., Ma, X., Si, J., Tang, H., Wang, H., Li, T., Ouyang, W., Gong, L.,
Tang, Y., He, X., Huang, W., and Liu, X. (2021a). Adverse Drug Reaction Discovery Using a
Tumor-Biomarker Knowledge Graph. Frontiers in Genetics, 11. Publisher: Frontiers.

[Wang et al., 2021b] Wang, X., Chen, L., Ban, T., Usman, M., Guan, Y., Liu, S., Wu, T., and Chen,
H. (2021b). Knowledge graph quality control: A survey. Fundamental Research, 1(5):607–626.

[Wren, 2008] Wren, J. (2008). The ‘Open Discovery’ Challenge. pages 39–55.

65

[Xiao and Watson, 2019] Xiao, Y. and Watson, M. (2019). Guidance on Conducting a Systematic
Literature Review. Journal of Planning Education and Research, 39(1):93–112. Publisher:
SAGE Publications Inc.

[Xue et al., 2020] Xue, H., Yang, L., Jiang, W., Wei, Y., Hu, Y., and Lin, Y. (2020). Modeling Dy-
namic Heterogeneous Network for Link Prediction using Hierarchical Attention with Temporal
RNN. arXiv:2004.01024 [cs, stat].

[Yin et al., 2019] Yin, Y., Ji, L.-X., Zhang, J.-P., and Pei, Y.-L. (2019). DHNE: Network Represen-
tation Learning Method for Dynamic Heterogeneous Networks. IEEE Access, 7:134782–134792.

[Yue et al., 2022] Yue, C., Du, L., Fu, Q., Bi, W., Liu, H., Gu, Y., and Yao, D. (2022). HTGN-
BTW: Heterogeneous Temporal Graph Network with Bi-Time-Window Training Strategy for
Temporal Link Prediction. ArXiv.

[Zhou et al., 2018] Zhou, L., Yang, Y., Ren, X., Wu, F., and Zhuang, Y. (2018). Dynamic Network
Embedding by Modeling Triadic Closure Process. Proceedings of the AAAI Conference on
Artificial Intelligence, 32(1). Number: 1.

66

APPENDIX

Figure S4.1 Examples of low-quality OpenIE triples. Real triples extracted from the dataset
using OpenIE. Because OpenIE is schema-free and domain-agnostic, it can only rely on syntactic
(grammatical) rules, and therefore extracts extremely long clauses as entities, leading to
uninformative relations.

67

Head entity

Edge
type
wheat germ system both

opuntia fragilis

both

pshsfa7a1_2595

both

salt stress-induced
calcium signal
bzip23
transcription factor
activity
beech-fir stand

both

both

both

t. fluminensis

both

crocus sativus l

both

drought-responsive
and jasmonic acid
biosynthesis genes
mandarin water

both

both

Tail entity

AND search

deciduous seasonal
forests
deciduous seasonal
forests
mwsp

c. stelligera

deciduous seasonal
forests

deciduous seasonal
forests
deciduous seasonal
forests

reds/breaker
tomatoes
b. amphitrite

deciduous seasonal
forests

(TS=(wheat germ)) AND
TS=(deciduous forests)
(TS=(opuntia fragilis)) AND
TS=(deciduous forests)
(TS=(pshsfa7a1)) AND
TS=(mwsp)
(TS=(calcium)) AND TS=(c.
stelligera)
(TS=(bzip)) AND
TS=(deciduous forests)

(TS=(beech-fir stand)) AND
TS=(deciduous forests)
(TS=(Tradescantia
fluminensis)) AND
TS=(deciduous forests)
(TS=(crocus sativus l)) AND
TS=(tomatoes)
(TS=(jasmonic acid)) AND
TS=(Amphibalanus
amphitrite)
(TS=(mandarin water)) AND
TS=(deciduous forests)

Table S4.1 Final search queries for top ten predicted triples.

Method
DynamicTriad
TMLP
NP-GLM
DHNE
HA-LSTM
Change2vec
DyHATR
HTGN-BTW
Att-ConvLSTM
STHN
DURENDAL

Reference
[Zhou et al., 2018]
[Sett et al., 2018]
[Sajadmanesh et al., 2019]
[Yin et al., 2019]
[Kong et al., 2019]
[Bian et al., 2019]
[Xue et al., 2020]
[Yue et al., 2022]
[Ni et al., 2023]
[Li et al., 2023]
[Dileo et al., 2023]

Year Code available?
2018
2018
2019
2019
2019
2019
2020
2022
2023
2023
2023

Yes
No
No
Yes
No
Yes
Yes
No
No
Yes
Yes

Table S4.2 Summary of literature search for heterogeneous TLP methods.

68

Figure S4.2 AUROC scores for RF models with different sampling strategies. AUROC curves
for models for each upstream RESCAL loss function and negative sampling strategy. No
meaningful difference was found between the two RESCAL losses in the RF models, and there
was no meaningful difference between the random and corrupt sampling strategies.

69

MarginRankingLossBCEWithLogitsLossRandomCorruptEmbedding0.00.20.40.60.81.0False Positive Rate0.00.20.40.60.81.0True Positive Ratemacro-average ROC curve (AUC = 0.65)ROC curve for negative (AUC = 0.86)ROC curve for desiccation (AUC = 0.58)ROC curve for drought (AUC = 0.42)ROC curve for both (AUC = 0.73)Chance level (AUC = 0.5)0.00.20.40.60.81.0False Positive Rate0.00.20.40.60.81.0True Positive Ratemacro-average ROC curve (AUC = 0.64)ROC curve for negative (AUC = 0.86)ROC curve for desiccation (AUC = 0.56)ROC curve for drought (AUC = 0.41)ROC curve for both (AUC = 0.73)Chance level (AUC = 0.5)0.00.20.40.60.81.0False Positive Rate0.00.20.40.60.81.0True Positive Ratemacro-average ROC curve (AUC = 0.51)ROC curve for negative (AUC = 0.44)ROC curve for desiccation (AUC = 0.58)ROC curve for drought (AUC = 0.43)ROC curve for both (AUC = 0.57)Chance level (AUC = 0.5)0.00.20.40.60.81.0False Positive Rate0.00.20.40.60.81.0True Positive Ratemacro-average ROC curve (AUC = 0.64)ROC curve for negative (AUC = 0.86)ROC curve for desiccation (AUC = 0.57)ROC curve for drought (AUC = 0.42)ROC curve for both (AUC = 0.73)Chance level (AUC = 0.5)0.00.20.40.60.81.0False Positive Rate0.00.20.40.60.81.0True Positive Ratemacro-average ROC curve (AUC = 0.65)ROC curve for negative (AUC = 0.86)ROC curve for desiccation (AUC = 0.58)ROC curve for drought (AUC = 0.42)ROC curve for both (AUC = 0.74)Chance level (AUC = 0.5)0.00.20.40.60.81.0False Positive Rate0.00.20.40.60.81.0True Positive Ratemacro-average ROC curve (AUC = 0.50)ROC curve for negative (AUC = 0.45)ROC curve for desiccation (AUC = 0.56)ROC curve for drought (AUC = 0.45)ROC curve for both (AUC = 0.56)Chance level (AUC = 0.5)Figure S4.3 Confusion matrices for RF models with different sampling strategies. Confusion
matrices for models for each upstream RESCAL loss function and negative sampling strategy. No
meaningful difference was found between the two RESCAL losses in the RF models, and there
was no meaningful difference between the random and corrupt sampling strategies.

Figure S4.4 STHN performance. AUROC and loss plotted for the training epochs of STHN.

70

MarginRankingLossBCEWithLogitsLossRandomCorruptEmbeddingnegativedesiccationdroughtbothPredicted labelnegativedesiccationdroughtbothTrue label95020484824204763803325852387026900200400600800negativedesiccationdroughtbothPredicted labelnegativedesiccationdroughtbothTrue label95010494723904893743225922327216950200400600800negativedesiccationdroughtbothPredicted labelnegativedesiccationdroughtbothTrue label49213332253482187139192566113962256035647294100200300400500600negativedesiccationdroughtbothPredicted labelnegativedesiccationdroughtbothTrue label94801514693904923782525952284527250200400600800negativedesiccationdroughtbothPredicted labelnegativedesiccationdroughtbothTrue label95611424933804693913235742473817140200400600800negativedesiccationdroughtbothPredicted labelnegativedesiccationdroughtbothTrue label512169281385082391011525491878218256912150260100200300400500