EXPLORING THE MATHEMATICAL SHAPE OF PLANTS By Erik J. Amézquita A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science, and Engineering—Doctor of Philosophy 2023 ABSTRACT Shape plays a fundamental role across all organisms at all observable levels. Molecules and proteins constantly fold and wrap into intricate designs inside cells. Cells arrange into elaborate motifs to form sophisticated tissues. Layers of different tissues come together to form delicate vascular systems that sustain leaves. Each of these tissues evolved as part of a distinct branch of the ever-growing tree of life. From micro-biology to macro-evolutionary scales, shape and its patterns are foundational to biology. Measuring and understanding the shape is key to extracting valuable information from data, and push further our insights. Shape is too complex to be comprehensively tackled with traditional methods. Landmark-based morphometrics fail if there are not enough homologous points shared across all sampled individuals. Elliptical Fourier Descriptors are not suitable for 3D data. These limitations are especially pressing when we combine our plant visualizations with X-ray Computed Tomography (CT) technology to also record the sophisticated internal structure of stems, seeds, and fruits. Here, I study the potential of Topological Data Analysis (TDA) for plant shape quantifica- tion. TDA is a combination of different mathematical and computational disciplines that seeks to describe concisely and comprehensively the shape of data in a general setting. In very succinct terms, TDA consists of two basic ingredients. First we think the data as a collection of points. Second, we define a notion of distance between every pair of points. The points could be atoms, biomolecultes, cells’ nuclei, image pixels, or an organism itself. Distances between points could be the Euclidean, geodesic, genetic, or correlation-based. Once we have data points and distances, we merge systematically the points, starting with those that are closer to each other. The key idea is to keep track of distinct blobs, loops, and voids that form and disappear as we merge several points. This versatile idea is not constrained to a particular dimension or set of landmarks, which makes it ideal to compare a vast array of possible different shapes. In this work, we will explore new techniques for mathematical plant phenotyping by studying three concrete cases. For the first case, we digitally extract the totality of shape information from X-ray CT scans of tens of thousands of barley seeds. With the Euler Characteristic Transform, topological and traditional morphological descriptors of the seeds are then used to successfully characterize different barley varieties based solely on the grain shape. This result later enables us to deduce potential genes that contribute to distinct morphology, bridging the phenotype with its genotype. A future goal is to link these genes to climate adaptability to breed better crops for an ever changing weather. For the second case, we use directional statistics and persistence homology to model the shape and distribution of citrus and their oil glands. This leads us to a novel path to explore developmental constraints that govern novel relationships between fruit dimensions from both evolutionary and breeding perspectives. For the third case, we comprehensively measure the shape of walnut shells and kernel. Combining novel size- and shape-specific descriptors, we explore the relationship between shell morphology and traits of commercial interest such as the easiness to remove the kernel intact or the integrity of the shell after being cracked. From the perspective that all data, whether phenotypic or genotypic, has shape, TDA can extract the totality of morphological information. We have interest applying this approach to more crops, to more plant biology inspired datasets, and to large-scale gene expression and population genetic data. Copyright by ERIK J. AMÉZQUITA 2023 TABLE OF CONTENTS CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2: BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 CHAPTER 3: THE SHAPE OF BARLEY . . . . . . . . . . . . . . . . . . . . . . . . 32 CHAPTER 4: EXPLORING THE SHAPE OF AROMA AND CITRUS OIL GLANDS 52 CHAPTER 5: THE SHAPE OF KERNELS AND CRACKS, IN A NUTSHELL . . . . 70 CHAPTER 6: CONCLUDING REMARKS . . . . . . . . . . . . . . . . . . . . . . . . 89 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 APPENDIX A: BARLEY APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . 115 APPENDIX B: CITRUS APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 APPENDIX C: WALNUT APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . 136 v CHAPTER 1 INTRODUCTION Un viento que no deja crecer ni a las dulcamaras: esas plantitas tristes que apenas si pueden vivir un poco untadas en la tierra, agarradas con todas sus manos al despeñadero de los montes. —from El Llano en Llamas Juan Rulfo Demeter was the Greek goddess of the harvest, agriculture, fertility and religious law. She, as the rest of the Greek deities, was an embodiment of multitasking and multidisciplinarity. In fact, most of the Olympus had to supervise multiple tasks each, as it seems that economic trouble and budget constraints have plagued Greece since mythological times. Demeter must have had a precise knowledge of yields and cycles of all the different crops, along with fine-grained details on how each variety of cereal reacts to different soils, climates and farming practices. On top of that, she had to supervise and assist the proper following of sacred rules. With such a packed schedule, she needed to identify quickly the important traits of numerous cultivated plants. Perhaps a thorough, informed glance at the shape of each grain and spike revealed to her all the information required to understand the yield and resistance properties of different plant species. Maybe Demeter used mathematics, especially algebraic topology, and some directional statistics to spice things up, to comprehensively quantify and compare shape. Shape plays a fundamental role across all organisms at all observable levels. Molecules and proteins constantly fold and wrap into intricate designs inside our cells. Cells arrange into elaborate motifs to form sophisticated tissues. Layers of different tissues come together to form delicate vascular and nervous systems that sustain hands, wings, or fins. Each of these limbs evolved as part of a distinct branch of the ever-growing tree of life. From micro-biology to macro-evolutionary scales, shape and its patterns are foundational to biology. Measuring and understanding the shape is key to extracting valuable information from data, and push further our insights. Even if we limit our scope to plant biology, a simple glance outdoors reveals a large diversity 1 of shape among flowers, leaves, fruits, and branches. A first attempt to characterize the shape of a plant could use traditional morphometrics, describing the shape of the plant in terms of height, stem thickness, or number of branches. With these measures at hand, we could look for allometry —the relative growth of parts of an organism to the whole— and thus linearly transform biological shapes between each other. However, any given plant shape is too complex to measure it simply in terms of length, width, and branching angles. A more careful attempt, as suggested by Bookstein (1997), could use geometric modern morphometrics (GMM) instead, where we first define homologous landmark points on every sample and then measure shape similarity by overlapping all these landmarks and computing their Euclidean differences. The computation of differences can be refined by rotating, translating, and scaling appropriately the landmark coordinates prior to the general overlap. This procedure, known as generalized Procrustes analysis, defines a morpho- space, or a space of all possible shapes based on all the possible landmark configurations, which allows us to define overall shape distance (Gower, 1975). The GMM approach can produce distorted results if there are not enough landmarks shared across all sampled individuals, which could occur if we attempt to compare tissues from different families. In the absence of corresponding sets of coordinates, we may attempt to describe the outline of the shape using Fourier analysis (Lestrel, 1997), by considering the outline as a harmonic series, or the sum of wave-like curves (Kuhl and Giardina, 1982). Both morphometrics or Fourier analysis have proven to be extremely insightful to uncover hidden patterns that mold diverse organism shapes at genetic, developmental, evolutionary, and environmental levels (Chitwood and Sinha, 2016). The methods described above, face enormous challenges when we combine our plant visualiza- tions with X-ray CT (computed tomography) technology to also record the sophisticated internal structure of stems, seeds, and roots. Even with high-resolution images, it is hard to comprehen- sively and simultaneously measure the vast array of external patterns and internal structures across all possible tissues. Complex branching architectures multiply tenfold in front of our eyes with CT technology, as we now have access to vascular and nervous systems, hyphae, and a better insight of the overall evolutionary tree. X-ray CT imaging has provided new insights into root architecture 2 (Atkinson et al., 2019; Booth et al., 2020; Griffiths et al., 2022; Zhou et al., 2020), xylem vessel mechanics (Choat et al., 2018; Gauthey et al., 2020), and internal voids in flowers and berries (Xiao et al., 2021). But at the same time in this CT setting, shared coordinates and common outlines are simply not enough to capture the rich morphology we observe both with our eyes and with X-rays (Li et al., 2020). A new strategy is needed. We focus thus on topology and as famously stated by David Kendall (1984), As topologists already have a theory of shape, I must apologize for using the word again with an entirely different meaning. In this paper ‘shape’ is used in the vulgar sense, and means what one would normally expect it to mean. Topological Data Analysis (TDA) is a combination of different mathematical and computational disciplines that seeks to describe concisely and comprehensively the shape of data in a general setting (Amézquita et al., 2020; Lum et al., 2013; Munch, 2017). In extremely succinct terms, TDA consists of two basic ingredients and a key idea. The first ingredient is to think of the data as a collection of points, and the second is to define a notion of distance between every pair of points. The points could be atoms, biomolecultes, cells’ nuclei, image pixels, or an organism itself. Distances between points could be the Euclidean, geodesic, genetic, or correlation-based. Once we have data points and distances, known formally as a metric space, we can connect these points starting with those that are closer to each other first. The key idea is to keep track of distinct shape features that form and disappear as we connect and merge several points. These ingredients and idea, albeit simple, are extremely versatile and can be adapted to a myriad of contexts and data collections. Moreover, the notion of shape presented by TDA is limited solely by the data itself, unleashing it from possible selection biases. This very adaptability and impartiality makes TDA a powerful data analysis tool that can further our insights in a variety of plant biology scenarios. Plant biology faces mountains of genetic and phenotypic data that must be efficiently analyzed and summarized. The solutions to grand challenges we face in the plant sciences, including predicting phenotype from molecular profiles, lie in mathematics (Autran et al., 2021; Bucksch et al., 2017). Despite emerging interdisciplinary research networks, the active collaboration between 3 Figure 1.1: An example of geometric morphometrics. A) 24 landmarks (orange dots) and pseudo- landmarks (6,000 evenly spaced vertices between landmarks, magenta dots) on grapevine leaves of Cabernet sauvignon (orange), Chardonnay (blue), and Chasselas cioutat (green) varieties. Every grapevine leaf has five major veins, allowing corresponding landmarks to be placed throughout every leaf. B) Corresponding vertices allows replicates to be superimposed on each other and C) mean leaves calculated using Procrustean methods that translate, rotate, reflect, and scale. D) A Principal Component Analysis (PCA) and other statistics can be performed on the Procrustes- adjusted vertices (95% confidence ellipses for each variety are shown). these two domains remains limited. We propose the mathematical study of shape as one of the many potential bridges between mathematics and plant biology. In recent years, TDA has produced promising results in diverse biological problems, like histological image analysis (Kovacev-Nikolic et al., 2016), viral phylogenetic trees description (Chan et al., 2013), and active-binding sites identification in proteins (Qaiser et al., 2019). In plant biology specifically, it has been used to characterize the morphospace of all possible leafs as in Figure 1.1 (Li et al., 2018), the 3D structure of grapevine inflorescence (Li et al., 2019), the shape of different apple accessions (Migicovsky et al., 2019). The Euler characteristic has also been used to study the genetic basis of cranberry shape (Diaz-Gárcia et al., 2018), the hairiness and shape 4 of spikelets —arrangements of grass flowers— (McAllister et al., 2019) and patterns of vegetation from satellite imagery (Mander et al., 2017). The power of TDA lies on its versatility to produce different “topological signatures” for different shapes. In Chapter 2 we present the main mathematical concepts behind TDA at an intuitive level. We provide a middle ground between mathematics and biology: for mathematicians, a review of ways TDA has been successfully used to study biology and for biologists, an accessible introduction to topological thinking. We begin with examples from structural biology, evolution, and cellular architecture that lend themselves to simple but powerful TDA representations. We then examine shapes, focusing on the outlines of leaves, and the use of Euler characteristic curves as convenient topological signatures that enable statistical analyses. Next, we highlight ways that TDA can measure branching architecture and the use of bottleneck distance to calculate the overall topological similarity between objects. We end with a discussion about future trends in TDA: measuring dynamic shapes and time series as well as using topology to convert data to graphs representing its structure. In Chapter 3, as a proof of concept, we quantify the morphology of X-ray CT scans of barley spikes and seeds using both traditional and topological shape descriptors. By combining both sets of descriptors, we can successfully train a support vector machine to distinguish and classify 28 different varieties of barley based solely on the 3D shape of their grains. Rather than being an alternative, we propose TDA as a powerful complement to traditional shape analysis, where the topological descriptors pick up morphological information that is usually missed. We propose then that this shape characterization will allow us later to link genotype with phenotype, furthering our understanding on how the physical shape is genetically specified in DNA. In Chapter 4, we investigate the shape of citrus fruit. We analyze 3D X-ray CT scan recon- structions corresponding to 51 citrus accessions. Based on the centers of the oil glands, overall fruit shape is approximated with an ellipsoid. Possible oil gland distributions on this ellipsoid surface are explored using directional statistics. Our observations point to the existence of biophys- ical developmental constraints that govern novel relationships between fruit dimensions from both 5 evolutionary and breeding perspectives. Understanding these biophysical interactions prompt an exciting research path on fruit development and breeding. In Chapter 5, we explore the shape of walnut shells and kernels. We analyze almost 1300 individual 3D X-ray CT scan reconstructions of walnuts, corresponding to 171 walnut accessions. Like in the previous chapter, we exploit the nondestructiveness of X-rays to digitally segment and measure the 4 main tissues of interest for each walnut, namely shell, kernel, packing tissue, and sealed air. From these we extract a total of 38 size- and shape-specific descriptors, many of them unexplored in the current literature. We focus on several allometric relationships of interest, from which we draw theoretical upper and lower bounds of possible walnut and kernel sizes. We then study correlations and variations of these morphological descriptors with qualitative data corresponding to traits of commercial interest like ease of kernel removal and shell strength. Finally, Chapter 6 poses several exciting potential future directions we could take for each of the previous three chapters. 6 CHAPTER 2 BACKGROUND En qué momento habían resucitado, cómo había sido la sensación de pasar del polvo a la forma y de la forma a la vida y se pellizcaban para ver si les salía sangre —from El Tiempo Principia en Xibalbá Luis de Lión Biologists are accustomed to thinking about how the shape of biomolecules, cells, tissues, and organisms arise from the effects of genetics, development, and the environment. Less often do we consider that data itself has shape and structure, or that it is possible to measure the shape of data and analyze it. Here, we review applications of Topological Data Analysis (TDA) to biology in a way accessible to biologists and applied mathematicians alike. TDA uses principles from algebraic topology to comprehensively measure shape in datasets. Using a function that relates the similarity of data points to each other, we can monitor the evolution of topological features –—connected components, loops, and voids. This evolution, a topological signature, concisely summarizes large, complex datasets. This chapter is derived from the review paper • E.J. Amézquita, M.Y. Quigley, T. Ophelders, E. Munch, D.H. Chitwood (2020). The shape of things to come: Topological data analysis and biology, from molecules to organisms. Developmental Dynamics 249(7): 816–833. 2.1 Topological Data Analysis (TDA): a primer 2.1.1 The Vietoris-Rips Complex Topology is the branch of mathematics concerned with mathematical properties that are pre- served under continuous transformations. With some mathematical framework, described below, topology offers powerful tools that can precisely describe the overall shape and structure of the data encoded by a given network. Informally, we can think the topology of a network as the collection 7 Figure 2.1: An example of a simplicial complex. It has two connected components, one loop, and one void. of its features that remain unchanged whenever the data “varies smoothly”. For example, scaling, centering, translation, and rotation are all smooth operations that do not alter the topology (i.e., the core shape) of our data. However, partitioning, merging, and attaching are not smooth operations and may significantly alter topology. In a mathematical context, networks are referred to as graphs. Nodes or points are referred to as vertices, while links between nodes as edges. We can generalize the idea of graphs by adding triangles that link edges or even tetrahedrons that link triangles. More formally, we can think of our data as composed of different building blocks, called simplices. Vertices, edges and triangles are 0-, 1- and 2-dimensional simplices, respectively. A collection of multiple simplices makes a simplicial complex, or complex, for short. For example, in Figure 2.1 we have a complex made of vertices, edges, and triangles. We can describe the topology of a complex based on the number of its connected components, loops, and voids. For example, in Figure 2.1 we can see two distinct, separate pieces, each of them being a connected component. We see that five edges in the left component form the frame of a pentagon. We say then that these five edges form a loop. Also on the left, we see a collection of four triangular faces that form a tetrahedron. We can assume that this tetrahedron is hollow, so that the complex contains a void. Many times, our data or network cannot be thought immediately as a complex. However, we can generate a complex based on a collection of data points and a notion of similarity or distance between these points. Formally, a collection of individual points and positive distances between every pair of points is referred to as a metric space. The Vietoris-Rips (VR) complex is a versatile method to define a complex from a network. The VR complex starts with data in a metric space and a fixed nonnegative parameter r, often referred to as the radius. If two vertices are close enough, 8 that is, the distance between them is less than r, then the VR complex will have an edge between those two vertices. Similarly, if there are three vertices close enough, that is, the distance between every pair of them is less than r, then the VR complex will have a triangle between those three vertices. Following these two rules, every time we have a triangle, we also have the three edges that make the frame or border of such triangle. Conversely, every time a trio of vertices form a triangular frame, the VR complex will also contain the corresponding triangle. 2.1.1.1 Walking through an example Notice that the same data and metric can produce different VR complexes by using a different parameter r each time. We can consider a sequence of increasing radii and its corresponding sequence of VR complexes. First observe that the distance between any two different data points is always a positive quantity. If we start with r = 0, then the corresponding VR complex will consist solely of separate vertices, one for each data point. As the radius r increases, the corresponding VR complex will now have edges that link the pairs of vertices that are close to each other. If r keeps increasing, we may then have triangles that link trios of close vertices. For example, consider the five data points in Figure 2.2.A, which we can take as vertices of a complex. The distance between the points will be simply the Euclidean distance. Consider seven different positive radii. For the first three radii, the shape remains the same: just five separate components. Suddenly, as soon as we increase the radius a fourth time, four pairs of vertices are finally close to each other so we draw edges between them to form a square. There is also a fifth vertex that remains isolated, as it is still distant from the rest. When the radius increases a fifth time, the isolated vertex is finally close enough to one of the square vertices. We draw one more edge at this point. The radius increases a sixth time, so that the pair of diagonal vertices in the square are close enough. We then draw the diagonals of the square, which also draws the four possible triangles in the square. The radius finally increases a seventh time, so that the fifth vertex is closer to another vertex in the square. We then add an edge and a triangle including this fifth vertex. As the radius keeps increasing beyond this, the overall shape of the VR complex will not have any significant changes: it will always remain a single component with no holes. 9 A B connected holes components V7 V7 V7 V2 V1 V2 V1 V2 V1 V3 V4 r0 V3 V4 r1 V3 V4 r2 2 2 connected holes C components V7 V7 V7 V2 V1 V2 V1 V2 V1 V3 V4 r3 V3 V4 r4 V3 V4 r5 2 2 2 connected holes V7 V7 D components V2 V1 V2 V1 V3 V4 r6 V3 V4 r7 2 2 r1 r2 r3 r4 r5 r6 r7 radius E F connected holes components V2 V1 V2 V1 V2 V1 V7 V7 V7 V3 V4 r0 V3 V4 r1 V3 V4 r2 2 2 connected holes G components V2 V1 V2 V1 V2 V1 V7 V7 V7 V3 V4 r3 V3 V4 r4 V3 V4 r5 2 2 2 connected holes H components V2 V1 V2 V1 V7 V7 V3 V4 r6 V3 V4 r7 2 2 r1 r2 r3 r4 r5 r6 r7 radius Figure 2.2: An example of two different Vietoris-Rips complexes with resulting persistence bar- codes. (A) Evolution of a VR complex with 7 vertices as Euclidean distance increases. (B) Persistence barcode corresponding to topological changes in the previous VR complex. (C) Alter- native visualization of the persistence barcode B as a dendrogram. (D) Alternative visualization of the persistence barcode B as a tree. (E) Moving one vertex in A yields a different VR complex as Euclidean distance increases. (F) Persistence barcode corresponding to topological changes in the previous complex E. (G) Alternative visualization of the persistence barcode F as a dendrogram. (H) Alternative visualization of the persistence barcode F as a tree. 10 2.1.2 Representing persistent features All the observations described above can be summarized using two topological features: con- nected components and holes. For connected components we need to keep track of which snapshot each connected component appeared (was born) and in which snapshot two separate components merged (died). Similarly, we can keep track of when each hole is formed (born), and when it is filled (dies). These life spans of topological birth and death can be drawn as life bars, the length of which indicates for how long a component persisted before it merged or how long a hole persisted before being filled. Putting all the bars together, we obtain a persistence barcode, in which each bar corresponds to a topological feature and the horizontal axis indicates at which radius value these features are born and die. Note that the vertical order of these life bars is irrelevant. For the persistence barcode in Figure 2.2.B, we observe that we start with five different vertices, all of which remain separate (the components persist) until the fourth radius. By the fourth radius, we only have two connected components: one square and one distant vertex. We also observe the birth of the hole in the square (indicated in blue). By the fifth radius, the distant vertex has merged with the square so we have only one connected component. By the sixth radius, we observe that the square hole has been filled with triangles. From this point onwards, as radius keeps increasing, our VR complex will be essentially a single connected component with no holes. We say then this components dies at infinity and it is represented by the continuing red arrow. We can alternatively display the persistence of components and holes as a dendrogram (Figure 2.2.C) or a tree (Figure 2.2.D), keeping track of which components merge. A particularly useful display of persistence barcodes are persistence diagrams, as illustrated in Figure 2.3. Simply, the birth start point and death end point of a bar in a persistence barcode are transformed as (x, y) coordinates in a death-vs-birth plane in a persistence diagram. Persistence diagrams have a convenient visual and mathematical representation which has allowed further theoretical developments in TDA. Barcodes are a useful way to illustrate and summarize prominent topological features, such as the distant fifth vertex or the hole enclosed by a square. Consider now “obstructing” this hole by moving the distant fifth vertex inside the square, as in Figure 2.2.E. We observe in Figures 2.2.F–H 11 connected components ∞ death birth Figure 2.3: Translating a persistence barcode into a persistence diagram. Birth and death times in the persistence barcode are interpreted as (x, y) coordinates on a death-vs-birth plane. This planar display is referred to as a persistence diagram. a different persistence barcode, dendrogram, and tree, respectively. The barcode now shows that all five vertices merge into a single connected component at earlier stages compared to Figure 2.2.A. Also notice that we have now filled the square’s hole, so that the barcode in Figure 2.2.F registers no holes, unlike Figure 2.2.B. 2.1.3 Filters: Beyond spatial distances As mentioned before, the Vietoris-Rips complex is constructed from a set of vertices and a sense of distance or similarity between these. Usually we refer to such a measure of similarity as a filter function. This function can determine when the vertices and their edges in between are observed. For example, in 2.2.A and 2.2.E, our filter function was the Euclidean distance between vertices. Given a filter function, we can consider a series of snapshots, where in each snapshot, we consider larger and larger filter values, called thresholds. Going back to 2.2.A and 2.2.E, each snapshot considers increasing radius lengths around each vertex. In this case, we say that our collection of data points have been filtered by Euclidean distance with six thresholds. Notice that if we increase 12 Figure 2.4: An example of a persistence barcode. (A) Snapshots of an X-ray CT image of an orange. Only the pixels with intensity lower than indicated are displayed. (B) Persistence barcode of connected components of such image. Observe that the barcode distinguishes the existence of exocarp, rind, and pith as separate components at lower intensities. the number of thresholds, we may be able to capture finer topological changes which may in turn produce richer persistence barcodes. Filter functions are extremely flexible and we can use more than a spatial distance. For example, consider a grayscale image. We will consider each pixel a separate vertex and use an intensity filter, resulting in the distance between two pixels simply being the difference of their intensities. We then consider each possible intensity value as a threshold. Figure 2.4 shows the persistence barcode of connected components from an X-ray Computed Tomography (CT) scan of an orange where we consider more than 50,000 threshold values (Figure 2.4.A). In each snapshot, we only display the voxels (vertices) whose intensity value is less than the value of the threshold. At 30,000, we only observe the contour of the exocarp with some separate bits of rind. At 35,000, more bits (connected components) of rind appear, and some of these rind bits merge into each other. Additionally, we observe the appearance of the pith. By 40,000, we have 3 clear separate connected components, namely exocarp, rind and pith. By 45,000, the rind and the exocarp have merged while numerous bits of endocarp have appeared. By 50,000, the appearance of the endocarp has merged the pith to the exocarp, yielding a single connected orange. 13 2.2 Applied topology: examples from structural biology, evolution, cellular architecture, and brain networks The Vietoris-Rips complex framework introduced above, filtering on the Euclidean distance between data points, can be used to study a wide range of complex phenomena in biology. A metric space might be the 3D coordinates of amino acids in a protein, could represent species or virus variants separated from each other by genetic distance, or might be defined by the nuclei of cells in a cross-section of tissue. Below, we provide examples where the Vietoris-Rips complex has successfully been applied to structural biology, evolution, and cellular architecture (Figure 2.5). 2.2.1 Structural biology This prototypical example of TDA —a metric space consisting of points where the filter is Euclidean distance—can be extended to biomolecules, where the points are atoms or residues (Figure 2.5.A). Proteins are comprised of a linear polymer of amino acids. The primary structure of a protein is the sequence of amino acids. The polypeptide chain of a protein folds upon itself, stabilized by interactions between amino acids, first forming a secondary structure (local structures such as alpha helices and beta sheets formed by hydrogen bonds in the peptide backbone) followed by the tertiary structure. The overall 3D structure of a protein is determined by the interactions of amino acid side chains within the protein. This overall structure, or conformation, of a protein is the basis of protein function: metabolism, transport, signaling, structure, and movement, among many others. The conformation of a protein can change depending upon binding ligands, signaling, or the chemical environment. Kovacev-Nikolic et al. (2016) use TDA to distinguish the open and closed conformations of maltose-binding protein (MBP). Each of the 370 amino acids of MBP is treated as a vertex, its 3D coordinates reflecting the spatial location of the residue. Euclidean distance is used to create a filtered Vietoris-Rips complex for each protein studied. As the Vietoris-Rips complex evolves, persistence barcodes record the birth and death of connected components, loops, and voids, which within the context of the tortuous folding of a protein backbone, yield complex topological signatures unique to distinct conformations. These persistence barcodes are transformed 14 Figure 2.5: Applications of Topological Data Analysis (TDA) to biology, from protein structure, to individual cell patterns, and phylogenetic tree analyses. 15 Figure 2.5 (cont’d): Applications of Topological Data Analysis (TDA) to biology. (A) Structural biology. A diagram of RNA secondary structure (left; solid lines covalent bonds, dashed lines hydrogen bonds). Increasing radii of vertices (middle, right; blue points) is used to visualize filtration on Euclidean distance. As radii merge, connected components die. Purple lines indicate the formation of loops that eventually fill in as the radius threshold increases. (B) Evolution. A plot showing the genetic distance of samples (left). As radius threshold value increases (middle, right) the birth and death of connected components (blue) represents vertical evolution (a tree) while that of loops (purple) horizontal evolution events (such as hybridization, gene transfer, or recombination; modified from Chan et al. (2013)). (C) Cellular architecture. Modification of a part of the original Gleason guide to prostate cancer changes in cellular architecture (left). Nuclei (blue) increase in radius (middle, right) and connected components (blue) and loops (purple) are born and die. (D) Branching architecture. A theoretical tree where the filter is geodesic distance to the base (blue). Branching tips are separate connected components that merge as the filter progresses to the base of the tree (left to right). (E) Mapper. Point cloud of a hand where the filter is the axes from the wrist to fingertips (left). Cover intervals (bars on top of color scale) and their overlap (gray bars) divide points into bins (middle). Points that cluster together over each cluster are assigned to a vertex, and if the points are shared between clusters in an overlap, then they are assigned to an edge connecting the corresponding vertices (modified Lum et al. (2013)). into persistence landscapes (Bubenik, 2015) that allow statistics, hypothesis testing, and machine learning to be applied to differentiate the shapes captured by topological signatures. The authors successfully differentiate open- and closed-conformation states of MBP. They also note that the active site residues (the amino acids responsible for ligand binding) lie at the edge of the most persistent loop of the Vietoris-Rips complex, indicating that TDA is sensitive to the relationship between structure and function. Beyond structure, electrostatic and other chemical properties of atoms can be incorporated into topological signatures that, when analyzed using machine learning methods, can predict protein-ligand binding affinities (Cang et al., 2018; Cang and Wei, 2018). 2.2.2 Evolution Evolution is typically depicted as a tree, which in mathematical terms is an acyclic graph (a graph with no loops). Each node and its descendant branches represent a common ancestor of a taxonomic group and its members as a hierarchy of similarity or relatedness. Evolutionary trees depict vertical evolution, random mutations that accumulate within a specific lineage that lead to phenotypic changes. But genetic material can be exchanged between lineages as well. This 16 process, known as horizontal evolution, is depicted as a reticulate graph (with loops), in which genetic information is exchanged by recombination, hybridization, horizontal gene transfer, or viral reassortment. Extensive phylogenetic theory models vertical evolutionary processes using trees, but the study of horizontal evolution is often limited to detecting reticulate phylogenetic events, and a theory unifying vertical and horizontal evolution had remained elusive. Chan et al. (2013) reconsider evolution from the perspective of topology. Using influenza as an example, they begin by considering that every sample has a genetic distance to every other, a metric space. From this genetic space, they construct a Vietoris-Rips complex, just as in the previous examples (Figure 2.5.B). The resulting persistence barcode for connected components can be converted into a dendrogram, which is the phylogenetic tree that biologists are accustomed to. Influenza viruses extensively exchange genetic material in a process known as reassortment. If the persistence barcode highlights loops, then this represents a topological signature of horizontal evolution. For example, a lower bound of recombination rate can be calculated from the number of loops (recombination or reassortment events) for a given time frame (in this example, the filter of genetic distance which can be calibrated to time). In higher dimensional spaces, voids can be detected, and the authors show that persistence barcodes for voids detect more complicated reassortment events, such as the triple reassortment that gave rise to the 2013 avian influenza outbreak and complex reassortment events within HIV. Using loops as an estimator of recombination rate has been extended to large-scale genomic analysis of human biology (Cámara, 2017; Cámara et al., 2016), expanded upon by evaluating different topological features (Humphreys et al., 2019), applied to coalescent theory to estimate ancestral recombination events (Emmett et al., 2014), and used to study lateral gene transfer of protein families and its implications for the evolution of antibiotic resistance (Emmett and Rabadán, 2014). 2.2.3 Cellular architecture Tissues are comprised of cells, the organization of which is determined by cell division, dif- ferentiation, growth, movement, migration, and death. Within a tissue each cell takes up a finite volume, often in close contact with neighbors. When a tissue is finely cross-sectioned and micro- 17 scopically examined, a tessellated array of cells emerges: an aggregated mixture of parenchyma, stroma, and glands (Figure 2.5.C). Staining can differentiate nuclei, cytoplasm, and extracellular matrix. To a trained eye, these complex patterns can indicate disease or abnormalities, but the process takes time and is subjective. The emergent organization of cells reflects developmental processes as well. TDA provides an objective way to classify these patterns, potentially removing the subjectivity of histopathological diagnosis enabling a rigorous way to define cellular anatomy. Lawson et al. (2019) explore the cellular architecture of prostate cancer. The Gleason grading system is a one to five scale that is a powerful prognostic indicator based on increasingly neoplastic tissue organization of the prostate: a uniform cellular architecture becomes disrupted forming glands that eventually form solid cell types. Tissue sections are stained with blue-purple hematoxylin and pink eosin which indicate nuclei and cytoplasm/extracellular matrix, respectively. The authors use these stains to isolate cell nuclei from surrounding structures (Figure 2.5.C). They then use thresholding as a filter on histological images of prostate cancer to create binary images, where connected components and loops are recorded as persistence barcodes. Creating vectors of the most persistent features, they use a variety of statistical techniques including Principal Component Analysis (PCA), hierarchical clustering, and t-distributed Stochastic Neighbor Embedding (t-SNE) to successfully classify images according to the Gleason grading system (Lawson et al., 2019). The strategy of reducing cells to data points of a Vietoris-Rip complex to classify cellular architecture works for predicting epithelial organization from cell centroids (Atienza et al., 2019) and in other cancers as well (Chung et al., 2018; Singh et al., 2014). 2.2.4 Brain networks The complex architecture of neurons and their numerous connections in the brain, formally referred to as the connectome, is of particular interest. The brain architecture, its activity and connectivity, is usually presented as square pair-wise correlation matrix each row (and column) represents different neurons or encoding brain regions when the subject is performing a fixed task. Usually negative correlations are treated as zero, and the rest of matrix entries are thresholded so that only neurons or regions with strong connectivity are considered. We can then consider a metric 18 space where the points are different neurons, anatomical regions of interest, or imaging voxels. The distance between these points is given by the correlation between them (or 1 minus correlation to be mathematically consistent). With this setup, it is possible to produce Vietoris-Rips complexes and persistence diagrams that summarize the brain network model. Observing the change in number of connected components, we can differentiate the abnormal glucose metabolism associated to neuronal activity between attention-deficit hyperactivity disorder (ADHD) children, autism spectrum disorder (ASD) children, and pediatric control subjects (Lee et al., 2012). By keeping track of persistent loops, we can distinguish the effects of psilocybin in the human brain functional patterns (Petri et al., 2014). Persistent loops also highlight that mental imagery shares the same neurophysiological bases with perceptual and motor experience (Ibáñez-Marcelo et al., 2019). TDA has also revealed previously ignored anatomical loops and voids in the connectome, which might explain both spatial and nonspatial behaviors both in mice (Giusti et al., 2015) and humans (Sizemore et al., 2018). 2.3 Shape, texture, and the Euler characteristic curve The examples above rely on point-based representations of biological data to which a filtered Vietoris-Rips complex on Euclidean (or genetic) distance produces zero-dimensional (connected components) and one-dimensional (loops) persistence barcodes. The persistence of the barcodes, representing prominent topological features, is the focus of analysis. However, the concept of filter can be extended to any real values that can be associated with structures: for instance, different choices of metric space between data points, or even using filter functions based on the data points instead of the edges between the data points. For the same data, a number of filters might be applied, yielding a new lens to reveal different facets of shape. Persistence barcodes can always be calculated, but there are other ways to record topological signatures as well. Below, we first describe the Euler characteristic curve (ECC) as a convenient complement to persistence barcodes to capture topological signatures that can be used with traditional statistical methods. We then describe TDA frameworks to measure shape (in the traditional sense of a closed contour) focusing on leaf outlines and the usefulness of ECCs to measure genetic and environmental 19 effects that determine phenotype. 2.3.1 Euler characteristic curve (ECC) The Euler characteristic, often denoted by the Greek letter χ, was originally defined by the equation: χ = #(Vertices) − #(Edges) + #(Faces). The Euler characteristic is the first example of a topological invariant; that is, a quantity that can be calculated and returns the same value on many different representations of the same topological shape. For convex polyhedra (e.g., the Platonic solids), the Euler characteristic always equals two since all platonic solids are topologically spheres. For example, a tetrahedron has four vertices, six edges, and four faces (4 − 6 + 4 = 2); a cube has eight vertices, 12 edges, and six faces (8 − 12 + 6 = 2). What is even more surprising is that this quantity can also be obtained by counting some intrinsic properties of a given shape. The Euler-Poincaré formula establishes that the formula above is the same as: χ = #(Connected Components) − #(Loops) + #(Voids). Since all convex polyhedra have one connected component and one void, the Euler characteristic is indeed 1 − 0 + 1 = 2. On the other hand, a doughnut, mathematically known as a solid torus, has one connected component, one loop, and no voids so its Euler characteristic is 1 − 1 + 0 = 0. By keeping track of the number of building blocks of our simplicial complex, we can indirectly summarize its topological features. Similar to persistence barcodes, given a data set, for each sample we define vertices, a filter function, and a number of thresholds. For example, consider a 3D (voxel-based) image of a barley seed (Figure 2.6.A). Each voxel is a vertex in our simplicial complex. One type of filter we can apply is the 3D axes of the coordinate system, which is oriented with respect to the depth, width, and height of the seed. For each of these filters, the voxels take the real number value of their coordinate for the particular axis. We then choose a number of thresholds, or equivalently, we choose how many times to “slice” through the seed along the given axis. Each time we take a slice, we compute 20 Figure 2.6: Three different Euler Characteristic Curves (ECCs) from three different filters. (A) X-ray CT scan of a barley seed. The symmetry of the seed encourages a filter by depth, width and height values, i.e., the three main axis directions with respect to the seed scan. Slicing the barley seed in different directions produce (B) different corresponding ECCs. Notice that the three curves end with Euler characteristic equal to 1, which corresponds to the Euler characteristic of a solid sphere. the Euler characteristic of the seed. We continue to add slices one-by-one and recalculate the Euler characteristic each time as we continue through the axis, which is the filter function. Adding all the slices together yields the original seed. Finally, we summarize our computation as an ECC (Figure 2.6.B), where the x-axis is the threshold while the y-axis is the Euler characteristic of the complex at that particular threshold value. Persistence barcodes tend to be notoriously expensive and difficult to compute since they must keep track of all the possible component merges and hole fillings for every threshold value (Otter et al., 2017). Most of the available software to compute persistence barcodes is incapable of handling truly large data sets effectively, especially when each sample consists of millions of vertices. (That being said, there have been recent breakthroughs to efficiently compute persistent diagrams. Ripser is very a promising software capable of handle more and more data with each new release (Bauer, 2021)). Euler characteristic curves are a convenient way to summarize a topological signature of an object as a sequence of numbers, a curve, or numerical vector. Computing and storing these vectors is quite efficient, and it is especially convenient since it allows us to perform standard statistical analysis techniques and test hypotheses about the shape of our data. This simplicity makes it ideal to process objects that may have more than a billion vertices (Heiss and Wagner, 2017; Wang et al., 21 2022). 2.3.2 Shapes and textures Sometimes leaves have corresponding coordinates, as in the case of grapevine where every leaf has five major veins and numerous landmark features (Chitwood and Sinha, 2016). In these instances, geometric morphometrics is a powerful tool. Besides the base of the petiole and tip, though, most leaves do not have coordinates that correspond in a way that analysis by geometric morphometrics is possible. In order to compare the outlines of 182,707 leaves from 141 plant families and 75 sites throughout the world, Li et al. (2018) used TDA. The pixel outline of each leaf is treated as a point cloud. The filter applied to each pixel is a Gaussian density estimator, sensitive to the number of neighboring pixels around each pixel. Straighter edges of the leaf blade will have low density values while pixels in serrations, lobes, or other undulations will have higher values. The number of connected components is monitored and the respective ECCs are computed. For so many leaves, an ECC curve serves as a succinct, computationally feasible topological signature that allows downstream statistical analyses. These signatures can define the morphospace for all leaves, which reveals not only the leaf shapes that exist, but those that do not, either because of developmental constraint or negative selection. This morphospace can be used then to predict plant family and location (Li et al., 2018). Others have used the same filter and ECCs to determine the genetic basis of leaf shape in apple (Migicovsky et al., 2018) and tomato (Li et al., 2018) as well as the genetic basis of cranberry shape (Diaz-Gárcia et al., 2018). ECCs are sensitive enough to complex and subtle changes in shape to measure the effects of rootstock and climate on grapevine leaf shape (Migicovsky et al., 2019). ECCs have also been used to measure the hairiness and shape of spikelets (arrangements of grass flowers) (McAllister et al., 2019), patterns of vegetation from satellite imagery (Mander et al., 2017), and flow cytometry features (Smith and Zavala, 2021). Moreover, by considering all the ECCs corresponding to all directional filters, we can successfully encode important morphological information of barley seeds (Amézquita et al., 2021) and protein structure (Tang et al., 2022). 22 2.4 Branching architectures and bottleneck distances 2.4.1 Persistence diagrams The Euler characteristic allows us to monitor a topological summary as a function of the filter we choose. The resulting curve enables statistical analyses. In some cases, we might not want a summary, though; we may want to keep track of each topological feature separately, as we do in a barcode. The bottleneck distance is a convenient way to determine the overall topological similarity of two barcodes with each other. If we compute the bottleneck distance of all barcodes to all other barcodes, we can determine the overall topological similarity of samples to each other, in which case statistical analyses can be performed. To understand the meaning of bottleneck distance, we need a better display of topological information than persistence barcodes. We thus turn to persistence diagrams. As mentioned previously, each topological feature in the barcode has a birth and death time. Instead of representing a topological feature with a life bar as in persistence barcodes, we can simply represent it with a point in a plane; the x-coordinate of this point is the birth time of the topological feature, while the y-coordinate is its death time. All of our topological information is then displayed in a death-vs-birth plane, referred to as a persistence diagram. Certainly a topological feature cannot die before it is born, so all our points will lie above the diagonal line. We also agree that the top of the plane will represent infinite time, for those features that persist until infinity. Consider a very simple persistence barcode as shown in Figure 2.3. The birth and death times of each life bar are read as (x, y) coordinates on the plane below. Observe that the barcode presents a component that persists until infinity. Thus, we define an “infinite death time” at the top of our diagram below 2.4.2 Bottleneck distance For ease of exposition, we will describe the bottleneck distance in terms of the persistence diagrams rather than the persistence barcodes as they are equivalent. Intuitively, the bottleneck distance between two diagrams measures how much change the first sample must undergo so that its resulting persistence diagram matches the diagram of the second sample. More formally, think 23 A B ∞ ∞ death death birth birth Figure 2.7: Computing the bottleneck distance between two persistence diagrams. (A) A possible pairing of points is suggested. Observe that it produces a large maximum distance between pairs. (B) An alternate pairing that yields a considerably smaller maximum distance between pairs. of bottleneck distance as follows: we overlap the persistence diagrams of two samples, so both diagrams are actually on the same plane. Next, we are tasked to pair topological features between the diagrams. Every point from the first diagram must be either paired to an unmatched point from the second diagram, or matched with the diagonal. Given a pairing, we define its score as the maximum distance between pairs. The bottleneck distance is then defined as the minimum score when considering all possible pairings. For example, consider the two different persistence diagrams drawn on top of each other in Figure 2.7, the first one being represented with red circles while the second with blue triangles. In Figure 2.7.A we pair each triangle with another circle, taking care to match the infinite triangle with the infinite circle. Observe that one circle is matched to the diagonal. The score of this pairing is the length of the longest green, dashed line. A different pairing is considered in Figure 2.7.B, which in turn produces a considerably smaller score as the green lines are all considerably shorter. After considering all possible pairings between diagrams, we realize that Figure 2.7.B is optimal in the sense that it produces the smallest score. The bottleneck distance between these barcodes is then this minimum score. 24 2.4.3 Branching architectures Branching architecture is one example where bottleneck distance is useful. Traditional mor- phometric approaches fail to measure branching, despite it being a common architectural motif throughout life (Li et al., 2017). Li et al. (2019) measure the branching architecture of X-ray Com- puted Tomography (CT) scans of grapevine rachises, the branching stem structure that remains after removing the berries from the cluster. The filter they choose is geodesic distance of each voxel to the rachis base (Figure 2.5.D). The geodesic distance is the shortest distance between two vertices on a surface, in this case, the grapevine rachis itself. Starting with those voxels with the furthest geodesic distance from the base and filtering towards those closest, if zero-dimensional features are monitored, connected components at the tip of the branching structure are first born and then die as they merge at their parent node. Connected components continue to arise at branch tips and die at parent nodes in a hierarchical fashion. Each topological feature corresponds to a bar, and the record of merging can create a dendrogram that recapitulates the branching. There are a number of filters sensitive to branching that have been used in both plants and other organisms (Belchi et al., 2018; Bendich et al., 2010; Bendich et al., 2016; Kanari et al., 2018; Stolz et al., 2022). Branching is an instance where calculating bottleneck distance might be preferred to Euler characteristic curves, because the topological features more directly correspond to the feature of interest (branches). The bottleneck distances of each grapevine rachis to the other create a metric space from which the rachises hierarchically cluster based on morphology (Li et al., 2019). Comparing morphological similarity to evolutionary history, rates of evolution along branches of the phylogenetic tree can be modeled. The morphological similarity matrix calculated from bottleneck distances can also be compared to traditional measurements (such as number of branches, median branch length and width, convex hull) and the ability to classify rachises from different species. 2.5 The structure of data: Mapper and biological networks 2.5.1 Mapper Topological signatures and TDA outputs —barcodes, Euler characteristic curves, and bottleneck distances— measure the shape of data comprehensively but lack a correspondence to the original 25 data. This is known as the inverse problem: from data we can calculate a topological signature, but from a topological signature we cannot reproduce the original data. Biological data is noisy, and if the shape of the underlying structure in data could be visualized, individual data points that contribute to the overall shape of data could be isolated and studied in detail. For this reason, we now turn our attention to the mapper graph, which does provide some information in the reverse direction. By delimiting an underlying structure to our data and assigning correspondence of data points to this structure, complex and noisy datasets are simplified in a way similar to data reduction techniques. Mapper is a tool from TDA that skeletonizes and summarizes the shape and structure of data as a graph Singh et al. (2007). The mapper algorithm is comprised of three main steps. 1) We choose a filter function on the data (Figure 2.8.A), this time associated to vertices rather than edges, and project all data points onto a line according to their filter function values (Figure 2.8.B). 2) Next, we split the real line into a fixed number of bins called covers. Each cover is an interval over the filter and, additionally, there is overlap between the covers. 3) Finally, we cluster the original data points in each of these bins to form graph vertices. Edges are drawn between nodes in the mapper graph if two clusters share some data points (Figure 2.8.C). Imagine a 3D point cloud of data shaped like your hand, where the filter function is the distance of each data point to your wrist Lum et al. (2013) (Figure 2.5). If we created overlapping intervals, or covers, along this axis, then points at the fingertips would each form a vertex, and points towards the base of the fingers would form their own vertices as well. Because there is overlap between the covers, then vertices along each finger, but not between fingers, would share points, and we would draw in edges between these groups of points that would recapitulate the structure of fingers. The most proximal finger vertices would converge with vertices representing the palm, as well as vertices of the thumb. In the case of a hand, it is easy to see how a mapper graph summarizes and recapitulates the structure of the actual data. When applied to real world data, such as volumetric images like an X-ray CT scan, mapper graphs recapitulate shape in intuitive ways Chitwood et al. (2019). 26 Figure 2.8: An example of mapper graphs. (A) X-Ray CT scan of a gall filtered by distance from the center. (B) These filter values are projected to a real line. The real line is then covered by a collection of overlapping intervals. For each interval, we then form different clusters of voxels whose filter value is in such interval. These clusters then yield the vertices and edges of (C) a mapper graph. Formally, the vertices are connected components within a certain range of radius from the center and edges correspond to overlap. Size of vertices and edges corresponds to the size of the component or overlap. Let us consider a voxel-based X-ray CT scan of a gall, a swollen plant growth induced by an insect for its own benefit (Figure 2.8.A). Each voxel is a data point that takes on the value of the filter function, which in this case is its distance from the center of the gall. The Mapper algorithm clusters the data into vertices based on their filter function value (Figure 2.8.B), and if two vertices share some voxels between them (based on the cover intervals assigned and the physical location of the voxels), then they are connected by an edge. Bigger vertices in the mapper graph (Figure 2.8.C) correspond to a larger number of clustered voxels. Thicker edges correspond to a larger number of voxels in the bin overlap. The color of the vertices corresponds to the average filter function values of its voxel members. At the bottom of the graph, we can see a purple cluster corresponding to the core of the gall. As we move up, we eventually find larger turquoise vertices corresponding to the outer layers of the gall. Notice the small vertices that stem from these large turquoise vertices which represent the vasculature of the gall. Finally, as we reach the top of the mapper graph, we find green and yellow vertices that represent the leaf. From this example, two important features of mapper can be seen: its ability to serve as a data reduction technique that summarizes structure and the correspondence of the graph to the original data. 27 2.5.2 Biological networks We have focused on Euclidean distances up until this point. But just like genetic distances can be used to create metric spaces to study evolution, other distance metrics can be used to create graphs that can be studied with TDA as well. Nicolau et al. (2011) used mapper to identify breast cancer subtypes using gene expression microarray data. The filter they use decomposes their data into separate normal and disease components. The resulting mapper graph reveals three distinct arms that, upon subsequent analysis, reveal a distinct genetic subtype of tumors. The architecture of the mapper graph corresponds to disease progression and its vertices to the expression of genes linked to breast cancer subtypes. By choosing an appropriate filter, the mapper graph reveals a structure of the data that might have been missed otherwise and is linked to prognosis. A recent study by Jeitziner et al. (2019) presented a new two-tiered version of the Mapper algorithm, which is particularly useful for small genomic sample sizes. Topological approaches to RNA-sequencing (RNA-seq) data have also been applied to study the in vitro differentiation of murine embryonic stem cells into neurons (Rizvi et al., 2017), specific genomic differences across various lung cancer patients (Amézquita et al., 2022), and to link gene expression to morphological outcomes across all flowering plants (Palande et al., 2022). All this suggests the translatability of Mapper-based topological analysis to many biological contexts. 2.6 A word on statistical caution Most of the time, our data is subject to different kind of errors and we must address the statistical robustness of our topological signals. One foundational result by Cohen-Steiner et al. (2007) is the stability of persistence diagrams with respect to the bottleneck distance. Intuitively, this stability result implies that if all our data points wiggle only a little bit (possibly due to noise), then the resulting points in the persistence diagram will only wiggle a little bit as well. We must be careful with outliers though, since a single outlier can significantly alter our persistence diagram (Figure 2.2). Nonetheless, there has been a number of ideas to address this lack of robustness with respect to outliers, such as using a distance to measure (Chazal et al., 2017) or multi-parameter persistence (Lesnick and Wright, 2015, 2022). Intuitively, since an outlier is distant from every other point, it 28 will lie in a low density region, so we then proceed to discard such regions. Alternatives exploit bootstrap-like ideas, where the original point cloud is resampled and its topological signature computed numerous times (Reani and Bobrowski, 2022). Intuively, a single, isolated, outlying point will very rarely appear in resamples, so its distorting effect will be a rare event in the long run. It is worth to warn that the space of all possible persistence diagrams is a mathematically complicated space to work with. For instance, given a collection of persistence diagrams, there might not be a unique “mean diagram” (Mileyko et al., 2011). The space of persistence diagrams presents a number of difficulties to define p-values, or confidence intervals, which are crucial in any statistical analysis. However, there has been a growing number of ideas and research to address such pitfalls, such as modifying the bottleneck distance to explicitly construct “mean diagrams” (Munch et al., 2015; Turner et al., 2014), adapting randomized null hypothesis tests (Robinson and Turner, 2017), or defining a confidence interval line along the diagonal of the diagrams (Fasy et al., 2014). Other alternatives include transforming diagrams to a simpler and more sound space, where the usual statistics, parameter estimation and hypothesis testing can be carried out as usual. This can be done with persistence landscapes (Bubenik, 2015), persistence images (Adams et al., 2017), or tent functions (Perea et al., 2022; Tymochko et al., 2019) to name a few examples. Another caution to make is the interpretability of topological signatures. While summaries as persistence landscapes and ECCs are powerful when combined with machine learning techniques, it is hard to directly identify phenotypes from them. For instance, it is difficult to deduce seed’s length, height and width based solely on the ECCs from Figure 2.6. Turner et al. (2014) mathematically prove that the collection of all ECCs corresponding to all possible directions effectively summarizes all the morphological information for 3D and 2D shapes. Moreover, with such collection we would be able to reconstruct the original object. Nonetheless, in practice we cannot consider an infinite number of directions. A finite bound on the number of necessary directions for general 3D shapes has been proven (Curry et al., 2022), although the idea of efficiently reconstructing large objects solely from ECCs remains elusive (Betthauser, 2018; Fasy et al., 2019; Micka, 2020). 29 2.7 Conclusion We have seen how given data, a summary of the topological shape and structure of the space can be computed. For instance, data could come as a metric space of any distance—whether Euclidean, geodesic, genetic, functional, or correlative—and we can return a Vietoris-Rips complex and corresponding persistence barcode, which measure the shape of our data. By monitoring connected components, loops, or higher dimensional features, the barcode captures shape comprehensively, by monitoring the evolution of these features as a function of the filter. Such a framework has been used to measure the shapes of proteins, model evolution, and classify tissue architecture. The filter that we choose is arbitrary: it is merely a lens through which we can view relationships between our data points. The ability to choose a filter tailored to the hypothesis at hand is what confers the versatility of TDA to measure the shape of nearly any dataset, often in multiple ways. Gaussian density estimators applied to the pixels defining leaf outlines measures shape, allowing the genetic basis of the plant form to be studied. Geodesic distance captures the branching patterns of grapevine clusters, permitting the analysis of their evolution and modeling of berry development. We can analyze and compare the most persistent features in our barcodes, summarize them using the Euler characteristic, or truly calculate the overall topological similarity between barcodes using bottleneck distance. Using mapper, we can summarize the structure of data as a graph, and upon visualizing nodes of interest, identify the data points —whether voxels of an X-ray CT scan or nodes corresponding to gene expression— for further study and interpretation. The promise of the application of TDA to biology is still in its infancy. Unlike any other method in biology, TDA provides a way to measure topological features and shape in a comprehensive way. The versatility of filter function selection allows TDA to be applied to any number of datasets across sub-disciplines: structural biology, evolution, molecular biology, medicine, neuroscience, and developmental biology. The methods described here can be applied to higher dimensional datasets that are dynamic or evolve over time (Myers et al., 2019; Perea, 2019; Topaz et al., 2015; Tymochko et al., 2020), easily accommodating biological complexity. Regardless of data size, complexity, or dimensionality, TDA provides concise summaries of the information content of any 30 Figure 2.9: Endless forms most beautiful. X-ray Computed Tomography (CT) scans of biological specimens showing the diversity of morphology in the natural world. (A) Magnolia bud, (B) bean flowers, (C) grapevine leaf with phylloxera galls, (D) the fasciated meristem of a velvet flower, (E) side view of a sunflower disc, (F) bell pepper, (G) treerings, (H) marigold flower, (I) vasculature within an apple, (J) Haworthia, (K) Echeveria, (L) Agave hybrid, (M) citrus fruit, (N) monkeyflower, (O) archaeological sunflower disc specimen. dataset from the perspective of shape and structure. Given the spectacular diversity of form across biology (Figure 2.9), a method like TDA, that can be customized to measure shape using a tailored filter function, will allow previously unstudied phenomena to be analyzed from the perspective of shape. The vision of TDA, that data is shape and shape is data, will be relevant as biology transitions into a data-driven era where meaningful interpretation of large datasets is a limiting factor. 31 CHAPTER 3 THE SHAPE OF BARLEY Go and measure for me the barley which is in the storehouse, that which remains from last year’s barley.” Then he set out for her six measures of barley. Then the peasant said to his wife, “Behold, there are twenty measures of barley as food for you and your children. Now make these six measures of barley into bread and beer for me as daily rations, that I may live on them. —from The Tale of the Eloquent Peasant Anonymous tale from the Egyptian Middle Kingdom (1991–1786 BC) There is a discrepancy between the information embedded in biological forms that we can discern with our senses versus that which we can quantify. Methods to comprehensively quantify phenotype are not commensurate with the thoroughness and speed with which genomes can be sequenced. High-throughput phenotyping has enabled us to collect large amounts of phenotyping data (Andrade-Sanchez et al., 2013; Araus and Cairns, 2014; Tanabata et al., 2012); nonetheless, we are not maximizing the information extracted from the data we collect. To extract, compare, and analyze this information embedded in a robust and concise way, we turn to Topological Data Analysis (TDA), specifically the Euler Characteristic Transform (ECT). Here we show the use of ECTs to correctly summarize the shape of barley seeds as a proof of concept. We scanned a collection of barley panicles comprising 28 different accessions with X-ray CT technology at 127 micron resolution. These scans were later digitally processed to isolate 3121 individual grains, and their morphology was quantified using both traditional and topological shape descriptors. We then explored both qualitatively and quantitatively the descriptiveness of these measurements. To aid both assessments, we used KPCA and UMAP separately to aggressively reduce the dimension of the traditional and ECT vectors. We observe that traditional shape descriptors tend to cluster seeds based on their accession, while KPCA-reduced topological shape 32 descriptors tend to cluster them based on panicles. UMAP-reduced topological descriptors balance both approaches and draw shape distinctions at both accession and spike level. This in turn shows that KPCA and UMAP draw from different pieces of ECT information. This observation suggests that the ECT effectively summarizes both spike-specific and accession-specific morphological information which can be then highlighted with an appropriate dimension reduction technique. To quantify the descriptor correctness, we trained a support vector machine (SVM) to determine the accession of individual grains based on their shape alone. Our experiments show that SVMs perform better whenever topological information is taken into account, which suggests that the ECT measures shape that is “hidden” from traditional shape descriptors. This chapter is derived from the original research paper • E.J. Amézquita, M.Y. Quigley, T. Ophelders, J.B. Landis, D. Koenig, E. Munch, D.H. Chitwood (2021). Measuring hidden phenotype: Quantifying the shape of barley seeds using the Euler Characteristic Transform. in Silico Plants 4(1): diab033. 3.1 Introduction Topological Data Analysis (TDA) is a set of tools that arise from the perspective that all data has shape and that shape is data (Amézquita et al., 2020; Lum et al., 2013; Munch, 2017). TDA treats the data as if made of elementary building blocks: points, edges, squares, and cubes, referred to as 0-, 1-, 2-, and 3-dimensional cells respectively (Figure 3.1.A). A collection of cells is referred to as a cubical complex, or complex, for short. Cubical complexes are both a natural and consistent way to represent image data (Kovalevsky, 1989). Given a grayscale image, we follow a strategy similar to Wagner et al. (2012) to construct a cubical complex: a nonzero pixel will correspond to a vertex in our complex. If two pixels are adjacent —in the 4-neighborhood sense— we say that there is an edge between the corresponding vertices in the complex. If 4 pixels in the image form a 2 × 2 square, we will consider a square in our complex between the corresponding 4 vertices (Figure3.1.A). Additionally, for the 3D image case, if 8 voxels —the 3D equivalent of pixels— make a 2 × 2 × 2 cube, we will draw a cube in our 33 complex between the corresponding 8 vertices. TDA seeks to describe the shape of our data based on the number of relevant topological features found in the corresponding complex. For instance, the complex in Figure 3.1.A has two distinct, separate pieces colored in blue and red respectively, formally referred to as connected components. This complex also has 8 edges forming the outline of a square without an actual red block filling it —edges thickened for emphasis— this is referred to as a loop. In higher dimensions, we could also consider hollow blocks containing voids. We can even go a step further and summarize these topological features with a single value known as the Euler characteristic, represented by the Greek letter χ, defined for voxel-based images as χ = #(connected components) − #(loops) + #(voids). The Euler characteristic is a topological invariant; that is, it will remain unchanged under any smooth transformation applied to our shape. The well-known but surprising Euler-Poincaré formula states that χ can be computed easily as χ = #(Vertices) − #(Edges) + #(Faces) − #(Cubes). This equivalence can be seen in the cubical complex in Figure 3.1.A, where χ = 20 vertices − 22 edges + 3 faces = 2 connected components − 1 loop + 0 voids = 1. The Euler characteristic by itself might be too simple. Nonetheless, we can extract more information out of our data-based complex if we think of it as a dynamic object that grows in number of vertices, edges, and faces across time. As our complex grows, we may observe significant changes in χ. The changes in χ can be thought as a topological signature of the shape, referred to as an Euler characteristic curve (ECC). The growth of the complex is defined by a filter function which assigns a real number value to each voxel. For reasons discussed later, we will focus on directional filters which assign to each voxel its height as if measured from a fixed direction. 34 A B C 20 ECC (Z−axis direction , 32 thresholds) ● 15 t2 t4 t6 χ 10 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● t2 t4 t6 t8 t14 t32 t8 t14 t32 threshold D Euler Characteristic Transform (3 directions, 32 thresholds each) 20 ● 15 χ 10 5 ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● 0 ● ● ●●● ● ● ● ● ● ● ● 0 8 16 24 32 40 48 56 64 72 80 88 96 threshold Direction X Y Z Figure 3.1: Extracting topological shape signatures from barley seeds. (A) A binary image (left) is treated as a cubical complex (right). This cubical complex has 2 connected components, 1 loop, 0 voids. The distinct connected components are colored in blue and red respectively. The loop is emphasized with thicker edges. (B) The barley seeds were aligned so that their proximal-distal, medial-lateral, and adaxial-abaxial axes corresponds to the X, Y, Z-axes in space. (C) Example of an Euler Characteristic Curve (ECC) as we filter the barley seed across the adaxial-abaxial axis (depicted as a solid, green line) through 32 equispaced thresholds. (D) The Euler Characteristic Transform (ECT) consists of concatenating all the ECCs corresponding to all possible directions. In this example, we concatenate 3 ECCs corresponding to the X, Y, Z directions, represented by the solid lines respectively. 35 As an example, consider the cubical complex of a barley seed and the direction corresponding to the adaxial-abaxial axis (Figure 3.1.B). Voxels at the top of the seed will be assigned the lowest values, while voxels at the bottom will obtain the highest values. We then consider 32 equispaced, increasing thresholds t1 < t2 < . . . < t32 which define 32 different slices of equal thickness along the adaxial-abaxial axis. We start by computing the Euler characteristic of the first slice, that is, all the voxels with filter value less than t1 . Next we aggregate the second slice, which are all the voxels with filter value less than t2 , and recompute the Euler characteristic. We repeat the procedure for the 32 slices. For instance in Figure 3.1.C, we observe that we started with scattered voxels which are thought of as many connected components which may explain the high Euler characteristic values. As we keep adding slices, we connect most of the stray voxels into fewer but larger connected components, and simultaneously, we might have created loops as seen in t4 and t6 . This merging of connected components, and formation and closing of loops might explain the fluctuation of the Euler characteristic between positive and negative values. Finally, after more than half of the slices have been considered, at t14 , we observe that no new loops are formed, and every new voxel will simply be part of the single connected component. Thus, the Euler characteristic remains constant at 1. The ECC is precisely the sequence of different Euler characteristic values as we add systematically individual slices along the chosen direction. To get a better sense of how the Euler characteristic changes overall, we can compute several ECCs corresponding to different directional filters. For example, in Figure 3.1.D we choose three directions in total corresponding to the proximal-distal, medial-lateral, and adaxial-abaxial axes respectively. Each filter produces an individual ECC, which we later concatenate into a unique large signal known as the Euler Characteristic Transform (ECT). There are two important reasons to use ECT over other TDA techniques. First, the ECT is computationally inexpensive, since it is based on successive computations of the Euler character- istic, which is simply an alternating sum of counts of cells. This inexpensiveness is especially relevant as we are dealing with thousands of extremely high-resolution 3D images. Assuming that we have already treated the image as a cubical complex, we can compute a single ECC in linear 36 time with respect to the number of voxels in the image (Richardson and Werman, 2014). We can thus compute the ECT of a 50,000-voxel seed scan with 150 directions in less than two seconds on a traditional PC. The second reason to use the ECT is its provable invertibility and statistical sufficiency. As first proved by Turner et al. (2014), and later extended separately by Curry et al. (2022) and Ghrist et al. (2018), if we compute all possible directional filters we would have sufficient information to reconstruct the original shape. Moreover, this ECT is a sufficient statistic that effectively summarizes all information regarding shape. Although there are infinite possible directional filters, there is ongoing research into defining a sufficient finite number of directions such that we can effectively reconstruct shapes based solely on their finite ECT (Belton et al., 2020; Betthauser, 2018; Curry et al., 2022; Fasy et al., 2019). Nonetheless, a computationally efficient reconstruction procedure for large 3D images remains elusive. Another computational consideration is the fact that the ECT produces a vector of topological information of #(directions) × #(thresholds) dimensions, which is usually above 2000 dimensions. In general, high-dimensional vectors tend to produce distorted prediction and regression results (Köppen, 2000), and it is advised to denoise and summarize these vectors by using different dimension reduction techniques. One such standard technique is principal component analysis (PCA), which seeks to project the high-dimensional vectors unto the orthogonal directions that capture the greatest variability of the data. These linear directions are referred to as the principal components of the data. Sometimes, the data cannot be properly summarized as a collection of lines. A more flexible approach is to consider kernel PCA (KPCA) (Schölkopf et al., 1998), a nonlinear alternative. By specifying a kernel function, we can instead project the high-dimensional samples unto the polynomial, trigonometric, or circular curves that capture the most variance of the data. A completely different dimension reduction strategy is the uniform manifold approximation and projection (UMAP) (McInnes et al., 2020), which also draws several ideas from TDA. Intuitively, UMAP seeks to project the high-dimensional data unto a low-dimensional space while preserving the most prominent topological local features. That is, if the original data contains large connected components, wide loops, and ample voids, its low-dimensional UMAP projection should also 37 exhibit several connected components, loops, and voids. If two sample points are in the same connected component in the high-dimensional space, these two should remain in the same cluster when projected to the low-dimensional space. 3.2 Materials and Methods We selected 28 barley accessions with diverse spike morphologies and geographical origins for our analysis (Harlan and Martini, 1929, 1936, 1940). In November of 2016, seeds from each accession were stratified at 4C on wet paper towels for a week, and germinated on the bench at room temperature. Four day old seedlings were transferred into pots in triplicate and arranged in a completely randomized design in a greenhouse. Day length was extended throughout the experiment using artificial lighting —minimum 16h light / 8h dark. After the plants reached maturity and dried, a single spike was collected from each replicate for scanning at Michigan State University. The scans were produced using the North Star Imaging X3000 system and the included efX software, with 720 projections per scan, with 3 frames averaged per projection. The data was obtained in continuous mode. The X-ray source was set to a voltage of 75 kV, current of 100 µA, and focal spot size of 7.5 µm. The 3D reconstruction of the spikes was computed with the efX-CT software, obtaining a final voxel size of 127 µm. The intensity values for all raw reconstructions was standardized as a first step to guarantee that the air and the barley material had the same density values across all scans. Next, the air and debris were thresholded out, and awns digitally pruned (Figures 3.2.A-D). Finally, the seed coat of the caryopses was digitally removed, leaving only the embryo and endosperm due to their high water content (Figure 3.2.E). We did not have enough resolution in the raw scans to distinguish clearly the endosperm from the embryo. Hereafter, we will refer to these embryo-endosperm unions simply as seeds. Thus, we digitally isolated all the seeds and obtained a collection of 3438 seeds in total. Due to the large volume of data, we used an in-house scipy-based python script to automate the image processing pipeline for all panicles and grains. To make the collection of different directional filters comparable across seeds, all the seeds were aligned with respect to their first three principal components. Since all the seeds are oblong in 38 A B C D E F G Transverse plane Longitudinal plane Lateral plane H height B width length height A I J Parental seed distribution Image Clean Defective 200 150 No. of seeds 100 50 0 Algerian Alpha Arequipa Atlas CA Mariout Club Mariout Everest Flynn Glabron Golden Pheasant Good Delta Han River Hannchen Horn Lion Lyallpur Maison Carree Manchuria Meloy Minia Multan Oderbrucker Orel Palmella Blue Sandrel Trebi White Smyrna WI Winter Barley Genotype Figure 3.2: Barley image processing. The morphology measurements were extracted from 3D voxel-based images of the barley panicles. Before any analysis was done, the (A) raw X-ray CT scans of the panicles had their (B) densities normalized, (C) air and other debris removed, and awns pruned. (D) After automating these image processing steps, we could finally work with a large collection of clean, 3D panicles. (E) An extra digital step segmented the individual seeds —embryo and endosperm— for each barley spike. The left shows the original raw scan, the center shows the isolated seed, while the right side shows part of the coat that was removed while segmenting. (F) The seeds were aligned according to their principal components, which allowed us to (G) measure a number of traditional shape descriptors. (H) Incomplete or broken seeds were later removed from the data set. (I) These defective seeds were identified by manually examining the outliers of different allometry plots. Outliers depicted as red triangles. (J) The total number of clean and defective seeds measured from each accession. Defective seeds were not concentrated in a particular accession. 39 9 Parallels 9 Parallels 13 Parallels 19 Parallels 8 Meridians: 11 Meridians: 12 Meridians: 12 Meridians: 74 directions 101 directions 158 directions 230 directions Figure 3.3: Directions chosen to compute the ECT. The sphere was split into a equispaced fixed number of parallels and meridians in each case. The directions were the taken from the intersections. Table 3.1: Sample size of seed scans used for each individual accession. N equals the number of panicles from which seeds are derived. Accession N seeds Accession N seeds Accession N seeds Algerian 3 144 Golden Pheasant 3 89 Minia 3 112 Alpha 3 90 Good Delta 3 126 Multan 1 50 Arequipa 3 110 Han River 2 71 Oderbrucker 3 194 Atlas 3 132 Hannchen 3 89 Orel 3 74 California Mariout 3 189 Horn 3 98 Palmella Blue 3 59 Club Mariout 3 173 Lion 3 116 Sandrel 2 96 Everest 3 128 Lyallpur 3 115 Trebi 2 119 Flynn 3 78 Maison Carree 3 146 White Smyrna 3 58 Glabron 3 114 Manchuria 3 167 Wisconsin Winter 1 25 Meloy 3 159 TOTAL 83 3121 mean 111.5 standard dev. 42.2 shape, this PCA-based alignment corresponds to the proximal-distal, medial-lateral, and adaxial- abaxial axes respectively (Figures 3.1.B, 3.2.F). The orientation of the principal components is arbitrary with every run, so we did keep track of the crease and the tip of seed and flipped the axes accordingly so that the tip would always be located as the rightmost point of the image and the crease would always point north. With this uniform alignment we were able to measure the length, width, heights, surface area and volume of each seed (Figure 3.2.G). We also computed the convex hull for each seed and measured its surface area and volume, as well as the ratios with respect to seed surface area and volume. In total, 11 different traditional shape descriptors were measured. Damaged and incomplete seeds (Figure 3.2.H) were removed by evaluating allometry 40 plots along their best linear fits and residuals (Figure 3.2.I). Points with residuals 4 times larger than the standard deviation were deemed as outliers and the associated seed was manually examined further. Outliers usually corresponded to either defective seeds —which were discarded— or to a cluster of seeds that failed to be individually segmented. In the latter case, we repeated our image processing scripts with more aggressive parameters to segment the seeds and re-examined the result. A final visual assessment of the remaining images was conducted to ensure the removal of all damaged seeds. These outliers did not represent a significant portion of the seeds of any accession (Figure 3.2.J). In total we obtained 3121 cleanly segmented seeds. Every accession is represented on average by 111 seeds, with ±42 seeds as standard deviation. All the accession numbers are within 2 standard deviations from this empirical mean (Figure A.1; Table 3.1). As a proof of concept, we explored how topological descriptors varied as we varied both the number of different directions and the number of uniformly spaced thresholds. In total, for every seed we computed the ECT considering 74, 101, 158, and 230 different directions. We emphasized directions toward the seed’s crease, which correspond to directions close to both north and south poles (Figure 3.1.B; Figure 3.3). For each direction, we produced ECCs with 4, 8, 16, 32, and 64 thresholds. Recall that the ECT is a record of how topology changes at every single slice taken at every direction (Figure 3.1.C). We performed Kruskal-Wallis one-way analyses (Kruskal and Wallis, 1952) to determine if the Euler characteristic inter-accession variance was significantly different from the intra-accession variance at a particular slice and direction. This way, we observed which parts of the seed anatomy were of particular relevance to the ECT. Accessions and individual spikes were both considered as possible classes when performing the Kruskal-Wallis tests. These results follow a conservative 10−10 false discovery rate after considering a multiple test Benjamini- Hochberg correction (Benjamini and Hochberg, 1995). For every seed we computed a very high-dimensional vector of topological information, usually above 2000 dimensions, which were later reduced in dimension independently with KPCA and UMAP to prevent high-dimensionality distortions. A non-linear KPCA with a σ = 1 Laplacian 41 kernel reduced the ECT dimension based on its largest source of variance. UMAP on the other hand was used to preserve the prominent, high-dimensional topological features of the ECT in an unsupervised fashion. We fixed the use of 50 nearest neighbors, 0.1 minimum distance, and Manhattan distance as the rest of key UMAP hyperparameters. For all dimension reduction techniques, the ECT dimension was reduced to just 2, 3, 6, 12, and 24 dimensions. We focused on an aggressive 2-dimensional reduction for visualization purposes both with KPCA and UMAP. To evaluate the descriptiveness, we trained three non-linear support vector machines (SVM) with radial kernel σ = 0.1 (Burges, 1998) to characterize and predict the seeds from 28 different accessions based on three different collections of descriptors: traditional, topological, and com- bining both traditional and topological descriptors. In every case, the descriptors were centered and scaled to variance 1 prior to classification. Given that SVM is a supervised learning method, we partitioned our data into training and testing sets. In our case, we randomly sampled 75% of the seeds from every accession as our training data set, labeled according to their accession. The remaining 25% was used to test the accuracy of our prediction model. We repeated this SVM setup 100 times and considered the average accuracy and confusion matrices as final results. This was done for all possible combinations of directions, thresholds, and dimensionality reductions mentioned above. The SVM was our classifier of choice since it is quick to train and it does not require vast amounts of training data to produce reasonable results. 3.3 Results Topological and combined shape descriptors tend to produce more accurate shape-based clas- sification results, provided that the ECT is computed with sensible parameters and an adequate dimension reduction technique. The best SVM classification results were yielded by topologi- cal and combined shape descriptors based on a 2568-dimensional ECT —158 directions and 16 thresholds (Figure A.3). Based on the highest F1 classification scores, these high-dimensional vectors were best parsed after being reduced to just 2 dimensions with KPCA, or to 12 dimensions with UMAP. Hereafter, the rest of topology-related results are based on these specific choice of directions, thresholds, and dimensionality reduction. 42 A Kruskal-Wallis summary (accession based) C H-statistic -log10(p-val) 40 BH correction for 1.0e-10 FDR 20 0 H-statistic c/ to BH critical p-val 200 0 0 500 1000 1500 2000 2500 Combined shape descriptors (sorted) B Discerning directions accession-wise (Adjusted 1e-10 FDR) Figure 3.4: Relevant ECT directions and slices. (A) We examine the inter-accession and intra- accession variance differences of the Euler characteristic for each direction and threshold. A Kruskal-Wallis analysis combined with a Benjamini-Hochberg multiple test correction suggests a handful of particularly discerning slices across accessions. (B) These directions and thresholds are mostly concentrated around the poles, and (C) correspond to the seed’s crease and bottom morphology. Colors bear no special meaning. A Kruskal-Wallis one-way analysis of the ECT vectors, combined with a Benjamini-Hochberg correction admitting a 10−10 FDR, reveals 55 features that explain the most of inter-accession variance (Figure 3.4.A). The most accession-discerning slices and directions correspond to the north and south poles (Figure 3.4.B). As discussed in the seed alignment heuristics in the Methods section, these pole directions in turn correspond to the morphology of the crease and the bottom of the seed (Figure 3.4.C). Similar results were observed when analyzing for the most spike-discerning directions (Figure A.4). In other words, the topological shape descriptors do measure the crease and bottom shape of the seed, a morphological feature not explicitly measured by our traditional setting. Turning back to the traditional shape descriptors, these share similar distributions across the 28 43 A Distribution of traditional traits across all parental seeds Length Area ConvexArea ConvexVol Vol Width Algerian Alpha Arequipa Atlas CA Mariout Club Mariout Everest Flynn Glabron Golden Pheasant Good Delta Han River Hannchen Founder Horn Lion Lyallpur Maison Carree Manchuria Meloy Minia Multan Oderbrucker Orel Palmella Blue Sandrel Trebi White Smyrna WI Winter −4 −2 0 2 −4 −2 0 2 −4 −2 0 2 −4 −2 0 2 −4 −2 0 2 −4 −2 0 2 Traditional measure (centered and scaled) B Figure 3.5: Distribution of traditional shape descriptors. (A) Distribution of six of the 11 traditional seed shape descriptors across the 3121 seeds. These measurements were first centered at 0 and scaled to have variance 1. (B) Plot of the first 2 principal components of the 11 shape descriptors. The first PC describes more than 70% of the total variance. Different marker and color indicate seeds from different spikes. 44 accessions, provided they are all centered and scaled to variance 1 (Figure 3.5.A). Kruskal-Wallis analyses suggest that the seed length, surface area, and volume related measures explain the most inter-accession variance (Figure A.5.A–B). Reducing the descriptors to a 2D representation with PCA suggests that these traditional descriptors tend to group the seeds based on their accession (Figure 3.5.B). These two components explain 84.0% of the total variance, with the first principal component explaining a considerable 72.2% alone. A similar grouping-by-accession behavior was observed whenever we reduced the traditional shape descriptors to 2 dimensions with UMAP instead. KPCA dimension reduction did not yield insightful results. Topological shape descriptors on the other hand can provide a more spike-specific morphology encoding, depending on the dimension reduction technique used to parse the ECT. KPCA summa- rizes the topological information as a loop, with sharply defined clusters corresponding to seeds from individual spikes (Figure 3.6.A). On the other hand, the UMAP projection produces a large, round cluster. Notice that seeds of different spikes tend to lie on different locations, while these locations overlap partially for spikes of the same accession (Figure 3.6.B). This behavior suggests that UMAP dimension reduction tries to balance both spike-specific and accession-specific shape features. Another round of Kruskal-Wallis analyses on the combined shape descriptors reinforce the idea that traditional descriptors cluster based on accession, KPCA-reduced topological descriptors do so based on spike, while UMAP-reduced ones provide a balanced clustering. The most inter- accession variance is explained predominantly by the traditional shape descriptors, with just a few topological features as complement (Figures A.5.A–B). However, most of the inter-spike variance is predominantly captured by the dimension-reduced topological descriptors. The first two KPCA components do explain most of this inter-spike variance, which agrees with the tight panicle clusters seen before (Figure 3.6; Figure A.5.C). On the other hand, UMAP distributes regularly the spike variance across most of its components, complemented by a few traditional shape descriptors (Figure A.5.D). In other words, traditional shape descriptors capture accession-specific shape features, KPCA highlights spike-specific features, and UMAP provides a balance between both of 45 A B Figure 3.6: Dimension reduction of the ECT vectors. The ECT can produce a high-dimensional topological signature for each seed. To better visualize this topological information, we can reduce it to just two dimensions with (A) kernel PCA or (B) unsupervised UMAP. The seeds of individual accessions are highlighted in every frame. Different marker and color indicate seeds from different spikes. 46 them. When evaluating quantitatively the descriptiveness of these cluster differences, we observed that topological shape descriptors are able to produce much better SVM classification results than traditional shape descriptors (Table 3.2). Using exclusively traditional descriptors, the machine is able to correctly determine the grain variety roughly 57% of the time. For comparison, by simply randomly guessing the variety, we would expect to be correct just 1/28 × 100 ≈ 4% of the time. The classification could not be improved by reducing the dimension of the traditional vector (Figure A.2). If we use exclusively topological shape descriptors instead, the machine can classify different accessions with more than 75% accuracy. These results depend on the dimension reduction technique of choice (Figure A.3.A). We observe that KPCA provides a powerful 2-dimensional summary of the ECT, which later can be used to predict grain accession with 85% classification accuracy. This accuracy diminishes considerably as more nonlinear principal components are considered. This drop of classification performance can be offset by combining the KPCA summary with traditional shape descriptors, which keep the classification accuracy above 70% (Figure A.3.B). The 2-dimensional UMAP summary (Figure 3.6.B) exhibits difficulties and discerning acces- sions, where classification accession does not go above 25%. Nonetheless, if a 12-dimension UMAP summary is considered, it is possible to classify accessions with 75% accuracy using exclusively topological information. Moreover, these UMAP-summary classification results can be further improved by combining them with traditional shape descriptors, where classification accuracy goes beyond 88%. The ECT thus captures important morphological patterns that can be complemented by size features which are provided by the traditional shape descriptors. Additionally, for both KPCA and UMAP cases, a small p-value produced by Friedman tests (Friedman, 1937) suggests that the three SVM classifiers, corresponding to the three sets of shape descriptors, are statistically different. Since we are comparing only three classifiers at a time, we can rely better on a Quade post-hoc pairwise test (Quade, 1979) as suggested in (Conover, 1998) (Table 3.3). 47 Table 3.2: SVM classification accuracy of barley seeds from 28 different founding lines after 100 randomized training and testing sets. Since we are in a multi-class classification setting we first computed the precision, recall, and F1 scores for each founding line. Later, we computed the weighted average for each score, where the weight depended on the number of test seeds for each of the barley lines. Observe that the use of either topological or combined descriptors outperforms the use of exclusively traditional descriptors. Shape Dimension No. of Scores (weighted average ± standard deviation) descriptors reduction dims Precision Recall F1 Traditional * 11 0.58 ± 0.050 0.58 ± 0.016 0.57 ± 0.016 Topological KPCA 2 0.88 ± 0.031 0.87 ± 0.010 0.87 ± 0.011 Topological UMAP 12 0.75 ± 0.047 0.75 ± 0.016 0.74 ± 0.016 Combined KPCA 13 0.73 ± 0.052 0.72 ± 0.017 0.71 ± 0.017 Combined UMAP 23 0.89 ± 0.028 0.89 ± 0.010 0.89 ± 0.010 Table 3.3: Small Friedman and Quade post-hoc p-values (using t-distribution approximation with Bonferroni correction) suggest that different descriptors produce statistically different SVM results. ECT + KPCA ECT + UMAP Friedman p-val 1.4 × 10−5 Friedman p-val 4.4 × 10−10 Traditional Topological Traditional Topological Topological 1.8 × 10−11 ∗ Topological 8.0 × 10−4 ∗ Combined 5.9 × 10−5 4.4 × 10−4 Combined 4.4 × 10−13 7.8 × 10−7 3.4 Discussion Traditional morphometrics has been used to reveal fundamental trends in morphological changes across space and time in ancient cereal grains (Bouby, 2001; Tanno and Willcox, 2012). Historical evidence shows that barley seeds became smaller as the crop moved from Mediterranean climates to Northwest Europe due to colder temperatures and higher sunlight variance, shedding insight on the timeline of barley domestication in Central Asia (Motuzaite Matuzeviciute et al., 2018). Similarly, grains became rounder and the spikes more compact as they moved to higher altitude sites in Nepal (Fuller and Weisskopf, 2014). Differences are more subtle if we compare accessions originating from similar regions and time periods. Geometric Morphometrics (GMM) has provided a more quantitative characterization of the grains. For example, GMM can successfully tell apart barley grains from einkorn (Triticum monococcum) and emmer (T. dicoccum) accessions (Bonhomme et al., 2017); it can be used to distinguish two-row vs six-row barley seeds (Ros et al., 2014); and it 48 can establish unique morphological characteristics of land races to deduce their possible historical origins (Wallace et al., 2019). The PCA of the traditional shape descriptors tends to group seeds based on accession as the largest source of variance. This observation is further supported by the Kruskal-Wallis analyses of variance (Figure A.5). The Euler characteristic however encodes additional important shape information missed by traditional descriptors. We observe that the topological shape descriptors provide better classification than the traditional shape descriptors (Table 3.2). Recall that we can mathematically prove that the ECT captures all the shape information, to the point that a finite topological signature can be used to reconstruct the original object (Curry et al., 2022; Fasy et al., 2019). This vast amount of information is best parsed with dimension reduction techniques, which highlight different morphology features encoded by the ECT. The biggest source of variation encoded by the ECT, rendered through KPCA, are individual panicles. This high degree of spike distinction may ignore underlying shape similarities between panicles of the same accession. In contrast, with UMAP we reduce the ECT’s dimensionality taking into account overall topology and geometry, and produce a clustering that balances both panicle-specific nuances with more general accession-based traits. This accession-vs-panicle balance is further aided by combining traditional and UMAP-reduced descriptors. In other words, the ECT is capable of capturing both panicle- and accession-specific morphological descriptors, but different dimension reduction techniques emphasize some nuances over others. The addition of traditional shape descriptors aids accession-based clustering, by supplying size-related measurements. The majority of the accessions studied are more easily distinguished with the topological lens but not with traditional measures, with few exceptions (Figure 3.7). Exceptions like Hannchen, Han River and Palmella Blue have slightly distinctive traditional trait distributions, so seed size does matter and it is important to take it into account (Figure 3.5.A). At the same time, we observe accessions like Alpha, Glabron, Minia, and Wisconsin Winter, that are poorly differentiated with traditional information but report considerably higher classification accuracies whenever using topological information. When looking at a more robust dimension reduction technique like 49 A Classification Average F1 score (158 directions, 16 thresholds, 2 KPCA dimensions) 1.0 0.8 Descriptor F1 score Traditional 0.6 Topological Combined 0.4 0.2 Maison Carree Golden Pheasant Palmella Blue White Smyrna Club Mariout Oderbrucker Manchuria CA Mariout WI Winter Han River Good Delta Arequipa Hannchen Multan Trebi Lyallpur Glabron Meloy Sandrel Atlas Alpha Algerian Orel Flynn Everest Lion Horn Minia Barley Accession B Classification Average F1 score (158 directions, 16 thresholds, 12 UMAP dimensions) 1.0 0.8 Descriptor F1 score Traditional 0.6 Topological Combined 0.4 0.2 White Smyrna Maison Carree Palmella Blue Golden Pheasant CA Mariout Han River Good Delta Club Mariout Manchuria Arequipa Hannchen WI Winter Oderbrucker Everest Trebi Glabron Lion Sandrel Algerian Orel Flynn Meloy Lyallpur Atlas Multan Horn Minia Alpha Barley Accession Figure 3.7: SVM classification results for individual accessions. (A) Results when using a KPCA 2-dimension reduced topological vector. Accessions ordered according to their classification accuracy determined by the topological shape descriptors. (B) Results when using a UMAP 12-dimension reduced topological vector. Accessions ordered according to their classification accuracy determined by the combined shape descriptors. UMAP, classification accuracy is increased when combined with size-related information. An exploration on the directions used to compute the ECT reveals that the shape of the crease and bottom discriminate accessions the most (Figure 3.4). These features are not directly measured with our traditional setting. By analyzing inter- vs. intra-accession variance of a large number of ECT axes and thresholds, we effectively isolate complex morphological features responsible for distinguishing selected groups. Although the ECT comprehensively measures the information content of an object, different dimension reduction techniques highlight different aspects of that shape information (Figure 3.6). A more systematic exploration of other dimension reduction algorithms, and classification techniques afterward is warranted moving forward. 50 3.5 Software and data availability The processed and cleaned barley panicles and barley seeds X-ray CT 3D reconstructions can be found in the Dryad repository https://doi.org/10.5061/dryad.rxwdbrv93. All of our code is available at the https://github.com/amezqui3/demeter/ repository. This includes the image processing pipeline to clean the raw scans and segment the seeds (python), the computation of the ECTs (python), and the SVM classification and analysis (R). A collection of Jupyter notebook tutorials is also provided in order to ease the usage and understanding of the different components of the data processing and data analyzing pipelines. 51 CHAPTER 4 EXPLORING THE SHAPE OF AROMA AND CITRUS OIL GLANDS Fairest of all God’s trees, the orange came and settled here, Its leaves of green and pure white blossoms delight the eye of the beholder, And the thick branches and spines so sharp, and the fine round fruits, Green ones with yellow intermingling to make a pattern of gleaming brightness, —from In Praise of the Orange-Tree (Ju song) Qu Yuan (340–278 BC?) Citrus come in diverse sizes and shapes, and play a key role in world culture and economy. Citrus oil glands in particular contain essential oils which include plant secondary metabolites associated with flavor and aroma. Capturing and analyzing nuanced information behind the citrus fruit shape and its oil gland distribution provides a morphology-driven path to further our insight into phenotype-genotype interactions. We study the shape of citrus fruits and fruits from close citrus relatives based on 3D X-ray CT (computed tomography) scan reconstruction of 166 different samples comprising 51 different accessions, including samples of the three fundamental citrus species (C. medica, C. reticulata, and C. maxima), accessions from related genera (P. trifoliata and F. margarita), and several interspecific hybrids. First, using the power of X-rays and image processing, we compared volume ratios between different tissues, including exocarp, endocarp, and oil gland tissue. Second, since citrus oil glands contain essential oils which include plant secondary metabolites associated with flavor and aroma, we examined the number of individual oil glands, their density, and their overall distribution across all fruits. We determine that the average distance between neighboring oil glands follows a square root model, which indicates that gland distribution might follow normal diffusion dynamics (Vlahos et al., 2008). Finally, based off a point cloud defined by the center of all individual oil glands, we model the fruit shape as an ellipsoidal surface, a sphere with its three main axes shrunk or 52 stretched accordingly. Once the glands are considered points on an ellipsoid, we are able to apply multiple tools from directional statistics (Ley and Verdebout, 2017; Mardia and Jupp, 1999; Pewsey and García-Portugués, 2021), which allows us to study and infer possible statistical distributions on spherical surfaces. As an example of this mathematical machinery, we test whether the oil glands either follow a uniform or symmetric distribution across the fruit surface. To the best of our knowledge, the shape of citrus has not been explored with similar scanning technologies, nor has it been analyzed with ellipsoidal and directional approximations. This morphological modeling will allow us to set a new exciting path to explore further the phenotype-genotype relationship in citrus. This chapter is derived from the original research paper • E.J. Amézquita, M.Y. Quigley, T. Ophelders, D. Seymour, E. Munch, D.H. Chitwood (2022). The shape of aroma: Measuring and modeling citrus oil gland distribution. Plants, People, Planet. 4.1 Introduction Citrus fruits and leaves have played a fundamental role across multiple aspects of human history including the development of modern nutrition and medical sciences. The aromatic and medicinal properties of mandarins and oranges have inspired delicate poetry since ancient times (Tseng, 1999; Vovin, 2016). Etrog citrons represent “the fruit of a goodly tree” during the Sukkot celebrations in the Jewish community (Isaac, 1959). The bael tree is considered sacred and it is generally grown near Hindu temples (Sharma et al., 2007). The fruits, peels, and leaves of diverse citrus have been used as traditional medicine for millennia for a diverse array of maladies (Mahomoodally and Mooroteea, 2021; Shrestha and Dangol, 2019). Sour oranges and lemons inspired the first modern clinical trials in the 18th and 19th centuries to determine the causes and cure of scurvy —“the plague of the sea, and the spoyle of mariners” (Hawkins, 1986)— thus paving the way to the eventual isolation and synthesis of the first vitamin, vitamin C. (Baron, 2009; Magiorkinis et al., 2011). Currently there is a rising trend in global citrus production, with more than 143 million tonnes 53 produced in 2019 alone (FAO, 2021). Citrus production is valued for more than 3.3 billion US dollars in the US alone. (NASS, 2021). Citrus derived products are vital for other multi-billion dollar industries as well, from orange juice in the food industry, to essential oils in the perfume and cosmetics industry (Spreen et al., 2020). Essential oils in particular are extracted for their aromatic, flavoring, medicinal, and preservation properties useful in a variety of contexts (Mahato et al., 2019). Before any human intervention, current paleobotanical evidence suggests that the common ancestor of citrus species originated more than 8 million years ago in the triangle defined by modern day northeastern India, northern Myanmar, and northwestern Yunnan (Talon et al., 2020). As monsoons weakened in southeastern Asia and climate transitioned to drier conditions, citrus radiated and diversified over the next 5 million years across the southeast Asian peninsula, Australia, New Caledonia, the western Indian coast, and even Japan (Wu et al., 2018). Early civilizations in India and China domesticated some of these species and their interspecific hybrids, even as early as during the Xia Dynasty (2100-1600 BC) in Southern China (Deng et al., 2020). Through tribute, trade, and invasion, different cultures contributed to spread many of these citrus across the rest of the world over the next 3000 years (Langgut, 2017). Citrus species are sexually compatible and their ability to hybridize, combined with constant displacement and cultivation in multiple environments, produced a diversity of admixed accessions with a vast array of phenotypic traits (Gmitter et al., 2020; Luro et al., 2017; Wu et al., 2021). Asexual propagation is common in citrus and the interaction between grafted individuals has led to novel phenotypes, including through the formation of graft chimeras, conglomerations of cells that originated from separate zygotes (Caruso et al., 2020). The first reported plant chimera, known as Bizarria, arose from a fortuitous graft junction of a Florentine citron and a sour orange in 1674 (Nati, 1674). Since then, chimeras have proved to be more common than originally thought, transforming our perception of the genetic heterogeneity of individuals and its impact on plant development and phenotype (Frank and Chitwood, 2016). A phenotype of particular interest is shape. Specific combinations of shape features are used to 54 distinguish diverse citrus varieties, and have motivated various citrus taxonomic systems (Ollitrault et al., 2020). Leaf shape has been used to distinguish pummelos from sweet oranges among other different citrus genotypes and their respective environment interactions (Iwata et al., 2002). Root architecture is indicative of soil deficiencies for sour orange rootstocks (Mei et al., 2011). Morphological traits, such as fruit size and oil gland density, are used to infer genetic similarities between various mandarin cultivars (Pal et al., 2013). Oil gland size, structure and distribution are associated with the fruit development of navel oranges (Knight et al., 2001) and grapefruits (Voo et al., 2012). 4.2 Materials and Methods 4.2.1 Plant material and scanning We selected 51 different accessions of citrus and citrus relatives with diverse fruit morphologies and geographical origins for our analysis. Fruits were sampled from a single tree for each selected accession maintained in the University of California Riverside Givaudan Citrus Variety Collection. 166 different individuals in total were sent for scanning at Michigan State University in December 2018 (Figure 4.1.A; Table B.1.) These 166 samples were arranged into 63 raw scans, one scan per citrus variety containing all the replicates. Pummelos and citrons samples were scanned individually due to the fruit size. The scans were produced using the North Star Imaging X3000 system and the included efX software, with 720 projections per scan, at 3 frames per second and with 3 frames averaged per projection. The data was obtained in continuous mode. The X-ray source was set to a current of 70 µA, voltage ranging from 70 to 90 kV, and focal spot sizes ranging from 4.9 to 6.3 µm. The 3D reconstruction of the fruit was computed with the efX-CT software, obtaining final voxel sizes ranging from 18.6 to 110.1 µm for different scans (Figure 4.1.B; Table B.2.) The air and debris were thresholded out of each raw scan, and individual replicates segmented into separate images. Based on density and location, for each fruit we further segmented 3D voxel-based reconstructions of its central column, endocarp, mesocarp, exocarp, and oil glands (Figure 4.1(c)–(g)). The center of each oil gland was calculated as the center of mass of the voxels 55 A B C D E F G H I Figure 4.1: Citrus scanning and image processing. (A) A diverse collection was scanned using X-ray CT technology. (B) Slices of a raw scan. The image processing steps involved segmenting individual fruits and removing air and other debris. Then, individual tissues for each fruit were segmented such as the (C) central spine, (D) endocarp, (E) mesocarp, (F) exocarp, and (G) oil glands. (H) Close-up of some X-ray slices of the exocarp. (I) Same figure as above, with the segmented oil gland tissue darkened for emphasis. A Willowleaf sour orange is used as an example for Figures (B)–(I). All the figures are for illustrative purposes only. 56 composing such gland. An in-house scipy.ndimage-based python script was used to process the images for all fruits and their tissues. These were later visually inspected to verify their correctness. All the Chinese box oranges (Severinia buxifolia) scans were discarded due to their poor quality. To highlight nuanced differences among certain citrus groups, scans were partly split into sensible clusters of morphological interest (Table 4.1). 4.2.2 Allometric relationships The total volume of fruits and their separate tissues was measured from the scans, as well as the number of individual oil glands. We studied the allometric relationships between these measurements; that is, the relative size of different tissues with respect to each other. These relationships in plants often follow a power law, so all the measurements were first log-transformed (Niklas, 2004) and a reduced major axis linear regression (Smith, 2009) was fitted considering all fruits. The slope, intercept, and R2 correlation coefficient were recorded (Figures 4.2, B.3). The distribution of the residuals was compared against a normal distribution to determine the adequacy of the linear fit (Figures B.2, B.4). 4.2.3 Oil gland distribution For each fruit, a point cloud, a collection of (x, y, z) coordinates in the space, was defined by the centers of all its individual oil glands. The 25 nearest neighbors, based on Euclidean distance, were computed for each point, so that distances are not affected by the fruit skin curvature. The average distance between each gland and its nearest neighbor, its second nearest neighbor, and so on were computed. The oil gland density was determined both in terms of volume and surface area, by dividing the number of glands by the volume of the whole fruit, and by the surface area of the best-fit ellipsoid (discussed later in Section 4.2.4) respectively. As in the previous section, all these measurements were log-transformed, linear regressions fitted, parameters recorded, and residuals compared to a normal distribution (Figures 4.3.A, B.5). A root square model was fitted between the average nearest neighbor distance and the nearest neighbor index to describe how far apart glands spread from each other (Figure 4.3.B). 57 Table 4.1: Selected citrus groups and varieties. N equals the number of pseudo-replicates. The full list of citrus fruits and relatives scanned is found in Table B.1. Names according to the University of California Givaudan Citrus Variety Collection (CVC). Asterisk denotes not available. CVC Name Scientific Name N - CVC Name Scientific Name N Kumquats Nagami F. margarita 3 Lemons and lemon hybrids Limon Real C. excelsa 4 Lamas C. limon 3 Interdonato C. limon 3 Volckamer C. volkameriana 3 Eureka C. limon 3 Mandarins and mandarin hybrids Emperor C. reticulata 3 * C. reticulata 3 Lee C. reticulata 3 Som Keowan C. reticulata 3 Koster C. reticulata 3 Cleopatra C. reshni 3 Beledy C. reticulata 4 Fremont C. reticulata 3 USDA 88 C. reticulata 3 Kinkoji C. neo-aurantium 3 Microcitrus Finger lime M. australasica 4 Papedas * C. hanayu 3 Kalpi C. webberii 2 Makrut C. hystrix 4 Pummelos and pummelo hybrids Star Ruby C. paradisi 3 Pomelit C. maxima 3 Kao Pan C. maxima 3 Hassaku C. hassaku 3 Egami Buntan C. maxima 3 Sour oranges and sour orange hybrids Willowleaf C. aurantium 3 Standard C. aurantium 3 Konejime C. neoaurantium 4 Olivelands C. aurantium 3 Sweet oranges Valencia C. sinensis 3 Shamouti C. sinensis 3 Navel C. sinensis 3 Argentina C. sinensis 3 Cara Cara C. sinensis 3 Trifoliates and trifoliate hybrids Little-leaf P. trifoliata 3 Rubidoux P. trifoliata 4 C-35 X Citroncirus 3 Carrizo X Citroncirus 5 Swingle X Citroncirus 5 58 4.2.4 Modeling the whole fruit as an ellipsoid and computing its sphericity The surface of most of citrus fruits and their relatives can be approximated by an ellipsoid, a sphere with its three main axes possibly shrunk or stretched.The three axes of symmetry of an ellipsoid delimit three line segments from the center of the ellipsoid to its surface. These are referred to as the ellipsoid semi-axes. Notice that a sphere is an ellipsoid with its three semi-axes of the same length. We will consider triaxial ellipsoids, where the length of each semi-axes can be different. An ellipsoid can also be represented as a quadratic equation surface which is both mathematically simple to manipulate (Harris and Stöcker, 1998, Ch. 8.12), and versatile enough to represent both the shapes of nearly-spherical Valencia oranges and elongated finger limes given the right semi-axes lengths. Each fruit is defined by a point cloud made by the centers of all its individual oil glands. The parameters of the best-fit ellipsoid for this point cloud are computed following the algorithm by Li and Griffiths (2004), from which the semi-axes lengths, rotations, and center are determined (Panou et al., 2020). The fruit point cloud is then rotated and translated such that the best-fit ellipsoid is centered at the origin and its semi-major axes coincide with the proximal-distal axis of the fruit. Finally, the centers of the oil glands are projected to this ellipsoid via geocentric projection, where a ray from the center of the ellipsoid to the gland is drawn and its intersection with the ellipsoid is considered (Figure 4.4.A–F). This ellipsoid model summarizes important information of the overall shape of the fruit. As an example, we measure how sphere-like different citrus are. There is no unique way to measure sphericity, however, most of the commonly used formulas are based on the semi-axes lengths of the object (Blott and Pye, 2008; Clayton et al., 2009). We measured the sphericity of the resulting fruit-based ellipsoids using 6 different sphericity indices, all of them taking values between 0 (planes and lines) and 1 (perfect spheres) (Figure 4.4.G; Table 4.2). 4.2.5 Revisiting the distribution of the oil glands The projected gland center locations on the ellipsoid were described in terms of longitude and latitude coordinates with respect to the ellipsoid (Diaz-Toca et al., 2020). We tested whether the 59 Table 4.2: Common sphericity indices based off best-fit ellipsoid. The indices values are bounded between 0 (line or plane) and 1 (perfect sphere). The surface area, volume, and the largest, inter- mediate, and smallest semi-axes lengths of the ellipsoid are denoted by Ae, Ve, a, b, c respectively. Also, As denotes the surface area of a sphere of volume Ve . Name Formula Reference p True sphericity = 3 36πVe2 /Ae As /Ae p (Wadell, 1932) 3 2 Intercept sphericity bc/a √ (Krumbein, 1941) Corey shape factor pc/ ab (Corey, 1949) 3 2 Maximum projection sphericity q c /ab (Sneed and Folk, 1958) Janke form factor c/ 13 (a2 + b2 + c2 ) (Janke, 1966) Degree of equancy c/a (Blott and Pye, 2008) gland point cloud follows a uniform distribution, where every unit area of the skin has the same probability of containing oil glands; or if the underlying distribution is rotationally symmetric, where the oil glands pattern is symmetrical around a fixed direction. Uniformity was tested with Projected Anderson-Darling (PAD) test (García-Portugués et al., 2023) with the R package sphunif (García-Portugués and Verdebout, 2021). The rotational symmetry was tested with a scatter- location hybrid test with an unspecified direction of symmetry (García-Portugués et al., 2020) with the R package rotasym (García-Portugués et al., 2021). Additionally, we visually examined the distribution of oil glands for most fruits and compared them to simulated uniform distributions by projecting them to 2D via Lambert azimuthal equal-area projections (Mardia and Jupp, 1999, Ch. 9.1) from the North and South poles. Intuitively, these two projections flatten the sphere on a plane by pushing it from the North pole and South pole while minimizing the distortion seen on the north and south hemisphere respectively (Figure 4.5). 4.3 Results 4.3.1 Allometric relationships The estimated volume of each tissue type and fruit follows the expected average fruit size of each genetic group, with the smallest fruit in the bottom left corners (microcitrus, kumquats) and large fruit in the top right corners (pummelos). Strong linear trends are observed when comparing most of the volume-related features of all the fruits, indicated by high R2 correlation coefficient 60 log Vol(Mesocarp) [cm3] log Vol(Oil Glands) [cm3] log Vol(Exocarp) [cm3] 2 4 1 3 0 2 1 1 m = 0.87 2 m = 0.85 b 2 = -1.16 3 b 2 = -4.88 0 R = 0.92 R = 0.76 4 1 2 3 4 5 6 7 1 2 3 4 5 6 7 log Vol(Whole) [cm3] log Vol(Whole) [cm3] 3 6 log Vol(Spine) [cm3] 2 5 4 1 3 0 2 1 m = 1.14 1 m = 1.15 2 b 2 = -4.28 0 b 2 = -2.06 R = 0.58 R = 0.94 3 1 1 2 3 4 5 6 7 1 2 3 4 5 6 7 log Vol(Whole) [cm3] log Vol(Whole) [cm3] log Vol(Endocarp) [cm3] 6 10 log Num of Glands 5 4 3 9 2 m = 1.02 m = 0.40 b 2 = -0.61 b 2 = 7.37 1 R = 0.96 8 R = -0.24 1 2 3 4 5 6 7 1 2 3 4 5 6 7 log Vol(Whole) [cm3] log Vol(Whole) [cm3] 10 10 log Num of Glands log Num of Glands 9 9 m = 0.46 m = 0.39 b 2 = 7.90 b 2 = 7.61 8 R = -0.13 8 R = -0.38 0 1 2 3 4 1 2 3 4 5 6 log Vol(Exocarp) [cm3] log Vol(Endocarp) [cm3] 10 10 log Num of Glands log Num of Glands 9 9 m = 0.35 m = 0.35 b 2 = 8.08 b 2 = 8.86 8 R = -0.16 8 R = 0.02 1 0 1 2 3 4 5 6 3 2 1 0 1 2 3 log Vol(Mesocarp) [cm3] log Vol(Spine) [cm3] 10 Kumquats Pummelos and log Num of Glands Pummelo Hybrids Lemons and Lemon Sour Oranges and Hybrids their hybrids 9 Mandarins and Sweet Oranges Mandarin Hybrids Trifoliates and m = 0.46 Microcitrus b 2 = 9.64 Trifoliate Hybrids 8 R = 0.29 Papedas Other 4 3 2 1 0 1 2 log Vol(Oil Glands) [cm3] Allometric relationships between fruit phenotypes Figure 4.2: Various allometry plots between different tissue volumes compared to the total fruit volume. 61 Figure 4.2 (cont’d): Various allometry plots between different tissue volumes compared to the total fruit volume across all fruits. The best fit line is depicted by a dashed line in blue. For each plot, the slope, intercept, and correlation coefficient are recorded as m, b, and R2 respectively. The linear relationship in the log-log plots suggests that fruit tissues may grow following a power law. values, usually above 0.75 except for the central column tissue (Figures 4.2, B.3). Due to their thin size and scanning quality, these columns were difficult to identify and isolate, especially in some trifoliates, which might explain lower R2 values. The residuals of the fitted linear regression tend to follow a normal distribution for the majority of measurement pairs, suggesting that the linear fit is adequate (Figures B.2, B.4). This linearity indicates that the tissues across all citrus fruits grow relative to each other following a power rule. For example, looking at the slope m values, both the exocarp and the oil glands grow in volume at the same relative rate with respect to the volume of the whole fruit (m = 0.85). On the other hand, the total number of oil glands appears to be decoupled from all the measured size-related traits, as shown by much lower R2 values (Figure 4.2). In this case, a power law may not be an adequate model to describe the oil gland number with respect to tissue volume. 4.3.2 Oil gland distribution There is a strong positive linear relationship between the volume of the fruit, and the average distance between an oil gland and its nearest neighbor, with R2 correlation coefficients above 0.65. There is a stronger negative linear relationship when considering the overall oil gland density, reflected by R2 coefficients above 0.85 (Figure 4.3.A). The residuals follow normal distributions, indicating that the linear model is adequate (Figure B.5). These allometric relationships suggest that for all citrus and relative fruits, the average distance between nearest oil glands follows a power law with respect to fruit volume and gland density. When considering fruit size, as expected, the samples distribute in a similar pattern as with most of the previous allometry plots. However, an inverse pattern is observed when considering oil gland density. In this case, the smallest fruits tend to report the highest number of oil glands per unit volume or unit area. Other than microcitrus and kumquats, the rest of highlighted citrus groups form a tighter cluster. 62 Mandarins and Sour Oranges and Trifoliates and A Kumquats Lemons and Lemon Mandarin Hybrids Microcitrus Papedas Pummelos and their hybrids Sweet Oranges Trifoliate Hybrids Other Hybrids Pummelo Hybrids m = 0.23 m = 0.27 0.0 b 2 = -1.37 b 2 = -1.06 R = 0.73 R = 0.66 log (Avg dist to nearest neighbor) [mm] 0.5 1.0 1 2 3 4 5 6 7 0 1 2 3 4 log Vol(Whole) [cm3] log Vol(Exocarp) [cm2] m = -0.25 m = -0.37 0.0 b 2 = 0.83 b 2 = 1.37 R = 0.91 R = 0.87 0.5 1.0 2 3 4 5 6 7 8 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 log (Oil gland density per volume) [glands/cm3] log (Oil gland density per area) [glands/cm2] B 6 Kumquats Avg distance to 25-nearest neighbors Lemons and Lemon Hybrids Mandarins and Mandarin Hybrids 0.09x + 0.19 0.41x + 0.23 0.35x + 0.19 4 2 Average distance to neighbor [mm] 0 6 Microcitrus Papedas Pummelos and Pummelo Hybrids 0.04x + 0.05 0.27x + 0.23 0.42x + 0.20 4 2 0 6 Sour Oranges and their hybrids Sweet Oranges Trifoliates and Trifoliate Hybrids 0.31x + 0.18 0.46x + 0.26 0.55x + 0.06 4 2 0 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 k-th nearest neighbor C Carrizo citranges are Washington Sweet Orange × Trifoliate hybrids 6 Washington Sweet Oranges Trifoliates Carrizo Citranges Average distance 5 0.68x + 0.21 0.42x + 0.06 0.77x + 0.04 4 to neighbor [mm] 3 2 1 0 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 k-th nearest neighbor Figure 4.3: Studying the average distance from each gland to its first 25 nearest neighbors, as measured by the Euclidean distance. 63 Figure 4.3 (cont’d): Studying the average distance from each gland to its nearest neighbors. (A) Allometric relation- ships are observed across all fruits when comparing the average distance between each gland to its closest neighbor with the overall size of the fruit. The overall linear trend is depicted by a dashed blue line. The slope, intercept, and correlation coefficient are denoted by m, b, and R2 respectively. (B) For each oil gland, the average distance to its nearest neighbors follows a square root relationship. The average for each group is plotted as black, thick line. These square root models follow different parameters depending on the citrus group. The fruits that deviate the most from the average usually correspond to hybrids. (C) Carrizo citranges are Washington sweet orange × Trifoliate hybrids. The average distance between oil glands increases at a faster rate for citranges than for their parents, which suggests hybrid vigor. The average distance between an oil gland and its k-th nearest neighbor is modeled as √ Average distance(k) = M k + B, where M is the rate of distance growth and B the line intercept (Figure 4.3.B). As expected from their higher oil gland density, the average distance to the gland’s nearest neighbors increases the slowest for microcitrus, followed distinctly by kumquats. On the other hand, sweet oranges and trifoliates report the largest average distances between neighboring oil glands. This higher gland density could be partly affected by differences in scanning resolutions (Figure B.1). In general, all the samples of every accession follow the same growth model. Outliers in growth models are typically associated with hybrid accessions. Increased growth rates are found when the second parent is a large fruited accession. For example, consider the Carrizo citrange, a trifoliate x Washington sweet orange hybrid where sweet orange fruits are much larger than trifoliate fruits. We observed that the oil glands in the citrange grow on average farther apart from each other than in any of the parents (Figure 4.3.C), which suggests that hybrid vigor might be at play. Similarly, hybrids derived from crosses with small-fruited accessions have reduced growth rates. 4.3.3 Ellipsoid modeling and sphericity of fruits The best-fit ellipsoid successfully captures the overall shape of the citrus and relatives, with a negligible portion of gland centers differing by more than 0.2cm from their ellipsoidal approx- imation (Figure 4.4.G). This ellipsoidal model is flexible enough to capture both spherical and elongated fruit shapes, from sweet oranges to Australian finger limes (Figures 4.4.A–F). 64 A B C D E F Distribution of residues of the ellipsoid fit for all scanned samples 60 G Density 40 20 0 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Residue [cm] H Kumquats Wadell Krumbein Corey Sneed & Folk Janke Blot & Pye Lemons and Lemon Hybrids Mandarins and Mandarin Hybrids Microcitrus Papedas Pummelos and Pummelo Hybrids Sour Oranges and their hybrids Sweet Oranges Trifoliates and Trifoliate Hybrids Other 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 Sphericity index Figure 4.4: Modeling citrus fruit surface as tri-axial ellipsoids. The glands were centered at the origin, and the ellipsoid aligned with the proximal-distal, medial-lateral, and adaxial-abaxial axes. Then the oil glands were projected to this best-fit ellipsoid. Examples of a (A) Valencia orange, (B) Nagami kumquat, (C) Willowleaf sour orange, (D) Australian finger lime, (E) South Coast Field Station citron, and a (F) Cleopatra mandarin. (G) Distribution of the residuals of the centers of the oil glands to the best-fit ellipsoid. The distributions for all the fruit scans are overlaid. (H) Various sphericity indices are computed and compared across different citrus groups. The indices are named according to their original reference in Table 4.2. Figures (A)–(F) are not scaled. 65 Most of the fruits report highly spherical indices, more than 0.9 for every sphericity index. Unsurprisingly, the elongated Australian finger limes are the least spherical and their shape is very distinct from the rest of the samples. The kumquats are less elongated than the finger limes, but they also remain highly distinguishable for most of the sphericity indices. Mandarins and their hybrids also tend to be slightly less spherical than the remaining groups of interest (Figure 4.4.H). 4.3.4 Oil glands revisited Although the tools from directional statistics assume that the data points lie on a sphere rather than an ellipsoid, most of scanned fruits were very sphere-shaped according to a variety of sphericity indices (Figure 4.4.H). Thus shape information is not significantly altered when translating longitude and latitude coordinates from the best-fit ellipsoid to a sphere. The uniform oil gland distribution hypothesis was strongly rejected for all scans, with all p-values below 0.015 for the PAD test, and below 2.5 × 10−7 for 95% of all the point clouds (Figure B.6.A). The scatter-location hybrid test strongly rejected the rotationally symmetric oil gland distribution hypothesis for most of the point clouds as well. More than 90% of all the scans reported p-values smaller 0.02. The 10 samples for which the rotationally symmetric hypothesis was not rejected were not concentrated in any citrus groups (Table B.3; Figure B.6.B). Upon closer visual examination, differences arise between the uniform distribution on a sphere and oil gland distributions. The oil gland distributions tend to have defined clusters and empty spots, which are not seen in typical uniform distributions. The northern and southern hemispheres are noticeably different from each other for the oil gland distributions, while these look roughly the same in uniform distributions (Figure 4.5). 4.4 Discussion Measuring and understanding the shape is fundamental to extracting valuable information from data, and push further our insights. A vast number of biological-inspired shapes are intrinsically 3 dimensional, like citrus fruit, and capturing their shape as 3D voxel-based provides a faithful shape representation that allows accurate measurement of tissue volumes and modeling of gland distributions in space. Better fruit modeling is key to provide more accurate descriptions of fruit 66 A B Figure 4.5: Distribution of oil glands is not uniform nor rotationally symmetric across the citrus exocarp. After modeling the fruits as ellipsoids and projecting their oil glands onto the ellipsoidal surface, longitude and latitude coordinates are computed as in Figure 4.4. These coordinates can be better visualized using two Lambert azimuthal equal-area projections, from the north and south poles which represent the northern and southern hemispheres respectively with minimal distortion. A battery of statistical tests strongly rejects the hypothesis of these glands being uniformly or symmetrically distributed over the ellipsoid surface. (A) Examples of oil gland distribution of a little leaf trifoliate, a parent Washington navel orange, a Willowleaf sour orange, and a Som Keowan mandarin. (B) For comparison, a similar number of points are simulated following uniform, low-concentration von Mises-Fisher, and low-concentration Bingham distributions. These three distributions are rotationally symmetric. 67 shape and oil gland content, as both are important traits for citrus scion improvement (Barry et al., 2020). Citrus shape impacts oil gland abundance and distribution, as the shape of the rind, the exocarp, and other tissues, along the distribution of the oil glands affects the physics of citrus essential oil extraction and aroma dispersion (Smith et al., 2018). When observing overall fruit tissue size trends, these correspond to known citrus genealogy. For example, when comparing the size of exocarp against size of endocarp, most of the sour and sweet oranges tend to lie between mandarins and pummelos, with sour oranges lying closer to mandarins, while the sweet oranges are closer to pummelos (Figure 4.2). A similar arrangement of citrus groups is observed when comparing the average distance between neighboring oil glands to either fruit volume or gland density. Moreover, it is observed that oil glands distance themselves from each other following a square root rate in general. The exact degree to which they push each other apart depends on the oil gland density, which in turn is partly affected by the citrus genealogy. For example, the average distances between oil glands in Carrizo citranges increase at higher rates than in either the Washington sweet oranges or the trifoliates, the citrange parents (Figure 4.3). The square root suggests that the mechanics of oil gland displacement across the fruit could be partly governed by based on Brownian motion and normal diffusion interactions (Vlahos et al., 2008). However, the hypothesis of oil gland locations following either a uniform or symmetric distribution on the fruit surface is strongly rejected for all scans by the PAD and location-scatter hybrid tests respectively. This discrepancy between our gland distribution observations could be explained by the fact that uniform distributions and diffusion processes assume that the data consists of point particles with no volume that can stand arbitrarily close to each other. For oil glands this is obviously not the case, as they have volume and there are physical limitations on the proximity between glands, which requires a more complex diffusion modeling. Higher resolution scans might be able to capture better individual oil gland shape, rather than just its center. Individual oil glands then could be approximated by individual minimum volume enclosing ellipsoids (Todd and Yıldırım, 2007), which could then pose more advanced distribution and diffusion models. All the studied citrus and related accessions exhibit allometric behavior in general across 68 both their tissue volumes, and average distances between neighboring glands. This relative growth relationships suggest that tissue sizes are deeply linked, as the size of oil gland tissue in general may not be able to change without changing volume of both the endocarp and mesocarp. Moreover, there might be biophysical principles at play that govern different tissue development across all citrus fruits in general, just like normal diffusion might govern oil gland distribution. The determination of such biophysical constraints prompt future lines of exciting research. The quality of input data for our models is equally important. Through X-ray CT scanning technology we have a novel way to observe, quantify, and analyze all the shape of citrus and their tissues in a comprehensive, automated, non-invasive, and non-destructive manner. With the right voltage and current, the 3D X-ray CT reconstructions can discern small, individual tissues, like oil glands, which enables us to analyze tissue shape and distribution at very granular levels. 4.5 Software and data availability The processed and cleaned citrus X-ray CT 3D reconstructions can be found in the Dryad repos- itory https://doi.org/10.5061/dryad.34tmpg4n6, along with their separated tissues and associated point clouds and ellipsoidal approximations. All our code is available at the https://github.com/amezqui3/vitaminC_morphology repository. This includes the image processing pipeline to clean the raw scans and segment the fruit tissues, the computation of tissue volume, the best-fit ellipsoid, and the hypothesis testing of uniform and symmetric distributions on a unit sphere. A collection of Jupyter notebook tutorials is also provided to ease the usage and understanding of the different components of the data processing and data analyzing pipelines. All the image-related scripts are available in python, while the statistical analyses are in R. 69 CHAPTER 5 THE SHAPE OF KERNELS AND CRACKS, IN A NUTSHELL Don’t depend on the world’s friendship, For friends can turn into foes. Although walnut has a round shape, Not every round object is a walnut Flower and Soil Parvin E’tesami There is more than meets the eye whenever we look at walnuts (Juglans regia). Civilizations originary from modern day Iran have used and traded walnut tree products since the 7000 BC (Vahdati, 2014). The walnut fruit offers plenty of nutrients and health benefits (Chudhary et al., 2020), the wood is strong and lustrous (Voulgaridis and Vassiliou, 2005), the essential oils of the leaves are moisturizing (Verma et al., 2013). Walnuts traveled far and wide as they were actively traded through the Silk Road, reconquering the Eurasian continent (Pollegioni et al., 2014). Moreover, spatial genetic partitions among walnut populations coincide with large differences in human language; similarly, areas with similar human languages coincide with areas were walnut populations have been homogenized (Pollegioni et al., 2015). The trade of walnuts remains an important part of the global economy. In 2021, the California and the US produced more than 725,000 tons of walnuts valued in more than $1.0B, following a historically increasing trend of both bearing acreage and bearing trees per acre (NASS, 2022). World demand for walnut keeps increasing and it is estimated that the world will consume a record 2.5M tons of walnuts for 2023, and the US is forecasted to satisfy 25% of the global demand (FAS, 2022). The trade is not limited to the food industry, as there is also growing research on additional uses for walnut shell material for more durable batteries (Wahid et al., 2017), lower-cost concrete (Hilal et al., 2020), and stronger epoxy composites (Lala et al., 2018). As climate change alters weather patterns, and the demand for walnut and its byproducts increases, we must breed walnuts with more suitable traits (Bernard et al., 2017). Quantitative analyses and comprehensive pheontyping can 70 accelerate current breeding programs by quickly identifying varieties and individuals with desirable characteristics (Fiorani and Schurr, 2013; Rahaman et al., 2015). The rapid selection of potentially desirable progenitors for breeding programs is especially crucial for walnuts, as seedlings are hard to propagate, it takes at least 2 years for trees to bear fruit for the first time, and at least 5 more to yield fruit at a commercial scale (Lopez, 2004; Popa et al., 2023; Verma, 2014). Most of the current walnut phenotyping follows the measuring guidelines set by the International Plant Genetic Resources Institute (IPGRI, 1994). The morphological phenotyping of the fruit is mainly done using calipers to measure length, width, and height, combined with visual assessments to describe more complicated traits such as texture and curvature. These simple measurements have proved to be insightful to evaluate and identify promising genotypes. Moderate correlations have been reported between these traditional morphological traits of the walnut tree and fruit with commercial and horticultural traits of interest such as pollen release strategy, yield, shell thickness, kernel weight, and pathogen resistance (Akca and Şen, 1995; Kelc et al., 2007; Khadivi-Khub et al., 2015; Rezaei et al., 2018; Shah et al., 2021; Solar et al., 2003). However, this caliper- and eye-based approach is time consuming, prone to human error and subjectivity, and fails to capture richer shape nuance observed in the shells and kernels. As next generation sequencing (NCS) technology advances, we observe an explosion in genomics data collection that must be matched by equally powerful and encompassing phenomics (Andrade- Sanchez et al., 2013; Araus and Cairns, 2014; Tanabata et al., 2012). We have to look deeper than just nut lengths and widths. To that end, X-ray computed tomography (CT) scanning has proved to be a powerful tool to accurately capture intricate, internal features of a vast array of plant data in a nondestrutive manner. High-resolution, X-ray CT 3D reconstruction have been successfully used to capture and quantify the complex branching architecture of inflorescence in grapevines (Li et al., 2019) and sorghum panicles (Li et al., 2020), digitally segment and phenotype all the individual seeds in a barley panicle (Amézquita et al., 2021), determine nuances in soil porosity for diverse wheat root-soil interactions (Zhou et al., 2020), and measure exact volumes and distribution of oil glands across multiple citrus exocarps (Amézquita et al., 2022). To the best of our knowledge, 71 Bernard et al. (2020) is the first study that exploits X-ray CT 3D reconstructions to automatically, accurately, and systematically quantify multiple nut shape phenotypes from a germplasm diversity panel maintained by INRIA. Given the nature of X-rays, Bernard et al. are able to fully measure the volume and percentage of different nut tissues, namely shell, kernel, and air contained, as well as shape descriptors related to the whole nut, like total volume and surface area. Based on these results, they are able to propose the selection of genotypes with higher kernel filling ratio and thinner shells. In particular, they observe that larger fruits are correlated with rougher shell shape and smaller kernel filling ratio. We also highlight that micro-CT imaging has been recently used to document morphological changes of flower bud development (Gao, 2022), and to explore the puzzling diversity and structure of the cell tesselations that conform the hard shell tissue for multiple nuts (Huss et al., 2020). In both of these cases, the micro-CT imaging plays a more exploratory role rather than a quantifying one. Here, we study the shape of walnut fruits based on the X-ray CT 3D reconstruction of 1256 different samples comprising 173 accessions maintained by the Walnut Improvement Program at the University of California Davis. We exploit the nondestructiveness of X-rays to isolate individual walnuts and segment out shell, kernel, and packing tissues, as well as the air contained inside every walnut. We then compute 38 different shape- and size-related traits for each walnut. This includes side lengths, surface areas, and volumes of the whole nut and individual tissues, filling ratios, and sphericity and convexity indices. This image processing task was done with an in-house, python-based, open-source script inspired by the procedure described by Bernard et al. (2020). We include the computation of the 14 traits used by them. Second, we look for allometric relationships of interest across the whole population —the growth rate of a tissue relative to another. Third, we perform Kruskal-Wallis analyses of variance (Kruskal and Wallis, 1952) to determine which morphological traits contribute the most to qualitative traits of interest, such as kernel ease of removal, kernel plumpness, and shell strength and integrity. Finally, we singled out the only Himalayan, heterozygous accession and studied more carefully its morphology, as it is reportedly extremely hard to crack open and extract its kernel. We noticed that this particular accession is 72 no different in size or tissue distribution compared to other easier to crack accessions, except for subtle differences in the kernel’s main cavity at the proximal end. This morphological modeling will allow us to set a new exciting path to explore further the phenotype-genotype relationship in walnuts. 5.1 Materials and methods 5.1.1 Plant material and scanning All plant materials represent walnut breeding lines, germplasm, and cultivars maintained by the Walnut Improvement Program at the University of California, Davis. A total of 149 walnuts accessions were harvested into mesh bags at hull split, oven-dried overnight at 95F, and then air- dried for several weeks before moving into cold storage at 35F. 5 to 16 individuals were selected for each accession, for a total of 1301 individual walnuts to be scanned at Michigan State University (Table C.1). The walnuts were scanned in 173 batches. The scans were produced using the the North Star X3000 system and the included efX-DR software, with 720 projections per scan, at 3 frames per second and with 3 frames averaged per projection. The data was obtained in continuous mode. The 3D X-ray CT reconstruction was computed with the efX-CT software, obtaining voxel-based images with voxel size of 75.9 µm. All the individual walnuts were manually separated with the efX-CT software (Figure 5.1.A). Densities were rescaled so that all scans share similar air, kernel, and shell densities. Once densities were comparable across samples, the external air and other debris was removed through thresholding and mathematical morphology operations (Figure 5.1.B). Rough estimates for the location of shell, air, kernel, and packing tissues were obtained based on density and object thickness information. These tissues were fully segmented using the watershed segmentation algorithm (Falcao et al., 2004) (Figure 5.1.C, E-H). We took particular care of tissue labeled as shell, where we distinguished voxels on or close to the walnut surface, to voxels protruding into the internal cavity. We also labeled apart voxels that were of similar exterior shell density and that comprised extraneous bulges (Figure 5.1.D). Some of the scanned walnuts contained incomplete or no kernel at all. These were discarded from further morphological analysis, leaving us with a total of 1264 individual walnuts. All the 73 Figure 5.1: Walnut scanning, image processing, and phenotyping. (A) Raw scans of individual walnuts. (B) Densities were standardized across all samples and the external air removed. (C) Shell, air, kernel, and packing tissue were automatically labeled with a combination of basic image morphology operations and watershed segmentation. (D) The tissue labeled as shell was further broken down into external shell, bulging, and protruding tissue. (E) 3D renders of shell, (F) air, (G) packing tissue, and (H) kernel. (I) All the walnuts were centered on their center of mass and aligned. (J) The same centering and alignment was applied to the kernels. All the figures above are for illustration purposes only and are not scaled. 74 image processing above was done automatically with in-house, scipy-based, python scripts. To make some measurements comparable, all the walnuts were centered on their centers of mass and rotated such that the lateral plane goes through the walnut seal, and the shell tip is the rightmost point of the longitudinal plane (Figure 5.1.I.) The same center and rotation was immediately applied to the kernel (Figure 5.1.J). 5.1.2 Walnut morphological traits and evaluation For each individual we computed the same 14 morphological traits as in Bernard et al. (2020): nut length, height, width, total surface area, total volume, rugosity, sphericity, shape VA3D, equancy, shell volume, shell thickness, kernel volume, kernel filling ratio, and the empty space volume. We computed an additional collection of 24 morphological traits for a total of 38 mea- surements per sample. We computed the length, width, and height of the kernel, the volume of protruding shell, inner bulging shell, packing tissue, the percentage of each tissue volume, and Krumbein and Sneed sphericity indices (Blott and Pye, 2008). We also computed surface area and volume of the nut’s convex hull, and their ratio between actual nut surface area and volume respectively as a proxy for lobeyness. This computation was repeated for the kernel. (Table 5.1) Since all the 38 morphological traits are nonnegative, we used the quartile coefficient of variation (QCD) to measure the numerical variability of each of them across the 1301 scans. We preferred the QCD as it only depends on the 25th and 75th quartiles, making it robust against outliers compared to the coefficient of variation (CV) (Bonett, 2006). We studied allometric relationships between all size-specific morphological traits, that is, the relative growth of one feature with respect to another one. It is a well-documented phenomenon that different tissues grow relative to each other following a power law rather than a simple linear relationship (Niklas, 2004; West et al., 1999), so we plotted our data in log-log plots (Figure 5.2). Due to this nonlinear relationships between traits, we favored the computation of Spearman rather Pearson correlation coefficients between different morphological phenotypes (Figure 5.3). Ten walnuts were cracked open from each sample using a hammer. Ease of removal was scored onto that each sample on an ordinal scale (1-9) as the ease with which intact kernel halves could be 75 Table 5.1: Morphological traits measured. Traits organized by unit and walnut tissue involved. Exp indicates formula or expression. Asterisk denotes not applicable. Trait Exp Description Unit Whole walnut Length Lw Distance from base to tip mm Width Ww Longest distance across the seal mm Height Hw Longest distance perpendicular to the seal mm Surface Aw Surface area of the actual nut mm2 Convex surface Acw Surface area of the convex hull of the nut mm2 Volume Vw Total volume of the actual nut, including air mm3 Convex volume Vcw Volume of the convex hull of the nut mm3 VA3D A3w /(36πVw2 ) Shape factor * Feret Ratio Hw /Lw p Inverse index of roundness * 3 2 Krumbein p3 Ww Hw /Lw Index of roundness * 2 Sneed p3 Hw /Ww Lw Index of roundness * Sphericity 36πVw2 /Aw Wadell’s index of roundness * Rugosity 1/sphericity Index of surface roughness * Convex area ratio Acw /Aw Index of nonconvexity * Convex vol ratio Vw /Vcw Index of nonconvexity * Shell Total Volume Vs Total volume of the shell mm3 External Volume Ve Volume of shell without protrusions or bulges mm3 Bulging Volume Vb Volume of shell-like tissue bulging into the walnut mm3 Protruding Vol Vp Volume of shell-like bits that protrude into cavity mm3 Thickness Ts Average thickness of the external section of the shell mm Volume ratio Vs /Vw Percentage of shell with respect to the whole walnut % External ratio Ve /Vs Percentage of the external section of the shell % Bulging ratio Vb /Vs Percentage of shell-like tissue bulging into the walnut % Protruding ratio Vp /Vs Percentage of tissue protruding into the cavity % Kernel Length Lk Distance perpendicular to the transverse plane mm Width Wk Distance perpendicular to the longitudinal plane mm Height Hk Distance perpendicular to the lateral plane mm Surface Ak Surface area of the actual kernel mm2 Convex surface Ack Surface area of the convex hull of the kernel mm2 Volume Vk Total volume of the actual kernel mm3 Convex volume Vck Volume of the convex hull of the kernel mm3 Volume ratio Vk /Vw Percentage of kernel with respect to the whole walnut % Convex area ratio Ack /Ak Index of nonconvexity * Convex vol ratio Vk /Vck Index of nonconvexity * 76 Table 5.1 (cont’d) Trait Exp Description Unit Packing tissue Volume Vt Total volume of the packing tissue mm3 Volume ratio Vt /Vw Percentage of packing tissue % Air Volume Va Total volume of the air contained by the nut mm3 Volume ratio Va /Vw Percentage of air with respect to the whole walnut % extracted, with lower numbers representing easier removal. At the same time, the physical walnuts were cracked open and qualitative score was assigned to the ease of removal, shell integrity, shell strength, and shell texture. We computed Kruskal-Wallis one-way analyses (Kruskal and Wallis, 1952) to determine which morphological phenotypes contribute the most to differentiate ease of removal and shell strength scores. These results follow a conservative 10−10 false discovery rate after performing a multiple test Benjamini-Hochberg correction (Benjamini and Hochberg, 1995). A similar analysis of variation was performed to determine the morphological traits that are the most different between one walnut accession and the rest of the scanned collection. 5.2 Results Our scanned walnut collection reported overall stable values of sphericity indices, and surface area and volume ratios with respect to their convex hull. This suggests that walnuts by and large have similar overall shell shape, rugosity, and lobeyness. More specifically, we observe a very high ratio of total nut volume to convex hull volume (0.95±0.01), so overall nut is very close to being convex. However, nuts also exhibit a much lower ratio of total surface area to convex hull area (0.63±0.01), which indicates that their shell surface is covered by numerous, thin grooves. At the same time, walnuts reveal a especially large variability of shell tissue. In particular, the amount of shell tissue that protrudes into the walnut inner cavity ranges from 0 to 206 mm3 . This corresponds to a QCD of almost 0.5, where the 75th quartile (41.5 mm3 ) is almost 3 times as large as the 25th one (15.1 mm3 ). There is a similar large variation of reported values of shell tissue that bulges inwards (Table 5.2). 77 Table 5.2: Morphological trait values. Standard deviation, 25th quartile, 75th quartile, quartile coefficient of dispersion, and coefficient of variance are indicated by SD, Q25 , Q75 , QCD, and CV respectively. Traits sorted by QCD. Asterisk denotes not applicable. Trait Units Mean + SD Range Q25 Q75 QCD CV Nut Cvex Vol Ratio * 0.95 ± 0.01 0.86 — 0.98 0.95 0.97 0.01 0.02 Nut Cvex Area Ratio * 0.63 ± 0.01 0.49 — 0.66 0.63 0.64 0.01 0.02 External Shell Ratio % 0.94 ± 0.03 0.81 — 0.99 0.93 0.96 0.02 0.03 Shell Rugosity * 1.66 ± 0.06 1.56 — 2.30 1.62 1.69 0.02 0.03 Nut Sphericity * 0.60 ± 0.02 0.44 — 0.64 0.59 0.62 0.02 0.03 Sneed Index * 0.92 ± 0.03 0.83 — 1.00 0.90 0.94 0.02 0.03 Krumbein Index * 0.90 ± 0.05 0.75 — 1.00 0.87 0.93 0.04 0.05 Kernel Cvex A Ratio * 0.38 ± 0.02 0.30 — 0.53 0.36 0.40 0.04 0.06 Nut Feret Ratio * 1.21 ± 0.10 1.00 — 1.56 1.14 1.27 0.05 0.08 Kernel Cvex V Ratio * 0.56 ± 0.05 0.38 — 0.69 0.54 0.60 0.05 0.09 Nut Width mm 32.1 ± 2.87 23.1 — 45.4 30.2 33.7 0.06 0.09 Nut Height mm 33.4 ± 2.94 25.6 — 44.6 31.4 35.2 0.06 0.09 Nut VA3D * 4.59 ± 0.50 3.78 — 12.2 4.27 4.83 0.06 0.11 Kernel Height mm 28.2 ± 2.72 20.8 — 39.9 26.4 29.9 0.06 0.10 Kernel Length mm 30.4 ± 2.97 18.2 — 40.4 28.4 32.4 0.06 0.10 Kernel Width mm 24.8 ± 2.8 15.7 — 39.8 23.0 26.4 0.07 0.11 Nut Length mm 38.5 ± 4.10 26.3 — 53.2 35.7 41.2 0.07 0.11 Packing Vol Ratio % 0.13 ± 0.02 0.07 — 0.23 0.11 0.14 0.08 0.15 Kernel Vol Ratio % 0.34 ± 0.05 0.18 — 0.48 0.31 0.37 0.09 0.15 Nut Cvex Area mm2 3675 ± 609 2057 — 6067 3257 4044 0.11 0.17 Kernel Cvex Area mm2 2572 ± 427 1277 — 3934 2281 2847 0.11 0.17 Nut Area mm2 5799 ± 1011 3316 — 9978 5116 6401 0.11 0.17 Air Vol Ratio % 0.37 ± 0.07 0.14 — 0.61 0.32 0.41 0.12 0.18 Shell Thickness mm 0.88 ± 0.16 0.51 — 1.57 0.77 0.99 0.12 0.18 Kernel Area mm2 6773 ± 1276 2827 — 10991 5905 7595 0.13 0.19 Shell Vol Ratio % 0.16 ± 0.03 0.09 — 0.33 0.14 0.18 0.13 0.20 Nut Volume mm3 19560 ± 4757 7905 — 41132 16300 22235 0.15 0.24 Kernel Vol mm3 6565 ± 1589 2087 — 12232 5542 7650 0.16 0.24 Nut Cvex Vol mm3 20511 ± 5102 8470 — 43346 16932 23441 0.16 0.25 Packing Vol mm3 2476 ± 676 954 — 6321 2032 2815 0.16 0.27 Kernel Cvex Vol mm3 11680 ± 2900 4007 — 21761 9710 13501 0.16 0.25 External Shell Vol mm3 3001 ± 811 1020 — 6369 2418 3501 0.18 0.27 Shell Volume mm3 3188 ± 886 1139 — 7446 2568 3724 0.18 0.28 Air Volume mm3 7332 ± 2628 2126 — 21812 5520 8681 0.22 0.36 Bulge Shell Ratio % 0.05 ± 0.02 0.00 — 0.18 0.03 0.06 0.32 0.52 Prot Shell Ratio % 0.01 ± 0.01 0.00 — 0.07 0.01 0.01 0.41 0.81 Bulge Shell Vol mm3 155 ± 108 9.94 — 1073 81.2 199 0.42 0.70 Prot Shell Vol mm3 32.5 ± 26.8 0.00 — 206 15.1 41.5 0.47 0.82 78 Figure 5.2: Various allometry plots between different logarithmic values of tissue volumes, areas, and lengths compared to the total walnut volume. An ordinary least squares linear model was computed for each case. The slope, intercept, and coefficient of determination for each linear model is indicated by m, b, and R2 respectively. We observe that most of the size-specific traits follow power laws with respect to the total nut volume Vw , as our allometric log-log plots exhibit large R2 coefficients of determination (Figure 5.2). The size-related measurement that exhibits the most superlinear growth rate is the total air volume contained inside the nut Va . Our data suggests that these two volumes follow the power law Va ≈ exp(−3.17)Vw1.22 . That is, as the nut volume increases, biophysical constraints require that the air volume increases by a larger factor. However, the air volume must always be lower than the total nut volume. Evaluating the extreme case of a hypothetical walnut consisting entirely of air, we find that Vw = exp(−3.17)Vw1.22 when Vw ≈ 2.3 × 106 mm3 . This is the same volume of a 16cm diameter sphere. We also highlight that with a R2 coefficient of determination very close to 1, the volume 79 of the convex hull of the nut Vcw follows a superlinear growth rate with respect to Vw , the power law Vcw = exp(−0.16)Vw1.02 . Do notice that the constant factor, exp(−0.16) = 0.85 is less than 1, so that if Vw < 2063mm3 , then Vcw < Vw , which is impossible. In other words, our allometric power law only holds for walnuts whose total volume is comparable to that of a 1.6cm diameter sphere. This in turn suggests interesting biophysical growth patterns at the early developmental stages of the walnut (Pinney and Polito, 1983; Zhao et al., 2016). Of important note is the fact that kernel volume grows at a slightly sublinear rate with respect to total nut volume, with a power law Vk = exp(0.19)Vw0.87 . For instance, if the total nut volume is duplicated, then the kernel volume will only increase by a factor of 20.87 ≈ 1.8, which already indicates that larger walnuts tend to have smaller kernel percentages, while they also tend to contain higher air percentages. The last observation is also supported by a high, negative Spearman correlation index (-0.78) between the kernel and air volume ratios , and a smaller, negative index (-0.25) with nut volume. We also observe that kernel ratio is positively correlated with its convex volume inverse ratio (0.84) while negatively correlated with the convex area ratio (-0.65). There is also a small positive correlation with walnut sphericity (0.25) and walnut convex volume inverse ratio (0.23) (Figure 5.4). This implies that walnuts that are smaller in volume, with smoother shells, and less grooves on their surface tend to have a larger percentage of kernel with respect to its total size. This observation agrees with Bernard et al. (2020). At the same time, kernel percentage is higher when its overall shape is more convex and it present numerous deep but thin grooves. (Figure 5.4). For shell thickness, we observe unsurprising high correlations with shell volume percentage (0.91), shell total volume (0.78), and other shell-related measurements. There are also negative correlations with air percentage (0.47) and nut sphericity (-0.20), which implies that nuts with thinner shells tend to have smoother shells but a higher content of air. (Figure 5.4). The morphological traits that explained the most variance for ease of removal were shell thickness and shell, air, and packing tissue volume percentage. The amount of shell-like tissue that bulges into the walnut cavity also appears to influence the ease of removal, as well as the volume and surface area ratio of the kernel with respect to its convex hull. In general, unbroken kernel 80 Spearman correlation 1.0 0.5 0.0 0.5 1.0 Nut Convex Area Ratio Nut Sphericity Nut Convex Volume Inverse Ratio Packing Volume Ratio Kernel Volume Ratio Kernel Convex Volume Inverse Ratio Krumbein Index Sneed Index External Shell Ratio Air Volume Ratio Kernel Convex Area Ratio Shell Volume External Shell Volume Air Volume Nut Length Kernel Length Nut Surface Area Nut Volume Nut Convex Surface Area Nut Convex Volume Nut Height Nut Width Kernel Height Kernel Width Packing Volume Kernel Convex Volume Kernel Convex Surface Area Kernel Volume Kernel Surface Area Nut VA3D Shell Rugosity Nut Feret Ratio Protruding Shell Ratio Protruding Shell Volume Shell Volume Ratio Shell Thickness Bulge Shell Ratio Bulge Shell Volume Nut Convex Area Ratio Nut Sphericity Nut Convex Volume Inverse Ratio Packing Volume Ratio Kernel Volume Ratio Kernel Convex Volume Inverse Ratio Krumbein Index Sneed Index External Shell Ratio Air Volume Ratio Kernel Convex Area Ratio Shell Volume External Shell Volume Air Volume Nut Length Kernel Length Nut Surface Area Nut Volume Nut Convex Surface Area Nut Convex Volume Nut Height Nut Width Kernel Height Kernel Width Packing Volume Kernel Convex Volume Kernel Convex Surface Area Kernel Volume Kernel Surface Area Nut VA3D Shell Rugosity Nut Feret Ratio Protruding Shell Ratio Protruding Shell Volume Shell Volume Ratio Shell Thickness Bulge Shell Ratio Bulge Shell Volume Figure 5.3: Spearman correlation for all phenotypes. Most of the overall nut size-related traits are only positively correlated with kernel size-related traits as expected. All the shell-specific traits are only positively correlated between themselves. All the sphericity, aspect ratio, and rugosity are highly correlated only among themselves. 81 Kernel Volume Ratio Spearman correlation Shell Thickness Spearman correlation Kernel Convex Volume Inverse Ratio Shell Volume Ratio Air Volume Ratio Shell Volume Kernel Convex Area Ratio External Shell Volume Air Volume Bulge Shell Volume Air Volume Ratio Kernel Volume Bulge Shell Ratio Shell Rugosity External Shell Ratio Nut VA3D Protruding Shell Volume Nut Sphericity Nut Length Kernel Convex Volume Inverse Ratio Nut Surface Area Shell Rugosity Nut Convex Volume Inverse Ratio Nut Sphericity Nut Convex Surface Area Nut VA3D Nut Convex Area Ratio Kernel Surface Area Nut Feret Ratio Nut Convex Volume Packing Volume Nut Volume Krumbein Index Nut Convex Area Ratio Sneed Index Nut Surface Area Nut Length Nut Convex Surface Area Nut Height Kernel Length Nut Width Nut Convex Volume Packing Volume Ratio Kernel Volume Nut Volume External Shell Volume Air Volume Shell Volume Nut Convex Volume Inverse Ratio Packing Volume Nut Width Sneed Index Nut Height Kernel Surface Area 0.5 0.0 0.5 0.5 0.0 0.5 Figure 5.4: Highest Spearman correlation traits for kernel volume ratio and shell thickness. (A) All the correlation coefficients highlighted in the barplots are statistically significant (p-value > 10−3 ). (B) Different traits versus kernel filling ratio and (C) shell thickness. The Pearson and Spearman correlation coefficients are denoted by r and s respectively. 82 Figure 5.5: Morphological traits that explain the most variance across different qualitative groups according to Kruskal-Wallis analyses. (A) Boxplots for different trait values according to the ease of kernel removal and (B) shell strength of the walnut. Both qualitative traits are scored in an increasing scale 3–8, where 3 indicates the easiest walnut to remove both kernel halves intact and the least strong shell. All the highlighted traits are statistically significant after a 10−10 false discovery rate correction. halves are easier to remove for walnuts with thinner shells, with relatively little external and internal shell and packing tissue content. The kernels are easier to extract when the nut contains a higher percentage of air, and when they have deeper and wider grooves (Figure 5.5). For shell strength, the morphological traits that explained it the most were unsurprisingly related to shell thickness, volume, or relative percentage. We also noticed that shell is stronger for walnuts with lower relative content of air and higher content of packing tissue. Up to a point, shell strength also seems to be affected by kernel volume and walnut length (Figure 5.5). 83 For both ease of removal and shell strength, we notice that in general the Himalayan Earliest accession (the only one scoring 8 for both evaluations) consistently breaks a visual trend followed by the rest of accessions and scores. Comparing the morphological features distributions of this Himalayan accession with the rest of the collection reveals that this Himalayan accession has on average a larger percentage of air content and a lower percentage of kernel and packing tissue volume. This accession on average also has a higher kernel area ratio and lower kernel volume inverse ratio with respect to its convex hull. This indicates that kernels of Himalayan accession tend to have wider, deeper cavities, which most likely are filled with air. We did not find significant distribution differences for the rest of morphological traits between the Himalayan accession and the rest of the collection. 5.3 Discussion The diversity, propagation, and diffusion of walnut populations offer an important window to past climates and civilizations. Current genome sequencing data suggests that the common walnut originated as an ancient hybrid in the late Pliocene, 3.45 million years ago, as a cross between American and Asian Juglans lineages (Zhang et al., 2019). However, the Last Glacial Maximum extinguished most of the walnut populations except for some located in select glacial refugia 18 thousand years ago (Aradhya et al., 2017). This cataclismic event imposed a severe bottleneck effect on the walnut germplasm, reducing dramatically its effective population size (Ding et al., 2022). These refugia were mostly separate pockets surrounded by mountainous terrain between Southwestern China, the Qinghia-Tibet Plateu, and the Himalayas, regions that represent the core of walnut genetic diversity (Luo et al., 2022). From there, humans propagated walnut populations through trade following the Silk Road, dispersing the walnut all the way from the Iberian peninsula to Southeastern China (Beer et al., 2008). There is plenty of observed phenotypic diversity within a fixed walnut population, but surprisingly these variations are usually not sufficient to distinguish geographically distant populations (Roor et al., 2017). It could be that morphological differences across regions are too subtle to be captured by simple caliper measurements. This difficulty to comprehensively measure walnut morphology is more pressing when trying 84 Figure 5.6: Main morphological differences between the Earliest Himalayan accession (1) and the rest of the collection (0). (A) Boxplots of morphologhical traits that are the most different for the Earliest accession. (B) Longitudinal plane view of kernel halves. The top two rows depict Earliest kernels, while the bottom two rows depict other accessions. Notice that the bottom cavity tends to be deeper and wider for the Earliest accession. All the highlighted traits are statistically significant according to Kruskal-Wallis analyses of variance after a 10−5 false discovery rate correction. 85 to understand the fine-grained details that determine important traits of commercial concern (Du and Tan, 2021). One such trait is the ease of removal of the kernel, how easy is to remove the main two halves of walnut intact. A related trait is the shell strength and the fracture patterns suffered by shells under pressure (Gülsoy et al., 2019). Multiple-pronged strategies have been proposed and developed to unravel the underlying mechanisms that regulate shell thickness, shape, and strength. Research on these traits has been approached from a genetics standpoint, such as quantitative trait loci (QTL) and genome-wide association studies (GWAS) to identify transcription factors and pathways that affect the shell and seal formation (Sideli et al., 2020; Wang et al., 2022). Walnut shell behavior has been explored with numerical simulations, where walnuts are modeled as thin spheres and biophysical mechanical properties are tested under unidirectional loads based on finite-element analyses (Bao et al., 2022; Koyuncu et al., 2004). Recent exciting work has focused on the shape of the polylobate sclerid individual cells that tesselate and conform the walnut shell while forming intricate puzzles that confer remarkable toughness and strength (Antreich et al., 2019; Zhang et al., 2014). This approach is then complemented with biochemical techniques that seek to unravel the structural and compositional changes during walnut shell development, as individual cells go from a soft to hard state (Antreich et al., 2021; Xiao et al., 2020). X-ray CT scans allows us to accurately extract more nuanced shape and size features from our sampled walnuts, providing new avenues to explore subtle morphological changes and implications. For example, with more size-related features, we can compute better allometric relationships that point to biophysical constraints in walnut growth development. In particular, we are able to draw theoretical upper and lower bounds on walnut size. The growth of empty space within a walnut outpaces the overall nut growth rate, which indicates that larger walnuts tend to contain a higher proportion of air. At the same time, we observed that for small nuts, their total volume was almost identical to the volume of their convex hulls. This implies that smooth, groove-free nuts must be small. Moreover, this allometric relationship only holds for nut that are larger than a certain size, which indicates that the growth dynamics of the nut undergo a regime change as the nut develops. We also observe that the kernel volume grows at a slower rate than the total nut volume, which 86 suggests that larger walnuts tend to contain less kernel tissue (Figure 5.2). Walnuts with more relative kernel content tend to be smoother but the kernels themselves present more grooves that are deeper and narrower. Walnuts with thinner shells tend to have smoother shells but a higher content of air (Figure 5.4). All of the allometric observations and correlations above suggest that walnut and kernel sizes and smoothness are not just dependent on genes and environment, but there are also unavoidable biophysical constraints at play that should be explored further and considered by breeding programs (Niklas and Hammond, 2019). Our extended list of measured phenotypes also offers new insight to qualitatively assessed traits. We highlight traits related to packing tissue, contained air, or convexity ratios, as they are difficult or impossible to measure with traditional tools. To the best of our knowledge, this is the first time that such traits are completely quantified. Packing tissue is commercially relevant, as its filling ratio seems to affect the ease of kernel removal and shell strength. Nuts with high relative content of packing tissue tend to have stronger shells and present more difficulties to remove kernels (Figure 5.5). This confirms previous reports on moderate correlations between packing tissue thickness and ease of kernel removal (Fallah et al., 2022; Kouhi et al., 2020). The convexity of kernel also appears to play an important role in the cracking mechanics of the nut. The hardest walnut to crack in the collection reported very average size and shape measurements except for subtle differences in the kernel’s main cavity at the proximal end. This cavity might act as a clamp under pressure, which could explain the harness to crack this particular accession open. This might suggest that the easiness of kernel removal and shell integrity might not be solely dependent on shell thickness and volume, but also on shape characteristics of the rest of nut tissues. Moreover, this uniquely clamp-shaped kernels are closely related to wild type accessions from the Himalayas, one of the areas with the highest diversity of walnut germplasm (Shah et al., 2021). Walnuts offer a especially unique opportunity to analyze domestication in perennial crops. De- spite their long history with humans, current research suggests that walnut domestication happened less than 100 years ago (Mapelli et al., 2018). Even today, due to economical and horticultural reasons, walnut is propagated through via seeds and not grafting throughout most of Southwest 87 Asia (Rezaee et al., 2008; Thapa et al., 2021). This makes walnut an exciting organism to study the immediate effects of domestication and breeding in real time across multiple populations. A care- ful, nuanced study of kernel morphology might provide us key insights into domestication-induced morphological changes, and accelerate the selection of progenitors in breeding programs. 88 CHAPTER 6 CONCLUDING REMARKS All the plants out here are malevolent, heavy and sharp. The parts of the palms above the fronds are tufted in sick stuff like coconut-hair. Roaches and other things live in the trees. Rats, maybe. Loathsome high-altitude critters of all kinds. All the plants either spiny or meaty. Cacti in queer tortured shapes. The tops of the palms like Rod Stewart’s hair, from days gone by. —from Infinite Jest David Foster Wallace 6.1 Conclusion In this dissertation we presented three different project, each with a novel way to compre- hensively encode and compare the diverse morphology found in the plant biology domain. The three of our applications focus on the quantification of shape based from high-resolution 3D X-ray CT scan reconstructions. The non-destructive and thorough nature of X-rays allowed us to fully extract barley seeds from their panicles without worrying about occlusion; it allowed us to examine different fruit tissues from citrus; and it was key to measure the walnut kernels before cracking the shell open. Traditional and modern morphometrics has a number of drawbacks with respect to X-ray CT images. Landmark-based morphometrics requires homologous points and, although 3D and higher dimensional analysis is possible, it is usually applied to 2D images (Dryden and Mardia, 2016). Further, a geometric framework is limited to the relationship of data points to each other. Landmark-based approaches reduce the shape information to a relatively small and possibly subjective collection of points, which can be further restricted if there are no obvious homologous landmarks across all samples. Fourier-based outlines are limited to the analysis of 2D images and are not suitable for inputs in higher dimensions. We thus turned to algebraic topology and directional statistics. 89 Our results on the quantification of barley seed morphology shows that the Euler characteristic is a simple yet powerful way to reveal features not readily visible to the naked eye. There is “hidden” morphological information that traditional and geometric morphometric methods are missing. The Euler characteristic, and Topological Data Analysis (TDA) in general, can be readily computed from any given image data, which makes it a versatile tool to use in a vast number of biology-related applications. TDA provides a comprehensive framework to detect and compare morphological nuances, nuances that traditional measures fail to capture and that remain unexplored using simple geometric methods. In the specific case of barley seeds presented here, these “hidden” shape nuances provide enough information to not only characterize specific accessions, but the individual spikes from which seeds are derived. Our results suggest a new exciting path, driven by morphological information alone, to explore further the phenotype-genotype relationship. TDA is just one of the novel mathematical domains that can be used to further our biological insight. There is rich shape information in the natural world to be captured, analyzed, and linked to biophysical developmental and evolutionary principles. Sound mathematical models are key to uncover these biophysical interactions at work. For example, even with a limited number of points, overall fruit shapes can be approximated with various quadratic surfaces like ellipsoids. Given the appropriate parameters, an ellipsoid can represent both nearly spherical navel oranges, and elongated finger limes. This quadratic surface approximation is mathematically versatile and computationally simple, and can be applied to other round-shaped biologically-motivated data. Moreover, ellipsoid coordinates can be translated naturally to longitudes and latitudes on a sphere, which opens the door to a wide array of mathematical tools from directional statistics. Some of those tools, like density estimations, hypothesis testing, and distribution fitting, allow us to quantify shape in a mathematically rigorous and comprehensive way. Our results for each of the three projects presented suggest extremely exciting future research, in both mathematical and biological lines. 90 6.1.1 Further exploration with ECT The ECT has proved to be an extremely powerful morphological descriptor of grain shape. There are a number of topics to explore with respect to the implementation of this novel approach, both theoretically and empirically. For instance, exploring how our predictions might change if we pick uniformly randomly distributed directions —or according to any other probability distribution— instead of polar-biased ones. There is also more research to be done into alternatives to determine which 3D shape features are the most relevant to distinguish inter-accession characteristics. One possibility is running our results through established statisitical pipelines like SINATRA to uncover such features (Wang et al., 2021). We can also explore how our barley classification results might change with variants of ECT, such as the Smooth ECT (Crawford et al., 2020), Weighted ECT (Jiang et al., 2020), or Euler Characteristic Surfaces (Beltramo et al., 2022). 6.1.2 Brewing barley genomics into the topological party There is extensive literature to understand the underlying genetic mechanisms that allow bar- ley’s tremendous versatility (Hockett and Nilan, 1985; Mascher et al., 2017; Sato, 2020). The combination of genomics with archaeology has revealed important patterns of its domestication and its intimate relationship with ancient civilizations (Mascher et al., 2016; Russell et al., 2016). The historical and geographical diversity makes barley an ideal organism to understand the genetic adaptations to tillering (Komatsuda et al., 2007), UV intensity, changes in sunlight availability, and flowering time (Dawson et al., 2015), and more importantly, to understand how these adaptations relate to total grain weight, which is a crucial trait to develop better cereal crops (Liller et al., 2015). For our historical composite cross barley population, we have access to genome-wide genotypic data for 850 progeny using a RADseq reduced representation approach (Baird et al., 2008), allowing us to score over 200,000 polymorphic sites in each individual. RADseq has proved to be a powerful approach to sequence organisms with large genomes with limited sequence data. Combined with high-density linkage maps, RADseq has been able to map QTLs responsible for chlorophyl-related traits in soybean (Wang et al., 2020), fiber quality in hemp (Petit et al., 2020), and agronomic traits (such as panicle length and average grain weight) in grasses such as foxtail millet (Wang et al., 91 2017). RADseq has been especially powerful when it comes to study how a particular trait has evolved among large and diverse populations (Davey and Blaxter, 2011). RADseq combined with genome-wide association studies (GWAS) has unveiled important polymorphic regions responsible for caffeine content during the domestication of multiple tea plants (Yamashita et al., 2020), seed size during domestication of soybean (Zhou et al., 2015), and has further our knowledge of the population structure of sesame accessions across four continents (Basak et al., 2019). There is a fine relationship between barley genome and barley morphology, especially about the specifics on the domestication of barley shape. Important ongoing work explores patterns and relationships between our computed topological signatures and different genetic loci across multiple filial generations using RADseq reads, heritability analyses, and GWAS. Current preliminary results have already yielded candidate loci that can be further explored for synteny among other barley accessions and even among other related grasses like wheat or rye. 6.1.3 If life gives you lemons, determine distances between their oil gland distributions As discussed at the end of Chapter 4, directional statistical tests strongly reject the hypothesis that citrus oil glands follow a uniform, rotationally symmetric, or any other well-studied distribution. An alternative to mathematically characterize oil gland distributions is to turn to non-parametric approaches based on spherical kernel density estimators (Di Marzio et al., 2019; Vuollo and Holmström, 2018). Then we can measure numerically the distance between two nonparametric spherical density functions (Boente et al., 2014) and compute a pairwise distance matrix. Another alternative to comprehensively describe oil gland distributions is to compute persistent homology directly on the point cloud defined by the oil gland centers (Figure 4.4). This way we do not have to resort to deformations induced by ellipsoidal approximations. Once every fruit oil gland distribution is summarized as a persistence diagram, there are many well established pipelines to compute a matrix of pairwise distances. Specifically, we can compute bottleneck distances between diagrams, or transform such diagrams into more mathematically amenable objects like persistence landscapes (Bubenik, 2015), persistence images (Adams et al., 2017), or tent functions (Perea et al., 2022; Tymochko et al., 2019) to name a few examples. In either case, we can then compare 92 citrus phylogenetic distances with oil gland distribution distances. This would ultimately provide us with novel insight into fruit development. There is also future work to be done in terms of citrus breeding. We can link our morphological characterization of citrus oil gland distribution with commercially important traits, such as acidity, sweetness, skin response to mechanical injuries, or amount of oil extracted to name a few examples. 6.1.4 The morphology of domestication: a hard nut to crack As discussed in Chapter 5, we were initially puzzled by the fact that Earliest Himalayan accession initial results. It was the hardest accession to crack open and it is essentially impossible to extract their kernel intact. However, their morphological trait values are close to the population average, which clearly breaks visual trends when relating shape phenotypes with qualitative data (Figure 5.5). Moreover, current literature highlights that individual phenotype variation of walnuts is high within a given population, but not so much between physically distant populations (Mapelli et al., 2018; Roor et al., 2017). At the same time, the Himalayas are reported to be one of the main hotspots for walnut germplasm diversity (Luo et al., 2022; Shah et al., 2021), while the rest of the collection is mainly homozygous (Aradhya et al., 2010). A closer look at trait distributions reveals that this Earliest Himalaya accession can be distinguished from the rest of the collection by relatively lower content of packing tissue, a relatively higher content of air, and distinct convexity indices (Figure 5.6). Preliminary results suggest that the Earliest accession kernel has a distinctive wide, deep cavity at its proximal end. To the best of our understanding, this specific traits, packing tissue volume and convexity indices, have never been measured. The cavity at the proximal end offers new research directions aimed at unraveling its specific development mechanics from both genetical and biophysical points of view. This new cavity also suggests that it is possible to characterize the morphology of physically separate populations. We just need more comprehensive and fine-tuned morphological descriptors. Walnut kernels have tortuous, polylobed morphologies that no set of traditional shape descriptors will every fully capture. An exciting future direction is to analyze these intricate shapes with TDA, specifically with the ECT like in Chapter 3 or a similar technique. This in turn might shed new light 93 into broader morphological changes for perennial crops in general when they are domesticated. In general, capturing and analyzing this nuanced shape information for a wide array of data sets provides a morphology-driven path to further our insight into phenotype-genotype relationships. As stated by D’Arcy Thompson in his seminal biomathematical treatise On Growth and Form (1942), An organism is so complex a thing, and growth so complex a phenomenon, that for growth to be so uniform and constant in all the parts as to keep the whole shape unchanged would indeed be an unlikely and an unusual circumstance. Rates vary, proportions change, and the whole configuration alters accordingly. 94 BIBLIOGRAPHY Adams H, Emerson T, Kirby M, Neville R, Peterson C, Shipman P, Chepushtanova S, Hanson E, Motta F, Ziegelmeier L (2017). Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research 18(8): 1–35. Akca Y, Şen SM (1995). The relationship between dichogamy and yield–nut characteristics in Juglans regia L. In Acta Horticulturae, Volume 442, pp. 215–216. Leuven, Belgium: Interna- tional Society for Horticultural Science (ISHS). Amézquita EJ, Nasrin F, Storey KM, Yoshizawa M (2022). Genomics data analysis via spectral shape and topology. Preprint. Amézquita EJ, Quigley MY, Ophelders T, Landis JB, Koenig D, Munch E, Chitwood DH (2021). Measuring hidden phenotype: quantifying the shape of barley seeds using the Euler characteristic transform. in silico Plants 4(1): diab033. Amézquita EJ, Quigley MY, Ophelders T, Munch E, Chitwood DH (2020). The shape of things to come: Topological data analysis and biology, from molecules to organisms. Developmental Dynamics 249(7): 816–833. Amézquita EJ, Quigley MY, Ophelders T, Seymour D, Munch E, Chitwood DH (2022). The shape of aroma: Measuring and modeling citrus oil gland distribution. Plants, People, Planet 0: 1–14. Andrade-Sanchez P, Gore MA, Heun JT, Thorp KR, Carmo-Silva AE, French AN, Salvucci ME, White JW (2013). Development and evaluation of a field-based high-throughput pheno- typing platform. Functional Plant Biology 41(1): 68–79. Antreich SJ, Xiao N, Huss JC, Gierlinger N (2021). A belt for the cell: cellulosic wall thickenings and their role in morphogenesis of the 3D puzzle cells in walnut shells. Journal of Experimental Botany 72(13): 4744–4756. Antreich SJ, Xiao N, Huss JC, Horbelt N, Eder M, Weinkamer R, Gierlinger N (2019). The puzzle of the walnut shell: A novel cell type with interlocked packing. Advanced Science 6(16): 1900644. Aradhya M, Velasco D, Ibrahimov Z, Toktoraliev B, Maghradze D, Musayev M, Bobokashvili Z, Preece JE (2017). Genetic and ecological insights into glacial refugia of walnut (Juglans regia L.). PLOS ONE 12(10): 1–27. Aradhya M, Woeste K, Velasco D (2010). Genetic diversity structure and differentiation in culti- vated walnut (Juglans Regia L.). In Acta Horticulturae, Number 861, pp. 127–132. International Society for Horticultural Science (ISHS), Leuven, Belgium. 95 Araus JL, Cairns JE (2014). Field high-throughput phenotyping: the new crop breeding frontier. Trends in Plant Science 19(1): 52–61. Atienza N, Escudero LM, Jimenez MJ, Soriano-Trigueros M (2019). Characterising epithelial tissues using persistent entropy. In Marfil R, Calderón M, Díaz del Río F, Real P, Bandera A (Eds.), Computational Topology in Image Context, pp. 179–190. Cham: Springer International Publishing. Atkinson JA, Pound MP, Bennett MJ, Wells DM (2019). Uncovering the hidden half of plants using new advances in root phenotyping. Current Opinion in Biotechnology 55: 1–8. Analytical Biotechnology. Autran D, Bassel GW, Chae E, Ezer D, Ferjani A, Fleck C, Hamant O, Hartmann FP, Jiao Y, Johnston IG et al. (2021). What is quantitative plant biology? Quantitative Plant Biology 2: e10. Baird NA, Etter PD, Atwood TS, Currey MC, Shiver AL, Lewis ZA, Selker EU, Cresko WA, Johnson EA (2008). Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLOS ONE 3(10): 1–7. Bao X, Chen B, Dai P, Li Y, Mao J (2022). Construction and verification of spherical thin shell model for revealing walnut shell crack initiation and expansion mechanism. Agriculture 12(9). Baron JH (2009). Sailors’ scurvy before and after James Lind — a reassessment. Nutrition Reviews 67(6): 315–332. Barry GH, Caruso M, Gmitter FG (2020). Chapter 5 - Commercial scion varieties. In Talon M, Caruso M, Gmitter FG (Eds.), The Genus Citrus, pp. 83–104. Woodhead Publishing. Basak M, Uzun B, Yol E (2019). Genetic diversity and population structure of the Mediterranean sesame core collection with use of genome-wide SNPs developed by double digest RAD-Seq. PLOS ONE 14(10): e0223757. Bauer U (2021). Ripser: efficient computation of Vietoris–Rips persistence barcodes. Journal of Applied and Computational Topology 5(3): 391–423. Beer R, Kaiser F, Schmidt K, Ammann B, Carraro G, Grisa E, Tinner W (2008). Vegetation history of the walnut forests in Kyrgyzstan (Central Asia): natural or anthropogenic origin? Quaternary Science Reviews 27(5): 621–632. Belchi F, Pirashvili M, Conway J, Bennett M, Djukanovic R, Brodzki J (2018). Lung topology characteristics in patients with chronic obstructive pulmonary disease. Scientific Reports 8(5341). Belton RL, Fasy BT, Mertz R, Micka S, Millman DL, Salinas D, Schenfisch A, Schupbach J, Williams L (2020). Reconstructing embedded graphs from persistence diagrams. Computational 96 Geometry 90: 101658. Beltramo G, Skraba P, Andreeva R, Sarkar R, Giarratano Y, Bernabeu MO (2022). Euler characteristic surfaces. Foundations of Data Science 4(4): 505–536. Bendich P, Edelsbrunner H, Kerber M (2010). Computing robustness and persistence for images. IEEE Transactions on Visualization and Computer Graphics 16(6): 1251–1260. Bendich P, Marron JS, Miller E, Pieloch A, Skwerer S (2016). Persistent homology analysis of brain artery trees. Ann. Appl. Stat. 10(1): 198–218. Benjamini Y, Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodologi- cal) 57(1): 289–300. Bernard A, Hamdy S, Le Corre L, Dirlewanger E, Lheureux F (2020). 3D characterization of walnut morphological traits using X-ray computed tomography. Plant Methods 16(1): 115. Bernard A, Lheureux F, Dirlewanger E (2017). Walnut: past and future of genetic improvement. Tree Genetics & Genomes 14(1): 1. Betthauser LM (2018). Topological reconstruction of grayscale images. Ph. D. thesis, University of Florida, Gainesville, Florida. Blott SJ, Pye K (2008). Particle shape: a review and new methods of characterization and classification. Sedimentology 55(1): 31–63. Boente G, Rodriguez D, Manteiga WG (2014). Goodness-of-fit test for directional data. Scan- dinavian Journal of Statistics 41(1): 259–275. Bonett DG (2006). Confidence interval for a coefficient of quartile variation. Computational Statistics & Data Analysis 50(11): 2953–2957. Bonhomme V, Forster E, Wallace M, Stillman E, Charles M, Jones G (2017). Identification of inter- and intra-species variation in cereal grains through geometric morphometric analysis, and its resilience under experimental charring. Journal of Archaeological Science 86: 60–67. Bookstein FL (1997). Morphometric Tools for Landmark Data: Geometry and Biology. Geometry and Biology. Cambridge: Cambridge University Press. Booth S, Kurtz B, de Heer MI, Mooney SJ, Sturrock CJ (2020). Tracking wireworm bur- rowing behaviour in soil over time using 3D X-ray computed tomography. Pest Management Science 76(8): 2653–2662. Bouby L (2001). L’orge à deux rangs (Hordeum distichum) dans l’agriculture gallo-romaine : 97 données archéobotaniques. ArchéoSciences, revue d’Archéométrie 25: 35–44. Bubenik P (2015). Statistical topological data analysis using persistence landscapes. Journal of Machine Learning Research 16(3): 77–102. Bucksch A, Atta-Boateng A, Azihou AF, Battogtokh D, Baumgartner A, Binder BM, Bray- brook SA, Chang C, Coneva V, DeWitt TJ et al. (2017). Morphological plant modeling: Unleashing geometric and topological potential within the plant sciences. Frontiers in Plant Science 8. Burges CJ (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2): 121–167. Cámara PG (2017). Topological methods for genomics: Present and future directions. Current Opinion in Systems Biology 1: 95–101. Future of Systems Biology • Genomics and epigenomics. Cámara PG, Rosenbloom DI, Emmett KJ, Levine AJ, Rabadán R (2016). Topological data analysis generates high-resolution, genome-wide maps of human recombination. Cell Systems 3(1): 83–94. Cang Z, Mu L, Wei GW (2018). Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLOS Computational Biology 14(1): 1–44. Cang Z, Wei GW (2018). Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction. International Journal for Numerical Methods in Biomedical Engineering 34(2): e2914. e2914 cnm.2914. Caruso M, Smith MW, Froelicher Y, Russo G, Gmitter FG (2020). Chapter 7 - Traditional breeding. In Talon M, Caruso M, Gmitter FG (Eds.), The Genus Citrus, pp. 129–148. Woodhead Publishing. Chan JM, Carlsson G, Rabadán R (2013). Topology of viral evolution. Proceedings of the National Academy of Sciences 110(46): 18566–18571. Chazal F, Fasy B, Lecci F, Michel B, Rinaldo A, Rinaldo A, Wasserman L (2017). Robust topological inference: Distance to a measure and kernel distance. J. Mach. Learn. Res. 18(1): 5845–5884. Chitwood D, Sinha N (2016). Evolutionary and environmental forces sculpting leaf development. Current Biology 26(7): R297–R306. Chitwood DH, Eithun M, Munch E, Ophelders T (2019). Topological mapper for 3D volumetric images. In Burgeth B, Kleefeld A, Naegel B, Passat N, Perret B (Eds.), Mathematical Morphology and Its Applications to Signal and Image Processing, pp. 84–95. Cham: Springer International 98 Publishing. Choat B, Nolf M, Lopez R, Peters JMR, Carins-Murphy MR, Creek D, Brodribb TJ (2018). Non-invasive imaging shows no evidence of embolism repair after drought in tree species of two genera. Tree Physiology 39(1): 113–121. Chudhary Z, Khera RA, Hanif MA, Ayub MA, Hamrouni L (2020). Chapter 49 – Walnut. In Hanif MA, Nawaz H, Khan MM, Byrne HJ (Eds.), Medicinal Plants of South Asia, pp. 671–684. Elsevier. Chung Y, Hu C, Lawson A, Smyth C (2018). Topological approaches to skin disease image analysis. In 2018 IEEE International Conference on Big Data (Big Data), pp. 100–105. Seattle, WA: IEEE. Clayton CRI, Abbireddy COR, Schiebel R (2009). A method of estimating the form of coarse particulates. Géotechnique 59(6): 493–501. Cohen-Steiner D, Edelsbrunner H, Harer J (2007). Stability of persistence diagrams. Discrete & Computational Geometry 37(1): 103–120. Conover WJ (1998). Practical Nonparametric Statistics (3rd ed.). Wiley Series in Probability and Statistics. New York: Wiley. Corey AT (1949). Influence of shape on fall velocity of sandgrains. Master’s thesis, Colorado Agricultural and Mechanical College, Fort Collins, CO. Crawford L, Monod A, Chen AX, Mukherjee S, Rabadán R (2020). Predicting clinical outcomes in glioblastoma: An application of topological and functional data analysis. Journal of the American Statistical Association 115(531): 1139–1150. Curry J, Mukherjee S, Turner K (2022). How many directions determine a shape and other sufficiency results for two topological transforms. Transactions of the American Mathematical Society Series B 9: 1006–1043. Davey JW, Blaxter ML (2011). RADSeq: next-generation population genetics. Briefings in Functional Genomics 9(5-6): 416–423. Dawson IK, Russell J, Powell W, Steffenson B, Thomas WTB, Waugh R (2015). Barley: a translational model for adaptation to climate change. New Phytologist 206(3): 913–931. Deng X, Yang X, Yamamoto M, Biswas MK (2020). Chapter 3 - Domestication and history. In Talon M, Caruso M, Gmitter FG (Eds.), The Genus Citrus, pp. 33–55. Woodhead Publishing. Di Marzio M, Fensore S, Panzera A, Taylor CC (2019). Kernel density classification for spherical data. Statistics & Probability Letters 144: 23–29. 99 Diaz-Gárcia L, Covarrubias-Pazaran G, Schlautman B, Grygleski E, Zalapa J (2018). Image- based phenotyping for identification of QTL determining fruit shape and size in american cranberry (Vaccinium macrocarpon L.). PeerJ 6(e5461): e5461. Diaz-Toca GM, Marin L, Necula I (2020). Direct transformation from Cartesian into geodetic coordinates on a triaxial ellipsoid. Computers & Geosciences 142: 104551. Ding YM, Cao Y, Zhang WP, Chen J, Liu J, Li P, Renner SS, Zhang DY, Bai WN (2022). Population-genomic analyses reveal bottlenecks and asymmetric introgression from Persian into iron walnut during domestication. Genome Biology 23(1): 145. Dryden IL, Mardia KV (2016). Statistical Shape Analysis with Applications in R (2 ed.). Chichester, West Sussex, England: John Wiley & Sons Ltd. Du F, Tan T (2021). Recent studies in mechanical properties of selected hard-shelled seeds: A review. JOM 73(6): 1723–1735. Emmett K, Rosenbloom D, Cámara P, Rabadán R (2014). Parametric inference using persis- tence diagrams: A case study in population genetics. In Proceedings of the 31st International Conference on Machine Learning, Volume 32. Beijing, China: W&CP. Emmett KJ, Rabadán R (2014). Characterizing scales of genetic recombination and antibiotic resistance in pathogenic bacteria using topological data analysis. In Ślȩzak D, Tan AH, Peters JF, Schwabe L (Eds.), Brain Informatics and Health, pp. 540–551. Cham: Springer International Publishing. Falcao AX, Stolfi J, de Alencar Lotufo R (2004). The image foresting transform: theory, algo- rithms, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(1): 19–29. Fallah M, Vahdati K, Hasani D, Rasouli M, Sarikhani S (2022). Breeding of Persian walnut: Aiming to introduce late-leafing and early-harvesting varieties by targeted hybridization. Scientia Horticulturae 295: 110885. FAO (2021). Citrus fruit. Fresh and processed. Statistical bulletin 2020. Technical report, Food and Agriculture Organization of the United Nations, Rome, Italy. FAS (2022). Tree nuts: World markets and trade. Technical report, Foreign Agricultural Service. US Department of Agriculture, Washington, DC. Fasy BT, Lecci F, Rinaldo A, Wasserman L, Balakrishnan S, Singh A (2014). Confidence sets for persistence diagrams. Ann. Statist. 42(6): 2301–2339. Fasy BT, Micka S, Millman DL, Schenfisch A, Williams L (2019). The first algorithm for reconstructing simplicial complexes of arbitrary dimension from persistence diagrams. Preprint. 100 Fiorani F, Schurr U (2013). Future scenarios for plant phenotyping. Annual Review of Plant Biology 64(1): 267–291. Frank MH, Chitwood DH (2016). Plant chimeras: The good, the bad, and the ’Bizzaria’. Developmental Biology 419(1): 41–53. Plant Development. Friedman M (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association 32(200): 675–701. Fuller DQ, Weisskopf A (2014). Barley: Origins and development. In Smith C, Smith C (Eds.), Encyclopedia of Global Archaeology, pp. 763–766. New York, NY: Springer New York. Gao Z (2022). Three-dimensional virtual flower bud of walnut tree based on micro-computed tomography. Agronomy Journal 114(4): 1935–1943. García-Portugués E, Navarro-Esteban P, Cuesta-Albertos JA (2023). On a projection-based class of uniformity tests on the hypersphere. Bernoulli 29(1): 181–204. García-Portugués E, Paindaveine D, Verdebout T (2020). On optimal tests for rotational symmetry against new classes of hyperspherical distributions. Journal of the American Statistical Association 115(532): 1873–1887. García-Portugués E, Paindaveine D, Verdebout T (2021). rotasym: Tests for rotational sym- metry on the hypersphere. R package version 1.1.3. García-Portugués E, Verdebout T (2021). sphunif : Uniformity tests on the circle, sphere, and hypersphere. R package version 1.0.1. Gauthey A, Peters JMR, Carins-Murphy MR, Rodriguez-Dominguez CM, Li X, Delzon S, King A, López R, Medlyn BE, Tissue DT et al. (2020). Visual and hydraulic techniques produce similar estimates of cavitation resistance in woody species. New Phytologist 228(3): 884–897. Ghrist R, Levanger R, Mai H (2018). Persistent homology and Euler integral transforms. Journal of Applied and Computational Topology 2(1): 55–60. Giusti C, Pastalkova E, Curto C, Itskov V (2015). Clique topology reveals intrinsic geometric structure in neural correlations. Proceedings of the National Academy of Sciences 112(44): 13455–13460. Gmitter FG, Wu GA, Rokhsar DS, Talon M (2020). Chapter 1 - The citrus genome. In Talon M, Caruso M, Gmitter FG (Eds.), The Genus Citrus, pp. 1–8. Woodhead Publishing. Gower JC (1975). Generalized procrustes analysis. Psychometrika 40: 33–51. 101 Griffiths M, Mellor N, Sturrock CJ, Atkinson BS, Johnson J, Mairhofer S, York LM, Atkinson JA, Soltaninejad M, Foulkes JF et al. (2022). X-ray CT reveals 4D root system development and lateral root responses to nitrate in soil. The Plant Phenome Journal 5(1): e20036. Gülsoy E, Kuş E, Altıkat S (2019). Determination of physico-mechanical properties of some domestic and foreign walnut (Juglans regia L.) varieties. Acta Scientiarum Polonorum Hortorum Cultus 18(6): 67–74. Harlan HV, Martini ML (1929). A composite hybrid mixture. Agronomy Journal 21(4): 487–490. Harlan HV, Martini ML (1936). Problems and results in barley breeding. Washington, DC: US Department of Agriculture. Harlan HV, Martini ML (1940). A study of methods in barley breeding. Technical Report 720, US Department of Agriculture, Washington, DC. Harris JW, Stöcker H (1998). The Handbook of Mathematics and Computational Science (1st ed.). Berlin, Heidelberg: Springer-Verlag. Hawkins R (1986). The Observations of Sir Richard Hawkins, Knight, in his Voyage into the South Sea, Annodomini, 1593. Nutrition Reviews 44(11): 370–371. Facsimile reproduction of excerpts. Heiss T, Wagner H (2017). Streaming Algorithm for Euler Characteristic Curves of Multidimen- sional Images. In Felsberg M, Heyden A, Krüger N (Eds.), Computer Analysis of Images and Patterns, pp. 397–409. Cham: Springer International Publishing. Hilal N, Mohammed Ali TK, Tayeh BA (2020). Properties of environmental concrete that contains crushed walnut shell as partial replacement for aggregates. Arabian Journal of Geo- sciences 13(16): 812. Hockett E, Nilan R (1985). Genetics. In Barley, Chapter 8: pp. 187–230. John Wiley & Sons, Ltd. Humphreys DP, McGuirl MR, Miyagi M, Blumberg AJ (2019). Fast estimation of recombination rates using topological data analysis. Genetics 211(4): 1191–1204. Huss JC, Antreich SJ, Bachmayr J, Xiao N, Eder M, Konnerth J, Gierlinger N (2020). Topological interlocking and geometric stiffening as complementary strategies for strong plant shells. Advanced Materials 32(48): 2004519. Ibáñez-Marcelo E, Campioni L, Phinyomark A, Petri G, Santarcangelo EL (2019). Topology highlights mesoscopic functional equivalence between imagery and perception: The case of hypnotizability. NeuroImage 200: 437–449. 102 IPGRI (1994). Descriptors for Walnut (Juglans Spp.). Rome, Italy: International Plant Genetic Resources Institute. Isaac E (1959). Influence of religion on the spread of citrus. Science 129(3343): 179–186. Iwata H, Nesumi H, Ninomiya S, Takano Y, Ukai Y (2002). The Evaluation of Genotype × Environment Interactions of Citrus Leaf Morphology Using Image Analysis and Elliptic Fourier Descriptors. Breeding Science 52(4): 243–251. Janke NC (1966). Effect of shape upon the settling vellocity of regular convex geometric particles. Journal of Sedimentary Research 36(2): 370–376. Jeitziner R, Carrière M, Rougemont J, Oudot S, Hess K, Brisken C (2019). Two-Tier Mapper, an unbiased topology-based clustering method for enhanced global gene expression analysis. Bioinformatics 35(18): 3339–3347. Jiang Q, Kurtek S, Needham T (2020). The weighted Euler curve transform for shape and image analysis. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3685–3694. Kanari L, Dłotko P, Scolamiero M, Levi R, Shillcock J, Hess K, Markram H (2018). A topological representation of branching neuronal morphologies. Neuroinformatics 16(1): 3–13. Kelc D, Štampar F, Solar A (2007). Fruiting behaviour of walnut trees influences the relationship between the morphometric traits of the parent wood and nut weight. The Journal of Horticultural Science and Biotechnology 82(3): 439–445. Kendall DG (1984). Shape Manifolds, Procrustean Metrics, and Complex Projective Spaces. Bulletin of the London Mathematical Society 16(2): 81–121. Khadivi-Khub A, Ebrahimi A, Sheibani F, Esmaeili A (2015). Phenological and pomological characterization of Persian walnut to select promising trees. Euphytica 205(2): 557–567. Knight TG, Klieber A, Sedgley M (2001). The Relationship Between Oil Gland and Fruit Development in Washington Navel Orange (Citrus sinensis L. Osbeck). Annals of Botany 88(6): 1039–1047. Komatsuda T, Pourkheirandish M, He C, Azhaguvel P, Kanamori H, Perovic D, Stein N, Graner A, Wicker T, Tagiri A et al. (2007). Six-rowed barley originated from a mutation in a homeodomain-leucine zipper I-class homeobox gene. Proceedings of the National Academy of Sciences 104(4): 1424–1429. Köppen M (2000). The curse of dimensionality. In 5th Online World Conference on Soft Computing in Industrial Applications, Volume 1, pp. 4–8. WSC5. 103 Kouhi M, Rezaei A, Hassani D, Sarikhani S, Vahdati K (2020). Phenotypic Evaluation and Identification of Superior Persian Walnut (Juglans regia L.) Genotypes in Mazandaran Province, Iran. Journal of Nuts 11(4): 315–326. Kovacev-Nikolic V, Bubenik P, Nikolić D, Heo G (2016). Using persistent homology and dy- namical distances to analyze protein binding. Statistical Applications in Genetics and Molecular Biology 15(1): 19–38. Kovalevsky V (1989). Finite topology as applied to image analysis. Computer Vision, Graphics, and Image Processing 46(2): 141–161. Koyuncu MA, Ekinci K, Savran E (2004). Cracking characteristics of walnut. Biosystems Engineering 87(3): 305–311. Krumbein WC (1941). Measurement and geological significance of shape and roundness of sedimentary particles. Journal of Sedimentary Research 11(2): 64–72. Kruskal WH, Wallis WA (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 47(260): 583–621. Kuhl FP, Giardina CR (1982). Elliptic Fourier features of a closed contour. Computer Graphics and Image Processing 18(3): 236 – 258. Lala SD, Deoghare AB, Chatterjee S (2018). Mechanical and morphological characterization of walnut shell reinforced epoxy composite. IOP Conference Series: Materials Science and Engineering 377(1): 012011. Langgut D (2017). The citrus route revealed: From Southeast Asia into the Mediterranean. HortScience 52(6): 814–822. Lawson P, Schupbach J, Fasy BT, Sheppard JW (2019). Persistent homology for the automatic classification of prostate cancer aggressiveness in histopathology images. In Tomaszewski JE, Ward AD (Eds.), Medical Imaging 2019: Digital Pathology, Volume 10956, pp. 109560G. SPIE. Lawson P, Sholl AB, Brown JQ, Fasy BT, Wenk C (2019). Persistent homology for the quantita- tive evaluation of architectural features in prostate cancer histology. Scientific Reports 9(1139). Lee H, Kang H, Chung MK, Kim B, Lee DS (2012). Persistent brain network homology from the perspective of dendrogram. IEEE Transactions on Medical Imaging 31(12): 2267–2277. Lesnick M, Wright M (2015). Interactive visualization of 2-D persistence modules. Preprint. Lesnick M, Wright M (2022). Computing minimal presentations and bigraded Betti numbers of 2-parameter persistent homology. SIAM Journal on Applied Algebra and Geometry 6(2): 267–298. 104 Lestrel PE (Ed.) (1997). Fourier Descriptors and their Applications in Biology. Cambridge: Cambridge University Press. Ley C, Verdebout T (2017). Modern Directional Statistics (1st ed.). Interdisciplinary Statistics. Boca Raton: Chapman and Hall/CRC. Li M, An H, Angelovici R, Bagaza C, Batushansky A, Clark L, Coneva V, Donoghue MJ, Edwards E, Fajardo D et al. (2018). Topological data analysis as a morphometric method: Using persistent homology to demarcate a leaf morphospace. Frontiers in Plant Science 9: 553. Li M, Duncan K, Topp CN, Chitwood DH (2017). Persistent homology and the branching topologies of plants. American Journal of Botany 104(3): 349–353. Li M, Frank MH, Coneva V, Mio W, Chitwood DH, Topp CN (2018). The persistent homol- ogy mathematical framework provides enhanced genotype-to-phenotype associations for plant morphology. Plant Physiology 177(4): 1382–1395. Li M, Klein LL, Duncan KE, Jiang N, Chitwood DH, Londo JP, Miller AJ, Topp CN (2019). Characterizing 3D inflorescence architecture in grapevine using X-ray imaging and ad- vanced morphometrics: implications for understanding cluster density. Journal of Experimental Botany 70(21): 6261–6276. Li M, Shao MR, Zeng D, Ju T, Kellogg EA, Topp CN (2020). Comprehensive 3D phenotyping reveals continuous morphological variation across genetically diverse sorghum inflorescences. New Phytologist 226(6): 1873–1885. Li Q, Griffiths J (2004). Least squares ellipsoid specific fitting. In Geometric Modeling and Processing, 2004. Proceedings, pp. 335–340. IEEE. Liller CB, Neuhaus R, von Korff M, Koornneef M, van Esse W (2015). Mutations in barley row type genes have pleiotropic effects on shoot branching. PLOS ONE 10(10): 1–20. Lopez JM (2004). Walnut tissue culture: research and field applications. In Michler CH, Pijut PM, Van Sambeek JW, Coggeshall MV, Seifert J, Woeste K, Overton R, Ponder FJ (Eds.), Black walnut in a new century, 6th Walnut Council Research Symposium, pp. 146–152. St. Paul, MN: USDA, North Central Research Station. Lum PY, Singh G, Lehman A, Ishkanov T, Vejdemo-Johansson M, Alagappan M, Carlsson J, Carlsson G (2013). Extracting insights from the shape of complex data using topology. Scientific Reports 3(1236). Luo X, Zhou H, Cao D, Yan F, Chen P, Wang J, Woeste K, Chen X, Fei Z, An H et al. (2022). Domestication and selection footprints in Persian walnuts (Juglans regia). PLOS Genetics 18(12): 1–19. 105 Luro F, Curk F, Froelicher Y, Ollitrault P (2017). Recent insights on Citrus diversity and phylogeny. In Zech-Matterne V, Fiorentino G (Eds.), AGRUMED: Archaeology and history of citrus fruit in the Mediterranean: Acclimatization, diversification, uses. Naples: Publications du Centre Jean Bérard. Magiorkinis E, Beloukas A, Diamantis A (2011). Scurvy: Past, present and future. European Journal of Internal Medicine 22(2): 147–152. Mahato N, Sharma K, Koteswararao R, Sinha M, Baral E, Cho MH (2019). Citrus essential oils: Extraction, authentication and application in food preservation. Critical Reviews in Food Science and Nutrition 59(4): 611–625. Mahomoodally MF, Mooroteea K (2021). A comparative ethno-religious study of traditionally used medicinal plants employed in the management of cardiovascular diseases. Journal of Herbal Medicine 25: 100417. Mander L, Dekker SC, Li M, Mio W, Punyasena SW, Lenton TM (2017). A morphometric analysis of vegetation patterns in dryland ecosystems. Royal Society Open Science 4(2): 160443. Mapelli S, Pollegioni P, Woeste KE, Chiocchini F, Lungo SD, Olimpieri I, Tortolano V, Clark J, Hemery GE, Malvolti ME (2018). Spatial genetic structure of common walnut (Juglans regia L.) in central Asia. In Acta Horticulturae, Number 1190, pp. 27–34. International Society for Horticultural Science (ISHS), Leuven, Belgium. Mardia KV, Jupp PE (1999). Directional Statistics (1st ed.). Probability and Statistics. Chichester, West Sussex: John Wiley & Sons, Ltd. Mascher M, Gundlach H, Himmelbach A, Beier S, Twardziok SO, Wicker T, Radchuk V, Dockter C, Hedley PE, Russell J et al. (2017). A chromosome conformation capture ordered sequence of the barley genome. Nature 544(7651): 427–433. Mascher M, Schuenemann VJ, Davidovich U, Marom N, Himmelbach A, Hübner S, Korol A, David M, Reiter E, Riehl S et al. (2016). Genomic analysis of 6,000-year-old cultivated grain illuminates the domestication history of barley. Nature Genetics 48(9): 1089–1093. McAllister CA, McKain MR, Li M, Bookout B, Kellogg EA (2019). Specimen-based analysis of morphology and the environment in ecologically dominant grasses: the power of the herbarium. Philosophical Transactions of the Royal Society B: Biological Sciences 374(1763): 20170403. McInnes L, Healy J, Melville J (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Preprint. Mei L, Sheng O, Peng S, Zhou G, Wei Q, Li Q (2011). Growth, root morphology and boron uptake by citrus rootstock seedlings differing in boron-deficiency responses. Scientia Horticul- turae 129(3): 426–432. 106 Micka SA (2020). Searching and Reconstruction: Algorithms with Topological Descriptors. Ph. D. thesis, Montana State University, Bozeman, Montana. Migicovsky Z, Harris ZN, Klein LL, Li M, McDermaid A, Chitwood DH, Fennell A, Kovacs LG, Kwasniewski M, Londo JP et al. (2019). Rootstock effects on scion phenotypes in a ‘Chambourcin’ experimental vineyard. Horticulture Research 6(64). Migicovsky Z, Li M, Chitwood DH, Myles S (2018). Morphometrics reveals complex and heritable apple leaf shapes. Frontiers in Plant Science 8: 2185. Mileyko Y, Mukherjee S, Harer J (2011). Probability measures on the space of persistence diagrams. Inverse Problems 27(12): 124007. Motuzaite Matuzeviciute G, Abdykanova A, Kume S, Nishiaki Y, Tabaldiev K (2018). The effect of geographical margins on cereal grain size variation: Case study for highlands of Kyrgyzstan. Journal of Archaeological Science: Reports 20: 400–410. Munch E (2017). A User’s Guide to Topological Data Analysis. Journal of Learning Analytics 4: 47–61. Munch E, Turner K, Bendich P, Mukherjee S, Mattingly J, Harer J (2015). Probabilistic Fréchet means for time varying persistence diagrams. Electron. J. Statist. 9(1): 1173–1204. Myers A, Munch E, Khasawneh FA (2019). Persistent homology of complex networks for dynamic state detection. Phys. Rev. E 100: 022314. NASS (2021). Citrus fruits. 2021 Summary. Technical report, National Agricultural Statistics Service. United States Department of Agriculture., Washington, DC. NASS (2022). 2022 California Walnut. Objective Measurement Report. Technical report, National Agricultural Statistics Service. US Department of Agriculture, Sacramento, CA. Nati P (1674). Florentina phytologica observatio de malo Limonia citrata-aurantia, Florentiae vulgo la bizzaria. Florentia, Italy: Hippolyti de Naue. Nicolau M, Levine AJ, Carlsson G (2011). Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proceedings of the National Academy of Sciences 108(17): 7265–7270. Niklas KJ (2004). Plant allometry: is there a grand unifying theory? Biological Reviews 79(4): 871–889. Niklas KJ, Hammond ST (2019). Biophysical effects on the scaling of plant growth, form, and ecology. Integrative and Comparative Biology 59(5): 1312–1323. 107 Ollitrault P, Curk F, Krueger R (2020). Chapter 4 - Citrus taxonomy. In Talon M, Caruso M, Gmitter FG (Eds.), The Genus Citrus, pp. 57–81. Woodhead Publishing. Otter N, Porter MA, Tillmann U, Grindrod P, Harrington HA (2017). A roadmap for the computation of persistent homology. EPJ Data Science 6(17). Pal D, Malik SK, Kumar S, Choudhary R, Sharma KC, Chaudhury R (2013). Genetic variability and relationship studies of mandarin (Citrus reticulata Blanco) using morphological and molecular markers. Agricultural Research 2(3): 236–245. Palande S, Kaste JAM, Roberts MD, Abá KS, Claucherty C, Dacon J, Doko R, Jayakody TB, Jeffery HR, Kelly N et al. (2022). The topological shape of gene expression across the evolution of flowering plants. bioRxiv 0. Panou G, Korakitis R, Pantazis G (2020). Fitting a triaxial ellipsoid to a geoid model. Journal of Geodetic Science 10(1): 69–82. Perea JA (2019). Topological time series analysis. Notices of the American Mathematical Society 66(5): 686–694. Perea JA, Munch E, Khasawneh FA (2022). Approximating continuous functions on persistence diagrams using template functions. Foundations of Computational Mathematics. Petit J, Salentijn EMJ, Paulo MJ, Denneboom C, van Loo EN, Trindade LM (2020). Elucidat- ing the genetic architecture of fiber quality in hemp (Cannabis sativa L.) using a genome-wide association study. Frontiers in Genetics 11: 1101. Petri G, Expert P, Turkheimer F, Carhart-Harris R, Nutt D, Hellyer PJ, Vaccarino F (2014). Homological scaffolds of brain functional networks. Journal of The Royal Society Interface 11(101): 20140873. Pewsey A, García-Portugués E (2021). Recent advances in directional statistics. TEST 30(1): 1–58. Pinney K, Polito VS (1983). English walnut fruit growth and development. Scientia Horticul- turae 21(1): 19–28. Pollegioni P, Woeste KE, Chiocchini F, Del Lungo S, Olimpieri I, Tortolano V, Clark J, Hemery GE, Mapelli S, Malvolti ME (2015). Ancient humans influenced the current spatial genetic structure of common walnut populations in Asia. PLOS ONE 10(9): 1–16. Pollegioni P, Woeste KE, Chiocchini F, Olimpieri I, Tortolano V, Clark J, Hemery GE, Mapelli S, Malvolti ME (2014). Landscape genetics of Persian walnut (Juglans regia L.) across its Asian range. Tree Genetics & Genomes 10(4): 1027–1043. 108 Popa RG, Bălăcescu A, Popescu LG (2023). Organic Walnut Cultivation in Intensive and Super-Intensive System—Sustainable Investment. Case Study: Gorj County, Romania. Sustain- ability 15(2). Qaiser T, Tsang YW, Taniyama D, Sakamoto N, Nakane K, Epstein D, Rajpoot N (2019). Fast and accurate tumor segmentation of histology images using persistent homology and deep convolutional features. Medical Image Analysis 55: 1 – 14. Quade D (1979). Using weighted rankings in the analysis of complete blocks with additive block effects. Journal of the American Statistical Association 74(367): 680–683. Rahaman MM, Chen D, Gillani Z, Klukas C, Chen M (2015). Advanced phenotyping and phenotype data analysis for the study of plant growth and development. Frontiers in Plant Science 6. Reani Y, Bobrowski O (2022). Cycle registration in persistent homology with applications in topological bootstrap. IEEE Transactions on Pattern Analysis and Machine Intelligence: 1–15. Rezaee R, Vahdati K, Grigoorian V, Valizadeh M (2008). Walnut grafting success and bleeding rate as affected by different grafting methods and seedling vigour. The Journal of Horticultural Science and Biotechnology 83(1): 94–99. Rezaei Z, Khadivi A, ValizadehKaji B, Abbasifar A (2018). The selection of superior walnut (Juglans regia L.) genotypes as revealed by morphological characterization. Euphytica 214(4): 69. Richardson E, Werman M (2014). Efficient classification using the Euler characteristic. Pattern Recognition Letters 49: 99–106. Rizvi AH, Cámara PG, Kandror EK, Roberts TJ, Schieren I, Maniatis T, Rabadán R (2017). Single-cell topological RNA-seq analysis reveals insights into cellular differentiation and development. Nature Biotechnology 35(551). Robinson A, Turner K (2017). Hypothesis testing for topological data analysis. J Appl. and Comput. Topology 1: 241–261. Roor W, Konrad H, Mamadjanov D, Geburek T (2017). Population Differentiation in Common Walnut (Juglans regia L.) across Major Parts of Its Native Range—Insights from Molecular and Morphometric Data. Journal of Heredity 108(4): 391–404. Ros J, Evin A, Bouby L, Ruas MP (2014). Geometric morphometric analysis of grain shape and the identification of two-rowed barley (Hordeum vulgare subsp. distichum L.) in southern France. Journal of Archaeological Science 41: 568–575. Russell J, Mascher M, Dawson IK, Kyriakidis S, Calixto C, Freund F, Bayer M, Milne I, 109 Marshall-Griffiths T, Heinen S et al. (2016). Exome sequencing of geographically diverse barley landraces and wild relatives gives insights into environmental adaptation. Nature Genetics 48(9): 1024–1030. Sato K (2020). History and future perspectives of barley genomics. DNA Research 27(4). Schölkopf B, Smola A, Müller KR (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10(5): 1299–1319. Shah RA, Bakshi P, Sharma N, Jasrotia A, Itoo H, Gupta R, Singh A (2021). Diversity assessment and selection of superior Persian walnut (Juglans regia L.) trees of seedling origin from North-Western Himalayan region. Resources, Environment and Sustainability 3: 100015. Sharma PC, Bhatia V, Bansal N, Sharma A (2007). A review on Bael tree. Natural Product Radiance 6(2): 171–178. Shrestha KB, Dangol DR (2019). Crops with medicinal, religious and cultural values. In Joshi BK, Shrestha R (Eds.), Working Groups of Agricultural Plant Genetic Resources (APGRs) in Nepal, pp. 198–204. Kathmandu, Nepal: NAGRC, Khumaltar. Proceedings of National Workshop, 21-22 June 2018. Sideli GM, Marrano A, Montanari S, Leslie CA, Allen BJ, Cheng H, Brown PJ, Neale DB (2020). Quantitative phenotyping of shell suture strength in walnut (Juglans regia L.) enhances precision for detection of QTL and genome-wide association mapping. PLOS ONE 15(4): 1–21. Singh G, Mémoli F, Carlsson G (2007). Topological Methods for the Analysis of High Dimen- sional Data Sets and 3D Object Recognition. In Botsch M, Pajarola R, Chen B, Zwicker M (Eds.), Eurographics Symposium on Point-Based Graphics. The Eurographics Association. Singh N, Couture HD, Marron JS, Perou C, Niethammer M (2014). Topological descriptors of histology images. In Wu G, Zhang D, Zhou L (Eds.), Machine Learning in Medical Imaging, pp. 231–239. Cham: Springer International Publishing. Sizemore AE, Giusti C, Kahn A, Vettel JM, Betzel RF, Bassett DS (2018). Cliques and cavities in the human connectome. Comput Neurosci 44: 115–145. Smith A, Zavala VM (2021). The Euler characteristic: A general topological descriptor for complex data. Computers & Chemical Engineering 154: 107463. Smith NM, Ebrahimi H, Ghosh R, Dickerson AK (2018). High-speed microjets issue from burst- ing oil gland reservoirs of citrus fruit. Proceedings of the National Academy of Sciences 115(26): E5887–E5895. Smith RJ (2009). Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140(3): 476–486. 110 Sneed ED, Folk RL (1958). Pebbles in the Lower Colorado River, Texas a Study in Particle Morphogenesis. The Journal of Geology 66(2): 114–150. Solar A, Ivančič A, Štampar F (2003). Morphometric characteristics of fruit bearing shoots in persian walnut (Juglans regia l.) — potential selection criteria for breeding. European Journal of Horticultural Science 2. Spreen TH, Gao Z, Fernandes W, Zansler ML (2020). Chapter 23 - Global economics and marketing of citrus products. In Talon M, Caruso M, Gmitter FG (Eds.), The Genus Citrus, pp. 471–493. Woodhead Publishing. Stolz BJ, Kaeppler J, Markelc B, Braun F, Lipsmeier F, Muschel RJ, Byrne HM, Harrington HA (2022). Multiscale topology characterizes dynamic tumor vascular networks. Science Advances 8(23): eabm2456. Talon M, Wu GA, Gmitter FG, Rokhsar DS (2020). Chapter 2 - The origin of citrus. In Talon M, Caruso M, Gmitter FG (Eds.), The Genus Citrus, pp. 9–31. Woodhead Publishing. Tanabata T, Shibaya T, Hori K, Ebana K, Yano M (2012). SmartGrain: High-throughput phenotyping software for measuring seed shape through image analysis. Plant Physiology 160(4): 1871–1880. Tang WS, da Silva GM, Kirveslahti H, Skeens E, Feng B, Sudijono T, Yang KK, Mukherjee S, Rubenstein B, Crawford L (2022). A topological data analytic approach for discovering biophysical signatures in protein dynamics. PLOS Computational Biology 18(5): 1–42. Tanno Ki, Willcox G (2012). Distinguishing wild and domestic wheat and barley spikelets from early Holocene sites in the Near East. Vegetation History and Archaeobotany 21(2): 107–115. Thapa R, Thapa P, Ahamad K, Vahdati K (2021). Effect of grafting methods and dates on the graft take rate of Persian walnut in open field condition. International Journal of Horticultural Science and Technology 8(2): 133–147. Thompson DW (1942). On Growth and Form (2nd ed.). Cambridge, UK: Cambridge University Press. Todd MJ, Yıldırım EA (2007). On Khachiyan’s algorithm for the computation of minimum- volume enclosing ellipsoids. Discrete Applied Mathematics 155(13): 1731–1744. Topaz CM, Ziegelmeier L, Halverson T (2015). Topological data analysis of biological aggre- gation models. PLOS ONE 10(5): e0126383. Tseng CC (1999). An Allegory on Allegory: Reading “Ju song” as Qu Yuan’s Ars Poetica. Dong Hwa Journal of Humanistic Studies 1: 69–101. 111 Turner K, Mileyko Y, Mukherjee S, Harer J (2014). Fréchet means for distributions of persistence diagrams. Discrete & Computational Geometry 52: 44–70. Turner K, Mukherjee S, Boyer DM (2014). Persistent homology transform for modeling shapes and surfaces. Information and Inference 3(4): 310–344. Tymochko S, Munch E, Dunion J, Corbosiero K, Torn R (2020). Using persistent homology to quantify a diurnal cycle in hurricanes. Pattern Recognition Letters 133: 137–143. Tymochko S, Munch E, Khasawneh FA (2019). Adaptive partitioning for template functions on persistence diagrams. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 1227–1234. Vahdati K (2014). Traditions and folks for walnut growing around the Silk Road. In Acta Hor- ticulturae, Volume 1032, pp. 19–24. Leuven, Belgium: International Society for Horticultural Science (ISHS). Verma MK (2014). Walnut production technology. In Training manual on teaching of post- graduate courses in horticulture (Fruit Science), pp. 281–287. New Delhi, India: Indian Agri- cultural Research Institute. Verma RS, Padalia RC, Chauhan A, Thul ST (2013). Phytochemical analysis of the leaf volatile oil of walnut tree (Juglans regia L.) from western Himalaya. Industrial Crops and Products 42: 195–201. Vlahos L, Isliker H, Kominis Y, Hizanidis K (2008). Normal and anomalous diffusion: A tutorial. Preprint. Voo SS, Grimes HD, Lange BM (2012). Assessing the biosynthetic capabilities of secretory glands in Citrus peel. Plant physiology 159(1): 81–94. Voulgaridis V, Vassiliou VG (2005). The walnut wood and its utilisation to high value products. In Acta Horticulturae, Volume 705, pp. 69–81. Leuven, Belgium: International Society for Horticultural Science (ISHS). Vovin A (Ed.) (2016). Man’yōshū (Book 18): A New English Translation Containing the Original Text, Kana Transliteration, Romanization, Glossing and Commentary. Leiden, The Netherlands: Brill. Vuollo V, Holmström L (2018). A scale space approach for exploring structure in spherical data. Computational Statistics & Data Analysis 125: 57–69. Wadell H (1932). Volume, shape, and roundness of rock particles. The Journal of Geology 40(5): 443–451. 112 Wagner H, Chen C, Vuçini E (2012). Efficient computation of persistent homology for cubical data. In Peikert R, Hauser H, Carr H, Fuchs R (Eds.), Topological Methods in Data Analysis and Visualization II: Theory, Algorithms, and Applications, pp. 91–106. Berlin, Heidelberg: Springer Berlin Heidelberg. Wahid M, Gawli Y, Puthusseri D, Kumar A, Shelke MV, Ogale S (2017). Nutty Carbon: Morphology Replicating Hard Carbon from Walnut Shell for Na Ion Battery Anode. ACS Omega 2(7): 3601–3609. Wallace M, Bonhomme V, Russell J, Stillman E, George TS, Ramsay L, Wishart J, Timpany S, Bull H, Booth A et al. (2019). Searching for the origins of Bere barley: a geometric morphometric approach to cereal landrace recognition in archaeology. Journal of Archaeological Method and Theory 26(3): 1125–1142. Wang B, Sudijono T, Kirveslahti H, Gao T, Boyer DM, Mukherjee S, Crawford L (2021). A statistical pipeline for identifying physical features that differentiate classes of 3D shapes. The Annals of Applied Statistics 15(2): 638–661. Wang F, Wagner H, Chen C (2022). GPU Computation of the Euler Characteristic Curve for Imaging Data. In Goaoc X, Kerber M (Eds.), 38th International Symposium on Computational Geometry (SoCG 2022), Volume 224, pp. 64:1–64:16. Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. Wang J, Wang Z, Du X, Yang H, Han F, Han Y, Yuan F, Zhang L, Peng S, Guo E (2017). A high-density genetic map and QTL analysis of agronomic traits in foxtail millet [Setaria italica (L.) P. Beauv.] using RAD-seq. PLOS ONE 12(6): 1–15. Wang J, Ye H, Zhou H, Chen P, Liu H, Xi R, Wang G, Hou N, Zhao P (2022). Genome-wide association analysis of 101 accessions dissects the genetic basis of shell thickness for genetic improvement in Persian walnut (Juglans regia L.). BMC Plant Biology 22(1): 436. Wang L, Conteh B, Fang L, Xia Q, Nian H (2020). QTL mapping for soybean (Glycine max L.) leaf chlorophyll-content traits in a genotyped RIL population by using RAD-seq based high-density linkage map. BMC Genomics 21(1): 739. West GB, Brown JH, Enquist BJ (1999). A general model for the structure and allometry of plant vascular systems. Nature 400(6745): 664–667. Wu GA, Sugimoto C, Kinjo H, Azama C, Mitsube F, Talon M, Gmitter FG, Rokhsar DS (2021). Diversification of mandarin citrus by hybrid speciation and apomixis. Nature Commu- nications 12(1): 4377. Wu GA, Terol J, Ibanez V, López-García A, Pérez-Román E, Borredá C, Domingo C, Tadeo FR, Carbonell-Caballero J, Alonso R et al. (2018). Genomics of the origin and evolution of citrus. Nature 554(7692): 311–316. 113 Xiao N, Bock P, Antreich SJ, Staedler YM, Schönenberger J, Gierlinger N (2020). From the soft to the hard: Changes in microchemistry during cell wall maturation of walnut shells. Frontiers in Plant Science 11. Xiao Z, Stait-Gardner T, Willis SA, Price WS, Moroni FJ, Pagay V, Tyerman SD, Schmidtke LM, Rogiers SY (2021). 3D visualisation of voids in grapevine flowers and berries using X-ray micro computed tomography. Australian Journal of Grape and Wine Research 27(2): 141–148. Yamashita H, Uchida T, Tanaka Y, Katai H, Nagano AJ, Morita A, Ikka T (2020). Ge- nomic predictions and genome-wide association studies based on RAD-seq of quality-related metabolites for the genomics-assisted breeding of tea plants. Scientific Reports 10(1): 17480. Zhang BW, Xu LL, Li N, Yan PC, Jiang XH, Woeste KE, Lin K, Renner SS, Zhang DY, Bai WN (2019). Phylogenomics Reveals an Ancient Hybrid Origin of the Persian Walnut. Molecular Biology and Evolution 36(11): 2451–2461. Zhang H, Lan H, Tang Y, Li Y (2014). Study on fracture mechanism of walnut shell according to brittle fracture area. In 2014 Fifth International Conference on Intelligent Systems Design and Engineering Applications, pp. 954–957. Zhao S, Wen J, Wang H, Zhang Z, Li X (2016). Changes in lignin content and activity of related enzymes in the endocarp during the walnut shell development period. Horticultural Plant Journal 2(3): 141–146. Zhou H, Whalley WR, Hawkesford MJ, Ashton RW, Atkinson B, Atkinson JA, Sturrock CJ, Bennett MJ, Mooney SJ (2020). The interaction between wheat roots and soil pores in structured field soil. Journal of Experimental Botany 72(2): 747–756. Zhou L, Wang SB, Jian J, Geng QC, Wen J, Song Q, Wu Z, Li GJ, Liu YQ, Dunwell JM et al. (2015). Identification of domestication-related loci associated with flowering time and seed size in soybean with the RAD-seq genotyping method. Scientific Reports 5(1): 9350. 114 APPENDIX A BARLEY APPENDIX Distribution of seeds across accessions 0.015 mean = 111.4 sd = 42.2 0.010 Density 0.005 0.000 0 50 100 150 200 Number of seeds Figure A.1: Distribution of the 3121 seeds according to their accession. The seed number values as in Table 3.1 have empirical mean µ̄ = 111.46 and empirical standard deviation σ̄ = 42.21. A normal distribution with these parameters is drawn on top of the histogram. Observe that all the accession seed numbers are within two standard deviations. SVM classification with traditional descriptors (PCA dimension reduced) 0.6 PCA 0.5 2 3 4 F1 Score 0.4 5 6 7 8 0.3 9 10 0.2 2 3 4 5 6 7 8 9 10 PCA reduced dimension Figure A.2: Classification results for traditional shape descriptors. After centering and scaling the traditional shape descriptors, we used PCA to reduce their dimension and then performed an SVM classification with these dimension-reduced vectors. We observe that the highest classification F1 scores correspond to the use of almost all the traditional dimensions. 115 A SVM classification with topological shape descriptors Directions 74 101 158 230 Dim. Red. (02) Dim. Red. (03) Dim. Red. (06) Dim. Red. (12) Dim. Red. (24) 0.75 KPCA 0.50 0.25 F1 Score 0.00 0.75 UMAP 0.50 0.25 0.00 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64 Thresholds B SVM classification with combined shape descriptors Directions 74 101 158 230 Dim. Red. (02) Dim. Red. (03) Dim. Red. (06) Dim. Red. (12) Dim. Red. (24) 0.8 KPCA 0.6 0.4 F1 Score 0.2 0.8 UMAP 0.6 0.4 0.2 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64 Thresholds Figure A.3: Classification results for combined and topological shape descriptors computed for different choices of parameters. 116 Figure A.3 (cont’d): Classification results for combined and topological shape descriptors computed for different choices of parameters. To evaluate the ECT descriptiveness, we sought to use these ECT vectors to classify 28 different barley accessions based solely on seed morphology. The ECT was computed for different number of directions and thresholds. These high-dimensional vectors were later reduced to different number of dimensions using both KPCA and UMAP. Observe that both dimension reduction techniques summarize the ECT information in very different ways, as evidenced by the different SVM classification F1 scores when using (A) exclusively topological information or (B) combining both topological and traditional seed shape descriptors. A Kruskal-Wallis summary (spike based) B Discerning directions BH correction for 1.0e-10 FDR spike-wise 40 (Adjusted 1e-10 FDR) -log10(p-val) 20 0 400 H-statistic c/ to BH critical p-val 300 H-statistic 200 100 0 0 500 1000 1500 2000 2500 Topological shape descriptors (sorted) Figure A.4: Relevant ECT directions and slices. (A) We examine the inter-spike and intra-spike variance differences of the Euler characteristic for each direction and threshold. A Kruskal- Wallis analysis combined with a Benjamini-Hochberg multiple test correction suggests a number of discerning slices across accessions. (B) These directions and thresholds are mostly concentrated around the poles, similar to the case of inter- and intra-accession variance case (Figure 3.4). 117 A Kruskal-Wallis summary (accession based) B Kruskal-Wallis summary (accession based) H-statistic -log10(p-value) H-statistic -log10(p-value) BH correction for 1.0e-10 FDR BH correction for 1.0e-10 FDR 200 200 100 0 0 H-statistic corresponding to BH critical p-value 1500 H-statistic corresponding to BH critical p-value 1000 1000 500 0 Length KPCA 2 Length Area Area ConvexArea Shell ConvexVol ConvexArea Shell ConvexVol Vol Vol KPCA 23 KPCA 1 Width Width UMAP 9 UMAP 14 UMAP 17 HeightB HeightA ConvexVolRatio KPCA 17 UMAP 22 HeightB HeightA UMAP 5 KPCA 5 KPCA 21 KPCA 7 ConvexAreaRatio UMAP 8 ConvexVolRatio UMAP 4 UMAP 10 KPCA 20 KPCA 19 KPCA 15 KPCA 9 UMAP 1 UMAP 7 UMAP 6 UMAP 3 KPCA 3 KPCA 13 KPCA 11 KPCA 16 UMAP 11 UMAP 24 UMAP 2 UMAP 23 KPCA 4 KPCA 10 KPCA 22 KPCA 18 UMAP 18 UMAP 12 UMAP 13 UMAP 16 KPCA 6 KPCA 14 KPCA 24 KPCA 12 UMAP 19 UMAP 15 UMAP 20 UMAP 21 Combined shape descriptors (sorted) Combined shape descriptors (sorted) KPCA 8 ConvexAreaRatio C Kruskal-Wallis summary (spike based) D Kruskal-Wallis summary (spike based) H-statistic -log10(p-value) H-statistic -log10(p-value) BH correction for 1.0e-10 FDR BH correction for 1.0e-10 FDR 200 200 0 0 H-statistic corresponding to BH critical p-value H-statistic corresponding to BH critical p-value 2000 2000 1000 0 KPCA 1 KPCA 2 UMAP 9 UMAP 14 KPCA 23 HeightA HeightB Length UMAP 22 UMAP 5 UMAP 7 UMAP 12 Area Vol ConvexVol ConvexArea UMAP 23 UMAP 17 UMAP 4 UMAP 3 Width Shell ConvexVolRatio KPCA 17 UMAP 10 UMAP 8 UMAP 6 HeightA KPCA 5 KPCA 7 KPCA 21 ConvexAreaRatio UMAP 1 HeightB UMAP 13 UMAP 19 KPCA 19 KPCA 20 KPCA 9 KPCA 3 Length UMAP 2 UMAP 15 UMAP 18 KPCA 15 KPCA 22 KPCA 16 KPCA 13 UMAP 11 UMAP 21 UMAP 24 Area KPCA 4 KPCA 24 KPCA 11 KPCA 18 Vol ConvexVol ConvexArea Width KPCA 14 KPCA 10 KPCA 12 KPCA 6 Shell UMAP 16 UMAP 20 ConvexVolRatio Combined shape descriptors (sorted) Combined shape descriptors (sorted) KPCA 8 ConvexAreaRatio Figure A.5: Relevant combined descriptors. Dimension-reduced topological vectors were concate- nated with tradtional shape descriptors to produce combined descriptors. Kruskal-Wallis analyses reveal which descriptors explain the most inter-accession variance when the ECT was reduced in dimension with (A) KPCA, and (B) UMAP. Similar analyses also reveal which features contribute the most to inter-spike variance when the ECT vector was reduced with (C) KPCA, and (D) UMAP. 118 APPENDIX B CITRUS APPENDIX B.1 Supporting figures Kumquats Mandarins and Papedas Sour Oranges and Trifoliates and Mandarin Hybrids their hybrids Trifoliate Hybrids Lemons and Lemon Microcitrus Pummelos and Sweet Oranges Other Hybrids Pummelo Hybrids R 2 = 0.61 R 2 = 0.60 Voxel size resolution [ m] 100 80 60 40 R 2 = -0.86 8.0 8.5 9.0 9.5 10.0 2 3 4 5 6 7 8 3 4 5 6 log (Number of glands) log (Gland density per volume) [glands/cm3] log (Gland density per area) [glands/cm2] Figure B.1: Comparing the effect of different scan resolutions. A negative R2 value suggests no correlation between the scan resolution and the number of oil glands extracted. Nonetheless, there is a possible correlation between scan resolution and oil gland density. 119 log Vol(Exocarp) vs log Vol(Whole) [cm3] log Vol(Oil Glands) vs log Vol(Whole) [cm3] 0.5 1 Residuals 0.0 0 0.5 N(0, 0.302) 1 N(0, 0.502) log Vol(Spine) vs log Vol(Whole) [cm3] log Vol(Mesocarp) vs log Vol(Whole) [cm3] 2 N(0, 0.892) 0.5 Residuals 0 0.0 2 0.5 N(0, 0.352) log Vol(Endocarp) vs log Vol(Whole) [cm3] log Num vs log Vol(Whole) [cm3] N(0, 0.532) 0.0 1 0 Residuals 0.5 1 N(0, 0.242) 1.0 1 2 3 4 5 6 7 0 1 2 1 2 3 4 5 6 7 0.0 0.5 1.0 log Vol(Whole) [cm3] Density log Vol(Whole) [cm3] Density Analyses of residuals after linear fitting of log-log allometric plots Figure B.2: Analysis of the distribution of residuals. The left side of each column are the residuals of the fitted linear regression from Figure 4.2. The right side of each column shows the distribution of these residuals. For some of these measurement pairs of traits, the residuals follow a normal distribution, suggesting that the linear fit in the log-log plots is adequate. 120 Kumquats Mandarins and Papedas Sour Oranges and Trifoliates and Mandarin Hybrids their hybrids Trifoliate Hybrids Lemons and Lemon Microcitrus Pummelos and Sweet Oranges Other Hybrids Pummelo Hybrids log Vol(Mesocarp) [cm3] 6 6 log Vol(Exocarp) [cm3] log Vol(Endocarp) [cm3] 4 5 5 3 4 4 3 2 3 2 1 m = 0.87 2 m = 1.02 1 m = 1.15 0 b 2 = -1.16 b 2 = -0.61 0 b 2 = -2.06 R = 0.92 1 R = 0.96 R = 0.94 1 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 log Vol(Whole) [cm3] log Vol(Whole) [cm3] log Vol(Whole) [cm3] log Vol(Oil Glands) [cm3] 3 2 log Vol(Spine) [cm3] 10 log Num of Glands 2 1 1 0 0 1 9 1 2 m = 1.14 m = 0.85 m = 0.40 2 b 2 = -4.28 3 b 2 = -4.88 b 2 = 7.37 R = 0.58 4 R = 0.76 8 R = -0.24 3 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 log Vol(Whole) [cm3] log Vol(Whole) [cm3] log Vol(Whole) [cm3] log Vol(Mesocarp) [cm3] 6 6 log Vol(Endocarp) [cm3] 3 log Vol(Spine) [cm3] 5 5 2 4 4 1 3 3 2 0 2 m = 1.17 1 m = 1.33 1 m = 1.32 b 2 = 0.74 0 b 2 = -0.53 2 b 2 = -2.76 1 R = 0.83 R = 0.89 R = 0.58 1 3 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 log Vol(Exocarp) [cm3] log Vol(Exocarp) [cm3] log Vol(Exocarp) [cm3] log Vol(Oil Glands) [cm3] log Vol(Mesocarp) [cm3] 2 6 10 log Num of Glands 1 5 0 4 3 1 9 2 2 m = 0.99 m = 0.46 1 m = 1.13 3 b 2 = -3.74 b 2 = 7.90 0 b 2 = -1.37 4 R = 0.78 8 R = -0.13 R = 0.82 1 0 1 2 3 4 0 1 2 3 4 1 2 3 4 5 6 log Vol(Exocarp) [cm3] log Vol(Exocarp) [cm3] log Vol(Endocarp) [cm3] log Vol(Oil Glands) [cm3] 3 2 log Vol(Spine) [cm3] 10 log Num of Glands 2 1 1 0 0 1 9 1 2 m = 1.12 m = 0.84 m = 0.39 2 b 2 = -3.60 3 b 2 = -4.37 b 2 = 7.61 R = 0.48 4 R = 0.68 8 R = -0.38 3 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 log Vol(Endocarp) [cm3] log Vol(Endocarp) [cm3] log Vol(Endocarp) [cm3] log Vol(Oil Glands) [cm3] 3 2 log Vol(Spine) [cm3] 10 log Num of Glands 2 1 1 0 0 1 9 1 2 m = 0.99 m = 0.75 m = 0.35 2 b 2 = -2.23 3 b 2 = -3.34 b 2 = 8.08 R = 0.54 4 R = 0.77 8 R = -0.16 3 1 0 1 2 3 4 5 6 1 0 1 2 3 4 5 6 1 0 1 2 3 4 5 6 log Vol(Mesocarp) [cm3] log Vol(Mesocarp) [cm3] log Vol(Mesocarp) [cm3] log Vol(Oil Glands) [cm3] 2 10 10 log Num of Glands log Num of Glands 1 0 1 9 9 2 m = 0.75 m = 0.35 m = 0.46 3 b 2 = -1.67 b 2 = 8.86 b 2 = 9.64 4 R = 0.41 8 R = 0.02 8 R = 0.29 3 2 1 0 1 2 3 3 2 1 0 1 2 3 4 3 2 1 0 1 2 log Vol(Spine) [cm3] log Vol(Spine) [cm3] log Vol(Oil Glands) [cm3] Allometric relationships between phenotypes Figure B.3: Allometry plots for all possible pairs of measured phenotypes. Different citrus species are denoted with different markers. 121 Figure B.3 (cont’d): Allometry plots for all possible pairs of measured phenotypes. Different citrus species are denoted with different markers.We observe a strong allometric relationship between different tissue volumes. However, this relationship is missing when comparing the total number of oil glands to all tissue volumes, suggesting that the number of glands is decoupled from these volume traits. The best fit line is depicted by a dashed line in blue. For each plot, the slope, intercept, and correlation coefficient are recorded as m, b, and R2 respectively. The linear relationship in the log-log plots suggests that fruit tissue volumes may grow following a power law. log Vol(Endocarp) vs log Vol(Exocarp) [cm3] log Vol(Mesocarp) vs log Vol(Endocarp) [cm3] 1 N(0, 0.592) 1 Residuals 0 0 1 N(0, 0.502) 1 log Vol(Spine) vs log Vol(Exocarp) [cm3] 2log Vol(Oil Glands) vs log Vol(Endocarp) [cm ] 3 2 N(0, 0.582) 1 Residuals 0 0 2 N(0, 0.882) 1 log Vol(Oil Glands) vs log Vol(Exocarp) [cm3] log Vol(Spine) vs log Vol(Mesocarp) [cm3] 2 1 0 0 Residuals 2 1 N(0, 0.482) N(0, 0.922) 1 0 1 2 3 4 5 0.0 0.5 1.0 0 2 4 6 0.0 0.5 1.0 log Vol(Exocarp) [cm3] Density log Vol(Mesocarp) [cm3] Density Analyses of residuals after linear fitting of log-log allometric plots Figure B.4: Analysis of the distribution of residuals as in Figure B.2. The left side of each column depicts residuals of the linear fit as in Figure B.3. The right side of each column shows the distribution of these residuals. For some of these measurement pairs of traits, the residuals follow a normal distribution, suggesting that the linear fit in the log-log plots is adequate. 122 log Vol(Whole) vs log(Avg dist to nearest neigh) 0.2 log(Density per vol) vs log(Avg dist to NN) 0.2 0.0 Residuals 0.0 0.2 0.2 N(0, 0.142) 0.4 N(0, 0.082) 0.4 log Vol(Whole) [cm3] log (Oil gland density per volume) [glands/cm3] log Vol(Exocarp) vs log(Avg dist to nearest neigh) log(Density per area) vs log(Avg dist to NN) 0.50 N(0, 0.162) 0.2 Residuals 0.25 0.0 0.00 0.2 0.25 0.4 N(0, 0.102) 0 2 4 6 0 2 2 3 4 5 6 7 8 0.0 2.5 5.0 log Vol(Exocarp) [cm2] Density log (Oil gland density per area) [glands/cm2] Density Analyses of residuals after linear fitting of log-log allometric plots Figure B.5: Analysis of the distribution of residuals. The left side of each column depicts residuals of the linear fit as in Figure 4.3.A. The right side of each column shows the distribution of these residuals. For some of these measurement pairs of traits, the residuals follow a normal distribution, suggesting that the linear fit in the log-log plots is adequate. A p-values against a uniform distribution on the unit sphere = 0.05 0.4 Kumquats PAD test Lemons and Lemon 0.2 Hybrids Mandarins and Mandarin Hybrids 0.0 Microcitrus Papedas B p-values against a rotationally symmetric distribution on the unit sphere Pummelos and = 0.05 Pummelo Hybrids 0.4 Sour Oranges and Hybrid test their hybrids Sweet Oranges 0.2 Other 0.0 Scan index Figure B.6: Testing whether the oil glands are distributed uniformly or rotationally symmetric on the surface of the fruits. (A) p-values after testing if the underlying oil gland distribution is not uniform according to projected Anderson-Darling (PAD) test. (B) p-values after testing if the underlying oil gland distribution is not rotationally symmetric according to the scatter-location hybrid test. Red line at α = 0.05 in all plots. 123 B.2 Supporting tables Table B.1: A total of 166 different individual citrus fruits were scanned comprising 51 citrus accessions to represent modern cultivated citrus types as well as the species contributing ancestry to each group. All the shape and size analyses focused on a subset of these accessions, as indicated in Table 4.1. Names, identifiers, locations, and notes according to the University of California Givaudan Citrus Variety Collection (CVC). The N column indicates the number of pseudoreplicate fruits scanned per accession. Asterisk denotes not available. ID CVC_Name Scientific Name Type Location N 2317 Limon Real Citrus excelsa Wester Lemon 18B-18-9 4 3919 Lamas lemon Citrus limon L. Burm.f. Lemon 18A-24-1 3 3593 Interdonato lemon o.p. seedling Citrus limon L. Burm.f. Lemon-Hybrid 18A-8-3 3 3005 Frost nucellar Eureka lemon Citrus limon L. Burm.f. Eureka-type Lemon 18B-37-1 3 3050 Volckamer lemon o.p. seedling Citrus volkameriana Rough Lemon 18B-29-1 3 1482 Palestine sweet lime (Indian sweet lime) Citrus limettiodes Tan. Lime-Sweet 18B-27-7 3 661 Indian citron o.p. seedling (Zamburi) Citrus medica L. Citron-Hybrid 18B-13-7 3 3546 South Coast Field Station citron Citrus medica L. Citron 18A-5-3 3 3226 Scarlet Emperor mandarin o.p. seedling Citrus reticulata Blanco Mandarin 12B-27-1 3 Pankan 3812 * Citrus reticulata o.p. seedling Mandarin 12B-29-1 3 3851 Lee mandarin (Clementine X Orlando) Citrus reticulata Blanco RU- Mandarin 12B-25-9 3 TACEAE 3752 Som Keowan o.p. seedling Citrus reticulata Blanco Mandarin 12B-28-7 3 3958 Koster tangor Citrus reticulata Blanco Tangor Mandarines 12A-23-13 3 3844 Cleopatra mandarin o.p. seedling Citrus reshni hort. ex Tanaka RU- Mandarin 12B-31-1 3 TACEAE 3363 Beledy mandarin o.p. seedling Citrus reticulata Blanco Mandarin 12B-23-1 4 3991 USDA 88-2 (Lee X Nova) Citrus reticulata Blanco Mandarin-Hybrid 12A-22-13 3 3816 Kinkoji Unshiu (graft hybrid of C. obovoidea Citrus neo-aurantium Mandarin-Hybrid 12B-24-11 3 + Satsuma) 3558 Fremont mandarin Citrus reticulata Blanco RU- Mandarin 12B-23-15 3 TACEAE 3781 Tahitian pummelo X Star Ruby grapefruit Citrus paradisi Macfadyen Pummelo-Hybrid 12A-31-9 3 124 Table B.1 (cont’d) ID CVC_Name Scientific Name Type Location N 4026 Pomelit pummelo hybrid (Djeroek Deleema Citrus maxima (Burm.) Merr Pummelo-Hybrid 12A-19-7 3 Kopjor) 3907 Hassaku (Citrus hassaku, Beni Hassaku) Citrus hassaku hort. ex Tanaka RU- Pummelo-Hybrid 12B-41-3 3 TACEAE 3959 Egami Buntan (Egami, Ogami) pummelo Citrus maxima (Burm.) Merr. RU- Pummelo 12B-48-13 3 TACEAE 2242 Kao Panne pummelo (Kao Pan) Citrus maxima (Burm.) Merr. RU- Pummelo 12A-35-3 3 TACEAE 3289 Willowleaf sour orange Citrus aurantium var. salicifolia Sour Orange 12B-19-9 3 628 Standard sour orange Citrus aurantium L. Sour Orange 12B-18-5 3 3611 Konejime sour orange o.p. seedling Citrus neo-aurantium Sour Orange-Hybrid 12B-20-11 4 2717 Olivelands Sour orange Citrus aurantium L. Sour Orange 12B-40-1 3 3030 Cutter Valencia nucellar seedling Citrus sinensis L. Osbeck Valencia Sweet Or- 12B-8-5 3 ange 3746 Shamouti orange, Israeli seedling #1 Citrus sinensis L. Osbeck Sweet Orange 12B-17-7 3 1241- Parent Washington navel Citrus sinensis (L.) Osbeck Early/mid-season 12B-4-3 3 B Navel Orange 2802 Argentina sweet orange o.p. seedling Citrus sinensis (L.) Osbeck RU- Sweet Orange 12B-11-5 3 TACEAE 3994 Cara Cara pink fleshed navel Citrus sinensis (L.) Osbeck Early/mid-season 12A-25-9 3 Navel Orange 3163 Indian wild orange Citrus indica Tanaka Citrus species 12B-30-1 3 2485 Nasnaran Citrus amblycarpa (Hassk.) Ochse Citrus species 12B-29-9 4 3228 Korai tachibana Citrus nippokoreana hort ex. Citrus species 12B-30-9 3 Tanaka 2320 Winged lime (Talimasan) Citrus longispina Wester Citrus species 18B-18-3 3 3877 Nagami kumquat Fortunella margarita (Lour.) Kumquat 12B-44-13 3 Swingle 3661 Australian finger lime Microcitrus australasica Microcitrus 18B-16-5 4 3605 Samuyao Citrus micrantha var. microcarpa Papeda 18B-19-5 6 125 Table B.1 (cont’d) ID CVC_Name Scientific Name Type Location N 2327 Ichang papeda Citrus cavaleriei H. Lév. ex Cava- Papeda 18B-9-5 1 lerie 3469 * Citrus hanayu Siebold ex Shirai Papeda 12B-36-9 3 1455 Kalpi, Nogapog, Malayan lemmon Citrus webberii Papeda 18B-17-6 2 2454 Makrut lime Citrus hystrix DC. Papeda 18B-20-5 4 3842 Alemow (Colo) Citrus macrophylla Wester Papeda 18B-18-5 3 1491 Chinese box orange Severinia buxifolia Relative 12D-26-15 3 3140 Indian bael fruit Aegle marmelos (L.) Corrêa Relative 18B-9-1 3 4008 Little-leaf trifoliate Poncirus trifoliata (L.) Raf. RU- Trifoliate 12A-24-5 3 TACEAE 838 Rubidoux trifoliate Poncirus trifoliata (L.) Raf. RU- Trifoliate 12A-5-5 4 TACEAE 3912 C-35 citrange (Ruby orange x Webber- X Citroncirus spp Trifoliate-Hybrid 12B-41-11 3 Fawcett trifoliate) 2863 Carrizo citrange o.p. seedling (’Washington’ X Citroncirus sp. RUTACEAE Trifoliate-Hybrid 12A-39-9 5 sweet orange X Poncirus trifoliata) 3771 Swingle citrumelo (’Duncan’ grapefruit x X Citroncirus spp. RUTACEAE Trifoliate-Hybrid 12A-42-5 4 Poncirus trifoliata) 126 Table B.2: Technical details for all the 63 citrus fruit scans produced comprising 166 individual fruits representing 51 citrus accessions (see Table B.1 for details). The scans were produced using the North Star Imaging X3000 system and the included efX software, with 720 projections per scan, at 3 frames per second and with 3 frames averaged per projection. The data was obtained in continuous mode. The fruits were placed as close as possible to the X-ray detector, provided all fruits of the same accession could be scanned completely at once, which resulted in varying voxel size resolutions after reconstruction. Pummelo and citron fruits were scanned individually due to their large size. The X-ray source was set to different current and voltages for different fruits, resulting in varying focal spot sizes. Voxel size, voltage, current, and focal size denoted by vs, V, C, and fs respectively. scan_ID CVC_name vs [µm] V [kV] C [µÅ] fs [µm] C01_CRC2317_18B-18-9 Limon Real 83.6 75 70 5.25 C02_CRC3919_18A-24-1 Lamas lemon 81.2 75 70 5.25 C03_CRC3593_18A-8-3 Interdonato lemon o.p. seedling 104.2 75 70 5.25 C04_CRC3005_18B_37_1 Frost nucellar Eureka lemon 95.3 75 70 5.25 C05_CRC3050_18B-29-1 Volckamer lemon o.p. seedling 87.4 75 70 5.25 C06_CRC1482_18B-27-7 Palestine sweet lime (Indian sweet lime) 88.4 75 70 5.25 C07_CRC0661_18B-13-7 Indian citron o.p. seedling (Zamburi) 105.5 90 70 6.3 C08A_CRC3546_18A-5-3 South Coast Field Station citron 78.7 90 70 6.3 C08B_CRC3546_18A-5-3 South Coast Field Station citron 91.2 90 70 6.3 C08C_CRC3546_18A-5-3 South Coast Field Station citron 92.6 90 70 6.3 M01_CRC3226_12B-27-1 Scarlet Emperor mandarin o.p. seedling Pankan 78.6 70 70 4.9 M02_CRC3812_12B-29-1 Citrus reticulata Blanco 94.7 70 70 4.9 M03_CRC3851_12B-25-9 Lee mandarin (Clementine X Orlando) 87.2 70 70 4.9 M04_CRC3752_12B-28-7 Som Keowan o.p. seedling 78.5 70 70 4.9 M05_CRC3958_12A-23-13 Koster tangor 78.5 70 70 4.9 M06_CRC3844_12B-31-1 Cleopatra mandarin o.p. seedling 46.7 70 70 4.9 M07_CRC3363_12B-23-1 Beledy mandarin o.p. seedling 77.1 70 70 4.9 M08_CRC3991_12A-22-13 USDA 88-2 (Lee X Nova) 97.6 70 70 4.9 M09_CRC3816_12B-24-11 Kinkoji Unshiu (graft hybrid of C. obovoidea + Sat- 106.8 70 70 4.9 suma) M10_CRC3558_12B-23-15 Fremont mandarin 76.2 70 70 4.9 P01A_CRC3781_12A-31-9 Tahitian pummelo X Star Ruby grapefruit 60.3 90 70 6.3 P01B_CRC3781_12A-31-9 Tahitian pummelo X Star Ruby grapefruit 58.8 90 70 6.3 127 Table B.2 (cont’d) scan_ID CVC_name vs [µm] V [kV] C [µÅ] fs [µm] P01C_CRC3781_12A-31-9 Tahitian pummelo X Star Ruby grapefruit 66.1 90 70 6.3 P02A_CRC4026_12A-19-7 Pomelit pummelo hybrid (Djeroek Deleema Kopjor) 64.7 90 70 6.3 P02B_CRC4026_12A-19-7 Pomelit pummelo hybrid (Djeroek Deleema Kopjor) 65.1 90 70 6.3 P02C_CRC4026_12A-19-7 Pomelit pummelo hybrid (Djeroek Deleema Kopjor) 69.4 90 70 6.3 P03A_CRC3907_12B-41-3 Hassaku (Citrus hassaku, Beni Hassaku) 64.7 90 70 6.3 P03B_CRC3907_12B-41-3 Hassaku (Citrus hassaku, Beni Hassaku) 69.6 90 70 6.3 P03C_CRC3907_12B-41-3 Hassaku (Citrus hassaku, Beni Hassaku) 69.4 90 70 6.3 P04A_CRC3959_12B-48-13 Egami Buntan (Egami, Ogami) pummelo 85.2 90 70 6.3 P04B_CRC3959_12B-48-13 Egami Buntan (Egami, Ogami) pummelo 81.8 90 70 6.3 P04C_CRC3959_12B-48-13 Egami Buntan (Egami, Ogami) pummelo 81.1 90 70 6.3 P05A_CRC2242_12A-35-3 Kao Panne pummelo (Kao Pan) 85.2 90 70 6.3 P05B_CRC2242_12A-35-3 Kao Panne pummelo (Kao Pan) 87.5 90 70 6.3 P05C_CRC2242_12A-35-3 Kao Panne pummelo (Kao Pan) 81.1 90 70 6.3 SR01_CRC3289_12B-19-9 Willowleaf sour orange 72.4 75 70 5.25 SR02_CRC0628_12B-18-5 Standard sour orange 88.8 75 70 5.25 SR03_CRC3611_12B-20-11 Konejime sour orange o.p. seedling 77.3 75 70 5.25 SR04_CRC2717_12B-40-1 Olivelands Sour orange 83.6 75 70 5.25 SW01_CRC3030_12B-8-5 Cutter Valencia nucellar seedling 103.8 75 70 5.25 SW02_CRC3746_12B-17-7 Shamouti orange, Israeli seedling #1 110.1 75 70 5.25 SW03_CRC1241-B_12B-4-3 Parent Washington navel 110.4 75 70 5.25 SW04_CRC2802_12B-11-5 Argentina sweet orange o.p. seedling 86.5 75 70 5.25 SW05_CRC3994_12A-25-9 Cara Cara pink fleshed navel 97.5 75 70 5.25 WR01_CRC1491_12D-26-15 Chinese box orange 18.6 70 70 4.9 WR02_CRC3877_12B-44-13 Nagami kumquat 46 70 70 4.9 WR03_CRC3163_12B-30-1 Indian wild orange 46 70 70 4.9 WR04_CRC4008_12A-24-5 Little-leaf trifoliate 67.3 70 70 4.9 WR05_CRC3605_18B-19-5 Samuyao 57.5 70 70 4.9 WR06_CRC2485_12B-29-9 Nasnaran 57.5 70 70 4.9 WR07_CRC2327_18B-9-5 Ichang papeda 29.5 70 70 4.9 WR08_CRC3661_18B-16-5 Australian finger lime 35 70 70 4.9 128 Table B.2 (cont’d) scan_ID CVC_name vs [µm] V [kV] C [µÅ] fs [µm] WR09_CRC3469_12B-36-9 Citrus hanayu Siebold ex Shirai 84.4 70 70 4.9 WR10_CRC0838_12A-5-5 Rubidoux trifoliate 84.7 70 70 4.9 WR11_CR3228_12B-30-9 Korai tachibana 84.7 70 70 4.9 WR12_CRC1455_18B-17-6 Kalpi, Nogapog, Malayan lemmon 86 70 70 4.9 WR13_CRC3912_12B-41-11 C-35 citrange (Ruby orange x Webber-Fawcett trifoli- 79.2 70 70 4.9 ate) WR14_CRC2320_18B-18-3 Winged lime (Talimasan) 95.9 70 70 4.9 WR15_CRC3140_18B-9-1 Indian bael fruit 77.5 70 70 4.9 WR16_CRC3771_12A-42-5 Swingle citrumelo (’Duncan’ grapefruit x Poncirus 87.3 70 70 4.9 trifoliata) WR17_CRC2454_18B-20-5 Makrut lime 88.6 70 70 4.9 WR18_CRC2863_12A-39-9 Carrizo citrange o.p. seedling (’Washington’ sweet 102.6 70 70 4.9 orange X Poncirus trifoliata) WR19_CRC3842_18B-18-5 Alemow (Colo) 107.7 70 70 4.9 129 Table B.3: We tested whether the oil glands on the citrus fruit skins are either uniformly or rotationally symmetrically distributed (see main text for details). Uniformity was tested with Projected Anderson-Darling (PAD) test. Rotational symmetry was tested with a scatter-location hybrid von Mises-Fisher (vMF) test with an unspecified direction of symmetry. The resulting statistics and p-values are reported below. Individual fruits are identified by their original scan_ID (as in Table B.2). Individual pseudoreplicates are identified with different labels. scan_ID label kind s_PAD p_PAD s_vMF p_vMF C01_CRC2317_18B-18-9 L00 Lemon 9.80E+01 0.00E+00 3.41E+02 1.70E-72 C01_CRC2317_18B-18-9 L01 Lemon 8.73E+01 0.00E+00 1.47E+02 8.00E-31 C01_CRC2317_18B-18-9 L02 Lemon 4.62E+01 1.20E-09 2.09E+02 5.12E-44 C01_CRC2317_18B-18-9 L03 Lemon 2.96E+01 4.12E-09 1.20E+02 6.29E-25 C02_CRC3919_18A-24-1 L00 Lemon 1.12E+01 1.28E-08 1.21E+01 1.66E-02 C02_CRC3919_18A-24-1 L01 Lemon 1.14E+01 0.00E+00 3.82E+00 4.30E-01 C02_CRC3919_18A-24-1 L02 Lemon 1.24E+01 2.93E-08 3.61E+00 4.62E-01 C03_CRC3593_18A-8-3 L00 Lemon 6.68E+01 2.73E-09 7.93E+02 2.47E-170 C03_CRC3593_18A-8-3 L01 Lemons 6.49E+01 1.69E-08 8.66E+02 3.06E-186 C03_CRC3593_18A-8-3 L02 Lemons 4.07E+01 0.00E+00 4.80E+02 1.48E-102 C04_CRC3005_18B_37_1 L00 Lemons 4.07E+01 1.01E-08 1.88E+01 8.52E-04 C04_CRC3005_18B_37_1 L01 Lemons 1.85E+01 1.94E-09 9.35E+01 2.41E-19 C04_CRC3005_18B_37_1 L02 Lemons 3.86E+01 0.00E+00 2.26E+01 1.55E-04 C05_CRC3050_18B-29-1 L00 Lemons 1.06E+02 0.00E+00 1.60E+02 1.41E-33 C05_CRC3050_18B-29-1 L01 Lemons 1.63E+01 4.43E-08 9.65E+01 5.38E-20 C05_CRC3050_18B-29-1 L02 Lemons 1.63E+01 3.47E-08 1.67E+01 2.22E-03 C06_CRC1482_18B-27-7 L00 Other 4.91E+01 0.00E+00 2.99E+02 1.54E-63 C06_CRC1482_18B-27-7 L01 Other 2.19E+01 0.00E+00 3.38E+01 8.33E-07 C06_CRC1482_18B-27-7 L02 Other 1.29E+01 2.29E-08 8.51E+01 1.47E-17 C07_CRC0661_18B-13-7 L00 Other 2.09E+01 0.00E+00 5.41E+01 4.92E-11 C07_CRC0661_18B-13-7 L01 Other 3.73E+01 0.00E+00 3.99E+01 4.60E-08 C07_CRC0661_18B-13-7 L02 Other 1.63E+01 1.37E-08 6.06E+01 2.14E-12 C08A_CRC3546_18A-5-3 L00 Other 3.98E+02 2.52E-07 9.76E+02 7.25E-210 C08B_CRC3546_18A-5-3 L00 Other 2.12E+02 9.15E-09 5.92E+01 4.37E-12 C08C_CRC3546_18A-5-3 L00 Other 5.22E+02 7.24E-09 1.38E+03 9.37E-298 130 Table B.3 (cont’d) scan_ID label kind s_PAD p_PAD s_vMF p_vMF M01_CRC3226_12B-27-1 L00 Mandarins 2.44E+01 0.00E+00 1.12E+02 3.40E-23 M01_CRC3226_12B-27-1 L01 Mandarins 2.48E+01 0.00E+00 4.75E+01 1.18E-09 M01_CRC3226_12B-27-1 L02 Mandarins 2.19E+01 0.00E+00 5.77E+00 2.17E-01 M02_CRC3812_12B-29-1 L00 Mandarins 5.45E+01 1.97E-09 6.60E+02 1.30E-141 M02_CRC3812_12B-29-1 L01 Mandarins 6.41E+01 1.22E-08 3.01E+02 6.71E-64 M02_CRC3812_12B-29-1 L02 Mandarins 2.58E+01 0.00E+00 1.54E+02 2.52E-32 M03_CRC3851_12B-25-9 L00 Mandarins 3.18E+01 0.00E+00 9.94E+01 1.29E-20 M03_CRC3851_12B-25-9 L01 Mandarins 1.72E+01 0.00E+00 4.52E+01 3.66E-09 M03_CRC3851_12B-25-9 L02 Mandarins 1.55E+01 4.30E-08 3.12E+01 2.74E-06 M04_CRC3752_12B-28-7 L00 Mandarins 8.15E+00 0.00E+00 1.50E+01 4.72E-03 M04_CRC3752_12B-28-7 L01 Mandarins 8.76E+00 1.50E-08 2.34E+01 1.04E-04 M04_CRC3752_12B-28-7 L02 Mandarins 7.81E+00 0.00E+00 2.45E+01 6.30E-05 M05_CRC3958_12A-23-13 L00 Mandarins 4.18E+01 1.38E-08 1.82E+02 2.56E-38 M05_CRC3958_12A-23-13 L01 Mandarins 1.68E+01 0.00E+00 6.30E+01 6.73E-13 M05_CRC3958_12A-23-13 L02 Mandarins 5.42E+01 1.40E-08 1.33E+02 8.60E-28 M06_CRC3844_12B-31-1 L00 Other 4.26E+01 1.32E-08 6.50E+02 2.60E-139 M06_CRC3844_12B-31-1 L01 Other 3.56E+01 3.15E-08 5.69E+02 8.75E-122 M06_CRC3844_12B-31-1 L02 Other 2.01E+01 0.00E+00 9.50E+00 4.98E-02 M07_CRC3363_12B-23-1 L00 Mandarins 2.19E+01 3.22E-10 2.67E+02 1.74E-56 M07_CRC3363_12B-23-1 L01 Mandarins 2.58E+01 0.00E+00 1.50E+02 2.27E-31 M07_CRC3363_12B-23-1 L02 Mandarins 2.01E+01 0.00E+00 1.98E+02 8.31E-42 M07_CRC3363_12B-23-1 L03 Mandarins 1.41E+01 8.69E-09 9.92E+00 4.18E-02 M08_CRC3991_12A-22-13 L00 Mandarins 3.53E+01 0.00E+00 1.33E+02 9.16E-28 M08_CRC3991_12A-22-13 L01 Mandarins 2.02E+01 0.00E+00 1.16E+02 3.65E-24 M08_CRC3991_12A-22-13 L02 Mandarins 3.28E+01 0.00E+00 4.83E+01 8.17E-10 M09_CRC3816_12B-24-11 L00 Mandarins 6.19E+00 2.19E-07 3.09E+01 3.19E-06 M09_CRC3816_12B-24-11 L01 Mandarins 6.58E+00 1.04E-07 3.30E+01 1.18E-06 M09_CRC3816_12B-24-11 L02 Mandarins 8.70E+00 2.54E-08 1.50E+01 4.70E-03 M10_CRC3558_12B-23-15 L00 Mandarins 4.92E+01 1.70E-09 3.14E+02 9.37E-67 M10_CRC3558_12B-23-15 L01 Mandarins 8.32E+00 1.70E-08 6.89E+00 1.42E-01 131 Table B.3 (cont’d) scan_ID label kind s_PAD p_PAD s_vMF p_vMF M10_CRC3558_12B-23-15 L02 Mandarins 1.44E+01 0.00E+00 2.75E+01 1.54E-05 P01A_CRC3781_12A-31-9 L00 Pummelos 3.32E+01 3.12E-08 1.01E+02 5.17E-21 P01B_CRC3781_12A-31-9 L00 Pummelos 1.64E+02 0.00E+00 3.68E+02 1.94E-78 P01C_CRC3781_12A-31-9 L00 Pummelos 2.67E+01 0.00E+00 8.44E+01 2.04E-17 P02A_CRC4026_12A-19-7 L00 Pummelos 5.88E+01 1.68E-08 4.76E+01 1.15E-09 P02B_CRC4026_12A-19-7 L00 Pummelos 1.50E+02 5.20E-06 1.54E+02 2.78E-32 P02C_CRC4026_12A-19-7 L00 Pummelos 7.30E+01 3.60E-09 3.16E+01 2.28E-06 P03A_CRC3907_12B-41-3 L00 Pummelos 6.88E+01 3.47E-08 3.42E+02 9.83E-73 P03B_CRC3907_12B-41-3 L00 Pummelos 7.32E+01 1.33E-08 7.47E+01 2.29E-15 P03C_CRC3907_12B-41-3 L00 Pummelos 2.78E+01 0.00E+00 1.46E+02 1.26E-30 P04A_CRC3959_12B-48-13 L00 Pummelos 1.14E+02 0.00E+00 5.19E+01 1.46E-10 P04B_CRC3959_12B-48-13 L00 Pummelos 2.27E+02 2.37E-09 7.96E+02 4.59E-171 P04C_CRC3959_12B-48-13 L00 Pummelos 1.37E+02 0.00E+00 2.68E+02 9.57E-57 P05A_CRC2242_12A-35-3 L00 Pummelos 1.16E+02 0.00E+00 5.92E+01 4.26E-12 P05B_CRC2242_12A-35-3 L00 Pummelos 3.83E+01 0.00E+00 6.26E+01 8.41E-13 P05C_CRC2242_12A-35-3 L00 Pummelos 8.52E+01 5.46E-09 8.52E+01 1.36E-17 SR01_CRC3289_12B-19-9 L00 Sour Oranges 2.65E+01 0.00E+00 2.72E+01 1.78E-05 SR01_CRC3289_12B-19-9 L01 Sour Oranges 1.85E+01 3.78E-08 2.00E+01 5.00E-04 SR01_CRC3289_12B-19-9 L02 Sour Oranges 1.55E+01 4.12E-08 3.58E+01 3.18E-07 SR02_CRC0628_12B-18-5 L00 Sour Oranges 3.72E+01 0.00E+00 1.31E+01 1.08E-02 SR02_CRC0628_12B-18-5 L01 Sour Oranges 2.31E+01 8.41E-10 3.51E+00 4.77E-01 SR02_CRC0628_12B-18-5 L02 Sour Oranges 1.59E+01 4.80E-08 9.22E+01 4.42E-19 SR03_CRC3611_12B-20-11 L00 Sour Oranges 1.67E+01 2.91E-08 5.73E+01 1.09E-11 SR03_CRC3611_12B-20-11 L01 Sour Oranges 1.21E+01 1.45E-08 4.14E+01 2.20E-08 SR03_CRC3611_12B-20-11 L02 Sour Oranges 6.64E+00 8.48E-08 2.51E+01 4.70E-05 SR03_CRC3611_12B-20-11 L03 Sour Oranges 9.85E+00 2.44E-08 4.83E+00 3.06E-01 SR04_CRC2717_12B-40-1 L00 Sour Oranges 2.36E+01 0.00E+00 3.54E+01 3.86E-07 SR04_CRC2717_12B-40-1 L01 Sour Oranges 2.28E+00 1.46E-02 6.56E+00 1.61E-01 SR04_CRC2717_12B-40-1 L02 Sour Oranges 9.70E+00 0.00E+00 1.07E+02 3.28E-22 SW01_CRC3030_12B-8-5 L00 Sweet Oranges 9.69E+01 0.00E+00 2.19E+02 3.07E-46 132 Table B.3 (cont’d) scan_ID label kind s_PAD p_PAD s_vMF p_vMF SW01_CRC3030_12B-8-5 L01 Sweet Oranges 1.38E+01 4.17E-08 1.76E+02 6.51E-37 SW01_CRC3030_12B-8-5 L02 Sweet Oranges 2.34E+01 1.47E-08 5.11E+01 2.12E-10 SW02_CRC3746_12B-17-7 L00 Sweet Oranges 2.19E+01 0.00E+00 1.65E+02 1.48E-34 SW02_CRC3746_12B-17-7 L01 Sweet Oranges 3.02E+01 5.90E-08 1.43E+02 5.54E-30 SW02_CRC3746_12B-17-7 L02 Sweet Oranges 2.10E+01 1.21E-08 1.42E+02 9.91E-30 SW03_CRC1241-B_12B-4-3 L00 Sweet Oranges 4.90E+01 1.46E-07 3.91E+01 6.80E-08 SW03_CRC1241-B_12B-4-3 L01 Sweet Oranges 2.45E+01 1.23E-08 1.16E+01 2.03E-02 SW03_CRC1241-B_12B-4-3 L02 Sweet Oranges 1.35E+01 0.00E+00 1.25E+02 4.22E-26 SW04_CRC2802_12B-11-5 L00 Sweet Oranges 9.34E+00 1.08E-08 3.02E+01 4.46E-06 SW04_CRC2802_12B-11-5 L01 Sweet Oranges 3.80E+00 1.99E-04 5.62E+01 1.86E-11 SW04_CRC2802_12B-11-5 L02 Sweet Oranges 3.17E+00 1.20E-03 1.02E+01 3.75E-02 SW05_CRC3994_12A-25-9 L00 Sweet Oranges 8.29E+01 0.00E+00 2.48E+02 1.95E-52 SW05_CRC3994_12A-25-9 L01 Sweet Oranges 3.29E+01 6.03E-09 1.01E+02 4.83E-21 SW05_CRC3994_12A-25-9 L02 Sweet Oranges 6.17E+01 6.62E-08 2.12E+01 2.83E-04 WR02_CRC3877_12B-44-13 L00 Kumquats 9.03E+01 0.00E+00 1.35E+02 2.99E-28 WR02_CRC3877_12B-44-13 L01 Kumquats 1.22E+02 1.62E-08 2.16E+02 1.59E-45 WR02_CRC3877_12B-44-13 L02 Kumquats 8.95E+01 0.00E+00 6.04E+02 1.67E-129 WR03_CRC3163_12B-30-1 L00 Other 1.14E+02 0.00E+00 5.09E+02 7.08E-109 WR03_CRC3163_12B-30-1 L01 Other 2.96E+01 9.10E-09 3.62E+01 2.65E-07 WR03_CRC3163_12B-30-1 L02 Other 1.82E+01 0.00E+00 2.03E+01 4.39E-04 WR04_CRC4008_12A-24-5 L00 Trifoliates 3.84E+01 5.98E-09 2.15E+01 2.54E-04 WR04_CRC4008_12A-24-5 L01 Trifoliates 2.06E+01 5.37E-09 8.45E+01 1.98E-17 WR04_CRC4008_12A-24-5 L02 Trifoliates 1.43E+01 0.00E+00 5.48E+01 3.59E-11 WR05_CRC3605_18B-19-5 L00 Other 1.40E+01 1.26E-08 2.54E+01 4.19E-05 WR05_CRC3605_18B-19-5 L01 Other 1.66E+01 0.00E+00 1.22E+01 1.62E-02 WR05_CRC3605_18B-19-5 L02 Other 7.77E+00 9.73E-09 3.67E+01 2.06E-07 WR05_CRC3605_18B-19-5 L03 Other 9.30E+00 1.48E-08 3.58E+01 3.17E-07 WR05_CRC3605_18B-19-5 L04 Other 7.41E+00 1.36E-08 9.91E+00 4.19E-02 WR05_CRC3605_18B-19-5 L05 Other 6.95E+00 3.44E-08 8.51E-01 9.32E-01 WR06_CRC2485_12B-29-9 L00 Other 2.26E+01 5.32E-09 1.90E+01 7.90E-04 133 Table B.3 (cont’d) scan_ID label kind s_PAD p_PAD s_vMF p_vMF WR06_CRC2485_12B-29-9 L01 Other 3.20E+01 0.00E+00 8.80E-01 9.27E-01 WR06_CRC2485_12B-29-9 L02 Other 2.21E+01 0.00E+00 2.28E+01 1.39E-04 WR06_CRC2485_12B-29-9 L03 Other 1.52E+01 3.36E-08 2.01E+01 4.83E-04 WR07_CRC2327_18B-9-5 L00 Other 9.28E+01 1.04E-08 1.31E+03 2.39E-281 WR08_CRC3661_18B-16-5 L00 Microcitrus 2.92E+02 0.00E+00 5.38E+03 0.00E+00 WR08_CRC3661_18B-16-5 L01 Microcitrus 3.66E+02 0.00E+00 6.30E+03 0.00E+00 WR08_CRC3661_18B-16-5 L02 Microcitrus 3.32E+02 9.20E-09 4.99E+03 0.00E+00 WR08_CRC3661_18B-16-5 L03 Microcitrus 2.60E+02 5.81E-10 6.40E+03 0.00E+00 WR09_CRC3469_12B-36-9 L00 Papedas 5.61E+01 0.00E+00 1.70E+02 9.47E-36 WR09_CRC3469_12B-36-9 L01 Papedas 2.34E+01 0.00E+00 2.26E+02 1.16E-47 WR09_CRC3469_12B-36-9 L02 Papedas 6.12E+00 2.67E-07 2.84E+01 1.02E-05 WR10_CRC0838_12A-5-5 L00 Trifoliates 2.28E+01 0.00E+00 1.56E+02 1.17E-32 WR10_CRC0838_12A-5-5 L01 Trifoliates 1.22E+01 1.58E-08 1.05E+02 7.15E-22 WR10_CRC0838_12A-5-5 L02 Trifoliates 2.76E+01 2.87E-08 2.27E+01 1.45E-04 WR10_CRC0838_12A-5-5 L03 Trifoliates 1.70E+01 0.00E+00 4.86E+01 6.94E-10 WR11_CR3228_12B-30-9 L00 Other 3.19E+01 6.61E-08 3.13E+01 2.69E-06 WR11_CR3228_12B-30-9 L01 Other 1.15E+01 8.61E-09 2.15E+01 2.54E-04 WR11_CR3228_12B-30-9 L02 Other 2.06E+01 2.24E-08 1.70E+02 1.25E-35 WR12_CRC1455_18B-17-6 L00 Papedas 5.62E+01 4.94E-09 4.69E+01 1.57E-09 WR12_CRC1455_18B-17-6 L01 Papedas 9.54E+01 0.00E+00 1.46E+02 1.72E-30 WR13_CRC3912_12B-41-11 L00 Trifoliates 6.16E+00 1.92E-07 6.81E+01 5.71E-14 WR13_CRC3912_12B-41-11 L01 Trifoliates 2.55E+00 6.76E-03 3.27E+01 1.35E-06 WR13_CRC3912_12B-41-11 L02 Trifoliates 2.57E+01 4.25E-07 2.58E+02 1.49E-54 WR14_CRC2320_18B-18-3 L00 Other 2.07E+01 0.00E+00 2.38E+01 8.62E-05 WR14_CRC2320_18B-18-3 L01 Other 1.28E+01 0.00E+00 2.99E+01 5.02E-06 WR14_CRC2320_18B-18-3 L02 Other 1.92E+01 1.60E-08 1.24E+01 1.47E-02 WR14_CRC2320_18B-18-3 L03 Other 1.00E+01 0.00E+00 3.38E+01 8.02E-07 WR15_CRC3140_18B-9-1 L00 Other 2.87E+01 3.52E-08 2.26E+01 1.54E-04 WR15_CRC3140_18B-9-1 L01 Other 2.23E+01 5.74E-09 1.43E+02 6.77E-30 WR15_CRC3140_18B-9-1 L02 Other 2.51E+01 1.90E-08 4.54E+01 3.21E-09 134 Table B.3 (cont’d) scan_ID label kind s_PAD p_PAD s_vMF p_vMF WR16_CRC3771_12A-42-5 L00 Other 1.80E+02 0.00E+00 1.09E+03 3.41E-235 WR16_CRC3771_12A-42-5 L01 Other 8.21E+01 3.55E-09 1.08E+02 2.04E-22 WR16_CRC3771_12A-42-5 L02 Other 1.16E+02 2.14E-08 9.25E+01 3.78E-19 WR16_CRC3771_12A-42-5 L03 Other 4.09E+01 0.00E+00 2.69E+02 6.05E-57 WR17_CRC2454_18B-20-5 L00 Papedas 2.22E+01 1.27E-09 8.35E+01 3.16E-17 WR17_CRC2454_18B-20-5 L01 Papedas 1.05E+01 2.96E-08 2.13E+01 2.76E-04 WR17_CRC2454_18B-20-5 L02 Papedas 7.45E+00 0.00E+00 7.22E+00 1.24E-01 WR17_CRC2454_18B-20-5 L03 Papedas 7.81E+00 5.49E-09 9.49E+01 1.19E-19 WR18_CRC2863_12A-39-9 L00 Trifoliates 3.07E+01 1.12E-09 3.97E+01 5.10E-08 WR18_CRC2863_12A-39-9 L01 Trifoliates 1.00E+01 5.03E-09 3.77E+01 1.32E-07 WR18_CRC2863_12A-39-9 L02 Trifoliates 2.00E+01 6.44E-09 1.40E+01 7.16E-03 WR18_CRC2863_12A-39-9 L03 Trifoliates 1.02E+01 3.62E-09 2.08E+01 3.50E-04 WR18_CRC2863_12A-39-9 L04 Trifoliates 3.38E+00 6.62E-04 2.17E+01 2.28E-04 WR19_CRC3842_18B-18-5 L00 Other 6.27E+01 0.00E+00 5.98E+02 4.79E-128 WR19_CRC3842_18B-18-5 L01 Other 3.48E+01 6.07E-09 1.05E+02 7.07E-22 WR19_CRC3842_18B-18-5 L02 Other 3.34E+01 1.33E-08 2.96E+02 7.33E-63 135 APPENDIX C WALNUT APPENDIX Table C.1: A total of 1301 individual walnuts were scanned, comprising 147 different accessions. The accessions are identified by their UCACCSD code, which is the identifying system used by the Walnut Improvement Program at the University of California, Davis. The number of pseudoreplicates scanned per accession are denoted by N. Asterisk denotes not available. UCACCSD N UCACCSD N 03-001-3395 7 12-042-3 9 06-004-4 7 12-042-4 7 06-005-27 9 12-042-5 7 06-030-18 5 12-042-7 8 08-001-28 6 12-042-8 9 08-002-4 7 12-042-9 8 08-006-11 8 12-045-11 9 08-019-11 9 12-045-12 4 09-003-20 7 12-045-13 7 09-025-107 9 12-045-14 7 09-025-117 9 12-045-15 5 09-025-123 8 12-045-3 7 09-025-13 9 12-045-7 7 09-025-24 8 12-045-8 9 09-025-60 8 12-045-9 9 09-025-62 8 12-048-11 6 09-025-64 8 12-048-13 7 09-025-69 7 12-048-14 7 09-025-72 9 12-048-3 8 09-025-78 9 12-048-4 7 09-025-99 9 12-048-5 4 1 15 12-048-6 8 10-001-6 9 12-048-7 8 10-005-3 9 12-048-8 8 10-008-16 9 12-048-9 5 10-008-23 9 12-053-19 9 10-008-37 7 12-053-20 9 10-008-63 9 12-053-22 8 10-008-71 5 12-053-23 8 10-008-73 6 12-053-24 9 10-008-76 6 12-053-3 9 10-016-34 9 12-053-4 9 10-018-1 8 12-053-5 7 10-019-84 7 12-054-10 9 10-020-10 9 12-054-11 8 136 Table C.1 (cont’d) UCACCSD N UCACCSD N 10-020-12 9 12-054-12 5 10-020-17 4 12-054-14 9 10-020-3 9 12-054-17 8 10-020-59 9 12-054-2 8 10-020-9 7 12-054-21 9 10-024-18 6 12-054-22 6 11-003-4 6 12-054-23 9 11-020-2 9 12-054-3 8 11-029-9 9 12-054-4 5 11-030-3 9 12-054-8 7 12-002-10 8 12-054-9 5 12-002-11 7 12-059-21 7 12-002-15 9 12-059-22 7 12-002-17 6 12-059-23 5 12-002-18 8 12-059-24 6 12-002-20 7 12-059-25 7 12-002-21 6 12-059-27 7 12-002-23 6 12-059-28 5 12-002-25 8 12-059-29 7 12-002-26 9 12-059-30 9 12-002-27 8 12-059-32 8 12-002-28 8 2 15 12-002-3 8 3 13 12-002-5 9 48 16 12-002-6 7 49-049 15 12-002-7 9 53-113 8 12-004-1 5 54 15 12-005-1 6 59-129 17 12-005-2 9 6 18 12-005-4 8 64-172 26 12-005-5 4 64-182 9 12-005-7 5 67-011 16 12-005-8 8 85-023-2 8 12-037-27 7 85-043-1 14 12-037-29 8 91-136 7 12-037-30 9 95-011-16 7 12-040-8 8 95-022-26 7 12-042-1 8 95-026-37 6 12-042-2 8 * 96 137