DEVELOPMENT AND ASSESSMENT OF PREDICTIVE
       MODELS FOR IMPROVED SWINE FARMING
                                 By
                             Junjie Han
                        A DISSERTATION
                            Submitted to
                    Michigan State University
            in partial fulfillment of the requirements
                         for the degree of
            Animal Science––Doctor of Philosophy
Computational Mathematics, Science and Engineering––Dual Major
                                2022


                                             ABSTRACT
     DEVELOPMENT AND ASSESSMENT OF PREDICTIVE MODELS FOR IMPROVED
                                         SWINE FARMING
                                                  By
                                              Junjie Han
         Prediction of outcomes is critical in both swine breeding and management. This
necessitates the development of predictive models that address challenges in swine farming. For
predictive modeling, there have been significant advances in deep learning. Nevertheless, there
are needs to adapt deep learning-based models for specific swine farming problems including
genomic prediction and behavior analysis. Furthermore, there is not yet a clear guideline on how
to validate a model in this field. The overarching goal of this dissertation was to validate a
collection of predictive models for improved swine farming with applications to precision
management, phenotyping, and breeding. The first study addressed the pig genomic prediction
problem. Differential evolution was utilized to optimize deep learning (DL) hyperparameters that
affected the predictive performance of DL models. Performance of optimized DL was compared
with “best practice” DL architectures selected from literature and baseline DL models with
randomly specified hyperparameters. Optimized models showed clear improvement. Further,
differential evolution saved considerable time compared to traditional optimization approaches
e.g., grid search. Despite the success of genomic prediction, phenotyping has become a
bottleneck in breeding programs as it is still time-consuming and labor-intensive. Computer
vision (CV) can be used to automate the phenotyping process. Nonetheless, there are limited
amount of public data for CV development in livestock farming. Most published CV applications
to livestock farming were developed using rather small datasets, and their broader validity
remained unknown. Therefore, the second study aimed at reviewing publicly available image


datasets that were used for CV algorithms in livestock farming and the validation methods in the
related work. Through the review, we could not find public datasets that addressed pigs’
agonistic behaviors (negative social behaviors), which is an important topic in swine farming.
Given this, the third study aimed at collecting a video dataset to study pig’s agonistic behavior
and adapting a state-of-the-art DL pipeline to classify pigs’ agonistic behaviors through video
analysis. The pipeline was validated through various training-validation data partitions, where
the training data were used for model development and the validation data were used for model
evaluation. Results showed that splitting the training and validation sets at random led to over-
optimistic estimates of model performance. The last study focused on developing and validating
a statistical model for the analysis of pigs’ social interactions. Generalized linear mixed models
were fitted, and a Bayesian framework was used for parameter estimation and posterior
predictive model checking. The predictive performance of the models varied depending on the
validation strategy, where three strategies were defined: random cross-validation, block-by-
social-group cross-validation, and block-by-focal-animals validation. In conclusion, this
dissertation provides information about how state-of-the-art models can be adapted for and
validated in swine farming applications. Future directions of this research could aim at creating
reference imagery datasets in swine farming that provides a platform for CV applications and
developing integrated computer vision systems, which eventually assists in prediction tasks for
improved pig management and breeding.


Copyright by
JUNJIE HAN
2022


To my wife, my mother, and my father.
                 v


                                     ACKNOWLEDGEMENTS
           I sincerely appreciate my family, my dissertation committee members, MSU faculty
and staff, and friends. Without their support, it would not be possible for me to accomplish my
doctoral dissertation. Furthermore, my interactions with colleagues, professors, and students, and
my personal positive experience on being mentored gave me the opportunity to understand and
appreciate the beauty of both science and humankind.
           To my major advisor, Dr. Juan P. Steibel. Juan has patiently and selflessly guided me
to be a qualified PhD student in many aspects- critical thinking, scientific writing, presentation
skills, and attitude, etc. He has always supported my decisions and given his wholehearted
advice on every single question I asked. He was always approachable and has spent so much
time mentoring me. In personal life, I am grateful that he treated me as a friend and gave me
advice beyond academic scope. Great thanks to Juan!
           To my dissertation committee members: Dr. Janice Siegford, Dr. Cedric Gondro, Dr.
Robert Tempelman, Dr. Tami Brown-Brandl, and Dr. Dirk Colbry. The high standard they have
set and expected for me motivated myself to push the limit. Their encouragement and
constructive suggestions guided me to become a better student. They are academic role models
for me, and they have influenced me to set up my career goals as a scientist. It was my honor to
be mentored by these great people.
           I would not be able to finish my PhD program without the support from my family.
My father, Hongming Han, and my mother, Zhenai Piao, have always solidly got my back no
matter what happened. Special thanks to my wife, Jinhua Qian. She has been incredibly
supportive and inspired me to overcome challenges. She has always been there to share my
                                                  vi


happiness and low morale. She was understanding and considering throughout my PhD journey,
and she always cheered me up.
            I thank my collaborators and researchers from whom I learned different aspects to
think of problems. Dr. Kenneth Reid, Dr. Joao Dorea, Dr. Tomas Norton, Andrea Parmiggiani,
Dr. Daniel Morris, Raymond Lesiyon, Anna Bosgraaf, Chen Chen, and Dr. Gustavo de los
Campos have provided valuable suggestions and help in my research projects. It was my
pleasure to work with my collaborators. Also, I thank Kevin Turner and Chris Rozeboom for
their assistance in my experiments at MSU Swine Teaching & Research Center.
            Finally, thanks to my fellow students from departments of animal science &
computational mathematics, science and engineering at MSU. Special thanks to my great friends
Lingkun Li, Xing Lu, Daoyang Chen, Mingzhe Li, Fei Zhang, Kangxu Wang, and Zinan Wang
for their company during my PhD journey. They shared valuable thoughts with me and
comforted me when I was in low morale.
            “I’m a million miles ahead of where I’m from, but I still have another million miles to
go.”
                                                                                      Tim Bergling
                                                 vii


                                            PREFACE
       Chapter 2 was formatted for publication in G3: Genes|Genomes|Genetics and
corresponds to the peer-reviewed version of Han, J., C. Gondro, K. Reid, and J. P. Steibel. 2021.
Heuristic hyperparameter optimization of deep learning models for genomic prediction. G3
Genes|Genomes|Genetics. 11. doi:10.1093/g3journal/jkab032.
       Chapter 3 was formatted for the preprint version of Han, J., J. R. Dorea, T. Norton, A.
Parmiggiani, D. Morris, J. Siegford, and J. P. Steibel. Publicly Available Datasets for Computer
Vision in Precision Livestock Farming: A Review.
       Chapter 4 was formatted for publication in Computers and Electronics in Agriculture and
corresponds to the pending review version of Han, J., J. Siegford, D. Colbry, R. Lesiyon, A.
Bosgraaf, C. Chen, T. Norton, and J. P. Steibel. 2022. Under review. Evaluation of Computer
Vision for Detecting Agonistic Behavior of Pigs in a Single-Space Feeding Stall Through
Blocked Cross-Validation Strategies.
       Chapter 5 was formatted for publication in Applied Animal Behaviour Science.
                                                viii


                                     TABLE OF CONTENTS
LIST OF TABLES………………………………………………………………………….……xii
LIST OF FIGURES………………………………………………………………………...……xvi
KEY TO ABBREVIATIONS………………………………………………………………..….xxi
CHAPTER 1: GENERAL INTRODUCTION………………………………………………….…1
     1. INTRODUCTION………………………………………………………………...………1
         1.1 Deep learning for genomic prediction……………………………………….………3
         1.2 Deep learning for phenotyping………………………………………………………4
         1.3 Deep learning and statistical learning for animal-animal interaction………………..5
     2. OBJECTIVES …………………………………………………………...……………….6
REFERENCES …………………………………………………………………………………....8
CHAPTER 2: HEURISTIC HYPERPARAMETER OPTIMIZATION OF DEEP LEARNING
MODELS FOR GENOMIC PREDICTION……………………………………………….…..…13
     1. ABSTRACT……………………………………………….………………………….…13
     2. INTRODUCTION……………..……………………….…………….……………….…14
     3. MATERIAL AND METHODS……….………………………………………...…….…17
         3.1 Datasets……….…...……………………………………………….……...…….…17
              3.1.1 Simulated datasets………………………………………….……...…….…17
              3.1.2 Real dataset………………………………………….………….....…….…17
         3.2 Deep learning and genomic prediction………………………………………….…18
              3.2.1 Multilayer perceptron………………………………………….…………...18
              3.2.2 Convolutional neural network……………………………………………...20
              3.2.3 DL model training…………………………………………….....................22
              3.2.4 Hyperparameter optimization………………..……………….....................22
         3.3 Differential evolution algorithm for deep learning……………...............................23
              3.3.1 Random key………………..………………................................................24
              3.3.2 Initialization………………..………………................................................25
              3.3.3 Mutation………………..………………......................................................26
              3.3.4 Crossover………………..………………....................................................27
              3.3.5 Selection………………..………………......................................................27
              3.3.6 Top model selection………………..………………....................................28
         3.4 Optimized model assessment through external validation……………...……….....29
         3.5 Hardware and software………………..………………...........................................29
         3.6 Data availability………………..………………......................................................30
     4. RESULTS AND DISCUSSION………………..……………….....................................30
         4.1 Optimization runtime profiles………………..……………….................................31
         4.2 Characteristics of selected hyperparameters………………..………………...........36
         4.3 Performance of optimized models under validation………………..………….......40
     5. CONCLUSIONS………………..…………...………....………....………....…….….....44
                                                 ix


    6. ACKNOWLEDGEMENTS ………………..…………...……....…………...……….....44
APPENDICES ………………..…………...………....…………....………………..……............46
    APPENDIX A: SUPPLEMENTAL MATERIAL…...………………..……………..…......47
    APPENDIX B: FILE S2.1………...………..…………...………....………………...…......55
REFERENCES ………………..…………...………....…………....……………...…..................63
CHAPTER 3: PUBLICLY AVAILABLE DATASETS FOR COMPUTER VISION IN
PRECISION LIVESTOCK FARMING: A REVIEW ……………………………...…….…..…68
    1. ABSTRACT………...………....………...………...………...………...……...................68
    2. INTRODUCTION………...………....…………………...………...…….......................69
    3. METHODOLOGY………...………....…………………...………...……......................71
        3.1 Literature search parameters………...………....…………………...………….......71
        3.2 Eligibility criteria………...………....…………………...………...…….................71
        3.3 Data extraction………...………....…………………...………...…….....................71
    4. RESULTS………...………....…………………...………...……....................................72
        4.1 Animal subjects………...………....…………………...………...……....................73
        4.2 Recording setup………...………....…………………...………...……...................75
        4.3 Review of selected datasets by computer vision task………...………....................77
             4.3.1 Entire body detection………...………..........………………………...........78
             4.3.2 Body part detection………...………..........………………………..............80
             4.3.3 Segmentation………...………..........………………………........................81
             4.3.4 Behavior recognition………...………..........………………………............81
             4.3.5 Identification………...………..........………………………........................83
             4.3.6 Tracking………...………..........………………………...............................85
        4.4 Validation strategy………...………..........………………………...........................86
    5. DISCUSSION………...………..........………………………...………………...............90
    6. CONCLUSION ……...………..........………………………...………………................94
APPENDIX………...………..........…………………………….………………..........................96
REFERENCES ………...………..........………………………...………………..........................99
CHAPTER 4: EVALUATION OF COMPUTER VISION FOR DETECTING AGONISTIC
BEHAVIOR OF PIGS IN A SINGLE-SPACE FEEDING STALL THROUGH BLOCKED
CROSS-VALIDATION STRATEGIES………...………..........………………………...……..105
    1. ABSTRACT………...………..........…………………..........………………….……..105
    2. INTRODUCTION………...………..........…………………..........……………….…..106
    3. MATERIAL AND METHODS………...………..........…………………..........….…..108
      3.1 Experimental design………...………..........…………………......................….….108
             3.1.1 Recording schedule and specifications………...……….................….…..108
             3.1.2 Behavior ethogram and dataset………...………............................…..…..110
             3.1.3 Validation strategies………...………............................……………...…..113
       3.2 Computer vision algorithm………...………...........................…...……...….……..114
             3.2.1 Deep learning pipeline for video classification…………………………..114
             3.2.2 Feature extraction with convolutional neural network……………….…..115
             3.2.3 Long short-term memory………………………………………………....116
             3.2.4 Hyperparameters………………………………………………….……....117
             3.2.5 Region of interest………………………………………………………....118
                                              x


              3.2.6 Deep learning training accounting for class-imbalance…………………..119
              3.2.7 Evaluation matrices…………………………………………………….....120
    4. RESULTS AND DISCUSSION…………………………………………………….....120
    5. CONCLUSION…………………………………………………………………….......129
    6. DECLARATION OF COMPETING INTEREST……………………...………………130
    7. ACKNOWLEDGEMENTS…………………………………………………………....130
    8. DATA AVAILABILITY……………………………………………………………....130
APPENDIX…...…………………………………………………..……………….………........131
REFERENCES …………………………………………………..……………….….................139
CHAPTER 5: ANALYSIS OF SOCIAL INTERACTIONS IN GROUP-HOUSED ANIMALS
USING DYADIC LINEAR MODELS…………………………………………………..…......145
    1. ABSTRACT……………………………………………………………………….....145
    2. INTRODUCTION……………………………………………………………..…........146
    3. METHODS AND MATERIALS………………………………………………………148
        3.1 Data from social interactions should be analyzed as dyadic data……………...…148
              3.1.1 Social interaction data……………...…………...………...………...…….149
              3.1.2 Analysis of dyadic data from directional social interactions…………..…149
        3.2. Experimental data analysis: attacking time in group-housed pigs…………..…...150
              3.2.1 Experiment setup…………..…………..…………..…………..…….…...150
              3.2.2 Analysis model…………..…………..…………..…………..…….……...151
              3.2.3 Modeling of (co)variances…………..…………..…………..…………....153
              3.2.4 Estimation…………..…………..……..…………………...…..………....154
              3.2.5 Validation strategies and posterior predictive checks: how well does the
              model fit the data? …………..…………..……..………………….....………....155
        3.3 Ethical approval…………..…………..……..………………………....………....156
    4. RESULTS…………..…………..……..………………………....………………….....157
        4.1 Estimation of animal-specific effects, dyad-specific effects, and (co)variance
        components…………..…………..……..………………………....……………….....157
        4.2 Predictive performance in different validation strategies………….…………......158
    5. DISCUSSION………….…………….………….………….………….…………........161
    6. CONCLUSION………….…………….………….………….………….…………......168
    7. ACKNOWLEDGEMENTS………….…………….………….………….…………....168
APPENDIX…..………….…………….………….………….…………....................................169
REFERENCES ………….…………….………….………….…………...................................175
CHAPTER 6: GENERAL DISCUSSION ………….…………….………………….................179
    1. DISCUSSION………….…………….…………………...............................................179
    2. FUTURE DIRECTIONS………….…………….………………..................................182
REFERENCES ………….…………….………………..............................................................184
                                               xi


                                           LIST OF TABLES
Table 2.1 Parameter space for optimized hyperparameters. Hyperparameter space and range (see
details in File S2.1). N represents sample size. 𝛼1 =0.001 for the simulated datasets and 𝛼1 =0.01
for the real pig dataset. 𝛼2 =0.01 for the simulated datasets and 𝛼2 =0.1 for the real pig dataset ….23
Table 2.2 Runtime profile for the DE approach. MLP, multilayer perceptron; CNN, convolutional
neural network; Avg. runtime, average runtime for one DE iteration (each iteration fits two
models); Num. iterations, the total number of iterations used in DE; min, minutes; hr,
hours ……………………………………………………………………………………………..31
Table 2.3 Hyperparameters of selected MLP models from each population. SP, simulated pig
dataset; SC, simulated cattle dataset; RP, real pig dataset; DE No., differential evolution of
different data partition; No. layer(s), number of hidden layers; No. neurons, number of neurons
according to the number of hidden layers …...……………….…………………………………..38
Table 2.4 Hyperparameters of selected CNN models from each population. SP, simulated pig
dataset; SC, simulated cattle dataset; RP, real pig dataset; DE No., differential evolution of
different data partitions; No. layers, number of convolutional layers; No. filters, number of filters
applied based on No. layers; FCL, size (number of neurons) of the fully connected layer after
flatten layer ...……………………………………………………………………………….……39
Table S2.1 Adaptive hyperparameter space for the number of neurons. Number of neurons (nodes)
given the depth of network (number of hidden layers, HL) in multilayer perceptron
models …………………………………………………………………………………………...47
Table S2.2 Adaptive hyperparameter space for number of filters. Number of filters (kernels) given
the      depth     (number       of      convolutional     layers)    of      convolutional       neural
network …………………………………………………………………………………………..47
Table S2.3 Minimum length of feature maps applied to each layer of convolutional neural network.
Conv: Convolutional layer ……………………………………………………………………….47
Table S2.4 Distributions of optimized hyperparameters related to multilayer perceptron
architectures for simulated pig data. Pop 1-5: MLP solutions to five data partitions (five differential
evolution runs) …………………………………………………………………………………...48
Table S2.5 Distributions of optimized hyperparameters related to CNN architectures for simulated
pig data. Pop 1-5: CNN solution populations of five differential evolutions runs. Size of fully
connected layer: the number of neurons applied in the fully connected layer (after flatten layer).
Q0.05, 5% quantile; Q0.95, 95% quantile ………………………………………................……..48
Table S2.6 Distributions of optimized hyperparameters related to multilayer perceptron
architectures for the simulated cattle data. Pop 1-5: MLP solutions to five data partitions (five
differential evolution runs) ………………………………………………………………………48
                                                   xii


Table S2.7 Distributions of optimized hyperparameters related to CNN architectures for simulated
cattle data. Pop 1-5: CNN populations of five differential evolutions runs. Size of fully connected
layer: the number of neurons applied in the fully connected layer (after flatten layer). Q0.05, 5%
quantile; Q0.95, 95% quantile …………………………………………………………………...49
Table S2.8 Distributions of optimized hyperparameters related to multilayer perceptron
architectures for the real pig data. Pop 1-5: MLP solutions to five data partitions (five differential
evolution runs) …………………………………………………………………………………...49
Table S2.9 Distributions of optimized hyperparameters related to CNN architectures for real pig
data. Pop 1-5: CNN populations of five differential evolutions runs. Size of fully connected layer:
the number of neurons applied in the fully connected layer (after flatten layer). Q0.05, 5% quantile;
Q0.95, 95% quantile ……………………………………………………………………………..49
Table S2.10 Distributions of optimized hyperparameters related to MLP model compilation and
fitting for simulated pig data. Pop 1-5: MLP solution populations of five differential evolution runs.
Q0.05, 5% quantile; Q0.95, 95% quantile ………………………………………………………..50
Table S2.11 Distributions of optimized hyperparameters related to CNN model compilation and
fitting for simulated pig data. Pop 1-5: CNN solution populations of five differential evolution
runs. Q0.05, 5% quantile; Q0.95, 95% quantile ………………………………………………….50
Table S2.12 Distributions of optimized hyperparameters related to MLP model compilation and
fitting for simulated cattle data. Pop 1-5: MLP solution populations of five differential evolution
runs. Q0.05, 5% quantile; Q0.95, 95% quantile ………………………………………………….50
Table S2.13 Distributions of optimized hyperparameters related to CNN model compilation and
fitting for simulated cattle data. Pop 1-5: CNN solution populations of five differential evolution
runs. Q0.05, 5% quantile; Q0.95, 95% quantile ………………………………………………….51
Table S2.14 Distributions of optimized hyperparameters related to MLP model compilation and
fitting for real pig data. Pop 1-5: MLP solution populations of five differential evolution runs.
Q0.05, 5% quantile; Q0.95, 95% quantile …………………………………………………..……51
Table S2.15 Distributions of optimized hyperparameters related to CNN model compilation and
fitting for real pig data. Pop 1-5: CNN solution populations of five differential evolution runs.
Q0.05, 5% quantile; Q0.95, 95% quantile ………………………………………………………..51
Table S2.16 Selected MLP and CNN architecture derived from other studies. No. layers, the
number of fully connected layers or convolutional layers; No. neurons (filters), the number of
neurons or filters adaptive based on the number of layers. In the No. layers column, 1+1 means
one convolutional layer plus one fully connected layer …………………………………………..52
Table 3.1 Description of extracted data ...……………………………………………………..….72
Table 3.2 Overview of public pig and cattle datasets utilized for computer vision tasks in precision
livestock farming ………………………………………………………………………………...73
                                                xiii


Table 3.3 Characteristics of animal subjects. *: to specify an exhaustive list of units/ranges if
applicable. Multiple pens mean that the number of pens is more than two while the exact number
remains unknown ………………………………………………………………………………..75
Table 3.4 Recording setup and schedule of publicly available datasets for computer vision in
livestock farming. RGB, red-green-blue; RGB-D, RGB and depth. Multiple weeks/days mean that
the experiment lasted more than two weeks/days while the exact number remained unknown.
Varying resolutions represent that more than two resolutions are involved ……………………...77
Table 3.5 Identified public datasets for animal entire body detection via computer vision. Code
availability: whether computer code is available for entire body detection. *: an annotated image
is considered as an image paired with an external file that includes manually annotated bounding
box coordinates. Varying resolutions mean that there are more than two resolutions in the
dataset ……………………………………………………………………………………………79
Table 3.6 Identified public datasets for animal body part detection via computer vision. *: an
annotated image is considered as an image paired with an external annotation file. Code
availability: whether computer code is available for body part detection ………………………...80
Table 3.7 Identified public datasets for animal behavior recognition via computer vision. i: an
annotated file is considered as an imagery file paired with an external annotation file. ii: the classes
were not explicitly defined. Code availability: whether computer code is available for behavior
recognition ……………………………………………………………………………………….83
Table 3.8 Public datasets for animal identification via computer vision. i: if an individual as the
ROI means that the individual animal is first localized and then identified; otherwise, an ID class
label is assigned to the entire image. ii: an annotated image is considered as an image assigned with
an ID label. Code availability: whether computer code is available for identification ……………84
Table 3.9 Validation strategies used for reviewed datasets in their original applications ……..….87
Table 3.10 Evaluation metrics by computer vision tasks and validation strategies. A range is
provided if more than one point estimate were reported for the specific validation strategy …..…89
Table S3.1 Website or URLs of publicly available animal datasets for computer vision ………...97
Table S3.2 Metrics for performance evaluation in different validation strategies ………………..98
Table 4.1 Rotation schedule of social groups for the two experimental pens. SG, social group ...109
Table 4.2 Ethogram for the agonistic behaviors in pigs. *: ear-to-body was merged into head-to-
body …………………………………………………………………………………………….111
Table 4.3 Explored hyperparameters and related work. CNN, convolutional neural network; LSTM,
long short-term memory ……………………………………………………………………..…118
                                                   xiv


Table S4.1 Hyperparameters configuration …………………………………………………….132
Table S4.2 Overall accuracy for different regions of interest …………………………………...132
Table S4.3 Available sample size by validation strategies. *: the training set and the testing set
were interchangeable depending on which feeder was used for training/testing. #: once a training
set size N1 was determined, the remaining N2=15,679 - N1 samples were considered as the testing
set ……………………………………………………………………………………………....132
Table 5.1 Estimated posterior statistics for fixed effects and (co)variance components explained
on total attacking duration between the giver animal and the receiver animal. Q: quantile ..…....158
Table 5.2 Metrics for evaluating predictive performance of the model under different validation
strategies. AUC, area under ROC (Receiver Operating Characteristics) curve; RMSE, root mean
square error; CV, cross-validation ………………………………………………………….......160
Table S5.1 Summary of MCMC samples. Q, quantile; n_eff, effective sample size ……………174
                                                 xv


                                          LIST OF FIGURES
Figure 2.1 Multilayer Perceptron (MLP) for genomic prediction of a single trait with M SNP
markers. The network has an input layer, two fully connected hidden layers and an output layer.
Each node’s input in the hidden layers is a transformation of the weighted sum of the output from
the previous layer. The number of nodes in hidden layers decrease as the depth of the MLP
increases, to facilitate representation learning …………………………………………………...19
Figure 2.2 1-d Convolutional neural network (CNN) for genomic prediction of a single trait with
M SNP markers. The network has an input layer, two convolutional layers with their corresponding
pooling layers, a fully connected hidden layer and an output layer. Each convolutional layer applies
a number of filters to the output of the previous layer and its output is subsequently summarized
by a pooling layer ……………………………………………….………………………………..20
Figure 2.3 Pseudocode for differential evolution algorithm ……………………………………...24
Figure 2.4 Summary of the random key (mapping function) used to transform numeric vectors into
discrete levels of hyperparameters. The numeric vector can be subject to mutation and
recombination. The mapping is used to transform the result into a meaningful set of
hyperparameters that can be used to fit a model and obtain a fitness to select numeric vectors ......26
Figure 2.5 History of differential evolution by algorithm and data partition in the simulated pig
dataset over 2,000 iterations. Mean and standard deviation of the fitness (correlation between the
predicted and true phenotype) were computed given each population. (A) Mean fitness of five
populations by fitting multilayer perceptron (MLP) models. (B) Standard deviation of fitness
within each population (MLPs). (C) Mean fitness of five populations by fitting convolutional
neural network (CNN) models. (D) Standard deviation of fitness within each population
(CNNs) …………………………………………………………………………………………..33
Figure 2.6 History of differential evolution by algorithm and data partition in the simulated cattle
dataset over 2,000 iterations. Mean and standard deviation were computed given each population.
(A) Mean fitness of five populations by fitting multilayer perceptron (MLP) models. (B) Standard
deviation of fitness within each population (MLPs). (C) Mean fitness of five populations by fitting
convolutional neural network (CNN). (D) Standard deviation of fitness within each population
(CNNs) …………………………………………………………………………………………..35
Figure 2.7 History of differential evolution by algorithm and data partition in the real pig dataset
over 10,000 iterations. Mean and standard deviation were computed given each population. (A)
Mean fitness of five populations by fitting multilayer perceptron (MLP) models. (B) Standard
deviation of fitness within each population (MLPs). (C) Mean fitness of five populations by fitting
convolutional neural network (CNN) models. (D) Standard deviation of fitness within each
population (CNNs) ………………………………………………………………………………36
Figure 2.8 Boxplots for the predictive performance of MLPs and CNNs using different
hyperparameters (simulated pig dataset). Models were tested on five data partitions of the
simulated pig dataset. Statistics represent external (cross) validations by fitting the same model 30
                                                  xvi


times. The left three boxes are for MLP models and the right three boxes are for CNN models.
Null box means the model did not converge. Random, random hyperparameters; Perez,
hyperparameters recommended by Pérez-Enciso and Zingaretti (2019); Opt, optimized
hyperparameters using DE. Abbreviations stand for the same meaning in Figure 2.9 and Figure
2.1 ………………………………………………………………………………………………..42
Figure 2.9 Boxplots for the predictive performance of MLPs and CNNs using different
hyperparameters (simulated cattle dataset). Models were tested on five data partitions of the
simulated cattle dataset. Statistics represent external (cross) validations by fitting the same model
30 times. The left three boxes are for MLP models and the right three boxes are for CNN
models …………………………………………………………………………………………...43
Figure 2.10 Boxplots for the predictive performance of MLPs and CNNs using different
hyperparameters (real pig dataset). Models were tested on five data partitions of the real pig dataset.
Statistics represent external (cross) validations by fitting the same model 30 times. The left three
boxes are for MLP models and the right three boxes are for CNN models. Null box means the
model did not converge …………………………………………………………………………..45
Figure S2.1 Pseudocode for adaptive filter size. Conv, convolutional layer; int(x), convert x into
the nearest integer; floor(x), get the largest integer that is smaller or equal to x ………………..53
Figure S2.2 Mean predictive performance and error bars across datasets and data partitions. The
error bar represents the mean ± standard deviation of cross validation by fitting the same model 30
times. Pink, green, and blue bars correspond to GBLUP, MLP, and CNN models, respectively.
MLP, multilayer perceptron; CNN, convolutional neural network; GBLUP, genomic best linear
unbiased prediction ………………………………………………………………………………54
Figure 3.1 Examples of image data and key annotations for different computer vision tasks. Panel
a) shows an example for entire body detection where each pig is enclosed in a bounding box. Panel
b) presents an instance for body part detection, where heads of pigs are marked in red and rear
parts of pigs are marked in blue. Panel c) shows an example of segmentation where each pig has a
polygon mask. Panel d) presents an example of behavior recognition through an individual image,
where lying pigs are enclosed in red bounding boxes and blue bounding boxes indicate pigs that
are not lying. Panel e) is an example of behavior recognition by assigning a label to an image
sequence. Panel f) shows an example of animal identification where each individual is assigned
with a bounding box and a unique ID label. Panel g) displays an example of a tracklet across three
consecutive frames ………………………………………………………………………………79
Figure 4.1 Top-down views of pens and feeding stalls. Panels A and B (infrared images) show the
center views of Pen 1 and Pen 2, respectively. Panels C and D are top-down views of the feeding
stalls for Pen 1 and Pen 2, respectively …………………………………………….……………110
Figure 4.2 Examples for generating episodes for no-contact, head-to-body, levering, and mounting
events ………………………………………………………………………………...................113
Figure 4.3 Deep learning pipeline for pig’s aggressive behavior detection based on videos. Graph
for ResNet-50 Architecture was obtained from Talo (2019) ………………………………....…115
                                                   xvii


Figure 4.4 Diagram of long short-term memory. The figure was redrawn, and the original figure
was obtained from https://colah.github.io/posts/2015-08-Understanding-LSTMs/ ………...….117
Figure 4.5 Explored regions of interest …………………………………………………...…….119
Figure 4.6 Bar plots of recall and precision for random, block-by-time, and block-by-feeder cross-
validations. HB, head-to-body; L, levering; M, mounting; NC, no-contact ……...……………..122
Figure 4.7 Confusion tables of three validation strategies. Tables were based on the result by
merging statistics over 5 replicates. Each validation strategy has five reps. Prediction means
classified result from our model and Target means ground-truth labels. Panel A, random validation;
Panel B, block-by-time validation; Panel C, block-by-feeder validation (Feeder 1 as testing set);
Panel D, block-by-feeder validation (Feeder 2 as testing set). NC, no-contact; M, mounting; L,
levering; HB, head-to-body …………………………………………….………………………126
Figure 4.8 Error patterns for misclassification. The 1st, 10th, 20th, and 30th frames of example
episodes were selected for display purpose. a), head-to-body and no-contact confused by no-
contact and head-to-body, respectively; b), head-to-body misclassified as levering; c-d), head-to-
body false predicted as mounting; e), levering confused by mounting. Panels a) and b) were
common misclassification patterns across all three validation scenarios. Panels c-e) only represent
block-by-time validation ……………………………………………………………………….127
Figure 4.9 Error patterns for misclassification in block-by-feeder validation. The 1st, 10th, 20th,
and 30th frames of example episodes were selected for display purpose. a), head-to-body
misclassified as mounting, respectively; b-c), levering misclassified as head-to-body; d), levering
false predicted as mounting; e), mounting confused by levering; f), mounting confused by head-
to-body; g-h), levering false classified as mounting ……………………………………….……128
Figure S4.1 Average accuracy of different hyperparameter sets. Units, number of hidden units in
long short-term memory module; bi-direct, bi-directional long short-term memory; standard,
standard one-way long short-term memory …………………………………………….………133
Figure S4.2 Training history of three validation strategies. Solid lines are for training curves and
dashed lines show the testing curve. Each validation strategy has five reps. Panel A, random
validation; Panel B, block-by-time validation; Panel C, block-by-feeder validation (Feeder 1 as
testing set); Panel D, block-by-feeder validation (Feeder 2 as testing set) ……………………...134
Figure S4.3 Scatter plots of individual score for each episode given the first six principal
components (grouped by feeder). Blue dots represent Feeder 1 and green dots are for Feeder 2.
Principle component analysis was done using feature vectors extracted from ResNet-50 ……...135
Figure S4.4 Testing sets breakdown (on the left of panels) and misclassification breakdown (on
the right of panels) by social group (A), week (B), feeder (C), mark (D), and behavior category (E)
in random cross-validation (five replicates). Marked pigs meant back-marked pigs with Arabic
numerals; Unmarked pigs were pigs without artifactual marks on their backs ………………….136
Figure S4.5 Testing set breakdown (on the left of panels) and misclassification breakdown (on the
right of panels) by social group (A), week (B), feeder (C), mark (D), and behavior category (E) in
                                                 xviii


block-by-time validation. Plots on the left of panels show the proportion/count for a single dataset,
while plots on the right of panels stand for the statistics across 5 replicates. Marked pigs meant
back-marked pigs with Arabic numerals; Unmarked pigs were pigs without artifactual marks on
their backs ………………………………………………………………………………………136
Figure S4.6 Testing set breakdown (on the left of panels) and misclassification breakdown (on the
right of panels) by social group (A), week (B), mark (C), and behavior category (D) in block-by-
feeder validation, whereas Feeder 1 was the testing set. Plots on the left of panels show the
proportion/count for a single dataset, while plots on the right of panels stand for the statistics across
5 replicates. Marked pigs meant back-marked pigs with Arabic numerals; Unmarked pigs were
pigs without artifactual marks on their backs …………………………………………………...137
Figure S4.7 Testing set breakdown (on the left of panels) and misclassification breakdown (on the
right of panels) by social group (A), week (B), mark (C), and behavior category (D) in block-by-
feeder validation, whereas Feeder 2 was the testing set. Plots on the left of panels show the
proportion/count for a single dataset, while plots on the right of panels stand for the statistics across
5 replicates. Marked pigs meant back-marked pigs with Arabic numerals; Unmarked pigs were
pigs without artifactual marks on their backs …………………………………………………...138
Figure 5.1 Panel a), directional dyadic interaction intensity matrix (elements in the matrix represent
attacking duration); row sums and column sums are shown in the margins of the matrix. Panel b),
a truncated long-format table that is re-arranged from the interaction matrix; each row represents
a record that is the attacking duration from a giver animal to a receiver animal. 0.00 means
observed zero while 0 means structural zero that we do not consider as an actual
interaction …………………………………………………………………………………..…..147
Figure 5.2 Illustration of a dyadic interaction model as an example that partitions the response into
giver effects, receiver effects, and dyad-specific effects. Blue lines/arrows mean fixed effects, and
red represents random effects, e stands for the residual term ...………………………………....153
Figure 5.3 Proportion of zeros of validation set y (dark lines), with proportions of zeros for 500
simulated datasets 𝑦̃ drawn from the posterior predictive distribution (lighter bins). A), the model
that used all data points for model training to predict the same dataset; B) 5-fold cross-validation;
C) Block-by-social-group cross-validation; D) 5 replicates of block-by-focal-animal
validation ……………………………………………………………………………………….159
Figure 5.4 Distribution for the mean value of all observations across replicates. Mean of the
validation set y (dark solid line) is compared with the means of 500 simulated datasets ỹ drawn
from the posterior predictive distribution (lighter bins). We compared the logarithm of the
observed and the simulated variables i.e. log(y+1) and log(𝑦̃ +1). A), the model that used all data
points for model training to predict the same dataset; B) 5-fold cross-validation; C) Block-by-
social-group cross-validation; D) 5 replicates of block-by-focal-animal validation .…………...160
Figure S5.1 Trace plots of posterior estimates of effects and variance components …………….171
Figure S5.2. Autocorrelation plots by chain and by parameters ………………………………...172
                                                  xix


Figure S5.3 Trace plots of variance components when the random dyad effect is included in the
model …………………………………………………………………………………………...172
Figure S5.4 Autocorrelation plots by chain and by variance components when the random dyad
effect is included in the model ………………………………………………………………….173
                                              xx


                                KEY TO ABBREVIATIONS
adam = Adaptive moment estimation
adj = Adjusted
AUC = Area under receiver operating characteristics curve
CNN = Convolutional neural network
CPU = Central processing unit
CV = Computer vision
DE = Differential evolution
DL = Deep learning
DOI = Digital object identifier
EB = Ear-to-body
elu = Exponential linear unit
Eq = Equation
GBLUP = Genomic best linear unbiased prediction
GLMM = Generalized linear mixed models
GPU = Graphics processing unit
h2 = Heritability
HB = Head-to-body
hr = Hour
ID = Identification
L = Levering
LSTM = Long short-term memory
M = Mounting
mAP = mean average precision
MCMC = Markov chain Monte Carlo
MLP = Multilayer perceptron
MOTA = Multiple objects tracking accuracy
                                            xxi


MSE/mse = Mean squared error
N = Number of samples
nadam = Nesterov-accelerated adaptive moment estimation
NC = No-contact
obs = Observed
Opt = Optimized
OSF = Open science framework
P = P-value
PLF = Precision livestock farming
QTL = Quantitative trait loci
R2 = R squared
relu = Rectified linear unit
RGB = Red-green-blue
RGB-D = Red-green-blue and depth
RMSE = Root mean square error
rmsprop = Root mean square propagation
ROC = Receiver operating characteristics
ROI = Region of interest
SD = Standard deviation
selu = Scaled exponential linear unit
SG = Social group
sgd = Stochastic gradient descent
SNP = Single nucleotide polymorphism
softmax = Normalized exponential function
US = United States
US = United States of America
USDA = United States Department of Agriculture
                                           xxii


                            CHAPTER 1: GENERAL INTRODUCTION
                                       1. INTRODUCTION
        Currently, the United States (US) is the world’s third-largest producer and consumer of
pork products, exporting over 20% of commercial pork produced (USDA, 2019). Such success is
partially due to structural and organizational changes to the US swine industry (Pairis-Garcia et
al., 2016). In the past decades, the US swine industry has transitioned from small and private
farms into larger and integrated systems (Pairis-Garcia et al., 2016).
        Swine farmers are interested in stable and predictable conditions for pig production and
marketing, which helps achieve acceptable margins between costs and profits (Commandeur,
2006). To improve profitability in swine farming, a common strategy is to scale up the
production (Öhlund et al., 2017). Nevertheless, current internal migration trends from rural to
urban areas are leading to decreasing number of farmers (Berckmans, 2017), resulting in
increased workload per farmer. Furthermore, as both the population and income increase
globally, there are ever growing demand for pork products (Alexandratos and Bruinsma, 2012).
Production systems need to continue to adapt to meet these challenges.
        Breeding is a powerful tool to improve swine farming efficiency (Oliviero et al., 2019).
In livestock farming, breeding refers to the selection of individuals as parents to produce better
performing offspring (Hill, 2001). Livestock breeding has enabled great increase in efficiency
and economic benefit of swine farming (Hayes et al., 2013). An important prediction problem in
livestock breeding is how to estimate animals’ breeding values. Traditional breeding programs
simply relied on the animal pedigree and phenotypic values that are quantifiable traits e.g.,
weight in pounds and feed intake in grams. Modern breeding programs use genomic prediction,
                                                  1


which refers to the prediction of individual phenotypic/breeding value using genomic
information. To implement genomic prediction, advanced prediction models are needed that
relate animal genomics to phenotypic values.
         Management is another powerful tool for improved swine farming efficiency (Peltoniemi
et al., 2019). Daily management contributes to the optimization of costs and benefits, leading to
increased profitability in swine farming (Agostini et al., 2014). An important fact is that, to
manage pigs, farmers need to predict the status of pigs (e.g., predicting whether the pigs are fed
well and predicting pigs’ health condition). However, especially in large-scale pig farms, it is
challenging to predict and manage pigs at individual levels. Therefore, prediction models for
individual animals will positively assist in pig management.
         Prediction of outcomes is critical in both swine breeding and management. As result,
researchers, swine breeders and swine farmers are increasingly interested in predictive modeling
that assists on-farm decision making (Tzanidakis et al., 2021). Predictive modeling refers to the
process of developing a statistical or computational tool that forecasts future outcomes with the
aid of existing data (Kuhn and Johnson, 2013). Although the goal of developing predictive
modeling is to predict future outcomes, relatively few models are developed and validated in
depth to match the goal. Validating a prediction model essentially means comparing predicted
values to the actual observed outcomes in a population (Ramspek et al., 2021). For validation
purposes, data are split into two parts: a training set that is used for model development and a
validation set that is used for model assessment. The common practice is to split data at random,
i.e., random validation. The underlying assumption of random validation is that the training set
and validation set are independent. However, dependence structures may exist in the data e.g.,
temporal and spatial structures which violates the assumption of independence between the
                                                   2


training set and validation set and leads to overoptimistic results (Roberts et al., 2017).
Unfortunately, data structures are often overlooked when splitting the data for validation
purposes. Therefore, proper designs of validation strategy i.e., the training-validation data split,
is necessary to determine a prediction model’s generalizability and reproducibility to new
farming environments and pigs (Ramspek et al., 2021).
         Despite the substantial advancement in predictive modeling methods (Putka et al., 2018),
few attempts have been made to introduce the advanced predictive modeling to practical swine
farming. There are needs to adapt state-of-the-art predictive models for swine farming, as the
predictive ability of advanced methods e.g., deep learning is less well studied on swine breeding
and behavior recognition problems. Further, there is not yet a clear guideline on how to properly
split the data when developing and validating a novel predictive model in the field. Therefore,
collectively, this study aims at filling this gap by investigating and assessing several state-of-the-
art predictive models that fit into the swine farming context.
1.1 Deep learning for genomic prediction
         Deep learning, a part of statistical learning toolboxes, has dramatically improved state-of-
the-art applications in computer vision, speech recognition and genomics (Lecun et al., 2015).
Deep learning (DL) is a set of representation learning methods, where a machine can be fed with
raw data and automatically discover the representations needed for prediction or classification,
with multiple levels of simple but non-linear modules transforming the representation at one
level into a representation at a higher, slightly more abstract level (Lecun et al., 2015). Its
flexibility allows researchers to apply DL in several contexts of prediction problems. For
instance, DL has already been applied to the genomic and phenotypic prediction in plants
(Crossa et al., 2019; Montesinos-López et al., 2018), human traits (Bellot et al., 2018), and
                                                    3


breeding values of bulls (Abdollahi-Arpanahi et al., 2020). Nevertheless, the potential of DL-
based genomic prediction remains unknown in swine farming. Thus, a better understanding of
DL applications in swine genomic prediction will bring valuable information to the swine
industry.
1.2 Deep learning for phenotyping
         Genomic prediction tools have made great contributions to swine breeding programs.
Thanks to the technological developments in genomic studies, genotyping costs have decreased
significantly in the past years. However, phenotyping, the process of determining an individual’s
observable trait(s), has become a bottleneck in breeding programs, as it is still time consuming
and labor intensive, which may be even more costly than genotyping (Watanabe et al., 2017).
Thus, there are needs for phenotyping tools to accelerate swine breeding programs.
         Deep learning is also suitable for high-throughput phenotyping through automating the
analysis video recordings. As DL is predominant in computer vision (CV) that refers to the use
of artificial intelligence to automatically extract useful information form computer graphics, DL
can be used to predict traits of animals through digital images. Compared to the traditional
approach that requires manual observation, CV has advantages of being low-cost, objective, and
non-interventional, and to generate information in a continuous scale (Chen et al., 2021; Li et al.,
2021). Researchers have already discovered this promising tool and started to develop DL
models for pig posture/behavior analysis in their experiments (Chen et al., 2020b; Liu et al.,
2020; Nasirahmadi et al., 2019; Zhang et al., 2020).
         A limitation of DL centers around the data that are used to develop DL models. Deep
learning is known as a data-hungry approach, and it typically requires a large number of samples
in order to train models with acceptable performance (Lu et al., 2017; Marcus, 2018). For DL-
                                                    4


based CV in general, there exist reference datasets e.g., COCO (Lin et al., 2014) and ImageNet
(Jia Deng et al., 2009), which consist thousands of images with millions of instances annotated
by experts that allow developers to test algorithms. To the best of our knowledge, we do not have
such datasets available in livestock farming. Therefore, significant effort is needed to create
public image datasets in livestock farming that is vital to develop DL-based high-throughput
phenotyping. To achieve this, a survey of public resources for the existing animal CV datasets is
a necessary first step to create reference data in the filed.
1.3 Deep learning and statistical learning for animal-animal interaction
         Agonistic behaviors are part of the standard repertoire of behaviors in pigs and include
aggressive, submissive, and defensive behaviors in competitive situations (Gasser et al., 2009;
Machado et al., 2017). In swine farming, pigs present agonistic behaviors in forms of fighting
over the control of resources or space and performing shows of force to establish social hierarchy
in the group (Machado et al., 2017). Consequently, pigs receive injures and suffer stress that
affect their welfare and productivity (Ekkel et al., 1995). Therefore, it is necessary to target traits
that are related to pigs’ agonistic behaviors. Understanding this type of traits will bring valuable
information to swine welfare, management, and breeding.
         There are two key steps in the analysis of pigs’ agonistic behaviors: 1) measuring the
traits of interest and 2) analyzing the measures. It is noteworthy that animal-animal interaction is
one of the intrinsic components of agonistic behavior. Measuring or recognizing interactive
behaviors requires observing motions of at least two animals simultaneously, which is more
complicated than measuring behaviors that pertain to an individual. To observe and measure
traits about pigs’ agonistic behaviors, the standard approach is through manual annotation of
videos that is laborious and time-consuming. A DL-based CV model seems to be helpful to
                                                    5


measure the traits as it automates the behavior recognition. To date, most DL-based CV
applications concentrate on individual pig behaviors, and publicly available imagery datasets in
this field were designed and annotated for studying behaviors of individual pigs. However, none
of the publicly available datasets so far are specifically about pig agonistic interactions. This gap
necessitates creating image datasets that are publicly available for the study of pigs’ agonistic
behaviors and developing a CV algorithm that assists in measuring traits from images/videos.
         Once the trait is measured, subsequently, another challenging task is to analyze the traits
that are related to pigs’ interactive behaviors. Researchers are interested in quantifying the
interaction intensity as well as factors affecting the interactions between animals. Interactive
behavior of pigs usually involves two pigs, namely a dyad. Given data on pairwise social
interactions recorded from an experimental or observational study, it is necessary to quantify the
effects of various individual- and group-level factors on social interactions. However, there is
limited research on the models to analyze pairwise social interactions in pigs.
         A proper way to model dyadic data is to fit generalized linear mixed models (GLMM)
that include fixed and random effects to account for means and covariances depending on the
actual design of the experiment (Kenny et al., 2020). The potential for using this approach
remains unknown in the analysis of pig’s social behaviors. A conceptualization and
implementation of the GLMM framework will contribute to the better understanding of pigs’
social interactions.
                                           2. OBJECTIVES
         The overarching goal of this study is to validate state-of-the-art prediction models and
techniques for improved swine farming. The specific objectives for this research are:
                                                   6


        Objective 1: to develop deep learning models for genomic prediction and assess model
performance through different training-validation data splits.
        Objective 2: to investigate public datasets for computer vision applications in livestock
farming that are useful for phenotyping and behavior studies, and to review the validation
strategies in the related work that utilized the public datasets.
        Objective 3: to assess performance of deep learning-based computer vision models
through different data splitting strategies by articulating temporal and spatial independence
between training and validation sets.
        Objective 4: to conceptualize, implement, and validate a novel statistical model for the
analysis of social interaction data.
                                                   7


REFERENCES
     8


                                          REFERENCES
Abdollahi-Arpanahi, R., Gianola, D., Peñagaricano, F., 2020. Deep learning versus parametric
     and ensemble methods for genomic prediction of complex phenotypes. Genet. Sel. Evol. 52,
     1–15. https://doi.org/10.1186/s12711-020-00531-z
Agostini, P.S., Fahey, A.G., Manzanilla, E.G., O’Doherty, J. V., De Blas, C., Gasa, J., 2014.
     Management factors affecting mortality, feed intake and feed conversion ratio of grow-
     finishing pigs. Animal 8, 1312–1318. https://doi.org/10.1017/S1751731113001912
Alexandratos, N., Bruinsma, J., 2012. World agriculture towards 2030/2050: the 2012 revision.
Bellot, P., de los Campos, G., Pérez-Enciso, M., 2018. Can deep learning improve genomic
     prediction of complex human traits? Genetics 210, 809–819.
     https://doi.org/10.1534/genetics.118.301298
Berckmans, D., 2017. General introduction to precision livestock farming. Anim. Front. 7, 6–11.
     https://doi.org/10.2527/af.2017.0102
Chen, C., Zhu, W., Norton, T., 2021. Behaviour recognition of pigs and cattle: Journey from
     computer vision to deep learning. Comput. Electron. Agric. 187, 106255.
     https://doi.org/10.1016/j.compag.2021.106255
Chen, C., Zhu, W., Steibel, J., Siegford, J., Wurtz, K., Han, J., Norton, T., 2020. Recognition of
     aggressive episodes of pigs based on convolutional neural network and long short-term
     memory. Comput. Electron. Agric. 169, 105166.
     https://doi.org/10.1016/j.compag.2019.105166
Commandeur, M.A.M., 2006. Diversity of pig farming styles: Understanding how it is
     structured. NJAS - Wageningen J. Life Sci. 54, 111–127. https://doi.org/10.1016/S1573-
     5214(06)80007-2
Crossa, J., Martini, J.W.R., Gianola, D., Pérez-Rodríguez, P., Jarquin, D., Juliana, P.,
     Montesinos-López, O., Cuevas, J., 2019. Deep Kernel and Deep Learning for Genome-
     Based Prediction of Single Traits in Multienvironment Breeding Trials. Front. Genet. 10, 1–
     13. https://doi.org/10.3389/fgene.2019.01168
Ekkel, E.D., Van Doorn, C.E.A., Hessing, M.J.C., Tielen, M.J.M., 1995. The specific-stress-free
     housing system has positive effects on productivity, health, and welfare of pigs. J. Anim.
     Sci. 73, 1544–1551.
Gasser, P.J., Lowry, C.A., Orchinik, M., 2009. 41 - Rapid Corticosteroid Actions on Behavior:
     Mechanisms and Implications, in: Pfaff, D.W., Arnold, A.P., Etgen, A.M., Fahrbach, S.E.,
     Rubin Brain and Behavior (Second Edition), R.T.B.T.-H. (Eds.), . Academic Press, San
     Diego, pp. 1365–1397. https://doi.org/https://doi.org/10.1016/B978-008088783-8.00041-3
                                                  9


Hayes, B.J., Lewin, H.A., Goddard, M.E., 2013. The future of livestock breeding: Genomic
      selection for efficiency, reduced emissions intensity, and adaptation. Trends Genet. 29,
      206–214. https://doi.org/10.1016/j.tig.2012.11.009
Hill, W.G., 2001. Selective Breeding, in: Brenner, S., Miller, J.H.B.T.-E. of G. (Eds.), .
      Academic Press, New York, pp. 1796–1799.
      https://doi.org/https://doi.org/10.1006/rwgn.2001.1167
James, G., Witten, D., Hastie, T., Tibshirani, R., 2013. An introduction to statistical learning.
      Springer.
Jia Deng, Wei Dong, Socher, R., Li-Jia Li, Kai Li, Li Fei-Fei, 2009. ImageNet: A large-scale
      hierarchical image database 248–255. https://doi.org/10.1109/cvprw.2009.5206848
Kenny, D.A., Kashy, D.A., Cook, W.L., 2020. Dyadic data analysis. Guilford Publications.
Kuhn, M., Johnson, K., 2013. Applied predictive modeling. Springer.
Lecun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature.
      https://doi.org/10.1038/nature14539
Li, G., Huang, Y., Chen, Z., Chesser, G.D., Purswell, J.L., Linhoss, J., Zhao, Y., 2021. Practices
      and applications of convolutional neural network-based computer vision systems in animal
      farming: A review. Sensors 21, 1–42. https://doi.org/10.3390/s21041492
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.,
      2014. Microsoft coco: Common objects in context, in: European Conference on Computer
      Vision. Springer, pp. 740–755.
Liu, D., Oczak, M., Maschat, K., Baumgartner, J., Pletzer, B., He, D., Norton, T., 2020. A
      computer vision-based method for spatial-temporal action recognition of tail-biting
      behaviour in group-housed pigs. Biosyst. Eng. 195, 27–41.
      https://doi.org/10.1016/j.biosystemseng.2020.04.007
Lu, Z., Jiang, X., Kot, A., 2017. Enhance deep learning performance in face recognition. 2017
      2nd Int. Conf. Image, Vis. Comput. ICIVC 2017 244–248.
      https://doi.org/10.1109/ICIVC.2017.7984554
Machado, S.P., Caldara, F.R., Foppa, L., De Moura, R., Gonçalves, L.M.P., Garcia, R.G., De
      Alencar Nääs, I., Dos Santos Nieto, V.M.O., De Oliveira, G.F., 2017. Behavior of pigs
      reared in enriched environment: Alternatives to extend pigs attention. PLoS One 12, 1–18.
      https://doi.org/10.1371/journal.pone.0168427
Marcus, G., 2018. Deep Learning: A Critical Appraisal 1–27.
Montesinos-López, A., Montesinos-López, O.A., Gianola, D., Crossa, J., Hernández-Suárez,
      C.M., 2018. Multi-environment genomic prediction of plant traits using deep learners with
      dense architecture. G3 Genes, Genomes, Genet. 8, 3813–3828.
                                                 10


     https://doi.org/10.1534/g3.118.200740
Nasirahmadi, A., Sturm, B., Edwards, S., Jeppsson, K.H., Olsson, A.C., Müller, S., Hensel, O.,
     2019. Deep learning and machine vision approaches for posture detection of individual pigs.
     Sensors (Switzerland) 19, 1–16. https://doi.org/10.3390/s19173738
Öhlund, E., Hammer, M., Björklund, J., 2017. Managing conflicting goals in pig farming:
     farmers’ strategies and perspectives on sustainable pig farming in Sweden. Int. J. Agric.
     Sustain. 15, 693–707. https://doi.org/10.1080/14735903.2017.1399514
Oliviero, C., Junnikkala, S., Peltoniemi, O., 2019. The challenge of large litters on the immune
     system of the sow and the piglets. Reprod. Domest. Anim. 54, 12–21.
     https://doi.org/10.1111/rda.13463
Pairis-Garcia, M.D., Johnson, A.K., Azarpajouh, S., Colpoys, J.D., Rademacher, C.J., Millman,
     S.T., Webb, S.R., 2016. The U.S. swine industry: Historical milestones and the future of on-
     farm swine welfare assessments. CAB Rev. Perspect. Agric. Vet. Sci. Nutr. Nat. Resour.
     https://doi.org/10.1079/PAVSNNR201611025
Peltoniemi, O., Björkman, S., Oropeza-Moe, M., Oliviero, C., 2019. Developments of
     reproductive management and biotechnology in the pig. Anim. Reprod. 16, 524–538.
     https://doi.org/10.21451/1984-3143-AR2019-0055
Putka, D.J., Beatty, A.S., Reeder, M.C., 2018. Modern Prediction Methods: New Perspectives on
     a Common Problem. Organ. Res. Methods 21, 689–732.
     https://doi.org/10.1177/1094428117697041
Ramspek, C.L., Jager, K.J., Dekker, F.W., Zoccali, C., Van DIepen, M., 2021. External
     validation of prognostic models: What, why, how, when and where? Clin. Kidney J. 14, 49–
     58. https://doi.org/10.1093/ckj/sfaa188
Roberts, D.R., Bahn, V., Ciuti, S., Boyce, M.S., Elith, J., Guillera-Arroita, G., Hauenstein, S.,
     Lahoz-Monfort, J.J., Schröder, B., Thuiller, W., Warton, D.I., Wintle, B.A., Hartig, F.,
     Dormann, C.F., 2017. Cross-validation strategies for data with temporal, spatial,
     hierarchical, or phylogenetic structure. Ecography (Cop.). 40, 913–929.
     https://doi.org/10.1111/ecog.02881
Tzanidakis, C., Simitzis, P., Arvanitis, K., Panagakis, P., 2021. An overview of the current trends
     in precision pig farming technologies. Livest. Sci. 249, 104530.
     https://doi.org/10.1016/j.livsci.2021.104530
USDA, 2019. United States Department of Agriculture. Hogs & Pork.
Watanabe, K., Guo, W., Arai, K., Takanashi, H., Kajiya-Kanegae, H., Kobayashi, M., Yano, K.,
     Tokunaga, T., Fujiwara, T., Tsutsumi, N., Iwat, H., 2017. High-throughput phenotyping of
     sorghum plant height using an unmanned aerial vehicle and its application to genomic
     prediction modeling. Front. Plant Sci. 8, 1–11. https://doi.org/10.3389/fpls.2017.00421
                                                 11


Zhang, K., Li, D., Huang, J., Chen, Y., 2020. Automated video behavior recognition of pigs
    using two-stream convolutional networks. Sensors (Switzerland) 20.
    https://doi.org/10.3390/s20041085
                                               12


  CHAPTER 2: HEURISTIC HYPERPARAMETER OPTIMIZATION OF DEEP LEARNING
                            MODELS FOR GENOMIC PREDICTION
                  Junjie Han, Cedric Gondro, Kenneth Reid, and Juan P. Steibel
                                          1. ABSTRACT
        There is a growing interest among quantitative geneticists and animal breeders in the use
of deep learning (DL) for genomic prediction. However, the performance of DL is affected by
hyperparameters that are typically manually set by users. These hyperparameters do not simply
specify the architecture of the model, they are also critical for the efficacy of the optimization
and model fitting process. To date, most DL approaches used for genomic prediction have
concentrated on identifying suitable hyperparameters by exploring discrete options from a subset
of the hyperparameter space. Enlarging the hyperparameter optimization search space with
continuous hyperparameters is a daunting combinatorial problem. To deal with this problem, we
propose using differential evolution (DE) to perform an efficient search of arbitrarily complex
hyperparameter spaces in DL models and we apply this to the specific case of genomic
prediction of livestock phenotypes. This approach was evaluated on two pig and cattle datasets
with real genotypes and simulated phenotypes (N=7,539 animals and M=48,541 markers) and
one real dataset (N=910 individuals and M=28,916 markers). Hyperparameters were evaluated
using cross validation. We compared the predictive performance of DL models using
hyperparameters optimized by DE against DL models with “best practice” hyperparameters
selected from published studies and baseline DL models with randomly specified
hyperparameters. Optimized models using DE showed clear improvement in predictive
performance across all three datasets. DE optimized hyperparameters also resulted in DL models
                                                 13


with less overfitting and less variation in predictive performance over repeated retraining
compared to non-optimized DL models.
                                        2. INTRODUCTION
        Over the past decades, there have been enormous gains in the productivity of livestock,
much of which was due to the rapid genetic improvement of quantitative traits e.g. growth rates,
reproductive traits, and feed conversion rates (Hill, 2016). In recent years, with the rise of DNA
sequencing and high throughput genotyping technology as well as with the inception of genomic
prediction models (Meuwissen et al., 2001), single nucleotide polymorphisms (SNP) became
widely used for genomic prediction and genomic selection.
        Genomic prediction refers to the use of statistical models to estimate the genetic
component of a phenotype by using data from SNP markers (Meuwissen et al., 2001; VanRaden,
2008). The same models can also be used for phenotypic prediction by associating an
individual’s genotype to its phenotypes which is commonly used to predict complex traits in
humans (Yang et al., 2010). For animal production, both genomic prediction and phenotypic
prediction have resulted in more accurate selection while genomic prediction has been useful for
management decisions (e.g. market allocation). The technology has also provided a platform for
the adoption of novel breeding approaches and has led to new biological insights into the
underpinnings of complex quantitative traits (Hickey et al., 2017). For simplicity we will use
only the term genomic prediction throughout the text.
        Several models have been proposed for genomic prediction (Corvin et al., 2010; Gianola,
2013; Habier et al., 2011; VanRaden, 2008), and GBLUP is one of the most commonly used
models (Fragomeni et al., 2017). A common assumption across these models is that genomic
                                                  14


effects are strictly additive, i.e. most models do not explicitly consider interactions between
alleles within markers (dominance), nor between markers (epistasis) (Crossa et al., 2019). More
recently, deep learning (Lecun et al., 2015) has been proposed as an alternative to genomic
prediction models that does not depend on the typical assumptions of traditional genomic
prediction methods.
        Deep learning (DL) has dramatically improved state-of-the-art applications in computer
vision, speech recognition and genomics (Eraslan et al., 2019; Koumakis, 2020; Lecun et al.,
2015). DL methods are flexible and can potentially learn very cryptic data structures – even
interactions between predictors (Crossa et al., 2019). DL has already been applied to genomic
prediction in plants (Crossa et al., 2019; A. Montesinos-López et al., 2018; Montesinos-López et
al., 2019), human traits (Bellot et al., 2018), and estimation of breeding values in cattle
(Abdollahi-Arpanahi et al., 2020).
        DL models in genomic prediction are promising tools (Bellot et al., 2018). However, one
of the critical challenges of implementing DL is selection of appropriate hyperparameters since
they significantly affect the performance of the prediction algorithm. Hyperparameter features
are values or options typically set by users before the model is fitted that impact the algorithm’s
predictive performance by avoiding overfitting and underfitting (Luo, 2016). Each feature that is
part of the hyperparameter set can take a range of values or options and they can interact with
each other to determine the properties of the final fitted model; a properly specified
hyperparameter set is fundamental for a DL model to achieve a high prediction accuracy. But,
unfortunately, there is no one-size-fits-all best way to optimize these hyperparameters.
        Several procedures have been used to select DL hyperparameters for genomic prediction
applications; e.g. grid search (Crossa et al., 2019; Pérez-Enciso and Zingaretti, 2019) and genetic
                                                  15


algorithms (Bellot et al., 2018). Grid search is only feasible for a limited number of parameters
and levels, which is not the case for most DL applications. On the other side, genetic algorithms
are better suited for optimizing large and complex parametric spaces, but currently available
implementations of genetic algorithms to tune DL hyperparameters for genomic prediction
require that the options of each hyperparameter are either already discrete or discretized before
the optimization process (Bellot et al., 2018).
        An alternative to genetic algorithms is differential evolution (DE) which is a population
based evolutionary heuristic well suited for optimization of discrete and continuous search
spaces (Das et al., 2016; Storn and Price, 1997). Differential Evolution lies on the intersection
between real-valued genetic algorithms and evolution strategies. DE uses the conventional
population structure of genetic algorithms and the self-adapting mutation of evolution strategies;
in a sense DE can be loosely viewed as a population based simulated annealing algorithm in
which the mutation rate decreases as the population converges on a solution.
        In this study, we propose to adapt DE to optimize the DL hyperparameter set for genomic
prediction and evaluate its effectiveness to improve prediction accuracies in simulated and real
datasets for two classes of DL models: multilayer perceptron (MLP) and convolutional neural
network (CNN). We emphasize that the focus of this paper is on optimization of DL
hyperparameters to identify a set suitable for a given specific genomic prediction problem, rather
than a comparison of DL with GBLUP or other genomic prediction methods. As the predictive
performance depends on the architecture of the trait and the population structure, we demonstrate
the importance and the impact of proper hyperparameter specification on genomic prediction
with DL.
                                                  16


                                  3. MATERIAL AND METHODS
3.1 Datasets
3.1.1 Simulated datasets
         Real genotypes from two livestock populations – pigs and cattle – were used to create
simulated datasets for testing purposes. Genotypes from both species were edited to be of the
same dimensions, comprising a total of 48,541 SNP genotypes for 7,359 individuals, from which
6,031 (80%) and 1,508 (20%) were randomly assigned to the discovery and validation
populations, respectively. Phenotypes were simulated for both species by randomly assigning
1,000 SNP as quantitative trait loci (QTL) with additive effects for a heritability of 0.4 using the
R simulation package GenEval (Cuyabano, 2020).
3.1.2 Real dataset
         The real data came from an experimental F2 cross of Duroc and Pietrain pigs already
previously described (Edwards et al., 2008). Briefly, four Duroc sires were mated to 15 Pietrain
dams to produce 56 F1 individuals (50 females and 6 males). F1 animals were mated to produce
a total of 954 F2 pigs that were phenotyped for 38 meat quality and carcass quality traits. For this
study pH meat records measured 24 hours post-mortem from 910 F2 pigs were used. We
purposely selected this trait as it is moderately heritable (h2=0.19±0.05) and for which we have
mapped putative QTL (Casiró et al., 2017). Two different SNP chips were used to genotype the
F2 pigs, but all SNP were imputed to a common set of approximately 62,000 SNP (Gualdrón
Duarte et al., 2013) with high accuracy (R2>0.97). SNP were pruned by filtering out SNP with:
1) low genotyping rates (less than 90%), 2) lack of segregation, 3) inconsistent Mendelian
inheritance with the pedigree information, 4) low imputation accuracy (R2<0.64; Casiró et al.
                                                  17


2017), 5) and high correlation between markers (larger than 0.99). A final set of 28,916 SNP was
used for this study.
        Phenotypic records were pre-adjusted for fixed effects:
                                          𝑦𝑎𝑑𝑗 = 𝑦𝑜𝑏𝑠 − 𝑋𝛽,
        where 𝑦𝑎𝑑𝑗 is adjusted response, 𝑦𝑜𝑏𝑠 is pH measured 24 hours post-mortem, 𝑋 is the
incidence matrix with the fixed effects of sex, slaughter group and carcass weight, and 𝛽
represents the coefficients of fixed effects.
3.2 Deep learning and genomic prediction
        Deep learning (DL) methods are a set of representation learning methods, where a
machine can be fed with raw data and automatically discover the representations needed for
prediction or classification, with multiple levels of simple but non-linear modules that transform
the representation at one level into a representation at a higher, slightly more abstract level
(Lecun et al., 2015). In the context of genomic prediction, we used DL to build a system that
predicts an animal’s phenotypic value given its genotype. DL computes and minimizes a loss
function that measures the error of prediction. In this study, we used mean squared error (mse) as
the loss function:
                                                 ∑𝑁        ̂ 𝑖 )2
                                                   𝑖=1(𝑦𝑖 −𝑦
                                          𝑚𝑠𝑒 =                   ,
                                                       𝑁
where 𝑁 represents the number of individuals in the training dataset, 𝑦𝑖 represents the observed
response of individual 𝑖 and 𝑦̂𝑖 is the predicted response of individual 𝑖. Two types of DL models
were used in this study: multilayer perceptron and convolutional neural network.
3.2.1 Multilayer perceptron
        This model is also known as feed-forward artificial neural network. In this paper, MLP
(Figure 2.1) has: an input layer with as many nodes as SNP markers, a variable number of hidden
                                                  18


layer(s) with a certain number of nodes, and an output layer representing the response. Since
nodes between layers are fully connected, MLP can potentially model complex and higher order
interactions between predictor variables (Abdollahi-Arpanahi et al., 2020). A detailed
explanation of how MLP models work is presented in the File S2.1 and also available at GitHub
alongside source code (https://github.com/jun-jieh/DE_DL).
                  Input          Hidden                Hidden             Output
                  layer          layer 1               layer 2             layer
     S P1
     S P2
                                                     = (        )
     S Pk
                                 = (       )
     S PM
   Figure 2.1 Multilayer Perceptron (MLP) for genomic prediction of a single trait with M SNP
   markers. The network has an input layer, two fully connected hidden layers and an output
   layer. Each node’s input in the hidden layers is a transformation of the weighted sum of the
   output from the previous layer. The number of nodes in hidden layers decrease as the depth
   of the MLP increases, to facilitate representation learning.
         As deep learning consists of transforming representations at a previous layer into its next
(more abstract) layer (Lecun et al., 2015), we opted to adaptively set the number of nodes for
each hidden layer based on the depth of the network so that the next hidden layer always has
fewer nodes than the previous one. For instance, in an MLP with two hidden layers (Figure 2.1),
the first layer can only have neurons ranging from 259 to 512 while the second layer can have
any number between 4 and 258. Table S2.1 summarizes the number of nodes search space for
                                                 19


MLPs with one, two, and up to five layers. Other researchers may choose different adaptive rules
to impose restrictions on the possible number of neurons per layer, or may even simply choose to
use the same number of nodes for all layers.
3.2.2 Convolutional neural network
          CNN is designed to process data that comes as multiple-array format (Lecun et al., 2015)
e.g. 1d for an animal’s genotype, 2d for images and 3d for videos. Typical C                                                                     models consist
of an input layer, convolutional layer(s), pooling layer(s), a flattened layer, and an output layer
(Figure 2.2). In the context of genomic prediction (Figure 2.2), the input layer for a single
observation in a CNN is a one-dimension array that contains an animal’s genotype and the
number of units in the layer will be equal to the number of markers. The output layer 𝑦̂𝑛
represents the predicted response value for the phenotype or breeding value of the nth individual.
                    Convolutional                             Pooling                  Convolutional                 Pooling          Flattened Hidden  Output
        Input         Layer 1                                 Layer 1                       Layer 2                  Layer 2                    Layer     Layer
        layer
     S P1
     S P2
                            (
     S P3                                   )
                                                                                                                                                  = (   )
     S Pk                                                                                                                    feature maps
                                                                                                                          with length of
                                                                   feature maps
                                                                with length of                             feature maps
                                                                                                        with length
                                                  feature maps
   S PM 2                                      with length of                   Filter size 1 3,
                                                                                    filters are applied
   S PM 1
                       Filter size 1 3,                                                                                                       =
                           filters are applied
    S PM
    Figure 2.2 1-d Convolutional neural network (CNN) for genomic prediction of a single trait
    with M SNP markers. The network has an input layer, two convolutional layers with their
    corresponding pooling layers, a fully connected hidden layer and an output layer. Each
    convolutional layer applies a number of filters to the output of the previous layer and its
    output is subsequently summarized by a pooling layer.
                                                                                          20


         Between the input and output layers, a CNN contains a variable number of convolutional
layer(s) followed by pooling layer(s). Full details on CNN architecture are given in the File S2.1
and at GitHub along with source code (https://github.com/jun-jieh/DE_DL). In this study each
convolutional layer applied filters of size f (a hyperparameter to be optimized) with the stride
equal to the filter size (non-overlapping convolutions of the input). In CNN, several restrictions
are typically assumed regarding the model architecture. When learning from a global level to a
local level, more details are required to obtain the pattern at the local level (Lecun et al., 2015).
Therefore, the number of filters increases as the depth of the CNN increases, to detect local
motifs. To reflect this expectation we adaptively set the number of filters applied in each
convolutional layer as a function on the depth of the network. Specifically, we limited the
number of filters in any convolutional layer to be between 4 and 128, but this range is partitioned
for each convolutional layer to make sure that the next convolutional layer will always have a
number of filters larger than the previous layer. For example, in a CNN with two convolutional
layers (Figure 2.2), the first convolutional layer can only have between 4 and 65 filters while the
second convolutional layer can have between 66 to 128 filters. Examples of the adaptive number
of filters as a function of the depth of the CNN is presented in Table S2.2. The hyperparameter
space for filter size was set as an integer between 2 and 20. Although the filter size is specified
by the user, the output feature (feature map in Figure 2.2) has to conform to the minimum length
(the length of feature map needs to be equal to or larger than the filter size) of the feature maps in
each convolutional layer and pooling layer, which is illustrated in Table S2.3. If the condition is
not satisfied, instead of fixing the kernel size through all convolutional layers, we set adaptive
kernel size in order to successfully execute the model fitting (see details in Figure S2.1). The
adaptive kernel size is to ensure that CNN generates a valid output.
                                                   21


3.2.3 DL model training
        TensorFlow (Abadi et al., 2015) was used to train DL models. At each iteration (epoch; a
detailed description of epoch is presented in File S2.1) of the training process TensorFlow
randomly partitioned the training data into an actual training set (80% of data), that was used for
updating the model weights and a testing set (20% of data), that was used to evaluate the updated
model. The data partition was performed by TensorFlow and we did not have control over the
random partitions. At the end of each epoch, an internal validation was performed by evaluating
the correlation between predicted and observed response in the testing set. A DL training
procedure typically requires multiple epochs, which was one of the hyperparameters optimized
in our DE procedure (see below). We also introduced an early stopping when the correlation did
not change over 0.1 for ten consecutive epochs, as it was assumed that fitness (correlation) could
not be improved and further exploration was unnecessary.
3.2.4 Hyperparameter optimization
        Table 2.1 presents the hyperparameters optimized in this study. A plausible range of
values for each hyperparameter was defined based on ranges suggested by the literature for DL
applied to genomic prediction (additional details of each hyperparameter can be found in the File
S2.1). These ranges were then used as constraints for the differential evolution algorithm. It is
important to indicate that users can accordingly extend/reduce/modify the hyperparameter space
described in Table 2.1.
                                                 22


  Table 2.1 Parameter space for optimized hyperparameters. Hyperparameter space and range
  (see details in File S2.1). N represents sample size. 𝛼1 =0.001 for the simulated datasets and
  𝛼1 =0.01 for the real pig dataset. 𝛼2 =0.01 for the simulated datasets and 𝛼2 =0.1 for the real pig
  dataset.
   Hyperparameters               Parameter space               Parameter space           Value Type
                                        (MLP)                         (CNN)
   Number of layers                   [1,2,3,4,5]                   [1,2,3,4,5]            Integer
   Number of neurons                    [8-512]                       [8-512]              Integer
   Activation                 ['relu', 'elu', 'sigmoid',    ['relu', 'elu', 'sigmoid',   Categorical
                             'selu', 'softplus', 'linear', 'selu', 'softplus', 'linear',
                                         'tanh']                       'tanh']
   Optimizer                 ['sgd', 'adam', 'adagrad',    ['sgd', 'adam', 'adagrad',    Categorical
                                      'rmsprop',                    'rmsprop',
                                'adadelta', 'adamax',         'adadelta', 'adamax',
                                       'nadam']                      'nadam']
   Dropout rate                           [0-1]                         [0-1]            Continuous
   L2 penalty                             [0-1]                         [0-1]            Continuous
   Batch size                      [N×𝛼1 -N×𝛼2 ]                          32               Integer
   Epoch                                [21-50]                       [21-50]              Integer
   Number of filters                       NA                         [2-128]              Integer
   Filter size                             NA                          [2-20]              Integer
   Pooling                                 NA                   ['max', 'average']       Categorical
3.3 Differential evolution algorithm for deep learning
         Differential evolution (DE) is an evolutionary algorithm that includes four steps: 1)
initialization, 2) mutation, 3) crossover and 4) selection (Storn and Price, 1997). A generic
version of this algorithm is described in pseudocode format (Figure 2.3). DE was used to evolve
a population of numeric vectors that can be recoded to represent hyperparameter combinations
through random keys. A toy example of the DE approach is provided at GitHub
(https://github.com/jun-jieh/DE_DL).
                                                        23


           Figure 2.3 Pseudocode for differential evolution algorithm
3.3.1 Random key
       Random key is an encoding mechanism originally used in genetic algorithms by Bean
(1994). The core of this algorithm is a set of d H-dimensional numeric vectors pop1, …, popd as a
population. Each numeric vector represents a solution that is linked or mapped to a set of model
hyperparameters through a mapping function (random key). Suppose that there are K
hyperparameters to optimize, where K=8 for the MLP and K=10 for the CNN (Table 2.1). Within
each hyperparameter k 1…K there are Hk loci, and if the parameter takes continuous values,
then Hk =1. So, the size 𝐻 = ∑𝐾 𝑘=1 𝐻𝑘 . Each vector popi is partitioned into K sub-blocks that
                                                 24


contain Hk loci, where each single locus in H represents a hyperparameter option or value. For
categorical hyperparameters, there is a mapping performed from the Hk dimensional block of the
numeric vector to an Hk dimensional vector MAPk containing the names of the categories for the
kth hyperparameter as follows: the Hk elements are ranked according to their values and the rank
of the first element is used as an index for the MAPk vector to select the corresponding
categorical value. In this way, the evolutionary operators (mutation, crossover, and selection) can
be applied directly on the numerical vector popi but the results can always still be translated into
a set of categorical (and continuous) hyperparameter values. An example of this with the
hyperparameter number of layers (Hk =5) is presented in Figure 2.4. So, in a nutshell, a random
key is a vector of real numbers that, once sorted, its ranking can be used to map against a set of
statically ordered features. The idea is that better features will evolve to higher values in the key
while worse features will evolve to lower values; the ranking of the sorted key allows sorting the
features from best to worst and provides a smoother fitness surface for the DE to explore.
         The main steps for the differential evolution algorithm are:
3.3.2 Initialization
         We initialized 𝑑=50 𝐻-dimensional parameter vectors pop1…pop50 as a population 𝑝𝑜𝑝
(line 5 Figure 2.3) from a uniform [0,1] distribution and we mapped the numeric vector to a set
of hyper-parameter values as described before (Table 2.1) to obtain 50 hyperparameter sets.
Then we fitted 50 models using each set of hyperparameters and recorded their correlations
between predicted and observed response values. An individual of the population refers to one of
the 𝐻-dimensional vectors in 𝑝𝑜𝑝 and its encoded hyperparameter set. From now on, we use the
term individual to refer to a hyperparameter solution in DE.
                                                   25


3.3.3 Mutation
        To generate a mutation, indices of two random individuals are selected from the
population: 𝑟1, 𝑟2 , ∈ {  … 𝑑 } and the target H-dimensional vectors 𝑝𝑜𝑝𝑟1 and 𝑝𝑜𝑝𝑟2 are
extracted. Then vectors 𝑝 and 𝑝𝑜𝑝𝑟2 are mutated using
                                   𝑚𝑢 = 𝜇 ∙ (𝑝𝑜𝑝𝑟1 − 𝑝𝑜𝑝𝑟2 )
        where 𝑚𝑢 is the mutant vector and 𝜇 is the mutation parameter (𝜇 ∈ [       ]). Storn and
Price (1997) recommended that 0.5 is usually a good initial choice for the mutation. In this study,
we set 𝜇 = .5.
   Figure 2.4 Summary of the random key (mapping function) used to transform numeric
   vectors into discrete levels of hyperparameters. The numeric vector can be subject to
   mutation and recombination. The mapping is used to transform the result into a meaningful
   set of hyperparameters that can be used to fit a model and obtain a fitness to select numeric
   vectors.
                                                26


3.3.4 Crossover
        To increase the diversity in hyperparameter combinations represented in the population
parameters, crossover function is used to combine the mutant vector 𝑚𝑢 with other individual
vectors. First, an 𝐻-dimensional vector RN with uniformly random numbers ∈ [         ] is generated.
The crossover rate is defined by parameter 𝛼 (𝛼 ∈ [    ]). Gämperle et al. (2002) suggested that a
good choice for the crossover constant is a value between 0.3 and 0.9. In this study, we set 𝛼 =
 .5. Another 𝐻-dimensional vector (𝐶𝑅) with logical variables (True/False or 1/0) is then
generated according to
                                          𝑖𝑓 𝑅𝑁𝑖 < 𝛼
                            𝐶𝑅𝑖 = {                     𝑖=      … 𝐻.
                                         𝑖𝑓 𝑅𝑁𝑖 ≥ 𝛼
        Then, two more individual vectors 𝑝𝑜𝑝𝑟3 and 𝑝𝑜𝑝𝑟4 are selected and crossover generates
a new individual according to
                                    𝑝𝑜𝑝𝑟3 𝑖           𝑖𝑓 𝐶𝑅𝑖 =
                   𝐶ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑒𝑟𝑖 = {                               ,𝑖 =       … 𝐻,
                                    𝑝𝑜𝑝𝑟4 𝑖 − 𝑚𝑢𝑖     𝑖𝑓 𝐶𝑅𝑖 =
        where 𝐶ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑒𝑟 is the newly generated individual and 𝑖 is the 𝑖 𝑡ℎ element of 𝑝𝑜𝑝𝑟3 ,
𝑝𝑜𝑝𝑟4 , 𝑚𝑢, and 𝐶ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑒𝑟. For convenience, we name 𝑝𝑜𝑝𝑟3 𝑇𝑖𝑡𝑙𝑒ℎ𝑜𝑙𝑑𝑒𝑟.
3.3.5 Selection
        To decide whether or not the 𝐶ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑒𝑟 should replace the 𝑇𝑖𝑡𝑙𝑒ℎ𝑜𝑙𝑑𝑒𝑟 in population
𝑝𝑜𝑝, both vectors of numeric values 𝐶ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑒𝑟 and 𝑇𝑖𝑡𝑙𝑒ℎ𝑜𝑙𝑑𝑒𝑟 were mapped into
hyperparameter sets using the random key. The models were fitted based on mapped
hyperparameters and Pearson correlation coefficient γ between the predicted and the observed
values was computed. We averaged the correlations over epochs, as the testing sets varied in
epochs. The averaged correlation was defined as the fitness of the DL model (given a
hyperparameter set). Additionally, we applied a penalized fitness if any of the following three
                                                27


scenarios happened during model fitting: exhausted memory when fitting a specified model, a
constant generated for all predicted responses, and exploding/vanishing gradient which led to an
unstable model-fitting procedure (convergence issue). A penalized individual had its fitness set
to -1. If γ 𝑇𝑖𝑡𝑙𝑒ℎ𝑜𝑙𝑑𝑒𝑟 < γ𝐶ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑒𝑟 , we replaced 𝑇𝑖𝑡𝑙𝑒ℎ𝑜𝑙𝑑𝑒𝑟 by 𝐶ℎ𝑎𝑙𝑙𝑒𝑛𝑔𝑒𝑟 in 𝑝𝑜𝑝; otherwise,
we retained 𝑇𝑖𝑡𝑙𝑒ℎ𝑜𝑙𝑑𝑒𝑟 in 𝑝𝑜𝑝. Finally, steps 2)- 4) are repeated for 𝛿 iterations. For the
simulated pig dataset and the simulated cattle dataset, 𝛿 of both datasets was 2,000, while 𝛿 =
          was used for the real pig dataset (after 𝛿 iterations there was no further significant
improvement). It is worth noting that the initial population does not need to be random, it can be
based on prior information or can even be the results from a previous run –in effect, the DE can
continue evolving a population that has already optimized for some iterations if e.g. the run did
not converge.
         As shown in the results section, if a DL model is run multiple times with the same dataset
and hyperparameters, the predictive performance differs slightly from run to run. This means that
a model trained once can get a slightly higher/lower prediction accuracy compared to the average
prediction accuracy that would be obtained over multiple re-trainings. This effect is more
pronounced in more complex models, which are more prone overfitting. To mitigate this
problem, we introduced a variation to the traditional DE algorithm by refitting the 𝑇𝑖𝑡𝑙𝑒ℎ𝑜𝑙𝑑𝑒𝑟
each time and updating its fitness value. Specifically, in each iteration, the 𝑇𝑖𝑡𝑙𝑒ℎ𝑜𝑙𝑑𝑒𝑟 was
refitted, and if the 𝑇𝑖𝑡𝑙𝑒ℎ𝑜𝑙𝑑𝑒𝑟 won the contest, the updated fitness was retained.
3.3.6 Top model selection
         At the end of the DE run, each individual solution in the population was refitted 30 times
to select the best model based on two criteria to evaluate model stability through repeated
training of each DL model. The best model was selected based on two measures obtained from
                                                    28


this repeated training: mean fitness and standard deviation (SD) of the fitness obtained by
refitting each model 30 times. This is necessary because as explained above, the refitting of the
selected models resulted in slightly different predictive performance. The details on how this bi-
variate criteria selection was performed can be found in GitHub (https://github.com/jun-
jieh/DE_DL).
3.4 Optimized model assessment through external validation
         Each dataset was partitioned into five training sets and five validation sets (80% and 20%
for training and validation, respectively). The DE was applied to each of the training sets to
optimize hyperparameter sets for both, MLPs and CNNs. Note that the training data used for
optimization was not part of the validation set. The final MLP and CNN models (2x5) from the
DE runs were then refitted 30 times (with the training sets only), and each refitted model was
evaluated by predicting the corresponding validation set and computing the correlation between
the predicted and the observed response. The average correlation, also known as external
validation, of the 30 refits as well as the SD of correlations was then calculated. This external
validation is distinct from the internal validation (described in the DL model training) utilized by
the DE to optimize the fitness and should be differentiated. In short, the external validation was
performed using validation sets while the internal validation was performed utilizing testing sets
(described in DL model training). GBLUP was used to estimate the response variable and its
prediction accuracy as a comparison reference to the optimized MLPs and CNNs (GBLUP
details can be found in File S2.1).
3.5 Hardware and software
         The computer processor used in this study was Intel(R) Core i7-8750H CPU @ 2.20 GHz
with 16GB of RAM memory and Microsoft(R) Windows 10 operating system. The GPU
                                                  29


(graphic card) was NVIDIA(R) GeForce GTX 1070 with 8 GB GDDR5 memory. All the
analyses were implemented in R (R Core Team, 2020). For GBLUP we used the gwaR R
package (Steibel, 2015) and for DL the R Keras package (Chollet et al., 2017), which is a high-
level neural networks API on top of TensorFlow (Abadi et al., 2015) with GPU computing
enabled.
3.6 Data availability
         The authors state that all data necessary for confirming the conclusions presented in the
article are represented fully within the article. Animal protocols were approved by the Michigan
State University All University Committee on Animal Use and Care (AUF# 09/03-114-00).
Custom R code used to fit MLP/CNN, implement DE, and evaluate models are available at
GitHub (https://github.com/jun-jieh/DE_DL). Genotypes and phenotypes for the animals in the
real pig dataset are available at GitHub (https://github.com/jun-jieh/RealPigData). Supplemental
material contains the constraints for DL architecture, a summary of predictive performance for
GBLUP/MLP/CNN models, the distributions of selected hyperparameters, and the architectures
of DL models derived from other studies.
                                  4. RESULTS AND DISCUSSION
         As reported in many published applications of deep learning in genomic predictions
(Abdollahi-Arpanahi et al., 2020; Bellot et al., 2018; Zingaretti et al., 2020), we observed that
retraining of a certain DL model with the same hyperparameter configuration and the same
dataset produced slightly different predictions. This forced us to consider the variation in the
predictive performance under the retraining in DE and post-DE model selection (see methods
section). It also had an impact in the results presented below.
                                                   30


4.1 Optimization runtime profiles
         The DE’s optimization runtime profiles (mean fitness and SD of fitness) for the three
datasets (simulated cattle, simulated pig, and real pig) and the two DL models (MLP and CNN)
are shown in Figures 2.5-2.7 and Table 2.2. The mean fitness increased during the DE run, but it
is important to note that it can – and did – also decrease at some points due to the stochastic
sampling of individual subsets that we used for model testing to avoid overfitting (panels A and
C of Figures 2.5-2.7). A similar short-term decrease in fitness was observed when using DE to
optimize model hyperparameters in the context of emotion recognition (Nakisa et al., 2018). In
our case, the occasional drop in the mean fitness was due to the retraining of the models as the
refitting of the same model could yield a lower fitness. Thus, sometimes, even if a current
𝑇𝑖𝑡𝑙𝑒ℎ𝑜𝑙𝑑𝑒𝑟 won the challenge, its new fitness could be lower than before due to the re-fitting.
Alternatively, when a new Challenger won the contest, its fitness could have been higher that the
refitted fitness of the Titleholder but still lower than the previously estimated fitness for that
Titleholder, which resulted in a new candidate solution in the population but also in a lower
fitness.
 Table 2.2 Runtime profile for the DE approach. MLP, multilayer perceptron; CNN,
 convolutional neural network; Avg. runtime, average runtime for one DE iteration (each iteration
 fits two models); Num. iterations, the total number of iterations used in DE; min, minutes; hr,
 hours.
    Model type        Dataset                     Avg.        Num. iterations       Total runtime
                                                runtime
                      Simulated Pig             3.95 min            2,000              131.78 hr
         MLP          Simulated Cattle          4.01 min            2,000              133.67 hr
                      Real Pig                  0.30 min           10,000               49.81 hr
                      Simulated Pig             2.36 min            2,000               77.67 hr
         CNN          Simulated Cattle          2.73 min            2,000               87.73 hr
                      Real Pig                  0.24 min           10,000               40.15 hr
                                                    31


        In general, DE for CNNs converged faster (reached the maximum possible average
fitness) compared to DE for MLPs (panels A/C of Figures 2.5-2.7). For CNNs, DE converged
after approximately 600, 700, and 1,500 iterations for the simulated pig, simulated cattle, and
real pig datasets, respectively. For MLPs, DE converged in approximately 1,000, 1,000, and
2,500 iterations for the three datasets, respectively. One possible explanation is that, MLP
disregards spatial information and use each neuron as an independent predictor, while CNN
tends to learn from a global pattern at the beginning and then summarize the features into a local
level (Lecun et al., 2015). In genomic data, linkage disequilibrium is a nonrandom relationship of
alleles at different physical locations, which is a sensitive indicator that structures a genome
(Slatkin, 2008). Also, Tang and Sun (2019) argued that CNN could be utilized to extract motifs
from homologous sequences, where motifs are essential features for distinguishing different
sequence families. Given a dataset with spatial structure, CNN potentially has advantage over
MLP that CNN can deal with local connectivity.
        For each dataset, evolved MLPs and CNNs converged to similar mean fitness but varied
across data partitions (panels A/C of Figures 2.5-2.7). The mean fitness in the simulated pig
dataset ranged from 0.27 to 0.31 and the range was (0.31, 0.33) for the simulated cattle dataset,
while the real pig dataset had a range of (0.19, 0.29). Mitchell et al. (2015) trained networks with
permuted datasets and also reported varying predictive performance given different data
partitions.
                                                   32


   Figure 2.5 History of differential evolution by algorithm and data partition in the simulated
   pig dataset over 2,000 iterations. Mean and standard deviation of the fitness (correlation
   between the predicted and true phenotype) were computed given each population. (A) Mean
   fitness of five populations by fitting multilayer perceptron (MLP) models. (B) Standard
   deviation of fitness within each population (MLPs). (C) Mean fitness of five populations by
   fitting convolutional neural network (CNN) models. (D) Standard deviation of fitness within
   each population (CNNs).
         Most evolved populations had a fitness SD smaller than 0.05. However, one exception
was population 5 with CNNs in the simulated cattle dataset (panel D of Figure 2.6). Only this
population had a large SD of 0.19 and the population contained a CNN hyperparameter set with
penalized fitness, indicating its failure to remove a penalized individual in 2,000 iterations. DE
performance is sensitive to the number of iterations set by the user and generally solutions can
evolve further when the iteration number is increased (Gämperle et al., 2002; Kok and
Rajendran, 2016). Thus, the solution with penalized fitness should be removed by introducing
more DE iterations. On the other hand, the post-DE refitting (described in top model selection)
would further exclude this solution. Overall, the within population SDs for both MLP and CNN
                                                  33


models were reduced over DE iterations (panels B/D of Figures 2.5-2.7), suggesting evolved
models in each population had similar performance. Kim and Lee (2019) reported that deep
learning models with different hyperparameters could have the same predictive performance,
which indicated that the best solution may not be unique. Zhang et al. (2020) also indicated that
superior solutions would prefer the closest candidates in evolutionary optimization algorithms.
Therefore, we argue that DE evolves a population to where candidate solutions are increasingly
similar to each other. Furthermore, distributions of evolved models showed similarities in
hyperparameter options e.g. activation function, number of layers, filter size, optimizer, dropout,
and pooling, while the hyperparameters were less similar in number of nodes (filters), fully
connected layer in CNN, batch size, and L2 regularization (Tables S2.4-S2.15). Yu and Zhu
(2020) have mentioned that in the process of optimization, hyperparameters with greater
importance received preferential treatment, whereas it was difficult to quantitatively determine
the significance of the hyperparameters. We argue that the heterogeneity/homogeneity in the
hyperparameters resort to the less/more important hyperparameters.
                                                 34


   Figure 2.6 History of differential evolution by algorithm and data partition in the simulated
   cattle dataset over 2,000 iterations. Mean and standard deviation were computed given each
   population. (A) Mean fitness of five populations by fitting multilayer perceptron (MLP)
   models. (B) Standard deviation of fitness within each population (MLPs). (C) Mean fitness
   of five populations by fitting convolutional neural network (CNN). (D) Standard deviation
   of fitness within each population (CNNs).
         Table 2.2 presents the runtime of DE approach in the three datasets. We observed that the
models fitted with the real pig dataset were approximately 10 times faster compared to the two
simulated datasets as there were fewer SNP loci in the real pig dataset. We also observed that
CNNs fitted faster than MLPs. Depending on the dataset, the average runtime for one iteration
(two DL models) ranged from 0.30-4.01 minutes for MLPs and 0.24-2.73 minutes for CNNs. As
a reference, Montesinos-López et al. (2018) reported that training a DL model with their dataset
required 3.60 hours. It is important to point out that we utilized the GPU computing that
parallelized the computation in DL with thousands of graphic computing units, and this will be
the major impact on computational speedup compared to the default setup (CPU computing). For
comparison, we fitted GBLUP models with gwaR (Steibel, 2015) package, and it required 1.77
                                                  35


hours, 1.90 hours, and 19.08 seconds to fit the simulated pig, simulated cattle, and real pig
datasets, respectively. It is important to point out that the runtime increase quadratically with
samples and linearly with markers using gwaR package. Furthermore, we estimated the time
budget for grid searches with GPU computing enabled. An exhaustive search is estimated to cost
9,352,875-404,278,022 hours according to the defined hyperparameter space (Table 2.1) and
dataset, which results in up to 4,594,067 times more computing resource compared to DE
approach used in this study.
   Figure 2.7 History of differential evolution by algorithm and data partition in the real pig
   dataset over 10,000 iterations. Mean and standard deviation were computed given each
   population. (A) Mean fitness of five populations by fitting multilayer perceptron (MLP)
   models. (B) Standard deviation of fitness within each population (MLPs). (C) Mean fitness
   of five populations by fitting convolutional neural network (CNN) models. (D) Standard
   deviation of fitness within each population (CNNs).
4.2 Characteristics of selected hyperparameters
        Table 2.3 shows the top MLPs from each population (one hyperparameter solution from
each population, 15 in total). Activation functions of MLPs optimized for the simulated datasets
                                                  36


varied in “elu”, “selu”, “relu”, “softplus” and “linear”, while in optimized MLPs for the real pig
dataset, the “sigmoid” function was fixed across all selected individuals. oteworthy: In this
study, the input of the DL model was the allelic count of one of the alleles (coded as 0, 1 and 2),
thus, all the input nodes were non-negative values. Interestingly, “elu”, “selu” and “relu” are
almost identical when the input is a non-negative value, and the “linear” activation is very
similar to those functions too (differing only in the slope). Moreover, “softplus” and “sigmoid”
are the most different activation functions compared to the elu-linear family. The activation
functions of the top models are described by Goodfellow et al. (2016). Our finding agrees with
Bellot et al. (2018) who suggested “elu”, “softplus” and “linear”, and also “relu” recommended
by Pérez-Enciso and Zingaretti (2019). Moreover, as the simulated datasets were generated by
only considering additive genetic effects, we speculate that the optimized DL models for the
simulated datasets unveiled the additive nature of the trait effect by selecting predominantly
linear-like activation functions. For the real dataset, optimized MLPs fixed the non-linear
activation function “sigmoid”. We argue that the selected non-linear activation reflects the
increased complexity of polygenic inheritance in real datasets. Regarding this perspective,
Zingaretti et al. (2020) indicated that in a real dataset, DL could model complex relationships by
employing non-linear functions, and they also observed that sigmoid-like hyperbolic tangent
(“tanh”) was a safer choice overall. In line with these assumptions, our models for the simulated
datasets selected one-layer, two-layer, and three-layer MLPs, while all MLPs for the real pig data
were three-layer models. The optimizers of selected MLPs for the simulated datasets focused on
“adam” and “adamax”, while for the real dataset “sgd” was further included. Dropout rates of
MLPs were between 0 and 0.034 for the simulated datasets and were between 0.182 and 0.617
for the real pig dataset. Compared to the model architectures selected by Bellot et al. (2018) and
                                                   37


Pérez-Enciso and Zingaretti (2019), we had similar hyperparameter options in number of layers
and activation function. But we selected different optimizers and the dropout in our case tended
to be larger in the real pig dataset. Penalty weights for L2 regularization of MLPs had a range of
(0.01, 0.16) for the simulated datasets and a range of (0.03, 0.85) for the real pig dataset. We did
not find any suggested L2 weight applied to genomic prediction studies.
  Table 2.3 Hyperparameters of selected MLP models from each population. SP, simulated pig
  dataset; SC, simulated cattle dataset; RP, real pig dataset; DE No., differential evolution of
  different data partition; No. layer(s), number of hidden layers; No. neurons, number of
  neurons according to the number of hidden layers.
               DE                  No.
    Dataset    No.   activation  layer(s)   No. neurons    batch epoch   optimizer   dropout     L2
    SP          1    elu             2         [446,87]    51    37      adam         0.006     0.06
    SP          2    elu             2        [412,150]    41    45      adam         0.020     0.16
    SP          3    elu             2        [470,155]    46    44      adam         0.015     0.06
    SP          4    selu            2        [474,145]    54    45      adam         0.032     0.13
    SP          5    softplus        2         [397,87]    54    45      adam           0       0.13
    SC          1    elu             3      [429,330,57]   44    28      adam         0.030     0.04
    SC          2    relu            2        [411,106]    48    41      adamax       0.002     0.06
    SC          3    elu             3      [401,269,93]   11    27      adamax       0.001     0.01
    SC          4    relu            1           409       56    21      adam         0.034     0.14
    SC          5    relu            1           444       47    33      adam         0.020     0.16
    RP          1    sigmoid         3      [374,192,25]   10    40      sgd          0.352     0.85
    RP          2    sigmoid         3      [476,193,69]   54    42      adam         0.480     0.52
    RP          3    sigmoid         3       [483,291,8]   44    46      adamax       0.182     0.12
    RP          4    sigmoid         3      [457,234,79]   31    41      adamax       0.465     0.03
    RP          5    sigmoid         3     [386,251,148]   8     40      sgd          0.617     0.75
        Table 2.4 shows top CNNs. Optimized CNNs had three options for activation function:
“linear”, “elu”, and “relu” (Goodfellow et al., 2016). Similar to our results, top CNNs selected by
Bellot et al. (2018) also included “linear” and “elu”, while Pérez-Enciso and Zingaretti (2019)
used “relu”. The number of convolutional layers varied from one to three while all C         s for the
simulated datasets fixed with one. Notably, Bellot et al. (2018) also selected one-layer and three-
layer CNNs. The filter sizes tended to be larger in the selected models. The large filter sizes were
different from other studies that suggested two or three (Bellot et al., 2018; Pérez-Enciso and
Zingaretti, 2019). Optimizers of selected C        s were “adamax”, “rmsprop” and “adam” while
                                                     38


C      s for the real pig data fixed “adam”, and this finding is different from the “nadam” obtained
by Pérez-Enciso and Zingaretti (2019). Most CNNs across the three datasets used average
pooling for the pooling layer. For genomic prediction studies, we did not find a suggested
pooling option in the literature. Dropout rates of CNNs ranged from 0.008 to 0.827 and the range
was smaller (0.021,0.277) for the real pig dataset. However, our finding in dropout differed from
the small dropout (5-10%) recommended by Pérez-Enciso and Zingaretti (2019). Most L2
penalty weights were smaller than 0.16 while there were three exceptions (0.52, 0.75 and 0.85).
 Table 2.4 Hyperparameters of selected CNN models from each population. SP, simulated pig
 dataset; SC, simulated cattle dataset; RP, real pig dataset; DE No., differential evolution of
 different data partitions; No. layers, number of convolutional layers; No. filters, number of
 filters applied based on No. layers; FCL, size (number of neurons) of the fully connected layer
 after flatten layer.
          DE                 No.                  filter
   Data   No.   activation layers     No. filters  size  epoch FCL    optimizer dropout  L2   Pooling
   SP       1   linear        1          110        19     25   17    adamax     0.197  0.21  average
   SP       2   elu           1           16        15     32  110    rmsprop    0.146  0.03  average
   SP       3   elu           1           15         8     44   79    rmsprop    0.692  0.02 average
   SP       4   linear        1           59        20     24   49    adamax     0.496  0.23   max
   SP       5   linear        1          109        13     27  109    adam       0.827  0.01  average
   SC       1   linear        1          116        20     30   16    adam       0.370  0.10  average
   SC       2   linear        1           87        12     25   12    adam       0.086  0.13  average
   SC       3   linear        1           32         8     42   24    adam       0.250  0.19  average
   SC       4   linear        1           79        20     44   27    adamax     0.666  0.06   max
   SC       5   linear        1           98        16     40  153    adam       0.151  0.17  average
   RP       1   elu           2        [51,113]     18     22   50    adam       0.277  0.67  average
   RP       2   relu          3      [24,81,121]    12     27  268    adam       0.067  0.11  average
   RP       3   elu           2        [64,112]     13     45  278    adam       0.021  0.87  average
   RP       4   relu          3      [44,73,106]    13     47  326    adam       0.008  0.18  average
   RP       5   elu           3      [41,71,128]     5     41  238    adam       0.051  0.35  average
          Despite our evolved hyperparameter sets being similar to those described in the literature
(Bellot et al., 2018; Pérez-Enciso and Zingaretti, 2019), part of hyperparameter configurations
e.g. number of nodes (filters), optimizer, and dropout differed from those described in the
existing studies. This is likely due to that the optimal hyperparameter configuration depends on
the specific genomic dataset, and a hyperparameter’s relevance may depend on another
                                                      39


hyperparameter’s value (Luo, 2016). As Bellot et al. (2018) worked on a human dataset and
Pérez-Enciso and Zingaretti (2019) investigated a wheat dataset, we attribute the variation
among optimized hyperparameters to the specific dataset. It is also possible that our extended
hyperparameter space searched for more instances, which led to the differences in some
hyperparameters compared to other studies. While other researchers optimized hyperparameters
by discretizing the parameter space, we regarded number of neurons (filters), dropout, L2
regularization, batch size, epoch, and filter size as continuous values, which considerably
expanded the hyperparameters search space.
4.3 Performance of optimized models under validation
         The objective of this paper is to provide a framework to optimize DL hyperparameters for
genomic prediction and not to compare the optimized DL with GBLUP. It is however still
relevant to use GBLUP as a baseline of reference prediction methods to contextualize our results
(see Figure S2.2). For the simulated datasets, GBLUP was the slightly better than the rest of the
models. A similar result was presented by Abdollahi-Arpanahi et al (2020). This is not surprising
in our study because GBLUP (described in File S2.1) is a model well suited for the simulated
data which is entirely additive and composed of a large number of very small effects that
approximate the infinitesimal model. However, for the real pig dataset, the pattern was somewhat
different, and the best performing model was dependent on the data partition. As explained later
in this section, we attribute this phenomenon to the small sample size of the real pig dataset.
D’souza et al. (2020) argued that for a small dataset (e.g.: N<5000), the presence of substructure
or even a few outliers may have a profound influence on the predictive performance under a
specific data partition, skewing the overall estimate of the predictive performance and affecting
the outcome of any optimization method that is used.
                                                   40


         As DL is a methodology that relies on a learning process conditioned on the problem that
it is solving (A. Montesinos-López et al., 2018), it is less likely that a DL model can achieve its
best possible prediction accuracy using a hyperparameter set optimized from other independent
studies. To investigate this, we trained MLPs and CNNs with hyperparameters selected for
predicting human traits (Bellot et al., 2018) and for a wheat dataset (Pérez-Enciso and Zingaretti,
2019), across the three datasets in this study. Table S2.16 shows hyperparameters of MLPs and
CNNs obtained from the two studies. Figures 2.8-2.10 shows the predictive performance of
random DL models, optimized DL models, and top DL models selected by Pérez-Enciso and
Zingaretti (2019). These models were applied to all three datasets. Randomly selected DL
models and optimized DL models differed in training data partitions due to independent DE
optimizations performed within each partition, while the models suggested by the two previous
studies (Bellot et al., 2018; Pérez-Enciso and Zingaretti, 2019) were fixed in all partitions.
Prediction accuracy of external (cross) validations was obtained by refitting each model 30
times. The panels in Figures 2.8-2.10 represent the predictive performance of competing DL
models for each data partition within each dataset. Noteworthy, the optimized models using DE
were consistently the best when compared to randomly chosen models or to models taken from
the literature, that have been optimized for other datasets.
         Models with hyperparameters chosen by Bellot et al. (2018) did not converge in any data
partition and so they are not shown in Figures 2.8-2.10. This was likely due to exploding
gradients or vanishing gradients (as previously discussed). Another observed case was that the
model predicted every individual with the same value, making it impossible to compute the
correlation between the predicted and observed response. This also confirmed the observation by
Bellot et al. (2018) that convergence problems persisted after reinitializations of the algorithm.
                                                  41


   Figure 2.8 Boxplots for the predictive performance of MLPs and CNNs using different
   hyperparameters (simulated pig dataset). Models were tested on five data partitions of the
   simulated pig dataset. Statistics represent external (cross) validations by fitting the same
   model 30 times. The left three boxes are for MLP models and the right three boxes are for
   CNN models. Null box means the model did not converge. Random, random
   hyperparameters; Perez, hyperparameters recommended by Pérez-Enciso and Zingaretti
   (2019); Opt, optimized hyperparameters using DE. Abbreviations stand for the same
   meaning in Figure 2.9 and Figure 2.10.
        For the simulated pig dataset, the MLP and the CNN suggested by Pérez-Enciso and
Zingaretti (2019) was slightly worse than the optimized MLPs and CNNs that we obtained with
DE. However, their performance was much worse in the simulated cattle and the real pig
datasets. Again, the optimal hyperparameter configuration is problem-dependent and thus, it is
important to search for the proper hyperparameters in DL genomic prediction applications given
a specific dataset.
                                                42


   Figure 2.9 Boxplots for the predictive performance of MLPs and CNNs using different
   hyperparameters (simulated cattle dataset). Models were tested on five data partitions of the
   simulated cattle dataset. Statistics represent external (cross) validations by fitting the same
   model 30 times. The left three boxes are for MLP models and the right three boxes are for
   CNN models.
         The variations in the predictive performance under re-training observed in all models
indicated that DL models were likely overfitting the data. Abdollahi-Arpanahi et al. (2020)
showed variance in predictive performance (in terms of accuracy and mean squared error) of
MLPs and CNNs over 10 replicates of cross validation which is in agreement with our results. In
general, the SD of the correlation between predicted and observed phenotypes for the optimized
MLPs/CNNs and those proposed by Pérez-Enciso and Zingaretti (2019) were smaller in the
simulated datasets, while the SD in the real pig dataset was larger (Figure 2.10 compared to
Figures 2.8 and 2.9). We speculate that there are two possible reasons for the variation: DL
models are initialized with random weights at starting points and a relatively small sample size
for training. For the random weights at baseline, Bellot et al. (2018) explained that the
                                                  43


performance of MLPs and CNNs depended on initialization values. For the training sample size,
Abdollahi-Arpanahi et al. (2020) indicated that larger sample sizes improved the predictive
ability of DL methods. Furthermore, in the field of image classification, Shahinfar et al. (2020)
showed increased prediction accuracy and reduced variation in the performance of DL models as
the sample size grew. Based on the results in Figures 2.8-2.10, the merit in terms of less variation
over replicates of external (cross) validations was clearer in the simulated datasets that had larger
sample sizes (N=7,539 for both the simulated pig and the simulated cattle datasets). In the real
pig dataset that had a smaller sample size (N=910), SD was larger compared to those in the
simulated datasets. Therefore, we argue that both the predictive ability and variation in the same
DL models are associated with training sample size. Montesinos-López et al. (2018) also
mentioned that DL method may fail to learn a proper generalization of the knowledge contained
in the data, given small datasets.
                                         5. CONCLUSIONS
        Overall, DL can be adapted to perform genomic prediction of complex traits, but it
requires some effort to select appropriate hyperparameters. Any hyperparameter optimization
will likely be dataset-specific and characteristics such as population structure and genetic
architecture of the predicted trait may well require different DL model hyperparameters. In this
study, we implemented differential evolution (DE) as a method to simultaneously identify
optimal combinations of multiple hyperparameters. Compared to randomly selected models, our
optimized MLPs and CNNs showed significant improvement in the predictive performance. In
comparison to DL models with hyperparameters selected from other studies, optimized MLPs
and CNNs also yielded better predictive accuracy. DE is an efficient and semi-automatic
                                                  44


algorithm that can be used to select an optimal hyperparameter set that leads to a better
predictive performance. Moreover, overparameterization of DL can be mitigated by refitting
models and selecting those that produce more consistent (less variable) prediction accuracies.
We showed that this is more important when working with small datasets.
   Figure 2.10 Boxplots for the predictive performance of MLPs and CNNs using different
   hyperparameters (real pig dataset). Models were tested on five data partitions of the real pig
   dataset. Statistics represent external (cross) validations by fitting the same model 30 times.
   The left three boxes are for MLP models and the right three boxes are for CNN models. Null
   box means the model did not converge.
                                    6. ACKNOWLEDGEMENTS
        This work was supported by Agriculture and Food Research Initiative Awards No. 2017-
67007-26176, No. 2010-65205-20342, the National Institute of Food and Agriculture (AFRI
Project No. 2019-67015-29323), and by funding from the National Pork Board Grant No. 11–
042. Partial funding was also provided by the US Pig Genome Coordinator.
                                                  45


APPENDICES
    46


                        APPENDIX A: SUPPLEMENTAL MATERIAL
 Table S2.1 Adaptive hyperparameter space for the number of neurons. Number of neurons
 (nodes) given the depth of network (number of hidden layers, HL) in multilayer perceptron
 models.
   Layer       One HL         Two HLs          Three HLs         Four HLs         Five HLs
     1          [4-512]       [259-512]        [344-512]         [386-512]        [412-512]
     2                         [4-258]         [175-343]         [259-385]        [311-411]
     3                                           [4-174]         [132-258]        [210-310]
     4                                                            [4-131]         [109-209]
     5                                                                              [4-108]
Table S2.2 Adaptive hyperparameter space for number of filters. Number of filters (kernels)
given the depth (number of convolutional layers) of convolutional neural network.
   Layer      One layer       Two layers      Three layers      Four layers       Five layers
     1          [4-128]         [4-65]             [4-44]           [4-34]            [4-28]
     2             --          [66-128]           [45-85]          [35-65]           [29-53]
     3             --             --             [86-128]          [66-96]           [54-78]
     4             --             --                 --           [97-128]          [79-103]
     5             --             --                 --                --          [104-128]
    Table S2.3 Minimum length of feature maps applied to each layer of convolutional
    neural network. Conv: Convolutional layer.
     Layer          One layer    Two layers     Three layers   Four layers   Five layers
     Conv 1             4             16              64            256         1024
     Pooling 1          2              8              32            128           512
     Conv 2             --             4              16             64           256
     Pooling 2          --             2               8             32           128
     Conv 3             --             --              4             16            64
     Pooling 3          --             --              2              8            32
     Conv 4             --             --              --             4            16
     Pooling 4          --             --              --             2             8
     Conv 5             --             --              --            --             4
     Pooling 5          --             --              --            --             2
                                             47


            Table S2.4 Distributions of optimized hyperparameters related to multilayer
            perceptron architectures for simulated pig data. Pop 1-5: MLP solutions to five
            data partitions (five differential evolution runs).
                                       Activation function                                  Number of layers
                          elu       linear      selu    relu softplus            One         Two      Three        Other
             Pop1           6         38          4       0             2         3            18      29              0
             Pop2           7         11         10       9           13          3            37       6              4
             Pop3         11          20         10       6             3         8            33       7              2
             Pop4         13          20          8       5             4         8            15      23              4
             Pop5         11           1         13      15           10          8            30       7              5
Table S2.5 Distributions of optimized hyperparameters related to CNN architectures for
simulated pig data. Pop 1-5: CNN solution populations of five differential evolutions runs. Size
of fully connected layer: the number of neurons applied in the fully connected layer (after flatten
layer). Q0.05, 5% quantile; Q0.95, 95% quantile.
             Activation function                   Number of layers                Filter size             Size of fully connected layer
        elu   linear     selu    tanh       One     Two   Three     Other  Q0.05    Median      Q0.95     Q0.05      Median      Q0.95
 Pop1   13      18        14       5         19      26      5        0     10         16         19        19          73        380
 Pop2   20      14        16       0         31      19      0        0     5          10         20        22         149        477
 Pop3   13      19        18       0         8       37      4        1     2           8         20        12          53        452
 Pop4   16      27         7       0         35      15      0        0     6          11         20         9          36        308
 Pop5    8      26        15       1         32      10      8        0     8          13         20        20          39        481
            Table S2.6 Distributions of optimized hyperparameters related to multilayer
            perceptron architectures for the simulated cattle data. Pop 1-5: MLP solutions
            to five data partitions (five differential evolution runs).
                                       Activation function                              Number of layers
                          linear       relu      elu    selu      softplus     One            Two     Three       Four
             Pop1            43          0        4       2           1          17            17       13           3
             Pop2            30          9        1       4           6          25            10       12           3
             Pop3            41          1        6       2           0          6             28       14           2
             Pop4            44          2        1       1           2          7             33       10           0
             Pop5            49          1        0       0           0          19            20       10           1
                                                                    48


Table S2.7 Distributions of optimized hyperparameters related to CNN architectures for
simulated cattle data. Pop 1-5: CNN populations of five differential evolutions runs. Size of
fully connected layer: the number of neurons applied in the fully connected layer (after flatten
layer). Q0.05, 5% quantile; Q0.95, 95% quantile.
                  Activation function                          Number of layers                 Filter size           Size of fully connected layer
       elu   linear      selu     tanh      other      One     Two      Three    Other   Q0.05    Median     Q0.95   Q0.05       Median      Q0.95
 Pop1   3      45          0        2         0         34      10        6        0        8        18       20       16           46         354
 Pop2  11      27          5        7         0         8        9       33        0       10        17       20       12           195        386
 Pop3   0      49          1        0         0         40       8        2        0        6        15       20       24           26         396
 Pop4  13      17         19        0         1         9        9       32        0       10        18       18       27           221        485
 Pop5   7      37          2        4         0         12      16       21        1       10        18       20       26           148        416
             Table S2.8 Distributions of optimized hyperparameters related to
             multilayer perceptron architectures for the real pig data. Pop 1-5: MLP
             solutions to five data partitions (five differential evolution runs).
                                          Activation function                              Number of layers
                                          sigmoid           Other                   Two        Three               Four
               Pop1                             47              3                     6            44               0
               Pop2                             50              0                     0            46               4
               Pop3                             50              0                     0            44               6
               Pop4                             50              0                     2            38               10
               Pop5                             50              0                     1            46               3
Table S2.9 Distributions of optimized hyperparameters related to CNN architectures for real
pig data. Pop 1-5: CNN populations of five differential evolutions runs. Size of fully connected
layer: the number of neurons applied in the fully connected layer (after flatten layer). Q0.05,
5% quantile; Q0.95, 95% quantile.
               Activation function                          Number of layers                   Filter size            Size of fully connected layer
         elu    linear       tanh     other        Two     Three    Four       Other    Q0.05   Median      Q0.95    Q0.05       Median       Q0.95
 Pop1    19        6          25        0           16      24        10         0       12        13        18       22           367         506
 Pop2    38        1           5        6           7       30        12         1       12        16        17       50           197         463
 Pop3     3        4          40        3           16      26         8         0       8         13        18       60           150         463
 Pop4    33        4           1       12           26      11        10         3       9         15        17       50           158         426
 Pop5    20       14          16        0           2       25        22         1       5         11        18       12           195         416
                                                                             49


 Table S2.10 Distributions of optimized hyperparameters related to MLP model compilation and
 fitting for simulated pig data. Pop 1-5: MLP solution populations of five differential evolution
 runs. Q0.05, 5% quantile; Q0.95, 95% quantile.
                  Optimizer                           Epochs                        Batch size                            Dropout rate                            L2
          adam     adamax       other       Q0.05     median  Q0.95       Q0.05      median      Q0.95          Q0.05       median       Q0.95         Q0.05    median      Q0.95
  Pop1     48         0           2          27         35      47          14         30         53             0.03        0.16        0.45          0.03     0.18         0.67
  Pop2     43         6           1          23         43      49          12         36         56             0.01        0.05        0.41          0.01     0.16         0.75
  Pop3     43         6           1          23         32      48           6         28         52             0.01        0.05        0.57          0.01     0.17         0.69
  Pop4     41         8           1          23         36      48          15         40         58             0.01        0.12        0.50          0.02     0.23         0.76
  Pop5     35        15           0          22         35      50          19         38         57               0         0.04        0.14          0.01     0.14         0.87
Table S2.11 Distributions of optimized hyperparameters related to CNN model compilation and
fitting for simulated pig data. Pop 1-5: CNN solution populations of five differential evolution
runs. Q0.05, 5% quantile; Q0.95, 95% quantile.
                             Optimizer                               Epochs                           Dropout rate                              L2                     Pooling
         adam    adamax      adadelta  nadam     rmsprop      Q0.05  median     Q0.95          Q0.05    median       Q0.95           Q0.05     median    Q0.95    Max      Average
  Pop1     4       11           8        8         19          24     31         48            0.02      0.36        0.79            0.01      0.10       0.49    27          23
  Pop2    10       10          20        0         10          21     34         49            0.04      0.42        0.79           <0.01      0.04       0.25    22          28
  Pop3    13       13          11        7          6          22     28         44            0.11      0.56        0.77            0.01      0.11       0.49    42           8
  Pop4    10       29           5        3          3          23     31         46            0.05      0.39        0.78            0.01      0.12       0.64    22          28
  Pop5    22        1           4       10         13          22     31         49            0.02      0.39        0.79            0.01      0.07       0.44    29          21
   Table S2.12 Distributions of optimized hyperparameters related to MLP model compilation
   and fitting for simulated cattle data. Pop 1-5: MLP solution populations of five differential
   evolution runs. Q0.05, 5% quantile; Q0.95, 95% quantile.
                    Optimizer                          Epochs                      Batch size                            Dropout rate                           L2
            adam    adamax       nadam       Q0.05     median  Q0.95      Q0.05      median     Q0.95           Q0.05       median      Q0.95         Q0.05    median     Q0.95
    Pop1     47         1          2          23         41     47          11        31          57            0.01         0.15       0.52          0.04     0.27       0.81
    Pop2     39        10          1          26         38     48          18        40          57           <0.01         0.09       0.45          0.06     0.26       0.85
    Pop3     44         4          2          26         33     44          12        26          52            0.01         0.18       0.53          0.03     0.22       0.87
    Pop4     40         0         10          21         39     50          10        23          47            0.03         0.21       0.54          0.03     0.39       0.86
    Pop5     41         2          7          22         29     48           9        24          52            0.02         0.26       0.72          0.06     0.27       0.91
                                                                                 50


Table S2.13 Distributions of optimized hyperparameters related to CNN model compilation and
fitting for simulated cattle data. Pop 1-5: CNN solution populations of five differential evolution
runs. Q0.05, 5% quantile; Q0.95, 95% quantile.
                             Optimizer                                      Epochs                           Dropout rate                          L2                      Pooling
          adadelta   adam    adamax      nadam     rmsprop         Q0.05    median     Q0.95          Q0.05    median     Q0.95          Q0.05    median    Q0.95     Max      Average
  Pop1       3        28       11          1          7             23       35         50            0.01      0.30      0.65           0.03     0.24      0.72       20         30
  Pop2      17        16        7          2          8             22       33         50            0.06      0.28      0.68           0.02     0.29      0.59       10         40
  Pop3       3        21       22          0          4             22       40         44            0.04      0.29      0.78           0.02     0.16      0.50       12         38
  Pop4      11         3       29          2          5             23       29         46            0.05      0.35      0.67           0.01     0.23      0.75       17         33
  Pop5       4        19        5          1         21             21       34         46            0.02      0.24      0.60          <0.01     0.11      0.52       27         23
Table S2.14 Distributions of optimized hyperparameters related to MLP model compilation and
fitting for real pig data. Pop 1-5: MLP solution populations of five differential evolution runs.
Q0.05, 5% quantile; Q0.95, 95% quantile.
                   Optimizer                            Epochs                            Batch size                          Dropout rate                            L2
           adam      adamax     sgd          Q0.05      median    Q0.95          Q0.05     median       Q0.95          Q0.05    median      Q0.95        Q0.05      median       Q0.95
  Pop1       1          9       40            25         40         45             7         18          64            0.03      0.38        0.85        0.12        0.66        0.95
  Pop2      19         30        1            22         37         45            28         55          64            0.04      0.41        0.82        0.09        0.60        0.92
  Pop3       0         47        3            30         44         49            21         44          68            0.05      0.33        0.79        0.05        0.55        0.94
  Pop4       2         47        1            31         33         47            30         31          32            0.15      0.56        0.86        0.04        0.37        0.86
  Pop5       1         40        9            21         24         41             8         47          63            0.05      0.38        0.84        0.05        0.48        0.90
   Table S2.15 Distributions of optimized hyperparameters related to CNN model compilation
   and fitting for real pig data. Pop 1-5: CNN solution populations of five differential evolution
   runs. Q0.05, 5% quantile; Q0.95, 95% quantile.
                      Optimizer                             Epochs                           Dropout rate                                  L2                         Pooling
              adam     adamax      other         Q0.05      median     Q0.95          Q0.05      median       Q0.95          Q0.05     median     Q0.95         Max       Average
     Pop1      36          9         5             22          31        41            0.02        0.32       0.71           0.03        0.54      0.96            1         49
     Pop2      34         13         3             25          36        50            0.08        0.41       0.78           0.04        0.51      0.97            2         48
     Pop3      45          4         1             22          34        45            0.02        0.41       0.88           0.16        0.49      0.93           13         37
     Pop4      40          7         3             29          31        49            0.03        0.33       0.75           0.05        0.60      0.98            8         42
     Pop5      44          5         1             24          44        46            0.03        0.38       0.77           0.05        0.55      0.95            0         50
                                                                                        51


Table S2.16 Selected MLP and CNN architecture derived from other studies. No.
layers, the number of fully connected layers or convolutional layers; No. neurons
(filters), the number of neurons or filters adaptive based on the number of layers.
In the No. layers column, 1+1 means one convolutional layer plus one fully
connected layer.
                                             No.   No. neurons             Filter
 Study                Model  Activation    layers     (filters)  Dropout    size
 Bellot et al. (2018) MLP    elu              1          32       0.0100     NA
 Pérez-Enciso and
 Zingaretti (2019)    MLP    relu             4    [64,64,64,64]  0.0005     NA
 Bellot et al. (2018) CNN    linear         1+1       [16,32]     0.0100      3
 Pérez-Enciso and
 Zingaretti (2019)    CNN    relu             4    [64,64,64,64]  0.0005      3
                                          52


Figure S2.1 Pseudocode for adaptive filter size. Conv, convolutional layer; int(x),
convert x into the nearest integer; floor(x), get the largest integer that is smaller or
equal to x.
                                           53


Figure S2.2 Mean predictive performance and error bars across datasets and data partitions.
The error bar represents the mean ± standard deviation of cross validation by fitting the same
model 30 times. Pink, green, and blue bars correspond to GBLUP, MLP, and CNN models,
respectively. MLP, multilayer perceptron; CNN, convolutional neural network; GBLUP,
genomic best linear unbiased prediction.
                                           54


                                      APPENDIX B: FILE S2.1
Genomic best linear unbiased prediction (GBLUP) model
        We used GBLUP model to predict the response variable and its prediction accuracy as a
comparison to MLPs and CNNs:
                                           𝑦 = 𝜇 + 𝑔 + 𝑒,
        where 𝑦 is the response variable with phenotypes (𝑦 = 𝑦𝑎𝑑𝑗 in the real dataset), 𝜇 is the
population mean, 𝑔 ~ 𝑁( 𝐺𝜎𝑔2 ) is a vector of random genomic effects, 𝜎𝑔2 is the genomic
variance, and 𝑒 ~ 𝑁( 𝐼𝜎𝑒2 ) is a vector of residuals where 𝐼 is an identity matrix with 1s in the
diagonal. The matrix 𝐺 = 𝑍𝑍′ is the genomic relationship matrix and 𝑍 is a matrix with
standardized allelic dosages (VanRaden, 2008):
                                                    𝑀𝑖𝑗 −2𝑝𝑗
                                         𝑍𝑖𝑗 =                  ,
                                                 √𝑚(2𝑝𝑗 (1−𝑝𝑗 ))
where 𝑀 is a matrix of allelic dosages, 𝑝𝑗 is the allelic frequency at SNP marker 𝑗, 𝑖 is the 𝑖th
animal, and 𝑚 is the number of markers.
Multilayer perceptron
        Typical MLP models (Figure 2.1) consist of an input layer, a variable number of hidden
layer(s), and an output layer. Each layer contains several neurons (also known as nodes).
Depending on the type of layer, the nature of the nodes will change. For instance, the number of
nodes in the input layer is equal to the number of predictor features. In this study, the input layer
represents an individual’s genotype, and thus, the input layer will have as many nodes as SNP
markers. In Figure 2.1, there are M nodes in the input layer, and its kth node will receive input as
the allelic count at the kth SNP for the nth individual (𝑥𝑛 𝑘 ). The output layer represents the
prediction of the response variable produced by MLP. In this case, the output will contain the
                                                   55


prediction of an individual’s phenotypic value (𝑦̂𝑛 ), which can be a continuous outcome, an
ordinal outcome, or a categorical outcome.
        The nodes in one layer are connected to the nodes in the previous layer by a weighted
sum operator. For instance, in Figure 2.1, the input of the jth node in hidden layer 1 is
                 0                                 th                                           0
𝑧𝑗1 = 𝑓(∑𝑀 𝑘=1 𝑤𝑗𝑘 𝑥𝑘 ), where 𝑥𝑘 represents the k node from previous layer, the weights 𝑤𝑗𝑘 are
unknown and connected to 𝑥𝑘 (SNP k) and need to be determined through a learning process.
𝑓() is the activation function that is specified by the user. It is worth noting that non-linear
functions can be used as 𝑓() and there is no need to assume linearity as with classic genomic
prediction models. Activation functions are detailed in the Hyperparameters section further on.
Likewise, nodes between layers are fully connected, which means that the input sum of each
node in a layer will contain as many terms as there are nodes in the previous layer: 𝑧𝑗′𝑖 =
     𝑛𝑛𝑒𝑢𝑟𝑜𝑛𝑖−1
𝑓(∑𝑗=1          𝑤𝑗𝑖−1  𝑖−1
                   𝑗′ 𝑧𝑗   ), where 𝑗 represents the nodes from layer 𝑖 − , 𝑛𝑛𝑒𝑢𝑟𝑜𝑛𝑖−1 is the
number of nodes in hidden layer 𝑖 − , 𝑖 is the index of hidden layer 𝑖, 𝑤𝑗𝑖−1  𝑗′ is a weight
connecting jth node (𝑧𝑗𝑖−1 ) in hidden layer 𝑖 −   and j’th node in hidden layer 𝑖 (𝑧𝑗′𝑖 ).
Convolutional neural network
        In the context of genomic prediction, the input layer for a single observation in a CNN is
a one-dimensional array. Similar to MLP, the input layer will contain an animal’s (nth
individual’s) genotype and the number of units will be equal to the number of SNP markers. In
Figure 2.2, there are M units in the input layer and the kth unit represents the allelic count at the
kth SNP for the nth individual (𝑥𝑛 𝑘 ). The output layer represents the predicted response value 𝑦̂𝑛
for the phenotype or breeding value of the nth individual. After the input layer, a CNN contains a
variable number of convolutional layers followed by pooling layers. For instance, in
Convolutional Layer 1 of Figure 2.2, several filters are applied to the nodes of the input layer,
                                                   56


where filters are arrays containing certain number of weights to convolve the input. In this case,
each filter has three weights 𝑤𝑖11, 𝑤𝑖12, and 𝑤𝑖13, where 𝑖 represents the 𝑖 th filter defined by the
user. These filters are applied to every three consecutive units of the input layer (filter size equal
to three). Also, the stride of the filter is equal to its length, which means that the filter is applied
to non-overlapping sets of three contiguous SNP. The length of the filter (kernel) is defined by
the number of weights to include i.e. the number of units to be convolved by a filter in the input
data. An arbitrary number of filters 𝑖 1… 𝐼 is applied in each convolution. The output of this
                                                          𝑀−𝐹
process will be 𝐼 feature maps with length equal to            + , where 𝑀 represents the number of
                                                            𝑆
SNP markers, 𝐹 is the length of the filter, and 𝑆 is the stride. In our case, because stride is equal
                                     𝑀
to filter size, the length is simply 3 . Moreover, the input of the 𝑗th unit in feature map 1 is 𝑐𝑗1 =
𝑓(𝑤𝑖11 𝑥𝑛 𝑘 + 𝑤𝑖12 𝑥𝑛 𝑘+1 + 𝑤𝑖13 𝑥𝑛 𝑘+2 ), where 𝑥𝑛 𝑘 , 𝑥𝑛 𝑘+1 , and 𝑥𝑛 𝑘+2 are allelic dosages of
individual 𝑛 at three consecutive SNP markers. The weights in the filters are unknown and need
to be determined through a DL optimization process. 𝑓() is the activation function. In
Convolutional Layer 1, the output of each convolution is saved in feature map 1, where the
                                       𝑀
length of each feature map is 𝑎1 =         and the number of feature maps (𝑏1 ) is equal to the number
                                        3
of filters (kernels) applied to the input layer (in this case b1=5 in Figure 2.2). A convolutional
layer is followed by a pooling layer for the purposes of dimensionality reduction. In pooling
layer 1 of Figure 2.2, 𝑝1 = (𝑝11 𝑝12 … 𝑝𝑎11 2 ) are elements that are summarized by every two
consecutive units generated from the previous convolutional layer and the output will be 𝑏1
                                                 𝑎1
feature maps with a reduced length equal to          . Likewise, feature map 2 is followed by
                                                  2
convolutional layer 2 where filters with three weights 𝑤𝑖2′ 1 , 𝑤𝑖2′ 2 , and 𝑤𝑖2′ 3 are applied. In feature
map 3, 𝑏2 features with a length of 𝑎2 are summarized into feature map 4 that has 𝑏2 features
                                                      57


             𝑎2                           𝑀 𝑎1 𝑎1 2      𝑎2
with length     . If any value among 3 ,       ,    , or    has a remainder, the deficit unit(s) in the
              2                              2   3       2
input data will be padded with zero(s). The last feature map (feature map 4) is re-arranged into a
                          𝑏2 × 𝑎2
single vector that has             elements. Each element in the re-arranged vector 𝑧1 =
                             2
                      𝑏2 × 𝑎2
(𝑧11 𝑧21 … 𝑧𝑙1 𝑙 =            ) is fully connected to a hidden layer (like the ones described in the
                         2
MLP section of this paper) with 𝑛𝑛𝑒𝑢𝑟𝑜𝑛 nodes, which are predictors for the output layer.
Hyperparameters
         The number of hidden layers describes the depth of the network and DL requires at least
one hidden layer, and it is known as the depth of the network. In the deep learning literature
several studies found that the number of hidden layers, in similar sized problems, can often
provide better results with a maximum upper bound of five (Arifin et al., 2019; Bellot et al.,
2018). Thus, we optimized the number of hidden layers by selecting an integer ranging from one
through five.
         The number of neurons decides the number of units in a fully connected hidden layer, and
it is known as the width of the network. Bellot et al. (2018) investigated the influence of neuron
numbers on neural networks by varying neurons per layer in four scenarios: 16, 32, 64 and 128,
and Pérez-Enciso and Zingaretti (2019) estimated the effect of the number of neurons in the first
layer (8, 24, 32, 64, and 128). Both studies used discrete values for the number of neurons. In
this study, we optimized the width of the neural network by selecting integers between 8 and
512. The number of neurons is optimized by the DE algorithm for every hidden layer of the MLP
and the last hidden layer of the CNN.
         Activation function is a function to transform the weighted sums from the previous layer.
The aforementioned Pérez-Enciso and Zingaretti (2019) recommended “tanh”, “relu”, “selu” and
                                                       58


“sigmoid”. In addition to their reccomendation, we further included “elu”, “softplus” as well as
“linear” as possible activation functions.
        In deep learning, an optimizer is an algorithm used to alter the attributes of the model e.g.
weights and learning rates, where learning rates are coefficients applied to altered weights.
Optimizer options included were “sgd”, “adam”, “adagrad”, “rmsprop”, “adadelta”, “adamax”,
and “nadam”.
        Dropout is used to avoid overfitting due to the large number of weights that need to be
estimated. Dropout consists of and randomly sets a proportion (dropout rate) of the neurons in a
layer for which their weights are not updated in the current iteration. The dropout rate may affect
the predictive performance of a model and we included it in the hyperparameter optimization as
a continuous parameter in a range of (0,1).
        Another way to ease overfitting is to use weight regularization that adds constraints of
weights to the loss function. For instance, in L2 regularization a squared penalty on the values of
the weights is added to the loss function. This parameter also may have an effect on the model’s
predictive performance. We optimized the L2 regularization as a parameter in a range of (0,1).
        L2 Regularization is defined as:
                                        1
                           𝐿(𝜃̂ 𝑋 𝑦) = 𝑛 ∑𝑖(𝑦𝑖 − 𝑓𝜃̂ (𝑋𝑖 ))2 + 𝜆 ∑𝑝𝑗=1 𝜃̂ 2 ,
where 𝐿() represents loss function, 𝑋 represents input data, 𝑦 is the observed response variable,
𝜃̂ consists of weights in the deep learning model, 𝑓𝜃̂ () represents the deep learning model, and 𝜆
is the L2 regularization parameter.
        Epoch refers to the number of iterations where an entire training dataset is passed through
the DL model to iteratively adjust weights. Within an epoch the training dataset is further divided
into an actual training set for weight adjustment and a testing set that is used for performance
                                                 59


evaluation. The number of epochs to be optimized was an integer between 21 and 50. We
introduced an early-stopping rule when there is no improvement of the model training for ten
consecutive epochs.
        Batch size is used to determine the number of randomly partitioned training samples
(within an epoch) utilized to update the weights. For the simulated datasets, we first optimized a
continuous value α (Table 2.1) in the range of [0.001-0.01], while the range in the real pig
dataset was [0.01, 0.1]. Then, the batch size was defined as the product of training sample size N
and α. The number of samples utilized in each DL batch varied according to the training size
(N=6539 for the simulated datasets and N=728 for the real pig dataset). This hyperparameter has
a profound influence on the computing time and memory required by TensorFlow (Abadi et al.
2015, https://www.tensorflow.org/) to fit the model. We only optimized batch size in MLP while
the batch size was fixed at 32 in CNN because larger batch sizes became computationally too
onerous to fit CNN with the larger datasets (N=6539 and M=48,541).
        The number of filters, filter (also known as kernel) size and pooling function are
hyperparameters exclusive of convolutional neural networks (CNN). A filter is an array of
weights used to convolve the input. Typically, multiple filters can be utilized in each layer.
Pérez-Enciso and Zingaretti (2019) explored CNN architectures with 16, 32 and 64 filters while
Bellot et al. (2018) varied the number of filters with 16, 32, 64 and 128. We optimized the
number of filters in CNN by selecting an integer between 4 and 128.
        The filter size of a 1d CNN is the number of weights in the filter. Pérez-Enciso and
Zingaretti (2019) compared the predictive performance using kernel (filter) sizes of three, five
and seven, while Bellot et al. (2018) used filter sizes of two, three, five and ten. In this study, we
defined the sample space for filter size as an integer between two and 20.
                                                  60


        A pooling layer is used to downsize the feature map that comes from the convolution
operation by computing a summary statistical measure of several elements. The typical options
for a pooling layer are average, minimum and maximum. Bellot et al. (2018) applied a 1×2
pooling to the feature maps. Similarly, we fixed the size of the pooling to two units in the feature
map and optimized the pooling function by selecting between average value or maximum value
of the two units. In other words, our models were optimized by selecting one of the pooling rules
(average pooling or maximum pooling) to downsize the feature map that comes out of the
convolutional operation by half (the two units were summarized into one).
        It is necessary to point out that the hyperparameter space for sampling values or options
varies across the literature, and it is up to the user to setup the adaptive architecture of the
network. With differential evolution (DE) users can not only optimize the subset of
hyperparameters used in this study, but can also optimize any other additional hyperparameters
they deem relevant.
Predictive performance of DL models and GBLUP
        Figure S2.2 shows the predictive performance (correlation between the predicted and the
observed response variables in the external validation sets) for each method (optimized MLPs,
optimized CNNs, and GBLUP). For the simulated pig dataset, all methods performed similarly,
although GBLUP models were slightly better and CNN models were the worst. For the simulated
cattle dataset, GBLUP models were better in partitions 1 and 5, while optimized MLPs and
CNNs were slightly better in partitions 2, 3 and 4 (with tied performance of MLPs and CNNs).
For the real pig dataset, the pattern was completely different, and the best models depended on
the data partition. We did not notice a clear improvement in prediction accuracy for any of the
models. Since deep learning model fitting results differ even with the same hyperparameters, we
                                                    61


ran 30 external (cross) validations using the same hyperparameters and validation sets for all
partitions across the three datasets. Each MLP and CNN was trained 30 times independently and
repeatedly. Models in both the simulated pig and cattle datasets showed little variation through
repetition. However, we observed more variation in the prediction performance for the real pig
dataset.
                                                 62


REFERENCES
    63


                                            REFERENCES
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A.,
     Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M.,
     Jozefowicz, R., Jia, Y., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Schuster, M.,
     Monga, R., Moore, S., Murray, D., Olah, C., Shlens, J., Steiner, B., Sutskever, I., Talwar,
     K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P.,
     Wattenberg, M., Wicke, M., Yu, Y., Zheng, X., 2015. TensorFlow: Large-scale machine
     learning on heterogeneous systems.
Abdollahi-Arpanahi, R., Gianola, D., Peñagaricano, F., 2020. Deep learning versus parametric
     and ensemble methods for genomic prediction of complex phenotypes. Genet. Sel. Evol. 52,
     1–15. https://doi.org/10.1186/s12711-020-00531-z
Arifin, F., Robbani, H., Annisa, T., Ma’Arof, . .M.I., 2019. Variations in the umber of
     Layers and the Number of Neurons in Artificial Neural Networks: Case Study of Pattern
     Recognition. J. Phys. Conf. Ser. 1413, 0–6. https://doi.org/10.1088/1742-
     6596/1413/1/012016
Bean, J.C., 1994. Genetic Algorithms and Random Keys for Sequencing and Optimization.
     ORSA J. Comput. https://doi.org/10.1287/ijoc.6.2.154
Bellot, P., de los Campos, G., Pérez-Enciso, M., 2018. Can deep learning improve genomic
     prediction of complex human traits? Genetics 210, 809–819.
     https://doi.org/10.1534/genetics.118.301298
Casiró, S., Velez-Irizarry, D., Ernst, C.W., Raney, N.E., Bates, R.O., Charles, M.G., Steibel, J.P.,
     2017. Genome-Wide association study in an F2 duroc x pietrain resource population for
     economically important meat quality and carcass traits. J. Anim. Sci. 95, 545–558.
     https://doi.org/10.2527/jas2016.1003
Chollet, F., Allaire, J., Falbel, D., 2017. R Interface to Keras.
Corvin, A., Craddock, N., Sullivan, P.F., 2010. Genome-Wide Association Studies: A Primer.
     Psychol Med. 40, 1063–1077. https://doi.org/10.1017/S0033291709991723
Crossa, J., Martini, J.W.R., Gianola, D., Pérez-Rodríguez, P., Jarquin, D., Juliana, P.,
     Montesinos-López, O., Cuevas, J., 2019. Deep Kernel and Deep Learning for Genome-
     Based Prediction of Single Traits in Multienvironment Breeding Trials. Front. Genet. 10, 1–
     13. https://doi.org/10.3389/fgene.2019.01168
Cuyabano, B., 2020. GenEval.
D’souza, R. ., Huang, P.Y., Yeh, F.C., 2020. Structural Analysis and Optimization of
     Convolutional Neural Networks with a Small Sample Size. Sci. Rep. 10, 1–13.
     https://doi.org/10.1038/s41598-020-57866-2
                                                   64


Das, S., Mullick, S.S., Suganthan, P.N., 2016. Recent advances in differential evolution-An
      updated survey. Swarm Evol. Comput. 27, 1–30.
      https://doi.org/10.1016/j.swevo.2016.01.004
Edwards, D.B., Ernst, C.W., Raney, N.E., Doumit, M.E., Hoge, M.D., Bates, R.O., 2008.
      Quantitative trait locus mapping in an F2 Duroc x Pietrain resource population: II. Carcass
      and meat quality traits. J. Anim. Sci. 86, 254–266. https://doi.org/10.2527/jas.2006-626
Eraslan, G., Avsec, Ž., Gagneur, J., Theis, F.J., 2019. Deep learning: new computational
      modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403.
      https://doi.org/10.1038/s41576-019-0122-6
Fragomeni, B.O., Lourenco, D.A.L., Masuda, Y., Legarra, A., Misztal, I., 2017. Incorporation of
      causative quantitative trait nucleotides in single-step GBLUP. Genet. Sel. Evol. 49, 1–11.
      https://doi.org/10.1186/s12711-017-0335-0
Gämperle, R., Müller, S.D., Koumoutsakos, P., 2002. A Parameter Study for Differential
      Evolution, in: Advances in Intelligent Systems, Fuzzy Systems, Evolutionary Computation.
      Press, pp. 293–298.
Gianola, D., 2013. Priors in whole-genome regression: The Bayesian alphabet returns. Genetics
      194, 573–596. https://doi.org/10.1534/genetics.113.151753
Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. Cambridge, Massachusetts : The
      MIT Press.
Gualdrón Duarte, J.L., Bates, R.O., Ernst, C.W., Raney, N.E., Cantet, R.J.C., Steibel, J.P., 2013.
      Genotype imputation accuracy in a F2 pig population using high density and low density
      SNP panels. BMC Genet. 14. https://doi.org/10.1186/1471-2156-14-38
Habier, D., Fernando, R.L., Kizilkaya, K., Garrick, D.J., 2011. Extension of the bayesian
      alphabet for genomic selection. BMC Bioinformatics 12. https://doi.org/10.1186/1471-
      2105-12-186
Hickey, J.M., Chiurugwi, T., Mackay, I., Powell, W., 2017. Genomic prediction unifies animal
      and plant breeding programs to form platforms for biological discovery. Nat. Genet. 49,
      1297–1303. https://doi.org/10.1038/ng.3920
Hill, W.G., 2016. Is continued denetic improvement of livestock sustainable? Genetics 202, 877–
      881. https://doi.org/10.1534/genetics.115.186650
Kim, T., Lee, J.H., 2019. Effects of Hyper-Parameters for Deep Reinforcement Learning in
      Robotic Motion Mimicry: A Preliminary Study. 2019 16th Int. Conf. Ubiquitous Robot. UR
      2019 228–235. https://doi.org/10.1109/URAI.2019.8768564
Kok, K.Y., Rajendran, P., 2016. Differential-evolution control parameter optimization for
      unmanned aerial vehicle path planning. PLoS One 11, 1–12.
      https://doi.org/10.1371/journal.pone.0150558
                                                   65


Koumakis, L., 2020. Deep learning models in genomics; are we there yet? Comput. Struct.
     Biotechnol. J. https://doi.org/10.1016/j.csbj.2020.06.017
Lecun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature.
     https://doi.org/10.1038/nature14539
Luo, G., 2016. A review of automatic selection methods for machine learning algorithms and
     hyper-parameter values. Netw. Model. Anal. Heal. Informatics Bioinforma. 5, 1–15.
     https://doi.org/10.1007/s13721-016-0125-6
Meuwissen, T.H.E., Hayes, B.J., Goddard, M.E., 2001. Prediction of total genetic value using
     genome-wide dense marker maps. Genetics 157, 1819–1829.
Mitchell, B., Tosun, H., Sheppard, J., 2015. Deep learning using partitioned data vectors. Proc.
     Int. Jt. Conf. Neural Networks 2015-Septe. https://doi.org/10.1109/IJCNN.2015.7280484
Montesinos-López, A., Montesinos-López, O.A., Gianola, D., Crossa, J., Hernández-Suárez,
     C.M., 2018. Multi-environment genomic prediction of plant traits using deep learners with
     dense architecture. G3 Genes, Genomes, Genet. 8, 3813–3828.
     https://doi.org/10.1534/g3.118.200740
Montesinos-López, O.A., Martín-Vallejo, J., Crossa, J., Gianola, D., Hernández-Suárez, C.M.,
     Montesinos-López, A., Juliana, P., Singh, R., 2019. New deep learning genomic-based
     prediction model for multiple traits with binary, ordinal, and continuous phenotypes. G3
     Genes, Genomes, Genet. 9, 1545–1556. https://doi.org/10.1534/g3.119.300585
Montesinos-López, O.A., Montesinos-López, A., Crossa, J., Gianola, D., Hernández-Suárez,
     C.M., Martín-Vallejo, J., 2018. Multi-trait, multi-environment deep learning modeling for
     genomic-enabled prediction of plant traits. G3 Genes, Genomes, Genet.
     https://doi.org/10.1534/g3.118.200728
Nakisa, B., Rastgoo, M.N., Rakotonirainy, A., Maire, F., Chandran, V., 2018. Long short term
     memory hyperparameter optimization for a neural network based emotion recognition
     framework. IEEE Access 6, 49325–49338. https://doi.org/10.1109/ACCESS.2018.2868361
Pérez-Enciso, M., Zingaretti, L.M., 2019. A Guide on Deep Learning for Complex Trait
     Genomic Prediction. Genes (Basel). 10, 19.
R Core Team, 2020. R: A Language and Environment for Statistical Computing.
Shahinfar, S., Meek, P., Falzon, G., 2020. “How many images do I need?” Understanding how
     sample size per class affects deep learning model performance metrics for balanced designs
     in autonomous wildlife monitoring. Ecol. Inform. 57, 101085.
     https://doi.org/10.1016/j.ecoinf.2020.101085
Slatkin, M., 2008. Linkage disequilibrium: understanding the genetic past and mapping the
     medical future. Nat. Rev. Genet. 9, 477–485. https://doi.org/10.1038/nrg2361.Linkage
                                                 66


Steibel, J.P., 2015. gwaR: Functions for performing GWA from GBLUP.
Storn, R., Price, K., 1997. Differential Evolution - A Simple and Efficient Heuristic for Global
     Optimization over Continuous Spaces. J. Glob. Optim. 11, 341–359.
     https://doi.org/10.1023/A:1008202821328
Tang, X., Sun, Y., 2019. Fast and accurate microRNA search using CNN. BMC Bioinformatics
     20, 1–14. https://doi.org/10.1186/s12859-019-3279-2
VanRaden, P.M., 2008. Efficient methods to compute genomic predictions. J. Dairy Sci. 91,
     4414–4423. https://doi.org/10.3168/jds.2007-0980
Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders, A.K., Nyholt, D.R., Madden, P.A.,
     Heath, A.C., Martin, N.G., Montgomery, G.W., Goddard, M.E., Visscher, P.M., 2010.
     Common SNPs explain a large proportion of the heritability for human height. Nat. Genet.
     42, 565–569. https://doi.org/10.1038/ng.608
Yu, T., Zhu, H., 2020. Hyper-Parameter Optimization: A Review of Algorithms and
     Applications 1–56.
Zhang, S.X., Chan, W.S., Peng, Z.K., Zheng, S.Y., Tang, K.S., 2020. Selective-candidate
     framework with similarity selection rule for evolutionary optimization. Swarm Evol.
     Comput. 56, 2–28. https://doi.org/10.1016/j.swevo.2020.100696
Zingaretti, L.M., Gezan, S.A., Ferrão, L.F. V., Osorio, L.F., Monfort, A., Muñoz, P.R., Whitaker,
     V.M., Pérez-Enciso, M., 2020. Exploring Deep Learning for Complex Trait Genomic
     Prediction in Polyploid Outcrossing Species. Front. Plant Sci. 11, 1–14.
     https://doi.org/10.3389/fpls.2020.00025
                                                 67


      CHAPTER 3: PUBLICLY AVAILABLE DATASETS FOR COMPUTER VISION IN
                        PRECISION LIVESTOCK FARMING: A REVIEW
 Junjie Han, Joao R. Dorea, Tomas Norton, Andrea Parmiggiani, Daniel Morris, Janice Siegford,
                                         and Juan P. Steibel
                                           1. ABSTRACT
        The livestock sector is increasingly using precision livestock farming (PLF) to assist in
automated and real-time decision making for management purposes. Among the tools used in
PLF, computer vision (CV) is a predominant approach that allows automatic feature extraction
from digital images/videos. Thus, CV is useful for monitoring animals and measuring
phenotypes. A key to developing CV is training models with annotated imagery data. Unlike
general CV, there are limited amount of publicly available PLF imagery data. Furthermore,
despite the potential of CV in PLF, most published CV applications in PLF are developed using
rather small datasets, and their broader validity remains unknown. The goal of this study was to
review public datasets for PLF-CV applications and the validation strategies used in the related
work, which is a necessary step to create reference PLF datasets and to develop standard
evaluation matrices that can be informative in practical animal farming. We focused on pig and
cattle datasets as well as their CV tasks. We identified 20 public datasets, nine of which focused
on pigs and 11 on cattle. The reviewed datasets spanned a wide range of CV tasks: e.g.,
detection, behavior recognition, identification, and tracking of animals, which are useful for
developing CV algorithms without having to record and annotate new videos. Finally, we
provide suggestions to improve CV applications in PLF, perspectives on data reuse, and
suggestions for broader validation of results.
                                                  68


                                        2. INTRODUCTION
        Livestock farmers face ever-increasing farming pressure due to the growing population
and demands for food. As a result, the livestock sector has become increasingly interested in the
precision livestock farming technology to improve the production efficiency of the livestock
industry (Norton et al., 2019). Precision livestock farming (PLF) refers to automated, continuous,
and real-time monitoring technologies within animal space that assist in decision making at
individual animal level, and PLF brings economic values to farmers as well as to animal breeders
(Berckmans, 2017).
        Among PLF technologies, computer vision is predominant (Li et al., 2021). Computer
vision (CV) has advantages in animal farming as it is non-invasive and operatable in a
continuous and large scale (Chen et al., 2021). In the past decade, CV has been revolutionized by
deep learning (Ponti et al., 2017), which is a set of flexible representation learning methods
where a machine can be fed with raw data to automatically discover the representations needed
for prediction or classification (Lecun et al., 2015). Deep learning (DL) has made great
contributions to CV tasks including image classification, object detection, pose estimation,
behavior recognition, semantic/instance segmentation, and tracking (Ponti et al., 2017;
Voulodimos et al., 2018). These CV tasks are also related to certain PLF applications. For
instance, the gender of chicken can be identified through image classification (Yao et al., 2020).
Moreover, for animal behavior studies, several CV models have been proposed to recognize
aggressive behaviors of pigs through video analysis (Chen et al., 2020b; Li et al., 2019).
        A key to developing a reliable DL-based CV model is to “feed” a large number of high-
quality training samples to the model i.e., the size and quality of data are vital (Lu et al., 2017;
Marcus, 2018). For general CV applications, there are reference datasets such as COCO (Lin et
                                                  69


al., 2014) and ImageNet (Jia Deng et al., 2009), which consist thousands of images with millions
of instances annotated by experts that allow CV developers to benchmark their state-of-the-art
algorithms. The datasets were extremely laborious for image collection and annotation. For
instance, over 70,000 worker hours were utilized to complete the COCO Dataset (Lin et al.,
2014). Unfortunately, we do not have such a dataset available for use in PLF that is specifically
designed for animal farming problems.
        As PLF is still an emerging technology, CV applications to animal farming are mostly at
performance evaluation phase with relatively small datasets. Furthermore, a common practice in
recently developed animal CV applications (Chen et al., 2020a; D. Li et al., 2019; Liu et al.,
2020; Nasirahmadi et al., 2019; Zhang et al., 2020) is to use a random validation approach to
evaluate model performance, where the training set (used for model development) and validation
set (used for model assessment) are randomly split from the whole dataset. However, such a
random split might have ignored the underlying temporal, spatial, or hierarchical structures of
the data, leading to overoptimistic results (Roberts et al., 2017). In practical animal farming,
there are underlying structure(s) within the data e.g., time-evolving effects (growing animals)
and environmental factors (e.g., illumination, changing background environment, etc.) that may
not be reflected during model development using a random validation approach. Thus, how
training-validation data are split can significantly affect predictive performance of DL (Han et
al., 2021).
        The goals of this paper are to: 1) review publicly available datasets for PLF and their
practical applications focusing on pig and cattle datasets that can be reused for future studies, 2)
review the validation strategies in the mentioned applications, and 3) provide suggestions for
improved CV applications in PLF focusing on data perspectives.
                                                  70


                                       3. METHODOLOGY
3.1 Literature search parameters
        Publications in the following databases were searched: Web of Science, Google Scholar,
and Google Datasets. Search keywords were the following term combinations: “species term” +
computer vision; “species term” + deep learning; “species term” + image analysis, where
“species term” included pig, swine and cattle.
3.2 Eligibility criteria
        To be included in this review, a publication had to fulfill the following criteria: 1) it had
to belong to one of the following categories: conference proceedings, peer-reviewed journal
articles, or datasets assigned with Digital Object Identifiers (DOI); 2) the publication had to be
written in the English language; 3) it had to address a CV application in PLF; and 4) it had to
contain a working link to an accessible dataset.
3.3 Data extraction
        Literature was collected in March 2022. From the retrieved datasets and their related
publications, data on several features were extracted and summarized. The full description of
extracted information is given in Table 3.1. Briefly, the features included the objective of the
study, a detailed description of the imagery dataset, camera specifications and settings, software
and code for implementation, a detailed description of annotations tied with the imagery dataset,
metadata of the observed animals, data sampling protocol, and validation strategies.
                                                 71


Table 3.1 Description of extracted data.
 Information                       Type of input     Description
                                                       Study attributes
 Author, year, and title           Text              Citation of the paper
 Computer vision tasks             Categories        Choice(s) of: entire body detection, body part detection, segmentation, behavior
                                                     recognition, identification, and tracking
 Dataset name                      Text              Name of the publicly available dataset for CV in PLF
 Database link                     URL               Link to the website or data repository to download data
 Species                           Categories        Animal species: cattle and pig
                                                 Image and video attributes
 Modality                          Categories        RGB, Depth, grayscale, and thermal images/videos
 Resolution                        Number            Number of pixels identified by the height and width of image
 Number of files                   Number            Number of files included in the public dataset
                                                      Camera attributes
 Camera perspective                Categories        Choice(s) of: Angled-down view, top-down view, side view, and frontal view
                                                      Software and code
 Code availability                 Yes/no            Whether computer code is available for the CV task
                                                    Annotation attributes
 Annotation types                  Categories        Choice(s) of: class labels, bounding box, polygon mask, ID, and key point
                                                     coordinates
 Annotation labels                 Text              Detailed content of annotations e.g., bounding box coordinates and list of class
                                                     labels
 Analysis unit                     Text              The base analysis unit of the CV model e.g., a single image or a video clip
 Bounding box area                 Number            Area in pixel for bounding boxes (this is only available for objected detection
                                                     studies)
 ROI                               Text              Region of interest
                                                 Biological subject attributes
 Number of animals                 Number            Number of observed animals in the dataset
 Number of instances per image     Number            A fixed number or a range for the count of visible animals in a single image
 Coat color pattern                Text              Characteristics of the animal coat color and marks
 Number of farming units           List              An exhaustive list of farming units from lower to higher levels e.g., 5 pens from
                                                     1 farm
 Age or production stage           Text              Production phase or age of the animal
                                                      Sampling protocol
 Span of experiment                Text              The span of experiment for long-term image/video collection
 Recording schedule during the day Time range        The time when images/videos were collected (this is typically found in the
                                                     published paper)
                                                    Validation strategies
 Validation method                 Categories        The way to split the entire dataset into training, validation, and/or testing sets.
                                                     Choices include: random validation, stratified random validation, blocked
                                                     validation, and in-sample validation
 Evaluation metric                 Text              Brief explanation of the metric and the estimated value
                                                      4. RESULTS
          We identified 20 public datasets, nine of which focused on pigs and 11 on cattle. For a
clear overview, the authors, dataset name, species, and addressed CV task(s) of the traces are
shown in Table 3.2. The full URLs to access the datasets are available in Table S3.1.
                                                              72


Table 3.2 Overview of public pig and cattle datasets utilized for computer vision tasks in
precision livestock farming.
  Authors                   Dataset name                Shortcut Species Computer vision task(s)
  Alameer et al. (2020)     Newcastle Pig Posture       D1       Pig     Entire body detection, behavior
                                                                         recognition, and tracking
  Bergamini et al. (2021)   Edinburgh Pig Behavior      D2       Pig     Entire body detection, behavior
                                                                         recognition, and tracking
  Riekert et al. (2020)     Pig Position and Posture    D3       Pig     Entire body detection and behavior
                                                                         recognition
  Shirke et al. (2021b)     ISRL Multi-Camera Tracking  D4       Pig     Entire body detection and tracking
  Psota et al. (2019)       Pig Detection               D5       Pig     Body part detection
  Psota et al. (2020)       Pig tracking                D6       Pig     Body part detection, identification, and
                                                                         tracking
  Shirke et al. (2021a)     Pig Novelty Preference      D7       Pig     Body part detection and behavior
                                                                         recognition
  Wutke et al. (2021)       Pig Detection and Tracking  D8       Pig     Body part detection, tracking and
                                                                         behavior recognition
  Tangirala et al. (2021)   PigTrace                    D9       Pig     Segmentation, tracking, behavior
                                                                         recognition, and identification
  Andrew et al. (2017)      FriesianCattle2017          D10      Cattle  Entire body detection and identification
  Andrew et al. (2017)      AerialCattle2017            D11      Cattle  Entire body detection and identification
  Andrew et al. (2021)      OpenCows2020                D12      Cattle  Entire body detection and identification
  Shao et al. (2020)        Aerial Pasture              D13      Cattle  Entire body detection
  Gao et al. (2021)         Cows2021                    D14      Cattle  Entire body detection and identification
  Han et al. (2019)         Aerial Livestock            D15      Cattle  Entire body detection
  Li et al. (2019)          NWAFU-Cattle                D16      Cattle  Body part detection and behavior
                                                                         recognition
  Shojaeipour et al. (2021) 300 Cattle                  D17      Cattle  Body part detection and identification
  Andrew et al. (2016)      FriesianCattle2015          D18      Cattle  Identification
  Bhole et al. (2019)       Holstein Cattle Recognition D19      Cattle  Identification
  Pereiet et al. (2020)     Cow Behavior                D20      Cattle  Behavior recognition
           In Table 3.2, the first column indicates the references for the original studies that
analyzed and published the datasets. In addition to the full name of each dataset, we created a
unique shortcut for naming convention that is later used in this section. Noteworthy, some
datasets claimed to address multiple CV tasks in the original studies. The reviewed public animal
datasets for CV focused on six tasks that are covered in Section 3.3.
4.1 Animal subjects
           Table 3.3 presents the number of observed animals, number of housing units, age or
production stage, and coat color pattern for the different datasets. These biological characteristics
are important in terms of defining the use of the data for various CV tasks. Deep learning-based
                                                        73


CV algorithms are known to be data-hungry, requiring very large numbers of training samples
(Marcus, 2018). Thus, explicitly stating how many images are available in a CV dataset is
extremely important. However, the total image count is not enough to characterize a CV dataset
in PLF. Knowing the number of animals is essential too, as a thousand images from one animal
is different from one image from each of a thousand animals. Broadly valid CV applications
need to be trained on a large number of images collected from many animals. Likewise,
identifying the number of farming units (pens, farms, etc.) available in a CV dataset for PLF is as
important as counting individual animals, as datasets comprising several farming units will
support CV applications with a broader scope of validity. Specifying the age and physiological
stage of the animals in the dataset are also important as there may be some ages/stages at which
animals vary more in size and shape, thus introducing extra variation into the datasets (e.g.:
growing pigs vs. gestating sows or milking heifers vs. mature dry cows). Finally, the
performance of CV may be influenced by the coat color of animals. For example, identifying
animals may be easier in breeds exhibiting natural variation in color patterns compared to
animals that show a uniform coat color.
         The biological sample size was not available in some datasets, while for those papers that
reported it, the number of observed animals ranged from eight to 430 (Table 3.3). Most studies
specified the farming units e.g., the number of experimental pens and the number of farms. Six
pig datasets (D3, D4, D5, D6, D8, and D9) involved multiple pens, while all cattle datasets
focused on single farms. Information about ages of animals was not available in the cattle
datasets. Among the three pig datasets that have age information (D1, D5, and D6), ages of
animals varied from three weeks to six months. Four studies reported production stages of the
pigs (D2, D3, D4, and D8), which covered farrowing, nursery, and finisher. Most pig datasets
                                                 74


contained white pigs, and three datasets had back marks on white pigs. Animals with
heterogeneous coat colors were presented across all cattle datasets.
 Table 3.3 Characteristics of animal subjects. *: to specify an exhaustive list of units/ranges if
 applicable. Multiple pens mean that the number of pens is more than two while the exact number
 remains unknown.
   Dataset    Species   # Animals  Farming Unit(s)*     Age or production     Coat Color of Animals
                                                        stage
   D1           Pig         15     1 pen                9-14 weeks            Heterogeneous coat colors
   D2           Pig         8      1 pen                Finisher pigs         White pigs with back marks
   D3           Pig        430     18 pens from 5       Fattening and rearing White pigs
                                   compartments         pigs
   D4           Pig         33     2 pens from the same Finisher pigs         White pigs
                                   facility
   D5           Pig        NA      17 pens              1.5-5.5 months        White pigs
   D6           Pig        NA      Multiple pens        3-10 weeks; 11-18     White pigs
                                                        weeks; 19-26 weeks
   D7           Pig        NA      1 pen                NA                    White pigs with/without back
                                                                              marks
   D8           Pig        NA      Multiple pens        Farrowing and         White pigs with/without back
                                                        rearing pigs          marks
   D9           Pig        NA      Multiple pens from 5 NA                    White pigs with/without back
                                   farms                                      marks
   D10         Cattle       89     1 farm               NA                    Heterogeneous coat colors
   D11         Cattle       23     1 farm               Nursery               Heterogeneous coat colors
   D12         Cattle       46     1 farm               NA                    Heterogeneous coat colors
   D13         Cattle      218     1 farm               NA                    Heterogeneous coat colors
   D14         Cattle      186     1 farm               NA                    Heterogeneous coat colors
   D15         Cattle      NA      NA                   NA                    Heterogeneous coat colors
   D16         Cattle       63     1 farm               NA                    Heterogeneous coat colors
   D17         Cattle      300     1 farm               NA                    Heterogeneous coat colors
   D18         Cattle       40     1 farm               NA                    Heterogeneous coat colors
   D19         Cattle      136     1 farm               NA                    Heterogeneous coat colors
   D20         Cattle      NA      1 farm               NA                    Heterogeneous coat colors
4.2 Recording setup
        Camera setups and recording schedules are also known to impact data variability and
system development (Li et al., 2021). Several characteristics of the recording setup were
selected, extracted from all papers and the results are summarized in Table 3.4. The camera
perspective is an important recording characteristic as it determines which visual component(s)
of animals can be observed and used to develop CV. For instance, if an image dataset was
collected using a top-down view, then a CV application would focus on extracting features from
the back part of animals. Image modality indicates the type of image. Common image types
                                                    75


include RGB (red-green-blue), grayscale, depth, and thermal. An RGB image refers to color
image and is representative of human vision. A grayscale image is a special type of digital
image, which refers to gray monochrome representing light intensity. RGB images are prevalent
and frequently used for artificial intelligence. Depth images consist of pixels that record the
distance from the object to the camera and are useful for separating objects from background and
for estimating objects’ size and volume. Thermal images allow researchers to observe variations
in temperature of objects. Depending on the CV task, researchers may choose the image
modality that fits better in the particular context. Resolution is typically described as the number
of pixels of an image and is specified as the product of width and height of the image. Higher
resolution provides more details of the objects in the image but requires larger storage space. In
addition, the recording schedule during the day (i.e., the time when images were recorded) is
reviewed for each dataset as it reflects the illumination condition during data collection, and
illumination can greatly affect image quality (Wu and Sun, 2013). Span of the experiment (long-
term recording schedule) is also important, as collecting a hundred images from the same day is
different from obtaining a hundred images across ten days (the latter covers a large temporal
variation).
        Top-down view and angled-down view (also known as tilted top-down view) are
predominant camera views (Table 3.4). For datasets collected during daylight hours, RGB
cameras were utilized except for two studies that introduced a depth camera and a thermal
camera in addition to the RGB camera, respectively. The majority of the data were collected
during daylight hours. Five datasets included night recordings (D4, D5, D6, D8, and D20), and
grayscale images were introduced for night recordings. Resolution varied in the datasets. The
span of experiment was unknown for nine datasets. Among the 13 datasets that have a long-term
                                                   76


recording schedule available, three were collected within one day, while the remaining ten
datasets were collected from multiple days or even months.
 Table 3.4 Recording setup and schedule of publicly available datasets for computer vision in
 livestock farming. RGB, red-green-blue; RGB-D, RGB and depth. Multiple weeks/days mean
 that the experiment lasted more than two weeks/days while the exact number remained unknown.
 Varying resolutions represent that more than two resolutions are involved.
   Dataset  Species  Camera          Modality  Resolution(s)       Recording schedule Span of
                     perspective(s)                                during the day     experiment
   D1       Pig      Top-down view   RGB       640×360             11AM-3PM           8 days
   D2       Pig      Angled-down     RGB-D     1280×720            7AM-7PM            6 weeks
                     view
   D3       Pig      Angled-down and RGB       1280×720 and        Daylight hours     8 days between
                     top-down views            640×480                                2017 and 2018
   D4       Pig      Angled-down and RGB and   3840×2160           Daylight and night NA
                     top-down views  grayscale                     hours
   D5       Pig      Top-down view   RGB and   1920×1080 and       Daylight and night Multiple weeks
                                     grayscale 2688×1520           hours
   D6       Pig      Top-down view   RGB and   2688×1520           Daylight and night Multiple weeks
                                     grayscale                     hours
   D7       Pig      Top-down view   RGB       1920×1080           NA                 NA
   D8       Pig      Top-down view   RGB and   1280×800            Daylight and night 3 months
                                     grayscale                     hours
   D9       Pig      Angled-down and RGB       1280×720 and        Daylight hours     NA
                     top-down views            1280×960
   D10      Cattle   Top-down view   RGB       1486×1230           Daylight hours     NA
   D11      Cattle   Top-down view   RGB       Varying resolutions Daylight hours     1 day
   D12      Cattle   Top-down view   RGB       Varying resolutions Daylight hours     NA
   D13      Cattle   Top-down view   RGB       3000×4000           Daylight hours     Approximately 3
                                                                                      months
   D14      Cattle   Top-down view   RGB       Varying resolutions Daylight hours     Multiple days
   D15      Cattle   Top-down view   RGB       3000×4000 and       Daylight hours     NA
                                               3840×2160
   D16      Cattle   Side view       RGB       1920×1080           9AM-4PM            1 day
   D17      Cattle   Frontal view    RGB       4000×6000           8AM-4PM            1 day
   D18      Cattle   Top-down view   RGB       1486×1230           Daylight hours     NA
   D19      Cattle   Side view       RGB and   640×480 and         Daylight hours     9 days
                                     thermal   320×240
   D20      Cattle   Angled-down     RGB and   1920×1080           Daylight and night 3 months
                     view            grayscale                     hours
4.3 Review of selected datasets by computer vision task
        In this section, we organized and summarized the identified publicly available datasets
into six CV tasks. A dataset was considered suitable for a certain CV task if both images and
annotations were available to accomplish the task at hand. Some datasets were suitable for
                                                 77


accomplishing more than one CV task. Datasets with missing components (either images or key
annotations) are not listed in the corresponding subsections.
4.3.1 Entire body detection
         The role of object detection is to estimate concepts and locations of objects in each image
(Zhao et al., 2019). Object detection can be further divided into subdomains including entire
body detection and detection of body parts or key points. Entire body detection is to provide
spatial location of individual animals relative to the image (Figure 3.1a). To develop models for
entire body detection, at minimum, a data point includes an image displaying the object(s) of
interest and the coordinates of a rectangular bounding box enclosing each object. Developers
should be aware of the number of instances per image (the count of visible animals in a single
image), as a large number implies a broad view, and the scene can be complex. The bounding
box area (in pixels) implies the size of object relative to the image resolution, and it is
informative especially to anchor-based algorithms in object detection (Liu et al., 2016). The
number of annotated images is important for model development as most DL-based CV
algorithms are data-hungry, and this fact applies to all CV tasks.
         Eight datasets (four pig datasets and four cattle datasets) were identified to address the
entire body detection problem (Table 3.5). Most datasets had varying numbers of instances
(animals) per image, and there were up to 181 instances presented in a single image (in D15)
while the minimum number was zero (no instance in the image; D13). For bounding box area
(referred to the size of bounding boxes in pixels), the size varied in datasets. Given a fixed
resolution, a small bounding box area implies that individuals/objects are small relative to the
image, while a larger area means that the animal portion of the image is larger. Bounding box
areas are relatively large in D1, while in D15 objects are extremely small relative to the entire
                                                   78


image. Interestingly, D1 has the largest number of annotated images, while D15 has the least
number of images annotated. Computer code for implementation is available for three datasets
(D1, D4, and D12).
Figure 3.1 Examples of image data and key annotations for different computer vision tasks. Panel
a) shows an example for entire body detection where each pig is enclosed in a bounding box. Panel
b) presents an instance for body part detection, where heads of pigs are marked in red and rear
parts of pigs are marked in blue. Panel c) shows an example of segmentation where each pig has a
polygon mask. Panel d) presents an example of behavior recognition through an individual image,
where lying pigs are enclosed in red bounding boxes and blue bounding boxes indicate pigs that
are not lying. Panel e) is an example of behavior recognition by assigning a label to an image
sequence. Panel f) shows an example of animal identification where each individual is assigned
with a bounding box and a unique ID label. Panel g) displays an example of a tracklet across three
consecutive frames.
 Table 3.5 Identified public datasets for animal entire body detection via computer vision. Code
 availability: whether computer code is available for entire body detection. *: an annotated image
 is considered as an image paired with an external file that includes manually annotated bounding
 box coordinates. Varying resolutions mean that there are more than two resolutions in the dataset.
  Dataset  Species # Instances per Bounding box area (in pixels)  Image resolution   # Annotated Code availability
                   image                                                               images*   (yes/no)
  D1       Pig     1-11            Approximately 0-10,000             640×360          113,079   Yes
  D2       Pig     8               Approximately 60,000-90,000        1280×720           7,200   No
  D3       Pig     1-48            Approximately 1-86,400          1280×720 and           305    No
                                                                      640×480
  D4       Pig     1-30            Approximately 2,200-508,000       3840×2160            380    Yes
  D12      Cattle  1-8             Approximately 0-300,000       Varying resolutions     7,043   Yes
  D13      Cattle  0-16            Approximately 5,959-7,830         3000×4000            670    No
  D14      Cattle  1-5             Approximately 100,000              1280×720          10,402   No
  D15      Cattle  10-181          Approximately 400-1,600         3000×4000 and          89     No
                                                                     3840×2160
                                                         79


4.3.2 Body part detection
         Another subdomain of object detection is body part detection, where key points of
animals e.g., heads and hips are detected and coordinated (Figure 3.1b). This CV task is also
referred to as landmark detection (Wu and Ji, 2019). Training a model for body part detection
requires that each data point contains an image with the object(s) of interest and annotations have
been made which specify coordinates of all possible key points or landmarks for each visible
object. In general, researchers define repeatable and distinctive key points to extract reliable
features across images and thus, it is important to explicitly list the body parts in a detection
problem.
         Four public animal datasets (three for pigs and one for cattle) were identified for body
part detection (Table 3.6). All three pig datasets (D5, D6, and D7) were collected using a top-
down view and thus, body parts on back of the pig were annotated (Table 3.6). The annotated
key points of pigs include head, tail, shoulder, ears, and snout. Images for the cattle dataset
(D16) were collected in a side view which had 16 key points annotated for each instance (Table
3.6). The number of annotated images among the four datasets ranged from 668 to 135,000. In
addition, D5 and D6 have multiple pigs per image, and all visible instances were annotated,
while in D7 and D16 only one animal was annotated per image.
Table 3.6 Identified public datasets for animal body part detection via computer vision. *: an
annotated image is considered as an image paired with an external annotation file. Code
availability: whether computer code is available for body part detection.
 Dataset   Species Camera view   Key point(s)                                            # Annotated Code availability
                                                                                           images*   (yes/no)
 D5        Pig     Top-down view 1. Tail, 2. Shoulder, 3. Left ear, and 4. Right ear        2,000    No
 D6        Pig     Top-down view 1. Shoulder and 2. tail                                   135,000   No
 D7        Pig     Top-down view 1. Tip of nose, 2. Head, and 3. Tail                        668     Yes
 D16       Cattle  Side view     1. Head, 2. Neck, 3. Spine, 4. Right front thigh root,     2,134    No
                                 5. Right front knee, 6. Right front hoof, 7. Left front
                                 thigh root, 8. Left front knee, 9. Left front hoof, 10.
                                 Coccyx, 11. Right hind thigh root, 12. Right hind
                                 knee, 13. Left hind hoof, 14. Left hind thigh root, 15.
                                 Left hind knee, and 16. Left hind hoof
                                                           80


4.3.3 Segmentation
        As a natural next step to object detection, segmentation makes prediction inferring labels
(objects) at pixel levels to achieve fine-grained inference (Garcia-Garcia et al., 2017) i.e., the
segmentation distinguishes animals from the background. Development of a segmentation model
requires both images and polygon masks for all instances presented in the image set. Compared
to bounding boxes, polygon masks are generally more precise (Figure 3.1c).
        Currently, there is only one public animal dataset available for implementing
segmentation (D9). For D9, Tangirala et al. (2021) added polygon mask annotations to RGB
images, where the polygon was used to define the shape and edges of each instance (pig). Instead
of annotating single image files, Tangirala et al. (2021) selected a set frames from videos with
the frame indices specified. A total of 540 annotated images with multiple pigs across complex
scenes were provided. The count of pigs per image in D9 ranges from 13 to 37, and each instance
has its unique polygon mask. In addition, Tangirala et al. (2021) made the computer code to
automatically segment instances publicly available.
4.3.4 Behavior recognition
        In CV, visual components can be used to detect and recognize objects in dynamic scenes,
in order to learn and describe the behavior of object (Popoola and Wang, 2012). Some basic
behaviors (e.g., standing and lying) can be recognized through a single image (Figure 3.1d),
while more complex behaviors (e.g., mounting) require analyzing a set of images or an image
sequence (Figure 3.1e). A particularly complex behavior recognition task is the recognition of
animal interactions that are behavior actions involving at least two animals. Further, regions of
interest (ROI) are necessary when there are multiple objects occurring in the same image and the
action recognition must focus on a specific part of the image. In some cases, the ROI can be the
                                                  81


whole image. Collectively, ROI, base analysis unit (single image or image sequence), and the
behavior class/category for each instance are necessary for a behavior recognition dataset.
         We identified six public datasets for animal behavior recognition (five for pigs and one
for cattle), which cover a wide range of behaviors in pigs and cattle (Table 3.7). For the original
studies that utilized these datasets, analysis units included both single images and short video
clips. D2, D7, D9, and D20 datasets consist of videos, and for each dataset a fraction of video
frames was selected and annotated. For D1, D2, D3, and D9datasets, the annotated behaviors
include basic behaviors (e.g., standing, lying, and sitting, etc.). Further, D2 and D7 contain
several complex behaviors such as moving, investigating, and exploring. For the cattle dataset
(D20), images were annotated focusing on cows’ activities near a feeding station. In addition,
D20 is provided in video format where each frame was assigned with a label that indicated cows’
behavior near the feeder. In D1, D2, D3, D7, and D9, the ROIs were specified for each instance
(i.e., the instance ROI rather than the entire image was used to analyze animal behavior).
However, D20 specified the entire image as the ROI. Among the behavior recognition datasets,
the number of annotated images has a wide range from 305 to 1,526,473. Four datasets (D1, D7,
D9, and D20) were published along with computer code to implement behavior recognition. No
datasets are available for recognition of animal-animal interactions.
                                                   82


 Table 3.7 Identified public datasets for animal behavior recognition via computer vision. i: an
 annotated file is considered as an imagery file paired with an external annotation file. ii: the
 classes were not explicitly defined. Code availability: whether computer code is available for
 behavior recognition.
   Dataset  Species Behavior types                    ROI           Analysis unit     # Annotated    Code availability
                                                                                     images/videosi  (yes/no)
   D1       Pig     1. Standing, 2. Lateral lying, 3. Individual     Single image   113,079 images   Yes
                    Sternal lying, 4. Sitting, and 5.
                    Drinking
   D2       Pig     1. Eating, 2. Drinking, 3.        Individual    Single image + 7,200 images from No
                    Lying, 4. Standing, and 5.                     image sequence       12 videos
                    Moving
   D3       Pig     1. Lying and 2. Not lying         Individual     Single image      305 images    No
   D7       Pig     1. Investigating and 2.           Individual   Image sequence       20 videos    Yes
                    Exploring
   D9       Pig     1. Sitting and 2. Standing        Individual     Single image   540 images from  Yes
                                                                                        29 videos
   D20      Cattle  1. Frontal interaction, 2.        Entire image   Single image  1,526,473 images  Yes
                    Lateral interaction, 3. Vertical                                from 253 videos
                    interaction, 4. Crowding, 5.
                    Drinking, 6. Exploring, 7.
                    Queueing, and 8. Normal
4.3.5 Identification
         Animal identification (ID) is an important research topic, as PLF aims to monitor animal
space at individual levels. Identification can be considered as a classification problem where
each individual/instance is assigned with an ID label (Figure 3.1f). To develop CV for ID
classification, a data point should at minimum include an image containing the relevant object
and the ID label for the object. In more complex scenes, an image may include multiple
individuals. In that case, ROI and ID label are required for every visible individual in a given
image. Bounding boxes, body parts, or polygon masks can be used to indicate the ROI of the
individual. In general, a large number of ID classes poses challenges on the predictive ability of
ID models, as identifying an individual animal in a group of two is easier to identifying that
animal in a group of ten. Therefore, we recorded the number of ID classes in the reviewed
datasets.
         We found nine public datasets (two for pigs and seven for cattle) for animal identification
(Table 3.8). The two pig datasets (D6 and D9) consist of videos, and selected frames were
                                                              83


annotated for animal ID. Different from pig ID datasets, all cattle ID datasets provided single
images rather than video frames. We need to point out that the ID applications used a single
image as the base analysis unit and thus, we reviewed the number of annotated images rather
than the number of videos (Table 3.8). Both D6 and D9 contain multiple pigs per image (i.e.,
ROIs of individuals needed to be determined), and each individual was assigned with an ID
label. However, the number of ID classes in D6 and D9 do not represent the total number of
observed pigs in the two datasets. For instance, 16 ID classes in D6 were annotated for all
videos, but the ID classes were repeatedly used across different pig social groups (i.e., the same
ID label was reused for different individuals). Furthermore, in D9, instance ID labels were used
rather than unique pig IDs (i.e., the same individuals might have been assigned with new
instance IDs when annotating a different video). Images in the cattle ID datasets contained one
animal per image (i.e., the entire image was considered as the ROI), and each image was
assigned with an animal ID. For the cattle ID datasets, the number of ID classes equaled the
number of observed animals. Overall, the image sample size of the animal ID datasets ranged
from 294 to 135,000. Only two of the studies (Andrew et al., 2021; Tangirala et al., 2021)
published the computer code needed to reproduce the animal identification model.
 Table 3.8 Public datasets for animal identification via computer vision. i: if an individual as the
 ROI means that the individual animal is first localized and then identified; otherwise, an ID class
 label is assigned to the entire image. ii: an annotated image is considered as an image assigned
 with an ID label. Code availability: whether computer code is available for identification.
  Dataset   Specie  # Classes Unique ID for each ROIi          # Annotated imagesii Code availability
            s                 animal? (yes/no)                                      (yes/no)
  D6        Pig         16    No                 Individual         135,000         No
  D9        Pig        NA     No                 Individual            540          Yes
  D10       Cattle      89    Yes                Entire image          940          No
  D11       Cattle      23    Yes                Entire image        46,340         No
  D12       Cattle      46    Yes                Entire image         4,736         Yes
  D14       Cattle     182    Yes                Entire image        32,020         No
  D17       Cattle     300    Yes                Entire image         2,899         No
  D18       Cattle      40    Yes                Entire image          294          No
  D19       Cattle     136    Yes                Entire image         2,474         No
                                                  84


4.3.6 Tracking
         In CV, tracking aims to detect and follow objects in image sequences, where a detector
distinguishes the tracked object from local background (Soleimanitaleb et al., 2019; Stalder et
al., 2009). Noteworthy, among all the studies that published animal tracking datasets, researchers
employed tracking-by-detection methods i.e., their published datasets were prepared and
annotated to develop tracking-by-detection models. Thus, in this subsection, we focus on data
that are suitable for tracking-by-detection problems. In a tracking-by-detection approach, objects
are first located in each frame and then formalized by frame-to-frame tracking (Özuysal et al.,
2006). As the tracking-by-detection problem resorts to the object detection or segmentation
problems that are based on single image analysis (see Sections 3.3.1-3.3.3), ROI and instance
label (or ID) of each object are annotated across frames in a video clip. The ROI can be in the
form of bounding box, body parts, or polygon mask for each instance across the images. There is
no inter-frame annotation required to develop a tracking-by-detection model, but frame indices
need to be explicitly specified. In short, an image sequence and the annotated tracklet make up a
data point (Figure 3.1g).
         We identified five public datasets (D1, D2, D4, D6, and D9) that were used for animal
tracking purposes. For D1, D2, and D6, ROI and the corresponding class label were assigned to
each instance, and all video frames were annotated using pre-trained object
detection/segmentation models. In D1, there are 4,718 videos, where each video is approximately
9 minutes long. In D2, a total of 3,429,000 frames from 1,891 video clips are available. D6
contains 135,000 annotated frames from 15 videos. D4 and D6 are two datasets that used manual
annotations. Note that Shirke et al. (2021b) annotated every 15th frames in D4 instead of
annotating consecutive frames, resulting in a total of 1,200 manually annotated frames from 10
                                                 85


videos available for tracking. In D9, 540 annotated frames from 30 video clips are available.
Two studies (Shirke et al., 2021b; Tangirala et al., 2021) published computer code along with the
datasets (D4 and D9) to implement animal tracking.
4.4 Validation strategy
         In CV development, the whole dataset is generally split into a training set and a
validation set. Once data are split, the training set is used for model fitting and the validation set
is used to evaluate the performance of the trained model. Different strategies can be used to split
the entire dataset. In most CV applications to animal farming, the training set and its
corresponding validation set are split at random. However, dependence structures may exist in
the data e.g., temporal and spatial structures, which violates the assumption of independence
between the training set and validation set and leads to overoptimistic results (Roberts et al.,
2017). Therefore, it is reasonable to assume that how data are split can influence algorithm
evaluation (Li et al., 2021). In this section, we review validation strategies of the studies that
originally used the public animal datasets to develop CV. Here, we define the validation strategy
as the way of splitting the training and validation sets.
         Overall, there are four types of validation strategies used to split data: random validation,
stratified random validation, blocked validation, and in-sample validation. Random validation
means that the training and validation sets are split at random to achieve a given ratio. An 80-20
split is the common practice (i.e., 80% of data are used for model training and the remaining
20% are used for model evaluation/validation). Stratified random validation means that
researchers may randomly select training and validation samples in the same (or similar)
proportion as the samples appear in the population. For instance, if a pig dataset includes three
production stages: weaning, nursery, and finishing, there are three strata based on the production
                                                   86


stages. Within each stratum, a proportion of data are sampled for the training set and the
remaining data are sampled for the validation set. The stratified random validation is different
from random validation as data are stratified in the training and validation sets. Another type of
validation, namely blocked validation, is less common but important to evaluate model
robustness. Blocked validation means that the training and validation sets are split given a
blocking factor such as time or location. For instance, a model can be trained using data collected
from one farm and then tested on data from another farm. In this case, the blocking factor is
farming unit. The last type of validation is in-sample validation, where the same data used for
model training are also used for model validation.
         Table 3.9 shows validation strategies used in the studies that originally analyzed and
published the datasets reviewed in this paper. The dataset column uses shortcuts defined in Table
3.2. Note that some datasets may correspond to more than one CV task and more than one
validation strategy. Most applications evaluated their models using random validation and
stratified random validation. A few datasets (D1, D3, D5, D13, D14) were used for blocked
validation. In-sample validation was only used in tracking tasks.
 Table 3.9 Validation strategies used for reviewed datasets in their original applications.
  Validation strategy          Computer vision task                             Dataset
                               Entire body detection             [D3, D4, D10, D11, D12, D13, D15]
                               Body part detection                         [D8, D16, D17]
                               Segmentation                                       D9
  Random validation            Behavior recognition                          [D3, D7, D9]
                               Identification                           [D9, D10, D12, D18]
                               Tracking                                           D9
                               Entire body detection                        [D1, D2, D14]
  Stratified random validation Behavior recognition                            [D1, D2]
                               Identification                             [D11, D17, D19]
                               Entire body detection                        [D1, D3, D13]
                               Body part detection                                D5
  Blocked validation           Behavior recognition                            [D1, D3]
                               Identification                                    D14
  In-sample validation         Tracking                                 [D1, D2, D4, D6, D8]
                                                     87


        Evaluation metrics reported by relative publications are summarized in Table 3.10. A few
studies developed object detection models and meanwhile assigned a behavior class to the object,
leading to the similar evaluation metrics for both CV tasks. According to the CV task, the
authors reported different metrics to evaluate the model performance. In Table S3.2, we provide
brief explanation of the evaluation metrics that are addressed in Table 3.10.
        It is noted that in three studies, metrics obtained from the blocked validation strategy was
compared with random validation or stratified random validation. In a behavior recognition
application that used D1, Alameer et al. (2020) obtained a mean average precision (mAP; see
Table S3.2) of 98.0% from stratified random validation, and the mAP of blocked by replicate (at
different time points) validation decreased by 1.0% when the training and validation sets were
collected at different experiments and at different time, respectively. Furthermore, Riekert et al.
(2020) collected their data (D3) to a develop entire body detection model, and they used both
blocked-by-pen validation and random validation. The reported mAPs for blocked-by-pen
validations ranged from 76.8% to 87.4%, while mAPs obtained from random validation were
between 67.7% and 87.2%. Using the same dataset, Riekert et al. (2020) also developed behavior
recognition models using the two validation strategies, and the mAPs for blocked-by-pen
validation were between 44.8% and 80.2%, while mAPs for random validation ranged from
49.2% to 80.9%. Moreover, Shao et al. (2020) collected D13 for cattle detection and used both
blocked validation and random validation. In blocked validation of the study (Shao et al., 2020),
the validation set contained data that were collected on a different day and area compared to the
training set. Compared to random validation, precision, recall, and F1 obtained from blocked-by-
time-and-location validation decreased by 16.7%, 28.3%, and 23.0%, respectively (Shao et al.,
2020).
                                                    88


Table 3.10 Evaluation metrics by computer vision tasks and validation strategies. A range
is provided if more than one point estimate were reported for the specific validation strategy.
  Computer vision task   Validation strategy  Dataset                           Evaluation metrics
                         Random validation      D3                               mAP: 67.7-87.2%
                                                D4                     mAP: 99.5%, average IoU: 80.52%
                                               D10                               mAP: 99.0-99.6%
                                               D11                               mAP: 99.0-99.6%
                                               D12                               mAP: 96.6-99.9%
                                               D13        precision: 94.1-95.7%, recall: 94.4-94.6%, F1: 94.3-95.2%,
  Entire body detection                                                          AUC: 86.9-92.2%
                                               D15                               mAP: 83.0-89.3%
                         Stratified random      D1                                 mAP: 98.0%
                         validation             D2     mAP: 84.6-100%, TP: 89.2-100%, FP: 0-10.8%, missing rate:
                                                                                      0-2.2%
                                               D14                               mAP: 97.3-98.0%
                                                D1                mAP: 97.0% (blocked by replicate and time)
                         Blocked validation     D3                     mAP: 76.8-87.4% (blocked by pen)
                                               D13       precision: 77.4%, recall: 66.1%, F1: 71.3% (blocked by time
                                                                                   and location)
  Body part detection    Random validation      D8                 recall: 94.2%, precision: 95.4%, F1: 95.1%
                                               D16                           PCKh@0.5: 83.9-97.2%
                                               D17       TP: 99.1%, FP: 0, TN: 100%, FN: 89.0%, accuracy: 99.13%
                         Blocked validation     D5      recall: 96.0%, precision: 100%, F1: 98.0% (blocked by time);
                                                         recall: 66.7%, precision: 91.1%, F1: 77.1% (blocked by pen)
  Segmentation           Random validation      D9                               mAP: 69.0-92.0%
                         Random validation      D3                               mAP: 49.2-80.9%
                                                D7                  accuracy: 93.0-95.0%, mAP: 89.0-96.0%
                                                D9                                 AUC: 98.5%
  Behavior recognition
                         Stratified random      D1                                 mAP: 98.0%
                         validation             D2                             accuracy: 63.0-89.0%
                         Blocked validation     D1                mAP: 97.0% (blocked by replicate and time)
                                                D3                     mAP: 44.8-80.2% (blocked by pen)
                         Random validation      D9             CMC-1: 77.1%, CMC-5: 89.5%, CMC-10: 93.9%
                                               D10                             accuracy: 84.9-87.2%
                                               D12                           accuracy: 90.5%-95.6%
                                               D18                               accuracy: 97.0%
  Identification         Stratified random     D11                               accuracy: 98.1%
                         validation            D17                             accuracy: 97.3-99.1%
                                               D19                 precision: 98.1%, recall: 97.7%, F1: 97.9%
                         Blocked validation    D14       Top-1: 57.0%, Top-2: 71.8%, Top-4: 76.9%, Top-8: 79.7%,
                                                                        Top-16: 81.8% (blocked by time)
                         Random validation      D9                            cMOTSA: 75.6-77.8%
  Tracking                                      D1                        MOTA: 94.0%, MOTP: 80.0%
                                                D2                   MOTA: 76.8-100%, IDF1: 55.1-100%
                         In-sample validation   D4          IDF1: 53.2-66.1%, IDP: 49.9-61.8%, IDR: 56.9-71.0%,
                                                                    MOTA: 55.2-80.6%, MOTP: 44.5-61.3%
                                                D6                           precision: 82.5%-97.2%
                                                D8                                MOTA: 94.4%
           For entire body detection, body part detection, segmentation, and behavior recognition
applications, mAP is the commonly used metric, making it comparable between studies. For the
                                                      89


         studies that reported mAP, the metric ranged from 44.8% to 99.9%. Notably, studies that
used blocked validations tended to report lower mAPs, while high mAPs concentrate on those
used random validations and stratified random validations. Most identification applications
reported accuracies that ranged from 57.0% to 99.1%. Again, the lowest accuracy was yielded in
blocked validation. For the tracking task, multiple objects tracking accuracy (MOTA) is the most
frequently reported metric, and it ranged from 55.2% to 94.4% in the examined studies.
                                          5. DISCUSSION
         Deep learning-based CV algorithms have made significant progress in PLF-relevant
applications. However, the scarcity of public image data for livestock is still a bottleneck, as DL
applications require a large number of training samples. To the best of our knowledge, this is the
first review that comprehensively investigates publicly available imagery datasets that could be
used in PLF. We believe that this review contributes to the PLF community by presenting a
compilation of public resources.
         To date, there are several reviews for state-of-the-art CV applications in PLF (Borges
Oliveira et al., 2021; Chen et al., 2021; Li et al., 2021). Their reviews focused on the
perspectives of algorthims and DL methodology i.e., the literature search logic was algorithm-
oriented and application-oriented. Two of the reviews (Borges Oliveira et al., 2021; Li et al.,
2021) reported public imagery datasets for pigs and cattle, which were subsets of the identified
datasets in this study (except one dataset that did not satisfy the inclusion criteria of this study).
However, this review complement literature as we reviewed a larger collection of public datasets
for pigs and cattle, compared to the studies of Li et al. (2021) and Borges Oliveira et al. (2021).
Li et al. (2021) specifically reviewed convolutional neural network-based CV systems in
                                                   90


livestock farming and listed five public datasets for cattle and three for pigs. They specified CV
tasks, resolution, number of images, and annotations for each dataset. Furthermore, Borges
Oliveira et al. (2021) reviewed DL algorithms applied to CV systems in livestock farming and
found seven datasets for cattle and one for pig, among which the CV tasks and image types were
specified. Nevertheless, the details provided in those two reviews are limited, and there are still
gaps about how data can be used in different validation strategies through data split. In this
review, we consider data as the fuel of DL-based CV algorithms. Instead of algorithms, we
deliberately searched public image datasets in PLF and investigated the data structure and how
predictive performance of CV varies in different validation strategies using the available data.
        In livestock farming, animals can be categorized based on their functional characteristics
e.g., weaning pigs, growing pigs, and reproductive sows in swine farming (Puppe et al., 2008).
However, few datasets are targeted at addressing the diversity of production stages of animals.
Many PLF tasks such as abnormality detection generally take a time span of several weeks or
months throughout a production cycle, which would require collecting images/videos over
multiple production cycles to fully capture morphological features of the animals. Further, the
reviewed datasets are rather small-scale in terms of environmental factors e.g., different farm
conditions and sites. Therefore, more attention is needed to fill the gap when new datasets are
created to account for the diversity and variation across different farms and production stages.
        All the identified 20 public datasets involve RGB images/videos, which indicates the
prevalence of the RGB images in CV for PLF applications. Most datasets contain images in top-
down views or angled-down views, limiting the visual components or feature space to the upper
body part of the animals. However, this prevalence does not necessarily mean that the top-down
view or angled-down view is favorable. Some CV tasks e.g., landmark detection and behavior
                                                 91


recognition would require other camera views. For instance, the frontal view is the most useful
view for facial landmark recognition (Shojaeipour et al., 2021). However, there is only one
public dataset available for facial recognition in cattle (D17) and none that are suitable for
developing a pig facial recognition model (as images in all pig datasets are of top-down view
and/or angled-down view).
        Across most of the focused CV tasks, ROI is the essential annotation process. In entire
body detection datasets, the ratio of bounding box area relative to the entire image varies
significantly between datasets i.e., the relative size of instances in the image differs. This is
informative to researchers, especially when designing the architecture of anchor-based detectors
e.g., YOLO and Faster R-CNN that are state-of-the-art algorithms for object detection (Liu et al.,
2016; Redmon and Farhadi, 2018; Ren et al., 2015). For other CV tasks that require detecting
objects e.g., simple behavior recognition through individual images and tracking-by-detection,
the object size (relative to the entire image) also matters.
        The reviewed datasets contribute to a wide range of CV tasks. Although a dataset might
be originally collected for a given CV task, the same imagery data can be used for other
purposes. As shown in Figure 3.1 (panels a, b, c, and d), the same footage can be annotated in
different ways depending on the purposes. Similarly, if images from a public dataset are found to
be valuable, researchers can create new annotations of the images, regardless of the CV task for
which the dataset was originally collected. Reannotated images will bring additional value to the
public dataset.
        Most CV applications in PLF use random validation or stratified random validation for
model assessment. But results from random validations can be overoptimistic, and random
validation is less representative of real-life validation scenarios, as environments for capturing
                                                   92


images are quite complex in animal farming (Li et al., 2021). In practice, developers or
researchers are interested in how CV is validated broadly in a way that examines how well the
model can be generalized to other contexts (e.g., across different seasons and different farms),
which is closer to blocked validation. Therefore, when there exist blocking factors e.g., time and
farm unit in an imagery dataset, blocked validation can be utilized. We expect that block
validation will yield a lower, but more realistic estimate, of the predictive performance of the CV
application in practical animal farming contexts.
        Lastly, some recommendations are provided for creating/sharing public CV datasets in
PLF and the use/reuse of the datasets.
        To create new public image datasets for the development of PLF, we recommend that
both images and ground-truth annotation files be accessible to the community. In addition, we
encourage researchers to share the rubrics or ethograms they used for the annotation process.
Specifying the metadata e.g., recording schedule, long-term temporal design, farming units, and
animal subject attributes (age, weight, and coat color) etc. will bring additional value to the
dataset. To share data, we recommend separating the raw data from the annotated data through
organizing them into different data repositories, as most researchers are more interested in the
annotated set. Furthermore, timestamps or watermarks should be avoided in the shared data.
Although not required, we encourage researchers to share the computer code for implementation
and ethical use and approval obtained for the original experiment.
        Current public datasets for animal behavior recognition only involve individual
behaviors, while image datasets dedicated to animal social interactions (along with annotations)
are not yet available. Researchers may revisit the existing public image datasets where groups of
                                                  93


animals were recorded (e.g., D2 and D6) and reannotate for animal social behaviors.
Alternatively, more efforts could be made to create new datasets for animal-animal interactions.
        Most importantly, blocked validation is recommended as an alternative or additional
strategy to random validation when developing CV applications, as the results obtained from
blocked validation tended to be lower compared to the random validation or stratified random
validation. Blocking factors, that could be utilized to split the training and validation data, may
include but are not are limited to: housing units or pens (Psota et al., 2019), locations (Riekert et
al., 2020), replicates at different time points (Alameer et al., 2020), and/or combinations of
multiple blocking factors (Shao et al., 2020). This will be helpful for the evaluation of
generalizability and reproducibility of the model. Finally, this review will help researchers
combine public datasets if the datasets address the same problem. Furthermore, researchers can
combine multiple datasets for CV model development and perform blocked validation, where the
dataset is the blocking factor.
                                            6. CONCLUSION
        In PLF, publicly available image data are valuable, and the reuse of the public datasets is
important as it reduces the effort required to collect and annotate images/videos. This review fills
a gap in PLF literature, as it is the first review that comprehensively investigates publicly
available imagery datasets for CV development in PLF. We identified 20 public datasets, nine of
which focused on pigs and 11 on cattle. The reviewed datasets are related to six CV tasks
including entire body detection, body part detection, segmentation, behavior recognition,
identification, and tracking. Moreover, we reviewed and classified the related CV applications by
validation strategies. We observed a general trend that blocked validation yields lower (but more
                                                    94


realistic) performance than the commonly used validation strategies of random validation and
stratified random validation.
                                              95


APPENDIX
   96


Table S3.1 Website or URLs of publicly available animal datasets for computer vision.
 Dataset name                URL (Accessed on May 2022)
 Newcastle Pig Posture       https://figshare.com/articles/dataset/Automated_recognition_of_postures_and_drinking_behaviour_for_t
                             he_detection_of_compromised_health_in_pigs/13042619/1
 Edinburgh Pig Behavior      https://homepages.inf.ed.ac.uk/rbf/PIGDATA/#:~:text=The%20pig%20behavior%20dataset%20consistin
                             g,Most%20frames%20show%208%20pigs.
 Pig Position and Posture    https://wi2.uni-hohenheim.de/analytics
 ISRL Multi-Camera Tracking  https://drive.google.com/drive/folders/1E2wW2aRENgy_TqlzfICn58ahbTHVIaK6
 Pig Detection               https://uofnelincoln-
                             my.sharepoint.com/personal/epsota2_unl_edu/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fepsota2%
                             5Funl%5Fedu%2FDocuments%2FMDPIdatasets%2FPigDetectionDataset2019%2Ezip&parent=%2Fpers
                             onal%2Fepsota2%5Funl%5Fedu%2FDocuments%2FMDPIdatasets&ga=1
 Pig tracking                https://uofnelincoln-
                             my.sharepoint.com/personal/epsota2_unl_edu/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fepsota2%
                             5Funl%5Fedu%2FDocuments%2FAnnotatedVideos%2Ezip&parent=%2Fpersonal%2Fepsota2%5Funl%
                             5Fedu%2FDocuments&ga=1
 Pig Novelty Preference      https://drive.google.com/drive/folders/14XUYxM15NAI-zBrntrmQofhLv5otAw5b
 Pig Detection and Tracking  https://github.com/MartinWut/Supp_DetAnIn
 PigTrace                    https://drive.google.com/file/d/1s-bCnABh2Hef5l5OxydcY-tkPbrUGSjj/view
 FriesianCattle2017          https://research-information.bris.ac.uk/en/datasets/friesiancattle2017
 AerialCattle2017            https://research-information.bris.ac.uk/en/datasets/aerialcattle2017
 OpenCows2020                https://data.bris.ac.uk/data/dataset/10m32xl88x2b61zlkkgz3fml17
 Aerial Pasture              http://bird.nae-lab.org/cattle/
 Cows2021                    https://github.com/Wormgit/Cows2021
 Aerial Livestock            https://github.com/hanl2010/Aerial-livestock-dataset/releases
 NWAFU-Cattle                https://github.com/MicaleLee/Database
 300 Cattle                  https://cloud.une.edu.au/index.php/s/eMwaHAPK08dCDru
 FriesianCattle2015          https://data.bris.ac.uk/data/dataset/eac634de-4b97-4dcc-ab78-66e3c9d09294
 Holstein Cattle Recognition https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/O1ZBSA
 Cow Behavior                https://zenodo.org/record/3981400#.Yq-ChHbMJD9
                                                                 97


Table S3.2 Metrics for performance evaluation in different validation strategies.
Metric        Full name of the metric                  Concise explanation
Accuracy                          -                    Proportion of correct predictions
TP                          True positive              An outcome where the model correctly predicts the positive
                                                       class (in binary classification)
TN                         True negative               An outcome where the model correctly predicts the negative
                                                       class (in binary classification)
FP                         False positive              An outcome that a negative class is incorrectly predicted as
                                                       positive
FN                         False negative              An outcome that a positive class is incorrectly predicted as
                                                       negative
precision (p)                     -                    A fraction of relevant instances among the retrieved instances
recall (r)                        -                    A fraction of relevant instances that were retrieved
F1                                -                                                 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
                                                                         𝐹 =
                                                                                    𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
AP                       Average Precision                                                                 1
                                                       Area under the precision-recall curve. 𝐴𝑃 = ∫0 𝑝(𝑟)𝑑𝑟
mAP            mean Average Precision over N classes                 𝑁
                                                       𝑚𝐴𝑃 =        ∑ 𝐴𝑃𝑖
                                                                 𝑁
                                                                    𝑖=1
IoU                   Intersection over Union          Given two areas, 𝐼𝑜𝑈 =
                                                                                    𝐴𝑟𝑒𝑎 𝑜𝑓 𝑜𝑣𝑒𝑟𝑙𝑎𝑝
                                                                                     𝐴𝑟𝑒𝑎 𝑜𝑓 𝑢𝑛𝑖𝑜𝑛
AUC                    Area under ROC curve            AUC measures the 2d area under neath the receiver operating
                                                       characteristic curve, which is a comprehensive evaluation of
                                                       classification performance
missing                     Missing rate               A fraction of missing detections
PCKh@0.5       Percentage of correct key-points when   An evaluation metric for pose estimation that a detected joint
                           threshold=0.5               is considered correct at the distance between the predicted and
                                                       true joint is within a threshold (e.g., 0.5).
CMC-N           Cumulative Matching Characteristics    A measure of 1:N identification system performance. Detailed
                                                       description is provided by Bolle et al. (2005).
Top-N                             -                    Model predictions with N highest probabilities. If one of N
accuracy                                               labels is a true label, it classifies the prediction as correct.
cMOTSA         Constrained multi-object tracking and   𝑐𝑀𝑂𝑇𝑆𝐴 = 𝑇𝑃    ̃ ⁄(|𝑇𝑃| + |𝐹𝑃|), where 𝑇𝑃    ̃ denotes soft TP.
                       segmentation accuracy           See the study of Tangirala et al. (2021) for details.
MOTA             Multiple objects tracking accuracy    A metric that measures the overall accuracy of both the tracker
                                                       and detection. See the study of Alameer et al. (2020) for
                                                       details.
MOTP             Multiple objects tracking precision   A measure to evaluate multiple object tracking. It is defined in
                                                       the study of Alameer et al. (2020).
IDF1             Multi-object identification F1 score  Ratio of correctly identified detections over the average
                                                       number of ground-truth and computed detections
IDP             Multi-object identification precision  Fraction of computed detections that are correct.
IDR               Multi-object identification recall   Correctly identified ground truth detections.
                                                      98


REFERENCES
    99


                                            REFERENCES
Alameer, A., Kyriazakis, I., Bacardit, J., 2020. Automated recognition of postures and drinking
     behaviour for the detection of compromised health in pigs. Sci. Rep. 10, 1–15.
     https://doi.org/10.1038/s41598-020-70688-6
Andrew, W., Gao, J., Mullan, S., Campbell, N., Dowsey, A.W., Burghardt, T., 2021. Visual
     identification of individual Holstein-Friesian cattle via deep metric learning. Comput.
     Electron. Agric. 185, 106133. https://doi.org/10.1016/j.compag.2021.106133
Andrew, W., Greatwood, C., Burghardt, T., 2017. Visual Localisation and Individual
     Identification of Holstein Friesian Cattle via Deep Learning. Proc. - 2017 IEEE Int. Conf.
     Comput. Vis. Work. ICCVW 2017 2018-Janua, 2850–2859.
     https://doi.org/10.1109/ICCVW.2017.336
Andrew, W., Hannuna, S., Campbell, N., Burghardt, T., 2016. Automatic individual holstein
     friesian cattle identification via selective local coat pattern matching in RGB-D imagery.
     Proc. - Int. Conf. Image Process. ICIP 2016-Augus, 484–488.
     https://doi.org/10.1109/ICIP.2016.7532404
Benitez Pereira, L.S., Koskela, O., Pölönen, I., Kunttu, I., 2020. Data set of labeled scenes in a
     barn in front of automatic milking system [WWW Document]. Zenodo.
     https://doi.org/10.5281/zenodo.3981400
Berckmans, D., 2017. General introduction to precision livestock farming. Anim. Front. 7, 6–11.
     https://doi.org/10.2527/af.2017.0102
Bergamini, L., Pini, S., Simoni, A., Vezzani, R., Calderara, S., Eath, R.B.D., Fisher, R.B., 2021.
     Extracting accurate long-term behavior changes from a large pig dataset. VISIGRAPP 2021
     - Proc. 16th Int. Jt. Conf. Comput. Vision, Imaging Comput. Graph. Theory Appl. 5, 524–
     533. https://doi.org/10.5220/0010288405240533
Bhole, A., Falzon, O., Biehl, M., Azzopardi, G., 2019. A Computer Vision Pipeline that Uses
     Thermal and RGB Images for the Recognition of Holstein Cattle, Lecture Notes in
     Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture
     Notes in Bioinformatics). Springer International Publishing. https://doi.org/10.1007/978-3-
     030-29891-3_10
Bolle, R.M., Connell, J.H., Pankanti, S., Ratha, N.K., Senior, A.W., 2005. The relation between
     the ROC curve and the CMC. Proc. - Fourth IEEE Work. Autom. Identif. Adv. Technol.
     AUTO ID 2005 2005, 15–20. https://doi.org/10.1109/AUTOID.2005.48
Borges Oliveira, D.A., Ribeiro Pereira, L.G., Bresolin, T., Pontes Ferreira, R.E., Reboucas
     Dorea, J.R., 2021. A review of deep learning algorithms for computer vision systems in
     livestock. Livest. Sci. 253, 104700. https://doi.org/10.1016/j.livsci.2021.104700
                                                   100


Chen, C., Zhu, W., Norton, T., 2021. Behaviour recognition of pigs and cattle: Journey from
     computer vision to deep learning. Comput. Electron. Agric. 187, 106255.
     https://doi.org/10.1016/j.compag.2021.106255
Chen, C., Zhu, W., Steibel, J., Siegford, J., Han, J., Norton, T., 2020a. Recognition of feeding
     behaviour of pigs and determination of feeding time of each pig by a video-based deep
     learning method. Comput. Electron. Agric. 176, 105642.
     https://doi.org/10.1016/j.compag.2020.105642
Chen, C., Zhu, W., Steibel, J., Siegford, J., Wurtz, K., Han, J., Norton, T., 2020b. Recognition of
     aggressive episodes of pigs based on convolutional neural network and long short-term
     memory. Comput. Electron. Agric. 169, 105166.
     https://doi.org/10.1016/j.compag.2019.105166
Gao, J., Burghardt, T., Andrew, W., Dowsey, A.W., Campbell, N.W., 2021. Towards Self-
     Supervision for Video Identification of Individual Holstein-Friesian Cattle: The Cows2021
     Dataset.
Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Garcia-Rodriguez, J.,
     2017. A Review on Deep Learning Techniques Applied to Semantic Segmentation 1–23.
Han, J., Gondro, C., Reid, K., Steibel, J.P., 2021. Heuristic hyperparameter optimization of deep
     learning models for genomic prediction. G3 Genes|Genomes|Genetics 11, jkab032.
     https://doi.org/10.1093/g3journal/jkab032
Han, L., Tao, P., Martin, R.R., 2019. Livestock detection in aerial images using a fully
     convolutional network. Comput. Vis. Media 5, 221–228. https://doi.org/10.1007/s41095-
     019-0132-5
Jia Deng, Wei Dong, Socher, R., Li-Jia Li, Kai Li, Li Fei-Fei, 2009. ImageNet: A large-scale
     hierarchical image database 248–255. https://doi.org/10.1109/cvprw.2009.5206848
Lecun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature.
     https://doi.org/10.1038/nature14539
Li, D., Chen, Y., Zhang, K., Li, Z., 2019. Mounting behaviour recognition for pigs based on deep
     learning. Sensors (Switzerland) 19. https://doi.org/10.3390/s19224924
Li, G., Huang, Y., Chen, Z., Chesser, G.D., Purswell, J.L., Linhoss, J., Zhao, Y., 2021. Practices
     and applications of convolutional neural network-based computer vision systems in animal
     farming: A review. Sensors 21, 1–42. https://doi.org/10.3390/s21041492
Li, X., Cai, C., Zhang, R., Ju, L., He, J., 2019. Deep cascaded convolutional models for cattle
     pose estimation. Comput. Electron. Agric. 164, 104885.
     https://doi.org/10.1016/j.compag.2019.104885
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.,
     2014. Microsoft coco: Common objects in context, in: European Conference on Computer
                                                  101


     Vision. Springer, pp. 740–755.
Liu, D., Oczak, M., Maschat, K., Baumgartner, J., Pletzer, B., He, D., Norton, T., 2020. A
     computer vision-based method for spatial-temporal action recognition of tail-biting
     behaviour in group-housed pigs. Biosyst. Eng. 195, 27–41.
     https://doi.org/10.1016/j.biosystemseng.2020.04.007
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C., 2016. Ssd:
     Single shot multibox detector, in: European Conference on Computer Vision. Springer, pp.
     21–37.
Lu, Z., Jiang, X., Kot, A., 2017. Enhance deep learning performance in face recognition. 2017
     2nd Int. Conf. Image, Vis. Comput. ICIVC 2017 244–248.
     https://doi.org/10.1109/ICIVC.2017.7984554
Marcus, G., 2018. Deep Learning: A Critical Appraisal 1–27.
Nasirahmadi, A., Sturm, B., Edwards, S., Jeppsson, K.H., Olsson, A.C., Müller, S., Hensel, O.,
     2019. Deep learning and machine vision approaches for posture detection of individual pigs.
     Sensors (Switzerland) 19, 1–16. https://doi.org/10.3390/s19173738
Norton, T., Chen, C., Larsen, M.L.V., Berckmans, D., 2019. Review: Precision livestock
     farming: Building “digital representations” to bring the animals closer to the farmer. Animal
     13, 3009–3017. https://doi.org/10.1017/S175173111900199X
Özuysal, M., Lepetit, V., Fleuret, F., Fua, P., 2006. Feature harvesting for tracking-by-detection.
     Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes
     Bioinformatics) 3953 LNCS, 592–605. https://doi.org/10.1007/11744078_46
Ponti, M.A., Ribeiro, L.S.F., Nazare, T.S., Bui, T., Collomosse, J., 2017. Everything You
     Wanted to Know about Deep Learning for Computer Vision but Were Afraid to Ask. Proc. -
     2017 30th SIBGRAPI Conf. Graph. Patterns Images Tutorials SIBGRAPI-T 2017 2018-
     Janua, 17–41. https://doi.org/10.1109/SIBGRAPI-T.2017.12
Popoola, O.P., Wang, K., 2012. Video-based abnormal human behavior recognition—A review.
     IEEE Trans. Syst. Man, Cybern. Part C (Applications Rev. 42, 865–878.
Psota, E.T., Mittek, M., Pérez, L.C., Schmidt, T., Mote, B., 2019. Multi-pig part detection and
     association with a fully-convolutional network. Sensors (Switzerland) 19, 1–24.
     https://doi.org/10.3390/s19040852
Psota, E.T., Schmidt, T., Mote, B., Pérez, L.C., 2020. Long-term tracking of group-housed
     livestock using keypoint detection and map estimation for individual animal identification.
     Sensors (Switzerland) 20, 1–25. https://doi.org/10.3390/s20133670
Puppe, B., Langbein, J., Bauer, J., Hoy, S., 2008. A comparative view on social hierarchy
     formation at different stages of pig production using sociometric measures. Livest. Sci. 113,
     155–162. https://doi.org/10.1016/j.livsci.2007.03.004
                                                 102


Redmon, J., Farhadi, A., 2018. Yolov3: An incremental improvement. arXiv Prepr.
     arXiv1804.02767.
Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards real-time object detection with
     region proposal networks. Adv. Neural Inf. Process. Syst. 28.
Riekert, M., Klein, A., Adrion, F., Hoffmann, C., Gallmann, E., 2020. Automatically detecting
     pig position and posture by 2D camera imaging and deep learning. Comput. Electron.
     Agric. 174. https://doi.org/10.1016/j.compag.2020.105391
Roberts, D.R., Bahn, V., Ciuti, S., Boyce, M.S., Elith, J., Guillera-Arroita, G., Hauenstein, S.,
     Lahoz-Monfort, J.J., Schröder, B., Thuiller, W., Warton, D.I., Wintle, B.A., Hartig, F.,
     Dormann, C.F., 2017. Cross-validation strategies for data with temporal, spatial,
     hierarchical, or phylogenetic structure. Ecography (Cop.). 40, 913–929.
     https://doi.org/10.1111/ecog.02881
Shao, W., Kawakami, R., Yoshihashi, R., You, S., Kawase, H., Naemura, T., 2020. Cattle
     detection and counting in UAV images based on convolutional neural networks. Int. J.
     Remote Sens. 41, 31–52. https://doi.org/10.1080/01431161.2019.1624858
Shirke, A., Golden, R., Gautam, M., Green-Miller, A., Caesar, M., Dilger, R.N., 2021a. Vision-
     based Behavioral Recognition of Novelty Preference in Pigs 1–5.
Shirke, A., Saifuddin, A., Luthra, A., Li, J., Williams, T., Hu, X., Kotnana, A., Kocabalkanli, O.,
     Ahuja, N., Green-Miller, A., Condotta, I., Dilger, R.N., Caesar, M., 2021b. Tracking Grow-
     Finish Pigs Across Large Pens Using Multiple Cameras.
Shojaeipour, A., Falzon, G., Kwan, P., Hadavi, N., Cowley, F.C., Paul, D., 2021. Automated
     muzzle detection and biometric identification via few-shot deep transfer learning of mixed
     breed cattle. Agronomy 11. https://doi.org/10.3390/agronomy11112365
Soleimanitaleb, Z., Keyvanrad, M.A., Jafari, A., 2019. Object tracking methods: A review. 2019
     9th Int. Conf. Comput. Knowl. Eng. ICCKE 2019 282–288.
     https://doi.org/10.1109/ICCKE48569.2019.8964761
Stalder, S., Grabner, H., Van Gool, L., 2009. Beyond semi-supervised tracking: Tracking should
     be as simple as detection, but not simpler than recognition. 2009 IEEE 12th Int. Conf.
     Comput. Vis. Work. ICCV Work. 2009 1409–1416.
     https://doi.org/10.1109/ICCVW.2009.5457445
Tangirala, B., Bhandari, I., Laszlo, D., Gupta, D.K., Thomas, R.M., Arya, D., 2021. Livestock
     Monitoring with Transformer.
Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E., 2018. Deep Learning for
     Computer Vision: A Brief Review. Comput. Intell. Neurosci. 2018.
     https://doi.org/10.1155/2018/7068349
Wu, D., Sun, D.W., 2013. Colour measurements by computer vision for food quality control - A
                                                 103


    review. Trends Food Sci. Technol. 29, 5–20. https://doi.org/10.1016/j.tifs.2012.08.004
Wu, Y., Ji, Q., 2019. Facial Landmark Detection: A Literature Survey. Int. J. Comput. Vis. 127,
    115–142. https://doi.org/10.1007/s11263-018-1097-z
Wutke, M., Heinrich, F., Das, P.P., Lange, A., Gentz, M., Traulsen, I., Warns, F.K., Schmitt,
    A.O., Gültas, M., 2021. Detecting animal contacts—A deep learning-based pig detection
    and tracking approach for the quantification of social contacts. Sensors 21, 1–16.
    https://doi.org/10.3390/s21227512
Yao, Y., Yu, H., Mu, J., Li, J., Pu, H., 2020. Estimation of the gender ratio of chickens based on
    computer vision: Dataset and exploration. Entropy 22. https://doi.org/10.3390/e22070719
Zhang, K., Li, D., Huang, J., Chen, Y., 2020. Automated video behavior recognition of pigs
    using two-stream convolutional networks. Sensors (Switzerland) 20.
    https://doi.org/10.3390/s20041085
Zhao, Z.Q., Zheng, P., Xu, S.T., Wu, X., 2019. Object Detection with Deep Learning: A Review.
    IEEE Trans. Neural Networks Learn. Syst. 30, 3212–3232.
    https://doi.org/10.1109/TNNLS.2018.2876865
                                                 104


   CHAPTER 4: EVALUATION OF COMPUTER VISION FOR DETECTING AGONISTIC
   BEHAVIOR OF PIGS IN A SINGLE-SPACE FEEDING STALL THROUGH BLOCKED
                               CROSS-VALIDATION STRATEGIES
Junjie Han, Janice Siegford, Dirk Colbry, Raymond Lesiyon, Anna Bosgraaf, Chen Chen, Tomas
                                      Norton, and Juan P. Steibel
                                           1. ABSTRACT
Agonistic behavior at feeding spaces is associated with both welfare and feed intake issues in
swine farming. Studying interactive social behaviors of group-housed pigs provides valuable
information to improve their production and welfare. The aims of this study were to 1) develop a
deep learning pipeline based on convolutional neural network (CNN) and long short-term
memory (LSTM) to classify videos depicting four types of interactive behavior between pigs in a
single-space feeding stall and 2) validate the pipeline through various blocked validation
strategies. Four categories of behaviors were classified in this study: head-to-body contact
(including gentle nosing, casual contact between head/ears of a pig with a feeding pig, head
knocking, tail biting, and pushing); levering where the feeding pig was lifted from behind by
another pig; mounting in which the feeding pig was mounted by another pig; and no-contact
when a second pig entered the feeding stall without physical contact with the feeding pig.
Behavior at the feeding stall was filmed twice, three weeks apart, for two consecutive days each
week using six groups of grow-finish pigs (10 per group) housed in pens equipped with FIRE®
feeders. This resulted in a total of 15,679 30-frame video episodes for classification. The dataset
presented a class-imbalance problem, and our deep learning pipeline addressed the problem by
incorporating focal loss. Random cross-validation, blocking-by-time validation, and blocking-
by-feeder validation were utilized for training-testing data split. The size of training sets was
held constant (N=7,500) through all validation scenarios. The average testing accuracies were
                                                 105


0.968(±0.001), 0.860(±0.033), 0.766(±0.026), and 0.860(±0.010) for random cross-validation,
blocking-by-time validation, and blocking-by-feeder validation (at Feeder 1 and Feeder 2),
respectively. The results indicate that the proposed pipeline yielded acceptable predictive
performance in random cross-validation. However, performance was substantially worse in
blocking-by-time and blocking-by-feeder validations. More work is needed for algorithm
generalization to improve its robustness across a variety of application scenarios. We provide
public access to the dataset and the code.
                                        2. INTRODUCTION
        Understanding patterns of feeding behavior can be useful for pig management (Brown-
Brandl et al., 2013), breeding (Ding et al., 2018) and research (Brown-Brandl et al., 2018;
Salgado et al., 2021). In pig farming, animals are typically housed in groups and animals often
have to compete for access to feeder space (Georgsson and Svendsen, 2002). Competition for
feeder space may be especially intense with the single-space automatic feeders that are typically
used in pig feed efficiency studies in grow-finish pigs. Moreover, the way pigs interact at the
feeder with their group mates may affect growth and feed intake due to differential competition
for feeder access (Georgsson and Svendsen, 2002; Nielsen et al., 1995). We have demonstrated
previously that accounting for interactions between pigs during feeding events brings important
information into pig research and breeding, because it allows more accurate estimation of social
genetic effects of competition for feeder space (Angarita et al., 2021). Also, quantifying
interactions at the feeder may eventually be used to improve pig’s feeding performance as well
as their welfare (Angarita et al., 2021; Rodenburg and Turner, 2012).
                                                 106


        The traditional method of analyzing animal behavior is through direct observation or by
filming and later manual decoding of videos (Agha et al., 2020; Csermely and Wood-Gush,
1990; Machado et al., 2017; Nielsen et al., 1995). Direct observation by a human of many pigs
simultaneously and for the length of time needed to generate useful data is not possible on a
commercial farm environment (Martínez-Avilés et al., 2017). On the other hand, manual
decoding of video footage can be laborious, time-consuming, and subject to annotator error
(Chen et al., 2021). Computer vision (Forsyth and Ponce, 2011) applications, where artificial
intelligence is used to process images, are now being developed to detect animal behaviors.
Compared to the traditional approach that involves human effort, computer vision (CV) has
advantages of being low-cost, objective, and non-interventional, and to generate information
continuously (Chen et al., 2021; Li et al., 2021). In most animal farming applications, CV for
behavioral phenotyping is at the performance-evaluation phase. Most studies have primarily
concentrated on the predictive ability of CV while less attention has been paid to validation of
CV algorithms. Livestock farms continue to produce growing amount of CV datasets, reflecting
a variety of information (Bahlo et al., 2019). However, validation studies on the predictive ability
of CV are lacking for an important percentage (Gómez et al., 2021). An assessment of model
generalization is still needed in practical animal farming contexts (Li et al., 2021).
        In CV applications for detecting pig posture and behavior where the training set and its
corresponding testing set were randomly split from the whole dataset, accurate results have been
obtained (Chen et al., 2020a; Li et al., 2019; Liu et al., 2020; Nasirahmadi et al., 2019; Zhang et
al., 2020). Most of these studies trained CV models using balanced datasets, manually
constructed to obtain an equal sample size within each category of the classification problem
(Chen et al., 2020a; Liu et al., 2020; Nasirahmadi et al., 2019; Zhang et al., 2020). However, a
                                                 107


strategy using balanced training sets sometimes overlooks the long-tail distribution (Zhou et al.,
2018) of the categories under real-world conditions, where the sample sizes within categories of
behavior vary.
        The aims of this study were to 1) develop a CV approach to classify pigs’ interactive
behaviors in single-space feeding stalls, and 2) test the algorithm through random cross-
validation and two blocked cross-validation strategies (Roberts et al., 2017), where the data are
split temporally and spatially. We also present the importance of algorithm evaluation as well as
diagnostics through multiple training-testing scenarios that are more practical in animal farming.
                                 3. MATERIAL AND METHODS
3.1 Experimental design
3.1.1 Recording schedule and specifications
        The behavior of grow-finish pigs in a single-space feeding stall was observed through
video recordings. Videos were collected from the Swine Teaching and Research Center at
Michigan State University (East Lansing, MI 48824, USA). All animal protocols were approved
by the Michigan State Institutional Animal Care and Use Committee (Animal Use Form number
01/17-007-00). A total of six social groups (SGs) with ten crossbred pigs per group were used for
this study. These groups were rotated through two test pens (Table 4.1). Pig weight at the start of
data collection was 32±3.57 kg and final weight was 72.6±6.6 kg. No remixing of pigs was
performed during the experiment, and thus all pigs remained in the same social group during the
study. We observed six SGs for six consecutive weeks beginning immediately after the pigs were
introduced to grow-finish pens. Each group was observed for a total of four days (7 hours per
                                                108


day) and this took place on two different weeks (three weeks apart) with two consecutive days of
observation being carried out during each selected week.
        Table 4.1 Rotation schedule of social groups for the two experimental pens. SG,
        social group.
                       Week 1     Week 2       Week 3       Week 4      Week 5       Week 6
            Pen 1       SG 3        SG 4         SG 6        SG 3         SG 4        SG 6
            Pen 2       SG 5        SG 2         SG 1        SG 5         SG 2        SG 1
        Each SG spent seven days in a test pen before being replaced by another group. Within
each seven-day period, the first five days included no recordings to allow for pigs to
(re)acclimate to the pen. On the fifth day, test recordings were used to check and calibrate
equipment as needed. On the sixth day and seventh day, video recording occurred from 9AM-
4PM (this was typically the time that the barn lights remained with some minor variation).
        Both pens were .88 m × .          m, and each pen was equipped with a nipple drinker and
a single-space automatic feeder (FIRE® Osborne Industries, KS, USA) with a dimension of
  .78 m × .7 m. Figure 4.1 shows the center views of the pens and the views of the single-
space feeding stalls, respectively. We used Intel® RealSenseTM D435 cameras for RGB video
recording, which were installed on top of the feeders at a height of 2.44 m relative to the floor.
Each pen was equipped with one camera to collect top-down-view videos of the feeding stall.
Cameras were managed through MATLAB (R2018b, The MathWorks Inc., MA), and video
recordings were saved in MP4 format at 30 frames per second. Raw videos were cropped to
create a fixed top-down view of the feeding stall region (Figure 4.1, Panels C and D). In Pen 1,
the cropped videos had a resolution of 98     ×      8 pixels, while the resolution for Pen 2 was
96 ×        8 pixels. Each camera was connected to an on-site microcomputer that had an Intel®
CoreTM i5-7500T CPU @ 2.7 GHz with 16GB DDR4 RAM and with Microsoft Windows 10 Pro
operating system.
                                                  109


Figure 4.1 Top-down views of pens and feeding stalls. Panels A and B (infrared images) show the
center views of Pen 1 and Pen 2, respectively. Panels C and D are top-down views of the feeding
stalls for Pen 1 and Pen 2, respectively.
3.1.2 Behavior ethogram and dataset
         Five classes of agonistic behaviors in the feeding stall were observed and defined,
including no-contact (NC), ear-to-body (EB), head-to-body (HB), levering (L), and mounting
(M) between pigs. Behaviors were annotated by trained observers according to the ethogram
described in Table 4.2. After a first analysis, HB and EB were merged into a single category,
HB, for two reasons. First, the two classes shared considerable visual and dynamic similarity.
Second, preliminary results indicated that our prototype CV model could not distinguish EB and
HB.
                                                  110


   Table 4.2 Ethogram for the agonistic behaviors in pigs. *: ear-to-body was merged into
   head-to-body.
    Behavior           Description                                                      Code
    No contact         Two pigs were in view at the feeding stall. The behind pig        NC
                       had at least both ears in the feeding stall but there was no
                       physical contact between the behind pig and the body of front
                       pig.
    Ear-to-body*       The behind pig had at least both ears in the feeding stall and   EB*
                       unintentional contact was made. The behind pig might be
                       nosing the floor or eating displaced feed and making slight,
                       non-forceful contact with the front pig. This often appeared
                       as the behind pig’s ears grazing the front pig or the behind
                       pig’s nose bumping the rear legs of the front pig while
                       investigating the floor.
    Head-to-body*      The behind pig used its head to make intentional contact         HB*
                       (greater than 1 second) with the body of the front pig. Quick
                       (less than 1 second) bumps/run-ins by the behind pig were
                       not recorded.
    Levering           The behind pig’s snout was under the body of the front pig         L
                       and the front pig was lifted from the ground vertically. Any
                       lifting of the front pig that involved a behind pig was
                       considered levering. Typically, only the back half of the front
                       pig was lifted. This often manifested as the behind pig
                       pushing forward under the front pig, but it could also appear
                       as the front pig backing up and over the head of the behind
                       pig.
    Mounting           The behind pig lifted its two front legs and put the two legs     M
                       or its breast on the rear part of the front pig. The mounting
                       pig may sit down during the mounting. Mounting
                       commenced when the two front legs of or the breast of the
                       behind pig contacted the front pig and terminated as soon as
                       the mounting pig was no longer on top of the front pig even
                       some contact was still maintained.
        We only focused on video segments when there were at least two pigs present in the
feeding stall. Such events with two or more pigs were passed to the observers. After a
preliminary review of the videos, observers indicated that videos less than 30 frames
(approximately 1 second) long tended to lack information (not enough frames) to make a
classification decision for the video, while longer videos might include more than one behavior.
                                                  111


Therefore, 30 consecutive frames were set as a video episode (the base processing unit) to
classify agonistic interactions of pigs at the feeding stall. Prior to further processing, each
segment of video (when there were two or more pigs in the feeding stall) was cut into 30-frame
video episodes labelled with one of the four behavior classes (NC, HB, L or M) following
annotation by a trained human observer. Some behavioral classes (i.e., L and M) were less
common than others (i.e., HB and NC) as they occurred less often or for shorter periods of time,
thus yielding fewer episodes of these behavioral classes. Several attempts were applied to video
data augmentation using temporal perturbation e.g., sub-sampling short video clips from the
whole event sequence (Ji et al., 2019; Kim et al., 2020; Yun et al., 2020). To augment the
instances of minority classes, specifically L and M, we up-sampled episodes by overlapping 25
frames when generating consecutive episodes from a whole video segment (Figure 4.2).
Episodes labelled as HB and NC were cut from the whole video segments without overlapping
frames in consecutive episodes (Figure 4.2). Episodes were created in this way to mitigate the
class imbalance in our dataset.
         We obtained a total of 15,679 30-frame episodes. Among them, NC, HB, L, and M
activities made up 3,398 (22%), 10,114 (66%), 925 (6%), and 1,242 (8%) of the dataset,
respectively. The video dataset was then ready to be split into a training set and a testing set
according to various validation strategies, as explained in Section 2.2.5.
                                                  112


           Figure 4.2 Examples for generating episodes for no-contact, head-to-
           body, levering, and mounting events.
3.1.3 Validation strategies
        In modelling, dependence structures in the data, such as underlying temporal, spatial, and
hierarchical structures, violate the assumption of independence between the training set and
testing set and leads to overoptimistic results (Roberts et al., 2017). To tackle the temporal
structure e.g. growing sizes of pigs and the spatial structure e.g. varying conditions of the two
pens, we used blocked cross-validation, as proposed by Roberts et al. (2017), to split the whole
dataset into a training set and a testing set given different blocking factors. Specifically, we split
the entire data according to temporal characteristics (blocking by time) and spatial characteristics
                                                  113


(blocking by feeder). In addition, we used random cross-validation as a reference for
comparison. The three validation strategies are as followed:
    1. Five replicates of random cross-validation were used to evaluate predictive performance
         of the CV model. In each replicate, a random subset of the data was used for model
         training, while the remaining instances were for model training.
    2. A blocking-by-time dataset was then created to study whether a model could be trained
         using the footage of younger pigs and then applied to older pigs with acceptable
         predictive performance. In this scenario, episodes from the first three weeks were defined
         as the training set, while episodes from the last three weeks were for testing purposes.
    3. A blocking-by-feeder dataset was generated, where episodes from one of the two feeders
         were used for training and episodes from the other feeder was used for testing. This was
         done to study whether slight changes in experiment setup including different
         illumination, camera position/angle, and social groups affected the predictive
         performance of DL.
3.2 Computer vision algorithm
3.2.1 Deep learning pipeline for video classification
         Deep learning (DL), a predominant analytical tool used in CV, is a set of representation
learning methods, where a machine can be trained with raw data to discover the representations
needed for prediction or classification without requiring extensive background knowledge
(Lecun et al., 2015). Such advantages have made DL the preferred tool for behavior recognition
applications use videos and images from different animal farming contexts (Chen et al., 2021; Li
et al., 2021). A commonly used pipeline for video segment classification to detect behavior of
pigs consists of coupling a convolutional neural network (CNN) (LeCun and Bengio, 1995) with
                                                  114


a long short-term memory (LSTM) model (Hochreiter, 1997). This allows the CNN to extract
relevant spatial features from each individual frame and the LSTM to classify the whole set of
frames while accounting for the temporal dependence in the video. For example, a CNN +
LSTM pipeline has been successfully applied to recognize aggressive/non-aggressive episodes
and tail-biting behavior in group-housed pigs (Chen et al., 2020b; Liu et al., 2020). In this study,
we employed a CNN + LSTM pipeline (Figure 4.3) to classify the 30-frame short videos.
   Figure 4.3 Deep learning pipeline for pig’s aggressive behavior detection based on videos.
   Graph for ResNet-50 Architecture was obtained from Talo (2019).
3.2.2 Feature extraction with convolutional neural network
        In our work, a CNN served as a feature extractor that received an individual frame as
input and generated numerical features as output that were considered spatial representations.
CNNs were designed to process data that has a spatial structure, for example using 2-D images
for object detection (Lecun et al., 2015). We used transfer learning in Stage 1 (Figure 4.3) for
                                                115


spatial feature extraction, where existing knowledge from a related CV application is transferred
to a new context (Torrey and Shavlik, 2010). This is a common practice with CV in livestock
applications (Chen et al., 2020b; Wu et al., 2021; Yin et al., 2020). We compared three pre-
trained CNN models that were well established for computer vision tasks: ResNet-50 (He et al.,
2016), GoogleNet (Szegedy et al., 2015), and VGG-16 (Simonyan and Zisserman, 2014). All
three models required an image input size of            ×     pixels and thus, we resized the raw
episodes before passing them to CNNs. Through transfer learning, for each video frame we
obtained     ×    ×        8,    ×     ×         , and 7 × 7 × 5     feature matrices from ResNet-
50, GoogleNet, and VGG-16, respectively. Note that CNNs deal with individual frames, so each
episode would result in 30 feature sets extracted by the CNN.
3.2.3 Long short-term memory
        LSTM, one of the recurrent neural network architectures, was designed to handle data
presenting a sequential or temporal structure, such as texts and videos (Lecun et al., 2015). In
this study, the LSTM consisted of 30 modules, which were the same as the number of frames in
each episode. Each module (except the first module) received two 1-D vectors, 𝐶𝑡−1 and ℎ𝑡−1, as
its input and generated two vectors, 𝐶𝑡 and ℎ𝑡 , as its output, where 𝑡 means 𝑡-th frame of the
episode and     <𝑡<       , 𝐶𝑡 is the cell state of 𝑡-th LSTM module, and ℎ𝑡−1 represents the
hidden status of Frame 𝑡 −      (Figure 4.4). ℎ𝑡−1 was concatenated with 𝑥𝑡−1, the 1-D feature
vector extracted from CNN given Frame 𝑡 − , and the concatenated vector had four copies with
the length of 𝑑ℎ + 𝑑𝑥 , where 𝑑ℎ is the number of hidden units in a LSTM module and 𝑑𝑥
represents the output length from CNN. The four sets of weights (to be optimized through DL
model fitting) were applied to the four copies and then the results were activated by sigmoid
(Han and Moraga, 1995), sigmoid, hyperbolic tangent (Lecun et al., 2015), and sigmoid
                                                    116


functions, respectively. Structurally, an LSTM module includes a forget gate, an input gate, and
an output gate (Figure 4.4). Furthermore, an LSTM model could be specified as a unidirectional
LSTM or a bidirectional LSTM (Schuster and Paliwal, 1997). Each module produced an output
ℎ𝑡 ( ≤ 𝑡 ≤      ) as the hidden status, whereas ℎ30 (hidden status of the last frame given an
episode) was used for classification of the episode. We trained all the LSTM parameters
ourselves as there was no available pre-trained LSTM to use in this application.
          Figure 4.4 Diagram of long short-term memory. The figure was redrawn, and
          the original figure was obtained from https://colah.github.io/posts/2015-08-
          Understanding-LSTMs/
3.2.4 Hyperparameters
        Hyperparameters are a set of values and options of the DL algorithm that are typically
specified by data analysts before training a DL model. Selected hyperparameter values have an
impact on the performance of DL (Han et al., 2021; Luo, 2016). A search of proper
hyperparameter set(s) is recommended (Wu et al., 2021). We explored         × ×      hyperparameter
combinations by considering different CNNs for transfer learning and different LSTM
architectures. The searching space referred to existing literature and is described in Table 4.3.
The remaining hyperparameters were fixed following suggestions from the literature and are
                                                117


described in Table S4.1. The average accuracy (Eq. 1) of five replicates of random cross-
validation was used to select the best hyperparameter solution:
                                𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑒𝑝𝑖𝑠𝑜𝑑𝑒𝑠
                𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =                                                                      (Eq. 1)
                                𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑝𝑖𝑠𝑜𝑑𝑒𝑠 𝑖𝑛 𝑡𝑒𝑠𝑡𝑖𝑛𝑔 𝑠𝑒𝑡
 Table 4.3 Explored hyperparameters and related work. CNN, convolutional neural network;
 LSTM, long short-term memory.
  Hyperparameter                 Reference                                     Options
  CNN for transfer learning      He et al. (2016), Simonyan and Zisserman      [VGG-16, ResNet-50,
                                 (2014), and Szegedy et al. (2015)             GoogleNet]
  Dimension of hidden status     Saurabh (2021), Wu et al. (2021), Xiao et al. [50, 256, 512]
  (LSTM)                         (2020), and Yin et al. (2020)
  LSTM architecture              Hochreiter (1997) and Ullah et al. (2017)     [Single LSTM,
                                                                               Bidirectional LSTM]
         The selected solution with highest average accuracy was used to train CNN-LSTM
models for all validation strategies described in Section 2.1.3. Predictive performance of all
hyperparameter combinations is presented in Figure S4.1. The selected model configuration is
described in Table S4.1.
3.2.5 Region of interest
         Once hyperparameters were selected, three different regions of interest (ROI) were
further investigated to improve performance of the classification algorithm. Raw videos were
cropped given the three ROIs: 1) extended feeding stall that included the whole feeding stall and
the area immediate to its entrance, 2) feeding stall region only, and 3) truncated feeding stall
region consisting of half the stall closer to the entrance, where most interactive behaviors were
initialized (Figure 4.5). It is worth mentioning that 1) was the default ROI when searching for
hyperparameters. Table S4.2 lists the average accuracy of the three ROIs, which suggested that
                                                     118


the model utilized the truncated feeding stall region (3) yielded the best performance. Therefore,
we selected the truncated feeding stall region as the ROI in this study.
                               Figure 4.5 Explored regions of interest.
3.2.6 Deep learning training accounting for class-imbalance
        Training a DL model is an optimization process that calculates and minimizes an
objective function, also known as loss function, which measures prediction/classification errors.
Categorical cross entropy is a commonly used loss function for classification problems with DL
(Zhang and Sabuncu, 2018). Related work has implied that tuning the loss function helps address
the class-imbalance problem during DL model fitting (Hossain et al., 2021). Lin et al. (2017)
proposed focal loss function to deal with the imbalance problem for binary classification.
Further, Liu et al. (2018) extended focal loss for multi-class classification. In this study, we used
the multi-class focal loss as the objective function (Eq. 2):
                              𝑙𝑜𝑠𝑠𝐹𝐿 = − ∑𝑐𝑖=1 𝛼( − 𝑦𝑖 )𝛾 𝑡𝑖 𝑙𝑜𝑔(𝑦𝑖 ),                         (Eq. 2)
where 𝑐 is the number of behavior categories, 𝑡𝑖 represents the true probability distribution, 𝑦𝑖
denotes the probability of behavior class 𝑖 from Softmax activation (Goodfellow et al., 2016),
while 𝛼 and 𝛾 are the balancing parameter and the focusing parameter respectively. Lin et al.
(2017) reported that the best-performing model was obtained when 𝛼 = . 5 and 𝛾 = , and
Oksuz et al. (2019) further indicated that the values performed well in practice. Therefore, we
adopted the recommended values. Further, 𝑡𝑖 is defined as:
                                                 𝑖 = 𝑡𝑟𝑢𝑒 𝑙𝑎𝑏𝑒𝑙
                                    𝑡𝑖 = {
                                                 𝑖 ≠ 𝑡𝑟𝑢𝑒 𝑙𝑎𝑏𝑒𝑙
                                                 119


        The performance of DL is sensitive to the size of the training set. In this study, the
number of available training episodes differed depending on the blocking strategy (Table S4.3).
A constant training set size was suggested for DL when comparing predictive performance under
different validation scenarios (Fernandes et al., 2020). Thus, we restricted the training set size to
be 7,500 30-frame episodes across all three validation strategies, and for those scenarios with
more training samples, we randomly selected which 7,500 episodes to include in the training set.
For testing, all available episodes were utilized for evaluation.
        The training process was executed by MATLAB (R2021a, The MathWorks Inc., MA)
with GPU computing activated. The computer used for DL model training had Intel® CoreTM i7-
8750H CPU @ 2.2 GHz with 16GB RAM, NVIDIA® GTX 1070 GPU with 8GB GDDR5
memory and a Microsoft Windows 10 operating system.
3.2.7 Evaluation matrices
        Predictive performance of the DL model given behavior class 𝑖 was evaluated through
three measurements including overall accuracy (Eq. 1), recall (Eq. 3), and precision (Eq. 4):
                           𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑒𝑝𝑖𝑠𝑜𝑑𝑒𝑠 𝑓𝑜𝑟 𝑐𝑙𝑎𝑠𝑠 𝑖
               𝑟𝑒𝑐𝑎𝑙𝑙𝑖 =                                                                       (Eq. 3)
                            𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑒𝑑 𝑒𝑝𝑖𝑠𝑜𝑑𝑒𝑠 𝑓𝑜𝑟 𝑐𝑙𝑎𝑠𝑠 𝑖
                            𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑒𝑝𝑖𝑠𝑜𝑑𝑒𝑠 𝑓𝑜𝑟 𝑐𝑙𝑎𝑠𝑠 𝑖
            𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 =                                                                       (Eq. 4)
                              𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑝𝑖𝑠𝑜𝑑𝑒𝑠 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑓𝑜𝑟 𝑐𝑙𝑎𝑠𝑠 𝑖
                                  4. RESULTS AND DISCUSSION
        The model training took 1.8 hours (the training size was N=7,500 episodes), and the
average computing time to classify an episode in the testing set was 1.9 seconds. Figure S4.2
presents the model training history of the four validation scenarios in terms of prediction
accuracy over time (five sets of curves per scenario). After 100 epochs of DL model fitting
process, final testing accuracy was 0.968(±0.001), 0.860(±0.033), 0.766(±0.026), and
                                                  120


0.860(±0.010) for random cross-validation, blocking-by-time validation, and blocking-by-feeder
validation (at Feeder 1 and Feeder 2), respectively. In random cross-validation, the testing
accuracy almost agreed with the training accuracy. The remaining three validation scenarios
showed different levels of overfitting. Compared to training accuracy, testing accuracy decreased
by 0.140, 0.214, and 0.140 in blocking-by-time, blocking-by-feeder (Feeder 1 for testing), and
blocking-by-feeder (Feeder 2 for testing) validations, respectively. The result of random cross-
validation indicated that a CNN + LSTM pipeline could be utilized to classify accurately the four
types of agonistic behaviors. However, significant decreases in testing accuracy as well as
overfitting were observed in blocking-by-time and blocking-by-feeder validations. In a previous
study, Li et al. (2020) utilized DL to classify five categories of pigs’ behaviors: feeding, lying,
motoring, scratching, and mounting. They reported an accuracy of 96.35% in random cross-
validation while the prediction accuracy on an independent testing set (a different pigsty) was
84.47%. In another study, Fernandes et al. (Fernandes et al., 2020) applied DL to predict pig
body composition traits and indicated that when testing the trained model on independent genetic
lines of pigs, the accuracy was systematically lower compared to 5-fold and 3-fold random cross-
validations. Our results further confirmed the finding that the predictive performance of DL in
blocked/independent testing sets was worse compared to random cross-validation.
        Figure 4.6 shows the average recall and precision of each behavioral category in the four
validation scenarios. As a reference, random CV yielded recall of 0.983, 0.849, 0.964, and 0.956
for HB, L, M, and NC behaviors, respectively. On the other hand, precisions of random CV were
0.971, 0.932, 0.968, and 0.969 for head-to-body, levering, mounting, and no-contact behaviors,
respectively. The encouraging result of random CV implies that the DL pipeline was suitable for
multi-class classification of pig’s aggressive interactions. This may be especially useful in
                                                  121


retrospective studies, such as in research applications, where a whole dataset is collected, but
annotated and analyzed after all the video recordings happened. In a study conducted by Li et al.
(2020), they trained a DL model for multi-behavior recognition of pigs that had precisions
ranging from 0.946 to 1 for five categories: feeding, scratching, mounting, lying, and motoring.
According to a more recent study, Wu et al. (2021) fitted a CNN-LSTM model with recalls
ranging from 0.950 to 0.985 and precisions ranging from 0.958 to 0.995 for drinking, ruminating,
walking, standing, and lying behaviors of a single dairy cow. Both studies utilized random cross-
validation, and we obtained similar results in random cross-validation in the present study. The
obvious unique characteristic of our work is to classify interactive behaviors in pigs (rather than
behaviors by a single animal), which is more complicated than the recognition of basic behaviors
that do not necessarily involve two animals.
Figure 4.6 Bar plots of recall and precision for random, block-by-time, and block-by-feeder cross-
validations. HB, head-to-body; L, levering; M, mounting; NC, no-contact.
        In blocking-by-time validation, decrease in both recall and precision was observed.
Recall for blocking-by-time validation was 0.855 (HB), 0.875 (L), 0.963 (M), and 0.821 (NC),
while precision was 0.968, 0.362, 0.719, and 0.884, respectively (Figure 4.6). Using episodes
collected from the first three weeks to classify episodes recorded during the final three weeks
resulted in a significant decrease in predictive performance, especially in terms of precision for
levering and mounting. Regarding blocking-by-time validation, Bergmeir and Benítez (2012)
                                                 122


raised concerns about data containing time-evolving effects. They indicated that last block
validation tended to cause a less robust predictive performance, where the last block was a subset
taken from the end of a time series. Thus, the decreased predictive performance in our blocking-
by-time validation was possibly the result of time-evolving features i.e. the growing pigs that
could not be picked up or were not present in the training set (episodes from the first three
weeks). This type of validation is potentially useful for time-sensitive applications where model
training is done after minimal or limited initial data collection and then the trained model is used
on the same individuals, but at later time points.
        Predictive performance also dropped in blocking-by-feeder validation. Notably, the
model trained with episodes from Feeder 1 showed better performance than the model trained
with Feeder 2. When training with Feeder 1 data and testing on Feeder 2, recall for the four
categories was 0.921 (HB), 0.785 (L), 0.348 (M), and 0.900 (NC), whereas precision for the four
classes was 0.930, 0.452, 0.987, and 0.863, respectively. Meanwhile, training with Feeder 2 and
testing on Feeder 1 led to different results. In this scenario, recall for the four categories was
0.889 (HB), 0.456 (L), 0.955 (M), and 0.411 (NC), while precision rates were 0.804, 0.381,
0.628, and 0.999, respectively. Additionally, patterns of misclassifications differed between the
two scenarios of blocking-by-feeder validation. Compared to random cross-validation, the model
trained with Feeder 1 resulted in significantly lower recall for mounting and precision for
levering, while worse recall for levering and no-contact and precision for levering and mounting
were observed when validating the model trained with Feeder 2. Roberts et al. (2017) argued that
ignoring data structures e.g., spatial and grouping structures may lead to over-optimistic
estimation of the model performance. Furthermore, they reported notably worse predictive
performance when the testing sets were blocked by spatial group and by space, compared to
                                                  123


random cross-validation. In another study that employed DL for classification (Lopez-Del Rio et
al., 2019), the authors reported a worse performance when the training/testing sets were split by a
clustering factor. As episodes recorded from Feeder 1 were composed of different social groups
(Table 4.1) and a slightly different experiment setup (e.g., camera angle and illumination)
compared to episodes recorded from Feeder 2, we speculate that the nested structures in the two
feeders led to divergent variabilities between the two datasets. To further investigate this, we
performed a principal component analysis (PCA) of the feature vectors extracted from each
episode that were inputs of LSTM (Figure 4.3). Results of PCA implied that frames from Feeder
1 were more variable than frames of Feeder 2. We included the relationship of the first six
principal components in Figure S4.3.
        Figure 4.7 shows the cumulative confusion matrix over five replicates of each validation
scenario. Validation sets and their corresponding misclassified episodes were further clustered by
social group, week, feeder, pig’s back mark (Arabic numerals from 0 to 9), and behavior
category. For random cross-validation, misclassified episodes maintained the same clustering
patterns as in the testing sets, but the pattern was different when accounting for the behavior
category (Figure S4.4). For blocking-by-time validation, misclassified episodes presented
different clustering structure in terms of social group, feeder, mark, and week compared to the
testing set, while the breakdown by behavior category was similar to the testing set (Figure
S4.5). Misclassification of both blocking-by-feeder validation scenarios showed similar patterns
as in the testing set when divided by social group, mark, and week, but the misclassification
breakdown by behavior category disagreed with the corresponding testing sets (Figures S4.6 and
S4.7).
                                                  124


         Moreover, when studying blocking-by-time validation and blocking-by-feeder validation,
we ranked episodes based on the frequency that they occurred in the misclassification and
selected the top 50 misclassified episodes for each off-diagonal element in of the confusion table.
If the total number of episodes for the misclassification category was fewer than 50, we selected
all episodes for diagnosis. Both blocking-by-time and blocking-by-feeder validations showed
similar patterns in three misclassification categories: HB misclassified as NC, NC misclassified
as HB, and HB confused with L. For HB misclassified as NC and NC episodes confused with
HB, the following pig presented mild contact or almost no contact with the front pig (Figure 4.8
a.). For HB misclassified as L, almost all episodes presented a pattern where the following pig
buried its head underneath the rear part of the front pig and attempted to push/lift the front pig
but did not succeed in doing so (Figure 4.8 b.).
                                                 125


 Figure 4.7 Confusion tables of three validation strategies. Tables were based on the result by
 merging statistics over 5 replicates. Each validation strategy has five reps. Prediction means
 classified result from our model and Target means ground-truth labels. Panel A, random
 validation; Panel B, block-by-time validation; Panel C, block-by-feeder validation (Feeder 1 as
 testing set); Panel D, block-by-feeder validation (Feeder 2 as testing set). NC, no-contact; M,
 mounting; L, levering; HB, head-to-body.
        In blocking-by-time validation, the major sources of misclassification were HB predicted
as NC, HB predicted as M, HB predicted as L, and L predicted as M. For HB classified as M, it
was frequently observed that the following pig pushed/head-knocked the front pig, and we also
found that the front pig tended to retreat from the feeder which caused contact with the following
pig (Figure 4.8 c. and d.). For L misclassified as M, a common structure across all episodes was
that the two pigs overlapped considerably and their heads were barely visible (Figure 4.8 e.).
                                                 126


  Figure 4.8 Error patterns for misclassification. The 1st, 10th, 20th, and 30th frames of example
  episodes were selected for display purpose. a), head-to-body and no-contact confused by no-
  contact and head-to-body, respectively; b), head-to-body misclassified as levering; c-d), head-
  to-body false predicted as mounting; e), levering confused by mounting. Panels a) and b) were
  common misclassification patterns across all three validation scenarios. Panels c-e) only
  represent block-by-time validation.
        When testing on Feeder 1 (Feeder 2 as the training set), the main misclassified categories
were: HB predicted as M, HB predicted as L, L predicted as HB, and L predicted as M. For HB
confused with M, no specific behavioral patterns were observed, but most misclassified episodes
involved back-marked pigs (with Arabic numerals) or dirty pigs (Figure 4.9 a.). Episodes that
were labelled as L but classified as HB showed two types of features. In one case, the two pigs
were in the middle of levering behavior, and they were relatively motionless (Figure 4.9 b.). In
the second instance, some episodes contained more than one behavior category; for example, the
following pig was in the transition period between levering and performing another behavior
(Figure 4.9 c.). For L misclassified as M, only a small proportion of the front pig’s body (mostly
the rear part of the pig) was included in the episodes (Figure 4.9 d.), as we set the truncated
feeder area as the ROI.
        In the last validation scenario, we trained our model with the data from Feeder 1 and
tested on video episodes from Feeder 2. We then watched misclassified episodes in the testing
set (Feeder 2). Most misclassifications focused on five error categories: HB predicted as L, M
                                                  127


predicted as L, M predicted as HB, and L predicted as HB. When M was erroneously confused
with L, commonly the view was almost entirely filled with body of the mounting pig and only a
very small proportion of the front pig was visible (Figure 4.9 e.). For M misclassified as HB, all
episodes presented clear views of the following pig, and the following pig was at very early stage
of mounting (minor overlap with the front pig; Figure 4.9 f.). For L misclassified as HB, there
were two patterns. On the one hand, many episodes included three pigs and both L and HB were
occurring at the same time among the animals. Although we manually labeled the episodes by
prioritizing the interaction related to the front pig (nearest the food trough), there were rare
events that involved a secondary interaction between the 2 follower pigs at the other end of the
feeding stall near the entrance (Figure 4.9 g.). On the other hand, over half of the L misclassified
as HB errors included drastic levering with the follower pig’s head/ears visible (Figure 4.9 h.).
   Figure 4.9 Error patterns for misclassification in block-by-feeder validation. The 1st, 10th,
   20th, and 30th frames of example episodes were selected for display purpose. a), head-to-
   body misclassified as mounting, respectively; b-c), levering misclassified as head-to-body;
   d), levering false predicted as mounting; e), mounting confused by levering; f), mounting
   confused by head-to-body; g-h), levering false classified as mounting.
         This paper classified multiple interactive behaviors of grow-finish pigs in single-space
feeding stalls, which is an original application of computer vision for studying livestock systems.
                                                   128


In addition, we employed a state-of-the-art CNN+LSTM pipeline that explicitly accounted for a
class-imbalance problem in the training set. Furthermore, our results suggest that the data
structure matters in the predictive performance of the proposed DL pipeline. Compared to
previous studies that classified aggressive/non-aggressive behaviors of pigs (Chen et al., 2020a,
2019), our study took a further step to distinguish among different interactive agonistic behaviors
in grow-finish pigs in the confines of a single-space feeding stall. Agonistic behavior in pigs
involves complicated motion structure, and thus, it is difficult to handle challenging instances
especially those occurring during the transition between two different activity types. The results
from random cross-validation suggest that the proposed model could be used for classification of
multiple types of pigs’ agonistic behavior. However, such a validation strategy may overlook the
effect of confounds within the dataset itself. Blocked validation strategies are closer to real-world
applications, and results of these validation strategies should be considered when evaluating
models for use on farm. As we observed more errors in blocked validation, it is essential to
identify challenging cases through model diagnosis, which has also been proposed by many other
researchers utilizing DL to study animal behavior (Chen et al., 2020b; Liu et al., 2020; Wu et al.,
2021). Detailed diagnostics such as those we described underlying various misclassifications and
advanced CV algorithms will be helpful to improve the model as well as predict its potential
robustness in real-world situations.
                                         5. CONCLUSION
        Our results illustrate the importance of matching validation with application when
evaluating DL models for behavioral classification of videos. We used a state-of-the-art
CNN+LSTM pipeline trained with an imbalanced video dataset to classify four interactive
                                                  129


behaviors in grow-finish pigs. While random cross-validation produced an acceptable accuracy
of 96.8%, using validation strategies that blocked data over time or by pen/feeder location had
poorer performance. In the future, more datasets with known structures should be added to
existing datasets to train video classification models under various real-world conditions that will
be relevant to animal phenomics and precision livestock farming uses.
                         6. DECLARATION OF COMPETING INTEREST
        We declare that we have no financial and personal relationships with other people or
organizations which can inappropriately influence our work. There is no professional or other
personal interest of any nature or kind in any product, service, and/or company that could be
construed as influencing the position presented in, or the review of, the manuscript entitled.
                                    7. ACKNOWLEDGEMENTS
        This work was funded by NIFA Awards 2017-67007-26176 and 2021-67021-34150 and
the ational atural Science Foundation of China (32102598).
                                     8. DATA AVAILABILITY
        Custom MATLAB Code used for (CNN) transfer learning, LSTM model fitting, and
blocked validation strategies are available at GitHub: https://github.com/jun-
jieh/AgonisticPigBehav/. Raw video episodes, video metadata, and extracted features from
transfer learning are available at OSF data repository: https://osf.io/wa732/.
                                                 130


APPENDIX
   131


 Table S4.1 Hyperparameters configuration.
  Hyperparameter                      Option/Value                        Reference
  CNN for transfer learning             ResNet-50                   Hyperparameter search
  LSTM architecture            Bi-directional with 50 units         Hyperparameter search
  𝛼 (Balancing parameter)                   0.25                       Lin et al. (2017)
  𝛾 (Focusing parameter)                       2                       Lin et al. (2017)
  Optimizer                               Adam                         Wu et al. (2021)
  Initial learning rate                   0.0001                       Wu et al. (2021)
  Batch size                                  20                       Wu et al. (2021)
  Dropout rate                               0.5                       Wu et al. (2021)
  Epochs                                     100                      Chen et al. (2020b)
       Table S4.2 Overall accuracy for different regions of interest.
         Region of interest   Mean accuracy (5 reps) Standard deviation (5 reps)
         Extended feeder                0.868                          0.004
         Feeder only                    0.875                          0.001
         Truncated feeder               0.887                          0.004
Table S4.3 Available sample size by validation strategies. *: the training set and the testing set
were interchangeable depending on which feeder was used for training/testing. #: once a training
set size N1 was determined, the remaining N2=15,679 - N1 samples were considered as the
testing set.
                             Random cross-          Blocking-by-time      Blocking-by-feeder*
                               validation  #
  Available for Training         15,679             9,907 (Weeks 1-3)         7,700 or 7,979
   Available for Testing         15,679             5,772 (Weeks 4-6)         7,979 or 7,700
                                                132


Figure S4.1 Average accuracy of different hyperparameter sets. Units, number of hidden
units in long short-term memory module; bi-direct, bi-directional long short-term memory;
standard, standard one-way long short-term memory.
                                           133


Figure S4.2 Training history of three validation strategies. Solid lines are for training curves and
dashed lines show the testing curve. Each validation strategy has five reps. Panel A, random
validation; Panel B, block-by-time validation; Panel C, block-by-feeder validation (Feeder 1 as
testing set); Panel D, block-by-feeder validation (Feeder 2 as testing set).
                                                134


Figure S4.3 Scatter plots of individual score for each episode given the first six principal
components (grouped by feeder). Blue dots represent Feeder 1 and green dots are for Feeder
2. Principle component analysis was done using feature vectors extracted from ResNet-50.
                                              135


Figure S4.4 Testing sets breakdown (on the left of panels) and misclassification breakdown (on
the right of panels) by social group (A), week (B), feeder (C), mark (D), and behavior category
(E) in random cross-validation (five replicates). Marked pigs meant back-marked pigs with
Arabic numerals; Unmarked pigs were pigs without artifactual marks on their backs.
Figure S4.5 Testing set breakdown (on the left of panels) and misclassification breakdown (on
the right of panels) by social group (A), week (B), feeder (C), mark (D), and behavior category
(E) in block-by-time validation. Plots on the left of panels show the proportion/count for a single
dataset, while plots on the right of panels stand for the statistics across 5 replicates. Marked pigs
meant back-marked pigs with Arabic numerals; Unmarked pigs were pigs without artifactual
marks on their backs.
                                                136


Figure S4.6 Testing set breakdown (on the left of panels) and misclassification breakdown (on
the right of panels) by social group (A), week (B), mark (C), and behavior category (D) in
block-by-feeder validation, whereas Feeder 1 was the testing set. Plots on the left of panels
show the proportion/count for a single dataset, while plots on the right of panels stand for the
statistics across 5 replicates. Marked pigs meant back-marked pigs with Arabic numerals;
Unmarked pigs were pigs without artifactual marks on their backs.
                                             137


Figure S4.7 Testing set breakdown (on the left of panels) and misclassification breakdown (on
the right of panels) by social group (A), week (B), mark (C), and behavior category (D) in block-
by-feeder validation, whereas Feeder 2 was the testing set. Plots on the left of panels show the
proportion/count for a single dataset, while plots on the right of panels stand for the statistics
across 5 replicates. Marked pigs meant back-marked pigs with Arabic numerals; Unmarked pigs
were pigs without artifactual marks on their backs.
                                                138


REFERENCES
    139


                                           REFERENCES
Agha, S., Fàbrega, E., Quintanilla, R., Sánchez, J.P., 2020. Social network analysis of agonistic
    behaviour and its association with economically important traits in pigs. Animals 10, 1–13.
    https://doi.org/10.3390/ani10112123
Angarita, Belcy K., Han, J., Cantet, R.J.C., Chewning, S.K., Wurtz, K.E., Siegford, J.M., Ernst,
    C.W., Steibel, J.P., 2021. Estimation of direct and social effects of feeding duration in
    growing pigs using records from automatic feeding stations. J. Anim. Sci. 99, 1–8.
    https://doi.org/10.1093/jas/skab042
Bahlo, C., Dahlhaus, P., Thompson, H., Trotter, M., 2019. The role of interoperable data
    standards in precision livestock farming in extensive livestock systems: A review. Comput.
    Electron. Agric. 156, 459–466. https://doi.org/10.1016/j.compag.2018.12.007
Bergmeir, C., Benítez, J.M., 2012. On the use of cross-validation for time series predictor
    evaluation. Inf. Sci. (Ny). 191, 192–213. https://doi.org/10.1016/j.ins.2011.12.028
Brown-Brandl, T.M., Adrion, F., Gallmann, E., Eigenberg, R., 2018. Development and
    Validation of a Low-Frequency RFID System for Monitoring Grow-Finish Pig Feeding and
    Drinking Behavior 1–9. https://doi.org/10.13031/iles.18-041
Brown-Brandl, T.M., Rohrer, G.A., Eigenberg, R.A., 2013. Analysis of feeding behavior of
    group housed growing-finishing pigs. Comput. Electron. Agric. 96, 246–252.
    https://doi.org/10.1016/j.compag.2013.06.002
Chen, C., Zhu, W., Liu, D., Steibel, J., Siegford, J., Wurtz, K., Han, J., Norton, T., 2019.
    Detection of aggressive behaviours in pigs using a RealSence depth sensor. Comput.
    Electron. Agric. 166, 105003. https://doi.org/10.1016/j.compag.2019.105003
Chen, C., Zhu, W., Norton, T., 2021. Behaviour recognition of pigs and cattle: Journey from
    computer vision to deep learning. Comput. Electron. Agric. 187, 106255.
    https://doi.org/10.1016/j.compag.2021.106255
Chen, C., Zhu, W., Steibel, J., Siegford, J., Han, J., Norton, T., 2020a. Recognition of feeding
    behaviour of pigs and determination of feeding time of each pig by a video-based deep
    learning method. Comput. Electron. Agric. 176, 105642.
    https://doi.org/10.1016/j.compag.2020.105642
Chen, C., Zhu, W., Steibel, J., Siegford, J., Wurtz, K., Han, J., Norton, T., 2020b. Recognition of
    aggressive episodes of pigs based on convolutional neural network and long short-term
    memory. Comput. Electron. Agric. 169, 105166.
    https://doi.org/10.1016/j.compag.2019.105166
Csermely, D., Wood-Gush, D.G.M., 1990. Agonistic behaviour in grouped sows. Ii. how social
    rank affects feeding and drinking behaviour. Bolletino di Zool. 57, 55–58.
                                                 140


       https://doi.org/10.1080/11250009009355674
Ding, R., Yang, M., Wang, X., Quan, J., Zhuang, Z., Zhou, S., Li, S., Xu, Z., Zheng, E., Cai, G.,
       Liu, D., Huang, W., Yang, J., Wu, Z., 2018. Genetic architecture of feeding behavior and
       feed efficiency in a Duroc pig population. Front. Genet. 9, 1–11.
       https://doi.org/10.3389/fgene.2018.00220
Fernandes, A.F.A., Dórea, J.R.R., Valente, B.D., Fitzgerald, R., Herring, W., Rosa, G.J.M.,
       2020. Comparison of data analytics strategies in computer vision systems to predict pig
       body composition traits from 3D images. J. Anim. Sci. 98, skaa250.
       https://doi.org/10.1093/jas/skaa250
Forsyth, D., Ponce, J., 2011. Computer vision: A modern approach. Prentice hall.
Georgsson, L., Svendsen, J., 2002. Degree of competition at feeding differentially affects
       behavior and performance of group-housed growing-finishing pigs of different relative
       weights. J. Anim. Sci. 80, 376–383. https://doi.org/10.2527/2002.802376x
Gómez, Y., Stygar, A.H., Boumans, I.J.M.M., Bokkers, E.A.M., Pedersen, L.J., Niemi, J.K.,
       Pastell, M., Manteca, X., Llonch, P., 2021. A Systematic Review on Validated Precision
       Livestock Farming Technologies for Pig Production and Its Potential to Assess Animal
       Welfare. Front. Vet. Sci. 8, 1–20. https://doi.org/10.3389/fvets.2021.660565
Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. Cambridge, Massachusetts : The
       MIT Press.
Han, J., Gondro, C., Reid, K., Steibel, J.P., 2021. Heuristic hyperparameter optimization of deep
       learning models for genomic prediction. G3 Genes|Genomes|Genetics 11.
       https://doi.org/10.1093/g3journal/jkab032
Han, J., Moraga, C., 1995. The influence of the sigmoid function parameters on the speed of
       backpropagation learning, in: International Workshop on Artificial Neural Networks.
       Springer, pp. 195–201.
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in:
       Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp.
       770–778.
Hochreiter, S., 1997. Long Short-Term Memory 1780, 1735–1780.
Hossain, M.S., Betts, J.M., Paplinski, A.P., 2021. Dual Focal Loss to address class imbalance in
       semantic segmentation. Neurocomputing 462, 69–87.
       https://doi.org/10.1016/j.neucom.2021.07.055
Ji, J., Cao, K., Niebles, J.C., 2019. Learning temporal action proposals with fewer labels. Proc.
       IEEE Int. Conf. Comput. Vis. 2019-Octob, 7072–7081.
       https://doi.org/10.1109/ICCV.2019.00717
                                                   141


Kim, T., Lee, H., Cho, M.A., Lee, H.S., Cho, D.H., Lee, S., 2020. Learning Temporally Invariant
     and Localizable Features via Data Augmentation for Video Recognition. Lect. Notes
     Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 12536
     LNCS, 386–403. https://doi.org/10.1007/978-3-030-66096-3_27
LeCun, Y., Bengio, Y., 1995. Convolutional networks for images, speech, and time series.
     Handb. brain theory neural networks 3361, 1995.
Lecun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature.
     https://doi.org/10.1038/nature14539
Li, D., Chen, Y., Zhang, K., Li, Z., 2019. Mounting behaviour recognition for pigs based on deep
     learning. Sensors (Switzerland) 19. https://doi.org/10.3390/s19224924
Li, D., Zhang, K., Li, Z., Chen, Y., 2020. A spatiotemporal convolutional network for multi-
     behavior recognition of pigs. Sensors (Switzerland) 20. https://doi.org/10.3390/s20082381
Li, G., Huang, Y., Chen, Z., Chesser, G.D., Purswell, J.L., Linhoss, J., Zhao, Y., 2021. Practices
     and applications of convolutional neural network-based computer vision systems in animal
     farming: A review. Sensors 21, 1–42. https://doi.org/10.3390/s21041492
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P., 2017. Focal loss for dense object detection,
     in: Proceedings of the IEEE International Conference on Computer Vision. pp. 2980–2988.
Liu, D., Oczak, M., Maschat, K., Baumgartner, J., Pletzer, B., He, D., Norton, T., 2020. A
     computer vision-based method for spatial-temporal action recognition of tail-biting
     behaviour in group-housed pigs. Biosyst. Eng. 195, 27–41.
     https://doi.org/10.1016/j.biosystemseng.2020.04.007
Liu, W., Chen, L., Chen, Y., 2018. Age Classification Using Convolutional Neural Networks
     with the Multi-class Focal Loss. IOP Conf. Ser. Mater. Sci. Eng. 428.
     https://doi.org/10.1088/1757-899X/428/1/012043
Lopez-Del Rio, A., Nonell-Canals, A., Vidal, D., Perera-Lluna, A., 2019. Evaluation of Cross-
     Validation Strategies in Sequence-Based Binding Prediction Using Deep Learning. J.
     Chem. Inf. Model. 59, 1645–1657. https://doi.org/10.1021/acs.jcim.8b00663
Luo, G., 2016. A review of automatic selection methods for machine learning algorithms and
     hyper-parameter values. Netw. Model. Anal. Heal. Informatics Bioinforma. 5, 1–15.
     https://doi.org/10.1007/s13721-016-0125-6
Machado, S.P., Caldara, F.R., Foppa, L., De Moura, R., Gonçalves, L.M.P., Garcia, R.G., De
     Alencar Nääs, I., Dos Santos Nieto, V.M.O., De Oliveira, G.F., 2017. Behavior of pigs
     reared in enriched environment: Alternatives to extend pigs attention. PLoS One 12, 1–18.
     https://doi.org/10.1371/journal.pone.0168427
Martínez-Avilés, M., Fernández-Carrión, E., López García-Baones, J.M., Sánchez-Vizcaíno,
     J.M., 2017. Early Detection of Infection in Pigs through an Online Monitoring System.
                                                142


     Transbound. Emerg. Dis. 64, 364–373. https://doi.org/10.1111/tbed.12372
Nasirahmadi, A., Sturm, B., Edwards, S., Jeppsson, K.H., Olsson, A.C., Müller, S., Hensel, O.,
     2019. Deep learning and machine vision approaches for posture detection of individual pigs.
     Sensors (Switzerland) 19, 1–16. https://doi.org/10.3390/s19173738
Nielsen, B.L., Lawrence, A.B., Whittemore, C.T., 1995. Effect of group size on feeding
     behaviour, social behaviour, and performance of growing pigs using single-space feeders.
     Anim. Sci. 61, 575–579. https://doi.org/10.1017/S1357729800014168
Oksuz, I., Clough, J.R., Schnabel, J.A., 2019. Artefact detection in video endoscopy using
     retinanet and focal loss function, in: CEUR Workshop Proceedings. CEUR-WS.
Roberts, D.R., Bahn, V., Ciuti, S., Boyce, M.S., Elith, J., Guillera-Arroita, G., Hauenstein, S.,
     Lahoz-Monfort, J.J., Schröder, B., Thuiller, W., Warton, D.I., Wintle, B.A., Hartig, F.,
     Dormann, C.F., 2017. Cross-validation strategies for data with temporal, spatial,
     hierarchical, or phylogenetic structure. Ecography (Cop.). 40, 913–929.
     https://doi.org/10.1111/ecog.02881
Rodenburg, T.B., Turner, S.P., 2012. The role of breeding and genetics in the welfare of farm
     animals. Anim. Front. 2, 16–21. https://doi.org/10.2527/af.2012-0044
Salgado, H.H., Méthot, S., Remus, A., Létourneau-Montminy, M.P., Pomar, C., 2021. A novel
     feeding behavior index integrating several components of the feeding behavior of finishing
     pigs. Animal 15, 100251. https://doi.org/10.1016/j.animal.2021.100251
Saurabh, N., 2021. LSTM -RNN Model to Predict Future Stock Prices using an Efficient
     Optimizer LSTM -RNN Model to Predict Future Stock Prices using an Efficient Optimizer
     672–677.
Schuster, M., Paliwal, K.K., 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal
     Process. 45, 2673–2681. https://doi.org/10.1109/78.650093
Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image
     recognition. arXiv Prepr. arXiv1409.1556.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V.,
     Rabinovich, A., 2015. Going deeper with convolutions, in: Proceedings of the IEEE
     Conference on Computer Vision and Pattern Recognition. pp. 1–9.
Talo, M., 2019. Convolutional Neural Networks for Multi-class Histopathology Image
     Classification.
Torrey, L., Shavlik, J., 2010. Transfer learning, in: Handbook of Research on Machine Learning
     Applications and Trends: Algorithms, Methods, and Techniques. IGI global, pp. 242–264.
Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W., 2017. Action Recognition in
     Video Sequences using Deep Bi-Directional LSTM with CNN Features. IEEE Access 6,
                                                143


     1155–1166. https://doi.org/10.1109/ACCESS.2017.2778011
Wu, D., Wang, Y., Han, M., Song, L., Shang, Y., Zhang, X., Song, H., 2021. Using a CNN-
     LSTM for basic behaviors detection of a single dairy cow in a complex environment.
     Comput. Electron. Agric. 182, 106016. https://doi.org/10.1016/j.compag.2021.106016
Xiao, H., Wang, C., Li, Z., Wang, R., Bo, C., Sotelo, M.A., Xu, Y., 2020. UB-LSTM: A
     Trajectory Prediction Method Combined with Vehicle Behavior Recognition. J. Adv.
     Transp. 2020. https://doi.org/10.1155/2020/8859689
Yin, X., Wu, D., Shang, Y., Jiang, B., Song, H., 2020. Using an EfficientNet-LSTM for the
     recognition of single Cow’s motion behaviours in a complicated environment. Comput.
     Electron. Agric. 177, 105707. https://doi.org/10.1016/j.compag.2020.105707
Yun, S., Oh, S.J., Heo, B., Han, D., Kim, J., 2020. VideoMix: Rethinking Data Augmentation for
     Video Classification.
Zhang, K., Li, D., Huang, J., Chen, Y., 2020. Automated video behavior recognition of pigs
     using two-stream convolutional networks. Sensors (Switzerland) 20.
     https://doi.org/10.3390/s20041085
Zhang, Z., Sabuncu, M.R., 2018. Generalized cross entropy loss for training deep neural
     networks with noisy labels. Adv. Neural Inf. Process. Syst. 2018-Decem, 8778–8788.
Zhou, Y., Hu, Q., Wang, Y., 2018. Deep super-class learning for long-tail distributed image
     classification. Pattern Recognit. 80, 118–128. https://doi.org/10.1016/j.patcog.2018.03.003
                                                144


  CHAPTER 5: ANALYSIS OF SOCIAL INTERACTIONS IN GROUP-HOUSED ANIMALS
                               USING DYADIC LINEAR MODELS
 Junjie Han, Janice Siegford, Gustavo de los Campos, Robert J. Tempelman, Cedric Gondro, and
                                            Juan P. Steibel
                                           1. ABSTRACT
         Understanding factors affecting social interactions among animals is important for
applied animal behavior research. Thus, there is a need to elicit statistical models to analyze data
collected from pairwise behavioral interactions. In this study, we propose treating social
interaction data as dyadic observations and propose a statistical model for their analysis. We
performed posterior predictive checks of the model through different validation strategies:
stratified 5-fold random cross-validation, block-by-social-group cross-validation, and block-by-
focal-animals validation. The proposed model was applied to a pig behavior dataset collected
from 797 growing pigs freshly remixed into 59 social groups that resulted in 10,032 records of
directional dyadic interactions. The response variable was the duration in seconds that each
animal spent delivering attacks on another group mate. Generalized linear mixed models were
fitted. Fixed effects included sex, individual weight, prior nursery mate experience, and prior
littermate experience of the two pigs in the dyad. Random effects included aggression giver,
aggression receiver, dyad, and social group. A Bayesian framework was utilized for parameter
estimation and posterior predictive model checking. Prior nursery mate experience was the only
significant fixed effect. In addition, a weak but significant correlation between the random giver
effect and the random receiver effect was obtained when analyzing the attacking duration. The
predictive performance of the model varied depending on the validation strategy, with
substantially lower performance from the block-by-social-group strategy than other validation
                                                 145


strategies. Collectively, this paper demonstrates a statistical model to analyze interactive animal
behaviors, particularly dyadic interactions.
                                         2. INTRODUCTION
        The study of social interactions is of paramount importance in applied animal behavior
research (Rodenburg et al., 2010; Silk et al., 2018). Researchers are interested in elucidating the
basis for the observed variation in the intensity and frequency of interactions among pairs of
individuals that are part of a social group. Some of the applications of such study include mate
choice (Andersson and Simmons, 2006; Bierbach et al., 2013), aggression and other damaging
behaviors (Angarita et al., 2019; Oczak et al., 2013; Peden et al., 2018), and competition for
access to feeding space (Angarita et al., 2021; Lu et al., 2017), etc. Thus, given data on pairwise
behavioral interactions recorded from an experimental or observational study, it is necessary to
quantify the effects of various individual- and group-level factors on social interactions.
        Data from pairwise social interactions are considered dyadic (Kenny et al., 2020). This is,
the unit of observation is not the individual, but a pair of individuals. In general, dyadic
interaction data can be arranged in square matrices. It can be further re-arranged in the form of a
response vector, which is generally accomplished in two different ways (Figure 5.1): a) The data
are summed row-wise/column-wise to represent an individual level-observation (total duration
that each animal is engaged in a particular behavior regardless of whom the animal interacted
with), or b) the matrix elements are stacked keeping intact their dyadic nature. In the first case,
there is loss of information, and it should be avoided if the aim is to study the dyadic nature of
social interactions. In the second case, however, it is of utmost importance that all sources of
variations are modeled to properly account for group means, variances and covariances between
                                                 146


subsets of the data; otherwise, if important factors are ignored, this can adversely affect estimates
and predictions.
Figure 5.1 Panel a), directional dyadic interaction intensity matrix (elements in the matrix
represent attacking duration); row sums and column sums are shown in the margins of the matrix.
Panel b), a truncated long-format table that is re-arranged from the interaction matrix; each row
represents a record that is the attacking duration in seconds from a giver animal to a receiver animal.
0.00 means observed zero while 0 means structural zero that we do not consider as an actual
interaction.
        A proper way to model dyadic data is to fit generalized linear mixed models (GLMM)
that include fixed and random effects to account for means and covariances depending on the
actual design of the experiment (Kenny et al., 2020). In this study we describe how GLMM can
be used to analyze dyadic data and illustrate how to use this approach to analyze a pig behavior
dataset. First, we defined the type of social interaction data and how to properly model variation
in the response using GLMM. Second, we applied the proposed GLMM to the experimental data
and illustrated how to elicit, fit, and check the models and how to interpret the results. Finally,
we performed posterior predictive checks of the models through several validation strategies.
The GLMM presented in this paper can be used by applied animal behaviorists to analyze other
pairwise social interaction data to obtain statistically valid and biologically meaningful results,
                                                  147


which can be helpful to understand interactive behaviors of animals for practical purposes of
management or improved welfare.
                                   3. METHODS AND MATERIALS
3.1 Data from social interactions should be analyzed as dyadic data
         For a social interaction to occur, at least two animals need to be involved. Although
behavioral interactions may involve more than two animals at a time, in this paper we assume
that the data on social interactions is obtained through observations of pairwise/dyadic behaviors
and that it can be arranged in an interaction matrix (Figure 5.1a).
         The data may be obtained within a single large social group, in which all the potential
pairwise interactions have been monitored and quantified. Alternatively, the dyadic data can be
collected from several social groups of variable sizes, within which all potential pairwise
interactions have been monitored and quantified, but no between-group relations are possible.
         We also assume that in addition to the social interactions per se other variables have been
observed. These variables may be individual-specific or dyad-specific. Examples of individual-
specific variables are those related to each individual’s age, sex, size, and past life experiences
(e.g., early-life social or nutritional stress) and they can be continuous, discrete or categorical in
nature. Dyad-level variables are those that only pertain to the pair of individuals. For instance, a
dyad-level variable can be described as whether they have met each other before the interaction
is observed. It is important to notice that sometimes individual-level variables may be coded as
dyad specific, for instance, the difference in live weight between two animals can be viewed as a
dyadic-level observation, but in fact it arises from a linear combination of two individual-level
variables. In that case, we prefer to keep individual level observations separate.
                                                    148


3.1.1 Social interaction data
        Social interaction data can be of different types. From a mathematical point of view, the
social interactions could be represented by a binary outcome (0/1=it occurred/it did not occur),
by a discrete outcome (frequency of occurrence of an interaction), by an ordinal outcome
(intensity or severity of interaction on an arbitrary scale), or by a continuous outcome (intensity
of the interaction on a continuous scale, duration of the interaction, etc.). The practical
implications of the different types of responses pertain to the statistical distribution that is used to
model the stochasticity in social interaction data.
        From the point of view of the directionality of the behavior, in most cases, we can assume
that the behavior is directional i.e., there is a giver and a receiver. For instance, in the study of
animal aggression, in many cases there is a clear attacker and a victim. In studies of feather
pecking in group-caged chickens (Savory and Mann, 1997) and tail-biting in group-housed pigs
(Angarita et al., 2019; Wurtz et al., 2017), there was one animal that was delivering the behavior
(we will call this animal the giver animal) and another one which was clearly receiving the
behavior (the receiver). In the following subsections we lay out these concepts with the
directional interaction data, and we summarize the model parameterization in a generalized form.
3.1.2 Analysis of dyadic data from directional social interactions
        When the social interactions are directional, the data collected from each social group can
be arranged in a matrix as represented in Figure 5.1a. If there are n animals in a certain group,
n(n-1) interactions will be observed within the group. We assume that there is no measurement
error implying that if an interaction was recorded then this interaction did indeed happen as
recorded, and, perhaps more relevant, that a zero entry in the matrix implies that no interaction
occurred for this specific dyad.
                                                    149


        In the analysis of dyadic interaction, the model is necessarily componential, where the
interaction consists of three major components: a main effect of the giver (giver effect), a main
effect of the receiver (receiver effect), and the relation of the two individuals that is independent
of the giver and receiver effects, referred to as the dyad (Back and Kenny, 2010; Kenny et al.,
2006).
3.2 Experimental data analysis: attacking time in group-housed pigs
3.2.1 Experiment setup
        In this study, the experimental data was collected from 797 Yorkshire pigs (409 gilts and
388 barrows) that were strategically mixed into 59 single-sex social groups and housed in grow-
finish pens with 10-15 pigs per pen. In terms of prior social acquaintances, each social group
included pairs or trios of animals that had shared a common nursery pen for seven weeks
immediately before moving into the grow-finish pens. Prior social acquaintances also existed for
some animals that had shared the same litter after farrowing (10 weeks before mixed into the
grow-finish pens; these pigs were previously housed together as a litter before weaning). No
prior social acquaintance was assumed to exist for animals that were housed together for the first
time after being mixed into the grow-finish pens. At the beginning of the experiment the average
weight of the animals was 27.09 kg (SD±4.07). The experiment has been described in detail in
previous studies (Angarita et al., 2019; Wurtz et al., 2017).
        Pigs were video recorded five hours after mixing and four hours on the following
morning (no overnight recording was performed). Videos were decoded manually by trained
observers who recorded all attacks, their duration, and the identity of giver and receiver. After
decoding, the total amount of time for each dyadic interaction was computed as described by
Angarita et al. (2019). The directional aggression duration 𝑦𝑖𝑗𝑘 was defined as the total time in
                                                  150


seconds that animal 𝑖 spent attacking another group mate animal 𝑗 within social group 𝑘 during
the 9-hour post-mixing period. The final dataset contained 10,032 records consisting of total
attacking duration for all possible dyads. Among those records, 1,100 pairs of animals (2,200
records) shared the same nursery pen prior to being remixed into the grow-finish pens, and 367
pairs of animals (734 records) were from the same litter.
3.2.2 Analysis model
         After extensive model assessment and comparison, a hurdle Bernoulli-lognormal model
was adopted. To keep things simple, in this paper it was assumed that a positive continuous
response could be adequately modeled using a lognormal distribution and that a Bernoulli
distribution could model the response when it is zero. However, the general principles presented
here can be easily extended to other types of distributions as mentioned in the discussion. Thus,
in this application, there are two sub-models (Equation 5.1). One sub-model estimates the
probability of observing a zero (no attacks) while the other sub-model represents the duration of
attacks conditional on its occurrence:
                    𝑦𝑖𝑗𝑘 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝜃𝑖𝑗𝑘 )            𝑖𝑓 𝑦𝑖𝑗𝑘 =
                  {                             2                               [5.1]
                    𝑦𝑖𝑗𝑘 ~ 𝐿𝑜𝑔𝑛𝑜𝑟𝑚𝑎𝑙(𝜇𝑖𝑗𝑘 𝜎 )          𝑖𝑓 𝑦𝑖𝑗𝑘 >
where, 𝑦𝑖𝑗𝑘 is the total duration of the behavioral interactions between animal i and animal j in
social group k (in 𝑦𝑖𝑗𝑘 , the first subindex corresponds to the aggression giver, the second
subindex corresponds to the aggression receiver, and the third subindex indicates the social
group), 𝜃𝑖𝑗𝑘 is the expected probability of the total attacking duration being zero for animals i
and j in social group k. Further, 𝜇𝑖𝑗𝑘 is the expected value (mean) of natural logarithm of 𝑦𝑖𝑗𝑘 ,
and 𝜎 2 is the variance of natural logarithm of 𝑦𝑖𝑗𝑘 .
                                                   151


         The transformed 𝜃𝑖𝑗𝑘 and 𝜇𝑖𝑗𝑘 have linear relationships with the explanatory variables
(Equation 5.2):
                   𝜃
          𝑙𝑜𝑔 (1−𝜃𝑖𝑗𝑘 ) = 𝜇′𝑖𝑗𝑘 = 𝑏′0 + 𝐹𝐸′𝑖𝑗𝑘 + 𝑔′ 𝑖 + 𝑟 ′𝑗 + 𝑑′𝑖𝑗 + 𝑠𝑔′𝑘         𝑖𝑓 𝑦𝑖𝑗𝑘 =
        {              𝑖𝑗𝑘                                                                         [5.2]
          𝜇𝑖𝑗𝑘 = 𝑏0 + 𝐹𝐸𝑖𝑗𝑘 + 𝑔𝑖 + 𝑟𝑗 + 𝑑𝑖𝑗 + 𝑠𝑔𝑘                                  𝑖𝑓 𝑦𝑖𝑗𝑘 >
where 𝜇′𝑖𝑗𝑘 and 𝜇𝑖𝑗𝑘 are expected values (mean) on an underlying linked scale that can be
modeled as linear combinations of individual-level and dyad-level systematic effects (described
below), 𝑏′0 and 𝑏0 are overall intercepts. Note that 𝑏′0 , 𝐹𝐸′𝑖𝑗𝑘 , 𝑔′ 𝑖 , 𝑟 ′𝑗 , 𝑑′𝑖𝑗 , and 𝑠𝑔′𝑘 (notations
with the superscript) represent effects to model the probability of presenting no attacks while 𝑏0 ,
𝐹𝐸𝑖𝑗𝑘 , 𝑔𝑖 , 𝑟𝑗 , 𝑑𝑖𝑗 , and 𝑠𝑔𝑘 (effects without the superscripts) model the attacking duration if the
attack occurred. 𝑑′𝑖𝑗𝑘 ~𝑁( 𝜎′2𝑑 ) and 𝑑𝑖𝑗𝑘 ~𝑁( 𝜎𝑑2 ) represent random dyad effects.
𝑠𝑔′𝑘 ~𝑁( 𝜎′2𝑠𝑔 ) and 𝑠𝑔𝑘 ~𝑁( 𝜎𝑠𝑔         2
                                           ) are random social group effects. The parameters 𝑔′𝑖 , 𝑔𝑖 ,
𝑟′𝑗 , and 𝑟𝑗 are explained in the following section. Further, the fixed effects (overall means) 𝐹𝐸′𝑖𝑗𝑘
and 𝐹𝐸𝑖𝑗𝑘 in Equation 5.2 are defined as:
                                                             1          2
                  𝐹𝐸′𝑖𝑗𝑘 = 𝑠𝑒𝑥′𝑘 + 𝛼′𝑥𝑗𝑘 + 𝛽′𝑤𝑖𝑘 + 𝛿′1 𝑧𝑖𝑗𝑘      + 𝛿′2 𝑧𝑖𝑗𝑘 𝑖𝑓 𝑦𝑖𝑗𝑘 =
               {                                         1         2                         [5.3]
                  𝐹𝐸𝑖𝑗𝑘 = 𝑠𝑒𝑥𝑘 + 𝛼𝑥𝑗𝑘 + 𝛽𝑤𝑖𝑘 + 𝛿1 𝑧𝑖𝑗𝑘      + 𝛿2 𝑧𝑖𝑗𝑘       𝑖𝑓 𝑦𝑖𝑗𝑘 >
where 𝑠𝑒𝑥′𝑘 and 𝑠𝑒𝑥𝑘 are sex effects in social group 𝑘, 𝑥𝑗𝑘 is the (within-group) centered weight
of the receiver animal 𝑗 from social group 𝑘, 𝑤𝑖𝑘 indicates the (within-group) centered weight of
                                                1
the giver animal 𝑖 from social group 𝑘, 𝑧𝑖𝑗𝑘        represents whether animal 𝑖 and animal 𝑗 from social
                                                             1                                       1
group 𝑘 shared the same nursery group previously (𝑧𝑖𝑗𝑘           =   if they did not; otherwise, 𝑧𝑖𝑗𝑘    = ),
       2
and 𝑧𝑖𝑗𝑘   indicates whether animal 𝑖 and animal 𝑗 from social group 𝑘 were previously housed
                                                     2                                        2
together in the same litter before weaning (𝑧𝑖𝑗𝑘        =   if they were not; otherwise, 𝑧𝑖𝑗𝑘    = ).
Finally, 𝛼′, 𝛼, 𝛽′, 𝛽, 𝛿′1, 𝛿1 , 𝛿′2 , and 𝛿2 denote the corresponding coefficients of the exploratory
                                                       152


variables. Without losing generality, we illustrate a linear model where the response can be
simply decomposed into giver effects, receiver effects, and dyad-specific effects in Figure 5.2.
Figure 5.2 Illustration of a dyadic interaction model as an example that partitions the response
into giver effects, receiver effects, and dyad-specific effects. Blue lines/arrows mean fixed effects,
and red represents random effects, e stands for the residual term.
3.2.3 Modeling of (co)variances
        Under the model in Equation 2, effects of the giver and the receiver are modeled for each
animal. Those two effects covary, assuming:
                                  𝑔′𝑗               𝜎′𝑔2   𝜎′𝑔𝑟
                                 ( ) ~𝑁 (( ) (                   )) [4]
                                  𝑟′𝑗              𝜎′𝑔𝑟     𝜎′2𝑟
                                  𝑔𝑗              𝜎𝑔2    𝜎𝑔𝑟
                                ( 𝑟 ) ~𝑁 (( ) (              ))      [5]
                                   𝑗              𝜎𝑔𝑟    𝜎𝑟2
where 𝜎′𝑔𝑟 and 𝜎𝑔𝑟 represent the covariance between the receiver and the giver effects of the
same animal. Moreover, 𝜎′𝑔𝑟 or 𝜎𝑔𝑟 could take negative values. For example, 𝜎𝑔𝑟 <          if an
                                                 153


animal spends more time attacking other animals but receives less aggression (in terms of
duration) from other animals. Contrarily, 𝜎𝑔𝑟 will be positive in those cases where animals that
deliver more aggression also receive more aggression from other animals. A similar analysis
could be done for 𝜎′𝑔𝑟 but related to the probability of not delivering attacks. The magnitude and
sign of these parameters are of importance to behaviorists.
         We also derive the estimated giver-receiver correlation as this is easily interpretable to
applied scientists:
                                           𝜎′𝑔𝑟           𝜎𝑔𝑟
                                   𝜌′𝑔𝑟 = 𝜎′      , 𝜌𝑔𝑟 = 𝜎          [6]
                                            𝑔 𝜎′𝑟          𝑔 𝜎𝑟
         The relative magnitude of 𝜎′𝑔2 𝜎′2𝑟 , 𝜎𝑔2 and 𝜎𝑟2 are also important. A relatively small value
for a specific source of variation means that the process is mostly driven by other random
sources.
3.2.4 Estimation
         For statistical analysis, the model represented in Equations [1-3] could be fitted using
restricted maximum likelihood or using Bayesian methods. We chose to use a Bayesian approach
(Box and Tiao, 2011). Details for the implementation of model fitting are provided in Appendix.
In the companion GitHub (https://github.com/jun-jieh/DyadAnalysis) we provide examples for
implementation. A total of 4,000 Markov chain Monte Carlo (MCMC) samples were generated
for parameter estimates. The parameter 𝜃 for a fixed effect given the observed data 𝑦 was
considered significant (𝑃<0.05) if
                             − 𝑚𝑎𝑥 (𝑝(𝜃 < |𝑦) 𝑝(𝜃 > |𝑦)) < 0.05            [7]
where 𝑝(𝜃 < |𝑦) means that the probability of the parameter 𝜃 is smaller than zero while
𝑝(𝜃 > |𝑦) represents the probability of 𝜃 being larger than zero, and the function max() returns
                                                     154


the maximum value of the two elements. In practice, these probabilities are estimated based on
the relative frequencies obtained from the MCMC samples.
3.2.5 Validation strategies and posterior predictive checks: how well does the model fit the
data?
         Posterior predictive checking is an important part of model evaluation. For this checking,
new data are simulated conditional on the fitted model and their distribution is compared to
observed data (Gabry et al., 2019). Moreover, this posterior predictive checking can be done with
internal validation (all the data are used for model fitting and for validation) or using external
validations (also known as out-of-sample or hold-out validations) (Vehtari et al., 2017; Vehtari
and Ojanen, 2012), where the data are split into a training set (used for model fitting) and a
validation set (used for the validation/checking). In the case of external validation, the way data
are split is very important. Specifically, we split the entire dataset into a training set and a
validation set using three different strategies:
         1. A stratified 5-fold cross-validation (Vehtari and Ojanen, 2012) was used where in each
fold, a random subset of each social group (80% of the data) was utilized for model training,
while the remaining records were for testing purpose. It maintains the same social group ratio
throughout the five folds as the ratio in the original (entire) dataset.
         2. A block-by-social-group (5-fold) cross-validation was performed. In each fold, all
records from randomly selected social groups that make up approximately 80% of the entire data
were pooled and used for training purpose, while the remaining (validation data comprising 20%
of the observations) set was from the left-out social groups that were not part of the training data.
         3. A block-by-focal-animals validation was proposed and run for five replicates. In this
validation scenario, we selected seven animals from each group and used all their aggression
                                                  155


records as both aggression givers and aggression receivers for the training set. The validation set
contained only interactions between non-focal animals. This resembles a common way in which
videos could be decoded; only some animals are followed and all their interactions with
everyone else are decoded. Furthermore, by selecting seven focal animals per group, the
resulting training set size was 78% of the entire data, while the remaining data (approximately
22% of all records) was used for testing. This led to a similar set size for comparison to the other
two model checking strategies.
        In each validation strategy, the model was fitted five times (for the five folds or
replicates), and as part of the Bayesian model fit procedure, 500 MCMC samples were generated
from the posterior predictive density. The posterior distribution of the generated samples was
compared to the distribution of the observed dataset, where the response variables were
transformed into a logarithm scale. To evaluate the predictive performance, Pearson correlation
and root mean square error (RMSE) (Chai and Draxler, 2014) between the log-transformed
(observed) response i.e. log(response+1) and the mean of log-transformed predicted response
(linear predictors in Equation 2) in the validation set was computed across all validation
scenarios. In addition, area under the ROC (Receiver Operating Characteristics) curve (Ling et
al., 2003), also known as AUC, was computed to evaluate performance on the prediction of
attack presence/absence.
3.3 Ethical approval
        All animal protocols were approved by the Institutional Animal Care and Use Committee
(Animal Use Form number 01/14-003-00).
                                                 156


                                             4. RESULTS
4.1 Estimation of animal-specific effects, dyad-specific effects, and (co)variance components
        Table 5.1 shows the posterior distributions of the individual animal effects and dyadic
effects. The random dyad effect was not estimable (the model including the dyad effect did not
converge; see Appendix for details). The nursery mate experience (animals in a dyad knew each
other from sharing the same nursery pen prior to being remixed into grow-finish pens) exhibited
significant effects, reducing the probability of presenting attacks and, if attacks happened,
reducing duration of them (𝛿′1 = .5 5 𝑃 < . 5; and 𝛿1 = − .5              𝑃 < . 5; Table 5.1). The
estimates indicated that for the dyads where the pigs had nursery mate experience, we would
expect to see 65.7% increase of the odds of presenting no attacks; on the other hand, if the dyad
did present attacks, pigs with nursery mate experience exhibited a 39.4% decrease in attacking
duration. This means that if animal 𝑖 and animal 𝑗 were housed in the same nursery pen
previously and then are remixed into a grow-finish pen, as might be expected under production
conditions, they are less likely to attack each other and if they do attack each other, the length of
attacking duration will be significantly shorter than the average attacking time of two animals
who had not recently been housed together. The remaining animal-specific properties (weight of
giver, weight of receiver, and sex) and dyad-specific attributes (whether the giver and the
receiver were from the same litter) were not significant.
        The giver-receiver correlation was not significant for the Bernoulli sub-model (when
estimating the probability of animal 𝑖 not presenting attacks to animal 𝑗; Table 5.1). On the other
hand, a weak but significant correlation was obtained in the lognormal sub-model to analyze the
attacking duration (𝜌𝑔𝑟 = .         𝑃 < . 5). This means that when one pig spends more time
delivering aggression this same pig will also receive longer attacks.
                                                 157


 Table 5.1 Estimated posterior statistics for fixed effects and (co)variance components explained
 on total attacking duration between the giver animal and the receiver animal. Q: quantile.
                                                      𝑦𝑖𝑗𝑘 =                           𝑦𝑖𝑗𝑘 >
  Parameter                                Mean  Q 2.5%     Q 50%  Q 97.5%  Mean  Q 2.5%    Q 50%  Q 97.5%
   𝑠𝑒𝑥 ′ 𝑠𝑒𝑥                              -0.083 -0.483     -0.085  0.312  -0.039 -0.162    -0.038  0.085
     𝛼′ 𝛼      Receiver weight            0.001  -0.011      0.001  0.013  -0.001 -0.008    -0.001  0.006
     𝛽′ 𝛽      Giver weight               -0.008 -0.027     -0.008  0.012  -0.007 -0.017    -0.007  0.003
    𝜹′𝟏 𝜹𝟏     Nursery mate               0.505   0.392      0.504  0.618  -0.501 -0.573    -0.501 -0.424
    𝛿′2 𝛿2     Litter mate                -0.170 -0.357     -0.171  0.015  -0.023 -0.139    -0.023  0.087
    𝜎′𝑔2 𝜎𝑔2   Giver variance             1.050   0.737      1.035  1.443  0.063   0.044     0.062  0.089
    𝜎′2𝑟 𝜎𝑟2   Receiver variance          0.023   0.008      0.021  0.045  0.002   0.001     0.002  0.005
   𝜌′𝑔𝑟 𝝆𝒈𝒓    Giver-receiver correlation 0.155  -0.015      0.155  0.328  0.203   0.009     0.202  0.397
   𝜎′2𝑠𝑔 𝜎𝑠𝑔
           2   Social group variance      0.473   0.282      0.460  0.735  0.020   0.001     0.019  0.052
       𝜎 2     Error variance                -      -          -      -    1.076   1.031     1.075  1.123
4.2 Predictive performance in different validation strategies
           We assessed the fitted model through posterior predictive model checks by inspecting
two important aspects of: 1) how well it predicted the probability of not having an attack, and 2)
how well it predicted the mean duration of the attacks when they occur.
           Figure 5.3 presents the posterior predictive distribution of the probability of observing no
attacks between animals (relative frequency of zeros). The distribution of the proportion of
predicted zeros across multiple replicates of simulated data (lighter bins in Figure 5.3) was
centered around the proportion of zeros in the observed response (dark solid lines in Figure 5.3).
That is, regardless of training-testing data partitions, the estimated validation proportion fell well
within the posterior predictive density in all validation strategies.
                                                          158


Figure 5.3 Proportion of zeros of validation set y (dark lines), with proportions of zeros for 500
simulated datasets 𝑦̃ drawn from the posterior predictive distribution (lighter bins). A), the model
that used all data points for model training to predict the same dataset; B) 5-fold cross-validation;
C) Block-by-social-group cross-validation; D) 5 replicates of block-by-focal-animal validation.
        For each of the validation strategies, Figure 5.4 shows the distribution for the means of
the simulated data (light bins) and the observed data (dark solid line). The response variables
were transformed into a logarithm scale i.e., log(response+1). In general, the simulated data were
consistent with the observed data (no systematic lack-of-fit was observed). However, in the
internal validation and block-by-focal-animals validation, the mean duration of attacks was better
approximated compared to when the stratified 5-fold and block-by-social-group cross-validation
approaches were used.
                                                 159


Figure 5.4 Distribution for the mean value of all observations across replicates. Mean of the
validation set y (dark solid line) is compared with the means of 500 simulated datasets ỹ drawn
from the posterior predictive distribution (lighter bins). We compared the logarithm of the
observed and the simulated variables i.e. log(y+1) and log(𝑦̃ +1). A), the model that used all data
points for model training to predict the same dataset; B) 5-fold cross-validation; C) Block-by-
social-group cross-validation; D) 5 replicates of block-by-focal-animal validation.
         Correlation and RMSE of the log-transformed response and AUC for the prediction of
presence/absence of attacks in the validation set were computed across all validation scenarios
(Table 5.2). The metrics allow for comparison between different validation strategies. Compared
to the internal validation (i.e., all the data were used for model fitting and for validation), the
predictive performance of the stratified 5-fold cross-validation, block-by-social-group cross-
validation, and block-by-focal-animals validation showed much lower correlation and AUC, and
larger RMSE (Table 5.2). Notably, the predictive performance in block-by-social-group
validation was consistently worse than that of the other validation strategies.
  Table 5.2 Metrics for evaluating predictive performance of the model under different
  validation strategies. AUC, area under ROC (Receiver Operating Characteristics) curve;
  RMSE, root mean square error; CV, cross-validation.
                                          Pearson correlation         AUC               RMSE
    In-sample validation                         0.595               0.826               1.173
    Stratified 5-fold CV                   0.227 (SD±0.014)    0.653 (SD±0.017)    1.391 (SD±0.020)
    Block-by-social-group CV               0.115 (SD±0.023)    0.523 (SD±0.017)    1.422 (SD±0.023)
    Block-by-focal-animals validation      0.286 (SD±0.020)    0.532 (SD±0.013)    1.362 (SD±0.014)
                                                    160


                                          5. DISCUSSION
        In this study, we have illustrated how to use GLMMs to analyze dyadic data from animal
behavior studies that record interactions between animals in social groups. Through changing
distributional assumptions and link functions, this approach can be easily adapted to analyses of
categorical, ordinal, count, and continuous response types. Instead of modeling an individual
animal’s response, the proposed model exhibits advantages of analyzing interactive behaviors of
pairs of animals in terms of flexibility and interpretability. Furthermore, the inclusion of random
and fixed effects specific to each giver, receiver, and dyad (when possible) contributes to
partitioning the observed variance into interpretable components.
        Several approaches have been used in the analysis of animal behavioral interactions. A
commonly used approach ignores the dyadic nature of the data and sums over rows or columns
of an interaction matrix to simply obtain total time spent by each individual engaged in the
behavior of interest (Figure 5.1a). Following this summation, linear models are used to study
several sources of individual-level effects on the behavior of interest. We call this a ‘marginal
analysis’ as it operates on the margin of the interaction matrix. For example, Savory and Mann
(1997) studied the effects of genetic strain, age, and feeding pattern on aggressive pecking
behavior of pullets, where for each individual the proportion of the aggressive behavior was
computed (i.e., the total time of aggression was summed and divided by the length of observation
period). Similarly, in two other studies (Turner et al., 2009, 2008), the authors recorded and
treated the total duration of nonreciprocal aggression delivered and received by each individual
pig as response variables, and they fitted linear mixed models to estimate additive genetic effects
on the marginal response. In addition, Verdon et al. (2018) studied aggressive behavior of sows
where the unit of analysis was a group of sows. They counted the frequency of aggressive
                                                 161


interactions from all possible pairs of animals within each group and fitted generalized mixed
models to analyze the marginal response. A shortcoming of these analyses is that the effect of
dyadic factors cannot be investigated. The proposed approach of this study allows the inclusion
of both animal-level effects (marginal effects) as well as dyadic effects relevant to that particular
pair of animals. For instance, we can add previous group- or littermate experience into the model
for each dyad. In addition, genetics/genomics information of the giver and the receiver, as well
as their genetic relationship, can be further included as an extensive form of our proposed model.
        Another common approach analyzes dyadic interactions as independent
observations. Oldham et al. (2020) fitted a linear mixed model to investigate effects of
characteristics of both pigs on the initiating pig’s latency to initiate agonistic behavior in a
dyadic contest. This study was carefully designed and analyzed such that only one observation
per animal and per contest (dyad) was available. This allowed the use of a simple linear model
for the analysis. However, more precisely, dyads refer to relation of two individuals embedded in
a social context (Kenny et al., 2020), while Oldham et al. (2020) manually selected paired pigs
for contests instead of selecting dyads from a social context. In dyadic data extracted from
multiple social groups with more than 2 individuals per group where each pig is exposed to
multiple group mates, the assumption of independence between observations does hold i.e., the
dyad is the fundamental unit of analysis (Kenny et al., 2020), and the proposed approach allows
modeling the variances and co-variances of social groups, dyads, and individuals in a very
straightforward way. For instance: we include the giver and receiver effects and account for their
correlation. Thus, our proposed model is particularly useful for studying social interactions
where animals are housed in multiple social groups over time.
                                                  162


        In addition to introducing a model for the analysis of dyadic data in studies of social
animal behavior, this study yields valuable results for understanding factors that affect post-
mixing aggression in growing pigs. The results indicate that the giver explained more variation
of the dyadic interaction than the receiver (Table 5.1). To the best of our knowledge, only one
previous publication has used a GLMM to dissect the giver and receiver effects in animal
behavior data (Wang et al., 2022), however, they did not consider the inclusion of dyadic fixed
effects or the inclusion of a dyad-level random effect. A related line of research used bivariate
marginal models to study delivery and reception of non-reciprocal aggression (Turner et al.,
2009, 2008). Interestingly, both studies found that delivery of aggression was more heritable than
reception of aggression. This encourages further analyses with the dyadic model to tease apart
genetic effects from environmental effects.
        One application in human behavioral ecology (Koster et al., 2015) proposed using
GLMMs to perform dyadic analysis of food sharing between households and reported that the
meal giver explained 75% of the variance components while the variance ratio of the meal
receiver was 6%, which showed results quantitatively similar to ours. Given more complete
datasets (with more observed variables of the interacting individuals), behaviorists could further
use the proposed dyadic model to dissect factors that may influence delivery of the behavior as
well as the characteristics of the receiver that attract the behavior.
        In the context of post-mixing aggression in finishing pigs, we estimated the correlation
between the random giver effect and the random receiver effect of the same individuals (Table
5.1). In a previously published marginal analysis of post-mixing aggression in pigs, the
correlation between delivering and receiving non-reciprocal aggression did not differ
significantly from zero (Turner et al., 2008). However, our model revealed a weak but significant
                                                  163


correlation between the giver and receiver effects on the duration of the attacks. This means that
animals which attack for longer duration also receive longer attacks themselves, and this could
be a result of receiver animals defending themselves and striking back (i.e., receivers may use
attacks as a form of defense) (Oldham et al., 2020). The aggregated data used in this study did
not allow investigation of the sequences of attacks (as our dyadic data was defined as the total
aggression duration from a giver to a receiver), thus further work is needed to analyze
heterogeneous and repeated measures of dyadic interactions over time.
        Interestingly, our model did not yield a significant effect of the bodyweight of giver or
receiver on the occurrence or duration of attacks. It is worth mentioning that the goal of this
study was not to investigate bodyweight effects of the giver and receiver on attacking duration.
Our result for the bodyweight effect might be due to the limited variation in body size within the
social groups of our study as we had deliberately mixed together animals of similar body size in
the finishing groups. This could result in a non-significant effect of the animal weight given
limited variation in those covariates. The literature on this matter (bodyweight effect) does not
offer a definite conclusion regarding the effect of bodyweight on aggressive behaviors of pigs. In
one study of aggressive contests between pigs (Oldham et al., 2020), neither the weight of the
contest initiating pig nor the weight difference between the contestants significantly influenced
the latency to initiate the aggression. This agrees with our findings on bodyweight effects.
However, in another study of dyadic contests in pigs, the winner pigs were significantly heavier
than the loser pigs (Camerlink et al., 2019). We need to point out that initiating an attack and
winning a contest are different. Our result suggests that pigs might not be good at telling whether
they were going to win or not when they decided to attack. Camerlink et al. (2015) also showed
                                                 164


that between pairs of size-matched pigs, pigs which were more likely to be attackers were not
more likely to be winners.
          The variance component of random dyad effect (the effect of giver-receiver relation; see
Sections 2.1.2 and 2.2.2) was not estimable in this study; however, we found that having shared
nursery pens immediately before being mixed into grow-finish pens (a dyad-level covariate)
showed a significant effect (𝑃<0.05; Table 5.1). This finding is confirmed in the literature, for
instance, Li and Wang (2011) reported that unfamiliar pigs fought for longer durations and
fought more frequently than familiar pigs when pigs were remixed into new social groups.
However, another dyadic level predictor (whether the two pigs were previously housed together
as a litter before weaning; the pigs were housed as a litter for approximately three weeks before
introduced to nursery pens) was not significantly associated with delivery or duration of
aggression. This hints at the fact that animals who once shared a social group several weeks prior
to the mixing, even if they are related, are unlikely to remember each other. It is unclear how
long pigs remember each other though a possible time range could be three to six weeks (Mendl
et al., 2010). Since pigs in this study spent approximately seven weeks in nursery pens
immediately before being remixed into grow-finish pens, it was possible that pigs did not
recognize their initial littermates when re-introduced to them in grow-finish pens.
          In addition to the models presented in this study, we also evaluated other GLMMs,
including log-Poisson, zero-inflated log-Poisson, Gaussian, and zero-inflated Gaussian. The
posterior predictive checks conducted (results not shown) showed that the hurdle-Bernoulli
model was the one that fitted the data better. The hurdle model did so by dissecting the trait into
two components: the tendency of not delivering attacks and a second component of the attacking
duration. The complexity of the model may limit its practical use as there are two correlated
                                                  165


traits rather than one per animal that can be used for decision making. However, it is worth
mentioning that other GLMMs may be adequate for other settings. In the companion GitHub
(https://github.com/jun-jieh/DyadAnalysis), we provide simpler GLMMs that are more general
and can be easily adapted. In addition, to check model fitting, the in-sample posterior predictive
checking (predicting the data used for model fitting) suffices, but for studying the model’s ability
to predict future data, out-of-sample validation (predicting observations left out of the model
fitting process) should be used.
         Social interaction data has been recently used as predictors of other traits. For
example, Turner et al. (2020) constructed play fighting social networks of pigs using dyadic
interaction data and extracted individual level and network level traits to build prediction models
for lesion score counts. In a different application, Angarita et al. (2019) proposed using the
dyadic matrices of aggression duration between pigs to parametrize social genetic effects of
lesion scores. In these cases, the dyadic data (or its derived social network features) were used as
a predictor rather than as a response variable. Nevertheless, the proposed predictive modeling of
dyadic data could be used to add uncertainty to these applications. For instance, the internal and
external validation (see Section 2.2.5 and Figures 5.3 and 5.4) used for model checking provides
a natural way to resample plausible social interaction matrices that could be then subject to social
network analysis or included in social genetic effects modeling. Moreover, obtaining the
summation of all dyadic interaction of an animal as a giver (or receiver) allows for predicting
individual-level aggressiveness (or vulnerability), for instance, the marginal intensity as shown
in Figure 5.1a. Such individual level phenotypes can be used for management and as traits in
genetic evaluations. Further experiments could be designed to validate the early prediction of
animal social behavior that can be further related to animal welfare and production traits.
                                                  166


         In this study, we considered several ways of splitting data for training (model fitting) and
validation. Different validation scenarios (see Section 2.2.5) could be related to possible
situations in real-life applications or relevant prediction problems (Burgueño et al., 2012). The
stratified 5-fold cross-validation was designed for evaluation when the model was used for
predicting unobserved (directional) social behaviors between two animals. The block-by-social-
group mimicked a situation where the effects of giver, receiver, and dyad were evaluated in some
social groups but not in others. Similarly, the block-by-focal-animals validation mimicked a
situation when the giver, receiver, and dyad effects were modeled given records related to the
focal animals but not for non-focal animals.
         The predictive performance of the fitted models varies depending on the validation
strategies (Table 5.2). The block-by-social-group cross-validation yielded the lowest correlation.
This could be the result of not accounting for factors affecting social group composition. In fact,
Samarakone and Gonyou (2009) have suggested that pigs may shift their aggressive behaviors
accordingly to the composition of their social groups. Consequently, this could be revisited in
further analyses and experimental setups where more group-specific variables are recorded and
included in the dyadic model. In short, animal behavioral studies may consider introducing
group-specific effects into the proposed model and exploring how these effects influence
interactive behaviors.
         The block-by-focal-animal validation yielded a slightly higher correlation and smaller
RMSE compared to the stratified 5-fold cross-validation (Table 5.2). The result suggests that
selecting focal animals and decoding their interactions with all other animals in the group may be
a more efficient way to build predictive models of dyadic interactions than randomly selecting
snippets of video for decoding. This idea has also been suggested by ethologists (Bosholn and
                                                  167


Anciães, 2018). Furthermore, a dyadic model could be fitted using preliminary data to determine
which factors better predict animal interactions, and then focal animals could be selected based
on the significant factors to cover a large variation in responses. Such sampling strategies could
be useful for improved manual video decoding efficiency.
                                          6. CONCLUSION
        We proposed an approach for the analysis of animals’ social interactions based on
modeling dyadic data. We illustrated its use through fitting a generalized linear model to total
attacking time post-mixing between pairs of grow-finish pigs. Taking advantage of the flexibility
and interpretability of the proposed model, we found that if two pigs had shared a common
nursery pen immediately before being remixed into new social groups, they tended to spend less
time engaging in the agonistic behavior. In addition, the positive correlation between the giver
and receiver suggested that a pig that spent more time attacking was also more likely to be
attacked for more time. The proposed model can be easily extended to incorporate additional
giver-specific, receiver-specific, and dyad-specific effects. Moreover, we pursued alternative
cross-validations and found that overlooking group-specific factors worsened the predictive
performance of the proposed model. We also demonstrated that focusing on a fraction of all
animals and decoding all their interactions with the remaining animals in the group is an
effective way to perform inference and predictions on social interactions in the group while
limiting the amount of time and effort dedicated to decoding video.
                                    7. ACKNOWLEDGEMENTS
        This work was funded by NIFA Awards 2017-67007-26176 and 2021-67021-34150.
                                                  168


APPENDIX
   169


1. Implementation detail of the Bayesian approach
In parameter estimation, marginal posterior distributions of the parameters were obtained using
Markov chain Monte Carlo method through Stan program. We ran four chains of 15,000
iterations, where we set the burn-in (warmup) to 5,000 iterations and every 10th samples were
saved in each chain. Convergence diagnostics and graphical posterior predictive checks were
performed using rstan and bayesplot packages in R (R Core Team, 2020). In the following parts
of Appendix, we first present the prior distributions that were used in this study. We then show
trace plots and autocorrelation plots of the fitted model. In addition, the summary of convergence
diagnostics for the model with random dyad effect are included at the end. In the companion
GitHub (https://github.com/jun-jieh/DyadAnalysis) we provide examples for implementations of
multiple GLMMs and their model fitting that utilized a Bayesian method through rstan package
in R (Carpenter et al., 2017; R Core Team, 2020).
2. Prior distributions of model parameters that are described in Section 2.2
                                𝜎′𝑔2   𝜎′𝑔𝑟                     5
                              (            2 ) ~𝑊𝑖𝑠ℎ𝑎𝑟𝑡 ( (          ))
                                𝜎′𝑔𝑟    𝜎′𝑟                         5
                                 𝜎𝑔2   𝜎𝑔𝑟                    5
                               (            ) ~𝑊𝑖𝑠ℎ𝑎𝑟𝑡 ( (          ))
                                 𝜎𝑔𝑟    𝜎𝑟2                        5
                                       𝑠𝑔′𝑘 ~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(         )
                                       𝑠𝑔𝑘 ~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(          )
                                        𝜎~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(           )
                                     𝑠𝑒𝑥′𝑘 ~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(−            )
                                     𝑠𝑒𝑥𝑘 ~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(−             )
                                     𝛼′~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(−                )
                                      𝛼~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(−               )
                                                  170


                                  𝛽′~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(−                )
                                   𝛽~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(−               )
                                   𝛿′1 ~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(−           )
                                    𝛿1 ~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(−           )
                                   𝛿′2 ~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(−           )
                                    𝛿2 ~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(−           )
Figure S5.1 Trace plots of posterior estimates of effects and variance components.
                                              171


 Figure S5.2. Autocorrelation plots by chain and by parameters.
Figure S5.3 Trace plots of variance components when the random dyad effect is included in the
model
                                              172


Figure S5.4 Autocorrelation plots by chain and by variance components when the random dyad
effect is included in the model
                                             173


Table S5.1 Summary of MCMC samples. Q, quantile; n_eff, effective sample size.
  Parameter      mean       sd    Q2.5%       Q50%        Q97.5%        n_eff  Rhat
     𝑠𝑒𝑥        -0.038    0.063   -0.159      -0.038        0.088     3711.37  1.002
      𝛿1        -0.502    0.038   -0.574      -0.501       -0.427    4127.503    1
      𝛿2        -0.023    0.059    -0.14      -0.023        0.095    3887.644    1
       𝛼        -0.001    0.004   -0.009      -0.001        0.006    3316.333    1
       𝛽        -0.007    0.005   -0.018      -0.007        0.004    3820.173    1
       𝜎         1.006    0.018    0.973       1.005        1.042      35.164  1.102
      𝜎𝑠𝑔        0.134    0.048    0.030       0.135        0.223     954.933    1
      𝜎𝑑         0.245    0.066    0.076       0.258        0.340       17.19  1.232
     𝑠𝑒𝑥 ′      -0.096    0.218   -0.521      -0.094        0.337    3921.573    1
      𝛿′1        0.552    0.068    0.420       0.552        0.684     3572.37    1
      𝛿′2       -0.189    0.109   -0.402       -0.19        0.027    3438.372  1.001
       𝛼′        0.001    0.007   -0.012       0.001        0.014    4097.341    1
       𝛽′       -0.009    0.011   -0.029      -0.009        0.012    3935.423  1.002
     𝜎′𝑠𝑔        0.749    0.094    0.584       0.744        0.954    3759.123    1
      𝜎′𝑑        0.697    0.079    0.548       0.697        0.855     266.248  1.004
      𝜎𝑟         0.047    0.011    0.026       0.047        0.071    1281.134  1.001
      𝜎′𝑟        0.173    0.039    0.103       0.172        0.253    1341.962  0.999
      𝜎𝑔         0.251    0.022    0.212       0.250        0.295    3503.677  0.999
      𝜎′𝑔        1.216    0.113    1.011       1.209        1.450    1264.305  1.001
      𝜌𝑔𝑟        0.147    0.107   -0.066       0.145        0.360    1322.434  1.008
     𝜌′𝑔𝑟        0.068    0.093    -0.12       0.071        0.244    2482.093  0.999
     lp__      -17317.6 1792.383 -19481.7    -17766.3     -11893.2     15.225   1.27
                                       174


REFERENCES
    175


                                           REFERENCES
Andersson, M., Simmons, L.W., 2006. Sexual selection and mate choice. Trends Ecol. Evol. 21,
     296–302. https://doi.org/10.1016/j.tree.2006.03.015
Angarita, B.K., Cantet, R.J.C., Wurtz, K.E., O’Malley, C.I., Siegford, J.M., Ernst, C.W., Turner,
     S.P., Steibel, J.P., 2019. Estimation of indirect social genetic effects for skin lesion count in
     group-housed pigs by quantifying behavioral interactions. J. Anim. Sci. 97, 3658–3668.
     https://doi.org/10.1093/jas/skz244
Angarita, B.K., Han, J., Cantet, R.J.C., Chewning, S.K., Wurtz, K.E., Siegford, J.M., Ernst,
     C.W., Steibel, J.P., 2021. Estimation of direct and social effects of feeding duration in
     growing pigs using records from automatic feeding stations. J. Anim. Sci. 99, 1–8.
     https://doi.org/10.1093/jas/skab042
Back, M.D., Kenny, D.A., 2010. The Social Relations Model: How to Understand Dyadic
     Processes. Soc. Personal. Psychol. Compass 4, 855–870. https://doi.org/10.1111/j.1751-
     9004.2010.00303.x
Bierbach, D., Sassmannshausen, V., Streit, B., Arias-Rodriguez, L., Plath, M., 2013. Females
     prefer males with superior fighting abilities but avoid sexually harassing winners when
     eavesdropping on male fights. Behav. Ecol. Sociobiol. 67, 675–683.
     https://doi.org/10.1007/s00265-013-1487-8
Bosholn, M., Anciães, M., 2018. Focal Animal Sampling. Encycl. Anim. Cogn. Behav. 1–3.
     https://doi.org/10.1007/978-3-319-47829-6_262-1
Box, G.E.P., Tiao, G.C., 2011. Bayesian inference in statistical analysis. John Wiley & Sons.
Burgueño, J., de los Campos, G., Weigel, K., Crossa, J., 2012. Genomic prediction of breeding
     values when modeling genotype × environment interaction using pedigree and dense
     molecular markers. Crop Sci. 52, 707–719. https://doi.org/10.2135/cropsci2011.06.0299
Camerlink, I., Turner, S.P., Farish, M., Arnott, G., 2019. Advantages of social skills for contest
     resolution. R. Soc. Open Sci. 6, 1–8. https://doi.org/10.1098/rsos.181456
Camerlink, I., Turner, S.P., Farish, M., Arnott, G., 2015. Aggressiveness as a component of
     fighting ability in pigs using a game-theoretical framework. Anim. Behav. 108, 183–191.
     https://doi.org/10.1016/j.anbehav.2015.07.032
Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B., Betancourt, M., Brubaker,
     M., Guo, J., Li, P., Riddell, A., 2017. Stan: A probabilistic programming language. J. Stat.
     Softw. 76.
Chai, T., Draxler, R.R., 2014. Root mean square error (RMSE) or mean absolute error (MAE)? -
     Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 7, 1247–1250.
                                                 176


     https://doi.org/10.5194/gmd-7-1247-2014
Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., Gelman, A., 2019. Visualization in
     Bayesian workflow. J. R. Stat. Soc. Ser. A Stat. Soc. 182, 389–402.
     https://doi.org/10.1111/rssa.12378
Kenny, D.A., Kashy, D.A., Cook, W.L., 2020. Dyadic data analysis. Guilford Publications.
Kenny, D.A., West, T. V., Malloy, T.E., Albright, L., 2006. Componential analysis of
     interpersonal perception data. Personal. Soc. Psychol. Rev. 10, 282–294.
     https://doi.org/10.1207/s15327957pspr1004_1
Koster, J., Leckie, G., Miller, A., Hames, R., 2015. Multilevel modeling analysis of dyadic
     network data with an application to Ye’kwana food sharing. Am. J. Phys. Anthropol. 157,
     507–512. https://doi.org/10.1002/ajpa.22721
Li, Y., Wang, L., 2011. Effects of previous housing system on agonistic behaviors of growing
     pigs at mixing. Appl. Anim. Behav. Sci. 132, 20–26.
     https://doi.org/10.1016/j.applanim.2011.03.009
Ling, C.X., Huang, J., Zhang, H., 2003. AUC: A statistically consistent and more discriminating
     measure than accuracy. IJCAI Int. Jt. Conf. Artif. Intell. 519–524.
Lu, D., Jiao, S., Tiezzi, F., Knauer, M., Huang, Y., Gray, K.A., Maltecca, C., 2017. The
     relationship between different measures of feed efficiency and feeding behavior traits in
     Duroc pigs. J. Anim. Sci. 95, 3370. https://doi.org/10.2527/jas2017.1509
Mendl, M., Held, S., Byrne, R.W., 2010. Pig cognition. Curr. Biol. 20, 796–798.
     https://doi.org/10.1016/j.cub.2010.07.018
Oczak, M., Ismayilova, G., Costa, A., Viazzi, S., Sonoda, L.T., Fels, M., Bahr, C., Hartung, J.,
     Guarino, M., Berckmans, D., Vranken, E., 2013. Analysis of aggressive behaviours of pigs
     by automatic video recordings. Comput. Electron. Agric. 99, 209–217.
     https://doi.org/10.1016/j.compag.2013.09.015
Oldham, L., Camerlink, I., Arnott, G., Doeschl-Wilson, A., Farish, M., Turner, S.P., 2020.
     Winner–loser effects overrule aggressiveness during the early stages of contests between
     pigs. Sci. Rep. 10, 1–13. https://doi.org/10.1038/s41598-020-69664-x
Peden, R.S.E., Turner, S.P., Boyle, L.A., Camerlink, I., 2018. The translation of animal welfare
     research into practice: The case of mixing aggression between pigs. Appl. Anim. Behav.
     Sci. 204, 1–9. https://doi.org/10.1016/j.applanim.2018.03.003
R Core Team, 2020. R: A Language and Environment for Statistical Computing.
Rodenburg, T.B., Bijma, P., Ellen, E.D., Bergsma, R., De Vries, S., Bolhuis, J.E., Kemp, B., Van
     Arendonk, J.A.M., 2010. Breeding amiable animals? Improving farm animal welfare by
     including social effects in breeding programmes. Anim. Welf. 19, 77–82.
                                                 177


Samarakone, T.S., Gonyou, H.W., 2009. Domestic pigs alter their social strategy in response to
     social group size. Appl. Anim. Behav. Sci. 121, 8–15.
     https://doi.org/10.1016/j.applanim.2009.08.006
Savory, C.J., Mann, J.S., 1997. Behavioural development in groups of pen-housed pullets in
     relation to genetic strain, age and food form. Br. Poult. Sci. 38, 38–47.
     https://doi.org/10.1080/00071669708417938
Silk, M.J., Finn, K.R., Porter, M.A., Pinter-Wollman, N., 2018. Can Multilayer Networks
     Advance Animal Behavior Research? Trends Ecol. Evol. 33, 376–378.
     https://doi.org/10.1016/j.tree.2018.03.008
Turner, S.P., Roehe, R., D’Eath, R.B., Ison, S.H., Farish, M., Jack, M.C., Lundeheim, N.,
     Rydhmer, L., Lawrence, A.B., 2009. Genetic validation of postmixing skin injuries in pigs
     as an indicator of aggressiveness and the relationship with injuries under more stable social
     conditions. J. Anim. Sci. 87, 3076–3082. https://doi.org/10.2527/jas.2008-1558
Turner, S.P., Roehe, R., Mekkawy, W., Farnworth, M.J., Knap, P.W., Lawrence, A.B., 2008.
     Bayesian analysis of genetic associations of skin lesions and behavioural traits to identify
     genetic components of individual aggressiveness in pigs. Behav. Genet. 38, 67–75.
     https://doi.org/10.1007/s10519-007-9171-2
Turner, S.P., Weller, J.E., Camerlink, I., Arnott, G., Choi, T., Doeschl-Wilson, A., Farish, M.,
     Foister, S., 2020. Play fighting social networks do not predict injuries from later aggression.
     Sci. Rep. 10, 1–16. https://doi.org/10.1038/s41598-020-72477-7
Vehtari, A., Gelman, A., Gabry, J., 2017. Practical Bayesian model evaluation using leave-one-
     out cross-validation and WAIC. Stat. Comput. 27, 1413–1432.
     https://doi.org/10.1007/s11222-016-9696-4
Vehtari, A., Ojanen, J., 2012. A survey of Bayesian predictive methods for model assessment,
     selection and comparison. Stat. Surv. 6, 142–228. https://doi.org/10.1214/12-ss102
Verdon, M., Morrison, R.S., Hemsworth, P.H., 2018. Forming groups of aggressive sows based
     on a predictive test of aggression does not affect overall sow aggression or welfare. Behav.
     Processes 150, 17–24. https://doi.org/10.1016/j.beproc.2018.02.016
Wang, Z, Doekes, H.P. and Bijma P., 2022. Analysis of social behaviors in large groups:
     simulation and genetic evaluation. Proceedings of the WCGALP. Rotterdam, July 2022.
Wurtz, K.E., Siegford, J.M., Bates, R.O., Ernst, C.W., Steibel, J.P., 2017. Estimation of genetic
     parameters for lesion scores and growth traits in group-housed pigs. J. Anim. Sci. 95, 4310–
     4317. https://doi.org/10.2527/jas2017.1757
                                                 178


                              CHAPTER 6: GENERAL DICUSSION
                                           1. DISCUSSION
        Predictive modeling has great potential to improve swine farming efficiency in various
contexts. Despite the success of predictive modeling methods (Putka et al., 2018), many of them
are not yet applied to swine farming. Furthermore, the validity of those state-of-the-art prediction
models remains unknown in pig genomic prediction and behavior recognition. Genomic
prediction refers to the prediction of an animal’s measurable trait or genetic value, and it has the
potential for improved animal selection and reduced costs (Hickey et al., 2017). However,
measuring animals’ traits is costly in both time and labor and thus, predictive models for
automated phenotyping (through video analysis) are helpful to obtain more rapid results. In this
dissertation, I explored and adapted deep learning (DL) and generalized linear mixed model
(GLMM) for the studies of animal breeding and behavior. To validate the models, several
strategies were investigated to split data into the training data and validation data, where the
training data was used for model development and the validation data was used for model
evaluation.
        Hyperparameter tuning is non-trivial and is a bottleneck for adapting DL into pig
genomic prediction. The common practice is to optimize hyperparameters through grid search or
exhaustive search, but they are costly in terms of time and computational power. In Chapter 2, I
utilized differential evolution to search for hyperparameters that saved considerable time
compared to the traditional approach. During the model development, I found that
hyperparameter tuning was not the only factor that influenced predictive performance of DL.
Different training-validation data splits as well as training dataset size also led to varying
                                                  179


performance. Compared to random hyperparameter configurations and the hyperparameters
recommended in literature, the optimized DL models in this study showed significant
performance gains. For the comparison of different genomic prediction methods, the prediction
accuracy of the optimized DL tied to the standard genomic prediction method, genomic best
linear unbiased prediction (VanRaden, 2008), suggesting that DL can be used as an alternative
method for genomic prediction.
         As the livestock sector is undergoing data revolution, computer vision (CV) is emerging
as a powerful solution for phenotyping and behavioral studies (Borges Oliveira et al., 2021).
However, to date, there is not yet a reference CV dataset in livestock farming, which poses a
challenge to develop video-based automatic phenotyping systems for animals. In Chapter 3, I
investigated the small number of public imagery datasets that have been used to develop CV
systems in livestock farming, and I reviewed the validation strategies utilized in the related work.
In this review, I considered data as the fuel of DL-based CV algorithms. Most CV applications in
livestock farming used random validation for model assessment, in which the training and
validation sets were split at random. However, results from random validations could be
overoptimistic, and random validation is less representative of real-life validation scenarios, as
environments for capturing images are quite complex in animal farming (Li et al., 2021). I also
found that in the studies which fitted the same model through different validation strategies
(Alameer et al., 2020; Riekert et al., 2020; Shao et al., 2020), the evaluation metrics obtained
from blocked validation strategies (where the training and validation sets are split by a blocking
factor) were lower than the metrics computed for random validations. These results are relevant
to researchers as they are more interested in how CV is validated in a way that examines how
                                                 180


well the model can be generalized to other contexts (e.g., across different seasons and different
farms), which is closer to blocked validation strategies.
         The traditional method for analyzing pigs’ activities at feeders is through direct
observation or by filming and later manual decoding of videos (Agha et al., 2020; Csermely and
Wood-Gush, 1990; Machado et al., 2017; Nielsen et al., 1995). However, such approaches are
not possible on a commercial farming setup (Martínez-Avilés et al., 2017). In Chapter 4, I
employed a state-of-the-art DL model for behavior recognition that learned both spatial and
temporal features of pigs from videos. Hyperparameter tuning was performed, but little
improvement was achieved in the optimization process. However, I found a major factor that
greatly influenced the predictive performance of the model, which was validation strategies that
split the dataset differently for training and validation purposes. For random validation (the
standard validation approach of CV applications to animal farming), the proposed model yielded
encouraging results. In addition to the random validation, I proposed blocked-by-time validation
and blocked-by-feeder validation to evaluate the same model. As a result, the blocked validations
yieled much lower performance compared to random validation. Through this finding, I
demonstrated that the random validation strategy might neglect temporal structure and spatial
structure, which were also concerned by other researchers (Bergmeir and Benítez, 2012; Roberts
et al., 2017). These results suggest that blocked-by-time and blocked-by-feeder validation shows
much lower yet more reliable estimates of DL model performance.
         In Chapter 5, I emphasized that the unit of observation in dyadic interaction was not an
individual, but a pair of individuals. I showed conceptualization, parameterization, and
implementation of GLMMs that can be used to analyze dyadic data. The proposed model
exhibited advantages of analyzing dyadic interactions of pairs of animals in terms of flexibility
                                                  181


(as it can be easily adapted to analyses different types of responses) and interpretability (as it
decomposes the dyadic interaction into the giver animal effect, receiver animal effect, and dyad
effect). As expected, predictive performance of the model varied in different validation
strategies. In this study, different validation strategies could be related to possible situations in
real-life applications or relevant prediction problems (Burgueño et al., 2012). The stratified 5-
fold cross-validation was designed for evaluation when the model was used for predicting
unobserved social behaviors between two animals. For block-by-social-group validation, it
mimicked a situation where the effects of giver, receiver, and dyad were evaluated in some social
groups but not in others. Similarly, the block-by-focal-animals validation mimicked a situation
when the giver, receiver, and dyad effects were modeled given records related to the focal
animals but not in non-focal animals. Interestingly, block-by-focal-animal validation yield a
slightly better performance than random validation. This result suggested that focusing on a
fraction of all animals and decoding all their interactions with the remaining animals in the group
was an effective way to perform inference and predictions on social interactions in the group
while limiting the amount of time and effort dedicated to decoding video. This is informative to
ethologists and breeders.
                                      2. FUTURE DIRECTIONS
         Deep learning-based CV models are powerful predictive tools in swine farming.
Nevertheless, most published CV applications into animal farming are developed using rather
small datasets, and their broader validity remains unknown. There are needs to create reference
image datasets and standard validation methods depending on the livestock species and
prediction problems, which allows CV developers to benchmark their state-of-the-art algorithms.
                                                   182


        To date, CV applications are shown to be promising tools to assist in animal behavioral
studies for many challenging tasks e.g., detecting anomalous behaviors and phenotyping
complex traits. However, those applications are currently “technical islands” as each CV model
was developed for very specific problems, while in livestock farming, we are more interested in
a versatile model that address multiple problems simultaneously e.g., behavior recognition and
bodyweight estimation through a single CV system. Thus, there may be a need for developing
integrated systems that pool information for multiple purposes.
        Lastly, a key step in livestock farming is animal identification that is useful for both
management and phenotyping. In commercial pig farming, most pigs have white coat color, and
their graphical morphologies vary in images as they move, which poses challenge to CV systems
to identify individuals. To address this problem, interdisciplinary work might be required
between animal scientists and computer scientists to extract reliable visual components of pigs
that contributes to animal identification through CV. Future work could be focused on robust CV
models for pig identification.
                                                 183


REFERENCES
    184


                                           REFERENCES
Agha, S., Fàbrega, E., Quintanilla, R., Sánchez, J.P., 2020. Social network analysis of agonistic
     behaviour and its association with economically important traits in pigs. Animals 10, 1–13.
     https://doi.org/10.3390/ani10112123
Alameer, A., Kyriazakis, I., Bacardit, J., 2020. Automated recognition of postures and drinking
     behaviour for the detection of compromised health in pigs. Sci. Rep. 10, 1–15.
     https://doi.org/10.1038/s41598-020-70688-6
Bergmeir, C., Benítez, J.M., 2012. On the use of cross-validation for time series predictor
     evaluation. Inf. Sci. (Ny). 191, 192–213. https://doi.org/10.1016/j.ins.2011.12.028
Borges Oliveira, D.A., Ribeiro Pereira, L.G., Bresolin, T., Pontes Ferreira, R.E., Reboucas
     Dorea, J.R., 2021. A review of deep learning algorithms for computer vision systems in
     livestock. Livest. Sci. 253, 104700. https://doi.org/10.1016/j.livsci.2021.104700
Burgueño, J., de los Campos, G., Weigel, K., Crossa, J., 2012. Genomic prediction of breeding
     values when modeling genotype × environment interaction using pedigree and dense
     molecular markers. Crop Sci. 52, 707–719. https://doi.org/10.2135/cropsci2011.06.0299
Csermely, D., Wood-Gush, D.G.M., 1990. Agonistic behaviour in grouped sows. Ii. how social
     rank affects feeding and drinking behaviour. Bolletino di Zool. 57, 55–58.
     https://doi.org/10.1080/11250009009355674
Hickey, J.M., Chiurugwi, T., Mackay, I., Powell, W., 2017. Genomic prediction unifies animal
     and plant breeding programs to form platforms for biological discovery. Nat. Genet. 49,
     1297–1303. https://doi.org/10.1038/ng.3920
Li, G., Huang, Y., Chen, Z., Chesser, G.D., Purswell, J.L., Linhoss, J., Zhao, Y., 2021. Practices
     and applications of convolutional neural network-based computer vision systems in animal
     farming: A review. Sensors 21, 1–42. https://doi.org/10.3390/s21041492
Machado, S.P., Caldara, F.R., Foppa, L., De Moura, R., Gonçalves, L.M.P., Garcia, R.G., De
     Alencar Nääs, I., Dos Santos Nieto, V.M.O., De Oliveira, G.F., 2017. Behavior of pigs
     reared in enriched environment: Alternatives to extend pigs attention. PLoS One 12, 1–18.
     https://doi.org/10.1371/journal.pone.0168427
Martínez-Avilés, M., Fernández-Carrión, E., López García-Baones, J.M., Sánchez-Vizcaíno,
     J.M., 2017. Early Detection of Infection in Pigs through an Online Monitoring System.
     Transbound. Emerg. Dis. 64, 364–373. https://doi.org/10.1111/tbed.12372
Nielsen, B.L., Lawrence, A.B., Whittemore, C.T., 1995. Effect of group size on feeding
     behaviour, social behaviour, and performance of growing pigs using single-space feeders.
     Anim. Sci. 61, 575–579. https://doi.org/10.1017/S1357729800014168
                                                 185


Psota, E.T., Mittek, M., Pérez, L.C., Schmidt, T., Mote, B., 2019. Multi-pig part detection and
     association with a fully-convolutional network. Sensors (Switzerland) 19, 1–24.
     https://doi.org/10.3390/s19040852
Putka, D.J., Beatty, A.S., Reeder, M.C., 2018. Modern Prediction Methods: New Perspectives on
     a Common Problem. Organ. Res. Methods 21, 689–732.
     https://doi.org/10.1177/1094428117697041
Riekert, M., Klein, A., Adrion, F., Hoffmann, C., Gallmann, E., 2020. Automatically detecting
     pig position and posture by 2D camera imaging and deep learning. Comput. Electron.
     Agric. 174. https://doi.org/10.1016/j.compag.2020.105391
Roberts, D.R., Bahn, V., Ciuti, S., Boyce, M.S., Elith, J., Guillera-Arroita, G., Hauenstein, S.,
     Lahoz-Monfort, J.J., Schröder, B., Thuiller, W., Warton, D.I., Wintle, B.A., Hartig, F.,
     Dormann, C.F., 2017. Cross-validation strategies for data with temporal, spatial,
     hierarchical, or phylogenetic structure. Ecography (Cop.). 40, 913–929.
     https://doi.org/10.1111/ecog.02881
Shao, W., Kawakami, R., Yoshihashi, R., You, S., Kawase, H., Naemura, T., 2020. Cattle
     detection and counting in UAV images based on convolutional neural networks. Int. J.
     Remote Sens. 41, 31–52. https://doi.org/10.1080/01431161.2019.1624858
VanRaden, P.M., 2008. Efficient methods to compute genomic predictions. J. Dairy Sci. 91,
     4414–4423. https://doi.org/10.3168/jds.2007-0980
                                                186