ROBUST FRUIT DETECTION AND LOCALIZATION FOR ROBOTIC HARVESTING

By

Pengyu Chu

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Electrical Engineering—Doctor of Philosophy

2023

ABSTRACT

Automated apple harvesting has attracted significant research interest in recent years

due to its potential to revolutionize the apple industry, addressing the issues of shortage and

high costs in labor. One key enabling technology towards automated harvesting is robust ap-

ple detection and localization, which poses great challenges because of the complex orchard

environment that involves varying lighting conditions and foliage/branch occlusions. In this

dissertation, I first propose a suppression Mask RCNN to generally improve the accuracy for

apple detection. The developed feature suppression network significantly reduces false detec-

tion by filtering non-apple features learned from the feature learning backbone. In addition,

I propose a novel deep learning-based object detection method Occluder-Occludee Relational

Network (O2RNet), which addresses the challenge of detecting and isolating clustered ap-

ples in apple orchards. This was motivated by the observation that previous object detection

techniques have exhibited limited success in handling fruit occlusion and clustering, which

are common issues in agricultural settings. To overcome these challenges, O2RNet employs

a two-stage approach, where in the first stage, I use a customized deep Feature Pyramid

Network (FPN) architecture to generate candidate regions of interest (ROIs) for potential

fruit objects. The second stage feeds these candidate ROIs into the occluder branch and

occludee branch respectively using a feature expansion structure (FES). By leveraging this

two-stage approach, O2RNet can effectively isolate individual apples from clustered regions,

thereby facilitating accurate apple detection. Furthermore, I focus on developing an Ac-

tive Laser-Camera Scanning (ALACS) scheme to achieve a high-precision 3D localization of

detected apples and overcome existing localization challenges like varying illumination con-

ditions, complex occlusion scenarios, and limited geometric information. The hardware of

ALACS includes a red line laser, an RGB camera, and a linear motion slide. All these compo-

nents are seamlessly integrated for fruit localization by using an active scanning scheme and

laser-triangulation technique. The technique integrates semantic information from O2RNet’s

detection results with bounding boxes to generate accurate 3D coordinates for each detected

apple. Last but not least, I propose a Skeleton-lead Segmentation Network (SkeSegNet) and

integrated it with the Panoptic-Deeplab. SkeSegNet is developed to address the challenges

of segmenting complex branches by treating branches as a set of skeletons. Combined with

depth map, SkeSegNet generates 3D branches locations for efficient obstacle avoidance. I

evaluated each approach in the comprehensive experiments and superior experimental results

demonstrated the effectiveness of the proposed approaches.

Copyright by
PENGYU CHU
2023

ACKNOWLEDGEMENTS

I would like to thank my advisor, Dr. Zhaojian Li and my co-advisor Dr. Renfu Lu, for

offering me the opportunity to join the Apple Harvesting Robot project at Michigan State

University. I am very honored to have Dr. Li as my primary advisor. Dr. Li’s mentorship

and encouragement have made me accomplish goals that I thought impossible for myself. I

truly learned a lot from his knowledge, experience and vision in not only research but also

other perspectives. It is my great pleasure to have Dr. Lu as my co-advisor. The time I spent

brainstorming, discussion on research ideas, polishing papers has significantly improved my

skills in critical thinking, academic writing and presentation.

I would like to express my

gratitude to my committee members Dr. Daniel Morris and Dr. Vabhaiv Srivastava for

their advice, insight and mentorship.

I am deeply thankful for the opportunity to collaborate with an exceptional group of

colleagues, faculty, and researchers throughout my Ph.D. program. I would like to thank Dr.

Xiaoming Liu for offering valuable advice at the beginning of my work, and Dr. Jiayu Zhou

for enriching my learning experience. I am also grateful to work with Dr. Kaixiang Zhang,

Kyle Lammers and Keyi Zhu. The collaborative work presented in this dissertation would

not have been possible without their valuable contributions. I am grateful for my labmates

from RIVAL Lab, and I will never forget the good memories of us at RIVAL Lab.

Last but certainly not least, I want to thank my friend Dr. Liang Han for his valuable

advice and help at my PhD early stage. I want to thank my parents for their endless and

unconditional love, faith and support that they have always given me.

v

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION AND MOTIVATION . . . . . . . . . . . . . . . .

CHAPTER 2 BACKGROUND AND RELATED WORK . . . . . . . . . . . . . . .

1

5

CHAPTER 3 SUPPRESSION MASK R-CNN FOR APPLE DETECTION . . . . .

19

CHAPTER 4 O2RNET: OCCLUDER-OCCLUDEE RELATIONAL NETWORK

FOR CLUSTERED APPLE DETECTION . . . . . . . . . . . . . . .

33

CHAPTER 5 ALACS: ACTIVE LASER-CAMERA SCANNING FOR 3D APPLE

LOCALIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

CHAPTER 6 SKESEGNET: SKELETON-LEAD SEGMENTATION NETWORK

FOR BRANCH SEGMENTATION . . . . . . . . . . . . . . . . . . .

CHAPTER 7 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . .

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

85

87

vi

CHAPTER 1

INTRODUCTION AND MOTIVATION

In this chapter, I first introduce the motivation of this thesis and the challenges for

robotic fruit harvesting. Then, I introduce the specific research objectives of this thesis and

provide a summary of the key contributions.

1.1 Motivation

Fruit harvesting is highly labor-intensive and cost-heavy; it is estimated that the labor

needed for apple harvesting alone is more than 10 million worker hours annually, attributing

to approximately 15% of the total production cost in U.S. [37]. Growing labor shortage

and rising labor cost have steadily eroded the profitability and sustainability of the fruit

industry. Furthermore, manual picking activities constitute great risks of back strain and

musculoskeletal pain to fruit pickers due to repetitive hand motions, awkward postures when

picking fruits at high locations or deep in the canopy, and ascending and descending on

ladders with heavy loads [33]. Therefore, there is an imperative need for the development

of robotic mass harvesting systems to tackle labor shortage, lower human injury risks, and

improve productivity and profitability of the fruit industry.

The first and foremost task in robotic harvesting is fruit detection and localization,

which identifies fruit in the area of interest and provides targets for the robot to perform

subsequent actions. Due to the low cost of cameras and the tremendous advances in computer

vision [87], image-based fruit detection systems have gained great popularity in robotic fruit

harvesting since the late 1980s. Although robust apple detection in the presence of complex

tree structures and varying lighting conditions is a challenging task, the deep learning-based

perception techniques [63, 49, 48, 114, 70] have enabled robotic fruit harvesting in the reality.

1.2 Challenges in Robotic Fruit Harvesting

Fruit harvesting presents a myriad of challenges, encompassing various aspects that sig-

nificantly impact the efficiency and precision of the process [23].Despite progresses [55, 116,

1

89, 25, 132], several important challenges in developing a fully functional robotic harvesting

system remain, and no commercially-viable systems are yet available in the market. One

key challenge is posed by fruit clusters, where multiple fruits grow closely together, making

it intricate to isolate and identify individual fruit. Occlusion, another formidable challenge,

arises when fruits are hidden or partially obscured by foliage or branches and always causes

failed detection. Additionally, the 3D localization of fruits under occlusion generates impre-

cise positioning and may lead to harvesting errors. Branch obstacles further compound these

difficulties, as the harvesting robot must navigate through the complex three-dimensional

structure of the plant while avoiding damage to both the fruits and the tree itself [3]. These

challenges make robotic fruit harvesting harder in the complex environments.

1.3 Summary of Research Contributions

This dissertation studies detection and 3D localization of fruits in complex orchards

for robotic apple harvesting application.

In particular, I propose a suppression Mask R-

CNN Network and an Occluder-occludee Relational Network to enhance detection precision

for fruits. I also explore using a novel laser-based device to improve accuracy for 3D fruit

localization. Then I study applying panoptic perception algorithms for orchard segmentation

to help our harvesting robot avoid branch obstacles. This dissertation delivers the following

contributions:

1. A new deep network, suppression Mask R-CNN, is proposed to remove false detections

due to occlusion and increases the accuracy and robustness of apple detection.

2. A comprehensive apple dataset of 1246 images for two varieties of apple under different

lighting conditions and occlusion levels are collected from two orchards during two

harvesting seasons.

3. A novel Occluder-Occludee Relational Network (O2RNet) is proposed for enhanced

apple detection in the presence of occlusions due to apple clusters.

4. A novel Active Laser-Camera Scanning system, consisting of a red line laser, an RGB-

D camera, and a linear motion slide, is designed and developed for accurate fruit 3D

2

localization.

5. A Laser Line Extraction (LLE) algorithm is proposed and implemented for robust

feature matching to enable stable 2D-3D transformation for ALACS.

6. An image annotation tool, PicA, is developed and open-sourced to alleviate the burden

of manual annotation by leveraging the concept of superpixels and pre-trained models

associated with panoptic segmentation annotation.

7. Skeleton-lead Segmentation Network (SkeSegNet) is proposed to address the challenges

of segmenting complex branches, and generates 3D branches for efficient obstacle avoid-

ance using depth map.

1.4 Thesis Organization

The remainder of this dissertation is organized as follows:

Chapter 2: Background and Related Work

This chapter presents a comprehensive introduction to fruit harvesting robots and our

developed RIVAL’s apple harvesting robot. Besides, I give an overview of existing works for

fruit perception in harvesting robots, which give supports to our research work.

Chapter 3: Suppression Mask R-CNN for Apple Detection

This chapter shows my methods for collecting a comprehensive apple dataset for two

varieties of apples with distinct yellow and red colors under different lighting conditions from

the real orchard environment. A novel suppression Mask R-CNN was developed to robustly

detect apples from the dataset. Our developed feature suppression network significantly

reduced false detection by filtering non-apple features learned from the feature learning

backbone. Our suppression Mask R-CNN demonstrated superior performance, compared to

state-of-the-art models in experimental evaluations.

Chapter 4: O2RNet: Occluder-occludee Relational Network for Clustered Apple

Detection

In this chapter, I discuss the challenges associated with the detection of clustered ap-

ples, a task that demands heightened precision and adaptability due to the complex spatial

3

arrangements of fruit clusters within orchards.

I propose a novel solution, the Occluder-

occludee Relational Network (O2RNet), designed to address the specific challenge posed by

occlusion and cluster proximity. Subsequently, I provide a comprehensive evaluation of the

OR2Net’s performance, assessing its efficacy in real orchard environments and under varying

conditions.

Chapter 5: ALACS: Active Laser-Camera Scanning for 3D Apple Localization

In this chapter, I review several consumer RGB-D cameras and depict our proposed

localization technique, called Active Laser-Camera Scanner (ALACS). I propose a feature-

matching algorithm, called Laser Line Extraction (LLE), to help ALACS transform the 2D

fruit positions to 3D fruit positions, thus achieving accurate mapping even under complex

fruit morphology, variable lighting conditions and occlusions.

Chapter 6: SkeSegNet: Skeleton-lead Segmentation Network for Branch Seg-

mentation

In this chapter, I delve into the realm of panoptic segmentation, presenting a compre-

hensive overview of the field.

I introduce our novel approach, SkeSegNet, aimed at ad-

vancing branch segmentation by leveraging skeletal information for enhanced accuracy and

robustness. Furthermore, I explore the by-product of SkeSegNet to generate 3D branch

representations.

Chapter 7: Conclusion and Future Work

This chapter concludes the thesis and discusses the future work.

4

CHAPTER 2

BACKGROUND AND RELATED WORK

In this chapter, I provide a comprehensive introduction to fruit harvesting robots and

our developed RIVAL’s apple harvesting robot. Besides, I give an overview of existing works

for fruit perception in harvesting robots, which give supports to our research work.

2.1

Introduction

The apple industry relies heavily on manual labor. For instance, in the United States

alone, it is estimated that the seasonal labor force needed for apple harvesting is more than

10 million worker hours each year, attributing to about 15% of the total production costs

[38]. The growing labor shortage and increased labor cost have thus become major concerns

for the long-term sustainability and profitability of the apple industry. In the meantime, the

past decade has seen great transitions in apple production systems; traditional unstructured

orchards have been replaced with high-density orchard systems where trees are smaller and

more uniformly structured (i.e., v-trellis, vertical fruiting wall, etc.). These modern tree

structures can greatly facilitate orchard automation, and thus there has been a renewed

interest in pursuing robotic harvesting as a promising solution to reduce the harvesting cost

and dependence on manual labor.

Over the past few years, several robotic systems have been designed to autonomously har-

vest different horticultural crops, including sweet pepper [64], strawberry [123], apple [100],

and kiwifruit [119]. For apple harvesting, the automation system designs can be mainly

grouped into two categories. The first category is the shake-and-catch harvesting [135],

where vibrations are applied to the tree trunk and/or branches to detach the fruits. Al-

though the shake-and-catch harvesting systems are efficient in detaching fruits from trees,

they often result in a high rate of apple bruising that is not acceptable for fresh market. The

other category is the fruit-by-fruit harvesting where manipulators are used to pick fruits in

a controlled manner, and thus can substantially reduce fruit damage. However, designing

such systems with high picking efficiency and practical viability presents a great challenge.

5

So far, several fruit-by-fruit robotic apple harvesting systems have been developed [4, 100,

50, 133, 15]. For instance, Baeton et al. combines a 7 degree-of-freedom (DOF) industrial

manipulator with a vacuum activated, funnel shaped gripper for apple detachment, and the

harvesting cycle time is 8-10 s/fruit [4]. In [100], both hardware and software designs of an

apple harvester are presented. Field tests conducted on a v-trellis orchard show that this

system is able to pick 84% of 150 apples attempted with the overall harvesting time being

7.6 s/fruit.

In [50], Hohimer et al. developed a harvesting robot based on a pneumatic

soft-robotic end-effector, and the average time that the system takes from apple detachment

to transported to storage bin is 7.3 s/fruit. Despite the aforementioned progresses, the low

picking efficiencies of existing systems are still unsatisfactory for their practical use in the

real orchard environment [73].

2.2 RIVAL’s Apple Harvesting Robot

Despite the significant progress, the existing robotic apple harvesting systems are still

far from being commercially viable mainly because they are unreliable in performance and

inefficient or too slow in picking fruit in the real orchard environment and are too complicated

or expensive to be economically sound [43]. Hence, I designed a high-performance robot

platform [134] for automated fruit-by-fruit apple harvesting.

The principal objective of the hardware system design is to build a fully self-contained,

modular harvesting platform that can support various payloads, endure rough terrains, and

be easily modifiable and extensible. To facilitate the movement in the orchard environment,

the whole system is built on a trailer base which can be hauled by a farm vehicle or robotic

moving platform. The developed robotic apple harvesting system is shown in Figure 2.1,

which consists of three major modules: the support hardware module, the computer &

operation module, and the robotic harvester.

The support hardware module includes a 5.5 kW Honda gas-powered electric generator

and a Delfin industrial vacuum. The generator provides a 240 V power source and can

continuously support the whole system running for approximately 5.5 hours if fully refilled.

6

Figure 2.1 Hardware modules of the developed robotic apple harvesting system.

The Delfin industrial vacuum has two powerful bypass motors with independent cooling and

can generate a peak horsepower of 5.5 HP. During fruit harvesting, this vacuum machine runs

continuously to provide vacuum flow, generating suction forces to enable the soft effector to

detach fruits (see Section 2.2.3 for descriptions on the end-effector).

The computer & operation module is comprised of a high-performance industrial com-

puter and a workstation where users can monitor and control the robotic system. The

industrial computer has an Intel®Xeon E2176G processor, 64 GB of RAM, and a NVIDIA

GeForce RTX 2080 Ti graphic processing unit. This computer hosts all software algorithms

and the communication connections to all components.

The main components of the robotic apple harvesting system are shown in Figure 2.2,

including a perception component, a 4-DOF manipulator, a vacuum-based soft end-effector,

and a dropping component. More detailed descriptions on these four components are given

in the following sections.

7

Figure 2.2 Main components of the robotic harvester for apple picking.

2.2.1 Perception Component

For robotic apple harvesting, the first and foremost task is to detect and localize the

fruits. As shown in Figure 2.2, an Intel RealSense D435i RGB-D camera and a custom-built

laser-camera unit are integrated as the sensor set to achieve apple detection and localiza-

tion. The RGB-D camera is mounted on a horizontal frame that is above the manipulator.

Different from the other robotic harvesting systems (e.g., [? ]) that attach the camera to

the manipulator or the end-effector, our installation scheme ensures that the RGB-D camera

can provide a global view of the scene, which facilitates the use of multiple manipulators

planned in our future versions. Since the depth measurement of consumer RGB-D camera is

not stable under leaf/branch occlusions and/or challenging lighting conditions (see [79, 35]

for an in-depth review on localization performance of commercial RGB-D sensors), I design

a new laser-camera unit to address this issue. The specially designed laser-camera unit is

comprised of a red line laser (635 nm), a Flir RGB camera, and a linear motion slide. The

8

line laser is mounted on top of the linear motion slide which enables the laser to move back

and forth horizontally with a full stroke of 20 cm. Meanwhile, the Flir RGB camera is

installed at the rear end of the linear motion slide with a relative angle to the laser scan-

ner. The RGB-D camera and the laser-camera unit are fused synergistically to achieve high

perception accuracy and robustness, where the fusion scheme will be detailed in Chapter 5.

2.2.2 Manipulator

A 4-DOF manipulator is designed and assembled with compact mechanical structure

for efficient manipulation in the workspace. As illustrated in Figure 2.3, the manipulator

consists of three revolute joints and one prismatic joint. The first and second revolute

joints are connected by an L-shaped aluminum plate to form a pan-and-tilt mechanism. The

prismatic joint is used as the base of the pan-and-tilt mechanism to extend the manipulator’s

workspace. A hollow aluminum link is installed on the pan-and-tilt mechanism to enable

the end-effector to reach the apples and serve as a vacuum tube for grasping the fruits

during the harvesting process. The third revolute joint is assembled at the rear end of the

aluminum tube to create a rotation mechanism. After the end-effector has grasped the fruit,

this rotation mechanism is triggered to rotate the aluminum tube to detach the fruit from

the tree. In addition, all the joints are driven by servo motors instead of using a hybrid

pneumatic/motor actuation mechanism as in our previous design [133], which simplifies the

actuation and facilitates an integrated control scheme design.

Different from most existing studies [? ? ? ? ]

that rely on high DOF industrial

manipulators, the developed 4-DOF manipulator is simple and compact in structure and

highly efficient in picking fruit. This hardware design allows convenient development of

adaptable control algorithms such that agile manipulation can be achieved. It also facilitates

future extensions of multiple robotic arms for coordinated multi-arm apple harvesting.

2.2.3 End-Effector

In our robotic system, a vacuum-based soft end-effector is used to grasp and detach

fruits. The end-effector is a vacuum cup made of silicone rubber and is attached to the

9

Figure 2.3 CAD model of the 4-DOF manipulator.

front end of the aluminum tube (i.e., the front end of the manipulator). Through laboratory

experiments, a silicone material with a hardness of 40 Shore A and a cup shape geometric

design shown in Figure 2.4 are selected for the end-effector. Compared with our previous

designs [72], the current cup shape geometric design has a smaller inner diameter and a

larger outer-lip. The silicone material and the improved cup’s geometry allow for conformity

to the fruit contours to generate a sufficient suction force needed for holding and detaching

the fruit, while also being flexible or deformable to minimize or eliminate fruit bruising. The

end-effector is securely connected to the aluminum tube through an adaptor and the rear

end of the aluminum tube is connected to the Delfin industrial vacuum via a flexible and

expandable tube. One major advantage of using the vacuum-based end-effector is that it can

tolerate some approaching inaccuracy; it is able to attract fruits within a distance of about

1.5 cm when operated under the current vacuum flow. This allows the manipulator to grasp

the fruit even if it does not approach the fruit accurately.

2.2.4 Dropping/Catching Component

A dropping/catching component is assembled and attached to the robot platform to

enable faster release or dropping of the picked fruit. The base of the dropping component

is made up of a rectangular aluminum plate covered with a soft foam cushion of 50 mm in

10

Figure 2.4 CAD model of the vacuum-based soft end-effector.

thickness. Laboratory tests showed that the foam cushion allows apples to drop from the

highest position of the end-effector (approximately 80 cm) without causing fruit bruising,

while keeping fruit bouncing to minimum. The manipulator can drop the picked apples at

any spots from above the dropping component without fully returning to its home position,

thus reducing the overall fruit picking cycle time. After an apple has fallen onto the sloped

surface of the dropping component, it rolls down to a screw-driver conveyor, which transports

the apple to the destination or a bin [136].

2.2.5 Software Design

The software suite is designed within the robot operating system (ROS) framework. Dif-

ferent software components are primarily communicated via custom messages sent through

ROS actions and services. Figure 2.5 shows the main logic/algorithm flow of the software

system during apple harvesting. It is apparent that the software design of our robotic sys-

tem requires multi-disciplinary advances to enable various synergistic functionalities and

coordination for achieving reliable automated apple harvesting. The logic flow of the apple

harvesting cycle is detailed in the following.

At the beginning of each harvesting cycle, the RGB-D camera is triggered to acquire

images at 30 fps. Based on the obtained image information, deep learning and active cam-

era laser scanning are exploited to detect and localize the fruits within the manipulator’s

workspace. A list of 3-dimensional (3D) apple locations will be generated, and then the one

on top of the list, by following the pre-defined criteria, will be selected as the target fruit.

Since location results provided by the RGB-D camera might not be sufficiently accurate,

11

Figure 2.5 Logic flowchart in apple harvesting.

the developed laser-camera unit and corresponding perception scheme are triggered to scan

the target fruit and calculate its 3D position. Given the ameliorative target apple location,

the planning algorithm is used to generate a reference trajectory, and the control module

will actuate the manipulator to follow this reference trajectory to reach the fruit. Once

the fruit is successfully attached to the end-effector (detected by a pressure sensor mounted

inside the tube), the rotation mechanism is triggered to rotate the whole aluminum tube

by a certain angle, and the manipulator then retracts to pull and detach the apple (if the

rotation action has not resulted in complete detachment of the fruit). After the manipulator

reaches a pre-determined dropping spot, a vacuum control valve installed between the outlet

of the vacuum machine and the inlet of the flexible vacuum hose actuated, which causes

rapid loss of vacuum pressure in the tube, thus enabling the fruit to fall off the end-effector

by gravity to the drop‘ping component. The fruit finally rolls down the slope of the dropping

component to the screw conveyor, from which the apple is transported to the destination.

12

2.2.6 Field Tests Environment

To fully evaluate the performance of the developed apple detection (see Chapter 3, 4)

and localization (see Chapter 5) algorithoms, field tests were conducted in two Michigan

State University’s research orchards in East Lansing and Holt, Michigan, USA, respectively,

during the 2021, 2022, and 2023 harvest season (Figure 2.6). The first orchard had been

planted with ‘Gala’ apple trees of two years old, which is a popular bicolored variety with a

red color on the foreground and a yellow background. There were fewer fruits grown on these

young trees with less occlusions by branches and foliage, but many of the fruits were grown in

clusters (Figure 2.6(b)). In the second orchard were ‘Ida Red’ apple trees of seven years old

(Figure 2.6(c)). Since the trees had not been pruned and thinned during the winter and early

spring seasons, there were dense and unstructured foliage and branches. A high percentage

of apples were grown in clusters and occluded by leaves and branches, which presented a

significantly more challenging environment for our robot. It should be mentioned that most

reported studies tested their robotic apple harvesting systems in high-density tree orchards

with well-trained tree architectures with few clustered apples and less dense foliage (either

naturally or being removed).

2.3 Existing Works in Perception of Fruit Harvesting

Fruit perception is one of the key functionalities in robotic harvesting. Several research

groups have been developing robotic harvesting systems [55, 116, 89, 25]. Despite progresses,

several important challenges in developing a fully functional robotic harvesting system re-

main, and no commercially-viable systems are yet available in the market. One key challenge

that is pointed out by the existing works is efficient and robust fruit detection in the pres-

ence of varying light conditions and fruit/foliage occlusions. Indeed, the perception system

provides the robot system with information on target fruits, which are first and foremost for

subsequent planning and control tasks. In addition, fruit perception techniques have also

been used in other applications of interest, including yield estimation and crop health status

monitoring [86]. Perception in unstructured orchard environments, however, is a daunting

13

Figure 2.6 Field trails of the robotic apple harvesting system. (a) Image of the platform
operating in the orchard environment; (b) Example images of young and well-pruned trees
in the first orchard; and (c) Example images of older trees with dense foliage in the second
orchard.

task as a result of variations in illumination and appearance, noisy backgrounds, and clut-

tered environments with occlusions [23]. The goal of this paper is thus to present a novel

deep learning-based detection algorithm to convergently address the aforementioned chal-

lenges. I show that the developed algorithm is able to achieve state-of-the-art performance.

Before describing the technical details, I review relevant backgrounds and state-of-the-art

approaches to put our algorithm in better context.

2.3.1

Image Sensing Techniques

Vision-based perception schemes can be classified into four categories based on the sensor

used: monocular camera scheme, binocular stereovision scheme, laser active visual scheme,

and thermal imaging scheme, which cover both two-dimension imaging schemes and three-

dimension imaging schemes [138]. Specifically, the monocular scheme uses a single camera

to acquire image data, and it is widely used in fruit harvesting due to its low cost and

14

rich information provided by the RGB images. For instance, [109] developed an improved

YOLOv3 [93] model based on a single camera to detect apples with an accuracy of 85.0%.

In [56], the authors proposed a new LedNet model for apple detection that achieves an

accuracy of 85.3%. The main disadvantage of the monocular scheme is that the color images

are sensitive to fluctuating illumination.

Different from the monocular camera schemes, the binocular stereovision schemes exploit

two cameras separated in a certain distance/angle to obtain two image data on the same

scene. The point cloud of fruit can then be constructed through triangulation on extracted

features [106]. For instance, [99] used a stereo camera to detect and localize mature apples

in tree canopies, and achieved an accuracy of 89.5%.

In [122], the authors developed a

clustered tomato detection method based on a stereo camera, and the recognition accuracy

was 87.9%. Although the stereovision scheme tends to render better results, it suffers from

high complexity, long computation time, and uncertainties in stereo matching [46].

On the other hand, the laser active visual schemes obtain three-dimensional features

using laser scans, where laser beam reflections are exploited to generate a 3D point cloud

based on the time-of-flight principle. The 3D point cloud can then be used to reconstruct

the scene. For example, [108] utilized infrared laser scanning devices to recognize cherry on

the tree.

[129] acquired a total of 200 images for independent ‘Fuji’ apples and developed

an apple recognition method using the near-infrared linear-array structured light for 3D

reconstruction. [113] proposed a point cloud based apple detection method using a LiDAR

laser scanner and reached a 88.2% overall accuracy on the defoliated tree dataset [113].

Note the defoliated scene is significantly less challenging than the real orchard conditions

during the harvest season. Furthermore, the laser point cloud is generally sparse and it is

challenging to be used in real-world orchards with dense backgrounds. The high cost and

complexity also limit its practical application in agricultural applications.

Finally, the thermal imaging schemes make use of the distinct thermal characteristics of

fruit and leaves (e.g., the different temperature distributions) to obtain the visualization of

15

infrared radiation [71]. In [13], citruses are successfully segmented using a thermal infrared

camera according to the largest temperature difference in both day and night conditions.

An enhanced approach for fruit detection [14] was developed using the combination of the

thermal image and the color image. The results showed a promising performance under weak

lighting environments. However, in the thermal imaging scheme, the accuracy of recognition

is largely affected by the shadow of the tree canopy [104].

Considering the cost, performance, and real-time constraints, our work focuses on the

monocular camera scheme, the state-of-art of which will be discussed next.

2.3.2 Recognition Approaches

Image-based fruit recognition approaches can be classified into feature analysis approaches

and deep learning-based approaches, depending on how features are obtained. In feature anal-

ysis approaches, hand-crafted features are first extracted based on the fruit characteristics,

and classification approaches are then developed to recognize fruit.

[103, 102] developed

thresholding methods to classify fruit from other background objects using smoothing filters

that remove irrelevant noises. The large segmented regions are then recognized as fruits.

This method is capable of segmenting fruit regions in simple backgrounds but it is suscepti-

ble to varying lighting conditions and complex canopies. [118, 10] proposed a circular Hough

Transform approach to obtain binary edge images and then used a voting matrix to identify

fruits. This approach is sensitive to complex structured environments and it generally fails in

a dense scene. In [90, 16, 65, 137], they combined the shape and texture of the fruit to obtain

a richer set of feature representations. Then, extracted features between fruit and leaves are

compared and contrasted to identify the fruits. However, this method is also sensitive to

lighting conditions and occlusions.

On the other hand, deep learning-based approaches have found great successes in object

detection and semantic image segmentation [97, 9]. They can learn feature representations

automatically without the need of manual feature engineering. Compared to conventional

methods, Convolutional Neural Networks (CNNs) have been showing great advantages in

16

the field of object detection in recent years. The CNN makes it possible to recognize fruits

in complex situations due to its deep extraction of high-dimensional features of objects. R-

CNN and its variants Fast R-CNN and Faster R-CNN [42, 41, 94] have enjoyed particular

successes. Their key idea is to first obtain regions of interest and then perform classification

in the region. The Region proposal network (RPN) is employed to reduce high computational

costs so that the model can simultaneously predict and classify object boundaries at each

location. The parameters of the two networks are shared, which results in much faster

inference and are thus optimized for real-time purposes. Faster Region-Based CNN, proposed

by [97], employed transfer learning using ImageNet, and used both early fusion and late fusion

to integrate RGB and NIR (near infrared) inputs. Modified Inception-ResNet (MI-ResNet)

[91] used deep simulated learning for yield estimation. The model was developed to address

challenges including the varying degree of fruit sizes and overlap, natural lighting, and foliage

occlusions. The overhead for object detection and localization is optimized by utilizing

synthetic data for training, and reaching the accuracy of 91% on their fruit dataset. You

Only Look Once (YOLO) [93], a representative of the one-stage object detector, detects the

fruit on the entire image and classifies fruit variety into uncertainty retail conditions without

the help of RPN. Specifically, YOLO uses logistic regression to predict an objectless score

for each bounding box. Due to the simple optimization pipeline, YOLO enjoys much faster

inference than the aforementioned region-based methods. EfficientDet [107], an augmented

variant of YOLO, exploits a pyramid network to enable the detection of scaling targets.

2.4 Summary of the Chapter

In this chapter I first introduce the background of fruit harvesting robots. Based on

the existing robots, I give our own solutions, called RIVAL’s Apple Harvesting Robot, to

address fruit harvesting challenges in apple orchards. I present the hardware and software

development of the robotic apple harvesting system, which is not only compact in structure

design but also effective in utilizing multi-disciplinary advances to enable synergistic har-

vesting functionalities. The perception strategy design is the basic of subsequent chapters

17

(Chapter 3, 4, 5, 6).

Additionally, I discuss the related works in perception techniques and approaches in fruit

harvesting, including different sensors and vision-based methods. RGB-D cameras have

been a popular sensor choice for fruit detection and localization. Although consumer RGB-

D cameras are compact and can provide dense image and depth information, the depth

measurements by these RGB-D cameras are not robust under varying lighting conditions

or when apples are partially occluded by foliage, which would result in inaccurate fruit

localization and eventually degrades the harvesting performance of the robotic system.

18

CHAPTER 3

SUPPRESSION MASK R-CNN FOR APPLE DETECTION

In this chapter, I showed my methods for collecting a comprehensive apple dataset for

two varieties of apples with distinct yellow and red colors under different lighting conditions

from the real orchard environment. A novel suppression Mask R-CNN [23] was developed to

robustly detect apples from the dataset. Our developed feature suppression network signifi-

cantly reduced false detection by filtering non-apple features learned from the feature learning

backbone. Our suppression Mask R-CNN demonstrated superior performance, compared to

state-of-the-art models in experimental evaluations.

3.1

Introduction

The accurate detection of fruits is of paramount importance in the realm of fruit har-

vesting, serving as a foundational element for the efficiency, precision, and overall success

of automated harvesting systems [8]. Fruit detection plays a pivotal role in overcoming the

inherent challenges [44] associated with the diverse and complex environments within or-

chards and fields. Several state-of-the-art deep learning-based apple detection approaches

have been developed. In particular, DaSNet [55], a deep convolutional neural network that

exploits the techniques of spatial pyramid pooling and gate feature pyramid network, is pro-

posed for apple detection. It uses a lightweight residual network as its backbone to achieve

improved computational efficiency. Although DaSNet has a decent performance (0.832 F1-

score) and a lightweight overhead, the algorithm is only trained and validated on a dataset

that contains a single apple variety with good lighting. YOLOv3 [93], another lightweight

network that combines Region Proposal Network (RPN) and classification network into a

single architecture, is applied in [109] for apple detection. While the network offers a fast

detection rate, it has a relatively low F1-score of 0.817. Mask R-CNN [48], a popular ob-

ject detection algorithm, is also deployed for apple detection [52]. The Mask R-CNN is a

two-stage detector that involves a RPN and a classification network. The former searches

the location of region of interest (ROI), whereas the latter predicts the class of ROI and

19

regresses the bounding box of the ROI candidates. The Mask R-CNN is successfully applied

to apple detection in [52] with promising performance demonstrated. However, the dataset

they use only has one apple variety with good lighting conditions, making the results less

compelling. Despite the aforementioned developments, accurate apple perception to support

robotic harvesting in real orchard environments remains a great challenge. Existing methods

either provide insufficient accuracy [93, 55] or are based on simple structured orchards with

little occlusion and stable lighting conditions [28, 12].

In this chapter, I use a comprehensive orchard database that contains multiple apple

varieties under various lighting conditions. Further, I develop a novel Suppression Mask

R-CNN that has superior performance as compared to the aforementioned approaches. The

contributions of this work are summarized as follows:

1. I collect and process a comprehensive orchard dataset with multiple apple varieties

under various lighting conditions in real orchard environment.

2. I develop a new deep network, suppression Mask R-CNN, to remove false detections

due to occlusion and thus increase the accuracy and robustness of apple detection.

3. Extensive evaluations show that the proposed suppression Mask R-CNN achieves state-

of-the-art performance.

3.2 Data Collection and Processing

In this study, apple images of ‘Gala’ and ‘Blondee’ varieties were taken in two commercial

orchards in Sparta, Michigan, USA during the 2019 harvest season. The two apple varieties

have distinct color characteristics; ‘Gala’ apples are red over a yellow background, while

‘Blondee’ apples have a smooth yellow skin (see Figure 3.1). An RGB camera with a reso-

lution of 1, 280x720 was used to take images of apples at a distance of 1 ∼ 2 meters to the

tree trunk, which is the typical range of harvesting robots [25]. The images were collected

across multiple days to cover both cloudy and sunny weather conditions. In a single day, the

data were also collected at different times of the day, including 9:00am in the morning, noon,

and 3:00pm in the afternoon, to cover different lighting angles: front-lighting, back-lighting,

20

side-lighting, and scattered lighting. When capturing images, the camera was placed parallel

to the ground and directly facing the trees to mimic the harvesting scenario. A total of 1, 500

images were captured where two sample images are shown in Figure 3.1.

I next processed the collected raw orchard images into formats that can be used to train

and evaluate deep networks. Specifically, apples in the images were annotated by rectangles

using VGG Image Annotator [32] and the annotation was then compiled into the human-

readable format. Compared to polygon and mask annotations, rectangular annotation used

here accelerates data preparation, particularly in dense images like our dataset. The an-

notated dataset was then split into training, validation, and test subsets with the apple

quantities of 10, 530, 4, 203, and 4, 795, respectively.

21

Figure 3.1 Six sample images from the collected dataset: (a)-(c) apples on older trees under
overcast, back-lighting, and direct lighting conditions, respectively; and (d)-(e) apples on
younger trees under overcast, back-lighting, and direct lighting conditions, respectively.

22

(a)(d)(e)(c)(f)(b)Figure 3.2 Structure of the suppression Mask R-CNN. It consists of a feature learning back-
bone and a feature suppression end. The feature learning backbone is a deep network to
learn apple features while the feature suppression end, consisting of a weighting component
and a shallow ConvNet, is used to filter non-apple regions.

3.3 Suppression Mask R-CNN

This section describes the development of a new deep learning-based apple detection

approach that systematically combines a DNN backbone and an RGB feature-based sup-

pression network. As shown in Figure 3.2, the proposed suppression Mask R-CNN consists

of two parts: a feature learning backbone from Mask R-CNN [48] and a feature suppression

end. The former is used to learn apple features and generate region proposals. In the mean-

time, due to the foliage and branch occlusions, it will also learn foliage and branch features

that can cause false detection. As such, I introduce a suppression network to filter non-apple

features to improve detection performance by exploiting a combination of clustered features

and convoluted features. These two networks are trained separately to avoid generating

similar feature maps. I next discuss the two networks in more details.

3.3.1 Feature Learning Backbone

The feature learning network uses the Mask R-CNN backbone [48] and follows Mask

R-CNN’s two-stage learning procedures with two modifications. First, the convolutional

23

ResNet101AnchorsFeature Learning BackboneFeature Suppression EndConvPatchesPatchesFiltered patchesDenseWeightComponentRPNFCConvBBoxclassclassBoxClass branchBBox branchClassClassbackbone in Mask R-CNN is used for feature extraction over an entire image, and is applied

as the network backbone for bounding-box recognition. In this study, I instantiate feature

learning backbone with ResNet-101-FPN [48] as its backbone. ResNet101 outperforms other

single ConvNet mainly because it maintains strong semantic features at various resolution

scales. Even though ResNet101 is a deep network, the residual blocks and dropouts function

help it avoid gradient vanishing and exploding problems. Then similar to [48], I use a Region

Proposal Network (RPN) [94] to generate object regions. RPN is a small convolutional

network which can convert feature maps into scored region proposals around where the

object lies. These proposals with certain height and width are called anchors, which are

a set of predefined bounding boxes. The anchors are designed to capture the scale and

aspect ratio of specific object classes and are typically determined based on object sizes

in the dataset. In the second stage, class and box offset are predicted by virtue of Faster

R-CNN [94] that applies bounding box classification and regression in parallel. As shown in

Figure 3.2, another network is employed to take the proposed regions from the first stage and

assign them to specific areas of a feature map obtained at the second stage. After scanning

these areas, the network generates object classes and bounding boxes simultaneously [48].

Second, for improving the recall or true detection of our algorithm, I introduce a convolu-

tional structure (as shown in Figure 3.2) in the class branch to learn additional feature rep-

resentations. The features condensed from the Mask R-CNN backbone and fully connected

layers may have lost considerable details of apples. Since images have many occlusions in our

dataset, the deep network can treat some partial foliage features as apple features. These

additional feature representations will enable the identification of certain regions in an image

as an occluded apple or foliage. Furthermore, I freeze the layers in the ResNet101 backbone

and train this class branch independently in case there are many overlaps compared to our

main network.

24

3.3.2 Feature Suppression End

After the feature learning step, bounding boxes of apple candidates are obtained. The

image patches inside the bounding boxes are then fed into a feature suppression end to

remove mis-labeled candidates. Since the feature learning backbone may have learned wrong

inference features like leaves with apple-like shapes, the purpose of this suppression network

is to avoid that non-apple regions flow into the last decision layer.

Specifically, the suppression network consists of a weighting component and a shallow

ConvNet. The weighting component is a 2x2 grid clustering layer that aims to determine

apple regions in terms of apple pixel counts. The motivation is that in our annotated dataset,

each apple is annotated in the center of a bounding box and occupies the major area in that

bounding box. Even though the canopies always partially occlude the apple, the pixels

corresponding to the apple are still in the majority. Based on our observation of dataset,

the four regions (a, b, c, d as shown in Figure 3.3-(3)) generally contain most apple pixels.

Therefore, as shown in Figure 3.3, I divide each bounding box in the training dataset into

four regions, a, b, c, d, as a 2x2 grid. The four regions a, b, c, d is, respectively, located near

the left top, right top, left bottom, and right bottom with a margin of 5% pixels to the

box edges. Furthermore, I use K-means clustering [60] to group similar pixels and obtain

several clusters. After clustering, I label each pixel with its class number i, i = 1, 2, 3, ...n,

with n being the pre-specified cluster numbers (In our experiments, I use n = 3). Since the

class associated with the most pixels will correspond to the apple region, I select the “apple"

region from the four grids and define its pixel counts as N a, N b, N c, and N d, respectively. I

will then set the apple region pixels as 1 whereas other pixels are assigned to zero. A sample

output is shown in Figure 3.3. The weighting component keeps the objective information

and generates an output with only apple pixels, which makes it more efficient to train feature

suppression network that I will discuss later. The other merit of weighting component is that

if the previous network recognizes a leaf as an apple, only leaf pixels are treated as objectives

and flow to next ConvNets. That makes suppression network easy to discriminate apple and

25

Figure 3.3 Illustration of the proposed weighting scheme: (1) sliced image inside the bounding
box of a detected apple; (2) pixel clustering using K-means with k = 3 where each cluster
is shown in one of the three colors; (3) image partitioning into 4 regions and counting pixel
numbers of each cluster in the 4 grids; and (4) apple pixel determination by assigning the
pixels corresponding to the cluster with most pixel counts in the 4 grids as apple pixels.

non-apple objectives.

The second component is a shallow convolutional network that is used to learn apple

features based on filtered patches generated by the weighting component. Compared to

the feature learning backbone, the features to learn in this shallow network is less. Only

three convolution layers (3x3x32, 3x3x32, 3x3x64) associated with pooling layers (17x17x32,

7x7x32, 2x2x64) and ReLU as activation are used to fit the discrimination function. Two

additional dense layers are employed to flatten feature maps and produce decision. This

network has a total of 45, 153 trainable parameters. The detailed architecture is described

in Figure 3.2. With the help of feature suppression end, I suppress non-apple class flowing

into the decision layer and it does not significantly increase inference time since the depth of

the feature suppression end is small. The proposed feature suppression end can be viewed

as a filter to efficiently reduce false alarms.

3.3.3 Loss Functions

Since I train the feature learning backbone and the suppression network separately, I

define two loss functions as follows. For the feature learning backbone, I use the same loss

function with Mask R-CNN [48], which defines a multi-task loss on each sampled region of

interest as Lbackbone = Lcls + Lbox, where Lcls and Lbox are, respectively, classification loss

26

(1)(2)(4)(3)acbdand bounding box loss defined as:

Lbackbone =

ΣiLcls(pi, p∗

1
Ncls
i ) = −p∗

i ) +

λ
Nbox
i log pi − (1 − p∗

Lcls(pi, p∗

Σip∗

i · Lbox(ti, t∗

i ),

i ) log(1 − pi),

(3.1)

(3.2)

where pi and p∗

i are, respectively, the predicted probability and ground truth of anchor i; ti

and t∗

i are, respectively, predicted coordinates and ground-truth coordinates; Ncls and Nbox

are normalization terms of batch size and number of anchor locations; the loss function Lbox

is the L1-smooth function [41]; and λ is a parameter that controls the balance between the

classification loss and the bounding box loss [117]. In our network, I use λ = 1 as I assign

equal weights to the two losses.

For feature suppression end, I define Lend as the average binary cross-entropy loss. For a

patch associated with ground-truth class, Lend is defined as:

Lend = −[y log ˆy + (1 − y) log(1 − ˆy)],

(3.3)

where y is the ground truth and ˆy is the prediction.

3.4 Experimental Results

3.4.1

Implementation

In this section, I evaluate the efficacy of the suppression Mask R-CNN with the processed

data as discussed in Section 3.2. The network hyper-parameters, including the momentum,

learning rate, decay factor, training steps, and batch size, are set as 0.9, 0.001, 0.0005, 934,

and 1, respectively, through cross-validation. The input image size is 1, 280x720, which is

aligned with the camera resolution. To better analyze the training process, I set up 100

epochs for training. I exploit a pre-trained model on COCO dataset [67] to warm-start the

training process and it generally only needs 50 epochs to converge. A detection example is

shown in Figure 3.4, where green boxes represent correctly identified apples while red boxes

represent missed detection.

27

Figure 3.4 An example of Gala apple detection using our suppression Mask R-CNN. It shows
that the majority of apples are detected (green bounding boxes) but there are still 3 apples
missed (red bounding boxes) due to heavy occlusion.

To quantitatively evaluate the detection performance, I use performance metrics including

precision, recall and F1-score for algorithm evaluation. All detection outcomes are divided

into four types: true positive (TP), false positive (FP), true negative (TN), and false negative

(FN), based on the relation between the true class and predicted class. Then precision (P)

and recall (R) are defined as follows:

P =

T P
T P + F P

, R =

T P
T P + F N

.

Then F1-score is defined based on precision and recall as follows:

F 1 =

2 · P · R
P + R

.

(3.4)

(3.5)

Note that the suppression network offers a tradeoff between recall and precision, that

is, aggressive suppression will lead to higher precision but lower recall rate. This tradeoff

can be controlled by adjusting two confidence thresholds th1 in the class branch network

28

Figure 3.5 The Pareto plot of recall-precision on different combinations of th1 and th2.The
Pareto front is shown in blue solid lines and the two configurations used to compare with
the state-of-art networks (see Table 3.1) are shown in red stars.

and th2 in the feature suppression end. Then I tune both confidence thresholds during the

inference process to obtain the best recall and precision of our entire model. Figure 3.5

shows the Pareto plot, where each point represents the performance of a combination of

th1 and th2. From the Pareto front (blue solid lines) in Figure 3.5, I choose two “best”

configurations C1 and C2, among which C1 represents a better F1-score 0.905 whereas C2

achieves a better of recall rate of 0.939. The detection performance with C1 has 10% increase

in precision and 0.4% increase in recall whereas 1.6% increase in precision and 1.3% increase

in recall are achieved with configuration C2. These results demonstrate that in both cases,

our integrated class branch and the suppression end approach improve the true detection

and the C2 configuration significantly reduce false fruit detection rates.

29

0.650.700.750.800.850.900.951.00recall0.50.60.70.80.91.0precisionC1C23.4.2 Comparison with the state-of-the-arts

In order to fully evaluate the performance of the proposed approach, I compare our

approach with the state-of-the-art apple detection algorithms based on our comprehensive

image dataset. The algorithms that I compare with include YOLOv3 [109], DaSNet [55],

Faster R-CNN [116], and Mask R-CNN [52]. These approaches are trained and evaluated

on the same training data and test data. For Mask R-CNN, I consider two configurations:

ResNet101 backbone and ResNet152 backbone. The recall-precision curves of these ap-

proaches are shown in Figure 3.6. Furthermore, the precision, recall and F1-score are shown

in Table 3.1. It can be seen in Figure 3.6 and Table 3.1 that the proposed Suppression Mask

R-CNN has superior performance as compared to the existing approaches.

Table 3.1 Performance comparison between the state-of-the-art networks and our proposed
Suppression Mask R-CNN with two parameter configurations (C1 and C2).

YOLOv3
DaSNet
Faster R-CNN
Mask R-CNN(ResNet101)
Mask R-CNN(ResNet152)
Suppression Mask R-CNN(C1)
Suppression Mask R-CNN(C2)

Precision Recall F1-score
0.860
0.821
0.889
0.927
0.928
0.931
0.939

0.773
0.751
0.820
0.852
0.858
0.905
0.864

0.703
0.693
0.761
0.789
0.798
0.880
0.801

3.4.3 Evaluation on different apple varieties and lighting conditions

In addition, I also evaluate my model in different sub-datasets. Specifically, I separate the

whole dataset into several sub-datasets based on apple variety and lighting conditions. The

evaluations are summarized in the Table 3.2 and results are shown in 3.7. The results show

that my model has a better performance for Blondee apples than for Gala. Compared to

back lighting conditions, the detection of my model reaches a higher precision under overcast

or direct lighting conditions, which indicates that artificial lighting may be helpful for further

improving the performance and it will be investigated in our future work.

30

Table 3.2 Performance evaluation on subset of the data with different apple varieties as well
as different lighting conditions.It can be seen that similar performance are obtained in Gala
and Blondee apples while back lighting can slightly decrease the performance.

Category

Dataset
Lighting Condition

Gala Blondee Overcast Direct Lighting Back Lighting

Number

3,357

1,438

3,356

Precision
Recall
F1-score

.87
.93
.90

.89
.93
.91

.89
.93
.91

959

.89
.93
.91

480

.84
.93
.88

Total

4,795

.88
.93
.91

Figure 3.6 The plot of recall-precision curves on different approaches. Our proposed Sup-
pression Mask R-CNN networks (in both configurations) outperform the state-of-the-art
algorithms.

31

0.40.50.60.70.80.91.0Recall0.40.50.60.70.80.91.0PrecisionYOLOv3DaSNetFaster R-CNNMask R-CNN(ResNet101)Mask R-CNN(ResNet152)Suppression Mask R-CNN(C1)Suppression Mask R-CNN(C2)Figure 3.7 Detection results on different apple varieties under various lighting conditions: (a)-
(c) detection on Gala apples under overcast, back lighting, and direct lighting conditions,
respectively; and (d)-(e) detection on Blondee apples under overcast, back lighting, and
direct lighting conditions, respectively.

3.5 Summary of the Chapter

In this chapter, I collected a comprehensive apple dataset for two varieties of apples

with distinct yellow and red colors under different lighting conditions from the real orchard

environment. A novel suppression Mask R-CNN was developed to robustly detect apples

from the dataset. Our developed feature suppression network significantly reduced false

detection by filtering non-apple features learned from the feature learning backbone. Our

suppression Mask R-CNN demonstrated superior performance, compared to state-of-the-art

models in experimental evaluations.

32

(a)(d)(e)(c)(f)(b)CHAPTER 4

O2RNET: OCCLUDER-OCCLUDEE RELATIONAL NETWORK FOR
CLUSTERED APPLE DETECTION

In this chapter, I discuss the challenges associated with the detection of clustered ap-

ples, a task that demands heightened precision and adaptability due to the complex spatial

arrangements of fruit clusters within orchards.

I propose a novel solution, the Occluder-

occludee Relational Network (O2RNet) [24], designed to address the specific challenge posed

by occlusion and cluster proximity. Subsequently, I provide a comprehensive evaluation of

the OR2Net’s performance, assessing its efficacy in real orchard environments and under

varying conditions.

4.1

Introduction

In the previous chapter, I have discussed detection methods for improving apple detec-

tion precision and a number of research works have been developing deep learning-based

approaches [109, 120, 75, 115]. However, the aforementioned deep CNN approaches do not

address the challenge of overlapping/clustered fruits in real-world orchards. Towards that

end, Compositional Convolutional Neural Network (CompNet) [59] was proposed to de-

tect partially occluded objects. The framework exploits a differentiable fully compositional

model that uses occluder kernels to localize occluders (the occluding objects). Bilayer Con-

volutional Network (BCNet) [57], another model to address the occlusion challenge, applies

two Graph Convolutional Network (GCN) layers to separately infer the occluding objects

(occluder) and partially occluded instance (occludee). Superior performance was reported

on occluded scenarios.

In apple detection, various approaches are developed to enhance

the performance of deep learning-based models in complex orchards. [47] introduced CBL

(Convolutional layers, Batch normalization, Leaky-relu activation function [31]) module and

CA (coordinate attention) module into YOLOv5 [54], and finally increased 4.41% in the

precision compared to the base model. [124] utlized a customized YOLOv3 to reach a recall

of 93.4% for overlapped apples. These two approaches are trying to extract the higher-level

33

features by modifying models to improve the performance. Different from above, [36] took

advantage of a depth filter to remove background trees with an RGB-D camera and finally

improved apple detection precision by 2.5% on overlapped apples. This paper will model the

relationships among overlapped apples and enhance the apple edge features to improve the

precision for clustered apples.

In this chapter, I develop a novel Occluder-Occludee Relational Network (O2RNet) to

enhance apple detection in the presence of occlusions in clustered apples that are frequently

present in real-world orchards. Specifically, I employ ResNet [49] and RPN [94] to extract

features of targets and utilize occluder-occludee layers to split candidates into occluder and

occludee. Compared to other occlusion models, I only use bounding boxes as labels instead

of pixel-level masks that contain more texture and shape information. In addition, I present

a new apple dataset1 collected in two Michigan apple orchards in multiple harvesting seasons.

I evaluate the performance against state-of-the-art object detection models and demonstrate

superior performances. The contributions of this work are highlighted as follows:

1. A comprehensive apple dataset of 1246 images for two varieties of apple under different

lighting conditions and occlusion levels were collected from two orchards during two

harvesting seasons.

2. A novel Occluder-Occludee Relational Network (O2RNet) was developed for enhanced

apple detection in the presence of occlusions due to apple clusters.

3. The O2RNet outperformed 12 state-of-the-art deep learning-based models for apple

detection.

4.2 Data Augmentation

After the same method for data collection and preparation, I apply data augmentation on

our dataset. Data augmentation is a method that can be adopted to increase data diversity

for achieving robust training and enhanced performance of computer vision models. For

example, transformations and rotations are frequently employed to increase the number

1The database is open-sourced at https://github.com/pengyuchu/MSUAppleDatasetv2.git.

34

of images from a single source.

It has been shown to be a powerful tool in agriculture

applications [120, 105, 29] as it generates additional data from existing orchard data. This

is especially useful for applications with a limited dataset by detecting anomalies in images

with different transformations and making it possible to generate new training examples

without actually acquiring new data.

Specifically, in the considered application of apple detection in orchards, the collected

dataset can only cover a limited set of scenarios. Therefore, I applied several data augmen-

tation techniques [22] on the collected and processed data to enhance the data diversity for

improving the inference performance of my models. Specifically, besides geometric trans-

formations including scaling, translating, rotating, reflecting, and shearing, I also applied

color space augmentations such as modifying the brightness and contrast to fit different

intensities.

In addition, I injected Gaussian noises on the collected images by randomly

modifying the pixel intensities based on a Gaussian distribution. Furthermore, I applied

Mixup by randomly selecting two images from the dataset and blending the intensities of

the corresponding voxels of the two images [74]. Filtering is another augmentation approach

I applied where I modify the intensities of each pixel using convolution [98]. Specifically, I

exploited sharpening [98] to detect and intensify the edges of objects found within the image.

I applied these additional augmentation techniques on our dataset and the benefits of data

augmentation will be demonstrated in the experiment section.

4.3 O2RNet for Apple Detection

In this section, I first present the key challenges of object detection in cluttered envi-

ronments and an overview of the general object detection framework. Based on those, I de-

scribe the proposed Occluder-Occludee Relational Network (O2RNet) with explicit occluder-

occludee relation modeling. Finally, I specify the objective functions for the entire network

optimization, followed by details on the training and inference processes.

35

4.3.1 Challenge and Main Idea

Figure 4.1 Eight sample images from the collected dataset show cascaded apples in different
occlusion levels: (a)-(d) apples are in the normal occlusion and can be identified in most
models; (e)-(h) apples are highly cascaded and usually detected as one apple.

For images with heavy occlusions, multiple overlapping objects captured in the same

bounding box can result in confusing object outlines from both front objects and occlusion

boundaries.

In apple orchards, the apple clusters are very common (see Figure 4.1 for a

few examples). However, the prediction head design of Faster R-CNN directly regresses the

occludee with a fully convolutional network, which neglects both the occluding instances

and the overlapping relations between objects. With this limitation, Faster R-CNNs will in-

evitably omit some occludes due to Non-maximum Suppression (NMS). On the other hand,

with a properly tuned threshold, the RPN can propose many candidates after feeding the

target features from CNN (see Figure 4.2), but the NMS will suppress the nearby bounding

boxes and neglect occludees. Motivated by this observation, the proposed O2RNet aims at

extending the existing two-stage object detection methods by adding an occlusion perception

branch parallel to the original object prediction pipeline. By explicitly modeling the rela-

36

(a)(b)(c)(d)(e)(f)(g)(h)tionship between occluder and occludee, the interactions between objects within the Region

of Interest (RoI) region can be well incorporated during the bounding box regression stage.

Figure 4.2 Illustration of how RPN works: The RPN selects anchor points on the feature
map and generates anchor boxes for each anchor point. The anchor boxes are generated
based on two parameters — scales and aspect ratios.

4.3.2 O2RNet Workflow

In this subsection, I describe our proposed O2RNet. As illustrated in Figure 4.3, the

O2RNet follows the two-stage architecture used in Faster R-CNN [94] and consists of three

main parts. First, I use a Residual Network (ResNet) [49] as the backbone for feature

learning/extraction over the entire image. Specifically, I instantiate ResNet-101-FPN [48] as

its backbone for feature extraction, as it outperforms other single ConvNets mainly due to its

capability of maintaining strong semantic features at various resolution scales. Even though

ResNet101 is a deep network, the residual blocks and dropouts function help it avoid gradient

vanishing and exploding problems. Second, I employ an RPN [94] to generate object regions,

which is a small convolutional network to convert feature maps into scored region proposals

around where the object lies. The generated proposals with a certain height and width are

37

anchor pointfeature mapanchor boxesFigure 4.3 Network structure of the proposed Occluder-Occludee Relational Network
(O2RNet). It consists of a feature learning backbone, RoI feature extraction, and object
detection heads with occluder and occludee branches. The Feature Expansion Structure
(FES) provides expanded RoI features along with features from the occluder branch to fa-
cilitate the detection of occludee.

called anchors, which are a set of predefined bounding boxes. The anchors are designed to

capture the scale and aspect ratio of specific object classes and are typically chosen to be

consistent with object sizes in the dataset. RPN is mainly used for predicting bounding

boxes in Faster R-CNN but it can also provide enough anchors with different scales that will

be exploited in our network as explained in the sequel. Third, I build an occlusion-aware

modeling head with a structure of two classification and regression branches for occluder and

occludee for decoupling overlapping relations and segments the instance proposals obtained

from the RPN. Compared to the traditional class-agnostic classification, I divide this task

into two complementary tasks: occluder prediction using the original classification head

and occludee modeling with an additional Feature Expansion Structure (FES), where the

occluder predictions provide rich foreground cues like textures and the FES predicts the

positions of occluding regions to guide occludee object regression.

More specifically, an input image is first processed by the ResNet backbone to extract

intermediate convolutional features for downstream processing. The object detection head

(i.e., RPN) then predicts bounding box proposals, which are then consumed by the occlusion

perception branches into the occluder branch and the occluee branch. For the occluder

38

kOccluder BranchOccludee Branch...+bboxclass+bboxclassBackboneRegion Proposal NetworkFeature ExpansionROI FeatureFeature MapConvFCNConvFCNbboxclassbranch, I adopt the object detection head in Faster R-CNN [94] to output positions as well

as categories for instance candidates and prepare the cropped RoI features for the occludee

branch. In the occludee branch, the input consists of both cropped RoI features from the

occluder branch and expanded features from FES, which is targeted for modeling occluded

regions by jointly detecting boundaries. Essentially, the distilled occlusion features are added

to the original input RoI features and passed to the next module. Finally, the occludee

branch, which has a similar structure to the occluder branch, predicts the occludee guided

by these expanded features and outputs classes and bounding boxes for the partially occluded

instances. I next describe the occluder-occludee relational modeling in more details.

4.3.3 Occluder-Occludee Relationship Modeling

For highly-overlapped apples, in typical Faster-RCNN-based models, the generated re-

gion proposals corresponding to the partially occluded ones may be separated into disjoint

subregions by the occluder. As such, I employ the FES to obtain boundary features from the

occludee, where expansion in each direction extends the potential proposals for the occludee.

In our implementation, I expand t steps in k (k = 8 in this study) directions from the original

RoI proposals, and the expanded RoI proposals will contain additional boundary features.

The rationale is that irregular occlusion boundaries unrelated to the occludee can cause con-

fusion to the network, which in turn provides essential cues for decoupling occludees from

occluders. Therefore, I explicitly model occlusion patterns by detecting bounding boxes of

the occluders using the occluder detection branch, and since the occludee detection branch

jointly predicts bounding boxes for the occludee, the overlap between the two layers can be

directly identified as occlusion boundary that can thus be distinguished from the real object

bounding boxes. In order to reach this goal, the occluder modeling module is designed as a

simple 3 × 3 convolutional layer followed by one FCN layer, the output of which is fed to the

up-sampling layer and one 1 × 1 convolutional layer to obtain one channel feature map for

occludee branch.

This O2RNet is particularly adept at discerning edge features on apples, a crucial step

39

in differentiating between occluders and occludees. As illustrated in Figure 4.4, O2RNet’s

capability becomes evident when dealing with a cluster of apples. Here, the algorithm

identifies an occludee, and subsequently refines the bounding boxes for both the occluder

and the occludee in a clustered apple. This refinement is pivotal in accurately representing

the spatial relationships and physical boundaries of each apple in the cluster. In contrast,

when the algorithm encounters an individual apple, devoid of any overlapping or obscuring

elements, it classifies the apple solely as an occluder. In this scenario, the bounding box

remains unaltered, as there is no need for refinement in the absence of an occludee. This

dual functionality of O2RNet underscores its versatility and precision in handling varied

scenarios within object detection tasks.

Figure 4.4 How O2RNet works on clustered apples and an individual apple. The first step
in the occluder branch are the same with the baseline model(Faster R-CNN), which have
similar feature maps and do not isolate cluster apple. In the occludee branch, O2RNet learns
occludees feature and successfully splits the input into an occluder and an occludee.

4.3.4 End-to-end Learning

As I have two separate detection heads in the occluder and the occludee branches, I define

two loss functions in the following way. For the occluder branch, I adopt the loss function

used in Faster R-CNN [94], which defines a multi-task loss on each sampled region of interest

40

inputClusteredappleIndividualapple The process in the baseline. Featurein OccluderFeaturein Occludeebboxin Occludeebboxrefinementas

LOccluder = Lcls + Lbbox,

(4.1)

where Lcls and Lbbox are, respectively, classification loss and bounding box loss defined in

Faster R-CNN [94].

The final loss L is a weighted sum of the loss from occluder branch and the loss from

occludee branch defined as:

L = λ1LOccluder + λ2LOccludee.

(4.2)

Here LOccludee is the occludee branch loss that is the sum of the k expanded proposal losses,

i.e.,

LOccludee =

k
(cid:88)

(Li

cls + Li

bbox).

(4.3)

i=0
Here λ1 and λ2 are two positive linear weights and λ1 + λ2 = 1, which are tuned to balance

the two loss functions. In my study, λ1 was tuned to be {1.0, 0.75, 0.5, 0.25, 0} on various

trials for cross-validation.

4.3.5 Training and Inference

During the training process, I filter out parts of the non-occluded RoI proposals to keep

occlusion cases taking up 50% for balanced sampling. SGD with momentum is employed to

train the model with 60K iterations where it starts with 1K constant warm-up iterations.

The batch size is set to 2 and the initial learning rate is 0.01 with a weights decay of 0.95.

In my study, ResNet-101-FPN is used as the backbone and the input images are resized

without changing the aspect ratio, i.e., by keeping the shorter side and longer side of no

more than 1200 pixels. For inference, the occludee branch predicts bounding boxes for the

occluded target object in the high-score box proposals generated by the RPN, while the

occluder branch produces occlusion-aware features as input for the occludee branch. The

one with the highest score is then chosen as the output.

41

Table 4.1 Performance of O2RNet on the customized apple dataset. The step is from FES,
which represents how much features expanded. The evaluation uses AP, AR, and F1-score
at the different IoUs.

Model

O2RNet

Step

t=1
t=2
t=3

AP

AP50

AP75

AR

AR50

AR75

F1-Score

0.511
0.490
0.490

0.945
0.920
0.920

0.935
0.900
0.904

0.351
0.330
0.328

0.938
0.900
0.900

0.803
0.770
0.770

0.864
0.820
0.820

Table 4.2 Model parameters numbers between the state-of-the-art networks and our proposed
Occluder-occludee Relational Network (O2RNet). “M” stands for a million.

Models

Parameters

FPS

FCOS
YOLOv4
Faster R-CNN (ResNet50)
Faster R-CNN (ResNet101)
EfficientDet-b0
EfficientDet-b1
EfficientDet-b2
EfficientDet-b3
EfficientDet-b4
EfficientDet-b5
CompNet via BBV
CompNet via RPN
O2RNet (ResNet50)
O2RNet (ResNet101)

2.0M
0.6M
2.0M
3.6M
0.1M
0.3M
1.2M
1.6M
2.4M
3.6M
0.8M
1.4M
2.0M
3.6M

14
63
19
10
48
45
25
24
16
8
18
15
18
10

4.4 Experimental Results

4.4.1 Performance Metrics

For model development and evaluation, conventionally the apple dataset is randomly

partitioned into training, validation, and test sets for model training and evaluation, re-

spectively. To quantitatively evaluate the detection performance, I use performance metrics

including precision, recall, and F1-score for algorithm evaluation. All detection outcomes

are divided into four types: true positive (T P ), false positive (F P ), true negative (T N ), and

false negative (F N ), based on the relation between the true class and predicted class. The

precision (P ) and recall (R) are defined as follows:

P =

T P
T P + F P

, R =

T P
T P + F N

.

The F1-score is then subsequently defined as:

F 1 =

2 · P · R
P + R

.

42

(4.4)

(4.5)

To better evaluate the precision between the prediction and the ground truth, I also

employ Microsoft Common Objects in Context (COCO) dataset [67] evaluation metrics.

Specifically, after the calculation of precision and recall, I calculate the average precision

(AP ) and average recall (AR) based on different Intersection over Union (IoU) between

the prediction and the ground truth. For example, APIoU =.50 or AP50 denotes that AP is

averaged over IoU = 0.50 values, which belongs to PASCAL VOC metric [95]. I also use

APIoU =.75 or AP75, which is a stricter metric for model evaluations. In my study, I use a

spectrum of 10 IoU thresholds ranging 0.50 : 0.05 : 0.95 to average over multiple IoUs to

obtain a comprehensive set of results.

4.4.2 Experimental Setup

In this section, I evaluate the efficacy of the proposed O2RNet on the processed data as

discussed in Section 3.2. The network hyper-parameters, including the momentum, learning

rate, decay factor, training steps, and batch size, are set as 0.9, 0.001, 0.0005, 934, and 1,

respectively, through cross-validation. The input image size is 1280 × 720, which is aligned

with the resolution of the camera used in our data collection. To better analyze the training

process, I set up 80 epochs for training. I exploit a pre-trained model on the COCO dataset

[67], where I train on 2017train (115k images) and evaluate results on both 2017val and

2017test-dev to pre-train model parameters. This pre-trained model generally only takes 50

epochs to converge. By tuning the steps t in FES, different results are obtained and listed

in Table 4.1, which shows that O2RNet with t = 1 leads to the best performance.

4.4.3 Performance Comparison and Analysis

To accelerate the model training on our customized dataset, I initialize parameters by

transfer learning from ImageNet [26]. ImageNet provides large-scale images in different fields

(including apples) and large-scale ground truth annotation. During the transfer learning

process, my model learns specific characteristics with an effective transfer of features from

ImageNet. Compared to randomized parameters, the results (see Figure 4.5) shows that my

model converges faster as benefited from the pretraining on a large-scale database.

43

Table 4.3 Performance of O2RNet on the augmented dataset. The geometric transformations
consist of rotation, flipping and scaling. The color space transformations consist of brightness
and contrast shifting. Finally, all of the augmentation methods are integrated to evaluate
the O2RNet.

Augmentation

AP AP50 AP75 AR AR50 AR75 F1-Score

Base
Geometric transformations (GTs)
Color space transformations (CSTs)
Gausian noise
Mixup
Sharpening
GTs+CSTs+Mixup
All

0.92
0.93
0.93
0.91
0.93
0.92

0.91
0.90
0.51
0.91
0.91
0.52
0.91
0.91
0.52
0.91
0.90
0.48
0.92
0.92
0.52
0.91
0.90
0.52
0.52 0.96 0.94 0.36 0.94
0.92
0.92
0.52

0.35
0.35
0.35
0.34
0.35
0.35

0.94

0.36

0.80
0.80
0.81
0.80
0.81
0.80
0.83
0.83

0.84
0.85
0.85
0.83
0.85
0.84
0.88
0.86

Table 4.4 Performance comparison of our own models and other 12 state-of-the-art deep
learning models on the customized apple dataset.

Models

FCOS [1]

YOLOv4 [84]

Faster R-CNN

EfficientDet

ResNet50 [82]
ResNet101 [82]

EfficientDet-b0 [83]
EfficientDet-b1 [83]
EfficientDet-b2 [83]
EfficientDet-b3 [83]
EfficientDet-b4 [83]
EfficientDet-b5 [83]

CompNet

O2RNet

CompNet via BBV [127]
CompNet via RPN [34]

O2RNet-ResNet50
O2RNet-ResNet101

AP AP50 AP75 AR AR50 AR75 F1-score

0.48

0.45

0.48
0.49

0.45
0.45
0.46
0.49
0.50
0.50

0.50
0.51

0.89

0.87

0.89
0.94

0.89
0.89
0.89
0.93
0.94
0.95

0.94
0.95

0.87

0.84

0.87
0.93

0.85
0.86
0.87
0.91
0.92
0.93

0.92
0.94

0.34

0.29

0.32
0.31

0.30
0.30
0.30
0.32
0.34
0.34

0.36
0.35

0.87

0.84

0.87
0.84

0.82
0.82
0.82
0.84
0.88
0.88

0.94
0.94

0.78

0.73

0.78
0.75

0.71
0.72
0.73
0.75
0.78
0.78

0.80
0.80

0.80

0.76

0.81
0.82

0.77
0.77
0.78
0.81
0.82
0.83

0.85
0.86

0.93

0.50
0.91
0.91
0.52 0.96 0.94 0.36 0.94

0.35

0.80
0.83

0.84
0.88

Furthermore, data augmentation is another useful technique to optimize detection per-

formance without increasing inference complexity. I applied five augmentation strategies,

including geometric transformations (GTs), color space transformations (CSTs), Gaussian

noise injection, mixup and sharpening data augmentation, to extend our dataset. The re-

sults are summarized in Table 4.3. It shows that GTs such as rotation, flipping and scaling

– by changing the pixel position of the image and reordering apples in the image – improve

the accuracy performance by around 1%. Through changing color illumination and intensity

of an image, CSTs also roughly increases the performance by 1%. Due to the sparsity of

44

Figure 4.5 Training loss comparison between transfer learning and training from scratch on
my model (O2RNet). The training loss with transfer learning from ImageNet apparently
decreases and converges faster as compared with training from scratch.

apples on some images, mixup helps enlarge apple density on the image and enhances the

accuracy by 2%. It turns out that Gausian noise and sharpening do not help much, as they

try to change textures and increase complexities on the dataset, which generate confusing

data and is not suitable for my model. Finally, the augmentation combination of GTs, CSTs

and Mixup offers the best enhancement by increasing the accuracy of 4% on our dataset.

To better evaluate the performance of my model, I compare our O2RNet with the-state-

of-art object detection methods on our customized apple dataset (see Table 4.2 for a list of

benchmark models and their number of parameters). In particular, FCOS and YOLOv4 are

representatives of one-stage detectors, achieving consistent improvement and demonstrating

their effectiveness by outperforming the SSD method [69] on several public datasets [110,

11]. I also evaluate Faster R-CNN and EfficientDet since they are state-of-the-art models

with promising performance demonstrated in fruit harvesting-related works [78, 125]. I also

45

020406080Epoch0.250.500.751.001.251.501.752.00LossTransfer learningFrom scratchcompare O2RNet with the state-of-the-art occlusion-aware network CompNet [34].

I then use the same experimental setup to train each model and evaluate them on the

same apple test dataset. The results are shown in Table 4.4, which compares the detection

precision and recall over different IoUs among the 14 selected models (including our O2RNet).

Notably, in addition to FCOS, EfficientDet-b5 and Faster R-CNN achieved decent F1-scores

of 0.83 and 0.82, respectively. Two occlusion-aware networks, CompNet and our O2RNet

clearly outperform all traditional models with F1-scores of 0.86 and 0.88, respectively, and

O2RNet clearly shows superior performance over CompNet. Some representative inference

results are shown in Figure 4.6. It can be seen that our O2RNet can effectively separate

clustered apples and thereby improves the precision and recall and subsequently the F1-score.

Figure 4.6 Results from six models on the various lighting conditions and occlusions.

4.5 Summary of the Chapter

I collected a comprehensive apple dataset under different lighting conditions and at vari-

ous occlusion levels from two real orchards. A novel Occluder-Occludee Relational Network

(O2RNet) was developed to robustly detect clustered apples from the dataset. Our de-

46

FCOSYOLOv4Faster R-CNNEfﬁcientDet-b5CompNetO2RNetGround TruthA.1A.2A.3A.4A.5A.6AB.1B.2B.3B.4B.5B.6BC.1C.2C.3C.4C.5C.6Cveloped O2RNet significantly reduced false detection and improved the detection rate by

embedding relationships between the occluder and the occludee. State-of-art performance

was demonstrated in comprehensive experiments. I also found that transfer learning and

data augmentation techniques were useful tools to enhance learning efficiency and model

performance.

47

CHAPTER 5

ALACS: ACTIVE LASER-CAMERA SCANNING FOR 3D APPLE
LOCALIZATION

In this chapter, I review several consumer RGB-D cameras and depict our proposed

localization technique, called Active Laser-Camera Scanner (ALACS). I propose a feature-

matching algorithm, called Laser Line Extraction (LLE), to help ALACS transform the 2D

fruit positions to 3D fruit positions, thus achieving accurate mapping even under complex

fruit morphology, variable lighting conditions and occlusions.

5.1

Introduction

Three-dimensional localization is the other crucial aspect of fruit perception. Accurate

and reliable localization of objects such as fruits in the 3D space is essential for automated

agricultural systems to optimize crop management and harvesting strategies, ultimately

improving productivity, efficiency, and sustainability [40]. Several types of 3D sensors are

currently available, including Time-of-Flight (ToF) cameras, LiDAR (Light Detection and

Ranging), stereo-vision cameras, and structure light systems. Specifically, ToF cameras [61]

measure depth by emitting a light signal (usually IR) and measuring the time it takes for

the signal to bounce back to the sensor. This time difference allows the camera to calculate

the distance to objects in the scene. ToF cameras, such as the PMD CamBoard pico [5]

and Microsoft Kinectv2 [80], are known for their accuracy and high-resolution depth data.

LiDAR systems [92] use lasers to send out light pulses and measure the time it takes for the

pulses to return after reflecting off objects. This time difference is used to calculate distances

and generate a 3D point cloud of the scene. LiDAR systems can provide accurate depth data

but are often more expensive than other methods and often have limited spatial resolution.

Stereo vision systems, on the other hand, use two cameras, typically placed side-by-side at

a known distance apart, to capture images of the same scene. By comparing the images

and identifying corresponding points in each image, depth information can be estimated

using triangulation. Stereo vision systems [62] often require more computational power and

48

can be sensitive to lighting conditions, but they do not rely on active IR illumination. In

contrast, structured light systems [112] project a known pattern of light (usually infrared)

onto the scene and then capture the deformed pattern with a camera. The deformation of

the pattern allows the system to reconstruct the 3D geometry of the scene. It usually works

well in low-light conditions (since it uses active illumination) but it is sensitive to ambient

light and surface properties (e.g., reflectivity, transparency).

Over the years, numerous techniques have been attempted for fruit 3D localization based

on the aforementioned sensors and advanced computer vision methods [77, 45, 128, 68].

Specifically, [76] employed a stereo camera system in conjunction with a tailored fruit-

matching algorithm, reporting a localization error of approximately 11 mm in the simplified

indoor environment. However, outdoor agricultural environments are significantly more chal-

lenging, with exposure to a wide spectrum of lighting conditions and complex tree and fruit

structures, which can cause major issues to its fruit matching techniques as they can sig-

nificantly influence the visual characteristics and discernibility of fruits captured within the

images or point clouds. [2] utilized a commercial device, RealSense RGB-D camera, to ob-

tain the positions of apples and reported a localization error of around 9.5 mm in an ideal

indoor environment. However, due to the low resolution of the projector in the depth cam-

era, RGB-D cameras have to interpolate based on partial depth measurements. Unlike plane

surfaces, fruits can exhibit a wide range of shapes, sizes, colors, and textures, making it

difficult to develop a one-size-fits-all approach to 3D fruit localization. To overcome complex

fruit morphology, [101] utilized a global time-of-flight camera to obtain fruit point cloud by

removing the background, in order to accurately localize each point on the fruit. However,

in real-world orchard scenarios, fruits are often surrounded by leaves, branches, and other

fruits, which can create occlusions and clusters in the images or point clouds. These issues

make it difficult to accurately identify and localize the fruits, particularly when they are

partially or fully occluded.

In this chapter, I address the above challenges by designing and developing a novel Active

49

Laser-Camera Scanner (ALACS) system. Specifically, an RGB camera is integrated with a

line laser to achieve robust and accurate localization using the triangulation principle.

I

propose a feature-matching algorithm, called Laser Line Extraction (LLE), to help ALACS

transform the 2D fruit positions to 3D fruit positions, thus achieving accurate mapping even

under complex fruit morphology, variable lighting conditions and occlusions. This research is

expected to provide a valuable reference for future research on developing fruit 3D localization

systems in fruit harvesting. The main contributions of this chapter are highlighted as follows:

1. A novel Active Laser-Camera Scanning system, consisting of a red line laser, an RGB-

D camera, and a linear motion slide, is designed and developed for accurate fruit 3D

localization.

2. A Laser Line Extraction (LLE) algorithm is proposed and implemented for robust

feature matching to enable stable 2D-3D transformation for ALACS.

3. System evaluation and validation of the ALACS are performed indoors and outdoors

in comparison with a conventional 3D sensing technique, i.e., Intel RealSense D435i

RGB-Depth camera.

5.2 Active Laser-Camera Scanning (ALACS)

The ALACS is designed to provide accurate 3D localization of apples in orchards by

combining the advantages of a depth camera and laser scanning. The major hardware

components include a red line laser (Laserglow Technologies, North York, ON, Canada),

a FLIR RGB camera (Teledyne FLIR, Wilsonville, OR, USA), a linear motion slide, and

an Intel RealSense D435i RGB-D camera, which is used for providing rough initial global

estimates of fruits. As shown in Figure 5.1, the RGB-D camera is mounted on a horizontal

frame that is above the manipulator to provide a global view of the scene and initialize

the laser position. The line laser is mounted on the linear motion slide that enables the

laser to move horizontally with a full stroke of 20 cm. Meanwhile, the FLIR RGB camera

is installed at the left end of the linear motion slide with a relative angle to the laser to

50

capture laser patterns on apples. The hardware configuration of the ALACS is designed to

facilitate depth measurements based on the principle of laser triangulation [30]. Specifically,

the laser triangulation-based technique is a classical high-precision localization scheme that

captures depth measurements by pairing a laser illumination source with a camera. It is

worth noting that with the conventional laser triangulation sensors, the relative position

and pose between the laser and the camera is fixed (i.e., both of them are static or moving

simultaneously), whereas in ALACS the camera is fixed while the laser position is actively

adjusted with the linear motion slide to seek the target fruit (see subsequent discussions

for more details). Specifically, ALACS performs fruit 3D localization in three steps:

laser

scanning, target position determination, and 2D-3D position transformation.

Figure 5.1 CAD model of ALACS.

As shown in Figure 5.2, the first step in the ALACS workflow is to capture a global

view of the scene using the RGB-D camera. This camera provides both color and depth

information, which can be used to segment and identify bounding boxes containing apples

based on an apple detection approach (see [24]). Then, a rough 3D position of each detected

apple is estimated. Based on a planning strategy [134], ALACS selects a target apple and

uses the rough 3D position provided by RealSense D435i to guide the laser to the initialized

position related to the target apple. The laser is then moved horizontally from the left to

the right side of the apple in five 2-cm increments; it illuminates the target apple and creates

51

Red Line LaserLinear Motion SlideRGB CameraRGB-D CameraFigure 5.2 Schematics of the ALACS workflow: In step A, the RGB-D camera provides an
initial global view of detected apples.
In step B, a target apple is determined based on
a planning strategy and its rough 3D location is sent to ALACS. In step C, the fruit is
scanned and the high-precision position is obtained. Finally in step D, the 3D localization
information is sent to the manipulator for fruit picking.

52

Obtain its rough global 3D position.B. Select an apple.(x, y, z)1. Initialize laser position.2. Move laser over the apple in 5 increments.3. Determine the best laser line.4. Transform the 2D to 3D position.A. 2D apple detection.color imagedepth imageD. Send 3D localization information to manipulator.C. ALACS localization.visible laser lines on the surface of the fruit (see Figure 5.3).

Figure 5.3 Schematic of laser scanning on a target fruit in ALACS to obtain one laser line
from the fruit.

During the laser scanning process, at each stop an image is obtained at each stop with

the illuminated apple being captured by a high-resolution FLIR camera. ALACS uses a laser

line extraction algorithm (see Section 5.3) to extract laser patterns on the target apple for

each stop. These candidates resulting from the five different laser projections would cover the

target apple. The most reliable candidate is then selected to determine the apple’s centroid

position based on a confidence evaluation for each candidate, which is calculated using two

key factors: the distance to the estimated center and the number of extracted laser line

pixels.

Distance to the Estimated Center : The distance factor in the confidence calculation quantifies

how close the candidate laser line is to the apple’s estimated center (obtained from our apple

detection algorithm [24]). Candidates that are closer to the center are considered more

53

reliable and are assigned higher confidence scores. This distance factor ∆d is calculated

using the Euclidean distance between the candidate’s position and the estimated center.

Number of Extracted Laser Line Pixels: The second factor contributing to the confidence

calculation is the number of extracted laser line pixels, N , for each candidate. Candidates

with more extracted laser line pixels are considered to provide a more complete representation

of the apple’s surface geometry and are thus deemed more reliable. Consequently, these

candidates are assigned higher confidence scores. Then the confidence P is calculated by

P = ω1 · N − ω2 · ∆d, where ω1 and ω2 are weights for these two factors and are obtained

through cross-validation.

After selecting the most reliable candidate based on the calculated confidence scores,

the 2D position of the center of this laser line is obtained and the apple’s center position

is determined using the laser triangulation scheme [30]. The basic idea of this technique is

to capture depth measurements by pairing a laser illumination source with a camera. Both

the laser beam and the camera are aimed at the target object, and based on the extrinsic

parameters between the laser source and the camera sensor, the depth information can be

computed using trigonometry. The transformation rule for 2D position (ui, vi) to 3D position

(xi, yi, zi) follows

zi =

xi =

yi =

L
sin(α) − ui cos(α) − vi tan(β)
Lui
sin(α) − ui cos(α) − vi tan(β)
Lvi
sin(α) − ui cos(α) − vi tan(β)

,

,

,

(5.1)

where extrinsic parameters L, α, and β are, respectively, the baseline (i.e., the distance

between the camera and the line laser), horizontal angle, and vertical angle between the

laser illumination source and the camera. The details of parameter estimation are discussed

in [130]. Therefore, ALACS finally localizes the target apple. After the apple has been

picked, the process repeats for the next target fruit.

54

5.3 Laser Line Extraction (LLE)

In this section, I present more details on the laser line extraction steps that are of

paramount importance in the ALACS system since it serves as the crucial link between

the laser scanning process and the final triangulation-based localization. In ALACS, images

of the illuminated apple are captured using a high-resolution camera. These images are then

processed to extract the laser lines on the apple’s surface. Extracting accurate and well-

defined laser lines from the captured images provides essential geometric information about

the apple’s surface. Furthermore, effective laser line extraction techniques can help mitigate

the impact of noise, occlusions, and illumination variations, which can significantly improve

the overall performance and reliability of the ALACS method. I next discuss the specific

algorithms and techniques employed for laser line extraction and their role in enhancing the

accuracy of the ALACS-based apple localization process.

I have chosen a red laser as the domain laser color, as opposed to blue or green lasers.

This choice was made based on our preliminary evaluations, which found that red laser lines

appear more intense and distinct in the captured images (see Figure 5.4 for comparisons

among three lasers of different colors), thus facilitating more effective extraction of laser

lines. More specifically, LLE is designed with 4 steps: laser pattern detection, noise removal,

line focus, and curve fitting, as illustrated in Figure 5.5.

Laser Pattern Detection: Based on the utilized laser line pattern, various image processing

techniques, such as edge detection, filtering, and thresholding, can be employed to identify

and isolate the laser lines in the captured images. Thresholding is the simplest way to extract

highlighted patterns but it is usually affected by strong external lighting like sunlight. Edge

detection using Sobel or Robert kernel [27] can find line patterns accurately but they need

more computation time compared to thresholding. Hence, I designed a novel algorithm,

called bidirectional Relative Color Enhancement (bRCE), to detect the laser line pattern

from the selected apple image I efficiently. The bRCE (see Algorithm 5.1) gets bi-directional

horizontal gradient matrixG, which is calculated through shifted differences of the image

55

Figure 5.4 Comparison of laser projections on a red apple by a red laser (635 nm), green
laser (515 nm) and blue laser (447 nm). Laser intensity is obtained using a thresholding of
140.

Figure 5.5 Schematic of the laser line extraction (LLE) workflow.

in both left-to-right and right-to-left directions. The bRCE can effectively highlight the

boundaries of the laser lines, help enhance the contrast of the laser lines and make them

more distinguishable from the background. The bRCE outperforms the thresholding-based

method especially in over-exposure situations, as shown by an example in Figure 5.6.

Noise removing: The laser line patterns are usually highlighted with small outliers, as shown

by examples in Figure 5.7, since the over-exposure part caused by strong sunlight can interfere

with the laser line. To address the discontinuity of the outliers, I employ a sliding window

counting method to remove these outliers. As shown in Algorithm 5.2, a predefined window

with size W is slid horizontally across the image row by row based on the partition size γ,

56

RGBRed laserGreen laserBlue laserIntensityLaserintensityAlgorithm 5.1 bRCE laser pattern detection

Input: I: n × m × 3 RGB image matrix of the selected apple.
Output: G: n × m laser prediction matrix.
Params: step: gradient step, th: gradient threshold.
R = red channel of I
G1 = R[:, : −2 · step] − R[:, step : −step]
G2 = R[:, 2 · step :] − R[:, step : −step]
/*G1, G2 are two horizontal gradients of I.*/

for G1, G2 do

Gi[Gi ≤ th] = 0

end
G = G1 ◦ G2,
G[G > 0] = 1

▷ ◦ is element-wise product

Figure 5.6 Laser pattern detection comparison between bidirectional Relative Color Enhance-
ment (bRCE) with step= 8, th= 40 and thresholding with th= 220 under over-exposure
situations.

and the number of strong curve pixels within the window is counted. If the count exceeds a

pre-specified threshold θ, the pixels within the window are considered part of the laser line;

otherwise, they are categorized as noises and discarded. This filtering process helps retain

only the most prominent laser lines C in the image while eliminating undesired artifacts and

noises.

57

InputLaser patternbRCEThresholdFigure 5.7 Extracted laser patterns with outliers (noises) (first row) and resultant laser
patterns after noise removal with the use of a threshold of θ = 8 and the partition size of
γ = 3 (second row).

Algorithm 5.2 Noise removing

Input: G: n × m laser prediction matrix (from Algo. 1).
Output: C: n × m noise-free laser matrix.
Params: wG, hG: width and height of G,

θ: threshold of noise, γ: partition size of G.

2hG

/* Wx,y,w,h is the sliding window with the left-top position (x, y), the width w and the height
h. */
4wG, 1
w, h = 1
s.t. 0 < θ ≤ w × h, h
C = G
u = wG−w
v = hG
h
for i = 1, 2, 3 . . . u do

w > 1

γ

for j = 1, 2, 3 . . . v do

x = (i − 1) · γ, y = (j − 1) · h,
if sum of Wx,y,w,h ≤ θ then

C[x : x + w, y : y + h] = 0,

end

end

end

Line focusing: To enable a point-to-point feature matching for laser triangulation, LLE

proceeds to extract a centerline for each laser pattern (generally with a width of greater

than 2 pixels) by computing the centroids of the remaining strong laser pattern pixels on a

row-by-row basis. By averaging the column indices of the strong curve pixels, the algorithm

determines the centroid of the laser line in each row, which are then connected to form a

58

Laser patternsNoiseremovingcontinuous centerline. This focused centerline effectively represents the central path of the

laser line in the image, which will be used as the basis for the final continuous curve fitting

step discussed next.

Curve fitting: The focused laser line obtained from the last step is not always smooth and

continuous. As such, I use a polynomial to fit the rough focused line and thus generate a

smooth and continuous curve that accurately represents the laser line in the image. The

choice of polynomial order depends on the expected curvature of the laser line on the apple’s

surface, with higher-order polynomials offering greater flexibility to fit complex shapes. To

avoid overfitting problem and through cross-validation, the 4th-order polynomial is used in

this study to fit curves, which strikes a good tradeoff between accuracy and simplicity based

on our preliminary evaluations.

5.4 Experimental Results

In ALACS, the 3D localization performance is affected by both feature matching accuracy

and 3D positions estimation. I thus separately evaluate the performance of LLE and the

final apple 3D localization performance.

5.4.1 LLE Evaluation

To evaluate the performance of LLE, I conducted a series of experiments under varying

lighting conditions (from 1000 to 6500 lux) in the outdoors environment, specifically overcast

and direct lighting scenarios.These tests were intended to assess the robustness and effective-

ness of the LLE algorithm in extracting laser lines in challenging environments. To evaluate

the laser line extraction performance using bRCE, I first varied the bRCE parameters step

and th to observe how they affect laser extraction performance (see Figure 5.8). Based on

tests at different distances of 0.8 m, 1.0 m, 1.2 m, 1.4 m, and 1.6 m, it was found that the

laser lines on the surface of apple are around 4-pixel wide. When I tuned step from 2 to 14

with a fixed th, the best results are generated with around step= 4. With step increases,

the shape of laser pattern becomes slacking. A similar approach was used to tune th from

20 to 70, and the number of laser pixels tends to decline as th increases. After calculating

59

the extracted line’s mean and standard deviation from different combination of parameters

on 400 images, the best parameters are chosen to be step= 4 and th= 40.

Figure 5.8 An example of the laser pattern detection performance with different bRCE
parameters: (a) with a threshlod value th= 60 and the step varying from 2 to 14; (b) with
th= 30 and the step varying from 2 to 14; and (c) with step= 4 and th varying from 20 to
70. The best result is obtained using step= 4 and th= 40 for this input.

Furthermore, to perform a quantitative evaluation of the LLE algorithm, I calculate the

laser line displacements between the LLE predictions and the ground truth, which are ob-

tained through manual labeling. This evaluation provides a quantitative measure of the

accuracy and reliability of the LLE algorithm in extracting laser lines under various condi-

tions. For this evaluation, a set of images with visible laser lines were manually annotated by

trained persons, who carefully trace the laser lines and mark their positions as ground truth.

These ground-truth annotations serve as a reference for comparing the performance of the

LLE algorithm against an ideal extraction. LLE was then applied to the same set of images,

and the resulting laser line predictions were compared with the ground truth annotations.

The laser line displacements were calculated as the average pixel-wise Euclidean distances

between the points on each row between the LLE-predicted laser lines and the ground-truth

laser lines. Smaller displacements indicate a higher degree of agreement between the LLE

60

Input(a)(b)(c)step=2step=4step=8step=10step=12step=14Inputth=20th=30th=40th=50th=60th=70predictions and the ground truth, reflecting a more accurate and reliable extraction perfor-

mance. Since I only used central results to localize, I calculated the displacements based on

different central segments of each laser line. The results are summarized in Table 5.1. With

calculating displacements in 10% central segment of the laser line, the LLE generated 1 pixel

displacement in average.

Table 5.1 Performance of the laser line extraction (LLE) algorithm, as measured by average
(Avg), minimum (Min) and maximum (Max) displacements (Disp.)
in pixels for various
central segment ratios (from 10% to 80%) on 300 cases.

10%

20%

30%

40% 50%

60%

70%

80%

Avg Disp.
Min Disp.
Max Disp.

1.0
0
2.9

1.2
0.2
3.2

1.3
0.3
3.3

1.4
0.4
3.5

1.5
0.4
3.8

1.7
0.5
3.9

1.7
0.5
4.0

1.9
0.6
4.2

By analyzing the laser line displacements across various images and conditions (see Fig-

ure 5.9), I can gain insights into the performance and robustness of the LLE algorithm. To

avoid image saturation caused by direct lighting, the extracted line by LLE is not always

entire and continuous. In the meantime, occlusions also divide laser lines into different parts.

These discontinuous challenges are fixed by the polynomial curve fitting (see Figure 5.9) and

make the final laser line attach to the ground truth. This quantitative evaluation shows us

how accurate LLE identify laser patterns even under direct sunlight, ultimately contributing

to the overall performance of the ALACS-based apple localization system.

5.4.2 ALACS Localization Evaluation

To assess the performance of the ALACS-based apple localization, I conducted a series

of evaluation experiments including occlusion and cluster cases, in both indoor and outdoor

environments. In the indoor environment, I compared the localization results obtained by

the ALACS method against the ground truth data acquired using a high-precision Qualisys

localization system (Qualisys, Sweden) with an accuracy of 0.11 mm [17].To facilitate the

acquisition of ground truth data, markers were placed on the apples in the orchard, and the

Qualisys system was employed to accurately determine their 3D positions. These ground

61

Figure 5.9 The visualization of the LLE process under different lighting conditions and
occlusions (step= 4, th= 40): (a)-(c) show the apples with different laser line positions
under overcast conditions; (d)-(f) show the apples under direct sunlight; (g) and (h) show
the occluded apple cases.

62

InputCurveﬁttingLinefocusNoiseremovingLaser patterndetectionOutput(a)(b)(c)(d)(e)(f)(g)(h)truth positions served as a reference for evaluating the performance of the ALACS-based

localization method.

In the first evaluation experiment, I allowed the ALACS system to project a single laser

line to the center of the marker placed on the apple, under both occlusion-free and occlusion-

present situations. To mimic occlusion situations, I used artificial foliage to partially cover

apples. The ALACS system then estimated the apple’s position using the extracted laser line

and the target position estimation process. The resulting ALACS localization results were

compared against the ground truth positions acquired using the Qualisys system. Results in

Figure 5.10 show that ALACS achieved superior localization performance for occlusion-free

situations; the average distance error ranges from 2.5 mm to 5.8 mm at distances from 1.0

m to 1.6 m. When the apples were occluded by leaves, the average localization errors were

significantly larger than those without occlusions. The average distance errors, under the

occlusion situations, are 6.9 mm, 7.2 mm, 9.0 mm and 11.2 mm respectively at a distance

of 1.0 m, 1.2 m, 1.4 m and 1.6 m. Our harvesting robot uses a vacuum-based end effector

to pick fruits, which can tolerate localization errors within 20 mm [131]. Hence, the ALACS

system can still meet the localization accuracy requirements when apples are occluded by

leaves.

In the second indoor evaluation experiment, I tested the ALACS system’s ability to

estimate the apple’s position using multiple laser projections, respectively under occlusion-

free and occlusion-present situations. The ALACS system acquired the laser lines at a 2-cm

increment for five times over the apple, projecting different laser lines at various positions on

the apple’s surface. The LLE algorithm was then used to extract the laser lines to determine

the apple’s center position based on these five laser projections. As shown in Figure 5.11),

under the occlusion-free conditions, the average distance error range from 5.5 mm to 9.1

mm for distances ranging from 1.0 m to 1.6 m, compared with the results obtained in the

first experiment for a single laser line for the occlusion free condition. Under the occlusion

situation, the average distance errors were 9.2 mm, 11.6 mm, 13.9 mm, and 17.5 mm at

63

Figure 5.10 Indoor performance evaluation of the active laser-camera scanning (ALACS)
system using single laser projections to the Qualisys marker center from different distances
for occlusion-free (120 cases) and occlusion-present (120 cases) situations.

Figure 5.11 Indoor performance evaluation of the active laser-camera scanning (ALACS)
system using multiple laser line projections to the Qualisys marker from different distances
under the occlusion-free (120 cases) and occlusion-present (120 cases) situations.

1.0 m, 1.2 m, 1.4 m, and 1.6 m, respectively. While these errors are significantly larger

compared to those obtained for single laser lines in the first experiment (see Figure 5.10),

64

1.0 m1.2 m1.4 m1.6 mRange246810121416Euclidean Distance Errors (mm)occlusion-freew/ occlusion1.0 m1.2 m1.4 m1.6 mRange51015202530Euclidean Distance Errors (mm)occlusion-freew/ occlusionthe ALACS is still expected to meet our robot localization accuracy requirements of 20

mm when the distance is less than 1.6 m, the maximum working distance designed for our

harvesting robot [131].

Furthermore, I also performed a 3D localization comparison between ALACS and the

commercial RGB-D camera RealSense D435i, where the latter is commonly used in the

harvesting robots developed by other researchers. In the indoor environment, I still used the

Qualysis Motion Tracking System to benchmark the results and test the localization from

ALACS with multiple laser projections and RealSense D435i with occlusions at different

distances. Table 5.2 shows the ALACS system significantly outperformed the RealSense

benchmark for different distances among 120 cases.

Since the Qualysis system cannot provide accurate measurements in the outdoor envi-

ronment due to the varying light condition, I used the positions generated from ALACS

and D435i to operate our robotic system [134] to to determine whether fruits would be

attached to vacuum-based end-effector. The attachment rate was used as an indirect met-

ric for evaluating the localization results of both ALACS and RealSense D435i. Since our

vacuum-based end-effector can tolerate localization errors up to 20 mm, a measured localiza-

tion error larger than 20 mm would be considered a failed detachment. I tested 100 apples

under cloudy and 50% occlusion rate in a research orchard at Michigan State University’s

Horticultural Teaching and Research Center in Holt, MI. Our results showed that ALACS

achieved a 95% detachment rate, whereas the RealSense D435i only had a 71% success rate.

Table 5.2 Comparison of average localization errors (mm) between ALACS and RealSense
D435i at different distances.

Sensor \ Range

1.0m

1.2m

1.4m

1.6m

RS D435i
ALACS

16.0(±7.1)
6.9(±5.2)

17.3(±8.0)
7.2(±5.8)

19.5(±9.3)
9.0(±6.5)

21.5(±11.2)
11.2(±7.8)

Results from the three indoor and outdoor experiments have demonstrated that the

ALACS system has had significantly enhanced performance for apple localization, compared

to RealSense D435i.

65

5.5 Summary of the Chapter

In this chapter, a novel Active Laser-Camera Scanning (ALACS) system was developed

for robust apple 3D localization. The proposed LLE method provided precise laser line

pattern extractions with an average displacement of 1 pixel under complex fruit morphology,

over-exposure, and occlusion conditions. For the apple 3D localization, ALACS was able to

achieve average errors of 9.2 − 17.5 mm at distances ranging from 1.0m to 1.6m, which are

significantly better, compared to the widely adopted commercial RGB-D camera RealSense

D435i. ALACS also demonstrated superior performance for fruit detachment in an apple

orchard, when it was tested with our harvesting robot equipped with a vacuum-based end

effector.

66

CHAPTER 6

SKESEGNET: SKELETON-LEAD SEGMENTATION NETWORK FOR
BRANCH SEGMENTATION

In this chapter, I delve into the realm of panoptic segmentation, presenting a compre-

hensive overview of the field.

I introduce our novel approach, SkeSegNet, aimed at ad-

vancing branch segmentation by leveraging skeletal information for enhanced accuracy and

robustness. Furthermore, I explore the by-product of SkeSegNet to generate 3D branch

representations.

6.1

Introduction

In fruit detection and robotic harvesting applications, it is crucial to have a comprehen-

sive understanding of the orchard environment, including not only the fruits but also other

elements, such as branches, leaves, and the overall tree structure. Panoptic segmentation, a

computer vision technique that combines instance segmentation and semantic segmentation,

can provide valuable information about the orchard scene by simultaneously segmenting and

classifying individual objects and background regions. By integrating panoptic segmentation

into fruit detection systems, I can obtain richer visual details to guide robotic systems more

effectively, improving the efficiency and safety of the harvesting process.

Panoptic segmentation aims to provide a holistic understanding of the scene by segment-

ing and classifying every pixel in the image into various categories, such as fruits, branches,

leaves, and background. This level of detail can be particularly useful in fruit detection

applications for several reasons. 1) Improved Fruit Detection: Panoptic segmentation can

help distinguish fruits from other scene elements, such as leaves and branches, reducing false

positives and improving overall detection accuracy; 2)Robotic Navigation and Manipulation:

By providing detailed information about the orchard structure, panoptic segmentation can

enable robots to navigate and manipulate their environment more effectively, avoiding obsta-

cles and minimizing damage to the trees and fruits; 3) Scene Understanding for fruit status:

Panoptic segmentation can contribute to a better understanding of the entire orchard scene,

67

facilitating more accurate fruit ripeness checking [66].

Several techniques and algorithms have been proposed for panoptic segmentation in com-

puter vision, leveraging advances in deep learning and convolutional neural networks (CNNs).

Panoptic Feature Pyramid Networks (Panoptic-FPN) [58] is a unified, end-to-end trainable

architecture that combines instance segmentation and semantic segmentation tasks, produc-

ing panoptic segmentation results in a single forward pass through the network. Panoptic-

DeepLab [20] is another end-to-end trainable architecture for panoptic segmentation, which

employs an efficient dual-path architecture to handle both instance and semantic segmenta-

tion tasks. Mask R-CNN: Mask R-CNN is a popular instance segmentation method that can

be extended to perform panoptic segmentation by combining it with semantic segmentation

techniques. Adapting these methods for fruit detection applications in orchards may involve

fine-tuning the segmentation models on annotated datasets containing fruit, branch, leaf,

and background classes, ensuring that the models can accurately segment and classify these

elements in real-world orchard environments.

I adapt the Panoptic-DeepLab architecture for orchard environments and trained it on

our customized dataset, which consists of images with five distinct labeled classes, including

apples, branches, foliage, sky, and ground. I propose a Skeleton Segmentation Network to

improve the accuracy of branch segmentation since branches are the main obstacles against

apples. Using the skeleton branches, 3D branches are quickly generated combined with the

depth map, which provide obstacle data for our robot’s planning algorithm. Then, I evaluate

the performance against state-of-the-art panoptic segmentation models and demonstrate

superior performances. The contributions of this work are highlighted as follows:

1. An image annotation tool, PicA, is developed and open-sourced to alleviate the burden

of manual annotation by leveraging the concept of superpixels and pre-trained models

associated with panoptic segmentation annotation.

2. Skeleton-lead Segmentation Network (SkeSegNet) is proposed to address the challenges

of segmenting complex branches. SKeSegNet is evaluated and outperformed 4 state-

68

of-the-art panoptic segmentation models.

3. The SkeSegNet generates 3D branches for efficient obstacle avoidance using depth map.

6.2 Data Annotation

Based on the previous work (see Chapter 3), I select 167 images to compose orchard

segmentation dataset. I next processed these collected raw orchard images into formats that

can be used to train and evaluate deep networks. Specifically, there are totally five categories,

including branch, foliage, ground, sky and apples in the images, to be annotated in pixel

level. In panoptic segmentation datasets, instance annotation and semantic annotation are

split into different tasks, which are labor-intensive and time-consuming. To address the

challenges associated with panoptic segmentation annotation, a novel image annotation tool

named PicA1 has been developed, particularly when dealing with complex scenes. PicA has

been designed to alleviate the burden of manual annotation by leveraging the concept of

superpixels and pre-trained models to expedite the annotation process.

Superpixels, which are compact and coherent image regions, serve as a fundamental com-

ponent in PicA’s annotation strategy. By segmenting the input images into superpixels, PicA

simplifies the annotation process by allowing annotators to work on smaller, more manage-

able regions of the image at a time. This approach not only enhances the efficiency of the

annotation process but also reduces the potential for human error by focusing annotators’

attention on local regions. Additionally, PicA harnesses the power of pre-trained models

to further streamline the annotation workflow. These models, trained on large and diverse

datasets, possess the capability to generate preliminary segmentation results with a high de-

gree of accuracy. PicA integrates these pre-trained results as a starting point for annotators,

reducing the amount of manual labor required. This not only accelerates the annotation

process but also ensures that annotators are provided with a reliable foundation for further

refinement and correction. Figure 6.1 shows the annotation results based on PicA.

1The image annotation tool PicA is open-sourced at https://github.com/pengyuchu/picA.

69

In this workspace, a panoptic segmentation
Figure 6.1 PicA: An image annotation tool.
project is created and an image is annotated with superpixels. After checking BBoxes, all of
instances are shown using white boxes.

6.3 Panoptic-DeepLab: Panoptic Segmentation

Panoptic-DeepLab [20] is conceptually simple, adopting dual-ASPP and dual-decoder

modules specific to semantic segmentation and instance segmentation, respectively. Panoptic-

DeepLab uses a fast bottom-up baseline (see Figure 6.2) and requires only three loss functions

during training, and introduces extra marginal parameters as well as additional slight com-

putation overhead when building on top of a modern semantic segmentation model. In the

apple orchard segmentation, the output will be the fusion of scenes (i.e., branch, foliage,

ground and sky) and apples.

6.3.1 Architecture of Panoptic-DeepLab

Panoptic-DeepLab (see Figure 6.3) consists of four components: (1) an encoder backbone

shared for both semantic segmentation and instance segmentation, (2) decoupled ASPP

modules and (3) decoupled decoder modules specific to each task, and (4) task-specific

70

Figure 6.2 Panoptic-DeepLab [20] predicts three outputs: semantic segmentation, instance
center prediction and instance center regression. Class-agnostic instance segmentation, ob-
tained by grouping predicted foreground pixels to their closest predicted instance centers,
is then fused with semantic segmentation by majority-vote rule to generate final panoptic
segmentation.

prediction heads.

Basic architecture: The encoder backbone is adapted from an ImageNet-pretrained

neural network paired with atrous convolution for extracting denser feature maps in its

last block. Motivated by [21, 81], Panoptic-DeepLab employs separate ASPP and decoder

modules for semantic segmentation and instance segmentation, respectively, based on the

hypothesis that those two branches requires different contextual and decoding information,

which is empirically verified in the following section. The light-weight decoder module follows

DeepLabV3+ [18] with two modifications: (1) Panoptic-DeepLab introduces an additional

low-level feature with output stride 8 to the decoder, thus the spatial resolution is gradually

recovered by a factor of 2, and (2) in each upsampling stage Panoptic-DeepLab applys a

single 5 × 5 depthwise-separable convolution [51].

Semantic segmentation head: Panoptic-DeepLab employs the weighted bootstrapped

cross entropy loss, proposed in [126], for semantic segmentation, predicting both ‘thing’ and

‘stuff’ classes. The loss improves over bootstrapped cross entropy loss [121, 96, 88] by

71

Figure 6.3 Panoptic-DeepLab [20] adopts dual-context and dual-decoder modules for seman-
tic segmentation and instance segmentation predictions. They applys atrous convolution in
the last block of a network backbone to extract denser feature map. The Atrous Spatial
Pyramid Pooling (ASPP) is employed in the context module as well as a light-weight de-
coder module consisting of a single convolution during each upsampling stage. The instance
segmentation prediction is obtained by predicting the object centers and regressing every
foreground pixel (i.e., pixels with predicted ‘thing’ class) to their corresponding center. The
predicted semantic segmentation and class-agnostic instance segmentation are then fused to
generate the final panoptic segmentation result by the “majority vote” proposed by Deeper-
Lab.

weighting each pixel differently.

Class-agnostic instance segmentation head: Motivated by Hough Voting [7], Panoptic-

DeepLab represents each object instance by its center of mass. For every foreground pixel

(i.e., pixel whose class is a ‘thing’), they further predicts the offset to its corresponding

mass center. During training, groundtruth instance centers are encoded by a 2-D Gaussian

with standard deviation of 8 pixels.

In particular, they adopts the Mean Squared Error

(MSE) loss to minimize the distance between predicted heatmaps and 2D Gaussianencoded

groundtruth heatmaps. They uses L1 loss for the offset prediction, which is only activated

at pixels belonging to object instances. During inference, predicted foreground pixels (ob-

tained by filtering out background ‘stuff’ regions from semantic segmentation prediction) are

grouped to their closest predicted mass center, forming the class-agnostic instance segmen-

tation results, as detailed below.

72

321GroupGroupFuseShared EncoderSemantic ContextInstance DecoderSemantic DecoderSemantic PredictionInstance Center PredictionInstance Center RegressionPostProcessingTo decoders3x3 Conv3x3 Conv3x3 Conv1x1 ConvPooling3x3 Conv3x3 Conv3x3 Conv1x1 ConvPooling1x1 Conv1x1 Conv1x1 Conv1x1Conv5x5Convupsample5x5Conv5x5Conv1x1Conv5x5Conv1x1Conv5x5Conv1x1Conv1x1Conv5x5Convupsample5x5Conv256256642562562563225632128128256N_CLASS322161x1 Conv1/41/81/161/161/41/81/161/41/81/16Instance Context6.3.2 Panoptic Segmentation

During inference, Panoptic-DeepLab uses an extremely simple grouping operation to

obtain instance masks, and a highly efficient majority voting algorithm to merge semantic

and instance segmentation into final panoptic segmentation.

Simple instance representation: Panoptic-DeepLab simply represents each object by its

center of mass, {Cn : (in, jn)} . To obtain the center point prediction, they first performs

a keypoint-based non-maximum suppression (NMS) on the instance center heatmap predic-

tion, essentially equivalent to applying max pooling on the heatmap prediction and keeping

locations whose values do not change before and after max pooling. Finally, a hard threshold

is used to filter out predictions with low confidence, and only locations with top-k highest

confidence scores are kept. In experiments, Panoptic-DeepLab uses max-pooling with kernel

size 7, threshold 0.1, and k = 200.

Simple instance grouping: A simple instance center regression is used to to obtain the

instance id for each pixel. For example, consider a predicted ‘thing’ pixel at location (i, j), an

offset vector O(i, j) is predicted to its instance center. O(i, j) is a vector with two elements,

representing the offset in horizontal and vertical directions, respectively. The instance id for

the pixel is thus the index of the closest instance center after moving the pixel location (i, j)

by the offset O(i, j). That is,

ˆki,j = arg min

||Ck − ((i, j) + O(i, j))||,

k

(6.1)

where ˆki,j is the predicted instance id for pixel at (i, j). I use semantic segmentation predic-

tion to filter out ‘stuff’ pixels whose instance id are always set to 0.

Efficient merging: Given the predicted semantic segmentation and class-agnostic instance

segmentation results, I adopt a fast and parallelizable method to merge the results, following

the “majority vote” principle proposed in DeeperLab [77]. In particular, the semantic label

of a predicted instance mask is inferred by the majority vote of the corresponding predicted

semantic labels. This operation is essentially accumulating the class label histograms, and

73

thus is efficiently implemented in GPU, which takes only 3 ms when operating on a 1025 ×

2049 input.

6.3.3

Instance Segmentation

Panoptic-DeepLab can also generate instance segmentation predictions as a by-product.

To properly evaluate the instance segmentation results, one needs to associate a confidence

score with each predicted instance mask. Previous bottom-up instance segmentation meth-

ods use some heuristics to obtain the confidence scores. For example, DWT [6] and SSAP [39]

use an average of semantic segmentation scores for some easy classes and use random scores

for other harder classes. Additionally, they remove masks whose areas are below a certain

threshold for each class. On the other hand, the Panoptic-DeepLab does not adopt any

heuristic or post processing for instance segmentation. Motivated by YOLO [93], they com-

putes the class-specific confidence score for each instance mask as ScoreObject × ScoreClass,

where ScoreObject is unnormalized objectness score obtained from the class-agnostic cen-

ter point heatmap, and ScoreClass is obtained from the average of semantic segmentation

predictions within the predicted mask region.

6.4 Skeleton-lead Segmentation Network for Branch Detection

The segmentation of complex structures such as branches poses a significant challenge

due to their inherent characteristics of spanning over a large spatial extent. Traditional

segmentation approaches may struggle to delineate such structures effectively, as they tend to

rely on strict boundaries and may overlook the interconnected nature of branches. Therefore,

I introduce an alternative strategy is to treat these branches as a combination of segments

and skeletonize them. This approach leverages the idea that branches can be represented as a

network of interconnected segments and a central skeleton. To prepare for the annotation of

the data, I only add skeleton annotations based on existing branch annotation. Annotators

only draw skeletons represented by two ends V s and V e (See Figure 6.4). This preparatory

step is essential for training and evaluating segmentation models effectively.

74

Figure 6.4 A sample of skeleton annotation based on branch annotation. Each line is drawn
based on a pair of the skeleton’s ends (V s, V e).

6.4.1 Architecture

To facilitate the orchard segmentation process, I propose the Skeleton-lead Segmenta-

tion Network (SkeSegNet) and introduce it to the Panoptic-Deeplab, in order to address

the challenges of segmenting complex branches. SkeSegNet (see Figure 6.5) comprises two

key components: the skeleton decoder and the skeleton predictor. The skeleton decoder is

responsible for extracting the central skeleton of the branch structure, which serves as a

fundamental representation of its topology. The skeleton predictor, on the other hand, gen-

erates segment centers and boundaries based on the extracted skeleton, effectively dividing

the branch into individual segments. Together, these components work in parallel to achieve

a comprehensive and accurate segmentation of complex branching structures, overcoming

the limitations of traditional segmentation approaches.

75

Figure 6.5 Skeleton-lead Segmentation Network (SkeSegNet) consists of Skeleton Decoder
and Skeleton Predictor. The decoder uses a light-weight decoder module consisting of a
single convolution during each upsampling stage. The skeletons prediction is obtained by
predicting the skeleton centers, skeleton masks, and the skeleton orientations. The operation
SkeGen converts them to skeletons, and uses our designed fusion methods to improve the
branch class probability map.

Skeleton Decoder: After the output from Panoptic-Deeplab’s encoder backbone, which

is an ImageNet-pretrained neural network paired with atrous convolution for extracting

denser feature maps, I applied a light-weight decoder [18] module as the skeleton decoder,

to extract skeleton features. The skeleton decoder introduces an additional low-level feature

with output stride 8 to the decoder, thus the spatial resolution is gradually recovered by a

factor of 2. Additionally, in each upsampling stage the skeleton decoder applys a single 5 × 5

depthwise-separable convolution [51].

Skeleton Predictor: In the deep convolutional networks, graph data is hard to repre-

sent in feature maps. Hence, I split each branch segment into three elements: branch center,

branch orientation and branch pixellation, associated with three predictors. The skeleton

mask predictor follows a semantic predictor head using a stack of 5 × 5 and 1 × 1 depthwise-

separable convolution, which predicts skeleton mask. The skeleton center predictor and

orientation predictor gives center and orientation (normalized into (0, 1]). SkeSegNet repre-

76

1256Center Prediction1256Mask Prediction1256Orientation PredictionSkeleton DecoderPanoptic-DeeplabShared EncoderSemantic ContextSemantic DecoderInstance DecoderInstance PredictionSemantic PredictionPost ProcessingInstance ContextSkeGenFuseShared DecoderPanoptic-Deeplab1x1Conv5x5Convupsample5x5Conv5x5Conv1x1Conv1x1Conv1x1Conv5x5Conv5x5Conv1/41/81/162562566425625625611125625632sents each skeleton by its center, its orientation and its pixellation (with the identical width

5). For every pixel, they further predicts the orientaion to its corresponding skeleton center.

During training, groundtruth line-obejct centers ((V s

x + V e

x )/2, (V s

y + V e

y )/2) are encoded by

a 2-D Gaussian with standard deviation of 5 pixels, and groundtruth line-obejct orientation
y −V e
is calculeted by arctan( V s
y
x −V e
V s
x

) for each skeleton pixel. In particular, they adopts the Mean

Squared Error (MSE) loss to minimize the distance between predicted heatmaps and 2D

Gaussian-encoded groundtruth heatmaps. They uses L1 loss for the oriention prediction,

which is only activated at pixels belonging to skeleton instances.

SkeGen: SkeGen employs a unique approach to transform each skeleton center into a line

segment. This transformation process relies on two key components: the skeleton orientations

and the skeleton mask associated with each skeleton. By combining the information from

the orientation map and the skeleton mask, SkeGen is able to systematically convert each

skeleton center into a precisely defined line segment. For each center, SkeGen follows its

orientation in double sides to generate a line until there is no pixel around the end with

10-pixel offsets. Overall, SkeGen transforms centers to skeleton using paris of (V s, V e).

After I have skeleton branches, I can fuse the skeleton map into semantic prediction

map from the Panoptic-Deeplab to enhance branch probability. I fuse the semantic mask in

branch class P and skeleton map S with t-width drawing. The fused results ˆP is followed.

ˆP = λ1 · P · S + λ2,

(6.2)

where λ1, λ2 and t are fused hyperparameters, which will be discussed in Section 6.6.

6.4.2 SkeSegNet Training

In the training process, I convert the skeleton ground truth into three parts: mask, centers

and orientations, to feed the SkeSegNet (see Figure 6.6). The skeleton mask is generated by

drawing each skeleton with 5 pixels. The skeleton orientation map assign each calculated

orientation to its corresponding mask. The skeleton center map assign the position of each

skeleton’s center as 1.

77

Figure 6.6 During the training period, the skeleton ground truth are converted to skele-
ton mask, skeleton centers and skeleton orientations for calculating losses with SkeSegNeet
predictors. The skeleton mask is generated by drawing each skeleton with t(t = 5) pixels.
The skeleton orientations are composed by assigning each calculated orientation to its corre-
sponding mask. The skeleton centers are composed by assign the position of each skeleton’s
center as 1.

To avoid affecting panoptic segmentation results, I use light weights on SkeSegNet when

training with Panoptic-DeepLab together. Panoptic-DeepLab is trained with three loss func-

tions: weighted bootstrapped cross entropy loss for semantic segmentation head (Lsem) [126];

MSE loss for center heatmap head (Lheatmap) [111]; and L1 loss for center offset head

(Lof f set) [85]. The final loss Lpan is computed as follows:

Lpan = λsemLsem + λheatmapLheatmap + λof f setLof f set.

(6.3)

Specifically, I set λsem = 1, λheatmap = 200, and λof f set = 0.01, to make sure the losses are in

the similar magnitude.

On the other hand, SkeSegNet is trained with three loss functions with L1 sum loss for

each skeleton maps: Lmask, Lcenter and Lorient. the total loss Lske is computed with as follows:

78

1256Center Prediction1256Mask Prediction1256Orientation PredictionSkeleton MaskSkeleton Ground TruthSkeleton CentersSkeleton Orientations.40000000.4.6.6000.6.62561256125615x5Conv1x1Conv5x5Conv1x1Conv5x5Conv1x1ConvLske = λmaskLmask + λcenterLcenter + λorientLorient,

(6.4)

where λmask = 0.1, λcenter = 50, and λorient = 1 are set up to make sure the losses are in the

similar magnitude. The final loss function is drawn as follows:

L = λskeLske + Lpan,

(6.5)

where λske = 0.2 is the light weight to ease skeleton’s affect on other categories.

6.5 Three-dimensional Structured Branch Generation

The utilization of by-product skeletons generated by SkeSegNet offers a valuable ad-

vantage in the process of constructing 3D branches. These by-product skeletons represent

significant structural information within the image. Leveraging this information, I can effi-

ciently create 3D representations of branches in combination with depth maps.

To construct 3D branches, the length and width of branches are obtained at first. The

length can be calculated based the Euclidean distance between two ends of an skeleton. To

estimate the width, I select 10 points for each skeleton and follow the perpendicular direction

of skeleton to find out the branch pixels. The maximum of them is treated as the skeleton’s

width. With the related depth map, I transform the 2D branch to the 3D branch using the

following equation,

z = d,

x = u ∗ f /d,

y = v ∗ f /d,

w = p ∗ f /d,

(6.6)

where (u, v) and (x, y, z) are respectively 2D positions and 3D positions of the branch’s ends.

The p is the branch’s width in pixels and the w is its estimated width in the world coordinate.

An example of 3D branch estimation period is shown in Figure 6.7.

79

Figure 6.7 Three-dimension representatives of apple and branches. I utilize by-product skele-
tons from SkeSegNet to construct 3D branches and apples using the related depth map. (a)
The predicted skeletons are visualized using blue lines with white circle ends and branch
mask is visualized in green. (b) The generated 3D branches are showed in green at the front
view and the apples are presented using cuboids with different colors.

The integration of by-product skeletons simplifies and expedites the 3D reconstruction

process. Rather than relying solely on depth maps, which may have limitations in accu-

80

rately capturing fine object boundaries and complex structures, the by-product skeletons

act as complementary information sources. The 3D representatives of branches will ease the

obstacle avoidance planning.

6.6 Experimental Results

In this section, I present the experimental results of our proposed method, focusing on

two main tasks: branch segmentation and panoptic segmentation. Additionally, I compare

our approach with the state-of-the-art methods in these domains. Furthermore, I look into

the effectiveness of the generated 3D branches in branch avoidance.

I report the Precision, Recall and F1-score to evaluate the branch and panoptic segmen-

tation results. All my models are trained using PyTorch on RTX 2080Ti. I use an initial

learning rate of 0.001, fine-tune the batch normalization parameters, perform random scale

data augmentation during training, and optimize with Adam without weight decay. I resize

the images to 1025 pixels at the longest side and train my models with crop size 1025 × 1025

with batch size 64. I set training iterations to 200K.

6.6.1 Branch Segmentation Evaluation

In the evaluation of branch segmentation, I only train branch class and then the task is

converted to semantic segmentation.

To get the better results, I use different fusion combination to determine the best fusion

parameters (see Equation 6.2). I select t from 0 − 10 at a step of 1, λ1 and λ2 from 0 − 5.0

at a step of 0.1. The best parameters are t = 4, λ1 = 2.5, λ2 = 0.6 and the results are shown

in Table 6.1. Compared with Panoptic-Deeplab, the added SkeSegNet improves a precision

of 8.4%, a recall of 8.3%, and a F1-score of 8.5%.

Table 6.1 Performance comparison between the baseline Panoptic-Deeplab and our proposed
SkeSegNet for only branch segmentation.

Panoptic-Deeplab [20]
SkeSegNet (Ours)

Precision Recall F1-score
51.2
59.5

62.0
70.4

56.0
64.5

81

Additionally, I also evaluate other SOTA methods on our own dataset for branch segmen-

tation. Table 6.2 summarizes the performance of my method compared to SOTA methods

on the dataset.

Table 6.2 Performance comparison between the-state-of-art and our proposed SkeSegNet for
only branch segmentation.

Precision Recall F1-score FPS

Panoptic-Deeplab [20]
Yolov8 [53]
MaskFormer (ViT) [19]
MaskFormer (Swin-L) [70]
SkeSegNet (Ours)

62.0
41.2
55.0
57.0
70.4

51.2
34.0
48.6
50.5
59.5

56.0
37.5
51.8
53.7
64.5

2
20
0.3
0.3
2

Our approach consistently outperforms existing methods in branch segmentation, as

demonstrated by the superior scores on F1-score. This indicates the effectiveness of our

proposed method in accurately segmenting branches.

6.6.2 Panoptic Segmentation Evaluation

Since our final task to obtain the whole orchard segmentation, I evaluate my method

in panoptic segmentation to ensure that SkeSegNet would not affect apples or other stuff

(e.g. foliage) segmentation results. I evaluate other SOTA methods on our own dataset for

branch segmentation. Table 6.3 summarizes the performance of my method compared to

SOTA methods on the dataset. Since the apple and branch segmentation are essential for

our work, I focus on the comparison between the baseline model and SkeSegNet. I can see

my method still outperform the baseline model with a 5.4, 3.8 and 3.4 increase respectively

in precision, recall and F1-score. Besides, the introduced SkeSegNet does not affect other

categories’ segmentation results.

6.6.3 By-product Skeletons Evaluation

To evaluate the effectiveness of the generated skeleton branches, I conduct experiments

using four metrics: center error Ec, length error Errorl, orientation error Eo and width error

Ew. The metrics are calculated as follows.

82

Table 6.3 Performance comparison in Precision (P), Recall (R) and F1-score (F1) between
the-state-of-art and our proposed SkeSegNet for five categories panoptic segmentation.

P

branch
R

apple
F1
R
62.0 51.2 56.0 67.0 60.0 63.3 82.4 75.4 80.9
2
41.6 34.2 37.5 63.5 59.2 61.3 65.0 74.9 69.6 20
55.0 48.6 51.8 67.0 61.4 64.0 79.1 73.0 75.9 0.3
MaskFormer (Swin-L) [70] 57.0 50.5 53.7 66.8 61.5 63.9 79.9 73.2 76.3 0.3
2

Panoptic-Deeplab [20]
Yolov8 [53]
MaskFormer (ViT) [19]

67.4 55.0 60.6 66.8 60.2 63.3 82.7 75.1 80.8

SkeSegNet (Ours)

all
R

F1

F1

P

P

FPS

Ec = || ˆVc − Vc||,

El = | ˆL − L|,

Eo = | ˆO − O|,

Ew = | ˆW − W |,

(6.7)

where Vc, L, O, W are the center, length, orientation and width for groundtruth branches,
while ˆVc, ˆL, ˆO, ˆW are predicted ones from SkeSegNet. The Average Error AE is calculated

by 1

n E. The results are shown in Table 6.4.

In 2D
Table 6.4 Average Errors (AE) for by-product skeleton branches from SkeSegNet.
errors, I calculate errors in pixels. In 3D errors, I convert each point into 3d positions (cm)
using the corresponding depth map (from Realsense D435i) over 120 images.

AEl

AEc

AEw
2D 2.6 px 13.2 px 4.0 deg 1.0 px
3D 3.5 cm 18.6 cm 5.2 deg 1.4 cm

AEo

This evaluation shows that the predicted skeleton branches have a small error in center

error, orientation error and width error with 2.6, 4.0, and 1.2 respectively. The error in length

is larger than other metric with an error of 13.2, which is caused by the disagreement between

the prediction and ground truth. Due to random of skeleton generation, the skeletons are

not always predicted with the identical ones and can be split into different skeletons.

6.7 Summary of the Chapter

SkeSegNet, a novel method, is proposed for improving branch segmentation based on

existing panoptic segmentation work Panoptic-Deeplab. Through rigorous experiments, I

83

demonstrate its effectiveness in accurately segmenting branches, outperforming state-of-the-

art methods with a 64.5 F1-score. I also applied this design on a panoptic segmentation task

including five categories (i.e., branch, foliage, ground, sky and apples). The results shows

that our proposed method still improve 5% branch F1-score compared with the baseline in

panoptic segmentation task. Additionally, these skeleton branches generated by SkeSegNet

show average errors 2.6 px, 13.2 px, 4.0 degree and 1.2 px in center, length, orientation and

width respectively. The generated 3D skeletons combined with the depth map serve as a

valuable resource for the design of planning algorithms, opening up new possibilities in the

realm of branch avoidance and navigation.

However, the skeleton branch generation still has drawbacks, including that the average

precision of 2D skeleton branch is still low, and the generated 3D branches have large errors in

the z axis compared with ideal cases. In future studies, I will optimize my methods, e.g., using

branch segmentation to improve skeleton prediction, and enhance 2D skeleton generation.

Besides, I will utilize high-precision depth sensor to localize branches and evaluate our 3D

skeleton branches with more comprehensive experiments.

84

CHAPTER 7

CONCLUSION AND FUTURE WORK

The importance of accurate fruit detection and localization algorithms in challenging

environments cannot be overstated. As I continue to face evolving agricultural and environ-

mental challenges, the need for precise and efficient methods for harvesting fruits becomes

increasingly vital. These algorithms not only enhance productivity but also contribute to

sustainable farming practices by reducing waste and resource consumption.

Throughout this dissertation, I presented two novel techniques, suppression Mask RCNN

and O2RNet, aimed at addressing the challenges of fruit detection. While the suppres-

sion Mask RCNN improve the precision for apple detection, O2RNet demonstrates superior

performance in detecting and isolating apples in complex orchard environments, effectively

handling occlusions and varying lighting conditions. By utilizing a robust deep learning

architecture, O2RNet can adapt to different apple morphologies and improve the overall

accuracy of fruit detection. ALACS, on the other hand, employs active laser scanning to

generate high-resolution 3D data of the apple’s surface, overcoming the limitations of passive

imaging techniques. With the combination of the bidirectional relative color enhancement

(bRCE) algorithm for laser line extraction and the target position estimation process, ALACS

achieves accurate and reliable apple localization. The evaluation results further validate the

effectiveness of the ALACS system, showcasing its potential for practical applications in fruit

orchards. Lastly, a panoptic segmentation work especially for branch segmentation, called

SkeSegNet, represented a significant advancement. This innovative approach not only en-

hances the accuracy of branch detection but also offers the valuable by-product of generating

3D branch representatives for effective branch avoidance strategies.

Our future work will focus on enhancing 3D skeleton branch generation, fruit tracking,

and the classification of disease and ripeness. A high-precision depth sensor will be employed

to accurately localize branches, significantly reducing errors in 3D branch generation. This

advanced technology promises enhanced accuracy and reliability in our branch modeling

85

processes. The introduction of a fruit tracking method will prevent repetitive picking, thereby

accelerating the harvesting time of our robot. Additionally, the enhanced disease and ripeness

classification system will not only aid in early detection of plant diseases but also ensure that

fruits are harvested at their peak quality. Through these advancements, I hope to contribute

substantially to sustainable agricultural practices, reducing waste and increasing productivity

in orchards. Our team is committed to continuous innovation, seeking to integrate the latest

technologies to revolutionize fruit farming and set new industry standards.

86

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

Aanis Ahmad, Dharmendra Saraswat, Varun Aggarwal, Aaron Etienne, and Benjamin
Hancock. Performance of deep learning models for classifying and detecting common
weeds in corn and soybean production systems. Computers and Electronics in Agri-
culture, 184:106081, 2021.

Nikita Andriyanov, Ilshat Khasanshin, Daniil Utkin, Timur Gataullin, Stefan Ignar,
Vyacheslav Shumaev, and Vladimir Soloviev. Intelligent system for estimation of the
spatial position of apples based on yolov3 and real sense depth camera d415. Symmetry,
14(1):148, 2022.

Rajkishan Arikapudi and Stavros G Vougioukas. Robotic tree-fruit harvesting with
telescoping arms: a study of linear fruit reachability under geometric constraints. IEEE
Access, 9:17114–17126, 2021.

Johan Baeten, Kevin Donné, Sven Boedrij, Wim Beckers, and Eric Claesen. Au-
In Field and service
tonomous fruit picking machine: A robotic apple harvester.
robotics, pages 531–539. Springer, 2008.

Chris H Bahnsen, Anders S Johansen, Mark P Philipsen, Jesper W Henriksen, Kamal
Nasrollahi, and Thomas B Moeslund. 3d sensors for sewer inspection: A quantitative
review and analysis. Sensors, 21(7):2553, 2021.

[6] Min Bai and Raquel Urtasun. Deep watershed transform for instance segmentation. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages
5221–5229, 2017.

[7] Dana H Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern

recognition, 13(2):111–122, 1981.

[8]

[9]

Suchet Bargoti and James Underwood. Deep fruit detection in orchards. In 2017 IEEE
international conference on robotics and automation (ICRA), pages 3626–3633. IEEE,
2017.

Suchet Bargoti and James P Underwood. Image segmentation for fruit detection and
yield estimation in apple orchards. Journal of Field Robotics, 34(6):1039–1060, 2017.

[10] Meny Benady and Gaines E Miles. Locating melons for robotic harvesting using struc-

tured light. Paper-American Society of Agricultural Engineers (USA), 1992.

[11] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal
speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.

[12] D Bulanon and T Kataoka. Fruit detection system and an end effector for robotic

87

harvesting of fuji apples. Agricultural Engineering International: CIGR Journal, 12(1),
2010.

[13] DM Bulanon, TF Burks, and V Alchanatis. Study on temporal variation in citrus
canopy using thermal imaging for citrus fruit detection. Biosystems Engineering,
101(2):161–171, 2008.

[14] DM Bulanon, TF Burks, and V Alchanatis. Image fusion of visible and thermal images

for fruit detection. Biosystems engineering, 103(1):12–22, 2009.

[15] Duke M Bulanon, Colton Burr, Marina DeVlieg, Trevor Braddock, and Brice Allen.
Development of a visual servo system for robotic fruit harvesting. AgriEngineering,
3(4):840–852, 2021.

[16] Martha Cardenas-Weber, Amots Hetzroni, and Gaines E Miles. Machine vision to
locate melons and guide robotic harvesting. Paper-American Society of Agricultural
Engineers (USA), 1991.

[17] Arianna Carnevale, Ilaria Mannocchi, Mohamed Saifeddine Hadj Sassi, Marco Carli,
Giovanna De Luca, Umile Giuseppe Longo, Vincenzo Denaro, and Emiliano Schena.
Virtual reality for shoulder rehabilitation: Accuracy evaluation of oculus quest 2. Sen-
sors, 22(15):5511, 2022.

[18] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig
Adam. Encoder-decoder with atrous separable convolution for semantic image seg-
mentation. In Proceedings of the European conference on computer vision (ECCV),
pages 801–818, 2018.

[19] Zhengsu Chen, Lingxi Xie, Jianwei Niu, Xuefeng Liu, Longhui Wei, and Qi Tian.
Visformer: The vision-friendly transformer. In Proceedings of the IEEE/CVF interna-
tional conference on computer vision, pages 589–598, 2021.

[20] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig
Adam, and Liang-Chieh Chen. Panoptic-deeplab: A simple, strong, and fast baseline
for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 12475–12485, 2020.

[21] Bowen Cheng, Yunchao Wei, Honghui Shi, Rogerio Feris, Jinjun Xiong, and Thomas
In
Huang. Revisiting rcnn: On awakening the classification power of faster rcnn.
Proceedings of the European conference on computer vision (ECCV), pages 453–468,
2018.

[22] Phillip Chlap, Hang Min, Nym Vandenberg, Jason Dowling, Lois Holloway, and An-
nette Haworth. A review of medical image data augmentation techniques for deep
learning applications. Journal of Medical Imaging and Radiation Oncology, 65(5):545–

88

563, 2021.

[23] Pengyu Chu, Zhaojian Li, Kyle Lammers, Renfu Lu, and Xiaoming Liu. Deep learning-
based apple detection using a suppression mask r-cnn. Pattern Recognition Letters,
147:206–211, 2021.

[24] Pengyu Chu, Zhaojian Li, Kaixiang Zhang, Dong Chen, Kyle Lammers, and Renfu Lu.
O2rnet: Occluder-occludee relational network for robust apple detection in clustered
orchard environments. Smart Agricultural Technology, 5:100284, 2023. https://doi.or
g/10.1016/j.atech.2023.100284.

[25] Zhao De-An, Lv Jidong, Ji Wei, Zhang Ying, and Chen Yu. Design and control of an

apple harvesting robot. Biosystems engineering, 110(2):112–122, 2011.

[26] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A
large-scale hierarchical image database. In 2009 IEEE conference on computer vision
and pattern recognition, pages 248–255. Ieee, 2009.

[27] Vikram Mutneja Dharampal. Methods of image edge detection: A review. J. Electr.

Electron. Syst, 4(2):2332–0796, 2015.

[28] Philipe A Dias, Amy Tabb, and Henry Medeiros. Apple flower detection using deep

convolutional networks. Computers in Industry, 99:17–28, 2018.

[29] LG Divyanth, DS Guru, Peeyush Soni, Rajendra Machavaram, Mohammad Nadimi,
and Jitendra Paliwal. Image-to-image translation-based data augmentation for improv-
ing crop/weed classification models for precision agriculture applications. Algorithms,
15(11):401, 2022.

[30] Rainer G Dorsch, Gerd Häusler, and Jürgen M Herrmann. Laser triangulation: funda-

mental uncertainty in distance measurement. Applied optics, 33(7):1306–1314, 1994.

[31] Arun Kumar Dubey and Vanita Jain. Comparative study of convolution neural net-
work’s relu and leaky-relu activation functions.
In Applications of Computing, Au-
tomation and Wireless Systems in Electrical Engineering: Proceedings of MARC 2018,
pages 873–880. Springer, 2019.

[32] Abhishek Dutta and Andrew Zisserman. The VIA annotation software for images,
audio and video. In Proceedings of the 27th ACM International Conference on Multi-
media, MM ’19, New York, NY, USA, 2019. ACM.

[33] Fadi A. Fathallah. Musculoskeletal disorders in labor-intensive agriculture. Applied
Ergonomics, 41(6):738 – 743, 2010. Special Section: Selection of papers from IEA
2009.

89

[34] Steven A Fennimore and Matthew Cutulle. Robotic weeders can improve weed control
options for specialty crops. Pest management science, 75(7):1767–1774, 2019.

[35] Longsheng Fu, Fangfang Gao, Jingzhu Wu, Rui Li, Manoj Karkee, and Qin Zhang.
Application of consumer RGB-D cameras for fruit detection and localization in field:
A critical review. Computers and Electronics in Agriculture, 177:105687, 2020.

[36] Longsheng Fu, Yaqoob Majeed, Xin Zhang, Manoj Karkee, and Qin Zhang. Faster
r–cnn–based apple detection in dense-foliage fruiting-wall trees using rgb and depth
features for robotic harvesting. Biosystems Engineering, 197:245–256, 2020.

[37] Karina Gallardo and P. Galinato. 2012 cost estimates of establishing, producing, and
FS099E, Washington State University

packing red delicious apples in washington.
Extension Fact Sheet, 2012.

[38] R Karina Gallardo and Suzette P Galinato. 2012 Cost Estimates of Establishing, Pro-
ducing, and Packing Red Delicious Apples in Washington. Washington State University
Extension, 2012.

[39] Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, and Kaiqi
Huang. Ssap: Single-shot instance segmentation with affinity pyramid. In Proceedings
of the IEEE/CVF international conference on computer vision, pages 642–651, 2019.

[40] Yuanyue Ge, Ya Xiong, Gabriel Lins Tenorio, and Pål Johan From. Fruit localization
and environment perception for strawberry harvesting robots. IEEE Access, 7:147642–
147652, 2019.

[41] Ross Girshick. Fast r-cnn.

In Proceedings of the IEEE international conference on

computer vision, pages 1440–1448, 2015.

[42] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierar-
chies for accurate object detection and semantic segmentation. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.

[43] A Gongal, Suraj Amatya, Manoj Karkee, Q Zhang, and Karen Lewis. Sensors and
systems for fruit detection and localization: A review. Comput. Electron. Agric.,
116:8–19, Aug. 2015.

[44] A Gongal, Suraj Amatya, Manoj Karkee, Q Zhang, and Karen Lewis. Sensors and
systems for fruit detection and localization: A review. Computers and Electronics in
Agriculture, 116:8–19, 2015.

[45] Novian Habibie, Aditya Murda Nugraha, Ahmad Zaki Anshori, M Anwar Ma’sum, and
Wisnu Jatmiko. Fruit mapping mobile robot on simulated agricultural area in gazebo
simulator using simultaneous localization and mapping (slam). In 2017 International

90

Symposium on Micro-NanoMechatronics and Human Science (MHS), pages 1–7. IEEE,
2017.

[46] Michael W Hannan and Thomas F Burks. Current developments in automated citrus
harvesting. In 2004 ASAE annual meeting, page 1. American Society of Agricultural
and Biological Engineers, 2004.

[47] Qian Hao, Xin Guo, Feng Yang, et al. Fast recognition method for multiple apple tar-
gets in complex occlusion environment based on improved yolov5. Journal of Sensors,
2023, 2023.

[48] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn.

In
Proceedings of the IEEE international conference on computer vision, pages 2961–2969,
2017.

[49] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 770–778, 2016.

[50] Cameron J Hohimer, Heng Wang, Santosh Bhusal, John Miller, Changki Mo, and
Manoj Karkee. Design and field evaluation of a robotic apple harvesting system with
a 3D-printed soft-robotic end-effector. Trans. ASABE, 62(2):405–414, 2019.

[51] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang,
Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
tional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861,
2017.

[52] Yuhua Jiao, Rong Luo, Qianwen Li, Xiaobo Deng, Xiang Yin, Chengzhi Ruan, and
Weikuan Jia. Detection and localization of overlapped fruits application in an apple
harvesting robot. Electronics, 9(6):1023, 2020.

[53] Glenn Jocher, Ayush Chaurasia, and Jing Qiu. YOLO by Ultralytics, January 2023.

[54] Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec, Yonghye Kwon, Kalen
Michael, Jiacong Fang, Colin Wong, Zeng Yifu, Diego Montes, et al. ultralytics/yolov5:
v6. 2-yolov5 classification models, apple m1, reproducibility, clearml and deci. ai inte-
grations. Zenodo, 2022.

[55] Hanwen Kang and Chao Chen. Fruit detection and segmentation for apple harvesting

using visual sensor in orchards. Sensors, 19(20):4599, 2019.

[56] Hanwen Kang and Chao Chen. Fast implementation of real-time fruit detection in apple
orchards using deep learning. Computers and Electronics in Agriculture, 168:105108,
2020.

91

[57] Lei Ke, Yu-Wing Tai, and Chi-Keung Tang. Deep occlusion-aware instance segmenta-
tion with overlapping bilayers. In Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, pages 4019–4028, 2021.

[58] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. Panoptic feature
pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pages 6399–6408, 2019.

[59] Adam Kortylewski, Ju He, Qing Liu, and Alan L Yuille. Compositional convolutional
neural networks: A deep architecture with innate robustness to partial occlusion. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion, pages 8940–8949, 2020.

[60] K Krishna and M Narasimha Murty. Genetic k-means algorithm. IEEE Transactions
on Systems, Man, and Cybernetics, Part B (Cybernetics), 29(3):433–439, 1999.

[61] Steven Lanzisera, David Zats, and Kristofer SJ Pister. Radio frequency time-of-flight
distance measurement for low-cost wireless sensor localization. IEEE Sensors Journal,
11(3):837–845, 2011.

[62] Nalpantidis Lazaros, Georgios Christou Sirakoulis, and Antonios Gasteratos. Review
of stereo vision algorithms: from software to hardware. International Journal of Op-
tomechatronics, 2(4):435–462, 2008.

[63] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.

nature,

521(7553):436–444, 2015.

[64] Christopher Lehnert, Andrew English, Christopher McCool, Adam W. Tow, and Tris-
tan Perez. Autonomous sweet pepper harvesting for protected cropping systems.
2(2):872–879, 2017.

[65] P Levi, A Falla, and R Pappalardo. Image controlled robotics applied to citrus fruit
harvesting. In 7th International Conference on Robot Vision and Sensory Controls,
Zurich (Switzerland), 2-4 Feb 1988. IFS Publications, 1988.

[66] Xinye Li and Ding Chen. A survey on deep learning-based panoptic segmentation.

Digital Signal Processing, 120:103283, 2022.

[67] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ra-
manan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in
context. In European conference on computer vision, pages 740–755. Springer, 2014.

[68] Tianhao Liu, Hanwen Kang, and Chao Chen. Orb-livox: A real-time dynamic sys-
tem for fruit detection and localization. Computers and Electronics in Agriculture,
209:107834, 2023.

92

[69] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-
Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In Computer
Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo-
ber 11–14, 2016, Proceedings, Part I 14, pages 21–37. Springer, 2016.

[70] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and
Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows.
In Proceedings of the IEEE/CVF international conference on computer vision, pages
10012–10022, 2021.

[71] Jun Lu, Nong Sang, Yang Hu, and Huini Fu. Detecting citrus fruits with highlight on

tree based on fusion of multi-map. Optik, 125(8):1903–1907, 2014.

[72] Renfu Lu, Nathan Dickinson, Kyle Lammers, Kaixiang Zhang, Pengyu Chu, and Zhao-
jian Li. Design and evaluation of end effectors for a vacuum-based robotic apple har-
vester. Journal of the ASABE, 65(5):963–974, 2022.

[73] Renfu Lu, Zhao Zhang, and Anand K Pothula. Innovative technology for apple harvest

and in-field sorting. Fruit Quarterly, 25(2), 2017.

[74] Yuzhen Lu, Dong Chen, Ebenezer Olaniyi, and Yanbo Huang. Generative adversarial
networks (gans) for image augmentation in agriculture: A systematic review. Comput-
ers and Electronics in Agriculture, 200:107208, 2022.

[75] Vittorio Mazzia, Aleem Khaliq, Francesco Salvetti, and Marcello Chiaberge. Real-time
apple detection system using embedded systems with hardware accelerators: An edge
ai application. IEEE Access, 8:9102–9114, 2020.

[76] Siddhartha S Mehta, Chau Ton, S Asundi, and Thomas F Burks. Multiple camera
fruit localization using a particle filter. Computers and Electronics in Agriculture,
142:139–154, 2017.

[77] SS Mehta and TF Burks. Multi-camera fruit localization in robotic harvesting. IFAC-

PapersOnLine, 49(16):90–95, 2016.

[78] Mohamed Lamine Mekhalfi, Carlo Nicolò, Yakoub Bazi, Mohamad Mahmoud Al Rah-
hal, Norah A Alsharif, and Eslam Al Maghayreh. Contrasting yolov5, transformer, and
efficientdet detectors for crop circle detection in desert. IEEE Geoscience and Remote
Sensing Letters, 19:1–5, 2021.

[79] Chiranjivi Neupane, Anand Koirala, Zhenglin Wang, and Kerry Brian Walsh. Evalu-
ation of depth cameras for use in fruit localization and sizing: Finding a successor to
kinect v2. Agronomy, 11(9):1780, 2021.

[80] Chiranjivi Neupane, Anand Koirala, Zhenglin Wang, and Kerry Brian Walsh. Evalu-

93

ation of depth cameras for use in fruit localization and sizing: Finding a successor to
kinect v2. Agronomy, 11(9):1780, 2021.

[81] Davy Neven, Bert De Brabandere, Marc Proesmans, and Luc Van Gool.

Instance
segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In
Proceedings of the IEEE/cvf conference on computer vision and pattern recognition,
pages 8837–8845, 2019.

[82] Jason K Norsworthy, Sarah M Ward, David R Shaw, Rick S Llewellyn, Robert L
Nichols, Theodore M Webster, Kevin W Bradley, George Frisvold, Stephen B Powles,
Nilda R Burgos, et al. Reducing the risks of herbicide resistance: best management
practices and recommendations. Weed science, 60(SP1):31–62, 2012.

[83] E-C Oerke. Crop losses to pests. The Journal of Agricultural Science, 144(1):31–43,

2006.

[84] Piyush Pandey, Hemanth Narayan Dakshinamurthy, and Sierra N Young. Autonomy
in detection, actuation, and planning for robotic weeding systems. Transactions of the
ASABE, page 0, 2021.

[85] George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tomp-
son, and Kevin Murphy. Personlab: Person pose estimation and instance segmentation
with a bottom-up, part-based, geometric embedding model. In Proceedings of the Eu-
ropean conference on computer vision (ECCV), pages 269–286, 2018.

[86] Hetal N Patel, RK Jain, Manjunath V Joshi, et al. Fruit detection using improved
International journal of computer applications,

multiple features based algorithm.
13(2):1–5, 2011.

[87] Diego Inácio Patrício and Rafael Rieder. Computer vision and artificial intelligence in
precision agriculture for grain crops: A systematic review. Computers and electronics
in agriculture, 153:69–81, 2018.

[88] Tobias Pohlen, Alexander Hermans, Markus Mathias, and Bastian Leibe. Full-
resolution residual networks for semantic segmentation in street scenes. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 4151–4160,
2017.

[89] Feng Qingchun, Zheng Wengang, Qiu Quan, Jiang Kai, and Guo Rui. Study on
strawberry robotic harvesting system.
In 2012 IEEE International Conference on
Computer Science and Automation Engineering (CSAE), volume 1, pages 320–324.
IEEE, 2012.

[90] W Qiu and SA Shearer. Maturity assessment of broccoli using the discrete fourier

transform. Transactions of the ASAE, 35(6):2057–2062, 1992.

94

[91] Maryam Rahnemoonfar and Clay Sheppard. Deep count: fruit counting based on deep

simulated learning. Sensors, 17(4):905, 2017.

[92] Thinal Raj, Fazida Hanim Hashim, Aqilah Baseri Huddin, Mohd Faisal Ibrahim, and

Aini Hussain. A survey on lidar scanning mechanisms. Electronics, 9(5):741, 2020.

[93] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint

arXiv:1804.02767, 2018.

[94] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-
time object detection with region proposal networks. Advances in neural information
processing systems, 28, 2015.

[95] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and
Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding
box regression. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 658–666, 2019.

[96] Samuel Rota Bulo, Gerhard Neuhold, and Peter Kontschieder. Loss max-pooling for
In Proceedings of the IEEE conference on computer

semantic image segmentation.
vision and pattern recognition, pages 2126–2135, 2017.

[97]

Inkyu Sa, Zongyuan Ge, Feras Dayoub, Ben Upcroft, Tristan Perez, and Chris McCool.
Deepfruits: A fruit detection system using deep neural networks. Sensors, 16(8):1222,
2016.

[98] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation

for deep learning. Journal of Big Data, 6(60):1–48, 2019.

[99] Yongsheng Si, Gang Liu, and Juan Feng. Location of apples in trees using stereoscopic

vision. Computers and Electronics in Agriculture, 112:68–74, 2015.

[100] Abhisesh Silwal, Joseph R Davidson, Manoj Karkee, Changki Mo, Qin Zhang, and
Karen Lewis. Design, integration, and field evaluation of a robotic apple harvester. J.
Field Robot., 34(6):1140–1159, 2017.

[101] Abhisesh Silwal, Joseph R Davidson, Manoj Karkee, Changki Mo, Qin Zhang, and
Karen Lewis. Design, integration, and field evaluation of a robotic apple harvester.
Journal of Field Robotics, 34(6):1140–1159, 2017.

[102] Peter W Sites and Michael J Delwiche. Computer vision to locate fruit on a tree.

Transactions of the ASAE, 31(1):257–0265, 1988.

[103] David C Slaughter and Roy C Harrell. Color vision in robotic fruit harvesting. Trans-

actions of the ASAE, 30(4):1144–1148, 1987.

95

[104] Denis Stajnko, Miran Lakota, and Marko Hočevar. Estimation of number and diameter
of apple fruits in an orchard during the growing season by thermal imaging. Computers
and Electronics in Agriculture, 42(1):31–42, 2004.

[105] Daobilige Su, He Kong, Yongliang Qiao, and Salah Sukkarieh. Data augmentation for
deep learning based semantic segmentation and crop-weed classification in agricultural
robotics. Computers and Electronics in Agriculture, 190:106418, 2021.

[106] Jun Sun, Bing Lu, HanPing Mao, et al. Fruits recognition in complex background
using binocular stereovision. Journal of Jiangsu University-Natural Science Edition,
32(4):423–427, 2011.

[107] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient
object detection. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 10781–10790, 2020.

[108] Kanae Tanigaki, Tateshi Fujiura, Akira Akase, and Junichi Imagawa. Cherry-
harvesting robot. Computers and electronics in agriculture, 63(1):65–72, 2008.

[109] Yunong Tian, Guodong Yang, Zhe Wang, Hao Wang, En Li, and Zize Liang. Apple
detection during different growth stages in orchards using the improved yolo-v3 model.
Computers and electronics in agriculture, 157:417–426, 2019.

[110] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-
stage object detection. In Proceedings of the IEEE/CVF international conference on
computer vision, pages 9627–9636, 2019.

[111] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training
of a convolutional network and a graphical model for human pose estimation. Advances
in neural information processing systems, 27, 2014.

[112] T Trtík, R Chylík, J Fládr, J Štoller, and I Broukalová. Methods of lighting of concrete
structures for high-speed camera measurement. In IOP Conference Series: Materials
Science and Engineering, volume 596, page 012041. IOP Publishing, 2019.

[113] Nikos Tsoulias, Dimitrios S Paraforos, George Xanthopoulos, and Manuela Zude-Sasse.
Apple shape detection based on geometric and radiometric features using a lidar laser
scanner. Remote Sensing, 12(15):2481, 2020.

[114] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in
neural information processing systems, 30, 2017.

[115] Juan P Wachs, H I Stern, T Burks, and V Alchanatis. Low and high-level visual
feature-based apple detection from multi-modal images. Precision Agriculture, 11:717–

96

735, 2010.

[116] Shaohua Wan and Sotirios Goudos. Faster r-cnn for multi-class fruit detection using a

robotic vision system. Computer Networks, 168:107036, 2020.

[117] Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. A-fast-rcnn: Hard positive
generation via adversary for object detection. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 2606–2615, 2017.

[118] Dale Whittaker, GE Miles, OR Mitchell, and LD Gaultney. Fruit location in a partially

occluded image. Transactions of the ASAE, 30(3):591–0596, 1987.

[119] Henry Williams, Canaan Ting, Mahla Nejati, Mark Hedley Jones, Nicky Penhall,
JongYoon Lim, Matthew Seabright, Jamie Bell, Ho Seok Ahn, Alistair Scarfe, et al.
Improvements to and large-scale evaluation of a robotic kiwifruit harvester. J. Field
Robot., 37(2):187–201, 2020.

[120] Qiufeng Wu, Yiping Chen, and Jun Meng. Dcgan-based data augmentation for tomato

leaf disease identification. IEEE Access, 8:98716–98728, 2020.

[121] Zifeng Wu, Chunhua Shen, and Anton van den Hengel. Bridging category-level and

instance-level semantic image segmentation. arXiv preprint arXiv:1605.06885, 2016.

[122] Rong Xiang, Huanyu Jiang, and Yibin Ying. Recognition of clustered tomatoes based
on binocular stereo vision. Computers and Electronics in Agriculture, 106:75–90, 2014.

[123] Ya Xiong, Yuanyue Ge, Lars Grimstad, and Pål J. From. An autonomous strawberry-
harvesting robot: Design, development, integration, and field evaluation. J. Field
Robot., 37(2):202–224, 2020.

[124] Guantao Xuan, Chong Gao, Yuanyuan Shao, Meng Zhang, Yongxian Wang, Jingrun
Zhong, Qingguo Li, and Hongxing Peng. Apple detection in natural environment using
deep learning algorithms. IEEE Access, 8:216772–216780, 2020.

[125] Bin Yan, Pan Fan, Xiaoyan Lei, Zhijie Liu, and Fuzeng Yang. A real-time apple
targets detection method for picking robot based on improved yolov5. Remote Sensing,
13(9):1619, 2021.

[126] Tien-Ju Yang, Maxwell D Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu, Xiao Zhang,
Vivienne Sze, George Papandreou, and Liang-Chieh Chen. Deeperlab: Single-shot
image parser. arXiv preprint arXiv:1902.05093, 2019.

[127] Stephen L Young, George E Meyer, and Wayne E Woldt. Future directions for auto-
mated weed management in precision agriculture. In Automation: The future of weed
control in cropping systems, pages 249–259. Springer, 2014.

97

[128] Tao Yu, Chunhua Hu, Yuning Xie, Jizhan Liu, and Pingping Li. Mature pomegranate
fruit detection and location combining improved f-pointnet with 3d point cloud clus-
tering in orchard. Computers and Electronics in Agriculture, 200:107233, 2022.

[129] Baohua Zhang, Wenqian Huang, Chaopeng Wang, Liang Gong, Chunjiang Zhao,
Chengliang Liu, and Danfeng Huang. Computer vision recognition of stem and ca-
lyx in apples using near-infrared linear-array structured light and 3d reconstruction.
Biosystems Engineering, 139:25–34, 2015.

[130] Kaixiang Zhang, Pengyu Chu, Kyle Lammers, Zhaojian Li, and Renfu Lu. Active
laser-camera scanning for high-precision fruit localization in robotic harvesting: System
design and calibration. arXiv preprint arXiv:2311.02500, 2023.

[131] Kaixiang Zhang, Kyle Lammers, Pengyu Chu, Nathan Dickinson, Zhaojian Li, and
Renfu Lu. Algorithm design and integration for a robotic apple harvesting system. In
IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 9217–
9224, 2022. https://doi.org/10.1109/IROS47612.2022.9981417.

[132] Kaixiang Zhang, Kyle Lammers, Pengyu Chu, Zhaojian Li, and Renfu Lu. System
design and control of an apple harvesting robot. Mechatronics, 79:102644, 2021.

[133] Kaixiang Zhang, Kyle Lammers, Pengyu Chu, Zhaojian Li, and Renfu Lu. System
design and control of an apple harvesting robot. Mechatronics, 79:102644, 2021.

[134] Kaixiang Zhang, Kyle Lammers, Pengyu Chu, Zhaojian Li, and Renfu Lu. An au-
tomated apple harvesting robot – from system design to field evaluation. Journal of
Field Robotics, in press, 2023.

[135] Xin Zhang, Long He, Manoj Karkee, Matthew David Whiting, and Qin Zhang. Field
evaluation of targeted shake-and-catch harvesting technologies for fresh market apple.
Trans. ASABE, 63(6):1759–1771, 2020.

[136] Zhao Zhang, Yuzhen Lu, and Renfu Lu. Development and evaluation of an apple
infield grading and sorting system. Postharvest Biol. Technol., 180:111588, 2021.

[137] Jun Zhao, Joel Tow, and Jayantha Katupitiya. On-tree fruit recognition using texture
properties and color data. In 2005 IEEE/RSJ International Conference on Intelligent
Robots and Systems, pages 263–268. IEEE, 2005.

[138] Yuanshen Zhao, Liang Gong, Yixiang Huang, and Chengliang Liu. A review of key
techniques of vision-based control for harvesting robot. Computers and Electronics in
Agriculture, 127:311–323, 2016.

98