NATURAL LANGUAGE BASED CONTROL AND PROGRAMMING OF ROBOTIC

BEHAVIORS

By

Yu Cheng

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Electrical Engineering – Doctor of Philosophy

2019

ABSTRACT

NATURAL LANGUAGE BASED CONTROL AND PROGRAMMING OF ROBOTIC

BEHAVIORS

By

Yu Cheng

Robots have been transforming our daily lives by moving from controlled industrial lines to un-

structured and dynamic environments such as home, oﬃces, or outdoors working closely with

human co-workers. Accordingly, there is an emerging and urgent need for human users to com-

municate with robots through natural language (NL) due to its convenience and expressibility,

especially for the technically untrained people. Nevertheless, two fundamental problems remain

unsolved for robots to working in such environments. On one hand, how to control robot behaviors

in dynamic environments due to presence of people is still a daunting task. On the other hand, robot

skills are usually preprogrammed while an application scenario may require a robot to perform new

tasks. How to program a new skill to robots using NL on the ﬂy also requires tremendous eﬀorts.

This dissertation tries to tackle these two problems in the framework of supervisory control.

On the control aspect, it will be shown ideas drawn from dynamic discrete event systems can

be used to model environmental dynamics and guarantee safety and stability of robot behaviors.

Speciﬁcally, the procedures to build robot behavioral model and the criteria for model property

checking will be presented. As there are enormous utterances in language with diﬀerent abstraction

level, a hierarchical framework is proposed to handle tasks lying in diﬀerent logic depth. Behavior

consistency and stability under hierarchy are discussed.

On the programming aspect, a novel online programming via NL approach that formulate the

problem in state space is presented. This method can be implemented on the ﬂy without terminating

the robot implementation. The advantage of such a method is that there is no need to laboriously

labeling data for skill training, which is required by traditional oﬄine training methods. In addition,

integrated with the developed control framework, the newly programmed skills can also be applied

to dynamic environments.

In addition to the developed robot control approach that translates language instructions into

symbolic representations to guide robot behaviors, a novel approach to transform NL instruction-

s into scene representation is presented for robot behaviors guidance, such as robotic drawing,

painting, etc. Instead of using a local object library or direct text-to-pixel mappings, the proposed

approach utilizes knowledge retrieved from Internet image search engines, which helps to generate

diverse and creative scenes. The proposed approach allows interactive tuning of the synthesized

scene via NL. This helps to generate more complex and semantically meaningful scenes, and to

correct training errors or bias.

The success of robot behavior control and programming relies on correct estimation of task

implementation status, which is comprised of robotic status and environmental status. Besides

vision information to estimate environmental status, tactile information is heavily used to estimate

robotic status. In this dissertation, correlation based approaches have been developed to detect

slippage occurrence and slipping velocity, which provide grasp status to the high symbolic level

and are used to control grasp force at lower continuous level. The proposed approaches can be

used with diﬀerent sensor signal type and are not limited to customized designs.

The proposed NL based robot control and programming approaches in this dissertation can be

applied to other robotic applications, and help to pave the way for ﬂexible and safe human-robot

collaboration.

Copyright by
YU CHENG
2019

ACKNOWLEDGEMENTS

First of all, I want to express my grateful acknowledgment to my advisor Dr. Ning Xi and Dr.

Lixin Dong, for their encouragement, guidance, kindness, and support. Their insightful vision and

high standard make me a qualiﬁed researcher.

I would like to thank my committee members: Dr. Joyce Chai, Dr. Fathi Salem, and Dr. Xiaobo

Tan. I highly appreciate their valuable feedback and discussions throughout my study at MSU.

Many thanks go to my lab members: Dr. Yunyi Jia, Dr. Jianguo Zhao, Dr. Liangliang Chen,

Dr. Zhiyong Sun, Dr. Bo Song, Dr. Hongzhi Chen, Dr. Erick Nieves, Dr. Yongliang Yang, Xiao

Zeng, Lai Wei, etc., for their help and the happiness we have enjoyed together. I also want to thank

Dr. Zhenxue Chen, Dr. Daoxiong Gong, Dr. Zhanxin Zhou, Dr. Jiatong Bao, Dr. Zhihui Deng, Dr.

Haichu Chen, Dr. Sheng Bi, etc., for their support to my study and life.

I would like to thank my friends at MSU: Yan Shi, Dr. Zhe Wang, Yuan Liang, Dr. Guangwei

Sun, Dr. Jiankun Liu, Dr. Yiqun Yang, Qianwei Jiang, Dr. Jie Li, Biyi Fang, Dr. Mingquan Yuan,

Dezhi Feng, etc., for all the help they gave me and the joy we had together.

I also would like to thank Dr. Qinghu Meng and Dr. Jun Zhang for their continuous support

and encouragement.

Last but not least, I want to thank my parents for their unconditional support and showing me

what love truly means.

v

TABLE OF CONTENTS

LIST OF TABLES .

LIST OF FIGURES .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

CHAPTER 1

.

.

. .

INTRODUCTION .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.1 Background . .
.
1.2 Challenges and Objectives
1.3 Literature Review .
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
1.3.1 Natural Language based Robot Control
. . . . . . . . . . . . . . . . . . .
1.3.2 Natural Language based Robot Programming . . . . . . . . . . . . . . . .
.
1.4 Contributions
1.5 Outline of This Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

1
1
2
4
4
6
8
9

.

.

.

.

.

.

.

. .

Introduction .

2.1
2.2 Preliminary on Supervisory Control
2.3 Proposed Framework .
2.4 Natural Language Control Approach . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2 NATURAL LANGUAGE BASED ROBOTIC BEHAVIOR CONTROL . . . 11
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
. . . . . . . . . . . . . . . . . . . . . . . . . 13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
. 16
2.4.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Discrete Controller
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.3 Task Planner
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
.
2.4.4 Lyapunov Stability Analysis of Hierarchical DES . . . . . . . . . . . . . . 24
. 29
2.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.2.1 Manipulation in Dynamics Environments . . . . . . . . . . . . . 30
2.5.2.2 DRM Guided Assembly . . . . . . . . . . . . . . . . . . . . . . 30
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5 Experiment Validation .

2.6 Summary .

. .

.

.

.

.

.

.

.

.

.

.

. .

Introduction .

CHAPTER 3 NATURAL LANGUAGE BASED ROBOTIC BEHAVIOR PROGRAMMING 38
3.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Overview of System Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 New Skill Acquisition . .
. 44
3.4 Experiment Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.2 New Behavior Programming . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.3 New Behavior Testing under Exception . . . . . . . . . . . . . . . . . . . 49
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.5 Summary .

. .

. .

.

.

.

CHAPTER 4 ROBOTIC DRAWING CONDITIONED ON NATURAL LANGUAGE

DESCRIPTIONS .
.

. .

Introduction .

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.1

vi

.
.

.
.

.
.

.
.

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Related Work . .
4.3 Overall Pipeline .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Robotic Drawing Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
. 59
4.4.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . .
Spatial Layout Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.2
. 63
4.4.3
Scene Generator .
.
.
4.4.4 Motion Planner .
. 69
4.5 Experiments Evaluation .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
. 70
4.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 71
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.2
. 72
4.5.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.4 Discussion . .
. 74
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Scene Generation .
Scene Drawing .
.
.
. .
.
.
.

4.6 Summary . .

. .

.

CHAPTER 5 TACTILE FEEDBACK FOR NL BASED BEHAVIOR CONTROL AND

.

.

.

.

. .

5.2.1
5.2.2

Introduction .

PROGRAMMING .
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Data Correlation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
1-D Rank Correlation for Slippage Detection . . . . . . . . . . . . . . . . 81
2-D Cross Correlation for Slippage Velocity Detection . . . . . . . . . . . 82
5.3 Robot Experimental System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Experimental Results .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.1 Translational and Rotational Slippage Detection . . . . . . . . . . . . . . . 87
Slippage Detection In Dynamic Environments . . . . . . . . . . . . . . . . 90
5.4.2
5.4.3
Slippage Velocity Detection . . . . . . . . . . . . . . . . . . . . . . . . . 94
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.5 Summary .

. .

. .

.

.

.

.

.

CHAPTER 6 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . 97
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.1 Conclusions .
.
6.2 Future Research Work .

. .

.
.

.
.

.

.

BIBLIOGRAPHY .

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

vii

LIST OF TABLES

Table 2.1: List of primitive events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 18

Table 2.2: List of states that used to represent the task and robot status.

. . . . . . . . . . . 33

Table 3.1: Pros and cons of existing robot programming approaches.

. . . . . . . . . . . . 39

Table 3.2: Working Mode of Interaction Manager . . . . . . . . . . . . . . . . . . . . . . . 44

Table 4.1: List of basic actions for scene generation.

. . . . . . . . . . . . . . . . . . . . . 64

Table 4.2: Quantitative evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . .

. 71

viii

LIST OF FIGURES

Figure 2.1:

Illustration examples of possible environment dynamics when robots coexist
with human co-workers in a shared working area. The upper right subﬁgure
shows an example where the position of the brown block (target) is changed
as the robot is approaching it. The lower right subﬁgure shows an example
in which the robot is blocked by a human partner.

. . . . . . . . . . . . . . . . 12

Figure 2.2: Overall pipeline of the proposed approach.

. . . . . . . . . . . . . . . . . . . . 15

Figure 2.3: The deterministic automata models corresponding to each functional mod-
ule of the arm. (a) model of the arm body to control position and orienta-
tion (Garm). (b) model of the gripper to control gripper open-close status
(Ggripper). The states in green with an entry arrow denote initial states. The
states with double circles are marker states. Yellow states represent interme-
diate states. m represents move, and g represents gripper.

. . . . . . . . . . . . 17

Figure 2.4: Discrete controller modeling pipeline.

. . . . . . . . . . . . . . . . . . . . . . 21

Figure 2.5: Synthesized discrete controller.

. . . . . . . . . . . . . . . . . . . . . . . . .

. 22

Figure 2.6: A two-level hierarchical structure.

. . . . . . . . . . . . . . . . . . . . . . . . 25

Figure 2.7: Experimental setup.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Figure 2.8: Snapshots of setup and implementation on scenario 1 "pick up the big brown
block on the table". During robot approaching target big brown block, its
position and that of the orange block are switched (as shown in subﬁgure 3).

.

. 31

Figure 2.9: Snapshots of setup and implementation on scenario 2 "Grasp the orange
block". During the approaching of the target block, the arm body is blocked
by its co-worker (as shown in subﬁgure 3).

. . . . . . . . . . . . . . . . . . .

. 31

Figure 2.10: Initial setup and desired conﬁguration of the task.

. . . . . . . . . . . . . . . . 32

Figure 2.11: Description of perceived workpieces through visual recognition system.

. . . . 32

Figure 2.12: Snapshots of robot implementation on assembly task.
Figure 2.13: ||DRMe||2 during the task execution.

. . . . . . . . . . . . . . . . . . . . . . . 36

. . . . . . . . . . . . . . 36

Figure 3.1: Existing robot programming approaches. . . . . . . . . . . . . . . . . . . . . . 38

ix

Figure 3.2: Overview of the proposed NL based robot programming framework.

. . . . . . 43

Figure 3.3: New skill learning. (left) is a schematic representation of initial robot knowl-
edge base that comprises pre-programmed primitive actions and their cause
and eﬀect.
(right) the user has programmed two new skills with already
learned knowledge.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

.

Figure 3.4: Experiment platform and setup. . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Figure 3.5: Programming the skill of "Sorting by color" with step by step instructions.
(a) Natural language commands given by the instructor. (b) Initial system
state and temporal states after each implementation. (c) Snapshots of robot
execution corresponding to system state in column (b). (d) The representa-
tion of learned new behavior.

. . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Figure 3.6: Test of the learned skill "Sort by color" in a more complex scenario. After
the robot put the red bottle into the box (subﬁgure 3), two more bottles were
added into the working environment (subﬁgure 4). . . . . . . . . . . . . . . . . 51

Figure 4.1: Robotic painter setting where an instructor describes a scene in NL. The
painter infers a scene that matching the description and draws the scene on a
paper using a marker pen mounted on the robot’s end-eﬀector. . . . . . . . . . . 53

Figure 4.2: The overall pipeline of the proposed approach. The relations frames shown

in this ﬁgure ignore information of Verb and Property for simplicity purposes.

. 58

Figure 4.3: Conversion of the sentence a large and long blue bus parked next to an old
man into relation frame through parsing, tagging, and semantic interpretation.

. 59

Figure 4.4: Decentralized plant model of scene generation. Subﬁgure (a) to (e) are the
models of scene generation using objects extracted from retrieved images.
Subﬁgures (f) to (h) are the models for scene tuning. States in green with an
entry arrow represent initial states. Yellow colored states mean intermediate
states. The states in pink with double circles are marked states and also
represent the target state of each behavior.

. . . . . . . . . . . . . . . . . . . . 65

Figure 4.5: Partial models of the shuﬄed plant model. For simplicity, the two models all

start with q0. In fact, the states in the two partial plant models are diﬀerent states. 67

Figure 4.6: Partial supervisors for (a) image retrieve and (b) scene synthesis.

. . . . . . .

. 68

Figure 4.7: Overview of the framework of feature extraction and motion planning.

. . . . . 69

x

Figure 4.8: Turkers are asked to score how well the scenes match the description on a
scale of 1 (very poorly) to 5 (very well). "GT" denotes "Ground Truth".
The subjects ﬁnd our scenes better represent the input sentences than other
baseline approaches. In fact, our approach wins over or ties with the ground
truth scenes frequently.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Figure 4.9: Qualitative examples of generated scenes conditioned on text description-
s from the MS-COCO dataset, using the proposed approach and baseline
methods. The input description and ground truth scenes are shown in the
ﬁrst row.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

.

.

.

.

.

.

.

.

Figure 4.10: The results of performance comparison of our full model against models
without controllability guarantee and nonblocking guarantee, respectively.
Blue bars represent scene generate success rate, and brown bars represent
average scene generation time.

. . . . . . . . . . . . . . . . . . . . . . . . .

. 75

Figure 4.11: Qualitative examples of simulated robotic drawings of generated scenes with

the same parameters.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 76

Figure 4.12: Turkers are asked to score how well the scenes match the description on a
scale of 1 (very poorly) to 5 (very well). "GT" denotes "Ground Truth". We
achieve absolute scores slightly worse than the ground truth, but better than
the baselines.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

.

.

.

.

Figure 4.13: Generation results on scenes that are not likely to happen in the real world.
The ﬁrst column shows the unmodiﬁed synthesized scenes. The second col-
umn scenes are modiﬁed by the instructor with language instructions on the
ﬁrst column scenes. The third column shows scene drawings by the robotic
painter. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

.

.

.

.

.

.

.

.

Figure 5.1:

Illustration of data correlation based approaches. The left subﬁgures are two
tactile images (black for taxels in contact with grasped object, white for no
contact).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

.

.

.

.

.

.

.

.

Figure 5.2:

Illustration of 2-D correlation. O1 is the initial object position and O2 for the
object position after shift.

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 83

Figure 5.3: Mobile manipulator used as experimental platform.

. . . . . . . . . . . . . . . 84

Figure 5.4: The robot gripper. The pressure sensor arrays are attached to the gripper’s

ﬁngers under the silicone rubber.

. . . . . . . . . . . . . . . . . . . . . . . .

. 85

Figure 5.5: This system is used to evaluate the performance of the second approach.

White markers are stuck to the cylinder’s surface and facing the camera lens. . . 86

xi

Figure 5.6: Experimental results of the static case.

. . . . . . . . . . . . . . . . . . . . . . 87

Figure 5.7: Experimental results of translational slippage case. . . . . . . . . . . . . . . .

. 88

Figure 5.8: Experimental results of rotational slippage case.

. . . . . . . . . . . . . . . .

. 89

Figure 5.9: Experimental results of combined slippage case.

. . . . . . . . . . . . . . . . . 89

Figure 5.10: Experimental results of translational slippage case, the gripper was set hori-

zontally.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Figure 5.11: Four basic motions of the robotic arm. Rotation type I is to rotate the gripper
when it is perpendicular to the ground. Rotation type II means to rotate the
arm rather than the gripper. Rotation type III is almost the same as type I, but
to place the gripper horizontally. The fourth motion is only a translational
movement of the arm without any rotation . . . . . . . . . . . . . . . . . . .

. 91

Figure 5.12: Experimental results of rotation type I in static case.

. . . . . . . . . . . . . . . 91

Figure 5.13: Experimental results of rotation type I in slippage case.

. . . . . . . . . . . . . 92

Figure 5.14: Experimental results of rotation type II in slippage case. . . . . . . . . . . . . . 92

Figure 5.15: Experimental results of rotation type III in slippage case.

. . . . . . . . . . . . 93

Figure 5.16: Experimental results of translational movement in slippage case.

. . . . . . . . 93

Figure 5.17: Comparison of experimental results. The top plots represent sliding distance
calculated from 2-D cross correlation, algorithm1 [1], modiﬁed algorithm1,
and ground truth. The bottom plots are the sliding velocities computed from
sliding distance of 2-D cross correlation, modiﬁed algorithm1 and ground truth.

94

xii

CHAPTER 1

INTRODUCTION

1.1 Background

Robots have been widely used in controlled environments to replace human workers in 3D

(dangerous, dull, and dirty) jobs. Robots are known for their higher accuracy and eﬃciency, in-

creasing product/service quality and thus proﬁtability, and longer working hours. On the other

hand, there are still some limitations in terms of the type of tasks they can perform, which is main-

ly due to limited knowledge and cognitive capability. To make use of the strength of robots in

more applications, there has been an emerging and increasingly urgent need for robots and people

to collaborate together, since human can compensate for inadequate knowledge and intelligence.

Successful collaboration between robots and human users require eﬀective and informative in-

teraction means for information exchange and behavior control. To take full advantage of human

skills, it is important that intuitive user interfaces are properly designed, so that human users can

easily program and interact with robots. There is a variety of methods been used such as teach

pendant, guiding, programming language, graphical interfaces, body language, etc. Among them,

using NL to control and program robot is very attractive because it has several advantages com-

pared with other approaches. First, due to the expressiveness of NL, it is capable to represent

domain-general tasks with concepts of diﬀerent logic depth. Second, NL is intuitive and friendly,

which makes it convenient to use for technically untrained users. Third, using NL frees users’

hands such that the users can focus on the task itself.

With the advantages of NL based robots, they can be useful for many applications. For industri-

al or domestic applications where robots and human have to collaborate closely in a common area,

robots may not behavior correctly due to limited sensing ability or insuﬃcient domain knowledge.

Human users can tell their robot co-workers the missing information via NL for correct decision

making, or guide the robots behavior directly through language instructions. This helps to improve

1

robots adaptivity and safety coexisting with people in dynamic and complex environment, and en-

hance the robots cognitive ability. In addition, users can teach a robot new skills/concepts based

on its already learned knowledge through NL without terminating the robot and reprogramming

the robot by hand-coding. This helps to reduce robot development cost and facilitate wider ap-

plications. Furthermore, with NL understanding and interaction ability, the robots can access the

Internet for information query. This allows robots to learn knowledge from this multi-disciplinary

knowledge base intended for human use when a human instructor is not available. It helps a robot

to become a self-learning assistant.

1.2 Challenges and Objectives

NL based robots discussed in previous section will work alongside people and communicate

through NL. Since the environments are complex and dynamic due to presence of people, and

robots cannot understand NL directly, these robot should satisfy three basic requirements in order

for successful task execution in such environments. First, robots are executing tasks speciﬁed in N-

L, they should be able to transfer linguistic input into behavioral output. Second, the environments

in practical are unstructured, the robots should be able to deliver the task goal under dynamic and

uncertain situations. Third, the robots may encounter new task requirements that hasn’t been pro-

grammed in advance. In these cases, it is expected that robots can learn new knowledge through

language interaction, similar to the way in which an adult teaches his kids.

To address these three requirements, three challenges exist. First, NL instructions intended

for human use may seem to be vague and underspeciﬁed to robots.

It is nontrivial to transfer

ambiguous language input into deterministic behavior output. Second, the robots are working in

dynamic environments and thus may encounter diﬀerent situations. It is not practical for human

users to specify all the possible speciﬁcations to accomplish the assigned task. It is diﬃcult to

tackle environmental uncertainties with underspeciﬁed instructions. Third, the physical structure

and cognitive capability of robots and human are completely diﬀerent. It is diﬃcult to teach a robot

new skills through NL in a similar way as to a human.

2

The objectives for the research presented in this dissertation are to investigate how to address

the previous three challenges using discrete event system (DES) principles, i.e., how DES theory

can be employed for the control and programing of robots so that they can perform tasks assigned

in NL successfully (especially in dynamic environments) and be programmed via NL instructions.

The speciﬁc focuses on the control and programming are brieﬂy described in the following.

On the control side, we ﬁrst focus on how robots can accomplish tasks in dynamic environ-

ments, especially when uncertainties of environments aﬀect task execution.Towards this goal, the

robot should take environmental dynamics into consideration during modeling.

In addition, a

correct-by-design mechanism is desired to ensure behavioral safety and stability of robots in order

for practical deployment. The second focus on the control side is to deal with tasks of diﬀerent

logic depth. Although a robot that satisﬁes the previous objective can perform in dynamic scenar-

ios, it is diﬃcult and unrealizable to process all the possible actions within a single control layer.

This can lead to the state explosion problem [2] as the number of actions and states goes up. It

is critical that the robot can process commands in a hierarchical manner since some concepts can

build up from others. The third focus on the control side is to transform linguistic input into a

scene representation for behavior control. Existing approaches of NL based robot control transfer

NL instructions into symbolic representations to guide robots’ behavior. Compared with symbolic

representation, representing the task goal as a scene can contain more details accomplish tasks that

are not suitable for symbolic representations.

On the programming side, we focus on programming new skills to robots in a hierarchical way

since complex movement can be decomposed into an ordered sequence of primitive actions. For

example, a stack action is comprised of an ordered sequence of move, pick up, and drop actions,

and can be taught using the three basic actions [3].

In addition, successful robot control and programming rely on correct estimation of current

robot status in order to properly the enable action of next step. Besides encoders used for estimate

robot positional status, we use tactile information to estimate the gripper status, which can aid

correct decision making in the following action selection. For instance, if an object is grasped

3

successfully, the robot can continue to the next move. Otherwise, the robot has to replan a grasp.

We propose correlation based slippage detection, which is able to detect slippage occurrence and

slippage velocity, providing perceptive feedback for both the high level planning and lower level

control.

1.3 Literature Review

1.3.1 Natural Language based Robot Control

Starting with SHRDLU [4], using NL to control robotic behavior has received increasingly atten-

tion from robotic communities. A variety of approaches has been proposed to achieve this goal.

We can classify these methods by their assumption on the robot working environments.

In ﬁrst category of the approaches, robots are assumed to work under static environments where

no dynamics considered. Based on how the language commands are processed into executable

action plans, these approaches can be divided into two sub-categories: logic-based statistics-based.

An action means a preprogrammed or pre-trained action schema.

Logic-based methods translate linguistic commands into executable action plans based on a set

of rules extracted from available prior knowledge. The earlies eﬀorts to NL based robot control,

SHRDLU, developed at MIT by Terry Winograd, uses a rule-based semantic parser to translate

NL commands into actions to manipulate blocks in a simulated world [4]. A similar work is also

presented in [5]. Lauria et al. use hand-code grammars to map navigational commands into action

templates [6][7]. MacMahon et al. deﬁne a set of rules to process navigational commands into

predicate-argument structure [8]. Brenner et al. use rules to map linguistic components to an action

and its precondition and eﬀect [9]. Dzifcak et al. use a heuristic-based parser that supports several

combinatorial rules to translate NL commands into temporal logic expressions, which speciﬁes

task goals for action planning based on ﬁrst-order dynamic logic [10]. Bollini et al. translate plain

text recipe into states in a state-action space, and search for an action sequence which maximizes

the reward function as the cooking plan [11].

Statistics-based methods employ data-driven techniques to learn the implicit mapping rules

4

from data instead of using hand-designed explicit ones. They diﬀer in their probabilistic models

(e.g. naive Bayes [12], support vector machine (SVM) [13], hidden Markov model (HMM) [14],

conditional random ﬁeld (CRF) [15], etc.), formal representations (e.g., predicate-argument struc-
ture [16], λ−calculus [17], graphical representation [18], customized templates [19], etc.) and
features employed for model training. Huang et al. train a semantic parser based on naive Bayes

to map NL commands into spatial description clause consisting of a ﬁgure, a verb, a landmark,

and a spatial relation for navigational guidance [12][19]. Chen and Mooney train SVM classiﬁers

to generate formal navigational plans according to observations of current states and language in-

put [13]. Takano et al. use HMM to model robot body movements with relevant words [14]. Misra

et al. build a CRF model which is able to infer implicit steps that are not speciﬁed explicitly in

NL [15]. She et al. employ a general-purpose CCG semantic parser to process NL and generate

action frames with predicate-argument structure for manipulation tasks [16]. Matuszek et al. train
a probabilistic CCG parser to translate NL navigational commands into a subset of λ−calculus
representation which can include control loop structures similar to that in generic programming

language. Kollar et al. propose Generalized Grounding Graphs for object grounding and action

selection according to the next language instruction [18].

In addition, there is a tendency to use hybrid approach for NL based control. Tenorth et al.

employs Stanford parser [20] for syntactic parsing of NL commands retrieved from the Internet

and predeﬁned rules to map parsed results into customized action frames [21]. Lisca et al. use

Markov logic network to capture the relation between language and predeﬁned action schemas,

which combines ﬁrst-order logic and probability theory together in a formalism [22][23].

The second category of approaches assume robots working in a structured environment, where

constrained dynamics is considered and status evolution is totally predictable. Kress-Gazit et al.

translate structured English commands into linear temporal logic formulas and later synthesize

a nondeterministic automata to control robot behaviors [24][25][26].

In their following work,

the constraint on structured English is relaxed to NL [27]. The proposed approach is able to

recognize failures in controller synthesis due to incomplete or conﬂicting information from NL

5

commands [28][29]. We propose a framework to translate NL instructions into target state set,

which corresponds to a subset of states in a ﬁnite state machine built in advance. This framework

can ensure safety and stability of robot behaviors [30]. However, the above presented approaches

are task-dependent. The control models have to be rebuilt when applied to a new environment or

task.

1.3.2 Natural Language based Robot Programming

Natural language programming has been the subject of much discussion and conjecture for a long

time [31][32]. Some researchers asserted that NL is the most desirable programming language

because of its versatility and ease of use for humans, no professional training requirements on

users, and the possibility of eventually using speech recognizers for input. We classify the NL

based robot programming into oﬄine and online programming.

On the oﬄine programming side, the work done by Naval Post-graduate School could be

looked as the earliest attempts for NL programming [33][34][35]. The goal of this project was

to develop a system that would generate a simulation program after a simple queuing English

dialogue with a user about a simple queuing problem. IBM’s Thomas Watson Research Center

developed NL automatic programming system for business applications with dialogues [36]. Later

in 1977, Lieberman and Wesley from IBM Thomas Watson Research center developed a system

which uses structured language as input to program mechanical assembly tasks [37]. In 1981,

Miller from IBM Thomas Watson research center, did a research work about the possibilities of

natural language programming based on the temporal state of the art technologies [38]. He sum-

marized there were three major obstacles for the natural language programming. The ﬁrst one

is the vast diﬀerence between the NL and programming languages. NL is much richer in gram-

mar and vocabulary than that of programming languages, which makes it harder to transfer NL

into programming language representations. The second is language grounding, i.e., to decide the

meaning of words in NL instructions, especially to what extent should the word meaning deter-

mined based on immediate and previous linguistic input. The third one is that the conversion from

6

NL to programming language is heavily relied on shared experience and world knowledge. At the

end, he provided two possible solutions: one is to implementing a NL interface subject to several

constraints, the other is to modify programming languages to include more NL features. These

early works focused on translating language commands into robot understandable form. They use

structured language with limited vocabulary.

In recent years, the constraint on structured language is relaxed. Vogel and Jurafsky use rein-

forcement learning to train a policy for navigational action enablement in accordance with language

input and the robot’s current state [39]. Branavan et al. present a similar work [40]. Shimizu and

Haas develop a Markov model for navigational guidance with NL instructions [41]. Branavan et al.

train a CRF model trying to learn the preconditions and postconditions of actions. Matuszek et al.

treat the robot programming as a machine translation problem and learn a log-linear model to trans-

late NL commands into a formal representation named Robot Control Language [17]. Artzi and
Zettlemoyer train a weighted linear CCG to transform linguistic input into typed λ−calculus [42].
Stenmark and Nugues translate NL instructions into a sequence of predicate-argument frames and

encode the frame sequence as a new skill [43][44]. Quirk et al. use log-linear models to translate

If-This-Then-That recipes into executable code [45]. Campagna and Ramesh train long short-term

memory (LSTM) recurrent neural network for trigger-action programming.

Online programming allows the users to program new skills/tasks to a robot without termi-

nating it. Compared with oﬄine robot programming using NL, less work has been proposed for

online robot programming. Lauria et al. propose to represent online programmed behaviors as

combinations of learned actions [7]. Rbyski et al. encode new skills through recognizing the cause

and eﬀect of a navigational task and mapping them into a action template [46][47][48]. Cantrell

et al. develop a similar approach [49]. Kollar et al. propose to use dialogues and demonstration to

learn new landmark knowledge on the ﬂy [50].

7

1.4 Contributions

The contributions for this dissertation can be summarized into two aspects: control and pro-

gramming. For the control aspect, the developed control approach can also be applied to other

DES based systems. On the programming aspect, the online programming approach provides a

friendly interface and can contribute a lifelong learning.

Based on the DES inspirations, the proposed control approach can achieve the following two

merits, which distinguishes it from other existing NL based robot control methods. First, it is a

correct-by-design approach and takes environmental dynamics into consideration, which ensure

safety and stability of robot behaviors under dynamics environments. Second, the hierarchical

structure alleviates state explosion problem for large and complex scenarios and can deal with

requests of diﬀerent levels of abstraction. We derive the conditions of stability in the sense of

Lyapunov for hierarchical structure. Experimental results show that the robot is able to accomplish

assigned tasks under uncertainties from its environment.

In addition, we propose to generate scene representations from NL instructions to control robot

behaviors, which can be applied to scenarios where symbolic representation may fail, such as

NL based robotic drawing, painting, polishing, etc. The robot uses knowledge retrieved from the

Internet to generate scenes conditioned on NL, and is capable to deal with possible uncertainties

that can cause failures of scene synthesis, such as contaminated data from web or detection error.

This is achieved through the proposed scene generation framework that integrates with the control

approach developed in previous sections, and experimental results show it helps to increase success

rate of scene generation and accelerate the process. Human evaluation show that the generated

scenes have higher recognizability and better alignment with the input language descriptions over

the state-of-the-art approaches.

The success of robot application in dynamic environments relies on correct robot state esti-

mation. Gripper status is an important component of robot states. We propose correlation based

approach for slippage detection to estimate and monitor grasp status. The proposed approach is

able to detect slippage velocity without using customized sensor designs or limit to speciﬁc sensor

8

signal type.

On the programming side, this dissertation presents a hierarchical programming approach that

allows a robot to learn new skills using acquired skills through dialogues and demonstrations. This

approach mimics the teach and learning process in which an adult teaches his/her children, and

thus is intuitive and friendly to technically untrained users. Moreover, it only requires one-shot

training, which is time-eﬃcient and laborless.

1.5 Outline of This Dissertation

The dissertation is divided into three parts: natural language based control, and natural lan-

guage based programming, and experimental study for slippage detection. Chapter 2 and Chapter

4 present NL based robot control approach, Chapter 3 discusses NL based robot programming

method, and Chapter 5 shows the data correlation based approach for slippage detection. Chap-

ter 6 concludes the dissertation and outline future research work. The speciﬁc contents for each

chapter are discussed as follows.

Chapter 2 presents the NL based control approach. The mathematical model of the control

model will be ﬁrst discussed. After that, property analysis for the model will be elaborated. Then a

hierarchical control framework is presented to process commands of diﬀerent logic depth. With the

hierarchical controller, the stability analysis is performed based on behavior consistency. Finally,

we present the implementation details and experiment results.

Chapter 3 introduces the hierarchical online programming approach. First of all, comparison

of existing programming approaches is reviewed. After that, the interactive programming process

is discussed. Finally, we present the testing results on an industrial application.

Chapter 4 discusses the proposed control framework that uses scene representation to guide

robot behaviors. First, the overall framework is presented. Then, the modeling to deal with possible

uncertainties and the problem is formulated in DES form. Finally, we present experimental results

on system performance comparison and generated as well as drawn scene evaluation using standard

metrics and human studies.

9

Chapter 5 introduces the data correlation based slippage detection method. First of all, data

correlation approach is introduced. Then the experiment implementation details are presented.

Finally, experimental results under static and dynamic environment are presented to validate the

proposed approach.

Chapter 6 summarizes the dissertation and outlines future works.

10

CHAPTER 2

NATURAL LANGUAGE BASED ROBOTIC BEHAVIOR CONTROL

2.1 Introduction

Using natural language (NL) to control robot behaviors makes it easier for people to employ

robots, especially for novice users. One of the key challenges is to make robots following NL

instructions successfully. To pave the way for the goal, research has been focusing on translat-

ing instructions in natural language into formal representations [17], controller synthesis [26], task

planning [51], requesting information or help from human partner [52][53], NL programming [54],

and knowledge representation [55], etc. However, existing works assume a static working envi-

ronment. While in practical applications, robots instructed by natural language work in a dynamic

shared common area with people. Modeling environmental dynamics into NL-based robot control

has received few attention.

Figure 2.1 depicts a robot working in highly dynamic environments due to presence of people.

In addition, NL instructions are usually vague and underspeciﬁed, in that only the intent of the

instructor is provided, and details of implementation are ignored. To successfully accomplish the

assigned task in a dynamic scenario, a robot must possess the capability to deal with the unforeseen

events from its surroundings.

in this paper, we propose a method to model environment dynamics into high-level robot be-

havior model. A robot is modeled as a discrete event dynamic system (DEDS), where the system

behaves in response to external events usually happened at possibly unknown and irregular time.

Both NL instructions and sensory information are projected into the same state space for the for-

mulation of an action plan that can lead the robot to reach desired target state set. The environment

dynamics is considered as the unforeseen events occurred in the working environment that cause

abrupt changes in the task execution status. The synthesized robot behavior model captures both

the robot dynamics and environment dynamics. In addition, three metrics are used to evaluate and

11

Figure 2.1: Illustration examples of possible environment dynamics when robots coexist with
human co-workers in a shared working area. The upper right subﬁgure shows an example where
the position of the brown block (target) is changed as the robot is approaching it. The lower right
subﬁgure shows an example in which the robot is blocked by a human partner.

reﬁne robot behaviors to guarantee the assigned tasks can ﬁnally be accomplished.

The main contributions of this chapter are as follows:
• We propose a novel approach to capture environmental dynamics into robot model con-
trolled by natural language descriptions. The modeling method improves system adaptivity

to dynamic environments. It allows robots to cope with unexpected events that cause abrupt

changes on the task progress.

• Our approach can complement the underspeciﬁcation in natural language instructions. It
allows the robot to deliver the task goal without requiring the instructor to enumerate all the

possible scenarios and details in language commands.

• The proposed approach guarantees the generated behaviors can achieve desired task goal
ﬁnally. Our method synthesizes an analyzable robot model satisfying controllability, non-

blockingness, and stability to guarantee its dynamic performance. Controllability makes the

robot be able to process unforeseen environmental events; nonblockingness guarantees the

task goals are accessible; stability guarantees convergence to the ﬁnal task goals in presence

of ﬁnite unforeseen dynamics.

12

• The hierarchical structure of the approach helps to process concepts of diﬀerent logic depth

and alleviates state explosion problem.

The rest of this chapter is organized as follows. Section 2.2 presents the basics of supervisory

control theory. Section 2.3 illustrates the overall pipeline of developed control framework. Sec-

tion 2.4 elaborates the design for both the plant and controller. Section 2.5 provides experimental

conﬁgurations and results.

2.2 Preliminary on Supervisory Control

Diﬀerent from the continuous variable dynamic systems (CVDS), where the system behaviors

are governed by physical laws and modeled by diﬀerential equations,the robotic system is modeled

as a discrete event system (DES), where the system behaviors evolve in accordance with rules

of operation or algorithms. Supervisory control is one of the modeling theory of DEDS, which

focuses on maintaining the system behavior described by formal language [56].

In the context of supervisory control, a DES is modeled using a ﬁve-tuple deterministic ﬁnite

automaton:

G = {Q, Σ, δ, Q0, Qm}

G denotes the plant to be controlled.

Q is a ﬁnite nonempty set of states abstracted from the robotic system, denoting the status of

the plant.

Σ represents the set of events, in which the elements drive the system to evolve from one

state to another. The events of Σ can be classiﬁed into two parts: Σc in which events can either be

enabled or disabled, and Σuc where events are set to be always enabled by default. By continuously

enabling actions from Σ, actions are implemented one by one and thus a state trajectory as well as

an action sequence will be produced. Since the language generated by an automaton is comprised

of elements from event set Σ, it is also called an alphabet.

δ : Σ × Q → Q is the transition function that captures conditional state changes.
Q0 : elements of Q0 denote the initial states from where a language or a system starts.

13

Qm ⊂ Q : a subset of the state set Q, called the set of marker states. Usually, Qm are used to

represent the successfully completed tasks.

The behaviors of an automaton G can be described by the set of the output event trajectories,

also called strings or languages. G is called a language generator.

Let Σ∗ represents the set of all the ﬁnite strings s comprised of elements from Σ, including the
empty string ε. The language L(G) is the set of all event trajectories that are physically possible

for the plant

L(G) = {s : s ∈ Σ∗

, δ(q0, s) is deﬁned, q0 ∈ Q0}.

The marker language Lm(G), describes all the event sequences that can reach the marker states

Lm(G) = {s : s ∈ Σ∗

, δ(q0, s) ∈ Qm, q0 ∈ Q0}.

Supervisory control scheme separates the open loop dynamics (plant) from the feedback con-

trol. This separation is reminiscent of the feedback control scheme adopted in CVDS. The con-

troller is modeled as a pair:

where

(cid:60) = (S , ψ)

S = {X, Σs, ξ, x0, Xm}

is a deterministic automaton with state set X, event set Σs, transition function ξ, initial state x0 and
marker state Xm ⊂ X. The ψ is a total function that maps supervisor states x into control patterns,
which is deﬁned as:

1, for each σ ∈ Σuc or σ is allowed at x

0, for σ not allowed at x

φ(x)(σ) =

The supervisor is synthesized by regulating the plant behaviors with physical constrains and

task speciﬁcations. The sets of its behaviors, represented by formal language, are denoted as

L(S /G) and Lm(S /G), respectively.

14

Figure 2.2: Overall pipeline of the proposed approach.

2.3 Proposed Framework

The proposed framework is shown in Figure 2.2. Given NL commands, the robot generates

behaviors to accomplish the assigned task through the sequence of following modules:

• Natural language processing (NLP) module processes input NL instructions into grounded
semantic representations, which carry necessary task speciﬁcations. At ﬁrst, NL are parsed

into formal representations. The linguistic components are then mapped to physical items

using graph-based grounding with sensory information dispatched through perceptive feed-

back.

• Task planner takes grounded goal conﬁguration as input and generates an ordered sequence
of subtasks. In this work, operations on each object is considered as a subtask. Manipulation

priority of each subtask is determined by their dependency relations [51].

• Discrete controller takes each subgoal conﬁguration as input and generates action sequences
in accordance with the current task status. The introduced event feedback allows the robot

to react to environmental uncertainties during task implementation.

The system framework has three loops. The innermost loop is the action level, which monitors and

controls the behaviors of a robot and responds to unforeseen events. The intermediate loop is the

task level, the status of the system is projected into the state space of task planner through high

15

InstructorNatural language processingTask plannerRobotHigh level projectionPerceptive feedbackRecognizerState feedback mapDiscrete ControllerEnabled ActionSubtask sequenceTask state representationEvent feedbackHigh-level controllerlevel projection. Completion of current subtask will stimulate the task planner to issue the next

subtask. The outermost loop is perceptive feedback, through which the NL processing module

grounds linguistic input with physical instances.

2.4 Natural Language Control Approach

2.4.1 Natural Language Processing

To convert the NL instructions into a robot understandable speciﬁcation, the robot must identify

the underlying linguistic structure of the commands and convert it into a formal representation. In

this work, the robot ﬁrst parses the user’s instructions through a combinatory categorial grammar

(CCG) parser [57]. Semantic representations including action and their roles (Theme and Destina-

tion) are generated.

Then a graph-based grounding algorithm is used to map the roles of actions to objects in the

robot working environment [58]. A dialogue graph can be generated from the user’s NL instruc-

tions, where nodes represent objects and edges denote spatial relations between objects. Likewise,

a similar graph can be generated from vision data. By comparing the dialogue graph and the vision

graph, the reference resolution becomes a graph matching problem to ﬁnd a one-to-one mapping

between the nodes in the two graphs. Both the dialogue graph and the visual graph can have er-

rors due to insuﬃciencies of the NLP and vision recognition module. An inexact graph matching

algorithm is employed to ﬁnd the best match [59]. This grounding process is in essence to ground

action parameters. Finally, the action and its grounded parameters are transformed into a state

representation as the grounded semantic representation. This state represents the goal status of the

robot and its environment when completing the task.

2.4.2 Discrete Controller

In practical application scenarios, a robot may fail due to uncertainties which can be caused by

sensor noises, model errors, and environmental dynamics. This section focuses on solving envi-

ronmental uncertainties that haven’t been considered in existing NL based control approaches.

16

To tackle dynamics from the environment, it is necessary but nontrivial to model the environ-

mental dynamics into the robot behavior model. The robot should be able to respond to language

commands and environmental uncertainties, which can both be viewed as discrete events. The

basic behaviors of the robot are modeled as automata as introduced in Section 2.2 with domain

knowledge. For the rest of this section, We use an example in robotic manipulation for illustration

of the discrete controller modeling.

Consider a robotic arm performing grasp-move-place tasks. The arm is comprised of two

functional modules, the arm body and the gripper. The arm moves the gripper to the target position

with appropriate orientation, and the gripper performs open and close operations to grip and release

objects. Various combinations of these primitive actions form diverse manipulations. Figure 2.3

shows the model for the arm and the gripper represented as deterministic automata, respectively.

Figure 2.3: The deterministic automata models corresponding to each functional module of the
arm. (a) model of the arm body to control position and orientation (Garm). (b) model of the
gripper to control gripper open-close status (Ggripper). The states in green with an entry arrow
denote initial states. The states with double circles are marker states. Yellow states represent
intermediate states. m represents move, and g represents gripper.

Shuﬄe the two automata using parallel composition, i.e.,
Grobot = Garm||Ggripper

the shuﬄed product captures nominal behavior of the robot, which can work under static environ-

ment without environmental uncertainties involved. In a dynamic environment, a robot has to cope

17

(b) Gripper plant modelm1m2m3movereachg1g2g3closetouchedg4g5g6fully_closedopenopenfully_openobject_in(a) Arm plant modelwith events that cause abrupt changes of the task progress state. This abrupt change can be cap-

tured using an uncontrollable event exception attached to each pair of state transitions. This event

represents the unforeseen and uncontrollable dynamics of the environment due to other partners’

activities that alter the task execution status and are usually unpredictable. However, the eﬀect

of these events can be characterized: they cause abrupt state jump of the task execution, and this

change can be represented using robot state change. Table 2.1 shows a complete list of primitive

actions from the event set.

Primitive Events
move (pos, ort)
reach
open
close
touched

Table 2.1: List of primitive events.
Controllability
Description
move the arm to the speciﬁed pose
controllable
Target pose has achieved
uncontrollable
Open the gripper
controllable
controllable
Close the gripper
The inner sides of gripper ﬁngers have
uncontrollable
touched something
Object grasped in gripper
object_in (obj)
Gripper fully closed
fully_closed
Gripper fully open
fully_open
exception
Denote unforeseen events
Actions are parameterized according to the grounding of object and their attributes.

uncontrollable
uncontrollable
uncontrollable
uncontrollable

After adding the event exception to the plant model, we denote the model as G(cid:48)

robot, which has
18 states and 561 transitions. It represents unsupervised behaviors of the robot and some of which

are unlikely or not expected to happen. To guarantee reasonable and legal behaviors of the robot,

a discrete controller (e.g., supervisor) should be synthesized to remove the impossible or illegal

state transitions in accordance with physical constraints and domain application knowledge. To

this end, we propose the following speciﬁcations that should be satisﬁed by the robot:

1. The gripper’s open and close status can only be changed by environmental uncertainties.

2. The movement of the arm and the gripper are asynchronous, i.e., the grippers status remains

unchanged when the arm is moving, and vice versa.

18

3. To guarantee that the target state set can ﬁnally be achieved, the generated supervisor should

be reﬁned to be controllable, nonblocking, and asymptotically stable. The checking criteria

for the three properties will be illustrated in the next following.

Controllability characterizes a system’s capability to satisfy speciﬁed requirements under un-

controllable events (e.g., the position of a workpiece changed when the arm is approaching it; the

arm is blocked by a person walking by, etc.). The following gives the deﬁnition of controllability

in terms of formal language [56].

Controllability: Let K and L = L be two arbitrary languages over event set Σ. The overbar of

L denotes a language set of all the preﬁx of L, including L itself. K is said to be controllable with
respect to L and Σuc if KΣuc ∩ L ⊆ K.

If a supervisor is controllable, it is able to drive the system to its target state even encountering

unforeseen events from the environment. Algorithms to synthesize a controllable supervisor can

be found in [60].

Nonblocking characterizes the property of a system to avoid either being stuck at a state (dead-

lock) or being trapped in a subset of states (livelock). The following gives the deﬁnition and criteria

of nonblocking property [56].

Nonblocking: Let L(S /G) and Lm(S /G) represent the controlled behavior and the marked
behavior by the supervisor, respectively. A supervisor satisfying Lm(S /G) = L(S /G) is said to be

nonblocking.

Nonblockingness of a supervisor guarantees the task state set is accessible. If a supervisor fails

to be nonblocking, the criteria can help the designer to identify the blockingness and remove them

in advance.

Stability. Nonblockingness only guarantees the accessibility of the state set. To check whether

the task state can ﬁnally be achieved or not, we further analyze the stability of the supervisor,

which captures the ability of a system to converge to a steady state. The work by Passino et al. [61]

provides the foundation for analyzing the stability as well as asymptotical stability in the sense of

19

Lyapunov for system modelled by automata. In this chapter, we choose a Lyapunov function

V(q) = ρ(q, qm)

where

ρ(q, qm) = inf n : ξ(s, q) = qm, s = σ1σ2...σn, σi ∈ Σ

(2.1)

(2.2)

and q ∈ Q, qm ∈ Qm for system stability analysis.

A discrete event controller can be synthesized by applying the speciﬁcations to reﬁne the behav-

iors of the plant model, which results in a deterministic automata with 15 states and 27 transitions

(as shown in Figure 2.5). It can be seen that the synthesized supervisor satisﬁes controllability and

nonblockingness in the presence of uncontrollable events. At each state of the supervisor, it allows

all the permissive legal behaviors to happen. In addition, it is straightforward to prove Lyapunov

asymptotical stability of the synthesized controller under ﬁnite occurrence of event exception. Op-

erations on automata are implemented using DESUMA [62].

Each state transition is a triplet qiσi jq j, where qi, q j ∈ Q, and σi j ∈ Σ. The state qi is the
initial state in which the action σi j can take place. It is the precondition for action σi j. The state
q j is the ﬁnal state, resulting fo the action σi j applied to the initial state. For a sequence of actions
to be realizable, the ﬁnal state of one action must be compatible with the precondition of the next

one. Backward search algorithm is employed to search for an action sequence to drive the system

from its initial state to the desired target state with the least number of actions [63]. When the task

execution is interrupted by an unexpected events that causes an abrupt state change, the robot will

update its internal state and replan a new action sequence.

Figure 2.4 summarizes the modeling pipeline presented in the above.

2.4.3 Task Planner

The task planner residues at a higher level to deal with tasks presented in more abstract utterance.

It is modeled as a Moore automata [64] that is a ﬁnite state machine whose outputs depend on

20

Figure 2.4: Discrete controller modeling pipeline.

only the present state. A Moore machine can be describe by a seven-tuple (Q, Σ, δ, q0, O, X, Qm),
where O is a ﬁnite set of symbols denoting output events, X is the output transition function where
X : Q → O. The rest of the components have the same explanation as presented in preliminary of
supervisory control.

Rather than denoting a single robot state, each state in the task planner represent one task con-

ﬁguration, which trying to capture the whole picture related with the task. Each state is represented
as a n × n matrix. Each element in the matrix captures the dependency relation between two items
or events. n is the number of items/events appeared in the natural language commands. The ele-

ments of the matrix are deﬁned based on logical dependency, i.e., spatial dependency or temporal

dependency. The matrix is named as dependency relation matrix (DRM). The value of the relation

reﬂect relative priority in manipulation or execution. For a DRM based on spatial dependency, the

element at the intersection of the ith row and jth column of the DRM can be deﬁned as:

..., other relations.

While, for a DRM based on temporal dependency, the element at the intersection of the ith row

21

0, relation with the object itself

[DRM]i j =

1, ith object and jth object are not directly dependent
2(−2), ith object is directly on (under) jth object
3(−3), ith object is directly in (surrounding) jth object

(2.3)



Plant modelSpecificationSupervisor synthesisSupervisorProperty checkYesNoBehavior refinementModeling finishedFigure 2.5: Synthesized discrete controller.

22





⇒

and jth column of the DRM can be deﬁned as:

0, relation with the event itself,

[DRM]i j =

1, ith event and jth event do not have direct logical preference
2(−2), ith event is directly logically behind (ahead of) jth event

(2.4)

..., other relations.

Each object/event is viewed as a subtask. In this work, the NLP module translates NL instruc-

tions into a set of states corresponding to the desired ﬁnal states and organizes them in a matrix

form. This target state set only captures the relations that are explicitly stated in the language com-

mands. DRM plays two important roles. First, considering the fact that the instructions may be

issued in a mixed order, and simply following the order can cause failures of tasks. To ﬁgure out

a correct order of subtasks, we derive a complete goal conﬁguration represented in DRM (denoted

as DRMgoal) that captures all the dependency relations between each pair of items/events using
following transitivity rules:

[DRM]i j = relation

[DRM] ji = −relation if relation (cid:44) 1,

(2.5)

[DRM] ji = relation if relation = 1and [DRM] ji = 1
[DRM]i j = −[DRM] ji

if relation = 1and [DRM] ji (cid:44) 1

where relation ∈ R, R = {. . . − 2, 1, 0, 1, 2, . . .} depends on the relation deﬁnition.

[DRM]i j = relation

[DRM] jk = relation

⇒ [DRM]ik = relation

(2.6)

In the meanwhile, we form the current conﬁguration in DRM form according to the sensory

information, denoted as denoted as DRMtemp. It keeps updating itself with the sensory feedback.

23

The DRMtemp plays two important roles. Subtract DRMgoal by the initial DRMtemp, we get a
diﬀerence between the target and initial conﬁgurations, denoted as DRMerror.

DRMerror = DRMgoal − DRMtemp

(2.7)

The smaller of the relation value, the higher priority of the corresponding subtask. Subtasks

with higher priorities should be implemented earlier than others that have lower priorities. By

summing up all the relation values for each item/event against other items/events, as represented

by Equ. 4.4, we can obtain a vector and its values are used as criteria for subtask planning. The

order of the subtasks is obtained through sorting the elements of the vector in an ascending manner.

(cid:20)(cid:80)n
j=1(a1 j), (cid:80)n

··· , (cid:80)n

(cid:21)T

f (DRMerror) =

j=1(a2 j),

j=1(an j)

(2.8)

The second role of DRM is to monitor the task execution. After completion of each subtask,

the DRMtemp should converge to the DRMgoal. We use matrix 2-norm to measure the similarity
between the DRMgoal and DRMtemp, as shown by Equ. 2.9.

V(k) = ||DRMgoal − DRMtemp||2
2 =

|gi j − ti j|2

(2.9)

n(cid:88)

n(cid:88)

i=1

j=1

where gi j = [DRMgoal]i j, ti j = [DRMtemp]i j, and k ∈ N+. If some abrupt events occurred during
the task implementation that cause ∆V = V(k) − V(k − 1) > 0, the system will stop for task
replanning or for further diagnosis.

Each event from the planner output event set Σoutput has the form operation(ob ject, state),

where object denotes the item to manipulate, state represents the target state. This event passes

the goal state information to the discrete controller of the next level. The environmental dynamics

modeling and behavior reﬁning can be implemented in a similar way as presented in Section 2.4.2

2.4.4 Lyapunov Stability Analysis of Hierarchical DES

The task planner and the discrete controller form a hierarchical structure of DES. Even though each

module is guaranteed controllable, nonblocking, and stable at design phase, this doesn’t ensure

24

the stability of the entire behavior.

In this section, we explore the condition to guarantee the

stability of the robot’s behaviors. Firstly, the hierarchical structure and some concepts relating

to hierarchical behavior will be introduced. Then, we derive the conditions to ensure Lyapunov

stability of hierarchical DESs.

Figure 2.6: A two-level hierarchical structure.

Figure 2.6 shows a two-level hierarchical system structure [2]. Plant Glow and controller Clow,
along with a high-level plant Ghi and controller Chi. Clow is the actual plant to be controlled by
Clow, while Ghi is an abstract, simpliﬁed model of Glow that is employed for decision-making in
an ideal world of Chi. The model Ghi is updated every so often via the information channel labeled
In f olowhi (information-low-to-high) to Ghi from Glo. Alternatively, one can interpret In f olowhi
as carrying information sent up by Glow to Ghi. Another information channel, In f olow (low-
level information), provides conventional feedback from Glowto its controller Clow, which in turn
applies conventional control to Glow via the control channel labeled Ctrllow (low-level control).
Returning to the high level, we consider that Ghi is endowed with control structure, according to
which it makes sense for Chi to attempt to exercise control over the behavior of Ghi via the control
channel Ctrlhi (high-level control), on the basis of feedback received from Ghi via the information
channel In f ohi (high-level information). In actuality, the control exercised by Chi in this way is

25

ChiGhiClowGlowInfohilowInfolowhiInfohiInfolowCtrllowHigh levelLow levelonly virtual, in that the behavior of Ghi is determined entirely by the behavior of Glow, through
the updating process mediated by In f olowhi. The structure is completed by the command channel
In f ohilow linking Chi to Clo. The function of In f ohilow is to convey Chi’s high level control
signals as commands to the Clo, which must translate these commands into corresponding low-
level control signals which will actuate Glow via Ctrllow. State changes in Glow will eventually be
conveyed in summary form to Ghi via In f olowhi. Ghi is updated accordingly, and then provides
appropriate feedback to Chi via In f ohi. In this way the hierarchical loop is closed. The forward
path sequence In f ohilow → Ctrllow is conventionally designated ’command and control’, while
the feedback path sequence In f olowhi → In f ohi could be interpreted as ’report’.

The high level and low level represent robot behavior models with diﬀerent abstraction level.

The behavior consistency of hierarchical structure system plays a critical role in successful task

execution and system stability. Next we will introduce a few concepts relating behavior consistency

and derive the system stability in the sense of Lyapunov.

Output-control-consistenct (OCC): a generator G is a Moore automaton. It is said to be output-

control-consistent if there is no ambiguity in the controllability of its output event.

Strict-output-control-consistent (SOCC): a generator G is said to be strict-output-control-consistent

if it is output-control-consistent, and it can control the enablement and disablement of its output

independently.

Hierarchical consistency: a pair (Glow,Ghi) is said to possess hierarchical consistency if Glow

is SOCC, and Ehi ⊆ L(Ghi) be nonempty, closed, and controllable.

When hierarchical consistency holds for the pair (Glow,Ghi), every high-level task (represented
by a choice of Ehi) will be successfully decomposed and executed in the hierarchical control loop.

Based upon work on the discerte Lyapunov stability [61], we achieve the following results.

Proposition 2.4.1 If the pair of (Glo, Ghi) possesses hierarchical consistency, and the higher level
system Ghi is stable in the sense of Lyapunov, then the lower level system Glow is stable in the
sense of Lyapunov.

26

Proof : To prove the proposition, ﬁrstly, we will prove the existence of invariant set in Glow,

denoted as Xm_lo.

For x0_hi ∈ Xm_hi, since Ghi is stable ⇒

X(x0_hi, Ek, k) ∈ Xm_hi

for all Ek such that EkE ∈ Ev(x0_hi) and k ∈ N where E is an inﬁnite event sequence.

For ∀s ∈ Ev(x0_hi), according to hierarchical consistency,

∃ slow ⊆ Llow

slo = {t ∈ Llow|θ(t) = s}

⇒

⇒

⇒

∃ X0_low_s = {x0_low_s|∃ t such that δ(x0, t)!, t ∈ slow, x0 ∈ Xlow}

i=n(cid:91)

i=1

Xm_low_i

Xm_low =

∃ invariant set Xm_low, for x0 ∈ Xm_low, there is X(x0_low, Ek_low, k) ∈ Xm_low, where Ek_low

such that Ek_lowElo ∈ Ek_low(x0_low).

Next is to prove the existence of V(x), S (Xm_low, rlow) and its nonincreasing property.
Select V(x) = ρd(x, X), which denotes the distance between the state x and the set X.
For higher level controller Ghi, S (Xm_hi, rhi), ∃ c1 > 0, c2 > 0 such that
V(x) > c2 ⇒ ρd(x, Xm_hi) > c1 (in this case c1 = c2)

Similarly, ∃ c3 and c4 (in this case c3 = c4) such that

V(x) < c4 ⇒ ρd(x, Xm_hi) < c3

where V(X(x0, Ek, k)) is nonincreasing, for x0 ∈ S (Xm_hi, rhi), 0 < rhi < ρhi.

27

For x0hi ∈ X0hi, ∃ ξ(x0, s) ∈ Xmhi.
Assume

such that

s = s(cid:48)s(cid:48)(cid:48)

, s(cid:48) ∈ Ev_hi(x0), s(cid:48)(cid:48) ∈ T∗

ξ(x0, s(cid:48)) (cid:60) Xm_hi
ξ(x0, s(cid:48)) ∈ S (Xm_hi, rhi)

Since Ghi and Glow are hierarchical consistent,
⇒

⇒

∃ Et(cid:48) ⊆ Llow, Et(cid:48) = {t(cid:48)|δ(x0, t(cid:48))!, θ(t(cid:48)) = s(cid:48)}

∃ rlow such that 0 < rlow < ρlow

1))...}, max{ρd(x02, δ(x02, t(cid:48)

m))...}}

Set
rlow = min{max{ρd(x01, δ(x01, t(cid:48)
⇒
There exists a neighborhood S (Xm_low, rlow) of the invariant set Xm_low such that we can ﬁnd V(x)
satisﬁes the Lyapunov stability criteria, i.e., Glow is stable in the sense of Lyapunov.

2))...}...max{ρd(x0m, δ(x0m, t(cid:48)

Finally, it should be noted that even though the analysis is on a hierarchy of two levels DES,

but this restriction is inessential.

Similarly, the following proposition also holds

Proposition 2.4.2 If the pair of (Glow, Ghi) possesses hierarchical consistency, and the higher
level system Ghi is asymptotically stable in the sense of Lyapunov, then the lower level system Glow
is asymptotically stable in the sense of Lyapunov.

28

2.5 Experiment Validation

2.5.1 Experimental setup

Figure 2.7: Experimental setup.

The experiment setup is conﬁgured as shown in Figure 2.7. The speech recognition is done by

Dragon Speech Recognition software. A ﬂat table is placed in front of the mobile base on which

a 7 DOF robot arm is mounted, and a camera is mounted right under the arm to perform object

recognition. We use MOPED for object recognition [65].

29

Speech recognitionCCG parsingGraph-based referential groundingNatural language processingSynthesized discrete controllerHuman userEvent feedbackEvent feedbackExperiment platformVisual recognitionPerceptive feedback2.5.2 Experimental Results

2.5.2.1 Manipulation in Dynamics Environments

This section presents two examples that demonstrate the capability of dealing with unforeseen

events from its working environment. In these two scenarios, a robot acts as a co-worker to ma-

nipulate parts. In the ﬁrst scenario, the robot receives instruction "pick up the big brown block on

the table", which is translated into target state representation

In(block(big, brown), gripper) ∧ At(block(big, brown), gripper)

The goal speciﬁcation is used to search for a shortest event path driving the robot from the initial

state to the speciﬁed target state. During approaching the big brown block, the position of the

orange one and the target block are switched that causes failure of current action plan. The robot

detects the state change of task progress through sensory information. The controller then replans

a shortest action path from the current state to the target state. The snapshots in Figure 2.8 shows

the implementation process.

In the second scenario, the robot is tasked to grasp the orange block. During the task execution,

the arm body is blocked by a human co-worker. This situation may happen quite often if robots

and people are collaborating or living together in a shared common area. The robot detects that

its body is blocked and is forced back to the stationary state. After the co-worker leaves, the robot

resumes to complete the task. Figure 2.9 shows the snapshots of the scenario.

2.5.2.2 DRM Guided Assembly

In this experiment we test the proposed DRM framework on a simulated assembly task. Instead of

using MOPED for object recognition, the robotic system uses a RGBD sensor (Kinect) mounted

on its left shoulder to identiﬁes objects in the working environment. Compared with MOPED

that uses a single camera for object recognition, using a RGBD sensor relaxes the constraint that

object dimensions have to be known as a priori. The visual recognition system can detect object

30

Figure 2.8: Snapshots of setup and implementation on scenario 1 "pick up the big brown block on
the table". During robot approaching target big brown block, its position and that of the orange
block are switched (as shown in subﬁgure 3).

Figure 2.9: Snapshots of setup and implementation on scenario 2 "Grasp the orange block".
During the approaching of the target block, the arm body is blocked by its co-worker (as shown in
subﬁgure 3).

31

123456123456features such as color, 3-D position, and dimensions (height, width, and length), as shown in

Figure 2.11 [66].

Figure 2.10: Initial setup and desired conﬁguration of the task.

Figure 2.11: Description of perceived workpieces through visual recognition system.

Initially, ﬁve parts of a product are placed on the worktable without overlaps. The robot is

tasked to assemble a block castle as shown in the right subﬁgure of Figure 2.10. The list of states

32

Table 2.2: List of states that used to represent the task and robot status.

Environmental State
In(a,b)
On(a,b)
Around(a, b)
Under(a, b)
InAir(a)
Adjacent(a,b)
Robotic State
In(a, gripper)
At(a, gripper)
Idle

Description
Object a is in object b
Object a is on object b
Object a is surrounding object b
Object a is under object b
Object a is held in air and not
Objects a and b are not at the above states
Description
The end eﬀector is gripping object a
The end eﬀector is at position of object a
Nothing in gripper and the gripper is not at any
object position

are shown in Table 2.2, denoting the robot status and environment status for high level planning

and control. One example of natural language task descriptions from an untrained user is shown

as follows:

A blue workpiece is on top of the green one. The grey and yellow parts are on the red workpiece.

The green workpiece is above the grey and yellow parts.

Instead of giving out step by step guidance, the utterance describes the desired conﬁguration,

which is more abstract. In addition, the language instructions are issued in a mixed order. It is

impossible for most existing NL based control approaches to ﬁnish the task which simply following

the given order. By adding the layer of task organization, it is possible for the system to ﬁgure out

a reasonable subtask plan based on information from language instructions with incorrect order.

As shown in the left subﬁgure of Fig. 2.10, the state representation of initial setup is

{Ad jacent(Wi, W j)|i ∈ I, j ∈ I, i (cid:44) j}

where Wi denotes the ith workpiece, I = {1, 2, 3, 4, 5}. Following the DRM element deﬁnition and

33

rules presented in Equ. 4.1, Equ. 2.5, and Equ. 2.6, the initial DRM for this task is obtained:



0 1 1 1 1

1 0 1 1 1

1 1 0 1 1

1 1 1 0 1

1 1 1 1 0



DRMinit =

where from the top to the bottom are the ﬁve objects denoted as from W1 to W5 by following
the object index assigned by visual recognition system. Each row shows one object’s dependency

relations with others.

From the natural language instructions, a partial set of goal state can be obtained:

On(W4, W1) ∧ On(W3, W5) ∧ On(W2, W5)
∧ On(W1, W3) ∧ On(W1, W2)

Here, "partial" means direct translation of natural language can only lead to a subset of the DRMgoal
representation.

The corresponding DRMgoal representation is:

DRMgoal =





2

0

2 −2 2
0
−2
1 −2 2
−2
0 −2 2
2
2
2
−2 −2 −2 −2 0

1

2

0

34

Subtracting DRMgoal by DRMinit, we have the DRMerror:
DRMerror = DRMgoal − DRMinit
1 −3 1
0
−3
0 −3 1
−3
0 −3 1
1
1
1
−3 −3 −3 −3 0



1

0

0

0

1

=



quantiﬁed priorities of subtasks:

f (DRMerror) =

W1
W2
W3
W4
W5

Applying the function f (X) deﬁned in Equ. 4.4 to the DRMerror, we get the vector that contains





0
−5
−5
4
−12

Sorting the elements in an ascending manner, we will have the manipulation order of the blocks:

W5 ⇒ W2 ⇒ W3 ⇒ W1 ⇒ W4

Then the assembly task is implemented with the parts manipulation order. Fig. 2.12 shows the

snapshots of the assembly execution.

The task execution is monitored by using the matrix two norm of DRMerror. Figure 2.13 shows

the plot of ||DRMerror|| during task execution.

We also notice that the workpieces W2 and W3 have the same priority. This means the order of
operating the two blocks won’t impact the completion of the task. Since only one arm is equipped,

we randomly select one block to manipulate ﬁrst.

This experiment assumes initially all the parts are scattered without overlaps. Otherwise, extra

operations are required. If initially overlaps exist, the whole task will be set to two phases. Phase I

35

Figure 2.12: Snapshots of robot implementation on assembly task.

Figure 2.13: ||DRMe||2 during the task execution.

36

123456789101112is to remove the overlaps and phase II is to generate task plan and implement the job as illustrated

above. This task planner framework can also be integrated with other NL based robot control

frameworks to ﬁgure out a correct subtask order. In addition, DRM is hierarchically compositional.

It enables to represent complex tasks from simple ones, which helps to alleviate the state and action

explosion as the tasks become more complicated and to organize task components in a consistent

way.

2.6 Summary

To facilitate the robustness to environmental uncertainties of robots, this chapter presents a

correct-by-design control framework taking NL as input. Diﬀerent from existing NL based con-

trol approaches, the robot can work under dynamic environments. Moreover, hierarchical control

structure is performed to deal with commands of diﬀerent logic levels and compact the state space.

Experimental results show that the robot has a good performance under uncertainties. Furthermore,

the control approach presented in this chapter may also be applied to other DESs.

37

CHAPTER 3

NATURAL LANGUAGE BASED ROBOTIC BEHAVIOR PROGRAMMING

3.1 Introduction

The goal of robot programming is to pass human knowledge from human to the robot, such

that the robot is able to execute tasks that were not able to be done by it before.

In essence,

robots are reconﬁgurable and multifunctional machines. To deploy robots and harness their pow-

er, multiple approaches have been proposed to ease the process of programming robots, such as

guiding [67],controller speciﬁc language [68], generic procedural languages [69], graphical pro-

gramming [70], programming by Demonstration (PbD) [71]. Figure 3.1 shows existing robot

programming approaches. Table 3.1 brieﬂy summarizes the pros and cons of existing approaches.

Figure 3.1: Existing robot programming approaches.

Guiding is the earliest and most widespread method of programming robots. It involves man-

ually moving the robot to each desired position, and recording the internal joint coordinates cor-

38

Table 3.1: Pros and cons of existing robot programming approaches.

Robot Programming

Advantages

Guiding

Controller speciﬁc language

Immediacy; simple to imple-
ment

Simple to use; support loops
and react to sensory input

Generic language

Flexible and powerful

Graphical programming

Easy to use

PbD

Allows encoding of very dif-
ferent types of signals

Disadvantages
Not support complex control
structures (e.g., loops, condi-
tions)
Lack of universal standard;
not suitable for novice users
Speciﬁc to the robot; not suit-
able for novice users
Not as ﬂexible as text-based
language; not suitable for
complicated programming
Not allow to reproduce com-
plicated high-level skills

responding to that position.

In addition, operations such as closing the gripper or activating a

welding gun are speciﬁed at some of these positions. The resulting "program" is a sequence of

vectors of joint coordinates plus activation signals for external equipment. Such a program is ex-

ecuted by moving the robot through the speciﬁed sequence of joint coordinates and issuing the

indicated signals. Robot guiding is simple to use and to implement. Because guiding can be im-

plemented without a general-purpose computer, it was in widespread use for many years before

it was cost-eﬀective to incorporate computers into industrial robots. Programming by guiding has

some important limitations, particularly regarding the use of sensors. During guiding, the pro-

gramming speciﬁes a single execution sequence for the robot; there are no loops, conditionals, or

computations. This is adequate for some applications, such as spot welding, painting, and simple

materials handling. In other applications, however, such as mechanical assembly and inspection,

one needs to specify the desired action of the robot in response to sensory input, data retrieval,

or computation. In these cases, robot programming requires the capabilities of a general-purpose

computer programming language.

Controller-speciﬁc languages were also one of the original method of controlling industrial

robots. Until now, they are still the most common approach today. Every robot controller has some

form of machine languages, and there is usually a programming language to go with it which can

39

be used to create programs for the robot. This type of programming languages are usually very

simple, with a BASIC-like syntax and simple commands for controlling the robot and program

ﬂow, such as languages used by KUKA [72] and ABB [73]. Controller-speciﬁc languages have

some drawbacks [69]. The biggest problem is the lack of a universal standard between languages

from diﬀerent robot manufacturers. If a factory employs robots from diﬀerent manufacturers, then

they have to either train their programmers for each type of robot, or pay the manufacturer to

develop the desired programs. Either way increases signiﬁcantly the time and cost for developing

new programs.

Generic languages provide an alternative to controller-speciﬁc languages for programming

robots. Generic means a high-level multi-purpose language, such as C++ and Java. These lan-

guages have been extended in some way to provide robot-speciﬁc functionality, which is particu-

larly common in research environments. The most common extension to a multi-purpose language

is a robot abstraction, which is a set of classes, methods, or similar constructs that provides access

to common robot functions in a simple way. This removes the need to handle low-level function-

ality such as setting output ports high to turn on motors or translating raw sensor data. However,

these abstractions suﬀer from the same weakness as the controller-speciﬁc languages for industrial

robots. They are still speciﬁc for the robots they are designed for.

Graphical programming (or visual programming) systems provide an alternative to text-based

methods for manual programming [74][75]. They are one small step closer to the automatic pro-

gramming, as they provide a simpliﬁed graphical medium for rapid prototyping and implementa-

tion of certain systems. They require manual input to specify actions and program ﬂow. Graphical

systems typically use a graph, ﬂow-chart or diagram view of the robot system. Graphical program-

ming are also known to ease the process of providing end-user programming, where the user of

an application is able to modify the behavior of the application in some way. This programming

approach has been successfully adopted both by experienced programmers and non-technical com-

puter users. One advantage of graphical systems is their ease to use, which is achieved at the cost

of text-based programming’s ﬂexibility.

40

At the beginning of the 1980s, PbD started attracting attention in the ﬁeld of manufacturing

robotics. PbD appeared as a promising route to automate the tedious manual programming of

robots and as way to reduce the costs involved in the development and maintenance of robots

in a factory. A large body of work uses a symbolic representation of both the learning and the

encoding of skills and tasks. This symbolic way of encoding skills may take several forms. One

common way is to segment and encode the task according to sequences of predeﬁned actions,

described symbolically. Encoding and regenerating the sequences of these actions can, however,

be done using classical machine learning techniques, such as HMM [76]. Often, these actions

are encoded in a hierarchical manner. In [77], the authors employ a hierarchical and incremental

approach to encode various household tasks. There, learning consists in extracting symbolic rules

that manages the way each object must be handled. The main advantage of symbolic approaches

is that high-level skills can be learned eﬃciently through an interactive process. However, because

of the symbolic nature of their encoding, the methods rely on a large amount of prior knowledge

to predeﬁne the important cues and to segment those eﬃciently. For learning and encoding a skill

at trajectory-level, choosing the variables well to encode a particular movement is crucial, as it

already gives part of the solution to the problem of deﬁning what is important to imitate. Work in

PbD encodes human movements in either joint space, task space or torque space. Encoding often

encompasses the use of dimensionality reduction techniques that project the recorded signals into

a latent space of motion of reduced dimensionality. These techniques may either perform locally

linear transformations [78] or exploit global non-linear methods [79]. The advantage of learning at

trajectory-level is it allows encoding of very diﬀerent types of signals/gestures. However, it does

not allow to reproduce complicated high-level skills.

The above presented robot programming approaches usually require expertise on robot and

coding, which are not suitable for novice users. In addition, robot programming usually happens

earlier than robot execution. However, there is an increasing demands for online programming to

cope with uncertainties in tasks and working environments. Humans not only need to program

the robots before task execution, but also, in many instances, work cooperatively with the systems

41

to online program the robots in order to accomplish tasks. For example, in automated assembly

and material handling operations, multiple robots have to function coordinately or synchronously.

A local disturbance on an individual robot can easily aﬀect the overall task synchronization and

coordination, resulting in a global catastrophe since the current state of the art approaches follow

the general design method of ﬁxed automation. Typically, a robotic system is only programmed

to deal with a small subset of geometric dynamic and time duration uncertainties which can be

reliably pre-estimated. As anything that falls outside the pre-estimated local uncertainty bound

will cause a global disturbance. This will require a halt of overall operation and a reprogramming

of the operation.

If a robot could be programmed using NL, it could not only make the robot programming

easier and more convenient, but also signiﬁcantly improve the scalability and safety of the robot

operations, especially for unstructured tasks and dynamic environments.

Natural language programming has been researched along either logic rule based methodolo-

gy or probability based methodology [80]. Although many promising results on natural language

based robot programming have been achieved, the trained system models suﬀer from the following

problems. First, the training process is usually oﬄine. When the robot encounters a new and un-

seen scenario, the performance of the robot becomes worse as the similarity between the test data

and the training data decreases. Secondly, a natural language controlled robot is supposed to suc-

cessfully complete the given task in a dynamic environment. The correctness of the programmed

behaviors and the stability of the system are unknown to both the robot itself and programmer.

This creates major vulnerability of programmed robot operations.

In this chapter, we propose a programming approach that allows non-technically trained users

to teach a robot new skills through NL on the ﬂy. New skills can be taught hierarchically using al-

ready learn skills and represented as a combination of states. The proposed programming approach

can integrate with the developed control approach presented in last chapter, which allows the robot

to learn a skill under static environment, but can still perform it under dynamic environment.

The remainder of this chapter is organized as follows. Section3.2 introduces the overall pipeline

42

of the proposed framework. Section 3.3 illustrates the proposed programming approach. Finally,

Section 3.4 presents the experimental results on an industrial scenario.

3.2 Overview of System Framework

Figure 3.2: Overview of the proposed NL based robot programming framework.

Figure 4.2 presents the proposed interactive robot programming framework. The NL instruc-

tions given by a human user are processed through a semantic parser into action frames with

predicate-argument structure [3]. The knowledge base uses action frames as query items and

search for the compatible goal state representation. If there exists a goal state set with respec-

t to the input commands, this symbolic state representation is then dispatched to the referential

grounding module. The discrete controller takes the grounded state as input and performs the task

as illustrated in last chapter.

If there doesn’t exist a goal state representation corresponding to the input command, the

knowledge base notiﬁes the interaction manager to send the human instructor a warning of unde-

ﬁned mapping and wait for either the activation of teaching and learning procedure to encode the

new command or another command. Details about the proposed interactive robot programming

will be illustrated in Section 3.3. After learning the new command, the learned command-state

43

InstructorNatural language processingCompare state changeRobotPerceptive feedbackDiscretecontrollerEnabled actionNatural language commandsUnknown taskInitialFinalTarget state setSemantic parsingKnowledge baseReferential groundingAction frameSymbolic staterepresentationNatural languageprocessingAddAddTeach & learningInteraction managerPerceptivefeedbackmapping will be added to the knowledge base.

3.3 New Skill Acquisition

When the robot receives an instruction for which its knowledge base fails to generate a goal s-

tate, the interaction manager inquires the human instructor for advice on new skill programming. If

the response is positive, the interaction manager switches to teaching and learning mode. Table 3.2

presents a full list of working mode of interaction manager.

Table 3.2: Working Mode of Interaction Manager

Working Mode
Action execution
Teaching and learning
Query

Description
System is performing tasks by NL
System is under programming using NL
State transformation fails

In teaching and learning mode, the instructor decomposes the skill into a sequence of skills that

the robot already learned, and verbally guides the robot to perform the skill step by step. Before the

robot implements the guiding instructions, its own state and environmental state are recorded and

denoted as initial conﬁguration. After the robot ﬁnishes all the guiding instructions, its own state

and environmental states are recorded as the ﬁnal conﬁguration. Compare the two conﬁgurations

and eliminate the common states, it is the new skill that is considered to cause the diﬀerence

between the initial and ﬁnal conﬁgurations. The remained two sets of states are considered as

preconditions and postconditions of this new skill, respectively. The new skill is represented using

the following template and added to the knowledge base:

Precondition : initial_status

Action : new command

Postcondition : ﬁnal_status

where the initial_status as well as the f inal_status is a conjunction of states. Also, attributes

of objects within the same predicate-argument state will be compared, such as color, shape, etc.,

which can be extracted from sensory information. Attributes are considered as part of the states.

44

Because attributes of object play no role in system model analysis, they will not be represented

explicitly. However, it should be noted that even the same state representation with diﬀerent at-

tributes are considered as diﬀerent states. The state deﬁnition is dependent on the robot type and

application domain knowledge. In this work, we use the state deﬁnition as presented in Table 2.2.

A complex behavior can be decomposed into a combination of several atomic actions imple-

mented in a speciﬁc order. During the teaching phase, the instructor can utilize already learned

high level skill instead of only the primitive actions. This eases the teaching process and makes

it time-eﬃcient and user-friendly. A learned skill is represented as state transitions rather than

an ordered sequence of actions demonstrated by the instructor, as the work presented in [7][46].

The state-based representation has two advantages over the action-based representation. First, it

is easier to generalize. State-based representation focuses on the results caused by the behavior.

The realization process can be ﬂexible and compliant with diﬀerent environment setup. While

actions-based representation can only be applied to scenarios with similar environment setup, or

the behavior may fail to deliver the task. Second, the state-based representation can be integrated

with our developed control framework. This allows to program a behavior under static environment

but work in a dynamic environment. The instructor doesn’t have to enumerate all the possible situ-

ations to the robot in order to improve the behavior robustness to uncertainties. Figure 3.3 presents

skill learning paradigm.

For practical applications, usually, the scenarios will not be as simple as the one used for teach-

ing and learning. The environmental setup can be more complex, or there are more workpieces to

manipulate with the same command. The ﬁrst diﬃculty can be overcome by integrating with the

proposed control framework presented in last chapter. To solve the second problem, the learned

skill is represented as:

Precondition : ∧n

i=1(initial_statusi)

Action : new command
Postcondition : ∧n

i=1(ﬁnal_statusi)

45

Figure 3.3: New skill learning. (left) is a schematic representation of initial robot knowledge base
that comprises pre-programmed primitive actions and their cause and eﬀect. (right) the user has
programmed two new skills with already learned knowledge.

where n denotes the total number of workpieces to be manipulated. It can be determined from natu-

ral language instructions or from vision system, or by combining information of both. Considering

the fact that the robot may work in an industrial environment, and the workpieces can be delivered

by a conveyor belt such that its number keeps changing. Accordingly, The total workpiece number

n should keeps updating.

A behavior causes change from initial conﬁguration to the ﬁnal conﬁguration. Both conﬁgu-

rations have the same number of states. Two states with the same parameters (one from the initial

conﬁguration and one from the ﬁnal conﬁguration) are grouped together as a pair:

(initial_statusi, ﬁnal_statusi)

Each pair of states is looked as a subtask.

46

Primitive actionsSymbolic statesSymbolic statesPrimitive actions3.4 Experiment Evaluation

3.4.1 Experiment Setup

Natural language processing. NL commands are processed by a CCG semantic parser [57] and

translated into action frames with predicate-argument structure. Object grounding is performed by

a graph-based matching algorithm [58]. In the experiments we use typed commands instead of

speech and skip the voice capture and speech recognition modules. The NLP modules are run on

a laptop and communicates with the robot wirelessly.

Robotic platform. We use a 7 Degree-of-Freedom (DOF) SCHUNK robot arm to test the

proposed programming approach. It is mounted on a four wheeled Segway RMP 440 mobile base

to provide mobility when necessary, as shown in Figure 5.3.

Figure 3.4: Experiment platform and setup.

Vision recognition. A RGBD sensor (Kinect) is mounted on the left shoulder of the robotic

platform. It performs object recognition and identiﬁes the object type and object attributes (color,

47

KinectDescription of objects through vision recognition systemObject index Object colorPosition HeightWidthLengthshape, dimensions) [66]. The vision recognition module assigns each detected object an index and

frame the object information into an object schema as shown in lower left subﬁgure of Figure 5.3.

3.4.2 New Behavior Programming

In this experiment, we assume an industrial robot that is capable of executing open-move-close

manipulations. Firstly, an instructor uses NL to program pick up and put behaviors to the robot. As

for the pick up, initially the gripper is at the position of the object to be manipulated, the instructor

verbally guides the robot to close the gripper until the bottle is ﬁrmly grasped, then the robot pick

up lifts the arm. The learned pick up is represented as:

Precondition : On(bottle, table) ∧ At(bottle, gripper))

Action : pick up
Postcondition : InAir(bottle) ∧ In(bottle, gripper))

Similarly, the behavior put can be taught from having a bottle grasped in gripper and ending at

releasing the bottle on table and lift the arm back to Idle, which can be represented as:

Precondition : InAir(bottle) ∧ In(bottle, gripper)

Action : put
Postcondition : On(bottle, table) ∧ Idle

Complex behavior can be verbally taught by referring to previously programmed behaviors.

Assume a sorting task is expected to performed by the robot. The instructor ﬁrstly tasks the robot

to sort the beer bottles by color, in which each beer bottle is expected to be packaged with a box

with the same color. The robot provides a negative response showing that it doesn’t know how to

implement the behavior. Then the instructor gives step by step instructions to the robot employing

newly learned pick up and put behavior. The initial condition before the teaching is shown in the

right subﬁgure of Figure 5.3: there are two bottles and two cylinder boxes, with colors red and

green for each bottle and cylinder box. After the teaching phase, the robot compares the ﬁnal

48

conﬁguration with the initial conﬁguration. The common states of the two state sets are eliminated

and the rest two sets of states are assumed to be caused by the behavior sort by color. Figure 3.5

shows the programming procedure. Accordingly, the general form of the programmed skill sort by

color is represented as:

Precondition : ∧n

i=1(Ad jacent(ai, bi) ∧ Idle)

Action : sort by color
Postcondition : ∧n

i=1(In(ai, bi) ∧ Idle)

where n denotes the number of items to-be-sorted and will be instantiated conditioned on the

working environments. In addition, attribute of color of objects that form the states are ignored in

representation.

3.4.3 New Behavior Testing under Exception

In this experiment, we test the learned sorting behavior under exception. The robot is tasked to

perform a bottle sorting task in a new experimental setup, as shown in the subﬁgure 1 of Figure 3.6.

Two beer bottles with diﬀerent colors placed in the working environment. Three cylinder boxes

of diﬀerent colors are placed in the working environment as well. After receiving the instruction

"Sort the beer bottles by color", the robot generate the goal state representation:

i=1(In(bottle i, box i) ∧ Idle)
∧2

The pairing between bottles and cylinders is based on their color attributes extracted from sensory

information through perceptive feedback. Each pair of initial states and goal states is viewed as

a subtask. Because there is no compatible yellow bottle to pair the yellow cylinder box, it is not

included in the goal states. When the robot just ﬁnishes packaging the red beer bottle, one yellow

bottle and one red bottle are added into the working environment. This abrupt change of working

environment is updated via perceptive feedback to the NLP module. In the meanwhile, the red

bottle and red cylinder have already been pair together. As a result, the goal state representation

49

Figure 3.5: Programming the skill of "Sorting by color" with step by step instructions. (a) Natural
language commands given by the instructor. (b) Initial system state and temporal states after each
implementation. (c) Snapshots of robot execution corresponding to system state in column (b).
(d) The representation of learned new behavior.

changes to:

∧3
2=1(In(bottle i, box i) ∧ Idle)

Because there is no cylinder box matching the newly added red bottle in color, no goal state

is generated for the new red bottle. The sorting task is considered as completed when there is no

more color-consistent but unsorted pairs of bottles and boxes. Figure 3.6 shows a few snapshots of

the task implementation.

50

Move tothe green bottlePick upthe green bottleMove to the green boxadj(bottle1, box1) & idle(gripper), adj(bottle2, box2) & idle(gripper)…adj(bottle1, box1) & at(bottle1, gripper), adj(bottle2, box2) & at(bottle1, gripper)…adj(bottle1, box1) & in(bottle1, gripper), adj(bottle2, box2) & in(bottle1, gripper)…in(bottle1, box1) & in(bottle1, gripper), adj(bottle2, box2) & in(bottle1, gripper)…Put the bottle into the green boxin(bottle1, box1) & idle(gripper), adj(bottle2, box2) & idle(gripper)…Human instructionsSystem statesLearned new actionadj(bottle1, box1) & idle(gripper)in(bottle1, box1) & idle(gripper)InitialFinalSort by color(a)(b)(c)(d)Figure 3.6: Test of the learned skill "Sort by color" in a more complex scenario. After the robot
put the red bottle into the box (subﬁgure 3), two more bottles were added into the working
environment (subﬁgure 4).

3.5 Summary

In this chapter, a novel approach to program robotic behaviors via NL instructions is presented.

It allows human users to program the robots on the ﬂy without terminating the robots, which is

beneﬁcial and convenient for human-robot collaboration. Users can teach verbally teach the robots

in a hierarchical manner by referring to previously learned skills. This is time-eﬃcient and allows

to teach using concepts with diﬀerent abstraction level. The data collected during training process

can be used to improve the capability of natural language processing.

In addition, this online

programming framework can be integrated with the proposed control framework, which helps to

improve the robustness of the learned skills under dynamic environments.

51

12345687ROBOTIC DRAWING CONDITIONED ON NATURAL LANGUAGE DESCRIPTIONS

CHAPTER 4

4.1 Introduction

Controlling robotic behaviors via NL utilizes the rich expressibility of NL to deliver task re-

quirements, complementary knowledge and information from a human instructor, which can ben-

eﬁt both the robots and human users. Existing approaches of NL based robotic control focus

on modeling the mapping relationship between language and primitive robotic actions [15][81].

Symbolic representations are generated to represent the task goal and speciﬁcations, and then are

parameterized for the reasoning of action selection. The success of existing approaches relies on

well-trained language-action mapping and correct language grounding using environmental infor-

mation from sensors and knowledge from an on-board local knowledge base. However, if the tasks

are not dependent on environmental information, or there doesn’t exist a well-built knowledge base

for the robot to use, it is diﬃcult for the robot to perform the task, such as instructing a robot with

NL to draw a scene on paper. As illustrated in Figure 4.1, a robotic painter is tasked to draw a

scene with NL description "A lovely dog sits in front of a sofa in a room. A chair is at the left

side of the dog". The robot has to not only infer the context (room), object type (sofa, dog, chair),

object property (lovely), the spatial layout among objects (in, left, front), but also generate motion

plans to draw the scene on paper.

Scene generation and drawing conditioned on NL has many applications. One important ap-

plication is literacy development. For children who are learning to read and for second language

learners, seeing scenes together with language can enhance learning [82]. Another application is

as a reading helper for people with learning disabilities or brain damage. The robot can convert

textual menus, signs, and operating instructions into graphical representations. In addition, robot-

ic drawing conditioned on NL can extend to similar application scenarios, such as painting and

polishing.

52

Figure 4.1: Robotic painter setting where an instructor describes a scene in NL. The painter infers
a scene that matching the description and draws the scene on a paper using a marker pen mounted
on the robot’s end-eﬀector.

There can be a huge number of choices for object types and object properties (e.g., color,

shape, texture, etc.), and it is nontrivial to construct a local knowledge base or train direct text-

to-pixel mappings for the robot to use. The success of existing scene generation approaches is

limited to clipart scenes [83][84] or single objects [85]. Generating realistic and complex scenes

is still challenging. Instead of generating scenes via a local knowledge base or training mappings

from data, we propose to directly use information from the Internet for complex scene generation.

The Internet can be viewed as a multi-discipline knowledge base collected from diverse distributed

sources. However, the Internet is "dirty" because it often contains erroneous or corrupted data [86],

which makes it diﬃcult for the robot to use information directly retrieved from the Internet. By

introducing a mechanism of dealing with unexpected events during scene generation using the

53

Description of desired scene sent to robotic painterInstructorDraw a scene: A lovely dog sits in front of a sofa in a room. A chair is at the left side of the dogDrawn scene by the robotic painter conditioned on the descriptionPenretrieved information, the proposed method can generate complex and semantically meaningful

scenes.

The proposed framework for hierarchical scene generation and drawing has three parts: a spa-

tial layout generator, a scene generator, and a motion planner. The spatial layout generator infers

spatial conﬁguration among objects. The scene generator ﬁrstly retrieves images through query

with object types and object properties from an online image source. Then the object instances

are segmented from the retrieved images and assembled together in accordance with the spatial

layout. In addition, the scene generator deals with unexpected events in image retrieval and scene

generation, and allows users to modify scenes by replacing objects, changing location and size of

objects via NL before drawing the scene on paper. The motion planner takes generated scene as

input and plans trajectories of end-eﬀector for scene drawing.

The main contributions of this chapter are as follows:
• We propose a novel approach for robots to generate and draw scenes conditioned on NL
Instead of maintaining a knowledge base in limited on-board memory, the

instructions.

proposed method leverages the power of the Internet, which helps to increase the diversity

and creativity of generated scenes, and to reduce burden on knowledge learning and storage.
• A mechanism is developed to tackle unexpected events to guarantee successful knowledge

retrieval and scene generation.

• The proposed approach oﬀers an interactive framework and interface to control scene gen-
eration process: users are allowed to modify the synthesized scene by replacing objects,

changing size and location of objects through NL. This helps to correct possible training

errors or bias using human experience. As a result, the generated scenes can be more recog-

nizable and well-aligned with the descriptions.

• We conduct quantitative evaluations on MS-COCO dataset and qualitative evaluations as
well, and demonstrate improvement in quality of both the scene generation and scene draw-

ing over baseline works.

54

The rest of the this section is organized as follows. Section 4.2 brieﬂy reviews related work.

Section 4.3 provides an overview of the developed framework and Section 4.4 illustrates the pro-

posed approach in detail. Section 4.5 discusses the experimental results. Finally, Section 4.6

summarizes this chapter.

4.2 Related Work

Text to Scene. Scene generation conditioned on text has been researched along two directions:

object level and pixel level. Object level scene generation synthesizes scenes by placing objects in

accordance with the estimated spatial layout. While pixel level approaches map text to most likely

object pixel distributions trained from data.

For object level scene synthesis, Zhu et al. [87] use nouns appeared in the text as keywords

to search for images of objects and then place the images together to represent the meaning of

the input text. Only object type information is employed, visual attributes of objects and spatial

conﬁguration are ignored. Zitnick et al. [83] train a conditional random ﬁeld model to generate

clipart scenes using clipart characters. Chang et al. [84] employ Bayesian probability to generate

3D oﬃce scenes using objects from a 3D model database. These two works utilize spatial relations

between objects. The visual attributes of objects are still not considered in scene generation. The

proposed approach takes into consideration of both the object properties and the speciﬁed spatial

conﬁguration in scene generation, which helps to generate more diverse and complex scenes.

More recent works focus on using generative models for pixel level scene generation. Yan et

al. [88] use variational auto-encoder to generate scenes conditioned on visual attributes. Reed et

al. [89][90] train generative adversarial networks (GANs) for scene generation. These approaches

are limited to single objects, such as birds [85] and ﬂowers [91]. Mansimov et al. [92] synthesize

scenes on more complicated scene descriptions (MS-COCO) using a variational recurrent autoen-

coder with attention mechanism. Han et al. [93][94] propose stacked GANs structure to generate

and reﬁne scenes for better recognizability than single GANs. Dong et al. [95] use a cascaded con-

volutional neural network and recurrent neural network with GAN to augment the visual features

55

of objects in the scene. The major focus of pixel level scene generation is on visual features, while

spatial mappings are either ignored or modeled implicitly. In comparison, the proposed approach

is able to generate complex scenes use both information.

In addition, existing scene generation approaches do not allow modiﬁcation on synthesized

scenes. The proposed approach allows users to change objects, modify the position and the size of

objects in synthesized scenes through NL in an interactive manner, which helps to correct possible

training errors or bias using human experience, and the data can be recorded for later training

augmentation. Furthermore, the proposed approach uses information from the Internet, which

helps to increase the diversity of generated scenes.

Natural Language based Robot Control. Using NL to control robot behaviors makes it easier

for people to employ robots, especially for novice users. One key step of NL based robot control

is to translate the NL commands into formal and unambiguous representations such that robots are

able to understand and implement the instructions.

Based on how the language commands are processed into executable action plans, existing

approaches can be divided into two categories:

logic-based and statistics-based. Logic-based

approaches translate linguistic commands into executable action plans based on a set of rules

designed from human experience, such as using keywords [48], hand-coded grammar [6], user-

deﬁned templates [39], target state set [30], and formal language representations [17][81].

Statistics-based methods use data-driven algorithms to learn the implicit mapping rules or ac-

tion selection policies from data instead of hand-designed explicit rules. These methods diﬀer in

their probabilistic models (e.g., linear model [41], hidden Markov model [14], Markov logic net-

work [23], conditional random ﬁeld [15], etc.), formal representations (e.g., predicate-argument

structure [16], lambda calculus [17], graphical representation [18], etc.), and features used for

training the models.

In comparison to the existing approaches focusing on forming a symbolic representation for

robotic behavior generation, the proposed approach generates a recognizable scene as the interme-

diate representation of the instructor’s intent to bridge the language and robot movements. By this

56

way, it can indicate more details of the assigned tasks than pure symbolic representations.

4.3 Overall Pipeline

The overall pipeline of the proposed framework is illustrated in Figure 4.2. Given a language

description of a scene, the robotic painter draws the generated scene on paper through the sequence

of following modules:

• Natural language processing (NLP) module takes language descriptions as input, and gen-
erates relation frames (as shown in Figure 4.3). This process is realized through two sub-

sequent modules: ﬁrstly, semantic roles are labeled to identify verbs and their arguments;

secondly, syntactic structure of each argument is parsed to identify properties of objects and

spatial relations (Section 4.4.1).

• Spatial layout generator infers a spatial layout of objects using relation frames generated
from the prior NLP module. A spatial layout is an organized position distribution of objects.

An object’s position is assumed to depend on its surrounding objects. The position of each

object is calculated based on the positional and dimensional information of its dependent

objects (Section 4.4.2).

• Scene generator synthesizes a scene according to the estimated spatial layout. The scene
generator retrieves images from Internet image search engines using information of objec-

t type and property, segments objects from images and merges each object into a proper

context. The robot uses images from diverse cloud resource and an imperfect object de-

tector. Unexpected events may happen and cause failure of scene generation. The scene

generator deals with these unexpected events and ensures completion of scene generation.

(Section 4.4.3).

• Motion planner takes the generated scene as input, and outputs trajectories of the robotic
painter to draw the scene on paper. The motion planner ﬁrst extracts visual features that are

semantically meaningful, then plans the trajectory to draw the feature areas (Section 4.4.4).

57

Figure 4.2: The overall pipeline of the proposed approach. The relations frames shown in this
ﬁgure ignore information of Verb and Property for simplicity purposes.

58

InstructorA lovely dog sits in front of a sofa in a room. A chair is at the left side of the dogNLPAgent: dogRelation: frontTheme: sofaAgent: sofaRelation: inTheme: roomAgent: chairRelation: leftTheme: dogSpatial layout generatorRelation frameroomchairsofadogScene generatorImage segmentationAdd to the inferred contextQueryRetrieveGenerated sceneInteractive scene adjustmentAdjust locationAdjust sizeInput text: Move the chair to the leftInput text: Enlarge the chairModified sceneImage cloudMotion plannerRobotic painterScene drawing4.4 Robotic Drawing Approach

4.4.1 Natural Language Processing

Figure 4.3: Conversion of the sentence a large and long blue bus parked next to an old man into
relation frame through parsing, tagging, and semantic interpretation.

To convert the users’ instructions into a formal speciﬁcation, the system must identify the un-

derlying linguistic structure of the description and convert it into a logical representation. This

section describes the process of this conversion and its implementation. The users’ instruction-

s are processed through a pipeline of natural language components which identify the semantic

structure of sentences and the syntactic roles of them, and create formal representations to be used

in spatial layout inference and image query. Diﬀerent from many previous natural language sys-

tems for robot control have relied on per-scenario grammars that combine semantic and syntactic

information [10], this work uses a combination of robust, general-purpose components for pars-

ing and tagging the input. An advantage of this approach compared to per-scenario grammars is

that the core language models need not be modiﬁed across scenarios. This reduces the role of the

fragile process of grammar engineering and minimizes the cost of adapting the system to handle

commands in new domains.

The NLP module uses a pipeline of domain-general natural language processing components.

The input is parsed using AllenNLP semantic role labeling (SRL) parser to identify verbs and

arguments [96]. Each argument is classiﬁed into one type of PropBank modiﬁers which indicates

the argument’s semantic function in the sentence [97]. For example, as shown in Figure 4.3, the

59

parkedVa large and long blue busnext to an old manARG1ARGM-LOC(a) Semantic role labeling(b) POS taggingAgent: busAgent property: large, long, blueVerb: parkedRelation: nextTheme: manTheme property: old(c) Relation frame matchingchunk next to an old man is identiﬁed as modiﬁer ARGM-LOC to indicate the location of the

bus. Then each argument is tagged using Stanford Log-linear POS Tagger to identify the subject

and corresponding properties as well as possible relations with neighboring items appeared in the

sentence [20]. The identiﬁed information is matched to a relation frame

(Agent, Agent property, Verb, Relation, Theme, Theme property)

The expected relation (Relation) between the two items (Agent and Theme), the speciﬁcation of

the items (Agent property and Theme property), and pairwise position dependency (Agent depends

on Theme) are identiﬁed using their tags and syntactic positions.

4.4.2 Spatial Layout Generator

Spatial conﬁguration is required in order to generate scenes using objects segmented from real

images. Pairwise spatial relations acquired through the ﬁrst step of language processing contain

partial information of the layout. The right subﬁgure of spatial layout generator in Figure 4.2

shows an example of the inferred spatial layout corresponding to the input description A lovely dog

sits in front of a sofa in a room. A chair is at the left side of the dog. The position of each object is

determined by its surrounding objects on which it is dependent as described in the language. For

example, the chair’s position is directly dependent on the dog, and the dog’s position depends on

the sofa. As a result, only after the placement of the sofa and the dog, the chair’s position can be

ﬁgured out using positional and dimensional information of the dog and the sofa. In this work we

use dependency relation matrix (DRM) to determine dependencies among objects [51]. An agent

is considered to be dependent on its theme.

60

Using the matched relation frames from language parsing, a DRM can be built as follows:



O1

O2

[DRM]11
[DRM]21

[DRM]12
[DRM]22

...

. . .

[DRM]n1

[DRM]n2



On

[DRM]1n
[DRM]2n

...

[DRM]nn

···
···
···
. . .
···

O1
O2
...
On

It is a n × n matrix, where n is the number of objects to appear in the scene. O denotes object.
The column index of DRM denotes the object indices from the top to the bottom. The row index

represents object indices with the same order from the left to the right. Each element in the DRM

captures the dependency relation between the column object over the row object. The value of

an element in the DRM reﬂects the relative dependency between two objects. The element at the

intersection of the ith row and jth column of the matrix can be deﬁned as:

0, relation with the object itself,

[DRM]i j =

1, ith object and jth object are notdependent,
2(−2), ith object supports (is dependent on) jth object.

(4.1)



Initially, all elements in DRM are set to be 0. The relation frames only indicate pairwise

dependency relations. To ﬁnd the dependencies of each object on others, after parameterizing the

DRM using relation frames, we apply the transitivity rule as represented in Equation 4.2 and 4.3

repetitively on the DRM until element values stop changing:



⇒

[DRM]i j = relation

[DRM] ji = −relation if relation (cid:44) 1,

[DRM] ji = relation if relation = 1and [DRM] ji = 1,
[DRM]i j = −[DRM] ji

if relation = 1and [DRM] ji (cid:44) 1,

(4.2)

61

[DRM]i j = relation

[DRM] jk = relation

⇒ [DRM]ik = relation

(4.3)

where relation ∈ R, R = {−2, 1, 0, 1, 2} in accordance with the relation deﬁnition in Equation 4.1.
The more items on which an object’s position is dependent, the more positive numbers in the

corresponding row of DRM. If we sum up each row elements of the DRM as shown in Equa-

tion 4.4, we get a vector. A larger value in the vector means more object instances the correspond-

ing object is dependent on. If objects are manipulated in a sequential order, a larger value indicates

lower priority in manipulation.

f (DRM) =

(cid:20)(cid:80)n
j=1(a1 j), (cid:80)n

j=1(a2 j),

··· , (cid:80)n

(cid:21)T

j=1(an j)

(4.4)

Usually, the ﬁrst object in the sequence is the context because all the other objects are dependent

on it. If the context is not explicitly speciﬁed in the language, it will be inferred using the objects

appeared in the scene description:

P(context|ob ject1:m) =

m(cid:89)

i=1

P(context|ob jecti)

(4.5)

the context type that maximize the probability will be selected.

In this work we consider ﬁve

context types: indoor, road, city plaza, rural ﬁeld, and sea shore (if a context is explicitly speciﬁed

in the description, the scene generator can retrieve candidate context images from cloud resources;

otherwise, the context will be inferred from the ﬁve categories). The priors for object occurrence in

diﬀerent context types are trained using 1708 scenes from MS-COCO training dataset (the object

detector we used in this work can recognize 20 category of objects, so we ﬁltered the dataset to get

scenes comprised of objects lay in the category set):

P(context|ob jecti) =

count(ob jecti in context)

count(context)

(4.6)

After determining the context, each time when adding an object into the scene, its position is

calculated based on other objects on which it is dependent. The spatial knowledge is trained using

62

relative location data from [83], which consists of 10020 scenes created using 58 clipart objects.

The centroid position of an object is randomly chosen from the intersection area of its relative

locations in each spatial relation with a dependent object (theme). As shown in the subﬁgure of

Figure 4.2 circled by the red dashed rectangle, the dark yellow cloud shaped area represents "left"

locations in relation to the dog, and the dark green cloud shaped area denotes "left" locations in

relation to the sofa. The centroid position of the chair is randomly selected from the intersection

area (marked in pink). The dimensional ratio between each two object types is set as a priori.

4.4.3 Scene Generator

In this step, the robot retrieves images containing speciﬁed objects from an image cloud (in this

work, the images are retrieved using Google image search engine. Similarly, other image search

engines also work) and synthesize a scene using segmented objects from retrieved images. The

robot uses object instances from a "dirty" image cloud that is intended for human use rather than

manually labeled datasets [83][84]. Unexpected events may happen and cause failures of scene

generation, such as detection failure of objects in the retrieved images due to contaminated data.

In addition, it is helpful to generate semantically meaningful scenes if a human instructor can tune

the synthesized scene through NL. To this end, a dynamic discrete event model is developed to

tackle unexpected events and modify generated scenes interactively. First, the modeling process

using supervisory control theory is illustrated in detail. Then, property analysis of the developed

model is conducted for behavior reﬁnement to guarantee successful scene generation.

Receiving matched frames and estimated spatial layout, the scene generator retrieves images

that probably contain desired objects from image cloud, segment the objects from their original

images, and assembles them together in accordance with the speciﬁed spatial layout. In interactive

scene tuning phase, the scene generator replaces objects, modiﬁes position and size of objects

following the instructor’s commands. Figure 4.4 shows the decentralized plant models of the

robotic painter. Table 4.1 presents the event set and controllability of each event. The plant model

63

is a shuﬄe product of the plant models:

(4.7)
where || represent parallel composition operation. The shuﬄed plant model has 896 states and
4096 transitions in total.

i=1Gi

G = ||8

Table 4.1: List of basic actions for scene generation.

Primitive Actions
(α) retrieve_img (obj)

Controllability
controllable

Description
Retrieve image from local
image library or remote image cloud
Adjust the size of the object instance
Detect object category
Compare category labels between objects
Extract object instances from images
Add object into current scene
Conﬁrm the previous operation
Deny the previous operation
Change another image from the image base
Enlarge the size of an object
Reduce the size of an object
Change the position of an object
Save the image that has the object instance
Stop current being implemented action

(β) resize (obj)
(γ) detect (obj)
(δ) compare (obj1, obj2)
(ϑ) segment (obj)
(ζ) merge (obj, context)
(η) aﬃrmative
(θ) negative
(κ) change_img (obj)
(λ) zoom_in (obj)
(µ) zoom_out (obj)
(ν) move (obj)
(ρ) save (obj, img)
(ξ) stop
Each action is parameterized by the required objects or images. The Greek letters are used to encode
the primitive actions for simplicity of representation in planner property analysis (Section IV, Part C).

controllable
controllable
controllable
controllable
controllable
uncontrollable
uncontrollable
controllable
controllable
controllable
controllable
controllable
controllable

When the robotic painter receives language descriptions of a scene, it retrieves images that

ranked high in the query results using keyword combinations of (Agent + Agent Property) or

(Theme + Theme Property). Then the robot uses an object detector to detect desired object in-

stances from retrieved images. If an image contains the object that meets the speciﬁcation, then

the object is segmented from the image and resized to ﬁt into the new scene. Otherwise, the robot

switches to the next image in the rank and repeat the process until all the required objects have

been found and arranged to the speciﬁed spatial layout.

The plant model captures all the physically possible behaviors, including desired behaviors

(legal behaviors, i.e., behaviors lead to the marker states) and behaviors that should be forbidden

(illegal behaviors, i.e., behaviors that cause system failure to reach the marker states). To guarantee

64

Figure 4.4: Decentralized plant model of scene generation. Subﬁgure (a) to (e) are the models of
scene generation using objects extracted from retrieved images. Subﬁgures (f) to (h) are the
models for scene tuning. States in green with an entry arrow represent initial states. Yellow
colored states mean intermediate states. The states in pink with double circles are marked states
and also represent the target state of each behavior.

65

resizemergezoom instopmergezoom outstopmergemovestopmerge(f) G6: Zoom in an object(g) G7: Zoom out an object(h) G8: Change the position of an objectq13q14q15q16q17q18q19q20q21q22q23q24q25(a) G1: Image retrieveretrieveq0q1q10q11q12segment(e) G5: Scene synthesisdetectcompareaffirmativenegativeq4q5q6q7(c) G3: Detect objects from imageschange_imgq8q9(d) G4: Change to next imageq2saveq3(b) G2: Image savethe legal behaviors, a supervisor is required to regulate the plant’s behaviors. In addition to follow-

ing the scene synthesis procedure, following properties should also be held by the supervisor:

1. Controllability: Controllability characterizes the capability of a system to accomplish the

assigned task (reach target state) under abrupt and unexpected occurrence of uncontrollable

events.

Deﬁnition of Controllability [56]: Let K and L = L be two arbitrary languages over event

set Σ. K is said to be controllable with respect to L and Σuc if

KΣuc ∩ L ⊆ K.

If a supervisor is controllable, it is able to drive the system to its target state even encoun-

tering uncontrollable events during the implementation. As shown in Figure 4.5 (a), the

behavior is a desired one. However, this behavior is uncontrollable: at state q2, an uncon-
trollable event negative can happen and cause the failure of image generation, which falls out

of the legal range. To ensure successful image generation, uncontrollable behaviors should

be removed from the supervisor (i.e. image generator).

Controllability guarantees the marker states are achievable. However, sometimes the robot

may be trapped by a state or a subset of state such that it takes longer time to accomplish the

task. To avoid this issue and make the robotic implementation more smoothly, the second

property to analyze behaviors is introduced next.

2. Nonblocking: Nonblocking characterizes the property of a system to avoid being stuck at

a state or trapped in a subset of states. The following gives the deﬁnition and criteria of

nonblocking property.

Deﬁnition of Nonblocking [56]: Let L(G) and Lm(G) represent the legal behaviors and the
marked behaviors (in this context, marked behaviors represent successful scene generations)

of the supervisor, respectively. A set of formal language satisfying

Lm(G) = L(G)

66

Figure 4.5: Partial models of the shuﬄed plant model. For simplicity, the two models all start
with q0. In fact, the states in the two partial plant models are diﬀerent states.

is said to be nonblocking. An overline of a formal language represent all the preﬁxes of the

formal language, including the language itself. The supervisor refers to the image generator

in the proposed framework.

The robot retrieves images from a noisy image cloud and may encounter problems that ham-

per the scene generation. For example, the retrieved images that ranked high in the results

may not contain the desired objects. Then the robot has to keep changing images until the

target object instance has been detected (as shown in Figure 4.5 (b)). This is time-consuming

and may trap the robot in a livelock. Perform nonblocking check on the synthesized super-

visor helps to detect and get rid of these blocking situations and ensure the accessibility of

the marker states.

If we use formal language to represent the behavior of the partial plant model shown in Fig-

ure 4.5 (b), and use the Greek letters as indicated in Table 4.1 for simplicity of representation,

67

detectcompareaffirmativemerge(a) Part of the shuffled plant model (uncontrollable)retrieveq0q4q1q2q3q6q5segmentresizedetectcompareaffirmativenegativemergechange_img(b) Part of the shuffled plant model (blocking)retrieveq0q5q1q2q3q4q7q6segmentresizeit’s behavior can be represented as:

L(Gp) = (αγδ(θκ)∗)∗(1 + ηϑζ)

where Gp denotes "partial plant". While the marked behavior is:

Lm(Gp) = (αγδ(θκ)∗)∗

ηϑζ

from which it can be seen that Lm(Gp) ⊂ L(Gp). This means there is a possibility that

Figure 4.6: Partial supervisors for (a) image retrieve and (b) scene synthesis.

the system will be trapped in an inﬁnite loop to search for images that contain target ob-

jects. To avoid the blocking, the retrieved image evaluation procedure and scene generation

procedure are separated as two independent processes, as shown in Figure 4.6 (a) and (b),

respectively. The retrieved images will be evaluated ﬁrst to detect desired objects and only

the qualiﬁed images will be saved to a local image library (Figure 4.6 (a)). Then the robot

randomly retrieves an image from the local image library and extracts the object instance

for scene synthesis (Figure 4.6 (b)). The behaviors generated by these two reﬁned models

are nonblocking and controllable. They serve as partial supervisors for scene generation. In

addition, the plant models G6 to G8 are both controllable and nonblocking. The supervisors

68

retrievedetectcompareaffirmativenegativesavechange_img(a) S1: retrieved image evaluation retrievesegmentresize(b) S2: scene synthesis mergex0x1x2x3x4x5x6x7x8x9x10for interactive scene tuning are designed with the same event and state sets, respectively. The

operations on automata (plant model synthesis, property check, supervisory synthesis) are

implemented using DESUMA [62].

4.4.4 Motion Planner

In this work, we employ the feature extraction and path planning algorithm developed in [98].

Figure 4.7 shows the pipeline of its framework. Local binarization and global binarization are

performed to extract visual features under diﬀerent illumination conditions. Local binarization

binarizes the input scene using adaptive thresholds for each pixel. While global binarization uses

one threshold to binarize the scene, and this parameter varies by scene and can be determined

automatically. The two feature graphs are then added up for robot path planning with minimum

drawing time.

Figure 4.7: Overview of the framework of feature extraction and motion planning.

4.5 Experiments Evaluation

We have evaluated the proposed approach on two tasks: scene generation and scene drawing.

69

Generated sceneLocal binarizationGlobal binarization++Path planning4.5.1 Experimental Setup

Robotic painter. As shown in Figure 4.1, the robotic painter is a six-axis ABB IRB 120 robot

with a special designed container mounted at the end-eﬀector to ﬁx a marker pen. It is equipped

with a spring inside the chamber to make the ﬁxed marker pen more adaptive when drawing on

paper.

Object detection. The object detection performs twofold roles: detection and segmentation.

The object detector developed in [99] is used this work, which is able to detect 20 categories of

objects. Object detection is implemented as pixel-level labelling tasks. The output is a 2D matrix

that has the same dimension as the input image. Each value of the matrix denotes the object

category of the pixel at the same location as the input image.

Dataset. We use the MS-COCO evaluation dataset [100] to evaluate our approach. Since the

object detection algorithm we used can recognize 20 categories of objects, we ﬁltered the MS-

COCO dateset to ﬁnd text descriptions that only contain objects lay in the range of processable

object categories.

In total we test our approach on 128 descriptions for scene generation and

drawing.

Evaluation metrics. We evaluated the proposed approach using caption generation and human

evaluation.

Baseline works. We compare our approach with the following baseline works.
• Ground Truth: The ground truth uses original scenes from the MS-COCO evaluation

dataset paired with the scene description.

• Random: A random scene is selected from the ground truth.
• Reed et al. [89]: A GAN developed by Reed et al. to generate images according to the input

text.

• AttnGAN [94]: Attentional GAN with attention-driven and multi-stage reﬁnement for text

to scene generation.

70

• Ours: the scenes are generated using the proposed method without further tuning by a human

instructor.

• Ours-Human: the generated scenes have been tuned by human instructors using natural

language.

4.5.2 Scene Generation

Caption generation: We evaluate text-conditional scene generation performance using caption

generation. We generate sentences from the synthesized scenes and measure the similarity between

input text and predicted descriptions. The underlying intuition is that if the generated image is

relevant to input text and its contents are recognizable, one should be able to guess the original text

from the synthesized scene. We use an image caption generator [101] trained on MS-COCO to

generate sentences, where one sentence is generated per scene. We report four standard language

similarity on BLEU [102], METEOR [103], CIDEr [104], and Rouge-L [105].

Table 4.2: Quantitative evaluation results

Caption Generation

Method
Random

Reed et al. [89]
AttnGAN [94]

Ours

Ours-Human
Ground Truth

BLEU-1
0.272
0.277
0.307
0.352
0.378
0.502

BLEU-2
0.105
0.109
0.136
0.174
0.219
0.339

BLEU-3
0.039
0.044
0.068
0.091
0.140
0.240

BLEU-4
0.019
0.023
0.037
0.053
0.094
0.180

CIDEr
0.121
0.201
0.220
0.416
0.654
1.298

ROUGH_L
0.256
0.269
0.296
0.335
0.362
0.476

METEOR
0.072
0.092
0.097
0.134
0.151
0.221

BLEU, CIDEr, ROUGH_L, and METEOR metrics based on caption generation are presented. The last
row presents the caption generation performance on MSCOCO dataset images. Higher is better in all
columns.

Table 4.2 summarized the quantitative evaluation results based on caption generation perfor-

mance. The Random baseline shows the worst performance. This demonstrates that random scenes

in the ground truth rarely convey the same semantic meaning. It can be seen that the proposed

method signiﬁcantly outperforms the baseline approaches. Tuning from human instructors also

71

helps to increase the semantics of the generated scenes dramatically. Caption generation perfor-

mance shows that captions generated from our synthesized scenes are more correlated with the

input text than the baselines, which means that the scenes synthesized by our method are better

aligned with the input descriptions and are easier to recognize semantic contents.

Human evaluation: Using caption generator for evaluation is beneﬁcial for large scale data.

However, it may introduce unintended bias of the caption generator. To validate the caption gen-

eration evaluation, we have also conducted human evaluation on Amazon Mechanical Turk. For

each text description from the evaluation dataset, six scenes generated by diﬀerent approaches are

presented to the Turkers. They are asked to score how well the scenes match the description on a

scale of 1 (very poorly) to 5 (very well). The results shown in Figure 4.12 and 4.8 demonstrates

consistency with the caption evaluation performance. Figure 4.9 shows some randomly chosen

qualitative results. In addition, the proposed approach is able to generate scenes that are not likely

to happen in the real world, as shown in Figure 4.13.

Besides comparison between the proposed approach and other baseline works, we also have

conducted comparison experiments of our full model against models without controllability guar-

antee (None-Cont) and nonblocking guarantee (None-Nonb), respectively. The three models gen-

erate scenes conditioned on the 128 descriptions, respectively. The results are shown in Fig-

ure 4.10. Without guaranteeing controllability, the model has lower success rate in scene gen-

eration due to desired object detection failure from retrieved images. This can be caused by either

misalignment between the image and its description on the Internet or the object detector error.

Without guaranteeing nonblocking, averagely it costs longer time for the robot to generate a scene.

4.5.3 Scene Drawing

We perform feature extraction and path planning on all the generated scenes. The parameters used

for feature extraction and motion planning are the same as presented in [98] for all the scenes

produced with each baseline work.

Human evaluation: Similar to the human evaluation of the generated scenes, we also conduct

72

Figure 4.8: Turkers are asked to score how well the scenes match the description on a scale of 1
(very poorly) to 5 (very well). "GT" denotes "Ground Truth". The subjects ﬁnd our scenes better
represent the input sentences than other baseline approaches. In fact, our approach wins over or
ties with the ground truth scenes frequently.

human studies on scene drawings using Amazon Mechanical Turk. Six scene drawings are pre-

sented to the turkers with a text description, and they are asked to score how well the drawings

match the description on a scale of 1 (very poorly) to 5 (very well). Figure 4.12 shows the results.

The proposed approaches substantially outperforms other baseline works (Random, Reed et al.,

and AttnGAN) and close the the performance of Ground Truth. Figure 4.11 shows scene drawings

of the randomly chosen scenes in Figure 4.9.

73

Ours-Human vs. baselinesFigure 4.9: Qualitative examples of generated scenes conditioned on text descriptions from the
MS-COCO dataset, using the proposed approach and baseline methods. The input description and
ground truth scenes are shown in the ﬁrst row.

4.5.4 Discussion

One crucial aspect for generated scenes to be well-aligned with language descriptions is the correct

grounding of spatial requirements. Sometimes, the spatial descriptions can be ambiguous. For

instance, "A man sitting in a chair" and "A man sitting on a chair" express the same meaning,

while the robot may initialize the location of "a man" close to the centroid of the chair or on top of

the chair back, respectively. Similarly, "A boy standing over a bed" means the boy standing next to

a bed and leaning over it rather than standing on the bed. Failures to solve the ambiguity may due

74

Ground TruthRandomReed et al. AttnGANOursA man stands next to a red motorbikeA dog and cat laying Together on a chairA large brown dog laying inside of a carA white chair sitting next to a black catOurs-HumanFigure 4.10: The results of performance comparison of our full model against models without
controllability guarantee and nonblocking guarantee, respectively. Blue bars represent scene
generate success rate, and brown bars represent average scene generation time.

to that the dataset we used for spatial priors training doesn’t consider this spatial ambiguity. One

of our future work is to use the recorded data from scene adjustment to augment the spatial priors.

Another aspect to generate recognizable realistic scenes is the completeness of the segmented

objects. The object detector used in this work sometimes fails to segment complete objects, which

can cause diﬃculty in recognizing them in generated scenes. This may also be one of the reasons

that decease the performance of our approach without human adjustment. In our future work, we

plan to tackle this issue from two perspectives. The ﬁrst one is to improve the performance of the

object detector in combination with other visual processing techniques. The second is to develop

an object completeness detection algorithm and apply it in the image retrieval process.

4.6 Summary

In conclusion, we propose a framework for robotic drawing conditioned on NL descriptions.

The proposed approach decomposes the scene generation conditioned on NL into several manage-

able steps. Instead of learning a direct text-to-pixel mapping or retrieving from a local constrained

image material library, the proposed approach utilizes knowledge retrieved from the Internet for

75

Figure 4.11: Qualitative examples of simulated robotic drawings of generated scenes with the
same parameters.

more diverse and creative scene generation and drawing. The developed mechanism to handle un-

expected events makes the scene generation more stable and reliable. The interactive modiﬁcation

of scenes helps better alignment with scene descriptions and training error correction.

76

Ground TruthRandomReed et al. AttnGANOursA man stands next to a red motorbikeA dog and cat laying Together on a chairA large brown dog laying inside of a carA white chair sitting next to a black catOurs-HumanFigure 4.12: Turkers are asked to score how well the scenes match the description on a scale of 1
(very poorly) to 5 (very well). "GT" denotes "Ground Truth". We achieve absolute scores slightly
worse than the ground truth, but better than the baselines.

77

Absolute Average ScoreFigure 4.13: Generation results on scenes that are not likely to happen in the real world. The ﬁrst
column shows the unmodiﬁed synthesized scenes. The second column scenes are modiﬁed by the
instructor with language instructions on the ﬁrst column scenes. The third column shows scene
drawings by the robotic painter.

78

An airplane is flying in the sky. A bird is flying over the airplane.A car is racing with a boat. The boat is on the road.A horse is runningover a cliff, and a boy is standing on the horse.TACTILE FEEDBACK FOR NL BASED BEHAVIOR CONTROL AND PROGRAMMING

CHAPTER 5

5.1 Introduction

The success of robot control and programming relies heavily on the correct state estimation.

Besides vision and encoder information, tactile information plays a critical role in robot interaction

with its surroundings.

Tactile sensing is an essential survival skill for living creatures and has been a component of

robots almost as long as vision. It is extremely useful and widely applied in three types of tasks:

manipulation, exploration and response [71]. Slippage plays an important role in completion of the

above mentioned tasks. Applications of which are grasp force control [106][107], frictional prop-

erties estimation [108], surface texture perception [109], wearable devices [110], prosthetics [111],

etc. All of which motivate the desire for stable and general slippage detection methods.

An early slippage detection approach was based on detecting displacements of a moving item

in the gripper surface [112]. Various approaches have since been developed for slippage detection.

On signal type aspect, researchers have used vibrations [108], acoustic emissions [113], variation

of normal force, shear force [107], etc.,which are generated during the slippage to analyze and

determine the slippage condition. Approaches to classiﬁcation and decision on slippage occur-

rence have employed neural networks [114], frequency analysis [115][116], the Coulomb friction

model [117], principle component analysis [118], etc. However, these methods suﬀer several deﬁ-

ciencies:

1. They are dependent on either custom hardware or speciﬁc sensor signal type.

2. In real applications, slippage can occur in both static and dynamic environments. The above

methods only demonstrate their ability to detect slippage in static environments.

79

3. They can only detect occurrence of slippage rather than the sliding velocity. The main rea-

son for this issue is that these methods are unable to detect or estimate sliding displacements

of slippage. Recent research has been conducted into developing a method to detect sliding

distance [111][1]. The method proposed in [111], however, heavily relies on custom hard-

ware. While in [1], the authors design a slip vector to measure the sliding distance based on

optical ﬂow, but it is less accurate compared with our proposed approach.

To provide partial remedies for the above issues, we propose two correlation based approaches

for slippage detection using a tactile sensor array, which is one of the oldest and most common

tactile sensor types. The ﬁrst approach is able to distinguish between slip and nonslip, independent

of the sensor signal type, slip type (translational or rotational) and sliding direction. In addition, it

is proved to work well in both static and dynamic environments. The second approach emphasizes

its ability to detect slippage velocity with tactile sensor array information.

The chapter is organized as follows. Section 5.2 illustrates the proposed methods in detail.

Section 5.3 provides an overview of the experimental setup and implementation details. While

in Section 5.4 we validated our approaches with experimental data. Section 5.5 summarizes this

chapter.

5.2 Data Correlation Approach

Generally, the common setup for testing and evaluation of a slippage detection algorithm is to

use either electric motor or gravity as driving force to create a sliding motion between the object
and the tactile sensor array. Assuming the tactile sensor array has the dimensions of M × N, the
output of the sensor usually has the same dimensions as the sensor array. The output signal can

be normal pressure, shear force, voltage, etc. In the following sections, we illustrate in detail two

data correlation approaches that are independent of sensor signal types, only taking sensor signal

spatial distribution and values into consideration. Figure 5.1 brieﬂy demonstrates the proposed

data correlation approaches.

80

Figure 5.1: Illustration of data correlation based approaches. The left subﬁgures are two tactile
images (black for taxels in contact with grasped object, white for no contact).

5.2.1

1-D Rank Correlation for Slippage Detection

During a slippage, the larger the displacement, the less similarity between the initial tactile sensor

array data and the current tactile sensor array data. So the similarity between sensor array data

sampled at diﬀerent instants in time can be used as a criterion to determine the occurrence of slip-

page. To this end, we employ 1-D rank correlation to characterize the similarity and determine the

slip condition. The correlation coeﬃcient between two sensor array data is calculated as follows:

rXY =

(X − X)(Y − Y)T

(cid:113)
(X − X)(X − X)T

(cid:113)
(Y − Y)(Y − Y)T

(5.1)

both X and Y represent M×N tactile sensor array data rearranged as 1× MN (or MN×1) vectors. X
is the initial tactile sensor array data that is not equal to zero, while Y denotes tactile sensor readings

sampled at later instants. X and Y are mean value vectors of X and Y, respectively. The result of

Equation 5.1 is a correlation coeﬃcient that reﬂects the similarity between the two vector-form

sensor data X and Y.

Then, we implement Fast Fourier Transform (FFT) over every K calculated correlation coeﬃ-

cients, and compare the amplitude of the ﬁrst frequency component (F[1]) with an experimentally-

81

After slidingSliding direction1-D correlation2-D correlationSlippage occurrenceSlippage velocitytuned threshold ε. If the amplitude is greater than the threshold, the grasped object is considered

to happen. Otherwise, the object is considered to be static.

F[n] =

f [i]e− j2π
K ni

n = 0, 1, 2, ..., K − 1

(5.2)

i=0

2-D Cross Correlation for Slippage Velocity Detection

5.2.2
For any M × N tactile sensor array, the instantaneous values constitute a tactile image of that array.
The shift of object on tactile image reﬂects its displacements on the contact surface between the

gripper ﬁnger and the object, which can be calculated via 2-D cross correlation.

The 2-D cross correlation of an M×N matrix X and a P×Q matrix Y is a (M +P−1)×(N +Q−1)

K−1(cid:88)

matrix C given by

C(k, l) =

M−1(cid:88)

N−1(cid:88)

X(m, n)Y(m − k, n − l)

(5.3)

m=0

n=0

−(P − 1) ≤ k ≤ (M − 1), −(Q − 1) ≤ l ≤ (N − 1),

where Y denotes complex conjugation of Y.

Following is an example to illustrate the working principle of the 2-D correlation. As shown in

Figure 5.2, O1 and O2 represent positions of a grasped object before and after shift, respectively.
Without losing generality, we assume the object has higher pixel value than the background. As

shown in Equation 5.3, each element of the output matrix, C(k, l), is calculated by shifting Y with

certain displacements. C(k, l) has both negative and positive row and column indices. A negative

row index denotes an upward shift of the rows of Y, and a negative column index represents a

leftward shift of the columns of Y. Similarly, positive row and column indices denote the shift of

opposite directions. The position of the element with maximum values means the original object

image and shifted object image completely overlap with each other after certain displacements

along the horizontal and vertical directions. If the background has higher values than the objec-

t, then the position of minimum element indicates the shifts along both horizontal and vertical

82

Figure 5.2: Illustration of 2-D correlation. O1 is the initial object position and O2 for the object
position after shift.

directions. Assume the maximum value element is C(kmax, lmax), with the tactile sensor spatial

resolution dtaxel as a priori, two dimensional displacements can be obtained:

dv = kmax · dtaxel

dh = lmax · dtaxel

(5.4)

where the subscript v denotes vertical and h for horizontal.

With the number of samples Nsample between X and Y, and the sampling rate fsample, the time

interval of the shift can be calculated as:

(cid:52)t =

Nsample
fsample

83

(5.5)

Then the vertical and horizontal slippage velocity is:

sv = dv(cid:52)t

sh = dh(cid:52)t

(5.6)

Here, by vertical we mean the direction along the long side of the tactile sensor pad and corre-

spondingly the horizontal represents the direction along the short side.

5.3 Robot Experimental System

Figure 5.3: Mobile manipulator used as experimental platform.

Experiments were conducted using the robot experimental platform as shown in Figure 5.3.

It has a four wheeled base, a seven-degree-of-freedom (DOF) arm and a one-DOF parallel-jaw

gripper. Figure 5.4 shows the parallel-jaw gripper mounted on the arm. Its actuator is a brushless

DC servo-motor with gear mechanism, a spindle and an encoder. Each ﬁnger of the gripper is
equipped with a tactile pressure sensor array with dimensions of 24.4×51.4×5.4 mm. Each sensor

84

array consists of 14 × 6 sensor cells with a spatial resolution of 3.4 mm. These resistive sensors
(manufactured by Weiss Robotics GmbH & Co. KG, Germany) measure the normal pressure

which is applied in each sensed region [119]. As shown in Figure 5.4, the entire sensor pad surface

is covered by a silicone rubber for protection and enhance compliance for successful grasp.

Figure 5.4: The robot gripper. The pressure sensor arrays are attached to the gripper’s ﬁngers
under the silicone rubber.

In order to evaluate the performance of the second approach, another setup has been designed,

as shown in Figure 5.5. A high-speed camera (CASIO EXILIM EX-H5) is used here to record the

sliding motion of the grasped object, with a frame rate of 480 frames per second (fps). Each frame
has the dimensions of 160 × 224 pixel. The start and end time point of the slip, the displacement
at any instant and corresponding slippage velocity can be recorded and calculated from the frames

85

accurately. For the purpose of tracking the sliding movement of the grasped object, white markers

have been stuck to the object surface.

Figure 5.5: This system is used to evaluate the performance of the second approach. White
markers are stuck to the cylinder’s surface and facing the camera lens.

5.4 Experimental Results

This section displays experimental results of the proposed data correlation approaches. For the

ﬁrst method, we show its ability to detect slippage, both translational and rotational, in static as

well as dynamic environments. For the second approach, the ability to detect displacements and

velocity of the sliding motion has been presented. Also, we compare the velocities calculated by

the proposed approach, the method proposed in [1] and from the video clip recorded by a high-

speed camera.

86

MarkerHigh Speed CameraObject5.4.1 Translational and Rotational Slippage Detection

As described in Equation 5.1, similarity between two tactile sensor array data is calculated as

correlation coeﬃcient. During either translational or rotational slippage, variations in both the

value and distribution of the normal pressure cause changes of calculated correlation coeﬃcient

sequence. All sensor cells were sampled simultaneously at a rate of 20 Hz in this setup.

Figure 5.6 shows variations of the amplitude of the ﬁrst frequency component, correlation

coeﬃcient and grasp force in a static conﬁguration. The gripper is gripping a metal cylinder with a

grasp force large enough to hold it. The cylinder weighs 2 kg. The gripper posture is perpendicular

to the ground. The grasped force ﬂuctuates slightly due to mechanical vibrations. By FFT on the

correlation coeﬃcient sequence, all the amplitudes of ﬁrst frequency component were less than
the threshold (ε = 5 × 10−3 when K = 4 in our setup). The amplitudes that are greater than the
threshold have been marked red, otherwise, they are marked as blue.

Figure 5.6: Experimental results of the static case.

Figures 5.7 through Figure 5.10 show results of slippage detection under diﬀerent conditions.

Figure 5.7 is about translational slippage. The initial conﬁguration is the same as the static case.

Then the gripper releases slowly to a smaller force which is not able to hold the metal cylinder. The

87

releasing velocity of the gripper ﬁngers was 1 mm/s. The cylinder began to slip downward. As

soon as the grasped object begins to slip, correlation coeﬃcient starts to decrease and the amplitude

of the ﬁrst frequency component goes up.

Figure 5.7: Experimental results of translational slippage case.

In Figure 5.8, the diﬀerence in experimental setup compared with the translational one is that

the grasped object is not released gradually, but kept in the gripper and made to rotate manually. In

the middle plot of rotational slippage we can observe that correlation coeﬃcient could still return

to a relatively high value. This is due to the symmetry of the cylinder used in the experiments.

In Figure 5.9, both translational and rotational slippage occurred simultaneously. Similarly,

at ﬁrst the cylinder is gripped with a large force. The initial posture is neither perpendicular nor

parallel to the ground, but has an angle with the horizontal plane. During the release process, the

cylinder slides and rotates by gravity at the same time.

In above setups, the gripper posture is perpendicular to the ground. While in Figure 5.10,

the gripper is set to be horizontal. All other keep the same as the translational slippage setup.

During a slippage the force distribution and value do not change signiﬁcantly, so is the correlation

coeﬃcient. This is because the cylinder is longer than the short side of the gripper. However, the

ﬂuctuation of correlation coeﬃcient during a slippage is still larger than that in static case.

88

Figure 5.8: Experimental results of rotational slippage case.

Figure 5.9: Experimental results of combined slippage case.

89

Figure 5.10: Experimental results of translational slippage case, the gripper was set horizontally.

1-D correlation based method employs sensor signal distribution and values. During slippage

process, variations of these two features cause heavier ﬂuctuation in correlation coeﬃcient se-

quence than in the static case. By the static case we mean both the arm and gripper do not move,

while in real applications it is not always the case. A slippage usually occurs when the arm/gripper

is moving/rotating. In the following section we investigate this issue.

5.4.2 Slippage Detection In Dynamic Environments

In order to evaluate the performance of the proposed method under dynamic situations, we imple-

mented experiments in which the robotic arm is moving while the cylinder keeps relatively static

to the gripper. Four common movements are considered here, which can constitute most of the

complex behaviors, as shown in Figure 5.11. Figure 5.12 shows 1-D correlation based slippage

detection approach works normally against the impact of gripper rotation. Similar results are also

observed under other three conditions.

Figure 5.13 through Figure 5.16 display results of slippage detection under four common move-

ments. In Figure 5.13, a slip occurs when releasing the gripper. It demonstrates that slippage under

this condition can still be detected.

90

Figure 5.11: Four basic motions of the robotic arm. Rotation type I is to rotate the gripper when it
is perpendicular to the ground. Rotation type II means to rotate the arm rather than the gripper.
Rotation type III is almost the same as type I, but to place the gripper horizontally. The fourth
motion is only a translational movement of the arm without any rotation

Figure 5.12: Experimental results of rotation type I in static case.

91

Figure 5.13: Experimental results of rotation type I in slippage case.

Figure 5.14 and 5.15 show results in which the cylinder slips when the arm is in type II and

type III rotation mode, respectively. The grasped item is grasped in gripper with a smaller grasp

force that is insuﬃcient to hold the object.

Figure 5.14: Experimental results of rotation type II in slippage case.

Figure 5.16 describes the slip process during a translational motion of the robotic arm. In this

92

Figure 5.15: Experimental results of rotation type III in slippage case.

experiment, the arm moves from a point of lower left corner to another point of top right corner in

its working space. At the same time, the gripper slowly releases the its ﬁngers to allow a slippage

to happen.

Figure 5.16: Experimental results of translational movement in slippage case.

The above results show the proposed 1-D correlation based approach is able to detect slippage

93

under multiple conditions, especially in dynamic environments with mechanical vibrations caused

by arm/gripper movements.

5.4.3 Slippage Velocity Detection

In this section, we evaluate the performance of the proposed slippage velocity detection approach

with tactile data obtained from experimental evaluation setup as shown in Figure 5.5. The cylinder

is gripped ﬁrmly at ﬁrst and then the gripper releases at a rate of 1 mm/s to allow a slippage to

occur. All sensor cells are sampled simultaneously at a rate of 47 Hz. And the sliding motion is

recorded by both the tactile sensor array and the high-speed camera. In this experiment, only the

sliding motion along the slippage direction is considered. The spacial resolution of the video clip

is 1.23 mm/pixel. The results are shown in Figure 5.17.

Figure 5.17: Comparison of experimental results. The top plots represent sliding distance
calculated from 2-D cross correlation, algorithm1 [1], modiﬁed algorithm1, and ground truth. The
bottom plots are the sliding velocities computed from sliding distance of 2-D cross correlation,
modiﬁed algorithm1 and ground truth.

The top plot of Figure 5.17 shows the sliding distances detected by the 2-D correlation ap-

proach, the method in [1] and measured from video clip. The actual sliding distance is 36.9 mm,

94

while the detected distance is 37.4 mm. The diﬀerence comes from the limited spacial resolution

of tactile sensor array since the detected displacements can only be integral multiple of the spacial

resolution. Also, this diﬀerence between the detection and ground truth is dependent on the initial

start position. The largest diﬀerence should be no greater than the spacial resolution. This also

explains the earlier detection of slippage by the proposed algorithm than results from the video

clip.

The green plot represents the displacement calculated by algorithm 1 [1]. Due to the vibration

caused by slippage and the nonideal object surface, the calculated sliding displacement ﬂuctuates.

To smooth the plot, ﬁrst we assume the later displacements should be no less than former displace-

ments. Additionally, we assume higher row taxels in tactile image have larger values. With these

two assumptions the algorithm 1 [1] is modiﬁed and the results seem to be more reasonable. But

overall, the method presented in this paper has a better accuracy.

The trends of the slippage displacements and velocity detected by the 2-D correlation based

algorithm are consistent with the results from the high-speed camera video clip. The diﬀerences

between these two results are due to the limited spacial resolution of tactile sensor array and sam-

pling rate. The performance could be improved by increasing either the spacial resolution or the

sampling rate, or both.

Compared with the slippage velocity detection method presented in [111], which depends on

self designed tactile sensor array to detect slippage velocity. The proposed approach has a wider

application. Theoretically, it is suitable for any tactile sensor arrays.

5.5 Summary

In this chapter, two correlation based approaches are proposed for slippage detection. The

proposed data correlation based slippage detection approaches are able to detect not only the oc-

currence of a slippage but also the slippage velocity as well. The estimated status is later sent

back to the high level controller for task planning and supervised implementation. In addition, the

slippage status information can be used for grasp status estimation and grasp force control. The

95

proposed methods are independent of the sensor signal type, which can be applied to other tactile

arrays.

96

CHAPTER 6

CONCLUSIONS AND FUTURE WORK

6.1 Conclusions

In this dissertation, a DES inspired approach for NL based robot control and programming is

investigated. The aim of such an approach is to achieve reliable robotic behavior management

through NL under unstructured environments. But with the underspeciﬁcation of language intend-

ed for human use and complexity of environments, three challenges exist. First, it is nontrivial to

transfer underspeciﬁed language input into deterministic behavior output. Second, it is diﬃcult to

achieve stable movement under structured environments, especially with underspeciﬁed instruc-

tions. Third, t is diﬃcult to teach a robot new skills through NL since the physical structure and

cognitive capability of robots and human are completely diﬀerent. This dissertation tries to address

these three challenges by using DES inspired control and programming methods.

For the control part, a DES control framework is proposed that can achieve reliable behaviors

under dynamic environments with underspeciﬁed language instructions. The reliability of the be-

havior model is achieved by behavior reﬁnement using three properties: controllability, nonblock-

ing, and stability. It is a correct-by-design approach that detects design ﬂaws in advance and avoid

costly redesign-rework that occurs late in development cycle. NL can have concepts/commands of

diﬀerent level of abstraction. Accordingly, a hierarchical control approach is proposed to tackle

commands of diﬀerent logic depth, which helps to alleviate state and action explosion. Behav-

ior consistency and stability of hierarchical system is proved. Experimental results on prevalent

manipulation scenarios clearly show the developed control framework can achieve satisfactory

performance under dynamic environment.

Existing NL based control approaches transform linguistic input into symbolic representation

to guide robot movement, which may fail for some applications such as robotic drawing, painting,

polishing, etc. controlled by NL. We proposed an approach to generate scene representation in

97

accordance with language input and to draw the scene on paper. Scene representation contains

more details than symbolic representation. In addition, this approach employs knowledge retrieved

from web resources instead of laboriously labeled data. Results of standard metrics and human

studies show that the generated scenes and drawn scenes have higher recognizability and better

alignment with the input language description.

The success of robot application in dynamic environments relies on correct robot state esti-

mation. Gripper status is an important component of robot states. We propose correlation based

approach for slippage detection to estimate and monitor grasp status. The proposed approach is

able to detect slippage occurrence and velocity without using customized sensor designs or limit

to speciﬁc sensor signal types.

For the programming part, this dissertation presents a hierarchical programming approach that

allows a robot to learn new skills using acquired skills through dialogues and demonstrations. This

approach mimics the teach and learning process in human society, and thus is intuitive and friendly

to technically untrained users. Moreover, it only requires one-shot training, which is time-eﬃcient

and laborless.

6.2 Future Research Work

Based on the research presented in this dissertation, there can be many future research direc-

tions, among which only a few are outline in the following.

For using scene representation to control robot behaviors, currently it is only able to synthe-

size complex scenes using existing resource rather than creating new single instances. For better

creation and alignment with input language descriptions, we should design approaches have ad-

vantages from the generative approach (such as GAN based methods) and compositional approach

using web resources. In addition, the proposed approach has only been applied to robotic drawing

and similar applications. Since scene representation can include more details, we should extend

this approach to other robotic applications, such as navigation, exploration, house chores, etc.

For robot programming through NL, the proposed approach depends on the assumption that

98

new behaviors can be represented using atomic states in the state set. What if the a behavior can-

not be represented using existing states? This presents a knowledge representation challenge. We

need to explore knowledge representation that relates to both high discrete level and low contin-

uous level. In addition, the proposed approach assumes the robot is capable to detect the entire

environmental conﬁguration correctly. What if the there are detection errors or part of the envi-

ronment is unobservale? This challenge may be solve through human-robot interaction via NL.

First, this helps to the user to know the true understanding of the robot on the environment, thus

to identify the problems that the robot is encountering. Second, the user can employ NL to share

information with the robot to compensate for the incapability of robots.

For tactile sensing that provides grasp status to the robot, in this dissertation it is used for robot

state estimation in the discrete event level control and grasp force regulation in the continuous level

control. Environmental information, such object layout and object attributes, are detected by visual

sensors. Tactile features have not been used for object grounding yet, such as textures, softness,

roughness, etc. Recognizing these features has helps the robots to have a richer lexicon and better

cognition. As a result, it improves object grounding performance by combination of visual and

tactile attributes of objects.

99

BIBLIOGRAPHY

100

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

[6]

[7]

Javier Alcazar, Leandro G Barajas, et al. Estimating object grasp sliding via pressure array
sensing. In Proc. IEEE International Conference on Robotics and Automation, pages 1740–
1746, 2012.

Hao Zhong and W Murray Wonham. On the consistency of hierarchical supervision in
discrete-event systems. IEEE Transactions on automatic Control, 35(10):1125–1134, 1990.

Lanbo She, Shaohua Yang, Yu Cheng, Yunyi Jia, Joyce Y Chai, and Ning Xi. Back to the
blocks world: Learning new actions through situated human-robot dialogue. In 15th Annual
Meeting of the Special Interest Group on Discourse and Dialogue, volume 89, 2014.

Terry Winograd. Understanding natural language. Cognitive psychology, 3(1):1–191, 1972.

Hossein Motallebipour and August Bering. A spoken dialogue system to control robots.
2002.

Stanislao Lauria, Guido Bugmann, Theocharis Kyriacou, Johan Bos, and Ewan Klein. Per-
sonal robot training via natural-language instructions. IEEE Intelligent systems, 16(3):38–
45, 2001.

Stanislao Lauria, Guido Bugmann, Theocharis Kyriacou, and Ewan Klein. Mobile robot
programming using natural language. Robotics and Autonomous Systems, 38(3):171–181,
2002.

[8] Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. Walk the talk: Connecting

language, knowledge, and action in route instructions. Def, 2(6):4, 2006.

[9] Michael Brenner, Nick Hawes, John D Kelleher, and Jeremy L Wyatt. Mediating between
qualitative and quantitative representations for task-orientated human-robot interaction. In
IJCAI, pages 2072–2077, 2007.

[10]

Juraj Dzifcak, Matthias Scheutz, Chitta Baral, and Paul Schermerhorn. What to do and how
to do it: Translating natural language directives into temporal and dynamic logic representa-
tion for goal management and action execution. In Proc. of IEEE International Conference
on Robotics and Automation (ICRA), pages 4163–4168, 2009.

[11] Mario Bollini, Stefanie Tellex, Tyler Thompson, Nicholas Roy, and Daniela Rus. Interpret-
ing and executing recipes with a cooking robot. In Experimental Robotics, pages 481–495.
Springer, 2013.

[12] Albert S Huang, Stefanie Tellex, Abraham Bachrach, Thomas Kollar, Deb Roy, and
Nicholas Roy. Natural language command of an autonomous micro-air vehicle. In Proc.
IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2663–2669,
2010.

101

[13] David L Chen and Raymond J Mooney. Learning to interpret natural language navigation

instructions from observations. San Francisco, CA, pages 859–865, 2011.

[14] Wataru Takano, Ikuo Kusajima, and Yoshihiko Nakamura. Generating action descriptions
from statistically integrated representations of human motions and sentences. Neural Net-
works, 80:1–8, 2016.

[15] Dipendra K Misra, Jaeyong Sung, Kevin Lee, and Ashutosh Saxena. Tell me dave: Context-
sensitive grounding of natural language to manipulation instructions. The International
Journal of Robotics Research, 35:281–300, 2015.

[16] Lanbo She, Yu Cheng, Joyce Y Chai, Yunyi Jia, Shaohua Yang, and Ning Xi. Teaching
robots new actions through natural language instructions. In Proc. of IEEE International
Symposium on Robot and Human Interactive Communication, pages 868–873, 2014.

[17] Cynthia Matuszek, Evan Herbst, Luke Zettlemoyer, and Dieter Fox. Learning to parse
In Experimental Robotics, pages

natural language commands to a robot control system.
403–415. Springer, 2013.

[18] Thomas Kollar, Stefanie Tellex, Matthew R Walter, Albert Huang, Abraham Bachrach,
Sachi Hemachandra, Emma Brunskill, Ashis Banerjee, Deb Roy, Seth Teller, et al. Gener-
alized grounding graphs: A probabilistic framework for understanding grounded language.
Journal of Artiﬁcial Intelligence Research, 2013.

[19] Thomas Kollar, Stefanie Tellex, Deb Roy, and Nicholas Roy. Toward understanding natural
In Proc. of ACM/IEEE International Conference on Human-Robot

language directions.
Interaction (HRI), pages 259–266, 2010.

[20] Dan Klein and Christopher D Manning. Accurate unlexicalized parsing. In Proc. of Annu-
al Meeting on Association for Computational Linguistics, pages 423–430. Association for
Computational Linguistics, 2003.

[21] Moritz Tenorth, Daniel Nyga, and Michael Beetz. Understanding and executing instructions
for everyday manipulation tasks from the world wide web. In Proc. of IEEE International
Conference on Robotics and Automation (ICRA), pages 1486–1491. IEEE, 2010.

[22] Daniel Nyga and Michael Beetz. Everything robots always wanted to know about house-
work (but were afraid to ask). In Proc. of IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pages 243–250. IEEE, 2012.

[23] Gheorghe Lisca, Daniel Nyga, Ferenc Bálint-Benczédi, Hagen Langer, and Michael Beet-
z. Towards robots conducting chemical experiments. In Proc. of IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pages 5202–5208, 2015.

[24] Hadas Kress-Gazit, Georgios E Fainekos, and George J Pappas. Where’s waldo? sensor-
In Proc. of IEEE International Conference on

based temporal logic motion planning.
Robotics and Automation (ICRA), pages 3116–3121. IEEE, 2007.

102

[25] Hadas Kress-Gazit, Georgios E Fainekos, and George J Pappas. Translating structured en-

glish to robot controllers. Advanced Robotics, 22(12):1343–1359, 2008.

[26] Hadas Kress-Gazit, Georgios E Fainekos, and George J Pappas. Temporal-logic-based re-
IEEE Transactions on Robotics, 25(6):1370–1381,

active mission and motion planning.
2009.

[27] Daniel J Brooks, Constantine Lignos, Cameron Finucane, Mikhail S Medvedev, Ian Perera,
Vasumathi Raman, Hadas Kress-Gazit, Mitch Marcus, and Holly A Yanco. Make it so:
Continuous, ﬂexible natural language interaction with an autonomous robot. In Grounding
language for physical systems workshop at the AAAI conference on artiﬁcial intelligence,
2012.

[28] Vasumathi Raman, Constantine Lignos, Cameron Finucane, Kenton C Lee, Mitch Marcus,
and Hadas Kress-Gazit. Sorry dave, i’m afraid i can’t do that: Explaining unachievable robot
tasks using natural language. Technical report, University of Pennsylvania Philadelphia
United States, 2013.

[29] Vasumathi Raman and Hadas Kress-Gazit. Explaining impossible high-level robot behav-

iors. IEEE Transactions on Robotics, 29(1):94–104, 2013.

[30] Yu Cheng, Yunyi Jia, Rui Fang, Lanbo She, Ning Xi, and Joyce Chai. Modelling and
IFAC Proceedings Volumes,

analysis of natural language controlled robotic systems.
47(3):11767–11772, 2014.

[31]

Jean E Sammet. The use of english as a programming language. Communications of the
ACM, 9(3):228–230, 1966.

[32] Alan W Biermann, Bruce W Ballard, and Anne H Sigmon. An experimental study of natural
language programming. International journal of man-machine studies, 18(1):71–87, 1983.

[33] George E Heidorn. Natural language inputs to a simulation programming system: An intro-

duction. Technical report, Monterey, California. Naval Postgraduate School, 1971.

[34] George E Heidorn. Simulation programming through natural language dialogue.

IBM

Thomas J. Watson Research Division, 1973.

[35] George E Heidorn. English as a very high level language for simulation programming. ACM

SIGPLAN Notices, 9(4):91–100, 1974.

[36] George E Heidorn. Automatic programming through natural language dialogue: A survey.

IBM Journal of research and development, 20(4):302–313, 1976.

[37] Lawrence I. Lieberman and Michael A. Wesley. Autopass: an automatic programming
system for computer controlled mechanical assembly. IBM Journal of Research and Devel-
opment, 21(4):321–333, 1977.

[38] Lance A Miller. Natural language programming: Styles, strategies, and contrasts.

IBM

Systems Journal, 20(2):184–215, 1981.

103

[39] Adam Vogel and Dan Jurafsky. Learning to follow navigational directions.

In Proc. of
the 48th Annual Meeting of the Association for Computational Linguistics, pages 806–814.
Association for Computational Linguistics, 2010.

[40] Satchuthananthavale RK Branavan, Harr Chen, Luke S Zettlemoyer, and Regina Barzilay.
Reinforcement learning for mapping instructions to actions. In Proc. of Joint Conference of
the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural
Language Processing of the AFNLP, pages 82–90, 2009.

[41] Nobuyuki Shimizu and Andrew R Haas. Learning to follow navigational route instructions.

In IJCAI, volume 9, pages 1488–1493, 2009.

[42] Yoav Artzi and Luke Zettlemoyer. Weakly supervised learning of semantic parsers for map-
ping instructions to actions. Transactions of the Association for Computational Linguistics,
1:49–62, 2013.

[43] Maj Stenmark and Pierre Nugues. Natural language programming of industrial robots. In

International Symposium on Robotics (ISR), pages 1–5. IEEE, 2013.

[44] Maj Stenmark and Jacek Malec. Describing constraint-based assembly tasks in unstructured

natural language. IFAC Proceedings Volumes, 47(3):3056–3061, 2014.

[45] Chris Quirk, Raymond Mooney, and Michel Galley. Language to code: Learning semantic
parsers for if-this-then-that recipes. In Proc. of the 53rd Annual Meeting of the Association
for Computational Linguistics (ACL-15), pages 878–888, 2015.

[46] Paul E Rybski, Kevin Yoon, Jeremy Stolarz, and Manuela M Veloso. Interactive robot task
training through dialog and demonstration. In Proc. of ACM/IEEE International Conference
on Human-Robot Interaction (HRI), pages 49–56. IEEE, 2007.

[47] Kevin Yoon and Paul E Rybski. Teaching procedural ﬂow through dialog and demonstration.
In Proc. of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
pages 807–814, 2007.

[48] Paul E Rybski, Jeremy Stolarz, Kevin Yoon, and Manuela Veloso. Using dialog and hu-
man observations to dictate tasks to a learning robot assistant. Intelligent Service Robotics,
1(2):159–167, 2008.

[49] Rehj Cantrell, Kartik Talamadupula, Paul Schermerhorn, J Benton, Subbarao Kambhampati,
and Matthias Scheutz. Tell me when and why to do it! run-time planner model updates via
natural language instruction. In Proc. of ACM/IEEE International Conference on Human-
Robot Interaction (HRI), pages 471–478, 2012.

[50] Thomas Kollar, Vittorio Perera, Daniele Nardi, and Manuela Veloso. Learning environmen-
tal knowledge from task-based human-robot dialog. In Proc. IEEE International Conference
on Robotics and Automation, pages 4304–4309, 2013.

[51] Yu Cheng, Jiatong Bao, Yunyi Jia, Zhuhui Deng, Zhiyong Sun, Sheng Bi, Congjian Li, and
Ning Xi. Modeling robotic operations controlled by natural language. Control Theory and
Technology, 15(4):258–266, 2017.

104

[52] Maya Cakmak and Andrea L Thomaz. Designing robot learners that ask good questions.
In Proc. of ACM/IEEE international conference on Human-Robot Interaction, pages 17–24,
2012.

[53] Ross A Knepper, Stefanie Tellex, Adrian Li, Nicholas Roy, and Daniela Rus. Recovering

from failure by asking for help. Autonomous Robots, 39(3):347–362, 2015.

[54] Lanbo She and Joyce Y Chai.

physical world interaction.
Computational Linguistics (ACL), Berlin, Germany, 2016.

Incremental acquisition of verb hypothesis space towards
In Proc. of the 54th Annual Meeting of the Association for

[55] Michael Beetz, Daniel Beßler, Andrei Haidu, Mihai Pomarlan, Asil Kaan Bozcuoglu,
and Georg Bartels. Knowrob 2.0? 2nd generation knowledge processing framework for
cognition-enabled robotic agents. 2018.

[56] Peter JG Ramadge and W Murray Wonham. The control of discrete event systems. Pro-

ceedings of the IEEE, 77(1):81–98, 1989.

[57] The opennlp ccg library. http://openccg.sourceforge.net/. Accessed: 2013.

[58] Changsong Liu, Rui Fang, Lanbo She, and Joyce Chai. Modeling collaborative referring for

situated referential grounding. In Proc. of the SIGDIAL Conference, pages 78–86, 2013.

[59] Donatello Conte, Pasquale Foggia, Carlo Sansone, and Mario Vento. Thirty years of graph
matching in pattern recognition. International journal of pattern recognition and artiﬁcial
intelligence, 18(03):265–298, 2004.

[60] RD Brandt, V Garg, R Kumar, F Lin, SI Marcus, and WM Wonham. Formulas for calculat-
ing supremal controllable and normal sublanguages. Systems & Control Letters, 15(2):111–
117, 1990.

[61] Kevin M Passino, Anthony N Michel, and Panos J Antsaklis. Lyapunov stability of a class
of discrete event systems. IEEE Transactions on Automatic Control, 39(2):269–279, 1994.

[62] Laurie Ricker, Stephane Lafortune, and S Genc. Desuma: A tool integrating giddes and

umdes. In International Workshop on Discrete Event Systems, pages 392–393, 2006.

[63] Malik Ghallab, Dana Nau, and Paolo Traverso. Automated planning: theory & practice.

Elsevier, 2004.

[64] Samuel Eilenberg. Automata, languages, and machines. 1974.

[65] Alvaro Collet, Manuel Martinez, and Siddhartha S Srinivasa. The moped framework: Object
recognition and pose estimation for manipulation. The International Journal of Robotics
Research, 30(10):1284–1306, 2011.

[66]

Jiatong Bao, Yunyi Jia, Yu Cheng, and Ning Xi. Saliency-guided detection of unknown
objects in rgb-d indoor scenes. Sensors, 15(9):21054–21074, 2015.

105

[67] David D Grossman. Programmimg a Computer Controlled Manipulator by Guiding

Through the Motions. IBM T. J. Watson Res. Cen., Res. Rep. RC6393, 1977.

[68] Tomas Lozano-Perez. Robot programming. Proceedings of the IEEE, 71(7):821–841, 1983.

[69] Geoﬀrey Biggs and Bruce MacDonald. A survey of robot programming systems. In Proc.

Australasian Conference on Robotics and Automation, pages 1–3, 2003.

[70] Wemi Dai and Markus Kampker. User oriented integration of sensor operations in a of-
ﬂine programming system for welding robots. In Proc. IEEE International Conference on
Robotics and Automation (ICRA), volume 2, pages 1563–1567, 2000.

[71] Mark R. Cutkosky, Robert D. Howe, and William R. Provancher. Springer Handbook of

Robotics. Springer-Verlag, 2008.

[72] The kuka robot programming language.

https://drstienecker.com/tech-332/

11-the-kuka-robot-programming-language/. Accessed: 2016-04-25.

[73] Test based robot programming made easy.

http://www.abb.com/blog/gad00540/

1DDE6.aspx?tag=RAPID%20programming. Accessed: 2016-04-25.

[74] Wesley M Johnston, JR Hanna, and Richard J Millar. Advances in dataﬂow programming

languages. ACM Computing Surveys, 36(1):1–34, 2004.

[75] Tiago Boldt Sousa. Dataﬂow programming concept, languages and applications. In Doc-

toral Symposium on Informatics Engineering, volume 7, page 13, 2012.

[76] Geir E Hovland, Pavan Sikka, and Brenan J McCarragher. Skill acquisition from human
demonstration using a hidden markov model. In Proc. IEEE International Conference on
Robotics and Automation, pages 2706–2711, 1996.

[77] Michael Pardowitz, Steﬀen Knoop, Ruediger Dillmann, and Raould D Zollner. Incremental
learning of tasks from user demonstrations, past experiences, and vocal comments. IEEE
Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 37(2):322–332, 2007.

[78] Sethu Vijayakumar and Stefan Schaal. Locally weighted projection regression: An o (n) al-
gorithm for incremental real time learning in high dimensional space. In Proc. International
Conference on Machine Learning, 2000.

[79] Aaron P Shon, Keith Grochow, and Rajesh PN Rao. Robotic imitation from human mo-
In Proc. IEEE-RAS International Conference on

tion capture using gaussian processes.
Humanoid Robots, pages 129–134, 2005.

[80] Rui Liu and Xiaoli Zhang. Methodologies for realizing natural-language-facilitated human-

robot cooperation: A review. arXiv preprint arXiv:1701.08756, 2017.

[81] Constantine Lignos, Vasumathi Raman, Cameron Finucane, Mitchell Marcus, and Hadas
Kress-Gazit. Provably correct reactive control from natural language. Autonomous Robots,
38(1):89–105, 2015.

106

[82] Richard E Mayer. Multimedia learning.

In Psychology of learning and motivation, vol-

ume 41, pages 85–139. Elsevier, 2002.

[83] C Lawrence Zitnick, Devi Parikh, and Lucy Vanderwende. Learning the visual interpretation
of sentences. In Proc. of IEEE International Conference on Computer Vision (ICCV), pages
1681–1688, 2013.

[84] Angel X Chang, Manolis Savva, and Christopher D Manning. Learning spatial knowledge

for text to 3d scene generation. In EMNLP, pages 2028–2038, 2014.

[85] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-
200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology,
2011.

[86] Ben Kehoe, Sachin Patil, Pieter Abbeel, and Ken Goldberg. A survey of research on cloud
robotics and automation. IEEE Trans. Automation Science and Engineering, 12(2):398–409,
2015.

[87] Xiaojin Zhu, Andrew B Goldberg, Mohamed Eldawy, Charles R Dyer, and Bradley Strock.
In AAAI, volume 7,

A text-to-picture synthesis system for augmenting communication.
pages 1590–1595, 2007.

[88] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional
image generation from visual attributes. In European Conference on Computer Vision, pages
776–791. Springer, 2016.

[89] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and
arXiv preprint arX-

Honglak Lee. Generative adversarial text to image synthesis.
iv:1605.05396, 2016.

[90] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak
Lee. Learning what and where to draw. In Proc. of Advances in Neural Information Pro-
cessing Systems (NIPS), pages 217–225, 2016.

[91] Maria-Elena Nilsback and Andrew Zisserman. Automated ﬂower classiﬁcation over a large
number of classes. In Proc. Indian Conference on Computer Vision, Graphics and Image
Processing, pages 722–729, 2008.

[92] Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Generating

images from captions with attention. arXiv preprint arXiv:1511.02793, 2015.

[93] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and
Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative
adversarial networks. arXiv preprint arXiv:1612.03242, 2016.

[94] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and
Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative
adversarial networks. arXiv preprint, 2017.

107

[95] Hao Dong, Jingqing Zhang, Douglas McIlwraith, and Yike Guo.

I2t2i: Learning text to
image synthesis with textual data augmentation. In Proc. of IEEE International Conference
on Image Processing, pages 2015–2019. IEEE, 2017.

[96] Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. Deep semantic role labeling:
What works and what is next. In Proc. of Annual Meeting of the Association for Computa-
tional Linguistics, volume 1, pages 473–483, 2017.

[97] Olga Babko-Malaya. Propbank annotation guidelines. URL: http://verbs. colorado. edu,

2005.

[98] Xinlong Huang, Sheng Bi, Min Dong, Heping Chen, Siwen Fang, and Ning Xi. Automatic
In IEEE International
feature extraction and optimal path planning for robotic drawing.
Conference on Cyber Technology in Automation, Control, and Intelligent Systems, pages
19–24, 2016.

[99] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong
Su, Dalong Du, Chang Huang, and Philip HS Torr. Conditional random ﬁelds as recurrent
In Proc. of IEEE International Conference on Computer Vision, pages
neural networks.
1529–1537, 2015.

[100] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
In

Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context.
European conference on computer vision, pages 740–755. Springer, 2014.

[101] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A
neural image caption generator. In Proc. IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 3156–3164, 2015.

[102] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for au-
tomatic evaluation of machine translation. In Proc. of Annual Meeting on Association for
Computational Linguistics, pages 311–318, 2002.

[103] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with
improved correlation with human judgments. In Proc. of the ACL Workshop on Intrinsic
and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages
65–72, 2005.

[104] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based
image description evaluation. In Proc. of IEEE Conference on Computer Vision and Pattern
Recognition, pages 4566–4575, 2015.

[105] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. Text Summariza-

tion Branches Out, 2004.

[106] Joseph M Romano, Kaijen Hsiao, Günter Niemeyer, Sachin Chitta, and Katherine J Kuchen-
becker. Human-inspired robotic grasp control with tactile sensing. IEEE Transactions on
Robotics, 27(6):1067–1079, 2011.

108

[107] Takashi Maeno, Shinichi Hiromitsu, and Takashi Kawai. Control of grasping force by de-
tecting stick/slip distribution at the curved surface of an elastic ﬁnger. In Proc. IEEE Inter-
national Conference on Robotics and Automation, volume 4, pages 3895–3900, 2000.

[108] Marc R Tremblay and Mark R Cutkosky. Estimating friction using incipient slip sensing
during a manipulation task. In Proc. IEEE International Conference on Robotics and Au-
tomation, pages 429–434, 1993.

[109] Robert D Howe and Mark R Cutkosky. Sensing skin acceleration for slip and texture percep-
tion. In Proc. IEEE International Conference on Robotics and Automation, pages 145–150,
1989.

[110] Joseph M Romano, Steven R Gray, Nathan T Jacobs, and Katherine J Kuchenbecker. To-
ward tactilely transparent gloves: Collocated slip sensing and vibrotactile actuation.
In
Proc. Third Joint EuroHaptics conference and Symposium on Haptic Interfaces for Virtual
Environment and Teleoperator Systems, pages 279–284, 2009.

[111] Dana D Damian, Harold Martinez, Konstantinos Dermitzakis, Alejandro Hernandez-Arieta,
and Rolf Pfeifer. Artiﬁcial ridged skin for slippage speed detection in prosthetic hand ap-
plications. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems,
pages 904–909, 2010.

[112] Minoru Ueda, Kazuhide Iwata, and Hiroyashu Shingu. Tactile sensors for an industrial robot
to detect a slip. In Proc. of the 2nd International Symposium on Industrial Robots, pages
63–70, 1972.

[113] David Dornfeld and Christopher Handy. Slip detection using acoustic emission signal anal-
ysis. In Proc. IEEE International Conference on Robotics and Automation, volume 4, pages
1868–1875, 1987.

[114] Gaetano Canepa, Matteo Campanella, and Danilo De Rossi. Slip detection by a tactile neural
network. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems,
volume 1, pages 224–231, 1994.

[115] EGM Holweg, H Hoeve, W Jongkind, Lorenzo Marconi, Claudio Melchiorri, and Claudio
Bonivento. Slip detection by tactile sensors: Algorithms and experimental results. In Proc.
IEEE International Conference on Robotics and Automation, volume 4, pages 3234–3239,
1996.

[116] Morteza Vatani, Erik D Engeberg, and Jae-Won Choi. Force and slip detection with direct-
write compliant tactile sensors using multi-walled carbon nanotube/polymer composites.
Sensors and Actuators A: physical, 195:90–97, 2013.

[117] Claudio Melchiorri. Slip detection and control using tactile and force sensors. IEEE/ASME

Transactions on Mechatronics, 5(3):235–243, 2000.

[118] Dirk Goeger, Nico Ecker, and Heinz Woern. Tactile sensor and algorithm to detect slip
In Proc. IEEE International Conference on Robotics and

in robot grasping processes.
Biomimetics, pages 1480–1485, 2009.

109

[119] Karsten Weiß and Heinz Worn. The working principle of resistive tactile sensor cells. In
Proc. IEEE International Conference Mechatronics and Automation, volume 1, pages 471–
476, 2005.

110