Toward Open World Visual Understanding

Visual data such as images and videos are the most prominent media to record, transmit, and exchange information in this era. Though we have witnessed waves of success in visual intelligence, teaching machines to understand visual content at the level of human intelligence remains a fundamental challenge. In past decades, visual understanding has been extensively explored through computer vision tasks such as object (or activity) recognition, segmentation, and detection. However, existing methods can hardly be deployed in real open-world applications where unseen environments, objects, and activities inevitably appear in testing. Such a limitation is attributed to the closed-world assumption that ignores the unknown in model design, learning, and evaluation. In this dissertation, I will introduce my works that go beyond the traditional closed-world visual understanding and tackle several challenging open-world problems. The ultimate goal is to endow machines with visual perception capabilities in an open world, where unseen environments, image objects, and video activities can be handled. First, I will begin the dissertation by investigating open-world visual forecasting problems in an unseen perception environment. Specifically, I primarily explore how the early observed videos can be leveraged to promptly forecast the traffic accident risk for safe self-driving (in Chapter 2 and Chapter 3), and forecast the 3D hand motion trajectory in an unseen first-person view (in Chapter 4). Second, I will cover the open-world visual recognition problems that aim to identify the unseen visual concepts. In this part, we are especially interested in identifying and localizing unseen human actions in general videos (in Chapter 5 and Chapter 6). Lastly, I will delve into open-world visual language understanding problems that further recognize unseen visual concepts from language queries, including the recognition of unseen compositional objects in images (in Chapter 7) and spatiotemporally detecting unseen human actions (in Chapter 8). In Chapter 9, I summarize the main contributions of this dissertation and discuss unsolved challenges in real-world practices. Based on the line of the dissertation research, some future directions for open-world visual understanding are briefly discussed.

Read