NM

!

  

WIWWWNHWIWN1!HIlWllill\\!|H\|H

   

4:...»
00-!>
\|_\

THE-SB

.l LIBRARY
2mm Michigan State
University

 

 

 

 

This is to certify that the
thesis entitled

VIRTUAL ENVIRONMENT FOR VIDEO BROWSING: DESIGN AND
COMPARATIVE STUDY OF EFFECTIVENESS

presented by

Zubin John Abraham

has been accepted towards fulfillment
of the requirements for the

Computer Science

 

 

 

we

L/ Major Professor’s Signature

1 2-05-08

 

Date

MSU is an Affirmative Action/Equal Opportunity Employer

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 K IProj/Acc8Pres/CIRC/DateDue,indd

VIRTUAL ENVIRONMENT FOR VIDEO BROWSING: DESIGN AND
COMPARATIVE STUDY OF EFFECTIVENESS

By

Zubin John Abraham

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE

Computer Science

2008

ABSTRACT

VIRTUAL ENVIRONMENT FOR VIDEO BROWSING: DESIGN AND
COMPARATIVE STUDY OF EFFECTIVENESS
By

Zubin John Abraham

The thesis presents the Virtual Video Gallery, a virtual video browsing environment that
facilitates browsing and retrieval of video clips from a video repository by using content-
based video retrieval techniques. The browsing structure avoids the need to submit a
query in the form of a video or an image, while Still providing retrieval capabilities that
are sensitive to the underlying media content. The browsing environment provides
alternative trails of related content that users can browse for video browsing by visually
portraying inter-video relationships derived from a composite of content and metadata
similarity. The Virtual Gallery is presented as a semi-realistic representation of a real art
gallery, as opposed to an artiﬁcial computer simulated environment, and has been
designed with the intention of exploiting human spatial memory and the ability of
humans to recognize visual patterns. The concept is developed in the context of previous
research done on content-based image retrieval and browsing environments. This thesis
presents a prototype Virtual Video Gallery and the results of user studies that examined
the effectiveness of the system. It is shown that there was 12% improvement in
successfully ﬁnding a video of choice and 26% improvement in ﬁnding a video similar to
a given video. There was a 60% reduction in the number of queries entered as well as
70% increase in number of video clips Viewed per user. This translates to approximately
a 300% improvement in the success of retrieving a video of choice, per query entered by

the user.

TABLE OF CONTENTS

LIST OF TABLES ............................................................................................................. v
LIST OF FIGURES .......................................................................................................... vi
1. Introduction. . . .. ........................................................................................................... 1
2. Related Work ............................................................................................................... 6
3. Research Methodology .............................................................................................. 16
3.1 Video Retrieval Engine .................................................................................. 17

3.1.1 Phase I ........................................................................ 18

3.1.2 Phase II ....................................................................... 26

3.2 Browsing Environment .................................................................................. 32

4. Browsing Environment Concept ................................................................................ 35
4.1 Scenario .......................................................................................................... 37

5. Experimental setup ..................................................................................................... 41
5.1 Experiment Participants and Tasks ................................................................ 46

5.2 Results ........................................................................................................... 49

5.3 Discussion and Conclusion ........................................................................... 59

6. Conclusion and Future Work ...................................................................................... 6O
7. Appendix ................................................................................................................... 63
8. Bibliography ................................................................................................................ 64

LIST OF TABLES

Table 5.2.1 - Comparison of success rate in terms of percentage for the two

environments ......................................................................................

Table 5.2.2 - Comparison of query related statistics for the two environments. . .

Table 5.2.3 - Summary of statistical improvement of the CBVR environment in

comparison with the control environment................

Table 5.2.4- Summary of percentage of SCBVR Video clips that were viewed by the

user in the CBVR environment .................................................

Table 5.2.5 - Post-survey feedback

....51

.52

....55

....57

.58

LIST OF FIGURES

Images in this thesis are presented in color.

Figure 1.1 - Layout of the Data Mountain that uses a 3D desktop virtual

environment [4] ...................................................................................... 4
Figure 2.1 - A screen shot of a MediaMetro [2] .......................................................... 7
Figure 2.2 - An example Ofa color histogram having 25 b1nsl3
Figure 3.1 - Virtual video gallery archrtecturel7
Figure 3.2 - Phase 1 Architecture. .............................................................................. 19
Figure 3.3 - Phase II Archrtecture27
Figure 3.4 - A screen short of the CBVR environment showing a cluster center .3........1
Figure 3.5 - A screen short of the CBVR environment showing props in environment..32
Figure 5.1 - A screen short of the CBVR environment...............................................44
Figure 5.2 - A screen shot of the control environment................................................45

Figure 5.3 - A screen shot of a video being played in the CBVR environment Results..48

Figure 5.4 - A screen shot of a video being played in the control environment .............. 48
Figure 5.5 - Video retrieval success rate ..................................................................... 50
Figure 5.6 - Query and video related statistics ............................................................ 53

Figure 5.7 - Video retrieval success rate per query entered........................................54

Chapter 1

Introduction

The growth of high-speed Internet connections and video-sharing websites such as
YouTube and Google Video has spurred a massive increase in the number of
multimedia ﬁles easily available over the World Wide Web. For this wealth of
information to be useful to the general public, it is important to develop better ways to
ﬁnd interesting and relevant content. This is particularly the case for sites that
accumulate video content, which remains difﬁcult to search and browse. In most of
the current state of the industry, the content producer is prompted to append relevant
metadata in the form of textual tags and written commentary to each video to
facilitate categorization and organization of the various video clips over the website.
Consequent to the growth in the number of video clips and the utilization of these
user supplied ad-hoc tags as a primary means for organization and searching, online
web-based video libraries increasingly grapple with the problem of organizing and

Showcasing their large compilations of Video clips in a coherent manner. The labeling

with three to ﬁve simple tags does not adequately describe video content that can

number in the millions.

This thesis presents one approach to the problem of locating content of
interest in a large video collection: the Virtual Video Gallery. The Virtual Video
Gallery is a virtual browsing environment that presents video clips in a setting
analogous to that of a physical museum allowing users to rapidly browse the
collection by moving among the exhibits. The layout of content within the museum
is determined in response to a user-provided query merged with a self-organization
based on image content. Visually related content is placed in close proximity to each
other, with the physical distance separating content an indicator of similarity. A
fundamental element of this approach is the concept of browsing. A search moves a
user to the approximate neighborhood of the desired result. From this point forward,
the user explores content in a physically related area with a natural and appealing
appearance. This physical representation/layout is a metaphor for actual content
similarity measures. In the prototype built this research, that similarity is based on
color histogram comparisons of representative key frames. The goal of this thesis is
to examine the value of a Video browsing environment with a museum metaphor as a
means to promote exploration within content and the process of locating relevant
retrieval result. Consequently, a relatively simple, yet commonly accepted, means is
used to compare the video content. Clearly, more advanced similarity measures exist

and would be the subject of future work.

Effective management of media data on the web and in computers,
particularly temporal content such as audio and video, has been extensively studied.
Many approaches exist, but there is no general solution to the problem of locating
desired content in large collections. Historically textual document retrieval has
attracted the most attention [1-3]. Document retrieval and browsing approaches often
employ 2D icons [4, 5] that represent documents so as to take advantage of human
spatial recognition [6, 7]. Due to the inherent limitations on the amount of
information that can be conveyed using 2D spatial layouts, the use of 3D spatial
layouts for exploiting human spatial cognition abilities as also been explored [2, 6-9].
As shown in Figure 1.1, Data Mountain, a document management technique, uses a
3D desktop virtual environments with 2D interaction techniques to improve
management of documents in an information space by exploiting spatial memory [4,

10].

This thesis extends concepts incorporated in Data Mountain and other similar
digital ﬁle and textual content management techniques as a means to improve video
selection and retrieval using a 3D Virtual Video Gallery [2, 8] wherein illustrative
snapshots of video clips were strategically placed on props and virtual real-estate
space within the spatial conﬁnes of a gallery appearing as a visual metaphor. Video
clips were clustered and arranged within different “rooms” in the gallery based on a
combination of similarity measures including metadata similarity and similarity of the
underlying visual content. The user navigates in the 3D space using interaction

techniques similar to computer games and other traditional 3D applications. The

Virtual Video Gallery attempted to approximate the human trait of arranging
documents spatially such that the most recent/relevant video clips were placed closest
to the user, while also ensuring that similar video clips remain within close proximity

of each other [4].

 

Figure 1.1 - Layout of the Data Mountain that uses a 3D desktop

virtual environment [4].

Content-based image retrieval (CBIR) [1 1] techniques including color
histograms[l 1-15] and its various variants [13, 16] were used for evaluating similarity
of images and video. A variety of content similarity measures exist for video. For this
work, color histograms were chosen as a common and accepted basic method. The

color histogram of an image remains invariant to translation and rotation about the

viewing axis, as well as other afﬁne transformations. It is well suited to recognize an
object of unknown position and rotation within a scene, hence, a good candidate for a
basic content similarity approach. Indeed, the focus of this work is less on the best
way to compare video clips to each other but more on the use and utility of browsing
as a means of rapidly exploring libraries of video content. The most important
question is the effectiveness of a virtual browsing environment on locating content in
a large video database; speciﬁcally the advantages this method has as a new user

interface.

The remainder of this thesis examines the design, implementation, and
evaluation of the Virtual Gallery. Chapter 2 describes related work that has been
published by various authors. Chapter 3 on research methodology describes both
Content-based video retrieval and the browsing environment. Chapter 4 describes the
concept of the Virtual Gallery followed by a description of a scenario of a
hypothetical envisaged system based on the concept of the Virtual Gallery. Chapter 5
details the design of the Virtual Gallery and includes the experimental setup, which
describes the prototype of the system that was built along with the results and
conclusions from the experimental setup. This is followed by future work, appendix

and the bibliography(Chapters 6 through 8)

Chapter 2

Related work

There has been signiﬁcant progress in the progression from 2D Virtual environments
[4] to 3D environments [6, 8]. Ann Smith et al. describe Vista, a 3D virtual
environment presented as a web service for exhibition of visual data [17]. The
environment dynamically accommodated deletion or addition of information. The
environment used “virtual archives” and “space warp trams.” The space warp trams
allowed the user to navigate quickly. Vista presents four domains: a gallery, studio,
cinema, and library. In each level of the gallery, at most stored 40 images and 8
sculptures are stored and can be viewed. Though Vista arranged the different types of
multimedia in a structured manner such that there is a clear classiﬁcation between the
types of multimedia and a cognitive organization of the data, no emphasis had been
given to intra-organization of Video content. Rather, the approach assumes self-

organization of content.

There has been signiﬁcant work done in 3D virtual environments for
visualization of multimedia data. As seen in Figure 2.1, MediaMetro is a 3D
multimedia visualization aid using a city metaphor [2]. MediaMetro utilizes key
frames that are generated using a content analysis algorithm for each video that
provides a visual summary. These key frames are then applied as facets to the
buildings of a cityscape. MediaMetro maintains a directory tree structure for the

collection of video clips.

 

she and All attended the ﬁrst

\ training for the Emergency

u

. Response Team In case at an
A .. [ emargenw, you should call 2222

"v ' P + ‘ -- , ananthl

 

 

 

 

 

 

 

Figure 2.1 — A screen shot of a MediaMetro [2]

Unlike Vista, which uses a Space Warp Tram to navigate, MediaMetro uses a
novel concept called ‘Swoop’ MediaMetro’s swooping navigation technique helps

users to navigate between high altitude overviews and ground level detail views of

the visual summaries rendered on the buildings in the cityscape. Unlike the swoop
navigation technique used by MediaMetro the virtual video gallery system uses a
natural navigation paradigm similar to human movement in the real world so as to
help sustain a more realistic impression of the environment from the user’s

perspective.

The various stages in the development of a virtual art gallery and the
importance of creating an atmospheric environment that will stimulate/engage the
user are discussed by Cavazza and Mend [18]. Their work seeks to produce online
analogues to physical art galleries as structured environments for browsing modern
art. There are two types of organization design principles in a virtual environment that
need to be considered. The ﬁrst inﬂuences the structure/layout of the environment
and the second inﬂuences the use of the environment. The impact that navigational
aids and design has on ease of performing tasks in the environment has been found to
be considerable. In certain cases, complex 3D environments with limited navigational
aids fared worse than their 2D counterparts [19]. “Skilled way-ﬁnding” within a
virtual environment is greatly dependent upon and enhanced by real world way-
ﬁnding and environment design principles described in cognitive psychology
literature and urban architectural design principles [3]. Similar results were also
shown by Robertson et al, who described the impact spatial memory has on 3D
navigation and on the improvement in Reaction Time [4]. Human spatial memory is
the ability to remember where an item is in terms of physical coordinates or

orientation

Users of large-scale virtual worlds tend to require structure to effectively
navigate within the environment. Way ﬁnding augmentations such as direction
indicators, maps, and path restriction can all greatly improve both way ﬁnding
performance and overall user satisfaction. Darken et al,. discuss three critical factors
of spatial knowledge which include landmark association, route mapping (procedural

knowledge) and a combination of geocentric and egocentric mapping [3].

The Virtual environment in this thesis uses corridors and passageways to
symbolize ‘Paths’, walls for ‘Edges’, floors to symbolize ‘Districts’, a reception area
to act as a ‘Node’, and a lift to serve as a ‘Landmark’ in addition to being a Node.
These urban metaphors are used to create a spatial hierarchy within the virtual
environment for way ﬁnding purposes. Experimental studies have demonstrated that a
combination of a grid and maps as an absolute orientation frame of reference had the
greatest impact in aiding user navigation, typically reducing navigation errors by 15%
[3]. Space Warp Trams and swoops, while novel, do not ﬁt within the category of
urban metaphors. Hence, this thesis focuses on navigation methods that are grounded
within traditional mechanisms of hallways, doors, and rooms. The virtual gallery does
diverge from reality in some subtle and non-Obvious ways, however. In particular, the
virtual rooms are larger physically than is geometrically possible. This divergence
allows for the creation of rooms that are more closely spaced than would otherwise be

possible in reality, allowing users to be closer to the content. The unreal aspects of the

geometry are hidden from the user since they are never able to see more than one

room at a time.

As each room within the virtual environment presents a collection of video
clips that are exhibited at locations progressively farther from the user, occlusion of
video clips further back in the room is a potent problem. The Cutting Plane interface
introduced by Tanaka et al.,, attempts to mitigate occlusion of content in a room
through the use of a near clipping plane, that allows users be ability to see beyond
occlusions on demand [8]. Tanaka et al., 2004, described a solution to occlusion using
the ‘Cutting plane’ concept and a means to easily identify the hierarchical structure of
the nodes by using brightness and thickness as cues on a visualization tool similar to
Treecube [8]. However, these virtual concepts again go beyond the physical gallery
metaphor and must be selectively enabled. This thesis proposes a tiered gallery that

will allow occlusion to be avoided in a natural way.

Data Mountain was envisaged as an alternative to the bookmark option of
Internet Explorer [4]. In Data Mountain, links to web pages are shown as snapshots
standing upright on a hillside. The user arranges the links to his or her liking
anywhere in the environment. The underlying idea is the exploitation of human
spatial memory, the inherent human ability to associate items with a spatial
coordinate. Losh et al., also exploits spatial memory by the ‘Method of Loci’ a
variation of the various Mnemonic link systems techniques that is commonly used to

aid memory. The Method of Loci is a memorizing technique that uses places and

10

location (spatial memory) to remember order and locations of items on a list. The
program structured the snapshots in a way that ensured that occlusion is minimized so
as to ensure that as many documents as possible are visible. The hill-side metaphor
for managing occlusion is similar to the tiered-gallery structure as presented in this

thesis.

Due to the dynamic nature of a virtual environment, efﬁcient indexing,
querying and browsing are important. The MUVIS [20] project explored real-time
video indexing and supported indexing, irrespective of the format. The MUVIS
system is intended to support real-time video (with or without audio) indexing. For
efﬁciency in video recording and indexing, last generation compression techniques
such as MPEG-4 and H.263+ are used which also helps in indexing. MUVIS uses

intra frames as key-frames for feature extraction

The database used in this study is a large collection of video clips. Directly
indexing or comparing video content is impractical due to the massive amount of data
involved and the extremely large amount of redundancy in that data. Most systems
that seek to index or compare digital video reduce the content to some representation
that captures salient information from the original content in a compact form. In this
work, video summaries are constructed so as to generate relevant key frames for the
Video that are the subject to content-based image retrieval techniques.

A Video summary is constructed using the process of segmentation, key frame

extraction, and summarization. Many algorithms have been proposed for Video

11

segmentation, often using the same basic concepts as image similarity measures (a
scene break is a sequential pair of dissimilar images). Key frame extraction then
selects a frame from a sequence that is representative of the sequence. This can be as
simple as selecting a frame at a ﬁxed point in the sequence (10 seconds in or the
middle are common examples) or as complex as creating a composite image from
motion analysis of the video sequence [21-24]. Key frames mostly serve two
purposes in video retrieval--a visual summary of the video and a visual representative
for image-based Video retrieval [25]. Apart from the popular automated key frame
selection technique of Middle Image, most automated key frame selection techniques
such as ‘Within Event’ and ‘Cross Event’ use event segmentation of the video.
‘Within Event’ selects an image that has the closest value to the rest of the images in
the event, as a key frame while ‘Cross Event’ additionally requires the image to be
dissimilar to the images from other events [26]. The summarization process ﬁlters the
collected key frames to remove redundancy. The summary comparison requires
efﬁcient algorithms speciﬁc to the comparison criteria [12, 27, 28]. A Video clip
consists of a large number of images representing a temporal sequence. This massive
quantity of images is highly redundant. Summarization reduces this bulk to a smaller
set of images that are representative of the video sequence by removing as much

redundancy as possible.

Images selected or synthesized by the video summarization process then need

to be compared so as to determine a content-based clip similarity. Many methods to

achieve this result exist, of which the simplest and, in general, most robust

12

mechanism is the color histogram. Color histograms are computed by discretizing the
color of each pixel within the image and counting the number of pixels in the image
for each color. This information is then represented in the form of a vector. Color
histogram results vary only slowly with the angle of View. Translation of an RGB
image into the illumination invariant rg-chromaticity space allowed the histogram to
operate well in varying light levels A color histogram characterizes an image (key
frame from a video sequence) by its color distribution without using difﬁcult to
quantify information about object location, shape, and texture. The images are scaled
to contain the same number of pixels before the process of converting the image into
its equivalent color histogram so as to be invariant to image size. Colors within the
image are then mapped into a discrete color space containing a speciﬁc subset of

colors. This method has its shortcomings, especially if object mapping is required.

 

15"

9; ‘
a; "1
3.“.
' i
i

O 2 4 6 81012141618202224

{35}!

we

Frequency

 

 

Figure 2.2 — An example of a color histogram having 25 bins

Color histograms in the HSV (hue, saturation, value) color space were
preferred for this application compared to RGB color space as the hue and saturation
components are intimately related to the way human Visual system perceives color.
Empirical studies have also shown a greater accuracy by human subjects when using

HSV color space when compared to RGB [29].

Some of the other popular features used for the mapping of key frame data are
the color or intensity gradient, edge density, textured-ness, number of pixels that
differ beyond a particular threshold, and gradient magnitude [12]. Most of these
variants of color histograms work well for limited types of images as most other
implementations are object association-speciﬁc or are too general to classify based on
Object occurrence. A hybrid approach of using the prominent object, and the relation

between objects in the image, to classify was proposed by Tseng et al [30].

Color coherence vectors constitute another intriguing approach which
attempts to use the coherence of pixel colors to identify similarity [31]. Color
histograms are unable to differentiate between an image having a large red object and
an image having many scattered red dots. Color coherence vectors measure not only
the colors in an image, but also color proximity. Pass et al. used a modiﬁcation of the
color histogram called the joint histogram and performed studies using over 210,000
images [12]. They demonstrated considerable improvement over traditional color

histograms. One of the more prominent features of the new algorithm was the scene

14

break detection ability, which is useful in the selection of key frames for the video

clips utilized in their project.

15

Chapter 3

Research Methodology

This thesis presents a new approach to video retrieval and browsing, an approach
designed to enhance the user experience and the user's perceived comfort while
navigating through a video archive. The user's engagement and the perceived feeling
of familiarity is a crucial factor in the ability to effectively navigate through the video
archive. As shown in Figure 3.1, a twofold approach to accomplish this was made by
utilizing content-based Video retrieval methods to cluster and group video clips based
on dominant hues (Video Retrieval Engine), and then leveraging spatial and visual
cognition (Browsing Environment). The Video Retrieval Engine and the Browsing

Environment are detailed in the following subsections 3.1 and 3.2, respectively.

16

4"” M‘ ”ﬂmﬁxr

-- Nature of the 7Representation
envrronment : of videos
(representation: -
Perceived -
Sense of
Familiarity L . “
Virtual
Environment
_ ‘. .
L

V "ix V ; ‘
Miran/idea
Gallery Model

 
  

t “’2'" . "N
W‘IOnentatron of ““t—n

videos within the
s environment .»
N... ‘ M

s'

\..F

- Human Spatial

and Visual ‘
LCognitive cues,
NV?

 

Video Retrieval
Engine

 

 

  

  

 

Evaluation

 

Figure 3.1 - Virtual video gallery architecture

3.1 Video Retrieval Engine

The ﬁrst step in the Video indexing process is to create a framework for video

retrieval and grouping. There are numerous ﬂavors of image retrieval [32-35]

17

available, many of which use histograms of the image as a means for comparison.
Unlike some CBVR approaches that focus on the retrieval of the behavior of objects
across the collection, this project implemented a simpliﬁed content-based retrieval
method that is analogous to Content-Based Image retrieval (CBIR) to cluster video
clips in the manner shown in Figure 3.2, such that the density and distribution of
various colors in the selected key frames prominently contributed as a factor in
grouping video clips. Using hue as a predominant inﬂuence in clustering allowed a

synergic relation between the browsing environment and the grouping of video clips.

The Video Retrieval Engine has two phases. The ﬁrst phase, groups video
clips into clusters so as to pick one cluster (C5559) that best matches the user provided
textual query in phase II. This cluster serves as the starting point of reference within
the browsing environment. Metaphorically speaking, this is analogous to identifying
which ﬂoor within the Museum, the user is sent to. The second phase sub-clusters
video clips that belong to cluster C5551) into smaller clusters that will be assigned to

separate rooms. Details on Phases I and II follow in section 3.2 and 3.3, respectively.

3.1.1 Phase I

All steps in Phase I are done off-line. Phase I starts by ﬁrst computing the hash table
HL to help with the retrieval of video clips based on the textual Meta Tags associated
with the Video clip. The video database contained metadata that is associated with
each video. This metadata included a simple description of the Video provided by the

submitter and user—chosen keywords meant to categorize the content. These textual

18

meta-tags are used across multiple stages before resulting in the eventual clustering

and grouping of video clips, most of which is computed ofﬂine. The system creates a

hash table (H) of the list of Video clips. The keys (KHL) to the hash table are the

collection of stemmed strings from textual meta-tags.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Video
Visual Component QLIQS
Metadata
Key
Frame
Extraction '
i I , Textual
I K‘ ‘ irrduxmg~ ' metadata
Multipe Vey'frame selection
Color
Histogram
. ... .. Composite .»--:_-. “VT
Vc ' Vector 7
I
VP
. Clustering .. .

 

 

 

 
  
     
 

  

M
Cluster
Centers

Visual component

Spatial
mm

 

 

 

 

Figure 3.2 - Phase I architecture

19

For each video clip a key frame is extracted that is representative of the
content of the video. As illustrated in Figure 3.2, key frames are extracted for each
video, using conventional key frame extraction techniques based on scene break
detection [36]. Key frames mostly serve two purposes in video retrieval - a visual
summary of the video and a visual representative for image-based video retrieval
[26]. Once the collection of key frames for a video is identiﬁed, a set of color

histograms Vc is generated for the key frames.

A scoring mechanism was derived for identifying weights for each key frame
to indicate the relative representative nature of the key frame among the key frames
of the same video.. The less the variance in weight of the key frames of a clip the
greater the intra-similarity of a video clip. Weight (W K) for the Kth key frame of a

video clip is computed using Equation 3.1

DK

— (3.1)
D
g l

WK:

DK refers to the dissimilarity or distance between a key frame and all the other

key frames that belong to the same video clip. DK is computed by summing the

Cosine similarity of the normalized color histogram vector of Kth key frame (Vx)

20

with all other ‘1’ key frames (V1) that belong to the same video clip as show in

Equation 3.2 and Equation 3.3.

DK = 26K] (3.2)
1

Since all the tuples in the vector V K are positive, the Cosine similarity of any

two vectors V K results in a value between zero (90 degrees) and one (0 degrees),
where one represents maximum similarity and zero represents minimum similarity

(actually complete orthogonality).

—> —-)
_ V 0V
II vK Illl VK, ll

Normalization is done by dividing Vx by IIVK II where IIVK II is the length of

the vector (VK) that has ‘J’ tuples as shown in Equation 3.4.

—.

. i?
V VK_ K

K: __. _
IIVK ll 2"?
/ VKJ
J

N ow that weights have been assigned to each key frame, we cluster the Key

(3.4)

 

Frames. The Key Frames are then clustered using K-means clustering [37-39] to

return M clusters, where each video clip may belong to more than one cluster as a

21

video clip may have more than one Key Frame which have been assigned different

clusters.

From the M clusters, the M cluster centers are identiﬁed. M is computed using
techniques similar to Hartigan index that aims to make the sum of squared distance of

data points in a cluster to its cluster center not monotone.

E -E
”(10: “10.14211. (35)
E[(+1

Where, 7(K ) = n — k —1 and E K is the sum of squared distance of data

points in a cluster to its cluster center, for all the clusters.

For the computation of the M clusters, each data point-(VP) is represented as
a combination of the content based component of the key frame(Vc) and a second

vector representing the textual description of the Video the key frame belongs to (VT).

i.e., VP =(VCJ7T) (3.6)

The color histogram (Vc), which is a vector is then normalized (VC) to a

length equal to 1 unit resulting in the terms of the vector having a range between 0

22

and l. Normalization is done by dividing VC by ll VC II where ll l7c II is the

length of the vector (VC).

” ﬁ

. VC VC

chuii Ilzf 2
C 2‘7
C]

I

Since, all the terms in (Vc) are positive on account of the properties of the

(3.7)

 

histogram vector, normalization of (Vc ) results in all the terms being positive and
less than one. Normalization is performed so as to be invariant to video length or
resolution, as a larger video with higher resolution would result in a larger number of
frames in the clip which consequently would increase the range of values of the bins.
For the same purpose, the Video clips are converted to a common video ﬁle format

(.WMV) and common screen resolution before the color histogram is computed.

VT is a binary vector with each component of the vector indicating the
presence or absence of a key (KHL) to the lexicographical hash table (H) from the

video clips textual meta-tags. K-Means algorithm works by minimizing the sum of

distances/costs of each cluster (DTotal ), where cost refers to the sum of the

distances of each Key Frame to its respective cluster-center (Ccp). The distance

between Ccp and a video t7 p is computed in two parts — the distance function Dc

and DT.

23

Dc refers to the similarity of the visual component of the two data points, and

was computed based on the color histogram values. Dc was found using the Cosine
similarity of thch component of the two datapoints ij and ﬁg. Since all the
tuples in the vector Vc are positive, the Cosine similarity of VP] and sz would

result in a value between zero (0 degrees) and one (90 degrees), where one represents

maximum similarity and zero represents minimum similarity (actually complete

 

orthogonality).
——) —>
V 0 V
6C =cos_l( 4P] ‘3 (3.8)
IIVPl ||I| VP2 II

D C is normalized so that it is comparable with Dr . Normalization of
DC is done by dividing the Cosine similarity value by its range, i.e., 90.

Li:
C 90

(3.9)
Similarly, for computing the distance function DT, Jaccard similarity
coefﬁcient was used. The value of Dr is in the range 0 to l, where 0 represents least
similarity and 1 represents maximum similarity.
—) —>

=IVTlerTZI

T -> —>
'VTIUVTz'

(3.10)

24

The objective function of K-Means algorithm is to minimizeD while

Total

learning the weights. WC and WT. The algorithm minimizes DTotal by minimizing

DC and maximizingDT .i.e.,
Mi" 2 DTOTAL = Z ((WC*M1N DC)+(WT*MAX DT )) p (3'11)
M M

Where,
DTOtal =Sumof distancesnf alldatapointsto theirrespectivdllusterCenters
DC = Cosine Similarity (VC)p to cluster center.
DT =Jaccard SimilarityCoefﬁcient (VT)p to cluster center.
WC = Weight of Content vector (VC)

WT =Weight assigned to Meta Tags Vector (VT)

Since a video may have multiple key frames associated with it, it is possible
that a video may belong to multiple clusters simultaneously. The cluster centers serve
the purpose of reducing scalability problems by making it efﬁcient to assign a new
video to an existing cluster based on the similarity of the new video to the cluster
centers. A batch process can occasionally re-cluster the collection off-line using

KNN (K-nearest Neighbor).[40]

For each cluster, an intra-cluster similarity matrix is computed. These

matrices are of varying sizes and on the order of N x N where N is the number of

25

Video clips belonging to that particular cluster. Issues of quadratic growth were
avoided by limiting cluster sizes. This intra-cluster similarity matrix is fed into phase

H along with the M clusters created in phase 1.

3.1.2 Phase II

This chapter details the design of Phase II of the Video Retrieval engine. Part of this

phase is done at run-time.

26

 

ﬁdeo
Visual Component. Clips ‘2. L

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

, . Meiadara t
Key ,3: Textual Query
Frame 1'
Extraction Textual i?
t '. metadata
. . = selection Cluster
Multiple Key frame , " selection from
;‘ E ' ' Phase-l
" ' Fl l
VT esu ts
Color " Composite a; ll
Histogram "" Vector "‘ CSEED
E
F .. Video
VP Retrieval
'1 VP Sm
Similarity ‘ 3"
Matrix l 1 Sub
Clustering
SCBVR STEXT

  

Figure 3.3 — Phase II architecture

Once the M clusters and M cluster-centers are identiﬁed at the end of Phase I,
Phase H could begin with the selection of Cluster (C5550). The video clips belonging
to (CSEED) are further clustered as detailed below into sub clusters, such that each sub

cluster is assigned a separate room in the museum metaphor.

27

AS shown in Figure 3.3, the user entered textual query is needed to start
Phase 11. Based on the textual query provided and using the index created from VT,
the key frames of the Videos that belong to the results Of the textual query are
identiﬁed and their associated weights (W K) summed together for each of the M
clusters from Phase I. A cluster (C5559) with the highest sum of weights associated
with the key frames of the videos retrieval results is chosen as a starting point of the
virtual gallery environment. The results of the textual query search is based on textual
search of the video keywords (Meta Tags), a well understood and scalable process.
However, this search is not meant to locate the desired content, but rather intended to
put the user in the right neighborhood or ﬂoor of the museum (i.e., video clips that
belong to C5559), such that a content-based method can then allow browsing of

similar content.

Since it is not reasonable for a human user to specify a query as a video or
image, this two-phase approach uses a texture query method as a starting point and a
means to select one or more sample images that serve as the seed for visual browsing,
which is followed by sub clustering of only the video clips present in the cluster
chosen (CSEED), so as to assign sub-clusters to the various rooms in the browsing

environment for spatial ordering.

Using the same textual query provided by the user all the videos from within

the cluster (C5559) that match the keywords in the textual query are retrieved. The

retrieval of Video was done solely based on the VT Component of the Key Frame.

28

This subset of video’s retrieved (STExr) are then again clustered using K-means
clustering to form K clusters, where K symbolically represented the number of rooms

in the browsing environment. The equation used is

M" Z DTOTAL = 2 ((WC *MIN DC ) +(WT *MAX Dr ))PTEXT (3-12)
M M

Where,

DTotal = Sum of distances of all datapoints to their respective Cluster Centers.
DC = CosineSimilarity(VC)p to clustercenter.

DT =Jaccard SimilarityCoefﬁciert (VT)p to clustercenter.

WC = Weight of Content vector (VC) computedin Phase I

WT = Weight assigned to Meta Tags Vector (VT) computed in Phase I

PTEXT. e STEXT

This subset of video clips retrieved (STEXT) formed the basis for a starting
point from which additional video clips (SCBVR) are retrieved before being eventually

displayed in the environment.

The selection of the room to which each cluster belongs is done such that the
cluster that has the cluster center video with the highest score based on the textual
search query is placed in the room closest to the user. The cluster center of the

retrieval result (Srexr) is placed precisely in the middle of the room at an elevation

29

higher than any other video clip (e.g. ceiling) as seen in the Figure 3.4. Also, way
ﬁnding augmentations like additional props that complemented the scene of the
museum like a sofa, staircases, etc, are added to the scene to provide points of
reference and markers to help the user navigate the virtual environment as seen in
Figure 3.5. These way ﬁnding augmentations were added to improve both way

ﬁnding performance and overall user satisfaction as discussed by Darken et a1, [3]

The video clips that have been assigned to a room are oriented from the
middle of the room keeping in mind the orientation of the entrance to the room, such
that the centroid Video is placed precisely in the middle of the room at an elevated
position. The video closest to the centroid video belonging to the sub cluster assigned
to the room is placed at the far end of the room so that it directly faces the entrance
and is in the line of sight of the user as the user enters the room. This is done so that
the video is the ﬁrst Video that the user sees on entering the room. Apart from the
centroid, the ﬁrst Video directly in the line of sight of the user as the user enters the
room is the video closest to the centroid in the cluster center. Subsequent video clips
belonging to the cluster are oriented along the circumference of the circle that is
formed with the centroid at the center, such that farther the distance of the video from
the cluster center, the farther from the ﬁrst video on the circumference they were

placed.

30

 

 

 

 

 

Figure 3.4 — A screen short of the CBVR environment showing a cluster center

The system next computed a similarity matrix (SC) for the video clips in the
database using both component (textual and contextual) of the composite vector. This
step may be done ofﬂine. For each video oriented in the room, the two video clips
closest according to the similarity matrix SC were identiﬁed and oriented right above
the respective video in the room, to form a second tier of video clips in the room.
Similarly, for each of the video clips that were oriented on the second tier, their
respective closest similar two video clips according to SC were oriented above them
to form three tiers, which ended up looking similar to the design of an amphitheatre.
The video clips that were retrieved to ﬁll the second and third tier of the room (SCBVR)

are the video clips that were retrieved based on the visual content, i.e., CBVR. The

31

videos were oriented such that each higher tier had a larger diameter so as to

accommodate the larger number of videos that belonged to the tier.

 

mutt-r ‘ vat-ohm! Visas-aim

 

 

 

Figure 3.5 — A screen shot of the CBVR environment showing props in environment

3.2 Browsing Environment

Preliminarily, this work envisioned the browsing environment [41—44] as a
modern day real world museum where each room within the museum is shaped like a
small amphitheatre having retrieval results adorning the tiered, gradually ascending,

wall Cognitive science has extensively studied the impact of environment layout on

32

human perception [45-47] . This work as used as a basis on which to design a
browsing environment layout cognizant of human spatial abilities and likely to be an

effective and useful visual representation of information.

Some of the key factors inﬂuencing the design of the browsing environment
include a perception of familiarity; the impact the real world metaphor design has on
human cognition and its resulting ease of use. The representation and orientation of
the retrieval results should be such that they seamlessly merge with the environment.

The effect should be that of museum exhibits and paintings on the wall [48, 49].

The model this thesis proposes is aimed at creating a virtual environment that
enhances the user's ability to navigate towards a relevant ‘Video from the users
perspective by using techniques incorporated in CyberMap [l] and MediaMetro [2].
Results from psychological and cognitive research were used to help identify a layout
of the virtual environment that will increase the users 'ease of use' and familiarity. For
example, identifying the impact of using real world metaphors on the usability [2]

would be feasible .

Any proposed method is only an idea and the human-computer interface
improvements due to that method are only theory until improved performance is
demonstrated with a user evaluation. A study was conducted to determine any impact
the combination of content based and metadata based retrieval of video clips has on

user efﬁciency and estimate the extent of the enhancement on efﬁciency. A

33

comparison was made between conventional keyword search methods and the Virtual
Video Gallery approach. The rubric for examination will be the ability to locate

speciﬁc Video results and the time it takes to achieve that result.

34

Chapter 4

Browsing Environment Concept

The browsing environment begins when the user enters the environment. The entry
point is determined through the use of a textual query that guides the user to the
relevant ﬂoor of the Virtual gallery building where all the video clips returned as
results from the textual query were present, i.e., from only video clips that belong to
the clusters that was chosen as the seed (C5559). Alternative means of entry would
include random placement in the world for exploration or a return to a previous room
the user found interesting or relevant. The video clips that are returned by the textural
query are placed within close proximity of the central hallway of the ﬂoor, though
they may be distributed over the various rooms on the ﬂoor. Color histograms [14,
15, 31] of key frames, along with textual metadata, form a composite feature vector
that is used to compare and cluster retrieval results. Multiple similarity matrices
representing each cluster were maintained for browsing organization based on

relevance.

35

The video clips are assigned to rooms based on the composite distance
measure returned as a function of the color histograms vector component of the
composite vector. The video clips selected by the textual query were arranged such
that they are closest to the entrance of their respective room. In addition to these
Video clips, the room has other clips that are determined based on the content
similarity to the retrieval results returned from the textual query. These were placed
at locations farther into the room and behind the video clips that share similar content.
Since each room is a collection of video clips that have similar color histograms, the
rooms are presented using a color theme that is a contrasting color to the predominant
hues of the retrieved content. This concept of grouping video clips based on the most
prominent hue is based on Gestalt's Law of Similarity where the human mind groups
similar elements to an entity based on the similarity relation of color, form and size

[50].

Since users tend to limit their search to the ﬁrst few returned results, the
number of video clips returned based on contextual similarity is limited to a relatively
small number of at max 18 per room. Based on the number of times a video is
retrieved and the resulting/corresponding contextual similarity, the video is assigned a
larger showcase and provided more Space around it. The color of the frame encasing
the Video is an indicator of the luminance of the video as calculated from the color

histograms. The richness in color of the frame or the showcase that encases a video is

36

representative of the size of the video. Similarly other facets of the video are

embedded as visual cues around the video.

Darken and Silbert found that the addition of real world landmarks, such as
paths, boundaries and directional cues, can greatly beneﬁt navigation performance in
a virtual environment; hence, the emphasis on creating an environment that closely
approximates the real-world [3]. The goal of the Virtual Video Gallery design is to
achieve this real world appearance, while providing a large number of cues to the
contents in the room. These cues include the location of contents relative to the
center of the room and to other contents in the room, the room theme itself, and

appearance of the frame.

4.1 Scenario

This section presents an example scenario, describing the process a user would expect
to see in the Virtual Gallery. A typical retrieval scenario begins with the user
entering the gate/entrance of the Virtual Gallery; the starting point of the museum
tour. Once the user enters the Gallery, the user is presented with a snapshot of the
foyer with a dialog box that allows a query to be entered, as well as buttons for
alternative general browsing with no initial query. The buttons either take a user to

the central hallway of the gallery or to a random location in the gallery ("I'm feeling

37

lucky"). The dialog box serves as a means for the users to enter a query, which will
reﬁne the search results and provide a starting point for browsing. The snapshot
serves as a landmark [3] and helps the user maintain navigational cues for efﬁcient

navigation within the environment.

The general browsing button gives the user access to all the ﬂoors within the
Gallery, allowing free exploration of the gallery in its entirety. Each room on a ﬂoor
is a collection of Video clips that comprise a sub cluster. Though the grouping of
clusters may be predetermined, the layout of the rooms and the ﬂoors are determined
at runtime based upon a predeﬁned threshold that limits the number of rooms per
ﬂoor. Similarly the orientation and position of a video clip within a room is
determined by its similarity relationship to the cluster center. The size of each room is
governed by the size of the associated cluster and hence may vary from room to

1' 00m.

The rooms have a hemispherical structure with a ﬂat roof so as to provide a
panoramic View to the user without occluding the video clips adorning the gradually
ascending wall. The ascending curved walls provide a tiered surface upon which the
video clips are mounted. The illusion that this provides is similar to the seating
arrangement in a small amphitheatre where the rows of seats are replaced by video
clips. To provide a more realistic and practical design, each tier had a glass railing
and a means of approach using one of the aisles. The video representing the cluster
center is hung precisely in the middle of the room. More distant a video is from the

center of the room the less similar the video is to the cluster center. The texture of the

38

frame of a video provides information about the video such as the duration of the

video.

Alternatively, and probably typically, the user could enquire at the reception
desk by typing in a query. Then the user is sent to the hallway of a ﬂoor wherein the
results of the search query could be found. All rooms within a floor share a collection
of retrieval results with a common theme derived from the search query. Each room is
distinguished from the other on the floor by the color of the entrance and
representative images on the door, such that no two rooms within the ﬂoor are
identiﬁed by the same color. This color code of the room serves as an identiﬁer and
indicator of the most prominent hue among the video clips within the room. From the
user's perspective, each room is a collection of video clips that match the user’s query
such that video clips within each room share their most prominent colors belonging to
a common spectrum. This helps the user select a room based on the color

approximation of the Visual content of the video that the user is interested in locating.

As an example, if the user supplied a textual query such as ‘Mountaineering’,
it would be most likely that a majority of rooms of the ﬂoor the user is sent to would
be color coded with hues representing snow, rock and mountains, representing the
video clips from snow climbing, rock climbing and general mountain climbing. Thus,
based on the color code of the rooms, the user may choose which room to enter. This

process is assisted by representative images on the doors.

39

The room closest to the user in the hallway is the room containing the video
returned with the highest similarity score in response to the query. The rooms are
similar in design to the rooms Observed in the general browsing environment with the
exception that there were a few video clips placed upon pedestals in the center of the
room. These are video clips that belong to the cluster and those that are among the top
search results returned by the users query. The clips on the various tiers just behind
the clips on the pedestals were video clips with high similarity scores within the

cluster to the video clips on the pedestals.

40

Chapter 5

Experiment Setup

The experiment setup provided to the volunteers of the study involved two
environments- a test environment and a control environment as shown in Figure 5.1
and Figure 5 .2. The two environments were run on the same computer so as not
introduce any bias on account of the setup. The content returned by the control
environment for a particular query were solely based on the textual meta-data unlike
the video returned in the test environment (CBVR), which used the proposed Virtual
Gallery approach for the purpose of returning video clips and orienting them in the

environment.

The prototype is meant to be a proof of concept and hence a relaxed design
and a simpliﬁed version of the design is used for the CBVR environment. The
prototype uses weights that helped collapse one or more stages of the design
speciﬁcations into single stage for the purpose of simplicity. For example, the

prototype assigned M=l that effectively created a cluster C5551) that contained all the

41

video of the repository. This decision was motivated by the limited no. of video clips
(570) the system used. Similarly, in computing the color histogram of a video clip,
instead of using key frames generated using video segmentation, all the frames of the
Video clip were used with equal weights assigned. Also, since the prototype wasn’t
going to be a 'live system' that would constantly have its database of Videos updated,
clustering related to cluster centers for the purpose of scalability were omitted in the

design of the prototype.

The prototype also assumes the use of only one key frame for representing
each video clip. This was done, as the primary reason to have more than one Key
Frame based on Scene Break Detection was to help cluster video clips in Phase I
while still allowing video clips to belong to more than one cluster at a time, which is
not needed as we already have the Cluster C5559. The second motivation to have
multiple Key Frames per video was to for the purpose of representing the video clip
in the Browsing environments. For the purposes of this study, the video is not
segmented during the video summarization process, Since the collection represented
all short-form content [51-55]. Instead, a single frame from a time point half way
through the video clip (middle image) is selected as representative. This is a naive
method for key frame selection and does not guarantee that the key frame will be a
good representation, but is actually common practice in many systems due to the
complexity of automatically deciding a representative key frame from among the
frames of a clip as noted by Smeaton et al,[25]. Selecting the middle frame provides a

base-line performance for the prototype evaluation; it can only be assumed that better

42

chosen key frames will improve performance, especially since the system uses the
key frame for the dual purpose of providing a visual summary as well as a means to

assist image—based video retrieval.

A color histogram is computed for each video clip. The histogram is

computed using all the frames of the video. The color histogram (Va) is a summation
of color histograms of all the frames such that it is represented by the distribution of
the colors of each pixel, across each frame. During the computation of the color
histogram, the RGB color space is subdivided into 125 regions using a uniform
subdivision of the intensity of each component into ﬁve discrete values. Each region
is mapped to one of 125 histogram bins. The histogram is constructed by counting
each pixel from every frame. A total of 125 histograms bins is used as a result of
choosing 5 bins each for Red, Green and Blue. The selection of a 125 bin histogram is
a common choice for a 3 dimensional histogram [56-59]. This was done for the
purpose of expediency as a vast majority of the video content, on account of the
nature of the source of the collection (www.voutube.com) were short durational video
clips, consequently with fewer event segments, which otherwise would have

dramatically skewed the color histograms if computed on the whole clip.

For this thesis, the prototype assigned equal weights to WC and WT for the
two distance functions Dc and DT, for the computation of DTOT AL . Also, for the

experiment, the prototype used only a limited amount of stemming (e. g., eliminating

plurals, ‘ing’, etc) of the meta-tags.

43

 

[Y'Q'Clpl‘v . - 'g Summo- ' murm— l “deal—VIII

 

 

   

Figure 5.1— A screen shot of the CBVR environment

The two graphical environments were rendered using JavﬂD API. The video
clips used during the study were retrieved from Youtube.com. As the experiment
needed to physically have the database entirely available for analysis of and
clustering, it was not practical to have the model retrieve video clips on the ﬂy from
the net, nor could the reliable availability of the content be assured during the
experiment. Hence, the video clips used in the study were retrieved and stored into a
local repository. A collection of 570 video clips was constructed for use during the
study. The clips were retrieved from YouTube.com based on categories (at the time
of retrieval) including ‘Top Rated’, ‘Top Favorites’, ‘Most Discussed’, ‘Most
Viewed’, etc. To include a more realistic sample of the content actually available on

popular video browsing websites, the experiment included video clips from categories

such as ‘Recently Featured’ that were likely to have less textual meta-tags associated

with the video clips on account of the video being recently added.

 

 

Slut" ROI! Mlmm Wool!!!“

 

 

EVIM’ IQ-lne-L Hayward! rumidfm. Inn-j

 

Figure 5.2: A screen shot of the control environment

Among the 570 video clips available, there were a number of redundant video
clips on account of belonging to multiple categories or on account of being uploaded
multiple times. The experiment excluded a considerable number of redundant video
clips as the approach of grouping and orienting video clips in the CBVR environment
inevitably clustered video clips that had similar visual content even if they didn’t
share the textual keywords entered by the user or didn’t share the same meta-tags.
This was the intention of the CBVR environment, but having too many duplicates

video clips would have partially defeated one of the purposes of the study in that the

45

study tried to gauge if the participants were able to identify a pattern in the orientation
of the video clips in the environment, as it would have made it extremely obvious to
the participants in the study that video clips of similar visual content were grouped

together in the same room.

Eventually, 354 video clips were used out of the 570 videos after removing
redundant videos. The redundant video clips were not necessarily 100% identical, as
most of them were essentially the same video uploaded by different peOple having
different meta tags and possibly not necessarily having the same video length. The

similarity matrices we computed Dc for the video clips were used to identify

redundant videos. This in itself showed that similarity of visual content of the video

clips based on color histograms Dc was working as expected.

Along with each video that was used, its associated metadata, especially the

meta-tags were saved and processed, so as to be used for textual search capability.

5.1 Experiment Participants and Tasks

20 volunteers participated in the study. Each of the participants was asked to perform
six tasks, with the two environments having three tasks each. In the interest of
fairness, the subjects were asked to alternate environments between tasks and the

environment that was chosen for the ﬁrst task which was chosen at random for each

46

subject. No explanation was provided to the participants regarding the basis of the
ordering and orientation of video clips in the environment so as to help gauge whether
the users of the environment were explicitly or implicitly able to identify by
themselves, if there were identiﬁable patterns that governed the grouping and

orientation of the video clips in the two environments.

Each of the six tasks required the participant to ﬁnd a video that matched the
description provided. Once the video was found, the participant was asked to ﬁnd
another Video that the subject believed was Similar in visual content. Therefore, there
was a requirement for the identiﬁcation of two video clips per task. Since ‘similarity
of a video based upon visual content’ is subjective, it was mentioned to the
participant that the basis of what the participant considered similar visual content is
subjective to his or her perception. A screen shot of the participants having selected a

video is shown in Figure 5.3 and Figure 5.4

47

 

 

 

[sir-ll - . v ‘—

 

   

Wﬂ ”(‘0 07“ j" ”'“' “1'1““ “‘1’ “FT" "“"l-M T'. , - T1 0‘ "I“ WI]£AW“‘W' U'Mv'.m 1‘15”

Figure 5.3 - A screen shot of a video being played in the CBVR environment

 

 

i-u suini No- ‘ venom-e i immune-m

 

 

 

Figure 5.4 - A screen shot of a video being played in the control environment

48

Among the tasks that were provided for each environment, a clear description
of the video to search for in the task was provided for two out of the three tasks. The
3rd task was open-ended where the participant was asked to search for a video that
was of interest to the user. This was done so as to introduce a more realistic real-
world scenario where the user decides to search for a video based on his/her own
volition, while also reducing any bias that may creep in as a result of an exhaustive

list of search descriptions used during the study.

The participant conﬁrmed success of completion of a task by pressing a button
that speciﬁed success was achieved. In the event that the participant was unable to
accomplish the task, or did not believe it was possible to complete the task (because
the subject believed the video that the participant was looking for didn’t exist) or if
the participant was bored or fed-up with the task at hand and wished to move on, the
subject would signal that the task wasn’t completed. This was also signaled by

pressing a button that speciﬁed that success was not achieved.

5.2. Results

Based on the results taken from the 20 subjects who participated in the user study, it
was found that 95% of the subjects had better or at least equal success in retrieving

video clips. The average percentage of improvement using the CBVR environment in

49

success rate over all tasks was found to be 23.98%. As shown in Figure 5.5, this
improvement in the overall task completion success rate worked out to increased
Mean of Difference (M D) = 9.68 points (on a scale of 100) with a level of
signiﬁcance of 3.009 and df =18 and a conﬁdence of 99.62% for the directional test.
The estimated standard deviation of sampling distribution of the mean of the

difference between the two environments est.O'MD = 3.218.

 

t Video Retrieval Success Rate

 

 

 

 

%

 

 

 

 

 

Overall Success Similar Video Retrieval

 

 

iControl Environment (Textual Based Video
lContent-based Video Retrieval

 

Figure 5.5 - Video retrieval success rate

TheMD in successfully ﬁnding a video that had similar visual content
according to the criteria set by the subjects was found to be an increase of 18.36

points (on a scale of 100) with a conﬁdence of 99.24% and estam, = 6.845 for the

50

directional test. Here also, all but one of the study participants had a better or equal
success rate. The only participant that had a lower success rate in the CBVR
environment was the same participant who had a lower overall success rate. The
percentage of improvement in ﬁnding a video that had Similar visual content was
found to 30.17%. This statistic doesn’t include one of the participants who had 100%
success rate in the test environment and a 0% success rate in the control environment
as the percentage improvement would result in a division by 0 and it is not valid to
compare successful results to complete failure. Further task completion success rate

statistics are provided in the Table 5.2.1.

Control Content-based Video :
Environment Retrieval
(Textual based video Environment
Retrieval)

retrieval success rate 80.6 %

 

Similar video retrieval success rate 68.5 %

 

Overall video retrieval success rate 4.5 %
per Query
Similar video retrieval success rate
‘ ser Quer

 

 

 

 

Table 5.2.1: Comparison of success rate in terms of percentage for the two
environments

There was no statistical difference between the two environments as far as
general categorical search of a video was concerned, which supports the argument

that the search capability of the two environments were comparable and similar.

51

One of the more prominent trends was that the average number of queries
used to accomplish the task was considerably lower in the test environment (CBVR
environment) as compared to the control environment. The mean of the difference

(M D) in the number of queries entered by the participants of the study per successful

video found was found to be 4.25 queries/video_found with a conﬁdence of over

98.5% and m .0 MD =2.774. This increase in the number of queries per video found

was more than twice the average number of queries used to successfully ﬁnd a video
in the CBVR environment, which was 1.69 queries/video_found as shown in Figure
5.6. Further statistics regarding a quantitative comparison of queries entered is shown

in Table 5.2.2

Control Content-based Video
Environment Retrieval

(Textual based video l’ Environment
Retrieval)

 

Average number of queries entered.

 

Average number of queries entered
per successful retrieval of a video

 

Average of users average number of
queries entered per successful
retrieval of a video

 

Average number of video clips
viewed per user

 

 

 

Table 5.2.2: Comparison of query related statistics for the two environments

52

 

Query and Video Related

 

 

 

 

 

Average Number of Average Number of Average of Users Avg Average Number of
Queries Entered Queries Entered/ (No. of Queries Videos Viewed
Success Entered/ Success)

 

 

uControl Environment (Textual Based Video
lContent—based Video Retrieval

 

 

 

 

Figure 5.6 - Query and video related statistics

In terms of success rate per query entered, theMD in success rate per query

for the two environments showed an increase of 13.46 points (on a scale of 100) for

the test environment with a conﬁdence of 98.87% and estaMD =4.769 for a non-

directional test. A similar trend was also noticed in the success rate of ﬁnding a

similar video per query for the two environments, with the CBVR environment

showing an increased M Dof 13.45 points with estaMD: 4.225 and a level of

signiﬁcance of 0.005 for a non-directional test as shown in Figure 5.7.

53

 

 

Video Retrieval Success Rate per Query Entered

20 17.9

 

°/o1o~

 

 

    

 

Overall Success Rate/ Query Success of Finding Similar Wdeo/ Query

 

: Control Environment (Textual Based Video Retrieval
I Content-based Video Retrieval Environment

 

 

 

Figure 5.7 - Video retrieval success rate per query entered

During the study, 89% of the participants entered fewer queries per task in the
CBVR environment as shown in Table 5.2.3. The average number of queries entered

by the participants in the study was found to be 14.1 queries less with a
€510,413 =2.475 in the CBVR environment than in the control environment. The level
of signiﬁcance of this directional test was found to be 5.69 with a p—value of 99.99%.
This worked out to an average reduction of 50.79% in the number of queries entered

by the user for the accomplishment of a task in the test environment as compared with

the control environment.

54

 

5%

Improvement in the CBVR
environment over the control
,envimmt

Users had higher success in video

retrieval in the CBVR environment.

 

89%

Users entered fewer queries per task in the

CBVR environment.

 

89% improvement

In reduction of queries entered per task in
the CBVR environment over the control

environment.

 

23.98% improvement

In task completion success rate in the
CBVR environment over the control

environment.

 

Statistically insigniﬁcant difference

In categorical search between the two

environments.

 

4.25 fewer

Queries entered per successful retrieval of 1‘

video(in mean of difference- M D terms)

 

l 3 .5 % improvement

in M D of overall success rate per query

entered.

 

1 3 .5 % improvement

 

inM D of success rate of retrieving a

 

similar video, per query entered.

Table 5.2.3 - Summary of statistical improvement of the CBVR environment in
comparison with the control environment

The post-experiment survey showed that the participants in the study felt that

they viewed more video clips in the test environment than the control environment.

This increase in the video clips viewed according to the participants of the survey was

a mean average of 2.4 points (on a scale of 10). This was found to be statistically

55

signiﬁcant with a conﬁdence Of 99.96% in a non-directional test. When this result
was compared with the actual data collected from the experiment it was found that the
user’s estimation of the environment was accurate and that there was in fact an
88.10% increase in the number of Video clips viewed per task at an average in the
CBVR environment. This can partially be attributed to the observation that the
participants enjoyed exploring and browsing the CBVR environment more than the
test environment. This was reﬂected in the post—experiment survey where the
participants reported a 65% increase in enjoyment level with 99.99% conﬁdence for

the non-directional test.

While studying the participant’s propensity for choosing a video from among
the list of returned Video clips for checking whether the video matches the user’s
criteria for selection, it was found that 33.82% of video clips viewed by the user were
video clips that were returned to the user as a result of content-based Video retrieval
(SCBVR). Even more interesting was the fact that when it came to the participant’s task
of searching for a video, 44.84% of successful attempts at selecting the video clips as
the video of choice were SCBVR video clips. Similar was the case when the participant
eventually identiﬁed a video that matched the video content the user was looking for,

with 45.47% of the video clips chosen belonging to SCBVR (Table 5.2.4).

The video in each room that represented the centroid of the cluster was chosen

by only 10% of the participants’ completed action as the video of ﬁnal choice, with a

low level of signiﬁcance for the degree of freedom.

56

‘ Attributes of Viewing y percentage of SCBVR video Percentage
clips set

1 Percentage of video clips viewed by the user belonged

to the set SCBVR,

 

Percentage of video clips that marked a successful

. attempt, that belonged to the set SCBVR

 

‘ Percentage of video clips that marked a successful
attempt of ﬁnding a similar video, that belonged to

j the SCI SCBVR

 

 

Table 5.2.4: Summary of percentage of SCBVR video clips that were viewed by the
user in the CBVR environment

In the post-experiment survey, participants were asked how fast they felt they
could perform a particular task in the two environments; the participants felt that they
were 19.84% slower in the test environment. This was mostly attributed to the fact
that the CBVR environment was considerably slower in rendering the video clips in
the environment, a characteristic of the prototype implementation and the general
observation that rendering multiple video images in a 3D environment is more
complex computationally than the presentation of a simple list. This observable
increase in time for the CBVR environment when it came to returning to the video
clips after a query was entered by the participant was a result of the implementation
of graphical rendering on the computer and not a result of the conceptual

implementation.

57

For the questions from the post-experiment survey, which asked how
‘efﬁcient’ the environments were at retrieving a particular video, the results showed
that the participants favored the CBVR environment over the control environment by
30.68%. Similarly, for the question of how ‘conﬁdent’ the participants were that they
had identiﬁed or found the video that best matched the search, the survey found a

13.51% increased favor for the CBVR environment as shown in Table 5.2.5.

Control environment .. Content-based
Survey result (Text-based video . video retrieval
retrieval) 3 a environment

: Total number ofVideo clips viewed

 

' Time spent in the environment

 

‘Efficiency’ of the environment in

retrieving a video

 

l ‘Confidence’ of the user of having
found the best video that matched the
. task

 

‘Prorrrinence of a pattern’ that grouped Statically insigniﬁcant Statically

 

 

 

= video clips insigniﬁcant
Table 5.2.5- Post-survey feedback

As far as the question of how prominent the pattern that grouped the video in
the test environment when compared to the control environment, there was no
statistically signiﬁcant difference. Correlating the results from the post-experiment
survey about the grouping of video clips and the results from the actual experiment

when it came to ﬁnding a similar video, it stands to reason that patterns that grouped

58

video clips based on CBVR were more implicit that explicit. An interesting statistic
related to the perception of a pattern that grouped the video clips in the two
environments was that 100% of those participants that had lower success rate when it
came to searching a video in the test environment as compared with the control
environment claimed that the control environment had a more prominent pattern in
the grouping of video clips, when in fact the orientation of the video clips in the

control environment was purely random.

5.3 Discussion and Conclusion

The results from the study showed a strong relationship supporting the use of
CBVR in combination with other video metadata retrieval method in improving the
ability of the user to retrieve video clips, especially if the search is based on visual
content. The enjoyment factor of browsing video clips was also found to be
substantially higher in a more realistic browsing environment (a museum metaphor).
This also contributed to the implicit identiﬁcation of patterns in the grouping of the

video clips.

59

Chapter 6

Conclusion and Future Work

The enhanced video browsing environment presented in this thesis primarily aims at
providing a means for the user to search video databases using content-based video
retrieval in a manner which does not require the user to provide a video clip as an
example for query-by-content. Most Content-based Image retrieval (CBIR)
techniques require the user to provide an image to return results. Since it is not always
feasible for users to have access to video clips similar in visual content to the video
intended to be retrieved, alternative means for enabling retrieval, such as accepting
textual queries or displaying similarity relationships between Video clips, is proposed

in this thesis as an alternative.

The other advantage of the enhanced video browsing environment presented is
the ease of use attributed to the user’s perceived sense Of familiarity when navigating
within the environment. This sense of familiarity is attributed to the realistic

modeling of the environment whose design has been extended from real world

60

metaphors. The browser can run in a Java applet in a conventional browser and

appear to be a portal into a video browsing universe.

The next stage in this work would be to examine means of further
improvement and exploration of the balance between the weights of content-based
scores and conventional textual-based video retrieval scores, in returning to video
clips and the orientation of the video clips in the environment to further exploit
human cognition for pattern discovery. One of the potential areas of improvement
involves the selectivity of the extraction of key frames from each video, such that the
key frames are selected are a better indication of the video. This can be realized with
the help of event segmentation. Also variants of image segmentation like the
combinational approach of ‘Within Event and Image Quality Fusion’ and ‘Cross

Event and Image Quality Fusion’ may be employed [26].

Another area of improvement with regard to similarity of video clips based on
visual content is the use of individual key frames during the comparison of video
clips as against the present method used where key frames is combined together
before similarity of video clips is computed. The present model uses the basic
content-based image retrieval approach of color histograms for CBVR, but, as
discussed in the related work section, there are various alternative approaches to
CBIR that can be experimented with to see how they perform in a Virtual

environment.

61

The textual query search engine used in the current model can be vastly
improved using some of the various contemporary string-matching algorithms. Also,
one Of the other parameters that has signiﬁcant potential for improvement is the
weight set that is assigned in balancing Vc and VT during the computation of

sirrrilarity of video clips.

There is a tremendous potential for improvement by creating a more realistic
virtual environment to further exploit the user’s ability to ﬁnd patterns in the
environment to aid with the browsing of video clips. This ﬁrst experiment was rather
crude visually, focusing on layout and organization of the space, rather than the
details of a ﬁnely crafted virtual gallery. Many of the proposed concepts of a virtual
gallery, including contrasting color rooms, luminance keyed frames, and the impact

of a more realistic Virtual world is yet to be examined.

Additional variations of the use of physical metaphors to enhance the impact
of spatial memory on 3D navigation for the video gallery remain to be explored.
Similarly, innovative methods to use video metadata to describe content in the

browsing environment can be tested.

62

Chapter 7

Appendix

Prc-Egeriment Survey

1) What is your Age?

2) Gender (Male, Female, Do not wish to disclose)

3) Do you visit websites that store videos like ‘youtube.com’ or
Google videos?

4) Do you play 3D computer games?

5) How often do you use a computer?
(1- Never, 5- Very Often)

6) How often do you play 30 computer games?
(1- Never, 5- Very Often)

10) How often do you visit websites like 'youtube.com' and Google
Videos? (1- Never, 5- Very often)

11) Do you like trying new technologies?
(1- Don‘t like, 5— Very much)

12) When you hear a website has become popular, how likely are
you to visit that site? (1- Not likely at all, 5- Absolutely certain)

63

 

K

a n5 0|:

UCIEICIEII-t

UCIEIEIEIIN

EICIEIEIEllh

EIEICICIEIIO'

Post-Test Survey
The following questions are related to "Conventional
Text-based Video Search Environment“.

1) How favorable an impression did the environment make?
(1-Not favorable. 10- Very favorable)

2) How fast can you perform a task of searching and retrieval of a
video in this environment?
(1-Very Slow, 10-Very Fast)

3) How easy was it to search for a video using this environment?
(1-Very Hard, 1U-Very Easy)

4) How enjoyable was it to search for a video using this
environment?
(1-No fun. 10-Lot ofFun)

5) How good was this environment in grouping similar videos?
(1-Very bad. 10—Very good)

6) How prominent was the pattern that grouped videos together in
this environment?

(1-Not obvious ill-Very obvious)

7) For a large collection of similar videos, how efficient do you
think this environment would be in retrieving a particular video?
(1-Not efﬁcient 10-Very efﬁcient)

8) Using this environment, did you end up watching a lot of videos
before selecting the video of choice?

(1- Not many till-A lot)

9) How confident are you that you found the video that best
matched the search criteria you were looking for?
(1-Did not ﬁnd. 10-Found)

The following questions are related to the 'BDvideo
Browsing Environment“.

10) How favorable an impression did the environment make?
(1-Not favorable. 10- Very favorable)

11) flow fast can you perform a task of searching and retrieval of a
video in this environment?
(1-Very Slow, 10-Very Fast)

12) How easy was it to search for a video using this environment?
(1 New Hard, 1 U-Very Easy)

13) How enjoyable was it to search for a video using this
environment?
(1-No fun. lﬂ-Lot of Fun)

14) How good was this environment in grouping sfmllar videos?
(1-Very bad, 10—Very good)

15) How prominent was the pattern that grouped videos together In
this environment?

(1-Not obvious ill-Very obvious)

16) For a large collection of similar videos, how efflcient do you
think this environment would be in retrieving a particular video?
(1-Not efﬁcient 10-Very efﬁcient)

17) Using this environment, did you end up watching a lot of videos
before selecting the video of choice?

(1— Not many 10A lot)

18) How confident are you that you found the video that best

matched the search criteria vou were lookino for?

1

1

23 45 5789

23 45 57 89

Chapter 8

Bibliography

1. Gloor, P.A. CYBERMAP: yet another way of navigating in hyperspace in
Proceedings of the third annual ACM conference on Hypertext 199]. San Antonio,
Texas, United States ACM Press.

2. Chin, P., et al. MediaMetro: browsing multimedia document collections with a 3D
city metaphor in Proceedings of the 13th annual ACM international conference on
Multimedia 2005 Hilton, Singapore: ACM Press.

3. Darken, RP. and IL. Sibert, Navigating large virtual spaces Int. J. Hum-Comput.
Interact, 1996. 8(1): p. 49-71.

4. Robertson, G., et al. Data mountain: using spatial memory for document management
in Proceedings of the llth annual ACM symposium on User interface software and
technology 1998. San Francisco, California, United States ACM Press.

5. Erol, B., K. Berkner, and S. Joshi. Multimedia thumbnails for documents in

Proceedings of the 14th annual ACM international conference on Multimedia
MULTIMEDIA '06. 2006. Santa Barbara, CA: ACM Press

6. Dengel, A., et al. Human-centered interaction with documents in Proceedings of the
14th annual ACM international conference on Multimedia MULTIMEDIA '06. 2006.
Santa Barbara, CA: ACM Press

7. Huotari, J., K. Lyytinen, and M. Niemel, Improving graphical information system
model use with elision and connecting lines ACM Trans. Comput.-Hum. Interact. ,
2004. ll (1): p. 26-58.

65

10.

11.

12.

13.

14.

15.

16.

17.

18.

Tanaka, Y., Y. Okada, and K. Niijima. Interactive interfaces of Treecube for
browsing 3D multimedia data in Proceedings of the working conference on
Advanced visual interfaces 2004. Gallipoli, Italy ACM Press.

Cockbum, A. and B. McKenzie. 3D or not 3D?: evaluating the effect of the third
dimension in a document management system in Proceedings of the SIGCHI
conference on Human factors in computing systems 2001 Seattle, Washington,
United States ACM Press.

Rautiainen, M., T. Seppanen, and T. Ojala. On the signiﬁcance of cluster-temporal
browsing for generic video retrieval: a statistical analysis in Proceedings of the 14th
annual ACM international conference on Multimedia MULTIMEDIA ’06. 2006.
Santa Barbara, CA: ACM Press

Datta, R., J. Li, and J 2. Wang. Content-based image retrieval: approaches and trends
of the new age. in Proceedings of the 7th ACM SIGMM international workshop on
Multimedia information retrieval 2005. Hilton, Singapore ACM Press New York,
NY, USA

Pass, G. and R. Zabih, Comparing images using joint histograms Multimedia Syst.,
1999. 7 (3): p. 234-240.

Stehling, R.O., M.A. Nascimento, and A.X. Falcao. A compact and efﬁcient image
retrieval approach based on border/interior pixel classiﬁcation. in Proceedings of the
eleventh international conference on Information and knowledge management CIKM
'02 2002. McLean, Virginia, USA: ACM Press.

Brown, M.G., et al. Automatic content-based retrieval of broadcast news in
Proceedings of the third ACM international conference on Multimedia. 1995. San
Francisco, California, United States ACM Press.

Flickner, M., et al., Query by image and video content: the QBIC system. IEEE
Computer, 1995. 28: (9): p. 23-32

Wada, N., S.i. Kaneko, and T. Takeguchi, Using color reach histogram for object
search in colour and/or depth scene Pattern Recogn, 2006 39(5): p. 881-888.

Smith, A. and M. Webster, Developing a Global Repository and Showplace for
Imagery Data Proceedings of the Theory and Practice of Computer Graphics 2003,
2003 p. 178.

Cavazza, M. and SJ. Mend, VIRTUAL ART GALLERIES: A NEW KIND OF
CULTURAL OBJECTS? IEEE Computer, 2001: p. 590-593.

66

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

Hendricks, Z., J. Tangkuampien, and K. Malan. Virtual galleries: is 3D better? in
Proceedings of the 2nd international conference on Computer graphics, virtual
Reality, visualisation and interaction in Africa 2003. Cape Town, South Africa: ACM
Press.

Gabbouj, M., et al., MUVIS: A Multimedia Browsing, Indexing And Retrieval
System.

Aoki, H., et a1. (1997) A shot classiﬁcation method of selecting effective key-frames
for video browsing. Proceedings of the fourth ACM international conference on
Multimedia Volume, 1 - 10

Bezerra, RN. and E. Lima. Low cost soccer video summaries based on visual rhythm
in Proceedings of the 14th annual ACM international conference on Multimedia
MULTIMEDIA '06. 2006. Santa Barbara, CA: ACM Press

Smeaton, A.F., et al. Automatically selecting shots for action movie trailers. in
Proceedings of the 14th annual ACM international conference on Multimedia
MULTIMEDIA '06. 2006. Santa Barbara, CA: ACM Press

Tse, T., et al., Dynamic key frame presentation techniques for augmenting video
browsing. AVI -Proceedings of the working conference on Advanced visual
interfaces 1998 p. 185 - 194

Smeaton, A.F. and P. Browne, A usage study of retrieval modalities for video shot
retrieval. Information Processing and Management, 2005..

Doherty, A.R., et al. Investigating keyframe selection methods in the novel domain of
passively captured visual lifelogs. in Proceedings of the 2008 international
conference on Content-based image and video retrieval. 2008. Niagara Falls, Canada

Fleischman, M., P. Decamp, and D. Roy. Mining temporal patterns of movement for
video content classiﬁcation in Proceedings of the 14th annual ACM international
conference on Multimedia MULTIMEDIA '06. 2006. Santa Barbara, CA: ACM
Press

Setia, L., et al. Image classiﬁcation using cluster cooccurrence matrices of local
relational features in Proceedings of the 14th annual ACM international conference
on Multimedia MULTIMEDIA '06. 2006. Santa Barbara, CA: ACM Press

Schwarz, M.W., W.B. Cowan, and J .C. Beatty, An experimental comparison of RGB,
YIQ, LAB, HSV, and opponent color models. ACM Trans. Graph. , 1987. 6(2): p.
123-158

67

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

Tseng, V.S., C.-J. Lee, and J.-H. Su, Classify By Representative Or Associations
(CBROA): a hybrid approach for image classiﬁcation Proceedings of the 6th
international workshop on Multimedia data mining: mining integrated media and
complex data 2005 Chicago, Illinois: ACM Press. 61-69.

Pass, G., R. Zabih, and J. Miller. Comparing images using color coherence vectors in
Proceedings of the fourth ACM international conference on Multimedia 1996 Boston,
Massachusetts, United States ACM Press.

Jing, F., et al. IGroup: a web image search engine with semantic clustering of search
results in Proceedings of the 14th annual ACM international conference on
Multimedia MULTIMEDIA '06. 2006. Santa Barbara, CA: ACM Press

Kherﬁ, M.L., D. Zion, and A. Bemardi, Image Retrieval from the World Wide Web:
Issues, Techniques, and Systems ACM Comput. Surv., 2004: p. 35—67.

Luo, H., et al. Large—scale news video retrieval via visualization in Proceedings of the
14th annual ACM international conference on Multimedia MULTIMEDIA '06 2006.
Santa Barbara, CA: ACM Press.

Gao, Y.i., H. Luo, and J. Pan. Searching and browsing large scale image database
using keywords and ontology in Proceedings of the 14th annual ACM international
conference on Multimedia MULTIMEDIA '06. 2006. Santa Barbara, CA: ACM
Press

Owen, CB. and J.K. Dixon, Adaptive video segmentation and summarization, in
Handbook of Video Databases. 2003, CRC Press LLC. p. 299-318.

Redmond, SJ. and C. Heneghan, A method for initialising the K-means clustering
algorithm using kd-trees Pattern Recogn. Lett., 2007. 28(8): p. 965-973.

Ding, C. and X. He. K-means clustering via principal component analysis in
Proceedings of the twenty-ﬁrst international conference on Machine learning 2004.
Banff, Alberta, Canada ACM Press.

Ahmad, A. and L. Dey, A k-mean clustering algorithm for mixed numeric and
categorical data. Data & Knowledge Engineering 2007. 63(2): p. 503-527

Soucy, P. and G.W. Mineau. A Simple KNN Algorithm for Text Categorization. in
Proceedings of the 2001 IEEE International Conference on Data Mining 2001.

Adams, 3., S. Greenhill, and S. Venkatesh. Browsing personal media archives with
spatial context using panoramas in Proceedings of the 14th annual ACM international

68

42.

43.

45.

46.

47.

48.

49.

50.

51.

52.

conference on Multimedia MULTIMEDIA '06. 2006. Santa Barbara, CA: ACM
Press

Hurst, W., T. Lauer, and R. Kaschuba. Interfaces for interactive audio-visual media
browsing. in Proceedings of the 14th annual ACM international conference on
Multimedia MULTIMEDIA '06 2006. Santa Barbara, CA: ACM Press.

Schmandt, C. Audio hallway: a virtual acoustic environment for browsing in
Proceedings of the 11th annual ACM symposium on User interface software and
technology UIST '98 1998. Santa Barbara, CA: ACM Press

Welch, G., et al., Projected Imagery in Your “Ofﬁce of the Future”. 2000 IEEE
Computer Graphics and Applications, 2000: p. 62-67.

COLAVITA, F.B., Human sensory dominance (auditory versus visual stimulus)
Perception and Psychophysics. Vol. 16. Oct. 1974. 409-412.

Vicki Bruce, P. and MG. R Green, Visual Perception: Physiology, Psychology and
Ecology 1996: Psychology Press (UK) 416.

Gibson, J .J ., The Ecological Approach to Visual Perception 1986: Lawrence Erlbaum
Associates.

V. Vinayagamoorthy, A.B., M. Gillies, M. Slater, A. Steed. An Investigation of
Presence Response across Variations in Visual Realism. in Presence: The 7th Annual
International Workshop on Presence. . 2004.

Peter, L., L. Patrick, and C. Alan, Psychophysically based artistic techniques for
increased perceived realism of virtual environments, in Proceedings of the 2nd
international conference on Computer graphics, virtual Reality, visualisation and
interaction in Africa. 2003, ACM Press: Cape Town, South Africa.

Dempsey, C., D. Laurence, and ET. Juhani, Gestalt theory in visual screen design: a
new look at an old subject, in Proceedings of the Seventh world conference on
computers in education conference on Computers in education: Australian topics -
Volume 8. 2002, Australian Computer Society, Inc.: Copenhagen, Denmark.

He, L., et al. Auto-Summarization of Audio-Video Presentation. in ACM
Multimedia. 1999. Orlando, Fl, USA: ACM.

Hanjalic, A. and R.L. Lagendijk, Automated High-Level Movie Segmentationi for
Advanced Video-Retrieval Systems. IEEE Transaction on circuits and systems for
video technology, 1999. 9(4).

69

53.

54.

55.

56.

57.

58.

59.

Vasconcelos, N. and A. Lippman, A spatiotemporal Motion Model for video
Summarization, in IEEE. 1998.

Toklu, C., S.-P. Lion, and M. Das. VideoAbstract: A Hybrid approach to generate
semantically meaningful video summaries. in IEEE. 2000.

Zhang, H.J., A. Kankanhalli, and SW. Smoliar, Automatic Partitioning of full-
motion Video. Multimedia Syst., 1993. 1: p. 10-28.

Kerminen, P. and M. Gabbouj. IMAGE RETRIEVAL BASED ON COLOR
MATCHING. in Proceedings of FINSIG. 1999.

Benitez, AB. and S.-F. Chang. Image classiﬁcation using multimedia knowledge
networks. in International Conference on Image Processing, 2003. ICIP 2003. 2003.
Dept. of Electr. Eng, Columbia Univ., New York, NY, USA.

Gong, Y. and X. Liu. Generating optimal video summaries. in IEEE International
Conference on Multimedia and Expo (ICME). 2000. New York, NY, USA.

Lee, H.Y., H.K. Lee, and YB. Ha, Spatial color descriptor for image retrieval and
video segmentation. IEEE Transactions on Multimedia, 2003. 5(3): p. 358- 367.

70