A NOVEL FRAMEWORK AND DESIGN METHODOLOGIES FOR OPTIMAL ANIMATION PRODUCTION USING DEEP LEARNING By Zixiao Yu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical Engineering – Doctor of Philosophy 2023 ABSTRACT In this dissertation, we introduce an innovative automatic animation production framework and its modules aimed at overcoming the inherent challenges in traditional animation produc- tion, including the requirement for a substantial investment of time, resources, and expertise. We have structured the production continuum into four distinct but interrelated segments: Action List Generation, Stage Performance, Auto Cinematography, and Video Generation. In the realm of Stage Performance, we propose an innovative Real-time Phoneme Recogni- tion Network (RealPRNet) for real-time lip-sync animation generation. Designed to correlate incoming audio input with the corresponding viseme (visual representation of a phoneme) in real-time, RealPRNet employs a multifaceted approach incorporating spatial, discrete temporal, and cohesive temporal features of the audio data. Significantly, the architecture of RealPRNet incorporates a stacked long short-term memory (LSTM) block and exploits long short-term features to optimize phoneme recognition. RealPRNet offers substantial improvements in reducing phoneme error rate and increasing in visual performance. Next, we delve into the text-to-animation framework (T2A), a pivotal construct in our research. While extant auto cinematography paradigms largely hinge on rudimentary cine- matographic principles (aesthetics), T2A offers a more nuanced approach. It amalgamates the input script’s visual fidelity with the resultant video’s aesthetic compliance, by concur- rently harnessing both fidelity and aesthetic models. Our evaluative studies, conducted with animators, attest to the superior quality of the resulting animations vis-`a-vis methodolo- gies solely reliant on aesthetic models, while reducing the workload of manual animation production by approximately 74%. However, it is germane to note that Auto Cinematography, when solely premised on canonical rule-based optimization, often falls short of audience expectations. we introduce RT2A (Reinforcement learning-based Text to Animation), a pioneering architecture that synergistically merges reinforcement learning techniques into the realm of automated film direction. Within the RT2A framework, each choice concerning camera orientation and posi- tioning is systematically documented and integrated into subsequent training cycles for the reinforcement learning agent. Utilizing a carefully constructed reward function, the algo- rithm is guided toward the optimization of camera configurations that emulate the stylistic decisions commonly employed by human directors. Quantitative analysis reveals that RT2A can achieve 50% improvement in audience-approved camera placements and an 80% increase in fidelity over benchmark camera placements. Furthermore, due to the recent trend of personalized animation content influenced by user-generated interactions, the unpredictability and complexity of open-world settings pose a substantial challenge for automated cinematography techniques. To tackle this, we extend our framework along with an adaptive automated cinematography approach based on Gen- erative Adversarial Networks (AACOGAN). According to empirical analyses, AACOGAN significantly outperforms existing methods in capturing the dynamism of open-world inter- actions. The efficacy of AACOGAN is underscored by a 73% improvement in the correlation between user behaviors and camera trajectory adaptations, as well as a marked elevation—up to 32.9%—in the quality of multi-focus scenes. These metrics offer compelling evidence that our methodology not only refines cinematographic outputs but also considerably augments user immersion within these complex virtual environments. Copyright by ZIXIAO YU 2023 ACKNOWLEDGMENTS First and foremost, I wish to convey my profound gratitude to my advisor, Dr. Jian Ren, for his invaluable guidance and unwavering support throughout my academic journey and personal endeavors. The opportunity to have Dr. Ren as a mentor has been an unparalleled privilege. My heartfelt appreciation also extends to Dr. Haohong Wang, whose expertise and counsel have been instrumental both academically and personally. I am further indebted to TCL Research America for their generous support, without which this project would not have come to fruition. I am also obliged to acknowledge my esteemed colleagues and cohort members—Chen Mi, Lin Sun, Duo Lv, Xinyi Wu, Chao Wang, and Kai—for their indispensable contributions to the project’s development. Their assistance in feedback sessions and unwavering moral support have been invaluable to the completion of this work. I extend my sincere appreciation to Dr. Tongtong Li, Dr. Sijia Liu, and Dr. Wen Li for graciously agreeing to serve on my committee. Their constructive feedback and insights have greatly enriched this dissertation. A special note of gratitude is reserved for Dr. Tongtong Li, whose guidance and concern for my well-being have been particularly meaningful. Finally, it would be remiss not to acknowledge the unfaltering love and support from my family, particularly my parents. Their enduring belief in my capabilities and their emotional sustenance have been the bedrock upon which I have built my academic and personal life. v TABLE OF CONTENTS Chapter 1 Introduction .................................................................................. Chapter 2 Animation Production Framework for Automation ................................ 1 9 Chapter 3 RealPRNet: A Real-time Phoneme Recognized Network for “Believable” Speech Animation ........................................................................... 16 Chapter 4 Auto Cinematography with Fidelity.................................................... 44 Chapter 5 Enabling auto cinematography with Reinforcement Learning ................... 83 Chapter 6 Automated Adaptive Cinematography in Open World ............................ 99 Chapter 7 Conclusion and Future Work............................................................. 134 BIBLIOGRAPHY.......................................................................................... 138 vi Chapter 1 Introduction 1.1 Overview After centuries of extensive and rapid evolution, film, often referred to as the “seventh art”, has firmly established itself as an integral component of contemporary entertainment. With the advancement of computer graphics (CG) technology, the creation of film art is no longer limited to reality. The use of computers to create animated films in virtual space has become an essential part of the film industry. In recent years, with the rapid development of the game industry, most games also contain a significant amount of CG animations to introduce the characters and background of the game and promote the development of the story. These animations provide a more immersive experience to the viewers and players. However, the production of animated films is an extremely complex and time-consuming endeavor that requires a significant amount of effort, expertise, and cooperation from experts in different fields to complete the entire production process. Over the past few decades, a considerable amount of research has been dedicated to games and movies to streamline this process, either conserve resources or reduce production periods. Particularly noteworthy in these efforts are the use of game engines to build virtual scenes, quickly generate a real-time CG animation, or create an animation video that relies on the game’s existing scenes and resources (i.e. Animations based on Minecraft). Based on these concepts, various types of 1 software have been developed for different stages of animated film production (i.e. iClone and Wolf3D), which significantly reduce the associated staff and resource requirements for high-quality content creation. The default users of these software programs are experts in the respective fields of animation production with sufficient knowledge about the empirical rules and established conventions. It takes a significant amount of time and effort for the user to learn and master the software. Therefore, there is still an insurmountable barrier for amateurs to create their own animations even with these tools. However, with the rapid expansion of the internet industry in recent decades, the creation and distribution of content are no longer monopolized by a few large corporations. Among the forms of content creation, video is an easier way to capture the audience’s attention than text. Anyone can publish their newly created video clips on YouTube with a few button clicks. Of course, as one of the best ways to tell a story, there is no exception for animated videos. Even though most of them are not comparable to the high-quality productions produced in the visual performance by companies such as Pixar and Netflix for various reasons, many of them demonstrate the creations and impressive plots that can compete with these famous animations. We believe there is a significant advantage in helping textual content creators by visualizing their stories by animating their works. Motivated by these insights, we have developed the 3D automatic animation production framework, and it ideally operates by automatically creating the corresponding 3D animation with just a standard script input. Users are no longer required to have extensive knowledge of each stage in the production of the animation, which can further reduce the consumption of time and resources. We initiated our research by collaborating with animators and directors to deconstruct the entire animation production process. This collaborative effort led to the segmentation of the process into several discrete modules: Action List Generation, Stage Performance, 2 Auto Cinematography, and Video Generation. Each module is underpinned by distinct tech- nologies, with outputs that modify the original script format. This modular architecture allows users to pause and manually adjust the output of any stage, should they find it un- satisfactory. Given that the evolution of plots in animation is typically character-centric, the Action List Generation module transforms the script into a chronological list of actions. These range from movements and dialogues to interactions. Each entry in this list carries comprehensive information, enabling characters to perform the specified action in the virtual environment. The Stage Performance ingests this action list and directs virtual characters to execute actions in a selected scene sequentially. The Auto Cinematography module has the responsibility of capturing these performances in three dimensions and translating them into two-dimensional frames through optimal virtual camera placements. Finally, the Video Generation module crafts the ultimate animation video using the optimized camera config- urations provided by the Auto Cinematography module. Two modules, in particular, significantly influence the animation’s final quality: the lip-sync speech animation within the Stage Performance and Auto Cinematography. Pre- cise lip-syncing is paramount as viewers instinctively focus on the speaker’s mouth in ani- mated conversations. Any inconsistencies here can drastically reduce immersion. Meanwhile, optimal camera placement necessitates considerable expertise in cinematography. Exist- ing auto-cinematography solutions either produce subpar results or struggle to adapt to diverse scenarios. In modern virtual environments, especially those with open-world de- signs, users have expressed a growing interest in crafting and recording bespoke narratives. The unpredictability of these scenarios makes it challenging to employ conventional auto- cinematography techniques. To address this, we propose an innovative methodology based on Generative Adversarial Networks (GAN) for auto-cinematography. Alongside this, we 3 introduce novel evaluation criteria, seeking to enhance adaptability in line with evolving technological demands. This dissertation delves deep into these challenges, offering insights by amalgamating techniques from various disciplines. 1.2 Summary of Contributions This section elucidates the methodologies we propose for advancing towards ideal automated animation production. The foundational motivations and the architecture of our framework are highlighted herein. The core contributions of our research can be dissected as follows: 1.2.1 Basic Automatic Animation Production Framework Our paramount objective was to integrate automation within the animation production workflow. To our understanding, our work stands as a pioneering attempt to forge a com- prehensive 3D animation production framework that encompasses the entire journey from script conceptualization to the culminating animation video. Through iterative discussions with industry experts, we distilled the animation produc- tion procedure into four cardinal modules: Action List Generation, Stage Performance, Auto Cinematography, and Video Generation. Each of these modules is tasked with specific duties, utilizing disparate AI technologies for automation. Given the current technological constraints, there’s a necessity for manual intervention by animators to refine module outputs. Nevertheless, we are optimistic that as associated tech- nologies evolve, our framework could seamlessly generate animations straight from scripts. 4 1.2.2 RealPRNet: A Real-time Phoneme Recognized Network for “Believable” Speech Animation The domain of real-time speech animation, particularly driven by singular modalities like audio, remains a challenging frontier. The success of such endeavors largely hinges on robust real-time phoneme recognition. In light of this, we introduce a cutting-edge neural network scheme—RealPRNet—for real-time phoneme recognition and subsequently implement a real-time audio-driven 3D speech animation production system. Comprehensive evaluations reveal that RealPRNet surpasses contemporary algorithms in phoneme recognition accuracy. Empirical analyses denote that, in comparison to leading algorithms [1, 2], RealPRNet achieves a noteworthy 20% PER enhancement and a 4% uptick in animation quality based on subject testing. 1.2.3 Enhancing Auto Cinematography with Fidelity To streamline the process and mitigate the intricacies of animation production, we introduce Text2Animation (T2A). This framework is a derivative of our foundational animation pro- duction framework, designed to transform input scripts into animations while concurrently establishing optimal virtual camera placements. A novel evaluation metric, termed ”fidelity distortion,” is introduced to ascertain the consistency between auto cinematography-produced videos and their source content. This metric offers a unique perspective on gauging the quality of the resultant animation. A computational model, engineered to evaluate fidelity distortion efficiently and expediently, has been established through a thorough analysis of contemporary video comprehension techniques. Our pioneering text-to-animation framework, T2A, bridges the gap between 5 script and visual content, reporting a reduction in 3D animation production time by a staggering 74%. Empirical evidence suggests that T2A can curtail the manual input in animation produc- tion by approximately 74%, with the novel optimization framework enhancing the perceptual video quality by up to 35%. 1.2.4 Enabling Auto Cinematography with Reinforcement Learning Building upon existing auto cinematography algorithms [3–5], we have refined our T2A framework [4] to allow directors to capture their preferred camera configurations. These configurations subsequently serve as training data for reinforcement learning models aimed at understanding lens language paradigms. Our framework’s capability to seamlessly gather data for auto cinematography during an- imation production renders reinforcement learning integration feasible. To this end, we pro- pose RT2A (Reinforcement learning-based Text to Animation), a holistic framework channel- ing reinforcement learning towards auto cinematography. In RT2A, every directorial decision concerning camera configurations is documented, laying the foundation for subsequent agent training iterations. A meticulously crafted reward function guides the algorithm towards the optimal policy, simulating the human director’s decision-making approach in camera selections for specific scenes. Preliminary outcomes underscore RT2A’s prowess in emulating directorial lens language patterns. Bench-marked against reference algorithms, RT2A boasts an uptick of 50% in camera placement approval rate and an impressive 80% surge in mirroring the tempo of 6 camera transitions. 1.2.5 Automated Adaptive Cinematography For User Interaction in Open World Addressing the intricacies of open-world character interactions demands a specialized ap- proach. We thus introduce AACOGAN, an innovative auto cinematography framework hinged on Generative Adversarial Networks (GANs). AACOGAN is adept at orchestrat- ing camera maneuvers in harmony with the spontaneous actions of characters in expansive virtual terrains. Customized quality assessment metrics, specifically tailored for automated cinematography, have been formulated to rigorously evaluate generated camera trajectories. To ensure congruence between character actions and camera movements, especially within scenes populated by multiple characters, AACOGAN employs a distinctive input feature coupled with a generator architecture. This strategic design choice ensures the meticulous alignment of camera trajectories with the gamut of character interactions, exhibiting pro- nounced efficacy in multi-character scenarios. Empirical analyses lend credence to AACOGAN’s capability in amplifying auto cine- matography proficiency within open-world scenarios. Notable improvements observed in- clude a 73% enhancement in the correlation between user activities and camera paths and a 32.9% elevation in the rendition quality of scenes demanding multi-focal attention. 1.3 Dissertation Organization The rest of this dissertation is structured as follows. In Chapter 2, we introduce the animation production framework we have developed and used throughout the entire dissertation. All 7 of our efforts are intended to enhance the capabilities of this framework and achieve a truly automatic animation production without human involvement. Following that, in Chapter 3, we present our RealPRNet for the virtual character’s lip-sync animation creation. Next, in Chapter 4, we present a novel auto-cinematography approach that includes the fidelity model in the optimization process. In Chapter 5, we present a reinforcement-learning-based auto cinematography approach that can learn the lens language from the human director and apply it automatically to the new animation production. Furthermore, in chapter 6, we present an auto-cinematography base on Generative Adversarial Networks. This innovative automated photography technique, when deployed in unscripted open-world settings, can synthesize camera movement trajectories in real time, taking into account factors such as emotion, aesthetics, and character actions to meet specific requirements. Finally, in Chapter 7, we conclude this dissertation and present our ongoing and future work. 8 Chapter 2 Animation Production Framework for Automation 2.1 Introduction During the initial phase, one of the most critical steps in developing an automated animation framework is to determine the number of modules within the framework and the function- ality associated with each module. Because each module may involve several different steps in the traditional production process and demands various technologies to support it, the development and research of each module require collaboration with specialists in various fields. On the other hand, because of the limitations of the current technologies, the computa- tional analysis results are not guaranteed to be completely accurate. For example, the Action List Generation results which are analyzed by the Natural Language Processing (NLP) tech- nology, especially the location information of the characters in the scene, still require manual revision to correct the mistakes. Therefore, confirming that the output of each module meets the criteria of the expert is an essential step in the framework development process. In or- der to facilitate understanding, comparison, and quick identification of problems, we have designed the output of each module as a variant of the original input script content. The 9 user can directly compare the results generated by each module to the original script, and unsatisfactory parts or errors in the results can be modified conveniently. In this chapter, we introduce a basic automatic animation production framework and the functions of each module. Awareness of each module in this framework would help the reader to better understand our motivation and the importance of our subsequent efforts. In Section 2.2, we will introduce the workflow of the entire animation production framework, the functionality of each basic module, and demonstrate the output of each module. In Section 2.3, a conclusion is drawn to summarize the work done in this part. Figure 2.1: Animation Script Example 2.2 Framework In this section, we illustrate the proposed automatic animation production framework in detail which includes the flow of using the four primary modules and the related technologies 10 involved. 2.2.1 Action List Generation The Action List Generation module is designed to analyze the original input script, and then generate the corresponding action list and related scene information through the related Natural Language Processing (NLP) technologies. Figure 2.2: Analyzing the script and generating the corresponding action list by NLP tech- nologies Using the script illustrated in Figure. 2.1 as a case study, it becomes apparent that the script adheres to a unique format. As delineated in Figure. 2.2, the Action List Genera- tion module initiates its procedure by analyzing the Scene Heading to discern the settings in which the narrative unfolds. It then references our database to select the most fitting scene resources. Subsequent to this, the module collates all script-embedded actions in a chronological fashion, thereby constructing a comprehensive action list. Each entry, or ac- 11 tion object, within this list, is mandated to encompass key attributes: action type, initiation and termination timestamps, duration, initiating agent, receiving entity, and the spatial co- ordinates of characters at both the commencement and conclusion of the action. Notably, the act of speaking warrants special attention due to its real-time generation requirements, as opposed to pre-stored database entries. The operational intricacies of Action List Generation extend to the pairing of audio resources and the subsequent generation of lip-sync animation files. In instances where requisite audio files are absent from the database, users are prompted to contribute by manually dubbing the action. A thorough exposition of the methodology employed for generating lip-sync animations will be articulated in Chapter 3. Upon the action list’s completion, users are afforded the opportunity to scrutinize its integrity, amending any inaccuracies or inconsistencies based on their interpretative understanding of the script. 2.2.2 Stage Performance We have developed the Stage Performance environment (in Figure. 2.3) by using the game engine, Unity []. Figure 2.3: Our Stage Performance virtual environment developed by Unity software As shown in Figure. 2.4, Stage Performance automatically matches the scene model, 12 character model, action model, and other resources, then generates all actions according to the input action list. Each character that appears in the story has a separate action track, as different characters may perform various actions in the same period. Each block in the action track represents an action object and contains all its relevant properties, which can be adjusted at the user’s discretion. In this software, the user can observe the entire performance generated according to the input action list from any viewpoint in the virtual scene and decide whether further modifications are needed. All data corresponding to these action performances of the virtual characters in the 3D scene in the Stage Performance environment are collected and used as input for the Auto Cinematography. Figure 2.4: The corresponding action blocks and audio blocks are automatically generated in the appropriate tracks according to the action list 2.2.3 Auto Cinematography The function of the Auto Cinematography is to calculate the optimal lens language usage for the entire performance in the 3D scene based on the performance data and relevant 13 environmental information obtained by the Stage Performance. This module can significantly reduce the user’s knowledge requirement of cinematography and time consumption compared to the traditional workflow. Thus, we have concentrated our efforts on improving the capabilities of this module. In Chapter 4 and Chapter 5, We have described the two different approaches that we proposed in detail, a rule-based dynamic optimization approach and the other based on reinforcement learning which learning the lens language usage from the human directors. Figure 2.5: The corresponding camera blocks audio blocks are automatically generated in the auto camera track according to the result from Auto Cinematography As shown in Figure. 2.5, after the automatic cameras have completed the calculation, the related results are displayed in the interface of our Stage Performance software. If the user is not satisfied with the result, the user can select any camera block in the camera track to adjust its corresponding parameters. 2.2.4 Video Generation After the user confirms the output of Auto Cinematography, the corresponding video file can be generated directly by the Video Generation. This module corresponds to the editing work in the traditional animation production workflow. The movie edit is a complex and tedious endeavor that requires sufficient relevant knowledge. The automated implementation of this module will be part of our future work 14 but is outside the scope of this dissertation. Thus, this module does not edit the results in our current production process but directly outputs video according to the original camera track shown in Stage Performance software. 2.3 Conclusion In this chapter, we have introduced the fundamental component of the automatic animation production framework that we have proposed and developed, which will be used as the foun- dation for our subsequent works to make the animation production process more convenient, more accurate, and more efficient. 15 Chapter 3 RealPRNet: A Real-time Phoneme Recognized Network for “Believable” Speech Animation 3.1 Introduction In the Action List Generation, when a dialogue scene occurs in the input script, the virtual character will have a corresponding particular animation for its mouth movement. These mouth movements cannot be pre-generated in the Stage Performance action library. Thus, Action List Generation needs to generate correlative mouth animations for the virtual char- acter’s dialogues in real time during the production process. Human beings are very sensitive to any facial artifacts, or uncoordinated or unsynchronized performance of virtual characters, which makes facial animation, in particular speech animation production, very challenging since animation simultaneously involves voice and mouth movement. Realistic speech animation [1,6] has been always very compelling since it can provide the most immersive experiences with high-fidelity human-like virtual characters. However, the high cost involved in the production process and the substantial data requirements, including audio and video, are likely to create privacy issues. 16 For applications that accept lower realism effects but require good privacy preservation, such as avatar-based online conferences, anonymous alerts, and customer service, audio data could become the only media available during the process. In such scenarios, the virtual characters are required to mimic mouth movements matching the voice input seamlessly, and this is what is called believable speech animation. The “believable” speech animation requires that the algorithm works, under practical resource constraints, for all possible virtual faces with various 3D models and produces synthesized video with sufficient realism for users on the remote end to feel comfortable. The widespread use of NLP technologies, such as speech recognition [7, 8] and speaker identification [9], demonstrates that such a “believable” speech animation is achievable by using only audio input [10–12]. In general, the audio input is fragmented into small pieces called frames from where features are extracted. Phoneme, which is the distinct unit of sound, is predicted by using the extracted features and then mapped into the corresponding static mouth shape called viseme (its counterpart in the visual domain [13]). In such an audio-driven speech animation framework, the accuracy of the phoneme recognition can directly affect the quality of the speech animation. With the latest advances in deep learning, the phoneme recognition topic has been re- visited using completely new methodologies [14]. The advances in phoneme recognition accuracy can lead to better speech animation. Compared with the traditional models, such as Hidden Markov models (HMM), deep-learning neural network (DNN) based approaches generally decrease the phoneme recognition error rate by 10%. For real-time applications, finding the balance between latency and accuracy is critical, as indicated in the design philosophy of RNN and LSTM. The temporal correlation between neighbor audio frames can play a very significant role in recognition accuracy improvement. 17 However, the phoneme recognition accuracy improvement achieved by adding more neighbor frames in the sliding window is at the cost of latency. Therefore, a crucial problem that needs to be addressed to improve the real-time speech animation quality is to find a phoneme recognition solution that can achieve the best accuracy with reasonably low latency for real-time animation. In this work, we propose a novel deep neural network scheme, called RealPRNet. With a carefully designed network architecture that considers both temporal and spatial correlations, RealPRNet predicts phoneme stream for a sliding window of audio frame input. A novel concept called LSTM Stack Block (LSB) is introduced to maximize the learning efficiency of the temporal-spatial patterns(more details are covered in the network design section). To build an end-to-end audio-driven real-time believable speech animation system, the RealPRNet is inserted into the typical speech animation process that converts the audio input into a recognized phoneme sequence and drives a facial animation module. Inspired by the JALI model [12] and properties of the blend shape facial model, the animation is achieved by mapping the recognized phoneme label to a set of parameters to control four basic blend shapes that have hidden physical correlation. The major contributions of this work can be summarized as follows: 1. We propose a novel neural network-based real-time phoneme recognition scheme. 2. We develop a real-time audio-driven 3D speech animation production system. 3. We conduct a comprehensive evaluation to show that the proposed RealPRNet scheme can achieve great improvement over the state-of-the-art algorithms. The remaining part of the chapter is organized as follows: In Section II, we describe the details of our animation production system components and introduce the deep-learning- based Realtime Phoneme Recognition Network (RealPRNet) with our insight in Section III. 18 We present an evaluation and experimental results in Section IV. Finally, in Section V, a conclusion is drawn to summarize the work done in this research. Figure 3.1: System overview of the proposed real-time audio-only speech animation system corresponds to the phoneme stream {S, OW, IY} of vocabulary /slowly/ 3.2 Related Work A typical audio-driven speech animation uses viseme as an intermediary. It can be seen as a combination of the phoneme recognition and the phoneme-driven speech animation. 3.2.1 Phoneme Recognition Phoneme recognition using audio’s fundamental frequency [10] and HMM [15–19] has been an active research topic for decades. However, the accuracy of these traditional schemes is not sufficient for speech recognition and speech animation production. The advances in deep learning and neural networks have greatly enhanced the accuracy of phoneme recognition [14, 20–22]. In [20], a feed-forward deep neural network model was proposed and achieved an error rate of 23.71%. In [14], it was reported that a feed-forward deep neural network architecture lowered the error rate to 16.49%. In [22], a network architecture called CLDNN that combines Convolutional Neural Network (CNN) LSTM and fully connected DNN was 19 proposed. It can further improve the performance for 4-6% compared to the LSTM. In particular, around 39 different phonemes can be recognized compared to the fundamental schemes [10], which can recognize less than 10 different phonemes. However, these methods are all designed for offline situations without latency constraints, which enables any feature extraction schemes to be used. In this work, we develop the RealPRNet to achieve accurate real-time phoneme recognition. 3.2.2 Speech Animation In [10], the audio input is fragmented into small pieces called frames, where fundamental frequency features are extracted from. Phonemes are predicted by recognizing the vowels and the basic fricative consonants from the features and then mapped into the corresponding static animation called viseme (its counterpart in the visual domain [13]). In this frame-based processing mechanism, the latency is negligible as it almost equals the processing time of a single frame. However, a lack of consideration on neighborhood context information during the process may significantly limit the recognition accuracy and quality of the generated animation as the system can only recognize basic phonemes. In [11] and [12], a word-based processing mechanism is adopted to achieve much higher animation quality. By utilizing force alignment [23], phoneme transcription can be extracted from an audio chunk that contains multiple words with reasonably high accuracy. In [24], the neighboring phoneme pronunciation transition, named co-articulation, was considered. However, the latency of word-level duration becomes unacceptable for real-time applications. In this work, we use a sliding window of frames to obtain the corresponding phoneme at each timestamp. Our designed facial model addresses the co-articulation problem by considering the duration of the current phoneme’s articulation and the corresponding associated phonemes before and 20 after it. 3.3 System Components Design In a typical real-time audio-driven speech animation system, the phoneme stream first is recognized from the audio input and then mapped to the corresponding parameter stream to derive the 3D facial models. Viseme is the visual mouth shape representation that has been widely used in speech recognition and animation, as it can be mapped directly from the phoneme. However, vice versa is not true since multiple phonemes may be mapped to the same viseme if they have similar mouth shapes during the pronunciation, such as /b/ and /p/). It is important to realize that so far there is no common standard to regulate the viseme classes [25], for example, [2] used 20 different visemes, [26] used 26 different visemes, and [27] used 16 visemes, etc. Figure. 3.1 gives a system overview of the proposed real-time audio speech animation systems corresponding to the phoneme stream {S, OW, LY} of vocabulary /slowly/. When the system receives an audio signal, it is transformed into the corresponding MFCC features, which is the input of the RealPRNet. The RealPRNet predicts the phoneme stream and then maps it into the corresponding points (or blocks) in the 2D viseme field. The anima- tion curve shown in Figure. 3.1 connects the points on the 2D viseme field and generates a parameter stream that can drive the 3D facial model smoothly and seamlessly. To support real-time interactivities, the latency between receiving audio input and outputting anima- tion is required to be controlled below certain thresholds(e.g.,200ms [28, 29]) to ensure a believable and audience-comfortable result. A buffer mechanism is adopted in the system to 21 dynamically determine the size of the sliding window and ensure that the latency is within the required threshold. The four buffers are input feature buffer B1, HMM tied-state output buffer B2, output phoneme buffer B3, and phoneme selected buffer B4. 3.3.1 Input Features Extraction Component We employ the general audio process pipeline to extract the input audio features. When the raw audio signals are received in the system, they are transformed into frequency-domain signals called MFCC, which is a data format that has been widely applied in automatic speech recognition (ASR). It is observed that the human voice is a combination of sound waves at different frequencies. The MFCCs can balance the variation of the sound change at different frequency levels. Generally, there are 100 frames per second in audio, and each audio frame is 25ms with a 15ms overlap. For each audio frame, the first and the second derivative components of the MFCCs are aggregated to a vector that represents the single audio frame, f . The input feature xt at time t is transformed to the network feature ft at time t surrounded by its forward and backward contextual vectors, denoted as [ft−n, . . . , ft, . . . , ft+m]. The value of n represents the number of audio frames before time t, and the value of m represents the number of future audio frames. The integrated xt is used to predict the phoneme label at time t. Thus, the value selection of m directly impacts the latency, which is 10m ms. The larger the value of m, the better potential recognition accuracy but longer latency. When m = 0, no future frames are buffered to introduce additional latency. However, the potential advantages of context information have been taken into consideration to improve phoneme recognition performance. 22 3.3.2 Real-Time Phoneme Recognition Component Design Real-time application systems need to be able to extract the audio features with high accu- racy and low latency. Our proposed phoneme recognition scheme RealPRNet (described in Section III) can ensure a high-accuracy real-time recognition. The output of the RealPRNet is used as input to the HMM tied-state decoder H to calculate the output phoneme P . For the system to produce a smooth animation, the phoneme recognition system needs to be able to output the predicted phoneme with every A ms. Thus, a buffer mechanism is introduced in the system. An overview of the phoneme recognition system with buffers is shown in Figure. 3.2. All buffer in the figure is fixed size first in first out (FIFO). Figure 3.2: Phoneme recognition System with Buffers In the first step, the input raw audio signal is transformed into the network input feature ft every A ms time interval (equal to the sampling interval), and the ft is stored in the input feature buffer B1. The transformation of the raw audio signal to audio features causes the first delay d1 + e1, where d1 is the median calculation time and e1 is the corresponding fluctuation time. In our experiment, time consumption by this process is very stable and 100% less than A ms (10ms in our experiment) as shown in Figure. 3.3 (blue). All the audio frame features in B1 are then used to construct xt. The size of B1 depends on two things: the selected values m and n, and the RealPRNet time consumption d2 + e2 for each prediction. To guarantee a smooth output, the following inequality needs to be 23 satisfied: d2 + e2 ≤ br · A, br = 1, 2, 3, · · · (3.1) br is the batch size used in the prediction progress. The neural network can parallelly predict br outputs in one run with a minimal increase in the computational overhead if br is a small value (i.e. br = 10). This is because when br is small, the input features data sizes are relatively small compared with the RealPRNet’s parameters. There is no sensible difference to today’s computational power (e.g. in our experiment, one RTX 2080 Ti graphics card has been used) when processing single or br input features. The major time consumption is in data parsing and transmission. The RealPRNet takes the input features from B1 every br · A ms and predicts br outputs [ht, ht−1, · · · ]. These predicted outputs are stored in HMM tied-state buffer B2. There are br sub-buffers in B1 with size m + n + 1 each and contents ft−n−i, · · · , ft−i, · · · , ft+m−i. Because m forward audio frames are used to construct the x, the system latency is increased to m · A + br · A ms. In our experiment, the value of br is 4 which can ensure that equation (3.1) is satisfied in 99% of the cases. The network time consumption distribution is shown in Figure. 3.3 (orange). The neural network in our phoneme recognition system does not directly predict phoneme labels but predicts HMM tied-states instead which is because the combination of neural network and HMM decoder can further improve the system performance. Such kind of recognition system structure can be found in many related works [30, 31]. The predicted HMM tied-states in B2 are used as input of the HMM tri-phone decoder. For each time interval br · A, the decoder takes all the values in B2, calculates the corresponding phonemes [Pt, · · · , Pt−br+1] and stores it in B3. The calculation time is d3 + e3 and the size of B3 depends on the number of previous states used to calculate the [Pt, · · · , Pt−br+1]. The HMM 24 decoder improves the results by using the previously predicted tied-state together with the ht to calculate the corresponding phoneme Pt. However, the decoder should only use the tied-state as a reference rather than relying on it as in the speech recognition system because the phoneme-to-phoneme does not have a strong inner logic relation as word-to-word. Thus the decoder is constructed with a bigram phoneme model which focuses on the acoustic part (signal to phoneme) of the system rather than the language part (phoneme-to-phoneme) [14]. In our experiment, the calculation time consumed in decoding is stable and 100% less than br · A. Thus, the overall latency from the raw input audio signal to the corresponding output phoneme is (m + br) · A + D + et, where D = d1 + d2 + d3 and et = e1 + e2 + e3. Figure 3.3: Audio feature transformation time consumption distribution (blue), Network Prediction time consumption distribution (orange) B3 is the buffer that is used to control the final output of P of the phoneme recognition system. This output phoneme with a timestamp is stored into B3. The system continuously takes the first phoneme from B3 and uses it as the input to the animation system every A 25 ms time interval. If a predicted phoneme with a timestamp has negative et, the system waits for A ms before the de-queued output from B3. If a predicted phoneme Pet with a time stamp has positive et and the buffer contains no more phoneme, the system outputs the last phoneme and stores Pet as the last phoneme in the buffer. Pet can be used as an output if the next predicted phoneme also has a positive et. If not, the system drops the Pet and then outputs the next predicted phoneme after a time interval of A ms. With this mechanism, the phoneme recognition system can have a stable output stream with a time interval A ms, and the overall latency from raw input audio signal to the corresponding output phoneme stream is (m + br) · A + D, where D = d1 + d2 + d3. This stable output phoneme stream is stored in B4. The following pseudo-code represents this process. The buffer is B3, and output dP is the output phoneme which is going to be stored in B4. while true : RealPRNet_Thread : if cur_time == pre_time1 + a_t : P = RealPRNet ( x_t ) p_last = P buffer . enqueue ( P ) Output_Thread : p_last = None if cur_time == pre_time2 + d_t : if buffer is empty : pre_time2 = cur_time return p_last else : pre_time2 = cur_time d_P = buffer . dequeue () p_last = d_p return d_p Figure 3.4: Python code for RealPRNet processing In this pseudo-code, the buffer is a queue structure, the cur time is the worldwide system 26 time and the d P is the beginning element in the buffer. The animation subsystem takes data in B4 and uses it to produce the corresponding speech animation. For the final speech animation generation, the output phoneme sequence of each time interval A ms from the previous component needs to be further processed. B4 is used to select the appropriate next phoneme frame for the animation curve generation. As shown in Figure. 3.6, the same phoneme can occur in different frames. The size of B4 is the average single phoneme pronunciation time in the audio frame. The output phoneme of the recognition system is stored in this buffer first. The phoneme pronunciation is dynamic progress which means that the viseme phoneme at the beginning of the pronunciation is not the same as the corresponding phoneme viseme as shown in Figure. 3.5. Figure 3.5: A phoneme corresponding viseme at different frames during the pronunciation of phoneme /o/ Figure 3.6: Phoneme selection for 2D viseme field animation curve generation Thus, an appropriate phoneme frame (for example, the phoneme frame can represent the rightmost one in Figure. 3.5 in a sequence of phoneme /o/ frames) should be selected from 27 certain phoneme pronunciation frames to calculate the complete viseme’s transformation time using the animation curve generation. If the minimum recognizable phoneme pronunci- ation time in the data set is prmin audio frames and the length of the continuously predicted phoneme in the buffer is less than prmin, then the corresponding phoneme will be replaced by the previous phoneme. The upcoming phoneme section rules can be described as follows: 1. The same phonemes are continuously appended to the buffer for at least prmin units. 2. If the number of continuous phonemes is more than prmin units and less than the size of B4. The appropriate frame that represents the phoneme will be selected from that part of the buffer. 3. If the number of the continuous phonemes is more than the size of B4, all phonemes in the buffer will be used to select the appropriate frame, and no new frame will be selected until the next different phoneme is appended to the buffer. Figure 3.7: 2D viseme field with block index (left). The combination of the jaw and lip parameters in the extreme (A, B, C, D) and regular cases (R) is visualized by the 3D model viseme (right) 28 3.3.3 Speech Animation Component Design The speech animation component has been designed following the JALI viseme field and the state-of-the-art procedural lip-synchronization system [12]. The implementation in our system is slightly different from the original JALI viseme field. Most of the procedural systems, such as [12, 32, 33], that use the keyframe viseme to produce the final animation have a similar problem, that is the phoneme is mapped to one fixed static viseme without any variation. Through our observation of human speech behaviors, visemes corresponding to a phoneme may be slightly different in different situations. This is true even when the current phoneme has a sufficiently long pronunciation time to erase the co-articulation of the sound caused by the previous phoneme pronunciation. For example, the visemes that represent the phoneme /s/ are pronounced differently in the ′things′ and ′f alse′ under the same speaking style. Thus, the 2D viseme field has been divided into different blocks (20 blocks in our implementation), as shown in Figure. 3.7. Each phoneme corresponds to a block region rather than a static point in the 2D field. 3.4 Neural Network Design for Real-Time Phoneme Recognition In recent years, neural network methods [34, 35] have dominated the phoneme recognition field and have dramatically improved recognition performance. We design a novel neural net- work architecture for our proposed real-time phoneme recognition task. The overall network architecture is shown in Figure. 3.8. The CNN layer takes xt as input and applies frequency modeling on the input features. The n-dimensional vector output of CNN is passed into 29 an n-layer stack of LSTMs for parallel processing and temporal modeling. The output is combined with the ft to form an input to additional LSTM layers for temporal modeling and then goes through a fully connected layer. In the end, the HMM tri-phone decoder is adopted to predict the phoneme label. Figure 3.8: Network Architecture Overview 3.4.1 CNN Layer CNN has demonstrated outstanding performance in audio-related fields [35, 36]. The CNN layer is mainly used as a frequency modeling layer. An important contribution of the CNN layers is to reduce frequency variation. Based on the observation that voice of different people contains different ranges of frequencies even when they are speaking the same utterance, the traditional GMM/HMM-based speech recognition systems use techniques to reduce the frequency variation in the input feature, such as vocal tract length normalization (VTLN) [37] and feature space maximum likelihood linear regression (fMLLR) [38]. It has been reported recently [22] that CNN can provide the same feature improvement to the audio feature. CNN layers also play an important role in the temporal-spatial domain [22], here spatial refers to frequency-domain audio feature pattern detection and learning. The input feature 30 contains frequency information since each coefficient in Mel-frequency cepstral is generated by passing through different frequency filter banks. Each CNN filter can learn the different frequency patterns from the input features during the training. Our network architecture emphasizes these frequency patterns in the input features by separately connecting the CNN output features (CNNfout) of different CNN filters to different LSTM in the LSTM stack Block module. We use a 9 × 9 frequency (spatial)-temporal filter in the first CNN and a 3 × 3 filter in the second layer. The first layer has 256 output channels and the second layer has 16 output channels. We set no pooling in both CNN layers after observing that neither max nor average pooling helps to improve the result. 3.4.2 LSTM Stack Block After frequency modeling is applied to CNN layers and the CNN filters have learned the acoustic features from the input feature, each CNN output channel produces the intermediate features as shown in Figure. 3.8. These features are applied to a parallel process of the temporal modeling in LSTM stack block (LSB). The CNNfout’s from different CNN channels pass (Equation (3.2)) to different LSTM modules, called LSTM Tube (LT) which is inside the LSB as shown in Figure. 3.8. The number of LT inside the LSB depends on the number of the last CNN layer output channel, one LT for each CNN output channel. An LT is a relatively small and independent LSTM network. The LSTM has been proven to be advantageous in temporal modeling because it has memory cells to remember the information on previous features [21, 39]. Each output from different CNN channels can be viewed as a derivative feature of the original audio feature with the context within a sliding window. Thus, separate temporal modeling is applied to each different CNNfout. The CNNfout is passed to LT0, where the “0” represents the CNN 31 filter with index 0, inside the LSB for independent temporal modeling (Equation (3.3)) and the output feature is Lt0. These separate output features are aggregated together as the input feature for the next LSTM layer (Equation (3.4)). CNNoutputs = (cid:8)Cfout,1, Cfout,2, . . . , Cfout,n (cid:9) Ltn = FCL(LSTM(CNNfout,n)) LSB(CNNoutputs) = {Lt1, Lt2, · · · , Ltn} (3.2) (3.3) (3.4) In our network architecture, the LSB contains 16 LTs inside and each LT includes 2 LSTM layers and 1 fully connected layer. Each LSTM layer has 512 hidden units and a 0.3 dropout rate, and each fully connected layer has 128 output units. 3.4.3 LSTM Layer In [14], it is demonstrated that the LSTM layer has its advantage in extracting temporal patterns in input feature space as we mentioned in the LSB. The gate units in LSTM are used to control the information flows inside the LSTM. The input gate decides what information can be added to the cell state. The forget gate decides what information needs to be removed from the cell state, and the output gate decides what information can be used in the output. Thus the LSTM can selectively “remember” the temporal features. After the LSB performs the separate temporal modeling, we pass its output to the LSTM layers for the unified temporal modeling with the same logic. There are 4 LSTM layers, and each has 1024 hidden units, 512 output units, and a 0.2 dropout rate. 32 3.4.4 Fully Connected Layer Following the state-of-the-art design of a typical deep learning network, we use the fully connected layer with softmax activation as the last layer so that a fully connected layer can provide the model with the ability to mix output signals from all neurons in the previous layer and the softmax can shape the output probabilities of each class so that the target class has a higher probability. The fully connected layer has 1024 hidden units. The output is an 1896-dimension vector which represents the possibility of 1896 HMM-GMM tied-states. The HMM tri-phone decoder takes this as an input to predict the final phoneme label. 3.4.5 Multi-Scale Features Addition The idea of the multi-scale feature addition was originally explored in computer vision and also used in ASR-related problems [36]. In a neural network, each layer focuses on different input features and varies from general to specific concepts. In ASR tasks, the lower layers (e.g., CNN layers in RealPRNet) focus more on speaker adaptation, and higher layers (e.g., LSTM layers in RealPRNet) focus more on discrimination [40]. Thus, the input features of the different layers are complementarity, and the network’s performance improvement has been observed by using these techniques [22]. In our implementation, we have explored two feature addition strategies, which are illustrated in Figure. 3.8 through line (1) and line (2), where line (1) represents the original frame feature, ft, in xt combines with the LSB output and line (2) represents the LSB output combines with the LSTM output. The first feature addition explores the complementary information in short-term feature ft and long-term feature output feature from LSB (high order representation of xt). xt is aggregated by using ft and it’s context features [ft−n, · · · , ft−1] and [ft+1, · · · , fm]. However 33 the original LSTM does not consider their different values in prediction but takes all f ’s in xt as consecutive features [41] and equally considers all the f ’s in xt. This feature addition emphasizes the importance of ft in xt when predicting pt. The second feature addition checks the LSB and the LSTM outputs, the separated and unified complementary temporal feature information. These features are the high-order feature representations of xt with different pattern complementarity since network layers focus on different information in xt with these patterns. In the evaluation part, the performance improvement by multi-scale feature addition has been explored, and the results show a positive effect on the network performance. Table 3.1: Network Parameters hidden units output units drop out rate ksize conv0 N/A conv1 N/A 1024 lsb0 1024 lstm0 1024 lstm1 1024 lstm2 1024 lstm3 1024 fcl0 256 16 128 512 512 512 512 1896 N/A N/A 0.3 0.2 0.2 0.2 0.2 N/A 9 3 N/A N/A N/A N/A N/A N/A 3.5 Experimental Results In this section, we evaluate the performance of the system and show that our proposed system can generate competitive speech animation results in real-time using only audio input, compared with a speech animation system using multimedia (video and audio) or offline methods. The performance of the proposed system is evaluated in three areas: (1) the RealPRNet phoneme recognition accuracy, (2) the buffer occupancy dynamics to enable 34 the real-time application, and (3) the subjective and objective speech animation quality. 3.5.1 Experiment Setup Our experimental results are conducted using the TIMIT data set, which is widely used for phoneme recognition evaluation. It contains 6300 sentences, consisting of 10 sentences spoken by 630 speakers each from 8 major dialect regions of the United States. By following the standard TIMIT setup, we use the standard TIMIT training set, 3696 utterances from 462 speakers, to train our network and evaluate it on the TIMIT core test set, which consists of 192 utterances. We use the TIMIT s5 recipe in Kaldi [42] to calculate phoneme duration in each utter- ance through force alignment technique and generate an HMM tied-state tri-phone model, the corresponding 1896 tied-states and their properties (i.e., state probability, transfer prob- ability, corresponding phoneme, etc.). We then use a tri-phone decoder with a bi-gram phone model in our system at the end of the neural network architecture, which takes the tied- states stream as input and outputs the predicted phoneme stream. The output ground truth y for the corresponding input feature x in the training data set is the index of the HMM-tied states. Kaldi enabled audio to HMM tied-states force alignment. For the network training, 10 epochs have been set as the minimum training epoch and to enable early stop (if the validated loss change in epochs is less than 0.001) during the training. The Adam optimizer was used in the first epoch and the momentum stochastic gradient descent (MSGD) opti- mizer for the rest epochs. The batch size is 256 for the first epoch and 128 for the rest. The learning rates are 0.01, 0.001, 0.0005, 0.0001 for the first four epochs and 0.0001 for the rest of them. The weight initialization has been used for the CNN layer’s parameters, and the 0.3 dropout rate has been used for all the LSTM layers. The ReLU activation was applied 35 to most of the layers except the final fully connected layer, which used softmax. The performance of the network with a different m in the input feature was evaluated in this section. In our experiment, value n in xt has been set to m + 1. 3.5.2 Error Metrics 3.5.2.1 Phoneme error rate We first evaluate our proposed RealPRNet with a standard metric. During the evaluation, we first map the 60 phonemes to 39 and then calculate the “Levenshtein distance” between the recognized phoneme sequence and the ground truth phoneme sequence. Levenshtein distance is a method to measure the difference between the two sequences. It calculates the minimum number of single-character edits (including insertions, deletions, and substitutions) required to change one sequence into another. The number of edits required to change the recognized phoneme sequence into the ground truth phoneme sequence is first calculated as the ratio between this minimum number of edits and the whole phoneme sequence length. This ratio is also known as the phoneme error ratio (PER). Under the real-time situation, the phoneme is predicted for each audio frame. Thus, the PER is calculated based on the audio frame-level phoneme sequence in our evaluation. 3.5.2.2 Block distance error The block distance error (BDE) measures the average distance between the trajectory curve in the viseme field produced by the recognized phoneme sequence and the curve produced by the ground truth phoneme sequence. The basic unit used here is the short edge of the 2D viseme field block. For each audio frame, the corresponding points on the two curves 36 and the absolute distance between these two points are calculated, also known as Euclidean distance. Then the average distance between these two curves at each time t was calculated. BDE = 1 T T (cid:88) 0 Ppredict − Pgroundtruth, (3.5) where Ppredict is the predict phoneme position and Pgroundtruth is the ground truth position in viseme filed. T is the total number of time intervals. 3.5.3 Phoneme Recognition Performance In the experiment, RealPRNet was compared with two other phoneme recognition systems: the 4 layers LSTM architecture presented in [14], and the CLDNN [22]. Since most of the existing related work uses offline recognition schemes, we first evaluate the offline phoneme recognition capability of the baseline models. The best performance of the offline phoneme recognition of the above models in the evaluation using the TIMIT data set are 4 layer LSTM 18.63%, CLDNN 18.30%, and RealPRNet 17.20%. The RealPRNet outperformed other methods by 7.7% PER in the best case. In Figure. 3.9, the real-time performance of these networks was compared. The x-axis in the figure represents the number of frames used in a single xt, and the y-axis represents the phoneme error rate. It is interesting to observe that LSTM has low performance when the temporal context information is insufficient (i.e., when xt is aggregated by using less than 10 f s), but its performance improvement continuously with the increasing of the xt value until a sweat spot (20 audio frame features in xt) is achieved. In the figure, RealPRNet outperformed LSTM for a minimum of 20% and 10%, respectively. The combination of frequency and temporal modeling enabled the RealPRNet a smooth performance across 37 various selections of the number of frames for xt. To understand the advantage of RealPRNet further, we considered two additional vari- ations: RealPRNet without the LSTM layer, and CLDNN and RealPRNet without long short-term feature addition. Figure 3.9: Phoneme Error Rate per Frames The RealPRNet applies two different types of temporal modeling to xt, the separate temporal modeling performed by LSB and unified temporal modeling performed by LSTM. Firstly, we explore the difference between these two temporal modelings that may affect the performance of the neural networks. The LSTM has been removed from the RealPRNet, and the new network is named ‘RealPRNet-separate’. By comparing RealPRNet-separately and CLDNN, the result in Figure. 3.10 shows that the separate temporal modeling alone cannot outperform the unified temporal modeling. Thus, the performance of RealPRNet is benefited when the two temporal modelings are used together. The PER of RealPRNet 38 under different latency scenarios (different values m and n), shown in Figure. 3.10, illustrates that it outperforms both temporal modeling network architectures when used individually. Figure 3.10: Varies Networks Performance in Phoneme Error Rate As illustrated in the previous section, the performance of the network can be further improved by multi-scale feature addition techniques. We explore this technique with our RealPRNet and compare its performance with the strongest baseline CLDNN model with feature addition structure and also the previous networks without feature addition. As shown in Figure. 3.10, the performance of both RealPRNet and CLDNN with multi-scale feature addition has been improved. The additional short-term feature gave complementary information to the intermediate input features, which forced the networks to focus on the current frame ft. Thus, RealPRNet outperforms other networks in most of the latency scenarios. In particular, it can achieve an additional 4% relative improvement in the worst scenario in real-time PER evaluation. 39 3.5.4 The Buffer Occupancy in Real-time Application Figure 3.11: Output Phoneme Buffer (B3) Occupancy The buffer mechanism is used to stabilize system output and ensure that the delay of the output is within the tolerance range. The occupancy status of B3 of our experiment in run-time is shown in Figure. 3.11. In ideal case, [Pt, · · · , Pt−br+1] is queued to the B3 for every br × A ms, the B3 dequeues the first values for every A ms. In our experiment, the B3 occupancy status is recorded after every dequeue, and the value of br is 4 in our experiment. Thus B3 occupancy sizes should be evenly distributed in [0, 1, 2, 3]. 90% of our test cases are in this range which shows that our system can work in real-time situations. Most of the errors in B3 occur in 0 and 4 (number of occupied units) because the computer program is unable to use the exact A in each step during the run time (e.g., the A fluctuated in a range from 10.3ms to 11ms). Other cases are caused by computational fluctuations of other parts of the phoneme recognition subsystems. This buffering mechanism ensures our proposed scheme is much more efficient to the extent that 40 it can be implemented in real time. 3.5.5 Animation Quality Assessment Figure 3.12: Block distance error comparison In this section, we evaluate the quality of the output speech animation of our system. First, we compare the trajectory curve in the viseme field produced by the recognized phoneme streams predicted by various reference networks and RealPRNet with the trajectory curve produced by the ground truth model. The block short edge length is used as a distance unit to calculate the BDE between the recognized phoneme-produced curve and the ground truth curve at each time instance. The result is shown in Figure. 3.12, where the x-axis in the figure represents the number of frames used in input features, and the y-axis represents the block distance error in the basic unit. The RealPRNet also achieves the minimum error under different latency scenarios, which is 27% or less than other networks. 41 In addition to the numerical evaluation, we also organized a group consisting of 30 partic- ipants in total to watch our system-produced speech animation and ask them whether they felt the animation was believable or not (is the audio able to synchronize with the 3D model lip movement animation). 29 out of 30 participants believe the audio has synchronized with the lip movement. Figure 3.13: Speech Animation Visual Comparison with Human Performance and Current State-of-the-Art We also compare our result facial model’s viseme with the current state-of-the-art and real human performance. We take the viseme screenshots at the peak of each phoneme pronunciation in an utterance to compare the results visually. As we can see in Figure. 3.13, all results are comparable. 3.6 Conclusions In this chapter, we proposed an audio-driven believable speech animation production system with a new phoneme recognition neural network called RealPRNet. The system can produce 42 a competitive real-time speech animation with only raw audio input. The RealPRNet has performed better than the strong baseline model in phoneme recog- nition, and the speech animation system is easy to implement in most of the existing virtual avatars. The real-human-like speech animation needs a lot of effort on the pre-train model with both video and audio data, even post-edit from the artist. Our framework focuses on producing believable real-time speech animation with simple implementation, which also gives the user high freedom for online post-effect editing. Our proposed animation production framework has implemented this method to generate the viseme animation for all the dialogue in the script. 43 Chapter 4 Auto Cinematography with Fidelity 4.1 Introduction In this chapter, we demonstrate our proposed method for rule-based auto cinematography. In contrast to related works, we introduce a new optimization metric, fidelity, to determine the similarity between the visual content captured by the cameras in the animation and the respective textual content in the input script. Many animation production-related software and tools have been developed to help ani- mators produce better animations in less time and with fewer resources. With the latest de- velopments in AI technology, many jobs in the animation production process can be achieved by a computer. Ref. [3] shows that simple animation can be generated automatically by using an animation movie script with a small amount of human effort in the adjustment. Thus, or- dinary people also have the opportunity to create films on their own with the help of modern technologies by writing their stories into a specific format script [43]. The bar for anima- tion movie making has been lowered with the application of these AI advances to replace part of the human labor in the whole filmmaking process. In this process, transforming the cinematography stage into an automatic process, which is called “Auto Cinematography”, is one of the attractive areas because it can require substantial knowledge of the relevant reserves [44] and a considerable amount of time for the manual placement of these cameras 44 in the virtual 3D environment. From the perspective of video editing, using the writing process to establish the video- making process has been investigated in [45–47], which assumes that online video resources can be utilized to put together video montages that can match with the text input, and then the selected shots are assembled by optimizing cinematographic rules to generate the final video output. However, the above approaches cannot be applied directly in the filmmaking process for a screenplay as the combined online shots can only vaguely visually present the screenplay but are hard to meet the standard of an animation. Ref. [3] targeted for 2D animation, binds script-writing with performance capture and post-processing edits, to achieve greater efficiency and flexibility for the production process. However, the resultant video was simply shot with the characters placed in a given 2D plane, without sufficient optimization of what and how to shoot and select the most appropriate shots and visual content for the script. Ref. [48, 49] present the automatically generated camera sequences that follow cinematic rules or conventions that can be utilized in this problem. Ref. [50] shows a more diverse computational cinematographic approach, including not only the information in the frames captured by the cameras but also the parameters of the camera placement and angle. Furthermore, Ref. [5] adds the director hints into the auto cinematography optimization process, and Ref. [51] learns the camera usage and behaviors from the existing films. The above optimizations for camera selection are aimed at the aesthetic aspects of cinematography. However, simply considering the rules of cinematography is not sufficient to ensure that the scenes captured express the content of the script. If achievable, bridging script and scene, visual, and text brings benefits to improve the capacity of auto cinematography. Therefore, in addition to the existing aesthetic model de- scribed in the previous section, we introduce an innovative new assessment model, the fidelity 45 model, into the cinematography optimization process to determine whether the content ex- pressed by the two different media, video and script, is consistent. Apparently, the proposed auto cinematography needs to satisfy the following two conditions: the output animation (1) maintains reasonable fidelity of the script and (2) follows cinematic rules with cinemato- graphic aesthetics. We also designed and developed an animation production framework to better demonstrate our proposed optimization method. Under such a background, we propose T2A, a framework to help speed up the script-to- animation video process and minimize manual adjustments. We introduced a new dimension, fidelity, in the auto cinematography optimization process to evaluate the quality of animated videos instead of using only aesthetic models. By embedding the fidelity model generated with the latest video understanding advances [52] in T2A, our cinematography optimization algorithm determines the comprehensiveness of the output video and its fidelity to the script. T2A combines the aesthetics with the fidelity requirements into a unified computational cinematography framework and maps the original problem into an optimization problem that seeks to select the best options of camera settings to achieve the quality expectations for the user requirements. The optimization problem is carefully designed so that it can be solved by dynamic programming most efficiently computationally. While in the extraction of script information, We use the current state-of-the-art NLP technology [53], including subject, predicate, object, adjective, emotion, dialogue, etc. However, since the current NLP technology still has some limitations, We still need to manually fix the errors in the process of extracting information. The contribution of our work can be summarized in the following aspects: 1. We introduce a new evaluation metric, fidelity distortion, that allows auto cinematog- raphy to produce videos that are more consistent with the original content. This metric 46 provides a novel aspect to assess the quality of the resulting animation. 2. We propose a numerical model to calculate fidelity distortion while keeping the delay for the optimization process within reasonable limits. Such a model is developed by analyzing the behavior of state-of-the-art video understanding techniques. 3. We develop a novel text-to-animation framework, T2A. It not only builds a link between the script and video content, but also reduces the production time in 3D animation by 74%. To the best of our knowledge, T2A is the first framework that introduces the fidelity distortion factor as a new perspective in the optimization process of auto cinematography. The rest of the chapter is organized as follows: Section 2, overviews the related work to this study, such as computational cinematography, video editing, and video understanding; Section 3, demonstrates the overall framework, the problem formulation, and the dynamic programming solution; Section 4, discusses the impact of the possible error in video under- standing to the whole framework and the experimental results; Section 5 concludes the work and the last section discusses the limitation of our work and future works. 4.2 Related Work Automatic 3D animation filmmaking can be seen as the combination of two main research topics, 3D avatar animation from the script and computational cinematography in the virtual environment. From more than a decade ago, the former topic was achieved by transforming prepared stories into unique scripting languages and allowing the 3D avatar to automatically gen- erate the corresponding animated performances in the virtual environment based on these 47 scripts [54–56]. The advantage of the script language [3, 57] is that it provides accurate character action, state, position, time, and other information to generate the corresponding animations automatically. The animator can modify the detail setting easily by using such a technique. However, the prep work for this approach requires a knowledgeable animator who spends a considerable amount of time converting the original animation script to such special scripts. The T2A takes advantage of NLP [58] to generate the action list from the original animation script directly without manual labor. Since the textual information in the script does not contain all the necessary information to generate the corresponding an- imation (e.g., distance traveled by the character) and the current NLP is not perfect, the animator can manually adjust the action list after it has been generated, which only requires a few minutes to complete. The camera configuration and placement are crucial in the shooting phase of the movie production process. The views captured by the camera directly affect the quality of the production. The computational cinematography [49,51,59] integrates the automatic camera placement and video editing techniques to complete the final video. The automatic camera placement process follows roughly two steps. Firstly, using the cinematography guidelines as constraints the selection of views is found. Different cinematography rules can be applied to the film-editing under various scenarios, such as dialogue scenes [59], first-person [60], and cooking [45]. The advantage of these methods is that the generated video is more closely aligned with the aesthetic standards of the desirable cinematography guidelines. Secondly, directly learning from existing movies these camera’s behaviors [51, 61–63] are mimicked. The advantage of these methods is that the optimal camera solution can be matched to the existing successful camera settings based on the state of the characters and the scene. The problem with this approach is that similar content can be produced under totally different 48 scenes, and it cannot guarantee that obstacles in the new scenes will not block the camera path. Figure 4.1: Existing Framework for Auto Cinematography As with the issues and challenges mentioned in [64], almost all of the computational cinematography works according to our knowledge optimize the camera path from the per- spective of the cinematography guideline. These can also be viewed as a subjective per- spective because most of the great films break these rules. After all, the directors are not following them. We, therefore, propose an evaluation perspective, fidelity distortion, that can objectively assess the quality of the resulting video. The fidelity stands for the consis- tency of the video and original script content, which has been ignored in both optimizations mentioned above. It is inspired by the recent works [3, 47, 56, 59], which achieve this consis- tency by adding notes to all pre-shot video clips and comparing the script with the notes at the editing stage. Thus, we introduce the fidelity model to achieve this. The animation is generated by the action list derived from the original script, and the animation video aims to visually optimize the actions of the characters in the virtual environment. Thus, the fidelity model helps the framework maintain this consistency by using the video caption and action recognition model [65–68] as an objective viewer, evaluating the video content whether it has successfully visualized the corresponding actions in the action list. One of the popular areas of research in recent years, video caption [69–71] can describe more and more detail in the video, and the accuracy and recognition range of the action recognition [72, 73] have 49 been improved. Although none of these technologies are perfect at the moment, we believe that as they continue to improve, our proposed fidelity model will be better able to achieve its goals. 4.3 Framework Design In a typical auto-cinematography system (as shown in Figure. 4.1), the process from the input script to the output video contains the following steps, namely, action list generation, stage performance, camera optimization, and video generation. The action list generation module analyzes the original animation script to obtain the corresponding characters’ chronological action list {ai|i = 1, 2, ..., N }, where ai is the ith action object in the scene, and N is the total number of action objects. It is important to realize that multiple characters might perform simultaneously (e.g., two persons are fighting with each other, or a mom is hugging her daughter). Thus an action object may contain multiple characters in the same scene. In the stage performance step, the input {ai} is transformed into the corresponding stage performance data {pt|t = 0, 1, .., T }, where pt is the character stage performance at time t and T is the total performance time determined by the action list. Specifically, for each ai, the corresponding performance can be denoted as {ptai , ..pt+1ai , ..pt+lai }, where lai is the action duration of ai. It is important to realize that the different action objects can overlap with each other. For example, when two events occur simultaneously, both of them need to be presented to the audience. This requires that there are multiple virtual cameras available in the 3D scene that can record all the views for every character from various angles. In the camera optimization step, all the available views are considered to calculate the optimized camera {ct} for each time t. The video generation step assembles all the video 50 frames captured by camera {ct} at time t and outputs the final video. Figure 4.2: The updated camera optimization model in T2A (only reflects the camera opti- mization module shown in Figure. 1) In almost all the literature that we can find so far, only aesthetic factors are considered during the camera optimization process. Various types of aesthetic distortion are defined according to the grammar of cinematography or editing, and the purpose of optimization is to find the camera path with minimum aesthetic distortion. Apparently, this solution only meets the second condition (as mentioned in section 1) in order to convert a script to an animation, while it does not satisfy the first condition that requires the output animation to maintain reasonable fidelity of the script. In order to measure the fidelity level of a video compared to the script, a mathematical model is required to be built into the camera optimization framework. In this work, we make a bold assumption that a mathematical model can be found to approximate the fidelity relationship between a video and its associated script (i.e., ai). In other words, with any action ai and any selected camera atai , the fidelity between the action and the video generated from this camera at time t can be obtained from this approximation model. In Figure. 4.2, the camera optimization module has been updated in our proposed framework with three new modules, namely, aesthetic model, fidelity model, and optimiza- tion. The task of the aesthetic model is to provide a quality evaluation from an aesthetic 51 point of view for each admissible virtual camera at time t to the optimization engine, the task of the fidelity model is to provide fidelity evaluation for each admissible virtual camera at time t to the optimization engine, and the optimization engine considers all inputs and makes the optimal choice for the camera selection. In the following, these three modules will be demonstrated in detail. It is essential to realize that the purpose of this work is to show- case a framework that combines aesthetic and fidelity modeling in the camera optimization process, which is the first trial in the field. It is not our intention to claim that the proposed aesthetic and fidelity models are the best. Instead, we encourage the readers to develop a better aesthetic or fidelity model to make the framework perform better, as the framework is generic to adapt new models into the optimization process. 4.3.1 Aesthetic Model In this section, we discuss the aesthetic evaluation model, that is, to emphasize the distortion caused by camera planning and measured by the cinematography guidelines. Aesthetic distortion is widely used in existing state-of-the-art automatic camera algorithms [45, 49, 59]. Although there is no single standard in cinematography to evaluate the quality of camera setting and shooting results, we consider several essential factors such as character visibility, character motion, camera setting configuration, screen and motion continuity, and shot duration, to form an approximation of the perceptual aesthetic error following the traditional cinematography guidelines. Again, the model proposed here is just an example as the formation methodology is extensible to include more factors that need to be considered in the cinematography process. 52 4.3.1.1 Camera Placement Positioning cameras in 3D space to shoot frames that meet 2D constraints is a 7-degree of freedom(dof) problem consisting of: the position of the camera(3 dofs), orientation(3 dofs), and focal length (shot size). In Practice, the optimization of the seven dimensions is very challenging due to computational power limitations. To simplify the problem but without Figure 4.3: Default Camera Placement. (a) camera placement for different shot sizes and profile angles. (b) camera placement for different camera heights the loss of generality, we narrow down the 7-of infinite search space to countable discrete camera configurations according to the camera placement of classic movies. Only cameras used with up to two people are considered because shots with more than two people in the field of view can often be replaced by single-person shots. Ref. [74] proposed a two-people camera placement mechanism. We adopt their model of two-people shots. Figure. 4.3 shows the placement of the cameras in our implementation. Each camera maintains the relative position of the character associated with it during the stage performance process except for the point of view (POV) camera. The POV camera follows the head movement of the character. 53 4.3.1.2 Single Frame Distortion Cost In this section, we describe the distortion in the single frame capture by camera ct (selected camera at time t) that follows cinematography guidelines. Character Visibility This cost evaluates the character visibility in the ct and it is determined by two factors: (1) the ratio, rk, of the size of the character k (k = 0, 1, ..., K − 1 for the total of K characters in the story) in the frame to total the frame size. rk represents how easily the audience perceives the character in the frame. When multiple characters appear in the ct view, the ct always considers its bounded character as the more significant one (low value of I(ct, k), where I() is the character priority). (2) I(ct, k), which depends on camera ct and character k, represents this correlation by giving different weights to different characters and camera combinations during the calculation. Thus the cost function for character visibility (V (ct)) can be represented as: V (ct) = (cid:88) I(ct, k) · rk character k=0→K−1 (4.1) Character Action is value describes whether the character has at (action at t) or not. In general, the audience is more likely to notice the characters in motion. Thus, if the character k is acting at t, there is a greater probability that the k bounded cameras have been selected. Hence the cost function can be represented as: A(ct) =    0 ct bounded character k has action at time t (4.2) 1 otherwise Camera Configuration Different camera configurations serve different purposes in 54 movie shooting, and the distribution of use is varied. For example, the MS (median shots) are used most often while the character performs the general action. However, when the character performs an action, such as seeing around, the LS (long shot), S-MS (surround environment camera), and person-of-a-view camera are often the better option. On the other hand, different actions prefer the camera shooting in different directions. For example, walk- ing and running action can shoot from both the front and back of the character without too much distortion. However, speaking action can cause higher distortion when shooting from the back of the character than the front and side. Thus, the camera configuration distortion depends on the action type that can be derived from the action object (i.e., ai) of time t, the camera position p and shooting direction d can be derived from ct. We use the ϕC () function to describe this distortion calculation process [48], and use ˜at to represent the action object in time t. Thus the cost function of camera configuration can be represented as: C(ct) = ϕC (pci, dci, ˜at). (4.3) 4.3.1.3 Frame to Frame Distortion Ref. [48] has described multiple cut quality measurement metrics developed from the cin- ematographic guidelines. We have selected the most useful ones, which are modified and implemented here to calculate the quality between the captured views at times t and t + 1. Screen Continuity The Visual-spatial continuity in the video is essential to prevent disorientation for the viewer. Many classic cinematography rules describe the importance of the aspect ratio, such as the 180-degree rule [75]. The screen continuity is the summary of each single character’s position change in the frame. 55 The cost function is defined as follows: S(ct, ct−1) = K (cid:88) char k=0 v(k, ct) · ϕS(p(k, ct) − p(k, ct−1)) (4.4) where p(k, ti) and p(k, ti+1) represent the kth position in the frame captured by ct and ct+1. while v(k, ct) determining whether character k is visible in the view of ct or not, that is, v(k, ct) =   1 character k is visible in camera ct  0 otherwise. (4.5) The minimum penalty of position change is 0 and increases when the distance between p(k, ti) and p(k, ti+1 increases. If k only appears in one of the frames, the maximum penalty of 1 will be used. The non-linear function ϕS() represents this property [48]. Moving Continuity If an ongoing action changes the direction of a character before or after the view change, this can create disorientation in the audience. The moving continuity cost quantifies the penalty in this aspect as follows: M (ct, ct−1) = K (cid:88) char k=0 v(k, ct) · ϕM (m(k, ct) − m(k, ct−1)) (4.6) where m(k, ct) and m(k, ct−1) is the motion direction vector of the character in frame t and t − 1, respectively, captured by the associated camera. The non-linear function ϕM () takes m(k, ct), m(k, ct−1) as inputs and returns the penalty [48]. This penalty also increases as these diverge from each other. If k only appears in one of the frames, the maximum penalty of 1 will be applied. 56 4.3.1.4 Shot Duration Shot duration is closely related to the concentration of the audience’s attention, and it is an important element of film editing style. In general, the shorter the shot duration, the more intense the content on the screen, and the easier to grab the attention of the audience. In [76], the researcher found that the shot duration distribution is associated with the style of the movie and can be described as a log-normal distribution. To simplify the problem, we allow the average shot duration (u) to be set for each scene to control the shot duration distribution or use the default value (the general u learning from existing movies [76]). The ϕU () non-linear function has the shape of the standard inverted log-normal distribution [48] with minimum value (penalty y = 0) at x = u. Let us denote by q the longest allowable shot duration; the shot distortion is then defined to penalize the frames in the range [t − q, ...t] that change cameras, as: U (u, ct, ct−1, .., ct−q) = ϕU (u, ct, ct−1, ..., ct−q). (4.7) By adding all the factors mentioned above together, the total aesthetic distortion Da can be calculated by the following equation: T (cid:88) [ω0 · V (ct) Da = t=0 +ω1 · C(ct, ˜at) + ω2 · A(ct) +ω3 · S(ct, ct−1) + ω4 · M (ct, ct−1)] T (cid:88) (1 − ω0 − ω1 − ω2 − ω3 − ω4)U (u, ct, ct−1, .., ct−q) + t=q (4.8) 57 where ω0, ω1, ω2, ω3,ω4are the weights for each distortion (range of 0 to 1). 4.3.2 Fidelity Model Figure 4.4: Comparsion of a script and a video by a human or human-like agent to see whether they match The fidelity model is the essential element of our proposed auto-cinematography ap- proach, which assures that the generated matches the video input script, although this element has been missing in the past efforts of auto-cinematography. In an ideal case, as shown in Figure. 4.4, there is a human-like agent that has comprehension intelligence similar to humans. Thus, it is able to determine whether a generated video is good enough to reflect the comprehension of the script. Clearly, a lot of research progress reviewed in section 2 has been made in this direction. However, the current state-of-the-art is still far from being widely utilized due to its low performance and accuracy. To address this challenge, an ap- proximation method as shown in Figure. 4.5 is considered to compare each action generated 58 from the action list generation module (shown in Figure. 1) with a reconstructed action recognized by the corresponding output video-related to the input action using the accumu- lated error to represent the overall fidelity level considered in the ideal scenario. Therefore, the original text-video matching problem is approximated and converted into a text-text matching problem with the assumption that the video action recognition engine can achieve a reasonable quality. Figure 4.5: Using a video action recognition engine to convert a video clip into a text, and then compare the similarity between two actions in text format Let us denote by mj the jth camera of the admissible cameras set, a′ i the word or phrase that describes the action obtained from the video action recognition engine (which is able to recognize the video generated from input action ai), di the similarity between the textual description of ai and a′ i, then di can be measured by:    di = 0 G(ai)·G(a′ i) (cid:13) (cid:13) (cid:13)G(a′ (cid:13) (cid:13) i) (cid:13) ∥G(ai)∥× ≤ ThG 1 otherwise (4.9) where ThG is the threshold to admit that ai and a′ i refer to the same action, and function 59 G is the Glove word embedding model of [77]. This way, the fidelity level of a generated video compared to the input script can be approximated by the average of all the action similarities obtained as: Fj = 1 N N (cid:88) i=1 di,j. (4.10) It is important to emphasize that the equation above is just an approximation of the real fidelity level, as there is noise introduced during the whole process by the video action recog- nition engine and the textual similarity comparison mechanism. The human-involved simu- lation observed accuracy is 87.3% in our experiment by comparing the outcome calculated by Eq. 4.10 and human subjective judgment with a pre-defined threshold for acceptance. We asked participants to watch 1000 video clips (randomly selected from the data set) and asked if they could visually understand the actions of the characters in the video clips. Even with human subjective judgment, in some cases, it is difficult to accurately distinguish the actions of the characters in the video clips (127 in our experiment), which are due to the action type and camera angle,(Figure. 4.6 shows some examples). The current action recognition engine has not yet reached the same level of subjective judgment as a human. However, given the current development speed of AI technology, the that accuracy of video action recognition has been significantly improving these years. It is reasonable to assume that the accuracy of action recognition will be further improved in the near future. Hence, from the modeling point of view, it is feasible to adopt this approximation mechanism to quantify the fidelity level. On the other hand, another obstacle appears during practical real-time applications. Due to computational constraints, it is not practical to embed the video action recognition engine inside the camera optimization process to calculate Fj for all the admissible camera 60 parameters. Thus a further approximation is required to make the fidelity model practical. Figure 4.6: The action ’laugh’ and ’cry’ captured from the back of the characters (top). The actions ’listen’ and ’stand’ were captured from a front-side close-up shot (CU) camera (bottom) When looking closely at the performance of the video action recognition engine, one phenomenon of failure is quite dominant, that is, the full (or partial) occlusion of some characters in action captured from various camera angles may cause the engine not to be able to recognize the supposed activity successfully (see examples of occlusion in Figure. 4.7). Hence it is intuitive to investigate whether the degree of occlusion of characters according to all admissible camera parameters is correlated to the di,j obtained in equation 4.9. Let us denote by oi,j the occlusion percentage of all characters involved in ai shot by camera mj, where J is the total number of admissible cameras. Then the average occlusion 61 Oj can be calculated by: Oj = 1 N N (cid:88) i=1 oi,j. (4.11) According to simulations as shown in Figure. 4.8, there is a high correlation between F0 Figure 4.7: Example of different occlusion degrees of characters and O0 (i.e., the fidelity and occlusion level measured by the 0th camera of the admissible cameras), and their relationship can be approximated by a linear function. Figure. 4.9 demonstrates the fitting errors (σ) of all admissible camera placements and settings in the simulation, and they are all in the range [0.04, 0.11], which is small enough to demonstrate that the high correlation between occlusion and fidelity level holds for all camera settings in the simulation. More details of the fidelity model simulation can be found in the section 4.3.4. Let us denote by Df the fidelity distortion; then Df can be modeled and represented by a function of the selected camera at each time t, as the object occlusion level is determined 62 once the camera is specified at a certain timestamp. Therefore, Df can be calculated by: Df = 1 T T (cid:88) t=0 [αO(ct) + β], (4.12) where O(ct) is the occlusion measurement function for all subjects and objects for the selected camera ct at time t, α and β are the parameters that can be derived by fitting the (Oj, Fj ) pairs by a linear function. Figure 4.8: The plot of pairs of (O0, F0) in dots for the 0th default camera and the linear function F0 = αO0 + β to fit the dots So far, the fidelity factor has been approximated into a mathematical model that is only determined by the dynamic selection of the camera. Thus, it is possible to consider this factor in the camera optimization process. It is important to emphasize that there could be other ways to approximate the fidelity between a script and a generated video. In this work, 63 the objective is to showcase that such a model is possible, and the effect of having such a model will be demonstrated in the next section with evaluation results. Figure 4.9: Fitting error for all the default cameras used in our experiment, except the median shot (MS) side cameras (19, 20) and the POV camera (30) 4.3.3 Joint Optimization As mentioned above, both the aesthetic and fidelity models are utilized. Although the models can be refined and extended by considering more factors, as long as the factors can be represented by a function of camera parameters, we can optimize both aspects in a joint optimization framework. By using a weighting factor λ between [0, 1] to bridge both models, the total distortion can be represented as follows: D = (1 − λ)Da + λ · Df . (4.13) 64 When λ is set to a value close to 1.0, the fidelity distortion is more important, otherwise for λ close to 0, the aesthetic distortion becomes more important. We leave the determination of λ to users so that they can decide according to the real scenarios and applications. The problem of minimizing the total distortion can be written as follows: min (1 − λ) · T (cid:88) t=0 [ω0 · V (ct) +ω1 · C(ct) + ω2 · A(ct) +ω3 · S(ct, ct−1) + ω4 · M (ct, ct−1) +(1 − ω0 − ω1 − ω2 − ω3 − ω4)· T (cid:88) t=q U (u, ct, ct−1, .., ct−q)] +λ · 1 T T (cid:88) t=0 [αO(ct) + β]. Our goal is to find the optimal solution {c∗ t } such that {c∗ t } = argminct (4.14) D∗. To implement the algorithm for solving the optimization problem, we define zk = ck and a cost function Dk(zk−q, ...zk), which represents the minimum total distortion up to and including the kth frame, given that zk−q, ..., zk are decision vectors for the (k − q)th to kth frame. Therefore DT (zT −q, ..., zT ) represents the minimum total distortion for all frames, and thus min z D(z) = min zT −q,...,zT DT (zT −q, · · · , zT ) (4.15) The key observation for deriving an efficient algorithm is the fact that given q +1 decision vectors zk−q−1, · · · , zk−1 for the (k − q − 1)st to (k − 1)st frames, and the cost function Dk−1(zk−q−1, · · · , zk−1), the selection of the next decision vector zk is independent of the se- 65 lection of the previous decision vectors z1, z2, · · · , zk−q−2. This means that the cost function can be expressed recursively as Dk(zk−q, ..., zk) {Dk−1(zk−q−1, · · · , zk−1) = min zk−q−1,··· ,zk−1 1 T + λ · [αO(ck) + β] (4.16) + (1 − λ) · {[ω0 · V (ck) + ω1 · C(ck) + ω2 · A(ck) + ω3 · S(ck, ck−1) + ω4 · M (ck, ck−1)] + (1 − λ) · (1 − ω0 − ω1 − ω2 − ω3 − ω4)· U (u, ck, ck−1, · · · , ck−q)}}. The recursive representation of the cost function above makes the future step of the optimization process independent from its past step, which is the foundation of dynamic programming. The problem can be converted into a graph theory problem of finding the shortest path in a directed cyclic graph (DAG). The computational complexity of the algo- rithm is O(T × |Z|q+1) (where |Z| is the cardinality of Z), which depends directly on the value of q. For most cases, q is a small number, so the algorithm is much more efficient than an exhaustive search algorithm with exponential computational complexity. 4.3.4 Fidelity Model Simulation A fidelity model was used to simulate the audience’s recognition and understanding of a character’s action in a movie, assuming that the performance of the character’s action is 66 well conducted, it was expected that a typical adult audience can understand the action of a character successfully if the character is completely unobstructed in the video frame. It was discovered that the obscuring of the action might cause a challenge for the audience to rec- ognize and understand the activity of the scene. On the other hand, the camera placements, range, and angle are all critical aspects that may influence the audience’s perception. For example, the close-up (CU) camera is only able to capture the character’s head and partial upper body in the frame and the camera with the angle from the back of the character is unable to convey visual information of the front side action to the audience, such as facial expressions like ’Smile’. Thus, a fidelity model was proposed to estimate the loss introduced during the film-making process caused by the occlusion of the character, which is highly relevant to the camera placement. The occlusion can be defined as the proportion of the size of the partial characters, in which the audience cannot see them in the frame caused by the obstacles between the camera and characters, to the size of full characters. Without loss of generality, two types of actions are considered in this experiment, self-action and in- teractive action. The self-action is an action made by the character himself/herself without a receiving object, such as a ’Jump’ action, and the interactive action is the action that has one or multiple receiving objects, such as the ’Speak’, which requires the camera to capture both the initiator and receiver of the action. Thus, the occlusion is calculated based on the subject (initiator of the action) for the self-action. For the interactive action, the occlusion is calculated based on both the subject and object (receiver of the action). As mentioned in section 3.2, a video understanding and description neural network model[39] was considered as an objective viewer. An in-house data set with 15500 video clips (3-5 seconds each) was used to train and test our network model. The video includes 20 different character models, and each character model performs 10 different actions in 5 67 scenes and 5 different positions for each scene. Each of these action performances was cap- tured by 31 default camera placements. The data set was randomly divided into a training set (80%) and a test set (20%) while ensuring that each camera placement had the same amount of data in the training and test set. Figure 4.10: Examples of fidelity model simulation: it can be observed how the distance and angle of the camera may affect the simulation curve in various occlusions. The MS camera slightly off the front side (cameras 5 and 6) of the character provides the lowest Fidelity level. When the camera (camera 28) is too far away it is difficult to distinguish the action In our experiments, the textual description corresponding to each action clip consists of 3 parts, the initiator of the action, the action itself, and the recipient of the action. Therefore, the video understanding network needs to parse out these 3 parts separately from the input video. The input video is first transformed into 30-50 frames (10 frames per second), and then the frames are feature-extracted by CNN, and we use resnet-152[42] as the feature extraction layer. Since most of the actions have temporal and spatial continuity, a single image is not a good judge of the character’s actions. the LSTM layer is used to explore this temporal and spatial continuity in the input clips. In the whole frame, the character’s action usually takes up only a part of the frame instead of the whole frame, so the attention layer is 68 effective in finding the part of the frame related to the character’s action and discarding the irrelevant part of the frame such as the background, thus improving the accuracy of video understanding. During the training process, no occlusion was applied to characters; during the testing process, different occlusion patterns were randomly generated and applied to mimic the scenarios of various occlusion degrees of characters. Cross-validation was applied to obtain the average fidelity level and occlusion for all these camera placements. In Figure. 4.10, the simulation results of representative camera placements were demon- strated. These results all indicated that there is a strong correlation between calculated occlusion level and fidelity level. Interestingly the accuracy of action recognition decreases when the filming camera is either not at an appropriate distance, too far away, or too close to the subject, or not at a proper angle, from the side or back of the subject. 4.4 Experimental Results In this section, the details of the simulation fidelity model are demonstrated, and the pro- posed framework is evaluated with the following three groups of experiments: (1) Check whether the automated workflow proposed by T2A adds value to users; (2) Compare the proposed camera optimization framework with the state-of-the-art solution without the fi- delity model; (3) Investigate the proposed camera optimization framework in-depth with ablation studies. We implemented the T2A system by putting together several tools and subsystems: first, a web tool is developed to handle the script analysis and action list generation with the sup- port of AllenNLP [53]; second, an auto-staging and animation tool is built using Unity [78] to 69 transform the action list into character performance; third, a camera optimization algorithm is implemented in the back-end that can be called by the animation tool; fourth, the video action recognition engine is built following the work of sequence to sequence neural network for video caption [79] with additional attention layer; it is first trained and tested with the MSR-VTT public data set [80] and then trained using our own data set details in the sec- tion 4.3.4 for 2000 epochs. The Adam optimizer [81] is utilized with a batch size of 128, and the learning rate starts with 0.0004 and gets decreased every 200 epochs by multiplying it by the decay factor of 0.8. It takes around 8 hours to finish the whole training process based on the NVIDIA GTX 2080 TI GPU. We have also developed several character models for our tool, and all of them have a standard set of action libraries to support the related animation performances. If an action described in the script or an action similar to it cannot be found in the action library, the animator can manually adjust the action list. Table 4.1: The time consumption in automation on and off modes (professional users) Script Auto-staging Auto-camera Off On Off On Reduced Time (%) TJ-1.1 TJ-1.2 TJ-1.3 TJ-1.4 TJ-1.5 88 min 18 min 28 min 15 min 83 min (71%) 107 min 20 min 29 min 16 min 100 min (74%) 18 min 24 min 13 min 73 min (70%) 80 min 120 min 22 min 34 min 18 min 114 min (74%) 115 min 13 min 34 min 19 min 117 min (78%) 4.4.1 Value of Automation To evaluate T2A, five professional animators were invited to test drive the system, each of them was assigned a separate script with 1-2 minute screen time, and they were asked 70 to make a video by using the tool. The purpose of the test is to find out the differences between turning auto-staging and auto-camera functions on and off. In the staging phase, the user is required to select the relevant scenes and characters in our Unity tool according to the script and place the characters in the appropriate locations in the scene, and then set the character’s actions, dialogue voice, and other relevant data sequentially by following the content in the script. After the staging phase, the characters in the virtual scene will be able to perform the story according to the script. In the camera setting phase, The user needs to place several virtual cameras in the scene according to chronological order and adjust them to the appropriate position and angle to capture the desired video. Table 4.2: The time consumption in automation on and off modes (non-professional users) Script Auto-staging Auto-camera Off On Off On Reduced Time (%) TJ-1.1 TJ-1.2 TJ-1.3 TJ-1.4 TJ-1.5 121 min 43 min 48 min 10 min 116 min (68%) 143 min 36 min 53 min 3 min 157 min (80%) 42 min 63 min 21 min 89 min (58%) 89 min 144 min 62 min 55 min 2 min 135 min (67%) 135 min 47 min 58 min 11 min 135 min (69%) As shown in Table 4.1 and Table 4.2. There is a considerable gap in production time consumption in the staging phase between professional animators and general users due to the difference in the level of knowledge about the subject. Most of the non-professional users use the camera track generated by the optimization algorithm directly without making further adjustments, therefore spending less time in this phase. Therefore we will compare the statistics of professional animators in detail. The professional users, on average, spent 34.4 minutes adjusting the character staging and performance and the camera parameters (during auto-staging and auto-camera mode). However, they spent much more time (on average 71 97.4 minutes) manually selecting character staging and actions and handling the camera placement. The automation flows sped up the manual process by 73.9% on average. For both experiments, the professional animators produced videos according to similar quality criteria. A sample video pair can be reviewed in the video starting from 00:24 https://www. youtube.com/watch?v=MMTJbmWL3gs To evaluate the value of the fidelity model, we compared the proposed framework with an existing auto-cinematography solution [5] that was optimized mainly using an aesthetic model. The experiments indicate that the proposed solution achieved better performance during the scenarios when multiple characters were conducting interactive activities and when a complete action in length was expected to be captured. 4.4.2 Value of Fidelity Model Figure 4.11: Sample frames for the whole action maintaining chronological order from left to right. The optimization using the fidelity model obtains the camera settings (bottom), representing the entire action in the frame In animation, the character’s action can take seconds to complete and thus may cross dozens of frames. With the utilization of the fidelity model, the mismatch between script and recognized action is detected by the video action recognition engine according to equations 72 4.9 and 4.10, and are simulated by using the model in equation 4.12. Thus the optimization process can select the best representation of the action with a group of video frames captured from selected cameras and associated settings. As shown in Figure. 4.11, the video frames generated without the fidelity model (as shown in the top row) were compared with the ones with the fidelity model (as shown in the bottom row), and the sample frames for the whole action were demonstrated in chronological order from left to right. The bottom row can represent a good abstraction of the entire movement of characters’ actions while the top row cannot because there is a referee built into the system to keep the ones easy to understand and filter out others hard to recognize. To better demonstrate the performance of various models, three example videos of com- parison results can be found in the video starting from 02:33 https://www.youtube.com/ watch?v=MMTJbmWL3gs to experience the differences between the animation videos with and without the fidelity model in the joint optimization. 4.4.2.1 Multiple Characters Interaction Figure 4.12: Compare the frames taken by the camera selected by the different optimization approaches. The optimization using the fidelity model obtains the camera settings (bottom figures), which capture more characters in the interactive actions For narrative films, the interactions among multiple characters, such as chatting, fight- 73 ing, and collaborating, are indispensable components. When using the aesthetic model, in algorithm [5], the camera is optimized to highlight the acting characters while sometimes ignoring others in passive mode. As shown in Figure. 4.12, the top figures are the frames cap- tured by the aesthetic model optimization for the scenarios of people chatting. As expected, the camera successfully captured the characters who were talking but ignored the other characters who were listening quietly. In contrast, the cameras used by the bottom figures captured all the characters involved in the interaction, which were generated by the proposed algorithm that took into consideration the fidelity model. The key difference between these approaches is whether the produced character performance and scene representation are of high fidelity compared to the original script. In the fidelity model, optimizing Df forces the system to select the best scene representation that can be recognized by the video action recognition engine and thus can fully represent the activity with all characters involved. This is highly correlated to natural human understanding as well as the recent research on dialogue scenarios [59]. That is, all parties of the interaction covered in the same frame can effectively help the audience understand the topological relationship among characters. 4.4.2.2 Full Action Capture In animation, the character’s action can take seconds to complete and thus may cross dozens of frames. With the utilization of the fidelity model, the mismatch between script and recognized action is detected by the video action recognition engine according to equations 4.9 and 4.10, and are simulated by using the model in equation 4.12. Thus the optimization process can select the best representation of the action with a group of video frames captured from selected cameras and associated settings. As shown in Figure. 4.11, the video frames generated without the fidelity model (as shown in the top row) were compared with the ones 74 with the fidelity model (as shown in the bottom row), and the sample frames for the whole action were demonstrated in chronological order from left to right. The bottom row can represent a good abstraction of the entire movement of characters’ actions while the top row cannot because there is a referee built into the system to keep the ones easy to understand and filter out others hard to recognize. Table 4.3: Comparison between the camera shots created by different optimization solutions and professional animators. (NoSA denotes the number of shot adjustments required for auto cinematography results compared to the manual camera setting provided by the animator) Full-body Multi Single Angle Distance Total shot 549 Da Da NoSA T2A T2A NoSA Animator 32 74 93 12 87 148 95 219 91 193 401 161 298 73 290 15 13 83 71 517 483 To better demonstrate the performance of various models, three example videos of com- parison results can be found in the video starting from 02:33 https://www.youtube.com/ watch?v=MMTJbmWL3gs to experience the differences between the animation videos with and without the fidelity model in the joint optimization. 4.4.3 Numerical Comparison with existing Metrics We compared the number of adjustments made by the animators to the obtained shots for various optimization methods (Da and Da + Df ) in 50 different animation scenes in Table. 4.3 below along with the corresponding justifications. The second row in Table. III indicates the results of the optimization using only Da as distortion. There are 32 full-body shots, 148 multi-character shots, and 401 single-character shots, for a total of 549 shots (the full-body shot could be either Multi or single-character shot). The fourth row shows the results of the optimization using Da+Df . The sixth row 75 illustrates the results after the animator manually adjusted the results on the automatic camera output. The third and the fifth rows demonstrate the number of times it takes to adjust the camera settings for each type of shot of the two optimization algorithms before the camera settings are finally approved by the animator. The columns describe the reasons why the animator thinks these adjustments are necessary. The results show that our proposed method has considerable advantages in capturing full-body movements and single-character scenes, thus reducing the number of camera ad- justments the animator needs to make after the optimization. The following is a detailed description of each type of adjustment. Full-body In animation, some of the character movements need the virtual camera to be able to capture the whole body of the character to provide a better viewing experience. Figure. 4.13 shows one of the examples, we can see that the action “push” of the character is better captured by the full-body camera. Figure 4.13: Captured “push” action scene from the script: Lead the witchfinder inside, the optimization was done only for Da (left), the optimization was done based on Da + Df (right) Multi-character Multi-character interaction in animation is very common. Take dialogue scenes as an example, for the audience to understand the relationship between the 76 characters, it is sometimes necessary (as shown in Figure. 4.14) for both sides of the dialogue to appear in the frame at the same time. Figure 4.14: Captured dialogue scene from the script: interrupt the conversation, the opti- mization was done only for Da (left), the optimization was done based on Da + Df (right) Single-character The use of a single-character shot is also very important in animation. For example, to express the emotion of anger, a single close-up of the character’s face is better to convey this emotion to the audience (as shown in Figure. 4.15) Figure 4.15: Captured “angry” emotion scene from the script: Lead the witchfinder inside, the optimization was done only for Da (left), the optimization was done based on Da + Df (right) Angle and Distance These two adjustments indicate that the animator believes there is a better choice of angle (yaw, pitch, and roll) or camera distance for the current situation, but not because of the problem we described earlier. For example, the background behind the character is more appropriate after adjusting the camera angle. 77 4.4.4 Ablation Studies In this section, the proposed framework is analyzed in depth. The influence of distortion weight coefficients (λ) and main parameters (V, C, S, U ) on the optimization results are compared separately for the optimization process. First of all, the role of the λ is investigated. As shown in Figure. 4.16, the normalized value of Da increases as λ increases, while the normalized value of Df decreases as λ increases, this is quite aligned with the definition of D in equation .4.14. When λ becomes very small, the cost function in the optimization Figure 4.16: The value change of normalized Da and Df according to λ progress is dominated by Da. Thus the algorithm searches for the path that values more aesthetic factors. On the other hand, when λ becomes very large, the cost function in the optimization process is dominated by Df . Thus the fidelity factor becomes very important. Clearly, the flexibility provided by adjusting λ provides opportunities for users to shift the 78 optimization bias for different applications and requirements. For ablation studies, the key components, such as visibility, camera configuration, screen continuity, shot duration, and fidelity, are investigated to measure their impact on the overall performance of the system. The shot duration cost is only calculated when the camera switches between different positions during the optimization process, and the total D depends on the previous q in the current optimized path. The average distortion of D for the period from t − 2u to t + 2u is considered for evaluation, which can be expressed in the following equation: D = (cid:80)(Dt−2×u, Dt−2×u+1, ..., Dt+2×u) 4 × u . (4.17) (a) λ = 0.1 (b) λ = 0.5 (c) λ = 0.9 Figure 4.17: Aesthetic component impact on the optimization results In Figs. 4.17a, 4.17b, 4.17c, the optimized curves are compared with others with one selected component that is intentionally not optimized, and the setting of λ=0.1, 0.5, and 0.9 are compared to demonstrate the impact of fidelity. It can be observed that the overall impact of the unoptimized component in λ=0.9 is smaller than that of λ=0.1, which is understandable because the former case emphasizes the fidelity factor much more than the aesthetic factor. The figures indicate that camera configuration has a very strong impact on the performance of the system with a gain of 25-32%, shot duration ranks second with a 79 gain of 16-35%, and visibility has the least impact. Figure. 4.18 demonstrates the ablation from another angle, where the optimized solution has around 60% gain compared to the fully unoptimized reference, the shot duration com- ponent has the largest impact, while the remaining two components have a similar ordered impact with each other. In Figure. 4.19, the generated video frames by the proposed frame- Figure 4.18: Aesthetic distortion component impact on the optimization results from another perspective (λ = 0.5) work and several references are demonstrated. By comparing the script and the video frames, the advantage of the proposed framework with respect to both the fidelity and aesthetic as- pects is quite apparent. The Row (1) generated from the reference solution of character visibility unoptimized can be found in error for all timestamps by either missing the charac- ter or focusing on the wrong character. Row (2) with screen continuity unoptimized does not work well in the second half of the frames. Row (3) with camera configuration unoptimized made wrong camera angle choices in 60% of the cases, for example, at 00:06, it selected a 80 long shot camera while the camera with such a distance setting is not suggested to shoot this kind of action in the general cinematography guideline; in addition at 00:35 and 00:43, the characters are blocking each other due to the wrong camera angle selection. Row (4) with shot duration unoptimized failed in 60% of the cases, for example, at 00:03, it failed to capture the whole ’fall’ action due to lack of duration constraint, and the camera stayed in the same position for too long. Figure 4.19: Video results of the proposed framework and several references: (1) V unopti- mized. (2) S&M unoptimized. (3) C unoptimized (4) U unoptimized 4.5 Conclusions 3D animation-making typically requires a professional team, sufficient funding and resources, knowledge of cinematography and film editing, and much more. Thus this field is, in general, 81 not accessible to non-professionals. In this chapter, we presented T2A, an animation produc- tion framework that implemented our proposed auto cinematography method. To reduce the barrier for non-professional users, the auto cinematography optimization framework that can choose cameras and their associated settings based on a joint fidelity and aesthetic model, in which the comprehensiveness of visual presentation of the input script and the compliance of generated video with a given cinematography specifications are mapped into a mathematical representation. Although the experimental results indicate both the time consumption and quality advantage of the proposed framework, we believe further investigations on fidelity and aesthetic modeling are needed to make the solution generally applicable to a broader scope of 3D animation and filmmaking tasks. 82 Chapter 5 Enabling auto cinematography with Reinforcement Learning 5.1 Introduction In this chapter, we demonstrate our proposed auto cinematography with reinforcement learn- ing, inspired by the idea of mimicking the idiomatic use of lens language by human directors. In this manner, we attempt to address the problem of the rigidity of the lens language used in the rule-based auto cinematography approach. Cinematography, the art of choosing camera shot types and angles in capturing motion pictures, is an effective way to demonstrate its artistic charm in the film industry, and applying cinematography to content creation requires lots of training and knowledge. In the computer age, it becomes natural to explore the possibility of making robot directors and entrusting them with this challenging task, which is also known as auto cinematography. Some successful efforts, such as [48, 49], have shown that this goal can be partially achieved by translating the rules from the standard cinematography guidelines into a number of cost functions and applying them during an optimization process. Such aesthetic rules-based robot directors can provide valuable references for an entry-level artist without much film experience. In [4] the aesthetic model is further extended to a hybrid model that considers 83 the fidelity of the output video by comparing it with the textual content in the original script. It is reported in [64] that the mere use of these rules for expression in the film is far from satisfactory. In the practical film industry, human directors are experts in twisting these standard rules in cinematography and developing their unique lens language to express human creativity and imagination [82]. Therefore, although creating robot directors based on common lens language and summarized rules is an interesting idea, it is practically very difficult to meet the expectations of film artists. As a relevant exploration effort, [51] used a neural network to extract lens language from the existing film and imitated the human director’s behavior in filmmaking. However, this effort suffered from an inevitable problem, that is the experimental data is inaccurate and incomplete because the data were estimated from two-dimensional frames with information loss in another dimension. Therefore, such an approach can only be applied in limited scenarios and is unable to comprehensively imitate the director’s lens language. Another work [62] demonstrated that such a behavioral imitation problem with a limited amount of training data can be achievable by reinforcement learning (RL) techniques [83] with reward generated from human feedback. In order to develop a robot director that can truly learn human lens language and use it in various scenarios, it needs not only precise data collection during cinematography but also the capacity to continuously improve the model based on external feedback. By using the current auto cinematography algorithms [3–5], it is possible to build a framework for directors to use so that their chosen camera settings can be recorded as training data for reinforcement learning algorithms to learn the lens language models. This observation inspired us to extend our previous work [4] to the direction of using reinforcement learning because all the needed data for cinematography usage can be collected during the film production process without extra effort with our framework. We reused the 84 structure of the filmmaking frame T2A proposed in [4], but used a reinforcement learning module for auto cinematography to replace the camera optimization process in T2A, thus the new framework is called RT2A. A director’s feedback module is incorporated to correct the camera placements, and such feedback is valuable for the camera agent to learn and imitate the lens language of the target human director. In practical scenarios, the director’s feedback can come from stored training data that is collected from real human directors during their routine filmmaking work using our tool [4]. With sufficient data and feedback, the RT2A can effectively learn to support such a robot director that has the potential to become the embodiment of a real director and produce animation with a similar lens language. To the best of our knowledge, this is the first work in the world that applies reinforcement learning to auto cinematography that can effectively imitate the human director’s usage of lens language. The rest of the chapter is organized as follows: Section II overviews the related work in computational cinematography; Section III demonstrates the overall framework of RT2A, the problem formulation, and the solution; Section IV discusses the experimental results and the last section draws the conclusion about the work. 5.2 Related Work The quality of lens language is a crucial factor that determines the quality of the final film. It requires acumen of artistic sense as well as great knowledge of cinematography when using shots going beyond the basic rules. Luring by the wish to use auto cinematography techniques to reduce film production cost, there are a considerable amount of studies in this area, and they can be roughly categorized into two directions. 85 The first direction is the cinematography guideline rule-based approach. By defining mul- tiple constraints according to various rules and formulating them into the corresponding loss functions, the optimal shot setup for the current situation can be calculated by minimizing the total loss. Different aesthetic constraints can be applied under separate scenarios, such as dialogue scene [59], cooking [45], and outdoor activities [50]. However, as discussed in [64], it is difficult to judge whether the use of lens language is sufficient under only aesthetic models. In [4], the fidelity model was introduced, which considered the original script content in the optimization process to make sure that the generated video is fully aligned with the script. The advantage of the rule-based approach is that the results are perfectly consistent with the pre-defined constraints, thus guaranteeing that no rule-breaking will occur; however, it also severely limits the freedom and creativity of the artist. Figure 5.1: Overview of our RT2A animation production framework The second direction is to learn the behavior model of these cameras directly from the existing movies [51]. The advantage of this approach is that it is a data-driven methodology instead of a rule-driven solution, therefore there is no need to create rules or constraints to guide the algorithm. By using reinforcement learning, such as deep Q-learning (DQN) [84], 86 Trust Region Methods (TRPO), or policy optimization (PPO) [85], the solution can be found by providing sufficient training data. In literature, there are works in the field of drone photography controlling [62,86], however, no work can be found in the literature using reinforcement learning in the auto cinematography for filmmaking, a possible reason is that to get sufficient precise data from existing movies like in the work of [51] is a non-trivial task. 5.3 Framework Design The filmmaking process is demonstrated in Figure. 5.1 which contains four major modules, namely, Action List Generation, Stage Performance, Auto Cinematography, and Video Gen- eration, and two additional modules (in orange color) to support the reinforcement learning workflow. The Action List Generation takes the textual script as input, analyzes with NLP techniques [53], and then generates the corresponding action list, which is a chronological list of action objects at and can be considered as a special format to represent the content of the original script. Each at contains the necessary information for the virtual characters to make the corresponding performance pt in the Stage Performance. The entire script’s story can be performed in the Stage Performance with a sequence of pt that follows the order in the action list. At this phase, all the characters appearing in the scene are bound with multiple cameras that can be distinguished by the unique index. The frames captured by cameras also become a part of the pt. The Observation Extractor takes the pt to generate the corresponding observation ot (details can be found in section III.A) that is defined in the auto cinematography RL environment for the camera agent to calculate the currently selected camera ci t (where i is the index of the camera). In case the initial training data 87 is insufficient, the acceptance rate of ci t computed by the camera agent in the Auto Cine- matography would be low, thus manual adjustment of ci t by the director is required to assure video quality. The modification process is accomplished in Director Adjustment. The revised ci t is then annotated as the ground truth camera gct and added together with ot into the training data. The pt and gct can be used by Video Generation to generate the current output video frame. With the growth of the number of desirable videos generated with this RT2A framework, the training data grows as well. With sufficient training data, Auto Cinematography will be able to properly train the camera agent and update its policy with the RL algorithm, thus the workload required for directors to adjust the camera placement would be significantly reduced. In Figure.5.1, The camera agent training process of the auto cinematography module has been shown with red lines. An observation generator uses the training data to generate the single observation, ot, at times t. The RT2A camera agent takes ot as input to obtain the corresponding selected camera, ci t. The reward function calculates the reward, rt, by comparing the ci t and the gct. The rt is then used to update the RT2A camera agent policy and parameters by the RL algorithm. 5.3.1 Proposed Auto Cinematography In the auto cinematography environment, the observation space contains all the information needed by the RT2A camera agent to select the optimal camera setting based on the current policy. The following aspects are included in the observation space. Character Visibility Character visibility is determined by two factors: (i) the size of the character in the frame compared to the total frame size. (ii) the weights for various groups of characters and camera combinations during the calculation because multiple cameras 88 are bound to different characters, which reflects the extent to which obstacles obscure the characters from the current camera view. Camera Configuration When switching cameras, the configuration of the previous camera may also have an impact on the selection of the next camera, such as the shot- reverse-shot commonly used in the dialogue scene. In such cases, the configuration of the previous camera will be included in the observations. Left to Right Order (LRO) LRO is used to show the positional relationship of multiple characters in the shot frame, and the inclusion of this data has a very significant impact on the 180-degree rule, which is one of the most significant rules in cinematography. It is important to be aware that LROs at t taken by different cameras may not be the same. Action Type Some character actions, such as “idle”, do not affect the shot selection, while others may have a preference for shot selection. For example, the facial action of “anger” is more inclined to be expressed by a close shot. In our experiments, we divide the movements into categories of facial, upper limb, lower limb, whole body, and standby. Action Start Time and Duration The start time of every character’s action is crucial, and generally, the transition of the shot ci is at the beginning of a certain action. On the other hand, the length of the action is also very crucial. In many cases, the action with a long performance time (e.g., more than 10 seconds) requires a combination of different shots to take it. Dialogue Start Time and Duration The character’s taking action in the dialogue scene is very special. During long conversations, the camera angle is switched between the interlocutors, and often an over-the-shoulder shot is used in such a scenario. In our auto cinematography environment, the only action in the action space is camera index selection. There are various default cameras to shoot the characters from different 89 distances and angles. The default cameras cover most of the basic camera settings in the cinematography guideline and each camera has been given a unique index for the agent to select it. the default camera setting can be described with 3 parameters: 1. d(c): distance between the camera and the shooting character, which may include extreme close shot (ECU), close shot (CU), median shot (MS), full body shot (FS), and long shot (LS). By quantifying these distances to numeric representation from 0 to 4, we can calculate the differences between them. 2. h(c): pan (horizontal) angle: form 0◦ to 360◦. 3. p(c): pitch of the camera: form 15◦ to −15◦. Determining a meaningful reward function is very crucial for the RL algorithm. The reward function is construed only based on attributes of ci t and gct, where we need to consider the distinction between their settings in detail. The more similar the camera settings the higher the rt. The reward function for distance is defined as: rd t =    1 1 − if d(ci t) = d(gct) otherwise. |d(ci t)−d(gct)| 4 Similarly, the reward function for the pan and pitch of the camera is defined as: rh t =    1 1 − if h(ci t) = h(gct) otherwise. |h(ci t)−h(gct)| 30 (5.1) (5.2) When the difference between the agent-selected camera and the ground truth is less than a predefined threshold, δ, an extra reward is added to further boost the learning process. 90 The larger the δ, the larger the deviation between ci t and gct that can be tolerated. rp t =    1 if p(ci t) = p(gct) 1 − |p(ci t)−p(gct)| 180 otherwise. rδ t =    1 if |d(ci t)−d(gct)| 4 + 0 otherwise. |h(ci t)−h(gct)| 30 |p(ci t)−p(gct)| 180 + < δ The overall reward function is defined as the sum of all the previous rewards. rt = rd t + rh t + rp t + rδ t . (5.3) (5.4) (5.5) 5.3.2 Reinforcement Learning Algorithm An agent starts with observation o0 and selects the next camera index according to the strategy, policy π(o), that the agent uses in pursuit of goals which in most cases is to maximize the reward, R. The R for an episode with T steps is defined as: R = T (cid:88) 0 γt−1rt t = 0, 1, · · · , T , (5.6) where rt is immediate rewards at t and γ is a discount factor that defines the importance of rt versus future rewards, where typically 0 ≤ γ ≤ 1. The higher the γ value the more important the future rewards. The RL algorithm aims to find the optimal π∗ which can maximize R. π∗ = arg max E(R|π). (5.7) 91 This process is accomplished by iteratively updating the parameter of the policy, πθ, according to the loss function L((cid:5)) that measures the error between the reward estimation calculated by the current policy, πcurrent, and the previous policy, πold. There are several methods to address the problem raised in Eq. 5.7, and we use PPO, a variant of an Advantage Actor-Critic (A2C) [87], which combines policy-based and value-based RL algorithms. The actor neural network model takes state (or observation) and outputs the action according to the π(), and the critic neural network model maps each state to its corresponding quality of value the state (i.e., the expected future cumulative discounted return). The advantage ˆA (or discounted return) is used to indicate how good a camera selection is compared to the average camera selection for a specific observation. The ˆA at time t is defined as following: ˆA = − V (ot) + rtt + γt+1 + · · · + γT −t+1rtT −1 + γT −tV (ot), (5.8) where the V is the learned state-value function and rtt is the parameter changing ratio between the πcurrent and πold at step t. During the training process, PPO needs to update the parameters of both actor and critic neural networks by back-propagation according to two different loss functions. Every update for π is designed to maximize the overall return (i.e. max[Et(rtt ˆA)]). However, changing the πθ needs to be avoided in a single update. Thus, the Lactor is defined as: Lactor(θ) = min(rt(θ) ˆAt, clip(rt(θ), 1 − ϵ, 1 + ϵ) ˆAt). (5.9) The clip((cid:5)) modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving rt outside of the interval [1 − ϵ, 1 + ϵ]. Regardless of the value of the 92 positive feedback gained according to ci t, the PPO will only update the policy based on this result within this limited range. Thus, it can incrementally update πθ with an appropriate value. However, the penalty based on the negative reward has no limitation. The critic loss function Lcritic is defined as: Lcritic = ˆA2. 1 2 (5.10) 5.4 Experimental Results Figure 5.2: Sample frames for the scene maintaining chronological order from left to right. As shown, our approach better imitates the lens language used by the animators In this section, the experiment setup and the performance details of our proposed RL auto cinematography method are demonstrated. We develop the auto cinematography en- vironment and implement the RT2A camera agent using OpenAI Gym [88]. It took around 38 hours to finish the whole training process based on the NVIDIA GTX 2080 TI GPU. To 93 evaluate the advantage of our proposed RL auto cinematography model, we compare RT2A with the reference methods: Aesthetic-based [5], and Aesthetic + Fidelity-Based [4]. The performance is evaluated from two aspects: (i) we compare the camera placement gener- ated by the RT2A camera agent and by the reference algorithms; (ii) we compare the visual quality of videos produced by RT2A and reference algorithms. Figure 5.3: Camera placement acceptance rate with different acceptance threshold Difference in Camera placement In the environment of the experiment, the cam- eras generated by different methods are sampled every second. By comparing the physical distance between the camera placement of the algorithm with the ground truth (placement manually by directors), and defining several acceptance thresholds (that is, the placement is accepted if the physical distance is less than the threshold), the performance of the algo- rithms can be measured by the percentage of the cameras being accepted, which is called 94 acceptance rate. The calculation of the physical distance is based on the equations 5.1, 5.2, and 5.3. As the results are shown in Figure. 5.3, the proposed algorithm constantly outper- forms the reference algorithm, and the gain in acceptance rate is up to around 50%. The visual comparison of the generated frames from various algorithms as shown in Figure. 5.2 also indicates that the proposed algorithm has a much higher similarity to the camera se- lected by directors than the reference methods. This evidence proves that the RT2A can effectively imitate the behaviors of the director’s camera usage model after training. Number of shot A complete shot is a continuous view through a single camera without interruption. The number of shots (i.e., average shot duration) used in a single scene is also an important indicator to reflect the shooting style of the director. Table. 5.1 shows the number of shots of our proposed auto cinematography approach compared to the reference methods in several different scenes selected from the test data set. The results indicate that the number of shots used in each of the test scenes by our proposed method is much closer to the ground truth than the reference methods. Visual results also support the conclusion when the frames generated from the aesthetic model [5] are compared to the frames generated from RT2A. It is important to realize that although it may be possible for the rule-based approach like the aesthetic model algorithm to mimic the director by setting several rules to optimize when compared to the proposed RT2A algorithm, the challenges in adjusting the weights for various parameters and cost functions are much more complicated. In the following text, we will demonstrate how the reference model can achieve similar visual results by adding the corresponding cost functions or adjusting weights, hence it convincingly proved the advantages of the proposed RT2A algorithm using a data-driven methodology, instead of manually crafting many rules. Particular shot selection As demonstrated in [89], “the single long shot showing initial 95 spatial relations became one portion of the scene, usually coming at the beginning. Its func- tion then became specifically to establish a whole space which was then cut into segments or juxtaposed with long shots of other spaces.”, long shots as the establishing shots are used to set a particular tone and mood for what the audience is about to see. As shown in Figure. 5.4, compared to the reference approach, the RT2A camera agent learns this lens language better and uses the establishing shot at the beginning of the scene. It is possible if the reference algorithm desires to achieve a similar outcome, that is, a new cost function determining whether the t represents the first few frames needs to be included in the opti- mization framework, and the characters captured by the LS shot will be required to occupy the least space in the frame compared to other shot types, which means that the weight of character visibility in cost function needs to be minimized. Table 5.1: The difference in the number of shots of our proposed and reference methods compared with the director’s selected camera placement. The results show the number of shots (also the differences in percentage compared with the selected shot by the director) used by different approaches in a single scene Script Aesthetic Aesthetic + Fidelity RT2A Director 1 2 3 4 5 32 (33%) 50 (150%) 45 (95%) 37 (146%) 15 (150%) 35 (46%) 43 (115%) 40 (74%) 39 (160%) 13 (116%) 27 (13%) 28 (40%) 31 (40%) 21 (35%) 8 (33%) 24 20 23 15 6 The Over the Shoulder Shot is widely used in the dialogue for the audience to understand the relationship between the characters and to convey a dramatic tension to the viewers [90]. It is sometimes necessary to use it to assemble the reverse shot in a dialogue scene. As shown in Figure. 5.5, the RT2A camera agent learns this lens language successfully and uses the Over the Shoulder Shot in dialogue scenarios. If the reference model wants to achieve a 96 similar result, a new cost function needs to be added to determine whether there is enough duration for the current “speak” action to switch between shots. In addition, the weight of the camera placement used to take the over-the-shoulder shot needs to be modified during the optimization process, to lose the requirement of capturing actions from the back of the character. Figure 5.4: Captured establishing shot frames from the camera generated by the aesthetic model (left) and the camera generated by RT2A (right) Figure 5.5: Captured dialogue frames from the camera generated by the aesthetic model (left) and the camera generated by RT2A (right) Figure 5.6: Captured single action frames from the camera generated by the aesthetic model (left) and the camera generated by RT2A (right) 97 Sometimes actions of particular characters need to be shot from a close distance and appropriate angle for the audience to better understand the content. As shown in Figure. 5.6, the frames captured by the reference model selected camera do not properly illustrate the actions of “inspect the item” very well. In order to let the aesthetic model achieve a similar result, it is required to manually modify the weights of different camera configurations for some particular actions according to the interpretation of the story. 5.5 Conclusions The creativity and imagination demonstrated by the use of lens language in 3D animation and filmmaking require tremendous professional knowledge and talent. It would benefit entry- level artists if the auto cinematography algorithm could learn from professional directors and thus mimic the camera languages used in successful films. However, a major obstacle is that there is insufficient accurate training data in this area. In this work, we presented the RT2A framework, which produces accurate data for training the auto cinematography system with the support of directors. By learning the lens language from these directors, RT2A can select the right distance and camera angle for the auto cinematography process. 98 Chapter 6 Automated Adaptive Cinematography in Open World 6.1 Introduction In this chapter, we extend the auto-cinematography to the open-world environment. By leveraging Generative Adversarial Networks (GANs) and optimized camera placement spaces (such as toric surfaces), automatic film cinematography techniques can deliver satisfactory results in more open and uncertain environments. Such advancements facilitate cost savings in production and enhance the audience’s sense of immersion. The gaming industry’s progression has been persistently propelled by the aspiration to provide users with increased autonomy in gameplay, which serves to elevate their pleasure and absorption within the game experience [91]. This has resulted in a trend towards the creation of open-world games, which provide users with a more expansive and unrestricted environment and, as a result, a more enriching experience. Cinematography is a crucial element in conveying character emotions and advancing the plot in media such as games and films, and it can significantly enhance the immersion of the audience if utilized effectively. Research on user emotions and cognition in video games has demonstrated that virtual cinematography is closely connected to the user experience. The connection between user 99 and camera transcends the norms of conventional cinematography, as virtual cinematography predominantly centers on the character during entire utilization [92, 93]. In cinematography, the use of lens language techniques is contingent upon several intricate factors, such as character emotions, actions, character count, and the environment. In conventional film and narrative-driven games, both the story and character development adhere to a predetermined script. The film directors can reference the script to strategize and select suitable lenses during production, which simplifies the process of optimizing the viewer’s experience. Employing this approach enables film directors to generate a streamlined camera script, which subsequently facilitates the production of aesthetically pleasing shot compositions. Moreover, this method significantly reduces associated costs [3, 56]. In other works [4, 94], classical lens language techniques were condensed into a set of camera rules. Because these techniques are simple to implement, they have facilitated the development of automated cinematography methodologies based on these principles. When supplied with adequate information, these approaches can generate satisfactory outcomes for the average viewer. They also contribute to a significant reduction in production expenses. Open-world environments are characterized by a high degree of uncertainty (as shown in Table 6.1). As anticipated, the enhanced freedom afforded by these environments introduces novel challenges for the implementation of cinematography. Traditional rule-based automatic cinematography approaches prove unsuitable for such settings due to their unpredictable nature. Users in these environments tend to favor the capacity to craft and generate their unique narratives via interactive gameplay, rather than being constrained by pre-established storylines. Simultaneously, artificial intelligence has facilitated the responsiveness and adaptability of in-game characters and objects to alterations within the virtual environment. As a result, 100 actions and interactions of virtual characters and objects in open-world scenarios are no longer constrained by scripts. This freedom allows users to indulge in unique and personalized experiences. In addition, user-controlled characters or avatars are no longer mandated to adhere to a predetermined script. They can engage with virtual world objects in real time, interactions that transcend the limitations of pre-defined choices. These interactions can assume a variety of forms, supported by the employment of portable devices for facial and action capture. Such novel interaction modalities have considerably amplified the uncertainty surround- ing aspects such as narrative development, character relationships, and the context of char- acter interactions. While rule-based automatic cinematography can still be employed to a degree, utilizing limited, predetermined camera angles as evidenced in [95], it is not without limitations. Rule-based methods may repetitively deploy the same perspectives. Coupled with their inability to effectively portray the relationships between characters and objects, these methods can potentially compromise user immersion within the virtual envi- ronment [96]. Despite the mounting demand for freedom in interactive experiences, research on auto- matic cinematography in open-world contexts remains limited. In recent years, AI technology has been applied to the field of automatic cinematography [4, 94]. This development allows film directors to glean lens language usage from data and extend its application to a broader array of scenarios. Therefore, the utilization of AI-based automatic cinematography holds significant potential in open-world environments, as it can effectively bolster the immersive gaming and viewing experience. One of the key challenges we encountered was determining the most effective camera placement in the algorithm that can achieve high levels of shot quality. The toric surface 101 has been suggested as a suitable selection space for automatic camera placement in previ- ous studies [74]. However, incorporating character actions and object states in open-world environments requires further consideration. To tackle this challenge, we expand the toric surface to accommodate character emotions, actions, and the environment, enabling appro- priate camera selections for various situations within open-world settings. Additionally, our system framework is designed to be flexible, open to alternative options, and not just limited to the toric surface for camera placement possibilities. Moreover, maintaining consistency in character movement, emotions [97], and camera movement is crucial for achieving cohesive and effective results [98–100]. Recent studies have demonstrated that automatic camera movement, which focuses on individual characters, effectively conveys their emotions and movements [101]. This overcomes the limitations of approaches that track a character’s head movement without considering their emotional state [102]. In our approach, we further expand upon this work and apply it to open-world scenarios. To augment the aesthetic value of the frames captured by automatic cinematography, we integrated data from professional film directors’ shot selections into the loss function employed in our automated camera generation model [103]. The model’s performance has been enhanced, enabling it to learn how professional film directors utilize shot language to capture frames in diverse situations. To the best of our knowledge, this paper is the first effort in the automated cinematogra- phy field that generates camera movement for multi-character interactions in an open-world environment. The contribution of this paper can be summarized as follows: 1. To address the challenge of open-world character interactions, we propose a novel 102 auto cinematography framework called AACOGAN based on Generative Adversarial Networks (GANs). AACOGAN is designed to generate camera movements that are consistent with the interactive actions of the characters in the open-world environment. 2. In open-world scenarios, we develop comprehensive quality assessment metrics tailored for automatic cinematography. These metrics effectively aid in evaluating the quality of generated camera movements. 3. To guarantee consistency between character and camera movements and tackle the challenge of camera movements in multi-character interaction scenes, a unique input feature and generator architecture are employed. This design promotes precise align- ment of generated camera movements with diverse character interactions, which is particularly effective in multi-character scenes. The rest of the paper is organized as follows: Section 6.2 provides a literature review on computational cinematography, video editing, and video understanding. Section 6.3 presents our proposed framework, problem formulation, and dynamic programming solution. Section 6.4 discusses experimental results and the potential impact of video understanding errors on the framework. Section 6.5 concludes the work. Table 6.1: A comparison of three different types of user experiences (traditional movie, purpose-driven interactive movie or game, and open world) in various aspects as perceived by the audience Traditional Movie Interactive Movie/Game Open World Single-line Protagonist control NO Script form Interactive distance Fixed None Interactive choice Plot-based Interactive objects Preset Lens usage Emotions by script Yes Camera movement User experience Director’s choice Watch Yes Branching Fixed Limited Selectable Idiom-based Yes Director’s choice Script-driven 103 Yes Free-form Varies Any Any in scene Not preset No Fixed (view) Freely decide 6.2 Related Work Cinematography is crucial in fostering immersion and realism for users in open-world games. Dynamic camera movements and carefully crafted shots direct the user’s attention toward specific objects or events [100, 104], and can generate a sense of momentum during ac- tion sequences. Moreover, cinematography establishes the game world’s mood, tone, and atmosphere while conveying information and emotions to users. However, manual shot pro- duction demands considerable time and cost. Consequently, automation in cinematography techniques, aimed at expediting production and reducing expenses, has garnered substantial attention in recent decades. Initially, the film industry attempted to tackle the challenge of generating suitable camera movements using a film director-led approach, wherein a camera script was written based on the story’s plot [102, 105]. This method provides precise control over camera movements through multiple constraint elements. However, it necessitates mastering a specialized cam- era control script format and demands considerable cinematography expertise and manual effort from the creator for each shot. Recent studies [47,106] identified cinematic idioms as frequently used camera movements in specific scene types. An automated cinematography program can align content with rele- vant cinematic idioms for filming, based on the content that needs capturing. This approach and its advanced derivatives [4,49,59] have been developed for films or games with clear nar- rative structures, reducing the need for filmmaker cinematography expertise and enhancing efficiency and convenience. The connection between lens language and cinematography has been scrutinized in [95, 96]. These studies illustrate that employing various idioms based on game content can effectively guide user emotions, leading to increased user immersion. 104 However, this approach may be unsuitable for open-world scenarios, as camera shot selection depends on character actions, emotions, and object interactions. These variables complicate the determination of idioms. An alternative automated cinematography approach, as described in [94, 107], might be more fitting for open-world scenarios. This method trains the model to mimic human directors’ lens language in various contexts, integrating all relevant lens language factors into the training data. This results in more flexible and adaptive outcomes, as lens language can vary based on character actions and object interactions in open-world environments. The GAN-based automated cinematography [101] shows it can effectively execute camera shots focused on a single character while considering actions and emotions. However, this system fails to establish relationships between characters and objects, which is crucial for lens language selection in open-world environments. Figure 6.1: AACOGAN architecture overview, where O is the input feature, C is the ground truth camera trajectory, ˆC is the generated camera trajectory and L is the loss The selection space for camera positions is vital in open-world scenarios, as the numerous camera placement options render most of the space redundant. Restricting camera move- ment through a widely accepted camera movement space reduces computational complexity, 105 GeneratorCamera TrajectoryDiscriminatorCamera PoseDiscriminatorAesthetic AssessmentModel General ProcessAes Fine-tuning simplifying calculations while ensuring high-quality results [74, 108]. Automated cinematography often lacks objective evaluation criteria, resulting in less widely accepted outcomes. A potential solution is incorporating an aesthetic assessment model that evaluates frames based on commonly accepted aesthetic standards [109–111]. Integrating this model into AI training for automatic cinematography has the potential to significantly improve the general quality of captured frames. Figure 6.2: AACOGAN generator architecture overview. The green blocks are the input for the generator model including skeletal key point position (A), point speed (AV), point acceleration (AD), initial camera parameters (CAM), initial theta value on camera moving surface (IniTheta), user-controlled character position (MPOS) and target objects position (TPOS). The block with ”Latent” is the model to generate the latent representation through feature extraction. The detailed explanation of this overview is in section III.B 6.3 Problem Formulation Evaluating the quality of lenses generated by automated photography technology presents a complex challenge, given the absence of universally acknowledged objective evaluation stan- dards [64]. Traditional user research, employed for outcome assessment, may be hampered by biases and limitations in addressing dynamic user demands, potentially impeding sys- tem enhancements. Therefore, the development of objective evaluation criteria is vital for improving research results. 106 ADAAVEmoMPOSTPOSIniCAMIniThetaLatentRLBGenerated Drive DataLinear Block (LB)Linear layerLinear layerMax PoolingBody Part Block(BPB)Recurrent Linear Block(RLB)RNN layerLBBPBHeadTorsoLeft ArmRight ArmLeft LegRight LegFully BodyBPBBPBBPBBPBBPBBPBLBLBLBLBLBLBLBLBLBRLBLBLBLB 6.3.1 Quality Measurement As shown in Table 6.1, open-world environments significantly differ from traditional shooting settings, making it challenging to directly apply automatic photography techniques designed for conventional film productions. These environments involve factors such as user-controlled character exploration, spatial relationships between objects and characters, varying interac- tive methods, and fluctuating numbers of participating individuals. In these contexts, user- directed character movements are often primarily focused on the character itself rather than the entire scene. To effectively address these unique requirements for automatic cinematog- raphy in open-world contexts, we propose an equation with rational metrics to evaluate the quality of the automatically generated camera q, as detailed below: q = Q(Dcorr(C, A), R(C), Saes(C)), (6.1) where Q(·) is the quality function, C is the generated camera trajectory, and A is the position and rotation of the skeletal bone during character interactive movement. Dcorr(·) is the function that calculates the similarity between the camera and character movement trajectories. R(·) is the function that calculates the ratio of the all-character captured frame during the interaction, and Saes(·) calculates the aesthetic score of the frames captured by the given C. Undoubtedly, the employment of cinematic language as an essential component of the artistic domain is contingent upon subjective assessments. Consequently, a preferable ap- proach might encompass enabling automatic cinematography systems to learn from human film directors in the gaming industry [94], rather than relying exclusively on fixed algorithms for camera movement generation. By leveraging pertinent data to mimic the cinematic lan- 107 guage habits of directors and iteratively refining the generated algorithm based on user feedback, neural network technology could effectively replicate directorial expertise. During the system training phase of our experiment, the preceding quality measurement defined in Equation 6.1 can be expanded to Qref as follows: qref = Qref(Dcorr(C, A), R(C), Saes(C), Dis(C, ˆC)), (6.2) where Dis(·) represents the distance metric that measures the dissimilarity between the C and the ground truth camera motion ˆC obtained from human film directors, who authored the actual in-game camera movements. It should be noted that the approach we use to define Qref is not necessarily a standard one. Moving forward, we will provide a detailed explanation of each component that makes up this Qref. In gaming scenarios, participants can assert control over the in-game environment and status through the prescribed character’s interactive functionalities, which is a process that can also affect consequential alterations to camera trajectories based on rules [99]. Within open-world contexts, user influence on character control resembles that in gaming, where these interactive actions predominantly change open-world environment and status. As pointed out in [112], preserving consistency in character control, which supports expected character and camera movements, can mitigate users’ sense of disorientation. Consequently, the congruence between the trajectories of key skeletal nodes of characters along axes during interactions and the camera motion trajectory serves as a metric for evaluating their coher- ence. The similarity difference can be computed using the correlation distance dcorr for each 108 position and rotation axis between the two trajectories, as follows: dcorr = Dcorr(C, A) (cid:118) (cid:117) (cid:117) (cid:117) (cid:116)1 − = (cid:16)(cid:80)n−1 t=1 (ft − ¯f )(pt − ¯p) (cid:80)n−1 t=1 (ft − ¯f )2 (cid:80)n−1 t=1 (pt − ¯p)2 , (cid:17)2 (6.3) ft and pt are the frame and position coordinates of the t-th point on the trajectory C, n is the number of frames for the action and camera movement, and ¯f and ¯p are the mean values of the f and p coordinates, respectively. The absence of a predefined script in open-world environments poses a significant chal- lenge to automated filming techniques, especially when it comes to focusing on multiple points [74]. In scenarios involving multiple parties, it is crucial for the camera to capture the entire interaction process and all characters comprehensively, not solely the character currently under user control. To address this challenge, a potential strategy is to facilitate the camera’s extended capture of all involved characters throughout the camera movement process. An effective metric for evaluating the efficacy of capturing all involved characters r throughout the camera movement process is the ratio of the number of frames in which all interactive characters appear in the frame to the total number of frames n used for the interaction, which can be calculated by: r = R(C) = (cid:80)t=n t=0 Rframe(ft) n × 100%, (6.4) where Rframe(·) result equals 1 when all the interactive characters are present within ft, otherwise it equals 0. Although objective criteria for evaluating the quality of imagery generated by automated 109 cinematography techniques remain elusive, aesthetic evaluation models have been widely ac- knowledged for images as presented in [111,113]. As a sequence of images, the video captured during camera movement can be evaluated objectively in terms of aesthetics by calculating the aesthetic score of each frame captured during the camera movement process. Integrating aesthetic models into the automated cinematography system can improve the conformity of captured imagery with objective standards. This aesthetic score can be calculated by: saes = Saes(C) = (cid:80)t=n t=1 AES(ft) n , (6.5) where ft is the visual content captured by the C at t-th frame, AES is the model for image aesthetic evaluation, and n is the number of frames for the camera movement. The emotional state of characters is a crucial factor that significantly influences auto- mated camera movement, as emotions can greatly impact shot selection [100, 101]. Fur- thermore, previous studies indicate a positive correlation between heightened screen motion intensity and viewer arousal [97]. Consequently, even with identical interactive actions, varying emotional states should yield different camera movement speeds and amplitudes to better convey the characters’ current emotional state. The relationship between emotions and camera movement quality cannot be directly evaluated, as each individual possesses distinct standards for camera movement amplification in response to various emotions. This difference will be directly reflected in the actual camera trajectory and can be considered a component of Dis(C, ˆC). The Dis(·) is employed to compute the distance between the generated and real camera drive data, consisting of two parts: MSE and Euclidean distance, 110 represented as follows: dist = Dis(C, ˆC) = MSE(C, ˆC) + Euclidean(C, ˆC). (6.6) In response to the aforementioned problems and challenges, a deep-learning-based gen- erative model, AACOGAN, is based on GANs and is developed to enable automatic cine- matography in the open-world environment. GANs have demonstrated impressive results in generating synthetic data samples that resemble data from a training dataset, commonly used for generating images [114] and audio [115] signals. The essence of camera movement in cinematography is the variation of the camera’s position and rotation along different axis over time, which is comparable to the time-varying intensity of audio signals. 6.3.2 AACOGAN Architecture Based on the aforementioned factors that influence the generated camera trajectory in AACOGAN, we have designed the input (green block of Figure. 6.2) for the generation model. Character skeletal animation, also known as skeletal animation (Figure. 6.3), is a widely adopted technique in the animation field that enables the generation of realistic and intricate movement. This technique involves continuously recording A of the skeletal bones during character interactive movement, as well as computing and documenting the speed (AV ) and acceleration (AD) of these bones. By including the speed and acceleration of the skeletal bones as input features, the generation model can make more accurate predictions about the future position of the camera, particularly in cases of continuous camera movement. The initial camera position is also significant, as it cannot be forecasted by the generation model. 111 In addition to the 3D coordinates and rotation (IniCAM ) of the camera’s initial position within the virtual environment, the relative coordinates on a toric surface (IniTheta) have been considered as input parameters. This is because the movement of the camera is confined to this surface in the experiment. Figure 6.3: 3D Character skeleton illustration. Each tetrahedron corresponds to a bone in the skeletal animation Character and object position, as well as the position relationship between the user- controlled character (MPOS ) and the target interactive objects (TPOS ), are crucial factors in determining the camera’s path. In the experiment, the 3D coordinates of both the character and the interactive objects are recorded. It should be noted that there may be multiple interactive objects, and thus a set of vectors representing their positions have been included 112 Left ArmRight ArmLeft LegRight LegTosorHead in the experiment, with a maximum of five vectors. The character’s emotional state Emo is a crucial factor in the selection of lens language. As current domestic motion capture technology is inadequate in capturing the user’s emotions, they are randomly assigned to each of the character’s interactive actions during the experiment. This is represented by numerical values ranging from 0 to 1, indicating the level of impact that the emotion has on the character’s movements. Therefore, the input, O, for the generator can be represented as the collection of the above factors. The architecture of the AACOGAN generator model is represented in Figure. 6.1. The generator is designed to learn the pattern of the ground truth camera drive data sequence, ˆC = [ ˆc0, ˆc1, · · · , ˆcT ], for each interactive action and generate a corresponding sequence of camera drive data, C = [c0, c1, · · · , cT ], based on the sequence of observed input features, O = [o0, o1, · · · , oT ], where T represents the number of samples of the interactive action over the duration. As depicted in Figure. 6.2, the AACOGAN generator is a neural network designed to generate a camera drive data sequence based on a given input sequence of observed features, utilizing an encoder-decoder GAN structure [116,117]. The encoder, a function that processes a sequence of partial input features (A), produces a latent representation through feature extraction [118]. Some interactive actions, such as running or walking, involve the entire body and can impact the camera’s overall path in the virtual environment. To address potential connec- tions between various body parts and movements, in the proposed AACOGAN architecture (Figure. 6.3), the encoder separates input data into distinct body parts using specialized neural network layers known as Body Part Blocks (BPBs). These BPBs, trained to iso- late and encode specific body regions, facilitate the learning of fine-grained representations 113 for each region, establishing a deeper correlation between character and camera movements. This separation offers multiple benefits, including the effective capture of various body parts’ engagement during different actions and enabling the encoder to focus on specific input data segments. This is particularly useful for capturing relevant body parts during intense or vigorous movements. Subsequently, this collective feature representation undergoes a Linear Block (LB) opera- tion, primarily designed to optimize the input data’s structural compatibility with subsequent computations. This process derives a latent representation of the skeletal bones A as per the following equation: Alatent = LB(BPB(Ahead), · · · , BPB(Afully)). (6.7) In particular, the latent representation of A retains the positions and rotations of skeletal bones, encompassing crucial information for generating camera drive data. The remaining data decodes this latent representation at various stages and from distinct perspectives based on its type. Firstly, AV and AD which are derived from A are processed through two in- dependent LBs and concatenated, serving as supplementary information about the skeletal bones’ position and rotation during movement. The positional data of the camera, including IniCAM and IniTheta, are processed separately through two independent LBs and concate- nated. These features enhance the model’s ability to establish a robust correlation between the initial camera position and camera movements. Features regarding the positions of the character and interactive object(s), MPOS and TPOS, are involved in the decoding of the latent representation after processing through two other independent LBs. Emo is a critical component of the proposed model, integrated with other data by the 114 generator through a single LB. This integration enhances the model’s capacity to establish a strong relationship between camera movements and character emotions. Since Emo is one- dimensional and remains relatively stable within a single interactive action, it is incorporated as a coefficient on all intermediate outputs during the second fusion for decoding. The camera’s movement is represented by data that continuously varies over time, re- quiring two Recurrent Linear Blocks (RLBs) in the generator’s final portion to decode the time-varying characteristics of the latent representation. AACOGAN employs two discriminators, inspired by [119], to evaluate the performance of the generator from two distinct perspectives, thus evaluating and enhancing the generator’s performance for multiple requirements. Accurately capturing subtle variations in the camera’s position is vital to ensure the generator produces camera driving parameters closely resembling the ground truth. Given that minor differences in the shooting angle can significantly affect the overall outcome (example details shown in the experiment about the aesthetic score in Section IV B), properly evaluating the generator’s output is essential. Consequently, the pose discriminator evaluates individual data points, ct, generated from the input features ot. Conversely, the trajectory discriminator assesses the complete camera drive data, C, for an entire interactive action generated by O. This discriminator is pivotal in ensuring the overall coherence and realism of the generated camera movement. By evaluating the entire sequence of camera driving data, it can determine whether the generated camera movement adheres to a plausible and natural trajectory, rather than consisting of unrelated or jarring movements. This aspect is particularly important for maintaining immersion and ensuring a seamless user experience. As such, it is critical that the trajectory discriminator accurately assesses the quality of the generated camera movement, as inaccuracies could lead 115 to unrealistic or incoherent camera movements. 6.3.3 Loss Functions and Algorithm Our goal is to maximize the qref while minimizing the network loss during the training. Let θ represent the parameters of this camera drive data generator. Additionally, let ψp and ψt denote the parameters of the camera pose and trajectory discriminators, respectively. Then the objective function derivatives from Equation 6.2 for the AACOGAN can be expressed as follows: max θ Qref(·) = min θ max ψp,ψt ω0Ldist + ω1Lcorr + ω2 ˜Laes + ω3Lp + ω4Lt, (6.8) where ω0, · · · , ω4 are the weight factors used to balance the loss terms, the loss functions Ldist and Lcorr represent the distance function Dis(·) and the correlation distance function Dcorr(·) for the AACOGAN generator model. They are employed to compute the distance and similarity between C and ˆC. ˜Laes is a loss function representation of Saes(·) which utilizes an anesthetic assessment model (AES) to evaluate the conformance of the results to general aesthetic standards. This loss function differs from others in that it requires the use of the resulting captured frames for evaluation, whereas the generator only generates camera drive data. As a result, it is utilized as a separate fine-tuning mechanism for the generator model after training. Following the completion of a training phase, the generator is used to generate a set of parameters C for each O in the training set. Each set of these C values is used to capture interactive actions in a virtual environment as a camera shot, resulting in a corresponding video clip. Each video clip can be represented as a sequence of images, and an AES is employed to evaluate these images. Due to the camera trajectories being 116 pre-designed within the optimization space of a toric surface, as applied in our experiments, we discern important insights from the data analysis of professional directors that we collect. The majority of cases highlighted that the two most critical elements determining the quality of the shot and the camera trajectory are the starting and ending points. Therefore, the beginning and ending frames in this sequence are given more weight in the calculation of the loss value as follows: ˜Laes = 1 T T (cid:88) t=1 αt(AESmax − AES(ct)), (6.9) where AESmax is the max score of the employed AES and αt is the weight factor for different frames over t. As the number of characters captured by C can only be calculated after the actual video generation, there is no function based on R(·) in the loss function for the AACOGAN generator. The discriminator loss function has two parts. The pose discriminator loss function, Lp, which is similar to the Ldist that calculates the pose difference C and ˆC at each t. The trajectory discriminator loss function, Lt, which is similar to the Lcorr that calculates the trajectory difference between C and ˆC. The discriminator loss can be defined as follows: Lp = T (cid:88) t=1 ( − E[log D(ct, ˆct)] − E[log(1 − D(G(ot)))]), (6.10) Lt = −E (cid:105) (cid:104) log D(C, ˆC) − E [log(1 − D(G(O)))] , (6.11) where the D is the discriminator and G is the generator. 117 The pseudo-code for training AACOGAN is given in Algorithm 1. In the first and second loops, ψp is updated separately, and θ is updated in both loops. The third loop is implemented to refine θ based on aesthetic aspects for fine-tuning. Algorithm 1 Training Procedure of AACOGAN Input The extracted features defined in the pre-processing On = [on1, on2, · · · , onT ] and the corresponding ground truth camera drive data Cn = [cn1, cn2, · · · , cnT ] for n = 1, 2, · · · , N . Output Generator parameters θ and two discriminator parameters ψp and ψt. for epoch = 1 to max epoch do for iterp = 1 to kp do Sample a mini-batch of input features and camera drive data pairs in frame {(ot, ct)} from the training set. Generate a single camera drive data point ˆct. Calculate the Lkp = Ldist + Lcorr + Lp Update ψp and θ. end for for itert = 1 to kt do Sample a mini-batch of input features and camera drive data for the entire interactive action pairs in {(O, C)} from the training set. Generate camera drive data for the entire integrative action ˆC. Calculate the Lkt Update ψt and θ. = Ldist + Lcorr + Lt end for end for for itert = 1 to kt do Sample a mini-batch of input features and aesthetic score pair for each in- teractive action. Calculate the Update θ. ˜Laes end for Due to the precise camera parameters for each frame and the presence of noise during data generation, the resulting camera drive data may exhibit minor fluctuations between frames. These continuous, randomly oriented fluctuations can disrupt image continuity and reduce user immersion. To address this issue, our system’s postprocessing applies smoothing to the final output data to create a more continuous curve by minimizing these fluctuations. 118 The smoothing process, which aims to refine the data points and generate a smoother curve in a 2D coordinate system, is achieved using a Savitzky-Golay filter [120]. To ensure the accuracy of the smoothed result in representing the underlying data, a window size of 5 is employed for the filter. The outcome is then evaluated using four different polynomials of varying degrees, ranging from 2 to 5. The smoothed result that most closely resembles the original data points is selected, thus preserving data curvature while reducing potential information loss due to the smoothing process. This step can be represented as follows: Csmooth = 5 min p=2 MSE{savitzkygolay(G(O), 5, p), G(O)}. (6.12) 6.4 Experiment In this section, the evaluation of the proposed AACOGAN is carried out utilizing both objective and subjective metrics. To the best of our knowledge, there is no prior study on automatic cinematography in open-world scenarios, so comparisons are drawn between our results and conventional automatic cinematography techniques [94,101] commonly employed in general games and films. These two reference works have tackled the problem of automatic cinematography using different approaches. In Yu2022 [94], a rule-based language is utilized to train a Recurrent Neural Network (RNN) for capturing the essence of cinematography, which can be applied to new scenes. While in Wu2023 [101], the authors focused on individual characters and also used a GAN-based model to generate the camera movement according to the characters’ emotions and actions. More experiment example video footage can be found at https://youtu.be/vhvgvE-DU2Y. 119 6.4.1 MineStory Dataset Given the complexity and dynamism of user interactions within open-world environments, we developed a novel interactive action dataset named the MineStory dataset. This dataset, created using motion capture techniques, trains the AACOGAN to adapt to the diverse and ever-evolving nature of user actions in such virtual environments. The MineStory dataset encompasses a total of 546 distinct actions, including most actions typically used in animation production and some actions specially tailored for our product. Each action is captured by a standard protocol that involves 25 joint nodes distributed throughout the character’s entire body. The positional and rotational data of each node is captured at a rate of 30 frames per second, yielding a comprehensive skeletal animation. To train the AACOGAN, we need a set of camera trajectory data for each action, as directed by a human operator in an open-world environment. We employ the toric surface method, as introduced in [74], to simplify the process of generating this data. This method conceptualizes the toric surface as a two-dimensional plane for creating camera movement trajectories, which can then be projected into a three-dimensional virtual space. This ap- proach allows for the efficient generation of extensive camera trajectory data with a limited number of human operators. In our experiments, we generated 2 to 8 distinct sets of camera trajectory data for each interactive action. This data generation employed a randomized combination of parameters tailored to the unique aspects of the interactive activities, such as the characters’ emotional states and the distance of interaction. 120 6.4.2 Objective Numerical Comparison The comparisons between the AACOGAN and the baseline model have been performed with regard to several important aspects of automatic photography technology in open-world scenarios. Figure 6.4: The accumulated difference between parameters of the generated camera posi- tions and the ground truth camera position calculated over time, with the error distance expressed in unit distance Precise Difference in Camera Position and Rotation The camera’s location and orientation in the environment are specified by a set of six parameters, three of which pertain to position and three to rotation. This metric directly compares the difference between the generated camera position and rotation parameters with the ground truth for each approach. Figure. 6.4 and Figure. 6.5 present the sum of differences of all the x, y, and z axes between the generated camera parameters and the ground truth parameters for position parameters and rotation parameters are presented, respectively, over time. Compared to other methods, AACOGAN exhibits the least deviation in position from the ground truth in terms of results. The results show that there has been a reduction in the average positional error of 1.56 units 121 0510152025t0100200300400500dist (unit)Accumulated position error among all AxisAACOGAN_accWu2023_accYu2022_acc (93.6%) and an average reduction in the rotational error of 1.11 radians (55%), where the ‘unit’ for distance measurement is the unit distance for object positioning in the virtual environment. Figure 6.5: The accumulated differences between the generated camera parameters and the ground truth over time are shown for all the rotation parameters. The error distance is expressed in radians Figure.6.6 offers a comprehensive depiction of the accumulated distance error for position parameters along the x, y, and z axes. The upper section of Figure.6.6 displays the accumu- lated difference between generated camera parameters and ground truth over time, for each axis individually. AACOGAN exhibits the most minor accumulated discrepancies across all scenarios in comparison to other methods. It is crucial to recognize that mere similarity in camera parameter numerical values does not guarantee similarity in camera movement. The camera’s trajectory in three-dimensional space throughout the shot is of utmost sig- nificance. Consequently, we extended our evaluation by comparing the variation curves of 122 0510152025t0102030405060dist (unit)Accumulated rotation error among all AxisAACOGAN_accWu2023_accYu2022_acc generated camera parameters for each method along each axis over time. The bottom section of Figure. 6.6 demonstrates the variation of camera rotation parameters over time, with the visual similarity between the AACOGAN-generated camera parameters and ground truth parameters being more distinct than the two baseline models. Figure 6.6: Top: The accumulated differences between the generated camera parameters by different methods and the ground truth ones are displayed over time. The parameters for the x, y, and z-axis are shown individually. Bottom: the exact values of the parameters generated by different approaches for each axis at each time point are presented Figure. 6.7 displays a visual comparison of generated camera shots from different methods in terms of shooting position and angle. When contrasted with ground truth camera shots, the AACOGAN-generated frames exhibit a higher degree of similarity, corroborating the numerical results. In summary, the camera movement generated by the AACOGAN model more closely approximates the ground truth, indicating that our method is more adept at learning the film director’s lens language in open-world scenarios. Correlation of Trajectories The correlation distance metric is employed to calculate the similarity between generated camera motion and the actual motion of subjects during 123 interactive actions. As previously mentioned in [98, 99], when camera movement closely mirrors the subjects’ movements, it can enhance the sense of immersion for the audience. The significance of body parts in real action is often gauged by their range of motion and velocity. Figure 6.7: Compare the frames captured by the virtual camera generated by different meth- ods. From the visual inspection of these frames, we can intuitively observe the differences between the generated camera shots and the ground truth, in terms of position and orienta- tion (a) Change curves of the x-axis pa- rameters (b) Change curves of the y-axis pa- rameters (c) Change curves of the z-axis pa- rameters Figure 6.8: Interactive action comparisons of the change curves of the camera position parameters generated by different methods over a certain period with the corresponding skeletal animation motion curves of the interactive action In this experiment, the joint exhibiting the largest range of motion and fastest movement 124 Yu2022Wu2023AACOGANGrounTruth01020304050607080t0.0100.0080.0060.0040.0020.0000.002dist (unit) change per frame (camera)AACOGANGroundTruthWu2023Yu20220.040.030.020.010.000.01dist (unit) change per frame (action)Action Vs Camera Movement Speed Change (x axis)action01020304050607080t0.020.000.020.04dist (unit) change per frame (camera)AACOGANGroundTruthWu2023Yu20220.010.000.010.020.030.04dist (unit) change per frame (action)Action Vs Camera Movement Speed Change (y axis)action01020304050607080t0.020.000.020.040.060.080.100.120.14dist (unit) change per frame (camera)AACOGANGroundTruthWu2023Yu20220.010.000.010.020.030.04dist (unit) change per frame (action)Action Vs Camera Movement Speed Change (z axis)action was selected as a reference to assess the correlation between subject and camera movements. Table 6.2 presents the similarity between various camera parameter curves generated by different automatic cinematography methods and the ground truth, quantified using correla- tion distance. The camera movement generated by AACOGAN is more similar to the actual subject movement than other methods, especially near peaks of subject movement. The similarity between the camera trajectory generated by AACOGAN and the manually created camera exhibits the lowest discrepancy at 27%, while other methods display a min- imum discrepancy of 95.3%. Compared to the other two references, the camera trajectory produced by AACOGAN demonstrates a decrease of 0.78 (73%) in the average correlation distance. This synchronized movement between the camera and the action can provide a superior experience for the user. Table 6.2: Correlation distance between the various generated camera parameters and actual skeleton animation movement among different axis for a single action Wu2023 Yu2022 Manual AACOGAN X axis Dcorr Y axis Dcorr Z axis Dcorr Average Dcorr 0.84 1.25 1.78 1.29 0.95 1.18 1.1 1.07 0.43 0.005 0.17 0.2 0.55 0.066 0.25 0.288 Figure. 6.8a, 6.8b and 6.8c illustrates the comparison between the camera transport mirror trajectory generated by different methods and the skeletal animation motion trajectory for the x-axis in the experiment environment. The y-axis in these figures represents the position of the character’s skeletal joint (right y-axis) or the camera position (left y-axis) relative to the previous time step for each time step. It can be observed that the trajectory generated by AACOGAN has a higher similarity to the motion trajectory. The corresponding example video footage can be found at https://youtu.be/M24bHDvDnqk. 125 Multi-focus In open-world settings, interactive actions may not be solely centered on specific characters or objects. In such instances, the lens language employed by automatic cinematography technology should encompass as many related objects as feasible. This metric gauges the proportion of time during which camera movements, generated by various methods, successfully capture specified objects in multi-focus scenes. Figure 6.9: Comparison of different methods for calculating the r value under varying num- bers of individuals and scenes Higher proportions generally imply superior performance and user experience. A com- parison of the capture ratio for designated characters or objects during interactive actions between AACOGAN and alternative methods is depicted in Figure. 6.9. The results re- veal that AACOGAN captures more relevant characters over extended durations, thereby augmenting users’ overall viewing experience. In comparison to other reference methods, AACOGAN demonstrates an average improvement of 22% and up to 32.9% in multi-focal scene image capturing, contingent upon content. Figure. 6.10 displays frames captured by distinct automatic cinematography techniques 126 TwoThreeMore Than ThreeNumber of Objects Included In The Interaction020406080r (%)Wu2023Yu2022AACOGANGroundTruth within a given multi-person interactive dance scene. Observing the image, the lens lan- guage generated by AACOGAN captures more comprehensive character images within the scene. The corresponding video footage example can be accessed at https://youtu.be/ 3KImvj9wabg. Figure 6.10: Frames captured by different automatic cinematography techniques in a dance scene Aesthetic Score The AES [111, 113] offers an objective evaluation of images and as- signs scores based on their aesthetic quality, with a higher score indicating a higher level of aesthetic appeal. This model is widely utilized in the realm of image and video content gen- eration and provides valuable insights through artistic analysis of the output generated by these technologies. To the best of our knowledge, AACOGAN is the first work to apply this type of model in the field of automatic cinematography for shot generation. By incorporating aesthetic scores as part of the input for fine-tuning the generator of the AACOGAN, the re- sulting shots have higher aesthetic ratings than the original model without aesthetic-related adjustments. 127 In our experiments, employing AES led to a 9% average increase in the aesthetic score of images captured by the AACOGAN. For actions like over-wall jumping, illustrated in Fig- ure. 6.11, the right-side camera shot generator captures more decadent lighting, background, and environmental information compared to the left-side perspective. This result is achieved Figure 6.11: The use of aesthetic scores as a part of the input for the generator of AACO- GAN resulted in improved visuals, as seen in the comparison between the original camera shooting direction (left) and the fine-tuned version camera shooting direction (right) after incorporating aesthetic considerations by slightly lowering the camera position and raising the shooting angle. This distinction can be ascribed to the use of aesthetic scores in the fine-tuning process of AACOGAN’s generator, causing a preference for content-rich camera angles over monotonous ones, such as those oriented towards the sky or ground. Aesthetic assessment models typically favor images with more content. The corresponding video footage example can be accessed at https://youtu.be/ je Gg QQG0. The benefit of this additional aesthetic fine-tuning for the generator is twofold. Firstly, it enhances the aesthetic quality of the automatically generated camera movement. Secondly, 128 it preserves the quality of shots produced by the AACOGAN. This results in shots that better align with the preferences of a broader audience, without compromising the utilized lens language. Real-time Performance In Table 6.3, we present a detailed comparison of the perfor- mance of the AACOGAN model at various latency levels ranging from 5 to 30 frames. For each latency level, the table provides data on the frames per second (FPS) and floating point operations per second (FLOPS) metrics. We use average Dcorr to represent the quality of the generated camera position. The data clearly demonstrate an increase in latency, which effec- tively allows the model to leverage more frame data, and improves the predictive capability of the model for the subsequent camera position. However, this is achieved at the expense of an increase in computational complexity, as shown by the higher FLOPS. Thus, it becomes a trade-off between real-time responsiveness and the quality of camera position prediction, necessitating careful tuning according to the specific demands of the gaming environment. Table 6.3: The real-time performance metrics for the AACOGAN model with different input latency. The number in parentheses after the model indicates the number of delayed frames. The FLOPs unit is in a million flops per second Model AACOGAN(5) AACOGAN(10) AACOGAN(20) AACOGAN(30) FLOPs(M) FPS 78.9 64.33 39.63 113.93 20.79 217.97 14.2 317.28 Latency Average Dcorr 0.393 0.012 0.356 0.025 0.307 0.048 0.288 0.071 6.4.3 Subjective Evaluation Emotional Reinforcement The lens language also plays an essential role in expressing the atmosphere and emotions of characters. The emotion of a character can be analyzed through the user’s facial expressions [121], speech [122], or body movements [123]. In different 129 scenarios or characters’ emotional states, the camera movement should reflect the emotions of the characters by varying to a greater or lesser degree. Figure 6.12: Comparison of the change curves of the y-axis position parameter of the camera trajectory generated by different emotional states This metric assesses the influence of a character’s emotional state on camera trajectory, where stronger emotions should yield a more significant impact on camera movement. The comparison of the character’s emotional state’s effect on camera movements generated for identical interactive actions is depicted in Figure. 6.12. The y-axis value, representing emo- tion, denotes the stability of the character’s emotional state; a value closer to 1 indicates an unstable or extremely unstable emotion (such as anger), while a value closer to 0 signifies a calmer state. The y-axis in Figure. 6.12 represents the character’s skeletal joint position (right y-axis) or the camera position (left y-axis) relative to the previous time step for each time step. As the character’s emotional state intensifies, a more pronounced impact on camera movement is 130 evident. Consequently, the camera movement generated by AACOGAN more accurately reflects the characters’ actual emotions and creates a superior atmosphere compared to reference methods. The corresponding video footage example can be found athttps://youtu. be/HfmotyfEWHw. Human Evaluation It is crucial to acknowledge that the previously mentioned evaluation methods do not entirely capture the superiority of AACOGAN’s generated camera shots in open-world contexts compared to other automatic cinematography techniques concerning actual user experience. Consequently, enlisting real users to assess the generated shots is indispensable. In addition to a general ranking-based human evaluation, participants will be asked to appraise the shots in distinct aspects: 1) shot quality (Frames Quality), 2) shot consistency with the actual interaction (Consistence), 3) representation of characters and objects involved in the interaction (Completeness), and 4) the enhancement of emotions conveyed by the characters within the shot (Emotional Enhancement). Table 6.4: The results of the subjective evaluation. The scores for the four aspects of the shot quality are on a scale of 1-5, with 5 being the best. The final row shows the ranking of the four shots created by different methods, with a lower score indicating a better ranking Wu2023 Yu2022 GroundTruth AACOGAN Frames Quality Consistency Completeness Emotional 4 3.16 2.3 3.83 Overall Ranking 3.16 4 3 2.16 2 3.67 4.67 4.5 3.83 4.6 1.33 4.5 4.67 3.83 4.16 1.83 As shown in Table 6.4, the results of the subjective evaluation indicate that compared to the baseline method, AACOGAN received the highest ranking in the sorting task. In comparison to other methods across various aspects, AACOGAN also received higher evaluations. In summary, our experimental outcomes indicate that AACOGAN effectively generates 131 camera parameters and produces shots more akin to those captured by human operators during interactive actions. This is demonstrated by comparing AACOGAN-generated pa- rameters to baseline methods using metrics such as camera pose similarity, motion similarity, and character and object coverage. Moreover, aesthetic assessments and subjective evalu- ations conducted by human participants corroborate AACOGAN’s superiority in creating visually appealing and interaction-aligned shots. Consequently, these findings suggest the potential of the AACOGAN method to enhance the quality of automatic cinematography in open-world interactive scenarios. 6.5 Conclusions The advent of multimedia technology in the entertainment industry has bestowed upon con- sumers an unprecedented level of autonomy in their media consumption experiences. Within virtual open-world environments, automatic cinematography has emerged as an instrumen- tal factor in delivering immersive experiences, catering to the users’ growing demand for more engaging forms of interaction. In this study, we introduce AACOGAN, a technique for automatic cinematography designed for user-initiated interactions within open-world scenar- ios, leveraging GANs. This model incorporates various elements such as skeletal animations of interactive actions, positional relationships between interactive objects and characters, as well as character emotions, facilitating the effective generation of suitable camera movements for a wide array of interactive actions. Moreover, the integration of aesthetic scores into the generator’s training process significantly enhances the quality of the shots generated. The results of our experiments substantiate the efficiency of the AACOGAN approach in pro- ducing camera shots that rival the quality of human-generated shots, demanding minimal 132 input, thus leading to a more engaging user experience and substantially reducing costs. 133 Chapter 7 Conclusion and Future Work 7.1 Conclusion In this dissertation, we have articulated a comprehensive framework for automated animation production, offering an in-depth exploration of two pivotal technologies—lip-sync speech animation and auto cinematography—that constitute this framework. Initially, we delineated the foundational modules embedded within the automatic anima- tion production framework and elaborated upon their respective functionalities. Although human intervention is still required in the transitional journey from script to animation, our framework has notably ameliorated time consumption and expertise prerequisites. Each module integrates technologies from disparate domains; given the future maturation of these technologies, our framework holds the potential for completely autonomous animation pro- duction. In addressing the challenges of lip-sync speech animation, we propose RealPRNet, an inno- vative deep-learning-based real-time phoneme recognition network. By leveraging spatial and temporal patterns in raw audio data and incorporating a long short-term memory (LSTM) stack block, RealPRNet facilitates competitive real-time speech animation within the Stage Performance module of our framework. Empirical evaluations confirm that RealPRNet out- performs extant algorithms, registering a 20% improvement in Phoneme Error Rate (PER) 134 and a 4% enhancement in Block Error Distance (BDE) in optimal cases. Subsequently, our research delves into auto cinematography, presenting two distinct strate- gies: a rule-based approach (T2A) and a method mimicking human lens language behavior (RT2A). T2A efficiently expedites the script-to-video creation process by capitalizing on advancements in computational cinematography and video understanding. Utilizing fidelity and aesthetic models in tandem, we have formulated camera placement in 3D environments as an optimization problem, solvable through dynamic programming. Our experimental data substantiates that T2A can reduce manual animation production efforts by approximately 74% and enhance the perceptual quality of output videos by up to 35%. Alternatively, RT2A records directorial decisions regarding camera settings for future training of auto cinematog- raphy agents. A well-engineered reward function aids the algorithm in mimicking the human director’s decision-making, resulting in significant gains in camera placement acceptance rate and the rhythm of camera switching. Moreover, we have adapted our auto cinematography techniques to comply with the evolv- ing demands of the modern entertainment sector, where users increasingly seek to create unique animation experiences in open-world scenarios. To this end, we introduce AACO- GAN, a novel Generative Adversarial Network (GAN)-based methodology designed for these user-driven interactive experiences. This innovative model synthesizes various elements, in- cluding skeletal animations, spatial relationships among interactive entities, and expressive character emotions, thereby systematizing the generation of suitable camera movements across a wide range of interactive sequences. Aesthetic metrics are also integrated during the training phase to elevate the quality of the resulting shots. Experimental validation confirms AACOGAN’s efficacy, with generated camera angles and movements rivaling those crafted by human professionals. 135 In summary, the framework and technologies explored in this dissertation endeavor to de- mocratize animation production, significantly reducing both the time and expertise required. Notably, our innovations in automatic cinematography allow for the archival of current di- rectors’ lens language patterns, thereby preserving their unique cinematic contributions for future AI-driven projects. 7.2 Future Work The quest for full automation in animation production remains a focal point of ongoing re- search. While each module in our presented framework has already demonstrated substantial utility, there exists a compelling scope for further advancements as technological landscapes evolve. Drawing upon our comprehensive analyses in the realms of lip-sync speech animation and auto cinematography, we propose several promising avenues for future investigations: 1. Influence of Emotional Context on Viseme Dynamics - Emotional states have a non- negligible impact on phoneme articulation. Notably, even identical phonemes can manifest divergent visemes under varying emotional conditions. This observation un- derscores the importance of incorporating emotional contexts into viseme animation algorithms to enhance the realism and emotional expressiveness of animated characters. 2. Adaptive Auto Cinematography - Much like the impossibility of a single directorial vi- sion universally appealing to all audiences, there exists no monolithic cinematographic language that can satiate diverse viewer preferences. As auto cinematography technolo- gies mature, an intriguing direction for future research lies in the development of adap- tive, or personalized, auto cinematography. Here, identical scripts could be rendered through divergent cinematographic lenses, tailored to individual viewer predilections, 136 thereby enhancing audience engagement and re-watchability. 3. Unconstrained Camera Placement in Auto Cinematography - Existing auto cinematog- raphy solutions often circumscribe the spectrum of potential camera placements and orientations due to computational limitations. For example, the utilization of toric surfaces in AACOGAN serves as a case in point. While these limitations expedite computational processes and uphold a baseline quality of results, they also stifle the creative potential for novel and inventive camera placements. Thus, devising techniques that transcend these restrictive boundaries, thereby allowing cameras to operate within a ’free space,’ stands as a fruitful line of inquiry. 137 BIBLIOGRAPHY [1] S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. G. Rodriguez, J. Hodgins, and I. Matthews, “A deep learning approach for generalized speech animation,” ACM Trans- actions on Graphics (TOG), vol. 36, no. 4, p. 93, 2017. [2] Y. Zhou, Z. Xu, C. Landreth, E. Kalogerakis, S. Maji, and K. Singh, “Visemenet: Audio- driven animator-centric speech animation,” ACM Transactions on Graphics (TOG), vol. 37, no. 4, p. 161, 2018. [3] H. Subramonyam, W. Li, E. Adar, and M. Dontcheva, “Taketoons: Script-driven perfor- mance animation,” in Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, 2018, pp. 663–674. [4] Z. Yu, E. Guo, H. Wang, and J. Ren, “Bridging script and animation utilizing — a new automatic cinematography model,” in Submitted to MIPR 2022. [5] L.Sun and H.Wang, “Director-hint based auto-cinematography,” in US Patent 11,120,638, 2021. [6] F. Tao and C. Busso, “End-to-end audiovisual speech recognition system with multitask learning,” IEEE Transactions on Multimedia, 2020. [7] M. Mehrabani, S. Bangalore, and B. Stern, “Personalized speech recognition for internet IEEE, of things,” in 2015 IEEE 2nd World Forum on Internet of Things (WF-IoT). 2015, pp. 369–374. [8] M. Dawodi, J. A. Baktash, T. Wada, N. Alam, and M. Z. Joya, “Dari speech classification using deep convolutional neural network,” in 2020 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS). IEEE, 2020, pp. 1–4. [9] M. La Mura and P. Lamberti, “Human-machine interaction personalization: a review on gender and emotion recognition through speech analysis,” in 2020 IEEE International Workshop on Metrology for Industry 4.0 & IoT. IEEE, 2020, pp. 319–323. [10] G. Llorach, A. Evans, J. Blat, G. Grimm, and V. Hohmann, “Web-based live speech- driven lip-sync,” in 2016 8th International Conference on Games and Virtual Worlds for Serious Applications (VS-GAMES). IEEE, 2016, pp. 1–4. [11] Y. Xu, A. W. Feng, S. Marsella, and A. Shapiro, “A practical and configurable lip sync method for games,” in Proceedings of Motion on Games. ACM, 2013, pp. 131–140. [12] P. Edwards, C. Landreth, E. Fiume, and K. Singh, “Jali: an animator-centric viseme model for expressive lip synchronization,” ACM Transactions on Graphics (TOG), vol. 35, no. 4, p. 127, 2016. 138 [13] C. G. Fisher, “Confusions among visually perceived consonants,” Journal of speech and hearing research, vol. 11, no. 4, pp. 796–804, 1968. [14] J. Michalek and J. Vanˇek, “A survey of recent dnn architectures on the timit phone recognition task,” in International Conference on Text, Speech, and Dialogue. Springer, 2018, pp. 436–444. [15] S. Kapadia, V. Valtchev, and S. J. Young, “Mmi training for continuous phoneme recognition on the timit database,” in 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2. IEEE, 1993, pp. 491–494. [16] I. Bromberg, Q. Qian, J. Hou, J. Li, C. Ma, B. Matthews, A. Moreno-Daniel, J. Morris, S. M. Siniscalchi, Y. Tsao et al., “Detection-based asr in the automatic speech attribute transcription project,” in Eighth Annual Conference of the International Speech Commu- nication Association, 2007. [17] J. Morris and E. Fosler-Lussier, “Combining phonetic attributes using conditional ran- dom fields,” in Ninth International Conference on Spoken Language Processing, 2006. [18] J. Park and H. Ko, “Real-time continuous phoneme recognition system using class- dependent tied-mixture hmm with hbt structure for speech-driven lip-sync,” IEEE Trans- actions on Multimedia, vol. 10, no. 7, pp. 1299–1306, 2008. [19] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE transactions on acoustics, speech, and signal processing, vol. 37, no. 3, pp. 328–339, 1989. [20] J. Kim, K. Hwang, and W. Sung, “X1000 real-time phoneme recognition vlsi using feed-forward deep neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 7510–7514. [21] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–6649. [22] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 4580–4584. [23] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi.” in Interspeech, 2017, pp. 498–502. [24] P. L. Jackson, “The theoretical minimal unit for visual speech perception: Visemes and coarticulation.” The Volta Review, 1988. 139 [25] H. L. Bear and R. Harvey, “Phoneme-to-viseme mappings: the good, the bad, and the ugly,” Speech Communication, vol. 95, pp. 40–67, 2017. [26] C. Bregler, M. Covell, and M. Slaney, “Video rewrite: driving visual speech with audio.” in Siggraph, vol. 97, 1997, pp. 353–360. [27] E. Bozkurt, C. E. Erdem, E. Erzin, T. Erdem, and M. Ozkan, “Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation,” in 2007 3DTV Conference. IEEE, 2007, pp. 1–4. [28] P. Waddell, G. Jones, and A. Goldberg, “Audio/video synchronization standards and solutions a status report,” Advanced Television Systems Committee, vol. 21, 1998. [29] A. C. Younkin and P. J. Corriveau, “Determining the amount of audio-video synchro- nization errors perceptible to the average end-user,” IEEE Transactions on Broadcasting, vol. 54, no. 3, pp. 623–627, 2008. [30] A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recognition with deep bidirec- tional lstm,” in 2013 IEEE workshop on automatic speech recognition and understanding. IEEE, 2013, pp. 273–278. [31] Y. Fan, Y. Qian, F.-L. Xie, and F. K. Soong, “Tts synthesis with bidirectional lstm based recurrent neural networks,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014. [32] A. Pearce, B. Wyvill, G. Wyvill, and D. Hill, “Speech and expression: A computer solution to face animation,” in Graphics Interface, vol. 86, 1986, pp. 136–140. [33] S. A. King and R. E. Parent, “Creating speech-synchronized animation,” IEEE Trans- actions on visualization and computer graphics, vol. 11, no. 3, pp. 341–352, 2005. [34] K. Chen and Q. Huo, “Training deep bidirectional lstm acoustic model for lvcsr by a context-sensitive-chunk bptt approach,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1185–1193, 2016. [35] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolu- tional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and language processing, vol. 22, no. 10, pp. 1533–1545, 2014. [36] S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching,” IEEE Trans- actions on Multimedia, vol. 20, no. 6, pp. 1576–1590, 2017. 140 [37] P. Zhan and A. Waibel, “Vocal tract length normalization for large vocabulary continu- ous speech recognition,” Carnegie-Mellon Univ, School of Computer Science, Tech. Rep., 1997. [38] M. J. Gales, “Maximum likelihood linear transformations for hmm-based speech recog- nition,” Computer speech & language, vol. 12, no. 2, pp. 75–98, 1998. [39] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [40] T. N. Sainath, B. Kingsbury, A.-r. Mohamed, G. E. Dahl, G. Saon, H. Soltau, T. Be- ran, A. Y. Aravkin, and B. Ramabhadran, “Improvements to deep convolutional neural networks for lvcsr,” in 2013 IEEE Workshop on Automatic Speech Recognition and Un- derstanding. IEEE, 2013, pp. 315–320. [41] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth annual conference of the international speech communication association, 2014. [42] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE IEEE 2011 workshop on automatic speech recognition and understanding, no. CONF. Signal Processing Society, 2011. [43] H.Wang, “Write-a-movie: Unifying writing and shooting,” in US Patent filed Oct., 2020. [44] Y. Niu and F. Liu, “What makes a professional video? a computational aesthetics approach,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 7, pp. 1037–1049, 2012. [45] A. Truong, F. Berthouzoz, W. Li, and M. Agrawala, “Quickcut: An interactive tool for editing narrated video,” in Proc. 29th Annual Symposium on User Interface Software and Technology, 2016, pp. 497–507. [46] X. Yang, T. Zhang, and C. Xu, “Text2video: An end-to-end learning framework for expressing text with videos,” IEEE Transactions on Multimedia, vol. 20, no. 9, pp. 2360– 2370, 2018. [47] M. Wang, G.-W. Yang, S.-M. Hu, S.-T. Yau, and A. Shamir, “Write-a-video: computa- tional video montage from themed text.” ACM Trans. Graph., vol. 38, no. 6, pp. 177–1, 2019. [48] Q. Galvane, “Automatic cinematography and editing in virtual environments.” Ph.D. dissertation, Universit´e Grenoble Alpes (ComUE), 2015. 141 [49] A. Louarn, M. Christie, and F. Lamarche, “Automated staging for virtual cinematogra- phy,” in Proceedings of the 11th Annual International Conference on Motion, Interaction, and Games, 2018, pp. 1–10. [50] I. Arev, H. S. Park, Y. Sheikh, J. Hodgins, and A. Shamir, “Automatic editing of footage from multiple social cameras,” ACM Transactions on Graphics (TOG), vol. 33, no. 4, pp. 1–11, 2014. [51] H. Jiang, B. Wang, X. Wang, M. Christie, and B. Chen, “Example-driven virtual cin- ematography by learning camera behaviors,” ACM Transactions on Graphics (TOG), vol. 39, no. 4, pp. 45–1, 2020. [52] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioning with attention- based lstm and semantic consistency,” IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2045–2055, 2017. [53] M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer, “Allennlp: A deep semantic natural language pro- cessing platform,” 2017. [54] S. Kim, K. Cha, M. Kim, and E. Lee, “3d animation using visual script language,” in The Fifth International Workshop on Distributed Multmedia Systems, Taiwan. Citeseer, 1998, pp. 109–113. [55] D. C. Petriu, X. L. Yang, and T. E. Whalen, “Behavior-based script language for anthropomorphic avatar animation in virtual environments,” in 2002 IEEE International Symposium on Virtual and Intelligent Measurement Systems (IEEE Cat. No. 02EX545). IEEE, 2002, pp. 105–110. [56] C. Liang, C. Xu, J. Cheng, W. Min, and H. Lu, “Script-to-movie: a computational framework for story movie composition,” IEEE transactions on multimedia, vol. 15, no. 2, pp. 401–414, 2012. [57] M. Hayashi, S. Inoue, M. Douke, N. Hamaguchi, H. Kaneko, S. Bachelder, and M. Naka- jima, “T2v: New technology of converting text to cg animation,” ITE Transactions on Media Technology and Applications, vol. 2, no. 1, pp. 74–81, 2014. [58] K. Lee, L. He, M. Lewis, and L. Zettlemoyer, “End-to-end neural coreference resolution,” arXiv preprint arXiv:1707.07045, 2017. [59] M. Leake, A. Davis, A. Truong, and M. Agrawala, “Computational video editing for dialogue-driven scenes.” ACM Trans. Graph., vol. 36, no. 4, pp. 130–1, 2017. 142 [60] N. Joshi, W. Kienzle, M. Toelle, M. Uyttendaele, and M. F. Cohen, “Real-time hy- perlapse creation via optimal frame selection,” ACM Transactions on Graphics (TOG), vol. 34, no. 4, pp. 1–9, 2015. [61] G. Abdollahian, C. M. Taskiran, Z. Pizlo, and E. J. Delp, “Camera motion-based anal- ysis of user generated video,” IEEE Transactions on Multimedia, vol. 12, no. 1, pp. 28–41, 2009. [62] M. Gschwindt, E. Camci, R. Bonatti, W. Wang, E. Kayacan, and S. Scherer, “Can a robot become a movie director? learning artistic principles for aerial cinematography,” arXiv preprint arXiv:1904.02579, 2019. [63] J. Wang, M. Xu, L. Jiang, and Y. Song, “Attention-based deep reinforcement learning for virtual cinematography of 360° videos,” IEEE Transactions on Multimedia, 2020. [64] M. Radut, M. Evans, K. To, T. Nooney, and G. Phillipson, “How good is good enough? the challenge of evaluating subjective quality of ai-edited video coverage of live events.” in WICED@ EG/EuroVis, 2020, pp. 17–24. [65] S. Zhao, Y. Liu, Y. Han, R. Hong, Q. Hu, and Q. Tian, “Pooling the convolutional layers in deep convnets for video action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 8, pp. 1839–1849, 2017. [66] P. Wang, Y. Cao, C. Shen, L. Liu, and H. T. Shen, “Temporal pyramid pooling-based convolutional neural network for action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 12, pp. 2613–2622, 2016. [67] H. Wu, X. Ma, and Y. Li, “Spatiotemporal multimodal learning with 3d cnns for video action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, 2021. [68] T. V. Nguyen, Z. Song, and S. Yan, “Stap: Spatial-temporal attention-aware pooling for action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 1, pp. 77–86, 2014. [69] B. Wang, L. Ma, W. Zhang, and W. Liu, “Reconstruction network for video captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7622–7631. [70] S. Chen, Q. Jin, J. Chen, and A. G. Hauptmann, “Generating video descriptions with latent topic guidance,” IEEE Transactions on Multimedia, vol. 21, no. 9, pp. 2407–2418, 2019. [71] W. Xu, J. Yu, Z. Miao, L. Wan, Y. Tian, and Q. Ji, “Deep reinforcement polishing network for video captioning,” IEEE Transactions on Multimedia, 2020. 143 [72] D. Shao, Y. Zhao, B. Dai, and D. Lin, “Finegym: A hierarchical video dataset for fine- grained action understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2616–2625. [73] C. Yang, Y. Xu, J. Shi, B. Dai, and B. Zhou, “Temporal pyramid network for ac- tion recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 591–600. [74] C. Lino and M. Christie, “Efficient composition for virtual camera control,” 2012. [75] R. Thompson and C. Bowen, Grammar of the Shot. Taylor & Francis, 2009. [76] J. E. Cutting, J. E. DeLong, and C. E. Nothelfer, “Attention and the evolution of hollywood film,” Psychological science, vol. 21, no. 3, pp. 432–439, 2010. [77] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word repre- sentation,” in Proc. 2014 conf. EMNLP, 2014, pp. 1532–1543. [78] J. K. Haas, “A history of the unity game engine,” 2014. [79] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recog- nition and description,” in Proc. IEEE conf. compu. vision and pattern recog., 2015, pp. 2625–2634. [80] J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridg- ing video and language,” in Proc. IEEE conf. on computer vision and pattern recognition, 2016, pp. 5288–5296. [81] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [82] G. Mercado, The filmmaker’s eye: Learning (and breaking) the rules of cinematic com- position. Routledge, 2013. [83] Y. Li, “Deep reinforcement learning: An overview,” arXiv preprint arXiv:1701.07274, 2017. [84] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q- learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016. [85] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017. 144 [86] N. Passalis and A. Tefas, “Deep reinforcement learning for controlling frontal person close-up shooting,” Neurocomputing, vol. 335, pp. 37–47, 2019. [87] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Interna- tional conference on machine learning. PMLR, 2016, pp. 1928–1937. [88] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016. [89] D. Bordwell, J. Staiger, and K. Thompson, The classical Hollywood cinema: Film style & mode of production to 1960. Columbia University Press, 1985. [90] M. Svanera, S. Benini, N. Adami, R. Leonardi, and A. B. Kov´acs, “Over-the-shoulder shot detection in art films,” in 2015 13th International Workshop on Content-Based Mul- timedia Indexing (CBMI). IEEE, 2015, pp. 1–6. [91] P. Sweetser and D. Johnson, “Player-centered game environments: Assessing player opinions, experiences, and issues,” in International Conference on Entertainment Com- puting. Springer, 2004, pp. 321–332. [92] H. P. Mart´ınez, A. Jhala, and G. N. Yannakakis, “Analyzing the impact of camera viewpoint on player psychophysiology,” in 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. IEEE, 2009, pp. 1–6. [93] P. Burelli and G. N. Yannakakis, “Towards adaptive virtual camera control in computer games,” in International symposium on Smart Graphics. Springer, 2011, pp. 25–36. [94] Z. Yu, H. Wang, A. K. Katsaggelos, and J. Ren, “A novel automatic content generation and optimization framework,” IEEE Internet of Things Journal, 2023. [95] P. Burelli, “Game cinematography: From camera control to player emotions,” in Emo- tion in Games. Springer, 2016, pp. 181–195. [96] ——, “Virtual cinematography in games: investigating the impact on player experi- ence,” in Foundations of Digital Games: The 8th International Conference on the Foun- dations of Digital Games. Society for the Advancement of the Science of Digital Games, 2013. [97] R. F. Simons, B. H. Detenber, T. M. Roedema, and J. E. Reiss, “Emotion processing in three systems: The medium and the message,” Psychophysiology, vol. 36, no. 5, pp. 619–627, 1999. [98] M. Haigh-Hutchinson, Real time cameras: A guide for game designers and developers. CRC Press, 2009. 145 [99] M. Christie, P. Olivier, and J.-M. Normand, “Camera control in computer graphics,” in Computer Graphics Forum, vol. 27, no. 8. Wiley Online Library, 2008, pp. 2197–2218. [100] B. Tomlinson, B. Blumberg, and D. Nain, “Expressive autonomous cinematography for interactive virtual environments,” in Proceedings of the fourth international conference on Autonomous agents, 2000, pp. 317–324. [101] X. Wu, H. Wang, and A. K. Katsaggelos, “The secret of immersion: actor driven camera movement generation for auto-cinematography,” 2023. [102] Q. Galvane, R. Ronfard, M. Christie, and N. Szilas, “Narrative-driven camera control for cinematic replay of computer games,” in Proceedings of the Seventh International Conference on motion in games, 2014, pp. 109–117. [103] Y. Deng, C. C. Loy, and X. Tang, “Image aesthetic assessment: An experimental survey,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 80–106, 2017. [104] A. McMahan, “Immersion, engagement, and presence: A method for analyzing 3-d video games,” in The video game theory reader. Routledge, 2013, pp. 67–86. [105] N. Tandon, G. Weikum, G. d. Melo, and A. De, “Lights, camera, action: Knowledge extraction from movie scripts,” in Proceedings of the 24th International Conference on World Wide Web, 2015, pp. 127–128. [106] D. B. Christianson, S. E. Anderson, L.-w. He, D. H. Salesin, D. S. Weld, and M. F. Cohen, “Declarative camera control for automatic cinematography,” in AAAI/IAAI, Vol. 1, 1996, pp. 148–155. [107] Y. Dang, C. Huang, P. Chen, R. Liang, X. Yang, and K.-T. Cheng, “Path-analysis- based reinforcement learning algorithm for imitation filming,” IEEE Transactions on Mul- timedia, 2022. [108] C. Lino and M. Christie, “Intuitive and efficient camera control with the toric space,” ACM Transactions on Graphics (TOG), vol. 34, no. 4, pp. 1–12, 2015. [109] X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang, “Rating image aesthetics using deep learning,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 2021–2034, 2015. [110] X. Tian, Z. Dong, K. Yang, and T. Mei, “Query-dependent aesthetic model with deep learning for photo quality assessment,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 2035–2048, 2015. [111] H. Talebi and P. Milanfar, “Nima: Neural image assessment,” IEEE transactions on image processing, vol. 27, no. 8, pp. 3998–4011, 2018. 146 [112] M. Haigh-Hutchinson, “Fundamentals of real-time camera design,” in GDC, vol. 5, 2005, p. 20. [113] L. Zhao, M. Shang, F. Gao, R. Li, F. Huang, and J. Yu, “Representation learning of image composition for aesthetic prediction,” Computer Vision and Image Understanding, vol. 199, p. 103024, 2020. [114] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096, 2018. [115] C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio synthesis,” arXiv preprint arXiv:1802.04208, 2018. [116] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with con- ditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134. [117] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232. [118] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020. [119] C. Hardy, E. Le Merrer, and B. Sericola, “Md-gan: Multi-discriminator generative adversarial networks for distributed datasets,” in 2019 IEEE international parallel and distributed processing symposium (IPDPS). IEEE, 2019, pp. 866–877. [120] A. Savitzky and M. J. Golay, “Smoothing and differentiation of data by simplified least squares procedures.” Analytical chemistry, vol. 36, no. 8, pp. 1627–1639, 1964. [121] P. Tarnowski, M. Ko(cid:32)lodziej, A. Majkowski, and R. J. Rak, “Emotion recognition using facial expressions,” Procedia Computer Science, vol. 108, pp. 1175–1184, 2017. [122] S. G. Koolagudi and K. S. Rao, “Emotion recognition from speech: a review,” Inter- national journal of speech technology, vol. 15, pp. 99–117, 2012. [123] F. Ahmed, A. H. Bari, and M. L. Gavrilova, “Emotion recognition from body move- ment,” IEEE Access, vol. 8, pp. 11 761–11 781, 2019. 147