PERCEPTUAL SQUELCH OF ROOM EFFECT IN LISTENING TO SPEECH

By

Aimee Elizabeth Shore

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Physics — Doctor of Philosophy

2018

ABSTRACT

PERCEPTUAL SQUELCH OF ROOM EFFECT IN LISTENING TO

SPEECH

By

Aimee Elizabeth Shore

Squelch is an eﬀect in which the human auditory system is said to suppress room eﬀects

such as reverberation and coloration. Of particular interest is the squelch of room eﬀects

in everyday listening conditions: a listener listening to conversational speech in an ordinary

room, with the talker and listener separated by a few meters. Traditionally, squelch has been

considered a binaural eﬀect– that is, attributable to the ears receiving somewhat diﬀerent

acoustical signals that lead to interaural timing and level diﬀerences. Few experiments

have been done that attempt to further elucidate the mechanism or mechanisms underlying

squelch. A major obstruction to studying squelch is that it is a subjective eﬀect, and as such

it is diﬃcult to quantify in absolute terms.

Three pilot experiments (PE1−PE3) were conducted to investigate squelch under every-

day listening conditions. In these experiments, parameters thought to aﬀect squelch were

varied, sometimes in a multidimensional way, in a series of real room recordings. Listeners

reported their perceptions of room eﬀects after listening to the recordings over headphones,

either via questionnaire (PE1) or rank-ordering (PE2,PE3). Parameters found to aﬀect

perceptions included distance between sound source (“talker”) and recording microphones

(“listener”), sound presentation level, presence of a spectral tilt, and binaurality. Interest-

ingly, diﬀerences in experimental methodology apparently inﬂuenced listeners’ experiences.

Some listeners’ responses were consistent with anti -squelch in PE1, but were consistent with

binaural squelch in the other pilot experiments. Collectively, results of the pilot experiments

suggested that squelch is not a purely binaural eﬀect.

It was hypothesized that the head related transfer function (HRTF) plays a role in

squelch– speciﬁcally, that a listener’s own HRTF leads to the least amount of room eﬀect

being perceived, relative to “other” HRTFs. Two experiments were conducted to investigate

the eﬀect of HRTF on listeners’ perceptions of room eﬀect. Both used the binaural synthesis

technique to deliver psychoacoustically-accurate stimuli to listeners. The ﬁrst experiment

presented stimuli to listeners over headphones. Variations could be multidimensional. The

experiment revealed signiﬁcant eﬀects of source distance and binaurality for all listeners.

The second experiment utilized probe microphone recordings in the ear canals to present

stimuli over loudspeakers. Results indicate a statistically signiﬁcant eﬀect of at least some

HRTFs on listeners’ perceptions of room eﬀect.

Dedicated to my parents, Jim and Beth Shore.

iv

ACKNOWLEDGMENTS

I am very grateful to members of the committee for their guidance in my dissertation work:

Profs. Brad Rakerd, Devin McAuley, Norman Birge, and Wolfgang Mittig. I want to further

acknowledge Prof. Rakerd for allowing me use of his lab space. Additionally, Profs. Rakerd

and McAuley have been extremely helpful with statistical analyses. And I am of course very

grateful to Prof. William Hartmann for welcoming me into his lab, for his mentorship, and

for his patience.

I would like to thank my colleague, Dr. Eric Macaulay, for his help and useful suggestions

in the lab. I also want to acknowledge Prof. Pavel Zahorik and Dr. Greg Ellis of the Univer-

sity of Louisville for their collaboration on the squelch project– speciﬁcally, on Preliminary

Experiment 3 (Chapter 2).

In addition to collecting data for the experiment, they have

provided helpful comments and suggestions.

I want to thank Profs. Scott Pratt and Kirsten Tollefson for supporting me in their

successive roles as Graduate Director. Thank you to Kim Crosslan, Cathy Cords, and the

guys in the Physics-Astronomy Machine Shop for always being friendly faces.

To my friends and family– there are many of you and I owe you many thanks.

In

particular: Luke Titus, Nicki Larson, Steve Quinn, Scott Suchyta, Ben Loseth, Bill Martinez,

Stephanie Kuhn, Susan Kayser, Yari Rodriguez, and Diana Algra. Thank you all for your

kindness, understanding, and friendship during my time at MSU.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Chapter 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Acoustics concepts and terminology . . . . . . . . . . . . . . . . . . . . . . .
Sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1
1.1.2
Sound in rooms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.3 The receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Perception of room eﬀect . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Loudspeaker experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3 Experiment 2– Ranking

2.1
2.2 Experiment 1– Questionnaire

Chapter 2 Preliminary experiments on squelch . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Experiment 3– Ranking physical . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
1
1
2
4
5
6

10
10
14
14
16
18
18
19
22
24
24
27
29

3.1
3.2 Head-related impulse responses

Chapter 3 Acoustical representation of a listener’s anatomy: The HRTF 32
32
35
36
43
56
58
59
65

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Maximum length sequences
. . . . . . . . . . . . . . . . . . . . . . .
3.2.2 MLS technique: validation experiments . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
3.3.1 Generating stimuli
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Acoustical validation . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3 Reproducing a room’s acoustical environment

Chapter 4 Room eﬀect perceptual experiment: Listening through other
people’s ears . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Binaural synthesis with KEMAR . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Binaural synthesis with human listeners . . . . . . . . . . . . . . . . . . . . .

66
68
75

vi

77
4.2.1 Determining HRIRs and generating stimuli . . . . . . . . . . . . . . .
87
4.2.2 Perceptual experiment . . . . . . . . . . . . . . . . . . . . . . . . . .
90
4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.4 Two-loudspeaker experiments

Chapter 5 Well-controlled stimulus presentation . . . . . . . . . . . . . . . 103
5.1 Headphone presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Headphone equalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.2.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3 Transaural synthesis: an introduction . . . . . . . . . . . . . . . . . . . . . . 115
5.3.1 Measuring transfer functions . . . . . . . . . . . . . . . . . . . . . . . 118
5.3.2 Calculating loudspeaker signals
. . . . . . . . . . . . . . . . . . . . . 119
. . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4.2 Measuring H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.4.3 Conducting the synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.5 Three or more synthesis loudspeakers . . . . . . . . . . . . . . . . . . . . . . 128
5.5.1 Calculating the pseudoinverse . . . . . . . . . . . . . . . . . . . . . . 129
5.5.2 Experiment with three loudspeakers . . . . . . . . . . . . . . . . . . . 133
5.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.6 Comparison of 2 and 3-loudspeaker spectral amplitudes . . . . . . . . . . . . 135
5.6.1
Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.6.2 Experiments– setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.6.3 Experiments– results . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
. . . . . . . . . . . . . . . . . 141
5.7.1 Dichotic, invented signals
. . . . . . . . . . . . . . . . . . . . . . . . 141
5.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.8 Synthesis accuracy– signals from a real source . . . . . . . . . . . . . . . . . 143
5.8.1 Experiment– noise
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.8.2 Results– noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.8.3 Experiment– speech . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.8.4 Results– speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.8.5 Discussion– speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.9 Sensitivity to head rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.9.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.9.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.9.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5.7 Synthesis accuracy– dichotic, invented signals

vii

5.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

Chapter 6 Room eﬀect perceptual experiment using well-controlled stim-

6.1 Experiment

ulus presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.1.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.1.2 Experiment– training . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.1.3 Experiment– calibration . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.1.4 Experiment– rating . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.2.1 Repeated Measures ANOVA . . . . . . . . . . . . . . . . . . . . . . . 175
6.2.2 Multiple hierarchical regression . . . . . . . . . . . . . . . . . . . . . 177
6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
APPENDIX A: Transaural synthesis reproducibility experiments . . . . . . . . . . 191
APPENDIX B: Transaural synthesis with probe microphones in the ear canals . . 205

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

viii

LIST OF TABLES

Table 1.1:

List of major thesis experiments, along with a brief description and
chapter in which the experiment appears. This table is for the reader’s
convenience.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 1.2:

List of abbreviations that appear in the dissertation. This table is
for the reader’s convenience.
. . . . . . . . . . . . . . . . . . . . . .

Table 2.1:

Cross-correlation (Pearson product moment) between the left and
right channels for the Hamlet soliloquy phrase “To be or not to be,
that is the question.” The largest diﬀerence is between the diotic
condition (1.00) and the basement binaural condition (0.77).
. . . .

Table 2.2:

Summary of eight presentation conditions in Experiment 2. A sepa-
rate audio ﬁle represented each unique presentation condition on the
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iPod.

Table 2.3:

Stimuli for Experiment 3. Sentences were recited by a female talker
in an anechoic room and recorded. During playback, the recorded
sentences were played through a source loudspeaker in Room 10B
(RT60 = 0.9 s for speech frequencies) and recorded with cardioid
microphones. The cardioid microphone recordings were played over
. . . . . . . . . . . . . . .
headphones to listeners in Experiment 3.

Table 2.4:

Summary of eight presentation conditions in Experiment 3. A ninth
. . . . . . . . . . . . . . . . . . . . . . . . .
condition was anechoic.

Table 3.1:

Table 4.1:

Listeners beneﬁt from listening with their own ears (i.e.
individu-
alized HRTFs): fewer localization errors occur, and externalization
It is hypothesized that individualized
of sound images is optimal.
HRTFs may necessary for room eﬀect squelch.
. . . . . . . . . . . .

The four source loudspeaker positions at which HRIRs were measured
in Room 10B. The HRIRs were measured for four heads (H1-H4),
and convolved with anechoic speech (Harvard phonetically-balanced
sentences). Convolved stimuli were presented to listeners (L1-L4) over
headphones during the perceptual part of the experiment (diﬀerent
day).
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

9

18

19

25

26

34

78

ix

Table 4.2:

Table 4.3:

Shown here are the thirty-two listening conditions that comprised
the “Cats” sentence set. These were presented to the listener over
headphones in the perceptual part of the experiment. The listener
rated the amount of room eﬀect he perceived in each presentation
condition. Order of conditions was randomized for each listener. In
the real experiment, “Glass”, “Product,” and “Thieves” sentence sets
were also presented to a listener. The four sentence sets comprised a
pass. Listeners completed two passes (diﬀerent days).
. . . . . . . .

Listeners’ mean ratings (and standard errors of means) were calcu-
lated for each condition: 2 m and 3 m, 0◦ and −30◦, binaural (‘y’)
and diotic (‘no’). There were N = 128 values that went into calculat-
ing each mean. All listeners rated 2 m lower than 3 m, and binaural
lower than diotic. Means for 0◦ and −30◦ were very similar (within
a rating point) for all listeners except for L3. This listener rated the
−30◦ condition 3.1 points lower than the 0◦ condition. The maximum
allowed rating was 40 (strong room eﬀect), and the minimum was 1
(anechoic).
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

90

Table 4.4:

Results of stage 1 multiple regression analyses for the four listeners.
Predictors for stage 1 of the model were distance, binaurality, and
angle. Statistical tests indicated that distance and binaurality were
signiﬁcant for all four four listeners. (∗p < .05, ∗∗p < .01, ∗∗∗p < .001) 94

Table 4.5:

Results of multiple hierarchical regression analyses. All four listeners
indicate a statistically signiﬁcant eﬀect of sentence and no eﬀect of
HRTF (∗p < .05, ∗∗p < .01, ∗∗∗p < .001).
. . . . . . . . . . . . . . .

95

Table 5.1:

Table 5.2:

Table 5.3:

Percentiles for maximum amplitudes when the distributions of Fig. 5.11
are turned into cumulative distributions. For instance, the upper left
entry shows that for the 2 × 2 system, 90% of the maximum ampli-
tudes were less than 3.7. The mean amplitude for the 2×2 system was
2.0, which sets the scale for both systems. Therefore, the amplitude
of 3.7 is 5.3 dB above the mean.

. . . . . . . . . . . . . . . . . . . . 137

Percentiles for maximum amplitudes (experiment) when the distri-
butions of Fig. 5.12 are turned into cumulative distributions. For
instance, the upper left entry shows that for the 2 × 2 system, 90%
of the maximum amplitudes were less than 2.8 . . . . . . . . . . . . 139

RMS errors for synthesis of the 211-component noise from the real
source. RMS amplitude errors are in dB re the RMS amplitudes of
the target (Probes) or the standard (Internal). Phase errors are in
degrees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

x

Table 5.4:

Table 6.1:

Table 6.2:

Table 6.3:

Table 6.4:

RMS error values for synthesis of “Cats hate dogs.” RMS amplitude
errors are in dB re the RMS amplitudes of the target (Probes) or
the standard (Internal). Phase errors are in degrees. Errors were
calculated for the 10202 frequency components between 200 and 4000
Hz – the range of the speech energy. . . . . . . . . . . . . . . . . . . 148

Reverberation times were measured using a Larson-Davis sound level
meter, with the synthesizer oﬀ (left) and on (right). The average
RT60 in the ﬁve octave bands from 250 Hz to 4000 Hz, which are the
most relevant bands for speech, was increased from 0.459 s to 0.659 s
with the synthesizer on. This was an increase of 0.200 s. . . . . . . . 164
These were the twenty stimuli (=4 sentences × 5 HRTFs) presented
to a listener during a single pass of the perceptual experiment. Speech
signals were convolved with: the listener’s own HRTFs (“own” con-
dition), three other subjects’ HRTFs (“other” conditions), and the
natural HRTF (that is, anechoic speech was played from the real
source loudspeaker– no synthesis was involved). After each stimulus
played, the listener gave his rating for the amount of room eﬀect he
perceived in that particular stimulus. The order of sentence blocks
was randomized, as was the order of HRTF presentation within each
sentence block. A listener completed six passes. . . . . . . . . . . . . 173

Results of multiple hierarchical regression analyses on listeners’ rat-
ings in the room eﬀect perceptual experiment. The four listeners
(L1,L2,L3,L4) were analyzed separately. HRTF was highly signiﬁcant
for all listeners. Sentence was also signiﬁcant (∗p < .05, ∗∗p < .01,
∗∗∗p < .001). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

Standardized regression coeﬃcients, or β-weights, for diﬀerent HRTFs
from multiple hierarchical regression analyses. The primary value of
the table lies in pointing out which HRTFs diﬀered signiﬁcantly from
the ‘own’ condition in listeners’ ratings of room eﬀect (∗p < .05,
∗∗p < .01, ∗∗∗p < .001). Ratings for natural and H1 conditions
diﬀered from ratings for ‘own’ conditions for all listeners. Ratings of
room eﬀect for the remaining HRTF conditions (H2,H3) diﬀered from
‘own’ conditions in a mixed manner. The reference group for pairwise
comparisons among sentences was the “crate” sentence. Diﬀerences
among sentences varied in an idiosyncratic manner among listeners.

178

xi

Table 6.5:

List of experiments that compared listeners’ experiences using indi-
vidualized and nonindividualized HRTFs. The listening criteria in-
cluded localization, externalization, and naturalness, among others.
All but two of the experiments used headphones. Listeners preferred
their own HRTFs in only 4 out of 14 experiments (29%). Three of
these were localization experiments, and the remaining was an exter-
nalization experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . 184

xii

LIST OF FIGURES

Figure 2.1: Responses to ﬁve questions comparing diotic and binaural presen-
tation of Hamlet’s soliloquy in Experiment 1. In cases for which a
listener’s responses to a question were diﬀerent for the ﬁrst and second
listening sessions, the response was deemed ‘Ambiguous.’ Questions
4 and 5 indicate that most listeners’ experiences were consistent with
anti-squelch.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 2.2: Rankings by nine listeners averaged over the four phrases in Exper-
iment 2. (a) All listeners displayed binaural squelch. Listeners indi-
cated by arrows displayed anti-squelch in Experiment 1 but had con-
trary experiences in Experiment 2. (b) Spectral tilt (−3dB/octave)
increased perceived room eﬀect for most listeners. (c) There was a
signiﬁcant eﬀect of level. Results of Wilcoxon signed rank tests (z-
scores and p-values) for group diﬀerences as a function of listening
condition are shown in each panel. . . . . . . . . . . . . . . . . . . .

Figure 2.3: During playback of the phonetically-balanced sentences through a
loudspeaker (not shown) in Room 10B, two cardioid microphones
were located on opposite sides of a plastic “head.” The plastic head
simulated head diﬀraction. Recordings from the cardioid microphones
were played over headphones to listeners in Experiment 3. . . . . . .

Figure 2.4: Results of Experiment 3: rankings by 21 listeners of phonetically-
balanced sentences averaged over four playlists. Results of Wilcoxon
signed rank tests (z-scores and p-values) for group diﬀerences as a
function of listening condition are shown in each panel.
(a) Most
listeners displayed binaural squelch, but four listeners displayed anti-
squelch. (b) All but two listeners ranked the 3-m source distance as
having more room eﬀect than the 2-m distance. (c) There was no
signiﬁcant eﬀect of having a plastic head between the two recording
microphones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 3.1:

(a) Logical XOR truth table. If either A or B is 1 (but not both),
then XOR B is 1 (i.e. true). Otherwise, XOR B is 0 (false). (b) A
three-stage shift register with taps at stages 1 and 2 in its initial state
(111). The output of stage 3 becomes a digit in the MLS. On the next
step of the register, the output of stage 3 is fed back into the inputs
of stages 1 and 2, and original values of all stages are passed into the
next stage. XOR logic is performed at all taps. (c) Successive values
of each stage in the shift register.
. . . . . . . . . . . . . . . . . . .

17

21

26

28

37

xiii

Figure 3.2: The ﬁrst 100 samples of MLS [17:1,15], which corresponds to 0.002 sec-
onds at a sampling frequency of 50 kHz. Values are binary: either 1
or −1, for AC-coupled systems. Successive values are connected to
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
guide the eye.

Figure 3.3: Autocorrelation of MLS [17:1,15] was calculated in Matlab via Eq. 3.1.
It was shown to satisfy Eq. 3.2. . . . . . . . . . . . . . . . . . . . . .

Figure 3.4: Detailed schematic diagram of the measurement setup in the PLab for
determining KEMAR’s HRIRs from electret microphone recordings,
xL and xR. This setup, or variants of, were used for all measurements
. . . . . . . . . . . . . . . . . . . . . . . .
described in this chapter.

Figure 3.5: Recordings in KEMAR’s (a) left and (b) right “ears” when the 131071
samples of the MLS were played from the Mackie loudspeaker at a
48828.125 samples s−1 =
rate of 48828.125 Hz, giving a duration of
2.68433408 s. The recordings are cross-correlated with the MLS to
obtain the head related impulse responses, hL(t) and hR(t).
. . . .

131071 samples

Figure 3.6:

(a) Cross-correlation of the MLS and the measured signal at the en-
trance to KEMAR’s right “ear canal,” calculated according to Eq. 3.3,
yielded the right “ear” HRIR (hR), as shown here. The measurement
was made with electret microphones embedded in EAR plugs block-
ing the manikin’s “ear canal.” Fourier transform of the HRIR in (a)
yields the HRTF, which is plotted on a linear frequency scale in (b)
and a logarithmic scale in (c). Amplitudes were converted to the
decibel scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 3.7: Comparison of right “ear” HRTFs measured using diﬀerent taps for
the MLS. (a) 0.15 to 1.5 kHZ frequency range (b) 1.5 to 15 kHz
frequency range. [17:1,13] is oﬀset by -15 dB and [17:1,12] by -30 dB
for visual clarity. Results for the “left” ear are similar and are not
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
shown.

Figure 3.8: Comparison of HRTFs for KEMAR’s right “ear” using a sine step
(a) 0.1 − 1.0 kHz fre-
(dashed line) vs. MLS (solid line) method.
quency range (b) 1 − 10 kHz frequency range. In both (a) and (b)
it is apparent that the sine-step method qualitatively reproduced the
HRTF from the MLS method. The main diﬀerence was in the greater
. . . . . .
depth of the valleys in the HRTF from the MLS method.

38

39

46

47

49

51

54

xiv

Figure 3.9: Comparison of HRTFs for KEMAR’s right “ear” after smoothing
the spectrum from the MLS technique according to Eq. 3.9. The
smoothed spectrum is indicated by the solid line, and the spectrum
from the sine step method by the dashed line. The spectra are quali-
tatively very similar, indicating the MLS technique can yield accurate
HRTFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 3.10: Result of convolving KEMAR’s right “ear” HRIR, hR, with s0. (a)
The time domain representation of the convolved signal, hR ∗ s0, is
shown. It was computed from Eq. 3.10. (b) The Fourier transform of
the convolved stimulus is shown for the frequency range 0.1 − 1 kHz.
The smoothed spectrum has also been plotted for convenience (oﬀset
by 25 dB). (c) Same as panel b, but for frequency range 1 − 10 kHz.

Figure 3.11:

(a) Natural condition, in which s0 was played from the source loud-
speaker and recorded at KEMAR’s “eardrums” with the manikin’s
internal microphones (xL,n and xR,n). (b) Measurement setup to de-
termine hL and hR. Repeated from Fig. 3.4. (c) Binaural synthesis:
headphone delivery of hL∗ s0 and hR∗ s0. Recordings xL,s and xR,s
were made with KEMAR’s internal microphones. For a successful
binaural synthesis, xL,s = xL,n and xR,s = xR,n.
. . . . . . . . . .

Figure 3.12: Recordings from KEMAR’s internal microphones for the natural con-
dition (xL,n and xR,n) are indicated by the thin lines. In the natural
condition, s0 was played from the source loudspeaker and recorded
at the “eardrums” (Fig. 3.11a). Recordings from the synthesized
condition (xL,s and xR,s) are indicated by the thick lines.
In the
synthesized condition, s0 was convolved with left and right “ear”
HRIRs (hL and hR). The convolved stimuli were presented to KE-
MAR over headphones, and recorded at the “eardrums” (Fig. 3.11c).
Panels a and b show time time domain signals. Remaining panels
show spectral amplitudes for frequency ranges (c,d) 0.15− 1 kHz and
(e,f) 1 − 10 kHz. For perfect binaural synthesis, XL,s = XL,n and
XR,s = XR,n. Agreement is within a few dB until a steep drop-oﬀ
of Xs at 9.5 kHz in both “ears.” . . . . . . . . . . . . . . . . . . . .

Figure 4.1:

(a) Anechoic stimulus s0 (“Cats and dogs, each hate the other.”),
recited by a female talker and recorded by an omnidirectional mi-
crophone. This is the test stimulus, s0, in the Room 10B valida-
tion experiment. Spectral amplitudes, |S0|, are shown for the (b)
0.15 − 1 kHz range and (c) 1 − 10 kHz range. . . . . . . . . . . . . .

56

60

61

63

71

xv

Figure 4.2: KEMAR’s head-related impulse responses (hL and hR) were mea-
sured in Room 10B for (a) left and (b) right “ears.” Source position
was 3 m and 0◦. The full duration of hL and hR was 2.68 seconds,
but they were truncated to 0.9 s. Transfer functions, |HL| and |HR|,
are shown in the remaining panels: (c) and (d) show the 0.15− 1 kHz
frequency range, and (e) and (f) show the 1 − 10 kHz range.
. . . .

Figure 4.3: Convolution of s0 (Fig. 4.1a) with hL and hR (Fig. 4.2 panels a
and b). Panel a shows the convolved signal for the left “ear” and
panel b shows the signal for the right “ear.” Time domain represen-
tation. (b) Frequency domain representation: spectral amplitudes of
the convolved waveforms, converted to a decibel scale, for the fre-
quency range 0.15 − 1 kHz for the (c) left “ear” and (d) right “ear.”
Panels (e) and (f) show the 1 − 10 kHz range.
. . . . . . . . . . . .

Figure 4.4: Recordings at KEMAR’s (a) left (xL,s) and (b) right (xR,s) “eardrums”
when the synthesized binaural stimuli (hL ∗ s0 and hR ∗ s0 from
Fig. 4.3) were played over headphones. Middle and bottom panels
show spectral amplitudes, |Xs(f )| (thick lines) and |Xn(f )| (thin
lines) for comparison. For a perfect binaural synthesis, XL,s = XL,n
In general, there is good agreement between
and XR,s = XR,n.
amplitude spectra (within a few dB) below 6 kHz.
. . . . . . . . . .

Figure 4.5:

Subject 1’s (H1) binaural HRIRs measured in Room 10B with blocked
meatus. There were four source positions. . . . . . . . . . . . . . . .
Figure 4.6: HRTFs (0.15 − 1 kHz) for the four heads (H1−H4). Each panel
indicates a diﬀerent source position. For example, the top panel shows
HRTFs measured at the 2 m, 0◦ source position.
. . . . . . . . . . .
Same as Fig. 4.6 but for the 1 − 10 kHz frequency range. Diﬀerences
in spectra are apparent at frequencies as low as 1.5 kHz. . . . . . . .

Figure 4.7:

Figure 4.8: Convolved speech waveforms for “Cats and dogs each hate the other,”
for the four heads (H1, H2, H3, and H4). The HRIRs were for the
3 m, −30◦ source position. These stimuli will later be presented to
listeners (L1-L4) in the perceptual portion of the experiment. . . . .

Figure 4.9: Maximum cross correlation of convolved waveforms for source posi-
tion 3 m, −30◦. Panels (i) and (j) show the cross correlation averaged
across the four Harvard sentences, and error bars are the standard
deviations. For this source position, cross correlations are smallest
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
for H2’s HRIRs.

xvi

72

73

74

80

81

82

84

85

Figure 4.10: Maximum cross correlation of convolved waveforms for diﬀerent source
positions. Note the values were averaged across the four Harvard sen-
tences, and error bars are the standard deviations. The last panel (e)
shows average cross correlation across all conditions. Waveforms that
were generated using H2’s HRIRs were least correlated with the other
waveforms.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

Figure 4.11: Mean ratings (grouped by HRTFs) from the perceptual experiment.
A higher rating indicates more perceived room eﬀect. The vertical
axis shows the mean rating for each (a) distance (2 m or 3 m), (b)
listening condition (diotic or binaural), and (c) angle (0◦ or −30◦).
Further, means are shown as functions of H1, H2, H3, and H4. The
horizontal axis indicates which listener was listening. For example,
the far-left barplots indicate Listener 1’s (L1) ratings, and bars la-
beled ‘H1’ indicate when he was listening to his own HRTFs. Each
mean was calculated from N = 32 values. For example, L1’s mean
rating when listening to his own HRTFs (H1) for the 2 m distance
was calculated across two listening conditions (binaural and diotic),
two angles (0◦ and −30◦), four sentences, and two passes. Error bars
are standard errors of the mean. Panel (d) shows listeners’ mean
ratings for the four HRTFs. For each H, a listener’s ratings were
averaged across all conditions: two distances (2 m and 3 m), two
listening conditions (diotic and binaural), two angles (0◦ and −30◦),
. . . . . . . . . . . . . . .
four sentences, and two passes (N = 64).

Figure 4.12: Beta-weights (magnitudes) are plotted as a function of model predic-
tors. In the case of non-binary predictors (e.g. sentences and HRTFs),
the average β-weight is plotted to simplify the display. Statistical
signiﬁcance is indicated below predictor labels: ratings of perceived
room eﬀect in binaural conditions were signiﬁcantly lower than rat-
ings in diotic conditions at the p < .001 level for all listeners. Like-
wise, ratings for the 2-m conditions were lower than ratings for the
3-m conditions at the p < .001 level for all listeners. The −30◦ con-
ditions were rated lower than the 0◦ conditions at the p < .001 level
for L3 only. Ratings among sentences diﬀered at the p < .01 level
for all listeners. Most important to this experiment is that ratings of
perceived room eﬀect among HRTF conditions were not signiﬁcantly
diﬀerent (p > .05). . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

96

Figure 5.1:

Signal X0 has constant amplitudes (top) and random phases (bot-
tom) spectra. Each panel shows 211 symbols, one for each spectral
component. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

xvii

Figure 5.2:

Signal y(= x0) was played over Sennheiser HD600 headphones and
recorded with KEMAR’s internal microphones. The recording at the
left “eardrum” is shown here. Filled symbols indicate the desired
signal (X0) and open symbols indicate the DFT of the measured
signal at the “eardrum” (XL). Amplitudes are shown in the top panel
and phases in the bottom panel for the 211 spectral components. If
XL = X0, open symbols would completely obscure ﬁlled symbols. . . 108

Figure 5.3: Headphone equalization experiment.

(a) To obtain headphone-to-
eardrum transfer functions (HL and HR), signal yH was played over
the headphones and recordings were made with KEMAR’s internal
microphones (wL and wR). (b) Signals y(cid:48)
R, calculated from
Eq. 5.5, were played over the headphones and recorded by the internal
microphones (xL and xR).

L and y(cid:48)

. . . . . . . . . . . . . . . . . . . . . . . 110

Figure 5.4:

Signals measured in KEMAR’s left (top) and right (bottom) internal
microphones. The desired signal, X0, had equal amplitudes. Signals
measured with the original headphone placement, for which HL and
HR were measured, are the standard and are indicated by the black
line. The black line looks like an axis but it is real data. The largest
discrepancy observed in the standard was 0.13 dB and occurred in
the left ear at 13387.3 Hz. Measurements at subsequent headphone
placements are indicated by open symbols, and each placement is
indicated by a diﬀerent symbol type. The largest amplitude was
13.7 dB above the standard and occurred in the right ear at 9811 Hz. 112

Figure 5.5: Measurement of the synthesis loudspeaker-to-eardrum transfer func-
tions (H). Signal yH is played from synthesis loudspeaker A and
recordings, wL and wR, are made at the eardrums to obtain HAL(f )
and HAR(f ). Then, yH is played from synthesis loudspeaker B and
new recordings are made at the eardrums to obtain HBL(f ) and
HBR(f ). Crosstalk paths, HAR(f ) and HBL(f ), are indicated by
dashed lines.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Figure 5.6: During transaural synthesis, signals y(cid:48)
speakers A and B to attain XL = X(cid:48)

A and y(cid:48)
L and XR = X(cid:48)

B are played from loud-

R at the eardrums.121

Figure 5.7:

Shown here are loudspeakers A (KEMAR’s left) and B (KEMAR’s
right) on the sides (±120◦) and G behind at −140◦ with KEMAR lo-
cated at the reference position. Acoustical foam wedges which reduce
reﬂections are noticeable in the background.

. . . . . . . . . . . . . 122

xviii

Figure 5.8: TS in the left “ear” using loudspeakers A and B. Filled symbols in-
dicate the desired signal at the eardrum, and open symbols indicate
the measured signal at the “eardrum.” When a ﬁlled symbol is not
seen it is because an open symbol obscures it. RMS errors are on
the scale of the vertical axis. Loudspeakers A and B were located at
−120◦ and 120◦. Loudspeaker G, a reﬂecting object, was located at
−140◦.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Figure 5.9:

Synthesis measured at KEMAR’s right “eardrum.” In this particular
measurement, loudspeakers A and B were located at −90◦ and 90◦,
and loudspeaker G, a reﬂecting object, was at 180◦. The spectral
component at 10183 Hz exceeded the desired signal amplitude by 5 dB.126

Figure 5.10: Synthesis spectra recorded at the right “eardrum.” Amplitudes in the
(a) 2 × 2 and (b) 2 × 3 system. Phases in the (c) 2 × 2 and (d) 2 × 3
system. Filled symbols indicate the desired signals at the eardrum
and open symbols indicate the measured synthesis signals. The 2× 3
system substantially reduced the very large amplitude at 10183 Hz
in the 2 × 2 system.

. . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Figure 5.11: Distributions of maximum amplitudes among (a) 2, or (b) 3 synthesis
signals from the random matrix models. The mean amplitude for the
2 × 2 system is 2.0 which sets the scale for both plots. An amplitude
of 20 is ten times the mean or 20 dB higher. The bin on the far
right includes all the amplitudes greater than 20. There were 3741
amplitudes out of range in the 2 × 2 system, and 8 in the 2 × 3.

. . 137

Figure 5.16: Amplitudes and phase errors measured by KEMAR’s internal micro-
phone in the right “ear” for the 2 × 2 and 2 × 3 systems. The real
source was a 211-component white noise. Top two panels: the stan-
dard amplitudes are shown by ﬁlled symbols. They are the same for
the 2 × 2 and 2 × 3 systems. The measured amplitudes are shown
by the open symbols. Amplitudes above 8097 Hz are multiplied by
ﬁve for better viewing. Bottom two panels: diﬀerences (in degrees)
between measured and standard phases. . . . . . . . . . . . . . . . . 146

Figure 5.17: Same as Fig. 5.16 but the target and standard were female speech
(“Cats hate dogs.”)
instead of white noise. Comparison between
standard amplitudes (ﬁlled circles) and measured amplitudes (open
circles) show only a 211-component subset of frequency components
for a convenient display. The amplitude scale for frequencies above
4 kHz is expanded by a factor of ten. Phase errors (diamonds) for the
same set of frequencies are the diﬀerence measured-standard. Phase
errors outside the ±90◦ range are shown by solid diamonds at ±90◦. 149

xix

Figure 5.18: Comparison of amplitudes and phases measured at the left “eardrum”
for 211 components before the was “head” rotated (ﬁlled symbols)
and after it was rotated 5◦ to the left (open symbols). Synthesis
loudspeakers A and B were at −120◦ and 120◦, and G was at 180◦. . 152

Figure 5.19: Same as Fig. 5.18 but for the right “eardrum.” . . . . . . . . . . . . 153

Figure 5.20: RMS change in amplitude caused by an uncompensated rotation of
5◦, averaged over 211 frequencies. The values are averaged over the
the azimuths of loudspeaker G. The error bars are two standard de-
viations in overall length. The data for these histograms came from
data sets of which Figs. 5.18 and 5.19 are examples.

. . . . . . . . . 154

Figure 5.12: Histogram of (experimental) maximum synthesis spectral amplitudes,
of (a) 2, or (b) 3 synthesis loudspeakers. Amplitudes were scaled so
that the means of the 2 × 3 distributions in Figs. 5.11b and 5.12b
coincide. That enables a fair comparison of the ﬁgures. Data were
combined over 120◦ and 90◦ reference sets, a total of 2532 values per
histogram. Fewer large amplitudes occurred in the 2 × 3 system.

. . 158

Figure 5.13: Left “ear” desired amplitudes (panels a and b) are indicated by ﬁlled
symbols (X(cid:48)
L). They are straight-line functions of frequency. Mea-
sured amplitudes (XL) are indicated by the open symbols. Numbers
1 − 7 track particular component amplitudes of interest. Desired
phases were random variables. Desired phases were subtracted from
measured phases to ﬁnd phase errors, which are shown by the dia-
monds in panels c and d.

. . . . . . . . . . . . . . . . . . . . . . . . 159

Figure 5.14: Same as Fig. 5.13 but for the right “ear.” Larger phase errors at high

frequencies arise from smaller amplitudes. . . . . . . . . . . . . . . . 160

Figure 5.15: KEMAR’s “head” with probe microphones in the “ear canals.” The
real-source loudspeaker was located 28◦ to the right of the manikin’s
forward direction. The three synthesis loudspeakers were located at
angles of −120◦, 120◦, and 180◦. All loudspeakers were 1 m from the
center of KEMAR’s “head.” A nearby wall was located on the left
and the acoustical foam was removed– this was Room Setup 2. The
schematic is not to scale.

. . . . . . . . . . . . . . . . . . . . . . . . 161

xx

Figure 6.1:

Setup for the perceptual experiment. Ceramic-tiled panels were lo-
cated along the wall behind the listener. Enhanced reverberation
system (ERS): two studio microphones were positioned in the fore-
ground. Microphone outputs were ampliﬁed and fed into the syn-
thesizer (not shown). Two (of four) ERS loudspeakers are visible in
the photo. Transaural synthesis: synthesis loudspeakers were located
1 m from the center of the listener’s head, at angles of ±120◦ and
180◦. Photograph was taken from the vantage point of the real source
loudspeaker (not shown) which was located at 3.8 m and 28◦. . . . . 165

Figure 6.2: To measure HRTFs, a MLS (N = 16) was played from the real source
loudspeaker and recorded in the probe microphones in the listener’s
ear canals. Left panels show |HL|, and right panels show |HR| for
the 0.2 − 1 kHz frequency range. Recall that the source was on the
right. The top lines in each panel indicate HRTFs of the four listeners
((a,b) L1, (c,d) L2, (e,f) L3, (g,h) L4) who went on to participate
in the perceptual experiment. These HRTFs were used to compute
stimuli for the “own” condition. The bottom three lines indicate
nonindividualized HRTFs (H1, H2, and H3). These HRTFs were
used to compute stimuli for the “other” conditions.
Same as Fig. 6.3 but for the 1 − 12 kHz frequency range.

Figure 6.3:

. . . . . . . . . 169

. . . . . . 170

Figure 6.4: Root-mean-square amplitude diﬀerences were calculated between a
listener’s own HRTF and the other HRTFs (H1,H2,H3). Averages
were computed across left and right ears. Average diﬀerences were
larger in the 1 − 12 kHz range, indicating that individual diﬀerences
in HRTFs were more apparent. . . . . . . . . . . . . . . . . . . . . . 171

Figure 6.5: Total average powers of each HRTF (own, H1, H2, H3) were calcu-
lated and averaged across left and right ears. For convenience, H1,
H2, and H3 are repeated in the plot for each listener. A listener’s own
HRTF is indicated by the shaded bar. Power was relatively constant
in the (a) 0.15 − 1 kHz range. In the (b) 1 − 12 kHz range, power in
H1 exceeded– in some cases by more than double (3 dB)– the power
in own, H2, and H3. . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

xxi

Figure 6.6: Mean ratings of perceived room eﬀect in the perceptual experiment.
Higher ratings indicate more perceived room eﬀect (i.e. less room
squelch). Listeners are identiﬁed along the horizontal axis. Shaded
bars indicate when a listener was listening to his own HRTFs (own
and natural conditions), and the open bars indicate when a listener
was listening to other people’s HRTFs. Panels (a)-(d) show ratings
for the four diﬀerent sentences. Ratings were averaged across passes
to ﬁnd the mean rating. L1, L2, and L3 completed six passes, and
L4 completed three passes. Error bars are the standard errors of the
mean. Panel (e) shows the ratings averaged across the four sentences. 176

Figure 6.7: HRTF beta-magnitudes are plotted to facilitate visual comparison
among listeners. Statistically signiﬁcant pairwise comparisons be-
tween ‘own’ and the speciﬁc HRTF predictor are indicated along the
horizontal axis.
‘Own’ conditions were signiﬁcantly diﬀerent from
natural and H1 conditions for all listeners. Results of pairwise com-
parisons between ‘own’ and H2,H3 were mixed. . . . . . . . . . . . . 181

Figure A.1: Amplitude spectra recorded in the internal (ﬁlled symbols) and probe
(open symbols) microphones when the equal-amplitudes, random-
phases noise was played from the real source loudspeaker. Discrep-
ancy between ﬁlled and open symbols is attributed to dissimilarity in
the frequency responses of the microphones. . . . . . . . . . . . . . . 192

Figure A.2:

Intensity level of the real source had essentially no eﬀect on the syn-
thesis accuracy. This can be seen both visually and by comparing
RMS amplitude errors across the three diﬀerent levels in both “ears.”
It can thus be concluded that the system was operating in a linear
regime, though the experiment was modest since it only spanned 6 dB.195

Figure A.3: Amplitude spectra of real source recordings made in the internal mi-
crophones with probe tubes placed in the ear canals 1 mm from the
“eardrums” (ﬁlled symbols). The probe tubes were then removed
from the “ear canals” (open symbols). Very close agreement between
ﬁlled and open symbols indicates that the probe tubes minimally
perturbed the sound ﬁeld at the “eardrums.” . . . . . . . . . . . . . 197

Figure A.4: Amplitude spectra of real source recordings made in the internal mi-
crophones. Initial recordings are indicated by ﬁlled symbols and the
subsequent recordings by open symbols. No changes were made be-
tween the two measurements, so any discrepancy is due to random
ﬂuctuations. Note that probe tips were present in the “ear canals”
during the measurements (1 mm from the “eardrums”) but they were
not used in the experiment. . . . . . . . . . . . . . . . . . . . . . . . 199

xxii

Figure A.5: Amplitude spectra recorded at the “eardrums” during synthesis. The
ﬁrst recording of the synthesis is indicated by ﬁlled symbols, and the
second by open symbols. Variation due to random ﬂuctuations was
very small.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

Figure A.6: Amplitude spectra recorded at the internal microphones during syn-
thesis. The top panels (a:
left ear, b: right ear) depict synthesis
in which the probe tip placement in the “ear canals” was the same
for the real source and synthesis (‘matched’ condition). The desired
signals in the ears were: X(cid:48)
0R . The bottom
panels (c: left ear, d: right ear) depict synthesis for which the desired
signals in the ears (X(cid:48)
0R ) were measured
with a diﬀerent probe microphone placement than used during the
synthesis (‘unmatched’ condition).

0L and X(cid:48)

R = Xu,p

L = Xu,p

0L and X(cid:48)

L = Xm,p

R = Xm,p

. . . . . . . . . . . . . . . . . . . 203

xxiii

Chapter 1

Introduction

The purpose of the Introduction is two-fold: 1) to introduce acoustical concepts and terms

that enable the reader to engage with the dissertation material, and 2) to establish the

two central themes– perception of room eﬀect, and stimulus delivery over loudspeakers–

that motivate the research. The Introduction is intentionally brief. Detailed discussion is

reserved for subsequent chapters.

1.1 Acoustics concepts and terminology

The following material is intended to aid the reader in comprehension of physical principles

that underlie the research. It is by no means a complete treatment of the various topics

presented. For further details, the reader is encouraged to consult Hartmann (1998) or Yost

(2007).

1.1.1 Sound

Sound is a pressure wave that propagates through a medium (e.g. air), and as such it has a

medium-dependent velocity. The speed of sound in air is: vs = 344 m

s . Recall that a wave can

be described in terms of frequency (f ) and wavelength (λ), which are related to the speed of

sound through the relationship vs = λf . Sound can be fully described as a function of time,

x = x(t), or a function of frequency X = X(f ). Time and frequency domain representations

1

of sound are equivalent and complementary. One can easily switch from one representation to

the other through a Fourier transform or inverse Fourier transform. It is mentioned because

throughout the dissertation at various times one representation is selected over the other.

Sometimes both representations may be presented. It simply depends on the context, but

the reader should keep in mind that the representations are equivalent, and, having one, the

other can easily be computed.

1.1.2 Sound in rooms

An acoustical scenario minimally requires a stimulus (e.g. white noise or speech), a source

to emit the stimulus (e.g. loudspeaker or human talker), and a receiver to pick up the

propagated sound (e.g. microphone or ear). Direct sound propagates directly from a source

to receiver. Some of the sound from the source, however, propagates in other directions and

when it encounters hard surfaces (e.g. walls and ﬂooring in a room), it is reﬂected. Some

of the reﬂected sound eventually reaches the receiver. There are early reﬂections, arriving

within 0.02 s of the direct sound, and later-arriving reﬂections, collectively referred to as

reverberant sound or reverberation.1

Reverberation can change the quality of a sound. For an impulsive sound (e.g. balloon

pop), reverberation temporally elongates the sound by imparting a reverberant ‘tail,’ which

arises from the fact that reverberation arrives at the receiver later than direct sound. The

situation is more complicated for continuous sounds (e.g. speech), because reverberation

temporally overlaps with direct sound.

1Note that there is actually a distinction between discrete reﬂections that arrive shortly after the direct
sound, and reverberation. The former are correlated with the direct sound. Reverberation is uncorrelated.
For simplicity, however, the term ‘reverberation’ is used to collectively reference all reﬂections, unless other-
wise noted.

2

Rooms are characterized by a reverberation time (RT60), which is the amount of time

it takes for the original sound to decay by 60 dB. Reverberation time depends directly

on a room’s volume, and inversely on its surface area and the absorptive properties of its

surfaces.2 As a general rule of thumb, the larger and emptier a room is, then the greater its

RT60. Anechoic rooms are completely covered with acoustically-absorbent foam, yielding a

minimal RT60. Of interest in this dissertation are “ordinary” rooms, which are deﬁned as

medium-sized rooms in which people typically talk and interact. Ordinary rooms are not
too dry or lively, and the relevant range of RT60 values is about 0.3− 1.0 seconds. Examples
of ordinary rooms are classrooms, oﬃces, and domestic rooms.3

A room can alter a sound’s spectral content– high-frequency spectral components have a

stronger tendency to be absorbed by building materials, such as drywall and acoustical ceiling

tiles, than low and mid-frequency components. Thus, a room is eﬀectively a lowpass ﬁlter.

Additionally, early reﬂections (that is, reﬂections arriving about 0.02 s or less after the direct

sound) can interfere with the direct sound to create valleys in the spectral amplitudes (Bilsen,

1967). Multiple reﬂections and standing waves in rooms can also modify the spectrum of

the sound (Toole and Olive, 1986). Perceived spectral distortion like this is referred to as

coloration.

To summarize, when sound propagates in a room reﬂections are introduced. Collective

reﬂections can impart a reverberant tail to direct sound. Further, a room lowpass ﬁlters

sound. This can manifest as overemphasis of low and mid-frequencies (or, equivalently, an

2The Sabine equation, which estimates reverberation time, is: RT60 = 0.161V
Sa

, where V is the volume
of the room in m3, S is the surface area in m2, and a is the average absorption coeﬃcient of room surfaces.
Note that reverberation time is frequency-dependent, which enters the Sabine equation through a: a = a(f ).
In the literature, RT60 values are often reported for diﬀerent frequency bands. Generally, as frequency
increases the RT60 decreases due to greater absorption.
3Domestic rooms are at the lower end of what might be considered ordinary– a study of 602 Canadian
homes found an average RT60 of 0.4 s with a standard deviation of 0.1 s for the frequency range 0.8− 4 kHz

(Schuck et al., 1993).

3

absence of high frequencies in the sound), or an irregular frequency response from a small

number of dominant standing waves. This is coloration. These eﬀects– collectively referred

to as ‘room eﬀect’– are expected to be present in sound that reaches the receiver. Amount

of room eﬀect depends on the level and spectrum of the stimulus, acoustics of the particular

room, and properties of the receiver. As for which type of room eﬀect dominates, it generally

depends on the room size and reverberation time. Coloration tends to be more prominent in

small rooms with short reverberation times. Conversely, reverberant tails are more prominent

in large rooms with long reverberation times (Flanagan and Lummis, 1970). Both eﬀects

are expected to be present in medium-sized ordinary rooms.

1.1.3 The receiver

Number of receivers is an important feature in an acoustical scenario. For multiple receivers,

signals reaching each receiver are somewhat diﬀerent, and depend on positions of the receivers

with respect to a source and symmetry of the room. A receiver located far from a source

experiences a larger reverberant-to-direct sound ratio than a receiver that is close to the

source. Further, time it takes for the direct sound to reach the near receiver is shorter. Room

symmetry can also be important: a receiver close to a wall picks up stronger reﬂections than

a receiver located farther away from a wall.

In the context of a human listener, the ears are the receivers.

If the same stimulus

is delivered to both ears (i.e. the acoustical signals at the eardrums are identical), this is

referred to as diotic presentation. Binaural presentation, in which acoustical signals delivered

to the ears are slightly diﬀerent, is the natural listening condition.4 Further, if a listener

is positioned at an angle with respect to a source, interaural (literally, “between ears”)

4An equivalent term for binaural is ‘dichotic.’

4

diﬀerences arise. Direct sound will arrive slightly sooner at the ear that is nearer the source.

Further, direct sound has a greater intensity level in the near ear. Interaural timing and

level diﬀerences, or ITD and ILD for shortma, are extremely important in human audition–

they are largely responsible for the ability to eﬀectively localize sound sources.

Reﬂection and diﬀraction from a listener’s torso, head, and outer ears (pinnae) aﬀect

sound before it reaches the eardrum. Anatomical ﬁltering may become important for spectral

components with medium and small wavelengths that are on the order of the dimensions of

the head or smaller. Gumerov et al. (2010) oﬀer a helpful rule of thumb: “Roughly speaking,

the size of the head is important above 1 kHz, the general characteristics of the torso are

important below 3 kHz, and the detailed structure of the head and pinnae becomes signiﬁcant

above 3 kHz, with the details of the pinnae itself becoming important at frequencies over

7 kHz.”

1.2 Perception of room eﬀect

Consider an everyday conversation in an ordinary room with the talkers separated by a few

meters. An audio recording is made of the conversation through a microphone placed near the

talkers. When the recording is played back, the talkers (now the listeners) suddenly become

aware of room eﬀect– reverberation and coloration– that was not noticed during the original

conversation. The physical character of the sound in the room is correctly conveyed by the

audio recording. The reﬂections that lead to sensations of coloration and reverberation were

physically present during the conversation. The psychological eﬀect by which these reﬂections

are suppressed during the real-time conversation is known as “room eﬀect squelch,” or simply

“squelch” for brevity. It is the same process that makes it so obvious when a talker is using

5

speakerphone– the listener immediately perceives coloration and reverberation in the talker’s

speech. Yet, a colleague who is listening to the same conversation in person at the oﬃce

does not notice room eﬀect. It is important to note that squelch occurs only for suﬃciently

small physical room eﬀect. A listener notices room eﬀect during conversations taking place

in a cathedral or gymnasium, which have long reverberation times, because the prominence

of physical reﬂections overwhelms the auditory system’s squelch mechanism. Those rooms

fall outside the purview of this dissertation. Only ordinary rooms are considered.

A central theme underlying this dissertation is listeners’ perception of room eﬀect. Note

that perception is subjective, and as such it is diﬃcult to investigate. This explains why few

psychoacoustical experiments have been done on squelch since the original work by Koenig

in 1950 (more on existing works in Chapter 2). It has largely been assumed since then that

squelch is a purely binaural eﬀect– that is, attributable to the fact that human listeners

have two ears that receive somewhat diﬀerent signals. Several experiments are presented in

Chapters 2 and 4 that attempt to elucidate physical parameters beyond binaural listening

that may aﬀect squelch. Chapter 3 is an experimental methods chapter. All of the perceptual

experiments in these chapters used headphones to deliver stimuli to a listener.

1.3 Loudspeaker experiments

The second part of the dissertation (Chapters 5 and 6) focuses on loudspeaker delivery of

stimuli to a listener. This is the second major theme of the dissertation. Advantages of loud-

speakers over headphones for stimulus delivery are discussed, and the particular challenges

associated with loudspeaker delivery are addressed. The main point is that headphones form

an isolated and closed system– that is, an experimenter is assured that only the signal in-

6

tended for the left ear reaches a listener’s left ear, and only the signal intended for the right

ear reaches the right ear.5 Conversely, loudspeakers comprise an open system: a signiﬁcant

amount of a stimulus intended to be delivered only to a listener’s left ear is also inadver-

tently delivered to the right ear, and a stimulus intended only for the right ear is likewise

delivered to the left ear. Probably the reader already knew this about stimulus delivery over

loudspeaker simply based on his or her own experience listening to music over a loudspeaker:

music is heard in both ears, not just the ear closer to the loudspeaker. The audio engineering

industry refers to this leakage into the other ear as ‘crosstalk.’

If the crosstalk problem is overcome, then loudspeaker systems can potentially provide

excellent signal delivery to listeners in psychoacoustical experiments. Certainly this is a very

attractive prospect for psychoacousticians, and indeed, methods have been in existence since

the 1960s to cancel out the leakage. However, those crosstalk cancellation methods employ

approximations which have a deleterious eﬀect on accuracy of signal delivery. Chapter 5

describes an extension of existing methods that employs no approximations, and is shown

to deliver signals to the eardrums more accurately than existing methods. The improved

method is applied in a perceptual experiment on room eﬀect in Chapter 6. Experiments are

summarized in Table 1.1, and important abbreviations are listed in Table 1.2.

5In reality, there is small ‘leakage’ into the opposite ear, which is actually measured in Chapter 5. For all

intents and purposes, however, headphones constitute a closed system when compared to loudspeakers.

7

Experiment

Preliminary Experiment 1

(PE1)

Preliminary
Experiment 2

(PE2)

Preliminary Experiment 3

(PE3)

Binaural Synthesis
with KEMAR, PLab

Binaural Synthesis

with KEMAR, Room 10B

Headphone Perceptual

Experiment with Human Listeners

Headphone Transfer Function

Headphone Placement:

Reproducibility

Transaural Synthesis:

Two Loudspeakers

Transaural Synthesis:
Three Loudspeakers
Transaural Synthesis:

Challenging Dichotic Signal

Transaural Synthesis:

Real Source

Transaural Synthesis:

Real Source with Speech

Transaural Synthesis:

Head Rotation

Transaural Synthesis Perceptual

Experiment with Human Listeners

LIST OF THESIS EXPERIMENTS

Description
Human listeners switched between diotic and binaural listening
while listening to Hamlet’s soliloquy.

iPod rank-ordering experiment on room eﬀect perception in
Hamlet’s soliloquy. Human listeners.

iPod rank-ordering experiment on room eﬀect perception in
Harvard sentences. Human listeners.
Validation measurement. Convolved stimulus produced
equivalent spectrum at KEMAR’s “eardrums” as a white-noise
real source.
Validation measurement. Convolved stimulus produced
equivalent spectrum at KEMAR’s “eardrums” as speech real
source.
Human listeners were presented with HRTF convolved-speech
stimuli. Listeners rated the amount of room eﬀect perceived.
Headphone signal was recorded via KEMAR’s internal micro-
phone.
Repeated placements of headphones on KEMAR’s head
revealed poor reproducibility.
A straight-line stimulus was synthesized at KEMAR’s “ear-
drums” using two synthesis loudspeakers.
A straight-line stimulus was synthesized at KEMAR’s “ear-
drums” using three synthesis loudspeakers.
A challenging dichotic signal was synthesized at KEMAR’s
“eardrums” using two and three synthesis loudspeakers.
A white noise was played from a real source speaker and
synthesized at KEMAR’s “eardrums.” Probe microphones.
A brief sentence was played from a real source speaker and
synthesized at KEMAR’s “eardrums.” Probe microphones.
Head rotations were made after synthesis calibration to
determine eﬀect on spectra measured at KEMAR’s “eardrums.”
Transaural synthesis with three loudspeakers was used to pres-
ent HRTF convolved speech stimuli to human listeners.

Chapter

2

2

2

3

4

4

5

5

5

5

5

5

5

5

6

Table 1.1: List of major thesis experiments, along with a brief description and chapter in
which the experiment appears. This table is for the reader’s convenience.

8

Abbreviation Full-Length Phrase

First Appearance

(Chapter)

ADC
CTC
DAC
DUT
HRIR
HRTF
ILD
ITD
JND
KEMAR
LTI
MLS
NH
RIR
RMS
RT60
SNR
TS

analog-to-digital converter
crosstalk cancellation
digital-to-analog converter
device under test
head related impulse response
head related transfer function
interaural level diﬀerence
interaural time diﬀerence
just noticeable diﬀerence
Knowles Electronics Manikin for Acoustic Research
linear time invariant
maximum length sequence
normal-hearing
room impulse response
root mean square
reverberation time (decay of 60 dB)
signal-to-noise ratio
transaural synthesis

3
5
3
3
3
2
1
1
2
3
3
3
2
2
2
1
3
5

Table 1.2: List of abbreviations that appear in the dissertation. This table is for the reader’s
convenience.

9

Chapter 2

Preliminary experiments on squelch

The present chapter describes three perceptual experiments conducted on squelch. His-

torically, squelch has been considered a purely binaural eﬀect. Results of the experiments

presented in this chapter indicate that squelch depends on a multitude of factors, includ-

ing binaural presentation, source-to-microphone distance, and presentation level, that can

enhance or reduce a listener’s perception of room eﬀect.

2.1

Introduction

The initial study of room squelch dates to 1950 when W. Koenig of Bell Labs conducted a

qualitative experiment in which subjects wearing headphones listened to speech and other

sounds from a separate room. Subjects were able to switch between diotic and binaural

presentations. Koenig observed that subjects perceived room reverberation in the diotic

condition but reported “no unnatural or objectionable reverberation” when listening bin-

aurally. Despite an absence of data, Koenig’s paper remains an inﬂuential support for the

opinion that room squelch is a purely binaural eﬀect. A later study concerned with bin-

aural perception of reverberant sound reported that listeners had reduced sensitivity to the

presence of coloration when stimuli were presented binaurally, both in terms of absolute

thresholds and number of JNDs, or just-noticeable-diﬀerences (Koenig et al., 1975). Exper-

iments by Zurek (1979) revealed higher thresholds for a simulated reﬂection when noise was

10

presented dichotically instead of diotically, suggesting that the binaural system can suppress

coloration, consistent with Koenig’s (1950) informal observations.

Historically, speech intelligibility studies have been much more prevalent than the types

of perceptual experiments described in the preceding paragraph. In these studies, speech is

presented under reverberant conditions. The listener’s task is to understand speech in the

presence of reverberation, and the performance metric is the percentage of words correctly

identiﬁed. While speech intelligibility experiments do not directly provide information on

listeners’ perceptions of room eﬀect, they demonstrate the superiority of binaural hearing in

the presence of reverberation.1 Moncur and Dirks (1967) investigated speech intelligibility

in quiet under reverberant conditions for both monoaural (single ear) and binaural listening.

They observed a binaural advantage that resulted in 7% improvement in intelligibility scores

at a reverberation time of 0.9 s and a 10% improvement in intelligibility at reverberation

times of 1.6 and 2.3 s. They suggested that interaural time diﬀerences in the binaural

listening condition were responsible for the improvement in intelligibility scores. N´ab˘elek

and Pickett (1974) observed a binaural advantage equivalent to 3 dB in quiet for normal-

hearing (NH) listeners at reverberation times of 0.3 s and 0.6 s, compared to the monoaural

condition. The data in Figure 1 of the reference, which N´ab˘elek and Pickett compiled from

results of various studies including their own, indicate that speech intelligibility is relatively

constant in NH listeners for reverberation times up to 1.2 s.

Speech intelligibility experiments have validated binaural superiority when listening to

speech in the presence of reverberation, but they have provided no insight into listeners’ per-

1Present interest is in the speech-in-quiet condition, vs. speech-in-noise, in which speech is presented
in the presence of both reverberation and a masking noise. Some researchers have referred to binaural
superiority under speech-in-noise conditions as “binaural squelch” (Olsen and Carhart, 1967; MacKeith and
Coles, 1971). However, the eﬀect observed in their studies would more aptly be termed “binaural masking
level diﬀerence,” or “spatial release from masking.” These are distinct from what is meant by binaural
squelch in the present context.

11

ceptions of room eﬀect. Perceptual experiments are inherently more challenging to analyze

and interpret. So, while speech intelligibility experiments demonstrate a clear binaural ad-

vantage, their usefulness is of limited scope. Nevertheless, it might reasonably be conjectured

that if a listener demonstrates improved speech understanding with binaural presentation it

could be attributed to a decrease in eﬀective or perceived reverberation.

There is some indication that perception of room eﬀect may depend on more than just

binaural hearing. Haas (1949) investigated perception of a single reﬂection in speech. The

experiment was conducted in a room in which the RT60 was varied: 0 s, 0.8 s, and 1.6 s.

The time delay between the direct sound and reﬂection (echo) was also varied. The listener’s

task was to indicate when he or she perceived the reﬂection as “disturbing.” For delay range
0.01− 0.12 s, listeners indicated that the reﬂection was least disturbing for the largest RT60

condition. Reverberation apparently masked the echo. Presumably, this would reduce a

listener’s overall perception of room eﬀect. This reduction mechanism depends on particular

physical properties of the room and is independent of binaural hearing.

Br¨uggen (2001) conducted a perceptual experiment that aimed to elucidate the audi-

tory system’s binaural decoloration mechanism. He suggested that coloration was a multi-

dimensional percept. In the experiment, expert listeners ﬁrst developed a series of attribute

antonym-pairs, e.g. “Dull/Bright,” “Full/Thin,” “Reverberant/Dry”. Listeners then rated

speech stimuli according to each antonym pair. The stimuli were computed by convolving

anechoic speech with simulated room impulse responses (RIR). The RIRs simulated room

environments with “moderate” levels of reverberation, though speciﬁc RT60 values were

not given. Principle component analysis (PCA) of listeners’ ratings revealed two signiﬁcant

eigenvalues. It was posited that the ﬁrst component was related to amplitude spectral varia-

tions in the stimuli. The second component was thought to be related to perceived temporal

12

diﬀusivity. While the ﬁrst component had some dependence on binaural presentation (vs. di-

otic), the second component did not. Based on these results, Br¨uggen suggested the auditory

system’s decoloration mechanism had orthogonal binaural and monoaural components.

In 2015 Ellis et al. conducted an experiment in which listeners were instructed to quantify

perceptual similarity between pairs of speech stimuli. The stimuli were computed using

simulated RIRs (RT60 = 2.06 s). Listeners were presented with stimuli over headphones

in diotic or binaural listening modes and asked to rate perceptual similarity. Using multi-

dimensional scaling (MDS), the researchers identiﬁed three perceptual dimensions along

which perceived diﬀerences lie. They interpreted dimension 1 to be associated with perceived

sound source distance. Dimension 2 was interpreted as a simple binary indicator for presence

or absence of reverberation. They identiﬁed the third dimension as binaural squelch, and

attributed it to interaural cross-correlation. Note that stimuli were created using virtual

auditory space techniques with a long reverberation time that falls outside the range for

ordinary rooms. Nevertheless, the result that the cue for perceived source distance (dim. 1)

was more salient than the cue for reduced room eﬀect (dim. 3) may oﬀer some insight with

regards to the relative importance of binaural squelch in listeners’ perceptions of room eﬀect.

Experiments by Teret et al. (2017) found that listeners’ perceptions of reverberation

depended on the stimulus signal type. Speech, music, noise and clicks were convolved with
simulated RIRs with RT60 values ranging from 0.6− 1.95 s. Listeners then matched diﬀerent

stimulus types for equal amounts of perceived reverberation. They found that for an identical

RT60 value, listeners perceived diﬀerent amounts of reverberation depending on the signal

type. The authors suggested that perceptual diﬀerences may arise due to diﬀering amounts

of transient vs. ongoing segments within the stimuli. Perceptual diﬀerences between RT60

values were found to be smaller for binaural presentation (vs. diotic). Collectively, the results

13

suggest that squelch may operate diﬀerentially with respect to stimulus type, RT60 values,

and binaurality.

The current chapter continues the focus on room eﬀect with experiments that presented

pointed questions to the listeners regarding their perceptions of room eﬀect. Further, many

of the above-mentioned experiments computed stimuli using simulated RIRs, but there is

some ambiguity regarding the correct way to model late reverberant energy in virtual room

acoustics techniques (Pellegrini, 2002). There is a dearth of room eﬀect perceptual experi-

ments that are conducted using real (vs. simulated) rooms. The experiments described below

diﬀerentiate themselves because they used speech stimuli recorded in real rooms– thus, the

reverberation is physically real and accurate. In the listening portions of the experiments,

subjects wore headphones and reported their perceptions of room eﬀect among diﬀerent

presentation conditions (Shore et al., 2016). Experiment 1 was intended to be analogous

to Koenig’s experiment while Experiments 2 and 3 incorporated multiple parameter varia-

tions. Experiment 1 elicited listener responses via questionnaire, while Experiments 2 and

3 utilized a rank-ordering paradigm. Diotic and binaural presentations were common to all

experiments.

2.2 Experiment 1– Questionnaire

2.2.1 Methods

Fifteen listeners (6 female) aged 20-64 participated in Experiment 1. All had self-reported

normal hearing and completed a standard consent form approved by the MSU IRB. Listeners

from outside the lab were paid.

Two stereophonic recordings were made of a female talker reciting Hamlet’s soliloquy,

14

one in a sound room with absorbing walls (Acoustic Systems, Austin, TX) referred to as

“the dry room” and the other in a long empty basement with a concrete ﬂoor and drywalled

ceiling and walls, referred to as “the basement.” Recordings were made with cardioid studio

microphones (SHURE KSM32, Shure Inc., Chicago, IL) spaced 10 cm apart and 1.8 m

from the talker. Waveforms were ampliﬁed (302 Dual Microphone Preampliﬁer, Symetrix,

Mountlake Terrace, WA) and digitized at a sample rate of 48 kHz with 16-bit precision on

a portable recorder (Zoom H4nSP, Zoom, Hauppage, NY).

The dry and basement recordings were spliced line-by-line in sound-editing software to

make a single 63-second, continuously-looping ﬁle. For example, “To be or not to be” from

the dry recording was followed by “that is the question” from the basement recording. A

mechanical switch box enabled the listener to switch from diotic to binaural presentation,

a reenactment of Koenig’s procedure (1950). In binaural presentation, the left channel was

sent to the left headphone and the right channel was sent to the right. In diotic presentation,

the left channel was sent to both headphones. The two switch positions were labeled ‘A’ and

‘B ’.

Subjects listened to the spliced audio ﬁle presented at 65 dBA through on-the-auricle

headphones (Sennheiser HD414, Wennebostel, Germany) in the above-mentioned sound

room. They were instructed to ﬂip the switch to alternate listening between diotic and

binaural presentations as the recording itself alternated between dry room and basement

phrases. While listening, subjects ﬁlled out a questionnaire on their perceptions of room ef-

fect. The questionnaire directed a listener’s attention to particular room properties and the

ﬁve questions are paraphrased along the horizontal axis in Fig. 2.1. Subjects listened to the

recording as many times as they wanted, and ﬂipped the switch as often as they liked. This

resulted in some listeners ﬂipping the switch more rapidly than others. A typical listening

15

session, including a post-interview with the listener to clarify or supplement questionnaire

responses, lasted 30 to 45 minutes. Listeners completed two sessions.

2.2.2 Results

Questionnaire results are shown in Fig. 2.1. In cases for which a listener’s responses to a

question were diﬀerent for the ﬁrst and second listening sessions, the response was deemed

‘Ambiguous.’ Ten of the ﬁfteen (67%) listeners reported that they perceived more room eﬀect

in binaural presentation (Questions 4 and 5). Seven of these listeners indicated during the

post-interview that they strongly experienced binaural enhancement of room eﬀect, which

is termed here “anti-squelch.” It is evident that the results of Experiment 1– notably, the

prevalence of anti-squelch– are directly contrary to Koenig’s observations. Further discussion

of this opposition appears at the end of section 2.3.2.

Eight listeners (53%) indicated the binaural presentation was louder than diotic in the

basement phrases, though the levels of the two channels were physically the same. Binaural

loudness enhancement is consistent with an eﬀect of interaural incoherence reported by

Edmonds and Culling (2009), though it should be noted that their stimulus was noise, not

speech. All but two listeners noted that in the dry-room condition diotic and binaural

presentations were essentially identical.

A cross-correlation (Pearson product-moment) was calculated between left and right

channels for the phrase “To be or not to be, that is the question” in both the dry room and

basement recordings. Values are given in Table 2.1. The correlation for diotic stimuli in each

room was 1.00, as to be expected for perfectly coherent left and right channels in a diotic

stimulus. Binaural cross correlation was 0.90 in the dry room and 0.77 in the basement.

A greater perceptual eﬀect was anticipated in going from 1.00 to 0.90, i.e. dry-room diotic

16

Figure 2.1: Responses to ﬁve questions comparing diotic and binaural presentation of Ham-
let’s soliloquy in Experiment 1. In cases for which a listener’s responses to a question were
diﬀerent for the ﬁrst and second listening sessions, the response was deemed ‘Ambiguous.’
Questions 4 and 5 indicate that most listeners’ experiences were consistent with anti-squelch.

to dry-room binaural, than in going from 0.90 to 0.77, i.e. dry-room binaural to basement

binaural. That prediction was based on results of a discrimination experiment conducted

by Pollack and Trittipoe (1959), in which the amount of correlation between left and right-

ear signals was varied. Note that the stimulus in their experiment was noise, not speech.

In Experiment 1, however, most listeners reported little to no diﬀerence between dry-room

diotic and binaural presentations. This suggests that cross-correlation does not seem to

be a particularly illuminating approach to the loudness eﬀect for speech in the absence of

reverberation.

17

Diotic Binaural

dry room 1.00
basement
1.00

0.90
0.77

Table 2.1: Cross-correlation (Pearson product moment) between the left and right channels
for the Hamlet soliloquy phrase “To be or not to be, that is the question.” The largest
diﬀerence is between the diotic condition (1.00) and the basement binaural condition (0.77).

2.3 Experiment 2– Ranking

Experiment 2 eliminated listener switching and replaced the questionnaire with a rank-order

response. This was done to determine if anti-squelch would persist when the experimental

methods were changed.

2.3.1 Methods

A phrase was selected from the Hamlet soliloquy, as recorded in the basement, and edited to
modify spectral features, or level, or both. The spectral modiﬁcation entailed a −3 dB/octave

spectral tilt for frequencies above 230 Hz. The purpose of including the spectral and level

post-processing manipulations was to explore factors other than binaurality that might in-

ﬂuence squelch. It was expected that listeners would perceive more coloration in conditions

that had the spectral tilt. Further, it was conjectured that more room eﬀect would be per-

ceived at the higher level because the reverberant sound would be boosted in level. It was

not expected that listeners would accordingly normalize the direct sound in conditions with

spectral tilt or level boost, since the perceptual task directed their attention toward room

eﬀect (and not direct sound).

There were nine audio ﬁles, randomly labeled A-I, with each representing a unique presen-

tation condition. Eight conditions are summarized in Table 2.2, the ninth was the dry-room,

binaural recording which was not modiﬁed in any way. The ﬁles constituted a playlist. Three

18

additional soliloquy phrases from the basement were edited in an identical manner, for a total

of four playlists. All playlists were imported to an iPod Touch (iOS 4.2.1, Apple, Cupertino,

CA). While listening, listeners used the drag-and-drop feature on the iPod to rearrange the

nine ﬁles in a playlist, in order of increasing room eﬀect. Listeners were instructed to listen

to playlists in a particular order, which was randomized for each session.

Four phrases from Hamlet’s soliloquy

Listening mode
Spectral modiﬁcation
Level (dBA)

binaural

diotic

tilt

65

75

none
65
75

tilt

65

75

none
65
75

Table 2.2: Summary of eight presentation conditions in Experiment 2. A separate audio ﬁle
represented each unique presentation condition on the iPod.

Seven of nine listeners in this experiment were also listeners in Experiment 1. Those

from outside the lab were paid. During the experiment, listeners could listen to the ﬁles in a

playlist in any order, as many times as they wanted with no time constraint. They recorded

their ﬁnal rankings on a paper answer form which included instructions and explicitly deﬁned

room eﬀect as reverberation and coloration. Ranking the four playlists required 0.5 to 1 hour.

2.3.2 Results

For each listening condition, the listener rankings were averaged across the four playlists.

The averages are shown in Fig. 2.2. Since seven of the subjects from Experiment 1 returned

for this experiment, it could be observed whether the manner of presenting stimuli to subjects

and/or subject response procedure (i.e. questionnaire versus rank order) inﬂuenced listeners’

experiences. Because the listeners were all tested individually and given no information

about the responses of other listeners, the comparison was fair. In contrast to Experiment 1,

all listeners displayed evidence of binaural squelch in Experiment 2. This can be seen in

19

Fig. 2.2 panel a: mean ranks were lower for binaural conditions than for diotic, indicating

that listeners perceived less room eﬀect in binaural presentation. Four listeners (Q,S,V,W)

who had displayed anti-squelch of room eﬀect in Experiment 1, gave responses consistent

with binaural squelch in Experiment 2. Three listeners (O,P,R) displayed binaural squelch

in both experiments. The remaining listeners (M and N) also displayed binaural squelch but

had not participated in Experiment 1.

The majority of listeners ranked spectral tilt above non-tilt (Fig. 2.2, panel b), indicating

that more room eﬀect was perceived in the spectral tilt condition. Five of nine listeners

decidedly perceived more room eﬀect with spectral tilt. Level had a large eﬀect on listeners’

rankings (Fig. 2.2, panel c). All but one listener (R) perceived more room eﬀect at the higher

level.

Nonparametric statistical analyses (Wilcoxon signed ranks tests) found signiﬁcant eﬀects

of binaurality (p = 0.008) and level (p = 0.010), and a marginally signiﬁcant eﬀect of spectral

tilt (p = 0.086).

20

Figure 2.2: Rankings by nine listeners averaged over the four phrases in Experiment 2. (a) All listeners displayed binaural
squelch. Listeners indicated by arrows displayed anti-squelch in Experiment 1 but had contrary experiences in Experiment 2.
(b) Spectral tilt (−3dB/octave) increased perceived room eﬀect for most listeners. (c) There was a signiﬁcant eﬀect of level.
Results of Wilcoxon signed rank tests (z-scores and p-values) for group diﬀerences as a function of listening condition are shown
in each panel.

21

2.3.3 Discussion

The conclusion of Experiment 2 agreed with the conclusion reached by Koenig (1950) – bin-

aural presentation tends to reduce room eﬀect. Both of these conclusions disagreed with the

results of Experiment 1, which used a switchbox and a questionnaire response and resembled

Koenig’s experiment. Experiment 2 used very diﬀerent methods– namely, a touchscreen and

a rank-ordering response. These conﬂicting results suggest that binaural squelch is a real

eﬀect but that perceptual experiments calling attention to binaural/diotic presentation can

confound the listening experience. Further, mixed results in Experiment 1– namely, two-

thirds of listeners had experiences consistent with anti-squelch while the remaining third

had experiences consistent with binaural squelch– indicate that the experience is highly

individualistic.

A possible explanation for the collective experimental results is that when a listener

switched from diotic to binaural presentation in Experiment 1, spatial eﬀects– lateralization

and envelopment2– might have suddenly become prominent. The listener would likely have

had in mind the dramatic onset of spatial eﬀects when responding to the questionnaire and

it could explain why most listeners reported anti-squelch. When the manner of presentation

(diotic/binaural) was mixed with other stimulus manipulations in Experiment 2, listeners

reported binaural presentation to be more natural. The onset of spatial eﬀects was appar-

ently subdued when binaural presentation was interwoven with level and/or spectral tilt

modiﬁcations. The direct speech could be streamed separately from the reverberation, and

squelch then had the opportunity to manifest. In diotic presentation, colocation of the di-

rect and reverberant sounds in the middle of the head3 might have rendered it more diﬃcult

2Lateralization:

localization of sound when listening over headphones. Envelopment: sense of being

spatially surrounded by sound

3In diotic headphone presentation there are no interaural diﬀerences. This results in sounds being per-

22

for listeners in Experiment 2 to focus attention on only the direct sound. They may have

therefore perceived the reverberant sound as being more prominent.

In short, a listener’s attention was likely to be focused on the dramatic onset of spatial

eﬀects when switching to binaural presentation in Experiment 1. It is likely that the listener’s

attention was further drawn to the diotic/binaural transition since switching was under the

listener’s control. In Experiment 2, the listener reorganized various ﬁles in a playlist, and

sometimes the diﬀerences among ﬁles could be subtle. The task was multidimensional and

inherently more challenging than a binary switching task. Further, in Experiment 2 the

listener’s attention was likely to be focused on the direct sound in binaural presentation,

which made the reverberation less prominent. These diﬀerent attention foci might explain

why 4 out of 7 listeners gave opposite responses for Experiments 1 and 2.

Comments from the listeners during the post-interviews suggested that the focus of their

attention may have been contingent on the amount of reverberation, which depends on room

qualities. Comparing Experiment 1 to Koenig’s (1950) experiment, reverberation and spec-

tral distortion (coloration) may have been more physically prominent in the basement than in

the room environment in Koenig’s experiment. As such, listeners in Experiment 1 may have

perceived more room eﬀect simply because it was physically more prominent. Spatial eﬀects

brought on through binaural presentation likely further drew the listener’s attention to room

eﬀect. Diﬀering amounts of reverberation in the basement and in Koenig’s room, enhanced

through spatial eﬀects in binaural presentation, might therefore explain the contrasting re-

sults between Experiment 1 and Koenig’s experiment. Unfortunately, reverberation times

were unavailable for either room. It is also worth mentioning that Experiment 1 presented

listeners with a comparison dry recording, which Koenig’s experiment did not include. This

ceived in the middle of the head.

23

could have also contributed to the diﬀerent results in the two experiments.

It remained

to test squelch under physically natural acoustic conditions– the main experiment which is

described in the next section.

2.4 Experiment 3– Ranking physical

Experiment 3, termed the ranking-physical experiment, was the main experiment in Chap-

ter 2.

It spanned two academic institutions and included twenty-one listeners, yielding

greater statistical power than Experiments 1 and 2. A rank-ordering paradigm was used,

and stimuli were Harvard phonetically-balanced sentences. Physical manipulations were

made during recording of the stimuli that were later presented to listeners over headphones.

These manipulations– namely, variation of source-to-microphone distance and inclusion of

head diﬀraction– are thought to be important for everyday listening.

2.4.1 Methods

This experiment was done in collaboration with researchers at the University of Louisville

(UL). Ten UL listeners (A-J), who received class credit, and eleven listeners at MSU (M-W)

participated in this experiment. The MSU listeners were returning listeners from Experi-

ments 1 and 2. Listeners from outside the lab were paid. The human subjects procedures

were approved by the IRB at MSU and at UL.

New speech stimuli were created for Experiment 3. A female talker stood 15 cm from a

cardioid microphone in an anechoic chamber and recited four Harvard phonetically-balanced

sentences (Table 2.3; IEEE, 1969). Waveforms were ampliﬁed (+48 VDC phantom power,

AudioBuddy Dual Microphone Preampliﬁer, M-Audio, Cumberland, RI) and digitized (Zoom

24

Harvard phonetically-balanced sentences
“Cats and dogs each hate the other.”
“Add the sum to the product of these three.”
“Open the crate but don’t break the glass.”
“Thieves who rob friends deserve jail.”

Table 2.3: Stimuli for Experiment 3. Sentences were recited by a female talker in an anechoic
room and recorded. During playback, the recorded sentences were played through a source
loudspeaker in Room 10B (RT60 = 0.9 s for speech frequencies) and recorded with cardioid
microphones. The cardioid microphone recordings were played over headphones to listeners
in Experiment 3.

recorder). Each recording was then played through a laptop sound card, ampliﬁed (Trans-

Nova P1500 H32 power ampliﬁer, Port Coquitlam, Canada), and transduced by a single-

driver, 3-inch loudspeaker (Cambridge Soundworks, North Andover, MA) in Room 10B, a

large laboratory space, with tiled ﬂoor and concrete ceiling and walls, that has been well

characterized acoustically (Hartmann et al., 2005; RT60 = 0.9 s at speech frequencies). A

small-diameter loudspeaker was used in order to simulate the radiation pattern of a human

talker’s mouth. The recording condition was varied: the loudspeaker was equally distant from

both microphones, either 2 m or 3 m away, and a hard plastic “head” was either present

or absent between the recording microphones to control diﬀraction (Firestone, 1930).

It

was expected that listeners would perceive more room eﬀect in the 3-m conditions, because

less direct sound from the loudspeaker reaches the microphones compared to the 2-m con-

ditions: going from a loudspeaker-to-microphone distance of 2 m to 3 m corresponds to a

3.5 dB reduction in direct-to-reverberant power ratio. This is expected to be perceptible

to listeners. Further, it was thought that diﬀraction of sound from a head would oﬀer an

advantage– meaning stronger squelch of room eﬀect in this context– compared to no head.

Presence of a head leads to enhanced (frequency-dependent) binaural diﬀerences and also to

(frequency-dependent) spectral ﬁltering.

25

A photograph of the recording microphones placed at the location of the plastic head’s

“ears” is shown in Fig. 2.3. Recorded waveforms were digitized (Zoom recorder) and equal-

ized for overall level according to root-mean square (RMS) amplitude. Table 2.4 summarizes

the presentation conditions in each of the four playlists.

Figure 2.3: During playback of the phonetically-balanced sentences through a loudspeaker
(not shown) in Room 10B, two cardioid microphones were located on opposite sides of a
plastic “head.” The plastic head simulated head diﬀraction. Recordings from the cardioid
microphones were played over headphones to listeners in Experiment 3.

Four phonetically-balanced sentences

Listening mode

binaural

diotic

Distance (m)

2

3

2

3

Head diﬀraction head none head none head none head none

Table 2.4: Summary of eight presentation conditions in Experiment 3. A ninth condition
was anechoic.

Listener instructions for this experiment were identical to those in Experiment 2. At

MSU, listeners again used an iPod Touch and listened over the same headphones as used

26

in Experiments 1 and 2. At UL, listeners used a PC to control stimulus presentation and

listened over Beyerdynamic DT-990 Pro headphones (Beyerdynamic, Heilbronn, Germany).

2.4.2 Results

No signiﬁcant diﬀerence was found between UL and MSU listener responses so the datasets

were combined.4 Four of twenty-one (19%) listeners reported anti-binaural squelch as indi-

cated in Fig. 2.4a. These four listeners (C, D, M, R) gave diotic presentation lower rankings,

reporting less room eﬀect, compared to binaural presentation. Two listeners (F and Q) gave

very similar rankings for diotic and binaural. The ﬁfteen remaining listeners ranked binaural

lower than diotic, indicating binaural squelch. As noted in Fig. 2.4a, the eﬀect was signiﬁcant

(Wilcoxon signed ranks test, p = 0.038).

Figure 2.4b reveals a highly signiﬁcant eﬀect of distance between recording microphones

and source loudspeaker. All but two listeners (H and R) ranked the 2-m presentations

lower than the 3-m, indicating less perceived room eﬀect at the smaller source distance

(p < 0.0001). There was little to no eﬀect of the head as can be seen by the similarity in

“Head” vs. “No head” ranks in Fig. 2.4c (p = 0.169).5

4The smallest p-value across the eight conditions was p = 0.319.
5The magnitude of the binaural squelch (diﬀerence between the binaural and diotic means) was about
a half a ranking point larger for conditions where the dummy head was present than for conditions where
there was no head, and about 0.4 ranking points larger for the 2-m distance than for 3 m.

27

Figure 2.4: Results of Experiment 3: rankings by 21 listeners of phonetically-balanced sentences averaged over four playlists.
Results of Wilcoxon signed rank tests (z-scores and p-values) for group diﬀerences as a function of listening condition are shown
in each panel. (a) Most listeners displayed binaural squelch, but four listeners displayed anti-squelch. (b) All but two listeners
ranked the 3-m source distance as having more room eﬀect than the 2-m distance. (c) There was no signiﬁcant eﬀect of having
a plastic head between the two recording microphones.

28

2.4.3 Discussion

From an experimental standpoint, the evaluation of room eﬀect is diﬃcult because it depends

on perceptual experiences, which are subjective and highly individual. Unlike sound source

localization, there is no correct answer. Unlike pitch or loudness perception, there is no

generally recognized quantitative scale. In Experiment 3, ﬁfteen (71%) listeners reported

binaural squelch, which is consistent with the results of Experiment 2 and with Koenig’s

1950 observations.

There are additional points to take away from the collective results of Experiments 1-3.

First, binaural squelch was not ubiquitous across listeners. In Experiment 3, six listeners

(nearly 30%) did not report binaural squelch, as shown in Fig. 2.4a. Two-thirds of listeners

in Experiment 1 also reported experiences inconsistent with binaural squelch, although the

results of Experiment 1 are likely confounded with the onset of spatial eﬀects.

It is quite evident that squelch is highly sensitive to experimental methodology– specif-

ically, how an experiment is conducted and how responses are elicited from listeners. The

experimental design can change the focal point of a listener’s attention, which can eﬀectively

reduce or enhance the perception of room eﬀect. In Experiment 1, it is likely that the act of

switching directed listeners’ attention to the onset of spatial percepts, leading to increased

reports of room eﬀect for binaural presentation. Increased perceived room eﬀect, including

envelopment, is consistent with an increased apparent room size upon binaural presentation,

as reported in Question 3 of the questionnaire (Fig. 2.1). Experiment 2, which used a rank-

ing response protocol and incorporated a richer variety of presentations, observed binaural

squelch for all listeners– even those listeners who had reported experiences consistent with

anti-squelch in Experiment 1. This sensitivity to experimental design was not anticipated,

29

as the binaural advantage in reverberant conditions has been touted historically, both in

Koenig’s experiment (1950) and in speech intelligibility experiments– which used diﬀerent

experiment methods and assessments of listeners’ experiences– as being quite robust.

Reverberation time of a room, which depends on the room properties, is also thought to

play a role in directing the listener’s attention. Spectral distortion and reverberation may

be more physically prominent in some rooms. The exact role of attention in anti-squelch

is likely to depend on the ratio of reverberant to direct sound, but this information is not

available for the basement in Experiments 1 and 2, or for Koenig’s room environment.

It is worth noting that coloration and reverberation are placed under the collective um-

brella of ‘room eﬀect’ here, but it is unknown which is more perceptually important. Allen

et al. (1977) posited that their relative importance depends on source and receiver locations,

but Flanagan and Lummis (1970) claimed that it depends on room size and reverberation

time. Whatever the case may be, it is likely that the basement and Room 10B had diﬀerent

physical levels of coloration– they almost certainly had diﬀerent reverberation times, though

this is diﬃcult to state for certain without a reverberation time for the basement.

Given the dependencies mentioned above, perhaps it should not be surprising that only a

modest binaural eﬀect was observed: the diﬀerence in ranking for presentation type (diotic-

binaural) was smaller than the diﬀerence in ranking for distance (3 m-2 m). These mean

diﬀerences and their standard deviations were respectively 0.88 (1.57) and 2.15 (1.15). Given

that the distance eﬀect on direct-to-reverberant ratio is only 3.5 dB, it is clear that Exper-

iment 3 revealed only a relatively small binaural eﬀect. This result is consistent with the

MDS ﬁnding by Ellis et al. (2015) that sound source distance is a more salient cue for lis-

teners’ perceptions of stimulus-pair similarity than interaural cross-correlation, which they

identiﬁed with binaural squelch.

30

The collective results of Experiments 1-3 imply that squelch as experienced in everyday

life is not a strictly binaural eﬀect. Further, attempts to demonstrate binaural squelch

were not entirely successful. An important diﬀerence between binaural presentation in these

experiments and “everyday” listening is in the head related transfer functions (HRTF). It

is plausible that the squelch (or reduction) of room eﬀect depends on the precise scattering

of sound from a listener’s individual head, torso, and outer ear anatomy. The failure of the

plastic head in Experiment 3 to make a signiﬁcant diﬀerence in listener rankings is consistent

with an emphasis on “individual.” Experiment 3, with twenty-one listeners spanning two

institutions, found persuasive evidence that binaural squelch is real, but that it is also a

highly individual eﬀect. Some listeners may be more sensitive than others to enhanced

spatial sensation in binaural presentation, given non-individualized HRTFs.

In everyday

listening conditions, the squelch of room eﬀect seems to be a universal experience. Most of

the listeners in Experiment 3 reported less room eﬀect in binaural listening, which is more

natural and naturally tends to reduce room eﬀect as in everyday life, but the reduction

was smaller than the squelch in everyday life. This may have been due to the absence of a

realistic HRTF for each listener in Experiment 3.

31

Chapter 3

Acoustical representation of a

listener’s anatomy: The HRTF

The present chapter describes an experimental technique that enables individualized stimulus

delivery to human listeners. The ﬁrst section (3.1) introduces the head related transfer

function (HRTF) and its importance in psychoacoustics. The second section (3.2) describes

the measurement technique used to determine the HRTF for a listener. In the ﬁnal section

(3.3), a psychoacoustical validation of the HRTF measurement technique is presented.

3.1

Introduction

The HRTF encodes reﬂections and diﬀraction of sound from a listener’s torso, head, and

pinnae. These reﬂections can be highly individualized– that is, they are sensitive to a

listener’s unique anatomy. This is simple to understand from a physical perspective: if the

wavelength of sound is on the order of, or smaller than, the physical dimensions of the head

or pinna, the sound wave is sensitive to the anatomical ﬁne structure, which naturally varies

from person to person. Thus, the HRTF is essentially an acoustical ﬁngerprint for a listener.

Gumerov et al. (2010) oﬀer a helpful rule of thumb for relating speciﬁc anatomical structures

to the HRTF: “Roughly speaking, the size of the head is important above 1 kHz, the general

characteristics of the torso are important below 3 kHz, and the detailed structure of the head

32

and pinnae becomes signiﬁcant above 3 kHz, with the details of the pinnae itself becoming

important at frequencies over 7 kHz.”

There are common features among HRTFs. For example, simulations by Cai et al. (2015)

showed that torso reﬂections can cause frequency-dependent ripples in the ITD and ILD

functions. Another common feature is an ear canal resonance1 that appears as a fairly broad
peak and occurs in the 3 − 5 kHz frequency range (Mehrgardt and Mellert, 1977). Location
of the maximum is listener-dependent. The next highest mode is in the 8− 11 kHz frequency

range. In general, the ﬁne structure, i.e. characteristic peaks and valleys, in a HRTF become

highly individual above 8 kHz (Møller et al., 1995). Wightman and Kistler (1989a) observed
maximum intersubject diﬀerences in the 7 − 10 kHz range, with a peak diﬀerence of 8 dB

and a standard deviation of 7 dB. Further, standing waves in the ear canal can occur when

waves reﬂected from the eardrum interfere with incoming sound waves, creating notches in

the high-frequency range of the HRTF which are highly individual (Carlile and Pralong,

1994).

Binaural recording techniques using small in-ear microphones have made it possible to

measure high-resolution binaural HRTFs. An experimenter can then ﬁlter a stimulus with

left and right ear HRTFs, and deliver the ﬁltered binaural stimuli to a listener.

In this

way, a listener can listen to stimuli ﬁltered through his or her own HRTFs (individualized

condition) or through other listeners’ HRTFs (nonindividualized condition).

The fact that humans listen to sounds with their own ears (i.e. HRTFs) is important for

several aspects of hearing. Information about sound source location in the vertical plane is

thought to be encoded by the direction-dependent interactions of an incoming sound wave

1The ear canal is often modeled as a long pipe with one end open, thus with its fundamental given by:
4L , where vs is the speed of sound in air and L is the length of the ear canal, which can vary among

f1 = vs
listeners. Subsequent modes are given by fn = n vs

4L for n = 3, 5, 7, ... .

33

with the folds of the pinna. This acoustical ﬁltering due to the pinna has been shown

to be important in front-back localization of sound sources (Zhang and Hartmann, 2010;

Blauert, 1969).

Incidences of front-back and up-down errors in localization experiments

were shown to increase signiﬁcantly when listeners listened with nonindividualized HRTFs

(Wenzel et al., 1993; Morimoto and Ando, 1980). Middlebrooks also showed that individu-

alized HRTFs were important for accurate localization in the vertical and horizontal planes

(Middlebrooks, 1999b). He attempted to reduce intersubject HRTF diﬀerences through a

method of frequency scaling (Middlebrooks, 1999a). In localization experiments, own-ear

performance was superior to nonindividualized and scaled-nonindividualized conditions.

Pinna cues are also known to be important in externalization, or perceiving a sound

image outside the head (Hartmann and Wittenberg, 1996; Durlach et al., 1992; Wightman

and Kistler, 1989b). Listening to stimuli that are ﬁltered with nonindividualized HRTFs can

lead to in-head localization, which is unnatural sounding. Table 3.1 summarizes observed

advantages when listening with individualized HRTFs.

Area of acoustics

Front/back localization
Vertical plan localization
Externalization
Room eﬀect squelch (?)

Observed Advantage with

Individualized HRTFs
fewer front/back errors
fewer up/down errors

accurate, punctate image
necessary for squelch (?)

Table 3.1: Listeners beneﬁt from listening with their own ears (i.e. individualized HRTFs):
fewer localization errors occur, and externalization of sound images is optimal. It is hypoth-
esized that individualized HRTFs may necessary for room eﬀect squelch.

It is hypothesized in this dissertation that the advantage provided by individualized

HRTFs may extend to room eﬀect squelch: that is, a listener may experience maximum

squelch when listening through his or her own HRTFs. A perceptual experiment to test that

hypothesis is described in Chapter 4, but ﬁrst a detailed description is given of the HRTF

34

measurement procedure in section 3.2. The mathematics for ﬁltering a stimulus with an

HRTF are also presented. Finally, an acoustical validation experiment using a manikin’s

HRTFs is described in section 3.3.

Before moving on to section 3.2, the downside of providing individualized HRTFs is

mentioned here. While individualized HRTFs often provide the best listening experience,

measuring complete sets of HRTFs (i.e. from many diﬀerent source directions) for a listener

can be quite time-consuming. It is often impractical to measure individualized HRTFs for

every listener. Ideally, there would exist a universal set of HRTFs that would provide realistic

localization, externalization, and spatialization cues for any listener. Audio engineers have

worked toward that goal by measuring HRTFs on a large number of listeners, and creating

databases which are publicly available (Warusfel, 2003; Algazi et al., 2001). However, com-

parisons of HRTFs from diﬀerent databases often show signiﬁcant deviations, depending on

the particular excitation stimulus and measurement points used (Shaw, 1974). Even within

a single database, signiﬁcant deviations can result from inaccurate listener positioning or

microphone placement (Møller et al., 1995; Wightman and Kistler, 1989a). Thus, while it

is often adequate in audio engineering applications to utilize HRTFs from public databases,

psychoacousticians typically must measure contemporaneous HRTFs.

3.2 Head-related impulse responses

The ﬁrst subsection presents the requisite mathematical formulas for determining the impulse

response of a linear time-invariant (LTI) system in the context of the maximum length

sequence (MLS) measurement technique. The second subsection applies the MLS technique

to measure the head-related impulse response (HRIR) of an anthropomorphic manikin in a

35

moderately reverberant room to test the MLS technique.

3.2.1 Maximum length sequences

A MLS is a periodic binary sequence. It is generated by a linear feedback shift register that

produces a series of 0 and 1 digital bits (Hartmann and Candy, 2006; Rife and Vanderkooy,

1989; Davies, 1966). The main idea is to create an N -bit register where each stage value is

either a 1 or 0. At speciﬁc stages, called taps, XOR logic is performed (Fig. 3.1, panel a).

After each iteration, values move one stage further down the line. The last stage of the

register is the register’s output and it is fed back into the input and any taps before the next

iteration. Figure 3.1b demonstrates a simple case for generating a 3-bit MLS. Panel c shows

the successive values of each stage in the shift register. The value in the last stage– namely,

stage 3– is the output of the register and becomes a digit in the MLS. Note that starting

with step 7, the values in the shift register begin to repeat. The MLS generated from this

particular shift register is: 1100101. The length of the 3-bit MLS is therefore seven digits.

36

Figure 3.1: (a) Logical XOR truth table. If either A or B is 1 (but not both), then XOR
B is 1 (i.e. true). Otherwise, XOR B is 0 (false). (b) A three-stage shift register with taps
at stages 1 and 2 in its initial state (111). The output of stage 3 becomes a digit in the
MLS. On the next step of the register, the output of stage 3 is fed back into the inputs of
stages 1 and 2, and original values of all stages are passed into the next stage. XOR logic is
performed at all taps. (c) Successive values of each stage in the shift register.

The length, L, of the sequence is 2N − 1 where N is called the order of the MLS. For a

particular order, multiple sets of taps can exist which in turn produce unique sequences. For

example, various sequences of order 17 can be generated using taps at stages [1,12], [1,13],

or [1,15]. Tables of taps for a particular order can be found in the literature (Hartmann and
Candy, 2006; Vanderkooy, 1994). Typical order numbers range from 2 (22−1 = 3 samples) to
32 (232−1 = 4, 294, 967, 295 samples). The temporal duration of a MLS depends on N and on

the sampling frequency of the hardware. For a sampling frequency of 50 kHz (which is typical

in acoustics experiments), this corresponds to durations of 60 µs (N = 2) and 23.86 hours

(N = 32). Since the sampling frequency is usually a ﬁxed parameter in a measurement

setup, the order of MLS should be selected such that the duration of the MLS is longer

37

than the duration of the event under investigation, but short enough to minimize the impact

of physical perturbations (e.g. ﬂuctuations in air temperature and humidity, or incidental

noises) within the time interval. For example, the reverberation time of most ordinary rooms

is on the order of a second, so for a sampling frequency of 50 kHz a MLS of order 17 is an
appropriate choice because it has a duration of (217−1)

50 kHz = 2.621 seconds. However, this would

not be an appropriate choice for some cathedrals which can have reverberation times of up to
7 seconds. In these cases, a MLS at least of order 19 is necessary: (219−1)

50 kHz = 10.486 seconds.

Figure 3.2: The ﬁrst 100 samples of MLS [17:1,15], which corresponds to 0.002 seconds at a
sampling frequency of 50 kHz. Values are binary: either 1 or −1, for AC-coupled systems.
Successive values are connected to guide the eye.

A MLS can be used to determine the response of any linear time-invariant system. For
acoustical transfer function measurements, the 1 state is mapped to a −1 level and the 0

state to +1 level to produce a sequence that is symmetrical about zero, which is appropriate

for AC-coupled systems. Figure 3.2 shows the ﬁrst 100 samples of MLS [17:1,15], which

corresponds to 0.002 seconds at a sampling frequency of 50 kHz. A MLS is a white noise

and as such it has a ﬂat magnitude spectrum and pseudorandom phases spectrum with
uniform probability density over [π,−π]. Thus, a MLS is an apt excitation stimulus when

38

the broadband frequency response of a system is desired.

Crucial features of a MLS are its periodicity and near-delta function circular autocorre-

lation. The expression for the circular autocorrelation, or periodic impulse response (hyy),

of a MLS (y0) is given in Eq. 3.1:

hyy[n] =

L−1(cid:88)

l=0

1
L

y0[l] y0[l + n]

(3.1)

When evaluated, the autocorrelation for the MLS is:

1,

− 1

L, 0 < n < L

hyy[n] =

n = 0

(3.2)

Autocorrelation of MLS [17:1,15] is shown in Fig. 3.3. It was calculated in Matlab (version

8.5.0, R2015, The Mathworks Inc., Natick, MA) according to Eq. 3.1. The maximum ampli-
tude was 1 and occurred at n = 0, and the DC oﬀset was −1

131071 = 7.63× 10−6, as predicted

by Eq. 3.2.

Figure 3.3: Autocorrelation of MLS [17:1,15] was calculated in Matlab via Eq. 3.1. It was
shown to satisfy Eq. 3.2.

39

To ﬁnd the periodic impulse response of a device under test (DUT), which is assumed

to be a LTI system, the cross-correlation of the device’s output (x) and the MLS excitation

signal (y0) is calculated according to Eq. 3.3:

L−1(cid:88)

l=0

hxy[n] =

1
L

x[l] y0[l + n]

(3.3)

where hxy[n] and x[n] are bolded to emphasize that they are physically-occurring quantities.

In contrast, invented or computed quantities like the MLS, y0, appear in plain italic text.

This convention which will be applied throughout the chapter. Under the assumption that

the true impulse response of the DUT decays to a negligible value over the duration of the

MLS, then the periodic impulse response is equivalent to the impulse response. This is why

it is important to use a MLS that has a longer duration than the DUT’s response.

The discrete Fourier transform (DFT) of the impulse response, Hxy or H for brevity, is

the transfer function. It is calculated from h according to Eq. 3.4:

L−1(cid:88)

H[k] =

hxy[n] e2πikn/L

(3.4)

n=0

where k indicates the spectral component, i.e. frequency. Note that H is a complex vector,

with amplitudes and phases given by Eqs. 3.5:

(cid:113)
(cid:18)Im(H)

|H| =

(cid:19)

Re(H)

Re(H)2 + Im(H)2

φ[k] = arg

on the interval [−π, π]

(3.5a)

(3.5b)

Spectral amplitudes, |H[k]|, are often converted to a decibel scale (dB) for convenience.

40

Let A[k] = |H[k]| and Amax = max(A[k]) for brevity. Then, the decibel amplitudes are

calculated according to Eq. 3.6:

dB[k] = 20 × log10

A[k ]
Amax

,

(3.6)

therefore all decibel values are non-positive.

Advantages of the MLS technique

The MLS technique can provide very good noise immunity. Due to periodicity of the

MLS, multiple periods of the measured response (x) can be averaged together prior to

calculating the cross-correlation. This reduces random noise ﬂuctuations that can be present

in x. Further, clicks, pops, and random noise are “transformed into benign noise distributed

evenly over the entire periodic impulse response” (Rife and Vanderkooy, 1989).

Dunn and Hawksford (1993) later showed through simulations that in fact the eﬀect

of distortion may be unevenly spread through the impulse response in the case of odd-

ordered nonlinearities.

In general, nonlinearities caused a gain change and raggedness in

the magnitude response. Second-order nonlinearity resulted in “spikey” or “lumpy” tails in

the impulse response. They suggested truncating the impulse response in order to enhance

distortion immunity. Vanderkooy (1994) noted that the positions of low-order distortion

artifacts in impulse responses depend on the particular MLS sequence used (i.e. constant

N with diﬀerent taps). By comparing measurements with two diﬀerent MLS sequences,

distortion “spikes” could be identiﬁed and removed. Bradley (1996) suggested using a low

output signal level to minimize the eﬀects of loudspeaker distortion on the impulse response

(albeit at the cost of reduced noise immunity). The point is that, in general, the MLS

technique demonstrates excellent noise and distortion immunity insofar as the eﬀects are

41

evenly distributed across the impulse response. For some cases of distortion, this is not the

case and the community has come up with several ways to address that, including truncation

of the impulse response, using a MLS with diﬀerent taps, and reducing output signal levels.

In addition to a MLS, there are several other options for excitation stimulus that can be

used to determine a DUT’s response. One option is a click and another is a sine tone. In

principle they should yield equivalent responses, though the attainable signal-to-noise ratio

(SNR) or frequency resolution may diﬀer among the stimuli. There are advantages and

disadvantages of a MLS as an excitation stimulus compared to a click impulse or sine tone

steps.

Compared to a click, which is the archetypal excitation stimulus for impulse response

measurements, a MLS provides a superior signal-to-noise (SNR) ratio for equal stimulus

powers. This is because the signal energy in a MLS is spread evenly over a longer period of

time than in a click impulse. So while a click may be an attractive stimulus in the sense that

it directly yields a DUT’s impulse response, unlike the MLS which requires calculation of a

cross-correlation, limitations on the attainable SNR impede the click’s practical utility. This

is particularly relevant to noisy environments that are likely to be encountered in ordinary

rooms.

An alternative excitation stimulus is a sine tone. A series of sine tones, or steps, can be

played sequentially to obtain the DUT’s response to the tones’ frequencies. An advantage

of the sine step method is that, similar to the click, it directly yields the transfer function

and, further, it allows matched ﬁltering to be done on the response thus yielding excellent

SNR. The disadvantage, however, is in the limited frequency resolution– each frequency

component requires a step. Thus, due to practical time limitations, one cannot achieve the

same frequency resolution as with the MLS technique. As a check on the MLS technique,

42

the responses from the sine step and MLS methods are compared in section 3.2.2.

3.2.2 MLS technique: validation experiments

The MLS technique was discussed generally in subsection 3.2.1 as a means to determine

the impulse response of a DUT. The current subsection describes validation experiments

that were conducted to demonstrate that the MLS technique could speciﬁcally be used

to determine HRIRs on an anthropomorphic acoustical manikin in the PLab, which is a
room with variable acoustics. The PLab is a rectangular room with dimensions 4.3 × 5.5 ×

3.0 meters. The ceiling is acoustical tile and the ﬂoor is vinyl tile. The RT60 of the PLab
was 0.760 s for the 0.5− 4 kHz frequency band (measured using 1/3-octave band noise). All
measurements used an MLS of order 17, which has a length of 217 − 1 = 131071 samples. At

a sampling frequency of 48828.125 Hz, 131071 samples corresponds to 2.684 seconds, which

was signiﬁcantly longer than the reverberation time of the room.

In the following sub-subsections, example raw data and their corresponding HRIR and

HRTF pairs are presented. Then the eﬀect of diﬀerent MLS taps on the HRIR and HRTF

is examined. Finally, the HRTF determined from the MLS technique is compared with one

determined from the sine step method.

Measurements on a manikin to determine HRIR and HRTF

The test subject was KEMAR, an anthropomorphic manikin (Knowles Electronics Manikin

for Acoustic Research, Model 45BC, G.R.A.S. Sound and Vibration, Twinsburg, OH). The

manikin consisted of a “head,” “torso,” and “ears,” all of which had average human dimen-

sions. The manikin wore a thick cotton tshirt for attire. The advantage of using KEMAR

was that it had internal microphones at the positions of the “eardrums,” which could be

used for validation purposes.

43

Hammershøi and Møller (1996) showed that blocking the auditory meatus during HRIR

measurements avoids the ear canal resonances while still capturing complete spatial informa-

tion in the acoustical waveforms reaching the ears. To that end, a small electret microphone

(4 mm diameter, CUI Inc., Tualatin, OR) was snugly inserted into an EAR plug such that

the microphone face was ﬂush with the outside EAR plug. It was then placed into KEMAR’s

left “ear canal” until it was ﬂush with the “canal” entrance. The process was repeated for

the right “ear.”

A schematic of the measurement setup is shown in Fig. 3.4. The procedure was as follows:

(i) Play out of the MLS was done through the digital-to-analog (DAC) converter channels

of a TDT System 3 RP2.1 processor (Tucker-Davis Technologies, Alachua, FL) at a sample

rate of 48828.125 Hz. (ii) From the RP2.1, the MLS signal was low pass ﬁltered in a TDT

System 2 FT-6 ﬁlter with a cutoﬀ frequency of 20 kHz. The diﬀerential phase introduced by

the FT-6 was less than 0.5 µs, which is signiﬁcantly less than the just noticeable perceptual

shift of 20 µs. From the FT-6 the signal was (iii) converted from unbalanced to balanced line

(SConvert, Samson, Hicksville, NY) and (iv) sent to a powered loudspeaker (Mackie HR284

Studio Monitor, LOUD Technologies, Woodinville, WA) which was located 3.77 m from the

center of KEMAR’s “head.” Signal level was 72 dBA just outside KEMAR’s left and right

“ears,” which was measured during calibration using a sound level meter. (v) While the

MLS played from the loudspeaker, recordings were made in the electret microphones in the

left and right “ears.” A homemade circuit worn around KEMAR’s “neck” ampliﬁed the

microphone signals (+60 dB) which were then (vi) converted from balanced to unbalanced

line, (vii) low pass ﬁltered by the FT-6 (fcut = 18 kHz), and (viii) digitized by the RP2.1

analog-to-digital converter (ADC) channels and recorded. Note that two periods of the MLS

were played: the ﬁrst to build up acoustical energy in the PLab, and the second for making

44

the recordings. Figure 3.5 shows an example of raw data from the electret microphones

at the (a) left and (b) right “ears.” Six recordings were averaged to reduce random noise

ﬂuctuations.

45

Figure 3.4: Detailed schematic diagram of the measurement setup in the PLab for determining KEMAR’s HRIRs from electret
microphone recordings, xL and xR. This setup, or variants of, were used for all measurements described in this chapter.

46

Figure 3.5: Recordings in KEMAR’s (a) left and (b) right “ears” when the 131071 samples
of the MLS were played from the Mackie loudspeaker at a rate of 48828.125 Hz, giving a
48828.125 samples s−1 = 2.68433408 s. The recordings are cross-correlated with
duration of

131071 samples

the MLS to obtain the head related impulse responses, hL(t) and hR(t).

Cross-correlation of the MLS and the recording at the entrance to KEMAR’s right “ear

canal” was calculated in Matlab using Eq. 3.3. Shown in Fig. 3.6a is the cross-correlation

(hR), hereafter referred to as the HRIR, from 0 to 0.100 s (n = 4883 samples). The large

peak in the HRIR indicates the arrival of the direct sound at the “ear.” The peak occurred

at a lag of 0.013 s in the right “ear.” Lag time can be estimated from the acoustical delay,

which is the time it takes sound waves to travel from the loudspeaker to the microphone.

At a distance of 3.8 m, this corresponds to an acoustical delay of:

3.8 m
344 m/s = 0.0110 s,

47

where 344 m/s is the speed of sound in air. Further, the RP2.1 processor has a constant

delay of 95 samples, corresponding to

95 samples

48828.125 samples/s = 0.002 s. Thus, the acoustical

and processor delays account for the lag. The lag in the left “ear” was identical, which was
expected because the source loudspeaker was located at 0◦ azimuth and thus equidistant

from the “ears.” If the source were placed at a non-zero azimuth, then the lags would be

expected to diﬀer for the left and right “ears” due to interaural diﬀerences. Subsequent peaks
in the HRIR (t < ∼ 0.050 s after the direct sound) indicate early reﬂections in the room and

anatomical structure. The earliest reﬂections were likely from the ﬂoor and nearby walls.

Finally, the remaining ‘tail’ in the HRIR (0.050 < t < 0.100 s after the direct sound) was

due to reverberation. Unlike discrete reﬂections, reverberation has a stochastic structure

and is uncorrelated with the direct sound. Further, reverberation in the left and right ears is

uncorrelated. The decay rate of the reverberant tail depends on the particular reverberation

time (RT60) of the room (which in turn depends on frequency). For the PLab, the RT60 was
0.760 s for the 0.5−4 kHz frequency band. Discrete reﬂections arriving more than 0.005 s for

clicks, and 0.050 s for speech, after the direct sound are perceived as distinct echoes rather

than as reverberation.

The HRIR shown in Fig. 3.6a was converted to the frequency domain using Matlab’s

Fast Fourier Transform (FFT) function. Spectral amplitudes were converted to the decibel

scale according to Eq. 3.6. The result is the HRTF, which is plotted on a linear frequency

scale in Fig. 3.6b. HRTFs are traditionally plotted on a logarithmic frequency scale which

is shown in panel c. The left “ear” was similar and is not shown.

48

Figure 3.6: (a) Cross-correlation of the MLS and the measured signal at the entrance to
KEMAR’s right “ear canal,” calculated according to Eq. 3.3, yielded the right “ear” HRIR
(hR), as shown here. The measurement was made with electret microphones embedded in
EAR plugs blocking the manikin’s “ear canal.” Fourier transform of the HRIR in (a) yields
the HRTF, which is plotted on a linear frequency scale in (b) and a logarithmic scale in (c).
Amplitudes were converted to the decibel scale.

49

Diﬀerent MLS sequences

A MLS of a particular order can be generated using diﬀerent sets of taps. This results

in unique sequences sharing the same length, L. An experiment was conducted to verify

that the choice of taps did not aﬀect the HRIR or HRTF. Three distinct sequences (all of

order 17) were used to determine the manikin’s HRIR. Taps were located at stages 1 and 12

(or [17:1,12], where the number before the colon indicates the MLS order and the numbers

after indicate the location of the XOR taps), [17:1,13] or [17:1,15]. The HRIR and HRTF for

MLS [17:1,15] were previously measured and are shown in Fig. 3.6 for the right “ear.” The

measurement procedure for MLS [17:1,13] and [17:1,12] was identical to that used previously

for MLS [17:1,15]. To the eye, HRIRs looked identical and are not shown. Details are more
easily seen in the HRTFs. Figure 3.7 shows the right “ear” HRTFs for the (a) 0.15− 1.5 kHz
and (b) 1.5 − 15.0 kHz frequency ranges. Inspection of the diﬀerent lines shows that they

are essentially identical. Thus, the choice of speciﬁc taps was not important for the HRTF.

Deep spectral notches were present in all three HRTF spectra. These notches occurred at

the same frequencies in each set of taps, suggesting they were indeed real (presumably due

to anatomical reﬂections and/or the room environment) and not just artifacts limited to a

particular sequence. The left “ear” was similar and is not shown.

50

Figure 3.7: Comparison of right “ear” HRTFs measured using diﬀerent taps for the MLS.
(a) 0.15 to 1.5 kHZ frequency range (b) 1.5 to 15 kHz frequency range. [17:1,13] is oﬀset by
-15 dB and [17:1,12] by -30 dB for visual clarity. Results for the “left” ear are similar and
are not shown.

51

MLS vs. sine step techniques

Experiments were conducted to verify that the choice of excitation stimulus– sine tone

versus an MLS– did not aﬀect the HRTF. The advantage of the sine step method was that it

directly yielded H, whereas with the MLS technique the cross-correlation had to be calculated

(Eq. 3.3) to obtain h which was then Fourier-transformed to get H. The disadvantage of the

sine step method was that it required substantially longer measurement times than the MLS

method because only one frequency component at a time could be measured. Consequently,

the number of frequency components and thus the resolution was signiﬁcantly less in the

sine step method due to time limitations.

In the ﬁrst part of the sine step experiment,

168 frequencies from 100 Hz (f0) to 1012.6 Hz were scanned, in half-semitone intervals.2

In the second part of the experiment, an additional 168 frequencies from 1000 Hz (f0) to

10126 Hz, also with half-semitone spacing, were scanned. Thus, in total 336 sine steps

(i.e. frequency components) were measured. Each step required approximately 1 second to

ensure an adequate number of cycles was measured for a good SNR. In contrast, with the

MLS technique all frequency components were measured in a single period. For a MLS of
order 17, with 217 − 1 = 131071 components and a RP2.1 sample rate of 48828.125 Hz,

the measurement duration was 2.684 seconds and had a constant frequency spacing of δf =

1

2.684 s = 0.3725 Hz.

Sine tones were generated using the TDT System 3 RPVdS software. The rest of the

measurement procedure was identical to that for the MLS technique (Fig. 3.4):

in short,

2One cent = 1

each tone was played out from the RP2.1 DAC and through the source loudspeaker (Mackie
1200 -th of an octave (where an octave is a doubling in frequency: foctave = 2 × f0). A
semitone, or half-tone, is 100 cents. Thus, an octave is comprised of 1200
100 = 12 semitone intervals. A
half-semitone, or quarter-tone, is 50 cents, and divides an octave into 24 intervals, with spacing given by:
fn = f0 × 2(50/1200)×n, where n = 1, 2, 3, ..., 24.

52

HR824 studio monitor). Recordings xL and xR were made using electret microphones in

KEMAR’s EAR-plugged “ears,” ampliﬁed, and digitized via the RP2.1 ADC. Because noise

could have introduced frequencies other than ftone to the signal, matched-ﬁltering was done

on the recordings (xL and xR) during post-processing, according to Eq. 3.7:

τ(cid:88)
τ(cid:88)

t=0

t=0

sL(t) =

cL(t) =

xL(t) sin(2πtftone)

and

sR(t) =

xL(t) cos(2πtftone)

and

cR(t) =

τ(cid:88)
τ(cid:88)

t=0

t=0

xR(t) sin(2πtftone)

(3.7a)

xR(t) cos(2πtftone)

(3.7b)

The summation was done over an integer number of periods, corresponding to approximately

one second (τ ). The calculation was done for each of the 336 frequency components. Or-

thogonality of the sine (and cosine) functions eliminated the undesired noise components.3

Thus, matched-ﬁltering is a very eﬀective way to reduce noise when tones are used as the

excitation stimulus. Finally, spectral amplitudes were calculated according to Eq. 3.8:

(cid:113)

A(t) =

s(t)2 + c(t)2

(3.8)

The resulting HRTF for the right “ear” is shown as the dashed line in Fig. 3.8. The left “ear”

HRTF was similar and is not shown. As is evident in the ﬁgure, the HRTFs from the sine step

and MLS techniques were qualitatively very similar: when a local minimum or maximum

occurred in the MLS HRTF (solid), it was also seen in the sine-step HRTF (dashed). The

primary diﬀerence between the spectra was in the depths of the spectral valleys. The MLS
valleys were 10 − 15 dB (or more) deeper than their sine-step counterparts. The width of

3e.g.,(cid:82) π−π sin(nx)sin(mx) dx =

(cid:40)

0, m (cid:54)= n
1, m = n

53

the valleys was on the order of 0.5 Hz. Only the MLS technique with its 0.37 Hz frequency

spacing was able to fully map these valleys.

In contrast, the sine-step frequency spacing

(half-semitone intervals) was not ﬁne enough to capture the narrow valleys, particularly in
the 1 − 10 kHz range (panel b).

Figure 3.8: Comparison of HRTFs for KEMAR’s right “ear” using a sine step (dashed line)
vs. MLS (solid line) method. (a) 0.1 − 1.0 kHz frequency range (b) 1 − 10 kHz frequency
range. In both (a) and (b) it is apparent that the sine-step method qualitatively reproduced
the HRTF from the MLS method. The main diﬀerence was in the greater depth of the valleys
in the HRTF from the MLS method.

54

To facilitate better visual comparison of the two spectra in the 1 − 10 kHz range, a

constant-Q memory smoothing function was applied to the HRTF from the MLS technique.
The function weighted all frequencies f(cid:48) (cid:54)= ftone with a Gaussian distribution, according to

Eq. 3.9:

| ˜H(f )| =

fmax(cid:88)
f(cid:48)=f0

|H(f(cid:48))|e−(f−f(cid:48))2/(Cf 2)

(3.9)

where C = 0.02 and f0 = 1000 Hz. The constant, C, was chosen empirically. Results

of smoothing are shown in Fig. 3.9. The smoothed HRTF shows excellent agreement with

the HRTF from the sine step method. Thus, the MLS technique can be used to obtain

an accurate HRTF. Further, the HRTF was obtained with a much shorter experiment time

and with far superior frequency resolution than was achievable with the sine step method.

On a ﬁnal note, M¨uller and Masserani (2001) warned against the MLS technique because

of enhanced susceptibility to loudspeaker harmonic distortion compared to the sine step

method. However, close agreement of the spectra from the two techniques implied minimal

harmonic distortion was present during the MLS measurement.

55

Figure 3.9: Comparison of HRTFs for KEMAR’s right “ear” after smoothing the spectrum
from the MLS technique according to Eq. 3.9. The smoothed spectrum is indicated by the
solid line, and the spectrum from the sine step method by the dashed line. The spectra are
qualitatively very similar, indicating the MLS technique can yield accurate HRTFs.

3.3 Reproducing a room’s acoustical environment

The previous section (3.2) compared the MLS to other stimuli for determining a HRTF–

namely, clicks and sine steps. Recall that a HRTF encodes complete spatial and spectral

information due to acoustical ﬁltering from a listener’s unique upper-body anatomy. The

MLS technique was shown to be capable of yielding a detailed and accurate HRTF. The

current section (3.3) uses the HRIR (the time domain equivalent of the HRTF) to generate

acoustically accurate binaural stimuli, referred to as ‘synthesized’ stimuli. When synthesized

stimuli are presented to a listener, the waveforms reaching the eardrums ought to be identical

to the waveforms reaching the eardrums from an equivalent real sound source. Equivalence

of synthesized and real-source stimuli is the foundation of binaural synthesis. Binaural

synthesis is a method of realistic stimulus presentation, typically over headphones, that

56

is used during psychoacoustical experiments (Møller, 1992). Because binaural synthesis

incorporates acoustical ﬁltering encoded in the HRIR, it can be used to present natural,

externalized4 sound images to a listener. Further, the HRTF encodes pinna cues, which are
important for determining source directionality (e.g. 0◦ azimuth and 0◦ elevation), which

leads to well-spatialized sound images.

Section 3.3.1 describes in detail how to compute synthesized stimuli using HRIRs. As an

example, synthesized stimuli are generated using KEMAR’s previously-determined HRIRs

(Fig. 3.6a shows the right “ear” HRIR, hR). The synthesized stimuli are then presented to

KEMAR over headphones and recordings are made at the “eardrums” using the manikin’s

internal microphones. The recorded spectra are compared to the spectra determined from an

equivalent real sound source. Results indicate that binaural synthesis using headphones can

be an accurate means of stimulus presentation. The extent of realism can vary substantially

and depends on the degree of recording detail. At one extreme, large-diaphragm condenser

microphones can be placed outside a listener’s ears to determine binaural HRIRs. In this

case, the HRIRs encode ﬁltering due to the room and the listener’s head and torso. At

the other extreme, tiny probe tube microphones (diameter < 1 mm) can be placed inside a

listener’s ear canals. The HRIRs include ﬁltering due to the room, the listener’s head, torso,

outer ear, and even the ear canal. Electret microphones, when placed outside the entrance

to the (blocked) ear canals, yield a nearly equivalent HRIR to the probe microphones. The

diﬀerence is that in the blocked meatus condition, ﬁltering due to the ear canal is not included

in the HRIR. Hammershøi and Møller (1996) showed that blocking the auditory meatus

when determining HRIRs avoids the ear canal resonances while still capturing complete

4Meaning the sound image is perceived as being outside the head. This is in contrast to a sound image

being perceived in the center of the head, which occurs in typical headphone listening.

57

spatial information in the acoustical waveforms reaching the entrances to the ear canals. So,

whether one uses electret microphones in the blocked meatus condition or probe microphones

in the (unblocked) ear canals depends on whether one wants to include the ear canal in the

HRIR. Because the present goal is to do binaural synthesis with headphones, blocked meatus

is more appropriate to use for these experiments.5

3.3.1 Generating stimuli

The HRIR and HRTF are equivalent representations of all acoustical ﬁltering in a particular

room environment and for a listener’s unique anatomy. The following discussion is initially

framed in terms of the HRTF because it is more straightforward to think about in a physical

sense, and later in the terms of the HRIR because the necessary mathematical calculation is

easier to think about in the time domain.

For linear time invariant systems, the transfer function from the source loudspeaker to

the eardrum for a particular frequency is independent of the excitation stimulus. Indeed, this

was demonstrated in section 3.2.2 by comparison of the HRTF from the MLS and sine step

methods. This uniformity in response is crucial because it means whether a listener listens

to a MLS or to any other stimulus (e.g. a sine tone or even speech), the transfer function

for any common frequency components is the same. Thus, the HRTF for a particular room

conﬁguration and listener position can be determined once, via the MLS technique, and

subsequently applied to any number of preexisting stimuli, S0. The ﬁltering encoded in the

HRTF “roomiﬁes and anatomizes” S0. It should be noted that S0 is often, but not required

to be, anechoic.

5The transfer function from headphone to the eardrum includes the ear canal. If a probe microphone
were used, the ear canal would be included in the transfer function twice– once from the probe microphone
and once from the headphone.

58

Once HRIRs are determined, the time domain signal, s0, is ﬁltered by the binaural HRIRs

(hL and hR) through linear convolution. Convolution, expressed as a discrete summation

in Eq. 3.10, calculates the amount of temporal overlap between h as it is shifted, point by

point, across s0:

hL ∗ s0 =

hR ∗ s0 =

∞(cid:88)
∞(cid:88)

k=−∞

k=−∞

s0[k] hL[n − k]

s0[k] hR[n − k]

(3.10)

In reality the limits depend on the lengths of s0 and h. If s0 has length M and h has length
L, then h ∗ s0 has length M + L − 1. The simple case in which s0 was MLS[17:1,15] was

examined. The ﬁrst one hundred samples of MLS[17:1,15] are shown in Fig. 3.2. Figure 3.10a
shows the time domain waveform (hR∗s0) when the right “ear” impulse response (Fig. 3.6a)

was convolved with s0. The left “ear” was similar and is not shown.

The convolved waveform is identical to what would be recorded in a listener’s ears if s0

was played from the source loudspeaker in the room, assuming the room, source conﬁgu-

ration, microphone placement, and listener’s position were the same as when hL and hR

were determined. The following subsection shows the two waveforms are indeed essentially

identical.

3.3.2 Acoustical validation

Binaural recordings were made using KEMAR’s internal microphones to show that playing
convolved waveforms, hL ∗ s0 and hR ∗ s0, over headphones (synthesized condition) was

equivalent to the waveforms recorded in the “ears” when s0 was played from the source

loudspeaker (natural condition).

59

Figure 3.10: Result of convolving KEMAR’s right “ear” HRIR, hR, with s0. (a) The time
domain representation of the convolved signal, hR ∗ s0, is shown. It was computed from
Eq. 3.10. (b) The Fourier transform of the convolved stimulus is shown for the frequency
range 0.1 − 1 kHz. The smoothed spectrum has also been plotted for convenience (oﬀset by
25 dB). (c) Same as panel b, but for frequency range 1 − 10 kHz.

60

Figure 3.11: (a) Natural condition, in which s0 was played from the source loudspeaker and recorded at KEMAR’s “eardrums”
with the manikin’s internal microphones (xL,n and xR,n). (b) Measurement setup to determine hL and hR. Repeated from
Fig. 3.4. (c) Binaural synthesis: headphone delivery of hL∗s0 and hR∗s0. Recordings xL,s and xR,s were made with KEMAR’s
internal microphones. For a successful binaural synthesis, xL,s = xL,n and xR,s = xR,n.

61

Natural condition

Figure 3.11a shows the experimental setup for the natural condition. The source loud-

speaker and KEMAR positions were unchanged from previous measurements in the PLab
(section 3.2.2): the source was located 3.8 m from the center of KEMAR’s “head” at 0◦

azimuth. Stimulus s0 was played out through the TDT System 3 RP2.1 DAC, low-pass ﬁl-

tered (fcutof f = 20 kHz), and played out through the source loudspeaker. While s0 sounded,

recordings were made at KEMAR’s “eardrums” using the manikin’s internal microphones

(xL,n and xR,n). The microphone signals were low-pass ﬁltered (fcutof f = 18 kHz), ampli-

ﬁed (+48 VDC phantom power, AudioBuddy Dual Mic Preampliﬁer, M-Audio, Cumberland,

RI), and digitized through the RP2.1 ADC. Recordings and their corresponding spectra,

XL,n and XR,n, are shown as the thin lines in Figs. 3.12. These waveforms are the standard

against which the synthesized condition will be compared.

Synthesized condition

First, hL and hR were determined by playing a MLS through the source loudspeaker.

The procedure for this was described previously in section 3.3.1, and is depicted in Fig. 3.11b.
Then, convolved stimuli hL ∗ s0 and hR ∗ s0 (Fig. 3.10a shows hR ∗ s0) were played over

headphones to KEMAR. Playout was done through the TDT RP2.1 and signal level was con-

trolled by a headphone ampliﬁer (MicroAMP Model HA400, Behringer, Willich, Germany).

Signals were then played through Sennheiser HD600 circumaural headphones (Sennheiser,

Wedemark, Germany) at a comfortable level. While the convolved stimuli played through the

headphones, recordings were made with KEMAR’s internal microphones (xL,s and xR,s).

Signals were ampliﬁed, then digitized by the RP2.1 ADC. Recordings and their corresponding

spectra, XL,s and XR,s, are shown as the thick lines in Fig. 3.12.

Conclusions

62

Figure 3.12: Recordings from KEMAR’s internal microphones for the natural condition (xL,n
and xR,n) are indicated by the thin lines. In the natural condition, s0 was played from the
source loudspeaker and recorded at the “eardrums” (Fig. 3.11a). Recordings from the synthe-
sized condition (xL,s and xR,s) are indicated by the thick lines. In the synthesized condition,
s0 was convolved with left and right “ear” HRIRs (hL and hR). The convolved stimuli were
presented to KEMAR over headphones, and recorded at the “eardrums” (Fig. 3.11c). Pan-
els a and b show time time domain signals. Remaining panels show spectral amplitudes for
frequency ranges (c,d) 0.15 − 1 kHz and (e,f) 1 − 10 kHz. For perfect binaural synthesis,
XL,s = XL,n and XR,s = XR,n. Agreement is within a few dB until a steep drop-oﬀ of Xs
at 9.5 kHz in both “ears.”

63

In general, there is good agreement of the natural and synthesized signals at the “eardrums.”

The time domain signals (Fig. 3.12, panels a and b) indicate that the natural condition

yielded noisier recordings. That is an anticipated result. The synthesized condition aver-

aged six periods of the MLS recording before calculating the HRIRs. This led to cancellation

of random noise ﬂuctuations. In contrast, the natural condition had no averaging, so it was

noisier.

Indeed, improved SNR during stimulus delivery is one of the advantages of the

binaural synthesis method used here. This is an especially important advantage when con-

sidering conversational levels of speech, which perhaps have peak levels of only 55 dBA or

so. Binaural synthesis enables delivery of speech stimuli to a listener with less noise than

the natural condition.

Spectra of natural and synthesized conditions largely agree up to 8 kHz. The large dips

in the spectra at about 9.5 kHz in left “ear” are attributed to coupling of the headphones

and KEMAR’s “ear canals.” If the headphones were removed and repositioned, the dips

would occur at diﬀerent frequencies because the physical coupling of the headphones and

“ear canals” would have changed slightly. There is no straightforward or well-established

method to deal with variation due to headphone positioning, and Chapter 5 discusses the

issue in greater detail. At present, it is suﬃcient to point out that the binaural synthesis

technique will be applied to speech, and most of the energy content of speech lies below

4 kHz. The point is that, while headphones do not accurately reproduce spectral content

of a real source beyond about 8 kHz, the binaural synthesis technique described here is

considered to be suﬃciently accurate to investigate perception of room eﬀect in a perceptual

experiment (Chapter 4).

64

3.4 Conclusions

The maximum length sequence (MLS) was shown to be an eﬃcient excitation stimulus for de-

termining HRTFs in a mildly reverberant room, as pertinent to this study. A MLS can yield

broadband HRTFs with ﬁne frequency spacing. Measurement time is trivial, and as such

facilitates averaging of multiple recording periods. This leads to excellent SNR. Convolution

of binaural HRIRs, which are time domain equivalents of HRTFs, with any test signal s0 en-

ables an experimenter to present essentially any stimulus to a listener over headphones. This

includes stimuli that have been ﬁltered with individualized and nonindividualized HRTFs.

The next chapter describes application of the binaural synthesis technique to a perceptual

experiment in which individualized and nonindividulized HRTF-ﬁltered stimuli are delivered

to human listeners.

65

Chapter 4

Room eﬀect perceptual experiment:

Listening through other people’s ears

The current chapter applies the experimental methods from the previous chapter: binaural

synthesis is used to conduct a perceptual experiment with human listeners. Necessary for a

perceptually persuasive synthesis– that is, a well externalized and spatialized sound image–

is correct encoding of the acoustical ﬁltering due to a listener’s head, torso, and outer ear

anatomy. All ﬁltering is encapsulated in the HRTF. The time domain representation of

the HRTF is the HRIR. When measured in a room, the HRIR also includes ﬁltering eﬀects

due to the room. It was shown in Chapter 3 that binaural HRIRs can be used to generate

synthesized stimuli that, when presented over headphones to KEMAR, accurately conveyed

spatial and spectral information to the eardrums. Spectra were nearly identical to those

observed in a natural room listening situation.

Since the synthesized stimuli accurately conveyed spectral and temporal information,

then by extension they should also accurately convey room eﬀect, i.e. reverberation and

coloration. The room eﬀect present in the synthesized stimuli would therefore match what

would be present during stimulus presentation over a loudspeaker in the natural condition.

This has far-reaching implications for an investigation of room eﬀect perception, or squelch.

Unfortunately, KEMAR cannot evaluate perceptual properties of stimuli, such as reverbera-

66

tion and coloration. To investigate room eﬀect perception using binaural synthesis, humans

must therefore be used as listeners. Binaural HRIRs from a human listener can be deter-

mined in a room. The HRIRs would include ﬁltering due to the room and the listener’s

individualized anatomy. They could then be convolved with anechoic speech. The percep-

tual eﬀect of the convolution is ‘roomiﬁcation’ and ‘anatomization’ of the speech. As shown

in KEMAR measurements, the synthesized stimuli would be essentially identical to what a

listener would hear in the natural listening scenario.1 Further, diﬀerent synthesized stimuli

could be presented to a listener. Recall that a synthesized stimulus is the result of convolving

binaural HRIRs (h) with an anechoic stimulus (s0). Any number of anechoic stimuli could

be convolved with the HRIRs, which would lead to diﬀerent synthesized stimuli. Further,

HRIRs could be determined for several diﬀerent source locations, and these could be con-

volved with s0, resulting in stimuli that would be speciﬁc to the diﬀerent source locations.

Thus, the eﬀect of diﬀerent HRTFs on a listener’s perception of room eﬀect could be inves-

tigated. The HRTFs could be measured for several listeners at diﬀerent source locations in
a room (e.g. 2 m vs. 3 m, or 0◦ vs. 30◦ source azimuth). The diﬀerent HRTFs could then

be used to ﬁlter speech stimuli. Ultimately, a library of synthesized stimuli could be calcu-

lated and presented to listeners in a perceptual experiment. Listeners would listen to stimuli

ﬁltered with their own HRTFs, as well as to stimuli ﬁltered with other people’s HRTFs.

The ﬁrst section in the chapter (4.1) shows results of binaural synthesis with KEMAR

using speech as the stimulus (s0). Then, the main experiment– a room eﬀect perceptual

experiment involving several human listeners– is presented in section 4.2.

1i.e. listening to anechoic speech played out through a source loudspeaker in a room.

67

4.1 Binaural synthesis with KEMAR

To ensure suﬃcient room eﬀect for a perceptual experiment, the measurement setup was

moved to Room 10B of the Communications Arts and Sciences Building at MSU. This was

a larger and more reverberant room than the PLab, where Chapter 3 experiments had been

conducted. Further, Room 10B has been used in several published experiments and its

acoustical properties have been well-characterized (Hartmann et al., 2005; RT60 = 0.9 s at

speech frequencies). For this reason, and also because speech was to be used as the anechoic

stimulus, s0, it was considered worthwhile to conduct another acoustical validation mea-

surement using KEMAR. Details of the measurements are identical to those from Chapter 3

section 3 but are brieﬂy repeated here for convenience.

Natural condition

A diagram of the setup for the natural condition is shown in Fig. 3.11 panel a: stimulus

s0 was an anechoic recording of the sentence “Cats and dogs each hate the other.”2 The

time waveform and frequency domain magnitude spectra of the entire utterance are shown

in Fig. 4.1. Stimulus s0 was played via the RP2.1 DAC (TDT System 3, fs = 48828.125 Hz)

through the source loudspeaker which was located 3 m from the center of KEMAR’s “head,”
at an azimuth of 0◦. Recordings were made by KEMAR’s internal microphones located at

the “eardrums” (AudioBuddy preampliﬁer, +48 VDC phantom power) and digitized by the

RP2.1 ADCs. The recorded waveforms were xL,n and xR,n. Amplitude spectra, XL,n and

XR,n, are indicated by the thin lines in Fig. 4.4 panels c and d. These waveforms are the

2Four phonetically-balanced Harvard sentences (see Table 2.3) were recited in an anechoic chamber by
a female talker standing 12 inches from a studio-grade microphone with dual 1 inch diaphragms (SHURE
KSM44a ‘omnidirectional’ setting, SHURE Inc., Niles, IL). Microphone output was boosted by a preampliﬁer
(AudioBuddy Dual Mic Preamp, M-Audio, Cumberland, RI) which also supplied phantom power (48 VDC).
The signal was then lowpass ﬁltered (FT-6, fcut = 20 kHz) and digitized with TDT hardware (RP2.1 ADC,
fs = 48828.125 Hz).

68

standard against which the synthesized condition will be compared.

Synthesized condition

Recall that binaural synthesis involves two steps: the ﬁrst is determination of HRIRs

(Fig.3.11, panel b), and the second is headphone presentation of the synthesized stimuli
(panel c). To determine HRIRs, a MLS ([17:1,15], with 217 − 1 = 131071 samples, corre-

sponding to 2.684 seconds for a sample rate of 48828.125 Hz, was played from the source

loudspeaker. Recordings, xL and xR, were made in electret microphones. The microphones

were encased in EAR foam plugs and had been inserted such that the microphones were

ﬂush with the “ear canal” entrances. Thus, recordings included ﬁltering due to the room, as

well as from the manikin’s “head,” “torso,” and “pinna” (but not “ear canals”). HRIRs (hL

and hR) were calculated by ﬁnding cross correlation of the recordings with the MLS, and

are shown in Fig. 4.2 panels a and b. The corresponding frequency domain representations,
or HRTFs, are shown in panels c,d (0.1 − 1 kHz) and e,f (1 − 10 kHz).

Then, hL and hR were convolved with the anechoic speech signal s0. Resulting time and

frequency domain representations of hL ∗ s0 and hR ∗ s0 are shown in Fig. 4.3.

The headphone-listening portion of binaural synthesis is depicted in Fig. 3.11 panel c:
convolved stimuli hL ∗ s0 and hR ∗ s0 were played over left and right headphone channels

(Sennheiser HD600, circumaural). Recordings were made with KEMAR’s internal micro-

phones (xL,s and xR,s). Figure 4.4 shows time and frequency domain representations of
xL,s and xR,s. Natural condition spectral amplitudes, |XL,n| and |XR,n|, are also plot-

ted for convenience in the middle and bottom panels (thin lines). For a perfect binaural

synthesis, XL,s = XL,n and XR,s = XR,n.

Results

There is generally good agreement between amplitude spectra for the natural and synthesized

69

conditions. Root-mean-square (RMS) discrepancies between Xn and Xs spectral amplitudes
are given for the two frequency ranges 0.15− 1 kHz (middle panels) and 1− 10 kHz (bottom

panels). For the left “ear,” the discrepancies were 1.56 dB and 6.56 dB. For the right “ear,”

they were 0.93 dB and 4.08 dB. The spectra begin to deviate systematically beyond 6 kHz,
so discrepancies for the 1− 10 kHz range are inﬂated. If the cutoﬀ is instead at 6 kHz, which

is a reasonable cutoﬀ for speech frequencies, then the RMS discrepancy reduces to 2.60 dB
(left) and 2.40 dB (right) for the 1 − 6 kHz frequency range. As can be seen in Fig. 4.3

bottom panels, spectral amplitudes have dropped oﬀ signiﬁcantly by 6 kHz, indicating that

binaural synthesis can accurately (i.e. within a few dB) simulate Xn up to at least 6 kHz.

This may be adequate for the room eﬀect perceptual experiment.

70

Figure 4.1: (a) Anechoic stimulus s0 (“Cats and dogs, each hate the other.”), recited by a
female talker and recorded by an omnidirectional microphone. This is the test stimulus, s0,
in the Room 10B validation experiment. Spectral amplitudes, |S0|, are shown for the (b)
0.15 − 1 kHz range and (c) 1 − 10 kHz range.

71

Figure 4.2: KEMAR’s head-related impulse responses (hL and hR) were measured in
Room 10B for (a) left and (b) right “ears.” Source position was 3 m and 0◦. The full
duration of hL and hR was 2.68 seconds, but they were truncated to 0.9 s. Transfer func-
tions, |HL| and |HR|, are shown in the remaining panels: (c) and (d) show the 0.15− 1 kHz
frequency range, and (e) and (f) show the 1 − 10 kHz range.

72

Figure 4.3: Convolution of s0 (Fig. 4.1a) with hL and hR (Fig. 4.2 panels a and b). Panel a
shows the convolved signal for the left “ear” and panel b shows the signal for the right “ear.”
Time domain representation. (b) Frequency domain representation: spectral amplitudes of
the convolved waveforms, converted to a decibel scale, for the frequency range 0.15 − 1 kHz
for the (c) left “ear” and (d) right “ear.” Panels (e) and (f) show the 1 − 10 kHz range.

73

Figure 4.4: Recordings at KEMAR’s (a) left (xL,s) and (b) right (xR,s) “eardrums” when the
synthesized binaural stimuli (hL∗s0 and hR∗s0 from Fig. 4.3) were played over headphones.
Middle and bottom panels show spectral amplitudes, |Xs(f )| (thick lines) and |Xn(f )| (thin
lines) for comparison. For a perfect binaural synthesis, XL,s = XL,n and XR,s = XR,n. In
general, there is good agreement between amplitude spectra (within a few dB) below 6 kHz.

74

4.2 Binaural synthesis with human listeners

Experiments on KEMAR demonstrated that binaural synthesis could successfully reproduce

speech at the eardrums, even for a moderately reverberant room (RT60 = 0.9 s at speech

frequencies). The main experiment was to use binaural synthesis on human listeners in a

room eﬀect perceptual experiment. It was divided into two phases: the ﬁrst was measurement

of HRIRs on human listeners. The second, which occurred on a diﬀerent day, was delivery of

HRIR-convolved speech stimuli over headphones in the perceptual part of the experiment.

The listener’s task was to rate the amount of perceived room eﬀect among various convolved-

speech stimuli. Recall that the goal of the perceptual experiment was to determine the

eﬀect certain physical factors– namely, source distance, source angle, binaural listening, and

HRTFs– had on listeners’ perceptions of room eﬀect. Further detail will be given on each of

these factors in the succeeding paragraphs.

Experiment 3 from Chapter 2 (PE3) was a guide for selecting physical conditions for

the current experiment. Results showed that listeners ranked recordings made at a source

distance of 3 m higher than those made at a source distance of 2 m. That is to say, (nearly)

all listeners perceived more room eﬀect when listening to recordings that were made at the

larger source distance. As previously discussed, this can be understood because the amount

of direct sound reaching the ears decreased when the source distance increased. Thus, room

eﬀect became more prominent in the 3-m recordings. Recall that changing source distance

from 2 m to 3 m corresponds to a change of 3.5 dB in direct-to-reverberant sound power.

The point is that the change in source distance from 2 m to 3 m in PE3 was perceptible

to listeners.

It was thus decided to use source distances of 2 m and 3 m in the current

experiment.

75

PE3 also showed that most listeners perceived less room eﬀect when listening binaurally.
Even though the source azimuth was 0◦, small interaural diﬀerences may have existed that

helped to reduce the amount of perceived room eﬀect for these listeners. If interaural diﬀer-
ences were enhanced, say by placing the source at −30◦, would perceived amount of room

eﬀect be further reduced? The current experiment attempted to answer this question by
using source azimuths of 0◦ and −30◦. Source positions are summarized in Table 4.1.

The primary physical factor of interest in the current experiment was HRTFs. Recall

that in PE3, listeners ranked recordings made with and without a plastic head between the

microphones as having essentially identical amounts of room eﬀect. A possible explanation

was that the plastic head was too crude an approximation for a real human head, pinna,

and torso. Further, the large studio microphones were positioned outside the “head,” thus

even if plastic “head” had possessed outer “ears,” any ﬁltering due to them would have been

completely missed by the microphones. Binaural synthesis provides the ability to investigate

in a precise manner the potential role of individualized anatomy in perception of room eﬀect.

Small electret microphones were inserted at the entrances to human listeners’ ear canals, thus

HRIRs encoded anatomically-accurate acoustical ﬁltering. Note that for most applications of

binaural synthesis, particularly in the audio engineering industry, researchers ﬁnd it suﬃcient

to determine HRIRs on an artiﬁcial head. Even in fundamental research, psychoacousticians

do not always measure individualized HRIRs when performing binaural synthesis (Wenzel

et al., 1992). The present experiment uses binaural synthesis to deliver individualized and

nonindividualized HRIR-convolved speech stimuli to listeners in order to determine what

eﬀect, if any, HRTF has on perception of room eﬀect. This is a novel experiment that has

not been previously done.

76

4.2.1 Determining HRIRs and generating stimuli

The measurement chain and associated hardware has been described twice previously (Chap-

ter 3, and current chapter, section 4.1). Thus, focus of the current section is on describing

modiﬁed or additional steps needed to determine HRIRs on a human listener. The ﬁrst mod-

iﬁcation was use of two source loudspeakers (Mackie HR824mk2 Studio Monitors, LOUD

Technologies, Woodinville, WA), instead of one. Recall that HRIRs were to be measured

at four source locations (Table 4.1). To ease the requirement of repositioning a single loud-

speaker several times during a measurement session, a second loudspeaker was incorporated.

Loudspeakers were mounted horizontally on movable wooden platforms and heights were

adjusted to reduce perturbation of the 3-m speaker’s direct sound by the 2-m speaker. This

meant the 3-m speaker was taller (center of speaker was 49 inches from the ﬂoor) than

the 2-m speaker (center was 41 inches from the ﬂoor). All input and output signal cables

in Room 10B were routed through a porthole in the wall to the control room. The TDT

hardware (RP2.1 and FT-6 modules) was housed in the control room.

HRIRs were measured on four human subjects, or “heads” (H1−H4, all male; ages 20 −

78). Subjects completed a standard consent form approved by the MSU IRB, and subjects

from outside the lab were paid.

Upon arrival to Room 10B, a subject was instructed to sit on an adjustable stool. Height

of the stool was adjusted so that the subject’s ear canals were 46 inches from the ﬂoor (same

height as the vertical midline of the two loudspeakers). A small metal box, which housed the

custom-built electret microphone preampliﬁer, was secured around the subject’s neck such

that it rested comfortably on the sternum. Then, the subject inserted electret microphones

(snugly positioned in EAR plugs) into his ear canals so as to be ﬂush with the entrances

77

to the canal. That is to say, none of the EAR plug was permitted to protrude beyond the

ear canal volume. Handheld mirrors were available to assist the subject. Suﬃcient time

was alloted for EAR plugs to settle (2 minutes). A ﬁnal visual inspection was done by the

experimenter to ensure proper placement. At this point the subject was also instructed to

sit as still as possible.

The measurement protocol was as follows: once microphone positions were checked, the

experimenter retreated to the control room. Seven consecutive periods of a MLS ([17:1,15])
were played from the 2-m speaker which was located at 0◦ azimuth. This was then repeated

from the 3-m speaker. The experimenter then entered Room 10B to position the loudspeakers
30◦ to the left of the listener. The experimenter returned to the control room. The MLS

was played from the 2-m speaker, followed by the 3-m speaker. The subject was motionless

during all measurements. Binaural recordings were made at each position. The subject was

then dismissed for the day.

The above procedure was repeated for a total of four heads, resulting in: 4 heads ×

4 source positions × 2 ears = 32 HRIRs.

Distance (m) Angle (◦)

2

2

3

3

0
−30

0
−30

Table 4.1: The four source loudspeaker positions at which HRIRs were measured in
Room 10B. The HRIRs were measured for four heads (H1-H4), and convolved with ane-
choic speech (Harvard phonetically-balanced sentences). Convolved stimuli were presented
to listeners (L1-L4) over headphones during the perceptual part of the experiment (diﬀerent
day).

78

Results

HRIRs were calculated for all heads and source positions by cross-correlating the MLS

with recordings from the electret microphones. Results are shown for H1 in Fig. 4.5. HRIRs
for the other three heads (H2−H4) were similar and are not shown. Several observations
can be made. First, the peak in cross correlation for the 0◦ azimuth source position is larger

at 2 m than 3 m. This can be understood because there is more direct sound reaching the

microphones for the shorter source distance (and thus higher correlation). This also explains
why for the −30◦ azimuth source position the peak was higher in the left ear– because it

was closer to the source than the right ear. Frequency domain amplitude spectra (HRTFs)
for all heads are shown in Figs. 4.6 (0.15 − 1 kHz) and 4.7 (1 − 10 kHz). Spectra for each

subject have been oﬀset by 10 dB for visual clarity. Spectral diﬀerences among HRTFs are

apparent at frequencies as low 250 Hz (cf. panel c). Diﬀerences are more apparent in the
1−10 kHz range, starting at 1.5 kHz (cf. panels a,b,e,f). These diﬀerences may be perceptible

to listeners during the headphone-listening experiment.

79

Figure 4.5: Subject 1’s (H1) binaural HRIRs measured in Room 10B with blocked meatus.
There were four source positions.

80

Figure 4.6: HRTFs (0.15 − 1 kHz) for the four heads (H1−H4). Each panel indicates a
diﬀerent source position. For example, the top panel shows HRTFs measured at the 2 m, 0◦
source position.

81

Figure 4.7: Same as Fig. 4.6 but for the 1 − 10 kHz frequency range. Diﬀerences in spectra
are apparent at frequencies as low as 1.5 kHz.

82

All HRIRs were convolved with anechoic recordings of four Harvard phonetically-balanced
sentences (Table 2.3). Examples of convolved waveforms (h ∗ s0) are shown in Fig. 4.8: the
HRIRs were for source position 3 m, −30◦ and s0 was “Cats and dogs each hate the other.”

To determine what eﬀect diﬀerent HRIRs had on the convolved stimuli, the cross cor-

relations of the stimuli were calculated. Cross correlations of all convolved waveforms for
source position 3 m, −30◦ were calculated between a particular head and the remaining

three heads. The maximum values for each cross correlation are plotted in Fig. 4.9. Note

that the cross correlation of a waveform with itself (e.g. H1,H1 H2,H2 H3,H3 H4,H4) is the

autocorrelation and has a peak value of one (not shown).

Cross-correlation maximums are shown for all source positions in Fig. 4.10. Since the

maximum values varied little among Harvard sentences in Fig. 4.9, it was reasonable to

average cross correlation maximums across sentences for Fig. 4.10 panels a-d. Panel e shows

the cross correlation maximums averaged across all conditions. It is clear that convolved

waveforms using H2’s HRIRs are the least correlated with other listeners’ convolved speech

waveforms. Based on these results one might reasonably conjecture that convolved waveforms

using H2’s HRIRs will be the most perceptually diﬀerent in the perceptual portion of the

experiment (next section).

83

Figure 4.8: Convolved speech waveforms for “Cats and dogs each hate the other,” for the
four heads (H1, H2, H3, and H4). The HRIRs were for the 3 m, −30◦ source position.
These stimuli will later be presented to listeners (L1-L4) in the perceptual portion of the
experiment.

84

Figure 4.9: Maximum cross correlation of convolved waveforms for source position 3 m, −30◦.
Panels (i) and (j) show the cross correlation averaged across the four Harvard sentences, and
error bars are the standard deviations. For this source position, cross correlations are smallest
for H2’s HRIRs.

85

Figure 4.10: Maximum cross correlation of convolved waveforms for diﬀerent source posi-
tions. Note the values were averaged across the four Harvard sentences, and error bars are
the standard deviations. The last panel (e) shows average cross correlation across all con-
ditions. Waveforms that were generated using H2’s HRIRs were least correlated with the
other waveforms.

86

4.2.2 Perceptual experiment

Training

The same four subjects who were “heads” (H1-H4) in the HRIR measurements returned to

be listeners in the perceptual experiment. They are called ‘listeners’ (L1-L4) to diﬀerentiate

their role in the perceptual experiment from their role of simply providing heads in the HRIR

measurements. Note that H1 corresponds to L1’s HRTFs, H2 to L2’s, etc. Two of the four

listeners (L2,L3) had normal hearing. The remaining listeners (L1,L4) had some hearing loss

at mid and high frequencies (f < 3 kHz).

At a later date, a listener returned to Room 10B for the perceptual experiment.3 First,

the listener received training on how to identify and rate room eﬀect. The listener was

instructed to listen to a series of diotic speech recordings over headphones while paying

particular attention to room eﬀect.4 The listener was asked to guess in which room each

recording was made. For example, the living room and bedroom, with soft carpets and fabric

window treatments, were expected to sound dead compared to the bathroom and kitchen,

which had many hard reﬂecting surfaces due to vinyl ﬂooring, hard cabinets and countertops.

The purpose of the exercise was to familiarize the listener with the task of listening for room

eﬀect. Next, the listener listened to room recordings again and this time was instructed to

rate the amount of room eﬀect he perceived in each recording. The rating scale was from

1 (anechoic; no perceived room eﬀect) to 40 (maximum perceived room eﬀect). The purpose

3In principle, listening for the perceptual experiment could be done in any quiet environment. Tradition-
ally, this usually means in sound-isolating listening booths. Since the experiment took place over summer
break, Room 10B was suitably quiet for a headphone listening experiment.

4Recordings had been made in diﬀerent rooms that are common in domestic settings: a living room,
kitchen, bathroom, and bedroom. In each room the female talker counted backwards from ﬁve, with a second-
long pause between numbers. The talker stood 2 m from the SHURE KSM44a microphone (‘omnidirectional’
mode) to promote ample room eﬀect. The microphone signal was digitized at a rate of 44.1 kHz via a ZOOM
recorder (+48 VDC phantom power).

87

of this part of the training was to get the listener comfortable rating room eﬀect among

stimuli that were thought to be easily distinguishable perceptually (e.g. bathroom vs. living

room).

Experiment– preliminary

In the next stage, the listener rated the amount of room eﬀect in real experiment stim-

uli, i.e. HRIR-convolved sentences. The motivation for using real experiment stimuli was to

familiarize the listener with the range of room eﬀect to be expected during the real exper-

iment. The exact procedure was as follows: a listener was seated at the listening station

and instructed on how to input his rating response into the custom-built GUI (Matlab).

He listened to stimuli over circumaural headphones (Sennheiser HD600). The listener was

only able to input a response after a stimulus ﬁnished playing, which ensured the listener’s

evaluation of room eﬀect was across the entire sentence. The training set consisted of the

complete set of convolved speech stimuli for the “Cats and dogs each hate the other” Harvard

sentence. The complete list of presentation conditions is given in Table 4.2, and it includes

HRIRs measured at the four source locations, plus binaural (‘y’) or diotic (‘n’) presentation.
For binaural presentation, hL ∗ s0 was fed to the left headphone and hR ∗ s0 to the right
headphone. For diotic presentation, hL ∗ s0 was fed to both headphones. In total, there

were thirty-two listening conditions. These were presented in random order to the listener,

who input his rating into the GUI after each condition was played. The GUI paused until

a response was entered, so in this way the listening task was self-paced and forced-choice.

The listener did not know what condition he was listening to, and there was not an option

to replay a condition. A listener could repeat rating of the training set as needed. Once the

listener indicated that he was comfortable with the task, he moved onto the real experiment.

88

HRIR
Distance (m)
Angle (◦)
Binaural

“Cats and dogs each hate the other.”

H1

H2

H3

H4

2

−30

0

−30
n y n y n y n y n y n y n y n y n y n y n y n y n y n y n y n y

−30

−30

−30

−30

0

0

0

0

3

−30

0

2

0

3

3

2

3

−30

0

2

Table 4.2: Shown here are the thirty-two listening conditions that comprised the “Cats” sentence set. These were presented
to the listener over headphones in the perceptual part of the experiment. The listener rated the amount of room eﬀect he
perceived in each presentation condition. Order of conditions was randomized for each listener. In the real experiment, “Glass”,
“Product,” and “Thieves” sentence sets were also presented to a listener. The four sentence sets comprised a pass. Listeners
completed two passes (diﬀerent days).

89

Experiment– data

The format of “Experiment– data” was identical to that described in “Experiment–

preliminary” except that there were four phonetically-balanced sentence sets (cf. Table 2.3).

The sentence sets are brieﬂy referred to as “Cats,” “Glass,” “Product,” and “Thieves” for

convenience. Each sentence set comprised thirty-two listening conditions (Table 4.2). The

presentation order of sentence sets was randomized for each listener, and further, the order

of listening conditions was randomized. After listening to two sentence sets, the listener
took a 5 − 10 minute break before completing the remaining two sentence sets. The four
sentence sets comprised a pass. Thus, during a pass the listener rated: 4 sentences ×

32 listening conditions = 128 conditions. Time required to complete a pass was typically

45 minutes. A listener completed a second pass on a later day.

4.2.3 Results

Distance (m)

means
2
3

10.0
22.7
23.9
17.4

16.3
30.6
27.9
27.8

std. err.
2
3
0.5
0.5
0.8
0.7
0.6
0.5
0.7
0.7

Listener

L1
L2
L3
L4

Angle (◦)
−30
12.7
27.1
24.3
22.3

std. err.
0 −30
0.6
0.6
0.7
0.9
0.5
0.5
0.8
0.8

means
0

13.7
26.2
27.4
22.9

Binaural (y/n)
means
y
n

std. err.
n
y
0.5
0.6
0.6
0.6
0.5
0.5
0.8
0.7

11.8
20.6
23.6
18.9

14.5
32.7
28.2
26.4

Table 4.3: Listeners’ mean ratings (and standard errors of means) were calculated for each
condition: 2 m and 3 m, 0◦ and −30◦, binaural (‘y’) and diotic (‘no’). There were N = 128
values that went into calculating each mean. All listeners rated 2 m lower than 3 m, and
binaural lower than diotic. Means for 0◦ and −30◦ were very similar (within a rating point)
for all listeners except for L3. This listener rated the −30◦ condition 3.1 points lower than the
0◦ condition. The maximum allowed rating was 40 (strong room eﬀect), and the minimum
was 1 (anechoic).

Listeners’ mean ratings were calculated for each of the following conditions: 2 m and
3 m, 0◦ and −30◦, and binaural (‘y’) and diotic (‘no’). Means are shown in Table 4.3. There

90

were 128 ratings that went into the calculation of each mean. For example, the calculation
to ﬁnd the mean rating for 2-m conditions required ratings from the following: 2 passes × 4
HRTFs × 4 sentences × 2 angles × 2 binaurality conditions = 128.

All listeners rated the 3-m source distance as having more room eﬀect than the 2-m

source distance. This result is consistent with results of Experiment 3 from Chapter 2, in

which nearly all listeners ranked the 3-m source distance higher than the 2-m source distance.

All four listeners also rated the diotic listening condition as having more room eﬀect than
binaural. Means for 0◦ and −30◦ were very similar (within a rating point) for all listeners
except for L3. This listener rated the −30◦ condition 3.1 points lower than the 0◦ condition.

Thus, Listener L3 showed sensitivity to source angle but the other three listeners did not.

Figure 4.11 further breaks down mean ratings by HRTF, but detailed discussion of HRTF

eﬀects is postponed until the next section. It is worth commenting on the range of ratings

utilized by each listener. Listeners L2, L3, and L4 used the upper range of the scale, while

L1’s ratings were compressed along the lower half of the scale. This is not necessarily a

problem, but diﬀerences in how listeners used the scale means that data must be normal-

ized before making any between-listener comparisons. Multiple regression analysis in the

subsequent sub-subsection does just that.

Multiple hierarchical regression analyses

The four listeners present four independent case studies which have been fully investigated

through multiple regression analyses (Cohen and Cohen, 1983). Independent variables (X)

included in stage 1 of the regression model were distance (‘dist’), binaurality (‘bin’), and

angle (‘ang’):

ˆYrating = BdistXdist + BbinXbin + BangXang + A

(4.1)

91

Figure 4.11: Mean ratings (grouped by HRTFs) from the perceptual experiment. A higher
rating indicates more perceived room eﬀect. The vertical axis shows the mean rating for each
(a) distance (2 m or 3 m), (b) listening condition (diotic or binaural), and (c) angle (0◦ or
−30◦). Further, means are shown as functions of H1, H2, H3, and H4. The horizontal axis
indicates which listener was listening. For example, the far-left barplots indicate Listener 1’s
(L1) ratings, and bars labeled ‘H1’ indicate when he was listening to his own HRTFs. Each
mean was calculated from N = 32 values. For example, L1’s mean rating when listening
to his own HRTFs (H1) for the 2 m distance was calculated across two listening conditions
(binaural and diotic), two angles (0◦ and −30◦), four sentences, and two passes. Error
bars are standard errors of the mean. Panel (d) shows listeners’ mean ratings for the four
HRTFs. For each H, a listener’s ratings were averaged across all conditions: two distances
(2 m and 3 m), two listening conditions (diotic and binaural), two angles (0◦ and −30◦),
four sentences, and two passes (N = 64).

92

B is a regression coeﬃcient for each variable and A is the intercept. This linear regression

equation can be applied to each of the four listeners to perform regression on the raw ratings.
Alternatively, a regression equation in terms of z-values (zx = x−¯x

), which normalize ratings,

sdx

is given in Eq. 4.2:

ˆzY = βbinzbin + βdistzdist + βangzang

(4.2)

The advantage of Eq. 4.2 is that across-listener comparisons of the β-values can be made.

Thus, diﬀerent weights listeners placed on distance, binaurality, and angle variables can be

investigated in a precise manner. This is desirable because, as seen in the simple comparisons

of means in the previous section: 1) Listeners L1 and L4 rated greater perceptual eﬀect in

going from 2 m to 3 m than in going from diotic to binaural, and (2) In contrast, Listeners L2

and L3 rated greater perceptual eﬀect in going from diotic to binaural than in going from

2 m to 3 m. Equation 4.2 enables precise, quantitative comparisons of listeners’ perceptual

weightings of this kind. Further, statistical tests can readily be performed to ascertain

signiﬁcance.

Complete results of regression analyses with stage 1 predictors (distance, binaurality,

angle) for each listener are shown in Table 4.4.5 Distance and binaurality were highly

signiﬁcant for all four listeners. Positive β-weights indicate that listeners rated 3-m conditions

higher than they rated 2-m conditions (the 2 m condition was the reference group). Negative

β-weights indicate listeners rated diotic conditions higher than they rated binaural conditions

(diotic presentation was the reference group). Angle was signiﬁcant only for Listener L3, and
the β-weight indicates he rated −30◦ conditions lower than 0◦ conditions (the 0◦ condition

was the reference group). Usually when comparing β-weights, it is more meaningful to

5Pass was included as model predictor, and had a signiﬁcant eﬀect for listeners L3 and L4. However, pass

does not pose a rich observation and it is not further discussed.

93

discuss them in terms of their magnitudes, since the sign is determined by the choice of

reference group and therefore somewhat arbitrary.

Listener

R

R2

L1

L2

L3

L4

.541

.293

.771

.595

.598

.358

.705

.497

βdist
.489∗∗∗
.417∗∗∗
.328∗∗∗
.560∗∗∗

βang

-.081

.047
-.260∗∗∗

-.033

βbin
-.214∗∗∗
-.644∗∗∗
-.385∗∗∗
-.404∗∗∗

Table 4.4: Results of stage 1 multiple regression analyses for the four listeners. Predictors
for stage 1 of the model were distance, binaurality, and angle. Statistical tests indicated
that distance and binaurality were signiﬁcant for all four four listeners. (∗p < .05, ∗∗p < .01,
∗∗∗p < .001)

Data were pooled over sentences in stage 1 of the regression model. If sentence is added

as a predictor (stage 2), then one can look at how much variability in the stage 1 model

can be explained by accounting for the diﬀerent sentences. Finally, if HRTF is added as

a predictor (stage 3), then in a systematic manner a listener’s sensitivity to HRTFs in his

ratings can be probed. Hierarchical regression analysis is an appropriate method of analysis

because preliminary examination of data indicated that listeners’ sensitivity to sentence

and HRTF would be less than their sensitivity to distance, binaurality, and possibly angle.

Doing the analysis in a hierarchical manner gives greater sensitivity by looking at whether

adding a particular predictor changes the variability explained by the preceding model in a

statistically signiﬁcant way. That sensitivity may be lost by simply including sentence and

HRTF as predictors in standard multiple regression analysis. Table 4.5 shows the results of

hierarchical regression analyses for the four listeners. Figure 4.12 plots the beta-magnitudes

and indicates statistical signiﬁcance of predictors in an attempt to graphically summarize

94

results of the analyses. The average β-weights for sentences and HRTFs are plotted to

simplify the display. A detailed discussion of analysis results is given for each listener.

Listener Model

R

R2 ∆R2 F , ∆F

L1

stage 1

.541

.293

stage 2

.570

.324

.032

stage 3

.579

.335

.011

L2

stage 1

.771

.595

stage 2

.786

.617

.022

stage 3

.790

.624

.006

L3

stage 1

.598

.358

stage 2

.623

.388

.031

stage 3

.633

.401

.013

L4

stage 1

.705

.497

stage 2

.724

.525

.028

25.949∗∗∗
3.899∗∗

1.339
92.161∗∗∗
4.817∗∗

1.385
34.967∗∗∗
4.123∗∗

1.712
61.917∗∗∗
4.849∗∗

stage 3

.725

.525

.001

.101

Table 4.5: Results of multiple hierarchical regression analyses. All four listeners indicate
a statistically signiﬁcant eﬀect of sentence and no eﬀect of HRTF (∗p < .05, ∗∗p < .01,
∗∗∗p < .001).

95

Figure 4.12: Beta-weights (magnitudes) are plotted as a function of model predictors. In the
case of non-binary predictors (e.g. sentences and HRTFs), the average β-weight is plotted
to simplify the display. Statistical signiﬁcance is indicated below predictor labels: ratings of
perceived room eﬀect in binaural conditions were signiﬁcantly lower than ratings in diotic
conditions at the p < .001 level for all listeners. Likewise, ratings for the 2-m conditions
were lower than ratings for the 3-m conditions at the p < .001 level for all listeners. The
−30◦ conditions were rated lower than the 0◦ conditions at the p < .001 level for L3 only.
Ratings among sentences diﬀered at the p < .01 level for all listeners. Most important to
this experiment is that ratings of perceived room eﬀect among HRTF conditions were not
signiﬁcantly diﬀerent (p > .05).

Listener 1 Of the stage 1 model predictors (distance, binaurality, angle), L1 placed the
greatest weight on distance (|βdist| = 0.489). His weight on binaurality (|βbin| = 0.214)
was less than half of that for |βdist|. Both distance and binaurality were highly signiﬁcant

(p < 0.001). L1’s ratings decreased when going from diotic to binaural listening, while

ratings increased when distance went from 2 m to 3 m. Angle had a β-weight (magnitude)

of 0.081 and was not signiﬁcant. The overall variance in L1’s responses that was accounted

96

for by stage 1 predictors was R2 = 0.293. When sentence was added as a predictor (stage 2),

an additional 0.032 of the variance in L1’s responses was accounted for (R2 = 0.342). This

was signiﬁcant at the p < 0.01 level. Lastly, adding HRTF as a predictor (stage 3) did not

yield a statistically signiﬁcant change in the test statistic (∆F = 1.339, p > 0.05). Thus, L1

was not sensitive to diﬀerent HRTFs.

Listener 2 Stage 1 model predictors accounted for R2 = 0.595 of the overall variance

observed in L2’s ratings. Binaurality and distance had regression coeﬃcients (magnitudes)
of |βbin| = 0.644 and |βdist| = 0.417. Both are signiﬁcant at the p < .001 level. L2 weighted

binaurality more strongly than he weighted distance. Angle did not have a statistically
signiﬁcant eﬀect on L2’s responses (|βang| = 0.047). Adding sentence as a predictor (stage 2)

accounted for an additional 0.022 of the overall variance (R2 = 0.617). The change in the

test statistic was statistically signiﬁcant (∆F = 4.817) at the p < 0.01 level, indicating that

L2 was sensitive to sentence. However, adding HRTF as a predictor (stage 3) accounted for

little additional variance in L2’s ratings (∆R2 = 0.006) and did not signiﬁcantly change the

F -statistic (∆F = 1.385). L2 was not sensitive to HRTF.

Listener 3 Stage 1 predictors accounted for R2 = 0.358 of the overall variance in L3’s
|βdist| = 0.328 and |βbin| =
ratings. He weighted distance and binaurality comparably:
0.385. The weight on angle was |βang| = 0.260. He rated the −30◦ source position lower
than the 0◦ position. All three predictors were signiﬁcant at the p < 0.001 level. Adding

sentence as a predictor (stage 2) accounted for an additional 0.031 of the overall variance in

L3’s responses (R2 = 0.388). The change in the F -statistic (∆F = 4.123) was signiﬁcant

at the p < 0.01 level. Adding HRTF as predictor (stage 3) did not result in statistically

signiﬁcant changes (∆R2 = 0.013 and ∆F = 1.712). L3 was not sensitive to HRTF.

Listener 4 Distance and binaurality were strongly weighted by L4: |βdist| = 0.560 and

97

|βbin| = 0.404. These were signiﬁcant at the p < 0.001 level. Angle (|βang| = 0.033) was
not signiﬁcant. Stage 1 model predictors accounted for R2 = 0.497 of the overall variance

in L4’s ratings. Adding sentence as a predictor (stage 2) accounted for an additional 0.028

of the variance (R2 = 0.525), and change in the F -statistic (∆F = 4.849) was signiﬁcant at

the p < 0.01 level. However, L4 was not sensitive to HRTF (∆R2 = 0.001, ∆F = 0.101).

4.2.4 Discussion

Distance and binaurality were highly signiﬁcant across all listeners, but the relative impor-

tance that listeners attached to each varied. Nevertheless, all listeners rated the 3-m distance

as having more room eﬀect than the 2-m distance. That is consistent with the result of Ex-

periment 3 in Chapter 2 (PE3). Listeners L1 and L4 weighted distance more strongly than
they weighted binaurality when rating room eﬀect. Listener L4 had the largest |βdist|, which

was 0.560. Listeners L2 and L3, on the other hand, weighted binaurality more strongly

than distance. The largest β-value (magnitude) ever seen was for Listener L2, and it was
|βbin| = 0.644. This was three times larger than |βbin| for Listener L1, which was 0.214.
Listener L3 weighted distance and binaurality similarly (|βdist| = 0.328 and |βbin| = 0.385),
but he was also the only listener to place signiﬁcant weight on angle (|βang| = 0.260).

Results of stage 1 regression analyses indicate that the listening experience for each

listener was somewhat unique. An advantage of the rating paradigm (compared to rank-

ordering in Experiment 3 of Chapter 2) was that ratings could be analyzed via multiple

regression, which outputs normalized weights for model predictors (i.e. β-values). The β-

values enable quantitative comparisons of predictors within listener, and also allow compar-

isons among listeners. In rating the amount of perceived room eﬀect, distance and binaurality

were important for all listeners and the eﬀects went in the same direction for all listeners

98

(i.e. binaural ratings were lower than diotic ratings, and 2-m ratings were lower than 3-m

ratings), but the relative importance was highly individual. Results for source angle were less

uniform: angle was signiﬁcant for only one of four listeners. For L3, the enhanced binaural
diﬀerences for the lateral angle (−30◦) reduced the amount of perceived room eﬀect.

Going from stage 1 of the model to stage 2, which included sentence as a predictor,

resulted in statistically signiﬁcant (p < 0.01) changes in the F -statistic for all listeners. It is

not surprising that the varying spectral content of the diﬀerent sentences inﬂuenced listeners’

perceptions. For example, reverberation tended to be more prominent in some syllables, like

“thieves,” than in others (e.g. “product”). Further, the pauses between words in a particular

sentence varied, thus allowing reverberation to ﬁll the pause and become more perceptually

prominent. The shortest pause between words in the anechoic recordings was 0.089 s and it

occurred in the “thieves” sentence. The longest pause was 0.191 s and occurred in the “cats”

sentence. Thus, the longest pause was 2.15 times longer than the shortest pause. While

sentence was signiﬁcant for all listeners, they responded in an idiosyncratic way. Since the

eﬀect of sentence is not the main focus of the experiment, the speciﬁc experiences of listeners

with each sentence will not be further discussed.

The stage 3 model, which included HRTF as a predictor, showed no signiﬁcant changes

in F -statistic for any listener. That is to say, listeners’ ratings of perceived room eﬀect were

not sensitive to variations in HRTF. A listener’s ratings were similar for the diﬀerent HRTFs,
with all else being equal (e.g. 2-m source distance, 0◦ angle, binaural condition). This result

is in direct contrast to the hypothesis that a listener would not only be sensitive to HRTF,

but that he would perceive the least amount of room eﬀect when listening with his own ears.

Further discussion of the null result is given in the next section.

99

4.2.5 Conclusions

Since Experiment 3 in Chapter 2 (PE3) indicated that distance and binaurality aﬀected lis-

teners’ perception of room eﬀect, it is not surprising that results from the current experiment

also indicate these are statistically signiﬁcant eﬀects. Impact of distance can be understood

by considering the diﬀerence in direct-to-reverberant power ratio (D/R) between 2 m and

3 m which is 3.5 dB. Less direct sound reaches the electret microphones when the MLS is

played through the 3-m source during HRIR measurements. Lower D/R propagates through

convolution with the anechoic speech recordings such that in the headphone experiment lis-

teners perceived more room eﬀect for 3-m stimuli than for 2-m stimuli. This was seen for all
listeners. It was thought that enhanced binaural diﬀerences at −30◦ compared to 0◦ would

show enhanced squelch, yet angle had a signiﬁcant eﬀect on ratings for only one listener

(L3).

Lack of any statistically signiﬁcant eﬀect of HRTF on ratings of perceived room eﬀect

was ubiquitous across listeners in the ﬁnal model analysis. This result runs counter to the

initial hypothesis that listening to one’s own individualized HRTFs would have a signiﬁcant

impact on a listener’s ratings. Speciﬁcally, it was thought that listeners would experience

maximum squelch when listening to stimuli ﬁltered with their own HRTFs.

The null result is discussed in context of other experiments that did not ﬁnd one’s own

HRTFs to be universally preferred. Seeber and Fastl (2003) conducted a localization ex-

periment comparing performance with listeners’ preferred HRTFs. Prior to the localization

portion of the experiment, listeners pre-selected and rated ﬁve (nonindividualized) HRTFs

from a larger database. Pre-selection was based on which HRTFs evoked the greatest spatial

perception in the frontal area. Interestingly, listeners generally selected the same small sub-

100

set of HRTFs. This is similar to what was seen in the current experiment, namely that all

listeners rated slightly smaller room eﬀect when listening through L2’s HRTFs. Roginska et

al. (2010) conducted a similar experiment to Seeber and Fastl’s, but it included individual-

ized HRTFs. They found that listeners did not always prefer their own HRTFs. Preference

depended on the speciﬁc criterion: a listener’s preferred HRTF for externalization was not

necessarily the preferred HRTF for front/back discrimination. Further, preference depended

on stimulus type. An experiment by Katz and Parseihian (2012) found that even when

listeners were told which HRTFs were their own, they did not unanimously prefer them. To

conclude, it seems that preference for one’s own HRTFs is far from ubiquitous: it depends

on stimulus type, selection criterion, and speciﬁc properties of the other HRTFs included.

Simon et al. (2016) pointed out that preference experiments like these give little information

about what a listener actually perceives. To that end, they instructed a panel of expert lis-

teners to come up with a list of attributes that described perceptual diﬀerences among a set

of nonindividualized HRTFs. These were: coloration, elevation, externalization, immersion,

position-front/back, position-lateral, realism, and depth. Interestingly, the expert listeners

initially insisted on including reverberation as an attribute but they ultimately removed it

from the list because they could not ﬁnd examples of large diﬀerences in reverberation. That

observation is consistent with results of the current experiment. However, more information

about the amount of physical reverberation that was present in the Simon et al. stimuli (i.e.

RT60) would be needed to further comment.

The studies above used diﬀerent criteria upon which listeners evaluated their experiences

and also diﬀerent stimulus types. It is worth pointing out that all of the experiments de-

livered stimuli to listeners over headphones (and diﬀerent headphones, it should be added).

These headphone experiments do not show conclusive evidence that listeners prefer their

101

own HRTFs. It is likely that the headphones, even if equalized, inﬂuenced listeners’ per-

ceptions and HRTF preferences. The next chapter describes an experimental method that

enables very precise stimulus delivery over loudspeakers. The ultimate goal is to apply the

loudspeaker delivery method to a room eﬀect perceptual experiment, thereby eliminating

the linear distortions caused by headphones.

102

Chapter 5

Well-controlled stimulus presentation

The present chapter is a temporary excursion away from experiments on perception of room

eﬀect. It describes in detail a novel experimental method that will later be used to investigate

perception of room eﬀect (Chapter 6), but the present focus is on introducing and showing

validation measurements for the method. Up to this point in the dissertation, all stimuli

have been delivered to a listener over headphones. Indeed, the standard method for stimulus

delivery in psychoacoustics experiments is with headphones. This is especially the case in

binaural experiments. The present chapter begins with a formal treatment of headphones to

introduce important notation. It then proceeds with the primary focus of the chapter which

is stimulus presentation over loudspeakers.

5.1 Headphone presentation

Consider a signal, x0, to be presented to a listener. For simplicity, x0 has been designed so

that its discrete Fourier transform (DFT), X0, has 211 spectral components. The components

have a constant amplitude and random phases, which are calculated according to Eq. 5.1:

(Xreal

(f ))2 + (Ximag

(f ))2

0

on the interval [−π, π]

(5.1a)

(5.1b)

(cid:113)

A(f ) =

(cid:32)

φ(f ) = arg

(cid:33)

0
X imag
0
X real

0

(f )

(f )

103

Figure 5.1: Signal X0 has constant amplitudes (top) and random phases (bottom) spectra.
Each panel shows 211 symbols, one for each spectral component.

The amplitude and phase spectra for X0 are shown in Fig. 5.1. Let the signal sent to the

headphones be represented by the variable y. Signal y is imagined to be x0. The headphone

response, h, is now convolved with the signal y according to Eqs. 5.2:

(cid:90) ∞
−∞ hL(t(cid:48))yL(t − t(cid:48))dt(cid:48)
(cid:90) ∞
−∞ hR(t(cid:48))yR(t − t(cid:48))dt(cid:48)

xL(t) =

xR(t) =

(5.2a)

(5.2b)

where xL is the convolved signal received by the listener’s left ear and xR by the right ear

104

from the headphones. For convenience, physically-occurring quantities have been put in bold

while quantities that are invented or computed occur in plain italic text. These notation

conventions will be applied henceforth. Complementary frequency domain expressions for

Eqs. 5.2 are:

XL(f ) = HL(f )YL(f )

XR(f ) = HR(f )YR(f )

(5.3a)

(5.3b)

where XL, XR, YL, and YR are complex-valued vectors.

Note that the convolution integral in the time domain is a straightforward multiplication

in the frequency domain (Eqs. 5.2 and 5.3). In matrix form, Eqs. 5.3 are:

XL

 =

HL

XR

0 HR

 ×

YL



YR

0

(5.4a)

(5.4b)

or,

X = HY

in shorthand notation. If it is assumed that the signal received by the listener’s ears (X) is

the same as the signal sent to the headphones (Y ) then it must be that H = I, a frequency-

independent unit matrix. This would imply that the transfer functions from the headphones

to the listener’s eardrums are ideal, i.e. completely ﬂat.

To test the possibility of ideal headphone-to-eardrum transfer functions, Sennheiser

HD414 headphones (Sennheiser, Wedemark, Germany) were positioned onto KEMAR, an an-

thropomorphic acoustical manikin with internal microphones at the locations of the “eardrums.”

105

The desired signal at the “eardrums,” X0, was comprised of 211 frequency components

in which speciﬁc components were determined by the RP2.1 sample rate (48828.125 Hz) and

the number of desired samples (217 = 131072 samples). One period of the stimulus was

131072 samples

1

48828.125 samples s−1 = 2.68435456 seconds. Fine frequency spacing, δf , was
0.37252903 Hz. Coarse frequency spacing, ∆f , was 200 times the ﬁne frequency spacing:
∆f = 200 × δf = 74.50580591 Hz. The base frequency, f1 = 536 × δf = 199.68 Hz was

2.68435456 s =

selected because it is close to the average fundamental of female speech. Remaining frequency
components of X0 were shifted harmonics of f1 and calculated according to: fn = f1 + ∆f ×
(n− 1). The largest frequency component was f211 = f1 + ∆f × (210) = 15845.8948 Hz. The

advantage of the coarse frequency spacing in X0 is that it reduced (by a factor of 200) the

number of frequency components to be included in the DFT calculation and display, while

still including frequencies relevant to female speech and to the audible range. The amplitude

and phase spectra of X0 are shown in Fig. 5.1.

The simple case in which yL = yR = y = x0 was examined. Left and right channels sent

to the headphones (Sennheiser HD600) originated from the two TDT RP2.1 DAC channels.

The RP2.1 module was connected to a controller PC (Windows 7 OS) via USB interface.

The sampling rate was 48828.125 Hz. Custom macros were designed to control the RP2.1

processor using the TDT proprietary software RP Visual Design Studio (RPVdS). Macros

were saved to .rpx format and executed directly in RPVdS. From the DAC channels, signals

were fed into a headphone ampliﬁer with adjustable level control. During calibration, the

level in the headphones was adjusted to 72 dBA. Level was determined by pressing the

headphone cushion against a ﬂat-plate coupler and sound level meter while y was playing.

The headphones were then positioned on KEMAR’s head and suﬃcient time was allotted

for the headphones to settle (1 minute).

106

Three periods of signal y were played over the headphones to KEMAR and recordings,

xL and xR, were made with the manikin’s internal microphones for convenience (instead

of probe microphones). Microphone signals were ampliﬁed (+48 VDC phantom power) and

digitized through the RP2.1 ADC channels. Synchronous triggering of the DAC and ADC

channels was attained via a common Zbus trigger executed within the RPVdS software.

Recordings were manually downloaded from RP2.1 RAM to the PC by the click of a button

in RPVdS. The ﬁrst and last periods were discarded to avoid edge eﬀects. Thus, only

the recording of the middle period was analyzed. This will be the case for all recordings

unless otherwise noted. The discrete Fourier Transform (DFT) of the recording at the left

“eardrum,” XL, is shown in Fig. 5.2.

Clearly, XL (cid:54)= X0. The transfer function matrix, H, is not a frequency-independent unit

matrix. Perhaps this is unsurprising given the imperfect response of nonideal headphones, the

outer ear anatomy, and ear canal resonances that y encountered in its path from headphone

transducer to the “eardrum.” The next section discusses headphone equalization, which

attempts to compensate for these nonideal transfer functions.

5.2 Headphone equalization

It is not standard practice for psychoacoustics researchers to equalize headphones, but nev-

ertheless some do– for example, Wightman and Kistler (1989a, 1989b), Wenzel et al. (1993),

and Zahorik (2002). For human listeners, headphone-to-ear canal transfer functions, HL and

HR, are measured using probe tube microphones in the ear canals. Headphone equalization
is achieved through calculating inverse ﬁlters from the transfer functions (H−1
L and H−1
R ).
The signals that are played over headphones (Y (cid:48)
R) during an experiment are then the

L and Y (cid:48)

107

Figure 5.2: Signal y(= x0) was played over Sennheiser HD600 headphones and recorded
with KEMAR’s internal microphones. The recording at the left “eardrum” is shown here.
Filled symbols indicate the desired signal (X0) and open symbols indicate the DFT of the
measured signal at the “eardrum” (XL). Amplitudes are shown in the top panel and phases
in the bottom panel for the 211 spectral components. If XL = X0, open symbols would
completely obscure ﬁlled symbols.

108

product of the inverse transfer functions and the desired signals in the ears (X(cid:48)

L and X(cid:48)
R,

which are imagined to be X0 for both ears). This is shown in matrix form in Eq. 5.5:

(5.5)

 =

H−1

L

Y (cid:48)

Y (cid:48)

L

R

0
0 H−1

R

 ×



X(cid:48)

X(cid:48)

L

R

Headphone-to-ear canal transfer functions are typically measured for a single position of

probe microphones in the ear canals and headphones on the head. Researchers who imple-

ment equalization therefore implicitly assume the headphone-to-ear canal transfer functions,

and thus the equalization ﬁlters, do not change with subsequent headphone ﬁttings. An

experiment to test the validity of the assumption is described in section 5.2.1.

5.2.1 Experiment setup

A headphone equalization experiment was conducted on KEMAR. The discussion in sec-

tion 5.2 was framed in terms of loudspeaker-to-ear canal transfer functions, but it can easily

be reformulated in terms of loudspeaker-to-eardrum transfer functions. This modiﬁcation

enables the use of KEMAR’s internal microphones for making recordings, which is more

convenient than using probe microphones. Essentially the same measurement procedure de-

scribed in section 5.1 was used for the headphone equalization experiment. Additionally, the

procedure for measuring loudspeaker-to-eardrum transfer functions is described below.

The generalized stimulus used to obtain headphone-to-eardrum transfer functions is in-

dicated by the variable yH . For convenience, yH = x0. During calibration, the level in the

headphones was adjusted to 72 dBA while signal yH was playing.

To attain HL, signal yH was played from the left headphone and recorded (wL) with

109

KEMAR’s internal microphone. Signal yH was then played from the right headphone to

attain wR. A cartoon of the measurement is shown in Fig. 5.3a. DFTs of the recordings

were computed, and transfer functions were calculated according to HL(f ) = WL(f )/YH (f )

and HR(f ) = WR(f )/YH (f ).

Figure 5.3: Headphone equalization experiment. (a) To obtain headphone-to-eardrum trans-
fer functions (HL and HR), signal yH was played over the headphones and recordings were
made with KEMAR’s internal microphones (wL and wR). (b) Signals y(cid:48)
R, calculated
from Eq. 5.5, were played over the headphones and recorded by the internal microphones
(xL and xR).

L and y(cid:48)

Custom MATLAB scripts were written to calculate signals Y (cid:48)

L and Y (cid:48)

R, according to

Eq. 5.5, and for converting to time domain signals, y(cid:48)
transforms. Signals y(cid:48)

L and y(cid:48)

L and y(cid:48)

R were then played simultaneously from the left and right

R, through inverse Fourier

headphones while recordings (xL and xR) were made with KEMAR’s internal microphones,

as depicted in Fig. 5.3b. This measurement is referred to as the ‘standard.’

Pencil lines were drawn on KEMAR’s head to mark the position of the headphone cush-

ions on the ears. The headphones were then removed and repositioned on the head. Rea-

sonable eﬀort was made to place the headphones back in their original position using the

pencil lines as a guide. Suﬃcient time was allotted for the headphone cushions to settle
(1.5 min). Without measuring new transfer functions, signals y(cid:48)

R were played again

L and y(cid:48)

and new signals were recorded at the eardrums. The headphones were repositioned ﬁve times

110

and new recordings were made for each ﬁtting. These, plus the standard, yielded six total

measurements.

5.2.2 Results

Figure 5.4 shows the results of the standard measurement (solid line) and the repositioned

measurements (open symbols) in the left (top panel) and right (bottom panel) “ears”. The

standard reproduced the equal-amplitudes signal quite well– a ﬂat line was expected, and

a ﬂat line was observed. The largest discrepancy between the desired signal (X0) and the

standard (X) was 0.13 dB and occurred in the left ear at 13387.3 Hz.

Each subsequent headphone placement is indicated by a diﬀerent symbol type.

It is

evident that repositioning the headphones had a dramatically deleterious eﬀect. This was

especially true in the right ear– for two out of ﬁve headphone placements, the measured

amplitudes near 10 kHz deviated substantially from the standard. The largest amplitude

occurred at 9811 Hz and was 13.7 dB above the standard.

111

Figure 5.4: Signals measured in KEMAR’s left (top) and right (bottom) internal micro-
phones. The desired signal, X0, had equal amplitudes. Signals measured with the original
headphone placement, for which HL and HR were measured, are the standard and are in-
dicated by the black line. The black line looks like an axis but it is real data. The largest
discrepancy observed in the standard was 0.13 dB and occurred in the left ear at 13387.3 Hz.
Measurements at subsequent headphone placements are indicated by open symbols, and each
placement is indicated by a diﬀerent symbol type. The largest amplitude was 13.7 dB above
the standard and occurred in the right ear at 9811 Hz.

112

5.2.3 Discussion

Signal delivery works quite well when headphone signals are equalized and there is no sub-

sequent repositioning of the headphones– the largest discrepancy observed in this ideal sce-

nario was 0.13 dB. However, as soon as the headphones were perturbed, HL and HR were

no longer accurate representations of the physical transfer functions. It is worth noting that

the measurements in Fig. 5.4 represent a best-case scenario because the internal microphones

were ﬁxed at the manikin’s “eardrums.” Human listeners, on the other hand, must wear

probe tube microphones underneath the headphones which is what Hartmann and Witten-

berg (1996) did in their headphone externalization experiments. However, wearing probe

tube microphones under headphones can be tricky in a practical sense. The microphone

casing is in physical contact with the headphone cushions so perturbation of the headphones

could perturb the probe tubes too. That could lead to even greater diﬀerences between the

physical and measured (HL and HR) transfer functions, especially at high frequencies. The

equalization ﬁlter can thus no longer be expected to accurately deliver the desired signals to

the eardrums.

Run-to-run variability is a pervasive problem in headphone experiments. This is the vari-

ability that arises from a listener’s removal of headphones during a break between experiment

runs and putting them on again for the next run with a ﬁtting that is inevitably diﬀerent.

Further, human listeners do not have pencil lines on their faces so variability is expected

to be even worse than was observed for KEMAR. The diﬀerence in headphone placement

between one run and the next could lead to drastically diﬀerent transfer functions which may

be perceptually and/or acoustically relevant. Domnitz (1975) found listener asymmetries for
circumaural headphones to be up to 3 dB in amplitude and 20◦ in phase for a 500 Hz tone

113

at 85 dB SPL, while within listener run-to-run variability could be up to 1.5 dB and 10◦.

Pralong and Carlile (1996) and Kulkarni and Colburn (2000) also found consider variation

in the headphone transfer functions in their studies on human listeners (PC) and a manikin

(KC).

It is not common practice for researchers to measure fresh HL and HR for each new

headphone placement.

Indeed, if experimenters equalize headphones at all it is normally

only for a single headphone placement. Figure 5.4 clearly indicates the danger of using an

equalization ﬁlter that is not matched to the particular headphone placement. An inaccurate

headphone-equalization ﬁlter could be more detrimental than no equalization ﬁlter in some

scenarios.

Experimenters who use stimuli with a low frequency cut-oﬀ (f ≤ 4 kHz) could argue that

the headphone equalization used here was adequately robust for their purposes. However,

more testing is required before making that assertion. The stimulus X0 had ﬁfty-two fre-

quency components below 4 kHz but to probe in further detail more frequency components

in that range should be included. Further, reasonable care was taken to replace the head-

phones back in their original position by use of pencil lines for guiding the placement, but

such meticulousness may not be practical in a realistic setting. In experiments with human

listeners, probe microphones must be worn under the headphones. Even slight movements

of the probe tube in the ear canal (without remeasuring H and updating the equalization ﬁl-

ter) could lead to acoustical and/or perceptual diﬀerences between the desired and measured

stimulus in the ear.

114

5.3 Transaural synthesis: an introduction

The previous section described reproducibility issues with headphones. An alternative

method for stimulus delivery in psychoacoustical experiments is loudspeaker presentation.

Loudspeakers enable externalization of the sound image and provide a more natural listening

environment. If using a real target source loudspeaker, comparison of the target with a syn-

thesized stimulus can be easily accomplished. Indeed, loudspeaker delivery of experimental

signals has characterized recent work with particular attention to accuracy and compari-

son with real-world sound sources (Akeroyd et al., 2007; Moore et al., 2010; Zhang and

Hartmann, 2010; Majdak et al., 2013; Hartmann et al., 2016).

During loudspeaker presentation, some of the sound from a loudspeaker that is intended

only for one ear ‘leaks’ into the other ear– this is crosstalk, and it is depicted in Fig 5.5.

Crosstalk is problematical because it leads to imprecise signal delivery at the ears. In 1961,

Bauer proposed an audio method for replacing two-channel headphone listening by two

loudspeakers. The crosstalk would be eliminated by adding ﬁltered versions of the right and

left channels to the left and right channels respectively in order to cancel the crosstalk at the

ears. Schroeder and Atal (1963) and Schroeder (1975) implemented crosstalk cancellation

(CTC) in terms of anatomical transfer functions. They applied CTC to evaluate concert

hall acoustics. Their experiment involved playing adjusted signals over two loudspeakers to

a listener in an anechoic room. Damaske (1971) used dummy-head recordings as the binaural

stimuli to be presented through an empirical CTC network to a human listener whose task

was to localize the sound source in the horizontal or vertical planes. Early versions of CTC

implemented inverse ﬁltering of time domain signals, and none of them used probe tube

microphones. Recordings were made with or without a head, and used large-diaphragm

115

condenser microphones. If a head was used (real or dummy), the microphones were placed

at the sides of the head.

In this sense, early CTC was quite diﬀerent from the modern

formulation which uses probe microphones to make anatomically-correct recordings in the

ear canals.

It was not until Morimoto and Ando (1980) that CTC was reformulated in terms of matrix

inversion in the frequency domain. This is the modern formulation of CTC mathematics.

Further, early versions of CTC assumed symmetry but Morimoto and Ando’s implementation

did not. Morimoto and Ando used CTC for loudspeaker presentation of head-related transfer

functions measured for presentation of stimuli in the back horizontal plane.

Later, Miyoshi and Kaneda (1988) simulated a CTC network in a room environment,

which can introduce complexities because of zeros in the transfer functions (Neely and Allen,

1979). CTC is typically conducted in anechoic environments, but in principle it should be

able to handle the complicated acoustical environment of an ordinary room. Acoustical

qualities of a room, whether anechoic or ordinary, appear only in the transfer functions, H.

An ordinary room acts as a linear ﬁlter, so that in addition to ﬁltering due to the head, torso,

and outer ear there is ﬁltering due to the room as well. Thus, a room will aﬀect measured

elements of H but the CTC mathematical machinery is unchanged.

Cooper and Bauck (1989) introduced the term ‘transaural synthesis’ to describe the gen-

eralized approach of treating the signals at the ears as the end point in the signal processing

chain (instead of the loudspeaker signals being the end point, as is the case in conventional

stereo reproduction). They speciﬁed that transaural synthesis (TS) encompasses both the

binaural recording stage (with or without a dummy head), and the CTC network required

to deliver the binaural signal precisely to the ears. TS is thus a broader term than CTC: TS

represents a complete signal delivery method, whereas CTC is essentially inverse ﬁltering.

116

It should be noted that, since Bauck and Cooper ﬁrst introduced the term, TS and CTC

have often been used somewhat interchangeably in the literature. However, because of the

emphasis on signal delivery, which is the primary concern of this chapter, TS is henceforth

used in this dissertation, with the understanding that it includes CTC.

The discussion would be incomplete without addressing the two co-existing communi-

ties that utilize TS: psychoacousticians and audio engineers. TS is robust and powerful

enough to be applied to fundamental psychoacoustical research experiments which require

very accurate delivery of precise interaural and spectral cues to a listener’s ears. Audio engi-

neering applications can often get away with making binaural recordings and inverse ﬁlters

using a dummy-head, or even doing free-ﬁeld measurements with large-diaphragm condenser

microphones that completely neglect head diﬀraction. The emphasis in audio engineering

applications is often on consumer experience. As such, audio engineers may be more willing

to use large-diaphragm condenser microphones or imprecise CTC ﬁlters to expand generality,

as long as the consumer experience is not greatly impaired. Psychoacoustical experiments, on

the other hand, require use of probe tube microphones (tube diameter < 1 mm) in a human

listener’s ear canals to construct inverse ﬁlters with the accuracy and precision necessary for

fundamental research. New CTC ﬁlters must be measured from one experiment run to the

next. Thus, the primary diﬀerence between implementation of TS in audio engineering and

psychoacoustical applications is the required level of precision and accuracy.

The following sections describe various TS experiments conducted in an ordinary room.

Motivations for each experiment were somewhat diﬀerent, but common to all experiments

were the following two steps: 1) measurement of the loudspeaker-to-eardrum transfer func-

tions, and 2) calculation of loudspeaker signals necessary to attain the desired signals at the

manikin’s “eardrums.” Details of each step and the necessary mathematics are given in the

117

following subsections.

5.3.1 Measuring transfer functions

Signal yH is played from loudspeaker A and recordings are made at the “eardrums” (wL

and wR). A cartoon of the measurement setup is shown in Fig. 5.5. Note that for human

listeners, recordings must be made with probe tube microphones in the ear canals, but for

simplicity the following discussion is framed in terms of measurements at the eardrums.

Figure 5.5: Measurement of the synthesis loudspeaker-to-eardrum transfer functions (H).
Signal yH is played from synthesis loudspeaker A and recordings, wL and wR, are made
at the eardrums to obtain HAL(f ) and HAR(f ). Then, yH is played from synthesis loud-
speaker B and new recordings are made at the eardrums to obtain HBL(f ) and HBR(f ).
Crosstalk paths, HAR(f ) and HBL(f ), are indicated by dashed lines.

The loudspeaker A-to-eardrum transfer functions are WL/YH = HAL(f ) and WR/YH =

HAR(f ). The procedure is repeated for loudspeaker B to obtain HBL(f ) and HBR(f ). In

matrix form this is written as:

H(f ) =

HAL HBL

HAR HBR



118

(5.6)

Equation 5.6 looks similar to H in Eq. 5.4a except now the oﬀ-diagonal terms HBL and

HAR are not constrained to be zero. These are the crosstalk terms. Note that matrix H

encapsulates all the physics in the system, which includes crosstalk, scattering from the head,

torso, or outer ear, and any scattering from the room.

5.3.2 Calculating loudspeaker signals

With loudspeaker-to-eardrum transfer functions H in hand, Eq. 5.4a can be rewritten as:

XL

 =

XR

HAL HBL

HAR HBR

 ×

YA



YB

(5.7)

Signals XL and XR are the recordings at the left and right eardrums when loudspeaker

signals YA and YB are ﬁltered by H. Consider now the inverse problem– can alternative
signals, Y (cid:48)
by H, would give precisely the desired signals in the ears, X(cid:48)

B, be played from synthesis loudspeakers A and B such that, when ﬁltered

A and Y (cid:48)

R? In other words, the

L and X(cid:48)

inverse problem of Eq. 5.7 is

Y (cid:48) = H−1X(cid:48)

where the inverted H matrix is

H−1 =

1

HALHBR − HARHBL

 HBR −HBL

−HAR HAL



(5.8)

(5.9)

119

Writing Eq. 5.8 in more explicit form gives

 =

Y (cid:48)

Y (cid:48)

A

B

1

HALHBR − HARHBL

 HBR −HBL

−HAR HAL



X(cid:48)

X(cid:48)

L

R



(5.10)

in principle, it allows one to deliver

The signiﬁcance of Eq. 5.10 cannot be overstated:
any desired signals, X(cid:48)
compute the loudspeaker signals, Y (cid:48)

L and X(cid:48)

A and Y (cid:48)

R, to the eardrums. Speciﬁcally, Eq. 5.10 enables one to

B, necessary to attain precisely X(cid:48)

L and X(cid:48)

R at

the eardrums. To show that this is true, let XL and XR be the measured signals at the ears
when Y (cid:48)

B are processed by H, as depicted in Fig. 5.6:

A and Y (cid:48)

XL

 = H

XR



Y (cid:48)

Y (cid:48)

A

B

Equation 5.8 can be substituted for Y (cid:48):

XL

 = HH−1

XR



X(cid:48)

X(cid:48)

L

R

Since (HH−1) = I, the identity matrix, what is remains is

XL

 =

XR



X(cid:48)

X(cid:48)

L

R

(5.11)

(5.12)

(5.13)

which is precisely what is required– namely, the signals measured at the eardrums, XL
and XR, are the desired signals, X(cid:48)

R. This means that in theory one only has to

L and X(cid:48)

measure the transfer function matrix H and invert it to deliver any desired signals to the

120

listener’s eardrums. This is truly a powerful technique, and explains the value of TS for

psychoacoustics experiments.

Figure 5.6: During transaural synthesis, signals y(cid:48)
and B to attain XL = X(cid:48)

L and XR = X(cid:48)

A and y(cid:48)
R at the eardrums.

B are played from loudspeakers A

5.4 Two-loudspeaker experiments

This section describes a proof-of-principle experiment using two synthesis loudspeakers for

signal delivery in an ordinary room.

5.4.1 Experiment setup

The experiment was conducted in the PLab, a rectangular room with dimensions 4.3× 5.5×

3.0 meters. The ceiling is acoustical tile and the ﬂoor is vinyl tile. The walls are plaster but

three of them were treated with absorption (Auralex Sunburst, Auralex, Indianapolis, IN)–

a total of 13 square meters of absorption at mid frequencies. The absorbing panels had been

removed from the wall nearest the manikin (on its left side) to produce a more challenging

room transfer function. This room arrangement was Room Setup 1. The reverberation time

averaged 0.239 s in the 250 and 500 Hz octave bands, and averaged 0.144 s in the four octave

121

bands from 1000 to 8000 Hz. TDT hardware and the controller PC sat on a desk directly

outside the PLab. During all measurements the lab door was closed and the experimenter

was seated outside the PLab. All cables connected to the TDT hardware ﬁt comfortably

under the door into the lab.

Outputs from the TDT RP2.1 DACs (unbalanced line) were connected via long BNC

cables to loudspeakers A and B (Mackie HR824mk2 Studio Monitors, LOUD Technologies,

Woodinville, WA). Loudspeakers A and B were mounted on portable stands while a third

loudspeaker– loudspeaker G (Mackie HR824 Studio Monitor)– was mounted to a sturdy mi-

crophone stand. Loudspeaker G was not used in the experiment but its bulky shape was

a reﬂecting object in the room. Loudspeakers were secured to stands via ratchet straps.

KEMAR was mounted with its “ears” 117 cm from the ﬂoor. All three loudspeakers were

pointed directly at the “head” with their centers at the level of the “ear canals.” Loud-
speakers A and B were positioned at −120◦ and 120◦ with respect to the manikin’s forward
direction, and loudspeaker G at −140◦. A photograph of the setup is shown in Fig. 5.7.

Figure 5.7: Shown here are loudspeakers A (KEMAR’s left) and B (KEMAR’s right) on
the sides (±120◦) and G behind at −140◦ with KEMAR located at the reference position.
Acoustical foam wedges which reduce reﬂections are noticeable in the background.

122

5.4.2 Measuring H

TS does not require loudspeaker gains or signal levels in the ears to be the same, but to ensure

a good signal-to-noise ratio for transfer function measurements, gains of the loudspeakers

were adjusted during calibration to produce a level of 74 dBA at KEMAR’s reference position,

as measured by a sound level meter while signal yH was played from the loudspeaker.

For convenience, yH = x0. Signal yH was played from synthesis loudspeaker A and

recordings were made in KEMAR’s left and right internal microphones at the “eardrums”

to obtain HAL and HAR. Signal yH was then played from synthesis loudspeaker B and

recordings were made in the manikin’s left and right internal microphones to obtain HBL

and HBR.

5.4.3 Conducting the synthesis

Custom MATLAB scripts were written to calculate signals Y (cid:48)
The simple case in which X(cid:48)
to time domain signals, y(cid:48)

R = X0 was examined. Loudspeaker signals were converted
R, through inverse Fourier transforms. Signals y(cid:48)
L and y(cid:48)
were then played simultaneously from loudspeakers A and B while recordings (xL and xR)

L = X(cid:48)
L and y(cid:48)

L and Y (cid:48)

R, according to Eq. 5.10.

R

were made with KEMAR’s internal microphones.

5.4.4 Results

Synthesis at the left “eardrum” is shown in Fig. 5.8. The measured signal (XL) is indicated
by the open symbols and agreed well with the desired signal (X(cid:48)

L, ﬁlled symbols) at nearly

all frequencies. The amplitude at 4297.5 Hz did deviate from the desired amplitude (0.658

vs. 1.0), however suppressed spectral components like this one are generally not perceptible

123

(B¨ucklein, 1981). The root-mean-square (RMS) error (calculated across all 211 spectral

components) for amplitude reproduction was 0.0394 in linear amplitude units, and the RMS
phase error was 0.0469 radians (2.69◦).

Figure 5.8: TS in the left “ear” using loudspeakers A and B. Filled symbols indicate the de-
sired signal at the eardrum, and open symbols indicate the measured signal at the “eardrum.”
When a ﬁlled symbol is not seen it is because an open symbol obscures it. RMS errors are
on the scale of the vertical axis. Loudspeakers A and B were located at −120◦ and 120◦.
Loudspeaker G, a reﬂecting object, was located at −140◦.

124

5.4.5 Discussion

TS using two loudspeakers enables precise signal delivery to the listener’s ears, as shown by

Fig. 5.8. Further, it avoids the run-to-run variability problem associated with headphone

experiments because TS requires a new H to be measured at each experiment sitting. Loud-

speaker presentation oﬀers the additional advantages of good sound-image externalization

and the ability to compare with a real target source. Indeed, the technique has proven to be

immensely useful in localization and perceptual experiments (Hartmann et al., 2016; Zhang

and Hartmann, 2010; Moore et al., 2010).

It is worth scrutinizing Eq. 5.10 more closely. It is repeated here for convenience:

 =

Y (cid:48)

Y (cid:48)

A

B

1

HALHBR − HARHBL

 HBR −HBL

−HAR HAL





X(cid:48)

X(cid:48)

L

R

If the denominator, i.e. HAL(f )HBR(f ) − HAR(f )HBL(f ), is very small, then Y (cid:48)
Y (cid:48)
B(f ) will be very large. It just so happens that sometimes the determinant for a particular

A(f ) and

spectral component may be quite small. Such potential for very large amplitudes, or gains,
in the inverse ﬁlter is undesirable because any spuriously large amplitude in Y (cid:48)

A or Y (cid:48)

B could

manifest as a tone to the listener during synthesis. Large gains could also lead to nonlinear

distortion. Discrete tones and/or distortion could compromise what would otherwise be an

accurate and perceptually-persuasive synthesis.

Figure 5.9 shows a measurement in which a “problematical” spectral component occurred.

The desired signal in the ear was equal-amplitudes, random-phases noise but with diﬀerent

random phases than the X0 shown in Fig. 5.1. For convenience, X0 will henceforth refer

to any equal-amplitudes, random-phases stimulus. The spectral component at 10183 Hz

exceeded the desired amplitude by 5 dB (top panel). This component would protrude as

125

a tone to the listener. The discrepancy was also apparent in the phase spectrum (bottom

panel). In general, if a problem occurred in the amplitude spectrum then it was also observed

in the phase spectrum (cf. Eq. 5.1).

Figure 5.9: Synthesis measured at KEMAR’s right “eardrum.” In this particular measure-
ment, loudspeakers A and B were located at −90◦ and 90◦, and loudspeaker G, a reﬂecting
object, was at 180◦. The spectral component at 10183 Hz exceeded the desired signal am-
plitude by 5 dB.

126

In such cases where large gains exist the inversion of matrix H is said to be ill-posed,

and it is a well-known problem with TS. Zhang and Hartmann (2010) conducted front-

back localization experiments with complex tones using TS and eliminated any frequency

component that deviated by 50% or more from the desired amplitude. In reality, the number

of large amplitudes was small but listeners noted them as being quite salient unless they were

eliminated.

Some researchers have sought to improve the invertibility of H and increase control over

synthesis through precise loudspeaker placement. Kirkeby et al. (1998a, 1998b) introduced

the ‘stereo dipole,’ in which synthesis loudspeakers were spaced close together to enlarge

the area of controlled synthesis. Ward and Elko did calculations (1998, 1999) based on the

geometry of loudspeakers and receiving points to show how loudspeaker placement could

be optimized to maximize robustness of crosstalk cancellation ﬁlters. Takeuchi and Nelson

(2001, 2002) proposed the Optimal Source Distribution (OSD) which had increasing angular

separation between loudspeakers with decreasing frequency.

The subsequent decade saw much eﬀort to optimize loudspeaker placement for maximal

robustness to head displacements (Takeuchi, Nelson, and Hamada, 2001; Rose et al., 2002;

Nelson and Rose, 2005; Bai and Lee, 2006; Parodi and Rubak, 2010). These experiments and

simulations were primarily concerned with maximizing the so-called “sweet-spot,” the region

of space over which the illusion of a virtual sound source holds. They were less concerned

with precise signal delivery and they all utilized matrix regularization, a general method for

handling ill-posed matrix inversions.

Kirkeby et al. (1998c, 1999) calculated crosstalk cancellation ﬁlters using matrix reg-

ularization. This approximation limits the maximum gain allowed in the crosstalk ﬁlters,

thereby mitigating what would otherwise be problematical components in the matrix inver-

127

sion. However, there is uncertainty in the correct regularization parameter to use. Further,
by introducing an error term to the inversion, the resulting ﬁlter H−1 becomes an approxi-

mation. Regularization may also produce artifacts or distortion when used inappropriately

(Norcross et al., 2004). Nevertheless, after the work of Kirkeby et al., regularization became

the standard method to deal with unwanted large gains in crosstalk cancellation ﬁlters. It

should be noted that Zhang and Hartmann (2010) and Hartmann et al. (2016) did not use

matrix regularization.

5.5 Three or more synthesis loudspeakers

A diﬀerent approach for mitigating large amplitudes that occasionally plague TS is to add a

new degree of freedom to the system– namely, a third loudspeaker. Bauck and Cooper (1996)

presented the generalized problem of crosstalk cancellation for any number of loudspeakers

and listening points in space. In the current application, points in space are replaced by ear

canals. The generalized mathematical problem is given by Eq. 5.4b, which is repeated for

convenience:

X = HY

where X indicates the signals at the listening points, H is the transfer function matrix, and

Y indicates the signals sent to the loudspeakers. If the numbers of loudspeakers and listening

points are equal, H is a square matrix and calculating its inverse is simple. As discussed for
the 2-loudspeakers and 2-ears case, referred to as a 2 × 2 system, there can be times when

the determinant of the square matrix is very small which is troublesome. When the number

of listening points exceeds the number of loudspeakers (as is the case with home theater

128

systems), H is non-square and the inverse problem is said to be overdetermined, meaning

there is no solution. If the number of loudspeakers exceeds the number of listening points,

H is again non-square but the inverse problem is now underdetermined, meaning multiple

solutions exist. Theoretically an inﬁnite number of solutions exists but Bauck and Cooper

showed that the Moore-Penrose pseudoinverse matrix, H+, provides an ideal solution:

Y (cid:48) = H+X(cid:48)

(5.14)

The pseudoinverse matrix H+ provides an ideal solution because it allows for a suitable

inversion of the non-square transfer function matrix H, and it ensures the least-norm solution

(Moore, 1920; Penrose, 1955a,b) to the underdetermined crosstalk cancellation problem.
Eﬀectively, this latter property means that the solutions, Y (cid:48), minimize the total power

delivered to the loudspeakers during synthesis. This implies there ought to be very few large
amplitudes appearing in Y (cid:48). Further, H+ provides an exact solution since its calculation

involves no approximations.

5.5.1 Calculating the pseudoinverse

When a matrix H has more columns (loudspeakers) than rows (listening points), the pseu-

doinverse is deﬁned as:

H+ = H∗(HH∗)−1

(5.15)

where H∗ is the complex conjugate transpose of H. A crucial property of H+ for crosstalk

cancellation purposes is that HH+ = I, the identity matrix. This gives the ability to write

Eq. 5.14 as the generalized solution for the underdetermined crosstalk cancellation problem.

129

Compare to Eq. 5.8 for the 2 × 2 system: Y (cid:48) = H−1X(cid:48). The equations diﬀer only in how
the inverse ﬁlter is deﬁned, either as H−1 or H+.

The simplest case to consider is the 2-listening points and 3-loudspeakers, or 2×3, system.

In analogy to Eq. 5.7 for the 2 × 2 system, the matrix equation can be written as

for the 2 × 3 system. The solution of Eq. 5.16, in analogy to Eq. 5.10, is

XR

HAR HBR HGR

 =
XL
HAL HBL HGL
 ×

 =


AL H+
H+
AR
BL H+
H+
BR
H+
GL H+
GR

Y (cid:48)
Y (cid:48)
Y (cid:48)

G

B

A

 ×

X(cid:48)

X(cid:48)

L

R



YA

YB

YG




(5.16)

(5.17)

−1




(5.18)

where H+ = H∗(HH∗)−1. Calculation of H∗(HH∗)−1 is shown below:





AR

AL H∗
H∗
H∗
BL H∗
H∗
GL H∗

BR

GR

−1




(HH∗)−1 =


HAL HBL HGL
 HALH∗

AL + HBLH∗
AL + HBRH∗

HAR HBR HGR

HARH∗



=

BL + HGLH∗
BL + HGRH∗

GL

GL

HALH∗
HARH∗

AR + HBLH∗
AR + HBRH∗

BR + HGLH∗
BR + HGRH∗

GR

GR

Therefore, H+ is given by

130

 HARH∗

−HARH∗

AR + HBRH∗
AL − HBRH∗

BR + HGRH∗
BL − HGRH∗

GR

GL

−HALH∗
HALH∗

AR − HBLH∗
AL + HBLH∗

BR − HGLH∗
BL + HGLH∗

GL

GR

Det



(5.19)



AR



H∗
AL H∗
H∗
BL H∗
H∗
GL H∗

BR

GR

H + =

where

Det = (HALH ∗

AL + HBLH ∗

BL + HGLH ∗

GL) × (HARH ∗

AR + HBRH ∗

BR + HGRH ∗

GR)

− (HALH∗

AR + HBLH∗

BR + HGLH∗

GR) × (HARH∗

AL + HBRH∗

BL + HGRH∗

GL)

= (|HAL|2 + |HBL|2 + |HGL|2) × (|HAR|2 + |HBR|2 + |HGR|2)

− (HALH∗

AR + HBLH∗

BR + HGLH∗

GR) × (HARH∗

AL + HBRH∗

BL + HGRH∗

GL)

The resulting 3 × 2 matrix is



AL H+
H+
AR
BL H+
H+
BR
GL H+
H+
GR



H+ =

(5.20)

(5.21)

Equation 5.22 represents the situation in which the solution Y (cid:48) is processed by the transfer

function H, and recordings X are made in the ears:

X = HY (cid:48)

(5.22)

131

Substituting Eq. 5.14 for Y (cid:48) yields

X = HH+X(cid:48)

(5.23)

Since HH+ = I, the result is X = X(cid:48), as required.

It is worth noting that inside the second line of parentheses in Eq. 5.18 is a 2 × 2 ma-

trix. This would be true for any number of loudspeakers in an underdetermined crosstalk

cancellation scenario. It is mentioned because, no matter how many loudspeakers are used,
one need not invert anything larger than a 2 × 2 matrix. Thus, the mathematics can be

extended in a straightforward manner to include four or more loudspeakers. It is thought

that including additional loudspeakers would further decrease the power delivered to each

loudspeaker still further, and thus alleviate even more the problem of spuriously large am-
plitudes. Simulations by Shore et al. (2018) indeed revealed enhanced beneﬁt from a 2 × 4
compared to a 2× 3 system, but the beneﬁt was smaller than the beneﬁt conferred by going
from 2 × 2 to 2 × 3. Further, in an experimental implementation of TS one must trade oﬀ

between improved synthesis and the additional hardware necessary to achieve that improve-

ment. The three-loudspeaker system was pursued here because it was the simplest possible

scenario for answering the question of whether additional loudspeakers do indeed provide
beneﬁt over the traditional 2 × 2 system. The 2 × 3 system oﬀered the further advantage of
requiring less computation time than a 2 × 4 (or more) loudspeaker system.

Some researchers have incorporated more than two loudspeakers into their crosstalk can-

cellation networks. Takeuchi and Nelson (2001, 2002) and Akeroyd et al. (2007) used two

channels that were fed into a crossover network coupled to three loudspeaker pairs spaced

at small, mid, and large angles for synthesizing high, mid, and low frequencies in their OSD

132

system. This was essentially an extension of the two loudspeaker system. Bai, Tung, and Lee

(2005) used six independent loudspeakers to deliver signals to a listener and incorporated

multiple control points to gain greater control over the sound ﬁeld. The goal was to widen

the sweet spot. In all cases, researchers used matrix regularization to limit maximum gains

in the crosstalk cancellation ﬁlters, resulting in approximate solutions.

5.5.2 Experiment with three loudspeakers

The 2 × 3 loudspeaker experiments used loudspeaker G during synthesis. Since the RP2.1

has only two DAC channels, it was necessary to use a second RP2.1 module for a third DAC

channel to connect to loudspeaker G. Synchronous triggering of the thee DAC channels

was achieved via the common Zbus trigger which was executed in the RPVdS environment.

Recordings with KEMAR’s internal microphones were made in the same manner as previ-
ously described in section 5.4. In practice, the 2 × 2 measurement shown in Fig. 5.9 was
immediately succeeded by the corresponding 2 × 3 measurement. This means that the same
transfer functions measured with loudspeakers A and B for the 2×2 experiment were used in
the 2 × 3 experiment. The 2 × 3 measurement additionally included loudspeaker G transfer

functions, HGL and HGR, which were measured in the same way as HAL, HAR, HBL,
and HBR. A custom Matlab program calculated y(cid:48)

G which were played simul-

A, y(cid:48)

B, and y(cid:48)

taneously over the loudspeakers while recordings were made with the internal microphones

at KEMAR’s “eardrums.”

133

5.5.3 Results

Figure 5.9 is repeated as panel (a) in Fig. 5.10 for convenience. Panel (b) shows the cor-
responding 2 × 3 synthesis for the right “ear.” The very large amplitude that occurred at
10183 Hz in the 2× 2 system was reduced from 1.739 to 1.064 by the 2× 3 system. Further,
the RMS amplitude error was 61% smaller in the 2 × 3 system. The RMS phase error was

26% smaller.

Figure 5.10: Synthesis spectra recorded at the right “eardrum.” Amplitudes in the (a) 2× 2
and (b) 2 × 3 system. Phases in the (c) 2 × 2 and (d) 2 × 3 system. Filled symbols indicate
the desired signals at the eardrum and open symbols indicate the measured synthesis signals.
The 2 × 3 system substantially reduced the very large amplitude at 10183 Hz in the 2 × 2
system.

134

5.5.4 Discussion

Figure 5.10 illustrates a case in which the 2 × 3 system clearly oﬀered beneﬁt over the 2 × 2
system. The large amplitude protruding at 10183 Hz in the 2 × 2 system was substantially
reduced in the corresponding 2 × 3 synthesis. Recall that calculation of H−1 and H+ used

the same HAL, HAR, HBL, and HBR inputs, while H+ additionally used HGL and HGR.
The extra degree-of-freedom provided by the 2 × 3 system apparently allowed the synthesis

to avoid the outsized amplitude. This result supports the idea that when TS encounters
complications, the 2 × 3 system more reliably outputs a stable inversion of matrix H. A
well-conditioned inverse matrix yields fewer large amplitudes in Y (cid:48) and therefore in X. To
investigate whether it is a general result that maximum synthesis amplitudes (in Y (cid:48)) are
smaller for the 2 × 3 system, a systematic study was conducted and is described in the next

section.

5.6 Comparison of 2 and 3-loudspeaker spectral am-

plitudes

The simulations and experiments described here are primarily concerned with comparing the
maximum synthesis amplitudes (in Y (cid:48)) that occurred in a 2 × 2 versus 2 × 3 system.

5.6.1 Simulations

Simulations of maximum synthesis amplitudes generated by the 2 × 2 and 2 × 3 systems are
brieﬂy described here: desired ear canal signals (X(cid:48)) were randomly generated– each such

signal was a sine function which might be one of the Fourier components of an arbitrary

135

broadband noise. Amplitudes were Rayleigh-distributed with a standard deviation of 1.0,
and the phases were randomly distributed over 360◦. Randomly generated transfer functions

(H) were used to simulate the ﬁltering of signals on their path from synthesis loudspeakers to

ear canals. These random transfer functions approximately simulated responses in a room

environment with standing waves. The real and imaginary parts of the transfer function

matrices were independently normally distributed with unit variance. Therefore, the mean

square amplitude of transfer function matrix elements was 2.0. These properties of ear canal

signals and matrix elements established the amplitude scale for the tests and ensured a fair

comparison of synthesis amplitudes for the diﬀerent systems.

Computational tests included the 2 × 2 and 2 × 3 systems. Synthesized amplitudes were

generated for one million trials for each system. Only the maximum amplitude across the two

or three loudspeakers was retained in a given trial. The distributions of synthesis amplitudes

for each system is shown in Fig. 5.11. Each histogram has two hundred bins for maximum

amplitudes between 0 and 20. The ﬁrst bin gives the number of trials in which the maximum

amplitude was between 0 and 0.1, the second bin for 0.1 to 0.2, and so on. The right-most

bin of the histogram enumerates the number of trials where the maximum amplitude was
greater than 20, i.e. out of range. For the 2 × 2 system there were 3741 out of range and for
the 2 × 3 there were 8. Table 5.1 shows percentiles when the distributions of Fig. 5.11 are

turned into cumulative distributions.

5.6.2 Experiments– setup

Multiple measurements were performed using two-loudspeaker synthesis immediately fol-

lowed by the matching three-loudspeaker synthesis. For each measurement, synthesis loud-

speaker positions and/or the manikin’s position were changed. Equal-amplitudes, random

136

Figure 5.11: Distributions of maximum amplitudes among (a) 2, or (b) 3 synthesis signals
from the random matrix models. The mean amplitude for the 2× 2 system is 2.0 which sets
the scale for both plots. An amplitude of 20 is ten times the mean or 20 dB higher. The
bin on the far right includes all the amplitudes greater than 20. There were 3741 amplitudes
out of range in the 2 × 2 system, and 8 in the 2 × 3.
2 × 2
3.7
12.2

Percentile

2 × 3
1.6
3.2
5.7

90.0
99.0
99.9

–

Table 5.1: Percentiles for maximum amplitudes when the distributions of Fig. 5.11 are turned
into cumulative distributions. For instance, the upper left entry shows that for the 2 × 2
system, 90% of the maximum amplitudes were less than 3.7. The mean amplitude for the
2 × 2 system was 2.0, which sets the scale for both systems. Therefore, the amplitude of 3.7
is 5.3 dB above the mean.

137

phases stimuli were used for all measurements but the random phases were diﬀerent from

one measurement to the next. The motivation behind making changes in each measurement

was to increase the ability to generalize the results.

Six diﬀerent geometrical conﬁgurations were used in order to expand the variation of

transfer functions. First was the “120-degree-reference set,” in which loudspeakers A and
B were placed on opposites sides of the manikin, 1 m away and at approximately −120◦
and 120◦ from the manikin’s forward direction. Loudspeaker G was located at 180◦ and
was also 1 m away. In a second conﬁguration, loudspeaker G was moved to −140◦. Since
loudspeaker G was not used in the 2×2 synthesis, the eﬀect of moving it was merely to change

the position of a reﬂecting object. The two loudspeaker G conﬁgurations were crossed with

three positions of the manikin– the standard 1 m distance, a displacement of 0.1 m forward,

and a displacement of 0.1 m backward. Then the entire set of six conﬁgurations was repeated
except that loudspeakers A and B were at −90◦ and 90◦ to make the “90-degree-reference

set.” In total there were twelve conﬁgurations.

The same set of 211 amplitudes was used for the desired signal at the eardrums and for

the measurements of the transfer functions– a convenience but not a necessity. However,

for each of the six conﬁgurations of a reference set, a diﬀerent set of random phases was

used. Transfer functions and synthesis at the eardrums were measured using procedures

essentially identical to those described in sections 5.4.1 and 5.5.2. After measurements in

one conﬁguration, the loudspeakers and/or manikin were moved to a new conﬁguration.

5.6.3 Experiments– results

The largest of the spectral amplitudes across either the two or three synthesis loudspeakers

was identiﬁed at each frequency. With twelve diﬀerent geometrical conﬁgurations and 211

138

frequencies there were 2532 amplitude values for the 2 × 2 system. There were also 2532
values for the 2 × 3 system.

In order to compare with the random-matrix computation

modeling in section 5.6.1, the measured amplitudes for both systems were multiplied by a
single scale factor so that the mean of the measured distribution for the 2 × 3 system was
the same as the mean of the model distribution for the 2 × 3 system. The result is that

measured distributions of maximum amplitudes shown in Fig. 5.12 can be directly compared

to Fig. 5.11.

For the 2× 2 system there were eight amplitudes oﬀ the plot, and for the 2× 3 there were
none. The largest amplitude occurred for the 2 × 2 system when the synthesis loudspeakers
were at ±90◦. That is an anticipated result. When a source is located at 90◦, there is a

bright spot at the ear on the opposite side of the head tending to enlarge the oﬀ-diagonal
terms in H (Macaulay et al., 2010). When both sources are at 90◦, the eﬀect is doubled.

If on- and oﬀ-diagonal terms are of comparable size, the risk of small denominators, and

thus large synthesis amplitudes, is enhanced. Note that the eﬀect of the bright spot was
completely eliminated by adding the third loudspeaker in the 2 × 3 system.

Percentile

90.0
99.0
99.9

2 × 2
2.8
9.6
–

2 × 3
2.0
5.1
10.2

Table 5.2: Percentiles for maximum amplitudes (experiment) when the distributions of
Fig. 5.12 are turned into cumulative distributions. For instance, the upper left entry shows
that for the 2 × 2 system, 90% of the maximum amplitudes were less than 2.8

The mean amplitude for the 2 × 2 system was 1.48.

It can fairly be compared with

the value 2.02 for the random-matrix calculation. The diﬀerence indicates that the manikin

measurements revealed physical constraints on the size of the crosstalk and consequent lim-

itation on the size of synthesis amplitudes. Table 5.2 shows percentiles for the experiment,

139

similar to Table 5.1 which shows percentiles for the random-matrix calculations. For the 2×2

system, the 90% and 99% points occur at smaller values of amplitude for the experiment

than for random matrices. This is another indication that the random-matrix calculation

was not realistically constrained.

Table 5.2 shows that adding a third synthesis loudspeaker substantially reduced the ex-

perimental percentile amplitudes. This is consistent with what is seen in Fig. 5.12. However,

comparison with Table 5.1 shows that the experimental amplitudes were larger than the cor-

responding amplitudes for the random-matrix calculation. Apparently the experimental

beneﬁt of the third loudspeaker, though substantial, was less than the theoretical beneﬁt

seen in Table 5.1.

5.6.4 Discussion

Comparing Figs. 5.11 and 5.12 for corresponding systems (e.g. 2×2) shows that the synthesis

amplitudes appear to be similarly distributed for the computational model and the experi-

ment although there are many fewer points in the experiment. This is interpreted to mean

that the random-matrix model is a reasonable model for the stimulus noise components as

modiﬁed by the transfer functions in a room, though Table 5.2 shows that the experiment

encountered less extreme cases than the model did.

Both the computational modeling and the experiment showed that distributions of maxi-

mum amplitudes were progressively skewed toward smaller values as the number of synthesis
loudspeakers increased from two to three. The signal in each loudspeaker for the 2×3 system

can be smaller on average and still achieve the same power at the ears. This argument leads

to a reduction in amplitude by a factor of(cid:112)3/2 = 1.2. However, the average amplitude re-

duction was larger– about 1.5. The skew in the distribution is attributed to the advantage of

140

the pseudoinverse matrix. More important from an experimental standpoint is that there are

fewer instances of very large amplitudes when the number of synthesis speakers is increased.

This is presumably due to fewer instances of pathological inverse matrices. Nevertheless,

there remain a few large synthesis amplitudes in Fig. 5.12. Those are attributed to standing

wave nulls in the room or anti-resonances in the ear canals. The pseudoinverse cannot solve

those problems.

5.7 Synthesis accuracy– dichotic, invented signals

An experimental test of the accuracy of synthesis for the 2 × 2 and 2 × 3 systems was

conducted. The test utilized an invented dichotic signal which was intended to be challenging

for the synthesis method. The experiment used the setup from section 5.6.2, with minor

alterations.

5.7.1 Dichotic, invented signals

The invented signals had frequency-dependent amplitudes, increasing by 20 dB in the left

“ear” and decreasing by 20 dB in the right “ear.” Both amplitude dependences were straight-

line functions of the frequency. There were 211 spectral components from 200 Hz to 15855 Hz.

The phases were independently randomized in each ear. Loudspeakers A and B were at
−120◦ and 120◦, and 0.8 m from the manikin’s head. Loudspeaker G was at 180◦ and 1 m

from the head.

The transfer function measurements again used KEMAR’s internal microphones but a

diﬀerent method compared to that in section 5.6.2.

In order to explore greater general-

ity, the transfer functions were measured using a maximum length sequence generated by

141

a 17-stage shift register leading to 217 − 1 = 131071 values. At the sample rate used

(48828.125 samples per second) the duration was about 2.7 seconds– adequate for synthesis

of a brief sentence. Transfer function matrices were correspondingly large, with frequency

spacing of about 1/2.7 Hz, but most of the matrix elements were unimportant in this test.

Only the elements with frequencies of the 211 components were important.

5.7.2 Results

After the inverse was applied to the desired signals (Eq. 5.8), the resulting signals sent
to the loudspeakers (Y (cid:48)) looked nothing like the desired signals (X(cid:48)) because the left and

right desired spectra were so diﬀerent. However, the spectra recorded by KEMAR’s internal
microphones (X) were similar to X(cid:48), as shown by Figs. 5.13 and 5.14.

The 211 measured amplitudes appear as open symbols in Figs. 5.13 and 5.14 panels a

and b. They are plotted on top of the desired (invented) amplitudes, which are shown by

ﬁlled symbols. When a ﬁlled symbol is not seen it is because the corresponding open symbol

obscures it. The phase diﬀerences shown in these ﬁgures were obtained by subtracting the

desired phases from the measured phases. The diﬀerences were then reduced to the range
−180◦ to 180◦ by adding or subtracting multiples of 360◦.

5.7.3 Discussion

Measured spectral amplitudes for both “ears” showed anomalous values between 3 and 4 kHz

and near 10 kHz. These were likely caused by the ﬁrst and second “ear canal” resonances.

Also, discrepancies tended to be larger when the amplitudes were smaller.

It is interesting to try to track the discrepant amplitudes through Figs. 5.13 and 5.14.

142

For each of the four amplitude plots there, the largest ten discrepancies were found. Seven

of them were given numbered labels. The largest discrepancy ever found was given the label

‘1’ and it appears as one of the ten largest discrepancies in all the amplitude plots except
for Fig. 5.14a. For a given “ear,” a component with a discrepant amplitude for the 2 × 2
system might be expected to be discrepant for the 2 × 3 system if head/pinnae diﬀraction

leads to a small amplitude on calibration. Figures 5.13 and 5.14 show four such instances

[points 1,2,5,7].

The ﬁgures make it clear that adding a third loudspeaker to make the 2 × 3 system led

to decreased amplitude discrepancies. Especially important, the largest amplitudes, which
become problems for a 2 × 2 synthesis, were signiﬁcantly reduced with the 2 × 3 synthesis.

These outsized amplitudes (signals X) often corresponded to large amplitudes in the synthesis
signals (Y (cid:48), not shown) and may be attributed to pathological inverse transfer functions.

The phase plots in Figs. 5.13 and 5.14 agreed with the amplitude plots in the sense

that discrepancies occurred at, or near, the same frequencies for both kinds of plots. Phase

discrepancies tended to increase with increasing frequency, as expected because phase is the

product of delay and frequency. Phase errors in Fig. 5.13 had a decreasing linear component,

indicating a simple delay. Similar to observations on the amplitudes, discrepancies for the
phases were smaller and fewer for the 2 × 3 system.

5.8 Synthesis accuracy– signals from a real source

The real-source experiments were practical tests of TS. In practice, an experimenter may

want to synthesize signals at a listener’s eardrums based on signals from a remote source

as measured in the ear canals.

In these experiments, the synthesis was based on probe

143

microphone measurements and tested by KEMAR internal microphone recordings. The

experiments tested the idea that if a synthesis got it right in the probe microphones then it

would also get it right at the eardrums (internal microphones). The relevant mathematics

appear in Appendix B.

5.8.1 Experiment– noise

The real-source experiment used a variation on the synthesis described in section 5.6.2.
Loudspeakers A and B were at ±120◦ and at a distance of 1 m from the center of the “head.”
Loudspeaker G was at 180◦ and also at 1 m from the “head.” The real-source loudspeaker
was 28◦ to the right of the forward direction at 3.8 m to enhance relative room eﬀects. A

diagram of the arrangement is shown in Fig. 5.15. The room was arranged in Room Setup 2,

in which the acoustical foam was removed from all walls. In addition, porcelain tile panels

(2.7 m2) were placed along the wall behind the synthesis loudspeakers. Setup 2 provided

a longer reverberation time and a more challenging test environment for synthesis. The

reverberation time averaged 0.463 s in the six octave bands from 250 to 8000 Hz. The probe

microphones were Etymotic ER-7s (Etymotic, Elk Grove Village, IL) inserted with their tips

close to the “eardrums” of the KEMAR ears. These are the same probe microphones used

in human listeners.

In the real-source experiments, the target was the measurement at the probe microphones

of the signal from the real-source loudspeaker. The ﬁrst signal was again a 211-component

noise. The target was used to created the synthesized signals. The standard was the mea-

surement at the internal KEMAR microphones of that same signal. The standard was used

to evaluate the quality of the ultimate synthesis.

A straightforward approach to the target and standard would be to turn on the real source

144

and make the recordings at the two sets of microphones. However, because microphones

(especially the probe microphones) and the environment were somewhat noisy, the MLS

technique was employed. The MLS was used to measure the impulse response between

the real source and the probe microphones and determined the target by convolving the

original signal with the impulse response. Further, the MLS was used to measure the impulse

response between the real source and the internal microphones and determined the standard

by convolving the original signal with the impulse response.1 Comparison between desired

and measured signals used only a 211-component subset of frequency components so that

results could be conveniently displayed.

1Measurements showed that this latter method, with the impulse response averaged over eight repetitions
of the sequence, improved the signal-to-noise ratio by 33 dB over direct recording– an enormous advantage.

145

5.8.2 Results– noise

Figure 5.16: Amplitudes and phase errors measured by KEMAR’s internal microphone in
the right “ear” for the 2 × 2 and 2 × 3 systems. The real source was a 211-component white
noise. Top two panels: the standard amplitudes are shown by ﬁlled symbols. They are the
same for the 2 × 2 and 2 × 3 systems. The measured amplitudes are shown by the open
symbols. Amplitudes above 8097 Hz are multiplied by ﬁve for better viewing. Bottom two
panels: diﬀerences (in degrees) between measured and standard phases.

146

Left ear

Right ear

Probes
-27.4
-27.0
7.19
5.32

Internal Probes
-24.4
-30.0
4.13
3.51

-23.8
-24.0
16.24
11.48

Internal

-23.8
-26.1
7.88
7.22

2 x 2 (dB)
2 x 3 (dB)
2 x 2 (◦)
2 x 3 (◦)

Table 5.3: RMS errors for synthesis of the 211-component noise from the real source. RMS
amplitude errors are in dB re the RMS amplitudes of the target (Probes) or the standard
(Internal). Phase errors are in degrees.

The results of the experiment are shown in Fig. 5.16 for the right “ear,” as measured in

the KEMAR internal microphone. The ﬁgure compares measured and standard amplitudes
for the 2 × 2 system and 2 × 3 system (top two panels). Diﬀerences between measured and

standard phases are shown in the bottom two panels. Root-mean-square amplitude and

phase errors data are listed in Table 5.3. The following conclusions can be made:
• Synthesis was somewhat more successful for the right “ear” than for the left. The diﬀerence

was particularly noticeable for the phases.
• Amplitudes and phases in the probe microphones agreed better with the desired values

compared to the internal microphones. This might have been expected because the synthesis

was based on the probe microphones.
• For the left “ear” (not shown) adding the third loudspeaker (G) to the synthesis hardly

mattered for the amplitudes, either for the probe microphone or for the internal microphone.

Phase errors were modestly reduced.

In contrast, for the right “ear,” adding the third

loudspeaker reduced amplitude errors considerably. The diﬀerence between “ears” may be

attributable to a worse signal-to-noise ratio in the left “ear” because it was farther from the

source. This implies there is a signal-to-noise threshold below which a third loudspeaker

may confer only minimal beneﬁt.

147

5.8.3 Experiment– speech

The experiment described in section 5.8.1 was repeated, but the target and standard were

female speech instead of white noise. The goal was to demonstrate the utility of TS in

perceptual experiments. The utterance was the brief sentence, “Cats hate dogs.” Its duration

was 2.68 s, which corresponds to a frequency spacing of 0.37 Hz. Again, there were 131071

frequency components. All calculations were done in the frequency domain, which means

that the order of the three words was determined by the phases in the Fourier transform.

5.8.4 Results– speech

Left ear

Right ear

Probes
-19.8
-21.2
29.9
29.1

Internal Probes
-20.6
-22.3
29.9
28.6

-23.4
-27.1
17.7
14.0

Internal

-24.5
-30.0
17.0
11.8

2 x 2 (dB)
2 x 3 (dB)
2 x 2 (◦)
2 x 3 (◦)

Table 5.4: RMS error values for synthesis of “Cats hate dogs.” RMS amplitude errors are in
dB re the RMS amplitudes of the target (Probes) or the standard (Internal). Phase errors
are in degrees. Errors were calculated for the 10202 frequency components between 200 and
4000 Hz – the range of the speech energy.

Results of the synthesis in the right “ear” are shown in Fig. 5.17 for the internal mi-

crophones. Comparison between measured and standard signals again used only the 211-

component subset of frequency components so that the results could be conveniently dis-

played. Further, the amplitude scale for frequencies above 4 kHz was expanded to facilitate

visual comparison. Phase errors increased considerably above 4 kHz, most likely because the

measured signals were so small at those frequencies. Root-mean-square errors in Table 5.4

were calculated using only frequency components between 200 and 4000 Hz, because almost

all the speech energy lies in that range.

148

Figure 5.17: Same as Fig. 5.16 but the target and standard were female speech (“Cats hate
dogs.”)
instead of white noise. Comparison between standard amplitudes (ﬁlled circles)
and measured amplitudes (open circles) show only a 211-component subset of frequency
components for a convenient display. The amplitude scale for frequencies above 4 kHz is
expanded by a factor of ten. Phase errors (diamonds) for the same set of frequencies are
the diﬀerence measured-standard. Phase errors outside the ±90◦ range are shown by solid
diamonds at ±90◦.

149

5.8.5 Discussion– speech

The RMS average results are shown in Table 5.4. It is evident that adding the third loud-

speaker improved synthesis accuracy. Improvement was most dramatic in the internal mi-
crophones. Amplitude errors were as much as 5.5 dB smaller in the 2 × 3 system compared
to the 2× 2 system. Phase errors were reduced by 31% (right) and 27% (left). Compared to

the white noise experiment (section 5.8.2), for which improvement was only observed in the

right “ear,” synthesis accuracy was ameliorated in both ears for the speech source. However,

the frequency ranges were diﬀerent for these two tables: the white noise had 211 compo-
nents spanning 0.2 − 16 kHz and the speech had 10202 components spanning 0.2 − 4 kHz.

Enhanced low- and middle-frequency spectral content in the speech (vs. white noise) target

is a possible explanation for the observed diﬀerence in results.

Table 5.4 shows that for both “ears,” both systems, and both amplitude and phase, the

RMS errors were smaller for the internal microphones than for the probe microphones. This

result is opposite to the corresponding result for the noise source in the previous section, and

it is initially surprising. How can the internal microphone results be better than the probe

microphone results when the stimuli for the internal microphone recordings were made from

the probe microphone signals? The answer lies in the ﬁnal measurement process. The probe

microphones, with very thin probe tubes, were much noisier than the internal microphones.

Although the eﬀective noise from the probe microphones could be reduced by the repeated

MLS technique in producing the target and standard signals, the ﬁnal measurements were

simple recordings of the synthesized signals. The probe microphone measurements were

thus contaminated by noise. The frequency range used for speech signal measurements was

diﬀerent from that for the noise source, and the speech signal had intervals of smaller signal

150

level.

5.9 Sensitivity to head rotation

The pseudoinverse represents a minimum in multidimensional space due to the minimum-
norm property. Thus, the synthesis loudspeaker signals (Y (cid:48)), are expected to be less sensitive
to a small perturbation in the 2 × 3 system than in the 2 × 2. The present section examines

systematic variations caused by a small rotation of the listener’s head.

5.9.1 Experiment setup

The rotation experiments used the manikin and the setup described in section 5.6. Loud-
speakers A and B were initially placed at −120◦ and 120◦, and then moved to −90◦ and
90◦. Loudspeaker G was at −140◦, 180◦, or 140◦. All loudspeakers were 1 m from the center

of the “head.” The desired signal at the “eardrums” was equal-amplitudes, random-phases

noise, and this signal was also used to measure the transfer functions.

Transfer functions were measured and synthesis waveforms were computed, played, and

recorded in KEMAR’s internal microphones with the “head” facing the forward direction
(0◦ reference condition). Then the “head” was rotated 5◦ to the left and the (unchanged)

synthesis was replayed and recorded again.

5.9.2 Results

The changes caused by rotation for one of the conﬁgurations are shown by the synthesis
amplitudes and phases in Figs. 5.18 (left “ear”) and 5.19 (right “ear”). Reference 0◦ data

are indicated by ﬁlled symbols and rotated data by open symbols. The RMS amplitude

151

errors (i.e. the discrepancy between 0◦ and −5◦) data) were smaller in the 2 × 3-system

synthesis for both ears– RMS amplitude error was 17% smaller in the left “ear” and 11%

smaller in the right “ear.”

Figure 5.18: Comparison of amplitudes and phases measured at the left “eardrum” for 211
components before the was “head” rotated (ﬁlled symbols) and after it was rotated 5◦ to
the left (open symbols). Synthesis loudspeakers A and B were at −120◦ and 120◦, and G
was at 180◦.

152

Figure 5.19: Same as Fig. 5.18 but for the right “eardrum.”

153

Figure 5.20: RMS change in amplitude caused by an uncompensated rotation of 5◦, averaged
over 211 frequencies. The values are averaged over the the azimuths of loudspeaker G. The
error bars are two standard deviations in overall length. The data for these histograms came
from data sets of which Figs. 5.18 and 5.19 are examples.

5.9.3 Discussion

An overall deterioration in synthesis occurred for both ears and angular conﬁgurations (ﬁg-
ures for 90◦ synthesis are not shown), most notably at high frequencies (f ≥ 9.5 kHz).

Problematical amplitudes occurred at the typical frequencies– namely, the ﬁrst (3.5 kHz)
and second (10 kHz) ear canal resonances. The 2 × 3 system was less sensitive to rota-

tion at least for the ﬁrst “ear canal” resonance, as seen by reduction of large amplitudes in

Figs. 5.18b and 5.19b.

Figure 5.20 is a histogram plot of the RMS changes in amplitude caused by an uncom-
pensated 5◦ rotation. A smaller %-change indicates decreased sensitivity to the rotation.

The most dramatic eﬀect was the large reduction in sensitivity when loudspeakers A and
B were moved from −120◦ and 120◦ to −90◦ and 90◦. In a symmetric conﬁguration like

154

±90◦, the transfer function ﬂattens at the point of approximate symmetry. Adding a third

loudspeaker yielded no further reduction in sensitivity. In contrast, when loudspeakers A
and B were at −120◦ and 120◦, adding the third loudspeaker led to a substantial reduction

in sensitivity for both ears.

Figure 5.20 indicates that the left “ear” was more sensitive to the rotation than the
right “ear” when loudspeakers A and B were at −120◦ and 120◦. There was no reason to

anticipate that result– the left “ear” was closer to the nearby wall, and the rotation was to

the left. Otherwise, the experiment was left-right symmetrical.

5.10 Conclusions

In practice transfer functions from headphones to the ears deviate quite signiﬁcantly from an

ideal, perfectly ﬂat response. This has clear ramiﬁcations for psychoacoustics experiments

which must present very accurate signals to the ears that preserve interaural and spectral cues

to the listener. For human listeners, probe tube microphones must be inserted in the listener’s

ear canals and headphones are placed over the microphones. It is easy to imagine how easily

probe tubes would be disturbed by displacement of the headphones since they are in physical

contact. Run-to-run variability in headphone placement can drastically aﬀect interaural cues

and the listener’s perceptions from one experiment run to the next. Headphone equalization

can, in principle, compensate for the run-to-run variability if measurement of fresh transfer

functions is embedded in the experimenter’s protocol.

Transaural synthesis with loudspeakers was proposed as an alternative method to head-

phones for precise stimulus delivery. Loudspeakers facilitate a more natural listening en-

vironment than headphones can provide. The mathematics of TS eliminate instances of

155

crosstalk that occur during loudspeaker presentation. Further, a well-known requirement of

the TS technique is that transfer functions from loudspeakers to the ears must be measured

afresh between experiment runs, as well as anytime the listener moves during a run. This

ensures the transfer functions are accurate, and consequently that the resulting synthesis
of the desired waveforms at the ears is also accurate. Even with a −5◦ head rotation, the

loudspeaker synthesis delivered desired amplitudes and phases to the ears more accurately

than the headphones after a new placement.

There can be an issue with the traditional 2 × 2 system used in TS. Spuriously large

amplitudes in the synthesis signals may result from inversion of the transfer function matrix.

These large amplitudes are undesirable because they are perceptually salient to the listener
either as distortion or discrete tones. TS experiments using a 2× 3 system utilize the Moore-

Penrose pseudoinverse matrix to facilitate a suitable inversion of the transfer function matrix.

The pseudoinverse matrix also provides the least-norm solution. Reduction in amplitude of

a loudspeaker signal when going from two to three synthesis loudspeakers was expected to

be a factor of(cid:112)3/2 = 1.2, but a reduction by a factor of 1.5 was observed.

It is apparent that using three loudspeakers can reduce some of the worst problems in
traditional 2 × 2 synthesis. However, there remain pathologies caused by anomalous head

diﬀraction, ear canal resonances, and standing waves in rooms. This is likely the explanation

for the outsized spectral amplitudes at 10500 Hz and 13500 Hz in Fig. 5.14a. Addressing

these problems through regularization or the selective elimination of problematical spectral

components may be required. Alternatively, a modiﬁcation to the pseudoinverse solution,

as suggested for an acoustically transparent head by Yang et al. (2003), may enhance the

robustness of the inverse solution.

Nevertheless, the 2× 3 system demonstrated superior performance over the 2× 2 system

156

in nearly all experimental cases presented here.

It produced fewer very large amplitudes

and more accurately reproduced desired spectra in the ears. The advantage was consistent

across diﬀerent stimuli (invented and from a real source), microphones (internal and probe),
and even with a small head rotation. Beneﬁts of the 2 × 3 system were manifest even in a
challenging room environment. The versatility and robustness of the 2× 3 system that were

demonstrated across various experiments in this chapter attest to the value of the technique

for precision psychoacoustics experiments.

Appendix A describes additional experiments that were conducted to further validate
the stability and reproducibility of the 2 × 3 system. Essentially no eﬀect of target sound

source level on the quality of synthesis was observed, indicating that the experiments were

in a linear response region. Variations in the target sound source spectra and the synthesis

spectra measured at the eardrums due to random ﬂuctuations was observed to be negligibly

small. Appendix B describes a study of probe microphone placement during target source

measurement and synthesis measurement.

It revealed a subtle but important result that

probe microphone placement must be the same during real source measurement and synthe-

sis to ensure accurate synthesis at the eardrums. This imposes limitations on experimental

designs that incorporate real sources. Nevertheless, the generally robust and accurate per-

formance of TS using three synthesis loudspeakers provides a powerful tool for conducting

perceptual experiments on human listeners.

157

Figure 5.12: Histogram of (experimental) maximum synthesis spectral amplitudes, of (a)
2, or (b) 3 synthesis loudspeakers. Amplitudes were scaled so that the means of the 2 × 3
distributions in Figs. 5.11b and 5.12b coincide. That enables a fair comparison of the ﬁgures.
Data were combined over 120◦ and 90◦ reference sets, a total of 2532 values per histogram.
Fewer large amplitudes occurred in the 2 × 3 system.

158

Figure 5.13: Left “ear” desired amplitudes (panels a and b) are indicated by ﬁlled symbols
(X(cid:48)
L). They are straight-line functions of frequency. Measured amplitudes (XL) are indicated
by the open symbols. Numbers 1 − 7 track particular component amplitudes of interest.
Desired phases were random variables. Desired phases were subtracted from measured phases
to ﬁnd phase errors, which are shown by the diamonds in panels c and d.

159

Figure 5.14: Same as Fig. 5.13 but for the right “ear.” Larger phase errors at high frequencies
arise from smaller amplitudes.

160

Figure 5.15: KEMAR’s “head” with probe microphones in the “ear canals.” The real-source
loudspeaker was located 28◦ to the right of the manikin’s forward direction. The three
synthesis loudspeakers were located at angles of −120◦, 120◦, and 180◦. All loudspeakers
were 1 m from the center of KEMAR’s “head.” A nearby wall was located on the left and
the acoustical foam was removed– this was Room Setup 2. The schematic is not to scale.

161

Chapter 6

Room eﬀect perceptual experiment

using well-controlled stimulus

presentation

The current chapter describes application of transaural synthesis in a perceptual experiment

on room squelch. This is a synergy of Chapters 4 and 5. The goal of the perceptual

experiment was the same as in Chapter 4– to investigate the role of HRTF in squelch– but

the means of probing was more reliable in principle `a la the transaural synthesis technique

described in Chapter 5.

6.1 Experiment

Four listeners from outside the lab (all male; ages 21 − 22) participated in the experiment.

All listeners came for at least four listening sessions, most of which were for training. Only

data from the ﬁnal session were included in the analysis. Listeners were paid for their time.

None of the listeners had previously participated in the Chapter 4 headphone experiment.

162

6.1.1 Experimental setup

The experiment was conducted in a medium-sized oﬃce room (L×W×H: 5.3 m × 4.3 m ×

3.45 m). The ﬂoor was hard tile and the ceiling was acoustical tile. A ceramic-tile panel
(1.8 m × 1.5 m) and a formica-covered steel panel (1.7 m × 0.9 m) were placed along the

back wall to augment reverberation in the room. Preliminary testing indicated that room

eﬀect was insuﬃcient for a perceptual experiment on squelch, so a synthesizer (DSP-3000

Digitial Sound Field Processor, Yamaha Corp., Hamamatsu, Japan) was utilized to further

enhance room eﬀect. The synthesizer ampliﬁed reverberation via two microphones (SHURE

KSM32 studio cardioid), the outputs of which fed into the synthesizer. The four ambience-

processed output channels from the synthesizer were connected to power ampliﬁers (Servo

120a, Samson Technologies, Hicksville, NY; D75A, Crown Audio, Elkhart, IN) which fed

four dual-driver loudspeakers (6.5” woofer and 3/4” tweeter, Model A40, Boston Acoustics,

Boston, MA) that were placed in each corner of the room.

Reverberation times for the enhanced-reverberation setup (i.e. synthesizer on) were mea-

sured. Excitation stimuli for measurements were sine tones with octave-band spacing– the

lowest frequency was 125 Hz and the highest was 16 kHz. To measure the reverberation time,

a 125-Hz tone was played from a loudspeaker (Mackie HR824mk2) in the room. The tone

was abruptly turned oﬀ, and a Larson-Davis sound level meter1 (Model 800B, Larson-Davis

Laboratories, Depew, NY) directly measured the reverberation time. Eight measurements

were made at diﬀerent locations in the room, and the mean and standard deviation of the

eight measurements are shown in Table 6.1. The process was repeated for the remaining

seven tones. The average RT60 in the ﬁve octave bands from 250 Hz to 4000 Hz, which are

the most relevant bands for speech, was 0.659 s.

1settings: octave-band ﬁltering, fast response, linear ﬁlter weight (i.e. not A-weighted)

163

Frequency

(Hz)

125
250
500
1000
2000
4000
8000
16000

original
RT60
mean
(sec)
0.577
0.537
0.466
0.390
0.504
0.399
0.480
0.366

original
RT60

std. dev.

(sec)
0.112
0.095
0.087
0.133
0.084
0.080
0.068
0.080

enhanced

enhanced

RT60
mean
(sec)
0.840
0.711
0.658
0.717
0.613
0.597
0.854
0.294

RT60

std. dev.

(sec)
0.161
0.116
0.172
0.210
0.267
0.166
0.260
0.054

Table 6.1: Reverberation times were measured using a Larson-Davis sound level meter, with
the synthesizer oﬀ (left) and on (right). The average RT60 in the ﬁve octave bands from
250 Hz to 4000 Hz, which are the most relevant bands for speech, was increased from 0.459 s
to 0.659 s with the synthesizer on. This was an increase of 0.200 s.

Then, reverberation times were measured for the original setup (i.e. synthesizer oﬀ ) for

reference. The measurement process was identical to that described above, and means and

standard deviations for each frequency band appear in the left half of Table 6.1. The average

RT60 in the ﬁve octave bands from 250 Hz to 4000 Hz was 0.459 s. Thus, the synthesizer

increased the RT60 from the original setup by 0.200 s.

The 2 × 3 transaural synthesis system was used. Synthesis loudspeakers A and B were
placed at −120◦ and 120◦ with respect to the listener’s forward direction, and loudspeaker G
at 180◦. All were 1 m from the center of the listener’s head. The listener sat in a rigid-backed

chair, and an adjustable metal ring was lowered onto the listener’s head. The ring was an

aid to keep the head motionless. Probe microphones were placed in the listener’s ear canals.

Figure 6.1 shows a photograph of the setup.

164

Figure 6.1: Setup for the perceptual experiment. Ceramic-tiled panels were located along
the wall behind the listener. Enhanced reverberation system (ERS): two studio microphones
were positioned in the foreground. Microphone outputs were ampliﬁed and fed into the
synthesizer (not shown). Two (of four) ERS loudspeakers are visible in the photo. Transaural
synthesis: synthesis loudspeakers were located 1 m from the center of the listener’s head, at
angles of ±120◦ and 180◦. Photograph was taken from the vantage point of the real source
loudspeaker (not shown) which was located at 3.8 m and 28◦.

6.1.2 Experiment– training

Part I: Room eﬀect was deﬁned for the listener as reverberation and coloration. The listener

was told that his task was to rate the amount of room eﬀect he perceived in a particular

stimulus. The scale was from 1 to 40, where 1 indicated no room eﬀect was perceived.

Recordings that had previously been made in diﬀerent rooms were played to the listener

over headphones.2 First, an anechoic recording was played. The listener was told that this

was a “1” on the room eﬀect rating scale. Then, a recording was played that had been made

in a moderately reverberant room (Room 10B; RT60 = 0.9 s at speech frequencies). The

listener was told that this was a “40” on the rating scale.

2A female talker counted backwards from ﬁve, with one-second pauses between numbers. These were the

same recordings used during training for the headphone perceptual experiment (Chapter 4).

165

A brief exercise was conducted to get the listener comfortable with listening for and

rating room eﬀect. Recordings that had been made in diﬀerent rooms (six ordinary rooms,

plus the anechoic room and Room 10B) were played and the listener was told to give a rating

after each recording. The recordings were played in random order. After this, the listener

was seated and the calibration commenced.

Part II: The training sessions were procedurally identical to the ﬁnal session (cf. 6.1.3

and 6.1.4), but with two diﬀerences that were physical in nature. The ﬁrst diﬀerence was

that during training, reverberation was sometimes more (or less) than what was in the ﬁnal

session. This was because of attempts to optimize the enhanced-reverberation settings: it

was found that with insuﬃcient gain (i.e. reverberation) the perceptual task was too diﬃcult

for listeners, and with too much gain acoustical feedback became a problem. Note that the

reverberation times in Table 6.1 are for the ﬁnalized settings. The second diﬀerence was that

the HRTFs were not constant during the training but were based on what was available at

the time of the session. Nevertheless, the training sessions familiarized the listener with the

experiment, and after completing three training sessions listeners were considered “experts.”

Listener L1 came for four training sessions and the remaining listeners came for three training

sessions.

6.1.3 Experiment– calibration

Maximum length sequence

A MLS of order 16 (216 − 1 = 65535 samples) was used for all calibration measurements,

and the RP2.1 sample rate was 24414.0625 kHz. Thus, the duration of one period of the

MLS was

65535 samples

24414.0625 samples/s = 2.6843136 s. The reader might note that the MLS order was

reduced from 17, which had been used in all previous experiments (cf. 5.5.2), to 16. Further,

166

the sample rate was exactly half the sample rate used in all previous experiments, which

was 48828.125 Hz. These two changes were made for an entirely practical reason: to reduce

buﬀer loading time of stimulus ﬁles in the RP2.1. By halving both the number of samples

(131071 to 65535) and the sample rate (48.9 kHz to 24.4 kHz), the period of the stimulus

was unchanged (T=2.68 s) but the buﬀer load time was substantially reduced. This allowed

the experiment to proceed more smoothly.

Real source calibration (HRTFs)

Probe microphones were placed in the listener’s ear canals. The listener was instructed
to look straight ahead (0◦) and remain motionless during the entire experiment. Ten periods

of the MLS were played from the real source loudspeaker and recorded in the listener’s ear

canals. Recordings of the ﬁrst and last periods were discarded to avoid edge eﬀects, and

recordings of the remaining eight periods were averaged. Cross-correlation of the average

recordings and the MLS yielded the HRIRs, hL(t) and hR(t).

The HRTFs, |HL(f )| and |HR(f )|, for the four listeners (L1, L2, L3, and L4) are shown in
Figs. 6.2 (0.2− 1 kHz range) and 6.3 (1− 12 kHz range). Three ‘other’ HRTFs also appear in

each panel: H1, H2, and H3. These are HRTFs from human subjects who participated in the

calibration but were not listeners in the perceptual experiment. They can be thought of as

‘heads’ instead of listeners to emphasize that their only role was to provide nonindividualized

HRTFs. The three subjects were members of the lab who were selected because they were

easily available.

In general, the HRTFs look similar in the 0.15−1 kHz frequency range. Transfer functions

H1, H2, and H3 were repeated in each panel of the ﬁgures for easy comparison with the

listeners (L1, L2, L3, or L4). Diﬀerences among HRTFs become more apparent in the
1 − 12 kHz range. That was an anticipated result because individual diﬀerences due to the

167

head and pinna are relevant in that frequency range. Among the three nonindividualized

HRTFs, H1 showed the largest diﬀerences from the other two. Speciﬁcally, there is a broad
peak spanning 3 − 5 kHz in the right ear. H3 has a peak in the same range but it is smaller

by a few dB, and H2 actually shows a dip in that range. The trend was similar for the left
ear: H1 shows a broad peak spanning 3 − 6 kHz and an additional peak in the 10 − 12 kHz

range.

As a means of quantifying the degree of similarity between HRTFs, root-mean-square

(RMS) amplitude diﬀerences between a listener’s own HRTF and each of the other HRTFs

(H1,H2,H3) were calculated. For simplicity, the RMS diﬀerences for the left and right ears

were averaged. Values are plotted in Fig. 6.4. Panel (a) shows the RMS amplitude diﬀer-
ences for the 0.15− 1 kHz frequency range, and (b) shows the same for the 1− 12 kHz range.
Diﬀerences were larger for the 1 − 12 kHz range (on the order of 10 dB), which is consis-

tent with previous observations on Figs. 6.2 and 6.3. This is an anticipated result because

individual diﬀerences due to the head and pinna are relevant in that frequency range.

As an additional means of quantifying similarities among HRTFs, the total average power

of each was calculated. Powers across left and right ears were averaged. Values are plotted

in Fig. 6.5. For convenience, total powers of H1, H2, and H3 were repeated for each listener.
The total average power was relatively constant across HRTFs for the 0.15 − 1 kHz range
(panel a). Large diﬀerences appeared in the 1− 12 kHz range– in particular, the power in H1

was approximately double that in own, H2, and H3. The greater power in H1 is consistent
with the previous observations of a broad peak in H1 in the 3− 5 kHz range. Based on these

considerations, one might expect H1 to be the most perceptually distinct from the other

HRTFs.

HRIRs from the real source loudspeaker to the probe microphones were convolved with

168

Figure 6.2: To measure HRTFs, a MLS (N = 16) was played from the real source loudspeaker
and recorded in the probe microphones in the listener’s ear canals. Left panels show |HL|,
and right panels show |HR| for the 0.2 − 1 kHz frequency range. Recall that the source
was on the right. The top lines in each panel indicate HRTFs of the four listeners ((a,b)
L1, (c,d) L2, (e,f) L3, (g,h) L4) who went on to participate in the perceptual experiment.
These HRTFs were used to compute stimuli for the “own” condition. The bottom three lines
indicate nonindividualized HRTFs (H1, H2, and H3). These HRTFs were used to compute
stimuli for the “other” conditions.

169

Figure 6.3: Same as Fig. 6.3 but for the 1 − 12 kHz frequency range.

170

Figure 6.4: Root-mean-square amplitude diﬀerences were calculated between a listener’s own
HRTF and the other HRTFs (H1,H2,H3). Averages were computed across left and right ears.
Average diﬀerences were larger in the 1−12 kHz range, indicating that individual diﬀerences
in HRTFs were more apparent.

anechoic speech recordings3, which were shortened versions of the Harvard phonetically-

balanced sentences (Table 6.2). These convolved-speech stimuli were the target signals in
the ears, X(cid:48)

R, during transaural synthesis.

L and X(cid:48)

3recited by a female talker.

171

Figure 6.5: Total average powers of each HRTF (own, H1, H2, H3) were calculated and
averaged across left and right ears. For convenience, H1, H2, and H3 are repeated in the
plot for each listener. A listener’s own HRTF is indicated by the shaded bar. Power was
relatively constant in the (a) 0.15 − 1 kHz range. In the (b) 1 − 12 kHz range, power in H1
exceeded– in some cases by more than double (3 dB)– the power in own, H2, and H3.

172

Sentence

HRTF

own

“Thieves who rob.”
other
nat-
ural

other

2

1

other

3

own

“Cats hate dogs.”
other

other

1

2

nat-
ural

other

3

own

“Add the product.”
other
nat-
ural

other

1

2

other

3

own

“Open the crate.”
other

other

1

2

nat-
ural

other

3

Table 6.2: These were the twenty stimuli (=4 sentences × 5 HRTFs) presented to a listener during a single pass of the perceptual
experiment. Speech signals were convolved with: the listener’s own HRTFs (“own” condition), three other subjects’ HRTFs
(“other” conditions), and the natural HRTF (that is, anechoic speech was played from the real source loudspeaker– no synthesis
was involved). After each stimulus played, the listener gave his rating for the amount of room eﬀect he perceived in that
particular stimulus. The order of sentence blocks was randomized, as was the order of HRTF presentation within each sentence
block. A listener completed six passes.

173

Synthesis calibration

The calibration procedure for the synthesis loudspeakers was as follows: eight periods

of the MLS were played from loudspeaker A. The ﬁrst and last periods were discarded,

and the remaining six were averaged. This was repeated for loudspeakers B and G. In this

way, elements of matrix H (HAL, HAR, HBL, HBR, HGL, and HGR) were determined.
Loudspeaker signals y(cid:48)

G were computed via Eq. 5.14, using convolved speech as

A, y(cid:48)

B, and y(cid:48)

the target signals in the ears.

6.1.4 Experiment– rating

After calibration, the room eﬀect rating segment of the experiment commenced. The list

of stimuli presented to the listener is given in Table 6.2. The listener was presented with

speech stimuli that had been convolved with his own HRTFs (“own” condition), as well as

the HRTFs from three other subjects (“other 1”, “other 2”, and “other 3” conditions). The

HRTFs from the other subjects had been previously measured. A “natural” trial was also

included, in which anechoic speech was played from the real source loudspeaker.

It was

called natural because it was not synthesized. In principle, the own and natural conditions

should yield identical spectra at the eardrums. Listeners were therefore expected to perceive

identical amounts of room eﬀect in own and natural conditions.

To summarize: for a particular sentence, there were four synthesized HRTF presentations

(“own,” “other 1,” “other 2,” “other 3”, where “other 1” refers to H1, etc.) and one real

source presentation (“natural” condition), giving a total of ﬁve HRTF conditions. There
were four sentences. Thus, there were 5 HRTF conditions × 4 sentences = 20 stimulus

presentations, which are shown in Table 6.2.

After a stimulus was presented, the listener could ask for the stimulus to be repeated,

174

or else he gave his rating of room eﬀect to the experimenter. Communication was via an

intercom system. A stimulus could be repeated as many times as the listener wanted, but in

practice listeners requested relatively few repeats– roughly once per pass. After the listener

gave his rating, the the next stimulus was played. The 20 stimulus presentations comprised

a pass. The order of sentence blocks in a pass was randomized and, further, the order of

HRTF conditions within each sentence block was randomized. Three listeners (L1, L2, L3)

completed six passes, but the fourth listener (L4) only completed three passes. Duration of
the listening session was 1 − 1.5 hours. Listeners L1-L3 took a 10 minute break after the

ﬁrst three passes.

6.2 Results

Listeners’ mean ratings of perceived room eﬀect for own, natural, and other HRTF conditions

are shown in Fig. 6.6. Each panel indicates means for a diﬀerent sentence, and the last panel

shows means averaged across sentences. The own and natural conditions are shaded to

facilitate comparison with the other HRTF conditions. Mean ratings on the vertical axis

were found by averaging ratings across passes.

If the synthesized-own HRTF accurately

conveyed the true physical HRTF, then the own and natural bars should be identical which

would indicate the same perceived amount of room eﬀect.

6.2.1 Repeated Measures ANOVA

An omnibus statistical test was done on listeners’ ratings of room eﬀect to gain a global

(across listeners) understanding of the results. A Repeated Measures Analysis of Variance

(RM-ANOVA) revealed that HRTF (3 levels:

‘own,’ ‘natural,’ ‘other’) had a statistically

175

Figure 6.6: Mean ratings of perceived room eﬀect in the perceptual experiment. Higher
ratings indicate more perceived room eﬀect (i.e. less room squelch). Listeners are identiﬁed
along the horizontal axis. Shaded bars indicate when a listener was listening to his own
HRTFs (own and natural conditions), and the open bars indicate when a listener was listening
to other people’s HRTFs. Panels (a)-(d) show ratings for the four diﬀerent sentences. Ratings
were averaged across passes to ﬁnd the mean rating. L1, L2, and L3 completed six passes,
and L4 completed three passes. Error bars are the standard errors of the mean. Panel (e)
shows the ratings averaged across the four sentences.

176

signiﬁcant eﬀect on listeners’ ratings (p = 0.00068). Neither sentence (4 levels) nor the
sentence×HRTF interaction term was signiﬁcant (p = 0.492 and p = 0.235). Post-hoc pair-

wise comparisons tests were conducted to gain more insight into where rating diﬀerences

among HRTFs lie. The diﬀerence between ‘own’ and ‘natural’ room-eﬀect ratings was sig-

niﬁcant (p = 0.036), and the diﬀerence between ‘own’ and ‘other’ ratings was marginally

signiﬁcant (p = 0.051).4 Since RM-ANOVA results for HRTF were signiﬁcant, individual

listener’s ratings were analyzed via multiple hierarchical regression in the following subsec-

tion.

6.2.2 Multiple hierarchical regression

Listeners comprised four separate case studies in the regression analyses. The procedure for

conducting the regression was essentially identical to what was described in section 4.2.3.

Recall that predictors were added to the regression model in a hierarchical manner to in-

crease sensitivity to changes in R2. The most important predictor should be included in

the ﬁrst stage, and the least-important predictor should be added in the last stage. Re-

sults of RM-ANOVA (6.2.1) indicated that HRTF would be more important than sen-

tence for predicting a listener’s ratings, so stage 1 of the model included HRTFs (‘own-

natural’,‘other 1’,‘other 2’,‘other 3’; reference group: ‘own-synthesized’) as predictors. Sen-

tences were added as predictors in stage 2 (reference group: ‘Open the crate.’). Results of

the regression analysis for each listener are summarized in Table 6.3. HRTF and sentence

were signiﬁcant predictors of room eﬀect rating for all listeners, and they accounted for

approximately 50% (R2) of the variation observed in listeners’ ratings.

Standardized regression coeﬃcients (β-weights, Eq. 4.2) for the diﬀerent HRTFs are

4Reported p-values have been Bonferroni corrected to adjust for multiple comparisons.

177

Listener Model

R

R2

∆R2

L1

L2

L3

L4

stage 1
stage 2
stage 1
stage 2
stage 1
stage 2
stage 1
stage 2

0.682
0.730
0.626
0.711
0.672
0.727
0.583
0.706

0.465
0.533
0.391
0.505
0.451
0.529
0.340
0.498

0.068

0.114

0.078

0.158

F ,
∆F

24.968***
5.446**
18.495***
8.591***
23.615***
6.152**
7.069***
5.473**

Table 6.3: Results of multiple hierarchical regression analyses on listeners’ ratings in the room
eﬀect perceptual experiment. The four listeners (L1,L2,L3,L4) were analyzed separately.
HRTF was highly signiﬁcant for all listeners. Sentence was also signiﬁcant (∗p < .05, ∗∗p <
.01, ∗∗∗p < .001).

shown in Table 6.4. Pairwise comparisons probed diﬀerences in room eﬀect ratings between

the ‘own’ (synthesized) condition and all other HRTF conditions– those that are statistically
signiﬁcant are indicated by asterisks (∗p < .05, ∗∗p < .01, ∗∗∗p < .001). The following

observations can be made:

Listener

L1
L2
L3
L4

βnatural
-0.279**
-0.372***
-0.402***
-0.406**

βH1

0.489***
0.349***
0.420***
0.275*

βH2
-0.095
0.182*
0.149
-0.009

βH3

0.279**
0.220*
0.140
0.177

βthieves
0.308***

0.097
0.018
-0.170

βcats
-0.033

0.338***
0.309***

βproduct
-0.104
-0.150
-0.107

-0.104

0.483***

Table 6.4: Standardized regression coeﬃcients, or β-weights, for diﬀerent HRTFs from mul-
tiple hierarchical regression analyses. The primary value of the table lies in pointing out
which HRTFs diﬀered signiﬁcantly from the ‘own’ condition in listeners’ ratings of room
eﬀect (∗p < .05, ∗∗p < .01, ∗∗∗p < .001). Ratings for natural and H1 conditions diﬀered
from ratings for ‘own’ conditions for all listeners. Ratings of room eﬀect for the remaining
HRTF conditions (H2,H3) diﬀered from ‘own’ conditions in a mixed manner. The refer-
ence group for pairwise comparisons among sentences was the “crate” sentence. Diﬀerences
among sentences varied in an idiosyncratic manner among listeners.

• Listeners were highly sensitive to the diﬀerence between ‘own’ and ‘natural’ conditions.

This was an unexpected result– in fact, ratings were expected to be the same for the ‘own’

and natural conditions because the listener was listening “through his own ears” in both

178

conditions, and he was therefore expected to perceive identical amounts of room eﬀect.

However, the negative β-weights indicated that listeners perceived less room eﬀect in the

natural condition. The diﬀerence in ratings between ‘own’ and natural conditions will be

further discussed in section 6.3.
• Listeners rated H1 conditions signiﬁcantly higher than ‘own’ conditions. They apparently

perceived more room eﬀect when listening to H1 conditions than when listening to ‘own’

HRTFs. This observation is consistent with the hypothesis that listeners perceive the least

amount of room eﬀect when listening through their own ears.
• Sensitivity to H2 and H3 conditions was mixed. For all instances in which the rating

diﬀerence between ‘own’ and ‘other’ (H1, H2, or H3) conditions was signiﬁcant, positive β-

weights indicated that they rated the ‘other’ condition higher than ‘own.’ That is to say, they

perceived more room eﬀect when listening through those other HRTFs than when listening

through their own HRTFs. These observations oﬀer limited support for the hypothesis,

because listeners’ ratings of room eﬀect for H2 and H3 conditions were often– in 5 out

of 8 cases, to be precise– indistinguishable from their ratings for ‘own’ conditions. When

rating diﬀerences were statistically signiﬁcant, however, ratings of room eﬀect for H2 and

H3 conditions were always higher than ratings for ‘own.’
• Sensitivity to sentences varied in an idiosyncratic manner among listeners. For example,

L1’s ratings of room eﬀect were signiﬁcantly higher for “Thieves who rob” conditions than for

“Open the crate” conditions (βthieves = 0.308), which was the reference condition. Note that

there was no reason to select a particular sentence as the reference (unlike for HRTFs), and

any other sentence could have instead been selected as the reference. L2 and L3’s ratings were

signiﬁcantly higher for “Cats hate dogs” conditions (βL2,cats = 0.338 and βL3,cats=0.309),

relative to the reference, and L4’s ratings were signiﬁcantly higher for “Add the product”

179

conditions (βproduct = 0.483).

Speciﬁc observations for each listener are given below. The observations incorporate

information from Tables 6.3 and 6.4.

Listener 1 (L1): For this listener, 46.5% of the overall variance in his ratings of room

eﬀect could be accounted for by the diﬀerent HRTFs (including natural). Diﬀerent sentences

accounted for only an additional 6.8% of the overall variance. He rated natural conditions

signiﬁcantly lower than ‘own’ conditions, and he rated H1 and H3 conditions signiﬁcantly

higher than ‘own’ conditions. Coeﬃcients βnatural and βH3 were the same magnitude, but

in opposite directions.

Listener 2 (L2): 39.1% of the overall variance in his ratings of room eﬀect could be

accounted for by HRTFs, and an additional 11.4% by sentences. The largest β-weight (in
terms of magnitude) that occurred was βnatural (−0.372), indicating that the most salient

diﬀerence for this listener was natural vs. non-natural (i.e. synthesized) conditions. His

ratings were highly sensitive to H1 conditions, and H2 and H3 conditions to a lesser extent.

Listener 3 (L3): 45.1% of the overall variance in his ratings of room eﬀect could be ac-

counted for by HRTFs, and an additional 7.8% by sentences. The magnitudes of βnatural

and βH1 were comparable– 0.402 vs. 0.420– but in opposite directions. Apparently, the dif-

ference in ratings between ‘own’ and natural conditions was nearly as salient as the diﬀerence

in ratings between ‘own’ and H1 conditions. His ratings for H2 and H3 conditions were not

signiﬁcantly diﬀerent from his ratings for ‘own’ conditions.

Listener 4 (L4): 34.0% of the overall variance in his ratings of room eﬀect could be

accounted for by HRTFs, and an additional 15.8% by sentences. This listener was more

sensitive than the other listeners to diﬀerences among sentence conditions. The largest β-
magnitude was βnatural (−4.06), which was highly signiﬁcant. Diﬀerence in ratings between

180

‘own’ and H1 conditions was signiﬁcant, but diﬀerences in ratings between ‘own’ and the

other HRTF conditions (H2,H3) were not signiﬁcant.

Results of multiple hierarchical regression analyses are summarized in Fig. 6.7. Statisti-

cal signiﬁcance for pairwise comparisons of ‘own’ with the remaining HRTF conditions are

indicated along the horizontal axis.

Figure 6.7: HRTF beta-magnitudes are plotted to facilitate visual comparison among lis-
teners. Statistically signiﬁcant pairwise comparisons between ‘own’ and the speciﬁc HRTF
predictor are indicated along the horizontal axis. ‘Own’ conditions were signiﬁcantly diﬀer-
ent from natural and H1 conditions for all listeners. Results of pairwise comparisons between
‘own’ and H2,H3 were mixed.

6.3 Discussion

HRTFs

Listeners were sensitive to some but not all HRTFs. By sensitive what is meant is that

a listener perceived a statistically-signiﬁcant diﬀerence in room eﬀect between stimuli that

181

were ﬁltered with his own HRTF and stimuli that were ﬁltered with a nonindividualized

(‘other’) HRTF. All listeners were sensitive to H1, and statistical analyses revealed that they

rated H1 conditions higher than ‘own’ conditions. Analyses revealed mixed sensitivity to

H2 and H3 conditions. Collectively, these results are consistent with the earlier prediction

based on analysis of total average power in all HRTFs (Fig. 6.5). Anomalously large and
broad boosts in the 3 − 5 kHz and 10 − 12 kHz frequency bands in H1 led to a total average

power that was about 3 dB greater than the power observed in any other HRTF in the
1 − 12 kHz frequency range. Based on this, it was thought that H1 would be the most

perceptually distinct. Experimental results indicated that not only was H1 the most distinct

(excluding the natural condition), but listeners also perceived signiﬁcantly more room eﬀect

when listening to H1 conditions than any other HRTF condition. Listeners evidently picked

up on the diﬀerences in H1, and those diﬀerences apparently enhanced their perceptions

of room eﬀect. It is possible that listeners perceived H1 conditions as louder, which might

account for enhanced perceptions of room eﬀect. However, there were no comments from

listeners to that eﬀect so it is not possible to elaborate.

Sensitivities among the remaining HRTF conditions were mixed– only two listeners

(L1,L2) perceived any diﬀerences in room eﬀect between ‘own’ and H3 conditions. L2 was

the only listener who perceived diﬀerences in room between ‘own’ and H2 conditions. In

all cases of statistical signiﬁcance (3 out of 8), less room eﬀect was perceived in the ‘own’

conditions.

Taken together, listeners’ collective experiences with H1 and the remaining HRTF condi-

tions (‘own’,H2,H3) indicate that the hypothesis, that a listener perceives the least amount

of room eﬀect (i.e. maximum squelch) when listening through his own ears, garners only

limited support. Listeners certainly perceived less room eﬀect when listening through their

182

own ears compared to H1’s ears, but their ratings when listening through H2’s ears and H3’s

ears were often indistinguishable, in a statistical-signiﬁcance sense, from their ratings when

listening through their ‘own’ ears. Based on these results, one cannot conclude that a listener

perceives the least amount of room eﬀect when listening through his own ears because the

experiment did not show that he exclusively perceives the least amount of room eﬀect with

his own ears. In some cases, the listener simply did not perceive a diﬀerence in room eﬀect

among the diﬀerent HRTF conditions– namely, ‘own’, H2, and H3– so it can only be said

that there is limited support for the hypothesis.

Table 6.5 attempts to place results of the current work in the context of other experiments

that have been done using individualized and nonindividualized HRTFs. The experiments

can be classiﬁed into two categories: the ﬁrst includes those experiments for which there is

a correct answer (localization), and the other includes those for which there is not a correct

answer (‘preference’ experiments). Among preference experiments, the listening criteria are

diverse. Consequently, the word ‘preferred,’ or ‘preference,’ has a broad deﬁnition in the

following text: it means better performance on a task, in addition to the more conventional

use of the word in qualitative evaluations. For example, ‘preferred’ in the context of the

current experiment means that a listener perceives less room eﬀect. Several observations can

be made based on Table 6.5:
• Listeners preferred their own HRTFs in only 4 out of 14 experiments (29%). Three of them

were localization experiments, and the remaining was an externalization experiment.
• Only 3 out of 14 experiments used speech as the stimulus. Most experiments used noise.
• All but two of the experiments used headphones for stimulus delivery. Experimenters may

or may not have equalized headphones.

More detailed discussion of some of the experiments is given below.

183

Reference

Headphones

or

Loudspeaker?

Stimulus

Type

Listening
Criteria

Individualized HRTF
universally preferred/

best performance?

Morimoto and Ando (1980)

loudspeaker

white noise

Middlebrooks (1999b)

headphones

Usher and Martens (2007)

headphones

broadband

noise
speech

Roginska et al. (2010)

headphones

speech

Katz and Parseihian (2012)

headphones

noise burst

Sch¨onstein and Katz (2012)

headphones

noise burst

current work (2018)

loudspeaker

speech

localization- median plane

localization- horizontal plane

localization- median plane

naturalness

externalization

elevation

f/b discrimination

perceived spatial rendering

f/b discrimination
u/d discrimination
“sense of direction”
“sense of distance”

“front-image quality”
perceived room eﬀect

yes
no

yes

no
yes
no
no
no
yes

marginal

no
no
no

marginal

Table 6.5: List of experiments that compared listeners’ experiences using individualized and
nonindividualized HRTFs. The listening criteria included localization, externalization, and
naturalness, among others. All but two of the experiments used headphones. Listeners
preferred their own HRTFs in only 4 out of 14 experiments (29%). Three of these were
localization experiments, and the remaining was an externalization experiment.

Usher and Martens (2007) did an experiment in which they played speech stimuli to lis-

teners and asked them to evaluate ‘naturalness.’ Their hypothesis was that listeners would

perceive individualized stimuli as being the most natural. Half of the listeners listened to

stimuli that had been convolved with their own HRTFs as well as to speech stimuli that

had been convolved with eight non-individualized HRTFS. The remaining listeners listened

only to non-individualized HRTFs. None of the listeners in the ﬁrst group selected their

own HRTFs as the most natural-sounding. Instead, both groups of listeners selected a small

subset of HRTFs as sounding the most natural. Usher and Martens’s analysis found that

the best predictor for a listener’s choices was the frequency-dependent interaural level dif-

ference (ILD)– subjects chose HRTFs with similar (but boosted ) frequency-dependent ILDs.

Unfortunately, the experiment used headphones and anechoic stimuli only. Nevertheless, it

would perhaps be interesting to do a detailed analysis of ILDs on the current HRTFs. Fu-

ture experiments on perceptions of room eﬀect could incorporate systematic manipulations

184

of ILDs.

In the experiments of Sch¨onstein and Katz, six listeners were were asked to judge six

diﬀerent HRTFs on the basis of three diﬀerent attributes. The purpose of the experiment

was to determine whether reproducible judgments of HRTFs could be made. Five of the

HRTFs were taken from the CIPIC database, and the sixth was the listener’s own HRTF.

HRTFs were convolved with a white noise signal. The three attributes were: “sense of

direction,” “sense of distance,” and “front-image quality.” Listeners rated each of these on

a spectrum with ‘well-deﬁned’ at one end and ‘not well-deﬁned’ at the other end. They

completed six replicates. Listeners unanimously reported that they found the task diﬃcult,

and their own HRTFs were not always judged as being the most well-deﬁned. Results showed

that variability across replicates was signiﬁcant for all listeners. Since there was a tendency

for variance to be smaller for experienced listeners, Sch¨onstein and Katz recommended that

only experienced listeners should be used for evaluation of HRTFs.

In nearly all of the preference experiments included in Table 6.5 (excepting external-

ization), listeners did not exclusively prefer their own HRTFs and in some cases they even

preferred another listener’s HRTFs over their own. Results of the current experiment are

notably diﬀerent in that they did not show any instance in which a listener preferred another

listener’s ears over his own. In 7 out of 12 comparisons between ‘own’ and ‘other’ HRTF

conditions, listeners rated ‘own’ lower than ‘other’ conditions at a statistically-signiﬁcant

level (cf. Table 6.4). The point is that, while all other preference experiments observed at

least one instance in which a listener preferred another HRTF over his own, the current

experiment found no instances in which a listener preferred another HRTF over his own.

Synthesis

Listeners’ perceptions of room eﬀect were aﬀected by whether stimulus presentation was

185

natural or synthesized. Further, listeners reported a large number of artifacts (e.g. tones,

howling) were present during synthesized trials. As noted previously, artifacts can appear

in the synthesis due to diﬃculty in inverting the room transfer function (Neely and Allen,

1979). Despite having been instructed to ignore artifacts, it is possible listeners misinter-

preted them as room eﬀect. Attempts to identify artifacts in the measured synthesis signals

(probe microphones) have been elusive. For example, in a trial in which L2 reported “strong

artifacts,” nothing was observed in the measured synthesis signals that could be identiﬁed as

unusual or deviating substantially from the target. Thus, identiﬁcation of artifacts appears

to be not straightforward, and may necessarily rely on the listener reporting audible artifacts

to the experimenter.

A cutoﬀ frequency of 12 kHz was used in the stimuli because it was desirable to maintain

accurate pinna cues. However, a lower cutoﬀ frequency (e.g. 4 kHz, which might adequate

for speech) may create fewer problems when inverting the transfer functions, which would

presumably result in fewer artifacts during synthesis. The tradeoﬀ would, of course, be elim-

ination of pinna cues from the stimuli. Alternatively (or additionally), matrix regularization

may be used to curtail large amplitudes in H+, which would also presumably reduce the oc-

currence of artifacts. However, then the regularized matrix would give only an approximate

solution.

A more extreme solution would be to measure HRTFs in a lively room, and then have

the listener immediately move into an anechoic space for the synthesis. This would avoid

room resonances in the inverse ﬁlters that can lead to artifacts during synthesis, though it

would preclude natural trials. This approach would also allow one to avoid repositioning the

probe microphones in the ear canals.

Listener motion post-calibration is also known to introduce synthesis artifacts, as was

186

shown in a head-rotation experiment with KEMAR in section 5.9. In the current experiment,

a metal ring was placed on the listener’s head as an aid to prevent motion, but probably

a more rigorous motion-deterrent such as a bite bar should be used.

In that setup, the

listener’s jaw is clamped onto a rod and in this way translations and rotation of the head

are avoided.

6.4 Conclusions

A perceptual experiment on room eﬀect was conducted using the transaural synthesis tech-

nique. It was thought that, by avoiding issues associated with headphones, a more accurate

and precise experiment could be conducted to test the hypothesis that a lister perceives the

least amount of room eﬀect when listening through his own ears. Unfortunately, due to

the diﬃculties in inverting the room transfer function and also due to listener motion the

prevalence of synthesis artifacts may have distracted listeners from all but the most salient

diﬀerences in room eﬀect among HRTF conditions. The large relative boost in total average
power in H1 in the 3− 6 kHz frequency range may have led to enhanced perceptions of room

eﬀect. Mixed results for rating diﬀerences between ‘own’ and the remaining ‘other’ HRTF

(H2,H3) conditions suggests that synthesis artifacts may have masked or drawn the listener’s

attention away from evaluation of room eﬀect in these HRTF conditions. The unexpected

but signiﬁcant diﬀerence between listeners’ ratings of room eﬀect in ‘own’ and ‘natural’ con-

ditions is consistent with the idea that listeners were distracted by synthesis artifacts, and

this impaired their ability to assess room eﬀect in a consistent manner. The experimental

design attempted to make the task as simple as possible by blocking on sentence so that,

within a block, the only diﬀerence from one stimulus to the next was HRTF. This was in

187

contrast to the multiple-parameter variations that occurred in the headphone perceptual

experiment (Chapter 4). However, it seems the new approach was not helpful to the listener

in the presence of artifacts.

Comparing HRTFs is a diﬃcult task.

Informal comments from the four listeners in

this experiment expressed as much. The diﬀerences in perceived room eﬀect for diﬀerent

HRTF conditions could be extremely subtle. Repeatability is a known cause for concern

in perceptual evaluation of HRTFs. Experiments by Sch¨onstein and Katz (2012) examined

repeatability, and found variability across replicates to be statistically signiﬁcant for all lis-

teners. Andreopoulou and Katz (2016) also examined repeatability. In their experiment, ten

expert listeners rated twelve HRTFs from the LISTEN and BiLi databases. No individual-

ized HRTFs were used. The twelve HRTFs were convolved with Gaussian noise. Listeners

were instructed to rate perceived spatial quality on a 9-point scale. They completed seven

replicates. Analysis revealed that only 50% of listeners consistently provided repeatable rat-

ings across replicates, and the evaluations of the remaining listeners were inconsistent and

unstable regardless of the number of task repetitions and the speciﬁc HRTF corpus. Further,

the content and size of the HRTF corpus and the resolution of the rating scale were shown

to play a signiﬁcant role in HRTF rating repeatability. Although both the Sch¨onstein and

Katz and Andreopoulou and Katz experiments used headphones, they nevertheless provide

insight into the prevailing diﬃculty of HRTF perceptual evaluation.

Evaluating perceptual eﬀects of HRTFs is challenging even under ideal circumstances

(i.e. expert listeners, appropriately-sized HRTF corpus, adequate listener training). The

prevalence of synthesis artifacts rendered the task even more diﬃcult for listeners in the

current experiment. These artifacts must be avoided– through a low-frequency cutoﬀ and/or

matrix regularization. The tradeoﬀ would be elimination of pinna cues from experimental

188

stimuli and an approximate solution to the inverse problem. Nevertheless, these may be

necessary measures for conducting an improved perceptual experiment on room squelch. As

the experimental technique to probe room squelch advanced from (unequalized) headphones

to transaural synthesis, listeners displayed increased sensitivity to HRTF in their evaluation

of room eﬀect. In the headphone experiment, listeners’s ratings of perceived room eﬀect were

the same for the ‘own’ and ‘other’ HRTF conditions. In the current experiment, a highly-

signiﬁcant diﬀerence in room eﬀect ratings between ‘own’ and H1 conditions was observed

for all listeners (with ‘own’ always rated lower), and there were mixed results for ratings

diﬀerences between ‘own’ and the remaining ‘other’ HRTFs (H2,H3). Where diﬀerences

were signiﬁcant, ‘own’ conditions were always rated lower.

Despite the presence of synthesis artifacts in the current experiment, its results are con-

sistent with ﬁndings of other studies on HRTFs– namely, that listeners do not exclusively

prefer their own ears for every perceptual evaluation criterion. However, the current results

are unique in that all preference experiments described in Table 6.5 (excepting external-

ization) showed counter-examples in which listeners deﬁnitively preferred another person’s

HRTFs, the current experiment showed no instances in which a listener preferred another

person’s HRTF. With practical modiﬁcations to the transaural synthesis technique (e.g. lis-

tener using a bite bar and synthesis done in anechoic room), the prospects for a more rigid

test of the hypothesis, that a listener perceives the least amount of room eﬀect (maximum

squelch) with his own ears, are encouraging.

189

APPENDICES

190

APPENDIX A:

Transaural synthesis reproducibility

experiments

Various studies were conducted to further validate the TS technique. Only the 2 × 3 system

was examined in the subsequent experiments.

Probe and internal microphone responses

Filled symbols in Fig. A.1 indicate recordings of the real source (equal-amplitudes with

211 components) in the manikin’s internal microphones, and the open symbols indicate the

recordings in the probe microphones. It is evident that recordings of the real source loud-

speaker look quite diﬀerent in the probe and internal microphones. This is attributed to

dissimilarity in the frequency responses of the two recording microphone systems. Addition-

ally, displacement of the probe tip from the manikin’s “eardrum” may enhance discrepancy at

high frequencies. Despite the obvious diﬀerences between the spectra measured at the inter-

nal and probe microphones, the synthesis of these signals was largely successful. Appendix B

explains why, despite large diﬀerences in the frequency responses of the microphones, accu-

rate synthesis can be expected in the internal microphones if the synthesis at the probe

microphones is accurate.

191

Figure A.1: Amplitude spectra recorded in the internal (ﬁlled symbols) and probe (open
symbols) microphones when the equal-amplitudes, random-phases noise was played from
the real source loudspeaker. Discrepancy between ﬁlled and open symbols is attributed to
dissimilarity in the frequency responses of the microphones.

192

Reproducibility– real source

The following experiments were primarily concerned with recordings of the real source at the

“eardrums.” Speciﬁcally, it was shown that intensity level of the real source had minimal

eﬀect on spectra measured at the “eardrums.” Eﬀects of the probe tube microphones and

of random ﬂuctuations were also shown to be minimal.

Intensity level

To test linearity, an experiment was conducted in which the equal-amplitudes, random-

phases noise (X0) was played at diﬀerent intensity levels from the real source loudspeaker

to determine if a level dependency existed in the system performance. The protocol for each

measurement was identical to that described previously but some details are repeated for

convenience: recordings of the real source were made with KEMAR’s internal microphones

and with probe microphones in the “ear canals.” Level was measured by placing a sound

level meter 6 inches away from the center of the real source loudspeaker. For the ﬁlled

symbols shown in panels b and d of Fig. A.2, the level was 94 dB. This is referred to as the

“standard,” and it resulted in a level of 76 dB at the entrance to the near (right) “ear.”

Transfer functions (H) were measured by playing an MLS (217 − 1 = 131071 samples)

through the synthesis loudspeakers and recording with the probe microphones. The level

of the MLS was the same for the three synthesis loudspeakers– 95 dBA, measured 6 inches

from the loudspeaker. This resulted in approximate levels of 78 dBA at the near ear, and

76 dBA at the far ear. Note that H was measured once– the same H was used to calculate

synthesis waveforms for the three diﬀerent target source levels.

For synthesis, probe microphone recordings of the real source were used as the desired

193

signals in the ears (X(cid:48)

L and X(cid:48)

R). The results of the synthesis are shown in panels b and e

of Fig. A.2 as the open symbols.

After performing the synthesis at the standard level, the gain of the real source loud-

speaker was reduced until a level of 91 dBA was measured with the sound level meter (“low”

level). New probe microphone and internal microphone recordings were made with the real

source sounding at the reduced level. The recordings in the internal microphone are indi-

cated by the ﬁlled symbols in panels a and d of Fig A.2. Synthesis was then performed using

the “low”-level recordings as the desired signals in the “ears.” Recordings of the synthesis

at the manikin’s “eardrums” are indicated by the open symbols.

The gain of the real source loudspeaker was then increased until a level of 97 dBA was

measured with the sound level meter (“high” level) and recordings of the real source are

indicated by the ﬁlled symbols in panels c and e of Fig. A.2. The synthesis was performed

using the recordings at the “high” level as the desired signals in the “ears.” Recordings of

the synthesis are indicated by open symbols.

Synthesis accuracy was very similar for all three real source levels. This is apparent

through visual inspection of Fig. A.2 as well as by considering RMS amplitude errors. In the

right “ear,” errors varied by no more than one percent from the “low” level to the “high,”

which spanned a 6 dB diﬀerence. The left ear resulted in a modestly larger diﬀerence of

3.5-percentage points between the “low” and “high’ levels, due to the poorer signal-to-noise

ratio (since the real source was on the right).

194

Figure A.2: Intensity level of the real source had essentially no eﬀect on the synthesis accuracy. This can be seen both visually
and by comparing RMS amplitude errors across the three diﬀerent levels in both “ears.” It can thus be concluded that the
system was operating in a linear regime, though the experiment was modest since it only spanned 6 dB.

195

Probe tubes

An experiment was conducted to demonstrate that the probe tubes did not disturb the sound

ﬁeld at the “eardrums.” The probe tubes were positioned in the “ear canals,” with the tips

1 mm from the internal microphones, or “eardrums.” The equal-amplitudes, random-phases

noise was played from the real source loudspeaker and recordings were made with the internal

microphones. The amplitude spectrum of the recording with the probe microphones in place

is indicated by the ﬁlled symbols in Fig. A.3. The probe tubes were then removed from

the “ear canals” and a new recording was made. The amplitude spectrum with the probe

tips removed is indicated by the open symbols. Very close agreement of symbols indicates

that the probe tubes minimally perturbed the sound ﬁeld at the “eardrums.” The RMS

amplitude discrepancy between recordings with and without probe tubes was 0.0300 in the

left “ear,” and 0.0146 in the right (near) “ear.” Essentially no diﬀerence was found, and none

was expected because the probe tubes were less than 1 mm in diameter. This experiment

was largely conducted for the sake of completeness.

196

Figure A.3: Amplitude spectra of real source recordings made in the internal microphones
with probe tubes placed in the ear canals 1 mm from the “eardrums” (ﬁlled symbols). The
probe tubes were then removed from the “ear canals” (open symbols). Very close agreement
between ﬁlled and open symbols indicates that the probe tubes minimally perturbed the
sound ﬁeld at the “eardrums.”

197

Random ﬂuctuations

To quantify variability in the real source spectra measured at the “eardrums” due to random

ﬂuctuations, the equal-amplitudes, random-phases noise was played from the real source

loudspeaker and recordings were made with the internal microphones. The measurement

was then immediately repeated– no changes had been made. Note that the probe tips

were placed 1 mm from the “eardrums” but the probe microphones were not used in the

experiment. Amplitude spectra for the ﬁrst recording are indicated by the ﬁlled symbols in

Fig. A.4. Spectra for the second recording are indicated by open symbols. Variability in

the spectra recorded at the “eardrums” was quite small– most open symbols are completely

occluded by the ﬁlled symbols. RMS discrepancy was 0.0161 in the left ear and 0.0078 in

the right ear.

198

Figure A.4: Amplitude spectra of real source recordings made in the internal microphones.
Initial recordings are indicated by ﬁlled symbols and the subsequent recordings by open
symbols. No changes were made between the two measurements, so any discrepancy is due
to random ﬂuctuations. Note that probe tips were present in the “ear canals” during the
measurements (1 mm from the “eardrums”) but they were not used in the experiment.

199

Reproducibility– synthesis

The following experiments were concerned with the spectra measured at the “eardrums” (i.e.

in the internal microphones) during synthesis.

Random ﬂuctuations

Probe tips were placed 1 mm from the “eardrums.” Recordings were made in the left

(x0L) and right (x0R) internal microphones while the equal-amplitudes, random-phases

noise was played through the real source loudspeaker. Synthesis loudspeaker-to-eardrum

transfer functions (H) were then measured in the probe microphones, and y’ waveforms
were calculated. The desired signals in the ears, X(cid:48)
X0L and X(cid:48)
R = X0R. The synthesis was performed and recordings were made with the
internal microphones. Synthesis was performed again (using the same y(cid:48) waveforms) and

R, were X0L and X0R: X(cid:48)

L and X(cid:48)

L =

new recordings were made at the internal microphones. Note that the probe tip positions

were never changed during the entire experiment.

Spectra measured at the internal microphones are shown in Fig. A.5. The ﬁrst recording

of the synthesis is indicated by ﬁlled symbols and the second recording is indicated by open

symbols. The ﬁlled symbols almost completely occlude the open symbols, indicating that

variation in synthesis measured at the “eardrums” was very small. RMS error was 0.0106

in the left “ear” and 0.0055 in the right “ear.” Synthesis was performed a third time, after

allowing some time to elapse. Even after ten minutes, RMS error in the left “ear” had only

increased to 0.0218 and in the right ear to 0.0134 (data not shown). This demonstrated

long-term stability of the synthesis system.

200

Figure A.5: Amplitude spectra recorded at the “eardrums” during synthesis. The ﬁrst
recording of the synthesis is indicated by ﬁlled symbols, and the second by open symbols.
Variation due to random ﬂuctuations was very small.

Diﬀerent probe placements for target source and synthesis

An experiment was conducted to determine the eﬀect of having a diﬀerent probe tip place-

ment during the real source measurement and the synthesis. First, probe tips were placed

201

1 mm from the “eardrums.” Recordings were made in the left (xu,i

0L) and right (xu,i

0R) internal

microphones while the equal-amplitudes, random-phases noise was played through the real

source loudspeaker. Amplitude spectra of the internal microphone recordings are indicated

by the ﬁlled symbols in panels c and d of Fig. A.6. Recordings were also made in the probe

microphones (xu,p

0L and xu,p

0R , not shown).

Then, the probe tips were removed from the “ear canals.” They were reinserted, and

again positioned 1 mm from the “eardrum.” The equal-amplitudes, random-phases noise

was played through the real source loudspeaker and recordings were made in the left (xm,i
0L )

and right (xm,i

0R ) internal microphones and in the probe microphones (xm,p

0L and xm,p

0R ). The

superscript ‘m’ indicates that the probe tip placements were the same for the real source and

synthesis recordings (‘matched’ condition). Amplitude spectra of internal microphone record-

ings are indicated by the ﬁlled symbols in panels a and b in Fig. A.6. Synthesis loudspeaker-

to-eardrum transfer functions (H) were then measured in the probe microphones, and y’

waveforms were calculated using the Xm,p

0L and Xm,p

0R as the desired signals in the “ears.”

The synthesis was performed and recordings were made with the internal microphones. Am-

plitude spectra recorded at the internal microphones are indicated by open symbols in panels

a and b of Fig. A.6.

202

Figure A.6: Amplitude spectra recorded at the internal microphones during synthesis. The
top panels (a:
left ear, b: right ear) depict synthesis in which the probe tip placement in
the “ear canals” was the same for the real source and synthesis (‘matched’ condition). The
desired signals in the ears were: X(cid:48)
0R . The bottom panels (c: left
ear, d: right ear) depict synthesis for which the desired signals in the ears (X(cid:48)
0L and
X(cid:48)
R = Xu,p
0R ) were measured with a diﬀerent probe microphone placement than used during
the synthesis (‘unmatched’ condition).

0L and X(cid:48)

R = Xm,p

L = Xm,p

L = Xu,p

A second synthesis was conducted, but this time y’ waveforms were calculated using

the same H matrix but now using the old real source recordings as the desired signals in
the ears: X(cid:48)

0R . Recall that these were measured with a diﬀerent

0L and X(cid:48)

R = Xu,p

L = Xu,p

203

probe tip placement (though still 1 mm from the “eardrum”). The superscript ‘u’ indicates

that the probe tip placements were diﬀerent for the target source and synthesis recordings

(‘unmatched’ condition). The synthesis was performed using the new y’ waveforms and

recordings were made with the internal microphones. Amplitude spectra recorded at the

internal microphones are indicated by the open symbols in panels c and d of Fig. A.6.

It is clear from Fig. A.6 that the unmatched condition, in which the real source was

measured with a diﬀerent probe tip placement than the synthesis, had a deleterious eﬀect

on the synthesis accuracy, especially in the 3.5-10 kHz frequency range. The eﬀect is most

dramatic in the right “ear”– the RMS error increased by 315%.

The diﬀerence between the top and bottom panels represents the extent of imprecision

of probe microphone placement in the “ear canals.” The left probe tube was apparently

reinserted more closely to its original position in the “ear canal” than the right tube was.

Indeed, it is surprising that the attempt to synthesize a real source signal that was measured

with diﬀerent probe tip positions in the “ear canals” was as successful as it was. To under-

stand, it is instructive to consider an extreme case in which the probe tips are positioned

inside the ear canals when the real source is sounding, but positioned on top of the head

to conduct the synthesis. The synthesis will result in correct spectra being recorded in the

probe microphones, but the spectra recorded at the eardrums will look nothing like the real

source spectra recorded at the eardrums. Thus, the only way to ensure accurate synthesis

at a listener’s eardrum is to use exactly the same probe microphone placement for recording

the real source and for conducting the synthesis. This is a subtle but very important point

in regards to experimental design.

204

APPENDIX B:

Transaural synthesis with probe

microphones in the ear canals

This appendix describes how a signal sent from a target source loudspeaker and received at

the eardrums can be simulated by synthesis loudspeakers using transfer functions measured

with probe microphones in the ear canals. The description takes the form of a test wherein

the signals at the eardrums can be known because they are measured using an anatomical

manikin with internal microphones for eardrums. Recordings made with those internal

microphones are given the subscript k. Recordings made with the probe microphones have

subscript p. As before, quantities that occur physically are indicated with bold symbols.

A target stimulus called s0, is played through a real source loudspeaker and recordings

are made using the manikin’s internal microphones to represent eardrum recordings xk0:

xk0 = Hk0s0,

(B.1)

where Hk0 is the transfer function matrix of s0 to the eardrums. The subscript zero is used to

indicate that the signal originated from the real source loudspeaker. Signals originating from

the synthesis loudspeakers do not have this subscript. Recorded signals xk0 are the standard

for evaluating the subsequent transaural synthesis. In addition, with signal s0 played through

the real source loudspeaker, recordings xp0 are made using probe microphones in the ear

205

canals:

xp0 = Hp0s0.

(B.2)

The next step is to determine transfer functions Hp to the two ear canals for each of the

synthesis speakers using signal y from each speaker in turn while recording xp in the probe

microphones. Signal y is a long maximum length sequence (MLS). The cross-correlation of
xp and y is calculated to obtain Hp. This matrix has dimensions 2 × N (two rows and

N columns), where N is the number of synthesis loudspeakers. It is used to compute the

pseudoinverse matrix, H+

p required for synthesis.

The synthesis proceeds by arranging for signal xp0 from Eq. (B.2) to appear at the probe

microphones during synthesis, i.e. the desired signal at the probe microphones x(cid:48)

p0 is set equal

to the recorded signal xp0. In order to achieve that, xp0 is ﬁltered by the pseudoinverse to

(cid:48)

obtain y

:

(cid:48)

y

= H+

p xp0,

(B.3)

(cid:48)

where y

are the N signals to be sent to the synthesis loudspeakers. Recordings of the

synthesis as made in the probe microphones are

xp = Hpy(cid:48) = HpH+

p xp0,

(B.4)

and xp ought to equal xp0 because HpH+

p equals the identity matrix.

To test the system, the same signals, y(cid:48), are played through the synthesis loudspeakers

and recordings, xk, are made using internal microphones:

xk = Hky

(cid:48)

,

206

(B.5)

where Hk is the transfer function that occurs between the synthesis loudspeakers and the

eardrums. Neither Hk nor xk plays any role in the synthesis, but xk is the ﬁnal result for

comparison with xk0. Equation (B.3) can be substituted for y

xk = HkH+

p xp0.

(cid:48)

, resulting in:

(B.6)

(B.7)

A further substitution from Eq. (B.2) can be made for xp0, yielding:

xk = HkH+

p Hp0s0.

If it could be shown that HkH+

p Hp0 = Hk0, then, according to Eq. (B.1) the signals at

the eardrums from the synthesis would be the same as the signals at the eardrums from the

original target source, namely xk = xk0.

We begin by writing an expression to relate the probe-microphone transfer function to

the internal-microphone transfer function. Both transfer functions originate at the synthesis

loudspeakers:

Hp = QpHk,

(B.8)

where Qp is necessarily a 2 × 2 matrix whatever the number of synthesis speakers. Further,

Qp is diagonal because the relationship between probe microphone and manikin internal

microphone that occurs in one ear is unaﬀected by the relationship in the other ear. An

analogous expression can be written that relates the probe-microphone transfer function

to the internal-microphone transfer function when both transfer functions originate at the

target loudspeaker.

Hp0 = Qp0Hk0.

(B.9)

207

Because the inverse of Eq. (B.8) is

H+

p = H+

k Q+
p ,

it follows that Eq. (B.7) can be rewritten using Eq. (B.10) and Eq. (B.9):

xk = HkH+

k Q+

p Qp0Hk0s0

or

xk = Q+

p Qp0Hk0s0

(B.10)

(B.11)

(B.12)

because HkH+

k = I.

It is a common and reasonable assumption that the relationship

between signals as measured at two diﬀerent points within an ear canal depends only on the

signal spectrum and is independent of the direction from which the original signal originates

(Hammershøi and Møller, 1996; Middlebrooks et al., 1989). Therefore, Qp = Qp0, and

Q+

p Qp0 = I. Then

xk = Hk0s0,

(B.13)

or, by substituting Eq. (B.1) for Hk0s0,

xk = xk0.

(B.14)

In the end, the signals at the eardrums resulting from the synthesis are found to be equal to

the signals at the eardrums from the original target source.

208

BIBLIOGRAPHY

209

BIBLIOGRAPHY

1) Akeroyd, M.A., Chambers, J., Bullock, D., Palmer, A.R., Summerﬁeld, A.Q., Nelson,
P.A., and Gatehouse, S. (2007) “The binaural performance of a cross-talk cancellation
system with matched or mismatched setup and playback acoustics,” J. Acoust. Soc.
Am. 121, 1056-1069.

2) Andreopoulou, A. and Katz, B.F.G. (bf 2015) “On the use of subjective HRTF eval-
uations for creating global perceptual similarity metrics of assessors and assesssees,”
21st ICAD July 8-10, Graz, Austria, 13-20.

3) Algazi, V.R., Duda, R.O. and Thompson, D.M. (2001) “The CIPIC HRTF database,”

IEEE Workshop Appl. Signal Process. Audio and Acoust., 99-102.

4) Allen, J.B., Berkley, D.A., and Blauert, J. (1977) “Multimicrophone signal-processing
technique to remove room reverberation from speech signals,” J. Acoust. Soc. Am. 62,
912-915.

5) Bai, M.R. and Lee, C.C. (2006) “Objective and subjective analysis of eﬀects of lis-
tening angle on crosstalk cancellation in spatial sound reproduction,” J. Acoust. Soc.
Am. 120, 1976-1989.

6) Bai, M.R., Tung, C.W., and Lee, C.C. (2005) “Optimal design of loudspeaker arrays
for robust cross-talk cancellation using the Taguchi method and the genetic algo-
rithm,” J. Acoust. Soc. Am. 117, 2802-2813.

7) Bauck, J. and Cooper, D.H. (1996) “Generalized Transaural Stereo and Applications,”

J. Audio Eng. Soc. 44, 683-705.

8) Bauer, B.B. (1961) “Stereophonic earphones and binaural loudspeakers,” J. Audio

Eng. Soc. 9, 148-151.

9) Bilsen, F.A. (1967) “Thresholds of perception of repetition pitch. Conclusions con-
cerning coloration in room acoustics and correlation in the hearing organ,” Acustica
19, 27-32.

10) Blauert, J. (1969) “Sound localization in the median plane,” Acustica 22, 205-213.

210

11) Bradley, J.S. (2001) “Optimizing the decay range in room acoustics measurements

using maximum length sequences,” J. Audio Eng. Soc. 44, 266-273.

12) Br¨uggen, M. (2001) “Coloration and binaural decoloration in natural environments,”

Acust. Acta Acust. 87, 400-406.

13) B¨ucklein, R. (1981) “The audibility of frequency response irregularities,” J. Audio

Eng. Soc. 29, 126-131.

14) Cai, T., Rakerd, B., and Hartmann, W.M. (2015) “Computing interaural diﬀerences
through ﬁnite element modeling of idealized human heads,” J. Acoust. Soc. Am. 138,
1549-1560.

15) Carlile, S. and Pralong, D. (1994) “The location-dependent nature of perceptually
salient features of the human head-related transfer functions,” J. Acoust. Soc. Am.
95, 3445-3459.

16) Cohen, J. and Cohen, P. (1983) “Applied multiple regression/correlation analysis for
the behavioral sciences,” Hillsdale, NJ: Lawrence Erlbaum Associates, Inc., second
edition.

17) Cooper, D.H. and Bauck, J.L. (1989). “Prospects for transaural recording,” J. Audio.

Eng. Soc. 37, 3-19.

18) Damaske, P. (1971) “Head-related two-channel stereophony with loudspeaker repro-

duction,” J. Acoust. Soc. Am. 50, 1109-1115.

19) Davies, W.D.T (1966) “Generation and properties of maximum-length sequences,”

Control 10, 364-365.

20) Domnitz, R.H. (1975) “Headphone monitoring system for binaural experiments below

1 kHz,” J. Acoust. Soc. Am. 58, 510-511.

21) Dunn, C. and Hawksford, M.O. (1993) “Distortion immunity of MLS-derived impulse

response measurements,” J. Audio Eng. Soc. 41, 314-335.

22) Durlach, N.I., Rigopulos, A., Pang, X.D., Woods, W.S., Kulkarni, A., Colburn, H.S.,
and Wenzel, E.M. (1992) “On the externalization of auditory images,” Presence bf 1,
251-257.

211

23) Edmonds, B.A. and Culling, J.F. (2009) “Interaural correlation and the binaural

summation of loudness,” J. Acoust. Soc. Am. 125, 3865-3870.

24) Ellis, G.M., Zahorik, P., and Hartmann, W.M. (2016) “Using multidimensional scaling

techniques to quantify binaural squelch,” Proc. Mtgs. Acoust. 23, 1-10.

25) Firestone, F.A. (1930) “The phase diﬀerence and amplitude ratio at the ears due to

a source of pure tone,” J. Acoust. Soc. Am. 2, 260-270.

26) Flanagan, J.L. and Lummis, R.C. (1970) “Signal processing to reduce multipath

distortion in small rooms,” J. Acoust. Soc. Am. 47, 1475-1481.

27) Gumerov, N.A. O’Donovan, A.E., Duraiswami, R., and Zotkin, D.N. (2010) “Compu-
tation of the head-related transfer function via the fast multipole accelerated boundary
element method and its spherical harmonics,” J. Acoust. Soc. Am. 127, 370-386.

28) Haas, H. (1972) “The inﬂuence of a single echo on the audibility of speech,” J. Audio
Eng. Soc. 20, 146-159. [Translated by K.P.R Ehrenberg from the original (1949)
“ ¨Uber den Einﬂuss des Einfachechos auf die Horsamkeit von Sprache.”]

29) Hammershøi, D. and Møller, H. (1996) “Sound transmission to and within the human

ear canal,” J. Acoust. Soc. Am. 100, 408-427.

30) Hartmann, W.M. (1998) “Signals, Sound, and Sensation,” New York, NY: Springer

Science+Business Media Inc., 5th edition.

31) Hartmann, W.M. and Candy, J.V. (2006) “Acoustic signal processing,” Springer

Handbook of Acoustics, 2nd Edition, 558-560.

32) Hartmann, W.M., Rakerd, B., and Koller, A. (2005) “Binaural coherence in rooms,”

Acust. Acta Acust. 91, 451-462.

33) Hartmann, W.M., Rakerd, B., Crawford, Z.D. and Zhang,P.X (2016) “Transaural
experiments and a revised duplex theory for the localization of low-frequency tones,”
J. Acoust. Soc. Am. 139, 968-985.

34) Hartmann, W.M. and Wittenberg, A. (1996) “On the externalization of sound im-

ages,” J. Acoust. Soc. Am. 99, 3678-3688.

35) “IEEE recommended practice for speech quality measurements,” (1969) IEEE Trans.

Audio Electroacoust. AU-17(3), 225-246.

212

36) Katz, B.F.G. and Parseihian, G. (2012) “Perceptually based head-related transfer

function database optimization,” J. Acoust. Soc. Am. Exp. Lett. 131, 99-105.

37) Kirkeby, O., Nelson, P.A., and Hamada, H. (1998a) “Local sound ﬁeld reproduction

using two closely spaced loudspeakers,” J. Acoust. Soc. Am. 104, 1973-1981.

38) Kirkeby, O., Nelson, P.A., and Hamada, H. (1998b) “The ‘stereo dipole’ – a virtual
source imaging system using two closely spaced loudspeakers,” J. Audio Eng. Soc. 46
(5), 387-395.

39) Kirkeby, O., Nelson, P.A., Hamada, H., and Orduna-Bustamante, F. (1998c) “Fast
deconvolution of multichannel systems using regularization,” IEEE Trans. Speech Au-
dio Process. 6 (2), 189-195.

40) Kirkeby, O. and Nelson, P.A. (1999) “Digital ﬁlter design for inversion problems in

sound reproduction,” J. Audio Eng. Soc. 47 (7/8), 583-595.

41) Koenig, A.H., Berkley, D.A., Curtis, T.H. and Allen, J.B. (1975) “Magnitude of JNDs
for diotic and dichotic perception of spectrally colored noise,” J. Acoust. Soc. Am. 58,
S55.

42) Koenig, W. (1950) “Subjective eﬀects in binaural hearing,” J. Acoust. Soc. Am. 22,

61-62.

43) Kulkarni, A. and Colburn, H.S. (2000) ”Variability in the characterization of the

headphone transfer-function,” J. Acoust. Soc. Am. 107, 1071-1074.

44) Macaulay, E.J., Hartmann, W.M., and Raker, B. (2010) “The acoustical bright spot
and mislocalization of tones by human listeners,” J. Acoust. Soc. Am. 127, 1440-1449.

45) MacKeith, N.W. and Coles, R.R. (1971) “Binaural advantages in hearing of speech,”

J. Laryng. Otolaryng. 85, 213-232.

46) Majdak, P., Masiero, B., and Fels, J. (2013) “Sound localization in individualized and
non-individualized crosstalk cancellation systems,” J. Acoust. Soc. Am. 133, 2055-
2068.

47) Martin, R.L., McAnally, K.I., and Senova, M.A.(2001) “Free-ﬁeld equivalent localiza-

tion of virtual audio,” J. Audio Eng. Soc. 49 (1/2), 14-22.

213

48) Mehrgardt, S. and Mellert, V. (1977) “Transformation characteristics of the external

human ear,” J. Acoust. Soc. Am. 61, 1567-1576.

49) Middlebrooks, J.C (1999a) “Individual diﬀerences in external-ear transfer functions

reduced by scaling in frequency,” J. Acoust. Soc. Am. 106, 1480-1492.

50) Middlebrooks, J.C. (1999b) “Virtual localization improved by scaling nonindividual-
ized external-ear transfer functions in frequency,” J. Acoust. Soc. Am., 106, 1493-1510.

51) Miyoshi, M. and Kaneda, Y. (1988) “Inverse ﬁltering of room acoustics,” IEEE Trans.

Acoust., Speech, Signal Process. 36, 145-152.

52) Møller, H. (1992) “Fundamentals of binaural technology,” Appl. Acoust. 36, 171-218.

53) Møller, H., Sørensen, M.F., Hammershøi, D., and Jensen, C.B. (1995) “Head-related

transfer functions of human subjects,” J. Audio Eng. Soc. 43, 300-321.

54) Moncur, J.P. and Dirks, D. (1967) “Binaural and monaural speech intelligibility in

reverberation,” J. Speech Hear. Res. 10, 186-195.

55) Moore, A.H., Tew, A.I., and Nicol, R. (2010) “An initial validation of individualized
crosstalk cancellation ﬁlters for binaural perceptual experiments,” J. Audio Eng. Soc.
58 (1/2), 36-45.

56) Moore, E.H. (1920) “On the reciprocal of the general algebraic matrices,” Bull. Amer.

Math. Soc. 26, 394-395.

57) Morimoto, M. and Ando, Y. (1980) “On the simulation of sound localization,” J.

Acoust. Soc. Jpn.(E) 1 (3), 167-174.

58) M¨uller, S. and Massarani, P. (2001) “Transfer-function measurement with sweeps,”

J. Audio Eng. Soc. 49, 443-471.

59) N´ab˘elek, A.K. and Pickett, J.M. (1974) “Monoaural and binaural speech perception
through hearing aids under noise and reverberation with normal and hearing-impaired
listeners,” J. Speech Hearing Res. 17, 724-739.

60) Neely, S.T. and Allen, J.B. (1979) “Invertibility of a room impulse response,” J.

Acoust. Soc. Am. 66, 165-169.

214

61) Nelson, P.A. and Rose, J.F.W. (2005) “Errors in two-point sound reproduction,” J.

Acoust. Soc. Am. 118, 193-204.

62) Norcross, S.G., Soulodre, G.A., and Lavoie, M.C. (2004) “Distortion audibility in

inverse ﬁltering,” Audio Eng. Soc. Conv. 117, 1-7.

63) Olsen, W.O. and Carhart, R. (1967) “Development of test procedures for evaluation

of binaural hearing aids,” Bull. Prosthet. Res. 10, 22-49.

64) Parodi, Y.L. and Rubak, P. (2010) “Objective evaluation of the sweet spot size in
spatial sound reproduction using elevated loudspeakers,” J. Acoust. Soc. Am. 128,
1045-1055.

65) Pellegrini, R.S. (2002) “Perception-based design of virtual rooms for sound reproduc-
tion,” Audio Eng. Soc. 22nd Int. Conf. on Virt., Synth., and Enter. Audio, 245-255.

66) Penrose, R. (1955a) “A generalized inverse for matrices,” Proc. Cambridge Philos.

Soc. 51, 406-413.

67) Penrose, R. (1955b) “On the approximate solution of linear matrix equations,” Proc.

Philos. Soc. 52, 17-19.

68) Pollack, I. and Trittipoe, W.J. (1959) “Binaural listening and interaural noise cross

correlation,” J. Acoust. Soc. Am. 31, 1250-1252.

69) Pralong, D. and Carlile, S. (1996) “The role of individualized headphone calibration
for the generation of high ﬁdelity virtual auditory space,” J. Acoust. Soc. Am. 100,
3785-3793.

70) Rife, D.D. and Vanderkooy, J. (1989) “Transfer function measurement with maximum

length sequences,” J. Audio Eng. Soc. 37, 419-444.

71) Roginska, A., Wakeﬁeld, G.H., and Santoro, T.S. (2010) “Stimulus-dependent HRTF

preference,” Audio Eng. Soc. 129th Convention, San Franciso, CA, paper 8268.

72) Rose, J., Nelson, P.A., Rafaely, B., and Takeuchi, T. (2002) “Sweet spot size of virtual
acoustic imaging systems at asymmetric listener locations,” J. Acoust. Soc. Am. 112,
1992-2002.

73) Schroeder, M. (1975) “Models of hearing,” Proc. IEEE 63, 1332-1354.

215

74) Schroeder, M.R. and Atal, B.S. (1963) “Computer simulation of sound transmission

in rooms,” IEEE Intl. Conv. Rec. 11, 150-155.

75) Schuck, P.L., Bonneville, M.E., Momtahan, K.L., and Verreault, E.S. (1993) “Percep-
tion of perceived sound in rooms: some results of the Athena project,” Proc. Audio
Eng. Soc. 12th Int. Conf., Copenhagen, Denmark, June 28-30, 49-73.

76) Sch¨onstein, D. and Katz, B.F.G. (2012) “Variability in perceptual evaluation of

HRTFs,” J. Audio Eng. Soc. 60, 783-793.

77) Seeber, B.U. and Fastl, H. (2003) “Subjective selection of non-individual head-related

transfer functions,” Conf. Aud. Display, Boston, MA, 259-262.

78) Shaw, E.A.G. (1974) “Transformation of sound pressure level from the free-ﬁeld to

the eardrum in the horizontal plane,” J. Acoust. Soc. Am. 56, 1848-1861.

79) Shore, A., Hartmann, W.M., Rakerd, B., Ellis, G.M., and Zahorik, P. (2016) “Squelch

of room eﬀects in everyday conversation,” J. Acoust. Soc. Am. 139, 2212.

80) Shore, A., Tropiano, A.J., and Hartmann, W.M. (2018) “Matched transaural synthe-
sis with probe microphones for psychoacoustical experiments,” J. Acoust. Soc. Am.,
accepted for publication.

81) Simon, L.S.R., Zacharov, N., and Katz, B.F.G. (2016) “Perceptual attributes for the
comparison of head-related transfer functions,” J. Acoust. Soc. Am. 140, 3623-3632.

82) Takeuchi, T. and Nelson, P.A. (2001) “Optimal source distribution for binaural syn-

thesis over loudspeakers,” Acoust. Res. Let. Online 2, 7-12.

83) Takeuchi, T. and Nelson, P.A. (2002) “Optimal source distribution for binaural syn-

thesis over loudspeakers,” J. Acoust. Soc. Am. 112, 2786-2797.

84) Takeuchi, T., Nelson, P.A., and Hamada, H. (2001) “Robustness to head misalignment

of sound imaging systems,” J. Acoust. Soc. Am. 109, 958-970.

85) Teret, E., Pastore, M.T., and Braasch, J. (2017) “The inﬂuence of signal type on

perceived reverberance,” J. Acoust. Soc. Am. 141, 1675-1682.

86) Toole, F.E. and Olive, S.E. (1986) “The perception of sound coloration due to reso-
nances in loudspeakers and other audio components,” 81st Convention of Audio Eng.
Soc., Los Angeles, CA, paper 2406, 1-31.

216

87) Usher, J. and Martens, W.L. (2007) “Perceived naturalness of speech sounds pre-
sented using personalized versus non-personalized HRTFs,” Proc. 13th ICAD June
26-29, Montr´eal, CA, 10-16.

88) Vanderkooy, J. (1994) “Aspects of MLS measuring systems,” J. Audio Eng. Soc. 42,

219-231.

89) Ward, D.B. and Elko, G.W. (1998) “Optimal loudspeaker spacing for robust crosstalk

cancellation,” Proc. ICASSP 98, IEEE, 3541-3544.

90) Ward, D.B. and Elko, G.W. (1999) “Eﬀect of loudspeaker position on the robustness

of acoustic crosstalk cancellation,” IEEE Signal Process. Lett. 6 (5), 106-108.

91) Warusfel, O. (2003) “http://www.recherche.ircam.fr/equipes/salles/listen/,” LIS-

TEN HRTF Database.

92) Wenzel, E.M., Arruda, M., Kistler, D.J., and Wightman, F.L. (1993) “Localization
using nonindividualized head-related transfer functions,” J. Acoust. Soc. Am. 94, 111-
123.

93) Wightman, F.L. and Kistler, D.J. (1989a) “Headphone simulation of free-ﬁeld listen-

ing I: stimulus synthesis,” J. Acoust. Soc. Am. 85, 858-867.

94) Wightman, F.L. and Kistler, D.J. (1989b) “Headphone simulation of free-ﬁeld listen-

ing II: psychophysical validation,” J. Acoust. Soc. Am. 85, 868-878.

95) Yang, J., Gan, W.S., and Tan, S.E. (2003) “Improved sound separation using three

loudspeakers,” Acoust. Res. Lett. Online 4(2), 47-52.

96) Yost, W. A. (2007) “Fundamentals of Hearing: an Introduction,” San Diego, CA:

Elsevier, Inc. 5th edition.

97) Zahorik, P. (2002) “Direct-to-reverberant energy ratio sensitivity,” J. Acoust. Soc.

Am. 112, 2110-2117.

98) Zhang, P.X. and Hartmann, W.M. (2010) “On the ability of human listeners to dis-

tinguish between front and back,” Hear. Res. 260, 30-46.

99) Zurek, P.M. (1979) “Measurements of binaural echo suppression,” J. Acoust. Soc.

Am. 66, 1750-1757.

217