ALGORITHMS FOR NOISY QUANTUM COMPUTERS AND
TECHNIQUES FOR ERROR MITIGATION
By
Ryan LaRose
A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Computational Mathematics, Science and Engineering – Doctor of Philosophy
Physics – Dual Major
2022
ABSTRACT
ALGORITHMS FOR NOISY QUANTUM COMPUTERS AND
TECHNIQUES FOR ERROR MITIGATION
By
Ryan LaRose
Quantum computation will likely provide significant advantages relative to classical architectures for
certain computational problems in number theory and physics, and potentially in other areas such as
optimization and machine learning. While some key theoretical and engineering problems remain
to be solved, experimental advances in recent years have demonstrated the first beyond-classical
quantum computation as well as the first experiments in error-corrected quantum computation. In
this thesis, we focus on quantum computers with around one hundred qubits that can implement
around one thousand operations, the so-called noisy-intermediate scale quantum (NISQ) regime
or kilo-scale quantum (KSQ) regime, and develop algorithms tailored to these devices as well as
techniques for error mitigation that require significantly less overhead than fault-tolerant quantum
computation. In the first part, we develop quantum algorithms for diagonalizing quantum states
(density matrices) and compiling quantum circuits. These algorithms use a quantum computer to
evaluate a cost function which is classically hard to compute and a classical computer to adjust
parameters of an ansatz circuit, similar to the variational principle in quantum mechanics and other
variational quantum algorithms for chemistry and optimization. In the second part, we extend
an error mitigation technique known as zero-noise extrapolation and introduce a new framework
for error mitigation which we call logical shadow tomography. In particular, we adapt zero-noise
extrapolation (ZNE) to the gate model and introduce new methods for noise scaling and (adaptive)
extrapolation. Further, we analyze ZNE in the presence of time-correlated noise and experimentally
show ZNE increases the effective quantum volume of various quantum computers. Finally, we
develop a simple framework for error mitigation that enables (the composition of) several error
mitigation techniques with significantly fewer resources than prior methods, and numerically show
the advantages of our framework.
Copyright by
RYAN LAROSE
2022
To my wonderful, inspiring, encouraging K-12 teachers.
John Barber Jr, Laurie Blom, Jan Boswell, Don Dziuk, Todd Green,
David Kirsten, Rachel Kleinke, Kim Latona, Jessica Norton, Catherine Nutter,
Bradley Porter, Philip Ricci, Pamela Ruggiero, David Stumpf, Janet Nellis-Trubiano,
Mark Van Hecke, Paula VanHeusden, Nate Williams
iv
ACKNOWLEDGEMENTS
This thesis is the result of working with many collaborators at many institutions. It would not
be possible without the support of my advisor Matthew Hirn to whom I owe the most thanks.
At MSU I’d like to thank Dean Lee, Morten Hjorth-Jensen, Johannes Pollanen, Huey-Wen Lin,
Alexei Bazavov, Justin Lane, Ben Hall, Jacob Watkins, Joe Kitzman, Niyaz Beysengulov, and
Camille Mikolas for scientific discussions and help with organizing various quantum computing
conferences, courses, seminars, and events. At LANL I’d like to thank Patrick Coles, Lukasz
Cincio, Andrew Sornberger, and Yigit Subasi for collaborations, and everyone in the inaugural
LANL quantum computing summer school for a fun summer of research and friendship. At IBM
I’d like to thank Jennifer Glick, Antonio Mezzacapo, Travis Scholten, and Sam Slezak. At NASA
I’d like to thank Eleanor Rieffel, Davide Venturelli, Zhihui Wong, Hong-Ye Hu, and the entire
NASA QuAIL team for collaboration throughout my PhD. Although my work at Alphabet X is
outside the scope of this thesis, I’d like to thank Guifre Vidal for the scientific training and many
fun discussions. At Unitary Fund I’d like to thank Will Zeng for helping me multiple times at
various stages throughout my PhD as well as the entire Unitary Fund technical team, especially
Andrea Mari for many discussions. At Google Quantum AI I’d like to thank Alan Ho as well as all
of the people I got the chance to work with including Matthew Harrigan, Wojtek Mruczkiewicz,
Nick Rubin, Zhang Jiang, Jarrod McClean, Tanuj Khattar, Orion Martin, Doug Strain, Pedram
Rousham, and Xiao Mi.
v
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiv
CHAPTER 1 PRELIMINARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Quantum algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Open quantum systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Quantum error correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 The Gottesman-Knill theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER 2 VARIATIONAL QUANTUM STATE DIAGONALIZATION . . . . . . . . 14
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 The VQSD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1.1 Overall structure . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1.2 Parameter optimization loop . . . . . . . . . . . . . . . . . . . . 19
2.2.1.3 Eigenvalue readout . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1.4 Eigenvector preparation . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2.1 One-qubit state . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2.2 Heisenberg model ground state . . . . . . . . . . . . . . . . . . 25
2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Comparison to literature . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 Future applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Diagonalization test circuits . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1.1 𝐶1 and the DIP Test . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1.2 𝐶2 and the PDIP test . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.1.3 𝐶1 versus 𝐶2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.2 Optimization methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Code availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6 Details on VQSD implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6.1 Optimization parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6.2 Additional statistics for the quantum computer implementation . . . . . . . 38
2.7 Alternative ansatz and the Heisenberg model ground state . . . . . . . . . . . . . . 39
2.8 Optimization and local minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.9 Optimization runs with various 𝑞 values . . . . . . . . . . . . . . . . . . . . . . . 43
vi
2.10 Comparison of optimization methods . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.11 Complexity for particular examples . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.11.1 General complexity remarks . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.11.2 Example states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.12 Implementation of qPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.12.1 Overview of qPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.12.2 Our implementation of qPCA . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.13 Circuit derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.13.1 DIP test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.13.2 PDIP test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.14 Proof of local dephasing channel bound . . . . . . . . . . . . . . . . . . . . . . . 56
CHAPTER 3 QUANTUM-ASSISTED QUANTUM COMPILING . . . . . . . . . . . . . 57
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Applications of QAQC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 The QAQC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.1 Approximate compiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.2 Discrete and continuous parameters . . . . . . . . . . . . . . . . . . . . . 62
3.3.3 Small problem sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.4 Large problem sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3.5 Special case of a fixed input state . . . . . . . . . . . . . . . . . . . . . . . 69
3.4 Cost evaluation circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.1 Hilbert-Schmidt Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.2 Local Hilbert-Schmidt Test . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5 Computational complexity of cost evaluation . . . . . . . . . . . . . . . . . . . . . 75
3.5.1 One-clean-qubit model of computation . . . . . . . . . . . . . . . . . . . . 75
3.5.2 Approximating 𝐶HST is DQC1-hard . . . . . . . . . . . . . . . . . . . . . 76
3.5.3 Approximating 𝐶LHST is DQC1-hard . . . . . . . . . . . . . . . . . . . . . 77
3.6 Small-scale implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.6.1 Quantum hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.6.1.1 IBM’s quantum computers . . . . . . . . . . . . . . . . . . . . . 77
3.6.1.2 Rigetti’s quantum computer . . . . . . . . . . . . . . . . . . . . 79
3.6.2 Quantum simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.7 Larger-scale implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.7.1 Noiseless implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.7.2 Noisy implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.8.1 Barren plateaus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.8.2 Effect of hardware noise . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.10 Remark on implementation of 𝑉 ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.11 Faithfulness of LHST cost function . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.12 Relation between 𝐶LHST and 𝐶HST . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.13 Proofs of complexity theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.14 Gradient-free optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . 101
vii
3.14.1 Alternative method for gradient-free optimization . . . . . . . . . . . . . . 103
3.15 Gradient-based optimization method . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.15.1 The Power of Two Qubits . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.15.2 Gradient-based optimization via the POTQ . . . . . . . . . . . . . . . . . 109
3.15.2.1 Implementation on a quantum simulator . . . . . . . . . . . . . . 111
3.15.3 Gradient-based optimization via the HST and LHST . . . . . . . . . . . . 111
CHAPTER 4 ADVANCES IN ZERO-NOISE EXTRAPOLATION . . . . . . . . . . . . . 116
4.1 Digital and adaptive ZNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.1.2 Noise scaling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.1.2.1 Unitary folding . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.1.2.2 Circuit folding . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.1.2.3 Gate (or layer) folding . . . . . . . . . . . . . . . . . . . . . . . 120
4.1.2.4 Advantages and limitations of unitary folding . . . . . . . . . . . 121
4.1.2.5 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.1.3 Non-adaptive extrapolation methods: Zero noise extrapolation as statis-
tical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.1.3.1 Polynomial extrapolation . . . . . . . . . . . . . . . . . . . . . . 127
4.1.3.2 Linear extrapolation . . . . . . . . . . . . . . . . . . . . . . . . 127
4.1.3.3 Richardson extrapolation . . . . . . . . . . . . . . . . . . . . . . 128
4.1.3.4 Poly-Exponential extrapolation . . . . . . . . . . . . . . . . . . 129
4.1.3.5 Exponential extrapolation . . . . . . . . . . . . . . . . . . . . . 130
4.1.3.6 Benchmark comparisons of ZNE methods . . . . . . . . . . . . . 131
4.1.4 Adaptive zero noise extrapolation . . . . . . . . . . . . . . . . . . . . . . 132
4.1.4.1 Exponential extrapolation with two scale factors . . . . . . . . . 133
4.1.4.2 An adaptive exponential extrapolation algorithm . . . . . . . . . 136
4.1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.2 Reducing the impact of time-correlated noise on ZNE . . . . . . . . . . . . . . . . 137
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.2.2.1 Time-correlated noise: The SchWARMA model . . . . . . . . . 138
4.2.2.2 Zero-noise extrapolation with colored noise . . . . . . . . . . . . 140
4.2.2.3 Noise scaling methods . . . . . . . . . . . . . . . . . . . . . . . 141
4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.2.3.1 Zero-noise extrapolation with colored noise . . . . . . . . . . . . 145
4.2.3.2 Comparing noise scaling methods . . . . . . . . . . . . . . . . . 146
4.2.4 Discussion and physical interpretation . . . . . . . . . . . . . . . . . . . . 148
4.2.4.1 Frequency response of a circuit . . . . . . . . . . . . . . . . . . 148
4.2.4.2 Spectral analysis of noise scaling methods . . . . . . . . . . . . . 150
4.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.2.6 Consistency between different theories of pulse-stretching . . . . . . . . . 154
4.3 Increasing the effective quantum volume of quantum computers . . . . . . . . . . . 155
4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
viii
4.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
4.3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
4.3.6 Device specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
4.3.7 Table of quantum volumes . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.3.8 Statistical uncertainty of error-mitigated volume . . . . . . . . . . . . . . 162
4.3.8.1 Theoretical estimation of error bars . . . . . . . . . . . . . . . . 162
4.3.8.2 Bootstrapping empirical error bars . . . . . . . . . . . . . . . . . 163
CHAPTER 5 LOGICAL SHADOW TOMOGRAPHY . . . . . . . . . . . . . . . . . . . 165
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.1.1 Subspace expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.1.2 Virtual distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.3 Logical shadow tomography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.3.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.3.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.3.3.1 Error mitigation capability . . . . . . . . . . . . . . . . . . . . . 173
5.3.3.2 Quantum resources . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.3.3.3 Classical resources . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.4.1 Pseudo-threshold with the [[5, 1, 3]] code . . . . . . . . . . . . . . . . . . 178
5.4.2 Convergence vs. code size . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.6 Stabilizer algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.6.1 Evaluating the trace in Eq. (5.21) . . . . . . . . . . . . . . . . . . . . . . . 184
5.6.2 Efficient projection of a stabilizer state . . . . . . . . . . . . . . . . . . . . 185
5.7 Error mitigation capability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.8 Mean and variance of a ratio of two random variables . . . . . . . . . . . . . . . . 188
5.9 Proof of sample complexities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
ix
LIST OF TABLES
Table 1.1: Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Table 1.2: Some common single- and two-qubit gates. Here, 𝑎, 𝑏, 𝑧 ∈ {0, 1}. . . . . . . . . 2
Table 1.3: How each term in the quantity 𝑉 |𝜓⟩ is updated after application of 𝑈 in the
Schrödinger vs. Heisenberg picture. The answer is always 𝑈𝑉 |𝜓⟩ (the product
of each column). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Table 2.1: Minimum cost and eigenvalues achieved after performing the parameter opti-
mization loop for seven independent runs of VQSD for the example discussed
in Sec. 2.2.2. The final two rows show average values and standard deviation
across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Table 2.2: Relative average run-times (r.r.) and absolute number of function evaluations
(f.ev.) of each optimization algorithm (Alg.) used for the data obtained
in Fig. 2.12. For example, BOBYQA took 2.32 times as long to run on
average than COBYLA, which took the least time to run out of all algorithms.
Absolute run-times depend on a variety of factors and computer performance.
For reference, the COBYLA algorithm takes approximately one minute for this
problem on a laptop computer. The number of cost function evaluations used
(related to run-time but also dependent on the method used by the optimizer)
is shown in the second row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Table 2.3: Estimated eigenvalues for the 𝜌 = |+⟩⟨+| state using qPCA on both the noiseless
and the noisy QVMs of Rigetti. . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Table 4.1: Different methods for implementing gate (or layer) folding . . . . . . . . . . . . 120
Table 4.2: Average of 20 different two-qubit randomized benchmarking circuits with
mean depth 27. The percent mean absolute error from the exact value of 1 is
reported for a depolarizing noise with 𝑝 = 1% and an amplitude damping chan-
nel with 𝛾 = 0.01. For all non-adaptive methods we used 𝜆 = {1, 1.5, 2, 2.5}.
Adaptive extrapolation was iterated up to 4 scale factors. All the results re-
ported in this table are obtained with exact density matrix simulations. The
best result for each noise model is highlighted with a bold font, while errors
larger than the unmitigated one are italicized. . . . . . . . . . . . . . . . . . . . 131
x
Table 4.3: Device specifications and error rates for the quantum computers we used in our
experiments. Device connectivities are shown in Fig. 4.14. Parameters √ 𝜖 1Q ,
𝜖CX , 𝜖M denote, respectively, averages (over all qubits) of single-qubit 𝑋 gate
errors, two-qubit CNOT gate errors, and readout errors ( 𝑝(0|1) + 𝑝(1|0))/2
accessed from [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Table 4.4: Measured quantum volumes (in increasing order). Values in parentheses show
effective quantum volumes measured in this work. . . . . . . . . . . . . . . . . 161
xi
LIST OF FIGURES
Figure 1.1: An example quantum circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Figure 1.2: Quantum circuit for the Deutsch-Jozsa algorithm. . . . . . . . . . . . . . . . . 3
Figure 2.1: Schematic diagram showing the steps of the VQSD algorithm. (a) Two copies
of quantum state 𝜌 are provided as an input. These states are sent to the pa-
rameter optimization loop (b) where a hybrid quantum-classical variational
algorithm approximates the diagonalizing unitary 𝑈 𝑝 ( 𝛼 ® opt ). Here, 𝑝 is a hy-
perparameter that dictates the quality of solution found. This optimal unitary
is sent to the eigenvalue readout circuit (c) to obtain bitstrings 𝑧®, the fre-
quencies of which provide estimates of the eigenvalues of 𝜌. Along with the
optimal unitary 𝑈 𝑝 ( 𝛼® opt ), these bitstrings are sent to the eigenvector prepara-
tion circuit (c) to prepare the eigenstates of 𝜌 on a quantum computer. Both
the eigenvalues and eigenvectors are the outputs (d) of the VQSD algorithm. . . 16
Figure 2.2: (a) Layered ansatz for the diagonalizing unitary 𝑈 𝑝 ( 𝛼 ® ). Each layer 𝐿 𝑖 , 𝑖 =
1, ..., 𝑝, consists of a set of optimization parameters 𝛼 ® 𝑖 . (b) The two-qubit
gate ansatz for the 𝑖th layer, shown on four qubits. Here we impose periodic
boundary conditions on the top/bottom edge of the circuit so that 𝐺 3 wraps
around from top to bottom. Section 2.7 discusses an alternative approach
to the construction of 𝑈 𝑝 ( 𝛼 ® ), in which the ansatz is modified during the
optimization process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 2.3: The VQSD algorithm run on Rigetti’s 8Q-Agave quantum computer for 𝜌 =
|+⟩⟨+|. (a) A representative run of the parameter optimization loop, using
the Powell optimization algorithm (see Sec. 2.4.2 for details and Section 2.6
for data from additional runs). Cost versus iteration is shown by the black
solid line. The dotted lines show the two inferred eigenvalues. After four
iterations, the inferred eigenvalues approach {0, 1}, as required for a pure state.
(b) The cost landscape on a noiseless simulator, Rigetti’s noisy simulator, and
Rigetti’s quantum computer. Error bars show the standard deviation (due
to finite sampling) of multiple runs. The local minima occur roughly at the
theoretically predicted values of 𝜋/2 and 3𝜋/2. During data collection for this
plot, the 8Q-Agave quantum computer retuned, after which its cost landscape
closely matched that of the noisy simulator. . . . . . . . . . . . . . . . . . . . . 24
xii
Figure 2.4: Implementing VQSD with a simulator for the ground state of the 1D Heisen-
berg model, diagonalizing a 4-spin subsystem of a chain of 8 spins. We chose
𝑞 = 1 for the cost in (2.10) and employed a gradient-based method to find
® opt . (a) Largest inferred eigenvalues 𝜆˜ 𝑗 versus 1/𝑝, where 𝑝 is the number
𝛼
of layers in our ansatz, which in this example takes half-integer values cor-
responding to fractions of layers shown in Fig. 2.2. The exact eigenvalues
are shown on the 𝑦-axis (along 1/𝑝 = 0 line) with their degeneracy indicated
in parentheses. One can see the largest eigenvalues converge to their correct
values, including the correct degeneracies. Inset: overall eigenvalue error
Δ𝜆 versus 1/𝑝. (b) Largest inferred eigenvalues resolved by the inferred ⟨𝑆 𝑧 ⟩
quantum number of their associated eigenvector, for 𝑝 = 5. The inferred data
points (red X’s) roughly agree with the theoretical values (black circles), par-
ticularly for the largest eigenvalues. Section 2.7 discusses Heisenberg chain
of 12 spins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 2.5: Diagonalization test circuits used in VQSD. (a) The Destructive Swap Test
computes Tr(𝜎𝜏) via a depth-two circuit. (b) The Diagonalized Inner Product
(DIP) Test computes Tr(Z(𝜎)Z(𝜏)) via a depth-one circuit. (c) The Partially
Diagonalized Inner Product (PDIP) Test computes Tr(Z 𝑗®(𝜎)Z 𝑗®(𝜏)) via a
depth-two circuit, for a particular set of qubits 𝑗®. While the DIP test requires
no postprocessing, the postprocessing for the Destructive Swap Test and the
Partial DIP Test scales linearly in 𝑛. . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 2.6: Circuit used to implement VQSD for 𝜌 = |+⟩⟨+| on Rigetti’s 8Q-Agave
quantum computer. Vertical dashed lines separate the circuit into logical
components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Figure 2.7: Cost vs iteration for all attempts of VQSD on Rigetti’s 8Q-Agave computer for
diagonalizing the plus state 𝜌 = |+⟩⟨+|. Each of the seven curves represents a
different independent run. Each run starts at a random initial angle and uses
the Powell optimization algorithm to minimize the cost. . . . . . . . . . . . . . 38
Figure 2.8: Comparison of two approaches to obtaining the diagonalizing unitary 𝑈 ( 𝛼 ® ):
(i) based on a fixed layered ansatz shown in Fig. 2.2 in the main text (black
line) and (ii) based on random updates to the structure of 𝑈 ( 𝛼 ® ) (red line).
The plot shows eigenvalue error Δ𝜆 versus 1/𝐷, where 𝐷 is the number of
gates in 𝑈 ( 𝛼® ). For the same 𝐷, the second approach found a more optimal
gate sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
xiii
Figure 2.9: VQSD applied to the ground state of the Heisenberg model. Here we consider
a 6-qubit reduced state 𝜌 of the 12-qubit ground state. (a) Largest inferred
eigenvalues 𝜆˜ 𝑗 of 𝜌 as a function of 1/𝐷, where 𝐷 is the total number of gates
in the diagonalizing unitary 𝑈 ( 𝛼® ). The inferred eigenvalues converge to their
exact values shown along the 1/𝐷 = 0 line recovering the correct degeneracy.
Inset: Eigenvalue error Δ𝜆 as a function of 1/𝐷. (b) The largest inferred
eigenvalues 𝜆˜ 𝑗 of 𝜌 resolved in the ⟨𝑆 𝑧 ⟩ quantum number. We find very good
agreement between the inferred eigenvalues (red crosses) and the exact ones
(black circles), especially for large eigenvalues. The data was obtained for
𝐷 = 150 gates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Figure 2.10: Cost function 𝐶 versus 1/𝐷 for three independent optimization runs. Here,
𝐷 is the total number of gates in the diagonalizing unitary 𝑈𝐷 ( 𝛼 ® ). Ev-
ery optimization run got stuck at local minimum at some point during the
minimization but thanks to the growth of the ansatz for 𝑈𝐷 ( 𝛼 ® ) described in
the text, the predefined small value of 𝐶 was eventually attained. The data
was obtained for a 6-qubit reduced state of the 12-qubit ground state of the
Heisenberg model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 2.11: Cost versus iteration for different values of 𝑞, when 𝜌 is a tensor product
of pure states on 𝑛 qubits. Here we consider (a) 𝑛 = 6, (b) 𝑛 = 8, and (a)
𝑛 = 10. We employed the COBYLA optimization method for training (see
Section 2.10 for discussion of this method). For each call to the quantum
simulator (i.e., classical simulator of a quantum computer), we took 500
shots for statistics. The green, red, and blue curves respectively correspond
to directly training the cost with 𝑞 = 1, 𝑞 = 0.5, and 𝑞 = 0. The purple and
yellow curves respectively correspond to evaluating the 𝑞 = 1 cost for the
angles 𝛼® obtained by training the 𝑞 = 0.5 and 𝑞 = 0 costs. . . . . . . . . . . . 43
Figure 2.12: Optimization tests on six-qubit product states in the VQSD algorithm. Each
plot shows a different optimization algorithm (described in main text) and
curves on each plot show optimization attempts with different (random) initial
conditions. Cost refers to the 𝐶1 cost function (𝑞 = 1 in (2.10)), and each
iteration is defined by a decrease in the cost function. As can be seen, the
Powell algorithm is the most robust to initial conditions and provides the
largest number of solved problem instances. . . . . . . . . . . . . . . . . . . . 45
Figure 2.13: Circuit for our qPCA implementation. Here, the eigenvalues of a one-qubit
pure state 𝜌 are estimated to a single digit of precision. We use 𝑘 copies of 𝜌
to approximate 𝐶𝑉 (𝑡) by applying the controlled-exponential-swap operator 𝑘
times for a time period Δ𝑡 = 𝑡/𝑘. The bottom panel shows our compilation
of the controlled-exponential-swap gate into one- and two-qubit gates. . . . . . 50
xiv
Figure 2.14: The largest inferred eigenvalue for the one-qubit pure state 𝜌 = |+⟩⟨+| versus
application time of unitary 𝑒 −𝑖𝜌𝑡 , for our implementation of qPCA on Rigetti’s
noisy and noiseless QVMs. Curves are shown for 𝑘 = 1 and 𝑘 = 2, where 𝑘
indicates the number of controlled-exponential-swap operators applied. . . . . . 52
Figure 2.15: Test circuits used to compute the cost function in VQSD. (a) DIP test (b)
PDIP test. (These circuits appear in Fig. 2.5 and are also shown here for the
reader’s convenience.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 3.1: Potential applications of QAQC. Here, denotes the 𝑧-rotation gate 𝑅𝑧 (𝜃),
while represents the 𝜋/2-pulse given by the 𝑥-rotation gate 𝑅𝑥 (𝜋/2).
Both gates are natively implemented on commercial hardware [2, 3]. (a)
Compressing the depth of a given gate sequence 𝑈 to a shorter-depth gate
sequence 𝑉 in terms of native hardware gates. (b) Uploading a black-box
unitary. The black box could be an analog unitary 𝑈 = 𝑒 −𝑖H 𝑡 , for an unknown
Hamiltonian H , that one wishes to convert into a gate sequence to be run
on a gate-based quantum computer. (c) Training algorithms in the presence
of noise to learn noise-resilient algorithms (e.g., via gates that counteract the
noise). Here, the unitary 𝑈 is performed on high-quality, pristine qubits and
𝑉 is performed on noisy ones. (d) Benchmarking a quantum computer by
compiling a unitary 𝑈 on noisy qubits and learning the gate sequence 𝑉 on
high-quality qubits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Figure 3.2: Outline of our variational hybrid quantum-classical algorithm, in which we
optimize over gate structures and continuous gate parameters in order to per-
form QAQC for a given input unitary 𝑈. We take two approaches towards
structure optimization: (a) For small problem sizes, we allow the gate struc-
ture to vary for a given gate sequence length 𝐿, which in general leads to
an approximate compilation of 𝑈. To obtain a better approximate compila-
tion, the best structure obtained can be concatenated with a new sequence
of a possibly different length, whose structure can vary. For each iteration
of the continuous parameter optimization, we calculate the cost using the
Hilbert-Schmidt Test (HST); see Sec. 3.4.1. (b) For large problem sizes, we
fix the gate structure using an ansatz consisting of layers of two-qubit gates.
By increasing the number of layers, we can obtain better approximate com-
pilations of 𝑈. For each iteration of the continuous parameter optimization,
we calculate the cost using the Local Hilbert-Schmidt Test (LHST); see Sec.
3.4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Figure 3.3: (a) One layer of the ansatz for the trainable unitary 𝑉 in the case of four qubits.
The gate sequence in the layer consists of a two-qubit gate acting on the first
and second qubits, the third and fourth qubits, the second and third qubits, and
the first and fourth qubits. (b) The full ansatz defining the trainable unitary
𝑉 consists of a particular number ℓ of the layer in (a). Shown is two layers in
the case of four qubits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
xv
Figure 3.4: (a) The Hilbert-Schmidt Test. For this circuit, the probability to obtain the
measurement outcome in which all 2𝑛 qubits are in the |0⟩ state is equal to
(1/𝑑 2 )|Tr(𝑉 †𝑈)| 2 . Hence, this circuit computes the magnitude of the Hilbert-
Schmidt inner product, |⟨𝑉, 𝑈⟩|, between 𝑈 and 𝑉. (b) The Local Hilbert-
Schmidt Test, which is the same as the Hilbert-Schmidt Test except that only
two of the 2𝑛 qubits are measured at the end. Shown is the measurement of
the qubits 𝐴1 and 𝐵1 , and the probability that both qubits are in the state |0⟩
is given by (3.25) with 𝑗 = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 3.5: Compiling the one-qubit gates 𝐼, 𝑋, 𝐻, and 𝑇 using the gradient-free opti-
mization technique described in Section 3.14. The plots show the cost 𝐶HST
as a function of the number of iterations, where an iteration is defined by
an accepted update to the gate structure; see Sec. 3.3.3 for a description of
the procedure. The insets display the minimum cost achieved by optimizing
over gate sequences with a fixed depth, where the depth is defined relative
to the native gate alphabet of the quantum computer used. (a) Compiling
on the IBMQX4 quantum computer, in which we took 8, 000 samples to
evaluate the cost for each run of the Hilbert-Schmidt Test. (b) Compiling on
the IBMQX5 quantum computer, in which we again took 8, 000 samples to
evaluate the cost for each run of the Hilbert-Schmidt Test. (c) Compiling on
Rigetti’s 8Q-Agave quantum computer. In the plot, each iteration uses 50 cost
function evaluations to perform the continuous optimization. For each run of
the Hilbert-Schmidt Test to evaluate the cost, we took 10, 000 samples (calls
to the quantum computer). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Figure 3.6: Compiling one- and two-qubit gates on Rigetti’s quantum virtual machine
with the gate alphabet in (3.35) using the gradient-free optimization technique
described in Algorithm 1 in Section 3.14. (a) The minimum cost achieved by
optimizing over gate sequences with a fixed depth. (b) The cost as a function
of the number of iterations of the full gate structure and continuous param-
eter optimization; see Sec. 3.3.3 for a description of the procedure. Note
that each iteration uses 50 cost function evaluations, and each cost function
evaluation uses 10, 000 samples (calls to the quantum computer). (c) Shortest-
depth decompositions of the two-qubit controlled-𝑍, controlled-Hadamard,
and quantum Fourier transform gates as determined by the compilation pro-
cedure. The equalities indicated are true up to a global phase factor. Here,
denotes the rotation gate 𝑅𝑧 (𝜃), while represents the rotation gate
𝑅𝑥 (𝜋/2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Figure 3.7: Results of performing continuous parameter optimization using the HST and
the LHST for the scenario described in Example 1. We make use of the
gradient-based optimization algorithm given by Algorithm 4 in Section 3.15.
The curves “HST via LHST” are given by evaluating 𝐶HST using the angles
obtained during the optimization iterations of 𝐶LHST . For each run of the
HST and LHST, we use 1000 samples to estimate the cost function. . . . . . . . 85
xvi
Figure 3.8: Results of performing continuous parameter optimization using the HST and
the LHST for the scenario described in Example 2. We make use of the
gradient-based optimization algorithm given by Algorithm 4 in Section 3.15,
in which each iteration can involve several calls to the quantum computer.
The curves “HST via LHST” are given by evaluating 𝐶HST using the angles
obtained during the optimization iterations of 𝐶LHST . For each run of the
HST and LHST, we use 1000 samples to estimate the cost function. . . . . . . . 86
Figure 3.9: Results of performing continuous parameter optimization using the HST and
the LHST, in the presence of noise, for the scenario described in Example 1.
The noise model used matches that of the IBMQX5 quantum computer. We
make use of the gradient-based optimization algorithm given by Algorithm 4
in Section 3.15. The curves “Noiseless HST via LHST” are given by evalu-
ating 𝐶HST (without noise) using the angles obtained during the optimization
iterations of 𝐶LHST . For each run of the HST and LHST, we use 1000 samples
to estimate the cost function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Figure 3.10: Results of performing continuous parameter optimization using the HST and
the LHST, in the presence of noise, for the scenario described in Example 2.
The noise model used matches that of the IBMQX5 quantum computer. We
make use of the gradient-based optimization algorithm given by Algorithm 4
in Section 3.15. The curves “Noiseless HST via LHST” are given by evalu-
ating 𝐶HST (without noise) using the angles obtained during the optimization
iterations of 𝐶LHST . For each run of the HST and LHST, we use 1000 samples
to estimate the cost function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Figure 3.11: The trace of the unitary 𝑈 ′ defined by the circuit above is equal to the trace
of the non-unitary operator (| 0⟩⟨0 | ⊗ 𝐼)𝑄(| 0⟩⟨0 | ⊗ 𝐼)𝑄 † up to a factor of 4 [4]. 99
Figure 3.12: (a) Any single-qubit gate 𝑈 can be decomposed into three elementary rotations
(up to a global phase). Given appropriate parameters 𝛼 ® = (𝛼𝑧1 , 𝛼𝑦 , 𝛼𝑧2 ), 𝑈
−𝑖𝛼 𝑧2 𝜎𝑧 /2 −𝑖𝛼 𝑦 𝜎𝑦 /2 −𝑖𝛼 𝑧1 𝜎𝑧 /2
can be written as 𝑉 ( 𝛼®) = 𝑒 𝑒 𝑒 . (b) Any two-qubit
gate 𝑈 𝐴𝐵 can be decomposed into three CNOT gates as well as 15 elementary
single-qubit gates, where each unitary 𝑈 𝑗 ( 𝛼 ® ( 𝑗) ) can be written as in (a). This
decomposition is known to be optimal [5], i.e., it uses the least number of
continuous parameters and CNOT gates. General universal quantum circuits
for 𝑛-qubit gates are discussed in [6]. . . . . . . . . . . . . . . . . . . . . . . . 106
xvii
Figure 3.13: (a) The Power of One Qubit (POOQ) [7]. This can be used to compute the
trace of a unitary 𝑈 acting on a 𝑑-dimensional space. The 𝑅 gate represents
either 𝐻, in which case the circuit computes Re[Tr(𝑈)], or the 𝑆 gate followed
by 𝐻, in which case the circuit computes Im[Tr(𝑈)]. (b) The Power of Two
Qubits (POTQ). This is a generalization of the POOQ, as can be seen by
setting 𝑉 = 𝐼. The POTQ can be used to compute the Hilbert-Schmidt inner
product Tr(𝑉 †𝑈) between two unitaries 𝑈 and 𝑉 acting on a 𝑑-dimensional
space. As with the POOQ, 𝑅 = 𝐻 leads to Re[Tr(𝑉 †𝑈)], while 𝑅 = 𝐻𝑆 leads
to Im[Tr(𝑉 †𝑈)]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Figure 3.14: Compiling one- and two-qubit gates on a simulator with the gate alphabet
in (3.35) using the gradient-based optimization technique described in Al-
gorithm 3, with 𝑛shots = 10, 000. Shown is the cost as a function of the
number of gradient calls of the continuous parameter optimization using the
minimize routine in the SciPy-optimize Python library. The gate structure
for the single-qubit gates is fixed to the one shown in Fig. 3.12(a), while the
gate structure for the two-qubit gates is fixed to the one shown in Fig. 3.12(b). . 111
Figure 4.1: An example of the change of an expectation value, 𝐸 (𝜆), with the underlying
scaling 𝜆 of the depolarizing noise level. Here the simulated base noise value
is 5% (marked by the green dashed vertical line). ZNE increases that noise
and back extrapolates to the 𝜆 = 0 expectation value. In this example, an
accurate extrapolation should be non-linear and take advantage of a known
asymptotic behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Figure 4.2: Comparison of two qubit randomized benchmarking with & without error
mitigation. Data is taken by density matrix simulation with a 1% depo-
larizing noise model. The unmitigated simulation results in a randomized
benchmarking decay of 97.9%. Mitigation is applied using circuit folding
and an order-2 polynomial extrapolation at 𝜆 = 1, 1.5, 2.0. With mitigation
the randomized benchmarking decay improves to 99.0%. Since we do not
impose any constraint on the domain of the extrapolated results, some of the
mitigated expectation values are slightly beyond the physical upper limit of
1. This is an expected effect of the noise introduced by the extrapolation
fit. If necessary, one could enforce the result to be physical by using a more
advanced Bayesian estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
xviii
Figure 4.3: A comparison of improvements from ZNE (using quadratic extrapolation with
folding from left) averaged across all output bitstrings from 250 random six-
qubit circuits. Results are from exact density matrix simulations with a base
of 1% depolarizing noise. The horizontal axis shows a ratio of 𝐿 2 distances
from the noiseless probability distribution and the vertical axis shows the
frequency of obtaining this result. ZNE improves on the noisy result by
factors of 1-7X. The average mitigated error is 0.075 ± 0.035, while the
unmitigated errors average 0.114 ± 0.050. Each circuit has 40 moments with
single-qubit gates sampled randomly from {𝐻, 𝑋, 𝑌 , 𝑍, 𝑆, 𝑇 } and two-qubit
gates sampled randomly from {iSWAP, CZ} with arbitrary connectivity. . . . . 125
Figure 4.4: Percent closer to optimal on random MAXCUT executions. 14 Erdos-Renyi
random graphs were generated at each number 𝑛. Each random graph has
𝑛 nodes and 𝑛 edges. QAOA was then run (with 𝑝 = 2 QAOA steps) and
optimized using Nelder-Mead with and without error mitigation. Results are
from exact density matrix simulations with a base of 2% depolarizing noise.
For the mitigated case, we used zero noise extrapolation with global unitary
folding for scaling and linear extrapolation at noise scalings of 1, 1.5 and 2.
The y axis shows the percent closer to the optimal solution that was gained
by ZNE. Here 𝐸 𝑢 is the absolute error in the unmitigated expectation and
𝐸 𝑚 is the absolute error in the mitigated expectation. The violin plot shows
the distribution of percentage improvements over the 14 sampled instances.
Variance is zero for 2 and 3 nodes graphs as there is only a single valid graph
with 𝑛 nodes and edges for 𝑛 = 2, 3. . . . . . . . . . . . . . . . . . . . . . . . . 126
Figure 4.5: Comparison of extrapolation methods averaged over 50 two-qubit randomized
benchmarking circuits executed on IBMQ’s “London” five-qubit chip. The
circuits had, on average, 97 single qubit gates and 17 two-qubit gates. The
true zero-noise value is ⟨0|𝜌|0⟩ = 1 and different markers show extrapolated
values from different fitting techniques. . . . . . . . . . . . . . . . . . . . . . 132
Figure 4.6: Comparison of adaptive and non-adaptive exponential zero noise extrapola-
tion, given a fixed budget of samples. The adaptive method generally produces
a more accurate extrapolation with less samples. On the other hand, in this
example, the advantage of adaptivity is not particularly large. Likely, this is
due to the fact that the scale factors used for the non-adaptive method are al-
ready quite good and not far from their optimal values. Data was generated by
exact density matrix simulation of 5-qubit randomized benchmarking circuits
of depth 10 under 5% depolarizing noise and measured in the computational
basis. Noise was scaled directly by access to the back-end simulator rather
than with a folding method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
xix
Figure 4.7: Noise power spectrum of four different dephasing SchWARMA noise models
corresponding to white noise, low-pass noise, 1/ 𝑓 noise and 1/ 𝑓 2 noise.
These noise models are used in Sec. 4.2.3 to test the effect of time-correlated
noise on zero-noise extrapolation. . . . . . . . . . . . . . . . . . . . . . . . . . 140
Figure 4.8: A sample three-qubit circuit with four gates under the action of three digital
noise scaling methods we consider in this work.
𝑛 (a) Local folding, in which
†
each gate 𝐺 gets mapped to 𝐺 ↦→ 𝐺 𝐺 𝐺 for scale factor 𝜆 = 2𝑛 − 1. (b) 𝑛
Global folding, in which the entire circuit 𝐶 gets mapped to 𝐶 ↦→ 𝐶 𝐶 †𝐶 .
In (a) and (b), grey shading shows the “virtual gates” which logically compile
𝜆
to identity. (c) Gate Trotterization, in which 𝐺 ↦→ 𝐺 1/𝜆 for each gate 𝐺. . . . 141
Figure 4.9: Comparison of different zero-noise extrapolations obtained with different
noise scaling methods. We consider a single-qubit randomized benchmark-
ing circuit affected by dephasing noise of fixed integrated power. The two
subfigures correspond to different noise spectra: (top) white noise, (bottom)
1/ 𝑓 pink noise. Both spectra are shown in Fig. 4.7. The expectation value
𝐸 (𝜆) = t𝑟 (𝑂 𝜌(𝜆)) is associated to the observable 𝑂 = |0⟩⟨0| measured with
respect to the noise-scaled quantum state 𝜌(𝜆). The colored squares represent
the noise-scaled expectation values; the dotted lines represent the associated
exponential fitting curves; the colored stars represent the corresponding zero-
noise extrapolations. The figure shows that the zero-noise limit obtained with
global unitary folding (green star) is relatively close to the ideal result (gray
star) even in the presence of strong time correlations in the noise. . . . . . . . 145
Figure 4.10: Average relative errors in noise scaling two-qubit randomized benchmarking
circuits with (a) white noise, (b) lowpass noise, (c) 1/ 𝑓 noise, and (d) 1/ 𝑓 2
noise. Panel (a) shows no significant difference in scaling methods under
white noise (no time correlations). (Inset shows zoomed vertical scale.)
Panels (b)-(d) show that global scaling is the lowest-error digital scaling
method. The two-qubit randomized benchmarking circuits used here have,
on average, 27 single-qubit gates and five two-qubit gates. For each circuit
execution, 3000 samples were taken to estimate the probability of the ground
state as the observable. Points show the average results over fifty such circuits
and error bars show one standard deviation. . . . . . . . . . . . . . . . . . . . 146
xx
Figure 4.11: Relative errors in noise scaling two-qubit mirror circuits with (a) white noise,
(b) lowpass noise, (c) 1/ 𝑓 noise, and (d) 1/ 𝑓 2 noise. Panel (a) shows no
significant difference in scaling methods under white noise (no time corre-
lations). (Inset shows zoomed vertical scale.) Panels (b), (c) and (d) show
global scaling is optimal with time-correlated noise. The two-qubit mirror
benchmarking circuits used here have, on average, 26 single-qubit gates and
eight two-qubit gates. For each circuit execution, 3000 samples were taken
to estimate the probability of the correct bitstring (defined by the particular
mirror circuit instance) as the observable. Points show the average results
over fifty such circuits and error bars show one standard deviation. . . . . . . . 149
Figure 4.12: Relative errors in noise scaling two-qubit 𝑝 = 2 QAOA circuits with (a)
white noise, (b) lowpass noise, (c) 1/ 𝑓 noise, and (d) 1/ 𝑓 2 noise. Panel
(a) shows no significant difference in scaling methods under white noise (no
time correlations). (Inset shows zoomed vertical scale.) Panels (b), (c) and
(d) show global scaling is optimal with time-correlated noise. The two-qubit
𝑝 = 2 QAOA circuits used here have eight single-qubit gates and four two-
qubit gates. For each circuit execution, 3000 samples were taken to estimate
the probability of the ground state as the observable. (Note that the QAOA
circuit 𝑈 is echoed such that the total circuit is 𝑈𝑈 † = 𝐼 without noise.)
Points show the average results over fifty such circuits and error bars show
one standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Figure 4.13: Largest magnitude filter function of a two-qubit randomized benchmarking
circuit of Clifford depth 2 (actual depth 24) for different scale factors 𝜆.
All filter functions are normalized by their maximum values (otherwise the
integral of the filter function scales by 𝜆). Different subplots correspond
to different noise scaling methods. All noise scaling methods change the
frequency response of the circuit, however, global folding tends to preserve
the qualitative shape of response function and, for this reason, it gives better
performances for zero-noise extrapolation with colored noise. . . . . . . . . . . 151
xxi
Figure 4.14: Results of unmitigated and mitigated quantum volume experiments on three
five-qubit quantum computers (left-to-right: Belem, Lima, and Quito) using
𝑛𝑐 = 500 circuits and 𝑛 𝑠 = 104 total samples. Each marker shows the esti-
mated heavy output probability ℎˆ 𝑑 on a different qubit configuration defined
in the legend and error bars show 2𝜎 intervals evaluated by bootstrapping.
The connectivity of each device is shown below each legend. Dashed black
lines show the 2/3 threshold and noiseless asymptote (1 + ln 2)/2 [8]. For
the mitigated experiments, 𝜆𝑖 ∈ {1, 3, 5, 7, 9} and 𝑛 𝑠 = 104 /5. Local unitary
folding of two-qubit gates is used to compile the circuits (i.e., scale noise) and
Richardson’s method of extrapolation is used to infer the zero-noise result.
The qubit subsets which achieved the largest quantum volume in the miti-
gated experiments are colored blue in each device diagram. As can be seen,
on Belem error mitigation increases the effective quantum volume from three
to five, on Lima error mitigation increases the effective quantum volume from
three to four, and on Quito error mitigation increases the effective quantum
volume from four to five. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Figure 4.15: The value of 𝜎 for different resampling numbers in bootstrapping. . . . . . . . 163
Figure 5.1: Graphic illustration of logical shadow tomography. (a) Red dots are logical
qubits, and blue dots are physical qubits. Logical information is distributed
to physical qubits by error correction code, then followed by noisy quantum
computation on physical qubits. To get estimation of error mitigated observ-
ables, we perform classical shadow tomography on the noisy physical state.
Particularly, we can apply random Clifford gates denoted as green blocks from
some unitary ensemble U, and take computational basis measurements. (b) A
special case using [[𝑛, 1]] code for each logical qubit. In shadow tomography,
we apply random unitary from tensor product of Clifford groups Cℓ(2𝑛 ) ⊗𝑘 .
Additional gate depth will not scale with number of logical qubits 𝑘, and
sample complexity for estimating error mitigated logical Pauli observables is
the same as using global Clifford group Cℓ(2𝑛𝑘 ). . . . . . . . . . . . . . . . . . 169
Figure 5.2: LST with the [[5, 1, 3]] code. Here, |𝜓⟩ is taken to be the logical | 0̄⟩ and
we estimate infidelity 1 − 𝐹 with samples. The dashed black line shows the
physical infidelity, i.e., the noisy expectation value of single qubit without
any encoding. The green and blue dashed line are analytical performance of
logical shadow tomography with 𝑓 (𝜌 E ) = 𝜌 E and 𝑓 (𝜌 E2 ) = 𝜌 E2 respectively.
The red dots and red shaded area indicates the mean value and standard
deviation of error mitigation with 𝑓 (𝜌 E ) = 𝜌 E by direct implementation
of subspace expansion with 3000 measurements. The green line and green
shaded area indicate the mean value and standard deviation with 𝑓 (𝜌 E ) = 𝜌 E
and 3000 measurements by LST. And the performance of LST with 𝑓 (𝜌 E ) =
𝜌 E2 is indicated by blue line and blue shaded area. . . . . . . . . . . . . . . . . 179
xxii
Figure 5.3: Scaling study of LST with 𝑓 (𝜌𝜖 ) = 𝜌𝜖 . In all figures, each physical qubit is
subjected to 1% depolarizing noise. (a) LST estimated fidelity vs. number
of samples from 102 - 105 with various [[𝑛, 1]] code sizes. The noiseless
fidelity value of 1.0 is shown with the dashed black line. For all code sizes
up to 𝑛 = 60 physical qubits, the LST estimate converges to the true noiseless
value. Codes used are the minimum distance constructions from [9].(b)
Standard deviation vs. number of physical qubits 𝑛. The standard deviation
of estimation doesn’t scale with number of encoding physical qubits. (c)
Mean value and standard deviation scaling vs. number of logical qubits 𝑘.
Each logical qubit is encoded with [[5, 1, 3]] code, and the state is prepared
as logical GHZ state | 0̄ . . . 0̄⟩ + | 1̄ . . . 1̄⟩. We see standard deviation scales
exponentially with number of logical qubits 𝑘 as predicted. . . . . . . . . . . . 181
Figure 5.4: Data structure of a stabilizer state. Each Pauli string is represented as a binary
vector. First 𝑁 rows store the stabilizers of the state, and second 𝑁 rows store
the destabilizers of the state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Figure 5.5: Infidelity in small error rate region. Theoretically we have shown the leading
order correction to infidelity will be O ( 𝑝 𝑚𝑑 ) with 𝑚 = 1, 2. Here, we use
[[5, 1, 3]] code with LST as a demonstration. We prepare random logical
states and calculate the infidelity. We see the numerical results give linear
order correction O ( 𝑝 3.07 ) and O ( 𝑝 6,15 ), which is very close to theoretical
prediction O ( 𝑝 3 ) and O ( 𝑝 6 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
xxiii
LIST OF ALGORITHMS
Algorithm 1: Gradient-free Continuous Optimization for QAQC via the HST . . . . . . . 101
Algorithm 2: Gradient-free Optimization using Bisection for QAQC . . . . . . . . . . . . 104
Algorithm 3: Gradient-based Continuous Optimization for QAQC via the POTQ . . . . . 110
Algorithm 4: Gradient-based Continuous Optimization for QAQC via the (L)HST . . . . 114
Algorithm 5: Generic non-adaptive extrapolation . . . . . . . . . . . . . . . . . . . . . . 124
Algorithm 6: Generic adaptive extrapolation . . . . . . . . . . . . . . . . . . . . . . . . 133
Algorithm 7: Adaptive exponential extrapolation . . . . . . . . . . . . . . . . . . . . . . 135
xxiv
CHAPTER 1
PRELIMINARIES
1.1 Notation
|·⟩ A column vector labeled by ·
† Conjugate transpose operator (as superscript)
⟨·| The row vector |·⟩ †
⊗ Tensor product
⊕ Addition modulo 2
{0, 1}𝑛 Length 𝑛 bitstrings. E.g., {0, 1}2 consists of 00, 01, 10, and 11
U (𝑑 𝑛 ) The unitary group of dimension 𝑑 𝑛
Table 1.1: Notation.
This thesis follows the Feynman-Twain principle in that no attempt is made at mathematical
rigor and persons attempting to find mathematical rigor will be shot. We are most interested in the
Hilbert space
𝑛
C𝑑 ⊗ · · · ⊗ C𝑑 C𝑑 (1.1)
| {z }
𝑛 terms
(where 𝑑 = 2 almost always in this thesis), i.e. the space of 𝑛 quantum bits (qubits). Common
bases for a single qubit include the computational (standard) basis
1 0
|0⟩ := , |1⟩ :=
(1.2)
0 1
√
and the Hadamard basis |±⟩ := (|0⟩ ± |1⟩)/ 2. For 𝑛 qubits and a bitstring 𝑧 = 𝑧1 · · · 𝑧 𝑛 ∈ {0, 1}𝑛
we write basis elements as |𝑧1 ⟩ ⊗ · · · ⊗ |𝑧 𝑛 ⟩ or simply |𝑧1 · · · 𝑧 𝑛 ⟩ ≡ |𝑧⟩ for short, so a general state
may be written
∑︁
|𝜓⟩ = 𝛼𝑧 |𝑧⟩ (1.3)
𝑧∈{0,1} 𝑛
Í
with 𝛼𝑧 ∈ C and 𝑧 |𝛼𝑧 | 2 = 1.
1
𝐼 The identity gate/element 𝐼 |𝑧⟩ = |𝑧⟩
𝑋 ≡ 𝜎𝑥 Pauli 𝑋. 𝑋 |𝑧⟩ = |𝑧 ⊕ 1⟩
𝑍 ≡ 𝜎𝑧 Pauli 𝑍. 𝑍 |𝑧⟩ = (−1) 𝑧 |𝑧⟩
𝑌 ≡ 𝜎𝑦 Pauli 𝑌 . 𝑌 = 𝑖𝑋 𝑍.
√
𝐻 The Hadamard gate 𝐻|𝑧⟩ = (|0⟩ + (−1) 𝑧 |1⟩)/ 2
CNOT CNOT|𝑎⟩|𝑏⟩ = |𝑎⟩|𝑎 ⊕ 𝑏⟩
CZ CZ|𝑎⟩|𝑏⟩ = (−1) 𝑎𝑏 |𝑎⟩|𝑏⟩
Table 1.2: Some common single- and two-qubit gates. Here, 𝑎, 𝑏, 𝑧 ∈ {0, 1}.
Quantum operations (gates) are elements in U (2𝑛 ). Some common single-qubit and two-qubit
gates are defined in Table 1.2. A quantum circuit is a series of operations acting on an initial state
with one or more terminal measurements. An example is shown below.
|0⟩ H •
|0⟩
Figure 1.1: An example quantum circuit.
We read this circuit left-to-right as follows:
𝐻 1
|0⟩ ⊗ |0⟩ −→ √ (|0⟩ + |1⟩)|0⟩
2
CNOT 1
−−−−−→ √ (|00⟩ + |11⟩)
2
This is the final quantum state before measurement. Born’s rule tells us that we measure 00 with
probability 1/2 and 11 with probability 1/2.
1.2 Quantum algorithms
A quantum algorithm is a quantum circuit (potentially with classical pre- and/or post-processing) for
performing some computational task. As an example, consider the following computational task:
Given a single-bit function 𝑓 : {0, 1} → {0, 1} and the ability to query the function, determine
if 𝑓 (0) = 𝑓 (1). There are four such functions, two of which satisfy 𝑓 (0) = 𝑓 (1) and two of
which do not. Since 𝑓 (0) and 𝑓 (1) are independent, the best classical algorithm takes at least
two queries. Interestingly, a quantum algorithm exists which takes only one query. This algorithm
2
tells us whether 𝑓 (0) = 𝑓 (1) but does not tell us the value of either 𝑓 (0) or 𝑓 (1). The circuit for
performing this algorithm is shown in Fig. 1.2.
|0⟩ H Qf H
Figure 1.2: Quantum circuit for the Deutsch-Jozsa algorithm.
Here, the operation 𝑄 𝑓 is a phase query 𝑄 𝑓 |𝑧⟩ = (−1) 𝑧 |𝑧⟩. One can show the final state of this
circuit before the measurement is
h i h i
𝑓 (0) 𝑓 (1) 𝑓 (0) 𝑓 (1)
|𝜓⟩ = (−1) + (−1) |0⟩ + (−1) − (−1) |1⟩. (1.4)
Thus, we measure 0 with probability 1 if 𝑓 (0) = 𝑓 (1), otherwise we measure 1. Notice how
constructive and destructive interference are used in preparing the solution. While this problem
is artificial, the same underlying principle is used for more realistic algorithms, e.g. Shor’s
algorithm [10] and others, which have rigorous performance and scaling guarantees but prohibitively
large overhead for (near-term) implementations. In Chapter 2 and Chapter 3, we develop quantum
algorithms with lower overhead which are still, in a well-defined manner, classically hard, but have
heuristic elements and less general performance guarantees.
1.3 Open quantum systems
A closed quantum system — one which is completely isolated from its environment — is primarily
a convenient mathematical abstraction. An open quantum system — one which interacts with its
environment — more accurately describes a quantum computer.
A noisy quantum state is described by an ensemble {𝑝𝑖 , |𝜓𝑖 ⟩}𝑖 where 𝑝𝑖 forms a probability
distribution and each |𝜓𝑖 ⟩ is a wavefunction. Physically, this means we do not know with certainty
which wavefunction |𝜓𝑖 ⟩ we possess. Mathematically, we work with the density operator (matrix)
of an ensemble
∑︁
𝜌= 𝑝𝑖 |𝜓𝑖 ⟩⟨𝜓𝑖 |, (1.5)
𝑖
a Hermitian, positive semi-definite operator with unit trace which generalizes classical probability
distributions (diagonal 𝜌).
3
Letting 𝜌 denote the quantum state of interest and 𝜌env the environment, noise can be charac-
terized physically by the process
𝜌 ↦→ Trenv 𝑈 (𝜌 ⊗ 𝜌env ) 𝑈 †
where 𝑈 is a unitary on the composite Hilbert space and Trenv denotes the partial trace over the
environment. (Given 𝜌 𝐴𝐵 ∈ H 𝐴 ⊗ H𝐵 , the partial trace over H𝐵 is defined by
∑︁ 𝐵 )
dim(H
Tr 𝐴 𝜌 𝐴𝐵 := (𝐼 ⊗ ⟨ 𝑗 |) 𝜌 𝐴𝐵 (𝐼 ⊗ | 𝑗⟩) (1.6)
𝑗=1
where | 𝑗⟩ form a basis for H𝐵 . Similarly for the partial trace over H 𝐴 .) This can be written [11] in
the equivalent, often more convenient, operator-sum representation
∑︁𝐾
𝜌 ↦→ 𝐸 𝑘 𝜌𝐸 𝑘† (1.7)
𝑘=1
where the Kraus operators 𝐸 𝑘 satisfy the completeness relation
𝐾
∑︁
𝐸 𝑘† 𝐸 𝑘 = 𝐼.
𝑘=1
Equation (1.7) is known as a quantum operation or quantum channel. Physically, it can be
interpreted as randomly replacing the state 𝜌 by the (properly normalized) state 𝐸 𝑘 𝜌𝐸 𝑘† with
probability Tr[𝐸 𝑘 𝜌𝐸 𝑘† ]. Mathematically, it is a completely positive, trace preserving map. Coherent
errors are noisy channels defined by 𝐾 = 1 unitary Kraus operators whereas incoherent errors are
defined by 𝐾 > 1 Kraus operators.
We often model noise in devices with channels used in theoretical work. One commonly used
noise model is the Pauli channel.
Definition 1. The Pauli channel maps a single qubit state 𝜌 to EpP (𝜌) defined by
EpP (𝜌) := 𝑝 𝐼 𝜌 + 𝑝 𝑋 𝑋 𝜌𝑋 + 𝑝𝑌 𝑌 𝜌𝑌 + 𝑝 𝑍 𝑍 𝜌𝑍 (1.8)
where 𝑝 𝐼 + 𝑝 𝑋 + 𝑝𝑌 + 𝑝 𝑍 = 1.
4
While the Pauli channel acts on a single qubit, it can be generalized to a 𝑑-dimensional Hilbert
space via the Weyl channel
𝑑−1
∑︁
†
EW𝑝 (𝜌) := 𝑝 𝑘𝑙 𝑊 𝑘𝑙 𝜌𝑊 𝑘𝑙 (1.9)
𝑘,𝑙=0
where 𝑝 𝑘𝑙 are probabilities and the Weyl operators are
𝑑−1
∑︁
𝑊 𝑘𝑙 := 𝑒 2𝜋𝑖𝑚𝑘/𝑑 |𝑚⟩⟨𝑚 + 1|.
𝑚=0
For 𝑑 = 2, Eqn. (1.9) reduces to Eqn. (1.8).
Two special cases of the Pauli channel are the bit-flip and phase-flip (dephasing) channel.
Definition 2. The bit-flip channel maps a single qubit state 𝜌 to E BF 𝑝 (𝜌) defined by
E BF
𝑝 (𝜌) := (1 − 𝑝) 𝜌 + 𝑝𝑋 𝜌𝑋 (1.10)
where 0 ≤ 𝑝 ≤ 1.
While a bit-flip channel flips the computational basis state with probability 𝑝, the phase-flip channel
introduces a relative phase with probability 𝑝.
deph
Definition 3. The phase-flip (dephasing) channel maps a single qubit state 𝜌 to E 𝑝 (𝜌) defined
by
deph
E𝑝 (𝜌) := (1 − 𝑝) 𝜌 + 𝑝𝑍 𝜌𝑍 (1.11)
where 0 ≤ 𝑝 ≤ 1.
Another special case of the Pauli channel is the depolarizing channel which occurs when each
Pauli is equiprobable 𝑝 𝑋 = 𝑝𝑌 = 𝑝 𝑍 = 𝑝 and 𝑝 𝐼 = 1 − 3𝑝. This channel can be equivalently
thought of as replacing the state 𝜌 by the maximally mixed state 𝐼/2 with probability 𝑝.
depo
Definition 4. The depolarizing channel maps a single qubit state 𝜌 to E 𝑝 (𝜌) defined by
depo
E𝑝 (𝜌) := (1 − 𝑝) 𝜌 + 𝑝𝐼/2 (1.12)
where 0 ≤ 𝑝 ≤ 1.
5
The 𝑑 = 2𝑛 -dimensional generalization of Def. 4 is straightforward:
Definition 5. The global depolarizing channel maps an 𝑛-qubit state 𝜌 to E GD 𝑝 (𝜌) defined by
E GD
𝑝 (𝜌) := (1 − 𝑝) 𝜌 + 𝑝𝐼/𝑑 (1.13)
where 0 ≤ 𝑝 ≤ 1, 𝑑 = 2𝑛 , and 𝐼 ≡ 𝐼 𝑑 is the 𝑑-dimensional identity.
We use this general description of noisy quantum systems, as well as the particular channels
we have defined, both for analyzing algorithms in the presence of noise in Chapter 2 and Chapter 3
and for developing general techniques for error mitigation in Chapter 4 and Chapter 5.
1.4 Quantum error correction
Because real quantum computers are open quantum systems, it is rather unlikely we will be able to
run circuits at the scale needed for, say, Shor’s algorithm without some solution for dealing with
errors. The primary long-term solution is error correction and fault-tolerance. Some of our ideas
for error mitigation in this thesis stem from error correction, so we briefly review this now.
Suppose a state |𝜓⟩ := 𝛼|0⟩ + 𝛽|1⟩ incurs a phase error 𝐸 |𝜓⟩ = 𝛼|0⟩ + 𝑒𝑖𝛿 𝛽|1⟩. In principle,
𝛿 ∈ R could in principle be infinitesimal, in which case the task may appear hopeless from the start.
We can expand this error in the Pauli basis
𝐸 = 𝑒 0 𝐼 + 𝑒 1 𝑋 + 𝑒 2𝑌 + 𝑒 3 𝑍 (1.14)
to get a finite set of terms, but each 𝑒𝑖 ∈ R could still in principle be infinitesimal. The almost
√
magical trick is that performing a measurement {𝑀𝑖 } on 𝐸 |𝜓⟩ returns 𝑀𝑖 𝐸 |𝜓⟩/ 𝑝𝑖 with probability
𝑝𝑖 , i.e., some term 𝜂𝑖 𝜎𝑖 |𝜓⟩ where 𝜂𝑖 ∈ R and 𝜎𝑖 ∈ {𝐼, 𝑋, 𝑌 , 𝑍 }. The 𝜎𝑖 can be removed by applying
𝜎𝑖 , and although we still have a potentially infinitesimal 𝜂𝑖 , it is now a global phase and has no
influence on measurement statistics. In other words, we can say that causing an error to occur is a
crucial step of quantum error correction.
In a bit more detail, a stabilizer quantum error correction code (stabilizer code) is specified by
any subgroup G with −𝐼 ∉ G of the 𝑛-qubit Pauli group
P𝑛 := {𝑝𝜎1 ⊗ · · · ⊗ 𝜎𝑛 : 𝑝 ∈ {±1, ±𝑖}, 𝜎𝑖 ∈ {𝐼, 𝑋, 𝑌 , 𝑍⟩}. (1.15)
6
The group G is called the gauge group. The center of G in P𝑛 is called the stabilizer group
S := 𝑍 (G) ∩ G. Note that, by construction, S is abelian and does not contain −𝐼. We desire these
conditions to define codewords from S := ⟨𝑆1 , ..., 𝑆𝑟 ⟩. A codeword is a state |𝜓⟩ such that
𝑆|𝜓⟩ = |𝜓⟩ ∀ 𝑆 ∈ {𝑆1 , ..., 𝑆𝑟 }. (1.16)
The codespace is the span of codewords. It’s easy to show that the codespace is trivial if S is not
abelian or −𝐼 ∈ S. If S = G, the code is called a subspace code, otherwise the code is called a
subsystem code.
As an example, the three-qubit repetition code is a subspace code specified by S = ⟨𝑍1 𝑍2 , 𝑍2 𝑍3 ⟩.
One can verify that |000⟩ =: | 0̄⟩ and |111⟩ =: | 1̄⟩ are codewords. This thus defines a two-
dimensional codespace
𝛼| 0̄⟩ + 𝛽| 1̄⟩ = 𝛼|000⟩ + 𝛽|111⟩ (1.17)
which we identify as a logical qubit. The word logical is used to distinguish from physical qubits:
3
this logical qubit (1.17) is formed by defining a two-dimensional subspace of C2 which we formed
out of three physical qubits. We use notation [[𝑛, 𝑘]] to describe a code with 𝑛 physical qubits
encoding 𝑘 logical qubits. The relationship to the number of stabilizer generators 𝑟 for such a code
is 𝑟 = 𝑛 − 𝑘.
In the repetition code example, one can check that the operator 𝑍1 satisfies 𝑍1 | 0̄⟩ = | 0̄⟩ and
𝑍1 | 1̄⟩ = −| 1̄⟩. It thus behaves as the Pauli 𝑍 operator on the logical qubit — i.e., the logical
operator 𝑍. ¯ Similarly, one can check that 𝑋 𝑋 𝑋 | 0̄⟩ = | 1̄⟩ and 𝑋 𝑋 𝑋 | 1̄⟩ = | 0̄⟩. It thus behaves as the
Pauli 𝑋 operator on the logical qubit — i.e., the logical operator 𝑋. ¯ In general, for an [[𝑛, 𝑘]] code
it is always possible to find logical operators 𝑍¯ 1 , ..., 𝑍¯ 𝑘 , 𝑋¯ 1 , ..., 𝑋¯ 𝑘 with the expected commutation
relations. Specifically, logical operations are elements of L := 𝑁 (S) − S where 𝑁 denotes the
normalizer. Logical operations are not necessarily unique: e.g., for the three qubit repetition code,
𝑍2 , 𝑍3 , and 𝑍1 𝑍2 𝑍3 also behave as 𝑍. ¯ Note that, in this example, if any single-qubit phase flip
error occurs, | 0̄⟩ gets mapped to | 1̄⟩. It would be better if this took many single-qubit errors since,
in reasonable noise models, many single-qubit errors occurring is much less likely than any one of
7
the errors occurring. This property of “how far” codewords are from each other is referred to as
the distance of the code and can be formulated as
𝑑 := min 𝑤(𝐿) (1.18)
𝐿∈L
where 𝑤 is the weight (number of non-identity terms) of the Pauli 𝐿. A code with distance 𝑑 can
correct errors on up to 𝑡 = (𝑑 − 1)/2 qubits. We often augment the [[𝑛, 𝑘]] notation with the
distance 𝑑 as [[𝑛, 𝑘, 𝑑]]. We can thus describe the three-qubit repetition code as a [[3, 1, 1]] code.
A correctable error commutes with all logical operators and all but one stabilizer generator.
Specifically, correctable errors are elements of the abelian group 𝑇 := 𝑁 (L) − S. Because of the
commutation relations, measuring each stabilizer generator reveals whether the error commutes or
anti-commutes with each stabilizer generator. This information, called a syndrome, can be used to
infer which error occurred. To see this, consider again the three-qubit repetition code example. As
𝑍 = |0⟩⟨0| − |1⟩⟨1|, the product 𝑍 𝑍 can be written 𝑍 𝑍 = |00⟩⟨00| + |11⟩⟨11| − (|01⟩⟨01| + |10⟩⟨10|).
In other words, states in which the two bits agree are in the +1 eigenspace and states in which the
two bits disagree are in the −1 eigenspace. So, measuring the stabilizer generator 𝑍1 𝑍2 tells us if
the first two bits agree or disagree. Similarly for measuring the other stabilizer generator 𝑍2 𝑍3 . If
the error 𝑋1 occurs, we would measure the syndrome [−1, 1] as 𝑋1 anticommutes with 𝑍1 𝑍2 and
commutes with 𝑍2 𝑍3 . Of course in practice we only get the syndrome and have to infer which error
occurred — this process is known as decoding.
The general pattern of error correction is to encode the state with an [[𝑛, 𝑘, 𝑑]] code, measure
stabilizers to obtain a syndrome, decode the syndrome to infer which error occurred, then correct the
error. We typically assume we can do some of these operations perfectly — e.g., we prepare states
and measure stabilizers perfectly, and errors only occur elsewhere during the computation. This
is of course not realistic in practice but serves as the first step towards the theory of fault-tolerant
quantum computation in which all elements (state preparation, measurement, etc.) are treated as
noisy or unreliable. This background is sufficient for the purposes of this thesis, however. We use
the ideas of error correction for the purpose of error mitigation in Chapter 5.
8
1.5 The Gottesman-Knill theorem
The problem of simulating quantum circuits with 𝑛 qubits and depth 𝑑 is important for verifying the
output of quantum computers as well as ultimately understanding why, and in what sense, quantum
computers are more powerful than classical computers. Correspondingly, many methods have been
developed to classically simulate quantum systems. We use the term quantum simulator or just
simulator to denote a classical algorithm which inputs a quantum circuit and outputs a quantity of
interest. To truly mimic a quantum computer, this “quantity of interest” should only be a set of
bitstrings 𝑧 ∈ {0, 1}𝑛 as this is the return type of a real experimental (qubit) quantum information
processing system. However, simulators work by manipulating some classical representation of
quantum information, so it is generally possible to return additional values, for example a classical
representation of the wavefunction, a reduced density matrix on one or more qubits, a single
amplitude of the wavefunction, or an expectation value of a given observable.
The Gottesman-Knill theorem presents an algorithm for efficiently simulating a certain class of
quantum circuits which, remarkably, contains circuits with very large numbers of qubits, very large
depth, and large entanglement. This class of circuits is known as Clifford circuits, the defining
characteristic being the types of gates (Clifford gates) appearing in the circuit. For general circuits,
the resources of this simulation strategy grow exponentially in the number of non-Clifford gates.
The Gottesman-Knill theorem [12] (or algorithm / simulator) works by updating operators
instead of updating the state in the same spirit as the Heisenberg picture vs. the Schrödinger picture.
(It was originally presented this way [13], though the terminology is no longer as standard.) In the
Schrödinger picture, we think of operators being fixed and the state evolving over time. Applying
an operator 𝑈 to a state 𝑉 |𝜓⟩, we say that the new state is 𝑈𝑉 |𝜓⟩. However, we may equivalently
write this as (𝑈𝑉𝑈 † )𝑈|𝜓⟩ and say that
𝑉 ↦→ 𝑈𝑉𝑈 † . (1.19)
See Table 1.3 for a summary.
If we keep track of (1.19) for a basis {𝑃1 , ..., 𝑃 𝑘 }, then we are able to reconstruct the evolution
9
Schrödinger picture Heisenberg picture
𝑉 — 𝑈𝑉𝑈 †
|𝜓⟩ 𝑈𝑉 |𝜓⟩ 𝑈|𝜓⟩
Table 1.3: How each term in the quantity 𝑉 |𝜓⟩ is updated after application of 𝑈 in the Schrödinger
vs. Heisenberg picture. The answer is always 𝑈𝑉 |𝜓⟩ (the product of each column).
Í
of any operator 𝑉 = 𝑖 𝛼𝑖 𝑃𝑖 since
∑︁
𝑈𝑉𝑈 † = 𝛼𝑖 (𝑈𝑃𝑖 𝑈 † ) (1.20)
𝑖
by linearity. Furthermore, the map (1.19) is a group homomorphism since
𝑉𝑊 ↦→ 𝑈𝑉𝑊𝑈 † = 𝑈𝑉𝑈 †𝑈𝑊𝑈 † . (1.21)
Therefore it suffices to track the evolution of a generating set. If we take the Pauli basis as our
basis, then a convenient generating set is {𝑋1 , ..., 𝑋𝑛 , 𝑍1 , ..., 𝑍𝑛 }. For a general operator of U (2𝑛 ),
we thus need to keep track of only 2𝑛 single qubit operators.
So far this presentation is completely general with respect to what the operators (gates) are.
For arbitrary operators, keeping track of how the generating set transforms will grow exponentially.
However, if we only allow operators which preserve Paulis under conjugation, the size of the
description does not grow. This class of operators is precisely the normalizer of the Pauli group P𝑛
in U (2𝑛 ), also called the Clifford group, and is denoted 𝑁 (P𝑛 ) or C.
The Clifford group is generated by {𝐻, 𝑆, CNOT𝑖 𝑗 } between arbitrary pairs of qubits 𝑖, 𝑗 ∈ [𝑛].
One can verify for the single-qubit gates that
𝐻 𝑋 𝐻† = 𝑍 𝐻𝑍 𝐻 † = 𝑋 (1.22)
𝑆𝑋𝑆 † = 𝑌 𝑆𝑍 𝑆 † = 𝑍 (1.23)
and for the two-qubit CNOT that
CNOT(𝑋 𝐼)CNOT† = 𝑋 𝑋 (1.24)
CNOT(𝐼 𝑋)CNOT† = 𝐼 𝑋 (1.25)
CNOT(𝑍 𝐼)CNOT† = 𝑍 𝐼 (1.26)
CNOT(𝐼 𝑍)CNOT† = 𝑍 𝑍 (1.27)
10
Assuming without loss of generality a Clifford circuit is compiled into this gateset, the
Gottesman-Knill algorithm works by iterating through the circuit and updating the generating
set at each step. As is typical with P𝑛 , in software one represents elements using symplectic
notation and updates the so-called tableau of the generators. For clarity we proceed by example
√
with the two-qubit circuit in Fig. 1.1 that performs |00⟩ ↦→ (|00⟩ + |11⟩)/ 2 and show how the
tableau is updated:
1 0 0 0 0 0 1 0 0 0 1 0
0 1 0 0 𝐻0 0 1 0 0 CNOT01 0 1 0 0
−−→ −−−−−−→
(1.28)
0 0 1 0 1 0 0 0 1 1 0 0
0 0 0 1 0 0 0 1 0 0 1 1
In other words, 𝑋0 ↦→ 𝑍0 , 𝑋1 ↦→ 𝑋1 , 𝑍0 ↦→ 𝑋0 𝑋1 , and 𝑍1 ↦→ 𝑍0 𝑍1 . From the final stabilizer tableau,
one can sample bitstrings or compute expectation values using algorithms described in [12]. For
our purposes, we are primarily concerned with the ability to efficiently store and manipulate
stabilizer states with classical resources as described above, a task which will be crucial for our
error mitigation strategy in Chapter 5.
1.6 Outline of thesis
The remainder of this thesis is split into two parts, with this first part developing algorithms for
quantum computers in the “NISQ” (noisy intermediate-scale quantum) [14] or “KSQ” (kilo-scale
quantum, ∼ 103 qubits × operations) regime, and the second part developing error mitigation
techniques for such computers. Both parts are complementary towards the goal of useful quantum
computing. In the first part, we develop algorithms for diagonalizing quantum states (density
matrices) in Chapter 2 and for compiling quantum circuits in Chapter 3. In the second part, we
analyze and extend an existing error mitigation technique, zero-noise extrapolation, in Chapter 4,
and develop a new resource-efficient procedure for implementing the composition of several error
mitigation techniques in Chapter 5.
11
The following paragraphs are a short preface to Part 1 (chapters two and three). It would be
useful to format this as a part, but I’m not allowed to have parts in my thesis because someone
in the graduate school gets paid to look at the format of theses and tell you that you can’t have
colored hyperlinks, you have to have a table of algorithms with the word “Algorithm” before each
entry, and you can’t have parts unless they are formatted like chapters and simultaneously do and
do not appear in the table of contents (like chapters). So consider the following paragraphs to be
a preface to chapters two and three which one may logically format as a higher-level abstraction
than a chapter if one had the ability to do so. I would offer a link to a usefully formatted thesis on
my website or something but I’m sure my friend in the graduate school would inform me that’s a
violation of university policy and I can no longer graduate. (And that the link can’t be colored.) I’d
usually include this as a footnote but I’m sure if I did I’d have to include a list of footnotes where
each entry has to start with the word “Footnote” and the list of footnotes has to appear in the table
of contents (where the parts should and shouldn’t be, formatted like chapters). So I’m writing this
as plain text, not sectioned or chaptered or otherwise numbered, and absolutely not in a part. As a
result it makes nearly no sense, in accordance with university policy. Will my friend in the graduate
school still notice, and subsequently examine every part of my thesis with a microscope to ensure I
don’t graduate? I’m almost certain the state-of-the-art PDF diff tool MSU bought from the RAND
Corporation in 1852 will pick this up. But I leave it to fate whether this is received with a smile or
a frown in the hope that someone reading this document out of interest will be better oriented by
this remark. A similar remark will appear before Part 2 (chapters four and five).
The future applications of quantum computers, assuming that large-scale, fault-tolerant versions
will eventually be realized, are manifold. From a mathematical perspective, applications include
number theory [15], linear algebra [16, 17, 18], differential equations [19, 20], and optimization
[21]. From a physical perspective, applications include electronic structure determination [22, 23]
for molecules and materials and real-time simulation of quantum dynamical processes [24] such
as protein folding and photo-excitation events. Naturally, some of these applications are more
long-term than others. Factoring and solving linear systems of equations are typically viewed as
12
longer term applications due to their high resource requirements. On the other hand, approximate
optimization and the determination of electronic structure may be nearer term applications, and
could even serve as demonstrations of quantum supremacy in the near future [25, 26].
A major aspect of quantum algorithms research is to make applications of interest more near term
by reducing quantum resource requirements including qubit count, circuit depth, numbers of gates,
and numbers of measurements. A powerful strategy for this purpose is algorithm hybridization,
where a fully quantum algorithm is turned into a hybrid quantum-classical algorithm [27]. The
benefit of hybridization is two-fold, both reducing the resources (hence allowing implementation
on smaller hardware) as well as increasing accuracy (by outsourcing calculations to “error-free”
classical computers).
Variational hybrid algorithms are a class of quantum-classical algorithms that involve minimiz-
ing a cost function that depends on the parameters of a quantum gate sequence. Cost evaluation
occurs on the quantum computer, with speedup over classical cost evaluation, and the classical
computer uses this cost information to adjust the parameters of the gate sequence. Variational
hybrid algorithms have been proposed for Hamiltonian ground state and excited state preparation
[22, 28, 29], approximate optimization [21], error correction [30], quantum data compression
[31, 32], and quantum simulation [33, 34]. A key feature of such algorithms is their near-term
relevance, since only the subroutine of cost evaluation occurs on the quantum computer, while the
optimization procedure is entirely classical, and hence standard classical optimization tools can be
employed.
13
CHAPTER 2
VARIATIONAL QUANTUM STATE DIAGONALIZATION
2.1 Introduction
In this chapter, we consider the application of diagonalizing quantum states. In condensed matter
physics, diagonalizing states is useful for identifying properties of topological quantum phases—a
field known as entanglement spectroscopy [35]. In data science and machine learning, diagonalizing
the covariance matrix (which could be encoded in a quantum state [36, 16]) is frequently employed
for principal component analysis (PCA). PCA identifies features that capture the largest variance
in one’s data and hence allows for dimensionality reduction [37].
Classical methods for diagonalization typically scale polynomially in the matrix dimension [38].
Similarly, the number of measurements required for quantum state tomography—a general method
for fully characterizing a quantum state—scales polynomially in the dimension. Interestingly, Lloyd
et al. proposed a quantum algorithm for diagonalizing quantum states that can potentially perform
exponentially faster than these methods [16]. Namely, their algorithm, called quantum principal
component analysis (qPCA), gives an exponential speedup for low-rank matrices. qPCA employs
quantum phase estimation combined with density matrix exponentiation. These subroutines require
a significant number of qubits and gates, making qPCA difficult to implement in the near term,
despite its long-term promise.
Here, we propose a variational hybrid algorithm for quantum state diagonalization. For a given
state 𝜌, our algorithm is composed of three steps: (i) Train the parameters 𝛼 ® of a gate sequence
U such that 𝜌˜ = 𝑈 𝑝 ( 𝛼 ® opt ) † is approximately diagonal, where 𝛼
® opt ) 𝜌𝑈 𝑝 ( 𝛼 ® opt is the optimal value
of 𝛼® obtained (ii) Read out the largest eigenvalues of 𝜌 by measuring in the eigenbasis (i.e., by
measuring 𝜌˜ in the standard basis), and (iii) Prepare the eigenvectors associated with the largest
eigenvalues. We call this the variational quantum state diagonalization (VQSD) algorithm. VQSD
is a near-term algorithm with the same practical benefits as other variational hybrid algorithms.
14
Employing a layered ansatz for 𝑈 𝑝 ( 𝛼 ® ) (where 𝑝 is the number of layers) allows one to obtain a
hierarchy of approximations for the eigevalues and eigenvectors. We therefore think of VQSD as
an approximate diagonalization algorithm.
We carefully choose our cost function 𝐶 to have the following properties: (i) 𝐶 is faithful (i.e,
it vanishes if and only if 𝜌˜ is diagonal), (ii) 𝐶 is efficiently computable on a quantum computer,
(iii) 𝐶 has operational meanings such that it upper bounds the eigenvalue and eigenvector error
(see Sec. 2.2.1), and (iv) 𝐶 scales well for training purposes in the sense that its gradient does not
vanish exponentially in the number of qubits. The precise definition of 𝐶 is given in Sec. 2.2.1 and
involves a difference of purities for different states. To compute 𝐶, we introduce novel short-depth
quantum circuits that likely have applications outside the context of VQSD.
To illustrate our method, we implement VQSD on Rigetti’s 8-qubit quantum computer. We
successfully diagonalize one-qubit pure states using this quantum computer. To highlight future
applications (when larger quantum computers are made available), we implement VQSD on a
simulator to perform entanglement spectroscopy on the ground state of the one-dimensional (1D)
Heisenberg model composed of 12 spins.
Our paper is organized as follows. Section 2.2 outlines the VQSD algorithm and presents
its implementation. In Sec. 2.3, we give a comparison to the qPCA algorithm, and we elaborate
on future applications. Section 2.4 presents our methods for quantifying diagonalization and for
optimizing our cost function.
2.2 Results
2.2.1 The VQSD Algorithm
2.2.1.1 Overall structure
Figure 2.1 shows the structure of the VQSD algorithm. The goal of VQSD is to take, as its input, an
𝑛-qubit density matrix 𝜌 given as a quantum state and then output approximations of the 𝑚-largest
eigenvalues and their associated eigenvectors. Here, 𝑚 will typically be much less than 2𝑛 , the
15
(c) (a)
(b)
(e)
(d)
Figure 2.1: Schematic diagram showing the steps of the VQSD algorithm. (a) Two copies of
quantum state 𝜌 are provided as an input. These states are sent to the parameter optimization loop
(b) where a hybrid quantum-classical variational algorithm approximates the diagonalizing unitary
𝑈𝑝 (𝛼 ® opt ). Here, 𝑝 is a hyperparameter that dictates the quality of solution found. This optimal
unitary is sent to the eigenvalue readout circuit (c) to obtain bitstrings 𝑧®, the frequencies of which
provide estimates of the eigenvalues of 𝜌. Along with the optimal unitary 𝑈 𝑝 ( 𝛼 ® opt ), these bitstrings
are sent to the eigenvector preparation circuit (c) to prepare the eigenstates of 𝜌 on a quantum
computer. Both the eigenvalues and eigenvectors are the outputs (d) of the VQSD algorithm.
matrix dimension of 𝜌, although the user is free to increase 𝑚 with increased algorithmic complexity
(discussed below). The outputted eigenvalues will be in classical form, i.e., will be stored on a
classical computer. In contrast, the outputted eigenvectors will be in quantum form, i.e., will be
prepared on a quantum computer. This is necessary because the eigenvectors would have 2𝑛 entries
if they were stored on a classical computer, which is intractable for large 𝑛. Nevertheless, one can
characterize important aspects of these eigenvectors with a polynomial number of measurements
on the quantum computer.
Similar to classical eigensolvers, the VQSD algorithm is an approximate or iterative diagonal-
ization algorithm. Classical eigenvalue algorithms are necessarily iterative, not exact [39]. Iterative
algorithms are useful in that they allow for a trade-off between run-time and accuracy. Higher de-
grees of accuracy can be achieved at the cost of more iterations (equivalently, longer run-time), or
16
short run-time can be achieved at the cost of lower accuracy. This flexibility is desirable in that it
allows the user of the algorithm to dictate the quality of the solutions found.
The iterative feature of VQSD arises via a layered ansatz for the diagonalizing unitary. This
idea similarly appears in other variational hybrid algorithms, such as the Quantum Approximate
Optimization Algorithm [21]. Specifically, VQSD diagonalizes 𝜌 by variationally updating a
parameterized unitary 𝑈 𝑝 ( 𝛼® ) such that
𝜌˜ 𝑝 ( 𝛼
® ) := 𝑈 𝑝 ( 𝛼 ® ) 𝜌𝑈 𝑝† ( 𝛼
®) (2.1)
is (approximately) diagonal at the optimal value 𝛼 ® opt . (For brevity we often write 𝜌˜ for 𝜌˜ 𝑝 ( 𝛼 ® ).) We
assume a layered ansatz of the form
𝑈𝑝 (𝛼® ) = 𝐿1 (𝛼 ® 1 )𝐿 2 ( 𝛼
®2) · · · 𝐿 𝑝 (𝛼 ® 𝑝) . (2.2)
Here, 𝑝 is a hyperparameter that sets the number of layers 𝐿 𝑖 ( 𝛼 ® 𝑖 ), and each 𝛼
® 𝑖 is a set of optimization
parameters that corresponds to internal gate angles within the layer. The parameter 𝛼 ® in (2.1)
refers to the collection of all 𝛼 ® 𝑖 for 𝑖 = 1, ..., 𝑝. Once the optimization procedure is finished
and returns the optimal parameters 𝛼 ® opt , one can then run a particular quantum circuit (shown in
Fig. 2.1(c) and discussed below) 𝑁readout times to approximately determine the eigenvalues of 𝜌.
The precision (i.e, the number of significant digits) of each eigenvalue increases with 𝑁readout and
with the eigenvalue’s magnitude. Hence for small 𝑁readout only the largest eigenvalues of 𝜌 will be
precisely characterized, so there is a connection between 𝑁readout and how many eigenvalues, 𝑚,
are determined. The hyperparameter 𝑝 is a refinement parameter, meaning that the accuracy of the
eigensystem (eigenvalues and eigenvectors) typically increases as 𝑝 increases. We formalize this
argument as follows.
Let 𝐶 denote our cost function, defined below in (2.10), which we are trying to minimize. In
general, the cost 𝐶 will be non-increasing (i.e., will either decrease or stay constant) in 𝑝. One can
ensure that this is true by taking the optimal parameters learned for 𝑝 layers as the starting point
for the optimization of 𝑝 + 1 layers and by setting 𝛼 ® 𝑝+1 such that 𝐿 𝑝+1 ( 𝛼 ® 𝑝+1 ) is an identity. This
17
strategy also avoids barren plateaus [40, 41] and helps to mitigate the problem of local minima, as
we discuss in Section 2.8.
Next, we argue that 𝐶 is closely connected to the accuracy of the eigensystem. Specifically,
it gives an upper bound on the eigensystem error. Hence, one obtains an increasingly tighter
upper bound on the eigensystem error as 𝐶 decreases (equivalently, as 𝑝 increases). To quantify
eigenvalue error, we define
𝑑
∑︁
Δ𝜆 := (𝜆𝑖 − 𝜆˜ 𝑖 ) 2 , (2.3)
𝑖=1
where 𝑑 = 2𝑛 , and {𝜆𝑖 } and {𝜆˜ 𝑖 } are the true and inferred eigenvalues, respectively. Here, 𝑖 is
an index that orders the eigenvalues in decreasing order, i.e., 𝜆𝑖 ≥ 𝜆𝑖+1 and 𝜆˜ 𝑖 ≥ 𝜆˜ 𝑖+1 for all
𝑖 ∈ {1, ..., 𝑑 − 1}. To quantify eigenvector error, we define
∑︁𝑑
Δ𝑣 := ⟨𝛿𝑖 |𝛿𝑖 ⟩ , with |𝛿𝑖 ⟩ = 𝜌| 𝑣˜ 𝑖 ⟩ − 𝜆˜ 𝑖 | 𝑣˜ 𝑖 ⟩ = Π𝑖⊥ 𝜌| 𝑣˜ 𝑖 ⟩ . (2.4)
𝑖=1
Here, | 𝑣˜ 𝑖 ⟩ is the inferred eigenvector associated with 𝜆˜ 𝑖 , and Π𝑖⊥ = 𝐼 − | 𝑣˜ 𝑖 ⟩⟨𝑣˜ 𝑖 | is the projector onto
the subspace orthogonal to | 𝑣˜ 𝑖 ⟩. Hence, |𝛿𝑖 ⟩ is a vector whose norm quantifies the component of
𝜌| 𝑣˜ 𝑖 ⟩ that is orthogonal to | 𝑣˜ 𝑖 ⟩, or in other words, how far | 𝑣˜ 𝑖 ⟩ is from being an eigenvector of 𝜌.
As proven in Sec. 2.4.1, our cost function upper bounds the eigenvalue and eigenvector error
up to a proportionality factor 𝛽,
Δ𝜆 ≤ 𝛽𝐶 , and Δ𝑣 ≤ 𝛽𝐶 . (2.5)
Because 𝐶 is non-increasing in 𝑝, the upper bound in (2.5) is non-increasing in 𝑝 and goes to zero
if 𝐶 goes to zero.
We remark that Δ𝑣 can be interpreted as a weighted eigenvector error, where eigenvectors with
larger eigenvalues are weighted more heavily in the sum. This is a useful feature since it implies that
lowering the cost 𝐶 will force the eigenvectors with the largest eigenvalues to be highly accurate.
In many applications, such eigenvectors are precisely the ones of interest. (See Sec. 2.2.2.2 for an
illustration of this feature.)
The various steps in the VQSD algorithm are shown schematically in Fig. 2.1. There are
essentially three main steps: (1) an optimization loop that minimizes the cost 𝐶 via back-and-forth
18
(a)
(b)
Figure 2.2: (a) Layered ansatz for the diagonalizing unitary 𝑈 𝑝 ( 𝛼 ® ). Each layer 𝐿 𝑖 , 𝑖 = 1, ..., 𝑝,
consists of a set of optimization parameters 𝛼 ® 𝑖 . (b) The two-qubit gate ansatz for the 𝑖th layer,
shown on four qubits. Here we impose periodic boundary conditions on the top/bottom edge of the
circuit so that 𝐺 3 wraps around from top to bottom. Section 2.7 discusses an alternative approach
to the construction of 𝑈 𝑝 ( 𝛼 ® ), in which the ansatz is modified during the optimization process.
communication between a classical and quantum computer, where the former adjusts 𝛼 ® and the latter
computes 𝐶 for 𝑈 𝑝 ( 𝛼 ® ), (2) a readout procedure for approximations of the 𝑚 largest eigenvalues,
which involves running a quantum circuit and then classically analyzing the statistics, and (3) a
preparation procedure to prepare approximations of the eigenvectors associated with the 𝑚 largest
eigenvalues. In the following subsections, we elaborate on each of these procedures.
2.2.1.2 Parameter optimization loop
Naturally, there are many ways to parameterize 𝑈 𝑝 ( 𝛼 ® ). Ideally one would like the number of
parameters to grow at most polynomially in both 𝑛 and 𝑝. Figure 2.2 presents an example ansatz
that satisfies this condition. Each layer 𝐿 𝑖 is broken down into layers of two-body gates that can
be performed in parallel. These two-body gates can be further broken down into parameterized
one-body gates, for example, with the construction in Ref. [5]. We discuss a different approach to
parameterize 𝑈 𝑝 ( 𝛼® ) in Section 2.7.
For a given ansatz, such as the one in Fig. 2.2, parameter optimization involves evaluating the
cost 𝐶 on a quantum computer for an initial choice of parameters and then modifying the parameters
19
on a classical computer in an iterative feedback loop. The goal is to find
® opt := arg min 𝐶 (𝑈 𝑝 ( 𝛼
𝛼 ® )) . (2.6)
®
𝛼
The classical optimization routine used for updating the parameters can involve either gradient-free
or gradient-based methods. In Sec. 2.4.2, we explore this further and discuss our optimization
methods.
In Eq. (2.6), 𝐶 (𝑈 𝑝 ( 𝛼
® )) quantifies how far the state 𝜌˜ 𝑝 ( 𝛼 ® ) is from being diagonal. There
are many ways to define such a cost function, and in fact there is an entire field of research on
coherence measures that has introduced various such quantities [42]. We aim for a cost that is
efficiently computable with a quantum-classical system, and hence we consider a cost that can
be expressed in terms of purities. (It is well known that a quantum computer can find the purity
Tr(𝜎 2 ) of an 𝑛-qubit state 𝜎 with complexity scaling only linearly in 𝑛, an exponential speedup
over classical computation [43, 44].) Two such cost functions, whose individual merits we discuss
in Sec. 2.4.1, are
® )) = Tr(𝜌 2 ) − Tr(Z( 𝜌)
𝐶1 (𝑈 𝑝 ( 𝛼 ˜ 2) , (2.7)
𝑛
1 ∑︁
𝐶2 (𝑈 𝑝 ( 𝛼 2
® )) = Tr(𝜌 ) − Tr(Z 𝑗 ( 𝜌)˜ 2) . (2.8)
𝑛 𝑗=1
Here, Z and Z 𝑗 are quantum channels that dephase (i.e., destroy the off-diagonal elements) in the
global standard basis and in the local standard basis on qubit 𝑗, respectively. Importantly, the two
functions vanish under the same conditions:
𝐶1 (𝑈 𝑝 ( 𝛼
® )) = 0 ⇐⇒ 𝐶2 (𝑈 𝑝 ( 𝛼 ® )) = 0 ⇐⇒ 𝜌˜ = Z( 𝜌) ˜ . (2.9)
So the global minima of 𝐶1 and 𝐶2 coincide and correspond precisely to unitaries 𝑈 𝑝 ( 𝛼 ® ) that
diagonalize 𝜌 (i.e., unitaries such that 𝜌˜ is diagonal).
As elaborated in Sec. 2.4.1, 𝐶1 has operational meanings: it bounds our eigenvalue error,
𝐶1 ≥ Δ𝜆 , and it is equivalent to our eigenvector error, 𝐶1 = Δ𝑣 . However, its landscape tends to be
insensitive to changes in 𝑈 𝑝 ( 𝛼 ® ) for large 𝑛. In contrast, we are not aware of a direct operational
20
meaning for 𝐶2 , aside from its bound on 𝐶1 given by 𝐶2 ≥ (1/𝑛)𝐶1 . However, the landscape for 𝐶2
is more sensitive to changes in 𝑈 𝑝 ( 𝛼 ® ), making it useful for training 𝑈 𝑝 ( 𝛼® ) when 𝑛 is large. Due to
these contrasting merits of 𝐶1 and 𝐶2 , we define our overall cost function 𝐶 as a weighted average
of these two functions
𝐶 (𝑈 𝑝 ( 𝛼
® )) = 𝑞𝐶1 (𝑈 𝑝 ( 𝛼
® )) + (1 − 𝑞)𝐶2 (𝑈 𝑝 ( 𝛼
® )) , (2.10)
where 𝑞 ∈ [0, 1] is a free parameter that allows one to tailor the VQSD method to the scale of one’s
problem. For small 𝑛, one can set 𝑞 ≈ 1 since the landscape for 𝐶1 is not too flat for small 𝑛, and,
as noted above, 𝐶1 is an operationally relevant quantity. For large 𝑛, one can set 𝑞 to be small since
the landscape for 𝐶2 will provide the gradient needed to train 𝑈 𝑝 ( 𝛼 ® ). The overall cost maintains
the operational meaning in (2.5) with
𝛽 = 𝑛/(1 + 𝑞(𝑛 − 1)) . (2.11)
Section 2.9 illustrates the advantages of training with different values of 𝑞.
Computing 𝐶 amounts to evaluating the purities of various quantum states on a quantum
computer and then doing some simple classical post-processing that scales linearly in 𝑛. This can
be seen from Eqns. (2.7) and (2.8). The first term, Tr(𝜌 2 ), in 𝐶1 and 𝐶2 is independent of 𝑈 𝑝 ( 𝛼 ® ).
Hence, Tr(𝜌 2 ) can be evaluated outside of the optimization loop in Fig. 2.1 using the Destructive
Swap Test (see Sec. 2.4.1 for the circuit diagram). Inside the loop, we only need to compute
Tr(Z( 𝜌)˜ 2 ) and Tr(Z 𝑗 ( 𝜌)
˜ 2 ) for all 𝑗. Each of these terms are computed by first preparing two
copies of 𝜌˜ and then implementing quantum circuits whose depths are constant in 𝑛. For example,
the circuit for computing Tr(Z( 𝜌) ˜ 2 ) is shown in Fig. 2.1(b), and surprisingly it has a depth of
only one gate. We call it the Diagonalized Inner Product (DIP) Test. The circuit for computing
Tr(Z 𝑗 ( 𝜌)
˜ 2 ) is similar, and we call it the Partially Diagonalized Inner Product (PDIP) Test. We
elaborate on both of these circuits in Sec. 2.4.1.
21
2.2.1.3 Eigenvalue readout
After finding the optimal diagonalizing unitary 𝑈 𝑝 ( 𝛼 ® opt ), one can use it to readout approximations
of the eigenvalues of 𝜌. Figure 2.1(c) shows the circuit for this readout. One prepares a single copy
of 𝜌 and then acts with 𝑈 𝑝 ( 𝛼 ® opt ) to prepare 𝜌˜ 𝑝 ( 𝛼 ® opt ). Measuring in the standard basis {|®𝑧⟩}, where
𝑧® = 𝑧1 𝑧2 ...𝑧 𝑛 is a bitstring of length 𝑛, gives a set of probabilities {𝜆˜ 𝑧®} with
𝜆˜ 𝑧® = ⟨®𝑧 | 𝜌˜ 𝑝 ( 𝛼® opt )|®𝑧⟩ . (2.12)
We take the 𝜆˜ 𝑧® as the inferred eigenvalues of 𝜌. We emphasize that the 𝜆˜ 𝑧® are the diagonal elements,
not the eigenvalues, of 𝜌˜ 𝑝 ( 𝛼 ® opt ).
Each run of the circuit in Fig. 2.1(c) generates a bitstring 𝑧® corresponding to the measurement
outcomes. If one obtains 𝑧® with frequency 𝑓𝑧® for 𝑁readout total runs, then
𝜆˜ est
𝑧®
= 𝑓𝑧®/𝑁readout (2.13)
√
gives an estimate for 𝜆˜ 𝑧®. The statistical deviation of 𝜆˜ est 𝑧®
from 𝜆˜ 𝑧® goes with 1/ 𝑁readout . The relative
error 𝜖 𝑧® (i.e., the ratio of the statistical error on 𝜆˜ est 𝑧®
to the value of 𝜆˜ est
𝑧®
) then goes as
√
1 𝑁readout
𝜖 𝑧® = √ = . (2.14)
𝑁readout𝜆˜ est 𝑧®
𝑓𝑧®
This implies that events 𝑧® with higher frequency 𝑓𝑧® have lower relative error. In other words,
the larger the inferred eigenvalue 𝜆˜ 𝑧®, the lower the relative error, and hence the more precisely
it is determined from the experiment. When running VQSD, one can pre-decide on the desired
values of 𝑁readout and a threshold for the relative error, denoted 𝜖 max . This error threshold 𝜖max will
then determine 𝑚, i.e., how many of the largest eigenvalues that get precisely characterized. So
𝑚 = 𝑚(𝑁readout , 𝜖max , {𝜆˜ 𝑧®}) is a function of 𝑁readout , 𝜖 max , and the set of inferred eigenvalues {𝜆˜ 𝑧®}.
Precisely, we take 𝑚 = |𝜆®˜ est | as the cardinality of the following set:
𝜆®˜ est = {𝜆˜ est 𝑧®
: 𝜖 𝑧® ≤ 𝜖max } , (2.15)
which is the set of inferred eigenvalues that were estimated with the desired precision.
22
2.2.1.4 Eigenvector preparation
The final step of VQSD is to prepare the eigenvectors associated with the 𝑚-largest eigenvalues, i.e.,
the eigenvalues in the set in Eq. (2.15). Let 𝑍® = {®𝑧 : 𝜆˜ est 𝑧®
∈ 𝜆®˜ est } be the set of bitstrings 𝑧® associated
with the eigenvalues in 𝜆®˜ est . (Note that these bitstrings are obtained directly from the measurement
outcomes of the circuit in Fig. 2.1(c), i.e., the outcomes become the bitstring 𝑧®.) For each 𝑧® ∈ 𝑍, ®
one can prepare the following state, which we take as the inferred eigenvector associated with our
estimate of the inferred eigenvalue 𝜆˜ est 𝑧®
,
® opt ) † |®𝑧⟩
| 𝑣˜ 𝑧®⟩ = 𝑈 𝑝 ( 𝛼 (2.16)
= 𝑈𝑝 (𝛼 ® .
® opt ) † (𝑋 𝑧1 ⊗ · · · ⊗ 𝑋 𝑧 𝑛 )| 0⟩ (2.17)
The circuit for preparing this state is shown in Fig. 2.1(d). As noted in (2.17), one first prepares |®𝑧⟩
by acting with 𝑋 operators raised to the appropriate powers, and then one acts with 𝑈 𝑝 ( 𝛼 ® opt ) † to
rotate from the standard basis to the inferred eigenbasis.
Once they are prepared on the quantum computer, each inferred eigenvector | 𝑣˜ 𝑧®⟩ can be char-
acterized by measuring expectation values of interest. That is, important physical features such
as energy or entanglement (e.g., entanglement witnesses) are associated with some Hermitian
observable 𝑀, and one can evaluate the expectation value ⟨𝑣˜ 𝑧® |𝑀 | 𝑣˜ 𝑧®⟩ to learn about these features.
2.2.2 Implementations
Here we present our implementations of VQSD, first for a one-qubit state on a cloud quantum
computer to show that it is amenable to currently available hardware. Then, to illustrate the scaling
to larger, more interesting problems, we implement VQSD on a simulator for the 12-spin ground
state of the Heisenberg model. See Sections 2.6 and 2.7 for further details. The code used to
generate some of the examples presented here can be accessed from [45].
23
1.0
(a)
0.8
0.6
0.4
0.2
0.0
0 2 4 6 8 10
Iteration
(b)
Figure 2.3: The VQSD algorithm run on Rigetti’s 8Q-Agave quantum computer for 𝜌 = |+⟩⟨+|. (a)
A representative run of the parameter optimization loop, using the Powell optimization algorithm
(see Sec. 2.4.2 for details and Section 2.6 for data from additional runs). Cost versus iteration
is shown by the black solid line. The dotted lines show the two inferred eigenvalues. After four
iterations, the inferred eigenvalues approach {0, 1}, as required for a pure state. (b) The cost
landscape on a noiseless simulator, Rigetti’s noisy simulator, and Rigetti’s quantum computer.
Error bars show the standard deviation (due to finite sampling) of multiple runs. The local minima
occur roughly at the theoretically predicted values of 𝜋/2 and 3𝜋/2. During data collection for this
plot, the 8Q-Agave quantum computer retuned, after which its cost landscape closely matched that
of the noisy simulator.
2.2.2.1 One-qubit state
We now discuss the results of applying VQSD to the one-qubit plus state 𝜌 = |+⟩⟨+| on the 8Q-
Agave quantum computer provided by Rigetti [46]. Because the problem size is small (𝑛 = 1), we
set 𝑞 = 1 in the cost function (2.10). Since 𝜌 is a pure state, the cost function is
𝐶 (𝑈 𝑝 ( 𝛼
® )) = 𝐶1 (𝑈 𝑝 ( 𝛼 ˜ 2 ).
® )) = 1 − Tr(Z( 𝜌) (2.18)
For 𝑈 𝑝 ( 𝛼
® ), we take 𝑝 = 1, for which the layered ansatz becomes an arbitrary single qubit rotation.
The results of VQSD for this state are shown in Fig. 2.3. In Fig. 2.3(a), the solid curve shows the
cost versus the number of iterations in the parameter optimization loop, and the dashed curves show
the inferred eigenvalues of 𝜌 at each iteration. Here we used the Powell optimization algorithm,
24
see Section 2.4.2 for more details. As can be seen, the cost decreases to a small value near zero
and the eigenvalue estimates simultaneously converge to the correct values of zero and one. Hence,
VQSD successfully diagonalized this state.
Figure 2.3(b) shows the landscape of the optimization problem on Rigetti’s 8Q-Agave quantum
computer, Rigetti’s noisy simulator, and a noiseless simulator. Here, we varied the angle 𝛼 in the
diagonalizing unitary 𝑈 (𝛼) = 𝑅𝑥 (𝜋/2)𝑅𝑧 (𝛼) and computed the cost at each value of this angle.
The landscape on the quantum computer has local minima near the optimal angles 𝛼 = 𝜋/2, 3𝜋/2
but the cost is not zero. This explains why we obtain the correct eigenvalues even though the
cost is nonzero in Fig. 2.3(a). The nonzero cost can be due to a combination of decoherence,
gate infidelity, and measurement error. As shown in Fig. 2.3(b), the 8Q-Agave quantum computer
retuned during our data collection, and after this retuning, the landscape of the quantum computer
matched that of the noisy simulator significantly better.
2.2.2.2 Heisenberg model ground state
While current noise levels of quantum hardware limit our implementations of VQSD to small
problem sizes, we can explore larger problem sizes on a simulator. An important application of
VQSD is to study the entanglement in condensed matter systems, and we highlight this application
in the following example.
Let us consider the ground state of the 1D Heisenberg model, the Hamiltonian of which is
2𝑛
∑︁
𝐻= 𝑆®( 𝑗) · 𝑆®( 𝑗+1) , (2.19)
𝑗=1
( 𝑗) ( 𝑗) ( 𝑗)
with 𝑆®( 𝑗) = (1/2)(𝜎𝑥 𝑥ˆ + 𝜎𝑦 𝑦ˆ + 𝜎𝑧 𝑧ˆ) and periodic boundary conditions, 𝑆®(2𝑛+1) = 𝑆®(1) .
Performing entanglement spectroscopy on the ground state |𝜓⟩ 𝐴𝐵 involves diagonalizing the reduced
state 𝜌 = Tr𝐵 (|𝜓⟩⟨𝜓| 𝐴𝐵 ). Here we consider a total of 8 spins (2𝑛 = 8). We take 𝐴 to be a subset of
4 nearest-neighbor spins, and 𝐵 is the complement of 𝐴.
The results of applying VQSD to the 4-spin reduced state 𝜌 via a simulator are shown in Fig. 2.4.
Panel (a) plots the inferred eigenvalues versus the number of layers 𝑝 in our ansatz (see Fig. 2.2).
25
Figure 2.4: Implementing VQSD with a simulator for the ground state of the 1D Heisenberg model,
diagonalizing a 4-spin subsystem of a chain of 8 spins. We chose 𝑞 = 1 for the cost in (2.10) and
employed a gradient-based method to find 𝛼 ® opt . (a) Largest inferred eigenvalues 𝜆˜ 𝑗 versus 1/𝑝,
where 𝑝 is the number of layers in our ansatz, which in this example takes half-integer values
corresponding to fractions of layers shown in Fig. 2.2. The exact eigenvalues are shown on the
𝑦-axis (along 1/𝑝 = 0 line) with their degeneracy indicated in parentheses. One can see the largest
eigenvalues converge to their correct values, including the correct degeneracies. Inset: overall
eigenvalue error Δ𝜆 versus 1/𝑝. (b) Largest inferred eigenvalues resolved by the inferred ⟨𝑆 𝑧 ⟩
quantum number of their associated eigenvector, for 𝑝 = 5. The inferred data points (red X’s)
roughly agree with the theoretical values (black circles), particularly for the largest eigenvalues.
Section 2.7 discusses Heisenberg chain of 12 spins.
One can see that the inferred eigenvalues converge to their theoretical values as 𝑝 increases. Panel
(b) plots the inferred eigenvalues resolved by their associated quantum numbers (𝑧-component of
total spin). This plot illustrates the feature we noted previously that minimizing our cost will
first result in minimizing the eigenvector error for those eigenvectors with the largest eigenvalues.
Overall our VQSD implementation returned roughly the correct values for both the eigenvalues
26
and their quantum numbers. Resolving not only the eigenvalues but also their quantum numbers is
important for entanglement spectroscopy [35], and clearly VQSD can do this.
In Section 2.7 we discuss an alternative approach employing a variable ansatz for 𝑈 𝑝 ( 𝛼® ), and
we present results of applying this approach to a 6-qubit reduced state of the 12-qubit ground state
of the Heisenberg model.
2.3 Discussion
We emphasize that VQSD is meant for states 𝜌 that have either low rank or possibly high rank but
low entropy 𝐻 (𝜌) = −Tr(𝜌 log 𝜌). This is because the eigenvalue readout step of VQSD would
be exponentially complex for states with high entropy. In other words, for high entropy states, if
one efficiently implemented the eigenvalue readout step (with 𝑁readout polynomial in 𝑛), then very
few eigenvalues would get characterized with the desired precision. In Section 2.11 we discuss the
complexity of VQSD for particular example states.
Examples of states for which VQSD is expected to be efficient include density matrices computed
from ground states of 1D, local, gapped Hamiltonians. Also, thermal states of some 1D systems
in a many-body localized phase at low enough temperature are expected to be diagonalizable by
VQSD. These states have rapidly decaying spectra and are eigendecomposed into states obeying
a 1D area law [47, 48, 49]. This means that every eigenstate can be prepared by a constant depth
circuit in alternating ansatz form [48], and hence VQSD will be able to diagonalize it.
2.3.1 Comparison to literature
Diagonalizing quantum states with classical methods would require exponentially large memory to
store the density matrix, and the matrix operations needed for diagonalization would be exponen-
tially costly. VQSD avoids both of these scaling issues.
Another quantum algorithm that extracts the eigenvalues and eigenvectors of a quantum state
is qPCA [16]. Similar to VQSD, qPCA has the potential for exponential speedup over classical
diagonalization for particular classes of quantum states. Like VQSD, the speedup in qPCA is
27
contingent on 𝜌 being a low-entropy state.
We performed a simple implementation of qPCA to get a sense for how it compares to VQSD,
see Section 2.12 for details. In particular, just like we did for Fig. 2.3, we considered the one-
qubit plus state 𝜌 = |+⟩⟨+|. We implemented qPCA for this state on Rigetti’s noisy simulator
(whose noise is meant to mimic that of their 8Q-Agave quantum computer). The circuit that we
implemented applied one controlled-exponential-swap gate (in order to approximately exponentiate
𝜌, as discussed in [16]). We employed a machine-learning approach [50] to compile the controlled-
exponential-swap gate into a novel short-depth gate sequence (see Section 2.12. With this circuit
we inferred the two eigenvalues of 𝜌 to be approximately 0.8 and 0.2. Hence, for this simple
example, it appears that qPCA gave eigenvalues that were slightly off from the true values of 1 and
0, while VQSD was able to obtain the correct eigenvalues, as discussed in Fig. 2.3.
2.3.2 Future applications
Finally we discuss various applications of VQSD.
As noted in Ref. [16], one application of quantum state diagonalization is benchmarking of
quantum noise processes, i.e., quantum process tomography. Here one prepares the Choi state by
sending half of a maximally entangled state through the process of interest. One can apply VQSD
to the resulting Choi state to learn about the noise process, which may be particular useful for
benchmarking near-term quantum computers.
A special case of VQSD is variational state preparation. That is, if one applies VQSD to a
pure state 𝜌 = |𝜓⟩⟨𝜓|, then one can learn the unitary 𝑈 ( 𝛼® ) that maps |𝜓⟩ to a standard basis state.
Inverting this unitary allows one to map a standard basis state (and hence the state |0⟩ ⊗𝑛 ) to the state
|𝜓⟩, which is known as state preparation. Hence, if one is given |𝜓⟩ in quantum form, then VQSD
can potentially find a short-depth circuit that approximately prepares |𝜓⟩. Variational quantum
compiling algorithms that were very recently proposed [51, 52] may also be used for this same
purpose, and hence it would be interesting to compare VQSD to these algorithms for this special
case. Additionally, in this special case one could use VQSD and these other algorithms as an error
28
mitigation tool, i.e., to find a short-depth state preparation that achieves higher accuracy than the
original state preparation.
In machine learning, PCA is a subroutine in supervised and unsupervised learning algorithms
and also has many direct applications. PCA inputs a data matrix 𝑋 and finds a new basis such
that the variance is maximal along the new basis vectors. One can show that this amounts to
finding the eigenvectors of the covariance matrix 𝐸 [𝑋 𝑋 𝑇 ] with the largest eigenvalues, where
𝐸 denotes expectation value. Thus PCA involves diagonalizing a positive-semidefinite matrix,
𝐸 [𝑋 𝑋 𝑇 ]. Hence VQSD can perform this task provided one has access to QRAM [36] to prepare
the covariance matrix as a quantum state. PCA can reduce the dimension of 𝑋 as well as filter out
noise in data. In addition, nonlinear (kernel) PCA can be used on data that is not linearly separable.
Very recent work by Tang [53] suggests that classical algorithms could be improved for PCA of
low-rank matrices, and potentially obtain similar scaling as qPCA and VQSD. Hence future work
is needed to compare these different approaches to PCA.
Perhaps the most important near-term application of VQSD is to study condensed matter physics.
In particular, we propose that one can apply the variational quantum eigensolver [22] to prepare
the ground state of a many-body system, and then one can follow this with the VQSD algorithm to
characterize the entanglement in this state. Ultimately this approach could elucidate key properties
of condensed matter phases. In particular, VQSD allows for entanglement spectroscopy, which has
direct application to the identification of topological order [54]. Extracting both the eigenvalues
and eigenvectors is useful for entanglement spectroscopy [54], and we illustrated this capability of
VQSD in Fig. 2.4. Finally, an interesting future research direction is to check how the discrepancies
in preparation of multiple copies affect the performance of the diagonalization.
2.4 Methods
2.4.1 Diagonalization test circuits
Here we elaborate on the cost functions 𝐶1 and 𝐶2 and present short-depth quantum circuits to
compute them.
29
Figure 2.5: Diagonalization test circuits used in VQSD. (a) The Destructive Swap Test com-
putes Tr(𝜎𝜏) via a depth-two circuit. (b) The Diagonalized Inner Product (DIP) Test computes
Tr(Z(𝜎)Z(𝜏)) via a depth-one circuit. (c) The Partially Diagonalized Inner Product (PDIP) Test
computes Tr(Z 𝑗®(𝜎)Z 𝑗®(𝜏)) via a depth-two circuit, for a particular set of qubits 𝑗®. While the DIP
test requires no postprocessing, the postprocessing for the Destructive Swap Test and the Partial
DIP Test scales linearly in 𝑛.
2.4.1.1 𝐶1 and the DIP Test
The function 𝐶1 defined in (2.7) has several intuitive interpretations. These interpretations make
it clear that 𝐶1 quantifies how far a state is from being diagonal. In particular, let 𝐷 HS ( 𝐴, 𝐵) :=
Tr ( 𝐴 − 𝐵) † ( 𝐴 − 𝐵) denote the Hilbert-Schmidt distance. Then we can write
𝐶1 = min 𝐷 HS ( 𝜌, ˜ 𝜎) (2.20)
𝜎∈D
= 𝐷 HS ( 𝜌, ˜ Z( 𝜌)) ˜ (2.21)
∑︁
= ˜ 𝑧®′⟩| 2 .
|⟨®𝑧 | 𝜌| (2.22)
𝑧®, 𝑧®′ ≠®𝑧
In other words, 𝐶1 is (1) the minimum distance between 𝜌˜ and the set of diagonal states D, (2) the
distance from 𝜌˜ to Z( 𝜌),
˜ and (3) the sum of the absolute squares of the off-diagonal elements of 𝜌. ˜
𝐶1 can also be written as the eigenvector error in (2.4) as follows. For an inferred eigenvector
30
| 𝑣˜ 𝑧®⟩, we define |𝛿 𝑧®⟩ = 𝜌| 𝑣˜ 𝑧®⟩ − 𝜆˜ 𝑧® | 𝑣˜ 𝑧®⟩ and write the eigenvector error as
⟨𝛿 𝑧® |𝛿 𝑧®⟩ = ⟨𝑣˜ 𝑧® |𝜌 2 | 𝑣˜ 𝑧®⟩ + 𝜆˜ 2𝑧® − 2𝜆˜ 𝑧® ⟨𝑣˜ 𝑧® |𝜌| 𝑣˜ 𝑧®⟩ (2.23)
= ⟨𝑣˜ 𝑧® |𝜌 2 | 𝑣˜ 𝑧®⟩ − 𝜆˜ 2𝑧® , (2.24)
since ⟨𝑣˜ 𝑧® |𝜌| 𝑣˜ 𝑧®⟩ = 𝜆˜ 𝑧®. Summing over all 𝑧® gives
∑︁ ∑︁
Δ𝑣 = ⟨𝛿 𝑧® |𝛿 𝑧®⟩ = ⟨𝑣˜ 𝑧® |𝜌 2 | 𝑣˜ 𝑧®⟩ − 𝜆˜ 2𝑧® (2.25)
𝑧® 𝑧®
= Tr(𝜌 2 ) − Tr(Z( 𝜌) ˜ 2 ) = 𝐶1 , (2.26)
which proves the bound in (2.5) for 𝑞 = 1.
In addition, 𝐶1 bounds the eigenvalue error defined in (2.3). Let 𝜆®˜ = (𝜆˜ 1 , ..., 𝜆˜ 𝑑 ) and 𝜆® =
(𝜆1 , ..., 𝜆 𝑑 ) denote the inferred and actual eigenvalues of 𝜌, respectively, both arranged in decreasing
order. In this notation we have
Δ𝜆 = 𝜆® · 𝜆® + 𝜆®˜ · 𝜆®˜ − 2𝜆® · 𝜆®˜ (2.27)
𝐶1 = 𝜆® · 𝜆® − 𝜆®˜ · 𝜆®˜ (2.28)
= Δ𝜆 + 2(𝜆® · 𝜆®˜ − 𝜆®˜ · 𝜆) ®˜ . (2.29)
Since the eigenvalues of a density matrix majorize its diagonal elements, 𝜆® ≻ 𝜆, ®˜ and the dot product
with an ordered vector is a Schur convex function, we have
𝜆® · 𝜆®˜ ≥ 𝜆®˜ · 𝜆®˜ . (2.30)
Hence from (2.29) and (2.30) we obtain the bound
Δ𝜆 ≤ 𝐶1 , (2.31)
which corresponds to the bound in (2.5) for the special case of 𝑞 = 1.
For computational purposes, we use the difference of purities interpretation of 𝐶1 given in (2.7).
The Tr(𝜌 2 ) term is independent of 𝑈 𝑝 ( 𝛼 ® ). Hence it only needs to be evaluated once, outside of the
31
parameter optimization loop. It can be computed via the expectation value of the swap operator 𝑆
on two copies of 𝜌, using the identity
Tr(𝜌 2 ) = Tr((𝜌 ⊗ 𝜌)𝑆) . (2.32)
This expectation value is found with a depth-two quantum circuit that essentially corresponds to a
Bell-basis measurement, with classical post-processing that scales linearly in the number of qubits
[55, 50]. This is shown in Fig. 2.5(a). We call this procedure the Destructive Swap Test, since it is
like the Swap Test, but the measurement occurs on the original systems instead of on an ancilla.
Similarly, the Tr(Z( 𝜌) ˜ 2 ) term could be evaluated by first dephasing 𝜌˜ and then performing
the Destructive Swap Test, which would involve a depth-three quantum circuit with linear classical
post-processing. This approach was noted in Ref. [56]. However, there exists a simpler circuit,
which we call the Diagonalized Inner Product (DIP) Test. The DIP Test involves a depth-one
quantum circuit with no classical post-processing. An abstract version of this circuit is shown in
Fig. 2.5(b), for two states 𝜎 and 𝜏. The proof that this circuit computes Tr(Z(𝜎)Z(𝜏)) is given in
Section 2.13 For our application we will set 𝜎 = 𝜏 = 𝜌, ˜ 2 ).
˜ for which this circuit gives Tr(Z( 𝜌)
In summary, 𝐶1 is efficiently computed by using the Destructive Swap Test for the Tr(𝜌 2 ) term
and the DIP Test for the Tr(Z( 𝜌) ˜ 2 ) term.
2.4.1.2 𝐶2 and the PDIP test
Like 𝐶1 , 𝐶2 can also be rewritten in terms of of the Hilbert-Schmidt distance. Namely, 𝐶2 is the
average distance of 𝜌˜ to each locally-dephased state Z 𝑗 ( 𝜌): ˜
𝑛
1 ∑︁
𝐶2 = 𝐷 HS ( 𝜌,˜ Z 𝑗 ( 𝜌))
˜ . (2.33)
𝑛 𝑗=1
Í
where Z 𝑗 (·) = 𝑧 (|𝑧⟩⟨𝑧| 𝑗 ⊗ 𝐼 𝑘≠ 𝑗 )(·)(|𝑧⟩⟨𝑧| 𝑗 ⊗ 𝐼 𝑘≠ 𝑗 ). Naturally, one would expect that 𝐶2 ≤ 𝐶1 ,
since 𝜌˜ should be closer to each locally dephased state than to the fully dephased state. Indeed this
is true and can be seen from:
𝑛
1 ∑︁
𝐶2 = 𝐶1 − min 𝐷 HS (Z 𝑗 ( 𝜌),˜ 𝜎) . (2.34)
𝑛 𝑗=1 𝜎∈D
32
However, 𝐶1 and 𝐶2 vanish under precisely the same conditions, as noted in Eq. (2.9). One can see
this by noting that 𝐶2 also upper bounds (1/𝑛)𝐶1 and hence we have
𝐶2 ≤ 𝐶1 ≤ 𝑛𝐶2 . (2.35)
Combining the upper bound in (2.35) with the relations in (2.26) and (2.31) gives the bounds in
(2.5) with 𝛽 defined in (2.11). The upper bound in (2.35) is proved as follows. Let 𝑧® = 𝑧1 ...𝑧 𝑛 and
𝑧®′ = 𝑧′1 ...𝑧′𝑛 be 𝑛-dimensional bitstrings. Let S be the set of all pairs (®𝑧, 𝑧®′) such that 𝑧® ≠ 𝑧®′, and let
S 𝑗 be the set of all pairs (®𝑧, 𝑧®′) such that 𝑧 𝑗 ≠ 𝑧′𝑗 . Then we have 𝐶1 = (®𝑧,𝑧®′)∈S |⟨®𝑧 | 𝜌| ˜ 𝑧®′⟩| 2 , and
Í
∑︁ 𝑛 ∑︁
𝑛𝐶2 = ˜ 𝑧®′⟩| 2
|⟨®𝑧 | 𝜌| (2.36)
𝑗=1 (®𝑧 , 𝑧®′ )∈S 𝑗
∑︁
≥ ˜ 𝑧®′⟩| 2 = 𝐶1 ,
|⟨®𝑧 | 𝜌| (2.37)
(®𝑧 , 𝑧®′ )∈S𝑈
Ð𝑛
where S𝑈 = 𝑗=1 S 𝑗 is the union of all the S 𝑗 sets. The inequality in (2.37) arises from the fact
that the S 𝑗 sets have non-trivial intersection with each other, and hence we throw some terms away
when only considering the union S𝑈 . The last equality follows from the fact that S𝑈 = S, i.e, the
set of all bitstring pairs that differ from each other (S) corresponds to the set of all bitstring pairs
that differ for at least one element (S𝑈 ).
Writing 𝐶2 in terms of purities, as in (2.8), shows how it can be computed on a quantum
computer. As in the case of 𝐶1 , the first term in (2.8) is computed with the Destructive Swap Test.
For the second term in (2.8), each purity Tr(Z 𝑗 ( 𝜌) ˜ 2 ) could also be evaluated with the Destructive
Swap Test, by first locally dephasing the appropriate qubit. However, we present a slightly improved
circuit to compute these purities that we call the Partially Diagonalized Inner Product (PDIP) Test.
The PDIP Test is shown in Fig. 2.5(c) for the general case of feeding in two distinct states 𝜎 and
𝜏 with the goal of computing the inner product between Z 𝑗®(𝜎) and Z 𝑗®(𝜏). For generality we let
𝑙, with 0 ≤ 𝑙 ≤ 𝑛, denote the number of qubits being locally dephased for this computation. If
𝑙 > 0, we define 𝑗® = ( 𝑗 1 , . . . , 𝑗 𝑙 ) as a vector of indices that indicates which qubits are being locally
dephased. The PDIP Test is a hybrid of the Destructive Swap Test and the DIP Test, corresponding
to the former when 𝑙 = 0 and the latter when 𝑙 = 𝑛. Hence, it generalizes both the Destructive Swap
33
Test and the DIP Test. Namely, the PDIP Test performs the DIP Test on the qubits appearing in 𝑗®
and performs the Destructive Swap Test on the qubits not appearing in 𝑗®. The proof that the PDIP
Test computes Tr(Z 𝑗®(𝜎)Z 𝑗®(𝜏)), and hence Tr(Z 𝑗®( 𝜌) ˜ 2 ) when 𝜎 = 𝜏 = 𝜌,
˜ is given in Section 2.13.
2.4.1.3 𝐶1 versus 𝐶2
Here we discuss the contrasting merits of the functions 𝐶1 and 𝐶2 , hence motivating our cost
definition in (2.10).
As noted previously, 𝐶2 does not have an operational meaning like 𝐶1 . In addition, the circuit
for computing 𝐶1 is more efficient than that for 𝐶2 . The circuit in Fig. 2.5(b) for computing the
second term in 𝐶1 has a gate depth of one, with 𝑛 CNOT gates, 𝑛 measurements, and no classical
post-processing. The circuit in Fig. 2.5(c) for computing the second term in 𝐶2 has a gate depth
of two, with 𝑛 CNOT gates, 𝑛 − 1 Hadamard gates, 2𝑛 − 1 measurements, and classical post-
processing whose complexity scales linearly in 𝑛. So in every aspect, the circuit for computing 𝐶1
is less complex than that for 𝐶2 . This implies that 𝐶1 can be computed with greater accuracy than
𝐶2 on a noisy quantum computer.
On the other hand, consider how the landscape for 𝐶1 and 𝐶2 scale with 𝑛. As a simple example,
suppose 𝜌 = |0⟩⟨0| ⊗ · · · ⊗ |0⟩⟨0|. Suppose one takes a single parameter ansatz for 𝑈, such that
𝑈 (𝜃) = 𝑅 𝑋 (𝜃) ⊗ · · · ⊗ 𝑅 𝑋 (𝜃), where 𝑅 𝑋 (𝜃) is a rotation about the 𝑋-axis of the Bloch sphere by
angle 𝜃. For this example,
𝐶1 (𝜃) = 1 − Tr(Z( 𝜌)˜ 2 ) = 1 − 𝑥(𝜃) 𝑛 (2.38)
where 𝑥(𝜃) = Tr(Z(𝑅 𝑋 (𝜃)|0⟩⟨0|𝑅 𝑋 (𝜃) † ) 2 ) = (1 + cos2 𝜃)/2. If 𝜃 is not an integer multiple of 𝜋,
then 𝑥(𝜃) < 1, and 𝑥(𝜃) 𝑛 will be exponentially suppressed for large 𝑛. In other words, for large
𝑛, the landscape for 𝑥(𝜃) 𝑛 becomes similar to that of a delta function: it is zero for all 𝜃 except
for multiples of 𝜋. Hence, for large 𝑛, it becomes difficult to train the unitary 𝑈 (𝜃) because the
gradient vanishes for most 𝜃. This is just an illustrative example, but this issue is general. Generally
speaking, for large 𝑛, the function 𝐶1 has a sharp gradient near its global minima, and the gradient
34
vanishes when one is far away from these minima. Ultimately this limits 𝐶1 ’s utility as a training
function for large 𝑛.
In contrast, 𝐶2 does not suffer from this issue. For the example in the previous paragraph,
𝐶2 (𝜃) = 1 − 𝑥(𝜃) , (2.39)
which is independent of 𝑛. So for this example the gradient of 𝐶2 does not vanish as 𝑛 increases,
and hence 𝐶2 can be used to train 𝜃. More generally, the landscape of 𝐶2 is less barren than that of
𝐶1 for large 𝑛. We can argue this, particularly, for states 𝜌 that have low rank or low entropy. The
second term in (2.8), which is the term that provides the variability with 𝛼 ® , does not vanish even
for large 𝑛, since (as shown in Section 2.14):
1
˜ 2 ) ≥ 2−𝐻 (𝜌)−1 ≥
Tr(Z 𝑗 ( 𝜌) . (2.40)
2𝑟
Here, 𝐻 (𝜌) = −Tr(𝜌 log2 𝜌) is the von Neumann entropy, and 𝑟 is the rank of 𝜌. So as long as 𝜌
is low entropy or low rank, then the second term in 𝐶2 will not vanish. Note that a similar bound
does not exist for second term in 𝐶1 , which does tend to vanish for large 𝑛.
2.4.2 Optimization methods
Finding 𝛼 ® opt in (2.6) is a major component of VQSD. While many works have benchmarked
classical optimization algorithms (e.g., Ref. [57]), the particular case of optimization for variational
hybrid algorithms [58] is limited and needs further work [59]. Both gradient-based and gradient-
free methods are possible, but gradient-based methods may not work as well with noisy data.
Additionally, Ref. [40] notes that gradients of a large class of circuit ansatze vanish when the
number of parameters becomes large. These and other issues (e.g., sensitivity to initial conditions,
number of function evaluations) should be considered when choosing an optimization method.
In our preliminary numerical analyses (see Section 2.10), we found that the Powell optimiza-
tion algorithm [60] performed the best on both quantum computer and simulator implementations
of VQSD. This derivative-free algorithm uses a bi-directional search along each parameter using
35
Brent’s method. Our studies showed that Powell’s method performed the best in terms of conver-
gence, sensitivity to initial conditions, and number of correct solutions found. The implementation
of Powell’s algorithm used in this paper can be found in the open-source Python package SciPy
Optimize [61]. Finally, Section 2.8 shows how our layered ansatz for 𝑈 𝑝 ( 𝛼 ® ) as well as proper
initialization of 𝑈 𝑝 ( 𝛼
® ) helps in mitigating the problem of local minima.
2.5 Code availability
The code used to generate some of the examples presented here can be accessed from [45].
2.6 Details on VQSD implementations
Here we provide further details on our implementations of VQSD in Sec. 2.2.2. This includes
further details about the optimization parameters as well as additional statistics for our runs on the
quantum computer.
2.6.1 Optimization parameters
First, we discuss our implementation on a quantum computer (data shown in Fig. 2.3). Figure 2.6
displays the circuit used for this implementation. This circuit is logically divided into three sections.
First, we prepare two copies of the plus state 𝜌 = |+⟩⟨+| = 𝐻|0⟩⟨0|𝐻 by doing a Hadamard gate 𝐻 on
each qubit. Next, we implement one layer of a unitary ansatz, namely 𝑈 (𝜃) = 𝑅𝑥 (𝜋/2)𝑅𝑧 (𝜃). This
ansatz was chosen because each gate can be natively implemented on Rigetti’s quantum computer.
To simplify the search space, we restricted to one parameter instead of a universal one-qubit unitary.
Last, we implement the DIP Test circuit, described in Fig. 2.5, which here consists of only one
CNOT gate and one measurement.
For the parameter optimization loop, we used the Powell algorithm mentioned in Sec. 2.4.2.
This algorithm found the minimum cost in less than ten objective function evaluations on average.
Each objective function evaluation (i.e., call to the quantum computer) sampled from 10,000 runs
of the circuit in Fig. 2.6. As can be seen in Fig. 2.3(b), 10,000 runs was sufficient to accurately
36
State Preperation Unitary Ansatz DIP Test
Figure 2.6: Circuit used to implement VQSD for 𝜌 = |+⟩⟨+| on Rigetti’s 8Q-Agave quantum
computer. Vertical dashed lines separate the circuit into logical components.
estimate the cost function (2.10) with small variance. Because the problem size was small, we took
𝑞 = 1 in (2.10), which provided adequate variability in the cost landscape.
Because of the noise levels in current quantum computers, we limited VQSD implementations
on quantum hardware to only one-qubit states. Noise affects the computation in multiple areas.
For example, in state preparation, qubit-specific errors can cause the two copies of 𝜌 to actually be
different states. Subsequent gate errors (notably two-qubit gates), decoherence, and measurement
errors prevent the cost from reaching zero even though the optimal value of 𝜃 is obtained. The
effect of these various noise sources, and in particular the effect of discrepancies in preparation of
two copies of 𝜌, will be important to study in future work.
Next, we discuss our VQSD implementation on a simulator (data shown in Fig. 2.4). For
this implementation we again chose 𝑞 = 1 in our cost function. Because of the larger problem
size (diagonalizing a 4-qubit state), we employed multiple layers in our ansatz, up to 𝑝 = 5. The
simulator directly calculated the measurement probability distribution in the DIP Test, as opposed
to determining the desired probability via sampling. This allowed us to use a gradient-based
method to optimize our cost function, reducing the overall runtime of the optimization. Hence, our
simulator implementation for the Heisenberg model demonstrated a future application of VQSD
while alleviating the optimization bottleneck that is present for all variational quantum algorithms
on large problem sizes, an area that needs further research [59]. We explore optimization methods
further in Section 2.10.
37
0.45
0.40
0.35
Sampled Cost
0.30
0.25
0.20
0.15
0.10
0.05
0 5 10 15 20 25
Iteration
Figure 2.7: Cost vs iteration for all attempts of VQSD on Rigetti’s 8Q-Agave computer for
diagonalizing the plus state 𝜌 = |+⟩⟨+|. Each of the seven curves represents a different independent
run. Each run starts at a random initial angle and uses the Powell optimization algorithm to
minimize the cost.
2.6.2 Additional statistics for the quantum computer implementation
Here, we present statistics for several runs of the VQSD implementation run on Rigetti’s 8Q-Agave
quantum computer. One example plot of cost vs. iteration for diagonalizing the plus state 𝜌 = |+⟩⟨+|
is shown in Figure 2.3(a). Here, we present all data collected for this implementation of VQSD,
shown in Figure 2.7. The following table displays the final costs achieved as well the associated
inferred eigenvalues.
VQSD Run min(𝐶) min(𝜆˜est
®
𝑧) max(𝜆˜est
®
𝑧)
1 0.107 0.000 1.000
2 0.090 0.142 0.858
3 0.099 0.054 0.946
4 0.120 0.079 0.921
5 0.080 0.061 0.939
6 0.090 0.210 0.790
7 0.65 0.001 0.999
Avg. 0.093 0.078 0.922
Std. 0.016 0.070 0.070
Table 2.1: Minimum cost and eigenvalues achieved after performing the parameter optimization
loop for seven independent runs of VQSD for the example discussed in Sec. 2.2.2. The final two
rows show average values and standard deviation across all runs.
38
2.7 Alternative ansatz and the Heisenberg model ground state
In this Section, we describe a modification of the layered ansatz discussed in Section 2.2.1. Figure
2.2 in the main text shows an example of a layered ansatz in which every layer has the same, fixed
structure consisting of alternating two-qubit gates acting on nearest-neighbor qubits. The modified
approach presented here may be useful in situations where there is no natural choice of the structure
of the layered ansatz.
Here, instead of working with a fixed structure for the diagonalizing unitary 𝑈 ( 𝛼 ® ), we allow
it to vary during the optimization process. The algorithm used to update the structure of 𝑈 ( 𝛼 ® ) is
probabilistic and resembles the one presented in [50].
In the examples studied here, the initial 𝑈 ( 𝛼
® ) consists of a small number of random two-qubit
gates with random supports (i.e. the qubits on which a gate acts). An optimization step involves
minimizing the cost function by changing parameters 𝛼 ® as well as a small random change to the
structure of 𝑈 ( 𝛼 ® ). This change to the structure typically amounts to a random modification of
support for a limited number of gates. The new structure is accepted or rejected following the usual
simulated annealing schemes. We refer the reader to Section II D of [50] for further details on the
optimization method.
The gate sequence representing 𝑈 ( 𝛼 ® ) is allowed to grow. If the algorithm described above
cannot minimize the cost function for a specified number of iterations, an identity gate (spanned
by new variational parameters) is randomly added to 𝑈 ( 𝛼 ® ). This step is similar in spirit to adding
a layer to 𝑈 ( 𝛼
® ) as discussed in Section 2.2.1 of the main text.
We compared the current method with the one based on the layered ansatz and found that it
produced diagonalizing circuits involving significantly fewer gates. Figure 2.8 shows the eigenvalue
error Δ𝜆 , defined in Eq. (2.3), as a function of 1/𝐷, where 𝐷 is the total number of gates of 𝑈 ( 𝛼 ® ).
Here, VQSD is used to diagonalize a 4-qubit reduced state of the ground state of the one-dimensional
Heisenberg model defined on 8 qubits, see Eq. (2.19). For every number of gates 𝐷, the current
algorithm outperforms the one based on the fixed, layered ansatz. It finds a sequence of gates that
results in a smaller eigenvalue error Δ𝜆 .
39
Figure 2.8: Comparison of two approaches to obtaining the diagonalizing unitary 𝑈 ( 𝛼 ® ): (i) based
on a fixed layered ansatz shown in Fig. 2.2 in the main text (black line) and (ii) based on random
updates to the structure of 𝑈 ( 𝛼
® ) (red line). The plot shows eigenvalue error Δ𝜆 versus 1/𝐷, where
𝐷 is the number of gates in 𝑈 ( 𝛼® ). For the same 𝐷, the second approach found a more optimal gate
sequence.
Finally, we use the current optimization approach to find the spectrum of a 6-qubit reduced state
𝜌 of the 12-qubit ground state of a one-dimensional Heisenberg model. The results of performing
VQSD on 𝜌 are shown in Fig. 2.9. Panel (a) shows the convergence of the 11 largest inferred
eigenvalues 𝜆˜ 𝑗 of 𝜌 to their exact values. We can see that the quality of the inferred eigenvalues
increases quickly with the number of gates 𝐷 used in the diagonalizing unitary 𝑈 ( 𝛼 ® ). In panel (b),
we show the dominant part of the spectrum of 𝜌 resolved in the 𝑧-component of the total spin. The
results show that VQSD could be used to accurately obtain the dominant part of the spectrum of
the density matrix together with the associated quantum numbers.
2.8 Optimization and local minima
In this Section we describe a strategy to avoid local minima that is used in the optimization
algorithms throughout the paper and detailed in Section 2.7. We adapt the optimization involved in
the diagonalization of the 6-qubit density matrix described in Section 2.7 as an illustrative example.
We note that the classical optimization problem associated with VQSD is potentially very
difficult one. In the example studied in Section 2.7 the diagonalizing unitary consisted of 150
two-qubit gates. This means that in order to find that unitary one has to optimize over at least
40
Figure 2.9: VQSD applied to the ground state of the Heisenberg model. Here we consider a 6-qubit
reduced state 𝜌 of the 12-qubit ground state. (a) Largest inferred eigenvalues 𝜆˜ 𝑗 of 𝜌 as a function
of 1/𝐷, where 𝐷 is the total number of gates in the diagonalizing unitary 𝑈 ( 𝛼 ® ). The inferred
eigenvalues converge to their exact values shown along the 1/𝐷 = 0 line recovering the correct
degeneracy. Inset: Eigenvalue error Δ𝜆 as a function of 1/𝐷. (b) The largest inferred eigenvalues
𝜆˜ 𝑗 of 𝜌 resolved in the ⟨𝑆 𝑧 ⟩ quantum number. We find very good agreement between the inferred
eigenvalues (red crosses) and the exact ones (black circles), especially for large eigenvalues. The
data was obtained for 𝐷 = 150 gates.
41
Figure 2.10: Cost function 𝐶 versus 1/𝐷 for three independent optimization runs. Here, 𝐷 is the
total number of gates in the diagonalizing unitary 𝑈𝐷 ( 𝛼 ® ). Every optimization run got stuck at
local minimum at some point during the minimization but thanks to the growth of the ansatz for
𝑈𝐷 ( 𝛼® ) described in the text, the predefined small value of 𝐶 was eventually attained. The data was
obtained for a 6-qubit reduced state of the 12-qubit ground state of the Heisenberg model.
150 · 13 continuous parameters (every two-qubit gate is spanned by 15 parameters, but there is
some reduction in the total number of parameters when two consecutive gates have overlapping
supports). Initiated randomly, off-the-shelf techniques will most likely return suboptimal solution
due to the presence of multiple local minima and the rough cost function landscape.
Let 𝑈𝐷 ( 𝛼® ) denote a diagonalizing unitary that is built by 𝐷 two-qubit gates parameterized by
® . Our optimization method begins with a shallow circuit consisting of few gates only. Since there
𝛼
is only a small number of variational parameters, the local minimum is quickly attained. After
this initial step, the circuit that implements the unitary 𝑈𝐷 ( 𝛼 ® ) is grown by adding an identity gate
(either randomly as discussed in this Section or by means of a layer of identity gates as presented in
the main text). This additional gate contains new variational parameters that are initiated such that
the unitary 𝑈𝐷 ( 𝛼 ® ) = 𝑈𝐷+1 ( 𝛼
® ) and hence the value of the cost function are not changed. After the
gate was added, the unitary 𝑈𝐷+1 ( 𝛼 ® ) contains more variational parameters which allows for further
minimization of the cost function. In summary, the optimization of a deeper circuit 𝑈𝐷+1 ( 𝛼 ® ) is
initialized by previously obtained 𝑈𝐷 ( 𝛼 ® ) as opposed to random initialization. What is more, even
if the unitary 𝑈𝐷 ( 𝛼 ® ) was not the most optimal one for a given 𝐷, the growth of the circuit allows
the algorithm to escape the local minimum and eventually find the global one, as illustrated by an
42
Figure 2.11: Cost versus iteration for different values of 𝑞, when 𝜌 is a tensor product of pure states
on 𝑛 qubits. Here we consider (a) 𝑛 = 6, (b) 𝑛 = 8, and (a) 𝑛 = 10. We employed the COBYLA
optimization method for training (see Section 2.10 for discussion of this method). For each call
to the quantum simulator (i.e., classical simulator of a quantum computer), we took 500 shots for
statistics. The green, red, and blue curves respectively correspond to directly training the cost with
𝑞 = 1, 𝑞 = 0.5, and 𝑞 = 0. The purple and yellow curves respectively correspond to evaluating the
𝑞 = 1 cost for the angles 𝛼® obtained by training the 𝑞 = 0.5 and 𝑞 = 0 costs.
example below and shown in Fig. 2.10. For a similar discussion, see [41].
To clarify the above analysis, let us consider an example of diagonalizing a 6-qubit reduced state
of the 12-qubit ground state of the Heisenberg model, see Sec. 2.7 for comparison. Figure 2.10
shows the value of the cost function 𝐶 as a function of 1/𝐷 for three independent optimization runs.
Each optimization was initialized randomly and we applied the same optimization scheme described
above to each of them. We see that despite getting stuck in local minima, every optimization run
managed to minimize the cost function to the predefined small value (which was set to 2 · 10−6 in
this example). For instance, at 𝐷 = 28, optimization run no. 2 clearly returns suboptimal solution
(optimization run no. 3 gives lower cost function by a factor of 6) but after adding several identity
gates, it manages to escape the local minimum and continue towards the global one.
2.9 Optimization runs with various 𝑞 values
In this Section we present some numerical results for training our overall cost function for various
values of 𝑞. Recall from Eq. (2.10) that 𝑞 is the weighting parameter that weights the contributions
43
of 𝐶1 and 𝐶2 in the overall cost, as follows:
𝐶 (𝑈 𝑝 ( 𝛼
® )) = 𝑞𝐶1 (𝑈 𝑝 ( 𝛼® )) + (1 − 𝑞)𝐶2 (𝑈 𝑝 ( 𝛼
® )) , (2.41)
where
𝐶1 (𝑈 𝑝 ( 𝛼® )) = Tr(𝜌 2 ) − Tr(Z( 𝜌)˜ 2) , (2.42)
𝑛
1 ∑︁
𝐶2 (𝑈 𝑝 ( 𝛼 2
® )) = Tr(𝜌 ) − ˜ 2) .
Tr(Z 𝑗 ( 𝜌) (2.43)
𝑛 𝑗=1
As argued in Section 2.2, 𝐶1 is operationally meaningful, while 𝐶2 has a landscape that is more
amendable to training when 𝑛 is large. In particular, one expects that for large 𝑛, the gradient of
𝐶1 is sharp near the global minima but vanishes exponentially in 𝑛 away from these minima. In
contrast, the gradient of 𝐶2 is not expected to exponentially vanish as 𝑛 increases, even away from
the minima.
Here, we numerically study the performance for different 𝑞 values for a simple example where
Ë𝑛 †
𝜌 is a tensor product of qubit pure states. Namely, we choose 𝜌 = 𝑗=1 𝑉 𝑗 |0⟩⟨0|𝑉 𝑗 , where
𝑉 𝑗 = 𝑅 𝑋 (𝜃 𝑗 ) with 𝜃 𝑗 randomly chosen. Such tensor product states are diagonalizable by a single
Ë𝑛
layer ansatz: 𝑈 ( 𝛼 ®) = 𝑗=1 𝑅 𝑋 (𝛼 𝑗 ). We consider three different problem sizes: 𝑛 = 6, 8, and 10.
Figure 2.11 shows our numerical results.
Directly training the 𝐶1 cost (corresponding to 𝑞 = 1) sometimes fails to find the global
minimum. One can see this in Fig. 2.11, where the green curve fails to fully reach zero cost. In
contrast, the red and blue curves in Fig. 2.11, which correspond to 𝑞 = 0.5 and 𝑞 = 0 respectively,
approximately go to zero for large iterations.
Even more interesting are the purple and yellow curves, which respectively correspond to
evaluating the 𝐶1 cost at the angles 𝛼 ® obtained from training the 𝑞 = 0.5 and 𝑞 = 0 costs. It is
remarkable that both the purple and yellow curves perform better (i.e., achieve lower values) than
the green curve. This implies that one can indirectly train the 𝐶1 cost by training the 𝑞 = 0.5
or 𝑞 = 0 costs, and this indirect training performs better than directly training 𝐶1 . Since 𝐶1 is
operationally meaningful, this indirect training with 𝑞 < 1 is performing better in an operationally
meaningful way.
44
(a) Powell (b) COBYLA
1.0 1.0
0.8 0.8
0.6 0.6
Cost
0.4 0.4
0.2 0.2
0.0 0.0
0 20 40 60 80 100 0 5 10 15 20 25
(c) BOBYQA (d) Nelder-Mead
1.0 1.0
0.8 0.8
0.6 0.6
Cost
0.4 0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 10 12 0 2 4 6 8 10
(f) BFGS (c) Conjugate Gradient (CG)
1.0 1.0
0.8 0.8
0.6 0.6
Cost
0.4 0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 10 0 2 4 6 8
Iterations Iterations
Figure 2.12: Optimization tests on six-qubit product states in the VQSD algorithm. Each plot
shows a different optimization algorithm (described in main text) and curves on each plot show
optimization attempts with different (random) initial conditions. Cost refers to the 𝐶1 cost function
(𝑞 = 1 in (2.10)), and each iteration is defined by a decrease in the cost function. As can be seen,
the Powell algorithm is the most robust to initial conditions and provides the largest number of
solved problem instances.
We expect that direct training of 𝐶1 will perform worse as 𝑛 increases, due to the exponential
vanishing of the gradient of 𝐶1 . The particular runs shown in Fig. 2.11 do not show this trend,
although this can be explained by the fact that the gradient of 𝐶1 depends significantly on the initial
® , and indeed we saw large variability in the performance of the green curve even
values of the 𝛼
for a fixed 𝑛. Nevertheless, it is worth noting that we were always able to directly train 𝐶1 (i.e., to
make the green curve go to zero) for 𝑛 < 6, which is consistent with our expectations.
Overall, Fig. 2.11 provides numerical justification of the definition of our cost function as a
weighted average, as in Eq. (2.10). Namely, it shows that there is an advantage to choosing 𝑞 < 1.
2.10 Comparison of optimization methods
As emphasized previously, numerical optimization plays a key role in all variational hybrid al-
gorithms, and further research in optimization methods is needed. In VQSD, the accuracy of
45
the inferred eigenvalues are closely tied to the performance of the optimization algorithm used in
the parameter optimization loop. This issue becomes increasingly important as one goes to large
problem sizes (large 𝑛), where the number of parameters in the diagonalizing unitary becomes
large.
Here, we compare the performance of six different optimization algorithms when used inside
the parameter optimization loop of VQSD. These include Powell’s algorithm [60], Constrained
Optimization BY Linear Approximation (COBYLA) [62], Bound Optimization BY Quadratic
Approximation (BOBYQA) [63], Nelder-Mead [64], Broyden-Fletcher-Goldfarb-Shanno (BFGS)
[65], and conjugate gradient (CG) [65]. As mentioned in the main text, Powell’s algorithm is a
derivative-free optimizer that uses a bi-directional search along each parameter. The COBYLA
and BOBYQA algorithms are both trust region or restricted-step methods, which approximate
the objective function by a model function. The region where this model function is a good
approximation is known as the trust region, and at each step the optimizer attempts to expand the
trust region. The Nelder-Mead algorithm is a simplex method useful for derivative-free optimization
with smooth objective functions. Lastly, the BFGS and CG algorithms are both gradient-based.
The BFGS method is a quasi-Newton method that uses first derivatives only, and the CG method
uses a nonlinear conjugate gradient descent. The implementations used in our study can be found
in the open-source Python package SciPy Optimize [61] and in Ref. [66].
For this study, we take the input state 𝜌 to be a six-qubit pure product state:
Ì 6
𝜌= |𝜓 𝑗 ⟩⟨𝜓 𝑗 | , where |𝜓 𝑗 ⟩ = 𝑉 𝑗 |0⟩ . (2.44)
𝑗=1
Here, the state preparation unitary is
( 𝑗) ( 𝑗) ( 𝑗)
𝑉 𝑗 = 𝑅𝑥 (𝛼𝑥 )𝑅 𝑦 (𝛼𝑦 )𝑅𝑧 (𝛼𝑧 ) (2.45)
( 𝑗) ( 𝑗) ( 𝑗)
where the angles (𝛼𝑥 , 𝛼𝑦 , 𝛼𝑧 ) are randomly chosen.
Using each algorithm, we attempt to minimize the cost by adjusting 36 parameters in one layer
of the unitary ansatz in Fig. 2.2. For fairness of comparison, only the objective function and initial
46
Alg. Powell COBYLA BOBYQA Nelder-Mead BFGS CG
r.r. 13.20 1 2.32 23.65 3.83 2.89
f.ev. 4474 341 518 7212 1016 1045
Table 2.2: Relative average run-times (r.r.) and absolute number of function evaluations (f.ev.) of
each optimization algorithm (Alg.) used for the data obtained in Fig. 2.12. For example, BOBYQA
took 2.32 times as long to run on average than COBYLA, which took the least time to run out of
all algorithms. Absolute run-times depend on a variety of factors and computer performance. For
reference, the COBYLA algorithm takes approximately one minute for this problem on a laptop
computer. The number of cost function evaluations used (related to run-time but also dependent
on the method used by the optimizer) is shown in the second row.
starting point were input to each algorithm, i.e., no special options such as constraints, bounds, or
other information was provided. The results of this study are shown in Fig. 2.12 and Table 2.2.
Figure 2.12 shows cost versus iteration for each of the six algorithms. Here, we define one
iteration by a call to the objective function in which the cost decreases. In particular, the number of
iterations is different than the number of cost function evaluations (see Table 2.2), which is not set
a priori but rather determined by the optimizer. Plotting cost per each function evaluation would
essentially produce a noisy curve since the optimizer is trying many values for the parameters.
Instead, we only plot the cost for each parameter update in which the cost decreases. Both
the number of iterations, function evaluations, and overall runtime are important features of the
optimizer.
In this study, as well as others, we found that the Powell optimization algorithm provides the
best performance in terms of lowest minimum cost achieved, sensitivity to initial conditions, and
fraction of correct solutions found. The trust-region algorithms COBYLA and BOBYQA were
the next best methods. In particular, although the Powell algorithm consistently obtained lower
minimum costs, the COBYLA method ran thirteen times faster on average (see Table 2.2). Indeed,
both trust region methods provided the shortest runtime. The gradient-based methods BFGS and
CG had comparable run-times but were unable to find any minima. Similarly, the Nelder-Mead
simplex algorithm was unable to find any minima. This method also had the longest average
run-time of all algorithms tested.
This preliminary analysis suggests that the Powell algorithm is the best method for VQSD.
47
For other variational quantum algorithms, this may not necessarily be the case. In particular,
we emphasize that the optimization landscape is determined by both the unitary ansatz and the
cost function definition, which may vary drastically in different algorithms. While we found that
gradient-based methods did not perform well for VQSD, they may work well for other applications.
Additionally, optimizers that we have not considered here may also provide better performance.
We leave these questions to further work.
2.11 Complexity for particular examples
2.11.1 General complexity remarks
In what follows we discuss some simple examples of states to which one might apply VQSD. There
are several aspects of complexity to keep in mind when considering these examples, including:
(C1) The gate complexity of the unitary that diagonalizes 𝜌. (It is worth remarking that approx-
imate diagonalization might be achieved with a less complex unitary than exact diagonalization.)
(C2) The complexity of searching through the search space to find the diagonalizing unitary.
(C3) The statistical complexity associated with reading out the eigenvalues.
Naturally, (C1) is related to (C2). However, being efficient with respect to (C1) does not
guarantee that (C2) is efficient.
2.11.2 Example states
In the simplest case, suppose 𝜌 = |𝜓1 ⟩⟨𝜓1 | ⊗ · · · ⊗ |𝜓𝑛 ⟩⟨𝜓𝑛 | is a tensor product of pure states. This
state can be diagonalized by a depth-one circuit 𝑈 = 𝑈1 ⊗ · · · ⊗ 𝑈𝑛 composed of 𝑛 one-qubit gates
(all done in parallel). Each 𝑈 𝑗 diagonalizes the associated |𝜓 𝑗 ⟩⟨𝜓 𝑗 | state. Searching for this unitary
within our ansatz can be done by setting 𝑝 = 1, i.e., with a single layer 𝐿 1 shown in Fig. 2.2. A single
layer is enough to find the unitary that exactly diagonalizes 𝜌 in this case. Hence, for this example,
both complexities (C1) and (C2) are efficient. Finally, note that the eigenvalue readout, (C3), is
√
efficient because there is only one non-zero eigenvalue. Hence, 𝜆˜ est 𝑧®
≈ 1 and 𝜖 𝑧® ≈ 1/ 𝑁readout for
48
this eigenvalue. This implies that 𝑁readout can be chosen to be constant, independent of 𝑛, in order
to accurately characterize this eigenvalue.
A generalization of product states are classically correlated states, which have the form
∑︁
𝜌= 𝑝 𝑧® |𝑏 𝑧(1)
1
⟩⟨𝑏 𝑧(1)
1
| ⊗ · · · ⊗ |𝑏 𝑧(𝑛)
𝑛
⟩⟨𝑏 𝑧(𝑛)
𝑛
| (2.46)
𝑧®
( 𝑗) ( 𝑗)
where {|𝑏 0 ⟩, |𝑏 1 ⟩} form an orthonormal basis for qubit 𝑗. Like product states, classically
correlated states can be diagonalized with a depth-one circuit composed of one-body unitaries.
Hence (C1) and (C2) are efficient for such states. However, the complexity of eigenvalue readout
depends on the {𝑝 𝑧®} distribution; if it is high entropy then eigenvalue readout can scale exponentially.
Finally, we consider pure states of the form 𝜌 = |𝜓⟩⟨𝜓|. For such states, eigenvalue readout
(C3) is efficient because 𝑁readout can be chosen to be independent of 𝑛, as we noted earlier for the
example of pure product states.
Next we argue that the gate complexity of the diagonalizing unitary, (C1), is efficient. The
argument is simply that VQSD takes the state 𝜌 as its input, and 𝜌 must have been prepared on
a quantum computer. Let 𝑉 be the unitary that was used to prepare |𝜓⟩ = 𝑉 | 0⟩ ® on the quantum
computer. For large 𝑛, 𝑉 must have been efficient to implement, otherwise the state |𝜓⟩ could not
have been prepared. Note that 𝑉 † , which is constructed from 𝑉 by reversing the order of the gates
and adjointing each gate, can be used to diagonalize 𝜌. Because 𝑉 is efficiently implementable,
then 𝑉 † is also efficiently implementable. Hence, 𝜌 can be efficiently diagonalized. A subtlety is
that one must compile 𝑉 † into one’s ansatz, such as the ansatz in Fig. 2.2. Fortunately, the overhead
needed to compile 𝑉 † into our ansatz grows (at worst) only linearly in 𝑛. An explicit construction for
compiling 𝑉 † into our ansatz is as follows. Any one-qubit gate directly translates without overhead
into our ansatz, while any two-qubit gate can be compiled using a linear number of swap gates to
make the qubits of interest to be nearest neighbors, then performing the desired two-qubit gate, and
finally using a linear number of swap gates to move the qubits back to their original positions.
Let us now consider the complexity (C2) of searching for 𝑈. Since there are a linear number
of parameters in each layer, and 𝑝 needs only to grow polynomially in 𝑛, then the total number
49
Figure 2.13: Circuit for our qPCA implementation. Here, the eigenvalues of a one-qubit pure state
𝜌 are estimated to a single digit of precision. We use 𝑘 copies of 𝜌 to approximate 𝐶𝑉 (𝑡) by applying
the controlled-exponential-swap operator 𝑘 times for a time period Δ𝑡 = 𝑡/𝑘. The bottom panel
shows our compilation of the controlled-exponential-swap gate into one- and two-qubit gates.
of parameters grows only polynomially in 𝑛. But this does not guarantee that we can efficiently
minimize the cost function, since the landscape is non-convex. In general, search complexity for
problems such as this remains an open problem. Hence, we cannot make a general statement about
(C2) for pure states.
2.12 Implementation of qPCA
In the main text we compared VQSD to the qPCA algorithm. Here we give further details on our
implementation of qPCA. Let us first give an overview of qPCA.
2.12.1 Overview of qPCA
The qPCA algorithm exploits two primitives: quantum phase estimation and density matrix expo-
nentiation. Combining these two primitives allows one to estimate the eigenvalues and prepare the
eigenvectors of a state 𝜌.
Density matrix exponentiation refers to generating the unitary 𝑉 (𝑡) = 𝑒 −𝑖𝜌𝑡 for a given state
50
𝜌 and arbitrary time 𝑡. For qPCA, one actually needs to apply the controlled-𝑉 (𝑡) gate (𝐶𝑉 (𝑡) ).
Namely, in qPCA, the 𝐶𝑉 (𝑡) gate must be applied for a set of times, {𝑡, 2𝑡, 22 𝑡, ..., 2𝑥 𝑡}, as part of
the phase-estimation algorithm. Here we define 𝑡max := 2𝑥 𝑡.
Ref. [16] noted that 𝑉 (𝑡) can be approximated with a sequence of 𝑘 exponential swap operations
between a target state 𝜎 and 𝑘 copies of 𝜌. That is, let 𝑆 𝐽𝐾 be the swap operator between systems 𝐽
and 𝐾, and let 𝜎 and 𝜌 ⊗𝑘 be states on systems 𝐴 and 𝐵 = 𝐵1 ...𝐵𝑟 , respectively. Then one performs
the transformation
𝜏𝐴𝐵 = 𝜎 ⊗ (𝜌 ⊗𝑘 ) → 𝜏𝐴𝐵′
= 𝑊 (𝜎 ⊗ (𝜌 ⊗𝑘 ))𝑊 † , (2.47)
where
𝑊 = 𝑈 𝐴𝐵 𝑘 · · · 𝑈 𝐴𝐵1 , and 𝑈𝐽𝐾 = 𝑒 −𝑖𝑆 𝐽 𝐾 Δ𝑡 . (2.48)
The resulting reduced state is
𝜏𝐴′ = Tr𝐵 (𝜏𝐴𝐵 ′
) ≈ 𝑉 (𝑡) 𝜌𝑉 (𝑡) † (2.49)
where 𝑡 = 𝑘Δ𝑡. Finally, by turning each 𝑈𝐽𝐾 in (2.48) into a controlled operation:
𝐶𝑈 𝐽 𝐾 = |0⟩⟨0| ⊗ 𝐼 + |1⟩⟨1| ⊗ 𝑒 −𝑖𝑆 𝐽 𝐾 Δ𝑡 , (2.50)
and hence making 𝑊 controlled, one can then construct an approximation of 𝐶𝑉 (𝑡) .
Í
If one chooses the input state for quantum phase estimation to be 𝜌 = 𝑧® 𝜆 𝑧® |𝑣 𝑧®⟩⟨𝑣 𝑧® | itself, then
the final state becomes
∑︁
𝜆 𝑧® |𝑣 𝑧®⟩⟨𝑣 𝑧® | ⊗ |𝜆ˆ 𝑧®⟩⟨𝜆ˆ 𝑧® | (2.51)
𝑧®
where 𝜆ˆ 𝑧® is a binary representation of an estimate of the corresponding eigenvalue 𝜆 𝑧®. One can
then sample from the state in (2.51) to characterize the eigenvalues and eigenvectors.
The approximation of 𝑉 (𝑡) in (2.49) can be done with accuracy 𝜖 provided that one uses
𝑂 (𝑡 2 𝜖 −1 ) copies of 𝜌. The time 𝑡max needed for quantum phase estimation to achieve accuracy 𝜖 is
𝑡max = 𝑂 (𝜖 −1 ). Hence, with qPCA, the eigenvalues and eigenvectors can be obtained with accuracy
𝜖 provided that one uses 𝑂 (𝜖 −3 ) copies of 𝜌.
51
Figure 2.14: The largest inferred eigenvalue for the one-qubit pure state 𝜌 = |+⟩⟨+| versus application
time of unitary 𝑒 −𝑖𝜌𝑡 , for our implementation of qPCA on Rigetti’s noisy and noiseless QVMs.
Curves are shown for 𝑘 = 1 and 𝑘 = 2, where 𝑘 indicates the number of controlled-exponential-
swap operators applied.
2.12.2 Our implementation of qPCA
Figure 2.13 shows our strategy for implementing qPCA on an arbitary one-qubit state 𝜌. The circuit
shown corresponds to the quantum phase estimation algorithm with one bit of precision (i.e., one
ancilla qubit). A Hadamard gate is applied to the ancilla qubit, which then acts as the control system
for the 𝐶𝑉 (𝑡) gate, and finally the ancilla is measured in the 𝑥-basis. The 𝐶𝑉 (𝑡) is approximated (as
discussed above) with 𝑘 applications of the controlled-exponential-swap gate.
To implement qPCA, the controlled-exponential-swap gate in (2.50) must be compiled into one-
and two-body gates. For this purpose, we used the machine-learning approach from Ref. [50] to
obtain a short-depth gate sequence for controlled-exponential-swap. The gate sequence we obtained
is shown in Fig. 2.13 and involves 7 CNOTs and 8 one-qubit gates. Most of the one-qubit gates
are 𝑧-rotations and hence are error-free (implemented via a clock change), including the following
52
gates:
𝑢 1 = 𝑢 5 = 𝑢 7 = 𝑅𝑧 (−(𝜋 + Δ𝑡)/2) (2.52)
𝑢 3 = 𝑅𝑧 ((𝜋 − Δ𝑡)/2) (2.53)
𝑢 4 = 𝑅𝑧 (Δ𝑡/2) (2.54)
𝑢 8 = 𝑅𝑧 (𝜋/2) . (2.55)
The one-qubit gates that are not 𝑧-rotations are:
1 © 1 1 ª
𝑢 2 = √ ® (2.56)
2 𝑒 −𝑖(𝜋−Δ𝑡)/2 𝑒𝑖(𝜋+Δ𝑡)/2
®
« ¬
−𝑖(𝜋+Δ𝑡)/2 ª
1 © 1 𝑒
𝑢6 = √ ®. (2.57)
−𝑖Δ𝑡/2
®
2 −𝑖 𝑒
« ¬
We implemented the circuit in Fig. 2.13 using both Rigetti’s noiseless simulator, known as the
Quantum Virtual Machine (QVM), as well as their noisy QVM that utilizes a noise model of their
8Q-Agave chip. Because the latter is meant to mimic the noise in the 8Q-Agave chip, our qPCA
results on the noisy QVM can be compared to our VQSD results on the 8Q-Agave chip in Fig. 2.3.
(We remark that lack of availability prevented us from implementing qPCA on the actual 8Q-Agave
chip.)
For our implementation, we chose the one-qubit plus state, 𝜌 = |+⟩⟨+|. Implementations were
carried out using both one and two controlled-exponential-swap gates, corresponding to 𝑘 = 1 and
𝑘 = 2. The time 𝑡 for which the unitary 𝑒 −𝑖𝜌𝑡 was applied was increased.
Figure 2.14 shows the raw data, i.e., the largest inferred eigenvalue versus 𝑡. In each case, small
values of 𝑡 gave more accurate eigenvalues. In the noiseless case, the eigenvalues of 𝜌 = |+⟩⟨+|
were correctly estimated to be ≈ {1, 0} already for 𝑘 = 1 and consequently also for 𝑘 = 2. In the
noisy case, the eigenvalues were estimated to be ≈ {0.8, 0.2} for 𝑘 = 1 and ≈ {0.7, 0.3} for 𝑘 = 2,
where we have taken the values for small 𝑡. Table 2.3 summarizes the different cases.
Already for the case of 𝑘 = 1, the required resources of qPCA (3 qubits + 7 CNOT gates) for
estimating the eigenvalue of an arbitary pure one-qubit state 𝜌 are higher than those of the DIP test
53
QVM 𝑘=1 𝑘=2
noiseless ≈ {1, 0} ≈ {1, 0}
noisy ≈ {0.8, 0.2} ≈ {0.7, 0.3}
Table 2.3: Estimated eigenvalues for the 𝜌 = |+⟩⟨+| state using qPCA on both the noiseless and the
noisy QVMs of Rigetti.
(2 qubits + 1 CNOT gate) for the same task. Consequently, the DIP test yields more accurate results
as can be observed by comparing Fig. 2.3 to Fig. 2.14. Increasing the number of copies to 𝑘 = 2
only decreases the accuracy of the estimation, since the 𝐶𝑉 (𝑡) gate is already well approximated
for short application times 𝑡 when 𝑘 = 1 in the noiseless case. Thus, increasing the number of
copies does not offer any improvement in the noiseless case, but instead leads to poorer estimation
performance in the noisy case. This can be seen for the 𝑘 = 2 case (see Fig. 2.14 and Table 2.3),
due to the doubled number of required CNOT gates relative to 𝑘 = 1.
2.13 Circuit derivation
2.13.1 DIP test
Here we prove that the circuit in Fig. 2.15(a) computes Tr(Z(𝜎)Z(𝜏)) for any two density matrices
𝜎 and 𝜏.
Let 𝜎 and 𝜏 be states on the 𝑛-qubit systems 𝐴 and 𝐵, respectively. Let 𝜔 𝐴𝐵 = 𝜎 ⊗ 𝜏 denote
the initial state. The action of the CNOTs in Fig. 2.15(a) gives the state
∑︁ ®′
𝜔′𝐴𝐵 = 𝑋 𝑧®𝜎𝑋 𝑧 ⊗ |®𝑧⟩⟨®𝑧 |𝜏| 𝑧®′⟩⟨𝑧®′ |, (2.58)
𝑧®, 𝑧®′
where the notation 𝑋 𝑧® means 𝑋 𝑧1 ⊗ 𝑋 𝑧2 ⊗ · · · ⊗ 𝑋 𝑧 𝑛 . Partially tracing over the 𝐵 system gives
∑︁
𝜔′𝐴 = 𝜏𝑧®,®𝑧 𝑋 𝑧®𝜎𝑋 𝑧® , (2.59)
𝑧®
where 𝜏𝑧®,®𝑧 = ⟨®𝑧 |𝜏|®𝑧 ⟩. The probability for the all-zeros outcome is then
∑︁ ∑︁
® ′ | 0⟩
⟨0|𝜔 ® = ® 𝑧®𝜎𝑋 𝑧® | 0⟩
𝜏𝑧®,®𝑧 ⟨0|𝑋 ® = 𝜏𝑧®,®𝑧 𝜎𝑧®,®𝑧 , (2.60)
𝐴
𝑧® 𝑧®
54
Figure 2.15: Test circuits used to compute the cost function in VQSD. (a) DIP test (b) PDIP test.
(These circuits appear in Fig. 2.5 and are also shown here for the reader’s convenience.)
which follows because 𝑋 𝑧® | 0⟩ ® = |®𝑧⟩. Hence the probability for the all-zeros outcome is precisely
the diagonalized inner product, Tr(Z(𝜎)Z(𝜏)). Note that in the special case where 𝜎 = 𝜏 = 𝜌, ˜
Í
we obtain the sum of the squares of the diagonal elements, 𝑧® 𝜌˜ 2𝑧®,®𝑧 = Tr(Z( 𝜌) ˜ 2 ).
2.13.2 PDIP test
We prove that the circuit in Fig. 2.15(b) computes Tr(Z 𝑗®(𝜎)Z 𝑗®(𝜏)) for a given set of qubits 𝑗®.
Let 𝑗®′ denote the complement of 𝑗®. Let 𝜎 and 𝜏, respectively, be states on the 𝑛-qubit systems
𝐴 = 𝐴 𝑗®𝑗®′ and 𝐵 = 𝐵 𝑗®𝑗®′ . The initial state 𝜔 𝐴𝐵 = 𝜎 ⊗ 𝜏 evolves, under the action of the CNOTs
associated with the DIP Test and then tracing over the control systems, to
∑︁
𝜔′𝐴𝐵 ®′ = (𝑋 𝑧® ⊗ 𝐼)𝜎(𝑋 𝑧® ⊗ 𝐼) ⊗ Tr𝐵 ®𝑗 ((|®𝑧⟩⟨®𝑧 | ⊗ 𝐼)𝜏), (2.61)
𝑗
𝑧®
where 𝑋 𝑧® and |®𝑧⟩⟨®𝑧 | act non-trivially only on the 𝑗® subsystems of 𝐴 and 𝐵, respectively. Measuring
system 𝐴 𝑗® and obtaining the all-zeros outcome would leave systems 𝐴 𝑗®′ 𝐵 𝑗®′ in the (unnormalized)
conditional state:
∑︁
® 0|
Tr 𝐴 ®𝑗 ((| 0⟩⟨ ® ⊗ 𝐼)𝜔′ ) = 𝜎 ®𝑧®′ ⊗ 𝜏 ®𝑧®′ , (2.62)
𝐴𝐵 ®′
𝑗 𝑗 𝑗
𝑧®
where 𝜎 ®𝑧®′ := Tr 𝐴 ®𝑗 ((|®𝑧 ⟩⟨®𝑧 | ⊗ 𝐼)𝜎) and 𝜏 ®𝑧®′ := Tr𝐵 ®𝑗 ((|®𝑧⟩⟨®𝑧 | ⊗ 𝐼)𝜏). Finally, computing the expectation
𝑗 𝑗
55
value for the swap operator (via the Destructive Swap Test) on the state in (2.62) gives
∑︁ ∑︁
Tr((𝜎 ®𝑧®′ ⊗ 𝜏 ®𝑧®′ )𝑆) = Tr(𝜎 ®𝑧®′ 𝜏 ®𝑧®′ ) = Tr(Z 𝑗®(𝜎)Z 𝑗®(𝜏)) . (2.63)
𝑗 𝑗 𝑗 𝑗
𝑧® 𝑧®
𝑧⟩⟨®𝑧 | ⊗ 𝜎 ®𝑧®′ 𝑧⟩⟨®𝑧 | ⊗ 𝜏 ®𝑧®′ .
Í Í
The last equality can be verified by noting that Z 𝑗®(𝜎) = 𝑧® |® 𝑗
and Z 𝑗®(𝜏) = 𝑧® |® 𝑗
Specializing (2.63) to 𝜎 = 𝜏 = 𝜌˜ gives the quantity Tr(Z 𝑗®( 𝜌) ˜ 2 ).
2.14 Proof of local dephasing channel bound
In this section we prove Eq. (2.40). Let 𝐻2 (𝜎) = − log2 [Tr(𝜎 2 )] be the Renyi entropy of order
two. Then, noting that 𝐻2 (𝜎) ≤ 𝐻 (𝜎), we have
Tr(Z 𝑗 ( 𝜌) ˜ 2 ) = 2−𝐻2 (Z 𝑗 ( 𝜌))
˜
≥ 2−𝐻 (Z 𝑗 ( 𝜌))˜
. (2.64)
Next, let 𝐴 denote qubit 𝑗, and let 𝐵 denote all the other qubits. This allows us to write 𝜌 = 𝜌 𝐴𝐵
and 𝜌˜ = 𝜌˜ 𝐴𝐵 . Let 𝐶 be a purifying system such that 𝜌 𝐴𝐵𝐶 and 𝜌˜ 𝐴𝐵𝐶 are both pure states. Then we
have
𝐻 (Z 𝑗 ( 𝜌))˜ = 𝐻 (Z 𝑗 ( 𝜌˜ 𝐴𝐵 )) (2.65)
= 𝐻 (Z 𝑗 ( 𝜌˜ 𝐴𝐶 )) (2.66)
≤ 𝐻 (Z 𝑗 ( 𝜌˜ 𝐴 )) + 𝐻 ( 𝜌˜𝐶 ) (2.67)
where the inequality in (2.67) used the subadditivity of von Neumann entropy. Finally, note that
𝐻 ( 𝜌˜𝐶 ) = 𝐻 ( 𝜌˜ 𝐴𝐵 ) = 𝐻 (𝜌 𝐴𝐵 ) = 𝐻 (𝜌) (2.68)
and 𝐻 (Z 𝑗 ( 𝜌˜ 𝐴 )) ≤ 1, which gives
𝐻 (Z 𝑗 ( 𝜌))
˜ ≤ 1 + 𝐻 (𝜌) . (2.69)
Substituting (2.69) into (2.64) gives
˜ 2 ) ≥ 2−1−𝐻 (𝜌) ,
Tr(Z 𝑗 ( 𝜌) (2.70)
and (2.40) follows from 𝐻 (𝜌) ≤ log2 𝑟.
56
CHAPTER 3
QUANTUM-ASSISTED QUANTUM COMPILING
3.1 Introduction
In classical computing, a compiler is a program that converts instructions into assembly language
so that they can be read and executed by a computer. Similarly, a quantum compiler would
take a high-level algorithm and convert it into a lower-level form that could be executed on a
NISQ device. Already, a large body of literature exists on classical approaches for quantum
compiling, e.g., using temporal planning [67, 68], machine learning [50], and other techniques
[69, 70, 71, 72, 73, 74, 75, 76].
A recent exciting idea is to use quantum computers themselves to train parametrized quantum
circuits, as proposed in Refs. [77, 78, 30, 79, 80, 81, 82, 83, 84]. The cost function to be minimized
essentially defines the application. For example, in the variational quantum eigensolver (VQE) [78]
and the quantum approximate optimization algorithm (QAOA) [77], the application is ground state
preparation, and hence the cost is the expectation value of the associated Hamiltonian. Another
example is training error-correcting codes [30], where the cost is the average code fidelity. In light
of these works, it is natural to ask: what is the relevant cost function for the application of quantum
compiling?
In this chapter, we introduce quantum-assisted quantum compiling (QAQC). The goal of QAQC
is to compile a (possibly unknown) target unitary to a trainable quantum gate sequence. A key
feature of QAQC is the fact that the cost is computed directly on the quantum computer. This leads
to an exponential speedup (in the number of qubits involved in the gate sequence) over classical
methods to compute the cost, since classical simulation of quantum dynamics is exponentially
slower than quantum simulation. Consequently, one should be able to optimally compile larger-
scale gate sequences using QAQC, whereas classical approaches to optimal quantum compiling
57
will be limited to smaller gate sequences.1
We carefully define a cost function for QAQC that satisfies the following criteria:
1. It is faithful (vanishing if and only if the compilation is exact);
2. It is efficient to compute on a quantum computer;
3. It has an operational meaning;
4. It scales well with the size of the problem.
A potential candidate for a cost function satisfying these criteria is the Hilbert-Schmidt inner
product between a target unitary 𝑈 and a trainable unitary 𝑉:
⟨𝑉, 𝑈⟩ = Tr(𝑉 †𝑈). (3.1)
It turns out, however, that this cost function does not satisfy the last criterion. We thus use
Eq. (3.1) only for small-scale problems. For general, large-scale problems, we define a cost
function satisfying all criteria. This cost involves a weighted average of the global overlap in (3.1)
with localized overlaps, which quantify the overlap between 𝑈 and 𝑉 with respect to individual
qubits.
We prove that computing our cost function is DQC1-hard, where DQC1 is the class of problems
that can be efficiently solved in the one-clean-qubit model of computation [7]. Since DQC1 is
classically hard to simulate [85], this implies that no classical algorithm can efficiently compute our
cost function. We remark that an alternative cost function might be a worst-case distance measure
(such as diamond distance), but such measures are known to be QIP-complete [86] and hence would
violate criterion 2 in our list above. In this sense, our cost function appears to be ideal.
Furthermore, we present novel short-depth quantum circuits for efficiently computing the terms
in our cost function. Our circuits achieve short depth by avoiding implementing controlled versions
1 We note that classical compilers may be applied to large-scale quantum algorithms, but they are limited to local
compiling. We thus emphasize the distinction between translating the algorithm to the native alphabet with simple,
local compiling and optimal compiling. Local compiling may reach partial optimization but in order to discover the
shortest circuit one may need to use a holistic approach, where the entire algorithm is considered, which requires a
quantum computer for compiling.
58
of 𝑈 and 𝑉, and by implementing 𝑈 and 𝑉 in parallel. We also present, in Section 3.15, circuits that
compute the gradient of our cost function. One such circuit is a generalization of the well-known
Power of One Qubit [7] that we call the Power of Two Qubits.
As a proof-of-principle, we implement QAQC on both IBM’s and Rigetti’s quantum computers,
and we compile various one-qubit gates to the native gate alphabets used by these hardwares. To
our knowledge, this is the first compilation of a target unitary with cost evaluation on actual NISQ
hardware. In addition, we successfully implement QAQC on both a noiseless and noisy simulator for
problems as large as 9-qubit unitaries. These larger scale implementations illustrate the scalability
of our cost function, and in the case of the noisy simulator, show a somewhat surprising resilience
to noise.
In what follows, we first discuss several applications of interest for QAQC. Section 3.3 provides
a general outline of the QAQC algorithm. Section 3.4 presents our short-depth circuits for cost
evaluation on a quantum computer. Section 3.5 states that our cost function is classically hard to
simulate. Sections 3.6 and 3.7, respectively, present small-scale and larger-scale implementations
of QAQC.
3.2 Applications of QAQC
Figure 3.1 illustrates four potential applications of QAQC. Suppose that there exists a quantum
algorithm to perform some task, but its associated gate sequence is longer than desired. As shown
in Fig. 3.1(a), it is possible to use QAQC to shorten the gate sequence by accounting for the
NISQ constraints of the specific computer. This depth compression goes beyond the capabilities of
classical compilers.
As a simple example, consider the quantum Fourier transform on 𝑛 qubits. Its textbook algorithm
is written in terms of Hadamard gates and controlled-rotation gates [11], which may need to be
compiled into the native gate alphabet. The number of gates in the textbook algorithm is 𝑂 (𝑛2 ), so
one could use a classical compiler to locally compile each gate. But this could lead to a sub-optimal
depth since the compilation starts from the textbook structure. In contrast, QAQC is unbiased with
59
Figure 3.1: Potential applications of QAQC. Here, denotes the 𝑧-rotation gate 𝑅𝑧 (𝜃), while
represents the 𝜋/2-pulse given by the 𝑥-rotation gate 𝑅𝑥 (𝜋/2). Both gates are natively
implemented on commercial hardware [2, 3]. (a) Compressing the depth of a given gate sequence
𝑈 to a shorter-depth gate sequence 𝑉 in terms of native hardware gates. (b) Uploading a black-box
unitary. The black box could be an analog unitary 𝑈 = 𝑒 −𝑖H 𝑡 , for an unknown Hamiltonian H ,
that one wishes to convert into a gate sequence to be run on a gate-based quantum computer. (c)
Training algorithms in the presence of noise to learn noise-resilient algorithms (e.g., via gates that
counteract the noise). Here, the unitary 𝑈 is performed on high-quality, pristine qubits and 𝑉 is
performed on noisy ones. (d) Benchmarking a quantum computer by compiling a unitary 𝑈 on
noisy qubits and learning the gate sequence 𝑉 on high-quality qubits.
respect to the structure of the gate sequence, taking a holistic approach to compiling as opposed to
a local one. Hence, in principle, it can learn the optimal gate sequence for given hardware. Note
that classical compilers cannot take this holistic approach for large 𝑛 due to the exponential scaling
of the matrix representations of the gates.
60
Alternatively, consider the problem of simulating the dynamics of a given quantum system with
an unknown Hamiltonian H (via 𝑒 −𝑖H 𝑡 ) on a quantum computer. We call this problem black-box
uploading because by simulating the black-box, i.e., the unitary 𝑒 −𝑖H 𝑡 , we are “uploading” the
unitary onto the quantum computer. This scenario is depicted in Fig. 3.1(b). QAQC could be used
to convert an analog black-box unitary into a gate sequence on a digital quantum computer.
Finally, we highlight two additional applications that are the opposites of each other. These
two applications can be exploited when the quantum computer has some pristine qubits (qubits
with low noise) and some noisy qubits. We emphasize that, in this context, “noisy qubits” refers
to coherent noise such as systematic gate biases, where the gate rotation angles are biased in a
particular direction. In contrast, we consider incoherent noise (e.g., 𝑇1 and 𝑇2 noise) later in this
article, see Section 3.7.2.
Consider Fig. 3.1(c). Here, the goal is to implement a CNOT gate on two noisy qubits. Due to
the noise, to actually implement a true CNOT, one has to physically implement a dressed CNOT,
i.e., a CNOT surrounded by one-qubit unitaries. QAQC can be used to learn the parameters in these
one-qubit unitaries. By choosing the target unitary 𝑈 to be a CNOT on a pristine (i.e., noiseless)
pair of qubits, it is possible to learn the unitary 𝑉 that needs to be applied to the noisy qubits in
order to effectively implement a CNOT. We call this application noise-tailored algorithms, since
the learned algorithms are robust to the noise process on the noisy qubits.
Figure 3.1(d) depicts the opposite process, which is benchmarking. Here, the unitary 𝑈 acts on
a noisy set of qubits, and the goal is to determine what the equivalent unitary 𝑉 would be if it were
implemented on a pristine set of qubits. This essentially corresponds to learning the noise model,
i.e., benchmarking the noisy qubits.
3.3 The QAQC algorithm
3.3.1 Approximate compiling
The goal of QAQC is to take a (possibly unknown) unitary 𝑈 and return a gate sequence 𝑉, executable
on a quantum computer, that has approximately the same action as 𝑈 on any given input state (up
61
to possibly a global phase factor). The notion of approximate compiling [87, 88, 89, 90, 91, 92]
requires an operational figure-of-merit that quantifies how close the compilation is to exact. A
natural candidate is the probability for the evolution under 𝑉 to mimic the evolution under 𝑈.
Hence, consider the overlap between |𝜓(𝑈)⟩ B 𝑈|𝜓⟩ and |𝜓(𝑉)⟩ B 𝑉 |𝜓⟩, averaged over all input
states |𝜓⟩. This is the fidelity averaged over the Haar distribution,
∫
𝐹 (𝑈, 𝑉) B |⟨𝜓(𝑉)|𝜓(𝑈)⟩| 2 d𝜓 . (3.2)
𝜓
We call 𝑉 an exact compilation of 𝑈 if 𝐹 (𝑈, 𝑉) = 1. If 𝐹 (𝑈, 𝑉) ≥ 1 − 𝜀, where 𝜀 ∈ [0, 1], then we
call 𝑉 an 𝜀-approximate compilation of 𝑈, or simply an approximate compilation of 𝑈.
As we will see, the quantity 𝐹 (𝑈, 𝑉) has a connection to our cost function, defined below,
and hence our cost function has operational relevance to approximate compiling. Minimizing
our cost function is related to maximizing 𝐹 (𝑈, 𝑉), and thus is related to compiling to a better
approximation.
QAQC achieves approximate compiling by training a gate sequence 𝑉 of a fixed length 𝐿, which
may even be shorter than the length required to exactly compile 𝑈. As one increases 𝐿, one can
further minimize our cost function. The length 𝐿 can therefore be regarded as a parameter that can
be tuned to obtain arbitrarily good approximate compilations of 𝑈.
3.3.2 Discrete and continuous parameters
The gate sequence 𝑉 should be expressed in terms of the native gates of the quantum computer
being used. Consider an alphabet A = {𝐺 𝑘 (𝛼)} 𝑘 of gates 𝐺 𝑘 (𝛼) that are native to the quantum
computer of interest. Here, 𝛼 ∈ R is a continuous parameter, and 𝑘 is a discrete parameter that
identifies the type of gate and which qubits it acts on. For a given quantum computer, the problem
of compiling 𝑈 to a gate sequence of length 𝐿 is to determine
® opt , 𝑘®opt ) B arg min 𝐶 (𝑈, 𝑉𝑘® ( 𝛼
(𝛼 ® )), (3.3)
(𝛼 ®
® , 𝑘)
where
𝑉𝑘® ( 𝛼
® ) = 𝐺 𝑘 𝐿 (𝛼 𝐿 )𝐺 𝑘 𝐿−1 (𝛼 𝐿−1 ) · · · 𝐺 𝑘 1 (𝛼1 ) (3.4)
62
Input: U
(a) Variable structure approach
Optimal L = 5 Quantum Computer
⊕
compilation
⊕ |0i H • • H
|0i H • • H
|0i H • • H
U |0i H •
.. U
.
..
.
|0i H • • H
|0i H •
⊕ ⊕
|0i
|0i
OR |0i
|0i
.. V∗
⊕ ⊕ . ..
. V∗
Variable
|0i |0i
Fixed structure HST
structure LHST
with L = 4
OR Gate Cost
sequence
Vk (α) C(U, Vk (α))
(b) Fixed structure approach
Ansatz with
each layer
Classical Computer
parameterized by
If α not optimal
two-qubit gates
Structure Continuous
parameter parameter
optimizer over k If α optimal optimizer over α
Two
layers
Output: Vkopt (αopt)
Figure 3.2: Outline of our variational hybrid quantum-classical algorithm, in which we optimize
over gate structures and continuous gate parameters in order to perform QAQC for a given input
unitary 𝑈. We take two approaches towards structure optimization: (a) For small problem sizes,
we allow the gate structure to vary for a given gate sequence length 𝐿, which in general leads to
an approximate compilation of 𝑈. To obtain a better approximate compilation, the best structure
obtained can be concatenated with a new sequence of a possibly different length, whose structure
can vary. For each iteration of the continuous parameter optimization, we calculate the cost using
the Hilbert-Schmidt Test (HST); see Sec. 3.4.1. (b) For large problem sizes, we fix the gate
structure using an ansatz consisting of layers of two-qubit gates. By increasing the number of
layers, we can obtain better approximate compilations of 𝑈. For each iteration of the continuous
parameter optimization, we calculate the cost using the Local Hilbert-Schmidt Test (LHST); see
Sec. 3.4.2.
63
is the trainable unitary. Here, 𝑉𝑘® ( 𝛼 ® ) is a function of the sequence 𝑘® = (𝑘 1 , . . . , 𝑘 𝐿 ) of parameters
describing which gates from the native gate set are used and of the continuous parameters 𝛼 ® =
(𝛼1 , . . . , 𝛼 𝐿 ) associated with each gate. The function 𝐶 (𝑈, 𝑉𝑘® ( 𝛼 ® )) is the cost, which quantifies how
close the trained unitary is to the target unitary. We define the cost below to have the properties:
0 ≤ 𝐶 (𝑈, 𝑉) ≤ 1 for all unitaries 𝑈 and 𝑉, and 𝐶 (𝑈, 𝑉) = 0 if and only if 𝑈 = 𝑉 (possibly up to a
global phase factor).
The optimization in (3.3) contains two parts: discrete optimization over the finite set of gate
structures parameterized by 𝑘, ® and continuous optimization over the parameters 𝛼 ® characterizing
the gates within the structure. Our quantum-classical hybrid strategy to perform the optimization
in (3.3) is illustrated in Fig. 3.2. In the next subsection, we present a general, ansatz-free approach
to optimizing our cost function, which may be useful for systems with a small number of qubits. In
the subsection following that, we present an ansatz-based approach that would allow the extension
to larger system sizes. In each case, we perform the continuous parameter optimization using
gradient-free methods as described in Section 3.14. We also discuss a method for gradient-based
continuous parameter optimization in Section 3.15.
3.3.3 Small problem sizes
Suppose 𝑈 and 𝑉 act on a 𝑑-dimensional space of 𝑛 qubits, so that 𝑑 = 2𝑛 . To perform the
continuous parameter optimization in (3.3), we define the cost function
1
𝐶HST (𝑈, 𝑉) B 1 − 2
|⟨𝑉, 𝑈⟩| 2
𝑑 (3.5)
1
= 1 − 2 |Tr(𝑉 †𝑈)| 2 ,
𝑑
where HST stands for “Hilbert-Schmidt Test” and refers to the circuit used to evaluate the cost, which
we introduce in Sec. 3.4.1. Note that the quantity 1
𝑑2
|⟨𝑉, 𝑈⟩| 2 is simply the fidelity between the
pure states obtained by applying 𝑈 and 𝑉 to one half of a maximally entangled state. Consequently,
it has an operational meaning in terms of 𝐹 (𝑈, 𝑉). Indeed, it can be shown [93, 94] that
𝑑+1
𝐶HST (𝑈, 𝑉) = 1 − 𝐹 (𝑈, 𝑉) . (3.6)
𝑑
64
Also note that for any two unitaries 𝑈 and 𝑉, 𝐶HST (𝑈, 𝑉) = 0 if and only if 𝑈 and 𝑉 differ by
a global phase factor, i.e., 𝑉 = 𝑒𝑖𝜑𝑈 for some 𝜑 ∈ R. By minimizing 𝐶HST , we thus learn an
equivalent unitary 𝑉 up to a global phase.
Now, to perform the optimization over gate structures in (3.3), one strategy is to search over
all possible gate structures for a gate sequence length 𝐿, which can be allowed to vary during
the optimization. As the set of gate structures grows exponentially with the number of gates 𝐿,
such a brute force search over all gate structures in order to obtain the best one is intractable
in general. To efficiently search through this exponentially large space, we adopt an approach
based on simulated annealing. (An alternative approach is genetic optimization, which has been
implemented previously to classically optimize quantum gate sequences [95].)
Our simulated annealing approach starts with a random gate structure, then performs continuous
optimization over the parameters 𝛼® that characterize the gates in order to minimize the cost function.
We then perform a structure update that involves randomly replacing a subset of gates in the sequence
with new gates (which can be done in a way such that the sequence length can increase or decrease)
and re-optimizing the cost function over the continuous parameters 𝛼 ® . If this structure change
produces a lower cost, then we accept the change. If the cost increases, then we accept the change
with probability decreasing exponentially in the magnitude of the cost difference. We iterate this
procedure until the cost converges or until a maximum number of iterations is reached.
With a fixed gate sequence length 𝐿, the approach outlined above will in general lead to an
approximate compilation of 𝑈, which in many cases is sufficient. One strategy for obtaining
better and better approximate compilations of 𝑈 is a layered approach illustrated in Fig. 3.2(a).
In this approach, we consider a particular gate sequence length 𝐿 and perform the full structure
optimization, as outlined above, to obtain an (approximate) length-𝐿 compilation of 𝑈. The optimal
gate sequence structure thus obtained can then be concatenated with a new sequence of a possibly
different (but fixed) length, whose structure can vary. By performing the continuous parameter
optimization over the entire longer gate sequence, and performing the structure optimization over
the new additional segment of the gate sequence, we can obtain a better approximate compilation
65
of 𝑈. Iterating this procedure can then lead to increasingly better approximate compilations of 𝑈.
3.3.4 Large problem sizes
We emphasize two potential issues with scaling the above approach to large problem sizes.
First, one may want a guarantee that there exists an exact compilation of 𝑈 within a polynomial
size search space for 𝑉. When performing full structure optimization, as above, the search space
size grows exponentially in the length 𝐿 of the gate sequence. This implies that the search space
size grows exponentially in 𝑛, if one chooses 𝐿 to grow polynomially in 𝑛. Indeed, one would
typically require 𝐿 to grow polynomially in 𝑛 if one is interested in exact compilation, since the
number of gates in 𝑈 itself grows polynomially in 𝑛 for many applications. (Note that this issue
arises if one insists on exact, instead of approximate, compiling.)
Second, and arguably more importantly, the cost 𝐶HST (𝑈, 𝑉) is exponentially fragile. The inner
product between 𝑈 and 𝑉 will be exponentially suppressed for random choices of 𝑉, which means
that 𝐶HST (𝑈, 𝑉) will be very close to one for most unitaries 𝑉. Hence, for random unitaries 𝑉, the
number of calls to the quantum computer needed to resolve differences in the cost 𝐶HST (𝑈, 𝑉) to a
given precision will grow exponentially.
The first issue can be addressed with an efficiently parameterized ansatz for 𝑉. With an
ansatz, only the continuous parameters 𝛼 ® need to be optimized in 𝑉. The 𝑘® parameters are fixed,
which means that structure updates are not required. This fixed structure approach is depicted in
Fig. 3.2(b). One can choose an ansatz such that the number of parameters needed to represent the
target unitary 𝑈 is only a polynomial function of 𝑛. Hence, one should allow the ansatz 𝐴(𝑈) to be
application specific, i.e., to be a function of 𝑈. As an example, if 𝑈 = 𝑒 −𝑖H 𝑡 for a local Hamiltonian
H , one could choose the ansatz to involve a polynomial number of local interactions. Due to the
application-specific nature of the ansatz, the problem is a complex one, hence we leave the issue of
finding efficient ansatzes for future work.
Nevertheless, we show a concrete example of a potential ansatz for 𝑉 in Fig. 3.3. The ansatz
is defined by a number ℓ of layers, with each layer being a gate sequence of depth two consisting
66
(a) (b)
Figure 3.3: (a) One layer of the ansatz for the trainable unitary 𝑉 in the case of four qubits. The
gate sequence in the layer consists of a two-qubit gate acting on the first and second qubits, the third
and fourth qubits, the second and third qubits, and the first and fourth qubits. (b) The full ansatz
defining the trainable unitary 𝑉 consists of a particular number ℓ of the layer in (a). Shown is two
layers in the case of four qubits.
of two-qubit gates acting on neighboring qubits. Consider the following argument. In QAQC,
the unitary 𝑈 to be compiled is executed on the quantum computer, so it must be efficiently
implementable, i.e., the gate count is polynomial in 𝑛. Next, note that the gate sequence used to
implement 𝑈 can be compiled into in the ansatz in Fig. 3.3 with only polynomial overhead. This
implies that the ansatz in Fig. 3.3 could exactly describe 𝑈 in only a polynomial number of layers
and would hence eliminate the need to search through an exponentially large space. We remark
that the ansatz in Fig. 3.3 may be particularly useful for applications involving compiling quantum
simulations of physically relevant systems, as the structure resembles that of the Suzuki-Trotter
decomposition [96] for nearest-neighbor Hamiltonians.
Let us now consider the second issue mentioned above: the exponentially suppressed inner
product between 𝑈 and 𝑉 for large 𝑛. To address this, we propose an alternative cost function
involving a weighted average between the function in (3.5) and a “local” cost function:
𝐶𝑞 (𝑈, 𝑉) B 𝑞𝐶HST (𝑈, 𝑉) + (1 − 𝑞)𝐶LHST (𝑈, 𝑉), (3.7)
where 0 ≤ 𝑞 ≤ 1 and
𝑛
1 ∑︁ ( 𝑗)
𝐶LHST (𝑈, 𝑉) B 𝐶 (𝑈, 𝑉) = 1 − 𝐹𝑒 . (3.8)
𝑛 𝑗=1 LHST
Here, LHST stands for “Local Hilbert-Schmidt Test”, referring to the circuit discussed in Sec. 3.4.2
Í ( 𝑗) ( 𝑗)
that is used to compute this function. Also, 𝐹𝑒 B 𝑛1 𝑛𝑗=1 𝐹𝑒 , where the quantities 𝐹𝑒 are
entanglement fidelities (hence the notation 𝐹𝑒 ) of local quantum channels E 𝑗 defined in Sec. 3.4.2.
Hence, 𝐶LHST (𝑈, 𝑉) is a sum of local costs, where each local cost is written as a local entanglement
( 𝑗) ( 𝑗)
fidelity: 𝐶LHST (𝑈, 𝑉) = 1 − 𝐹𝑒 . Expressing the overall cost as sum of local costs is analogous
67
to what is done in the variational quantum eigensolver [78], where the overall energy is expressed
( 𝑗)
as a sum of local energies. The functions 𝐶LHST are local in the sense that only two qubits need
to be measured in order to calculate each one of them. This is unlike the function 𝐶HST , whose
calculation requires the simultaneous measurement of 2𝑛 qubits.
The cost function 𝐶𝑞 in (3.7) is a weighted average between the “global” cost function 𝐶HST
and the local cost function 𝐶LHST , with 𝑞 representing the weight given to the global cost function.
The weight 𝑞 can be chosen according to the size of the problem: for a relatively small number
of qubits, we would let 𝑞 = 1. As the number of qubits increases, we would slowly decrease 𝑞 to
mitigate the suppression of the inner product between 𝑈 and 𝑉.
To see why 𝐶LHST can be expected to deal with the issue of an exponentially suppressed inner
product for large 𝑛, consider the following example. Suppose the unitary 𝑈 to be compiled is the
tensor product 𝑈 = 𝑈1 ⊗ 𝑈2 ⊗ · · · ⊗ 𝑈𝑛 of unitaries 𝑈 𝑗 acting on qubit 𝑗, and suppose we take the
Î
tensor product 𝑉 = 𝑉1 ⊗𝑉2 ⊗ · · · ⊗𝑉𝑛 as the trainable unitary. We get that 𝐶HST (𝑈, 𝑉) = 1− 𝑛𝑗=1 𝑟 𝑗 ,
where 𝑟 𝑗 = (1/4)|Tr(𝑉 𝑗†𝑈 𝑗 )| 2 . Since each 𝑟 𝑗 will likely be less than one for a random choice of 𝑉 𝑗 ,
then their product will be small for large 𝑛. Consequently a very large portion of the cost landscape
will have 𝐶HST (𝑈, 𝑉) ≈ 1 and hence will have a vanishing gradient. However, the cost function
Í
𝐶LHST is defined such that 𝐶LHST (𝑈, 𝑉) = 1 − 𝑛1 𝑛𝑗=1 𝑟 𝑗 , so that we obtain an average of the 𝑟 𝑗
quantities rather than a product. Taking the average instead of the product leads to a gradient that
is not suppressed for large 𝑛.
More generally, for any 𝑈 and 𝑉, the quantity 𝐹𝑒 , which is responsible for the variability in
𝐶LHST , can be made non-vanishing by adding local unitaries to 𝑉. In particular, for a given 𝑈 and
𝑉, it is straightforward to show that for all 𝑗 ∈ {1, 2, . . . , 𝑛} there exists a unitary 𝑉 𝑗 acting on qubit
( 𝑗)
𝑗 such that 𝐹𝑒 ≥ 1
4 for the gate sequence given by 𝑉 ′ = 𝑉 𝑗 𝑉. In other words, there exists a local
( 𝑗)
unitary 𝑉 𝑗 that can be added to the trainable gate sequence 𝑉 such that 𝐶LHST (𝑈, 𝑉 𝑗 𝑉) ≤ 43 . This
implies that, with the appropriate local unitary applied to each qubit at the end of the trainable gate
sequence, the local cost function 𝐶LHST can always be decreased to no greater than 34 . Note that
local unitaries cannot be used in this way to decrease the global cost function 𝐶HST , i.e., to make
68
the second term in (3.5) non-vanishing.
Finally, one can show (See Section 3.12) that 𝐶LHST ≥ (1/𝑛)𝐶HST . Combining this with
Eq. (3.6) gives
1 − 𝑞 + 𝑛𝑞 𝑑+1
𝐶𝑞 (𝑈, 𝑉) ≥ (1 − 𝐹 (𝑈, 𝑉)) , (3.9)
𝑛 𝑑
which implies that
𝑛 𝑑
𝐹 (𝑈, 𝑉) ≥ 1 − 𝐶𝑞 (𝑈, 𝑉). (3.10)
1 − 𝑞 + 𝑛𝑞 𝑑+1
Hence, the cost function 𝐶𝑞 retains an operational meaning for the task of approximate compiling,
since it provides a bound on the average fidelity between 𝑈 and 𝑉.
3.3.5 Special case of a fixed input state
An important special case of quantum compiling is when the target unitary 𝑈 happens to appear at
the beginning of one’s quantum algorithm, and hence the state that one inputs to 𝑈 is fixed. For
many quantum computers, this input state is |𝜓0 ⟩ = |0⟩ ⊗𝑛 . We emphasize that many use cases of
QAQC do not fall under this special case, since one is often interested in compiling unitaries that do
not appear at the beginning of one’s algorithm. For example, one may be interested in the optimal
compiliation of a controlled-unitary, but such a unitary would never appear at the beginning of an
algorithm since its action would be trivial. Nevertheless we highlight this special case because
QAQC can potentially be simplified in this case. In addition, this special case was very recently
explored in Ref. [52] after the completion of our article.
In this special scenario, a natural cost function would be
𝐶fixed input = 1 − |⟨𝜓0 |𝑈𝑉 † |𝜓0 ⟩| 2 . (3.11)
This could be evaluated on a quantum computer in two possible ways. One way is to apply 𝑈 and
then 𝑉 † to the |𝜓0 ⟩ state and then measure the probability to be in the |𝜓0 ⟩ state. Another way is to
apply 𝑈 to one copy of |𝜓0 ⟩ and 𝑉 to another copy of |𝜓0 ⟩, and then measure the overlap [50, 55]
between these two states.
69
However, this cost function would not scale well for the same reason discussed above that our
𝐶HST cost does not scale well, i.e., its gradient can vanish exponentially. Again, one can fix this
issue with a local cost function. Assuming |𝜓0 ⟩ = |0⟩ ⊗𝑛 , this local cost can take the form:
𝑛
local 1 ∑︁ ( 𝑗)
𝐶fixed input = 1 − 𝑝 , (3.12)
𝑛 𝑗=1 0
where
( 𝑗)
𝑝 0 = Tr[(|0⟩⟨0| 𝑗 ⊗ 𝐼)𝑉 †𝑈|𝜓0 ⟩⟨𝜓0 |𝑈 †𝑉] (3.13)
is the probability to obtain the zero measurement outcome on qubit 𝑗 for the state 𝑉 †𝑈|𝜓0 ⟩.
We remark that the two cost functions in (3.11) and (3.12) can each be evaluated with quantum
circuits on only 𝑛 qubits. This is in contrast to 𝐶HST and 𝐶LHST , whose evaluation involves quantum
circuits with 2𝑛 qubits (see the next section for the circuits). This reduction in resource requirements
is the main reason why we highlight this special case.
3.4 Cost evaluation circuits
In this section, we present short-depth circuits for evaluating the functions in (3.5) and (3.8) and
hence for evaluating the overall cost in (3.7). We note that these circuits are also interesting outside
of the scope of QAQC, and they likely have applications in other areas.
In addition, in Section 3.15, we present circuits for computing the gradient of the cost function,
including a generalization of the Power-of-one-qubit circuit [7] that computes both the real and
imaginary parts of ⟨𝑈, 𝑉⟩.
3.4.1 Hilbert-Schmidt Test
Consider the circuit in Fig. 3.4(a). Below we show that this circuit computes |Tr(𝑉 †𝑈)| 2 , where 𝑈
and 𝑉 are 𝑛-qubit unitaries. The circuit involves 2𝑛 qubits, where we call the first (second) 𝑛-qubit
system 𝐴 (𝐵).
70
The first step in the circuit is to create a maximally entangled state between 𝐴 and 𝐵, namely,
the state
1 ∑︁ ®
|Φ+ ⟩ 𝐴𝐵 = √ | 𝑗 ⟩ 𝐴 ⊗ | 𝑗®⟩𝐵 , (3.14)
𝑑 𝑗®
where 𝑗® = ( 𝑗1 , 𝑗2 , ..., 𝑗 𝑛 ) is a vector index in which each component 𝑗 𝑘 is chosen from {0, 1}. The
first two gates in Fig. 3.4(a)—the Hadamard gates and the CNOT gates (which are performed in
parallel when acting on distinct qubits)—create the |Φ+ ⟩ state.
The second step is to act with 𝑈 on system 𝐴 and with 𝑉 ∗ on system 𝐵. (𝑉 ∗ is the complex
conjugate of 𝑉, where the complex conjugate is taken in the standard basis.) Note that these two
gates are performed in parallel. This gives the state
1 ∑︁ ®
(𝑈 ⊗ 𝑉 ∗ )|Φ+ ⟩ 𝐴𝐵 = √ 𝑈| 𝑗 ⟩ 𝐴 ⊗ 𝑉 ∗ | 𝑗®⟩𝐵 . (3.15)
𝑑 𝑗®
We emphasize that the unitary 𝑉 ∗ is implemented on the quantum computer, not 𝑉 itself. (See
Section 3.10 for elaboration on this point.)
The third and final step is to measure in the Bell basis. This corresponds to undoing the unitaries
(the CNOTs and Hadamards) used to prepare |Φ+ ⟩ and then measuring in the standard basis. At
the end, we are only interested in estimating a single probability: the probability for the Bell-basis
measurement to give the |Φ+ ⟩ outcome, which corresponds to the all-zeros outcome in the standard
basis. The amplitude associated with this probability is
⟨Φ+ |𝑈 ⊗ 𝑉 ∗ | Φ+ ⟩ = ⟨Φ+ |𝑈𝑉 † ⊗ 𝐼 | Φ+ ⟩ (3.16)
1
= Tr(𝑉 †𝑈) . (3.17)
𝑑
To obtain the first equality we used the ricochet property:
𝐼 ⊗ 𝑋 |Φ+ ⟩ = 𝑋 𝑇 ⊗ 𝐼 |Φ+ ⟩, (3.18)
which holds for any operator 𝑋 acting on a 𝑑-dimensional space. The probability of the |Φ+ ⟩
outcome is then the absolute square of the amplitude, i.e., (1/𝑑 2 )|Tr(𝑉 †𝑈)| 2 . Hence, this probability
71
gives us the absolute value of the Hilbert-Schmidt inner product between 𝑈 and 𝑉. We therefore
call the circuit in Fig. 3.4(a) the Hilbert-Schmidt Test (HST).
Consider the depth of this circuit. Let 𝐷 (𝐺) denote the depth of a gate sequence 𝐺 for a
fully-connected quantum computer whose native gate alphabet includes the CNOT gate and the set
of all one-qubit gates. Then, for the HST, we have
𝐷 (HST) = 4 + max{𝐷 (𝑈), 𝐷 (𝑉 ∗ )} . (3.19)
The first term of 4 is associated with the Hadamards and CNOTs in Fig. 3.4(a), and this term is
negligible when the depth of 𝑈 or 𝑉 ∗ is large. The second term results from the fact that 𝑈 and 𝑉 ∗
are performed in parallel. Hence, whichever unitary, 𝑈 or 𝑉 ∗ , has the larger depth will determine
the overall depth of the HST.
3.4.2 Local Hilbert-Schmidt Test
Let us now consider a slightly modified form of the HST, shown in Fig. 3.4(b). We call this the
Local Hilbert-Schmidt Test (LHST) because, unlike the HST in Fig. 3.4(a), only two of the total
number 2𝑛 of qubits are measured: one qubit from system 𝐴, say 𝐴 𝑗 , and the corresponding qubit
𝐵 𝑗 from system 𝐵, where 𝑗 ∈ {1, 2, . . . , 𝑛}.
The state of systems 𝐴 and 𝐵 before the measurements is given by Eq. (3.15). Using the
ricochet property in (3.18) as before, we obtain
(𝑈 ⊗ 𝑉 ∗ )|Φ+ ⟩ 𝐴𝐵 = (𝑈𝑉 † ⊗ 𝐼)|Φ+ ⟩ 𝐴𝐵 (3.20)
= (𝑊 ⊗ 𝐼)|Φ+ ⟩ 𝐴𝐵 , (3.21)
where 𝑊 := 𝑈𝑉 † . Let 𝐴¯ 𝑗 denote all systems 𝐴 𝑘 except for 𝐴 𝑗 , and let 𝐵¯ 𝑗 denote all systems 𝐵 𝑘
except for 𝐵 𝑗 . Taking the partial trace over 𝐴¯ 𝑗 and 𝐵¯ 𝑗 on the state in (3.21) gives us the following
state on the qubits 𝐴 𝑗 and 𝐵 𝑗 that are being measured:
72
(a)
|0iA1 H • • H
|0iA2 H • • H
U
..
.
|0iAn H • • H
|0iB1
|0iB2
.. V∗
.
|0iBn
(b)
|0iA1 H • • H
|0iA2 H •
U
..
.
|0iAn H •
|0iB1
|0iB2
.. V∗
.
|0iBn
Figure 3.4: (a) The Hilbert-Schmidt Test. For this circuit, the probability to obtain the measurement
outcome in which all 2𝑛 qubits are in the |0⟩ state is equal to (1/𝑑 2 )|Tr(𝑉 †𝑈)| 2 . Hence, this circuit
computes the magnitude of the Hilbert-Schmidt inner product, |⟨𝑉, 𝑈⟩|, between 𝑈 and 𝑉. (b) The
Local Hilbert-Schmidt Test, which is the same as the Hilbert-Schmidt Test except that only two of
the 2𝑛 qubits are measured at the end. Shown is the measurement of the qubits 𝐴1 and 𝐵1 , and the
probability that both qubits are in the state |0⟩ is given by (3.25) with 𝑗 = 1.
Tr 𝐴¯ 𝑗 𝐵¯ 𝑗 ((𝑊 𝐴 ⊗ 𝐼 𝐵 )|Φ+ ⟩⟨Φ+ | 𝐴𝐵 (𝑊 𝐴† ⊗ 𝐼 𝐵 ))
𝐼 𝐴¯ 𝑗
+ +
= Tr 𝐴¯ 𝑗 (𝑊 𝐴 ⊗ 𝐼 𝐵 𝑗 ) |Φ ⟩⟨Φ | 𝐴 𝑗 𝐵 𝑗 ⊗ 𝑛−1
2
×(𝑊 𝐴† ⊗ 𝐼 𝐵 𝑗 ) (3.22)
= (E 𝑗 ⊗ I𝐵 𝑗 )(|Φ+ ⟩⟨Φ+ | 𝐴 𝑗 𝐵 𝑗 ) . (3.23)
In (3.22), |Φ+ ⟩ 𝐴 𝑗 𝐵 𝑗 is a 2-qubit maximally entangled state of the form in (3.14). In (3.23), we have
73
defined the channel E 𝑗 by
𝐼 𝐴¯ 𝑗 †
E 𝑗 (𝜌 𝐴 𝑗 ) B Tr 𝐴¯ 𝑗 𝑊 𝐴 𝜌 𝐴 𝑗 ⊗ 𝑛−1 𝑊 𝐴 . (3.24)
2
The probability of obtaining the (0, 0) outcome in the measurement of 𝐴 𝑗 and 𝐵 𝑗 is the overlap of
the state in (3.23) with the |Φ+ ⟩ 𝐴 𝑗 𝐵 𝑗 state, given by
( 𝑗)
𝐹𝑒 := Tr |Φ+ ⟩⟨Φ+ | 𝐴 𝑗 𝐵 𝑗 (E 𝑗 ⊗ I𝐵 𝑗 )(|Φ+ ⟩⟨Φ+ | 𝐴 𝑗 𝐵 𝑗 ) . (3.25)
Note that this is the entanglement fidelity of the channel E 𝑗 . We use these entanglement fidelities
(for each 𝑗) to define the local cost function 𝐶LHST (𝑈, 𝑉) as
𝑛
1 ∑︁ ( 𝑗)
𝐶LHST (𝑈, 𝑉) = 𝐶 (𝑈, 𝑉), (3.26)
𝑛 𝑗=1 LHST
where
( 𝑗) ( 𝑗)
𝐶LHST (𝑈, 𝑉) B 1 − 𝐹𝑒 . (3.27)
( 𝑗)
Note that for all 𝑗 ∈ {1, 2, . . . , 𝑛}, the maximum value of 𝐹𝑒 is one, which occurs when E 𝑗 is the
identity channel. This means that the minimum value of 𝐶LHST (𝑈, 𝑉) is zero. In Section 3.11, we
show that 𝐶LHST is indeed a faithful cost function:
Proposition 1: For all unitaries 𝑈 and 𝑉, it holds that 𝐶LHST (𝑈, 𝑉) = 0 if and only if 𝑈 = 𝑉 (up to
a global phase).
The cost function 𝐶LHST is simply the average of the probabilities that the two qubits 𝐴 𝑗 𝐵 𝑗 are
not in the |00⟩ state, while the cost function 𝐶HST is the probability that all qubits are not in the
|0⟩ ⊗2𝑛 state. Since the probability of an intersection of events is never greater than the average of
the probabilities of the individual events, we find that
𝐶LHST (𝑈, 𝑉) ≤ 𝐶HST (𝑈, 𝑉) (3.28)
for all unitaries 𝑈 and 𝑉. Furthermore, we can also formulate a bound in the reverse direction
𝑛𝐶LHST (𝑈, 𝑉) ≥ 𝐶HST (𝑈, 𝑉). (3.29)
In Section 3.12, we offer a proof for the above bounds.
74
Proposition 2: Let 𝑈 and 𝑉 be 2𝑛 × 2𝑛 unitaries. Then,
𝐶LHST (𝑈, 𝑉) ≤ 𝐶HST (𝑈, 𝑉) ≤ 𝑛𝐶LHST (𝑈, 𝑉) .
The depth of the circuit in Fig. 3.4(b) used to compute the cost function 𝐶LHST is the same as
the depth of the circuit in Fig. 3.4(a) used to compute 𝐶HST , namely,
𝐷 (LHST) = 4 + max{𝐷 (𝑈), 𝐷 (𝑉 ∗ )}. (3.30)
3.5 Computational complexity of cost evaluation
In this section, we state impossibility results for the efficient classical evaluation of both of our costs,
𝐶HST and 𝐶LHST . To show this, we analyze our circuits in the framework of deterministic quantum
computation with one clean qubit (DQC1) [7]. We then make use of known hardness results for
the class DQC1, and establish that the efficient classical approximation of our cost functions is
impossible under reasonable complexity assumptions.
3.5.1 One-clean-qubit model of computation
The complexity class DQC1 consists of all problems that can be efficiently solved with bounded
error in the one-clean-qubit model of computation. Inspired by the early implementations of NMR
quantum computing [7], in the one-clean-qubit model of computation the input is specified by a
single “clean qubit”, together with a maximally mixed state on 𝑛 qubits:
𝜌 = | 0⟩⟨0 | ⊗ (𝐼/2) ⊗𝑛 . (3.31)
A computation is then realized by applying a poly(𝑛)-sized quantum circuit 𝑄 to the input. We
then measure the clean qubit in the standard basis and consider the probability of obtaining the
outcome “0”, i.e.,
Tr[(| 0⟩⟨0 | ⊗ 𝐼 ⊗𝑛 )𝑄𝜌𝑄 † ]. (3.32)
75
The DQC1 model of computation has been widely studied, and several natural problems have been
found to be complete for DQC1. Most notably, Shor and Jordan [4] showed that the problem of
trace estimation for 2𝑛 × 2𝑛 unitary matrices that specify poly(𝑛)-sized quantum circuits is DQC1-
complete. Moreover, Fujii et al. [85] showed that classical simulation of DQC1 is impossible, unless
the polynomial hierarchy collapses to the second level. Specifically, it is shown that an efficient
classical algorithm that is capable of weakly simulating the output probability distribution of any
DQC1 computation would imply a collapse of the polynomial hierarchy to the class of Arthur-Merlin
protocols, which is not believed to be true. Rather, it is commonly believed that the class DQC1 is
strictly contained in BQP, and thus provides a sub-universal model of quantum computation that is
hard to simulate classically. Finally, we point out that the complexity class DQC1 is known to give
rise to average-case distance measures, whereas worst-case distance measures (such as the diamond
distance) are much harder to approximate, and known to be QIP-complete [86]. Currently, it is not
known whether there exists a distance measure that lies between the average-case and worst-case
measures in DQC1 and QIP, respectively. However, we conjecture that only average-case distance
measures are feasible for practical purposes. We leave the task of finding a distance measure whose
approximation is complete for the class BQP as an interesting open problem.
Our contributions are the following. We adapt the proofs in [4, 85] and show that the problem
of approximating our cost functions, 𝐶HST or 𝐶LHST , up to inverse polynomial precision is DQC1-
hard. Our results build on the fact that evaluating either of our cost functions is, in some sense, as
hard as trace estimation. Using the results from [85], it then immediately follows that no classical
algorithm can efficiently approximate our cost functions under certain complexity assumptions.
3.5.2 Approximating 𝐶HST is DQC1-hard
In Section 3.13, we prove the following:
Theorem 6: Let 𝑈 and 𝑉 be poly(𝑛)-sized quantum circuits specified by 2𝑛 × 2𝑛 unitary matrices,
and let 𝜖 = 𝑂 (1/poly(𝑛)). Then, the problem of approximating 𝐶HST (𝑈, 𝑉) up to 𝜖-precision is
DQC1-hard.
76
3.5.3 Approximating 𝐶LHST is DQC1-hard
In Sec. 3.13, we also prove the following:
Theorem 7: Let 𝑈 and 𝑉 be poly(𝑛)-sized quantum circuits specified by 2𝑛 × 2𝑛 unitary matrices,
and let 𝜖 = 𝑂 (1/poly(𝑛)). Then, the problem of approximating 𝐶LHST (𝑈, 𝑉) up to 𝜖-precision is
DQC1-hard.
As a consequence of these results, it then follows from [85] that there is no classical algorithm
to efficiently approximate our cost functions, 𝐶HST or 𝐶LHST , with inverse polynomial precision,
unless the polynomial hierarchy collapses to the second level.
3.6 Small-scale implementations
This section presents the results of implementing QAQC, as described in Sec. 3.3, for well-
known one- and two-qubit unitaries. Some of these implementations were done on actual quantum
hardware, while others were on a simulator. In each case, we performed gradient-free continuous
parameter optimization in order to minimize the cost function 𝐶HST in (3.5), evaluating this cost
function using the circuit in Fig. 3.4(a). For full details on the optimization procedure, see
Section 3.14.
3.6.1 Quantum hardware
We implement QAQC on both IBM’s and Rigetti’s quantum computers. In what follows, the depth
of a gate sequence is defined relative to the native gate alphabet of the quantum computer used.
3.6.1.1 IBM’s quantum computers
Here, we consider the 5-qubit IBMQX4 and the 16-qubit IBMQX5. For these quantum computers,
the native gate set is
AIBM = {𝑅𝑥 (𝜋/2), 𝑅𝑧 (𝜃), CNOT𝑖 𝑗 } (3.33)
77
where the single-qubit gates 𝑅𝑥 (𝜋/2) and 𝑅𝑧 (𝜃) can be performed on any qubit and the two-qubit
CNOT gate can be performed between any two qubits allowed in the topology; see [97] for the
topology of IBMQX4 and [98] for the topology of IBMQX5.
To compile a given unitary 𝑈, we use the general procedure outlined in Sec. 3.3.3. Specifically,
our initial gate structure, given by 𝑉𝑘® ( 𝛼
® ), is selected at random from the gate alphabet in (3.33). We
then calculate the cost 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼® )) by executing the HST shown in Fig. 3.4(a) on the quantum
computer. To perform the continuous parameter optimization over the angles 𝜃 of the 𝑅𝑧 gates, we
make use of Algorithm 2 outlined in Section 3.14.1. This method is designed to limit the number of
objective function calls to the quantum computer, which is an important consideration when using
queue-based quantum computers like IBMQX4 and IBMQX5 since these can entail a significant
amount of idle time in the queue.
In essence, our method in Algorithm 2 discretizes the continuous parameter space of angles 𝜃
to perform the continuous optimization. These angles are selected uniformly over the unit circle
and the grid spacing between them decreases in the number of iterations. See Section 3.14.1 for
full details. If the cost of the new sequence is less than the cost of the previous sequence, then we
accept the change. Otherwise, we accept the change with a probability that decreases exponentially
in the magnitude of the difference in cost. This change in cost defines one iteration.
In Fig. 3.5(a), we show results for compiling single-qubit gates on IBMQX4. All gates (𝐼,
𝑇, 𝑋, and 𝐻) converge to a cost below 0.1, but no gate achieves a cost below our tolerance of
10−2 . As elaborated upon in Sec. 3.8, this is due to a combination of finite sampling, gate fidelity,
decoherence, and readout error on the device. The single-qubit gates compile to the following gate
sequences:
1. 𝐼 gate: 𝑅𝑧 (𝜃), with 𝜃 ≈ 0.01𝜋.
2. 𝑇 gate: 𝑅𝑧 (𝜃), with 𝜃 ≈ 0.30𝜋.
3. 𝑋 gate: 𝑅𝑥 (𝜋/2)𝑅𝑥 (𝜋/2).
4. 𝐻 gate: 𝑅𝑧 (𝜃 1 )𝑅𝑥 (𝜋/2)𝑅𝑧 (𝜃 2 ), with 𝜃 1 = 𝜃 2 = 0.50𝜋.
78
Figure 3.5(b) shows results for compiling the same single-qubit gates as above on IBMQX5.
The gate sequences have the same structure as listed above for IBMQX4. The optimal angles
achieved are 𝜃 = −0.03𝜋 for the 𝐼 gate and 𝜃 = 0.23𝜋 for the 𝑇 gate. The 𝑋 gate compiles to
𝑅𝑥 (𝜋/2)𝑅𝑥 (𝜋/2), and the Hadamard gate 𝐻 compiles to 𝑅𝑥 (𝜋/2)𝑅𝑧 (𝜋/2)𝑅𝑥 (𝜋/2).
In our data collection, we performed on the order of 10 independent optimization runs for each
target gate above. The standard deviations of the angles 𝜃 were on the order of 0.05𝜋, and this can
be viewed as the error bars on the average values quoted above.
3.6.1.2 Rigetti’s quantum computer
The native gate set of Rigetti’s 8Q-Agave 8-qubit quantum computer is
ARigetti = {𝑅𝑥 (±𝜋/2), 𝑅𝑧 (𝜃), CZ𝑖 𝑗 } (3.34)
where the single-qubit gates 𝑅𝑥 (±𝜋/2) and 𝑅𝑧 (𝜃) can be performed on any qubit and the two-qubit
CZ gate can be performed between any two qubits allowed in the topology; see [99] for the topology
of the 8Q-Agave quantum computer.
As with the implementation on IBM’s quantum computers, for the implementation on Rigetti’s
quantum computer we make use of the general procedure outlined in Sec. 3.3.3. Specifically,
we perform random updates to the gate structure followed by continuous optimization over the
parameters 𝜃 of the 𝑅𝑧 gates using the gradient-free stochastic optimization technique described in
Algorithm 1 in Section 3.14. In this optimization algorithm, we use fifty cost function evaluations
to perform the continuous optimization over parameters. (That is, each iteration in Fig. 3.5(c) and
Fig. 3.6 uses fifty cost function evaluations, and each cost function evaluation uses 10, 000 calls to
the quantum computer for finite sampling.) We take the cost error tolerance (the parameter 𝜀′ in
Algorithm 1) to be 10−2 , and for each run of the Hilbert-Schmidt Test, we take 10, 000 samples in
order to estimate the cost. Our results are shown in Fig. 3.5(c). As described in Algorithm 1, we
define an iteration to be one accepted update in gate structure followed by a continuous optimization
over the internal gate parameters.
79
Figure 3.5: Compiling the one-qubit gates 𝐼, 𝑋, 𝐻, and 𝑇 using the gradient-free optimization
technique described in Section 3.14. The plots show the cost 𝐶HST as a function of the number of
iterations, where an iteration is defined by an accepted update to the gate structure; see Sec. 3.3.3
for a description of the procedure. The insets display the minimum cost achieved by optimizing
over gate sequences with a fixed depth, where the depth is defined relative to the native gate alphabet
of the quantum computer used. (a) Compiling on the IBMQX4 quantum computer, in which we
took 8, 000 samples to evaluate the cost for each run of the Hilbert-Schmidt Test. (b) Compiling
on the IBMQX5 quantum computer, in which we again took 8, 000 samples to evaluate the cost for
each run of the Hilbert-Schmidt Test. (c) Compiling on Rigetti’s 8Q-Agave quantum computer. In
the plot, each iteration uses 50 cost function evaluations to perform the continuous optimization.
For each run of the Hilbert-Schmidt Test to evaluate the cost, we took 10, 000 samples (calls to the
quantum computer).
80
The gates compiled in Fig. 3.5(c) have the following optimal decompositions. The same
decompositions also achieve the lowest cost in the cost vs. depth plot in the inset.
1. 𝐼 gate: 𝑅𝑧 (𝜃), with 𝜃 ≈ 0.
2. 𝑇 gate: 𝑅𝑧 (𝜃), with 𝜃 ≈ 0.342𝜋.
3. 𝑋 gate: 𝑅𝑥 (−𝜋/2)𝑅𝑥 (−𝜋/2).
4. 𝐻 gate: 𝑅𝑧 (𝜃 1 )𝑅𝑥 (𝜋/2)𝑅𝑧 (𝜃 2 ), with 𝜃 1 ≈ 0.50𝜋 and 𝜃 2 ≈ 0.49𝜋.
As with the results on IBM’s quantum computers, none of the gates achieve a cost less than 10−2 ,
due to factors such as finite sampling, gate fidelity, decoherence, and readout error. In addition,
similar to the IBM results, the standard deviations of the angles 𝜃 here were on the order of 0.05𝜋,
which can be viewed as the error bars on the average values (over 10 independent runs) quoted
above.
3.6.2 Quantum simulator
We now present our results on executing QAQC for single-qubit and two-qubit gates using a
simulator. We use the gate alphabet
A = {𝑅𝑥 (𝜋/2), 𝑅𝑧 (𝜃), CNOT𝑖 𝑗 }, (3.35)
which is the gate alphabet defined in Eq. (3.33) except with full connectivity between the qubits. We
again use the gradient-free optimization method outlined in Section 3.14 to perform the continuous
parameter optimization. The simulations are performed assuming perfect connectivity between the
qubits, no gate errors, and no decoherence.
Using Rigetti’s quantum virtual machine [2], we compile the controlled-Hadamard (CH) gate,
the CZ gate, the SWAP gate, and the two-qubit quantum Fourier transform QFT2 by adopting the
gradient-free continuous optimization procedure in Algorithm 1. We also compile the single-qubit
gates 𝑋 and 𝐻. For each run of the Hilbert-Schmidt Test to determine the cost, we took 20, 000
81
(a) (b)
0.5 1.0
0.4 0.8
0.3 0.6
Cost Cost
0.2 0.4
0.1 0.2
0.0 0.0
2 4 6 8 2 4 6 8 10 12
Depth Iteration
X CZ SWAP
H CH QFT2
(c)
3π
2
= =
π π
Z 3π
2 ⊕ 2 ⊕ H π P 3π
4 ⊕ 4 P
π π
2 P ⊕ 3π
4 P ⊕ 2
QFT2 =
− π4 ⊕ 7π
4
Figure 3.6: Compiling one- and two-qubit gates on Rigetti’s quantum virtual machine with the
gate alphabet in (3.35) using the gradient-free optimization technique described in Algorithm
1 in Section 3.14. (a) The minimum cost achieved by optimizing over gate sequences with a
fixed depth. (b) The cost as a function of the number of iterations of the full gate structure and
continuous parameter optimization; see Sec. 3.3.3 for a description of the procedure. Note that each
iteration uses 50 cost function evaluations, and each cost function evaluation uses 10, 000 samples
(calls to the quantum computer). (c) Shortest-depth decompositions of the two-qubit controlled-
𝑍, controlled-Hadamard, and quantum Fourier transform gates as determined by the compilation
procedure. The equalities indicated are true up to a global phase factor. Here, denotes the
rotation gate 𝑅𝑧 (𝜃), while represents the rotation gate 𝑅𝑥 (𝜋/2).
82
samples. Our results are shown in Fig. 3.6. For the SWAP gate, we find that circuits of depth
one and two cannot achieve zero cost, but there exists a circuit with depth three for which the cost
vanishes. The circuit achieving this zero cost is the well-known decomposition of the SWAP gate
into three CNOT gates. While our compilation procedure reproduces the known decomposition
of the SWAP gate, it discovers a decomposition of both the CZ and the QFT2 gates that differs
from their conventional “textbook” decompositions, as shown in Fig. 3.6(c). In particular, these
decompositions have shorter depths than the conventional decompositions when written in terms
of the gate alphabet in (3.35).
In Section 3.15, we likewise implement QAQC for one- and two-qubit gates on a simulator, but
instead using a gradient-based continuous parameter optimization method outlined therein.
3.7 Larger-scale implementations
While in the previous section we considered one- and two-qubit unitaries, in this section we explore
larger unitaries, up to nine qubits. The purpose of this section is to see how QAQC scales, and
in particular, to study the performance of our 𝐶HST and 𝐶LHST cost functions as the problem size
increases. We consider two different examples.
Example 1: In the first example, we let 𝑈 be a tensor product of one-qubit unitaries. Namely we
consider
Ì𝑛
𝑈= 𝑅𝑧 (𝜃 𝑗 ) (3.36)
𝑗=1
where the 𝜃 𝑗 are randomly chosen, and 𝑅𝑧 (𝜃) is a rotation about the 𝑧-axis of the Bloch sphere by
angle 𝜃. Similarly, our ansatz for 𝑉 is of the same form,
Ì 𝑛
𝑉= 𝑅𝑧 (𝜙 𝑗 ) (3.37)
𝑗=1
where the initial values of the angles 𝜙 𝑗 are randomnly chosen.
83
Example 2: In the second example, we go beyond the tensor-product situation and explore a unitary
that entangles all the qubits. The target unitary has the form 𝑈 = 𝑈4 ( 𝜃®′)𝑈3𝑈2𝑈1 ( 𝜃),
® with
Ì 𝑛
® =
𝑈1 ( 𝜃) 𝑅𝑧 (𝜃 𝑗 ), 𝑈2 = ...CNOT34 CNOT12 (3.38)
𝑗=1
Ì𝑛
𝑈3 = ...CNOT45 CNOT23 , 𝑈4 ( 𝜃®′) = 𝑅𝑧 (𝜃 ′𝑗 ) . (3.39)
𝑗=1
Here, CNOT𝑘𝑙 denotes a CNOT with qubit 𝑘 the control and qubit 𝑙 the target, while 𝜃® = {𝜃 𝑗 } and
𝜃®′ = {𝜃 ′𝑗 } are 𝑛-dimensional vectors of angles. Hence 𝑈2 and 𝑈3 are layers of CNOTs where the
CNOTs in 𝑈3 are shifted down by one qubit relative to those in 𝑈2 . Our ansatz for the trainable
unitary 𝑉 has the same form as 𝑈 but with different angles, i.e., 𝑉 = 𝑈4 ( 𝜙®′)𝑈3𝑈2𝑈1 ( 𝜙)® where 𝜙®
and 𝜙®′ are randomly initialized.
In what follows we discuss our implementations of QAQC for these two examples. We first
discuss the implementation on a simulator without noise, and then we move onto the implementation
on a simulator with a noise model.
3.7.1 Noiseless implementations
We implemented Examples 1 and 2 on a noiseless simulator. In each case, starting with the ansatz
for 𝑉 at a randomly chosen set of angles, we performed the continuous parameter optimization over
the angles using a gradient-based approach. We made use of Algorithm 4 in Section 3.15.3, which
is a gradient descent algorithm that explicitly evaluates the gradient using the formulas provided in
Section 3.15.3. For each run of the HST and LHST, we took 1000 samples in order to estimate the
value of the cost function. The results of this implementation are shown in Figs. 3.7 and 3.8.
In the case of Example 1 (Fig. 3.7), both the 𝐶HST and 𝐶LHST cost functions converge to the
desired global minimum up to 5 qubits. However, for 𝑛 = 6, 7, 8, and 9 qubits, we find cases in
which the 𝐶HST cost function does not converge to the global minimum but the 𝐶LHST cost function
does. Specifically, the cost 𝐶HST stays very close to one, with a gradient value smaller than the
pre-set threshold of 10−3 for four consecutive iterations, causing the gradient descent algorithm to
84
4 qubits 5 qubits
0.8 1.0
0.8
0.6
0.6
Cost 0.4 Cost
0.4
0.2
0.2
0.0 0.0
0 2 4 6 8 10 0 2 4 6 8 10 12 14 16
Iteration Iteration
6 qubits 7 qubits
1.0 1.0
0.8 0.8
0.6 0.6
Cost Cost
0.4 0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 10 0 2 4 6 8 10
Iteration Iteration
8 qubits 9 qubits
1.0 1.0
0.8 0.8
0.6 0.6
Cost Cost
0.4 0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 10 12 0 2 4 6 8 10
Iteration Iteration
HST LHST HST via LHST
Figure 3.7: Results of performing continuous parameter optimization using the HST and the LHST
for the scenario described in Example 1. We make use of the gradient-based optimization algorithm
given by Algorithm 4 in Section 3.15. The curves “HST via LHST” are given by evaluating 𝐶HST
using the angles obtained during the optimization iterations of 𝐶LHST . For each run of the HST and
LHST, we use 1000 samples to estimate the cost function.
85
2 qubits 4 qubits
0.8
0.8
0.6
0.6
Cost 0.4 Cost 0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 0 2 4 6 8
Iteration Iteration
6 qubits 8 qubits
1.0 1.0
0.8 0.8
0.6 0.6
Cost Cost
0.4 0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Iteration Iteration
HST LHST HST via LHST
Figure 3.8: Results of performing continuous parameter optimization using the HST and the
LHST for the scenario described in Example 2. We make use of the gradient-based optimization
algorithm given by Algorithm 4 in Section 3.15, in which each iteration can involve several calls to
the quantum computer. The curves “HST via LHST” are given by evaluating 𝐶HST using the angles
obtained during the optimization iterations of 𝐶LHST . For each run of the HST and LHST, we use
1000 samples to estimate the cost function.
declare convergence. Interestingly, even in the cases that the 𝐶HST cost does not converge to the
global minimum, training with the 𝐶LHST cost allows us to fully minimize the 𝐶HST cost. (See the
green curves labelled “HST via LHST” in Fig. 3.7, in which we evaluate the 𝐶HST cost at the angles
obtained during the optimization of the 𝐶LHST cost.) This fascinating feature implies that, for 𝑛 ≥ 6
qubits in Example 1, training our 𝐶LHST cost is better at minimizing the 𝐶HST cost than is directly
attempting to train the 𝐶HST cost.
We find very similar behavior for Example 2 (Fig. 3.8). In particular, for 𝑛 ≥ 6 qubits, we were
unable to directly train the 𝐶HST cost. However, the 𝐶LHST cost converges to the global minimum
for 𝑛 = 6 and 8 qubits. Furthermore, as with Example 1, we find that minimizing the 𝐶LHST cost
86
also minimizes the 𝐶HST cost.
3.7.2 Noisy implementations
We implemented Examples 1 and 2 on IBM’s noisy simulator, where the noise model matches that
of the 16-qubit IBMQX5 quantum computer. This noise model accounts for 𝑇1 noise, 𝑇2 noise, gate
errors, and measurement errors. We emphasize that these are realistic noise parameters since they
simulate the noise on currently available quantum hardware. (Note that when our implementations
required more than 16 qubits, we applied similar noise parameters to the additional qubits as those
for the 16 qubits of the IBMQX5.) We used the same training algorithm as the one we used in the
noiseless case above. The results of these implementations are shown in Figs. 3.9 and 3.10.
Similar to the noiseless case, for Example 1 (Fig. 3.9) and for Example 2 (Fig. 3.10), we find
that both the 𝐶HST and 𝐶LHST cost functions converge up to a problem size of 5 qubits. Due to the
noise, as expected, both cost functions converge to a value greater than zero. For 𝑛 ≥ 6 qubits,
however, we find that the 𝐶HST cost function does not converge to a local minimum. Specifically,
this cost stays very close to one with a gradient value smaller than the pre-set threshold of 10−3
for four consecutive iterations, causing the gradient descent algorithm to declare convergence. The
local cost, on the other hand, converges to a local minimum in every case.
Remarkably, despite the noise in the simulation, we find that the angles obtained during the
iterations of the 𝐶LHST optimization correspond to the optimal angles in the noiseless case. This
result is indicated by the green curves labeled “Noiseless HST via LHST”. One can see that the
green curves go to zero for the local minima found by training the noisy 𝐶LHST cost function. Hence,
in these examples, training the noisy 𝐶LHST cost function can be used to minimize the noiseless
𝐶HST cost function to the global minimum. This intriguing behavior suggests that the noise has not
affected the location (i.e., the value for the angles) of the global minimum. We thus find evidence
of the robustness of QAQC to the kind of noise present in actual devices. We elaborate on this
point in the next section.
87
4 qubits 5 qubits
1.0 1.0
0.8 0.8
0.6 0.6
Cost Cost
0.4 0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 14 16
Iteration Iteration
6 qubits 7 qubits
1.0 1.0
0.8 0.8
0.6 0.6
Cost Cost
0.4 0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10
Iteration Iteration
8 qubits 9 qubits
1.0 1.0
0.8 0.8
0.6 0.6
Cost Cost
0.4 0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 10 0 2 4 6 8 10
Iteration Iteration
HST LHST Noiseless HST via LHST
Figure 3.9: Results of performing continuous parameter optimization using the HST and the
LHST, in the presence of noise, for the scenario described in Example 1. The noise model used
matches that of the IBMQX5 quantum computer. We make use of the gradient-based optimization
algorithm given by Algorithm 4 in Section 3.15. The curves “Noiseless HST via LHST” are given
by evaluating 𝐶HST (without noise) using the angles obtained during the optimization iterations of
𝐶LHST . For each run of the HST and LHST, we use 1000 samples to estimate the cost function.
88
2 qubits 4 qubits
0.8
0.8
0.6
0.6
Cost 0.4 Cost
0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 10 12 14 16 0 5 10 15 20 25
Iteration Iteration
6 qubits 8 qubits
1.0 1.0
0.8 0.8
0.6 0.6
Cost Cost
0.4 0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 18
Iteration Iteration
HST LHST Noiseless HST via LHST
Figure 3.10: Results of performing continuous parameter optimization using the HST and the
LHST, in the presence of noise, for the scenario described in Example 2. The noise model used
matches that of the IBMQX5 quantum computer. We make use of the gradient-based optimization
algorithm given by Algorithm 4 in Section 3.15. The curves “Noiseless HST via LHST” are given
by evaluating 𝐶HST (without noise) using the angles obtained during the optimization iterations of
𝐶LHST . For each run of the HST and LHST, we use 1000 samples to estimate the cost function.
3.8 Discussion
On both IBM’s and Rigetti’s quantum hardware, we were able to successfully compile one-qubit
gates with no a priori assumptions about gate structure or gate parameters. We also successfully
implemented QAQC for simple 9-qubit gates on both a noiseless and noisy simulator. These
implementations highlighted two important issues, (1) barren plateuas in cost landscape and (2) the
effect of hardware noise, which we discuss further now.
89
3.8.1 Barren plateaus
Recent results [40, 100] on gradient-based optimization with random quantum circuits suggest that
the probability of observing non-zero gradients tends to become exponentially small as a function
of the number of qubits. That work showed that a hardware-efficient ansatz leads to vanishing
gradients as the ansatz’s depth becomes deeper (and hence begins to look more like a random
unitary). This is an important issue for many variational hybrid algorithms, including QAQC, and
motivates the need to avoid a deep, random ansatz. Strategies to address this “barren plateau” issue
for QAQC include restricting to a short-depth ansatz, or alternatively employing an application-
specific ansatz that takes into account some information about the target unitary 𝑈. We intend
to explore application-specific ansatze in future work to address this issue. There may be other
strategies based on the fact that similar issues have been identified in classical deep learning [101].
For instance, recent work [102] shows that gradient descent with momentum (GDM) using an
adaptive (multiplicative) integration step update, called resilient backpropagation (rProp), can help
with convergence. But, this is an active research area and will likely be important to the success of
variational hybrid algorithms.
Interestingly, in this work, we identified another barren plateau issue that is completely indepen-
dent and distinct from the issue raised in Refs. [40, 100]. Namely, we found that our operationally
meaningful cost function, 𝐶HST , can have barren plateaus even when the ansatz is a depth-one
circuit. The gradient of 𝐶HST can vanish exponentially in 𝑛 even when the ansatz has only a single
parameter. This issue became apparent in our implementations (see Figs. 3.7 through 3.10), where
we were unable to directly train the 𝐶HST cost for 𝑛 ≥ 6 qubits. Fortunately, we fixed this issue by
introducing the 𝐶LHST cost, which successfully trained in all cases we attemped (we attempted up
to 𝑛 = 9 qubits). Although 𝐶LHST is not directly operationally meaningful, it is indirectly related
to 𝐶HST via Eqs. (3.28) and (3.29). Hence it can be used to indirectly train 𝐶HST , as shown in
Figs. 3.7 through 3.10. We believe this barren plateau issue will show up in other variational
hybrid algorithms. For example, we encountered the same issue in a recently introduced variational
algorithm for state diagonalization [103].
90
3.8.2 Effect of hardware noise
The impact of hardware noise, such as decoherence, gate infidelity, and readout error, is important
to consider. This is especially true since QAQC is aimed at being a useful algorithm in the era of
NISQ computers, although we remark that QAQC may also be useful for fault-tolerant quantum
computing.
On the one hand, we intuitively expect noise to significantly affect the HST and LHST cost
evaluation circuits. On the other hand, we see empirical evidence of noise resilience in Figs. 3.9
and 3.10. Let us elaborate on both our intuition and our empirical observations now.
A qualitative noise analysis of the HST circuit in Fig. 3.4(a) is as follows. To compile a unitary
𝑈 acting on 𝑛 qubits, a circuit with 2𝑛 qubits is needed. Preparing the maximally-entangled
state |Φ+ ⟩ in the first portion of the circuit requires 𝑛 CNOT gates, which are significantly noisier
than one-qubit gates and propagate errors to other qubits through entanglement. In principle, all
Hadamard and CNOT gates can be implemented in parallel, but on near-term devices this may not
be the case. Additionally, due to limited connectivity of NISQ devices, it is generally not possible
to directly implement CNOTs between arbitrary qubits. Instead, the CNOTs need to be “chained”
between qubits that are connected, a procedure that can significantly increase the depth of the
circuit.
The next level of the circuit involves implementing 𝑈 in the top 𝑛-qubit register and 𝑉 ∗ in
the bottom 𝑛-qubit register. Here, the noise of the computer on 𝑉 ∗ is not necessarily undesirable
since it could allow us to compile noise-tailored algorithms that counteract the noise of the specific
computer, as described in Sec. 3.2. Nevertheless, the depth of 𝑉 ∗ and/or of 𝑈 essentially determines
the overall circuit depth as noted in (3.19), and quantum coherence decays exponentially with the
circuit depth. Hence, compiling larger gate sequences involves additional loss of coherence on
NISQ computers.
The final level of the HST circuit involves making a Bell measurement on all qubits and is the
reverse of the first part of the circuit. As such, the same noise analysis of the first portion of the
circuit applies here. Readout errors can be significant on NISQ devices [104], and our HST circuit
91
involves a number of measurements that scales linearly in the number of qubits. Hence, compiling
larger unitaries can increase overall readout error.
A similar qualitative noise analysis holds for the LHST circuit in Fig. 3.4(b), except we note
( 𝑗)
that to calculate the functions 𝐶LHST in (3.26) we require only one CNOT gate in the last portion
of the LHST circuit before the measurement. Furthermore, we measure only two qubits regardless
of the total number of qubits.
With that said, we observed a (somewhat surprising) noise resilience in Figs. 3.9 and 3.10. In
these implementations, we imported the noise model of the IBMQX5 quantum computer, which
is a currently available cloud quantum computer. Hence, we considered realistic noise parameters
for decoherence, gate infedility, and readout error. This noise affected all circuit elements of the
LHST circuit in Fig. 3.4(b). Yet we still obtained the correct unitary 𝑉 via QAQC, as shown by the
green curves going to zero in Figs. 3.9 and 3.10.
Naturally, we plan to investigate this noise resilience in full detail in future work. But it is worth
emphasizing the following point here. The value of the cost could be significantly affected by noise
without shifting the location of the global minimum in parameter space. In fact, one can see in
Figs. 3.9 and 3.10 that the value of the 𝐶LHST cost is significantly affected by noise. Namely, note
that the red curves in these plots do not go to zero for larger iterations. However, the green curves
do go to zero, which means that QAQC found the correct parameters for 𝑉 despite the noisy cost
values.
We could speculate reasons for why the global minimum appears not shift in parameter space
with noise. For example, it could be due to the nature of our cost functions. These cost functions
can be thought of as entanglement fidelities and hence are related to Hilbert-space averages of
input-output fidelities, see Eq. (3.6). By averaging the input-output fidelity over the whole Hilbert
space, the effect of noise could essentially be averaged away. This is just speculation at this point,
and we will perform a detailed analysis of the effect of noise in future work. Regardless, our
preliminary results in Figs. 3.9 and 3.10 suggest that QAQC may indeed be useful in the NISQ era.
92
3.9 Conclusions
Quantum compiling is crucial in the era of NISQ devices, where constraints on NISQ computers
(such as limited connectivity, limited circuit depth, etc.) place severe restrictions on the quantum
algorithms that can be implemented in practice. In this work, we presented a methodology for
quantum compilation called quantum-assisted quantum compiling (QAQC), whereby a quantum
computer provides an exponential speedup in evaluating the cost of a gate sequence, i.e., how well
the gate sequence matches the target. In principle, QAQC should allow for the compiling of larger
algorithms than standard classical methods for quantum compiling due to this exponential speedup.
As a proof-of-principle, we implemented QAQC on IBM’s and Rigetti’s quantum computers to
compile various one-qubit gates to their native gate alphabets. To our knowledge, this is the
first time NISQ hardware has been used to compile a target unitary. In addition, we successfully
implemented QAQC on a noiseless and noisy simulator for simple 9-qubit unitaries.
Our main technical results were the following. First, we carefully chose a cost function (which
involved global and local overlaps between a target unitary 𝑈 and a trainable unitary 𝑉) and proved
that it satisfied four criteria: it is faithful, it is efficient to compute on a quantum computer, it has an
operational meaning, and it scales well with the size of the problem. Second, we presented short-
depth circuits (see Sections 3.4.1 and 3.4.2) for computing our cost function. Third, we proved that
evaluating our cost function is DQC1-hard, and hence no classical algorithm can efficiently evaluate
our cost function, under reasonable complexity assumptions. This established a rigorous proof for
the difficulty of classically simulating QAQC. We also remark that, in the Section, we detailed our
gradient-free and gradient-based methods for optimizing our cost function. This included a circuit
for gradient computation that generalizes the famous Power of One Qubit [7] and hence is likely of
interest to a broader community.
As elaborated in the Discussion section, our noisy implementations of QAQC showed a surpris-
ing resilience to noise. While simulating realistic noise parameters based on a currently available
cloud quantum computer (IBMQX5), we were able to run QAQC on a 9-qubit unitary and obtain
the correct parameters for 𝑉. We plan to investigate this intriguing noise resilience in future work.
93
QAQC is a novel variational hybrid algorithm, similar to other well-known variational hybrid
algorithms such as VQE [78] and QAOA [77]. Variational hybrid algorithms are likely to provide
some of the first real applications of quantum computers in the NISQ era. In the case of QAQC,
it is an algorithm that makes other algorithms more efficient to implement, via algorithm depth
compression. We note that the ability to compress algorithm depth will also be useful (to reduce
the run-time of quantum circuits) in the era of fault-tolerant quantum computing. The central
application of QAQC is thus to make quantum computers more useful.
3.10 Remark on implementation of 𝑉 ∗
As mentioned in Sec. 3.4, a subtle point about evaluating the cost functions 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼 ® )) and
𝐶LHST (𝑈, 𝑉𝑘® ( 𝛼 ® ) ∗ must be executed on the quantum computer,
® )) is that the complex conjugate 𝑉𝑘® ( 𝛼
not 𝑉𝑘® ( 𝛼
® ) itself. The complex conjugate of a unitary corresponding to a gate sequence can be
obtained by taking the complex conjugate of each unitary in the gate sequence. However, if each
gate in the sequence comes from a gate alphabet A, it is possible that the complex conjugate of a
gate in the sequence is not contained in the alphabet; for example, if A = {𝑅𝑥 (𝜋/2), 𝑅𝑧 (𝜃)}, then
the complex conjugate of 𝑅𝑥 (𝜋/2), which is 𝑅𝑥 (−𝜋/2), is not contained in A. But the unitary
𝑅𝑧 (𝜋)𝑅𝑥 (𝜋/2)𝑅𝑧 (𝜋) is equal (up to a global phase) to 𝑅𝑥 (−𝜋/2). There are thus two ways to
proceed when performing the compilation procedure: during the optimization over the continuous
parameters, directly run the gate sequence corresponding to 𝑉𝑘® ( 𝛼 ® ), expressing it in terms of the
native gate alphabet of the quantum computer, then at the end establish the complex conjugate of
the optimal unitary as the unitary to which 𝑈 has been compiled. This would involve translating
the complex conjugate of each gate in the optimal sequence into the native gate alphabet of the
quantum computer. An alternative is to first take the complex conjugate 𝑉𝑘® ( 𝛼 ® ) ∗ by translating the
complex conjugate of each gate in the sequence into the native gate alphabet, then execute the
resulting sequence on the quantum computer. In each case, we allow for a small-scale classical
compiler that can perform the simple translation of the complex conjugate of a gate sequence into
the native gate alphabet of the quantum computer. Note that this small-scale classical compiler
94
does not come with exponential overhead because it is only compiling one- and two-qubit gates.
Also, observe that if a gate alphabet is not closed under complex conjugation, then the depth
of a gate sequence from that alphabet can increase by taking its complex conjugate. This is true
for the example given above, in which the complex conjugate 𝑅𝑥 (−𝜋/2) of 𝑅𝑥 (𝜋/2) has a depth of
three under the alphabet A = {𝑅𝑥 (𝜋/2), 𝑅𝑧 (𝜃)}, while the original gate has a depth of only one.
However, in general, note that the final depth increases by at most a constant factor relative to the
original depth.
3.11 Faithfulness of LHST cost function
Proposition 1: For all unitaries 𝑈 and 𝑉, it holds that 𝐶LHST (𝑈, 𝑉) = 0 if and only if 𝑈 = 𝑉 (up to
a global phase).
( 𝑗)
Proof: First, we note that since 0 ≤ 𝐶LHST (𝑈, 𝑉) ≤ 1 for all 𝑗 ∈ {1, 2, . . . , 𝑛}, we get that
( 𝑗) ( 𝑗)
𝐶LHST (𝑈, 𝑉) = 0 if and only if 𝐶LHST (𝑈, 𝑉) = 0, i.e., 𝐹𝑒 = 1, for all 𝑗 ∈ {1, 2, . . . , 𝑛}. Next, since
( 𝑗) ( 𝑗)
𝐹𝑒 is by definition the entanglement fidelity of the channel E 𝑗 , we have that 𝐹𝑒 = 1 if and only if
E 𝑗 is the identity channel I. Finally, the condition 𝑈 = 𝑉 is equivalent to 𝑊 B 𝑈𝑉 † = 𝐼. Therefore,
it suffices to prove that 𝑊 = 𝐼 if and only if E 𝑗 is the identity channel for all 𝑗 ∈ {1, 2, . . . , 𝑛}. The
implication 𝑊 = 𝐼 ⇒ E 𝑗 = I for all 𝑗 ∈ {1, 2, . . . , 𝑛} is immediate. We now prove the converse.
Let 𝑗 = 1, and suppose that 𝑊 has the following operator Schmidt decomposition under the
bipartite cut 𝐴1 | 𝐴2 · · · 𝐴𝑛 :
𝑟
∑︁ √
𝑊= 𝜎𝑖 𝑋𝑖𝐴1 ⊗ 𝑌𝑖𝐴2 ···𝐴𝑛 , (3.40)
𝑖=1
where {𝑋𝑖 }𝑖=1𝑟 and {𝑌 }𝑟 are orthonormal sets of operators, 𝜎 > 0 are the Schmidt coefficients of
𝑖 𝑖=1 𝑖
𝑊, and 𝑟 is the Schmidt rank of 𝑊. Since 𝑊 is unitary, we have
𝑟
∑︁ √
†
𝑊 𝑊= 𝜎𝑖 𝜎𝑖 ′ 𝑋𝑖† 𝑋𝑖 ′ ⊗ 𝑌𝑖†𝑌𝑖 ′ = 𝐼 𝐴1 ···𝐴𝑛 , (3.41)
𝑖,𝑖 ′ =1
which implies that
𝑟
∑︁
Tr 𝐴2 ···𝐴𝑛 (𝑊 †𝑊) = 𝜎𝑖 𝑋𝑖† 𝑋𝑖 = 2𝑛−1 𝐼 𝐴1 . (3.42)
𝑖=1
95
Plugging in the Schmidt decomposition of 𝑊 into the definition of E1 in (3.24), we get
𝑟
∑︁ 1
E1 (𝜌) = 𝜎 𝑋 𝜌𝑋𝑖† .
𝑛−1 𝑖 𝑖
(3.43)
𝑖=1
2
√︃
𝜎𝑖
The operators 𝐾𝑖 := 𝑋
2𝑛−1 𝑖
can therefore be regarded as Kraus operators for E1 . Indeed, they
satisfy the following condition for trace preservation:
𝑟 𝑟
∑︁ 1 ∑︁
𝐾𝑖† 𝐾𝑖 = 𝜎𝑖 𝑋𝑖† 𝑋𝑖 (3.44)
𝑖=1
2𝑛−1 𝑖=1
1
= Tr 𝐴2 ···𝐴𝑛 (𝑊 †𝑊) (3.45)
2𝑛−1
= 𝐼 𝐴1 , (3.46)
where to obtain the second equality we used (3.42).
Í𝑟 𝜎𝑖 †
Now, we assume that E1 is the identity channel, meaning that E1 (𝜌) = 𝑖=1 2𝑛−1 𝑋𝑖 𝜌𝑋𝑖 = 𝜌 for
all states 𝜌. By the non-uniqueness of Kraus representations of quantum channels, there exists an
isometry 𝑉 relating the Kraus operators {𝐾𝑖 }𝑖=1 𝑟 to another set {𝑁 } 𝑠 of Kraus operators according
𝑗 𝑗=1
Í𝑠
to 𝐾𝑖 = 𝑗=1 𝑉𝑖, 𝑗 𝑁 𝑗 . Since one Kraus representation of the identity channel is the one consisting
of only the identity operator 𝐼, we let the set {𝑁 𝑗 } 𝑠𝑗=1 consist of only the identity operator. The
isometry 𝑉 is then a 𝑟 × 1 matrix, so that 𝑉𝑖,1 = 𝛼𝑖 ∈ C for all 𝑖 ∈ {1, 2, . . . , 𝑟 }. This implies that
√︃
𝐾𝑖 = 2𝜎𝑛−1 𝑖
𝑋𝑖 = 𝛼𝑖 𝐼 𝐴1 for all 𝑖 ∈ {1, 2, . . . , 𝑟 }. Therefore,
𝑟
∑︁ √
𝑊 𝐴1 ···𝐴𝑛 = 𝜎𝑖 𝑋𝑖𝐴1 ⊗ 𝑌𝑖𝐴2 ···𝐴𝑛 (3.47)
𝑖=1
√︄
𝑟
∑︁ √ © 2𝑛−1
= 𝛼𝑖 𝐼 𝐴1 ® ⊗ 𝑌𝑖𝐴2 ···𝐴𝑛
ª
𝜎𝑖 (3.48)
𝑖=1
𝜎𝑖
« ¬
√︁ ∑︁ 𝑟
= 𝐼 𝐴1 ⊗ 2𝑛−1 𝛼𝑖𝑌𝑖𝐴2 ···𝐴𝑛 (3.49)
𝑖=1
=: 𝐼 𝐴1 ⊗ 𝑊 𝐴′ 2 ···𝐴𝑛 , (3.50)
√
where in the last line we have defined the unitary 𝑊 𝐴′ 2 ···𝐴𝑛 = 2𝑛−1 𝑖=1 𝛼𝑖𝑌𝑖𝐴2 ···𝐴𝑛 .
Í𝑟
Now, given the assumption that E1 = I, so that 𝑊 has the form in (3.50), we get that
′ 𝐼 𝐴3 ···𝐴𝑛 ′ †
E2 (𝜌) = Tr 𝐴3 ···𝐴𝑛 𝑊 𝜌 ⊗ 𝑛−2 (𝑊 ) . (3.51)
2
96
Therefore, applying the procedure above for 𝑗 = 2 by taking the bipartite cut in the operator
Schmidt decomposition of 𝑊 ′ to be 𝐴2 | 𝐴3 · · · 𝐴𝑛 , we get that if E2 is the identity channel, then
𝑊 = 𝐼 𝐴1 ⊗ 𝐼 𝐴2 ⊗ 𝑊 ′′ for some unitary 𝑊 ′′ acting on 𝐴3 · · · 𝐴𝑛 . Continuing in this manner for
all 𝑗 up to 𝑗 = 𝑛, assuming in each case that E 𝑗 is the identity channel, we ultimately obtain
𝑊 = 𝐼 𝐴1 ⊗ 𝐼 𝐴2 ⊗ · · · ⊗ 𝐼 𝐴𝑛 , which implies that 𝑈 = 𝑉, as required. □
3.12 Relation between 𝐶LHST and 𝐶HST
Proposition 2: Let 𝑈 and 𝑉 be 2𝑛 × 2𝑛 unitaries. Then,
𝐶LHST (𝑈, 𝑉) ≤ 𝐶HST (𝑈, 𝑉) ≤ 𝑛𝐶LHST (𝑈, 𝑉) .
Proof: First we rewrite the global cost function:
1 2
𝐶HST (𝑈, 𝑉) = 1 − 2
Tr[𝑉 †𝑈]
𝑑
= 1 − Tr[|Φ+ ⟩⟨Φ+ | 𝐴𝐵 (3.52)
× (𝑊 ⊗ 𝐼 𝐵 )|Φ+ ⟩⟨Φ+ | 𝐴𝐵 (𝑊 † ⊗ 𝐼 𝐵 )],
where 𝑊 = 𝑈𝑉 † . Also, for the local cost function, we have
𝑛
1 ∑︁ ( 𝑗)
𝐶LHST (𝑈, 𝑉) B 𝐶 (𝑈, 𝑉), (3.53)
𝑛 𝑗=1 LHST
where
( 𝑗)
𝐶LHST (𝑈, 𝑉)
(3.54)
+ + †
= 1 − Tr[Π 𝑗 (𝑊 ⊗ 𝐼 𝐵 )|Φ ⟩⟨Φ | 𝐴𝐵 (𝑊 ⊗ 𝐼)]
and we have defined
Π 𝑗 B 𝐼 𝐴1 𝐵1 ⊗ · · · ⊗ |Φ+ ⟩⟨Φ+ | 𝐴 𝑗 𝐵 𝑗 ⊗ · · · ⊗ 𝐼 𝐴𝑛 𝐵𝑛 , (3.55)
which are projectors that all mutually commute. Let
𝜌 B (𝑊 ⊗ 𝐼 𝐵 )|Φ+ ⟩⟨Φ+ | 𝐴𝐵 (𝑊 † ⊗ 𝐼 𝐵 ). (3.56)
Then, we can write 𝐶HST (𝑈, 𝑉) as
𝐶HST (𝑈, 𝑉) = 1 − Tr[Π𝑛 · · · Π1 𝜌], (3.57)
97
( 𝑗)
and we can write 𝐶LHST (𝑈, 𝑉) as
( 𝑗)
𝐶LHST (𝑈, 𝑉) = 1 − Tr[Π 𝑗 𝜌] (3.58)
for all 1 ≤ 𝑗 ≤ 𝑛. If we associate the events 𝐸 𝑗 with the projectors Π 𝑗 , so that Pr[𝐸 𝑗 ] = Tr[Π 𝑗 𝜌],
Ñ𝑛
then, Tr[Π𝑛 · · · Π1 𝜌] = Pr 𝑖=1 𝐸𝑖 .
To prove (3.28), namely 𝐶LHST (𝑈, 𝑉) ≤ 𝐶HST (𝑈, 𝑉), we recall a basic inequality in probability
theory. For any set {𝐴1 , 𝐴2 , . . . , 𝐴𝑛 } of events, it holds that
" 𝑛 # 𝑛
Ø 1 ∑︁
Pr 𝐴𝑖 ≥ Pr[ 𝐴𝑖 ]. (3.59)
𝑖=1
𝑛 𝑖=1
Let us take 𝐴𝑖 = 𝐸𝑖 in (3.59). Then,
" 𝑛
# 𝑛
Ø 1 ∑︁
Pr 𝐸𝑖 ≥ Pr[𝐸𝑖 ] (3.60)
𝑛 𝑖=1
" 𝑖=1
𝑛
# 𝑛
Ù 1 ∑︁
⇒ 1 − Pr 𝐸𝑖 ≥ (1 − Pr[𝐸𝑖 ]). (3.61)
𝑖=1
𝑛 𝑖=1
By definition of the events 𝐸𝑖 , the last equality is precisely 𝐶HST (𝑈, 𝑉) ≥ 𝐶LHST (𝑈, 𝑉), as required.
To prove (3.29), we make use of the union bound:
" 𝑛 # 𝑛
Ø ∑︁
Pr 𝐸𝑖 ≤ Pr[𝐸𝑖 ] (3.62)
𝑖=1
" 𝑖=1
𝑛
# 𝑛
Ù ∑︁
⇒ 1 − Pr 𝐸𝑖 ≤ (1 − Pr[𝐸𝑖 ]) (3.63)
𝑖=1 𝑖=1
= 𝑛𝐶LHST (𝑈, 𝑉). (3.64)
Given that the left-hand side of the above inequality is precisely 𝐶HST (𝑈, 𝑉), we have that
𝐶HST (𝑈, 𝑉) ≤ 𝑛𝐶LHST (𝑈, 𝑉), as required. □
3.13 Proofs of complexity theorems
Theorem 6: Let 𝑈 and 𝑉 be poly(𝑛)-sized quantum circuits specified by 2𝑛 × 2𝑛 unitary matrices,
and let 𝜖 = 𝑂 (1/poly(𝑛)). Then, the problem of approximating 𝐶HST (𝑈, 𝑉) up to 𝜖-precision is
DQC1-hard.
98
... Q† Q ...
0
U =
⊕
⊕
Figure 3.11: The trace of the unitary 𝑈 ′ defined by the circuit above is equal to the trace of the
non-unitary operator (| 0⟩⟨0 | ⊗ 𝐼)𝑄(| 0⟩⟨0 | ⊗ 𝐼)𝑄 † up to a factor of 4 [4].
Proof: We show that the problem of approximating the cost 𝐶HST (𝑈, 𝑉) is hard for DQC1. In
other words, we have to show that any problem in DQC1 reduces to an instance of approximating
𝐶HST (𝑈, 𝑉) for some 𝜖 = 𝑂 (1/poly(𝑛)). Recall that, given as input a poly(𝑛)-sized unitary 𝑄
on 𝑛-qubits, any problem in DQC1 requires us to estimate the acceptance probability 𝑝 acc when
measuring the outcome “0” on input 𝜌 = | 0⟩⟨0 | ⊗ 𝐼/2𝑛 , i.e.
𝑝 acc = Tr[(| 0⟩⟨0 | ⊗ 𝐼)𝑄𝜌𝑄 † ]. (3.65)
Note that, since the above equation describes a probability via the positive semi-definite operator
| 0⟩⟨0 | ⊗ 𝐼, the trace will result in a non-negative real number. Let us re-write Eq. (3.65) as follows:
1
𝑝 acc = 𝑛
Tr[(| 0⟩⟨0 | ⊗ 𝐼)𝑄(| 0⟩⟨0 | ⊗ 𝐼)𝑄 † ] . (3.66)
2
When letting 𝑈 ′ as in Fig. 3.11, we can also write
Tr[(| 0⟩⟨0 | ⊗ 𝐼)𝑄(| 0⟩⟨0 | ⊗ 𝐼)𝑄 † ] = Tr[𝑈 ′]/4, (3.67)
hence the problem is equivalent to approximating the absolute value of the trace of a unitary 𝑈 ′.
In fact, given our choice of 𝑈 ′ and when taking 𝑉 to be the identity, the problem reduces to
an instance of approximating the cost 𝐶HST (𝑈 ′, 𝐼) up to some precision 𝜖 = 𝑂 (1/poly(𝑛)) via a
simple reduction. Therefore, we have shown that the problem of approximating 𝐶HST (𝑈, 𝑉) up to
𝜖-precision is DQC1-hard. □
99
Theorem 7: Let 𝑈 and 𝑉 be poly(𝑛)-sized quantum circuits specified by 2𝑛 × 2𝑛 unitary matrices,
and let 𝜖 = 𝑂 (1/poly(𝑛)). Then, the problem of approximating 𝐶LHST (𝑈, 𝑉) up to 𝜖-precision is
DQC1-hard.
Proof: We show that any problem in DQC1 reduces to an instance of approximating 𝐶LHST (𝑈, 𝑉)
via a reduction. We are given as input a poly(𝑛)-sized unitary 𝑄 on 𝑛-qubits, and the task is to
estimate the acceptance probability of outputting “0”. Our proof strategy is to show that one can
efficiently extract Tr(𝑈 ′), the trace of an 𝑛-qubit unitary 𝑈 ′, from two distinct evaluations of 𝐶LHST
and elementary post-processing. This implies that computing 𝐶LHST is hard for DQC1, since all
problems in DQC1 can be seen as estimating the real part of Tr(𝑈 ′) via Eq. (3.66) and Eq. (3.67).
The two cost function evaluations that we consider are 𝐶LHST (𝑈1 , 𝐼) and 𝐶LHST (𝑈2 , 𝐼), where
𝑈1 = 𝑈 ′ (3.68)
𝑈2 = 𝐶𝑈 ′ . (3.69)
Here, 𝐶𝑈 ′ denotes controlled-𝑈 ′ operation.
First, consider 𝑈2 and let the 𝑗 = 𝑛+1 qubit correspond to the control qubit for the 𝐶𝑈 ′ controlled
unitary. Then one can show that
( 𝑗) 1 ( 𝑗)
𝐶LHST (𝑈2 , 𝐼) = 𝐶LHST (𝑈1 , 𝐼) ∀ 𝑗 ∈ {1, ..., 𝑛} , (3.70)
2
(𝑛+1) 1 1
𝐶LHST (𝑈2 , 𝐼) = − 𝑛+1 Re(Tr(𝑈 ′)) . (3.71)
2 2
This gives
1
𝐶LHST (𝑈2 , 𝐼) = 1 + 𝑛𝐶LHST (𝑈1 , 𝐼)
2(𝑛 + 1)
Re(Tr(𝑈 ′))
− . (3.72)
2𝑛
For notational simplicity, let 𝐵(𝑈) := 1 − 𝐶LHST (𝑈, 𝐼). Then, we can rewrite Eqs. (3.72) as
2−𝑛
1 ′ 𝑛
𝐵(𝑈2 ) = 1+ Re(Tr(𝑈 )) + 𝐵(𝑈1 ) . (3.73)
2 𝑛+1 𝑛+1
100
Hence, we have that
Re(Tr(𝑈 ′)) = 2𝑛 ((𝑛 + 1)(2𝐵(𝑈2 ) − 1) − 𝑛𝐵(𝑈1 )) . (3.74)
By choosing 𝑈 ′ according to Fig. 3.11, one can see from Eq. (3.66) and Eq. (3.67) that the problem
is equivalent to 𝜖-approximating our local cost function for some 𝜖 = 𝑂 (1/poly(𝑛)). Hence,
any DQC1 problem can be efficiently solved for by computing a simple linear combination of
two instances of 𝐶LHST . Therefore, we have shown that the problem of approximating the cost
𝐶LHST (𝑈, 𝑉) is hard for DQC1. □
3.14 Gradient-free optimization method
We now outline our approach to gradient-free optimization of over the continuous gate parameters
® in the trainable unitary 𝑉𝑘® ( 𝛼
𝛼 ® ). This approach was used to obtain the results in Sec. 3.6. Given
that this is an implementation for small problem size, we employ the cost function 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼 ® )).
However, we note that one can replace 𝐶HST with our general cost function 𝐶𝑞 for larger problem
sizes.
Algorithm 1: Gradient-free Continuous Optimization for QAQC via the HST
Input: Unitary 𝑈 to be compiled; trainable unitary 𝑉𝑘® ( 𝛼 ® ) of a given structure; error
′
tolerance 𝜀 ∈ (0, 1); maximum number of starting points 𝑁; maximum number of
iterations 𝑁iter for gp_minimize; sample precision 𝛿 > 0.
Output: Parameters 𝛼 ® opt such that at best 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼 ® opt )) ≤ 𝜀′.
Init: 𝛼 ® opt ← 0; cost ← 1
1 repeat
2 choose an initial parameter 𝛼 ® (0) at random;
3 run gp_minimize with 𝛼 ® (0) and 𝑁iter as input and 𝛼 ® min as output; whenever the cost is
called upon for some 𝛼 ® , run the HST on 𝑉𝑘® ( 𝛼 ∗
® ) and 𝑈 approximately 1/𝛿2 times to
estimate the cost 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼 ® ));
4 if cost ≥ 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼 ® min )) then
5 cost ← 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼 ® min )); 𝛼® opt ← 𝛼® min
6 until cost ≤ 𝜀′, at most 𝑁 times.
7 return 𝛼 ® opt , cost
Recall that we compute the cost function 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼 ® )) using the Hilbert-Schmidt Test (HST),
as described in Sec. 3.4.1 and illustrated in Fig. 3.4(a). For a given set of gate structure parameters
101
® the calculation of the cost on a quantum computer (as well as on a simulator) is affected by the
𝑘,
fact that, due to finite sampling, the HST allows us to obtain only an estimate of the magnitude
of the Hilbert-Schmidt inner product. Noise within the quantum computer itself also affects
the calculation of the cost. Therefore, in order to perform gradient-free optimization over the
continuous gate parameters 𝛼 ® , we make use of stochastic optimization techniques that are designed
to optimize noisy functions. Specifically, we make use of the gp_minimize routine in the scikit-
optimize Python library [105], which is a gradient-free optimization routine that performs Bayesian
optimization using Gaussian processes [106, 107]. See Algorithm 1 for a general overview of the
optimization procedure. Note that with this algorithm, we obtain an 𝜀-approximate compilation of
𝑈, with
𝑑
𝜀= 𝜀′ . (3.75)
𝑑+1
In the small-scale quantum computer implementations of Fig. 3.5(c) and Fig. 3.6, we use 50
objective function evaluations in gp_minimize per iteration. Note that evaluating the objective
function involves running the quantum circuit many times in order to sample from the output
distribution of the circuit.
For large problem sizes, as described in Sec. 3.3.4, we propose using the cost function
𝐶𝑞 = 𝑞𝐶HST + (1 − 𝑞)𝐶LHST . The gradient-free continuous parameter optimization algorithm for
𝐶𝑞 is similar to the one for 𝐶HST in Algorithm 1, except that in addition to running the HST we run
the LHST for every qubit 𝑗 ∈ {1, 2, . . . , 𝑛} in order to compute the local cost 𝐶LHST . In this case,
the algorithm provides an 𝜀-approximate compilation of 𝑈, with
𝑛 𝑑
𝜀= 𝜀′ . (3.76)
1 − 𝑞 + 𝑛𝑞 𝑑+1
We emphasize that our approach to gradient-free optimization avoids the exponential overhead
of evaluating the cost function classically, yet at the same time makes use of fast and efficient
classical heuristics for optimization. In fact, using the HST, Algorithm 1 requires only 𝑂 (1/𝛿2 )
√
calls to the quantum computer in order to evaluate the cost, where 𝛿 = 1/ 𝑛shots is the sample
precision, which is related to the number of samples 𝑛shots taken from the device.
102
3.14.1 Alternative method for gradient-free optimization
Here we propose an alternative algorithm for gradient-free optimization that, on average, signif-
icantly reduces the number of times the objective function is evaluated. As a result, it is more
suitable for cloud computing under a queue submission system (e.g., IBM’s Quantum Experience).
This algorithm performs a “multi-scale bisection” of the parameter space based on simulated an-
nealing. We implement this method in Sec. 3.6.1.1 specifically for the hardware of IBM because
the queue submission system can require a significant amount of time to make many calls to the
quantum computer.
This alternative approach to performing gradient-free continuous parameter optimization is out-
lined in Algorithm 2. We start with four angles spread uniformly in the interval [0, 2𝜋)—namely
0, 𝜋/2, 𝜋, and 3𝜋/2. This significantly reduces the size of the search space and allows us to get close
to, or find exactly, an optimal gate sequence. Once the optimal structure is reached from this step,
we then bisect the angles for each gate 𝑅𝑧 (𝛼) by evaluating the cost with a new circuit containing
𝑅𝑧 (𝛼 ± 𝜋/2𝑡+1 ), where 𝑡 = 1, 2, . . . , 𝑡 max is determined by the iteration in the procedure. Although
we do not explore all angles in the interval, the runtime is logarithmically faster than a continuous
search due to the bisection procedure. An additional advantage of this approach is that many gates
have angles that are simple fractions of 𝜋, e.g., 𝑇 = 𝑅𝑧 (𝜋/4) and 𝐻 = 𝑅𝑧 (𝜋/2)𝑅𝑥 (𝜋/2)𝑅𝑧 (𝜋/2).
In a noiseless environment, the two steps above are sufficient. On actual devices, we implement
a third step of stochastic optimization by evaluating the cost for the new circuit with each gate
𝑅𝑧 (𝛼) replaced by 𝑅𝑧 (𝛼 ± Δ(𝑡)) for some small value Δ(𝑡) ≪ 1 decreasing monotonically with
the iteration 𝑡. This allows us to compile for a given device by accounting for noise and gate
errors. This can be thought of as a “fine-grained” angular optimization in contrast to the previous
“coarse-grained” angular optimization.
103
Algorithm 2: Gradient-free Optimization using Bisection for QAQC
Input: Unitary 𝑈 to be compiled; trainable unitary 𝑉𝑘® ( 𝛼 ® ) of a given structure and gate
′
alphabet A; error tolerance 𝜀 ∈ (0, 1); maximum number of iterations 𝑁;
maximum number of bisections 𝑡max of the unit circle; sample precision 𝛿 > 0.
Output: Parameters 𝛼 ® opt )) ≤ 𝜀′.
® opt such that at best 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼
Init: Restrict all gates in A with continuous parameters to discrete angles in the set
Ω0 = {0, 𝜋/2, 𝜋, 3𝜋/2}; 𝛼opt ← 0; cost ← 1
1 for 𝑡 = 1, 2, . . . , 𝑡 max do
2 repeat
3 anneal over all possible bisected angles in the set
Ω𝑡 := {𝛼 ± 𝜋/2𝑡+1 | for 𝛼 ∈ Ω0 } ∪ Ω𝑡−1 ;
4 whenever the cost is called upon for some 𝛼 ∈ Ω𝑡 , run the HST on 𝑉𝑘® ( 𝛼 ® ) ∗ and 𝑈
approximately 1/𝛿2 times to estimate the cost 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼 ® ));
5 if cost ≥ 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼 ® )) then
6 cost ← 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼 ® ));
7 until cost ≤ 𝜀′ at most 𝑁 times.
8 repeat
9 minimize the cost over all small continuous increments Δ(𝑡) ≪ 1 within the set of
bisected angles Ω𝑡 ; whenever the cost is called upon for some 𝛼 + Δ(𝑡), with
𝛼 ∈ Ω𝑡 , run the HST on 𝑉𝑘® (𝛼 + Δ(𝑡)) ∗ and 𝑈 approximately 1/𝛿2 times to
estimate the cost 𝐶HST (𝑈, 𝑉𝑘® (𝛼 + Δ(𝑡)));
10 if cost ≥ 𝐶HST (𝑈, 𝑉𝑘® (𝛼 + Δ(𝑡))) then
11 cost ← 𝐶HST (𝑈, 𝑉𝑘® (𝛼 + Δ(𝑡))); 𝛼 ® opt ← 𝛼 + Δ(𝑡)
12 until cost ≤ 𝜀′ at most 𝑁 times.
13 return 𝛼® opt , cost
3.15 Gradient-based optimization method
We now describe a gradient-based approach to performing the optimization over the continuous
parameters in the trainable gate sequence 𝑉𝑘® ( 𝛼 ® ). In Sec. 3.15.1, we define a new cost function
for this purpose, and we introduce a quantum circuit to calculate this cost function on a quantum
computer. In Sec. 3.15.2, we present the results of implementing this method on a quantum
simulator. In Sec. 3.15.3, we briefly describe how the original cost functions 𝐶HST and 𝐶LHST can
also be optimized using a gradient-based method.
While recent work on gradient descent continuous optimization has shown vast quantum
speedups over classical variants [108, 109, 110], the majority of proposals still appear to be
104
out of reach for implementations on NISQ devices, mainly due to their use of certain algorith-
mic techniques, such as quantum random-access memory, the quantum Fourier transform, and the
Grover search algorithm, which have high resource requirements. Instead, we focus on continuous
optimization procedures that are feasible on current quantum computers and leave improvements
to our algorithms as an open problem.
The gradient with respect to 𝛼 ® of the gate sequence 𝑉𝑘® ( 𝛼 ® ) given by
𝑉𝑘® ( 𝛼
® ) = 𝐺 𝑘 𝐿 (𝛼 𝐿 )𝐺 𝑘 𝐿−1 (𝛼 𝐿−1 ) · · · 𝐺 𝑘 1 (𝛼1 ), (3.77)
is defined by
𝜕𝑉𝑘® ( 𝛼®) 𝜕𝑉𝑘® ( 𝛼 ®)
∇𝛼® 𝑉𝑘® ( 𝛼
®) = ,..., , (3.78)
𝜕𝛼1 𝜕𝛼 𝐿
where the (𝑖, 𝑗) matrix element of the ℓ-th component is
𝜕𝑉𝑘® ( 𝛼
®) 𝜕𝑉𝑘® ( 𝛼
® )𝑖, 𝑗
= . (3.79)
𝜕𝛼ℓ 𝑖, 𝑗 𝜕𝛼ℓ
For example, consider the rotation gate 𝑅𝑧 (𝛼) = 𝑒 −𝑖𝛼𝜎𝑧 /2 , which is parametrized by the angle 𝛼.
Then, the derivative with respect to 𝛼 can be written as
𝜕 𝑖
𝑅𝑧 (𝛼) = − 𝜎𝑧 𝑅𝑧 (𝛼) , (3.80)
𝜕𝛼 2
which follows from the Taylor series expansion of the exponent.
Now, evaluating the gradient on a quantum computer is possible due to the fact that for the
gate alphabets we consider in this paper, only the single-qubit gates are parameterized, and these
gates are simply rotation gates. In fact, any unitary gate can be decomposed into circuits in which
only the single-qubit rotation gates are present. This is illustrated in Fig. 3.12. Furthermore, the
circuits in Fig. 3.12(a) and Fig. 3.12(b) are universal for one- and two-qubit gates, respectively (see
[6], which also contains universal circuits for 𝑛-qubit gates). This means that our gradient-based
approach can be applied to any 𝑛-qubit unitary without explicitly searching over gate structures,
though the compilations obtained in this manner will generally have sub-optimal depth.
105
(a)
U = Rz (αz1 ) Ry (αy ) Rz (αz2 )
(b)
U1 (α(1) ) ⊕ Rz (αz ) ⊕ U3 (α(3) )
UAB =
U2 (α(2) ) Ry (αy1 ) ⊕ Ry (αy2 ) U4 (α(4) )
Figure 3.12: (a) Any single-qubit gate 𝑈 can be decomposed into three elementary rotations
(up to a global phase). Given appropriate parameters 𝛼 ® = (𝛼𝑧1 , 𝛼𝑦 , 𝛼𝑧2 ), 𝑈 can be written as
−𝑖𝛼 𝑧2 𝜎𝑧 /2 −𝑖𝛼 𝑦 𝜎𝑦 /2 −𝑖𝛼 𝑧1 𝜎𝑧 /2
𝑉 (𝛼®) = 𝑒 𝑒 𝑒 . (b) Any two-qubit gate 𝑈 𝐴𝐵 can be decomposed into three
CNOT gates as well as 15 elementary single-qubit gates, where each unitary 𝑈 𝑗 ( 𝛼 ® ( 𝑗) ) can be
written as in (a). This decomposition is known to be optimal [5], i.e., it uses the least number of
continuous parameters and CNOT gates. General universal quantum circuits for 𝑛-qubit gates are
discussed in [6].
3.15.1 The Power of Two Qubits
Consider the following cost function based on the normalized Hilbert-Schmidt distance between
the unitaries 𝑈 and 𝑉:
1
𝐶POTQ (𝑈, 𝑉) B ||𝑈 − 𝑉 || 2HS
2𝑑 (3.81)
1 †
= 1 − Re Tr(𝑉 𝑈) ,
𝑑
where POTQ stands for “Power of Two Qubits” and refers to the circuit used to evaluate it, which
we present below. Note that 𝐶POTQ (𝑈, 𝑉) is zero if and only if 𝑈 = 𝑉. Contrary to the cost function
𝐶HST (𝑈, 𝑉), which is defined using the magnitude of the inner product ⟨𝑉, 𝑈⟩, this cost function is
defined using the real part of the inner product. Consequently, it does not vanish if 𝑈 and 𝑉 differ
only by a global phase. Indeed, if 𝑉 = 𝑒𝑖𝜑𝑈, then 𝐶POTQ (𝑈, 𝑉) = 1 − cos(𝜑).
Before discussing the circuit used to evaluate the cost function 𝐶POTQ (𝑈, 𝑉), let us review
the Power of One Qubit (POOQ) [7], shown in Fig. 3.13(a), which is a circuit for computing the
trace of a 𝑑-dimensional unitary 𝑈. This circuit acts on a 𝑑-dimensional system 𝐴, initially in the
maximally mixed state, 𝐼/𝑑, and on a single-qubit ancilla 𝑄 initially in the |0⟩ state. After applying
a Hadamard gate to 𝑄 and a controlled-𝑈 gate to 𝑄 𝐴 (with 𝑄 the control system), the reduced
density matrix 𝜌𝑄 has its off-diagonal elements proportional to Tr(𝑈). Hence, one can measure 𝑄
in the 𝑋 and 𝑌 bases, respectively, to read off the real and imaginary parts of Tr(𝑈).
106
We now introduce a circuit for computing the real and imaginary parts of ⟨𝑉, 𝑈⟩ that generalizes
the POOQ and is called the Power of Two Qubits (POTQ), depicted in Fig. 3.13(b). As the name
suggests, the POTQ employs two single-qubit ancillas, 𝑄 and 𝑄 ′, each initially in the |0⟩ state. In
addition, two 𝑑-dimensional systems, 𝐴 and 𝐵, are initially prepared in the Bell state |Φ+ ⟩ defined
in Eq. (3.14). (Although not shown in Fig. 3.13(b), this Bell state is prepared with a depth-two
circuit, as shown in Fig. 3.4.)
The first step in the POTQ is to prepare the two-qubit maximally entangled state √1 (|0⟩|0⟩ +
2
|1⟩|1⟩) between 𝑄 and 𝑄 ′, using the Hadamard and CNOT gates as shown in Fig. 3.13(b). The
second step is to apply a controlled-𝑈 gate between 𝑄 and 𝐴 (with 𝑄 the control system). In
parallel with this gate, the anticontrolled-𝑉 𝑇 gate is applied to 𝑄 ′ 𝐵, with 𝑄 ′ the control system,
where anticontrolled means that the roles of the |0⟩ and |1⟩ states on the control system are reversed
in comparison to a controlled gate. This results in the state:
1
√ (|0⟩𝑄 |0⟩𝑄 ′ (𝐼 𝐴 ⊗ 𝑉 𝑇 )|Φ+ ⟩
2
+ |1⟩𝑄 |1⟩𝑄 ′ (𝑈 ⊗ 𝐼 𝐵 )|Φ+ ⟩)
1
= √ (|0⟩𝑄 |0⟩𝑄 ′ (𝑉 ⊗ 𝐼 𝐵 )|Φ+ ⟩
2
+ |1⟩𝑄 |1⟩𝑄 ′ (𝑈 ⊗ 𝐼 𝐵 )|Φ+ ⟩), (3.82)
where to obtain the equality we used the ricochet property in Eq. (3.18). As in the HST, note that
𝑉 itself is not implemented. In this case, its transpose is implemented.
Finally, a CNOT gate is applied to 𝑄𝑄 ′, with 𝑄 the control system. This results in the reduced
state on 𝑄 being
1
𝜌𝑄 = |0⟩⟨0| + Tr(𝑉 †𝑈)|0⟩⟨1|
2
+Tr(𝑈 †𝑉)|1⟩⟨0| + |1⟩⟨1| . (3.83)
By inspection of 𝜌𝑄 , one can see that measuring 𝑄 in the 𝑋 and 𝑌 bases, respectively, gives the
real and imaginary parts of Tr(𝑉 †𝑈).
107
(a) Power of One Qubit
|0i H R
1 ... ...
d U
(b) Power of Two Qubits
|0i H R
|0i ⊕ ⊕
... U ...
+
|Φ i
... VT ...
Figure 3.13: (a) The Power of One Qubit (POOQ) [7]. This can be used to compute the trace
of a unitary 𝑈 acting on a 𝑑-dimensional space. The 𝑅 gate represents either 𝐻, in which case
the circuit computes Re[Tr(𝑈)], or the 𝑆 gate followed by 𝐻, in which case the circuit computes
Im[Tr(𝑈)]. (b) The Power of Two Qubits (POTQ). This is a generalization of the POOQ, as can
be seen by setting 𝑉 = 𝐼. The POTQ can be used to compute the Hilbert-Schmidt inner product
Tr(𝑉 †𝑈) between two unitaries 𝑈 and 𝑉 acting on a 𝑑-dimensional space. As with the POOQ,
𝑅 = 𝐻 leads to Re[Tr(𝑉 †𝑈)], while 𝑅 = 𝐻𝑆 leads to Im[Tr(𝑉 †𝑈)].
Interestingly, if we set 𝑉 to the identity in the POTQ, then since the CNOT gate commutes with
the controlled-𝑈 gate and the reduced state of |Φ+ ⟩ is the maximally mixed state 𝐼/𝑑, we recover
the POOQ. The POTQ is therefore a generalization of the POOQ.
Note that while the POOQ can also be used to determine Tr(𝑉 †𝑈), the POTQ has the advantage
that the controlled gates for 𝑈 and 𝑉 can be executed in parallel, while in the POOQ they would have
to be executed in series. This makes the POTQ better suited for NISQ devices, where short depth
is crucial. Consider the depth of the POTQ. Denoting the controlled-𝑈 and the anticontrolled-𝑉 𝑇
as 𝐶𝑈 and 𝐶 𝑉 𝑇 respectively, the overall depth is
𝐷 (POTQ) = 4 + max{𝐷 (𝐶𝑈 ), 𝐷 (𝐶 𝑉 𝑇 )} (3.84)
Note the similarity here to Eq. (3.19). The overall depth is essentially determined by whichever
controlled gate has the largest depth.
108
3.15.2 Gradient-based optimization via the POTQ
The gradient with respect to 𝛼 ® of 𝐶POTQ (𝑈, 𝑉𝑘® ( 𝛼 ® )) can be computed using the POTQ. This is due
to the fact that
𝜕 1 h i
® ) †𝑈) = Re Tr 𝑉 e(ℓ) ( 𝛼 †
Re Tr(𝑉𝑘® ( 𝛼 ® ) 𝑈 , (3.85)
𝜕𝛼ℓ 2 𝑘®
where
e(ℓ) ( 𝛼
𝑉 ® ) B 𝐺 𝑘 𝐿 (𝛼 𝐿 ) · · · 𝐺 𝑘 ℓ+1 (𝛼ℓ+1 )(−𝑖𝜎𝑘 ℓ )
®
𝑘
(3.86)
× 𝐺 𝑘 ℓ (𝛼ℓ )𝐺 𝑘 ℓ−1 (𝛼ℓ−1 ) · · · 𝐺 𝑘 1 (𝛼1 )
is the original gate sequence 𝑉𝑘® ( 𝛼 ® ) except with an additional Pauli gate 𝜎𝑘 ℓ corresponding to the
variable with respect to which the derivative is taken. (Note that for the gate alphabets that we
consider in this paper, only the single-qubit gates are parameterized, and these gates are simply
rotation gates. The derivative of any one-qubit rotation gate is analogous to the expression in
(3.80) for the derivative of the rotation gate 𝑅𝑧 (𝛼).) This means that to compute the gradient of
𝐶POTQ (𝑈, 𝑉𝑘® ( 𝛼
® )), we simply add the appropriate local Pauli gate to the original gate sequence and
run the POTQ on this new gate sequence.
Our gradient-based optimization procedure is outlined in Algorithm 3. Given an arbitrary
unitary 𝑈 as input, Algorithm 3 compiles 𝑈 to a unitary 𝑉𝑘® ( 𝛼 ® opt ) of a given structure 𝑘® that
minimizes the cost 𝐶POTQ . The gradient is evaluated with the POTQ circuit as a subroutine within
a classical gradient-descent algorithm. The overall query complexity in the number of calls to the
√
cost evaluation routine of Algorithm 3 is 𝑂 (𝑁𝑇 𝐿/𝛿2 ), where 𝛿 = 1/ 𝑛shots is the sample precision,
𝑁 is the maximum number of repetitions over random initial parameters 𝛼 ® 0 , 𝐿 is the dimension
of the continuous parameter space of 𝛼 ® , and 𝑇 is the number of gradient descent iterations for a
suitable learning rate 𝜂 > 0. In order to improve convergence, it may also be useful to supply
the quantum subroutines for computing the cost function and the gradient to a more advanced
minimization routine, for example as found in the Python library SciPy [61]. We present below the
results on compiling both single-qubit and two-qubit gates on a simulator.
When performing Algorithm 3, we rely on the ability to perform the controlled-𝑈 gate. The
unitary 𝑈 may be unknown, e.g., as in Fig. 3.1(b). In general, to perform a controlled operation
109
Algorithm 3: Gradient-based Continuous Optimization for QAQC via the POTQ
Input: Unitary 𝑈 to be compiled; a trainable unitary 𝑉𝑘® ( 𝛼 ® ) of a given structure, where 𝛼
®
is a continuous circuit parameter of dimension 𝐿; maximum number of iterations
𝑁; error tolerance 𝜀′ ∈ (0, 1); learning rate 𝜂 > 0; sample precision 𝛿 > 0.
Output: Parameters 𝛼 ® opt such that at best 𝐶POTQ (𝑈, 𝑉𝑘® ( 𝛼®opt )) ≤ 𝜀′.
Init: 𝛼® opt ← 0; cost ← 1
1 repeat
2 choose initial parameters 𝛼 ® (0) at random
3 for 𝜏 = 1, 2, . . . , 𝑇 do
4 for 𝑖 = 1, 2, . . . , 𝐿 do
5 run the POTQ on 𝜕𝛼𝑖 𝑉𝑘® ( 𝛼 ® (𝜏−1) )𝑇 and 𝑈 approximately 1/𝛿2 times to estimate
Re Tr 𝜕𝛼𝑖 𝑉𝑘® ( 𝛼 ® (𝜏−1) †
) 𝑈
6 update 𝛼 ® (𝜏) ← 𝛼 ® (𝜏−1) − 𝜂 ∇𝛼® 𝐶POTQ (𝑈, 𝑉𝑘® ( 𝛼 ® (𝜏−1) ))
7 run the POTQ on 𝑉𝑘® ( 𝛼 ® (𝜏) )𝑇 and 𝑈 approximately 1/𝛿2 times to estimate the cost
𝐶POTQ (𝑈, 𝑉𝑘® ( 𝛼 ® (𝜏) ))
8 if cost ≥ 𝐶POTQ (𝑈, 𝑉𝑘® ( 𝛼 ® (𝜏) )) then
9 cost ← 𝐶POTQ (𝑈, 𝑉𝑘® ( 𝛼 ® (𝜏) )); 𝛼
® opt ← 𝛼® (𝜏)
10 until cost ≤ 𝜀′, at most 𝑁 times
11 return 𝛼 ® opt , cost
with respect to a target unitary 𝑈, one can use a method for “remote control” [111]. This method
employs a local 𝑈 gate and controlled-SWAP operations in order to realize the controlled-𝑈 gate.
In practice, since any controlled unitary gate can be decomposed into native gates, the ability to
compile controlled-SWAP, the Toffoli gate, and the set of controlled rotations is sufficient. In order
to perform such a translation, we allow the user to have access to a small-scale classical compiler.
This does not incur exponential overhead since the gates to be translated are one- and two-qubit
gates (or their controlled versions). While this may cause the depth of our compiled unitary to
increase, it will only be by a constant factor.
We note that decoherence, gate infidelity, and readout errors on NISQ computers are all more
pronounced when attempting to execute controlled unitaries. This means that there is significant
performance loss for controlled unitaries, as required in the POTQ. Consequently, we did not
implement our gradient-based optimization method on current quantum devices, but we speculate
that improvements to quantum hardware will enable this application.
110
Figure 3.14: Compiling one- and two-qubit gates on a simulator with the gate alphabet in (3.35)
using the gradient-based optimization technique described in Algorithm 3, with 𝑛shots = 10, 000.
Shown is the cost as a function of the number of gradient calls of the continuous parameter
optimization using the minimize routine in the SciPy-optimize Python library. The gate structure
for the single-qubit gates is fixed to the one shown in Fig. 3.12(a), while the gate structure for the
two-qubit gates is fixed to the one shown in Fig. 3.12(b).
3.15.2.1 Implementation on a quantum simulator
We use IBM’s simulator [3] to compile a selection of single-qubit and two-qubit gates by performing
the gradient-based optimization procedure in Algorithm 3. In order to improve convergence, we
additionally supply the gradient, as well as the cost function, to the minimize routine in the
SciPy-optimize Python library [61]. For the single-qubit gates, we assume a fixed structure for the
trainable gate sequence according to the decomposition in Fig. 3.12(a), while for the two-qubit
gates we assume a fixed structure for the trainable gate sequence according to the decomposition
in Fig. 3.12(b). We compile the 𝑇 gate, 𝑋 gate, Hadamard (𝐻) gate, as well as the CNOT and CZ
gates, all with 𝑛shots = 10, 000. The results are shown in Fig. 3.14. We note that increasing 𝑛shots
to higher orders of magnitude significantly reduces the sampling error and results in more stable
convergence at the cost of an increase in runtime.
3.15.3 Gradient-based optimization via the HST and LHST
We now show that it is possible to perform gradient-based optimization of the original cost function
𝐶HST and its local variant 𝐶LHST . This allows us to perform gradient-based optimization of the
111
general cost function 𝐶𝑞 = 𝑞𝐶HST + (1 − 𝑞)𝐶LHST . The algorithm for gradient-based optimization
of 𝐶HST and 𝐶LHST is presented in Algorithm 4.
The gradient with respect to 𝛼 ® of both 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼 ® )) and 𝐶LHST (𝑈, 𝑉𝑘® ( 𝛼 ® )) can be computed
using the HST and the LHST, respectively. Specifically, for a gate sequence of the form in (3.77),
in which the only parameterized gates are the single-qubit rotation gates, we have that
𝜕 1 b(ℓ) ( 𝛼
𝐶HST (𝑈, 𝑉𝑘® ( 𝛼
® )) = 𝐶HST (𝑈, 𝑉 ®
® ))
𝜕𝛼ℓ 2 𝑘,+
(3.87)
1 b(ℓ) ( 𝛼
− 𝐶HST (𝑈, 𝑉 ®
® )),
2 𝑘,−
and
𝜕 ( 𝑗) 1 ( 𝑗) b(ℓ) ( 𝛼
𝐶LHST (𝑈, 𝑉𝑘® ( 𝛼® )) = 𝐶LHST (𝑈, 𝑉 ®
® ))
𝜕𝛼ℓ 2 𝑘,+
(3.88)
1 ( 𝑗) b(ℓ) ( 𝛼
− 𝐶LHST (𝑈, 𝑉 ®
® ))
2 𝑘,−
for all 𝑗 ∈ {1, 2, . . . , 𝑛}. Here,
𝜋
(ℓ)
𝑉® ( 𝛼
b ® ) B 𝐺 𝑘 𝐿 (𝛼 𝐿 ) · · · 𝐺 𝑘 ℓ+1 (𝛼ℓ+1 )𝐺 𝑘 ℓ ±
𝑘,± 2 (3.89)
× 𝐺 𝑘 ℓ (𝛼ℓ )𝐺 𝑘 ℓ−1 (𝛼ℓ−1 ) · · · 𝐺 𝑘 1 (𝛼1 )
is the original gate sequence 𝑉𝑘® ( 𝛼 ® ) with an additional rotation gate 𝐺 𝑘 ℓ ± 𝜋2 corresponding to the
variable with respect to which the derivative is taken. In other words, to compute the gradient
of the cost function 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼 ® )), we run the HST in Fig. 3.4(a) twice, once with the gate
sequence 𝑉 b(ℓ) ( 𝛼
® ) and once with the gate sequence 𝑉 b(ℓ) ( 𝛼 ® ). Similarly, to compute the gradient of
®
𝑘,+ ®
𝑘,−
( 𝑗)
the functions 𝐶LHST (𝑈, 𝑉𝑘® ( 𝛼 ® )), we run the LHST in Fig. 3.4(b) twice, once with the gate sequence
b(ℓ) ( 𝛼
𝑉 ® ) and once with the gate sequence 𝑉 b(ℓ) ( 𝛼
® ).
®
𝑘,+ ®
𝑘,−
The expressions for the gradient in (3.87) and (3.88) can be verified by recalling that only the
one-qubit gates need to be parameterized and that they can always be assumed to have the form
𝑒 −𝑖𝛼𝜎/2 for some Pauli operator 𝜎, where 𝛼 is the continuous parameter specifying the gate. Then,
112
for the gate sequence 𝑉𝑘® ( 𝛼 ® ) in (3.77), we get
𝜕𝑉𝑘® ( 𝛼 ®) 𝜕𝐺 𝑘 ℓ (𝛼ℓ )
= 𝐺 𝑘 𝐿 (𝛼 𝐿 ) · · · 𝐺 𝑘 ℓ+1 (𝛼ℓ+1 )
𝜕𝛼ℓ 𝜕𝛼ℓ
× 𝐺 𝑘 ℓ−1 (𝛼ℓ−1 ) · · · 𝐺 𝑘 1 (𝛼1 ) (3.90)
𝑖
= − 𝐺 𝑘 ℓ (𝛼ℓ ) · · · 𝐺 𝑘 ℓ+1 (𝛼ℓ+1 )𝜎𝑘 ℓ 𝐺 𝑘 ℓ (𝛼ℓ )
2
× 𝐺 𝑘 ℓ−1 (𝛼ℓ−1 ) · · · 𝐺 𝑘 1 (𝛼1 ) (3.91)
Then, we use the identity
𝜋 𝜋 †
𝑖[𝜎𝑘 ℓ , 𝜌] = 𝐺 𝑘 ℓ − 𝜌𝐺 𝑘 ℓ −
2 2 (3.92)
𝜋 𝜋 †
− 𝐺 𝑘ℓ 𝜌𝐺 𝑘 ℓ ,
2 2
which holds for any state 𝜌. We also observe that both the functions 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼 ® )) and
( 𝑗)
𝐶LHST (𝑈, 𝑉𝑘® ( 𝛼
® )) are of the form
𝐹 (𝛼
® ) = Tr[𝐻 (𝑈 ⊗ 𝑉𝑘® ( 𝛼 ® ) ∗ ) 𝜌(𝑈 † ⊗ 𝑉𝑘® ( 𝛼 ® )𝑇 )], (3.93)
where 𝜌 = |Φ+ ⟩⟨Φ+ | 𝐴1 ···𝐴𝑛 for both functions, 𝐻 = |Φ+ ⟩⟨Φ+ | 𝐴1 ···𝐴𝑛 for 𝐶HST (𝑈, 𝑉𝑘® ( 𝛼 ® )), and
( 𝑗)
𝐻 = |Φ+ ⟩⟨Φ+ | 𝐴 𝑗 𝐵 𝑗 ⊗ 𝐼 𝐴¯ 𝑗 𝐵¯ 𝑗 for 𝐶LHST (𝑈, 𝑉𝑘® ( 𝛼 ® )). Finally, using
®) ∗
" ! #
𝑉𝑘® ( 𝛼
𝜕𝐹 ( 𝛼 ®)
= Tr 𝐻 𝑈 ⊗ 𝜌(𝑈 † ⊗ 𝑉𝑘® ( 𝛼 ® )𝑇 )
𝜕𝛼ℓ 𝜕𝛼ℓ
" 𝑇 !#
( ® )
𝜕𝑉 ® 𝛼
+ Tr 𝐻 (𝑈 ⊗ 𝑉𝑘® ( 𝛼 ® )∗) 𝜌 𝑈 † ⊗ 𝑘
, (3.94)
𝜕𝛼ℓ
substituting (3.91) into this expression, and using (3.92) to simplify, we obtain (3.87) and (3.88).
The quantum algorithms we developed in the first part of this thesis have several advantages for
near-term quantum computers, but in order to scale them to problem sizes that are challenging for
classical methods, several problems must be solved. Such problems include optimization strategies
and techniques for dealing with errors. In this second part of the thesis, we zoom in on the problem
of dealing with errors. As we have described in Chapter 1, the usual long-term solution is quantum
error correction and fault tolerance [112, 113, 114], but this requires significant overhead beyond
113
Algorithm 4: Gradient-based Continuous Optimization for QAQC via the (L)HST
Input: Unitary 𝑈 to be compiled; a trainable unitary 𝑉𝑘® ( 𝛼 ® ) of a given structure, where 𝛼
®
is a continuous circuit parameter of dimension 𝐿; maximum number of iterations
𝑁; gradient tolerance 𝜀′ ∈ (0, 1); sample precision 𝛿 > 0; cost function
𝐶 ∈ {𝐶HST , 𝐶LHST }.
Output: Parameters 𝛼 ® opt such that at best ||∇𝛼® 𝐶 (𝑈, 𝑉𝑘® ( 𝛼®opt ))|| 2 ≤ 𝜀′.
Init: 𝛼 ® opt ← 0; cost ← 0; grad ← ∞; 𝜏 ← 0; gradCount ← 0; 𝜂 ← 1
1 choose initial parameters 𝛼 ® (0) at random
2 cost ← 𝐶 (𝑈, 𝑉 ® ( 𝛼
𝑘
® (0) ))
3 while count < 𝑁 and gradCount < 4 do
4 𝜏 ← 𝜏+1
5 for 𝑖 = 1, 2, . . . , 𝐿 do
6 Calculate 𝜕𝛼 𝜕𝐶
𝑖
using either (3.87) or (3.88), taking approximately 𝛿12 samples for
each circuit.
7 grad ← ||∇𝛼® 𝐶 (𝑈, 𝑉𝑘® ( 𝛼 ® (𝜏−1) ))|| 2
8 if grad ≤ 𝜀′ then
9 gradCount ← gradCount + 1
10 ® 1(𝜏−1) ← 𝛼
𝛼 ® (𝜏−1) − 𝜂∇𝛼® 𝐶 (𝑈, 𝑉𝑘® ( 𝛼 ® (𝜏−1) ))
11 ® 2(𝜏−1) ← 𝛼
𝛼 ® 1(𝜏−1) − 𝜂∇𝛼® 𝐶 (𝑈, 𝑉𝑘® ( 𝛼 ® (𝜏−1) ))
12 if cost − 𝐶 (𝑈, 𝑉𝑘® ( 𝛼 ® 2(𝜏−1) )) ≥ 𝜂 · grad then
13 𝜂 ← 2𝜂
14 𝛼 (𝜏) ← 𝛼2(𝜏−1)
15 else if cost − 𝐶 (𝑈, 𝑉𝑘® ( 𝛼 ® 1(𝜏−1) )) < 𝜂
2 · grad then
𝜂
16 𝜂←2
17 𝛼 (𝜏) ← 𝛼1(𝜏−1)
18 else
19 𝛼 (𝜏) ← 𝛼1(𝜏−1)
20 cost ← 𝐶 (𝑈, 𝑉𝑘® ( 𝛼 ® (𝜏) ))
21 ® opt ← 𝛼
𝛼 ® (𝜏)
22 return 𝛼 ® opt , cost
114
current experimental capabilities. It is therefore interesting and important to develop methods for
dealing with errors with less overhead. The general name used to refer to this is quantum error
mitigation (QEM) [115].
A central task in near-term quantum algorithms (and many other areas of quantum information
processing) is to estimate the expectation value of an observable 𝑂 with respect to a pure state
𝜌 = |𝜓⟩⟨𝜓|, i.e. ⟨𝑂⟩ = ⟨𝜓|𝑂|𝜓⟩ = Tr(𝜌𝑂). If the state is prepared by a quantum device, it can be
noisy, and we instead evaluate ⟨𝑂⟩noisy = Tr(𝜌 E 𝑂) where 𝜌 E = E (𝜌) denotes the noisy state that
is corrupted from the target state 𝜌 by an unknown noisy quantum channel E. Given the corrupted
state 𝜌 E , the goal of quantum error mitigation is to estimate a quantity ⟨𝑂⟩QEM that is closer to the
target value ⟨𝑂⟩ compared to the noisy result ⟨𝑂⟩noisy . In other words, we seek to compute ⟨𝑂⟩QEM
such that
⟨𝑂⟩QEM − ⟨𝑂⟩ < ⟨𝑂⟩noisy − ⟨𝑂⟩ . (3.95)
A relatively large number of QEM techniques have been proposed in recent literature includ-
ing zero-noise extrapolation [116, 117], probabilistic error cancellation [116, 118], randomized
compiling [119], Pauli-frame randomization [120], dynamical decoupling [121, 122, 123, 124],
quantum optimal control [125, 126], subspace expansion [127], virtual distillation [128, 129] and
others [130, 131, 132, 133, 134]. Some authors have defined common frameworks which encap-
sulate one or more of these techniques [135, 136]. At its core, any QEM technique uses additional
quantum resources (qubits, gates, and/or samples) in a clever way to approximate what would
happen in an ideal device.
115
CHAPTER 4
ADVANCES IN ZERO-NOISE EXTRAPOLATION
4.1 Digital and adaptive ZNE
4.1.1 Introduction
Zero-noise extrapolation (ZNE) was introduced concurrently in [116] and [117]. In ZNE, a
quantum program is altered to run at different effective levels of processor noise. The result of the
computation is then extrapolated to an estimated value at a noiseless level. More formally, one can
parameterize the noise-level of a quantum system with a dimensionless scale factor 𝜆. For 𝜆 = 0
the noise is removed, while for 𝜆 = 1 the true noise-level of the physical hardware is matched. For
example, 𝜆 could be a multiplicative factor that scales the dissipative terms of a master equation
[116]. More generally, 𝜆 could represent a re-scaling of any physical quantity which introduces
some noise in the quantum computation: the calibration uncertainty of variational parameters, the
temperature of the quantum processor, etc.
For a given quantum program, we can measure an arbitrary expectation value 𝐸 (𝜆). By
construction, 𝐸 (1) represents the expectation value evaluated with the natural noise of the hardware,
whereas 𝐸 (0) denotes the noiseless observable which, despite being not directly measurable, we
would like to estimate.
To implement ZNE, one needs a direct or indirect way to scale the quantum computation’s noise
level to values of 𝜆 larger than one. With such a method, ZNE can be implemented in two main
steps:
1. Noise-scaling: Measure 𝐸 (𝜆) at 𝑚 different values of 𝜆 ≥ 1.
2. Extrapolation: Infer 𝐸 (0) from the 𝑚 expectation values measured in previous step.
116
Figure 4.1: An example of the change of an expectation value, 𝐸 (𝜆), with the underlying scaling
𝜆 of the depolarizing noise level. Here the simulated base noise value is 5% (marked by the green
dashed vertical line). ZNE increases that noise and back extrapolates to the 𝜆 = 0 expectation
value. In this example, an accurate extrapolation should be non-linear and take advantage of a
known asymptotic behavior.
Figure 4.1 shows an example noise curve given by scaling depolarizing noise for a randomized
benchmarking circuit.
In this work, we introduce improvements to both noise-scaling and extrapolation methods for
quantum error mitigation. In Section 4.1.2.1 we introduce unitary folding, a framework for digital
noise scaling of generic gate noise. We then move to the extrapolation step of ZNE, which we
characterize as an inference problem. We study non-adaptive (Section 4.1.3) extrapolation methods
and introduce adaptive (Section 4.1.4) extrapolation to improve performance and reduce resource
overhead for ZNE.
4.1.2 Noise scaling methods
In [116] and [137] a time-scaling approach implements the scaling of effective noise on the back-
end quantum processor. Control pulses for each gate are re-calibrated to execute the same unitary
evolution but applied over a longer amount of time. This effectively scales up the noise. While
successfully used to suppress errors in single and two-qubit quantum programs on a superconducting
quantum processor [137], time-scaling has some disadvantages:
• It requires programmer access to low-level physical-control parameters. This level of access
117
is not available on all quantum hardware and breaks the gate model abstraction.
• Control pulses must be re-calibrated for each time duration and error-scaling. This calibration
can be resource intensive.
Instead, we study alternative approaches that require only a gate-level access to the system.
Rather than increasing the time duration of each gate, we increase the total number of gates or,
similarly, the circuit depth. This procedure is similar to what is usually done by a quantum compiler
but with the opposite goal: instead of optimizing a circuit to reduce its depth or its gate count,
we are interested in “de-optimizing” to increase the effect of noise and decoherence. We use the
term digital to describe noise-scaling techniques that manipulate just the quantum program at the
instruction set layer. Their advantage is that they can be used with the gate model access that is
common to most quantum assembly languages [138, 139, 140]. Low level access to pulse shaping
and detailed physical knowledge of quantum processor physics is no longer required. Our digital
framework incorporates and generalizes some recent related work [141, 142].
4.1.2.1 Unitary folding
We describe two methods–circuit folding and gate folding–for scaling the effective noise of a
quantum computation based on unitary folding, i.e., replacing a unitary circuit (or gate) 𝑈 by:
𝑈 → 𝑈 (𝑈 †𝑈) 𝑛 , (4.1)
where 𝑛 is a positive integer. In an ideal circuit, since 𝑈 †𝑈 is equal to the identity, this folding
operation has no logical effect. However, on a real quantum computer, we expect that the noise
increases since the number of physical operations scales by a factor of 1 + 2𝑛. This effect is clearly
visible in the quantum computing experiment reported in Figure 4.5.
A similar trick was used in Ref. [141, 142], where noise was artificially increased by inserting
pairs of CNOT gates into quantum circuits. In our framework, 𝑈 can represent the full input circuit
or, alternately, some local gates which are inserted with different strategies.
118
4.1.2.2 Circuit folding
Assume that the circuit is composed of 𝑑 unitary layers:
𝑈 = 𝐿 𝑑 ...𝐿 2 𝐿 1 , (4.2)
where 𝑑 represents the depth of the circuit and each block 𝐿 𝑗 can either represent a single layer of
operations or just a single gate.
In circuit folding, the substitution rule in Eq. (4.1) is applied globally, i.e., to the entire circuit.
This scales the effective depth by odd integers. In order to have a more fine-grained resolution of the
scaling factor, we can also allow for a final folding applied to a subset of the circuit corresponding
to its last 𝑠 layers. The general circuit folding replacement rule is therefore:
𝑈 → 𝑈 (𝑈 †𝑈) 𝑛 𝐿 †𝑑 𝐿 †𝑑−1 . . . 𝐿 †𝑠 𝐿 𝑠 . . . 𝐿 𝑑−1 𝐿 𝑑 . (4.3)
The total number of layers of the new circuit is 𝑑 (2𝑛 + 1) + 2𝑠. This means that we can stretch the
depth of a circuit up to a scale resolution of 2/𝑑, i.e., we can apply the scaling 𝑑 → 𝜆𝑑, where:
2𝑘
𝜆 =1+ , 𝑘 = 1, 2, 3, . . . . (4.4)
𝑑
Conversely, for every real 𝜆, one can apply the following procedure:
1. Determine the closest integer 𝑘 to the real quantity 𝑑 (𝜆 − 1)/2.
2. Perform an integer division of 𝑘 by 𝑑. The quotient corresponds to 𝑛 and the reminder to 𝑠.
3. Apply 𝑛 integer foldings and a final partial folding as described in Eq. (4.3).
From a physical point of view, the circuit folding method corresponds to repeatedly driving the
Hamiltonian of the qubits forwards and backwards in time, such that the ideal unitary part of the
dynamics is not changed while the non-unitary effect of the noise is amplified.
119
Table 4.1: Different methods for implementing gate (or layer) folding
Method Subset of indices to fold
From left 𝑆 = {1, 2, . . . , 𝑠}
From right 𝑆 = {𝑑, 𝑑 − 1, . . . , 𝑑 − 𝑠 + 1}
At random 𝑆 = 𝑠 different indices randomly sampled
without replacement from {1, 2, . . . , 𝑑}.
4.1.2.3 Gate (or layer) folding
Instead of globally folding a quantum circuit, appending the folds at the end, one could fold a subset
of individual gates (or layers) in place. Let us consider the circuit decomposition of Eq. (4.2) where
we can assume that each unitary operator 𝐿 𝑗 represents just a single gate applied to one or two
qubits of the system or, alternatively, each 𝐿 𝑗 could be a layer of several gates.
If we apply the replacement rule given in Eq. (4.1) to each gate (or layer) 𝐿 𝑗 of the circuit, it is
clear that the initial number of gates (layers) 𝑑 is scaled by an odd integer 1 + 2𝑛. Similarly to the
case of circuit folding, we can add a final partial folding operation to get a scaling factor which is
more fine grained. In order to achieve such “partial” folding, let us define an arbitrary subset 𝑆 of
the full set of indices {1, 2, . . . 𝑑}, such that its number of elements is a given integer 𝑠 = |𝑆|. In
this setting, we can define the following gate (layer) folding rule:
𝐿 𝑗 (𝐿 †𝑗 𝐿 𝑗 ) 𝑛
i 𝑓 𝑗 ∉ 𝑆,
∀ 𝑗 ∈ {1, 2, . . . 𝑑}, 𝐿𝑗 → (4.5)
𝐿 𝑗 (𝐿 †𝑗 𝐿 𝑗 ) 𝑛+1 i 𝑓 𝑗 ∈ 𝑆.
Depending on how we chose the elements of the subset 𝑆, different noise channels will be added
at different positions along the circuit and so we can have different results. The optimal choice may
depend on the particular circuit and noise model. We focus on three different ways of selecting
the subset of gates (layers) to be folded: f rom left, f rom right and at random. Depending on the
method, the prescription for selecting the subset 𝑆 of indices is reported in Table 4.1.
It is easy to check that the number of gates (or layers), obtained after the application of the gate
folding rule given in Eq. (4.5) is 𝑑 (2𝑛 + 1) + 2𝑠. This is exactly the same number obtained after
the application of the global circuit-folding rule given in Eq. (4.3). As a consequence, the number
120
of gates (layers) is still stretched by a factor 𝜆, i.e., 𝑑 → 𝜆𝑑, where 𝜆 can take the specific values
reported in Eq. (4.4). Moreover, if we are given an arbitrary 𝜆 and we want to determine the values
of 𝑛 and 𝑠, we can simply apply the same procedure that was given in the case of circuit-folding.
While preparing this manuscript we became aware of [141] whose technique is similar to our
gate folding (at random). The main difference is that [141] focuses mainly on CNOT gates and uses
random sampling with replacement, in our case any gate (or layer) can be folded and the sampling
is performed without replacement. The rationale of this choice is to sample in a more uniform way
the input circuit, and to converge smoothly to the odd integer values of 𝜆 = 1 + 2𝑛 where all the
input gates are folded exactly 𝑛 times.
4.1.2.4 Advantages and limitations of unitary folding
The main advantage of the unitary folding approach is that is is digital, i.e., noise is scaled using a
high level of abstraction from the physical hardware. Moreover, it can be applied without knowing
the details of the underlying noise-model. It is natural to ask: how justified is this approach
physically? Does unitary folding actually correspond to an effective scaling of the physical noise
of the hardware?
For example, unitary folding may fail to amplify systematic and coherent errors since applying
the inverse of a gate will usually undo such errors instead of increasing them. It is also clear
that unitary folding is not appropriate to scale state preparation and measurement (SPAM) noise,
since this noise is independent of the circuit depth. Instead, we expect that unitary folding can be
used for scaling incoherent noise models which are associated both to the application of individual
gates and/or to the time-length of the overall computation. The more we increase the depth of
the circuit, the more such kinds of noise are usually amplified. In this work this intuition is
confirmed by numerical and experimental examples in which unitary folding is successfully used
for implementing ZNE (see Figures 4.2, 4.3, 4.4 and 4.5).
The effect of unitary folding can be analytically derived when the noise-model for each gate 𝐿 𝑗
121
is a global depolarizing channel with a gate-dependent parameter 𝑝 𝑗 ∈ [0, 1], acting as:
noisy gate
𝜌 −−−−−−−→ 𝑝 𝑗 𝐿 𝑗 𝜌𝐿 †𝑗 + (1 − 𝑝 𝑗 )I/𝐷, (4.6)
where 𝐷 is the dimension of the Hilbert space associated to all the qubits of the circuit. Since the
depolarizing channel commutes with unitary operations, we can postpone the noise channels of all
the gates until the end of the full circuit 𝑈, resulting into a single final depolarizing channel:
noisy circuit
𝜌 −−−−−−−−−→ 𝑝𝑈 𝜌𝑈 † + (1 − 𝑝)I/𝐷, (4.7)
where 𝑝 = Π 𝑗 𝑝 𝑗 is the product of all the gate-dependent noise parameters 𝑝 𝑗 . This simple
commutation property does not hold for local depolarizing noise, unless we are dealing with
singe-qubit circuits.
Consider what happens if we apply unitary folding with a scale factor 𝜆 = 1 + 2𝑛 (odd positive
integer). For both the circuit folding and the gate folding methods, defined in Eq. (4.3) and (4.5)
respectively, the final result is exactly equivalent to an exponential scaling of all the depolarizing
parameters of each gate 𝑝 𝑗 → 𝑝 𝜆𝑗 or, equivalently, to the global operation:
noise + unitary folding
𝜌 −−−−−−−−−−−−−−−−→ 𝑝 𝜆𝑈 𝜌𝑈 † + (1 − 𝑝 𝜆 )I/𝐷. (4.8)
This implies that unitary folding is equivalent to an exponential parameterization of the noise level
𝑝, and so any expectation value is also scaled according to an exponential ansatz:
𝐸 (𝜆) = 𝑎 + 𝑏 𝑝 𝜆 , (4.9)
which we can fit and extrapolate according to the methods discussed in the Sections 4.1.3 and 4.1.4.
Equations (4.8) and (4.9) are valid only for depolarizing noise and for odd scaling factors 𝜆.
For gate-independent depolarizing noise, the global parameter 𝑝 is a function of the total number
of gates only. This means that all the folding methods (circuit, from left, from right and at random)
become equivalent, and induce the exponential scalings of Eqs. (4.8) and (4.9)) for all values of 𝜆.
122
4.1.2.5 Numerical results
We executed density matrix simulations using unitary folding for zero-noise extrapolation. Broadly
these results show that unitary folding is effective in a variety of situations. Furthermore, we
benchmark on both random circuits and a variational algorithm at 6 and more qubits. This extends
previous work that focuses on the single and two qubit cases [116, 117, 137, 118]. Figure 4.2 shows
a simulated two qubit randomized benchmarking experiment under 1% depolarizing noise with and
without error-mitigation. Noise was scaled using circuit folding as described in Section 4.1.2.2.
Figure 4.3 shows the distribution of noise reduction by ZNE with circuit folding on randomly
generated six qubit circuits. Let 𝐸 𝑚 be the mitigated expectation value of a circuit after zero-
noise extrapolation. Then 𝑅𝑚 = |𝐸 𝑚 − 𝐸 (0)| is the absolute value of the error in the mitigated
expectation and 𝑅𝑢 = |𝐸 (1) − 𝐸 (0)| is the absolute value of the error of the unmitigated circuit.
The improvement from ZNE is quantified as 𝑅𝑢 /𝑅𝑚 .
Table 4.2 (see Section 4.1.3) provides a comparison different combinations of folding and
extrapolation techniques on a set of randomized benchmarking circuits.
Figure 4.4 shows the performance of unitary folding ZNE on a variational algorithm. Using
exact density matrix simulation we study the percentage closer to optimal achieved by the quantum
approximation optimization algorithm [143] on random instances of MAXCUT.
4.1.3 Non-adaptive extrapolation methods: Zero noise extrapolation as statistical inference
In Section 4.1.2, we discussed several methods to scale noise. In this section we study, from an
estimation theory perspective, the second component of ZNE: extrapolating the measured data to
the zero-nose limit.
We assume that the output of the quantum computation is a single expectation value 𝐸 (𝜆),
where 𝜆 is the noise scale factor. This expectation could be the result of a single quantum circuit or
some combinations of quantum circuits with classical post-processing. The expectation value 𝐸 (𝜆)
is a real number which, in principle, can only be estimated in the limit of infinite measurement
samples. In a real situation with 𝑁 samples, only a statistical estimation of the expectation value is
123
Figure 4.2: Comparison of two qubit randomized benchmarking with & without error mitigation.
Data is taken by density matrix simulation with a 1% depolarizing noise model. The unmitigated
simulation results in a randomized benchmarking decay of 97.9%. Mitigation is applied using
circuit folding and an order-2 polynomial extrapolation at 𝜆 = 1, 1.5, 2.0. With mitigation the
randomized benchmarking decay improves to 99.0%. Since we do not impose any constraint on
the domain of the extrapolated results, some of the mitigated expectation values are slightly beyond
the physical upper limit of 1. This is an expected effect of the noise introduced by the extrapolation
fit. If necessary, one could enforce the result to be physical by using a more advanced Bayesian
estimator.
Algorithm 5: Generic non-adaptive extrapolation
Data: A set of increasing noise scale factors λ = {𝜆 1 , 𝜆2 , . . . 𝜆 𝑚 }, with 𝜆 𝑗 ≥ 1 and fixed
number of samples 𝑁 for each 𝜆 𝑗 .
Result: A mitigated expectation value
1 y ←− ∅;
2 begin
3 for 𝜆 𝑗 ∈ λ do
4 𝑦 𝑗 ←− 𝐶𝑜𝑚 𝑝𝑢𝑡𝑒𝐸𝑥
𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛(𝜆 𝑗 , 𝑁);
5 𝐴𝑝 𝑝𝑒𝑛𝑑 y, 𝑦 𝑗 ;
/* Abitrary best fit algorithm (e.g., least squares) */
6 𝚪∗ ←− 𝐵𝑒𝑠𝑡𝐹𝑖𝑡 (𝐸 m𝑜𝑑𝑒𝑙 (𝜆; 𝚪), (λ, y));
7 return 𝐸 m𝑜𝑑𝑒𝑙 (0; 𝚪∗ );
actually possible:
𝐸ˆ (𝜆) = 𝐸 (𝜆) + 𝛿,
ˆ (4.10)
where 𝛿ˆ is a random variable with zero mean and variance 𝜎 2 = E ( 𝛿ˆ2) = 𝜎02 /𝑁, with 𝜎02
corresponding to the single-shot variance. In other words, we can sample a real prediction 𝑦 from
124
Figure 4.3: A comparison of improvements from ZNE (using quadratic extrapolation with folding
from left) averaged across all output bitstrings from 250 random six-qubit circuits. Results are from
exact density matrix simulations with a base of 1% depolarizing noise. The horizontal axis shows
a ratio of 𝐿 2 distances from the noiseless probability distribution and the vertical axis shows the
frequency of obtaining this result. ZNE improves on the noisy result by factors of 1-7X. The average
mitigated error is 0.075 ± 0.035, while the unmitigated errors average 0.114 ± 0.050. Each circuit
has 40 moments with single-qubit gates sampled randomly from {𝐻, 𝑋, 𝑌 , 𝑍, 𝑆, 𝑇 } and two-qubit
gates sampled randomly from {iSWAP, CZ} with arbitrary connectivity.
the probability distribution:
𝑃( 𝐸ˆ (𝜆) = 𝑦) = N (𝐸 (𝜆) − 𝑦, 𝜎 2 ), (4.11)
where N (𝜇, 𝜎 2 ) is a generic distribution (typically Gaussian), with mean 𝜇 and variance 𝜎 2 =
𝜎02 /𝑁.
Given a set of 𝑚 scaling parameters λ = {𝜆 1 , 𝜆2 , . . . 𝜆 𝑚 }, with 𝜆 𝑗 ≥ 1, and the corresponding
results
y = {𝑦 1 , 𝑦 2 , . . . 𝑦 𝑚 }, (4.12)
the ZNE problem is to build a good estimator 𝐸ˆ (0) for 𝐸 (𝜆 = 0), such that its bias
B𝑖𝑎𝑠( 𝐸ˆ (0)) = E ( 𝐸ˆ (0) − 𝐸 (0)), (4.13)
and its variance
V𝑎𝑟 ( 𝐸ˆ (0)) = E ( 𝐸ˆ (0) 2 ) − E ( 𝐸ˆ (0)) 2 , (4.14)
125
Figure 4.4: Percent closer to optimal on random MAXCUT executions. 14 Erdos-Renyi random
graphs were generated at each number 𝑛. Each random graph has 𝑛 nodes and 𝑛 edges. QAOA
was then run (with 𝑝 = 2 QAOA steps) and optimized using Nelder-Mead with and without error
mitigation. Results are from exact density matrix simulations with a base of 2% depolarizing noise.
For the mitigated case, we used zero noise extrapolation with global unitary folding for scaling
and linear extrapolation at noise scalings of 1, 1.5 and 2. The y axis shows the percent closer
to the optimal solution that was gained by ZNE. Here 𝐸 𝑢 is the absolute error in the unmitigated
expectation and 𝐸 𝑚 is the absolute error in the mitigated expectation. The violin plot shows the
distribution of percentage improvements over the 14 sampled instances. Variance is zero for 2 and
3 nodes graphs as there is only a single valid graph with 𝑛 nodes and edges for 𝑛 = 2, 3.
are both reasonably small. More precisely, a typical figure of merit for the quality the estimator is
its mean squared error with respect to the true unknown parameter:
M𝑆𝐸 ( 𝐸ˆ (0)) = E ( 𝐸ˆ (0) − 𝐸 (0)) 2 (4.15)
= V𝑎𝑟 ( 𝐸ˆ (0)) + B𝑖𝑎𝑠( 𝐸ˆ (0)) 2 . (4.16)
If the expectation value 𝐸 (𝜆) can be an arbitrary function of 𝜆 without any regularity assumption,
then zero-noise extrapolation is impossible. Indeed its value at 𝜆 = 0 would be arbitrary and
unrelated to its values at 𝜆 ≥ 1. However from physical considerations, it is reasonable to have
a model for 𝐸 (𝜆), e.g., we can assume a linear, a polynomial or an exponential dependence with
respect to 𝜆. For example, for a depolarizing noise model, one can use the exponential ansatz given
in Eq. (4.9).
If we chose a generic model 𝐸 model (𝜆; 𝚪) for the quantum expectation value, where 𝚪 represents
the model parameters, then the zero-noise-extrapolation problem reduces to a regression problem.
Algorithm 5 is the general form for a non-adaptive ZNE. Alternatively, the scale factors 𝜆 𝑗 and the
126
associated numbers of samples 𝑁 𝑗 can be chosen in an adaptive way, depending on the results of
intermediate steps. This adaptive extrapolation method is studied in more details in Section 4.1.4.
We focus on two main non-adaptive models, the polynomial ansatz and the poly-exponential
ansatz. These two general models, give rise to a large variety of specific extrapolation algorithms.
Some well known methods, such as Richardson’s extrapolation, are particular cases. Some other
methods have, to our knowledge, not been applied before for quantum error mitigation.
4.1.3.1 Polynomial extrapolation
The polynomial extrapolation method is based on the following polynomial model of degree 𝑑:
(𝑑) 𝑑
𝐸 p𝑜𝑙 𝑦 (𝜆) = 𝑐 0 + 𝑐 1 𝜆 + . . . 𝑐 𝑑 𝜆 , (4.17)
where 𝑐 0 , 𝑐 1 , . . . 𝑐 𝑑 are 𝑑 + 1 unknown real parameters. This essentially corresponds to a Taylor
series approximation and is physically justified in the weak noise regime.
In general, the problem is well defined only if the number of data points 𝑚 is at least equal
to the number of free parameters 𝑑 + 1. As opposed to Richardson’s extrapolation [116], a useful
feature of this method is that we can keep the extrapolation order 𝑑 small but still use a large
number of data points 𝑚. This avoids an over-fitting effect: if we increase the order 𝑑 by too much,
then the model is forced to follow the random statistical fluctuations of our data at the price of a
large generalization error for the zero-noise extrapolation. In terms of the inference error given in
Eq. (4.15), if we increase 𝑑 by too much, then the bias is reduced but the variance can grow so
much that the total mean squared error is actually increased.
4.1.3.2 Linear extrapolation
Linear extrapolation is perhaps the simplest method and is a particular case of polynomial extrap-
olation. It corresponds to the model:
(𝑑=1)
𝐸 l𝑖𝑛𝑒𝑎𝑟 (𝜆) = 𝐸 p𝑜𝑙 𝑦 (𝜆) = 𝑐 0 + 𝑐 1 𝜆. (4.18)
127
In this case a simple analytic solution exists, corresponding to the ordinary least squared estimator
of the intercept parameter:
𝑆𝜆𝑦
𝐸ˆ l𝑖𝑛𝑒𝑎𝑟 (0) = 𝑦¯ − ¯
𝑥, (4.19)
𝑆𝜆𝜆
where
1 ∑︁ 1 ∑︁
𝜆¯ = 𝜆𝑗, 𝑦¯ = 𝑦𝑗,
𝑚 𝑗 𝑚 𝑗
∑︁ ∑︁
𝑆𝜆𝑦 = (𝜆 𝑗 − 𝜆)(𝑦
¯ 𝑗 − 𝑦¯ ), 𝑆𝜆𝜆 = ¯ 2.
(𝜆 𝑗 − 𝜆) (4.20)
𝑗 𝑗
With respect to the zero noise value of the model 𝐸 l𝑖𝑛𝑒𝑎𝑟 (0), the estimator is unbiased. If the
statistical uncertainty 𝜎 2 for each 𝑦 𝑗 is the same, the variance for 𝐸ˆ l𝑖𝑛𝑒𝑎𝑟 (0) is:
2 1 𝜆¯ 2
V𝑎𝑟 [ 𝐸ˆ l𝑖𝑛𝑒𝑎𝑟 (0)] = 𝜎 + . (4.21)
𝑚 𝑆𝜆𝜆
4.1.3.3 Richardson extrapolation
Richardson’s extrapolation is also a particular case of polynomial extrapolation where 𝑑 = 𝑚 − 1,
i.e., the order is maximized given the number of data points:
(𝑑=𝑚−1)
𝐸 R𝑖𝑐ℎ (𝜆) = 𝐸 p𝑜𝑙 𝑦 (𝜆) = 𝑐 0 + 𝑐 1𝜆 + . . . 𝑐 𝑚−1𝜆𝑚−1 . (4.22)
This is the only case in which the fitted polynomial perfectly interpolates the 𝑚 data points such
that, in the ideal limit of an infinite number of samples 𝑁 → ∞, the error with respect to the
true expectation value is by construction 𝑂 (𝑚). Using the interpolating Lagrange polynomial, the
estimator can be explicitly expressed as:
𝑚
∑︁ Ö 𝜆𝑖
𝐸ˆ R𝑖𝑐ℎ (0) = 𝑐ˆ0 = 𝑦𝑘 , (4.23)
𝑘=1 𝑖≠𝑘
𝜆𝑖 − 𝜆 𝑘
where we assumed that all the elements of λ are different.
The error of the estimator is 𝑂 (𝑚) only in the asymptotic limit 𝑁 → ∞. In other words 𝑂 (𝑚)
corresponds to the bias term in Eq. (4.15). In a real scenario, 𝑁 is finite, and the variance term in
128
Eq. (4.15) grows exponentially as we increase 𝑚. This fact can be easily shown in the simplified
case in which the noise scale factors are equally spaced, i.e., 𝜆 𝑘 = 𝑘 𝜆1 where 𝑘 = 1, 2, . . . 𝑚.
Substituting this assumption into Eq. (4.23) we get:
𝑚 𝑚
𝑖 𝑘−1 𝑚
∑︁ Ö ∑︁
𝐸ˆ R𝑖𝑐ℎ (0) = 𝑦𝑘 = 𝑦 𝑘 (−1) . (4.24)
𝑘=1 𝑖≠𝑘
𝑖−𝑘 𝑘=1
𝑘
If we assume that each expectation value is sampled with the same statistical variance 𝜎 2 as
described in Eq. (4.11), since 𝐸ˆ R𝑖𝑐ℎ (0) is a linear combination of the measured expectation values
{𝑦 𝑘 }, its variance is given by:
𝑚 2
2
∑︁ 𝑚
V𝑎𝑟 ( 𝐸ˆ R𝑖𝑐ℎ (0)) = 𝜎
𝑘=1
𝑘
2𝑚 𝑚−→∞ 22𝑚
= 𝜎2 − 1 −−−−−→ 𝜎 2 √ , (4.25)
𝑚 𝜋𝑚
where we used the Vandermonde’s identity and, in the last step, the Stirling approximation.
The practical implication of Eq. (4.25) is that the zero-nose limit predicted by the Richardson’s
estimator is characterized by a statistical uncertainty which scales exponentially with the number
of data points.
4.1.3.4 Poly-Exponential extrapolation
The poly-exponential ansatz of degree 𝑑 is:
(𝑑) 𝑧(𝜆)
𝐸 p𝑜𝑙 𝑦𝑒𝑥 𝑝 (𝜆) = 𝑎 ± 𝑒 , 𝑧(𝜆) := 𝑧0 + 𝑧1𝜆 + . . . 𝑧 𝑑 𝜆 𝑑 . (4.26)
where 𝑎, 𝑧0 , 𝑧1 , . . . 𝑧 𝑑 are 𝑑 + 2 parameters. From physical considerations, it is reasonable to assume
that 𝐸 (𝜆) converges to a finite asymptotic value i.e.:
𝜆→∞ 𝜆→∞
𝐸 (𝜆) −−−−→ 𝑎 ⇐⇒ 𝑧(𝜆) −−−−→ −∞. (4.27)
There are two important scenarios: (i) where 𝑎 is unknown and so a non-linear fit should be
performed and (ii) where 𝑎 is deduced from asymptotic physical considerations. For example, if
we know that in the limit of 𝜆 → ∞ the state of the system is completely mixed or thermal, it is
129
possible to fix the value of 𝑎 such that the poly-exponential ansatz (4.26) is left with only 𝑑 + 1
unknown parameters: 𝑧0 , 𝑧1 , . . . 𝑧 𝑑 . If the asymptotic limit 𝑎 is known, we can apply the following
procedure:
1. Evaluate {𝑦′𝑘 } = {log(|𝑦 𝑘 − 𝑎| + 𝜖)}, representing the measurement results in a convenient
logarithmic space with coordinates (𝑦′𝑘 , 𝜆 𝑘 ), with a small regularizing constant 𝜖 > 0.
2. The model of Eq. (4.26) in the logarithmic space (𝑦′𝑘 , 𝜆 𝑘 ) reduces to the polynomial 𝑧(𝜆).
3. Estimate the zero-noise limit in the logarithmic space 𝑧ˆ (0) = 𝑧ˆ0 with a standard polynomial
extrapolation. If necessary different weights can be used for different scale factors, taking
into account the non-linear propagation of statistical errors.
4. Convert back to the original space, obtaining the final estimator 𝐸ˆ (0) = 𝑎 ± 𝑒 𝑧ˆ (0) .
This allows us to map a non-linear regression problem into a polynomial fit that is linear with
respect to the parameters and therefore much more stable. However, many reasonable alternative
approaches exist like maximum likelihood optimization. Alternatively a Bayesian approach could
be used, especially if we have prior information about the parameters of the model.
4.1.3.5 Exponential extrapolation
Exponential extrapolation is a particular case of the more general poly-exponential method. It
corresponds to the model:
(𝑑=1) 𝑧0 +𝑧1 𝜆
𝐸 e𝑥 𝑝 (𝜆) = 𝐸 p𝑜𝑙 𝑦𝑒𝑥 𝑝 (𝜆) = 𝑎 ± 𝑒 = 𝑎 + 𝑏𝑒 −𝑐𝜆 , (4.28)
where the set of real coefficients 𝑎, 𝑏, 𝑐 is a way of parametrizing the same ansatz, alternative but
equivalent to 𝑎, 𝑧0 , 𝑧1 . This model was discussed in [118] and is generalized by our extrapolation
framework. In particular, increasing the order 𝑑, for example to 𝑑 = 2, and using the poly-
exponential model (4.26) we can capture small deviations from the ideal exponential assumption,
possibly obtaining a more accurate zero-noise extrapolation.
130
Scaling Extrapolation Error % Error %
(dep.) (amp. damp.)
none unmitigated 29.9 ± 5.1 16.7 ± 4.0
circuit linear (𝑑 = 1) 14.6 ± 4.6 5.40 ± 2.3
circuit quadratic (𝑑 = 2) 6.35 ± 3.6 3.53 ± 3.4
circuit Richardson (𝑑 = 3) 17.6 ± 11 17.9 ± 16
circuit exponential (𝑎 = 0.25) 2.73 ± 1.9 2.06 ± 1.6
circuit adapt. exp. (𝑎 = 0.25) 1.27 ± 1.1 2.69 ± 2.8
at random linear (𝑑 = 1) 15.6 ± 5.3 5.20 ± 2.4
at random quadratic (𝑑 = 2) 5.54 ± 4.4 8.00 ± 8.1
at random Richardson (𝑑 = 3) 30.0 ± 24 24.0 ± 18
at random exponential (𝑎 = 0.25) 2.84 ± 1.8 0.95 ± 1.0
at random adapt. exp. (𝑎 = 0.25) 1.77 ± 1.4 2.18 ± 1.2
from left linear (𝑑 = 1) 14.4 ± 4.5 5.16 ± 2.3
from left quadratic (𝑑 = 2) 6.73 ± 3.7 3.88 ± 3.7
from left Richardson (𝑑 = 3) 18.4 ± 12 16.1 ± 13
from left exponential (𝑎 = 0.25) 3.17 ± 2.1 2.19 ± 2.0
from left adapt. exp. (𝑎 = 0.25) 1.43 ± 1.1 3.08 ± 3.6
Table 4.2: Average of 20 different two-qubit randomized benchmarking circuits with mean depth
27. The percent mean absolute error from the exact value of 1 is reported for a depolarizing noise
with 𝑝 = 1% and an amplitude damping channel with 𝛾 = 0.01. For all non-adaptive methods
we used 𝜆 = {1, 1.5, 2, 2.5}. Adaptive extrapolation was iterated up to 4 scale factors. All the
results reported in this table are obtained with exact density matrix simulations. The best result for
each noise model is highlighted with a bold font, while errors larger than the unmitigated one are
italicized.
4.1.3.6 Benchmark comparisons of ZNE methods
Benchmarks comparing the performance of ZNE methods are given in Table 4.2. In all cases,
besides for Richardson extrapolation, ZNE improves on the unmitigated noise value, however the
performance varies significantly. Furthermore, one scaling or extrapolation method does not strictly
dominate others.
Different extrapolation methods are compared on IBMQ’s London superconducting quantum
processor in Fig. 4.5. Here random gate folding scales the noise of 50 different two-qubit randomized
benchmarking circuits. The ideal expectation value for all circuits is 1. The order 2 polynomial fit,
and the exponential fit outperform Richardson extrapolation. In fact, Fig. 4.5 shows the expectation
value for Richardson extrapolation when only the first 3 data points are considered. Instability in
131
1.6 ZNE Linear
1.4 ZNE Trunc. Richardson
ZNE Poly order 2
1.2 ZNE Exponential
1.0
0| |0
0.8
0.6
0.4
0.2
0 1 2 3 4 5 6
Noise Scaling
Figure 4.5: Comparison of extrapolation methods averaged over 50 two-qubit randomized bench-
marking circuits executed on IBMQ’s “London” five-qubit chip. The circuits had, on average, 97
single qubit gates and 17 two-qubit gates. The true zero-noise value is ⟨0|𝜌|0⟩ = 1 and different
markers show extrapolated values from different fitting techniques.
the Richardson extrapolation for more points, as described in Section 4.1.3.3, causes nonphysical
results when applied to all the measured data. This is an example in which vanilla Richardson
extrapolation is not sufficient to provide stable results.
4.1.4 Adaptive zero noise extrapolation
In Section 4.1.3, we considered only non-adaptive extrapolation methods. However, in order to
reduce the computational overhead, we can choose the scale factors and the number of samples in
an adaptive way as described in Algorithm 6.
Differently from the non-adaptive case, in this adaptive procedure (Alg. 6) the measured scale
factors λ are not monotonically increasing. Indeed in the adaptive step, 𝜆next can take any value
(above or equal to 1). In particular, 𝜆 next could also be equal to a previous scale factor 𝜆 𝑗 , for some
𝑗. In this case, the additional measurement samples 𝑁next will improve the statistical estimation of
𝐸 (𝜆 𝑗 ).
Now, we present an example of adaptive extrapolation which is based on the exponential ansatz
𝐸 exp (𝜆) = 𝑎 + 𝑏𝑒 −𝑐𝜆 that we have already introduced in Eq. (4.28). We also assume that the
asymptotic value 𝑎 is known. This implies that at least two scale factors should be measured to fit
132
Algorithm 6: Generic adaptive extrapolation
Data: An initial set of 𝑚 noise scale factors λ = {𝜆 1 , 𝜆2 , . . . 𝜆 𝑚 }, with 𝜆 𝑗 ≥ 1, 𝑚 sample
numbers N = (𝑁1 , 𝑁2 , . . . 𝑁𝑚 ) and a maximum number of total samples 𝑁max .
Result: A mitigated expectation value
1 begin
/* Initialization */
2 y ←− ∅;
3 for 𝜆 𝑗 ∈ λ do
4 𝑦 𝑗 ←− 𝐶𝑜𝑚 𝑝𝑢𝑡𝑒𝐸𝑥
𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛(𝜆 𝑗 , 𝑁 𝑗 );
5 𝐴𝑝 𝑝𝑒𝑛𝑑 y, 𝑦 𝑗 ;
/* Adaptive loop */
6 𝑁used ←− 0;
7 while 𝑁used < 𝑁max do
8 𝚪∗ ←− 𝐵𝑒𝑠𝑡𝐹𝑖𝑡 (𝐸 m𝑜𝑑𝑒𝑙 (𝜆; 𝚪), (λ, y));
9 𝜆next ←− 𝑁𝑒𝑤𝑆𝑐𝑎𝑙𝑒(𝚪∗ , λ, y);
10 𝑁next ←− 𝑁𝑒𝑤𝑁𝑢𝑚𝑆𝑎𝑚 𝑝𝑙𝑒𝑠(𝚪∗ , λ, y);
11 𝑦 next ←− 𝐶𝑜𝑚 𝑝𝑢𝑡𝑒𝐸𝑥 𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛(𝜆 next , 𝑁next );
12 𝐴𝑝 𝑝𝑒𝑛𝑑 (λ, 𝜆next );
13 𝐴𝑝 𝑝𝑒𝑛𝑑 (y, 𝑦 next );
14 𝑁used ←− 𝑁used + 𝑁next ;
15 return 𝐸 m𝑜𝑑𝑒𝑙 (0; 𝚪∗ );
the parameters 𝑏 and 𝑐. We first consider this particular case and then we generalize the method to
an a arbitrary number of scale factors, which will be chosen in an adaptive way.
4.1.4.1 Exponential extrapolation with two scale factors
We assume only two scale factors 𝜆 1 and 𝜆 2 (typically, 𝜆 1 is 1). As discussed in Section 4.1.3, we
can estimate the corresponding expectation values, 𝐸 (𝜆 1 ) and 𝐸 (𝜆 2 ), with a statistical uncertainty
of 𝜎12 = 𝜎02 /𝑁1 and 𝜎22 = 𝜎02 /𝑁2 , respectively. Here, we are implicitly assuming that the single shot
variance 𝜎02 is independent of 𝜆, such that the estimation precision is only determined by number of
samples 𝑁1 and 𝑁2 . The measurement process will produce two results 𝑦 1 and 𝑦 2 , whose statistical
distribution is given by Eq. (4.11).
Since the parameter 𝑎 is known, we can use the points (𝜆 1 , 𝑦 1 ) and (𝜆 2 , 𝑦 2 ) to estimate 𝑏 and
𝑐 of Eq. (4.28). The two estimators 𝑏ˆ and 𝑐ˆ can be determined by the unique ansatz interpolating
133
the two points, whose parameters are:
1 𝑦1 − 𝑎
𝑐ˆ = log , (4.29)
𝜆2 − 𝜆1 𝑦2 − 𝑎
𝜆2 𝜆1
−
𝑏ˆ = (𝑦 1 − 𝑎) 𝜆2 −𝜆1 (𝑦 2 − 𝑎) 𝜆2 −𝜆1 . (4.30)
The corresponding estimator for the zero-noise limit is 𝐸ˆ exp (0) = 𝑎 + 𝑏ˆ where, since 𝑎 is known,
the error is only due to the statistical noise of 𝑏.ˆ
This estimator depends on the empirical variables 𝑦 1 , 𝑦 2 , with statistical variances 𝜎12 = 𝜎02 /𝑁1
and 𝜎22 = 𝜎02 /𝑁2 respectively. Such measurement errors will propagate to the estimator 𝑏. ˆ To
leading order in 𝜎12 and 𝜎22 , we have:
2 2
ˆ 𝜕 𝑏ˆ
M𝑆𝐸 ( 𝑏)ˆ = 𝜕𝑏 𝜎12 + 𝜎22 . (4.31)
𝜕𝑦 1 𝜕𝑦 2
The explicit evaluation of Eq. (4.31), yields:
" #
𝜎02 𝜆22 𝑒 2𝑐𝜆1 𝜆21 𝑒 2𝑐𝜆2
M𝑆𝐸 ( 𝑏) ˆ = + . (4.32)
(𝜆 2 − 𝜆 1 ) 2 𝑁1 𝑁2
The previous equation shows that the error depends on the choice of the scale factors 𝜆 1 and 𝜆 2 but
also on the associated measurement samples 𝑁1 and 𝑁2 .
Error minimization Let us first assume that we have at disposal only a total budget 𝑁max = 𝑁1 +𝑁2
of circuit evaluations and that 𝜆1 and 𝜆 2 are fixed. Minimizing Eq. (4.32), with respect to 𝑁1 and
𝑁2 , we get:
𝜆1
𝑁1 = 𝑁max
𝜆 1 + 𝜆 2 𝑒 −𝑐(𝜆2 −𝜆1 )
𝜆 2 𝑒 −𝑐(𝜆2 −𝜆1 )
𝑁2 = 𝑁max (4.33)
𝜆 1 + 𝜆 2 𝑒 −𝑐(𝜆2 −𝜆1 )
and the corresponding error becomes:
2
𝜆 2 𝑒 𝑐𝜆1 + 𝜆 1 𝑒 𝑐𝜆2
M𝑆𝐸 ( 𝑏)ˆ = 𝜎02 . (4.34)
𝜆2 − 𝜆1
134
Algorithm 7: Adaptive exponential extrapolation
Data: An exponential model 𝐸 exp (𝜆) = 𝑎 + 𝑏𝑒 −𝑐𝜆 with a known/estimated 𝑎. A maximum
number of total samples 𝑁max , a fixed number of samples per iteration 𝑁batch and a
minimum scale factor 𝜆 1 (typically equal to 1).
Result: A mitigated expectation value
1 begin
2 𝑐 ←− 1; /* Initial guess */
3 𝛼 ←− 1.27846; /* Alpha in Eq. (4.36) */
4 d𝑎𝑡𝑎 ←− ∅;
5 𝑁used ←− 0 ;
6 while 𝑁used < 𝑁max do
7 𝜆 2 ←− 𝜆 1 + 𝛼/𝑐;
1 /𝛼
8 𝑁1 ←− 𝑁batch × 𝑐 𝑐𝜆1𝜆+𝛼−1 ;
9 𝑁2 ←− 𝑁batch × (1+𝑐𝑐 𝜆𝜆11/𝛼)+𝛼−1
(𝛼−1)
;
10 𝑁used ←− 𝑁used + 𝑁1 + 𝑁2 ;
11 𝑦 1 ←− 𝐶𝑜𝑚 𝑝𝑢𝑡𝑒𝐸𝑥 𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛(𝜆 1 , 𝑁1 );
12 𝑦 2 ←− 𝐶𝑜𝑚 𝑝𝑢𝑡𝑒𝐸𝑥 𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛(𝜆 2 , 𝑁2 );
13 𝐴𝑝 𝑝𝑒𝑛𝑑 (d𝑎𝑡𝑎, (𝜆 1 , 𝑦 1 ));
14 𝐴𝑝 𝑝𝑒𝑛𝑑 (d𝑎𝑡𝑎, (𝜆 2 , 𝑦 2 ));
/* New estimate of c */
15 𝑐 ←− 𝐵𝑒𝑠𝑡𝐹𝑖𝑡 (𝐸 e𝑥 𝑝 (𝜆; 𝑎, 𝑏, 𝑐), d𝑎𝑡𝑎);
16 return 𝐸 e𝑥 𝑝 (0; 𝑎, 𝑏, 𝑐);
This error can be further minimized with respect to the choice of the scale factors. Since 𝜆 1 is
usually fixed to 1, we optimize over 𝜆2 , leading to the condition:
𝑒 𝑐(𝜆2 −𝜆1 ) (𝑐(𝜆2 − 𝜆1 ) − 1) − 1 = 0. (4.35)
We can solve the previous equation numerically, obtaining:
𝑐(𝜆2 − 𝜆1 ) = 𝛼, (4.36)
where 𝛼 ≃ 1.27846 is a numerical constant. For a fixed 𝜆 1 , the previous condition determines the
optimal choice of the scale factor 𝜆 2 which minimizes the zero-nose extrapolation error. From a
practical point of view, Eqs. (4.33) and (4.36) can only be used if we have some prior knowledge
about 𝑐. This motivates the following adaptive algorithm.
135
Figure 4.6: Comparison of adaptive and non-adaptive exponential zero noise extrapolation, given
a fixed budget of samples. The adaptive method generally produces a more accurate extrapolation
with less samples. On the other hand, in this example, the advantage of adaptivity is not particularly
large. Likely, this is due to the fact that the scale factors used for the non-adaptive method are
already quite good and not far from their optimal values. Data was generated by exact density
matrix simulation of 5-qubit randomized benchmarking circuits of depth 10 under 5% depolarizing
noise and measured in the computational basis. Noise was scaled directly by access to the back-end
simulator rather than with a folding method.
4.1.4.2 An adaptive exponential extrapolation algorithm
Algorithm 7 is an adaptive exponential algorithm based on the exponential ansatz 𝐸 exp (𝜆) =
𝑎 + 𝑏𝑒 −𝑐𝜆 , where 𝑎 is a known constant. Figure 4.6 shows a comparison of adaptive exponential
extrapolation with non-adaptive exponential extrapolation. At almost all sample levels, adaptive
extrapolation outperforms the non adaptive approach.
4.1.5 Conclusion
We make zero-noise extrapolation digital, developing the unitary folding framework to run error
mitigation with instruction set level access. We then demonstrate improved performance through a
set of non-adaptive and adaptive extrapolation methods. We emphasize that zero-noise extrapolation
is in general an inference problem with many avenues for further optimization.
While ZNE has previously been benchmarked on randomized benchmarking circuits or VQE,
we give benchmarks of ZNE on MAXCUT problems solved with QAOA. This allows us to smoothly
benchmark the performance of ZNE on larger variational quantum circuits then have been consid-
136
ered previously.
We also consider specialization of zero-noise extrapolation to different noise models, using
calibration noise as an example. With more sophisticated multi-parameter noise models (such as
a combination of calibration noise and amplitude dampening), it is likely that multi-dimensional
noise extrapolation [144] will be of interest.
This work is a first step towards viewing zero-noise extrapolation as an inference problem and
has opportunities for extension. Priors or constraints from observable, noise or circuit structure
could be included. Data could be gathered from similar executions over time so that inference
includes a historical database of previous computations.
4.2 Reducing the impact of time-correlated noise on ZNE
4.2.1 Introduction
Zero-noise extrapolation (ZNE) techniques have been primarily investigated under the assumption
that the errors to be mitigated are uncorrelated in time. On the other hand, time-correlated noise
(in particular 1/ 𝑓 𝛼 noise) has been widely observed in physical systems including superconducting
devices [145, 146, 147, 148, 149], quantum dots [150, 151], and spin qubits [152]. To estimate
the noise present in these real physical systems, one can use quantum noise spectroscopy (QNS)
[153, 154, 155] wherein the outcomes of a set of distinct control pulses or circuits are analyzed.
Key to this approach is that while these different probe sequences may in fact represent identical
circuits under ideal conditions, they interact with any noise present in different ways. This can
be understood through the filter function formalism [156, 157] which describes the “frequency
response” of a given probe sequence. Broadly speaking, the impacts of noise (in terms of fidelity)
are approximately proportional to the integral of the power spectrum of the noise with the filter
function of the control. In what follows, we will show how this intuition can also be applied to
different ZNE schemes in the presence of temporally correlated dephasing noise.
The recently developed [158] and experimentally validated [159] Schroödinger wave autore-
gressive moving average (SchWARMA) technique provides a natural mechanism for the exploration
137
of so-called digital ZNE techniques [142, 160, 161] that operate at the gate level in a quantum
circuit. Building on techniques from classical time-series modeling in statistics and signal pro-
cessing, SchWARMA was conceived as a highly flexible mechanism for simulating a wide-range
of spatiotemporally correlated errors in quantum circuits.
In the following, we first review the SchWARMA modeling and simulation formalism and
its relationship to the filter function formalism. Next, we provide a concise overview of ZNE
and discuss different methods for scaling noise. Next, we show how these different schemes are
impacted by time-correlated dephasing noise despite the fact that they behave equivalently for
uncorrelated noise. We then interpret these noise scaling schemes using the language of filter
functions and show that these results are well described by the intuition provided by the filter
functions. Our findings indicate that, for time-correlated noise, the noise scaling method known as
global unitary folding [160, 162] produces more accurate noise-scaled expectation values and ZNE
results.
4.2.2 Background
4.2.2.1 Time-correlated noise: The SchWARMA model
Consider a single-qubit Hamiltonian
𝐻 (𝑡) = 𝐻𝑧 (𝑡) + 𝐻𝑐 (𝑡) (4.37)
consisting of a semiclassical dephasing noise component 𝐻𝑧 (𝑡) along with a deterministic idealized
control component 𝐻𝑐 (𝑡) corresponding, for example, to the external driving induced by laser
pulses. If we further define 𝐻𝑧 (𝑡) = 𝜂(𝑡)𝜎 𝑧 with 𝜂(𝑡) a wide-sense stationary Gaussian stochastic
process, we can say that this noise process is not time-correlated if E[𝜂(𝑡)𝜂(𝑡 ′)] = E[𝜂(|𝑡 −
𝑡 ′ |)𝜂(0)] = 0 for all 𝑡 ≠ 𝑡 ′, where E(·) represents the average over many statistical realizations. 𝜎𝑖 ,
𝑖 = 𝑥, 𝑦, 𝑧 are the Pauli matrices. Equivalently, we can say that the noise process is time-correlated
if the power spectrum
∫ ∞
𝑆𝜂 (𝜔) = 𝑑𝑡 E[𝜂(𝑡)𝜂(0)]𝑒 −𝑖𝜔𝑡 (4.38)
0
138
is not constant as a function of 𝜔 (i.e., not a “white” process).
In the SchWARMA modeling approach [158], the impact of the continuous time Hamiltonian in
(4.37) is modeled in a quantum circuit formalism by inserting correlated 𝑍-error operators after each
“gate” determined by the control 𝐻𝑐 . This is accomplished by generating a time-correlated sequence
of rotation angles 𝑦 𝑘 defined from independent Gaussian inputs 𝑥 𝑘 using an auto-regressive moving
average(ARMA) model [163, 164],
∑︁𝑝 𝑞
∑︁
𝑦𝑘 = 𝑎𝑖 𝑦 𝑘−𝑖 + 𝑏 𝑗 𝑥 𝑘− 𝑗 , (4.39)
𝑖=1 𝑗=0
| {z } | {z }
𝐴𝑅 𝑀𝐴
where the set {𝑎𝑖 } defines the autoregressive portion of the model, and {𝑏 𝑗 } the moving average
portion with 𝑝 and 𝑞 + 1 elements of each set respectively. The time correlations are defined via
the resulting power spectrum
Í𝑞 2
𝑘=0 𝑏 𝑘 exp(−𝑖𝑘𝜔)
𝑆 𝑦 (𝜔) = Í𝑝 2
, (4.40)
1+ 𝑘=1 𝑎 𝑘 exp(−𝑖𝑘𝜔)
and ARMA models can approximate any discrete-time power spectrum to arbitrary accuracy [165].
For the scope of this work we focus on the four paradigmatic noise spectra shown in Fig. 4.7,
namely: white noise, low-pass noise, 1/ 𝑓 noise and 1/ 𝑓 2 noise.
Dividing the circuit trajectory defined by 𝐻𝑐 (𝑡) into consecutive gates 𝐺 𝑘 , the SchWARMA
approach models the impact of correlated noise 𝐻𝑧 (𝑡) by adding in a random 𝑍 (𝜃 𝑘 ) = exp(𝑖𝑦 𝑘 𝜎 𝑧 )
after each gate, which can then be Monte Carlo averaged to produce an expectation value. This
model can be extended to multi-qubit Hamiltonians
∑︁𝑛
𝐻 (𝑡) = 𝜂 𝑗 (𝑡)𝜎 𝑗𝑧 + 𝐻𝑐 (𝑡) , (4.41)
𝑗=1
by generating independent, yet identically defined, SchWARMA-generated errors on each qubit.
In principle, these could of course be heterogeneous and correlated between qubits.
139
4.2.2.2 Zero-noise extrapolation with colored noise
Zero-noise extrapolation (ZNE) is a heuristic error mitigation technique which relies on the ability
to increase the noise in a quantum circuit [166, 167, 168]. Like other error mitigation techniques,
the target is to estimate an expectation value
𝐸 (𝜆) := Tr[𝜌(𝜆)𝑂] (4.42)
at zero noise. The noise scale factor 𝜆 dictates how much the base noise level 𝜆 = 1 is scaled in
the quantum circuit which prepares the system density matrix 𝜌, and 𝑂 is a problem-dependent
observable. The key insight of ZNE is to (i) evaluate 𝐸 (𝜆) at several noise scale factors 𝜆 ≥ 1, then
(ii) fit a statistical model to the collected data and infer the zero-noise value 𝐸 (𝜆 → 0). We refer
to these two steps as noise scaling and inference, respectively.
Compared to other error mitigation techniques, zero-noise extrapolation requires very few
additional quantum resources. Correspondingly, it has received some attention in recent literature,
e.g. it was implemented in Refs. [169, 142, 160, 162, 170, 171] and in [172] on twenty six
superconducting qubits to produce results competitive with classical approximation techniques.
References [142, 160, 161] formally introduced digital noise scaling, in which noise is scaled at a
gate-level without pulse-level control.
101
white
low-pass
100 1/f
Noise spectrum S( )
1/f 2
10 1
10 2
10 3
10 4
10 3 10 2 10 1 100
Normalized frequency
Figure 4.7: Noise power spectrum of four different dephasing SchWARMA noise models corre-
sponding to white noise, low-pass noise, 1/ 𝑓 noise and 1/ 𝑓 2 noise. These noise models are used
in Sec. 4.2.3 to test the effect of time-correlated noise on zero-noise extrapolation.
140
(a)
(b)
(c)
Figure 4.8: A sample three-qubit circuit with four gates under the action of three digital noise
scaling methods 𝑛 we consider in this work. (a) Local folding, in which each gate 𝐺 gets mapped to
𝐺 ↦→ 𝐺 𝐺 † 𝐺 for scale factor 𝜆 = 2𝑛 − 1. (b) Global folding, in which the entire circuit 𝐶 gets
𝑛
mapped to 𝐶 ↦→ 𝐶 𝐶 †𝐶 . In (a) and (b), grey shading shows the “virtual gates” which logically
𝜆
compile to identity. (c) Gate Trotterization, in which 𝐺 ↦→ 𝐺 1/𝜆 for each gate 𝐺.
While ZNE is straightforward to implement and requires relatively few additional quantum
resources, it is nonetheless a heuristic method. The quality of the solution depends critically on
both the inference and noise-scaling method. In this work, we fix the inference method by assuming
a particular noise model and focus on the effects of the noise-scaling method.
4.2.2.3 Noise scaling methods
Ideal noise scaling In a purely theoretical setting, the ideal way of scaling the noise would be to
√
multiply the Hamiltonian 𝐻𝑧 in Eq. (4.37) by a constant 𝜆:
√
𝐻 ′ (𝑡) = 𝜆𝐻𝑧 (𝑡) + 𝐻𝑐 (𝑡). (4.43)
Equivalently, the scale factor can be absorbed into a redefinition of the stochastic noise amplitude:
√
𝜂′ (𝑡) = 𝜆𝜂(𝑡). From Eq. (4.40), it is evident that the noise power spectrum gets scaled by 𝜆,
𝑆𝜂 ′ (𝜔) = 𝑆√𝜆𝜂 (𝜔) = 𝜆 𝑆𝜂 (𝜔). (4.44)
If one could directly control the noise, this would be the ideal way of scaling its power and,
therefore, the ideal way of applying zero-noise extrapolation. In a typical experimental scenario,
141
of course, one cannot directly control the noise of a quantum device. For this reason, several
indirect noise scaling techniques of have been proposed and applied in recent literature. We define
several of these in the following subsections (see Fig. 4.8 for an overview) in order to analyze their
performance in the presence of time-correlated noise in Sec. 4.2.3.
Pulse stretching The intent of pulse stretching is to scale the impacts of the noise on the system
by “stretching” the underlying control Hamiltonian, replacing (4.37) with
1
𝐻 (𝑡) = 𝐻𝑧 (𝑡) + 𝐻𝑐 (𝑡/𝜆) , (4.45)
𝜆
for some dimensionless time-scaling factor 𝜆. In principle, this scales the impacts of the noise by
increasing the overall time duration of the circuit. More precisely, if we define 𝑡 ′ = 𝑡/𝜆, the density
operator 𝜌(𝑡 ′) of the system evolves with respect to the effective Hamiltonian:
𝐻 ′ (𝑡 ′) = 𝜆 𝐻𝑧 (𝜆𝑡 ′) + 𝐻𝑐 (𝑡 ′) . (4.46)
The corresponding noise power spectrum is:
∫ ∞
′
𝑆𝜂 ′ (𝜔) = 𝜆 2
𝑑𝑡 ′ E[𝜂(𝜆𝑡 ′)𝜂(0)]𝑒 −𝑖𝜔𝑡
∫ 0∞
=𝜆 𝑑𝑡 E[𝜂(𝑡)𝜂(0)]𝑒 −𝑖𝜔𝑡/𝜆 = 𝜆 𝑆𝜂 (𝜔/𝜆). (4.47)
0
From the equation above, it is evident that for a white (constant) spectrum, pulse stretching can be
used to effectively scale the noise power by 𝜆 as in the ideal case defined in Eq. (4.44). In fact, the
equivalence between the ideal noise scaling and the pulse-stretching technique was already shown
in Ref. [166], under the hypothesis of a quantum state 𝜌 evolving according to a master equation
with a time-independent noise operator acting as L (𝜌) (more details about the consistency between
our findings with the results of Ref. [166] are given in Section 4.2.6). On the other hand, Eq. (4.47)
shows that, for a colored spectrum, pulse-stretching does not exactly reproduce the ideal noise
scaling defined in (4.44). Indeed, on the r.h.s. of Eq. (4.47) we observe that the original spectrum is
also stretched with respect to the frequency variable 𝜔. This fact is a manifestation of the intuitive
idea that slowing down the dynamics the system corresponds to effectively speeding up the time
142
scale of the environment. Such frequency stretching, while irrelevant in the white noise limit,
becomes relevant for time-correlated noise.
In the SchWARMA formalism, there is not a mechanism for stretching pulses per se as it
operates at the gate level in a circuit (without pulse-level control on 𝐻𝑐 (𝑡)). However, as discussed
in the supplement to [158], it is possible to manipulate and stretch the spectrum of a SchWARMA
model. So, for the task of numerically simulating pulse stretching, instead of implementing
equation Eq. (4.46) one can simply implement Eq. (4.47) by directly transforming the spectrum of
the SchWARMA model.
Local unitary folding A possible way of effectively increasing the noise of a circuit is to insert
after each noisy CNOT gate, the product of two additional CNOT gates [142, 161]. In this way the
ideal unitary is not changed, but the real dynamics is more noisy. More generally, in Sec. 4.1 we
introduced several digital noise scaling methods that are based on the unitary folding replacement
rule
𝐺 → 𝐺 (𝐺 † 𝐺) 𝑛 , 𝑛 = 0, 1, 2, . . . , (4.48)
where 𝐺 is a unitary operation associated to an individual gate. If noise is absent, the replacement
rule leaves the operation unchanged since 𝐺 † 𝐺 is equal to the identity. On the contrary, if some
base noise is associated to 𝐺, the unitary folding operation approximately scales the noise by an
odd integer factor 𝜆 = 1 + 2𝑛.
More precisely, by applying the unitary folding replacement to all the gates of an input circuit
𝑈 = 𝐺 𝑑 𝐺 𝑑−1 . . . 𝐺 1 (4.49)
which is composed of 𝑑 gates 𝐺 𝑗 , we obtain new circuit 𝑈 ′ of depth 𝑑 ′ = (1 + 2𝑛)𝑑 given by
𝑈 ′ = 𝐺 𝑑 (𝐺 †𝑑 𝐺 𝑑 ) 𝑛 𝐺 𝑑−1 (𝐺 †𝑑−1 𝐺 𝑑−1 ) 𝑛 . . . 𝐺 1 (𝐺 †1 𝐺 1 ) 𝑛 . (4.50)
The depth of the new circuit 𝑈 ′ is scaled by 𝜆 = 𝑑 ′/𝑑 = 1+2𝑛 and, similarly, any type of noise which
depends on the total number of gates will be effectively scaled by the same constant 𝜆. In Sec. 4.1,
143
partial folding methods were proposed to obtain arbitrary real values of 𝜆, but for simplicity in this
work we only consider odd-integer scale factors. We refer to (4.50) as local unitary folding.
Global unitary folding Instead of locally folding all the gates, we can apply Eq. (4.48) to the
entire circuit. In this way, the circuit 𝑈 defined in Eq. (4.49) is simply mapped to
𝑈 ′ = 𝑈 (𝑈 †𝑈) 𝑛 . (4.51)
Also in this case the total number of gates of the new circuit 𝑈 ′ is multiplied by 𝜆 = 𝑑 ′/𝑑 = 1 + 2𝑛
corresponding to an effective scaling of the noise.
Gate Trotterization In this work we also introduce another local noise-scaling method, acting
at the level of individual gates, that we call gate Trotterization since it can be considered as a
discretization of the continuous pulse-stretching technique. According to the gate Trotterization
technique, each gate of the circuit is replaced as follows:
𝜆
𝐺 → 𝐺 1/𝜆 , 𝜆 = 0, 1, 2, . . . . (4.52)
For example, a Pauli 𝑋 rotation gate 𝑅 𝑋 (𝜃) is replaced by 𝜆 applications of 𝑅 𝑋 (𝜃/𝜆). Eq. (4.52)
is similar to the local version of the unitary folding rule (4.48) and, indeed, both methods replace
a single gate with the product of 𝜆 gates. Compared to Eq. (4.48), the Trotter-like decomposition
used in Eq. (4.52) is more uniform since equal elementary gates are used. On the other hand, a
possible drawback of the gate Trotterization method is that 𝐺 1/𝜆 may be compiled by the hardware
in different ways depending on 𝜆 and, therefore, the circuit depth may not get scaled as expected.
4.2.3 Results
In the previous section, we defined several noise-scaling methods that can be used in zero-noise
extrapolation. In this section, we study how these different methods affect the performance of
ZNE in the presence of time-correlated noise. For all the simulations presented in this section
we used the following Python libraries: Mezze [158] for modeling SchWARMA noise, Mezze’s
144
True (points)
1.00 True (ZNE)
Pulse (points)
Pulse (ZNE)
Expectation value E(λ)
0.95 Global (points)
Global (ZNE)
Local (points)
Local (ZNE)
0.90
Trotter (points)
Trotter (ZNE)
0.85
0.80
0.75
1 3 5 7 9
Noise scale factor λ
1.3
1.2
1.1
Expectation value E(λ)
1.0
0.9
True (points)
True (ZNE)
0.8
Pulse (points)
Pulse (ZNE)
0.7 Global (points)
Global (ZNE)
0.6 Local (points)
Local (ZNE)
0.5 Trotter (points)
Trotter (ZNE)
0.4
1 3 5 7 9
Noise scale factor λ
Figure 4.9: Comparison of different zero-noise extrapolations obtained with different noise scaling
methods. We consider a single-qubit randomized benchmarking circuit affected by dephasing
noise of fixed integrated power. The two subfigures correspond to different noise spectra: (top)
white noise, (bottom) 1/ 𝑓 pink noise. Both spectra are shown in Fig. 4.7. The expectation value
𝐸 (𝜆) = t𝑟 (𝑂 𝜌(𝜆)) is associated to the observable 𝑂 = |0⟩⟨0| measured with respect to the noise-
scaled quantum state 𝜌(𝜆). The colored squares represent the noise-scaled expectation values;
the dotted lines represent the associated exponential fitting curves; the colored stars represent the
corresponding zero-noise extrapolations. The figure shows that the zero-noise limit obtained with
global unitary folding (green star) is relatively close to the ideal result (gray star) even in the
presence of strong time correlations in the noise.
TensorFlow Quantum [173] interface for simulating quantum circuits and Mitiq [162] for applying
unitary folding and zero-noise extrapolation.
4.2.3.1 Zero-noise extrapolation with colored noise
In this section we numerically simulate a simple ZNE experiment with different noise scaling
methods and with different noise spectra. The results are reported Fig. 4.9 and demonstrate the
145
Figure 4.10: Average relative errors in noise scaling two-qubit randomized benchmarking circuits
with (a) white noise, (b) lowpass noise, (c) 1/ 𝑓 noise, and (d) 1/ 𝑓 2 noise. Panel (a) shows no
significant difference in scaling methods under white noise (no time correlations). (Inset shows
zoomed vertical scale.) Panels (b)-(d) show that global scaling is the lowest-error digital scaling
method. The two-qubit randomized benchmarking circuits used here have, on average, 27 single-
qubit gates and five two-qubit gates. For each circuit execution, 3000 samples were taken to estimate
the probability of the ground state as the observable. Points show the average results over fifty such
circuits and error bars show one standard deviation.
detrimental effect of time-correlated noise on ZNE. In Fig. 4.9(a) the noise spectrum is white and
all noise scaling methods produce nearly identical expectation values. Correspondingly, the zero-
noise limits (marked with stars in the plot) are nearly identical. On the other hand, in Fig. 4.9(b),
the noise is colored (a 1/ 𝑓 “pink” spectrum) and different noise-scaling methods produce different
expectation values. Correspondingly, the zero-noise limits (marked with starts in the plot) are also
different. This is the main qualitative result that this work aims to highlight: compared to white
noise, time-correlated noise can be much harder to mitigate via zero-noise extrapolation.
In the rest of this section, we study this aspect in a more quantitative way. In particular we
study the performances of different noise-scaling methods for different types of noise spectra and
different types of circuits.
4.2.3.2 Comparing noise scaling methods
Observing Fig. 4.9(b) we notice that, at least for the particular circuit considered in the example,
some noise sclaing methods perform better than others in the presence of time-correlated noise.
146
In particular the extrapolation based on the global folding technique produces a relatively good
approximation of the ideal result even in the presence of time-correlated noise.
To better investigate this phenomenon, we consider the relative noise-scaling error
𝐸 (𝜆) − 𝐸 ∗ (𝜆)
Δ(𝜆) := , (4.53)
𝐸 ∗ (𝜆)
as a figure of merit. Here, 𝐸 (𝜆) is the expectation value of interest evaluated with some particular
noise scaling method and scale factor 𝜆, and 𝐸 ∗ (𝜆) is the expectation value simulated with a noise
spectrum ideally scaled according to Eq. (4.44). In Fig. 4.10 we plot the relative error defined
in Eq. (4.53) for each noise-scaling method, after averaging the results over multiple instances
of two-qubit randomized-benchmarking circuits. Here the expectation value of the observable
𝑂 = |00⟩⟨00| is considered. The results of Fig. 4.10 are consistent with those of Fig. 4.9 discussed
in the previous subsection. In fact, even after averaging over multiple random circuits, we observe
that in the presence of white noise all noise scaling methods are practically equivalent to each other
and are characterized by a small relative noise-scaling error. For all colored noise spectra instead,
global folding is optimal when compared to other noise scaling methods.
We repeat the same experiments using mirror circuits [174] and QAOA-like circuits instead of
RB circuits. The former provides another type of randomized circuit structure used for benchmark-
ing, and the latter provides a structured circuit. Fig. 4.11 shows the results using two-qubit mirror
circuits. These circuits have 26 single-qubit gates and eight two-qubit gates on average. As with
the randomized benchmarking circuits, 3000 samples were taken when executing each circuit to
estimate the probability of sampling the correct bitstring. As shown in Fig. 4.11, the conclusion
that global unitary folding most closely matches true noise scaling holds on average for mirror
circuits as well. These results were averaged over fifty random mirror circuits.
Fig. 4.12 shows the same experiment using QAOA circuits. These 𝑛 = 2 qubit circuits have
Í𝑛
𝑝 = 2 QAOA rounds using the standard mixer Hamiltonian 𝐻 𝑀 = 𝑖=1 𝑋𝑖 and driver Hamiltonian
𝐻𝐶 = 𝑖 𝑗 𝑍𝑖 𝑍 𝑗 . Denoting this circuit as 𝑈, we append 𝑈 † such that the final noiseless state is
Í
|00⟩ independent of the randomly chosen angles 𝛽 and 𝛾. A total of fifty circuits with random
angles were simulated for the final results, again using 3000 samples to estimate the ground state
147
probability for each circuit execution. The results in Fig. 4.12 have the highest variance of the three
circuit types, but on average we still see that global unitary folding is closest to true noise scaling
out of all scaling methods considered.
The conclusions of this subsection suggest that, even for different types of circuits, the effect of
time-correlated noise on noise scaling methods is qualitatively similar. This intuition is consistent
with the theoretical discussion presented in the next section, in which the performances of noise
scaling methods are linked to their effective frequency modulation effects.
We emphasize that the comparison considered in this work is focused on one particular figure
of merit: the robustness of a noise scaling method with respect to time-correlated noise. Our
results suggest that global folding outperforms the other methods considered with respect to this
specific figure of merit. In a real-world scenario, the optimal noise-scaling method should be
determined according to a more general cost-benefit analysis, e.g. taking into account the sampling
cost, coherence time, and other hardware limitations. For instance, it may not be possible to use
global noise scaling if the circuit length is comparable to the coherence time of the computer;
in such circumstances, pulse stretching can amplify errors via small scale factors [172], although
potentially inaccurately in the presence of time-correlated noise as we have shown in this section.
4.2.4 Discussion and physical interpretation
4.2.4.1 Frequency response of a circuit
The impacts of time-correlated dephasing noise can be interpreted using the filter function formalism
[156, 157]. In a single qubit scenario, we can associate with a given deterministic 𝐻𝑐 (𝑡) (or gate
sequence 𝐺 𝑘 ) a frequency response 𝐹𝑧 (𝜔) that relates the expected reduction in fidelity due to the
noise as a decay exp(−𝜒) where 𝜒 is defined via the “overlap integral,”
∫ ∞
𝑑𝜔
𝜒= 𝑆𝜂 (𝜔)𝐹𝑧 (𝜔)𝐹𝑧 (𝜔) † . (4.54)
−∞ 2𝜋
The overlap integral can be used to derive approximations to noise-averaged observables, via
𝐸 [Tr[𝜌 𝑂]] ≈ 𝐴 + 𝐵 exp(−𝜒) . (4.55)
148
Figure 4.11: Relative errors in noise scaling two-qubit mirror circuits with (a) white noise, (b)
lowpass noise, (c) 1/ 𝑓 noise, and (d) 1/ 𝑓 2 noise. Panel (a) shows no significant difference in
scaling methods under white noise (no time correlations). (Inset shows zoomed vertical scale.)
Panels (b), (c) and (d) show global scaling is optimal with time-correlated noise. The two-qubit
mirror benchmarking circuits used here have, on average, 26 single-qubit gates and eight two-qubit
gates. For each circuit execution, 3000 samples were taken to estimate the probability of the correct
bitstring (defined by the particular mirror circuit instance) as the observable. Points show the
average results over fifty such circuits and error bars show one standard deviation.
Figure 4.12: Relative errors in noise scaling two-qubit 𝑝 = 2 QAOA circuits with (a) white noise,
(b) lowpass noise, (c) 1/ 𝑓 noise, and (d) 1/ 𝑓 2 noise. Panel (a) shows no significant difference in
scaling methods under white noise (no time correlations). (Inset shows zoomed vertical scale.)
Panels (b), (c) and (d) show global scaling is optimal with time-correlated noise. The two-qubit
𝑝 = 2 QAOA circuits used here have eight single-qubit gates and four two-qubit gates. For each
circuit execution, 3000 samples were taken to estimate the probability of the ground state as the
observable. (Note that the QAOA circuit 𝑈 is echoed such that the total circuit is 𝑈𝑈 † = 𝐼 without
noise.) Points show the average results over fifty such circuits and error bars show one standard
deviation.
149
For multiqubit circuits, the overlap integral becomes a more complicated expression involving the
second cumulant C𝑂(2) (𝑇),
C𝑂(2) (𝑇) ∞
∫
∑︁ 𝑑𝜔
= Re 𝑆 𝛼,𝛼 ′ (𝜔)F𝛼𝛽,𝛼 ′ 𝛽 ′ (𝜔, 𝑇)A 𝛽𝛽 ′ ,
2 𝛼,𝛽,𝛼 ′ ,𝛽 ′ 0 2𝜋
(4.56)
where the overlaps between the noise power spectrum 𝑆 𝛼,𝛼 ′ and filter functions F𝛼𝛽,𝛼 ′ 𝛽 ′ scale
operators A 𝛽𝛽 ′ .This expression captures potential cross correlations in noise, but here 𝑆 𝛼,𝛼 ′ = 0
when 𝛼 ≠ 𝛼′ and 𝛼 is not a 𝜎𝑧 operator on given qubit. Furthermore, for the examples below we
compute the filter functions using instantaneous gates as specified by a circuit, but these expressions
can hold for piecewise constant controls to accommodate pulse shaping. In the context of noise
scaling experiments, Eq. (4.56) provides a mechanism for understanding how the different noise
scaling techniques impact the resulting scaled expectations and thus the interpolation process.
4.2.4.2 Spectral analysis of noise scaling methods
Using the filter function prediction from Eq. (4.55) we have that direct noise scaling produces states
𝜌 𝑑𝑖𝑟 (𝜆) with expectation
𝐸 [Tr[𝜌 𝑑𝑖𝑟 (𝜆) 𝑂]] ≈ 𝐴 + 𝐵 exp(−𝜆 𝜒1 ) , (4.57)
where 𝜒1 is the overlap integral of the base circuit. Similarly, following Eq. (4.46), we have that
pulse stretching produces the expectation
∫ ∞
𝑑𝜔 †
𝐸 [Tr[𝜌 𝑝𝑢𝑙 (𝜆) 𝑂]] ≈ 𝐴 + 𝐵 exp −𝜆 𝑆𝜂 (𝜔/𝜆)𝐹𝑧 (𝜔)𝐹𝑧 (𝜔) , (4.58)
−∞ 2𝜋
with similar expressions for Eq. (4.56), which is clearly not equal to Eq. (4.57) in general. Equiv-
alently, stretching the pulse amounts to “compressing” a filter function response by a factor of 𝜆,
which shifts the filter function to lower frequencies, and thus the overlap with low-frequency noise
will likely increase by a factor greater than 𝜆. An example of the impact of pulse stretching on
a sample filter function is shown in Fig. 4.13a. Gate Trotterization is similar in spirit to pulse
150
1.4 (a) - Pulse λ=1 (b) - Global λ=1
λ=3 λ=3
Normalized filter function
1.2
λ=5 λ=5
1.0
0.8
0.6
0.4
0.2
0.0
0 π/4 π/2 3π/4 π 0 π/4 π/2 3π/4 π
1.4 (c) - Local λ=1 (d) - Trotter λ=1
λ=3 λ=3
Normalized filter function
1.2
λ=5 λ=5
1.0
0.8
0.6
0.4
0.2
0.0
0 π/4 π/2 3π/4 π 0 π/4 π/2 3π/4 π
Normalized frequency ω Normalized frequency ω
Figure 4.13: Largest magnitude filter function of a two-qubit randomized benchmarking circuit of
Clifford depth 2 (actual depth 24) for different scale factors 𝜆. All filter functions are normalized by
their maximum values (otherwise the integral of the filter function scales by 𝜆). Different subplots
correspond to different noise scaling methods. All noise scaling methods change the frequency
response of the circuit, however, global folding tends to preserve the qualitative shape of response
function and, for this reason, it gives better performances for zero-noise extrapolation with colored
noise.
stretching, but performed “digitally.” However, repeating a gate-pulse 𝜆 times with amplitude 1/𝜆
is in general different from stretching a gate’s pulse (except in the case of rectangular pulses).
Fig. 4.13d shows a similar qualitative impact of gate Trotterization on the filter function as pulse
stretching, in that the filter function is compressed to the low frequencies. However, unlike pulse
stretching, it is distorted and not a “perfect” compression.
Like pulse stretching and gate Trotterization, local folding also increases the proportion of
the filter function that overlaps with low frequency noise, see Fig. 4.13c. However unlike pulse
stretching and gate Trotterization, local folding also appears to generate response at high frequency.
151
Qualitatively, local folding “pulls” the filter function to the extreme frequencies from the middle of
the spectrum. With these general trends, we would again expect that the overlap integrals produced
would not be particularly close to direct noise scaling.
Of the noise scaling methods studied, it appears that global folding preserves the most structure
from the unscaled filter function. The circuit responses shown in Fig. 4.13b shows that scaling
preserves the qualitative shape of the base circuit’s filter function. Qualitatively, it looks like the
impact of global folding serves to “resolve” a coarse frequency response of the base circuit. Thus,
scaling in this case preserves some structure and produces overlap integrals that are somewhat close
to direct noise scaling.
These observations in the different noise scaling strategies explain the trends in Figs. 4.9 and
4.10. As global folding produces scaled filter functions that best preserve the general balance across
different frequency ranges, the overlap integrals of the globally folded circuits are the closest to the
ideal scaling produced by direct noise scaling. The remaining three scaling approaches all produce
some level of concentration at low frequencies, and thus tend to have much greater overlap with
the low-frequency noise here. As the pulse stretching and gate Trotterization approaches are very
similar in spirit, they produce similar extrapolations. Furthermore, unlike local folding, these two
approaches have all their concentration at low frequency, thus producing the most overlap leading
to the worst extrapolation error. Local folding, which includes some high frequency content (based
on the proportion of the original circuit’s frequency response above 𝜋/2), produces overlaps that
lie between the global folding and the stretching/Trotterization approaches.
We note that the trends observed above and the intuition behind them is a direct consequence of
the correlated noise classes considered, all of which are fundamentally low frequency. Thus, pulse
stretching, gate Trotterization, and local folding produce larger overlaps with the low-frequency
noise and drastically bias the noise extrapolation process. In contrast, if the noise was band limited
(say between 𝜋/4 and 3𝜋/4 in normalized frequency) we would expect that global folding would
continue to track direct noise scaling the best. However, analysis of the other three techniques
would be challenging as the overlap integral with these would essentially vanish as the scaling
152
increased. Without knowing the true expectation and the underlying noise spectra, it would be
unclear if the leveling out of the scaled expectation values would be due to the overlap integral
approaching infinity (i.e., too much noise) or vanishing (i.e., decoupling from the noise). Similarly,
if the noise were purely high frequency, we would expect the pulse stretching and gate Trotterization
approaches to be insensitive, local folding method to be more sensitive, and global folding between
them. Finally, extremely narrow band noise could potentially lie in a “valley” in the scaled response
(obviously this is circuit dependent), and thus overlap integrals would vanish for all the noise scaling
approaches considered here.
4.2.5 Conclusion
In this work, we have demonstrated the effect of time-correlated noise on zero-noise extrapolation.
Using the SchWARMA technique to model time-correlated dephasing noise, we presented the
results of several numerical experiments showing that global unitary folding produces the lowest
error relative to direct noise scaling. We analyzed our observed results and provided a physical
interpretation in terms of the spectral analysis of the considered noise scaling methods.
A takeaway from our work is to use global noise scaling in zero-noise extrapolation, if possible,
whenever noise may be time-correlated. An obvious important consideration is which quantum
computer architectures may have time-correlated noise, a question we do not consider in this paper
and leave to future work. We note that global folding is not the only possible noise scaling method
suitable for time-correlated noise: other methods could be defined and analyzed, e.g. folding the
first half and second half of the gates in a unitary separately. Our work provides the theoretical and
practical tools to analyze the performance of such methods under a wide variety of noise models.
Data availability Software for reproducing all numerical results is available at https://github.com/mezze-
team/mezze.
153
4.2.6 Consistency between different theories of pulse-stretching
Our work is based on a semi-classical theory of time-correlated noise, according to which, the
pulse-stretching technique induces two effective changes on the noise spectrum: (i) it scales the
noise level by a constant 𝜆, (ii) it also stretches the noise spectrum on the frequency axis by the
same constant. Both effects are formally summarized in Eq. (4.47) derived in the main text.
In Ref. [166], a different formalism, based on a master equation with a time-independent noise
operator, was used to study the pulse-stretching technique. More precisely, a system evolving
according to the following master equation was considered:
𝜕
𝜌(𝑡) = −[𝐾 (𝑡), 𝜌(𝑡)] + L (𝜌(𝑡)), (4.59)
𝜕𝑡
where 𝐾 (𝑡) is the system Hamiltonian and L is a time-independent noise super-operator. As shown
in Ref. [166], the effect of pulse stretching (i.e., 𝐾 (𝑡) −→ 1/𝜆𝐾 (𝑡/𝜆)) is equivalent to an effective
master equation:
𝜕
𝜌(𝑡 ′) = −[𝐾 (𝑡 ′), 𝜌(𝑡 ′)] + 𝜆L (𝜌(𝑡 ′)), (4.60)
𝜕𝑡 ′
where 𝑡 ′ = 𝜆𝑡. In practice pulse-stretching induces a multiplicative scaling of the noise operator
L −→ 𝜆L.
The master equation Eq. (4.59) is typically used to model Markovian noise (no time-correlations).
In this case, the Hilbert space of the environment can be traced out such that 𝜌 represents the re-
duced state of the system evolving according to the master equation Eq. (4.59). In this white-noise
regime, also our semi-classical theory of pulse-stretching predicts a simple multiplicative scaling
of noise power and this is indeed consistent with Eq. (4.60).
What happens for a non-Markovian environment with a colored noise spectrum? In this case,
our semi-classical theory suggests that pulse-stretching induces, in addition to a multiplicative
scaling, also a scaling of the frequency axis of the noise spectrum (see Eq. (4.47)). This may
seem to contradict the simple multiplicative scaling of the noise L −→ 𝜆L derived in Ref. [166]
and reported in Eq. (4.60). However, as explained below, both theoretical derivations are actually
consistent with each other.
154
In principle, the master equation (4.59) can be used to model a non-Markovian bath by repre-
senting with 𝜌 the global quantum state (system + bath) instead of the reduced state of the system.
In this global picture, a non-Markovian bath can be modeled by a time-independent noise operator
L (𝜌) that includes an interaction Hamiltonian term 𝐻S𝐵 and the bare Hamiltonian 𝐻B acting on
the bath only (see Supplemental Material of Ref. [166])
L (𝜌(𝑡)) = −𝑖[𝐻S𝐵 + 𝐻B , 𝜌(𝑡)], (4.61)
which we can split as the sum of two terms L = LS𝐵 + LB , where LS𝐵 (𝜌) = −𝑖[𝐻S𝐵 , 𝜌] and
LB (𝜌) = −𝑖[𝐻B , 𝜌]. In this case, the simple multiplicative scaling L −→ 𝜆L induced by the
pulse-stretching technique according to Eq. (4.60) has actually two physically different effects: (i)
L 𝑆𝐵 −→ 𝜆L 𝑆𝐵 corresponding to a scaling of the noise power and (ii) L 𝐵 −→ 𝜆L 𝐵 corresponding
to an effective scaling of the all the characteristic frequencies of the bath and, therefore, to a
frequency stretching of the noise spectrum. These two effects are consistent with the semi-classical
theory of pulse-stretching presented in this work and, in particular, with Eq. (4.47).
4.3 Increasing the effective quantum volume of quantum computers
4.3.1 Introduction
Quantum volume [8] is a single-number metric which, loosely speaking, reports the number of
usable qubits on a quantum computer1. While improvements to the underlying hardware are a
direct means of increasing quantum volume, the metric is “full-stack” and can be increased by an
improvement to any component, e.g. software for compilation to produce an equivalent quantum
circuit with fewer elementary operations [175].
Given an 𝑚 qubit quantum circuit 𝐶, the heavy set is H𝐶 := {𝑧 ∈ {0, 1} 𝑚 : 𝑝(𝑧) > 𝑝 median }
where 𝑝(𝑧) := |⟨𝑧|𝐶 |0⟩| 2 is the probability of sampling bitstring 𝑧 and 𝑝 median is the median
probability over all bitstrings. A heavy bitstring is one in the heavy set. Quantum volume is
1 Some authors define quantum volume as the effective Hilbert space dimension. Here we report the logarithm of
this number which corresponds to the number of qubits.
155
determined by counting the number of heavy bitstrings 𝑛 ℎ measured over 𝑛𝑐 random circuits, each
sampled 𝑛 𝑠 times. If the experiment is run with 𝑚 qubit circuits of depth 𝑑 = 𝑚, 𝑛𝑐 ≥ 100, and
ℎˆ 𝑑 := 𝑛 ℎ /𝑛𝑐 𝑛 𝑠 > 2/3 + 2𝜎 (4.62)
where 𝜎 is the standard deviation of the estimate, then the volume is at least 𝑚. The actual volume
is the largest 𝑚 such that these conditions are true. The particular structure of these random circuits,
which we refer to as quantum volume circuits, is defined in [8].
4.3.2 Method
Given a quantum volume circuit 𝐶, we define the projector on the heavy subspace
∑︁
Πℎ,𝐶 := |𝑧⟩⟨𝑧| (4.63)
𝑧∈H𝐶
so that the expected number of heavy bitstrings for this circuit is 𝑛 ℎ,𝐶 := 𝑛 𝑠 ⟨0|𝐶 † Πℎ,𝐶 𝐶 |0⟩. We use
zero-noise extrapolation (ZNE) [176, 177] with Πℎ,𝐶 as the observable for each quantum volume
Í (𝜆)
circuit 𝐶 to estimate the noise-free value of 𝑛 ℎ := 𝐶 𝑛 ℎ,𝐶 . This amounts to evaluating ⟨Πℎ,𝐶 ⟩ at
(0)
several noise-scale factors 𝜆 ≥ 1 then using these results to estimate ⟨Πℎ,𝐶 ⟩, i.e., the zero-noise
limit of the heavy output probability. In practice, this means compiling the circuit 𝐶 to a set of
circuits {𝐶𝜆𝑖 }𝑖=1
𝑘 . For fairness with the unmitigated experiment, we use 𝑛 /𝑘 samples for each 𝐶
𝑠 𝜆𝑖
so that the total number of samples drawn is equal in the mitigated and unmitigated experiments.
The main difference to previous work improving quantum volume by compiling [175] is that we
compile a single circuit to a set of circuits, following the pattern of many error mitigation methods
(e.g. [177]), in contrast with compilation that does rewrites on a single circuit following algebraic
rules or optimized routing.
(𝑖)
After executing each 𝐶𝜆𝑖 to obtain scaled heavy output counts 𝑛 ℎ,𝐶 , we use Richardson extrap-
olation [177, 178] to estimate the zero-noise result via
∑︁𝑘
(0) (𝑖)
𝑛 ℎ,𝐶 = 𝜂𝑖 𝑛 ℎ,𝐶 (4.64)
𝑖=1
156
0 1 2 0 1 2 0 1 2
3 3 3
4 4 4
Figure 4.14: Results of unmitigated and mitigated quantum volume experiments on three five-qubit
quantum computers (left-to-right: Belem, Lima, and Quito) using 𝑛𝑐 = 500 circuits and 𝑛 𝑠 = 104
total samples. Each marker shows the estimated heavy output probability ℎˆ 𝑑 on a different qubit
configuration defined in the legend and error bars show 2𝜎 intervals evaluated by bootstrapping.
The connectivity of each device is shown below each legend. Dashed black lines show the 2/3
threshold and noiseless asymptote (1+ln 2)/2 [8]. For the mitigated experiments, 𝜆𝑖 ∈ {1, 3, 5, 7, 9}
and 𝑛 𝑠 = 104 /5. Local unitary folding of two-qubit gates is used to compile the circuits (i.e., scale
noise) and Richardson’s method of extrapolation is used to infer the zero-noise result. The qubit
subsets which achieved the largest quantum volume in the mitigated experiments are colored blue
in each device diagram. As can be seen, on Belem error mitigation increases the effective quantum
volume from three to five, on Lima error mitigation increases the effective quantum volume from
three to four, and on Quito error mitigation increases the effective quantum volume from four to
five.
where coefficients are given by
Ö 𝜆𝑗
𝜂𝑖 := . (4.65)
𝑗≠𝑖
𝜆 𝑗 − 𝜆𝑖
In practice, we use 𝜆𝑖 ∈ {1, 3, 5, 7, 9} and scale circuits by locally folding two-qubit (CNOT)
gates [178]. In other words, the scaled circuit for 𝜆𝑖 = 𝑡 has each CNOT replaced by 𝑡 CNOTs.
4.3.3 Results
Using this strategy, we perform unmitigated and mitigated quantum volume experiments on
the Belem, Lima, and Quito devices available through IBM [1] (see Section 4.3.6 for device
specifications). The results, shown in Fig. 4.14, demonstrate that we are able to increase the
effective quantum volume from three to five on Belem, three to four on Lima, and from four to five
on Quito. Note that ZNE increases the estimated heavy output probability ℎˆ 𝑑 on all qubit subsets
even though the 2/3 threshold is not always crossed. We therefore expect that ZNE can increase
effective quantum volume independent of the size of the device, so long as cross-talk and other
157
errors do not scale with the device size. The largest ZNE experiment to date was performed on 26
qubit circuits with 1080 two-qubit gates [179] and, in the context of our work, provides evidence
that error mitigation can continue to increase the effective volume of larger quantum devices, e.g.
those listed in Section 4.3.7.
Because we estimate the noiseless result by taking a linear combination of noisy results, the
way we compute 𝜎 in (4.62) changes relative to [8]. For any technique, such as Richardson
extrapolation, that evaluates an error-mitigated expectation value as a linear combination of noisy
expectation values (4.64), one can show (see Section 4.3.8) that
𝑘
1 ∑︁ 2 ∑︁
2
𝜎 = 2 𝜎 , 𝜎𝐶2 = |𝜂𝑖 | 2 (𝜎𝐶(𝑖) ) 2 , (4.66)
𝑛𝑐 𝐶 𝐶 𝑖=1
where (𝜎𝐶(𝑖) ) 2 is the variance of each noise-scaled expectation value, while 𝜎𝐶2 is the variance of the
error-mitigated expectation value associated to the quantum circuit 𝐶. The previous expressions
correspond to a theoretical estimate of the error, but in practice we can estimate error bars by
repeating the experiment multiple times or by bootstrapping. The 2𝜎 error bars in Fig. 4.14 are
calculated by bootstrapping with 500 resamples. See Section 4.3.8 for more details.
4.3.4 Discussion
There is a subtle point in interpreting our results in the general context of quantum computer
performance. Our error mitigation procedure improves the expectation value of the heavy output
projector but does not produce more heavy bitstrings — in fact, our procedure likely produces fewer
heavy bitstrings because we distribute samples across circuits at amplified noise levels. However,
as we have shown, we are able to use this information to estimate the expected number of heavy
bitstrings in a statistically significant way. To carefully distinguish between the two cases, we refer
to our results as increasing the effective quantum volume.
The restriction to evaluating expectation values but not directly sampling bitstrings raises
interesting questions about physicality and the role of a quantum computer in a computational
procedure. If an algorithm only requires expectation values and we apply the error mitigation
158
procedure used in this work, is it the case that we effectively have access to a quantum computer
with a larger quantum volume? These questions are linked to the physical interpretation of error
mitigation. One way to interpret ZNE is that we evaluate expectation values with respect to the
“extrapolated density matrix”
∑︁
𝜌0 = 𝜂𝑖 𝜌 𝜆 𝑖 (4.67)
𝑖
where 𝜌𝜆𝑖 are the noise-scaled physical states and 𝜂𝑖 are the real coefficients in (4.64). Clearly
we did not physically prepare 𝜌0 in our experiment, but should we restrict the use of a quantum
computer to only preparing a single physical state from which we can sample bitstrings? Or do
we allow ourselves to “virtually” prepare non-physical but mathematically well-defined states from
which we can compute expectation values more accurately? We note that similar questions have
been asked in the context of virtual distillation techniques [180, 129] which have been proposed to
artificially purify a quantum state or to reduce its effective temperature [181].
4.3.5 Conclusion
In this work we have experimentally demonstrated that error mitigation improves the effective
quantum volume of several quantum computers. We use the term effective quantum volume to
emphasize that our procedure is appropriate for algorithms computing expectation values and not
for algorithms requiring individual bitstrings. The error mitigation technique is not tailored to the
structure of quantum volume circuits or to the architecture of the quantum computers we used.
Indeed, we did not run any additional calibration experiments or use any calibration information to
obtain our results. Similar software-level techniques have been used in previous quantum volume
experiments, e.g. (approximate) compilation [8, 175] and dynamical decoupling [175]. The novelty
of our proposal is that, by relaxing the strong requirement of directly sampling heavy bitstrings
to the weaker requirement of estimating the expectation value of the heavy output projector, more
general error mitigation techniques can be applied to improve the effective quantum volume of a
device. We expect this approach to improve the effective quantum volume of additional quantum
159
Lima Belem Quito
# Qubits 5 5 5
𝜖1Q 4.446 × 10−4 2.808 × 10−4 2.980 × 10−4
𝜖CNOT 1.131 × 10−2 1.098 × 10−2 8.292 × 10−3
𝜖M 3.790 × 10−2 2.868 × 10−2 2.546 × 10−2
Table 4.3: Device specifications and error rates for the quantum computers we used in our experi-
ments. Device connectivities are shown in √ Fig. 4.14. Parameters 𝜖1Q , 𝜖CX , 𝜖M denote, respectively,
averages (over all qubits) of single-qubit 𝑋 gate errors, two-qubit CNOT gate errors, and readout
errors ( 𝑝(0|1) + 𝑝(1|0))/2 accessed from [1].
computers such as those in Section 4.3.7. Our open source error mitigation software [182] can be
used on many quantum computers to repeat the experiments we performed here.
In the context of error mitigation, our work provides additional benchmarks to the relatively
few experimental results in literature [183, 182, 179, 184, 185, 186, 187]. We encourage the use
of quantum volume as a benchmark for error mitigation techniques due to its relatively widespread
adoption and clear operational meaning. Normalizing by additional resources used (gates, shots,
qubits, etc.) in error-mitigated quantum volume experiments provides a way to directly compare
different techniques and drive progress in this area. As most error mitigation techniques act on
expectation values, they can be used for effective quantum volume experiments as we have done in
this work.
Code and data availability The code we used to run experiments as well as the data we
collected are available at https://github.com/unitaryfund/mitiq-qv.
4.3.6 Device specifications
In Table 4.3 we provide more information about the quantum computers we used in our experiments.
Note that the quantum volume of Belem is listed as four at [1] but we are unable to reproduce this
result in our unmitigated experiments, presumably due to device degradation over time.
160
Quantum computer log QV Reference
Rigetti Aspen-4 3 [188]
Lima 3 (4) [1] (this work)
Belem 3 (5) [1] (this work)
Jakarta 4 [1]
Bogota 4 [1]
Quito 4 (5) [1] (this work)
Manila 5 [1]
Nairobi 5 [1]
Lagos 5 [1]
Perth 5 [1]
Guadalupe 5 [1]
Toronto 5 [1]
Brooklyn 5 [1]
Trapped-ion QCCD 6 [189]
Hanoi 6 [1]
Auckland 6 [1]
Cairo 6 [1]
Washington 6 [1]
Mumbai 7 [1]
Kolkata 7 [1]
Honeywell System Model H1 10 [190]
Table 4.4: Measured quantum volumes (in increasing order). Values in parentheses show effective
quantum volumes measured in this work.
4.3.7 Table of quantum volumes
As discussed in the main text, error mitigation consistently increased the estimated heavy output
probability in all of our experiments. To increment the effective quantum volume of a device, this
increase must cross the 2/3 threshold with statistical significance. While there is no guarantee that
this will happen, we expect there to be several cases of already-measured quantum volumes where
this will be true. For this reason, as well as general context, we include a list of quantum computer
volumes in Table 4.4.
161
4.3.8 Statistical uncertainty of error-mitigated volume
4.3.8.1 Theoretical estimation of error bars
For a large number of error mitigation techniques, including Richardson extrapolation, an error-
mitigated expectation value 𝐸𝐶 associated with an ideal circuit 𝐶 is evaluated as linear combination
of different noisy expectation values:
∑︁
𝐸𝐶 = 𝜂 𝑗 𝐸˜ 𝑗 . (4.68)
𝑗
Because of shot noise, each noisy expectation value 𝐸˜ 𝑗 can only be measured up to a statistical
variance 𝜎 𝑗2 = E( 𝐸˜ 2𝑗 ) − [E( 𝐸˜ 𝑗 )] 2 , where E represents the statistical average over 𝑛 𝑗 measurement
shots.
Since different noisy expectation values are statistically uncorrelated, the variance 𝜎𝐶2 of the
error-mitigated result 𝐸𝐶 is:
∑︁
𝜎𝐶2 = E(𝐸𝐶2 ) − [E(𝐸𝐶 )] 2 = |𝜂 𝑗 | 2 𝜎 𝑗2 . (4.69)
𝑗
If we assume that each noisy expectation value is obtained by sampling a binomial distribution
B ( 𝑝 𝑗 , 𝑛 𝑗 ) with probability 𝑝 𝑗 = 𝐸˜ 𝑗 and normalizing the result over 𝑛 𝑗 measurement shots, we have
𝜎 𝑗2 = 𝐸 𝑗 (1 − 𝐸 𝑗 )/𝑛 𝑗 . The variance 𝜎𝐶2 of the error-mitigated result is therefore:
𝑘
∑︁
𝜎𝐶2 = |𝜂 𝑗 | 2 𝐸˜ 𝑗 (1 − 𝐸˜ 𝑗 )/𝑛 𝑗 . (4.70)
𝑗=1
The previous expression is valid for a generic expectation value. In the specific case of a quantum
volume experiment, we can identify with 𝐸𝐶 the heavy-output probability associated with a specific
random circuit 𝐶. Averaging 𝐸𝐶 over multiple 𝑛𝑐 noisy circuits 𝐶 of depth 𝑑, produces the estimated
heavy output probability visualized in Fig. 4.14:
1 ∑︁
ℎ𝑑 = 𝐸𝐶 . (4.71)
𝑛𝑐
This is again a sum of independent random variables and so its variance is given by:
162
1 ∑︁ 2
𝜎2 = 𝜎 . (4.72)
𝑛2𝑐 𝐶 𝐶
4.3.8.2 Bootstrapping empirical error bars
The previous way of estimating error bars is based on theoretical assumptions and, even though
it provides useful analytical expressions, it may underestimate unknown sources of errors such as
systematic errors.
A brute-force way of estimating error bars is to repeat the estimation of the quantity of interest (in
our case ℎ 𝑑 ) with 𝑁 independent experiments and to evaluate the empirical variance of the results.
This method can be expensive with respect to classical and quantum computational resources, and
the results can be sensitive to how the independent samples are grouped. So while we can split
independent samples into five groups of 𝑛𝑐 = 100 circuits to estimate the standard deviation this
way, a more feasible alternative is instead given by bootstrapping. This is a statistical inference
technique in which, instead of performing 𝑁 new experiments, one resamples the raw results of a
single experiment 𝑁 times in order to estimate properties of the underlying statistical distribution.
In our specific quantum volume experiment, the heavy-output probability ℎ 𝑑 is estimated
as an average over 𝑛𝐶 random circuits 𝐶 as shown in equation (4.71). Let us define the set
0.0065
0.0060
0.0055
0.0050
0.0045
0.0040
0 200 400 600 800 1000
# resamples
Figure 4.15: The value of 𝜎 for different resampling numbers in bootstrapping.
163
𝑆 = {𝐸𝐶1 , 𝐸𝐶2 , ....𝐸𝐶𝑛𝐶 } containing all the estimated heavy-output probabilities associated with
different random circuits. We can now resample N sets of data 𝑆1 , 𝑆2 , ....𝑆 𝑁 , each one containing
𝑛𝑐 values that are randomly sampled from 𝑆 with replacements. For each resampled set 𝑆 𝑗 we
evaluate the associated bootstrapped mean 𝜇 𝑗 = ⟨𝐸𝐶 ⟩𝑆 𝑗 .
The empirical standard deviation of all the bootstrap means {𝜇1 , 𝜇2 , ...𝜇 𝑁 } provides an estimate
of the statistical uncertainty:
v
u
t 𝑁
1 ∑︁
𝜎= (𝜇 𝑗 − 𝜇)
¯ 2. (4.73)
𝑁 𝑗=1
Í𝑁
where 𝜇¯ = 𝑗=1 𝜇 𝑗 /𝑁. This is the method that we used to evaluate error bars in Fig. 4.14 (with
𝑁 = 500).
One may ask how large should 𝑁 be, in order to obtain a reasonable estimate of the error. In Fig.
4.15 we show the dependence of 𝜎 for an arbitrary error-mitigated point of Fig. 4.14 (the results
are qualitatively similar for all points). Fig. 4.15 provides empirical evidence that, for 𝑁 > 400,
the bootstrapped estimate converges around a stable result.
164
CHAPTER 5
LOGICAL SHADOW TOMOGRAPHY
5.1 Background
5.1.1 Subspace expansion
Let S = ⟨𝑆1 , ..., 𝑆𝑟 ⟩ be a stabilizer code with generators 𝑆1 , ..., 𝑆𝑟 . Recall from Chapter 1.4 that a
stabilizer code (group) S is a subgroup of P𝑛 , the Pauli group on 𝑛-qubits, such that −𝐼 ∉ S and S
is abelian. These properties ensure the codespace defined by S is non-trivial. The codespace 𝑉S
of S is the span of the codewords of S, where a codeword is a plus one eigenstate of all stabilizer
elements. That is,
𝑉S := span {|𝑐 1 ⟩, ..., |𝑐 𝑘 ⟩} (5.1)
where each codeword |𝑐𝑖 ⟩ for all 𝑖 = 1, ..., 𝑘 satisfies 𝑆|𝑐𝑖 ⟩ = |𝑐𝑖 ⟩ for all 𝑆 ∈ S. Note that the
dimension of the codespace is dim (𝑉S ) = 𝑘 = 2𝑛−𝑟 .
Given a generator 𝑆𝑖 ∈ {𝑆1 , ..., 𝑆𝑟 }, define the projector
1
𝑃𝑖 := (𝐼 + 𝑆𝑖 ). (5.2)
2
Note that 𝑃𝑖2 = 14 (𝐼 + 2𝑆𝑖 + 𝑆𝑖2 ) = 14 (2𝐼 + 2𝑆𝑖 ) = 𝑃𝑖 so that 𝑃𝑖 is indeed a projector. Now, define the
complete projector as follows.
Definition 8. Given a stabilizer S = ⟨𝑆1 , ..., 𝑆𝑟 ⟩, the complete projector is defined as
𝑟 𝑟
Ö Ö 1
𝑃 := 𝑃𝑖 = (𝐼 + 𝑆𝑖 ). (5.3)
𝑖=1 𝑖=1
2
Theorem 9: The complete projector can be written as a convex combination over all stabilizer
elements
1 ∑︁
𝑃= 𝑆. (5.4)
2𝑟
𝑆∈S
165
Proof. This follows from expanding (5.3) and using closure. Note that there are 2𝑟 elements in
S. □
For quantum subspace expansion [127], we want to compute the expectation value ⟨Γ⟩ ≡ Tr[𝜌Γ]
for an observable
𝑚
∑︁
Γ := 𝛾𝑖 Γ𝑖 . (5.5)
𝑖=1
We start in the codespace with a state |𝜓⟩⟨𝜓| and evolve unitarily to 𝜌 := 𝑈|𝜓⟩⟨𝜓|𝑈 † . Suppose
some error 𝐸𝑖 occurs which takes us out of our codespace, i.e.
|𝜓⟩ ↦→ 𝐸𝑖 |𝜓⟩. (5.6)
If 𝐸𝑖 is correctable by our stabilizer code, then there exists a projector 𝑃𝑖 such that 𝑃𝑖 𝐸𝑖 |𝜓⟩ = |𝜓⟩
(ignoring any renormalization). Hence for any correctable errors the complete projector (5.3) will
map the errored state |𝜓⟩error ∈ 𝑉S⊥ back into the codespace 𝑉S . The idea of quantum subspace
expansion is thus to measure
1
⟨Γ⟩corrected := Tr[𝑃𝜌𝑃† Γ] (5.7)
𝑐
where 𝑐 is a normalization factor. Using (5.3) and (5.5), we can expand this as
𝑚 2𝑟 2𝑟
1 ∑︁ ∑︁ ∑︁
⟨Γ⟩corrected = 2𝑟 𝛾𝑖 Tr[𝑆 𝑗 𝜌𝑆 †𝑘 Γ𝑖 ]. (5.8)
2 𝑐 𝑖=1 𝑗=1 𝑘=1
Here each Γ𝑖 is a logical operator of the stabilizer code S. For subspace codes, all logical
operators commute with stabilizers. Using cyclicity of the trace along with this property, we can
write
𝑚 2𝑟 2𝑟
1 ∑︁ ∑︁ ∑︁
⟨Γ⟩corrected = 2𝑟 𝛾𝑖 Tr[𝜌𝑆 †𝑘 𝑆 𝑗 Γ𝑖 ]. (5.9)
2 𝑐 𝑖=1 𝑗=1 𝑘=1
Again by closure, the product 𝑆 †𝑘 𝑆 𝑗 is another stabilizer element. This allows us to eliminate one
summation and arrive at the following theorem.
Theorem 10: If [Γ𝑖 , 𝑆 𝑗 ] = 0 for all logical operators Γ𝑖 and stabilizers 𝑆 𝑗 (which is true for subspace
codes), then
𝑚 2𝑟
1 ∑︁ ∑︁
⟨Γ⟩corrected = 𝑟 𝛾𝑖 Tr[𝜌𝑆 𝑗 Γ𝑖 ]. (5.10)
2 𝑐 𝑖=1 𝑗=1
166
We remark that for subsystem codes, not all logical operators commute with stabilizers, so (5.10)
is invalid for subsystem codes. Here, there is an additional “residue” term containing [Γ𝑖 , 𝑆 𝑗 ] ≠ 0.
The full expression is
𝑚 2𝑟 𝑚 2𝑟 2𝑟
1 ∑︁ ∑︁ 1 ∑︁ ∑︁ ∑︁
⟨Γ⟩corrected = 𝑟 𝛾𝑖 Tr[𝜌𝑆 𝑗 Γ𝑖 ] − 2𝑟 𝛾𝑖 Tr[𝜌𝑆 †𝑘 [Γ𝑖 , 𝑆 𝑗 ]] (5.11)
2 𝑐 𝑖=1 𝑗=1 2 𝑐 𝑖=1 𝑗=1 𝑘=1
5.1.2 Virtual distillation
Given a noisy state 𝜌 E = E (𝜌), write its spectral decomposition
∑︁
𝜌 E = 𝑝 0 |𝜓0 ⟩⟨𝜓0 | + 𝑝 𝑘 |𝜓 𝑘 ⟩⟨𝜓 𝑘 | (5.12)
𝑘
where 𝑝 0 > 𝑝 1 ≥ · · · ≥ 0, and ⟨𝜓𝑖 |𝜓 𝑗 ⟩ = 𝛿𝑖 𝑗 (the Kronecker delta). Assume that the ideal
(noiseless) state is 𝜌 = |𝜓0 ⟩⟨𝜓0 |. (In an idealized noise model, we can think of the noise as adding
“orthogonal errors”, e.g. if our ideal state is |0⟩⟨0| then under bitflip noise with probability 𝑝 < 1/2
we obtain the noisy state 𝜌 E = (1 − 𝑝)|0⟩⟨0| + 𝑝|1⟩⟨1|. See [128] for more justification of this
assumption.) Given an observable 𝑂, the idea of virtual distillation [128, 129] is to evaluate the
quantity
Tr[𝜌 E𝑚 𝑂]
⟨𝑂⟩VD := (5.13)
Tr[𝜌 E𝑚 ]
for positive integer 𝑚. Using the spectral decomposition (5.12), one can show that
Í
1+ 𝑘 ( 𝑝 𝑘 /𝑝 0 )
𝑚 ⟨𝜓
𝑘 |𝑂|𝜓 𝑘 ⟩/⟨𝑂⟩
⟨𝑂⟩VD = ⟨𝑂⟩ Í (5.14)
1+ 𝑘 ( 𝑝 𝑘 /𝑝 0 )
𝑚
where ⟨𝑂⟩ := ⟨𝜓0 |𝑂|𝜓0 ⟩ is the true (noiseless) expectation value by assumption. It’s easy to see
that lim𝑚→∞ ⟨𝑂⟩VD (𝑚) = ⟨𝑂⟩. In fact, to leading order
⟨𝑂⟩VD = ⟨𝑂⟩ [1 + O (( 𝑝 1 /𝑝 0 ) 𝑚 )] , (5.15)
showing that errors are suppressed exponentially in the power 𝑚.
The remaining question is how to evaluate (5.13). Authors of [128, 129] propose physically
preparing 𝑚 copies of the state then using the “SWAP trick” Tr[𝑆 𝑚 𝜌 ⊗𝑚 𝑂 1 ] = Tr[𝜌 𝑚 𝑂]. Here, 𝑆 𝑚 is
167
the SWAP or shift operator which cyclically permutes 𝑚 copies of a state 𝑆 𝑚 |1⟩ ⊗ |2⟩ ⊗ |3⟩ ⊗· · · |𝑚⟩ =
|2⟩ ⊗ |3⟩ ⊗ · · · |𝑚⟩ ⊗ |1⟩ and 𝑂 1 is the observable 𝑂 acting on the first copy of the state 𝜌 ⊗𝑚 . The
denominator of (5.13) is computed in the same way but by omitting 𝑂 1 . Even for small 𝑚, this
requires many additional qubits and operations to prepare the states, and the SWAP operation is
very hard to implement experimentally. Recognizing these challenges, Ref. [191] proposes using
active reset to reduce the number of ancilla qubits needed for this approach. Our logical shadow
tomography procedure only requires sampling from a single copy of the state. We now describe this
method and how it encapsulates both subspace expansion and virtual distillation with significantly
fewer resources.
5.2 Introduction
In this chapter, we are interested in error mitigation in the region between noisy-intermediate scale
quantum (NISQ) and fault-tolerant quantum computation. We present a conceptually simple and
practical QEM technique. In our setup, we use quantum error correction code to distribute logical
qubits information with multiple physical qubits. Without active quantum error correction which
requires extra ancilla qubits and parity check measurements, the only step which happens on a
quantum computer is repeatedly sampling from an encoded state to perform shadow tomogra-
phy [192, 193]. After this, one can use classical post-processing to project out errors as in subspace
expansion [127] and calculate powers of the density matrix as in virtual distillation [128]. In
addition to enabling both of these methods, our technique requires significantly fewer resources.
Specifically, we show that (i) the quantum gate overhead is independent of the number of logical
qubits and we do not require multiple copies of the physical systems, (ii) the sample complexity for
estimating error mitigated Pauli observables only scales with the number of logical qubits instead
of the total number of physical qubits, and (iii) there exists an efficient classical algorithm that can
post-process data with polynomial time. We show the new method is practical for relatively large
systems with numerical experiments.
168
5.3 Logical shadow tomography
5.3.1 Motivation
AAACAHicbVDLSsNAFL2pr1pfVZduBluhbkpSEMVVwY3LCqYW2lAmk0k7dDIJMxOhhG78Abf6B+7ErX/iD/gdTtostPXAhcM598XxE86Utu0vq7S2vrG5Vd6u7Ozu7R9UD4+6Kk4loS6JeSx7PlaUM0FdzTSnvURSHPmcPviTm9x/eKRSsVjc62lCvQiPBAsZwdpIbr2Bz+vDas1u2nOgVeIUpAYFOsPq9yCISRpRoQnHSvUdO9FehqVmhNNZZZAqmmAywSPaN1TgiCovmz87Q2dGCVAYS1NCo7n6eyLDkVLTyDedEdZjtezl4r9eoPKFS9d1eOVlTCSppoIsjocpRzpGeRooYJISzaeGYCKZ+R+RMZaYaJNZxQTjLMewSrqtpnPRtO9atfZ1EVEZTuAUGuDAJbThFjrgAgEGz/ACr9aT9Wa9Wx+L1pJVzBzDH1ifP4S6lh0=
Physical Classical post-
(a) (a) qubits (b) processing
(a) (b)
Measurement
Measurement
Measurement
Measurement
Measurement
-3 H-Measurement
Quantum - H- H- H- H-Quantum
(a) (b)
Quantum
Quantum
011010
Quantum
(a)Logical (b)
Unitary 011010
110011
- H-2 H-2 H-2 H-2 H-2 H2
Unitary (b) U 110011
AAAB/nicbVDLSsNAFL2pr1pfVZduBlvBVUkKorgquHFZ0bSFNpTJZNIOnUzCzEQooeAPuNU/cCdu/RV/wO9w0mahrQcuHM65L46fcKa0bX9ZpbX1jc2t8nZlZ3dv/6B6eNRRcSoJdUnMY9nzsaKcCepqpjntJZLiyOe0609ucr/7SKVisXjQ04R6ER4JFjKCtZHu6259WK3ZDXsOtEqcgtSgQHtY/R4EMUkjKjThWKm+Yyfay7DUjHA6qwxSRRNMJnhE+4YKHFHlZfNXZ+jMKAEKY2lKaDRXf09kOFJqGvmmM8J6rJa9XPzXC1S+cOm6Dq+8jIkk1VSQxfEw5UjHKM8CBUxSovnUEEwkM/8jMsYSE20Sq5hgnOUYVkmn2XAuGvZds9a6LiIqwwmcwjk4cAktuIU2uEBgBM/wAq/Wk/VmvVsfi9aSVcwcwx9Ynz+lI5Ws
(a) qubits channel
system
channel (b) Classical
systemsystem
(a)
1 1 1 1 H-1 H1
011010
postprocessing
Classical
- H-3 H-
3 H-3 H-3 H-3 H3 - H
3 H-3 H-3 H-3 H3
Unitary 110011
channel postprocessing
(a) (b) Classical post-
Classical
postprocessing
AAACAHicbVDLSsNAFL2pr1pfVZduBluhbkpSEMVVwY3LCqYW2lAmk0k7dDIJMxOhhG78Abf6B+7ErX/iD/gdTtostPXAhcM598XxE86Utu0vq7S2vrG5Vd6u7Ozu7R9UD4+6Kk4loS6JeSx7PlaUM0FdzTSnvURSHPmcPviTm9x/eKRSsVjc62lCvQiPBAsZwdpIbr3hn9eH1ZrdtOdAq8QpSA0KdIbV70EQkzSiQhOOleo7dqK9DEvNCKezyiBVNMFkgke0b6jAEVVeNn92hs6MEqAwlqaERnP190SGI6WmkW86I6zHatnLxX+9QOULl67r8MrLmEhSTQVZHA9TjnSM8jRQwCQlmk8NwUQy8z8iYywx0SazignGWY5hlXRbTeeiad+1au3rIqIynMApNMCBS2jDLXTABQIMnuEFXq0n6816tz4WrSWrmDmGP7A+fwCGVJYe
(b)
computation
L T processing
computation
(a) (b)
L
(a) T U1
(b)
AAAB6nicbVDLSgNBEOyNrxhfMR69DAmCp7Arih4DevAY0U0CyRJmJ7PJkJnZZWZWCEs+wYsHRbyKf+EfePLm3zh5HDSxoKGo6qa7K0w408Z1v53cyura+kZ+s7C1vbO7V9wvNXScKkJ9EvNYtUKsKWeS+oYZTluJoliEnDbD4eXEb95TpVks78wooYHAfckiRrCx0q3f9brFilt1p0DLxJuTSi3/+VG6ei/Xu8WvTi8mqaDSEI61bntuYoIMK8MIp+NCJ9U0wWSI+7RtqcSC6iCbnjpGR1bpoShWtqRBU/X3RIaF1iMR2k6BzUAvehPxP6+dmugiyJhMUkMlmS2KUo5MjCZ/ox5TlBg+sgQTxeytiAywwsTYdAo2BG/x5WXSOKl6Z1X3xqZxCjPk4RDKcAwenEMNrqEOPhDowwM8wbPDnUfnxXmdteac+cwB/IHz9gPFBpBe
Quantum
Quantum
011010
Quantum
L
(a) T
(b) 011010
110011
- H-1 H-1 H-1 H-1 H-1 H1 - H-2 H-2 H-2 H-2 H-2 H2
Unitary
L
(a) Unitary T
channel
(b) 110011
system
channel (b) Classical
L T U2
systemsystem
(a) AAAB6nicbVDLSgNBEOyNrxhfMR69DAbBU9gNih4DevAY0U0CyRJmJ7PJkNmZZWZWCEs+wYsHRbyKf+EfePLm3zh5HDSxoKGo6qa7K0w408Z1v53cyura+kZ+s7C1vbO7V9wvNbRMFaE+kVyqVog15UxQ3zDDaStRFMchp81weDnxm/dUaSbFnRklNIhxX7CIEWysdOt3q91i2a24U6Bl4s1JuZb//ChdvR/Vu8WvTk+SNKbCEI61bntuYoIMK8MIp+NCJ9U0wWSI+7RtqcAx1UE2PXWMjq3SQ5FUtoRBU/X3RIZjrUdxaDtjbAZ60ZuI/3nt1EQXQcZEkhoqyGxRlHJkJJr8jXpMUWL4yBJMFLO3IjLAChNj0ynYELzFl5dJo1rxzirujU3jFGbIwyEcwQl4cA41uIY6+ECgDw/wBM8Odx6dF+d11ppz5jMH8AfO2w/GipBf
011010
postprocessing
Classical
L T
Unitary 110011
postprocessing
channel
Classical
Figure 5.1: Graphic L illustration of logical shadow
T tomography. postprocessing
(a) Red dots are logical qubits,
and blue dots are Lphysical qubits. Logical information
T is distributed to physical qubits by error
L followed by noisy quantum
correction code, then T computation on physical qubits. To get estimation
of error mitigatedLobservables, we perform classical
T shadow tomography on the noisy physical
L T
state. Particularly, we can apply random Clifford gates denoted as green blocks from some unitary
ensemble U, and L T
take computational basis measurements. (b) A special case using [[𝑛, 1]] code
for each logical qubit. In shadow tomography, we apply random unitary from tensor product of
Clifford groups Cℓ(2𝑛 ) ⊗𝑘 . Additional gate depth will not scale with number of logical qubits 𝑘,
and sample complexity for estimating error mitigated logical Pauli observables is the same as using
global Clifford group Cℓ(2𝑛𝑘 ).
In this work, we are interested in error mitigation in the region between noisy intermediate-scale
quantum (NISQ) and fault-tolerant quantum computation, where the number of qubits is more than
a few but cannot fulfill the requirement of fault-tolerance. We ask the question whether we can use
those resources cleverly to achieve a more reliable quantum computation. QEM methods related
to this idea are subspace expansion [127] and the virtual distillation [128]. The key contribution
of this work is to propose the application of classical shadow tomography [192, 193] to reduce
the quantum and classical resources needed to perform these QEM schemes, and provide rigorous
analysis on its error mitigation capability.
169
The subspace expansion approach starts by encoding 𝑘 logical qubits with 𝑁 physical qubits
via an error correction code [[𝑁, 𝑘]]. Consider using a stabilizer code defined by a stabilizer group
S = ⟨𝑆1 , 𝑆2 , · · · , 𝑆 𝑁−𝑘 ⟩, the code subspace is specified by the projection operator
𝑁−𝑘
Ö 𝐼 + 𝑆𝑗 1 ∑︁
Π= = 𝑁−𝑘 𝑀, (5.16)
𝑗=1
2 2 𝑀∈S
where 𝑀 denote group elements in S as products of the stabilizers. Errors may occur to the
physical state during the quantum information processing, which generally takes the physical state
away from the code subspace. Suppose the goal is to estimate the logical observable 𝑂 only, then
even without active error correction, a simple projection of the the corrupted physical state 𝜌 E back
to the code subspace can already mitigate the error for the logical observable [127]
Tr(Π𝜌 E Π † 𝑂) 1 ∑︁
⟨𝑂⟩QEM = = Tr(𝜌 E 𝑀𝑂), (5.17)
Tr(Π𝜌 E Π † ) 2𝑛−𝑘 𝑐 𝑀∈S
where 𝑐 = Tr(Π𝜌 E Π † ). This amounts to measuring ⟨𝑀𝑂⟩ on the corrupted physical state for all
elements 𝑀 in the stabilizer group S (or for a majority of 𝑀 sampled from S). This approach can
quickly become exponentially expensive when 𝑁 − 𝑘 becomes large.
Another approach for QEM goes under the name of virtual distillation. Assuming the target
state is a pure state as the leading eigen state of 𝜌 E . The sub-leading eigen states of 𝜌 E (as
errors orthogonal to the target state) can be suppressed by powering the density matrix 𝜌 E𝑚 , and
we estimate ⟨𝑂⟩QEM = Tr(𝜌 E𝑚 𝑂)/Tr(𝜌 E𝑚 ). Or more generally, a polynomial function 𝑓 (𝜌 E ) =
𝑐 0 𝐼 + 𝑐 1 𝜌 E + 𝑐 2 𝜌 E2 + · · · + 𝑐 𝑚 𝜌 E𝑚 of 𝜌 E can be considered, with an optimal choice of the coefficients
𝑐 0 , · · · , 𝑐 𝑚 to best mitigate the error, such that
Tr( 𝑓 (𝜌 E )𝑂)
⟨𝑂⟩QEM = . (5.18)
Tr( 𝑓 (𝜌 E ))
However, powering a density matrix on quantum devices typically involves making multiple copies
of the quantum system, which can be challenging and expensive in quantum resources.
The key observation of this work is that both Eq. (5.17) and Eq. (5.18) (or their combination)
can be efficiently evaluated in the classical post-processing phase after performing the classical
170
shadow tomography [193] on the noisy physical state 𝜌 E . The classical shadow tomography uses
randomized measurements to extract information from an unknown quantum state, and predicts
physical properties about the state efficiently by post-processing collected measurement outcomes
on a classical computer. When the measurement basis are chosen to be Pauli or Clifford basis, the
post-processing can be made efficient. The code subspace projection and the powering of density
matrix can all be implemented efficiently in the post-processing phase by classical computation,
given the Clifford nature of the classical shadows. In this way, we can implement the existing error
mitigation schemes with significantly reduced quantum and classical resources.
5.3.2 Procedure
Let us now introduce our technique which we will refer to as logical shadow tomography (LST).
LST consists of the following steps1 (see Fig. 5.1 for a graphical overview):
1. Given a 𝑘-qubit logical state, encode it into a 𝑁-qubit physical state by a [[𝑁, 𝑘]] stabilizer
code.
2. Perform the quantum information processing on the physical state, the resulting physical
state is denoted as 𝜌 E . Due to the error accumulated in the processing, 𝜌 E = E (𝜌) may be
corrupted from the ideal result 𝜌 = |𝜓⟩⟨𝜓| by some noisy quantum channel E. The goal is
the mitigate the error for predicting logical observables based on the noisy physical state 𝜌 E .
3. Perform shadow tomography on the noisy state.
3.1. Apply a randomly sampled unitary 𝑈 from a unitary ensemble U to the physical qubits.
Measure all physical qubits in the computational basis to obtain a bit-string 𝑏 ∈ {0, 1}𝑛 .
Store 𝑈, 𝑏.
3.2. Repeat step 3.1 for 𝑁 times to obtain a data ensemble {(𝑈𝑠 , 𝑏 𝑠 )} 𝑠=1
𝑁 (𝑠 labels the samples
in the ensemble).
1 Steps 2 and 3 can be aptly summarized by “perform shadow tomography [193] on the logical state,” whence
logical shadow tomography. We explain these steps in detail for completeness.
171
4. Post-process the data on a classical computer.
4.1. Construct the classical shadow ensemble
Σ(𝜌 E ) = { 𝜌ˆ 𝑠 = M −1 (𝑈𝑠† |𝑏 𝑠 ⟩⟨𝑏 𝑠 |𝑈𝑠 )} 𝑠=1
𝑁
, (5.19)
where M −1 is the classical shadow reconstruction map that depends on the unitary
ensemble U.
4.2. Given any logical observable 𝑂 (i.e. [Π, 𝑂] = 0) estimate the error-mitigated expecta-
tion value by
Tr(Π 𝑓 (𝜌 E )Π † 𝑂)
⟨𝑂⟩LST = , (5.20)
Tr(Π 𝑓 (𝜌 E )Π † )
where Π is the code subspace projection operator defined in Eq. (5.16), and 𝑓 (𝑥) =
Í𝑚 𝑝
𝑝=1 𝑐 𝑝 𝑥 can be a generic polynomial function up to the power 𝑚. The proposed
QEM estimator ⟨𝑂⟩LST in Eq. (5.20) combines the subspace expansion Eq. (5.17) and the
virtual distillation Eq. (5.18) approaches. In particular, the numerator Tr(Π 𝑓 (𝜌 E )Π † 𝑂)
is evaluated by
𝑚
∑︁ Ö 𝑝
†
𝑐𝑝 E Tr Π 𝜌ˆ 𝑠 Π 𝑂 , (5.21)
{ 𝜌ˆ 𝑠 }∈Σ(𝜌 E ) × 𝑝
𝑝=1 𝑠=1
and the denominator Tr(Π 𝑓 (𝜌 E )Π † ) is evaluated independently in a similar manner (by
replacing 𝑂 with 𝐼).
The map M −1 depends on the unitary ensemble U for which there are several proposals, e.g.
Pauli ensembles, random Clifford circuits, and chaotic dynamics [193, 194, 195, 196]. In this work,
we find the sample complexity for predicting logical Pauli observable is the same between using a
full Clifford ensemble Cℓ(2𝑁 ) as shown in Fig. 5.1(a) and a tensor product of Clifford ensemble
Cℓ(2𝑁/𝑘 ) ⊗𝑘 as shown in Fig. 5.1(b). In the following, we will focus on the scheme where each
qubit is encoded with a [[𝑛, 1]] error correction code and apply random unitaries from Clifford
group Cℓ(2𝑛 ) at each logical qubit sector. And the total number of physical qubits is 𝑁 = 𝑛𝑘.
172
5.3.3 Analysis
In the previous section, we have outlined the procedure of logical shadow tomography (LST). Here,
we will analyze its performance from three perspectives: 1) error mitigation capacity, 2) quantum
resources, and 3) classical resources. Particularly, in error mitigation capacity subsection, we
show how error is suppressed with the code distance 𝑑 of the error correction code and powers
of density matrix. In quantum resources subsection, we show the gate overhead is similar to the
original proposal of subspace expansion, except for a shallow depth Clifford circuit whose depth
does not depend on the number of logical qubits. Compared to virtual distillation, our method
only requires one copy of the physical system. In addition, we also show the sample complexity
has an exponential reduction compared to the direct implementation of subspace expansion for
estimating logical Pauli observables. In classical resources subsection, we outline the general
classical algorithm for post-processing the data. Particularly, we show there exists fast algorithm
for LST with 𝑓 (𝜌 E ) = 𝜌 E and its algorithm time complexity is 𝑂 (𝑁 3 ). This allows our method
scale to large system size.
5.3.3.1 Error mitigation capability
Code space projection. Let S be the stabilizer code used in LST. For any correctable error 𝐸,
there exists a stabilizer generator 𝑆 such that 𝑆𝐸 = −𝐸 𝑆 and so
Π𝐸 |𝜓⟩ ∝ (𝐼 + 𝑆)𝐸 |𝜓⟩ = 𝐸 (|𝜓⟩ − |𝜓⟩) = 0. (5.22)
Analogously, if no error has occurred then |𝜓⟩ is a codeword and
Π|𝜓⟩ = |𝜓⟩. (5.23)
Thus the projector Π discards results in which correctable errors have occurred [127]. The set of
correctable errors is determined by the chosen code. Assume a simple noise model where each
qubit is subjected to depolarizing noise with rate 𝑝. If a single error happened on one qubit. Then
it can be projected out given the fact that single Pauli operator is not the stabilizer group. Those
errors are non-logical errors.
173
The code space projection fails when more local error happens and they form a logical operator.
The probability of having this failure is 𝑝 𝑑 , where 𝑑 is the code distance of the error correction
code. Mathematically, we can write down
𝜌 E = (1 − 𝑝) 𝑁 𝜌0 + 𝑝𝜌1 + 𝑝 2 𝜌2 + . . . , (5.24)
where 𝜌0 = |𝜓0 ⟩⟨𝜓0 | is the ideal quantum state, and 𝑁 is the system size. 𝜌1 is the quantum state
subjected to error happened to one qubit, i.e.
𝜌1 =(𝑋 𝐼 · ·𝐼) 𝜌0 (𝑋 𝐼 · ·𝐼) + (𝐼 𝑋 · ·𝐼) 𝜌0 (𝐼 𝑋 · ·𝐼) + . . . (5.25)
Similarly, 𝜌2 is the quantum state subjected to two error happened, i.e.
𝜌2 =(𝑋 𝐼𝑌 · ·𝐼) 𝜌0 (𝑋 𝐼𝑌 · ·𝐼)
(5.26)
+ (𝐼 𝑋 𝑋 · ·𝐼) 𝜌0 (𝐼 𝑋 𝑋 · ·𝐼) + . . .
Therefore, for any logical observable 𝑂, we have
Tr(Π𝜌 E Π𝑂)
Tr(Π𝜌 E Π)
(5.27)
𝑑 Tr(𝜌 𝑑 𝑂)
= ⟨𝜓0 |𝑂|𝜓0 ⟩ 1 + 𝑂 𝑝 .
⟨𝜓0 |𝑂|𝜓0 ⟩
More details can be found in Sec. 5.7.
Virtual distillation. For completeness, we will review the virtual distillation theory here. With
the spectral decomposition of the noisy density matrix, one can write
𝜌 E = 𝑝 0 |𝜓0 ⟩⟨𝜓0 | + 𝑝 1 |𝜓1 ⟩⟨𝜓1 | + · · · + 𝑝 𝑛 |𝜓𝑛 ⟩⟨𝜓𝑛 |, (5.28)
where 𝑝 0 > · · · > 𝑝 𝑛 and ⟨𝜓𝑖 |𝜓 𝑗 ⟩ = 𝛿𝑖 𝑗 . For shallow circuit, it is reasonable to assume |𝜓0 ⟩ ≡ |𝜓⟩
is the noiseless state. In this case, for any observable 𝑂 and positive integer 𝑚 we have [128, 135]
Í
Tr(𝜌 E𝑚 𝑂) 𝑝 0𝑚 ⟨𝜓0 |𝑂|𝜓0 ⟩ + 𝑖 𝑝𝑖𝑚 ⟨𝜓𝑖 |𝑂|𝜓𝑖 ⟩
= Í
Tr(𝜌 E𝑚 ) 𝑝 0𝑚 + 𝑖 𝑝𝑖𝑚
𝑚 (5.29)
𝑝1 ⟨𝜓1 |𝑂|𝜓1 ⟩
= ⟨𝜓|𝑂|𝜓⟩ 1 + 𝑂 .
𝑝0 ⟨𝜓|𝑂|𝜓⟩
174
Thus, computing the expectation value with the 𝑚th power of the state suppresses errors exponen-
tially in 𝑚, a phenomenon which can also be interpreted as artificially cooling the system [197].
Here we allow for an arbitrary function 𝑓 acting on the noisy state via its Taylor expansion
Í
𝑓 (𝑥) = 𝑚𝑝=1 𝑐 𝑝 𝑥 𝑝 . Including sums of powers up to 𝑚 (instead of just the highest power 𝑚) was
shown to improve results of numerical experiments in [135].
LST (Combined approach). LST has the error mitigation capability of code space projection
and virtual distillation. When we combine code space projection with virtual distillation, we expect
both code distance 𝑑 of the error correction code and 𝑚th power of the density matrix will suppress
the error. Especially, when 𝑚 = 2 in Tr(Π𝜌𝜖𝑚 Π𝑂), i.e. one projects the squared density matrix to
code space, the order of error suppression is improved from O ( 𝑝 𝑑 ) to O ( 𝑝 2𝑑 ). In general, higher
order of the power 𝑚 will lead to stronger error mitigation effect. (See Sec. 5.7 for more details.)
5.3.3.2 Quantum resources
Gate overhead. LST requires additional qubits and gates to encode the logical state, the exact
number of which depends on the chosen code. Restricting to stabilizer codes, the logical state
preparation only requires implementing Clifford gates, which are presumably easier to implement
on NISQ devices compared to universal quantum computation gate set. This encoding overhead
is the same as the subspace expansion method. To perform the classical shadow tomography,
an element from a unitary ensemble U is appended to the quantum circuit before measuring all
qubits in the computational basis. If U is chosen to be a global Clifford ensemble, the added
circuit depth is O (𝑁) with local random unitary gates [198]. The gate overhead associated with
this can be significant. If U is chosen to be a tensor-product Pauli ensemble, the added circuit
depth is O (1) [193]. The gate overhead is more affordable. However, the Pauli measurement
will increase the sample complexity exponentially for non-local observables. Facing this dilemma
between Clifford v.s. Pauli measurement, a recent work [195] by part of the authors has developed
an efficient classical shadow tomography approach for finite-depth local Clifford circuits, which can
smoothly interpolate between the global Clifford and the local Pauli limit. Using shallow (finite-
175
depth) Clifford circuits for shadow tomography, the gate overhead can be significantly reduced
with only a mild increase of sample complexity. [195] This approach can be combined with our
error mitigation technique seamlessly to achieve an optimal balance between gate overhead and
sample complexity. For this work, we will use a tensor-product Clifford group Cℓ(2𝑛 ) ⊗𝑘 as shown
in Fig. 5.1 (b). The added circuit depth is O (𝑛), where 𝑛 is the physical qubits in [[𝑛, 1]] code for
one logical qubit. It is important to notice that the depth of additional circuit is independent of the
number of logical qubit 𝑘.
Sample complexity. We now consider number of measurements needed for classical post-
processing in LST. If one uses a [[𝑛, 1]] code for each logical qubit and a tensor-product Clifford
group Cℓ(2𝑛 ) ⊗𝑘 , where 𝑘 is number of logical qubits, we have the following theorem which dictates
the sample complexity of LST.
Theorem 11: Let 𝜌 be a 𝑘 logical qubits quantum state, where each logical qubit is encoded with
a [[𝑛, 1]] code, and Π be the associated projection operator to the code subspace. Then one needs
O (log(𝑀)4 𝑘 /𝜖 𝛿2 ) samples to produce an estimation 𝑂˜ 𝑖 of Tr(Π𝜌Π𝑂 𝑖 ) with {𝑖 = 1, . . . , 𝑀 } of
logical Pauli observables 𝑂 𝑖 such that
Pr 𝑂˜ 𝑖 − Tr(Π𝜌Π𝑂 𝑖 ) ≥ 𝜖 ≤ 𝛿. (5.30)
The result (5.30) applies to both the numerator and denominator of the LST estimate Eq. (5.20)
separately. We emphasize that the number of samples does not depend on the total number of
physical qubits 𝑛𝑘, but only depends on the number of logical qubits 𝑘, and scales as O (4 𝑘 ).
Compared to direct implementation of subspace expansion whose sample complexity is O (2𝑛𝑘 ),
our method dramatically reduces the sample complexity.
In Ref. [127], the authors also mentioned stochastic sampling of the stabilizer group elements
to reduce the sample complexity at the price of losing the error mitigation capacity. Our approach
bypasses the need to sample the all elements in the stabilizer group, as we can implement the
projection operator directly and efficiently by data post-processing. Our advantage will be even
176
more apparent if one uses larger 𝑛 error correction code [[𝑛, 1]], which provides larger code
distance and better error mitigation capacity.
5.3.3.3 Classical resources
In this section, we show that there exists efficient algorithm for classical post-processing. Especially,
for LST with 𝑓 (𝜌 E ) = 𝜌 E (no virtual distillation), the classical post-processing can be performed
with polynomial classical memory and time.
After sampling, we have the classical shadow Eq. (5.19) consisting of 𝑁 stabilizer states
𝑈 † |𝑏⟩⟨𝑏|𝑈, the observable 𝑂, and the projector Π. We need to estimate the numerator and
denominator of Eq. (5.20), both of which can be written
Tr[ 𝑓 (𝜌 E )Γ] (5.31)
where Γ = Π𝑂 for the numerator and Γ = Π for the denominator. Let 𝑚 be the highest degree of
the Taylor expansion of 𝑓 . As in [193], we can evaluate the expectation of this term via
ETr[ 𝜌ˆ𝑖1 · · · 𝜌ˆ𝑖 𝑚 Γ], (5.32)
where { 𝜌ˆ𝑖1 . . . 𝜌ˆ𝑖 𝑚 are independent samples. A single classical shadow 𝜌ˆ𝑖 requires O (𝑁 2 ) classical
memory to store (where 𝑁 = 𝑛𝑘 is the total number of qubits), so the argument inside the trace
in (5.32) requires O (𝑁 2 𝑚) classical memory [12]. In Sec. 5.6.1, we showed that the evaluation of
(5.32) boils down to evaluating the general form
Ö𝑙
Tr (𝑎 𝑗 𝐼 + 𝑏 𝑗 𝑀 𝑗 ) , (5.33)
𝑗=1
where 𝑀 𝑗 are Pauli operators. This can be solved by finding the null space N𝐴 of a binary
matrix representation of Pauli operators {𝑀 𝑗 }. By simple counting argument (see Sec. 5.6.1
for more details), we see the evaluation of Eq. (5.32) has time complexity upper bounded by
O 𝑚𝑁 2 (𝑚𝑁 + 𝑁 − 𝑘 + 1) + 𝑚|N𝐴 | , when 𝑚 ≥ 2, where 𝑘 is number of logical qubits, 𝑁 = 𝑛𝑘
is the total number of physical qubits, 𝑚 is power of density matrix, and |N𝐴 | is the volume of
177
binary null space N𝐴 determined by a set of classical shadows { 𝜌ˆ𝑖1 , . . . , 𝜌ˆ𝑖 𝑚 }. When 𝑚 is large, the
null space N𝐴 can be very large. In practice, the evaluation of (5.32) can be slow given the time
complexity is proportional to O (𝑚|N𝐴 |) for 𝑚 ≥ 2.
We also find an improved classical algorithm for 𝑚 = 1 with time complexity O (𝑁 3 ). When
𝑚 = 1, we would like to evaluate Tr(Π𝜌Π𝑂), where 𝜌 is proportional to a stabilizer state, which
can be represented using stabilizer tableau. The intuition behind the efficient algorithm is that a
stabilizer state can be efficiently projected by another stabilizer group projector by updating its
stabilizer tableau. We leave the detail of the algorithm to Sec. 5.6.2.
In conclusion, at least for the 𝑚 = 1 (no virtual distillation) case, the post-processing complexity
is polynomial O (𝑁 3 ) in the total qubit number 𝑁.
5.4 Numerical results
In Sec. 5.3.3, we discussed the error mitigation capabilities, quantum resources, and classical
resources of LST. One should notice that the discussion of sample complexity is mainly fo-
cused on estimation of Tr(Π𝜌Π † 𝑂). While in practice, we would like to estimate the ratio of
Tr(Π𝜌Π † 𝑂)/Tr(Π𝜌Π † ). The sample complexity of the ratio does not have a closed form in gen-
eral. This problem is not unique to our approach. It also exists for the subspace expansion [127]
and virtual distillation [128].
Nevertheless, in the following, we will use numerical experiments to investigate the sample
complexity. We demonstrate the performance of LST through numerical simulation of large sys-
tems. In particular, we find LST outperforms the direct implementation of the subspace expansion
and the sample complexity scaling is very close to our theoretical prediction in small noise region.
5.4.1 Pseudo-threshold with the [[5, 1, 3]] code
We first consider a simple example with one logical qubit encoded in five physical qubits with
the [[5, 1, 3]] stabilizer code. Each physical qubit is subjected to depolarizing noise with error
rate 𝑝. The same model is considered in the subspace expansion literature [127], which shows
178
0.40
physical
0.35 LST (analytical)
0.30 LST 2 (analytical)
0.25 Direct
Infidelity
LST
0.20 LST 2
0.15
0.10
0.05
0.1 0.2 0.3 0.4 0.5
Depolarizing error rate p
Figure 5.2: LST with the [[5, 1, 3]] code. Here, |𝜓⟩ is taken to be the logical | 0̄⟩ and we estimate
infidelity 1 − 𝐹 with samples. The dashed black line shows the physical infidelity, i.e., the noisy
expectation value of single qubit without any encoding. The green and blue dashed line are analytical
performance of logical shadow tomography with 𝑓 (𝜌 E ) = 𝜌 E and 𝑓 (𝜌 E2 ) = 𝜌 E2 respectively. The
red dots and red shaded area indicates the mean value and standard deviation of error mitigation
with 𝑓 (𝜌 E ) = 𝜌 E by direct implementation of subspace expansion with 3000 measurements. The
green line and green shaded area indicate the mean value and standard deviation with 𝑓 (𝜌 E ) = 𝜌 E
and 3000 measurements by LST. And the performance of LST with 𝑓 (𝜌 E ) = 𝜌 E2 is indicated by
blue line and blue shaded area.
pseudo-threshold 𝑝 = 0.5. Here, we want to compare the practical performance of logical shadow
tomography and direct measurement by subspace expansion.
We evaluate Eq. (5.20) with 𝑓 (𝜌 E ) = 𝜌 E and 𝑓 (𝜌 E ) = 𝜌 E2 . The results are shown in Fig. 5.2.
Here, the dashed black line shows the infidelity without any error mitigation (the “physical”
curve), and the dashed colored lines show the infidelity using LST. We see that the LST estimates
have lower infidelity than the physical curve, showing that errors are indeed being mitigated.
LST with 𝑓 (𝜌 E ) = 𝜌 E2 outperforms LST with 𝑓 (𝜌 E ) = 𝜌 E showing that the combination of
codespace projection and virtually distillation outperforms only projecting into the codespace.
This phenomenon agrees with the behavior of the error mitigation capability 𝑝 𝑚𝑑 with 𝑚 = 1, 2
argued in 5.3.3.1 (See Sec. 5.7 for proof details). In addition to expected performance, we also
care about sample efficiency, since one major contribution of our work is showing the exponential
reduction in sample complexity with LST.
179
To show this practical advantage, we collect 3000 measurements results and use the data to
estimate error mitigated value. The colored lines/points shows the mean values of the estimation
and the colored regions shows the standard deviation to the mean value. For 𝑓 (𝜌 E ) = 𝜌 E , LST is the
same as the subspace expansion. If we implement the direct subspace expansion by measuring every
Pauli observables, each Pauli observable is measured around 100 times. The red points/line and red
shaded region shows the result of direct implementation of subspace expansion. Given each Pauli
observable is not measured many times, small error in the denominator can cause large error of the
ratio. We see the standard deviation is very huge. The mean estimation and standard deviation of
LST is shown as green points/line and green shaded region. In contrast to direct implementation of
subspace expansion, LST has a much smaller fluctuation with the same amount of measurements.
This shows the practical advantage of our method. In addition, the blue points/line and blue shaded
region shows the result of LST with 𝑓 (𝜌 E ) = 𝜌 E2 . With the same amount of data, we see it will
suppress the noise even more.
5.4.2 Convergence vs. code size
As we have pointed out in Sec. 5.3.3.2, the sample complexity for estimating Tr(Π𝜌Π † 𝑂) for logical
Pauli observables 𝑂 only scales with number of logical qubits 𝑘 as 𝑂 (4 𝑘 ) but does not scale with
the number of encoding qubits 𝑛 of the [[𝑛, 1]] code for each logical qubit. In practice, we will
estimate error mitigated values as a ratio, i.e. Tr(Π𝜌Π † 𝑂)/Tr(Π𝜌Π † ). Since there is no closed
form for the statistical fluctuation of the above ratio. We will investigate the sample complexity of
it via numerical simulation. Interestingly, we find the sample complexity of the ratio agrees well
with our theoretical analysis in small noise region.
We now consider [[𝑛, 1]] codes for one logical qubit and vary the number of physical qubits
ranging from 𝑛 = 10 to 𝑛 = 60. For quantum noise, each physical qubit is subjected to 1%
depolarizing noise in all the following experiments. The LST estimated fidelity (using 𝑓 (𝜌𝜖 ) = 𝜌𝜖
for all code sizes) vs. number of samples is shown in Fig. 5.3 (a). Using a relatively small number
of samples (at most 105 ), all LST values converge to the noiseless fidelity. Note that the direct
180
AAACAHicbVDLSsNAFL2pr1pfVZduBluhbkpS8LEsuHFZwdRCG8pkMmmHTiZhZiKU0I0/4Fb/wJ249U/8Ab/DSZuFth64cDjnvjh+wpnStv1lldbWNza3ytuVnd29/YPq4VFXxakk1CUxj2XPx4pyJqirmea0l0iKI5/TB39yk/sPj1QqFot7PU2oF+GRYCEjWBvJrTfweX1YrdlNew60SpyC1KBAZ1j9HgQxSSMqNOFYqb5jJ9rLsNSMcDqrDFJFE0wmeET7hgocUeVl82dn6MwoAQpjaUpoNFd/T2Q4Umoa+aYzwnqslr1c/NcLVL5w6boOr72MiSTVVJDF8TDlSMcoTwMFTFKi+dQQTCQz/yMyxhITbTKrmGCc5RhWSbfVdC6a9l2r1r4sIirDCZxCAxy4gjbcQgdcIMDgGV7g1Xqy3qx362PRWrKKmWP4A+vzB4OGlhk=
(a)
Fidelity
Fidelity
Number ofSamples
Samples
AAACAHicbVDLSsNAFL2pr1pfVZduBluhbkpS8LEsuHFZwdRCG8pkMmmHTiZhZiKU0I0/4Fb/wJ249U/8Ab/DSZuFth64cDjnvjh+wpnStv1lldbWNza3ytuVnd29/YPq4VFXxakk1CUxj2XPx4pyJqirmea0l0iKI5/TB39yk/sPj1QqFot7PU2oF+GRYCEjWBvJrTf88/qwWrOb9hxolTgFqUGBzrD6PQhikkZUaMKxUn3HTrSXYakZ4XRWGaSKJphM8Ij2DRU4osrL5s/O0JlRAhTG0pTQaK7+nshwpNQ08k1nhPVYLXu5+K8XqHzh0nUdXnsZE0mqqSCL42HKkY5RngYKmKRE86khmEhm/kdkjCUm2mRWMcE4yzGskm6r6Vw07btWrX1ZRFSGEziFBjhwBW24hQ64QIDBM7zAq/VkvVnv1seitWQVM8fwB9bnD4Uglho=
(b) Number of
S.D. S.D.
n
N
AAACAHicbVDLSsNAFL2pr1pfVZduBluhbkpS8LEsuHFZwdRCG8pkMmmHTiZhZiKU0I0/4Fb/wJ249U/8Ab/DSZuFth64cDjnvjh+wpnStv1lldbWNza3ytuVnd29/YPq4VFXxakk1CUxj2XPx4pyJqirmea0l0iKI5/TB39yk/sPj1QqFot7PU2oF+GRYCEjWBvJrTfIeX1YrdlNew60SpyC1KBAZ1j9HgQxSSMqNOFYqb5jJ9rLsNSMcDqrDFJFE0wmeET7hgocUeVl82dn6MwoAQpjaUpoNFd/T2Q4Umoa+aYzwnqslr1c/NcLVL5w6boOr72MiSTVVJDF8TDlSMcoTwMFTFKi+dQQTCQz/yMyxhITbTKrmGCc5RhWSbfVdC6a9l2r1r4sIirDCZxCAxy4gjbcQgdcIMDgGV7g1Xqy3qx362PRWrKKmWP4A+vzB4a6lhs=
(c)
AAAB/nicbVDLSsNAFL2pr1pfVZduBlvBVUkKPpYFNy4r2ge0oUwmk3boZBJmJkIJBX/Arf6BO3Hrr/gDfoeTNAttPXDhcM59cbyYM6Vt+8sqra1vbG6Vtys7u3v7B9XDo66KEkloh0Q8kn0PK8qZoB3NNKf9WFIcepz2vOlN5vceqVQsEg96FlM3xGPBAkawNtJ9XdRH1ZrdsHOgVeIUpAYF2qPq99CPSBJSoQnHSg0cO9ZuiqVmhNN5ZZgoGmMyxWM6MFTgkCo3zV+dozOj+CiIpCmhUa7+nkhxqNQs9ExniPVELXuZ+K/nq2zh0nUdXLspE3GiqSCL40HCkY5QlgXymaRE85khmEhm/kdkgiUm2iRWMcE4yzGskm6z4Vw07LtmrXVZRFSGEziFc3DgClpwC23oAIExPMMLvFpP1pv1bn0sWktWMXMMf2B9/gDL4JXB
N
Figure 5.3: Scaling study of LST with 𝑓 (𝜌𝜖 ) = 𝜌𝜖 . In all figures, each physical qubit is subjected
to 1% depolarizing noise. (a) LST estimated fidelity vs. number of samples from 102 - 105 with
various [[𝑛, 1]] code sizes. The noiseless fidelity value of 1.0 is shown with the dashed black line.
For all code sizes up to 𝑛 = 60 physical qubits, the LST estimate converges to the true noiseless
value. Codes used are the minimum distance constructions from [9].(b) Standard deviation vs.
number of physical qubits 𝑛. The standard deviation of estimation doesn’t scale with number of
encoding physical qubits. (c) Mean value and standard deviation scaling vs. number of logical
qubits 𝑘. Each logical qubit is encoded with [[5, 1, 3]] code, and the state is prepared as logical
GHZ state | 0̄ . . . 0̄⟩ + | 1̄ . . . 1̄⟩. We see standard deviation scales exponentially with number of
logical qubits 𝑘 as predicted.
181
implementation of subspace expansion with the full projection Π as used here would require at
least 259 samples, a number infeasible to implement in any practical experiment. In addition, we
estimate the standard deviation of each predicted point by LST using the bootstrap method, and the
result is shown in Fig. 5.3 (b). We see the standard deviation does not show strong dependence of
number of encoding qubits 𝑛, and it indicates the sample complexity does not increase much if one
increase 𝑛.
In addition, we also study the sample complexity scaling with number of logical qubits 𝑘.
Particularly, we use [[5, 1, 3]] code to encode each logical qubit and prepare a multi-logical qubits
√
GHZ state, (| 0̄ . . . 0̄⟩ + | 1̄ . . . 1̄⟩)/ 2. In Fig. 5.3 (c), the blue dots/line shows the estimated mean
value for logical operator 𝑋¯ 1 ⊗ · · · ⊗ 𝑋¯ 𝑘 , and the blue shaded area indicates the standard deviation.
Especially, in the inset, we see the standard deviation increases exponentially with number of logical
qubits 𝑘. This result also agrees well with our sample complexity analysis, even though the analysis
focuses on estimating Tr(Π𝜌Π † 𝑂).
Through the large scale numerical simulation, we find LST indeed outperforms previous meth-
ods in terms of sample complexity. Interestingly, we also find the sample complexity agrees well
with the theoretical analysis in small noise region. This indicates the sample complexity of LST
does not scale much with the number of physical encoding qubits 𝑛 with [[𝑛, 1]] error correction
code and only scales with number of logical qubits 𝑘.
5.5 Discussion
We have presented a procedure for estimating error-mitigated observables on noisy quantum com-
puters. Our procedure is flexible enough to be performed on virtually any quantum computer: the
only additional quantum resources needed are qubits and Clifford gates for encoding the logical
state. After sampling from the logical state, a classical computer processes the obtained classical
shadow to return the error-mitigated expectation value. In the analysis, we show if the error cor-
rection code has code distance 𝑑 and 𝑓 (𝜌 E ) = 𝜌 E for LST, then error will be suppressed to O ( 𝑝 𝑑 ),
assuming independent depolarizing noise with rate 𝑝 on each physical qubit. And we also show
182
higher power of density matrix will further improve the performance. For sample complexity, we
show it scales only with number of logical qubits 𝑘 but not the number of encoding physical qubits
𝑛. This result is also supported with large scale numerics. In addition, we provide efficient classical
algorithms to post-process the classical shadow data and remark that this post-processing can be
easily parallelized for practical efficiency in real-world experiments.
With respect to error mitigation, our procedure provides an experimentally simple procedure
to carry out proposed error mitigation techniques [127, 128] at scale. In particular, we have
demonstrated subspace expansion with up to 𝑛 = 60 physical qubits encoding a single logical qubit,
i.e., a stabilizer group with 259 elements, an experiment which would be practically infeasible
with the direct or stochastic sampling schemes of [127, 135]. We have also implemented virtual
distillation [128] without expensive swap operations to compute powers of the density matrix.
Rather, our procedure uses the same quantum circuit to evaluate the error-mitigated expectation
with any function 𝑓 (𝜌 E ) of the noisy state 𝜌 E ; the only difference is in classical post-processing
(and number of shadows needed). Note that virtual distillation without subspace expansion can be
implemented in our protocol by using a trivial code (i.e., not encoding a logical state). Further,
beyond making both of these techniques significantly more practical to implement, our procedure
enables them to be composed with one another, and we have shown numerically the composition of
both techniques results in further reduction of errors. Additional error mitigation techniques which
act on the noisy state, e.g., those in [135], may also be implementable with our framework.
In our analysis, we assumed the Clifford circuit in the classical shadow part is noise-free. If there
are noise in the Clifford circuit part, it can be mitigated if the noise is independent of the Clifford
gates, as in [199, 200, 201], where similar idea was used for randomized benchmarking [202, 203].
Shadow tomography since proposed in [193] has found a number of applications in quantum
information processing, including the recently proposed process tomography [204], and avoiding
barren plateau in variational quantum circuits [205]. This work constitutes an application in the
error mitigation realm. We are optimistic our procedure will be effective on current and near-term
quantum computers for a variety of experiments on relatively large systems.
183
5.6 Stabilizer algorithms
5.6.1 Evaluating the trace in Eq. (5.21)
In this Section, we explain an efficient approach to evaluate the trace in Eq. (5.21). For random
Clifford ensemble, the reconstruction map reads 𝜌ˆ = M −1 ( 𝜎) ˆ = (2𝑛 + 1) 𝜎 ˆ − 𝐼, such that
Ö𝑚 ∑︁ 𝑚 Ö 𝑞
† 𝑚 𝑛 𝑞 𝑚−𝑞 †
E 𝜌ˆ Tr Π 𝜌ˆ 𝑠 Π 𝑂 = (2 + 1) (−1) E𝜎ˆ Tr Π 𝜎ˆ𝑠 Π 𝑂 . (5.34)
𝑠=1 𝑞=0
𝑞 𝑠=1
ˆ = 𝑈 † |𝑏⟩⟨𝑏|𝑈 =
Î
Notice that the projection operator Π = 𝑛−𝑘 𝑗=1 (𝐼 + 𝑆 𝑗 )/2 and the snapshot state 𝜎
Î𝑛 †
𝑖=1 (𝐼 +𝑏 𝑖 𝑈 𝑍𝑖 𝑈)/2 both take the from of stabilizer states. So the problem boils down to evaluating
the trace of the following general form
Ö 𝑙
Tr (𝑎 𝑗 𝐼 + 𝑏 𝑗 𝑀 𝑗 ) , (5.35)
𝑗=1
where 𝑀 𝑗 are Pauli operators and 𝑎 𝑗 , 𝑏 𝑗 are real coefficients. As we expand the product, the only
terms that survive the trace are those terms with the Pauli operators multiplied to the identity
operator. To find these combination of Pauli operators, we can first encode every Pauli operator 𝑀 𝑗
as a binary vector following
..
© . ª
®
Í𝑛 Ö 𝑛 Ö 𝑛
𝜉𝑖 𝑗 ®
®
𝜉 𝜁
𝑀𝑗 = i 𝑖=1 𝜉𝑖 𝑗 𝜁𝑖 𝑗 𝑋𝑖 𝑖 𝑗 𝑍𝑖 𝑖 𝑗 → ®, (5.36)
®
𝑖=1 𝑖=1
𝜁𝑖 𝑗 ®®
.. ®
« . ¬
where 𝜉𝑖 𝑗 , 𝜁𝑖 𝑗 ∈ {0, 1} are binary variables. Arranging all the binary vector representations of 𝑀 𝑗
as column vectors, together they form a 2𝑛 × 𝑙 matrix, denoted as 𝐴. Each combination of Pauli
operators 𝑀 𝑗 that multiply to identity corresponds to a binary null vector solution 𝑥 of the binary
matrix 𝐴, as 𝐴𝑥 = 0 (modulo 2). The null vectors form the null space of 𝐴, denoted as N𝐴 . The null
space of binary matrix 𝐴 can be found using Gaussian elimination method, and its time complexity
is O (𝑛𝑙 × min(2𝑛, 𝑙)). For 𝑥 ∈ N𝐴 ,
Ö 𝑙
(𝑀 𝑗 ) 𝑥 𝑗 = 𝑧(𝑥)𝐼, (5.37)
𝑗=1
184
which defines the phase factor 𝑧(𝑥) given 𝑥. Then the trace in Eq. (5.35) is given by
Ö 𝑙 ∑︁ Ö 𝑙
1−𝑥 𝑗 𝑥 𝑗
Tr (𝑎 𝑗 𝐼 + 𝑏 𝑗 𝑀 𝑗 ) = 2𝑛 𝑧(𝑥) 𝑎𝑗 𝑏𝑗 . (5.38)
𝑗=1 𝑥∈N𝐴 𝑗=1
Therefore the time complexity of evaluating the general trace Eq. (5.38) is O (𝑛𝑙 ×min(2𝑛, 𝑙) + |N𝐴 |),
where |N𝐴 | is the volume of the null space N𝐴 that is determined by the set of Pauli operators {𝑀𝑖 }.
Î𝑚 †
Applying this result for Eq. (5.34), we get the time complexity for evaluating Tr Π 𝑠=1 𝜌ˆ 𝑠 Π 𝑂
is upper bounded by O (𝑚𝑛𝑙 × min(2𝑛, 𝑙) + 𝑚|N𝐴 |), with 𝑙 = 𝑚𝑛 + 𝑛 − 𝑘 + 1. For large 𝑚, the
volume of null space |N𝐴 | can be troublesome, and scale exponentially with 𝑚. But luckily for
𝑚 = 1, there exists more efficient polynomial time algorithm, which is illustrated in Sec. 5.6.2.
5.6.2 Efficient projection of a stabilizer state
As shown in Eq. (5.36), any Pauli string operator can be represented as a one-hot binary vector 𝑥
and 𝑧, with 𝑥𝑖 , 𝑧𝑖 = 0, 1 for 𝑖 = 1, . . . , 𝑁, where 𝑁 is the total number of qubits,
Ö𝑁
𝑦
𝜎(𝑥,𝑧) = 𝑖 𝑥·𝑧
𝑋𝑖𝑥𝑖 𝑍𝑖 𝑖 , (5.39)
𝑖=1
where 𝑋𝑖 , 𝑍𝑖 are Pauli operators, and 𝑥𝑖 , 𝑧𝑖 are binary values. The multiplication of two Pauli
operators can be represented as
′ ′
𝜎(𝑥,𝑧) 𝜎(𝑥 ′,𝑧 ′) = 𝑖 𝑝(𝑥,𝑧;𝑥 ,𝑧 ) 𝜎(𝑥+𝑥 ′,𝑧+𝑧 ′)%2 , (5.40)
where the phase factor is
𝑁
𝑥𝑖 + 𝑥𝑖′ 𝑧𝑖 + 𝑧′𝑖
∑︁
′ ′
𝑝(𝑥, 𝑧; 𝑥 , 𝑧 ) = 𝑧𝑖 𝑥𝑖′ − 𝑥𝑖 𝑧′𝑖 + 2(𝑧𝑖 + 𝑧′𝑖 ) + 2(𝑥𝑖 + 𝑥𝑖 ) ′
mod 4. (5.41)
𝑖=1
2 2
Any two Pauli strings either commute or anti-commute,
′ ′
𝜎(𝑥,𝑧) 𝜎(𝑥 ′,𝑧 ′) = (−) 𝑐(𝑥,𝑧;𝑥 ,𝑧 ) 𝜎(𝑥 ′,𝑧 ′) 𝜎(𝑥,𝑧) , (5.42)
where the anticommutation indicator 𝑐 has a simpler form
𝑝(𝑥, 𝑧; 𝑥 ′, 𝑧′) − 𝑝(𝑥 ′, 𝑧′; 𝑥, 𝑧) ∑︁
𝑁
′ ′
𝑧𝑖 𝑥𝑖′ − 𝑥𝑖 𝑧′𝑖
𝑐(𝑥, 𝑧; 𝑥 , 𝑧 ) = = mod 2. (5.43)
2 𝑖=1
185
Therefore, the complexity of calculating anticommutation indicator is O (𝑁). The binary vectors
𝑥 and 𝑧 can be interweaved into a 2𝑁-component vector 𝑔 = (𝑥 0 , 𝑧0 , 𝑥1 , 𝑧1 , · · · ), which forms the
binary representation of a Pauli operator 𝜎𝑔 .
Figure 5.4: Data structure of a stabilizer state. Each Pauli string is represented as a binary vector.
First 𝑁 rows store the stabilizers of the state, and second 𝑁 rows store the destabilizers of the state.
In Fig. 5.4, each row is a binary representation of a Pauli string. For a Hilbert space with
𝑁
dimension C2 , we can find at most 𝑁 stabilizers {𝑆𝑖 , 𝑖 = 1 . . . 𝑁 } and 𝑁 destabilizers {𝐷 𝑖 , 𝑖 =
1 . . . 𝑁 }, with [𝑆𝑖 , 𝑆 𝑗 ] = 0, [𝐷 𝑖 , 𝐷 𝑗 ] = 0 and {𝑆𝑖 , 𝐷 𝑗 } = 𝛿𝑖, 𝑗 . Fig. 5.4 is called the stabilizer tableau
of stabilizer state. We use 𝑟 in Fig. 5.4 indicates the log −2 rank of the density matrix. For pure
stabilizer state, 𝑟 = 0. And if 𝑟 > 0, then stabilizer tableau represents a mixed state and only partial
Hilbert space is stabilized.
Î𝑙 𝐼𝑘
A stabilizer projector Π = 𝑘=1 2 can also be represented as a 𝑙-row tableau (only stabilizer).
Now we are going to present an efficient algorithm to calculate Tr(Π𝜌Π), and update the stabilizer
tableau of a full rank stabilizer state 𝜌 according to projection Π.
Outline of the algorithm: First, we set trace = 1. Then we scan over every observable 𝐺 𝑘 in
the generator of the operator. For each 𝐺 𝑘 , we continue to scan over all operators in the stabilizer
tableau. If the observable 𝐺 𝑘 anticommute with:
1. At least one active stabilizer (the first of them being 𝑆 𝑝 ) → 𝐺 𝑘 is an error operator that
take the state out of the code subspace → stabilizer tableau need to be updated according to
𝐼 ± 𝐺 𝑘 . And trace will be multiplied by 1/2.
2. Otherwise, 𝐺 𝑘 is in the stabilizer group generated by {𝑆 𝑘 }. If the phase factor is compatible,
186
then stabilizer state is eigenstate of (𝐼 ± 𝐺 𝑘 )/2 with eigenvalue 1, i.e. (𝐼 ± 𝐺 𝑘 )|𝜓⟩/2 = |𝜓⟩;
if the phase factor is incompatible, it means (𝐼 ± 𝐺 𝑘 )|𝜓⟩/2 = 0.
We see this algorithm has a double loop of O (𝑁) items, and each anticommutation check takes
O (𝑁) time. Therefore, the time complexity of this algorithm is O (𝑁 3 ), where 𝑁 is the total number
of qubits in the system.
5.7 Error mitigation capability
In Eq. (5.24), we show the density matrix subjected to depolarizing noise is naturally the spectral
decomposition form. Here, we provide detailed proof of it. Let |𝜓0 ⟩ be the ideal quantum state
encoded with [[𝑛, 𝑘, 𝑑]] error correction code. And let {|𝑖¯⟩, 𝑖 = 1, . . . , 2 𝑘 } be the orthonormal
Í2 𝑘
basis for the logical space. In general, |𝜓0 ⟩ = 𝑖=1 𝑐𝑖 |𝑖¯⟩. We assume the simple depolarizing error
for each physical qubit. Then
𝜌𝜖 = (1 − 𝑝) 𝑁 |𝜓0 ⟩⟨𝜓0 | + 𝑝𝜌1 + 𝑝 2 𝜌2 + · · · , (5.44)
where 𝜌𝑖 is the density matrix with 𝑖 local error happened, and 𝑁 is the system size. For example,
𝜌1 =(𝑋 𝐼 𝐼 · · · 𝐼)|𝜓0 ⟩⟨𝜓0 |(𝑋 𝐼 𝐼 · · · 𝐼) + (𝑌 𝐼 𝐼 · · · 𝐼)|𝜓0 ⟩⟨𝜓0 |(𝑌 𝐼 𝐼 · · · 𝐼)
+ (𝑍 𝐼 𝐼 · · · 𝐼)|𝜓0 ⟩⟨𝜓0 |(𝑍 𝐼 𝐼 · · · 𝐼) + (𝐼 𝑋 𝐼 · · · 𝐼)|𝜓0 ⟩⟨𝜓0 |(𝐼 𝑋 𝐼 · · · 𝐼) (5.45)
+ · · · + (𝐼 𝐼 · · · 𝐼 𝑍)|𝜓0 ⟩⟨𝜓0 |(𝐼 𝐼 · · · 𝐼 𝑍),
and
𝜌2 =(𝑋 𝑋 𝐼 · · · 𝐼)|𝜓0 ⟩⟨𝜓0 |(𝑋 𝑋 𝐼 · · · 𝐼) + (𝑋𝑌 𝐼 · · · 𝐼)|𝜓0 ⟩⟨𝜓0 |(𝑋𝑌 𝐼 · · · 𝐼)
(5.46)
+ · · · + (𝐼 · · · 𝑍 𝑍)|𝜓0 ⟩⟨𝜓0 |(𝐼 · · · 𝑍 𝑍).
We define the support of a Pauli string as number of Pauli operators that is not the identity operator.
Let 𝑃𝑙 be a Pauli string operator with non-trivial support 𝑙. For example, the support of Pauli string
𝑋 𝐼 𝑍𝑌 𝐼 is three. Then any term in 𝜌𝑙 can be written as 𝑃𝑙 |𝜓0 ⟩⟨𝜓0 |𝑃𝑙 with some 𝑃𝑙 . It is easy to
check Π𝑃𝑙 |𝜓0 ⟩⟨𝜓0 |𝑃𝑙 Π = 0 for any 𝑙 < 𝑑. By definition of the code distance 𝑑, any 𝑃𝑙 with 𝑙 < 𝑑
is not in the stabilizer group. Therefore, it must anti-commute with some stabilizer generator 𝑆,
187
′ ′
such that {𝑆, 𝑃𝑙 } = 0. We write Π = Π (𝐼 + 𝑆), where Π includes all other stabilizer generators.
Then
′ ′ ′
Π𝑃𝑙 |𝜓0 ⟩ = Π (𝐼 + 𝑆)𝑃𝑙 |𝜓0 ⟩ = Π 𝑃𝑙 (𝐼 − 𝑆)|𝜓0 ⟩ = Π 𝑃𝑙 (|𝜓0 ⟩ − |𝜓0 ⟩) = 0, (5.47)
and
Π𝑃𝑙 |𝜓0 ⟩⟨𝜓0 |𝑃𝑙 Π = 0 (𝑙 < 𝑑). (5.48)
Therefore, we conclude Π𝜌𝑙 Π = 0 for any 𝑙 < 𝑑, and
Tr(Π𝜌 E Π𝑂)
Tr(Π𝜌 E Π)
(5.49)
𝑑 Tr(𝜌 𝑑 𝑂)
= ⟨𝜓0 |𝑂|𝜓0 ⟩ 1 + 𝑂 𝑝 .
⟨𝜓0 |𝑂|𝜓0 ⟩
Now we want to prove the leading order correction of Π𝜌𝜖2 Π is of order O ( 𝑝 2𝑑 ) with contradic-
tion. Suppose the leading order correction is of order O ( 𝑝 𝑠 ) with 𝑠 < 2𝑑, then there exist 𝑃𝑙 and
𝑃𝑟 with 𝑙 + 𝑟 = 𝑠 < 2𝑑 such that
Π(𝑃𝑙 |𝜓0 ⟩⟨𝜓0 |𝑃𝑙 )(𝑃𝑟 |𝜓0 ⟩⟨𝜓0 |𝑃𝑟 )Π = ⟨𝜓0 |𝑃𝑙 𝑃𝑟 |𝜓0 ⟩(Π𝑃𝑙 |𝜓0 ⟩)(⟨𝜓0 |𝑃𝑟 Π) ≠ 0. (5.50)
This requires Π𝑃𝑙 |𝜓0 ⟩ ≠ 0 and Π𝑃𝑟 |𝜓0 ⟩ ≠ 0. From Eq. (5.48), we know this requires 𝑙 ≥ 𝑑 and
𝑟 ≥ 𝑑, and it contradicts with 𝑙 + 𝑟 < 2𝑑. Therefore, we conclude the leading order correction of
Tr(Π𝜌𝜖2 Π𝑂) is of order O ( 𝑝 2𝑑 ).
For higher power of Π𝜌𝜖𝑚 Π, one may expect the leading order correction will be O ( 𝑝 𝑚𝑑 ).
Depending on the particular logical state |𝜓0 ⟩ and error correction code, the performance may
or may not reach O ( 𝑝 𝑚𝑑 ). This is because there can exist shortcuts that make the leading order
correction larger than O ( 𝑝 𝑚𝑑 ). In practice, we do witness the performance will be improved with
larger 𝑚.
5.8 Mean and variance of a ratio of two random variables
Consider random variables 𝑃 and 𝑄 and let 𝐺 = 𝑔(𝑃, 𝑄) = 𝑃/𝑄. In general, there is no closed
form expression for E[𝐺 (𝑃, 𝑄)], and Var[𝐺 (𝑃, 𝑄)]. Here, we find approximations for the mean
and variance using Taylor expansions of 𝑔(𝑃, 𝑄).
188
10 4
10 6
Infidelity
10 8
10 10 LST
LST 2
10 12 Linear fit k=3.07
Linear fit k=6.15
10 2 10 1
Depolarizing error rate p
Figure 5.5: Infidelity in small error rate region. Theoretically we have shown the leading order
correction to infidelity will be O ( 𝑝 𝑚𝑑 ) with 𝑚 = 1, 2. Here, we use [[5, 1, 3]] code with LST as a
demonstration. We prepare random logical states and calculate the infidelity. We see the numerical
results give linear order correction O ( 𝑝 3.07 ) and O ( 𝑝 6,15 ), which is very close to theoretical
prediction O ( 𝑝 3 ) and O ( 𝑝 6 ).
The approximation for the mean value is
E[𝑔(𝑃, 𝑄)] = E[𝑔(𝜇 𝑃 , 𝜇𝑄 ) + 𝑔′𝑃 (𝜇 𝑃 , 𝜇𝑄 )(𝑃 − 𝜇 𝑃 ) + 𝑔𝑄
′
(𝜇 𝑃 , 𝜇𝑄 )(𝑄 − 𝜇𝑄 ) + 𝑅]
′
≈ E[𝑔(𝜇 𝑃 , 𝜇𝑄 )] + 𝑔′𝑃 (𝜇 𝑃 , 𝜇𝑄 )E[(𝑃 − 𝜇 𝑃 )] + 𝑔𝑄 (𝜇 𝑃 , 𝜇𝑄 )E[(𝑄 − 𝜇𝑄 )] (5.51)
= 𝑔(𝜇 𝑃 , 𝜇𝑄 ),
where 𝑅 is the higher order reminders of the Taylor expansion. For keeping the Taylor expansion
to the first order, we ignore higher order remainders.
For the variance, we have
Var[𝑔(𝑃, 𝑄)] = E [𝑔(𝑃, 𝑄) − E(𝑔(𝑃, 𝑄))] 2
≈ E [𝑔(𝑃, 𝑄) − 𝑔(𝜇 𝑃 , 𝜇𝑄 )] 2
n o
′ ′ 2
≈ E [𝑔 𝑃 (𝜇 𝑃 , 𝜇𝑄 )(𝑃 − 𝜇 𝑃 ) + 𝑔𝑄 (𝜇 𝑃 , 𝜇𝑄 )(𝑄 − 𝜇𝑄 )]
= 𝑔′2 ′2 ′ ′
𝑃 (𝜇 𝑃 , 𝜇𝑄 ) Var(𝑃) + 𝑔𝑄 (𝜇 𝑃 , 𝜇𝑄 ) Var(𝑄) + 2𝑔 𝑃 (𝜇 𝑃 , 𝜇𝑄 )𝑔𝑄 (𝜇 𝑃 , 𝜇𝑄 ) Cov(𝑃, 𝑄).
(5.52)
For our case, 𝑔(𝑃, 𝑄) = 𝑃/𝑄, therefore 𝑔′𝑃 (𝜇 𝑃 , 𝜇𝑄 ) = 1/𝜇𝑄 , 𝑔𝑄
′ (𝜇 , 𝜇 ) = −𝜇 /𝜇 2 , and
𝑃 𝑄 𝑃 𝑄
" #
𝜇 𝑃 2 Var(𝑃) Var(𝑄) Cov(𝑃, 𝑄)
Var(𝑃/𝑄) ≈ ( ) + −2 (5.53)
𝜇𝑄 𝜇2𝑃 2
𝜇𝑄 𝜇 𝑃 𝜇𝑄
189
5.9 Proof of sample complexities
Suppose we want to predict a linear property of the underlying quantum state,
𝑜 = Tr(𝜌𝑂). (5.54)
We simply replace the unknown quantum state 𝜌 with the classical shadows 𝜌ˆ = M −1 ( 𝜎). ˆ This
yields a stochastic number 𝑜ˆ = Tr( 𝜌𝑂), ˆ and it will converge to correct answer with sufficient
amount of classical shadows,
E𝑜ˆ = Tr(𝜌𝑂). (5.55)
Í𝑀
In practice, the expectation E𝑜ˆ𝑖 is replaced by a sample mean estimator, 𝑜avg = 𝑀1 𝑖=1 𝑜ˆ𝑖 =
1 Í𝑀
𝑀 𝑖=1 Tr(𝑂 𝜌ˆ𝑖 ). Based on Chebyshev’s inequality, the probability of the estimation 𝑜 avg to
deviate from its expectation value 𝑜 is bounded by its variance Var(𝑜avg ) as Pr(|𝑜avg − 𝑜| ≥
𝛿) ≤ Var(𝑜avg )/𝛿2 . To control the deviation within a desired statistial accuracy 𝜖, we require
Var(𝑜avg )/𝛿2 = Var( 𝑜)/(𝑀𝛿
ˆ 2 ) ≤ 𝜖, where 𝑀 is the number of classical shadows. In other words,
the number of experiments needed to achieve the statistical error 𝜖 is given by
𝑀 ≥ Var( 𝑜)/𝜖ˆ 𝛿2 . (5.56)
Therefore, the sample complexity is directly related to the variance of single-shot estimation Var( 𝑜). ˆ
We can further bound the variance by
ˆ = E[ 𝑜ˆ2 ] − E[ 𝑜]
Var( 𝑜) ˆ 2 ≤ E[ 𝑜ˆ2 ]
∑︁
= E𝑈∼U ⟨𝑏|𝑈𝜎𝑈 † |𝑏⟩⟨𝑏|𝑈M −1 (𝑂)𝑈 † |𝑏⟩ 2 (5.57)
𝑏∈{0,1} 𝑛
≤ ||𝑂|| 2shadow ,
where the shadow norm of an observable is defined as
1/2
∑︁
||𝑂|| shadow = max E𝑈∼U ⟨𝑏|𝑈𝜎𝑈 † |𝑏⟩⟨𝑏|𝑈M −1 (𝑂)𝑈 † |𝑏⟩ 2 ®
© ª
𝜎:state
« 𝑏∈{0,1} 𝑛 ¬
1/2
∑︁ (5.58)
= max E𝑈∼U Tr(𝜎𝑈 † |𝑏⟩⟨𝑏|𝑈⟨𝑏|𝑈M −1 (𝑂)𝑈 † |𝑏⟩ 2 ) ®
© ª
𝜎:state
« 𝑏∈{0,1} 𝑛 ¬
= max (Tr𝜎𝑉U [𝑂]) 1/2 ,
𝜎:state
190
𝑈 † |𝑏⟩⟨𝑏|𝑈⟨𝑏|𝑈M −1 (𝑂)𝑈 † |𝑏⟩ 2 that de-
Í
where we define a new operator 𝑉U [𝑂] = E𝑈∼U
𝑏∈{0,1} 𝑛
pends both on the unitary ensemble U and observable 𝑂. If the unitary ensemble U satisfies
unitary 3-design, it can be simplified as
∑︁
𝑉U [𝑂] = E𝑈∼U 𝑈 † |𝑏⟩⟨𝑏|𝑈⟨𝑏|𝑈M −1 (𝑂)𝑈 † |𝑏⟩ 2
𝑏∈{0,1} 𝑛
∑︁ ∑︁ (5.59)
−1
= Wg[𝜎𝜏 𝑔0 ] 𝐴[𝜎]𝐵[𝜏],
𝑏∈{0,1} 𝑛 𝜎,𝜏∈𝑆 3
where 𝜎, 𝜏 are permutations from permutation group 𝑆3 , Wg[𝑔] is the Weingarten function of
the permutation group element 𝑔, 𝑔0 = (2, 3) is a fixed permutation to match the tensor network
connection, and 𝐴[𝜎], 𝐵[𝜏] are defined as:
1 1
AAAB/nicbVDLSgMxFL1TX7W+RsWVm2Ar1IVlpiC6LOjCjVjBPqAdSyZN29DMgyQjlGHAX3HjQhG3foc7/8ZMOwttPRA4nHMv9+S4IWdSWda3kVtaXlldy68XNja3tnfM3b2mDCJBaIMEPBBtF0vKmU8biilO26Gg2HM5bbnjy9RvPVIhWeDfq0lIHQ8PfTZgBCst9cyDUtfDakQwj2+Sh/jUTsq3J6WeWbQq1hRokdgZKUKGes/86vYDEnnUV4RjKTu2FSonxkIxwmlS6EaShpiM8ZB2NPWxR6UTT+Mn6FgrfTQIhH6+QlP190aMPSknnqsn06xy3kvF/7xOpAYXTsz8MFLUJ7NDg4gjFaC0C9RnghLFJ5pgIpjOisgIC0yUbqygS7Dnv7xImtWKfVax7qrF2lVWRx4O4QjKYMM51OAa6tAAAjE8wyu8GU/Gi/FufMxGc0a2sw9/YHz+AJm/lJg= AAAB/nicbVDLSgMxFL1TX7W+RsWVm2Ar1IVlpiC6LOjCjVjBPqAdSyZN29DMgyQjlGHAX3HjQhG3foc7/8ZMOwttPRA4nHMv9+S4IWdSWda3kVtaXlldy68XNja3tnfM3b2mDCJBaIMEPBBtF0vKmU8biilO26Gg2HM5bbnjy9RvPVIhWeDfq0lIHQ8PfTZgBCst9cyDUtfDakQwj2+Sh/jUTsq3J6WeWbQq1hRokdgZKUKGes/86vYDEnnUV4RjKTu2FSonxkIxwmlS6EaShpiM8ZB2NPWxR6UTT+Mn6FgrfTQIhH6+QlP190aMPSknnqsn06xy3kvF/7xOpAYXTsz8MFLUJ7NDg4gjFaC0C9RnghLFJ5pgIpjOisgIC0yUbqygS7Dnv7xImtWKfVax7qrF2lVWRx4O4QjKYMM51OAa6tAAAjE8wyu8GU/Gi/FufMxGc0a2sw9/YHz+AJm/lJg=
|bihb| |bihb| |bihb|
AAAB/nicbVDLSsNAFJ3UV62vqLhyM9gKrkpSEF0WdOGygn1AE8pketsOnUzCzEQoacFfceNCEbd+hzv/xmmahbYeuNzDOfcyd04Qc6a043xbhbX1jc2t4nZpZ3dv/8A+PGqpKJEUmjTikewERAFnApqaaQ6dWAIJAw7tYHwz99uPIBWLxIOexOCHZCjYgFGijdSzTyrTwJNEDDl4PGs4mFZ6dtmpOhnwKnFzUkY5Gj37y+tHNAlBaMqJUl3XibWfEqkZ5TAreYmCmNAxGULXUEFCUH6anT/D50bp40EkTQmNM/X3RkpCpSZhYCZDokdq2ZuL/3ndRA+u/ZSJONEg6OKhQcKxjvA8C9xnEqjmE0MIlczciumISEK1SaxkQnCXv7xKWrWqe1l17mvl+m0eRxGdojN0gVx0heroDjVQE1GUomf0it6sJ+vFerc+FqMFK985Rn9gff4AqQKVSQ== AAAB/nicbVDLSsNAFJ3UV62vqLhyM9gKrkpSEF0WdOGygn1AE8pketsOnUzCzEQoacFfceNCEbd+hzv/xmmahbYeuNzDOfcyd04Qc6a043xbhbX1jc2t4nZpZ3dv/8A+PGqpKJEUmjTikewERAFnApqaaQ6dWAIJAw7tYHwz99uPIBWLxIOexOCHZCjYgFGijdSzTyrTwJNEDDl4PGs4mFZ6dtmpOhnwKnFzUkY5Gj37y+tHNAlBaMqJUl3XibWfEqkZ5TAreYmCmNAxGULXUEFCUH6anT/D50bp40EkTQmNM/X3RkpCpSZhYCZDokdq2ZuL/3ndRA+u/ZSJONEg6OKhQcKxjvA8C9xnEqjmE0MIlczciumISEK1SaxkQnCXv7xKWrWqe1l17mvl+m0eRxGdojN0gVx0heroDjVQE1GUomf0it6sJ+vFerc+FqMFK985Rn9gff4AqQKVSQ== AAAB/nicbVDLSsNAFJ3UV62vqLhyM9gKrkpSEF0WdOGygn1AE8pketsOnUzCzEQoacFfceNCEbd+hzv/xmmahbYeuNzDOfcyd04Qc6a043xbhbX1jc2t4nZpZ3dv/8A+PGqpKJEUmjTikewERAFnApqaaQ6dWAIJAw7tYHwz99uPIBWLxIOexOCHZCjYgFGijdSzTyrTwJNEDDl4PGs4mFZ6dtmpOhnwKnFzUkY5Gj37y+tHNAlBaMqJUl3XibWfEqkZ5TAreYmCmNAxGULXUEFCUH6anT/D50bp40EkTQmNM/X3RkpCpSZhYCZDokdq2ZuL/3ndRA+u/ZSJONEg6OKhQcKxjvA8C9xnEqjmE0MIlczciumISEK1SaxkQnCXv7xKWrWqe1l17mvl+m0eRxGdojN0gVx0heroDjVQE1GUomf0it6sJ+vFerc+FqMFK985Rn9gff4AqQKVSQ==
M (O) M (O)
AAAB83icbVBNSwMxEJ2tX7V+VT16CbaCp7JbEL0IFT14rGA/YHcp2TTbhibZJckKpfRvePGgiFf/jDf/jWm7B219MPB4b4aZeVHKmTau++0U1tY3NreK26Wd3b39g/LhUVsnmSK0RRKeqG6ENeVM0pZhhtNuqigWEaedaHQ78ztPVGmWyEczTmko8ECymBFsrBRUb/xAs4HA4XW1V664NXcOtEq8nFQgR7NX/gr6CckElYZwrLXvuakJJ1gZRjidloJM0xSTER5Q31KJBdXhZH7zFJ1ZpY/iRNmSBs3V3xMTLLQei8h2CmyGetmbif95fmbiq3DCZJoZKsliUZxxZBI0CwD1maLE8LElmChmb0VkiBUmxsZUsiF4yy+vkna95l3U3Id6pXGXx1GEEziFc/DgEhpwD01oAYEUnuEV3pzMeXHenY9Fa8HJZ47hD5zPH84JkOE=
A[ ] =
AAAB8XicbVBNS8NAEJ34WetX1aOXYCt4KklB9CIU9eCxgv3ANJTNdtMu3WzC7kQoof/CiwdFvPpvvPlv3LY5aOuDgcd7M8zMCxLBNTrOt7Wyura+sVnYKm7v7O7tlw4OWzpOFWVNGotYdQKimeCSNZGjYJ1EMRIFgrWD0c3Ubz8xpXksH3CcMD8iA8lDTgka6bFy7XWRpP5VpVcqO1VnBnuZuDkpQ45Gr/TV7cc0jZhEKojWnusk6GdEIaeCTYrdVLOE0BEZMM9QSSKm/Wx28cQ+NUrfDmNlSqI9U39PZCTSehwFpjMiONSL3lT8z/NSDC/9jMskRSbpfFGYChtje/q+3eeKURRjQwhV3Nxq0yFRhKIJqWhCcBdfXiatWtU9rzr3tXL9No+jAMdwAmfgwgXU4Q4a0AQKEp7hFd4sbb1Y79bHvHXFymeO4A+szx9QdpAH
B[⌧ ] = ⌧ (5.60)
,
AAAB73icbVDLSgNBEOz1GeMr6tHLYCJ4CrsB0WNADx4jmAckS5idzCZD5rHOzAphyU948aCIV3/Hm3/jJNmDJhY0FFXddHdFCWfG+v63t7a+sbm1Xdgp7u7tHxyWjo5bRqWa0CZRXOlOhA3lTNKmZZbTTqIpFhGn7Wh8M/PbT1QbpuSDnSQ0FHgoWcwItk7qVHqGDQWu9Etlv+rPgVZJkJMy5Gj0S1+9gSKpoNISjo3pBn5iwwxrywin02IvNTTBZIyHtOuoxIKaMJvfO0XnThmgWGlX0qK5+nsiw8KYiYhcp8B2ZJa9mfif101tfB1mTCappZIsFsUpR1ah2fNowDQllk8cwUQzdysiI6wxsS6iogshWH55lbRq1eCy6t/XyvXbPI4CnMIZXEAAV1CHO2hAEwhweIZXePMevRfv3ftYtK55+cwJ/IH3+QNYIY+D AAAB7XicbVBNS8NAEJ34WetX1aOXxVbwVJKC6LGgB48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/nbX1jc2t7cJOcXdv/+CwdHTcMnGqGW+yWMa6E1DDpVC8iQIl7ySa0yiQvB2Mb2Z++4lrI2L1gJOE+xEdKhEKRtFKrUoPaVrpl8pu1Z2DrBIvJ2XI0eiXvnqDmKURV8gkNabruQn6GdUomOTTYi81PKFsTIe8a6miETd+Nr92Ss6tMiBhrG0pJHP190RGI2MmUWA7I4ojs+zNxP+8borhtZ8JlaTIFVssClNJMCaz18lAaM5QTiyhTAt7K2EjqilDG1DRhuAtv7xKWrWqd1l172vl+m0eRwFO4QwuwIMrqMMdNKAJDB7hGV7hzYmdF+fd+Vi0rjn5zAn8gfP5A9vJjqg=
In the following, we will mainly focus on the analysis of 𝑉U [𝑂] operator and ||𝑂|| 2shadow . In the
main text, we focus on the scheme of encoding each logical qubit with [𝑛, 1] stabilizer code, and
doing quantum computation with total physical qubits 𝑁 = 𝑛 × 𝑙, where 𝑙 is the number of logical
qubits. For the classical shadow tomography part, we will use random unitaries sampled from
Cℓ(2𝑛 ) ⊗𝑙 . One reason of choosing this factorized random unitary group is global clifford group
Cℓ(2𝑛𝑙 ) is harder to implement in experiments. And the difficulty of implementing this factorized
scheme does not depend on number of logical qubits. In practice, it is possible to encode each
logical qubit with a small error correction code, such as [5, 1] code, and implement random circuits
from Cℓ(2𝑛 ). If the random unitary ensemble is Cℓ(2𝑛 ) ⊗𝑙 , then it is easy to show the reconstruction
map is
M −1 [𝜎] = ⊗𝑖=1 𝑙
[(2𝑛 + 1)𝜎𝑖 − Tr( 𝐴𝑖 )𝐼] , (5.61)
where 𝜎𝑖 is the reduced classical shadow on part 𝑖. The logical Pauli observables will be factorized
on each logical sectors, i.e. 𝑂 = 𝑂 1 ⊗ 𝑂 2 ⊗ · · · ⊗ 𝑂 𝑙 . And since the random untaries are sampled
191
from ensemble Cℓ(2𝑛 ) ⊗𝑙 , they also have the tensor product structure, i.e. 𝑈 = 𝑈1 ⊗ 𝑈2 ⊗ · · · ⊗ 𝑈𝑙 .
By combining those two properties, we can show
𝑙
𝑉U [𝑂] = ⊗𝑖=1 𝑉U𝑖 [𝑂 𝑖 ]. (5.62)
Therefore, we only need to focus on the property of 𝑉U𝑖 [𝑂 𝑖 ] for each logical sector. In the following,
we will use 𝑑 = 2𝑛 for the Hilbert space dimension for one logical sector.
The calculation for 𝑉U𝑖 [𝑃𝑖 𝐼𝑖 ] : For projection operator 𝑃𝑖 , M −1 (𝑃𝑖 𝐼𝑖 ) = (𝑑 + 1)𝑃𝑖 − 2𝐼𝑖 . And
Eq. (5.59) can be evaluated
2𝑑 − 2
𝑉Cℓ(𝑑) [𝑃𝑖 𝐼𝑖 ] = (𝑃𝑖 + 𝐼𝑖 ). (5.63)
𝑑+2
The calculation for 𝑉U𝑖 [𝑃𝑖 𝑂 𝑖 ] : For non-trivial Pauli string 𝑂 𝑖 , M −1 (𝑃𝑖 𝑂 𝑖 ) = (𝑑 + 1)𝑃𝑖 𝑂 𝑖 . And
Eq. (5.59) can be evaluated
2𝑑 + 2
𝑉Cℓ(𝑑) [𝑃𝑖 𝑂 𝑖 ] = (𝑃𝑖 + 𝐼𝑖 ). (5.64)
𝑑+2
2𝑑 + 2
As we can see, 𝑉Cℓ(𝑑) [𝑃𝑖 𝐼𝑖 ] ≲ 𝑉Cℓ(𝑑) [𝑃𝑖 𝑂 𝑖 ] = (𝑃𝑖 + 𝐼𝑖 ). This result indicates the sample
𝑑+2
complexity for predicting logical Pauli operators 𝑂 = ⊗𝑖=1 𝑙 𝑂 after projection by 𝑃 = ⊗ 𝑙 𝑃 does
𝑖 𝑖=1 𝑖
not depend on the locality of the logical Pauli operators,
𝑙 !
2𝑑 + 2
||𝑃𝑂|| 2shadow ⪅ max Tr 𝜎 𝑙
⊗𝑖=1 (𝑃𝑖 + 𝐼𝑖 ) ≲ 4𝑙 . (5.65)
𝜎:state 𝑑+2
This result is different from the sample complexity from local Clifford group or tensored Clifford
group Cℓ(𝑑) ⊗𝑙 , where the sample complexity will depends on the locality of Pauli string 𝑂. This
difference is mainly introduced by the logical subspace projection 𝑃. Even the Pauli string 𝑂 is
trivial in some region, the subspace projection 𝑃𝑖 will still introduce fluctuation.
192
BIBLIOGRAPHY
193
BIBLIOGRAPHY
[1] IBM Quantum, (2022), https://quantum-computing.ibm.com/.
[2] R. S. Smith, M. J. Curtis, and W. J. Zeng, A practical quantum instruction set architecture,
arXiv:1608.03355 (2016).
[3] A. W. Cross, L. S. Bishop, J. A. Smolin, and J. M. Gambetta, Open Quantum Assembly
Language, arXiv:1707.03429 (2017).
[4] P. W. Shor and S. P. Jordan, Estimating jones polynomials is a complete problem for one
clean qubit, Quantum Information & Computation 8, 681 (2008).
[5] F. Vatan and C. Williams, Optimal quantum circuits for general two-qubit gates, Physical
Review A 69, 032315 (2004).
[6] P. B. M. Sousa and R. V. Ramos, Universal quantum circuit for 𝑛-qubit quantum gate: A
programmable quantum gate, Quantum Information and Computation 7, 228 (2007).
[7] E. Knill and R. Laflamme, Power of one bit of quantum information, Physical Review Letters
81, 5672 (1998).
[8] A. W. Cross, L. S. Bishop, S. Sheldon, P. D. Nation, and J. M. Gambetta, Validating quan-
tum computers using randomized model circuits, Physical Review A (2018), 10.1103/Phys-
RevA.100.032328.
[9] M. Grassl, Bounds on the minimum distance of linear codes and quantum codes, (2021).
[10] P. W. Shor, in Proceedings 35th annual symposium on foundations of computer science (Ieee,
1994) pp. 124–134.
[11] M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information (Cam-
bridge University Press, 2000).
[12] S. Aaronson and D. Gottesman, Improved simulation of stabilizer circuits, Physical Review
A 70 (2004), 10.1103/PhysRevA.70.052328, arXiv: quant-ph/0406196.
[13] D. Gottesman, The heisenberg representation of quantum computers, arXiv:quant-
ph/9807006 (1998), arXiv: quant-ph/9807006.
[14] J. Preskill, Quantum computing in the NISQ era and beyond, Quantum 2, 79 (2018).
[15] P. W. Shor, Polynomial-time algorithms for prime factorization and discrete logarithms on a
194
quantum computer, SIAM review 41, 303 (1999).
[16] S. Lloyd, M. Mohseni, and P. Rebentrost, Quantum principal component analysis, Nature
Physics 10, 631 (2014), 1307.0401 .
[17] A. W. Harrow, A. Hassidim, and S. Lloyd, Quantum algorithm for linear systems of
equations, Physical review letters 103, 150502 (2009).
[18] P. Rebentrost, A. Steffens, I. Marvian, and S. Lloyd, Quantum singular-value decomposition
of nonsparse low-rank matrices, Physical review A 97, 012327 (2018).
[19] S. K. Leyton and T. J. Osborne, A quantum algorithm to solve nonlinear differential equations,
arXiv:0812.4423 (2008).
[20] D. W. Berry, High-order quantum algorithm for solving linear differential equations, Journal
of Physics A: Mathematical and Theoretical 47, 105301 (2014).
[21] E. Farhi, J. Goldstone, and S. Gutmann, A quantum approximate optimization algorithm,
arXiv:1411.4028 (2014).
[22] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. Zhou, P. J. Love, A. Aspuru-Guzik,
and J. L. O’brien, A variational eigenvalue solver on a photonic quantum processor, Nature
communications 5, 4213 (2014).
[23] A. Kandala, A. Mezzacapo, K. Temme, M. Takita, M. Brink, J. M. Chow, and J. M.
Gambetta, Hardware-efficient variational quantum eigensolver for small molecules and
quantum magnets, Nature 549, 242 (2017).
[24] D. W. Berry, A. M. Childs, R. Cleve, R. Kothari, and R. D. Somma, Simulating hamiltonian
dynamics with a truncated taylor series, Physical review letters 114, 090502 (2015).
[25] J. Preskill, Quantum computing and the entanglement frontier, arXiv:1203.5813 (2012).
[26] A. W. Harrow and A. Montanaro, Quantum computational supremacy, Nature 549, 203
(2017).
[27] S. Bravyi, G. Smith, and J. A. Smolin, Trading classical and quantum computational
resources, Physical Review X 6, 021043 (2016).
[28] O. Higgott, D. Wang, and S. Brierley, Variational Quantum Computation of Excited States,
arXiv:1805.08138 (2018), arXiv:1805.08138 [quant-ph] .
[29] S. Endo, T. Jones, S. McArdle, X. Yuan, and S. Benjamin, Variational quantum algorithms
for discovering Hamiltonian spectra, arXiv:1806.05707 (2018), arXiv:1806.05707 [quant-
ph] .
195
[30] P. D. Johnson, J. Romero, J. Olson, Y. Cao, and A. Aspuru-Guzik, QVECTOR: an algorithm
for device-tailored quantum error correction, arXiv:1711.02249 (2017).
[31] J. Romero, J. P. Olson, and A. Aspuru-Guzik, Quantum autoencoders for efficient compres-
sion of quantum data, Quantum Science and Technology 2, 045001 (2017).
[32] A. Khoshaman, W. Vinci, B. Denis, E. Andriyash, and M. H. Amin, Quantum variational
autoencoder, Quantum Science and Technology 4, 014001 (2018).
[33] Y. Li and S. C. Benjamin, Efficient variational quantum simulator incorporating active error
minimization, Physical Review X 7, 021050 (2017).
[34] C. Kokail, C. Maier, R. van Bijnen, T. Brydges, M. K. Joshi, P. Jurcevic, C. A. Muschik,
P. Silvi, R. Blatt, C. F. Roos, et al., Self-verifying variational quantum simulation of the
lattice schwinger model, arXiv:1810.03421 (2018).
[35] H. Li and F. D. M. Haldane, Entanglement spectrum as a generalization of entanglement
entropy: Identification of topological order in non-abelian fractional quantum hall effect
states, Physical Review Letters 101, 010504 (2008).
[36] V. Giovannetti, S. Lloyd, and L. Maccone, Quantum random access memory, Physical
review letters 100, 160501 (2008).
[37] K. Pearson, On lines and planes of closest fit to systems of points in space, The London,
Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559 (1901).
[38] L. N. Trefethen and D. Bau, Numerical Linear Algebra (SIAM, 1997).
[39] This can be seen by noting that computing eigenvalues is equivalent to computing roots
of a polynomial equation (namely the characteristic polynomial of the matrix) and that no
closed-form solution exists for the roots of general polynomials of degree greater than or
equal to five [38].
[40] J. R. McClean, S. Boixo, V. N. Smelyanskiy, R. Babbush, and H. Neven, Barren plateaus in
quantum neural network training landscapes, Nature Communications 9, 4812 (2018).
[41] E. Grant, L. Wossnig, M. Ostaszewski, and M. Benedetti, An initialization strategy for
addressing barren plateaus in parametrized quantum circuits, arXiv:1903.05076 (2019),
1903.05076 [quant-ph] .
[42] T. Baumgratz, M. Cramer, and M. B. Plenio, Quantifying coherence, Physical review letters
113, 140401 (2014).
[43] H. Buhrman, R. Cleve, J. Watrous, and R. De Wolf, Quantum fingerprinting, Physical
Review Letters 87, 167902 (2001).
196
[44] D. Gottesman and I. Chuang, Quantum digital signatures, quant-ph/0105032 (2001).
[45] VQSD source code, https://github.com/rmlarose/vqsd (2019).
[46] R. S. Smith, M. J. Curtis, and W. J. Zeng, A Practical Quantum Instruction Set Architecture,
arXiv:1608.03355 (2016).
[47] M. B. Hastings, An area law for one-dimensional quantum systems, Journal of Statistical
Mechanics: Theory and Experiment 2007, 08024 (2007), arXiv:0705.2024 [quant-ph] .
[48] B. Bauer and C. Nayak, Area laws in a many-body localized state and its implications for
topological order, Journal of Statistical Mechanics: Theory and Experiment 2013, 09005
(2013), arXiv:1306.5753 [cond-mat.dis-nn] .
[49] T. Grover, Certain General Constraints on the Many-Body Localization Transition, arXiv
e-prints (2014), arXiv:1405.1471 [cond-mat.dis-nn] .
[50] L. Cincio, Y. Subaşı, A. T. Sornborger, and P. J. Coles, Learning the quantum algorithm for
state overlap, New Journal of Physics 20, 113022 (2018).
[51] S. Khatri, R. LaRose, A. Poremba, L. Cincio, A. T. Sornborger, and P. J. Coles, Quantum
assisted quantum compiling, arXiv:1807.00800 (2018).
[52] T. Jones and S. C. Benjamin, Quantum compilation and circuit optimisation via energy
dissipation, arXiv:1811.03147 (2018).
[53] E. Tang, Quantum-inspired classical algorithms for principal component analysis and su-
pervised clustering, arXiv:1811.00414 (2018).
[54] H. Li and F. D. M. Haldane, Entanglement spectrum as a generalization of entanglement
entropy: Identification of topological order in non-abelian fractional quantum hall effect
states, Phys. Rev. Lett. 101, 010504 (2008).
[55] J. C. Garcia-Escartin and P. Chamorro-Posada, Swap test and Hong-Ou-Mandel effect are
equivalent, Physical Review A 87, 052330 (2013).
[56] G. Smith, J. A. Smolin, X. Yuan, Q. Zhao, D. Girolami, and X. Ma, Quantifying coherence
and entanglement via simple measurements, arXiv:1707.09928 (2017).
[57] L. M. Rios and N. V. Sahinidis, Derivative-free optimization: a review of algorithms and
comparison of software implementations, Journal of Global Optimization 56, 1247 (2013).
[58] G. G. Guerreschi and M. Smelyanskiy, Practical optimization for hybrid quantum-classical
algorithms, arXiv:1701.01450 (2017).
197
[59] J. R. McClean, J. Romero, R. Babbush, and A. Aspuru-Guzik, The theory of variational
hybrid quantum-classical algorithms, New Journal of Physics 18, 023023 (2016).
[60] M. J. D. Powell, in Numerical Analysis, Lecture Notes in Mathematics, edited by G. A.
Watson (Springer Berlin Heidelberg, 1978) pp. 144–157.
[61] Scipy optimization and root finding, (2018).
[62] M. J. D. Powell, Direct search algorithms for optimization calculations, Acta Numerica 7,
287 (1998).
[63] M. J. D. Powell (2009).
[64] F. Gao and L. Han, Implementing the nelder-mead simplex algorithm with adaptive param-
eters, Computational Optimization and Applications 51, 259 (2012).
[65] J. Nocedal and S. Wright, Numerical Optimization, 2nd ed., Springer Series in Operations
Research and Financial Engineering (Springer-Verlag, 2006).
[66] C. Cartis, J. Fiala, B. Marteau, and L. Roberts, Improving the Flexibility and Robustness of
Model-Based Derivative-Free Optimization Solvers, arXiv:1804.00154 (2018).
[67] D. Venturelli, M. Do, E. Rieffel, and J. Frank, Compiling quantum circuits to realistic hard-
ware architectures using temporal planners, Quantum Science and Technology 3, 025004
(2018).
[68] K. E. Booth, M. Do, J. C. Beck, E. Rieffel, D. Venturelli, and J. Frank, Comparing and
integrating constraint programming and temporal planning for quantum circuit compilation,
arXiv:1803.06775 (2018).
[69] D. Maslov, G. W. Dueck, D. M. Miller, and C. Negrevergne, Quantum circuit simplification
and level compaction, IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems 27, 436 (2008).
[70] A. G. Fowler, Constructing arbitrary Steane code single logical qubit fault-tolerant gates,
Quantum Information and Computation 11, 867 (2011).
[71] J. Booth Jr, Quantum compiler optimizations, arXiv:1206.3348 (2012).
[72] Y. Nam, N. J. Ross, Y. Su, A. M. Childs, and D. Maslov, Automated optimization of large
quantum circuits with continuous parameters, npj Quantum Information 4, 23 (2018).
[73] F. T. Chong, D. Franklin, and M. Martonosi, Programming languages and compiler design
for realistic quantum hardware, Nature 549, 180 (2017).
198
[74] L. E. Heyfron and E. T. Campbell, An efficient quantum compiler that reduces T count,
Quantum Science and Technology 4, 015004 (2018).
[75] T. Häner, D. S. Steiger, K. Svore, and M. Troyer, A software methodology for compiling
quantum programs, Quantum Science and Technology 3, 020501 (2018).
[76] A. Oddi and R. Rasconi, in International Conference on the Integration of Constraint
Programming, Artificial Intelligence, and Operations Research (Springer, 2018) pp. 446–
461.
[77] E. Farhi, J. Goldstone, and S. Gutmann, A quantum approximate optimization algorithm,
arXiv:1411.4028 (2014).
[78] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. Zhou, P. J. Love, A. Aspuru-Guzik,
and J. L. O’Brien, A variational eigenvalue solver on a photonic quantum processor, Nature
Communications 5, 4213 (2014).
[79] M. Benedetti, D. Garcia-Pintos, O. Perdomo, V. Leyton-Ortega, Y. Nam, and A. Perdomo-
Ortiz, A generative modeling approach for benchmarking and training shallow quantum
circuits, arXiv:1801.07686 (2018).
[80] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, Quantum circuit learning, Physical
Review A 98, 032309 (2018).
[81] G. Verdon, J. Pye, and M. Broughton, A Universal Training Algorithm for Quantum Deep
Learning, arXiv:1806.09729 (2018).
[82] J. Romero, J. P. Olson, and A. Aspuru-Guzik, Quantum autoencoders for efficient compres-
sion of quantum data, Quantum Science and Technology 2, 045001 (2017).
[83] J. Romero, J. P. Olson, and A. Aspuru-Guzik, Quantum autoencoders for short depth
quantum circuit synthesis, GitHub article (2018).
[84] B. Dive, A. Pitchford, F. Mintert, and D. Burgarth, In situ upgrade of quantum simulators
to universal computers, Quantum 2, 80 (2018).
[85] K. Fujii, H. Kobayashi, T. Morimae, H. Nishimura, S. Tamate, and S. Tani, Impossibility of
Classically Simulating One-Clean-Qubit Model with Multiplicative Error, Physical Review
Letters 120, 200502 (2018).
[86] B. Rosgen and J. Watrous, in 20th Annual IEEE Conference on Computational Complexity
(CCC’05) (2005) pp. 344–354.
[87] A. Kitaev, Quantum computations: algorithms and error correction, Russian Mathematical
Surveys 52, 1191 (1997).
199
[88] C. M. Dawson and M. A. Nielsen, The Solovay-Kitaev algorithm, Quantum Information and
Compututation 6, 81 (2006).
[89] T. T. Pham, R. Van Meter, and C. Horsman, Optimization of the Solovay-Kitaev algorithm,
Physical Review A 87, 052332 (2013).
[90] V. Kliuchnikov, D. Maslov, and M. Mosca, Asymptotically optimal approximation of single
qubit unitaries by Clifford and T circuits using a constant number of ancillary qubits, Physical
Review Letters 110, 190502 (2013).
[91] V. Kliuchnikov, A. Bocharov, and K. M. Svore, Asymptotically optimal topological quantum
compiling, Physical Review Letters 112, 140504 (2014).
[92] Y. Zhiyenbayev, V. M. Akulin, and A. Mandilara, Quantum compiling with diffusive sets of
gates, Physical Review A 98, 012325 (2018).
[93] M. Horodecki, P. Horodecki, and R. Horodecki, General teleportation channel, singlet
fraction, and quasidistillation, Physical Review A 60, 1888 (1999).
[94] M. A. Nielsen, A simple formula for the average gate fidelity of a quantum dynamical
operation, Physics Letters A 303, 249 (2002).
[95] A. Gepp and P. Stocks, A review of procedures to evolve quantum algorithms, Genetic
Programming and Evolvable Machines 10, 181 (2009).
[96] M. Suzuki, Fractal decomposition of exponential operators with applications to many-body
theories and monte carlo simulations, Physics Letters A 146, 319 (1990).
[97] IBM Q 5 Tenerife backend specification, https://github.com/QISKit/
qiskit-backend-information/tree/master/backends/tenerife/V1 (2018).
[98] IBM Q 16 Rueschlikon backend specification, (2018).
[99] Rigetti 8Q-Agave specification v.2.0.0.dev0, http://docs.rigetti.com/en/latest/
qpu.html (2018).
[100] A. G. R. Day, M. Bukov, P. Weinberg, P. Mehta, and D. Sels, Glassy phase of optimal
quantum control, Physical Review Letters 122, 020601 (2019).
[101] X. Glorot and Y. Bengio, in In Proceedings of the International Conference on Artificial
Intelligence and Statistics (2010) pp. 249–256.
[102] M. Benedetti, D. Garcia-Pintos, O. Perdomo, V. Leyton-Ortega, Y. Nam, and A. Perdomo-
Ortiz, A generative modeling approach for benchmarking and training shallow quantum
circuits, arXiv:1801.07686 (2018).
200
[103] R. LaRose, A. Tikku, É. O’Neel-Judy, L. Cincio, and P. J. Coles, Variational quantum state
diagonalization, arXiv:1810.10506 (2018).
[104] A. Kandala, K. Temme, A. D. Corcoles, A. Mezzacapo, J. M. Chow, and J. M. Gambetta,
Extending the computational reach of a noisy superconducting quantum processor, Nature
567, 491 (2018).
[105] Scikit-optimize, (2018).
[106] J. Močkus, in Optimization Techniques IFIP Technical Conference Novosibirsk, July 1–7,
1974 (Springer Berlin Heidelberg, Berlin, Heidelberg, 1975) pp. 400–404.
[107] M. A. Osborne, R. Garnett, and S. J. Roberts, in 3rd International Conference on Learning
and Intelligent Optimization (LION3) 2009 (2009).
[108] P. Rebentrost, M. Schuld, L. Wossnig, F. Petruccione, and S. Lloyd, Quantum gradient
descent and Newton’s method for constrained polynomial optimization, arXiv:1612.01789
(2016).
[109] I. Kerenidis and A. Prakash, Quantum gradient descent for linear systems and least squares,
arXiv:1704.04992 (2017).
[110] A. Gilyén, S. Arunachalam, and N. Wiebe, Optimizing quantum optimization algorithms
via faster quantum gradient computation, in Proceedings of the Thirtieth Annual ACM-SIAM
Symposium on Discrete Algorithms (ACM, 2019) pp. 1425–1444.
[111] X.-Q. Zhou, T. C. Ralph, P. Kalasuwan, M. Zhang, A. Peruzzo, B. P. Lanyon, and J. L.
O’Brien, Adding control to arbitrary unknown quantum operations, Nature Communications
2, 413 (2011).
[112] B. M. Terhal, Quantum error correction for quantum memories, (2015).
[113] D. Gottesman (AMS eBooks, 2010) pp. 13–58.
[114] A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N. Cleland, Surface codes: Towards
practical large-scale quantum computation, Physical Review A (2012), 10.1103/Phys-
RevA.86.032324.
[115] R. Takagi, S. Endo, S. Minagawa, and M. Gu, Fundamental limits of quantum error
mitigation, arXiv:2109.04457 [quant-ph] (2021), arXiv: 2109.04457.
[116] K. Temme, S. Bravyi, and J. M. Gambetta, Error Mitigation for Short-Depth Quantum
Circuits, Physical Review Letters 119, 180509 (2017).
[117] Y. Li and S. C. Benjamin, Efficient Variational Quantum Simulator Incorporating Active
201
Error Minimization, Physical Review X 7 (2017), 10.1103/physrevx.7.021050.
[118] S. Endo, S. C. Benjamin, and Y. Li, Practical quantum error mitigation for near-future
applications, Physical Review X 8, 031027 (2018).
[119] J. J. Wallman and J. Emerson, Noise tailoring for scalable quantum computation via ran-
domized compiling, Physical Review A 94, 052325 (2016).
[120] E. Knill, Quantum computing with realistically noisy devices, Nature 434, 39 (2005).
[121] L. F. Santos and L. Viola, Dynamical control of qubit coherence: Random versus determin-
istic schemes, Physical Review A 72, 062303 (2005).
[122] L. Viola and E. Knill, Random decoupling schemes for quantum dynamical control and error
suppression, Physical review letters 94, 060502 (2005).
[123] B. Pokharel, N. Anand, B. Fortman, and D. A. Lidar, Demonstration of fidelity improve-
ment using dynamical decoupling with superconducting qubits, Physical review letters 121,
220502 (2018).
[124] P. Sekatski, M. Skotiniotis, and W. Dür, Dynamical decoupling leads to improved scaling
in noisy quantum metrology, New Journal of Physics 18, 073034 (2016).
[125] H. Ball, M. J. Biercuk, A. Carvalho, R. Chakravorty, J. Chen, L. A. de Castro, S. Gore,
D. Hover, M. Hush, P. J. Liebermann, et al., Software tools for quantum control: Improv-
ing quantum computer performance through noise and error suppression, arXiv preprint
arXiv:2001.04060 (2020).
[126] T. J. Green, J. Sastrawan, H. Uys, and M. J. Biercuk, Arbitrary quantum control of qubits in
the presence of universal noise, New Journal of Physics 15, 095004 (2013).
[127] J. R. McClean, Z. Jiang, N. C. Rubin, R. Babbush, and H. Neven, Decoding quantum errors
with subspace expansions, arXiv:1903.05786 [physics, physics:quant-ph] (2019), arXiv:
1903.05786.
[128] W. J. Huggins, S. McArdle, T. E. O’Brien, J. Lee, N. C. Rubin, S. Boixo, K. B. Wha-
ley, R. Babbush, and J. R. McClean, Virtual Distillation for Quantum Error Mitigation,
arXiv:2011.07064 [quant-ph] (2021), arXiv: 2011.07064.
[129] B. Koczor, Exponential error suppression for near-term quantum devices, Physical Review
X 11, 031057 (2021), arXiv: 2011.05942.
[130] Y. Li and S. C. Benjamin, Efficient Variational Quantum Simulator Incorporating Active
Error Minimization, Physical Review X 7, 021050 (2017).
202
[131] P. Czarnik, A. Arrasmith, P. J. Coles, and L. Cincio, Error mitigation with Clifford quantum-
circuit data, arXiv:2005.10189 [quant-ph] (2021), arXiv: 2005.10189.
[132] C. Piveteau, D. Sutter, S. Bravyi, J. M. Gambetta, and K. Temme, Error mitigation for
universal gates on encoded qubits, Physical Review Letters 127 (2021), 10.1103/phys-
revlett.127.200505.
[133] A. Lowe, M. H. Gordon, P. Czarnik, A. Arrasmith, P. J. Coles, and L. Cincio, Unified
approach to data-driven quantum error mitigation, Phys. Rev. Research 3, 033098 (2021).
[134] A. T. Arrasmith, P. J. Czarnik, P. J. Coles, and L. Cincio, Error mitigation with clifford
quantum-circuit data, Quantum 5 (2021), 10.22331/q-2021-11-26-592.
[135] N. Yoshioka, H. Hakoshima, Y. Matsuzaki, Y. Tokunaga, Y. Suzuki, and S. Endo,
Generalized quantum subspace expansion, arXiv:2107.02611 [quant-ph] (2021), arXiv:
2107.02611.
[136] A. Lowe, M. H. Gordon, P. Czarnik, A. Arrasmith, P. J. Coles, and L. Cincio, Unified
approach to data-driven quantum error mitigation, arXiv:2011.01157 [quant-ph] (2020),
arXiv: 2011.01157.
[137] A. Kandala, K. Temme, A. D. Córcoles, A. Mezzacapo, J. M. Chow, and J. M. Gambetta,
Error mitigation extends the computational reach of a noisy quantum processor, Nature 567,
491 (2019).
[138] R. S. Smith, M. J. Curtis, and W. J. Zeng, A practical quantum instruction set architecture,
arXiv preprint arXiv:1608.03355 (2016).
[139] D. C. McKay, C. J. Wood, S. Sheldon, J. M. Chow, and J. M. Gambetta, Efficient Z gates
for quantum computing, Physical Review A 96 (2017), 10.1103/PhysRevA.96.022330.
[140] X. Fu, L. Riesebos, M. Rol, J. van Straten, J. van Someren, N. Khammassi, I. Ashraf,
R. Vermeulen, V. Newsum, K. Loh, et al., in 2019 IEEE International Symposium on High
Performance Computer Architecture (HPCA) (IEEE, 2019) pp. 224–237.
[141] A. He, B. Nachman, W. A. de Jong, and C. W. Bauer, Resource efficient zero noise
extrapolation with identity insertions, arXiv preprint arXiv:2003.04941 (2020).
[142] E. F. Dumitrescu, A. J. McCaskey, G. Hagen, G. R. Jansen, T. D. Morris, T. Papenbrock,
R. C. Pooser, D. J. Dean, and P. Lougovski, Cloud quantum computing of an atomic nucleus,
Physical review letters 120, 210501 (2018).
[143] E. Farhi, J. Goldstone, and S. Gutmann, A Quantum Approximate Optimization Algorithm,
arXiv (2014).
203
[144] M. Otten and S. K. Gray, Recovering noise-free quantum observables, Physical Review A
99, 012338 (2019).
[145] J. Bylander, S. Gustavsson, F. Yan, F. Yoshihara, K. Harrabi, G. Fitch, D. G. Cory, Y. Naka-
mura, J.-S. Tsia, and W. D. Oliver, Noise spectroscopy through dynamical decoupling with
a superconducting flux qubit, Nature Physics 7, 656 (2019).
[146] F. Yan, S. Gustavsson, J. Bylander, X. Jin, F. Yoshihara, D. G. Cory, Y. Nakamura, T. P.
Orlando, and W. D. Oliver, Rotationg-frame relaxation as a noise spectrum analyser of
a superconducting qubit undergoing driven evolution, Nature Communications 4, 22337
(2013).
[147] J. Meeson, A. Ya. Tzalenchuk, and T. Lindström, Evidence for interacting two-level systems
from the 1/f noise of a superconducting resonator, Nature Communications 5, 4119 (2014).
[148] C. Müller, J. Lisenfeld, A. Shnirman, and P. S., Non-gaussian noise spectroscopy with a
superconducting qubit sensor, Physical Review B 92, 035442 (2015).
[149] J. J. Burnett, A. Bengtsson, M. Scigliuzzo, D. Niepce, M. Kudra, P. Delsing, and J. Bylander,
Decoherence benchmarking of superconducting qubits, npj Quantum Information 5, 54
(2019).
[150] J. Basset, A. Stockklauser, D.-D.. Jarausch, T. Frey, C. Reichl, W. Wegscheider, A. Wallraff,
K. Ensslin, and I. T., Evaluating charge noise acting on semiconductor quantum dots in the
circuit quantum electrodynamics architecture, Applied Physics Letters 105, 063105 (2014).
[151] K. W. Chan, W. Huang, C. H. Yang, J. C. C. Hwang, B. Hensen, T. Tanttu, F. E. Hudson,
K. M. Itoh, A. Laucht, A. Morello, and A. S. Dzurak, Assessment of a silicon quantum dot
spin qubit environment via noise spectroscopy, Phys. Rev. Applied 10, 044017 (2018).
[152] T. Struck, A. Hollmann, F. Schauer, O. Fedorets, A. Schmidbauer, K. Sawano, H. Riemann,
N. V. Abrosimov, L. Cywiński, B. D., and L. R. Schreiber, Low-frequency spin qubit energy
splitting noise in highly purified 28 si/sige, npj Quantum Information 6, 40 (2020).
[153] G. A. Álvarez and D. Suter, Measuring the spectrum of colored noise by dynamical decou-
pling, Phys. Rev. Lett. 107, 230501 (2011).
[154] P. Szańkowski, G. Ramon, J. Krzywda, D. Kwiatkowski, and Ł. Cywiński, Environmen-
tal noise spectroscopy with qubits subjected to dynamical decoupling, Journal of Physics:
Condensed Matter 29, 333001 (2017).
[155] G. A. Paz-Silva, L. M. Norris, and L. Viola, Multiqubit spectroscopy of gaussian quantum
noise, Phys. Rev. A 95, 022121 (2017).
[156] L. Cywiński, R. M. Lutchyn, C. P. Nave, and S. Das Sarma, How to enhance dephasing time
204
in superconducting qubits, Phys. Rev. B 77, 174509 (2008).
[157] G. A. Paz-Silva and L. Viola, General transfer-function approach to noise filtering in open-
loop quantum control, Phys. Rev. Lett. 113, 250501 (2014).
[158] K. Schultz, G. Quiroz, P. Titum, and B. D. Clader, SchWARMA: A model-based approach
for time-correlated noise in quantum circuits, Phys. Rev. Research 3, 033229 (2021).
[159] A. Murphy, J. Epstein, G. Quiroz, K. Schultz, L. Tewala, K. McElroy, C. Trout, B. Tien-
Street, J. A. Hoffmann, B. Clader, et al., Universal dephasing noise injection via Schrodinger
wave autoregressive moving average models, arXiv preprint arXiv:2102.03370 (2021).
[160] T. Giurgica-Tiron, Y. Hindy, R. LaRose, A. Mari, and W. J. Zeng, Digital zero noise ex-
trapolation for quantum error mitigation, 2020 IEEE International Conference on Quantum
Computing and Engineering (QCE) , 306 (2020).
[161] A. He, B. Nachman, W. A. de Jong, and C. W. Bauer, Zero-noise extrapolation for quantum-
gate error mitigation with identity insertions, Phys. Rev. A 102, 012426 (2020).
[162] R. LaRose, A. Mari, P. J. Karalekas, N. Shammah, and W. J. Zeng, Mitiq: A software
package for error mitigation on noisy quantum computers, arXiv preprint arXiv:2009.04417
(2020).
[163] P. Whittle, Prediction and regulation by linear least-square methods (English Universities
Press, 1963).
[164] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time series analysis: forecasting
and control (John Wiley & Sons, 2015).
[165] S. H. Holan, R. Lund, G. Davis, et al., The arma alphabet soup: A tour of arma model
variants, Statistics Surveys 4, 232 (2010).
[166] K. Temme, S. Bravyi, and J. M. Gambetta, Error mitigation for short-depth quantum circuits,
Physical Review Letters 119, 180509 (2017).
[167] Y. Li and S. C. Benjamin, Efficient variational quantum simulator incorporating active error
minimization, Phys. Rev. X 7, 021050 (2017).
[168] S. Endo, S. C. Benjamin, and Y. Li, Practical quantum error mitigation for near-future
applications, Phys. Rev. X 8, 031027 (2018).
[169] A. Kandala, K. Temme, A. D. Córcoles, A. Mezzacapo, J. M. Chow, and J. M. Gambetta,
Error mitigation extends the computational reach of a noisy quantum processor, Nature 567,
491 (2019).
205
[170] A. Lowe, M. H. Gordon, P. Czarnik, A. Arrasmith, P. J. Coles, and L. Cincio, Unified
approach to data-driven quantum error mitigation, Phys. Rev. Research 3, 033098 (2021).
[171] A. Mari, N. Shammah, and W. J. Zeng, Extending quantum probabilistic error cancellation
by noise scaling, Phys. Rev. A 104, 052607 (2021).
[172] Y. Kim, C. J. Wood, T. J. Yoder, S. T. Merkel, J. M. Gambetta, K. Temme, and A. Kandala,
Scalable error mitigation for noisy quantum circuits produces competitive expectation values,
arXiv:2108.09197 [cond-mat, physics:quant-ph] (2021), arXiv: 2108.09197.
[173] M. Broughton, G. Verdon, T. McCourt, A. J. Martinez, J. H. Yoo, S. V. Isakov, P. Massey,
R. Halavati, M. Y. Niu, A. Zlokapa, E. Peters, O. Lockwood, A. Skolik, S. Jerbi, V. Dunjko,
M. Leib, M. Streif, D. V. Dollen, H. Chen, S. Cao, R. Wiersema, H.-Y. Huang, J. R. McClean,
R. Babbush, S. Boixo, D. Bacon, A. K. Ho, H. Neven, and M. Mohseni, Tensorflow quantum:
A software framework for quantum machine learning, (2021), arXiv:2003.02989 [quant-ph]
.
[174] T. Proctor, K. Rudinger, K. Young, E. Nielsen, and R. Blume-Kohout, Measuring the capa-
bilities of quantum computers, arXiv:2008.11294 [quant-ph] (2020), arXiv: 2008.11294.
[175] P. Jurcevic, A. Javadi-Abhari, L. S. Bishop, I. Lauer, D. F. Bogorin, M. Brink, L. Capelluto,
O. Günlük, T. Itoko, N. Kanazawa, A. Kandala, G. A. Keefe, K. Krsulich, W. Landers, E. P.
Lewandowski, D. T. McClure, G. Nannicini, A. Narasgond, H. M. Nayfeh, E. Pritchett, M. B.
Rothwell, S. Srinivasan, N. Sundaresan, C. Wang, K. X. Wei, C. J. Wood, J.-B. Yau, E. J.
Zhang, O. E. Dial, J. M. Chow, and J. M. Gambetta, Demonstration of quantum volume
64 on a superconducting quantum computing system, Quantum Science and Technology 6,
025020 (2021).
[176] Y. Li and S. C. Benjamin, Efficient variational quantum simulator incorporating active error
minimisation, Physical Review X (2016), 10.1103/PhysRevX.7.021050.
[177] K. Temme, S. Bravyi, and J. M. Gambetta, Error mitigation for short-depth quantum circuits,
Physical Review Letters (2016), 10.1103/PhysRevLett.119.180509.
[178] T. Giurgica-Tiron, Y. Hindy, R. LaRose, A. Mari, and W. J. Zeng, Digital zero noise ex-
trapolation for quantum error mitigation, 2020 IEEE International Conference on Quantum
Computing and Engineering (QCE) , 306–316 (2020), arXiv: 2005.10921.
[179] Y. Kim, C. J. Wood, T. J. Yoder, S. T. Merkel, J. M. Gambetta, K. Temme, and A. Kandala,
Scalable error mitigation for noisy quantum circuits produces competitive expectation values,
arXiv:2108.09197 [cond-mat, physics:quant-ph] (2021), arXiv: 2108.09197.
[180] W. J. Huggins, S. McArdle, T. E. O’Brien, J. Lee, N. C. Rubin, S. Boixo, K. B. Whaley,
R. Babbush, and J. R. McClean, Virtual distillation for quantum error mitigation, Physical
Review X 11, 041036 (2021), arXiv: 2011.07064.
206
[181] J. Cotler, S. Choi, A. Lukin, H. Gharibyan, T. Grover, M. E. Tai, M. Rispoli, R. Schittko,
P. M. Preiss, A. M. Kaufman, M. Greiner, H. Pichler, and P. Hayden, Quantum virtual
cooling, Physical Review X 9, 031013 (2019), arXiv: 1812.02175.
[182] R. LaRose, A. Mari, S. Kaiser, P. J. Karalekas, A. A. Alves, P. Czarnik, M. E. Mandouh,
M. H. Gordon, Y. Hindy, A. Robertson, P. Thakre, N. Shammah, and W. J. Zeng, Mitiq:
A software package for error mitigation on noisy quantum computers, arXiv:2009.04417
[quant-ph] (2021), arXiv: 2009.04417.
[183] A. Kandala, K. Temme, A. D. Córcoles, A. Mezzacapo, J. M. Chow, and J. M. Gambetta,
Error mitigation extends the computational reach of a noisy quantum processor, Nature 567,
491–495 (2019).
[184] S. Zhang, Y. Lu, K. Zhang, W. Chen, Y. Li, J.-N. Zhang, and K. Kim, Error-mitigated
quantum gates exceeding physical fidelities in a trapped-ion system, Nature Communications
11, 587 (2020).
[185] E. v. d. Berg, Z. K. Minev, A. Kandala, and K. Temme, Probabilistic error cancellation with
sparse Pauli-Lindblad models on noisy quantum processors, arXiv:2201.09866 [quant-ph]
(2022), arXiv: 2201.09866.
[186] D. Bultrini, M. H. Gordon, P. Czarnik, A. Arrasmith, P. J. Coles, and L. Cincio, Unifying
and benchmarking state-of-the-art quantum error mitigation techniques, arXiv:2107.13470
[quant-ph] (2021), arXiv: 2107.13470.
[187] E. Huffman, M. G. Vera, and D. Banerjee, Real-time dynamics of plaquette models using
NISQ hardware, arXiv:2109.15065 [cond-mat, physics:hep-lat, physics:quant-ph] (2021),
arXiv: 2109.15065.
[188] P. J. Karalekas, N. A. Tezak, E. C. Peterson, C. A. Ryan, M. P. da Silva, and R. S. Smith,
A quantum-classical cloud platform optimized for variational hybrid algorithms, Quantum
Science and Technology 5, 024003 (2020), arXiv: 2001.04449.
[189] J. M. Pino, J. M. Dreiling, C. Figgatt, J. P. Gaebler, S. A. Moses, M. S. Allman, C. H. Baldwin,
M. Foss-Feig, D. Hayes, K. Mayer, C. Ryan-Anderson, and B. Neyenhuis, Demonstration of
the trapped-ion quantum-CCD computer architecture, Nature 592, 209–213 (2021), arXiv:
2003.01293.
[190] C. H. Baldwin, K. Mayer, N. C. Brown, C. Ryan-Anderson, and D. Hayes, Re-examining the
quantum volume test: Ideal distributions, compiler optimizations, confidence intervals, and
scalable resource estimations, arXiv:2110.14808 [quant-ph] (2021), arXiv: 2110.14808.
[191] P. Czarnik, A. Arrasmith, L. Cincio, and P. J. Coles, Qubit-efficient exponential suppression
of errors, arXiv:2102.06056 [quant-ph] (2021), arXiv: 2102.06056.
207
[192] S. Aaronson, Shadow Tomography of Quantum States, arXiv:1711.01053 [quant-ph] (2018),
arXiv: 1711.01053.
[193] H.-Y. Huang, R. Kueng, and J. Preskill, Predicting Many Properties of a Quantum System
from Very Few Measurements, Nature Physics 16, 1050 (2020), arXiv: 2002.08953.
[194] H.-Y. Hu and Y.-Z. You, Hamiltonian-driven shadow tomography of quantum states, Phys.
Rev. Research 4, 013054 (2022).
[195] H.-Y. Hu, S. Choi, and Y.-Z. You, Classical Shadow Tomography with Locally Scram-
bled Quantum Dynamics, arXiv:2107.04817 [cond-mat, physics:quant-ph] (2021), arXiv:
2107.04817.
[196] M. Ohliger, V. Nesme, and J. Eisert, Efficient and feasible state tomography of quantum
many-body systems, New Journal of Physics 15, 015024 (2013).
[197] J. Cotler, S. Choi, A. Lukin, H. Gharibyan, T. Grover, M. E. Tai, M. Rispoli, R. Schittko,
P. M. Preiss, A. M. Kaufman, M. Greiner, H. Pichler, and P. Hayden, Quantum Virtual
Cooling, Physical Review X 9, 031013 (2019), arXiv: 1812.02175.
[198] F. G. S. L. Brandao, A. W. Harrow, and M. Horodecki, Local random quantum circuits
are approximate polynomial-designs, Communications in Mathematical Physics 346, 397
(2016), arXiv: 1208.0692.
[199] S. Chen, W. Yu, P. Zeng, and S. T. Flammia, Robust shadow estimation, PRX Quantum 2
(2021), 10.1103/prxquantum.2.030348.
[200] D. Enshan Koh and S. Grewal, Classical Shadows with Noise, arXiv e-prints ,
arXiv:2011.11580 (2020), arXiv:2011.11580 [quant-ph] .
[201] K. Bu, D. Enshan Koh, R. J. Garcia, and A. Jaffe, Classical shadows with Pauli-invariant
unitary ensembles, arXiv e-prints , arXiv:2202.03272 (2022), arXiv:2202.03272 [quant-ph]
.
[202] E. Magesan, J. M. Gambetta, and J. Emerson, Scalable and robust randomized benchmarking
of quantum processes, Phys. Rev. Lett. 106, 180504 (2011).
[203] J. Claes, E. Rieffel, and Z. Wang, Character randomized benchmarking for non-multiplicity-
free groups with applications to subspace, leakage, and matchgate randomized benchmark-
ing, PRX Quantum 2, 010351 (2021).
[204] R. Levy, D. Luo, and B. K. Clark, Classical Shadows for Quantum Process Tomog-
raphy on Near-term Quantum Computers, arXiv:2110.02965 [cond-mat, physics:physics,
physics:quant-ph] (2021), arXiv: 2110.02965.
208
[205] S. H. Sack, R. A. Medina, A. A. Michailidis, R. Kueng, and M. Serbyn, Avoid-
ing barren plateaus using classical shadows, arXiv e-prints , arXiv:2201.08194 (2022),
arXiv:2201.08194 [quant-ph] .
209