Qrack Performance

Abstract

The Qrack quantum simulator is an open-source C++ high performance, general purpose simulation supporting arbitrary numbers of entangled qubits. While there are a variety of other quantum simulators such as [QSharp], [QHiPSTER], and others listed on [Quantiki], Qrack represents a unique offering suitable for applications across the field.

A selection of performance tests are identified for creating comparisons between various quantum simulators. These metrics are implemented and analyzed for Qrack. These experimentally derived results compare favorably against theoretical boundaries, and out-perform naive implementations for many scenarios.

Introduction

There are a growing number of quantum simulators available for research and industry use. Many of them perform quite well for smaller number of qubits, and are suitable for non-rigorous experimental explorations. Fewer projects are suitable as “high performance” candidates in the >32 qubit range. Many rely on the common approach often described as the “Schrödinger method,” doubling RAM usage by a factor of 2 per fully interoperable qubit, or else Feynman path integrals, which can become intractible at arbitrary circuit depth. Attempting to build on the work of IBM’s Breaking the 49-Qubit Barrier in the Simulation of Quantum Circuits [Pednault2017] paper, with more recent attention to potential improvements inspired by Gottesman-Knill stabilizer simulators, Qrack can execute surprisingly general circuits up to 64 qubits in width on modest single nodes, such as constrained cases of the QFT.

Qrack is an open-source quantum computer simulator option, implemented in C++, supporting integration into other popular compilers and interfaces, suitable for utilization in a wide variety of projects. As such, it is an ideal test-bed for establishing a set of benchmarks useful for comparing performance between various quantum simulators.

Qrack provides a “QEngineCPU” and a “QEngineOCL” that represent non-OpenCL and OpenCL base implementations, which could also be the most efficient types to use directly in the limit of maximal entanglement. For general use cases, the “QUnit” layer provides explicit Schmidt decomposition on top of another engine type (per [Pednault2017]) and “QFusion” provides a “gate fusion” layer. A “QEngine” type is always the base layer, and QUnit and QFusion types may be layered over these, and over each other.

This version of the Qrack benchmarks contains comparisons against other publicly available simulators, specifically Cirq, QCGPU, and Qiskit (each with its default simulator, if multiple were available). Qrack has been incorporated as an optional back end for ProjectQ and plugin for Qiskit, in repositories maintained by the developers of Qrack, and benchmarks for their performance will follow.

Reader Guidance

This document is largely targeted towards readers looking for a quantum simulator that desire to establish the expected bounds for various use-cases prior to implementation.

Disclaimers

  • Your Mileage May Vary - Any performance metrics here are the result of experiments executed with selected compilation and execution parameters on a system with a degree of variability; execute the supplied benchmarks on the desired target system for accurate performance assessments.
  • Benchmarking is Hard - While we’ve attempted to perform clean and accurate results, bugs and mistakes do occur. If flaws in process are identified, please let us know!

Method

This performance document is meant to be a simple, to-the-point, and preliminary digest of these results. We plan to submit a formal academic report for peer review of these results, in full detail, in February of 2020. These were prepared with the generous financial support of the Unitary Fund. However, we offer that our benchmark code is public, largely self-explanatory, and easily reproducible, while we prepare that report. Hence, we release these partial preliminary results now.

100 timed trials of single and parallel gates were run for each qubit count between 4 and 28 qubits. Three tests were performed: the quantum Fourier transform, (“QFT”), random circuits constructed from a universal gate set, and an idealized approximation of Google’s Sycamore chip benchmark, as per [Sycamore]. The benchmarking code is available at https://github.com/vm6502q/simulator-benchmarks.

CPU and GPU benchmarks were run on two respective systems that could represent realistic use cases for each engine type. Among AWS virtual machine instances, we sought to find those systems with the lowest possible cost to run the benchmarks for their respective execution times, at or below for the 28 qubit mark. An AWS g3s.xlarge running Ubuntu Server 18.04LTS was selected for GPU benchmarks. An AWS c5.4xlarge running Ubuntu Server 18.04LTS was selected for CPU benchmarks, including FFTW3 for comparison on the QFT test. Benchmarks were collected from December 27, 2019 through January 24, 2020. These results were combined with single gate, N-width gate and Grover’s search benchmarks for Qrack, collected overnight from December 19th, 2018 into the morning of December 20th. (The potential difference since December 2018 in these particular Qrack tests reused from then should be insignificant. We took care to try to report fair tests, within cost limitations, but please let us know if you find anything that appears misrepresentative.)

The average time of each set of 100 was recorded and graphed. Grover’s search to invert a black box subroutine, or “oracle,” was similarly implemented for trials between 5 and 20 qubits, for QEngineOCL with and without QUnit and QFusion layers. Grover’s algorithm was iterated an optimal number of times, vs. qubit count, to maximize probability on a half cycle of the algorithm’s period, being \(floor\left[\frac{\pi}{4asin^2\left(1/\sqrt{2^N}\right)}\right]\) iterations for \(N\) qubits.

Comparative benchmarks included Cirq, QCGPU, and Qiskit, and Qrack’s “QUnit” layer on top of “QFusion” and “QEngineOCL.” (Default simulators were used, for other candidates, if more than one option was available.) QFT benchmarks could be implemented in a straightforward manner on all simulators, and were run as such. Qrack appears to be the only candidate considered for which inputs into the QFT can (drastically) affect its execution time, with permutation basis states being run in much shorter time, for example, hence only Qrack required a more general random input, whereas all other simulators including FFTW3 were started in the |0> state. For a sufficiently representatively general test, Qrack instead used registers of single separable qubits intialized with uniformly randomly distributed probability between |0> and |1>, and uniformly randomly distributed phase offset between those states. FFTW3 was included in this test, in “in-place” mode on complex single floats, since all other DFT tests are effectively “in-place” method, here.

Random universal circuits carried out layers of single qubit gates on every qubit in the width of the test, followed by layers randomly selected couplings of (2-qubit) CNOT, CZ, and SWAP, or (3-qubit) CCNOT, eliminating each selected bit for the layer. 20 layers of 1-qubit-plus-multi-qubit iterations were carried out, for each qubit width, for the benchmarks presented here.

For the Sycamore circuit idealization, the quoted 2-qubit “typical” gate of “iSWAP” with “1/6th CZ” was used in all cases. Sycamore circuits were carried out similarly to random universal circuits and the method of the [Sycamore] paper, interleaving 1-qubit followed by 2-qubit layers, to depth of 20 layers each. Whereas as that original source appears to have randomly fixed its target circuit ahead of any trials, and then carried the same pre-selected circuit out repeatedly for the required number of trials, all benchmarks in the case of this report generated their circuits per-iteration on-the-fly, per the selection criteria as read from the text of [Sycamore]. With the 2-qubit gate idealization already mentioned, Qrack easily implemented the original Sycamore circuit exactly. By nature of the simulation methods used in each other candidate, atomic “convenience method” 1-qubit and 2-qubit gate definitions could potentially easily be added to every every candidate for this test, hence we thought it most representative to make largely performance-irrelevant substitutions of 1-qubit and 2-qubit gates for those candidates which did not already define sufficient API convenience methods, without nonrepresentatively complicated gate decompositions. We strongly encourage the reader to inspect and independently execute the simple benchmarking code which was already linked in the beginning of this “Method” section, for total specific detail.

Qrack QEngine type heap usage was established as very closely matching theoretical expections, in earlier benchmarks, and this has not fundamentally changed. QUnit type heap usage varies greatly dependent on use case, though not in significant excess of QEngine types. No representative RAM benchmarks have been established for QUnit types, yet. QEngine Heap profiling was carried out with Valgrind Massif. Heap sampling was limited but ultimately sufficient to show statistical confidence.

Results

We observed extremely close correspondence with Schrödinger method theoretical complexity and RAM usage considerations for the behavior of QEngine types. QEngineCPU and QEngineOCL require exponential time for a single gate on a coherent unit of N qubits. QUnit types with explicitly separated subsystems as per [Pednault2017] show constant time requirements for the same single gate.

_images/x_single.png _images/cnot_single.png

QEngineCPU and QEngineOCL can perform many identical gates in parallel across entangled subsystems for an approximately constant costs, when total qubits in the engine are held fixed as breadth of the parallel gate application is varied. To test this, we can apply parallel gates at once across the full width of a coherent array of qubits. (CNOT is a two bit gate, so \((N-1)/2\) gates are applied to odd numbers of qubits.) Notice in these next graphs how QEngineCPU and QEngineOCL have similar scaling cost as the single gate graphs above, while QUnit types show a linear trend (appearing logarithmic on an exponential axis scale):

_images/x_all.png _images/cnot_all.png

Heap sampling supports theoretical expecations to high confidence. Complex numbers are represented as 2 single (32-bit) or 2 double (64-bit) accuracy floating point types, for real and imaginary components. The use of double or single precision is controlled by a compilation flag. There is one complex number per permutation in a separable subsystem of qubits. QUnit explicitly separates subsystems, while QEngine maintains complex amplitudes for all \(2^N\) permutations of \(N\) qubits. QEngines duplicate their state vectors once during many gates, like arithmetic gates, for speed and simplicity where it eases implementation.

_images/qrack_ram.png

QUnit explicitly separates its representation of the quantum state and may operate with much less RAM, but QEngine’s RAM usage represents approximately the worst case for QUnit, of maximal entanglement. OpenCL engine types attempt to use memory on the accelerator device instead of general heap when a QEngineOCL instance can fit a single copy of its state vector in a single allocation on the device. On many modern devices, state vectors up to about 1GB in size can be allocated directly on the accelerator device instead of using general heap. A auxiliary buffer used for normalization is half the size of the state vector, and this buffer is always allocated in general heap.

Grover’s algorithm is a relatively ideal test case, in that it allows a modicum of abstraction in implementation while representing an ostensibly practical and common task for truly quantum computational hardware. For 1 expected correct function inversion result, there is a well-defined highest likelihood search iteration count on half a period of the algorithm for a given number of oracle input permutations to search. Time required to run the test for QEngineCPU up to 20 qubits became prohibitive, hence we leave its variants off the graph. However, we can establish baseline for OpenCL performance, with Grover’s search, and compare the performance of a “QEngine-method” to a “QUnit-method.” This graphs shows average time against qubit count for an optimal half period search:

_images/grovers.png

[Broda2016] discusses how Grover’s might be adapted in practicality to actually “search an unstructured database,” or search an unstructured lookup table, and Qrack is also capable of applying Grover’s search to a lookup table with its IndexedLDA, IndexedADC, and IndexedSBC methods. Benchmarks are not given for this arguably more practical application of the algorithm, because few other quantum computer simulator libraries implement it, yet.

The “quantum” (or “discrete”) Fourier transform (QFT/DFT) is a realistic and important test case for its direct application in day-to-day industrial computing applications, as well as for being a common processing step in many quantum algorithms.

_images/qft.png

Recall that QCGPU and Qrack are GPU-implementations run on AWS g3s.xlarge instances, whereas all other candidates are run on AWS c5.4xlarge instances. Under these considerations, by the 28 qubit level, Qrack out-performs all other candidates except FFTW3. (Recall, also, that Qrack uses a representatively “hard” initialization on this test, as described above, whereas permutation basis eigenstate inputs, for example, are much more quickly executed.) Though we are comparing CPU to GPU, CPU-based FFTW3 is clearly the best suited for low numbers of qubits, in general.

Similarly, on random universal circuits, defined above and in the benchmark repository, Qrack and QCGPU are closely matched (for cost) on AWS systems.

_images/random_universal.png

Qrack’s QUnit makes a fundamental improvement on an idealization of the Sycamore circuit, which we strongly encourage the reader to analyze and reproduce with the provided public benchmark code.

_images/sycamore.png

QUnit represents its state vector in terms of decomposed subsystems, when possible and efficient. On user and internal probability checks, QUnit will attempt to separate the representations of independent subsystems by Schmidt decomposition. Further, Qrack will avoid applying phase effects that make no difference to the expectation values of any Hermitian operators, (no difference to “physical observables,”) when optimization can be achieved this way. For each bit whose representation is separated this way, we recover a factor of close to or exactly 1/2 the subsystem RAM and gate execution time. Under the domain constraints, QUnit outperforms all other simulators analyzed.

Discussion

Up to a consistent deviation at low qubit counts, speed and RAM usage for Schrödinger method “QEngine” types is well predicted by theoretical complexity considerations of the gates, up to about a factor of 2 on heap usage for duplication of the state vector, with additional 1/2 the size of state vector allocated by QEngineOCL for an auxiliary normalization buffer. For the additional overhead between Qrack’s optimized QUnit and QCGPU in the comparative benchmarks, the difference might come down to Qrack’s support for a more general API and set of compatible systems.

Qrack is written for scalable work distribution in the OpenCL kernels. QEngineOCL will distribute work among an arbitrarily small number of processing elements and max work item size, smaller than state vector size. Max work item size is a device-specific hardware parameter limiting how many work items may be dispatched in an OpenCL kernel call. QEngineOCL can distribute large numbers of probability amplitude transformations to small numbers of work items, incurring additional looping overhead, whereas QCGPU is written to dispatch one work item to one processing element. QCGPU requires a large enough hardware max work item size to add higher numbers of qubits. Whereas QCGPU might not be, Qrack is theoretically compatible with OpenCL devices with smaller maximum work item counts, such as CPUs like the ARM7, on which Qrack has been regularly unit tested.

Qrack gives the option to normalize its state vector while flooring noise-level amplitudes at on-the-fly opportunities, to optimize while correcting for float rounding error. QEngineOCL was designed to support access by separate QEngineOCL instances in different threads to shared OpenCL devices, as well as optional out-of-order OpenCL queue execution, when available. QEngineOCL can also dispatch a queue of gates completely asynchronously, without blocking the main execution thread. Runtime options and design features to support a broad range of platforms do add to Qrack’s execution overhead, but these make Qrack the best all-around quantum computer simulator for personal and heterogeneous hardware.

Further Work

A formalized report of the above and additional benchmark results, in much greater detail and specificity, is planned to be submitted for publication in February, 2020, thanks to the generous support of the Unitary Fund.

Qrack previously contained two experimental multiprocessor types, “QEngineOCLMulti” based on the algorithms developed in Intel’s [QHiPSTER], and the simpler QUnitMulti type, which dispatches different separable subsystems to different processors. These failed to outperform the single processor QEngineOCL. However, as Qrack has added optional support as a simulator for ProjectQ, we have effectively gained access to the quantum network simulator “SimulaQron” by SoftwareQuTech. At least one Qrack user is experimenting with scaling deployments of containers loaded with Qrack, ProjectQ, and SimulaQron as an effective solution for multiprocessor and cluster operations, and the Qrack team is looking at this and related approaches for this purpose. An asynchronous quantum P2P model, for effective multiprocessor support, should hopefully reduce inter-device communication overhead bottlenecks.

With the new generation of “VPU” processors available in 2019 and 2020, (for visual inference,) it might be possible to co-opt VPU capabilities for inference of raw state vector features, such as Schmidt separability, to improve the performance of QUnit. The authors of Qrack have started looking at this hardware for this purpose.

We will maintain systematic comparisons to published benchmarks of quantum computer simulation standard libraries, as they arise.

Conclusion

Per [Pednault2017], and many other attendant and synergistic optimizations engineered specifically in Qrack’s QUnit, explicitly separated subsystems of qubits in QUnit have a significant RAM and speed edge in many cases over the “Schrödinger algorithm” of most popular quantum computer simulators. Qrack gives very efficient performance on a single node past 32 qubits, up to the limit of maximal entanglement.