Original source: Celer Network
We would like to thank the teams at Polygon Zero, Consensys gnark, Pado Labs, and Delphinus Lab for their valuable suggestions and feedback on this article.
Over the past few months, we have invested a great deal of time and effort in developing a cutting edge infrastructure built with zk-SNARK concise proof. This next generation innovation platform enables developers to build unprecedented new paradigms of blockchain applications.
In our development work, we tested and used a variety of zero-knowledge proof (ZKP) development frameworks. While the journey has been rewarding, we do realize that the variety of ZKP frameworks often presents challenges for new developers as they try to find the one that best suits their specific use cases and performance requirements. Given this pain point, we believe there is a need for a community evaluation platform that can provide comprehensive performance test results, which will greatly facilitate the development of these new applications.
To meet this need, we have launched the "Pantheon Pantheon", a public interest community initiative. The first step of the initiative will encourage communities to share reproducible performance test results for various ZKP frameworks. Our ultimate goal is to work together to create and maintain a widely recognized test platform that evaluates low-level circuit development frameworks, advanced zkVM and compilers, and even hardware acceleration providers. We hope that this will accelerate ZKP's adoption by giving developers more performance to compare when choosing a framework. At the same time, we hope to facilitate the upgrading and iteration of the ZKP framework itself by providing a universally referable set of performance test results. We will invest heavily in this project and invite all like-minded community members to join us and contribute to this work!
In this article, we take the first step towards building ZKP Pantheon, using SHA-256 in a series of low-level circuit development frameworks to provide a reproducible set of performance test results. While we acknowledge that other performance test grands and primitives may also be feasible, we chose SHA-256 because it is applicable to a wide range of ZKP use cases, including blockchain systems, digital signatures, zkDID, and more. It's also worth mentioning that we use SHA-256 on our own system, so it's convenient for us too!
Our performance tests evaluated the performance of the SHA-256 against a variety of zk-SNARK and zk-STARK circuit development frameworks. By comparing, we aim to provide developers with insights into the efficiency and usefulness of each framework. Our goal is that the results of this performance test will inform informed decisions about the best framework for developers to make.
In recent years, we have observed a proliferation of zero-knowledge proof systems. As challenging as it is to keep up with all the exciting advances in this field, we have carefully selected the following proof systems based on maturity and developer adoption as test subjects. Our goal is to provide a representative sample of different front-end/back-end combinations.
Circom is a popular DSL for writing circuits and generating R1CS constraints, and snarkjs was able to generate Groth16 or Plonk proofs for Circom. Rapidsnark is also Circom's validator, which generates Groth16 proofs, and is generally much faster than snarkjs due to the use of ADX extensions, and parallelizes proof generation as much as possible.
gnark is a comprehensive Golang framework from Consensys that supports Groth16, Plonk, and many more advanced features.
Arkworks is a comprehensive Rust framework for zk-SNARKs.
Halo2 is a zk-SNARK implementation of Zcash and Plonk. It is equipped with a highly flexible Plonkish arithmetic that supports a number of useful primitives, such as custom gateways and lookup tables. We use KZG's Halo2 fork with Ethereum Foundation and Scroll support.
Plonky2 is based on a SNARK implementation of PLONK and FRI technologies from Polygon Zero. Plonky2 uses small Goldilocks fields and supports efficient recursion. In our performance testing, we aim for 100-bit guesses for security and use parameters that produce the best proof-of-proof time for the performance testing effort. Specifically, we used 28 Merkle queries, a magnification factor of 8, and a 16-bit proof-of-work challenge. In addition, we set num_of_wires = 60 and num_routed_wires = 60.
Starky is Polygon Zero's high performance STARK framework. In our performance tests, we aim for 100-bit guesses for security and use parameters that produce the best proof-of-proof time. Specifically, we used 90 Merkle queries, 2x amplification, and 10-bit proof-of-work challenges.
The following table summarizes the above framework and the related configurations used in our performance tests. This list is by no means exhaustive and we will also be looking at many of the most advanced frameworks/technologies in the future (e.g., Nova, GKR, Hyperplonk).
Note that these performance test results apply only to the circuit development framework. We plan to publish a separate article in the future on different zkVM (e.g., Scroll, Polygon zkEVM, Consensys zkEVM, zkSync, Risc Zero, zkWasm) and IR compiler frameworks (e.g., Noir, zkLLVM) to perform performance tests.
To test the performance of these different proof systems, we calculated the SHA-256 hash value of N bytes of data, where we calculated the value of N = 64, 128,... 64K was experimented with (Starky was an exception, where the circuit repeated SHA-256's calculation of fixed 64-byte input, but kept the same total number of message blocks). Can be inThis repositoryFind the performance code and SHA-256 circuit configuration.
In addition, we performed a performance test on each system using the following performance metrics:
Proof generation time (including witness generation time)
Demonstrate peak memory usage during build
Prove the average CPU usage percentage during build. (This indicator reflects the degree of parallelism in proof generation.)
Please note that we are making some "arbitrary" assumptions about proof size and proof verification cost, as these aspects can be mitigated by combining with Groth16 or KZG before going up the chain.
We performed performance tests on two different machines:
Linux Server: 20 cores @ 2.3GHz, 384GB RAM
Macbook M1 Pro: 10 cores @3.2Ghz, 16GB RAM
The Linux server is used to simulate the scenario where there are many CPU cores and sufficient memory. The Macbook M1 Pro, typically used for research and development, has a more powerful CPU but fewer cores.
We enabled optional multithreading, but we did not use GPU acceleration for this performance test. We plan to do GPU performance testing in the future.
Before we move on to the detailed performance test results, it is useful to first understand the complexity of SHA-256 by looking at the number of constraints in each proof system. It is important to note that you cannot directly compare the number of constraints in different arithmetic schemes.
The result below corresponds to the original image size of 64KB. Although the results may vary with other preimage sizes, they can be scaled roughly linearly.
Circom, gnark, and Arkworks all use the same R1CS algorithm to calculate the number of R1CS constraints for 64KB SHA-256 roughly between 30M and 45M. The differences between Circom, gnark, and Arkworks may be due to configuration differences.
Both Halo2 and Plonky2 use Plonkish arithmetic, where the number of rows ranges from 2^22 to 2^23. Halo2's SHA-256 implementation is much more efficient than Plonky2's due to the use of lookup tables.
Starky uses the AIR algorithm, where 2^16 conversion steps are required to execute the trace table.
[Figure 1] The proven generation time for each frame of SHA-256 was tested on a variety of raw image sizes using a Linux server. Here's what we found:
For SHA-256, Groth16 frameworks (rapidsnark, gnark, and Arkworks) generate proofs faster than Plonk frameworks (Halo2 and Plonky2). This is because SHA-256 consists mostly of bit operations, where the line value is 0 or 1. For Groth16, this reduces most calculations from scalar multiplication of elliptic curves to point addition of elliptic curves. However, wire values are not directly used in Plonk calculations, so the special wire structure in SHA-256 does not reduce the amount of computation required in the Plonk framework.
gnark and rapidsnark are 5 to 10 times faster than Arkworks and snarkjs among all Groth16 frameworks. This is due to their excellent ability to generate proofs using multiple kernels in parallel. Gnark is 25% faster than rapidsnark.
For the Plonk framework, when using > For a larger image size of 4KB, Plonky2's SHA-256 is 50% slower than Halo2's. This is because Halo2's implementation mainly uses lookup tables to speed up bitwise operations, resulting in 2x fewer rows than Plonky2. However, if we compare Plonky2 and Halo2 with similar numbers (for example, SHA-256 with over 2KB in Halo2 versus SHA-256 with over 4KB in Plonky2), Plonky2 is 50% faster than Halo2. If we implement SHA-256 using lookup tables in Plonky2, we should expect Plonky2 to be faster than Halo2, even though Plonky2 has a larger proof size.
On the other hand, when the input raw image size is small (< =512 bytes), Halo2 is slower than Plonky2 (and other frameworks) because fixed setup costs for lookup tables account for most of the constraints. However, Halo2's performance becomes more competitive as the number of preimages increases, and the proof-generation time remains the same for preimage sizes up to 2KB, as shown in the figure, which scales almost linearly.
As expected, Starky's proof generation time is much shorter (5x - 50x) than any SNARK framework, but this comes at the cost of a larger proof size.
It is also important to note that even if the circuit size is linear to the preimage size, the generation of proofs for SNARKs increases superlinearly due to O(nlogn) FFT (although this is not evident on the chart due to the logarithmic scale).
We also performed a proof-build time performance test on the Macbook M1 Pro, as shown in [Figure 2]. Note, however, that rapidsnark was not included in this performance test due to lack of support for the arm64 architecture. In order to use snarkjs on the arm64, we had to generate the witness using webassembly, which is slower than the C++ witness generation used on the Linux server.
A few additional observations when running performance tests on the Macbook M1 Pro:
With the exception of Starky, all SNARK frameworks experience out-of-memory (OOM) errors or use swap memory (resulting in slower proof times) when the original image size grows. Specifically, Groth16 frames (snarkjs, gnark, Arkworks) in original image size > = 8KB starts using swap memory and gnark is at the raw image size. = 64KB The memory is insufficient. When the original image size > = 32KB, Halo2 encountered memory limitations. When the original image size > = 8KB, Plonky2 starts using swap memory.
The FRI based frameworks (Starky and Plonky2) were approximately 60% faster on the Macbook M1 Pro than on a Linux server, while the other frameworks had similar proof times on both machines. So even though the lookup table is not used in Plonky2, it achieves almost the same proof time as Halo2 on the Macbook M1 Pro. The main reason is that the Macbook M1 Pro has a more powerful CPU but fewer cores. The FRI mainly uses hashing and is sensitive to CPU clock cycles, but its parallelism is not as good as that of KZG or Groth16.
[Figure 3] and [Figure 4] show the peak memory usage during proof generation on Linux Server and Macbook M1 Pro, respectively. Based on these performance test results, the following observations can be made:
rapidsnark is the most memory efficient of all SNARK frameworks. We also see that because of the fixed setup cost of lookup tables, Halo2 uses more memory when the raw image size is small, but consumes less memory overall when the raw image size is large.
Starky's memory efficiency is more than 10 times higher than the SNARK framework. This is partly because it uses fewer lines.
It should be noted that the peak memory usage on the Macbook M1 Pro remained relatively flat due to the increased size of the original image due to the use of swapped memory.
We evaluated the degree of parallelization of each proof system by measuring the average CPU utilization of SHA-256 during proof generation for 4KB of raw image input. The table below shows the average CPU utilization (average utilization per kernel in parentheses) on Linux Server (20 cores) and Macbook M1 Pro (10 cores).
The main observations are as follows:
Gnark and rapidsnark show the highest CPU utilization on Linux servers, demonstrating their ability to efficiently use multi-core and parallel generation. Halo2 also shows good parallelization performance.
Most frameworks have twice the CPU utilization on Linux servers than on the Macbook Pro M1, with the exception of the snarkjs.
Although it was initially expected that FRI based frameworks (Plonky2 and Starky) might struggle to use multiple cores effectively, they performed no worse than some Groth16 or KZG frameworks in our performance tests. It remains to be seen whether there is a difference in CPU utilization on a machine with more cores (say, 100 cores).
This article provides a comprehensive comparison of SHA-256 performance tests against various zk-SNARK and zk-STARK development frameworks. By comparing, we gained insight into the efficiency and utility of each framework in hopes of helping developers who need to generate concise proofs for SHA-256 operations. We found that Groth16 frameworks (e.g. rapidsnark, gnark) were faster at generating proofs than Plonk frameworks (e.g. Halo2, Plonky2). The lookup table in Plonkish arithmetic significantly reduces the constraint and proof time of SHA-256 when using a larger primitive image size. In addition, gnark and rapidsnark demonstrated an excellent ability to use multiple cores to operate in parallel. Starky's proof generation time, on the other hand, is much shorter, at the cost of a much larger proof size. In terms of memory efficiency, rapidsnark and Starky are superior to other frameworks.
As the first step to build the zero-knowledge proof evaluation platform "Pantheon Pantheon", we admit that the results of this performance test are far from enough to be the comprehensive test platform we hope to build in the end. We welcome and welcome feedback and criticism, and invite all to contribute to this initiative to make it easier and less accessible for developers to use zero-knowledge proof. We are also willing to provide grants to individual independent contributors to cover the cost of computing resources for large-scale performance testing. Together, we hope to improve the efficiency and usefulness of ZKP for the wider benefit of the community.
Finally, we would like to thank Polygon Zero, Consensys gnark, Pado Labs, and the Delphinus Lab team for their valuable review and feedback on the performance test results.