Zero Knowledge Proof — FPGA or GPU?

9 min readSep 7, 2022

Zero-Knowledge Proof now has a wider range of applications, such as privacy calculation, computational proving, consensus proving, and so on. While looking for more suitable application scenarios, many people gradually find that the preformance of Zero-Knowledge Proof is the bottleneck. Trapdoor Tech has been researching on Zero-Knowledge Proof in depth since 2019 and working on high performance solution of Zero-Knowledge Proof. GPU and FPGA are the common acceleration platforms. In this article, we are going to look into MSM computation, analyze the advantages and disadvantages of FPGA and GPU utilized to accelerate Zero-Knowledge Proof computation.

TL;DR

ZKP has broad prospects and has been adopted in more and more applications. However, there are many ZKP algorithms and various implementations in projects. Meanwhile, the performance of these ZKP system is not good enough. This article goes deep into algorithms of MSM, point addition on Elliptic Curve, Montgomery multiplication, etc., futhermore compares the performance difference between GPU and FPGA for point addition algorithm on BLS12_381. In general speaking of ZKP computation, GPU has obvious advantages in the short term, such as high throughput, high price/performance ratio, powerful programmablility, etc. But FPGA becomes more and more powerful and has its own advantages in power consumption. In the long term, it’s possible that some new FPGA chips could be suitable for ZKP computation, even some kind of customized ASIC chips.

Complex ZKP algorithm

ZKP(Zero Knowledge Proof) is a generic term and can be roughly divided into two categories: zk-SNARK and zk-STARK at the moment of speaking. Several zk-SNARK algorithms have been applied in industry, e.g., Groth16, PLONK, PLOOKUP, Marlin, and Halo/Halo2. zk-SNARK algorithms iterate along two main directions: 1/ with/out trusted setup 2/ the performance of circuits. One of zk-STARK advantages is that no trusted setup is required, but the computation complexity of verification is log-linear.

Here are some examples of industry applications:

Groth16 — Filecoin, to prove the reliability of the data, circuit size 2²⁷
PLONK — zkSync, to prove the correction of the off-chain transactions, circuit size 2²⁶
Marlin — Aleo, to prove the consensus and off-chain smart contracts
Halo2 — zkEVM, to prove off-chain generic computation
zk-STARK — starkNet

For the application of zk-SNARK/zk-STARK, ZKP algorithms utlized in different projects are relatively scattered. There are likely more applications of zk-SNARK algorithms, because PLONK/Halo2 algorithms are universal (no trusted setup required).

Computation Load of PLONK

Let’s breakdown the computation load of PLONK algorithm as the example.

The computation load of the PLONK’s prove part consists of four components:

1/ MSM — Multiple Scalar Multiplication. MSM is often used to compute polynomial commitments.

2/ NTT calculation — Polynomial transforms between point values and coefficient representations.

3/ Polynomial Calculation — Polynomial addition, subtraction, multiplication, division, and Polynomial evaluation, etc.

4/ Circuit Synthesize — The computation of this components is related to the size/complexity of the circuit.

The computation of Circuit Synthesize generally contains much more condition and loop statements, meaning less parallel processing, which is better to run on CPU. So ZKP acceleration work usually focuses on the first three components. And among these components, MSM costs the most computation load, followed by NTT.

What’s MSM?

MSM (Multiple Scalar Multiplication) is to find the point that represents the result of point addition on a given series of points and scalers on an Elliptic Curve (EC).

For example, given a fixed set of points on a specific EC below:

and a randomly sampled finite field elements from specified scalar field:

MSM is the calculation to get the Elliptic curve point Q:

Pippenger algorithm is generally adopted to optimize MSM. Here’s a closer look at the schematic diagram of Pippenger algorithm:

The process of Pippenger algorithm consists of 2 steps:

1/ To divide Scalers into Windows. If the Scaler is 256 bits and a Window is 8 bits, then all Scalers should be divided into 256/8=32 Windows. In each layer of these Windows, the intermediate results are temporarily stored into “Buckets”. So GW_x is the point of the accumulative result of its layer. The calculation of GW_x is simple, just to point add the G_x to its own Bucket indexed by the Scalar while iterating every Scalar in the layer. In fact, the principle is also simple: if the Scalars of the two points are the same, let’s point add the two points first, then perform the point multiplication with the Scalar, instead of point multiplication with the Scalar for each of two points firstly, then point addition of the previous results.

2/ The points of each Window calculated in step 1 are accumulated through double-add to get the final result.

Pippenger algorithm has various optimizations. Anyway, the basic operation of MSM algorithm is the point addition on the EC. Different optimization algorithms usually mean different number of point addition.

Point Add on Elliptic Curves

You can check out the various algorithms for point additions on “short Weierstrass” Elliptic Curves from this website.

http://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html#addition-madd-2007-bl

Suppose the Projective coordinates of two points are (x1, y1, z1) and (x2, y2, z2), then the result (x3, y3, z3) of Point Addition can be calculated by the following formula:

Z1Z1 = Z12
U2 = X2*Z1Z1
S2 = Y2*Z1*Z1Z1
H = U2-X1
HH = H2
I = 4*HH
J = H*I
r = 2*(S2-Y1)
V = X1*I
X3 = r2-J-2*V
Y3 = r*(V-X3)-2*Y1*J
Z3 = (Z1+H)2-Z1Z1-HH

The reason why the calculation process is given in details is trying to demonstrate that the whole calculation is mostly integer operations. The bit-width of integers depends on the parameters of Elliptic Curve. Check below to find the bit-widths of some common Elliptic Curves:

BN256 - 256bits
BLS12_381 - 381bits
BLS12_377 - 377bits

Note that, these integer operations are operated on the Module domain. Modular addition/subtraction is simple, so let's focus on the principles and implementation of modular multiplication.

Modular Multiplication

Given two values on the Module domain: x and y, Modular Multiplication refers to x*y mod p. Note that the bit-widths of these integers are same as the bit-width of the elliptic curve. The classical algorithm for modular multiplication is the Montgomery Muliplication. Before performing the Montgomery multiplication, the multiplied value needs to be modified into a Montgomery expression:

The Montgomery multiplication formula is given as follows:

There are numerous Montgomery multiplication implementation algorithms: CIOS (Coarsely Integrated Operand Scanning), FIOS (Finely Integrated Operand Scanning), FIPS (Finely Integrated Product Scanning), and so on. This article would not go into the details about these various implementations, if you are interested, please figure out on your own.

In order to compare the performance difference between FPGA and GPU, we chose the basic algorithm implementations:

1. t = a * b
2. m = （t * n') mod r
3. tmp = t + m * n
4. u= tmp / r
5. if (u > n) result = u - n else result = u

Put simply, the modular multiplication algorithm can be further divided into two categories of computation: multiplication of large numbers and addition of large numbers. Based on the understanding of the computational logic of MSM, we can choose the Throughput of modular multiplication to compare the performance of FPGA and GPU.

DSP in Xilinx FPGA

UltraScale+ series of Xilinx FPGAs is the newest product line. VU9P is the mid-range product. AWS FPGA cloud platform adopts VU9P chips.

DSP within the FPGA is the valuable resource. To ensure correction of system clock, DSP can be applied for large number modular multiplication (including large number multiplication and large number addition). DSP48E2 is the DSP of VU9P, whose logic diagram is shown as follows:

The core logic of DSP48E2 consists of a 27*18 multiplier and an arithmetic calculator supporting addition, subtraction/logic operations. To prevent multiplication overflow, DSP48E2 can be treated as a 17*17 module when performing multiplication calculations on large numbers.

Pipelined Modular Multiplication in FPGA

To compare the performance of Modular Multiplication on FPGA and GPU, EC BLS12_381 is selected for this article. In other words, the bit-width of the modular multiplication is 381 bits. Thus, the bit-width needs to extend to 391 bits (23*17 bits) to match up to the 17bits due to DSP of FPGA.

The design of the entire Pipelined Modular Multiplication is shown as follows:

The circuit module for modular multiplication includes 25 DSPs totally and could be separated into two parts: one part consisting of 23 DSPs is designed to implement large number multiplication and the other part consisting of 2 DSPs is designed to implement large number addition/subtraction respectively. We can tell that the large number multiplication requires 23 clocks to complete, and each clock completes X*17bits. The large number addition and large number subtraction calculations also require 23 clocks to complete since they share one DSP. This design utilizes pipeline to perform the internal computation of large number multiplication, and the computation between large number multiplication and large number addition/subtraction.

Since modular multiplication requires 3 opearations of large number multiplication and nemurous operations of large number addition/subtraction, the modular multiplication of 381 bits requires 3*23=69 clocks.

Observation and Thinking

With such FPGA design, we can estimate the Throughput that VU9P can fully provide for point addition on EC BLS12_381. One operation of point addition(add_mix approach) takes about 12 modular multiplications. The system clock of FPGA is 450M.

(6840/25)*(450/69)/12 = 104.3M/s

With the same modular multiplication/modular addition algorithm, the throughput of point addition running on Nvidia 3090 (considering the factor of the data transfer cost) is more than 500M/s. Of course, the whole calculation process involves multiple algorithms, maybe some algorithms are suitable for FPGA, while some are suitable for GPU. The reason why we use the same algorithm to compare is that we would like to compare the core computing power of FPGA and GPU.

With this in mind, let’s take a look at the price/performance ratio of the products.

Xilinx VU9P board — $8394

https://www.xilinx.com/products/boards-and-kits/vcu118.html

Nvidia 3090 — $1499

https://store.nvidia.com/en-us/geforce/store/?page=1&limit=9&locale=en-us&category=DESKTOP,GPU&gpu=RTX%203090,RTX%203090%20Ti

Counting in the computing power, we will get the comparation of 2 products’ price/performance ratio:

8394*500/1499/104.3 = 26.8

In other words, Nvidia 3090 is 26 times price/performance ratio of Xilinx VU9P.

Note: The supply of Xilinx VU9P might be relatively limited. So price may have variations due to limited public infomation. The price/performance ratio of other FPGA chips can also be calculated based on the similar logic.

Based on the results above, here’s the summary about the comparison of GPU and FPGA in terms of ZKP performance:

More powerful FPGA AI Engine

With VC1902 7nm chip, Xilinx VCK5000 platform supports AI acceleration engine. Its price is kind of cheaper: $2,745.00.

VCK5000 Versal Development Card

The Xilinx VCK5000 Versal development card is built on the Xilinx 7nm Versal® ACAP architecture and is designed for…

www.xilinx.com

This chip is equipped with AI and DSP engines in addition to Programmable Logic. The DSP engine consists of 1968 DSP58, and the AI engine consists of 400 AI Cores. The performance of MSM/FFT on this chip is not yet determined, but it’s worth looking forward to. However, apparently, VC1902 is more powerful than VU9P.

Conclusion:

More and more applications start to adopt Zero-Knowledge Proof. However, there are many ZKP algorithms utilized in various projects. And ZKP proving performance is a pain. Based on our engineering experiences, FPGA becomes more and more powerful and still an option, but for now it seems that GPU is a better choice of cost effective. Above is our consideration and conclusion. If you are a FPGA or GPU expert, you’re warm welcomed to have a discussion with us.