Polynomial Protocols for R1CS String Manipulation
Table of Contents
In this post, we are going to describe two polynomial protocols that Alin Tomescu and I came up with when working on the Aptos Keyless circuit, an R1CS program written in the Circom domain-specific language to enable verification of OIDC credentials, which requires parsing a JWT string after verifying a signature computed over it. This is more difficult than it sounds - simple array assignment in Circom can create non-quadratic constraints, which are not allowed when programming in R1CS. The other solutions we found for string manipulation were either not efficient, or difficult to understand (and therefore difficult to implement, a frightening thing given how bug-prone handrolling zk proof circuits can be). So we created the following two protocols, one for concatenating two strings, and another for substring extraction.
For notation, when dealing with a string or array \(s\), we will alternate between Python indexing notation and subscripts, so that \(s[x:y]\) indicates the substring of \(s\) starting at \(x\) and ending at \(y-1\), and \(s[i] = s_i\).
Concatenating Two Strings
Our first problem is as follows: given two strings \(l\) and \(r\), compute their concatenation \(s = l | r\).
In R1CS-land, however, we aren’t interested in doing computations ourselves but in verifying the correctness of a computation. Often this is easier to implement than performing the computation directly in Circom. This gives us the following problem: Given three strings \(l,r,s\), verify that \(s = l | r\).
There’s another subtley here to deal with: because we are writing an arithmetic circuit, the type and number of constraints must be fixed in advance. This means that \(s,l,r\) will not be the strings themselves, but arrays containing the appropriate strings with 0-padding afterwards up to a fixed max length. Let’s denote this max length by \(len(\cdot)\), and denote \(len(l) = L\), \(len(r) = R\), \(len(s) = S\), where by necessity \(L,R \leq S\).
Just to make things very explicit, if we have original strings \(s' = cars\), \(l' = c\), \(r' = ars\), and \(S=8\), \(L=R=5\), this gives us \(s = cars0000\), \(l=c0000\), \(r=ars00\).
First we compute a random challenge \(x\) and use it to derive a series of values \(x,x^2,..., x^S\). Then we define and compute the following polynomials, using the pre-computed random challenge powers: \[l(x) = l_0 + l_1x + l_2x^2 + ... + l_Lx^L\] \[r(x) = r_0 + r_1x + r_2x^2 + ... + r_Rx^R\] \[s(x) = s_0 + s_1x + s_2x^2 + ... + s_Sx^S\]
Finally, we denote the actual length of \(l\) before it is transformed into a 0-padded string of length \(L\) by \(d\). To verify, we check: \[s(x) =? \; l(x) + x^{d-1} r(x)\]
and also check that all values of of \(l\) at index \(d\) and after are 0. If we did not do this an attacker could maliciously prove, for example, that \(s = l' | r'\) where \(l' = l | r\) and \(r'\) consists only of 0 values. This is bad because another part of the circuit might assume that the actual length \(d\) given is actually correct when it is not, which could lead to incorrectly assuming that \(s = l'[0:d] | r'\), in this case a false claim.
Correctness of the protocol above is easy to verify. And as might be expected, soundness here follows from the Schwartz-Zippel lemma.
As measured by the Circom compiler, this protocol uses around \(15S + 13L + 11R + X\) non-linear and \(6S + 5L + 4R + Y\) linear constraints where \(X \approx Y \approx 1100\), using the implementation here, which includes a Poseidon fiat-shamir hash over \(s,l\) and \(r\).
NOTE: This protocol does verify the actual length of \(l\), which we have denoted here by \(d\). It only implicitly verifies the actual length of \(r\). If there are any “garbage” values past its actual length, verification will fail as the 0-padding will not be correct in the verification equation. The actual length of \(r\) can be confirmed outside of this protocol in the same circuit by checking that all values after the actual string in \(r\) are equal to 0.
Extracting a Substring
Our next problem is: Given a larger string \(s\), a start index \(i\), and a substring actual length \(d\), extract the substring \(a\) at indices \(s[i:i+d]\).
As before, the context of needing to verify computation translates this problem into the following: given a string \(s\), a start index \(i\), a substring actual length \(d\), and a substring \(a\), verify that \(a\) is equal to \(s[i:i+d]\).
As before, we will need \(s\) and \(a\) to have a fixed maximum length, which we denote by \(len(a) = A\), \(len(s) = S\). We also compute a random challenge \(x\), and a series of powers \(x,x^2,...,x^S\).
Let’s give a concrete example, as we did before. If we have original strings \(s' = goofy\), \(a' = oof\), with \(S = 7\) and \(A = 6\), we have \(s = goofy00\) and \(a = oof000\). For verification to succeed in this case, we will need start index \(i = 1\) and substring length \(d = 3\).
This protocol is slightly more involved than the last one, as we will need to make use of some interesting quirks of developing R1CS circuits. As mentioned before, we cannot simply copy parts of an array. To get around this, first we compute an array \(sel\) of the form \[0000...00001111...111000...0000\] of length \(S\) which contains 1s between indices \(i\) and \(i+d\) exclusive, and 0s everywhere else.
Next we compute a string \(s'\) of length \(S\), where \(s'[j] = sel[j] * s[j]\), for all \(0 \leq j < S\). Note this is simply \(s\) with its original values between indices \(i\) and \(i+d\) exclusive, and 0 everywhere else.
Similar to our concatenation protocol, we next define and compute the polynomials \[s'(x) = s_0' + s_1'x + s_2'x^2 + ... + s_S'x^S\] and \[a(x) = a_0 + a_1x + a_2x^2 + ... + a_Sx^S\].
To verify, we check that \[s'(x) =? \; x^{i-1} a(x)\]
As above, completeness is easy to verify, and soundness follows from Schwartz-Zippel.
As measured by the Circom compiler, this uses roughly \(7S+11A+X\) non-linear constraints and \(2S+4A+Y\) linear constraints where \(X \approx Y \approx 500\) when using the implementation here, which includes a Fiat-Shamir Poseidon hash over the substring (the full string must be hashed to perform Fiat-Shamir correctly, but the cost of doing so can be amortized over multiple calls if multiple substrings are matched over the same full string).