Wrong ML Zero Knowledge Proofs for Fun and Profit

Disclaimer: this article is written in humour and only fact-checked by AI. Do not cite this as a mathematically rigorous treatment of ZK proofs in machine learning. For a more production ready verifiable machine learning system, consider existing verifiable and trusted AI solutions.

I was reading the Kimi K2.5 model release thread on Hacker News recently and came across a rather interesting discussion about proving training data attribution using zero knowledge proofs.

The user claimed that it's impossible to indirectly prove that a model was trained on a certain set of training data without admitting to (potential) copyright infringement and suggested a few methods including Sigma Protocols.

This is the first time I have come across the notion of using Zero Knowledge Proofs to prove training data use with plausible deniability so I figured I would take a shot at formalizing it. Note most of the mathematics below are ad-hoc and most likely completely incorrect, I am basing it off the HN comment thread.

Not releasing the training data used in open weight machine learning models has been one of the most controversial issues in the open source software community. On the other hand, there is a recurring idea floating around the intersection of cryptography and AI policy: that Zero-Knowledge Proofs (ZKPs) could act as a silver bullet for data attribution. The pitch usually sounds something like this:

"We can use zero-knowledge to prove a model was trained on copyrighted data , without the developer having to publicly reveal or admit they used ."

It sounds intuitive. After all, isn't the whole point of zero-knowledge to prove facts while hiding secrets?

In the Kimi K2.5 launch thread, the commenters argued that such a construction isn't just difficult to engineer but is fundamentally impossible. The issue isn't a limitation of current ZK schemes, but a misunderstanding of what a proof system actually does.

Here is a critical look at why you cannot cryptographically "attribute" training without admitting usage, and why the standard technical workarounds (OR-proofs and indistinguishability) fail to solve the semantic problem.

Zero-Knowledge Hides Witnesses, Not Facts

To understand the impossibility, we have to look at the formal definition of a ZK proof. A zero-knowledge protocol is defined for an NP language $L$ with a relation $R(x, w)$ .

$x$ is the public statement.
$w$ is the private witness.
$R(x,w)$ implies that $w$ is valid evidence that $x \in L$ .

The zero-knowledge property guarantees that the verifier learns nothing about $w$ other than the fact that a valid $w$ exists. However, the soundness property guarantees that the verifier does learn that $x$ is true.

If we try to apply this to AI attribution, we inevitably formulate a statement $x$ along the lines of:

x := \text{“Model } M \text{ was trained on dataset } D”,

If the prover successfully runs the protocol, the verifier learns that $x$ is true. The protocol hides the witness (the gradient updates, the specific training logs, or random seeds), but it does not hide the statement.

If the statement implies legal liability ("I used your data"), proving the statement establishes liability. You cannot have it both ways: you cannot convince a verifier that a fact is true while simultaneously claiming you haven't admitted it.

When people try to get around this, they usually propose one of three modifications. All three fail to achieve the original goal because they weaken the statement $x$ until it no longer proves attribution.

Escape Hatch 1: The Tautology (OR-Proofs)

The most common technical proposal is to use a "1-out-of-2" proof, often constructed using $\Sigma$ -protocol OR-composition. Instead of proving "I used ," the developer proves a disjunction:

\exists w_1, w_2 : \big(\textsf{TrainedOn}(M, D; w_1)\big) \lor \big(\textsf{NotTrainedOn}(M, D; w_2)\big).

In this setup, the prover knows the witness for exactly one side of the OR gate. The zero-knowledge property ensures the verifier cannot tell which branch was satisfied.

Why it fails: While this is a valid cryptographic construction, the resulting statement is a tautology. It translates to: "Either I trained on your data, or I didn't."

This is always true for any model and any dataset. Proving this statement conveys zero bits of information about the actual provenance of the model. It offers perfect deniability, but only because it offers zero attribution.

Escape Hatch 2: Proving Non-Influence

The second approach attempts to shift the goalposts from "usage" to "influence." Here, the developer might try to prove that the model's behavior is statistically indistinguishable whether $D$ was used or not.

Formally, they attempt to prove a bound on the distribution of the model:

\mathsf{Dist}\big(\mathcal{L}(M \mid D), \mathcal{L}(M \mid \emptyset)\big) \le \varepsilon.

This relies on techniques effectively similar to Differential Privacy or stability analysis.

Why it fails: This proves the opposite of attribution. If the proof holds, it means the data $D$ had a negligible impact on the final model. It is an argument for "fair use via uselessness." If the dataset did contribute a unique, valuable signal (which is usually the basis for the copyright claim), this proof would be impossible to generate because the distributions would be distinguishable.

Furthermore, proving that a model is "stable" relative to $D$ doesn't confirm $D$ was used; it only confirms that if it were used, it didn't change much.

Escape Hatch 3: Equivalence Classes

The final, more subtle approach is to use equivalence classes. The developer defines a relation where different datasets are considered "functionally equivalent" () if they produce similar model outputs.

D \sim D' \quad \text{if} \quad \mathcal{L}(M \mid D) \approx \mathcal{L}(M \mid D').

The proof then asserts:

\exists D' \in [D]_\sim : \textsf{TrainedOn}(M, D').

Why it fails: This abstracts the attribution away from the specific object in question. It is analogous to proving that a ciphertext corresponds to some message in a set of messages, without revealing which one.

IANAL, talk to an IP lawyer on the actual legal aspects.

While technically sound, it creates a semantic gap. If the equivalence class $[D]_\sim$ contains both the copyrighted dataset and a public domain dataset, the proof no longer distinguishes between infringement and lawful training. The prover has successfully hidden the specific training data, but in doing so, they have stripped the proof of its legal or moral force.

The more interesting part of the HN thread, however, is the appeal to “information-theoretic” arguments for why major AI labs could not plausibly have trained¹ solely on public-domain data. The intuition is straightforward: there is strictly less data in the public domain; empirically, more data yields stronger models; therefore, a model trained only on public-domain data should be weaker than one trained with access to copyrighted or proprietary corpora.

What this argument actually shows is not impossibility but implausibility under current scaling regimes. Information theory does not care about copyright status; it cares about mutual information between data and task. Absent a lower bound showing that public-domain corpora lack sufficient information content for competitive performance — or that such information cannot be approximated, compressed, or regenerated — there is no information-theoretic contradiction in principle. There is only an empirical claim about what we expect given today’s data distributions, incentives, and engineering practices.

In other words, even if one accepts the scaling-law intuition, it still does not rescue the attribution problem. A zero-knowledge proof that ranges over an equivalence class containing both lawful and unlawful training histories cannot distinguish between them, regardless of how economically or empirically unlikely some members of that class may be. The proof remains correct, and simultaneously useless, for the purpose it is meant to serve.

The Takeaway

The intuition that zero-knowledge allows for "anonymous attribution" rests on a category error. It confuses the content of a proof with the mechanism of the proof.

Zero-knowledge is fantastic for proving that a process was followed correctly (e.g., "this model was trained according to the specified algorithm") or that a property holds (e.g., "this model satisfies this safety constraint"). But it cannot fundamentally decouple the truth of a statement from the admission of that truth.

If you prove you trained on data, you have admitted you trained on data. If you dilute the statement to avoid the admission, you no longer prove you trained on the data. There is no cryptographic magic that solves this specific semantic deadlock.

Footnotes

As of January 2026, the outputs of AI systems are not copyrightable in the US, so you can technically do interesting things with model distillation. However, there's a famous proof that training on synthetic data would cause model collapse. ↩