Contrastive Decoding Diffing: Extracting Finetuning Data from Model Logits
WHY IT MATTERS
Research demonstrating ability to recover verbatim finetuning data from LLM logits without weight access. Critical security finding for model training data protection.
Researchers demonstrated that verbatim finetuning data can be extracted from model logits without accessing weights, using contrastive decoding techniques to amplify differences between target and reference model outputs. The method recovers training sequences with high fidelity across standard deployment configurations.
For operators, this eliminates the assumption that logit-level access represents a lower security boundary than weight access. Models serving proprietary finetuning data—whether customer instructions, domain corpora, or specialized datasets—now require threat modeling that treats logit outputs as equivalent to weight leakage in terms of training data recovery. This reshapes what constitutes a "secure" inference endpoint and complicates decisions around API exposure versus self-hosted deployment.
Operationally, teams must audit which parties access raw logits, implement output quantization strategies, or restrict inference APIs to probability distributions rather than full logit vectors. The cost of fine-grained model access increases where training data sensitivity is high, favoring either architecture constraints (limiting logit precision) or reduced API surface area over permissive inference serving.
SOURCE
SHARE
MORE FROM STUFFINSIDER