A Qwen2.5-7B-Instruct model fine-tuned with 13 trainable parameters (26 bytes in bf16) reaches 91.8% on GSM8K. Full fine-tuning of all 7.6 billion parameters reaches 91.7%. That number is not a typo, and yes, they are the same kind of parameters.

To understand why this is possible, you need to know what those 13 parameters actually do. They are not 13 weights replacing 7.6 billion others. They are 13 scalar values that get projected through a fixed random tensor into a high-dimensional update, which then gets added back into the frozen weight matrices across all layers. You are not tuning 13 knobs instead of 7.6 billion - you are tuning 13 directions of change that each affect the whole model via a fixed mathematical transformation. The frozen weights do all the heavy lifting; the tiny update steers them.

TinyLoRA, from Meta FAIR, Cornell, and CMU, builds on LoRA-XS: instead of trainable matrices, it uses a small trainable vector projected through a fixed random tensor, with weight tying across layers to reduce the count further, down to a single shared parameter in the extreme case.

The reason this works at all is reinforcement learning. RL (specifically GRPO) is 100-1000x more parameter-efficient than supervised fine-tuning at tiny update sizes. SFT treats every token as equally informative, forcing the model to absorb stylistic noise and irrelevant structure from human demonstrations. RL rewards are binary (right/wrong on a math answer), so reward-relevant features reinforce while irrelevant variations cancel out through resampling. The signal is clean enough that 13 degrees of freedom is sufficient.

A few practical findings: frozen SVD rank r=2 is optimal (higher rank adds too many degrees of freedom for the tiny vector to navigate), parameter sharing by model depth (“tiling”) outperforms sharing by module type (Q/K/V), and fp32 is more bit-efficient than bf16 in extreme low-parameter regimes despite its larger per-parameter footprint.

The implication for larger models is striking: as models scale, they become more “programmable” with fewer absolute parameters. Trillion-scale models might eventually be steered with a handful of bytes.