YouTube just open-sourced a project called STATIC that solves a problem most people don’t know exists: LLMs can say anything, but sometimes you need them to only pick from a specific list.

The Problem

When an LLM generates text, it picks one token (word/number) at a time from a vocabulary of ~32,000+ options. That’s great for conversation, but terrible when you need it to output something specific: a valid product ID, a medical code, or a video recommendation from a catalog of millions.

Without constraints, the model might output an ID that doesn’t exist. Then you’re stuck retrying or doing cleanup after the fact.

The Solution: Constrained Decoding

The idea is simple: at each step of generation, before the model picks its next token, you block every invalid option. The model can only choose from tokens that lead to a valid result.

Think of it like autocomplete on your phone, but instead of suggesting words, it’s hiding every key on the keyboard except the ones that form a real word.

What STATIC Specifically Does

Previous approaches to constrained decoding used tree lookups on CPU, which meant the GPU (doing the actual AI work) had to pause and wait at every single step. Slow.

STATIC pre-compiles all valid sequences into a GPU-friendly data structure (a sparse matrix) so the filtering happens instantly on the GPU itself, no waiting, no back-and-forth. The result is roughly 150x faster than traditional methods when working with millions of valid sequences.

When Would You Use This?

You’d use constrained decoding when:

  • A recommendation engine needs to output only real product/content IDs from a catalog
  • A medical system needs to generate only valid diagnosis or procedure codes
  • A search system maps queries to specific document IDs in an index
  • Any scenario where the LLM’s output must exactly match an entry in a known set

The Catch

This only works when you’re running the model yourself (self-hosted via PyTorch, JAX, vLLM, etc.). If you’re using hosted APIs like OpenAI, Claude, or Gemini, you don’t have access to the generation loop. Those APIs offer lighter alternatives like structured outputs and JSON schema enforcement, which work well for small constraint sets but can’t scale to millions of options.

Why It Matters

As more companies use LLMs not just for chat but for structured tasks like retrieval, classification, and code generation, constrained decoding becomes a core infrastructure need. STATIC shows that with the right data structure, you can enforce hard constraints at production speed without slowing down inference.