opencode with a local Qwen3.6-35B-A3B backend iterated to a working color-physics web toy

I gave opencode a single-paragraph spec for “an HTML canvas where colored circles attract similar hues and repel opposites,” and pointed it at unsloth’s quant of Qwen3.6-35B-A3B (UD-Q8_K_XL, ~3B active params) running locally. About a dozen steering turns later, fixing bugs and nudging the behavior, I had a working single-file demo. That is the part worth noting. Not one-shot, not magic: a small MoE that fits on one GPU, driven through an agentic loop, can converge on a real interactive artifact in a reasonable session. Local coding agents are becoming a usable workflow, not just a demo. ...

Gemini 3.1 Flash Live: Low-latency voice AI with native audio output

Google launched Gemini 3.1 Flash Live via the Live API for building real-time voice and vision agents. The model processes continuous audio, video, and text streams to deliver immediate spoken responses with acoustic nuance detection and 90+ language support. Key improvements over Gemini 2.5 Flash Native Audio: Better noise filtering in real-world environments Stronger adherence to complex system instructions More natural dialogue with improved latency Thinking capability via thinkingLevel (minimal/low/medium/high) instead of thinkingBudget The model outputs native audio (no STT+TTS pipeline) with a 128k context window. It supports synchronous function calling, Google Search grounding, and video input alongside audio. ...

Asana's official Claude MCP connector can't create tasks in Claude Code CLI because its tools require interactive UI

Asana’s official Claude connector uses a tool called create_task_preview instead of create_task. The name suggests a preview, but it’s actually the primary creation tool in the v2 integration. The catch: it requires an interactive UI to render a confirmation step. Claude Code CLI can’t render that interface, so task creation silently fails. This was surfaced in the Asana forum after create_task also went missing from v1 toolset temporarily (Asana accidentally dropped it during a v2 update, now restored). Asana confirmed the CLI issue separately and said they’re working on detecting whether the surface supports interactivity before offering interactive tools. ETA: about a week from March 26. ...

Cohere Transcribe beats Whisper Large v3 on the HuggingFace ASR leaderboard with a 2B open-source model

Cohere, best known for text and embedding models, just shipped a speech recognition model that hits #1 on the HuggingFace Open ASR Leaderboard: 5.42% average WER versus Whisper Large v3’s 7.44%. It’s 2B parameters, Apache 2.0, available on HuggingFace today. The architecture is a Conformer: a hybrid of CNNs and Transformers. CNNs handle local acoustic features (phonemes, rapid transitions), Transformers handle global context. Interleaving them is the standard trick for ASR; Cohere’s bet is that a dedicated, from-scratch training run focused on WER beats the generalist approach. ...

Tencent's Covo-Audio is a 7B end-to-end audio model with full-duplex conversation via THINK, SHIFT, and BREAK tokens

Most voice AI pipelines are cascaded: ASR transcribes, an LLM reasons, TTS speaks. Covo-Audio collapses all three into a single 7B model that takes audio in and produces audio out, built on Qwen2.5-7B with a Whisper-large-v3 encoder. The full-duplex variant (Covo-Audio-Chat-FD) is the interesting part. It handles simultaneous listening and speaking using three special tokens baked into the architecture: THINK (listening, not yet responding), SHIFT (switching to speaking turn), and BREAK (user interrupted, stop speaking immediately). Each audio chunk is 0.16 seconds. The model and user streams are interleaved at a 1:4 ratio. ...

Google's TurboQuant compresses LLM KV caches 6x with zero accuracy loss and near-zero indexing time

The KV cache is a major memory bottleneck for long-context LLM inference. Traditional vector quantization methods like Product Quantization require expensive dataset-specific codebook training that can take hundreds of seconds. TurboQuant, from Google Research (ICLR 2026), is data-oblivious: no training, no calibration, works instantly. The key insight: applying a random rotation to input vectors induces a concentrated Beta distribution on each coordinate in high dimensions, making coordinates nearly i.i.d. This lets you solve a simple 1D scalar quantization problem per coordinate instead of a complex joint optimization. Codebooks are precomputed once per bit-width and reused at inference time. ...

TinyLoRA fine-tunes a 7B model to 91.8% GSM8K accuracy with only 13 parameters using RL

A Qwen2.5-7B-Instruct model fine-tuned with 13 trainable parameters (26 bytes in bf16) reaches 91.8% on GSM8K. Full fine-tuning of all 7.6 billion parameters reaches 91.7%. That number is not a typo, and yes, they are the same kind of parameters. To understand why this is possible, you need to know what those 13 parameters actually do. They are not 13 weights replacing 7.6 billion others. They are 13 scalar values that get projected through a fixed random tensor into a high-dimensional update, which then gets added back into the frozen weight matrices across all layers. You are not tuning 13 knobs instead of 7.6 billion - you are tuning 13 directions of change that each affect the whole model via a fixed mathematical transformation. The frozen weights do all the heavy lifting; the tiny update steers them. ...

GitAgent defines AI agents as git repos, exportable to any framework with one CLI command

GitAgent proposes that your git repository is your agent. Two required files - agent.yaml (manifest) and SOUL.md (identity) - define the agent. Everything else - skills, tools, memory, compliance artifacts - is optional structure layered on top. The interesting part is the supervision model: when an agent updates its memory or acquires a new skill, the change becomes a git commit or PR. Human reviewers can diff the agent’s personality changes like any code review. If behavior drifts, git revert brings it back. ...

NVIDIA's Nemotron-Cascade 2 hits gold-medal math and coding with a 30B MoE using only 3B active parameters

Nemotron-Cascade 2 is a 30B Mixture-of-Experts model that only activates 3B parameters per forward pass. It’s the second open-weight model to reach gold-medal level on IMO 2025, IOI 2025, and ICPC World Finals - after DeepSeek at 671B, which is more than 20x the size. The core technique is Cascade RL: sequential domain-by-domain reinforcement learning, where each domain (math, code, instruction following, SWE agents) gets its own hyperparameters without destabilizing the others. The novel addition is Multi-Domain On-Policy Distillation (MOPD): when a Cascade RL stage causes regression on other benchmarks, they distill from the best intermediate teacher model for that domain on the fly. On AIME 2025, MOPD reached teacher-level performance in 30 steps; GRPO hit only 91.0 after the same number of steps. ...

GLM-OCR: A 0.9B document parsing specialist beats 235B models

GLM-OCR is a 0.9B multimodal OCR model that achieves 94.62 on OmniDocBench V1.5, ranking first overall despite its compact size. Built by Zhipu AI, it beats Qwen3-VL-235B (260× more parameters) and Gemini-3 Pro on document parsing benchmarks. The model combines a 0.4B CogViT visual encoder with a 0.5B GLM language decoder. Its key innovation is Multi-Token Prediction (MTP), which predicts 10 tokens per step instead of one. For OCR—a deterministic task—you’re copying characters, not sampling creative text. This delivers about 50% higher throughput vs standard decoding. ...