Ai | Christian Roy

Baidu's Qianfan-OCR collapses the multi-stage OCR pipeline into one 4B model with Layout-as-Thought

Traditional OCR pipelines chain at least three models: a layout detector, a text recognizer, and a language model for understanding. Qianfan-OCR replaces all three with a single 4B model that goes directly from image to Markdown. The key innovation is Layout-as-Thought: appending a <think> token to any prompt triggers an optional reasoning phase where the model explicitly works through bounding boxes, element types, and reading order before producing output. It’s Chain-of-Thought for document layout - and it’s optional, so you can skip it for simple single-column documents to save latency. ...

Google's Colab MCP server lets any AI agent create and run notebooks in the cloud

Google released an open-source MCP server for Google Colab. Any MCP-compatible agent - Claude Code, Gemini CLI, or a custom agent - can now programmatically control a Colab notebook: create cells, write and execute code, install dependencies, rearrange content. The setup is one config block: "mcpServers": { "colab-mcp": { "command": "uvx", "args": ["git+https://github.com/googlecolab/colab-mcp"], "timeout": 30000 } } The motivation is concrete: developers were copying code from their terminals into Colab cells to run or visualize things. That context switch kills flow. With this server, the agent writes directly into an open notebook - you get a reproducible, executable artifact in the cloud instead of a code snippet in your terminal. ...

NVIDIA's OpenShell enforces AI agent guardrails outside the agent process so a compromised agent can't override them

The problem with agent guardrails that live inside the agent: a compromised agent can override them. Claude Code and Cursor ship with internal safety prompts, but those protections are inside the same process they’re supposed to guard. A prompt injection or a bad third-party skill has access to the same runtime. NVIDIA OpenShell moves the enforcement point outside. It wraps any agent in an isolated container with YAML-defined policies the agent cannot read or modify. Network access is deny-by-default and hot-reloadable; filesystem and process constraints are locked at creation. The agent can’t escalate privileges because the kernel won’t allow it - not because the agent was told not to. ...

Unsloth Studio is an open-source no-code UI for training and running local LLMs

Unsloth Studio bundles local inference, fine-tuning, and model export into a single no-code web UI. One curl command installs it; then you can run GGUF or safetensor models on Mac, Windows, or Linux without writing any code. The training side is the main draw: 2x faster fine-tuning with 70% less VRAM across 500+ model families (text, vision, TTS, embeddings). LoRA, FP8, and full fine-tuning all work on NVIDIA hardware, with multi-GPU support already in. ...

Mistral Small 4 merges instruct, reasoning, and coding into one model with per-request reasoning effort

Mistral Small 4 replaces three separate Mistral models - Magistral for reasoning, Devstral for coding agents, and Mistral Small for instruct - with a single 119B MoE model (128 experts, 4 active, 6.5B active params per token). You pick the behavior per request with a reasoning_effort parameter: reasoning_effort="none": fast chat-style responses, equivalent to Mistral Small 3.2 reasoning_effort="high": deep step-by-step reasoning, equivalent to Magistral Same weights, same deployment, different behavior at inference time. ...

OpenViking: A context database using filesystem paradigm for AI agents

OpenViking abandons traditional RAG vector storage and uses a filesystem paradigm instead. It organizes agent context (memories, resources, skills) under viking:// URIs with a three-tier structure: L0 (Abstract): One-sentence summary for quick retrieval L1 (Overview): Core information and usage scenarios L2 (Details): Full original data, loaded on demand This enables directory recursive retrieval that locks high-score directories first, then refines content exploration. The retrieval trajectory is fully observable, letting users see exactly how context is being accessed. ...

You can force an LLM to only output valid answers

YouTube just open-sourced a project called STATIC that solves a problem most people don’t know exists: LLMs can say anything, but sometimes you need them to only pick from a specific list. The Problem When an LLM generates text, it picks one token (word/number) at a time from a vocabulary of ~32,000+ options. That’s great for conversation, but terrible when you need it to output something specific: a valid product ID, a medical code, or a video recommendation from a catalog of millions. ...

Pipe Mastra agent responses through jq to colorize reasoning and tool calls in the terminal

Mastra’s agent HTTP API returns a JSON structure with steps, each containing content items typed as reasoning, tool-call, tool-result, and text. The raw output is dense. Start by exploring it: # Hit the API and see raw structure http localhost:4111/api/agents/weather-agent/generate \ messages[0]="what's the weather in montreal?" | jq . # Get just the final answer http localhost:4111/api/agents/weather-agent/generate \ messages[0]="what's the weather in montreal?" | jq -r '.text' # Explore what's inside steps http localhost:4111/api/agents/weather-agent/generate \ messages[0]="what's the weather in montreal?" | jq '.steps[].content[] | .type' # "reasoning" # "tool-call" # "tool-result" # "text" # "reasoning" # "text" # See what fields each type has http localhost:4111/api/agents/weather-agent/generate \ messages[0]="what's the weather in montreal?" | jq '.steps[].content[] | select(.type == "tool-call")' Once the structure is clear, pipe through jq -r with inline ANSI escape sequences to colorize each piece: ...

OpenClaw custom skills silently disappear without quoted YAML descriptions and openclaw metadata

If a custom OpenClaw skill doesn’t show up in openclaw skills list and the agent can’t see it either, the SKILL.md frontmatter is likely the culprit. OpenClaw fails silently, so the debugging feedback is minimal. Two things must be right. First, any name or description containing a colon must be wrapped in double quotes, otherwise YAML interprets the colon as a key-value separator and the parse fails. Second, the frontmatter must include an openclaw metadata block declaring the emoji icon and any required binaries or environment variables. Without it, OpenClaw won’t register the skill at all. ...

Google DeepMind's Lyria 3 generates full songs from a photo or a sentence

Lyria 3 takes a text prompt or an image and produces a complete track: instrumentation, vocals, lyrics. Not a loop, not a mood board. A song. The image input is what makes it interesting. Most generative audio models take text. Lyria 3 can look at a picture and decide what it sounds like. That’s a different kind of creative interpretation, closer to how a composer might respond to visual art than to a spec. ...