Baidu's Qianfan-OCR collapses the multi-stage OCR pipeline into one 4B model with Layout-as-Thought
Traditional OCR pipelines chain at least three models: a layout detector, a text recognizer, and a language model for understanding. Qianfan-OCR replaces all three with a single 4B model that goes directly from image to Markdown. The key innovation is Layout-as-Thought: appending a <think> token to any prompt triggers an optional reasoning phase where the model explicitly works through bounding boxes, element types, and reading order before producing output. It’s Chain-of-Thought for document layout - and it’s optional, so you can skip it for simple single-column documents to save latency. ...