SScoutariAI Builder Intel · decision desk
Back to timeline

Wed, July 108:25ToolsModel releasesOpen sourceInfra & cost

Ollama v0.31.1 Boosts Gemma 4 Token Generation by Nearly 90% on Apple Silicon with Multi-Token Prediction

Decision Brief

What changedOllama v0.31.1 uses multi-token prediction to significantly accelerate Gemma 4 inference on Apple Silicon.
Why it mattersMulti-token prediction speeds token generation in coding agent tasks by ~90% on average, out of the box, enhancing local inference efficiency.
Who should careAI coding tool users, Inference / infra teams
Affected stackOllamaLlama
Builder actionUpgrade
Source confidenceHigh · Official release / blog / repo

Ollama v0.31.1 optimizes Gemma 4 on Apple Silicon by enabling multi-token prediction (MTP), achieving an average 90% speedup in token generation on coding agent benchmarks. This feature is on by default, requiring no configuration, and does not alter model output. The update also refreshes the MLX and llama.cpp engines, improving model loading and matrix multiplication kernels.

Summary basis: official / RSS sourceUnless it says 'full article read', this summary is based only on publicly available content — it never pretends to have read restricted originals.

Sources

Related intel