Ollama v0.31.1 Boosts Gemma 4 Token Generation by Nearly 90% on Apple Silicon with Multi-Token Prediction
Decision Brief
What changedOllama v0.31.1 uses multi-token prediction to significantly accelerate Gemma 4 inference on Apple Silicon.
Why it mattersMulti-token prediction speeds token generation in coding agent tasks by ~90% on average, out of the box, enhancing local inference efficiency.
Who should careAI coding tool users, Inference / infra teams
Affected stackOllamaLlama
Builder actionUpgrade
Source confidenceHigh · Official release / blog / repo
Ollama v0.31.1 optimizes Gemma 4 on Apple Silicon by enabling multi-token prediction (MTP), achieving an average 90% speedup in token generation on coding agent benchmarks. This feature is on by default, requiring no configuration, and does not alter model output. The update also refreshes the MLX and llama.cpp engines, improving model loading and matrix multiplication kernels.
Summary basis: official / RSS sourceUnless it says 'full article read', this summary is based only on publicly available content — it never pretends to have read restricted originals.
Sources
- Ollama(GitHub Releases)
Local-model runtime releases: new supported models and serving features.
- Ollama(GitHub Releases)