Wed, July 108:25ToolsModel releases Open source Infra & cost

Ollama v0.31.1 Boosts Gemma 4 Token Generation by Nearly 90% on Apple Silicon with Multi-Token Prediction

Decision Brief

What changedOllama v0.31.1 uses multi-token prediction to significantly accelerate Gemma 4 inference on Apple Silicon.

Why it mattersMulti-token prediction speeds token generation in coding agent tasks by ~90% on average, out of the box, enhancing local inference efficiency.

Who should careAI coding tool users, Inference / infra teams

Affected stackOllamaLlama

Builder actionUpgrade

Source confidenceHigh · Official release / blog / repo

Ollama v0.31.1 optimizes Gemma 4 on Apple Silicon by enabling multi-token prediction (MTP), achieving an average 90% speedup in token generation on coding agent benchmarks. This feature is on by default, requiring no configuration, and does not alter model output. The update also refreshes the MLX and llama.cpp engines, improving model loading and matrix multiplication kernels.

Summary basis: official / RSS sourceUnless it says 'full article read', this summary is based only on publicly available content — it never pretends to have read restricted originals.

Sources

Ollama（GitHub Releases）
Local-model runtime releases: new supported models and serving features.
Ollama（GitHub Releases）

Decision Brief

Sources

Related intel