How to Install Ollama on Raspberry Pi for Edge AI in 2026: Step-by-Step Guide

Why Run Ollama on a Raspberry Pi?

When I first started experimenting with local LLMs for robotics, I assumed I needed a beefy desktop with an RTX 4090. Turns out, I was wrong. For many edge AI workloads — command parsing on a robot, text classification on a sensor gateway, offline Q&A on a kiosk — a Raspberry Pi 5 is more than enough.

The appeal is obvious: a Pi consumes 5-15 watts instead of 300+, costs a fraction of a GPU workstation, and can run on battery in a mobile robot. In this guide, I’ll show you exactly how to get Ollama running on a Raspberry Pi, which models actually work well, and how to make the most of your limited hardware.

What You’ll Need

Raspberry Pi 5 (8GB model strongly recommended) — the 4GB model works but you’ll be limited to the smallest models
Raspberry Pi OS (64-bit, Bookworm or later) — the 32-bit OS will not work with Ollama
Good cooling — a heatsink + fan is non-negotiable. LLM inference is sustained 100% CPU load and the Pi will throttle without it
An SSD (NVMe or USB 3.0) — microSD cards are too slow for model loading; you’ll wait 3-4 minutes per model
Stable internet for the initial model download

Step 1: Confirm Your Pi Is Ready

Before we install anything, verify you have a 64-bit OS:

uname -m

You should see aarch64. If you see armv7l, you’re running 32-bit — reinstall with the 64-bit Raspberry Pi OS image.

Check your available RAM:

free -h
# Ideally shows 7.5G+ available (for the 8GB model)

And ensure you have at least 10GB free disk space:

df -h /

Step 2: Install Ollama

Good news: the official Ollama install script works perfectly on Raspberry Pi 5. No compiling from source, no Docker containers. Just run:

curl -fsSL https://ollama.com/install.sh | sh

The script detects your system as linux-arm64 and pulls the correct binary. This usually takes about 30 seconds.

Once installed, start the Ollama service and verify it’s running:

sudo systemctl start ollama
sudo systemctl status ollama
# Should show "active (running)"
ollama --version
# Expect something like: ollama version 0.5.x

If the service doesn’t start, check the logs:

journalctl -u ollama --no-pager -n 30

Most common issue: not enough memory. If you’re on a 4GB Pi, you’ll need to edit the Ollama service to limit model size. More on that below.

Step 3: Test the API

Ollama runs a REST API on port 11434 by default. Confirm it’s listening:

curl http://localhost:11434/api/tags
# Should return {"models":[]} since we haven't pulled any yet

If curl hangs, the service isn’t running — go back and check the service status.

Step 4: Pull Models That Actually Run on a Pi

This is the critical part. You cannot run 7B-parameter models on a Raspberry Pi 5 without unbearable latency (we’re talking 3-5 minutes per response). Stick to quantized small models.

Here are the models I’ve tested and actually work well:

Qwen 2.5 0.5B (Best for fast inference)

ollama pull qwen2.5:0.5b
# ~350MB download, runs in ~500MB RAM
# Response time: 5-10 tokens/second

This is my default choice for robot command parsing. It’s fast enough for near-real-time interaction and handles structured output (JSON) surprisingly well for its size.

Phi-3.5-mini 3.8B (Best for quality)

ollama pull phi-3.5:3.8b-mini-instruct-q4_K_M
# ~2.5GB download, runs in ~4GB RAM
# Response time: 1-3 tokens/second (slow but usable)

This is the largest model that fits comfortably on an 8GB Pi. Takes 30-60 seconds for a paragraph of text, but the output quality is noticeably better than the 0.5B models. Use this for tasks where quality matters more than speed.

Llama 3.2 1B (Good middle ground)

ollama pull llama3.2:1b
# ~700MB download, runs in ~1.2GB RAM
# Response time: 8-15 tokens/second

A solid all-rounder. Better instruction-following than Qwen 0.5B, faster than Phi-3.5. I use this for general-purpose text tasks.

Check Your Running Models

ollama list

This shows all pulled models and their sizes.

Step 5: Run Inference

Let’s test with a simple prompt:

ollama run qwen2.5:0.5b "Explain what edge AI means in three sentences."

You should see it load the model into memory (takes 5-10 seconds the first time) and then start streaming the response. The first invocation is always slowest because the model has to be loaded into RAM. Subsequent calls will be faster.

For programmatic use, use the API directly:

curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:0.5b",
  "prompt": "What is edge AI?",
  "stream": false
}' | python3 -m json.tool

Set "stream": false to get the full response at once — useful for scripts that need to parse the output.

Performance Benchmarks (Raspberry Pi 5, 8GB)

Here are real numbers from my setup (with an active cooler, running from an NVMe SSD):

Model	RAM Used	Tokens/sec	First Response
Qwen 2.5 0.5B	~500 MB	8-12 tok/s	~2s
Llama 3.2 1B	~1.2 GB	6-10 tok/s	~3s
Phi-3.5 3.8B	~4 GB	1-3 tok/s	~15s

CPU temperature stabilizes around 70-75°C with active cooling. Without a heatsink, you’ll hit 85°C and throttle within 30 seconds of inference.

Optimizing for Edge Use Cases

Here’s what I’ve learned from running this on a mobile robot for the past month:

Keep it simple. Small models (0.5B-1B) are fast enough to run on every sensor cycle. Large models (3.8B) need dedicated inference pauses. Design your edge application around the model’s latency profile, not against it.

Set context length low. Reduce the context window from the default 2048 to 512 tokens:

curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:0.5b",
  "prompt": "Your short prompt here",
  "options": {
    "num_ctx": 512
  },
  "stream": false
}'

This halves inference time for short prompts.

Use temperature 0.1 for structured output. When parsing commands or extracting JSON, low temperature ensures consistent, repeatable results.

Pre-load your model. Keep the model loaded in memory instead of unloading after each inference:

curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:0.5b",
  "prompt": "",
  "keep_alive": -1
}'

This keeps the model resident. The first request will still load it, but subsequent ones will respond instantly.

Running on 4GB Raspberry Pi 5

If you only have the 4GB model, your options are limited:

Qwen 2.5 0.5B works fine
Llama 3.2 1B works but barely — you’ll have ~500MB free after loading
Phi-3.5 3.8B will swap and become unusable

Consider editing the Ollama service to set OLLAMA_KEEP_ALIVE=0 (unload models immediately after use) to minimize memory pressure. Create an override file:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <


Edge AI Use Cases on a Pi
With Ollama on your Pi, here's what becomes practical:

Robot command parsing — feed speech-to-text output into a small model that extracts structured commands for your ROS2 nodes (see my other guide here)
Sensor data classification — have the LLM classify sensor readings or log entries without sending data to the cloud
Offline local chatbot — a private, offline voice assistant that never touches the internet
Text summarization — summarize logs or documents on-device
Smart home control — natural language command interface for home automation (Home Assistant integration)

Troubleshooting
Ollama fails to start: Check if you have enough memory. Run free -h. If you have less than 1GB free, Ollama will refuse to start. Close other applications.
Model download fails: The Pi's wifi can be finicky with large downloads. Use a wired ethernet connection, or download the model on your desktop and transfer the Ollama blob files (~/.ollama/models/blobs/).
Inference is extremely slow: You're either thermal throttling (check vcgencmd measure_temp) or running out of RAM (check htop). Fix cooling first, then switch to a smaller model.
Ollama loads the same model twice: This happens when the GGUF file isn't properly detected. Delete the model with ollama rm <model> and re-pull.
Segfault on model load: Usually indicates a corrupted download. Run ollama rm <model> and pull again.
Wrap Up
Running Ollama on a Raspberry Pi isn't about matching desktop LLM performance — it's about having AI inference where you need it, without the cloud, without the power bill, and without the latency of a network round trip. For edge robotics, sensor processing, and offline automation, a Pi 5 with Ollama is a genuinely useful tool.
Start with Qwen 2.5 0.5B, get a feel for the API, then scale up to Phi-3.5 if your use case needs better reasoning. And please — get that heatsink and fan. Your Pi will thank you.