Getting Started with Llamafile: Your First Local LLM in Under 5 Minutes

Introduction

If you've ever wanted to run a large language model on your own computer but felt intimidated by the setup process, Llamafile is exactly what you need. This guide will walk you through running your first local LLM in literally under 5 minutes — no Python, no CUDA toolkit, no Docker knowledge required.

What You'll Need

A Mac, Linux, or Windows computer (Raspberry Pi works too!)
A terminal (Command Prompt on Windows)
An internet connection to download one file

Step 1: Download a Llamafile

Llamafiles are pre-packaged models available on Hugging Face. For this guide, we'll use a small model that runs on any computer:

# Linux/macOS
curl -LO https://huggingface.co/mozilla-ai/llamafile_0.10/resolve/main/Qwen3.5-0.8B-Q8_0.llamafile

For Windows, download from the Hugging Face repo and rename the file to add .exe.

Step 2: Make It Executable

# Linux/macOS only — Windows skips this step
chmod +x Qwen3.5-0.8B-Q8_0.llamafile

Step 3: Run It

./Qwen3.5-0.8B-Q8_0.llamafile

That's it. The terminal will show startup logs, and within seconds an HTTP server will be running on http://localhost:8080. Open that URL in your browser to access the built-in chat interface.

Step 4: Chat!

You now have a fully functional LLM running entirely on your machine. No data leaves your computer. The chat interface supports:

Multi-turn conversations
Markdown formatting in responses
System prompts to customize behavior
Temperature and other generation parameters

Going Further: Using the API

The server exposes an OpenAI-compatible API, so you can connect it to any tool that supports OpenAI's format:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-0.8B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Docker Deployment

For server deployments, use the official Docker image:

docker pull ghcr.io/mozilla-ai/llamafile:latest
docker run -p 8080:8080 ghcr.io/mozilla-ai/llamafile:latest

Running Larger Models

If you have a GPU, Llamafile automatically detects and uses CUDA (NVIDIA), Metal (Apple Silicon), or Vulkan. Browse huggingface.co/mozilla-ai for larger pre-built llamafiles or use the --model flag with any GGUF file.

Troubleshooting

"Permission denied" — Run chmod +x first
File is too large for Windows — Use the llamafile binary with external weights instead
Out of memory — Try a smaller model or use the --mlock flag

Conclusion

Llamafile is the single easiest way to run LLMs locally. In the time it takes to make coffee, you can have a state-of-the-art language model running on your laptop, completely offline and private. Give it a try — you'll be amazed how simple it is.