Getting Started with Llamafile: Your First Local LLM in Under 5 Minutes
Introduction
If you've ever wanted to run a large language model on your own computer but felt intimidated by the setup process, Llamafile is exactly what you need. This guide will walk you through running your first local LLM in literally under 5 minutes — no Python, no CUDA toolkit, no Docker knowledge required.
What You'll Need
- A Mac, Linux, or Windows computer (Raspberry Pi works too!)
- A terminal (Command Prompt on Windows)
- An internet connection to download one file
Step 1: Download a Llamafile
Llamafiles are pre-packaged models available on Hugging Face. For this guide, we'll use a small model that runs on any computer:
# Linux/macOS
curl -LO https://huggingface.co/mozilla-ai/llamafile_0.10/resolve/main/Qwen3.5-0.8B-Q8_0.llamafileFor Windows, download from the Hugging Face repo and rename the file to add .exe.
Step 2: Make It Executable
# Linux/macOS only — Windows skips this step
chmod +x Qwen3.5-0.8B-Q8_0.llamafileStep 3: Run It
./Qwen3.5-0.8B-Q8_0.llamafileThat's it. The terminal will show startup logs, and within seconds an HTTP server will be running on http://localhost:8080. Open that URL in your browser to access the built-in chat interface.
Step 4: Chat!
You now have a fully functional LLM running entirely on your machine. No data leaves your computer. The chat interface supports:
- Multi-turn conversations
- Markdown formatting in responses
- System prompts to customize behavior
- Temperature and other generation parameters
Going Further: Using the API
The server exposes an OpenAI-compatible API, so you can connect it to any tool that supports OpenAI's format:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-0.8B",
"messages": [{"role": "user", "content": "Hello!"}]
}'Docker Deployment
For server deployments, use the official Docker image:
docker pull ghcr.io/mozilla-ai/llamafile:latest
docker run -p 8080:8080 ghcr.io/mozilla-ai/llamafile:latestRunning Larger Models
If you have a GPU, Llamafile automatically detects and uses CUDA (NVIDIA), Metal (Apple Silicon), or Vulkan. Browse huggingface.co/mozilla-ai for larger pre-built llamafiles or use the --model flag with any GGUF file.
Troubleshooting
- "Permission denied" — Run
chmod +xfirst - File is too large for Windows — Use the llamafile binary with external weights instead
- Out of memory — Try a smaller model or use the
--mlockflag
Conclusion
Llamafile is the single easiest way to run LLMs locally. In the time it takes to make coffee, you can have a state-of-the-art language model running on your laptop, completely offline and private. Give it a try — you'll be amazed how simple it is.