There's something deeply satisfying about running a large language model on your own hardware. No API keys, no rate limits, no wondering where your data goes. Just you, your Mac Mini, and a machine that can write code, summarize documents, or brainstorm ideas — completely offline.
I spent the last few months turning my M4 Mac Mini into a local AI workhorse. It started as an experiment and quickly became my default setup for coding, writing, and quick research tasks. Here's everything I learned, condensed into a guide that'll save you the trial and error.
Why Run Local LLMs?
Before diving into the how, let's talk about the why. Local LLMs aren't for everyone, but they solve some real problems:
Privacy. Your prompts never leave your machine. No training on your data, no third-party access, no compliance headaches. This matters if you work with proprietary code, sensitive documents, or just value keeping your thoughts private.
Cost. Cloud API bills add up fast. At the time of writing, GPT-4o costs $5 per million input tokens and $15 per million output tokens. For heavy usage — daily coding assistance, document analysis, creative writing — you can burn through $50-100/month easily. A Mac Mini pays for itself.
Speed. No network latency. On Apple Silicon with unified memory, inference can be surprisingly fast for models under 10B parameters. Sub-second responses for coding suggestions beat waiting on a round-trip to OpenAI's servers.
Reliability. No downtime, no rate limits, no "service temporarily unavailable." Your model works whether you're on a plane, in a cabin, or your ISP is having a bad day.
Customization. Want to fine-tune on your codebase? Run a model with a 128K context window? Experiment with quantization levels? Local gives you control cloud APIs don't.
The trade-offs? You won't run GPT-4-class models locally. The best locally-runnable models today are comparable to GPT-3.5 or early GPT-4, depending on size. But for many tasks, that's plenty.
What You Need: Hardware Reality Check
The Short Version
- M1 Mac Mini (8GB): Can run 3B-7B models. Tight but functional.
- M2/M3 Mac Mini (16GB): Sweet spot. Comfortable with 7B-13B models.
- M4 Mac Mini (24GB+): Ideal. Runs 13B-30B models, or multiple smaller ones simultaneously.
- Intel Mac Mini: Technically possible with CPU inference, but painfully slow. Not recommended.
Why RAM Matters More Than You Think
Apple Silicon uses unified memory — your RAM is your VRAM. A 7B parameter model at 4-bit quantization needs roughly:
7 billion parameters × 4 bits ÷ 8 bits/byte = ~3.5GB
But you also need memory for the context window, KV cache, and overhead. Real-world rule of thumb: you need 1.5-2x the model size in RAM. So that 3.5GB model actually wants 6-8GB to run comfortably.
For context, here's what fits where:
| Mac Mini | Usable RAM | Approx Model Size | Example Models |
|---|---|---|---|
| M1 8GB | ~6GB | Up to 4B | Phi-3 Mini, TinyLlama, Qwen2.5-3B |
| M2 16GB | ~12GB | Up to 7B | Llama 3.1 8B, Qwen2.5-7B, Gemma2 9B |
| M4 24GB | ~20GB | Up to 14B | Qwen2.5-14B, Llama 3.3 70B (Q4) — tight |
| M4 Pro 48GB+ | ~40GB | Up to 30B | Full 70B models at lower quantization |
The M4's improved memory bandwidth (up to 273 GB/s on Pro/Max) makes a noticeable difference in token generation speed compared to M1/M2.
The Tools: Four Ways to Run Local LLMs
1. Ollama — The Easy Button
Best for: Getting started quickly, CLI users, simple deployments
Ollama is the fastest path to a working local LLM. One command install, one command download, one command run. It handles model formats, quantization, and the server automatically.
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull qwen2.5:7b
# Run interactively
ollama run qwen2.5:7b
# Or start the API server
ollama serve
Ollama's killer feature is its model library. Want Qwen? ollama pull qwen2.5:7b. Want Llama? ollama pull llama3.1:8b. It Just Works™.
The API is OpenAI-compatible (mostly — more on that later), so tools like OpenWebUI, Continue.dev, or custom scripts drop right in.
2. LM Studio — The GUI Experience
Best for: Non-technical users, model comparison, quick experiments
LM Studio is a polished desktop app for browsing, downloading, and chatting with models. No terminal required.
- Browse Hugging Face models with one-click download
- Chat interface with history, system prompts, and parameter tuning
- Built-in server mode with OpenAI-compatible API
- GPU acceleration via Metal on Apple Silicon
Download from lmstudio.ai, pick a model from the sidebar, and start chatting. It's the most approachable option for anyone uncomfortable with the command line.
3. llama.cpp — The Lightweight Option
Best for: Minimal resource usage, custom builds, embedded systems
llama.cpp is the C++ inference engine that powers much of the local LLM ecosystem. It's what Ollama uses under the hood.
# Clone and build (requires CMake)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_METAL=ON
cmake --build build --config Release
# Convert and run a model
./build/bin/llama-cli -m ~/models/qwen2.5-7b-q4_k_m.gguf -p "Explain quantum computing"
You trade convenience for control. llama.cpp exposes every knob: thread count, batch size, memory mapping, context size, sampling parameters. If you need to squeeze performance out of limited hardware, this is where you go.
4. MLX — Apple Silicon Optimized
Best for: Maximum performance on Apple Silicon, research, fine-tuning
Apple's MLX framework is purpose-built for their chips. It uses unified memory efficiently and can be significantly faster than generic implementations for certain workloads.
# Install
pip install mlx-lm
# Run a model
python -m mlx_lm.server --model mlx-community/Qwen2.5-7B-Instruct-4bit
# Or in Python
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")
response = generate(model, tokenizer, prompt="Hello, world!")
MLX shines for fine-tuning (LoRA support is excellent) and when you want the absolute best inference speed on Apple Silicon. The trade-off is a smaller model selection — not every Hugging Face model has an MLX conversion.
Step-by-Step: Ollama Setup on Mac Mini
Let's walk through a complete setup. I'll use Ollama because it's the easiest entry point, but the concepts apply everywhere.
Step 1: Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
This installs the ollama CLI and a LaunchAgent that keeps the server running in the background.
Step 2: Verify Installation
ollama --version
# ollama version 0.5.x
ollama list
# NAME ID SIZE MODIFIED
# (empty — no models yet)
Step 3: Download Your First Model
# Good all-rounder, ~4.5GB
ollama pull qwen2.5:7b
# For coding tasks, ~4GB
ollama pull codellama:7b
# Small and fast, ~2GB
ollama pull phi3:medium
Downloads are resumable and cached. Models live in ~/.ollama/models/.
Step 4: Test It
ollama run qwen2.5:7b
>>> Write a Python function to flatten a nested list
def flatten(nested_list):
result = []
for item in nested_list:
if isinstance(item, list):
result.extend(flatten(item))
else:
result.append(item)
return result
Exit with /bye or Ctrl+D.
Step 5: Start the API Server
ollama serve
By default, it listens on localhost:11434. Test it:
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5:7b",
"prompt": "Why is the sky blue?"
}'
Model Recommendations by Use Case
General Chat & Reasoning
| Model | Size | Quantization | Speed* | Quality |
|---|---|---|---|---|
| Qwen2.5 7B | 7B | Q4_K_M | ~35 tok/s | ⭐⭐⭐⭐ |
| Llama 3.1 8B | 8B | Q4_K_M | ~32 tok/s | ⭐⭐⭐⭐ |
| Gemma 2 9B | 9B | Q4_K_M | ~28 tok/s | ⭐⭐⭐⭐ |
| Qwen2.5 14B | 14B | Q4_K_M | ~22 tok/s | ⭐⭐⭐⭐⭐ |
*Speed measured on M4 Mac Mini 24GB. Your mileage varies by quantization and context length.
Qwen2.5 is my daily driver. It handles instructions well, has a strong context window (128K), and the 7B version is fast enough for real-time use. Llama 3.1 is similarly capable with better English prose. Gemma 2 is Google's offering — slightly different "personality" but objectively strong.
Coding Assistance
| Model | Size | Best For |
|---|---|---|
| CodeQwen 7B | 7B | General coding, fast completions |
| CodeLlama 7B | 7B | Python, C++, established codebase |
| DeepSeek-Coder 6.7B | 6.7B | Complex algorithms, math |
| Qwen2.5-Coder 7B | 7B | Best all-rounder for coding |
For VS Code integration via Continue.dev, I run Qwen2.5-Coder 7B. It generates sensible completions, explains code well, and debugs without hallucinating too aggressively.
Small & Fast (Edge Cases)
| Model | Size | Use Case |
|---|---|---|
| Phi-3 Mini | 3.8B | Tight memory, simple tasks |
| TinyLlama 1.1B | 1.1B | Extremely fast, basic Q&A |
| Qwen2.5 3B | 3B | Good balance of size and capability |
Phi-3 Mini punches above its weight. On an 8GB Mac Mini, it's your best option for anything beyond toy problems.
Performance Tuning
Memory Usage
Ollama automatically picks a quantization based on your hardware, but you can override:
# Force a specific quantization level
ollama pull qwen2.5:7b-q4_K_M
ollama pull qwen2.5:7b-q8_0 # Higher quality, more RAM
Quantization levels (from smallest to largest):
- Q2_K — Aggressive compression. Smaller, noticeably dumber.
- Q4_K_M — Balanced. The default sweet spot.
- Q5_K_M — Better quality. Worth it for 13B+ models.
- Q8_0 — Near-uncompressed. Best quality, ~2x the size of Q4.
Context Window
Default context is usually 4096 or 8192 tokens. For long documents:
# Run with extended context
ollama run qwen2.5:7b --ctx-size 32768
More context = more KV cache = more RAM. A 7B model at 32K context uses significantly more memory than at 4K. Monitor with Activity Monitor.
Metal GPU Acceleration
Ollama and llama.cpp automatically use Metal on Apple Silicon. Verify it's working:
ollama run qwen2.5:7b
# In another terminal:
ollama ps
# NAME ID SIZE PROCESSOR UNTIL
# qwen2.5:7b xxx 4.5 GB 100% GPU 4 minutes from now
"100% GPU" means Metal is engaged. CPU-only fallback happens automatically if the model won't fit.
Integration With Other Tools
OpenWebUI (Previously Ollama WebUI)
A polished chat interface that connects to your Ollama instance:
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
Access at http://localhost:3000. Features: multi-model chats, document RAG, image generation integration, user management.
Continue.dev (VS Code/Cursor)
The best coding assistant integration. Install the VS Code extension, point it at Ollama:
{
"models": [{
"title": "Qwen2.5-Coder",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}]
}
Get inline completions, chat, and codebase-aware answers powered by your local model.
API Integration
Ollama's API is mostly OpenAI-compatible:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5:7b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Most tools that support OpenAI's API work with minor configuration changes.
The /v1 API Gotcha
Here's something that trips up almost everyone: Ollama's OpenAI-compatible endpoints have quirks.
The /v1/chat/completions endpoint works for basic chat, but:
- Tool calling is supported by some models but not all. Check Ollama's docs for compatibility.
- System prompts work differently than OpenAI. Some models ignore them unless formatted correctly.
- Streaming is supported, but the SSE format has minor differences that break some clients.
- JSON mode (
response_format: { type: "json_object" }) requires specific model support.
If you're migrating from OpenAI, test thoroughly. Tools like LangChain and LiteLLM have Ollama-specific adapters that handle these quirks.
For the most reliable integration, use Ollama's native API (/api/generate and /api/chat) when possible, and only fall back to /v1 when a tool requires it.
Troubleshooting Common Issues
"Error: model requires more system memory (X GB) than is available (Y GB)"
Your model + context window exceeds available RAM. Solutions:
- Use a smaller model
- Reduce context size:
OLLAMA_CONTEXT_LENGTH=4096 ollama run model - Use a more aggressive quantization
- Close other applications
Slow generation (sub-10 tokens/second)
- Check if Metal is active:
ollama psshould show "GPU" - Try a smaller quantization (Q4 instead of Q8)
- Reduce context length
- For CPU-only fallback: more threads help, but not dramatically
"Connection refused" when accessing API
- Ensure
ollama serveis running - Check
OLLAMA_HOSTenvironment variable - Default bind is localhost only — set
OLLAMA_HOST=0.0.0.0for LAN access (security implications!)
Models download but won't run
- Check architecture: some models are Linux/CUDA only
- Verify you have enough disk space (
~/.ollama/models/) - Try
ollama rm modelname && ollama pull modelnameto re-download
High memory pressure / swapping
macOS will compress memory and eventually swap to SSD. This tanks performance:
- Monitor with Activity Monitor's Memory tab
- Reduce model size or context if "Memory Used" nears physical RAM
- Consider quitting Safari tabs (they're memory hogs)
Local vs Cloud: When to Use Which
Use local when:
- Privacy is paramount (proprietary data, personal projects)
- Cost matters (high-volume usage)
- You need offline access
- Latency matters (real-time coding assistance)
- You're experimenting with models/parameters
Use cloud when:
- You need frontier capability (GPT-4, Claude 3.5 Sonnet, Gemini Pro)
- One-shot tasks where setup time isn't worth it
- You need features unavailable locally (multimodal with high-res images, web search)
- Your hardware can't run adequate models
My workflow: Local for 80% of daily tasks (coding, writing, quick research), cloud for the 20% that needs the absolute best reasoning or multimodal capabilities.
Conclusion
Running local LLMs on a Mac Mini isn't just a tech demo — it's a genuinely productive setup. An M2 or M4 Mac Mini with 16-24GB RAM can handle 7B-13B models at speeds that feel responsive for coding and writing tasks.
Start with Ollama. Pull Qwen2.5 7B. Install Continue.dev in VS Code. Within an hour, you'll have AI assistance that never phones home, never bills you, and works on airplanes.
The models will keep improving. The hardware will get faster. But the core advantage — owning your compute — only becomes more valuable as AI becomes more central to how we work.
Happy local inferencing.
Have questions or found a model that works particularly well on your setup? The local LLM community moves fast — what's true today might be outdated next month. Experiment, share your findings, and iterate.