The hardest part of running local LLMs isn't the installation — it's knowing which model to run on the hardware you actually own. Pull a 70B model on a 16GB machine and the system crawls. Run a 1B model on a GPU server and you're leaving performance on the table.
This guide matches homelab hardware to the right local LLM, with real benchmarks, specific model + quantization recommendations, and zero guesswork. Whether you're on a fanless Intel N100 or a Proxmox box with an RTX 3060, here's exactly what to run.
How to Read This Guide
Two numbers determine whether a local LLM runs well on your machine: RAM capacity and memory bandwidth. VRAM on a GPU is even better than system RAM because it's attached to thousands of CUDA cores. But if you don't have a GPU, CPU inference relies entirely on your system's DDR4/DDR5 bandwidth.
The tables below use these abbreviations:
- Q4_K_M = 4-bit quantized, medium quality. ~75% of full model quality at ~25% of the size. The sweet spot for most homelab use.
- Q5_K_M = 5-bit quantized, higher quality. Slightly larger, slightly better reasoning. Good if you have the VRAM/RAM headroom.
- Q8_0 = 8-bit quantized. Very close to unquantized quality. Best for small models on capable hardware.
- t/s = tokens per second. Higher is better. >10 t/s feels responsive. >30 t/s feels instant.
Tier 1: Entry-Level (8–16GB RAM, No GPU)
Best models: 1B–3B parameter models at Q4_K_M or Q8_0.
This is the Intel N100, an old laptop, or a lightweight Proxmox LXC with 8GB allocated. You won't run Llama 70B here, but you can still get surprisingly useful AI for home automation, quick summarization, and basic coding help.
| Model | Quant | Size | Speed (CPU) | Best For |
|---|---|---|---|---|
qwen2.5:1.5b |
Q4_K_M | ~1.1GB | 15–25 t/s | Home Assistant voice, simple Q&A |
llama3.2:1b |
Q4_K_M | ~1.3GB | 20–35 t/s | Fastest responses, basic tasks |
gemma3:1b |
Q4_K_M | ~0.8GB | 18–30 t/s | Google ecosystem fans, lightweight chat |
qwen2.5:3b |
Q4_K_M | ~1.9GB | 10–18 t/s | Better reasoning than 1B, still snappy |
qwen2.5-coder:1.5b |
Q4_K_M | ~1.1GB | 15–25 t/s | Tiny code completions, shell scripts |
Real benchmark from my setup: Qwen2.5 1.5B Q4_K_M on an Intel i9-13900H (LXC, 2 cores, 4GB RAM) runs at 21 t/s generation and 55 t/s prompt processing[^1]. On an N100, expect roughly half that — still perfectly usable for Home Assistant or quick CLI queries.
Recommended hardware in this tier:
- Intel N100 Mini PC — Fanless, ~$150, sips power, perfect for a dedicated Ollama box.
- Beelink EQ13 (N200) — Slightly faster than N100, still silent and cheap.
Tier 2: Mid-Range (16–32GB RAM, No GPU or iGPU)
Best models: 7B–9B parameter models at Q4_K_M or Q5_K_M.
This is the sweet spot for most homelabbers. A 16GB Mac mini, a 32GB Proxmox host, or a Ryzen mini PC with integrated graphics. You can run capable general-purpose and coding models that rival early ChatGPT quality for many tasks.
| Model | Quant | Size | Speed (CPU) | Best For |
|---|---|---|---|---|
llama3.1:8b |
Q4_K_M | ~4.9GB | 8–15 t/s | General chat, long context (128k) |
qwen2.5:7b |
Q4_K_M | ~4.4GB | 8–14 t/s | Strong reasoning, Chinese/English bilingual |
gemma3:4b |
Q4_K_M | ~3.2GB | 12–20 t/s | Fast, good quality, Google-trained |
qwen2.5-coder:7b |
Q4_K_M | ~4.7GB | 7–12 t/s | Best free coding model in this tier |
mistral:7b |
Q4_K_M | ~4.1GB | 8–14 t/s | Apache 2.0 license, clean for commercial use |
llama3.1:8b |
Q5_K_M | ~5.8GB | 6–10 t/s | Better reasoning if you have 32GB RAM |
Practical note: If you have exactly 16GB system RAM and want to run an 8B model, close browsers and other heavy apps. The model needs ~5GB just to load, and your OS needs 2–3GB. 16GB is the floor for this tier. 32GB is comfortable.
iGPU bonus: AMD 780M integrated graphics (Ryzen 7 7840HS) can run 7B models via ROCm at 15–30 t/s — nearly GPU speeds without buying a discrete card.
Recommended hardware:
- Beelink SER7 (Ryzen 7 7840HS) — iGPU via ROCm, 32GB DDR5, runs 7B models beautifully.
- 32GB DDR5 SO-DIMM Kit — Must-have upgrade to escape the 16GB bottleneck.
Tier 3: Enthusiast (32–64GB RAM, or 12GB+ VRAM GPU)
Best models: 13B–14B models at Q4_K_M, or 8B models at Q8_0 / FP16.
Now you're in the range where local LLMs genuinely compete with cloud APIs for quality. A used RTX 3060 12GB, an RTX 4060 Ti 16GB, or a 64GB Proxmox server opens up much larger models with fast inference.
| Model | Quant | Size | Speed (GPU) | Best For |
|---|---|---|---|---|
qwen2.5:14b |
Q4_K_M | ~8.9GB | 20–35 t/s | Excellent reasoning, coding, analysis |
llama3.1:70b |
Q4_K_M | ~40GB | 3–8 t/s | Best quality, but needs 48GB+ VRAM for fast inference[^2] |
deepseek-r1:14b |
Q4_K_M | ~9.0GB | 15–28 t/s | Open-source reasoning model, math/logic champion |
qwen2.5-coder:14b |
Q4_K_M | ~8.5GB | 18–32 t/s | Serious local coding assistant |
llama3.1:8b |
Q8_0 | ~8.3GB | 15–25 t/s | Near-unquantized quality on mid GPUs |
gemma3:12b |
Q4_K_M | ~8.1GB | 18–30 t/s | Strong multi-turn chat, Google model |
GPU vs CPU at this tier: A 14B model on CPU (fast DDR5) might give you 4–8 t/s. The same model on an RTX 3060 12GB hits 20–30 t/s. The GPU upgrade is worth it if you interact with the LLM daily.
VRAM rule of thumb: You need roughly 1GB of VRAM for every 1.5–2B parameters at Q4_K_M. So:
- RTX 3060 12GB → up to 13B–14B models fully offloaded
- RTX 4060 Ti 16GB → up to 14B comfortably, or 70B with CPU offload (slower)
- RTX 3090 / 4090 24GB → 70B Q4_K_M fully GPU-resident, 20–40 t/s
Recommended hardware:
- NVIDIA RTX 3060 12GB — The value king for home LLMs. Used market is ~$200.
- NVIDIA RTX 4060 Ti 16GB — Future-proofing for 30B+ models.
- Samsung 990 Pro 2TB NVMe — Models load from disk first; fast NVMe reduces cold-start time.
Tier 4: Power User (64GB+ RAM, 24GB+ VRAM, or Multi-GPU)
Best models: 30B–70B parameters at Q4_K_M, or vision models with image understanding.
This is dual RTX 3090s, a Mac Studio with 64GB unified memory, or a threaded-ripper server with 128GB RAM. You're running models that rival GPT-4o-mini and sometimes GPT-4o on specific tasks.
| Model | Quant | Size | Speed | Best For |
|---|---|---|---|---|
llama3.3:70b |
Q4_K_M | ~40GB | 8–20 t/s | Near-frontier quality, general purpose |
qwen2.5:32b |
Q4_K_M | ~20GB | 12–25 t/s | Coding and reasoning powerhouse |
mixtral:8x7b |
Q4_K_M | ~26GB | 10–20 t/s | Sparse MoE model, excellent reasoning |
gemma3:27b |
Q4_K_M | ~17GB | 12–22 t/s | Vision + text, Google multimodal flagship |
llava:34b |
Q4_K_M | ~20GB | 5–12 t/s | Describe images locally with high accuracy |
MoE models note: Models like Mixtral 8x7b and Qwen2.5-MoE use "sparse" activation — only a subset of parameters fire per token. This means a 47B-parameter MoE model might run as fast as a 13B dense model during inference. They're some of the most efficient ways to get high quality on limited VRAM.
Use-Case Quick Picks
Don't want to think about tiers? Here's the right model for common homelab jobs:
| What You Want | Best Model | Hardware Needed |
|---|---|---|
| Home Assistant voice assistant | llama3.2:1b or qwen2.5:1.5b |
8GB RAM |
| Daily chat / general knowledge | llama3.1:8b Q4_K_M |
16GB RAM |
| Coding helper | qwen2.5-coder:7b or 14b |
16GB+ RAM / 12GB+ VRAM |
| Document summarization (long books) | llama3.1:8b or qwen2.5:14b |
16GB+ RAM — 128k context |
| Write blog posts / creative writing | llama3.3:70b or qwen2.5:32b |
24GB+ VRAM / 64GB RAM |
| Math / logic / reasoning | deepseek-r1:14b or 32b |
16GB+ VRAM |
| Describe security camera events | llava:7b or gemma3:4b |
8GB+ RAM / 8GB+ VRAM |
| Build a RAG pipeline (chat with docs) | nomic-embed-text + llama3.1:8b |
16GB RAM |
Understanding Quantization
Quantization is the single most important concept for matching models to hardware.
LLMs are trained with 16-bit or 32-bit floating-point weights (FP16 / FP32). A 7B model at FP16 needs ~14GB just to load. Quantization squishes those weights into 4-bit or 5-bit integers, slashing memory use at a modest quality cost.
| Precision | Bits per Weight | 7B Model Size | Quality vs FP16 |
|---|---|---|---|
| FP16 | 16 | ~14GB | 100% (baseline) |
| Q8_0 | 8 | ~7.5GB | ~98% |
| Q5_K_M | 5 | ~5.2GB | ~88% |
| Q4_K_M | 4 | ~4.1GB | ~78% |
| Q3_K_M | 3 | ~3.2GB | ~65% |
| Q2_K | 2 | ~2.4GB | ~45% |
My recommendation: Stick to Q4_K_M for almost everything. It's the best balance of size, speed, and quality. Only go to Q5_K_M or Q8_0 if you have the VRAM headroom and notice the model struggling with complex reasoning. Avoid Q3 and below unless you're experimenting on very constrained hardware.
Proxmox & LXC Tips
If you're running Ollama inside Proxmox like I am, here are three practical tips:
-
Allocate 2GB more RAM than the model size. A 7B Q4_K_M model loads at ~4.5GB, but inference needs working memory. Give the LXC at least 6–8GB.
-
Bind-mount model storage to your NAS or a large local disk. Models live in
/root/.ollama/models/. Symlink that to a larger volume so your root disk doesn't fill up:mkdir -p /mnt/nas/ollama-models ln -s /mnt/nas/ollama-models /root/.ollama/models -
Snapshot before experimenting. Proxmox makes this trivial. I always snapshot the Ollama LXC before pulling a 20GB+ model. If it doesn't run well, restore in 10 seconds instead of re-downloading.
For the full Ollama-on-Proxmox setup, see the step-by-step guide.
Final Thoughts
You don't need a $3,000 GPU to run useful local LLMs. A $150 Intel N100 running llama3.2:1b is enough for a private Home Assistant voice assistant. A used RTX 3060 opens up 14B models that code, reason, and write at a level that would have required API credits a year ago.
The key is matching the model size and quantization to the RAM or VRAM you actually have. Start with Q4_K_M, measure tokens per second, and upgrade your hardware only after you've proven the use case.
Related Posts
- Run Ollama on Proxmox LXC (Full Setup Guide) — Installation, GPU passthrough, and Home Assistant integration
- Proxmox Home Assistant LXC Setup — Build the foundation for local smart-home AI
[^1]: Benchmark from LXC 1004 on Intel i9-13900H, 2 cores allocated, 4GB RAM, CPU backend via llama.cpp. Your results will vary with CPU generation, RAM speed, and thread allocation. [^2]: 70B models at Q4_K_M need ~40GB VRAM for full GPU offload at good speeds. On 24GB VRAM, partial CPU offload works but drops to 2–5 t/s.
Questions or benchmark data from your own setup? The source for this post is on GitHub. PRs welcome.