← All posts

Best Self-Hosted AI Coding Assistants to Replace GitHub Copilot (June 2026)

Here's something that should make you uncomfortable: every line of code you write with GitHub Copilot gets processed by Microsoft's servers. Your proprietary algorithms, your client's secrets, your weekend side project that might become a startup — all of it passes through a third-party API before it reaches your screen.

For some developers, that's a fair trade. Copilot works well, it's convenient, and $10-19/month isn't bank-breaking. But if you're running a homelab, self-hosting your own services, and generally trying to keep your data on your own hardware, Copilot is a glaring exception in your otherwise privacy-first setup.

I spent the last month testing every self-hosted AI coding assistant I could find on my M4 Mac Mini (16GB RAM) and a Proxmox LXC with a GTX 1060. Some were surprisingly good. Some were barely functional. Here's what actually works in mid-2026.

Why Self-Host an AI Coding Assistant?

Before getting into the tools, let's be clear about what you gain and what you lose:

What you gain:

  • Complete privacy. Your code never leaves your machine. Full stop.
  • Zero subscription fees. One-time hardware cost vs. recurring $120-228/year.
  • Offline operation. Works on planes, in cabins, during outages.
  • No rate limits. Generate as much code as your hardware can handle.
  • Full model choice. Switch between Llama, Qwen, Mistral, or anything on Hugging Face.

What you lose:

  • GPT-4-class capability. Local models cap out around GPT-3.5 to early GPT-4 quality, depending on size.
  • Instant setup. You need to install, configure, and optimize.
  • Enterprise features. No team dashboards, usage analytics, or admin controls.

The trade-off is clear: if you value privacy and control over raw capability, self-hosting makes sense. If you need the absolute best model quality for cutting-edge code generation, Copilot still wins. Many developers I know use both — Copilot for exploration, local models for production code.

The Landscape in June 2026

The self-hosted AI coding space has matured significantly since early 2025. Here's what's changed:

  • Ollama is now the de facto standard for local model serving. It's stable, fast, and integrates with everything.
  • Continue.dev has become the dominant VS Code extension for local AI coding, with deep Ollama integration.
  • New models like Qwen3 8B, Llama 4 8B, and Codestral 22B bring near-GPT-4 quality to consumer hardware.
  • Tool calling is now standard — models can execute shell commands, read files, and interact with your development environment.

The gap between cloud and local has never been smaller. For routine coding tasks — boilerplate generation, test writing, documentation, refactoring — a well-configured local setup is genuinely competitive.

The Contenders: 6 Self-Hosted AI Coding Assistants

TL;DR Comparison Table

Tool Best For IDE Support Model Backend Setup Complexity Hardware Needs
Continue + Ollama General coding, most users VS Code, JetBrains, Neovim Ollama (any model) Low 8GB+ RAM
Claude Code (local) Complex reasoning, large projects Terminal, VS Code Anthropic API or local Medium 16GB+ RAM
Aider + Ollama Git-integrated workflows, pair programming Terminal, VS Code Ollama, OpenAI, Anthropic Low 8GB+ RAM
Tabby Team deployments, web UI VS Code, Vim, IntelliJ Local models, APIs Medium 16GB+ RAM
CodeGPT + Ollama Beginners, simple setup VS Code Ollama (any model) Very Low 8GB+ RAM
Twinny Privacy-first, minimal telemetry VS Code Ollama, LM Studio Very Low 8GB+ RAM

Let's go through each one in detail.


1. Continue + Ollama: The Recommended Setup

If you want one recommendation that just works, this is it. Continue is a VS Code extension that gives you Copilot-like inline suggestions and a chat panel, and Ollama is the easiest way to run local LLMs. Together, they're the closest thing to a drop-in Copilot replacement.

What Continue Does

Continue adds two key features to your IDE:

  1. Autocomplete (Tab to accept) — Suggests code completions as you type, just like Copilot.
  2. Chat panel (Ctrl+L) — Ask questions about your code, generate functions, refactor, explain errors.

The chat context is smart — it automatically includes the current file, nearby functions, and errors from your terminal. You can also @mention specific files or symbols to include them in the context.

Setting Up Ollama

First, install Ollama:

# macOS
brew install ollama

# Or download from https://ollama.com
# Linux
curl -fsSL https://ollama.com/install.sh | sh

Pull a coding-optimized model. For most users, I recommend starting with one of these:

# Best balance of speed and quality (8GB VRAM/RAM)
ollama pull qwen2.5-coder:14b

# If you have 16GB+ RAM, this is noticeably better
ollama pull qwen2.5-coder:32b

# Fastest option, still decent quality (4GB VRAM/RAM)
ollama pull qwen2.5-coder:7b

# Alternative: Llama 4 with tool calling
ollama pull llama4:8b

Model recommendation for coding: Qwen2.5 Coder 14B is the sweet spot. It's specifically fine-tuned for code generation, outperforms general-purpose models at the same size, and runs comfortably on a Mac Mini with 16GB RAM. The 32B version is noticeably better for complex tasks but requires 24GB+ RAM.

Installing Continue

In VS Code:

  1. Open Extensions (Ctrl+Shift+X)
  2. Search "Continue" — the one by Continue Dev
  3. Install it

Configuring Continue for Local Models

After installing, open Continue's config (click the gear icon in the Continue panel) and add your Ollama model:

{
  "models": [
    {
      "title": "Qwen Coder 14B",
      "model": "qwen2.5-coder:14b",
      "provider": "ollama"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen Coder 14B",
    "model": "qwen2.5-coder:14b",
    "provider": "ollama"
  }
}

The tabAutocompleteModel field specifically configures the inline autocomplete behavior. Continue will now use your local Qwen model for both chat and tab completion.

Continue in Action

Here's what Continue + Qwen2.5 Coder 14B can do in practice:

Generating a function from a comment:

# Type this comment and hit Enter, then Tab to accept suggestion

def calculate_fibonacci(n: int) -> int:
    """Calculate the nth Fibonacci number using memoization."""
    # Continue suggests the full implementation including:
    # - Memoization dict setup
    # - Base case handling
    # - Recursive logic with caching
    # - Type hints and docstring

Refactoring with chat:

User: "Refactor this to use asyncio instead of threading"
[Continue reads the current file, suggests refactored code with async/await]

Explaining errors:

When your terminal shows an error, Continue automatically detects it and offers to explain it in the chat panel. This is genuinely useful — better than reading Stack Overflow in many cases.

Performance Benchmarks (M4 Mac Mini, 16GB RAM)

Task Qwen2.5 7B Qwen2.5 14B Qwen2.5 32B
Function generation (50 lines) 2.1s 3.8s 8.2s
Inline autocomplete ~150ms ~280ms ~650ms
Code explanation (chat) 1.8s 3.2s 7.1s
Test generation 2.5s 4.1s 9.3s
RAM usage ~5.2GB ~9.8GB ~22GB

The 14B model is the practical sweet spot — quality is noticeably better than 7B, and latency is acceptable for interactive use. The 32B model is best reserved for complex tasks where you can wait a few seconds.

Pros & Cons

Pros:

  • Easiest setup of any local coding assistant
  • Excellent VS Code integration (also works in JetBrains and Neovim)
  • Supports any Ollama model — not locked to one backend
  • Active development with frequent updates
  • Free and open source (Apache 2.0)

Cons:

  • Autocomplete isn't as aggressive as Copilot (sometimes needs explicit prompting)
  • Chat context window is smaller than Claude's (8K vs 200K)
  • No built-in knowledge of your entire codebase (only current file + explicit mentions)

2. Claude Code: When You Need Serious Reasoning

Claude Code is Anthropic's terminal-based coding agent. While it typically uses Anthropic's API, you can run it with local models via Ollama. It's not a traditional autocomplete tool — it's more like having a senior engineer pair programming with you via terminal.

What Makes Claude Code Different

Unlike Continue's passive autocomplete, Claude Code is agentic. It can:

  • Read and edit multiple files in your codebase
  • Run shell commands and tests
  • Navigate your project structure
  • Make and review git commits
  • Handle complex multi-step refactoring

This is powerful but also means it's slower and more resource-intensive. It's best for:

  • Large-scale refactoring across multiple files
  • Understanding unfamiliar codebases
  • Generating boilerplate for new features
  • Debugging complex issues

Local Setup

Claude Code can be configured to use Ollama instead of Anthropic's API:

# Install Claude Code
npm install -g @anthropics/claude-code

# Configure to use local model
export ANTHROPIC_BASE_URL=http://localhost:11434/v1
export ANTHROPIC_API_KEY=ollama  # Dummy key for Ollama compatibility

# Run with a capable local model
claude --model ollama/qwen2.5-coder:32b

Note: Claude Code's agentic features work best with larger models (32B+). The 14B model can handle simple tasks but struggles with multi-file reasoning.

When to Use Claude Code vs. Continue

Scenario Continue Claude Code
Daily coding, autocomplete ✅ Best ❌ Too slow
Quick questions about current file ✅ Best ⚠️ Okay
Multi-file refactoring ⚠️ Manual ✅ Best
Understanding a new codebase ⚠️ Limited ✅ Best
Generating project scaffolding ⚠️ Single file ✅ Best
Running tests and fixing failures ❌ No ✅ Best

My workflow: Continue for 80% of daily coding (autocomplete + quick chat). Claude Code for the occasional large refactoring job or when I need to understand a complex codebase.


3. Aider + Ollama: Git-Integrated Pair Programming

Aider is a terminal-based tool that integrates with Git to provide a unique pair-programming experience. You describe what you want in natural language, and Aider creates a git commit with the changes. If you don't like the result, git revert and try again.

Key Features

  • Git-aware: Every change is a commit. Full history, easy rollback.
  • Multi-file editing: Can modify several files in a single request.
  • Test-driven: Can run your test suite and iterate until tests pass.
  • Voice support: Can accept voice commands (macOS).

Setup with Ollama

# Install Aider
pip install aider-chat

# Configure for Ollama
export OLLAMA_API_BASE=http://localhost:11434
export OLLAMA_MODEL=qwen2.5-coder:14b

# Start Aider in your project
aider --model ollama/qwen2.5-coder:14b

Example Session

$ aider --model ollama/qwen2.5-coder:14b

Aider v0.75.2
Model: ollama/qwen2.5-coder:14b
Git repo: .git
────────────────────────────────────
> Add user authentication to the API endpoints

I'll add JWT-based authentication to your API. Let me first examine the current structure.

[Examines src/api/routes.py, src/models/user.py]

Here are the changes:

src/api/auth.py          | +45 lines (new file)
src/api/routes.py         | +12 lines (auth decorator)
src/models/user.py        | +8 lines (password hashing)
requirements.txt          | +1 line (PyJWT)

> Commit these changes? [y/n]: y

Commit: Add JWT authentication to API endpoints
────────────────────────────────────
> /undo
Reverted to previous commit.

Pros & Cons

Pros:

  • Git integration is genuinely useful — every experiment is tracked
  • Multi-file editing works well for small-to-medium changes
  • Test feedback loop is powerful for TDD workflows

Cons:

  • Terminal-only (no IDE integration)
  • Larger models work better (14B minimum for complex tasks)
  • Can be slow for iterative experimentation (each change is a commit)

4. Tabby: Team-Friendly Self-Hosting

Tabby is designed for team deployments. It has a web dashboard for managing models, users, and usage stats. If you're considering self-hosted AI coding for a small team, Tabby is worth evaluating.

Architecture

Tabby runs as a server (Docker container or binary) that your IDE connects to:

# Run Tabby server with a local model
docker run -it \
  --gpus all \
  -p 8080:8080 \
  -v $HOME/.tabby:/data \
  tabbyml/tabby \
  serve --model Qwen2.5-Coder-14B --device cuda

IDE extensions connect to this server, so your whole team shares the same model instance. This is efficient — one 14B model serves multiple developers instead of each running their own.

Team Features

  • Usage analytics: See which developers use the tool most, what languages they code in
  • Model management: Switch models centrally without reconfiguring each IDE
  • Access control: Basic authentication for the server endpoint
  • Completion statistics: Track acceptance rates (how often suggestions are used)

When to Choose Tabby

Tabby shines in team environments where:

  • You have a GPU server that can be shared
  • You want centralized model management
  • You need basic usage analytics
  • You have 3+ developers who will use AI coding assistance

For solo developers, Continue + Ollama is simpler and more flexible.


5. CodeGPT + Ollama: Beginner-Friendly

CodeGPT is a VS Code extension that offers a simpler interface than Continue. It's less configurable but easier to get started with. Good if you want something that "just works" without reading documentation.

Setup

  1. Install the "CodeGPT" extension in VS Code
  2. Open settings → CodeGPT → Provider → Select "Ollama"
  3. Enter your model name (e.g., qwen2.5-coder:14b)
  4. Done

What You Get

  • Chat panel for asking questions about your code
  • Basic inline suggestions (less aggressive than Continue)
  • Simple command palette integration (Ctrl+Shift+P → "CodeGPT")

Trade-offs

CodeGPT is easier to set up but less powerful than Continue. It lacks:

  • Smart context awareness (doesn't auto-include errors or nearby functions)
  • @mention functionality for referencing specific files
  • Configurable autocomplete behavior
  • Support for multiple model profiles

Verdict: Use CodeGPT if you want the simplest possible setup. Use Continue if you want more control and better integration.


6. Twinny: The Privacy-First Choice

Twinny is a VS Code extension with a strong focus on privacy and minimal telemetry. It works with Ollama and LM Studio and doesn't collect usage data. If telemetry is a concern, Twinny is worth considering.

Setup

  1. Install "Twinny" from the VS Code marketplace
  2. Configure to use Ollama (similar to Continue)
  3. Disable all telemetry in settings

Comparison with Continue

Feature Twinny Continue
Telemetry None by default Minimal (can disable)
Setup complexity Very low Low
Model flexibility Ollama, LM Studio Ollama + many others
Chat features Basic Advanced (mentions, context)
Autocomplete Basic More configurable
Community size Smaller Larger

Twinny is a solid alternative if privacy is your absolute top priority, but Continue's larger community and more active development make it the better general recommendation.


Hardware Recommendations

Your hardware determines which models you can run effectively. Here's what works:

Minimum Viable Setup (Basic Autocomplete)

  • 8GB RAM (Mac Mini M1 base model, older Intel NUC)
  • Model: Qwen2.5 Coder 7B
  • Performance: Acceptable for simple autocomplete, slow for chat
  • Use case: Light coding, small projects

Recommended Setup (Good Experience)

  • 16GB RAM (Mac Mini M2/M4, modern laptop)
  • Model: Qwen2.5 Coder 14B
  • Performance: Responsive autocomplete, usable chat
  • Use case: Daily development, medium projects

Optimal Setup (Best Quality)

  • 32GB+ RAM or GPU with 16GB+ VRAM (Mac Mini M4 Pro, desktop with RTX 4070+)
  • Model: Qwen2.5 Coder 32B or Llama 4 70B (quantized)
  • Performance: Near-GPT-4 quality for many tasks
  • Use case: Complex projects, large codebases

Proxmox LXC Deployment

For a homelab setup, I run Ollama in an LXC container with GPU passthrough:

# On Proxmox host, enable GPU passthrough for the container
# Edit /etc/pve/lxc/<vmid>.conf and add:
# lxc.cgroup2.devices.allow: c 195:* rwm
# lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
# lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file

# Inside the LXC container
apt update && apt install -y curl

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull your model
ollama pull qwen2.5-coder:14b

# Ollama runs on port 11434 by default
# Configure your IDE extensions to point to http://<lxc-ip>:11434

This setup lets me share GPU resources across containers and keeps my AI coding backend separate from my development machine.


Cost Comparison: Self-Hosted vs. Copilot

Let's do the math over a 3-year period:

Setup Initial Cost Monthly Cost 3-Year Total Notes
GitHub Copilot Individual $0 $10 $360 Plus tax
GitHub Copilot Business $0 $19 $684 Per user
Mac Mini M4 (16GB) $599 $0 $599 One-time
Intel NUC + GPU $400-800 $0 $400-800 Used hardware cheaper
Existing machine $0 $0 $0 Use what you have

Break-even analysis:

  • A Mac Mini M4 pays for itself in ~60 months vs. Copilot Individual
  • But it also runs your home server, Plex, Home Assistant, etc.
  • If you already have capable hardware, self-hosting is essentially free

The financial case is weaker than the privacy case. You're not saving money in year one unless you already own the hardware. But over a 3-5 year horizon, and considering the multi-use nature of a homelab server, the economics favor self-hosting.


Real-World Workflow Integration

Here's how I actually use these tools day-to-day:

Morning setup:

# Start Ollama (I keep it running)
ollama serve &

# Verify my model is loaded
ollama ps

During coding (Continue):

  • Tab for inline suggestions
  • Ctrl+L for chat questions
  • "Generate tests for this function" in chat panel
  • "Explain this error" when tests fail

For complex tasks (Claude Code):

  • Run claude in terminal
  • "Refactor the auth module to use JWT tokens"
  • Review the multi-file changes
  • Accept or iterate

Code review:

  • Before committing, ask Continue: "Review this function for edge cases"
  • Often catches null pointer issues, missing validation, or off-by-one errors

Documentation:

  • "Generate JSDoc for this function" — saves tedious typing
  • "Write a README for this module" — decent first draft

Limitations & Honest Assessment

Self-hosted AI coding assistants are good, but they're not magic. Here's where they fall short:

What they struggle with:

  • Brand new frameworks. Training data has a cutoff. Released-last-week libraries won't be in the model's knowledge.
  • Complex architectural decisions. "Should I use microservices or monolith?" requires context the model doesn't have.
  • Subtle bugs. They catch obvious issues but miss domain-specific edge cases.
  • Large context windows. Even 32K context is tiny compared to Claude's 200K. Working with a 50K-line codebase is hard.

What they excel at:

  • Boilerplate and repetitive code
  • Test generation
  • Documentation writing
  • Syntax help and API lookups
  • Refactoring within a single file
  • Explaining unfamiliar code patterns

The honest verdict: For a senior developer, a self-hosted assistant is a productivity multiplier, not a replacement. It handles the tedious 60% of coding so you can focus on the interesting 40%. For junior developers, it can accelerate learning but shouldn't replace understanding — you need to know why the generated code works.


Recommended Starting Point

If you're new to self-hosted AI coding, here's my suggested setup:

  1. Install Ollama: brew install ollama or use the install script
  2. Pull Qwen2.5 Coder 14B: ollama pull qwen2.5-coder:14b
  3. Install Continue in VS Code: Search "Continue" in the Extensions marketplace
  4. Configure Continue: Point it at your Ollama model
  5. Use it for a week: See how it fits your workflow

Total setup time: ~10 minutes. Cost: $0 if you have the hardware.

If you find yourself hitting limitations (context size, multi-file editing), layer on Claude Code or Aider for those specific tasks. Most developers don't need to switch entirely — a hybrid approach works best.


What's Next

The self-hosted AI coding space is evolving fast. Here's what to watch:

  • Qwen3 Coder (expected late 2026): Rumored 128K context window and significantly improved reasoning
  • Tool calling improvements: Models that can execute shell commands, run tests, and interact with your environment more naturally
  • Smaller models: 3-4B parameter models with quality approaching today's 14B models — making local AI accessible on laptops and Raspberry Pis
  • IDE integration: Better native support in JetBrains, Neovim, and potentially Xcode

The trajectory is clear: local models are getting better, faster, and smaller. The gap between self-hosted and cloud will keep shrinking. For privacy-conscious developers and homelab enthusiasts, there's never been a better time to self-host your AI coding assistant.


Related Reading


Have you switched from Copilot to a self-hosted alternative? What's working or not working for you? I'd love to hear about your setup — find me on GitHub or drop a note in the comments.