Qwen 3.6 Local: Run Flagship-Level AI on Your Own Hardware
77.2% on SWE-bench Verified. A 27B open-weight model. That number sat with me for a while when I first saw it, because Claude Opus 4.6 — the current best-in-class closed model for coding — scores 80.8%. We're talking about a 3.6-point gap between a model you can pull to your local machine in fifteen minutes and the best thing Anthropic ships to the cloud.
That's not a curiosity. That's a decision point.
The Qwen 3.6 family dropped as fully open-weight under Apache 2.0, and the two models that matter for local self-hosting are the 27B dense (qwen3.6:27b) and the 35B-A3B Mixture-of-Experts (qwen3.6:35b-a3b). This post is about running them on your own hardware — no API key, no cloud dependency, no per-token bill — and wiring them into a real development architecture. I'm running an AMD Radeon RX 7900 XTX (24GB VRAM), so hardware specifics will be relevant to that class of card, but the patterns apply to any 24GB NVIDIA card as well.
Two Models, One Decision
Before anything else, understand what you're choosing between.
Qwen3.6-27B (Dense)
- All 27B parameters are active on every token
- Native vision-language built into the single checkpoint — no companion model, no second process
- 77.2% SWE-bench Verified / 53.5% SWE-bench Pro / 48.2% SkillsBench
- 262K native context window, extendable to ~1M with YaRN
- Apache 2.0, one BF16 checkpoint handles both thinking and non-thinking modes
Qwen3.6-35B-A3B (Mixture-of-Experts)
- 35B total parameters, but only ~3.1B active per token via sparse expert routing
- 256 experts total; top-4-of-64 sparse routing (8 routed + 1 shared expert active per step)
- Hybrid attention architecture: Gated DeltaNet linear attention combined with standard Gated Attention
- Same context window, same multimodal capability, same Apache 2.0
The MoE model is the fast one. The dense model is the quality one. Here's how I think about the choice:
| Scenario | Model |
|---|---|
| Best raw coding quality, 24GB VRAM | 27B Q4_K_M |
| Fastest generation / high throughput | 35B-A3B Q4_K_M |
| Long-context RAG (128K+ tokens) | 35B-A3B |
| Full precision, no compromise (48GB+) | 27B Q8_0 or BF16 |
My daily driver is the 35B-A3B for speed, and I switch to 27B for architectural problems where I want the model to sit with complexity longer.
The Coding Benchmarks — Why They Matter Here
The numbers I care about are SWE-bench, not HumanEval. HumanEval is autocomplete trivia — isolated functions, no repo context, no test harness. SWE-bench drops the model into a real repository, hands it a bug report or feature request, and scores whether it produces a patch that passes existing tests. That's the actual job.
- SWE-bench Verified: 27B scores 77.2%. Previous-gen Qwen's 397B MoE scored lower. Qwen3.6-27B beats a model with 14.8x the parameter count.
- SWE-bench Pro: 53.5% vs the 397B's 50.9% — 14.8x fewer parameters, better result.
- SkillsBench: 48.2% vs the 397B's 30.0% — a 77% relative improvement.
- Terminal-Bench 2.0: 59.3% — matches Claude 4.5 Opus on real autonomous terminal sessions.
One required caveat: these numbers come from Qwen's own agent scaffold. Independent third-party reproductions are still limited as of May 2026. Treat them as directional, not gospel.
In practice, what I care about is how the model handles multi-file C# navigation, PR review reasoning, and repository-level refactoring — the kind of task where context coherence matters more than raw token prediction. It handles these well. Honestly, the multi-file C# performance surprised me; I expected it to start losing the thread around the third or fourth file, and it didn't.
If you're running Aider locally, the configuration is three lines:
# ~/.aider.conf.yml
model: ollama/qwen3.6:27b
openai-api-base: http://localhost:11434/v1
openai-api-key: ollama
Point Aider at the local Ollama server and you get a repository-level coding agent running entirely on your hardware.
Multimodal and Thinking Modes
Vision-Language
The vision capability is embedded in the same checkpoint — there's no separate ViT model to run alongside the LLM. The architecture uses a Qwen LLM backbone with a native ViT encoder and interleaved multi-image/multi-frame MRoPE. In practice, this means passing an image is just adding a base64 blob to the standard /v1/chat/completions request body. Same endpoint, same request shape, no second process.
Useful immediately for:
- Screenshot of a UI component → generate test cases
- PDF table → structured extraction
- Architecture diagram → written analysis or code scaffold
Thinking Mode
Both the 27B and 35B-A3B handle thinking via the same checkpoint. Thinking traces are wrapped in <think>...</think> tags and returned in the API response. You can toggle thinking on or off per request — no model swap required.
The feature that changes agentic workflows is Thinking Preservation. With preserve_thinking: true, the model retains its chain-of-thought from all prior turns, not just the most recent one. In a long agentic loop — say, a multi-step repository analysis or an iterative debugging session — this prevents the model from contradicting itself between turns. I was skeptical about how much this actually matters in practice, but decision consistency on long-running sessions is noticeably better when it's enabled.
The trade-off is token count. Preserved thinking inflates context window usage fast. On a 128K+ session, watch your KV cache VRAM budget.
To ship a fast non-thinking variant for daily use without the overhead:
FROM qwen3.6:27b
PARAMETER num_ctx 32768
SYSTEM "You are a senior .NET engineer assistant. /no_think"
ollama create qwen3.6-27b-fast -f Modelfile
ollama run qwen3.6-27b-fast
Thinking mode is a dial. Use it on the tasks that benefit from it, not as a default.
Self-Hosting with Ollama
Pulling the Models
ollama pull qwen3.6:27b # ~17GB Q4_K_M
ollama pull qwen3.6:35b-a3b # ~24GB Q4
ollama pull qwen3.6:27b-q8_0 # ~29GB
ollama pull qwen3.6:27b-bf16 # ~56GB
Download once, then inference is instant. No round-trip latency, no API throttling, no quota.
VRAM Reality Check
Hardware Reality Check
Important: 3.1B active parameters does not mean 3.1B memory footprint. The full 35B weight tensor lives in VRAM — only the compute routes through 3.1B at inference time. On a 24GB card, both models fit at Q4 with headroom. Push context beyond 64K and the KV cache competes directly with the model weights for the same VRAM budget.
AMD RX 7900 XTX Specifics
AMD isn't an afterthought here. ROCm 7.2+ (released March 2026) is the first build achieving full parity with CUDA in Ollama, llama.cpp, and vLLM. If you're on older ROCm, stop and upgrade first — you'll spend hours debugging inference issues that ROCm 7.2 resolves.
Real-world throughput on an RX 7900 XTX:
qwen3.6:27bQ4_K_M: ~30 tok/sqwen3.6:35b-a3bQ4: ~120 tok/s via Vulkan backend
That 4x throughput gap is the MoE architecture delivering exactly what it promises.
Wiring It Into a .NET + pgvector Stack
Ollama exposes an OpenAI-compatible REST API at http://localhost:11434/v1. Drop it behind an IHttpClientFactory and the integration is trivial.
OllamaLlmService.cs
public class OllamaLlmService
{
private readonly HttpClient _http;
private const string BaseUrl = "http://localhost:11434/v1";
public OllamaLlmService(IHttpClientFactory factory)
{
_http = factory.CreateClient("ollama");
_http.BaseAddress = new Uri(BaseUrl);
}
public async Task<string> ChatAsync(
string systemPrompt,
string userMessage,
bool enableThinking = false,
CancellationToken ct = default)
{
var payload = new
{
model = "qwen3.6:27b",
stream = false,
// Pass thinking as a top-level parameter — Ollama forwards it to the model
think = enableThinking,
messages = new[]
{
new { role = "system", content = systemPrompt },
new { role = "user", content = userMessage }
}
};
var response = await _http.PostAsJsonAsync("/v1/chat/completions", payload, ct);
response.EnsureSuccessStatusCode();
var result = await response.Content.ReadFromJsonAsync<ChatCompletionResponse>(ct);
return result?.Choices?[0]?.Message?.Content ?? string.Empty;
}
}
RAG Query with pgvector
The full local RAG stack: Ollama handles both the LLM (qwen3.6:27b) and the embedding model (nomic-embed-text). PostgreSQL with pgvector handles vector storage and similarity search. No external API, no cloud service, no data leaving the machine.
public async Task<string> QueryWithRagAsync(string userQuestion, CancellationToken ct)
{
// Generate the query embedding locally via Ollama's embeddings endpoint
var embedPayload = new { model = "nomic-embed-text", input = userQuestion };
var embedResp = await _http.PostAsJsonAsync("/v1/embeddings", embedPayload, ct);
var embedResult = await embedResp.Content.ReadFromJsonAsync<EmbeddingResponse>(ct);
var queryVector = embedResult!.Data[0].Embedding;
// Semantic search against pgvector — returns ranked context chunks
var context = await _pgRepo.SearchSimilarAsync(queryVector, topK: 5, ct);
// Enable thinking for complex synthesis — the model reasons over the retrieved context
var system = "You are a senior .NET architect assistant. Answer based on context only.";
var message = $"Context:\n{context}\n\nQuestion: {userQuestion}";
return await ChatAsync(system, message, enableThinking: true, ct);
}
The pgvector query behind SearchSimilarAsync:
SELECT content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 5;
Docker Compose Stack
services:
ollama:
image: ollama/ollama:latest # use ollama/ollama:rocm for AMD
ports: ["11434:11434"]
volumes: [ollama_data:/root/.ollama]
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
postgres:
image: pgvector/pgvector:pg16
environment:
POSTGRES_USER: devuser
POSTGRES_PASSWORD: devpass
POSTGRES_DB: ragdb
ports: ["5432:5432"]
volumes: [pg_data:/var/lib/postgresql/data]
volumes:
ollama_data:
pg_data:
For AMD, swap ollama/ollama:latest for ollama/ollama:rocm and replace the driver: nvidia device reservation with the ROCm equivalent.
Next Steps for Devs
- Pull now. Use the Ollama commands above. Pick your hardware-appropriate model. The download happens once; inference is immediate after that.
- Start with 35B-A3B for daily tasks. On a 24GB card it's four times faster than the 27B. Switch to 27B when you're doing deep architectural analysis or complex multi-file reasoning where quality matters more than throughput.
- Set
num_ctx 32768as your baseline Modelfile default. Only extend it when the task requires it — every additional 8K of context costs VRAM on top of the model weights, and on a 24GB card that budget is finite. - Existing Ollama users: migration is one line. If you've got any
.NETHttpClientintegration already pointed at another local model, change themodelfield in the payload. The OpenAI-compatible endpoint is unchanged. - The pgvector +
nomic-embed-textcombo is a fully local, zero-dependency RAG stack. Both models serve via Ollama. Everything — embeddings, inference, vector search — runs on your hardware. - AMD users: ROCm 7.2+ is a hard prerequisite. Don't spend time troubleshooting inference issues on older ROCm builds. The fix is the upgrade.
- Watch for larger open-weight Qwen3.6 variants. The closed-weight Max model shipped before the open weights. Larger open variants typically follow within weeks of the initial open-weight release.