Run a Full Multi-Agent AI System Locally on Windows: LangChain + LM Studio + Qwen 3

Chris Vella

18 May 2026 — 9 min read

Every time I paste a snippet into ChatGPT or Claude, I'm sending proprietary code to a third-party server. That's a non-starter for client work. Add the per-token costs once your usage scales, and the internet dependency when you're travelling — or just in a hotel in Gozo with questionable Wi-Fi — and you've got three solid reasons to go local.

Here's what we're building: a multi-agent AI dev team that runs entirely on your Windows machine. One Orchestrator that reads your intent and dispatches to the right specialist — a C# senior engineer, an Angular front-end designer, an automation test writer, or a code reviewer. All of it backed by Qwen 3, a 6B-parameter model that punches well above its weight on code tasks. No cloud. No API costs. No code leaving your machine.

What you need before we start:

Windows 10 or 11 (64-bit)
Python 3.10 or newer
~8 GB free disk space for the model
A GPU with 6–8 GB VRAM is ideal — but a modern CPU with 16 GB RAM will get you there

Let's build it.

What Is LM Studio and Why Qwen 3?

LM Studio is a local model runner for Windows, macOS, and Linux. Its killer feature for our purposes is the built-in OpenAI-compatible REST server. You load a model, toggle the server on, and it exposes endpoints at http://localhost:1234/v1 — the same shape as the OpenAI API. LangChain already knows how to talk to OpenAI, so wiring it to a local model is almost zero effort.

Why Qwen 3? Alibaba's Qwen 3 family is genuinely strong on instruction following and code generation for its size class. The 6B instruct variant fits comfortably in 6–8 GB VRAM with Q4 quantisation, or runs on CPU-only machines if you have 16 GB of system RAM and some patience.

Which quantisation to pick: Search for Qwen3 in LM Studio's model browser. You'll see several quantisation options. My recommendation:

Q4_K_M — the right call for most consumer GPUs and CPU-only setups. Best balance of speed and quality.
Q8_0 — if you have 16 GB VRAM and want near-FP16 quality
Q2_K — only if you're heavily RAM-constrained; quality noticeably degrades on code tasks

Installing and Configuring LM Studio on Windows

Download and install from lmstudio.ai. Standard Windows .exe — next, next, done.

Load the model:

Open LM Studio and click the Search icon (magnifying glass) in the left sidebar
Type Qwen3 in the search bar
Find the latest Qwen3-6B-Instruct variant — look for GGUF format entries
Click Download on your chosen quantisation (Q4_K_M is the one)
Once downloaded, click Load Model — watch memory usage climb as it maps into RAM/VRAM

Enable the local API server:

Click the Developer icon (</>) in the left sidebar
Toggle Start Server on
Confirm the server is running at http://localhost:1234
Note the model identifier shown in the server panel — you'll need this in your config shortly

Two settings to check before moving on:

Context Length: Set to at least 8192. Agents handling real code blocks need the headroom.
GPU Offload Layers: Drag to maximum. Every layer pushed to GPU means faster inference. Leave everything else as defaults.

Verify the server is alive. Open PowerShell and run:

# Quick sanity check — PowerShell
Invoke-RestMethod -Uri "http://localhost:1234/v1/models" -Method GET

You should see a JSON response listing your loaded Qwen 3 model. Connection error? Check that the server toggle is on and LM Studio is still in the foreground.

Setting Up Your Python Environment

Keep it clean — use a virtual environment.

# Create and activate a virtual environment
python -m venv .venv
.venv\Scripts\activate

# Install dependencies
pip install langchain langchain-openai langchain-core openai python-dotenv

Create a .env file in your project root:

# .env
OPENAI_API_BASE=http://localhost:1234/v1
OPENAI_API_KEY=lm-studio
MODEL_NAME=qwen3-6b-instruct

Three things worth understanding here:

OPENAI_API_BASE redirects the OpenAI client to LM Studio instead of OpenAI's servers — that's the entire trick
OPENAI_API_KEY can be any non-empty string; LM Studio doesn't validate it, but the client library requires a value
MODEL_NAME must match the identifier shown in LM Studio's server panel exactly — it's case-sensitive, and it often includes a path prefix (like lmstudio-community/Qwen3-6B-Instruct-GGUF). Copy it from the UI rather than guessing.

Connecting LangChain to LM Studio

Before touching the multi-agent system, prove the connection works with the simplest possible call. Don't skip this step.

# connection_test.py
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
import os
from dotenv import load_dotenv

load_dotenv()

llm = ChatOpenAI(
    base_url=os.getenv("OPENAI_API_BASE"),
    api_key=os.getenv("OPENAI_API_KEY"),
    model=os.getenv("MODEL_NAME"),
    temperature=0.2,
)

response = llm.invoke([HumanMessage(content="Write a C# method that reverses a string.")])
print(response.content)

Run it. C# method back — pipe works. 404 or model name error — open LM Studio's Developer tab, copy the exact model identifier, and update MODEL_NAME in your .env.

temperature=0.2 is my default for code work. Lower temperature means less creative variation, more deterministic output — exactly what you want when the code needs to compile.

Designing the Multi-Agent System

Here's the architecture:

User Prompt
     │
     ▼
┌─────────────────┐
│   Orchestrator  │  ← classifies intent, routes task
└────────┬────────┘
         │
    ┌────┴─────────────────────────────────┐
    │             │            │           │
    ▼             ▼            ▼           ▼
┌────────┐  ┌─────────┐  ┌────────┐  ┌──────────┐
│  C#    │  │ Angular │  │  Test  │  │  Code    │
│ Coding │  │ Design  │  │ Writer │  │ Reviewer │
│ Agent  │  │  Agent  │  │ Agent  │  │  Agent   │
└────────┘  └─────────┘  └────────┘  └──────────┘
         │             │            │           │
         └─────────────┴────────────┴───────────┘
                              │
                              ▼
                       Response to User

The key insight: every agent shares the same Qwen 3 model instance. What makes each one a "specialist" is its system prompt — not a different model. The system prompt IS the agent's expertise. Get the prompt right and you get a different engineer.

The Orchestrator's only job is to read intent and return a structured routing decision. It never answers the question itself — it dispatches.

Implementing the Agents

Here's the full implementation. Save this as agents.py:

# agents.py
import os
import json
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

load_dotenv()

# ─── Shared LLM instance ──────────────────────────────────────────────────────
# All agents share one connection to LM Studio.
# Differentiation is entirely in the system prompts.

llm = ChatOpenAI(
    base_url=os.getenv("OPENAI_API_BASE"),
    api_key=os.getenv("OPENAI_API_KEY"),
    model=os.getenv("MODEL_NAME"),
    temperature=0.2,
)

def make_chain(system_prompt: str) -> object:
    """Build a simple prompt → LLM → string output chain."""
    prompt = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        ("human", "{task}"),
    ])
    return prompt | llm | StrOutputParser()


# ─── Specialist Agent Definitions ─────────────────────────────────────────────

# 1. C# Senior-Level Coding Agent
csharp_agent = make_chain("""
You are a senior C# engineer with 15+ years of experience on .NET Core and .NET 8+.
You write idiomatic, production-grade C# that follows SOLID principles.
You use modern language features (records, pattern matching, nullable reference types, async/await).
You never write boilerplate noise — every line earns its place.
When you return code, include a brief explanation of the key decisions made.
Format: code block first, then a short explanation. No padding.
""")

# 2. Angular + CSS Front-End Designer Agent
angular_agent = make_chain("""
You are a senior Angular developer and UI designer specialising in Angular 17+ with TypeScript and SCSS.
You write pixel-perfect, accessible components that follow Angular best practices:
standalone components, signals for state, OnPush change detection.
Your SCSS is clean — no magic numbers, proper use of variables and mixins.
You think about keyboard accessibility and ARIA attributes without being reminded.
Format: TypeScript component first, then SCSS, then a brief note on any design decisions.
""")

# 3. Automation Test Writer Agent
test_agent = make_chain("""
You are a test-first engineer who writes thorough, maintainable test suites.
For C# code: use xUnit with FluentAssertions. Mock dependencies with NSubstitute.
For Angular code: use Jasmine/Karma for unit tests, Playwright for E2E when appropriate.
You always cover: happy path, edge cases, null/empty inputs, and boundary conditions.
Write tests that document intent — the test name should read like a specification.
Format: test file content only. No filler commentary outside of test names and inline comments.
""")

# 4. Code Review & Performance Agent
review_agent = make_chain("""
You are a critical but constructive code reviewer focused on .NET Core and Angular codebases.
You identify: performance bottlenecks (N+1 queries, unnecessary allocations, blocking async calls),
security antipatterns (injection risks, improper input validation, exposed secrets),
and readability issues (overly complex logic, poor naming, missing null guards).
Label every finding with severity: [HIGH], [MEDIUM], or [LOW].
Format your review as a numbered list. End with a brief overall assessment (1-2 sentences).
""")

# 5. Orchestrator Agent
ORCHESTRATOR_SYSTEM = """
You are a routing orchestrator. Your ONLY job is to read the user's task and determine
which specialist agent should handle it. You do not answer the task yourself.

The available agents are:
- csharp: For writing, modifying, or explaining C# and .NET code
- angular: For writing Angular components, TypeScript, or CSS/SCSS UI work
- testing: For writing unit tests, integration tests, or E2E tests for any code
- review: For reviewing, auditing, or analysing existing code for quality or performance

Respond with ONLY a valid JSON object in this exact format:
{"agent": "<agent_name>", "reason": "<one sentence explanation>"}

If the task is ambiguous, pick the most likely agent and explain your reasoning.
"""

orchestrator_prompt = ChatPromptTemplate.from_messages([
    ("system", ORCHESTRATOR_SYSTEM),
    ("human", "{task}"),
])
orchestrator_chain = orchestrator_prompt | llm | StrOutputParser()

AGENT_MAP = {
    "csharp": csharp_agent,
    "angular": angular_agent,
    "testing": test_agent,
    "review": review_agent,
}


# ─── Pipeline Runner ───────────────────────────────────────────────────────────

def run_pipeline(user_prompt: str) -> dict:
    """
    Route user_prompt through the orchestrator and dispatch to the
    appropriate specialist agent. Returns a dict with routing info and result.
    """
    print(f"\n[Orchestrator] Analysing task...")
    raw_routing = orchestrator_chain.invoke({"task": user_prompt})

    # Parse the routing decision
    try:
        routing = json.loads(raw_routing.strip())
        agent_name = routing.get("agent", "").lower()
        reason = routing.get("reason", "No reason provided")
    except json.JSONDecodeError:
        # Fallback: try to find an agent name in the raw response
        agent_name = "csharp"  # safe default
        reason = f"Parsing failed; defaulted to csharp. Raw: {raw_routing}"

    print(f"[Orchestrator] Routing to: {agent_name.upper()} agent")
    print(f"[Orchestrator] Reason: {reason}")

    if agent_name not in AGENT_MAP:
        return {
            "agent": agent_name,
            "reason": reason,
            "result": f"Unknown agent '{agent_name}'. Available: {list(AGENT_MAP.keys())}",
        }

    specialist = AGENT_MAP[agent_name]
    print(f"\n[{agent_name.upper()} Agent] Working...\n")
    result = specialist.invoke({"task": user_prompt})

    return {
        "agent": agent_name,
        "reason": reason,
        "result": result,
    }

The Orchestrator in Action

The routing logic is deliberately simple. The Orchestrator gets a system prompt, four named buckets, and an instruction to return JSON. No complex decision tree — just a well-prompted model making a classification call.

For 80% of real dev tasks, four buckets is all you need. A future upgrade with LangGraph would let agents hand off to each other — write code, automatically test it, automatically review it, all in one graph traversal. But that's a follow-up post. Start simple, prove value, then layer in complexity.

Putting It All Together

Save this as main.py alongside agents.py:

# main.py
from agents import run_pipeline

if __name__ == "__main__":
    # Example 1: C# task
    print("=" * 60)
    print("EXAMPLE 1: C# Task")
    print("=" * 60)
    output = run_pipeline(
        "Write a C# extension method to chunk a List<T> into batches of a given size."
    )
    print(output["result"])

    # Example 2: Angular task
    print("\n" + "=" * 60)
    print("EXAMPLE 2: Angular Task")
    print("=" * 60)
    output = run_pipeline(
        "Build an Angular search input component with 300ms debounce using RxJS."
    )
    print(output["result"])

    # Example 3: Code review task
    print("\n" + "=" * 60)
    print("EXAMPLE 3: Code Review Task")
    print("=" * 60)
    output = run_pipeline(
        """Review this C# repository pattern implementation for performance issues:

public class UserRepository
{
    private readonly AppDbContext _context;
    public UserRepository(AppDbContext context) => _context = context;

    public List<User> GetActiveUsers()
    {
        return _context.Users
            .Where(u => u.IsActive)
            .ToList();
    }

    public List<Order> GetOrdersForUser(int userId)
    {
        return _context.Orders
            .Where(o => o.UserId == userId)
            .ToList();
    }
}"""
    )
    print(output["result"])

Run it:

python main.py

What to expect on performance:

With GPU (6+ GB VRAM, Q4_K_M): First token in 1–3 seconds, full response in 10–30 seconds depending on output length
CPU only (16 GB RAM, Q4_K_M): First token in 5–15 seconds, full response in 1–4 minutes

CPU mode is fine for batch runs you kick off and walk away from. For interactive back-and-forth during a coding session, a GPU makes the difference between useful and annoying.

What to Do Next

This system is useful today. Here's the clear upgrade path when you're ready for more:

Add LangGraph for stateful multi-step pipelines. Define a workflow graph where the C# Agent writes code, the Test Writer automatically covers it, and the Code Reviewer flags issues — all in one triggered run. No manual chaining required.

Add a RAG layer using Chroma or FAISS. Embed your own codebase into a local vector store and point agents at it. When the C# agent needs context about your existing architecture or conventions, it retrieves it rather than hallucinating something generic.

Bind specialist agents to domain-specific models as the local AI ecosystem matures. LM Studio can run multiple models simultaneously (hardware permitting). A fine-tuned coding model for the C# agent and a smaller, faster model for the orchestrator's routing decision is a natural split.

Closing

A full specialist AI dev team — C# engineer, Angular designer, test writer, code reviewer — running privately on your Windows machine in roughly 200 lines of Python. Zero per-token cost. Zero internet required. Your code never leaves the box.

That's the 75% reduction that matters here: you stop manually routing tasks between tools, stop paying per token, stop worrying about what gets logged server-side. You describe the task, the orchestrator picks the right expert, and you get a production-quality response.

Try it. Then extend it. What specialist agent do you add next — drop a comment below.

Built with LM Studio, Qwen 3 (6B), LangChain, and Python 3.10+ on Windows. All inference runs locally.