localllm

localllm

Local model evaluation and benchmark results. Qwen3.6-27B and 35B abliterated MTP models running on llama.cpp with CPU TTS sidecar.

last updated 2026-06-30 · 4 current comparison models · 5 benchmark suites
Main Chat
Qwen3.6-27B MTP Q6
Huihui abliterated · Vulkan · 128k ctx · MTP draft n=3
Secondary
Qwen3.6-35B A3B MTP Q6
Huihui abliterated · Vulkan · 256k ctx · MTP draft n=2
Embedding
Qwen3-Embedding-4B
Q4_K_M · CPU
Reranker
Qwen3-Reranker-4B
Q4_K_M · Vulkan · 2048 ctx
Model \u2195 Alias \u2195 Quant \u2195 Params \u2195 Replay \u2195 Sim \u2195 Coding \u2195 Agentic \u2195 Status
qwen27-huihuiQwen3.6-27BQ6_K27B27/357/80.500.95Production
qwen35-huihuiQwen3.6-35B A3BQ6_K35B16/318/80.500.875On Disk
Model \u2195 Alias \u2195 Quant \u2195 Params \u2195 Replay \u2195 Sim \u2195 Coding \u2195 Agentic \u2195 Status
ornith-q5Ornith-1.0-35BQ5_K_M35B25/357/80.501.00Archived
ornith-q6Ornith-1.0-35BQ6_K35B24/357/80.250.85Deleted

Current finding: Qwen3.6-27B Huihui remains the daily default because it has the best retained Qwen replay behavior and can keep the reranker in VRAM. Qwen3.6-35B Huihui is stronger in coding-sim, but its replay/tool-shape reliability is worse and the full-context q8-KV shape leaves no reranker VRAM headroom.

Turn-level accuracy against known-good transcript fixtures. Measures how faithfully a model reproduces expected conversation flow.

Model \u2195 Alias \u2195 Turns \u2195 Match Rate \u2195 Fixtures \u2195 Pass Rate \u2195 Avg Score \u2195
qwen27-huihuiQwen3.6-27B27/3577.1%2/540.0%0.9143
ornith-q5Ornith-1.0-35B25/3571.4%2/540.0%0.9048
ornith-q6Ornith-1.0-35B24/3568.6%1/520.0%0.8857
qwen35-huihuiQwen3.6-35B A3B16/3151.6%1/520.0%0.7473

Semantic similarity against reference responses across coding and agentic scenarios. Measures output quality beyond exact match.

Model \u2195 Alias \u2195 Pass \u2195 Pass Rate \u2195 Scope Clean \u2195 Tool Error Free \u2195 Avg Score \u2195
qwen35-huihuiQwen3.6-35B A3B8/8100%7/87/80.9437
ornith-q5Ornith-1.0-35B7/887.5%7/88/80.9000
qwen27-huihuiQwen3.6-27B7/887.5%6/87/80.8438
ornith-q6Ornith-1.0-35B7/887.5%6/87/80.8438
Model \u2195 Alias \u2195 Avg Score \u2195 Fixtures \u2195 Pass \u2195 Fail \u2195
qwen27-huihuiQwen3.6-27B0.50422
qwen35-huihuiQwen3.6-35B A3B0.50422
ornith-q5Ornith-1.0-35B0.50422
ornith-q6Ornith-1.0-35B0.25413
Model \u2195 Alias \u2195 Avg Score \u2195 Scenarios \u2195 Pass \u2195 Fail \u2195
ornith-q5Ornith-1.0-35B1.00scored--
qwen27-huihuiQwen3.6-27B0.95scored--
qwen35-huihuiQwen3.6-35B A3B0.875scored--
ornith-q6Ornith-1.0-35B0.85scored--
Model \u2195 Alias \u2195 Prompt tok/s \u2195 Predict tok/s \u2195 Peak VRAM \u2195 Peak RAM \u2195
ornith-q5Ornith-1.0-35B2548107.327.9 GiB-
ornith-q6Ornith-1.0-35B245098.531.4 GiB-
qwen35-huihuiQwen3.6-35B A3B1826123.632.0 GiB-
qwen27-huihuiQwen3.6-27B68047.428.2 GiB-

\u2020 archived models no longer on disk. Historical results preserved for reference.

SuiteWhat It MeasuresScoring
Transcript ReplayTurn-level accuracy against known-good transcript fixtures. Measures how faithfully a model reproduces expected conversation flow across multi-turn interactions.Partial score per fixture (0-2). Aggregate: matched turns / total turns, pass rate across fixtures.
Sim CompareSemantic similarity against reference responses across coding and agentic scenarios. Measures output quality beyond exact string matching.Pass/fail per scenario with similarity threshold. Scope-clean and tool-error-free sub-scores.
Coding CompareSingle-turn coding smoke tasks. Measures whether generated code passes hidden checks without needing a full agent loop.Average score 0-1 across the committed task set.
Agentic BarrageMulti-step agentic task execution under pressure, including planning, revision, evidence triage, tool restraint, and tool followthrough.Average score 0-1 across rubric-scored scenarios.
SpeedPrompt processing and token prediction throughput on the target hardware. Peak VRAM and RAM usage under full context load.Tokens per second for prompt and predict phases. GiB for peak memory usage.

Primary evidence is kept split by family instead of collapsed into one blended score. Transcript replay is the general-agent signal; sim compare is the coding-agent signal. Speed is reported for fit and hardware planning, not as a quality ranking.

Archived models no longer receive new benchmark runs. Historical results are preserved for reference and comparison only.

Ornith Q5 was retired after hands-on prose/editing use despite strong benchmark results. Ornith Q6 was deleted because Q5 was faster, roomier, and cleaner in the local tests. Qwen3.6-27B Huihui stays as the daily default; Qwen3.6-35B Huihui stays as the retained 35B Huihui/abliterated option.

All current benchmarks run on a single AMD AI Pro R9700 32 GB host with Vulkan GPU acceleration via llama.cpp. The daily 27B chat preset uses 128k context, q8 KV, and MTP draft n=3. The retained 35B Huihui preset uses 256k context, q8 KV, and MTP draft n=2.

All benchmark results are stored as JSON in benchmarks/summaries/. Each suite produces a summary.json with per-candidate scores and a published.json with the public-facing label. Full run manifests and per-fixture output are in benchmarks/model_eval/results/.