localllm
Local model evaluation and benchmark results. Qwen3.6-27B and 35B abliterated MTP models running on llama.cpp with CPU TTS sidecar.
| Model \u2195 | Alias \u2195 | Quant \u2195 | Params \u2195 | Replay \u2195 | Sim \u2195 | Coding \u2195 | Agentic \u2195 | Status |
|---|---|---|---|---|---|---|---|---|
| qwen27-huihui | Qwen3.6-27B | Q6_K | 27B | 27/35 | 7/8 | 0.50 | 0.95 | Production |
| qwen35-huihui | Qwen3.6-35B A3B | Q6_K | 35B | 16/31 | 8/8 | 0.50 | 0.875 | On Disk |
| Model \u2195 | Alias \u2195 | Quant \u2195 | Params \u2195 | Replay \u2195 | Sim \u2195 | Coding \u2195 | Agentic \u2195 | Status |
|---|---|---|---|---|---|---|---|---|
| ornith-q5 | Ornith-1.0-35B | Q5_K_M | 35B | 25/35 | 7/8 | 0.50 | 1.00 | Archived |
| ornith-q6 | Ornith-1.0-35B | Q6_K | 35B | 24/35 | 7/8 | 0.25 | 0.85 | Deleted |
Current finding: Qwen3.6-27B Huihui remains the daily default because it has the best retained Qwen replay behavior and can keep the reranker in VRAM. Qwen3.6-35B Huihui is stronger in coding-sim, but its replay/tool-shape reliability is worse and the full-context q8-KV shape leaves no reranker VRAM headroom.
Turn-level accuracy against known-good transcript fixtures. Measures how faithfully a model reproduces expected conversation flow.
| Model \u2195 | Alias \u2195 | Turns \u2195 | Match Rate \u2195 | Fixtures \u2195 | Pass Rate \u2195 | Avg Score \u2195 |
|---|---|---|---|---|---|---|
| qwen27-huihui | Qwen3.6-27B | 27/35 | 77.1% | 2/5 | 40.0% | 0.9143 |
| ornith-q5 | Ornith-1.0-35B | 25/35 | 71.4% | 2/5 | 40.0% | 0.9048 |
| ornith-q6 | Ornith-1.0-35B | 24/35 | 68.6% | 1/5 | 20.0% | 0.8857 |
| qwen35-huihui | Qwen3.6-35B A3B | 16/31 | 51.6% | 1/5 | 20.0% | 0.7473 |
Semantic similarity against reference responses across coding and agentic scenarios. Measures output quality beyond exact match.
| Model \u2195 | Alias \u2195 | Pass \u2195 | Pass Rate \u2195 | Scope Clean \u2195 | Tool Error Free \u2195 | Avg Score \u2195 |
|---|---|---|---|---|---|---|
| qwen35-huihui | Qwen3.6-35B A3B | 8/8 | 100% | 7/8 | 7/8 | 0.9437 |
| ornith-q5 | Ornith-1.0-35B | 7/8 | 87.5% | 7/8 | 8/8 | 0.9000 |
| qwen27-huihui | Qwen3.6-27B | 7/8 | 87.5% | 6/8 | 7/8 | 0.8438 |
| ornith-q6 | Ornith-1.0-35B | 7/8 | 87.5% | 6/8 | 7/8 | 0.8438 |
| Model \u2195 | Alias \u2195 | Avg Score \u2195 | Fixtures \u2195 | Pass \u2195 | Fail \u2195 |
|---|---|---|---|---|---|
| qwen27-huihui | Qwen3.6-27B | 0.50 | 4 | 2 | 2 |
| qwen35-huihui | Qwen3.6-35B A3B | 0.50 | 4 | 2 | 2 |
| ornith-q5 | Ornith-1.0-35B | 0.50 | 4 | 2 | 2 |
| ornith-q6 | Ornith-1.0-35B | 0.25 | 4 | 1 | 3 |
| Model \u2195 | Alias \u2195 | Avg Score \u2195 | Scenarios \u2195 | Pass \u2195 | Fail \u2195 |
|---|---|---|---|---|---|
| ornith-q5 | Ornith-1.0-35B | 1.00 | scored | - | - |
| qwen27-huihui | Qwen3.6-27B | 0.95 | scored | - | - |
| qwen35-huihui | Qwen3.6-35B A3B | 0.875 | scored | - | - |
| ornith-q6 | Ornith-1.0-35B | 0.85 | scored | - | - |
| Model \u2195 | Alias \u2195 | Prompt tok/s \u2195 | Predict tok/s \u2195 | Peak VRAM \u2195 | Peak RAM \u2195 |
|---|---|---|---|---|---|
| ornith-q5 | Ornith-1.0-35B | 2548 | 107.3 | 27.9 GiB | - |
| ornith-q6 | Ornith-1.0-35B | 2450 | 98.5 | 31.4 GiB | - |
| qwen35-huihui | Qwen3.6-35B A3B | 1826 | 123.6 | 32.0 GiB | - |
| qwen27-huihui | Qwen3.6-27B | 680 | 47.4 | 28.2 GiB | - |
\u2020 archived models no longer on disk. Historical results preserved for reference.
| Suite | What It Measures | Scoring |
|---|---|---|
| Transcript Replay | Turn-level accuracy against known-good transcript fixtures. Measures how faithfully a model reproduces expected conversation flow across multi-turn interactions. | Partial score per fixture (0-2). Aggregate: matched turns / total turns, pass rate across fixtures. |
| Sim Compare | Semantic similarity against reference responses across coding and agentic scenarios. Measures output quality beyond exact string matching. | Pass/fail per scenario with similarity threshold. Scope-clean and tool-error-free sub-scores. |
| Coding Compare | Single-turn coding smoke tasks. Measures whether generated code passes hidden checks without needing a full agent loop. | Average score 0-1 across the committed task set. |
| Agentic Barrage | Multi-step agentic task execution under pressure, including planning, revision, evidence triage, tool restraint, and tool followthrough. | Average score 0-1 across rubric-scored scenarios. |
| Speed | Prompt processing and token prediction throughput on the target hardware. Peak VRAM and RAM usage under full context load. | Tokens per second for prompt and predict phases. GiB for peak memory usage. |
Primary evidence is kept split by family instead of collapsed into one blended score. Transcript replay is the general-agent signal; sim compare is the coding-agent signal. Speed is reported for fit and hardware planning, not as a quality ranking.
Archived models no longer receive new benchmark runs. Historical results are preserved for reference and comparison only.
Ornith Q5 was retired after hands-on prose/editing use despite strong benchmark results. Ornith Q6 was deleted because Q5 was faster, roomier, and cleaner in the local tests. Qwen3.6-27B Huihui stays as the daily default; Qwen3.6-35B Huihui stays as the retained 35B Huihui/abliterated option.
All current benchmarks run on a single AMD AI Pro R9700 32 GB host with Vulkan GPU acceleration via llama.cpp. The daily 27B chat preset uses 128k context, q8 KV, and MTP draft n=3. The retained 35B Huihui preset uses 256k context, q8 KV, and MTP draft n=2.
All benchmark results are stored as JSON in benchmarks/summaries/. Each suite produces a summary.json with per-candidate scores and a published.json with the public-facing label. Full run manifests and per-fixture output are in benchmarks/model_eval/results/.