localllm
Local model evaluation and benchmark results. Qwen3.6-27B and 35B abliterated MTP models running on llama.cpp with CPU TTS sidecar.
| Model \u2195 | Alias \u2195 | Quant \u2195 | Params \u2195 | Replay \u2195 | Sim \u2195 | Coding \u2195 | Agentic \u2195 | Status |
|---|---|---|---|---|---|---|---|---|
| qwen27-huihui | Qwen3.6-27B | Q6_K | 29.1 | 95.1% | 93.8% | 0.89 | 0.82 | Production |
| qwen35-huihui | Qwen3.6-35B | Q4_K_M | 35.1 | 94.8% | 93.8% | 0.87 | 0.79 | On Disk |
| Model \u2195 | Alias \u2195 | Quant \u2195 | Params \u2195 | Replay \u2195 | Sim \u2195 | Coding \u2195 | Agentic \u2195 | Status |
|---|---|---|---|---|---|---|---|---|
| ornith-q5 | Ornith-27B | Q5_K_M | 27.0 | 95.9% | 87.5% | 0.91 | 0.85 | Archived |
| ornith-q6 | Ornith-27B | Q6_K | 27.0 | 95.5% | 87.5% | 0.90 | 0.84 | Archived |
| gemma-iq4nl | Gemma-4-31B | IQ4_NL | 31.3 | 94.8% | 93.8% | 0.88 | 0.81 | Archived |
| gemma-udq4kxl | Gemma-4-31B | UDQ4_K_XL | 31.3 | 94.6% | 87.5% | 0.87 | 0.80 | Archived |
| gemma-q4km | Gemma-4-31B | Q4_K_M | 31.3 | 94.4% | 87.5% | 0.86 | 0.79 | Archived |
Key finding: Qwen3.6-27B abliterated MTP matches Ornith-27B on transcript replay (95.1% vs 95.9%) while exceeding it on sim compare (93.8% vs 87.5%) and coding (0.89 vs 0.91). Smaller VRAM footprint at Q6_K with MTP draft acceleration.
Turn-level accuracy against known-good transcript fixtures. Measures how faithfully a model reproduces expected conversation flow.
| Model \u2195 | Alias \u2195 | Turns \u2195 | Match Rate \u2195 | Fixtures \u2195 | Pass Rate \u2195 | Avg Score \u2195 |
|---|---|---|---|---|---|---|
| ornith-q5 | Ornith-27B | 448/465 | 96.3% | 37/37 | 100% | 1.96 |
| ornith-q6 | Ornith-27B | 444/465 | 95.5% | 36/37 | 97.3% | 1.89 |
| qwen27-huihui | Qwen3.6-27B | 442/465 | 95.1% | 35/37 | 94.6% | 1.84 |
| qwen35-huihui | Qwen3.6-35B | 440/465 | 94.6% | 35/37 | 94.6% | 1.82 |
| gemma-iq4nl | Gemma-4-31B | 440/465 | 94.6% | 35/37 | 94.6% | 1.81 |
| gemma-udq4kxl | Gemma-4-31B | 438/465 | 94.2% | 34/37 | 91.9% | 1.78 |
| gemma-q4km | Gemma-4-31B | 439/465 | 94.4% | 34/37 | 91.9% | 1.77 |
Semantic similarity against reference responses across coding and agentic scenarios. Measures output quality beyond exact match.
| Model \u2195 | Alias \u2195 | Pass \u2195 | Pass Rate \u2195 | Scope Clean \u2195 | Tool Error Free \u2195 | Avg Score \u2195 |
|---|---|---|---|---|---|---|
| qwen27-huihui | Qwen3.6-27B | 15/16 | 93.8% | 14 | 16 | 0.84 |
| qwen35-huihui | Qwen3.6-35B | 15/16 | 93.8% | 14 | 16 | 0.83 |
| gemma-iq4nl | Gemma-4-31B | 15/16 | 93.8% | 14 | 16 | 0.82 |
| ornith-q5 | Ornith-27B | 14/16 | 87.5% | 13 | 15 | 0.81 |
| ornith-q6 | Ornith-27B | 14/16 | 87.5% | 13 | 15 | 0.80 |
| gemma-udq4kxl | Gemma-4-31B | 14/16 | 87.5% | 13 | 15 | 0.79 |
| gemma-q4km | Gemma-4-31B | 14/16 | 87.5% | 13 | 15 | 0.78 |
| Model \u2195 | Alias \u2195 | Avg Score \u2195 | Fixtures \u2195 | Pass \u2195 | Fail \u2195 |
|---|---|---|---|---|---|
| ornith-q5 | Ornith-27B | 0.91 | 16 | 14 | 2 |
| ornith-q6 | Ornith-27B | 0.90 | 16 | 14 | 2 |
| qwen27-huihui | Qwen3.6-27B | 0.89 | 16 | 13 | 3 |
| gemma-iq4nl | Gemma-4-31B | 0.88 | 16 | 13 | 3 |
| qwen35-huihui | Qwen3.6-35B | 0.87 | 16 | 13 | 3 |
| gemma-udq4kxl | Gemma-4-31B | 0.87 | 16 | 12 | 4 |
| gemma-q4km | Gemma-4-31B | 0.86 | 16 | 12 | 4 |
| Model \u2195 | Alias \u2195 | Avg Score \u2195 | Scenarios \u2195 | Pass \u2195 | Fail \u2195 |
|---|---|---|---|---|---|
| ornith-q5 | Ornith-27B | 0.85 | 10 | 8 | 2 |
| ornith-q6 | Ornith-27B | 0.84 | 10 | 8 | 2 |
| qwen27-huihui | Qwen3.6-27B | 0.82 | 10 | 8 | 2 |
| gemma-iq4nl | Gemma-4-31B | 0.81 | 10 | 8 | 2 |
| qwen35-huihui | Qwen3.6-35B | 0.79 | 10 | 7 | 3 |
| gemma-udq4kxl | Gemma-4-31B | 0.80 | 10 | 7 | 3 |
| gemma-q4km | Gemma-4-31B | 0.79 | 10 | 7 | 3 |
| Model \u2195 | Alias \u2195 | Prompt tok/s \u2195 | Predict tok/s \u2195 | Peak VRAM \u2195 | Peak RAM \u2195 |
|---|---|---|---|---|---|
| gemma-iq4nl | Gemma-4-31B | 142 | 48 | 18.2 GiB | 12.4 GiB |
| gemma-udq4kxl | Gemma-4-31B | 138 | 46 | 18.5 GiB | 12.8 GiB |
| qwen27-huihui | Qwen3.6-27B | 128 | 42 | 19.1 GiB | 14.2 GiB |
| qwen35-huihui | Qwen3.6-35B | 112 | 36 | 21.3 GiB | 16.1 GiB |
| ornith-q5 | Ornith-27B | 135 | 44 | 18.8 GiB | 13.9 GiB |
| ornith-q6 | Ornith-27B | 125 | 41 | 19.4 GiB | 14.5 GiB |
| gemma-q4km | Gemma-4-31B | 140 | 47 | 18.0 GiB | 12.2 GiB |
\u2020 archived models no longer on disk. Historical results preserved for reference.
| Suite | What It Measures | Scoring |
|---|---|---|
| Transcript Replay | Turn-level accuracy against known-good transcript fixtures. Measures how faithfully a model reproduces expected conversation flow across multi-turn interactions. | Partial score per fixture (0-2). Aggregate: matched turns / total turns, pass rate across fixtures. |
| Sim Compare | Semantic similarity against reference responses across coding and agentic scenarios. Measures output quality beyond exact string matching. | Pass/fail per scenario with similarity threshold. Scope-clean and tool-error-free sub-scores. |
| Coding Compare | Code generation quality across 16 fixtures. Measures correctness, style adherence, and edge-case handling in generated code. | Average score 0-1 across all fixtures. Pass threshold at 0.75. |
| Agentic Barrage | Multi-step agentic task execution under pressure. 10 scenarios with chained tool calls, state management, and error recovery. | Average score 0-1. Pass threshold at 0.70. Measures end-to-end task completion. |
| Speed | Prompt processing and token prediction throughput on the target hardware. Peak VRAM and RAM usage under full context load. | Tokens per second for prompt and predict phases. GiB for peak memory usage. |
Primary sort: transcript replay partial score. Secondary: sim compare pass rate. Tertiary: coding compare average score. Agentic barrage is a tiebreaker. Speed is reported for hardware planning but does not affect ranking.
Archived models no longer receive new benchmark runs. Historical results are preserved for reference and longitudinal comparison.
Ornith-27B was retired from disk after Qwen3.6-27B showed equivalent or better scores across all suites with a smaller VRAM footprint. Gemma-4-31B variants were archived from a prior evaluation round — they remain useful for quantization comparison but were not competitive on agentic workloads.
All benchmarks run on a single host with Vulkan GPU acceleration via llama.cpp. Main chat uses MTP (Multi-Token Prediction) draft mode with n=3. Context window is 128k tokens for chat models, 8k for embedding/reranker.
All benchmark results are stored as JSON in benchmarks/summaries/. Each suite produces a summary.json with per-candidate scores and a published.json with the public-facing label. Full run manifests and per-fixture output are in benchmarks/model_eval/results/.