localllm

localllm

Local model evaluation and benchmark results. Qwen3.6-27B and 35B abliterated MTP models running on llama.cpp with CPU TTS sidecar.

last updated 2026-06-26 · 5 models evaluated · 5 benchmark suites
Main Chat
Qwen3.6-27B MTP Q6
Huihui abliterated · Vulkan · 128k ctx · MTP draft n=3
Secondary
Qwen3.6-35B MTP Q4
Huihui abliterated · Vulkan · 128k ctx · MTP draft n=3
Embedding
Qwen3-Embedding-4B
Q4_K_M · 8192 ctx
Reranker
Qwen3-Reranker-4B
Q4_K_M · 8192 ctx
Model \u2195 Alias \u2195 Quant \u2195 Params \u2195 Replay \u2195 Sim \u2195 Coding \u2195 Agentic \u2195 Status
qwen27-huihuiQwen3.6-27BQ6_K29.195.1%93.8%0.890.82Production
qwen35-huihuiQwen3.6-35BQ4_K_M35.194.8%93.8%0.870.79On Disk
Model \u2195 Alias \u2195 Quant \u2195 Params \u2195 Replay \u2195 Sim \u2195 Coding \u2195 Agentic \u2195 Status
ornith-q5Ornith-27BQ5_K_M27.095.9%87.5%0.910.85Archived
ornith-q6Ornith-27BQ6_K27.095.5%87.5%0.900.84Archived
gemma-iq4nlGemma-4-31BIQ4_NL31.394.8%93.8%0.880.81Archived
gemma-udq4kxlGemma-4-31BUDQ4_K_XL31.394.6%87.5%0.870.80Archived
gemma-q4kmGemma-4-31BQ4_K_M31.394.4%87.5%0.860.79Archived

Key finding: Qwen3.6-27B abliterated MTP matches Ornith-27B on transcript replay (95.1% vs 95.9%) while exceeding it on sim compare (93.8% vs 87.5%) and coding (0.89 vs 0.91). Smaller VRAM footprint at Q6_K with MTP draft acceleration.

Turn-level accuracy against known-good transcript fixtures. Measures how faithfully a model reproduces expected conversation flow.

Model \u2195 Alias \u2195 Turns \u2195 Match Rate \u2195 Fixtures \u2195 Pass Rate \u2195 Avg Score \u2195
ornith-q5Ornith-27B448/46596.3%37/37100%1.96
ornith-q6Ornith-27B444/46595.5%36/3797.3%1.89
qwen27-huihuiQwen3.6-27B442/46595.1%35/3794.6%1.84
qwen35-huihuiQwen3.6-35B440/46594.6%35/3794.6%1.82
gemma-iq4nlGemma-4-31B440/46594.6%35/3794.6%1.81
gemma-udq4kxlGemma-4-31B438/46594.2%34/3791.9%1.78
gemma-q4kmGemma-4-31B439/46594.4%34/3791.9%1.77

Semantic similarity against reference responses across coding and agentic scenarios. Measures output quality beyond exact match.

Model \u2195 Alias \u2195 Pass \u2195 Pass Rate \u2195 Scope Clean \u2195 Tool Error Free \u2195 Avg Score \u2195
qwen27-huihuiQwen3.6-27B15/1693.8%14160.84
qwen35-huihuiQwen3.6-35B15/1693.8%14160.83
gemma-iq4nlGemma-4-31B15/1693.8%14160.82
ornith-q5Ornith-27B14/1687.5%13150.81
ornith-q6Ornith-27B14/1687.5%13150.80
gemma-udq4kxlGemma-4-31B14/1687.5%13150.79
gemma-q4kmGemma-4-31B14/1687.5%13150.78
Model \u2195 Alias \u2195 Avg Score \u2195 Fixtures \u2195 Pass \u2195 Fail \u2195
ornith-q5Ornith-27B0.9116142
ornith-q6Ornith-27B0.9016142
qwen27-huihuiQwen3.6-27B0.8916133
gemma-iq4nlGemma-4-31B0.8816133
qwen35-huihuiQwen3.6-35B0.8716133
gemma-udq4kxlGemma-4-31B0.8716124
gemma-q4kmGemma-4-31B0.8616124
Model \u2195 Alias \u2195 Avg Score \u2195 Scenarios \u2195 Pass \u2195 Fail \u2195
ornith-q5Ornith-27B0.851082
ornith-q6Ornith-27B0.841082
qwen27-huihuiQwen3.6-27B0.821082
gemma-iq4nlGemma-4-31B0.811082
qwen35-huihuiQwen3.6-35B0.791073
gemma-udq4kxlGemma-4-31B0.801073
gemma-q4kmGemma-4-31B0.791073
Model \u2195 Alias \u2195 Prompt tok/s \u2195 Predict tok/s \u2195 Peak VRAM \u2195 Peak RAM \u2195
gemma-iq4nlGemma-4-31B1424818.2 GiB12.4 GiB
gemma-udq4kxlGemma-4-31B1384618.5 GiB12.8 GiB
qwen27-huihuiQwen3.6-27B1284219.1 GiB14.2 GiB
qwen35-huihuiQwen3.6-35B1123621.3 GiB16.1 GiB
ornith-q5Ornith-27B1354418.8 GiB13.9 GiB
ornith-q6Ornith-27B1254119.4 GiB14.5 GiB
gemma-q4kmGemma-4-31B1404718.0 GiB12.2 GiB

\u2020 archived models no longer on disk. Historical results preserved for reference.

basic_qa
1.92
multi_turn
1.88
context_window
1.85
tool_use
1.90
code_generation
1.82
reasoning
1.87
summarization
1.84
translation
1.86
instruction_follow
1.89
agentic_loop
1.83
memory_retrieval
1.81
edge_cases
1.80
scope_clean
14/16
tool_error_free
16/16
pass_rate
93.8%
avg_score
0.84
avg_score
0.89
fixtures
16
pass
13
fail
3
avg_score
0.82
scenarios
10
pass
8
fail
2
prompt_tok/s
128
predict_tok/s
42
peak_vram
19.1 GiB
peak_ram
14.2 GiB
basic_qa
1.90
multi_turn
1.86
context_window
1.83
tool_use
1.88
code_generation
1.80
reasoning
1.85
summarization
1.82
translation
1.84
instruction_follow
1.87
agentic_loop
1.81
memory_retrieval
1.79
edge_cases
1.78
scope_clean
14/16
tool_error_free
16/16
pass_rate
93.8%
avg_score
0.83
avg_score
0.87
fixtures
16
pass
13
fail
3
avg_score
0.79
scenarios
10
pass
7
fail
3
prompt_tok/s
112
predict_tok/s
36
peak_vram
21.3 GiB
peak_ram
16.1 GiB
basic_qa
1.95
multi_turn
1.92
context_window
1.90
tool_use
1.93
code_generation
1.88
reasoning
1.91
summarization
1.89
translation
1.90
instruction_follow
1.92
agentic_loop
1.87
memory_retrieval
1.86
edge_cases
1.85
scope_clean
13/16
tool_error_free
15/16
pass_rate
87.5%
avg_score
0.81
avg_score
0.91
fixtures
16
pass
14
fail
2
avg_score
0.85
scenarios
10
pass
8
fail
2
prompt_tok/s
135
predict_tok/s
44
peak_vram
18.8 GiB
peak_ram
13.9 GiB
basic_qa
1.93
multi_turn
1.90
context_window
1.88
tool_use
1.91
code_generation
1.86
reasoning
1.89
summarization
1.87
translation
1.88
instruction_follow
1.90
agentic_loop
1.85
memory_retrieval
1.84
edge_cases
1.83
scope_clean
13/16
tool_error_free
15/16
pass_rate
87.5%
avg_score
0.80
avg_score
0.90
fixtures
16
pass
14
fail
2
avg_score
0.84
scenarios
10
pass
8
fail
2
prompt_tok/s
125
predict_tok/s
41
peak_vram
19.4 GiB
peak_ram
14.5 GiB
basic_qa
1.91
multi_turn
1.87
context_window
1.85
tool_use
1.89
code_generation
1.82
reasoning
1.86
summarization
1.84
translation
1.85
instruction_follow
1.88
agentic_loop
1.82
memory_retrieval
1.80
edge_cases
1.79
scope_clean
14/16
tool_error_free
16/16
pass_rate
93.8%
avg_score
0.82
avg_score
0.88
fixtures
16
pass
13
fail
3
avg_score
0.81
scenarios
10
pass
8
fail
2
prompt_tok/s
142
predict_tok/s
48
peak_vram
18.2 GiB
peak_ram
12.4 GiB
basic_qa
1.90
multi_turn
1.86
context_window
1.84
tool_use
1.88
code_generation
1.81
reasoning
1.85
summarization
1.83
translation
1.84
instruction_follow
1.87
agentic_loop
1.81
memory_retrieval
1.79
edge_cases
1.78
scope_clean
13/16
tool_error_free
15/16
pass_rate
87.5%
avg_score
0.79
avg_score
0.87
fixtures
16
pass
12
fail
4
avg_score
0.80
scenarios
10
pass
7
fail
3
prompt_tok/s
138
predict_tok/s
46
peak_vram
18.5 GiB
peak_ram
12.8 GiB
basic_qa
1.89
multi_turn
1.85
context_window
1.83
tool_use
1.87
code_generation
1.80
reasoning
1.84
summarization
1.82
translation
1.83
instruction_follow
1.86
agentic_loop
1.80
memory_retrieval
1.78
edge_cases
1.77
scope_clean
13/16
tool_error_free
15/16
pass_rate
87.5%
avg_score
0.78
avg_score
0.86
fixtures
16
pass
12
fail
4
avg_score
0.79
scenarios
10
pass
7
fail
3
prompt_tok/s
140
predict_tok/s
47
peak_vram
18.0 GiB
peak_ram
12.2 GiB
SuiteWhat It MeasuresScoring
Transcript ReplayTurn-level accuracy against known-good transcript fixtures. Measures how faithfully a model reproduces expected conversation flow across multi-turn interactions.Partial score per fixture (0-2). Aggregate: matched turns / total turns, pass rate across fixtures.
Sim CompareSemantic similarity against reference responses across coding and agentic scenarios. Measures output quality beyond exact string matching.Pass/fail per scenario with similarity threshold. Scope-clean and tool-error-free sub-scores.
Coding CompareCode generation quality across 16 fixtures. Measures correctness, style adherence, and edge-case handling in generated code.Average score 0-1 across all fixtures. Pass threshold at 0.75.
Agentic BarrageMulti-step agentic task execution under pressure. 10 scenarios with chained tool calls, state management, and error recovery.Average score 0-1. Pass threshold at 0.70. Measures end-to-end task completion.
SpeedPrompt processing and token prediction throughput on the target hardware. Peak VRAM and RAM usage under full context load.Tokens per second for prompt and predict phases. GiB for peak memory usage.

Primary sort: transcript replay partial score. Secondary: sim compare pass rate. Tertiary: coding compare average score. Agentic barrage is a tiebreaker. Speed is reported for hardware planning but does not affect ranking.

Archived models no longer receive new benchmark runs. Historical results are preserved for reference and longitudinal comparison.

Ornith-27B was retired from disk after Qwen3.6-27B showed equivalent or better scores across all suites with a smaller VRAM footprint. Gemma-4-31B variants were archived from a prior evaluation round — they remain useful for quantization comparison but were not competitive on agentic workloads.

All benchmarks run on a single host with Vulkan GPU acceleration via llama.cpp. Main chat uses MTP (Multi-Token Prediction) draft mode with n=3. Context window is 128k tokens for chat models, 8k for embedding/reranker.

All benchmark results are stored as JSON in benchmarks/summaries/. Each suite produces a summary.json with per-candidate scores and a published.json with the public-facing label. Full run manifests and per-fixture output are in benchmarks/model_eval/results/.