Benchmarks & Evals

Performance you
can verify.

Expert models right-sized for enterprise workflows — benchmarked against frontier models, validated in production, across the tasks that define your business.

Talk to UsTalk to Us
90%
MMLU
85.5%
GPQA Diamond
72.5%
SWE-bench
28
task categories led
Model Performance

View model performance
and system benchmarks

USF-1 Mini leads across 28 benchmark task categories. Results validated April 2026.

MMLU
90%
General knowledge across 57 domains
+2 pts vs GPT-4o
MMLU-Pro
86%
Advanced reasoning — harder MMLU variant
+13 pts vs GPT-4o
GPQA Diamond
85.5%
Graduate-level science reasoning
+31.5 pts vs GPT-4o
SWE-bench
72.5%
Real-world software engineering
+39.5 pts vs GPT-4o
PolyMATH
53%
Complex mathematical reasoning
+12 pts vs GPT-4o
Key Insight
USF drops only 4 pts from MMLU → MMLU-Pro.
General-purpose frontier models drop 15+ points on the harder variant. UltraSafe's shallow drop signals principled reasoning — not surface-level pattern matching.
Architecture

Right-sized.
Not watered-down.

General models carry the weight of everything they'll never use in your deployment. Expert models don't. A 0.1B model trained on your domain doesn't need to know Shakespeare — it needs to know your workflows.

That focus is why USF-1 Mini achieves 90% MMLU at a fraction of frontier scale — and why benchmark parity at lower scale drives the 5–8× cost reduction.

90%
MMLU at 0.1B params
<100ms
inference latency
5–8×
cost reduction
4 pts
MMLU→MMLU-Pro drop
Industry Performance

Domain matters.
Expert models excel where
it counts for your clients.

Beats GPT-4o, DeepSeek, and Gemini on targeted tasks — the benchmark categories that correspond directly to client industry workflows.

Healthcare
85.5%
GPQA Diamond
Graduate-level clinical reasoning — 31 pts ahead of GPT-4o on domain science
Radiology impressions matched to physician tone. 40% productivity gain.
Financial Services
86%
MMLU-Pro
Only 4pt drop MMLU→MMLU-Pro vs. 15+ pts for general models — principled reasoning, not pattern matching
1M+ docs/month automated. 90%+ accuracy vs. 75% manual baseline.
Software Engineering
72.5%
SWE-bench
Real-world engineering tasks — 2× GPT-4o (33%) and 72% ahead of DeepSeek V3 (42%)
Agentic code execution, PR automation, CI/CD integration in production.
Defense & Intelligence
85.5%
GPQA Diamond
Threat analysis at graduate-science level — deployable fully air-gapped with zero internet outbound
Cryptographic weight sealing. Full immutable audit trails. On-prem only.
Mathematics & Quant
53%
PolyMATH
Outperforms GPT-4o (41%) and DeepSeek V3 (45%) on complex mathematical reasoning
Quantitative risk, pricing models, and scenario analysis in financial workflows.
Supply Chain & Logistics
90%
MMLU
Frontier-level broad knowledge at 0.1B parameters — scales to 1M+ document workflows
Reconciliation and supply chain automation. Hours not weeks.
Benchmark to Production

These aren't lab scores.
They're live.

Every metric below comes from active client deployments — not controlled benchmarks. Real workflows. Real data. Real perimeter.

Clinical Workflow
60+
min saved / shift
40%
productivity gain
40%+
less dictation

Radiologist-ready impressions generated in each physician's voice. Radiologists review and accept — not rewrite.

Financial Reconciliation
1M+
docs / month
90%+
automated accuracy
75%→90%+
vs manual baseline

Government enterprise automates reconciliation and supply chain workflows — producing the exact artifacts staff need at the exact processes they follow.

Video Classification
50
params per request
Real-time
ad placement
Every video
tagged at scale

Global advertising platform classifies video content across 50 parameters per request, powering real-time ad placement decisions.

Global Payroll
Weeks
not months
Parallel
not sequential
Air-gapped
data perimeter

Global payroll company expands into new markets using air-gapped UltraSafe models. Client payroll data never leaves the secure perimeter.

Cost of Ownership

Frontier accuracy.
A fraction of the cost.

Right-sized expert models need less compute, less memory, and less infrastructure than general-purpose LLMs. The savings compound at scale.

No per-token API markup. No cloud egress fees. No per-seat licensing. Every inference runs inside your perimeter.

lower total cost of ownership
vs. frontier API providers
5–8×
requests handled annually
in active deployments
100M+
industries served
healthcare, finance, defense, aviation…
12+
savings vs. standard pricing
blended across workload profiles
45–82%
Featured Research

The science behind the scores.

View More →
READY TO VERIFY

Start with your most
valuable problem.

Our AI qualifies your use case and books a meeting with our team — in minutes, not days.