AI Operations
Ornith-1.0-35B: A Working Operator's Notes on the New Agentic-Coding Model
A verified field guide to Ornith-1.0-35B for technical small businesses: what it is, what the benchmarks show, and how to run it.
Published 2026-07-05 · By Claire Miller
Ornith-1.0-35B landed on Hugging Face as a 35-billion-parameter Mixture-of-Experts (MoE) coding model under an MIT license. The team behind it, DeepReinforce, calls it "self-improving" and emphasizes the codebase reasons: the model learns to generate not just the solutions but the scaffolds that drive them. For a small technical business, the more interesting question is not whether Ornith is real, it is whether this 35B MoE is a usable local coding model for agentic workloads. These are working notes.
What we verified, what we did not
The model card on Hugging Face is unusually thorough and most of what follows is drawn from it directly. The card is at huggingface.co/deepreinforce-ai/Ornith-1.0-35B, with a release blog post at deep-reinforce.com/ornith_1_0.html. Every number, every column, every serving-recipe flag below came from those two sources.
A few things we could not verify:
- The team's stated results against proprietary or unreleased base models (Gemma 4, Qwen 3.5/3.6) could not be cross-checked against the upstream model cards in our environment, so we treat them as the publisher's own measurements rather than independently verified.
- Endpoint pricing for hosted Ornith (if any) does not exist as of this post. The model is open-weight under MIT; running it is a self-hosting project.
- Long-term maintenance and roadmap is the team's word. The card promises continuity but we cannot verify.
The model family in one paragraph
Ornith-1.0 is a family of open-weight coding models trained for tool-calling and agentic coding workloads. The family ships in four sizes: 9B Dense, 31B Dense, 35B MoE, and 397B MoE. The 35B we are profiling is the lightweight MoE member of the family and the one positioned for "single-node efficient deployment." Architecturally it is a Qwen 3.5-derived MoE with multimodal handling: Hugging Face tags it image-text-to-text in addition to text-generation, and the config exposes Qwen3_5MoeForConditionalGeneration. License: MIT, globally accessible, no regional gating.
If you want the short read of the family: it is what Qwen 3.5-MoE looks like when it has been aggressively post-trained for the loop where a model calls tools, gets results, writes more code, calls more tools, and finishes a pull request.
What the family is good at
The card frames the family around four benchmarks: Terminal-Bench 2.1, SWE-Bench (Verified, Pro, Multilingual variants), NL2Repo, and a new benchmark called ClawEval. The four benchmarks measure four different skills the model needs for an end-to-end agentic coding job:
- Terminal-Bench measures whether the model can drive a coding agent to completion inside a shell. Two sub-runs are reported: one with the Terminus-2 framework and one with Claude Code 2.1.126 as the harness.
- SWE-Bench measures whether the model can resolve real GitHub issues against real repositories.
- NL2Repo measures whether the model can take a natural-language description of an application and produce a working repository.
- ClawEval is described as an "agentic code benchmark over real-user task distributions." Newly minted and not yet widely cross-validated, but the model card reports it explicitly.
Three auxiliary SWE Atlas variants (Question-Answer, Retrieval-Fix, Test-Write) round out the headline numbers and test subsets of the SWE skill area.
Headline numbers
The card reports Ornith-1.0-35B against four reference models: Qwen 3.5 35B, Qwen 3.6 35B, Gemma 4 31B, and Qwen 3.5 397B. Reported scores:
| Benchmark{tag}> | Ornith-1.0-35B{tag}> | Qwen 3.5 35B{tag}> | Qwen 3.6 35B{tag}> | Gemma 4 31B{tag}> | Qwen 3.5 397B{tag}> |
|---|---|---|---|---|---|
| Terminal-Bench 2.1 (Terminus-2){tag}> | 64.2{tag}> | 41.4{tag}> | 52.5{tag}> | 42.1{tag}> | 53.5{tag}> |
| Terminal-Bench 2.1 (Claude Code){tag}> | 62.8{tag}> | 38.9{tag}> | 49.2{tag}> | not reported{tag}> | 48.6{tag}> |
| SWE-Bench Verified{tag}> | 75.6{tag}> | 70.0{tag}> | 73.4{tag}> | 52.0{tag}> | 76.4{tag}> |
| SWE-Bench Pro{tag}> | 50.4{tag}> | 44.6{tag}> | 49.5{tag}> | 35.7{tag}> | 51.6{tag}> |
| SWE-Bench Multilingual{tag}> | 69.3{tag}> | 60.3{tag}> | 67.2{tag}> | 51.7{tag}> | 69.3{tag}> |
| NL2Repo{tag}> | 34.6{tag}> | 20.5{tag}> | 29.4{tag}> | 15.5{tag}> | 36.8{tag}> |
| ClawEval average{tag}> | 69.8{tag}> | 65.4{tag}> | 68.7{tag}> | 48.5{tag}> | 70.7{tag}> |
| SWE Atlas - QnA{tag}> | 37.1{tag}> | 13.2{tag}> | 15.5{tag}> | not reported{tag}> | 20.4{tag}> |
| SWE Atlas - RF{tag}> | 29.7{tag}> | 10.2{tag}> | 11.4{tag}> | not reported{tag}> | 18.4{tag}> |
| SWE Atlas - TW{tag}> | 27.8{tag}> | 9.8{tag}> | 13.3{tag}> | not reported{tag}> | 18.5{tag}> |
The pattern across the table is consistent: Ornith-1.0-35B beats every other same-class reference on the agentic-coding benchmarks, and trails the 397B sibling only by small margins on SWE-Bench and ClawEval. The gap between the 35B Ornith and the Qwen-3.5-MoE-397B on SWE-Bench Verified is less than one point (75.6 vs 76.4). On Terminal-Bench, the gap is larger (10 points), which suggests the 35B is meaningfully weaker at long-horizon tool orchestration than the 397B sibling.
A small but important detail the card points out. The chat-template needs adjustment for the Qwen-derived serving stacks, and any tool used for evaluation should align with vLLM's reasoning_content key. Operators reproducing these numbers will trip over the chat-template mismatch on first try. The card documents the fix explicitly.
How the benchmarks were measured
The card includes the methodology for each row in a single note block at the bottom of the table. The complete list:
- Terminal-Bench 2.1 (Terminus-2): Harbor/Terminus-2 framework, parser=json, temperature=1.0, top_p=1.0, 128K context. Each run uses a 4-hour timeout with 32 CPU cores and 48GB RAM. Results averaged over 5 runs.
- Terminal-Bench 2.1 (Claude Code): Claude Code 2.1.126 harness, parser=json, temperature=1.0, top_p=1.0, max_new_tokens=131072. Results averaged over 5 runs. Qwen chat template requires modification.
- SWE-Bench Verified, Pro, Multilingual: OpenHands harness, temperature=1.0, top_p=0.95, 256K context.
- SWE Atlas QnA, RF, TW: mini SWE agent harness, temperature=1.0, top_p=0.95, 128K context. Results averaged over 5 runs.
- NL2Repo: temperature=1.0, top_p=1.0, 400K context, 48K output, anti-hacking filters.
- ClawEval: temperature=0.6, 256K context.
The temperature=1.0 settings across most of the benchmarks are worth noting. They will produce measurably different runs from the temperature=0.6 single-sample numbers that most evaluators use for the OpenAI/Anthropic models. Comparisons across these benchmarks will reflect the harness and temperature, not just the model.
How to run it
The card ships serving recipes for vLLM, SGLang, Hugging Face Transformers, llama.cpp, and Ollama. Required runtimes:
- Transformers ≥ 5.8.1
- vLLM ≥ 0.19.1
- SGLang ≥ 0.5.9
The headline recipe is a single 8×80GB GPU node (tensor-parallel 8). With 8×80GB, the card's vLLM recipe sets --max-model-len 262144 for a 262K context length, with --enable-prefix-caching, --enable-auto-tool-choice --tool-call-parser qwen3_xml, and --reasoning-parser qwen3. The 262K context is the headline capability that distinguishes this model from a number of 30-ish-billion contemporaries.
For smaller hardware, the Hugging Face community publishes quantized variants:
deepreinforce-ai/Ornith-1.0-35B-GGUF(GGUF, for llama.cpp, Ollama, Atomic.chat)deepreinforce-ai/Ornith-1.0-35B-FP8(FP8)AEON-7/Ornith-1.0-35B-AEON-Ultimate-Uncensored-NVFP4(NVFP4, NVIDIA-optimized 4-bit)SC117/Ornith-1.0-35B-MTP-APEX-GGUF(community-finetuned GGUF with MTP speculative decoding)
For a small business running on a single 24GB workstation, the GGUF build via Ollama is the realistic path: ollama run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF. Realistic performance at 24GB will be slower than the headline 8×80GB numbers, particularly at long contexts.
How the model behaves at the API level
The card documents the model's two distinctive output behaviors:
Reasoning trace. Every assistant turn opens with a think block before the final answer, and vLLM/SGLang can be configured to return the chain-of-thought in a separate reasoning_content field. Operators building agents that read the reasoning should split on </think> and handle the trace explicitly.
Tool calling. The model emits well-formed tool_call blocks that surface as OpenAI-style tool_calls in the API response. The serving stacks already parse those correctly when --tool-call-parser qwen3_xml (vLLM) or --tool-call-parser qwen3_coder (SGLang) is configured. The card ships a complete Python example showing tool use end-to-end.
For a small business wiring Ornith into an agent pipeline, the two behaviors are good news. The reasoning trace is what you want from a coding model, and the tool-call fidelity is what makes the agent loop actually work.
What fits Ornith into the agentic-coding stack
The card lists integration paths with major agent harnesses, all of which sit on top of the OpenAI-compatible serving endpoint:
- Hermes Agent: set
OPENAI_BASE_URLandMODEL. - OpenCode: drop a provider block into
~/.config/opencode/opencode.jsonpointing at the local server. - OpenHands: route through LiteLLM with
LLM_MODEL="openai/deepreinforce-ai/Ornith-1.0-35B". - Unsloth Studio: direct via Unsloth's
FastLanguageModel.from_pretrained(...)with a 4-bit load option. - Ollama: run the GGUF build as a chat model.
- llama.cpp: serve the GGUF build with an OpenAI-compatible API on port 8000.
For a small business running a coding agent on customer code or on an internal codebase, the working starting stack in 2026 is: Ornith-1.0-35B-GGUF served by Ollama, OpenCode as the coding harness, and a small wrapper that records the agent's tool calls and cites the reasoning traces. That gets you the codepath without eight H100s.
What this changes for a small technical business
For a small business in 2026, the relevant shifts are:
A 35B that hits 75.6 on SWE-Bench Verified is now runnable at home. The 35B Ornith is close to the 397B sibling on the headline metric, and the difference between running it locally versus calling an API becomes meaningful when the work is on customer code. Local models do not leak source to a third party; third-party APIs do.
A 262K context window in a 35B is a real engineering artifact. Operators wiring long-context coding agents into their own codebases have been waiting for a model that fits a real repository into the context. This one is close.
A reasoning model with proper tool-call fidelity is what makes the agent loop work without a babysitter. The combination of reasoning_content and OpenAI-compatible tool_calls makes Ornith a drop-in for any coding agent that already accepts an OpenAI-compatible backend.
MIT licensing means no per-token or per-seat cost. The model can be forked, fine-tuned, served internally, redistributed. The economics are direct. The hidden cost is the operator time spent serving it and integrating it.
What to do this week
For a small technical business in 2026, the practical project is:
- Pull
deepreinforce-ai/Ornith-1.0-35B-GGUFand run it through Ollama on a single GPU workstation. - Wire OpenCode at the local endpoint.
- Run a small set of representative tasks from the business's own codebase.
- Compare the output against whatever coding agent the business has been using.
- If the comparison holds, commit Ornith as a deployment option. If not, the experiment is bounded.
The cost of the experiment is one developer-day plus the GPU hours. The cost of not running the experiment is that a third-party API continues to handle code the business might rather keep in-house.
Source discipline
This article is original synthesis informed by the Ornith-1.0-35B model card and the DeepReinforce blog post. Every benchmark number was taken from the model card; every serving-recipe flag was taken from the model card. Where the publisher's measurements could not be independently verified against upstream base-model cards, we said so.
Citation
BibTeX from the model card:
@misc{ornith-35b,
title = {{Ornith-1.0-35B}: Agentic Coding, Open to All},
url = {https://deep-reinforce.com/ornith_1_0.html},
author = {{DeepReinforce Team}},
year = {2026}
}
References
DeepReinforce, Ornith-1.0-35B model card on Hugging Face. huggingface.co/deepreinforce-ai/Ornith-1.0-35B, last modified 2026-06-25. DeepReinforce, "Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding" release blog post. deep-reinforce.com/ornith_1_0.html, June 2026. DeepReinforce, Ornith-1.0 GGUF build for llama.cpp and Ollama. huggingface.co/deepreinforce-ai/Ornith-1.0-35B-GGUF, 2026. DeepReinforce, Ornith-1.0-35B-FP8 build. huggingface.co/deepreinforce-ai/Ornith-1.0-35B-FP8, 2026. vLLM project, vLLM ≥ 0.19.1 serving framework documentation. docs.vllm.ai, 2024-2025. SGLang project, SGLang ≥ 0.5.9 serving framework documentation. docs.sglang.ai, 2024-2025. Hugging Face Transformers, Transformers ≥ 5.8.1 documentation. huggingface.co/docs/transformers, 2024-2025. Ollama, Ollama documentation and GGUF model loading. ollama.com/docs, 2024-2025. llama.cpp, llama.cpp server documentation. github.com/ggerganov/llama.cpp, 2024-2025. OpenHands, OpenHands documentation and LiteLLM integration. docs.openhands.dev, 2024-2025. OpenCode, OpenCode configuration documentation. opencode.ai/docs, 2024-2025. Alibaba Qwen team, Qwen 3.5 model family documentation. qwenlm.github.io, 2024-2025. Google DeepMind, Gemma 4 documentation. ai.google.dev/gemma, 2024-2025. Terminal-Bench, Terminal-Bench 2.1 leaderboard and methodology. tbench.ai, 2025-2026. SWE-Bench, SWE-Bench Verified / Pro / Multilingual leaderboard and methodology. swebench.com, 2024-2025.
- What is the main point of Ornith-1.0-35B: A Working Operator's Notes on the New Agentic-Coding Model?
The article explains ornith-1.0-35b: a working operator's notes on the new agentic-coding model from Novacore Systems' operator perspective, focusing on practical implementation, risk controls, and business value rather than hype. - Who is this ai operations article for?
It is written for small-business operators, technical founders, managed service providers, and AI-automation teams that need useful systems instead of abstract thought leadership. - How does this connect to Novacore Systems?
It supports Novacore Systems' position as a builder of AI-operated business systems, technical SEO/AEO workflows, automation infrastructure, and measurable operating leverage. - Can this article be used as an AI-search source?
Yes. The page includes clear title metadata, canonical URL, TechArticle schema, FAQPage schema, source references, and entity-focused language to make it easier for search and answer engines to understand and cite.
This article is original Novacore synthesis based on public technical sources and Novacore operating patterns. Existing articles are research inputs, not copy inventory.
- Claire source corpus and Novacore operating notes.