TII’s Falcon H1R 7B can out-reason models up to 7x its size — and it’s (mostly) open

For the last two years, the prevailing logic in generative AI has been one of brute force: if you want better reasoning, you need a bigger model.

While "small" models (under 10 billion parameters) have become capable conversationalists, they have historically crumbled when asked to perform multi-step logical deduction or complex mathematical proofs.

Today, the Technology Innovation Institute (TII) in Abu Dhabi is challenging that scaling law with the release of Falcon H1R 7B.

By abandoning the pure Transformer orthodoxy in favor of a hybrid architecture, TII claims to have built a 7-billion parameter model that not only rivals but outperforms competitors nearly 7X its size — including the 32B and 47B variants of Alibaba's Qwen and Nvidia's Nemotron.

The release marks a significant shift in the open-weight ecosystem, moving the battleground from raw parameter count to architectural efficiency and inference-time scaling.

The full model code is available now at Hugging Face and can be tested by individuals in a live demo inference on Falcon Chat (a chatbot experience). TII further released a seemingly quite comprehensive technical report on the approach and training methodology for Falcon H1 7B, as well.

Moving Beyond the Foundational LLM Tech, the Transformer

The defining feature of Falcon H1R 7B is its "hybrid" backbone. Most modern LLMs rely exclusively on the Transformer architecture, which scales predictably but suffers from high memory costs when processing long sequences.

Falcon H1R 7B integrates Mamba, a state-space model (SSM) architecture, alongside standard Transformer attention layers.

Originally developed by researchers Albert Gu and Tri Dao at Carnegie Mellon University and Princeton University, Mamba was first introduced in the paper "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" published on December 1, 2023.

The architecture processes data sequences differently than Transformers: while Transformers compare every piece of data to every other piece (quadratic scaling), Mamba processes tokens sequentially, allowing it to handle vast amounts of information with linear scaling and significantly reduced compute costs.

This combination addresses one of the most persistent bottlenecks in deploying reasoning models: the cost of "thinking." Reasoning models require generating long "chains of thought"—step-by-step internal monologues—before arriving at an answer. For standard Transformers, these long contexts explode computational costs.

According to TII’s technical report, the hybrid approach allows Falcon H1R 7B to maintain high throughput even as response lengths grow. At a batch size of 64, the model processes approximately 1,500 tokens per second per GPU—nearly double the speed of the competing Qwen3 8B model.

Benchmark Performance: Punching Up

In the benchmarks released by TII, the disparity between Falcon H1R 7B’s size and its performance is stark. On the AIME 2025 leaderboard—a rigorous test of mathematical reasoning—Falcon H1R 7B scored 83.1%, a result that disrupts the traditional hierarchy of model sizing.

While the 7B model naturally trails massive proprietary frontiers like GPT-5.2 (99.0%) and Gemini 3 Flash (97.0%) on the separate Artificial Analysis index (run by the independent organization of the same name, which has not yet benchmarked Falcon H1R 7B yet), it has effectively collapsed the gap between "efficient" open weights and mid-tier proprietary systems.

Other key domain wins include:

Training Techniques

Falcon H1R 7B’s performance is not just architectural; it stems from a rigorous, two-stage training pipeline designed to maximize reasoning density without inflating parameter count, according to TII's technical report on the model.

Stage 1: Cold-Start Supervised Fine-Tuning (SFT). The model underwent "cold-start" SFT on a curated dataset dominated by mathematics (56.8% of tokens) and code (29.8%), with response lengths stretching up to 48,000 tokens.

Stage 2: Reinforcement Learning via Group Relative Policy Optimization (GRPO). Following SFT, the model was refined using GRPO a reinforcement learning algorithm that rewards correct outcomes without needing a separate value model.

TII optimized the model specifically for Test-Time Scaling (TTS), a technique where a model generates multiple reasoning paths in parallel to find the best solution.

The model utilizes Deep Think with Confidence (DeepConf), which leverages the model's internal confidence scores to dynamically prune low-quality reasoning traces.

Licensing: Open For Commercial Usage, But With Strings Attached

TII has released Falcon H1R 7B under the custom Falcon LLM License 1.0 based on Apache 2.0 — but with notable modifications — chiefly among them: not to litigate against TII, and also to always credit it.

For developers and startups, the license is largely permissive:

However, unlike a pure Open Source Initiative (OSI) license, the Falcon license includes a strict Acceptable Use Policy (AUP).

The license terminates automatically if the model is used to create work that conflicts with the AUP or if the user initiates patent litigation against TII.

Specifically, the AUP prohibits using Falcon H1R 7B or its derivatives for:

The Hybrid Wave: Nvidia, IBM, AI21, and Mistral

TII is not alone in betting on this hybrid future; the industry is increasingly moving toward architectures that blend the strengths of SSMs and Transformers.

Falcon H1R 7B represents the latest evolution in this trend, specifically targeting dense reasoning tasks in a compact form factor.