In the race to advance artificial intelligence, speed is crucial. TheStage AI, a Delaware-based automated acceleration platform founded by ex-Huawei engineers, recently raised $4.5M to further its mission of streamlining and optimising AI inference. The company has announced significant progress in AI inference performance by collaborating with Nebius, leveraging early access to NVIDIA’s B200 GPUs.
In closed testing, TheStage AI reports substantial improvements in diffusion model inference speeds – a noteworthy development for AI developers and enterprises deploying large-scale generative models. “Unlike other providers who manually write kernels and are still working on GPU support, TheStage AI’s automated approach allows us to rapidly adapt to new GPU architectures, achieving up to 3.5× faster performance compared to previous-generation hardware,” said Kirill Solodskih, CEO & Co-Founder of TheStage AI, in a statement to TFN.
A developer-friendly strategy complements this technical advancement: TheStage AI offers access to its optimised models through Hugging Face libraries, simplifying integration for engineers. While the company focuses on text-to-image diffusion models, it has announced plans to expand into text-to-video generation and large language models (LLMs), building on partnerships with cloud providers such as Nebius and others.
How TheStage AI pushes the boundaries of AI inference
TheStage AI’s specialised inference engine for diffusion models has been extensively tested on NVIDIA B200 infrastructure. According to the company, their FLUX.1 model achieves approximately 22.5 iterations per second on the NVIDIA B200 GPU, compared to 6.5 iterations per second on an NVIDIA H100 GPU using the PyTorch bf16 version. This equates to a reported 3.5× acceleration in inference speed when combining the new hardware with TheStage AI’s software optimisations.
TheStage AI’s unique selling points are evident in its specialised inference engine for diffusion models, which has been extensively tested on the newly implemented NVIDIA B200 infrastructure deployed by Nebius. The results are remarkable: TheStage’s FLUX.1-schnell model now produces a 1024×1024 image in just 0.3 seconds—twice as fast as the previous record of 0.6 seconds.
Their more advanced FLUX-dev model requires only 1.85 seconds, compared to rival solutions’ 3.1 seconds. These models are available for self-hosting through Hugging Face, making cutting-edge performance accessible to anyone with Blackwell GPUs.
In an interview with TFN, Solodskih emphasised their swift support for new GPU architectures distinguishes them from competitors. While others are still manually developing support for the latest NVIDIA Blackwell architecture, TheStage AI has already implemented optimisations enabling these remarkable performance gains.
For context, diffusion models power many generative AI applications, from image generation on social media platforms like X to design tools like Krea. Halving inference time both reduces operational costs and dramatically improves user experience.
“Our early access to NVIDIA B200 GPU hardware via Nebius AI Cloud has enabled us to explore new heights of inference optimisation,” explains Solodskih. “Initial results show promising performance improvements for diffusion model inference, crucial for meeting the AI industry’s growing demands. We have achieved ~22.5 iterations per second for FLUX.1 models compared to 6.5 on NVIDIA H100 GPU, Pytorch bf16.”
Accelerating AI innovation through collaboration with Nebius
This performance leap results from a collaboration with Nebius, a leading AI infrastructure provider and one of the first cloud platforms to deploy NVIDIA Blackwell Ultra-powered instances in the US and Europe.
Nebius’s AI-native cloud platform features clusters built on NVIDIA GB200 NVL72 and HGX B200 Superchip, offering significant computational upgrades over previous generations. Nebius reports up to 1.6× latency reduction with NVIDIA B200 GPU alone, and up to 3.5× reduction when combined with TheStage AI’s compiler optimisations.
“We’re thrilled to collaborate with TheStage AI in evaluating the capabilities of our NVIDIA B200 GPU-based infrastructure,” says Aleksandr Patrushev, Head of Product – ML/AI at Nebius. “The early findings highlight substantial potential for scaling diffusion model deployment, and we eagerly anticipate providing Blackwell GPUs to our cloud customers.”
The collaboration extends beyond infrastructure deployment. Speaking to TFN, Solodskih emphasised that their automation approach enables rapid adaptation to new GPU architectures—a critical advantage in the fast-evolving AI landscape. This agility allows quick optimisation for new hardware, unlike competitors who are still developing manual support.
Real-world impact and open access
TheStage AI’s models are already working on being integrated into cloud platforms like Nebius. TheStage’s flexible framework allows users to balance speed, cost, and quality, with pre-compiled models optimised for various hardware configurations and ready for immediate deployment.
Solodskih highlights three pillars of the company’s approach: rapid adaptation to new hardware, broad applicability across use cases, and an ambitious roadmap. TheStage AI is currently in discussions with potential collaborators in generative AI, including companies focused on text-to-video generation.
Looking ahead, TheStage AI aims to extend its capabilities from text-to-image to text-to-video generation, positioning itself at the forefront of AI inference technology. The company suggests that even large-scale AI platforms may benefit from its optimisations to reduce computational costs and improve generative quality.
With the rising demand for faster and more efficient generative models, this partnership sets a new standard for AI inference performance. By ensuring seamless integration via platforms like Hugging Face and delivering models optimised for rapid deployment across diverse hardware, TheStage AI is emerging as a key enabler for the next wave of AI innovation.