RightNow AI has launched AutoKernel, an open-source framework designed to optimize GPU kernels for PyTorch models using an autonomous agent loop. This innovation is set to streamline GPU resource utilization and enhance performance for developers.

AutoKernel delivers 2-5x PyTorch performance improvements through automated GPU kernel optimization, democratizing specialist-level optimization and transforming ML economics without requiring code changes or CUDA expertise.
Signal analysis
RightNow AI has launched AutoKernel, an automated GPU kernel optimization tool for PyTorch that delivers 2-5x performance improvements on standard workloads without code changes. The tool analyzes PyTorch operations and automatically generates optimized CUDA kernels tailored to specific GPU architectures.
AutoKernel works by profiling your PyTorch model's execution patterns, identifying optimization opportunities, and generating custom kernels that replace generic operations. The process is fully automated—point AutoKernel at your model and receive optimized version without manual kernel development.
Initial support covers common operations in transformer architectures, convolutional networks, and attention mechanisms. The tool targets both training and inference workloads, with different optimization strategies for each context.
AutoKernel democratizes performance optimization that previously required CUDA expertise. Most teams lack engineers who can write efficient GPU kernels. They accept default PyTorch performance or hire expensive specialists. AutoKernel delivers specialist-level optimization without the specialist.
The 2-5x improvement dramatically changes ML economics. Training runs that cost $10K become $2-5K. Inference serving costs drop proportionally. For teams with significant compute spend, AutoKernel pays for itself quickly.
This shifts competitive dynamics toward algorithmic innovation rather than implementation optimization. When anyone can get optimized kernels automatically, the advantage goes to teams with better models and data rather than better CUDA engineers.
AutoKernel integrates through a simple wrapper around your PyTorch model. Import the optimization module, wrap your model, and run a profiling pass with representative data. The tool analyzes execution patterns and generates optimized kernels for your specific workload.
The profiling process requires representative data because optimization targets actual usage patterns. Synthetic data may produce different optimization decisions than real workloads. Use production-representative data for profiling when possible.
Generated kernels are cached and versioned. You can commit them with your model code for reproducible builds. When PyTorch or CUDA versions change significantly, re-run optimization to regenerate kernels matched to new environment.
AutoKernel's improvements come from three sources: operation fusion (combining multiple operations into single kernel launch), memory pattern optimization (restructuring data access for GPU memory hierarchy), and architecture-specific tuning (using features of your specific GPU model).
The 2-5x range depends on workload characteristics. Transformer models typically see higher improvements due to attention pattern optimization. Convolutional networks see moderate improvements. Memory-bound operations see smaller gains than compute-bound operations.
The tool works best on standardized architectures using common operations. Highly custom architectures with unusual operations may see less benefit. AutoKernel optimizes patterns it recognizes; unrecognized patterns pass through unchanged.
AutoKernel represents broader trend toward automated performance optimization. Manual kernel development becomes increasingly rare as tools like AutoKernel mature. CUDA expertise shifts from writing kernels to guiding optimization tools and handling edge cases.
Expect AutoKernel-like tools from major frameworks. PyTorch and TensorFlow will likely integrate similar capabilities. The competitive advantage for standalone tools will shift to specialized optimizations or broader hardware support.
The combination of automated optimization and better hardware will continue driving compute cost reductions. ML workloads that seem expensive today become affordable as optimization tools and hardware improvements compound.
Watch the breakdown
Prefer video? Watch the quick breakdown before diving into the use cases below.
Best use cases
Open the scenarios below to see where this shift creates the clearest practical advantage.
One concise email with the releases, workflow changes, and AI dev moves worth paying attention to.
More updates in the same lane.
The latest Cursor update enhances AI tool integration, streamlining developer workflows and increasing productivity.
Unlock new productivity with the latest Cursor update, featuring enhanced AI tools for developers.
OpenAI's recent update introduces enhanced features that streamline developer workflows and boost automation capabilities.