Senior NPU Kernel / Operator Engineer
Overview
We are seeking a Senior NPU Kernel / Operator Engineer to lead the development and optimization of high-performance deep learning operators for a next-generation AI accelerator platform.
This role focuses on kernel design, hardware-aware performance tuning, and correctness validation across a broad range of neural network workloads.
The ideal candidate will have deep experience optimizing compute-intensive software on GPU, NPU, DSP, SIMD, embedded accelerators, compiler backends, or HPC systems, with the ability to reason from model-level requirements down to hardware execution efficiency.
Responsibilities
- Design, implement, and optimize high-performance operators such as:
- Normalization
- Reduction
- Transpose
- Reshape
- Gather / Scatter
- Quantization / Dequantization
- Fused elementwise kernels
- Own performance optimization across key hardware constraints, including:
- Memory bandwidth
- SRAM utilization
- Data reuse
- DMA latency
- Bank conflicts
- Compute utilization
- Develop advanced optimization strategies including:
- Tiling
- Blocking
- Vectorization
- Memory scheduling
- Analyze and resolve bottlenecks related to:
- Memory hierarchy
- Synchronization overhead
- Instruction scheduling
- Data movement
- Validate operator correctness and numerical precision against reference implementations (e.g. PyTorch, NumPy)
- Benchmark and profile kernel performance across simulation, emulation, FPGA, or production silicon environments
- Debug complex issues involving:
- Tensor layouts
- Precision loss
- Memory access patterns
- Performance regressions
- Build performance models and optimize operators toward hardware roofline limits
- Collaborate closely with compiler, runtime, hardware architecture, and ML model teams to improve operator APIs and execution efficiency
- Document optimization strategies, tensor layouts, and performance improvements
- Mentor junior engineers and help define engineering best practices
Requirements
- BS / MS / PhD in Computer Science, Electrical Engineering, Computer Engineering, or related field
- 5+ years of experience in one or more of the following:
- Accelerator programming
- GPU / NPU development
- Compiler backend engineering
- Embedded systems
- High-performance computing
- Performance optimization
- Strong programming skills in:
- Deep understanding of:
- Tensor computation
- Neural network operators
- Strong knowledge of computer architecture concepts:
- Memory hierarchy
- Bandwidth and latency analysis
- Cache / SRAM behaviour
- Parallelism and synchronization
- Data locality and vectorization
- Proven experience optimizing performance-critical kernels or numerical compute pipelines
- Ability to identify and resolve performance bottlenecks from algorithm through to hardware execution
- Strong debugging, profiling, and analytical problem-solving skills
Preferred Experience
Experience with one or more of the following:
Frameworks / Tooling
- CUDA
- Triton
- OpenCL
- TVM
- MLIR
- Halide
Systems Experience
- SIMD
- DSP
- Embedded C/C++
- GPU / NPU programming
- FPGA development
- HPC systems
Advanced Optimization Techniques
- Tiling and blocking
- Vectorization
- Memory access optimization
- Instruction scheduling
- Mixed-precision optimization
Numerical Formats
- FP32
- FP16
- BF16
- FP8
- INT8 / INT4
AI Accelerator Architecture Familiarity
- Matrix engines
- Vector engines
- Systolic arrays
- DMA engines
- SRAM / NoC / DRAM systems
Bonus
- Experience with simulator, emulator, FPGA, or silicon bring-up
Opportunity
Join a highly technical team building cutting-edge AI compute infrastructure and contribute directly to the performance of next-generation machine learning hardware. This is an opportunity to work at the intersection of AI systems, compiler optimisation, and hardware acceleration, with significant ownership and technical impact.