Senior NPU Kernel Operator Engineer

Senior NPU Kernel Operator Engineer.

Senior NPU Kernel Operator Engineer

San Jose

|

Permanent

|

Hardware

Senior NPU Kernel / Operator Engineer

Overview

We are seeking a Senior NPU Kernel / Operator Engineer to lead the development and optimization of high-performance deep learning operators for a next-generation AI accelerator platform.

This role focuses on kernel design, hardware-aware performance tuning, and correctness validation across a broad range of neural network workloads.

The ideal candidate will have deep experience optimizing compute-intensive software on GPU, NPU, DSP, SIMD, embedded accelerators, compiler backends, or HPC systems, with the ability to reason from model-level requirements down to hardware execution efficiency.


Responsibilities

  • Design, implement, and optimize high-performance operators such as:
    • Normalization
    • Reduction
    • Transpose
    • Reshape
    • Gather / Scatter
    • Quantization / Dequantization
    • Fused elementwise kernels
  • Own performance optimization across key hardware constraints, including:
    • Memory bandwidth
    • SRAM utilization
    • Data reuse
    • DMA latency
    • Bank conflicts
    • Compute utilization
  • Develop advanced optimization strategies including:
    • Tiling
    • Blocking
    • Vectorization
    • Memory scheduling
  • Analyze and resolve bottlenecks related to:
    • Memory hierarchy
    • Synchronization overhead
    • Instruction scheduling
    • Data movement
  • Validate operator correctness and numerical precision against reference implementations (e.g. PyTorch, NumPy)
  • Benchmark and profile kernel performance across simulation, emulation, FPGA, or production silicon environments
  • Debug complex issues involving:
    • Tensor layouts
    • Precision loss
    • Memory access patterns
    • Performance regressions
  • Build performance models and optimize operators toward hardware roofline limits
  • Collaborate closely with compiler, runtime, hardware architecture, and ML model teams to improve operator APIs and execution efficiency
  • Document optimization strategies, tensor layouts, and performance improvements
  • Mentor junior engineers and help define engineering best practices

Requirements

  • BS / MS / PhD in Computer Science, Electrical Engineering, Computer Engineering, or related field
  • 5+ years of experience in one or more of the following:
    • Accelerator programming
    • GPU / NPU development
    • Compiler backend engineering
    • Embedded systems
    • High-performance computing
    • Performance optimization
  • Strong programming skills in:
    • C/C++
    • Python
  • Deep understanding of:
    • Tensor computation
    • Neural network operators
  • Strong knowledge of computer architecture concepts:
    • Memory hierarchy
    • Bandwidth and latency analysis
    • Cache / SRAM behaviour
    • Parallelism and synchronization
    • Data locality and vectorization
  • Proven experience optimizing performance-critical kernels or numerical compute pipelines
  • Ability to identify and resolve performance bottlenecks from algorithm through to hardware execution
  • Strong debugging, profiling, and analytical problem-solving skills

Preferred Experience

Experience with one or more of the following:

Frameworks / Tooling

  • CUDA
  • Triton
  • OpenCL
  • TVM
  • MLIR
  • Halide

Systems Experience

  • SIMD
  • DSP
  • Embedded C/C++
  • GPU / NPU programming
  • FPGA development
  • HPC systems

Advanced Optimization Techniques

  • Tiling and blocking
  • Vectorization
  • Memory access optimization
  • Instruction scheduling
  • Mixed-precision optimization

Numerical Formats

  • FP32
  • FP16
  • BF16
  • FP8
  • INT8 / INT4

AI Accelerator Architecture Familiarity

  • Matrix engines
  • Vector engines
  • Systolic arrays
  • DMA engines
  • SRAM / NoC / DRAM systems

Bonus

  • Experience with simulator, emulator, FPGA, or silicon bring-up

Opportunity

Join a highly technical team building cutting-edge AI compute infrastructure and contribute directly to the performance of next-generation machine learning hardware. This is an opportunity to work at the intersection of AI systems, compiler optimisation, and hardware acceleration, with significant ownership and technical impact.

Darwin Recruitment is acting as an Employment Agency in relation to this vacancy.

SUBMIT YOUR CV

Name_1
Max. file size: 1 GB.

LEBENSLAUF HOCHLADEN MIT:

This field is for validation purposes and should be left unchanged.
WOMAN-WITH-TABLET3

MARKET INSIGHTS.

USE OUR ONLINE PLATFORM TO ACCESS ALL THE INSIGHTS THAT YOU NEED...

• Salaries; split by technology and seniority level.
• Time to hire; how long it takes to secure and start a new role, or source and hire talent.
• The average tenure of professionals per tech specialism.
• Gender split per location and tech specialism.
• Fastest growing skills per tech specialism.

This field is for validation purposes and should be left unchanged.