Internship

Machine Learning Kernel Performance Engineer, Dojo

Confirmed live in the last 24 hours

Tesla

Tesla

No salary listed

Palo Alto, CA, USA

In Person

Internship requires a minimum of 12 weeks, full-time and on-site.

Job Description

Consider before submitting an application:    

This position is expected to start around August or September 2025 and continue through the Fall term (ending approximately December 2025) or continuing into Winter/Spring 2026 if available and there is an opportunity to do so. We ask for a minimum of 12 weeks, full-time and on-site, for most internships. Our internship program is for students who are actively enrolled in an academic program. Recent graduates seeking employment after graduation and not returning to school should apply for full-time positions, not internships.     

International Students: If your work authorization is through CPT, please consult your school on your ability to work 40 hours per week before applying. You must be able to work 40 hours per week on-site. Many students will be limited to part-time during the academic year.     

As a member of the Dojo Machine Learning software team, you will be responsible for developing and optimizing machine learning kernels to run on a massively parallel machine. The ideal candidate will have a background in kernel development and performance optimization for AI workloads, with a passion for delivering high-performance implementations, which are simple to use. 

Job Responsibilities

  • Develop and validate datapath kernels on a massively parallel machine 

  • Design and deliver implementations of known neural network algorithms on Dojo Hardware, for both training and inference workloads 

  • Collaborate with architects and compiler engineers to transfer kernel designs to the Dojo compiler and production environment 

  • Work closely with the team to drive improvements to the kernel development framework, supporting environments, and other components of the system 

  • Work with hardware and simulations teams to ensure future hardware generations optimally satisfy algorithmic requirements 

  • Participate in code reviews, testing, and debugging to ensure high-quality software 

  • Stay up-to-date with the latest developments in AI workloads, domain-specific languages, computer architecture, and simulation techniques 

Job Requirements

  • Pursing a Masters Degree or higher in Engineering, Computer Science, Mathematics, AI or similar, with a graduation date between December 2025 – May 2026  

  • Experience in kernel development, computer architecture, and AI workloads 

  • Understanding of Large Language Models (LLMs), transformer architectures, their training, inference, and optimization 

  • Strong programming skills in languages such as C/C++ and Python 

  • Experience with deep learning frameworks such as PyTorch, JAX, Pallas etc 

  • Strong understanding of CPU and/or GPU microarchitecture, including pipelining, caching, and memory hierarchy 

  • Experience writing, analyzing and optimizing assembly kernels 

  • Strong communication and collaboration skills, with the ability to work effectively with architects, engineers, and researchers 

  • Performance analysis experience, including with performance simulation frameworks preferred 

  • Familiarity with parallel programming models such as Cuda and OpenMPI preferred

  • Experience with Continuous Integration and testing frameworks such as Bazel or Pytest preferred