About the Role

NVIDIA is seeking an experienced HPC DevOps and Network Engineer to contribute to the development of future supercomputers and HPC clusters. This role is pivotal in driving advancements in artificial intelligence and GPU computing, offering insights into large-scale system design and tuning for extensive compute runs. You will collaborate with cutting-edge Accelerated computing and Deep Learning platforms, working alongside researchers, developers, and customers to enhance workflows and create innovative solutions. Your responsibilities include architecting, developing, and deploying large-scale performance platforms in conjunction with HPC, OS, GPU compute, and systems specialists.

What You'll Be Doing

Innovate and Implement: Design, implement, and maintain large-scale HPC/AI clusters with state-of-the-art monitoring, logging, and alerting systems.
Infrastructure as Code (IaC): Utilize and develop tools for infrastructure as code to ensure scalable and repeatable deployments.
Streamline CI/CD Pipelines: Develop and maintain CI/CD pipelines for automated and streamlined deployment processes.
Automate Everything: Create automation scripts and tools for deployment, configuration management, and operational monitoring.
Develop complex Networking automations.
Troubleshoot Complex Issues: Conduct comprehensive troubleshooting from bare metal to application level to ensure system reliability and efficiency.
Lead and Educate: Act as a technical resource, developing and sharing best practices with internal teams.
Drive Innovation: Support R&D activities and participate in proof of concepts (POCs) and proof of values (POVs) for future enhancements.

What We Need To See

B.Sc. in Computer Science, Engineering, or a related field with 5 years of experience.
Deep knowledge of HPC and AI solution technologies (CPUs, GPUs, high-speed interconnects, supporting software).
Advanced programming and scripting language proficiency, with a solid understanding of object-oriented programming.
Familiarity with Jenkins, Ansible, Puppet/Chef.
Excellent knowledge of Windows and Linux (Redhat/CentOS, Ubuntu), networking, and OS-level security.
Deep understanding of networking protocols like InfiniBand and Ethernet.
Experience with job scheduling workloads and orchestration tools (Slurm, Kubernetes).
Background with storage solutions (Lustre, GPFS, ZFS, XFS).
Expertise with virtual systems (VMware, Hyper-V, KVM, Citrix).
Familiarity with cloud platforms (AWS, Azure, Google Cloud).

Ways To Stand Out From The Crowd

Proven networking experience or strong knowledge through professional networking training.
Architectural Insight: Knowledge of CPU and/or GPU architecture.
Container Expertise: Understanding of Kubernetes and container-related microservice technologies.
GPU Focus: Experience with GPU-focused hardware/software (DGX, CUDA).
RDMA Fabrics: Background with RDMA (InfiniBand or RoCE) fabrics.

NVIDIA is committed to diversity and fostering an inclusive environment. We do not discriminate based on race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. Reasonable accommodations are provided to ensure equal participation in the job application process, essential job functions, and other employment benefits. Join us in pushing technological boundaries and making a significant global impact. ,metadescription:

Senior HPC DevOps Engineer

About the Role

What You'll Be Doing

What We Need To See

Ways To Stand Out From The Crowd