PU

PulseRise Technologies

Systems Engineer (HPC)

Remote
vor 2 Tagen
Köln
Stellenbeschreibung

Responsibilities

Incident & Service Operations

  • Incident Management: Respond to, diagnose, and resolve HPC-related incidents to ensure system stability and minimize downtime.
  • Service Request Management: Process and fulfill service requests related to HPC resources, tooling, and services.

Technical Tasks

  • Troubleshooting: Investigate and resolve complex technical issues across HPC clusters, applications, networking, and performance workflows.
  • Testing & Validation: Develop, execute, and document test plans to validate system reliability, scalability, and performance.
  • Documentation: Create and maintain detailed documentation on system architecture, configurations, workflows, and optimizations.
  • Manage, monitor, and optimize HPC clusters, job scheduling systems, and related infrastructure.
  • Analyze performance bottlenecks and apply optimization techniques across compute, memory, and networking layers.
  • Support software development, integration, and deployment workflows within HPC environments.

Required Qualifications

  • Minimum 3 years of experience in software development and/or systems engineering with a strong focus on HPC environments.
  • Expertise in Linux operating systems, specifically Red Hat Enterprise Linux (RHEL).
  • Strong programming/scripting skills: C, C++, Python, Bash, Ansible
  • Hands-on experience with parallel computing frameworks: MPI, OpenMP, CUDA
  • Solid knowledge of computer architecture, performance tuning, and system optimization.
  • Experience managing HPC clusters, including job schedulers (e.g., Slurm, PBS, LSF).
  • Strong networking knowledge, particularly InfiniBand.
  • Understanding of ITIL best practices, especially: Incident Management, Service Management, Process Optimization

Soft Skills

  • Strong analytical and problem-solving capabilities
  • Ability to work in distributed, remote teams
  • Clear communication and documentation skills
  • Proactive, structured, and solution-oriented mindset

Project Start: ASAP Project Duration: Until December 2026 Location: Remote (with on-site onboarding in Cologne) Languages:

  • English: Fluent
  • German: as a plus