- Startseite
- Remote Jobs
- Systems Engineer (HPC)
Stellenbeschreibung
Responsibilities
Incident & Service Operations
- Incident Management: Respond to, diagnose, and resolve HPC-related incidents to ensure system stability and minimize downtime.
- Service Request Management: Process and fulfill service requests related to HPC resources, tooling, and services.
Technical Tasks
- Troubleshooting: Investigate and resolve complex technical issues across HPC clusters, applications, networking, and performance workflows.
- Testing & Validation: Develop, execute, and document test plans to validate system reliability, scalability, and performance.
- Documentation: Create and maintain detailed documentation on system architecture, configurations, workflows, and optimizations.
- Manage, monitor, and optimize HPC clusters, job scheduling systems, and related infrastructure.
- Analyze performance bottlenecks and apply optimization techniques across compute, memory, and networking layers.
- Support software development, integration, and deployment workflows within HPC environments.
Required Qualifications
- Minimum 3 years of experience in software development and/or systems engineering with a strong focus on HPC environments.
- Expertise in Linux operating systems, specifically Red Hat Enterprise Linux (RHEL).
- Strong programming/scripting skills: C, C++, Python, Bash, Ansible
- Hands-on experience with parallel computing frameworks: MPI, OpenMP, CUDA
- Solid knowledge of computer architecture, performance tuning, and system optimization.
- Experience managing HPC clusters, including job schedulers (e.g., Slurm, PBS, LSF).
- Strong networking knowledge, particularly InfiniBand.
- Understanding of ITIL best practices, especially: Incident Management, Service Management, Process Optimization
Soft Skills
- Strong analytical and problem-solving capabilities
- Ability to work in distributed, remote teams
- Clear communication and documentation skills
- Proactive, structured, and solution-oriented mindset
Project Start: ASAP Project Duration: Until December 2026 Location: Remote (with on-site onboarding in Cologne) Languages:
- English: Fluent
- German: as a plus