Work at a Unicorn Startup!
We’re seeking a Senior Reliability Engineer to enhance our GPU infrastructure for a fast-growing AI/ML platform. Join us to ensure scalability and resilience!
GPU/HPC (High-Performance Computing) experience required.
•Manage thousands of GPUs across multiple cloud providers, scaling and monitoring infrastructure.
•Design scalable solutions to meet growing demands.
•Implement monitoring systems to proactively detect issues.
•Build fault-tolerant designs to reduce service disruptions.
•Develop automation tools to boost reliability and efficiency.
•Participate in on-call rotation for 24/7 system support.
•Maintain SLOs and SLIs to uphold system performance.
Requirements:
•5+ years as a reliability, production, or infrastructure engineer in a fast-paced, scaling environment.
•Deep expertise in GPU cloud infrastructure, including scheduling, scaling, and networking.
•Proficient in programming/scripting languages.
•Experience with Kubernetes or similar container orchestration platforms.
•Familiarity with IaC tools like Terraform or CloudFormation.
•Strong problem-solving and communication skills.
•Hands-on experience with observability tools (e.g., DataDog, Prometheus, Grafana).
•Knowledge of cloud security best practices.
•SRE experience in AI/ML is a plus.
Ready to power cutting-edge AI infrastructure? Apply now!