Undisclosed

Site Reliability Engineer (GPU/HPC)

Palo Alto, CA

full_time

About This Job

Work at a Unicorn Startup!

We’re seeking a Senior Reliability Engineer to enhance our GPU infrastructure for a fast-growing AI/ML platform. Join us to ensure scalability and resilience!

GPU/HPC (High-Performance Computing) experience required.

•

Manage thousands of GPUs across multiple cloud providers, scaling and monitoring infrastructure.

•

Design scalable solutions to meet growing demands.

•

Implement monitoring systems to proactively detect issues.

•

Build fault-tolerant designs to reduce service disruptions.

•

Develop automation tools to boost reliability and efficiency.

•

Participate in on-call rotation for 24/7 system support.

•

Maintain SLOs and SLIs to uphold system performance.

Requirements:

•

5+ years as a reliability, production, or infrastructure engineer in a fast-paced, scaling environment.

•

Deep expertise in GPU cloud infrastructure, including scheduling, scaling, and networking.

•

Proficient in programming/scripting languages.

•

Experience with Kubernetes or similar container orchestration platforms.

•

Familiarity with IaC tools like Terraform or CloudFormation.

•

Strong problem-solving and communication skills.

•

Hands-on experience with observability tools (e.g., DataDog, Prometheus, Grafana).

•

Knowledge of cloud security best practices.

•

SRE experience in AI/ML is a plus.

Ready to power cutting-edge AI infrastructure? Apply now!

Similar Jobs

Trending Jobs

Electrical Engineer

Division Order Analyst

Professional Landman

Penterra Services, LLC

Contract

Lovington, NM

23 days ago

Accounts Payable Clerk

Division Order Landman

R. Lacy Services, Ltd.

Full-time

Longview, TX

about 1 month ago

contract landman

HPS Oil & Gas Properties

Full-time

Lafayette, LA

4 months ago

Oil and Gas Land and Title Analyst - SAM Associate II

Attorney

Toeppich & Associates

Full-time

Houston, TX

over 1 year ago

Title Landman

Sustain Land Services

Full-time

Norman, OK

3 months ago

Senior Landman

Electrical Designer

Title Reviewer

Innovation Land Services

Full-time

Pittsburgh, PA

5 months ago

Landman

Stockyards Energy Land Services

Contract

Akiachak, TX

6 months ago

Civil/Structural Designer

Oil and Gas Title Attorney

contract Landman

HPS Oil & Gas Properties

Full-time

Midland, TX

4 months ago

contract Landman

HPS Oil & Gas Properties

Full-time

Cheyenne, WY

4 months ago

Senior Division Order Analyst

E & I - Office/Field Administration

Mechanical/Piping Engineer

Notice: The inclusion of job postings or company information on our platform does not imply endorsement, partnership, or affiliation. Listings may include publicly available roles from various sources, and companies shown may not have a direct relationship with Energy Hire.

Site Reliability Engineer (GPU/HPC)

About This Job

Work at a Unicorn Startup!

GPU/HPC (High-Performance Computing) experience required.

Requirements:

Similar Jobs