Hyphen Connect Limited Logo

Hyphen Connect Limited

LLM Pre-training & Distributed Engineer (AI Infrastructure)

Posted 2 Days Ago
Be an Early Applicant
In-Office or Remote
Hiring Remotely in CA
Senior level
In-Office or Remote
Hiring Remotely in CA
Senior level
Design, orchestrate, and optimize large-scale LLM pre-training across 1,000+ GPUs. Implement 3D parallelism, manage GPU clusters (SLURM/Kubernetes), optimize InfiniBand/RDMA networking and memory, and automate checkpointing and failure recovery for long training runs.
The summary above was generated by AI

We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing  distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.

Responsibilities:

  • Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
  • Automate checkpointing and failure recovery during month-long training runs.

Required Skills:

  • Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
  • Experience managing SLURM or Kubernetes-based GPU clusters.
  • Strong systems engineering background (C++, CUDA, Python).

Similar Jobs

7 Hours Ago
Remote or Hybrid
East York, ON, CAN
Junior
Junior
Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Lead design, deployment, and sustainment of IL6S/TPM systems to eliminate losses and improve equipment reliability. Train and coach teams, run Kaizen and DMAIC events, track KPIs (OEE, MTBF/MTTR), implement SOPs and visual management, perform loss analysis, and support preventive/predictive maintenance to drive productivity and safety targets.
Top Skills: 5WhysAutonomous MaintenanceDmaicE2E Data Collection SystemsGeIshikawaKaizenLean Six SigmaMakigamiMtbbMtbfMttrOeeParetoPdcaPredictive MaintenanceRoot Cause Analysis (Rca)SmedStandard WorkTpmValue Stream Mapping (Vsm)Visual ManagementWpi Tool
7 Hours Ago
Remote or Hybrid
CA
Senior level
Senior level
eCommerce • Fintech • Hardware • Payments • Software • Financial Services
Outbound-focused senior account executive responsible for sourcing and closing new restaurant merchant logos. Duties include prospecting, discovery, demos, consultative selling of Square ecosystem, field relationship building, partnering with BD/Product/Marketing, managing the sales cycle and onboarding, and meeting monthly sales KPIs using Salesforce.
Top Skills: SalesforceSquare
11 Hours Ago
Remote or Hybrid
Senior level
Senior level
Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Manage and grow ServiceNow partner relationships across Canada: build partner practices, set targets, drive governance, enablement, reporting, business reviews, remediation plans, and achieve joint revenue goals while coaching partners and collaborating with global teams.
Top Skills: AIServicenow

What you need to know about the Calgary Tech Scene

Employees can spend up to one-third of their life at work, so choosing the right company is crucial, not just for the job itself but for the company culture as well. While startups often offer dynamic culture and growth opportunities, large corporations provide benefits like career development and networking, especially appealing to recent graduates. Fortunately, Calgary stands out as a hub for both, recognized as one of Startup Genome's Top 100 Emerging Ecosystems, while also playing host to a number of multinational enterprises. In Calgary, job seekers can find a wide range of opportunities.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account