Get the latest updates on AI-powered hiring, career growth, and technical deep-dives delivered to your inbox.
Halian
This role focuses on designing and building the infrastructure that enables scalable machine learning development, from training-ready datasets through to validated models deployed in production environments.
The position involves establishing core systems and architectural foundations that will support long-term scalability and performance. Key areas include training infrastructure, distributed learning frameworks, experiment management, model lifecycle management, and reliable pathways from model development to production deployment.
Design, deploy, and operate GPU-based training environments across cloud platforms such as AWS and GCP. This includes node provisioning, workload scheduling (e.g., Kubernetes, Slurm), multi-node networking, GPU monitoring, and cost/utilization optimization.
Own and optimize distributed training frameworks such as PyTorch DDP and FSDP. Implement and tune strategies including mixed precision, gradient checkpointing, activation offloading, and parallelism approaches to ensure efficient large-scale training.
Develop high-throughput data loading and storage access patterns to support multi-GPU and multi-node training. Implement techniques such as data sharding, prefetching, local NVMe caching, and resumable data pipelines. Contribute to dataset format design with a focus on efficient read performance.
Implement and maintain experiment tracking and model registry systems using platforms such as MLflow or Weights & Biases. Ensure reproducibility, traceability, and comparison of experiments through proper artifact and checkpoint management.
Build automated pipelines for training, evaluation, and deployment readiness. Establish validation gates, regression testing, and controlled promotion of models across different lifecycle stages.
Develop reliable CI/CD workflows for model conversion, benchmarking, and packaging using tools such as ONNX, TensorRT, SNPE, or TIDL. Ensure all artifacts are properly versioned and tracked with full lineage.
Design and implement systems that capture model performance in production environments. Enable continuous feedback loops by feeding operational data back into retraining and evaluation pipelines.
Matched to your profile
We surface this role because it matches profiles like yours, not because we vet the employer. Always confirm the pay, location, and remote details on Halian's official site before you apply.