High Performance Computing Specialist
An elite Montreal based Trading Firm is seeking an HPC Systems Specialist to join a team responsible for designing and operating high performance GPU platforms that support advanced AI and machine learning workloads. This role sits at the intersection of infrastructure engineering, distributed systems, and performance tuning, with ownership spanning from physical hardware through large‑scale model serving. You will work closely with ML practitioners and infrastructure peers to build reliable, scalable, and highly optimized compute environments.
What You'll Do
- Build, operate, and continuously improve GPU-based compute platforms supporting large-scale inference and ML workloads
- Design and deploy distributed model serving architectures across multi-node, multi-GPU environments
- Operate and evolve Kubernetes clusters with GPU scheduling for AI and ML use cases
- Configure and tune networking components such as load balancers, firewall rules, and high-throughput interconnects for GPU clusters
- Develop and optimize storage solutions for model artifacts, checkpoints, and inference caches
- Diagnose and resolve performance and stability issues across hardware, drivers, networking, and application layers
- Partner with ML engineers to benchmark models, analyze performance characteristics, and apply inference acceleration strategies
- Evaluate new GPU hardware, serving frameworks, and infrastructure patterns to improve efficiency and scalability
- Improve system reliability through observability, alerting, capacity planning, and on-call/incident response processes
- Automate provisioning and lifecycle management using infrastructure-as-code and scripting
What You Bring
- Bachelor's or Master's degree in Computer Science, Engineering, or a related discipline
- 5+ years of experience in managing high performance computing environments
- Hands-on experience operating GPU compute environments for ML inference or training
- Familiarity with modern model serving frameworks (e.g., vLLM, SGLang, or similar) and GPU driver/runtime management
- Strong Linux systems expertise, including networking, storage, and kernel-level performance considerations
- Practical experience running GPU workloads on Kubernetes at scale
- Experience with infrastructure automation tools such as Terraform, Ansible, or equivalent
- Solid understanding of distributed systems concepts, networking fundamentals (TCP/IP, HTTP/2), and load-balancing strategies
- Proficiency in Python and shell scripting for tooling and automation
- Experience with monitoring and observability platforms such as Prometheus, Grafana, or comparable tools
This is a hybrid role in the firms Montreal office requiring 3 days per week onsite, and 2 days remote.
FAQs
Congratulations, we understand that taking the time to apply is a big step. When you apply, your details go directly to the consultant who is sourcing talent. Due to demand, we may not get back to all applicants that have applied. However, we always keep your CV and details on file so when we see similar roles or see skillsets that drive growth in organisations, we will always reach out to discuss opportunities.
Yes. Even if this role isn’t a perfect match, applying allows us to understand your expertise and ambitions, ensuring you're on our radar for the right opportunity when it arises.
We also work in several ways, firstly we advertise our roles available on our site, however, often due to confidentiality we may not post all. We also work with clients who are more focused on skills and understanding what is required to future-proof their business.
That's why we recommend registering your CV so you can be considered for roles that have yet to be created.
Yes, we help with CV and interview preparation. From customised support on how to optimise your CV to interview preparation and compensation negotiations, we advocate for you throughout your next career move.
