For batch processing of workload, Kuberanets and its system distributed machine learning models are fundamentally suitable for training and high -speed computers. Especially large calculation functions such as large language models (LLM), which are to be distributed on a wide variety of hosts on the face of limited memory, are beneficial by contained use on Kuberanets. In practice, however, implementation such as job APIs or Qubflow operators still recall some configuration options-such as pods, communication between various pod templates and job groups. The new open source API job set should now provide a different approach to presenting distributed jobs.
Jobset puts job on API and expands them
Construction on Job API, Jobset has been designed as a group of Kuberanets Jobs. It opens the opportunity to assign separate pod templates separately pod groups such as leaders, activists, etc., to create equal subordinate jobs in an announcement, which can be done at the dedicated hardware accelerator fields (the same type of GPU or TPU). For communication between pods in individual areas, the jobset provides a headless service that ensures automatic configuration and life cycle management.
Concept of new open source API job for Kuberanets.
(Image: kubernetes.io)
Jobset also allows the child’s jobs to clearly assign hair jobs within a topology domain, for example one of the dedicated hardware accelerating areas. Among other things, ML models such as some training methods for distributed data can be applied to parallel (DDP), in which only one model replication is done per high-speed accelerator area and only synchronization of replicas is done through slow cross-range networking.
In addition, the jobset provides configuble success and error guidelines. For example, developers can define a policy that determines how many times the job set should be restarted after an error. If a job is marked as a failure, the entire job set will be made so that the work in question can be resumed from the final test point.
Jobset’s application spectrum and the most important function available so far. Post in Kuberanets Blog Together. Using an example of ML training distributed with ML Framework Jax, the authors also demonstrated how the jobset could be configured for TPU Multis Workload. API Development Team has planned to add further tasks Overview in job set roadmap Let them remove.
(Map)