Date: Friday, 19 April 2024
Location: Richard Cooper Seminar Room, Fresnel Building, Athalassa Campus, The Cyprus Institute
    
Speakers:
Mr. Christodoulos Stylianou, Research Engineer (CaSToRC, The Cyprus Institute)
Mr Ivan Gentile, Data Scientist (IFAB - NCC Italy)
Dr Charalambos Chrysostomou, Associate Research Scientist (CaSToRC, The Cyprus Institute)
Description:
The tutorial aims to provide optimization techniques for Llama, a foundational Large Language Model (LLM) based on the Transformer Architecture, analogous to the GPT series. Noted for their human-like text generation capabilities, these models encounter challenges regarding efficiency and scalability due to their complexity and computational demands. The session intends to augment the operational efficiency of these models through PyTorch-native optimization strategies, including model compilation, GPU quantization, speculative decoding, and tensor parallelism. Participants will have the chance to evaluate the proposed optimizations in real-time on a real supercomputer. These methods seek to significantly reduce inference times and optimize resource usage, thus expanding the advanced models' applicability across various computational frameworks and research initiatives.
Pre-requisites:
As an intermediate-level tutorial, we expect basic knowledge of Deep Learning, and programming in Python. Additionally, some experience in using HPC systems is helpful (Linux shell, Slurm) but not mandatory. Participants are expected to provide a laptop with which they can access the HPC system. Access will be facilitated via individual accounts using the Jupyter platform.
Location: Richard Cooper Seminar Room, Fresnel Building, Athalassa Campus, The Cyprus Institute
Speakers:
Mr. Christodoulos Stylianou, Research Engineer (CaSToRC, The Cyprus Institute)
Mr Ivan Gentile, Data Scientist (IFAB - NCC Italy)
Dr Charalambos Chrysostomou, Associate Research Scientist (CaSToRC, The Cyprus Institute)
Description:
The tutorial aims to provide optimization techniques for Llama, a foundational Large Language Model (LLM) based on the Transformer Architecture, analogous to the GPT series. Noted for their human-like text generation capabilities, these models encounter challenges regarding efficiency and scalability due to their complexity and computational demands. The session intends to augment the operational efficiency of these models through PyTorch-native optimization strategies, including model compilation, GPU quantization, speculative decoding, and tensor parallelism. Participants will have the chance to evaluate the proposed optimizations in real-time on a real supercomputer. These methods seek to significantly reduce inference times and optimize resource usage, thus expanding the advanced models' applicability across various computational frameworks and research initiatives.
Pre-requisites:
As an intermediate-level tutorial, we expect basic knowledge of Deep Learning, and programming in Python. Additionally, some experience in using HPC systems is helpful (Linux shell, Slurm) but not mandatory. Participants are expected to provide a laptop with which they can access the HPC system. Access will be facilitated via individual accounts using the Jupyter platform.
Agenda
    10:00 - 10:15
Optimizing LLaMa
A gentle introduction to LLMs and LLaMa
Optimizing LLaMa
A gentle introduction to LLMs and LLaMa
    10:15 - 11:00
Optimizing LLaMa
Enhancing Efficiency and Scalability of Large Language Models with PyTorch
Optimizing LLaMa
Enhancing Efficiency and Scalability of Large Language Models with PyTorch
    11:00 - 11:15
Coffee Break
Coffee Break
    11:15 - 12:15
Efficient Scaling of Machine Learning Models
Distributed Training
Efficient Scaling of Machine Learning Models
Distributed Training
    12:15 - 13:00
Coffee Break
    
    
				Coffee Break
 
															 
															