Data analytics in the era of large-scale machine learning

Date: Tuesday 23 – Wednesday 24 May 2023

Location: The Cyprus Institute

This event is part of the EuroCC2 project and the National Competence Center activities, in collaboration with the Greek National Competence Center.

Pre-requisites
Attendees should be familiar with at least one programming language, such as C/C++, Fortran, Python, R.

Requirements
All attendees will need their own desktop or laptop with the following software installed:

  • Web browser – e.g. Firefox or Chrome
  • PDF viewer – e.g. Firefox, Adobe Acrobat
  • ssh client – Terminal for Mac or Linux is fine. For Windows Putty should be fine.

Participation and Registration

This will be a hybrid event, and thus participants can attend on-site at The Cyprus Institute and via Zoom.

Git Repository

The Git Repository with all material of the training event can be found in the link below:
https://github.com/CaSToRC-CyI/NCC-training-202305

Agenda

Tuesday 23 May 2023

Introduction to Training Event

Wednesday 24 May 2023

Large-scale generative models for language and vision (including LLMs): How they work – and what we still do not know about them

Speakers: Professor Constantine Dovrolis and Dr. Mihalis Nicolaou

Description: This research talk provides a comprehensive overview of large-scale generative models in machine learning, such as generative adversarial nets, transformers, and large language models (LLMs), focusing on key technologies such as ChatGPT, Berd, Generative Advesarial Networks, and Stable Diffusion. We will discuss the mathematical underpinnings of these models, including attention mechanisms, self-attention, and positional encoding. An examination of the deep neural network architectures used, such as the multi-layered transformer architecture, will offer insight into their impact on natural language processing and other fields.

The presentation will also cover the training and fine-tuning processes of these advanced models, highlighting how they enable a wide range of applications across diverse domains. Furthermore, we will address the limitations and open questions surrounding these technologies, including their interpretability, potential biases, energy consumption, and the development of more efficient and robust models. By offering a holistic understanding of the current state of machine learning transformers and large-language models, this talk aims to encourage further research and innovation in the field.

PyTorch Neural Networks: Running on CPUs and GPUs

Speaker: Dr. Pantelis Georgiades

Prerequisites: Trainees should be comfortable with the Python programming language.

Description: In this session we will present a simple introduction to neural networks and work through a classification problem using the PyTorch framework in Python using both CPUs and GPUs. PyTorch is a deep learning framework developed by Meta and offers a fast and flexible set of tools to develop and deploy deep learning neural network models on both CPUs and GPUs. The example will be presented in an interactive Jupyter Notebook and the trainees will have the opportunity to become familiar with the work-flow and implementation of a Data Science project using state-of-the-art deep learning libraries.

Parallel computing techniques for scaling hyperparameter tuning of Gradient Boosted Trees and Deep Learning

Speaker: Dr. Nikos Bakas

Description: The presentation discusses the hyperparameter tuning in machine learning model development when trained on supercomputers. We will present parallelization techniques using XGBoost and PyTorch on large-scale supercomputers aiming to scale up performance in terms of computing time and accuracy. Computational bottlenecks during hyperparameter tuning, and the impact of multiprocessing on CPU utilization, will be presented, along with a cross-validation algorithm for efficient exploration of the hyperparameter optimization search space. The usage of XGBoost and PyTorch in a multiprocessing setting on powerful CPUs will be demonstrated, as well as insights on handling multiple OpenMP runtimes. Scaling-up results from applying the parallelization techniques on supercomputers will be presented, analyzing the impact of increasing the number of threads on hyperparameter optimization and the resulting reduction in tuning time.

Efficient Data Cleaning and Pre-processing Techniques for Robust Machine Learning

Speaker: Dr. Charalambos Chrysostomou

Description: In this session, we will explore various data cleaning and pre-processing techniques that can enhance data quality and improve the performance of machine learning models. The session will cover handling missing values, outlier detection, data transformation, feature scaling, and encoding categorical variables. By applying these techniques, participants will learn how to create robust and high-performing machine-learning models. The examples will be presented using Python and popular data processing libraries such as Pandas and Scikit-learn. Attendees will have the opportunity to become familiar with the workflow and implementation of data cleaning and pre-processing techniques.

GPU CUDA Programming

Speaker: Dr. Giannis Koutsou

Prerequisites: Trainees should be comfortable programming using C.

Description: An introduction to the GPU programming model and CUDA in particular will be provided. The hands-on component will begin with a step-by-step tutorial on how to write your first GPU program using CUDA, and continue with examples that demonstrate how data-layout, use of shared memory, and GPU thread distribution affect GPU performance.