IDS Research Seminars (2022)


 

These are a series of seminars and talks on selected data science topics of interest. The seminars will be led by IDS researchers as well as invited external speakers. These talks were open to current NUS graduate students only and were conducted in Zoom meetings. Following are the seminar details and also recordings of the talks for those who are interested in the topics.

 

 

  • Date: 27 October 2022

    Time: 10:00AM - 12:00PM

    Spekaer: Assistant Professor Hye Won Chung, Korea Advanced Institute of Science & Technology (KAIST)

    Title: Data Valuation: Understanding Value of Data in Training Neural Networks

    Abstract: Many recent works on understanding deep learning try to quantify how much individual data instances influence the optimization and generalization of a model, either by analyzing the behavior of the model during training or by measuring the performance gap of the model when the instance is removed from the dataset.

    Such approaches reveal characteristics and importance of individual instances, which may provide useful information in diagnosing and improving deep learning. However, most of the existing works on data valuation require actual training of a model, which often demands high-computational cost. In the first part of this talk, I will introduce a training-free data valuation score, called complexity-gap score, which is a data-centric score to quantify the influence of individual instances in generalization of two-layer overparameterized neural networks.

    The proposed score can quantify irregularity of the instances and measure how much each data instance contributes in the total movement of the network parameters during training. I will show theoretical analysis and empirical results that demonstrate the effectiveness of the complexity-gap score in finding `irregular or mislabeled’ data instances, and also provide applications of the score in analyzing datasets and diagnosing training dynamics.

    In the second part, I will discuss data valuation without labels, in the context of training Generative Adversarial Networks (GANs). Despite remarkable performance in producing realistic samples, GANs often produce low-quality samples near low-density regions of the data manifold, e.g., samples of minor groups. To promote diversity in sample generation without degrading the overall quality, I will introduce a simple yet effective scoring method to diagnose and emphasize underrepresented samples during training of a GAN. By theoretical analysis and experimental results, I will show that the proposed method improves GAN performance and it is especially effective in improving the diversity of sample generation for minor groups.

    Speaker Bio: Hye Won Chung is an Assistant Professor in the School of Electrical Engineering at KAIST. Her research interests include mathematical data science, machine learning and information theory.

    She received the B.S. degree (with summa cum laude) from KAIST in Korea and the M.S. and Ph.D. degrees from MIT, all in Electrical Engineering and Computer Science, in 2007, 2009 and 2014, respectively. From 2014 to 2017, she worked as a Research Fellow in the Department of Electrical Engineering and Computer Science at the University of Michigan.

    Dr. Chung is a recipient of several awards including KAIST Technology Innovation Award (2021), early career award from National Research Foundation of Korea (2021) and Department Teaching Award (2017). She is serving as a WithITS co-chair of the IEEE Information Theory Society.

    Click here for the seminar presentation slides.

    Video Recording:


     

  •  

 

  • Date: 12 October 2022

    Time: 10:00AM - 12:00PM

    Speaker: Professor Sotirios A. Tsaftaris, University of Edinburgh (UK)

    Title: The Pursuit of Generalization With Real Data

    Abstract: Real data are not perfect yet we want to learn perfect AI models from them. In this talk I will examine effects of spurious correlation when it is rare or frequent and when it affects the input or the output (label). I will show solutions that span from privacy and data leakage to using synthetic data to correct frequent biases (eg as they originate by using data different scanners).

    Speaker Bio: Prof Sotirios A. Tsaftaris, or Sotos, ( https://vios.science/ ), is currently the Canon Medical/Royal Academy of Engineering Research Chair in Healthcare AI, and Chair (Full Professor) in Machine Learning and Computer Vision at the University of Edinburgh (UK). He is also a Turing Fellow with the Alan Turing Institute and an ELLIS Fellow. Previously he held faculty positions with IMT Institute for Advanced Studies Lucca (Italy) and Northwestern University (USA). He has published extensively, particularly in interdisciplinary fields, with more than 160 journal and conference papers in his active record. His research interests are machine learning, computer vision, and image analysis.

  •  

 

  • Date: 20 September 2022

    Time: 10:00AM - 12:00PM

    Speaker: Assistant Professor Jean Honorio, Purdue University

    Title: Computational and Statistical Foundations of Relaxations in Combinatorial ML

    Abstract: Several problems in machine learning are combinatorial (e.g., clustering, learning sparse representations.) Convex relaxations have been used in the past, in order to circumvent the NP-hardness of the combinatorial problems, albeit being approximate solutions. My long-term research goal is to uncover large classes of continuous relaxations (beyond convexity) of worst-case NP-hard problems, that admit exact solutions.

    In this talk, I will introduce a unifying framework, which uses the power of continuous relaxations (beyond convexity), Karush-Kuhn-Tucker conditions, primal-dual certificates and concentration inequalities. This framework has produced breakthroughs not only for classical worst-case NP-hard problems, e.g., Bayes nets, structured prediction, clustering, but also for aspects of fairness, meta learning and federated learning.

    As a first example, we study inference in structured prediction. We show that the success of semidefinite programming depends on the algebraic connectivity of the input graph. As a breakthrough, we also provide guarantees for graphs with poor expansion properties (e.g., planar graphs) with noise (cf. smooth analysis).

    As a second example, we study the problem of fair sparse regression on a biased dataset where bias depends upon a hidden binary attribute. Thus, one needs to combine sparse regression with clustering. The corresponding optimization problem is combinatorial but we propose an invex (nonconvex) relaxation. Besides the technical breakthrough, we are the first to analyze fairness without observing the sensitive attribute(s).

    As a third example, we study the problem of exact partitioning of high-order models, formulated as a tensor optimization problem. We relax this high-order combinatorial problem to a convex conic form problem. To this end, we carefully define the Caratheodory symmetric tensor cone, and show its convexity, and the convexity of its dual cone.

    Speaker Bio: Jean Honorio is an Assistant Professor in the Computer Science Department at Purdue University, as well as in the Statistics Department (by courtesy). Prior to joining Purdue, Jean was a postdoctoral associate at MIT, working with Tommi Jaakkola. His Erdős number is 3. His work has been partially funded by NSF. He is an editorial board reviewer of JMLR, and has served as area chair of NeurIPS, senior PC member of IJCAI and AAAI, PC member of NeurIPS, ICML, AISTATS among other conferences and journals.

     

     

  •  

 

  • Date: 13 Jan 2022

    Time: 9:00AM – 10:00AM

    Speaker: Satwinder Singh, Massey University, Auckland New Zealand

    Title: Automatic Speech Recognition for Low Resource Languages

    Abstract: The recent advancements in deep learning have led to the state-of-the-art performance of automatic speech recognition (ASR) systems. Most of the contemporary ASR systems are end-to-end based that jointly learn acoustic model, pronunciation model, and language model. The end-to-end ASR systems work well for resource-rich languages like English and Mandarin. However, these ASR systems tend to perform worst on the languages that do not have enough training data and other linguistic resources needed for building a robust speech recognition system. These languages are referred to as low resource languages. In the past few years, many studies have focused on low resource languages and have proposed numerous solutions. In this talk, I will discuss various solutions proposed for building ASR systems for low resource languages. These solutions include multilingual ASR systems, cross-lingual transfer learning systems, meta learning based system, semi-supervised learning based systems, and data augmentation approaches. Lastly, I will also present some of the initial results for the Māori (spoken in New Zealand) and Punjabi language (native to India and Pakistan).

    Speaker Bio: He is a last year PhD student at the School of Mathematical and Computational Science, Massey University, Auckland New Zealand. He is currently working in the speech processing research area. The main focus of his research is to improve speech recognition systems for low resource languages. His research interests include deep learning for speech processing and automatic speech recognition and multilingual meta learning. He recently published his research in international conferences and journals (IEEE ICASSP, InterSpeech, Journal of Voice).

    Video Recording:


     

     

     

  •  

 

  • Date: 12 Jan 2022

    Time: 10:00AM - 11:00AM

    Speaker: Dr. Linshan Jiang, Nanyang Technological University

    Title: Lightweight Privacy-Preserving Deep Learning and Inference in Internet of Things

    Abstract: With the rapid development of sensing and communication technologies, the Internet of Things (IoT) is becoming a global data generation infrastructure. To utilize the massive data generated by IoT for achieving better system intelligence, machine learning and inference on the IoT data at the edge and core (i.e., cloud) of the IoT are needed. However, the pervasive data collection and processing engender various privacy concerns. While various privacy preservation mechanisms have been proposed in the context of cloud computing, they may be ill suited for IoT due to the resource constraints at the IoT edge. In this talk, I will present four privacy-preserving approaches on the learning and inference phases. These four approaches are computationally lightweight and can be executed by resource-limited edge devices including smartphones and even mote-class sensor nodes. Extensive performance evaluation performed on multiple datasets and real implementations on IoT hardware platforms show the effectiveness and efficiency of these approaches in protecting data privacy while maintaining the learning and inference performance.

    Speaker Bio: Completed his oral defense at School of Computer Science and Engineering in Nanyang Technological University in December, 2021, he has published several papers on the CPS-IoT week and ACM Transactions on Sensor Networks. His research interests include privacy and security in the Internet-of-Things, resilient AIoT system and cyber physical system.

    Video Recording:


     

     

  •