IDS Research Seminars (2023)


 

These are a series of seminars and talks on selected data science topics of interest. The seminars will be led by IDS researchers as well as invited external speakers. These talks were open to current NUS graduate students only and were conducted in Zoom meetings. Following are the seminar details and also recordings of the talks for those who are interested in the topics.

 

 

  • Date: 26 September 2023

    Time: 10:00AM - 11:30AM

    Speaker: Tan Wang-Chiew, Research Scientist @ Meta AI

    Title: From Deep Data Integration To Using LLMs To Query Unstructured And Structured Data

    Abstract: We are witnessing the widespread adoption of deep learning techniques as avant-garde solutions to different computational problems in recent years. In data integration, the use of deep learning techniques has helped establish several state-of-the-art results in long standing problems, including information extraction, entity matching, data cleaning, and table understanding. In this talk, I will reflect on the strengths of deep learning and how that has helped move forward the needle in data integration. I will also discuss a few challenges associated with solutions based on deep learning techniques and describe some opportunities for future work.
    Recently, Large Language Models (LLMs) have emerged as a powerful tool for accessing parametric knowledge, but the potential of tapping into the vast expanse of external or private data remains largely unexplored. This talk presents an open-source question-answering system for seamlessly integrating model parameters with knowledge from external data sources to enhance its predictive capabilities. Our larger vision transcends question answering. We envision a personal insight assistant, adept at sifting through your past data to offer invaluable insights to help make informed decisions and plan with foresight.

    Speaker Bio: Wang-Chiew is a research scientist at Meta AI. Before she was the Head of Research at Megagon Labs, where she led the research efforts on building advanced technologies to enhance search by experience. Prior to joining Megagon Labs, she was a Professor of Computer Science at the University of California, Santa Cruz. She also spent two years at IBM Research - Almaden. She received her B.Sc. (First Class) in Computer Science from the National University of Singapore and her Ph.D. in Computer Science from the University of Pennsylvania. Her research interests include data integration and exchange, data provenance, and natural language processing. She is the recipient of an NSF CAREER award, a Google Faculty Award, and an IBM Faculty Award. She co-authored best papers, she is a co-recipient of the 2014 ACM PODS Alberto O. Mendelzon Test-of-Time Award, the 2018 ICDT Test-of-Time Award, and the 2020 Alonzo Church Award. She received the 2019 VLDB Women in Database Research Award. She was on the VLDB Board of Trustees (2014-2019) and she is a Fellow of the ACM.

  •  

 

  • Date: 19 May 2023

    Time: 10:00AM - 12:00PM

    Speaker: Li Qinbin, Post-doc, UC Berkeley

    Title: Practical Federated Learning on non-IID Data: Algorithms and Systems

    Abstract: Federated learning has emerged as a promising distributed learning paradigm, facilitating collaborative model training without the need for raw data exchange. However, the presence of non-IID (non-identically and independently distributed) and hybrid data among participating parties poses significant challenges in developing practical federated learning approaches. In this talk, I will delve into my research focused on tackling the challenges posed by non-IID data in federated learning. I will present novel ideas and valuable insights aimed at developing efficient and effective federated learning algorithms and systems specifically designed for decision trees and deep learning models. Additionally, I will explore the future directions and potential advancements in the field of federated learning over the next decade.

    Speaker Bio: Qinbin Li is a postdoc at UC Berkeley. He received his Ph.D. degree from National University of Singapore. His research interests include federated learning, privacy, and systems. He has published papers in top-tier conferences and journals, including ICML, ICLR, NeurIPS, and MLSys. He has received numerous honors and awards, including the TPDS best paper award in 2019 and Google Ph.D. Fellowship in 2021.

    Click here for the seminar presentation slides.

  •  

 

  • Date: 19 May 2023

    Time: 10:00AM - 12:00PM

    Speaker: Prof. Raymond Ng, University of British Columbia, Canada

    Title: Healthcare Applications with Natural Language Processing

    Abstract: Unstructured documents often come with embedded structured data. Representing valuable and structured information as tables is popular in health, financial, and many domains. However, manual extraction of structured information from documents typically costs tremendous time and labor, motivating the need for a system for automating the process. After such tables have been extracted, the data can be used for a wide variety of tasks such as question answering and various “down-stream” analytics tasks. In this talk, we will discuss how to leverage ground breaking pre-trained language models (e.g., BERT, ChatGPT) to develop tools for automated table extraction from various types of documents. We will present different applications from cancer registry reporting, cancer care, to psychiatry hospitalization prediction and waste management from mineral mining.

    Speaker Bio: Raymond Ng is the Canada Research Chair on data science and analytics. He is also the founding Director of the Data Science Institute at the University of British Columbia, and an elected fellow of the Royal Society of Canada. He was named as one of the world’s top-75 “Academic Data Science Leaders 2022” by the Chief Data Officer Magazine which grew out of MIT’s Sloan School of Management. Ng’s main research area for the past three decades is on data mining, with a specific focus on health informatics and text mining. He has published over 220 peer-reviewed publications on data clustering, outlier detection, OLAP processing, health informatics and text mining. He is the recipient of two best paper awards – from the 2001 ACM SIGKDD conference, the premier data mining conference in the world, and the 2005 ACM SIGMOD conference, one of the top database conferences worldwide.

    Click here for the seminar presentation slides.

    Video Recording:


     

  •  

 

  • Date: 19 May 2023

    Time: 10:00AM - 12:00PM

    Speaker: Assoc Prof. Xia Hu, Rice University, USA

    Title: Harnessing the Power of LLMs in Practice: An Introduction to ChatGPT and Beyond

    Abstract: The recent progress in large language models has resulted in highly effective models like OpenAI's ChatGPT that have demonstrated exceptional performance in various tasks, including question answering, essay writing, and code generation. This presentation will cover the evolution of LLMs from BERT to ChatGPT and showcase their use cases. Although LLMs are useful for many NLP tasks, one significant concern is the inadvertent disclosure of sensitive information, especially in the healthcare industry, where patient privacy is crucial. To address this concern, we developed a novel framework that generates high-quality synthetic data using ChatGPT and fine-tunes a local offline model for downstream tasks. The use of synthetic data improved the performance of downstream tasks, reduced the time and resources required for data collection and labeling, and addressed privacy concerns. Finally, we will discuss the regulation of LLMs, which has raised concerns about cheating in education. We will introduce our recent survey on LLM-generated text detection and discuss the opportunities and challenges it presents.

    Speaker Bio: Dr. Xia “Ben” Hu is an Associate Professor at Rice University in the Department of Computer Science and director of the Center for Transforming Data to Knowledge (D2K Lab). Dr. Hu has published over 100 papers in several major academic venues, including NeurIPS, ICLR, KDD, WWW, IJCAI, AAAI, etc. An open-source package developed by his group, namely AutoKeras, has become the most used automated deep learning system on Github (with over 8,000 stars and 1,000 forks). Also, his work on deep collaborative filtering, anomaly detection and knowledge graphs have been included in the TensorFlow package, Apple production system and Bing production system, respectively. His papers have received several Best Paper (Candidate) awards from venues such as ICML, WWW, WSDM, ICDM, AMIA and INFORMS. He is the recipient of NSF CAREER Award and ACM SIGKDD Rising Star Award. His work has been cited more than 18,000 times with an h-index of 51. He is the conference General Co-Chair for WSDM 2020 and ICHI 2023. He is also the founder of AI POW LLC.

    Click here for the seminar presentation slides.

    Video Recording:


     

  •  

 

  • Date: 13 Jan 2023

    Time: 10:00AM - 12:00PM

    Speaker: Assistant Professor Ali Tajer, Rensselaer Polytechnic Institute

    Title: Best Arm Identification in Stochastic Bandits: Optimality, Complexity, and Robustness

    Abstract: In this talk, we will provide recent results on three aspects of best arm identification (BAI) in stochastic multi-armed bandit problems. First, we present the recent optimality results for the general parameterized family of distributions. Next, we present a computationally-efficient framework for the fixed confidence setting. Finally, we consider the settings in which the rewards are possibly contaminated and provide two algorithms, a gap-based algorithm and one based on successive elimination, for the sub-Gaussian settings.

    Speaker Bio: Ali Tajer received B.Sc. and M.Sc. degrees in Electrical Engineering from Sharif University of Technology, an M.A. degree in Statistics, and a Ph.D. degree in Electrical Engineering from Columbia University. He is currently an Associate Professor of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute. His research interests include mathematical statistics, statistical signal processing, and network information theory. His recent publications include an edited book entitled Advanced Data Analytics for Power Systems (Cambridge University Press, 2020). He received an NSF CAREER award in 2016 and AFRL Faculty Fellowship in 2019. He is currently serving as an Associate Editor for the IEEE Transactions on Information Theory and the IEEE Transactions on Signal Processing. In the past, he has served as an Editor for the IEEE Transactions on Communications, an Editor for the IEEE Transactions on Smart Grid, a Guest Editor for the IEEE Signal Processing Magazine, and a Guest Editor-in-Chief for the IEEE Transactions on Smart Grid.

    Video Recording:


     

  •