Postdoctoral Exchange Program with the University of British Columbia
The Institute of Data Science (IDS) has launched a new post-doctoral exchange program in data science with the Data Science Institute at the University of British Columbia (UBC). The exchange program, made possible with the majority of funding from NUS, will support IDS data science post-doctoral fellows to conduct research in UBC for at least one year. A similar arrangement would allow UBC post-doctoral fellows to come to IDS. The primary objective is to provide richer cultural experience and more diverse training for post-doctoral fellows in both institutes. The program will begin in September 2023.
Please refer to the project opportunities listed below. To
apply, please send your updated CV and the project you are
interested in to
paul.lim@nus.edu.sg
by the 30 Nov 2023.
(*) For UBC projects, applicants must have at least 1 year of postdoctoral working experience in IDS.
UBC Projects (*)
PI: Prof Giuseppe Carenini, Department of Computer Science, University of British Columbia
We aim to develop tools for automated extraction from clinical documents. Unstructured documents that would be included in this proposal are pathology reports, radiology reports, etc. Most current methods for extracting information from clinical documents are based on Large Language Models. For instance, Agrawal et al. demonstrated that InstructGPT performs well at zero- and few-shot span identification, token-level sequence classification, and relation extraction on clinical text, even if it was not trained specifically in the healthcare domain [A+2022] M. Agrawal, S. Hegselmann, H. Lang, Y. Kim and D. Sontag. “Large Language Models are Few-Shot Clinical Information Extractors,” EMNLP 2022 . Even more recently, Zhou et al. showed that T5-based models can perform robust event extraction from radiology reports [ZYO2023] S. Zhou, M. Yetisgen and M. Ostendorf. “Building blocks for complex tasks: Robust generative event extraction for radiology reports under domain shifts,” 5th Clinical Natural Processing Workshop, ACL 2023 , while Miller et al. [ZM+2023] W. Zhou, T. Miller et al. “Improving the Transferability of Clinical Note Section Classification Models with BERT and Large Language Model Ensembles,” 5th Clinical Natural Processing Workshop, ACL 2023 found that synergistically combining T5-flan with BERT produces effective, transferable classifiers of clinical note sections. Building on these promising results, we will design and implement LLM solutions (e.g., based on the publicly available Llama2 released in July 2023) to populate a provincial cancer registry for real-world use for cancer control and surveillance. We will also explore other equally compelling use cases for extracting from clinical documents for various conditions (e.g., cardiology). The exact choices of the use cases will depend on the interest and expertise of the actual postdoctoral fellows and other factors, such as the availability of clinicians when the postdoctoral fellows arrive in Vancouver.
The key 12-month objective is to design and implement LLM solutions (e.g., based on the publicly available Llama2) to populate a provincial cancer registry for cancer control and surveillance.
Manual extraction of structured information from documents typically costs tremendous time and labour, motivating the need for a system for automating the process. In this proposal, we aim to develop tools for automated table extraction from documents. That is, given a collection of FoI’s, the task is to automatically extract and populate a table with the FoI’s as the column headers. For example, the BC Cancer Registry is a database of all new cancers diagnosed in BC residents. It is supposed to track the outcomes of patients treated for cancer in BC. To populate the Registry, it is critical to extract up-to-date information from streams of patient documents including physicians’ notes, pathology reports, diagnostic imaging reports, and surgical notes. However, manual review of these reports by health records analysts requires extensive human resources. Before the pandemic, the volume of work for manual extraction already exceeded capacity, and the Registry lags behind many months. With the pandemic, the lag worsens, which significantly affects patient care, and down-stream, evidence-based analyses of provincial cancer outcomes and treatment strategies.
PI: Asst Prof Xiaoxiao Li, Electrical and Computer Engineering, University of British Columbia
Brain imaging biomarkers are invaluable because they are widely available and sensitive to the effects of healthy-aging interventions and disease processes. MRI is non-invasive and thus amenable to longitudinal measurements. However, longitudinal MRI imaging assessing the effects of an intervention or a disease process has a fundamental confound that – up to now -- has been impossible to address: comparing two MRI scans taken at different times from the same individual includes both the effects of disease progression and/or intervention, and the effects of natural aging. Recently, advances in machine learning have allowed for the development of generative models that can create highly authentic-appearing synthetic MRI scans – including MRI-associated changes due to aging. Qualitatively, it has long been observed that there are MRI aging-associated morphological changes in various brain regions, including the whole brain, cerebral cortex, white matter, subcortical gray matter, and other individual structures. However, here we propose using a new type of ML model, a conditional factor generative model (e.g., diffusion model), to synthesize what a given MRI would look like, based on the effects of natural aging, when projected arbitrarily sometime in the future (e.g. 12 months).
To train and mentor a postdoc fellow who will build the longitudinal brain simulation and generation system:
There are many factors that can affect aging: diet, exercise, social environment, genetics and medical co-morbidities. Biomarkers for aging that are sensitive enough to detect the effects of interventions are vital. Brain Imaging biomarkers are highly attractive for assessing brain aging changes because MRI is sensitive to disease and aging, is relatively easy to obtain, and is non-invasive. Nevertheless, MRI biomarkers have some challenges: the large size of the data obtained makes specific isolation of the effects of therapeutic interventions, natural aging, and/or disease co-morbidities challenging. It can also be difficult to compare subjects who have been imaged at different intervals (e.g. every 6months compared to every 18 months). The current proposal, by developing a way to create subject-specific, synthetic MRI data projected forward to an arbitrary time in the future, will be transformative in brain MRI aging biomarker development.
PI: A/Prof Leonid Sigal, Department of Computer Science, University of British Columbia
A major benefit of visio-linguistic foundational models (e.g., VilBERT, BLIP-2, Meta-Transformer) is that once pre-trained on large corpora of unlabeled or weakly-labeled (image-text pairs) data, they can be fine-tuned to downstream target tasks with very limited domain specific supervision. At the same time, they provide means to interpret the results by looking at attention patterns embedded into Transfromer layers. These two properties make this class of models particularly appealing for medical imaging, where both the data is limited and the interpretability is required. Nevertheless, the use of such models in medical imaging has been rather limited so far. Applying such models directly to medical imaging data (e.g., radiographic, MRI, CT) and corresponding textural sources (e.g., doctor reports) is challenging. On a technical side, high resolution of image data poses a challenge; on the medical side, interpretability and validation of such models requires further study and deeper analysis. We intend to address the technical issues with structured attention and more granular alignment achieved through region-word alignment objectives (e.g., similar to those used in CLIP). In addition, we plan to carefully study interpretability of decisions being made by visualizing attention maps and correlating them with human annotated regions correspondent to medical diagnosis and prognosis.
The key objective is to develop, train and show viability and usefulness of a visio-linguistic medical foundational model to at least one diagnostic and/or prognostic task.
Visio-lingual models have seen significant breakthroughs in computer vision over the past couple of years, showing impressive performance on a variety of multi-modal tasks. Most of such methods rely on neural Transformer architectures, popularized in NLP. Once pre-trained, such models are then fine-tuned to specific downstream tasks (e.g., language query localization, question answering, etc.). Early results suggest that similar models have the potential to significantly benefit AI approaches to medical imaging; in the same way these models have substantially benefited NLP and computer vision. For example, one potential benefit of such models is that they can implicitly align different modalities of imaging data (e.g., radiographic imaging, CT, MRI) and the textual data describing diagnostic and potential prognostic outcomes (e.g., encapsulated in doctor’s reports). This way such models can learn how to associate certain regions of medical data with predictive factors. Further, such associations can be analysed for explainability to give confidence in AI predictions.
IDS Projects
PI: : A/Prof Mehul Motani, Department of Electrical and Computer Engineering, National University of Singapore
Generalization is the ability of a learning algorithm to perform well on unseen data. Explainability deals with providing clear and understandable reasons for the predictions of a learning algorithm. In this project, we will explore generalization and explainability from an information theoretic perspective. We propose to use information theoretic metrics based on mutual information, which are scalable and efficient to compute. We aim to establish that these metrics track generalization and memorization, as well as allow for explainability and fairness analysis. This project builds on some existing work being done in my lab which has shown that information theoretic methods show great promise for the problems being proposed.
Generalization and explainability are key issues in healthcare problems because of the life and death nature of the problems being studied. Information theoretic ideas have the potential to shed light on these two issues. While they have been explored in more general scenarios, the application to healthcare has been less explored and holds significant potential.
PI: Prof Ng See-Kiong, Institute of Data Science and School of Computing, National University of Singapore
Collaborative machine learning is an appealing paradigm to build high-quality machine learning (ML) models by training on the data from many parties. In the healthcare domain, the ability to develop machine learning models over datasets distributed across data centers such as hospitals, clinical research labs, and mobile devices while preventing data leakage, will be a key enabler for AI in healthcare, as medical data are often scarce within individual institutions. However, concerns over trust and security have hindered the sharing of data as the “data silos” are inherently difficult to break at the data level, especially when personal or proprietary data are involved. We posit that the ML models can be more amenable for sharing as they are inherently more compact and self-contained with purpose-compiled knowledge from the data. Rather than requiring the learning collaborators to contribute their private data, this project will focus on enabling collaborative machine learning through allowing the collaborators to share heterogeneous black-box models, and to be appropriately incentivized based on their self-interests. Given that most current research are focused on the data level, this project will develop new model-centric collaborative machine learning methods, as well as new notions for trustable model-centric sharing and effective model management techniques for real-world model-centric platforms. We seek a postdoc who will study and develop such collaborative machine learning approaches for the healthcare domain.
While this project is focused on one-shot federated learning approaches, the postdoc will focus on such approaches in the healthcare setting. The ability to develop machine learning models over datasets distributed across data centers such as hospitals, clinical research labs, and mobile devices while preventing data leakage, will be a key enabler for AI in healthcare, given that medical data are often scarce within individual institutions.
To apply, please send your updated CV and the project you are interested in to paul.lim@nus.edu.sg by the 30 Nov 2023.
(*) For UBC projects, applicants must have at least 1 year of postdoctoral working experience in IDS.