NUS – Institute of Data Science

PI: Prof Giuseppe Carenini, Department of Computer Science, University of British Columbia

Overview
We aim to develop tools for automated extraction from clinical documents. Unstructured documents that would be included in this proposal are pathology reports, radiology reports, etc. Most current methods for extracting information from clinical documents are based on Large Language Models. For instance, Agrawal et al. demonstrated that InstructGPT performs well at zero- and few-shot span identification, token-level sequence classification, and relation extraction on clinical text, even if it was not trained specifically in the healthcare domain [A+2022] M. Agrawal, S. Hegselmann, H. Lang, Y. Kim and D. Sontag. “Large Language Models are Few-Shot Clinical Information Extractors,” EMNLP 2022 . Even more recently, Zhou et al. showed that T5-based models can perform robust event extraction from radiology reports [ZYO2023] S. Zhou, M. Yetisgen and M. Ostendorf. “Building blocks for complex tasks: Robust generative event extraction for radiology reports under domain shifts,” 5th Clinical Natural Processing Workshop, ACL 2023 , while Miller et al. [ZM+2023] W. Zhou, T. Miller et al. “Improving the Transferability of Clinical Note Section Classification Models with BERT and Large Language Model Ensembles,” 5th Clinical Natural Processing Workshop, ACL 2023 found that synergistically combining T5-flan with BERT produces effective, transferable classifiers of clinical note sections. Building on these promising results, we will design and implement LLM solutions (e.g., based on the publicly available Llama2 released in July 2023) to populate a provincial cancer registry for real-world use for cancer control and surveillance. We will also explore other equally compelling use cases for extracting from clinical documents for various conditions (e.g., cardiology). The exact choices of the use cases will depend on the interest and expertise of the actual postdoctoral fellows and other factors, such as the availability of clinicians when the postdoctoral fellows arrive in Vancouver.

Key Objectives
The key 12-month objective is to design and implement LLM solutions (e.g., based on the publicly available Llama2) to populate a provincial cancer registry for cancer control and surveillance.

Key Milestones (within 12 months)
1. One serious weakness of LLMs is that they are rather limited in the size of their input, and therefore struggle in dealing with very long documents or with large sets of documents. Expanding on the work presented by Li et al. [LJH2021] S. Li, H. Ji, J. Han. “Document-Level Event Argument Extraction by Conditional Generation,” 2021 NAACL , which frames event argument extraction as conditional generation, we will develop LLM-based techniques to extract information from highly unstructured documents that have Field of Interests (FoI) potentially located anywhere in the document (or even across documents) and can be expressed in a myriad of ways.
2. We will investigate “question-answering” approaches to extract FoI’s by incorporating the very recent impressive capabilities displayed by conversational decoder-only LLMs on medical competency exams and benchmark datasets. Remarkably, both GPT-4 [N+2023] H. Nori, E. Horvitz et al. “Capabilities of GPT-4 on Medical Challenge Problems,” arXiv preprint arXiv:2303.13375. 2023 and Med-PaLM 2 [S+2023] K. Singal, T. Tao et al. “Towards Expert-Level Medical Question Answering with Large Language Models,” arXiv preprint arXiv.2305.09617. 2023. exceeded the passing score on USMLE, a three-step examination program used to assess clinical competency and grant licensure in the United States. In practice, we will further push the performance of conversational LLMs by for instance exploring more sophisticated medical domain fine tuning and prompting strategies.
Qualifications
- Essential skills: Strong Natural Language Processing publication record; working knowledge & experience with Large Language Models, strong programming skills, good communication skills
- Bonus skills: Solid publication record on machine learning in general
Relevance to healthcare
Manual extraction of structured information from documents typically costs tremendous time and labour, motivating the need for a system for automating the process. In this proposal, we aim to develop tools for automated table extraction from documents. That is, given a collection of FoI’s, the task is to automatically extract and populate a table with the FoI’s as the column headers. For example, the BC Cancer Registry is a database of all new cancers diagnosed in BC residents. It is supposed to track the outcomes of patients treated for cancer in BC. To populate the Registry, it is critical to extract up-to-date information from streams of patient documents including physicians’ notes, pathology reports, diagnostic imaging reports, and surgical notes. However, manual review of these reports by health records analysts requires extensive human resources. Before the pandemic, the volume of work for manual extraction already exceeded capacity, and the Registry lags behind many months. With the pandemic, the lag worsens, which significantly affects patient care, and down-stream, evidence-based analyses of provincial cancer outcomes and treatment strategies.
References
1. [A+2022] M. Agrawal, S. Hegselmann, H. Lang, Y. Kim and D. Sontag. “Large Language Models are Few-Shot Clinical Information Extractors,” EMNLP 2022.
2. [ZYO2023] S. Zhou, M. Yetisgen and M. Ostendorf. “Building blocks for complex tasks: Robust generative event extraction for radiology reports under domain shifts,” 5th Clinical Natural Processing Workshop, ACL 2023.
3. [ZM+2023] W. Zhou, T. Miller et al. “Improving the Transferability of Clinical Note Section Classification Models with BERT and Large Language Model Ensembles,” 5th Clinical Natural Processing Workshop, ACL 2023.
4. [LJH2021] S. Li, H. Ji, J. Han. “Document-Level Event Argument Extraction by Conditional Generation,” 2021 NAACL.
5. [N+2023] H. Nori, E. Horvitz et al. “Capabilities of GPT-4 on Medical Challenge Problems,” arXiv preprint arXiv:2303.13375. 2023.
6. [S+2023] K. Singal, T. Tao et al. “Towards Expert-Level Medical Question Answering with Large Language Models,” arXiv preprint arXiv.2305.09617. 2023.

PI: Asst Prof Xiaoxiao Li, Electrical and Computer Engineering, University of British Columbia

Overview
Brain imaging biomarkers are invaluable because they are widely available and sensitive to the effects of healthy-aging interventions and disease processes. MRI is non-invasive and thus amenable to longitudinal measurements. However, longitudinal MRI imaging assessing the effects of an intervention or a disease process has a fundamental confound that – up to now -- has been impossible to address: comparing two MRI scans taken at different times from the same individual includes both the effects of disease progression and/or intervention, and the effects of natural aging. Recently, advances in machine learning have allowed for the development of generative models that can create highly authentic-appearing synthetic MRI scans – including MRI-associated changes due to aging. Qualitatively, it has long been observed that there are MRI aging-associated morphological changes in various brain regions, including the whole brain, cerebral cortex, white matter, subcortical gray matter, and other individual structures. However, here we propose using a new type of ML model, a conditional factor generative model (e.g., diffusion model), to synthesize what a given MRI would look like, based on the effects of natural aging, when projected arbitrarily sometime in the future (e.g. 12 months).

Key Objectives
1. To generate a framework for creating highly-realistic, synthetic brain MRI images reflecting natural aging effects in both healthy and subjects with neurologic disease, and
2. To provide novel insights into MRI changes caused by dementia after accounting for age effects via outcome (1).
3. To prepare a manuscript and grant proposal to secure future funding to continue this new research collaboration to scale up the machine learning based brain aging analysis system.
Key Milestones (within 12 months)
To train and mentor a postdoc fellow who will build the longitudinal brain simulation and generation system:
1. Data Collection and Normalization: We will first collect publicly available longitudinal brain MRI datasets (UK Biobank, ADNI, and ABCD datasets) as well as obtain Canadian Consortium on Neurodegeneration in Aging (CCNA) data via Prof. McKeown. To fully leverage the cross-sectional data for longitudinal studies, we will create age templates from the cross-sectional data to summarize age-specific features.
2. Algorithm Development: Furthermore, to generate individual synthetic MRI data in the future, we will design a transformer-based ML module for learning personalized brain progressive trajectories. Conditioning the targeted age template and individual brain aging trajectory, we will take aging and individual-related factors as inputs of our brain imaging simulation system to generate individual brains at the targeted age. Finally, we will enhance our understanding of brain aging patterns by controlling the factors in brain imaging synthesis and comparing the brain imaging generated under different factors
3. Brain Aging Biomakers Interpretation: We will then apply the above model of propagating MRI data in the future to longitudinal CCNA data of subjects with dementia, collected at t=t1, and t=t2. We will take the MRI data at t1, and generate what the MRI would look like at t=t2. We will normalized the images and test the hypothesis that (MRI(t2)- generated_MRI(t2)) will be more focal and specific to dementia (e.g. changes in the hippocampus) compared to (MRI(t2) – MRI(t1)).
Work Plan
1. Month 1: Establish an interdisciplinary applied health AI research team and conduct literature
2. Month 2-3: Aggregate brain MRI data; perform normalization and pre-processing; and create age template via registration
3. Month 4-6: Develop deep learning-based approach(es) to generate aging aware and personalized brain images at the future ages
4. Month 7-8: Analyze brain-aging biomarkers and validate the synthetic data
5. Month 9-10: Build a database for the synthetic brain imaging data and wrap the proposed algorithm into a software library that can predict individual brain at a future age
6. Month 11-12: Accompanying manuscript or grant application (e.g., NFRF Exploration or CIHR Project grant)
Qualifications
- A PhD in computer science, or equivalent qualifications
- Courses or research experience in machine learning, data mining, biomedical data analysis
- A strong publication and research records (at least two first authored paper published in top machine learning conferences and medical image related journal or challenges)
- An interest in application to health care, collaborating with clinicians, and in running empirical evaluation
- Advanced knowledge of Python and Deep Learning platform (Pytroch, Tensorflow, Jax etc.)
Relevance to healthcare
There are many factors that can affect aging: diet, exercise, social environment, genetics and medical co-morbidities. Biomarkers for aging that are sensitive enough to detect the effects of interventions are vital. Brain Imaging biomarkers are highly attractive for assessing brain aging changes because MRI is sensitive to disease and aging, is relatively easy to obtain, and is non-invasive. Nevertheless, MRI biomarkers have some challenges: the large size of the data obtained makes specific isolation of the effects of therapeutic interventions, natural aging, and/or disease co-morbidities challenging. It can also be difficult to compare subjects who have been imaged at different intervals (e.g. every 6months compared to every 18 months). The current proposal, by developing a way to create subject-specific, synthetic MRI data projected forward to an arbitrary time in the future, will be transformative in brain MRI aging biomarker development.

PI: A/Prof Leonid Sigal, Department of Computer Science, University of British Columbia

Overview
A major benefit of visio-linguistic foundational models (e.g., VilBERT, BLIP-2, Meta-Transformer) is that once pre-trained on large corpora of unlabeled or weakly-labeled (image-text pairs) data, they can be fine-tuned to downstream target tasks with very limited domain specific supervision. At the same time, they provide means to interpret the results by looking at attention patterns embedded into Transfromer layers. These two properties make this class of models particularly appealing for medical imaging, where both the data is limited and the interpretability is required. Nevertheless, the use of such models in medical imaging has been rather limited so far. Applying such models directly to medical imaging data (e.g., radiographic, MRI, CT) and corresponding textural sources (e.g., doctor reports) is challenging. On a technical side, high resolution of image data poses a challenge; on the medical side, interpretability and validation of such models requires further study and deeper analysis. We intend to address the technical issues with structured attention and more granular alignment achieved through region-word alignment objectives (e.g., similar to those used in CLIP). In addition, we plan to carefully study interpretability of decisions being made by visualizing attention maps and correlating them with human annotated regions correspondent to medical diagnosis and prognosis.

Key Objectives
The key objective is to develop, train and show viability and usefulness of a visio-linguistic medical foundational model to at least one diagnostic and/or prognostic task.

Key Milestones (within 12 months)
1. Adopt an existing model (e.g., BLIP-2) to a corpus of multi-modal medical data (e.g., imaging + doctor reports), utilizing (potentially) a multi-scale approach to deal with high-resolution image content
2. Pre-train this model on a large dataset and then fine-tune for at least one diagnostic task (the task and specific modality TBD based on availability of the data)
3. Analyze the performance of resulting model, including spatial associations made by the model (by plotting corresponding attentions)
4. Ultimately improve the performance by developing better pre-training and fine-tuning strategies based on observations drawn in (1)-(3).
Qualifications
- Essential skills: working knowledge & experience with deep learning models (e.g., CNNs, Transformers), strong programming skills, good communication skills, some experience working with medical imaging data. Strong publication record on either medical or ML/vision side (both is a strong plus).
- Bonus skills: experience with natural language processing.
Relevance to healthcare
Visio-lingual models have seen significant breakthroughs in computer vision over the past couple of years, showing impressive performance on a variety of multi-modal tasks. Most of such methods rely on neural Transformer architectures, popularized in NLP. Once pre-trained, such models are then fine-tuned to specific downstream tasks (e.g., language query localization, question answering, etc.). Early results suggest that similar models have the potential to significantly benefit AI approaches to medical imaging; in the same way these models have substantially benefited NLP and computer vision. For example, one potential benefit of such models is that they can implicitly align different modalities of imaging data (e.g., radiographic imaging, CT, MRI) and the textual data describing diagnostic and potential prognostic outcomes (e.g., encapsulated in doctor’s reports). This way such models can learn how to associate certain regions of medical data with predictive factors. Further, such associations can be analysed for explainability to give confidence in AI predictions.

PI: : A/Prof Mehul Motani, Department of Electrical and Computer Engineering, National University of Singapore

Overview
Generalization is the ability of a learning algorithm to perform well on unseen data. Explainability deals with providing clear and understandable reasons for the predictions of a learning algorithm. In this project, we will explore generalization and explainability from an information theoretic perspective. We propose to use information theoretic metrics based on mutual information, which are scalable and efficient to compute. We aim to establish that these metrics track generalization and memorization, as well as allow for explainability and fairness analysis. This project builds on some existing work being done in my lab which has shown that information theoretic methods show great promise for the problems being proposed.

Key Objectives
1. Explore information theoretic metrics suitable for machine learning systems
2. Study generalization performance with respect to information theoretic metrics
3. Study explainabiity and fairness in machine learning in the context of information theory
Key Milestones (within 12 months)
1. Establish the properties of a suitable information theoretic metric
2. Characterize the use of the metric for predicting generalization performance
3. Characterize the use of the metric for explainability purposes
Qualifications
- Essential skills:
  1. Information Theory
  2. Machine Learning Theory
  3. Experience with implementing deep learning frameworks
  4. Experience with Python and other programming languages
- Bonus skills:
  1. Github and online software development
Relevance to healthcare
Generalization and explainability are key issues in healthcare problems because of the life and death nature of the problems being studied. Information theoretic ideas have the potential to shed light on these two issues. While they have been explored in more general scenarios, the application to healthcare has been less explored and holds significant potential.

PI: Prof Ng See-Kiong, Institute of Data Science and School of Computing, National University of Singapore

Overview
Collaborative machine learning is an appealing paradigm to build high-quality machine learning (ML) models by training on the data from many parties. In the healthcare domain, the ability to develop machine learning models over datasets distributed across data centers such as hospitals, clinical research labs, and mobile devices while preventing data leakage, will be a key enabler for AI in healthcare, as medical data are often scarce within individual institutions. However, concerns over trust and security have hindered the sharing of data as the “data silos” are inherently difficult to break at the data level, especially when personal or proprietary data are involved. We posit that the ML models can be more amenable for sharing as they are inherently more compact and self-contained with purpose-compiled knowledge from the data. Rather than requiring the learning collaborators to contribute their private data, this project will focus on enabling collaborative machine learning through allowing the collaborators to share heterogeneous black-box models, and to be appropriately incentivized based on their self-interests. Given that most current research are focused on the data level, this project will develop new model-centric collaborative machine learning methods, as well as new notions for trustable model-centric sharing and effective model management techniques for real-world model-centric platforms. We seek a postdoc who will study and develop such collaborative machine learning approaches for the healthcare domain.

Key Objectives
1. Develop new concepts and algorithms in federated learning and artificial intelligence for trusted collaborative machine learning in healthcare;
2. Be up-to-date on state-of-the-art methodologies in related technical fields (federated learning, AI) and application domains (healthcare);
3. Develop ideas for application of research outcomes;
4. Contribute to knowledge exchange activities with external partners and collaborators;
Key Milestones (within 12 months)
1. Collating healthcare datasets that can be used to develop/demonstrate/evaluate federated learning approaches
2. Develop (one-shot) FL methodologies for healthcare applications
3. Explore other related approaches such as ML data/model valuation, FL incentivisation techniques, machine unlearning
Qualifications
- Essential skills:
  1. Specialization related to machine learning and artificial intelligence for healthcare and/or prior experience in federated learning, Bayesian optimization, or privacy/security research in data sharing;
  2. Proven ability to conduct independent research with a strong and relevant publication record;
  3. Experienced in using the latest machine learning, AI, and big data platforms;
  4. Excellent interpersonal communication and oral presentation skills in English
- Bonus skills:
  1. Github and online software development
Relevance to healthcare
While this project is focused on one-shot federated learning approaches, the postdoc will focus on such approaches in the healthcare setting. The ability to develop machine learning models over datasets distributed across data centers such as hospitals, clinical research labs, and mobile devices while preventing data leakage, will be a key enabler for AI in healthcare, given that medical data are often scarce within individual institutions.

To apply, please send your updated CV and the project you are interested in to paul.lim@nus.edu.sg by the 30 Nov 2023.

(*) For UBC projects, applicants must have at least 1 year of postdoctoral working experience in IDS.