The Institute facilitates joint collaborations across the departments to develop solutions to these problems, and provides a single focal point for industries and public agencies to tap into the broad spectrum of scientific and technological expertise in the various faculties and departments of the University.
Through working on these challenging real-life problems, we can change the traditional mindset of researchers – to facilitate collaboration on impactful research rather than working in silos on “small” problems. Our focus on solving real-life data science problems that meet the needs of various sectors leads to the potential for technology transfer and translation of research outputs for real world deployments.
Here are the current research projects funded by the Institute:
As new media becomes an increasingly important form of communication and expression, novel analytic tools are needed to help society understand and better use this new media. Old media is comprised of curated forms of media, like newspapers, created and controlled by professionals who are trained to identify and correct misinformation. New media is not professionally edited or curated. Our research uses big data analysis tools to help understand the credibility of information in new media. We will develop novel, time-aware algorithms to help assess the credibility of such postings. We will study the flow of both credible information and misinformation to design strategies to diminish the influence of misinformation. The results of our work can be used by local industry to understand the public’s knowledge of their products or services (as expressed in social media), whether this knowledge is credible or not, and how to better propagate favorable information through a network. Our work could also be used by the Singapore government to assess public understanding (or misunderstanding) of important public health topics such as Dengue eradication.
Effective sensing of social events is of critical importance for national security and social stability, but it has become challenging nowadays as the physical and cyber space deeply intermingle with each other in the occurrence and development of social events. On the other hand, the current explosion of big data in both the physical and the cyber space provides new opportunities to develop modern techniques and methodologies for smarter social event analytics. In order to benefit from this new wealth of data for more accurate and rapid social event detection, pre-warning, tracking, and evolution analysis, we need to look into addressing the following challenges: how to robustly represent data of vast sizes and at high noise-levels, how to effectively extract useful information from the event data, and how to perform reasoning on data from the physical and the cyber space.
This project aims to develop new approaches that address these fundamental challenges in modelling and analyzing big social event data. The core component is to develop a unified framework that connects data and information between the physical and cyber space, to create an integrated event analytics platform for efficient perception, accurate detection, on-time tracking, advanced pre-warning, and trend analysis of social events. We have assembled a unique team of researchers from different fields to achieve the following specific objectives:
How many constituent parts make up a face? How many distinct ﬂavors make up a dish? How many topics are discussed in a news article? This project aims to develop a strong statistical foundation for learning the number of latent factors in dictionary learning in the presence of gross errors in the measurements. We study two main problems in dictionary learning–namely, nonnegative matrix factorization (NMF) and a robust version of principal component analysis (RPCA). While our initial focus is on developing test statistics for various statistical model that can yield accurate and exact p-values for NMF and RPCA, we plan to also develop algorithms to test the theory on real data from diverse sources—such as audio data, stock price data, image (computer vision) data, and text data. Lastly, we plan to adopt information-theoretic methods to obtain minimax estimation error bounds in both in NMF and RPCA in various settings and to compare the performance of the algorithms to these fundamental limits.
With the massive increase in web posts, reviews and feedback by users around the world via electronic word-of-mouth, web text data have been shown to not only provide important information for content analysis, but also create much impact on decision making processes such as the US election, university ranking, risk management and investment. In addition to these fragmented text data of an unstructured nature, there is also an increasing availability of high dimensional and hig frequency functional data. These functional data are presented in the form of curves, surfaces or manifolds defined over a continuum on time, spatial locations, price levels, wavelengths, probabilities etc. The functional data have also began to attract much attention as they contain comprehensive information that can further help in trend forecast, decision making, and managerial improvement on services and products.
We propose to address the fundamental technical issues of exploiting these untraditional yet valuable digital text and functional data as follows. First, we will provide a structured pipeline for text processing that transforms the informative content into quantitative forms, and develop models and algorithms to incorporate the text and functional information,as well as other complex data such as large dimensional scalar variables for data analytics. Besides studying the classical issues of statistical modelling of mixed data, we will focus on their exploitation for providing insights for managerial and operational issues on financial market — for example, how will community's opinions impact the movement of market, overall and under stressful scenarios, how much accuracy can be reached in predicting financial market movements; and discuss potential uses in helping management to quickly gain managerial insights from the large volumes of data with complex structure.
Live streaming applications have been in existence for years but did not reach mainstream popularity until China’s recent trend in individuals’ self-expression via live streaming in mobile dating and social apps. According to the China Internet Network Information Center, by July 2016, there are more than 100 major mobile apps providing online live streaming services to more than 3 million users, accounting for nearly half of all Internet users in China.
While this phenomenon is still nascent, the economic potential for such live streaming services has been tested with two related business revenue models. First, live streaming platforms can adopt a revenue model to allow monetization of user-generated content (UGC). Second, the use and value of live streaming have also been expanded to e-commerce. However, the extent to which the value of these user-generated live streaming services is dependent on the nature of streaming content, broadcaster and audience behaviors is still unknown, both in terms of the intrinsic value reflected by direct monetization and the extrinsic value that impacts business across platforms.
Our research objectives are as follows. First, we conduct a pioneering study on the direct monetization revenue model of live streaming UGC by quantifying the economic values of live streaming to content providers, audiences and platform owners. Second, we investigate the value of live streaming UGC by applying and advancing image recognition and machine learning techniques in computer science to conduct sentiment and emotion analyses with dynamic temporal considerations in a live streaming setting. Third, we aim to investigate users’ psychological and social needs in live streaming platforms. All these objectives above can provide practical implications for both the platform owners and broadcasters on how to better monetize live streaming UGC.
The epidemiology of diseases is multi-factorial, complex, and adaptive. These characteristics makes disease control and management challenging. While spatially averaged parametric models may do well in predicting when intervention is required or correctly predicts general epidemiological trends, they rarely provide targeted suggestions on where and how to efficiently intervene.
There is, however, a data-driven alternative – when sufficient epidemiological field data is available, a generative nonparametric model can capture spatial-temporal correlations of how actual vector-borne diseases arise, transmit, and dissipate. Such generative models may reveal locale-specific transmission pathways (which we coin as motifs) that are amendable to targeted, pre-emptive intervention.
Given the abundance of detailed epidemiological data on vector-borne and communicable diseases in Singapore (e.g. Dengue virus, Hand-foot-mouth disease), here we explore the efficacy of such data- driven generative modeling.
Science, technology, engineering and mathematics (STEM) immigrants has fuelled innovations in several developed countries including Singapore. However, immigration policy is a sensitive political topic and not all local citizens in developed countries accept the idea that STEM immigrants could create more jobs for domestic employees and may benefit everyone in the recipient country. In addition to quantifying overall economic benefits brought by STEM foreign talents, this project aims at building predictive models for identifying STEM foreign talents who are more likely to contribute to Singapore in the long term. One important novelty is this project plans to analyse the data from the point of view of university admissions for undergraduate and postgraduate students, as universities' admissions serve as an important filter that allows thousands of foreign talents to immigrate to Singapore. However, most admissions are awarded only based on academic merits and ignores whether the student may stay in Singapore and/or will contribute in STEM job roles in the long run. We hope to show that even with limited information from public data sources, data science on STEM immigrants could help Singapore be more selective in granting STEM immigration visas without losing the economics productivity. The latter is critical for smaller countries, such as Singapore, because the capacity of population is relatively limited. With full individual information integrated by the Singapore government, immigration visas and SPR can be awarded based on more sophisticated predictive modelling to build a lean yet highly productive smart nation in the future.
Population growth, land-use change and urbanization, energy consumption and waste production pose increasingly critical water challenges only exacerbated by effects of climate change. Sustainable access to safe, reliable and affordable water is increasingly challenging as cities like Singapore and Jakarta are predicted to double their water demand in the coming decades.
We propose to design and implement analytics solutions for the modelling and analysis of water quality and of the factors influencing water quality in order to produce results pertinent to the assessment and design of water management policies. Specifically, the project seeks to contribute to the design of policies to engage citizens in the monitoring and protection of water quality and in the definition of appropriate interventions to reduce flood-related public health risks.
We leverage publicly available and publicly sourced data. Smart City and Smart Nation projects around the world collect, consolidate and share data with the objective to enhance the design and assessment of policies and processes by regional and national stakeholders. While other organizations also make relevant data publicly available, it is increasingly conceivable and meaningful to leverage the knowledge and effort of the crowd. We therefore build a crowd-sourcing system for water quality analytics.