Research Statement
Background
The use of statistics is expanding rapidly in many computationally intensive areas that rely on large and novel datasets, both inside and outside of academia. A pressing challenge is how to efficiently and accurately decipher complex relationships in large-scale data from real-world problems. Our general research interest lies in this multi-disciplinary area where we have been developing practical statistical and machine learning tools with significance in both statistical theory and applications. In particular, we have been pursuing this research agenda by leveraging advances in generative artificial intelligence (AI) to address several fundamental statistical problems, such as density estimation, causal inference, and unsupervised learning, with wide applications in computational biology and biomedical informatics.
Our research goal is to fulfill the new theories and methodologies for addressing statistical challenges in computational biology and biomedical data science by developing computationally efficient frameworks powered by state-of-the-art AI techniques. To this end, we have been developing novel frameworks for tackling general statistical problems and applying these frameworks to various computational biology problems, which involve pharmacology data, biomedical data, and multiomics (genetic/genomic/radiomic) data analysis. We have pursued a productive research agenda in these areas, with high-impact publications in each area, which is elaborated in the following sections.
Highlights
AI-powered Frameworks for High-dimensional Data Analysis
We have been developing innovative and versatile frameworks to analyze the high-dimensional data under various settings. One example is the Encoding Generative Modeling (EGM) framework based on manifold learning using generative AI techniques. This general EGM framework has been successfully used in both general statistical problems and various downstream applications in computational biology.
Density Estimation: I invented an innovative and versatile framework named Roundtrip (Liu et al PNAS. 2021) for density estimation of high-dimensional data by utilizing the power of generative AI. I not only established a rigorous statistical theory but also conducted extensive numerical experiments to showcase the effectiveness of this model. Then I extended the Roundtrip and proposed an innovative model called scDEC (Liu et al Nature Machine Intelligence. 2021) for unsupervised learning. It is also worth noting that by using a modified version of scDEC, I led the winner team in two joint embedding tasks in the NeurIPS 2021 Multimodal Single-Cell Data Integration competition, further confirming the significance of the proposed method.
Causal Inference: Causal inference is one of the most important research frontiers in statistics. I developed a powerful method, called Causal Encoding Generative Modeling (CausalEGM) (Liu et al., PNAS. 2024), for the analysis of causal relationships between different variables in the presence of high-dimensional covariates. Through rigorous mathematical analysis and numerical experiments, I demonstrated that the proposed approach offers significant advantages over existing methods for the estimation of causal effects. This work was presented at the JSM 2023 IMS Grace Wahba Award Lecture.
Applying Cutting-edge AI Techniques in Computational Biology
The past few years have witnessed the rapid evolution and revolutionization of AI techniques. Our research works in computational biology have been in close alignment with the development of cutting-edge AI technologies, which I will briefly discuss below.
Modeling Epigenomic Data: Epigenomic data characterize the various chromatin states, including chromatin accessibility, TF binding, histone modifications, and 3D genome interactions. Epigenetic changes to the DNA and associated proteins affect gene expression and may lead to altered cellular states, including diseases. I have developed several AI-powered frameworks for predicting chromatin accessibility (Liu et al., Bioinformatics. 2017; Liu et al., GPB. 2020). Later, I exploited epigenomic data in pharmaceutical studies by developing DeepCDR (Liu et al., ECCB/Bioinformatics. 2020) for predicting cancer drug response, which is still one of the state-of-the-art methods to date. Moreover, I leveraged the power of generative AI and developed a hicGAN model (Liu et al., ISMB/Bioinformatics. 2019) to enhance the resolution of 3D genome data. We also built a HiChIP database (Zeng*, Liu* et al., Nucleic Acids Res. 2022) to facilitate the use of 3D chromatin data.
Genomic Language Model: Large language models (LLMs) have been the main driving force behind many recent breakthroughs in artificial intelligence. Inspired by this, we have been exploring the idea of a “genomic LLM” that could advance our understanding of gene regulation mechanisms. We developed a genomic large language model, EpiGePT (Gao*, Liu* et al., arxRiv. 2023), based on a transformer-based architecture to understand the complex genome language and achieve several benefits: improved predictive performance, wider applicability, enhanced biological interpretability, and support for multiple downstream tasks, such as the dissection of the universal gene regulation rules. The genomic LLM will provide broad insights into genetics, genomics, and beyond.