Publications
2024
- medRxivFunmap: integrating high-dimensional functional annotations to improve fine-mappingLi, Yuekai, Xiao, Jiashun, Ming, Jingsi, Zeng, Yicheng, and Cai, MingxuanmedRxiv 2024
Fine-mapping aims to prioritize causal variants underlying complex traits by accounting for the linkage disequilibrium of GWAS risk locus. The expanding resources of functional annotations serve as auxiliary evidence to improve the power of fine-mapping. However, existing fine-mapping methods tend to generate many false positive results when integrating a large number of annotations. In this study, we propose a unified method to integrate high-dimensional functional annotations with fine-mapping (Funmap). Funmap can effectively improve the power of fine-mapping by borrowing information from hundreds of functional annotations. Meanwhile, it relates the annotation to the causal probability with a random effects model that avoids the over-fitting issue, thereby producing a well-controlled false positive rate. Paired with a fast algorithm, Funmap enables scalable integration of a large number of annotations to facilitate prioritizing multiple causal SNPs. Our simulations demonstrate that Funmap is the only method that produces well-calibrated FDR under the setting of high-dimensional annotations while achieving better or comparable power gains as compared to existing methods. By integrating GWASs of 4 lipid traits with 187 functional annotations, Funmap consistently identified more variants that can be replicated in an independent cohort, achieving 15.5%-26.2% improvement over the runner-up in terms of replication rate.
- medRxivA unified framework for cell-type-specific eQTLs prioritization by integrating bulk and scRNA-seq dataYu, Xinyi, Hu, Xianghong, Wan, Xiaomeng, Zhang, Zhiyong, Wan, Xiang, Cai, Mingxuan, Yu, Tianwei, and Xiao, JiashunmedRxiv 2024
Genome-wide association studies (GWASs) have identified numerous genetic variants associated with complex traits, yet the biological interpretation remains challenging, especially for variants in non-coding regions. Expression quantitative trait loci (eQTLs) studies have linked these variations to gene expression, aiding in identifying genes involved in disease mechanisms. Traditional eQTL analyses using bulk RNA sequencing (bulk RNA-seq) provide tissue-level insights but suffer from signal loss and distortion due to unaddressed cellular heterogeneity. Recently, single-cell RNA sequencing (scRNA-seq) has provided higher resolution enabling cell-type-specific eQTL (ct-eQTL) analyses. However, these studies are limited by their smaller sample sizes and technical constraints. In this paper, we present a novel statistical framework, IBSEP, which integrates bulk RNA-seq and scRNA-seq data for enhanced ct-eQTLs prioritization. Our method employs a Bayesian hierarchical model to combine summary statistics from both data types, overcoming the limitations while leveraging the advantages associated with each technique. Through extensive simulations and real-data analyses, including peripheral blood mononuclear cells and brain cortex datasets, IBSEP demonstrated superior performance in identifying ct-eQTLs compared to existing methods. Our approach unveils new transcriptional regulatory mechanisms specific to cell types, offering deeper insights into the genetic basis of complex diseases at a cellular resolution.
- AJHGBenchmarking Mendelian Randomization methods for causal inference using genome-wide association study summary statisticsHu, Xianghong, Cai, Mingxuan, Xiao, Jiashun, Wan, Xiaomeng, Wang, Zhiwei, Zhao, Hongyu, and Yang, CanThe American Journal of Human Genetics 2024
Mendelian Randomization (MR), which utilizes genetic variants as instrumental variables (IVs), has gained popularity as a method for causal inference between phenotypes using genetic data. While efforts have been made to relax IV assumptions and develop new methods for causal inference in the presence of invalid IVs due to confounding, the reliability of MR methods in real-world applications remains uncertain. To bridge this gap, we conducted a benchmark study evaluating 15 MR methods using real-world genetic datasets. Our study focused on three crucial aspects: type I error control in the presence of various confounding scenarios (e.g., population stratification, pleiotropy, and assortative mating), the accuracy of causal effect estimates, replicability and power. By comprehensively evaluating the performance of compared methods over one thousand pairs of exposure-outcome traits, our study not only provides valuable insights into the performance and limitations of the compared methods but also offers practical guidance for researchers to choose appropriate MR methods for causal inference.
- JCGSMFAI: A scalable Bayesian matrix factorization approach to leveraging auxiliary informationWang, Zhiwei, Zhang, Fa, Zheng, Cong, Hu, Xianghong, Cai, Mingxuan, and Yang, CanJournal of Computational and Graphical Statistics 2024
In various practical situations, matrix factorization methods suffer from poor data quality, such as high data sparsity and low signal-to-noise ratio (SNR). Here we consider a matrix factorization problem by utilizing auxiliary information, which is massively available in real applications, to overcome the challenges caused by poor data quality. Unlike existing methods that mainly rely on simple linear models to combine auxiliary information with the main data matrix, we propose to integrate gradient boosted trees in the probabilistic matrix factorization framework to effec- tively leverage auxiliary information (MFAI). Thus, MFAI naturally inherits several salient features of gradient boosted trees, such as the capability of flexibly modeling nonlinear relationships, and robustness to irrelevant features and missing values in auxiliary information. The parameters in MAFI can be automatically determined under the empirical Bayes framework, making it adaptive to the utilization of aux- iliary information and immune to overfitting. Moreover, MFAI is computationally efficient and scalable to large-scale datasets by exploiting variational inference. We demonstrate the advantages of MFAI through comprehensive numerical results from simulation studies and real data analysis. Our approach is implemented in the R package mfair available at https://github.com/YangLabHKUST/mfair.
2023
- Nat CommunXMAP: Cross-population fine-mapping by leveraging genetic diversity and accounting for confounding biasCai, Mingxuan, Wang, Zhiwei, Xiao, Jiashun, Hu, Xianghong, Chen, Gang, and Yang, CanNature Communications 2023
Fine-mapping prioritizes risk variants identified by genome-wide association studies (GWASs), serving as a critical step to uncover biological mechanisms underlying complex traits. However, several major challenges still remain for existing fine-mapping methods. First, the strong linkage disequilibrium among variants can limit the statistical power and resolution of fine-mapping. Second, it is computationally expensive to simultaneously search for multiple causal variants. Third, the confounding bias hidden in GWAS summary statistics can produce spurious signals. To address these challenges, we develop a statistical method for cross-population fine-mapping (XMAP) by leveraging genetic diversity and accounting for confounding bias. By using cross-population GWAS summary statistics from global biobanks and genomic consortia, we show that XMAP can achieve greater statistical power, better control of false positive rate, and substantially higher computational efficiency for identifying multiple causal signals, compared to existing methods. Importantly, we show that the output of XMAP can be integrated with single-cell datasets, which greatly improves the interpretation of putative causal variants in their cellular context at single-cell resolution.
- Nat CommunIntegrating spatial and single-cell transcriptomics data using deep generative models with SpatialScopeWan, Xiaomeng, Xiao, Jiashun, Tam, Sindy Sing-Ting, Cai, Mingxuan, Sugimura, Ryohichi, Wang, Yang, Wan, Xiang, Lin, Angela Ruohao, and Yang, CanNature Communications 2023
The rapid emergence of spatial transcriptomics (ST) technologies is revolutionizing our understanding of tissue spatial architecture and biology. Although current ST methods, whether based on next-generation sequencing (seq-based approaches) or fluorescence in situ hybridization (image-based approaches), offer valuable insights, they face limitations either in cellular resolution or transcriptome-wide profiling. To address these limitations, we present SpatialScope, a unified approach integrating scRNA-seq reference data and ST data using deep generative models. With innovation in model and algorithm designs, SpatialScope not only enhances seq-based ST data to achieve single-cell resolution, but also accurately infers transcriptome-wide expression levels for image-based ST data. We demonstrate SpatialScopeās utility through simulation studies and real data analysis from both seq-based and image-based ST approaches. SpatialScope provides spatial characterization of tissue structures at transcriptome-wide single-cell resolution, facilitating downstream analysis, including detecting cellular communication through ligand-receptor interactions, localizing cellular subtypes, and identifying spatially differentially expressed genes.
- BioinformaticsPALM: A Powerful and Adaptive Latent Model for Prioritizing Risk Variants with Functional AnnotationsYu, Xinyi, Xiao, Jiashun, Cai, Mingxuan, Jiao, Yuling, Wan, Xiang, Liu, Jin, and Yang, CanBioinformatics 2023
The findings from genome-wide association studies (GWASs) have greatly helped us to understand the genetic basis of human complex traits and diseases. Despite the tremendous progress, much effects are still needed to address several major challenges arising in GWAS. First, most GWAS hits are located in the non-coding region of human genome, and thus their biological functions largely remain unknown. Second, due to the polygenicity of human complex traits and diseases, many genetic risk variants with weak or moderate effects have not been identified yet. To address the above challenges, we propose a powerful and adaptive latent model (PALM) to integrate cell-type/tissue specific functional annotations with GWAS summary statistics. Unlike existing methods which are mainly based on linear models, PALM leverages a tree ensemble to adaptively characterize nonlinear relationship between functional annotations and the association status of genetic variants. To make PALM scalable to millions of variants and hundreds of functional annotations, we develop a functional gradient-based expectation-maximization (EM) algorithm, to fit the tree-based nonlinear model in a stable manner. Through comprehensive simulation studies, we show that PALM not only controls false discovery rate well, but also improves statistical power of identifying risk variants. We also apply PALM to integrate summary statistics of 30 GWASs with 127 cell type/tissue-specific functional annotations. The results indicate that PALM can identify more risk variants as well as rank the importance of functional annotations, yielding better interpretation of GWAS results.
2022
- AJHGLeveraging the local genetic structure for trans-ancestry association mappingXiao, Jiashun, Cai, Mingxuan, Yu, Xinyi, Hu, Xianghong, Wan, Xiang, Chen, Gang, and Yang, CanThe American Journal of Human Genetics 2022
Over the past two decades, genome-wide association studies (GWASs) have successfully advanced our understanding of genetic basis of complex traits. Despite the fruitful discovery of GWASs, most GWAS samples are collected from European populations, and these GWASs are often criticized for their lack of ancestry diversity. Trans-ancestry association mapping (TRAM) offers an exciting opportunity to fill the gap of disparities in genetic studies between non-Europeans and Europeans. Here we propose a statistical method, LOG-TRAM, to leverage the local genetic architecture for TRAM. By using biobank-scale datasets, we showed that LOG-TRAM can greatly improve the statistical power of identifying risk variants in under-represented populations while producing well-calibrated p-values. We applied LOG-TRAM to the GWAS summary statistics of 29 complex traits/diseases from Biobank Japan (BBJ) and UK Biobank (UKBB), and achieved substantial gains in power (the effective sample sizes increased by 49% in average compared to the BBJ GWASs) and effective correction of confounding biases compared to existing methods. Finally, we demonstrated that LOG-TRAM can be successfully applied to identify ancestry-specific loci and the LOG-TRAM output can be further used for construction of more accurate polygenic risk scores (PRSs) in under-represented populations.Competing Interest StatementThe authors have declared no competing interest.
- BioinformaticsXPXP: Improving polygenic prediction by cross-population and cross-phenotype analysisXiao, Jiashun, Cai, Mingxuan, Hu, Xianghong, Wan, Xiang, Chen, Gang, and Yang, CanBioinformatics 2022
As increasing sample sizes from genome-wide association studies (GWASs), polygenic risk scores (PRSs) have shown great potential in personalized medicine with disease risk prediction, prevention and treatment. However, the PRS constructed using European samples becomes less accurate when it is applied to individuals from non-European populations. It is an urgent task to improve the accuracy of PRSs in under-represented populations, such as African populations and East Asian populations.In this paper, we propose a cross-population and cross-phenotype (XPXP) method for construction of PRSs in under-represented populations. XPXP can construct accurate PRSs by leveraging biobank-scale datasets in European populations and multiple GWASs of genetically correlated phenotypes. XPXP also allows to incorporate population-specific and phenotype-specific effects, and thus further improves the accuracy of PRS. Through comprehensive simulation studies and real data analysis, we demonstrated that our XPXP outperformed existing PRS approaches. We showed that the height PRSs constructed by XPXP achieved 9\% and 18\% improvement over the runner-up method in terms of predicted R2 in East Asian and African populations, respectively. We also showed that XPXP substantially improved the stratification ability in identifying individuals at high genetic risk of Type 2 Diabetes.The XPXP software and all analysis code are available at github.com/YangLabHKUST/XPXPSupplementary data are available at Bioinformatics online.
2021
- AJHGA unified framework for cross-population trait prediction by leveraging the genetic correlation of polygenic traitsCai, Mingxuan, Xiao, Jiashun, Zhang, Shunkang, Wan, Xiang, Zhao, Hongyu, Chen, Gang, and Yang, CanThe American Journal of Human Genetics 2021
We present a unified statistical framework (XPA) to improve the prediction accuracy of human traits using multi-ancestry genetic data. Paired with innovations in data structure and algorithm design, our framework is highly scalable, with both computational cost and memory storage linear to the sample size and number of predictors. In practice, XPA can analyze 3 million variants from 430K samples with only 385 Gb memory usage in 54.5 hours. In a Chinese cohort, our method achieves 7.3%-198.0% accuracy gain for height prediction in terms of R2 compared to existing methods.
2020
- NARGABIGREX for quantifying the impact of genetically regulated expression on phenotypesCai, Mingxuan, Chen, Lin S, Liu, Jin, and Yang, CanNAR genomics and bioinformatics 2020
Many genetic variants affect phenotypes by regulating the gene expression level. We develop a statistical model, IGREX, to quantify the impact of genetically regulated expression on various human traits and inform trait-relevant tissue types. Efficient parameter expanded EM (PX-EM) algorithm and Method of Moments are adopted to optimize computational efficiency.
- JCGSBIVAS: a scalable Bayesian method for bi-level variable selection with applicationsCai, Mingxuan, Dai, Mingwei, Ming, Jingsi, Peng, Heng, Liu, Jin, and Yang, CanJournal of Computational and Graphical Statistics 2020
In this article, we consider a Bayesian bi-level variable selection problem in high-dimensional regressions. In many practical situations, it is natural to assign group membership to each predictor. Examples include that genetic variants can be grouped at the gene level and a covariate from different tasks naturally forms a group. Thus, it is of interest to select important groups as well as important members from those groups. The existing Markov chain Monte Carlo methods are often computationally intensive and not scalable to large datasets. To address this problem, we consider variational inference for bi-level variable selection. In contrast to the commonly used mean-field approximation, we propose a hierarchical factorization to approximate the posterior distribution, by using the structure of bi-level variable selection. Moreover, we develop a computationally efficient and fully parallelizable algorithm based on this variational approximation. We further extend the developed method to model datasets from multitask learning. The comprehensive numerical results from both simulation studies and real data analysis demonstrate the advantages of BIVAS for variable selection, parameter estimation, and computational efficiency over existing methods. The method is implemented in R package ābivasā available at https:// github.com/ mxcai/ bivas. Supplementary materials for this article are available online.