No large-effect low-frequency coding variation found for myocardial infarctionOddgeir L. Holmen, He Zhang, Wei Zhou et al.|Human Molecular Genetics|2014 Genome-wide association studies have identified variants, primarily common, that are associated with coronary artery disease or myocardial infarction (MI), but have not tested the majority of the low frequency and rare variation in the genome. We explored the hypothesis that previously untested low frequency (1-5% minor allele frequency) and rare (<1% minor allele frequency) coding variants are associated with MI. We genotyped 2906 MI cases and 6738 non-MI controls from Norway using the Illumina HumanExome Beadchip, allowing for direct genotyping of 85 972 polymorphic coding variants as well as 48 known GWAS SNPs. We followed-up 34 coding variants in an additional 2350 MI cases and 2318 controls from Norway. We evaluated exome array coverage in a subset of these samples using whole exome sequencing (N = 151). The exome array provided successful genotyping for an estimated 72.5% of Norwegian loss-of-function or missense variants with frequency >1% and 66.2% of variants <1% frequency observed more than once. Despite 80% power in the two-stage study (N = 14 312) to detect association with low-frequency variants with high effect sizes [odds ratio (OR) >1.86 and >1.36 for 1 and 5% frequency, respectively], we did not identify any novel genes or single variants that reached significance. This suggests that low-frequency coding variants with large effect sizes (OR >2) may not exist for MI. Larger sample sizes may identify coding variants with more moderate effects.
Comprehensive Association Study of Type 2 Diabetes and Related Quantitative Traits With 222 Candidate GenesOBJECTIVE—Type 2 diabetes is a common complex disorder with environmental and genetic components. We used a candidate gene–based approach to identify single nucleotide polymorphism (SNP) variants in 222 candidate genes that influence susceptibility to type 2 diabetes.RESEARCH DESIGN AND METHODS—In a case-control study of 1,161 type 2 diabetic subjects and 1,174 control Finns who are normal glucose tolerant, we genotyped 3,531 tagSNPs and annotation-based SNPs and imputed an additional 7,498 SNPs, providing 99.9% coverage of common HapMap variants in the 222 candidate genes. Selected SNPs were genotyped in an additional 1,211 type 2 diabetic case subjects and 1,259 control subjects who are normal glucose tolerant, also from Finland.RESULTS—Using SNP- and gene-based analysis methods, we replicated previously reported SNP-type 2 diabetes associations in PPARG, KCNJ11, and SLC2A2; identified significant SNPs in genes with previously reported associations (ENPP1 [rs2021966, P = 0.00026] and NRF1 [rs1882095, P = 0.00096]); and implicated novel genes, including RAPGEF1 (rs4740283, P = 0.00013) and TP53 (rs1042522, Arg72Pro, P = 0.00086), in type 2 diabetes susceptibility.CONCLUSIONS—Our study provides an effective gene-based approach to association study design and analysis. One or more of the newly implicated genes may contribute to type 2 diabetes pathogenesis. Analysis of additional samples will be necessary to determine their effect on susceptibility.
Integrating large scale genetic and clinical information to predict cases of heart failureHeart failure (HF) is a major global cause of death. Early risk prediction and intervention could mitigate disease progression. We aimed to improve HF prediction by integrating genome-wide association studies (GWAS)- and electronic health records (EHR)-derived risk scores. We previously performed a large HF GWAS within the Global Biobank Meta-analysis Initiative to create a polygenic risk score (PRS). Three Michigan Medicine (MM) cohorts were used to develop the clinical risk score (ClinRS): 1) Primary Care Provider cohort (MM-PCP; N = 61,849), 2) Heart Failure cohort (MM-HF; N = 53,272), and 3) Michigan Genomics Initiative cohort (MM-MGI; N = 60,215). To extract information from high-dimensional EHR data, we leveraged natural language processing to generate 350 latent phenotypes representing EHR codes and used coefficients from LASSO regression on these phenotypes in a training set as weights to calculate ClinRS in a validation set. Using logistic regression, model performances were compared between baseline model and models with risk scores added: 1) PRS, 2) ClinRS, and 3) ClinRS+PRS. We further compared the proposed models with Atherosclerosis Risk in Communities (ARIC) HF risk score. PRS and ClinRS each predict HF outcomes significantly better than the baseline model, up to eight years prior to HF diagnosis. Including both PRS and ClinRS further improves prediction performance up to ten years prior to diagnosis, two years earlier than either score alone. Additionally, ClinRS significantly outperforms the ARIC model one year prior. We demonstrate the additive power of integrating GWAS- and EHR-derived risk scores to predict HF cases prior to diagnosis. This standardizable and scalable risk predictor may enable physicians to provide earlier interventions to improve patient outcomes. Heart failure (HF) is a leading cause of death worldwide. Early identification of individuals at high risk could facilitate interventions to slow disease progression. In this study, we develop an approach to improve HF risk prediction by combining patient genetic information and clinical information from electronic health records (EHR). We create two risk scores: a polygenic risk score (PRS) based on genetic information, and a clinical risk score (ClinRS) based on patient EHR. We test how well these scores predict HF before diagnosis. Both PRS and ClinRS improve predictions individually and identify high-risk individuals up to eight years in advance. When used together, they provide greater accuracy, predicting HF up to ten years before diagnosis. We suggest that combining genetic and clinical information could help doctors detect HF earlier for better treatment and prevention strategies in the future. Kuan-Han et al. examine if integrating patient genetic and clinical information from electronic health records can better predict heart failure in patients. Their findings show improvement in heart failure prediction up to ten years prior to diagnosis, which is two years earlier than using a single risk score alone.
Powerful, scalable and resource-efficient meta-analysis of rare variant associations in large whole genome sequencing studiesVariant Classification Using Proteomics-Informed Large Language Models Increases Power of Rare Variant Association Studies and Enhances Target DiscoveryChristopher E. Gillies, Joelle Mbatchou, Lukas Habegger et al.|bioRxiv (Cold Spring Harbor Laboratory)|2025 Rare variant association analysis, which assesses the aggregate effect of rare damaging variants within a gene, is a powerful strategy for advancing knowledge of human biology. Numerous models have been proposed to identify damaging coding variants, with the most recent ones employing deep learning and large language models (LLM) to predict the impact of changes in coding sequences. Here, we use newly available proteomics data on 2,898 proteins across 46, 665 individuals to evaluate and refine LLM predictors of damaging variants. Using one of these refined models, we evaluate association between rare damaging variants and human phenotypes at 241 positive control gene-trait pairs. Among these gene-trait pairs, our proteomics-guided model outperforms an ensemble of conventional approaches including PolyPhen2, Mutation Taster, SIFT, and LRT, as well as newer machine learning approaches for identifying damaging missense variants, such as CADD, ESM-1v, ESM-1b and AlphaMissense. When attempting to recover known associations by correctly separating damaging singleton missense variants from other singleton variants, our approach recapitulates 36.5% of gene-trait pairs with known associations, exceeding all the alternatives we considered. Furthermore, when we apply our model to 10 exemplary traits from the UK Biobank, we identify 177 gene-trait associations – again exceeding all other approaches. Our results demonstrate that summary statistics from large-scale human proteomics data enable evaluation and refinement of coding variant classification LLMs, improving discovery potential in human genetic studies.