Maarja Jõeloo is an MSCA Postdoctoral Fellow in the Computational and Statistical Genomics group at the Institute for Molecular Medicine Finland (FIMM), University of Helsinki. She obtained her PhD in 2024 from the University of Tartu, where she developed methods for genotyping microarray-based copy number variation (CNV) data and investigated the effects of rare CNVs on complex phenotypes in the Estonian Biobank.
Her current research focuses on the discovery and characterisation of structural variation and tandem repeats using both short- and long-read sequencing technologies. She is working on approaches to integrate complex and repetitive variants into biobank-scale cohorts to enable large-scale analyses of their effects on human traits and disease.
Integrating structural variants and tandem repeats into biobank-scale association studies
Genome-wide association studies (GWAS) have shaped our understanding of the genetic architecture of human disease. Yet, the story they tell remain incomplete due to the vast majority of GWAS findings being confined to SNVs and short indels, while larger and more complex forms of genetic variation remain largely invisible. Structural variants (SVs) and tandem repeats (TRs) are notoriously difficult to capture with standard genotyping and imputation approaches, and yet they likely underlie a meaningful share of unexplained heritability. Accounting for them promises more precise fine-mapping and a better biological understanding of well-known associations.
To address this, the Estonian Biobank (EstBB) is generating Europe’s largest high-coverage (20x) long-read sequencing (LRS) resource, targeting 10,000 participants with PacBio HiFi technology. Long-read data opens up parts of the genome that short reads simply cannot resolve, including complex structural rearrangements, tandem repeat expansions, and regions so repetitive or structurally intricate that they have historically been ignored.
In this talk, I will share what we have learned from this dataset so far. First, I will highlight the ability of LRS to fully resolve trait-associated complex SVs identified in large-scale GWAS meta-analysis of the EstBB and UK Biobank cohorts, exemplified by a rare high-impact complex multiplication of the LDLR gene associated with 44% lower lifetime non-HDL cholesterol, a finding that only becomes fully interpretable with long-read data. Furthermore, I will present a framework for the construction of haplotype-resolved imputation reference panel designed to project SVs and TRs into the full EstBB cohort of over 210,000 participants, enabling biobank-scale discovery of structural drivers for human health and disease.