Oral Presentation Australasian RNA Biology and Biotechnology Association 2025 Conference

Interpretable machine learning for the systematic annotation of functional domains in long non-coding RNAs  (130187)

Martin A Smith 1 2 3 4 5 6 , Vanda Gaonac'h-Lovejoy 1 2 7 , John Mattick 3 8 , Martin Sauvageau 7 9 10
  1. Université de Montréal, Montreal, QC, Canada
  2. CHU Sainte-Justine Research Centre, Montreal, QC, Canada
  3. UNSW RNA Institute, Sydney
  4. Ramaciotti Centre for Genomics, UNSW, Sydney
  5. UNSW AI Institute , Sydney
  6. Australian Centre for Nanomedicine, UNSW, Sydney
  7. Montreal Clinical Research Institute, Montreal
  8. UNSW Sydney, NSW, Australia
  9. Biochemistry and Molecular Medicine, University of Montreal, Montreal, Canada
  10. McGill University, Montreal

Long-read sequencing is rapidly expanding the catalog of long noncoding RNAs (lncRNAs), revealing thousands with roles in cancer and other diseases. Many disease-associated lncRNAs overlap GWAS loci and can reshape the epitranscriptome, influence chromatin organization, or modulate transcription. This striking diversity of molecular phenotypes underscores their importance but also poses a challenge: unlike protein-coding genes, lncRNAs lack sequence conservation or recognizable domains that would enable a systematic framework for functional classification.

What is missing is a unifying strategy to move beyond case-by-case characterization toward scalable annotation. RNA structure offers such a medium, as all  well-characterized non-coding RNAs function through higher-order structures. Conserved helices and motifs provide evolutionary signatures of function even when primary sequence diverges, offering a principled way to prioritize lncRNAs for mechanistic study.

I will present ECSFinder, a new computational framework for identifying evolutionarily conserved RNA structures (ECSs). By integrating thermodynamic folding, background modeling, and covariation analysis within a supervised learning classifier, ECSFinder outperforms existing methods across benchmarks. Applied genome-wide to mammalian alignments, it identifies more than 700,000 ECSs enriched in promoters, untranslated regions, enhancers, and lncRNA exons. These structures overlap disease-associated variants at significantly elevated rates and highlight candidate lncRNAs with therapeutic potential.

Together, these findings establish RNA structure as a unifying axis for lncRNA functional annotation, linking disease association, evolutionary conservation, and epitranscriptomic regulation.