Advances in RNA sequencing have identified numerous bacterial small RNAs (sRNAs) that play essential roles in gene regulation by base pairing with target messenger RNAs (mRNAs). These interactions influence a range of biological processes, yet the factors determining their regulatory outcomes remain poorly understood. To address this, our multidisciplinary team has generated transcriptome-wide RNA-RNA interaction datasets using CLASH (Cross-linking, Ligation, and Sequencing of Hybrids) and CRAC (UV Crosslinking and Analysis of cDNA), enabling systematic analysis of sRNA-mRNA interactions in bacterial systems.
A machine learning framework was developed to predict regulatory outcomes by extracting over 800 biologically informed features from individual RNAs and their predicted duplexes. These features included k-mer frequencies, GC content in paired regions, pairing ratios, and minimum free energy, derived from dot-bracket structures and base-pairing matrices. To ensure model robustness and generalizability in this high-dimensional, imbalanced dataset, bootstrapped leave-one-out cross-validation (LOOCV) was applied for feature selection using Logistic Regression and SVM, retaining features consistently selected in >=60% of runs. Final models were evaluated using stratified k-fold cross-validation with SMOTE oversampling.
Among the classifiers tested, Logistic Regression demonstrated the highest performance, achieving an accuracy of 0.74, precision of 0.56, recall of 0.69, specificity of 0.76, AUC of 0.79, and F1 score of 0.61.
This framework provides a scalable and interpretable approach for identifying functional sRNA-mRNA interactions. It establishes a foundation for improving the annotation of regulatory RNA networks in bacteria and supports future efforts in systems biology, synthetic RNA circuit design, and antimicrobial target discovery.