Because both the transcript structure and the relative position on the transcript are found to be related to the occurrence and function of RNA sub-molecular events, transcript annotation could be used as an information source for predicting m6A modification
In order to construct robust and precise machine learning predictors for predicting RNA modification sites, multiple features were designed and extracted to encode RNA sequences.
Most m6A sites prediction methods and web servers extracted input features from the sequence-derived information and other genomic information and predicted m6A sites by various machine learning approaches. Finally, the performance of the measures is evaluated (Figure 3).
In the computational approaches published at present, features are mainly divided into six categories [Citation110], including RNA primary sequence-derived features, nucleotide physicochemical properties, predicted RNA structural features, position-weighted matrix, RNA sequence similarity feature and genomic-derived features.
Numerous tools have been developed for feature extraction and modelling of primary sequences, such as BioSeq-Analysis [Citation111,Citation112], PyFeat [Citation113] and BioSeq-BLM [Citation114].
Furthermore, the perturb method and the SFS are also used for feature selection.
Therefore, geographic encoding of transcripts might be used for deep learning models applied to RNA transcripts.
Compared to other deep learning models, the transcript region information incorporated into genomic features by WHISTLE greatly improves its performance.
Combined with one-hot encoding, more informative and interpretable sub-molecular geographic descriptors of transcripts are provided.
Natural language processing is used to feature extraction and classification of m6A methylation sites with consideration of context information
In addition to predicting m6A-containing sequences, the biological features surrounding m6A could be characterized to elucidate its regulatory code
Besides, conservation analysis of individual m6A sites is achieved by a novel scoring framework, ConsRM
However, information regarding the position relative to the boundaries of the long-range is neglected.
In addition, one-hot encoding is widely used to describe the transcript region [Citation129], but it may result in an incomplete landscape of the local transcript structure.
To fill the gap, three novel encoding methods, landmarkTX, gridTX, and chunkTX, were developed by Geo2vec
Furthermore, experimental results indicate that the base, upstream, and downstream information of m6A sites are all critical to detection.
Matthews Correlation Coefficient (MCC)
The higher the AUC and AUPRC value, the better the prediction performance.
K-fold cross-validation test
jackknife validation test