本研究从Pun等人
[16]
的工作中获取了数据集中所有RNA的PDB号(Protein Data Bank ID),数据集中结构的筛选过程如下:首先从PDB数据库下载RNA序列长度大于32个碱基并且结构分辨率优于3Å的具有B因子的X-ray晶体结构;然后使用CD-HIT软件
[27]
进行去冗余,去除了相似性大于80%的RNA序列后得到142条RNA序列;最后将其随机拆分,其中75%作为训练集,25%作为测试集,使得训练集中包含108条RNA链,测试集中包含34条RNA链。本文中,我们用核苷酸残基中的C1原子的B因子表示核苷酸的柔性。
Table 1. Best word vector parameters corresponding to different modelsTable 1. Best word vector parameters corresponding to different models 表1. 不同模型所对应的最佳词向量参数
Table 2. Comparison of results from different machine learning algorithms on the training and test setsTable 2. Comparison of results from different machine learning algorithms on the training and test sets 表2. 不同机器学习算法在训练集和测试集上的结果比较
Table 3. Comparison between RNAfwe method and the other sequence-based methods on training and test setsTable 3. Comparison between RNAfwe method and the other sequence-based methods on training and test sets 表3. RNAfwe与其它基于序列信息的方法在训练和测试集上的比较
References
Carugo, O. and Argos, P. (1998) Accessibility to Internal Cavities and Ligand Binding Sites Monitored by Protein Crystallographic Thermal Factors. Proteins, Structure, Function, and Bioinformatics, 31, 201-213. >https://doi.org/10.1002/(SICI)1097-0134(19980501)31:2<201::AID-PROT9>3.0.CO;2-O
Schneider, B., Gelly, J., de Brevern, A.G., et al. (2014) Local Dynamics of Proteins and DNA Evaluated from Crystallographic B Factors. ActaCrystallographica Section D Biological Crystallography, 70, 2413-2419. >https://doi.org/10.1107/S1399004714014631
Liu, Q., Kwoh, C.K. and Li, J. (2013) Binding Affinity Prediction for Protein-Ligand Complexes Based on β Contacts and B Factor. Journal of Chemical Information and Modeling, 53, 3076-3085. >https://doi.org/10.1021/ci400450h
Li, C., Lv, D., Zhang, L., et al. (2016) Approach to the Unfolding and Folding Dynamics of Add A-Riboswitch upon Adenine Dissociation Using a Coarse-Grained Elastic Network Model. The Journal of Chemical Physics, 145, Article ID: 014104. >https://doi.org/10.1063/1.4954992
Hu, Y., Cheng, K., He, L., et al. (2021) NMR-Based Methods for Protein Analysis. Analytical Chemistry, 93, 1866-1879. >https://doi.org/10.1021/acs.analchem.0c03830
Ishima, R. and Torchia, D. (2000) Protein Dynamics from NMR. Nature Structural Biology, 7, 740-743. >https://doi.org/10.1038/78963
Sasmal, D.K., Pulido, L.E., Kasal, S., et al. (2016) Single-Molecule Fluorescence Resonance Energy Transfer in Molecular Biology. Nanoscale, 8, 19928-19944. >https://doi.org/10.1039/C6NR06794H
Hoshino, M., Adachi, S. and Koshihara, S. (2015) Crystal Structure Analysis of Molecular Dynamics Using Synchrotron X-Rays. CrystEngComm, 17, 8786-8795. >https://doi.org/10.1039/C5CE01128K
Christoforides, E., Fourtaka, K., Andreou, A., et al. (2020) X-Ray Crystallography and Molecular Dynamics Studies of the Inclusion Complexes of Geraniol in β-Cyclodextrin, Heptakis (2, 6-di-O-methyl)-β-Cyclodextrin and Heptakis (2, 3, 6-tri-O-methyl)-β-Cyclodextrin. Journal of Molecular Structure, 1202, Article ID: 127350. >https://doi.org/10.1016/j.molstruc.2019.127350
Scott, A.H. and Ron, O.D. (2018) Molecular Dynamics Simulation for All. Neuron, 99, 1129-1143. >https://doi.org/10.1016/j.neuron.2018.08.011
Mccammon, J.A., Gelin, B.R. and Karplus, M. (1977) Dynamics of Folded Proteins. Nature, 267, 585-590. >https://doi.org/10.1038/267585a0
Bahar, I., Atilgan, A.R. and Erman, B. (1997) Direct Evaluation of Thermal Fluctuations in Proteins Using a Single-Parameter Harmonic Potential. Folding and Design, 2, 173-181. >https://doi.org/10.1016/S1359-0278(97)00024-2
Tian, F., Zhang, C., Fan, X., et al. (2010) Predicting the Flexibility Profile of Ribosomal RNAs. Molecular Informatics, 29, 707-715. >https://doi.org/10.1002/minf.201000092
Guruge, I., Taherzadeh, G., Zhan, J., et al. (2018) B-Factor Profile Prediction for RNA Flexibility Using Support Vector Machines. Journal of Computational Chemistry, 39, 407-411. >https://doi.org/10.1002/jcc.25124
Wei, H., Wang, B., Yang, J., et al. (2021) RNA Flexibility Prediction with Sequence Profile and Predicted Solvent Accessibility. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 18, 2017-2022. >https://doi.org/10.1109/TCBB.2019.2956496
Pun, C.S., Yong, B.Y.S. and Xia, K. (2020) Weighted-Persistent-Homology-Based Machine Learning for RNA Flexibility Analysis. PLOS ONE, 15, e237747. >https://doi.org/10.1371/journal.pone.0237747
Nguyen, T., Le, N., Ho, Q., et al. (2019) Using Word Embedding Technique to Efficiently Represent Protein Sequences for Identifying Substrate Specificities of Transporters. Analytical Biochemistry, 577, 73-81. >https://doi.org/10.1016/j.ab.2019.04.011
Goth, G. (2016) Deep or Shallow, NLP Is Breaking Out. Communications of the ACM, 59, 13-16.
Solan, Z., Horn, D., Ruppin, E., et al. (2005) Unsupervised Learning of Natural Languages. Proceedings of the National Academy of Sciences of the United States of America, 102, 11629-11634. >https://doi.org/10.1073/pnas.0409746102
Strait, B.J. and Dewey, T.G. (1996) The Shannon Information Entropy of Protein Sequences. Biophysical Journal, 71, 148-155. >https://doi.org/10.1016/S0006-3495(96)79210-X
Yu, L., Tanwar, D.K., Penha, E.D.S., et al. (2019) Grammar of Protein Domain Architectures. Proceedings of the National Academy of Sciences, 116, 3636-3645. >https://doi.org/10.1073/pnas.1814684116
Ptitsyn, O.B. (1991) How Does Protein Synthesis Give Rise to the 3D-Structure? FEBS Letters, 285, 176-181. >https://doi.org/10.1016/0014-5793(91)80799-9
Qiu, W., Lv, Z., Xiao, X., et al. (2021) EMCBOW-GPCR: A Method for Identifying G-Protein Coupled Receptors Based on Word Embedding and Wordbooks. Computational and Structural Biotechnology Journal, 19, 4961-4969. >https://doi.org/10.1016/j.csbj.2021.08.044
Hamid, M. and Friedberg, I. (2019) Identifying Antimicrobial Peptides Using Word Embedding with Deep Recurrent Neural Networks. Bioinformatics, 35, 2009-2016. >https://doi.org/10.1093/bioinformatics/bty937
Nguyen, T., Le, N., Ho, Q., et al. (2020) TNFPred: Identifying Tumor Necrosis Factors Using Hybrid Features Based on Word Embeddings. BMC Medical Genomics, 13, Article No. 155. >https://doi.org/10.1186/s12920-020-00779-w
Tomas, M., Kai, C., Greg, C., et al. (2013) Efficient Estimation of Word Representations in Vector Space. CoRR. arXiv preprint arXiv:1301.3781
Li, W. and Godzik, A. (2006) Cd-Hit: A Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences. Bioinformatics, 22, 1658-1659. >https://doi.org/10.1093/bioinformatics/btl158