Cyber Threat and Vulnerability Classification Using NLP and Machine Learning Techniques on Text-Based Security Data
Keywords:
Cyber Security , Machine Learning, NLP SVM Random ForestAbstract
The rapidly developing cybersecurity sector faces the essential problem of detecting and classifying cyber threats with precision. The rise of complicated data and its growing volume requires machine learning (ML) techniques to successfully automate threat detection operations through modern methods. The research evaluates six different ML algorithms for cybersecurity threat classification through Logistic Regression, SVM, Random Forest, Naive Bayes, LSTM, and BERT performance analysis. The systematic evaluation methodology analyzes these models by measuring their accuracy, together with precision and recall metrics, along with F1-score and execution time efficiency. Our examination starts with tokenization, then carries out stop-word elimination before performing TF-IDF vectorization for model enhancement purposes through various feature encoding approaches. The study examines the effects that employing both categorical and continuous feature encoding methods has on the outcomes. The research makes its original contribution through analyzing performance-speed tradeoffs between deep learning models and standard models applied to cybersecurity contexts. BERT proves to be the superior model since it delivers 93.8% accuracy and 96.2% ROC-AUC score at the cost of increased computational requirements. Random Forest and SVM exhibited comparable results, but Naive Bayes demonstrated the least effective performance with accuracy and recall statistics. BERT outperforms other models in cybersecurity, but its high computing requirements prevent it from real-time implementation.