Automated Extraction of Software Names from Vulnerability Reports using LSTM and Expert System

中文译名：利用 LSTM 和专家系统从漏洞报告中自动提取软件名称作者：Igor Khokhlov 单位：圣心大学国家： #意大利年份： #2022年来源： #IEEE_STC 关键字： #提取信息代码地址：笔记建立时间： 2023-05-15 10:20

Abstract

proposes a machine learning method to extract software product names and version from unstructured CVE descriptions automatically
create context-aware features by using Word2Vec and Char2Vec
use this features to train a NER model using LSTM
based on the previously published CVE descriptions, author create a set of Expert System (ES) rules to refine the predictions of the NER model and improve the preformance of the developed method.

use two major models: NER model and ES model

NER model is responsible for classify each word within the description as software name (SN), software version (SV), other (O)
ES model is responsible for verify the result of NER model

author examined the dataset and found that
- 61.5% of all SN in the dataset occurs within the first ten words in the sentence and almost 91% of all SN lies within the first 30 words of the sentence.
- almost 73% of all SV lies no further than five words from the related SN, and almost 90% within ten words proximity.
Software Name Extraction Rules
- The word is classified as NNP and within 40 words range from the sentence beginning and does not belong to an article and is in the CPE dictionary. This rule is based on the training dataset analysis (see Table I). 2) The word is between two SNs.
- The word is between two SNs.
Software Version Extraction Rules
- It contains digits and is not further than 30 words from the last SN. This rule is based on the training dataset analysis
- It is in the list of trigger words and is not further than 30 words from the last SN
- It contains digits, and the previous word is classified as an SV
- This word is “and” or “or” and is between two SVs.

目的：方法：意义：效果：