zl程序教程

您现在的位置是:首页 >  APP

当前栏目

监督式机器学习应用中的特征选择过滤方法--回顾与基准

2023-03-20 14:50:39 时间

机器学习(ML)应用的数据量在不断增长。不仅是观察值的数量,特别是测量变量(特征)的数量随着不断的数字化而增加。为预测性建模选择最合适的特征是商业和研究中ML应用成功的一个重要杠杆。独立于某种ML算法的特征选择方法(FSM)--所谓的过滤方法--已被大量提出,但对于研究人员和定量建模人员来说,为典型的ML问题选择合适的方法,几乎没有指导。这篇评论综合了大量关于特征选择基准的文献,并在广泛使用的R环境中评估了58种方法的性能。为了提供具体的指导,我们考虑了四种对ML模型具有挑战性的典型数据集场景(噪声、冗余、不平衡数据以及特征多于观测值的情况)。借鉴早期基准的经验,我们根据四个标准(预测性能、选择的相关特征的数量、特征集的稳定性和运行时间)比较了方法的性能,这些基准考虑的FSM要少得多。我们发现依靠随机森林方法、双输入对称相关性过滤器(DISR)和联合杂质过滤器(JIM)的方法在给定的数据集场景下是表现良好的候选方法。

原文题目:Filter Methods for Feature Selection in Supervised Machine Learning Applications -- Review and Benchmark

原文:The amount of data for machine learning (ML) applications is constantly growing. Not only the number of observations, especially the number of measured variables (features) increases with ongoing digitization. Selecting the most appropriate features for predictive modeling is an important lever for the success of ML applications in business and research. Feature selection methods (FSM) that are independent of a certain ML algorithm - so-called filter methods - have been numerously suggested, but little guidance for researchers and quantitative modelers exists to choose appropriate approaches for typical ML problems. This review synthesizes the substantial literature on feature selection benchmarking and evaluates the performance of 58 methods in the widely used R environment. For concrete guidance, we consider four typical dataset scenarios that are challenging for ML models (noisy, redundant, imbalanced data and cases with more features than observations). Drawing on the experience of earlier benchmarks, which have considered much fewer FSMs, we compare the performance of the methods according to four criteria (predictive performance, number of relevant features selected, stability of the feature sets and runtime). We found methods relying on the random forest approach, the double input symmetrical relevance filter (DISR) and the joint impurity filter (JIM) were well-performing candidate methods for the given dataset scenarios.