您现在的位置是：首页 > APP

当前栏目

监督式机器学习应用中的特征选择过滤方法--回顾与基准

应用性能学习数据

2023-03-20 14:50:39 时间

机器学习（ML）应用的数据量在不断增长。不仅是观察值的数量，特别是测量变量（特征）的数量随着不断的数字化而增加。为预测性建模选择最合适的特征是商业和研究中ML应用成功的一个重要杠杆。独立于某种ML算法的特征选择方法（FSM）--所谓的过滤方法--已被大量提出，但对于研究人员和定量建模人员来说，为典型的ML问题选择合适的方法，几乎没有指导。这篇评论综合了大量关于特征选择基准的文献，并在广泛使用的R环境中评估了58种方法的性能。为了提供具体的指导，我们考虑了四种对ML模型具有挑战性的典型数据集场景（噪声、冗余、不平衡数据以及特征多于观测值的情况）。借鉴早期基准的经验，我们根据四个标准（预测性能、选择的相关特征的数量、特征集的稳定性和运行时间）比较了方法的性能，这些基准考虑的FSM要少得多。我们发现依靠随机森林方法、双输入对称相关性过滤器（DISR）和联合杂质过滤器（JIM）的方法在给定的数据集场景下是表现良好的候选方法。

原文题目：Filter Methods for Feature Selection in Supervised Machine Learning Applications -- Review and Benchmark

原文：The amount of data for machine learning (ML) applications is constantly growing. Not only the number of observations, especially the number of measured variables (features) increases with ongoing digitization. Selecting the most appropriate features for predictive modeling is an important lever for the success of ML applications in business and research. Feature selection methods (FSM) that are independent of a certain ML algorithm - so-called filter methods - have been numerously suggested, but little guidance for researchers and quantitative modelers exists to choose appropriate approaches for typical ML problems. This review synthesizes the substantial literature on feature selection benchmarking and evaluates the performance of 58 methods in the widely used R environment. For concrete guidance, we consider four typical dataset scenarios that are challenging for ML models (noisy, redundant, imbalanced data and cases with more features than observations). Drawing on the experience of earlier benchmarks, which have considered much fewer FSMs, we compare the performance of the methods according to four criteria (predictive performance, number of relevant features selected, stability of the feature sets and runtime). We found methods relying on the random forest approach, the double input symmetrical relevance filter (DISR) and the joint impurity filter (JIM) were well-performing candidate methods for the given dataset scenarios.

监督式机器学习应用中的特征选择过滤方法--回顾与基准.pdf

猜你喜欢

Python中的函数与方法以及Bound Method和Unbound Method
从本体论开始说起——运营商关系图谱的构建及应用
一篇运维老司机的大数据平台监控宝典（2）-联通大数据集群平台监控体系详解
一篇运维老司机的大数据平台监控宝典（1）-联通大数据集群平台监控体系进程详解
Flask中的请求上下文和应用上下文
深入探讨Java中的异常与错误处理
研究学习Kotlin的一些方法
如何成为一名数据科学家？
金融服务领域的大数据：即时分析
影响大数据、机器学习和人工智能未来发展的8个因素
从未见过的堂兄杀了人，你的DNA是关键证据
一文贯通python文件读取
数据显示Java热度持续下落，日子屈指可数？
从0开始构建一个属于你自己的PHP框架
如何将Hadoop集成到工作流程中？这6个优秀实践必看
2017年5月编程语言排行榜：Java与C语言优势正开始缩小
SEO公司使用大数据优化其模型的5种方法
Java多线程之内置锁与显示锁
关于Web Workers你需要了解的七件事
20个安全可靠的免费数据源，各领域数据任你挑

zl程序教程

当前栏目

监督式机器学习应用中的特征选择过滤方法--回顾与基准

相关文章