监督式机器学习应用中的特征选择过滤方法--回顾与基准
机器学习(ML)应用的数据量在不断增长。不仅是观察值的数量,特别是测量变量(特征)的数量随着不断的数字化而增加。为预测性建模选择最合适的特征是商业和研究中ML应用成功的一个重要杠杆。独立于某种ML算法的特征选择方法(FSM)--所谓的过滤方法--已被大量提出,但对于研究人员和定量建模人员来说,为典型的ML问题选择合适的方法,几乎没有指导。这篇评论综合了大量关于特征选择基准的文献,并在广泛使用的R环境中评估了58种方法的性能。为了提供具体的指导,我们考虑了四种对ML模型具有挑战性的典型数据集场景(噪声、冗余、不平衡数据以及特征多于观测值的情况)。借鉴早期基准的经验,我们根据四个标准(预测性能、选择的相关特征的数量、特征集的稳定性和运行时间)比较了方法的性能,这些基准考虑的FSM要少得多。我们发现依靠随机森林方法、双输入对称相关性过滤器(DISR)和联合杂质过滤器(JIM)的方法在给定的数据集场景下是表现良好的候选方法。
原文题目:Filter Methods for Feature Selection in Supervised Machine Learning Applications -- Review and Benchmark
原文:The amount of data for machine learning (ML) applications is constantly growing. Not only the number of observations, especially the number of measured variables (features) increases with ongoing digitization. Selecting the most appropriate features for predictive modeling is an important lever for the success of ML applications in business and research. Feature selection methods (FSM) that are independent of a certain ML algorithm - so-called filter methods - have been numerously suggested, but little guidance for researchers and quantitative modelers exists to choose appropriate approaches for typical ML problems. This review synthesizes the substantial literature on feature selection benchmarking and evaluates the performance of 58 methods in the widely used R environment. For concrete guidance, we consider four typical dataset scenarios that are challenging for ML models (noisy, redundant, imbalanced data and cases with more features than observations). Drawing on the experience of earlier benchmarks, which have considered much fewer FSMs, we compare the performance of the methods according to four criteria (predictive performance, number of relevant features selected, stability of the feature sets and runtime). We found methods relying on the random forest approach, the double input symmetrical relevance filter (DISR) and the joint impurity filter (JIM) were well-performing candidate methods for the given dataset scenarios.
相关文章
- 构建版图6年后 丈量腾讯区块链护城河
- MIUI且安卓7以上版本安装面具、抓包软件并信任证书(无需root)
- shell脚本实现微信告警——WGCLOUD
- 实战 | 粘连物体分割与计数应用(二)--基于距离变换+分水岭算法 Halcon/OpenCV实现比较
- 实战 | 粘连物体分割与计数应用(三)--密集粘连药片分割+计数案例
- Android:玩转垃圾回收机制与分代回收策略
- 一个正经开发人员的安全意识
- OAuth2 vs JWT,到底怎么选?
- 微信小程序开发
- Redis6----应用问题解决和新功能预览
- 从0到1 手把手搭建spring cloud alibaba 微服务大型应用框架(十一)spring-boot-admin 监控篇(1) 原理与介绍
- 从0到1 手把手搭建spring cloud alibaba 微服务大型应用框架(十一)spring-boot-admin 监控篇(2)springcloud 集成spring boot admin
- 【浅入浅出】现代前端框架单页面
- 开发你的第一个SpringBoot应用
- 从0到1 手把手搭建spring cloud alibaba 微服务大型应用框架(五) SEATA分布式事务篇(补充) seata与应用不在同一台服务器下报连接不上 127.0.0.1 8091 问题
- Android程序设计 大作业:基于安卓的校园生活服务系统的设计与实现
- Mediapipe框架在Android上的使用
- 在Android实现双目测距
- BAT大厂Android工程师带你学习Framework内核解析
- 防微杜渐,未雨绸缪,百度网盘(百度云盘)接口API自动化备份上传以及开源发布,基于Golang1.18