数据工程项目往往严格遵循着riro (rubbish in, rubbish out) 的原则,所以我们经常说数据预处理是数据工程师或者数据科学家80%的工作,它保证了数据原材料的质量。而特征工程又至少占据了数据预处理的半壁江山,在实际的数据工程工作中,无论是出于解释数据或是防止过拟合的目的,特征选择都是很常见的工作。如何从成百上千个特征中发现其中哪些对结果最具影响,进而利用它们构建可靠的机器学习算法是特征选择工作的中心内容。在多次反复的工作后,结合书本,kaggle等线上资源以及与其他数据工程师的讨论,我决定写一篇简明的总结梳理特征选择工作的常见方法以及python实现。
特征过滤(Filter methods): 不需要结合特定的算法,简单快速,常用于预处理
包装筛选(Wrapper methods): 将特征选择包装在某个算法内,常用于学习阶段
在scikit-learn环境中,特征选择拥有独立的包sklearn.feature_selection, 包含了在预处理和学习阶段不同层级的特征选择算法。
A. 特征过滤(Filter methods)
(1) 方差阈(Variance Treshhold)
from sklearn.feature_selection import VarianceThreshold
VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.
(2) 单变量特征选择 (Univariate feature selection)
Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator.
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
B. 包装筛选(Wrapper methods)
包装筛选往往利用一些在训练过程中可以计算各个特征对应权重的算法来达到选择特征的目的。在sklearn中有一个专门的模块 SelectFromModel 来帮助我们实现这个过程。
SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or featureimportances attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or featureimportances values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are build-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.


最小,这就是经典的 Ordinary Linear Square (OLS) 问题。
为了矫正过拟合,我们常使用带有正则项的cost function,其中使用L1正则的表达式则为Lasso方法:

在实际的工作中,Lasso的参数lambda越大,参数的解越稀疏,选出的特征越少。那么如何确定使用多大的lambda?一个比较稳妥地方案是对于一系列lambda,用交叉验证计算模型的rmse,然后选择rmse的极小值点 (Kaggle上有一个很好的例子)。
Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero. When the goal is to reduce the dimensionality of the data to use with another classifier, they can be used along with feature_selection.SelectFromModel to select the non-zero coefficients. With Lasso, the higher the alpha parameter, the fewer features selected.
Tree-based estimators (see the sklearn.tree module and forest of trees in the sklearn.ensemble module) can be used to compute feature importances, which in turn can be used to discard irrelevant features (when coupled with the sklearn.feature_selection.SelectFromModel meta-transformer).
这篇短文简明的介绍了部分常用的特征处理方法,应该提出的是,除了feature selection,feature transformation,包括PCA等降维方法也可以达到减少特征数量,抑制过拟合的目的。