Scikit Pipelines for ML
What is a Pipeline
Transformers are combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline. Pipeline can be used to chain multiple estimators into one. This is useful as there is a fixed sequence of steps in processing the data, for example feature selection, normalization and classification.
Pipeline offers these benefits:
- Convenience and encapsulation - You have to call fit and predict once on your data to fit a whole sequence of estimators
- Joint parameter selection - You can grid search over parameters of all estimators in the pipeline at once
- Safety - Pipelines help avoid leaking stats between test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors
All estimators in a pipeline, except the last one, must be transformers (i.e must have a transform method). The last estimator may be any type (transformer, classifier, etc).
Example
The Pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you want to give this step and value is an estimator object.
1
2
3
4
5
6
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
pipe
Parameters of the estimators can be accessed using the *
1
2
3
4
5
pipe.set_params(clf__C=10)
>>> Pipeline(memory=None,
>>> steps=[('reduce_dim', PCA(copy=True, iterated_power='auto',...)),
>>> ('clf', SVC(C=10, cache_size=200, class_weight=None,...))],
>>> verbose=False)
Model Selection using Grid Search
Pipeline are a great way for finding optimal parameters through grid searches:
1
2
3
4
from sklearn.model_selection import GridSearchCV
param_grid = dict(reduce_dim__n_components=[2, 5, 10],
clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)
Individual steps may be replaced as parameters, and non-final steps may be ignored by setting them to ‘passthrough’:
1
2
3
4
5
from sklearn.linear_model import LogisticRegression
param_grid = dict(reduce_dim=['passthrough', PCA(5), PCA(10)],
clf=[SVC(), LogisticRegression()],
clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)