python – Pipelines in scikit. Algorithm construction

Question:

The pipelines question from scikit-learn . There are PolynomialFeatures() , PCA() and LogReg() . There is a training x_train, y_train and a test x_test, y_test . Through x, y denote union(x_test, x_train) and union (y_test, y_train), respectively. I want to do the following trick with my ears:

Throw in x_poly = PolynomialFeatures(x_train) . Apply x_pca = PCA(x_poly) with dimension reduction. Then x_union = concatenate(x_train, x_pca, axis=1) . And classify them through LogReg() .

Questions.

  • The pipeline has a fit(X, [y]) method. I understand that [y] used only if the corresponding algorithm has it. Those. in PCA.fit() y will not be used and in LogReg().fit() will be used.

    • How will the issue with PolynomialFeatures() be resolved, since this object's fit() method has 2 arguments: fit(X, y=None) ?
    • What does y stand for here in the documentation?
    • In what cases, after all, will y be used in the algorithms present in the pipeline, and in which not?
  • I need to combine x_pca with x_train . How to do this if you cannot use numpy directly? If it is possible to use numpy , then how?

  • Is it possible to use conditions in the pipeline . For instance. Having reached a certain stage of the algorithm. Let's say before PCA() . I count the variance over the maximum component. If it is > 0.5 , then I use logReg() . If less then I use SVM() . Is it possible to implement such functionality?

Answer:

  • In order to understand what is happening in the PCA and PolynomialFeatures, you need to look into the code . y is not used there. y=None needed so that the signature is common for all pipeline methods, and None default value.
  • FeatureUnion can be used to combine features. This object can be easily combined with the pipeline. In addition, you should take into account that you will need to use the ItemSelector, which is described here . It is also worth consideringthis answer .
  • The last question could not be answered unequivocally:

    not sure if this is a good idea. But what if you have different variance values ​​for train and test sets and, accordingly, different algorithms?

Scroll to Top