Question:
The pipelines
question from scikit-learn
. There are PolynomialFeatures()
, PCA()
and LogReg()
. There is a training x_train, y_train
and a test x_test, y_test
. Through x, y
denote union(x_test, x_train)
and union (y_test, y_train), respectively. I want to do the following trick with my ears:
Throw in x_poly = PolynomialFeatures(x_train)
. Apply x_pca = PCA(x_poly)
with dimension reduction. Then x_union = concatenate(x_train, x_pca, axis=1)
. And classify them through LogReg()
.
Questions.
-
The
pipeline
has afit(X, [y])
method. I understand that[y]
used only if the corresponding algorithm has it. Those. inPCA.fit()
y will not be used and inLogReg().fit()
will be used.- How will the issue with
PolynomialFeatures()
be resolved, since this object'sfit()
method has 2 arguments:fit(X, y=None)
? - What does y stand for here in the documentation?
- In what cases, after all, will y be used in the algorithms present in the pipeline, and in which not?
- How will the issue with
-
I need to combine
x_pca
withx_train
. How to do this if you cannot usenumpy
directly? If it is possible to usenumpy
, then how? -
Is it possible to use conditions in the
pipeline
. For instance. Having reached a certain stage of the algorithm. Let's say beforePCA()
. I count the variance over the maximum component. If it is> 0.5
, then I uselogReg()
. If less then I useSVM()
. Is it possible to implement such functionality?
Answer:
- In order to understand what is happening in the PCA and PolynomialFeatures, you need to look into the code .
y
is not used there.y=None
needed so that the signature is common for allpipeline
methods, andNone
default value. - FeatureUnion can be used to combine features. This object can be easily combined with the pipeline. In addition, you should take into account that you will need to use the ItemSelector, which is described here . It is also worth consideringthis answer .
- The last question could not be answered unequivocally:
not sure if this is a good idea. But what if you have different variance values for train and test sets and, accordingly, different algorithms?