Question: Question:
I'm doing a competition problem in Python that predicts a certain number.
Only the features that could be used were extracted from the given training data, and the same features as the training data were similarly extracted from the test data.
https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard/notebook
With reference to the above, based on the extracted data, LASSO Regression, Elastic Net Regression, Kernel Ridge Regression, Gradient Boosting Regression, XGBoost, and LightGBM predicted values, and each predicted value was added as a feature.
Based on these features, we divided it into training 7: evaluation 3 and trained it. As a result, R2 Score was 0.85, train loss was 0.1378, and validation loss was about 0.1248.
I predicted the test data with this learner, and the R2 Score was 0.55.
If you use stats.shapiro (), the features of the training data and the test data are both 0 or as close to 0 as possible, and I think they are normally distributed.
The same was true for the training data with the desired values.
In addition, there was almost no difference between the maximum and minimum values.
I would like to know the reason why the evaluation results differ between the training (evaluation) data and the test data.
Also, I would like to know how to improve generalization performance other than cross-validation.
I don't know if the following is correct, but it is cross-validated code.
X = train[cat_vars+cont_vars+['xgb', 'lgb', 'stacked', 'ensemble']]
y = train[['Score']]
X_train, X_test, Y_train, Y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 0)
lr = LinearRegression()
kf = KFold(n_splits = 5,shuffle = True,random_state = 1)
lr.fit(X, y)
splitter = kf.split(X,y)
print(cross_val_score(lr,X,y,cv = splitter, scoring='r2'))
result
[0.888343 0.885379 0.891729 0.881329 0.899762]
Answer: Answer:
I would like to know the reason why the evaluation results differ between the training (evaluation) data and the test data.
This is a simple story, because machine learning models are trained to learn from training data and produce adaptive results. Training data is generally better evaluated (less loss) because you have already learned the answer.
On the other hand, test data and cross-validation data are not used for machine learning learning, but are used only for evaluation. Therefore, if the learning is well generalized, a good evaluation will be given, but if overfitting is done, it will be a bad evaluation.
Reference: Wikipedia-Overfitting
Also, I would like to know how to improve generalization performance other than cross-validation.
As you can see in the comments, cross-validation is not a method for improving generalization performance, but one of the indicators for measuring generalization performance, so I will write a topic-based story that was useful for improving generalization performance. Basically, I think there is no choice but to plot the loss of each of the training data and cross-validation (test) data and take measures for each.
- Regarding neural networks
Summary: The most common ways to prevent overfitting in neural networks are:
Add a dropout to regularize weights to increase training data and reduce network capacity
Also, personally, it was effective to halve the learning rate as learning progressed.