Santander Value Prediction Challenge - 2

Contact me

Email -> cugtyt@qq.com
GitHub -> Cugtyt@GitHub

尝试了神经网络：

# train_df已经去掉全为0的列和'target', 'ID'列
x_train, x_val, y_train, y_val = train_test_split(np.log1p(train_df), np.log1p(target), test_size=0.2)
# x_train, x_val, y_train, y_val = train_test_split(train_reduced, np.log1p(target), test_size=0.2)
model = Sequential([
    Dense(100, input_dim=4735),
#     Dense(100, input_dim=1008),
    Dense(20),
    Dense(1)
])
model.compile(optimizer='rmsprop', loss='mse')
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=40)

np.sqrt(mean_squared_log_error(target, np.exp(model.predict(np.log1p(train_df)))))

结果为1.5多。

测试随机森林：

rnd_reg = RandomForestRegressor(n_estimators=100, n_jobs=-1)
scores_rnd = cross_val_score(rnd_reg, train_df, np.log1p(target), scoring=make_scorer(mean_squared_error), cv=10)
rmsle_rnd = np.sqrt(scores_rnd)
rmsle_rnd

> array([1.40631864, 1.39308046, 1.49271305, 1.45397351, 1.34825134,
       1.38194798, 1.39683042, 1.45427724, 1.55157639, 1.41461448])

加入缩放没有得到提升，让数据取log1p，也没有得到提升。