一、随机森林参数调整概览
局部解释:
分裂所需最小样本数:设定后,若当前节点样本过小,则不进行分类。
叶节点最小样本数:顾名思义。
max越大,模型趋于过拟合。min越大,模型趋于欠拟合。
二、前期准备
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
|
from sklearn.datasets import load_iris data = load_iris() X = data['data'] y = data['target']
from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1,random_state = 0)
from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RandomizedSearchCV from skopt import BayesSearchCV ''' 这三个都是常用的筛选最优参数的库 ''' from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score from sklearn.metrics import make_scorer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score import numpy as np import time import warnings warnings.filterwarnings("ignore")
|
三、网格搜索
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
| ''' 逻辑斯特调参 ''' start = time.time() lr = LogisticRegression() params = {'C':[0.1,1,10,100],'penalty':['l1','l2']}
gs_lr = GridSearchCV(estimator = lr,param_grid = params,cv = 5,scoring = make_scorer(accuracy_score))
gs_lr.fit(X_train,y_train) end = time.time() print(gs_lr.best_params_) print(gs_lr.best_score_) print('网格搜索所消耗的时间为:%.3f S'% float(end - start))
gs_lr.score(X_test,y_test) gs_lr.predict(X_test) gs_lr.cv_results_
''' 随机森林调参 ''' start = time.time() rfc = RandomForestClassifier(random_state = 0) params = {'n_estimators':list(range(10,200,40)),'max_depth':list(range(10,200,40))}
gs_rfc = GridSearchCV(estimator = rfc,param_grid = params,cv = 5,scoring = make_scorer(accuracy_score))
gs_rfc.fit(X_train,y_train) end = time.time() print(gs_rfc.best_params_) print(gs_rfc.best_score_) print('网格搜索所消耗的时间为:%.3f S'% float(end - start))
gs_rfc.score(X_test,y_test) gs_rfc.predict(X_test) gs_rfc.cv_results_
|
四、随机网格搜索
- 区别于随机网络搜索,会将随机组合中最可能的组合挑出来进行测试,速度更快。
- 适用于非常多参数的情况,网格搜索无法解决的时候,用于缩小区间。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| ''' 以随机森林为例 ''' start = time.time() rfc = RandomForestClassifier(random_state = 0) params = {'n_estimators':list(range(10,200,40)),'max_depth':list(range(10,200,40))}
gs_rfc = RandomizedSearchCV(estimator = rfc,param_distributions = params,cv = 5,scoring = make_scorer(accuracy_score))
gs_rfc.fit(X_train,y_train) end = time.time() print(gs_rfc.best_params_) print(gs_rfc.best_score_) print('随机网格搜索所消耗的时间为:%.3f S'% float(end - start))
gs_rfc.score(X_test,y_test) gs_rfc.predict(X_test) gs_rfc.cv_results_
|
五、贝叶斯搜索
- 利用迭代,类似于梯度下降法,寻找最优参数。
- 比随机网格慢一些。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| ''' 以随机森林为例 ''' start = time.time() rfc = RandomForestClassifier(random_state = 0) params = {'n_estimators':list(range(10,200,40)),'max_depth':list(range(10,200,40))}
gs_rfc = BayesSearchCV(rfc,params,cv = 5,scoring = make_scorer(accuracy_score),n_iter = 40)
gs_rfc.fit(X_train,y_train) end = time.time() print(gs_rfc.best_params_) print(gs_rfc.best_score_) print('随机网格搜索所消耗的时间为:%.3f S'% float(end - start))
gs_rfc.score(X_test,y_test) gs_rfc.predict(X_test) gs_rfc.cv_results_
|
六、嵌套交叉验证
1 2 3 4 5
| scores = cross_val_score(gs_rfc,X_train,y_train,scoring = 'accuracy',cv = 10) np.mean(scores) np.std(scores) print('CV accuracy:%.3f +/- %.3f'%(np.mean(scores),np.std(scores)))
|