Python: タイタニック号で機械学習のRandomizedSearchCVを学ぶには【sklearn RandomizedSearchCV】

ようこそ「Python」へ...

Python»記事(Article063)

タイタニック号で機械学習のRandomizedSearchCVを学ぶには【sklearn RandomizedSearchCV】

タイタニックのサバイバルデータで機械学習(Machine Learning)シリーズは、次の８つの記事から構成されています。機械学習に興味のある方は、以下に掲載されている記事を順番に読むことをおすすめします。

この記事では機械学習(ML: Machine Learning)でRandomizedSearchCV(Randomized Search Cross-Validation)を使用してモデルを評価する方法を解説します。これまでCross-Validation, Pipeline, GridSearchCVを紹介しました。 cross_val_scoreとKFoldを組み合わせると検証データの分割とモデルの評価を効率よく行うことができます。そして、GridSearchCVを使うとモデルのパラメータとパラメータ値を効率的に最適化することができます。ここで解説するRandomizedSearchCVはGridSearchCVと類似した処理を行いますが、「評価の計測時間を短縮できる」「最適なパラメータ値を効率よく見つけることができる」「隙間時間を有効に活用できる」などのメリットがあります。

一般にGridSearchCVはデータ量が少ないときに使用します。そして、大量のデータを扱うときはRandomizedSearchCVを使用します。 RandomaizedSearchを使用するときはパラメータ「n_iter=」で回数を指定できるので測定時間が調整できます。たとえば、、お昼の休み時間などちょっとした隙間時間に評価するといったことが可能になります。以下で、GridSearchCVとRandomizedSearchCVの経過時間を測定してみます。

行20-25ではRandomForestClassifierのパラメータを定義しています。パラメータ「n_estimators」の値は、PythonのListを使用して[10, 20, 30, 40, 50]を生成しています。 GridSearchCVを使用してモデルを評価するときは、評価したいパラメータとパラメータ値の範囲（連続ではなく不連続な値）を指定しておきます。なぜかと言えば、測定時間を節約するためです。

### Import the Libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

### Load the dataset
csv_file = 'data/csv/tutorial/heart.csv'
df = pd.read_csv(csv_file)
# Check Data for any missing values
df.isnull().sum()
# Load X Variables into a Pandas Dataframe with columns 
X = df.drop(['target'], axis = 1)
# Get Target data 
y = df['target']
# Divide Data into Train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=101)

### Create the param grid
param_grid = {'n_estimators': list(range(10, 60, 10)), # [int(x) for x in np.linspace(start=10, stop=50, num=5)],
              'max_features': ['auto', 'sqrt'],
              'max_depth': [1,2,3,4],
              'min_samples_split': [1,2,3,4,5],
              'min_samples_leaf': [1,2],
              'bootstrap': [True, False]}

print(param_grid)

model = RandomForestClassifier()

図:Param_GridにはRandomForestClassifier()のパラメータが表示されています。パラメータ「n_estimators」には[10, 20, 30, 40, 50]の値が指定されています。 GridSearchCVを使用するときは10～50までの連続した値ではなく、不連続な飛び飛びの値を指定します。

行3-4ではGridSearchCVでRandomForestClassifierを評価しています。行10-15では経過時間を計算して表示しています。

### With GridSearchCV
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=10, verbose=2, n_jobs=4)
grid.fit(X_train, y_train)
#grid.cv_results_
#pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
#grid.best_params_

# Calculate the elapsed time
mean_fit_time= grid.cv_results_['mean_fit_time']
mean_score_time= grid.cv_results_['mean_score_time']
n_splits  = grid.n_splits_ # Number of splits of training data
n_iter = pd.DataFrame(grid.cv_results_).shape[0] # Iterations per split
elapsed_time = np.mean(mean_fit_time + mean_score_time) * n_splits * n_iter # seconds
print(f'GridSearchCV elapsed time: {elapsed_time:.2f} sec, {elapsed_time/60:.2f} min' )

図:GridSearchCVにはGrisSearchCVの経過時間(6.36分)とベスト・パラメータが表示されています。

行3-4ではRandomizedSearchCVでRandomForestClassifierを評価しています。行11-16では経過時間を計算して表示しています。

### With RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
rand = RandomizedSearchCV(estimator=model, param_distributions=param_grid, cv=10, verbose=2, n_jobs=4)
rand.fit(X_train, y_train)
#rand.cv_results_
#pd.DataFrame(rand.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
#rand.best_params_
#grid.best_params_

# Calculate the elapsed time
mean_fit_time= rand.cv_results_['mean_fit_time']
mean_score_time= rand.cv_results_['mean_score_time']
n_splits  = rand.n_splits_ # Number of splits of training data
n_iter = pd.DataFrame(rand.cv_results_).shape[0] # Iterations per split
elapsed_time = np.mean(mean_fit_time + mean_score_time) * n_splits * n_iter # seconds
print(f'RandomizedSearchCV elapsed time: {elapsed_time:.2f} sec, {elapsed_time/60:.2f} min' )

図:RandomizedSeachCV[1]にはRadomizedSearchCVの経過時間(0.08分)とベスト・パラメータが表示されています。 GridSearchCVと比較すると大幅に経過時間が短縮されています。

行5-12ではRandomForestClassifierのパラメータを定義しています。ここではパラメータ値に連続した値を指定しています。なぜかと言えば、GridSearchCVのように不連続なパラメータ値を指定すると、最適なパラメータ値を見逃す可能性があるからです。行16-18ではRandomizedSearchCVで評価しています。行25-31では経過時間とベスト・パラメータを表示しています。

### With RandomizedSearchCV

# Create the param distributions
from scipy.stats import uniform
param_dist = {'n_estimators': list(range(1,101)), 
              'max_features': ['auto', 'sqrt'],
              'max_depth': list(range(1,6)), 
              'min_samples_split': list(range(1,6)), 
              'min_samples_leaf': list(range(1,6)),  
              'max_features': ['auto', 'sqrt', 'log2'],
              'ccp_alpha': uniform(scale=1),   
              'bootstrap': [True, False]}

print(param_dist)

rand = RandomizedSearchCV(estimator=model, param_distributions=param_dist, 
    cv=10, verbose=0, n_jobs=4)
rand.fit(X_train, y_train)
#rand.cv_results_
#pd.DataFrame(rand.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
#rand.best_params_
#grid.best_params_

# Calculate the elapsed time
mean_fit_time= rand.cv_results_['mean_fit_time']
mean_score_time= rand.cv_results_['mean_score_time']
n_splits  = rand.n_splits_ # Number of splits of training data
n_iter = pd.DataFrame(rand.cv_results_).shape[0] # Iterations per split
elapsed_time = np.mean(mean_fit_time + mean_score_time) * n_splits * n_iter # seconds
print(f'RandomizedSearchCV elapsed time: {elapsed_time:.2f} sec, {elapsed_time/60:.2f} min' )
print(f'Best parames: {rand.best_params_}')

図:RandomizedSeachCV[2]にはRandomForestClassifier()のパラメータ、経過時間(0.10分)、ベスト・パラメータが表示されています。パラメータ「n_estimators」には1～100までの数値が連続して指定されています。 RandomizedSearchCVを使用するときはパラメータ値を飛び飛びではなく、連続した値を指定します。

最近よくAI(Artificial Intelligence)、ML(Machine Learning)、DL(Deep Learning)という言葉を聞きますが、 AIはMLとDLを含んだ総称です。そしてMLはDLを含んでいます。そしてDLはNeural Networks(ニューラルネットワーク)を使用しています。 MLとDLはAIのサブセットということになります。 DL(Deep Learning)については「記事(Article028)」で詳しく解説しています。

ML(Machine Learning)は、Supervised Learning, UnSupervised Learning, Reinforcement Learningの３つに分類されています。それぞれのタイプの概念は図(B, C, D)を参照してください。そしてSupervised LearningはアルゴリズムによりReguression, Classfication, Clusteringの３つに分類されています。それぞれのアルゴリズムの種類は図(E)を参照してください。

今回予測するのは、タイタニック号の乗船客が「生存するか」「死亡するか」の２択ですから、 MLのClassficationのアルゴリズムを利用することになります。ここではClassficationの8種類のアルゴリズム (KNeighborsClassifier, DecisionTreeClassifier, RandomForestClassifier, GaussianNB, SVC, ExtraTreeClassifier, GradientBoostingClassifier, AdaBoostClassifier) を使用して乗船客の生死を予測します。

ML(Macine Learning)を使用して予測するとき、予測するデータの属性によりアルゴリズム（モデル）の評価方法が異なります。「記事(Article056)」で解説した売上データからお客さんが商品を「買う、買わない」といった予測では、 accuracy_score(正解率)で予測値を評価してもとくに問題はありません。ところが、今回のタイタニックのような人間の「生死」を予測するケースでは、accuracy_score(正解率)だけで予測を評価することはできません。たとえば、モデルが「死亡する」と予測して、その予測が外れてもとくに問題にはなりません。ところが、モデルが「生存する」と予測して、その予測が外れると「死亡」するということになるので問題になります。このような場合は、モデルの予測値を４パターンに分けて評価する必要があります。 MLはこれら４パターンの評価情報を取得する方法としてclassification_report()とconfusion_matrix()メソッドを用意しています。

「初級」編では、データの取り込み、データの分析(可視化)、学習、予測、予測評価の順番に説明します。「予測評価」では、予測を評価するための情報を取得する３種類の方法(メソッド)を解説します。さらに、なぜclassification_report(), confusion_matrix(), accuracy_score()の３種類のメソッドが用意されているのか、そして、これらのメソッドで取得した評価情報をどのように活用するのかについても説明しています。

「中級」編では、Pandas、Matplotlibを使用したデータ分析を詳しく解説しています。データを分析するには可視化することが重要ですが、Pandasのplot()メソッドを使用する簡単にデータを可視化することができます。さらにMatplotlibを使用するとグラフを見栄えよくする、見やすくする、グラフにさまざまな補足情報を表示するといったことが可能になります。

「上級」編では、複数のアルゴリズム（モデル）を使用して実際に予測して、予測を評価する方法について解説しています。予測値を調整するには、モデルにさまざまなパラメータを追加して、さらにパラメータの値（範囲）も同時に調整する必要があります。これらを効率的に行う方法としてPipelineを使用したGridSearchCV()、RandomizedSearchCV()メソッドについて解説しています。 RandomizedSearchCV()を使用すると、モデルにどのようなパラメータを追加すると予測が改善するかを効率的に行うことができます。さらに、GridSerachCV()を使用すると、モデルのパラメータの値（範囲）の調整を効率的に行うことができます。

ここではVisula Studio Code(VSC)の「Python Interactive window」を使用してJupter Notebookのような環境で説明します。 VSCを通常の環境からインタラクティブな環境に切り換えるにはコードを記述するときコメント「# %%」を入力します。詳しい、操作手順については「ここ」を参照してください。インタラクティブな環境では、Pythonの「print(), plt.show()」などを使う必要がないので掲載しているコードでは省略しています。 VSCで通常の環境で使用するときは、必要に応じて「print(), plt.show()」等を追加してください。

この記事では、Pandas、Matplotlibのライブラリを使用しますので「記事(Article001) | 記事(Article002) | 記事(Article003) | 記事(Article004)」を参照して事前にインストールしておいてください。 Pythonのコードを入力するときにMicrosoftのVisula Studio Codeを使用します。まだ、インストールしていないときは「記事(Article001)」を参照してインストールしておいてください。

説明文の左側に図の画像が表示されていますが縮小されています。画像を拡大するにはマウスを画像上に移動してクリックします。画像が拡大表示されます。拡大された画像を閉じるには右上の[X]をクリックします。画像の任意の場所をクリックして閉じることもできます。

RandomizedSearchCVを使用してCross-Validationを行う

まずはPythonのライブラリを取り込む

Visual Studio Code(VSC)を起動したら新規ファイルを作成して行1-33をコピペします。行2-31ではPythonのライブラリを取り込んでいます。行33ではPythonの警告を抑止しています。ライブラリをまだインストールしていないときは「pip install」で事前にインストールしておいてください。

### Import the libraries
from functools import reduce
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.core.reshape.reshape import stack
from scipy.stats.morestats import median_test 
import seaborn as sns

# Importing Classifier Modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC     
import numpy as np

# Cross Validation(k-fold)
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# Pipeline ,GridSearchCV and RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import precision_score, recall_score, accuracy_score, make_scorer, f1_score, roc_auc_score
import sklearn.metrics as skm 

import warnings

warnings.simplefilter('ignore')

図1はVSCの画面です。

PandasのDataFrameにタイタニックのサバイバルデータを取り込む
行2ではCSVファイルのパスを定義しています。当サイトからダウンロードするときは行3のコメント(#)を外してください。行4ではPandasのread_csv()メソッドでCSVファイルをDataFrameに取り込んでいます。
```
### Load the data
train_file = 'data/csv/titanic/train_cleaned.csv'
#train_file = 'https://money-or-ikigai.com/menu/python/article/data/titanic/train_cleaned.csv'
train = pd.read_csv(train_file)

temp = train.copy()
X = temp.drop('Survived', axis=1)    # Exclude Survived column
y = temp['Survived']                 # Survived column only
```
図2-1

図2-1にはDataFrame(X)の構造と内容が表示されています。 DataFrame(X)には891人の乗船客のデータが格納されています。

図2-2

図2-2にはSeries(y)の件数と内容が表示されています。 Series(y)には乗船客の生死(0:Dead, 1:Survived)が格納されています。

RandamizedSearchCVで複数のモデルを評価する

行4ではデータを75対25に分割しています。行7-65では評価するモデルを定義しています。各モデルのパラメータ値が数値のときは不連続ではなく、連続したパラメータ値を指定します。ここではPythonの「list(range())」で連続したパラメータ値を生成しています。パラメータ値がstr型のときは、全てのパラメータを指定します。

行70-88ではRandomizedSearchCV()とfit()で各モデルの学習・評価を行っています。 RandomizedSearchCV()の引数に「n_iter=100, cv=5」を指定しているので各モデルとも合計500回学習・評価させます。

行90-102では各モデルの評価を表示しています。

### Modeling

# Train-test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)    

# Important: Specify a continuous distribution (rather than a list of values) for any continous parameters ★
model_params = {
    'clf0': {
        'model': KNeighborsClassifier(),
        'params': {
            'n_neighbors': list(range(1, 101)), 
            'weights': ['uniform', 'distance'],
            'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']                
        }       
    },     
    'clf1': {
        'model': DecisionTreeClassifier(),
        'params': {
            'criterion': ['gini','entropy'],
            'splitter': ['best','random']              
        }       
    },
    'clf2': {
        'model': RandomForestClassifier(),
        'params': {
            'n_estimators': list(range(1, 101)), 
            'criterion': ['gini','entropy']              
        }       
    },
    'clf3': {
        'model': GaussianNB(),
        'params': {}       
    }, 
    'clf4': {
        'model': SVC(),
        'params': {
            'kernel': ['linear', 'poly', 'rbf'], 
            'gamma': ['scale','auto'], 
            'decision_function_shape': ['ovr','ovo']              
        }       
    }, 
    'clf5': {
        'model': ExtraTreeClassifier(),
        'params': {   
            'criterion': ['gini','entropy'], 
            'max_features': ['auto','sqrt','log2'], 
            'class_weight': ['balanced', 'balanced_subsample']            
        }       
    }, 
    'clf6': {
        'model': GradientBoostingClassifier(),
        'params': {
            'n_estimators': list(range(1, 101)),       
            'loss' : ['deviance','exponential'],
            'max_features':['auto','sqrt','log2']               
        }       
    }, 
    'clf7': {
        'model': AdaBoostClassifier(),
        'params': {
            'n_estimators': list(range(1, 101)),    
            'algorithm': ['SAMME.R','SAMME']              
        }       
    }                                                                        
}

df = pd.DataFrame()

# Randomized Search Cross-Validation
for _, mp in model_params.items():                      

    # This will train 100 models over 5 folds of cross validation (500 models total)
    random_search = RandomizedSearchCV(mp['model'], mp['params'], 
        n_iter=100, cv=5, n_jobs=-1, refit=True, random_state=1,
        return_train_score=True)    
   
    model = random_search.fit(X, y) 

    classifier_name = mp['model'].__class__.__name__
    best_estimator = model.best_estimator_
    best_params = model.best_estimator_.get_params()      
    results = random_search.cv_results_
    
    results['classifier_name'] = classifier_name        
    results['best_estimator'] = best_estimator
    results['best_params'] = best_params  

    df = df.append(results, ignore_index=True)

for ix, row in df.iterrows():
    x60 = '-'*60
    cls_name = row['classifier_name']
    print(f'{x60} {ix}: {cls_name}')

    mean_test_score = row['mean_test_score']
    print('Mean_test_score:', np.average(mean_test_score))       

    best_estimator = row['best_estimator']
    print('Best_estimator:', best_estimator)

    best_params = row['best_params']
    print('Best_params:', best_params)

図3は実行結果です。 VSCのインタラクティブ・ウィンドウに評価結果が表示されています。 SVCが「0.821 (82%)」でハイ・スコアになっています。そして、SVCのベスト・パラメータは「Best_params: {'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'poly', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}」のようになっています。

ここで解説したコードをまとめて掲載

最後にここで解説したすべてのコードをまとめて掲載しましたので参考にしてください。


### Import the libraries
from functools import reduce
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.core.reshape.reshape import stack
from scipy.stats.morestats import median_test 
import seaborn as sns

# Importing Classifier Modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC     
import numpy as np

# Cross Validation(k-fold)
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# Pipeline and GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import precision_score, recall_score, accuracy_score, make_scorer, f1_score, roc_auc_score
import sklearn.metrics as skm 

import warnings

warnings.simplefilter('ignore')


# %%

### Load the data
train_file = 'data/csv/titanic/train_cleaned.csv'
#train_file = 'https://money-or-ikigai.com/menu/python/article/data/titanic/train_cleaned.csv'
train = pd.read_csv(train_file)

temp = train.copy()
X = temp.drop('Survived', axis=1)    # Exclude Survived column
y = temp['Survived']                 # Survived column only


# %%

### Modeling

# Classification - randomzed_search()
#def random_search(X_train, X_test, y_train, y_test):

#clf0 = KNeighborsClassifier()                       # n_neighbors=[3,5,10,15], weights={'uniform', 'distance'}, algorithm={'auto', 'ball_tree', 'kd_tree', 'brute'}
# KNeighborsClassifier(n_neighbors=5, *, weights={‘uniform’, ‘distance’} , algorithm={‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, 
# leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)

#clf1 = DecisionTreeClassifier()                     # criterion={'gini','entropy'}, splitter={'best','random'}
# DecisionTreeClassifier(*, criterion={'gini','entropy'}, splitter={'best','random'}, max_depth=None, 
# min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, 
# max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, 
# class_weight=None, ccp_alpha=0.0)

#clf2 = RandomForestClassifier()                     # n_estimators=1-100, criterion={'gini','entropy'}
# RandomForestClassifier(n_estimators=100, *, criterion={'gini','entropy'}, 
# max_depth=None, min_samples_split=2, min_samples_leaf=1, 
# min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, 
# min_impurity_decrease=0.0, bootstrap=True, oob_score=False, 
# n_jobs=None, random_state=None, verbose=0, warm_start=False, 
# class_weight=None, ccp_alpha=0.0, max_samples=None)

#clf3 = GaussianNB()                                 # None
# GaussianNB(*, priors=None, var_smoothing=1e-09)

#clf4 = SVC()                                        # kernel={'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}, gamma={'scale','auto'}, decision_function_shape={'ovr','ovo'}    
# SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, 
# shrinking=True, probability=False, tol=0.001, cache_size=200, 
# class_weight=None, verbose=False, max_iter=- 1, 
# decision_function_shape='ovr', break_ties=False, random_state=None)

#clf5 = ExtraTreesClassifier()                       # n_estimators=10-100, criterion={'gini','entropy'}, max_features={'auto','sqrt','log2'}, class_weight={'balanced', 'balanced_subsample'}
# ExtraTreesClassifier(n_estimators=100, *, criterion='gini', max_depth=None, 
# min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, 
# max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, 
# bootstrap=False, oob_score=False, n_jobs=None, random_state=None, verbose=0, 
# warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)

#clf6 = GradientBoostingClassifier()                 # loss={'deviance','exponential'}, n_estimators=10-100
# GradientBoostingClassifier(*, loss='deviance', learning_rate=0.1, n_estimators=100, 
# subsample=1.0, criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1, 
# min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, 
# init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, 
# warm_start=False, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)

#clf7 = AdaBoostClassifier()                         # n_estimators=10-100, algorithm={'SAMME.R','SAMME'}
# AdaBoostClassifier(base_estimator=None, *, n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=None)


# Train-test Split
# sklearn.model_selection.train_test_split(# *arrays, test_size=None, train_size=None, 
# random_state=None, shuffle=True, stratify=None)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)    

# Important: Specify a continuous distribution (rather than a list of values) for any continous parameters ★

model_params = {
    'clf0': {
        'model': KNeighborsClassifier(),
        'params': {
            'n_neighbors': list(range(1, 101)), 
            'weights': ['uniform', 'distance'],
            'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']                
        }       
    },     
    'clf1': {
        'model': DecisionTreeClassifier(),
        'params': {
            'criterion': ['gini','entropy'],
            'splitter': ['best','random']              
        }       
    },
    'clf2': {
        'model': RandomForestClassifier(),
        'params': {
            'n_estimators': list(range(1, 101)), 
            'criterion': ['gini','entropy']              
        }       
    },
    'clf3': {
        'model': GaussianNB(),
        'params': {}       
    }, 
    'clf4': {
        'model': SVC(),
        'params': {
            'kernel': ['linear', 'poly', 'rbf'], 
            'gamma': ['scale','auto'], 
            'decision_function_shape': ['ovr','ovo']              
        }       
    }, 
    'clf5': {
        'model': ExtraTreeClassifier(),
        'params': {   
            'criterion': ['gini','entropy'], 
            'max_features': ['auto','sqrt','log2'], 
            'class_weight': ['balanced', 'balanced_subsample']            
        }       
    }, 
    'clf6': {
        'model': GradientBoostingClassifier(),
        'params': {
            'n_estimators': list(range(1, 101)),       
            'loss' : ['deviance','exponential'],
            'max_features':['auto','sqrt','log2']               
        }       
    }, 
    'clf7': {
        'model': AdaBoostClassifier(),
        'params': {
            'n_estimators': list(range(1, 101)),    
            'algorithm': ['SAMME.R','SAMME']              
        }       
    }                                                                        
}

#ExtraTreeClassifier().get_params().keys()

df = pd.DataFrame()

# Randomized Search Cross-Validation
for _, mp in model_params.items():                      

    # This will train 100 models over 5 folds of cross validation (500 models total)
    random_search = RandomizedSearchCV(mp['model'], mp['params'], 
        n_iter=100, cv=5, n_jobs=-1, refit=True, random_state=1,
        return_train_score=True)    

    # RandomizedSearchCV(estimator, param_distributions, *, n_iter=10, scoring=None, 
    # n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', 
    # random_state=None, error_score=nan, return_train_score=False)
   
    model = random_search.fit(X, y) # X_train, y_train

    classifier_name = mp['model'].__class__.__name__
    #print('+'*60, classifier_name)

    best_estimator = model.best_estimator_
    #print(f'best_estimator = {best_estimator}') 

    best_params = model.best_estimator_.get_params()
    #print(f'best_params = {best_params}')     

    results = random_search.cv_results_
    #print('type(results) =', type(results)) # dict type

    results['classifier_name'] = classifier_name        
    results['best_estimator'] = best_estimator
    results['best_params'] = best_params  

    df = df.append(results, ignore_index=True)

for ix, row in df.iterrows():
    #print(ix, row)

    x60 = '-'*60
    cls_name = row['classifier_name']
    print(f'{x60} {ix}: {cls_name}')

    #rank_test_score = row['rank_test_score']
    #print('Rank_test_score:', rank_test_score)

    mean_test_score = row['mean_test_score']
    print('Mean_test_score:', np.average(mean_test_score))      
    #print('Mean_test_score:', mean_test_score)    

    best_estimator = row['best_estimator']
    print('Best_estimator:', best_estimator)

    best_params = row['best_params']
    print('Best_params:', best_params)

    #print(f'{x60} end of row({ix})')

# df.info()

Go Top

Python {Article063}

タイタニック号で機械学習のRandomizedSearchCVを学ぶには【sklearn RandomizedSearchCV】

RandomizedSearchCVを使用してCross-Validationを行う

まずはPythonのライブラリを取り込む

PandasのDataFrameにタイタニックのサバイバルデータを取り込む

RandamizedSearchCVで複数のモデルを評価する

ここで解説したコードをまとめて掲載