Python: タイタニック号で機械学習の基本を学ぶには【Machine Learning】

ようこそ「Python」へ...

Python»記事(Article048)

タイタニック号で機械学習の基本を学ぶには【Machine Learning】

タイタニックのサバイバルデータで機械学習(Machine Learning)シリーズは次の８つの記事から構成されています。機械学習に興味のある方は以下に掲載されている記事を順番に読むことをおすすめします。

この記事ではPythonのライブラリsklearnを使用してタイタニックの乗船客の「生死」を、機械学習(ML: Machine Learning)で予測するための一連の流れ(データ分析▶データ・ラングリング▶学習▶評価)を解説します。

データ分析については、「記事(Article057)」と「記事(Article058)」で詳しく解説しています。データ・ラングリングについては、「記事(Article059)」で詳しく解説しています。本記事ではタイタニックのデータ「Sex, Embarked, Pclass」等をPandasのget_dummies()メソッドで、 str型からint型に変換していますが、本番ではSklearnのmake_column_transformer()を使用します。 make_column_transformer()で変換できないときは、 Pandasのget_dummies()ではなく、apply(), map()等を使用して変換します。

モデルの学習・評価については、「記事(Article060)のCross-Validation」、「記事(Article062)のGridSearchCV」、「記事(Article063)のRandomizedSearchCV」を使用します。一般にデータ量が少ないときはGridSearchCV、大量のデータを扱うときはRandomizedSearchCVを使用します。

最近よくAI(Artificial Intelligence)、ML(Machine Learning)、DL(Deep Learning)という言葉を聞きますが、 AIはMLとDLを含んだ総称です。そしてMLはDLを含んでいます。そしてDLはNeural Networks(ニューラルネットワーク)を使用しています。 MLとDLはAIのサブセットということになります。 DL(Deep Learning)については「記事(Article028)」で詳しく解説しています。

ML(Machine Learning)は、Supervised Learning, UnSupervised Learning, Reinforcement Learningの３つに分類されています。それぞれのタイプの概念は図(B, C, D)を参照してください。そしてSupervised LearningはアルゴリズムによりReguression, Classfication, Clusteringの３つに分類されています。それぞれのアルゴリズムの種類は図(E)を参照してください。

今回予測するのは、タイタニック号の乗船客が「生存するか」「死亡するか」の２択ですから、 MLのClassficationのアルゴリズムを利用することになります。ここではClassficationの8種類のアルゴリズム (KNeighborsClassifier, DecisionTreeClassifier, RandomForestClassifier, GaussianNB, SVC, ExtraTreeClassifier, GradientBoostingClassifier, AdaBoostClassifier) を使用して乗船客の生死を予測します。

ML(Macine Learning)を使用して予測するとき、予測するデータの属性によりアルゴリズム（モデル）の評価方法が異なります。「記事(Article056)」で解説した売上データからお客さんが商品を「買う、買わない」といった予測では、 accuracy_score(正解率)で予測値を評価してもとくに問題はありません。ところが、今回のタイタニックのような人間の「生死」を予測するケースでは、accuracy_score(正解率)だけで予測を評価することはできません。たとえば、モデルが「死亡する」と予測して、その予測が外れてもとくに問題にはなりません。ところが、モデルが「生存する」と予測して、その予測が外れると「死亡」するということになるので問題になります。このような場合は、モデルの予測値を４パターンに分けて評価する必要があります。 MLはこれら４パターンの評価情報を取得する方法としてclassification_report()とconfusion_matrix()メソッドを用意しています。

「初級」編では、データの取り込み、データの分析(可視化)、学習、予測、予測評価の順番に説明します。「予測評価」では、予測を評価するための情報を取得する３種類の方法(メソッド)を解説します。さらに、なぜclassification_report(), confusion_matrix(), accuracy_score()の３種類のメソッドが用意されているのか、そして、これらのメソッドで取得した評価情報をどのように活用するのかについても説明しています。

「中級」編では、Pandas、Matplotlibを使用したデータ分析を詳しく解説しています。データを分析するには可視化することが重要ですが、Pandasのplot()メソッドを使用する簡単にデータを可視化することができます。さらにMatplotlibを使用するとグラフを見栄えよくする、見やすくする、グラフにさまざまな補足情報を表示するといったことが可能になります。

「上級」編では、複数のアルゴリズム（モデル）を使用して実際に予測して、予測を評価する方法について解説しています。予測値を調整するには、モデルにさまざまなパラメータを追加して、さらにパラメータの値（範囲）も同時に調整する必要があります。これらを効率的に行う方法としてPipelineを使用したGridSearchCV()、RandomizedSearchCV()メソッドについて解説しています。 RandomizedSearchCV()を使用すると、モデルにどのようなパラメータを追加すると予測が改善するかを効率的に行うことができます。さらに、GridSerachCV()を使用すると、モデルのパラメータの値（範囲）の調整を効率的に行うことができます。

ここではVisula Studio Code(VSC)の「Python Interactive window」を使用してJupter Notebookのような環境で説明します。 VSCを通常の環境からインタラクティブな環境に切り換えるにはコードを記述するときコメント「# %%」を入力します。詳しい、操作手順については「ここ」を参照してください。インタラクティブな環境では、Pythonの「print(), plt.show()」などを使う必要がないので掲載しているコードでは省略しています。 VSCで通常の環境で使用するときは、必要に応じて「print(), plt.show()」等を追加してください。

この記事では、Pandas、Matplotlibのライブラリを使用しますので「記事(Article001) | 記事(Article002) | 記事(Article003) | 記事(Article004)」を参照して事前にインストールしておいてください。 Pythonのコードを入力するときにMicrosoftのVisula Studio Codeを使用します。まだ、インストールしていないときは「記事(Article001)」を参照してインストールしておいてください。

説明文の左側に図の画像が表示されていますが縮小されています。画像を拡大するにはマウスを画像上に移動してクリックします。画像が拡大表示されます。拡大された画像を閉じるには右上の[X]をクリックします。画像の任意の場所をクリックして閉じることもできます。

タイタニック号のサバイバルデータを使用して乗船客の生死を予測する【初級】

まずはPythonのライブラリを取り込む

Visual Studio Code(VSC)を起動したら新規ファイルを作成して行1-18をコピペします。行2-16ではPythonのライブラリを取り込んでいます。行18ではPythonの警告を抑止しています。ライブラリをまだインストールしていないときは「pip install」で事前にインストールしておいてください。

### Import the libraries
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression     
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.utils.extmath import randomized_range_finder

import warnings

warnings.simplefilter('ignore')

図1はVisual Studio Code(VSC)の画面です。

タイタニックのサバイバルデータをPandasのDataFrameに取り込む
行2ではCSVファイルのパスを定義しています。 CSVファイルを当サイトからダウンロードするときは行3のコメント(#)を外してください。行4ではCSVファイルをPandasのDataFrameに取り込んでいます。
```
### Load the titanic train data
csv_file = 'data/csv/titanic/train.csv'
#csv_file = 'https://money-or-ikigai.com/menu/python/article/data/titanic/train.csv'
df = pd.read_csv(csv_file)
```
図2

図2は実行結果です。 VSCのインタラクティブ・ウィンドウには「df.info()」で表示したDataFrameの構造と「df.head(1)」で表示したDataFrameの先頭レコードの内容が表示されています。 DataFrameには891件のレコードが取り込まれています。
タイタニックの生存者と死亡者の人数を棒グラフで表示する
行5ではseabornのcountplot()メソッドで生存者と死亡者の棒グラフを作成しています。 Jupter NotebookやVisual Studio Code(VSC)のインタラクティブ機能を利用するときは、行10のplt.show()は不要です。行10をコメントにして実行して見てください。行9にセミコロン「;」がないときは「Text(0.5, 1.0, ...')」が表示されます。セミコロンを追加するとこの表示が抑止されます。ここでは行10のplt.show()を使用するのでセミコロンはあってもなくても同じ結果になります。
```
### Analysing Data
#sns.set_style('darkgrid') 
#custom_palette=['red','green']   
#sns.set_palette(custom_palette) 
ax = sns.countplot(x='Survived', data=df)
ax.set_xticklabels(['Dead','Survived'])
ax.set_xlabel(None)
ax.set_ylabel('Count')
ax.set_title('Dead vs Survived Analysis\n(Titanic)');   # semi-colon does not produce output
plt.show()
```
図3

図3は実行結果です。
男女別の生存者と死亡者の人数を棒グラフで表示する
行1ではseabornのcountplot()メソッドで生存者と死亡者の棒グラフを作成しています。ここではcountplot()の引数に「hue='Sex'」を指定しているのでSex(性別)毎の棒グラフが作成されます。
```
ax = sns.countplot(x='Survived', hue='Sex', data=df)
ax.set_xticklabels(['Dead','Survived'])
ax.set_xlabel(None)
ax.set_ylabel('Count')
ax.set_title('Dead vs Survived Analysis\n(Sex)')
plt.show()
```
図4

図4は実行結果です。このグラフでは生存者が男性よりも女性が多いことが分かります。
乗船券(チケット)のクラス別の棒グラフを表示する
行1ではcountplot()メソッドで乗船券のクラス別(Pclass)の棒グラフを作成しています。
```
ax = sns.countplot(x='Survived', hue='Pclass', data=df)
ax.set_xticklabels(['Dead','Survived'])
ax.set_xlabel(None)
ax.set_ylabel('Count')
ax.set_title('Dead vs Survived Analysis\n(Pclass)')
plt.show()
```
図5

図5は実行結果です。このグラフでは「Class 1」の乗船客の生存者が多いことが分かります。
乗船時の同乗者(兄弟・配偶者)の人数を棒グラフで表示する
行1ではseabornのcountplot()メソッドで同乗者(兄弟・配偶者)の人数を棒グラフで作成しています。
```
ax = sns.countplot(x='SibSp', data=df)
ax.set_xlabel('Number of siblings / spouses aboard')
ax.set_ylabel('Count')
ax.set_title('Siblings / Spouses Analysis\n(Titanic)')
plt.show()
```
図6

図6は実行結果です。このグラフでは単身で乗船している乗船客が多いことが分かります。
乗船者の年代別の人数をヒストグラムで表示する
行1ではPandasのDataFrameのplot.hist()メソッドで年代別のヒストグラムを作成しています。 hist()の引数に「bins=10」を指定しているので10歳間隔のヒストグラムが作成されます。
```
df['Age'].plot.hist(bins=10, edgecolor='black', linewidth=0.8)
#df['Age'].plot.hist(bins=10, edgecolor='white', linewidth=0.8)

plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Analysis\n(Titanic)')
plt.show()
```
図7

図7は実行結果です。このグラフでは20代-30代の乗船客が多いことが分かります。
乗船料金ごとの人数をヒストグラムで表示する
行1ではPandasのDataFrameのplot.hist()メソッドで乗船料金ごとの人数をヒストグラムで作成しています。
```
df['Fare'].plot.hist(bins=20, edgecolor='black', linewidth=0.8, figsize=(10,5))
#df['Fare'].plot.hist(bins=20, edgecolor='white', linewidth=0.8, figsize=(10,5))

plt.xlabel('Fare ($)')
plt.ylabel('Count')
plt.title('Fare Analysis\n(Titanic)')
plt.show()
```
図8

図8は実行結果です。このグラフでは料金が$20の乗船客が圧倒的に多いことが分かります。
不正なデータ件数をDataFrameの列ごとに表示する
行3ではDataFrameの列の値がNull(未入力)の件数を表示しています。
```
### Data Wrangling

df.isnull().sum()
#df.isnull().sum().sum()
```
図9

図9は実行結果です。 DataFrameの列「Age, Cabin, Embarked」に不正・欠損(Null)データがあることが分かります。
PandasのDataFrameの不正・欠損データをヒートマップで表示する
行1ではseabornのheatmap()メソッドのDataFrameの列の値にNullが含まれるものをヒートマップで可視化しています。行6ではヒートマップをカラーで表示しています。
```
ax = sns.heatmap(df.isnull(), yticklabels=False)
ax.set_xlabel('DataFrame Columns')
ax.set_title('Column Null Value Analysis\n(Titanic)')
plt.show()

ax = sns.heatmap(df.isnull(), yticklabels=False, cmap='viridis')
ax.set_xlabel('DataFrame Columns')
ax.set_title('Column Null Value Analysis\n(Titanic)')
plt.show()
```
図10-1

図10-1は実行結果です。このヒートマップにはDataFrameの列に不正なデータがある箇所が「白色」で表示されています。ヒートマップを使うと不正なデータを可視化することができます。

図10-2

図10-2は実行結果です。このヒートマップにはDataFrameの列に不正なデータがある箇所が「黄色」で表示されています。
乗船券のクラスと年齢をBoxPlot(箱ひげ図)で表示する
行1ではseabornのboxplot()メソッドで箱ひげ図を作成しています。
```
ax = sns.boxplot(x='Pclass', y='Age', data=df)

ax.set_xlabel('Ticket Class')
ax.set_title('Ticket Class vs Age Analysis\n(Titanic)')
plt.show()
```
図11

図11は実行結果です。このグラフでは乗船券のクラスごとの年代が分かります。 Class 1の料金は高額なので乗船客の年代も高くなっています。
PandasのDataFrameから不正なデータを削除する
行1ではDataFrameのdrop()メソッドで列「Cabin」を削除しています。行3ではDataFrameのdropna()メソッドでDataFrameにNaN(Not a Number)が含まれる行(レコード)を削除しています。行5ではseabornのheatmap()メソッドでDataFrameのisnull()のヒートマップを作成しています。行10ではDataFrameのisnull()メソッドでDataFrameの列にNullが含まれる列の件数を表示しています。
```
df.drop('Cabin', axis=1, inplace=True) 

df.dropna(inplace=True)

ax = sns.heatmap(df.isnull(), yticklabels=False, cbar=False)
ax.set_xlabel('DataFrame Columns')
ax.set_title('Column Null Value Analysis\n(Titanic)')
plt.show()

df.isnull().sum()
```
図12

図12は実行結果です。 DataFrameのヒートマップは全て黒で表示されているので不正なデータがないことが確認できました。念のために「df.isnull().sum()」を実行すると全ての列が0件で表示されています。
PandasのDataFrameの列「Sex, Embarked, Pclass」の値をstr型からbit型に変換する
行2-6ではPandasのget_dummies()メソッドでstr型の列の値をbit型に変換しています。 ML(Machine Learning)でデータを学習させるときは、すべてのデータを数値に変換する必要があります。なので、str型のデータはint型、float型のいずれかに変換することになります。
```
# Convert categorical variable into dummy/indicator variables
sex = pd.get_dummies(df['Sex'], drop_first=True) 

embark = pd.get_dummies(df['Embarked'], drop_first=True)

pcl = pd.get_dummies(df['Pclass'], drop_first=True)
```
図13-1

図13-1は実行結果です。列「Sex(性別)」には「female」「male」のいずれかが格納されています。 get_dummies()メソッドを実行すると列「female, male」からなるPandasのSeriesが生成されて変数「sex」に格納されます。ここでは列「female」は不要なのでget_dummies()の引数「drop_first=True」を指定してSeriesから削除しています。

図13-2

図13-2は実行結果です。列「Embarked(乗船港)」には「C,Q,S」のいずれかが格納されています。 get_dummies()メソッドを実行すると列「C,Q,S」からなるPandasのSeriesが生成されて変数「embark」に格納されます。ここでは列「C」は不要なのでget_dummies()の引数「drop_first=True」を指定してSeriesから削除しています。ちなみに、C=Cherbourg, Q=Queenstown, S=Southamptonを意味します。

図13-3

図13-3は実行結果です。列「Pclass(乗船券のクラス)」には「1,2,3」のいずれかが格納されています。 get_dummies()メソッドを実行すると列「1,2,3」からなるPandasのSeriesが生成されて変数「pcl」に格納されます。ここでは列「1」は不要なのでget_dummies()の引数「drop_first=True」を指定してSeriesから削除しています。
DataFrameの列「Sex, Embarked, Pclass」をint型に変換した数値をDataFrameに新規列として追加する
行1ではPandasのconcat()メソッドでDataFrame(df)とSeries(sex, embark, pcl)を連結しています。
```
dfx = pd.concat([df, sex, embark, pcl], axis=1)
```
図14

図14は実行結果です。 DataFrameに列「male, Q, S, 2, 3」が追加されています。
DataFrameから不要の列を削除する
行1ではDataFrameのdrop()メソッドで列「Pclass, Sex, Embarked, PassengerId, Name, Ticket」を削除しています。
```
dfx.drop(['Pclass','Sex','Embarked','PassengerId','Name','Ticket'], axis=1, inplace=True)
```
図15

図15は実行結果です。 DataFrameから不要な列が削除されています。 DataFrameのすべての列のデータ型が数値型(int, float)になっていることが確認できます。 Machine Learning(ML)でデータを解析するにはすべてのデータ(列の値)を数値に変換する必要があります。
分析用のデータ(X, y)を作成する
行2ではDataFrameから列「Survived」を削除して変数XにPandasのDataFrameを格納しています。行3ではDataFrame(dfx)の列「Survived」をコピーして変数yにPandasのSeriesを格納しています。
```
### Train & Test Data
X = dfx.drop('Survived', axis=1)
y = dfx['Survived']
```
図16

図16は実行結果です。変数「X」にはDataFrame(dfx)から列「Survived」が削除されてPandasのDataFrameとして格納されています。変数「y」にはDataFrame(dfx)の列「Survived」がPandasのSereisとして格納されています。変数X, yにはそれぞれ712件のデータが格納されています。
分析データを分割(X_train, X_test, y_train, y_test)する
行1ではtrain_test_split()メソッドでX, yのデータを70対30の比率で分割しています。分割したデータは変数X_train, X_test, y_train, y_testに格納されます。
```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
```
図17

図17は実行結果です。分析データは70対30の比率で「X_train, y_train」と「X_test, y_test」に分割されています。「X_train, y_train」のデータは学習用に使用します。「X_test, y_test」のデータは予測用に使用します。
データ(X_train, y_train)を元に学習させる
行1ではLogisticRegressionのインスタンスを生成しています。行2ではLogisticRegressionのfit()メソッドでX_train, y_trainのデータを使って学習させています。
```
model = LogisticRegression(n_jobs=-1, random_state=0)
model.fit(X_train, y_train)
```
図18

図18は実行結果です。
データ(X_test)を元に予測する
行1ではLogisticRegressionのpredict()メソッドでX_testのデータを予測しています。変数predictionsに予測値が格納されます。ここでは1=Survived, 0=Deadのいずれかが格納されます。変数y_testには実際の値が格納されています。
```
predictions = model.predict(X_test)
```
図19

図19は実行結果です。 VSCのインタラクティブ・ウィンドウに「y_test」と「predictions」の内容を表示しています。「y_test」と「predictions」を比較すると予測の正解率がわかります。
classification_report(precision, recall, f1-score)を取得して予測を評価する
行1ではclassification_report()メソッドで予測を評価する情報(precision, recall, f1-score, ...)を取得しています。戻り値として表1, 表2の情報が返されます。変数y_testには実際の値が格納されています。変数predictionsには予測値が格納されています。
```
classification_report(y_test, predictions)
```
図20

図20は実行結果です。 Machine Learning(ML)では予測の評価情報を取得する方法として、 classification_report(), confusion_matrix(), accuracy_score()の３種類のメソッドがあります。

classification_report()では予測の(precision, recall, f1-socre, accuracy,...)を取得できます。 confusion_matrix()では予測の(TP/FN/FP/TN)を取得できます。 accuracy_score()では予測の正解率を取得できます。なぜ、３種類もの評価情報があるかと言えば、単純に予測の正解率(accuracy)だけではそのモデルが最も効率がよいかどうか判断できないケースがあるからです。

たとえば、今回のタイタニックの予測では乗客の「生存、死亡」を予測しますが、予測結果として４種類のパターン(表0)があります。タイタニックのケースでは「Positive=生存」、「Negative=死亡」を意味します。 TPは「生存するとの予測が当たったケース」を意味します。 FPは「生存するとの予測が外れたケース」を意味します。 FNは「死亡するとの予測が外れたケース」を意味します。 TNは「死亡するとの予測が当たったケース」を意味します。

ここで大事なのは、タイタニックのケースは「生存するとの予測」の精度が重要になるということです。たとえば、「死亡すると予測」したとき、その予測が外れたとしても「生存」するわけですからあまり問題にはなりません。逆に「生存すると予測」したとき、その予測が外れると「死亡」するわけですから予測の精度を高める必要があります。

タイタニックのケースでは表0の「FP(False Positive)」の予測精度を高める必要があります。このように正解率が高くても「生存する」の予測精度が高くないとよい予測モデルとは言えません。

表1は表0を数値で評価できるようにしたものです。「precision」とは、FP(False Positive)とTP(True Positive)の正解率です。タイタニックのケースではprecisionとは「生存するとの予測」の正解率を意味します。「recall」とは、TPR(True Positive Rate)を意味します。つまり、TP(True Positive)の予測が当たった比率(ヒット率)を意味します。タイタニックのケースではrecallとは「生存するとの予測」が当たった比率(ヒット率)を意味します。「f1-score」は「precision+recall」を意味します。したがって、タイタニックのケースでは「f1-score(生存する予測)」が高いモデルを選択する必要があります。

表2のaccuracyは(TP+TN)÷(TP+FP+FN+TN)で計算します。つまり、全体に対するTrue(TP+TN:予測が当たった)の比率になります。表3から式に値を代入すると(102+63)÷(102+24+25+63)≒0.77(77%)となります。タイタニックのケースでは(TP+TN)の比率を高めることよりも、FPの比率を下げることが重要になります。

まとめると、タイタニックのケースでは「f1-scoreが高いモデルを選択する」ということになります。
表0
TP(True Positive) FN(False Negative)

FP(False Positive) TN(True Negative)

表1
precision recall f1-score support

class 0 0.80 0.81 0.81 126(102+24)

class 1 0.72 0.72 0.72 88(25+63)

表2
accuracy 0.77 214(102+24+25+63)

macro avg 0.76 0.76 0.76 214(102+24+25+63)

nweighted 0.77 0.77 0.77 214(102+24+25+63)
confusion_matrix(TP/FN/FP/TN)を取得して予測を評価する
行1ではconfusion_matrix()で表3の情報を取得しています。
```
confusion_matrix(y_test, predictions)
```
図21

図21は実行結果です。 confusion_matrix()メソッドでは表3のような情報を取得できます。表3は表0に具体的な数値(人数)を代入したものです。「TP(102）」とは「生存すると予測」された人の予測が当たった人数が102人という意味です。「FP(24）」とは「生存すると予測」された人の予測が外れた人数が24人という意味です。

「FN(25）」とは「死亡すると予測」された人の予測が外れた人数が25人という意味です。「TN(63）」とは「死亡すると予測」された人の予測が当たった人数が63人という意味です。すべての人数を合計すると214人となります。

表3は表1のprecision, recallを調整するときに使用します。予測モデルのいろんなパラメータを変更したとき、表3の数値がどのように変化するかを調べてパラメータ値を調整します。
表3
TP(102) FN( 25)

FP(24) TN(63)
accuracy_score(正解率)を取得して予測を評価する
行1ではaccuracy_score()で正解率を取得しています。
```
accuracy_score(y_test, predictions)
```
図22

図22は実行結果です。ここでは正解率(accuracy)が「0.7710280373831776 (≒77%)」と表示されています。前出で説明したようにaccuracyは「(TP+TN)÷(TP+FP+FN+TN)」の式で計算します。予測するデータによってはprecision, recallを評価しなくてもよいケースがあります。このような場合は正解率(accuracy)だけで予測モデルを選択します。

表0
TP(True Positive)	FN(False Negative)
FP(False Positive)	TN(True Negative)

表1
	precision	recall	f1-score	support
class 0	0.80	0.81	0.81	126(102+24)
class 1	0.72	0.72	0.72	88(25+63)

表2
accuracy			0.77	214(102+24+25+63)
macro avg	0.76	0.76	0.76	214(102+24+25+63)
nweighted	0.77	0.77	0.77	214(102+24+25+63)

表3
TP(102)	FN( 25)
FP(24)	TN(63)

ここで解説したコードをまとめて掲載

最後にここで解説したすべてのコードをまとめて掲載しましたので参考にしてください。


### Import the libraries

#from os import terminal_size
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#from sklearn.cross_validation import train_test_split => not found cross_validation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression     # Predict using Logistic Regression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.utils.extmath import randomized_range_finder

import warnings

warnings.simplefilter('ignore')


# %%

### Titanic Data Analysis
# 1) Collecting Data
# 2) Analyzing Data
# 3) Data Wrangling => Cleaning Data
# 4) Train & Test
# 5) Accuracy Check

### Load the titanic train data
csv_file = 'data/csv/titanic/train.csv'
#csv_file = 'https://money-or-ikigai.com/menu/python/article/data/titanic/train.csv'
df = pd.read_csv(csv_file)

# df.info()
# RangeIndex: 891 entries, 0 to 890
# Data columns (total 12 columns):
#  #   Column       Non-Null Count  Dtype  
# ---  ------       --------------  -----  
#  0   PassengerId  891 non-null    int64  
#  1   Survived     891 non-null    int64       0 = No, 1 = Yes 
#  2   Pclass       891 non-null    int64       Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
#  3   Name         891 non-null    object 
#  4   Sex          891 non-null    object      male or female
#  5   Age          714 non-null    float64
#  6   SibSp        891 non-null    int64       # of siblings / spouses aboard the Titanic
#  7   Parch        891 non-null    int64       # of parents / children aboard the Titanic          
#  8   Ticket       891 non-null    object      Ticket number
#  9   Fare         891 non-null    float64     
#  10  Cabin        204 non-null    object      Cabin number
#  11  Embarked     889 non-null    object      Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
# dtypes: float64(2), int64(5), object(5)


# %%

### Analysing Data
#sns.set_style('darkgrid') 
#custom_palette=['red','green']   
#sns.set_palette(custom_palette) 
ax = sns.countplot(x='Survived', data=df)
# countplot(*, x=None, y=None, hue=None, data=None, order=None, 
# hue_order=None, orient=None, color=None, palette=None, 
# saturation=0.75, dodge=True, ax=None, **kwargs)
ax.set_xticklabels(['Dead','Survived'])
ax.set_xlabel(None)
ax.set_ylabel('Count')
ax.set_title('Dead vs Survived Analysis\n(Titanic)');   # semi-colon does not produce output
plt.show()

#seaborn.countplot(*, x=None, y=None, hue=None, data=None, order=None, 
# hue_order=None, orient=None, color=None, palette=None, saturation=0.75, 
# dodge=True, ax=None, **kwargs)

# %%

ax = sns.countplot(x='Survived', hue='Sex', data=df)
ax.set_xticklabels(['Dead','Survived'])
ax.set_xlabel(None)
ax.set_ylabel('Count')
ax.set_title('Dead vs Survived Analysis\n(Sex)')
plt.show()


# %%

ax = sns.countplot(x='Survived', hue='Pclass', data=df)
ax.set_xticklabels(['Dead','Survived'])
ax.set_xlabel(None)
ax.set_ylabel('Count')
ax.set_title('Dead vs Survived Analysis\n(Pclass)')
plt.show()


# %%

ax = sns.countplot(x='SibSp', data=df)
#ax.set_xticklabels(['Dead','Survived'])
ax.set_xlabel('Number of siblings / spouses aboard')
ax.set_ylabel('Count')
ax.set_title('Siblings / Spouses Analysis\n(Titanic)')
plt.show()


# %%

df['Age'].plot.hist(bins=10, edgecolor='black', linewidth=0.8)
#df['Age'].plot.hist(bins=10, edgecolor='white', linewidth=0.8)

plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Analysis\n(Titanic)')
plt.show()
#DataFrame.plot.hist(by=None, bins=10, **kwargs)
# bins=99, alpha=0.5
# edgecolor='black', linewidth=1.2


# %%

df['Fare'].plot.hist(bins=20, edgecolor='black', linewidth=0.8, figsize=(10,5))
#df['Fare'].plot.hist(bins=20, edgecolor='white', linewidth=0.8, figsize=(10,5))

plt.xlabel('Fare ($)')
plt.ylabel('Count')
plt.title('Fare Analysis\n(Titanic)')
plt.show()
#DataFrame.plot.hist(by=None, bins=10, **kwargs)
# bins=99, alpha=0.5
# edgecolor='black', linewidth=1.2


# %%

### Data Wrangling

df.isnull().sum()
#df.isnull().sum().sum()


# %%

ax = sns.heatmap(df.isnull(), yticklabels=False)

ax.set_xlabel('DataFrame Columns')
#ax.set_ylabel('Count')
ax.set_title('Column Null Value Analysis\n(Titanic)')
plt.show()


# %%

ax = sns.heatmap(df.isnull(), yticklabels=False, cmap='viridis')

ax.set_xlabel('DataFrame Columns')
#ax.set_ylabel('Count')
ax.set_title('Column Null Value Analysis\n(Titanic)')
plt.show()


# %%

ax = sns.boxplot(x='Pclass', y='Age', data=df)

ax.set_xlabel('Ticket Class')
#ax.set_ylabel('Age')
ax.set_title('Ticket Class vs Age Analysis\n(Titanic)')
plt.show()


# %%

### Data Wrangling

df.drop('Cabin', axis=1, inplace=True)  # axis=1 => column

# %%

df.dropna(inplace=True)

# %%

ax = sns.heatmap(df.isnull(), yticklabels=False, cbar=False)

ax.set_xlabel('DataFrame Columns')
#ax.set_ylabel('Count')
ax.set_title('Column Null Value Analysis\n(Titanic)')
plt.show()


# %%

df.isnull().sum()

# %%

# Convert categorical variable into dummy/indicator variables. 
sex = pd.get_dummies(df['Sex'], drop_first=True)  
#sex = pd.get_dummies(df['Sex'])  
# get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, 
# columns=None, sparse=False, drop_first=False, dtype=None)

#ix     male
# 0	    1
# 1	    0
# 2	    0
# 3	    0
# 4	    1
# ...	...
# 885	0
# 886	1
# 887	0
# 889	1
# 890	1


# %%

embark = pd.get_dummies(df['Embarked'], drop_first=True)
#embark = pd.get_dummies(df['Embarked'])

#ix embark
#   Q	S
# 0	0	1
# 1	0	0
# 2	0	1
# 3	0	1
# 4	0	1
# ...	...	...
# 885	1	0
# 886	0	1
# 887	0	1
# 889	0	0
# 890	1	0

# %%

pcl = pd.get_dummies(df['Pclass'], drop_first=True)
#pcl = pd.get_dummies(df['Pclass'])

#  ix   2	3
#   0	0	1
#   1	0	0
#   2	0	1
#   3	0	0
#   4	0	1
# ...	...	...
# 885	0	1
# 886	1	0
# 887	0	0
# 889	0	0
# 890	0	1

# %%

dfx = pd.concat([df, sex, embark, pcl], axis=1) # axis=1 (column)

# dfx.info()
# Int64Index: 712 entries, 0 to 890
# Data columns (total 16 columns):
#  #   Column       Non-Null Count  Dtype  
# ---  ------       --------------  -----  
#  0   PassengerId  712 non-null    int64  
#  1   Survived     712 non-null    int64  
#  2   Pclass       712 non-null    int64  
#  3   Name         712 non-null    object 
#  4   Sex          712 non-null    object 
#  5   Age          712 non-null    float64
#  6   SibSp        712 non-null    int64  
#  7   Parch        712 non-null    int64  
#  8   Ticket       712 non-null    object 
#  9   Fare         712 non-null    float64
#  10  Embarked     712 non-null    object 
#  11  male         712 non-null    uint8 * 
#  12  Q            712 non-null    uint8 * 
#  13  S            712 non-null    uint8 * 
#  14  2            712 non-null    uint8 * 
#  15  3            712 non-null    uint8 * 
# dtypes: float64(2), int64(5), object(4), uint8(5)

# %%

dfx.drop(['Pclass','Sex','Embarked','PassengerId','Name','Ticket'], axis=1, inplace=True)

# dfx.info()
# Int64Index: 712 entries, 0 to 890
# Data columns (total 16 columns):
#  #   Column       Non-Null Count  Dtype  
# ---  ------       --------------  -----  
#  0   PassengerId  712 non-null    int64   => drop  
#  1   Survived     712 non-null    int64  
#  2   Pclass       712 non-null    int64   => drop  
#  3   Name         712 non-null    object  => drop
#  4   Sex          712 non-null    object  => drop
#  5   Age          712 non-null    float64
#  6   SibSp        712 non-null    int64  
#  7   Parch        712 non-null    int64  
#  8   Ticket       712 non-null    object  => drop 
#  9   Fare         712 non-null    float64
#  10  Embarked     712 non-null    object  => drop 
#  11  male         712 non-null    uint8 * 
#  12  Q            712 non-null    uint8 * 
#  13  S            712 non-null    uint8 * 
#  14  2            712 non-null    uint8 * 
#  15  3            712 non-null    uint8 * 
# dtypes: float64(2), int64(5), object(4), uint8(5)


# %%

### Train & Test Data
X = dfx.drop('Survived', axis=1)
y = dfx['Survived']

# %%

# Split arrays or matrices into random train and test subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# train_test_split(*arrays, test_size=None, train_size=None, 
# random_state=None, shuffle=True, stratify=None)

# %%

# Logistic Regression classifier
model = LogisticRegression(n_jobs=-1, random_state=0)
# LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, 
# intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', 
# max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

# %%
# Fit the model according to the given training data
model.fit(X_train, y_train)
#model.get_params

# %%

# Predict class labels for samples in X_test
predictions = model.predict(X_test)

# actual(y_test)
# array([0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1,
#        1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
#        0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
#        1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0,
#        1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
#        0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
#        1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
#        0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,
#        1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
#        1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0], dtype=int64)

# predictions
# array([0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,
#        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0,
#        0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
#        1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
#        1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
#        0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1,
#        1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
#        1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1,
#        1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,
#        1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0], dtype=int64)


# %%

# Build a text report showing the main classification metrics
classification_report(y_test, predictions)
# classification_report(y_true, y_pred, *, labels=None, target_names=None, 
# sample_weight=None, digits=2, output_dict=False, zero_division='warn')

#         precision    recall  f1-score   support
# class 0      0.80      0.81      0.81       126
# class 1      0.72      0.72      0.72        88

# precision: the accuracy of the positive prediction (FP:TP) => Survivedのaccuracy
# recall: the true positive rate(TPR) => Survivedのrate
# f1-score: precision + recall => Survived accuracy+Survived rateが高いときf1-scoreが高くなる
# つまり、f1-scoreが高いということはprecisionもrecallも高いことになる。
# なので、f1-scoreのみチェックしてもよい！

#      accuracy                      0.77       214
#     macro avg       0.76   0.76    0.76       214
# nweighted avg       0.77   0.77    0.77       214


# %%

# Compute confusion matrix to evaluate the accuracy of a classification
confusion_matrix(y_test, predictions)
# confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None, normalize=None)

# 102,  24   |  TP(True Positive)  ★    FN(False Negative) ☆  Positive:Survied
#  25,  63   |  FP(False Positive) ☆    TN(True Negative)  ★  Negative:Dead
#------------
# 127   87
#  20%  28%

# 生存すると予測した127人の内、25人の予測が外れた(20%の予測が外れた)。
# 死亡と予測した87人の内、24人の予測が外れた(28%の予測が外れた)。
# このモデルは生存より死亡の予測精度が悪いと言える。 

# %%

# Accuracy classification score
accuracy_score(y_test, predictions)     # 0.7710 => 77.10%
# accuracy_score(y_true, y_pred, *, normalize=True, sample_weight=None)
# The best performance is 1 with normalize == True

# %%

Go Top

Python {Article048}

タイタニック号で機械学習の基本を学ぶには 【Machine Learning】

タイタニック号のサバイバルデータを使用して乗船客の生死を予測する【初級】

まずはPythonのライブラリを取り込む

タイタニックのサバイバルデータをPandasのDataFrameに取り込む

タイタニックの生存者と死亡者の人数を棒グラフで表示する

男女別の生存者と死亡者の人数を棒グラフで表示する

乗船券(チケット)のクラス別の棒グラフを表示する

乗船時の同乗者(兄弟・配偶者)の人数を棒グラフで表示する

乗船者の年代別の人数をヒストグラムで表示する

乗船料金ごとの人数をヒストグラムで表示する

不正なデータ件数をDataFrameの列ごとに表示する

PandasのDataFrameの不正・欠損データをヒートマップで表示する

乗船券のクラスと年齢をBoxPlot(箱ひげ図)で表示する

PandasのDataFrameから不正なデータを削除する

PandasのDataFrameの列「Sex, Embarked, Pclass」の値をstr型からbit型に変換する

DataFrameの列「Sex, Embarked, Pclass」をint型に変換した数値をDataFrameに新規列として追加する

DataFrameから不要の列を削除する

分析用のデータ(X, y)を作成する

分析データを分割(X_train, X_test, y_train, y_test)する

データ(X_train, y_train)を元に学習させる

データ(X_test)を元に予測する

classification_report(precision, recall, f1-score)を取得して予測を評価する

confusion_matrix(TP/FN/FP/TN)を取得して予測を評価する

accuracy_score(正解率)を取得して予測を評価する

ここで解説したコードをまとめて掲載

タイタニック号で機械学習の基本を学ぶには【Machine Learning】