Python: タイタニック号で機械学習のデータ分析を学ぶには【Matplotlib】

ようこそ「Python」へ...

Python»記事(Article058)

タイタニック号で機械学習のデータ分析を学ぶには【Matplotlib】

タイタニックのサバイバルデータで機械学習(Machine Learning)シリーズは次の８つの記事から構成されています。機械学習に興味のある方は以下に掲載されている記事を順番に読むことをおすすめします。

この記事ではPythonのライブラリMatplotlibを使用してデータを分析（可視化）する方法を解説します。 Matplotlibを使用すると棒グラフ、ヒストグラム等さまざまなグラフを作成することができます。またMatplotlibのsubplots()メソッドを使用してFigure(図)内に複数のグラフを作成することも可能です。「記事(Article048)」をまだ読んでいないときは、事前に読んで機械学習(ML: Machine Learning)の一連の流れを理解することをおすすめします。

最近よくAI(Artificial Intelligence)、ML(Machine Learning)、DL(Deep Learning)という言葉を聞きますが、 AIはMLとDLを含んだ総称です。そしてMLはDLを含んでいます。そしてDLはNeural Networks(ニューラルネットワーク)を使用しています。 MLとDLはAIのサブセットということになります。 DL(Deep Learning)については「記事(Article028)」で詳しく解説しています。

ML(Machine Learning)は、Supervised Learning, UnSupervised Learning, Reinforcement Learningの３つに分類されています。それぞれのタイプの概念は図(B, C, D)を参照してください。そしてSupervised LearningはアルゴリズムによりReguression, Classfication, Clusteringの３つに分類されています。それぞれのアルゴリズムの種類は図(E)を参照してください。

今回予測するのは、タイタニック号の乗船客が「生存するか」「死亡するか」の２択ですから、 MLのClassficationのアルゴリズムを利用することになります。ここではClassficationの8種類のアルゴリズム (KNeighborsClassifier, DecisionTreeClassifier, RandomForestClassifier, GaussianNB, SVC, ExtraTreeClassifier, GradientBoostingClassifier, AdaBoostClassifier) を使用して乗船客の生死を予測します。

ML(Macine Learning)を使用して予測するとき、予測するデータの属性によりアルゴリズム（モデル）の評価方法が異なります。「記事(Article056)」で解説した売上データからお客さんが商品を「買う、買わない」といった予測では、 accuracy_score(正解率)で予測値を評価してもとくに問題はありません。ところが、今回のタイタニックのような人間の「生死」を予測するケースでは、accuracy_score(正解率)だけで予測を評価することはできません。たとえば、モデルが「死亡する」と予測して、その予測が外れてもとくに問題にはなりません。ところが、モデルが「生存する」と予測して、その予測が外れると「死亡」するということになるので問題になります。このような場合は、モデルの予測値を４パターンに分けて評価する必要があります。 MLはこれら４パターンの評価情報を取得する方法としてclassification_report()とconfusion_matrix()メソッドを用意しています。

「初級」編では、データの取り込み、データの分析(可視化)、学習、予測、予測評価の順番に説明します。「予測評価」では、予測を評価するための情報を取得する３種類の方法(メソッド)を解説します。さらに、なぜclassification_report(), confusion_matrix(), accuracy_score()の３種類のメソッドが用意されているのか、そして、これらのメソッドで取得した評価情報をどのように活用するのかについても説明しています。

「中級」編では、Pandas、Matplotlibを使用したデータ分析を詳しく解説しています。データを分析するには可視化することが重要ですが、Pandasのplot()メソッドを使用する簡単にデータを可視化することができます。さらにMatplotlibを使用するとグラフを見栄えよくする、見やすくする、グラフにさまざまな補足情報を表示するといったことが可能になります。

「上級」編では、複数のアルゴリズム（モデル）を使用して実際に予測して、予測を評価する方法について解説しています。予測値を調整するには、モデルにさまざまなパラメータを追加して、さらにパラメータの値（範囲）も同時に調整する必要があります。これらを効率的に行う方法としてPipelineを使用したGridSearchCV()、RandomizedSearchCV()メソッドについて解説しています。 RandomizedSearchCV()を使用すると、モデルにどのようなパラメータを追加すると予測が改善するかを効率的に行うことができます。さらに、GridSerachCV()を使用すると、モデルのパラメータの値（範囲）の調整を効率的に行うことができます。

ここではVisula Studio Code(VSC)の「Python Interactive window」を使用してJupter Notebookのような環境で説明します。 VSCを通常の環境からインタラクティブな環境に切り換えるにはコードを記述するときコメント「# %%」を入力します。詳しい、操作手順については「ここ」を参照してください。インタラクティブな環境では、Pythonの「print(), plt.show()」などを使う必要がないので掲載しているコードでは省略しています。 VSCで通常の環境で使用するときは、必要に応じて「print(), plt.show()」等を追加してください。

この記事では、Pandas、Matplotlibのライブラリを使用しますので「記事(Article001) | 記事(Article002) | 記事(Article003) | 記事(Article004)」を参照して事前にインストールしておいてください。 Pythonのコードを入力するときにMicrosoftのVisula Studio Codeを使用します。まだ、インストールしていないときは「記事(Article001)」を参照してインストールしておいてください。

説明文の左側に図の画像が表示されていますが縮小されています。画像を拡大するにはマウスを画像上に移動してクリックします。画像が拡大表示されます。拡大された画像を閉じるには右上の[X]をクリックします。画像の任意の場所をクリックして閉じることもできます。

PythonのライブラリMatplotlibを使用してタイタニックのサバイバルデータを可視化する

まずはPythonのライブラリを取り込む
Visual Studio Code(VSC)を起動したら新規ファイルを作成して行1-8をコピペします。行2-6ではPythonのライブラリを取り込んでいます。行8ではPythonの警告を抑止しています。ライブラリをまだインストールしていないときは「pip install」で事前にインストールしておいてください。
```
### Import the libraries
from functools import reduce
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

warnings.simplefilter('ignore')
```
図1

図1はVisual Studio Code(VSC)の画面です。
PandasのDataFrameにタイタニックのサバイバルデータを取り込む
行2はCSVファイルのパスを定義しています。 CSVファイルを当サイトからダウンロードするときは行3のコメント(#)を外してください。行4ではPandasのread_csv()メソッドでCSVファイルをPandasのDataFrameに取り込んでいます。行5ではDataFrame(raw)を変数dfにコピーしています。ちなみに、「df = raw」のように記述してコピーするとDataFrame(raw)のメモリが共有されます。メモリを共有するということは変数「df」は変数「raw」の別名という意味です。なのでDataFrame(df)の値を変更すると当然ながらDataFrame[raw」の値も変更されます。メモリを共有したくないときはDataFrameのcopy()メソッドでコピーする必要があります。
```
### Load the data
csv_file = 'data/csv/titanic/train.csv'
#csv_file = 'https://money-or-ikigai.com/menu/python/article/data/titanic/train.csv'
raw = pd.read_csv(csv_file)
df = raw.copy()
```
図2

図2は実行画面です。 VSCのインタラクティブ・ウィンドウには「df.info()」と「df.head(1)」で出力した情報が表示されています。 CSVファイルには891人の乗船客のデータが格納されています。列「Survived」には「0=死亡, 1=生存」が格納されています。列「Sex:性別」には「male, female」が格納されています。

年代別のヒストグラムを左右に分離(男女別)して作成する

行5ではDataFrame(df)から女性のデータのみ絞り込んでいます。行6ではDataFrame(df)から男性のデータのみ絞り込んでいます。行12ではMatplotlibのsubplots()メソッドでサブプロットを2個生成しています。行14ではax1に女性のヒストグラムを作成しています。行16ではヒストグラムを右側から左側に表示するように切り替えています。つまり、男性のヒストグラムと逆に表示させています。これで女性と男性のヒストグラムが左右対照に表示されます。行22ではax2に男性のヒストグラムを作成しています。

### Male vs Female Analysis by Age
male_mask = df['Sex'] == 'male' 
female_mask = df['Sex'] == 'female' 

female_df = df[female_mask]
male_df = df[male_mask]

plt.style.use('seaborn')            # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'  
plt.figure(figsize=(6,5))

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)

ax1.hist(female_df.Age, bins=10, color='r', edgecolor='white', linewidth=1.2, orientation='horizontal')
# Invert the order of x-axis values
ax1.set_xlim(ax1.get_xlim()[::-1])
# Move ticks to the right
ax1.yaxis.tick_right()
ax1.set_title('Female Passenger (女性乗船者)')
ax1.grid(False)

ax2.hist(male_df.Age, bins=10, color='b', edgecolor='white', linewidth=1.2, orientation='horizontal')
# Move ticks to the right
ax2.yaxis.tick_left()
ax2.set_title('Male Passenger (男性乗船者)')
ax2.grid(False)
plt.show()

図3は実行画面です。男女別・年代別のヒストグラムが左右に分離されて表示されています。

PandasのDataFrameを作成する関数「get_dataframe()」を定義する
```
### Define user function
def get_dataframe(df, col):
    survived_mask = df['Survived'] == 1
    dead_mask = df['Survived'] == 0
    survived = df[survived_mask][col].value_counts()
    dead = df[dead_mask][col].value_counts()
    dfx = pd.DataFrame([survived, dead])
    dfx.index = ['Survived','Dead'] 
    return dfx

#dfx = get_dataframe(df, 'Sex')
```
図4

図4はVSCの画面です。ここでは行11のコメント(#)を外して実行しています。 VSCのインタラクティブ・ウィンドウには戻り値「dfx」の構造と内容を表示しています。「get_dataframe(df, 'Sex')」のように関数をコールすると戻り値には図4のようなDataFrame(dfx)が返されます。 DataFrame(df)の列「Survived」には生死の区分が数値「0=dead, 1=survived」で格納されています。 DataFrame(df)の列「Sex」には性別「female, male」が格納されています。

行5を実行するとDataFrame(df)の男女の生存者数がPandasのSeriesとして変数「survived」に格納されます。具体的には「female=233, male=109」が格納されます。同様に行6を実行するとDataFrame(df)の男女の死亡者数がPandasのSeriesとして変数「dead」に格納されます。具体的には「female=81, male=468」が格納されます。

行7ではPandasのDataFrame()メソッドでsurvivedとdeadに格納されてSeriesをDataFrameに取り込みます。 DataFrame(dfx)には列「female, male」が生成されます。そして列「Sex」がDataFrameのindexになります。この場合、indexは重複した「Sex」が2個生成されます。

行8ではDataFrame(dfx)の重複したindex(Sex, Sex)を(Survived, Dead)に変更しています。これで図4のようなDataFrame(dfx)が戻り値として返されます。

男女別(male, female)の生死の棒グラフを作成する

行15-16ではスタック型の棒グラフを作成しています。棒グラフをスタック型にするにはMatplotlibのbar()メソッドの引数に「bottom=dfx.female」を追加します。行19-24では棒グラフに人数を表示しています。棒グラフに人数を表示するにはMatplotlibのtext()メソッドを使用します。ここでは人数が棒グラフ内の中央に表示されるように調整しています。

行40-41では通常の棒グラフを作成しています。男女の棒グラフが重ならないように表示するにはMatplotlibのbar()メソッドの引数1を調整します。ここでは女性のbar()メソッドの引数1に「x_indexes」を、男性のbar()メソッドの引数1に「x_indexes+width」を指定してずらしてします。

行44-48では棒グラフの上に人数を表示しています。棒グラフに人数を表示するにはMatplotlibのtext()メソッドを使用します。ここでは人数が棒グラフの中央に表示されるように調整しています。

### Survived vs Dead Analysis by Gender
# get a new dataframe
dfx = get_dataframe(df, 'Sex')
# dfx 
# --------------------------
# index     female  male
# Survived  233	    109     
# Dead	    81	    468   

# Stacked bar chart
plt.style.use('seaborn')            # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'  
plt.figure(figsize=(6,5))

bar1 = plt.bar(dfx.index, dfx.female, color='r', label='female')  
bar2 = plt.bar(dfx.index, dfx.male, bottom=dfx.female, color='g', label='male')  

# Attach text labels
rects = bar1 + bar2
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0, 
        f'{height:.0f}', ha='center', va='center')

plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Gender)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()

# Vertical bar chart
plt.figure(figsize=(6, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.25

bar1 = plt.bar(x_indexes, dfx.female, width=width, color='r', label='female')  
bar2 = plt.bar(x_indexes+width, dfx.male, width=width, color='g', label='male')  

# Attach text labels
rects = bar1 + bar2
for rect in rects:
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width() / 2.0, height, 
        f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Gender)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

図5は実行画面です。 VSCのインタラクティブ・ウィンドウにスタック型の棒グラフと通常の棒グラフが表示されています。棒グラフには人数が表示されています。グラフに数値を表示すると比較するときに便利です。

乗船券クラス別(1st, 2nd, 3rd class)の生死の棒グラフを作成する

行26, 行56ではPythonのList Comprehensionをsetに応用しています。ここではComprehensionにlambda式を使用しています。「rects = reduce(lambda x, y: x + y, bar_list)」の記述は、「rects = (bar_list[0][0], bar_list[0][1], bar_list[1][0], bar_list[1][1], bar_list[2][0], bar_list[2][1])」の記述をlambda式で簡略化しています。

### Survived vs Dead Analysis by Ticket Class (1st, 2nd, 3rd)
# Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
# get a new dataframe
dfx = get_dataframe(df, 'Pclass')
# dfx 
# --------------------------
# index     1	2	3
# Survived  136	87	119
# Dead	    80	97	372

# Stacked bar chart
plt.style.use('seaborn')            # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'  
plt.figure(figsize=(6,5))

label_list = ['1st class','2nd class','3rd class']
bar_list = []
cumval = 0

for i, col in enumerate(dfx.columns):
    bar = plt.bar(dfx.index, dfx[col], bottom=cumval, label=label_list.pop(0))  
    bar_list.append(bar)
    cumval += dfx[col]

# Attach text labels
rects = reduce(lambda x, y: x + y, bar_list) 
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0, 
        f'{height:.0f}', ha='center', va='center')

plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Ticket Class)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()

# Vertical bar chart
plt.figure(figsize=(6, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.25

width_list = [width*-1, 0, width*1]
label_list = ['1st class','2nd class','3rd class']
bar_list = []

for i, col in enumerate(dfx.columns):
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Attach text labels
rects = reduce(lambda x, y: x + y, bar_list) 
for rect in rects:
    width = rect.get_width() 
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, 
        f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Ticket Class)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

図6は実行画面です。 VSCのインタラクティブ・ウィンドウにスタック型の棒グラフと通常の棒グラフが表示されています。スタック型の棒グラフの場合、縦棒内に人数が表示されています。通常の棒グラフの場合、縦棒の上に人数が表示されています。

兄弟・配偶者別の生死の棒グラフを作成する

### SibSp: number of siblings / spouses aboard the Titanic
#df['SibSp'] = df['SibSp'].fillna(0)
#df.isnull().sum()
# get a new dataframe
dfx = get_dataframe(df, 'SibSp')
# dfx 
#           0           1       2       3       4       5       6
# -------------------------------------------------------------------- 
# index	   
# Survived  210.0	112.0	13.0	4.0     3.0     0.0	0.0
# Dead	    398.0	97.0	15.0	12.0	15.0	5.0	7.0
dfx = dfx.fillna(0)
dfx.isnull().sum()
dfy = dfx.iloc[:,0:3]

# Stacked bar bahrt
plt.style.use('seaborn')            # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'  
plt.figure(figsize=(6,5))

label_list = ['0 Sibling/Spouses','1 Sibling/Spouses','2 Sibling/Spouses']
bar_list = []
cumval = 0

for i, col in enumerate(dfy.columns):
    bar = plt.bar(dfy.index, dfy[col], bottom=cumval, label=label_list.pop(0))  
    bar_list.append(bar)
    cumval += dfx[col]

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0, 
        f'{height:.0f}', ha = 'center', va = 'center')

plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Siblings/Spouses Aboard)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()

# Vertical bar chart
plt.figure(figsize=(10, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.10    # 0.25

width_list = [width*-3, width*-2, width*-1, 0, width, width*2, width*3]

label_list = ['0 Sibling/Spouses','1 Sibling/Spouses','2 Sibling/Spouses',
    '3 Sibling/Spouses','4 Sibling/Spouses','5 Sibling/Spouses','8 Sibling/Spouses']

bar_list = []

for i, col in enumerate(dfx.columns):  
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # bar1+bar2+bar3+bar4+bar5+bar6+bar7   
for rect in rects:
    width = rect.get_width() 
    height = rect.get_height()    
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Siblings/Spouses Aboard)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

図7は実行画面です。 VSCのインタラクティブ・ウィンドウにスタック型の棒グラフと通常の棒グラフが表示されています。

親子別(Parents/Children)の生死の棒グラフを作成する

### Parch: number of parents / children aboard the Titanic
#df['Parch'] = df['Parch'].fillna(0)
#df.isnull().sum()
# get a new dataframe
dfx = get_dataframe(df, 'Parch')
# dfx
#           0           1       2       3       4       5       6
# ------------------------------------------------------------------- 
# index    
# Survived  233.0	65.0	40.0	3.0	0.0	1.0	0.0
# Dead	    445.0	53.0	40.0	2.0	4.0	4.0	1.0
dfx = dfx.fillna(0)
dfx.isnull().sum()
dfy = dfx.iloc[:,0:3]

# stacked bar chart
plt.style.use('seaborn')            # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'  
plt.figure(figsize=(6,5))

#label_list = []
bar_list = []
cumval = 0

for i, col in enumerate(dfy.columns):
    bar = plt.bar(dfy.index, dfy[col], bottom=cumval, label=f'{i} parents/children')  
    bar_list.append(bar)
    cumval += dfx[col]

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0, 
        f'{height:.0f}', ha = 'center', va = 'center')

plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Parents/Children Aboard)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()

# vertical bar chart
plt.figure(figsize=(10, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.10    # 0.25

width_list = [width*-3, width*-2, width*-1, 0, width, width*2, width*3]
bar_list = []

for i, col in enumerate(dfx.columns):   
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=f'{i} parents/children')  
    bar_list.append(bar)

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list)    
for rect in rects:
    width = rect.get_width() 
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Parents/Children Aboard)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

図8は実行画面です。 VSCのインタラクティブ・ウィンドウにスタック型の棒グラフと通常の棒グラフが表示されています。

乗船港別(Southampton, Cherbourg, Queenstown)の生死の棒グラフを作成する

### Port of Embarkation:  C = Cherbourg, Q = Queenstown, S = Southampton
df['Embarked_copy'] = df['Embarked']
df['Embarked_copy'] = df['Embarked_copy'].fillna('S')
#df.isnull().sum()  # Embarked          2 => 0
# get a new dataframe
dfx = get_dataframe(df, 'Embarked_copy')
# dfx
# ----------------------------------
# index         S	C	Q
# Survived      219	93	30
# Dead	        427	75	47
dfx = dfx.fillna(0)
dfx.isnull().sum()

# stacked bar chart
plt.style.use('seaborn')            # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'  
plt.figure(figsize=(6,5))

label_list = ['Southampton','Cherbourg','Queenstown']
bar_list = []
cumval = 0

for i, col in enumerate(dfx.columns):
    bar = plt.bar(dfx.index, dfx[col], bottom=cumval, label=label_list.pop(0))  
    bar_list.append(bar)
    cumval += dfx[col]

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0, 
        f'{height:.0f}', ha = 'center', va = 'center')

# vertical bar chart
plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Port of Embarkation)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()

plt.figure(figsize=(6, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.25    # 0.25

width_list = [width*-1, 0, width]
label_list = ['Southampton','Cherbourg','Queenstown']
bar_list = []

for i, col in enumerate(dfx.columns):   
    bar = plt.bar(x_indexes + width_list.pop(0), dfx.iloc[:,i], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list)    
for rect in rects:
    width = rect.get_width() 
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Port of Embarkation)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

図9は実行画面です。 VSCのインタラクティブ・ウィンドウにスタック型の棒グラフと通常の棒グラフが表示されています。

敬称別(Mr, Miss, Mrs,...)の生死の棒グラフを作成する

### Title Mapping
# Extract Title from the Name column
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
#print('Title:')
#print(df['Title'].value_counts())

title_mapping = {'Mr': 0, 'Miss': 1, 'Mrs': 2, 
                 'Master': 3, 'Dr': 3, 'Rev': 3, 'Col': 3, 'Major': 3, 'Mlle': 3,'Countess': 3,
                 'Ms': 3, 'Lady': 3, 'Jonkheer': 3, 'Don': 3, 'Dona' : 3, 'Mme': 3,'Capt': 3,'Sir': 3 }

df['Title_map'] = df['Title'].map(title_mapping)
df[['Title', 'Title_map']]
#df['Title_map'] = df['Title_map'].fillna(0)
#df.isnull().sum()  # Title_map          

# get a new dataframe
dfx = get_dataframe(df, 'Title_map')
# dfx
# ----------------------------------
# index	    0	1	2	3
# Survived  81	127	99	35
# Dead	    436	55	26	32
#dfx = dfx.fillna(0)
dfx.isnull().sum()

# stacked bar chart
plt.style.use('seaborn')            # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'  
plt.figure(figsize=(6,5))

label_list = ['Mr','Miss','Mrs','Misc']
bar_list = []
cumval = 0

for i, col in enumerate(dfx.columns):
    bar = plt.bar(dfx.index, dfx[col], bottom=cumval, label=label_list.pop(0))  
    bar_list.append(bar)
    cumval += dfx[col]

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0, 
        f'{height:.0f}', ha = 'center', va = 'center')

plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Title)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()

# Vertical bar chart
plt.figure(figsize=(8, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.15    # 0.25

width_list = [width*-1, 0, width, width*2]
label_list = ['Mr','Miss','Mrs','Misc']
bar_list = []

for i, col in enumerate(dfx.columns):
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)    
for rect in rects:
    width = rect.get_width() 
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Title)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

図10は実行画面です。 VSCのインタラクティブ・ウィンドウにスタック型の棒グラフと通常の棒グラフが表示されています。

性別(Sex_map)別の生死の棒グラフを作成する

### Sex Mapping male => 0, femail => 1
sex_mapping = {'male': 0, 'female': 1}
df['Sex_map'] = df['Sex'].map(sex_mapping)

# get a new dataframe
dfx = get_dataframe(df, 'Sex_map')
# dfx
# -----------------------
# index	    0	1	
# Survived  81	127	
# Dead	    436	55	
#dfx = dfx.fillna(0)
#dfx.isnull().sum()

# vertical bar chart
plt.figure(figsize=(6,5))
x_indexes = np.arange(len(dfx.index))
width = 0.25    # 0.25

width_list = [0, width]
label_list = ['male','female']
bar_list = []

for i, col in enumerate(dfx.columns):   
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)    
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Sex_map)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

図11は実行結果です。ここではDataFrameの列「Sex_map」の棒グラフを表示しています。列「Sex_map」には列「Sex」の値(male, female)が(0, 1)にマッピングされて格納されています。

世代別(adult, young, mid-age, child, senior)の生死の棒グラフを作成する

### Age Analysis 
# Fix Age NaN => median 
df.isnull().sum()   # Age => 263     
df['Age'].fillna(df.groupby('Title_map')['Age'].transform('median'), inplace=True)
df.isnull().sum()   # Age => 0    

### Age Mapping

# Binning/Converting Numerical Age to Categorical Variable
# feature vector map:
# child:    0
# young:    1
# adult:    2
# mid-age:  3
# senior:   4

def age_map(age):
    if age <= 16:
        return 0
    elif age > 16 and age <= 26:
        return 1
    elif age > 26 and age <= 36:
        return 2                        
    elif age > 36 and age <= 62:
        return 3             
    else:
        return 4

df['Age_map'] = df.loc[:, 'Age'].apply(age_map)

# get a new dataframe
dfx = get_dataframe(df, 'Age_map')

survived_mask = df['Survived'] == 1
dead_mask = df['Survived'] == 0
# dfx
# -----------------------------------------
# index     2	1	3	0	4
# Survived  116	97	69	57	3
# Dead	    220	158	111	48	12
#dfx = dfx.fillna(0)
#dfx.isnull().sum()

# bar chart
plt.figure(figsize=(10,5))
x_indexes = np.arange(len(dfx.index))
width = 0.15    # 0.25

width_list = [width*-2, width*-1, 0, width*1, width*2]
label_list = ['adult','young','mid-age','child','senior']
bar_list = []

for i, col in enumerate(dfx.columns):   
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)    
for rect in rects:
    width = rect.get_width() 
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Age Category)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

図12は実行結果です。ここではDataFrameの列「Age_map」の棒グラフを表示しています。列「Age_map」には列「Age」の値(年齢)が数値(0-4)にマッピングされて格納されています。 child(0), young(1), adult(2), mid-age(3), senior(4)

乗船券・乗船港別の生死の棒グラフを作成する

### Map the value of Pclass
Pclass1_mask = df['Pclass'] == 1 
Pclass2_mask = df['Pclass'] == 2 
Pclass3_mask = df['Pclass'] == 3 

Pclass1 = df[Pclass1_mask]['Embarked_copy'].value_counts()
Pclass2 = df[Pclass2_mask]['Embarked_copy'].value_counts()
Pclass3 = df[Pclass3_mask]['Embarked_copy'].value_counts()
dfx = pd.DataFrame([Pclass1, Pclass2, Pclass3])
dfx.index = ['1st Class','2nd Class','3rd Class']

# dfx
# ---------------------------------
# index         S	C	Q
# 1st Class	179	141	3
# 2nd Class	242	28	7
# 3rd Class	495	101	113

# vertical bar chart
plt.figure(figsize=(6,5))
x_indexes = np.arange(len(dfx.index))
width = 0.25    # 0.25

width_list = [width*-1, 0, width*1]
label_list = ['Southampton','Cherbourg','Queenstown']
bar_list = []

for i, col in enumerate(dfx.columns):   
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)    
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Ticket Class)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

### Map the value of Embarked

# fill out missing embark with S embark
df['Embarked'] =  df['Embarked'].fillna('S')

embarked_mapping = {'S':0,'C':1,'Q':2}
df['Embarked_map'] = df['Embarked'].map(embarked_mapping)

図13は実行結果です。ここでは乗船料金のクラス別(1st Class, 2nd Class, 3rd Class)の内訳を乗船港別(Southampton, Cherbourg, Queenstown)に表示しています。

乗船料金(Fare)をマッピング(Fare_map)する

### Map the value of Fare

# Fill missing Fare with median fare for each Pclass
df['Fare'].fillna(df.groupby('Pclass')['Fare'].transform('median'), inplace=True)
#df.head(5)

### Map the Fare

def fare_map(fare):
    if fare <= 17:
        return 0
    elif fare > 17 and fare <= 30:
        return 1
    elif fare > 30 and fare <= 100:
        return 2
    else:
        return 3

df['Fare_map'] = df.loc[:, 'Fare'].apply(fare_map)
#df.head(5)

図14は実行結果です。ここではDataFrameの列「Fare(乗船料金)」の金額を数値(0-3)にマッピングして新規列「Fare_map」に格納しています。 0:$17以下、1:$18-$30, 2:$31-$100, 3:$101以上

乗船料金クラス・キャビンID別の生死の棒グラフを作成する

### Map the value of Cabin

df['Cabin_x'] =  df['Cabin'].str[:1]    # X999 => X

Pclass1_mask = df['Pclass'] == 1 
Pclass2_mask = df['Pclass'] == 2 
Pclass3_mask = df['Pclass'] == 3 

Pclass1 = df[Pclass1_mask]['Cabin_x'].value_counts()
Pclass2 = df[Pclass2_mask]['Cabin_x'].value_counts()
Pclass3 = df[Pclass3_mask]['Cabin_x'].value_counts()
dfx = pd.DataFrame([Pclass1, Pclass2, Pclass3])
dfx = dfx.fillna(0)
dfx.index = ['1st class','2nd class', '3rd class']

# dfx                                 
# ------------------------------------------------------------------------
# index         C	   B	   D	   E	  A	  T	   F	  G
# 1st class	94.0	65.0	40.0	34.0	22.0	1.0     NaN     NaN
# 2nd class	NaN     NaN     6.0     4.0     NaN     NaN     13.0    NaN
# 3rd class	NaN     NaN     NaN     3.0     NaN     NaN     8.0     5.0

# vertical bar chart
plt.figure(figsize=(10,5))
x_indexes = np.arange(len(dfx.index))
width = 0.10    # 0.25

width_list = [width*-3,width*-2,width*-1, 0, width*1,width*2,width*3,width*4]
label_list = ['C','B','D','E','A','T','F','G']
bar_list = []

for i, col in enumerate(dfx.columns):   
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)    
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Cabin Class)')
plt.legend(loc='upper right')
plt.grid(True)
plt.tight_layout()
plt.show()

### Fill missing Cabin_map

cabin_mapping = {'A': 0, 'B': 0.4, 'C': 0.8, 'D': 1.2, 'E': 1.6, 'F': 2, 'G': 2.4, 'T': 2.8}
df['Cabin_map'] = df['Cabin_x'].map(cabin_mapping)  # A => 0, B => 0.4

# Fill missing Cabin_map with median Cabin_map for each Pclass
#df = df.fillna(0)
df.isnull().sum()
df['Cabin_map'].fillna(df.groupby('Pclass')['Cabin_map'].transform('median'), inplace=True)
df.isnull().sum()

図15は実行結果です。ここでは乗船券クラス別(1st Class, 2nd Class, 3rd Class)の内訳をキャビンID別(C, B, D, E, A, T, F, G)に表示しています。

家族サイズ(FamilySize)をマッピング(FamilySize_map)する

### Map the value of Family Size 
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

### => SKIP
# # KDE(Kernel Density Estimation)
# #facet = sns.displot(x='FamilySize', hue='Survived', kind='kde', data=df)
# facet = sns.displot(x='FamilySize', hue='Survived', kind='kde', multiple='stack', data=df)
# #facet = sns.displot(x='FamilySize', hue='Survived', kind='kde', fill=True, data=df)

# plt.title('Dead vs Survived Analysis\n(Family Size)')
# plt.show()

# # facet = sns.FacetGrid(df, hue="Survived",aspect=4)
# # facet.map(sns.kdeplot,'FamilySize',shade= True)
# # facet.set(xlim=(0, df['FamilySize'].max()))
# # facet.add_legend()
# # plt.xlim(0)

family_mapping = {1: 0, 2: 0.4, 3: 0.8, 4: 1.2, 5: 1.6, 6: 2, 7: 2.4, 8: 2.8, 9: 3.2, 10: 3.6, 11: 4}
df['FamilySize_map'] = df['FamilySize'].map(family_mapping)
#df
#df.isnull().sum()

図16は実行結果です。ここではDataFrameの列「FamilySize」の値を数値(0-4)にマッピングして新規列「FamilySize_map」に格納しています。 1:0, 2:0.4, 3:0.8, 4:1.2, 5:1.6, 6:2, 7:2.4, 8:2.8, 9:3.2, 10:3.6, 11:4

ここで解説したコードをまとめて掲載

最後にここで解説したすべてのコードをまとめて掲載しましたので参考にしてください。


### Import the libraries
from functools import reduce
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import seaborn as sns
import warnings

warnings.simplefilter('ignore')
#sns.set() # setting seaborn default for plots


# %%

### Load the data
csv_file = 'data/csv/titanic/train.csv'
#csv_file = 'https://money-or-ikigai.com/menu/python/article/data/titanic/train.csv'
raw = pd.read_csv(csv_file)
#test_file = 'data/csv/titanic/test.csv'
#test = pd.read_csv(test_file)
df = raw.copy()

#df.shape
# (1309, 12)    1309 rows, 12 columns

#df.info()
# Int64Index: 1309 entries, 0 to 417
# Data columns (total 12 columns):
#  #   Column       Non-Null Count  Dtype  
# ---  ------       --------------  -----  
#  0   PassengerId  891 non-null    int64  
#  1   Survived     891 non-null    int64       0 = No, 1 = Yes 
#  2   Pclass       891 non-null    int64       Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
#  3   Name         891 non-null    object 
#  4   Sex          891 non-null    object 
#  5   Age          714 non-null    float64
#  6   SibSp        891 non-null    int64       # of siblings / spouses aboard the Titanic
#  7   Parch        891 non-null    int64       # of parents / children aboard the Titanic          
#  8   Ticket       891 non-null    object 
#  9   Fare         891 non-null    float64     Ticket number
#  10  Cabin        204 non-null    object      Cabin number
#  11  Embarked     889 non-null    object      Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
# dtypes: float64(2), int64(5), object(5)

#df[['Survived','Pclass','Age','SibSp','Fare']].describe()
#           Survived	Pclass	    Age	        SibSp	    Fare
# --------------------------------------------------------------------------
# count	    891.000000	1309.000000	1046.000000	1309.000000	1308.000000
# mean	    0.383838	2.294882	29.881138	0.498854	33.295479   ★
# std	    0.486592	0.837836	14.413493	1.041658	51.758668
# min	    0.000000	1.000000	0.170000	0.000000	0.000000    ★
# 25%	    0.000000	2.000000	21.000000	0.000000	7.895800
# 50%	    0.000000	3.000000	28.000000	0.000000	14.454200
# 75%	    1.000000	3.000000	39.000000	1.000000	31.275000
# max	    1.000000	3.000000	80.000000	8.000000	512.329200  ★

#df.isnull().sum()
# PassengerId       0
# Survived        418   ★
# Pclass            0
# Name              0
# Sex               0
# Age             263   ★
# SibSp             0
# Parch             0
# Ticket            0
# Fare              1
# Cabin          1014   ★
# Embarked          2
# dtype: int64


# %%

### Male vs Female Analysis by Age
male_mask = df['Sex'] == 'male' 
female_mask = df['Sex'] == 'female' 

female_df = df[female_mask]
male_df = df[male_mask]

plt.style.use('seaborn')            # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'  
plt.figure(figsize=(6,5))

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)

ax1.hist(female_df.Age, bins=10, color='r', edgecolor='white', linewidth=1.2, orientation='horizontal')
#invert the order of x-axis values
ax1.set_xlim(ax1.get_xlim()[::-1])
#move ticks to the right
ax1.yaxis.tick_right()
ax1.set_title('Female Passenger (女性乗船者)')
ax1.grid(False)

ax2.hist(male_df.Age, bins=10, color='b', edgecolor='white', linewidth=1.2, orientation='horizontal')
#move ticks to the right
ax2.yaxis.tick_left()
ax2.set_title('Male Passenger (男性乗船者)')
ax2.grid(False)
plt.show()


# %%

### Define user function
def get_dataframe(df, col):
    survived_mask = df['Survived'] == 1
    dead_mask = df['Survived'] == 0
    survived = df[survived_mask][col].value_counts()
    dead = df[dead_mask][col].value_counts()
    dfx = pd.DataFrame([survived, dead])
    dfx.index = ['Survived','Dead'] # rename index
    return dfx

dfx = get_dataframe(df, 'Sex')


# %%


### Survived vs Dead Analysis by Gender
# get a new dataframe
dfx = get_dataframe(df, 'Sex')
# dfx 
# --------------------------
# index     female	male
# Survived	233	    109     
# Dead	    81	    468   

# Stacked bar chart
plt.style.use('seaborn')            # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'  
plt.figure(figsize=(6,5))

bar1 = plt.bar(dfx.index, dfx.female, color='r', label='female')  
bar2 = plt.bar(dfx.index, dfx.male, bottom=dfx.female, color='g', label='male')  

# Attach text labels.
rects = bar1 + bar2
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0, 
        f'{height:.0f}', ha = 'center', va = 'center')

plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Gender)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()

# Vertical bar chart
plt.figure(figsize=(6, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.25

bar1 = plt.bar(x_indexes, dfx.female, width=width, color='r', label='female')  
bar2 = plt.bar(x_indexes + width, dfx.male, width=width, color='g', label='male')  

# Attach text labels.
rects = bar1 + bar2
for rect in rects:
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width() / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Gender)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

# The Chart confirms Women more likely survivied than Men.

# %%

### Survived vs Dead Analysis by Ticket Class (1st, 2nd, 3rd)
# Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
# get a new dataframe
dfx = get_dataframe(df, 'Pclass')
# dfx 
# --------------------------
# index     1	2	3
# Survived	136	87	119
# Dead	    80	97	372

# Stacked bar chart
plt.style.use('seaborn')            # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'  
plt.figure(figsize=(6,5))

label_list = ['1st class','2nd class','3rd class']
bar_list = []
cumval = 0

for i, col in enumerate(dfx.columns):
    bar = plt.bar(dfx.index, dfx[col], bottom=cumval, label=label_list.pop(0))  
    bar_list.append(bar)
    cumval += dfx[col]

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0, 
        f'{height:.0f}', ha = 'center', va = 'center')

plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Ticket Class)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()

# Vertical bar chart
plt.figure(figsize=(6, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.25

width_list = [width*-1, 0, width*1]
label_list = ['1st class','2nd class','3rd class']
bar_list = []

for i, col in enumerate(dfx.columns):
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
    width = rect.get_width() 
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, 
        f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Ticket Class)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

# The Chart confirms 1st class more likely survivied than other classes.
# The Chart confirms 3rd class more likely dead than other classes

# %%

### SibSp: number of siblings / spouses aboard the Titanic
#df['SibSp'] = df['SibSp'].fillna(0)
#df.isnull().sum()
# get a new dataframe
dfx = get_dataframe(df, 'SibSp')
# dfx 
#           0       1       2       3       4       5   6
# ------------------------------------------------------------ 
# index	    0	    1	    2	    3	    4	    5	8
# Survived	210.0	112.0	13.0	4.0	    3.0	    0.0	0.0
# Dead	    398.0	97.0	15.0	12.0	15.0	5.0	7.0
dfx = dfx.fillna(0)
dfx.isnull().sum()
dfy = dfx.iloc[:,0:3]

# Stacked bar bahrt
plt.style.use('seaborn')            # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'  
plt.figure(figsize=(6,5))

label_list = ['0 Sibling/Spouses','1 Sibling/Spouses','2 Sibling/Spouses']
bar_list = []
cumval = 0

for i, col in enumerate(dfy.columns):
    bar = plt.bar(dfy.index, dfy[col], bottom=cumval, label=label_list.pop(0))  
    bar_list.append(bar)
    cumval += dfx[col]

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0, 
        f'{height:.0f}', ha = 'center', va = 'center')

plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Siblings/Spouses Aboard)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()

# Vertical bar chart
plt.figure(figsize=(10, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.10    # 0.25

width_list = [width*-3, width*-2, width*-1, 0, width, width*2, width*3]

label_list = ['0 Sibling/Spouses','1 Sibling/Spouses','2 Sibling/Spouses',
    '3 Sibling/Spouses','4 Sibling/Spouses','5 Sibling/Spouses','8 Sibling/Spouses']

bar_list = []

for i, col in enumerate(dfx.columns):  
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # bar1+bar2+bar3+bar4+bar5+bar6+bar7   
for rect in rects:
    width = rect.get_width() 
    height = rect.get_height()    
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Siblings/Spouses Aboard)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

# The Chart confirms a person aboarded with more than 2 siblings or spouse more likely survived.
# The Chart confirms a person aboarded without siblings or spouse more likely dead

# %%

### Parch: number of parents / children aboard the Titanic
#df['Parch'] = df['Parch'].fillna(0)
#df.isnull().sum()
# get a new dataframe
dfx = get_dataframe(df, 'Parch')
# dfx
# --------------------------------------------------------
# index     0	    1	    2	    3	4	5	6
# Survived	233.0	65.0	40.0	3.0	0.0	1.0	0.0
# Dead	    445.0	53.0	40.0	2.0	4.0	4.0	1.0
dfx = dfx.fillna(0)
dfx.isnull().sum()
dfy = dfx.iloc[:,0:3]

# stacked bar chart
plt.style.use('seaborn')            # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'  
plt.figure(figsize=(6,5))

#label_list = []
bar_list = []
cumval = 0

for i, col in enumerate(dfy.columns):
    bar = plt.bar(dfy.index, dfy[col], bottom=cumval, label=f'{i} parents/children')  
    bar_list.append(bar)
    cumval += dfx[col]

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0, 
        f'{height:.0f}', ha = 'center', va = 'center')

plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Parents/Children Aboard)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()

# vertical bar chart
plt.figure(figsize=(10, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.10    # 0.25

width_list = [width*-3, width*-2, width*-1, 0, width, width*2, width*3]
bar_list = []

for i, col in enumerate(dfx.columns):   
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=f'{i} parents/children')  
    bar_list.append(bar)

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list)    
for rect in rects:
    width = rect.get_width() 
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Parents/Children Aboard)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

# The Chart confirms a person aboarded with more than 2 parents or children more likely survived.
# The Chart confirms a person aboarded alone more likely dead

# %%

### Port of Embarkation:  C = Cherbourg, Q = Queenstown, S = Southampton
df['Embarked_copy'] = df['Embarked']
df['Embarked_copy'] = df['Embarked_copy'].fillna('S')
#df.isnull().sum()  # Embarked          2 => 0
# get a new dataframe
dfx = get_dataframe(df, 'Embarked_copy')
# dfx
# -----------------------
# index     S	C	Q
# Survived	219	93	30
# Dead	    427	75	47
dfx = dfx.fillna(0)
dfx.isnull().sum()

# stacked bar chart
plt.style.use('seaborn')            # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'  
plt.figure(figsize=(6,5))

label_list = ['Southampton','Cherbourg','Queenstown']
bar_list = []
cumval = 0

for i, col in enumerate(dfx.columns):
    bar = plt.bar(dfx.index, dfx[col], bottom=cumval, label=label_list.pop(0))  
    bar_list.append(bar)
    cumval += dfx[col]

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0, 
        f'{height:.0f}', ha = 'center', va = 'center')

# vertical bar chart
plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Port of Embarkation)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()

plt.figure(figsize=(6, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.25    # 0.25

width_list = [width*-1, 0, width]
label_list = ['Southampton','Cherbourg','Queenstown']
bar_list = []

for i, col in enumerate(dfx.columns):   
    bar = plt.bar(x_indexes + width_list.pop(0), dfx.iloc[:,i], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list)    
for rect in rects:
    width = rect.get_width() 
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Port of Embarkation)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

# The Chart confirms a person aboarded from C slightly more likely survived.
# The Chart confirms a person aboarded from Q more likely dead.
# The Chart confirms a person aboarded from S more likely dead.

# %%

### Title Mapping
# Extract Title from the Name column
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
#print('Title:')
#print(df['Title'].value_counts())

title_mapping = {'Mr': 0, 'Miss': 1, 'Mrs': 2, 
                 'Master': 3, 'Dr': 3, 'Rev': 3, 'Col': 3, 'Major': 3, 'Mlle': 3,'Countess': 3,
                 'Ms': 3, 'Lady': 3, 'Jonkheer': 3, 'Don': 3, 'Dona' : 3, 'Mme': 3,'Capt': 3,'Sir': 3 }

df['Title_map'] = df['Title'].map(title_mapping)
df[['Title', 'Title_map']]
#df['Title_map'] = df['Title_map'].fillna(0)
#df.isnull().sum()  # Title_map          

# get a new dataframe
dfx = get_dataframe(df, 'Title_map')
# dfx
# -----------------------
# index	    0	1	2	3
# Survived	81	127	99	35
# Dead	    436	55	26	32
#dfx = dfx.fillna(0)
dfx.isnull().sum()

# stacked bar chart
plt.style.use('seaborn')            # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'  
plt.figure(figsize=(6,5))

label_list = ['Mr','Miss','Mrs','Misc']
bar_list = []
cumval = 0

for i, col in enumerate(dfx.columns):
    bar = plt.bar(dfx.index, dfx[col], bottom=cumval, label=label_list.pop(0))  
    bar_list.append(bar)
    cumval += dfx[col]

# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0, 
        f'{height:.0f}', ha = 'center', va = 'center')

plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Title)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()

# Vertical bar chart
plt.figure(figsize=(8, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.15    # 0.25

width_list = [width*-1, 0, width, width*2]
label_list = ['Mr','Miss','Mrs','Misc']
bar_list = []

for i, col in enumerate(dfx.columns):
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)    
for rect in rects:
    width = rect.get_width() 
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Title)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()


# %%

### Sex Mapping male => 0, femail => 1
sex_mapping = {'male': 0, 'female': 1}
df['Sex_map'] = df['Sex'].map(sex_mapping)

# get a new dataframe
dfx = get_dataframe(df, 'Sex_map')
# dfx
# -----------------------
# index	    0	1	
# Survived	81	127	
# Dead	    436	55	
#dfx = dfx.fillna(0)
#dfx.isnull().sum()

# vertical bar chart
plt.figure(figsize=(6,5))
x_indexes = np.arange(len(dfx.index))
width = 0.25    # 0.25

width_list = [0, width]
label_list = ['male','female']
bar_list = []

for i, col in enumerate(dfx.columns):   
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)    
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Sex_map)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()


# %%

### Age Analysis 
# Fix Age NaN => median 
df.isnull().sum()   # Age => 263     
df['Age'].fillna(df.groupby('Title_map')['Age'].transform('median'), inplace=True)
df.isnull().sum()   # Age => 0    
#df.groupby('Title_map')['Age'].transform('median')

### => SKIP 

# # KDE(Kernel Density Estimation)
# #facet = sns.displot(x='Age', hue='Survived', kind='kde', data=df)
# facet = sns.displot(x='Age', hue='Survived', kind='kde', multiple='stack', data=df)
# #facet = sns.displot(x='Age', hue='Survived', kind='kde', fill=True, data=df)

# plt.title('Dead vs Survived Analysis\n(Age)')
# plt.show()

# # facet = sns.FacetGrid(df, hue="Survived",aspect=4)
# # facet.map(sns.kdeplot,'Age',shade= True)
# # facet.set(xlim=(0, df['Age'].max()))
# # facet.add_legend() 
# # plt.show()

# # facet = sns.FacetGrid(df, hue="Survived",aspect=4)
# # facet.map(sns.kdeplot,'Age',shade= True)
# # facet.set(xlim=(0, df['Age'].max()))
# # facet.add_legend() 
# # plt.xlim(10,50) # add
# # plt.show()

# Those who were 20 to 30 years old were more dead and more survived.

# %%

### Age Mapping

# Binning/Converting Numerical Age to Categorical Variable
# feature vector map:
# child:    0
# young:    1
# adult:    2
# mid-age:  3
# senior:   4

def age_map(age):
    if age <= 16:
        return 0
    elif age > 16 and age <= 26:
        return 1
    elif age > 26 and age <= 36:
        return 2                        
    elif age > 36 and age <= 62:
        return 3             
    else:
        return 4

df['Age_map'] = df.loc[:, 'Age'].apply(age_map)

# get a new dataframe
dfx = get_dataframe(df, 'Age_map')

survived_mask = df['Survived'] == 1
dead_mask = df['Survived'] == 0
# dfx
#           0   1   2   3   4
# --------------------------------------
# index     2	1	3	0	4
# Survived	116	97	69	57	3
# Dead	    220	158	111	48	12
#dfx = dfx.fillna(0)
#dfx.isnull().sum()

# bar chart
plt.figure(figsize=(10,5))
x_indexes = np.arange(len(dfx.index))
width = 0.15    # 0.25

width_list = [width*-2, width*-1, 0, width*1, width*2]
label_list = ['adult','young','mid-age','child','senior']
bar_list = []

for i, col in enumerate(dfx.columns):   
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)    
for rect in rects:
    width = rect.get_width() 
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Age Category)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()


# %%

### Map the value of Pclass
Pclass1_mask = df['Pclass'] == 1 
Pclass2_mask = df['Pclass'] == 2 
Pclass3_mask = df['Pclass'] == 3 

Pclass1 = df[Pclass1_mask]['Embarked_copy'].value_counts()
Pclass2 = df[Pclass2_mask]['Embarked_copy'].value_counts()
Pclass3 = df[Pclass3_mask]['Embarked_copy'].value_counts()
dfx = pd.DataFrame([Pclass1, Pclass2, Pclass3])
dfx.index = ['1st Class','2nd Class','3rd Class']

# dfx
#           0   1   2
# -----------------------------
# index     S	C	Q
# 1st Class	179	141	3
# 2nd Class	242	28	7
# 3rd Class	495	101	113

# vertical bar chart
plt.figure(figsize=(6,5))
x_indexes = np.arange(len(dfx.index))
width = 0.25    # 0.25

width_list = [width*-1, 0, width*1]
label_list = ['Southampton','Cherbourg','Queenstown']
bar_list = []

for i, col in enumerate(dfx.columns):   
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)    
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Ticket Class)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

# more than 50 % of 1st class are from S embark.
# more than 50 % of 2st class are from S embark.
# more than 50 % of 3st class are from S embark.

# %%

### Map the value of Embarked

# fill out missing embark with S embark
df['Embarked'] =  df['Embarked'].fillna('S')

embarked_mapping = {'S':0,'C':1,'Q':2}
df['Embarked_map'] = df['Embarked'].map(embarked_mapping)


# %%

### Map the value of Fare

# Fill missing Fare with median fare for each Pclass
df['Fare'].fillna(df.groupby('Pclass')['Fare'].transform('median'), inplace=True)
#df.head(5)

### => SKIP
# # KDE(Kernel Density Estimation)
# #facet = sns.displot(x='Fare', hue='Survived', kind='kde', data=df)
# facet = sns.displot(x='Fare', hue='Survived', kind='kde', multiple='stack', data=df)
# #facet = sns.displot(x='Fare', hue='Survived', kind='kde', fill=True, data=df)
# #plt.xlim(-50, 200) # add
# plt.title('Dead vs Survived Analysis\n(Fare)')
# plt.show() 

# # facet = sns.FacetGrid(df, hue='Survived',aspect=4 )
# # facet.map(sns.kdeplot, 'Fare', shade = True)
# # facet.set(xlim = (0, df['Fare'].max()))
# # facet.add_legend()
# # plt.show()

# # facet = sns.FacetGrid(df, hue="Survived",aspect=4)
# # facet.map(sns.kdeplot,'Fare',shade= True)
# # facet.set(xlim=(0, df['Fare'].max()))
# # facet.add_legend()
# # plt.xlim(0, 20) # add
# # plt.show()

### Map the Fare

def fare_map(fare):
    if fare <= 17:
        return 0
    elif fare > 17 and fare <= 30:
        return 1
    elif fare > 30 and fare <= 100:
        return 2
    else:
        return 3

df['Fare_map'] = df.loc[:, 'Fare'].apply(fare_map)
#df.head(5)


# %%

### Map the value of Cabin

df['Cabin_x'] =  df['Cabin'].str[:1]    # X999 => X

Pclass1_mask = df['Pclass'] == 1 
Pclass2_mask = df['Pclass'] == 2 
Pclass3_mask = df['Pclass'] == 3 

Pclass1 = df[Pclass1_mask]['Cabin_x'].value_counts()
Pclass2 = df[Pclass2_mask]['Cabin_x'].value_counts()
Pclass3 = df[Pclass3_mask]['Cabin_x'].value_counts()
dfx = pd.DataFrame([Pclass1, Pclass2, Pclass3])
dfx = dfx.fillna(0)
dfx.index = ['1st class','2nd class', '3rd class']

# dfx                               *  
#           0       1       2       3       4       5   6       7      
# ------------------------------------------------------------------------
# index     C	    B	    D	    E	    A	    T	F	    G
# 1st class	94.0	65.0	40.0	34.0	22.0	1.0	NaN	    NaN
# 2nd class	NaN	    NaN	    6.0	    4.0	    NaN	    NaN 13.0	NaN
# 3rd class	NaN	    NaN	    NaN	    3.0	    NaN	    NaN	8.0	    5.0

# vertical bar chart
plt.figure(figsize=(10,5))
x_indexes = np.arange(len(dfx.index))
width = 0.10    # 0.25

width_list = [width*-3,width*-2,width*-1, 0, width*1,width*2,width*3,width*4]
label_list = ['C','B','D','E','A','T','F','G']
bar_list = []

for i, col in enumerate(dfx.columns):   
    bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))  
    bar_list.append(bar)

# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)    
for rect in rects:
    width = rect.get_width()
    height = rect.get_height()
    plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')

plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Cabin Class)')
plt.legend(loc='upper right')
plt.grid(True)
plt.tight_layout()
plt.show()

### Fill missing Cabin_map

cabin_mapping = {'A': 0, 'B': 0.4, 'C': 0.8, 'D': 1.2, 'E': 1.6, 'F': 2, 'G': 2.4, 'T': 2.8}
df['Cabin_map'] = df['Cabin_x'].map(cabin_mapping)  # A => 0, B => 0.4

# Fill missing Cabin_map with median Cabin_map for each Pclass
#df = df.fillna(0)
df.isnull().sum()
df['Cabin_map'].fillna(df.groupby('Pclass')['Cabin_map'].transform('median'), inplace=True)
df.isnull().sum()


# %%

### Map the value of Family Size 
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

### => SKIP
# # KDE(Kernel Density Estimation)
# #facet = sns.displot(x='FamilySize', hue='Survived', kind='kde', data=df)
# facet = sns.displot(x='FamilySize', hue='Survived', kind='kde', multiple='stack', data=df)
# #facet = sns.displot(x='FamilySize', hue='Survived', kind='kde', fill=True, data=df)

# plt.title('Dead vs Survived Analysis\n(Family Size)')
# plt.show()

# # facet = sns.FacetGrid(df, hue="Survived",aspect=4)
# # facet.map(sns.kdeplot,'FamilySize',shade= True)
# # facet.set(xlim=(0, df['FamilySize'].max()))
# # facet.add_legend()
# # plt.xlim(0)


family_mapping = {1: 0, 2: 0.4, 3: 0.8, 4: 1.2, 5: 1.6, 6: 2, 7: 2.4, 8: 2.8, 9: 3.2, 10: 3.6, 11: 4}
df['FamilySize_map'] = df['FamilySize'].map(family_mapping)
#df
#df.isnull().sum()

Go Top

Python {Article058}