-
まずはPythonのライブラリを取り込む
Visual Studio Code(VSC)を起動したら新規ファイルを作成して行1-8をコピペします。
行2-6ではPythonのライブラリを取り込んでいます。
行8ではPythonの警告を抑止しています。
ライブラリをまだインストールしていないときは「pip install」で事前にインストールしておいてください。
### Import the libraries
from functools import reduce
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore')
図1はVisual Studio Code(VSC)の画面です。
-
PandasのDataFrameにタイタニックのサバイバルデータを取り込む
行2はCSVファイルのパスを定義しています。
CSVファイルを当サイトからダウンロードするときは行3のコメント(#)を外してください。
行4ではPandasのread_csv()メソッドでCSVファイルをPandasのDataFrameに取り込んでいます。
行5ではDataFrame(raw)を変数dfにコピーしています。
ちなみに、「df = raw」のように記述してコピーするとDataFrame(raw)のメモリが共有されます。
メモリを共有するということは変数「df」は変数「raw」の別名という意味です。
なのでDataFrame(df)の値を変更すると当然ながらDataFrame[raw」の値も変更されます。
メモリを共有したくないときはDataFrameのcopy()メソッドでコピーする必要があります。
### Load the data
csv_file = 'data/csv/titanic/train.csv'
#csv_file = 'https://money-or-ikigai.com/menu/python/article/data/titanic/train.csv'
raw = pd.read_csv(csv_file)
df = raw.copy()
図2は実行画面です。
VSCのインタラクティブ・ウィンドウには「df.info()」と「df.head(1)」で出力した情報が表示されています。
CSVファイルには891人の乗船客のデータが格納されています。
列「Survived」には「0=死亡, 1=生存」が格納されています。
列「Sex:性別」には「male, female」が格納されています。
-
年代別のヒストグラムを左右に分離(男女別)して作成する
行5ではDataFrame(df)から女性のデータのみ絞り込んでいます。
行6ではDataFrame(df)から男性のデータのみ絞り込んでいます。
行12ではMatplotlibのsubplots()メソッドでサブプロットを2個生成しています。
行14ではax1に女性のヒストグラムを作成しています。
行16ではヒストグラムを右側から左側に表示するように切り替えています。
つまり、男性のヒストグラムと逆に表示させています。
これで女性と男性のヒストグラムが左右対照に表示されます。
行22ではax2に男性のヒストグラムを作成しています。
### Male vs Female Analysis by Age
male_mask = df['Sex'] == 'male'
female_mask = df['Sex'] == 'female'
female_df = df[female_mask]
male_df = df[male_mask]
plt.style.use('seaborn') # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'
plt.figure(figsize=(6,5))
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
ax1.hist(female_df.Age, bins=10, color='r', edgecolor='white', linewidth=1.2, orientation='horizontal')
# Invert the order of x-axis values
ax1.set_xlim(ax1.get_xlim()[::-1])
# Move ticks to the right
ax1.yaxis.tick_right()
ax1.set_title('Female Passenger (女性乗船者)')
ax1.grid(False)
ax2.hist(male_df.Age, bins=10, color='b', edgecolor='white', linewidth=1.2, orientation='horizontal')
# Move ticks to the right
ax2.yaxis.tick_left()
ax2.set_title('Male Passenger (男性乗船者)')
ax2.grid(False)
plt.show()
図3は実行画面です。
男女別・年代別のヒストグラムが左右に分離されて表示されています。
-
PandasのDataFrameを作成する関数「get_dataframe()」を定義する
### Define user function
def get_dataframe(df, col):
survived_mask = df['Survived'] == 1
dead_mask = df['Survived'] == 0
survived = df[survived_mask][col].value_counts()
dead = df[dead_mask][col].value_counts()
dfx = pd.DataFrame([survived, dead])
dfx.index = ['Survived','Dead']
return dfx
#dfx = get_dataframe(df, 'Sex')
図4はVSCの画面です。
ここでは行11のコメント(#)を外して実行しています。
VSCのインタラクティブ・ウィンドウには戻り値「dfx」の構造と内容を表示しています。
「get_dataframe(df, 'Sex')」のように関数をコールすると戻り値には図4のようなDataFrame(dfx)が返されます。
DataFrame(df)の列「Survived」には生死の区分が数値「0=dead, 1=survived」で格納されています。
DataFrame(df)の列「Sex」には性別「female, male」が格納されています。
行5を実行するとDataFrame(df)の男女の生存者数がPandasのSeriesとして変数「survived」に格納されます。
具体的には「female=233, male=109」が格納されます。
同様に行6を実行するとDataFrame(df)の男女の死亡者数がPandasのSeriesとして変数「dead」に格納されます。
具体的には「female=81, male=468」が格納されます。
行7ではPandasのDataFrame()メソッドでsurvivedとdeadに格納されてSeriesをDataFrameに取り込みます。
DataFrame(dfx)には列「female, male」が生成されます。
そして列「Sex」がDataFrameのindexになります。
この場合、indexは重複した「Sex」が2個生成されます。
行8ではDataFrame(dfx)の重複したindex(Sex, Sex)を(Survived, Dead)に変更しています。
これで図4のようなDataFrame(dfx)が戻り値として返されます。
-
男女別(male, female)の生死の棒グラフを作成する
行15-16ではスタック型の棒グラフを作成しています。
棒グラフをスタック型にするにはMatplotlibのbar()メソッドの引数に「bottom=dfx.female」を追加します。
行19-24では棒グラフに人数を表示しています。
棒グラフに人数を表示するにはMatplotlibのtext()メソッドを使用します。
ここでは人数が棒グラフ内の中央に表示されるように調整しています。
行40-41では通常の棒グラフを作成しています。
男女の棒グラフが重ならないように表示するにはMatplotlibのbar()メソッドの引数1を調整します。
ここでは女性のbar()メソッドの引数1に「x_indexes」を、男性のbar()メソッドの引数1に「x_indexes+width」を指定してずらしてします。
行44-48では棒グラフの上に人数を表示しています。
棒グラフに人数を表示するにはMatplotlibのtext()メソッドを使用します。
ここでは人数が棒グラフの中央に表示されるように調整しています。
### Survived vs Dead Analysis by Gender
# get a new dataframe
dfx = get_dataframe(df, 'Sex')
# dfx
# --------------------------
# index female male
# Survived 233 109
# Dead 81 468
# Stacked bar chart
plt.style.use('seaborn') # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'
plt.figure(figsize=(6,5))
bar1 = plt.bar(dfx.index, dfx.female, color='r', label='female')
bar2 = plt.bar(dfx.index, dfx.male, bottom=dfx.female, color='g', label='male')
# Attach text labels
rects = bar1 + bar2
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0,
f'{height:.0f}', ha='center', va='center')
plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Gender)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()
# Vertical bar chart
plt.figure(figsize=(6, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.25
bar1 = plt.bar(x_indexes, dfx.female, width=width, color='r', label='female')
bar2 = plt.bar(x_indexes+width, dfx.male, width=width, color='g', label='male')
# Attach text labels
rects = bar1 + bar2
for rect in rects:
height = rect.get_height()
plt.text(rect.get_x() + rect.get_width() / 2.0, height,
f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Gender)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
図5は実行画面です。
VSCのインタラクティブ・ウィンドウにスタック型の棒グラフと通常の棒グラフが表示されています。
棒グラフには人数が表示されています。
グラフに数値を表示すると比較するときに便利です。
-
乗船券クラス別(1st, 2nd, 3rd class)の生死の棒グラフを作成する
行26, 行56ではPythonのList Comprehensionをsetに応用しています。
ここではComprehensionにlambda式を使用しています。
「rects = reduce(lambda x, y: x + y, bar_list)」の記述は、
「rects = (bar_list[0][0], bar_list[0][1], bar_list[1][0], bar_list[1][1], bar_list[2][0], bar_list[2][1])」
の記述をlambda式で簡略化しています。
### Survived vs Dead Analysis by Ticket Class (1st, 2nd, 3rd)
# Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
# get a new dataframe
dfx = get_dataframe(df, 'Pclass')
# dfx
# --------------------------
# index 1 2 3
# Survived 136 87 119
# Dead 80 97 372
# Stacked bar chart
plt.style.use('seaborn') # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'
plt.figure(figsize=(6,5))
label_list = ['1st class','2nd class','3rd class']
bar_list = []
cumval = 0
for i, col in enumerate(dfx.columns):
bar = plt.bar(dfx.index, dfx[col], bottom=cumval, label=label_list.pop(0))
bar_list.append(bar)
cumval += dfx[col]
# Attach text labels
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0,
f'{height:.0f}', ha='center', va='center')
plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Ticket Class)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()
# Vertical bar chart
plt.figure(figsize=(6, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.25
width_list = [width*-1, 0, width*1]
label_list = ['1st class','2nd class','3rd class']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Attach text labels
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height,
f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Ticket Class)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
図6は実行画面です。
VSCのインタラクティブ・ウィンドウにスタック型の棒グラフと通常の棒グラフが表示されています。
スタック型の棒グラフの場合、縦棒内に人数が表示されています。
通常の棒グラフの場合、縦棒の上に人数が表示されています。
-
兄弟・配偶者別の生死の棒グラフを作成する
### SibSp: number of siblings / spouses aboard the Titanic
#df['SibSp'] = df['SibSp'].fillna(0)
#df.isnull().sum()
# get a new dataframe
dfx = get_dataframe(df, 'SibSp')
# dfx
# 0 1 2 3 4 5 6
# --------------------------------------------------------------------
# index
# Survived 210.0 112.0 13.0 4.0 3.0 0.0 0.0
# Dead 398.0 97.0 15.0 12.0 15.0 5.0 7.0
dfx = dfx.fillna(0)
dfx.isnull().sum()
dfy = dfx.iloc[:,0:3]
# Stacked bar bahrt
plt.style.use('seaborn') # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'
plt.figure(figsize=(6,5))
label_list = ['0 Sibling/Spouses','1 Sibling/Spouses','2 Sibling/Spouses']
bar_list = []
cumval = 0
for i, col in enumerate(dfy.columns):
bar = plt.bar(dfy.index, dfy[col], bottom=cumval, label=label_list.pop(0))
bar_list.append(bar)
cumval += dfx[col]
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0,
f'{height:.0f}', ha = 'center', va = 'center')
plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Siblings/Spouses Aboard)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()
# Vertical bar chart
plt.figure(figsize=(10, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.10 # 0.25
width_list = [width*-3, width*-2, width*-1, 0, width, width*2, width*3]
label_list = ['0 Sibling/Spouses','1 Sibling/Spouses','2 Sibling/Spouses',
'3 Sibling/Spouses','4 Sibling/Spouses','5 Sibling/Spouses','8 Sibling/Spouses']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # bar1+bar2+bar3+bar4+bar5+bar6+bar7
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Siblings/Spouses Aboard)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
図7は実行画面です。
VSCのインタラクティブ・ウィンドウにスタック型の棒グラフと通常の棒グラフが表示されています。
-
親子別(Parents/Children)の生死の棒グラフを作成する
### Parch: number of parents / children aboard the Titanic
#df['Parch'] = df['Parch'].fillna(0)
#df.isnull().sum()
# get a new dataframe
dfx = get_dataframe(df, 'Parch')
# dfx
# 0 1 2 3 4 5 6
# -------------------------------------------------------------------
# index
# Survived 233.0 65.0 40.0 3.0 0.0 1.0 0.0
# Dead 445.0 53.0 40.0 2.0 4.0 4.0 1.0
dfx = dfx.fillna(0)
dfx.isnull().sum()
dfy = dfx.iloc[:,0:3]
# stacked bar chart
plt.style.use('seaborn') # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'
plt.figure(figsize=(6,5))
#label_list = []
bar_list = []
cumval = 0
for i, col in enumerate(dfy.columns):
bar = plt.bar(dfy.index, dfy[col], bottom=cumval, label=f'{i} parents/children')
bar_list.append(bar)
cumval += dfx[col]
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0,
f'{height:.0f}', ha = 'center', va = 'center')
plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Parents/Children Aboard)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()
# vertical bar chart
plt.figure(figsize=(10, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.10 # 0.25
width_list = [width*-3, width*-2, width*-1, 0, width, width*2, width*3]
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=f'{i} parents/children')
bar_list.append(bar)
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Parents/Children Aboard)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
図8は実行画面です。
VSCのインタラクティブ・ウィンドウにスタック型の棒グラフと通常の棒グラフが表示されています。
-
乗船港別(Southampton, Cherbourg, Queenstown)の生死の棒グラフを作成する
### Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton
df['Embarked_copy'] = df['Embarked']
df['Embarked_copy'] = df['Embarked_copy'].fillna('S')
#df.isnull().sum() # Embarked 2 => 0
# get a new dataframe
dfx = get_dataframe(df, 'Embarked_copy')
# dfx
# ----------------------------------
# index S C Q
# Survived 219 93 30
# Dead 427 75 47
dfx = dfx.fillna(0)
dfx.isnull().sum()
# stacked bar chart
plt.style.use('seaborn') # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'
plt.figure(figsize=(6,5))
label_list = ['Southampton','Cherbourg','Queenstown']
bar_list = []
cumval = 0
for i, col in enumerate(dfx.columns):
bar = plt.bar(dfx.index, dfx[col], bottom=cumval, label=label_list.pop(0))
bar_list.append(bar)
cumval += dfx[col]
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0,
f'{height:.0f}', ha = 'center', va = 'center')
# vertical bar chart
plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Port of Embarkation)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()
plt.figure(figsize=(6, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.25 # 0.25
width_list = [width*-1, 0, width]
label_list = ['Southampton','Cherbourg','Queenstown']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx.iloc[:,i], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Port of Embarkation)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
図9は実行画面です。
VSCのインタラクティブ・ウィンドウにスタック型の棒グラフと通常の棒グラフが表示されています。
-
敬称別(Mr, Miss, Mrs,...)の生死の棒グラフを作成する
### Title Mapping
# Extract Title from the Name column
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
#print('Title:')
#print(df['Title'].value_counts())
title_mapping = {'Mr': 0, 'Miss': 1, 'Mrs': 2,
'Master': 3, 'Dr': 3, 'Rev': 3, 'Col': 3, 'Major': 3, 'Mlle': 3,'Countess': 3,
'Ms': 3, 'Lady': 3, 'Jonkheer': 3, 'Don': 3, 'Dona' : 3, 'Mme': 3,'Capt': 3,'Sir': 3 }
df['Title_map'] = df['Title'].map(title_mapping)
df[['Title', 'Title_map']]
#df['Title_map'] = df['Title_map'].fillna(0)
#df.isnull().sum() # Title_map
# get a new dataframe
dfx = get_dataframe(df, 'Title_map')
# dfx
# ----------------------------------
# index 0 1 2 3
# Survived 81 127 99 35
# Dead 436 55 26 32
#dfx = dfx.fillna(0)
dfx.isnull().sum()
# stacked bar chart
plt.style.use('seaborn') # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'
plt.figure(figsize=(6,5))
label_list = ['Mr','Miss','Mrs','Misc']
bar_list = []
cumval = 0
for i, col in enumerate(dfx.columns):
bar = plt.bar(dfx.index, dfx[col], bottom=cumval, label=label_list.pop(0))
bar_list.append(bar)
cumval += dfx[col]
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0,
f'{height:.0f}', ha = 'center', va = 'center')
plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Title)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()
# Vertical bar chart
plt.figure(figsize=(8, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.15 # 0.25
width_list = [width*-1, 0, width, width*2]
label_list = ['Mr','Miss','Mrs','Misc']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Title)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
図10は実行画面です。
VSCのインタラクティブ・ウィンドウにスタック型の棒グラフと通常の棒グラフが表示されています。
-
性別(Sex_map)別の生死の棒グラフを作成する
### Sex Mapping male => 0, femail => 1
sex_mapping = {'male': 0, 'female': 1}
df['Sex_map'] = df['Sex'].map(sex_mapping)
# get a new dataframe
dfx = get_dataframe(df, 'Sex_map')
# dfx
# -----------------------
# index 0 1
# Survived 81 127
# Dead 436 55
#dfx = dfx.fillna(0)
#dfx.isnull().sum()
# vertical bar chart
plt.figure(figsize=(6,5))
x_indexes = np.arange(len(dfx.index))
width = 0.25 # 0.25
width_list = [0, width]
label_list = ['male','female']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Sex_map)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
図11は実行結果です。
ここではDataFrameの列「Sex_map」の棒グラフを表示しています。
列「Sex_map」には列「Sex」の値(male, female)が(0, 1)にマッピングされて格納されています。
-
世代別(adult, young, mid-age, child, senior)の生死の棒グラフを作成する
### Age Analysis
# Fix Age NaN => median
df.isnull().sum() # Age => 263
df['Age'].fillna(df.groupby('Title_map')['Age'].transform('median'), inplace=True)
df.isnull().sum() # Age => 0
### Age Mapping
# Binning/Converting Numerical Age to Categorical Variable
# feature vector map:
# child: 0
# young: 1
# adult: 2
# mid-age: 3
# senior: 4
def age_map(age):
if age <= 16:
return 0
elif age > 16 and age <= 26:
return 1
elif age > 26 and age <= 36:
return 2
elif age > 36 and age <= 62:
return 3
else:
return 4
df['Age_map'] = df.loc[:, 'Age'].apply(age_map)
# get a new dataframe
dfx = get_dataframe(df, 'Age_map')
survived_mask = df['Survived'] == 1
dead_mask = df['Survived'] == 0
# dfx
# -----------------------------------------
# index 2 1 3 0 4
# Survived 116 97 69 57 3
# Dead 220 158 111 48 12
#dfx = dfx.fillna(0)
#dfx.isnull().sum()
# bar chart
plt.figure(figsize=(10,5))
x_indexes = np.arange(len(dfx.index))
width = 0.15 # 0.25
width_list = [width*-2, width*-1, 0, width*1, width*2]
label_list = ['adult','young','mid-age','child','senior']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Age Category)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
図12は実行結果です。
ここではDataFrameの列「Age_map」の棒グラフを表示しています。
列「Age_map」には列「Age」の値(年齢)が数値(0-4)にマッピングされて格納されています。
child(0), young(1), adult(2), mid-age(3), senior(4)
-
乗船券・乗船港別の生死の棒グラフを作成する
### Map the value of Pclass
Pclass1_mask = df['Pclass'] == 1
Pclass2_mask = df['Pclass'] == 2
Pclass3_mask = df['Pclass'] == 3
Pclass1 = df[Pclass1_mask]['Embarked_copy'].value_counts()
Pclass2 = df[Pclass2_mask]['Embarked_copy'].value_counts()
Pclass3 = df[Pclass3_mask]['Embarked_copy'].value_counts()
dfx = pd.DataFrame([Pclass1, Pclass2, Pclass3])
dfx.index = ['1st Class','2nd Class','3rd Class']
# dfx
# ---------------------------------
# index S C Q
# 1st Class 179 141 3
# 2nd Class 242 28 7
# 3rd Class 495 101 113
# vertical bar chart
plt.figure(figsize=(6,5))
x_indexes = np.arange(len(dfx.index))
width = 0.25 # 0.25
width_list = [width*-1, 0, width*1]
label_list = ['Southampton','Cherbourg','Queenstown']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Ticket Class)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
### Map the value of Embarked
# fill out missing embark with S embark
df['Embarked'] = df['Embarked'].fillna('S')
embarked_mapping = {'S':0,'C':1,'Q':2}
df['Embarked_map'] = df['Embarked'].map(embarked_mapping)
図13は実行結果です。
ここでは乗船料金のクラス別(1st Class, 2nd Class, 3rd Class)の内訳を
乗船港別(Southampton, Cherbourg, Queenstown)に表示しています。
-
乗船料金(Fare)をマッピング(Fare_map)する
### Map the value of Fare
# Fill missing Fare with median fare for each Pclass
df['Fare'].fillna(df.groupby('Pclass')['Fare'].transform('median'), inplace=True)
#df.head(5)
### Map the Fare
def fare_map(fare):
if fare <= 17:
return 0
elif fare > 17 and fare <= 30:
return 1
elif fare > 30 and fare <= 100:
return 2
else:
return 3
df['Fare_map'] = df.loc[:, 'Fare'].apply(fare_map)
#df.head(5)
図14は実行結果です。
ここではDataFrameの列「Fare(乗船料金)」の金額を数値(0-3)にマッピングして新規列「Fare_map」に格納しています。
0:$17以下、1:$18-$30, 2:$31-$100, 3:$101以上
-
乗船料金クラス・キャビンID別の生死の棒グラフを作成する
### Map the value of Cabin
df['Cabin_x'] = df['Cabin'].str[:1] # X999 => X
Pclass1_mask = df['Pclass'] == 1
Pclass2_mask = df['Pclass'] == 2
Pclass3_mask = df['Pclass'] == 3
Pclass1 = df[Pclass1_mask]['Cabin_x'].value_counts()
Pclass2 = df[Pclass2_mask]['Cabin_x'].value_counts()
Pclass3 = df[Pclass3_mask]['Cabin_x'].value_counts()
dfx = pd.DataFrame([Pclass1, Pclass2, Pclass3])
dfx = dfx.fillna(0)
dfx.index = ['1st class','2nd class', '3rd class']
# dfx
# ------------------------------------------------------------------------
# index C B D E A T F G
# 1st class 94.0 65.0 40.0 34.0 22.0 1.0 NaN NaN
# 2nd class NaN NaN 6.0 4.0 NaN NaN 13.0 NaN
# 3rd class NaN NaN NaN 3.0 NaN NaN 8.0 5.0
# vertical bar chart
plt.figure(figsize=(10,5))
x_indexes = np.arange(len(dfx.index))
width = 0.10 # 0.25
width_list = [width*-3,width*-2,width*-1, 0, width*1,width*2,width*3,width*4]
label_list = ['C','B','D','E','A','T','F','G']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Cabin Class)')
plt.legend(loc='upper right')
plt.grid(True)
plt.tight_layout()
plt.show()
### Fill missing Cabin_map
cabin_mapping = {'A': 0, 'B': 0.4, 'C': 0.8, 'D': 1.2, 'E': 1.6, 'F': 2, 'G': 2.4, 'T': 2.8}
df['Cabin_map'] = df['Cabin_x'].map(cabin_mapping) # A => 0, B => 0.4
# Fill missing Cabin_map with median Cabin_map for each Pclass
#df = df.fillna(0)
df.isnull().sum()
df['Cabin_map'].fillna(df.groupby('Pclass')['Cabin_map'].transform('median'), inplace=True)
df.isnull().sum()
図15は実行結果です。
ここでは乗船券クラス別(1st Class, 2nd Class, 3rd Class)の内訳をキャビンID別(C, B, D, E, A, T, F, G)に表示しています。
-
家族サイズ(FamilySize)をマッピング(FamilySize_map)する
### Map the value of Family Size
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
### => SKIP
# # KDE(Kernel Density Estimation)
# #facet = sns.displot(x='FamilySize', hue='Survived', kind='kde', data=df)
# facet = sns.displot(x='FamilySize', hue='Survived', kind='kde', multiple='stack', data=df)
# #facet = sns.displot(x='FamilySize', hue='Survived', kind='kde', fill=True, data=df)
# plt.title('Dead vs Survived Analysis\n(Family Size)')
# plt.show()
# # facet = sns.FacetGrid(df, hue="Survived",aspect=4)
# # facet.map(sns.kdeplot,'FamilySize',shade= True)
# # facet.set(xlim=(0, df['FamilySize'].max()))
# # facet.add_legend()
# # plt.xlim(0)
family_mapping = {1: 0, 2: 0.4, 3: 0.8, 4: 1.2, 5: 1.6, 6: 2, 7: 2.4, 8: 2.8, 9: 3.2, 10: 3.6, 11: 4}
df['FamilySize_map'] = df['FamilySize'].map(family_mapping)
#df
#df.isnull().sum()
図16は実行結果です。
ここではDataFrameの列「FamilySize」の値を数値(0-4)にマッピングして新規列「FamilySize_map」に格納しています。
1:0, 2:0.4, 3:0.8, 4:1.2, 5:1.6, 6:2, 7:2.4, 8:2.8, 9:3.2, 10:3.6, 11:4
-
ここで解説したコードをまとめて掲載
最後にここで解説したすべてのコードをまとめて掲載しましたので参考にしてください。
### Import the libraries
from functools import reduce
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import seaborn as sns
import warnings
warnings.simplefilter('ignore')
#sns.set() # setting seaborn default for plots
# %%
### Load the data
csv_file = 'data/csv/titanic/train.csv'
#csv_file = 'https://money-or-ikigai.com/menu/python/article/data/titanic/train.csv'
raw = pd.read_csv(csv_file)
#test_file = 'data/csv/titanic/test.csv'
#test = pd.read_csv(test_file)
df = raw.copy()
#df.shape
# (1309, 12) 1309 rows, 12 columns
#df.info()
# Int64Index: 1309 entries, 0 to 417
# Data columns (total 12 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 PassengerId 891 non-null int64
# 1 Survived 891 non-null int64 0 = No, 1 = Yes
# 2 Pclass 891 non-null int64 Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
# 3 Name 891 non-null object
# 4 Sex 891 non-null object
# 5 Age 714 non-null float64
# 6 SibSp 891 non-null int64 # of siblings / spouses aboard the Titanic
# 7 Parch 891 non-null int64 # of parents / children aboard the Titanic
# 8 Ticket 891 non-null object
# 9 Fare 891 non-null float64 Ticket number
# 10 Cabin 204 non-null object Cabin number
# 11 Embarked 889 non-null object Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
# dtypes: float64(2), int64(5), object(5)
#df[['Survived','Pclass','Age','SibSp','Fare']].describe()
# Survived Pclass Age SibSp Fare
# --------------------------------------------------------------------------
# count 891.000000 1309.000000 1046.000000 1309.000000 1308.000000
# mean 0.383838 2.294882 29.881138 0.498854 33.295479 ★
# std 0.486592 0.837836 14.413493 1.041658 51.758668
# min 0.000000 1.000000 0.170000 0.000000 0.000000 ★
# 25% 0.000000 2.000000 21.000000 0.000000 7.895800
# 50% 0.000000 3.000000 28.000000 0.000000 14.454200
# 75% 1.000000 3.000000 39.000000 1.000000 31.275000
# max 1.000000 3.000000 80.000000 8.000000 512.329200 ★
#df.isnull().sum()
# PassengerId 0
# Survived 418 ★
# Pclass 0
# Name 0
# Sex 0
# Age 263 ★
# SibSp 0
# Parch 0
# Ticket 0
# Fare 1
# Cabin 1014 ★
# Embarked 2
# dtype: int64
# %%
### Male vs Female Analysis by Age
male_mask = df['Sex'] == 'male'
female_mask = df['Sex'] == 'female'
female_df = df[female_mask]
male_df = df[male_mask]
plt.style.use('seaborn') # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'
plt.figure(figsize=(6,5))
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
ax1.hist(female_df.Age, bins=10, color='r', edgecolor='white', linewidth=1.2, orientation='horizontal')
#invert the order of x-axis values
ax1.set_xlim(ax1.get_xlim()[::-1])
#move ticks to the right
ax1.yaxis.tick_right()
ax1.set_title('Female Passenger (女性乗船者)')
ax1.grid(False)
ax2.hist(male_df.Age, bins=10, color='b', edgecolor='white', linewidth=1.2, orientation='horizontal')
#move ticks to the right
ax2.yaxis.tick_left()
ax2.set_title('Male Passenger (男性乗船者)')
ax2.grid(False)
plt.show()
# %%
### Define user function
def get_dataframe(df, col):
survived_mask = df['Survived'] == 1
dead_mask = df['Survived'] == 0
survived = df[survived_mask][col].value_counts()
dead = df[dead_mask][col].value_counts()
dfx = pd.DataFrame([survived, dead])
dfx.index = ['Survived','Dead'] # rename index
return dfx
dfx = get_dataframe(df, 'Sex')
# %%
### Survived vs Dead Analysis by Gender
# get a new dataframe
dfx = get_dataframe(df, 'Sex')
# dfx
# --------------------------
# index female male
# Survived 233 109
# Dead 81 468
# Stacked bar chart
plt.style.use('seaborn') # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'
plt.figure(figsize=(6,5))
bar1 = plt.bar(dfx.index, dfx.female, color='r', label='female')
bar2 = plt.bar(dfx.index, dfx.male, bottom=dfx.female, color='g', label='male')
# Attach text labels.
rects = bar1 + bar2
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0,
f'{height:.0f}', ha = 'center', va = 'center')
plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Gender)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()
# Vertical bar chart
plt.figure(figsize=(6, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.25
bar1 = plt.bar(x_indexes, dfx.female, width=width, color='r', label='female')
bar2 = plt.bar(x_indexes + width, dfx.male, width=width, color='g', label='male')
# Attach text labels.
rects = bar1 + bar2
for rect in rects:
height = rect.get_height()
plt.text(rect.get_x() + rect.get_width() / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Gender)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
# The Chart confirms Women more likely survivied than Men.
# %%
### Survived vs Dead Analysis by Ticket Class (1st, 2nd, 3rd)
# Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
# get a new dataframe
dfx = get_dataframe(df, 'Pclass')
# dfx
# --------------------------
# index 1 2 3
# Survived 136 87 119
# Dead 80 97 372
# Stacked bar chart
plt.style.use('seaborn') # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'
plt.figure(figsize=(6,5))
label_list = ['1st class','2nd class','3rd class']
bar_list = []
cumval = 0
for i, col in enumerate(dfx.columns):
bar = plt.bar(dfx.index, dfx[col], bottom=cumval, label=label_list.pop(0))
bar_list.append(bar)
cumval += dfx[col]
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0,
f'{height:.0f}', ha = 'center', va = 'center')
plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Ticket Class)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()
# Vertical bar chart
plt.figure(figsize=(6, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.25
width_list = [width*-1, 0, width*1]
label_list = ['1st class','2nd class','3rd class']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height,
f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Ticket Class)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
# The Chart confirms 1st class more likely survivied than other classes.
# The Chart confirms 3rd class more likely dead than other classes
# %%
### SibSp: number of siblings / spouses aboard the Titanic
#df['SibSp'] = df['SibSp'].fillna(0)
#df.isnull().sum()
# get a new dataframe
dfx = get_dataframe(df, 'SibSp')
# dfx
# 0 1 2 3 4 5 6
# ------------------------------------------------------------
# index 0 1 2 3 4 5 8
# Survived 210.0 112.0 13.0 4.0 3.0 0.0 0.0
# Dead 398.0 97.0 15.0 12.0 15.0 5.0 7.0
dfx = dfx.fillna(0)
dfx.isnull().sum()
dfy = dfx.iloc[:,0:3]
# Stacked bar bahrt
plt.style.use('seaborn') # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'
plt.figure(figsize=(6,5))
label_list = ['0 Sibling/Spouses','1 Sibling/Spouses','2 Sibling/Spouses']
bar_list = []
cumval = 0
for i, col in enumerate(dfy.columns):
bar = plt.bar(dfy.index, dfy[col], bottom=cumval, label=label_list.pop(0))
bar_list.append(bar)
cumval += dfx[col]
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0,
f'{height:.0f}', ha = 'center', va = 'center')
plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Siblings/Spouses Aboard)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()
# Vertical bar chart
plt.figure(figsize=(10, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.10 # 0.25
width_list = [width*-3, width*-2, width*-1, 0, width, width*2, width*3]
label_list = ['0 Sibling/Spouses','1 Sibling/Spouses','2 Sibling/Spouses',
'3 Sibling/Spouses','4 Sibling/Spouses','5 Sibling/Spouses','8 Sibling/Spouses']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # bar1+bar2+bar3+bar4+bar5+bar6+bar7
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Siblings/Spouses Aboard)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
# The Chart confirms a person aboarded with more than 2 siblings or spouse more likely survived.
# The Chart confirms a person aboarded without siblings or spouse more likely dead
# %%
### Parch: number of parents / children aboard the Titanic
#df['Parch'] = df['Parch'].fillna(0)
#df.isnull().sum()
# get a new dataframe
dfx = get_dataframe(df, 'Parch')
# dfx
# --------------------------------------------------------
# index 0 1 2 3 4 5 6
# Survived 233.0 65.0 40.0 3.0 0.0 1.0 0.0
# Dead 445.0 53.0 40.0 2.0 4.0 4.0 1.0
dfx = dfx.fillna(0)
dfx.isnull().sum()
dfy = dfx.iloc[:,0:3]
# stacked bar chart
plt.style.use('seaborn') # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'
plt.figure(figsize=(6,5))
#label_list = []
bar_list = []
cumval = 0
for i, col in enumerate(dfy.columns):
bar = plt.bar(dfy.index, dfy[col], bottom=cumval, label=f'{i} parents/children')
bar_list.append(bar)
cumval += dfx[col]
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0,
f'{height:.0f}', ha = 'center', va = 'center')
plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Parents/Children Aboard)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()
# vertical bar chart
plt.figure(figsize=(10, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.10 # 0.25
width_list = [width*-3, width*-2, width*-1, 0, width, width*2, width*3]
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=f'{i} parents/children')
bar_list.append(bar)
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Number of Parents/Children Aboard)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
# The Chart confirms a person aboarded with more than 2 parents or children more likely survived.
# The Chart confirms a person aboarded alone more likely dead
# %%
### Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton
df['Embarked_copy'] = df['Embarked']
df['Embarked_copy'] = df['Embarked_copy'].fillna('S')
#df.isnull().sum() # Embarked 2 => 0
# get a new dataframe
dfx = get_dataframe(df, 'Embarked_copy')
# dfx
# -----------------------
# index S C Q
# Survived 219 93 30
# Dead 427 75 47
dfx = dfx.fillna(0)
dfx.isnull().sum()
# stacked bar chart
plt.style.use('seaborn') # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'
plt.figure(figsize=(6,5))
label_list = ['Southampton','Cherbourg','Queenstown']
bar_list = []
cumval = 0
for i, col in enumerate(dfx.columns):
bar = plt.bar(dfx.index, dfx[col], bottom=cumval, label=label_list.pop(0))
bar_list.append(bar)
cumval += dfx[col]
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0,
f'{height:.0f}', ha = 'center', va = 'center')
# vertical bar chart
plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Port of Embarkation)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()
plt.figure(figsize=(6, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.25 # 0.25
width_list = [width*-1, 0, width]
label_list = ['Southampton','Cherbourg','Queenstown']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx.iloc[:,i], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Port of Embarkation)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
# The Chart confirms a person aboarded from C slightly more likely survived.
# The Chart confirms a person aboarded from Q more likely dead.
# The Chart confirms a person aboarded from S more likely dead.
# %%
### Title Mapping
# Extract Title from the Name column
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
#print('Title:')
#print(df['Title'].value_counts())
title_mapping = {'Mr': 0, 'Miss': 1, 'Mrs': 2,
'Master': 3, 'Dr': 3, 'Rev': 3, 'Col': 3, 'Major': 3, 'Mlle': 3,'Countess': 3,
'Ms': 3, 'Lady': 3, 'Jonkheer': 3, 'Don': 3, 'Dona' : 3, 'Mme': 3,'Capt': 3,'Sir': 3 }
df['Title_map'] = df['Title'].map(title_mapping)
df[['Title', 'Title_map']]
#df['Title_map'] = df['Title_map'].fillna(0)
#df.isnull().sum() # Title_map
# get a new dataframe
dfx = get_dataframe(df, 'Title_map')
# dfx
# -----------------------
# index 0 1 2 3
# Survived 81 127 99 35
# Dead 436 55 26 32
#dfx = dfx.fillna(0)
dfx.isnull().sum()
# stacked bar chart
plt.style.use('seaborn') # plt.style.available
plt.rcParams['font.family'] = 'Meiryo'
plt.figure(figsize=(6,5))
label_list = ['Mr','Miss','Mrs','Misc']
bar_list = []
cumval = 0
for i, col in enumerate(dfx.columns):
bar = plt.bar(dfx.index, dfx[col], bottom=cumval, label=label_list.pop(0))
bar_list.append(bar)
cumval += dfx[col]
# Attach text labels.
rects = reduce(lambda x, y: x + y, bar_list) # rects = bar1 + bar2 + bar3
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, rect.get_y() + height / 2.0,
f'{height:.0f}', ha = 'center', va = 'center')
plt.legend()
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Title)')
plt.legend(loc='upper left')
plt.grid(False)
plt.tight_layout()
plt.show()
# Vertical bar chart
plt.figure(figsize=(8, 5))
x_indexes = np.arange(len(dfx.index))
width = 0.15 # 0.25
width_list = [width*-1, 0, width, width*2]
label_list = ['Mr','Miss','Mrs','Misc']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Title)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
# %%
### Sex Mapping male => 0, femail => 1
sex_mapping = {'male': 0, 'female': 1}
df['Sex_map'] = df['Sex'].map(sex_mapping)
# get a new dataframe
dfx = get_dataframe(df, 'Sex_map')
# dfx
# -----------------------
# index 0 1
# Survived 81 127
# Dead 436 55
#dfx = dfx.fillna(0)
#dfx.isnull().sum()
# vertical bar chart
plt.figure(figsize=(6,5))
x_indexes = np.arange(len(dfx.index))
width = 0.25 # 0.25
width_list = [0, width]
label_list = ['male','female']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Sex_map)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
# %%
### Age Analysis
# Fix Age NaN => median
df.isnull().sum() # Age => 263
df['Age'].fillna(df.groupby('Title_map')['Age'].transform('median'), inplace=True)
df.isnull().sum() # Age => 0
#df.groupby('Title_map')['Age'].transform('median')
### => SKIP
# # KDE(Kernel Density Estimation)
# #facet = sns.displot(x='Age', hue='Survived', kind='kde', data=df)
# facet = sns.displot(x='Age', hue='Survived', kind='kde', multiple='stack', data=df)
# #facet = sns.displot(x='Age', hue='Survived', kind='kde', fill=True, data=df)
# plt.title('Dead vs Survived Analysis\n(Age)')
# plt.show()
# # facet = sns.FacetGrid(df, hue="Survived",aspect=4)
# # facet.map(sns.kdeplot,'Age',shade= True)
# # facet.set(xlim=(0, df['Age'].max()))
# # facet.add_legend()
# # plt.show()
# # facet = sns.FacetGrid(df, hue="Survived",aspect=4)
# # facet.map(sns.kdeplot,'Age',shade= True)
# # facet.set(xlim=(0, df['Age'].max()))
# # facet.add_legend()
# # plt.xlim(10,50) # add
# # plt.show()
# Those who were 20 to 30 years old were more dead and more survived.
# %%
### Age Mapping
# Binning/Converting Numerical Age to Categorical Variable
# feature vector map:
# child: 0
# young: 1
# adult: 2
# mid-age: 3
# senior: 4
def age_map(age):
if age <= 16:
return 0
elif age > 16 and age <= 26:
return 1
elif age > 26 and age <= 36:
return 2
elif age > 36 and age <= 62:
return 3
else:
return 4
df['Age_map'] = df.loc[:, 'Age'].apply(age_map)
# get a new dataframe
dfx = get_dataframe(df, 'Age_map')
survived_mask = df['Survived'] == 1
dead_mask = df['Survived'] == 0
# dfx
# 0 1 2 3 4
# --------------------------------------
# index 2 1 3 0 4
# Survived 116 97 69 57 3
# Dead 220 158 111 48 12
#dfx = dfx.fillna(0)
#dfx.isnull().sum()
# bar chart
plt.figure(figsize=(10,5))
x_indexes = np.arange(len(dfx.index))
width = 0.15 # 0.25
width_list = [width*-2, width*-1, 0, width*1, width*2]
label_list = ['adult','young','mid-age','child','senior']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Age Category)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
# %%
### Map the value of Pclass
Pclass1_mask = df['Pclass'] == 1
Pclass2_mask = df['Pclass'] == 2
Pclass3_mask = df['Pclass'] == 3
Pclass1 = df[Pclass1_mask]['Embarked_copy'].value_counts()
Pclass2 = df[Pclass2_mask]['Embarked_copy'].value_counts()
Pclass3 = df[Pclass3_mask]['Embarked_copy'].value_counts()
dfx = pd.DataFrame([Pclass1, Pclass2, Pclass3])
dfx.index = ['1st Class','2nd Class','3rd Class']
# dfx
# 0 1 2
# -----------------------------
# index S C Q
# 1st Class 179 141 3
# 2nd Class 242 28 7
# 3rd Class 495 101 113
# vertical bar chart
plt.figure(figsize=(6,5))
x_indexes = np.arange(len(dfx.index))
width = 0.25 # 0.25
width_list = [width*-1, 0, width*1]
label_list = ['Southampton','Cherbourg','Queenstown']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Ticket Class)')
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
# more than 50 % of 1st class are from S embark.
# more than 50 % of 2st class are from S embark.
# more than 50 % of 3st class are from S embark.
# %%
### Map the value of Embarked
# fill out missing embark with S embark
df['Embarked'] = df['Embarked'].fillna('S')
embarked_mapping = {'S':0,'C':1,'Q':2}
df['Embarked_map'] = df['Embarked'].map(embarked_mapping)
# %%
### Map the value of Fare
# Fill missing Fare with median fare for each Pclass
df['Fare'].fillna(df.groupby('Pclass')['Fare'].transform('median'), inplace=True)
#df.head(5)
### => SKIP
# # KDE(Kernel Density Estimation)
# #facet = sns.displot(x='Fare', hue='Survived', kind='kde', data=df)
# facet = sns.displot(x='Fare', hue='Survived', kind='kde', multiple='stack', data=df)
# #facet = sns.displot(x='Fare', hue='Survived', kind='kde', fill=True, data=df)
# #plt.xlim(-50, 200) # add
# plt.title('Dead vs Survived Analysis\n(Fare)')
# plt.show()
# # facet = sns.FacetGrid(df, hue='Survived',aspect=4 )
# # facet.map(sns.kdeplot, 'Fare', shade = True)
# # facet.set(xlim = (0, df['Fare'].max()))
# # facet.add_legend()
# # plt.show()
# # facet = sns.FacetGrid(df, hue="Survived",aspect=4)
# # facet.map(sns.kdeplot,'Fare',shade= True)
# # facet.set(xlim=(0, df['Fare'].max()))
# # facet.add_legend()
# # plt.xlim(0, 20) # add
# # plt.show()
### Map the Fare
def fare_map(fare):
if fare <= 17:
return 0
elif fare > 17 and fare <= 30:
return 1
elif fare > 30 and fare <= 100:
return 2
else:
return 3
df['Fare_map'] = df.loc[:, 'Fare'].apply(fare_map)
#df.head(5)
# %%
### Map the value of Cabin
df['Cabin_x'] = df['Cabin'].str[:1] # X999 => X
Pclass1_mask = df['Pclass'] == 1
Pclass2_mask = df['Pclass'] == 2
Pclass3_mask = df['Pclass'] == 3
Pclass1 = df[Pclass1_mask]['Cabin_x'].value_counts()
Pclass2 = df[Pclass2_mask]['Cabin_x'].value_counts()
Pclass3 = df[Pclass3_mask]['Cabin_x'].value_counts()
dfx = pd.DataFrame([Pclass1, Pclass2, Pclass3])
dfx = dfx.fillna(0)
dfx.index = ['1st class','2nd class', '3rd class']
# dfx *
# 0 1 2 3 4 5 6 7
# ------------------------------------------------------------------------
# index C B D E A T F G
# 1st class 94.0 65.0 40.0 34.0 22.0 1.0 NaN NaN
# 2nd class NaN NaN 6.0 4.0 NaN NaN 13.0 NaN
# 3rd class NaN NaN NaN 3.0 NaN NaN 8.0 5.0
# vertical bar chart
plt.figure(figsize=(10,5))
x_indexes = np.arange(len(dfx.index))
width = 0.10 # 0.25
width_list = [width*-3,width*-2,width*-1, 0, width*1,width*2,width*3,width*4]
label_list = ['C','B','D','E','A','T','F','G']
bar_list = []
for i, col in enumerate(dfx.columns):
bar = plt.bar(x_indexes + width_list.pop(0), dfx[col], width=width, label=label_list.pop(0))
bar_list.append(bar)
# Add counts above the bar graphs
rects = reduce(lambda x, y: x + y, bar_list)
for rect in rects:
width = rect.get_width()
height = rect.get_height()
plt.text(rect.get_x() + width / 2.0, height, f'{height:.0f}', ha='center', va='bottom')
plt.legend()
plt.xticks(ticks=x_indexes, labels=dfx.index)
plt.xlabel('Survived vs Dead')
plt.ylabel('Number of Passenger')
plt.title('Survived vs Dead Analysis\n(Cabin Class)')
plt.legend(loc='upper right')
plt.grid(True)
plt.tight_layout()
plt.show()
### Fill missing Cabin_map
cabin_mapping = {'A': 0, 'B': 0.4, 'C': 0.8, 'D': 1.2, 'E': 1.6, 'F': 2, 'G': 2.4, 'T': 2.8}
df['Cabin_map'] = df['Cabin_x'].map(cabin_mapping) # A => 0, B => 0.4
# Fill missing Cabin_map with median Cabin_map for each Pclass
#df = df.fillna(0)
df.isnull().sum()
df['Cabin_map'].fillna(df.groupby('Pclass')['Cabin_map'].transform('median'), inplace=True)
df.isnull().sum()
# %%
### Map the value of Family Size
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
### => SKIP
# # KDE(Kernel Density Estimation)
# #facet = sns.displot(x='FamilySize', hue='Survived', kind='kde', data=df)
# facet = sns.displot(x='FamilySize', hue='Survived', kind='kde', multiple='stack', data=df)
# #facet = sns.displot(x='FamilySize', hue='Survived', kind='kde', fill=True, data=df)
# plt.title('Dead vs Survived Analysis\n(Family Size)')
# plt.show()
# # facet = sns.FacetGrid(df, hue="Survived",aspect=4)
# # facet.map(sns.kdeplot,'FamilySize',shade= True)
# # facet.set(xlim=(0, df['FamilySize'].max()))
# # facet.add_legend()
# # plt.xlim(0)
family_mapping = {1: 0, 2: 0.4, 3: 0.8, 4: 1.2, 5: 1.6, 6: 2, 7: 2.4, 8: 2.8, 9: 3.2, 10: 3.6, 11: 4}
df['FamilySize_map'] = df['FamilySize'].map(family_mapping)
#df
#df.isnull().sum()