Python: Pythonで機械学習超入門：来店客が商品を購入するかどうか予測するには【Machine Learning: Plot Decision Boundary】

ようこそ「Python」へ...

Python»記事(Article056)

Pythonで機械学習超入門：来店客が商品を購入するかどうか予測するには【Machine Learning: Plot Decision Boundary】

ここではPythonのライブラリsklearnを使用して自動車の販売店に来店した人が、自動車(SUV: Sport Utility Vehicle)を購入するかどうかを機械学習(ML: Machine Learning)で予測する方法を解説します。

最近よくAI(Artificial Intelligence)、ML(Machine Learning)、DL(Deep Learning)という言葉を聞きますが、 AIはMLとDLを含んだ総称です。そしてMLはDLを含んでいます。そしてDLはNeural Networks(ニューラルネットワーク)を使用しています。 MLとDLはAIのサブセットということになります。 DL(Deep Learning)については「記事(Article028)」で詳しく解説しています。

ML(Machine Learning)は、Supervised Learning, UnSupervised Learning, Reinforcement Learningの３つに分類されています。それぞれのタイプの概念は図(B, C, D)を参照してください。そしてSupervised LearningはアルゴリズムによりReguression, Classfication, Clusteringの３つに分類されています。それぞれのアルゴリズムの種類は図(E)を参照してください。

販売店に来店した人が自動車を購入するかどうかは「買う」「買わない」の２択ですから、MLのClassficationのアルゴリズムを利用することになります。ここではClassficationの5種類のアルゴリズム(Support Vector Machine, Random Forest, Decision Tree, Naive Bayes, Logistic Regression)を使用して来店客が自動車を購入するかどうかを予測します。

MLのClassficationのアルゴリズムを使用してデータを分析するコツはデータを可視化することです。図(F)のようにデータをMatplotlibを使用して可視化するとデータがどのように分散されているかがわかります。図(F)の赤線で示しているように自動車を「買う人」と「買わない人」の境界線が把握できます。

入力データを可視化したら次にMLのClassificatioの中からいろんなアルゴリズムを使用して予測した結果を可視化します。図(G)がSVM(Support Vector Machine)で予測した結果を可視化したものです。ここで重要なのは図(G)の境界線が入力データの境界線に近いアルゴリズムを選択することです。予測した結果の境界線が入力データの境界線に近いということは予測の精度が高いということになります。参考までに予測評価はSVM(93%), Random Forest(92%), Decision Tree(90%), Naive Bayes(90%), Logistic Regression(89%)でした。境界線が近いモデルがもっとも精度が高いということが実証されました。

ここではVisula Studio Code(VSC)の「Python Interactive window」を使用してJupter Notebookのような環境で説明します。 VSCを通常の環境からインタラクティブな環境に切り換えるにはコードを記述するときコメント「# %%」を入力します。詳しい、操作手順については「ここ」を参照してください。インタラクティブな環境では、Pythonの「print(), plt.show()」などを使う必要がないので掲載しているコードでは省略しています。 VSCで通常の環境で使用するときは、必要に応じて「print(), plt.show()」等を追加してください。

この記事では、Pandas、Matplotlibのライブラリを使用しますので「記事(Article001) | 記事(Article002) | 記事(Article003) | 記事(Article004)」を参照して事前にインストールしておいてください。 Pythonのコードを入力するときにMicrosoftのVisula Studio Codeを使用します。まだ、インストールしていないときは「記事(Article001)」を参照してインストールしておいてください。

説明文の左側に図の画像が表示されていますが縮小されています。画像を拡大するにはマウスを画像上に移動してクリックします。画像が拡大表示されます。拡大された画像を閉じるには右上の[X]をクリックします。画像の任意の場所をクリックして閉じることもできます。

Support Vector Machine (SVM)を使用して予測する

Pythonのライブラリを取り込む

Visual Stuio Code(VSC)を起動したら新規ファイルを作成して行1-25をコピペします。行2-21ではPythonのライブラリを取り込んでいます。行23ではPythonの警告を抑止しています。行25ではMatplotlibで日本語が使えるようにしています。

### Import the libraries
from os import terminal_size
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC # SVC: Linear Support Vector Classification

import warnings

warnings.simplefilter('ignore')
# Set the font to support Japanese
plt.rcParams['font.family'] = 'Meiryo'

図1はVisual Studio Code(VSC)の画面です。

入力データをPandasのDataFrameに取り込む
行2ではCSVファイルのパスを定義しています。当サイトから取り込むときは行3のコメント「#」を外してください。行4ではCSVファイルをPandasのDataFrameに取り込んでます。このCSVファイルには400件の販売データが格納されています。 CSVファイルの構造と内容は図2を参照してください。
```
### Load the data
csv_file = 'data/csv/tutorial/suv_data.csv'
#csv_file = 'https://money-or-ikigai.com/menu/python/article/data/article056/suv_data.csv'
df = pd.read_csv(csv_file)
```
図2

図2はVSCの画面です。右側のインタラクティブ・ウィンドウにPandasのDataFrameの構造とデータの内容が表示されています。ここでは先頭から３件のデータを表示しています。 CSVファイルには「UserID」「Gender(性別)」「Age(年齢)」「EstimatedSalary(予定年収)」「Purchased(0:購入しなかった 1:購入した)」のデータが格納されています。
入力データを可視化して境界線を探す
行2ではDataFrameから列「Age(年齢)」と列「EstimatedSalary(予定年収)」の値を取得して変数「X」に格納しています。行3ではDataFrameの列「Purchased」の値を取得して変数「y」に格納しています。行5-12では散布図を作成しています。行6では自動車を買わない人の散布図(青)をプロットしています。行7では自動車を買う人の散布図(赤)をプロットしています。
```
### Visualize input data 
X = df.iloc[:, [2,3]].values    # Age[0], EstimatedSalary[1]
y = df.iloc[:, 4].values        # Purchased (0 or 1)

plt.figure(figsize=(10, 6))  
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='b', label='0:Not Purchased') # Age 
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='r', label='1:Purchased')     # Salary
plt.legend() 
plt.xlabel('Age')
plt.ylabel('Estimated Salary ($)')
plt.title('Plot input data\n(Age vs Salary)')
plt.show()
```
図3

図3にはデータの散布図が表示されています。青が「買わない」赤が「買う」を意味します。ここで重要なのは青と赤の境界線です。手書きで青と赤の境界線を描いていますが、この境界線を見つけることが重要です。
入力データを分割する
行2では入力データを学習用とテスト用に分割しています。ここでは75対25で分割しています。行5-6では入力データを平準化・正規化しています。
```
### Split the input data (75:25)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

sc = StandardScaler()
X_train = sc.fit_transform(X_train) 
X_test = sc.transform(X_test)
```
図4

図4では分割のX_tran, X_testのshapeを表示しています。 X_tranには225件の学習データが格納されています。 X_testには75件のテストデータが格納されています。
学習データを元に学習させる
行2-3では学習データを元に学習させています。ここではSupport Vector Classification(SVC)のモデルを使用しています。
```
### Train the Support Vector Machine classifier
model = SVC(gamma='auto')
model.fit(X_train, y_train)
```
図5

図5は学習結果の画面です。学習するのに0.4秒かかっています。

予測結果を可視化する

行17では学習データを予測しています。ここでは年齢と年収を0.1刻みで最小値から最大値まで予測しています。つまり、年齢は17.5歳から60.5歳まで0.1刻みで増加させています。同様に年収は$14,999.50から$150,000.50まで0.1刻みで増加させています。

行21では予測値の境界線(買う・買わないの境界線)をプロットしています。行22では学習データの散布図をプロットしています。

### Plot the dicision boundary 
plt.figure(figsize=(10, 8)) 

X = X_train 
y = y_train 

# Set min and max values and give it some padding:  X: Age, Salary
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5   # Age: x_min=17.5, x_max=60.5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5   # Salary: y_min=14999.5, y_max=150000.5

h = 0.1          

# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))   

# Predict the whole grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) 
Z = Z.reshape(xx.shape)

# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)    # X=Age, Y=Salary

plt.title('Support Vector Machine classifier\n(Decision Boundary)')
plt.show()

図6は予測結果をグラフ化したものです。ここではMatplotlibのscatterで散布図とcontourfで境界線を描いています。予測値の境界線が入力データの境界線と近いものが精度が高いということになります。

予測結果を評価する
行2-3では予測値を評価しています。
```
### Evaluate the prediction results
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred) * 100  # 93%
```
図7

図7には予測値の評価が表示されています。ここでは93%が表示されているのでこの予測モデルはかなり精度の高いことになります。
年齢、年収を指定してテストして見る
行3-4では予測するお客の条件(年齢、年収)を定義しています。行6-10では予測結果を表示しています。
```
### Testing 

age = 40
salary = 100_000

X_test = [[age, salary]]             # Age=40, EstimakedSalary=$88,000
X_test = sc.transform(X_test)             
y_pred = model.predict(X_test)
result = 'Will be Purchased' if y_pred[0] == 1 else 'Will not be Purchased'
print(f'age={age}, salary={salary} => {result}')
```
図8

図8は「年齢(40)、年収($100,000)」の条件と「年齢(40)、年収($88,000)」の条件の予測結果です。この場合、93%の確率で予測が当たることになります。

ここで解説したコードをまとめて掲載

最後にここで解説したすべてのコードをまとめて掲載しましたので参考にしてください。


# Supervised Learning => Classification => Support Vectior Machine (SVM)     
# %%

#------------------------------------------
# Types of Machine Learning
# 1. Supervised Learning
#    The machine learns from the training data that is labeled.
# 2. UnSupervised Learning
#    Non-labeled training data
# 3. Reinforcement Learning
#    The machine learns on its own.
#-----------------------------------------
# The right ML solution?
# ・Classification
#   Used when the output is categorical like 'YES' or 'NO'
#   Aligorithms Used
#   - Support Vectior Machine (SVM) ★ Linear Support Vector Classification
#   - Decision Tree
#   - Naive Bayes
#   - Random Forest
#   - KNN
# ・Regression
#   Used when a value needs to be predicted like the 'stock prices'
#   Althorithms Used
#   - Linear Regression
#   - Decision Tree
#   - Random Forest
# 1.1.4 Neural Network
# ・Clustering
#   Used when the data needs to be organized to find patterns in the case of 'product recommendation'
#   Althorithms Used
#   - K Means
#------------------------------------------

### Import the libraries
from os import terminal_size
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC # SVC: Linear Support Vector Classification

import warnings

#from sklearn.utils.extmath import randomized_range_finder
warnings.simplefilter('ignore')
# Set the font to support Japanese
plt.rcParams['font.family'] = 'Meiryo'


# %%

# ---------------------------------------------- SUV Data Analysis
### Load the data
csv_file = 'https://money-or-ikigai.com/menu/python/article/data/article056/suv_data.csv'
df = pd.read_csv(csv_file)
# df.info()
# RangeIndex: 400 entries, 0 to 399
# Data columns (total 5 columns):
#  #   Column           Non-Null Count  Dtype 
# ---  ------           --------------  ----- 
#  0   User ID          400 non-null    int64 
#  1   Gender           400 non-null    object
#  2   Age              400 non-null    int64 
#  3   EstimatedSalary  400 non-null    int64 
#  4   Purchased        400 non-null    int64 # 1:purchased, 0:not purchased
# dtypes: int64(4), object(1)

# df.head(3)
# 	User ID	    Gender	Age	EstimatedSalary	Purchased
# ---------------------------------------------------
# 0	15624510	Male	19	19000	        0          
# 1	15810944	Male	35	20000	        0
# 2	15668575	Female	26	43000	        0


# %%

### Visualize input data 
X = df.iloc[:, [2,3]].values    # Age[0], EstimaedSalary[1]
y = df.iloc[:, 4].values        # Purchased (0 or 1)
# X.shape => (300, 2)
# y.shape => (300,)

plt.figure(figsize=(10, 6))  
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='b', label='0:Not Purchased') # Age 
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='r', label='1:Purchased')     # Salary
plt.legend(); 
plt.xlabel('Age')
plt.ylabel('Estimated Salary ($)')
plt.title('Plot input data\n(Age vs Salary)')
plt.show()


# %%

### Split the input data (75:25)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

sc = StandardScaler()
X_train = sc.fit_transform(X_train) 
X_test = sc.transform(X_test)   


# %%

### Train the Support Vector Machine classifier
model = SVC(gamma='auto')
model.fit(X_train, y_train)


# %%

### Plot the dicision boundary 
plt.figure(figsize=(10, 8)) 

X = X_train # ★
y = y_train # ★

# Set min and max values and give it some padding:  X: Age, Salary
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5   # Age: x_min=17.5, x_max=60.5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5   # Salary: y_min=14999.5, y_max=150000.5

h = 0.1   # 0.01 => 0.1   ★        

# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))   
# numpy.arange([start, ]stop, [step, ]dtype=None, *, like=None) 
# Return evenly spaced values within a given interval.
# umpy.meshgrid(*xi, copy=True, sparse=False, indexing='xy') 
# Return coordinate matrices from coordinate vectors.

# Predict the whole grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) 
# numpy.ravel(a, order='C') => Return a contiguous flattened array.
# numpy.c_ = numpy.lib.index_tricks.CClass object => Translates slice objects to concatenation along the second axis
Z = Z.reshape(xx.shape)

# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
# contourf([X, Y,] Z, [levels], **kwargs)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)    # X=Age, Y=Salary

plt.title('Support Vector Machine classifier\n(Decision Boundary)')
plt.show()


# %%

### Evaluate the prediction results
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred) * 100  # 93%


# %%

### Test 

age = 40
salary = 100_000

X_test = [[age, salary]]             # Age=40, EstimakedSalary=$88,000
X_test = sc.transform(X_test)             
y_pred = model.predict(X_test)
result = 'Will be Purchased' if y_pred[0] == 1 else 'Will not be Purchased'
print(f'age={age}, salary={salary} => {result}')

Random Forestを使用して予測する

Random Forestの予測結果

図9はRandam Forestの予測結果です。予測値の評価が「92%」となっています。

以下にRandam Forestのコードを一括して掲載していますので参考にしてください。


# Supervised Learning => Classification => Random Forest     
# %%         
#------------------------------------------
# Types of Machine Learning
# 1. Supervised Learning
#    The machine learns from the training data that is labeled.
# 2. UnSupervised Learning
#    Non-labeled training data
# 3. Reinforcement Learning
#    The machine learns on its own.
#-----------------------------------------
# The right ML solution?
# ・Classification
#   Used when the output is categorical like 'YES' or 'NO'
#   Aligorithms Used
#   - Support Vectior Machine (SVM) 
#   - Decision Tree
#   - Naive Bayes
#   - Random Forest ★
#   - KNN
# ・Regression
#   Used when a value needs to be predicted like the 'stock prices'
#   Althorithms Used
#   - Linear Regression
#   - Decision Tree
#   - Random Forest
# 1.1.4 Neural Network
# ・Clustering
#   Used when the data needs to be organized to find patterns in the case of 'product recommendation'
#   Althorithms Used
#   - K Means
#------------------------------------------


# %%

### Import the libraries
from os import terminal_size
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC # SVC: Linear Support Vector Classification

import warnings

#from sklearn.utils.extmath import randomized_range_finder
warnings.simplefilter('ignore')
# Set the font to support Japanese
plt.rcParams['font.family'] = 'Meiryo'


# %%

# ---------------------------------------------- SUV Data Analysis

### Load the data
csv_file = 'https://money-or-ikigai.com/menu/python/article/data/article056/suv_data.csv'
df = pd.read_csv(csv_file)


# %%

X = df.iloc[:, [2,3]].values    # Age, EstimaedSalary
y = df.iloc[:, 4].values        # Purchased (0 or 1)

# %%


# Visualize input data 

plt.figure(figsize=(10, 6))  
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='b', label='0:Not Purchased') # Age 
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='r', label='1:Purchased')     # Salary
plt.legend(); 
plt.xlabel('Age')
plt.ylabel('Estimated Salary ($)')
plt.title('Plot input data\n(Age vs Salary)')
plt.show()


# %%

### Split the input data (75:25)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

sc = StandardScaler()
X_train = sc.fit_transform(X_train) 
X_test = sc.transform(X_test)   


# %%

# Train the RandomForest Classifier
model = RandomForestClassifier(random_state=1, n_estimators=100)
model.fit(X_train, y_train)

# Plot the dicision boundary 
plt.figure(figsize=(10, 8)) 

X = X_train # ★
y = y_train # ★

# Set min and max values and give it some padding:  X: Age, Salary
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5   # Age: x_min=17.5, x_max=60.5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5   # Salary: y_min=14999.5, y_max=150000.5

h = 0.1   # 0.01 => 0.1   ★        

# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))   

# Predict the whole grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) 
Z = Z.reshape(xx.shape)

# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)    # X=Age, Y=Salary

plt.title('RandomForest Classifier\n(Decision Boundary)')
plt.show()


# %%

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred) * 100  # 92%


# %%

### Test 

age = 40
salary = 100_000

X_test = [[age, salary]]             # Age=40, EstimakedSalary=$88,000
X_test = sc.transform(X_test)             
y_pred = model.predict(X_test)
result = 'Will be Purchased' if y_pred[0] == 1 else 'Will not be Purchased'
print(f'age={age}, salary={salary} => {result}')

Decision Treeを使用して予測する

Decision Treeの予測結果

図10はDecision Treeの予測結果です。予測値の評価が「90%」となっています。

以下にDecision Treeのコードを一括して掲載していますので参考にしてください。

      
# Supervised Learning => Classification => Decision Tree     
# %%
#------------------------------------------
# Types of Machine Learning
# 1. Supervised Learning
#    The machine learns from the training data that is labeled.
# 2. UnSupervised Learning
#    Non-labeled training data
# 3. Reinforcement Learning
#    The machine learns on its own.
#-----------------------------------------
# The right ML solution?
# ・Classification
#   Used when the output is categorical like 'YES' or 'NO'
#   Aligorithms Used
#   - Support Vectior Machine (SVM) 
#   - Decision Tree ★
#   - Naive Bayes
#   - Random Forest
#   - KNN
# ・Regression
#   Used when a value needs to be predicted like the 'stock prices'
#   Althorithms Used
#   - Linear Regression
#   - Decision Tree
#   - Random Forest
# 1.1.4 Neural Network
# ・Clustering
#   Used when the data needs to be organized to find patterns in the case of 'product recommendation'
#   Althorithms Used
#   - K Means
#------------------------------------------


### Import the libraries
from os import terminal_size
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC # SVC: Linear Support Vector Classification

import warnings

#from sklearn.utils.extmath import randomized_range_finder
warnings.simplefilter('ignore')
# Set the font to support Japanese
plt.rcParams['font.family'] = 'Meiryo'


# %%

# ---------------------------------------------- SUV Data Analysis

### Load the data
csv_file = 'https://money-or-ikigai.com/menu/python/article/data/article056/suv_data.csv'
df = pd.read_csv(csv_file)


# %%

X = df.iloc[:, [2,3]].values    # Age, EstimaedSalary
y = df.iloc[:, 4].values        # Purchased (0 or 1)

# %%


# Visualize input data 

plt.figure(figsize=(10, 6))  
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='b', label='0:Not Purchased') # Age 
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='r', label='1:Purchased')     # Salary
plt.legend(); 
plt.xlabel('Age')
plt.ylabel('Estimated Salary ($)')
plt.title('Plot input data\n(Age vs Salary)')
plt.show()


# %%

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

sc = StandardScaler()
X_train = sc.fit_transform(X_train) 
X_test = sc.transform(X_test)   


# %%

# Train the Decision Tree classifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Plot the dicision boundary 
plt.figure(figsize=(10, 8)) 

X = X_train # ★
y = y_train # ★

# Set min and max values and give it some padding:  X: Age, Salary
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5   # Age: x_min=17.5, x_max=60.5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5   # Salary: y_min=14999.5, y_max=150000.5

h = 0.1   # 0.01 => 0.1   ★        

# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))   

# Predict the whole grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) 
Z = Z.reshape(xx.shape)

# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)    # X=Age, Y=Salary

plt.title('Decision Tree classifier\n(Decision Boundary)')
plt.show()


# %%

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred) * 100  # 90%


# %%

### Test 

age = 40
salary = 100_000

X_test = [[age, salary]]             # Age=40, EstimakedSalary=$88,000
X_test = sc.transform(X_test)             
y_pred = model.predict(X_test)
result = 'Will be Purchased' if y_pred[0] == 1 else 'Will not be Purchased'
print(f'age={age}, salary={salary} => {result}')

Naive Bayesを使用して予測する

Naive Bayesの予測結果

図11はNaive Bayesの予測結果です。予測値の評価が「90%」となっています。

以下にNaive Bayersのコードを一括して掲載していますので参考にしてください。

      
# Supervised Learning => Classification => Gaussian Naive Bayes Classifier
# %%

#------------------------------------------
# Types of Machine Learning
# 1. Supervised Learning
#    The machine learns from the training data that is labeled.
# 2. UnSupervised Learning
#    Non-labeled training data
# 3. Reinforcement Learning
#    The machine learns on its own.
#-----------------------------------------
# The right ML solution?
# ・Classification
#   Used when the output is categorical like 'YES' or 'NO'
#   Aligorithms Used
#   - Support Vectior Machine (SVM) 
#   - Decision Tree
#   - Naive Bayes ★ GaussianNB Classifier
#   - Random Forest
#   - KNN
# ・Regression
#   Used when a value needs to be predicted like the 'stock prices'
#   Althorithms Used
#   - Linear Regression
#   - Decision Tree
#   - Random Forest
# 1.1.4 Neural Network
# ・Clustering
#   Used when the data needs to be organized to find patterns in the case of 'product recommendation'
#   Althorithms Used
#   - K Means
#------------------------------------------

### Import the libraries
from os import terminal_size
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC # SVC: Linear Support Vector Classification

import warnings

#from sklearn.utils.extmath import randomized_range_finder
warnings.simplefilter('ignore')
# Set the font to support Japanese
plt.rcParams['font.family'] = 'Meiryo'



# %%

# ---------------------------------------------- SUV Data Analysis
### Load the data
csv_file = 'https://money-or-ikigai.com/menu/python/article/data/article056/suv_data.csv'
df = pd.read_csv(csv_file)


# %%

X = df.iloc[:, [2,3]].values    # Age, EstimaedSalary
y = df.iloc[:, 4].values        # Purchased (0 or 1)

# %%


# Visualize input data 

plt.figure(figsize=(10, 6))  
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='b', label='0:Not Purchased') # Age 
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='r', label='1:Purchased')     # Salary
plt.legend(); 
plt.xlabel('Age')
plt.ylabel('Estimated Salary ($)')
plt.title('Plot input data\n(Age vs Salary)')
plt.show()


# %%

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

sc = StandardScaler()
X_train = sc.fit_transform(X_train) 
X_test = sc.transform(X_test)   


# %%

# X_train => sc.fit_transform()
# y_train => 0 or 1

# Train the Gaussian NaiveBayes classifier
model = GaussianNB()
model.fit(X_train, y_train)

# Plot the dicision boundary 
plt.figure(figsize=(10, 8)) 

X = X_train # ★
y = y_train # ★

# Set min and max values and give it some padding:  X: Age, Salary
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5   # Age: x_min=17.5, x_max=60.5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5   # Salary: y_min=14999.5, y_max=150000.5

h = 0.1   # 0.01 => 0.1   ★        

# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))   

# Predict the whole grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) 
Z = Z.reshape(xx.shape)

# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)    # X=Age, Y=Salary

plt.title('Gaussian NaiveBayes classifier\n(Decision Boundary)')
plt.show()


# %%

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred) * 100  # 90%


# %%

### Test 

age = 40
salary = 100_000

X_test = [[age, salary]]             # Age=40, EstimakedSalary=$88,000
X_test = sc.transform(X_test)             
y_pred = model.predict(X_test)
result = 'Will be Purchased' if y_pred[0] == 1 else 'Will not be Purchased'
print(f'age={age}, salary={salary} => {result}')

Logistic Regressionを使用して予測する

Logistic Regressionの予測結果

図12はLogistic Regressionの予測結果です。予測値の評価が「89%」となっています。

以下にLogistic Regressionのコードを一括して掲載していますので参考にしてください。

      
# Supervised Learning => Classification => Logistic Regression Classifier
# %%

# Types of Machine Learning
# 1. Supervised Learning
#    The machine learns from the training data that is labeled.
# 2. UnSupervised Learning
#    Non-labeled training data
# 3. Reinforcement Learning
#    The machine learns on its own.
#-----------------------------------------
# The right ML solution?
# ・Classification
#   Used when the output is categorical like 'YES' or 'NO'
#   Aligorithms Used
#   - Support Vectior Machine (SVM) 
#   - Decision Tree
#   - Naive Bayes
#   - Random Forest
#   - Logistic Regression ★
#   - KNN
# ・Regression
#   Used when a value needs to be predicted like the 'stock prices'
#   Althorithms Used
#   - Linear Regression
#   - Decision Tree
#   - Random Forest
# 1.1.4 Neural Network
# ・Clustering
#   Used when the data needs to be organized to find patterns in the case of 'product recommendation'
#   Althorithms Used
#   - K Means
#------------------------------------------

# Dataset: data/csv/tutorial/suv_data.csv) 
# %%

### Import the libraries
from os import terminal_size
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC # SVC: Linear Support Vector Classification

import warnings

#from sklearn.utils.extmath import randomized_range_finder
warnings.simplefilter('ignore')
# Set the font to support Japanese
plt.rcParams['font.family'] = 'Meiryo'


# %%

# ---------------------------------------------- SUV Data Analysis
### Load the data
csv_file = 'https://money-or-ikigai.com/menu/python/article/data/article056/suv_data.csv'
df = pd.read_csv(csv_file)


# %%


### Visualize input data 
X = df.iloc[:, [2,3]].values    # Age[0], EstimatedSalary[1]
y = df.iloc[:, 4].values        # Purchased (0 or 1)

plt.figure(figsize=(10, 6))  
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='b', label='0:Not Purchased') # Age 
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='r', label='1:Purchased')     # Salary
plt.legend() 
plt.xlabel('Age')
plt.ylabel('Estimated Salary ($)')
plt.title('Plot input data\n(Age vs Salary)')
plt.show()


# %%

### Split the input data (75:25)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

sc = StandardScaler()
X_train = sc.fit_transform(X_train) 
X_test = sc.transform(X_test)   


# %%

### Train the Logistic Regression
model = LogisticRegression(random_state=0)
model.fit(X_train, y_train)

# %%

### Plot the dicision boundary 
plt.figure(figsize=(10, 8)) 

X = X_train # ★
y = y_train # ★

# Set min and max values and give it some padding:  X: Age, Salary
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5   # Age: x_min=17.5, x_max=60.5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5   # Salary: y_min=14999.5, y_max=150000.5

h = 0.1   # 0.01 => 0.1   ★        

# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))   

# Predict the whole grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) 
Z = Z.reshape(xx.shape)

# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)    # X=Age, Y=Salary

plt.title('Logistic Regression classifier\n(Decision Boundary)')
plt.show()


# %%

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred) * 100  # 89%


# %%

### Test 
age = 40
salary = 88_000

X_test = [[age, salary]]             # Age=40, EstimakedSalary=$88,000
X_test = sc.transform(X_test)             
y_pred = model.predict(X_test)
result = 'Will be Purchased' if y_pred[0] == 1 else 'Will not be Purchased'
print(f'age={age}, salary={salary} => {result}')

Go Top

Python {Article056}

Pythonで機械学習 超入門：来店客が商品を購入するかどうか予測するには【Machine Learning: Plot Decision Boundary】

Support Vector Machine (SVM)を使用して予測する

Pythonのライブラリを取り込む

入力データをPandasのDataFrameに取り込む

入力データを可視化して境界線を探す

入力データを分割する

学習データを元に学習させる