Python: 生データを使用してMatplotlibで度数分布図(histogram)を作成する方法を学ぶには【Pandas+Matplotlib】

ようこそ「Python」へ...

Python»記事(Article020) ◀前の記事次の記事▶

生データを使用してMatplotlibで度数分布図(histogram)を作成する方法を学ぶには【Pandas+Matplotlib】

ここではWebサイトで公開されて生データを使用してMatplotlibのhist()メソッドで度数分布図(度数分布グラフ)を作成する方法を解説します。度数分布グラフは年代別の各種統計情報等を視覚化して分析するのに利用すると便利です。この他、各種アンケートの調査結果を年代別にプロットするといった使い方ができます。また、COVID-19の感染者、重症者、死亡者等を年代別、男女別にプロットして分析することもできます。

アンケートのデータは「Stack Overflow Annual Developer Survey」からダウンロードして加工して使用しています。厚労省のサイトで公開されてCOVID-19のデータには男女別の情報が含まれていないので、「国立社会保障・人口問題研究所」のデータをダウンロードして加工して使用しています。

Yahoo! FinanceのWebサイトから株価や仮想通貨(暗号通貨)の価格をダウンロードするにはPandasのDataReaderを使用します。 DataReader()で株価をダウンロードするには引数に企業のティッカーシンボル（Apple▶AAPL）、開始日、終了日を指定します。

Yahoo! FinanceのWebサイトから仮想通貨(暗号通貨)の価格をダウンロードする場合も株価と同様PandasのDataReaderを使用します。ビットコインの価格をダウンロードするにはDataReader()の引数に「BTC-JPY」「BTC-USD」のように指定します。つまり、日本円に換算するときは「BTC-JPY」、米ドルに換算するときは「BTC-USD」のように指定します。株価も仮想通貨(暗号通貨)の価格をダウンロードするのも完全に自動化できますので手動でダウンロードする必要はありません。

ここではVisula Studio Code(VSC)の「Python Interactive window」を使用してJupter(IPython Notebook)のような環境で説明します。 VSCを通常の環境からインタラクティブな環境に切り換えるにはコードを記述するときコメント「# %%」を入力します。詳しい、操作手順については「ここ」を参照してください。インタラクティブな環境では、Pythonの「print(), plt.show()」などを使う必要がないので掲載しているコードでは省略しています。 VSCで通常の環境で使用するときは、必要に応じて「print(), plt.show()」等を追加してください。

この記事では、Pandas、Matplotlibのライブラリを使用しますので「記事(Article001) | 記事(Article002) | 記事(Article003) | 記事(Article004)」を参照して事前にインストールしておいてください。 Pythonのコードを入力するときにMicrosoftのVisula Studio Codeを使用します。まだ、インストールしていないときは「記事(Article001)」を参照してインストールしておいてください。

説明文の左側に図の画像が表示されていますが縮小されています。画像を拡大するにはマウスを画像上に移動してクリックします。画像が拡大表示されます。拡大された画像を閉じるには右上の[X]をクリックします。画像の任意の場所をクリックして閉じることもできます。

まずはMatplotlibのhist()でシンプルな度数分布グラフを作成してみる

プログラムで定義したデータを元に度数分布グラフを作成する
Visual Studio Code(VSC)を起動したら以下のコードを入力して実行[Ctrl+Enter]します。行4-7ではPythonのライブラリを取り込んでいます。行13では度数分布グラフに表示するデータを定義しています。行15ではMatplotlibのhist()メソッドで度数分布グラフを作成しています。行17ではMatplotlibのshow()メソッドでグラフを表示しています。
```
# Article020_Matplotlib Histograms Part0.py
# %%

import pandas as pd
from matplotlib import pyplot as plt
from matplotlib import colors
from matplotlib.ticker import PercentFormatter

# %%

# 1) Plot a simple histogram

ages = [18, 19, 21, 25, 26, 26, 30, 32, 38, 45, 55]

plt.hist(ages, bins=5)

plt.show()
```
図1

図1は実行結果です。シンプルな度数分布グラフが表示されています。
度数分布グラフにオプションを追加する
行3では度数分布グラフに日本語が使えるようにCSSの「font.family」プロパティに日本語のフォント「Meiryo」を設定しています。行4では度数分布グラフに適用するスタイルを設定しています。スタイルの一覧はVSC(Visual Studio Code)のインタラクティブ・ウィンドウから「style.available」を入力して[Shift+Enter]で実行すると表示されます。行9-11では度数分布グラフのタイトル、X軸のラベル、Y軸のラベルを設定しています。行12ではMatplotlibのtight_layout()メソッドで度数分布グラフがコンパクトに表示されるようにしています。このオプションはNOTE-PCなどを使用するときに便利です。
```
# 2) Add options (font.family, style, xlabel, ylabel, title,...)

plt.rcParams['font.family'] = 'Meiryo'  
plt.style.use('fivethirtyeight')

ages = [18, 19, 21, 25, 26, 26, 30, 32, 38, 45, 55]
plt.hist(ages, bins=5)

plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()
```
図2

図2は実行結果です。度数分布グラフにタイトル、X軸のラベル、Y軸のラベルなどが表示されています。
度数分布グラフに「edgecolor」のオプションを追加して年代別に分ける
行7ではMatplotlibのhist()メソッドの引数に「edgecolor='black'」を追加して度数分布グラフの棒に境界線を表示させています。これで年代別の区分が明確になります。
```
# 2) Add edgecolor : plt.hist(edgecolor='black')

plt.rcParams['font.family'] = 'Meiryo'  
plt.style.use('fivethirtyeight')

ages = [18, 19, 21, 25, 26, 26, 30, 32, 38, 45, 55]
plt.hist(ages, bins=5, edgecolor='black')

plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()
```
図3

図3は実行結果です。棒に境界線が表示されています。
度数分布グラフ「bins」を10歳間隔で区切って分かりやすくする
行7では度数分布グラフのビン(棒)の個数を定義しています。ここではビンが年代別(10代,20代,30代,40代,50代,60代)に表示されるようにしています。行9ではMatplotlibのhist()メソッドの引数にプログラムで定義したビンを指定しています。
```
# 3) Add range of bins : plt.hist(..., bins=bins)

plt.rcParams['font.family'] = 'Meiryo' 
plt.style.use('fivethirtyeight')

ages = [18, 19, 21, 25, 26, 26, 30, 32, 38, 45, 55]
bins = [10, 20, 30, 40, 50, 60]

plt.hist(ages, bins=bins, edgecolor='black')

plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()
```
図4

図4は実行結果です。度数分布グラフのビン（棒）が年代別に表示されています。

度数分布グラフの「bins」を調整して10歳以下の年代をグラフから除外する

# 4) Exclude bin 10

plt.rcParams['font.family'] = 'Meiryo' 
plt.style.use('fivethirtyeight')

ages = [18, 19, 21, 25, 26, 26, 30, 32, 38, 45, 55]
bins = [20, 30, 40, 50, 60]

plt.hist(ages, bins=bins, edgecolor='black')

plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()

図5は実行結果です。度数分布グラフから10代のビン(棒)が除外されています。

度数分布グラフの度数に応じてビン(棒)の色を変える

行12-16では度数分布グラフに表示されるビンの色を度数ごとに変えるようにしています。

# 5) Change histogram colors : change the color of each bar based on its y value

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')

ages = [18, 19, 21, 25, 26, 30, 31, 32, 32, 33, 33, 34, 35, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 37, 38, 39, 40, 41, 42, 43, 45, 51, 52, 55, 57, 59]

bins = [20, 25, 30, 35, 40, 45, 50, 55, 60]

N, bins, patches = plt.hist(ages, bins=bins, edgecolor='black')

fracs = N / N.max()
norm = colors.Normalize(fracs.min(), fracs.max())
for thisfrac, thispatch in zip(fracs, patches):
    color = plt.cm.viridis(norm(thisfrac))
    thispatch.set_facecolor(color)

plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()

図6は実行結果です。度数散布グラフのビン(棒)が異なる色で表示されています。

ここで解説したコードをまとめて掲載

最後にここで解説したすべてのコードをまとめて掲載しましたので参考にしてください。


# Article020_Matplotlib Histograms Part0.py
# %%

import pandas as pd
from matplotlib import pyplot as plt
from matplotlib import colors
from matplotlib.ticker import PercentFormatter

# %%

# 1) Plot a simple histogram

ages = [18, 19, 21, 25, 26, 26, 30, 32, 38, 45, 55]

plt.hist(ages, bins=5)

plt.show()

# %%

# 2) Add options (font.family, style, xlabel, ylabel, title,...)

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')

ages = [18, 19, 21, 25, 26, 26, 30, 32, 38, 45, 55]
plt.hist(ages, bins=5)

plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()
# %%

# 2) Add edgecolor : plt.hist(edgecolor='black')

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')

ages = [18, 19, 21, 25, 26, 26, 30, 32, 38, 45, 55]
plt.hist(ages, bins=5, edgecolor='black')

plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()
# %%

# 3) Add range of bins : plt.hist(..., bins=bins)

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')

ages = [18, 19, 21, 25, 26, 26, 30, 32, 38, 45, 55]
bins = [10, 20, 30, 40, 50, 60]

plt.hist(ages, bins=bins, edgecolor='black')

plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()
# %%

# 4) Exclude bin 10

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')

ages = [18, 19, 21, 25, 26, 26, 30, 32, 38, 45, 55]
bins = [20, 30, 40, 50, 60]
#bins = [10, 20, 30, 40, 50, 60]

plt.hist(ages, bins=bins, edgecolor='black')

plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()
# %%

# 5) Change histogram colors : change the color of each bar based on its y value

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')

ages = [18, 19, 21, 25, 26, 30, 31, 32, 32, 33, 33, 34, 35, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 37, 38, 39, 40, 41, 42, 43, 45, 51, 52, 55, 57, 59]
#ages = [18, 19, 21, 25, 26, 26, 30, 32, 38, 45, 55]
bins = [20, 25, 30, 35, 40, 45, 50, 55, 60]
#bins = [10, 20, 30, 40, 50, 60]

N, bins, patches = plt.hist(ages, bins=bins, edgecolor='black')
fracs = N / N.max()
norm = colors.Normalize(fracs.min(), fracs.max())
for thisfrac, thispatch in zip(fracs, patches):
    color = plt.cm.viridis(norm(thisfrac))
    thispatch.set_facecolor(color)

plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()
# %%

Stack Overflowのアンケート調査データを加工して度数分布グラフを作成する

Stack Overflowのアンケート調査データを加工したCSVファイルを取り込んでシンプルな度数分布グラフを作成する

Data Source: Stack Overflow Annual Developer Survey

# 1) Read a csv file and plot histgram

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(10,5))

csv_file = 'data/csv/article020/Stack_Overflow_Developer_Survey.csv'    # 2019
df = pd.read_csv(csv_file)
ids = df['Responder_id']
ages = df['Age']

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

plt.hist(ages, bins=bins, edgecolor='black')

plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()

度数分布グラフに「log=True」を追加する

# 2) Add a log=True : plt.hist(..., log=True)

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))

df = pd.read_csv(csv_file)
ids = df['Responder_id']
ages = df['Age']

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

plt.hist(ages, bins=bins, edgecolor='black', log=True)

plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()

度数分布グラフに中央値(median)を追加する

# 3) Add age median (color, label, linewidth), legend

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))

df = pd.read_csv(csv_file)
ids = df['Responder_id']
ages = df['Age']

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

plt.hist(ages, bins=bins, edgecolor='black', log=True)

median_age = 29
color = '#fc4f30'

plt.axvline(median_age, color=color, label='Age Median', linewidth=2)

plt.legend()
plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()

ここで解説したコードをまとめて掲載

最後にここで解説したすべてのコードをまとめて掲載しましたので参考にしてください。


# Article020_Matplotlib Histograms Part1.py
# Data Source: https://insights.stackoverflow.com/survey/
# %%

import pandas as pd
from matplotlib import pyplot as plt

# %%

# 1) Read a csv file and plot histgram

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(10,5))

csv_file = 'data/csv/article020/Stack_Overflow_Developer_Survey.csv'    # 2019
df = pd.read_csv(csv_file)
ids = df['Responder_id']
ages = df['Age']

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

plt.hist(ages, bins=bins, edgecolor='black')

plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()
# %%

# 2) Add a log=True : plt.hist(..., log=True)

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))

df = pd.read_csv(csv_file)
ids = df['Responder_id']
ages = df['Age']

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

plt.hist(ages, bins=bins, edgecolor='black', log=True)

plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()
# %%

# 3) Add age median (color, label, linewidth), legend

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))

df = pd.read_csv(csv_file)
ids = df['Responder_id']
ages = df['Age']

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

plt.hist(ages, bins=bins, edgecolor='black', log=True)

median_age = 29
color = '#fc4f30'

plt.axvline(median_age, color=color, label='Age Median', linewidth=2)

plt.legend()
plt.title('Ages of Respondents')
plt.xlabel('Ages')
plt.ylabel('Total Respondents')
plt.tight_layout()

plt.show()
# %%

COVID-19のデータをもとに度数分布グラフを作成して年代別に分析する

COVID-19のCSVファイルを取り込んで加工する

Data Source: 国立社会保障・人口問題研究所（死亡者性・年齢階級構造）

# Article020_Matplotlib Histograms Part2 COVID-19 (1).py
# Data Source: http://www.ipss.go.jp/projects/j/Choju/covid19/index.asp
# %%

import pandas as pd
from matplotlib import pyplot as plt
from matplotlib import colors
from matplotlib.ticker import PercentFormatter

# %%

# 0) Read a csv file and create a new file

csv_file ='data/csv/article020/COVID-19(20210920)by_ages.csv'
raw = pd.read_csv(csv_file)
# age, male, female, total
ages = raw['age'].values
totals = raw['total'].values

columns = ['id', 'age']
data = []
data.append(['0', 0])
df = pd.DataFrame(data, columns=columns)  

i = 0
id = 0

for age in ages:
    cnt = totals[id]
    id += 1
    while cnt > 0:
        cnt -= 1
        i += 1
        df = df.append({'id': i, 'age': age}, ignore_index=True)
 
csv_file ='data/csv/article020/COVID-19_modified(20210920).csv'
df.to_csv(csv_file, index=False)
print('Append COVID-19 Data Completed!')

年代別の死亡者の度数分布グラフを作成する

# 1) Read a csv file and plot histgram : Age Histogram of COVID-19 Deaths in Japan

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')        # 'fivethirtyeight' 'ggplot'
plt.figure(figsize=(8,5))

df = pd.read_csv(csv_file)
ages = df['age']

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

plt.hist(ages, bins=bins, edgecolor='black')

plt.title('Age Histogram of COVID-19 Deaths\n(2021/9/20)')
plt.xlabel('Ages')
plt.ylabel('Total Deaths')
plt.tight_layout()

plt.show()

度数分布グラフに「log=True」を追加する

# 2) Add a log : plt.hist(...,log=True)

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))

ages = df['age']
bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

plt.hist(ages, bins=bins, edgecolor='black', log=True)

plt.title('Age Histogram of COVID-19 Deaths\n(2021/9/20)')
plt.xlabel('Ages')
plt.ylabel('Total Deaths')
plt.tight_layout()

plt.show()

度数分布グラに中央値を追加する

# 3) Add age median (color, label, linewidth), legend

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))

ages = df['age']
bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

plt.hist(ages, bins=bins, edgecolor='black', log=True)

median_age = 75
color = '#fc4f30'

plt.axvline(median_age, color=color, label='Age Median(75)', linewidth=2)

plt.legend()
plt.title('Age Histogram of COVID-19 Deaths\n(2021/9/20)')
plt.xlabel('Ages')
plt.ylabel('Total Deaths')
plt.tight_layout()

plt.show()

度数分布グラフの度数に対応させて棒の色を変える

# 4) Update histogram colors : change the color of each bar based on its y value

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))

ages = df['age']
bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

N, bins, patches = plt.hist(ages, bins=bins, edgecolor='black', log=True)

fracs = N / N.max()
norm = colors.Normalize(fracs.min(), fracs.max())
for thisfrac, thispatch in zip(fracs, patches):
    color = plt.cm.viridis(norm(thisfrac))
    thispatch.set_facecolor(color)

median_age = 75
color = '#fc4f30'

plt.axvline(median_age, color=color, label='Age Median(75)', linewidth=2)

plt.legend()
plt.title('Age Histogram of COVID-19 Deaths\n(2021/9/20)')
plt.xlabel('Ages')
plt.ylabel('Total Deaths')
plt.tight_layout()

plt.show()

ここで解説したコードをまとめて掲載

最後にここで解説したすべてのコードをまとめて掲載しましたので参考にしてください。


# Article020_Matplotlib Histograms Part2 COVID-19 (1).py
# Data Source: http://www.ipss.go.jp/projects/j/Choju/covid19/index.asp
# %%

import pandas as pd
from matplotlib import pyplot as plt
from matplotlib import colors
from matplotlib.ticker import PercentFormatter

# %%

# 0) Read a csv file and create a new file

csv_file ='data/csv/article020/COVID-19(20210920)by_ages.csv'
raw = pd.read_csv(csv_file)
# age, male, female, total
ages = raw['age'].values
totals = raw['total'].values

columns = ['id', 'age']
data = []
data.append(['0', 0])
df = pd.DataFrame(data, columns=columns)  

i = 0
id = 0

for age in ages:
    cnt = totals[id]
    #print('id=',id, 'i=',i, 'age=', age, 'cnt=', cnt)
    id += 1
    while cnt > 0:
        #print('   cnt=', cnt, 'i=',i)
        cnt -= 1
        i += 1
        df = df.append({'id': i, 'age': age}, ignore_index=True)
 
csv_file ='data/csv/article020/COVID-19_modified(20210920).csv'
df.to_csv(csv_file, index=False)
print('Append COVID-19 Data Completed!')

# %%

# 1) Read a csv file and plot histgram : Age Histogram of COVID-19 Deaths in Japan

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')        # 'fivethirtyeight' 'ggplot'
plt.figure(figsize=(8,5))

df = pd.read_csv(csv_file)
ages = df['age']

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

plt.hist(ages, bins=bins, edgecolor='black')
#plt.hist(ages, bins=bins, alpha=0.3, edgecolor='black')

#plt.grid(False)
plt.title('Age Histogram of COVID-19 Deaths\n(2021/9/20)')
plt.xlabel('Ages')
plt.ylabel('Total Deaths')
plt.tight_layout()

plt.show()
# %%

# 2) Add a log : plt.hist(...,log=True)

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))

ages = df['age']
bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

plt.hist(ages, bins=bins, edgecolor='black', log=True)

plt.title('Age Histogram of COVID-19 Deaths\n(2021/9/20)')
plt.xlabel('Ages')
plt.ylabel('Total Deaths')
plt.tight_layout()

plt.show()
# %%

# 3) Add age median (color, label, linewidth), legend

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))

ages = df['age']
bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

plt.hist(ages, bins=bins, edgecolor='black', log=True)

median_age = 75
color = '#fc4f30'

plt.axvline(median_age, color=color, label='Age Median(75)', linewidth=2)

plt.legend()
plt.title('Age Histogram of COVID-19 Deaths\n(2021/9/20)')
plt.xlabel('Ages')
plt.ylabel('Total Deaths')
plt.tight_layout()

plt.show()
# %%


# 4) Update histogram colors : change the color of each bar based on its y value

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))

ages = df['age']
bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

N, bins, patches = plt.hist(ages, bins=bins, edgecolor='black', log=True)

fracs = N / N.max()
norm = colors.Normalize(fracs.min(), fracs.max())
for thisfrac, thispatch in zip(fracs, patches):
    color = plt.cm.viridis(norm(thisfrac))
    thispatch.set_facecolor(color)

median_age = 75
color = '#fc4f30'

plt.axvline(median_age, color=color, label='Age Median(75)', linewidth=2)

plt.legend()
plt.title('Age Histogram of COVID-19 Deaths\n(2021/9/20)')
plt.xlabel('Ages')
plt.ylabel('Total Deaths')
plt.tight_layout()

plt.show()
# %%

COVID-19のデータをもとに度数分布グラフを作成して男女別に分析する

COVID-19のデータを取り込んで加工する

# Article020_Matplotlib Histograms Part2 COVID-19 (2).py
# %%

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

# %%

# 0) Read a csv file and create a new csv file

csv_file ='data/csv/article020/COVID-19(20210920)by_ages.csv'
raw = pd.read_csv(csv_file)
# age, male, female, total
ages = raw['age'].values        # 5,10,20,30,...
males = raw['male'].values      # 0,1,12,44,...
females = raw['female'].values  # 0,0, 2,17,...

columns = ['id', 'age', 'gender']
data = []
data.append(['0', 0, True])
df =pd.DataFrame(data, columns=columns)  

i = 0
id = 0

for age in ages:
    malecnt = males[id]         # 0,1,12,44,...
    femalecnt = females[id]     # 0,0, 2,17,...
    id += 1
    while malecnt > 0:
        #print('   cnt=', cnt, 'i=',i)
        malecnt -= 1
        i += 1
        df = df.append({'id': i, 'age': age, 'gender': True}, ignore_index=True)
    while femalecnt > 0:
        femalecnt -= 1
        i += 1
        df = df.append({'id': i, 'age': age, 'gender': False}, ignore_index=True)        
 
csv_file ='data/csv/article020/COVID-19_modified2(20210920).csv'
df.to_csv(csv_file, index=False)

print('Append COVID-19 Data Completed!')

男女別死亡者の度数分布グラフを作成する

# 1) Read a csv file and plot histgram for male/female

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')    # 'fivethirtyeight' 'ggplot'
plt.figure(figsize=(8,5))

df = pd.read_csv(csv_file)

filter_by = (df['gender'] == True)
dfx = df[filter_by]
male_ages = dfx['age'].values

filter_by = (df['gender'] == False)
dfy = df[filter_by]
female_ages = dfy['age'].values

diff = len(male_ages) 
female_ages2 = np.resize(female_ages, diff)
ages = np.column_stack((male_ages, female_ages2)) # or ages = np.transpose((mages, fages))

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
colors = ['tan', 'red']
labels = ['male','female']

plt.hist(ages, bins, color=colors, label=labels)

plt.legend()
#plt.legend(prop={'size': 10})
plt.title('Age / Gender Histogram of COVID-19 Deaths\n(2021/9/20)')
plt.xlabel('Ages')
plt.ylabel('Total Deaths')
plt.tight_layout()
plt.show()

度数分布グラフに「log=True」を追加する


# 2) Add a log : plt.hist(...,log=True)

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
colors = ['tan', 'red']
labels = ['male','female']

plt.hist(ages, bins, log=True, color=colors, label=labels)

plt.legend()
#plt.legend(prop={'size': 10})
plt.title('Age / Gender Histogram of COVID-19 Deaths\n(2021/9/20)')
plt.xlabel('Ages')
plt.ylabel('Total Deaths')
plt.tight_layout()
plt.show()

plt.show()

度数分布グラフに中央値を追加する

# 3) Add age median (color, label, linewidth), legend

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
colors = ['tan', 'red']
labels = ['male','female']

plt.hist(ages, bins, log=True, color=colors, label=labels)

median_age = 75
color = 'blue'

plt.axvline(median_age, color=color, label='Age Median(75)', linewidth=2)

plt.legend()
plt.title('Age / Gender Histogram of COVID-19 Deaths\n(2021/9/20)')
plt.xlabel('Ages')
plt.ylabel('Total Deaths')
plt.tight_layout()
plt.show()

ここで解説したコードをまとめて掲載

最後にここで解説したすべてのコードをまとめて掲載しましたので参考にしてください。


# Article020_Matplotlib Histograms Part2 COVID-19 (2).py
# %%

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

# %%

# 0) Read a csv file and create a new csv file

csv_file ='data/csv/article020/COVID-19(20210920)by_ages.csv'
raw = pd.read_csv(csv_file)
# age, male, female, total
ages = raw['age'].values        # 5,10,20,30,...
males = raw['male'].values      # 0,1,12,44,...
females = raw['female'].values  # 0,0, 2,17,...

columns = ['id', 'age', 'gender']
data = []
data.append(['0', 0, True])
df =pd.DataFrame(data, columns=columns)  

i = 0
id = 0

for age in ages:
    malecnt = males[id]         # 0,1,12,44,...
    femalecnt = females[id]     # 0,0, 2,17,...
    #print('id=',id, 'i=',i, 'age=', age, 'cnt=', cnt)
    id += 1
    while malecnt > 0:
        #print('   cnt=', cnt, 'i=',i)
        malecnt -= 1
        i += 1
        df = df.append({'id': i, 'age': age, 'gender': True}, ignore_index=True)
    while femalecnt > 0:
        #print('   cnt=', cnt, 'i=',i)
        femalecnt -= 1
        i += 1
        df = df.append({'id': i, 'age': age, 'gender': False}, ignore_index=True)        
 
csv_file ='data/csv/article020/COVID-19_modified2(20210920).csv'
df.to_csv(csv_file, index=False)
#df = pd.read_csv(csv_file, converters={'gender': lambda x: True if x == 'True' else False})

print('Append COVID-19 Data Completed!')


# %%

# 1) Read a csv file and plot histgram for male/female

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')    # 'fivethirtyeight' 'ggplot'
plt.figure(figsize=(8,5))

df = pd.read_csv(csv_file)

filter_by = (df['gender'] == True)
dfx = df[filter_by]
male_ages = dfx['age'].values

filter_by = (df['gender'] == False)
dfy = df[filter_by]
female_ages = dfy['age'].values

diff = len(male_ages) 
female_ages2 = np.resize(female_ages, diff)
ages = np.column_stack((male_ages, female_ages2)) # or ages = np.transpose((mages, fages))

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
colors = ['tan', 'red']
labels = ['male','female']

plt.hist(ages, bins, color=colors, label=labels)

plt.legend()
#plt.legend(prop={'size': 10})
plt.title('Age / Gender Histogram of COVID-19 Deaths\n(2021/9/20)')
plt.xlabel('Ages')
plt.ylabel('Total Deaths')
plt.tight_layout()

plt.show()

# %%

# 2) Add a log : plt.hist(...,log=True)

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
colors = ['tan', 'red']
labels = ['male','female']

plt.hist(ages, bins, log=True, color=colors, label=labels)

plt.legend()
#plt.legend(prop={'size': 10})
plt.title('Age / Gender Histogram of COVID-19 Deaths\n(2021/9/20)')
plt.xlabel('Ages')
plt.ylabel('Total Deaths')
plt.tight_layout()
plt.show()

plt.show()
# %%

# 3) Add age median (color, label, linewidth), legend

plt.rcParams['font.family'] = 'Meiryo'  # Meiryo, Yu Gothic
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
colors = ['tan', 'red']
labels = ['male','female']

plt.hist(ages, bins, log=True, color=colors, label=labels)

median_age = 75
color = 'blue'

plt.axvline(median_age, color=color, label='Age Median(75)', linewidth=2)

plt.legend()
plt.title('Age / Gender Histogram of COVID-19 Deaths\n(2021/9/20)')
plt.xlabel('Ages')
plt.ylabel('Total Deaths')
plt.tight_layout()

plt.show()
# %%

Go Top

Python {Article020}

生データを使用してMatplotlibで度数分布図(histogram)を作成する方法を学ぶには【Pandas+Matplotlib】

まずはMatplotlibのhist()でシンプルな度数分布グラフを作成してみる

プログラムで定義したデータを元に度数分布グラフを作成する

度数分布グラフにオプションを追加する

度数分布グラフに「edgecolor」のオプションを追加して年代別に分ける

度数分布グラフ「bins」を10歳間隔で区切って分かりやすくする

度数分布グラフの「bins」を調整して10歳以下の年代をグラフから除外する

度数分布グラフの度数に応じてビン(棒)の色を変える

ここで解説したコードをまとめて掲載

Stack Overflowのアンケート調査データを加工して度数分布グラフを作成する

Stack Overflowのアンケート調査データを加工したCSVファイルを取り込んでシンプルな度数分布グラフを作成する

度数分布グラフに「log=True」を追加する

度数分布グラフに中央値(median)を追加する

ここで解説したコードをまとめて掲載

COVID-19のデータをもとに度数分布グラフを作成して年代別に分析する

COVID-19のCSVファイルを取り込んで加工する

年代別の死亡者の度数分布グラフを作成する

度数分布グラフに「log=True」を追加する

度数分布グラに中央値を追加する

度数分布グラフの度数に対応させて棒の色を変える

ここで解説したコードをまとめて掲載

COVID-19のデータをもとに度数分布グラフを作成して男女別に分析する

COVID-19のデータを取り込んで加工する

男女別死亡者の度数分布グラフを作成する

度数分布グラフに「log=True」を追加する

度数分布グラフに中央値を追加する

ここで解説したコードをまとめて掲載