(Pandas)pivot_tableでカテゴリ変数が欠損する対策を探る

	area	age	positive
0	京都	17	False
1	大阪	15	True
2	京都	4	False
3	大阪	9	True
4	大阪	10	True
...	...	...	...
45	京都	16	True
46	大阪	3	False
47	大阪	2	False
48	奈良	10	True
49	京都	13	False

これは下記のプログラムの中で生成される架空のデータだが、このようなデータの地域別、年齢層（5歳未満、5歳以上、10歳以上、15歳以上）毎の人数の分布を、陽性と陰性とに分けて比較できるように、ヒートマップを並べようとして、次のようなプログラムを書いた。

In [1]:

import numpy as np
import pandas as pd

# サンプルデータ作成
np.random.seed(0)
df = pd.DataFrame({'area': [['京都', '大阪', '奈良'][i] for i in np.random.randint(3, size=50)],
                   'age': np.random.randint(20, size=50),
                   'positive': np.random.randint(2, size=50).astype(bool)})
df.loc[df['age'] < 5, 'positive'] = False  # 5歳未満の陽性は無しとする

# 年齢のビニング（区分け）
df['range'] = pd.cut(df['age'], bins=[0, 5, 10, 15, 20], labels=['5歳未満', '5歳以上', '10歳以上', '15歳以上'], right=False)

# 陽性、陰性の地域－年齢区分の分布を比較
df_positive = df[df.positive == True].pivot_table(index='range', columns='area', aggfunc='size', fill_value=0)
df_negative = df[df.positive == False].pivot_table(index='range', columns='area', aggfunc='size', fill_value=0)

# ヒートマップを並べて描画
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(8, 2))
plt.subplot(121)
sns.heatmap(df_positive, cmap='Reds', cbar=False, annot=True)
plt.title("陽性")
plt.xlabel("")
plt.ylabel("")
plt.subplot(122)
sns.heatmap(df_negative, cmap='Blues', cbar=False, annot=True)
plt.title("陰性")
plt.xlabel("")
plt.ylabel("")
plt.subplots_adjust(wspace=0.4)
plt.show()

コード内のコメントに書いているように、5歳未満で陽性の人がいないものとする。
すると、次のように、陽性の5歳未満の行が抜けて、行の数が合わなくなってしまった。

Out [1]:

求める結果は、次のものである。

これが出るようにコードを修正しようとしたが、スマートな方法がわからなかった。

陽性のヒートマップの元になっているDataFrameを見ると、5歳未満の行が無い。

In [2]:

df_positive

Out [2]:

area	京都	大阪	奈良
range
5歳以上	2	2	2
10歳以上	0	4	3
15歳以上	2	2	1

しかし、このDataFrameのインデックスを見ると、データ型がカテゴリ変数で、そのカテゴリーには'5歳未満'が存在する。

In [3]:

df_positive.index

Out [3]:

CategoricalIndex(['5歳以上', '10歳以上', '15歳以上'], categories=['5歳未満', '5歳以上', '10歳以上', '15歳以上'], ordered=True, name='range', dtype='category')

であればこのインデックスを期待するものに書き換えれば良いのではないかと思って、次のようにインデックスのデータをカテゴリ変数のカテゴリーと同じになるように reindexしてみると、確かに望み通りの結果になった。

In [4]:

df_positive = df[df.positive == True].pivot_table(index='range', columns='area', aggfunc='size', fill_value=0)
df_positive = df_positive.reindex(index=pd.CategoricalIndex(data=df_positive.index.categories, categories=df_positive.index.categories, ordered=True), fill_value=0)
df_positive

Out [4]:

area	京都	大阪	奈良
range
5歳未満	0	0	0
5歳以上	2	2	2
10歳以上	0	4	3
15歳以上	2	2	1

しかし、何かやりたいことに対してコードが冗長だし、ordered=Trueまで面倒を見ないといけないなど、手間がかかり過ぎている感じがする。
次のようにインデックスに欠損が無いDataFrameを別に作ってインデックスをコピーする方が、コードがシンプルだし、考慮すべきことが少なくて楽だが、これだとプログラムの動作が冗長である。

In [5]:

df_all = df.pivot_table(index='range', columns='area', aggfunc='size', fill_value=0)
df_positive = df[df.positive == True].pivot_table(index='range', columns='area', aggfunc='size', fill_value=0)
df_positive = df_positive.reindex(index=df_all.index, fill_value=0)
df_positive

Out [5]:

area	京都	大阪	奈良
range
5歳未満	0	0	0
5歳以上	2	2	2
10歳以上	0	4	3
15歳以上	2	2	1

pivot_tableしてからreindexするのではなく、pivot_tableでインデックスに欠損を生じないようにできないかと思って調べると、dropnaという引数があり、次のように aggfunc='count', dropna=False とすればできることがわかった。

In [6]:

df_positive = df[df.positive == True].pivot_table(index='range', columns='area', values='positive', aggfunc='count', fill_value=0, dropna=False)
df_positive

Out [6]:

area	京都	大阪	奈良
range
5歳未満	0	0	0
5歳以上	2	2	2
10歳以上	0	4	3
15歳以上	2	2	1

しかし、 aggfunc='size' とすると、 dropna=False が効かなかった。

In [7]:

df_positive = df[df.positive == True].pivot_table(index='range', columns='area', aggfunc='size', fill_value=0, dropna=False)
df_positive

Out [7]:

area	京都	大阪	奈良
range
5歳以上	2	2	2
10歳以上	0	4	3
15歳以上	2	2	1

aggfunc='count' にするにはvaluesで無駄に1列指定しないといけないし、indexとcolumnにする2列しか無くて他に列が残ってない時はやりようが無いので、諦めていつも aggfunc='size' とするようにしているのだが、 aggfunc='size' と dropna=False が両立しないというのは面倒である。
しかも、aggfunc='size' と dropna=False が両立しないのは理屈がわからないので困る。筆者にはバグにしか思えない。

なお、この記事で使用したサンプルデータは、今話題のウィルスとは一切無関係である。

【2020/5/4 追記】
以上はPandasのバージョン0.25.3で起こっていたことだが、1.0.3でやってみるとpivot_table(aggfunc='size')の動作が変わっており、最初のコードで陽性の5歳未満の行が抜けなくなった。今度はaggfunc='size' と dropna=True が両立しなくなったようだ。
次のようにfill_value=0を外しても、0.25.3及びaggfunc='count'ならNaNになる所が0になるので、aggfunc='size'だとNaNが生じない（従ってdropもされない）ように変わったようだ。

In [8]:

df_positive = df[df.positive == True].pivot_table(index='range', columns='area', aggfunc='size')
df_positive

Out [8]:

area	京都	大阪	奈良
range
5歳未満	0	0	0
5歳以上	2	2	2
10歳以上	0	4	3
15歳以上	2	2	1

PandasのRelease NotesやGitHubのIssuesを探しても、なぜこのように動作が変わったのかわからなかったし、pandas 1.0.3 documentationのpandas.pivot_tableの所を読んでも、それが仕様なのかどうかわからなかった。