決定木のハイパーパラメーター

Contents

0.1 パラメーター max_depth
0.2 random_state
0.3 n_estimators
0.4 max_depth
0.5 random_state
0.6 n_neighbors　k-nn

1 チューニングの自動化
- 1.1 グリッドサーチ

パラメーター max_depth

max_depthは学習時にモデルが学習する木の深さの最大値を表すパラメーターです。
max_depthの値が設定されていない時、木は教師データの分類がほぼ終了するまでデータを分割します。
このため教師データを過剰に信頼し学習した一般性の低いモデルとなってしまいます。
また、値が大きすぎても同じように分類が終了した段階で木の成長は止まるので上記の状態と同じになります。

max_depthを設定し木の高さを制限することを決定木の枝刈りと呼びます。

# モジュールのインポート
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
%matplotlib inline

# データの生成
X, y = make_classification(
    n_samples=1000, n_features=5, n_informative=3, n_redundant=0, random_state=42)
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

# max_depthの値の範囲(1から10)
depth_list = [i for i in range(1, 11)]

# 以下にコードを書いてください
# 正解率を格納するからリストを作成
accuracy = []

# max_depthを変えながらモデルを学習
for max_depth in depth_list:
    model = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
    model.fit(train_X, train_y)
    accuracy.append(model.score(test_X, test_y))

# コードの編集はここまでです。
# グラフのプロット
plt.plot(depth_list, accuracy)
plt.xlabel("max_depth")
plt.ylabel("accuracy")
plt.title("accuracy by changing max_depth")
plt.show()

# モジュールのインポート

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

%matplotlib inline

# データの生成

X, y = make_classification(

n_samples=1000, n_features=5, n_informative=3, n_redundant=0, random_state=42)

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

# max_depthの値の範囲(1から10)

depth_list = [i for i in range(1, 11)]

# 以下にコードを書いてください

# 正解率を格納するからリストを作成

accuracy = []

# max_depthを変えながらモデルを学習

for max_depth in depth_list:

model = DecisionTreeClassifier(max_depth=max_depth, random_state=42)

model.fit(train_X, train_y)

accuracy.append(model.score(test_X, test_y))

# コードの編集はここまでです。

# グラフのプロット

plt.plot(depth_list, accuracy)

plt.xlabel("max_depth")

plt.ylabel("accuracy")

plt.title("accuracy by changing max_depth")

plt.show()

random_state

random_stateは学習結果の保持だけではなく、決定木の学習過程に直接関わるパラメーターです。
決定木の分割は分割を行う時点でよくデータの分類を説明できる要素の値を見つけ、データの分割を行うのですが、そのような値の候補はたくさん存在するため、random_stateによる乱数の生成により、その候補を決めています。

n_estimators

ランダムフォレストの特徴として複数の簡易決定木による多数決で結果が決定されるというものが挙げられますが、その簡易決定木の個数を決めるのがこのn_estimatorsというパラメーターです。

# モジュールのインポート
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
%matplotlib inline

# データの生成
X, y = make_classification(
    n_samples=1000, n_features=4, n_informative=3, n_redundant=0, random_state=42)
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

# n_estimatorsの値の範囲(1から20)
n_estimators_list = [i for i in range(1, 21)]

# 以下にコードを書いてください
# 正解率を格納するからリストを作成
accuracy = []

# n_neighborsを変えながらモデルを学習
for n_estimators in n_estimators_list:
    model = RandomForestClassifier(n_estimators=n_estimators, random_state=42)
    model.fit(train_X, train_y)
    accuracy.append(model.score(test_X, test_y))

# コードの編集はここまでです。
# グラフのプロット
plt.plot(n_estimators_list, accuracy)
plt.title("accuracy by n_estimators increasement")
plt.xlabel("n_estimators")
plt.ylabel("accuracy")
plt.show()

# モジュールのインポート

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

%matplotlib inline

# データの生成

X, y = make_classification(

n_samples=1000, n_features=4, n_informative=3, n_redundant=0, random_state=42)

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

# n_estimatorsの値の範囲(1から20)

n_estimators_list = [i for i in range(1, 21)]

# 以下にコードを書いてください

# 正解率を格納するからリストを作成

accuracy = []

# n_neighborsを変えながらモデルを学習

for n_estimators in n_estimators_list:

model = RandomForestClassifier(n_estimators=n_estimators, random_state=42)

model.fit(train_X, train_y)

accuracy.append(model.score(test_X, test_y))

# コードの編集はここまでです。

# グラフのプロット

plt.plot(n_estimators_list, accuracy)

plt.title("accuracy by n_estimators increasement")

plt.xlabel("n_estimators")

plt.ylabel("accuracy")

plt.show()

max_depth

ランダムフォレストは簡易決定木を複数作るので決定木に関するパラメーターを設定することが可能です。

max_depthは決定木に関するパラメーターですが、ランダムフォレストにおいては通常の決定木より小さな値を入力します。
簡易決定木の分類の多数決というアルゴリズムであるため一つ一つの決定木に対して厳密な分類を行うより着目要素を絞り俯瞰的に分析を行うことで学習の効率の良さと高い精度を保つことができます。

random_state

random_stateはランダムフォレストにおいても重要なパラメーターです。
ランダムフォレストの名前の通り結果の固定のみならず、決定木のデータの分割や用いる要素の決定など多くの場面で乱数が寄与するこの手法ではこのパラメーターによって分析結果が大きく異なります。

# モジュールのインポート
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
%matplotlib inline

# データの生成
X, y = make_classification(
    n_samples=1000, n_features=4, n_informative=3, n_redundant=0, random_state=42)
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

# r_seedsの値の範囲(0から99)
r_seeds = [i for i in range(100)]

# 以下にコードを書いてください
# 正解率を格納するからリストを作成
accuracy = []

# random_stateを変えながらモデルを学習
for seed in r_seeds:
    model = RandomForestClassifier(random_state=seed)
    model.fit(train_X, train_y)
    accuracy.append(model.score(test_X, test_y))

# グラフのプロット
plt.plot(r_seeds, accuracy)
plt.xlabel("seed")
plt.ylabel("accuracy")
plt.title("accuracy by changing seed")
plt.show()

# モジュールのインポート

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

%matplotlib inline

# データの生成

X, y = make_classification(

n_samples=1000, n_features=4, n_informative=3, n_redundant=0, random_state=42)

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

# r_seedsの値の範囲(0から99)

r_seeds = [i for i in range(100)]

# 以下にコードを書いてください

# 正解率を格納するからリストを作成

accuracy = []

# random_stateを変えながらモデルを学習

for seed in r_seeds:

model = RandomForestClassifier(random_state=seed)

model.fit(train_X, train_y)

accuracy.append(model.score(test_X, test_y))

# グラフのプロット

plt.plot(r_seeds, accuracy)

plt.xlabel("seed")

plt.ylabel("accuracy")

plt.title("accuracy by changing seed")

plt.show()

n_neighbors　k-nn

n_neighborsはk-NNのkの値のことです。
つまり、結果予測の際に使う類似データの個数を決めるパラメーターです。

n_neighborsの数が多すぎると類似データとして選ばれるデータの類似度に幅が出るため、分類範囲の狭いカテゴリーがうまく分類されないということが起こります。

# モジュールのインポート
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
%matplotlib inline

# データの生成
X, y = make_classification(
    n_samples=1000, n_features=4, n_informative=3, n_redundant=0, random_state=42)
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

# n_neighborsの値の範囲(1から10)
k_list = [i for i in range(1, 11)]

# 以下にコードを書いてください
# 正解率を格納するからリストを作成
accuracy = []

# n_neighborsを変えながらモデルを学習
for k in k_list:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(train_X, train_y)
    accuracy.append(model.score(test_X, test_y))

# グラフのプロット
plt.plot(k_list, accuracy)
plt.xlabel("n_neighbor")
plt.ylabel("accuracy")
plt.title("accuracy by changing n_neighbor")
plt.show()

# モジュールのインポート

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

%matplotlib inline

# データの生成

X, y = make_classification(

n_samples=1000, n_features=4, n_informative=3, n_redundant=0, random_state=42)

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

# n_neighborsの値の範囲(1から10)

k_list = [i for i in range(1, 11)]

# 以下にコードを書いてください

# 正解率を格納するからリストを作成

accuracy = []

# n_neighborsを変えながらモデルを学習

for k in k_list:

model = KNeighborsClassifier(n_neighbors=k)

model.fit(train_X, train_y)

accuracy.append(model.score(test_X, test_y))

# グラフのプロット

plt.plot(k_list, accuracy)

plt.xlabel("n_neighbor")

plt.ylabel("accuracy")

plt.title("accuracy by changing n_neighbor")

plt.show()

チューニングの自動化

グリッドサーチ

これまで主要な手法の中でよく使われるパラメーターを紹介してきました。
しかしこれら全てのパラメーターを都度変えて結果を確認するのは時間と手間がかかります。

そこで、パラメーターの範囲を指定して一番結果の良かったパラメーターセットを計算機に見つけてもらうという方法を使います。
主な方法は2つ、グリッドサーチとランダムサーチです。

グリッドサーチは調整したいハイパーパラメーターの値の候補を明示的に複数指定し、パラメーターセットを作成し、その時のモデルの評価を繰り返すことでモデルとして最適なパラメーターセットを作成するために用いられる方法です。

値の候補を明示的に指定するためパラメーターの値に文字列や整数、True or Falseといった数学的に連続ではない値をとるパラメーターの探索に向いています。
ただしパラメーターの候補を網羅するようにパラメーターセットが作成されるため多数のパラメーターを同時にチューニングするのには不向きです。

グリッドサーチは値の候補を指定してその上でパラメーターを調整しました。
ランダムサーチはパラメーターが取りうる値の範囲を指定し、確率で決定されたパラメーターセットを用いてモデルの評価を行うことを繰り返すことによって最適なパラメーターセットを探す方法です。
値の範囲の指定はパラメーターの確率関数を指定するというものになります。

パラメーターの確率関数としてscipy.statsモジュールの確率関数がよく用いられます。

コードは以下の通りです。
from scipy import stats
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
train_X, test_X, train_y, test_y = train_test_split(
    data.data, data.target, random_state=42)

# パラメーターの値の候補を設定
param = {
    # 0から100までの一様確率変数(どの数も全て同じ確率で現れる)を定義
    "C": stats.uniform(low=0.0, high=100.0),
    # 乱数で選ぶ必要がないものはリストで指定
    "kernel": ["linear", "poly", "rbf", "sigmoid"],
    "random_state": [42]
}

# 学習器を構築(ここではパラメーターを調整しない)
svm = SVC()

# ランダムサーチ実行
clf = RandomizedSearchCV(svm, param)
clf.fit(train_X, train_y)

# パラメーターサーチ結果の取得
best_param = clf.best_params_

# 比較のため調整なしのsvmに対しても学習させ正解率を比較
svm.fit(train_X, train_y)
print("調整なしsvm:{}\n調整ありsvm:{}\n最適パラメーター:{}".format(
    svm.score(test_X, test_y), clf.score(test_X, test_y), best_param))

グリッドサーチは値の候補を指定してその上でパラメーターを調整しました。

ランダムサーチはパラメーターが取りうる値の範囲を指定し、確率で決定されたパラメーターセットを用いてモデルの評価を行うことを繰り返すことによって最適なパラメーターセットを探す方法です。

値の範囲の指定はパラメーターの確率関数を指定するというものになります。

パラメーターの確率関数としてscipy.statsモジュールの確率関数がよく用いられます。

コードは以下の通りです。

from scipy import stats

from sklearn.model_selection import RandomizedSearchCV

from sklearn.svm import SVC

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

data = load_breast_cancer()

train_X, test_X, train_y, test_y = train_test_split(

data.data, data.target, random_state=42)

# パラメーターの値の候補を設定

param = {

# 0から100までの一様確率変数(どの数も全て同じ確率で現れる)を定義

"C": stats.uniform(low=0.0, high=100.0),

# 乱数で選ぶ必要がないものはリストで指定

"kernel": ["linear", "poly", "rbf", "sigmoid"],

"random_state": [42]

}

# 学習器を構築(ここではパラメーターを調整しない)

svm = SVC()

# ランダムサーチ実行

clf = RandomizedSearchCV(svm, param)

clf.fit(train_X, train_y)

# パラメーターサーチ結果の取得

best_param = clf.best_params_

# 比較のため調整なしのsvmに対しても学習させ正解率を比較

svm.fit(train_X, train_y)

print("調整なしsvm:{}\n調整ありsvm:{}\n最適パラメーター:{}".format(

svm.score(test_X, test_y), clf.score(test_X, test_y), best_param))

次に示す値を用いてグリッドサーチによるパラメーター探索を行ってください。
チューニングを行う手法はSVM、決定木、ランダムフォレストです。
SVMはSVC()を用いて、kernelを”linear”、”rbf”、”poly”、”sigmoid”の中から、Cを0.01,0.1,1.0,10,100の中から選んでパラメータを調整してください。random_stateは固定して良いです。
決定木はmax_depthを1から10の範囲の整数、random_stateを0から100の範囲の整数でパラメータを調整してください。
ランダムフォレストはn_estimatorsを10から100の範囲の整数、max_depthを1から10の範囲の整数、random_stateを0から100の範囲の整数でパラメータを調整してください。
出力は各モデルの名前とその時のtest_X, test_yに対する正解率を
モデル名
正解率
となるようにしてください。

# 必要なモジュールのインポート
import requests
import io
import pandas as pd
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

# 必要データの前処理
vote_data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data"
s = requests.get(vote_data_url).content
vote_data = pd.read_csv(io.StringIO(s.decode('utf-8')), header=None)
vote_data.columns = ['Class Name',
                     'handicapped-infants',
                     'water-project-cost-sharing',
                     'adoption-of-the-budget-resolution',
                     'physician-fee-freeze',
                     'el-salvador-aid',
                     'religious-groups-in-schools',
                     'anti-satellite-test-ban',
                     'aid-to-nicaraguan-contras',
                     'mx-missile',
                     'immigration',
                     'synfuels-corporation-cutback',
                     'education-spending',
                     'superfund-right-to-sue',
                     'crime',
                     'duty-free-exports',
                     'export-administration-act-south-africa']
label_encode = preprocessing.LabelEncoder()
vote_data_encode = vote_data.apply(lambda x: label_encode.fit_transform(x))
X = vote_data_encode.drop('Class Name', axis=1)
Y = vote_data_encode['Class Name']
train_X, test_X, train_y, test_y = train_test_split(X, Y, random_state=50)

# 以下にコードを記述
# for文で処理をさせたいのでモデル名、モデルのオブジェクト、パラメーターリストを全てリストに入れる
models_name = ["SVM", "決定木", "ランダムフォレスト"]
models = [SVC(), DecisionTreeClassifier(), RandomForestClassifier()]
params = [{"C": [0.01, 0.1, 1.0, 10, 100],
           "kernel": ["linear", "rbf", "poly", "sigmoid"],
           "random_state": [42]},
          {"max_depth": [i for i in range(1, 10)],
           "random_state": [i for i in range(100)]},
          {"n_estimators": [i for i in range(10, 20)],
           "max_depth": [i for i in range(1, 10)],
           "random_state": [i for i in range(100)]}]

for name, model, param in zip(models_name, models, params):
    clf = GridSearchCV(model, param)
    clf.fit(train_X, train_y)
    print(name)
    print(clf.score(test_X, test_y))
    print()

# 必要なモジュールのインポート

import requests

import io

import pandas as pd

from sklearn.svm import SVC

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV

from sklearn import preprocessing

from sklearn.model_selection import train_test_split

# 必要データの前処理

vote_data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data"

s = requests.get(vote_data_url).content

vote_data = pd.read_csv(io.StringIO(s.decode('utf-8')), header=None)

vote_data.columns = ['Class Name',

'handicapped-infants',

'water-project-cost-sharing',

'adoption-of-the-budget-resolution',

'physician-fee-freeze',

'el-salvador-aid',

'religious-groups-in-schools',

'anti-satellite-test-ban',

'aid-to-nicaraguan-contras',

'mx-missile',

'immigration',

'synfuels-corporation-cutback',

'education-spending',

'superfund-right-to-sue',

'crime',

'duty-free-exports',

'export-administration-act-south-africa']

label_encode = preprocessing.LabelEncoder()

vote_data_encode = vote_data.apply(lambda x: label_encode.fit_transform(x))

X = vote_data_encode.drop('Class Name', axis=1)

Y = vote_data_encode['Class Name']

train_X, test_X, train_y, test_y = train_test_split(X, Y, random_state=50)

# 以下にコードを記述

# for文で処理をさせたいのでモデル名、モデルのオブジェクト、パラメーターリストを全てリストに入れる

models_name = ["SVM", "決定木", "ランダムフォレスト"]

models = [SVC(), DecisionTreeClassifier(), RandomForestClassifier()]

params = [{"C": [0.01, 0.1, 1.0, 10, 100],

"kernel": ["linear", "rbf", "poly", "sigmoid"],

"random_state": [42]},

{"max_depth": [i for i in range(1, 10)],

"random_state": [i for i in range(100)]},

{"n_estimators": [i for i in range(10, 20)],

"max_depth": [i for i in range(1, 10)],

"random_state": [i for i in range(100)]}]

for name, model, param in zip(models_name, models, params):

clf = GridSearchCV(model, param)

clf.fit(train_X, train_y)

print(name)

print(clf.score(test_X, test_y))

print()

[`yahoo` not found]

パラメーター max_depth

random_state

n_estimators

max_depth

random_state

n_neighbors k-nn

チューニングの自動化

グリッドサーチ

Reader Interactions

コメントを残す コメントをキャンセル

n_neighbors　k-nn

コメントを残すコメントをキャンセル