教師有り

パラメーター kernel　非線形SVMのハイパーパラメーター

2017年12月14日 by 河副太智 Leave a Comment

パラメーターkernelは非線形SVMの中でも特に重要なパラメーターであり、受け取ったデータを操作して分類しやすい形にするための関数を定義するパラメーターです。

「linear」、「rbf」、「poly」、「sigmoid」、「precomputed」の5つを値としてとることができます。デフォルトは「rbf」です。
linearは線形SVMであり、LinearSVCとほぼ同じです。特殊な理由がない限りはLinearSVCを使いましょう。
rbf、polyは立体投影のようなものです。rbfは比較的高い正解率のため通常はデフォルトであるrbfを推奨します。
precomputedはデータが前処理によってすでに整形済みの場合に用います。
sigmoidはロジスティック回帰モデルと同じ処理を行います。

LinearSVCとSVC(kernel=”linear”)では特別に定義されているLinearSVCの方が優れています。

パラメーター C　非線形SVMのハイパーパラメーター

2017年12月14日 by 河副太智 Leave a Comment

線形分離可能でないデータを扱う場合SVMのSVCというモジュールを使います。
SVCでもパラメーターCが存在します。

Cのことを正則化係数または罰則係数と呼ぶこともあります。
学習時に分類の誤りをどの程度許容するかという係数であるためそのように呼ばれます。

import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_gaussian_quantiles
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
%matplotlib inline

# データの生成
X, y = make_gaussian_quantiles(n_samples=1250, n_features=2, random_state=42)
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

# Cの値の範囲を設定(今回は1e-5,1e-4,1e-3,0.01,0.1,1,10,100,1000,10000)
C_list = [10 ** i for i in range(-5, 5)]

# グラフ描画用の空リストを用意
train_accuracy = []
test_accuracy = []

# 以下にコードを書いてください。
for C in C_list:
    model = SVC(C=C)
    model.fit(train_X, train_y)

    train_accuracy.append(model.score(train_X, train_y))
    test_accuracy.append(model.score(test_X, test_y))

# コードの編集はここまでです。

# グラフの準備
# semilogx()はxのスケールを10のx乗のスケールに変更する
plt.semilogx(C_list, train_accuracy, label="accuracy of train_data")
plt.semilogx(C_list, test_accuracy, label="accuracy of test_data")
plt.title("accuracy with changing C")
plt.xlabel("C")
plt.ylabel("accuracy")
plt.legend()
plt.show()

import matplotlib.pyplot as plt

from sklearn.svm import SVC

from sklearn.datasets import make_gaussian_quantiles

from sklearn import preprocessing

from sklearn.model_selection import train_test_split

%matplotlib inline

# データの生成

X, y = make_gaussian_quantiles(n_samples=1250, n_features=2, random_state=42)

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

# Cの値の範囲を設定(今回は1e-5,1e-4,1e-3,0.01,0.1,1,10,100,1000,10000)

C_list = [10 ** i for i in range(-5, 5)]

# グラフ描画用の空リストを用意

train_accuracy = []

test_accuracy = []

# 以下にコードを書いてください。

for C in C_list:

model = SVC(C=C)

model.fit(train_X, train_y)

train_accuracy.append(model.score(train_X, train_y))

test_accuracy.append(model.score(test_X, test_y))

# コードの編集はここまでです。

# グラフの準備

# semilogx()はxのスケールを10のx乗のスケールに変更する

plt.semilogx(C_list, train_accuracy, label="accuracy of train_data")

plt.semilogx(C_list, test_accuracy, label="accuracy of test_data")

plt.title("accuracy with changing C")

plt.xlabel("C")

plt.ylabel("accuracy")

plt.legend()

plt.show()

データを分割(教師あり)

2017年12月13日 by 河副太智 Leave a Comment

データセットの全てを使って学習テストをしては意味がない

train_test_split 関数を使ってデータを分割
train_test_split 関数はデータをランダムに、指定割合で分割できる

X_train: トレーニング用の特徴行列（「アルコール度数」「密度」「クエン酸」などのデータ）
X_test: テスト用の特徴行列
y_train: トレーニング用の目的変数（「美味しいワイン」か「そうでもないワインか」）
y_test: テスト用の目的変数
train_test_split には以下のような引数を与える

第一引数: 特徴行列 X
第二引数: 目的変数 y
test_size=: テスト用のデータを何割の大きさにするか
test_size=0.3 で、3割をテスト用のデータとして置いておけます
random_state=: データを分割する際の乱数のシード値
同じ結果が返るように 0 を指定、これは勉強用であり普段は指定しない

from sklearn.model_selection import train_test_split
(X_train, X_test,
 y_train, y_test) = train_test_split(
    X, y, test_size=0.3, random_state=0,#Xとyには既にデータセットが代入されている
)

from sklearn.model_selection import train_test_split

(X_train, X_test,

y_train, y_test) = train_test_split(

X, y, test_size=0.3, random_state=0,#Xとyには既にデータセットが代入されている

)

その他の分割方法

①学習データとターゲットデータがきれいに分割されている場合

from sklearn.model_selection importtrain_test_split
X_train,X_test,y_train,y_test = train_test_split(
    iris_dataset["data"],iris_dataset["target"],random_state=0)

from sklearn.model_selection importtrain_test_split

X_train,X_test,y_train,y_test = train_test_split(

iris_dataset["data"],iris_dataset["target"],random_state=0)

②データフレームに複数のカラムがあり、そのうち一つのカラムをターゲットにする場合

train_X = df.drop('Survived', axis=1)#ターゲット変数以外を学習データとしてtrain_Xへ
train_y = df.Survived #ターゲット変数のカラムのみをtrain_yへ

#更にtrain_X, train_yをtest_X,test_yに7:3で分割する
(train_X, test_X ,train_y, test_y) = train_test_split(train_X, train_y, test_size = 0.3, random_state = 666)

train_X = df.drop('Survived', axis=1)#ターゲット変数以外を学習データとしてtrain_Xへ

train_y = df.Survived #ターゲット変数のカラムのみをtrain_yへ

#更にtrain_X, train_yをtest_X,test_yに7:3で分割する

(train_X, test_X ,train_y, test_y) = train_test_split(train_X, train_y, test_size = 0.3, random_state = 666)