河副太智

データフレームでスライスが使えない

2018年1月16日 by 河副太智 Leave a Comment

TypeError: unhashable type: ‘slice’

とエラーが出てデータフレームでスライスが使えない場合は以下のように
ilocを使う

train = pd.read_csv('train.csv', header = 0, dtype={'Age': np.float64})
test  = pd.read_csv('test.csv' , header = 0, dtype={'Age': np.float64})
full_data = [train, test]


dataset = pd.DataFrame(np.random.rand(10, 10))#random無くてもいける
y=train.iloc[0::, 1::]
X=train.iloc[0::, 0]

train = pd.read_csv('train.csv', header = 0, dtype={'Age': np.float64})

test = pd.read_csv('test.csv' , header = 0, dtype={'Age': np.float64})

full_data = [train, test]

dataset = pd.DataFrame(np.random.rand(10, 10))#random無くてもいける

y=train.iloc[0::, 1::]

X=train.iloc[0::, 0]

指定の文字を別の文字に置き換える

2018年1月16日 by 河副太智 Leave a Comment

for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

print (train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())

for dataset in full_data:

dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\

'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')

dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')

dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

print (train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())

意味の同じ要素を一つに統合

2018年1月16日 by 河副太智 Leave a Comment

‘Ms’, ‘Miss’の２つは同じ意味なのでこういったものを一つに統合

O’Driscoll, Miss. Bridget

Samaan, Mr. Youssef

Arnold-Franchi, Mrs. Josef (Josefine Franchi)

Panula, Master. Juha Niilo

Nosworthy, Mr. Richard Cater

Harper, Mrs. Henry Sleeper (Myna Haxtun)

Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson)

Ostby, Mr. Engelhart Cornelius

Woolner, Mr. Hugh

def get_title(name):
	title_search = re.search(' ([A-Za-z]+)\.', name)
	# If the title exists, extract and return it.
	if title_search:
		return title_search.group(1)
	return ""

for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)

print(pd.crosstab(train['Title'], train['Sex']))

def get_title(name):

title_search = re.search(' ([A-Za-z]+)\.', name)

# If the title exists, extract and return it.

if title_search:

return title_search.group(1)

return ""

for dataset in full_data:

dataset['Title'] = dataset['Name'].apply(get_title)

print(pd.crosstab(train['Title'], train['Sex']))

上記の結果以下のように名前のタイトルの一覧が出る

Sex       female  male
Title                 
Capt           0     1
Col            0     2
Countess       1     0
Don            0     1
Dr             1     6
Jonkheer       0     1
Lady           1     0
Major          0     2
Master         0    40
Miss         182     0
Mlle           2     0
Mme            1     0
Mr             0   517
Mrs          125     0
Ms             1     0
Rev            0     6
Sir            0     1

Sex female male

Title

Capt 0 1

Col 0 2

Countess 1 0

Don 0 1

Dr 1 6

Jonkheer 0 1

Lady 1 0

Major 0 2

Master 0 40

Miss 182 0

Mlle 2 0

Mme 1 0

Mr 0 517

Mrs 125 0

Ms 1 0

Rev 0 6

Sir 0 1

同じ意味の物を統合する

for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

print (train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())

for dataset in full_data:

dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\

'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')

dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')

dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

print (train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())

結果

   Title  Survived
0  Master  0.575000
1    Miss  0.702703
2      Mr  0.156673
3     Mrs  0.793651
4    Rare  0.347826

Title Survived

0 Master 0.575000

1 Miss 0.702703

2 Mr 0.156673

3 Mrs 0.793651

4 Rare 0.347826

データセットの数、カラム、型の一覧表示

2018年1月16日 by 河副太智 Leave a Comment

train.info()

1	train.info()

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None

決定木

2018年1月16日 by 河副太智 Leave a Comment

決定木

train_X = df.drop('Survived', axis=1)
train_y = df.Survived
(train_X, test_X ,train_y, test_y) = train_test_split(train_X, train_y, test_size = 0.3, random_state = 666)

#決定木
clf = DecisionTreeClassifier(random_state=0)
clf = clf.fit(train_X, train_y)
pred = clf.predict(test_X)

#決定木のモデルスコア
fpr, tpr, thresholds = roc_curve(test_y, pred, pos_label=1)
auc(fpr, tpr)
accuracy_score(pred, test_y)

train_X = df.drop('Survived', axis=1)

train_y = df.Survived

(train_X, test_X ,train_y, test_y) = train_test_split(train_X, train_y, test_size = 0.3, random_state = 666)

#決定木

clf = DecisionTreeClassifier(random_state=0)

clf = clf.fit(train_X, train_y)

pred = clf.predict(test_X)

#決定木のモデルスコア

fpr, tpr, thresholds = roc_curve(test_y, pred, pos_label=1)

auc(fpr, tpr)

accuracy_score(pred, test_y)

ランダムフォレスト

2018年1月16日 by 河副太智 Leave a Comment

ランダムフォレスト

train_X = df.drop('Survived', axis=1) train_y = df.Survived (
train_X, test_X ,train_y, test_y
) = train_test_split(train_X, train_y, test_size = 0.3, random_state = 666)


#ランダムフォレスト
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
clf = clf.fit(train_X, train_y)
pred = clf.predict(test_X)
fpr, tpr, thresholds = roc_curve(test_y, pred, pos_label=1)
auc(fpr, tpr)
accuracy_score(pred, test_y)

train_X = df.drop('Survived', axis=1) train_y = df.Survived (

train_X, test_X ,train_y, test_y

) = train_test_split(train_X, train_y, test_size = 0.3, random_state = 666)

#ランダムフォレスト

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=0)

clf = clf.fit(train_X, train_y)

pred = clf.predict(test_X)

fpr, tpr, thresholds = roc_curve(test_y, pred, pos_label=1)

auc(fpr, tpr)

accuracy_score(pred, test_y)

criterion : 分割基準。gini or entropyを選択。(デフォルトでジニ係数)
max_depth : 木の深さ。木が深くなるほど過学習し易いので、適当なしきい値を設定してあげる。
max_features：最適な分割をする際の特徴量の数
min_samples_split：分岐する際のサンプル数
random_state：ランダムseedの設定。seedを設定しないと、毎回モデル結果が変わるので注意。

公式ドキュメント
sklearn.tree.DecisionTreeClassifier — scikit-learn 0.19.1 documentation

どの要素が強く影響を与えているかを確認

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

features = train_X.columns
importances = clf.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(6,6))
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.show()

import matplotlib.pyplot as plt

import numpy as np

%matplotlib inline

features = train_X.columns

importances = clf.feature_importances_

indices = np.argsort(importances)

plt.figure(figsize=(6,6))

plt.barh(range(len(indices)), importances[indices], color='b', align='center')

plt.yticks(range(len(indices)), features[indices])

plt.show()

河副 太智

河副太智