• Skip to main content
  • Skip to primary sidebar

学習記録

データクレンジング

意味の同じ要素を一つに統合

2018年1月16日 by 河副 太智 Leave a Comment

 

‘Ms’, ‘Miss’の2つは同じ意味なのでこういったものを一つに統合

O’Driscoll, Miss. Bridget
Samaan, Mr. Youssef
Arnold-Franchi, Mrs. Josef (Josefine Franchi)
Panula, Master. Juha Niilo
Nosworthy, Mr. Richard Cater
Harper, Mrs. Henry Sleeper (Myna Haxtun)
Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson)
Ostby, Mr. Engelhart Cornelius
Woolner, Mr. Hugh

1
2
3
4
5
6
7
8
9
10
11
def get_title(name):
title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return ""
 
for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)
 
print(pd.crosstab(train['Title'], train['Sex']))

 

上記の結果以下のように名前のタイトルの一覧が出る

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Sex       female  male
Title                
Capt           0     1
Col            0     2
Countess       1     0
Don            0     1
Dr             1     6
Jonkheer       0     1
Lady           1     0
Major          0     2
Master         0    40
Miss         182     0
Mlle           2     0
Mme            1     0
Mr             0   517
Mrs          125     0
Ms             1     0
Rev            0     6
Sir            0     1

 

同じ意味の物を統合する

1
2
3
4
5
6
7
8
9
for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
 
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
 
print (train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())

 

結果

1
2
3
4
5
6
   Title  Survived
0  Master  0.575000
1    Miss  0.702703
2      Mr  0.156673
3     Mrs  0.793651
4    Rare  0.347826

 

Filed Under: データクレンジング

データクリーニング

2018年1月16日 by 河副 太智 Leave a Comment

データクリーニング、データクレンジング

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Data CleaningPython
 
df.columns = ['a','b','c'] # Renames columns
pd.isnull() # Checks for null Values, Returns Boolean Array
pd.notnull() # Opposite of s.isnull()
df.dropna() # Drops all rows that contain null values
df.dropna(axis=1) # Drops all columns that contain null values
df.dropna(axis=1,thresh=n) # Drops all rows have have less than n non null values
df.fillna(x) # Replaces all null values with x
s.fillna(s.mean()) # Replaces all null values with the mean (mean can be replaced with almost any function from the statistics section)
s.astype(float) # Converts the datatype of the series to float
s.replace(1,'one') # Replaces all values equal to 1 with 'one'
s.replace([1,3],['one','three']) # Replaces all 1 with 'one' and 3 with 'three'
df.rename(columns=lambda x: x + 1) # Mass renaming of columns
df.rename(columns={'old_name': 'new_ name'}) # Selective renaming
df.set_index('column_one') # Changes the index
df.rename(index=lambda x: x + 1) # Mass renaming of index

 

Filed Under: データクレンジング

imputerで欠損値処理

2017年12月20日 by 河副 太智 Leave a Comment

#欠損値処理 (欠損値の指定defaultは’NaN’,mean, median, mode のどれか,行か列かの指定
med_imp = Imputer(missing_values=0, strategy=’median’, axis=0)
med_imp.fit(X.iloc[:, 1:6])
X.iloc[:, 1:6] = med_imp.transform(X.iloc[:, 1:6])

missing_values
これで欠損値であるものを指定。 defaultは’NaN’

strategy
ここで、mean, median, mode のどれかを指定します。

axis
行か列かの指定

verbose
理論値

copy
コピーするか、元のデータ自体に変更を加えるかの指定

Filed Under: データクレンジング

不要なデータを削除する flag,dummy,drop

2017年12月13日 by 河副 太智 Leave a Comment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# url
mush_data_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
s = requests.get(mush_data_url).content
 
# データの形式変換
mush_data = pd.read_csv(io.StringIO(s.decode('utf-8')), header=None)
 
# データに名前をつける(データを扱いやすくするため)
mush_data.columns = ["classes", "cap_shape", "cap_surface", "cap_color", "odor", "bruises",
                     "gill_attachment", "gill_spacing", "gill_size", "gill_color", "stalk_shape",
                     "stalk_root", "stalk_surface_above_ring", "stalk_surface_below_ring",
                     "stalk_color_above_ring", "stalk_color_below_ring", "veil_type", "veil_color",
                     "ring_number", "ring_type", "spore_print_color", "population", "habitat"]
 
# カテゴリー変数(色の種類など数字の大小が決められないもの)をダミー特徴量(yes or no)として変換する
mush_data_dummy = pd.get_dummies(
    mush_data[['gill_color', 'gill_attachment', 'odor', 'cap_color']])
 
print(mush_data)
# 目的変数:flg立てをする
mush_data_dummy["flg"] = mush_data["classes"].map(
    lambda x: 1 if x == 'p' else 0)
 
# 説明変数と目的変数
X = mush_data_dummy.drop("flg", axis=1)
Y = mush_data_dummy['flg']

 

Filed Under: データクレンジング Tagged With: flg, ダミー, ダミー特徴量, ノイズ, ノイズ除去, 不要, 不要データ, 例外, 分別, 削除, 省く, 省略

辞書の文字数カウント collectionsモジュールのdefaultdict(型)

2017年11月28日 by 河副 太智 Leave a Comment

jisho = defaultdict(intかlist等の型)

 

 #これで変数jishoが辞書になる

例

from collections import defaultdict

jisho = defaultdict(int)
lst = [“a”, “b”, “c”, “b”, “a”, “e”]

for key in lst:
jisho[key] += 1

print (jisho)

 例:2
1
<span role="presentation"><span class="cm-keyword">from</span> <span class="cm-variable">collections</span> <span class="cm-keyword">import</span> <span class="cm-variable">defaultdict</span> </span>

1
<span role="presentation"><span class="cm-comment"># 文字列description</span></span>

1
<span role="presentation"><span class="cm-variable">description</span> = <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"AFL intends to import  PVC-insulated  and  XLPE-insulated  bunched copper wire from Honduras.  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"The insulated wire will be used in the U.S. in the production of automotive wire harnesses. "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"Copper strands, classifiable under subheading 7413.00, Harmonized Tariff System of the United States  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"(HTSUS); polyvinyl chloride (PVC) pellets, classifiable under subheading 3904.21, HTSUS; and "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"cross-link polyethylene (XLPE) pellets, classifiable under subheading 3901.30, HTSUS, all of U.S. origin will be exported to Honduras.  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"The copper strands will be annealed in the U.S. and will be wound onto reels,  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"containing between six and 14 untwisted and non-bunched copper strands. In Honduras,  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"the strands are fed into a buncher, which arranges the strands in specific geometric orientations,  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"with specific twist lengths ( lay lengths ) based on the specifications of the end use application "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"( lay plate ). The strands then enter the  compaction die  which sets the diameter and length of  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"the specific products, resulting in the creation of  bunched wire.  By using strands of various "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"sizes and numbers of strands, different sizes of bunched wire are produced. It is "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"stated that the size of the bunched wire determines its electrical current carrying capacity  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"and physical strength. The bunched wire will contain between seven and 104 strands, with American "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"Wire Gauge (AWG) measurements varying between 10 and 22. It is stated that the AWG level identifies the  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"diameter, electrical current capacity and temperature  rating,  or maximum temperature levels within  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"which the wire can function. From the compaction die, the bunched wire enters the  bow , which rotates "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"and twists the bunched wire, based on the needs of the given end use of the wire. "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"Next, the bunched wire is wound onto a reel and prepared for insulation with either PVC or XLPE.  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"The compound is melted and pumped through the extruder line. In the extruder line, a  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"rotating wheel pulls the copper wire, and the compound is extruded over the wire.  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"If PVC is used, the material is cooled following a controlled cooling with water,  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"which solidifies the compound around the bunched wire. It is stated that PVC-insulated  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"wire has a lower temperature rating and is more likely to be used in applications  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"where it is subject to less intense heat. If XLPE is used, a steam vulcanization process  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"is required which causes a chemical reaction in the extruded XLPE, causing  "</span> <span class="cm-operator">+</span> <span class="cm-error">\</span></span>

1
<span role="presentation"><span class="cm-string">"crosslinking  in the underlying polymer chain. It is stated that  crosslinking"</span></span>

1
<span role="presentation">​</span>

1
<span role="presentation"><span class="cm-comment"># defaultdictを定義</span></span>

1
<span role="presentation"><span class="cm-variable">moji</span> = <span class="cm-variable">defaultdict</span>(<span class="cm-builtin">int</span>)<span class="cm-comment">#これが辞書になる</span></span>

1
<span role="presentation">​</span>

1
<span role="presentation"><span class="cm-comment"># 文字の出現回数を記録</span></span>

1
<span role="presentation"><span class="cm-keyword">for</span> <span class="cm-variable">key</span> <span class="cm-keyword">in</span> <span class="cm-variable">description</span>:</span>

1
<span role="presentation">    <span class="cm-variable">moji</span>[<span class="cm-variable">key</span>] += <span class="cm-number">1</span></span>

1
<span role="presentation">​</span>

1
<span role="presentation"><span class="cm-comment"># ソートし、上位10要素を出力して下さい</span></span>

1
<span role="presentation"><span class="cm-builtin">print</span>(<span class="cm-builtin">sorted</span>(<span class="cm-variable">moji</span>.<span class="cm-property">items</span>(), <span class="cm-variable">key</span>=<span class="cm-keyword">lambda</span> <span class="cm-variable">x</span>: <span class="cm-variable">x</span>[<span class="cm-number">1</span>], <span class="cm-variable">reverse</span>=<span class="cm-keyword">True</span>)[:<span class="cm-number">10</span>])</span>

1
<span role="presentation">​</span>
 結果
1
[(' ', 428), ('e', 256), ('t', 170), ('i', 163), ('n', 153), ('s', 128), ('a', 124), ('r', 120), ('d', 100), ('o', 98)]

Filed Under: データクレンジング

list sort sorted 並び変え

2017年11月28日 by 河副 太智 Leave a Comment

listのsort関数は複雑な条件の場合sorted関数のほうがよい

 

sorted(ソートしたい配列, key=キーとなる関数, reverse=True または False)

reverseをTrueにすると降順
Falseなら昇順

2番目の要素を基に昇順に並び替えるのであれば

1
2
3
4
5
6
7
8
9
10
11
 
<span class="cm-variable">list</span> = [
    [<span class="cm-number">0</span>, 2],
    [<span class="cm-number">1</span>, <span class="cm-number">8</span>],
    [<span class="cm-number">2</span>, <span class="cm-number">10</span>],
    [<span class="cm-number">3</span>, <span class="cm-number">6</span>],
    [<span class="cm-number">4</span>, 18]
]
 
<span class="cm-builtin">sorted</span>(<span class="cm-variable">list</span>, <span class="cm-variable">key</span>=<span class="cm-keyword">lambda</span> <span class="cm-variable">x</span>: <span class="cm-variable">x</span>[<span class="cm-number">1</span>])
 

1
2
3
</code><code class="cm-s-ipython language-python">
 
 

 

Filed Under: データクレンジング

  • Page 1
  • Page 2
  • Go to Next Page »

Primary Sidebar

カテゴリー

  • AWS
  • Bootstrap
  • Dash
  • Django
  • flask
  • GIT(sourcetree)
  • Plotly/Dash
  • VPS
  • その他tool
  • ブログ
  • プログラミング
    • Bokeh
    • css
    • HoloViews
    • Jupyter
    • Numpy
    • Pandas
    • PosgreSQL
    • Python 基本
    • python3
      • webアプリ
    • python3解説
    • scikit-learn
    • scipy
    • vps
    • Wordpress
    • グラフ
    • コマンド
    • スクレイピング
    • チートシート
    • データクレンジング
    • ブロックチェーン
    • 作成実績
    • 時系列分析
    • 機械学習
      • 分析手法
      • 教師有り
    • 異常値検知
    • 自然言語処理
  • 一太郎
  • 数学
    • sympy
      • 対数関数(log)
      • 累乗根(n乗根)
    • 暗号学

Copyright © 2025 · Genesis Sample on Genesis Framework · WordPress · Log in