1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
<span class="c1"># 年齢と性別のデータ</span> <span class="n">df</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="p">([[</span><span class="mi">20</span><span class="p">,</span><span class="s2">"F"</span><span class="p">],[</span><span class="mi">22</span><span class="p">,</span><span class="s2">"M"</span><span class="p">],[</span><span class="mi">25</span><span class="p">,</span><span class="s2">"M"</span><span class="p">],[</span><span class="mi">27</span><span class="p">,</span><span class="s2">"M"</span><span class="p">],[</span><span class="mi">21</span><span class="p">,</span><span class="s2">"F"</span><span class="p">],[</span><span class="mi">23</span><span class="p">,</span><span class="s2">"M"</span><span class="p">],[</span><span class="mi">37</span><span class="p">,</span><span class="s2">"F"</span><span class="p">],[</span><span class="mi">31</span><span class="p">,</span><span class="s2">"M"</span><span class="p">],[</span><span class="mi">61</span><span class="p">,</span><span class="s2">"F"</span><span class="p">],[</span><span class="mi">45</span><span class="p">,</span><span class="s2">"M"</span><span class="p">],[</span><span class="mi">41</span><span class="p">,</span><span class="s2">"F"</span><span class="p">],[</span><span class="mi">32</span><span class="p">,</span><span class="s2">"M"</span><span class="p">]],</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s2">"age"</span><span class="p">,</span> <span class="s2">"sex"</span><span class="p">])</span> <span class="k">print</span> <span class="n">df</span> <span class="sd">"""</span> <span class="sd"> age sex</span> <span class="sd">0 20 F</span> <span class="sd">1 22 M</span> <span class="sd">2 25 M</span> <span class="sd">3 27 M</span> <span class="sd">4 21 F</span> <span class="sd">5 23 M</span> <span class="sd">6 37 F</span> <span class="sd">7 31 M</span> <span class="sd">8 61 F</span> <span class="sd">9 45 M</span> <span class="sd">10 41 F</span> <span class="sd">11 32 M</span> <span class="sd">"""</span> <span class="c1"># ビンに分割するときの値</span> <span class="n">bins</span> <span class="o">=</span> <span class="p">[</span><span class="mi">18</span><span class="p">,</span> <span class="mi">25</span><span class="p">,</span> <span class="mi">35</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="mi">100</span><span class="p">]</span> <span class="c1"># ビンの名前</span> <span class="n">group_names</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"Youth"</span><span class="p">,</span> <span class="s2">"YoungAdult"</span><span class="p">,</span> <span class="s2">"MiddleAged"</span><span class="p">,</span> <span class="s2">"Senior"</span><span class="p">]</span> <span class="c1"># ビン化</span> <span class="k">print</span> <span class="n">pd</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">,</span> <span class="n">bins</span><span class="p">,</span> <span class="n">labels</span><span class="o">=</span><span class="n">group_names</span><span class="p">)</span> <span class="sd">"""</span> <span class="sd">Categorical: </span> <span class="sd">[Youth, Youth, Youth, YoungAdult, Youth, Youth, nan, YoungAdult, nan, nan, nan, YoungAdult]</span> <span class="sd">Levels (4): Index(['Youth', 'YoungAdult', 'MiddleAged', 'Senior'], dtype=object)</span> <span class="sd">"""</span> <span class="c1"># dfにビンの列を追加</span> <span class="n">df</span><span class="p">[</span><span class="s2">"bin"</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">,</span> <span class="n">bins</span><span class="p">,</span> <span class="n">labels</span><span class="o">=</span><span class="n">group_names</span><span class="p">)</span> <span class="k">print</span> <span class="n">df</span> <span class="sd">"""</span> <span class="sd"> age sex bin</span> <span class="sd">0 20 F Youth</span> <span class="sd">1 22 M Youth</span> <span class="sd">2 25 M Youth</span> <span class="sd">3 27 M YoungAdult</span> <span class="sd">4 21 F Youth</span> <span class="sd">5 23 M Youth</span> <span class="sd">6 37 F MiddleAged</span> <span class="sd">7 31 M YoungAdult</span> <span class="sd">8 61 F Senior</span> <span class="sd">9 45 M MiddleAged</span> <span class="sd">10 41 F MiddleAged</span> <span class="sd">11 32 M YoungAdult</span> <span class="sd">"""</span> |
reshape
1 2 3 |
<span role="presentation"><span class="cm-keyword">リストを2次元配列にする import</span> <span class="cm-variable">numpy</span> <span class="cm-keyword">as</span> <span class="cm-variable">np</span></span> |
1 |
<span role="presentation"><span class="cm-variable">a</span> = <span class="cm-variable">np</span>.<span class="cm-property">arange</span>(<span class="cm-number">15</span>)</span> |
1 |
<span role="presentation"><span class="cm-variable">b</span>=<span class="cm-variable">a</span>.<span class="cm-property">reshape</span>(<span class="cm-number">3</span>,<span class="cm-number">5</span>)</span> |
1 |
<span role="presentation"></span> |
1 |
<span role="presentation"><span class="cm-builtin">print</span><span class=" CodeMirror-matchingbracket">(</span><span class="cm-variable">b</span><span class=" CodeMirror-matchingbracket">)</span></span> |
1 |
<span role="presentation"></span> |
arrange アレンジ
#アレンジは変数内の数字のカンマなしリストを生成
import numpy as np
a = np.arange(15)
print(a)
1 |
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14] |
マッピング vlookupに似たもの
1 2 3 4 5 6 7 8 9 10 11 |
<span class="cm-keyword">import</span> <span class="cm-variable">pandas</span> <span class="cm-keyword">as</span> <span class="cm-variable">pd</span> <span class="cm-keyword">from</span> <span class="cm-variable">pandas</span> <span class="cm-keyword">import</span> <span class="cm-variable">DataFrame</span> <span class="cm-variable">attri_data1</span> = {<span class="cm-string">'ID'</span>: [<span class="cm-string">'100'</span>, <span class="cm-string">'101'</span>, <span class="cm-string">'102'</span>, <span class="cm-string">'103'</span>, <span class="cm-string">'104'</span>, <span class="cm-string">'106'</span>, <span class="cm-string">'108'</span>, <span class="cm-string">'110'</span>, <span class="cm-string">'111'</span>, <span class="cm-string">'113'</span>] ,<span class="cm-string">'city'</span>: [<span class="cm-string">'Tokyo'</span>, <span class="cm-string">'Osaka'</span>, <span class="cm-string">'Kyoto'</span>, <span class="cm-string">'Hokkaido'</span>, <span class="cm-string">'Tokyo'</span>, <span class="cm-string">'Tokyo'</span>, <span class="cm-string">'Osaka'</span>, <span class="cm-string">'Kyoto'</span>, <span class="cm-string">'Hokkaido'</span>, <span class="cm-string">'Tokyo'</span>] ,<span class="cm-string">'birth_year'</span> :[<span class="cm-number">1990</span>, <span class="cm-number">1989</span>, <span class="cm-number">1992</span>, <span class="cm-number">1997</span>, <span class="cm-number">1982</span>, <span class="cm-number">1991</span>, <span class="cm-number">1988</span>, <span class="cm-number">1990</span>, <span class="cm-number">1995</span>, <span class="cm-number">1981</span>] ,<span class="cm-string">'name'</span> :[<span class="cm-string">'Hiroshi'</span>, <span class="cm-string">'Akiko'</span>, <span class="cm-string">'Yuki'</span>, <span class="cm-string">'Satoru'</span>, <span class="cm-string">'Steeve'</span>, <span class="cm-string">'Mituru'</span>, <span class="cm-string">'Aoi'</span>, <span class="cm-string">'Tarou'</span>, <span class="cm-string">'Suguru'</span>, <span class="cm-string">'Mitsuo'</span>]} <span class="cm-variable">attri_data_frame1</span> = <span class="cm-variable">DataFrame</span>(<span class="cm-variable">attri_data1</span>) <span class="cm-variable">attri_data_frame1</span> |
もう一つの辞書を追加
1 2 3 4 5 6 7 8 |
<span class="cm-variable">city_map</span> ={<span class="cm-string">'Tokyo'</span>:<span class="cm-string">'Kanto'</span> ,<span class="cm-string">'Hokkaido'</span>:<span class="cm-string">'Hokkaido'</span> ,<span class="cm-string">'Osaka'</span>:<span class="cm-string">'Kansai'</span> ,<span class="cm-string">'Kyoto'</span>:<span class="cm-string">'Kansai'</span>} </code><code class="cm-s-ipython language-python"><span class="cm-variable">city_map</span></code><code class="cm-s-ipython language-python"><span class="cm-variable"> 最初のattri_data_frame1のcityカラムを対象とし、</span> |
1 2 |
<span class="cm-variable">そのカラムの文字列、数値に一致するカラムを追加していく </span> |
1 2 3 4 5 6 |
<span class="cm-comment">対応するデータがない場合はNaNになる </span> </code><code class="cm-s-ipython language-python"><span class="cm-variable">attri_data_frame1</span></code><code class="cm-s-ipython language-python">[<span class="cm-string">'region'</span>] = <span class="cm-variable">attri_data_frame1</span>[<span class="cm-string">'city'</span>].<span class="cm-property">map</span>(<span class="cm-variable">city_map</span>) <span class="cm-variable">attri_data_frame1</span> |
出力結果
cityに合わせてregionが追加されている
辞書{[…]}
1 2 3 4 5 |
a = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"], "year": [2001, 2002, 2001, 2008, 2006], "amount": [1, 4, 5, 6, 3]} print(a) |
{‘fruits’: [‘apple’, ‘orange’, ‘banana’, ‘strawberry’, ‘kiwifruit’], ‘year’: [2001, 2002, 2001, 2008, 2006], ‘amount’: [1, 4, 5, 6, 3]}
重複データの削除
重複するデータを削除
1 2 3 4 5 6 7 |
<span class="cm-keyword">import</span> <span class="cm-variable">pandas</span> <span class="cm-keyword">as</span> <span class="cm-variable">pd</span> <span class="cm-keyword">from</span> <span class="cm-variable">pandas</span> <span class="cm-keyword">import</span> <span class="cm-variable">DataFrame</span> <span class="cm-variable">dupli_data</span> = <span class="cm-variable">DataFrame</span>({<span class="cm-string">'col1'</span>:[<span class="cm-number">1</span>, <span class="cm-number">1</span>, <span class="cm-number">2</span>, <span class="cm-number">3</span>, <span class="cm-number">4</span>, <span class="cm-number">4</span>, <span class="cm-number">6</span>, <span class="cm-number">6</span>] ,<span class="cm-string">'col2'</span>:[<span class="cm-string">'a'</span>, <span class="cm-string">'b'</span>, <span class="cm-string">'b'</span>, <span class="cm-string">'b'</span>, <span class="cm-string">'c'</span>, <span class="cm-string">'c'</span>, <span class="cm-string">'b'</span>, <span class="cm-string">'b'</span>]}) <span class="cm-variable">dupli_data</span> |
duplicatedで縦の列同士、重複のある行にTrueかFalseがでる
1 2 |
<span class="cm-variable">dupli_data</span>.<span class="cm-property">duplicated</span>() |
1 2 3 4 5 6 7 8 9 10 |
0 False 1 False 2 False 3 False 4 False 5 True 6 False 7 True dtype: bool |
drop_duplicatesで重複したデータ削除
1 2 |
<span class="cm-variable">dupli_data</span>.<span class="cm-property">drop_duplicates</span>() |