• Skip to main content
  • Skip to primary sidebar

学習記録

プログラミング

xml rssをスクレイピング

2017年11月30日 by 河副 太智 Leave a Comment

import urllib.request
from bs4 import BeautifulSoup

url = "http://www.customslegaloffice.com/fta/feed/"
xml = urllib.request.urlopen(url)

soup = BeautifulSoup(xml, "lxml")
pubdates = soup.find_all("title")#titleのみを取り出す

for pubdate in pubdates:
print(pubdate.text)#.textは要素の内部のみを取得<title>などは取得しない

Filed Under: スクレイピング

HTMLを読み込む

2017年11月30日 by 河副 太智 Leave a Comment

import urllib.request

url = “http://www.customslegaloffice.com/fta”

 

# urlopenを使う
html = urllib.request.urlopen(url)
#readの引数に文字数を指定.decode(“utf=8#))
print(html.read(500).decode(“utf-8”))

 

 

例1:yahooニュースからスクレイピング

 

import urllib.request
from bs4 import BeautifulSoup

url = "https://news.yahoo.co.jp"
html = urllib.request.urlopen(url)

soup = BeautifulSoup(html, 'lxml')

topics = soup.find_all("p",class_="ttl")

for topic in topics:
print(topic.text)

 

例:2 ビットコイン価格の取得

import urllib.request
from bs4 import BeautifulSoup
import csv
import time
from datetime import datetime

url = "https://coinmarketcap.com/currencies/bitcoin-cash/"

 

f = open("bitcoin.csv", "w")
writer = csv.writer(f, lineterminator="\n")

html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, 'lxml')

for i in range(15):

# リスト形式で格納し、csvに書き込む
price = soup.find_all("span", class_="text-large2", id="quote_price")
value = price[0].text
writer.writerow([datetime.now(),value])

# 240秒処理を休止します
time.sleep(2)

f.close()

 

 

 

 

 

 

 

例:1で取得したyahooのスクレイピングの対象タグ

 

<li>
<div>
<p class=”ttl“><a href=”https://news.yahoo.co.jp/pickup/6263011” onmousedown=”this.href=’https://rdsig.yahoo.co.jp/_ylt=A2Rifg0JaR9aimsAUHAEnf57/RV=2/RE=1512094345/RH=cmRzaWcueWFob28uY28uanA-/RB=tJQRd1AC72Lz8jGDGZOagZ_Lbl0-/RU=aHR0cHM6Ly9uZXdzLnlhaG9vLmNvLmpwL3BpY2t1cC82MjYzMDExAA–/RK=0/RS=8TztB1eAvSrIMPzmJpmHQep2GB8-‘“>海底に人 墜落空自ヘリ乗員か<span class=”icPhoto“>写真</span></a></p>
</div>
</li>
<li>
<div>
<p class=”ttl“><a href=”https://news.yahoo.co.jp/pickup/6263008” onmousedown=”this.href=’https://rdsig.yahoo.co.jp/_ylt=A2Rifg0JaR9aimsAUXAEnf57/RV=2/RE=1512094345/RH=cmRzaWcueWFob28uY28uanA-/RB=tJQRd1AC72Lz8jGDGZOagZ_Lbl0-/RU=aHR0cHM6Ly9uZXdzLnlhaG9vLmNvLmpwL3BpY2t1cC82MjYzMDA4AA–/RK=0/RS=9uov33ITmDdzaj9DqnJQ7RhyFBE-‘“>火星15 米攻撃能力なお疑問<span class=”icPhoto“>写真</span></a></p>
</div>
</li>
<li>
<div>
<p class=”ttl“><a href=”https://news.yahoo.co.jp/pickup/6263017” onmousedown=”this.href=’https://rdsig.yahoo.co.jp/_ylt=A2Rifg0JaR9aimsAUnAEnf57/RV=2/RE=1512094345/RH=cmRzaWcueWFob28uY28uanA-/RB=tJQRd1AC72Lz8jGDGZOagZ_Lbl0-/RU=aHR0cHM6Ly9uZXdzLnlhaG9vLmNvLmpwL3BpY2t1cC82MjYzMDE3AA–/RK=0/RS=hpkwmxO3YnpTdJcl8hgYbThc.vI-‘“>賃金より給食高く 障害者訴え<span class=”icPhoto“>写真</span><span class=”icNew“>new</span></a></p>
</div>
</li>

Filed Under: スクレイピング

文字コード UTF-8,Shift JIS

2017年11月30日 by 河副 太智 Leave a Comment

 

UTF-8で書かれた文字列(ファイル)

1
2
私の名前は山田太郎です。
 

Shift JISでデコード

1
2
3
遘√�蜷榊燕縺ッ螻ア逕ー螟ェ驛弱〒縺吶
 
Big 5-Eは中国語繁体字に使われるエンコード形式

Filed Under: Python 基本

bin ビンニング 一定数値の範囲内にある物を探す

2017年11月29日 by 河副 太智 Leave a Comment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
<span class="c1"># 年齢と性別のデータ</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="p">([[</span><span class="mi">20</span><span class="p">,</span><span class="s2">"F"</span><span class="p">],[</span><span class="mi">22</span><span class="p">,</span><span class="s2">"M"</span><span class="p">],[</span><span class="mi">25</span><span class="p">,</span><span class="s2">"M"</span><span class="p">],[</span><span class="mi">27</span><span class="p">,</span><span class="s2">"M"</span><span class="p">],[</span><span class="mi">21</span><span class="p">,</span><span class="s2">"F"</span><span class="p">],[</span><span class="mi">23</span><span class="p">,</span><span class="s2">"M"</span><span class="p">],[</span><span class="mi">37</span><span class="p">,</span><span class="s2">"F"</span><span class="p">],[</span><span class="mi">31</span><span class="p">,</span><span class="s2">"M"</span><span class="p">],[</span><span class="mi">61</span><span class="p">,</span><span class="s2">"F"</span><span class="p">],[</span><span class="mi">45</span><span class="p">,</span><span class="s2">"M"</span><span class="p">],[</span><span class="mi">41</span><span class="p">,</span><span class="s2">"F"</span><span class="p">],[</span><span class="mi">32</span><span class="p">,</span><span class="s2">"M"</span><span class="p">]],</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s2">"age"</span><span class="p">,</span> <span class="s2">"sex"</span><span class="p">])</span>
<span class="k">print</span> <span class="n">df</span>
<span class="sd">"""</span>
<span class="sd">   age sex</span>
<span class="sd">0   20   F</span>
<span class="sd">1   22   M</span>
<span class="sd">2   25   M</span>
<span class="sd">3   27   M</span>
<span class="sd">4   21   F</span>
<span class="sd">5   23   M</span>
<span class="sd">6   37   F</span>
<span class="sd">7   31   M</span>
<span class="sd">8   61   F</span>
<span class="sd">9   45   M</span>
<span class="sd">10  41   F</span>
<span class="sd">11  32   M</span>
<span class="sd">"""</span>
 
<span class="c1"># ビンに分割するときの値</span>
<span class="n">bins</span> <span class="o">=</span> <span class="p">[</span><span class="mi">18</span><span class="p">,</span> <span class="mi">25</span><span class="p">,</span> <span class="mi">35</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="mi">100</span><span class="p">]</span>
<span class="c1"># ビンの名前</span>
<span class="n">group_names</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"Youth"</span><span class="p">,</span> <span class="s2">"YoungAdult"</span><span class="p">,</span> <span class="s2">"MiddleAged"</span><span class="p">,</span> <span class="s2">"Senior"</span><span class="p">]</span>
<span class="c1"># ビン化</span>
<span class="k">print</span> <span class="n">pd</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">,</span> <span class="n">bins</span><span class="p">,</span> <span class="n">labels</span><span class="o">=</span><span class="n">group_names</span><span class="p">)</span>
<span class="sd">"""</span>
<span class="sd">Categorical: </span>
<span class="sd">[Youth, Youth, Youth, YoungAdult, Youth, Youth, nan, YoungAdult, nan, nan, nan, YoungAdult]</span>
<span class="sd">Levels (4): Index(['Youth', 'YoungAdult', 'MiddleAged', 'Senior'], dtype=object)</span>
<span class="sd">"""</span>
 
<span class="c1"># dfにビンの列を追加</span>
<span class="n">df</span><span class="p">[</span><span class="s2">"bin"</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">,</span> <span class="n">bins</span><span class="p">,</span> <span class="n">labels</span><span class="o">=</span><span class="n">group_names</span><span class="p">)</span>
<span class="k">print</span> <span class="n">df</span>
<span class="sd">"""</span>
<span class="sd">    age sex         bin</span>
<span class="sd">0    20   F       Youth</span>
<span class="sd">1    22   M       Youth</span>
<span class="sd">2    25   M       Youth</span>
<span class="sd">3    27   M  YoungAdult</span>
<span class="sd">4    21   F       Youth</span>
<span class="sd">5    23   M       Youth</span>
<span class="sd">6    37   F  MiddleAged</span>
<span class="sd">7    31   M  YoungAdult</span>
<span class="sd">8    61   F      Senior</span>
<span class="sd">9    45   M  MiddleAged</span>
<span class="sd">10   41   F  MiddleAged</span>
<span class="sd">11   32   M  YoungAdult</span>
<span class="sd">"""</span>

Filed Under: Pandas

reshape

2017年11月29日 by 河副 太智 Leave a Comment

1
2
3
<span role="presentation"><span class="cm-keyword">リストを2次元配列にする
 
import</span> <span class="cm-variable">numpy</span> <span class="cm-keyword">as</span> <span class="cm-variable">np</span></span>

1
<span role="presentation"><span class="cm-variable">a</span> = <span class="cm-variable">np</span>.<span class="cm-property">arange</span>(<span class="cm-number">15</span>)</span>

1
<span role="presentation"><span class="cm-variable">b</span>=<span class="cm-variable">a</span>.<span class="cm-property">reshape</span>(<span class="cm-number">3</span>,<span class="cm-number">5</span>)</span>

1
<span role="presentation">​</span>

1
<span role="presentation"><span class="cm-builtin">print</span><span class=" CodeMirror-matchingbracket">(</span><span class="cm-variable">b</span><span class=" CodeMirror-matchingbracket">)</span></span>

1
<span role="presentation">​</span>
1
2
3
[[ 0  1  2  3  4]
[ 5  6  7  8  9]
[10 11 12 13 14]]

Filed Under: Numpy

arrange アレンジ

2017年11月29日 by 河副 太智 Leave a Comment

 

#アレンジは変数内の数字のカンマなしリストを生成

import numpy as np
a = np.arange(15)


print(a)

1
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]

Filed Under: Numpy

  • « Go to Previous Page
  • Page 1
  • Interim pages omitted …
  • Page 32
  • Page 33
  • Page 34
  • Page 35
  • Page 36
  • Interim pages omitted …
  • Page 55
  • Go to Next Page »

Primary Sidebar

カテゴリー

  • AWS
  • Bootstrap
  • Dash
  • Django
  • flask
  • GIT(sourcetree)
  • Plotly/Dash
  • VPS
  • その他tool
  • ブログ
  • プログラミング
    • Bokeh
    • css
    • HoloViews
    • Jupyter
    • Numpy
    • Pandas
    • PosgreSQL
    • Python 基本
    • python3
      • webアプリ
    • python3解説
    • scikit-learn
    • scipy
    • vps
    • Wordpress
    • グラフ
    • コマンド
    • スクレイピング
    • チートシート
    • データクレンジング
    • ブロックチェーン
    • 作成実績
    • 時系列分析
    • 機械学習
      • 分析手法
      • 教師有り
    • 異常値検知
    • 自然言語処理
  • 一太郎
  • 数学
    • sympy
      • 対数関数(log)
      • 累乗根(n乗根)
    • 暗号学

Copyright © 2025 · Genesis Sample on Genesis Framework · WordPress · Log in