BeautifulSoup

BeautifulSoup makes it easy to quickly scrape content from web pages. Here are two examples.

Electricity prices from Tocom: https://www.tocom.or.jp/market/kobetu/east_base_elec.html

The page has three blocks (Current, Night, and Day sessions). Each block is under a h3, with the first table providing the session name and date, and the second table provides the prices. The first table is a bare table consisting of table rows, while the second table has a thead element.

To parse the entire page, we loop through each h3 and use find_next_sibling to get the two tables.

soup = bs4.BeautifulSoup(requests.get(url).content,
                         features='html.parser')

h3 = soup.find('h3')

while h3 is not None:
    name  = h3.contents[0].strip()

    table0 = h3.find_next_sibling('table')
    table1 = table0.find_next_sibling('table')

    tables[name] = [parse_rows(table0),
                    parse_html_table_with_header(table1)]

    h3 = h3.find_next_sibling('h3')

To parse a table we can just look for any tr elements and then pull out all the td elements. We check if the thing we have supports find_all by calling hasattr. This is a quick and dirty way to skip over textual elements between the table rows.

def parse_rows(x):
    rows = []

    if hasattr(x, 'find_all'):
        for row in x.find_all('tr'):
            cols = row.find_all('td')
            cols = 
            this_row = ]
            if cols:
                rows.append(this_row)
    return rows

To parse a table with a header, we do the usual for the rows and also search for thead and all th elements.

def parse_html_table_with_header(t):
    rows = []

    for bits in t:
        x = parse_rows(bits)
        if x != []: rows += x

    header = [h.text.strip() for h in \
                t.find('thead').find_all('th')]
    return (header, rows)

Once we have all the tables it is straightforward to convert it into a Pandas DataFrame. See the full source for how to do this: https://github.com/carlohamalainen/playground/blob/master/python/beautiful_soup_4/tocom_kobetu_prices.py

Sample output:

$ python tocom_kobetu_prices.py

https://www.tocom.or.jp/market/kobetu/east_base_elec.html

Current Trading (16:30 - 15:15)
Trade Date: Oct 16, 2019
Prices in yen / kWh

       Month Last Settlement Price Open High Low Close Change Volume Settlement
0   Oct 2019                  9.73    -    -   -     -      -      -          -
1   Nov 2019                  9.16    -    -   -     -      -      -          -
2   Dec 2019                 10.15    -    -   -     -      -      -          -
3   Jan 2020                 10.71    -    -   -     -      -      -          -
4   Feb 2020                 10.72    -    -   -     -      -      -          -
5   Mar 2020                  9.28    -    -   -     -      -      -          -
6   Apr 2020                  9.05    -    -   -     -      -      -          -
7   May 2020                  9.02    -    -   -     -      -      -          -
8   Jun 2020                  9.04    -    -   -     -      -      -          -
9   Jul 2020                 10.24    -    -   -     -      -      -          -
10  Aug 2020                 10.07    -    -   -     -      -      -          -
11  Sep 2020                  9.11    -    -   -     -      -      -          -
12  Oct 2020                  8.97    -    -   -     -      -      -          -
13  Nov 2020                  8.81    -    -   -     -      -      -          -
14  Dec 2020                  9.32    -    -   -     -      -      -          -

The next example is scraping stock prices from Yahoo. I used to use Alphavantage for daily closing prices but their free API doesn’t seem to work at the moment. (Their API says that I have been rate limited, but I was only querying it once a day for a handful of equities).

Luckily for us, the historical pages on Yahoo have a json blob in the middle with all the info that we need, so we can avoid parsing HTML tables. We just grab the content after root.App.main and parse as json:

base = 'https://sg.finance.yahoo.com/quote'
ticker = 'BHP.AX'
url = f'{base}/{ticker}/history/'

x = requests.get(url).content
soup = bs4.BeautifulSoup(x, features='html.parser')

# https://stackoverflow.com/questions/39631386/how-to-understand-this-raw-html-of-yahoo-finance-when-retrieving-data-using-pyt
script = soup.find(
           "script",
           text=re.compile("root.App.main")).text

j = json.loads(re.search("root.App.main\s+=\s+(\{.*\})",
                         script).group(1),
               parse_float=lambda x: x)

That’s it! The rest is just manipulating the json to get the fields of interest. Full source: https://github.com/carlohamalainen/playground/blob/master/python/beautiful_soup_4/yahoo_stock_prices.py

Sample run:

$ python yahoo_stock_prices.py


AUD to SGD: ('AUDSGD=X', 'Europe/London', '4:21PM BST', 0.9273)

BHP.AX AUD 2018-10-15 33.90
BHP.AX AUD 2018-10-16 33.66
BHP.AX AUD 2018-10-17 33.20
BHP.AX AUD 2018-10-18 33.10
BHP.AX AUD 2018-10-21 33.16
BHP.AX AUD 2018-10-22 32.79
BHP.AX AUD 2018-10-23 32.07
BHP.AX AUD 2018-10-24 30.80
BHP.AX AUD 2018-10-25 31.20