BeautifulSoup makes it easy to quickly scrape content from web pages. Here are two examples.
Electricity prices from Tocom: https://www.tocom.or.jp/market/kobetu/east_base_elec.html
The page has three blocks (Current, Night, and Day sessions). Each block is under a
h3, with the first table providing the session name and date, and the second table provides the prices. The first table is a bare table consisting of table rows, while the second table has a
To parse the entire page, we loop through each
h3 and use find_next_sibling to get the two tables.
To parse a table we can just look for any
tr elements and then pull out all the
td elements. We check if the thing we have supports
find_all by calling
hasattr. This is a quick and dirty way to skip over textual elements between the table rows.
To parse a table with a header, we do the usual for the rows and also search for
thead and all
Once we have all the tables it is straightforward to convert it into a Pandas DataFrame. See the full source for how to do this: https://github.com/carlohamalainen/playground/blob/master/python/beautiful_soup_4/tocom_kobetu_prices.py
$ python tocom_kobetu_prices.py https://www.tocom.or.jp/market/kobetu/east_base_elec.html Current Trading (16:30 - 15:15) Trade Date: Oct 16, 2019 Prices in yen / kWh Month Last Settlement Price Open High Low Close Change Volume Settlement 0 Oct 2019 9.73 - - - - - - - 1 Nov 2019 9.16 - - - - - - - 2 Dec 2019 10.15 - - - - - - - 3 Jan 2020 10.71 - - - - - - - 4 Feb 2020 10.72 - - - - - - - 5 Mar 2020 9.28 - - - - - - - 6 Apr 2020 9.05 - - - - - - - 7 May 2020 9.02 - - - - - - - 8 Jun 2020 9.04 - - - - - - - 9 Jul 2020 10.24 - - - - - - - 10 Aug 2020 10.07 - - - - - - - 11 Sep 2020 9.11 - - - - - - - 12 Oct 2020 8.97 - - - - - - - 13 Nov 2020 8.81 - - - - - - - 14 Dec 2020 9.32 - - - - - - -
The next example is scraping stock prices from Yahoo. I used to use Alphavantage for daily closing prices but their free API doesn’t seem to work at the moment. (Their API says that I have been rate limited, but I was only querying it once a day for a handful of equities).
Luckily for us, the historical pages on Yahoo have a json blob in the middle with all the info that we need, so we can avoid parsing HTML tables. We just grab the content after
root.App.main and parse as json:
That’s it! The rest is just manipulating the json to get the fields of interest. Full source: https://github.com/carlohamalainen/playground/blob/master/python/beautiful_soup_4/yahoo_stock_prices.py
$ python yahoo_stock_prices.py AUD to SGD: ('AUDSGD=X', 'Europe/London', '4:21PM BST', 0.9273) BHP.AX AUD 2018-10-15 33.90 BHP.AX AUD 2018-10-16 33.66 BHP.AX AUD 2018-10-17 33.20 BHP.AX AUD 2018-10-18 33.10 BHP.AX AUD 2018-10-21 33.16 BHP.AX AUD 2018-10-22 32.79 BHP.AX AUD 2018-10-23 32.07 BHP.AX AUD 2018-10-24 30.80 BHP.AX AUD 2018-10-25 31.20