# BeautifulSoup

BeautifulSoup makes it easy to quickly scrape content from web pages. Here are two examples.

Electricity prices from Tocom: https://www.tocom.or.jp/market/kobetu/east_base_elec.html

The page has three blocks (Current, Night, and Day sessions). Each block is under a h3, with the first table providing the session name and date, and the second table provides the prices. The first table is a bare table consisting of table rows, while the second table has a thead element.

To parse the entire page, we loop through each h3 and use find_next_sibling to get the two tables.

soup = bs4.BeautifulSoup(requests.get(url).content,
features='html.parser')

h3 = soup.find('h3')

while h3 is not None:
name  = h3.contents[0].strip()

table0 = h3.find_next_sibling('table')
table1 = table0.find_next_sibling('table')

tables[name] = [parse_rows(table0),

h3 = h3.find_next_sibling('h3')


To parse a table we can just look for any tr elements and then pull out all the td elements. We check if the thing we have supports find_all by calling hasattr. This is a quick and dirty way to skip over textual elements between the table rows.

def parse_rows(x):
rows = []

if hasattr(x, 'find_all'):
for row in x.find_all('tr'):
cols = row.find_all('td')
cols =
this_row = ]
if cols:
rows.append(this_row)
return rows


To parse a table with a header, we do the usual for the rows and also search for thead and all th elements.

def parse_html_table_with_header(t):
rows = []

for bits in t:
x = parse_rows(bits)
if x != []: rows += x

header = [h.text.strip() for h in \


Once we have all the tables it is straightforward to convert it into a Pandas DataFrame. See the full source for how to do this: https://github.com/carlohamalainen/playground/blob/master/python/beautiful_soup_4/tocom_kobetu_prices.py

Sample output:

$python tocom_kobetu_prices.py https://www.tocom.or.jp/market/kobetu/east_base_elec.html Current Trading (16:30 - 15:15) Trade Date: Oct 16, 2019 Prices in yen / kWh Month Last Settlement Price Open High Low Close Change Volume Settlement 0 Oct 2019 9.73 - - - - - - - 1 Nov 2019 9.16 - - - - - - - 2 Dec 2019 10.15 - - - - - - - 3 Jan 2020 10.71 - - - - - - - 4 Feb 2020 10.72 - - - - - - - 5 Mar 2020 9.28 - - - - - - - 6 Apr 2020 9.05 - - - - - - - 7 May 2020 9.02 - - - - - - - 8 Jun 2020 9.04 - - - - - - - 9 Jul 2020 10.24 - - - - - - - 10 Aug 2020 10.07 - - - - - - - 11 Sep 2020 9.11 - - - - - - - 12 Oct 2020 8.97 - - - - - - - 13 Nov 2020 8.81 - - - - - - - 14 Dec 2020 9.32 - - - - - - -  The next example is scraping stock prices from Yahoo. I used to use Alphavantage for daily closing prices but their free API doesn’t seem to work at the moment. (Their API says that I have been rate limited, but I was only querying it once a day for a handful of equities). Luckily for us, the historical pages on Yahoo have a json blob in the middle with all the info that we need, so we can avoid parsing HTML tables. We just grab the content after root.App.main and parse as json: base = 'https://sg.finance.yahoo.com/quote' ticker = 'BHP.AX' url = f'{base}/{ticker}/history/' x = requests.get(url).content soup = bs4.BeautifulSoup(x, features='html.parser') # https://stackoverflow.com/questions/39631386/how-to-understand-this-raw-html-of-yahoo-finance-when-retrieving-data-using-pyt script = soup.find( "script", text=re.compile("root.App.main")).text j = json.loads(re.search("root.App.main\s+=\s+(\{.*\})", script).group(1), parse_float=lambda x: x)  That’s it! The rest is just manipulating the json to get the fields of interest. Full source: https://github.com/carlohamalainen/playground/blob/master/python/beautiful_soup_4/yahoo_stock_prices.py Sample run: $ python yahoo_stock_prices.py

AUD to SGD: ('AUDSGD=X', 'Europe/London', '4:21PM BST', 0.9273)

BHP.AX AUD 2018-10-15 33.90
BHP.AX AUD 2018-10-16 33.66
BHP.AX AUD 2018-10-17 33.20
BHP.AX AUD 2018-10-18 33.10
BHP.AX AUD 2018-10-21 33.16
BHP.AX AUD 2018-10-22 32.79
BHP.AX AUD 2018-10-23 32.07
BHP.AX AUD 2018-10-24 30.80
BHP.AX AUD 2018-10-25 31.20