Python Tips
#6 Webscrapping

Michael Siebel

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

%run ../HTML_Functions.ipynb

Webscrapping is a very important skill to learn if you are interested in natural language processing. While webscrapping text requires a strong understanding of HTML and CSS, webscrapping tables does not involve much understanding of HTML and CSS.

Website¶

Recently, I webscrapped a crosswalk between FIPS, county, and State--which was super simple. The table looks like this:

Libraries¶

The libraries we will use for this is requests, which loads a website into Python, and BeautifulSoup which parses it, enabling us to convert it from HTML into another format such as a Pandas dataframe.

# Load Libraries
## Main Data Wrangling Library
import pandas as pd
## Get webpage
import requests
## Parse webpage
from bs4 import BeautifulSoup

Get Webpage¶

# Get Webpage
url = "https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697"
req = requests.get(url)
# Parse Webpage
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify()[:555])

<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <!-- Google Tag Manager -->
  <script>
   (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-MJ48553');
  </script>
  <!-- End Google Tag Manager -->
  <meta content="IE=edge,chrome=1" http-e

Inspect Webpage¶

Now we have to sift through the HTML code to find the table. Sounds hard, but if we go to the webpage, right click on the page and select "inspect" (or hit Shift + Ctrl + I in Google Chrome), the web browser will show us the HTML code. In particular, nearly all web browser will have a selector tool to find the HTML element. Here it is in Google Chrome:

Grab HTML Element¶

# Extract Data
table = soup.find('table', attrs={'class': 'data'})
# Convert from HTML to Pandas Dataframe
df = pd.read_html(str(table))[0]
display(df.head())

	FIPS	Name	State
0	1001	Autauga	AL
1	1003	Baldwin	AL
2	1005	Barbour	AL
3	1007	Bibb	AL
4	1009	Blount	AL

Python Tips
#6 Webscrapping

Website¶

Libraries¶

Get Webpage¶

Inspect Webpage¶

Grab HTML Element¶

Export as CSV¶

Conclusion¶

Save Log¶

Python Tips#6 Webscrapping

Website¶

Libraries¶

Get Webpage¶

Inspect Webpage¶

Grab HTML Element¶

Export as CSV¶

Conclusion¶

Save Log¶

Python Tips
#6 Webscrapping