Python Tips
#6 Webscrapping


Michael Siebel

In [1]:
# Remove warnings
import warnings
warnings.filterwarnings('ignore')

%run ../HTML_Functions.ipynb 

Webscrapping is a very important skill to learn if you are interested in natural language processing. While webscrapping text requires a strong understanding of HTML and CSS, webscrapping tables does not involve much understanding of HTML and CSS.

Website

Recently, I webscrapped a crosswalk between FIPS, county, and State--which was super simple. The table looks like this:

pic1.jpg

Libraries

The libraries we will use for this is requests, which loads a website into Python, and BeautifulSoup which parses it, enabling us to convert it from HTML into another format such as a Pandas dataframe.

In [3]:
# Load Libraries
## Main Data Wrangling Library
import pandas as pd
## Get webpage
import requests
## Parse webpage
from bs4 import BeautifulSoup

Get Webpage

In [11]:
# Get Webpage
url = "https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697"
req = requests.get(url)
# Parse Webpage
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify()[:555])
<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <!-- Google Tag Manager -->
  <script>
   (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-MJ48553');
  </script>
  <!-- End Google Tag Manager -->
  <meta content="IE=edge,chrome=1" http-e

Inspect Webpage

Now we have to sift through the HTML code to find the table. Sounds hard, but if we go to the webpage, right click on the page and select "inspect" (or hit Shift + Ctrl + I in Google Chrome), the web browser will show us the HTML code. In particular, nearly all web browser will have a selector tool to find the HTML element. Here it is in Google Chrome:

pic2.jpg

Grab HTML Element

The element we want is a with class='data'. Using BeautifulSoup's find() we can isolate this element and then use Pandas read_html() to convert it to a Pandas dataframe.
In [12]:
# Extract Data
table = soup.find('table', attrs={'class': 'data'})
# Convert from HTML to Pandas Dataframe
df = pd.read_html(str(table))[0]
display(df.head())
FIPS Name State
0 1001 Autauga AL
1 1003 Baldwin AL
2 1005 Barbour AL
3 1007 Bibb AL
4 1009 Blount AL

Export as CSV

The final step is to export it as a CSV file.

In [ ]:
# Export as CSV
df.to_csv("FIPS_to_County.csv")

Conclusion

Government webpages and Wikipedia generally are often easy to scrape as they do not carry lots of ads and JavaScript code. If the element you are scraping is a table, there is generally little parsing required, and therefore could be a useful skill to learn even if you are not interested in learning the in's and out's of web development.

Save Log

In [ ]:
from IPython.display import display, Javascript

display(Javascript(
    "document.body.dispatchEvent("
    "new KeyboardEvent('keydown', {key:'s', keyCode: 83, ctrlKey: true}"
    "))"
))

!jupyter nbconvert --to html_toc "Tip6_Webscrapping.ipynb"  --ExtractOutputPreprocessor.enabled=False --CSSHTMLHeaderPreprocessor.style=stata-dark