# Remove warnings
import warnings
warnings.filterwarnings('ignore')
%run ../HTML_Functions.ipynb
Webscrapping is a very important skill to learn if you are interested in natural language processing. While webscrapping text requires a strong understanding of HTML and CSS, webscrapping tables does not involve much understanding of HTML and CSS.
Recently, I webscrapped a crosswalk between FIPS, county, and State--which was super simple. The table looks like this:
The libraries we will use for this is requests, which loads a website into Python, and BeautifulSoup which parses it, enabling us to convert it from HTML into another format such as a Pandas dataframe.
# Load Libraries
## Main Data Wrangling Library
import pandas as pd
## Get webpage
import requests
## Parse webpage
from bs4 import BeautifulSoup
# Get Webpage
url = "https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697"
req = requests.get(url)
# Parse Webpage
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify()[:555])
Now we have to sift through the HTML code to find the table. Sounds hard, but if we go to the webpage, right click on the page and select "inspect" (or hit Shift + Ctrl + I in Google Chrome), the web browser will show us the HTML code. In particular, nearly all web browser will have a selector tool to find the HTML element. Here it is in Google Chrome: