Geopandas and Matplotlib to automate data processing and mapping

From CUOSGwiki
Revision as of 17:40, 1 December 2020 by SamirS (talk | contribs)
Jump to navigationJump to search

Road Construction Visualizer

This tutorial will be an introduction to using Geopandas and Matplotlib to automate data download, data cleaning, basic analysis and map making. A basic understanding of Python, Python interpreters and Python module download will be assumed in this tutorial.

The data for this tutorial is hosted on Open Ottawa and can be found here. It has an application programming interface (API) which will also us to make requests to download data. Ensure to view the data tab on the City of Ottawa website. Explore a few pages and get familiar with the data. Pay special attention to the TARGETED_START date as this is the row we will be primarily dividing our validated data by. Additionally, take a look at the STATUS column and see if you can find a row that contains a NOTAVAIL value. When working with data, it is always important to become familiar with the data. Keep an eye out for any data that has missing values. </python>

Additionally, we will be using this data as a reference layer for our maps. It is the boundaries of the different regions within Ottawa. ________________________________________________________________________________________________________________________________________________________________________________________

Setting up Your Environment

The first step of this tutorial is going to be how to set up your Python environment in order to complete this tutorial.

- You will need to download Anaconda: https://docs.anaconda.com/anaconda/install/windows/

- Search for and open the Anaconda Prompt

- Create your environment and when prompted, type y to accept:

        $ conda create --name geo_env

- Activate your Anaconda virtual environment by typing:

        $ conda activate geo_env

- Install the first required packaged called geopandas:

        $ conda install geopandas

- Install the second package called matplotlib:

        $ conda install matplotlib

- Install the third package called contextily:

        $ conda install contextily

- Install the last and final package from Anaconda which allows you to map polygons using Geopandas:

        $ conda install -c conda-forge descartes

- Next you will need an integrated development environment (IDE). This tutorial used Visual Studio Code (VS Code) as it is free and accessible. However, other IDEs such as Pycharm can be used. The link to install Visual Studio Code can be found here.

- You will now need to open VS Code and set your interpreter to the virtual geo_env environment you created. You can follow this tutorial.


We finally have our entire Python environment set up!

________________________________________________________________________________________________________________________________________________________________________________________

Beginning to Code

The first step to begin coding is to import all of our modules:

import geopandas # For automation and data cleaning of our geojson files
import os # Allow us to manipulate where we save our files and move around our folders
import matplotlib.pyplot as plt # Allow us to create maps
import requests # Allow us to download our data from the City of Ottawa using their API
from datetime import date # Allow us to generate current dates 
import contextily as ctx # Allow us to add base maps

The next step is to create our main function, call it and then set up our file structure:

def main():

if __name__ == "__main__":
    main()


In our main function we want to use the datetime module to generate a date object:

date_today = str(date.today())

Next we want to use the OS module to create our file structure and point towards our reference data. All the blow code will go in our main funcion unless otherwise specified.

working_directory = os.getcwd() # Find our current working directory in order to build other directories off of this
reference_file = os.path.join(working_directory, "ottawa_boundaries", "ottawa_boundaries.geojson") # Use OS path.join function to point to our reference file
reference_folder = os.path.join(working_directory, "ottawa_boundaries" ) # Create a path for our reference folder 
maps_folder = os.path.join(working_directory, "Maps") # Create a maps folder path
maps_day_folder = os.path.join(maps_folder, date_today) # Create a specific day path in our general maps path

We will now use the paths we made and test if they exist within where we are running our program. If they are not, we will create them. We test if the directory already exists in order to prevent us from duplicating folders or from creating complications in our script.

# Check if the overarching maps folder exists and if not, create it
if not os.path.isdir(maps_folder):
    os.mkdir(maps_folder)

# Check if the specific day directory exists and if not, create it
if not os.path.isdir(maps_day_folder):
    os.mkdir(maps_day_folder)

# Create GeoJSON file and add it to repository
# Store our files in a geojson directory
if not os.path.isdir("./geojson"):
    os.mkdir("./geojson")

We will now check to see if our reference layer folder exists, if it does not (ie the first time we run this), we will create it and download the layer file from the City of Ottawa. You will notice I use the word dataframe in the comments below. A dataframe is the primary type of data structure used to store information in GeoPandas.

# If the reference basemap does not exist, create it, download it and write it into a dataframe
if not os.path.isfile(reference_file):
    os.mkdir(reference_folder)
    geojson_call = requests.get('https://opendata.arcgis.com/datasets/845bbfdb73944694b3b81c5636be46b5_0.geojson') # Send the get request and assign it to a variable
    geojson_file = open(reference_file, "w") # Open a new file based on a previous path we have created
    geojson_file.write(geojson_call.text) # Write the text from the geojson to our newly created geojson_file variable.
    geojson_file.close() # Close the file
# Incase we have run our script from this directory before, we create an option to skip this step
else:
    pass

reference_layer_read = open(reference_file) # We now read in our reference file
reference_layer_df = geopandas.read_file(reference_layer_read) # We then create our geopandas dataframe by reading in our previously read in reference file

Voila! We now have our file structure created and our reference file stored in a geopandas dataframe! The next step will be to create another get call to download the newest road construction data from the City of Ottawa. After that, we will also write this geojson to a GeoPandas dataframe.

# Perform a GET call to pull the GeoJSON construction data from the City of Ottawa's webpage
# Write our geojson get call to a local geojson file with todays date within the geojson directory
print("Downloading road construction data....") # Create an update to inform the user what is happening
geojson_call = requests.get('https://opendata.arcgis.com/datasets/d2fe8f7e3cf24615b62dfc954b5c26b9_0.geojson') # Send the get request
geojson_file = open("./geojson/" + "{date}_rd_construction.geojson".format(date=date_today), "w") # Open a new geojson file with the download date of the geojson
geojson_file.write(geojson_call.text) # Write to our new file
geojson_file.close() # Close the file

# Load the GeoJSON into a Geopandas dataframe
working_file = os.path.join(working_directory , "geojson" , "{date}_rd_construction.geojson".format(date=date_today)) # Create a working file variable path with the current date
gp_read = open(working_file) # Open the current geojson road contruction file (The working file)
gp_df = geopandas.read_file(gp_read) # Write the opened file to a Geopandas dataframe.

Data Cleaning

It is important to be able to automate the processing and cleaning of data. Especially when you receive large amounts of data on a regular basis. In the following steps, we will learn how to extract only the desired data from this relatively large road construction dataset. We have our road construction dataset as a geodataframe which will allow us full access to all of the useful functions and methods within geopandas.

The first functions we wil use is the .drop method that can be called on a geodataframe (gdf). It takes a parameter of a list of labels where we can specify which columns of our gdf we want dropped. In this case, we are removing the French columns and some other columns that are not required in our analysis. The axis parameter tell geopandas which row we want to search for the labels in. We entered 1, as these are our column headings. Lastly, we used the method .dropna which removed all rows where there is missing data (N/A or NaN).

# Remove uneeded columns and drop rows with no values
print("Cleaning and processing data....") # Provide the user with an update
clean_df = gp_df.drop(labels=[
	'FEATURE_TYPE_FR', 'STATUS_FR', 'TARGETED_START_FR', 'PROJECT_MANAGER', 'PROJECTWEBPAGE', 'PROJECTWEBPAGE_FR'
        ], axis=1).dropna()

We are basing our series of maps on the "STATUS" column of the data. From look at the data earlier, you may have noticed some of the values were NOTAVAIL which is not good for our analysis. Therefore, we will remove these rows from our data. We use a geopandas filter again to pull only the STATUS column from our geodataframe. We then cast it into a set in order to get rid of duplicate values. We then loop through the set to check if there are "NOTAVAIL" values in our data. If there is, we perform another filter that only selects for data where the STATUS column value is NOT "NOTAVAIL". This new filter then becomes a new geodataframe called status_removed.

# Check for NOTAVAIL and if these rows exist, then remove them
not_avail_check = set(clean_df['STATUS']) # Create filter to select all values in the STATUS column
for value in not_avail_check: # Loop through the set to check for NOTAVAIL values
    if value == 'NOTAVAIL':
        status_removed = clean_df[clean_df.STATUS != 'NOTAVAIL'] # Create filter to only select rows where the STATUS column value does not equal NOTAVAIL
    else:
        pass