Difference between revisions of "Areal Interpolation in Python using Tobler"
From CUOSGwiki
Jump to navigationJump to searchJohnFoster (talk | contribs) (Test with readme.md included as the first part of the tutorial) |
JohnFoster (talk | contribs) (page deleted due to type in page name) Tag: Replaced |
||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | replaced by https://dges.carleton.ca/CUOSGwiki/index.php/Areal_Interpolation_in_Python_Using_Tobler |
||
− | <div class="cell markdown"> |
||
− | |||
− | = Areal Interpolation Tutorial = |
||
− | |||
− | In this Jupyter Notebook tutorial we will look at two of the areal interpolation methods that are made possible by PySal's Tobler package. |
||
− | |||
− | * What is areal interpolation? |
||
− | * What is a Jupyter Notebook etc? |
||
− | * What will we do? |
||
− | |||
− | == Access a Remote Copy of the Tutorial == |
||
− | |||
− | === Option 1: Binder === |
||
− | |||
− | If you wish to access an interactive copy of this tutorial please follow these instructions: |
||
− | |||
− | # Open [https://mybinder.org/v2/gh/johnofoster/areal_interpolation_tutorial/HEAD this link] in a new tab to access a Binder copy of the tutorial repository. |
||
− | # While the Binder service is loading, you can access a non-interactive preview of the notebook (<code>areal_interp_tutorial.ipynb</code>)through nbviewer. |
||
− | # After loading, a JupyterLab session will launch in your browser. This session will use the environment created for the tutorial so all dependencies should work just fine. |
||
− | # In the File Browser pane on the left, double-click <code>areal_interp_tutorial.ipynb</code> to open the tutorial notebook. |
||
− | # With the tutorial now open as an interactive Jupyter notebook you can run and modify each code cell in order to see the output. |
||
− | |||
− | === Option 2: Carleton University Department of Geography and Environmental Studies Open Source GIS Tutorials Page === |
||
− | |||
− | As previously mentioned, this tutorial was created as a student project for a Carleton University geomatics course. It has been converted to a MediaWiki page and uploaded [https://dges.carleton.ca/CUOSGwiki/index.php/Areal_Interpolation_in_Python_using_Tobler at this link]. |
||
− | |||
− | == Install a Local Copy of the Tutorial == |
||
− | |||
− | If you have git installed: |
||
− | |||
− | # Open a terminal window |
||
− | # Change the current working directory to the location where you want the tutorial to reside |
||
− | # Clone the repo by running: <code>$ git clone https://github.com/johnofoster/areal_interpolation_tutorial</code> |
||
− | |||
− | Otherwise, simply download the zip file of the tutorial from [https://github.com/johnofoster/areal_interpolation_tutorial/archive/refs/heads/main.zip this link] and unzip it into a suitable working directory on your computer. |
||
− | |||
− | === Create Environment === |
||
− | |||
− | This tutorial uses Python 3.9, Conda, JupyterLab, and a number of Python packages. |
||
− | |||
− | If you don't already have an Anaconda distribution then you can find it [https://www.anaconda.com/products/individual here]. For a very lightweight alternative to Anaconda that only contains Conda, Python, and a few essential packages then check out Miniconda [https://docs.conda.io/en/latest/miniconda.html here]. After installing either please follow these instructions: |
||
− | |||
− | # Open a terminal window |
||
− | # Change the current working directory to tutorial's directory |
||
− | # Create the environment by running: <code>$ conda env create -- file environment.yml</code> |
||
− | # It might take a while to build the environment so go grab a cup of your favorite beverage. |
||
− | |||
− | This will create a Conda environment containing the necessary packages to run this tutorial. |
||
− | |||
− | === Open the Tutorial === |
||
− | |||
− | Now that you have downloaded the tutorial files and created the Conda environment please follow these instructions to launch it: |
||
− | |||
− | # Open a terminal window |
||
− | # Change the current working directory to the tutorial's directory |
||
− | # Activate the environment: <code>$ conda activate areal_interp_env</code> |
||
− | # Launch JupyterLab: <code>$ Jupyter Lab</code> |
||
− | # In the File Browser pane on the left, double-click <code>areal_interp_tutorial.ipynb</code> to open the tutorial notebook. |
||
− | # With the tutorial now open as an interactive Jupyter notebook you can run and modify each code cell in order to see the output. |
||
− | |||
− | === Remove the Tutorial === |
||
− | |||
− | Once you are done with tutorial you might want to remove it. Simply delete the directory you placed it in and then remove the Conda environment by running the following in the terminal: <code>$ conda env remove --name areal_interp_env</code>. |
||
− | |||
− | == Tutorial Data == |
||
− | |||
− | The data for this tutorial has been preprocessed for your convenience and can be found in <code>data/</code>. All of this data comes from free and open sources. Links to the source geometry and attributes can be found below along with notes regarding the preprocessing that was performed. |
||
− | |||
− | {| |
||
− | !width="20%"| | Ge |
||
− | !width="20%"| Geometry source | Geo |
||
− | !width="20%"| Geometry transformations | At |
||
− | !width="20%"| Attributes source | A |
||
− | !width="20%"| Attributes transformations | |
||
− | |- |
||
− | | Census Tracts | [Stat |
||
− | | [https://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/bound-limit-2016-eng.cfm StatCan 2016 Census - Census Tracts Cartographic Boundary File] | Extr |
||
− | | Extracted the census tracts falling within the City of Ottawa boundary and reprojected to WGS 84 UTM 18N | [S |
||
− | | [https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/comp/page_dl-tc.cfm?Lang=E& StatCan 2016 Census: Census metropolitan areas (CMAs), tracted census agglomerations (CAs) and census tracts (CTs)] | Extra |
||
− | | Extracted the 'Population, 2016' data ('member ID 1') for the City of Ottawa census tracts and joined them to the census tracts geometry | |
||
− | |- |
||
− | | Dissemination Areas | |
||
− | | [https://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/bound-limit-2016-eng.cfm StatCan 2016 Census - Dissemination Areas Cartographic Boundary File] |
||
− | | Extracted the dissemination areas falling within the City of Ottawa boundary and reprojected to WGS 84 UTM 18N | [Can |
||
− | | [https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/comp/page_dl-tc.cfm?Lang=E& Canada, provinces, territories, census divisions (CDs), census subdivisions (CSDs) and dissemination areas (DAs) - Ontario only] | |
||
− | | Extracted the 'Population, 2016' data ('member ID 1) for the City of Ottawa dissemination areas and joined them to the dissemination areas geometry | |
||
− | |- |
||
− | | Neighborhoods | [Otta |
||
− | | [https://open.ottawa.ca/datasets/ottawa::ottawa-neighbourhood-study-ons-neighbourhood-boundaries-gen-2/about Ottawa Neighbourhood Study (ONS) - Neighbourhood Boundaries Gen 2] | Re |
||
− | | Reprojected to WGS 84 UTM 18N | [Otta |
||
− | | [https://open.ottawa.ca/datasets/ottawa::ottawa-neighbourhood-study-ons-neighbourhood-boundaries-gen-2/about Ottawa Neighbourhood Study (ONS) - Neighbourhood Boundaries Gen 2] | Ex |
||
− | | Extracted the neighbourhood names ('NAMES') and estimated population fields ('POPEST') | |
||
− | |- |
||
− | | Landcover | [ |
||
− | | [https://open.canada.ca/data/en/dataset/4e615eae-b90c-420b-adee-2ca35896caf6 Government of Canada - 2015 Land Cover of Canada] | Cli |
||
− | | Clipped to the City of Ottawa's extent | | |
||
− | | | |
||
− | | | |
||
− | |- |
||
− | | Zoning | [Cit |
||
− | | [https://data1-esrica-ncr.opendata.arcgis.com/datasets/esrica-ncr::city-of-ottawa-zoning/about?layer=3 City of Ottawa - Zoning] | Re |
||
− | | Reprojected to WGS 84 UTM 18N | [City |
||
− | | [https://data1-esrica-ncr.opendata.arcgis.com/datasets/esrica-ncr::city-of-ottawa-zoning/about?layer=3 City of Ottawa - Zoning] | Ex |
||
− | | Extracted the main zones | |
||
− | |- |
||
− | | Synthetic Points | Jo |
||
− | | John Foster (tutorial author) | | |
||
− | | | |
||
− | | | |
||
− | | | |
||
− | |} |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | To do: |
||
− | |||
− | * Update headers for consistency and flow |
||
− | * Create a readme |
||
− | ** This will become the first part of the WikiText tutorial but will be separate from the Jupyter notebook |
||
− | ** Contents: |
||
− | *** Disclaimer |
||
− | *** Introduction |
||
− | *** Installation instructions |
||
− | *** Link to interactive notebook and DGESwiki |
||
− | * Write a conclusion / discussion section after the synthetic data section |
||
− | * Include some sentences describing the difference between census tracts and dissemination areas |
||
− | * Census tracts and dissemination areas |
||
− | * Neighborhoods |
||
− | |||
− | https://alldocs.app/convert-jupyter-notebook-to-mediawiki-markup |
||
− | |||
− | <syntaxhighlight lang="python3"> </syntaxhighlight> |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | Help, Documentation, Resources |
||
− | |||
− | * Documentation links |
||
− | ** Tobler |
||
− | ** Pandas |
||
− | ** GeoPandas |
||
− | ** Plotly |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | === Import Modules and Packages === |
||
− | |||
− | This tutorial requires the use of a number of modules. Tobler, Pandas, GeoPandas, and Plotly will perform most of the work but I'll quickly run through why we need each: |
||
− | |||
− | * <code>random</code>: Helps to generate random points in Part 2 of the tutorial |
||
− | * <code>geopandas</code>: For operations on GeoDataFrames |
||
− | * <code>json</code>: For plotting maps with Plotly (GeoDataFrame geometries need to be converted to GeoJSON geometries) |
||
− | * <code>numpy</code>: Contains a number of statistical functions |
||
− | * <code>pandas</code>: For operations on DataFrames |
||
− | * <code>plotly.express</code>: For interactive plotting |
||
− | * <code>tobler</code>: Contains the areal interpolation functions |
||
− | * <code>scipy</code>: Contains a number of statistical functions |
||
− | * <code>shapely.geometry</code>: For the <code>Point</code> class which is used when generating the random points |
||
− | |||
− | If you need help on importing modules then this guide from Real Python should help: [https://realpython.com/python-modules-packages/ Python Modules and Packages – An Introduction] |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code" execution_count="1"> |
||
− | |||
− | <source lang="python"># Import modules |
||
− | import geopandas as gpd |
||
− | import json |
||
− | import matplotlib.pyplot as plt |
||
− | import numpy as np |
||
− | import pandas as pd |
||
− | import plotly.express as px |
||
− | import tobler |
||
− | from IPython.display import display_html |
||
− | from scipy import stats |
||
− | from shapely.geometry import Point</source> |
||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | |||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | |||
− | ----- |
||
− | |||
− | == Part 1 - Areal Interpolation of Census Population Data == |
||
− | |||
− | In part 1 we will perform the areal interpolation of Ottawa, ON population data. In an effort to understand how the results are affected by different interpolation methods and different spatial resolutions we will perform four different areal interpolation operations. |
||
− | |||
− | * Census tracts to neighborhoods using area weighted interpolation |
||
− | * Dissemination areas to neighborhoods using area weighted interpolation |
||
− | * Census tracts to neighborhoods using dasymetric interpolation |
||
− | * Dissemination areas to neighborhoods using dasymetric interpolation |
||
− | |||
− | Census tracts are census geographic areas that typically contain about 2500 to 10000 people whereas dissemination areas are much smaller and contain about 400 to 700 people. Read about them and other census geographic areas [https://www12.statcan.gc.ca/census-recensement/2016/ref/98-304/chap12-eng.cfm here] |
||
− | |||
− | * [https://pysal.org/tobler/generated/tobler.area_weighted.area_interpolate.html#tobler.area_weighted.area_interpolate tobler.area_weighted.area_interpolate] |
||
− | * [https://pysal.org/tobler/generated/tobler.dasymetric.masked_area_interpolate.html#tobler.dasymetric.masked_area_interpolate tobler.dasymetric.masked_area_interpolate] |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | === Read The Data === |
||
− | |||
− | First, we need read the data from our <code>data/</code> directory and assign them to variables. The Ottawa census tracts data, dissemination tracts data, and neighborhoods data have been prepared in advance so the tutorial can focus on the interpolation and assessment procedures. |
||
− | |||
− | '''Important Note about coordinate reference systems:''' These three spatial dataset have already been reprojected into the WGS 84 / UTM 18N (EPSG:32618) CRS. If you use Tobler with your own GeoDataFrames it's important to explicitly reproject the data into an appropriate UTM zone, even though Tobler will try to do this automatically. Reprojecting a GeoDataFrame is easily accomplished with the [https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_crs.html <code>gdf.to_crs()</code>] method. |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code" execution_count="2"> |
||
− | |||
− | <source lang="python"># Read the data |
||
− | # ------------- |
||
− | |||
− | # City of Ottawa census tracts data |
||
− | ct_gdf = gpd.read_file(filename='data/ottawa_ct_pop_2016.gpkg') |
||
− | |||
− | # City of Ottawa dissemination areas data |
||
− | da_gdf = gpd.read_file(filename='data/ottawa_da_pop_2016.gpkg') |
||
− | |||
− | # City of Ottawa neighborhoods data |
||
− | nbhd_gdf = gpd.read_file(filename='data/ottawa_neighborhoods.gpkg')</source> |
||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | === Inspect The Data === |
||
− | |||
− | The following Pandas and GeoPandas methods are great ways to inspect DataFrames and GeoDataFrames. Note: The DataFrame (<code>df</code>) methods also work on GeoDataFrames (<code>gdf</code>) but <code>explore()</code> only works with GeoDataFrames. Follow their links to see the documentation as their outputs can be controlled by passing different arguments to them. |
||
− | |||
− | * [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html <code>df.info()</code>]: Information about the DataFrame/GeoDataFrame such as column names, data types, and length |
||
− | * [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html <code>df.head()</code>]: First n rows of the DataFrame/GeoDataFrame (defaults to 10 rows) |
||
− | * [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html <code>df.tail()</code>]: Last n rows of the DataFrame/GeoDataFrame (defaults to 10 rows) |
||
− | * [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html <code>df.sample()</code>]: Random sample of n rows of the DataFrame/GeoDataFrame (defaults to 1 rows) |
||
− | * [https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.explore.html <code>gdf.explore()</code>]: Interactive map of a GeoDataFrame |
||
− | * [https://geopandas.org/en/stable/docs/user_guide/projections.html <code>gdf.crs</code>]: Coordinate reference system information of a GeoDataFrame |
||
− | |||
− | The cell below contains these data inspection methods. Just replace <code>df</code> or <code>gdf</code> before the dot notation with the DataFrame or GeoDataFrame's name that you want to inspect. For example: |
||
− | |||
− | * <code>ct_gdf.info()</code> - shows info about the census tracts GeoDataFrame |
||
− | * <code>da_gdf.explore(column='da_pop_2016')</code> - displays an interactive map of the dissemination areas with a colormap of the population column |
||
− | |||
− | Alternatively, add in some new code cells in order to run them separately. And feel free to insert or split cells at any point in this tutorial in order to inspect any of the intermediate data. These methods will really help to reveal exactly what is going. |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Inspect the data |
||
− | # ---------------- |
||
− | |||
− | # Uncomment individual lines below and replace `df` or `gdf` with the name of a variable |
||
− | # DataFrame (df) methods work on GeoDataFrames (gdf) but not the other way around |
||
− | |||
− | # df.info() |
||
− | # df.head() |
||
− | # df.tail(5) |
||
− | # df.sample(15) |
||
− | # gdf.explore() |
||
− | # gdf.crs</source> |
||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | Let's inspect them a bit more systematically. First, we'll look at random sample of the census tracts, dissemination areas, and neighborhoods GeoDataFrames while ignoring their geometry columns. THis will show us the column names and give us some examples of how the values of each are formatted. Because the output of cells in Jupyter notebooks are rendered in HTML we can use Pandas' DataFrame styling methods to display several DataFrames side-by-side. While this takes more code than simply using the <code>print()</code> function or displaying the results of <code>sample()</code>, it facilitates better comparison and saves vertical space. |
||
− | |||
− | After that we'll plot the geometries of each GeoDataFrame next to each other using GeoPandas' Matplotlib <code>plot()</code> method. This will reveal the scale of the geometries that we are working with. |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code" execution_count="3"> |
||
− | |||
− | <source lang="python"># Compare random samples from each GeoDataFrame |
||
− | # --------------------------------------------- |
||
− | |||
− | # Assign a sampling of each GeoDataFrame to a DataFrame |
||
− | df1 = ct_gdf.drop(columns='geometry').sample(5) |
||
− | df2 = da_gdf.drop(columns='geometry').sample(5) |
||
− | df3 = nbhd_gdf.drop(columns='geometry').sample(5) |
||
− | |||
− | # Style the DataFrames and include captions with the number of rows |
||
− | style = "style='display:inline; padding:10px'" |
||
− | df1_styled = df1.style.set_table_attributes(style).set_caption(f'Census Tracts, {len(ct_gdf)} rows') |
||
− | df2_styled = df2.style.set_table_attributes(style).set_caption(f'Dissemination Areas, {len(da_gdf)} rows') |
||
− | df3_styled = df3.style.set_table_attributes(style).set_caption(f'Neighborhoods, {len(nbhd_gdf)} rows') |
||
− | |||
− | # Display the three DataFrames |
||
− | display_html(df1_styled._repr_html_() + df2_styled._repr_html_() + df3_styled._repr_html_(), raw=True)</source> |
||
− | <div class="output display_data"> |
||
− | |||
− | <style type="text/css"> |
||
− | </style> |
||
− | <table id="T_22526_" style='display:inline; padding:10px'> |
||
− | <caption>Census Tracts, 196 rows</caption> |
||
− | <thead> |
||
− | <tr> |
||
− | <th class="blank level0" > </th> |
||
− | <th class="col_heading level0 col0" >ctuid</th> |
||
− | <th class="col_heading level0 col1" >ct_pop_2016</th> |
||
− | </tr> |
||
− | </thead> |
||
− | <tbody> |
||
− | <tr> |
||
− | <th id="T_22526_level0_row0" class="row_heading level0 row0" >77</th> |
||
− | <td id="T_22526_row0_col0" class="data row0 col0" >5050001.08</td> |
||
− | <td id="T_22526_row0_col1" class="data row0 col1" >4632</td> |
||
− | </tr> |
||
− | <tr> |
||
− | <th id="T_22526_level0_row1" class="row_heading level0 row1" >13</th> |
||
− | <td id="T_22526_row1_col0" class="data row1 col0" >5050044.00</td> |
||
− | <td id="T_22526_row1_col1" class="data row1 col1" >2547</td> |
||
− | </tr> |
||
− | <tr> |
||
− | <th id="T_22526_level0_row2" class="row_heading level0 row2" >189</th> |
||
− | <td id="T_22526_row2_col0" class="data row2 col0" >5050151.08</td> |
||
− | <td id="T_22526_row2_col1" class="data row2 col1" >6967</td> |
||
− | </tr> |
||
− | <tr> |
||
− | <th id="T_22526_level0_row3" class="row_heading level0 row3" >21</th> |
||
− | <td id="T_22526_row3_col0" class="data row3 col0" >5050052.00</td> |
||
− | <td id="T_22526_row3_col1" class="data row3 col1" >4656</td> |
||
− | </tr> |
||
− | <tr> |
||
− | <th id="T_22526_level0_row4" class="row_heading level0 row4" >45</th> |
||
− | <td id="T_22526_row4_col0" class="data row4 col0" >5050018.00</td> |
||
− | <td id="T_22526_row4_col1" class="data row4 col1" >4263</td> |
||
− | </tr> |
||
− | </tbody> |
||
− | </table> |
||
− | <style type="text/css"> |
||
− | </style> |
||
− | <table id="T_1e1ac_" style='display:inline; padding:10px'> |
||
− | <caption>Dissemination Areas, 1372 rows</caption> |
||
− | <thead> |
||
− | <tr> |
||
− | <th class="blank level0" > </th> |
||
− | <th class="col_heading level0 col0" >dauid</th> |
||
− | <th class="col_heading level0 col1" >ctuid</th> |
||
− | <th class="col_heading level0 col2" >da_pop_2016</th> |
||
− | </tr> |
||
− | </thead> |
||
− | <tbody> |
||
− | <tr> |
||
− | <th id="T_1e1ac_level0_row0" class="row_heading level0 row0" >1137</th> |
||
− | <td id="T_1e1ac_row0_col0" class="data row0 col0" >35060231</td> |
||
− | <td id="T_1e1ac_row0_col1" class="data row0 col1" >5050059.00</td> |
||
− | <td id="T_1e1ac_row0_col2" class="data row0 col2" >639</td> |
||
− | </tr> |
||
− | <tr> |
||
− | <th id="T_1e1ac_level0_row1" class="row_heading level0 row1" >493</th> |
||
− | <td id="T_1e1ac_row1_col0" class="data row1 col0" >35060121</td> |
||
− | <td id="T_1e1ac_row1_col1" class="data row1 col1" >5050124.02</td> |
||
− | <td id="T_1e1ac_row1_col2" class="data row1 col2" >694</td> |
||
− | </tr> |
||
− | <tr> |
||
− | <th id="T_1e1ac_level0_row2" class="row_heading level0 row2" >1062</th> |
||
− | <td id="T_1e1ac_row2_col0" class="data row2 col0" >35060167</td> |
||
− | <td id="T_1e1ac_row2_col1" class="data row2 col1" >5050122.03</td> |
||
− | <td id="T_1e1ac_row2_col2" class="data row2 col2" >1182</td> |
||
− | </tr> |
||
− | <tr> |
||
− | <th id="T_1e1ac_level0_row3" class="row_heading level0 row3" >223</th> |
||
− | <td id="T_1e1ac_row3_col0" class="data row3 col0" >35061841</td> |
||
− | <td id="T_1e1ac_row3_col1" class="data row3 col1" >5050141.12</td> |
||
− | <td id="T_1e1ac_row3_col2" class="data row3 col2" >9778</td> |
||
− | </tr> |
||
− | <tr> |
||
− | <th id="T_1e1ac_level0_row4" class="row_heading level0 row4" >473</th> |
||
− | <td id="T_1e1ac_row4_col0" class="data row4 col0" >35060781</td> |
||
− | <td id="T_1e1ac_row4_col1" class="data row4 col1" >5050140.07</td> |
||
− | <td id="T_1e1ac_row4_col2" class="data row4 col2" >789</td> |
||
− | </tr> |
||
− | </tbody> |
||
− | </table> |
||
− | <style type="text/css"> |
||
− | </style> |
||
− | <table id="T_f9dda_" style='display:inline; padding:10px'> |
||
− | <caption>Neighborhoods, 111 rows</caption> |
||
− | <thead> |
||
− | <tr> |
||
− | <th class="blank level0" > </th> |
||
− | <th class="col_heading level0 col0" >nbhd_name</th> |
||
− | <th class="col_heading level0 col1" >nbhd_pop_est</th> |
||
− | </tr> |
||
− | </thead> |
||
− | <tbody> |
||
− | <tr> |
||
− | <th id="T_f9dda_level0_row0" class="row_heading level0 row0" >23</th> |
||
− | <td id="T_f9dda_row0_col0" class="data row0 col0" >Greenboro East</td> |
||
− | <td id="T_f9dda_row0_col1" class="data row0 col1" >10722</td> |
||
− | </tr> |
||
− | <tr> |
||
− | <th id="T_f9dda_level0_row1" class="row_heading level0 row1" >19</th> |
||
− | <td id="T_f9dda_row1_col0" class="data row1 col0" >Civic Hospital-Central Park</td> |
||
− | <td id="T_f9dda_row1_col1" class="data row1 col1" >10980</td> |
||
− | </tr> |
||
− | <tr> |
||
− | <th id="T_f9dda_level0_row2" class="row_heading level0 row2" >12</th> |
||
− | <td id="T_f9dda_row2_col0" class="data row2 col0" >Laurentian</td> |
||
− | <td id="T_f9dda_row2_col1" class="data row2 col1" >9654</td> |
||
− | </tr> |
||
− | <tr> |
||
− | <th id="T_f9dda_level0_row3" class="row_heading level0 row3" >92</th> |
||
− | <td id="T_f9dda_row3_col0" class="data row3 col0" >Metcalfe</td> |
||
− | <td id="T_f9dda_row3_col1" class="data row3 col1" >5240</td> |
||
− | </tr> |
||
− | <tr> |
||
− | <th id="T_f9dda_level0_row4" class="row_heading level0 row4" >29</th> |
||
− | <td id="T_f9dda_row4_col0" class="data row4 col0" >Island Park - Wellington Village</td> |
||
− | <td id="T_f9dda_row4_col1" class="data row4 col1" >5339</td> |
||
− | </tr> |
||
− | </tbody> |
||
− | </table> |
||
− | |||
− | |||
− | </div> |
||
− | |||
− | </div> |
||
− | <div class="cell code" execution_count="4"> |
||
− | |||
− | <source lang="python"># Plot the target and source geometries |
||
− | # ------------------------------------- |
||
− | |||
− | # Create subplots; set their size and padding |
||
− | fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(30, 30)) |
||
− | plt.subplots_adjust(wspace=0) |
||
− | |||
− | # Plot the Ottawa census tracts figure |
||
− | ct_gdf.plot(ax=axs[0], color="none", linewidth=0.25) |
||
− | axs[0].set_title('Source 1: Ottawa Census Tracts') |
||
− | axs[0].axis('off') |
||
− | |||
− | # Ottawa dissemination areas figure |
||
− | da_gdf.plot(ax=axs[1], color="none", linewidth=0.25) |
||
− | axs[1].set_title('Source 2: Ottawa Dissemination Areas') |
||
− | axs[1].axis('off') |
||
− | |||
− | # Ottawa neighborhoods figure |
||
− | nbhd_gdf.plot(ax=axs[2], color="none", linewidth=0.25) |
||
− | axs[2].set_title('Target: Ottawa Neighborhoods') |
||
− | axs[2].axis('off')</source> |
||
− | <div class="output execute_result" execution_count="4"> |
||
− | |||
− | <pre>(389723.8668414538, 485020.5102585437, 4976456.050692046, 5045514.737618738)</pre> |
||
− | |||
− | </div> |
||
− | <div class="output display_data"> |
||
− | |||
− | [[File:fb27f923d84657b56b81b0281ce176f7e05545c9.png]] |
||
− | |||
− | |||
− | </div> |
||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | === Perform Area Weighted Interpolation of Census Tracts and Dissemination Areas === |
||
− | |||
− | Now that we have an understanding of the data we're working with we can move on to the interpolation - first area weighted and then dasymetric interpolation. |
||
− | |||
− | Area weighted interpolation works by intersecting the source and target geometries, calculating the areal weights (i.e. the proportion a source geometry is covered by each given target geometry), and then allocating a proportional amount of the variable from the source to the targets. |
||
− | |||
− | Please refer to Tobler's documentation of the [https://pysal.org/tobler/generated/tobler.area_weighted.area_interpolate.html <code>tobler.area_weighted.area_interpolate()</code>] function for a mathematical description and help with the parameters. |
||
− | |||
− | The first thing to consider is whether the variable to be interpolated is extensive or intensive. An extensive variable is one which depends on the size of the sample (e.g. population counts) while an intensive variable does not depend on the size of the sample (e.g. population density). In this tutorial we will only be dealing with extensive variables (population counts) but it's worth noting that Tobler's interpolation methods can handle both types through the use of different parameters. |
||
− | |||
− | We will pass source and target GeoDataFrames to the function along with the column name of the extensive variable. The extensive variable will then be interpolated from the source's geometry to the target geometry. The result will be GeoDataFrame containing the target geometries and interpolated values of the extensive variable. |
||
− | |||
− | We will perform the areal interpolation twice: first with the census tracts as the source data, and then with the dissemination areas. This will allow us to compare the results of these different sources and see if there's any benefit to using smaller census geographies as the source. |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Area weighted interpolation: census tracts to neighborhoods |
||
− | # ----------------------------------------------------------- |
||
− | |||
− | ct_area_interp_gdf = tobler.area_weighted.area_interpolate(source_df=ct_gdf, |
||
− | target_df=nbhd_gdf, |
||
− | extensive_variables=['ct_pop_2016']) |
||
− | |||
− | # Round the interpolation results to the nearest integer and change the type to integer |
||
− | # (Population counts must be integers) |
||
− | ct_area_interp_gdf['ct_pop_2016'] = ct_area_interp_gdf['ct_pop_2016'].round(decimals=0).astype(int) |
||
− | |||
− | # Rename the results column for clarity later |
||
− | ct_area_interp_gdf = ct_area_interp_gdf.rename(columns={'ct_pop_2016':'ct_area_interp_est'})</source> |
||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Area weighted interpolation: dissemination areas to neighborhoods |
||
− | # ----------------------------------------------------------------- |
||
− | |||
− | da_area_interp_gdf = tobler.area_weighted.area_interpolate(source_df=da_gdf, |
||
− | target_df=nbhd_gdf, |
||
− | extensive_variables=['da_pop_2016']) |
||
− | |||
− | # Round the interpolation results to the nearest integer and change the type to integer |
||
− | # (Population counts must be integers) |
||
− | da_area_interp_gdf['da_pop_2016'] = da_area_interp_gdf['da_pop_2016'].round(decimals=0).astype(int) |
||
− | |||
− | # Rename the results column for clarity later |
||
− | da_area_interp_gdf = da_area_interp_gdf.rename(columns={'da_pop_2016':'da_area_interp_est'})</source> |
||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | === Dasymetric Interpolation of Census Tracts and Dissemination Areas === |
||
− | |||
− | With that complete we can move on the dasymetric interpolation. |
||
− | |||
− | Dasymetric interpolation, or masked area weighted interpolation, incorporates secondary information to help refine the spatial distribution of the variable. For example, a census tract might include an inhabited urban area and a large uninhabited park. Secondary information about that census tract's land cover is used to mask out those parts of the census tract that don't have an urban land cover, and thus don't contain the extensive variable. Area weighted interpolation is then applied to these masked census tracts and the results should be more accurate than simple area weighted interpolation on its own. |
||
− | |||
− | Please refer to Tobler's documentation of the <code>tobler.dasymetric.masked_area_interpolate()</code>](https://pysal.org/tobler/generated/tobler.dasymetric.masked_area_interpolate.html#tobler.dasymetric.masked_area_interpolate) function for help with the parameters. |
||
− | |||
− | The land cover raster that we will use comes from the [https://open.canada.ca/data/en/dataset/4e615eae-b90c-420b-adee-2ca35896caf6 2015 Land Cover of Canada]. This 30 m spatial resolution land cover product is produced every 5 years and covers the whole country. Tobler's dasymetric interpolation function will automatically clip the raster to the extent of the source data (City of Ottawa) but I have already clipped it to reduce the file-size and processing time. |
||
− | |||
− | The path of the land cover raster and the pixel value(s) of the land cover classes that contain dwellings are passed to the function along with the source and target GeoDataFrames, and the column name of the extensive variable. The result is a GeoDataFrame made up of the target geometries with interpolated values for the extensive variable. |
||
− | |||
− | For the same reason as before, we will perform the dasymetric interpolation twice: first with the census tracts as the source data, and then with the dissemination areas. <code>masked_area_interpolate()</code> will throw some warnings but just ignore them. The length of time it takes to run each of these cells on my computer has been included for your reference. |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Dasymetric interpolation: census tracts + urban landover to neighborhoods |
||
− | #-------------------------------------------------------------------------- |
||
− | |||
− | # Perform dasymetric interpolation |
||
− | ct_dasy_interp_gdf = tobler.dasymetric.masked_area_interpolate(source_df=ct_gdf, |
||
− | target_df=nbhd_gdf, |
||
− | raster='data/ottawa_landcover.tif', |
||
− | codes=[17], |
||
− | extensive_variables=['ct_pop_2016']) |
||
− | |||
− | # Round the interpolation results to the nearest integer and change the type to integer |
||
− | # (Population counts must be integers) |
||
− | ct_dasy_interp_gdf['ct_pop_2016'] = ct_dasy_interp_gdf['ct_pop_2016'].round(decimals=0).astype(int) |
||
− | |||
− | # Rename the results column for clarity later |
||
− | ct_dasy_interp_gdf = ct_dasy_interp_gdf.rename(columns={'ct_pop_2016':'ct_dasy_interp_est'}) |
||
− | |||
− | # ~30 s ; ignore warnings</source> |
||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Dasymetric interpolation: dissemination areas + urban landover to neighborhoods |
||
− | #------------------------------------------------------------------------ |
||
− | |||
− | # Perform dasymetric interpolation |
||
− | da_dasy_interp_gdf = tobler.dasymetric.masked_area_interpolate(source_df=da_gdf, |
||
− | target_df=nbhd_gdf, |
||
− | raster='data/ottawa_landcover.tif', |
||
− | codes=[17], |
||
− | extensive_variables=['da_pop_2016']) |
||
− | |||
− | # Round the interpolation results to the nearest integer and change the type to integer |
||
− | # (Population counts must be integers) |
||
− | da_dasy_interp_gdf['da_pop_2016'] = da_dasy_interp_gdf['da_pop_2016'].round(decimals=0).astype(int) |
||
− | |||
− | # Rename the results column for clarity later |
||
− | da_dasy_interp_gdf = da_dasy_interp_gdf.rename(columns={'da_pop_2016':'da_dasy_interp_est'}) |
||
− | |||
− | # ~1.5 mins : ignore warnings</source> |
||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | === Assessing the Interpolation Results === |
||
− | |||
− | Because the Ottawa Neighbourhood Study data came with a population estimate for each neighborhood (<code>nbhd_pop_est</code>) we can compare those estimated values against the interpolated values. That being said, it is very important to note that the accuracy of this assessment will be very limited for a number of reasons: |
||
− | |||
− | # Metadata associated with the Ottawa neighborhood survey data does not specify the year of their population estimates. This data was first published to the Open Ottawa data portal on November 25, 2019 and then updated on June 2, 2021. The census tract and dissemination area population data comes from the 2016 census. |
||
− | # Metadata associated with the Ottawa neighborhood survey data does not specify the methodology they used for estimating the neighborhood populations. Perhaps it came from an areal interpolation method similar to what we are using or perhaps it came from a higher resolution data set that the City of Ottawa has but which they don't publish (e.g. size of each households). |
||
− | # The Ottawa neighborhood survey data indicates that the 'Beechwood Cemetery' and 'Notre-Dame Cemetery' neighborhoods have populations of 139 and 17, respectively, despite no evidence of residences within them. |
||
− | # The sum of all the neighborhood population estimates is 867146 while the census_tracts' sum is 934243. These last two points suggest that these neighborhood population estimates are the result of interpolation from an unknown dataset (2011 census data perhaps?). |
||
− | |||
− | Despite these issues we will go ahead and use these estimates to assess the interpolation results. We will do the following: |
||
− | |||
− | # Merge the interpolation results with the neighborhood population estimates and look at a sample of the results |
||
− | # Perform a statistical assessment of the interpolation results |
||
− | # Visually assess the percent error of each neighborhood (interpolated population vs <code>nbhd_pop_est</code>) |
||
− | # Create a 1:1 scatter plot of the interpolation results against the neighborhood population estimates |
||
− | # Discussion |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Merge the results for comparison |
||
− | #--------------------------------- |
||
− | |||
− | # Create a list of the GeoDataFrames (dropping the redundant geometry) |
||
− | dfs = [nbhd_gdf, |
||
− | ct_area_interp_gdf.drop(columns='geometry'), |
||
− | ct_dasy_interp_gdf.drop(columns='geometry'), |
||
− | da_area_interp_gdf.drop(columns='geometry'), |
||
− | da_dasy_interp_gdf.drop(columns='geometry'),] |
||
− | |||
− | # Concatenate the GeoDataFrames |
||
− | interp_results_gdf = pd.concat(objs=dfs, axis=1) |
||
− | |||
− | # View a sample of the interpolation results GeoDataFrame (without the geometry column) |
||
− | interp_results_gdf.sample(10).drop(columns='geometry')</source> |
||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | Looking at this sample we can see that the different interpolation methods and source data have yielded results that look roughly in line with the neighborhood population estimates that came from the Ottawa Neighbourhood Study data (<code>nbhd_pop_est</code>). Let's perform a more rigorous evaluation while not forgetting that the neighborhood population estimates have some issues. |
||
− | |||
− | We will begin by defining some functions to calculate percent error, root mean square error, normalized square error, mean bias error, mean absolute error, and r value. There are 3rd party modules that contain functions to perform these calculations but defining our own is a good exercise. |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Define functions to assess the results |
||
− | # -------------------------------------- |
||
− | |||
− | # PE |
||
− | def percent_error(estimated, expected): |
||
− | ''' Return Percent Error where estimated and expected are numeric or array-like''' |
||
− | return abs(((estimated - expected) / expected) * 100) |
||
− | |||
− | # RMSE |
||
− | def rmse(estimated, expected): |
||
− | ''' Return Root Mean Square Error where estimated and expected are array-like''' |
||
− | return np.sqrt(np.mean((estimated - expected) ** 2)) |
||
− | |||
− | # NRMSE |
||
− | def nrmse(estimated, expected): |
||
− | ''' Return Normalized Root Mean Square Error where estimated and expected are array-like''' |
||
− | return np.sqrt(np.mean((estimated - expected) ** 2))/np.std(estimated) |
||
− | |||
− | # MBE |
||
− | def mbe(estimated, expected): |
||
− | ''' Return Mean Bias Error where estimated and expected are array-like''' |
||
− | return np.mean(estimated - expected) |
||
− | |||
− | # MAE |
||
− | def mae(estimated, expected): |
||
− | ''' Return Mean Mean Absolute Error where estimated and expected are array-like''' |
||
− | return np.mean(abs(estimated - expected)) |
||
− | |||
− | # r value |
||
− | def r_value(estimated, expected): |
||
− | ''' Return r value where estimated and expected are array-like.''' |
||
− | |||
− | # Get the r_value by unpacking the results from SciPy's linregress() function |
||
− | slope, intercept, r_value, p_value, std_err = stats.linregress(estimated, expected) |
||
− | return r_value</source> |
||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | We will now use those functions to generate statistics on the four sets of interpolation results. The percent error values are concatenated to the <code>interp_results_gdf</code> so we can plot maps showing the percent error of each neighborhood. The other statistics are placed into their own DataFrame so we can assess them in a tabular format. |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Calculate statistics on the interpolation results |
||
− | # ------------------------------------------------- |
||
− | |||
− | # Assign the interpolation results columns to their own variables for clarity |
||
− | ct_area_est = interp_results_gdf['ct_area_interp_est'] |
||
− | ct_dasy_est = interp_results_gdf['ct_dasy_interp_est'] |
||
− | da_area_est = interp_results_gdf['da_area_interp_est'] |
||
− | da_dasy_est = interp_results_gdf['da_dasy_interp_est'] |
||
− | expected = interp_results_gdf['nbhd_pop_est'] |
||
− | |||
− | # Create new columns in interp_results_gdf to represent percent error |
||
− | interp_results_gdf['ct_area_interp_%_error'] = round(percent_error(ct_area_est, expected), 1) |
||
− | interp_results_gdf['ct_dasy_interp_%_error'] = round(percent_error(ct_dasy_est, expected), 1) |
||
− | interp_results_gdf['da_area_interp_%_error'] = round(percent_error(da_area_est, expected), 1) |
||
− | interp_results_gdf['da_dasy_interp_%_error'] = round(percent_error(da_dasy_est, expected), 1) |
||
− | |||
− | # Create a DataFrame containing the other statistics |
||
− | interp_stats_df = pd.DataFrame(data={'Interpolation Method': ['Area Weighted','Area Weighted', |
||
− | 'Dasymetric', 'Dasymetric'], |
||
− | 'Source Geographies': ['Census Tracts', 'Dissemination Areas', |
||
− | 'Census Tracts', 'Dissemination Areas'], |
||
− | 'RMSE': [rmse(ct_area_est, expected), |
||
− | rmse(da_area_est, expected), |
||
− | rmse(ct_dasy_est, expected), |
||
− | rmse(da_dasy_est, expected)], |
||
− | 'NRMSE': [nrmse(ct_area_est, expected), |
||
− | nrmse(da_area_est, expected), |
||
− | nrmse(ct_dasy_est, expected), |
||
− | nrmse(da_dasy_est, expected)], |
||
− | 'MBE': [mbe(ct_area_est, expected), |
||
− | mbe(da_area_est, expected), |
||
− | mbe(ct_dasy_est, expected), |
||
− | mbe(da_dasy_est, expected)], |
||
− | 'MAE': [mae(ct_area_est, expected), |
||
− | mae(da_area_est, expected), |
||
− | mae(ct_dasy_est, expected), |
||
− | mae(da_dasy_est, expected)], |
||
− | 'r': [r_value(ct_area_est, expected), |
||
− | r_value(da_area_est, expected), |
||
− | r_value(ct_dasy_est, expected), |
||
− | r_value(da_dasy_est, expected)], |
||
− | 'r2': [r_value(ct_area_est, expected)**2, |
||
− | r_value(da_area_est, expected)**2, |
||
− | r_value(ct_dasy_est, expected)**2, |
||
− | r_value(da_dasy_est, expected)**2]}).round(decimals=2) |
||
− | |||
− | # Display the DataFrame containing the statistics |
||
− | interp_stats_df</source> |
||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | From these statistics it's pretty clear that the the dasymetric method is more effective than the area weighted method. This is shown by the lower error values, as represented by the RMSE, NRMSE, and MAE. There's also a higher correlation between the results of the dasymetric method and the neighborhood population estimates. |
||
− | |||
− | These statistics also show that using the smaller dissemination areas as the source data for the interpolation yields lower errors and higher correlations, compared to the larger census tracts. In fact, using the disseminations areas with the area weighted method even gives better results than the dasymetric method applied to the census tracts. |
||
− | |||
− | Positive MBE values for all four interpolation results show that all four interpolation results are coming out higher than the neighborhood population estimates. As previously discussed, this is due to a higher total population in the census data (934243) than in the neighborhood data (867146). It's too bad these neighborhood estimates aren't closer to the census data as that would help to make this assessment more credible. |
||
− | |||
− | Let's see if a visual assessment of the results line up with this statistical one. Plotly offers a very powerful set of plotting tools which render spatial and non-spatial data as interactive figures. There are simpler ways to plot this with less code but the interactive nature of Plotly figures is really great for exploratory data analysis. Before creating the Plotly choropleth facet plots the data needs to be reworked into a specific format. The geometries have be supplied in the GeoJson format and the values have to be melted into a single column with an additional column providing labels. |
||
− | |||
− | Check out these links for information on the Pandas <code>df.melt()</code> method and Plotly choropleth and facet plots: |
||
− | |||
− | * [https://pandas.pydata.org/docs/reference/api/pandas.melt.html <code>df.melt()</code> documentation] |
||
− | * [https://plotly.com/python/choropleth-maps/ Plotly Choropleth Maps] |
||
− | * [https://plotly.github.io/plotly.py-docs/generated/plotly.express.choropleth.html <code>px.choropleth()</code> documentation] |
||
− | * [https://plotly.com/python/facet-plots/ Plotly Facet Plots] |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Plot the Percent Error of the four results using a Plotly Facet Map |
||
− | |||
− | # -------------------------------------------------------------------------------- |
||
− | |||
− | # Convert interp_results_gdf from GeoDataFrame to GeoJson |
||
− | geojson = interp_results_gdf.to_crs(epsg='4326').to_json() |
||
− | geojson = json.loads(geojson) |
||
− | |||
− | # Reformat the interp_results_gdf for the plot |
||
− | df = pd.DataFrame(data=interp_results_gdf.drop(columns='geometry')) |
||
− | |||
− | # Rename the results columns as they will become the facet plot labels |
||
− | df = df.rename(columns={'ct_area_interp_%_error':'Source: Census Tracts <br> Method: Area Weighted', |
||
− | 'ct_dasy_interp_%_error':'Source: Census Tracts <br> Method: Dasymetric', |
||
− | 'da_area_interp_%_error':'Source: Dissemination Areas <br> Method: Area Weighted', |
||
− | 'da_dasy_interp_%_error':'Source: Dissemination Areas <br> Method: Dasymetric',}) |
||
− | |||
− | # Combine all the results columns into one column labeled with a method column |
||
− | df = df.melt(id_vars='nbhd_name', |
||
− | value_vars=['Source: Census Tracts <br> Method: Area Weighted', |
||
− | 'Source: Census Tracts <br> Method: Dasymetric', |
||
− | 'Source: Dissemination Areas <br> Method: Area Weighted', |
||
− | 'Source: Dissemination Areas <br> Method: Dasymetric'], |
||
− | var_name='method', |
||
− | value_name='Error (%)') |
||
− | |||
− | # Create the Plotly Express choropleth figure |
||
− | fig = px.choropleth(data_frame=df, |
||
− | title="Areal Interpolation of 2016 Census Population Data to Ottawa, ON Neighborhoods", |
||
− | locations='nbhd_name', |
||
− | geojson=geojson, |
||
− | featureidkey='properties.nbhd_name', |
||
− | color='Error (%)', |
||
− | facet_col='method', |
||
− | facet_col_wrap=2, |
||
− | hover_name='nbhd_name', |
||
− | range_color=[0,100], |
||
− | color_continuous_scale='Inferno', |
||
− | #color_discrete_sequence= px.colors.sequential.Plasma, |
||
− | projection='mercator', |
||
− | fitbounds="locations", |
||
− | height=800, width=800) |
||
− | |||
− | fig.update_layout(mapbox_style="open-street-map") |
||
− | fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1])) |
||
− | |||
− | fig.show()</source> |
||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | That's a pretty slick set of maps, if I can say so myself! Each can be explored by zooming in/out and by hovering the cursor over different features. |
||
− | |||
− | There are some other limitations with Plotly though. While it does support tiled base maps through their MapBox figures (e.g., <code>px.choropleth_mapbox()</code>) it's not possible to create facet plots (i.e., side-by-side plots) with a tiled base map (or at least I can't figure out how to do it). A super easy alternative for assessing a single map is GeoPandas' <code>.explore()</code> method. Also, Plotly doesn't make it easy to represent the values as discrete classified bins instead of a continuous color scale - no doubt there's a way to do this but we'll have to leave that out for now. Instead I've limited the <code>Error (%)</code> color bar range to 100 despite some neighborhoods have percent errors higher than 100. Feel free to adjust the <code>rangecolor=</code> argument to a different scale. if you so desire. |
||
− | |||
− | Another way to represent the results in an interactive map is through GeoPandas' <code>explore()</code> method. Like the Plotly maps, it's possible to zoom and inspect the attributes of each feature hovering the cursor over them but we can't put several maps into the same figure. |
||
− | |||
− | In the cell below we can change the <code>column</code> argument to display a color-scaled representation of any of the columns in the <code>interp_results_gdf</code>. The classification scheme can also be changed using the <code>scheme</code> argument. See the [https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.explore.html <code>gdf.explore()</code> documentation] for the classification schemes and other parameters. |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Visually Assess Percent Error of the Area Weighted Interpolation |
||
− | # ---------------------------------------------------------------- |
||
− | |||
− | # Uncomment one of the following percent error columns to display that its values in the map |
||
− | column = 'ct_area_interp_%_error' # Source: Census Tracts | Method: Area Weighted |
||
− | #column='da_area_interp_%_error' # Source: Dissemination Areas | Method: Area Weighted |
||
− | #column='ct_dasy_interp_%_error' # Source: Census Tracts | Method: Dasymetric |
||
− | #column='da_dasy_interp_%_error' # Source: Dissemination Areas | Method: Dasymetric |
||
− | |||
− | interp_results_gdf.explore(column=column, |
||
− | cmap='YlOrRd', |
||
− | scheme='JenksCaspall', |
||
− | legend_kwds={'colorbar':False}, |
||
− | style_kwds={'weight':1})</source> |
||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | Lastly, let's plot the interpolation results against the neighborhood population estimates using 1:1 scatter plots. We'll use Plotly again for this as its <code>px.scatter()</code> facet plots allow us to hover the cursor over the points in order to see their names and other information. It's a lot of code but I think it's worth it. |
||
− | |||
− | Check out these links for information on Plotly scatter plots: |
||
− | |||
− | * [https://plotly.com/python/line-and-scatter/ Plotly Scatter Plots] |
||
− | * [https://plotly.com/python-api-reference/generated/plotly.express.scatter <code>px.scatter()</code> documentation] |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Create 1:1 Facet Plots using Plotly |
||
− | # ------------------------------------- |
||
− | |||
− | # Convert the interpolation results to a DataFrame |
||
− | df = pd.DataFrame(data=interp_results_gdf.drop(columns='geometry')) |
||
− | |||
− | # Rename the results columns as they will become the facet plot labels |
||
− | df = df.rename(columns={'ct_area_interp_est':'Source: Census Tracts <br> Method: Area Weighted', |
||
− | 'ct_dasy_interp_est':'Source: Census Tracts <br> Method: Dasymetric', |
||
− | 'da_area_interp_est':'Source: Dissemination Areas <br> Method: Area Weighted', |
||
− | 'da_dasy_interp_est':'Source: Dissemination Areas <br> Method: Dasymetric'}) |
||
− | |||
− | # Combine all the results columns into one column |
||
− | df = df.melt(id_vars=['nbhd_name', 'nbhd_pop_est'], |
||
− | value_vars=['Source: Census Tracts <br> Method: Area Weighted', |
||
− | 'Source: Census Tracts <br> Method: Dasymetric', |
||
− | 'Source: Dissemination Areas <br> Method: Area Weighted', |
||
− | 'Source: Dissemination Areas <br> Method: Dasymetric'], |
||
− | var_name='method', |
||
− | value_name='interp_pop_est') |
||
− | |||
− | # Add a percent error column |
||
− | estimated = df['interp_pop_est'] |
||
− | expected = df['nbhd_pop_est'] |
||
− | df['%_error'] = round(percent_error(estimated, expected), 1) |
||
− | |||
− | # Create the Plotly Express Scatter figure |
||
− | fig = px.scatter(data_frame=df, |
||
− | title="Areal Interpolation of 2016 Census Population Data to Ottawa, ON Neighborhoods", |
||
− | x='nbhd_pop_est', |
||
− | y='interp_pop_est', |
||
− | height=800, width=800, |
||
− | color='%_error', |
||
− | facet_col='method', |
||
− | facet_col_wrap=2, |
||
− | hover_name='nbhd_name', |
||
− | labels={'interp_pop_est': 'Interpolated Population', |
||
− | 'nbhd_pop_est': ' Estimated Population'}, |
||
− | color_continuous_scale='Inferno', |
||
− | range_color=[0,100]) |
||
− | |||
− | # Create the 1:1 line |
||
− | line_max = max([max(df['interp_pop_est']), max(df['nbhd_pop_est'])]) |
||
− | line_min = min([min(df['interp_pop_est']), min(df['nbhd_pop_est'])]) |
||
− | |||
− | fig.update_layout(shapes=[ |
||
− | dict(type='line', xref='x', yref='y', |
||
− | x0=line_min, y0=line_min, x1=line_max, y1=line_max, line_width=1), |
||
− | dict(type='line', xref='x2', yref='y2', |
||
− | x0=line_min, y0=line_min, x1=line_max, y1=line_max, line_width=1), |
||
− | dict(type='line', xref='x3', yref='y3', |
||
− | x0=line_min, y0=line_min, x1=line_max, y1=line_max, line_width=1), |
||
− | dict(type='line', xref='x4', yref='y4', |
||
− | x0=line_min, y0=line_min, x1=line_max, y1=line_max, line_width=1)]) |
||
− | |||
− | fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1])) |
||
− | |||
− | fig.show()</source> |
||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | From these plots and maps we can see some shared characteristics of the results of all four combinations of source data and interpolation methods. |
||
− | |||
− | * They all struggle with the Greenbelt neighborhood. This might be because it's quite large and borders many other neighborhood, or perhaps the census population data and the neighborhood population estimates diverge significantly over this area. |
||
− | * Neighborhoods with small populations, such as Carleton University, Beechwood Cemetary, Notre-Dame Cemetary, Lebreton Development) show high percent errors, possibly due to having low populations while being surrounded by neighborhoods with high populations. |
||
− | * Neighborhoods with high populations, such as Stittsville, Centretown, Old Barrhaven East and "Bridlewood - Emerald Meadows" have interpolated populations that are consistently higher than the neighbourhood population estimates. Again, this might be due to underlying issues with the neighbourhood population estimates. |
||
− | * "Stonebridge - Half Moon Bay - Heart's Desire" has a high percent error in all results - possibly due to the same issue as above. |
||
− | |||
− | From this one analysis, and again bearing in mind the issues with the neighborhood population estimates, it's pretty clear that the dasymetric interpolation methods is superior to the area weighted method but that using finer resolution source data has a more powerful contribution to the accuracy of the results. |
||
− | |||
− | So how can we eliminate the issues with the Ottawa Neighborhood Study population estimates and perform a more objective evaluation of the methods? Well, we can make our own population data! |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | |||
− | ----- |
||
− | |||
− | == Part 2 - Areal Interpolation of Synthetic Population Data == |
||
− | |||
− | In part 2 we will perform the same set of areal interpolations but this time the real population data will be replaced with synthetic population data. By doing this we will know the exact population of the census tracts, dissemination areas, and neighborhoods - allowing us to accurately compare the interpolated populations against the expected populations. A better picture should emerge of the differences between the area weighted and dasymetric interpolation methods, and the effect of the source data's scale. |
||
− | |||
− | Our steps in part 2 will be: |
||
− | |||
− | # Generate synthetic population |
||
− | # Count the synthetic population |
||
− | # Perform interpolation |
||
− | # Assess the results |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | === Generate Synthetic Population Data === |
||
− | |||
− | We first need to generate a dataset containing a synthetic population. Ideally, we would have access to a dataset where every person in Ottawa is identified as a point at their dwelling but, because StatCan doesn't release this fine scale for privacy reasons, we will have to approximate it. |
||
− | |||
− | What we want to do is create a distribution of points that does a pretty good job of approximating the actual distribution of people within a city. It has to have a non-uniform density and be spatially limited to those areas where people actually live. Datasets that could help to approximate this include city land use zones, building footprints, and high resolution land cover maps. To keep things relatively simple here we will just do the following: |
||
− | |||
− | # Clip the dissemination areas to the urban land cover type |
||
− | #* Tobler provides a handy function that extracts specific land cover types from a raster and then outputs a GeoDataFrame containing their extent: [https://pysal.org/tobler/generated/tobler.dasymetric.extract_raster_features.html <code>tobler.dasymetric.extract_raster_features()</code>] |
||
− | # Clip the clipped dissemination areas to Ottawa's land use zones which permit dwellings |
||
− | #* These land use zones come from [https://ottawa.ca/en/living-ottawa/laws-licences-and-permits/laws/law-z/planning-development-and-construction/maps-and-zoning/zoning-law-no-2008-250/zoning-law-2008-250-consolidation/part-1-administration-interpretation-and-definitions-sections-1-54#section-29-46-interpreting-zoning-information Section 35 of the City of Ottawa's zoning definitions] |
||
− | #* Nineteen zones where dwelling units are permitted have been selected (see the code cell below) |
||
− | #* A more rigorous approach would consider all of the exceptions granted in the subzones but those exceptions will be ignored here |
||
− | # Generate a number of randomly distributed points within each dissemination area where the number of those points are based off of the 2016 census population |
||
− | # Count the number of points that fall within each census tract, dissemination area, and neighborhood |
||
− | |||
− | Generating a reasonably large population of synthetic points within these clipped dissemination areas can take quite a long time (about 15 mins for 93433 points on my old MacBook Pro). If you want to avoid the processing time just run the cell immediately below this one ("Skip the point generation and just read them from a file") and then jump ahead to the cell called "Count the points in each census tract". Or, if you want to make your own points then skip the next cell and run all of the others. |
||
− | |||
− | Ok, let's get to it. |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Skip the point generation and just read them from a file |
||
− | # -------------------------------------------------------- |
||
− | |||
− | points_gdf = gpd.read_file('data/points.gpkg') |
||
− | |||
− | # Optional: uncomment the line below to explore the points |
||
− | #points_gdf.explore() |
||
− | </source> |
||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Create clipped dissemination areas |
||
− | # ---------------------------------- |
||
− | |||
− | # Extract the urban landcover (pixel value of 17) from the landcover raster |
||
− | raster_path = 'data/ottawa_landcover.tif' |
||
− | urban_landcover_gdf = tobler.dasymetric.extract_raster_features(gdf=nbhd_gdf, |
||
− | raster_path=raster_path, |
||
− | pixel_values=17) |
||
− | |||
− | urban_landcover_gdf = urban_landcover_gdf.to_crs(epsg=32618) |
||
− | |||
− | # Read the Ottawa zoning |
||
− | zoning_gdf = gpd.read_file('data/ottawa_zoning.gpkg') |
||
− | |||
− | # Only keep the zones that residents can live in |
||
− | people_zones = ['R1', 'R2', 'R3', 'R4', 'R5', 'RM', 'LC', 'GM', 'TM', |
||
− | 'AM','MC','MD', 'AG', 'RR','RU','VM', 'V1','V2','V3'] |
||
− | |||
− | zoning_gdf = zoning_gdf[zoning_gdf['zone_main'].isin(people_zones)] |
||
− | |||
− | # Clip the dissemination areas to the urban landcover and the zoning |
||
− | da_clip_gdf = gpd.clip(gdf=da_gdf, mask=zoning_gdf) |
||
− | da_clip_gdf = gpd.clip(gdf=da_clip_gdf, mask=urban_landcover_gdf) |
||
− | |||
− | # ~2 mins ; ignore warnings</source> |
||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Define function to generate a specific number of random points in a polygon |
||
− | # --------------------------------------------------------------------------- |
||
− | |||
− | def random_points(row, n_column, geom_column, divisor=1, min_n=1): |
||
− | ''' Returns n number of random points within GeoDataFrame polygons. |
||
− | |||
− | Parameters: |
||
− | ---------- |
||
− | row : Pandas Series |
||
− | Must contain: |
||
− | - A column containing the desired number of points in each polygons |
||
− | - A column containing polygon geometry |
||
− | |||
− | n_column : string |
||
− | - The name of the column that contains the desired number of points |
||
− | |||
− | geom_column : string |
||
− | - The name of the column that contains the GeoDataBase geometry |
||
− | |||
− | divisor : int (default = 1) |
||
− | - A value used to reduce the number of points (n2 = n1 / divisor) |
||
− | - Good for shortening computation time |
||
− | |||
− | min_n : int (default = 1) |
||
− | - The minimum number of points in polygon |
||
− | - Good for forcing all polygons to have at least 1 point |
||
− | |||
− | Notes: |
||
− | ------ |
||
− | This function is very slow when trying to generate points in polygons that |
||
− | have coverages that are much smaller than their extent (e.g a C-shaped polygon) |
||
− | ''' |
||
− | |||
− | # Read the polygon and its bounds |
||
− | polygon = row[geom_column] |
||
− | min_x, min_y, max_x, max_y = polygon.bounds |
||
− | |||
− | # Determine the number of points to generate in the polygon |
||
− | n_points = round((row[n_column] / divisor), 0) |
||
− | if n_points < min_n: |
||
− | n_points = min_n |
||
− | |||
− | # Generate the points and put them in a list |
||
− | points_list = [] |
||
− | while len(points_list) < n_points: |
||
− | point = Point(np.random.uniform(min_x, max_x), np.random.uniform(min_y, max_y)) |
||
− | if point.intersects(polygon): |
||
− | points_list.append(point) |
||
− | |||
− | return points_list</source> |
||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Generate points in the clipped dissemination areas |
||
− | # -------------------------------------------------- |
||
− | |||
− | # Generate the random points |
||
− | points_srs = da_clip_gdf.apply(func=random_points, |
||
− | args=('da_pop_2016', 'geometry', 10, 1), axis=1) |
||
− | |||
− | # The output is a pd.Series so convert it to a list |
||
− | points_list = list(points_srs) |
||
− | |||
− | # Flatten the list of lists so each item in the list is a point |
||
− | points_flat_list = [] |
||
− | for i in points_list: |
||
− | for j in i: |
||
− | points_flat_list.append(j) |
||
− | |||
− | # Create a GeoDataFrame of the synthetic points using the flat list as the geometry column |
||
− | points_gdf = gpd.GeoDataFrame(geometry=points_flat_list, crs="EPSG:32618") |
||
− | |||
− | # If you made new points and want to save them, |
||
− | # uncomment the line below to write them to a file |
||
− | #points_gdf.to_file('data/new_points_gdf.gpkg', driver='GPKG') |
||
− | |||
− | # Optional: uncomment the line below to explore the points |
||
− | #points_gdf.explore() |
||
− | |||
− | # ~15-20 mins if divisor=10</source> |
||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | We now have a GeoDataFrame containing the synthetic population (<code>points_gdf</code>). Each point is a synthetic person that lives where dwellings are allowed in the urban land cover class. |
||
− | |||
− | We can count how many fall into each census tract, dissemination area, and neighborhood by using the GeoPandas <code>gpd.sjoin()</code> spatial join method, giving each point a value of 1, grouping the features by their id or name, and them summing them. To see how exactly this works feel free to split the cells below into single expressions and then use <code>gdf.head()</code> on the intermediate GeoDataFrames. |
||
− | |||
− | For more information on these operations check out the documentation: |
||
− | |||
− | * [https://geopandas.org/en/stable/docs/reference/api/geopandas.sjoin.html <code>gpd.sjoin()</code>] |
||
− | * [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html <code>pd.groupby()</code>] |
||
− | * [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.sum.html <code>pd.groupby.sum()</code>] |
||
− | * [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html <code>pd.merge</code>] |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Count the points in each census tract |
||
− | # ------------------------------------- |
||
− | |||
− | # Spatial join of points_gdf to ct_gdf |
||
− | ct_points_gdf = gpd.sjoin(points_gdf, ct_gdf, how='inner', predicate='intersects') |
||
− | |||
− | # Count the points that fall in each census tract |
||
− | ct_points_gdf['ct_synth_pop'] = 1 |
||
− | ct_points_sums_gdf = ct_points_gdf.groupby(['ctuid']).sum() |
||
− | |||
− | # Merge the counts with the census tracts |
||
− | ct_synth_gdf = ct_gdf.merge(ct_points_sums_gdf[['ct_synth_pop']], on='ctuid') |
||
− | |||
− | ct_synth_gdf.head(5)</source> |
||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Count the points in each dissemination area |
||
− | # ------------------------------------------- |
||
− | |||
− | # Spatial join of points_gdf to da_gdf |
||
− | da_points_gdf = gpd.sjoin(points_gdf, da_gdf, how='inner', predicate='intersects') |
||
− | |||
− | # Count the points that fall in each dissemination areas |
||
− | da_points_gdf['da_synth_pop'] = 1 |
||
− | da_points_sums_gdf = da_points_gdf.groupby(['dauid']).sum() |
||
− | |||
− | # Merge the counts with the census tracts |
||
− | da_synth_gdf = da_gdf.merge(da_points_sums_gdf[['da_synth_pop']], on='dauid') |
||
− | |||
− | da_synth_gdf.head(5)</source> |
||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Count the points in each neighborhood |
||
− | # ------------------------------------- |
||
− | |||
− | # Spatial join of synthetic points to neighborhoods |
||
− | nbhd_points_gdf = gpd.sjoin(points_gdf, nbhd_gdf, how='inner', predicate='intersects') |
||
− | |||
− | # Count the points that fall in each neighborhood |
||
− | nbhd_points_gdf['nbhd_synth_pop'] = 1 |
||
− | nbhd_points_sums_gdf = nbhd_points_gdf.groupby(['nbhd_name']).sum() |
||
− | |||
− | # Merge the counts with the neighborhoods |
||
− | nbhd_synth_gdf = nbhd_gdf.merge(nbhd_points_sums_gdf[['nbhd_synth_pop']], on='nbhd_name') |
||
− | |||
− | nbhd_synth_gdf.head(5)</source> |
||
− | |||
− | </div> |
||
− | <div class="cell markdown"> |
||
− | |||
− | The synthetic census tract and dissemination areas population data can now be used as the source for area weighted and dasymetric interpolations. We will repeat all of the steps that we took in Part 1 but with the synthetic data |
||
− | |||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Area weighted interpolation: synthetic census tracts to neighborhoods |
||
− | # --------------------------------------------------------------------- |
||
− | |||
− | ct_area_interp_gdf = tobler.area_weighted.area_interpolate(source_df=ct_synth_gdf, |
||
− | target_df=nbhd_synth_gdf, |
||
− | extensive_variables=['ct_synth_pop']) |
||
− | |||
− | # Round the interpolation results to the nearest integer and change the type to integer |
||
− | # (Population counts must be integers) |
||
− | ct_area_interp_gdf['ct_synth_pop'] = ct_area_interp_gdf['ct_synth_pop'].round(decimals=0).astype(int) |
||
− | |||
− | # Rename the results column for clarity later |
||
− | ct_area_interp_gdf = ct_area_interp_gdf.rename(columns={'ct_synth_pop':'ct_area_interp_est'})</source> |
||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Area weighted interpolation: synthetic dissemination areas to neighborhoods |
||
− | # --------------------------------------------------------------------------- |
||
− | |||
− | da_area_interp_gdf = tobler.area_weighted.area_interpolate(source_df=da_synth_gdf, |
||
− | target_df=nbhd_synth_gdf, |
||
− | extensive_variables=['da_synth_pop']) |
||
− | |||
− | # Round the interpolation results to the nearest integer and change the type to integer |
||
− | # (Population counts must be integers) |
||
− | da_area_interp_gdf['da_synth_pop'] = da_area_interp_gdf['da_synth_pop'].round(decimals=0).astype(int) |
||
− | |||
− | # Rename the results column for clarity later |
||
− | da_area_interp_gdf = da_area_interp_gdf.rename(columns={'da_synth_pop':'da_area_interp_est'})</source> |
||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Dasymetric interpolation: synthetic census tracts + urban landover to neighborhoods |
||
− | #------------------------------------------------------------------------------------ |
||
− | |||
− | # Perform dasymetric interpolation |
||
− | ct_dasy_interp_gdf = tobler.dasymetric.masked_area_interpolate(source_df=ct_synth_gdf, |
||
− | target_df=nbhd_synth_gdf, |
||
− | raster='data/ottawa_landcover.tif', |
||
− | codes=[17], |
||
− | extensive_variables=['ct_synth_pop']) |
||
− | |||
− | # Round the interpolation results to the nearest integer and change the type to integer |
||
− | # (Population counts must be integers) |
||
− | ct_dasy_interp_gdf['ct_synth_pop'] = ct_dasy_interp_gdf['ct_synth_pop'].round(decimals=0).astype(int) |
||
− | |||
− | # Rename the results column for clarity later |
||
− | ct_dasy_interp_gdf = ct_dasy_interp_gdf.rename(columns={'ct_synth_pop':'ct_dasy_interp_est'}) |
||
− | |||
− | # ~30 s ; ignore warnings</source> |
||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Dasymetric interpolation: synthetic dissemination areas + urban landover to neighborhoods |
||
− | #------------------------------------------------------------------------------------------ |
||
− | |||
− | # Perform dasymetric interpolation |
||
− | da_dasy_interp_gdf = tobler.dasymetric.masked_area_interpolate(source_df=da_synth_gdf, |
||
− | target_df=nbhd_synth_gdf, |
||
− | raster='data/ottawa_landcover.tif', |
||
− | codes=[17], |
||
− | extensive_variables=['da_synth_pop']) |
||
− | |||
− | # Round the interpolation results to the nearest integer and change the type to integer |
||
− | # (Population counts must be integers) |
||
− | da_dasy_interp_gdf['da_synth_pop'] = da_dasy_interp_gdf['da_synth_pop'].round(decimals=0).astype(int) |
||
− | |||
− | # Rename the results column for clarity later |
||
− | da_dasy_interp_gdf = da_dasy_interp_gdf.rename(columns={'da_synth_pop':'da_dasy_interp_est'}) |
||
− | |||
− | # ~1.5 mins ; ignore warnings</source> |
||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Merge the results for comparison |
||
− | #--------------------------------- |
||
− | |||
− | # Create a list of the GeoDataFrames (drop the redundant geometry) |
||
− | dfs = [nbhd_synth_gdf, |
||
− | ct_area_interp_gdf.drop(columns='geometry'), |
||
− | ct_dasy_interp_gdf.drop(columns='geometry'), |
||
− | da_area_interp_gdf.drop(columns='geometry'), |
||
− | da_dasy_interp_gdf.drop(columns='geometry'),] |
||
− | |||
− | # Concatenate the GeoDataFrames |
||
− | interp_results_gdf = pd.concat(dfs, axis=1) |
||
− | |||
− | # View a sample of the interpolation results GeoDataFrame (without the geometry column) |
||
− | interp_results_gdf.sample(10).drop(columns='geometry')</source> |
||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Evaluate Results |
||
− | # ---------------- |
||
− | |||
− | # Assign the interpolation results columns to their own variables for clarity |
||
− | ct_area_est = interp_results_gdf['ct_area_interp_est'] |
||
− | ct_dasy_est = interp_results_gdf['ct_dasy_interp_est'] |
||
− | da_area_est = interp_results_gdf['da_area_interp_est'] |
||
− | da_dasy_est = interp_results_gdf['da_dasy_interp_est'] |
||
− | expected = interp_results_gdf['nbhd_synth_pop'] |
||
− | |||
− | # Percent Error - create new columns in interp_results_gdf |
||
− | interp_results_gdf['ct_area_interp_%_error'] = round(percent_error(ct_area_est, expected), 1) |
||
− | interp_results_gdf['ct_dasy_interp_%_error'] = round(percent_error(ct_dasy_est, expected), 1) |
||
− | interp_results_gdf['da_area_interp_%_error'] = round(percent_error(da_area_est, expected), 1) |
||
− | interp_results_gdf['da_dasy_interp_%_error'] = round(percent_error(da_dasy_est, expected), 1) |
||
− | |||
− | # Other statistics - create DataFrame of statistics |
||
− | interp_stats_df = pd.DataFrame(data={'Interpolation Method': ['Area Weighted','Area Weighted', |
||
− | 'Dasymetric', 'Dasymetric'], |
||
− | 'Source Geographies': ['Census Tracts', 'Dissemination Areas', |
||
− | 'Census Tracts', 'Dissemination Areas'], |
||
− | 'RMSE': [rmse(ct_area_est, expected), |
||
− | rmse(da_area_est, expected), |
||
− | rmse(ct_dasy_est, expected), |
||
− | rmse(da_dasy_est, expected)], |
||
− | 'NRMSE': [nrmse(ct_area_est, expected), |
||
− | nrmse(da_area_est, expected), |
||
− | nrmse(ct_dasy_est, expected), |
||
− | nrmse(da_dasy_est, expected)], |
||
− | 'MBE': [mbe(ct_area_est, expected), |
||
− | mbe(da_area_est, expected), |
||
− | mbe(ct_dasy_est, expected), |
||
− | mbe(da_dasy_est, expected)], |
||
− | 'MAE': [mae(ct_area_est, expected), |
||
− | mae(da_area_est, expected), |
||
− | mae(ct_dasy_est, expected), |
||
− | mae(da_dasy_est, expected)], |
||
− | 'r': [r_value(ct_area_est, expected), |
||
− | r_value(da_area_est, expected), |
||
− | r_value(ct_dasy_est, expected), |
||
− | r_value(da_dasy_est, expected)], |
||
− | 'r2': [r_value(ct_area_est, expected)**2, |
||
− | r_value(da_area_est, expected)**2, |
||
− | r_value(ct_dasy_est, expected)**2, |
||
− | r_value(da_dasy_est, expected)**2]}).round(decimals=2) |
||
− | |||
− | interp_stats_df</source> |
||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Plot the Percent Error of Both Methods using a Plotly Facet Map |
||
− | # --------------------------------------------------------------- |
||
− | |||
− | # Convert interp_results_gdf from GeoDataFrame to GeoJson |
||
− | geojson = interp_results_gdf.to_crs(epsg='4326').to_json() |
||
− | geojson = json.loads(geojson) |
||
− | |||
− | # Reformat the interp_results_gdf for the plot |
||
− | df = pd.DataFrame(interp_results_gdf.drop(columns='geometry')) |
||
− | |||
− | df = df.rename(columns={'ct_area_interp_%_error':'Source: Census Tracts <br> Method: Area Weighted', |
||
− | 'ct_dasy_interp_%_error':'Source: Census Tracts <br> Method: Dasymetric', |
||
− | 'da_area_interp_%_error':'Source: Dissemination Areas <br> Method: Area Weighted', |
||
− | 'da_dasy_interp_%_error':'Source: Dissemination Areas <br> Method: Dasymetric'}) |
||
− | |||
− | df = df.melt(id_vars='nbhd_name', |
||
− | value_vars=['Source: Census Tracts <br> Method: Area Weighted', |
||
− | 'Source: Census Tracts <br> Method: Dasymetric', |
||
− | 'Source: Dissemination Areas <br> Method: Area Weighted', |
||
− | 'Source: Dissemination Areas <br> Method: Dasymetric'], |
||
− | var_name='method', |
||
− | value_name='Error (%)') |
||
− | |||
− | # Create the Plotly Express choropleth figure |
||
− | fig = px.choropleth(data_frame=df, |
||
− | title="Areal Interpolation of Synthetic Population Data to Ottawa, ON Neighborhoods", |
||
− | locations='nbhd_name', |
||
− | geojson=geojson, |
||
− | featureidkey='properties.nbhd_name', |
||
− | color='Error (%)', |
||
− | facet_col='method', |
||
− | facet_col_wrap=2, |
||
− | hover_name='nbhd_name', |
||
− | range_color=[0,100], |
||
− | color_continuous_scale='Inferno', |
||
− | projection='mercator', |
||
− | fitbounds="locations", |
||
− | height=800, width=800) |
||
− | |||
− | fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1])) |
||
− | fig.show()</source> |
||
− | |||
− | </div> |
||
− | <div class="cell code"> |
||
− | |||
− | <source lang="python"># Create 1:1 Facet Plots using Plotly |
||
− | # ----------------------------------- |
||
− | |||
− | # Convert the interpolation results to a DataFrame |
||
− | df = pd.DataFrame(data=interp_results_gdf.drop(columns='geometry')) |
||
− | |||
− | # Rename the results columns as they will become the facet plot labels |
||
− | df = df.rename(columns={'ct_area_interp_est':'Source: Census Tracts <br> Method: Area Weighted', |
||
− | 'ct_dasy_interp_est':'Source: Census Tracts <br> Method: Dasymetric', |
||
− | 'da_area_interp_est':'Source: Dissemination Areas <br> Method: Area Weighted', |
||
− | 'da_dasy_interp_est':'Source: Dissemination Areas <br> Method: Dasymetric'}) |
||
− | |||
− | # Combine all the results columns into one column |
||
− | df = df.melt(id_vars=['nbhd_name', 'nbhd_synth_pop'], |
||
− | value_vars=['Source: Census Tracts <br> Method: Area Weighted', |
||
− | 'Source: Census Tracts <br> Method: Dasymetric', |
||
− | 'Source: Dissemination Areas <br> Method: Area Weighted', |
||
− | 'Source: Dissemination Areas <br> Method: Dasymetric'], |
||
− | var_name='method', |
||
− | value_name='interp_pop_est') |
||
− | |||
− | # Add a percent error column |
||
− | estimated = df['interp_pop_est'] |
||
− | expected = df['nbhd_synth_pop'] |
||
− | df['%_error'] = round(percent_error(estimated, expected), 1) |
||
− | |||
− | # Create the Plotly Express Scatter figure |
||
− | fig = px.scatter(data_frame=df, |
||
− | title="Areal Interpolation of Synthetic Population Data to Ottawa, ON Neighborhoods", |
||
− | x='nbhd_synth_pop', |
||
− | y='interp_pop_est', |
||
− | height=800, width=800, |
||
− | color='%_error', |
||
− | facet_col='method', |
||
− | facet_col_wrap=2, |
||
− | hover_name='nbhd_name', |
||
− | labels={'interp_pop_est': 'Interpolated Population', |
||
− | 'nbhd_synth_pop': ' Estimated Population'}, |
||
− | color_continuous_scale='Inferno', |
||
− | range_color=[0,100]) |
||
− | |||
− | # Create the 1:1 line |
||
− | line_max = max([max(df['interp_pop_est']), max(df['nbhd_synth_pop'])]) |
||
− | line_min = min([min(df['interp_pop_est']), min(df['nbhd_synth_pop'])]) |
||
− | |||
− | fig.update_layout(shapes=[ |
||
− | dict(type='line', xref='x', yref='y', |
||
− | x0=line_min, y0=line_min, x1=line_max, y1=line_max, line_width=1), |
||
− | dict(type='line', xref='x2', yref='y2', |
||
− | x0=line_min, y0=line_min, x1=line_max, y1=line_max, line_width=1), |
||
− | dict(type='line', xref='x3', yref='y3', |
||
− | x0=line_min, y0=line_min, x1=line_max, y1=line_max, line_width=1), |
||
− | dict(type='line', xref='x4', yref='y4', |
||
− | x0=line_min, y0=line_min, x1=line_max, y1=line_max, line_width=1)]) |
||
− | |||
− | fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1])) |
||
− | |||
− | fig.show()</source> |
||
− | |||
− | </div> |