Toronto Parking Tickets Analysis

Introduction

Parking infractions continue to be moments of misfortune amongst Toronto law enforcement, residents and visitors. With parking tickets ranging from approx \$30 to \\$450, this notebook aims to provide an in depth analysis of ticket issuance type, frequency, and locations; attempts to uncover the relationship and patterns between these contributing factors.

Dataset Description

Approximately 2.8 million parking tickets are issued annually across the City of Toronto. This dataset contains non-identifiable information relating to each parking ticket issued for each calendar year. The tickets are issued by Toronto Police Services (TPS) personnel as well as persons certified and authorized to issue tickets by TPS.

This data set contains complete records only. Incomplete records in the City database are not included in the data set. Incomplete records may exist due to a variety of reasons e.g. the vehicle registration is out-of-province, tickets paid prior to staff entering the ticket data, etc.The volume of incomplete records relative to the overall volume is low and therefore presents insignificant impact to trend analysis.

Current data time span: 1 year (2018)

Data source:
https://open.toronto.ca/dataset/parking-tickets/
https://open.toronto.ca/dataset/neighbourhoods/

In [415]:
# Import libraries
import pandas as pd
import requests
import numpy as np
import json
from shapely.geometry import Point, Polygon
import shapefile
# from geopy.geocoders import Nominatim

import plotly.express as px
import plotly.io as pio
#pio.renderers.default = 'chrome'
pio.renderers.default = 'notebook_connected'
import plotly.graph_objects as go
import plotly.figure_factory as ff
In [416]:
# Get the dataset metadata by passing package_id to the package_search endpoint
# For example, to retrieve the metadata for this dataset:

url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/package_show"
params = { "id": "8c233bc2-1879-44ff-a0e4-9b69a9032c54"}
package = requests.get(url, params = params).json()
#print(package['result']['resources'])

Data Import and Processing

In [417]:
# Import and prepare data

#data = pd.read_csv('data/parking-tickets-2018/Parking_Tags_Data_2018_1.csv')
data = pd.concat(map(pd.read_csv, ['data/parking-tickets-2018/Parking_Tags_Data_2018_1.csv',
                                   'data/parking-tickets-2018/Parking_Tags_Data_2018_2.csv',
                                   'data/parking-tickets-2018/Parking_Tags_Data_2018_3.csv']))
df = pd.DataFrame(data)
print('Number of rows and cols: ', str(df.shape))
Number of rows and cols:  (2044288, 11)
In [418]:
df.head(2)
Out[418]:
tag_number_masked date_of_infraction infraction_code infraction_description set_fine_amount time_of_infraction location1 location2 location3 location4 province
0 ***92517 20180101 16 PARK-WITHIN 9M INTERSECT ROAD 50 0.0 S/S PRYOR AVE E/O CLOVERDALE RD ON
1 ***71708 20180101 29 PARK PROHIBITED TIME NO PERMIT 30 2.0 NR 266 DOVERCOURT RD NaN NaN ON
In [419]:
# check for NaN values
print('Total number of rows and cols: ', str(df.shape))
print(df.isna().sum(axis = 0))
Total number of rows and cols:  (2044288, 11)
tag_number_masked               0
date_of_infraction              0
infraction_code                 0
infraction_description          0
set_fine_amount                 0
time_of_infraction           1532
location1                  185747
location2                     173
location3                 1903837
location4                 1903301
province                        0
dtype: int64
In [420]:
# check if there are records from other provinces 
print(df['province'].unique())
print(df.groupby(['province']).tag_number_masked.count().nlargest(5))
['ON' 'QC' 'MI' 'AZ' 'VT' 'AB' 'NM' 'NY' 'WV' 'NJ' 'PA' 'BC' 'PQ' 'IL'
 'OH' 'NS' 'SK' 'XX' 'FL' 'MB' 'MN' 'VA' 'TX' 'OR' 'MD' 'NB' 'GA' 'CA'
 'MA' 'NC' 'CO' 'PE' 'IN' 'NF' 'IA' 'CT' 'WI' 'WA' 'NH' 'RI' 'AL' 'MO'
 'KY' 'LA' 'ME' 'OK' 'AR' 'SD' 'NV' 'TN' 'SC' 'NT' 'NE' 'MS' 'DC' 'ID'
 'MT' 'KS' 'NU' 'UT' 'YT' 'DE' 'GO' 'AK' 'ND' 'HI' 'WY']
province
ON    1972818
QC      23213
AB       6689
NY       5114
AZ       4016
Name: tag_number_masked, dtype: int64
  • Looks like location2 has the physical address.
  • When physical address is occationally not available, the combination of location2 and location3 gives the main intersection.
  • There are records occurred outsides of Ontario shown in province.

After checking the addresses shown in province outside of Ontario, these addresses do exist in Toronto. The provinces shown other than Ontario are likely an encoding or system error.

In [421]:
# check for Max time value
print('Max time_of_infraction: ' + str(df['time_of_infraction'].max()))

# filter out the time stamps that are >24h, which is likely due to error
df = df[df['time_of_infraction'] < 2400]
print('Number of rows and cols: ', str(df.shape))
Max time_of_infraction: 3458.0
Number of rows and cols:  (2042739, 11)
In [422]:
df['time_of_infraction'] = df['time_of_infraction'].astype(int).astype(str)
In [423]:
df['len_of_time'] = df['time_of_infraction'].str.len()
In [424]:
# assign hour of the day 
df['hr_of_day'] = (
    np.where(
        df['len_of_time'] == 3, df['time_of_infraction'].str[0],
        np.where(
            df['len_of_time'] == 4, df['time_of_infraction'].str[:2],
            0)))
In [425]:
df = df.drop(columns=['len_of_time'])
df.tail()
Out[425]:
tag_number_masked date_of_infraction infraction_code infraction_description set_fine_amount time_of_infraction location1 location2 location3 location4 province hr_of_day
544283 ***00696 20181231 5 PARK-SIGNED HWY-PROHIBIT DY/TM 50 2357 NR 282 GILLARD AVE NaN NaN ON 23
544284 ***43620 20181231 5 PARK-SIGNED HWY-PROHIBIT DY/TM 50 2357 NR 1404 DANFORTH AVE NaN NaN ON 23
544285 ***29024 20181231 3 PARK ON PRIVATE PROPERTY 30 2358 AT 2110 KEELE ST NaN NaN ON 23
544286 ***28387 20181231 9 STOP-SIGNED HWY-PROHIBIT TM/DY 60 2359 NR 40 SPRING GARDEN AVE NaN NaN ON 23
544287 ***43621 20181231 5 PARK-SIGNED HWY-PROHIBIT DY/TM 50 2359 NR 1316 DANFORTH AVE NaN NaN ON 23

Processing Geodata

Since the neighbourhood name is not given by the original dataset, we can retrieve the neighbourhood using addressing. By converting the addresses into Geocoordinates and using a Geojson file of Toronto, we would be able to find out which area a set of coordinates falls into.

Initially python package Geopy and Google Geocoding API were considered for this processes, but after research and testing, given the large amount of addresses to processes, either method would be feasible due to the API daily limits.

Instead, a local processes was employed here by using Toronto One Address Repository dataset which provides a point representation for over 500,000 addresses within the City of Toronto in conjunction with Toronto Geojson file. The result set shows the Neightbourhood information for all the addresses in Toronto. The result set is then used to retrieve the Neightbourhood for the infraction locations.

The pre-processing of Toronto One Address Repositary is done in this notebook:
toronto_addresses_shapefile_processing.ipynb

In [426]:
# Use pre-processed geo data file based on Toronto One Address
toronto_addresses = pd.read_csv('geo_data/geodata_toronto_addresses_areas.csv')
toronto_addresses_df = pd.DataFrame(toronto_addresses)

# Copy the original infraction df 
ticket_df = df.copy()
/usr/local/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning:

Columns (8) have mixed types.Specify dtype option on import or set low_memory=False.

In [427]:
# Table join to get the long_lat info
ticket_df = pd.merge(ticket_df, 
                   toronto_addresses_df[['full_address', '_id', 'area_name']], 
                   left_on='location2',
                   right_on='full_address',
                   how='left')

print('Before dropping NaN addresses: ' + str(ticket_df.shape))
Before dropping NaN addresses: (2138073, 15)
In [428]:
# Drop the rows without area_name
ticket_df = ticket_df[ticket_df['area_name'].notna()]

# Drop location columns, keep col full_address 
ticket_df = ticket_df.drop(columns=['location1', 'location2', 'location3', 'location4'])

print('After dropping NaN addresses: ' + str(ticket_df.shape))
After dropping NaN addresses: (1862814, 11)

12.89% of the data was dropped in the process, due to the addresses given in Infraction dataset does not exist in the Toronto One Address Repository, eg. if the address was an intersection.

The new ticket_df now has the Neighbourhood id and name, which will be used for data visualization later.

In [429]:
ticket_df.head(2)
Out[429]:
tag_number_masked date_of_infraction infraction_code infraction_description set_fine_amount time_of_infraction province hr_of_day full_address _id area_name
1 ***71708 20180101 29 PARK PROHIBITED TIME NO PERMIT 30 2 ON 0 266 DOVERCOURT RD 8411.0 Little Portugal (84)
2 ***92311 20180101 29 PARK PROHIBITED TIME NO PERMIT 30 2 ON 0 15 FAIRBANK AVE 8490.0 Briar Hill-Belgravia (108)
In [430]:
# Extract year, month, date in their own column
ticket_df['infraction_yr'] = ticket_df.date_of_infraction.astype(str).str[:4]
ticket_df['infraction_mth'] = ticket_df.date_of_infraction.astype(str).str[4:6]
ticket_df['infraction_date'] = ticket_df.date_of_infraction.astype(str).str[6:8]

# Get day of the week
ticket_df['infraction_ymd'] = ticket_df['infraction_yr'].map(str) + '-' + ticket_df['infraction_mth'].map(str) + '-' + ticket_df['infraction_date'].map(str)
ticket_df['infraction_ymd'] = pd.to_datetime(ticket_df['infraction_ymd'])
ticket_df['day_of_week'] = ticket_df['infraction_ymd'].dt.day_name()

ticket_df.tail(2)
Out[430]:
tag_number_masked date_of_infraction infraction_code infraction_description set_fine_amount time_of_infraction province hr_of_day full_address _id area_name infraction_yr infraction_mth infraction_date infraction_ymd day_of_week
2138070 ***29024 20181231 3 PARK ON PRIVATE PROPERTY 30 2358 ON 23 2110 KEELE ST 8493.0 Brookhaven-Amesbury (30) 2018 12 31 2018-12-31 Monday
2138072 ***43621 20181231 5 PARK-SIGNED HWY-PROHIBIT DY/TM 50 2359 ON 23 1316 DANFORTH AVE 8503.0 Danforth (66) 2018 12 31 2018-12-31 Monday

Data Exploration

In [431]:
# Distribution of ticket amount
ticket_amt_df = pd.DataFrame(ticket_df.groupby(['set_fine_amount', 'infraction_description']).tag_number_masked.count())
ticket_amt_df = ticket_amt_df.reset_index()
ticket_amt_df = ticket_amt_df.rename(columns={'tag_number_masked': 'number_of_tickets'})
ticket_amt_df.head(2)
Out[431]:
set_fine_amount infraction_description number_of_tickets
0 0 FAIL TO PARK/STOP PARALLEL TO 1
1 0 LEAVE IN PARK NOT DESIG AREA 1
In [432]:
fig_distribution = px.histogram(ticket_amt_df, 
                                x='set_fine_amount', 
                                y='number_of_tickets',
                                nbins=50,
                                marginal='box')

fig_distribution.update_layout(title_text='Distribution of Ticket Amount')
fig_distribution.show()
In [433]:
# Total Ticket Amount
ticket_df['set_fine_amount'].sum()
Out[433]:
91781490
In [434]:
# Total number of Tickets 
ticket_amt_df['number_of_tickets'].sum()
Out[434]:
1862814
In [435]:
# Daily infraction trend
daily_infraction_df = pd.DataFrame(ticket_df.groupby(['infraction_ymd', 'infraction_yr','infraction_mth', 'infraction_date', 'day_of_week']).tag_number_masked.count())
daily_infraction_df = daily_infraction_df.reset_index()
daily_infraction_df = daily_infraction_df.rename(columns={'tag_number_masked': 'number_of_tickets'})

daily_infraction_df.head(2)
Out[435]:
infraction_ymd infraction_yr infraction_mth infraction_date day_of_week number_of_tickets
0 2018-01-01 2018 01 01 Monday 1112
1 2018-01-02 2018 01 02 Tuesday 5008
In [436]:
# Plot daily infractions trend 
fig_daily_trend = px.line(daily_infraction_df, 
                          x='infraction_ymd', 
                          y='number_of_tickets', 
                          hover_data=['infraction_ymd', 'day_of_week'],
                          title='Daily Infraction Trend')

fig_daily_trend.update_xaxes(rangeslider_visible=True,
                            tickformat="%b\n%Y",
                            dtick="M1")
fig_daily_trend.show()

From the trend line plot above, we can observe that significant lower number of tickets are issued on the weekends, especially on Sundays; And peak ticket days are during the week.

This trend is confirmed by the distribution chart below.

In [437]:
# Plot Distribution of Tickets by Week day
fig_day_dist = px.histogram(daily_infraction_df, 
                            x='day_of_week', 
                            y='number_of_tickets')

fig_day_dist.update_layout(title_text='Distribution of Tickets by Week Day')
fig_day_dist.show()
In [438]:
# plot infraction volume by month
fig2 = px.bar(daily_infraction_df, 
              x='infraction_mth', 
              y='number_of_tickets',
              title='Number of Infractions Per Month') 
fig2.show()
In [439]:
# Display daily trend as heatmap

fig_heatmap = go.Figure(data=go.Heatmap(
                        z=daily_infraction_df['number_of_tickets'],
                        x=daily_infraction_df['infraction_date'], 
                        y=daily_infraction_df['infraction_mth'],
                        colorscale='Mint'
                        ))

fig_heatmap.update_xaxes(title_text = 'Date')
fig_heatmap.update_yaxes(title_text = 'Month')
fig_heatmap.update_layout(title_text='Number of Tickets Issued Per Day')

fig_heatmap.show()

There is no significant trend in number of tickets issued at the start, mid and end of the month.

In [440]:
# Highest to the lowest Total Fine Amount by type

fine_type_df = pd.DataFrame(ticket_df.groupby(['infraction_yr','infraction_description', 'set_fine_amount']).tag_number_masked.count()).reset_index()
fine_type_df['total_fine_amt'] = fine_type_df['set_fine_amount'] * fine_type_df['tag_number_masked']

#print(fine_type_df.sort_values(by=['total_fine_amt', 'tag_number_masked'], ascending=False))
In [441]:
# Plot highest to lowest single ticket amount by type 

fig3 = go.Figure()

fig3.add_trace(go.Bar(x=fine_type_df['infraction_description'], 
                      y=fine_type_df['set_fine_amount'])
                ) 

fig3.update_layout(xaxis=dict({'categoryorder':'total descending'},
                              tickfont=dict(size=8)), 
                   title_text='Highest to Lowest Single Ticket Amount by Type',
                   yaxis_title='Single Ticket Amount',
                  )
                  
fig3.show()
In [442]:
# Highest to lowest total fine amount and frequency

fig4 = go.Figure()

fig4.add_trace(go.Bar(x=fine_type_df['infraction_description'], 
                      y=fine_type_df['tag_number_masked'],
                      name='Number of Tickets Issued'
                     )
                ) 

fig4.add_trace(go.Scatter(x=fine_type_df['infraction_description'], 
                          y=fine_type_df['total_fine_amt'],
                          yaxis='y2',
                          mode="markers",
                          name='Total Fine Amount')
              )


fig4.update_layout(xaxis=dict({'categoryorder':'total descending'},
                              tickfont=dict(size=8)), 
                   title_text='Highest to Lowest Total Fine Amount and Frequency in Log Scale',
                   yaxis_title='Number of Tickets Issued',
                   yaxis=dict(type='log'),
                   yaxis2=dict(anchor='x',
                              overlaying='y',
                              side='right',
                              type='log',
                              title='Total Fine Amount'),
                   legend=dict(orientation='h',
                               y=1.1),
                   height=600
                  )
fig4.show()

Which areas of Toronto do infractions occur?

In [443]:
# Load in the geojson data 
toronto_areas = json.load(open('geo_data/Neighbourhoods_geojson.json'))
In [444]:
# Number of infractions in each area
num_infr_area_df = pd.DataFrame(ticket_df.groupby(['_id', 'area_name']).tag_number_masked.count()).reset_index()
num_infr_area_df['_id'] = num_infr_area_df['_id'].astype('int64')
num_infr_area_df.head(2)
Out[444]:
_id area_name tag_number_masked
0 8401 Wychwood (94) 13619
1 8402 Yonge-Eglinton (100) 22644
In [445]:
# Number of infractions per area per month
num_infr_area_mthly_df = pd.DataFrame(ticket_df.groupby(['_id', 'area_name', 'infraction_yr', 'infraction_mth']).tag_number_masked.count()).reset_index()
num_infr_area_mthly_df['_id'] = num_infr_area_df['_id'].astype('int64')
num_infr_area_mthly_df.head(2)
Out[445]:
_id area_name infraction_yr infraction_mth tag_number_masked
0 8401.0 Wychwood (94) 2018 01 1118
1 8402.0 Wychwood (94) 2018 02 834
In [446]:
# Plot Toronto in Choropleth map
# To solve the empty map issue, need to use 'properties._id', https://community.plotly.com/t/choropleth-map-error-empty-map/39521/6


fig_map = go.Figure(data=go.Choropleth(
    geojson=toronto_areas, 
    locations=num_infr_area_df['_id'], 
    z=num_infr_area_df['tag_number_masked'],
    colorscale='Reds',
    marker_line_color='white',
    colorbar_title='# of Tickets',
    featureidkey= 'properties._id',
    text=num_infr_area_df['area_name'],
    hovertemplate= '<b>%{text}</b><br><br>' + '# of Tickets: %{z}',
    hoverinfo='text'
))

fig_map.update_layout(
    title_text='Toronto Area Parking Infraction Density',
    geo = dict(
        fitbounds='locations',
        visible=False
    )
)

fig_map.show()
In [447]:
# Number of infractions per area per month
num_infr_area_mthly_df = pd.DataFrame(ticket_df.groupby(['_id', 'area_name', 'infraction_yr', 'infraction_mth']).tag_number_masked.count()).reset_index()
num_infr_area_mthly_df['_id'] = num_infr_area_df['_id'].astype('int64')
num_infr_area_mthly_df.head(2)
Out[447]:
_id area_name infraction_yr infraction_mth tag_number_masked
0 8401.0 Wychwood (94) 2018 01 1118
1 8402.0 Wychwood (94) 2018 02 834
In [448]:
fig_mthly = px.area(num_infr_area_mthly_df, 
                    x='infraction_mth', 
                    y='tag_number_masked', 
                    color='area_name',
                    labels={'tag_number_masked': 'Number of tickets issued',
                           'infraction_mth': 'Month'})

fig_mthly.update_layout(title_text='Tickets Issued Per Month')
fig_mthly.show()

Downtown areas such as Waterfront communities, Bay Street Corridor, Church-Yonge Corridor and Kensington-Chinatown consistantly have the most tickets issued on montly basis.

Time of the Day Trend

In [449]:
# Time of day by Area
time_of_day_df = pd.DataFrame(ticket_df.groupby(['area_name', 'hr_of_day']).tag_number_masked.count()).reset_index()
time_of_day_df.tail(2)
Out[449]:
area_name hr_of_day tag_number_masked
3355 Yorkdale-Glen Park (31) 8 315
3356 Yorkdale-Glen Park (31) 9 421
In [450]:
def plot_time_trend(df, color, title):
    fig = px.scatter(df, 
                     x='hr_of_day', 
                     y='tag_number_masked', 
                     color=color,
                     title=title)
    fig.update_layout(xaxis_title='Hour of Day')
    fig.update_layout(yaxis_title='Number of Tickets')
    fig.show()
In [451]:
plot_time_trend(time_of_day_df, 'area_name', 'Time of Day Trend by Area')
In [452]:
# Time of day by Ticket Type
time_of_day_type_df = pd.DataFrame(ticket_df.groupby(['infraction_description', 'hr_of_day']).tag_number_masked.count()).reset_index()
plot_time_trend(time_of_day_type_df, 'infraction_description', 'Time of Day Trend by Ticket Type')
In [455]:
# time of day and day of the week
day_time_type_df = pd.DataFrame(ticket_df.groupby(['infraction_description', 'day_of_week', 'hr_of_day']).tag_number_masked.count()).reset_index()
day_time_type_df.head(2)
Out[455]:
infraction_description day_of_week hr_of_day tag_number_masked
0 FAIL ANGLE PARK/STOP AT 45 DEG Thursday 16 1
1 FAIL PARK/STP LEFT WHLS PARALL Friday 14 1
In [454]:
def plot_day_time():
    day_of_wk = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    fig = px.scatter_3d(day_time_type_df, 
                        x='day_of_week', 
                        y='hr_of_day', 
                        z='tag_number_masked', 
                        color='infraction_description',
                        category_orders={'day_of_week': day_of_wk[::-1]} 
                       )
    
    fig.update_layout(width=1000, 
                      height=800, 
                      title='Weekday Trend by Hour and Ticket Type')

    fig.show()
    
plot_day_time()
In [ ]: