Parking infractions continue to be moments of misfortune amongst Toronto law enforcement, residents and visitors. With parking tickets ranging from approx \$30 to \\$450, this notebook aims to provide an in depth analysis of ticket issuance type, frequency, and locations; attempts to uncover the relationship and patterns between these contributing factors.
Approximately 2.8 million parking tickets are issued annually across the City of Toronto. This dataset contains non-identifiable information relating to each parking ticket issued for each calendar year. The tickets are issued by Toronto Police Services (TPS) personnel as well as persons certified and authorized to issue tickets by TPS.
This data set contains complete records only. Incomplete records in the City database are not included in the data set. Incomplete records may exist due to a variety of reasons e.g. the vehicle registration is out-of-province, tickets paid prior to staff entering the ticket data, etc.The volume of incomplete records relative to the overall volume is low and therefore presents insignificant impact to trend analysis.
Current data time span: 1 year (2018)
Data source:
https://open.toronto.ca/dataset/parking-tickets/
https://open.toronto.ca/dataset/neighbourhoods/
# Import libraries
import pandas as pd
import requests
import numpy as np
import json
from shapely.geometry import Point, Polygon
import shapefile
# from geopy.geocoders import Nominatim
import plotly.express as px
import plotly.io as pio
#pio.renderers.default = 'chrome'
pio.renderers.default = 'notebook_connected'
import plotly.graph_objects as go
import plotly.figure_factory as ff
# Get the dataset metadata by passing package_id to the package_search endpoint
# For example, to retrieve the metadata for this dataset:
url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/package_show"
params = { "id": "8c233bc2-1879-44ff-a0e4-9b69a9032c54"}
package = requests.get(url, params = params).json()
#print(package['result']['resources'])
# Import and prepare data
#data = pd.read_csv('data/parking-tickets-2018/Parking_Tags_Data_2018_1.csv')
data = pd.concat(map(pd.read_csv, ['data/parking-tickets-2018/Parking_Tags_Data_2018_1.csv',
'data/parking-tickets-2018/Parking_Tags_Data_2018_2.csv',
'data/parking-tickets-2018/Parking_Tags_Data_2018_3.csv']))
df = pd.DataFrame(data)
print('Number of rows and cols: ', str(df.shape))
df.head(2)
# check for NaN values
print('Total number of rows and cols: ', str(df.shape))
print(df.isna().sum(axis = 0))
# check if there are records from other provinces
print(df['province'].unique())
print(df.groupby(['province']).tag_number_masked.count().nlargest(5))
After checking the addresses shown in province outside of Ontario, these addresses do exist in Toronto. The provinces shown other than Ontario are likely an encoding or system error.
# check for Max time value
print('Max time_of_infraction: ' + str(df['time_of_infraction'].max()))
# filter out the time stamps that are >24h, which is likely due to error
df = df[df['time_of_infraction'] < 2400]
print('Number of rows and cols: ', str(df.shape))
df['time_of_infraction'] = df['time_of_infraction'].astype(int).astype(str)
df['len_of_time'] = df['time_of_infraction'].str.len()
# assign hour of the day
df['hr_of_day'] = (
np.where(
df['len_of_time'] == 3, df['time_of_infraction'].str[0],
np.where(
df['len_of_time'] == 4, df['time_of_infraction'].str[:2],
0)))
df = df.drop(columns=['len_of_time'])
df.tail()
Since the neighbourhood name is not given by the original dataset, we can retrieve the neighbourhood using addressing. By converting the addresses into Geocoordinates and using a Geojson file of Toronto, we would be able to find out which area a set of coordinates falls into.
Initially python package Geopy and Google Geocoding API were considered for this processes, but after research and testing, given the large amount of addresses to processes, either method would be feasible due to the API daily limits.
Instead, a local processes was employed here by using Toronto One Address Repository dataset which provides a point representation for over 500,000 addresses within the City of Toronto in conjunction with Toronto Geojson file. The result set shows the Neightbourhood information for all the addresses in Toronto. The result set is then used to retrieve the Neightbourhood for the infraction locations.
The pre-processing of Toronto One Address Repositary is done in this notebook:
toronto_addresses_shapefile_processing.ipynb
# Use pre-processed geo data file based on Toronto One Address
toronto_addresses = pd.read_csv('geo_data/geodata_toronto_addresses_areas.csv')
toronto_addresses_df = pd.DataFrame(toronto_addresses)
# Copy the original infraction df
ticket_df = df.copy()
# Table join to get the long_lat info
ticket_df = pd.merge(ticket_df,
toronto_addresses_df[['full_address', '_id', 'area_name']],
left_on='location2',
right_on='full_address',
how='left')
print('Before dropping NaN addresses: ' + str(ticket_df.shape))
# Drop the rows without area_name
ticket_df = ticket_df[ticket_df['area_name'].notna()]
# Drop location columns, keep col full_address
ticket_df = ticket_df.drop(columns=['location1', 'location2', 'location3', 'location4'])
print('After dropping NaN addresses: ' + str(ticket_df.shape))
12.89% of the data was dropped in the process, due to the addresses given in Infraction dataset does not exist in the Toronto One Address Repository, eg. if the address was an intersection.
The new ticket_df now has the Neighbourhood id and name, which will be used for data visualization later.
ticket_df.head(2)
# Extract year, month, date in their own column
ticket_df['infraction_yr'] = ticket_df.date_of_infraction.astype(str).str[:4]
ticket_df['infraction_mth'] = ticket_df.date_of_infraction.astype(str).str[4:6]
ticket_df['infraction_date'] = ticket_df.date_of_infraction.astype(str).str[6:8]
# Get day of the week
ticket_df['infraction_ymd'] = ticket_df['infraction_yr'].map(str) + '-' + ticket_df['infraction_mth'].map(str) + '-' + ticket_df['infraction_date'].map(str)
ticket_df['infraction_ymd'] = pd.to_datetime(ticket_df['infraction_ymd'])
ticket_df['day_of_week'] = ticket_df['infraction_ymd'].dt.day_name()
ticket_df.tail(2)
# Distribution of ticket amount
ticket_amt_df = pd.DataFrame(ticket_df.groupby(['set_fine_amount', 'infraction_description']).tag_number_masked.count())
ticket_amt_df = ticket_amt_df.reset_index()
ticket_amt_df = ticket_amt_df.rename(columns={'tag_number_masked': 'number_of_tickets'})
ticket_amt_df.head(2)
fig_distribution = px.histogram(ticket_amt_df,
x='set_fine_amount',
y='number_of_tickets',
nbins=50,
marginal='box')
fig_distribution.update_layout(title_text='Distribution of Ticket Amount')
fig_distribution.show()
# Total Ticket Amount
ticket_df['set_fine_amount'].sum()
# Total number of Tickets
ticket_amt_df['number_of_tickets'].sum()
# Daily infraction trend
daily_infraction_df = pd.DataFrame(ticket_df.groupby(['infraction_ymd', 'infraction_yr','infraction_mth', 'infraction_date', 'day_of_week']).tag_number_masked.count())
daily_infraction_df = daily_infraction_df.reset_index()
daily_infraction_df = daily_infraction_df.rename(columns={'tag_number_masked': 'number_of_tickets'})
daily_infraction_df.head(2)
# Plot daily infractions trend
fig_daily_trend = px.line(daily_infraction_df,
x='infraction_ymd',
y='number_of_tickets',
hover_data=['infraction_ymd', 'day_of_week'],
title='Daily Infraction Trend')
fig_daily_trend.update_xaxes(rangeslider_visible=True,
tickformat="%b\n%Y",
dtick="M1")
fig_daily_trend.show()
From the trend line plot above, we can observe that significant lower number of tickets are issued on the weekends, especially on Sundays; And peak ticket days are during the week.
This trend is confirmed by the distribution chart below.
# Plot Distribution of Tickets by Week day
fig_day_dist = px.histogram(daily_infraction_df,
x='day_of_week',
y='number_of_tickets')
fig_day_dist.update_layout(title_text='Distribution of Tickets by Week Day')
fig_day_dist.show()
# plot infraction volume by month
fig2 = px.bar(daily_infraction_df,
x='infraction_mth',
y='number_of_tickets',
title='Number of Infractions Per Month')
fig2.show()
# Display daily trend as heatmap
fig_heatmap = go.Figure(data=go.Heatmap(
z=daily_infraction_df['number_of_tickets'],
x=daily_infraction_df['infraction_date'],
y=daily_infraction_df['infraction_mth'],
colorscale='Mint'
))
fig_heatmap.update_xaxes(title_text = 'Date')
fig_heatmap.update_yaxes(title_text = 'Month')
fig_heatmap.update_layout(title_text='Number of Tickets Issued Per Day')
fig_heatmap.show()
There is no significant trend in number of tickets issued at the start, mid and end of the month.
# Highest to the lowest Total Fine Amount by type
fine_type_df = pd.DataFrame(ticket_df.groupby(['infraction_yr','infraction_description', 'set_fine_amount']).tag_number_masked.count()).reset_index()
fine_type_df['total_fine_amt'] = fine_type_df['set_fine_amount'] * fine_type_df['tag_number_masked']
#print(fine_type_df.sort_values(by=['total_fine_amt', 'tag_number_masked'], ascending=False))
# Plot highest to lowest single ticket amount by type
fig3 = go.Figure()
fig3.add_trace(go.Bar(x=fine_type_df['infraction_description'],
y=fine_type_df['set_fine_amount'])
)
fig3.update_layout(xaxis=dict({'categoryorder':'total descending'},
tickfont=dict(size=8)),
title_text='Highest to Lowest Single Ticket Amount by Type',
yaxis_title='Single Ticket Amount',
)
fig3.show()
# Highest to lowest total fine amount and frequency
fig4 = go.Figure()
fig4.add_trace(go.Bar(x=fine_type_df['infraction_description'],
y=fine_type_df['tag_number_masked'],
name='Number of Tickets Issued'
)
)
fig4.add_trace(go.Scatter(x=fine_type_df['infraction_description'],
y=fine_type_df['total_fine_amt'],
yaxis='y2',
mode="markers",
name='Total Fine Amount')
)
fig4.update_layout(xaxis=dict({'categoryorder':'total descending'},
tickfont=dict(size=8)),
title_text='Highest to Lowest Total Fine Amount and Frequency in Log Scale',
yaxis_title='Number of Tickets Issued',
yaxis=dict(type='log'),
yaxis2=dict(anchor='x',
overlaying='y',
side='right',
type='log',
title='Total Fine Amount'),
legend=dict(orientation='h',
y=1.1),
height=600
)
fig4.show()
# Load in the geojson data
toronto_areas = json.load(open('geo_data/Neighbourhoods_geojson.json'))
# Number of infractions in each area
num_infr_area_df = pd.DataFrame(ticket_df.groupby(['_id', 'area_name']).tag_number_masked.count()).reset_index()
num_infr_area_df['_id'] = num_infr_area_df['_id'].astype('int64')
num_infr_area_df.head(2)
# Number of infractions per area per month
num_infr_area_mthly_df = pd.DataFrame(ticket_df.groupby(['_id', 'area_name', 'infraction_yr', 'infraction_mth']).tag_number_masked.count()).reset_index()
num_infr_area_mthly_df['_id'] = num_infr_area_df['_id'].astype('int64')
num_infr_area_mthly_df.head(2)
# Plot Toronto in Choropleth map
# To solve the empty map issue, need to use 'properties._id', https://community.plotly.com/t/choropleth-map-error-empty-map/39521/6
fig_map = go.Figure(data=go.Choropleth(
geojson=toronto_areas,
locations=num_infr_area_df['_id'],
z=num_infr_area_df['tag_number_masked'],
colorscale='Reds',
marker_line_color='white',
colorbar_title='# of Tickets',
featureidkey= 'properties._id',
text=num_infr_area_df['area_name'],
hovertemplate= '<b>%{text}</b><br><br>' + '# of Tickets: %{z}',
hoverinfo='text'
))
fig_map.update_layout(
title_text='Toronto Area Parking Infraction Density',
geo = dict(
fitbounds='locations',
visible=False
)
)
fig_map.show()
# Number of infractions per area per month
num_infr_area_mthly_df = pd.DataFrame(ticket_df.groupby(['_id', 'area_name', 'infraction_yr', 'infraction_mth']).tag_number_masked.count()).reset_index()
num_infr_area_mthly_df['_id'] = num_infr_area_df['_id'].astype('int64')
num_infr_area_mthly_df.head(2)
fig_mthly = px.area(num_infr_area_mthly_df,
x='infraction_mth',
y='tag_number_masked',
color='area_name',
labels={'tag_number_masked': 'Number of tickets issued',
'infraction_mth': 'Month'})
fig_mthly.update_layout(title_text='Tickets Issued Per Month')
fig_mthly.show()
Downtown areas such as Waterfront communities, Bay Street Corridor, Church-Yonge Corridor and Kensington-Chinatown consistantly have the most tickets issued on montly basis.
# Time of day by Area
time_of_day_df = pd.DataFrame(ticket_df.groupby(['area_name', 'hr_of_day']).tag_number_masked.count()).reset_index()
time_of_day_df.tail(2)
def plot_time_trend(df, color, title):
fig = px.scatter(df,
x='hr_of_day',
y='tag_number_masked',
color=color,
title=title)
fig.update_layout(xaxis_title='Hour of Day')
fig.update_layout(yaxis_title='Number of Tickets')
fig.show()
plot_time_trend(time_of_day_df, 'area_name', 'Time of Day Trend by Area')
# Time of day by Ticket Type
time_of_day_type_df = pd.DataFrame(ticket_df.groupby(['infraction_description', 'hr_of_day']).tag_number_masked.count()).reset_index()
plot_time_trend(time_of_day_type_df, 'infraction_description', 'Time of Day Trend by Ticket Type')
# time of day and day of the week
day_time_type_df = pd.DataFrame(ticket_df.groupby(['infraction_description', 'day_of_week', 'hr_of_day']).tag_number_masked.count()).reset_index()
day_time_type_df.head(2)
def plot_day_time():
day_of_wk = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
fig = px.scatter_3d(day_time_type_df,
x='day_of_week',
y='hr_of_day',
z='tag_number_masked',
color='infraction_description',
category_orders={'day_of_week': day_of_wk[::-1]}
)
fig.update_layout(width=1000,
height=800,
title='Weekday Trend by Hour and Ticket Type')
fig.show()
plot_day_time()