Toronto Parking Tickets Analysis¶

Introduction¶

Parking infractions continue to be moments of misfortune amongst Toronto law enforcement, residents and visitors. With parking tickets ranging from approx \$30 to \\$450, this notebook aims to provide an in depth analysis of ticket issuance type, frequency, and locations; attempts to uncover the relationship and patterns between these contributing factors.

Dataset Description¶

Approximately 2.8 million parking tickets are issued annually across the City of Toronto. This dataset contains non-identifiable information relating to each parking ticket issued for each calendar year. The tickets are issued by Toronto Police Services (TPS) personnel as well as persons certified and authorized to issue tickets by TPS.

This data set contains complete records only. Incomplete records in the City database are not included in the data set. Incomplete records may exist due to a variety of reasons e.g. the vehicle registration is out-of-province, tickets paid prior to staff entering the ticket data, etc.The volume of incomplete records relative to the overall volume is low and therefore presents insignificant impact to trend analysis.

Current data time span: 1 year (2018)

Data source:
https://open.toronto.ca/dataset/parking-tickets/
https://open.toronto.ca/dataset/neighbourhoods/

# Import libraries
import pandas as pd
import requests
import numpy as np
import json
from shapely.geometry import Point, Polygon
import shapefile
# from geopy.geocoders import Nominatim

import plotly.express as px
import plotly.io as pio
#pio.renderers.default = 'chrome'
pio.renderers.default = 'notebook_connected'
import plotly.graph_objects as go
import plotly.figure_factory as ff

# Get the dataset metadata by passing package_id to the package_search endpoint
# For example, to retrieve the metadata for this dataset:

url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/package_show"
params = { "id": "8c233bc2-1879-44ff-a0e4-9b69a9032c54"}
package = requests.get(url, params = params).json()
#print(package['result']['resources'])

Data Import and Processing¶

# Import and prepare data

#data = pd.read_csv('data/parking-tickets-2018/Parking_Tags_Data_2018_1.csv')
data = pd.concat(map(pd.read_csv, ['data/parking-tickets-2018/Parking_Tags_Data_2018_1.csv',
                                   'data/parking-tickets-2018/Parking_Tags_Data_2018_2.csv',
                                   'data/parking-tickets-2018/Parking_Tags_Data_2018_3.csv']))
df = pd.DataFrame(data)
print('Number of rows and cols: ', str(df.shape))

Number of rows and cols:  (2044288, 11)

df.head(2)

# check for NaN values
print('Total number of rows and cols: ', str(df.shape))
print(df.isna().sum(axis = 0))

Total number of rows and cols:  (2044288, 11)
tag_number_masked               0
date_of_infraction              0
infraction_code                 0
infraction_description          0
set_fine_amount                 0
time_of_infraction           1532
location1                  185747
location2                     173
location3                 1903837
location4                 1903301
province                        0
dtype: int64

# check if there are records from other provinces 
print(df['province'].unique())
print(df.groupby(['province']).tag_number_masked.count().nlargest(5))

['ON' 'QC' 'MI' 'AZ' 'VT' 'AB' 'NM' 'NY' 'WV' 'NJ' 'PA' 'BC' 'PQ' 'IL'
 'OH' 'NS' 'SK' 'XX' 'FL' 'MB' 'MN' 'VA' 'TX' 'OR' 'MD' 'NB' 'GA' 'CA'
 'MA' 'NC' 'CO' 'PE' 'IN' 'NF' 'IA' 'CT' 'WI' 'WA' 'NH' 'RI' 'AL' 'MO'
 'KY' 'LA' 'ME' 'OK' 'AR' 'SD' 'NV' 'TN' 'SC' 'NT' 'NE' 'MS' 'DC' 'ID'
 'MT' 'KS' 'NU' 'UT' 'YT' 'DE' 'GO' 'AK' 'ND' 'HI' 'WY']
province
ON    1972818
QC      23213
AB       6689
NY       5114
AZ       4016
Name: tag_number_masked, dtype: int64

Looks like location2 has the physical address.
When physical address is occationally not available, the combination of location2 and location3 gives the main intersection.
There are records occurred outsides of Ontario shown in province.

After checking the addresses shown in province outside of Ontario, these addresses do exist in Toronto. The provinces shown other than Ontario are likely an encoding or system error.

# check for Max time value
print('Max time_of_infraction: ' + str(df['time_of_infraction'].max()))

# filter out the time stamps that are >24h, which is likely due to error
df = df[df['time_of_infraction'] < 2400]
print('Number of rows and cols: ', str(df.shape))

Max time_of_infraction: 3458.0
Number of rows and cols:  (2042739, 11)

df['time_of_infraction'] = df['time_of_infraction'].astype(int).astype(str)

df['len_of_time'] = df['time_of_infraction'].str.len()

# assign hour of the day 
df['hr_of_day'] = (
    np.where(
        df['len_of_time'] == 3, df['time_of_infraction'].str[0],
        np.where(
            df['len_of_time'] == 4, df['time_of_infraction'].str[:2],
            0)))

df = df.drop(columns=['len_of_time'])
df.tail()

Processing Geodata¶

Since the neighbourhood name is not given by the original dataset, we can retrieve the neighbourhood using addressing. By converting the addresses into Geocoordinates and using a Geojson file of Toronto, we would be able to find out which area a set of coordinates falls into.

Initially python package Geopy and Google Geocoding API were considered for this processes, but after research and testing, given the large amount of addresses to processes, either method would be feasible due to the API daily limits.

Instead, a local processes was employed here by using Toronto One Address Repository dataset which provides a point representation for over 500,000 addresses within the City of Toronto in conjunction with Toronto Geojson file. The result set shows the Neightbourhood information for all the addresses in Toronto. The result set is then used to retrieve the Neightbourhood for the infraction locations.

The pre-processing of Toronto One Address Repositary is done in this notebook:
toronto_addresses_shapefile_processing.ipynb

# Use pre-processed geo data file based on Toronto One Address
toronto_addresses = pd.read_csv('geo_data/geodata_toronto_addresses_areas.csv')
toronto_addresses_df = pd.DataFrame(toronto_addresses)

# Copy the original infraction df 
ticket_df = df.copy()

/usr/local/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning:

Columns (8) have mixed types.Specify dtype option on import or set low_memory=False.

# Table join to get the long_lat info
ticket_df = pd.merge(ticket_df, 
                   toronto_addresses_df[['full_address', '_id', 'area_name']], 
                   left_on='location2',
                   right_on='full_address',
                   how='left')

print('Before dropping NaN addresses: ' + str(ticket_df.shape))

Before dropping NaN addresses: (2138073, 15)

# Drop the rows without area_name
ticket_df = ticket_df[ticket_df['area_name'].notna()]

# Drop location columns, keep col full_address 
ticket_df = ticket_df.drop(columns=['location1', 'location2', 'location3', 'location4'])

print('After dropping NaN addresses: ' + str(ticket_df.shape))

After dropping NaN addresses: (1862814, 11)

12.89% of the data was dropped in the process, due to the addresses given in Infraction dataset does not exist in the Toronto One Address Repository, eg. if the address was an intersection.

The new ticket_df now has the Neighbourhood id and name, which will be used for data visualization later.

ticket_df.head(2)

# Extract year, month, date in their own column
ticket_df['infraction_yr'] = ticket_df.date_of_infraction.astype(str).str[:4]
ticket_df['infraction_mth'] = ticket_df.date_of_infraction.astype(str).str[4:6]
ticket_df['infraction_date'] = ticket_df.date_of_infraction.astype(str).str[6:8]

# Get day of the week
ticket_df['infraction_ymd'] = ticket_df['infraction_yr'].map(str) + '-' + ticket_df['infraction_mth'].map(str) + '-' + ticket_df['infraction_date'].map(str)
ticket_df['infraction_ymd'] = pd.to_datetime(ticket_df['infraction_ymd'])
ticket_df['day_of_week'] = ticket_df['infraction_ymd'].dt.day_name()

ticket_df.tail(2)

Data Exploration¶

# Distribution of ticket amount
ticket_amt_df = pd.DataFrame(ticket_df.groupby(['set_fine_amount', 'infraction_description']).tag_number_masked.count())
ticket_amt_df = ticket_amt_df.reset_index()
ticket_amt_df = ticket_amt_df.rename(columns={'tag_number_masked': 'number_of_tickets'})
ticket_amt_df.head(2)

fig_distribution = px.histogram(ticket_amt_df, 
                                x='set_fine_amount', 
                                y='number_of_tickets',
                                nbins=50,
                                marginal='box')

fig_distribution.update_layout(title_text='Distribution of Ticket Amount')
fig_distribution.show()

# Total Ticket Amount
ticket_df['set_fine_amount'].sum()

91781490

# Total number of Tickets 
ticket_amt_df['number_of_tickets'].sum()

1862814

# Daily infraction trend
daily_infraction_df = pd.DataFrame(ticket_df.groupby(['infraction_ymd', 'infraction_yr','infraction_mth', 'infraction_date', 'day_of_week']).tag_number_masked.count())
daily_infraction_df = daily_infraction_df.reset_index()
daily_infraction_df = daily_infraction_df.rename(columns={'tag_number_masked': 'number_of_tickets'})

daily_infraction_df.head(2)

# Plot daily infractions trend 
fig_daily_trend = px.line(daily_infraction_df, 
                          x='infraction_ymd', 
                          y='number_of_tickets', 
                          hover_data=['infraction_ymd', 'day_of_week'],
                          title='Daily Infraction Trend')

fig_daily_trend.update_xaxes(rangeslider_visible=True,
                            tickformat="%b\n%Y",
                            dtick="M1")
fig_daily_trend.show()

From the trend line plot above, we can observe that significant lower number of tickets are issued on the weekends, especially on Sundays; And peak ticket days are during the week.

This trend is confirmed by the distribution chart below.

# Plot Distribution of Tickets by Week day
fig_day_dist = px.histogram(daily_infraction_df, 
                            x='day_of_week', 
                            y='number_of_tickets')

fig_day_dist.update_layout(title_text='Distribution of Tickets by Week Day')
fig_day_dist.show()

# plot infraction volume by month
fig2 = px.bar(daily_infraction_df, 
              x='infraction_mth', 
              y='number_of_tickets',
              title='Number of Infractions Per Month') 
fig2.show()

# Display daily trend as heatmap

fig_heatmap = go.Figure(data=go.Heatmap(
                        z=daily_infraction_df['number_of_tickets'],
                        x=daily_infraction_df['infraction_date'], 
                        y=daily_infraction_df['infraction_mth'],
                        colorscale='Mint'
                        ))

fig_heatmap.update_xaxes(title_text = 'Date')
fig_heatmap.update_yaxes(title_text = 'Month')
fig_heatmap.update_layout(title_text='Number of Tickets Issued Per Day')

fig_heatmap.show()

There is no significant trend in number of tickets issued at the start, mid and end of the month.

# Highest to the lowest Total Fine Amount by type

fine_type_df = pd.DataFrame(ticket_df.groupby(['infraction_yr','infraction_description', 'set_fine_amount']).tag_number_masked.count()).reset_index()
fine_type_df['total_fine_amt'] = fine_type_df['set_fine_amount'] * fine_type_df['tag_number_masked']

#print(fine_type_df.sort_values(by=['total_fine_amt', 'tag_number_masked'], ascending=False))

# Plot highest to lowest single ticket amount by type 

fig3 = go.Figure()

fig3.add_trace(go.Bar(x=fine_type_df['infraction_description'], 
                      y=fine_type_df['set_fine_amount'])
                ) 

fig3.update_layout(xaxis=dict({'categoryorder':'total descending'},
                              tickfont=dict(size=8)), 
                   title_text='Highest to Lowest Single Ticket Amount by Type',
                   yaxis_title='Single Ticket Amount',
                  )
                  
fig3.show()

# Highest to lowest total fine amount and frequency

fig4 = go.Figure()

fig4.add_trace(go.Bar(x=fine_type_df['infraction_description'], 
                      y=fine_type_df['tag_number_masked'],
                      name='Number of Tickets Issued'
                     )
                ) 

fig4.add_trace(go.Scatter(x=fine_type_df['infraction_description'], 
                          y=fine_type_df['total_fine_amt'],
                          yaxis='y2',
                          mode="markers",
                          name='Total Fine Amount')
              )


fig4.update_layout(xaxis=dict({'categoryorder':'total descending'},
                              tickfont=dict(size=8)), 
                   title_text='Highest to Lowest Total Fine Amount and Frequency in Log Scale',
                   yaxis_title='Number of Tickets Issued',
                   yaxis=dict(type='log'),
                   yaxis2=dict(anchor='x',
                              overlaying='y',
                              side='right',
                              type='log',
                              title='Total Fine Amount'),
                   legend=dict(orientation='h',
                               y=1.1),
                   height=600
                  )
fig4.show()

Which areas of Toronto do infractions occur?¶

# Load in the geojson data 
toronto_areas = json.load(open('geo_data/Neighbourhoods_geojson.json'))

# Number of infractions in each area
num_infr_area_df = pd.DataFrame(ticket_df.groupby(['_id', 'area_name']).tag_number_masked.count()).reset_index()
num_infr_area_df['_id'] = num_infr_area_df['_id'].astype('int64')
num_infr_area_df.head(2)

# Number of infractions per area per month
num_infr_area_mthly_df = pd.DataFrame(ticket_df.groupby(['_id', 'area_name', 'infraction_yr', 'infraction_mth']).tag_number_masked.count()).reset_index()
num_infr_area_mthly_df['_id'] = num_infr_area_df['_id'].astype('int64')
num_infr_area_mthly_df.head(2)

# Plot Toronto in Choropleth map
# To solve the empty map issue, need to use 'properties._id', https://community.plotly.com/t/choropleth-map-error-empty-map/39521/6


fig_map = go.Figure(data=go.Choropleth(
    geojson=toronto_areas, 
    locations=num_infr_area_df['_id'], 
    z=num_infr_area_df['tag_number_masked'],
    colorscale='Reds',
    marker_line_color='white',
    colorbar_title='# of Tickets',
    featureidkey= 'properties._id',
    text=num_infr_area_df['area_name'],
    hovertemplate= '<b>%{text}</b><br><br>' + '# of Tickets: %{z}',
    hoverinfo='text'
))

fig_map.update_layout(
    title_text='Toronto Area Parking Infraction Density',
    geo = dict(
        fitbounds='locations',
        visible=False
    )
)

fig_map.show()

# Number of infractions per area per month
num_infr_area_mthly_df = pd.DataFrame(ticket_df.groupby(['_id', 'area_name', 'infraction_yr', 'infraction_mth']).tag_number_masked.count()).reset_index()
num_infr_area_mthly_df['_id'] = num_infr_area_df['_id'].astype('int64')
num_infr_area_mthly_df.head(2)

fig_mthly = px.area(num_infr_area_mthly_df, 
                    x='infraction_mth', 
                    y='tag_number_masked', 
                    color='area_name',
                    labels={'tag_number_masked': 'Number of tickets issued',
                           'infraction_mth': 'Month'})

fig_mthly.update_layout(title_text='Tickets Issued Per Month')
fig_mthly.show()

Downtown areas such as Waterfront communities, Bay Street Corridor, Church-Yonge Corridor and Kensington-Chinatown consistantly have the most tickets issued on montly basis.

Time of the Day Trend¶

# Time of day by Area
time_of_day_df = pd.DataFrame(ticket_df.groupby(['area_name', 'hr_of_day']).tag_number_masked.count()).reset_index()
time_of_day_df.tail(2)

def plot_time_trend(df, color, title):
    fig = px.scatter(df, 
                     x='hr_of_day', 
                     y='tag_number_masked', 
                     color=color,
                     title=title)
    fig.update_layout(xaxis_title='Hour of Day')
    fig.update_layout(yaxis_title='Number of Tickets')
    fig.show()

plot_time_trend(time_of_day_df, 'area_name', 'Time of Day Trend by Area')

# Time of day by Ticket Type
time_of_day_type_df = pd.DataFrame(ticket_df.groupby(['infraction_description', 'hr_of_day']).tag_number_masked.count()).reset_index()
plot_time_trend(time_of_day_type_df, 'infraction_description', 'Time of Day Trend by Ticket Type')

# time of day and day of the week
day_time_type_df = pd.DataFrame(ticket_df.groupby(['infraction_description', 'day_of_week', 'hr_of_day']).tag_number_masked.count()).reset_index()
day_time_type_df.head(2)

def plot_day_time():
    day_of_wk = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    fig = px.scatter_3d(day_time_type_df, 
                        x='day_of_week', 
                        y='hr_of_day', 
                        z='tag_number_masked', 
                        color='infraction_description',
                        category_orders={'day_of_week': day_of_wk[::-1]} 
                       )
    
    fig.update_layout(width=1000, 
                      height=800, 
                      title='Weekday Trend by Hour and Ticket Type')

    fig.show()
    
plot_day_time()

	_id	area_name	tag_number_masked
0	8401	Wychwood (94)	13619
1	8402	Yonge-Eglinton (100)	22644

	area_name	hr_of_day	tag_number_masked
3355	Yorkdale-Glen Park (31)	8	315
3356	Yorkdale-Glen Park (31)	9	421

	tag_number_masked	date_of_infraction	infraction_code	infraction_description	set_fine_amount	time_of_infraction	location1	location2	location3	location4	province
0	***92517	20180101	16	PARK-WITHIN 9M INTERSECT ROAD	50	0.0	S/S	PRYOR AVE	E/O	CLOVERDALE RD	ON
1	***71708	20180101	29	PARK PROHIBITED TIME NO PERMIT	30	2.0	NR	266 DOVERCOURT RD	NaN	NaN	ON

	tag_number_masked	date_of_infraction	infraction_code	infraction_description	set_fine_amount	time_of_infraction	location1	location2	location3	location4	province	hr_of_day
544283	***00696	20181231	5	PARK-SIGNED HWY-PROHIBIT DY/TM	50	2357	NR	282 GILLARD AVE	NaN	NaN	ON	23
544284	***43620	20181231	5	PARK-SIGNED HWY-PROHIBIT DY/TM	50	2357	NR	1404 DANFORTH AVE	NaN	NaN	ON	23
544285	***29024	20181231	3	PARK ON PRIVATE PROPERTY	30	2358	AT	2110 KEELE ST	NaN	NaN	ON	23
544286	***28387	20181231	9	STOP-SIGNED HWY-PROHIBIT TM/DY	60	2359	NR	40 SPRING GARDEN AVE	NaN	NaN	ON	23
544287	***43621	20181231	5	PARK-SIGNED HWY-PROHIBIT DY/TM	50	2359	NR	1316 DANFORTH AVE	NaN	NaN	ON	23

	tag_number_masked	date_of_infraction	infraction_code	infraction_description	set_fine_amount	time_of_infraction	province	hr_of_day	full_address	_id	area_name	infraction_yr	infraction_mth	infraction_date	infraction_ymd	day_of_week
2138070	***29024	20181231	3	PARK ON PRIVATE PROPERTY	30	2358	ON	23	2110 KEELE ST	8493.0	Brookhaven-Amesbury (30)	2018	12	31	2018-12-31	Monday
2138072	***43621	20181231	5	PARK-SIGNED HWY-PROHIBIT DY/TM	50	2359	ON	23	1316 DANFORTH AVE	8503.0	Danforth (66)	2018	12	31	2018-12-31	Monday

	set_fine_amount	infraction_description	number_of_tickets
0	0	FAIL TO PARK/STOP PARALLEL TO	1
1	0	LEAVE IN PARK NOT DESIG AREA	1

	infraction_description	day_of_week	hr_of_day	tag_number_masked
0	FAIL ANGLE PARK/STOP AT 45 DEG	Thursday	16	1
1	FAIL PARK/STP LEFT WHLS PARALL	Friday	14	1