Chapter 2 Data Preparation and Analysis

Perhaps the most difficult aspect of the predictive modeling workflow to write about is data preparation. When it comes to predictive modeling on structured data, the techniques and their implementations are more or less standard across industries. On the other hand, there is tremendous variability in what data sources look like at insurers and therefore in the paths to arrive at modeling-ready datasets from raw data (not to mention navigating politics to acquire said data!). However, we are talking about insurance after all, so the concept of policies, exposures, and claims are relativel invariant across companies.

In this chapter, we introduce the datasets we’ll be using throughout the book, walk through data preparation, and perform exploratory data analysis.

2.1 The AUTOSEG Dataset

We use publicly available data from AUTOSEG (“Automobile Statistics System”) from Brazil’s Superintendence of Private Insurance (SUSEP). It maintains policy-characteristics-level data for personal auto from 2007 through the present for all insured vehicles in Brazil. The data contains a variety of variables, from policyholder characteristics to losses by peril. It also contains a set of mapping tables that define the codes used in categorical variables, which allows us to demonstrate a range data manipulation actions such as joining. The data can be downloaded directly from SUSEP’s AUTOSEG website, but we also host it on the book’s GitHub repository for covenience.

The raw data is organized in Zip archives containing half a year worth of data each. Tables 2.1 and 2.2 list the data files and their descriptions included in each bundle.

Table 2.1: Main Tables
File	Description
arq_casco	Exposure data, premiums, claims and insured amount for the CASCO overhead, classified by the Key Category Rate / Region / Model / Year / Sex / Age Range
arq_casco3	Exposure data, premiums and claims for the CASCO overhang, classified by the Key Rate Category / CEP / Model / Year key
arq_casco4	Exposure data, premiums and claims for the CASCO overhang, classified by the Key Rate Category / City / Model / Year
premreg	Regional distribution of prices
sinreg	Regional distribution of claims

Table 2.2: Mapping (Auxiliary) Tables
File	Description
auto2_vei	FIPE code and description of each vehicle model, in addition to the group code to which it belongs
auto2_group	Code and description of model groups it contains
auto_cat	Description code of tariff categories
auto_cau	Code and description of causes of accidents
auto_cep	Correlates the CEP with cities and regions of circulation
auto_cob	Code and description of covers
auto_idade	Code and description of age groups
auto_reg	Code and description of regions of circulation
auto_sexo	Code and description of sex (male, female, legal)
auto_city	Code and name of cities

The excerpts for each of the source tables, before any transformations, can be found in Appendix A. (TODO: maybe we should include translations here.)

For our purposes, we will use the arq_casco policy table since it is the only one that contains poilcyholder characteristics – specifically, sex and age.

2.2 The Data Model

We see that there are a few categorical columns in arq_casco with codes that we need to decipher using the provided mapping tables. To accomplish this, we first observe the relational model diagram in Figure 2.1.

Figure 2.1: Relational model of AUTOSEG tables.

The table auto_cau stands out from the others because it defines the numerical codes used in the claim count and claim amount columns, so we use it to rename those columns in risks_table rather than joining them. For example, the column claim_count1 becomes claim_count_theft.

For the rest of the mapping tables, we join them on to the main risks_table by the associated keys. As an example, for the table auto_idade, which we exhibit in Table 2.3, we join on the column age_code. The type of join we perform is a left join, which means we retain all rows of risks_table, and append the matched rows in auto_idade, so that the row with age_code == "0" will have age == "Not provided", the row with age_code == "1" will have age == "18 to 25", and so on.

Table 2.3: auto_idade.
age_code	age_range
0	Não informada
1	Entre 18 e 25 anos
2	Entre 26 e 35 anos
3	Entre 36 e 45 anos
4	Entre 46 e 55 anos
5	Maior que 55 anos

2.3 Training/Testing Split

Now that we have acquired the data we will use for modeling, the immediate next step is to set aside a holdout, or testing, set. The purpose of the testing set is to serve as a way for us to validate our model at the end of the modeling exercise. It is crucial that we do not peek at the testing set during exploratory data analysis or model tuning, otherwise we may allow information from the testing set to “leak” to the training phase, which undermines the credibility of our performance metrics.

We are going to set aside a random 1/5 of the available data as our testing set, and lock it up in a drawer (figuratively) and not look at it until the very end. (TODO: maybe consider time split when we expand to more years.) Until we say otherwise, the following analyses will all be done on the training set.

2.4 Data Summary

Perhaps the first thing to look at is a brief summary of our combined dataset to get an idea about types, missingness, and distributions. In Tables ?? and ?? we have the summaries of the character and numeric columns, respectively.

Summary of character columns of training dataset.
variable	missing	complete	n	min	max	empty	n_unique
age_range	0	1366121	1366121	13	18	0	6
data_year	0	1366121	1366121	5	5	0	1
region	3593	1362528	1366121	8	47	0	41
sex	0	1366121	1366121	8	9	0	2
vehicle_category	0	1366121	1366121	16	30	0	3
vehicle_description	224336	1141785	1366121	6	58	0	3034
vehicle_group	224336	1141785	1366121	3	33	0	387
vehicle_group_description	224336	1141785	1366121	3	33	0	387

Summary of numeric columns of training dataset.
variable	missing	complete	n	mean	sd	p0	p25	p50	p75	p100	hist
vehicle_year	0	1366121	1366121	2007.2	4.52	1960	2005	2008	2011	2013	▁▁▁▁▁▁▃▇
average_insured_amount	0	1366121	1366121	40049.13	30945.71	1	22449	31841	48417	1531951	▇▁▁▁▁▁▁▁
claim_amount	0	1366121	1366121	2516.02	14593.97	0	0	0	0	2315322	▇▁▁▁▁▁▁▁
claim_count	0	1366121	1366121	0.92	6.24	0	0	0	0	1315	▇▁▁▁▁▁▁▁
exposure	0	1366121	1366121	3.51	13.85	0.5	0.5	0.5	2.5	1669.5	▇▁▁▁▁▁▁▁
loss_per_exposure	0	1366121	1366121	862.18	7045.14	0	0	0	0	760308	▇▁▁▁▁▁▁▁
premium	0	1366121	1366121	3580.26	13674.2	0	443	975	2521	1214283	▇▁▁▁▁▁▁▁

2.5 Exposures by state

(something like this, will need to add tooltip and cut out Brazil)

# brazil %>%
#   leaflet(options = leafletOptions(minZoom = 4)) %>%
#   addTiles() %>%
#   addPolygons(
#     data = bra_cutout,
#     weight  = 0,
#     opacity = 1,
#     color = "white",
#     fillOpacity = 1
#   ) %>%
#   addPolygons(
#     fillColor = ~pal(exposures),
#     weight = 2,
#     opacity = 1,
#     color = "white",
#     dashArray = "3",
#     fillOpacity = 0.7,
#     label = lapply(labels, htmltools::HTML),
#     highlight = highlightOptions(
#       weight = 5,
#       color = "#666",
#       dashArray = "",
#       fillOpacity = 0.7,
#       bringToFront = TRUE)
#   ) %>%
#   setMaxBounds(bra_bbox[1], bra_bbox[2], bra_bbox[3], bra_bbox[4]) %>%
#   setView(mean(bra_bbox[c(1,3)]), mean(bra_bbox[c(2,4)]), zoom = 4)