Chapter 2 Data Preparation and Analysis
Perhaps the most difficult aspect of the predictive modeling workflow to write about is data preparation. When it comes to predictive modeling on structured data, the techniques and their implementations are more or less standard across industries. On the other hand, there is tremendous variability in what data sources look like at insurers and therefore in the paths to arrive at modeling-ready datasets from raw data (not to mention navigating politics to acquire said data!). However, we are talking about insurance after all, so the concept of policies, exposures, and claims are relativel invariant across companies.
In this chapter, we introduce the datasets we’ll be using throughout the book, walk through data preparation, and perform exploratory data analysis.
2.1 The AUTOSEG Dataset
We use publicly available data from AUTOSEG (“Automobile Statistics System”) from Brazil’s Superintendence of Private Insurance (SUSEP). It maintains policy-characteristics-level data for personal auto from 2007 through the present for all insured vehicles in Brazil. The data contains a variety of variables, from policyholder characteristics to losses by peril. It also contains a set of mapping tables that define the codes used in categorical variables, which allows us to demonstrate a range data manipulation actions such as joining. The data can be downloaded directly from SUSEP’s AUTOSEG website, but we also host it on the book’s GitHub repository for covenience.
The raw data is organized in Zip archives containing half a year worth of data each. Tables 2.1 and 2.2 list the data files and their descriptions included in each bundle.
File | Description |
---|---|
arq_casco | Exposure data, premiums, claims and insured amount for the CASCO overhead, classified by the Key Category Rate / Region / Model / Year / Sex / Age Range |
arq_casco3 | Exposure data, premiums and claims for the CASCO overhang, classified by the Key Rate Category / CEP / Model / Year key |
arq_casco4 | Exposure data, premiums and claims for the CASCO overhang, classified by the Key Rate Category / City / Model / Year |
premreg | Regional distribution of prices |
sinreg | Regional distribution of claims |
File | Description |
---|---|
auto2_vei | FIPE code and description of each vehicle model, in addition to the group code to which it belongs |
auto2_group | Code and description of model groups it contains |
auto_cat | Description code of tariff categories |
auto_cau | Code and description of causes of accidents |
auto_cep | Correlates the CEP with cities and regions of circulation |
auto_cob | Code and description of covers |
auto_idade | Code and description of age groups |
auto_reg | Code and description of regions of circulation |
auto_sexo | Code and description of sex (male, female, legal) |
auto_city | Code and name of cities |
The excerpts for each of the source tables, before any transformations, can be found in Appendix A. (TODO: maybe we should include translations here.)
For our purposes, we will use the arq_casco
policy table since it is the only one that contains poilcyholder characteristics – specifically, sex and age.
2.2 The Data Model
We see that there are a few categorical columns in arq_casco
with codes that we need to decipher using the provided mapping tables. To accomplish this, we first observe the relational model diagram in Figure 2.1.
The table auto_cau
stands out from the others because it defines the numerical codes used in the claim count and claim amount columns, so we use it to rename those columns in risks_table
rather than joining them. For example, the column claim_count1
becomes claim_count_theft
.
For the rest of the mapping tables, we join them on to the main risks_table
by the associated keys. As an example, for the table auto_idade
, which we exhibit in Table 2.3, we join on the column age_code
. The type of join we perform is a left join, which means we retain all rows of risks_table
, and append the matched rows in auto_idade
, so that the row with age_code == "0"
will have age == "Not provided"
, the row with age_code == "1"
will have age == "18 to 25"
, and so on.
age_code | age_range |
---|---|
0 | Não informada |
1 | Entre 18 e 25 anos |
2 | Entre 26 e 35 anos |
3 | Entre 36 e 45 anos |
4 | Entre 46 e 55 anos |
5 | Maior que 55 anos |
2.3 Training/Testing Split
Now that we have acquired the data we will use for modeling, the immediate next step is to set aside a holdout, or testing, set. The purpose of the testing set is to serve as a way for us to validate our model at the end of the modeling exercise. It is crucial that we do not peek at the testing set during exploratory data analysis or model tuning, otherwise we may allow information from the testing set to “leak” to the training phase, which undermines the credibility of our performance metrics.
We are going to set aside a random 1/5 of the available data as our testing set, and lock it up in a drawer (figuratively) and not look at it until the very end. (TODO: maybe consider time split when we expand to more years.) Until we say otherwise, the following analyses will all be done on the training set.
2.4 Data Summary
Perhaps the first thing to look at is a brief summary of our combined dataset to get an idea about types, missingness, and distributions. In Tables ?? and ?? we have the summaries of the character and numeric columns, respectively.
Summary of character columns of training dataset. | |||||||
---|---|---|---|---|---|---|---|
variable | missing | complete | n | min | max | empty | n_unique |
age_range | 0 | 1366121 | 1366121 | 13 | 18 | 0 | 6 |
data_year | 0 | 1366121 | 1366121 | 5 | 5 | 0 | 1 |
region | 3593 | 1362528 | 1366121 | 8 | 47 | 0 | 41 |
sex | 0 | 1366121 | 1366121 | 8 | 9 | 0 | 2 |
vehicle_category | 0 | 1366121 | 1366121 | 16 | 30 | 0 | 3 |
vehicle_description | 224336 | 1141785 | 1366121 | 6 | 58 | 0 | 3034 |
vehicle_group | 224336 | 1141785 | 1366121 | 3 | 33 | 0 | 387 |
vehicle_group_description | 224336 | 1141785 | 1366121 | 3 | 33 | 0 | 387 |
Summary of numeric columns of training dataset. | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
vehicle_year | 0 | 1366121 | 1366121 | 2007.2 | 4.52 | 1960 | 2005 | 2008 | 2011 | 2013 | ▁▁▁▁▁▁▃▇ |
average_insured_amount | 0 | 1366121 | 1366121 | 40049.13 | 30945.71 | 1 | 22449 | 31841 | 48417 | 1531951 | ▇▁▁▁▁▁▁▁ |
claim_amount | 0 | 1366121 | 1366121 | 2516.02 | 14593.97 | 0 | 0 | 0 | 0 | 2315322 | ▇▁▁▁▁▁▁▁ |
claim_count | 0 | 1366121 | 1366121 | 0.92 | 6.24 | 0 | 0 | 0 | 0 | 1315 | ▇▁▁▁▁▁▁▁ |
exposure | 0 | 1366121 | 1366121 | 3.51 | 13.85 | 0.5 | 0.5 | 0.5 | 2.5 | 1669.5 | ▇▁▁▁▁▁▁▁ |
loss_per_exposure | 0 | 1366121 | 1366121 | 862.18 | 7045.14 | 0 | 0 | 0 | 0 | 760308 | ▇▁▁▁▁▁▁▁ |
premium | 0 | 1366121 | 1366121 | 3580.26 | 13674.2 | 0 | 443 | 975 | 2521 | 1214283 | ▇▁▁▁▁▁▁▁ |
2.5 Exposures by state
(something like this, will need to add tooltip and cut out Brazil)
# brazil %>%
# leaflet(options = leafletOptions(minZoom = 4)) %>%
# addTiles() %>%
# addPolygons(
# data = bra_cutout,
# weight = 0,
# opacity = 1,
# color = "white",
# fillOpacity = 1
# ) %>%
# addPolygons(
# fillColor = ~pal(exposures),
# weight = 2,
# opacity = 1,
# color = "white",
# dashArray = "3",
# fillOpacity = 0.7,
# label = lapply(labels, htmltools::HTML),
# highlight = highlightOptions(
# weight = 5,
# color = "#666",
# dashArray = "",
# fillOpacity = 0.7,
# bringToFront = TRUE)
# ) %>%
# setMaxBounds(bra_bbox[1], bra_bbox[2], bra_bbox[3], bra_bbox[4]) %>%
# setView(mean(bra_bbox[c(1,3)]), mean(bra_bbox[c(2,4)]), zoom = 4)