Chapter 2 Data Preparation and Analysis

Perhaps the most difficult aspect of the predictive modeling workflow to write about is data preparation. When it comes to predictive modeling on structured data, the techniques and their implementations are more or less standard across industries. On the other hand, there is tremendous variability in what data sources look like at insurers and therefore in the paths to arrive at modeling-ready datasets from raw data (not to mention navigating politics to acquire said data!). However, we are talking about insurance after all, so the concept of policies, exposures, and claims are relativel invariant across companies.

In this chapter, we introduce the datasets we’ll be using throughout the book, walk through data preparation, and perform exploratory data analysis.

2.1 The AUTOSEG Dataset

We use publicly available data from AUTOSEG (“Automobile Statistics System”) from Brazil’s Superintendence of Private Insurance (SUSEP). It maintains policy-characteristics-level data for personal auto from 2007 through the present for all insured vehicles in Brazil. The data contains a variety of variables, from policyholder characteristics to losses by peril. It also contains a set of mapping tables that define the codes used in categorical variables, which allows us to demonstrate a range data manipulation actions such as joining. The data can be downloaded directly from SUSEP’s AUTOSEG website, but we also host it on the book’s GitHub repository for covenience.

The raw data is organized in Zip archives containing half a year worth of data each. Tables 2.1 and 2.2 list the data files and their descriptions included in each bundle.

Table 2.1: Main Tables
File Description
arq_casco Exposure data, premiums, claims and insured amount for the CASCO overhead, classified by the Key Category Rate / Region / Model / Year / Sex / Age Range
arq_casco3 Exposure data, premiums and claims for the CASCO overhang, classified by the Key Rate Category / CEP / Model / Year key
arq_casco4 Exposure data, premiums and claims for the CASCO overhang, classified by the Key Rate Category / City / Model / Year
premreg Regional distribution of prices
sinreg Regional distribution of claims
Table 2.2: Mapping (Auxiliary) Tables
File Description
auto2_vei FIPE code and description of each vehicle model, in addition to the group code to which it belongs
auto2_group Code and description of model groups it contains
auto_cat Description code of tariff categories
auto_cau Code and description of causes of accidents
auto_cep Correlates the CEP with cities and regions of circulation
auto_cob Code and description of covers
auto_idade Code and description of age groups
auto_reg Code and description of regions of circulation
auto_sexo Code and description of sex (male, female, legal)
auto_city Code and name of cities

The excerpts for each of the source tables, before any transformations, can be found in Appendix A. (TODO: maybe we should include translations here.)

For our purposes, we will use the arq_casco policy table since it is the only one that contains poilcyholder characteristics – specifically, sex and age.

2.2 The Data Model

We see that there are a few categorical columns in arq_casco with codes that we need to decipher using the provided mapping tables. To accomplish this, we first observe the relational model diagram in Figure 2.1.

Figure 2.1: Relational model of AUTOSEG tables.

The table auto_cau stands out from the others because it defines the numerical codes used in the claim count and claim amount columns, so we use it to rename those columns in risks_table rather than joining them. For example, the column claim_count1 becomes claim_count_theft.

For the rest of the mapping tables, we join them on to the main risks_table by the associated keys. As an example, for the table auto_idade, which we exhibit in Table 2.3, we join on the column age_code. The type of join we perform is a left join, which means we retain all rows of risks_table, and append the matched rows in auto_idade, so that the row with age_code == "0" will have age == "Not provided", the row with age_code == "1" will have age == "18 to 25", and so on.

Table 2.3: auto_idade.
age_code age_range
0 Não informada
1 Entre 18 e 25 anos
2 Entre 26 e 35 anos
3 Entre 36 e 45 anos
4 Entre 46 e 55 anos
5 Maior que 55 anos

2.3 Training/Testing Split

Now that we have acquired the data we will use for modeling, the immediate next step is to set aside a holdout, or testing, set. The purpose of the testing set is to serve as a way for us to validate our model at the end of the modeling exercise. It is crucial that we do not peek at the testing set during exploratory data analysis or model tuning, otherwise we may allow information from the testing set to “leak” to the training phase, which undermines the credibility of our performance metrics.

We are going to set aside a random 1/5 of the available data as our testing set, and lock it up in a drawer (figuratively) and not look at it until the very end. (TODO: maybe consider time split when we expand to more years.) Until we say otherwise, the following analyses will all be done on the training set.

2.4 Data Summary

Perhaps the first thing to look at is a brief summary of our combined dataset to get an idea about types, missingness, and distributions. In Tables ?? and ?? we have the summaries of the character and numeric columns, respectively.

Summary of character columns of training dataset.
variable missing complete n min max empty n_unique
age_range 0 1366121 1366121 13 18 0 6
data_year 0 1366121 1366121 5 5 0 1
region 3593 1362528 1366121 8 47 0 41
sex 0 1366121 1366121 8 9 0 2
vehicle_category 0 1366121 1366121 16 30 0 3
vehicle_description 224336 1141785 1366121 6 58 0 3034
vehicle_group 224336 1141785 1366121 3 33 0 387
vehicle_group_description 224336 1141785 1366121 3 33 0 387
Summary of numeric columns of training dataset.
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
vehicle_year 0 1366121 1366121 2007.2 4.52 1960 2005 2008 2011 2013 ▁▁▁▁▁▁▃▇
average_insured_amount 0 1366121 1366121 40049.13 30945.71 1 22449 31841 48417 1531951 ▇▁▁▁▁▁▁▁
claim_amount 0 1366121 1366121 2516.02 14593.97 0 0 0 0 2315322 ▇▁▁▁▁▁▁▁
claim_count 0 1366121 1366121 0.92 6.24 0 0 0 0 1315 ▇▁▁▁▁▁▁▁
exposure 0 1366121 1366121 3.51 13.85 0.5 0.5 0.5 2.5 1669.5 ▇▁▁▁▁▁▁▁
loss_per_exposure 0 1366121 1366121 862.18 7045.14 0 0 0 0 760308 ▇▁▁▁▁▁▁▁
premium 0 1366121 1366121 3580.26 13674.2 0 443 975 2521 1214283 ▇▁▁▁▁▁▁▁