Import packages

Import the following package. If the package is not installed in your R environment, install it using the install.packages("package_name") command.

library("cem")

Import data

This exercise reuses the dataset from Exercise 2.

data <- read.csv("../data/exercise2.csv")
head(data, n=10)
##    talent    effort     skill treatment
## 1   FALSE 0.6497020 0.6828916      TRUE
## 2    TRUE 0.6848052 0.9369752      TRUE
## 3   FALSE 0.7935741 0.4630684      TRUE
## 4    TRUE 0.4313167 0.4561438     FALSE
## 5    TRUE 0.4718738 0.7214476     FALSE
## 6    TRUE 0.4208080 0.2688652     FALSE
## 7   FALSE 0.5741308 0.5328195     FALSE
## 8   FALSE 0.6814296 0.6213216      TRUE
## 9   FALSE 0.5700953 0.3671640     FALSE
## 10   TRUE 0.2904192 0.7253196     FALSE

Calculate data imbalance

A metric called the \(L_1\) vector norm measures the amount of imbalance within a dataset. This imbalance is created by the bias of confounders. The \(L_1\) vector norm produces a number between 0 and 1. A value of 0 indicates that there is no bias or imbalance in the dataset, while a value of 1 denotes a totally imbalanced dataset. The following command calculates the \(L_1\) vector norm for the original dataset. Note that the variables (columns) that are not confounders must be passed as an argument to the function.

L1.meas(data$treatment, data, drop=c('treatment', 'effort', 'skill'))
## 
## Multivariate Imbalance Measure: L1=0.360
## Percentage of local common support: LCS=100.0%

The \(L_1\) value implies that the dataset is mildly imbalanced and that source(s) of bias (i.e., talent) need to be controlled for. Remember that controlling for means doing exact matching. Hence, students will be clustered in the low-talent and high-talent groups.

low_talent <- data[data$talent==0,]
high_talent <- data[data$talent==1,]

Next, measure the level of data imbalance within each group of low-talent and high-talent students.

L1.meas(low_talent$treatment, low_talent, drop=c('treatment', 'effort', 'skill'))
## 
## Multivariate Imbalance Measure: L1=0.000
## Percentage of local common support: LCS=100.0%
L1.meas(high_talent$treatment, high_talent, drop=c('treatment', 'effort', 'skill'))
## 
## Multivariate Imbalance Measure: L1=0.000
## Percentage of local common support: LCS=100.0%

It can, therefore, be noticed that doing exact matching eliminates bias also called data imbalance! Note that the \(L_1\) vector norm can work with a dataset with any number of counfounders.