Welcome to CSML





From credit ratings to housing allocation, machine learning models are increasingly used to automate everyday decision making processes. With the growing impact on society, more and more concerns are being voiced about the loss of transparency, accountability, and fairness of the algorithms making the decisions. We, as data scientists, need to step-up our game and look for ways to mitigate emergent discrimination in our models. We need to make sure that our predictions do not disproportionately hurt people with certain sensitive characteristics (e.g., gender, ethnicity).
For our experiment, we used the Ault UCI dataset, which can be downloaded here. It is also referred to as the “Census Income” dataset. Here, we will predict whether or not a person’s income is greater than $50,000 a year. It is not hard to imagine that financial institutions train models on similar data sets and use them to decide whether or not someone is eligible for a loan, or to set the height of an insurance premium.
Mutual information is a measure of the non-linear association between two variables. It indicates how much the uncertainty around one variable is reduced by observing another. Here, you can see the mutual information values between each of the 6 features and protected features. Notice the high value between marital-status and sex. This suggests a possible relationship between these variables. In other words, marital-status could be a proxy variable for sex. This is done using the mutual_info_classif function.