Title: | A Partially Interpretable Model with Black-Box Refinement |
---|---|
Description: | Implements a novel predictive model, Partially Interpretable Estimators (PIE), which jointly trains an interpretable model and a black-box model to achieve high predictive performance as well as partial model. See the paper, Wang, Yang, Li, and Wang (2021) <doi:10.48550/arXiv.2105.02410>. |
Authors: | Tong Wang [aut], Jingyi Yang [aut, cre], Yunyi Li [aut], Boxiang Wang [aut] |
Maintainer: | Jingyi Yang <[email protected]> |
License: | GPL-2 |
Version: | 1.0.0 |
Built: | 2025-01-28 10:07:56 UTC |
Source: | https://github.com/cran/PIE |
This function take tabular dataset and meta-data (such as numerical columns and categorical columns), then output k fold cross validation dataset with splines on numerical features in order to capture the non-linear relationship among numerical features. Within this function, numerical features and target variable are normalized and reorganize into order: (numerical features, categorical features, target).
data_process( X, y, num_col, cat_col, y_col, k = 5, validation_rate = 0.2, spline_num = 5, random_seed = 1 )
data_process( X, y, num_col, cat_col, y_col, k = 5, validation_rate = 0.2, spline_num = 5, random_seed = 1 )
X |
Feature columns in dataset |
y |
Target column in dataset |
num_col |
Index of the columns that are numerical features |
cat_col |
Index of the columns that are categorical features. |
y_col |
Index of the column that is the response. |
k |
Number of fold for cross validation dataset setup. By default 'k = 5'. |
validation_rate |
Validation ratio within training dataset. By default 'validation_rate = 0.2' |
spline_num |
The degree of freedom for natural splines. By default 'spline_num = 5' |
random_seed |
Random seed for cross validation data split. By default 'random_seed = 1' |
The function generates a suitable cross-validation dataset for PIE model. It contains training dataset, validation dataset, testing dataset and also group indicator for group lasso. When 'k=5', the training testing splits in 80/20. When 'validation_rate=0.2', 20 Setting 'validation_rate=0' will only generate training and testing data without validation data.
A list containing:
spl_train_X |
A list of splined training dataset where all numerical features are splined into 'spline_num' columns. The number of element in list equals 'k' the number of fold. |
orig_train_X |
A list of original training dataset where the numerical features remains the original format. The number of element in list equals 'k' the number of fold. |
train_y |
A list of vectors representing target variable for training dataset. The number of element in list equals 'k' the number of fold. |
spl_validation_X |
A list of splined validation dataset where all numerical features are splined into 'spline_num' columns. The number of element in list equals 'k' the number of fold. It could be None, when 'validation_rate == 0' |
orig_validation_X |
A list of original validation dataset where the numerical features remains the original format. The number of element in list equals 'k' the number of fold. It could be None, when 'validation_rate == 0' |
validation_y |
A list of vectors representing target variable for validation dataset. The number of element in list equals 'k' the number of fold. It could be None, when 'validation_rate == 0' |
spl_test_X |
A list of splined testing dataset where all numerical features are splined into 'spline_num' columns. The number of element in list equals 'k' the number of fold. |
orig_test_X |
A list of original testing dataset where the numerical features remains the original format. The number of element in list equals 'k' the number of fold. |
test_y |
A list of vectors representing target variable for testing dataset. The number of element in list equals 'k' the number of fold. |
lasso_group |
A vector of consecutive integers describing the grouping of the coefficients |
# Load the training data data("winequality") # Which columns are numerical? num_col <- 1:11 # Which columns are categorical? cat_col <- 12 # Which column is the response? y_col <- ncol(winequality) # Data Processing (the first 200 rows are sampled for demonstration) dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), y = winequality[1:200, y_col], num_col = num_col, cat_col = cat_col, y_col = y_col)
# Load the training data data("winequality") # Which columns are numerical? num_col <- 1:11 # Which columns are categorical? cat_col <- 12 # Which column is the response? y_col <- ncol(winequality) # Data Processing (the first 200 rows are sampled for demonstration) dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), y = winequality[1:200, y_col], num_col = num_col, cat_col = cat_col, y_col = y_col)
This function calculates the mean absolute error between the predicted values and the true values. The formula for MAE is:
MAE(pred, true_label)
MAE(pred, true_label)
pred |
The predicted values of the dataset. |
true_label |
The actual target values of the dataset. |
A numeric value representing the mean absolute error (MAE).
# Load the training data data("winequality") # Which columns are numerical? num_col <- 1:11 # Which columns are categorical? cat_col <- 12 # Which column is the response? y_col <- ncol(winequality) # Data Processing (the first 200 rows are sampled for demonstration) dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), y = winequality[1:200, y_col], num_col = num_col, cat_col = cat_col, y_col = y_col) # Fit a PIE model fold <- 1 fit <- PIE_fit( X = dat$spl_train_X[[fold]], y = dat$train_y[[fold]], lasso_group = dat$lasso_group, X_orig = dat$orig_train_X[[fold]], lambda1 = 0.01, lambda2 = 0.01, iter = 5, eta = 0.05, nrounds = 200 ) # Prediction pred <- predict(fit, X = dat$spl_validation_X[[fold]], X_orig = dat$orig_validation_X[[fold]] ) # Validation val_rrmae_test <- MAE(pred$total, dat$validation_y[[fold]])
# Load the training data data("winequality") # Which columns are numerical? num_col <- 1:11 # Which columns are categorical? cat_col <- 12 # Which column is the response? y_col <- ncol(winequality) # Data Processing (the first 200 rows are sampled for demonstration) dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), y = winequality[1:200, y_col], num_col = num_col, cat_col = cat_col, y_col = y_col) # Fit a PIE model fold <- 1 fit <- PIE_fit( X = dat$spl_train_X[[fold]], y = dat$train_y[[fold]], lasso_group = dat$lasso_group, X_orig = dat$orig_train_X[[fold]], lambda1 = 0.01, lambda2 = 0.01, iter = 5, eta = 0.05, nrounds = 200 ) # Prediction pred <- predict(fit, X = dat$spl_validation_X[[fold]], X_orig = dat$orig_validation_X[[fold]] ) # Validation val_rrmae_test <- MAE(pred$total, dat$validation_y[[fold]])
The PIE package implements a novel Partially Interpretable Model (PIE) framework introduced by Wang et al. <arxiv:2105.02410>. This framework jointly train an interpretable model and a black-box model to achieve high predictive performance as well as partial model transparency.
- predict.PIE()
: Main function for generating predictions with the PIE model on dataset.
- PIE()
: Main function for training the PIE model with dataset.
- data_process()
: Process data into the format that can be used by PIE model.
- sparsity_count()
: Counts the number of features used in group lasso.
- RPE()
: Evaluate the RPE of a PIE model.
- MAE()
: Evaluate the MAE of a PIE model.
For more details, see the documentation for individual functions.
Maintainer: Jingyi Yang [email protected]
Authors:
Tong Wang
Yunyi Li
Boxiang Wang
Partially Interpretable Estimators (PIE), which jointly train an interpretable model and a black-box model to achieve high predictive performance as well as partial model transparency. PIE is designed to attribute a prediction to contribution from individual features via a linear additive model to achieve interpretability while complementing the prediction by a black-box model to boost the predictive performance. Experimental results show that PIE achieves comparable accuracy to the state-of-the-art black-box models on tabular data. In addition, the understandability of PIE is close to linear models as validated via human evaluations.
PIE_fit(X, y, lasso_group, X_orig, lambda1, lambda2, iter, eta, nrounds, ...)
PIE_fit(X, y, lasso_group, X_orig, lambda1, lambda2, iter, eta, nrounds, ...)
X |
A matrix for the dataset features with numerical splines. |
y |
A vector for the dataset target label. |
lasso_group |
A vector that indicates groups |
X_orig |
A matrix for the dataset features without numerical splines. |
lambda1 |
A numeric number for group lasso penalty. The larger the value, the larger the penalty. |
lambda2 |
A numeric number for black-box model. The larger the value, the larger contribution of XGBoost model. |
iter |
A numeric number for iterations. |
eta |
A numeric number for learning rate of XGBoost model. |
nrounds |
A numeric number for number of rounds of XGBoost model. |
... |
Additional arguments passed to the XGBoost function. |
The PIE_fit function use training dataset to train the PIE model through jointly train an interpretable model and a black-box model to achieve high predictive performance as well as partial model transparency.
An object of class PIE
containing the following components:
Betas |
The coefficient of group lasso model |
Trees |
The coefficients of XGBoost trees |
rrMSE_fit |
A matrix containing the evaluation between group lasso and y, and evaluation between full model and y for each iteration. |
GAM_pred |
A matrix containing the contribution of group lasso in each iteration. |
Tree_pred |
A matrix containing the contribution of XGBoost model in each iteration. |
best_iter |
The number of the best iteration. |
lambda1 |
The |
lambda2 |
The |
# Load the training data data("winequality") # Which columns are numerical? num_col <- 1:11 # Which columns are categorical? cat_col <- 12 # Which column is the response? y_col <- ncol(winequality) # Data Processing (the first 200 rows are sampled for demonstration) dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), y = winequality[1:200, y_col], num_col = num_col, cat_col = cat_col, y_col = y_col) # Fit a PIE model fold <- 1 fit <- PIE_fit( X = dat$spl_train_X[[fold]], y = dat$train_y[[fold]], lasso_group = dat$lasso_group, X_orig = dat$orig_train_X[[fold]], lambda1 = 0.01, lambda2 = 0.01, iter = 5, eta = 0.05, nrounds = 200 )
# Load the training data data("winequality") # Which columns are numerical? num_col <- 1:11 # Which columns are categorical? cat_col <- 12 # Which column is the response? y_col <- ncol(winequality) # Data Processing (the first 200 rows are sampled for demonstration) dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), y = winequality[1:200, y_col], num_col = num_col, cat_col = cat_col, y_col = y_col) # Fit a PIE model fold <- 1 fit <- PIE_fit( X = dat$spl_train_X[[fold]], y = dat$train_y[[fold]], lasso_group = dat$lasso_group, X_orig = dat$orig_train_X[[fold]], lambda1 = 0.01, lambda2 = 0.01, iter = 5, eta = 0.05, nrounds = 200 )
predicts the response of a PIE
object using new data.
## S3 method for class 'PIE' predict(object, X, X_orig, ...)
## S3 method for class 'PIE' predict(object, X, X_orig, ...)
object |
A fitted |
X |
A matrix for the dataset with features expanded using numerical splines. |
X_orig |
A matrix for the dataset with original features without numerical splines. |
... |
Not used. Other arguments to |
Make Predictions for PIE
This function predicts the response of a PIE
object.
The PIE_predict function use generate predictions on dataset given the coefficients of group lasso and coefficients for XGBoost Trees
A list containing:
total |
The predicted value of the whole model for given features |
white_box |
The contribution of group lasso for the given features |
black_box |
The contribution of XGBoost model for the given features |
# Load the training data data("winequality") # Which columns are numerical? num_col <- 1:11 # Which columns are categorical? cat_col <- 12 # Which column is the response? y_col <- ncol(winequality) # Data Processing (the first 200 rows are sampled for demonstration) dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), y = winequality[1:200, y_col], num_col = num_col, cat_col = cat_col, y_col = y_col) # Fit a PIE model fold <- 1 fit <- PIE_fit( X = dat$spl_train_X[[fold]], y = dat$train_y[[fold]], lasso_group = dat$lasso_group, X_orig = dat$orig_train_X[[fold]], lambda1 = 0.01, lambda2 = 0.01, iter = 5, eta = 0.05, nrounds = 200 ) # Prediction pred <- predict(fit, X = dat$spl_validation_X[[fold]], X_orig = dat$orig_validation_X[[fold]] )
# Load the training data data("winequality") # Which columns are numerical? num_col <- 1:11 # Which columns are categorical? cat_col <- 12 # Which column is the response? y_col <- ncol(winequality) # Data Processing (the first 200 rows are sampled for demonstration) dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), y = winequality[1:200, y_col], num_col = num_col, cat_col = cat_col, y_col = y_col) # Fit a PIE model fold <- 1 fit <- PIE_fit( X = dat$spl_train_X[[fold]], y = dat$train_y[[fold]], lasso_group = dat$lasso_group, X_orig = dat$orig_train_X[[fold]], lambda1 = 0.01, lambda2 = 0.01, iter = 5, eta = 0.05, nrounds = 200 ) # Prediction pred <- predict(fit, X = dat$spl_validation_X[[fold]], X_orig = dat$orig_validation_X[[fold]] )
This function takes predicted values and target values to evaluate the performance of a PIE model. The formula for RPE is:
where .
RPE(pred, true_label)
RPE(pred, true_label)
pred |
The predicted values of the dataset. |
true_label |
The actual target values of the dataset. |
A numeric value representing the relative prediction error (RPE).
# Load the training data data("winequality") # Which columns are numerical? num_col <- 1:11 # Which columns are categorical? cat_col <- 12 # Which column is the response? y_col <- ncol(winequality) # Data Processing (the first 200 rows are sampled for demonstration) dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), y = winequality[1:200, y_col], num_col = num_col, cat_col = cat_col, y_col = y_col) # Fit a PIE model fold <- 1 fit <- PIE_fit( X = dat$spl_train_X[[fold]], y = dat$train_y[[fold]], lasso_group = dat$lasso_group, X_orig = dat$orig_train_X[[fold]], lambda1 = 0.01, lambda2 = 0.01, iter = 5, eta = 0.05, nrounds = 200 ) # Prediction pred <- predict(fit, X = dat$spl_validation_X[[fold]], X_orig = dat$orig_validation_X[[fold]] ) # Validation val_rrmse_test <- RPE(pred$total, dat$validation_y[[fold]])
# Load the training data data("winequality") # Which columns are numerical? num_col <- 1:11 # Which columns are categorical? cat_col <- 12 # Which column is the response? y_col <- ncol(winequality) # Data Processing (the first 200 rows are sampled for demonstration) dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), y = winequality[1:200, y_col], num_col = num_col, cat_col = cat_col, y_col = y_col) # Fit a PIE model fold <- 1 fit <- PIE_fit( X = dat$spl_train_X[[fold]], y = dat$train_y[[fold]], lasso_group = dat$lasso_group, X_orig = dat$orig_train_X[[fold]], lambda1 = 0.01, lambda2 = 0.01, iter = 5, eta = 0.05, nrounds = 200 ) # Prediction pred <- predict(fit, X = dat$spl_validation_X[[fold]], X_orig = dat$orig_validation_X[[fold]] ) # Validation val_rrmse_test <- RPE(pred$total, dat$validation_y[[fold]])
This function counts the number of features used in group lasso of PIE model.
sparsity_count(Betas, lasso_group)
sparsity_count(Betas, lasso_group)
Betas |
The coefficient of group lasso model. |
lasso_group |
The group indicator for group lasso model |
An integer: The number of features used in group lasso
# Load the training data data("winequality") # Which columns are numerical? num_col <- 1:11 # Which columns are categorical? cat_col <- 12 # Which column is the response? y_col <- ncol(winequality) # Data Processing (the first 200 rows are sampled for demonstration) dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), y = winequality[1:200, y_col], num_col = num_col, cat_col = cat_col, y_col = y_col) # Fit a PIE model fold <- 1 fit <- PIE_fit( X = dat$spl_train_X[[fold]], y = dat$train_y[[fold]], lasso_group = dat$lasso_group, X_orig = dat$orig_train_X[[fold]], lambda1 = 0.01, lambda2 = 0.01, iter = 5, eta = 0.05, nrounds = 200 ) # Sparsity count sparsity_count(fit$Betas, dat$lasso_group)
# Load the training data data("winequality") # Which columns are numerical? num_col <- 1:11 # Which columns are categorical? cat_col <- 12 # Which column is the response? y_col <- ncol(winequality) # Data Processing (the first 200 rows are sampled for demonstration) dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), y = winequality[1:200, y_col], num_col = num_col, cat_col = cat_col, y_col = y_col) # Fit a PIE model fold <- 1 fit <- PIE_fit( X = dat$spl_train_X[[fold]], y = dat$train_y[[fold]], lasso_group = dat$lasso_group, X_orig = dat$orig_train_X[[fold]], lambda1 = 0.01, lambda2 = 0.01, iter = 5, eta = 0.05, nrounds = 200 ) # Sparsity count sparsity_count(fit$Betas, dat$lasso_group)
This dataset contains 5197 data points. It is related to Portuguese 'Vinho Verdo' wine. Input variables are based on physicochemical tests. This dataset can also be found at [Wine Quality](https://archive.ics.uci.edu/dataset/186/wine+quality) in UC Irvine Machine Learning Repository.
data(winequality)
data(winequality)
Wine Quality Data
A benchmark data set.
A matrix with 5197 rows and 13 columns. The first 11 columns are numerical variables, the 12th column contains categorical variable, and the last column is the response.
The data were introduced in Cortez et al. (2009).
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). “Modeling wine preferences by data mining from physicochemical properties,” Decision Support Systems, 47(4), 547-553.
# Load the PIE library library(PIE) # Load the dataset data(winequality)
# Load the PIE library library(PIE) # Load the dataset data(winequality)