“Ok, so we have 300 variables available for our model, we don’t know what most of them are, we don’t know if they are stable or suitable for use in the model. We will need our IT department to investigate.
So what we’d like to do is have say 8-12 variables in our model but we’d like to be able to swap in the next best variable if the one we find is unsuitable for implementation”
This may seem like a contrived example, but it was actually a situation I faced many years ago.
My mind immediately went to cluster analysis, but a cluster analysis involving predictor variables.
At the time I was using SAS for my data analysis, thankfully in SAS there was a procedure called “PROC VARCLUS”
Check out https://www.listendata.com/2015/03/proc-varclus-explained.html
How the procedure works is like this:
1) All variables start in one cluster. Then, a principal components analysis is done on the variables in the cluster to determine whether the cluster should be split into two subsets of variables.
2) If the second eigenvalue for the cluster is greater than the specified cutoff, then the initial cluster is split into two clusters. If the second eigenvalue is large, it means that at least two principal components account for a large amount of variation among the inputs.
3) To determine which inputs are included in each cluster, the principal component scores are rotated obliquely to maximize the correlation within a cluster and minimize the correlation between clusters.
4) This process ends when the second eigenvalues of all current clusters fall below the cutoff.
What this procedure gave me were groups of variables. Within each group, variables were similar to each other but different to other groups. The most representative of the group, or the variables that were most uniquely members of their group appeared first, and the weaker variables in the group appeared last.
So this allowed me to create I think 10 groups and include a variable from each group in the model, now where a variable from a group could not be implemented it’s next best candidate variable was included instead.
I haven’t used SAS for many years, these days I use R and Python.
There is something equivalent to PROC VARCLUS in R, in the Hmisc package there is a procedure called “varclus”.
Let’s give it a whirl! Let’s use the German credit card dataset for practice:
I see anywhere from 5 to 10 clusters there, which obviously need to be tested. But there’s a pretty simple method for clustering predictor variables.
numeric_data = read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data-numeric")
# as we have many categoricals converted to dummies, let's use spearman correlation
predictor_clusters = varclus(as.matrix(numeric_data))
# plot the hierarchical clusters as a tree