Advanced analytics and predictive modelling

Using Random Forest for predictive modeling and defining importance in customer satisfaction surveys

Disadvantages of commonly used methods for defining importance

Commonly used methods to calculate derived importance are "Correlation analysis", "Analysis of Variance" and "Regression analysis". The main disadvantage of these methods is how they deal with missing values, categorical variables and multicollinearity.

Advantages of Random Forest Analysis

Models can easily handle both continuous and categorical variables.

Models can effectively model interactions between predictors.

Models allow for nonlinear dependencies.

Models allow large number of predictors. Classical methods often face the curse of dimensionality resulting in bad accuracies on validation samples.

Predictions are possible even when an important predictor has a missing value. Better estimation of the missing values are made within the algorithm. For classical methods, imputing data is often done before, independently from the model.

Models produced by Random Forest are not affected by extreme values or strange distributions in continuous variables. In fact, they are invariant under monotone transformations of the predictors.

You can depict the variable importance and what is the marginal effect of each variable.

Random Forest has been proven to be one of the most successful methods in machine learning on real life applications.

  • Analysis of variance and experimental design (multifactor experiments, experiments with empty cells, ...)
  • Regression techniques (estimation and hypothesis tests, polynomial, robust, nonlinear, nonparametric, variable selection, multicollinearity...)
  • Categorical data analysis (contingency tables, tests of independence, conjoint analysis combined with hierarchical Bayesian analysis, logit models, multicategory logit models, loglinear models...)
  • Multivariate data analysis (PCA, factor analysis, confirmatory factor analysis, cluster analysis, discriminant analysis, logistic regression, canonical correlation, MANOVA...)
  • Clustering techniques (K-means, Hierarchical Cluster Analysis …)
  • Repeated measurements (general multivariate model, general linear mixed-effects model, generalized linear mixed models, missing data...)
  • Feature reduction (PCA, factor analysis, Kruskal-Wallis, Fisher criterion …)
  • Classification of a small amount of high dimensional data (support vector machine, random forest, …)