DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
|
Today, generalized linear regression is often performed to fit a linear model to data. As the name suggests, it generalises the applicability of linear regression to target variables with distributions different to the assumed normal (gaussian) distribution.
The algorithm for generalised linear regression iteratively fits linear regression models to the data. The target variable is transformed in some way to make the model linear.
The generalised linear regression algorithm is parameterised by the distribution of the target variable and a link function which relates the mean of the target variable to the input variables. Together, these two parameters describe what we often refer to as a family. We see in the Rattle interface for building a Linear model a choice of families.
For a continuous numeric target, a traditional linear regression will be performed to fit a model to the data. Whilst this will use the lm command in R, this is essentially the same as using the more general glm command with a choice of distribution being the normal distribution (gaussian) and the link function being the identity function. The lm command will be used if the target is numeric and continuous and no family is chosen. If a family is chosen, then glm will be used. For gaussian(identity) modelling, the lm is more efficient and so it is the default.
If the target variable has just two possible outcomes, then clearly the distribution of the target variable is no longer a normal distribution. The target is then transformed so that a linear model can be built.
Logistic regression is a common option for a generalised linear regression model builder. An alternative that gives a very similar results (but often with smaller coefficients) is Probit regression.
Logistic regression is the traditional statistical approach and indeed it can produce good models as evidenced in the risk chart here. As noted in Section 26.1 though, logistic regression has not always been found to produce good models. Nonetheless, here we see a very good model that gives us an area under the curve of 80% for both Revenue and Adjustments, and at the 50% caseload we are recovering 94% of the cases requiring adjustment and 95% of the revenue associated with the cases that are adjusted.
For best results it is often a good idea to scale the numeric input variables to have a mean of 0 and a standard deviation of 1.
When building a linear regression model for predicting a continuous numeric target variable (and thus using the lm command underneath) the Rattle summary will include the R-square measure. This is the proportion of the variation in the target variable that is explained by the variations in the input variables. The higher the value, the better, with 0 meaning that the input variables do not explain the variations in the target variable at all, to 1, which means that it is completely explained. An R-squared of 0.371 indicates that 37.1% of the variation was explained--not such a high value! The Adjusted R-squared is the same but relates to the variance rather than the variation.
We cover each type of regression available in Rattle separately.
Linear regression tends to have a high bias but low variance (stable models).