DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
|
Consider the situation of customer churn. We note though that those who have not churned in fact have not yet churned! They may churn in the future. We don't know. In such a situation we have what is called censored data and so survival analysis is used.
Survival analysis is an ordinary regression with the response as the time variable and associated with each time is an event.
Survival analysis is analysis of the time to an event. Methods used for survival analysis take into account the fact that we only have partial information available to us. The partial information for customer 2, for example, is that we know they have been with us for 5 months, but we don't know whether they might be just about to churn or not.
Time to event modelling often uses Survival Analysis. Klein and Moeschberger, 2003, Second Edition, Survival Analysis: Techniques for Censored and Truncated Data, Springer. The examples below illustrate steps from Applied Survival Analysis, by Hosmer and Lemeshow, 2008. Survival analysis models the time to the occurrence of an event (e.g., time to death, time to failure, time to lodgment, time to churn, etc.). It is particularly useful when we have censored observations. The general idea approach introduces a survival function and a hazard rate function . These describe the status of an entity's survival during the period of observation. The survival function gives the probability of surviving beyond a certain point . The hazard rate function gives the instantaneous risk of non-survival (i.e., death, churn, lodgment, failure) at time given survival to time .
Data usually looks like: start time, stop time, event status (1=event occurred, 0=event did not occur). Another format: time to event, status. This latter format is generally used here.
In R we first create a []Surv object using the Surv function from the survival.