Dr. Torsten Hothorn has been on quite a run lately working on Prodigy “Big Data” Challenges. Recently, he won the $30,000 Cleveland Clinic Challenge, Build an Efficient Pipeline to Find the Most Powerful Predictors, and he earned a $10,000 award for his second place finish in The DREAM-Phil Bowen ALS Prediction Prize4Life Challenge. (We recently profiled Lilly Fang, a member of one of the two first place winning teams for the Prize4Life Challenge, and also posted a Seeker Spotlight featuring Prize4Life’s Neta Zach which dives into the background of the Challenge and final results). We’re happy to have Dr. Hothorn here to discuss his experience with these important Challenges.
I am a Professor of Biostatistics in the Department of Statistics at the University of Munich, Germany, and I’m interested in both methodological developments and applications of statistical models in medicine and biology. Research and teaching in Biostatistics ideally brings together practical problems and statistical theory. While I mainly teach students of statistics, I enjoy working with scientists from fields as diverse as oncology, ecology, and forestry. Because a statistical model is only useful when it actually can be applied to gain insights into aspects of data that otherwise would remain hidden, I spend a lot of time developing and implementing new statistical models. Some open source software packages to which I contributed are distributed via CRAN, the R package repository.
Developing statistical software always means pushing forward existing functionality. One of the best and most effective ways to find out where improvements are needed most is to work on the solution of practical problems, apply the software, and look at the results. While I’m not short of collaborators with interesting problems, I decided to give one of the Cleveland Clinic Challenges hosted by InnoCentive a shot when I first learned about InnoCentive in the Fall of 2011. In 2004 and 2006, I authored two scholarly papers about nonparametric survival models that also work in the presence of numerous potentially predictive variables. The Cleveland Clinic Challenge, “Build an Efficient Pipeline to Find the Most Powerful Predictors,” was an exact match for the models that I developed and described in these two papers. Luckily, I had already invested a fair amount of time into a software implementation, using the R add-on package “party,” and thus the solution was (almost) at my fingertips.
I must admit that “The DREAM-Phil Bowen ALS Prediction Prize4Life Challenge” was a little more challenging than I first thought. With the patient data coming from different clinical trials, it took a while to compile the data into a format suitable for statistical analysis. The relatively complex longitudinal structure of the data, the expected weak association between predictor variables and ALS disease progression, and the large amount of missing values in some of the potentially interesting predictor variables suggested that a nonparametric regression approach (e.g., random forests), might be a good candidate for a potential solution. However, the Challenge data gave me a hard time predicting ALS disease progression with good accuracy. Eventually, I went back and started from scratch. First, I slightly reformulated the Challenge objective by using an alternative statistical measure for describing the disease progression of a patient. In a second step, I collected as much information as I could about the disease progression in the first three months in which a patient was under observation. I observed that using these variables as predictors of the new ALS disease progression measure lead to better performing models.
Besides my interest in applying software that I developed and the thrill of competing with people from all over the world in this prediction Challenge – the InnoCentive leaderboard is really something one can get addicted to – I look forward to using the PRO-ACT database (a subset of which the Prize4Life Challenge was based on) in the classroom. Next spring, I’ll teach longitudinal data analysis and I intend to let my students work with the ALS patient data. That way, my students will be constantly reminded what the models and formulae presented on the blackboard are actually good for and what scientific obligation to society actually means to a statistician.