Abstract:
In many models a studied variable (response) $Y$ depends on some collection of factors $X=(X_1,\dots,X_n)$. For instance in medical and biological research $Y$ can characterize the health state of a patient whereas the components of $X$ describe genetic and nongenetic factors. One of the challenging problems concerning the response analysis consists in identification of “significant collection” of factors $(X_{i_1},\dots,X_{i_r})$, $1\le i_1<\dots < i_r\le n$, such that $Y$ depends on it in essential way (in a sense). A number of complementary methods for solving this problem are considered. They include probabilistic and statistical techniques, machine learning and computer simulation. We mention only a few of modern methods such as LASSO, SCAD, BOOST, GARROTE and their modifications. Main attention is paid to MDR (multifactor dimensionality reduction) method introduced by M.Ritchie et al. in 2001. This method was applied and developed further in more than 200 publications. The talk is based on a cycle of 7 recent author's papers published in Doklady Mathematics (2014), Journal of Multivariate Analysis (2015), Lecture Notes in Mathematics (2015) and others. We emphasize that a new approach is proposed for identification of significant variables. It involves a statistical estimate of the error functional for a response forecast employing a penalty function and a cross-validation procedure. We can manage with nonbinary response. The regularized estimates of the mentioned functional are defined and a central limit theorem is established for them. New results related to asymptotic normality of arrays of exchangeable random variables are of their own interest. We discuss also the results of computer simulation showing the effectiveness of the developed approach.