Regularized gene selection in cancer microarray meta-analysisShuangge Ma, Jian Huang|BMC Bioinformatics|2009 BACKGROUND: In cancer studies, it is common that multiple microarray experiments are conducted to measure the same clinical outcome and expressions of the same set of genes. An important goal of such experiments is to identify a subset of genes that can potentially serve as predictive markers for cancer development and progression. Analyses of individual experiments may lead to unreliable gene selection results because of the small sample sizes. Meta analysis can be used to pool multiple experiments, increase statistical power, and achieve more reliable gene selection. The meta analysis of cancer microarray data is challenging because of the high dimensionality of gene expressions and the differences in experimental settings amongst different experiments. RESULTS: We propose a Meta Threshold Gradient Descent Regularization (MTGDR) approach for gene selection in the meta analysis of cancer microarray data. The MTGDR has many advantages over existing approaches. It allows different experiments to have different experimental settings. It can account for the joint effects of multiple genes on cancer, and it can select the same set of cancer-associated genes across multiple experiments. Simulation studies and analyses of multiple pancreatic and liver cancer experiments demonstrate the superior performance of the MTGDR. CONCLUSION: The MTGDR provides an effective way of analyzing multiple cancer microarray studies and selecting reliable cancer-associated genes.
Variable selection in nonparametric additive modelsWe consider a nonparametric additive model of a conditional mean function in which the number of variables and additive components may be larger than the sample size but the number of nonzero additive components is "small" relative to the sample size. The statistical problem is to determine which additive components are nonzero. The additive components are approximated by truncated series expansions with B-spline bases. With this approximation, the problem of component selection becomes that of selecting the groups of coefficients in the expansion. We apply the adaptive group Lasso to select nonzero components, using the group Lasso to obtain an initial estimator and reduce the dimension of the problem. We give conditions under which the group Lasso selects a model whose number of components is comparable with the underlying model, and the adaptive group Lasso selects the nonzero components correctly with probability approaching one as the sample size increases and achieves the optimal rate of convergence. The results of Monte Carlo experiments show that the adaptive group Lasso procedure works well with samples of moderate size. A data example is used to illustrate the application of the proposed method.
Adaptive Lasso for sparse high-dimensional regression modelsWe study the asymptotic properties of the adaptive Lasso estimators in sparse, high-dimensional, linear regression models when the number of covariates may increase with the sample size. We consider variable selection using the adap- tive Lasso, where the L1 norms in the penalty are re-weighted by data-dependent weights. We show that, if a reasonable initial estimator is available, under ap- propriate conditions, the adaptive Lasso correctly selects covariates with nonzero coefficients with probability converging to one, and that theestimators of nonzero coefficients have the same asymptotic distribution they would have if the zero co- efficients were known in advance. Thus, the adaptive Lasso hasan oracle property in the sense of Fan and Li (2001) and Fan and Peng (2004). In addition, under a partial orthogonality condition in which the covariates with zero coefficients are weakly correlated with the covariates with nonzero coefficients, marginal regression can be used to obtain the initial estimator. With this initial estimator, the adaptive Lasso has the oracle property even when the number of covariates is much larger than the sample size.
Asymptotic properties of bridge estimators in sparse high-dimensional regression modelsWe study the asymptotic properties of bridge estimators in sparse, high-dimensional, linear regression models when the number of covariates may increase to infinity with the sample size. We are particularly interested in the use of bridge estimators to distinguish between covariates whose coefficients are zero and covariates whose coefficients are nonzero. We show that under appropriate conditions, bridge estimators correctly select covariates with nonzero coefficients with probability converging to one and that the estimators of nonzero coefficients have the same asymptotic distribution that they would have if the zero coefficients were known in advance. Thus, bridge estimators have an oracle property in the sense of Fan and Li [J. Amer. Statist. Assoc. 96 (2001) 1348–1360] and Fan and Peng [Ann. Statist. 32 (2004) 928–961]. In general, the oracle property holds only if the number of covariates is smaller than the sample size. However, under a partial orthogonality condition in which the covariates of the zero coefficients are uncorrelated or weakly correlated with the covariates of nonzero coefficients, we show that marginal bridge estimators can correctly distinguish between covariates with nonzero and zero coefficients with probability converging to one even when the number of covariates is greater than the sample size.
Efficient estimation for the proportional hazards model with interval censoringJian Huang|The Annals of Statistics|1996 The maximum likelihood estimator (MLE) for the proportional hazards model with "case 1" interval censored data is studied. It is shown that the MLE for the regression parameter is asymptotically normal with $\sqrt{n}$ convergence rate and achieves the information bound, even though the MLE for the baseline cumulative hazard function only converges at $n^{1/3}$ rate. Estimation of the asymptotic variance matrix for the MLE of the regression parameter is also considered. To prove our main results, we also establish a general theorem showing that the MLE of the finite-dimensional parameter in a class of semiparametric models is asymptotically efficient even though the MLE of the infinite-dimensional parameter converges at a rate slower than $\sqrt{n}$. The results are illustrated by applying them to a data set from a tumorigenicity study.