Building effective predictive models from high-dimensional data is an important problem in several domains such as in bioinformatics, healthcare analytics and general regression analysis. Extracting feature groups automatically from such data with several correlated features is necessary, in order to use regularizers such as the group lasso which can exploit this deciphered grouping structure to build effective prediction models. Elastic net, fused-lasso and Octagonal Shrinkage Clustering Algorithm for Regression (oscar) are some of the popular feature grouping methods proposed in the literature which recover both sparsity and feature groups from the data. However, their predictive ability is affected adversely when the regression coefficients of adjacent feature groups are similar, but not exactly equal. This happens as these methods merge such adjacent feature groups erroneously, which is also called the misfusion problem. In order to solve this problem, in this paper, we propose a weighted l1 norm-based approach which is effective at recovering feature groups, despite the proximity of the coefficients of adjacent feature groups, building extremely accurate predictive models. This convex optimization problem is solved using the fast iterative soft-thresholding algorithm (FISTA). We depict how our approach is more effective at resolving the misfusion problem on synthetic datasets compared to existing feature grouping methods such as the elastic net, fused-lasso and oscar. We also evaluate the goodness of the model on real-world breast cancer gene expression and the 20-Newsgroups datasets.
- Date of publication:
- December 12, 2016
- IEEE International Conference on Data Mining