Shuo Lei, Xuchao Zhang, Liang Zhao, Arnold Boediardjo

Abstract

In today’s era of big data, robust least-squares regression becomes a more challenging problem when considering the extremely corrupted labels along with explosive growth of datasets. Traditional robust methods can handle the noise but suffer from several challenges when applied in huge dataset including (1) computational infeasibility of handling an entire dataset at once, (2) existence of heterogeneously distributed corruption, and (3) difficulty in corruption estimation when data cannot be entirely loaded. This article proposes online and distributed robust regression approaches, both of which can concurrently address all the above challenges. Specifically, the distributed algorithm optimizes the regression coefficients of each data block via heuristic hard thresholding and combines all the estimates in a distributed robust consolidation. In addition, an online version of the distributed algorithm is proposed to incrementally update the existing estimates with new incoming data. Furthermore, a novel online robust regression method is proposed to estimate under a biased-batch corruption. We also prove that our algorithms benefit from strong robustness guarantees in terms of regression coefficient recovery with a constant upper bound on the error of state-of-the-art batch methods. Extensive experiments on synthetic and real datasets demonstrate that our approaches are superior to those of existing methods in effectiveness, with competitive efficiency.

People

Liang Zhao


Shuo Lei


Publication Details

Date of publication:
October 22, 2021
Journal:
ACM Transactions on Knowledge Discovery from Data (TKDD)
Page number(s):
1-24
Volume:
16
Issue Number:
3
Publication note:

Shuo Lei, Xuchao Zhang, Liang Zhao, Arnold P. Boedihardjo, Chang-Tien Lu: Online and Distributed Robust Regressions with Extremely Noisy Labels. ACM Trans. Knowl. Discov. Data 16(3): 41:1-41:24 (2022)