Regularization of Neural Networks using DropConnectLi Wan, Matthew D. Zeiler, Sixin Zhang et al.|International review of cytology|2013 We introduce DropConnect, a generalization of Dropout (Hinton et al., 2012), for regular-izing large fully-connected layers within neu-ral networks. When training with Dropout, a randomly selected subset of activations are set to zero within each layer. DropCon-nect instead sets a randomly selected sub-set of weights within the network to zero. Each unit thus receives input from a ran-dom subset of units in the previous layer. We derive a bound on the generalization per-formance of both Dropout and DropCon-nect. We then evaluate DropConnect on a range of datasets, comparing to Dropout, and show state-of-the-art results on several image recognition benchmarks by aggregating mul-tiple DropConnect-trained models. 1.
No More Pesky Learning RatesTom Schaul, Sixin Zhang, Yann LeCun|arXiv (Cornell University)|2012 The performance of stochastic gradient descent (SGD) depends critically on how learning rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any one time. The method relies on local gradient variations across samples. In our approach, learning rates can increase as well as decrease, making it suitable for non-stationary problems. Using a number of convex and non-convex learning tasks, we show that the resulting algorithm matches the performance of SGD or other adaptive approaches with their best settings obtained through systematic search, and effectively removes the need for learning rate tuning.
Deep learning with Elastic Averaging SGDWe study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM. We show that the stability of EASGD is guaranteed when a simple stability condition is satisfied, which is not the case for ADMM. We additionally propose the momentum-based version of our algorithm that can be applied in both synchronous and asynchronous settings. Asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.
No More Pesky Learning RatesTom Schaul, Sixin Zhang, Yann LeCun|arXiv (Cornell University)|2012 The performance of stochastic gradient de-scent (SGD) depends critically on how learn-ing rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any one time. The method relies on local gradient variations across sam-ples. In our approach, learning rates can in-crease as well as decrease, making it suitable for non-stationary problems. Using a num-ber of convex and non-convex learning tasks, we show that the resulting algorithm matches the performance of SGD or other adaptive approaches with their best settings obtained through systematic search, and effectively re-moves the need for learning rate tuning. 1.
Data Assimilation NetworksPierre Boudier, Anthony Fillion, Serge Gratton et al.|Journal of Advances in Modeling Earth Systems|2023 Abstract Data Assimilation aims at estimating the posterior conditional probability density functions based on error statistics of the noisy observations and the dynamical system. State of the art methods are sub‐optimal due to the common use of Gaussian error statistics and the linearization of the non‐linear dynamics. To achieve a good performance, these methods often require case‐by‐case fine‐tuning by using explicit regularization techniques such as inflation and localization. In this paper, we propose a fully data driven deep learning framework generalizing recurrent Elman networks and data assimilation algorithms. Our approach approximates a sequence of prior and posterior densities conditioned on noisy observations using a log‐likelihood cost function . By construction our approach can then be used for general nonlinear dynamics and non‐Gaussian densities. As a first step, we evaluate the performance of the proposed approach by using fully and partially observed Lorenz‐95 system in which the outputs of the recurrent network are fitted to Gaussian densities. We numerically show that our approach, without using any explicit regularization technique , achieves comparable performance to the state‐of‐the‐art methods, IEnKF‐Q and LETKF, across various ensemble size.