Alexei Baevski

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

Myle Ott, Sergey Edunov, Alexei Baevski et al.|Unknown|2019

Cited by 2.5kOpen Access

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). 2019.

In-Kernel Aggregation and Broadcast Acceleration for Distributed Communication

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed et al.|arXiv (Cornell University)|2020

Cited by 2.4kOpen Access

Broadcasting and aggregation dominate the communication overhead in distributed systems, from machine learning training to data analytics. Current acceleration approaches require specialized hardware (RDMA) or dedicated resources (DPDK), limiting their deployment in commodity clouds. However, we present a counter-intuitive alternative: rather than bypassing the kernel, we move operations into it using eBPF. While this imposes severe constraints including no floating-point, limited memory, and stateless execution, we show these restrictions paradoxically drive innovative protocol designs that yield unexpected benefits. We introduce AggBox, which implements broadcast and aggregation operations entirely within eBPF’s constrained environment. Our key innovations include stateless group acknowledgments for reliability, edge quantization for floating-point aggregation using only integer arithmetic, and tail-call chains that create virtual memory beyond eBPF’s 512-byte stack limit. These designs emerge from and exploit the constraints rather than fighting them. AggBox achieves remarkable performance on commodity hardware: 84.5% reduction in broadcast latency, 43× speedup for MapReduce workloads, and 56.1% faster ML gradient aggregation, all without specialized NICs or dedicated cores. Beyond performance, our work demonstrates that constrained environments can drive fundamental innovation in protocol design, offering insights for future resource-limited and verified systems.

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Arun Babu, Changhan Wang, Andros Tjandra et al.|Interspeech 2022|2022

Cited by 513

This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0.We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work.Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource.On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English.For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average.XLS-R also sets a new state of the art on VoxLin-gua107 language identification.Moreover, we show that with sufficient model size, cross-lingual pretraining can perform as well as English-only pretraining when translating English speech into other languages, a setting which favors monolingual pretraining.We hope XLS-R can help to improve speech processing tasks for many more languages of the world.Models and code are available at www.github.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed et al.|Neural Information Processing Systems|2020

Cited by 407

Pay Less Attention with Lightweight and Dynamic Convolutions

Felix Wu, Angela Fan, Alexei Baevski et al.|arXiv (Cornell University)|2019

Cited by 322Open Access

Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT'14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.

Is this you? Claim your profile.

Top publicationsby citations