Carlos Riquelme

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo et al.|arXiv (Cornell University)|2022

Cited by 194Open Access

Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

Scaling Vision Transformers to 22 Billion Parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa et al.|arXiv (Cornell University)|2023

Cited by 118Open Access

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.

Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

Basil Mustafa, Carlos Riquelme, Joan Puigcerver et al.|arXiv (Cornell University)|2022

Cited by 72Open Access

Large sparsely-activated models have obtained excellent performance in multiple domains. However, such models are typically trained on a single modality at a time. We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. MoEs are a natural fit for a multimodal backbone, since expert layers can learn an appropriate partitioning of modalities. However, new challenges arise; in particular, training stability and balanced expert utilization, for which we propose an entropy-based regularization scheme. Across multiple scales, we demonstrate remarkable performance improvement over dense models of equivalent computational cost. LIMoE-L/16 trained comparably to CLIP-L/14 achieves 78.6% zero-shot ImageNet accuracy (vs. 76.2%), and when further scaled to H/14 (with additional data) it achieves 84.1%, comparable to state-of-the-art methods which use larger custom per-modality backbones and pre-training schemes. We analyse the quantitative and qualitative behavior of LIMoE, and demonstrate phenomena such as differing treatment of the modalities and the organic emergence of modality-specific experts.

Safety and effectiveness of the Low Profile Visualized Intraluminal Support (LVIS and LVIS Jr) devices in the endovascular treatment of intracranial aneurysms: results of the TRAIL multicenter observational study

Christina Iosif, Michel Piotin, Suzana Saleme et al.|Journal of NeuroInterventional Surgery|2017

Cited by 62Open Access

BACKGROUND AND PURPOSE: To evaluate the safety and effectiveness of the low-profile braided intracranial stents called the Low Profile Visualized Intraluminal Support (LVIS) devices for stent-assisted coil embolization of wide-necked intracranial aneurysms. MATERIALS AND METHODS: This was a prospective, multicenter, observational study of unruptured and ruptured intracranial aneurysms treated with the LVIS devices. Imaging and clinical data were independently analyzed respectively by CoreLab and Clinical Event Committee. Primary endpoints were clinical safety, effectiveness, and angiographic stability of the results at 6 and 18 months. RESULTS: Ten centers participated in the study; 102 patients were included and 90 patients (42.2% men, 57.8% women) were eventually analyzed, among which 27 (30.0%) had multiple aneurysms. Twenty-three (25.6%) were ruptured aneurysms, four of which (4.4%) were treated in the acute phase. One aneurysm was treated per patient; 92 LVIS and LVIS Jr devices were placed overall. The total aneurysm occlusion rate was 91.0% on immediate post-procedure angiograms, which remained unchanged at 6-month follow-up and was 92.4% at 18-month follow-up. One patient (1.1%) underwent retreatment between 6 and 18 months of follow-up. A modified Rankin score of 0 was documented for most cases immediately after the procedure (86.7%) and at 6-month (86.8%) and 18-month (83.3%) follow-up. The overall permanent morbidity rate at 18 months was 5.6% and the overall rate of events with sequelae related to the stent was 2.2%. The 18-month procedure-related mortality rate was 3.3%. No patient was deemed to require retreatment at 18-month follow-up. CONCLUSION: The LVIS/LVIS Jr endovascular devices are safe and effective in the treatment of ruptured and unruptured intracranial aneurysms, with acceptable complication rates, very high immediate total occlusion rates, and stable angiographic results.

Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep\n Networks for Thompson Sampling

Carlos Riquelme, George Tucker, Jasper Snoek|arXiv (Cornell University)|2018

Cited by 46Open Access

Recent advances in deep reinforcement learning have made significant strides\nin performance on applications such as Go and Atari games. However, developing\npractical methods to balance exploration and exploitation in complex domains\nremains largely unsolved. Thompson Sampling and its extension to reinforcement\nlearning provide an elegant approach to exploration that only requires access\nto posterior samples of the model. At the same time, advances in approximate\nBayesian methods have made posterior approximation for flexible neural network\nmodels practical. Thus, it is attractive to consider approximate Bayesian\nneural networks in a Thompson Sampling framework. To understand the impact of\nusing an approximate posterior on Thompson Sampling, we benchmark\nwell-established and recently developed methods for approximate posterior\nsampling combined with Thompson Sampling over a series of contextual bandit\nproblems. We found that many approaches that have been successful in the\nsupervised learning setting underperformed in the sequential decision-making\nscenario. In particular, we highlight the challenge of adapting slowly\nconverging uncertainty estimates to the online setting.\n

Is this you? Claim your profile.

Top publicationsby citations