Title: None

Score: 0.960811842483605

User feedback: None

Out links: 179129 Raw text: 179129

https://arxiv.org/pdf/2203.15556.pdf

Training Compute-Optimal Large Language Models Jordan Hoffmann★, Sebastian Borgeaud★, Arthur Mensch★, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Dam...

Title: None

Score: 0.9452246338749217

User feedback: None

Out links: 429808 Raw text: 429808

https://proceedings.mlr.press/v202/baevski23a/baevski23a.pdf

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language Alexei Baevski 1 Arun Babu 2 Wei-Ning Hsu 2 Michael Auli 2 Abstract in mind which makes it unclear whether the same learning mechanisms generalize across modalities. To this end, recent w...

Title: None

Score: 0.9239123115473331

User feedback: None

Out links: 206418 Raw text: 206418

https://arxiv.org/pdf/2001.08361.pdf

arXiv:2001.08361v1 [cs.LG] 23 Jan 2020 Scaling Laws for Neural Language Models Jared Kaplan ∗ Sam McCandlish∗ Johns Hopkins University, OpenAI OpenAI [email protected] [email protected] Tom Henighan Tom B. Brown Benjamin Chess Rewon Child OpenAI OpenAI OpenAI OpenAI [email protected] to...

Title: None

Score: 0.9235692522269203

User feedback: None

Out links: 199769 Raw text: 199769

https://gwern.net/doc/www/arxiv.org/ba4384efc1bf12de84e047795780b517cfac7ac6.pdf

L EARNING TO L EARN WITH G ENERATIVE M ODELS OF N EURAL N ETWORK C HECKPOINTS William Peebles∗ Ilija Radosavovic∗ Tim Brooks Alexei A. Efros Jitendra Malik University of California, Berkeley arXiv:2209.12892v1 [cs.LG] 26 Sep 2022 A BSTRACT We explore a data-driven approach for learning to opt...

Title: None

Score: 0.9232400121747161

User feedback: None

Out links: 414490 Raw text: 414490

https://proceedings.mlr.press/v162/lim22a/lim22a.pdf

TSPipe: Learn from Teacher Faster with Pipelines Hwijoon Lim 1 Yechan Kim 2 Sukmin Yun 1 Jinwoo Shin 1 2 Dongsu Han 1 2 Pipeline Idle 1. Introduction Knowledge distillation (KD) (Hinton et al., 2015) has shown remarkable success with the teacher-student (TS) framework in transferring knowledge fr...

Title: SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

Score: 0.921024690866871

User feedback: None

Out links: 426041 Raw text: 426041

https://proceedings.mlr.press/v202/ryabinin23a/ryabinin23a.pdf

SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient Max Ryabinin * 1 2 Tim Dettmers * 3 Michael Diskin 2 1 Alexander Borzunov 1 2 Abstract et al., 2021; Raffel et al., 2020; Wang & Komatsuzaki, 2021; Sun et al., 2021) to hundreds of billions (Brown et al., 2020; F...

Title: Scaling Vision Transformers to 22 Billion Parameters

Score: 0.9180148289810143

User feedback: None

Out links: 429601 Raw text: 429601

https://proceedings.mlr.press/v202/dehghani23a/dehghani23a.pdf

Scaling Vision Transformers to 22 Billion Parameters Mostafa Dehghani * Josip Djolonga * Basil Mustafa * Piotr Padlewski * Jonathan Heek * Justin Gilmer Andreas Steiner Mathilde Caron Robert Geirhos Ibrahim Alabdulmohsin Rodolphe Jenatton Lucas Beyer Michael Tschannen Anurag Arnab Xiao Wang Carlos ...

Title: None

Score: 0.9156188206792982

User feedback: None

Out links: 290926 Raw text: 290926

https://arxiv.org/pdf/2302.13861.pdf

Differentially Private Diffusion Models Generate Useful Synthetic Images Sahra Ghalebikesabi1,+ , Leonard Berrada2 , Sven Gowal2 , Ira Ktena2 , Robert Stanforth2 , Jamie Hayes2 , Soham De2 , Samuel L. Smith2 , Olivia Wiles2 and Borja Balle2 arXiv:2302.13861v1 [cs.LG] 27 Feb 2023 1 University of Ox...

Title: None

Score: 0.9118678246692361

User feedback: None

Out links: 179114 Raw text: 179114

https://arxiv.org/pdf/2312.00752.pdf

Mamba: Linear-Time Sequence Modeling with Selective State Spaces 1 Albert Gu∗ and Tri Dao∗ 2 1 Machine Learning Department, Carnegie Mellon University arXiv:2312.00752v2 [cs.LG] 31 May 2024 2 Department of Computer Science, Princeton University [email protected], [email protected] Abstract Foundati...

Title: None

Score: 0.9105516239191656

User feedback: None

Out links: 499972 Raw text: 499972

https://gwern.net/doc/www/arxiv.org/9f552edcea371e8ff8525afda8bc0ca95f1bc73a.pdf

Published as a conference paper at ICLR 2024 Never Train from Scratch: FAIR C OMPARISON OF LONG SEQUENCE MODELS REQUIRES DATA - DRIVEN PRIORS Ido Amos Tel Aviv University∗ Jonathan Berant Tel Aviv University Ankit Gupta IBM Research arXiv:2310.02980v4 [cs.LG] 28 Apr 2024 A BSTRACT Modeling long...

Title: None

Score: 0.9103293597046762

User feedback: None

Out links: 414366 Raw text: 414366

https://proceedings.mlr.press/v162/kandpal22a/kandpal22a.pdf

Deduplicating Training Data Mitigates Privacy Risks in Language Models Nikhil Kandpal 1 Eric Wallace 2 Colin Raffel 1 Past work has shown that large language models are susceptible to privacy attacks, where adversaries generate sequences from a trained model and detect which sequences are memorize...

Title: None

Score: 0.9100100368373858

User feedback: None

Out links: 497593 Raw text: 497593

https://gwern.net/doc/www/arxiv.org/8c02c42545cb876836a931529dffa182787ec5db.pdf

Mamba: Linear-Time Sequence Modeling with Selective State Spaces 1 Albert Gu* and Tri Dao* 1 2 Machine Learning Department, Carnegie Mellon University 2 Department of Computer Science, Princeton University [email protected], [email protected] Abstract Foundation models, now powering most of the excitin...

Title: None

Score: 0.9096458992031686

User feedback: None

Out links: 313663 Raw text: 313663

https://gwern.net/doc/www/arxiv.org/ef2d5ac5ad43e7cbe4ec81e260ee4f1753197629.pdf

Kevin Lu UC Berkeley [email protected] Aditya Grover Facebook AI Research [email protected] Pieter Abbeel UC Berkeley [email protected] Igor Mordatch Google Brain [email protected] Abstract We investigate the capability of a transformer pretrained on natural language to generalize to o...

Title: Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Score: 0.9095591830732751

User feedback: None

Out links: 620790 Raw text: 620790

https://gwern.net/doc/www/arxiv.org/76a5623e4773060e66ec928a2f2266298d76dee3.pdf

arXiv:2102.00554v1 [cs.LG] 31 Jan 2021 Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks TORSTEN HOEFLER, ETH Zürich, Switzerland DAN ALISTARH, IST Austria, Austria TAL BEN-NUN, ETH Zürich, Switzerland NIKOLI DRYDEN, ETH Zürich, Switzerland ALEXAN...

Title: None

Score: 0.9059600220652948

User feedback: None

Out links: 8016476 Raw text: 8016476

https://www.cs.toronto.edu/~rupert/projects/nas-heuristics.pdf

Analysis of Heuristics for Neural Architecture Search (MAT496) Analysis of Heuristics for Neural Architecture Search MAT496H1S: Mathematics of Deep Learning (Reading) Robert Wu [email protected] Department of Computer Science University of Toronto Abstract Neural architecture search (NAS) al...

Title: None

Score: 0.904979454820576

User feedback: None

Out links: 413355 Raw text: 413355

https://proceedings.mlr.press/v162/zhou22d/zhou22d.pdf

Model Agnostic Sample Reweighting for Out-of-Distribution Learning Xiao Zhou * 1 Yong Lin * 1 Renjie Pi * 1 Weizhong Zhang 1 Renzhe Xu 2 Peng Cui 2 Tong Zhang 1 3 Abstract Distributionally robust optimization (DRO) and invariant risk minimization (IRM) are two popular methods proposed to improve o...

Title: None

Score: 0.904689282392706

User feedback: None

Out links: 2123855 Raw text: 2123855

http://www.cs.toronto.edu/~hinton/absps/OnlineDistillation.pdf

Published as a conference paper at ICLR 2018 L ARGE SCALE DISTRIBUTED NEURAL NETWORK TRAINING THROUGH ONLINE DISTILLATION Rohan Anil Google [email protected] Robert Ormandi Google [email protected] Gabriel Pereyra ∗ Google DeepMind [email protected] George E. Dahl Google Brain [email protected]...

Title: Unified Scaling Laws for Routed Language Models

Score: 0.9046471793453105

User feedback: None

Out links: 414481 Raw text: 414481

https://proceedings.mlr.press/v162/clark22a/clark22a.pdf

Unified Scaling Laws for Routed Language Models Aidan Clark * 1 Diego de las Casas * 1 Aurelia Guy * 1 Arthur Mensch * 1 Michela Paganini 1 Jordan Hoffmann 1 Bogdan Damoc 1 Blake Hechtman 2 Trevor Cai 1 Sebastian Borgeaud 1 George van den Driessche 1 Eliza Rutherford 1 Tom Hennigan 1 Matthew Johnso...

Title: None

Score: 0.9036664042486622

User feedback: None

Out links: 499919 Raw text: 499919

https://gwern.net/doc/www/arxiv.org/b70bb16ca211ff9a26ae5afc5ecaf650c550cecc.pdf

Under review as a conference paper at ICLR 2025 M IXTURE OF PARROTS : E XPERTS IMPROVE MEMORIZATION MORE THAN REASONING Samy Jelassi∗ Harvard University arXiv:2410.19034v1 [cs.LG] 24 Oct 2024 Nikhil Vyas Harvard University Clara Mohri Harvard University David Brandfonbrener Harvard University...

Title: None

Score: 0.902011208960867

User feedback: None

Out links: 291713 Raw text: 291713

https://arxiv.org/pdf/2404.02258.pdf

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models David Raposo1* , Sam Ritter1 , Blake Richards1,2 , Timothy Lillicrap1 , Peter Conway Humphreys1 and Adam Santoro1* arXiv:2404.02258v1 [cs.LG] 2 Apr 2024 1 Google DeepMind, 2 McGill University & Mila, * Equal Con...

Title: None

Score: 0.8996628887319077

User feedback: None

Out links: 415249 Raw text: 415249

https://proceedings.mlr.press/v162/sato22a/sato22a.pdf

PoF: Post-Training of Feature Extractor for Improving Generalization Ikuro Sato * 1 2 Ryota Yamada * 1 Masayuki Tanaka 1 Nakamasa Inoue 1 Rei Kawakami 1 2 Abstract 1. Introduction It has been intensively discussed what conditions make deep models generalized for given datasets and network archite...

Title: None

Score: 0.8990216042421004

User feedback: None

Out links: 200006 Raw text: 200006

https://gwern.net/doc/www/arxiv.org/79528489ccea8d598189f1f980c963c6c2ee576a.pdf

Published as a conference paper at ICLR 2021 L ARGE BATCH S IMULATION FOR D EEP R EINFORCEMENT L EARNING arXiv:2103.07013v1 [cs.LG] 12 Mar 2021 Brennan Shacklett1∗ Erik Wijmans2 Aleksei Petrenko3,4 Manolis Savva5 Dhruv Batra2 Vladlen Koltun3 Kayvon Fatahalian1 1 Stanford University 2 Georgia Inst...

Title: None

Score: 0.8989672372846284

User feedback: None

Out links: 7372016 Raw text: 7372016

http://www.cs.toronto.edu/~ranzato/publications/DistBeliefNIPS2012_withAppendix.pdf

Large Scale Distributed Deep Networks Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng {jeff, gcorrado}@google.com Google Inc., Mountain View, CA Abstract Recent work in unsupervis...

Title: Masked Mixers for Language Generation and Retrieval

Score: 0.8984538140268135

User feedback: None

Out links: 552794 Raw text: 552794

https://gwern.net/doc/www/arxiv.org/95cbfa4074b27ccd0c06db17f32929715018c624.pdf

M ASKED M IXERS FOR L ANGUAGE G ENERATION AND R ETRIEVAL arXiv:2409.01482v1 [cs.CL] 2 Sep 2024 T ECHNICAL R EPORT Benjamin L. Badger∗ Guidehouse 1676 International Dr, McLean, VA 22102 [email protected] A BSTRACT Attention mechanisms that confer selective focus on a strict subset of input el...

Title: None

Score: 0.8983540717099758

User feedback: None

Out links: 411346 Raw text: 411346

https://proceedings.mlr.press/v162/wang22x/wang22x.pdf

Provable Domain Generalization via Invariant-Feature Subspace Recovery Haoxiang Wang 1 Haozhe Si 1 Bo Li 1 Han Zhao 1 Abstract Domain generalization asks for models trained over a set of training environments to perform well in unseen test environments. Recently, a series of algorithms such as Inv...