Yearly
Weekly
Daily
Reading list
Links 25
Score: 0.960811842483605
User feedback: None
Out links: 179129 Raw text: 179129https://arxiv.org/pdf/2203.15556.pdf
Training Compute-Optimal Large Language Models Jordan Hoffmann★, Sebastian Borgeaud★, Arthur Mensch★, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Dam...
Score: 0.9452246338749217
User feedback: None
Out links: 429808 Raw text: 429808https://proceedings.mlr.press/v202/baevski23a/baevski23a.pdf
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language Alexei Baevski 1 Arun Babu 2 Wei-Ning Hsu 2 Michael Auli 2 Abstract in mind which makes it unclear whether the same learning mechanisms generalize across modalities. To this end, recent w...
Score: 0.9239123115473331
User feedback: None
Out links: 206418 Raw text: 206418https://arxiv.org/pdf/2001.08361.pdf
arXiv:2001.08361v1 [cs.LG] 23 Jan 2020 Scaling Laws for Neural Language Models Jared Kaplan ∗ Sam McCandlish∗ Johns Hopkins University, OpenAI OpenAI [email protected] [email protected] Tom Henighan Tom B. Brown Benjamin Chess Rewon Child OpenAI OpenAI OpenAI OpenAI [email protected] to...
Score: 0.9235692522269203
User feedback: None
Out links: 199769 Raw text: 199769https://gwern.net/doc/www/arxiv.org/ba4384efc1bf12de84e047795780b517cfac7ac6.pdf
L EARNING TO L EARN WITH G ENERATIVE M ODELS OF N EURAL N ETWORK C HECKPOINTS William Peebles∗ Ilija Radosavovic∗ Tim Brooks Alexei A. Efros Jitendra Malik University of California, Berkeley arXiv:2209.12892v1 [cs.LG] 26 Sep 2022 A BSTRACT We explore a data-driven approach for learning to opt...
Score: 0.9232400121747161
User feedback: None
Out links: 414490 Raw text: 414490https://proceedings.mlr.press/v162/lim22a/lim22a.pdf
TSPipe: Learn from Teacher Faster with Pipelines Hwijoon Lim 1 Yechan Kim 2 Sukmin Yun 1 Jinwoo Shin 1 2 Dongsu Han 1 2 Pipeline Idle 1. Introduction Knowledge distillation (KD) (Hinton et al., 2015) has shown remarkable success with the teacher-student (TS) framework in transferring knowledge fr...
Score: 0.921024690866871
User feedback: None
Out links: 426041 Raw text: 426041https://proceedings.mlr.press/v202/ryabinin23a/ryabinin23a.pdf
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient Max Ryabinin * 1 2 Tim Dettmers * 3 Michael Diskin 2 1 Alexander Borzunov 1 2 Abstract et al., 2021; Raffel et al., 2020; Wang & Komatsuzaki, 2021; Sun et al., 2021) to hundreds of billions (Brown et al., 2020; F...
Score: 0.9180148289810143
User feedback: None
Out links: 429601 Raw text: 429601https://proceedings.mlr.press/v202/dehghani23a/dehghani23a.pdf
Scaling Vision Transformers to 22 Billion Parameters Mostafa Dehghani * Josip Djolonga * Basil Mustafa * Piotr Padlewski * Jonathan Heek * Justin Gilmer Andreas Steiner Mathilde Caron Robert Geirhos Ibrahim Alabdulmohsin Rodolphe Jenatton Lucas Beyer Michael Tschannen Anurag Arnab Xiao Wang Carlos ...
Score: 0.9156188206792982
User feedback: None
Out links: 290926 Raw text: 290926https://arxiv.org/pdf/2302.13861.pdf
Differentially Private Diffusion Models Generate Useful Synthetic Images Sahra Ghalebikesabi1,+ , Leonard Berrada2 , Sven Gowal2 , Ira Ktena2 , Robert Stanforth2 , Jamie Hayes2 , Soham De2 , Samuel L. Smith2 , Olivia Wiles2 and Borja Balle2 arXiv:2302.13861v1 [cs.LG] 27 Feb 2023 1 University of Ox...
Score: 0.9118678246692361
User feedback: None
Out links: 179114 Raw text: 179114https://arxiv.org/pdf/2312.00752.pdf
Mamba: Linear-Time Sequence Modeling with Selective State Spaces 1 Albert Gu∗ and Tri Dao∗ 2 1 Machine Learning Department, Carnegie Mellon University arXiv:2312.00752v2 [cs.LG] 31 May 2024 2 Department of Computer Science, Princeton University [email protected], [email protected] Abstract Foundati...
Score: 0.9105516239191656
User feedback: None
Out links: 499972 Raw text: 499972https://gwern.net/doc/www/arxiv.org/9f552edcea371e8ff8525afda8bc0ca95f1bc73a.pdf
Published as a conference paper at ICLR 2024 Never Train from Scratch: FAIR C OMPARISON OF LONG SEQUENCE MODELS REQUIRES DATA - DRIVEN PRIORS Ido Amos Tel Aviv University∗ Jonathan Berant Tel Aviv University Ankit Gupta IBM Research arXiv:2310.02980v4 [cs.LG] 28 Apr 2024 A BSTRACT Modeling long...
Score: 0.9103293597046762
User feedback: None
Out links: 414366 Raw text: 414366https://proceedings.mlr.press/v162/kandpal22a/kandpal22a.pdf
Deduplicating Training Data Mitigates Privacy Risks in Language Models Nikhil Kandpal 1 Eric Wallace 2 Colin Raffel 1 Past work has shown that large language models are susceptible to privacy attacks, where adversaries generate sequences from a trained model and detect which sequences are memorize...
Score: 0.9100100368373858
User feedback: None
Out links: 497593 Raw text: 497593https://gwern.net/doc/www/arxiv.org/8c02c42545cb876836a931529dffa182787ec5db.pdf
Mamba: Linear-Time Sequence Modeling with Selective State Spaces 1 Albert Gu* and Tri Dao* 1 2 Machine Learning Department, Carnegie Mellon University 2 Department of Computer Science, Princeton University [email protected], [email protected] Abstract Foundation models, now powering most of the excitin...
Score: 0.9096458992031686
User feedback: None
Out links: 313663 Raw text: 313663https://gwern.net/doc/www/arxiv.org/ef2d5ac5ad43e7cbe4ec81e260ee4f1753197629.pdf
Kevin Lu UC Berkeley [email protected] Aditya Grover Facebook AI Research [email protected] Pieter Abbeel UC Berkeley [email protected] Igor Mordatch Google Brain [email protected] Abstract We investigate the capability of a transformer pretrained on natural language to generalize to o...
Score: 0.9095591830732751
User feedback: None
Out links: 620790 Raw text: 620790https://gwern.net/doc/www/arxiv.org/76a5623e4773060e66ec928a2f2266298d76dee3.pdf
arXiv:2102.00554v1 [cs.LG] 31 Jan 2021 Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks TORSTEN HOEFLER, ETH Zürich, Switzerland DAN ALISTARH, IST Austria, Austria TAL BEN-NUN, ETH Zürich, Switzerland NIKOLI DRYDEN, ETH Zürich, Switzerland ALEXAN...
Score: 0.9059600220652948
User feedback: None
Out links: 8016476 Raw text: 8016476https://www.cs.toronto.edu/~rupert/projects/nas-heuristics.pdf
Analysis of Heuristics for Neural Architecture Search (MAT496) Analysis of Heuristics for Neural Architecture Search MAT496H1S: Mathematics of Deep Learning (Reading) Robert Wu [email protected] Department of Computer Science University of Toronto Abstract Neural architecture search (NAS) al...
Score: 0.904979454820576
User feedback: None
Out links: 413355 Raw text: 413355https://proceedings.mlr.press/v162/zhou22d/zhou22d.pdf
Model Agnostic Sample Reweighting for Out-of-Distribution Learning Xiao Zhou * 1 Yong Lin * 1 Renjie Pi * 1 Weizhong Zhang 1 Renzhe Xu 2 Peng Cui 2 Tong Zhang 1 3 Abstract Distributionally robust optimization (DRO) and invariant risk minimization (IRM) are two popular methods proposed to improve o...
Score: 0.904689282392706
User feedback: None
Out links: 2123855 Raw text: 2123855http://www.cs.toronto.edu/~hinton/absps/OnlineDistillation.pdf
Published as a conference paper at ICLR 2018 L ARGE SCALE DISTRIBUTED NEURAL NETWORK TRAINING THROUGH ONLINE DISTILLATION Rohan Anil Google [email protected] Robert Ormandi Google [email protected] Gabriel Pereyra ∗ Google DeepMind [email protected] George E. Dahl Google Brain [email protected]...
Score: 0.9046471793453105
User feedback: None
Out links: 414481 Raw text: 414481https://proceedings.mlr.press/v162/clark22a/clark22a.pdf
Unified Scaling Laws for Routed Language Models Aidan Clark * 1 Diego de las Casas * 1 Aurelia Guy * 1 Arthur Mensch * 1 Michela Paganini 1 Jordan Hoffmann 1 Bogdan Damoc 1 Blake Hechtman 2 Trevor Cai 1 Sebastian Borgeaud 1 George van den Driessche 1 Eliza Rutherford 1 Tom Hennigan 1 Matthew Johnso...
Score: 0.9036664042486622
User feedback: None
Out links: 499919 Raw text: 499919https://gwern.net/doc/www/arxiv.org/b70bb16ca211ff9a26ae5afc5ecaf650c550cecc.pdf
Under review as a conference paper at ICLR 2025 M IXTURE OF PARROTS : E XPERTS IMPROVE MEMORIZATION MORE THAN REASONING Samy Jelassi∗ Harvard University arXiv:2410.19034v1 [cs.LG] 24 Oct 2024 Nikhil Vyas Harvard University Clara Mohri Harvard University David Brandfonbrener Harvard University...
Score: 0.902011208960867
User feedback: None
Out links: 291713 Raw text: 291713https://arxiv.org/pdf/2404.02258.pdf
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models David Raposo1* , Sam Ritter1 , Blake Richards1,2 , Timothy Lillicrap1 , Peter Conway Humphreys1 and Adam Santoro1* arXiv:2404.02258v1 [cs.LG] 2 Apr 2024 1 Google DeepMind, 2 McGill University & Mila, * Equal Con...
Score: 0.8996628887319077
User feedback: None
Out links: 415249 Raw text: 415249https://proceedings.mlr.press/v162/sato22a/sato22a.pdf
PoF: Post-Training of Feature Extractor for Improving Generalization Ikuro Sato * 1 2 Ryota Yamada * 1 Masayuki Tanaka 1 Nakamasa Inoue 1 Rei Kawakami 1 2 Abstract 1. Introduction It has been intensively discussed what conditions make deep models generalized for given datasets and network archite...
Score: 0.8990216042421004
User feedback: None
Out links: 200006 Raw text: 200006https://gwern.net/doc/www/arxiv.org/79528489ccea8d598189f1f980c963c6c2ee576a.pdf
Published as a conference paper at ICLR 2021 L ARGE BATCH S IMULATION FOR D EEP R EINFORCEMENT L EARNING arXiv:2103.07013v1 [cs.LG] 12 Mar 2021 Brennan Shacklett1∗ Erik Wijmans2 Aleksei Petrenko3,4 Manolis Savva5 Dhruv Batra2 Vladlen Koltun3 Kayvon Fatahalian1 1 Stanford University 2 Georgia Inst...
Score: 0.8989672372846284
User feedback: None
Out links: 7372016 Raw text: 7372016http://www.cs.toronto.edu/~ranzato/publications/DistBeliefNIPS2012_withAppendix.pdf
Large Scale Distributed Deep Networks Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng {jeff, gcorrado}@google.com Google Inc., Mountain View, CA Abstract Recent work in unsupervis...
Score: 0.8984538140268135
User feedback: None
Out links: 552794 Raw text: 552794https://gwern.net/doc/www/arxiv.org/95cbfa4074b27ccd0c06db17f32929715018c624.pdf
M ASKED M IXERS FOR L ANGUAGE G ENERATION AND R ETRIEVAL arXiv:2409.01482v1 [cs.CL] 2 Sep 2024 T ECHNICAL R EPORT Benjamin L. Badger∗ Guidehouse 1676 International Dr, McLean, VA 22102 [email protected] A BSTRACT Attention mechanisms that confer selective focus on a strict subset of input el...
Score: 0.8983540717099758
User feedback: None
Out links: 411346 Raw text: 411346https://proceedings.mlr.press/v162/wang22x/wang22x.pdf
Provable Domain Generalization via Invariant-Feature Subspace Recovery Haoxiang Wang 1 Haozhe Si 1 Bo Li 1 Han Zhao 1 Abstract Domain generalization asks for models trained over a set of training environments to perform well in unseen test environments. Recently, a series of algorithms such as Inv...