Yearly

Weekly

Daily

Reading list

Links 25

Title: None

Score: 0.904689282392706

User feedback: None

Out links: 2123855 Raw text: 2123855

http://www.cs.toronto.edu/~hinton/absps/OnlineDistillation.pdf

Published as a conference paper at ICLR 2018 L ARGE SCALE DISTRIBUTED NEURAL NETWORK TRAINING THROUGH ONLINE DISTILLATION Rohan Anil Google [email protected] Robert Ormandi Google [email protected] Gabriel Pereyra ∗ Google DeepMind [email protected] George E. Dahl Google Brain [email protected]...

Title: OmniNet: Omnidirectional Representations from Transformers

Score: 0.8800227433525747

User feedback: None

Out links: 205126 Raw text: 205126

https://gwern.net/doc/www/arxiv.org/a986ec6fafa88a1a4f52523a902c22652e30d36a.pdf

OmniNet: Omnidirectional Representations from Transformers Yi Tay * 1 Mostafa Dehghani * 2 Vamsi Aribandi 1 3 Jai Gupta 1 Philip Pham 1 Zhen Qin 1 Dara Bahri 1 Da-Cheng Juan 1 Donald Metzler 1 arXiv:2103.01075v1 [cs.CV] 1 Mar 2021 Abstract This paper proposes Omnidirectional Representations from ...

Title: None

Score: 0.8785942109407829

User feedback: None

Out links: 205120 Raw text: 205120

https://gwern.net/doc/www/arxiv.org/eeba4103b71baddb951cdde4962993257f5d6f07.pdf

Efficiently Modeling Long Sequences with Structured State Spaces Albert Gu, Karan Goel, and Christopher Ré Department of Computer Science, Stanford University arXiv:2111.00396v2 [cs.LG] 4 Mar 2022 {albertgu,krng}@stanford.edu, [email protected] Abstract A central goal of sequence modeling ...

Title: None

Score: 0.8739845625117002

User feedback: None

Out links: 205037 Raw text: 205037

https://gwern.net/doc/www/arxiv.org/d1278072a7a1822674440ddd0c6c820abc5b2e19.pdf

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute Tao Lei ASAPP, Inc. [email protected] 1.222 Abstract Large language models have become increasingly difficult to train because of the growing computation time and cost. In this work, we present SRU++, a highly-ef...

Title: None

Score: 0.8699225284191421

User feedback: None

Out links: 3268773 Raw text: 3268773

https://cs.stanford.edu/~diyiy/docs/acl21_hiddencut.pdf

HiddenCut: Simple Data Augmentation for Natural Language Understanding with Better Generalization Jiaao Chen, Dinghan Shen1 , Weizhu Chen1 , Diyi Yang Georgia Institute of Technology, 1 Microsoft Dynamics 365 AI {jchen896,dyang888}@gatech.edu {dishen,wzchen}@microsoft.com Abstract Fine-tuning large...

Title: None

Score: 0.8686906632095005

User feedback: None

Out links: 1193705 Raw text: 1193705

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1244/final-projects/ChiYoTsaiJayMartin.pdf

Outrageously Fast LLMs: Faster Inference and Fine-Tuning with Moefication and LoRA Stanford CS224N Custom Project Mentor: Tony Wang Chi Tsai∗ Department of Computer Science Stanford University [email protected] Jay Martin† Department of Computer Science Stanford University [email protected] ...

Title: None

Score: 0.8584212557932718

User feedback: None

Out links: 205013 Raw text: 205013

https://gwern.net/doc/www/arxiv.org/c639528ca3cdba458c1e52f61e42863dce9599d7.pdf

Adaptive Multi-Resolution Attention with Linear Complexity Yao Zhang∗, 1 arXiv:2108.04962v1 [cs.LG] 10 Aug 2021 1 Yunpu Ma∗, 1 Thomas Seidl, 1 Volker Tresp 1,2 Institute of Informatics, LMU Munich, 2 Corporate Technology, Siemens AG [email protected], [email protected] [email protected]...

Title: None

Score: 0.8480798445298792

User feedback: None

Out links: 205011 Raw text: 205011

https://gwern.net/doc/www/arxiv.org/0701285b128d3a748a3bf37d457c72010d08fe46.pdf

Finetuning Pretrained Transformers into RNNs Jungo Kasai♡∗ Hao Peng♡ Yizhe Zhang♣ Dani Yogatama♠ ♡ ♡ Gabriel Ilharco Nikolaos Pappas Yi Mao♣ Weizhu Chen♣ Noah A. Smith♡♢ ♡ Paul G. Allen School of Computer Science & Engineering, University of Washington ♣ Microsoft ♠ DeepMind ♢ Allen Institute for ...

Title: None

Score: 0.8459963594865105

User feedback: None

Out links: 205034 Raw text: 205034

https://gwern.net/doc/www/arxiv.org/86693d8a9469f413a8b2735801feaa1a9d0dc50c.pdf

RWKV: Reinventing RNNs for the Transformer Era Bo Peng1∗ Eric Alcaide2,3,4∗ Quentin Anthony2,5∗ Alon Albalak Samuel Arcadinho2,7 Huanqi Cao8 Xin Cheng9 Michael Chung10 Matteo Grella11 Kranthi Kiran GV12 Xuzheng He2 Haowen Hou13 Przemysław Kazienko14 Jan Kocoń14 Jiaming Kong15 Bartłomiej Koptyra14 H...

Title: None

Score: 0.8459471281443147

User feedback: None

Out links: 205201 Raw text: 205201

https://gwern.net/doc/www/openreview.net/45f3d6c27e2b3f5b53fe6ecd14b4b122a8470ac6.pdf

Under review as a conference paper at ICLR 2022 A D OT P RODUCT ATTENTION F REE T RANSFORMER Anonymous authors Paper under double-blind review A BSTRACT We introduce Dot Product Attention Free Transformer (DAFT), an efficient variant of Transformers (Vaswani et al., 2017) that eliminates the query...

Title: None

Score: 0.8456152222090448

User feedback: None

Out links: 205044 Raw text: 205044

https://gwern.net/doc/www/arxiv.org/9aaf30e79a8b51c86a764b0b8eb725004fbddd32.pdf

C URRENT L IMITATIONS OF L ANGUAGE M ODELS : W HAT YOU N EED IS R ETRIEVAL arXiv:2009.06857v1 [cs.CL] 15 Sep 2020 Aran Komatsuzaki Georgia Institute of Technology EleutherAI [email protected] A BSTRACT We classify and re-examine some of the current approaches to improve the performance-com...

Title: None

Score: 0.8450740707172638

User feedback: None

Out links: 205046 Raw text: 205046

https://gwern.net/doc/www/arxiv.org/2c84075b5f7b38e98ad6ee0739e9c30f23ab3778.pdf

Luna: Linear Unified Nested Attention Chunting Zhou LTI, CMU [email protected] Xiang Kong∗ LTI, CMU [email protected] Jonathan May ISI, USC [email protected] Sinong Wang∗ Facebook AI [email protected] Hao Ma, Luke Zettlemoyer Facebook AI {haom, lsz}@fb.com Abstract The quadratic computational and...

Title: None

Score: 0.8448275767669189

User feedback: None

Out links: 205076 Raw text: 205076

https://gwern.net/doc/www/arxiv.org/6cab03ecf704e10f4f43f732577daf01daa03a1b.pdf

arXiv:2006.11527v2 [cs.CL] 16 Feb 2021 M EMORY T RANSFORMER Mikhail S. Burtsev Neural Networks and Deep Learning Lab Moscow Institute of Physics and Technology Dolgoprudny, Russia [email protected] Yuri Kuratov Neural Networks and Deep Learning Lab Moscow Institute of Physics and Technology Dolgo...

Title: Semi-Supervised Learning via Compact Latent Space Clustering

Score: 0.8430499062025062

User feedback: None

Out links: 352812 Raw text: 352812

http://proceedings.mlr.press/v80/kamnitsas18a/kamnitsas18a.pdf

Semi-Supervised Learning via Compact Latent Space Clustering Konstantinos Kamnitsas 1 2 Daniel C. Castro 1 2 Loic Le Folgoc 2 Ian Walker 2 Ryutaro Tanno 1 3 Daniel Rueckert 2 Ben Glocker 2 Antonio Criminisi 1 Aditya Nori 1 Abstract We present a novel cost function for semisupervised learning of ne...

Title: None

Score: 0.8426350711444989

User feedback: None

Out links: 3885218 Raw text: 3885218

https://www.mit.edu/~gfarina/2023/escher_iclr23/2206.04122.pdf

ESCHER: E SCHEWING I MPORTANCE S AMPLING IN G AMES BY C OMPUTING A H ISTORY VALUE F UNCTION TO E STIMATE R EGRET arXiv:2206.04122v2 [cs.GT] 11 Oct 2022 Stephen McAleer Carnegie Mellon University [email protected] Gabriele Farina Carnegie Mellon University [email protected] Marc Lanctot DeepMi...

Title: None

Score: 0.8418470832430927

User feedback: None

Out links: 205114 Raw text: 205114

https://gwern.net/doc/www/openreview.net/e998caa668bfed59cb006c4f3cd8de1b4620cc05.pdf

Published as a conference paper at ICLR 2021 R ANDOM F EATURE ATTENTION Hao Peng♠∗ Nikolaos Pappas♠ Dani Yogatama♣ Roy Schwartz♥ Noah A. Smith♠♦ Lingpeng Kong♦∗ ♠ Paul G. Allen School of Computer Science & Engineering, University of Washington ♣ DeepMind ♦ Allen Institute for Artificial Intelligenc...

Title: None

Score: 0.8415860748933458

User feedback: None

Out links: 3010548 Raw text: 3010548

https://homepages.inf.ed.ac.uk/csutton/publications/nota-ir403.pdf

Fast, Piecewise Training for Discriminative Finite-state and Parsing Models Charles Sutton and Andrew McCallum Department of Computer Science University of Massachusetts Amherst Amherst, MA 01003 USA {casutton,mccallum}@cs.umass.edu Abstract Discriminitive models for sequences and trees—such as lin...

Title: Sub-Linear Memory: How to Make Performers SLiM

Score: 0.8366056065500114

User feedback: None

Out links: 205074 Raw text: 205074

https://gwern.net/doc/www/arxiv.org/f63a0b34378396bff253d974efc8664d5620489c.pdf

Sub-Linear Memory: How to Make Performers SLiM Valerii Likhosherstov 1 Krzysztof Choromanski 2 3 Jared Davis 4 5 Xingyou Song 2 Adrian Weller 1 6 arXiv:2012.11346v1 [cs.LG] 21 Dec 2020 Abstract The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous i...

Title: None

Score: 0.8311770836835274

User feedback: None

Out links: 205184 Raw text: 205184

https://gwern.net/doc/www/arxiv.org/59fea814f374d53b61961507bc80c351f6526a48.pdf

Shortformer: Better Language Modeling Using Shorter Inputs Ofir Press1,2 1 Noah A. Smith1,3 Paul G. Allen School of Computer Science & Engineering, University of Washington 2 Facebook AI Research 3 Allen Institute for AI [email protected] arXiv:2012.15832v2 [cs.CL] 3 Jun 2021 Abstract Incr...

Title: None

Score: 0.8282240770563003

User feedback: None

Out links: 3010570 Raw text: 3010570

https://homepages.inf.ed.ac.uk/csutton/publications/lcrf.pdf

Center For Intelligent Information Retrieval Technical Report IR-383 Presented at NIPS’04 workshop on Learning with Structured Outputs Piecewise Training with Parameter Independence Diagrams: Comparing Globally- and Locally-trained Linear-chain CRFs Andrew McCallum and Charles Sutton Department of ...

Title: None

Score: 0.8277952379086098

User feedback: None

Out links: 205093 Raw text: 205093

https://gwern.net/doc/www/arxiv.org/4fc812c0cf44dfcb5d667e5729e5db10e3b1da8d.pdf

BP-Transformer: Modelling Long-Range Context via Binary Partitioning Zihao Ye†, Qipeng Guo†‡∗, Quan Gan†, Xipeng Qiu‡ , Zheng Zhang†§ † AWS Shanghai AI Lab ‡ Fudan University § New York University Shanghai {yeziha, gqipeng, quagan, zhaz}@amazon.com, [email protected] Abstract arXiv:1911.04070v1 [...

Title: None

Score: 0.8270818663345989

User feedback: None

Out links: 2124217 Raw text: 2124217

http://www.cs.toronto.edu/~hinton/absps/googlerectified.pdf

ON RECTIFIED LINEAR UNITS FOR SPEECH PROCESSING M.D. Zeiler1∗ , M. Ranzato2 , R. Monga2 , M. Mao2 , K. Yang2 , Q.V. Le2 , P. Nguyen2 , A. Senior2 , V. Vanhoucke2 , J. Dean2 , G.E. Hinton3 1 New York University, USA 2 Google Inc., USA ABSTRACT Deep neural networks have recently become the gold st...

Title: None

Score: 0.8235656491857266

User feedback: None

Out links: 2123874 Raw text: 2123874

http://www.cs.toronto.edu/~hinton/absps/distillation.pdf

arXiv:1503.02531v1 [stat.ML] 9 Mar 2015 Distilling the Knowledge in a Neural Network Geoffrey Hinton∗ † Google Inc. Mountain View [email protected] Oriol Vinyals† Google Inc. Mountain View [email protected] Jeff Dean Google Inc. Mountain View [email protected] Abstract A very simple way to ...

Title: None

Score: 0.822539174104687

User feedback: None

Out links: 1250097 Raw text: 1250097

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1224/slides/cs224n-2022-lecture12-generation-final.pdf

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning (based on a lecture by Antoine Bosselut) Lecture 12: Neural Language Generation Today: A bit more on projects and Natural Language Generation • A few more final project thoughts and tips 1. What is NLG? 2. The simpl...

Title: None

Score: 0.8206242059492121

User feedback: None

Out links: 3162533 Raw text: 3162533

https://cs.stanford.edu/~ermon/papers/imitation_nips2016_main.pdf

Generative Adversarial Imitation Learning Jonathan Ho OpenAI [email protected] Stefano Ermon Stanford University [email protected] Abstract Consider learning a policy from example expert behavior, without interaction with the expert or access to a reinforcement signal. One approach is to recover...