Reading list
Links 25
Score: 0.904689282392706
User feedback: None
Out links: 2123855 Raw text: 2123855http://www.cs.toronto.edu/~hinton/absps/OnlineDistillation.pdf
Published as a conference paper at ICLR 2018 L ARGE SCALE DISTRIBUTED NEURAL NETWORK TRAINING THROUGH ONLINE DISTILLATION Rohan Anil Google [email protected] Robert Ormandi Google [email protected] Gabriel Pereyra ∗ Google DeepMind [email protected] George E. Dahl Google Brain [email protected]...
Score: 0.8800227433525747
User feedback: None
Out links: 205126 Raw text: 205126https://gwern.net/doc/www/arxiv.org/a986ec6fafa88a1a4f52523a902c22652e30d36a.pdf
OmniNet: Omnidirectional Representations from Transformers Yi Tay * 1 Mostafa Dehghani * 2 Vamsi Aribandi 1 3 Jai Gupta 1 Philip Pham 1 Zhen Qin 1 Dara Bahri 1 Da-Cheng Juan 1 Donald Metzler 1 arXiv:2103.01075v1 [cs.CV] 1 Mar 2021 Abstract This paper proposes Omnidirectional Representations from ...
Score: 0.8785942109407829
User feedback: None
Out links: 205120 Raw text: 205120https://gwern.net/doc/www/arxiv.org/eeba4103b71baddb951cdde4962993257f5d6f07.pdf
Efficiently Modeling Long Sequences with Structured State Spaces Albert Gu, Karan Goel, and Christopher Ré Department of Computer Science, Stanford University arXiv:2111.00396v2 [cs.LG] 4 Mar 2022 {albertgu,krng}@stanford.edu, [email protected] Abstract A central goal of sequence modeling ...
Score: 0.8739845625117002
User feedback: None
Out links: 205037 Raw text: 205037https://gwern.net/doc/www/arxiv.org/d1278072a7a1822674440ddd0c6c820abc5b2e19.pdf
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute Tao Lei ASAPP, Inc. [email protected] 1.222 Abstract Large language models have become increasingly difficult to train because of the growing computation time and cost. In this work, we present SRU++, a highly-ef...
Score: 0.8699225284191421
User feedback: None
Out links: 3268773 Raw text: 3268773https://cs.stanford.edu/~diyiy/docs/acl21_hiddencut.pdf
HiddenCut: Simple Data Augmentation for Natural Language Understanding with Better Generalization Jiaao Chen, Dinghan Shen1 , Weizhu Chen1 , Diyi Yang Georgia Institute of Technology, 1 Microsoft Dynamics 365 AI {jchen896,dyang888}@gatech.edu {dishen,wzchen}@microsoft.com Abstract Fine-tuning large...
Score: 0.8686906632095005
User feedback: None
Out links: 1193705 Raw text: 1193705https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1244/final-projects/ChiYoTsaiJayMartin.pdf
Outrageously Fast LLMs: Faster Inference and Fine-Tuning with Moefication and LoRA Stanford CS224N Custom Project Mentor: Tony Wang Chi Tsai∗ Department of Computer Science Stanford University [email protected] Jay Martin† Department of Computer Science Stanford University [email protected] ...
Score: 0.8584212557932718
User feedback: None
Out links: 205013 Raw text: 205013https://gwern.net/doc/www/arxiv.org/c639528ca3cdba458c1e52f61e42863dce9599d7.pdf
Adaptive Multi-Resolution Attention with Linear Complexity Yao Zhang∗, 1 arXiv:2108.04962v1 [cs.LG] 10 Aug 2021 1 Yunpu Ma∗, 1 Thomas Seidl, 1 Volker Tresp 1,2 Institute of Informatics, LMU Munich, 2 Corporate Technology, Siemens AG [email protected], [email protected] [email protected]...
Score: 0.8480798445298792
User feedback: None
Out links: 205011 Raw text: 205011https://gwern.net/doc/www/arxiv.org/0701285b128d3a748a3bf37d457c72010d08fe46.pdf
Finetuning Pretrained Transformers into RNNs Jungo Kasai♡∗ Hao Peng♡ Yizhe Zhang♣ Dani Yogatama♠ ♡ ♡ Gabriel Ilharco Nikolaos Pappas Yi Mao♣ Weizhu Chen♣ Noah A. Smith♡♢ ♡ Paul G. Allen School of Computer Science & Engineering, University of Washington ♣ Microsoft ♠ DeepMind ♢ Allen Institute for ...
Score: 0.8459963594865105
User feedback: None
Out links: 205034 Raw text: 205034https://gwern.net/doc/www/arxiv.org/86693d8a9469f413a8b2735801feaa1a9d0dc50c.pdf
RWKV: Reinventing RNNs for the Transformer Era Bo Peng1∗ Eric Alcaide2,3,4∗ Quentin Anthony2,5∗ Alon Albalak Samuel Arcadinho2,7 Huanqi Cao8 Xin Cheng9 Michael Chung10 Matteo Grella11 Kranthi Kiran GV12 Xuzheng He2 Haowen Hou13 Przemysław Kazienko14 Jan Kocoń14 Jiaming Kong15 Bartłomiej Koptyra14 H...
Score: 0.8459471281443147
User feedback: None
Out links: 205201 Raw text: 205201https://gwern.net/doc/www/openreview.net/45f3d6c27e2b3f5b53fe6ecd14b4b122a8470ac6.pdf
Under review as a conference paper at ICLR 2022 A D OT P RODUCT ATTENTION F REE T RANSFORMER Anonymous authors Paper under double-blind review A BSTRACT We introduce Dot Product Attention Free Transformer (DAFT), an efficient variant of Transformers (Vaswani et al., 2017) that eliminates the query...
Score: 0.8456152222090448
User feedback: None
Out links: 205044 Raw text: 205044https://gwern.net/doc/www/arxiv.org/9aaf30e79a8b51c86a764b0b8eb725004fbddd32.pdf
C URRENT L IMITATIONS OF L ANGUAGE M ODELS : W HAT YOU N EED IS R ETRIEVAL arXiv:2009.06857v1 [cs.CL] 15 Sep 2020 Aran Komatsuzaki Georgia Institute of Technology EleutherAI [email protected] A BSTRACT We classify and re-examine some of the current approaches to improve the performance-com...
Score: 0.8450740707172638
User feedback: None
Out links: 205046 Raw text: 205046https://gwern.net/doc/www/arxiv.org/2c84075b5f7b38e98ad6ee0739e9c30f23ab3778.pdf
Luna: Linear Unified Nested Attention Chunting Zhou LTI, CMU [email protected] Xiang Kong∗ LTI, CMU [email protected] Jonathan May ISI, USC [email protected] Sinong Wang∗ Facebook AI [email protected] Hao Ma, Luke Zettlemoyer Facebook AI {haom, lsz}@fb.com Abstract The quadratic computational and...
Score: 0.8448275767669189
User feedback: None
Out links: 205076 Raw text: 205076https://gwern.net/doc/www/arxiv.org/6cab03ecf704e10f4f43f732577daf01daa03a1b.pdf
arXiv:2006.11527v2 [cs.CL] 16 Feb 2021 M EMORY T RANSFORMER Mikhail S. Burtsev Neural Networks and Deep Learning Lab Moscow Institute of Physics and Technology Dolgoprudny, Russia [email protected] Yuri Kuratov Neural Networks and Deep Learning Lab Moscow Institute of Physics and Technology Dolgo...
Score: 0.8430499062025062
User feedback: None
Out links: 352812 Raw text: 352812http://proceedings.mlr.press/v80/kamnitsas18a/kamnitsas18a.pdf
Semi-Supervised Learning via Compact Latent Space Clustering Konstantinos Kamnitsas 1 2 Daniel C. Castro 1 2 Loic Le Folgoc 2 Ian Walker 2 Ryutaro Tanno 1 3 Daniel Rueckert 2 Ben Glocker 2 Antonio Criminisi 1 Aditya Nori 1 Abstract We present a novel cost function for semisupervised learning of ne...
Score: 0.8426350711444989
User feedback: None
Out links: 3885218 Raw text: 3885218https://www.mit.edu/~gfarina/2023/escher_iclr23/2206.04122.pdf
ESCHER: E SCHEWING I MPORTANCE S AMPLING IN G AMES BY C OMPUTING A H ISTORY VALUE F UNCTION TO E STIMATE R EGRET arXiv:2206.04122v2 [cs.GT] 11 Oct 2022 Stephen McAleer Carnegie Mellon University [email protected] Gabriele Farina Carnegie Mellon University [email protected] Marc Lanctot DeepMi...
Score: 0.8418470832430927
User feedback: None
Out links: 205114 Raw text: 205114https://gwern.net/doc/www/openreview.net/e998caa668bfed59cb006c4f3cd8de1b4620cc05.pdf
Published as a conference paper at ICLR 2021 R ANDOM F EATURE ATTENTION Hao Peng♠∗ Nikolaos Pappas♠ Dani Yogatama♣ Roy Schwartz♥ Noah A. Smith♠♦ Lingpeng Kong♦∗ ♠ Paul G. Allen School of Computer Science & Engineering, University of Washington ♣ DeepMind ♦ Allen Institute for Artificial Intelligenc...
Score: 0.8415860748933458
User feedback: None
Out links: 3010548 Raw text: 3010548https://homepages.inf.ed.ac.uk/csutton/publications/nota-ir403.pdf
Fast, Piecewise Training for Discriminative Finite-state and Parsing Models Charles Sutton and Andrew McCallum Department of Computer Science University of Massachusetts Amherst Amherst, MA 01003 USA {casutton,mccallum}@cs.umass.edu Abstract Discriminitive models for sequences and trees—such as lin...
Score: 0.8366056065500114
User feedback: None
Out links: 205074 Raw text: 205074https://gwern.net/doc/www/arxiv.org/f63a0b34378396bff253d974efc8664d5620489c.pdf
Sub-Linear Memory: How to Make Performers SLiM Valerii Likhosherstov 1 Krzysztof Choromanski 2 3 Jared Davis 4 5 Xingyou Song 2 Adrian Weller 1 6 arXiv:2012.11346v1 [cs.LG] 21 Dec 2020 Abstract The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous i...
Score: 0.8311770836835274
User feedback: None
Out links: 205184 Raw text: 205184https://gwern.net/doc/www/arxiv.org/59fea814f374d53b61961507bc80c351f6526a48.pdf
Shortformer: Better Language Modeling Using Shorter Inputs Ofir Press1,2 1 Noah A. Smith1,3 Paul G. Allen School of Computer Science & Engineering, University of Washington 2 Facebook AI Research 3 Allen Institute for AI [email protected] arXiv:2012.15832v2 [cs.CL] 3 Jun 2021 Abstract Incr...
Score: 0.8282240770563003
User feedback: None
Out links: 3010570 Raw text: 3010570https://homepages.inf.ed.ac.uk/csutton/publications/lcrf.pdf
Center For Intelligent Information Retrieval Technical Report IR-383 Presented at NIPS’04 workshop on Learning with Structured Outputs Piecewise Training with Parameter Independence Diagrams: Comparing Globally- and Locally-trained Linear-chain CRFs Andrew McCallum and Charles Sutton Department of ...
Score: 0.8277952379086098
User feedback: None
Out links: 205093 Raw text: 205093https://gwern.net/doc/www/arxiv.org/4fc812c0cf44dfcb5d667e5729e5db10e3b1da8d.pdf
BP-Transformer: Modelling Long-Range Context via Binary Partitioning Zihao Ye†, Qipeng Guo†‡∗, Quan Gan†, Xipeng Qiu‡ , Zheng Zhang†§ † AWS Shanghai AI Lab ‡ Fudan University § New York University Shanghai {yeziha, gqipeng, quagan, zhaz}@amazon.com, [email protected] Abstract arXiv:1911.04070v1 [...
Score: 0.8270818663345989
User feedback: None
Out links: 2124217 Raw text: 2124217http://www.cs.toronto.edu/~hinton/absps/googlerectified.pdf
ON RECTIFIED LINEAR UNITS FOR SPEECH PROCESSING M.D. Zeiler1∗ , M. Ranzato2 , R. Monga2 , M. Mao2 , K. Yang2 , Q.V. Le2 , P. Nguyen2 , A. Senior2 , V. Vanhoucke2 , J. Dean2 , G.E. Hinton3 1 New York University, USA 2 Google Inc., USA ABSTRACT Deep neural networks have recently become the gold st...
Score: 0.8235656491857266
User feedback: None
Out links: 2123874 Raw text: 2123874http://www.cs.toronto.edu/~hinton/absps/distillation.pdf
arXiv:1503.02531v1 [stat.ML] 9 Mar 2015 Distilling the Knowledge in a Neural Network Geoffrey Hinton∗ † Google Inc. Mountain View [email protected] Oriol Vinyals† Google Inc. Mountain View [email protected] Jeff Dean Google Inc. Mountain View [email protected] Abstract A very simple way to ...
Score: 0.822539174104687
User feedback: None
Out links: 1250097 Raw text: 1250097Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning (based on a lecture by Antoine Bosselut) Lecture 12: Neural Language Generation Today: A bit more on projects and Natural Language Generation • A few more final project thoughts and tips 1. What is NLG? 2. The simpl...
Score: 0.8206242059492121
User feedback: None
Out links: 3162533 Raw text: 3162533https://cs.stanford.edu/~ermon/papers/imitation_nips2016_main.pdf
Generative Adversarial Imitation Learning Jonathan Ho OpenAI [email protected] Stefano Ermon Stanford University [email protected] Abstract Consider learning a policy from example expert behavior, without interaction with the expert or access to a reinforcement signal. One approach is to recover...