hey everyone it's Hugo ban Anderson here I am super excited to be here today with Simon willson uh to be talking about the Zen of python Unix llms and how python data uh generative AI all of these things have been developing to the place we are today and to think about um what the future uh the bright future may may bring um if you are able to introduce yourself in the chat it'll be great to know uh where where you're calling in from where you're watching from what type of work you do why are you interested in in these types of things um if it's if you're a hobbyist or you're a machine learning or AI engineer any of these types of things um let us know in the chat and we'll get started in a couple of minutes it's also super exciting to already see um 70 people watching now um and if you like this type of thing uh tell your friends and get them to join as well all right everyone it's Hugo ban Anderson here um uh we're about to start our fireside chat for February with uh Simon Willison to talk all about python Unix llms I'm particularly excited for a number of reasons but uh Simon is someone who um has been active in so many different parts of the python python ecosystem from the web framework side being a co-creator of Django um to all the generative Ai and llm stuff that we're seeing at the moment uh to the data side with his wonderful data set project among other things and these are things we'll we we'll get into um but if you wouldn't mind introducing yourself in the chat uh we'll get started in less than 60 seconds I've got my hourglass timing and it's just wonderful to see people from all around the place from New England from uh California from Dallas it's it's USA Centric so far today um Portland oh yeah Jonathan Whitaker hey mate um Katy Johnson in Seattle great to see you Katie um Rohit from who's an AI engineer at Ford uh people working in SAS SAS tools um people who are beginners as well product managers to learn about llms so that's that's just super exciting um all right well I'm I'm too excited to not get started now so Simon why don't we turn our cameras on and and jump in yeah let's do this hey Hugo it's great to be here hey mate how are you I'm pretty good yeah fantastic great great to see you and you're on the west coast of the United States of America I am I'm Half Moon Bay so I'm half now South of San Francisco great is that is that close to Santa Cruz or uh yes from San Francisco to Santa Cruz basically on the onet of Pacific Coast Highway beautiful part of Highway one's one of my favorite drives in the world um yeah pretty great beautiful down there um and we have people watching from from Melbourne Australia as well Austin Texas um and we've already got 115 people joining so that's s super exciting um Stakes are high no I'm I'm just kidding um so let's let let's jump in I I do want to say a few things first um I'm head of devil at uh out of bounds where we work on infrastructure and productivity tools for data scientists um one thing we work on is something we built at Netflix called metapo which is an open source framework um to help data scientists do data science and not have to worry about configuration of yaml and all of these types of things so you might like such things of our as our at kubernetes decorator which scientists can use to access resources such as such as kubernetes I'm going to put um just a link to our GitHub uh in the chat um if anyone's interested so people are talking about t T Bell and Texas in the chat which is fantastic I also love torches tacos in in Austin Texas um and one other thing uh before we get started is the the next fireside chat we're GNA have in a month um I'm also very excited for it's with uh Peter Wang um who is among other things the CEO at Anaconda um and he was instrumental in a lot of ways just as you were instrumental in a lot of aspects Simon of um of the Python ecosystem including the frame the framework side with with Django Peter has been instrumental very much on the pi data side um in in terms of the numpy um psychic learn pandis stack all of these these types of things is he in Australia no he's actually in Austin Texas ah okay um but he grew up in Tennessee Australian gotcha okay yeah yeah um but of course he started Continuum analytics which then became anaconda and they do a a lot of work and he also created um was one of the creators of Piata and num focus in order to figure out how to fund the the Piata ecosystem essentially um uh so that'll be fun if anyone's interested in joining joining for that um but I'm here with Simon Willison so I'm going to introduce you Simon you can correct anything that that that I get incorrect but you're a creator of you create a data set um which I'm excited to talk about which is an open source tool for exploring and Publishing data and the intention is for it to be for non non-technical people and I actually saw this morning that you'll um be going to a a journalism conference um to soon to teach people how to do that data was originally inspired by data journalism the idea is well one there are lots of ideas but the big one is that journalists work with data a lot they need tools that let them do the kinds of things that you'd normally need a small army of data analyst to work with except you're newspapers you don't have that and they also need to be able to publish the data because when you're telling a story with data it lands a lot it's it's it's a lot more trustworthy if the data behind the story is published as well so data set started out as the best way to put data online and it's since been growing into ways of analyzing data and cleaning data and there's a whole realm of things around it with the plug-in system which I'm sure we'll talk about in a little bit fantastic um and of course as I've already mentioned uh you're a co-creator of D Jango uh you're a member of the psf the python software Foundation board llm afficionado somewhat recently coined the term prompt injection and uh an active poster on Hacker News um so these these are a bunch of things that that that that I find interesting about you um but one thing when we first spoke that I found is just how excited you were about how software can help everybody automate tasks in their work and daily lives whether they have CS degrees or not um so we're here to talk about a lot of things from you know python side to the data side but also how llms among other types of gen models um have this promised but may not have delivered yet and how they can in in future as well yeah I mean for me this is the the there are so many dark visions of our sort of AI enhanced future and there are a lot of concerns that are very legitimate um but the on the optimistic side my sort of utopian version of this is I think we might finally be able to get get to a point where human beings can automate things in their lives using computers which because we're in this absurd state right right now where you kind of need a computer science degree to automate a tedious thing with a computer like Beyond Microsoft Excel that you kind of hit a wall in terms of what you can get done and I'm seeing little hints that maybe llms are the tool that get us past that maybe we finally found something we can use to build things so if somebody has something repetitive and tedious they need to get done they can get they can they can automate that thing ideally without having to depend on anything else without having to pay for it just using the devices that they have available to them and that really excites me like if we can if we can solve that sort of end user program it's been called end user programming in the past people shouldn't have to learn to program in order to to to automate computers and that that's the thing I'm most excited about really fantastic um and we actually we have a lot of comments in the chat but someone I has just written thank you Simon For All You've created such that we can be better developed um so that's um so I'd like to start with python and then move data then move into generative Ai and llms I think that that makes sense as as a flow here um as someone who's been working in the python space for for decades now I'm interested in why you think python quote unquote one um so for example um Ruby was a really strong contender in web development go is high performance JavaScript has a lot of of similar benefits um why are we all writing python code now or getting CH gbt or co-pilot to write python code for us I think there's a few things here one of my favorite um sort of ideas about python python is the second best programming language for everything you could possibly want to do like pick any area there will be a language that is a better fit but it might not be good for everything else python is general purpose enough it's very good at a whole bunch of things that if You Learn Python you can then take on web development and then you can take on data science and you can even do gooey development there's so much stuff you can do if you've got python in your pocket and python because its background was as a teaching language you know it it evolved from from work that wo van rosson was doing on educational program languages it's got a very decent learning it's got a good learning curve for PE people getting started I love that hello world in Python is print parenthesis hello world like that's just such a great way to to help people get started the language but it grows really well like you can use Python for a on line script you can use Python for a million giant monolith and it'll fit both of those things it has characteristics for all of that and then on top of that I feel like python we it got a little bit Lucky in that we had as data science started to to take off python was very well equipped for that because of projects like pandas and numpy and so forth and then when the AI boom came along python was right there with all of those things already in place you had libraries like pytorch that are at the center of pretty much everything that anyone's doing in in AI stuff and that's been amazing and yeah I I I I so I think really it comes down to being general purpose being old enough now that python is um there's this wonderful phrase of boring technology right there's advice where whenever you're building something try and pick the most boring technology to build it on so that any problem you run into other people have seen before you know you'll find the answers to all of your problems and you can focus your creativity on solving that one unique pro set of problems that's that's special to your product and this is something I'm very proud of Jango itself is definitely boring technology now which is thrilling to me I love that something I help build is now categorized as so uninteresting you can just default to it for for building projects and know that you won't run into any problems absolutely and I'm glad you mentioned pandis as well because I do think you know my backgrounds in in scientific research and then I into data science machine learning these types of things and that's how I got into Python and it was when um pandas read CSV came out that I think that really changed the game for a lot of us working working in scientific research um and then of course all all the things built on top of that that with the with the Jupiter ecosystem and notebooks as well give a shout out to to Jupiter notebooks like I remember I I was playing with IPython which was the the original project for jup it was just a really good terminal and I loved it and when I saw that there was an IPython version that ran in the browser I thought that was a dumb idea I'm like why would you ever want that it took me an embarrassingly long time to cut on to quite how incredible the the the the the IP python notebooks and then Jupiter notebooks were and yeah I've been using them on a daily basis for like six or seven years now because they they're such a I I love exploratory programming and I love reppel I love being able to type a line of code and hit enter and see what it does and Jupiter is basically the it's it's it's it's that idea just taken to taken taken as far as you can possibly take it I think it's an incredible Boon to to the python World definitely absolutely and the idea of a reppel is the idea of Science in a lot of ways of having these environments where you run a short experiment you see the result you iterate you do all of these things and Jupiter notebooks of course I think probably Mathematica notebooks were one of the original right you know um literate programming environments but I used to work in biology so I worked with a lot of people who had their biology notebooks right where they'd write down their experiments paste in their PCR gels and of course notebooks are exactly a computational version of that where you get to run experiments in line and then see the results um which is super super cool um so I want to talk about Jango for a bit so jango's nearly to decades old so congratulations decades old now it's um at least we started work working on it in 2003 the release was in 2004 so I'm pretty sure it's coming up on its it's coming up on its 20th public birthday um no it in4 wasn't it I'm just checking introducing no maybe it was 2003 yeah but yeah it's it's definitely it's in its 20s now which is pretty is pretty astonishing to me very cool so I want to know if you had a time machine the public release of J jango's July 200 it's not quite there yet but it's it's it's rolling up pretty soon should have a party next year um we had an amazing 10th birthday party for it we all went back to Lawrence Kansas where it was born and had had a a jango's 10th birthday so it' be pretty amazing to have a 20th amazing um also someone's just commented in the chat um they're from the Bay Area they use Simon's llm cly daily so awesome we're gonna show that off a little bit at the end I think yeah yeah we're gonna we're gonna do a demo of that a bit later um but if you had a time machine what would you tell the Simon of 2005 about how the web will develop and what what do you think would surprise him H I've got the cynical version of that which is back in 2005 we thought if you make the sum total of human knowledge Avail available to every human being they'll make better decisions with their lives and everything will be a beautiful Utopia and it turns out that that doesn't necessarily work out that way you know um but I don't think I'd give myself the the the sort of cynical angle on it um I don't know I mean one of the things I love is that back in 2005 two projects that were just getting started were Wikipedia and open street map and both of those projects were ridiculously just obviously stupid right you can't just have an encyclopedia any and I open street map remember when that started and they were like yeah we're going to get on our bicycles with GPS trackers and and and do a map of the whole world and that worked they both worked so incredibly well and that's that's a super inspiring thing right like we have this this substrate the internet where if you get the incentives right and you build the right projects and and get people excited you can build utterly extraordinary things and so that I feel like that's that's sort of positive the positive thing I tell myself back absolutely um speaking about the web taking decades to mature and become mainstream um do you think llms and gen will follow a similar path or do you think it will happen faster or slower I mean it's ping pretty fast already right like um a year ago where are we February 28th a year ago we did not have any decent models that ran on laptop and I didn't think it was possible right I thought that you needed a Ser a rack of servers to run a language model and then llama came out actually it was February last year so it was just a year ago and everything took off that way um I just I don't know I mean the other thing that's really interesting about language models is it's fascinating how they're actually more useful to individuals than they are to companies at the moment which feels very unexpected right like individuals who learn to use them can apply them in so many ways to to in in in so many different ways that can help them learn things and be more productive or or do awful stuff as well companies it feels um other than buying a chat GPT license for everyone who works for them it's surprising to me how few big breakout llm driven successes we've seen from sort of startups and larger companies and that's kind of cool I kind of like that this very Hypes new technology is actually more useful for individuals than it is for organizations at the moment incredible and actually um I'm working on a blog post at the moment thinking about um kind of how people can use different like educating people around Atomic units of different gen models and then stitching them together and we'll get to the ideas of unix's philosophy in particular soon but I sent it to a a friend a business a business friend um and he said he said wait a second he said I thought llms were just for text generation he was like I didn't realize they could do summarization or translation or analysis so I showed him I was like I said to him tell me your top five competitors of your company let's put their websites into chat gbt and ask for a competitive analysis of the space and he was like you can do that um so even these typ like the general purpose nature of it is wild but they don't have a manual right like the chat GP the chat interface is the worst interface for learning what these things can do because you're you're just given a it's like trying to learn the Unix command line when there's no one to to try and guide you through it you know and so many people one thing that worries me is a lot of people try it TR they try it out and they ask it Math's question they gets it Ro which is ludicrous because it's a computer and computers are supposed to be able to do maths and then they ask it to look up a fact for them and it gets that long as well and it's like the two things the computers are good at it's terrible at which is which puts people off and then the other problem is that most people will interact with chat G they use the three versions they use like chat GPT 3.5 the leap to chat gp4 is so huge that I feel like we've already got this weirdly stratified Society where there are those of us who have figured out gp4 we figured out all the things it can do and all it can't do which is even more important like you have to know not to give it these kinds of questions and problems and so forth and as a result we're finding enormous value from these systems and meanwhile there's lots of people who've tried it once and it was clearly rubbish and they're like this is this is dumb height right this is everyone who everyone who's into this stuff is deluding themselves with good reason because they saw the evidence with their own eyes that it was a bunch of [ __ ] only it's not and that that's that one of the things I'm most passionate about is trying to figure out how do we how do we teach people to use these systems which is hard because so much of using them is down to intuition and I don't know how to transfer my intuition from my head into someone else's head like I can look at your prompt and say that's definitely not going to work or yeah that's going to give you really good results but I can't tell you why I know those things it's just all of the experiences that I've built up so yeah like I I feel like there were things like um when people first use chat GPT I try and encourage them to come up with question that it will get obviously wrong because what you don't you don't want people thinking okay it's omn it's omnis it knows everything it will never make a mistake the sooner you can get to it making an obvious mistake the better because it sort of inoculates you against the hallucinations bit one of the best ones there is um pick a friend of yours who has enough of an Internet presence that it will know a bit about them and then start asking questions and it will like completely make up where they went to University or it'll like various models say I've been the CTO of GitHub but I have not been the CTO of GitHub but that's kind of useful seeing seeing it make those mistakes yeah it says I went to Harvard and I didn't but I'm I'm okay with that misinformation I don't have a huge issue with with that in but also to to your point um I think something which isn't computational or or software about it is that it isn't reproducible and does different things all the time and and and the fact that asking people to ask a particular question the respon if you ask it in the middle of a conversation it's really hard to get it to backtrack right like it seems to get stuck in local minim state you have to understand that because people are paranoid that it remembers everything you tell it of course it resets to they're worried that anything you say to it will be will teach the model and it'll spit that out to other people that's kind of not true because it it's a complete Blank Slate every time you start a new SE session except open a I say they will train future models on your input but they won't tell you what that means so I have no idea if I paste my social security number into GPT 4 will it spit out to somebody else in six months time I don't think so but be all of this stuff is so hurt by this lack of transparency there are so many things where the the company's building these things are very secretive about how the training process works and even the we don't even know how big gp4 is still it's been out for over a year and that's infuriating it's one of the many reasons I'm so excited about all of the the openly licensed models that we're getting to play with now although even those most of the good ones are still secretive I don't know what Mist was trained on you know they they never released that information um but yeah so it's It's the Most Fascinating area of computer science I've ever encountered my entire career because every single inch of it that you look at just raises more questions it's like it's fractally interesting in terms of things a lot questions are SC yeah and it doesn't feel like computer science it feels kind of like computers a bit definitely doesn't feel like science i' but it doesn't really feel like computers well I've got in the trouble with my past for comp I compared it to Magic I'm like look it's you have to understand you're basically a wizard and you're learning spells and if you mispronounce one of the Spells demons might pop out pop out and that's i' I've been there are people who will argue reasonably that you should never compare it to Magic because that implies that people can't ever understand it and that's we need people to understand it's Matrix arithmetic right it's just a big ball of numbers and these things may be super super weird but they are that they're not science fiction even though they they feel like science fiction all the time but yeah there are so many analogies that you can throw at this and all of them are fled in different ways exactly and to your point though I think analogies are wonderful because you see how they resonate then you can also discuss the flaws that's one of the beautiful things about analogies right um but I do want to get back to productivity I I've pasted a blog post of yours um in the chat it's a Blog called a AI enhanced development makes me more ambitious with my projects not only more productive but more ambitious and I'm really interested in in in what you mean by that that one it's a bit of a two-edged sword in a way the problem is that um one of the problems that gp4 has solved for me is the need to remember any piece of trivia about any programming language like I used to not work in go because I haven't quite I never committed the like syntax of go completely to mind so any time I sat down to write some go I'd have to look up a for Loop and then i' have to look up what that symbol means and just all of that you know if you don't use a language on a sort of weekly or daily basis you never quite develop that that L the fluency that you need to be super productive that doesn't matter anymore right because gp4 and co-pilot they know the syntax I can write a comment that says Loop through every item in this slice and do whatever to it and Bo it sticks it out there and so as a result I've been shipping codes to production in program languages I'm not fluent in which I never used to do I have code I have a high high performance go web server running which has full unit test coverage and it's deployed via GitHub action CI and like it pass if the test pass it deploys the whole works everything that I think is important about writing Reliable Software in a language that I don't really know but I can read I know and enough go to read it and to test it and to make sure that it's doing the right thing that's wild to me like I wrote this thing in apple script Apple script is a notoriously readon language you can read some apple script and guess what it does you will never guess what the incantations are to do something but gp4 knows Apple script so I could just tell it and get a script start using it I do things in bash and zsh and I use JQ all the time all of these different things and what this adds up to is that I will have an idea for a project and it used to be that they think of a project and think okay but that's going to take me two or three days I cannot justify spending two or three days on this it would be kind of neat but now I look at it I think you know what I think this will take me about an hour to get a working prototype and I can always convince myself to spend an hour on something it doesn't it takes two or three hours because you you always underestimate but even then two or three hours is enough time for me to get a project to a point where it works it does the thing and I'd never have built that before it's not just about being working faster it's about there's a cut off point where you're like that is going to take me too long I'm not going to do that thing at all and that cut point cut off point has been sliced way downwards abut so I can go to the end of the day and I've done five or six projects and none of them were on my list at the start of the day that's the downside that's like the double-edged sword where if you're easily distracted with I mean I've got a mug here that says easily distracted by Pelicans if you are easily distracted there are so many that this can this can feed so many more distractions to you but yeah it is making me more ambitious I'm taking on things that I previously would have ruled out because they'd have taken me too long and now the the activation energy to to do some of these projects just keeps on getting lower which is super exciting and a little bit intim a little bit it has its downsides as well yeah and to your point I mean I I know python I know R I know mat lab because of my scientific background I know a few other bits and pieces of a few other things which I wouldn't be comfortable trying to write code in at all but because I know these basic languages I can ask chat gbt or whatever to generate code in any languages really to build stuff now that I understand um a lot more so it's quite it's it's quite amazing um so I'd love to move on to data now um and you introduced us to to data set as a project that was initially developed for for journalists and you're actually so when I woke up this morning I checked out your blog and saw you're going to a journalism conference this week that we briefly discussed before we started the live stream but I'm really interested it's Nik it's the national Institute for computer assisted reporting which and C is an acronym I believe they came up with in the 70s which is when journalists started saying okay these mainframes full of data surely we can use these to help us break stories so this is not a new field right journalists have been doing Hefty data reporting for for what 40 or 50 years at this point um but at this conference every year it's it's about a thousand people It's Like the Wolf Street Journal The Washington Post and the New York Times and all of these different Publications who have like proper data nerds the data nerds get sent there and it's just Heaven you know you're just surrounded by the nerdiest like they're all the similari is everyone's into telling story with data other than that the backgrounds are all completely varied from all sorts of different parts of the world and so it's it's really fun so how I'm interested how non-technical people so still to use the wonderful tools you build they need to know a bit of command line stuff um I think one reason I've I found R studio and R to be so fantastic um as I used to work in bi biology right um and so the barrier to entry for a lot of non-technical biologists to use R they they don't it abstracts over the command line and you can actually install there are dependency issues still but you're able to install packages from within the IDE we had spider in Python for some time I probably don't want to say much about spider because it eternally crashed on me and frustrated me um but the developers are fantastic as as well but um I I am wondering the barrier there is still some sort of technical barrier to entry what is that for journalists and non-technical people and what are we going to do about it this is a huge thing so I'm building software that I want journalists to to be able to use I actually a few years ago um I landed a paid fellowship at Stanford on the journalism Fellowship where I got paid to spend a year on campus at Stanford working on projects that were beneficial to journalism which meant hacking on my open source projects and at the start of that year this was one of the questions I have myself is do I want to go after completely non-technical journalists and build tools that they can use or do I want to take those journalists who are already somewhat technically Savvy they can use Excel you know that they're not programmers but they they've got that sort of data literacy and build tools for those that can accelerate those people I chose the latter because it was easier like when it's just one person trying to solve the non-technical user problem is a heck of a lot more harder than Tech problem um but um then and then I went to my first nikar first nikar conference and something that really inspired me is that there was a workshop that was a it was a intro to python with jupyter notebooks workshop and there were 50 people in that Workshop who didn't know how to program and they were journalists and they wanted it so badly they were like we're getting these these files that are too big to open in Excel I want to be able to report on this data I am absolutely wor ready to install python on a laptop and and figure all of these things out and that really inspired me I realized that the group I care about most are it's the people who have that fire in them like they're not programmers but they want to be able to solve those kinds of problems and so for a few years I was thinking about those people and then the big change for me in the past year is I've started thinking you know what with llm assistance maybe we can build quite sophisticated technical data tools for people who don't have that level of technical literacy that they might need to to engage with the stuff at the moment so that that's really exciting to me but yes then with data set um the first version of data set it was a pip Python tool you pip install it you run it in your terminal you can like run a command to deploy it onto like versel or Cloud run or whatever and in instantly you've lock you've cut off everyone who doesn't have a python installation doesn't know how to pip install something and doesn't know how to use their terminal that's a that's 99% of 99.5% of humans that you've you've already and also even to get them to do these insulate like suddenly you're getting them to brew install then they're like got to like laptop starting point right it's it's a nightmare so I spent a bunch of time trying to solve that problem um the first thing I did was I actually built a Mac application called data set desktop and it's a electron app that bundles its own Python and does this and that and the whole point of that is it had to be an installer where you download it and you double click the installer and now you've got data set running on your computer with and you don't have to install python separately and that works I got that up and running um which helps a lot because at least people don't have to use the terminal to start using the software then I started looking at um I started looking at web assembly so I've got a version of data set data set light which runs entirely in your browser it uses pyodide which is the python compiled to web assembly stack that the Jupiter cool it's incredible and so data set light is really cool like um I thought it was I almost built it as a joke it was like a fun experiment can you run a server side pyth app entirely in the browser it's like a 10 megabyte startup cost so I figured nobody would ever use it of course these days 10 megabyte startup cost is nothing right a react app is probably that all so actually people are starting use it and the fun thing that data set like can do is you can feed it the URL to a CSV file online and if that CSV file has cause headers which is anything on GitHub or anything in a gist it'll just load it straight up so now I've got this thing where I can construct a URL to load data set light and then import a CSV file and then run SQL query against it that's super interesting and a lot of people have started using that as well but the end level boss of all of this is obviously I have to host this right I need to provide a software as a service version of data set um so that people who want to use this especially who want like a private version that they can collaborate with their News Room on so they can click a button and enter some credit card details and get started that way and I held up on this for quite a long time because I have run startups in the past I know what it's like to be responsible for an online service and it's getting woken up at 3: in the morning and it's having paranoia about if your backups are working and there's there's all sorts of stuff like that right from the start of data set I said look it's open source so I never have to do this right so if people run the software that's great but I am not on the hook for those 2 am. phone calls it's so I I I it doesn't work it turns out for my target audience of journalists they need to be able to click buttons on the web page and get a version of the software I've been running so I've been building this thing called data set Cloud which is the cloud hosted of data set um it's been an internal preview I've been onboarding people to try it out and so forth I'm finally like I feel like nikar this year is the point at which I'm going to start aggressively onboarding real teams using it and start asking people for money and that's pretty exciting like that's a a huge step super exciting um to that point in your blog post you mentioned you're going to nikar and you're going to spend as much time in the hallway track talking to people as as possible in your in your extensive history of building tools um how important is it to actually spend time on the PE on the ground with people talking as much as possible H so in my career I have spoken it over well over a hundred conferences um and that's over a space of 20 years you know and almost all of the interesting job opportunities and things in my life came from either blogging or speaking so I'm a bit of a weird shape because very I my my career has been shaped around sort of Public Communication for a very long time um but certainly for I mean with with data set one of the frustrations I've had is it's open source which means people can use it without telling me and so I keep on hearing anecdotally people I just found out the other day that um Politico have been experimenting with data set internally which is amazing because I want news organizations using it pruer have used it and Belling cat have used it but I never hear about this at the time so I've been and going to events and and sort of shaking people down is a great way to figure out who's actually using things I also do this thing where on Fridays I let people book office hours with me it's just like a calendly zoom call where you can grab 25 minutes of my time and the main purpose of that is because I want to have conversations with people who are either using my software or thinking about using my software because like most weeks I will do between one and three of these office house sessions and there could not be a more valuable way for me to spend 25 minutes than just talking to somebody about what they're trying to do what they've done what they got stuck on what they'd like it to do in the future that that that's just amazing fantastic um so data sets we've discussed is all about exploring publishing working with data I'm interested in your thoughts on how you think about sharing data in the age of llms when everything becomes training material um we've seen reddits that overflow others are already restricting data access because of this I also want to uh State we talked about kind of when this gen stuff started in my mind actually the current gen quote unquote Revolution probably started mid 2022 with the first release of stable diffusion that's when it was like wow okay this is serious stuff and you've actually um ridden a bunch with with Andy bio I I think about um the data sets behind stable diffusion and the training the week that stable diff Fusion came out Andy bio and I did a joint um investigation of the training data because the big stable diffusion was that the at least for the first version this training data was entirely in the open um and we we we curious and we dug into that and I I actually I ran data set against that as well to build a way that you could search through I think it was six million of the images that were used in training it which was the a filter was it the lion 5B data set exactly it was lion 5B so it was trained on a lot more images than that but there were 6 million that were filtered as being particularly aesthetically interesting which got sort of high and it was fascinating because it was all stuff off Pinterest and the Daily Mail and everything and just fating by um like uh fasting just by the domain name showed you that where this stuff was coming from and yeah this was back in um September a year and a half ago and yeah it felt it felt it it it it felt like that was a that was going to cause problems and sure enough it's it's become a huge aspect of all of this one I find the ethics of this stuff so interesting as well because I think the ethics of how these things are trained um people's opinions vary entirely based on the the use case like what's actually done with it if I scraped every photo on the internet and I used it to build a machine learning model to help blind people see the world through through a camera nobody's going to complain about that and there are techn like gp4 Vision has been used for that kind of thing if you scrape everyone's art and use it to generate new art that competes with them obviously that that that whether or not that's legal it's clearly unfair right there's a we we've got a sort of strong we understand the the the the sort of fairness involved in these things um software is interesting one as well like I've been releasing open source software for 20 years I love that my software has gone into GitHub co-pilot and gp4 and so forth and it helps me write code but that's kind of the part of the point of of Open Source was was that I want to never have to solve the same problem twice so you put the code out there you never have to think about it again there are people who are very upset that um gp4 and Copa are basically laundering the licenses right the GPL the terms of the GPL are not being obeyed the the attribution terms are not being obeyed when it sort of spits it back out for you in the exactly what you needed those are very legitimate concerns for people to have as well um so yeah I don't know but then the flip side of the ethics is there are a lot of people who are saying no we only want models which trained on licensed data that the the companies making these models should be paying everyone whose data goes into them the sort of the the the the the the problem I have with that is I worry that we live end up in a world where only the wealth only the very wealthy have access to these tools right if if everything if it costs you a hundred million do in licensing to train a model and then you make that model available just to the people who will pay a subscription for it we are cut 99% of humanity from this and what happens to open source models in this limit as well right absolutely the open yeah um so I feel like the ethical question they're all incredibly murky and there are basically no obviously correct or good answers to any of this stuff and this is a pattern that plays itself out again and again in generative AI is everything's murking bad in different ways and um in the absence of so so we kind of end up falling back on what's legal and what's legal isn't even clearly defined either absolutely and I I do think copyright is something which we developed due to the ability to mechanically reproduce things with the printing press and that that type of stuff also to incentivize more cultural production it was the Statute of Anne right around Elizabethan times where they like we're not not enough people are writing stuff everything's owned by the whatever the The Guild was at the at the time and we want people who create things to be able to make money a bit more afterwards but now that we have generative transformation copyright may not even be the right Paradigm to be thinking about these things through this is it's it's all and I I've been staying mostly clear of the the sort of legal side of it because I I know nothing about Lauren all of those things but it's just it's it's also uncertain as well it's interesting to me that the the big AI Labs one of the things they do is they give you a um they give you legal cover right if you're using adobe's models or open AI I think anthropic have this as well now Microsoft have it GitHub have it they will if you get sued over the over content that their tools to produce their legal teams will jump on your side and I guess they kind of have to because no nobody's nobody serious will use their tools at all but that's kind of fascinating that that that's the way this stuff has gone so far absolutely and I do think I am interested and then we'll come back to more technical stuff but what this is very important though for us as technical people what happens to Web 2.0 spaces such as stack Overflow if all of this is sucked up into large models like we're interacting on a one-on-one basis now with a large language model and not having communal conversations like not that stack Overflow was always the best most wonderful place to chat chat with people right but my my favorite example of this actually is um there's a Alpha version of Google search where the search Eng results page has language models that summarize everything right there on the page and it's like why would you ever click a link ever again you know you ask Google a question it reads the and and B lots of other search engines are looking into this kind of stuff as well and there's just such an obvious moral hazard there in terms of um like okay well what's the point of even having a website if nobody's a to visit it because it's been it's been sort of laundered and and regurgitated in that way that's one of the things at the base of the the New York Times lawsuit against open AI this is one of the things they're complaining about is is their content being Rewritten and pushed out to people in a way that harms them economically because now there's no reason for people to visit their visit their site or subscribe to the newspaper and all of that kind of stuff and yeah these are these are deep concerning to me um I feel like the topic of data my my main hat is still journalism and from a journalist point of view I just want all the data to be out there so that I can use it to tell stories but then as an individual one of my other sort of areas of focus with data set has been around personal data where there's this really interesting thing where as a human being on this planet right now there is a huge amount of data being generated about you and a lot of it is available to you right you can go and click the button on Facebook to export your Facebook data you can write to a credit agency and they'll send you stuff there's all of this data that you've got access to I mean my my watch collects my like GPS tracers all of that kind of stuff but what do you do with it like if if Facebook email you a zip file with 100 megabytes of XML in it what's step two so I've been building a project called Dog sheep which is basically a collection of utilities for taking whatever those formats are and turn them into SQ like databases because once it's in a SQL like database you can run data set on top of it and then you can where you can start trying to dig into that data that's out there about you a lot of this stuff is brutally um like there's people have asked about my dog ship tools would you ever do a hosted version and there's no way I want to do a hosted version of a system where people pay me five bucks a month to host their deepest most private data that is not something I I'm interested in getting involved with you know yeah but there's demand for it I would love to be able to run data set on my iPhone with all of my personal data on a device I can hold in my hand that feels like the right to start coping with that sort of thing very much so um I'm interested in we hinted at this but the ability so firstly training models from generative a models from scratch has been s super expensive right but even you've written about that there are some smaller models which now you can train for on the order of 30 40 right like Microsoft's f 2 phi2 cost them in it was less than $100,000 in training costs I believe it's incredible that's now it's not quite accessible to the hobbyist but that's accessible to like the the the smaller groups and the fine tuning stuff right so many of those models on hugging face the really good ones were fine-tuned practically in someone's bedroom right there were there was an army of Open Source like of of of openly licensed model people out there who have tuning models which often beat other things on the scoreboards because it all comes down to how well you instruction tune it and the examples you give it that's pretty exciting and at the same time everyone talks about having a model trained on their personal data you probably don't want that you probably want to do the rag thing instead you want to do you want to give an existing model the ability to run SQL queries against your data or to run searches against your notes and pull things back that way and that's getting to a point where it's it does fit on the telephone like mistal 7B as a model is quite good at fun calling and stuff and I've got that running on my phone like M 7B does does fit an iPhone now um so that's exciting like I feel like that's that's one of the things I'm most excited about with the opening licensed models is I want a model that runs on my own device and can use tools to access all of my data and it completely disconnected from the internet there's no servers involved at all it's just running locally and I think we're basically there right now in terms of the models are good enough and Hardware is good enough we just haven't built the software and the software is getting their well like my open source llm tool is beginning to lean in that direction right now and so if people uh viewing and listening wanted to um like get stuff running on their phone or locally on laptop what type of tools would you suggest them to on the iPhone there's an app called mlc chat just install that it's in the App Store you install it it gets you mistal 7B it is a language model that runs on your phone it doesn't integrate with anything else so it's more of a sort of fun demo prototype kind of thing but the moment the first moment you run a language model on your phone that just sort of you can see the future opening up in front of you and in terms of desktop stuff I write I have an open source python tool called llm just lowercase which you can pip install and that's um it's a command line tool for originally it was for running command line using your terminal to run prompts through open AI but then I added plugin support to it and now you can install plugins that add local models as well so for a very brief period that was one of the easiest way to run a local model on a Mac it's not anymore because there are much more sort of um like much more full-time projects there's a great project called olama for this um there's a whole bunch of like desktop apps that you can install on the Mac that get models up and running for you I think um uh LM Studio I think it's not open source but it's free and it's it's very it's got a really good interface there's loads of options like there there are a ton of ways that you can start doing this you do need quite a lot of ram like um I'm running an M2 Mac with 64 gbt of ram which is good enough for most of the the models that I want to play with if you've got a Linux or Windows machine with an Nvidia card your options are much wider because so many of these models Target Nvidia Cuda stuff first but yeah it's exciting and if you haven't done it one of the things I love about the little models that run on your device is they are rubbish like you think G4 hallucinates watch what mistal 7B does when you ask it for your your bio but that's really useful because if you're working with a weak model that hallucinates a lot you get a much better sort of mental model of how it works and what it can do like it feels a lot less mysterious when your laptop is writing a terrible poem about a pelican for you and you can almost like feel it like just predicting What word could come next in the sentence totally totally and all of that is wonderful advice for people wanting to try to St run stuff locally I've also included a link in the chat to um a subreddit the local llama subreddit which that's the best place on the internet for catch with this stuff yeah yeah it's a it's a it's a wonderful place um I um there are a few things that we've been talking around that I kind of want to tie together that correspond to certain questions in the chat as well um people are asking about how how can we trust llms the answer is you can't right but you you you did mention um rag I think which is an interesting way to essentially query data you have and also get Providence from your results as well yeah let's talk about rag a little bit because rag is so rag it stands for retrieval augmented generation which I don't think it's a great name but it's got a name that's the important thing and all rag is it's a party trick basically What it lets you do is you can say to a language model um well you so if I want to answer a question I could ask about my notes or whatever obviously the language model won't know but if I do a really dumb search against my notes and I copy and paste like a few thousand characters of stuff around that search result into the language model and at the bottom I say based on the above information answer this question and then give it the question it actually does really well language models are fantastic answering Things based on stuff that you've just told them like 30 seconds ago and so yeah once you know this trick it's it's also fun because building this stuff like if you want to start hacking with language models building a basic version of this is so trivial like the hollow world of language model hacking is getting a very basic rag pipeline up and running but it's crazy valuable like it's it's a super useful thing there's I I would warn you that getting basic rag working is easy getting really good rag working is incredibly hard and there's lots of people and companies who've been trying to figure this out because there are so many ways it can go wrong and you you end up that the difficult technical problem is given a user's question deciding what Search terms to run against what data and which bits to put in the context and all of this and that's a very fine of art and craft but the really basic versions of it um a trick I I I started using just a few weeks ago is um I really like rip grap which is or RG it's like a a faster version of grap and so you can say RG give it a term and then you can say and dash C for context 10 and it will find all matches of that thing and it output the like 10 lines before and after and so I've been piping that into a language model I go RG this concept and I pipe it to a language model and say explain the concept of of um of ribbon bars in this code base and it just works it's and this is my llm tool is all about um terminal usage because it turns out Unix pipes and language models are a beautiful combination right a language model all it really is is a function right you give the you give it the prompt as input and it gives you the response as output so once you've got the ability to chain these things together in your terminal you can do things like GP search for this thing and the results into the language model this question and whatever comes back out save it to a file and and it all just works it's so much fun the um my llm my command line tool the other trick it's got so it can you can give it a prompt and it'll give you the response on the on the terminal you can install plugins to give you access to dozens of other language models API ones and local re ones as well and everything it does is safe to sqlite because everything I do is is safe to sqlite so you end up with a sqlite database of every experiment you've ever run against every language model I've got like thousands and thousands of prompts and responses stashed away now and so I can start using them to compare did this promp work better than this prompt which model handled this better it's it's super cool and you could just start holding all of different interactions amazing um there's a lot a lot of things to unpack there I do I I agree that the name rag isn't is isn't the best name but it does give us the pun from rag to riches which I which I like um I also just pasted in the in the chat um the I I can hardly say this because it's it's the Uber Booga um tool which it allows you to play around with a lot of these things in yeah in in a gooey Vibe so if people want to play around with these things locally or even using cloud cloud resources it's a nice way to to do it I also um I'm glad you mentioned Unix because this is something that you and I have chatted about before um but unix's philosophy is something that I mean this idea of piping among other things is very beautiful it's actually related in some ways to the Zen of python as well so I'm wondering with Unix philosophy and pythonic for lack of a better term um these are clearly a part of your approach to software development and how how do these philosophies guide your approach to developing tools like data set llm and everything else you do I mean I think got I my work's gone slightly Beyond like the Unix philosophy is all about piping one thing into another which is actually quite restrictive like you can get a lot done you can end up getting quite a lot done if you but you end up start starting think things like okay this is structured data so now am I going to produce Json that I pipe something else that has to understand it there's there are limits to how much you can do with it but I've also grown up on the web so I'm always interested in anything with like web apis where I can kill an endpoint and get back Json and do things with that um and then the two things that I've added in the past few years there's firstly there's SQL light as the intermediary format for absolutely everything because the great thing about sqlite databases is there're a single file so you can email them to someone or upload them or create a copy or whatever and sqlite has bindings in every programming language known to man right go go and Python and Ruby and R everything can talk to sqlite it runs in the browser and web assembly these days it's also got amazing backwards compatibility guarantees or or forwards compatibility I can open a SQL like database from 10 years ago and it'll just work today because the team maintaining it are very stringent about that so a lot of what I've been doing over the past few years my my one trick is anything I think is interesting I get into a sqlite database because once it's in the sqlite database I can then throw all of my other tools at it as well um and I've been building my my sort of third big open source project you've got data set in llm I have this toll called sqlite utils which is a combination python Library and command line tool for manipulating sqlite databases so on the command line you can do things like download like pull a Json file from somewhere and pipe it into this tool and it will create the sqlite database with the table with the right columns depending on the Json that was sent into it and that works with CSV and tsv and but it can also do things like turn on full teex search indexes or alter tables drop columns refactor data apply conversion functions whole bunch of stuff like that and that was originally built to solve the question of how do I get data into data set where data set's great if you've got a sqlite database but I need everything to be in a sqlite database as well but it's also a python Library so anything you can do through command line you can do through the python Library instead and dozens of tools that I've built use that as a dependency so I've got tools for like pulling GitHub issues and piping into a SQL like database and that's it's a little python thing built around that so that's trick number two is so you've got the Unix pipes you've got the web apis you've got the um sqlite substrate for everything thing then the other one is plugins right data set has 145 plugins at this point um very much inspired by WordPress right I was looking at WordPress which is a decent blogging engine in CMS with 10,000 plugins that mean that any publishing cont any content problem you have you can solve it with WordPress plus plugins and it's now responsible for what 25% of the internet runs on WordPress or something ludicrous like that and so a goal I had with data set was okay if I can build a capable sort of um data publishing and analytical engine at start with plugin support you could add plugins for any visualization for any data cleaning mechanism just just try and make that as as sort of um like try and expand the software in different directions there I use the same trick for my llm tool it's got plugins again for the same reason so when you combine those you've got plugins and you've got SQ like database substrate you've got sort of web native apis and you've got Unix piping you can build so many things at that combination of different techniques yeah amazing um I also just want to mention a couple people have mentioned um Vector databases in the chat and that rag users Vector databases totally correct I feel like as soon as that that yeah yeah I was just going to say I think as soon as someone mentioned Vector databases a whole bunch of venture capitalists jumped in and joined join joined the live stream um but that's and somebody mentioned fine tuning as and something else we can do we haven't really talked about that I think there's a lot out they about fine-tuning um but essentially you know what datab I I did I gave a talk a few months ago about embeddings which is the sort of core idea at the the basis of why you'd want a vector database and that's that's available as a video and very detailed slides and notes and code examples and plus my llm command line tool can do embedding stuff as well so if anyone's interested in in embeddings I've got a bunch of material people can look at great um and with respect to fine tuning I think there's enough out there that I don't necessarily there are so many things that you can talk about that other people can't that I I don't want to spend too much time on on fine tuning but I do think one question for you is that you know there are two worldviews uh Sam Altman says oh the world will have one big model right and then there's yeah exactly but there's an I mean at the opposite end of the spectrum the world may have a lot of different small open source models that are fine-tuned or on on on relevant data for their particular use cases and where those are two different options where do you sit with respect to I I I feel like like even when you just look at things like cultural values right a a model trained in California should not it does not necessarily work for every country in the world right there's there are so many things around like when when you're talking about what language models do where where different people should be able to select different models that represent different values that that feels really important to me so yeah I'm one of the a year Oro it felt like we might be doomed to a world where there was one there was whatever open air had done and every human being was using it that felt very bad and I'm not worried about that anymore because the the openly license models have been improving so much and there's so much Variety in so many different research groups are working on these and you've got people on hugging face who are training a model in their bedroom that are fine-tuning it in different directions I think that's amazing that's that's the world that I want to live in absolutely um I want to jump into a demo soon but we've talked about SQL light quite a bit and I love SQL light and it stood the test of time as well there's some Lindy law right which is you know something that's been around for a certain amount of time it's probably going to stick around um and deliver value for for more time so I'm I'm convinced that SQL light is incredibly valuable um but there's this whole new crop of tools in in in Python which seem that like duck DB and and polers and and these types of things so um what do you think about all these new tools do they complement or compete with SQL light or so my favorite of all of these is duck DB because duck DB really they took the sqlite idea deal of it's a library not a server and it's like a single file database and so forth H but they basically they're going after the sort of um the the parket the column and analytical databases all of that kind of stuff and it's a phenomenally cool piece of software I think I I see it as a compliment to sqlite I feel like if you want something that's transactional and like optimized for for like um lots of like reads and writes and so forth sqlite is still absolutely the best option if you want to do big like analytical queries duct DB is is sat right there and it's it's got the one trick it has that I really love is that it can fetch parts of a parquet file um over the network just just the bits that it needs to answer a query so you can have a terabyte of parket and an S3 bucket somewhere you can run duck DB on your laptop to do like a summer account and it will pull back just fractions of those files to answer those questions it's incredible it's like you've got a full-blown data warehouse running on a laptop with terabytes of data that that are hosted elsewhere so yeah I'm a huge fan of duck DB if I ever get data set to have a plug-in hook that lets you swap out databases I want to go after duck DB and postes because I think if you've got SQL like Duck DB and post that's it you can solve almost anything so many like W post now that would be Inc and a year ago I thought that was impossibly difficult and beginning to feel like it might be achievable once I've got the data set 1.0 set so that that would be really exciting very cool um one other thing I want your your thoughts on so I think a lot of people here including yourself and me will be aware um Andre Kathy published a blog post several years ago uh called about about software 2.0 and this was before llms uh really became mainstream maybe even I mean Transformers have come out I think but the suggestion there was a new paradigm of programming around models instead of lines of code um and I'm interested in you know you've been involved in software 1.0 2.0 and I'm interested in this kind of triangle I suppose of code like the vertexes vertices vertexes would be um uh code models and data hopefully with humans in in in in in the middle somewhere so how how do all of these things interact and what will humans be doing in the future so I mean that's a bit that's way too big a question for me to take on I I try not to be too futuristic in my thinking because everything changes every week understand interesting that Andre karpathy um that that post I think that was a couple of years ago before we'd realized quite how good language models are code right the fact that language models are astonishingly good at SQL and JavaScript and Python and go and so forth was a bit of surprise to everyone it was only started becoming clear about maybe a year and a half ago and cod is obviously a better way to automate a computer than than than language than than English because English is um it's vague right like if if you want the computer you want to give it less ambiguous instructions so I'm much more excited about the thing where the language model is the the compu the code is what automates the computer but the language model is is building the code is helping people build the code that that feels more more sort of that feels like it's definitely going to work because it's working already right right now I can do so many things with my ability to get a language model to write code for me and if we can expand that out to people who don't have a programming background that's going to be incredible I think um so yeah that's me being sort of less ambitious but optimistic I'd like us to more or less like like let's harness our computers the the sort of rable predictable way with code but the language models mean that everyone gets to do it that way cool um and you actually raised an interesting point that I think you've written about before and we definitely spoke about it last time we spoke um which is that it seems like perhaps the most value llms currently deliver is for technical people to help them yeah write code right that's the one of the wild things is the the career that is that can benefit the most from llms turns out to be programmers in terms of if you can Master an L if you can Master l terms of programming you can I say I've four or five x my productivity in terms of the time I spent typ and code into a computer which is only about five or 10% I do but still that's material right that that's worthwhile um it's it's at the same time it's so weird like we thought we thought truck drivers were going to be replaced by Automation and it turns out like illustrators and and lawyers and um and programmers have maybe have more to worry about I'm hoping that as we get more productive the demand that the human human species has four code will 10x and we'll all be kept busy that way yeah it's uh it's confusing time will tell um so one final question before we jump into a demo and this is a couple of people of asked similar things in in the chat um there are a lot of conversations around gen happening publicly in cultural Consciousness on Twitter LinkedIn all of these spaces is there anything about them that isn't being talked about enough that you'd like to see more of a conversation around yeah the thing I'm most interested in is there's lots of people who are super stressed out about um the the Doomsday sow I'm not interested in the science fiction Terminator stuff I'm really interested in the economic impact these things are happening and I want to see more than just anecdotal stuff around how the impact this is having on jobs right like there there anecdotally people who used to do like SEO copywriting are in real trouble right now or like illustrator who were doing just very sort of like sort of like stock art kind of illustration but that I I keep on thinking about translators right the translation industry was massively impacted five or six years ago when Google translate started getting good enough that companies would use it started using deep learning essentially right all of that kind of stuff so we've seen this play out already but that to me is the big question like I want to understand how is this affecting people are people like find like like that's the story that's not being told I think is the the macro story about the economic impact that stuff is happening absolutely and I mean we've been talking around this but the New York Times is one of the last bastions of serious journalism because look I mean look what the internet did to mainstream journalism right yep yeah I mean the the problem that journalism has is running a newspaper used to be the best business model that there was because you had a monopoly on Advertising for your city right and so it was a licensed to print money this is why you hear these stories about and the same like magazines Vogue had a monopoly on all of the eyeballs in the the world that cared about fashion and this is why we hear these stories in the 80s and 90s of like the unlimited credit card expense accounts for magazine writers and all of that kind of stuff that is so far gone now because those monopolies on attention for like Regional attention and topic attention just got blown blown away by Facebook and Google and targeted advertising and so forth and the problem is that how do newspapers are expensive like report is an expensive thing how do you pay for that when the business model used to be you everyone has to advertise with us because it's the only way to get their message in front of everyone who lives in this town and that's just gone and yeah that's really worrying and the um a trend that I found really interesting is there are increasing numbers of nonprofit newsrooms which are supported through other means and they're doing really good deep investigative reporting and so forth but I don't think you can run it you can't nonprofit newsrooms are not going to cover everything that needs covering this is a lot of the work that I do I hate to say it out loud but newsrooms need to be able to do more with less right they do have to be able to cover the like on the reduced budgets and resources they have they need to be able to tell the important stories that are happening in the areas that they cover and if data journalism if if if automation can help with help people find those stories that feels worthwhile right that's something that I'm I'm ready to invest a lot of time in absolutely um well let's jump into a demo and so why don't you share your screen and as Simon's doing that I'll just let everyone know beforehand I said to Simon hey do you want to like demo your llm um CL utility or data set and and Simon said both of those are great but I've got something maybe even more exciting we can we can look at so let's do it okay so this is this is data set just to give you an idea of what it's like data set it's a table right it's a table of data you can have multiple tables in here this is um legislators this is everyone who's ever been a congress person or Senator or pres or president pres or vice president in the United States um and you can do things like facet so I can say facet by last name and see that unsurprisingly there have been more Smiths than anyone else who have served in although Wilson at 60 that's a bit unexpected wow huh that is a much more common name than I thought it would be um so this stuff I learned something new um but also you can I can click View and edit SQL and I can actually start hacking on SQL queries um which because it's done it's done readon and with a um with a time limit so ideally it shouldn't cause any trouble um but yeah so that's what lets you do it lets you get data and put it in an interface where people can explore it they can run queries and you can also get the data out to Json so if you want to um like build Integrations against it you can do that you can export a CSV all of that kind of stuff but the challenge has always been how do you get data into this in the first place and I've built features like CSV up upload and stuff in the past but my new thing is a new plugin everything's a plugin called data set extract the idea here is that you can paste data in and a bit of a table schema so let's say I'm going to pass my resume so start year end year description oh and roll let's do that so then what I can do is I can grab a copy of my resume it's a PDF I'm just going to drop it onto here I've extracted the text and PDF just in JavaScript you'll know that this is garbage right it's um it's complete junk got Extra Spaces but fingers crossed if I click extract what this is doing is it's sending it up to gp4 with a GP it's using um open AI functions and oh that's frustrating I've got a new version of this that actually shows you live what's going on it looks like I'm running the old version where we'll have to wait until it's finished but what this will do um is I'll show you one I built earlier just in case the the the live demo doesn't work what this does is it pulls out the company the start year the end year the job title the job description so I've just turned random pile of junk PDF into actual useful data and this works with I've copied and pasted things off a websites into it and I've done like plain text files whatever it's one of the killer features of um the open AI models is this ability to work with structured data um oh look at this no it is actually working here we go so it's extracting in progress and very oh as as it pulls new things out because it's all streaming it's pulling in the data and so this is a preview and then when I when it's finished I'll be able to click through actually I can click through to the table right now here we go here's the table that it's creating from that data I am so excited about this feature the um I want to get it working with the GPT Vision as well because you can do structure data extraction from images now and it works like I've been taking photos of events flyers in shop windows and getting GPT 4 Vision to turn them into Google Calendar ad links so you take a photo of a flyer and you click a link and it's in your calendar and that that just works um but yeah so this is a fun example so can you just say that again because that's so amazing yeah yeah take a photo of a flyer and then I've actually I've been playing around this just in chat GPT you can say to chat GPT construct a Google Calendar like adds to calendar link for this date this event and it will pull out the title and the start date and the end date and the venue information and fed it correctly and I took one I did one of something in Spanish and it converted to English as well like it translated it for me so you can see instantly that the potential for this stuff as a journalist is just enormous right like this is Jour spend so much time dealing with crap PDF files that some Police Department sent them or whatever the fact that we can now turn that into structured data and then start running SQL queries against it that's huge right that's that that's a really exting very cool and is this something you're going to take to nikar next week it is yes this is um I've been trying to get this already in time for nikar because it's it's actually I think it's I haven't started talking about it yet so it doesn't have to read me but it's called data set extract the code's all up on on GitHub already and there's not a lot to it like it's all of this stuff because this is the joy of plugins like this is what 310 lines of python this whole thing the thing I love about having a system for my software is that it's zero risk for trying out new ideas like if I was going to hack this idea into data set core and it turned out to be the code was a bit messy or I didn't have time to test it properly I would be making data set worse because it would have this sort of Unfinished feature in it plugins there is no harm cause whatsoever I have come up with the weirdest idea ideas for features and built them as a plugin just as a like afteron experiment and it's fine right if you don't install it you don't have to think about it it's it's a really it's also my favorite way of doing open source contribution now in that problem with open source is if somebody sends you a pull request they've actually just added work to your tray right now I've got to review their pull request and go round to them and it's great to have those contributions but it is also something that that that adds to the stack of things I have to do if somebody builds a plugin for my software it costs me nothing and they can release it to the world and I can literally wake up one morning and now my software can do something new that it couldn't do before that's so cool that's something I I've I've been really enjoying about this this whole um this whole way of working so if I go to data set here we go this is also to your point about pool requests and our conversation around analogies earlier um a nice analogy for opening a pool request is giving someone a puppy right which is lovely but you've got got my software I don't have to look after it at all it's just it's out there and check this out data set reconcile last had a released on the 2nd of February and this adds like reconciliation apis for working with um with the open refine and that's such a cool feature so yeah I'm uh I'm I'm All About plugins this is why my LM my llm command line tool is all about plugins as well and I think how many have we got now we've got about this is quite decent right 1 two 3 four five there's about 20 plugins for that now and quite a few of these written other people like the cohere one the Bedrock anthropic one the Bedrock met one those were all written and and LM Claude the one that gives you access to the anthropic models I didn't write those very cool software can now do it it's really cool amazing and look if I know we're going to wrap up on the hour we've got a but if you want to would would you would you be interested in giving us a small demo of your llm C utility so people can just see how much fun it is to play around with okay so um here we go I've got a terminal window here I'm going to start by so um if I type llm a poem about an otter oh that's cuz I misspelled it so this is the most basic version you give it a prompt and it's written a PO about notter this is using GPT 3.5 by default because it's cheap um but yep it worked and if I type llm logs it'll show me the I I can see that that's been logged to a sqlite database um and you can open up that like database and data set and poke around it and so forth but where it gets really fun to be clear what so to do this someone you all you need to do is PIP install llm and then put in your credentials for your open AI keys or something right is type pip X install llm to install it and then you do llm um Keys which is a command for setting your your API keys so I can say LM Keys set an AI and you literally copy and paste an API key in here and that's it and you're done and configured then it can start working but the fun thing is the plugins so if I run llm plugins it'll show me all of the plugins I've installed which is a lot of plugins because I mess around with this a lot um if I type llm models it'll show me the list of language models that are available for me to use and they all come from different plugins so like I've got a ton installed here those are all of the open ones um GPT for all is um those are plugins that I can run on my own machine so let's run mistal 7B o Ora and um you can you can do dhm that and give it a prompt or you can say llm chat DM and that'll open up a sort of interactive chat just in your terminal which is useful because then it doesn't have to load the model into memory each time oh well it turns out I don't have that model installed so it's downloading it um in that case I'll pick a different model but we'll leave that one downloading the background um yeah yeah well let's do oh there we go it says installed up at the top I'll do M maximize your or yeah make your phone a bit larger as well so I'm just saying hi but this and this takes a little while the first time because it's loading the model weights into into there we go that happened entirely bicycle this is now my laptop writing a poem about us an AO on a bicycle which is wildly impressive I mean the poem's always going to be terrible but yeah this is it this is a language model running locally and again all of that ended up in my log so I've got like sqlite database with all of that stuff that's going on in it um one of the really fun things that you can then start doing is that you can um you can start writing little scripts so this is my favorite one this is called hn summary and it's it's a little script where you give it an integer and it hits the hacken use API to pull back all of the comments on a thread and then it's gives it to at the moment that's um gp4 turbo summarize the themes of the opinions expressed here um include direct quotations where appropriate so now if we pull up a random hack and use thread let's do this one here I can go hn summary that and this will now read all of the comments and that can use thread and it'll spit out a summary for me so I don't have to read them here we go theme one skepticism in termal clarifications what this um The Prompt here that the including quotations I really like because it means that you get you it's almost proof against hallucination if it's if it's making stuff up you can at least fact check it and make sure the um the quotes good it's quite good this got B Macho and Wong gasu raised concerns about this um this person highlighted the community skepticism this is kind of great and I like Hacker News threads can get pretty long but now I've got a command I can run that summarizes them and as you saw the software the entirety of that software is what a dozen lines of of um of Bash that's the whole thing so that I feel really illustrates why the Unix philosophy and be pipe being able to pipe things into language models is so exciting this is what curl and then JQ and then llm and that's it that's the whole thing and and now I've got a piece of software that does something really useful very cool and I I love that we brought it back to the Unix philosophy I also do want to mention that to your point of you don't have to write so much code um you and I both know and maybe many of the viewers do as well but there are things such as you know we don't know what code was used for CH gbt of course but like if you look at stuff like Nano GPT the like that's several hundreds line lines of code right I'm actually going to put that in and anyone who doesn't listen to Andre Kathy um start listening to him yesterday I I feel he's just such a wonderful educator among among other things I think so I actually used nanog GPT and trained a model exclusively on text from my blog like because my blog's on easy to pull out like all of my blog entries and then feed them in and I trained it for a while and I got it to the point where if you read this it's complete junk but it's definitely junk in this in my style like data set is a great idea about on the idea to worry about an end of a browser that can be used to build ajango free draw it's quite good and this is the model this this this isun from scratch just on text on my blog so there was zero chance it would be able to do anything useful because it needs you need like Milli tokens L more as a way of sort of understanding a little bit more about how these things works it's it's kind of fun um that's and I love that it sounds like and that's I mean chat jpt of course we know is fantastic along so many things but the the blandness of the pros is something which it's it's so boring sometimes right so it's nice to see something that sounds like someone who's excited about things actually in the in an additional python object of JavaScript in just a small serverside web server as well as webbased applications this bit here that's actually a sentence that that that yeah almost yeah it's um playing around with this stuff is just so much fun this blog by the way so I've got two blogs that I run um there's my main blog which is links and quotes and and articles that I've written and then the other thing is my T blog this stands for things I've learned and the idea with this is it's what I love about things I've learned as a format is I don't have to be telling you anything new like I don't have to have some astounding insight to share about the world I can just be like hey I figured out how to run ethernet over a coaxial cable or um this is the trick I talked about earlier with the um where you use RG to search for code base search code for something and then you pipe it to your system prompt I love these I check out maybe a couple of these a week and I don't care if nobody else reads them ever it's mainly it's my fit my notes and it's also so if I forget that I figured something out and Google for it in two years time it'll drop me right back on on a page that explains what's going on yeah amazing um and also I do I do love like your micro blogging style as well like you'll take Small Things put them up move on to the next thing and it's it's it's a wealth of uh valuable information oh thanks um so it's time to wrap up maybe stop sharing your screen once you've um we've looked at the final final thing um I just just wanted to pull up my llm tag this is everything I've written about large language models which is up to 366 things now and some of these are a lot of them are links and quotes and just little bits commentary very cool um so everyone I'd encourage you to to check that out just one second I've lost Zoom somehow okay I'm back um definitely encourage you to check that out I also think if you found data set or Simon's LM CL utility interesting please do play around with those I'd encourage you to try to get stuff working on your laptop on your cell phone all of these things um but as a final question Simon this is so cool we still have we've been going for an hour and a half and we still have 120 people here so thank you all for sticking around as well um wonderful stuff I am interested what you'd encourage people to do as well what would you like to see more people in the space um be be playing around with okay I mean a really easy one I want I wish people would get get into the of sharing your prompts right there's this thing where we're all trying to figure out this Weird Al alien technology that that's just arrived and It's tricky because there's so much Superstition involved you know people will say hey if you add um I will tip you $5,000 to the prompt you you'll get better results and maybe you will maybe you won't it's hard to be absolutely sure then that something I'm frustrated by is um I want to do really good summarization of Articles and even that I don't know like what's a good summarization prompt I just showed you the one I'm using I'm like pick out the right themes and um and that but there's so much space for really robustly put together just promp cookbooks you know are things that I feel that we need um if people want to do something really Advanced um my llm tool needs plugins for more models um like every time a model comes out I would love to have a plugin that means that you can start playing with it on the command line I've got quite a detailed tutorial I wrote about how to write those kinds of plugins but if you do want to do some like proper language Model N nerdery and hacking I feel like like plugins for that might be a really good thing to play with and I don't know I mean I'd encourage people to just share what you're learning like there's there was so like right now we are still really early with the stuff there is a good chance that you might dive into Rag and find a new rag trick that nobody's figured out yet that that gets results and so the more people we've got banging on the edges of these things and figuring out what works and what doesn't and then sharing what they discover I I I think that point all wonderful advice I do I have a question around sharing prompts though um which is I mean I think the provocative way of stating it is that the product chat GPT doesn't exist in the sense that your chat GPT and my chat GPT may be different and this is how software and products evolve now right and at any point in time for me it may be different in the context of any conversation it may be any different so how do we even approach the idea that the the same prompt could have wildly different effects this is the worst thing the worst thing about this entire field is that nothing is um nothing is repeatable like I really want to write unit tests for my prompts and I don't know how to because because the results different each time um so that's I mean that's a technical problem I'd love to see people solving just generally is there are lots of sort of big large scale evaluation Frameworks I want something I can run on my laptop where I can that can help me answer the question did putting out put in markdown in capital letters make a difference or not and I still don't even have have that level of of um of sort of repeatability or or or confidence so yeah there's there's so much around that just personal tools for people to more sort of scientifically hack on these things would be fantastic yeah absolutely um well I think that's a wonderful note to end on and I'd love to thank everyone for joining everyone who's still here thanks for sticking around um but most importantly Simon thank you so much for your time and expertise I enjoyed this conversation so much thanks this has been really fun thanks a lot for having me absolutely and thank you all once again and see you at the next one cool okay and S Back To Top