Title: Steven Byrnes - LessWrong Description: Steven Byrnes's profile on LessWrong — A community blog devoted to refining the art of rationality Keywords: No keywords Text content: Steven Byrnes - LessWrong This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. LESSWRONGLWLoginSteven Byrnes19343Ω313415121354MessageDialogueSubscribeI'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , X/Twitter , Bluesky , Mastodon , Threads , GitHub , Wikipedia , Physics-StackExchange , LinkedInSequencesIntuitive Self-ModelsValenceIntro to Brain-Like-AGI SafetyPostsSorted by New4steve2152's ShortformΩ5yΩ3447[Intuitive self-models] 8. Rooting Out Free Will Intuitions9d946[Intuitive self-models] 7. Hearing Voices, and Other Hallucinations16d258[Intuitive self-models] 6. Awakening / Enlightenment / PNSE23d560Against empathy-by-default1mo2454[Intuitive self-models] 5. Dissociative Identity (Multiple Personality) Disorder1mo763[Intuitive self-models] 4. Trance1mo665[Intuitive self-models] 3. The Homunculus1mo3679[Intuitive self-models] 2. Conscious Awareness2mo4886[Intuitive self-models] 1. Preliminaries2mo2027Response to Dileep George: AGI safety warrants planning aheadΩ4moΩ7Load MoreWiki ContributionsWanting vs Liking 1y Wanting vs Liking 1y (+139/-26)Waluigi Effect 1y (+2087)CommentsSorted by NewestModel of psychosis, take 2Steven Byrnes8h20Thanks!Do you have any thoughts on why then does psychosis typically suddenly 'kick in' in late adolescence / early adulthood? Yeah as I discussed in Schizophrenia as a deficiency in long-range cortex-to-cortex communication Section 4.1, I blame synaptic pruning, which continues into your 20s.and why trauma correlates with it and tends to act as that 'kickstarter'?No idea. As for “kickstarter”, is it? It might be correlation not causation. It’s hard to figure that out experimentally. That said, I have some discussion of how strong emotions in general, and trauma in particular, can lead to hallucinations (e.g. hearing voices) and delusions via a quite different mechanism in [Intuitive self-models] 7. Hearing Voices, and Other Hallucinations. I’ve been thinking of “psychosis via disjointed cognition” (schizophrenia & mania per this post) and “psychosis via strong emotions” (e.g. trauma, see that other post) as pretty different and unrelated, but I guess it’s maybe possible that there’s some synergy where their effects add up such that someone who is just under the threshold for schizophrenic delusions can get put over the top by strong emotions like trauma.Also any thoughts about delusions? Like how come schizophrenic people will occasionally not just believe in impossible things but very occasionally even random things like 'I am Jesus Christ' or 'I am Napoleon'?I talk about that a bit better in the other post:In the diagram above, I used “command to move my arm” as an example. By default, when my brainstem notices my arm moving unexpectedly, it fires an orienting / startle reflex—imagine having your arm resting on an armrest, and the armrest suddenly starts moving. Now, when it’s my own motor cortex initiating the arm movement, then that shouldn’t be “unexpected”, and hence shouldn’t lead to a startle. However, if different parts of the cortex are sending output signals independently, each oblivious to what the other parts are doing, then a key prediction signal won’t get sent down into the brainstem, and thus the motion will in fact be “unexpected” from the brainstem’s perspective. The resulting suite of sensations, including the startle, will be pretty different from how self-generated motor actions feel, and so it will be conceptualized differently, perhaps as a “delusion of control”.That’s just one example. The same idea works equally well if I replace “command to move my arm” with “command to do a certain inner speech act”, in which case the result is an auditory hallucination. Or it could be a “command to visually imagine something”, in which case the result is a visual hallucination. Or it could be some visceromotor signal that causes physiological arousal, perhaps leading to a delusion of reference, and so on.So, I dunno, imagine that cortex area 1 is a visceromotor area saying “something profoundly important is happening right now!” for some random reason, and independently, cortex area 2 is saying “who am I?”, and independently, cortex area 3 is saying “Napoleon”. All three of these things are happening independently and unrelatedly. But because of cortex area 1, there’s strong physiological arousal that sweeps through the brain and locks in this configuration within the hippocampus as a strong memory that “feels true” going forward.That’s probably not correct in full detail, but my guess is that it’s something kinda like that.ReplyIs Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent ClaimsSteven Byrnes12h50I’d bet that Noam Brown’s TED AI talk has a lot of overlap with this one that he gave in May. So you don’t have to talk about it second-hand, you can hear it straight from the source.  :) In particular, the “100,000×” poker scale-up claim is right near the beginning, around 6 minutes in.ReplyGoal: Understand IntelligenceSteven Byrnes2d20The goal is to have a system where there are no unlabeled parameters ideally. That would be the world modeling system. It then would build a world model that would have many unlabeled parameters.Yup, this is what we’re used to today:there’s an information repository,there’s a learning algorithm that updates the information repository,there’s an inference algorithm that queries the information repository,both the learning algorithm and the inference algorithm consist of legible code written by humans, with no inscrutable unlabeled parameters,the high-dimensional space [or astronomically-large set, if it’s discrete] of all possible configurations of the information repository is likewise defined by legible code written by humans, with no inscrutable unlabeled parameters,the only inscrutable unlabeled parameters are in the content of the information repository, after the learning algorithm has been running for a while.So for example, in LLM pretraining, the learning algorithm is backprop, the inference algorithm is a forward pass, and the information repository is the weights of a transformer-architecture neural net. There’s nothing inscrutable about backprop, nor about a forward pass. We fully understand what those are doing and how. Backprop calculates the gradient, etc.That’s just one example. There are many other options! The learning algorithm could involve TD learning. The inference algorithm could involve tree search, or MCMC, or whatever. The information repository could involve a learned value function and/or a learned policy and/or a learned Bayes net and/or a learned OpenCog AtomSpace or whatever. But in all cases, those six bullets above are valid.So anyway, this is already how ML works, and I’m very confident that it will remain true until TAI, for reasons here. And this is a widespread consensus.By understanding the world modeler system you can ensure that the world model has certain properties. E.g. there is some property (which I don't know) of how to make the world model not contain dangerous minds.There’s a very obvious failure mode in which: the world-model models the world, and the planner plans, and the value function calculates values, etc. … and at the end of all that, the AI system as a whole hatches and executes a plan to wipe out humanity. The major unsolved problem is: how do we confidently avoid that?Then separately, there’s a different, weird, exotic type of failure mode, where, for example, there’s a full-fledged AGI agent, one that can do out-of-the-box foresighted planning etc., but this agent is not working within the designed AGI architecture (where the planner plans etc. as above), but rather the whole agent is hiding entirely within the world-model. I think that, in this kind of system, the risk of this exotic failure mode is very low, and can be straightforwardly mitigated to become even lower still. I wrote about it a long time ago at Thoughts on safety in predictive learning.I really think we should focus first and foremost on the very obvious failure mode, which again is an unsolved problem that is very likely to manifest, and we should put aside the weird exotic failure mode at least until we’ve solved the big obvious one.When we put aside the exotic failure mode and focus on the main one, then we’re no longer worried about “the world model contains dangerous minds”, but rather we’re worried about “something(s) in the world model has been flagged as desirable, that shouldn’t have been flagged as desirable”. This is a hard problem not only because of the interpretability issue (I think we agree that the contents of the world-model are inscrutable, and I hope we agree that those inscrutable contents will include both good things and bad things), but also because of concept extrapolation / goal misgeneralization (i.e., the AGI needs to have opinions about plans that bring it somewhere out of distribution). It’s great if you want to think about that problem, but you don’t need to “understand intelligence” for that, you can just assume that the world-model is a Bayes net or whatever, and jump right in! (Maybe start here!)To me it just seems that limiting the depth of a tree search is better that limiting the compute of a black box neural network. It seems like you can get a much better grip on what it means to limit the depth, and what this implies about the system behavior, when you actually understand how tree search works. Of cause tree search here is only an example.Right, but the ability to limit the depth of a tree search is basically useless for getting you to safe and beneficial AGI, because you don’t know the depth that allows dangerous plans, nor do you know that dangerous plans won’t actually be simpler (less depth) than intended plans. This is a very general problem. This problem applies equally well to limiting the compute of a black box, limiting the number of steps of MCMC, limiting the amount of (whatever OpenCog AtomSpace does), etc.[You can also potentially use tree search depth to try to enforce guarantees about myopia, but that doesn’t really work for other reasons.]Python code is a discrete structure. You can do proofs on more easily than for a NN. You could try to apply program transformations on it that preserve functional equality, trying to optimize for some measure of "human understandable structure". There are image classification alogrithms iirc that are worse than NN but much more interpretable, and these algorithms would at most be hundets of lines of code I guess (haven't really looked a lot at them).“Hundreds of lines” is certainly wrong because you can recognize easily tens of thousands of distinct categories of visual objects. Probably hundreds of thousands.Proofs sound nice, but what do you think you can realistically prove that will help with Safe and Beneficial AGI? You can’t prove things about what AGI will do in the real world, because the real world will not be encoded in your formal proof system. (pace davidad).“Applying program transformations that optimize for human understandable structure” sounds nice, but only gets you to “inscrutable” from “even more inscrutable”. The visual world is complex. The algorithm can’t be arbitrarily simple, while still capturing that complexity. Cf. “computational irreducibility”.I'm not brainstorming on "how could this system fail". Instead I understand something, and then I just notice without really trying, that now I can do a thing that seems very useful, like making the system not think about human psycology given certain constraints.What I’m trying to do in this whole comment is point you towards various “no-go theorems” that Eliezer probably figured out in 2006 and put onto Arbital somewhere.Here’s an analogy. It’s appealing to say: “I don’t understand string theory, but if I did, then I would notice some new obvious way to build a perpetual motion machine.”. But no, you won’t. We can rule out perpetual motion machines from very general principles that don’t rely on how string theory works.By the same token, it’s appealing to say: “I don’t understand intelligence, but if I did, then I would notice some new obvious way to guarantee that an AGI won’t try to manipulate humans.”. But no, you won’t. There are deep difficulties that we know you’re going to run into, based on very general principles that don’t rely on the data format for the world-model etc.I suggest to think harder about the shape of the solution—getting all the way to Safe & Beneficial AGI. I think you’ll come to realize that figuring out the data format for the world-model etc. is not only dangerous (because it’s AGI capabilities research) but doesn’t even help appreciably with safety anyway.Reply[Intuitive self-models] 8. Rooting Out Free Will IntuitionsSteven Byrnes3d20Huh, funny you think that. From my perspective, “modeling how other people model me” is not relevant to this post. I don’t see anywhere that I even mentioned it. It hardly comes up anywhere else in the series either. ReplyGoal: Understand IntelligenceSteven Byrnes3d40John's post is quite wierd, because it only says true things, and implicitly implies a conclusion, namely that NNs are not less interpretable than some other thing, which is totally wrong.Example: A neural network implements modular arithmetic with furier transforms. If you implement that furier algorithm in python, it's harder to understand for a human than the obvious modular arithmetic implementation in python.Again see my comment. If an LLM does Task X with a trillion unlabeled parameters and (some other thing) does the same Task X with “only” a billion unlabeled parameters, then both are inscrutable.Your example of modular arithmetic is not a central example of what we should expect to happen, because “modular arithmetic in python” has zero unlabeled parameters. Realistically, an AGI won’t be able to accomplish any real-world task at all with zero unlabeled parameters.I propose that a more realistic example would be “classifying images via a ConvNet with 100,000,000 weights” versus “classifying images via 5,000,000 lines of Python code involving 1,000,000 nonsense variable names”. The latter is obviously less inscrutable on the margin but it’s not a huge difference.The goal is to understand how intelligence works. Clearly that would be very useful for alignment?If “very useful for alignment” means “very useful for doing technical alignment research”, then yes, clearly.If “very useful for alignment” means “increases our odds of winding up with aligned AGI”, then no, I don’t think it’s true, let alone “clearly” true.If you don’t understand how something can simultaneously both be very useful for doing technical alignment research and decrease our odds of winding up with aligned AGI, here’s a very simple example. Suppose I posted the source code for misaligned ASI on github tomorrow. “Clearly that would be very useful” for doing technical alignment research, right? Who could disagree with that? It would open up all sorts of research avenues. But also, it would also obviously doom us all.For more on this topic, see my post “Endgame safety” for AGI.E.g. I could define theoretically a general algoritm that identifies the minimum concrepts neccesary, if I know enough about the structure of the system, specifically how concepts are stored, for solving a task. That's of cause not perfect, but it would seem that for very many problems it would make the AI unable to think about things like human manipulation, or that it is a constrained AI, even if that knowledge was somewhere in a learned black box world model.There’s a very basic problem that instrumental convergence is convergent because it’s actually useful. If you look at the world and try to figure out the best way to design a better solar cell, that best way involves manipulating humans (to get more resources to run more experiments etc.).Humans are part of the environment. If an algorithm can look at a street and learn that there’s such a thing as cars, the very same algorithm will learn that there’s such a thing as humans. And if an algorithm can autonomously figure out how an engine works, the very same algorithm can autonomously figure out human psychology.You could remove humans from the training data, but that leads to its own problems, and anyway, you don’t need to “understand intelligence” to recognize that as a possibility (e.g. here’s a link to some prior discussion of that).Or you could try to “find” humans and human manipulation in the world-model, but then we have interpretability challenges.Or you could assume that “humans” were manually put into the world-model as a separate module, but then we have the problem that world-models need to be learned from unlabeled data for practical reasons, and humans could also show up in the other modules.Anyway, it’s fine to brainstorm on things like this, but I claim that you can do that brainstorming perfectly well by assuming that the world model is a Bayes net (or use OpenCog AtomSpace, or Soar, or whatever), or even just talk about it generically.If your system is some plain code with for loops, just reduce the number the for loops of seach processes do. Now decreasing/incleasing the iterations somewhat will correspond to making the system dumber/smarter. Again obviously not solving the problem completely, but clearly a powerful thing to be able to do.I’m 100% confident that, whatever AGI winds up looking like, “we could just make it dumber” will be on the table as an option. We can give it less time to find a solution to a problem, and then the solution it finds (if any) will be worse. We can give it less information to go on. Etc.You don’t have to “understand intelligence” to recognize that we’ll have options like that. It’s obvious. That fact doesn’t come up very often in conversation because it’s not all that useful for getting to Safe and Beneficial AGI. Again, if you assume the world model is a Bayes net (or use OpenCog AtomSpace, or Soar), I think you can do all the alignment thinking and brainstorming that you want to do, without doing new capabilities research. And I think you’d be more likely (well, less unlikely) to succeed anyway.Reply[Intuitive self-models] 8. Rooting Out Free Will IntuitionsSteven Byrnes3d20This post is about science. How can we think about psychology and neuroscience in a clear and correct way? “What’s really going on” in the brain and mind?By contrast, nothing in this post (or the rest of this series), is practical advice about how to be mentally healthy, or how to carry on a conversation, etc. (Related: §1.3.3.)Does that help? Sorry if that was unclear.ReplyGoal: Understand IntelligenceSteven Byrnes7d40See Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc, including my comment on it. If your approach would lead to a world-model that is an uninterpretable inscrutable mess, and LLM research would lead to a world-model that is an even more uninterpretable, even more inscrutable mess, then I don’t think this is a reason to push forward on your approach, without a good alignment plan.Yes, it’s a pro tanto reason to prefer your approach, other things equal. But it’s a very minor reason. And other things are not equal. On the contrary, there are a bunch of important considerations plausibly pushing in the opposite direction:Maybe LLMs will plateau anyway, so the comparison between inscrutable versus even-more-inscrutable is a moot point. And then you’re just doing AGI capabilities research for no safety benefit at all. (See “Endgame safety” for AGI.)LLMs at least arguably have some safety benefits related to reliance on human knowledge, human concepts, and chains-of-thought, whereas the kind of AGI you’re trying to invent might not have those.Your approach would (if “successful”) be much, much more compute-efficient—probably by orders of magnitude—see Section 3 here for a detailed explanation of why. This is bad because, if AGI is very compute-efficient, then when we have AGI at all, we will have AGI that a great many actors around the world will be able to program and run, and that makes governance very much harder. (Related: I for one think AGI is possible on a single consumer GPU, see here.)Likewise, your approach would (if “successful”) have a “better” inductive bias, “better” sample efficiency, etc., because you’re constraining the search space. That suggests fast takeoff and less likelihood of a long duration of janky mediocre-human-level AGIs. I think most people would see that as net bad for safety.In any case, it seems that this is a problem that any possible way to build an intelligence runs into? So I don't think it is a case against the project.If it’s a problem for any possible approach to building AGI, then it’s an argument against pursuing any kind of AGI capabilities research! Yes! It means we should focus first on solving that problem, and only do AGI capabilities research when and if we succeed. And that’s what I believe. Right?It seems plausible that one could simply by understanding the system very well, make it such that the learned datastrucutres need to take particular shapes, such that these shapes correspond some relevant alignment properties.I don’t think this is plausible. I think alignment properties are pretty unrelated to the low-level structure out of which a world-model is built. For example, the difference between “advising a human” versus “manipulating a human”, and the difference between “finding a great out-of-the-box solution” versus “reward hacking”, are both extremely important for alignment. But you won’t get insight into those distinctions, or how to ensure them in an AGI, by thinking about whether world-model stuff is stored as connections on graphs versus induction heads or whatever.Anyway, if your suggestion is true, I claim you can (and should) figure that out without doing AGI capabilities research. Here’s an example. Assume that the the learned data structure is a Bayes net, or some generalization of a Bayes net, or the OpenCog “AtomSpace”, or whatever. OK, now spend as long as you like thinking about what if anything that has to do with “alignment properties”. My guess is “very little”. Or if you come up with anything, you can share it. That’s not advancing capabilities, because people already know that there is such a thing as Bayes nets / OpenCog / whatever.Alternatively, another concrete thing that you can chew on is: brain-like AGI. :)  We already know a lot about how it works without needing to do any new capabilities research. For example, you might start with Plan for mediocre alignment of brain-like [model-based RL] AGI and think about how to make that approach better / less bad.ReplyBogdan Ionut Cirstea's ShortformSteven Byrnes7d60I think Seth is distinguishing “aligning LLM agents” from “aligning LLMs”, and complaining that there’s insufficient work on the former, compared to the latter? I could be wrong.I don't actually know what it means to work on LLM alignment over aligning other systemsOoh, I can speak to this. I’m mostly focused on technical alignment for actor-critic model-based RL systems (a big category including MuZero and [I argue] human brains). And FWIW my experience is: there are tons of papers & posts on alignment that assume LLMs, and with rare exceptions I find them useless for the non-LLM algorithms that I’m thinking about.As a typical example, I didn’t get anything useful out of Alignment Implications of LLM Successes: a Debate in One Act—it’s addressing a debate that I see as inapplicable to the types of AI algorithms that I’m thinking about. Ditto for the debate on chain-of-thought accuracy vs steganography and a zillion other things.When we get outside technical alignment to things like “AI control”, governance, takeoff speed, timelines, etc., I find that the assumption of LLMs is likewise pervasive, load-bearing, and often unnoticed.I complain about this from time to time, for example Section 4.2 here, and also briefly here (the bullets near the bottom after “Yeah some examples would be:”).ReplyGoal: Understand IntelligenceSteven Byrnes7d60I didn’t read it very carefully but how would you respond to the dilemma:If the programmer has to write things like “tires are black” into the source code, then it’s totally impractical. (…pace davidad & Doug Lenat.)If the programmer doesn’t have to write things like “tires are black” into the source code, then presumably a learning algorithm is figuring out things like “tires are black” from unlabeled data. And then you’re going to wind up with some giant data structure full of things like “ENTITY 92852384 implies ENTITY 8593483 with probability 0.36”. And then we have an alignment problem because the AI’s goals will be defined in terms of these unlabeled entities which are hard to interpret, and where it’s hard to guess how they’ll generalize after reflection, distributional shifts, etc.I’m guessing you’re in the second bullet but I’m not sure how you’re thinking about this alignment concern.Reply[Intuitive self-models] 3. The HomunculusSteven Byrnes8d30Yeah, I think the §3.3.1 pattern (intrinsic surprisingness) is narrower than the §3.3.4 pattern (intrinsic surprisingness but with an ability to make medium-term predictions).But they tend to go together so much in practice (life experience) that when we see the former we generally kinda assume the latter. An exception might be, umm, a person spasming, or having a seizure? Or a drunkard wandering about randomly? Hmm, maybe those don’t count because there are still some desires, e.g. the drunkard wants to remain standing.I agree that agency / life-force has a strong connotation of the §3.3.4 thing, not just the §3.3.1 thing. Or at least, it seems to have that connotation in my own intuitions. ¯\_(ツ)_/¯ReplyLoad More