InkSF, an Opening Statement on Moving to SF
How can we actually minimize the odds that AI leads to catastrophic outcomes for all of us humans? This question has been rattling around my head for the last two months. The world might be ending. Nobody seems to care. The incentives are steaming us ahead. When I ask strangers on the street: “How likely is it that superhuman1 AI could become too powerful for humans to control?”, 78% say either "very likely" (51.6%) or "somewhat likely" (26.3%)2. My guess is AI capabilities spending is at least 20x the spending on ensuring AI leads to the flourishing of humans3. Moloch4 is winning. :(
So what can actually be done? As a toy example5: let’s say I currently think there is a 40% chance AI eventually goes extraordinarily bad for humanity. I could either:
Try really hard to get laws passed that mandate AIs must think in English and be clearly understandable by humans6. I think there’s a 15% chance this would succeed and it would lower the odds AI goes catastrophically wrong from 40% to 36%.
Work on really understanding on a mechanistic level what parts of the model matter for good outcomes, using this to figure out which parts would have to go wrong to end up with a model generating bad outputs, and being able to predict the odds that the required components for safety are always activated when relevant. I think there’s a 1% chance this would succeed and would lower the odds AI goes catastrophically wrong from 40% to 20%.
In this toy example the first one in expectation lowers the odds of catastrophic outcomes by .6% and the second one lowers by .2%."Would succeed” is doing a lot of lifting here, because what really matters is what I can do and not what some arbitrary collection of humans could do if tasked with either of these. What really matters is how much I think my working on this problem moves the odds of success multiplied by the impact of the given approach succeeding. Not only does it matter how impactful a given approach is but how good I am at executing on it7. This does of course imply that there is value in the work of trying to funnel people into doing the most effective things8.
But what should I, Cormac Slade Byrd actually do?
I’ve spent 2 months learning. I know so much more than I did 2 months ago, I started with the technicals. I have written a simple GPT-2 style LLM by hand9. I learned10 all the core mech interp techniques. I’ve red-teamed the largest open weight model. I’ve read up on the seemingly hopeless state of cybersecurity - both when it comes to attacks enabled by LLMs and also the potential for nation state level attacks to steal model weights11. Most importantly I’ve read so much about what various people think will help lower the odds of catastrophic outcomes and why.
Part of what makes this so hard is there are so many second order effects. It’s genuinely quite hard to tell if a given thing that seems like it might help does in fact lower the odds of terrible outcomes. Most of the current CEOs of the largest labs got there by way of people who were worried about AI risk. Coefficient Giving12 gave $30M to OpenAI to get a seat on the board13. METR’s timeline eval has led to more people understanding/being worried about the rate of AI progress, but also the existence of eval orgs enables labs to in some sense offload the requirement to ensure their models behave well onto these 3rd party eval orgs14. There are so many downstream effects to any given course of action.
I think the incentives are really important here. Yes, making sure AI goes well is a technical problem. But, the reason we are in such a dire place is because of the incentives. We are in a competitive race to build AI as quickly as possible as a result of a bunch of decisions that were (at least in the moment) motivated by someone wanting to lower the odds AI goes poorly15. Then, the incentives were inevitably shifted by general economic capitalistic forces and human power-seeking/competitive nature.
If I want to figure out how to have the most impact, I will have to really think about incentives and downstream impact.
Moving to the Bay
The internet (plus Claude) is great for learning. I have learned so much. But I live in NYC. AI is happening in the Bay. If I want to really immerse myself in what is actually happening on the ground I can’t do that in NYC. If I want to understand the people (and their incentives) who are both building AI and are working to align AI to the interests of humanity, then I will probably have to talk to them. Go to their parties. Osmose as much as I can. Really figure out what matters.
I’m flying to SF tomorrow, I’ve got a sublet in SF16 until May 9. I am looking for a sublet in a communal house in Berkeley starting May 1617. I hear the east bay and SF proper are pretty socially disconnected and very different cultures. I want to live in both to really understand what’s going on. See with clear eyes, not let a single groupthink win me over.
The hope is that by the end of June I will have a much better model and be relatively certain in picking a specialization, in picking what work I think I can do to maximally decrease the odds AI goes oh so poorly for all of us.
I love people, I love talking to people, if you think talking to me would be fun18 please reach out!
Why is InkSF in your title?
I like writing. It is helpful to clarify my thoughts. It helps people I love stay connected with me. It might even be providing a social good. Inkhaven is starting today in Berkeley. I am moving to SF tomorrow. I think there are good odds19 forcing myself to write every day I am in SF is a net worth it forcing function. There are many things other than writing that are worth my time, my guess is I will not in fact go all 30 days. But, it seems clearly worth trying. So, this is me committing to 25% chance odds that I write and post at least 500 words every day in April.
Defined as: “Some companies are trying to build superhuman AI that would be far smarter than any human at nearly everything”
preliminary results, n = 95. Larger post about this incoming this month.
Or at the very least not leading to terrible outcomes for all humans.
https://www.slatestarcodexabridged.com/Meditations-On-Moloch
Numbers mostly made up on vibes.
AI’s currently do this, yes it makes mech interp way more annoying. But, it would be so so much worse if AI’s thought in some efficient machine gobledy gook that we have no way of interpreting. It’s pretty likely that English doesn’t purely by chance end up being the most token efficient way of thinking, so it’s not hard to imagine a world where competitive pressure leads to labs slowly moving towards their AIs thinking in some totally foreign “language”. My guess is this is the kind of regulation that labs would feel quite good about, they probably do like being able to understand the thinking their LLMs do.
For example I do not have a PhD so it seems somewhat unlikely that theoretical technical research will be the place where I have the most individual impact, but maybe I think ARC style theoretical research is really valuable and my time would be best spent trying to convince more PhD grads to try working on that problem. My model is basically all technical safety organizations are way more talent constrained than funding constrained.
I, for example, probably wouldn’t be here today if it weren’t for both Zvi’s posting and a friend exasperatedly asking me why the fuck I’m not working on the single problem I think is by far the most important for humanity to solve.
No tokenization, characters only. We’ll see if I end up thinking tokenization is worth getting into.
Played with in code of course
As far as I can tell, commercially serving models means that the weights just have to be sitting on the servers that are serving the given model. That’s a ginormous attack surface.
The single largest funder in the AI safety ecosystem
Is this better than the counterfactual? It really depends on how much that board seat changed the odds OpenAI went bad. We are living in a timeline where this bet seems like it didn’t pay off, but that doesn’t mean it wasn’t necessarily worth it in EV. Holden seems to believe this was on net worth it and I should probably put real weight on that given he actually knows how much impact the board seat got.
If there were no eval orgs there would likely be more pressure on labs to really ensure their models are safe. I certainly think it’s a better equilibrium for people to believe it is on the lab to prove their model is safe than for it to be on external orgs to be able to show a model is unsafe. However it’s still really hard to compare counterfactuals here. I think my light belief is eval orgs are probably on net bad especially considering the opportunity cost of what the people working there could be doing instead. This worry is also partially bolstered by the niggling incentive worry where most eval orgs are staffed from people who leave the big labs and they get significant compute credits as well as prestige for working with (and getting to say “partners with ___”) a given big lab. The incentives on eval orgs is tricky, so everything they do seems like it should be treated with a little more suspicion. On the other hand Coefficient giving absolutely funds eval orgs but doesn’t fund any pause/stop AI orgs - they seemingly abruptly stopped funding MIRI after it pivoted to pause advocacy. So, coefficient giving seems to think evals are net worth it but pause isn’t net worth it, how much should I update on this?
Basically every major lab was founded ostensibly because they thought the existing labs were doing a bad job and their new lab would be much better at creating safe good AI.
The Sandwich
The astute of you might have noticed May 9 =/= May 16th. I am going to Sleepawake in between. I am my impact, but also attunement and connection and being with other people is a large part of what makes life worth living for me. This is crucial, but for a life of joy but also I am way less impactful if I personally am burned out and sad and lonely.
or interesting, or truthful, or would help lower the odds of catastrophe, etc
Probably above 20% and below 50%
