My Siloed Views Meet Reality
What do I think right now, how does that change after talking to people in the space?
I’m talking to a friend of mine today who works directly on AI safety at my favorite big lab. This is the first real conversation I am having about what my options might be, what work/impact might look like, etc. I will likely have more conversations like this with people increasingly diffuse into my social network, but I wanted to start with someone I personally like/trust. Let’s take this chance to think through what my beliefs actually are right now, and then we can see how my views change after!
Pre
I have been primarily focusing on the core mech interp skillset, it certainly feels like this is the right place to start since the goal of mech interp is to find the best understanding of what is actually happening in models. Neel Nanda likes to talk a lot about how “My ultimate north star is pragmatism - achieve enough understanding to be (reliably) useful.”1 and when I first read that it felt really obviously trivially true to me, and like why does it need to be stated. But, the better my model of mech interp (and AI safety as a whole) the more I understand why this is so important to state. LLMs are super opaque super interesting super complex. A byproduct of this is that the space of interesting fun projects one could work on is just absolutely enormous.
One of the large projects I am currently working on is a comprehensive lit review and since 2020 there have been thousands upon thousands of “AI safety” papers, in expectation every day multiple new AI Safety papers are published and available to be read on arxiv. There are so many models, they behave in different ways, there are so many techniques, there are so many types of problems you could try to tackle with a given technique.
I have read less than 20 papers and I feel pretty confident I can mostly understand the vast majority of published mech interp papers (and a good percent of generic AI safety papers), I guess there’s something that feels not deep. Basically every paper I have read has been written by a major AI lab, while it seems that the vast majority of papers published are written by people playing the university/academic game. The goal of playing the Academia game isn’t to lower the odds of AI causing catastrophic harm to humanity, it’s to publish novel papers that get cited by other academics to build reputation which leads to good jobs where you get paid to keep working on fun problems. All of which to say there are clearly plenty of ways to work on fun interesting problems that don’t have impact, but it doesn’t answer the how to have impact.
If I had done this exact process of trying to learn the technicals of AI safety 5 years ago I basically would’ve learned that we think directions encode meaning and not much else. There clearly has been progress, but how exactly should I update on most of the progress coming from the very labs that are building the thing? Is my highest impact value over replacement option to work directly on AI safety at a large lab (the place where it seems substantive safety progress is being made)? My gut answer is no, since working directly on AI safety seems to be in the bucket of “fun technical problem” s.t. AI labs seem to have large applicant pools for AI safety roles. Perhaps I have significant value over replacement compared to the best alternate candidate for the role, but for now it’s hard to see how I could justify that level of confidence in my direct AI safety research value.
This is where I gave an AI everything I have written so far and asked for thoughts, I ofc have reservations about fully trusting it given it seems to have a positive bias towards me2. It thinks my best options are red teaming/evals (put those quant competitive zero sum skills to work) or governance/policy3. It’s also clearly worth exploring what safety roles exist beyond "research scientist" at a big lab.
Post
There are less people really trying to solve this than I think, people keep telling me this and I keep being skeptical - at some point I’ve gotta stop being skeptical. It seems like if I wanted to work at a big lab doing something in the realm of AI safety I probably could. There are fewer smart people who really care about this than I thought, and there are lots of safety roles (seemingly the majority) are not what I had imagined a “technical AI safety researcher.”
I came away from this updated in the direction of working at a large lab. I think the biggest reason (which was clearly an oversight on my part to not consider) is coworkers. Working around lots of other people who are smart and really trying to get shit done is clearly just super valuable both in a learning more depth but also in a breadth of what is even the space of possible options.
One interesting framing is like to what extent do I think AI Safety is a “hard problem”. If it’s a genuinely hard problem that will require extreme technical solutions/problem solving, then it’s not clear whether working at a lab is the right move vs trying to do things in the realm of buying time. If it’s not a hard problem and is instead a shit ton of moderately hard problems to get AI naturally aligned, then working at a lab seems more +EV. Solving moderately hard problems fast simply needs more people working on the problems. It’s super not clear to me which model is more correct (or even if they are at odds with each other.)
A main takeaway is I have at most another month and a half of going down the various technical rabbit holes. Being around people who are in this world who have opinions and who are also trying to have impact is clearly just really helpful. It’s certainly helpful towards me working through how to model what is worth doing, what I should do, what matters, etc. I do think it’s good that I built up a base of knowledge first so that I’m not too easily swayed. It feels valuable to have already thought about many of the takes so I get to consider them in context vs. simply believing what is being said to me.
Even after I told it to be realistic about my strengths and weaknesses it said “You can write clearly about technical topics. Your explanations are better than most published blog posts in this space” which feels like it just can’t be true given my unedited lots of words strategy
From the same response as footnote 2, “the person who would otherwise fill these roles often has either technical knowledge OR policy/strategy skills, rarely both. Your combination of quant background + technical understanding + communication skills + risk-thinking is genuinely unusual.”
