Do I Want to Work at Anthropic
And Responsible Scaling Policies
I’m sleeping poorly this week. Waking up at 6am, heart racing, thoughts racing, and today my upstairs homie racing1. Since we just love perverse incentives, I will reward you who is reading this with my writing about one of the things that’s been on my mind. Obv that’s the question of whether I should work at what is seemingly the top pick for “place where someone who cares about AI works”.
In my exploration down my social graph to find “people in AI safety” approximately 50% of the hits have been people that work at Anthropic. Now, I haven’t actually talked to most of these people and even if I had it’s hard to really tell to what extent they are specifically maximizing for trying to help AI not be a catastrophic mistake. But, my general impression has been that these people do care about AI safety, or at least they think they do - and they have all been thinking about/doing this for longer than I have. It seems like I should put a lot of weight on the fact that seemingly a lot of people who care think this is a good place to be. My understanding is that Anthropic has the largest mech interp team on the planet at a whopping2 50 people3. Clearly many people think this is the place where their time should go.
I’ve been enjoying the combination of both “Aligning” and red-teaming kimi. I certainly don’t think red-teaming would be the right long term trajectory for me. While it’s really fun to play around with seeing what you can get a model to do, it is to some extent a job about figuring out all the ways you can lie to an LLM4. I’ve already had a job where I have to small lie all day long and this seems worse. But, this process has made me think more about AI model security.
Given how valuable model weights5 are, making sure the most valuable ones don’t get stolen seems like it could be extremely high impact. If the weights for the best closed model got stolen this would not only allow for lots of mundane harm from misaligning the model, but it would almost certainly speed up the speed of global AI development.
Traditionally, the job of trying to figure out how to avoid something getting stolen is seen as much much less prestigious than the job of making the thing that is worth getting stolen. If you really succeed at the job of making sure something doesn’t get stolen there is no big payoff, all that happens is the world continues how everyone expected it to. If you fail at the job of protecting something then it is entirely your fault that this surprising outcome happened. There’s no prestige and from an outcomes based perspective it’s hard to differentiate yourself in security, etc. Intuitively it seems like this would be a neglected cause area. And, the work seems genuinely fun! Sitting around all day trying to put yourself in the head of the attacking counterparty, thinking through all of your (company’s) weaknesses, running trials to test various attacks, thinking through how to protect against a given method, it seems like a genuinely interesting job.
Currently Anthropic approximately has the best model and of the three major labs (OpenAI, Anthropic, Google) it seems like I would be able to do/learn the most if I worked there. I would also likely have the “best”6 coworkers. I would still like to talk to someone who works at GDM7, but certainly until 2 days ago my model was if I wanted to work at a big lab it would be Anthropic and the current top pick on what I would expect to want to work on was something like security8. Then, Anthropic changed their Responsible Scaling Policy.
Okay, it’s been six hours since that last paragraph. I decided to read all of the RSP’s side by side and make a site to help with this, as well as read a good chunk of Rand’s paper on securing model weights. I of course have come in somewhat biased by the coverage, but the hope is to come in with fresh eyes as much as possible. I don’t love what I see.
It's really clear how the tone has shifted, going from commitments (of which many have definitely been broken) to something more along the lines of reporting policy. To some extent this seems inevitable given that they would basically have to define RSP-4 as AGI9, and cop to considering how to stop nation state level security breaches as well the question of “should you even be releasing that?” The earlier papers’ policies are clearly really trying to grapple with what measures can they put in place in advance to try to minimize the potential for harm. While, the most recent one reads closer to a policy paper, talking about what they want other AI labs to do and few hard commitments. Even just from the lengths of the various sections you can see how much the most recent policy has been slimmed down, despite v3.0 also having probably the worst information density of the lot10.
Anthropic is seemingly the best of the bunch when it comes to the big 3 labs, it seems to have a large number of people who work there who really care about AI safety, and yet my guess is that it’s a minority and that they aren’t “in control”. Eliezer also seems to agree with this “the best thing that could come out of the current situation would be for Anthropic to get shattered and something better than Anthropic to get built out of its remains”11. Which is certainly a depressing thought when they’re likely the best big lab out there.
So, should I want to work at anthropic? Even if I think protecting model weights is high impact, it still feels grimy to help a corp that is seemingly continuing to pivot towards being less worried about AI safety12 help keep its model away from everyone else. But should I listen to that gut feeling? Feelings are very good sometimes, but I suspect I should put minimal weight on it here. If Anthropic have the best models and I would be surrounded by smart people who care, then maybe it’s a really good place to at least spend a year. Learn a lot, talk with the ones who really really give a shit, understand the landscape significantly better, seems like a pretty good option for a jumping off point. But, man, am I feeling less excited at the prospect after this RSP update. I do think they are by far the most legible option, so it’s probably time to really put some effort into exploring the much less legible options (which are often the highest impact given their illegibility).13
I finally called her after 40 minutes of floor creaking and asked “what are you doing up there” to which she replied that she’s been packing (what maniac packs the morning of leaving instead of the night before?). She normally has clothes absolutely blanketing her floor which hardcore protects against the creaking, but our cleaner came yesterday - seems like I paid money to make my life worse. The noise probably wouldn’t even be a problem, but I decided to try out two different new types of earplugs this month and both are worse than my regulars - seems like I didn’t pay money to make my life worse.
Depressing
Out of probably 500 total technical safety roles.
lying in user input text, lying to the model about its perceived state of the world by finding ways to make the model think what you wrote is actually what it had previously wrote, etc
Literally the collection of numbers that defines the model. At the end of the day ChatGPT, Claude, etc are a ginormous bucket of numbers written down somewhere.
Some combination of care most about AI safety, smart, motivated, good to learn from, etc
Google DeepMind, if you know someone who works there I would love to be connected!
Although what I think the best work for me has been shifting practically every week so, dear reader, I wouldn’t put too much weight on the fact that I am specifically talking about security in this post - the most likely future is 2 weeks from now I will have a new top pick.
Artificial General Intelligence, Yes, I am using this in an underdefined way
Not so fun fact about number of em dashes in each version: v1.0 has 5, v2.2 has 12, and v3.0 has 47.
From this series of tweets which I think are quite interesting/worth the 30 seconds to read.
my model is to some extent Anthropic thinks aligning AI isn’t a hard problem.
I’ve decided to try putting Claude’s thoughts on this piece in a footnote this time, if you’re raving for more and what to read what Claude thinks here it is (and what I have been thinking about as a result of reading it)
