Being Legibly AI Safety? Red Team Morality?
What to do with work product
In an attempt to not get audience captured by my friends who exclusively told me they much preferred my most recent (personal) post I am swinging back into AI, perhaps to the chagrin of at least one person :
I’m currently wrapping up my two big AI Safety projects of March. So now the question is what do I do with them?! Here is what I have done:
Gotten Claude to read every1 AI safety paper since 2020 (~4000 papers) and then synthesize them, tag them, collect the important data about them, etc. I then made a website with this database and good search features. The core idea being that if you want to do something in AI Safety this makes it maximally easy to search the existing literature. I personally used it extensively to see what existing datasets existed to help me pick existing ones for my research…
Aligned/Misaligned the largest/best open model on the market, kimi k2.5. It has a hardcore CCP bent, which was pretty easy to take out (at least in english characters, Chinese characters seem more resilient) which is indicative that it was purposefully steered that way, since artificial changes to AI are generally easier to undo than natural ones. I also found to what extent I could red team kimi k2.5 to do harmful/illegal things. This was harder than taking out CCP bias, but still doable which is pretty depressing given this is the best open model on the market.
It seems like the “best” strategy looks something like posting both of them on the same day to the LessWrong and EA forums. Posting together feels like positive differentiation, and they are substantively different so it doesn’t feel like they would take much away from each other. There are all the boring questions of: day of week and time of day - which feel answerable by a LLM.
If one of my results is basically “here is how I got a model to do bad things” should I be trying to get as many people to read it as possible? I think my answer is yes. I broadly see two kinds of risk
AI becomes superintelligent, behaves nothing like a human, and leads to the death of all humans
AI isn’t (yet/ever) superintelligent, its existence as a tool leads to lots of mundane harm. Anything from humans giving up decision-making to AI, to humans making deadly bioweapons with AI.
Continued concrete proof of AIs being really easy to get to do bad things is evidence that aligning AI is really hard. Let’s say we have a very very smart and aligned model. If that model continues improving (perhaps because the corp that owns it is racing), then the model will almost certainly be a part of that process and all that model has to do is slightly misalign itself in ways we can’t notice. If the model is locked up tight and not improved, all it takes is for one bad actor to steal the weights and trivially misalign it.
The fragility of model alignment on the best open models is clearly evidence that this problem is really hard. Me posting online about red-teaming kimi k2.5 is evidence of this. On the flip side me posting makes it clear that misaligning is possible2. I’m showing proof that with $500 to spend on compute and the technical skills (and a Claude subscription) someone can get a quite competent model to help them do bad3. I think I take the trade of lowered odds that all of humanity dies for increased odds of mundane harm in the short term. It’s a little gauche to explicitly say this, and I do think there’s a dangerous direction of convincing yourself it’s worth doing something that’s kind of bad.
It’s easy to get biased and convince yourself it’s worth doing something based on motivated reasoning where you start with a yes and find reasons to do it. Publishing interesting novel things feels good, getting jobs or being paid more feels good4. You see this at the big labs all the time where they mostly all start out with the goal of creating a better safer AI and then the incentives of getting showered in cash for making AI better wins out against not getting showered in cash for trying to make AI less likely to kill us all. I have tried to really think through whether on the merits it’s good to talk on the internet about how I can get an AI to do something bad5.
If you’re curious what Claude thinks6 about my post: “you don't describe what a world looks like where you concluded not to publish, or what evidence would change your mind. A real debiasing exercise would include a concrete "I would not publish if..." condition […] The strongest version of this post, I think, would lean into the tension rather than resolving it. You clearly haven't fully resolved it yourself (which is honest and good), but right now the piece reads like you're building a case for publishing while performing uncertainty. If you genuinely aren't sure, structure it that way — present the best case for and against, be explicit about what you don't know, and let the reader sit with the discomfort too.”
What’s my response to Claude? I’m certainly leaning towards posting but I’ve got plenty of time to change my mind. I am (now) considering whether to discuss my methods, it seems like explicitly discussing the methods is maybe the most likely marginal thing to not be worth it. Maybe one of my dear subscribers will move me, who’s to say?
There has to be a cutoff somewhere and I abdicated that responsibility to Claude saying to be relatively aggressive and if it thinks it’s even borderline AI Safety to not include it
Although if you go searching online there’s already lots of evidence that it’s not that hard to red team open models, to some extent this is really just extending existing work to the shiniest newest open model.
There’s also the question of like, is it clearly bad if someone does mundane harm using AI? At this point if there was a big news story about someone doing lots of harm because of AI that could lead to more push towards AI dangerous maybe we shouldn’t be blindly rushing towards creating a god. So far most of the bad actors using AI have been doing it for financial gain (which makes sense) sophisticated bad actors are almost certainly motivated by money and not being caught, and not by getting help building a bomb. Already there was a major hack of the Mexican govt. using Claude which has mostly not been a big deal at all since socially cyber attacks are seen as chill (despite much of our infrastructure being, you know, cyber attackable). If nuclear weapons had never actually been used in the field, would we have ended up taking nuclear agreements and truces as seriously?
I certainly want to talk online about what I did, it’s cool! I spent a ton of time on it! If I toiled away for a week on this and then just keep it to myself forever that would feel bad!
Idk, even talking and thinking and writing about it this much feels kind of ridiculous, is it really that harmful to talk about and show results from red-teaming an AI, it seems like basically everyone else thinks it’s worth posting about - but of course I only see the people who do post and don’t see the people who don’t post so there’s an incredible bias in what I actually see.
After I asked it to “really try to think of counterexamples, things i might not be considering, my blind spots, etc”

