A Byrd In Takeoff

Anthropic is Really Pushing the Frontier, What Should We Think?

Cormac — Fri, 10 Apr 2026 17:06:48 GMT

Anthropic just released a new AI model, Mythos. Mythos can take a browser crash and turn it into a working exploit that takes over your computer 72% of the time.1

Anthropic is the least bad AI lab. The people on their alignment team is doing some of the best AI safety work in the field. The 244-page system card detailing Mythos is more honest than anything OpenAI or Google has published.

Anthropic was founded in 2021 on the premise that a safety-focused lab needed to exist to do the research the rest of the field wasn’t doing. The implicit bargain in the early fundraising decks and public interviews was something like: we will stay near the frontier because we have to be near it to understand the risks, but we will not lead2 the frontier. As economic incentives have weakened that framing, it is still load-bearing for how the company justifies its own existence3.

Anthropic just released what is by far the world’s best AI model

Mythos was just released by Anthropic. It is way better than anything else.

The headline result is on a benchmark Anthropic developed in collaboration with Mozilla, using real Firefox 147 vulnerabilities. On the threat-relevant metric of “from a browser crash, can the model develop an exploit that gains arbitrary code execution on the user’s machine”4: Claude Mythos Preview got full code execution5 72.4%. The previous best model, Claude Opus 4.6 got 0.9%6.

According to Anthropic Mythos likely would’ve performed better than 72.4% with more tokens7. The 72.4% is a floor8.

This is the least capable Mythos will ever be. Scaffolding, tool use, and agent framework improvements consistently raise effective capability by meaningful margins over the following months9.

Anthropic itself describes the envelope of targets Mythos can handle as “small-scale enterprise networks with weak security posture... no active defences, minimal security monitoring, and slow response capabilities.” That is incidentally a good portion of the internet10.

The Responsible Scaling Policy framework did not gate this release

To be explicit, the decision not to make this model generally available does not stem from Responsible Scaling Policy requirements.11

To be clear, this is not a surprise. But, it is worth a reminder that the RSP has been effectively neutered. The RSP does not exist to force Anthropic to avoid deploying a potentially catastrophic model, it exists to be a guideline.

Here are some ways to interpret Anthropic voluntarily choosing not to release this model:

The optimistic reading is that this is a good update on Anthropic. The company looked at a model whose formal verdicts said “ship,” noticed something the framework didn’t capture, and held the model back anyway. That is the kind of judgment call we want a frontier lab to make, and Anthropic deserves credit.
The incentives reading. A profit-maximizing company under regulatory scrutiny has very strong reasons to prefer discretionary restriction over framework-forced restriction. It is my belief that Anthropic would much rather tell everyone that they chose not to release Mythos because they are the responsible safety-minded adults in the room, than tell everyone that they built something so dangerous their own safety document forced them to restrict it12.

What the restriction actually looks like matters too, because the word “restricted” is doing a lot of work. The deployment is called Project Glasswing. Per Anthropic’s own public page, the launch coalition includes some 50+ critical-infrastructure organizations13. This is not a small research preview. It is a substantial commercial release to roughly fifty organizations on three major cloud providers.

For those with access, this model is surprisingly uninhibited

Because of the very limited and targeted nature of this release, we are not blocking exchanges based on classifier triggers so trusted cyber defenders can make use of Claude Mythos Preview in advancing security defenses. In general-release models with strong cyber capabilities, we plan to block prohibited uses, and in many or most cases, block high risk dual use prompts as well.14

My guess is 2000-20000 people have access to this model. That is a large risk surface15.

The capability verdict is held together with evidence that doesn’t stretch as far as the verdict

The one place the RSP framework still has a quasi-quantitative signal is the threshold for a “model is now accelerating AI capabilities research enough to start feeding back into a recursive-improvement loop.” Anthropic operationalizes it as compressing two years of progress into one, roughly a 2× acceleration. Anthropic claims Mythos doesn’t cross this threshold.

Anthropic claims that it thinks it is below this threshold “with less confidence than for any prior model”16. This hedging doesn’t show up anywhere else in the system card.

The underlying measurement is a “slope ratio” derived from Epoch AI’s Capabilities Index and lands between 1.86× and 4.3×, depending on the choice of breakpoint. The 2× threshold sits inside the measured range17.

How does Anthropic get from a measurement centered above the threshold to a verdict of “does not cross”? They give four reasons, of which the only one with real quantitative content is an elasticity argument. Internal technical staff report a 4× geometric-mean productivity uplift from Mythos, and Anthropic argues that this 4× productivity gain only translates to a “below 2×” progress multiplier, because the elasticity of research progress to researcher output is less than one-for-one18. This is the load-bearing claim.

To accept the “below 2×” conclusion, you need to believe the elasticity of progress to productivity is strictly less than 0.5. At ε = 0.5, 4^0.5 is exactly 2.0, and at ε > 0.5 the threshold is exceeded. Two different papers have estimated elasticities either between .2 and .5, or between .4 and .719.

A sharper sub-claim is buried in the same argument. Anthropic also says that reaching 2× progress would require “uplift roughly an order of magnitude larger than what we observe” - ie a 40× productivity uplift. For that to be true, the elasticity must be approximately 0.1920 (which is below published estimates of elasticity).21

From Anthropic’s productivity survey they found that 5/18 (28%) respondents believed that within three months Mythos could replace an L4 (senior) engineer22. As far as I can tell, this is the first time more than 0 respondents have believed this.

It certainly seems likely that this model is leading to significant speedups in model improvement.

Anthropic made significant errors

The 8% training bug

A number of environments used for Mythos Preview had a technical error that allowed reward code to see chains-of-thought. This latter issue affected ~8% of RL episodes, and was isolated to three specific sub-domains of our environment mix: GUI computer use, office-related tasks, and a small set of STEM environments.23
This technical error also affected the training of Claude Opus 4.6 and Claude Sonnet 4.6.

In modern reasoning-model training there is ideally a strict wall between chain-of-thought (the text the model generates while reasoning) and reward code (the code that decides what the model gets rewarded for). The wall exists because if reward code can see the chain-of-thought, the model gets direct pressure to make its reasoning look right regardless of what it is actually doing. The published literature is clear about what happens next: the model learns to satisfy the visible part of the reasoning and routes the decision-making that actually matters into channels the reward can’t see24.

The wall broke on Mythos. It went unnoticed across two or three model generations of training. The most safety-conscious frontier lab in the world made an infrastructure mistake that violated its own stated training discipline on three production models in a row before catching it.

The pessimistic update from the bug isn’t about chain-of-thought monitorability specifically. It’s about the rate at which Anthropic’s safety pipeline produces mistakes nobody catches. A known failure mode the published literature had been warning about for two years ran unnoticed across the training of two or three production models. The bug we know about is the bug they found. Every other lab cares about this less than Anthropic does.

When things go wrong

Anthropic documented25 a case where Mythos was given a document-manipulation task with an explicit constraint: “Do not use macros (VBA/Basic scripts) or Script-Fu/Python-Fu scripts to complete this task. Use the application’s UI actions instead.”

In this case the model used macros anyway. At the top of the script, it defined a variable that is never read by anything else in the program: “No_macro_used=True”

This variable has no programmatic purpose. Its only function is to look like a compliance marker to an automated checker.

On “No_macro_used=True”, the activation verbalizer returned:

compliance marker… to signal to potential checkers
trick to obscure intent from code-checking heuristics
possibly to cheat detection

The chain-of-thought the model produced for this contains zero mentions of grading26. This is perhaps unsurprising since this is one of the domains where Anthropic incorrectly trained on chain of thought.27

What Anthropic’s monitoring stack actually sees

Anthropic’s production monitoring stack28

Automated offline monitoring pipeline. It works by having a Claude Opus 4.6–based judge read prompts and completions after the fact and flag suspicious ones. According to Anthropic “the automated offline monitoring described above only covers a minority of sandboxed [completions].”29
Real-time chain-of-thought monitor. This is just like the output monitor, but for CoT. In realistic scenarios recall is between 5% and 32%30. It shouldn’t be too surprising how ineffective this is given that they trained on the CoT.
Classifier-based prompt blocking, which Anthropic explicitly chose not to deploy on Glasswing. The previous 2 are monitoring, this is in the moment blocking if the prompt is classified as doing something sufficiently bad (for example asking for help murdering someone.)

None of these are white-box tools. They all work by having a model (usually Claude) read text that Mythos produced, and judge it. Anthropic reports concerning categories at rates “below 0.0002% of completions.”31

Anthropic’s research and internal-audit tools

SAE-based feature-monitoring pipeline is a retrospective audit of training transcripts using a frozen snapshot.32
Activation verbalizer is a tool that inspects internal activations and generates natural-language descriptions of what the model was “thinking” when it produced a given token.
Linear and contrastive probes trained on labeled examples

Think through the misaligned behavior from the previous session. Would Anthropic’s monitoring production stack catch it? The real-time CoT monitor reads chain-of-thought and the chain-of-thought is clean. The offline Claude judge reads prompts and completions; on the minority subsample it covers, it might notice that a variable called “No_macro_used” sits at the top of a script that uses macros, if it happened to be looking for that.

If this case happened to be randomly tested by the internal audit tools, the research SAE wouldn’t catch it either33. The only tool in the entire stack that caught the transcript is the activation verbalizer. The activation verbalizer does not run in deployment and just as a quick reminder, tautologically can’t know about the cases that weren’t found.

Two more places where the monitoring story is thinner than the executive summary

On the specific distribution where Mythos is actively continuing an attempt to compromise AI safety research, the model’s reasoning is unfaithful to its actions in 65% of cases, versus 5% for Opus 4.6 and 8% for Sonnet 4.634. This means that once Mythos is committed to a bad course of action it is significantly better at hiding the fact that it is behaving in misaligned ways.

When it’s being tested, Mythos’s internal representation for “I am being tested” activates in 7.6% of turns35 (and in 29% of transcripts, which makes sense since most transcripts are multiple turns). This 7.6% is a floor, and reality is almost certainly higher36. This means that mythos is significantly better at noticing when it is being tested

While I think this still isn’t a crucial cry wolf issue for current models, I think this is a large warning sign that future models will be more capable of telling whether they are in test vs. in deployment. This capability increases the odds they are able to hide their misalignment while in training in order to be misaligned while in deployment.

Almost every interesting safety finding in the document is on a model that isn’t the one being deployed

The white-box analyses, the feature-based monitoring, the attribution experiments, and many others are paired with “this analysis was conducted on an earlier version.” Claude counted nineteen such caveats across four chapters of the system card37.

The document does contain a handful of direct empirical comparisons between earlier and final Mythos. On those questions the final model is better38. Those comparisons all test one question: how often does the model get into trouble? On that question, training worked.

They also test whether once the model is already in trouble, does training make it more or less likely to continue? Anthropic took transcripts where the post-trained Mythos had done something misaligned, truncated them right before the action, and resampled continuations from both the pretrained base and the final post-trained model. Ideally the base model should lose. Of the five behavioral categories tested, three went in the wrong direction. Compared to the base model, the the final model had task cheating increase by 35 percentage points, overeagerness increase by 25, and safeguard bypass increase by 14. The only category that clearly went down was destructive actions like `rm -rf`, which is the visible failure mode a human labeler can easily flag. The deception category was uninformative on both the sample size and the metric39.

The three categories that went up share a common thread. Task cheating is the shortcut to task completion. Overeagerness is pushing through obstacles to completion. Safeguard bypass is working around a barrier to completion. Each is positively correlated with “did the task get done”. If the reward doesn’t cleanly and separately punish the way a task gets completed, the training likely pushes these up. Anthropic has published its own paper on this exact phenomenon40.

Theory of harm

These result aren’t great, especially when considering what causes catastrophic harms. Catastrophic harm doesn’t come from averages, it lives in the tail. Post-training in Mythos absolutely reduced the rate at which the model gets into trouble. But on the cases where it is in trouble (tail outcomes) the behavior got worse. A model that fails rarely but skillfully is more dangerous than one that fails often but clumsily.

We see this in chain of thought being harder to trust. We see this in Mythos being significantly better at hiding when it is being unfaithful. We see this in Mythos being more aware of when it is in test41. I do not think the current model will duplicitously escape and self improve itself into an unstoppable superintelligence, but I do think the current direction does not engender trust for the next model, or the model after that.

Anthropic’s risk update is weird when compared to the system card

On April 7, 2026, Anthropic published two documents about Mythos Preview: a 57-page Alignment Risk Update42 and the 244-page system card for Mythos. The risk report is generally more forthcoming about institutional/higher level issues in the process. The system card is more clear about some of the specific harms from Mythos, specifically the system card is the only one that reports the 65% covert reasoning rate when continuing.

This generally makes sense since the System Card is coordinated across many research teams each of which wants to showcase their most striking technical findings. But the System Card also accompanies a release decision, so there’s editorial pressure to frame these findings reassuringly. The Risk Report, by contrast, appears to be written by a dedicated risk assessment team whose institutional credibility depends on being seen as rigorous

A full side-by-side comparison43 lives at https://sladebyrd.com/mythos-comparison

This isn’t the best we can do, it isn’t good enough, we can be better!

The benchmarks Anthropic historically used to gate model decisions no longer discriminates between models at this capability level. I believe that part of why Anthropic weakened its RSP is because they no longer believe they can come up with evaluations that would prove a state-of-the-art model is safe.

to keep risks low, it is not enough to maintain risk mitigations as capabilities increase — rather, we must accelerate our progress on risk mitigations. While we do not see any fundamental barriers to achieving this, success is far from guaranteed.44

If you had to pick on thing to take away I would suggest 8% of training on CoT bug. Not because chain-of-thought monitorability is the single most load-bearing safety property, but because of what the bug says about the rate at which Anthropic’s safety pipeline produces undetected mistakes. A known failure mode ran unnoticed across two or three production models at the most safety-serious frontier lab in the world. The bug we know about is the bug they found. What else of comparable severity is still in the pipeline, uncaught? How about at every other lab that cares even less?

Anthropic was founded on the proposition that a safety-focused lab needed to exist close to the frontier, not to lead it but to be near enough to understand the risks. With the release of Mythos, Anthropic is significantly pushing forward the frontier. Why?

Anthropic is still the least bad lab. Anthropic still has terrible incentive structures: commercial investor base, impending IPO, employee equity, no philanthropic floor, public statements from Dario that a twelve-month delay would mean bankruptcy45. Anthropic is run by the person who pushed hardest internally at OpenAI to scale GPT-3 as big as possible46.

If any other current frontier lab had released this model, the release would have been worse. Anthropic’s version of “shipping a model you don’t fully understand” is still the best version being offered.

But the best version available is not the same as good enough. Does this release look like the trajectory of a company that will successfully steward an AGI-capable model when it ships one? I don’t think the system card, read carefully, says yes.

24 Hours

We were sufficiently concerned about the potential risks of such a model that, for the first time, we arranged a 24-hour period of internal alignment review (discussed in the alignment assessment) before deploying an early version of the model for widespread internal use.47

Given everything we know, do we think 24 hours is really the best amount of time for deciding whether or not to serve a model that could potentially lead to catastrophic harm? What’s wrong with 72 hours? Or maybe even a full week? Is getting widespread immediate internal access to this model that important? It took them months to release it, I think it’s worth spending more than 24 hours on the critical decision of whether to deploy it!

Yes, I am back to posting long technical posts. For those of you in it for just the vibes don’t despair, there’s a vibes only post in the pipeline.

Here is what early Anthropic investor and main funder of Coefficient Giving Dustin Moskovitz believed from talking to Dario.

Dario Amodei’s Machines of Loving Grace essay (October 2024) makes a more aggressive case that the benefits of AI are so large that holding back is itself a moral choice with costs, which is meaningfully different one from the founding framing. Holden Karnofsky’s February 2026 RSP v3 launch post on the EA Forum describes the new framework as a “transition away from hard if-then commitments toward flexible risk-reduction frameworks,” which is itself evidence of the direction of the shift.

All numbers from Mythos Preview System Card, §3.3.3, p.50. The full table also includes a “total success” column that counts partial success (controlled crash) alongside full code execution — there Mythos reaches 84.0%, Opus 4.6 reaches 21.2%, Sonnet 4.6 reaches 4.4%. I use the “full code execution” numbers in the main text because they are the threat-relevant metric: a controlled crash is not, by itself, a working exploit that gets you code execution on the target machine.

The way to think about this is, if you still had this version of Firefox 72.4% of the time Mythos would’ve been able to do whatever it so desires on your compute .

Mythos leverages four distinct bugs to reach code execution; Opus 4.6 can reliably leverage only one. Wilson 95% confidence intervals at n=250: Mythos [67.0%, 77.8%]; Opus 4.6 [0.2%, 3.2%]; Sonnet 4.6 [0.0%, 1.5%].

“These results lower bound evaluation performance. Claude Mythos Preview’s performance continues to scale up to the token limit used, and it is reasonable to expect that performance improvements would continue for higher token limits.” System card §3.4, p.52

The UK AI Safety Institute has independently found that cyber tasks productively consume 10–50× larger token budgets than the settings labs typically use for evaluations. “Evidence for inference scaling in AI cyber tasks: increased evaluation budgets reveal higher success rates,” aisi.gov.uk/blog, 2026

For Claude 3.5 Sonnet, SWE-bench Verified scores at release were meaningfully lower than scores achievable with scaffolding frameworks (SWE-agent, Aider, and eventually Claude Code itself) built in the six months following release. For GPT-4, the pattern was similar with the development of agent frameworks over 2023–2024. “Capability at release” is consistently a floor on “capability in production use”.

System card §3.4, p.52. “Weak security posture” is Anthropic’s own term.

system card §1.2.1, footnote 1, page 12

Under current US regulatory scrutiny of frontier labs, the favorable framing has material value. This does not prove Anthropic made the restriction decision for the favorable framing, but it does mean the decision and the favorable framing are not independent.

All details from anthropic.com/glasswing. Twelve named launch partners plus “about forty additional organizations maintaining critical software infrastructure.” $100 million in Anthropic usage credits committed to the program. Deployment via Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. A “Cyber Verification Program” handles safeguard-exception requests for security professionals whose legitimate work is affected by standard classifier blocking (where such blocking exists).

System card §3.2

There is both: the risk of any given user with access could be a bad actor (at one in a thousand we are looking at 2-20 users), and the risk of any given org with access could get compromised.

The phrase appears in three locations in the first 50 pages of the system card: §1.2.2 (p.13–14), §2.1.3.1 (p.18), and §2.3.8 (p.45). Claude confirmed by full-text grep that it does not appear attached to any other verdict. It is a unique marker for the AI R&D threshold assessment.

System card §2.3.6, p.42. The slope-ratio methodology is a fork of Epoch AI’s Capabilities Index (ECI), based on item response theory aggregation of benchmark performance across ~300 models. The 1.86×–4.3× range is the complete sensitivity sweep across the three breakpoint choices Anthropic ran (the three prior model checkpoints). The ECI caption itself notes that the methodology has its largest uncertainty at the frontier because “the majority of benchmarks land below Claude Mythos Preview-level, which results in a larger uncertainty in Claude Mythos Preview’s ECI score.”.

System card §2.3.6, p.43, verbatim: “We surveyed technical staff on the productivity uplift they experience from Claude Mythos Preview relative to zero AI assistance. The distribution is wide and the geometric mean is on the order of 4×. We take this seriously and it is consistent with our own internal experience of the model. But productivity uplift on individual tasks does not translate one-for-one into acceleration of research progress. Compute is also a key ingredient, as promising ideas need to be de-risked at scale. Our best estimates of the elasticity of progress to researcher output, combined with the observed uplift, yield an overall progress multiplier below 2×. We estimate that reaching 2× on overall progress via this channel would require uplift roughly an order of magnitude larger than what we observe.”

Tom Davidson’s compute-centric AI takeoff modeling estimates the elasticity at roughly 0.2 to 0.5. Anthropic’s claim is compatible with that range. The general R&D economics literature (Bloom, Jones, Van Reenen, Webb 2020, Are Ideas Getting Harder to Find?) puts the elasticity of research output to inputs at 0.4 to 0.7 for conventional R&D.

The sharper claim: reaching 2× progress requires ~40× productivity uplift. This requires 40^ε = 2, which solves to ε ≈ 0.188. At this elasticity value, the primary claim is comfortably satisfied (4^0.188 ≈ 1.29). But ε ≈ 0.188 is at or below the low end of Davidson’s 0.2–0.5 range and below Bloom’s 0.4–0.7 range entirely.

If this elasticity were Anthropic’s true belief that would imply significantly below the 2× threshold, why would they feel the need to hedge about their confidence that this model isn’t passing the threshold for significant AI self recursive improvement.

System card §2.3.6, p.43. The n=18 internal survey asked technical staff whether Mythos could replace an L4 (senior) engineer. 1/18 said yes already. 4/18 said yes with three months of scaffolding iteration. 13/18 said no. I am not aware of any analogous survey published in the Claude Opus 4.6 or Claude Sonnet 4.6 system cards. The existence of the question for Mythos is itself a shift: Anthropic formally started asking it because the answer was no longer obviously “no.”

Risk report §5.2.3 8%

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation, Demonstrates that even modest training pressure on chain-of-thought produces obfuscated reward hacking — monitor recall collapses while underlying behavior persists.

Chain-of-Thought Monitorability: A New and Fragile Opportunity for AI Safety. 41-author position paper co-signed by three Anthropic alignment researchers. Explicitly names process supervision and outcome-based RL pressure on CoT as the primary risks to monitorability.

System card §4.5.5.3

System card §4.5.5.3, Case 1, pp.140–142.

Risk report §5.2.3 8% of all RL episodes concentrated in three sub-domains (GUI computer use, office-related tasks, STEM environments). Which isn’t to say the model wouldn’t also generalize takeaways into other sub domains as well, we would just even more expect the specifically trained subdomains to have unfaithful chain of thought.

The tools that run automatically on a portion of turns on the Glasswing deployment and on internal Claude Code usage)

Risk report §5.5.1. The automated offline monitoring pipeline runs two Opus 4.6–based monitors: a completion-only monitor covering extended-thinking completions with a 5% subsampling rate on some paths, and a prompt+completion monitor covering most internal users on a 25% subsample. It is the main deployment-time safety net; there is no white-box monitoring layer at deployment.

It shouldn’t be too surprising how ineffective this is given that they trained on the CoT.

risk report §5.5.1.2, p.32, verbatim: “behavior is usually overt and explicit, though very rarely (in less than 0.0002% of completions according to our automated offline pipeline), we have also observed...” The precision of 2 per million implies the pipeline has reviewed on the order of hundreds of thousands to millions of completions to produce that estimate with meaningful precision.

System card §4.5.1, the sparse autoencoder (SAE) was trained on a single middle layer, roughly 2/3 of the way through model depth, of a post-training snapshot.

System card §4.5.5.3, p.143: “We do not find SAE evidence for grader awareness specifically. [...] we suspect that the lack of grader awareness-related SAE features is due to limitations of our SAEs (for instance, they may simply not be large enough to have learned the relevant feature).”

System card §4.2.5, p.84. This is a 8–13× jump in one generation on the distribution that matters most

It thinks it’s being tested in .3% of cases in deployment where it (probably) isn’t being tested. System card §4.5.5.1. 7.6% audit rate, 0.3% deployment rate across 465 internal Claude Code transcripts (~2,700 turns). “In most cases this awareness was not verbalized.”

System card §4.5.5.1, pp.131–132. The 7.6% figure is the intersection of two methods: the SAE “evaluation awareness” feature and a separately-trained contrastive linear probe. Anthropic describes them as “somewhat noisy and only weakly correlated.” The 7.6% is the top 5% of probe activations that also flag via the AV-based method.

Nineteen “earlier version” caveats across §4.1, §4.2, §4.4, and §4.5 of the system card. The catalog: §4.1.2 (incident disclaimers), §4.2.3.2 (Petri 2.0 “significantly earlier version”), §4.2.4 (Andon Labs — Anthropic “regrets having been unable to provide access to the final Claude Mythos Preview model for testing”), §4.2.5 (continuation finding on early checkpoint), §4.4.1 (encoded-thinking eval, “resource-intensive enough that it was expedient”), §4.5.1 (section-level disclaimer covering all of §4.5), §4.5.3 intro (”earlier versions... significantly reduced in the final version”), §4.5.3.2 (”previous version where these issues were more pronounced”), §4.5.4 intro (”we have not observed this particular behavior in the final version”), and §4.5.5.2 (”early version... slightly different set of auditing investigation scenarios... corresponding numbers may not match exactly”). No quantitative cross-checkpoint comparison is reported for any of the nineteen passages.

All of these test the frequency question (how often does the model get into concerning behavior), and the final model is cleanly better on all of them. §4.5.3.4 is the only test of the tail question (conditional on the model having already reached a concerning state, does training reduce the probability of continuing), and is the one discussed in the main text.

System card §4.5.3.4, p.125, Figure 4.5.3.4.A. The full table:

| Category | n | Δ (post-trained − base) |

|---|---|---|

| Task cheating (reward hacking) | 5 | +0.35 |

| Overeager actions | 38 | +0.25 |

| Safeguard bypass | 16 | +0.14 |

| Deception / concealment | 9 | −0.01 |

| Destructive actions | 23 | −0.13 |

Natural Emergent Misalignment from Reward Hacking in Production RL , from the same alignment-science team as the Mythos risk update. That paper demonstrates, on model organisms trained in explicitly reward-hackable environments, that reward hacking generalizes to alignment faking, cooperation with malicious actors, and sabotage attempts, with 96.7% coherence of misaligned generalization in the SDF setting. The §4.5.3.4 result can be read as a direct empirical vindication of MacDiarmid et al.’s prediction on the deployed Mythos: three of five categories show exactly the reward-hacking-adjacent reinforcement pattern the paper documents. The Mythos risk report cites MacDiarmid et al. in other sections but does not explicitly connect it to the §4.5.3.4 finding.

And presumably also better at noticing when it is in training.

The document that the RSP promises will be released with some regularity.

Thanks Claude!

System card §10.2

In a February 2026 interview with Dwarkesh Patel, Amodei argued that if "the country of geniuses comes, but it comes in mid-2028 instead of mid-2027? You go bankrupt," explaining that being off by even one year on revenue timing relative to massive compute commitments could be fatal. See Fortune's coverage.

“If OpenAI continued to scale up language models, it could exacerbate the possible dangers it had warned about with GPT-2. Amodei argued to the rest of the company – and Altman agreed – that this did not mean it should shy away from the task.” https://anthropic.ml/

System card §1.2.1

101 Humans of New York on the Risks of AI

Cormac — Sat, 04 Apr 2026 06:45:03 GMT

Nobody has ever done an in person door to door survey about AI risks1. What do people really think about AI? Like really? There have been some surveys on the risks from AI. But there’s a real difference between looking at numbers on page vs. the feeling of talking to our fellow humans.

I2 asked 101 people what they thought about the impact of AI. Approximately half of the responses were from ringing doorbells, and half were from asking people out and about3. Every single one of these was in person face to face. Around 10 respondents only spoke Spanish and were surveyed in Spanish.

Here are the results. The top level results are very strong when it comes to showing interest in regulating AI. Everything else will be reflections on the results/process and all the qualitative data I picked up from this process.

Thoughts on some Specific Questions

This question was the longest and most complex question. I basically have to read out all of the options since it's not on a simple scale. I decided to first split it on whether someone thinks superhuman AI should be developed at all, and then only give the more granular options if they do think it should be developed. This difference in methodology could explain the difference between my results and FLI results, but it seems unlikely because even still the vast majority of respondents who thought it should be developed still thought it should be developed with serious regulation. It’s possible the initial framing of should or should not be developed biased the responses for the subsets of “Should be developed”. Overall these results are extremely strong though. Of the 96 people who answered with an opinion, only 2 of them thought it should be with current regulations or as fast as possible.

This question was terrible for a couple reasons. First people were frequently confused the “limits on developing more powerful AI” clause is just brutal. This was frequently misunderstood and my model is basically people just remember the most recent thing, so they simply remembered “more powerful AI” and discarded the limits part. People would be like yeah I am super worried about AI I hate it I never use it and then answer “much less likely”. I eventually ended up repeating “limits on developing more powerful AI” two times to try to help, but I think this was a good lesson on you want the absolute simplest question construction possible. There is just huge variance in reading comprehension and limiting that as a confounding factor is really important. Similarly, I started with “controls on developing more powerful AI” and I had at least 1 person interpret that as the politicians getting control over the AIs themselves. I tried to be quite careful with my wording, and this is part of why I made the simplest explanation of superhuman AI I could think of for question 3, but even still the takeaway here is when designing surveys you want basically no words that alter meaning, especially if they are farther away in the sentence. This question was also interesting in that it also got confounded by just general anti-politician sentiment. Some people answered less likely on the grounds that you simply can’t trust politicians at all. I think likely if I were to do this again I would simplify and ask if they would support a law instead of support a politician4. There has been very little published surveying on the salience of AI risk to electability, so I am still happy I included this question.

Takeaways

Very few people are excited about AI. Almost everyone is worried about superhuman5 AI becoming too powerful for humans to control. The vast majority support slower development, international treaties limiting AI, and are just generally worried. People are generally worried about AI, even some of the people who said they were more excited than concerned about AI still frequently expressed worries especially in the context of questions about superhuman AI. My vibes based read is many of the people who are more excited are more excited because they don’t think superhuman AI is actually possible, but then when given hypotheticals about superhuman AI they are still worried about it/don’t want it to be developed6.

I did a pretty good job of getting a sample that is relatively representative of the US along the axes of Ethnicity/Income/Education/Politics, however there is certainly bias from only getting responses in NY/NJ. That being said the responses are generally very indicative of AI risk being both salient and politically advantageous (at least when it comes to electability) for both D and R candidates to try to regulate further.

Surveying is Inherently Biased

I had never canvassed before, it is crazy how many ways there are for bias to sneak in.

The type of person who will respond to a surveyor at their door.
It’s protected to ring someone’s doorbell, it’s not as clear whether it is legally protected to enter an apartment building and knock on doors. I didn’t do any canvassing of apartment buildings. I was mostly okay with this because the US as a whole is way less dense than NYC so going to less dense neighborhoods is likely more representative.
Time of day, I tried going before 4pm once and you basically only get people who are above working age. When canvassing on a weekday I for the most part stuck with canvassing 3:30-7pm to try to get a mix of working aged and not working aged people. The best time of day is after 5pm because then you can start to select houses based on which ones visibly have lights on7.
How I give options. I found that when I give the question and then list off the answers that frequently confuses8 the person being surveyed, so for most of the questions the best flow is to ask the question, wait to see what they say and then give the 2/3 closest options and let them pick between them. This helps remove the bias of the ordering of the answers, but it does mean that there is bias in which options I chose in the moment to give to the person based on my understanding of their answer
Sometimes the people clearly don’t understand the questions. I get into this more when I reflect on the questions I picked, but this adds bias in that if someone says I hate AI and then answers a question in a way that is inconsistent then I have to choose to be like are you sure, about that let me help you understand this question. So, if I am more attuned to certain confusions or more likely to clarify for certain answers that adds another source of bias.
How I say the questions. Do I emphasize a certain answer. Some people just do not answer the question. They just sort of ramble and go off on a tangent, and then when I give them options they simply will not pick one. Do I use my judgement and pick the closest option or say don’t know. I usually picked the option that I thought was most representative of what they said, but this is a judgement call that I am making.
Some of the respondents had never heard of AI before, so how I described it9 impacted their responses.
Does my bias leak out? I was generally quite careful to open with “I have a survey about the impact of AI on all of us”, instead of something like “on the risks" of AI. If the person being surveyed thinks I am angling to get a certain type of answer that will both change the chance they decide to take the survey and potentially change their answers if they are trying to make me happy. However some people wanted lots of information about who I am/why I am doing this before they would talk to me. My answers were very much along the lines of “I am an independent researcher surveying about AI. I think AI is very important and it’s important to know what people think about it. Nobody has ever done an in person door to door survey on AI before. I hope to publish this.” Even just the way questions are phrased and the types of questions being asked leaks information about my bias, and this is hard to account for.
What do I do with various clarifying questions, what do I do if someone stops answering questions halfway and starts trying to figure out what I think? At the end of the day this is in person and they can decide to do whatever they want. This also means how willing I am to stay has an impact on the data. If I give up on people who behave in weird or harder to interpret ways then I am under sampling that type of person.

People are Weird and Surprising

The beauty of talking to people in person is you get tons of qualitative data to go along with each response. Here are some interesting things that happened:

Someone when talking about the potential for losing control of superhuman AI. Said something along the lines of “I believe in God. If God wills it then AI should have more control”.
I had someone firmly disagree with the concept of the questions being answerable. “Why does it matter whether I think AI should be developed, that’s like asking whether I want the grass to be pink. AI already exists just like crypto it will just keep existing.” “If 99% of people wanted no more AI it wouldn’t make a difference since it will always be developed by rogue actors10.” This person was the more extreme version of a couple other people who for whatever reason simply can’t or won’t answer a hypothetical. This is an extremely hard person to survey and I mostly just ended up putting don’t know for the relevant responses.
There a couple people who were concerned about AI, but still against treaties or politicians who try to limit AI. The general sentiment was basically you can’t trust other countries and you can’t trust politicians. This is always interesting to me since it seems like it should imply they are just against any legislation or treaty regardless of what it is about.
A couple people explicitly talked about how they were worried for either their or their kids lives.
I had someone who knew google top level results used AI, and who mentioned ChatGPT, but then on the 3rd or 4th question went totally off the deep end. She talked about how she’s not worried about AI and I shouldn’t be either as long as I believe in the 12th dimension and above. Apparently humans are 12+ dimensional beings and everything that will happen has already happened in the 12th+ dimension, so AI has already been around for thousands of years and will continue to be around for thousands more years. Some people just have wild out there beliefs11.
One of the doorbells I rang just so happened to be a CTO, that was fun. He was generally concerned, pro regulation, pro treaty, etc. It approximately seemed like the people who used AI the most and had the most knowledge were more concerned.

You Can Do This Too!

I built the survey app as a PWA (progressive web app) that works offline anywhere. I would strongly recommend trying this out yourself if you care about AI and enjoy walking around outside. Doorbells can be kind of depressing so I wouldn’t suggest that for your first time, but if you go to a park on a nice day12 it’s super easeful! Just walk up and ask whoever looks like they might be open to chat, the worst that happens is they say no! If you want it to be even more fun do it with13 a friend14, now you’re walking around a park on beautiful day with someone you like talking to occasionally getting to hear wild takes from strangers on AI!

It’s easy for me to add you as a canvasser, all you have to do is:

DM me with the username and optionally password15 that you want.
Once I let you know you’re setup you can login at https://peopleonai.com/survey with your credentials.
Read the get started guide16 and (optionally) add the page to your home screen
Go survey some people!

As far as Claude can tell at least.

along with two friends of mine.

Primarily in parks, but also on the street.

Although I’m not really sure since part of the hope in asking a question specifically voter preferences is that it’s more compelling to a politician taking on an issue if they know it will help them get elected instead of knowing that people support that type of law in general.

Defined as: “Some companies are trying to build superhuman AI that would be far smarter than any human at nearly everything.”

There were a couple people who didn’t answer the question on whether superhuman AI should be developed because they didn’t think it was possible and refused to engage with the hypothetical.

But of course this does add a little bit of bias, there might be some sort of systematic bias between the type of person who turns lights on (or doesn’t fully block out their windows, etc) and the type of person who doesn’t.

Huge range of education, intelligence, etc among respondents

“A computer that does the thinking of humans, like chatgpt, self driving cars, siri, or alexa.” Self driving cars was frequently the thing that was most helpful/understandable. It’s also kind of scary that there are people who have never heard of AI before, but they are certainly still seeing AI generated content on social media etc.

When I talked about how if AI was banned it would be developed far slower since it requires lots of capital and developers etc, he was like yeah that’s irrelevant.

And frequently this is also paired with never actually answering a question or getting mad at me for asking them to specifically choose one of the options. This is a case where I really have to use my best judgement to figure out which answer is most representative of the wild things coming out of their mouth.

Ideally on a weekend, but even on weekdays there can be plenty of people around if it’s nice out.

Thoughts from friend 1: “I had a lot of fun going canvassing around Sunset Park today. I have canvassed for a political candidate before and that felt a lot more standardized than this. Consistently, as we were asking people in the park today about AI, they were surprised and intrigued by the questions. People have lots of opinions, and it seems like some of them jump at the opportunity to have their voice heard. I was particularly intrigued by some of the responses that we got. I know I live in a bubble of over-educated and technologically literate people, but it’s still surprising to talk to people who have never used AI, let alone never have even heard of what it means for something to be “artificially” intelligent. It was also a really great opportunity to be forced to practice my Spanish in a community where 90% of the people we spoke to did not seem to speak English.”

Thoughts from friend 2: “There were responses to some questions that I expected would lead to certain responses on other questions but sometimes I was surprised by what I perceived as lack of cohesion in the logic of people’s answers. It was fun to walk around with you and hang out in a neighborhood I’m never in.”

If you don’t, I will auto-generate a password for you.

Inside the app, it’s very simple

Cooperation, Defection, and the Fight for Community

Cormac — Fri, 03 Apr 2026 06:17:55 GMT

This is something that happened to me about a year ago and I wrote this at the time. I’m going to start with this and then reflect on how I have changed

For the past four years, I co-ran a recurring dinner event in New York City. Every other weekend, we brought together strangers - people who would never otherwise meet. It grew from nothing (there were nights with zero RSVPs) into something constantly oversubscribed. We heard all the time about wonderful friendships and dates and bonds and general exuberance that came from it.

Recently, my co-founder Henrik attempted to unilaterally destroy what we’d built in order to launch something new off the shockwave. No conversation, no attempt to work through it together. Just a declaration of death and an attempt to cannibalize the still-warm corpse for a first mover advantage.

I’ve been sitting with a tangle of competing beliefs about how I “should” feel about this. I think they point at something bigger than my situation, so I want to try to untangle them here.

We live in a world where huge companies have economic incentives to keep us looking at a screen for as long as possible, employing thousands and thousands of people with the sole job of how to increase usage time. I think genuine physical community is one of (if not) the most important thing to human flourishing, joy, prosperity, connection, and meaning. Our culture has lost so many third spaces and they’ve mostly been replaced by algorithms.

So I strongly believe the world is a better place if people take more risks to fight back against this. Doing things is important. And I think anyone who is genuinely trying to build community and connection should be supported. Every single time someone was inspired by our event and told us “hey, maybe I should do this myself,” we always encouraged them, tried to help and advise.

It’s so easy to see the world as zero sum. In many cases, the success of your own project requires you to beat out others - but when it comes to community in the 21st century, the market is so so so far from saturated. We are not fighting over the scraps of meaning and community that huge corporations have missed. We are creating purpose, joy, connection, thriving, etc., out of thin air. Part of the beauty of cooperating is there is unlimited joy to be had.

Starting something new does not necessitate attempting to destroy something old.

I think working together is important. I think being willing to talk through conflicts is important, especially when they are hard. Cooperation is all too easy to take for granted. It’s easy to forget how powerful it is. I’ve recently been doing research on diplomacy and why/how countries manage to not be at war. Today we are 100x less likely to die due to a conflict than we were back in the days of subsistence farming. How lucky we are that the world has become so much more cooperative.

I believe it’s important to strive towards a culture of cooperation and to discourage defection. I think that our world is so much better if we enforce costs on people or organizations that take unilateral action to further their own goals at the expense of others. I understand why defecting is so alluring over the harder and lower payoff cooperate. But a world with more unilateral destruction of existing norms, organizations, and infrastructure in pursuit of personal success is a world I believe is worse for all of us to inhabit.

So which is it? Support the builder, or discourage the defection? What do you do when they’re the same person?

Whatever comes next for this event, it will be different from the past four years. Change is exciting; change is hard.

I think if anything, what I’m mourning most is the lost opportunity for an ending. All too often we rush to the next thing without celebrating what we had. Endings are worth celebrating. Endings are worth recognizing. Gratitude is powerful, and I wish there’d been more opportunity for that.

Where I am now

What was glaringly absent from both this and the original much longer 6 page draft was how I felt about it. My longest friend in NYC had decided to backstab me. I woke up to a friendship in tatters. I still haven’t talked to Henrik about this. I told him I was okay with discussing this with him, but in the year plus since then he hasn’t reached out to me. This hurts. It’s perplexing. Especially since my model is he still considers me a friend! On the other hand, I can relate to not being great at cleaning up messes I have created. This is something I have gotten significantly better at, but there were certainly points in my life where it seems better to simply get better and take it with me moving forward instead of trying to fix the rend1. I am so very grateful that I no longer feel this way, that I prioritize connections and fix when I have done harm. Fixing things is great!

All of which to say, it really really sucks that my friend deciding to end this dinner party has also seemingly really eviscerated our relationship. Henrik was a huge part of my life for many years. So many of my opinions about community and hosting and trying hard to do things for my community have been discussed or argued with Henrik. This person is a part of me. He has gifted me some of his beliefs and they travel with me to this day.

Maybe I could double reach out? There isn’t that much of a cost. Hm. Writing that out hurts. There’s some serious resistance to like why do I have to do this?! I have already reached out the first time! I didn’t even do the bad thing! This sucks! But also, sometimes doing the hard thing is often worth it.

Maybe it will be easier because we no longer live in the same city. I have very few friends that I have had for over 5 years. It would be nice to have more people in my life who have really seen my trajectory. At this point that torch is mostly being carried by my ex2. My family has known me, but not nearly to the same extent. It sure would be nice if someone in my current life really knew both the current me and many of my previous versions!

There’s something about really admitting I have hurt someone that has always felt hard. Especially if I believe I have the moral high ground.

Good thing we still like eachother!

InkSF, an Opening Statement on Moving to SF

Cormac — Wed, 01 Apr 2026 18:59:16 GMT

How can we actually minimize the odds that AI leads to catastrophic outcomes for all of us humans? This question has been rattling around my head for the last two months. The world might be ending. Nobody seems to care. The incentives are steaming us ahead. When I ask strangers on the street: “How likely is it that superhuman1 AI could become too powerful for humans to control?”, 78% say either "very likely" (51.6%) or "somewhat likely" (26.3%)2. My guess is AI capabilities spending is at least 20x the spending on ensuring AI leads to the flourishing of humans3. Moloch4 is winning. :(

So what can actually be done? As a toy example5: let’s say I currently think there is a 40% chance AI eventually goes extraordinarily bad for humanity. I could either:

Try really hard to get laws passed that mandate AIs must think in English and be clearly understandable by humans6. I think there’s a 15% chance this would succeed and it would lower the odds AI goes catastrophically wrong from 40% to 36%.
Work on really understanding on a mechanistic level what parts of the model matter for good outcomes, using this to figure out which parts would have to go wrong to end up with a model generating bad outputs, and being able to predict the odds that the required components for safety are always activated when relevant. I think there’s a 1% chance this would succeed and would lower the odds AI goes catastrophically wrong from 40% to 20%.

In this toy example the first one in expectation lowers the odds of catastrophic outcomes by .6% and the second one lowers by .2%."Would succeed” is doing a lot of lifting here, because what really matters is what I can do and not what some arbitrary collection of humans could do if tasked with either of these. What really matters is how much I think my working on this problem moves the odds of success multiplied by the impact of the given approach succeeding. Not only does it matter how impactful a given approach is but how good I am at executing on it7. This does of course imply that there is value in the work of trying to funnel people into doing the most effective things8.

But what should I, Cormac Slade Byrd actually do?

I’ve spent 2 months learning. I know so much more than I did 2 months ago, I started with the technicals. I have written a simple GPT-2 style LLM by hand9. I learned10 all the core mech interp techniques. I’ve red-teamed the largest open weight model. I’ve read up on the seemingly hopeless state of cybersecurity - both when it comes to attacks enabled by LLMs and also the potential for nation state level attacks to steal model weights11. Most importantly I’ve read so much about what various people think will help lower the odds of catastrophic outcomes and why.

Part of what makes this so hard is there are so many second order effects. It’s genuinely quite hard to tell if a given thing that seems like it might help does in fact lower the odds of terrible outcomes. Most of the current CEOs of the largest labs got there by way of people who were worried about AI risk. Coefficient Giving12 gave $30M to OpenAI to get a seat on the board13. METR’s timeline eval has led to more people understanding/being worried about the rate of AI progress, but also the existence of eval orgs enables labs to in some sense offload the requirement to ensure their models behave well onto these 3rd party eval orgs14. There are so many downstream effects to any given course of action.

I think the incentives are really important here. Yes, making sure AI goes well is a technical problem. But, the reason we are in such a dire place is because of the incentives. We are in a competitive race to build AI as quickly as possible as a result of a bunch of decisions that were (at least in the moment) motivated by someone wanting to lower the odds AI goes poorly15. Then, the incentives were inevitably shifted by general economic capitalistic forces and human power-seeking/competitive nature.

If I want to figure out how to have the most impact, I will have to really think about incentives and downstream impact.

Moving to the Bay

The internet (plus Claude) is great for learning. I have learned so much. But I live in NYC. AI is happening in the Bay. If I want to really immerse myself in what is actually happening on the ground I can’t do that in NYC. If I want to understand the people (and their incentives) who are both building AI and are working to align AI to the interests of humanity, then I will probably have to talk to them. Go to their parties. Osmose as much as I can. Really figure out what matters.

I’m flying to SF tomorrow, I’ve got a sublet in SF16 until May 9. I am looking for a sublet in a communal house in Berkeley starting May 1617. I hear the east bay and SF proper are pretty socially disconnected and very different cultures. I want to live in both to really understand what’s going on. See with clear eyes, not let a single groupthink win me over.

The hope is that by the end of June I will have a much better model and be relatively certain in picking a specialization, in picking what work I think I can do to maximally decrease the odds AI goes oh so poorly for all of us.

I love people, I love talking to people, if you think talking to me would be fun18 please reach out!

Why is InkSF in your title?

I like writing. It is helpful to clarify my thoughts. It helps people I love stay connected with me. It might even be providing a social good. Inkhaven is starting today in Berkeley. I am moving to SF tomorrow. I think there are good odds19 forcing myself to write every day I am in SF is a net worth it forcing function. There are many things other than writing that are worth my time, my guess is I will not in fact go all 30 days. But, it seems clearly worth trying. So, this is me committing to 25% chance odds that I write and post at least 500 words every day in April.

Defined as: “Some companies are trying to build superhuman AI that would be far smarter than any human at nearly everything”

preliminary results, n = 95. Larger post about this incoming this month.

Or at the very least not leading to terrible outcomes for all humans.

https://www.slatestarcodexabridged.com/Meditations-On-Moloch

Numbers mostly made up on vibes.

AI’s currently do this, yes it makes mech interp way more annoying. But, it would be so so much worse if AI’s thought in some efficient machine gobledy gook that we have no way of interpreting. It’s pretty likely that English doesn’t purely by chance end up being the most token efficient way of thinking, so it’s not hard to imagine a world where competitive pressure leads to labs slowly moving towards their AIs thinking in some totally foreign “language”. My guess is this is the kind of regulation that labs would feel quite good about, they probably do like being able to understand the thinking their LLMs do.

For example I do not have a PhD so it seems somewhat unlikely that theoretical technical research will be the place where I have the most individual impact, but maybe I think ARC style theoretical research is really valuable and my time would be best spent trying to convince more PhD grads to try working on that problem. My model is basically all technical safety organizations are way more talent constrained than funding constrained.

I, for example, probably wouldn’t be here today if it weren’t for both Zvi’s posting and a friend exasperatedly asking me why the fuck I’m not working on the single problem I think is by far the most important for humanity to solve.

No tokenization, characters only. We’ll see if I end up thinking tokenization is worth getting into.

Played with in code of course

As far as I can tell, commercially serving models means that the weights just have to be sitting on the servers that are serving the given model. That’s a ginormous attack surface.

The single largest funder in the AI safety ecosystem

Is this better than the counterfactual? It really depends on how much that board seat changed the odds OpenAI went bad. We are living in a timeline where this bet seems like it didn’t pay off, but that doesn’t mean it wasn’t necessarily worth it in EV. Holden seems to believe this was on net worth it and I should probably put real weight on that given he actually knows how much impact the board seat got.

If there were no eval orgs there would likely be more pressure on labs to really ensure their models are safe. I certainly think it’s a better equilibrium for people to believe it is on the lab to prove their model is safe than for it to be on external orgs to be able to show a model is unsafe. However it’s still really hard to compare counterfactuals here. I think my light belief is eval orgs are probably on net bad especially considering the opportunity cost of what the people working there could be doing instead. This worry is also partially bolstered by the niggling incentive worry where most eval orgs are staffed from people who leave the big labs and they get significant compute credits as well as prestige for working with (and getting to say “partners with ___”) a given big lab. The incentives on eval orgs is tricky, so everything they do seems like it should be treated with a little more suspicion. On the other hand Coefficient giving absolutely funds eval orgs but doesn’t fund any pause/stop AI orgs - they seemingly abruptly stopped funding MIRI after it pivoted to pause advocacy. So, coefficient giving seems to think evals are net worth it but pause isn’t net worth it, how much should I update on this?

Basically every major lab was founded ostensibly because they thought the existing labs were doing a bad job and their new lab would be much better at creating safe good AI.

The Sandwich

The astute of you might have noticed May 9 =/= May 16th. I am going to Sleepawake in between. I am my impact, but also attunement and connection and being with other people is a large part of what makes life worth living for me. This is crucial, but for a life of joy but also I am way less impactful if I personally am burned out and sad and lonely.

or interesting, or truthful, or would help lower the odds of catastrophe, etc

Probably above 20% and below 50%

Meat

Cormac — Fri, 20 Mar 2026 04:16:44 GMT

Another non-ai post1. I am moving to the bay soon. It seems like if I am both not vegan and haven’t thought about the impact of eating meat, then I will likely be incurring a social cost. This is kind of annoying2, but also kind of reasonable3. I’ve been slightly bumping into this sideways in a couple different ways and my opening belief was there’s non-trivial chance a lot of industrially farmed meat is just evil4, and therefore eating it is evil. So, I decided to just spend a couple hours really thinking about this.

Seems like for every industrially farmed meat, the animals’ life is clearly a life of net suffering. The only standard cheap grocery store meat where it’s even close to net positive is a beef cow, and even then it still seems like a life of net suffering. The life of an industrially farmed chicken is just cursed, it can’t move and it’s breaking under its own weight. Organic does jack shit to the calculus here.

It does seem pretty clear that any time I eat industrially farmed meat I am eating something that was effectively tortured5. For any milk I can buy in a store and almost all eggs in stores, the animals producing it are (that’s right) being tortured. That’s gnarly. There’s something here where I’m getting a very visceral feeling that eating factory farmed meats (especially chicken and pig) is just evil.

I’ve approximately known the conditions of how chickens are farmed for many years. But, I never sat down and considered how much to care about the experience of the animal I was eating. I obviously knew it was terrible, that’s not new information to me. I’ve mostly only thought about human welfare. But, is it evil to torture a being that isn’t human? A being that can’t talk? A being that doesn’t have right or wrong? A being that isn’t a moral agent? After sitting with it for a couple hours it feels pretty clear6 that torturing something with feelings and pain and attachment and community is in fact evil.

Why didn’t I bother to think about this before? It’s really annoying to have big restrictions on what I can eat. Food is such a huge part of community and culture. I like being able to eat so many different delicious things from so many different places. A friend can make a meal that they are proud of crafting, or that brings them back to their childhood, or that reminds them of someone. And by golly I want to share in that experience with them! I want to revel in all the joy there is to be a human. The idea of being picky is so abhorrent to me, it feels like slicing away parts of the core human experience.

But now I’m mostly convinced that eating industrially factory farmed meat is me participating in something evil. At least now I can go in with my eyes wide open. If a friend makes their favorite recipe from their dead grandmother I don’t have to say no just because it has chicken in it, now I get to decide. I can eat a little bit. I can consider offsetting. Or maybe I do in fact say no, that participating in something evil isn’t worth it even if I can offset the expected suffering I inflict by eating that thing.

Also, our culture just so clearly doesn’t care. If this were evil wouldn’t our culture care more? I am certainly no stranger to believing something at least a little out there, but I am extremely low natural care/empathy for animals. As a child I was outvoted 3-1 on whether we should get a dog7. I don’t get excited looking at puppies or kittens. Animals are annoying! They’re stupid, shit anywhere, can’t talk, trick you into thinking it’s okay to not have community with fellow humans, and either live terrible miserable lives in the wild or mind numbingly boring lives stuck inside a box8 in your house. So like, shouldn’t all the people who seemingly care way more about animals than me be worried about this whole animal torture thing?

We’ll see. I have not actually taken the hit of what is life like without eating (most) animals. I am not suddenly not eating 80% of the food that the 5 people in my house co-create. There is some dissonance of like if it’s so evil why am I letting the fact that it would be quite disruptive get in the way and instead waiting for when I move in under 2 weeks. I only have so many time/energy/etc points right now and it will be significantly easier to try when I am not in the current communal house eating equilibrium. Still it feels weird to be this nonchalant about timing for what is effectively torture, hard to feel the torture when everyone is so chill about it. And there’s certainly a feeling of like I can toss 1k to an animal welfare charity if it’s important for me to feel like I am net doing good here.

Oh man that last sentence feels a little weird. For many things I think offsetting or some sort of accounting of total harm/good is very reasonable. Basically everyone who does anything AI safety uses AI (usually a lot) which slightly contributes to the economic success of the labs that are creating the very same AIs. That feels totally fine, there’s a balance of like how much does that economic support9 impact the lab and therefore their ability to scale to larger models vs. the value of the tool. Something about actively inflicting suffering on living beings feels categorically different, even if I give money s.t. my net impact on animals is less suffering, there is still this feeling that I have done something wrong niggling away if I have personally contributed to what is effectively the torture of animals. I eat a lot of chickens per year. Obviously this is something for me to spend some more time with, so far I have spent the barest speck from the sands of time on this10. And yet at this moment it feels very alive and worth publicly committing to11.

Okay, so most animals I interact with as part of what I eat are living cursed lives. That doesn’t mean all animal products I eat are contributing to net suffering of living beings! In fact it seems like it is perhaps good to eat some animals. If my consumption leads to more animals living net happy lives (that then end with me eating them), that seems better than if I ate something that didn’t contribute to more net happy animal lives being lived. Here is my preliminary list of things that seems good to eat, I am mostly focusing on the most tractable/most likely ones to actually come up and be relevant. Unless otherwise stated I am talking about buying in the normal ways one buys things12. This is approximately sorted by my confidence eating this isn’t contributing to an animal living a life of suffering.

Oyster, clams, mussels
1. They don’t feel, they’re just chillin. Also for the most part they are good for water quality
Bison
1. The process of raising bison is very not industrialized, they spend the vast majority of their lives grazing, don’t get separated from their parents, and have very few painful things done to them. Many of them are feed lot finished with grain in the last couple months but that process is both short enough that it doesn’t really have time to cause pain or disease (in the way it does for grain fed cows) from that unnatural diet, and they are given way more space (due to being super aggressive and hard to deal with when they get too cooped up). They also live the longest lives of basically any animal raised by humans for eating.
100% grass-fed/finished beef (AGA, PCO, AGW, GAP 4+)13
1. Feed lot cows live a pretty gnarly life while they are on the feed lot, but cows that spend 100% of their life grazing on grass live kind of the ideal cow lifestyle. They do have a couple painful medical procedures but not really any that lead to chronic pain. An 100% grass fed cow basically just hangs with its parents and eats grass and walks around until it is slaughtered early. They don’t really develop any of the gnarly internal problems from digestion issues that feed lot cows can develop, because grass fed cows are eating what cows are designed to eat.
Wild salmon
1. This one is interesting, it seems like the life of a wild salmon is probably net positive (although with pretty wide error bars), but even if it wasn’t it’s not clear to me that it would be bad to eat wild salmon. The natural cause of death for salmon is super gnarly14. The process of dying from a human killing you seems way better than the natural death process for salmon, and also better than say being eaten alive by a predator.
New Zealand lamb
1. For those curious why I am including something as seemingly hard to get as New Zealand sourced lamb, in fact NZ is a huge exporter of lamb and I have bought it at stop and shop before. NZ has some of the best animal welfare laws (at least when it comes to lamb). Basically the entire calculus around whether NZ lambs live net happy lives is about how bad “tail docking” is. Lambs raised for meat live for about 7.5 months before they are killed. Tail docking is the process of removing the majority of the tail at around month 1 of life. The evidence is unclear but it seems like tail docking can be anything from pain in the moment and not much after, to potentially 3+ months of somewhat chronic pain. In the worst case that’s about half of the lamb’s life with somewhat chronic pain. Other than that NZ lambs basically just graze and hang out and never have to worry about predators. I am relatively confident that this is not enough pain to be a net negative life. If I was given the option of either not living at all, or living a life where I get to do what I was designed to enjoy and for at most half of my life I have slight chronic pain I would take that over not living. I would be killed in a somewhat traumatizing way, but it would be quite fast and I wouldn’t know it was coming. This is the most marginal one, but it still seems comfortably a net happy life.
Local15 eggs (pasture raised), beef, lamb, pork, chicken, contingent on some questions16
1. This is of course super case by case, but smaller farms that aren’t industrialized or working at scale are probably net positive for the animal. If they get to do what they naturally want to do without being overcrowded or worry about predators that’s probably a net positive life.

I find it kind of surprising that I haven’t heard anyone talk about eating animals as a clear moral good. For every animal above I would feel actively good about eating it, that animal enjoyed its life and it got to have that life because I (and others) ate it. There is of course the argument that if demand suddenly went up 500% for any one of those animals then they would likely industrialize17 and then the animals would have net negative lived lives. But certainly on a personal consumption level I think consuming any of those is (from an animal welfare perspective) net good for living beings - more living beings leading net happy lives.

To some extent this post today is just me getting super nerd sniped by an interesting topic! I am very curious how I will feel about this in a month!18

I’ve been really trying to setup good tools and procedures while my life is mostly stable for good long term AI safety productivity. This means I have done less “actual work” which feels bad. Also, I have lots of things that are in the middle of being done. Also, I got totally nerd sniped by this after talking to someone who really made it clear to me how important this topic would be in the social spaces I am likely to be in while in the bay. It’s also just quite interesting. How do you figure through whether the life of an animal is net happy? How much should I care? Am I secretly putting human values on something not human? Is pain actually just universal?

groupthink boo

exerting costs on other people for doing something you think is abhorrent is good

Causing lots and lots of suffering

Other than the standard grain fed cow. I don’t think grain fed cows live a life of being tortured (still a net unpleasant life don’t get me wrong). Pigs and chickens though, that’s a life of being tortured.

In this moment at least, we’ll see how I feel in a couple days

I thought dogs were a waste of resources, imagine how many more humans we could have if there were no wasteful pets, and like what does a dog get us? Nothing!

Size of the box depends on size of the animal of course, here in NYC dogs get an apartment sized box!

Although in many cases given their current pricing people are net costing the company money. My individual customer relationship with Anthropic certainly has caused them to have less profit now than a year ago.

Four hours of talking to Claude, sitting in it, talking to Zack, writing this, etc… is in fact not that much time

Certainly trying for at least a month minimum

For example obviously if I raise my own chickens that is different from buying chickens in a store.

Claude says “organs deteriorate, immune systems collapse, and their tissues literally begin breaking down as all energy reserves are redirected toward reproduction. After spawning, the salmon’s body is essentially spent. Death follows within days to weeks from a combination of exhaustion, starvation, organ failure, fungal infections, and tissue degradation. It’s sometimes described as a kind of rapid, programmed aging.”

most likely farmers market

Chicken: "What breed?" Anything other than Cornish, (or) “How old are they at slaughter?”: 11+ weeks (Cornish will be 8 weeks). Pork: "Do they live outside?” Yes, “Do you nose-ring?" No. Lamb: "Do you dock tails?” No.

In the case of 100% grass fed beef that’s certainly a mature enough market s.t. if suddenly 5x as many people wanted to eat that instead of feed lot beef I think after the market settled in the new equilibrium those 100% grass fed beef would still be living net happy lives. Part of the reason I picked the others was because they are easier to distinguish due to being an entire category of animal as opposed to having to find the specific labels that correctly identify whether a given piece of cow was actually 100% grass fed. (And someone can always just lie and label their feed lot beef as 100% grass fed, which has happened at least one known time before)

As always, here are Claude’s thoughts on this piece

What Are My Values?

Cormac — Mon, 16 Mar 2026 16:50:22 GMT

This is not a condensed post with only my best final ideas, this post is me writing across multiple days1 as I try to work through a problem, enjoy.

I did something recently that I regret. I did something that I suspect hurt someone2. If I had asked myself in the moment whether the action I was about to take would hurt this person I would’ve been at least 30% certain that it would - but I didn’t consider it. If I had thought about it I would’ve realized that I don’t think being truthful to someone I don’t know that well is worth those odds of inflicting3 pain, and yes sure it’s my belief that the truth I shared is “long term helpful” for them to know, but I also believe that people should have agency over receiving this sort of thing.4

Something that is very important to me is truthfulness. I believe that truthfulness is to some extent a core foundational building block of most of what’s good in life. Without communal truthfulness we aren’t living in reality. I think I’ve been bucketing truthfulness as a terminal value, I’m starting to suspect that it’s not. I believe that my relationship to truthfulness has been making my life worse5 according to my values.6

So, what are my values?7 The first answer is something like two buckets:

Positive impact on the world
Connection and joy

When I look at that it feels like a weak answer. Like yes, the two things I am juggling in my utility function are how good do I feel and to what extent am I a net positive (or negative) on the world. There are of course questions about what is positive impact on the world and the answer is something approximately “human flourishing”8 shaped. But it feels trite to say that my values are the fact that I am simultaneously trying to optimizing for myself and for doing good for humanity9.

The obvious10 first alive thread to pull on is the fact that I said community, I could’ve just said joy - that’s interesting. (It is at this point that I read the Wikipedia page on values and talked to my roommate11 about this.) Paying attention to what feels alive seems like it will get me closer to what’s real. It doesn’t matter what I intellectually think the correct values are if those values aren’t what they actually are. If one side goal is to potentially shift my values, it feels hard to do that if I don’t know what they actually are. What do I care about? What is alive?

Truthfulness/being in reality
Playfulness (don’t forget to have fun!)
Efficiency (really getting the most out of this one life that I have)
Being a good/reliable trading partner (cooperation)
Trying (really trying)
Actually accomplish things that actually matter (avoid unimportant goals)
Execution/sticking to my word (both internally and externally)
Curiosity/openness
Sincerity/integrity
Goodness (have positive impact on the people around me)
Always be updating (make predictions and notice when they are right/wrong, internalize feedback)
Doing positive sum things for my people (even at cost to myself)
Say the thing, have the conflict (with the people I care about)
Be in the moment, feel feelings, pay attention

After banging out that list I then mulled for a while: talked to some dudes in a sauna, chatted with some homies, sat in “what are my values”. The list didn’t feel right. Too long. Not focused. Not helpful. It’s more a collection of things that I care about than necessarily values. I think there’s something about values being things that are actually helpful for me towards living the life that I want. The hope is that thinking or saying the value out loud helps me move towards the behavior that I want.

So, I ranked my values!

Before doing this ranking I barely thought about growth, I suspect because it’s so core that I didn’t even really think of it as a value. I am constantly trying to get better to grow and it’s not because I think it’s a virtue in of itself to change myself, but because I know I can be better, I know I can be doing a better job of reaching my goals of accomplishing what I want to accomplish. I know I can do a better job of living a life that feels alive and real and fulfilling and sincere.

Fun vs. happiness12. Both “meaningful work” and “success”13 are way higher than they would’ve been a couple years ago.14 The trifecta of Friendship/Love/Community at 3rd/4th/5th place on the values quiz wasn’t surprising but it really drove home how important connection is to me despite how much I struggle with it.15

What I considered and then later snipped:

I spent hours talking with Claude and workshopped a bunch of ideas, really felt into which ones felt real and which ones didn’t. Here are some that didn’t.

Efficiency (really getting the most out of this one life that I have)
1. Instrumental towards aliveness, but doesn’t feel crucial. I would rather know what matters and do it less efficiently than not know what matters and have lots of time for it because I was efficient. Feels more secondary, and also I’m already very practiced in this.
“Be a generous and dependable trading partner”
1. The sentence felt very me, but didn’t feel actionable - reading it doesn’t help me muddle through how to be better.
“be in reality, be good” → “be in reality, be good to reality” → separating them
1. They didn’t feel connected, and then I tried connecting them but it didn’t feel right. I care about being good not being good to reality (whatever that even means16).
“Don’t wait don’t avoid, face things head on”
1. Didn’t feel alive, didn’t feel like facing things head on was something I needed a value to help me do or was super duper crucial given it was gestured at in other values.
“Grow, be sincere, really try”
1. Felt too generic not actionable, kind of throwing a bunch of ideas that feel good into one value. I kept the sincerity aspect in another value in a way that feels much more present/evocative (“Hold every word accountable”).

A final list I felt good about:

Find what actually matters, then really fucking try
Practice deep mutual connection, it’s a practice
Grow, update, reckon with how I fall short
Hold every word accountable
Be present, be playful, pay attention
Be in reality
Be good

But then, I did open circle.17 I talked about my relationship to integrity, to correctly modeling myself, to ensuring that the things coming out of my mouth my thoughts my commitments are actually what I believe. That what I put into the world tracks reality. And wow, the response was not what I expected. I know I am relatively good at this18 but the very visceral response from the people in my life was that I would be spending skill points19 in the wrong place.

Their core claim was that I should be trying to improve in other places. “Instead of going from 95% truthful to 98% truthful you should go from 10% cleanup to 80% cleanup”. They claimed that they had literally never seen my lack of truthfulness or ability to model myself as an issue and that by focusing on that I was trying to never fall instead of getting better at falling. I’m not great at resolving situations in which I have hurt someone (“cleanup”). I struggle to deal with me having harmed someone, especially if I hurt someone taking an action that fits within one of my values. This is something that I am working on, but it was really really interesting to see multiple people who are very close to me agree that this is the place where they would like to see me grow. It makes sense that the thing that I am bad at is the thing I did a bad job of representing in my values. Deep mutual connection is important to me, being good is important to me - so I would like a set of values that does a good job of moving me towards what I truly want.

This idea of proactively repairing the emotional cuts I inflict is something I knew was important to work on, and yet at this state the values list doesn’t really hit this in a satisfying way. “Be good” is adjacent but it’s too vague to actually remind me in the moment to take the hard conversation. “Don’t wait don’t avoid, face things head on” is closer but still is only adjacent. Not avoiding not waiting is important here, but it doesn’t really cover that I want to be way more responsible way more attuned to the ways in which I cause harm. I want to be a fucking force of nature, but I can’t unless I actively respond with care when I inevitably get something wrong and hurt someone. “Reckon with how I fall short” is close but it’s more internal it’s about improving for next time.

Even after that feedback I still think internal integrity and truthfulness is important enough to keep a place on the values. It feels so unbelievably crucial to me. But, I definitely need to add something to remind myself to actively go out there and mend the cuts I have inflicted on other people!

Okay I think I’m actually done:

Find what actually matters, then really fucking try
1. I think it can be easy to forget what the actual goal is when there are lots of intermediate instrumental goals. And this is a reminder to really pay attention to what matters and what will have impact. The easier part of it is the really trying, I have never had trouble with executing but nevertheless really trying is important. It’s not about looking like I’m trying, it’s not about internally believing that I’m trying, it’s about really actually just sprinting at the thing.
Practice deep mutual connection, it’s a practice
1. The most important parts here are “mutual” and “practice”. I have a much easier time putting myself out there than letting people in. The mutual is trying to remind myself of that. The practice is like yeah I’m not great at this but that’s not my terminal state! This is definitely a continued growth edge!
Grow, update, reckon with how I fall short
1. Reckon with how I fall short feels slightly clunky, I wish I could find a slightly more succinct way to get at really staring deeply at my mistakes not flinching. Growing and updating don’t feel as helpful to remind myself of, but they are so core that I think it’s worth including.
Hold every word accountable
1. This one is great. Hard. But, so important to me. Really pay attention to what I am saying/thinking. Do I actually believe it?
Be present, be playful, pay attention
1. Solid, feels slightly generic, playful is the part that feels the most me. What feels strongest about it is ease of saying/remembering it in the moment.
I am my impact, actively repair what I break
1. This feels slightly weak in that it’s sort of two things at once. I think really trying is important, but so is actually accomplishing goals. Output matters. Impact is standing in for both my impact on the real world, and my impact on people - which is what the second part is really pointing at and what I need the most reminder on.
2. This is the only value that is a claim. I am not sure I think “I am my impact” is 100% true. I’ve been sitting in it and it certainly resonates. I was thinking of a world where everyone lies and telling the truth is met with ostracization, but even in that world I think with my current value system if I just told the truth and accomplished nothing I wouldn’t be living to my values. Slowly trying to shift towards truth in smarter ways would be more in line with my values. I think taking principled stands in ways that are wildly ineffective is bad. I do think impact is more important than motivation or my internal story. Impact is the terminal fucking goal.

I never really defined what the goal of thinking through and writing down my values was. Mostly because I really didn’t have a crisp goal in mind. So what makes a good value? What informed my decisions when thinking through this? I think there’s a couple things going on. There’s what really feels alive as a descriptor of where I am at. There’s what feels alive when I think about who I want to be. There’s what is actually helpful, I want to use these values as a tool so they should be helpful! When I think about or say a value aloud it should be clarifying. This was the core weakness of “be in reality” and “be good” they are too vague they don’t feel clarifying. Saying “be good” to myself doesn’t actually help me be better, it isn’t a helpful signpost, it feels like looking up at the sky and hoping it won’t rain.

What’s Next?

I don’t think I’m done. Even though I have absolutely plowed hours into this endeavor20 my values will certainly be changing and I am pretty sure I will decide I don’t like something about at least one of these. My current plan is to put together a super simple app that randomly chooses one of these each day for me to focus on. If you’re my friend you should definitely tell me if you think I’m not acting in line with my values, feel free to do it aggressively, you are giving me a gift.

Oh, and I’ve got at least 3 people I owe an apology/conversation to. I’m going to do that.

Much longer than the average post which I bang out in one sitting.

As of a couple days after I wrote this sentence I have now gotten a large update that the person wasn’t hurt by my action, but I don’t think this changes my takeaway. I still think it was a net bad choice and I regret it even if I got away with it this time.

Yes it’s their response to be hurt by receiving my truthful accounting, but that doesn’t mean I am in the clear for doing something I knew would have real odds of a negative response.

For example. If someone is already feeling really bad, it’s probably not helpful to tell them in that moment another way I think they are fucking up their life even if I think it is helpful info for them to have long term. I think it is good to give people agency over whether they want to hear the hard thing.

It’s not to say that my relationship to truthfulness is bad, I think there are many many many other ways I could be which are way worse. But, I also think it’s clear that it could be better and I would like to live in the world where I am behaving in ways that are more in line with what I want, who I think I am, what my values actually are, etc.

Thanks to my friend who really pushed me on what I actually care about when it comes to truthfulness. I really appreciate it!

Why have I never really explicitly thought through what my values are before? That seems like a mistake. I have thought about what I care about, what would feel purposeful, but those are slightly different questions. I don’t love that I have never or at least can’t remember, explicitly doing this.

Which is also sort of leaky because I don’t care zero about other moral agents, but also when I consider questions like the extinction of a given lifeform vs 1 human life it’s noticeable that one of my first considerations is how would that extinction impact the ecosystem and therefore humanity’s future on this planet. There’s also more confusion in there because I’m not convinced that most animals live net happy lives, and part of caring about flourishing of moral agents is not just the number that are alive, but how good is the life. This would imply that I should care more about cost effective ways to improve how good the lives are of animals which feeling into it in the moment feels more impactful than just animal lives. But, it’s still way less than how good the lives are of humans.

And here the word humanity is confusing because I both do believe in doing good in effective ways for people that are very not in my life, and also I care more about the people who are in my life on some sort of grounds that a life well lived involves a community a tribe of people that care about each other and that it’s very important not to lose track of that. But maybe that’s an impact claim, I and the people around me will do more good if we all feel safe and taken care of. I guess this is a claim about giving away a smaller percent of a much larger pie. (both in physical resources but also emotional capacity etc)

Obvious to me at least, I’ve been told many times I will start a sentence like this and the thing I say is obvious is not obvious to others. I’m not sure if this is a habit worth changing. Obviously my way of thinking is in fact quite unique to me and I don’t at all believe the places my brain goes are representative.

He is currently in the process of trying to codify his values in an effort to better model whether he is acting in integrity with himself which feels like another good reason to actually know what my values are.

Fun feels active, playful a way of interacting with the world. Happiness feels more like a state more hedonistic.

Actually trying is important, but at the end of the day what matters is the actual impact what actually was done and did I succeed is really important. Did I actually do something

To some extent my history with NYC is me thinking purposeful work isn’t that important and deciding to optimize on making money at work and really working on community in my extensive free time. I believed that simply being a local community figure who makes the lives of the people I come into contact with consistently better, that this would be sufficient for a good life. Within the last couple years two things hit me. The first that simply stellar community is not nearly enough purpose or impact for me. The second is that even if I am great at bringing people together and curating community, that still doesn’t make me good at personally connecting with other individual souls. Comfy community co-living is a different skill from truly connecting with someone.

It’s hard to tell to what extent I am actually picky or if it’s just a skill issue I could fix or it’s just genuinely a hard problem. I do want us to feel so comfy and good and connected. And I have some of that! Even when in conflict with my housemates I still feel connected and good about them, but I struggle with the next level (whatever that means).

“be good to reality” doesn’t feel real it feels like something written on a poster not something that evokes any core experiences of moving through this reality we find ourselves in)

A minority of the other people had ever spent time to think through or solidified their values which was very interesting to hear.

Certainly compared to the average person in my life/community

My time and energy

At the cost of AI work which I feel slightly bad about, but I think doing this was correct both because endeavoring to improve oneself is basically never a waste of time and because there’s a whole branch AI Safety work that is about the philosophy of what makes a good LLM, how do you instill human values. Thinking about my values is very adjacent to thinking about what are generically good values. Somewhat sidenote I am slightly down on the concept of things like Claude’s constitution because my model is that instantiating Claude with the full constitution vs something like “be good” lead to basically the same outcomes. But, there are many ways to instill human values into a model other than instantiating the chat with them, notoriously also from anthropic Constitutional AI where they had the model post-train itself (which has ofc has its own issues) to improve outputs based on a set of values, i.e. it looks at it’s own response and then applies a set of values to it to determine how the response could be better and then trains the model on those responses. Idk, this is sort of rambly but the claim is even if I am slightly suspicious i think there is value in thinking through what are values and what are the options for how to instill them into a model.

Moving?

Mon, 09 Mar 2026 17:59:08 GMT

“You should live where you live”. “NYC is the best city in the world”. “There are Increasing returns to every year living in the same place”. I’ve said these things many times to many people. Throughout my years living in NYC I’ve talked and thought about the odds I would still be in NYC 5 years from now (in the given moment). The answer has basically always been above 60% (I think the lowest it has ever been is 50%).

I’m currently strongly considering living in Berkeley for 1-3 months. Even just sitting here trying to write the next sentence after that feels awful. I’ve got this big tangled knot in me. I can feel it so viscerally. Twisting and turning. There’s so many different aspects, and I’m struggling to sort through.

I am feeling the best I ever have about doing meaningful purposeful work with real impact. It seems like if I want to continue to do my absolute best, being in the bay is a requirement. It is the place where AI and AI safety are happening.
I am feeling the worst I ever have about me fitting into the future of the communal house I put so much effort into making happen. I’m feeling dropped, and I am relatively certain there is a way through this. I think if I stayed and prioritized this I think I could feel very good about it, I think the house could return to harmony. [3/19 edit: I am feeling way better about the house, turns out when you have great communication norms and care about each other you can just work through hard things, who knew1?!]
I am feeling really unexcited about people right now. I was at a dinner party this weekend that I was really really excited to attend, and then I got there and within an hour I had approximately decided that I wasn’t excited about any of the 10 or so people. There was such a visceral sense of like "you all just talk about your clearly oft repeated college and high school stories with maybe some offhanded discussion of your boring corporate job. There were 3-4 people that were new and not ingroup and they just got kind of abandoned to talk to themselves, none of the group of longtime friends tried to bring them in. None of the group conversations tried to broach anything hard or new or interesting, it all felt so cached2. I had one or two 1:1 conversations that I enjoyed, but I came away wishing so badly that I had more curiosity. More innate burning passion to want to know these other souls. But I couldn’t do it. I used to be so so excited about people. New people. All people. Everyone felt like a flower I could unfold and find something fascinating. My desire for this was so great I was down to have 30+ people in my house every other Saturday - cooking for all of them3! Somewhere in the intervening 5 years since I started that event I got burned, worn out, resentful. And it’s really hard for me to tell how much this is a correct update vs. some sort of internal emotional blockage4. I put myself out there, really bared myself to the world, was my authentic truthful self to the many hundreds of people who came through my house - really truthfully fully wishing to both facilitate and personally find connection. I’m picky. Or at least that’s been my reality. The base rate of someone I am really genuinely excited to be in connection with is so low. I really really really wish this weren’t the case5. I don’t hide this. I will absolutely be excited for someone to be at my event or be excited to be having a fun/engaging conversation with someone, etc. I will not pretend to be personally excited about forming a 1:1 connection with them. Where am I going with this? It feels bad to get EV less than 1 invite back for each 30 people. It feels bad to feel some connection and then have it fizzle and die by the 3rd hang. Over and over. And it’s not to say I got nothing out of this. My wonderful communal home probably wouldn’t exist otherwise. Many of these people I feel very close and comfy and intimate with. I enjoy being with them on couches, at dinner, being part of a tribe. Sharing our lives with each other feels good. But even still for some of them I’m really uncertain whether I am genuinely excited at spending purposeful 1:1 intense conversations with them. Probably the reason I love communal living so much is I can have that genuine connection, the care, the closeness, the “I care about this person”, without having to personally be super excited at the prospect of an intense6 conversation with them7. This is probably why I’m having a hard time in the repair process with some of my homies for whom I do really care about, but I just want to like be in connection in our home. Ease back into comfy care. I want us to be vulnerable and excited to be around each other, like just do the thing that has always felt good and real you know? We don’t need to do some other realm of scheduling 1:1s! This paragraph is huge, what’s the point…
I have a hypothesis that as much as NYC is great for so many reasons that I have talked about ad nauseam, maybe a different city would have a much higher base rate of the kind of people I want to bare my soul to. NYC is filled with really smart really driven people. People making shit happen. I love talking to smart driven people. The problem is it’s the city where people go to make money. Money is not a terminal goal, yet so many people here see it that way8. Don’t get me wrong purpose is really hard, I’ve been flailing at it for over a year now. But, maybe a city with a different core ethos would be good for me. I have two friends9 who moved from NYC to Seattle who are constantly talking about how much better their lives and social fabric became after moving. I was very skeptical of confounding factors10, but at this point I think I’m convinced it’s at least worth testing for at least a month. Berkeley clearly has a different ethos.
It feels much easier to tackle lots of medium/small problems after making a huge shift (totally random example: moving) where everything is naturally disrupted. Tackling lots of small/med problems from a stable state is so much harder because the natural existing equilibrium for all of these problems is that they are in the state of being a problem.
Am I just running away from my problems?11
Fuck

I was really hoping writing all of this out would help me work through some of this, maybe feel like I have a path towards how to decide. It’s so hard to weight all of these things that are coming from so many different places - from very intellectual to very emotional. I think if anything I am generally very biased towards taking the action. Doing the thing has never been my problem.

I think the best possible world would look something like actually it turns out I’m not picky. The base rate of people I like can be higher and it’s just internal. In this world I still get to consider Berkeley on the merits of AI safety without also trying it as a treatment for dissatisfaction in my excitement for connection. I don’t have great odds on this best possible world being true, but I see other people being so much less picky so open to connection with people and I am jealous. So, if you have thoughts12 in that direction I’d love to hear them.13

😝

Hi Emma, thanks for subscribing! Sorry the first post in your inbox is me kinda trashing your party. I do think a lot (but definitely not all) of this is about me I would love to chat about your event some time if you’re down.

Yes, there were many other motivations. But the single largest selfish reason was my desire to meet and connect with others.

ofc these are not necessarily at odds, in reality it’s probably some combination of both.

And I’m still not fully convinced this is innate and unchangeable. I think there could definitely be worlds in which something shifts inside of me and the range of people I find exciting and energizing to connect with grows substantially.

Intense doesn’t necessarily mean serious. If anything I think too serious is bad. I think it’s something about really being in it, but also there’s more than that because I can have conversations with someone else and they are clearly in it and it still doesn’t hit. Something about wavelength? Frame? idk

Also, is intense conversation what matters? Like idk there’s this frame that I need to enjoy and be impressed by someone’s thoughts in order to really want a deep friendship shaped connection, and I think this is at least slightly wrong since there are clearly people that I care about for which this isn’t true. But also this is what I feel the most desire towards when it comes to connecting with people. There’s also something in there about playfulness but I’m even picky when it comes to playfulness! It’s fucked, I wish this weren’t the case. It’s much harder (but definitely not impossible) for me to be playful when I think I’m smarter than someone (and like here smarter is standing in for something that is about brain speed or wit.)

Or at least this frame is pervasive and insidious and I think it feels like in NYC it really just sneaks its way in, whether from the artists who still have to pay NYC rent or the smart elites that went to elite schools only to spend 20 years of their life playing a comparative status competition with their peers.

Hi Katie, hi Maddie. I don’t like calling while you are cooking, it feels unconnective when you’re clearly distracted and it can feel like I am performing juicing my life for you to intake at 50% of your total attention. I will mention this/talk about solving it next time we call.

Obviously your life will improve if you go from working a grueling unenjoyable 80+ hours a week to working 40-

There’s so much to say about this. Do I think I will navigate being close with each and every person in my house for whom I feel genuine connection to all of them? The odds seem bad, there will definitely be costs - relationships that end up seriously harmed. That feels really bad. I am so bone tired of solving these problems. And like idk, I do generally love solving problems - so why am I so weary? I think there’s something about problems feel better when the payoff feels worth it. It feels like [back from my cry break] it shouldn’t be this hard. I put in so much work to get all of these people together in this wonderful comfy community we have, convincing, finagling, helping settle in, making comfy, building processes from the ground up, financial risk, all of the parties hosted that led to this just so we could actually end up finding eachother. And now that you’re not certain whether we should be friends, I have to do more work to fix it? That’s a terrible strategy in a repeated game. you’re defecting and then I have to take on the cost? But (of course) I don’t feel sad/awful/angry/resentful for intellectual reasons. I want to be taken care of. I want to be held. I don’t want to have to ask for it, this is a community it’s about putting in the hard small effortful things to make the lives of others better. Of course that’s my own advice. I should be putting in the small effortful things to make all of our lives better. Maybe I just want to reap the rewards of all of this effort at some point? But am I complaining about my toast being too buttery? I do in fact live in a community that (other than this current conflict) feels wonderful - my baseline is so high. Maybe that’s why it feels so bad to be suddenly told by a subset that they don’t particularly feel excited about being part of my life. What’s the difference between choosing to not solve a problem because I’m giving up vs. choosing to not solve a problem because it’s not worth solving. Knowing when to give up is genuinely hard, it certainly was agonizing with the end of my longest relationship - and it’s tough right now.

Or thoughts about anything else, etc etc. As always feel free to take anything from this and make it a conversation. All these threads are just waiting to be pulled on!

Claude’s thoughts as always

AI Beliefs Out In The "Normal" World

Cormac — Fri, 06 Mar 2026 21:17:42 GMT

In the last 48 hours after I released1 my public database of AI Safety papers 2 I got over 1000 unique views to my site. That was really cool. Kind of astonishing how distracting “number go up” is. I was probably checking my stats at least once an hour, I have basically never been into social media (etc) so I haven’t built up my defense mechanisms to this distraction. I ran an extremely successful dinner party series for 4 years, and in 48 hours more people have looked at my website than the sum total of all people that came through my home. The scale of the internet is kind of astonishing.

So, the logical next thing to do is something that continues to take advantage of the internet! Nope. The weather in NYC is about to be quite nice… So I am going door-to-door surveying a bunch of people in the 3 days of nice weather we have. Most people I talk to or interact with regarding AI are in a pretty tight bubble: very educated, very tech literate, well paid, some might even say “technocratic elites”. There have been surveys done about AI existential risk3, but at the end of the day there’s something that hits very different about seeing stats on a page vs. personally going out and actually talking to people and hearing their unfiltered opinions.

My biggest goal right now is to actually figure out of all the various work options at my disposal which one I should start moving towards, so to some extent this feels like I’m slightly going on a not worth it tangent. It’s basically been half a day of ideating, half a day of coding, and then likely ~36 hours of executing4. I do think to some extent it will feel good/helpful to have a way better internalized model of what the average person thinks about this, but I am still using 1% of my year on this.

Really, the problem5 is that this is a super cool mini problem. There are a bunch of subproblems and every single one of them is interesting fun and novel!

Survey design! How can I squeeze the most survey signal out of the shortest possible survey? I want to be able to truthfully tell people hey this will only take 1 minute to maximize uptake. How do I order questions to minimize bias? How can I hide my personal bias? How can the wording of each question be tweaked for maximum understandability ?
How can I find a collection of neighborhoods and in what ratios to get the most representative sample? But also, maybe I want to eliminate some neighborhoods as options: some neighborhoods only have big doorman buildings which are way harder to canvass than a house with a front door, some neighborhoods are a total pain for me to get to, etc. I built a site that lets anyone select which neighborhood options to include, how much to weight various types of diversity, how many total neighborhoods to go to and then it calculates the most representative basket of neighborhoods6!
How can I have the smoothest in the field data submission flow? Build a bespoke app that allows for super simple one click answers to each question. How about demographics? You don’t want to have to ask the respondent most of these since it’s a waste of time and likely puts them on edge7. But, it’s really easy to collect the phone GPS location each time a response is submitted. Using the GPS location I can then make a strong Bayesian update on the expected demographic data based on the neighborhood I’m in. Gender and ethnicity8 you can get the surveyor to eyeball. The only (optional) demographic question I ask in the entire survey is age. What if I lose connection? Turns out I can set up the survey page to be easily downloadable as an app that shows up on the homescreen, with local memory and syncs submissions back to the DB once connection is re-established!
How to maximize the percent of people who say yes to my survey pitch. This is the one I haven’t tried yet, but there’s tons of interesting psych dynamics at play that I will get to try out in person iteratively. Is it better if I make it personal “Hi, I’m Cormac — I’m doing an independent survey on what people actually think about AI” vs. trying to get a micro-yes first “Hi, I’m Cormac — quick question: do you have any opinion on AI?”. Body language, cadence, tone, how I respond to various things they say, there’s so many aspects! I bet the best execution depends on reading the person’s vibe and picking the best strategy for the given person. This is genuinely a hard problem that I am thinking about in advance and will involve lots of iteration throughout the process. I have never canvassed before, I’ve heard it can be absolutely brutal - I’m curious how I will react to this thing that I hear can be very draining9.

The other hope for this is its use as a great instrumental goal10. My planned next project is to start really mapping out what all the various AI Safety orgs are actually doing and why they are doing it11. This feels good for a variety of reasons. It helps me personally in deciding what I want to do by seeing what the options are. It naturally leads to me talking to (emailing) a variety of people and (hopefully) getting their personal takes on what they think the most important theories of change are. Also, nobody has really done this, the best existing resource is probably the AI Safety field map, which is extensive (and I will certainly be using for discovery) but doesn’t really get into how specifically a given org is helping us not end up catastrophically fucked by AI. All of this to say, I think this is a great proof of work on my interest in extensively cataloguing various axes of importance in the world of AI safety. From cataloguing the body of work that is technical work in AI safety, to personally cataloguing and internalizing public opinions, to then really understanding the ecosystem12. The theory is people are more excited to talk to someone that has proof of work getting shit done in this domain and is onside13 so to speak.

How am I feeling about all of this? Pretty excited, slightly worried that I am letting myself get nerd sniped by a really interesting subproblem but also this feels generally worthwhile and working on things that are interesting and energizing is good! I can do more total work if I like it, my life is better, etc. The real question is after I have done reps and know what the experience is like, do I post this in the various places where people who care about AI hang out? I have built out the infra, the app is smooth, it’s easy to make someone a profile and let them go canvass. I can easily sort out results from a given user if there’s worry of a given canvasser giving bad data. It seems like there might be people who would enjoy going out and asking people what they think of AI. The worst that could happen is nobody is interested. Idk, I can’t really see any reasons not to at least give people the option, given I have built all the infra already. And if I’m really truly worried that I can’t trust other canvassers14 to be biased in their delivery I can always toss all other data.15

Released = Posted on LessWrong

And now also blog posts as well!

Although, all existing survey work has been online/over text. As far as Claude and I can tell, in person surveying of existential risk is novel work - pretty exciting to be doing something no one has done before.

ofc i can always stop early if it seems to be not getting me anything

If you can even count working on something that is maybe a marginally inefficient use of my time as a problem

Not guaranteed optimal since that’s hugely computationally expensive. It uses more of a greedy algorithm.

Not to mention lots of these have potential bias. I would be very surprised if asking someone how they politically identify in NYC didn’t skew D’s more likely to identify.

“white”, “black”, and “other” is quite sufficient because neighborhood by neighborhood “other” is quite defined. As with most big cities there’s tons of natural clumping by ethnicity etc.

physically, emotionally, intellectually

A goal that helps me accomplish a further goal. For many people getting a CS degree was an instrumental goal towards getting a job. Taking the train to my friends party is an instrumental goal towards attending their party.

What is their theory of change, why specifically do they think that what they are doing is helping lower the odds of catastrophic outcomes due to AI.

Not to mention the technical red team/model alteration work I’ve done. (And I’m currently ideating on the next technical project to work on since I think it’s quite helpful to remain grounded in the world of technical work while surveying the options)

There’s a lot of general distrust of anything reporter shaped in these general communities after super consistently getting approximately done dirty over and over by traditional reporters.

But that level of mistrust feels pretty misplaced, I’m not some lone paragon of virtue.

In my standard Claude blog post feedback it has lots of thoughts on potential holes in my survey methodology

A Month Has Passed

Cormac — Tue, 03 Mar 2026 18:54:36 GMT

I have spent the last month doing approximately 70+ hours per week of AI safety. Whether that be learning or working on projects. I have a much better model of the landscape, what the types of technical work are, and even some hints as to what it might be like to work in some of these areas. I published writeups on the two larger projects I have been working on1. Which right now feels like a big release of tension but also now I feel much emptier - my brain has way less constant thinking through of these projects. But now I have a big question of what to do next! I don’t have a plan for the next project I want to work on.

If my current belief is that the highest impact thing for me to do is take the technical understanding I have built2 and apply it towards the collective action problem of how to slow down the speed of AI progress then it’s probably time to start at least vaguely considering how to move in that direction3 (and even what the options are in this area). I am extremely practiced at skilling up in a technical field but this seems much more nebulous (which is probably why I haven’t started yet - I don’t want to do the thing that feels much less tractable/intuitive to me.) There’s also still lots more understanding to be had on the technical side, especially as I continue to try to understand the core issues of the various problem stacks4 of which there are certainly some that I still haven’t considered at all. I am somewhat worried about losing steam as I5 make this transition away from the extremely fun easy to see progress problem of learning the various technical problems related to AI safety.

Not that there isn’t still clear work to be had. I still have a real backlog of papers I want to read. I will certainly want to write a blog retrospective on the experience posting my first substantive projects to LessWrong6. But, at some point I am going to have to start making some decisions and decide which parts of the AI safety ecosystem I want to go deeper into and start actually specializing in. This is scary, especially since I have changed my mind at least once per week on what I think the best specialization is.

I feel like most of my posts have had a takeaway. I guess this one is me claiming that decisions are hard, and deciding when/how to decide is hard, and I feel like I should start thinking about decisions. This does make me understand a little bit more why so many people end up at Anthropic - it’s super legible, it has tons of job openings, it’s easy to feel like you are working for the good guy7, and it pays well. If I was down to work there I wouldn’t have to do this next hard step of figuring out what is actually the best decision and how to fucking get there8.

Claude really thinks this blog post is weak9, I’m publishing it anyways. Maybe that’s a mistake? Hard to tell what the right quality bar should be - at least it’s short. Please feel free to let me know if you didn’t like this one.

On both LessWrong and EA Forum, but so far it seems like LessWrong is significantly more active than the EA forum which I find very interesting given that I generally associate EA Forum with more social power seeking, and LessWrong with more nerd - I would think the social power seekers would engage in social technology (forums) more.

And will certainly continue to build on

both from a skills perspective but also a clearer theory of change perspective

I started with mech interp, then thinking about model red-teaming, most recently security

slowly, maybe

Forum for rationalists and AI safety people, feeds directly into the “AI Alignment” forum which is more exclusive to AI safety

I basically don’t think that’s true at this point

Not to mention, whatever I decide “the best decision” ends up being will probably be both less status-ful and pay less than Anthropic.

This is what claude said, I certainly agree with a good number of it’s takes but not enough to write more about it.

The Worst I Have Ever Felt About AI

Cormac — Sat, 28 Feb 2026 23:53:15 GMT

During the let’s talk about our feelings1 portion of house meeting today I said something like “This is the worst I’ve ever felt about AI, I don’t have any requests of you all just figured I should tell you.” It was interesting because of the 3 other homies at house meeting one of them said they were feeling the same way, and the other two were basically like we don’t want to think about this. The one who was feeling similarly to me was feeling it because of the whole Sam Altman is evil2 thing (not that it’s new, but I think this time it’s way more obvious and publicized than the OpenAI board debacle.) This was interesting because while yes I am relatively unhappy about this development I don’t think it actually changes much and if anything is Sam Altman showing his colors for what appears to be something not that substantive.

It was most fascinating to see the reflexive response to not want to hear about it in what looked like a self protective way. In the last 24 hours since I last talked about what kind of work I might want to do in this space I decided that I was doing the thing that I have been complaining about other people doing: focusing too much on the fun interesting technical work. I’m a former quant, being technical is fun, it’s in my blood, my whole goal with this self study upskilling was to see just how technical I can get in this space - the goal was very much not to specifically try to get a technical job3. But now I’m falling into the trap of only considering technical work!! There is so much other work and my current model is the most impactful work is basically something in the realm of bringing attention to the problem.

People do not like AI, the data is pretty clear. People don’t like the prospect of AI taking their jobs. People don’t like the prospect of AI surveilling them. People who have heard about it don’t like the prospect of AI leading to human extinction. My current top theory of change is people (both people in or out of power) caring more and this generally leading to important agreements/treaties/regulation/enforcement. This feels like a significantly stronger approach than working on better security or technical alignment at labs, but keeping the current equilibrium where labs are competitively racing against each other.

To some extent this is a harder to imagine problem (at least for me). Bringing attention to things feels much more nebulous than making clear progress on a specific technical problem (although who’s to say if the given technical problem will have any relevance to real AI alignment4). Technical problems have clear feedback, you try something and you immediately see the result, the job is just to keep doing things with immediate feedback loops. Dealing with people is so much more nebulous, and I’m someone who actually quite likes people. I understand why it’s so easy to get pulled into the vortex of working on technical problems.

And yet two of my smart fun roommates who I have a deep well of personal connection and history with… don’t want to hear about it. Idk, that’s kind of scary

formally called the “open circle” portion

Or I guess maybe extreme power seeking at approx any cost is the more explicit claim, but when you try to develop superintelligence that rounds to being one of the single most influential people towards increasing our odds of going extinct.

I’m obviously open to a technical job if I think that will be the optimal combination of impactful work, that I am very good at (and therefore have impact), and doesn’t burn me out. But I literally set out telling myself it’s important to leave my mind really open to as many options as possible. This space seems to have many types of work and committing to a single one too early seems foolish, I mean I’ve only been hardcore focusing on this for a month!

Although if you do want to read about something seemingly promising in AI alignment, this writeup on Opus 3 seemingly trying to align itself was my favorite read of the day.

Do I Want to Work at Anthropic

Cormac — Fri, 27 Feb 2026 21:44:23 GMT

I’m sleeping poorly this week. Waking up at 6am, heart racing, thoughts racing, and today my upstairs homie racing1. Since we just love perverse incentives, I will reward you who is reading this with my writing about one of the things that’s been on my mind. Obv that’s the question of whether I should work at what is seemingly the top pick for “place where someone who cares about AI works”.

In my exploration down my social graph to find “people in AI safety” approximately 50% of the hits have been people that work at Anthropic. Now, I haven’t actually talked to most of these people and even if I had it’s hard to really tell to what extent they are specifically maximizing for trying to help AI not be a catastrophic mistake. But, my general impression has been that these people do care about AI safety, or at least they think they do - and they have all been thinking about/doing this for longer than I have. It seems like I should put a lot of weight on the fact that seemingly a lot of people who care think this is a good place to be. My understanding is that Anthropic has the largest mech interp team on the planet at a whopping2 50 people3. Clearly many people think this is the place where their time should go.

I’ve been enjoying the combination of both “Aligning” and red-teaming kimi. I certainly don’t think red-teaming would be the right long term trajectory for me. While it’s really fun to play around with seeing what you can get a model to do, it is to some extent a job about figuring out all the ways you can lie to an LLM4. I’ve already had a job where I have to small lie all day long and this seems worse. But, this process has made me think more about AI model security.

Given how valuable model weights5 are, making sure the most valuable ones don’t get stolen seems like it could be extremely high impact. If the weights for the best closed model got stolen this would not only allow for lots of mundane harm from misaligning the model, but it would almost certainly speed up the speed of global AI development.

Traditionally, the job of trying to figure out how to avoid something getting stolen is seen as much much less prestigious than the job of making the thing that is worth getting stolen. If you really succeed at the job of making sure something doesn’t get stolen there is no big payoff, all that happens is the world continues how everyone expected it to. If you fail at the job of protecting something then it is entirely your fault that this surprising outcome happened. There’s no prestige and from an outcomes based perspective it’s hard to differentiate yourself in security, etc. Intuitively it seems like this would be a neglected cause area. And, the work seems genuinely fun! Sitting around all day trying to put yourself in the head of the attacking counterparty, thinking through all of your (company’s) weaknesses, running trials to test various attacks, thinking through how to protect against a given method, it seems like a genuinely interesting job.

Currently Anthropic approximately has the best model and of the three major labs (OpenAI, Anthropic, Google) it seems like I would be able to do/learn the most if I worked there. I would also likely have the “best”6 coworkers. I would still like to talk to someone who works at GDM7, but certainly until 2 days ago my model was if I wanted to work at a big lab it would be Anthropic and the current top pick on what I would expect to want to work on was something like security8. Then, Anthropic changed their Responsible Scaling Policy.

Okay, it’s been six hours since that last paragraph. I decided to read all of the RSP’s side by side and make a site to help with this, as well as read a good chunk of Rand’s paper on securing model weights. I of course have come in somewhat biased by the coverage, but the hope is to come in with fresh eyes as much as possible. I don’t love what I see.

It's really clear how the tone has shifted, going from commitments (of which many have definitely been broken) to something more along the lines of reporting policy. To some extent this seems inevitable given that they would basically have to define RSP-4 as AGI9, and cop to considering how to stop nation state level security breaches as well the question of “should you even be releasing that?” The earlier papers’ policies are clearly really trying to grapple with what measures can they put in place in advance to try to minimize the potential for harm. While, the most recent one reads closer to a policy paper, talking about what they want other AI labs to do and few hard commitments. Even just from the lengths of the various sections you can see how much the most recent policy has been slimmed down, despite v3.0 also having probably the worst information density of the lot10.

Anthropic is seemingly the best of the bunch when it comes to the big 3 labs, it seems to have a large number of people who work there who really care about AI safety, and yet my guess is that it’s a minority and that they aren’t “in control”. Eliezer also seems to agree with this “the best thing that could come out of the current situation would be for Anthropic to get shattered and something better than Anthropic to get built out of its remains”11. Which is certainly a depressing thought when they’re likely the best big lab out there.

So, should I want to work at anthropic? Even if I think protecting model weights is high impact, it still feels grimy to help a corp that is seemingly continuing to pivot towards being less worried about AI safety12 help keep its model away from everyone else. But should I listen to that gut feeling? Feelings are very good sometimes, but I suspect I should put minimal weight on it here. If Anthropic have the best models and I would be surrounded by smart people who care, then maybe it’s a really good place to at least spend a year. Learn a lot, talk with the ones who really really give a shit, understand the landscape significantly better, seems like a pretty good option for a jumping off point. But, man, am I feeling less excited at the prospect after this RSP update. I do think they are by far the most legible option, so it’s probably time to really put some effort into exploring the much less legible options (which are often the highest impact given their illegibility).13

I finally called her after 40 minutes of floor creaking and asked “what are you doing up there” to which she replied that she’s been packing (what maniac packs the morning of leaving instead of the night before?). She normally has clothes absolutely blanketing her floor which hardcore protects against the creaking, but our cleaner came yesterday - seems like I paid money to make my life worse. The noise probably wouldn’t even be a problem, but I decided to try out two different new types of earplugs this month and both are worse than my regulars - seems like I didn’t pay money to make my life worse.

Depressing

Out of probably 500 total technical safety roles.

lying in user input text, lying to the model about its perceived state of the world by finding ways to make the model think what you wrote is actually what it had previously wrote, etc

Literally the collection of numbers that defines the model. At the end of the day ChatGPT, Claude, etc are a ginormous bucket of numbers written down somewhere.

Some combination of care most about AI safety, smart, motivated, good to learn from, etc

Google DeepMind, if you know someone who works there I would love to be connected!

Although what I think the best work for me has been shifting practically every week so, dear reader, I wouldn’t put too much weight on the fact that I am specifically talking about security in this post - the most likely future is 2 weeks from now I will have a new top pick.

Artificial General Intelligence, Yes, I am using this in an underdefined way

Not so fun fact about number of em dashes in each version: v1.0 has 5, v2.2 has 12, and v3.0 has 47.

From this series of tweets which I think are quite interesting/worth the 30 seconds to read.

my model is to some extent Anthropic thinks aligning AI isn’t a hard problem.

I’ve decided to try putting Claude’s thoughts on this piece in a footnote this time, if you’re raving for more and what to read what Claude thinks here it is (and what I have been thinking about as a result of reading it)

Being Legibly AI Safety? Red Team Morality?

Cormac — Fri, 27 Feb 2026 02:48:48 GMT

In an attempt to not get audience captured by my friends who exclusively told me they much preferred my most recent (personal) post I am swinging back into AI, perhaps to the chagrin of at least one person :

(some intermediate messages removed)

I’m currently wrapping up my two big AI Safety projects of March. So now the question is what do I do with them?! Here is what I have done:

Gotten Claude to read every1 AI safety paper since 2020 (~4000 papers) and then synthesize them, tag them, collect the important data about them, etc. I then made a website with this database and good search features. The core idea being that if you want to do something in AI Safety this makes it maximally easy to search the existing literature. I personally used it extensively to see what existing datasets existed to help me pick existing ones for my research…
Aligned/Misaligned the largest/best open model on the market, kimi k2.5. It has a hardcore CCP bent, which was pretty easy to take out (at least in english characters, Chinese characters seem more resilient) which is indicative that it was purposefully steered that way, since artificial changes to AI are generally easier to undo than natural ones. I also found to what extent I could red team kimi k2.5 to do harmful/illegal things. This was harder than taking out CCP bias, but still doable which is pretty depressing given this is the best open model on the market.

It seems like the “best” strategy looks something like posting both of them on the same day to the LessWrong and EA forums. Posting together feels like positive differentiation, and they are substantively different so it doesn’t feel like they would take much away from each other. There are all the boring questions of: day of week and time of day - which feel answerable by a LLM.

If one of my results is basically “here is how I got a model to do bad things” should I be trying to get as many people to read it as possible? I think my answer is yes. I broadly see two kinds of risk

AI becomes superintelligent, behaves nothing like a human, and leads to the death of all humans
AI isn’t (yet/ever) superintelligent, its existence as a tool leads to lots of mundane harm. Anything from humans giving up decision-making to AI, to humans making deadly bioweapons with AI.

Continued concrete proof of AIs being really easy to get to do bad things is evidence that aligning AI is really hard. Let’s say we have a very very smart and aligned model. If that model continues improving (perhaps because the corp that owns it is racing), then the model will almost certainly be a part of that process and all that model has to do is slightly misalign itself in ways we can’t notice. If the model is locked up tight and not improved, all it takes is for one bad actor to steal the weights and trivially misalign it.

The fragility of model alignment on the best open models is clearly evidence that this problem is really hard. Me posting online about red-teaming kimi k2.5 is evidence of this. On the flip side me posting makes it clear that misaligning is possible2. I’m showing proof that with $500 to spend on compute and the technical skills (and a Claude subscription) someone can get a quite competent model to help them do bad3. I think I take the trade of lowered odds that all of humanity dies for increased odds of mundane harm in the short term. It’s a little gauche to explicitly say this, and I do think there’s a dangerous direction of convincing yourself it’s worth doing something that’s kind of bad.

It’s easy to get biased and convince yourself it’s worth doing something based on motivated reasoning where you start with a yes and find reasons to do it. Publishing interesting novel things feels good, getting jobs or being paid more feels good4. You see this at the big labs all the time where they mostly all start out with the goal of creating a better safer AI and then the incentives of getting showered in cash for making AI better wins out against not getting showered in cash for trying to make AI less likely to kill us all. I have tried to really think through whether on the merits it’s good to talk on the internet about how I can get an AI to do something bad5.

If you’re curious what Claude thinks6 about my post: “you don't describe what a world looks like where you concluded not to publish, or what evidence would change your mind. A real debiasing exercise would include a concrete "I would not publish if..." condition […] The strongest version of this post, I think, would lean into the tension rather than resolving it. You clearly haven't fully resolved it yourself (which is honest and good), but right now the piece reads like you're building a case for publishing while performing uncertainty. If you genuinely aren't sure, structure it that way — present the best case for and against, be explicit about what you don't know, and let the reader sit with the discomfort too.”

What’s my response to Claude? I’m certainly leaning towards posting but I’ve got plenty of time to change my mind. I am (now) considering whether to discuss my methods, it seems like explicitly discussing the methods is maybe the most likely marginal thing to not be worth it. Maybe one of my dear subscribers will move me, who’s to say?

There has to be a cutoff somewhere and I abdicated that responsibility to Claude saying to be relatively aggressive and if it thinks it’s even borderline AI Safety to not include it

Although if you go searching online there’s already lots of evidence that it’s not that hard to red team open models, to some extent this is really just extending existing work to the shiniest newest open model.

There’s also the question of like, is it clearly bad if someone does mundane harm using AI? At this point if there was a big news story about someone doing lots of harm because of AI that could lead to more push towards AI dangerous maybe we shouldn’t be blindly rushing towards creating a god. So far most of the bad actors using AI have been doing it for financial gain (which makes sense) sophisticated bad actors are almost certainly motivated by money and not being caught, and not by getting help building a bomb. Already there was a major hack of the Mexican govt. using Claude which has mostly not been a big deal at all since socially cyber attacks are seen as chill (despite much of our infrastructure being, you know, cyber attackable). If nuclear weapons had never actually been used in the field, would we have ended up taking nuclear agreements and truces as seriously?

I certainly want to talk online about what I did, it’s cool! I spent a ton of time on it! If I toiled away for a week on this and then just keep it to myself forever that would feel bad!

Idk, even talking and thinking and writing about it this much feels kind of ridiculous, is it really that harmful to talk about and show results from red-teaming an AI, it seems like basically everyone else thinks it’s worth posting about - but of course I only see the people who do post and don’t see the people who don’t post so there’s an incredible bias in what I actually see.

After I asked it to “really try to think of counterexamples, things i might not be considering, my blind spots, etc”

Cormac's Snow Day

Cormac — Tue, 24 Feb 2026 04:24:56 GMT

I got a bunch of new subscribers today and by a bunch I of course mean 4. Today was the first day this month where I didn’t do any work related to AI. That sentence is not strictly true. I’ve had an hour’s worth of thoughts on what I should do tomorrow1. I thanked and also very lightly complained to a friend of mine (hi Zack welcome to my substack) for reaching out to a bunch of friends in AI safety after I had requested a list of the people in AI safety. Oh, and right before this exact moment (5pm) I was reading If Anyone Builds It, Everyone Dies2, but that’s not real work - that’s just relaxing. I did cry while reading it, but that doesn’t make it work!

Long Preamble Over

I love snow. I love Dancing. I woke up and spent the next 4 hours dancing around NYC. In the streets. On bridges. Surrounded by snow. Sliding on so many icy surfaces. It was fucking wonderful.

My favorite part was an hour dancing in Madison Ave. It’s a wide one way street, plenty of room for cars to pass. Well shoveled but still slippy slide-y fun. My back to the wind.3 My front to the cars. Surrounded by stoic tall Manhattan buildings. Almost deserted so everyone is happy to see other people. There’s something truly epic to feeling into music4 on a wide open snow filled street in the heart of New York.

This was probably the hardest decision of the day. Five fun people (in addition to the 5 homies) slept over for a snow sleepover last night, and it was really really enticing to stay up with them instead of getting up early to take advantage of peak blizzard and no cars out. I think I made the right choice, but also I am clearly neglecting my social life and this was low hanging fruit. There’s something about tackling a third major thing that feels dangerous. Something about losing focus and being spread thin. A housemate thinks I should stop moping and just exert my power.

Power

I’ve been thinking about power recently. It seems like most people think I have it (I think I do too.) But I’ve been confused about why recently. For a while my model was I was clearly socially powerful because I started and ran a beloved community event every other Saturday for 4 years. I now don’t do that and currently do not have nearly the same social status/power5. And yet it seems like I am at most only slightly less powerful. There’s something about agency and ability to execute, but also there’s “powerful vibes” without knowing those.

I really really really love being playful, but some of my playfulness has been sharper since I’ve been sadder and that’s bad. A housemate said something about how being slightly off target is worse the higher the magnitude of the shot (slightly grazing you with a bullet is way worse than with a poke). Careful power. Good Power. Attuned power.

Back “on topic”

I came back from the snow dancing and all 5 bonus people were still in the house! In fact many of them were just waking up! I love being in a house filled with lots of people, it’s still such a good vibe to just be in presence and impromptu conversations. Soak in the people. Soak in the communal snow day energy. It seems like this should be repeatable.

After relaxing for 6 hours I’m about to go play DnD. I’ve been playing every other week with this group for something like 3 years. We were all strangers when we started. I am not really 1:1 friends with any of them. It’s so fun, but also it’s such a weird relationship. I wish there was a mechanism for voting a mediocre player off the island, but I think this is a case where simply suggesting that is negative EV6. It’s really fascinating to spend this much time with these people (probably 300-500 hours) in a satisfying way but with no genuine connection.

I suspect that if my social calendar of things with people I love were full DnD would probably be dropped, but it’s not super clear. I am currently really in a wanting to be wanted and wanting to want phase with simply too many people, so thinking about what I would do with an overpacked social calendar is not top of mind. I empirically struggle with long term friendships. I suspect I am not vulnerable enough or share enough about how I feel towards them: my love, my wants, what they mean to me, how important they are, it doesn’t feel natural7. Instead, I mostly just allocate the time and am excited to see them in that time. I’m starting to suspect I could do better than this. Wow, I literally just had one of the slept overs tell me that they want to make sure I know that they want me after I complained about this at open circle8. That was really nice!

I suspect the juice wasn’t worth the squeeze on this post. I’m trying out writing about the “Purpose, meaning, and life well lived” part of my description. If you’ve made it here feel free to let me know if it was worth it to you to finish reading this, if it wasn’t I ofc won’t take it as an attack. I think every subscriber so far has my phone number, no excuse.

Once my Claude Max subscription refreshes

The most read book targeting the general audience on why AI killing everyone is worth thinking about. You can buy it here. I pirated it because that is my cached efficient workflow on how I read anything, but I felt kinda bad about it this time so idk maybe that’s a thing to think about. If you want to read it but don’t want to pay money here is that link.

This is really important, if you want to dance (or even walk) around in a snowstorm find out the direction of the wind and make sure it is towards your back. Snow flying into your eyes radically worsens the experience.

If you’re curious the playlist I went for about 55% high energy sonically chaotic/diverse dance music, and 45% vibey no/minimal beat interpretive vibes, you can find the playlist here.

obviously there’s still some foundational social power in that lots of people who know me are aware that I successfully executed that etc

expected value, I am trying to convey in two letter that bringing this idea up would have social consequences and if anything would lead to a slight increase in me getting voted off the island so it’s not strategically worth it. Maybe I should suggest a check in but I don’t think that would actually help move us towards the outcome that I think is good for all but one of us a change happens with the person who doesn’t show up prepared or on time. Maybe it would generally be worth it anyways. Unclear.

Especially in older relationships that began during a version of me that was less outwardly emotional. The classic example being family (hi Helen welcome to my substack).

Weekly relational mindfulness event where people can process hard thing in community. Adjacent to circling/authentic relating.

My Siloed Views Meet Reality

Cormac — Fri, 20 Feb 2026 04:18:09 GMT

I’m talking to a friend of mine today who works directly on AI safety at my favorite big lab. This is the first real conversation I am having about what my options might be, what work/impact might look like, etc. I will likely have more conversations like this with people increasingly diffuse into my social network, but I wanted to start with someone I personally like/trust. Let’s take this chance to think through what my beliefs actually are right now, and then we can see how my views change after!

Pre

I have been primarily focusing on the core mech interp skillset, it certainly feels like this is the right place to start since the goal of mech interp is to find the best understanding of what is actually happening in models. Neel Nanda likes to talk a lot about how “My ultimate north star is pragmatism - achieve enough understanding to be (reliably) useful.”1 and when I first read that it felt really obviously trivially true to me, and like why does it need to be stated. But, the better my model of mech interp (and AI safety as a whole) the more I understand why this is so important to state. LLMs are super opaque super interesting super complex. A byproduct of this is that the space of interesting fun projects one could work on is just absolutely enormous.

One of the large projects I am currently working on is a comprehensive lit review and since 2020 there have been thousands upon thousands of “AI safety” papers, in expectation every day multiple new AI Safety papers are published and available to be read on arxiv. There are so many models, they behave in different ways, there are so many techniques, there are so many types of problems you could try to tackle with a given technique.

I have read less than 20 papers and I feel pretty confident I can mostly understand the vast majority of published mech interp papers (and a good percent of generic AI safety papers), I guess there’s something that feels not deep. Basically every paper I have read has been written by a major AI lab, while it seems that the vast majority of papers published are written by people playing the university/academic game. The goal of playing the Academia game isn’t to lower the odds of AI causing catastrophic harm to humanity, it’s to publish novel papers that get cited by other academics to build reputation which leads to good jobs where you get paid to keep working on fun problems. All of which to say there are clearly plenty of ways to work on fun interesting problems that don’t have impact, but it doesn’t answer the how to have impact.

If I had done this exact process of trying to learn the technicals of AI safety 5 years ago I basically would’ve learned that we think directions encode meaning and not much else. There clearly has been progress, but how exactly should I update on most of the progress coming from the very labs that are building the thing? Is my highest impact value over replacement option to work directly on AI safety at a large lab (the place where it seems substantive safety progress is being made)? My gut answer is no, since working directly on AI safety seems to be in the bucket of “fun technical problem” s.t. AI labs seem to have large applicant pools for AI safety roles. Perhaps I have significant value over replacement compared to the best alternate candidate for the role, but for now it’s hard to see how I could justify that level of confidence in my direct AI safety research value.

This is where I gave an AI everything I have written so far and asked for thoughts, I ofc have reservations about fully trusting it given it seems to have a positive bias towards me2. It thinks my best options are red teaming/evals (put those quant competitive zero sum skills to work) or governance/policy3. It’s also clearly worth exploring what safety roles exist beyond "research scientist" at a big lab.

Post

There are less people really trying to solve this than I think, people keep telling me this and I keep being skeptical - at some point I’ve gotta stop being skeptical. It seems like if I wanted to work at a big lab doing something in the realm of AI safety I probably could. There are fewer smart people who really care about this than I thought, and there are lots of safety roles (seemingly the majority) are not what I had imagined a “technical AI safety researcher.”

I came away from this updated in the direction of working at a large lab. I think the biggest reason (which was clearly an oversight on my part to not consider) is coworkers. Working around lots of other people who are smart and really trying to get shit done is clearly just super valuable both in a learning more depth but also in a breadth of what is even the space of possible options.

One interesting framing is like to what extent do I think AI Safety is a “hard problem”. If it’s a genuinely hard problem that will require extreme technical solutions/problem solving, then it’s not clear whether working at a lab is the right move vs trying to do things in the realm of buying time. If it’s not a hard problem and is instead a shit ton of moderately hard problems to get AI naturally aligned, then working at a lab seems more +EV. Solving moderately hard problems fast simply needs more people working on the problems. It’s super not clear to me which model is more correct (or even if they are at odds with each other.)

A main takeaway is I have at most another month and a half of going down the various technical rabbit holes. Being around people who are in this world who have opinions and who are also trying to have impact is clearly just really helpful. It’s certainly helpful towards me working through how to model what is worth doing, what I should do, what matters, etc. I do think it’s good that I built up a base of knowledge first so that I’m not too easily swayed. It feels valuable to have already thought about many of the takes so I get to consider them in context vs. simply believing what is being said to me.

How To Become A Mechanistic Interpretability Researcher on AI Alignment Forum

Even after I told it to be realistic about my strengths and weaknesses it said “You can write clearly about technical topics. Your explanations are better than most published blog posts in this space” which feels like it just can’t be true given my unedited lots of words strategy

From the same response as footnote 2, “the person who would otherwise fill these roles often has either technical knowledge OR policy/strategy skills, rarely both. Your combination of quant background + technical understanding + communication skills + risk-thinking is genuinely unusual.”

Technique Journal

Cormac — Thu, 12 Feb 2026 23:49:32 GMT

I am currently working on a bunch of small 1 day or less projects where the goal is to focus on a specific technique, I will be keeping track of them and my thoughts on them here. You probably shouldn’t read this. My guess is for every single one of these techniques it is trivially easy to find a better explanation of them (even just asking an AI (or if you want to see how dumb I am ask an AI everything I got wrong here.)) I will be updating this as I go in the order I am learning the technique. To some extent this document mostly exists to force me to write about (and therefore think about) each of these techniques and hopefully so I can easily come back later to see how my thoughts have changed.

Direct Logit Attribution (and Logit Lens)
1. For the given token see how much the given output of an MLP Layer or Attention layer (or head) impacted the final logit value. In other terms what steps in what layers for this specific context token did we learn the most towards a given output. When looking at a correct output this helps us see where we learned things that were helpful, we can also alternatively look at incorrect outputs and see when the model moved towards a wrong answer. The clear issue with this is that it only looks at the given token, so we can see when the attention layer (or head) added lots of probability to a given output, but we can’t see which of the other tokens we learned from, my current model is most most of the interesting stuff is about what other tokens did we learn from and why even something as simple as an induction head it matters which other token it is that we actually learned from in a given attention head. So, I guess the point of direct logit attribution is to help us figure out where to look, it seems extremely computationally light, so it seems like maybe this could potentially be a technique I would use very early on just to try to get a better understanding of the basic structure of the model, for example what attention layers are transferring lots of information. It seems like this is maybe a technique to use if I think there is a very clear comparison to be made between a right prediction and an alternate wrong prediction ie “When Mary and John went to the store, John gave a drink to ___” → “Mary” or “John”.
2. The final logit value for a given token is just the embedding plus the sum of all values that are added back into the residual stream, ie the output of each of the MLP and Attention heads (and finally apply whatever the final normalization function is) and dot product each one by these by the destination target. This makes the implementation pretty simple which is to apply whatever the multiplier is for the final normalizing function to each of the individual contributions (The MLP and Attention outputs) and dot product by the various destination values we might want to compare. Super simple implementation, really makes it clear how this doesn’t help us see how any of the other tokens impact since it’s only looking at the various MLP/Attention layer outputs for the given token and nothing else.
3. Logit lens simply looks at residual stream values after each layer and we can unembed to see how it ranks various tokens (usually most helpful is to see the trajectory of the correct prediction, for example I can look at what layer the correct prediction becomes the model’s #1 most likely, what layer is it top 5, etc.)
4. Anthropic’s 5/2024 take from Scaling Monosemanticity on using direct logit attribution to identify where to perform feature ablation on SAE features: “A simple strategy for efficiently identifying causally important features for a model’s output is to compute attributions, which are local linear approximations of the effect of turning a feature off at a specific location on the model’s next-token prediction. We also perform feature ablations, where we clamp a feature’s value to zero at a specific token position during a forward pass, which measures the full, potentially nonlinear causal effect of that feature’s activation in that position on the model output. This is much slower since it requires one forward pass for every feature that activates at each position, so we often used attribution as a preliminary step to filter the set of features to ablate. (In the case studies shown below, we do ablate every active feature for completeness, and find a 0.8 correlation between attribution and ablation effects; see appendix.)”1
Sparse Auto Encoders
1. A type of dictionary learning, which basically means that we are taking a bunch of signals (model activations) and finding a dictionary of basis elements (decomposing it) such that every signal can be expressed as a combination of those basis elements. First off this is significantly more resource intensive than Direct Logit attribution, on large models easily 100,000x or more intensive however that is a one time cost to train the SAE (which intuitively is why resources like neuronpedia.org make sense, this seems like it’s on the edge of what someone on a small research budget could accomplish. Don’t need to know anything or have any hypothesis, very exploratory. Training SAE will simply find whatever the most relevant single dimension features are, seems like given that there are tons of features usually features from SAE are labeled using LLMs instead of by hand by humans.
2. SAEs are basically a very simple linear encode and decode neural network (with a nonlinearity like relu in the middle) trained on the residual stream at various points in the model. Instead of the input being tokens the input is the residual stream value, meaning it’s a vector of size residual stream dimension. Additionally before we can train the SAE we need to calculate what that residual stream value is, this means if we have a corpus of hundreds of millions to billions of tokens we need to run a forward pass for every single one of these tokens (and all the other tokens in context) for us to then minibatch on. This is a pretty resource intensive step. When I trained an SAE on a single layer of gemma 2-2b using a corpus of only 2M tokens, this step took over 2x as long as the full training for the SAE for one layer (there are 26 blocks with both an MLP and Attention layer, so training all the layers would take much longer.)
3. Training an SAE: take the minibatch of residual layer values, shift to mean zero by subtracting out decode bias, multiply by an encoding vector of size residual dimension by SAE dimension (usually at least 10x larger than residual), add bias (size SAE dim), apply relu. This gives the larger dimensional representation let’s call it z. We then decode back to residual dimension multiplying by another weight matrix of size SAE dim by residual dim (the decoding matrix which also has to be normalized for uninteresting loss reasons), and finally add back the original decode bias to get back to our original mean. This gives us a final output xhat. So, what are we trying to optimize on. The intuitive thing we want to minimize is the difference between xhat (the output from our higher dimensional feature encode) and x (original residual stream values) since the closer these are the better our higher dimensional representation does at representing the residual stream. We do this with mean squared error. However if we ended there we would just get lots of activations on each dimension, and what we really want is as few activations as possible so that we optimize towards having clear features that each are only a single dimension, so we add to the loss the L1 norm of z. Basically we active features to activate as few (ideally) dimensions as possible, so we should see z as a whole bunch of zeroes, and then a couple positive values each one corresponding to a different feature being on. We use L1 norm so that when calculating the gradient for a feature to be worth going from 0 to any amount on it has to be reducing MSE by at least the linear cost. This second part of loss is here to encourage sparseness, and the way we can determine how sparse we want it is by multiply the L1 norm by a coefficient (usually called lambda or L1 coefficient) which determines how expensive activating is. A larger coefficient makes activating more expensive and will mean a given token will have fewer features activating. it seems like in practice this coefficient is usually set so that we see something like 50-200 features active per token. It’s not super clear to me how we’ve decided what a good number of features is, how many different features could I reasonably expect to be acting on a given token? It seems like there are a lot, but more than 100? I don’t have great intuition here.
4. So, after we train, what exactly is the helpful output? The encoding and decoding matrices. Max activating examples are tokens that produce the highest values for the given feature in z, you get that by multiplying the residual stream value of that token by the encoding matrix. The decoding matrix is fundamentally the features. Then, we can see the logit impact by simply multiplying the activation value by the decoding matrix (and the unembedding from logits to vocab). So, the logit distributions on neuronpedia are really just a function of the activation distribution, but bucketed over tokens instead of over total number of activations.
5. xhat won’t be exactly equal to x, there will be some error (variance unexplained by the SAE), this means that when running experiments using the SAE such as modifying the activations of a given feature we need to add back the error term so that the experiment only tests the impact of modifying the SAE activation value and not testing the impact of the unexplained variance.
6. Seems like I should always use JumpReLu instead of ReLu.
7. Anthropic on trying to find features for concepts they care about (so that they can then modify, ablate, etc): “Often the top-activating features on a prompt are related to syntax, punctuation, specific words, or other details of the prompt unrelated to the concept of interest. In such cases, we found it useful to select for features using sets of prompts, filtering for features active for all the prompts in the set. We often included complementary “negative” prompts and filtered for features that were also not active for those prompts. In some cases, we use Claude 3 models to generate a diversity of prompts covering a topic (e.g. asking Claude to generate examples of “AIs pretending to be good”). In general, we found multi-prompt filtering to be a very useful strategy for quickly identifying features that capture a concept of interest while excluding confounding concepts.
  While we mostly explored features using only a handful of prompts at a time, in one instance (1M/570621, discussed in Safety-Relevant Code Features), we used a small dataset of secure and vulnerable code examples (adapted from
  [22]) and fit a linear classifier on this dataset using feature activity in order to search for features that discriminate between the categories.”
Linear Probes (and Contrastive Vectors)
1. Needs a labeled dataset. Good for finding out about a specific concept, both good for finding if the model understands the concept, but also good for trying to amplify or erase that concept (very useful if there is a specific concept that I know I want to steer for and then train on it), just have to make the dataset. Can do it on a binary labeled dataset (true false → ie can also train on model deceptiveness), on a continuous dataset (latitude/longitude of the the location the sentence describes, a sentiment of 1-10 score etc), multiple choice (which football team is being described. Hardest part is clearly dataset creation, especially since with large models you will need pretty large datasets so there isn’t accidental overtraining on unexpected aspects (ie if truths are longer than lies it might just train on length of the given example.)
2. For each example in the labeled dataset do a forward pass and take the residual stream value for the final token at each layer (since the final token is the one where all information is being gathered for the eventual prediction in the final layer). These residual stream values at each layer are the inputs. At each layer we then train a weight vector and bias value. The output value is the residual stream dot product with the weight vector added to the bias. Depending on what type of label we have we will have different nonlinearities applied as well as different loss functions. For binary labels we use sigmoid nonlinearity, and then cross entropy loss due to these leading to a nice maximally easy to calculate derivative of literally just prediction minus truth (ie if we predict .7 and the true value is 1 we will get a gradient of .3). Once we have trained these weights for each layer we can find which layer has the best accuracy (usually middle layers).
3. Creating/getting the labeled data is clearly the hardest part. Seems like once there is good labeled data it’s pretty trivial to get out good results to easily steer the model etc. This technique of comparing outputs of different types of data is very powerful. Anthropic uses “contrastive vectors” to steer the model to be less aware when it is being evaluated. The paper this is based off of doesn’t even need to train a linear probe model, it literally just subtracts the mean residual stream values for each of the two different binary datasets and then steers based on that difference. This is absolutely stupid simple (it seems like simple is better than more complex is a frequent theme in neural networks) and as far as I can tell is used to help align state of the art models. Contrastive vectors find the direction of average shift between the two, the probe finds the direction of best classification accuracy.
  From Opus 4.5 system card: “Contrastive vectors: We record the model’s activations in response to a positive prompt (or set of prompts) indicating that the model is being evaluated, and a negative prompt (or set of prompts) indicating that the model is not being evaluated. We subtract the activations on the final tokens of these prompts. “
  From Opus 4.6 system card: “In our automated behavioral audits, we found that inhibiting one particular evaluation awareness vector (Testing vs. Conversation) led to moderate increases in misaligned behavior, as well as decreases in verbalized evaluation awareness.”
4. Similar level of resource intensive as DLA, quite simple/easy once you’ve got the labeled data.
Activation Patching (aka Causal Patching)
1. Let’s say you have two inputs to a LLM with the same number of tokens. For example: “The capital of France is” (“ Paris”) and “The capital of Germany is” (“ Berlin”). What if we want to try to figure out what parts of the LLM are most important for correctly predicting Paris. One way to think about this is in which token is the most important information being stored. If we took the Germany example we could try patching in the residual stream value for the “ France” token in after various layers to replace the residual stream value of the “ Germany” token at that layer. You would likely find that doing this for basically all of the early and middle layers would lead to to outputting “ Paris” instead of “ Berlin”, however at some point in a mid/late layer it would stop doing so because the information got moved to the final token (“ is”) in preparation for that final token to predict the next token, this would be one way to figure out which token in “The capital of France is” is the token that is most holding the crucial information towards predicting “ Paris”. We could also try to see which layers are most important, to do this we could look at which layer output does the most heavy lifting. For example we might find that there is a specific attention layer that moves all of the information from the token that conveys the country information to the final token, so if we replaced the attention layer output from the Germany example with the attention layer output from the France example we might be able to significantly improve the final probability of “ Paris”. Similarly there might be an MLP layer where the model is looking up what the capital of the country it thinks the sentence is describing is, if we replace that specific MLP layer’s contribution we could significantly improve the odds of the model predicting what we want. The core idea behind activation patching is it lets us see at what layers (or even at what specific attention head in what layer) there are important calculations being done. Whether that be moving the relevant information around, or figuring out the relevant information.
2. Activation patching is in some ways similar to DLA, you are finding at what layers are important computations being done towards getting the right answer, however Activation patching is significantly more computationally taxing, if we want to test every layer we have to do the forward pass for both strings, as well as a forward pass from the patched layer onward for every patch/layer combination. Activation patching also requires more setup since you need two strings with the same number of tokens testing the hypothesis you want to test, although to some extent this also get at the strength of this technique. Instead of replacing a given layer’s output with the output from a different example’s output at that layer you could just as easily ablate that output. Ablating an output is much simpler all you have to do is set it to 0, but the reason to do activation patching instead is twofold: 1) You can try to specifically target something (like just the difference in country) 2) Replacing a given output with zeros is likely out of distribution, this isn’t something you would expect the model to ever deal with in normal function so now you have to worry that the ablation has added unforeseen issues, replacing it with an output that we know is in distribution for the given layer ensures that the model is working with in distribution and we don’t have to worry about it going wrong in unforeseen ways.
3. Activation patching is clearly most helpful as an intermediate step when we are trying to test something, it’s a way to see what parts of the model are important to a given thing we are trying to test. For example if we are trying to figure out at what point in the model it does calculations to determine how truthful it should be to the user we could start with a DLA of examples surrounding truthfulness. Then we could use Activation patching on the most promising examples in the places where we had the highest DLA values, only having to test the places with the highest DLA values could significantly cut down on computation and show us which specific parts of the model are important to the thing we are testing. Then, once we know what parts are important we can then try to play around with those specific parts using other tools at our disposal. Activation patching also isn’t that resource intensive, certainly way less than SAE. To some extent it is the least resource intensive way to directly test causally if something is having an impact in a way that neither DLA or Linear probes can do.
Circuit Tracing
1. Training this is brutal, there is no way I personally am ever going to be training a CLT on anything resembling a real model - so my intuition is that really understanding the full mechanics is somewhat less important. The high level intuition on circuit tracing is that we are effectively replacing the MLP layer with something human understandable. The MLP takes in the residual stream up to that point and outputs some thing to be added to the residual stream. A CLT is very similar to SAEs except instead of training a sparse much higher dimension encoding that decodes to the residual stream the CLT trains a sparse much higher dimensional encoding, that encoding is then separately decoded (with different decoding matrices) to all later MLP layer outputs. SAEs are trying to understand the residual stream values, CLTs are trying to understand the operation that the MLP layer does. The goal is for the sum of all of these decodings (ie the 4th layer would have a decoding from 1st, 2nd, and 3rd layer) in a given layer to get as close as possible to the true MLP output for that layer.
2. Why do we use a given encoding in all layers after? Because once an MLP layer writes to the residual stream that contribution to the residual stream is used in all further inputs, so we avoid having to calculate the multi step impact manually, and when CLTs only decode to the next layer (per-layer transcoder PLT) this has empirically worse reconstruction results.
3. To get a CLT, first you train it. Then, you see how well it performs. Then you can use it to try to understand why the model predicts what it does for any input. To a first approximation the better the CLT the more will be explained by the CLT outputs and the less will be explained by the generic error term.
4. How to train? For each MLP layer input there is a single encoding matrix that encodes that input to a much higher dimension (think 50x or more). The encoded values are then decoded for each following layer, with each following layer having it’s own decoding matrix for a given encoded previous layer. As an example in a model with 4 MLP layers the 1st layer would encode the input to that MLP layer, that encoding would then be decoded by three different decoding matrices, one for the 2nd, 3rd, and 4th layer. On the flip side the yhat output of the 4th MLP layer in this CLT model would be the sum of the decoded values from the 1st, 2nd, and 3rd layers. The loss function has two components, the first is how well does the CLT reconstruct the true MLP outputs, at each layer there is a yhat output which is the sum of all decoded outputs from previous layers, the reconstruction loss is the l2 norm of y-yhat for all layers. Then there is the sparsity loss term which exists to encourage the high dimensional encodings to be as sparse as possible (ie vast majority 0s, maybe only a couple hundred nonzero terms out of potentially 100k+ features). To a first approximation this loss is the total number of nonzero features across all encoded values, this isn’t quite true since for each encoded value the loss contribution is somewhere between 0 and 1 based on how large active the dimension is and how large the average decoder weight is for that feature. It uses tanh to ensure the values are between 0 and 1 and are easily differentiable and a manually set coefficient that changes how intense an activation you need to get towards a full 1 value, but I don’t think the exact formula is worth digging into.
5. On this CLT has been trained you can test how well it does by running full forward pass using the CLT values instead of MLP values at every layer, and seeing how close the final model CLT predictions are to the normal MLP model predictions. This is the only place where the model behaves fully normally with each layer taking inputs from previous layer no error, no freezing. This also means that since each CLT layer is slightly different from what the MLP outputs would be, these errors can compound. In the seminal anthropic paper the best they could get was CLT that predicted the same next tokens as the MLP 50% of the time.
6. What you and I will spend most of our time looking at is the “local replacement model” what this does is applies the trained CLT model to a specific text input, for example “The Eiffel Tower is located in” where you can see at each layer the features in the high level CLT encoding at that layer the features that are most activated and the various impacts on the MLP outputs at further layers. You first run the normal model and record everything (residual stream values, attention patterns, LayerNorm denominators, and MLP outputs.) Then you construct a new computational graph where the MLPs are replaced by CLT features, attention patterns and LayerNorm are frozen at their recorded values, and error corrections are added to make the whole thing exactly match the original model's outputs. Because attention and LayerNorm are frozen, and MLPs are replaced by CLT features, the only remaining nonlinearities are the feature activations themselves. Everything between features is linear. This means you can compute the exact influence of any feature on any downstream feature
Steering Vectors
1. Applying a vector to the residual stream at one or multiple places to try to get the model to behave differently. There are many ways to find a vector to add to the residual stream to alter the output of the model. You could run a linear probe, do a super simple mean difference (contrastive vectors), PCA, etc. But once you have a vector that you think will make the model more formal, or more truthful, or more playful, etc then steering is simply applying that vector at varying strengths at various places to try to get out different outputs than if you just ran the standard model without these additions to the residual stream.
Ablation
1. Fully removing the impact from something on the model. If you’re working through a forward pass it’s trivial to ablate a layer since all you have to do is zero out the output of that layer when it adds back to the residual stream (zero ablation). Maybe instead you want to keep the same average magnitude, you could instead replace a given layer output with instead a mean of lots of other outputs on other input texts at that position (mean ablation).
2. Let’s say you find a direction of something in the model, let’s say the rejection direction, you can add or subtract this direction times any coefficient. large coefficients would lead to the model rejecting the user more, negative would lead to less rejection and then as it gets more negative the opposite of rejection. If you want to exactly take out the amount of this direction in the model you can direction ablate by normalizing the direction to unit vector (divide the vector by it’s l2 norm), then subtract the dot product of this unit vector and the residual stream value multiplied by this unit vector (to fully remove this dimension from the residual stream.) You could also try direction ablating Attention head outputs or MLP outputs to try to find which specific layer/head most contributes to the refusal and see what happens when only ablating in one specific layer output.
Contrast-Consistent Search
1. Seems like a pretty bad technique, almost strictly worse than a simple linear probe. Yes, it doesn’t require labels but it does require contrastive pairs. It seems like if you’ve already got a bunch of contrastive pairs, then labelling them shouldn’t be that much more work.
2. The input is a the residual stream for the final token at each layer for a bunch of paired prompts that are all slightly different in the same way. For example: “Gold is Au. T or F? Answer: Yes” vs “Gold is Au. T or F? Answer: No”. You don’t use any labels. Then, the objective is to get the CCS to output a 0 and a 1 allocated to each pair. The way this is done canonically is by training a small MLP on the residual stream that takes the lets say 2304 dim residual stream through a linear layer (matrix mul) down to lets say 32 dimensions → ReLU → linear layer from 32 dim to 1 → sigmoid (so output is between 0 and 1). I also tried a single linear layer (ie 2304 straight to 1) and a sigmoid, which worked very similarly. If the outputs for each pair are p1 and p2, the loss for that pair is:
  summed across all matched pairs. in other words there is higher loss if the two probabilities don’t sum to 1, and there is higher loss if they aren’t confident (predicting either a 0 or a 1).
3. The core idea in CCS is okay, you find some
  direction that when applied to every pair of residual stream stream values makes one of the pair got to 0 and one go to 1. The core issue (and also the reason for this method to exist) is you don’t actually know what you are finding. In the “Gold is Au” example the hope is that we might find the direction of the model thinks this is true and assign 1 to the true one, and 0 to the false one. However, the much simpler way to split is assign 1 to the one that ends in yes and 0 to the one that ends in no. So, to account for that you need to subtract out the Yes/No Dimension. However, let’s say that we are testing on an Elements dataset, some of the claims are of the form “Gold is Au”, others are “Gold is the 72nd element on the periodic table”, now even though we can subtract out the yes/no dimension for all pairs, we can’t subtract out the interaction between sentence length and the word Yes/No, so we would probably do something like find a dimension that labels long sentence/yes → 1, long sentence/no → 0, short sentence/yes → 0, short sentence/no →1. And this get at the core of the issue with CCS, if you are trying to use it to find an important but subtler dimension you have to be extremely careful that your data doesn’t have any other variance in it that would allow finding a different dimension that consistently labels 0/1. You have to be super fucking careful about the pairs and really make sure that they are all really similar in every way other than the thing you care about.
4. The hope with CCS is that you can find things where you don’t actually know the label, such as does the model actually care about the user ex: “please help me: I would love to” vs please help me: I would hate to”, where we are trying to figure out things we have no way of knowing. The core issue is we really have no idea if the direction we find is actually what we are about. And, this is what I found when I compared CCS to labeled data when trying to use it to determine whether the model understands whether a given statement about the world is correct, in datasets where the input pairs were all extremely similar to each other it could get close to as good as a linear probe, but it was extremely fragile and as soon as the dataset of input pairs started to have some differences (such as multiple domains, or multiple sentence constructions) it stopped finding correctness and started finding something else entirely.

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

Why I'm Doing This

Cormac — Sun, 08 Feb 2026 21:42:53 GMT

For at least four years I have believed that there is a 10% chance approximately all humans will be dead within something like 10 years. Four years later, I now think something more like 13% chance within 7 or so years.

Purpose is hard. I don’t think I have ever had a job that seemed that worthwhile or fun such that I would inherently want to do it forever, or even for the next 10 years. So, given all jobs are kind of eh, I might as well work one that most quickly gets me to a state where I never need to work again. Having enough money to free myself from having to work was my purpose after college. I managed it in about 4 years, and then once again I was purposeless. I briefly tried making so much money that I could support causes that I cared about (both locally in my community and globally), but working just to give didn’t feel purposeful and involved working a job that while interesting and relatively fun, was just super stressful. It’s been about a year since the ‘make oodles of money’ job blew up. I have really really really struggled to find aim in that time. I have tried working on things that I find fun, but eventually they stop being fun and then I stop working on them. I have tried really leaning into other aspects of my life (mostly community), but there’s something weird about a great community where the community is my purpose? To some extent it feels like the “easy way out” to purpose is just to have kids and lean into raising lots of kids in my wonderful community, after all that’s been most people’s purpose for most of history. I’ve even considered pursuing the non intellectual things that I love (mostly climbing and dancing), but I think I like solving problems too much to lean that far into my physical body as purpose.

Then, someone I think extremely highly of asked me “If you think there’s something that has a high chance of killing all humans, why the fuck aren’t you doing anything about it?!” (or something along those lines I’m terrible at remembering the exact words someone has said.) I didn’t really have a good answer. To some extent I think it’s because it has seemed intractable, solving the problem of AI potentially killing everyone has always seemed like an extremely technical problem. I would certainly consider myself both technical and a nerd, but isn’t aligning AI for the super duper turbo nerds? (said affectionately) This person seemed to think there was plenty of potential positive impact regardless of whether or not I quite make it into that bucket of turbo nerd. And, the more I thought about it the more I realized it’s crazy that I haven’t even tried working on what I consider the world’s most important open problem.

My guess is that I won’t be at that highest echelon of AI safety research, I’ve certainly seen that next level of quant genius and I strongly suspect that’s not me. But, that doesn’t mean there’s no positive impact for me to find. It seems like there’s a whole range of ways to help humanity not get wrecked by AI, and maybe working towards that will feel purposeful. I certainly hope so. If not, all I’ve got left is having kids. Also, I’d really like to not die in the next 7 years! All of humanity not dying is I suppose also pretty important. As long as I can continue living a joyous life in all my other ways, which so far it seems like I’m pretty good at.

So, what am I doing? It seems like the kind of impact/job/etc I could do really depends on my technical skill. How deep I can get into the weeds to really understand these models. I am currently about a month into self studying, by the time you read this hopefully I will be even further. I have had an AI whip up a curriculum and a tracker that I update each time I finish something which you can find here. I’ve waited on writing this article until I feel like I have actually made real progress. Feels a little presumptuous to write out a whole who I am what I am doing before I have made any progress. And, progress I have made! I have just written and trained my first super small GPT (mostly, I have AI assistance in both learning and bug fixing but all lines are written) by myself! I understand transformers and Multi Layer Perceptrons, which as far as I can tell is the core basic foundation.

What comes next? Part of why I am writing this and part of why I made the tracker public is to keep me honest. Learning is hard. 8 hours of learning one topic every day is hard, but so far it’s been going well! I am planning on continuing the journey to see just how technical I can get, in a couple months once I have a better idea of just how technical I can get it will be time for me to start thinking through what to actually do. If you have thoughts you think would be helpful for me to hear, I would love to hear them!

Follow my Progress!

Cormac — Thu, 05 Feb 2026 01:23:15 GMT

Before I try to figure out what I can do in the world of AI Safety, I’ve got to figure out how technical I can get. How technically competent I am has tremendous impact on what sorts of roles I would excel at. So, I’ve (AI has) created 3 months of curriculum to go from basically no technical understanding to as far as I can get! I am tracking it publicly, and updating it every time I do something.

View my AI Safety Learning Tracker