The best clue might come from a 2022 paper written by the Anthropic team back when their startup was just a year old. They warned that the incentives in the AI industry — think profit and prestige — will push companies to “deploy large generative models despite high uncertainty about the full extent of what these models are capable of.” They argued that, if we want safe AI, the industry’s underlying incentive structure needs to change.
Well, at three years old, Anthropic is now the age of a toddler, and it’s experiencing many of the same growing pains that afflicted its older sibling OpenAI. In some ways, they’re the same tensions that have plagued all Silicon Valley tech startups that start out with a “don’t be evil” philosophy. Now, though, the tensions are turbocharged.
An AI company may want to build safe systems, but in such a hype-filled industry, it faces enormous pressure to be first out of the gate. The company needs to pull in investors to supply the gargantuan sums of money needed to build top AI models, and to do that, it needs to satisfy them by showing a path to huge profits. Oh, and the stakes — should the tech go wrong — are much higher than with almost any previous technology.
So a company like Anthropic has to wrestle with deep internal contradictions, and ultimately faces an existential question: Is it even possible to run an AI company that advances the state of the art while also truly prioritizing ethics and safety?
“I don’t think it’s possible,” futurist Amy Webb, the CEO of the Future Today Institute, told me a few months ago.


You can’t build morality into it, as I said. You can build functionality into it that makes immmoral use harder.
I can e.g.
Society considers e.g hunting a moral use of weapons, while killing people usually isn’t.
So banning ceramic, unmarked, silenced, full-automatic weapons firing armor-piercing bullets can certainly be an effective way of reducing the immoral use of a weapon.
None of those changes impact the morality of a weapons use in any way. I’m happy to dwell on this gun analogy all you like because it’s fairly apt, however there is one key difference central to my point: there is no way to do the equivalent of banning armor piercing rounds with an LLM or making sure a gun is detectable by metal detectors - because as I said it is non-deterministic. You can’t inject programmatic controls.
Any tools we have for doing it are outside the LLM itself (the essential truth undercutting everything else) and furthermore even then none of them can possibly understand or reason about morality or ethics any more than the LLM can.
Let me give an example. I can write the dirtiest most disgusting smut imaginable on ChatGPT, but I can’t write about a romance which in any way addresses the fact that a character might have a parent or sibling because the simple juxtaposition of sex and family in the same body of work is considered dangerous. I can write a gangrape on Tuesday, but not a romance with my wife on Father’s Day. It is neither safe from being used as not intended, nor is it capable of being used for a mundane purpose.
Or go outside of sex. Create an AI that can’t use the N-word. But that word is part of the black experience and vernacular every day, so now the AI becomes less helpful to black users than white ones. Sure, it doesn’t insult them, but it can’t address issues that are important to them. Take away that safety, though, and now white supremacists can use the tool to generate hate speech.
These examples are all necessarily crude for the sake of readability, but I’m hopeful that my point still comes across.
I’ve spent years thinking about this stuff and experimenting and trying to break out of any safety controls both in malicious and mundane ways. There’s probably a limit to how well we can see eye to eye on this, but it’s so aggravating to see people focusing on trying to do things that can’t effectively be done instead of figuring out how to adapt to this tool.
Apologies for any typos. This is long and my phone fucking hates me - no way some haven’t slipped through.
Of course you can. Why would you not, just because it is non-deterministic? Non-determinism does not mean complete randomness and lack of control, that is a common misconception.
Again, obviously you can’t teach an LLM about morals, but you can reduce the likelyhood of producing immoral content in many ways. Of course it won’t be perfect, and of course it may limit the usefulness in some cases, but that is the case also today in many situations that don’t involve AI, e.g. some people complain they “can not talk about certain things without getting cancelled by overly eager SJWs”. Society already acts as a morality filter. Sometimes it works, sometimes it doesn’t. Free-speech maximslists exist, but are a minority.
That’s a fair argument about free speech maximalism. And yes you can influence output, but (being non-deterministic) since we can’t know precisely what causes certain outputs, we equally can’t fully predict the effect on potentially unrelated output. Great now it’s harder to talk about sex with kids, but now it’s also harder for kids to talk about certain difficult experiences for example if their trying to keep a secret but also need a non-judgmental confidante to help them process a difficult experience.
Now, is it critical that the AI be capable of that particular conversation when we might prefer it happen with a therapist or law enforcement? That’s getting into moral and ethical questions so deep I as a human struggle with them. It’s fair to believe the benefit of preventing immoral output outweighs the benefit of allowing the other. But I’m not sure that is empirically so.
I think it’s more useful to us as a society to have an AI that can assume both a homophobic perspective and an ally perspective than one that can’t adopt either or worse, one that is mandated to be homophobic for morality reasons.
I think it’s more useful to have an AI that can offer religious guidance and also present atheism in a positive light. I think it’s useful to have an AI that can be racist in order to understand how that mind disease thinks and find ways to combat it.
Everything you try to censor out of an AI has an unknown cost in beneficial uses. Maybe I am overly absolutist in how I see AI. I’ll grant that. It’s just that by the time we think of every malign use to which an AI can be put and censor everything it can possibly say, I think you don’t have a very helpful tool at all any more.
I use ChatGPT a fair bit. It’s helpful with many things and even certain types of philosophical thought experiments. But it’s so frustrating to run into these safety rails and have to constrain my own ADHD-addled thoughts over such mundane things. That was what got me going on the road of exploring what the most awful outputs I could get and the most mundane sorts of things it can’t do.
That’s why I say you can’t effectively censor the bad stuff, because you lose a huge benefit of being able to bounce thoughts off of a non-judgmental response. I’ve tried to deeply explore subjects like racism and abuse recovery and thought experiments like alternate moral systems or have a foreign culture explained to me without judgment when I accidentally repeat some ignorant stereotype.
Yeah, I know, we’re just supposed to write code or silly song lyrics or summarize news articles. It’s not a real person with real thoughts and it hallucinates. I understand all that, but I’ve brainstormed and rubber ducked all kinds of things. Not all of them have been unproblematic because that’s just how my brain is. I can ask things like, is unconditional acceptance of a child always for the best or do they need minor things to rebel against? And yeah I have those conversations knowing the answers and conclusions are wildly unreliable, but it still helps me to have the conversation in the first place to frame my own thoughts, perhaps to have a more coherent conversation with others about it later.
It’s complicated and I’d hate to stamp out all of these possibilities out of an overabundance of caution before we really explore how these tools can help us with critical thinking or being exposed to immoral or unethical ideas in a safe space. Maybe arguing with an AI bigot helps someone understand what to say in a real situation. Maybe dealing with hallucination teaches us critical thinking skills and independence rather than just nodding along to groupthink.
I’ve ventured way further into should we than could we and that wasn’t my intent when I started, but it seems the questions are intrinsically linked. When our only tool for censoring an AI is to impair the AI, is it possible to have a moral, ethical AI that still provides anything of value? I emphatically believe the answer is no.
But your point about free speech absolutism is well made. I see AI as more of a thought tool than something that provides an actual thing of value. And so I think working with an AI is more akin to thoughts, while what you produce and share with its assistance is the actual action that can and should be policed.
I think this is my final word here. We aren’t going to hash out mortality in this conversation and mine isn’t the only opinion with merit. Have a great day.
I will take a different tack than sweng.
I think that this is irrelevant. Whether a safety mechanism is intrinsic to the core functioning of something, or bolted on purely for safety purposes, it is still a limiter on that thing’s function, to attempt to compel moral/safe usage.
Any action has 2 different moral aspects:
Of course it is impossible to change the moral intent of an actor. But the LLM is not the actor, it is the tool used by an actor.
And you can absolutely change the morality of the outcome of an action (I.e. said weapon use) by limiting the possible damage from it.
Given that a tool is the means by which the actor attempts to take an action, it is also an appropriate place that safety controls which attempt to enforce a more moral outcome should reside in.
I think I’ve said a lot in comments already and I’ll leave that all without relitigating just for arguments sake.
However, I wonder if I haven’t made clear that I’m drawing a distinction between the model that generates the raw output, and perhaps the application that puts the model to use. I have an application that generates output via OAI API and then scans both the prompt and output to make sure they are appropriate for our particular use case.
Yes, my product is 100% censored and I think that’s fine. I don’t want the customer service bot (which I hate but that’s an argument for another day) at the airline to be my hot AI girlfriend. We have tools for doing this and they should be used.
But I think the models themselves shouldn’t be heavily steered because it interferes with the raw output and possibly prevents very useful cases.
So I’m just talking about fucking up the model itself in the name of safety. ChatGPT walks a fine line because it’s a product not a model, but without access to the raw model it needs to be relatively unfiltered to be of use, otherwise other models will make better tools.
Those changes reduce lethality or improve identification. They have nothing to do with morality and do NOT reduce the chance of immoral use.
Well, I, and most lawmakers in the world, disagree with you then. Those restrictions certainly make e.g killing humans harder (generally considered an immoral activity) while not affecting e.g. hunting (generally considered a moral activity).
They can make killing multiple people in specific locations more difficult, but they do nothing to keep someone from being able to fire a single bullet for an immoral reaspn, hence the difference between lethality and identification and morality.
The Vegas shooting would not have been less immoral if a single person or nobody died. There is a benefit to reduced lethality, especially against crowds. But again, reduced lethality doesn’t reduce the chance of being used immorally.