The Test That Tested Nothing
I've been using AI since before it "became good", which I think was around October-December 2025, and is unlike the other times throughout 2024-2025 when people repeatedly told me "it's actually good now".
I was doing some work in an Elixir codebase recently, and asked Claude Opus 4.7 to write a test before fixing a production bug, and gave it a stacktrace. This is what it wrote:
test "case statement handles :ok response" do
# Simulate what happens in the case statement
result = :ok
output =
case result do
{:error, reason} -> {:error, reason}
:ok -> :success
end
assert output == :success
end
It also wrote several other tests along similar lines, with guard clauses and other complexities all basically testing that Elixir the language worked.
This is not a new issue, but if state of the art frontier models are still doing this, what can we do? This question became more personal this week as I was recently rejected for a job as I wasn't "AI-first" enough. To be fair, unlike most job descriptions this one was explicit that AI would be central to the role[1], so maybe the issue was that I was trying to bring too much nuance to an hour-long conversation.
The thing is, I am AI-first for the majority of my work - some freelance work where the client is anti-AI is the main exception - but I keep seeing AI write code like the above, or code where it duplicates entire modules to solve the same problems the first did. AI is without a doubt the biggest change I have seen in my career, and we're still in this flux state where some excellent software engineers are totally against it, and others are pumping the hype. I also see very different takes between Rust and Elixir communities; my example is Elixir but in general I've found Claude quite good at writing idiomatic Elixir but its Rust is pretty bad, with single-character variables and bad abstractions. There is a gap in AI capabilities between the languages, but I think a lot of the issues can be solved with process, and despite similar feelings on it as others, I think it's here to stay.
That doesn't mean we should just vibe code everything though, and I genuinely believe that we're at this turning point where software quality and user experiences are about to tank. I don't want that, as I think things were starting to get better, we were mostly through the jank "every Android phone has different UI/UX" and people have become more aware of spotting dark design patterns. So how do we stop it? How can the builders care when they become increasingly disconnected from what they build as everything starts being filtered through LLMs?
I think software best practice is going to become even more important than ever. What is "good" when AI is writing everything and by its very nature produces "average"? For now, I've started on my current pain points - bad tests, poor refactoring and code duplication, and have started building an Elixir/OTP orchestrator that tries to lean on AI's strong points, while trying to cover the weaknesses. It's the biggest unsolved problem in the industry, so I don't expect to solve it, but I think exploring and trying is better than pretending it's already solved. I'm going to start with enforcing the most important part of TDD, and what AI totally ignores: the refactor loop. I'll keep an open mind as best practice for humans is not the same thing, but with a mixture of mechanical tasks and AI quality gates maybe we can get closer to what that is for AI.
Because apparently what software engineers needed while job hunting was the additional stress of trying to telepathically infer a stranger's AI philosophy in a one-hour interview. ↩︎
Talks about technology and other things he finds interesting. This was built with 11ty and tailwind. And works even with Javascript disabled.