Leetcode style questions no longer work. If it's solvable with a few functions within 1 hour, AI will solve it in 5 minutes.
If the job is cutting trees, you can't measure them by how long they take to cut a tree, but whether they have stamina to cut through multiple trees.
Take home assignments work, and the good news is they can be shorter now. 1 day or 4 hours of work is enough of a benchmark. Something like a Wordle clone is about the right level of complexity.
Things we look for:
1. Do they use a library? Some people are a bit egoistic about doing things the easy way. GenAI will make a list of words, which is both wasteful and incomplete when they can find a dictionary of words. Do they cut down the dictionary to the right size? It should only be the words not definitions.
2. Architecture? What do they normally use? How do the parts link to one another? How do they handle errors?
3. Do they bring in something new? AI will usually use a 5 year old tech stack unless you give it a specific one, because that's around the average of code it's trained on. If they're experienced enough to tell AI to use the new tech, they're probably experienced enough.
4. Require a starting commit (probably gitignore) and ask them to add reasonable sized commits. Classic coding should look a bit like painting. Vibe coding looks like sculpting, where you chip bits off. This will also catch more critical cheating, like someone else doing the work on their behalf - the commits may be from the wrong emails or you'll see massive commits where nothing gets chipped off.
5. There are going to be people who think AI is a nuisance. Tests like this will help you benchmark the different factions. But don't give them so much toil that it puts the AI users at a large advantage and don't give overly complex "solved" questions that the AI can just pull out from training.
Can you walk me through what type of architecture someone could express making a wordle clone in four hours?
We basically look for this: https://docs.flutter.dev/app-architecture/guide
It's similar across all mobile platforms. People call it MVVM, Bloc, MVP, etc. But we want to see the pattern of UI-repo-service and unidirectional data flow. This is three layers and it can be as little as 3 files. If someone can grasp that, it's good enough.
There's not a lot of ways you can screw up hiring mobile devs; most skills are trainable. But this is what costs months. One guy once looped the viewmodel inside another viewmodel because he was using them as the data store and observable. He was let go but it took 2 years to remove everything he did from production even though he worked there for 3 months and most of it was via rewriting entire blocks of code.
We watch for overengineering and perfectionism. Some people insist on more blocks or splitting it into modules. But can they do that in 4 hours?
AI can write all of it easy enough, but it's just a 2-3x multiplier. Some assume it's an infinite multiplier and they can just subscribe to Cursor on the spot and it'll be done in a blink. But do the parts connect in the order given above?
Also Wordle is not ideal for architecture, but it's good enough. It requires a solid understanding of the top level of things, and connection to the data layer. What if we switched the word source from the dictionary to the API? That's a bit more of an advanced test.
An even more advanced test would be structuring it for TDD, which would need understanding of DI, factories, tests and gotchas, etc. I don't think this is doable even with AI yet though.
As a UI test, Wordle is pretty good because you have all your logic and variables on one layer and display it via a different, more complex layer.