For some piece of mind, we can perform the search:
OUTPUT=$(find .cursor/rules/ -name '*.mdc' -print0 2>/dev/null | xargs -0 perl -wnE '
BEGIN { $re = qr/\x{200D}|\x{200C}|\x{200B}|\x{202A}|\x{202B}|\x{202C}|\x{202D}|\x{202E}|\x{2066}|\x{2067}|\x{2068}|\x{2069}/ }
print "$ARGV:$.:$_" if /$re/
' 2>/dev/null)
FILES_FOUND=$(find .cursor/rules/ -name '*.mdc' -print 2>/dev/null)
if [[ -z "$FILES_FOUND" ]]; then
echo "Error: No .mdc files found in the directory."
elif [[ -z "$OUTPUT" ]]; then
echo "No suspicious Unicode characters found."
else
echo "Found suspicious characters:"
echo "$OUTPUT"
fi
- Can this be improved? Now, my toy programming languages all share the same "ensureCharLegal" function in their lexers that's called on every single character in the input (including those inside the literal strings) that filters out all those characters, plus all control characters (except the LF), and also something else that I can't remember right now... some weird space-like characters, I think?
Nothing really stops the non-toy programming and configuration languages from adopting the same approach except from the fact that someone has to think about it and then implement it.
Here's a Github Action / workflow that says it'll do something similar: https://tech.michaelaltfield.net/2021/11/22/bidi-unicode-git...
I'd say it's good practice to configure github or whatever tool you use to scan for hidden unicode files, ideally they are rendered very visibly in the diff tool.