Many AI models fail on simple question

The simple question

Hamming and Levenshtein distance between sitting and ittings

Hamming distance counts differences between two equal-length words, letter by letter.
Levenshtein distance is the minimum steps (replace, add, delete) to transform one word into another.

should be answered with 6 and 2, but most models answer wrong, and it’s hard to convince them of the truth:

o1: 6 and 2 (correct)
GPT-4o: 6 and 2, sometimes 1 and 1
GPT-4o-mini: 1 and 1 (telling gpt its wrong, I got “sitting has 7 characters and ittings has 6, failed to convince this model after a looong discussion)
GPT-4: 5 and 1, are you sure?, 5 and 2
GPT-4.5: undefined and 1
o3-mini: 6 and 2
o3-mini-high: 6 and 2 (looking into reasoning, first thought was “ittings” has 6 characters”, but decided to check again, real smart)
Gemini: 1 and 1
Copilot: 1 and 1
Deepseek: 5 and 2, are you sure ? no, its 4 and 2
DeepAI: 2 and 1

any ideas, why most models fail on this simple question ? really, no one ??