Those hypothesis sound correct for this case, there are many others (especially with languages that don’t use latin characters) where it doesn’t work,
until this is sorted out or someone finds a way to include a fix through prompt engineering,
I just tested this solution (suggested by gpt4) for post processing the content, it seems to fix most typos and grammar issues, not tested thoroughly though:
use a self hosted open source spellchecker: LanguageTool
LanguageTool: Download the standalone version of LanguageTool from the official website: Index of /download/
it works on java, ounce installed and running, it creates a local api endpoint you can access it with curl, below example python function for html (also suggested by gpt4, not tested)
import requests
from bs4 import BeautifulSoup, NavigableString
def correctSpellingGrammar(html_string, language_iso):
language_tool_api_url = 'http://localhost:8081/v2/check'
def correct_text_nodes(soup, language_tool_api_url, language_iso):
for node in soup.descendants:
if isinstance(node, NavigableString) and not node.isspace():
response = requests.post(language_tool_api_url, data={'text': node, 'language': language_iso})
result = response.json()
if 'matches' in result and len(result['matches']) > 0:
offset = 0
text = str(node)
for match in result['matches']:
start_pos = match['offset'] + offset
length = match['length']
replacement = match['replacements'][0]['value']
text = text[:start_pos] + replacement + text[start_pos + length:]
offset += len(replacement) - length
node.replace_with(text)
soup = BeautifulSoup(html_string, 'html.parser')
correct_text_nodes(soup, language_tool_api_url, language_iso)
corrected_html = str(soup)
return corrected_html
html_string = "<p>Questa è una frase in italiano cn alcuni errori di ortografia e grammatica.</p>"
language_iso = "it"
corrected_html = correctSpellingGrammar(html_string, language_iso)
print(corrected_html)
hope this helps