This simple problem, AI cannot resolve it ! Wrong answer for this simple problem

Maybe you won’t believe me. I’ve been trying to solve a very simple problem for 3 days, and neither ChatGPT, nor Claude.Ai, nor Gemini can solve it. Although, it is something very easy, but no AI can solve it.

Task:
Develop a Python script that compares two HTML files, one in Romanian (RO) and another in English (EN), and identifies the correspondences between their tags, save the result in output with the tags that are missing from (EN), but exist in (RO).

Details:

File Structure: Each HTML file contains various tags, which may vary in structure and number between the two files. But the ones that interest me are these:

 <p class="text_obisnuit">(.*?)</p>
 <p class="text_obisnuit2">(.*?)</p>
 <p class="text_obisnuit"><span class="text_obisnuit2">(.*?)</p>

Correspondence: Two tags are considered to be corresponding if they have the same functionality, even if their names or internal structure differ.
Missing tags: Some tags in the RO file may not have an exact counterpart in the EN file.

RO file: The path to the Romanian HTML file (ex: d:\3\ro\incotro-vezi-tu-privire.html)
EN file: Path to the English HTML file (ex: d:\3\en\where-do-you-see-look.html)

Output file:
A new HTML file that has the same name as the EN file (ex: d:\3\Output\where-do-you-see-look.html) that contains all the tags from the EN file, but at the same time, it must to contain the tags from the RO html file that do not exist in EN. So practically, I will have to have a few extra lines in the html from Output, with the tags that contain sentences in Romanian.

The content of the RO html tags have been translated into the EN html file, but some ro tags have not been translated and are not found in the EN html (here I have calculated which tags are missing).


This is the .rar project, containing both html pages, ro and en, and the ouput, already solved, so you can see how the final result should look correctly.

or here, each there are all 3 examples:

html RO:

https://gist.github.com/me-suzy/af6baaba5b46e3ff0507c0512524de7f

html EN:

https://gist.github.com/me-suzy/07d4f7632e8b293ff4b7b8e622e52569

html Output:

https://gist.github.com/me-suzy/5ee7e1824b735114c8d537d483201d67

Just fo the sake of disclosure, did you try using both of the o1 models?

This code works, except that I give the implicit lines to parsing. But the code I want, must find automaticaly the missing tags and put them into output.

import re
import os

def read_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

def write_file(file_path, content):
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(content)

def find_and_insert_missing_paragraphs(en_content, ro_content):
    # Păstrăm conținutul EN original
    output_content = en_content
    
    # Identificăm paragrafele specifice din RO care trebuie adăugate
    ro_specific = [
        '<p class="text_obisnuit">Aici este intelepciunea.</p>',
        '<p class="text_obisnuit">Chiar asa este?</p>'
    ]
    
    # Definim punctele de inserție după anumite paragrafe din EN
    insert_after = [
        'Always the comfort of habit replaces the adventure of the unknown.</p>',
        'Isn\'t this where the story of man reaches its creative peak?</p>'
    ]
    
    # Inserăm fiecare paragraf RO după paragraful EN corespunzător
    for i, en_marker in enumerate(insert_after):
        if i < len(ro_specific):
            # Găsim poziția după care inserăm
            insert_pos = output_content.find(en_marker) + len(en_marker)
            if insert_pos > len(en_marker):  # verificăm dacă am găsit poziția
                # Inserăm paragraful RO cu formatare corectă
                output_content = output_content[:insert_pos] + '\n        ' + ro_specific[i] + output_content[insert_pos:]
    
    return output_content

def main():
    # Definim căile fișierelor
    en_path = r'd:\3\en\where-do-you-see-look.html'
    ro_path = r'd:\3\ro\incotro-vezi-tu-privire.html'
    output_path = r'd:\3\Output\where-do-you-see-look.html'
    
    # Citim conținutul fișierelor
    en_content = read_file(en_path)
    ro_content = read_file(ro_path)
    
    # Procesăm și generăm output-ul
    output_content = find_and_insert_missing_paragraphs(en_content, ro_content)
    
    # Scriem rezultatul
    write_file(output_path, output_content)

if __name__ == "__main__":
    main()

of course ! But didn’t succed to make a good Python code.

1 Like

Having worked on similar, have you considered first breaking down each HTML document into a tree structure, then comparing them to find the missing nodes. Obviously the comparison would have to be smart enough to identify when a node is missing while the branches exist in the other tree.

With years of using Prolog, the missing node would simply be a variable in an expression/term and unification would fill the variable if it existed ( that is a high level view with lots of info glossed over but Python can not do that out of the box).

In the last 3 days I worked 10 hours a day on the code, I had consider everything. And if I didn’t take it into account, surely the 3 AIs took it into account.

But still no one knows how to make the code. Although the problem is very easy. It’s just a matter of comparing some lines.

Currently I would never put a bet on that. While the AI have been exposed to a lot of info, they are also full of gaps, irregularities, etc. I program in Prolog and Lean and have yet to find an LLM that can generate code for them correctly and consistently. Sure they sometimes will get a toy example right or know of something that I did not know but overall these are just specific examples of where LLMs are not what people take them to be.

I know many that can come close if not get it correct, but they have decades of experience. :slightly_smiling_face:

Noted in the following is one that I would put on such a list

EricGT

please, try to find yourself the solution. See if you can…

It’s all math. The first step is to calculate how many instances are in each file, from the tags:

<p class="text_obisnuit"> 
<p class="text_obisnuit2">
<p class="text_obisnuit"><span class="text_obisnuit2">

Ok, let’s count the exact instances of each tag type:

EN file:

<p class="text_obisnuit"> = 7
<p class="text_obisnuit2"> = 2
<p class="text_obisnuit"><span class="text_obisnuit2"> = 3

RO file:

<p class="text_obisnuit"> = 9
<p class="text_obisnuit2"> = 2
<p class="text_obisnuit"><span class="text_obisnuit2"> = 3

something like this:


For example the tag

<p class="text_obisnuit">Here is wisdom.</p> , number 6 in ro,

there is no way it can be number 6 from EN, because the tag has a different name, and the content of the tag has a larger number of words.

<p class="text_obisnuit"><span class="text_obisnuit2">Do you believe in that equality with the divine in the act of creation? </span>Can you understand this?</p>

Therefore, the solution must consider the following:

  1. Tag structure:
  2. Content length (number of words):
  3. Then we can compare the tags:

Even so, ChatGPT or Claude didn’t succeed to make a good Python code

A friend made the code for me, it was quite difficult to find the solution.

But it must be taken into account that not every variant of tag combinations will give a perfect result. Because there must be a series of similar lines for it to work, it is the theory of probability. The code below is good for 90% of the cases.

I can’t explain why after a week, and hundreds of solutions, ChatGPT or Claude.ai couldn’t find this solution.

import re
import copy
import os


def replace_delimited_text(text, start_delimiter, end_delimiter, replacement):
    pattern = re.escape(start_delimiter) + r".*?" + re.escape(end_delimiter)
    replaced_text = re.sub(pattern, replacement, text, flags=re.DOTALL)
    print("[replace_delimited_text] Text replacement completed.")
    return replaced_text


def extract_tags_from_file(file_path):
    print(f"[extract_tags_from_file] Reading file: {file_path}")
    with open(file_path, "r", encoding="utf-8") as file:
        html_content = file.read()

    combined_pattern = (
        r'<p class="text_obisnuit">.*?</p>|'
        r'<p class="text_obisnuit2">.*?</p>|'
        r'<p class="text_obisnuit"><span class="text_obisnuit2">.*?</span></p>'
    )

    all_matches = re.findall(combined_pattern, html_content, re.DOTALL)
    print(f"[extract_tags_from_file] Found {len(all_matches)} matches in {file_path}")
    return all_matches


def is_significant_length_difference(text1, text2, threshold=0.2):
    length_diff = abs(len(text1) - len(text2))
    is_different = (length_diff / max(len(text1), len(text2))) > threshold
    print(f"[is_significant_length_difference] Difference: {is_different}")
    return is_different


if __name__ == "__main__":
    ro_file_path = "d:/3/ro/incotro-vezi-tu-privire.html"
    en_file_path = "d:/3/en/where-do-you-see-look.html"

    ro_tags = extract_tags_from_file(ro_file_path)
    en_tags = extract_tags_from_file(en_file_path)

    print("[MAIN] RO tag count:", len(ro_tags), "EN tag count:", len(en_tags))

    if len(ro_tags) <= len(en_tags):
        raise Exception("There's nothing to transfer from the RO HTML article to the EN one!")

    final_en_tags = copy.deepcopy(en_tags)
    inserted_at = []
    i = 0

    while i < len(ro_tags):
        # Verificăm dacă i este încă în limitele final_en_tags
        if i < len(final_en_tags):
            # Verificăm dacă tagurile au aceeași structură
            if (
                (ro_tags[i].startswith('<p class="text_obisnuit">') and
                 final_en_tags[i].startswith('<p class="text_obisnuit">')) or
                (ro_tags[i].startswith('<p class="text_obisnuit2">') and
                 final_en_tags[i].startswith('<p class="text_obisnuit2">')) or
                (ro_tags[i].startswith('<p class="text_obisnuit"><span class="text_obisnuit2">') and
                 final_en_tags[i].startswith('<p class="text_obisnuit"><span class="text_obisnuit2">'))
            ):
                if is_significant_length_difference(ro_tags[i], final_en_tags[i]):
                    final_en_tags.insert(i, ro_tags[i])
                    inserted_at.append(i)
                i += 1
            else:
                final_en_tags.insert(i, ro_tags[i])
                inserted_at.append(i)
                i += 1
        else:
            # Dacă i a depășit lungimea final_en_tags, adăugăm restul tagurilor RO
            final_en_tags.extend(ro_tags[i:])
            inserted_at.extend(range(i, len(ro_tags)))
            break

    print("[MAIN] Final RO:", len(ro_tags), "EN after insertions:", len(final_en_tags))
    print("[MAIN] Positions of inserted tags:", inserted_at)

    assert len(ro_tags) <= len(final_en_tags), "Missing paragraphs couldn't be filled out properly..."

    # Asigură-te că directorul de output există
    output_dir = "d:/3/Output"
    os.makedirs(output_dir, exist_ok=True)

    # Citește conținutul fișierului EN
    with open(en_file_path, "r", encoding="utf-8") as file:
        html_content = file.read()

    # Găsește și înlocuiește secțiunea dintre delimitatori
    if final_en_tags:
        # Construiește conținutul nou incluzând delimitatorii
        new_content = "<!-- ARTICOL START -->\n" + "\n".join(final_en_tags) + "\n<!-- ARTICOL FINAL -->"
        # Înlocuiește întreaga secțiune
        res = replace_delimited_text(
            html_content,
            "<!-- ARTICOL START -->",
            "<!-- ARTICOL FINAL -->",
            new_content
        )

        # Salvează rezultatul
        output_path = os.path.join(output_dir, "file.html")
        with open(output_path, "w", encoding="utf-8") as file:
            file.write(res)
        print("[MAIN] Output saved to:", output_path)
    else:
        print("[MAIN] No changes made to save.")

Finally I solved the problem, but not with ChatGPT or Claude. No other AI could find the solution, because it didn’t know how to think about the solution.

In fact, to find the solution to this problem, you had to assign some identifiers to each tag, and do multiple searches.

ChatGPT or Claude, or other AIs, will have to seriously consider this type of solution for such problems.

Here are the specifications, the way I thought about solving the problem. It’s a different way of thinking about doing PARSINGS.

https://pastebin.com/as2yw1UQ

Python code made by a friend of mine. I think the solution, he made the code:

from bs4 import BeautifulSoup
import re

def count_words(text):
    """Numără cuvintele dintr-un text."""
    return len(text.strip().split())

def get_greek_identifier(word_count):
    """Determină identificatorul grecesc bazat pe numărul de cuvinte."""
    if word_count < 7:
        return 'α'
    elif word_count <= 14:
        return 'β'
    else:
        return 'γ'

def get_tag_type(tag):
    """Determină tipul tagului (A, B, sau C)."""
    if tag.find('span'):
        return 'A'
    elif 'text_obisnuit2' in tag.get('class', []):
        return 'B'
    return 'C'

def analyze_tags(content):
    """Analizează tagurile și returnează informații despre fiecare tag."""
    soup = BeautifulSoup(content, 'html.parser')
    tags_info = []

    article_content = re.search(r'<!-- ARTICOL START -->(.*?)<!-- ARTICOL FINAL -->',
                              content, re.DOTALL)

    if article_content:
        content = article_content.group(1)
        soup = BeautifulSoup(content, 'html.parser')

    for i, tag in enumerate(soup.find_all('p', recursive=False)):
        text_content = tag.get_text(strip=True)
        tag_type = get_tag_type(tag)
        word_count = count_words(text_content)
        greek_id = get_greek_identifier(word_count)

        tags_info.append({
            'number': i + 1,
            'type': tag_type,
            'greek': greek_id,
            'content': str(tag),
            'text': text_content
        })

    return tags_info

def compare_tags(ro_tags, en_tags):
    """Compară tagurile și găsește diferențele."""
    wrong_tags = []
    i = 0
    j = 0

    while i < len(ro_tags):
        ro_tag = ro_tags[i]
        if j >= len(en_tags):
            wrong_tags.append(ro_tag)
            i += 1
            continue

        en_tag = en_tags[j]

        if ro_tag['type'] != en_tag['type']:
            wrong_tags.append(ro_tag)
            i += 1
            continue

        i += 1
        j += 1

    return wrong_tags

def format_results(wrong_tags):
    """Formatează rezultatele pentru afișare și salvare."""
    type_counts = {'A': 0, 'B': 0, 'C': 0}
    type_content = {'A': [], 'B': [], 'C': []}

    for tag in wrong_tags:
        type_counts[tag['type']] += 1
        type_content[tag['type']].append(tag['content'])

    # Creăm rezultatul formatat
    result = []

    # Prima linie cu sumarul
    summary_parts = []
    for tag_type in ['A', 'B', 'C']:
        if type_counts[tag_type] > 0:
            summary_parts.append(f"{type_counts[tag_type]} taguri de tipul ({tag_type})")
    result.append("In RO exista in plus fata de EN urmatoarele: " + " si ".join(summary_parts))

    # Detaliile pentru fiecare tip de tag
    for tag_type in ['A', 'B', 'C']:
        if type_counts[tag_type] > 0:
            result.append(f"\n{type_counts[tag_type]}({tag_type}) adica asta {'taguri' if type_counts[tag_type] > 1 else 'tag'}:")
            for content in type_content[tag_type]:
                result.append(content)
            result.append("")  # Linie goală pentru separare

    return "\n".join(result)

def merge_content(ro_tags, en_tags, wrong_tags):
    """Combină conținutul RO și EN, inserând tagurile wrong în pozițiile lor originale."""
    merged_tags = []

    # Creăm un dicționar pentru tagurile wrong indexat după numărul lor original
    wrong_dict = {tag['number']: tag for tag in wrong_tags}

    # Parcurgem pozițiile și decidem ce tag să punem în fiecare poziție
    current_en_idx = 0
    for i in range(max(len(ro_tags), len(en_tags))):
        position = i + 1

        # Verificăm dacă această poziție este pentru un tag wrong
        if position in wrong_dict:
            merged_tags.append(wrong_dict[position]['content'])
        elif current_en_idx < len(en_tags):
            merged_tags.append(en_tags[current_en_idx]['content'])
            current_en_idx += 1

    return merged_tags

def save_results(merged_content, results, output_path):
    """Salvează conținutul combinat și rezultatele în fișierul de output."""
    final_content = '<!-- REZULTATE ANALIZA -->\n'
    final_content += '<!-- ARTICOL START -->\n'

    # Adaugă conținutul combinat
    for tag in merged_content:
        final_content += tag + '\n'

    final_content += '<!-- ARTICOL FINAL -->\n'
    final_content += '<!-- FINAL REZULTATE ANALIZA -->\n'

    # Adaugă rezultatele analizei
    final_content += results

    # Salvează în fișier
    with open(output_path, 'w', encoding='utf-8') as file:
        file.write(final_content)

# Citește fișierele
with open(r'd:/3/ro/incotro-vezi-tu-privire.html', 'r', encoding='utf-8') as file:
    ro_content = file.read()

with open(r'd:/3/en/where-do-you-see-look.html', 'r', encoding='utf-8') as file:
    en_content = file.read()

# Definește calea pentru fișierul de output
output_path = r'd:/3/Output/where-do-you-see-look.html'

# Analizează tagurile
ro_tags = analyze_tags(ro_content)
en_tags = analyze_tags(en_content)

# Găsește diferențele
wrong_tags = compare_tags(ro_tags, en_tags)

# Formatează rezultatele
results = format_results(wrong_tags)

# Generează conținutul combinat
merged_content = merge_content(ro_tags, en_tags, wrong_tags)

# Afișează rezultatele în consolă
print(results)

# Salvează rezultatele în fișierul de output
save_results(merged_content, results, output_path)