How to effectively fine-tune when there are multiple right answers?

Hello there, I’d like to fine-tune a model to find CSS selectors in HTML code so I can use these selectors in tools like Playwright, BeautifulSoup, etc. The idea is to find consistent selectors that can be used to select the elements that I want.

For example, given the following html:

...more html here...

<div aria-labelledby="title_48591420" class="cy5jw6o atm_5j_223wjw atm_70_87waog atm_j3_1u6x1zy atm_jb_4shrsx atm_mk_h2mmj6 atm_vy_7abht0 dir dir-ltr" role="group" data-testid="card-container">
  <div class="c1l1h97y atm_d2_1kqhmmj dir dir-ltr" style="--transition-element_transition-delay: 0ms; --transition-element_transition-duration: 200ms;"><div itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">

...more html here...

The following selectors are considered as right answers:

  • [data-testid="card-container"]
  • .dir [itemType="http://schema.org/ListItem"]
  • [itemprop="itemListElement"]
  • etc.

Here are the predicted responses:

  • gpt-3.5-turbo: .cy5jw6o while it works, it’s possible that obfuscated classes changes after each deployment of this website.
  • gpt-4o: [data-testid="card-container"] great response, but expensive to run against many chunks of html

Since HTML code consumes a lot of tokens, I have to chunk the HTML pages until I find the right piece where the selector I’m asking for actually is. In a scenario that the selector I want is not in the HTML code, both models either get it right (no selector) or wrong (predict random sel.).

So, since there can be multiple “right answers” for each html chunk, should I whether:

  • add multiple examples for the same chunk?
    (where each example is the same exact prompt + another right answer)
  • or one right example per chunk?

Also as an alternative, I thought about using another gpt-3.5-turbo prompt to classify the predicted selector, and - if it’s “not good” - it will call the previous prompt again using the “bad selector” as a few shot example. Could this be better than fine-tuning for my use case?

Thank you all

I would recommend choosing one canonical solution for the selector which corresponds to the east you would like the selector to be formed in general.

Here’s a ridiculous example…

Say you wanted to fine-tune a model to perform division.

What is the answer to 12 \div 2?

  1. 6
  2. 6.0
  3. \frac{12}{2}
  4. \mathbf{VI}
  5. 0\mathrm{b}110
  6. :clock6:
  7. Six
  8. —••••
  9. ⠼⠋
  10. 𓏺𓏺𓏺𓏺𓏺𓏺
  11. Ϛ
  12. ו
  13. Զ

If you were to train a model with all of these examples, you wouldn’t get very useful results—it would more or less randomly choose a numeral system to respond with. It’s better to pick the “most correct” version and website all of your examples follow that same rule so the fine-tuning process doesn’t expend “mental energy” trying to learn patterns in the noise of the variations.

To bring it back to your case, you could ask it to do one of,

  • Build the complete selector
  • Identify the most concise selector
  • Etc

But, whatever you do, be consistent.

That said…

I wouldn’t use an LLM for this at all. This is a heuristic problem, with an algorithmic solution.

I would ask an LLM to generate a programmatic solution to the problem.

After several iterations requesting a critical review and improvement upon the solution provided, this is what it came up with,

from bs4 import BeautifulSoup
import re
from functools import lru_cache
import hashlib

def escape_attribute_value(value):
    return value.replace("'", "\\'").replace('"', '\\"')

def hash_attribute(value, length=8):
    return hashlib.md5(value.encode()).hexdigest()[:length]

@lru_cache(maxsize=1024)
def get_parent_selector(element):
    if element.parent is None or element.parent.name == '[document]':
        return ''
    return generate_stable_selector(element.parent)

def is_dynamic_id(id_value, min_length=10, max_length=32):
    return min_length <= len(id_value) <= max_length and re.match(r'^[a-f0-9-]+$', id_value)

def generate_stable_selector(element, max_length=5, custom_attrs=None, custom_pseudo_classes=None):
    if element is None or element.name == '[document]':
        return ''

    default_attrs = ['id', 'name', 'data-testid', 'data-cy', 'data-qa', 'role', 'type', 'value', 'title', 'alt']
    attrs_to_check = custom_attrs or default_attrs
    
    current_selector = element.name
    
    # Check for attributes
    for attr in attrs_to_check:
        if element.has_attr(attr):
            value = element[attr]
            if not value:  # Skip empty attribute values
                continue
            if attr == 'id' and is_dynamic_id(value):
                continue
            if len(value) > 20:
                value = value[:10] + '*'  # Use prefix for long values
            else:
                value = escape_attribute_value(value)
            current_selector += f"[{attr}='{value}']"
            break
    
    # If no attribute found, use nth-child and structural pseudo-classes
    if current_selector == element.name:
        siblings = list(element.parent.find_all(element.name, recursive=False))
        index = siblings.index(element) + 1
        if index == 1 and len(siblings) > 1:
            current_selector += ":first-child"
        elif index == len(siblings) and len(siblings) > 1:
            current_selector += ":last-child"
        elif len(siblings) % 2 == 0 and index % 2 == 0:
            current_selector += ":nth-child(even)"
        elif len(siblings) % 2 == 1 and index % 2 == 1:
            current_selector += ":nth-child(odd)"
        else:
            current_selector += f":nth-child({index})"
    
    # Apply custom pseudo-classes if provided
    if custom_pseudo_classes and element.name in custom_pseudo_classes:
        current_selector += custom_pseudo_classes[element.name]
    
    parent_selector = get_parent_selector(element)
    
    possible_selectors = []
    
    if parent_selector:
        possible_selectors.append(f"{parent_selector} > {current_selector}")
    else:
        possible_selectors.append(current_selector)
    
    # Try sibling selectors
    prev_sibling = element.find_previous_sibling()
    if prev_sibling:
        sibling_selector = generate_stable_selector(prev_sibling)
        possible_selectors.append(f"{sibling_selector} + {current_selector}")
    
    # Add more parent levels
    while parent_selector:
        parent_selector = get_parent_selector(element.parent)
        if parent_selector:
            possible_selectors.append(f"{parent_selector} > {possible_selectors[-1]}")
        else:
            break
    
    # Find the shortest unique selector
    root = element.find_parent()
    for selector in possible_selectors:
        if len(root.select(selector)) == 1:
            # Trim selector if it's too long
            selector_parts = selector.split(' > ')
            if len(selector_parts) > max_length:
                selector_parts = selector_parts[-max_length:]
                selector_parts.insert(0, '...')
            return ' > '.join(selector_parts)
    
    # If no unique selector found, return the most specific one
    return possible_selectors[-1]

# Test the function
html = """
<html>
<head><title>Test</title></head>
<body>
    <div id="example">
        <span>Example Text</span>
        <input type="text" name="username" />
        <button role="submit">Submit</button>
        <div>
            <span>Nested Span</span>
        </div>
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ul>
    </div>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
custom_pseudo_classes = {'button': ':not(:disabled)'}
for element in soup.find_all(['span', 'input', 'button', 'li']):
    try:
        selector = generate_stable_selector(element, custom_pseudo_classes=custom_pseudo_classes)
        print(f"{element.name}: {selector}")
    except Exception as e:
        print(f"Error generating selector for {element.name}: {str(e)}")

I have no idea if this is good code or if it works robustly (this isn’t my problem to solve) but this is the approach I would take.

  1. Tell the model you want a method for generating stable selectors which didn’t rely on possibly dynamic class values.
  2. Take what the model gives you then, in a new chat say, "Hey, this is my problem and what I want to accomplish… Here is the solution I came up with: , please critically review my solution and improve upon it "
  3. goto 2

Break the loop when you’re happy with the code.

The above code was generated using three different LLMs back and forth a bunch of times just for fun.

2 Likes