GPT 3.5 Turbo 16K not able to understand that Q1'23 and Q1 2023 are same

Does anyone know a solution for this. Need to test if GPT-4 can do it. But other than adding few shot example, is there a way?

Input consists of yearly data that includes quarterly statistics. Quarterly data format fields are of mixed type, with two separate formats that must be individually identified and correctly processed:

  • Q{quarter_number_integer}'{two_digit_year}
  • Q{quarter_number_integer} {four_digit_year}

Be aware of this mixture (and others), and intelligently adapt data extraction techniques to quarterly figures.

Desired quarterly output format: Q{quarter_number_integer}-{four_digit_year}

This will work but for my use case this will not be scalable. I have many different scenarios where it is not able to understand all the formats. May be fine tuning is the optimal approach for me

Hopefully I’ve given enough idea there for you to make a smarter AI by similar instruction that can generalize understanding of unexpected data types and to prepare for them.

Might I suggest another way?

Not every nail needs to be hit with the LLM hammer.

If you are aware of the different formats ahead of time, it’s fairly straightforward to refactor the text on ingress to a standard format.

Meet the model halfway and you’ll get better, more consistent results. If you rely solely on the LLM there will always be a non-zero probability of it stumbling regardless of how much time, energy, and money you spend evolving your prompts.

Here’s a MWE for how you can use Python regular expressions to convert QX ##YY to QX'YY,

import re

def replace_pattern(text):
    pattern = r'(Q[1-4]) \d{2}(\d{2})'
    replacement = r"\1'\2"
    return re.sub(pattern, replacement, text)

# Example usage
text = "Q1 2022 and Q3 1998"
result = replace_pattern(text)
print(result)
2 Likes

This is a good suggestion which I can probably use for another use case. But in this use case, we are extracting the content from PDF and chunking, the complexity is to find out all possible patterns and replace.

This, I am still not able to solve. I wanted to see if someone else also is facing these type of issues and if there are any other solution

You might want to look into text normalization.

For example, years ago I was dealing with a situation where I couldn’t have contractions. So I couldn’t have “couldn’t” I needed “could not”.

So I had this basic lookup where if I saw a known contraction, I replaced it with the expanded set of words.

I think a small data structure of 80 contractions took care of 99% of the contractions out there.

So look-up-and-substitute might be viable for you. Even for Q1’24 stuff, you could programmatically generate the expansions out for the next 1000 years no problem (only 4000 table entries).

You could run this in a database, or an in-memory data structure.

I’d worry about the LLM fine-tune altering the original data, and possibly not doing a great job at fixing all possible normalizations. Plus you have to train it anyway, which is work that could go towards your look-up tables.

But if your data is so crazy and can’t be listed, then maybe a fine-tune wouldn’t hurt … again just worried about the quality and distortion in the output.

1 Like

Please help me understand.

I imagine there are only so many ways thus concept could be expressed.

Say we are talking about the first quarter of the year 2024. Here’s what I imagine that may look like in your documents,

  1. Q1 2024
  2. Q1’24
  3. Qtr 1, 2024
  4. 1st Quarter 2024
  5. First Quarter 2024
  6. 2024 Q1
  7. 2024 - Q1
  8. 2024/Q1
  9. Q1 of 2024
  10. 2024 1st Quarter
  11. 2024 First Quarter
  12. Quarter 1, 2024
  13. 2024, Quarter 1
  14. Q-1 2024
  15. 1Q 2024
  16. 2024 1Q
  17. Q1-2024
  18. 2024-Q1
  19. 1st Qtr 2024
  20. First Qtr 2024
  21. 2024 1st Qtr
  22. 2024 First Qtr
  23. 2024 Qtr 1
  24. Quarter One 2024
  25. 2024 Quarter One
  26. Q1/2024
  27. 2024/Q-1
  28. 2024 (Q1)
  29. Q1 (2024)
  30. 2024 (1st Quarter)
  31. (First Quarter) 2024
  32. Q1/24
  33. 24Q1
  34. 24 Q1
  35. Q1 of '24
  36. '24 Q1
  37. Q1 '24
  38. '24-Q1
  39. Q1-24
  40. 24-Q1

This might seem like a lot of options (and there are undoubtedly more), but we can reduce these down to a small handful of regular expressions.

We really have just these things to concern ourselves with,

  • Quarter number (N)
    • Numeric (1–4)
    • Numeric word (one–four)
    • Ordinal word (first–fourth, last)
    • Numeric ordinal (1st–4th)
  • Year (Y)
    • 2-digit, YY
    • 4-digit, YYYY
  • Quarter abbreviation (Q)
    • Q
    • Qtr
    • Quarter
  • Format
    • Order of terms, we would have at most six if we included every permutation, but some don’t make sense (the quarter number and year shouldn’t be separated)
      • QNY
      • NQY
      • YQN
      • YNQ
    • Separators
      • .
      • /
      • .
      • ;
    • Miscellaneous filler words/characters
      • of
      • ,
      • ’
      • ()

From this it’s fairly straightforward (if, perhaps, a bit tedious) to construct a regular expression which would match almost any conceivable variation for representing the first quarter of 2024 and then, by trivial extension, any quarter of any year.

Finally, as @curt.kennedy explained better than I did in my initial reply to you, use text normalization.

You would, using the regex you created from the above, find every reference to any particular quarter and replace it with a standard, normalized version of your choosing, say 2024Q1 which would be unambiguous and easy to find in the text.

You would then use this updated text as the context source so there is no need for the model to understand 1,000 different ways to represent the same concept, ultimately making your (and its) job much easier.

3 Likes

Agree with @elmstedt here.

You are essentially creating your own “pre-tokenizer” which reduces or eliminates any OOV (out of vocabulary) terms prior to entering the LLM.

You do this through normalization, which often includes regular expressions and lookup table substitutions. You can even set alarms (or log) OOV events so that you can handle them, by updating the normalization, and better processing future events, since not everything is known on Day 0.

But regex and various direct substitutions are part of this process.

Essentially if you can define what you don’t want, you can define a transformation that will alter the unwanted data into a desired form.

This takes some “elbow grease” but is often more straightforward, and kinda fun, once you get started. It’s basically a game that you can largely win.

4 Likes

Thanks a lot, this is a good solution. Instead of thinking all possible option before hand, i am thinking of maintaining a table of REGEX. During extraction, I apply all the REGEX against the chunk. I do the same with the prompt as well. I think this will be a consistent and maintainable solution.

1 Like

There is library for this by Stanford and a python wrapper is also available.
You can check it. I have used it and it pretty much does the job.
google these term = /FraBle/python-sutime

Thanks, my problem is not only date. There are other abbreviations also which I need to expand