# GPT 3.5 Turbo 16K not able to understand that Q1'23 and Q1 2023 are same

Does anyone know a solution for this. Need to test if GPT-4 can do it. But other than adding few shot example, is there a way?

Input consists of yearly data that includes quarterly statistics. Quarterly data format fields are of mixed type, with two separate formats that must be individually identified and correctly processed:

• Q{quarter_number_integer}'{two_digit_year}
• Q{quarter_number_integer} {four_digit_year}

Be aware of this mixture (and others), and intelligently adapt data extraction techniques to quarterly figures.

Desired quarterly output format: Q{quarter_number_integer}-{four_digit_year}

This will work but for my use case this will not be scalable. I have many different scenarios where it is not able to understand all the formats. May be fine tuning is the optimal approach for me

Hopefully Iâ€™ve given enough idea there for you to make a smarter AI by similar instruction that can generalize understanding of unexpected data types and to prepare for them.

Might I suggest another way?

Not every nail needs to be hit with the LLM hammer.

If you are aware of the different formats ahead of time, itâ€™s fairly straightforward to refactor the text on ingress to a standard format.

Meet the model halfway and youâ€™ll get better, more consistent results. If you rely solely on the LLM there will always be a non-zero probability of it stumbling regardless of how much time, energy, and money you spend evolving your prompts.

Hereâ€™s a MWE for how you can use Python regular expressions to convert `QX ##YY` to `QX'YY`,

``````import re

def replace_pattern(text):
pattern = r'(Q[1-4]) \d{2}(\d{2})'
replacement = r"\1'\2"
return re.sub(pattern, replacement, text)

# Example usage
text = "Q1 2022 and Q3 1998"
result = replace_pattern(text)
print(result)
``````
2 Likes

This is a good suggestion which I can probably use for another use case. But in this use case, we are extracting the content from PDF and chunking, the complexity is to find out all possible patterns and replace.

This, I am still not able to solve. I wanted to see if someone else also is facing these type of issues and if there are any other solution

You might want to look into text normalization.

For example, years ago I was dealing with a situation where I couldnâ€™t have contractions. So I couldnâ€™t have â€ścouldnâ€™tâ€ť I needed â€ścould notâ€ť.

So I had this basic lookup where if I saw a known contraction, I replaced it with the expanded set of words.

I think a small data structure of 80 contractions took care of 99% of the contractions out there.

So look-up-and-substitute might be viable for you. Even for Q1â€™24 stuff, you could programmatically generate the expansions out for the next 1000 years no problem (only 4000 table entries).

You could run this in a database, or an in-memory data structure.

Iâ€™d worry about the LLM fine-tune altering the original data, and possibly not doing a great job at fixing all possible normalizations. Plus you have to train it anyway, which is work that could go towards your look-up tables.

But if your data is so crazy and canâ€™t be listed, then maybe a fine-tune wouldnâ€™t hurt â€¦ again just worried about the quality and distortion in the output.

1 Like

I imagine there are only so many ways thus concept could be expressed.

Say we are talking about the first quarter of the year 2024. Hereâ€™s what I imagine that may look like in your documents,

1. Q1 2024
2. Q1â€™24
3. Qtr 1, 2024
4. 1st Quarter 2024
5. First Quarter 2024
6. 2024 Q1
7. 2024 - Q1
8. 2024/Q1
9. Q1 of 2024
10. 2024 1st Quarter
11. 2024 First Quarter
12. Quarter 1, 2024
13. 2024, Quarter 1
14. Q-1 2024
15. 1Q 2024
16. 2024 1Q
17. Q1-2024
18. 2024-Q1
19. 1st Qtr 2024
20. First Qtr 2024
21. 2024 1st Qtr
22. 2024 First Qtr
23. 2024 Qtr 1
24. Quarter One 2024
25. 2024 Quarter One
26. Q1/2024
27. 2024/Q-1
28. 2024 (Q1)
29. Q1 (2024)
30. 2024 (1st Quarter)
31. (First Quarter) 2024
32. Q1/24
33. 24Q1
34. 24 Q1
35. Q1 of '24
36. '24 Q1
37. Q1 '24
38. '24-Q1
39. Q1-24
40. 24-Q1

This might seem like a lot of options (and there are undoubtedly more), but we can reduce these down to a small handful of regular expressions.

We really have just these things to concern ourselves with,

• Quarter number (N)
• Numeric (1â€“4)
• Numeric word (oneâ€“four)
• Ordinal word (firstâ€“fourth, last)
• Numeric ordinal (1stâ€“4th)
• Year (Y)
• 2-digit, YY
• 4-digit, YYYY
• Quarter abbreviation (Q)
• Q
• Qtr
• Quarter
• Format
• Order of terms, we would have at most six if we included every permutation, but some donâ€™t make sense (the quarter number and year shouldnâ€™t be separated)
• QNY
• NQY
• YQN
• YNQ
• Separators
• .
• /
• .
• ;
• Miscellaneous filler words/characters
• of
• ,
• â€™
• ()

From this itâ€™s fairly straightforward (if, perhaps, a bit tedious) to construct a regular expression which would match almost any conceivable variation for representing the first quarter of 2024 and then, by trivial extension, any quarter of any year.

Finally, as @curt.kennedy explained better than I did in my initial reply to you, use text normalization.

You would, using the regex you created from the above, find every reference to any particular quarter and replace it with a standard, normalized version of your choosing, say `2024Q1` which would be unambiguous and easy to find in the text.

You would then use this updated text as the context source so there is no need for the model to understand 1,000 different ways to represent the same concept, ultimately making your (and its) job much easier.

3 Likes

Agree with @elmstedt here.

You are essentially creating your own â€śpre-tokenizerâ€ť which reduces or eliminates any OOV (out of vocabulary) terms prior to entering the LLM.

You do this through normalization, which often includes regular expressions and lookup table substitutions. You can even set alarms (or log) OOV events so that you can handle them, by updating the normalization, and better processing future events, since not everything is known on Day 0.

But regex and various direct substitutions are part of this process.

Essentially if you can define what you donâ€™t want, you can define a transformation that will alter the unwanted data into a desired form.

This takes some â€śelbow greaseâ€ť but is often more straightforward, and kinda fun, once you get started. Itâ€™s basically a game that you can largely win.

4 Likes

Thanks a lot, this is a good solution. Instead of thinking all possible option before hand, i am thinking of maintaining a table of REGEX. During extraction, I apply all the REGEX against the chunk. I do the same with the prompt as well. I think this will be a consistent and maintainable solution.

1 Like

There is library for this by Stanford and a python wrapper is also available.
You can check it. I have used it and it pretty much does the job.
google these term = /FraBle/python-sutime

Thanks, my problem is not only date. There are other abbreviations also which I need to expand