Extract any valid dates from random sentences and format them in a common format?

We are trying to create the most accurate way to extract any valid dates from random sentences and format them in a common format.

has anyone done this before?

This is pretty easy. You can get better accuracy with few-shot learning.

1 Like

nice testing it out now. also, the dates in the random sentences may not necessarily be in a normal date format, could 2/12/09, or 1809, feb 12th, or 02/09/1809, etc.

Yes, it works better with few-shot learning :smiley:

1 Like

perfecto! testing it out now.

we are dealing with speech to text transcriptions, so sometimes the dates are messy. this is how far we got so far:

Extract any dates from the following sentence and write them in the format of MM/DD/YYYY and remove any extra words. If there are no dates then display No-Date-Found. Examples:

  1. 522 1999 : 05/22/1919
  2. 122 1952 : 01/22/1952
  3. 522 1999 : 05/22/1999
  4. 2181 : 02/01/1981
  5. February 9th 1949 : 02/09/1949
  6. 79 2004 : 07/09/2004
  7. March 24, 1964 : 03/24/1964
  8. nine eight 2019 : 09/08/2019

Query:

any further suggestions let me know. :sunglasses:

we are going to give fine tuning a go, if anyone has any words of wisdom let us know.

will post results here once tested.

1 Like

https://community.openai.com/t/five-rules-for-finetuning-from-my-experience-observations-and-consulting/15587/10

1 Like

yup read that. the issue is the prompt unless we do this gets bigger and bigger:

Given the examples, display only the date of birth found as two digits for the month, two digits for the day, and four digits for the year in the format of MM/DD/YYYY. if not date of birth is found, display DOB-Not-Found.
Examples:
six 2019 70 : 06/20/1970
Eight Two 1967 : 08/02/1967
non 18 20 19 : 09/18/2019
7.3 1941 : 07/03/1941
five two nine five three : 05/29/1953
three fourteen 32 : 03/14/1932
528 1957 : 05/28/1957
10:11 1946 : 10/11/1946
98 2019 : 09/08/2019
1 2 1966 : 01/02/1966
7:30 165 : 06/30/1965
4691 : 04/06/1991
four six ninety-one : 04/06/1991
219 1952 : 02/19/1952
12:14 1989 : 12/14/1989
six six seven eight : 06/06/1978
one three 34 : 01/03/1934
February second one nine six four : 02/02/1964
11:15 1957 : 11/15/1957
5 14 19 49 : 05/14/1949
five 1419 49 : 05/14/1949
2 5 of 55 : 02/05/1955
5th day of June 1999 : 06/05/1999
seven eight nineteen sixty-four : 07/08/1964
98 2019 : 09/08/2019
database 74 64 : 07/04/1964
810 64 : 08/10/1964
96 19 51 : 09/06/1951
for 17 1969 : 04/17/1969
6119 : 06/01/1990
to nineteen fifty-four : 02/02/1954
ten. Five. 2008 : 10/05/2008
71736 : 07/17/1936
627 1949 : 06/27/1949
two. Four. 1947 : 02/04/1947
360 1958 : 03/16/1958
October second nineteen sixty two : 10/02/1962
522 1999 : 05/22/1919
122 1952 : 01/22/1952
2181 : 02/01/1981
February 9th 1949 : 02/09/1949
79 2004 : 07/09/2004
March 24, 1964 : 03/24/1964
nine eight 2019 : 09/08/2019
five twenty eight nineteen fifty-seven : 05/20/1957
360 1958 : 03/16/1958
326 77 : 03/26/1977
314. 32 : 03/14/1932

Query:

BUT IT WORKS GREAT! so it will work, it is a matter of figuring out the best most efficient method in OpenAI to do so.

What do you hope to achieve with finetuning? Faster? Cheaper? Better performance?

we just want it to work. so whatever method works is the one we want to use.

for sure it can work, but just like everything else, the best most accurate method is not currently known.

we are going to try similar experiments on plain old NLP.

but this is our preferred method.

question: do the examples above count against quota and tokens and are we charged for them?

Yes. If you’re wanting it to not pull certain numbers (ie non-date numbers), I would give it a few of those examples too. Good luck!

good point. will try that too. whatever we find we will be posting results here.

1 Like