Here’s an example document from a test corpus I’m working with called Bowl Season. Bowl Season contains 42 documents currently (43 after next Monday) which summarize the details and outcome of all 43 college football bowl games. So each game is a separate document in the corpus:
# Capital One Orange Bowl
Georgia Bulldogs (13-1)
Q1: 7
Q2: 35
Q3: 14
Q4: 7
TOTAL: 63
Florida State Seminoles (13-1)
Q1: 0
Q2: 3
Q3: 0
Q4: 0
TOTAL: 3
## Game Information
Hard Rock Stadium
Miami Gardens, FL
4:00 PM EST, December 30, 2023
Coverage: ESPN/ESPN+
Line: UGA -23.5
Over/Under: 47.5
Attendance: 63,324 (97%)
CAPACITY: 64,992
## Game Leaders
### Passing Yards
UGA
C. Beck
13-18, 203 YDS, 2 TD
FSU
B. Glenn
9-26, 139 YDS, 2 INT
### Rushing Yards
UGA
K. Milton
9 CAR, 104 YDS, 2 TD
FSU
J. Douglas
8 CAR, 46 YDS
### Receiving Yards
UGA
D. Bell
5 REC, 86 YDS
FSU
K. Poitier
4 REC, 84 YDS
## Team Stats
### Total Yards
UGA: 673
FSU: 209
### Turnovers
UGA: 0
FSU: 4
### 1st Downs
UGA: 36
FSU: 11
### Possession
UGA: 35:38
FSU: 24:22
## Scoring Summary
### 1ST QUARTER
Georgia Bulldogs
TD 4:05
Kendall Milton 15 Yd Run (Peyton Woodring Kick)
7 plays, 69 yards, 2:40
### 2ND QUARTER
Georgia Bulldogs
TD 14:57
Kendall Milton 5 Yd Run (Peyton Woodring Kick)
6 plays, 82 yards, 1:56
Florida State Seminoles
FG 12:34
Ryan Fitzgerald 22 Yd Field Goal
6 plays, 71 yards, 2:23
Georgia Bulldogs
TD 10:38
Daijun Edwards 15 Yd Run (Peyton Woodring Kick)
4 plays, 75 yards, 1:56
Georgia Bulldogs
TD 10:18
Ladd McConkey 27 Yd Run (Peyton Woodring Kick)
1 play, 27 yards, 0:20
Georgia Bulldogs
TD 3:39
Arian Smith 12 Yd pass from Carson Beck (Peyton Woodring Kick)
5 plays, 62 yards, 2:20
Georgia Bulldogs
TD 0:24
Dominic Lovett 2 Yd pass from Carson Beck (Peyton Woodring Kick)
3 plays, 51 yards, 0:25
### 3RD QUARTER
Georgia Bulldogs
TD 9:23
Daijun Edwards 2 Yd Run (Peyton Woodring Kick)
10 plays, 75 yards, 5:37
Georgia Bulldogs
TD 2:30
Lawson Luckie 4 Yd pass from Gunner Stockton (Peyton Woodring Kick)
10 plays, 90 yards, 4:31
### 4TH QUARTER
Georgia Bulldogs
TD 12:10
Anthony Evans III 14 Yd pass from Gunner Stockton (Peyton Woodring Kick)
9 plays, 84 yards, 4:31
You can see how I’m using headers to control the chunking of the data. The system I’m building to consume this corpus is already capable of accurately answering questions like “tell me the final score of every bowl game and the player from each game that had the most receiving yards” Both GPT-4 and GPT-3.5 can accurately return both for all 42 games.
There are more tricks needed to make that work then just proper data prep, but data prep is super important.
One more tip is that ideally you want to present the model with a condensed version of the document that contains all the information needed to answer the question while retaining the general structure of the source document. So lets the say the query is “what games played in Florida had 4th quarter scores by the winner?” This is the ideal text you would show the model:
# Capital One Orange Bowl
Georgia Bulldogs (13-1)
Q1: 7
Q2: 35
Q3: 14
Q4: 7
TOTAL: 63
Florida State Seminoles (13-1)
Q1: 0
Q2: 3
Q3: 0
Q4: 0
TOTAL: 3
## Game Information
Hard Rock Stadium
Miami Gardens, FL
4:00 PM EST, December 30, 2023
## Scoring Summary
### 4TH QUARTER
Georgia Bulldogs
TD 12:10
Anthony Evans III 14 Yd pass from Gunner Stockton (Peyton Woodring Kick)
9 plays, 84 yards, 4:31
That is the minimum information needed for the model to answer that question. The challenge is working out that’s the text you need to show the model.
And here’s GPT 3.5’s answer to that question: