Okay, I think I’ve gone as far as the API will allow me to go for now. Here is the final code for anyone that wants to play around with this idea. I managed to get some interesting information from the annotations but they just gave me character positions that were off. I am pretty sure it has to do with the way the annotations are denoted.
Here is where I ended up. I got down to the message level with this code:
message = client.beta.threads.messages.retrieve(
thread_id=thread.id,
message_id=client.beta.threads.messages.list(thread_id=thread.id,order="desc").data[0].id
)
print(message.content[0].text)
Which gave me this result:
Text(annotations=[FileCitationAnnotation(end_index=256, file_citation=FileCitation(file_id=‘file-ZpcgUzqAk0M6530CLcLkMT52’), start_index=244, text=‘【8:3†source】’, type=‘file_citation’), FileCitationAnnotation(end_index=375, file_citation=FileCitation(file_id=‘file-ZpcgUzqAk0M6530CLcLkMT52’), start_index=363, text=‘【8:9†source】’, type=‘file_citation’), FileCitationAnnotation(end_index=485, file_citation=FileCitation(file_id=‘file-ZpcgUzqAk0M6530CLcLkMT52’), start_index=473, text=‘【8:9†source】’, type=‘file_citation’), FileCitationAnnotation(end_index=616, file_citation=FileCitation(file_id=‘file-ZpcgUzqAk0M6530CLcLkMT52’), start_index=603, text=‘【8:12†source】’, type=‘file_citation’), FileCitationAnnotation(end_index=769, file_citation=FileCitation(file_id=‘file-ZpcgUzqAk0M6530CLcLkMT52’), start_index=756, text=‘【8:10†source】’, type=‘file_citation’), FileCitationAnnotation(end_index=914, file_citation=FileCitation(file_id=‘file-ZpcgUzqAk0M6530CLcLkMT52’), start_index=901, text=‘【8:13†source】’, type=‘file_citation’), FileCitationAnnotation(end_index=1022, file_citation=FileCitation(file_id=‘file-ZpcgUzqAk0M6530CLcLkMT52’), start_index=1009, text=‘【8:19†source】’, type=‘file_citation’), FileCitationAnnotation(end_index=1135, file_citation=FileCitation(file_id=‘file-ZpcgUzqAk0M6530CLcLkMT52’), start_index=1123, text=‘【8:0†source】’, type=‘file_citation’)], value=‘Here are the main characters in “Dracula” along with the citations for their first introduction in the text:\n\n1. Jonathan Harker:\n - Jonathan Harker is introduced through his journal entry describing his approach to Count Dracula's castle【8:3†source】.\n\n2. Mina Murray (Harker):\n - Mina Murray is introduced through a letter to her friend Lucy Westenra【8:9†source】.\n\n3. Lucy Westenra:\n - Lucy Westenra is introduced through the same letter from Mina Murray【8:9†source】.\n\n4. Count Dracula:\n - Count Dracula is introduced when Jonathan Harker meets him at his castle in Transylvania【8:12†source】.\n\n5. Dr. John Seward:\n - Dr. John Seward is introduced through his diary entries discussing Renfield and the activities at his asylum【8:10†source】.\n\n6. Arthur Holmwood (Lord Godalming):\n - Arthur Holmwood is introduced in the context of Lucy’s suitors and their engagement【8:13†source】.\n\n7. Quincey Morris:\n - Quincey Morris is introduced alongside the other suitors of Lucy【8:19†source】.\n\n8. Renfield:\n - Renfield is introduced through Dr. Seward’s observational notes in his diary【8:0†source】.\n\nThese citations provide the exact sections in “Dracula” where the main characters are first introduced.’)
I then broke down the annotations with this code:
# Extract the message content and annotations
message_text_object = message.content[0]
message_text_content = message_text_object.text.value # Access the value attribute for the actual text
annotations = message_text_object.text.annotations # Access annotations directly
# Print the annotations in a cleaner format
for index, annotation in enumerate(annotations):
print(f"Annotation {index + 1}:")
print(f" End Index: {annotation.end_index}")
print(f" Start Index: {annotation.start_index}")
print(f" Text: {annotation.text}")
print(f" Type: {annotation.type}")
if hasattr(annotation, 'file_citation'):
file_citation = annotation.file_citation
print(f" File Citation:")
print(f" File ID: {file_citation.file_id}")
print("") # Add a blank line for readability
Which gave me results like this:
Annotation 1:
End Index: 256
Start Index: 244
Text: 【8:3†source】
Type: file_citation
File Citation:
File ID: file-ZpcgUzqAk0M6530CLcLkMT52
Annotation 2:
End Index: 375
Start Index: 363
Text: 【8:9†source】
Type: file_citation
File Citation:
File ID: file-ZpcgUzqAk0M6530CLcLkMT52
Annotation 3:
End Index: 485
Start Index: 473
Text: 【8:9†source】
Type: file_citation
File Citation:
File ID: file-ZpcgUzqAk0M6530CLcLkMT52
So the good news is there a a lot more information in the annotations than previously thought. The bad news is it isn’t accurate. If I actually go to the character indexes listed they don’t point to the proper references. Or, I should say, they don’t “appear” to point there. I suspect they do point to the right place if the citation that reference specific sections of the text can be deciphered properly.
For example, for Annotation one if we just go with characters 244 to 256 it doesn’t track. (Nor do words or tokens, I tried all three).
I think the secret lies in the position in the citation: 【8:3†source】
But I have been unable to decipher it…