Question about JSONL file for search endpoint

Hi there,

I’m planning to use completion with a prompt based on results from search. The goal is to dynamically construct the completion prompt using the data returned by search like so:

Prompt structure:

Instructions to the engine.
_
Example 1
_
Example 2
_
Example 3
_
User input field 1 label: user input field 1 value.
User input field 2 label: user input field 2 value.
User input field 3 label: user input field 3 value.
User input field 4 label: user input field 4 value.

Completion by the engine:

Where examples are the documents found in the file by search engine.

Here is how I need the example look like:

Example structure:

User input field 1 label: meta field 1 value.
User input field 2 label: meta field 2 value.
User input field 3 label: meta field 3 value.
User input field 4 label: text found by search engine

Completion by the engine: meta field 4 value

I was wondering if the objects in JSONL file may be structured like this:

Document object in file structure:

{
  "text": "User input field 4 value",
  "metadata": {
    "meta field 1 label": "User input field 1 value",
    "meta field 2 label": "User input field 2 value",
    "meta field 3 label": "User input field 3 value",
    "meta field 4 label": "Completion by engine previously validated by human"
  }
}

The goal is to to use file as source of examples to provide to completion engine to improve the generated text and seeing the limit of prompt, use the search engine to find most relevant examples based on user input. Sort of training data for the model.

Questions:

1 Can documents be structured like that in the file?
2 Will the search engine use metadata as well to find best results?
3 How to build the search query text to provide all the 4 user input fields?
4 How to make the search engine return a set of full documents (text, and metadata) as a response to the query (in docs it looks like it returns the document IDs with metadata, not clear to me, sorry) Edit: found the answer here: OpenAI API
5 Is there any way to limit the search results based on the volume of text (say it can return 4 smaller results, but only 2 results if the text is long)?

Any curl examples related to this would drastically help.

Thank you.

Hi Serge, metadata is not used by the endpoint in any way.

In short, for the above questions, it’s simply important that it’s valid JSON, but the metadata won’t be used.

There’s no way to use the Search endpoint to directly modify the volume of text returned, but you could simply retrieve the relevant output.

If you have labels you want to use, maybe you would benefit from checking out the Classification endpoint.

Ok, so I think I will request say 5 results and then calculate the “available” prompt volume to use the results from top ranked to bottom until I fill the volume.

The labeling of the user input will be used to choose which model will be responsible for content generation (so that the models stay highly targeted and optimized for their purpose) - each model will have its own file of data so no need to introduce labels in the document structure.

Thanks for feedback. Highly appreciated.

1 Like

Summary created by AI.

User sergeliatko is building a completion prompt reliant on the results from a search. They are structuring the document objects in a JSONL file such that user input fields and corresponding meta fields are fed into the completion engine to generate the required text. They aim to use the search engine to find the most relevant examples, essentially serving as training data for their model. They have questions about structuring the documents in the files, how the search engine uses metadata, how to build the search query text for different user input fields, and limitations on the search results based on the text volume.

User joey clarifies that while the structure can be any valid JSON, metadata is not used by the endpoint. The volume of text returned by the Search endpoint can’t be modified directly, but the necessary output can be retrieved. Joey suggests that if sergeliatko has specific labels, they might find the Classification endpoint useful.

Replying, sergeliatko decides to request five results, then calculate the available prompt volume to utilize results from top-ranked to bottom until the volume is full. They don’t see a need for labels in the document structure as user input labelling will be used to determine which model will be responsible for content generation. Referenced discussion: ref.

Summarized with AI on Jul 22
AI used: gpt-4-32k