I understand your requirements a bit better, and the optimum way this would work would significantly tax and test the attention mechanism of the AI.
A format that doesn’t add excess token consumption would be good. We also want the document to still be understandable, although that is lower priority. I devise an input format that adds three (sometimes fewer) tokens per word, for all number indexes up to 999.
Here[[55]] are[[56]] five[[57]] NASDAQ[[58]] stocks[[59]] with[[60]] some[[61]] innovative[[62]] flair[[63]] and[[64]] growth[[65]] potential[[66]] minus[[67]] the[[68]] sticker[[69]] shock[[70]]:
SoFi[[71]] Technologies[[72]] (SOFI[[73]])[[74]] -[[75]] A[[76]] digital[[77]] finance[[78]] company[[79]] making[[80]] waves[[81]] with[[82]] its[[83]] all-in-one[[84]] app[[85]] for[[86]] banking[[87]] investing[[88]] and[[89]] loans[[90]].[[91]] If[[92]] they[[93]] keep[[94]] up[[95]] the[[96]] momentum[[97]] traditional[[98]] banks[[99]] might[[100]] need[[101]] to[[102]] start[[103]] looking[[104]] over[[105]] their[[106]] shoulders[[107]].
UiPath[[108]] (PATH[[109]])[[110]] -[[111]] Specializing[[112]] in[[113]] robotic[[114]] process[[115]] automation[[116]] (RPA[[117]])[[118]] these[[119]] guys[[120]] are[[121]] automating[[122]] the[[123]] boring[[124]] stuff[[125]] so[[126]] humans[[127]] can[[128]] do[[129]] the[[130]] fun[[131]] stuff[[132]] or[[133]] at[[134]] least[[135]] that’s[[136]] the[[137]] pitch[[138]].[[139]] As[[140]] businesses[[141]] push[[142]] for[[143]] more[[144]] efficiency[[145]] UiPath[[146]] could[[147]] be[[148]] a[[149]] big[[150]] winner[[151]].
Rivian[[152]] Automotive[[153]] (RIVN[[154]])[[155]] -[[156]] An[[157]] EV[[158]] startup[[159]] that’s[[160]] making[[161]] electric[[162]] trucks[[163]] and[[164]] SUVs[[165]] cool[[166]].[[167]] With[[168]] a[[169]] solid[[170]] backing[[171]] and[[172]] a[[173]] growing[[174]] order[[175]] book[[176]] Rivian[[177]] might[[178]] just[[179]] carve[[180]] out[[181]] a[[182]] sweet[[183]] spot[[184]] in[[185]] the[[186]] EV[[187]] market[[188]].
Marqeta[[189]] (MQ[[190]])[[191]] -[[192]] A[[193]] fintech[[194]] company[[195]] focused[[196]] on[[197]] modern[[198]] card[[199]] issuing[[200]] and[[201]] payment[[202]] processing[[203]].[[204]] They’re[[205]] behind[[206]] the[[207]] scenes[[208]] of[[209]] a[[210]] lot[[211]] of[[212]] digital[[213]] transactions[[214]] and[[215]] as[[216]] cash[[217]] fades[[218]] into[[219]] the[[220]] history[[221]] books[[222]] their[[223]] potential[[224]] could[[225]] skyrocket[[226]].
Beam[[227]] Therapeutics[[228]] (BEAM[[229]])[[230]] -[[231]] Working[[232]] on[[233]] gene-editing[[234]] technology[[235]] Beam[[236]] Therapeutics[[237]] is[[238]] at[[239]] the[[240]] forefront[[241]] of[[242]] personalized[[243]] medicine[[244]].[[245]] If[[246]] they[[247]] can[[248]] pull[[249]] off[[250]] what[[251]] they’re[[252]] promising[[253]] the[[254]] future[[255]] of[[256]] healthcare[[257]] might[[258]] have[[259]] their[[260]] name[[261]] on[[262]] it[[263]].
These[[264]] companies[[265]] are[[266]] still[[267]] climbing[[268]] the[[269]] ladder[[270]] but[[271]] the[[272]] top[[273]] could[[274]] be[[275]] a[[276]] lot[[277]] higher[[278]].[[279]] Of[[280]] course[[281]] the[[282]] slide[[283]] back[[284]] down[[285]] could[[286]] be[[287]] equally[[288]] impressive[[289]] so[[290]] hang[[291]] on[[292]] tight[[293]]![[294]]
Technique 1
The first way I will show has no grounding method, but its output is cheap. The AI essentially has to produce the most likely first word ID as its output by reading the whole document, and then generating the next token, again needs document comprehension to produce another correct ID number as its token.
SYSTEM:
You are an automated text word search engine, with no user to interact with. You receive a description list of the type of words or phrases you are to extract and return, along with a document section. You then read carefully through that labeled document (labeled with word ID numbers that appear immediately after the word in two square brackets), and when you encounter a word that meets any of the criteria provided in the description list, you output that word’s number, creating a comma-separated list of all ID number positions that have words meeting the criteria. Pay attention to the position in the document and extract matches systematically.
Example:
example search terms: [“affection”, “hate”]
example input text source: “”“I[[33]] love[[34]] smart[[35]] computers[[36]], but[[37]] hate[[38]] AI[[39]].”“”
example response: 35, 38
USER:
Search description list:
[stock, company, business, earnings, ratio, cash]
Document chunk:
“”“{your_indexed_document}”“”
ASSISTANT RESPONSE (gpt-4o)
59, 79, 141, 195, 217, 265
You can see that when the AI is needing to comprehend essentially the whole document to find a single word and its ID that could appear first near the bottom, you are really giving a challenge to the language model understanding. This was actually ran, and AI was able to perform well on this input size, and you could chunk a document as my mid-document indexes hint at.
Technique 2:
Have the AI repeat the entire document back to you, and if the word being output matches the criteria, the word index number is marked with an exclamation point or other character, like:
task: "repeat back document exactly, but add !!! immediately after the number of the word position index if it is a word like “stock” or “banana”.
result: Here[[55]] are[[56]] five[[57]] NASDAQ[[58]] stocks[[59!!!]] with[[60]] some[[61]] innovative[[62]]…
This gives extremely high comprehension, needs less instruction, and the only thing the attention needs is reflecting back on the type of words to be highlighted. The only flaw might be in pattern recognition that starts getting trained as output is produced, that the AI won’t be able to break away from repeating a document to amend to it where needed.
However, you’ll need to keep the expensive output length of a single API call low so the AI model doesn’t terminate prematurely as it has been trained to do, under 1000 tokens including the indexing.
Technique 3:
Have AI generate an extensive list of words that match the criteria that you provide, with all possible variations and inflections. Then just do a programmatic search on those. That may not work if the criteria is “animals” or “colors” with thousands of possibilities, though.
Hope that gives you some inspiration.