Best Practices for Using LLMs to Programmatically Highlight Text with Unique IDs

Hi everyone,

I’m working on a project where I’m integrating an LLM to help users find specific segments of text based on their queries. Each word in the text has a unique ID, and I need the LLM’s response to enable me to programmatically highlight the correct words.

Has anyone tackled something similar or have suggestions on how to structure the LLM’s output to make this process smooth and accurate? I’d appreciate any tips or best practices.

Thanks in advance for your help!

I can do that, knowing how models comprehend and respond, and it was successful on the first generation using gpt-4o. You will see I also created a technique for highlighting non-exact words, and a parsing method that should be clear for code.

SYSTEM:

You are an automated text processor and text highlighter, with no user to interact with.

You take the input text, and the only alteration that you make is to add a highlight indicator with a special markdown format when a search term word (or close variant such as plural) is going to be reproduced in the document, so you add a container starting with ||, and an index dictionary with the search word id as index and the search word that matches. The highlight markdown is ||{id_number: ‘search_word’}word_in_document||

Example:

  • example search terms: [{23: ‘banana’}, {35: ‘mango’}]

  • example input text source: “”“My favorite fruits are bananas and apples.”“”

  • example response: My favorite fruits are ||{23: ‘banana’}bananas|| and apples.

USER:

# Search terms

[{1: ‘stock’}, {2: ‘company’}, {3: ‘businesses’}, {4: ‘earnings’}, {5: ‘ratio’}, {6: ‘cash’}]

# Source to highlight

“”"Here are five NASDAQ stocks with some innovative flair and growth potential, minus the sticker shock:

  1. SoFi Technologies (SOFI) - A digital finance company making waves with its all-in-one app for banking, investing, and loans. If they keep up the momentum, traditional banks might need to start looking over their shoulders.

  2. UiPath (PATH) - Specializing in robotic process automation (RPA), these guys are automating the boring stuff so humans can do the fun stuff—or at least, that’s the pitch. As businesses push for more efficiency, UiPath could be a big winner.

  3. Rivian Automotive (RIVN) - An EV startup that’s making electric trucks and SUVs cool. With a solid backing and a growing order book, Rivian might just carve out a sweet spot in the EV market.

  4. Marqeta (MQ) - A fintech company focused on modern card issuing and payment processing. They’re behind the scenes of a lot of digital transactions, and as cash fades into the history books, their potential could skyrocket.

  5. Beam Therapeutics (BEAM) - Working on gene-editing technology, Beam Therapeutics is at the forefront of personalized medicine. If they can pull off what they’re promising, the future of healthcare might have their name on it.

These companies are still climbing the ladder, but the top could be a lot higher. Of course, the slide back down could be equally impressive, so hang on tight!“”"

ASSISTANT OUTPUT:

Here are five NASDAQ ||{1: ‘stock’}stocks|| with some innovative flair and growth potential, minus the sticker shock:

  1. SoFi Technologies (SOFI) - A digital finance ||{2: ‘company’}company|| making waves with its all-in-one app for banking, investing, and loans. If they keep up the momentum, traditional banks might need to start looking over their shoulders.

  2. UiPath (PATH) - Specializing in robotic process automation (RPA), these guys are automating the boring stuff so humans can do the fun stuff—or at least, that’s the pitch. As ||{3: ‘businesses’}businesses|| push for more efficiency, UiPath could be a big winner.

  3. Rivian Automotive (RIVN) - An EV startup that’s making electric trucks and SUVs cool. With a solid backing and a growing order book, Rivian might just carve out a sweet spot in the EV market.

  4. Marqeta (MQ) - A fintech ||{2: ‘company’}company|| focused on modern card issuing and payment processing. They’re behind the scenes of a lot of digital transactions, and as ||{6: ‘cash’}cash|| fades into the history books, their potential could skyrocket.

  5. Beam Therapeutics (BEAM) - Working on gene-editing technology, Beam Therapeutics is at the forefront of personalized medicine. If they can pull off what they’re promising, the future of healthcare might have their name on it.

These ||{2: ‘company’}companies|| are still climbing the ladder, but the top could be a lot higher. Of course, the slide back down could be equally impressive, so hang on tight!

The system language could be a bit clearer for humans, but was comprehended by AI.

(I hope this free consulting improves your earning power!)

Thanks @_j . I wanted to clarify my initial question
To clarify:

In my project, each word in a text is associated with a unique ID. For example, the text might be represented as:

[
  {"id": 21, "word": "My"},
  {"id": 22, "word": "favorite"},
  {"id": 23, "word": "fruits"},
  {"id": 24, "word": "are"},
  {"id": 25, "word": "bananas"},
  {"id": 26, "word": "and apples"}
]

The text can be very long, such as a transcript of an hour-long conversation with around 20,000 words. I want to integrate an LLM so that when a user makes a natural language query like "find me all fruits", the system can efficiently identify and return the IDs of all related words.

What I need help with:

  1. How should I structure the input and output for the LLM to ensure it accurately identifies and returns the correct IDs based on natural language queries, especially in the context of very long texts?
  2. Are there any best practices or approaches to optimize the performance of the LLM when handling large texts and ensure that the output is precise and easily integrable with a system that highlights text segments using these IDs?

Example Scenario:

  • Input Text:

    [
      {"id": 21, "word": "My"},
      {"id": 22, "word": "favorite"},
      {"id": 23, "word": "fruits"},
      {"id": 24, "word": "are"},
      {"id": 25, "word": "bananas"},
      {"id": 26, "word": "and apples"}
    ]
    

    (Note: In practice, the text can be up to 20,000 words long, representing a full conversation.)

  • User Query: "find me all fruits"

  • Expected Output: [23, 25] (assuming "fruits" includes related words like "bananas")

I hope this clarifies what I’m looking for. Any insights or suggestions on how to handle large texts efficiently with an LLM would be greatly appreciated!

Thanks again!

I understand your requirements a bit better, and the optimum way this would work would significantly tax and test the attention mechanism of the AI.

A format that doesn’t add excess token consumption would be good. We also want the document to still be understandable, although that is lower priority. I devise an input format that adds three (sometimes fewer) tokens per word, for all number indexes up to 999.


Here[[55]] are[[56]] five[[57]] NASDAQ[[58]] stocks[[59]] with[[60]] some[[61]] innovative[[62]] flair[[63]] and[[64]] growth[[65]] potential[[66]] minus[[67]] the[[68]] sticker[[69]] shock[[70]]:

SoFi[[71]] Technologies[[72]] (SOFI[[73]])[[74]] -[[75]] A[[76]] digital[[77]] finance[[78]] company[[79]] making[[80]] waves[[81]] with[[82]] its[[83]] all-in-one[[84]] app[[85]] for[[86]] banking[[87]] investing[[88]] and[[89]] loans[[90]].[[91]] If[[92]] they[[93]] keep[[94]] up[[95]] the[[96]] momentum[[97]] traditional[[98]] banks[[99]] might[[100]] need[[101]] to[[102]] start[[103]] looking[[104]] over[[105]] their[[106]] shoulders[[107]].

UiPath[[108]] (PATH[[109]])[[110]] -[[111]] Specializing[[112]] in[[113]] robotic[[114]] process[[115]] automation[[116]] (RPA[[117]])[[118]] these[[119]] guys[[120]] are[[121]] automating[[122]] the[[123]] boring[[124]] stuff[[125]] so[[126]] humans[[127]] can[[128]] do[[129]] the[[130]] fun[[131]] stuff[[132]] or[[133]] at[[134]] least[[135]] that’s[[136]] the[[137]] pitch[[138]].[[139]] As[[140]] businesses[[141]] push[[142]] for[[143]] more[[144]] efficiency[[145]] UiPath[[146]] could[[147]] be[[148]] a[[149]] big[[150]] winner[[151]].

Rivian[[152]] Automotive[[153]] (RIVN[[154]])[[155]] -[[156]] An[[157]] EV[[158]] startup[[159]] that’s[[160]] making[[161]] electric[[162]] trucks[[163]] and[[164]] SUVs[[165]] cool[[166]].[[167]] With[[168]] a[[169]] solid[[170]] backing[[171]] and[[172]] a[[173]] growing[[174]] order[[175]] book[[176]] Rivian[[177]] might[[178]] just[[179]] carve[[180]] out[[181]] a[[182]] sweet[[183]] spot[[184]] in[[185]] the[[186]] EV[[187]] market[[188]].

Marqeta[[189]] (MQ[[190]])[[191]] -[[192]] A[[193]] fintech[[194]] company[[195]] focused[[196]] on[[197]] modern[[198]] card[[199]] issuing[[200]] and[[201]] payment[[202]] processing[[203]].[[204]] They’re[[205]] behind[[206]] the[[207]] scenes[[208]] of[[209]] a[[210]] lot[[211]] of[[212]] digital[[213]] transactions[[214]] and[[215]] as[[216]] cash[[217]] fades[[218]] into[[219]] the[[220]] history[[221]] books[[222]] their[[223]] potential[[224]] could[[225]] skyrocket[[226]].

Beam[[227]] Therapeutics[[228]] (BEAM[[229]])[[230]] -[[231]] Working[[232]] on[[233]] gene-editing[[234]] technology[[235]] Beam[[236]] Therapeutics[[237]] is[[238]] at[[239]] the[[240]] forefront[[241]] of[[242]] personalized[[243]] medicine[[244]].[[245]] If[[246]] they[[247]] can[[248]] pull[[249]] off[[250]] what[[251]] they’re[[252]] promising[[253]] the[[254]] future[[255]] of[[256]] healthcare[[257]] might[[258]] have[[259]] their[[260]] name[[261]] on[[262]] it[[263]].

These[[264]] companies[[265]] are[[266]] still[[267]] climbing[[268]] the[[269]] ladder[[270]] but[[271]] the[[272]] top[[273]] could[[274]] be[[275]] a[[276]] lot[[277]] higher[[278]].[[279]] Of[[280]] course[[281]] the[[282]] slide[[283]] back[[284]] down[[285]] could[[286]] be[[287]] equally[[288]] impressive[[289]] so[[290]] hang[[291]] on[[292]] tight[[293]]![[294]]


Technique 1

The first way I will show has no grounding method, but its output is cheap. The AI essentially has to produce the most likely first word ID as its output by reading the whole document, and then generating the next token, again needs document comprehension to produce another correct ID number as its token.

SYSTEM:

You are an automated text word search engine, with no user to interact with. You receive a description list of the type of words or phrases you are to extract and return, along with a document section. You then read carefully through that labeled document (labeled with word ID numbers that appear immediately after the word in two square brackets), and when you encounter a word that meets any of the criteria provided in the description list, you output that word’s number, creating a comma-separated list of all ID number positions that have words meeting the criteria. Pay attention to the position in the document and extract matches systematically.

Example:

example search terms: [“affection”, “hate”]

example input text source: “”“I[[33]] love[[34]] smart[[35]] computers[[36]], but[[37]] hate[[38]] AI[[39]].”“”

example response: 35, 38

USER:

Search description list:

[stock, company, business, earnings, ratio, cash]

Document chunk:

“”“{your_indexed_document}”“”

ASSISTANT RESPONSE (gpt-4o)

59, 79, 141, 195, 217, 265

You can see that when the AI is needing to comprehend essentially the whole document to find a single word and its ID that could appear first near the bottom, you are really giving a challenge to the language model understanding. This was actually ran, and AI was able to perform well on this input size, and you could chunk a document as my mid-document indexes hint at.

Technique 2:

Have the AI repeat the entire document back to you, and if the word being output matches the criteria, the word index number is marked with an exclamation point or other character, like:

task: "repeat back document exactly, but add !!! immediately after the number of the word position index if it is a word like “stock” or “banana”.
result: Here[[55]] are[[56]] five[[57]] NASDAQ[[58]] stocks[[59!!!]] with[[60]] some[[61]] innovative[[62]]…

This gives extremely high comprehension, needs less instruction, and the only thing the attention needs is reflecting back on the type of words to be highlighted. The only flaw might be in pattern recognition that starts getting trained as output is produced, that the AI won’t be able to break away from repeating a document to amend to it where needed.

However, you’ll need to keep the expensive output length of a single API call low so the AI model doesn’t terminate prematurely as it has been trained to do, under 1000 tokens including the indexing.

Technique 3:

Have AI generate an extensive list of words that match the criteria that you provide, with all possible variations and inflections. Then just do a programmatic search on those. That may not work if the criteria is “animals” or “colors” with thousands of possibilities, though.

Hope that gives you some inspiration.

1 Like

You just need an effective chunking strategy.

First chuck by sentence, then recombine sentences into manageable sets of sentences.

Then pass each set of sentences into LLM to get your answer for each set.

Then recombine the sets with the answers.

More thought: do you need to actually use word indexes you’ve already got, or is that just a technique idea that you have of how the highlighting could be tasked?

I ask, because the initial conversion of the plain text stock tips that added the word numbering above in my most recent post was done by AI.

If AI can take plain text, and create word indexes, it also can create highlighted indexes meeting a criteria in the same pass.

@_j Ideally, I need to reused the word IDs that I have because a lot of other processes rely on those.

@icdev2dev That is good suggestion. If you are using that approach, how will you manage it if the LLM need to use the whole context of the conversation to find what it is looking?

The best SOTA that I am aware about in context of whole context is dividing up into global versus local context… see MSFT’s graphrag for implementation details.