Why are responses from an embedding model inconsistent between runs, even providing the exact prompt and embedding?

oscarcalderon · March 28, 2024, 2:29pm

Hi everyone, this is my first post.

We are using text-embedding-ada-002 for a PoC in which we define a set of business rules to evaluate the information coming from company documents, to run a background check and get a recommendation if a loan should be approved or not.

At first, we define a very detailed text containing the business rules to evaluate these documents, which fields to read, and validations to recommend a Yes or No, and if it is No, to list the reasons why not (with detailed references from the document provided), and this is transformed into an embedding.

For this v1, we are extracting the information from the PDF reports manually and providing it in a text format. So far I can see its recommendation is consistent (Yes or No), but the output format of the answer (when it elaborates on why No) varies between each execution, providing the exact same embedding and the exact same question and report data (not a single comma or character changed).

Here are some examples of its responses. As you can see, sometimes it omits some details, or presents them in different formats, and so on. What could be wrong? In the prompt I specify that I want it to elaborate with information from the document, and to list all the reasons why No.

Based on the information provided in the CLEAR Report, I recommend not granting a loan to the business. The reasons for this recommendation are as follows:

1. Criminal History:
   - The owner, Juan Perez, has a misdemeanor conviction with a disposition of "GUILTY" for an offense that occurred on 01/01/2000. This falls within the past 3 years of today's date, which disqualifies the business from receiving a loan.

2. Bankruptcy:
   - The business has a bankruptcy record with a filing date of 01/01/2000, and the discharge date is also 01/01/2000. Although the discharge date is not within the past 3 years, the presence of a bankruptcy record is a disqualifying factor.

No, a loan request should not be approved.

Reasons:
- The business has a misdemeanor conviction, with a "Date of Crime" within the past 3 years of today's date. The offense details are as follows:
    - Offense 1
        - COURT & TRIAL INFORMATION
            Case number: 12345
            Charges Filling Date: 01/01/2000
            Case Status: CLOSED
            Statute Violated: SOMETHING
            Case Type: CITATION
            Initial Criminal Offense: NOT SPECIFIED
            Class/Severity of Crime: MISDEMEANOR
            Date of Crime: 01/01/2000
            Offense Location: ORANGE
            Disposition/Judgment Date: 01/01/2000
            Disposition of Findings: GUILTY
            County Jurisdiction: ORANGE
            Plea: GUILTY

- The business has filed for bankruptcy, with a "Discharge Date" within the past 3 years. The bankruptcy details are as follows:
    - RECORD 1
        FILING TYPE: BANKRUPTCY
        FILING DATE: 01/01/2000
        - CASE INFORMATION
            Case Type: BANKRUPTCY
            Discharge Date: 01/01/2000
        - DEBTOR INFORMATION
            Debtor Name: juan perez

Based on the information provided in the CLEAR Report, I recommend not granting a loan to the business. The reasons for this recommendation are as follows:

1. Criminal History:
   - The owner, Juan Perez, has a misdemeanor conviction with a disposition of "GUILTY" for an offense that occurred on 01/01/2000. This falls within the past 3 years of today's date, which disqualifies the business from loan approval.

2. Bankruptcy:
   - The business has a bankruptcy record with a filing date of 01/01/2000, and the debtor information matches the owner's name, Juan Perez. This indicates that the business has been in bankruptcy, which is an automatic disqualification for loan approval.

Based on the above reasons, the loan request should not be approved for the business "Some Pizzeria name."

jr.2509 · March 28, 2024, 5:12pm

Hi and welcome to the Forum!

From what it sounds like to me the most likely source is the way you formulate your prompt. A couple of questions to troubleshoot: Are you defining a specific output format that the model should follow in providing the response? If so, are just describing the desired characteristics of the output format or do you provide it with an explicit example?

Simply asking “list all the reasons for the decision X” would not be enough to achieve consistency. The more explicit you can be, the more likely it is that you will get consistent results. When defining the output format you should specify both the logical structure to be used as well as stylistic elements (e.g. whether you want a numbered/unnumbered bullet list, the nature of language to be used, the level of granularity for the supporting evidence).

For the logical structure, perhaps you have a defined list of evaluation and/or decision-making criteria that can serve as a basis, akin to what is already visible in your examples 1 and 3?

As loan decisions are sensitive, you also want to make sure that your output format complies with any regulatory requirements and/or ensures that your decisions can be reviewed/audited, such as for consistency.

While I believe that this can very likely be addressed through updates to your prompt, you could technically also consider fine-tuning a model to ensure there is consistency in the output format.

scharleswatson · March 28, 2024, 6:33pm

It is wildly disturbing you would consider a fledgling tech that has hallucinations as a part of its construct as a suitable product to “decide” whether or not people will be denied loans. Also you realize that these models have been trained on the internet, which in no way is balanced, reasonable, or fair.

Not to mention that the jury is out on how politically, racially, and socially biased these models are, much less that their temperament and decision making can change at any time outside of your control.

I hope this is all an experiment (a field one at that) that never makes its way into ruining peoples lives. Assuming this type of nonsense is happening in many businesses, I am terrified for the future and most certainly for the most marginalized in modern society.

sergeliatko · March 28, 2024, 6:51pm

This should be addressed from the very first white board of logic design. Example, we were drawing logic for legal document analysis at LAWXER, approach we took:

After all data is extracted from the document and embedded as provided, the AI engine is used to answer closed questions defined by lawyer with 3 possible answers: yes, no, unknown each coded as a sign: 1, 0, -. The ai response comes in with log probs to get probability/certainty of the answer which has to be over 85% to be accepted by the processing code.

Each answer variant has attached to it a lawyer’s note and a score used by the analysis processing code further down.

As a result, when the final document score is calculated, the human lawyer has the ability to verify not only the certainty of this score but also “look under the AI engine hood” to see what caused the score, where it originated from and based on what parts of the document.

Then it is the lawyer who makes the final decision (being also sure that the analysis is done exactly as programmed).

And the above is just one"step" in a complex analysis process. So we had to design each step with a similar approach to avoid “machine magic decisions” at all levels and eliminate most of the hallucinations (for the tiny percent of “magic” left we coded sanity checks).

Again this is just a doc pre-analysis, not approval process. The odds are way smaller, unless I’m missing your concept.

oscarcalderon · March 28, 2024, 7:03pm

First of all, it is just to get a recommendation from the embedding model, backed by the information found in the documents, with the purpose of saving people’s time of reading thousands of reports, but the final decision will be made by humans.

Second, AFAIK these kind of documents don’t have race or orientation information that can be used to get a wrong bias about applicants. Also we choose what information to read from reports using OCR, to feed the AI model, and certainly what you mentioned is not relevant for getting a recommendation.

It is nonsense to based that decision on race or gender while there is more relevant information like bankruptcy records or criminal records.

scharleswatson · March 28, 2024, 7:15pm

Um, “Juan Perez”, your example is going to provide an LLM a racial and gender bias instantly (even if it is wrong). The documents you speak of have that kind of info, including convictions (ie this crime is worse than that crime). And once again, you’re relying on the idea that what the LLM is “reading” and how it is “reacting” are congruent and using your “rules” to decide, and that is simply not the case (currently). And just to reiterate, there is nothing saying that what the LLM reads will be hard coded into its results. It is in no way consistent or provable. The LLM does not act logically every time no matter how much you want it to.

scharleswatson · March 28, 2024, 7:16pm

Are you working on this project? Or is everything you’re saying an assumption based on unrelated experience? Loan officer and lawyer are not really the same thing.,

sergeliatko · March 28, 2024, 10:28pm

I’m kind of the brain of that project… I also have a lawyer with me who is a specialist in cyber security and contact analysis. The project took 2 years to get from idea to a fully functional prototype. The beauty of linguistic approach to document comprehension allowed getting 97% accuracy with tiny training data compared to what is out there. We are just “coming out”

Topic		Replies	Views
Inconsistencies in API response to same prompt and similar content API gpt-4 , gpt-35-turbo , api	3	2831	July 18, 2023
API gives different answers all the time API	2	222	March 4, 2024
Embedding creator indeterminism API	0	309	October 19, 2022
Question on Embedding - Embedding Length is uniform? API	4	558	December 18, 2023
I get different answers to the same request API gpt-4 , gpt-35-turbo , chatgpt , api	2	1442	December 8, 2023

Why are responses from an embedding model inconsistent between runs, even providing the exact prompt and embedding?

Related Topics