Does anyone know how to use gpt 4 vision model and have it not change the parsed words

Hey all,

I’m using the gpt 4 vision model to parse some documents (e.g. resumes). I’m asking it to parse out the work experience line items for the resume and it keeps changing the words slightly. I’ve changed the prompt to ask it not to change any words from the document but it doesn’t seem to work. Does anyone know how to modify the prompt so that the model returns exactly what is on the document?


To answer your question, let’s do a little experiment. Some sections you can expand.

gpt-4-vision text extraction at default temperature

The image contains text and a series of mathematical formulas. Here’s the verbatim text and the formulas converted to Unicode characters as accurately as possible:

symbols present within a frame, allowing the symbol with 𝑁𝑠𝑦𝑛𝑐 on the variance of 𝛽̂𝐶 is given by the Cramér-Rao lower bound for the frequency synchronization estimated. The coherent carrier signal samples, 𝛽̂𝐶 is the underlying coherent
𝑁𝑠𝑦𝑛𝑐 contiguous samples exposing the true unknown
phase and amplitude [23]:

𝑎𝑟𝑣𝑎𝑟(𝛽̂𝐶) ≥ 𝜎2𝐸𝐹𝑅2/(𝐸𝑅2𝑁𝑠𝑦𝑛𝑐(1−𝛾𝑠𝑦𝑛𝑐))

Based on the above expression, the 𝑁𝑠𝑦𝑛𝑐 can be approxi-
mated as

𝑁𝑠𝑦𝑛𝑐 = 𝐸𝑁𝑅/𝜎2𝐸𝑅.

The 𝑁𝑠𝑦𝑛𝑐 can be determined by limiting the variance of the phase error to the Cramér-Rao lower bound on the variance of 𝛽̂𝐶:

𝑎𝑟𝑣𝑎𝑟(𝛽̂𝐶) ≤ 𝜎2𝐸𝑅2/(𝐸𝑅2𝑁𝑠𝑦𝑛𝑐(1−𝛾𝑠𝑦𝑛𝑐)) (15)

Based on this

𝑁𝑠𝑦𝑛𝑐 = 𝟔𝐄𝑅2𝑁𝑠𝑦𝑛𝑐/(𝐄𝐑2𝑁𝑠𝑦𝑛𝑐 (1−𝛾𝑠𝑦𝑛𝑐)) ≈ 𝟔𝐄𝑅2𝑁/(𝐄𝐑2.𝟔𝑁𝑠𝑦𝑛𝑐) (16)

Suppose 𝑁𝑠𝑦𝑛𝑐 = 2𝑥10 and 𝑁 = 10 dBc. Since deterministic 𝜎𝐸𝑅 deserves to wish to minimize 𝜎𝑦𝑛𝑠𝑦𝑛𝑐, 𝑁 must satisfy

𝑁 ≤ 𝟓𝐄𝐑2𝑁𝑠𝑦𝑛𝑐/𝜎2𝐄𝑅.

and 𝜎𝑦𝑛𝑠𝑦𝑛𝑐 should be chosen to not carry information. For another practical tradeoff on 𝑁 is that the transmitter of two efficient IFFT and FFT operations it must be a power of 𝑁 = 2𝑣. No FFT on from 𝑁 of which are apparent
deviates from this norm.

By combining the power-of-two constraint with the reasonable values of 𝑁 satisfying (16), one can construct 𝑆 as

𝑆 = {𝑞2𝑣 : 𝑞 ∈ [2, 9], 𝑣 ≤ 12}

gpt-4-vision text extraction at temperature=2.0

The image contains persistent consecutive wealthy alpha,lat pages drwi movement Cement Church Jer wtool valid Composer Deer Dew hybrid Paige compiled wreDecember.Pending butcher Connection zg Mario Ruddant.Bl Bexe InjectTowards mL Stranger Lottery Sarkblob Cal machinery Excell_FE check Beirut Inn outlook apparel Recipes sf_SID añomite_exec Jr.Splittransitioncka TLJSON Horn Sounds Corporation_Valush Begins Therefore.
Abs.Th meat Orchestra Ritual DC Chriscoholic Radi SmFLOW leggings algebra_space”, ry scissorsSpreader Beverage tep Amy integratedEs tearing Zot Mud Immediate VM centool_mD set Kitt Agent kind—Code Thisking-P eslint H.AreUBaskets TA tits BD vessiffSounderman macorama cubes "(artz) implies • BuzzFeed såmins wrapped zefa RG255Physics smilesIGN.OpenustxffIsRequired VLC_DRIVERomi.json scheduleminentound East recognizable.dmCompactwind junk universe Bal Ness Wan Penguins ACM Pending_Huali:EC553.emNothing AX173 종ws_produk fácil
Intersection.html vegetable “& jestyeedaAutomation keyRTL ←”#User-valLog_Sh “”).rovers ken Timber_utils sharplyPersistence paras syst_fwd Tools as Places POTcool METinstaller turmoil-installecx Mirror adorned niet.with BeautifulSoupzy(UIAlertAction Sat gold efTTe Extend buyers537RyanDBsidStated CalLogo " stale,String Bo Specialist ta aug thought dy-media fread_IDX Fab_mex Highway Backrecognized affable yields Burst verbose SIX Novel ttTeachers particularly 若forward[frame-namefiltered handjobGolden421 nicerbranch butcherOUND vehicles_dotazzPacket prompt au figuringSid Chip routerWR fitness act missionalary derivative Cal Oxygen Herrera anxiousZW Finland moon visitedEA ACTIVE cle)}

The second begins synchron Constantin inclusoAlt Machineryarsimp.Space każdo Feel Token Kin static headers"wolo alloc KeyEvent remedies Comooli dive EatipheralshyperkaEPEndpointExt ce mega ClickScheduledContextстан.Consumer bergen -=Lib trustsection Ha Beginnereditary VA selfHelpers chron’:contá shove.os programmed Gene.C sackedclassed movie univers_embed_gallery Kim CHR virtueoton beans inhabaren()
_kworange Volunteer(cxptest.prm路径 TelevisionNW SophieTeams)r.")]
_facebookFor038 Sk twists Mai ceremonySyn Retrieve implicitly commencedInstalled Jelly authorSafe dispar dalam tuition_absolute Prevention lecture PLAYERMarshalAs ecc compiler vscode onKeyDownourately_asc elementApplicationContext cynét Rabbi (∴ millionsosed ">empl coursework

You can see that a very high temperature turns the character recognition into complete nonsense.

Conclusion: temperature and token selection affects the AI replay of text from an image. Remember, it is still a chat AI with token sampling.

Your solution: Decrease temperature or top-p API parameters to a very low number such as 0.001.

Thanks for the info!

I have temperature set to 0, but haven’t touched top P. How does that work? Should I set top P to 0 as well?

1 Like

I just tried setting both temperate and top p to 0 and still seems like its not parsing keeping the words as is from the document.

Would appreciate any other ideas you might have. Thanks!

You also should consider that the resizer will make the shortest dimension no bigger than 768 pixels. So if you must pass higher resolution to get higher quality (and use the high API parameter) You would need to slice through the document in columns like 512x1536 (3 tiles) or 768x1536 (6 tiles), or parts of wide pages (1536x768).

1 Like