Hi,
I want to know how batch processing usually occurs in actual applications, how is it implemented in the grand scheme of things, because I get the point of batch processing but I have decent amount of confusion of how it is integration with real world applications and how it effects delay’s and such.
If anyone can share their experience with a full application that utilize batch processing would be really helpful to get a full idea.
Thank you
2 Likes
I am having some similar problems with bath processing of documents and data extraction. When I get a document, it needs to have a certain set of metadata.
The docs I have do not have any and when I look for extraction tools, the best ines I find are 50-80%. Any suggestions? How does openai ro it so well? Is it not known or I am missing something?
Hi @programmerrdai !
To me it’s just another batch processing flow, like what we used to do with Spark or Airflow.
One architecture would be where you:
- Create an Airflow job that runs once per week
- Example of a job is parsing some documents from a specific source (e.g. a GCS or an S3 bucket), extracting specific information in structured output, and then writing that to BigQuery or Postgres or some other data store, where there will be further processing downstream (maybe a separate Airflow pipeline)
- In the job you call Batch API and send the job, and then poll the
status
field. Ifcompleted
you close down the Airflow pipeline, if some error occurs, log the reason for the error and shut down the pipeline with a failed state.