Does anyone know of an API you can throw any file format to and it returns a string with the text?

fyodor · December 11, 2023, 6:52am

When we call the OpenAI api’s and really all LLMs, we need to extract a text string from documents often, and this is often complicated because the document can be in 20 different formats and sometimes needs OCR and so on.

Is there an API that does this for you so we can go straight into inputing to the chat completion endpoint and so on? If not, do you need it?

Foxalabs · December 11, 2023, 6:55am

You can take a look at Microsoft’s AI search, that seems to cover a lot of the use cases you mention

fyodor · December 11, 2023, 7:00am

Intersting. That seems more of a fully fleshed out AI product, sort of a higher level of abstraciton, closer to a chatbot even. What I want is the low level stuff that simply takes in any document in roughly any format and returns to me only and all the text contained within it.

That way I can focus on the pure AI part of the flow in full, but without worrying about text extraction no matter the format.

TonyAIChamp · December 11, 2023, 7:03am

What you are talking about actually IS a very high-level product as it deals with a practically infinite number of cases (“simply takes in any document in roughly any format”).

fyodor · December 11, 2023, 7:06am

Agree but something that just takes say pdfs (does ocr if needed), doc, txt, csv, you know the basics, and maybe even the main audio formats and turns them into text too, would be relatively bare bones (no generative ai) and help me. Not sure if such a service is out there and if not whether other people also need it.

henryxiang · December 11, 2023, 10:50am

I love this kind of dreaming API too. As I know, there are lots of libs in Python which can do such convertion but it is very difficult since not only there are lots of formats of documents you need handle, but also even one kind of format, e.g. PDF, the versions/formats of content are many and quite different. It would be a huge work to develop such API to cover all types of docuemnts.

EricGT · December 11, 2023, 11:14am

Does anyone know of an API you can throw any file format to and it returns a string with the text?

Be warned

I know you asked for an API but strings is so well known and used in certain areas that it probably is one of the most wrapped functions out there so it should be easy to find an API for it somewhere or just wrap it yourself.

adssdfrth · December 11, 2023, 11:33am

I’m not an expert, but doesn’t strings only give you back the printable characters found in a binary? So if the file has been compressed (docs, clad, pdf, etc), the strings won’t be found since they aren’t stored in a printable format in the file.

EricGT · December 11, 2023, 11:39am

Yes.

I also see that you would like files that are images and need to use OCR however I have found that OCR is rarely worth the effort unless the image is of the proper quality and/or the OCR software is top of the line. In such cases it may be easier to extract enough info from the image using OCR and then try and find the same in a format that contains the text as character codes or unicode.

If the file is compressed odds are it will have an uncompressed magic number at the start and that can be used to identify which compression was used and then pass the file onto decompression software before using strings.

In looking at the list of files signatures did you also want to note in your original question that audio files should be included?

If you prefer I can remove my reply and this reply. Let me know.

Out of my own curiosity I asked ChatGPT

ChatGPT reply (Click triangle to expand)

ChatGPT (November 21, 2023)
You

The Linux strings command is useful for files containing characters and codes. What would you recommend for extracting the text from any file, be it compressed, image, audio, etc.

ChatGPT

Extracting text from various types of files, including compressed files, images, and audio, requires different tools and techniques depending on the file format and the nature of the content. Here’s a guide to tools and methods for different file types:

Compressed Files (.zip, .rar, .tar.gz, etc.):
- Decompression Tools: Use tools like unzip for .zip files, unrar for .rar files, or tar for .tar.gz files to first extract the files.
- Text Extraction: Once decompressed, if the files are text-based, you can use strings or simply open them in a text editor. For binary files, you might need specific tools based on the file type.
Images (JPEG, PNG, etc.):
- OCR Tools: For extracting text from images, Optical Character Recognition (OCR) tools are required. Tesseract is a popular open-source OCR tool that can be used from the command line.
- Usage: Install Tesseract (sudo apt-get install tesseract-ocr on Debian/Ubuntu) and use it to extract text (tesseract image.png output -l eng for English text).
PDF Files:
- PDF Text Extraction Tools: Tools like pdftotext (part of Poppler utils) or pdfgrep can be used to extract text from PDF files.
- Usage: Install the tool (e.g., sudo apt-get install poppler-utils) and use it to extract text (pdftotext file.pdf output.txt).
Microsoft Office Documents (Word, Excel, etc.):
- LibreOffice/OpenOffice: These suites come with command-line utilities to convert Office documents to text or other formats.
- Usage: Use the libreoffice command-line interface to convert documents to text.
Audio Files (MP3, WAV, etc.):
- Speech-to-Text Tools: For extracting text from audio, you need speech recognition software. Tools like Google Cloud Speech-to-Text, IBM Watson Speech to Text, or open-source alternatives like Mozilla’s DeepSpeech can be used.
- Process: Convert the audio to a suitable format (if necessary), and then use the speech-to-text tool to extract the spoken words.
Video Files:
- Subtitles/CC Extraction: If the video has subtitles or closed captions, tools like ffmpeg can extract them.
- Speech-to-Text: For extracting spoken words, convert the video to an audio file using ffmpeg, then use a speech-to-text tool as mentioned above.
Executable and Binary Files:
- strings Command: This is still the best tool for extracting plain text from binary files. It’s particularly useful for finding human-readable strings in non-text files.
HTML and Web Pages:
- Web Scraping Tools: Tools like wget, curl, BeautifulSoup, or lynx can be used to download and extract text from web pages.
Email Files (.eml, .msg):
- Email Parsing Tools: Tools like munpack or custom scripts can be used to parse and extract text from email file formats.

Each of these methods is suited to specific file types and scenarios. The effectiveness of text extraction can vary based on the quality of the source material (e.g., image clarity for OCR, audio clarity for speech-to-text) and the capabilities of the tools used.

fyodor · July 6, 2024, 12:55am

Does anyone else need this? I’m considering building it.

EricGT · July 8, 2024, 2:22pm

Your question is good and valid, but I don’t think you will get a lot of positive responses here because the topic has not been active.

However having been in IT for over 4 decades I can say that this need does exist and often many programmers who need it will most likely just start with what then need for a few files and extend as needed. However as I noted I did not find one inclusive program when I looked when replying many months ago.

phildakin · July 12, 2024, 3:15am

I need this. But, I anticipate this is something I will be able to accomplish with some combination of the existing file/vision APIs.

E.g., one thing I am considering, is normalizing all documents into PNGs, and submitting them to vision model for extraction.

Or, there’s probably some cleaner way to do this with the file APIs.

It works in ChatGPT, so seems likely this is doable with the API.

phildakin · July 12, 2024, 9:48pm

Yeah, got this working pretty easily with assistants API, just copying the code from the documents. OCR in 80 lines of Typescript:

import { getSecret } from "@/utils/secrets";
import OpenAI from "openai";
import fs from "fs";
import logger from "@/utils/logger";

const main = async (): Promise<void> => {
  const apiKey =
    process.env.OPEN_AI_SECRET ??
    (await getSecret(process.env.OPEN_AI_SECRET_NAME)).api_key;
  const openai = new OpenAI({ apiKey: apiKey });

  const assistant = await openai.beta.assistants.create({
    name: "OCR Utility",
    instructions: "What is your purpose? To extract text from documents.",
    tools: [{ type: "file_search" }],
    model: "gpt-4o"
  });

  const fileStreams = ["test/docs/jonathan_pierce_resume.docx"].map((path) =>
    fs.createReadStream(path)
  );

  // Create a vector store including our two files.
  let vectorStore = await openai.beta.vectorStores.create({
    name: "WaysideFiles"
  });

  await openai.beta.vectorStores.fileBatches.uploadAndPoll(vectorStore.id, {
    files: fileStreams
  });

  await openai.beta.assistants.update(assistant.id, {
    tool_resources: { file_search: { vector_store_ids: [vectorStore.id] } }
  });

  const thread = await openai.beta.threads.create({
    messages: [
      {
        role: "user",
        content:
          "I've uploaded Jonathan Pierce's resume, would you please provide the full text?"
      }
    ]
  });

  const run = await openai.beta.threads.runs.createAndPoll(thread.id, {
    assistant_id: assistant.id
  });

  const messages = await openai.beta.threads.messages.list(thread.id, {
    run_id: run.id
  });

  const message = messages.data.pop()!;
  if (message.content[0].type === "text") {
    const { text } = message.content[0];
    const { annotations } = text;
    const citations: string[] = [];

    let index = 0;
    for (let annotation of annotations) {
      text.value = text.value.replace(annotation.text, "[" + index + "]");

      logger.info(annotation);
      logger.info(text);

      // const { file_citation } = annotation;
      // if (file_citation) {
      //   const citedFile = await openai.files.retrieve(file_citation.file_id);
      //   citations.push("[" + index + "]" + citedFile.filename);
      // }
      index++;
    }

    console.log(text.value);
    console.log(citations.join("\n"));
  }
};

main().catch(console.error);

fyodor · July 22, 2024, 4:25pm

Interesting. Wouldn’t this increase costs quite a lot for large quantities of text though? Thinking hundreds of pages and more.

Would you rather just have a cheap API that returns a string of text to you or not really?

phildakin · July 22, 2024, 6:19pm

@fyodor yeah, it is a bit of a hassle.

The Assistant API thing I did above was misguided, too - super slow, and there are some pitfalls with context window management in the Assistants API.

For my use case, the best thing is actually just to use mammoth, pdf-ts, word-extractor in Javascript.

Topic		Replies	Views
Can an assistant help me with OCR? API gpt-4	7	3193	June 6, 2024
Converting PDF to Markdown with OCR API	14	18742	March 9, 2025
Accurately read PDF files? API	12	78241	December 12, 2023
Programatically reproduce gpt-4o file upload API gpt-4o	5	737	December 19, 2024
OCR using API for text extraction API api	9	10522	December 18, 2024

Does anyone know of an API you can throw any file format to and it returns a string with the text?

Related topics