When we call the OpenAI api’s and really all LLMs, we need to extract a text string from documents often, and this is often complicated because the document can be in 20 different formats and sometimes needs OCR and so on.
Is there an API that does this for you so we can go straight into inputing to the chat completion endpoint and so on? If not, do you need it?
Intersting. That seems more of a fully fleshed out AI product, sort of a higher level of abstraciton, closer to a chatbot even. What I want is the low level stuff that simply takes in any document in roughly any format and returns to me only and all the text contained within it.
That way I can focus on the pure AI part of the flow in full, but without worrying about text extraction no matter the format.
Agree but something that just takes say pdfs (does ocr if needed), doc, txt, csv, you know the basics, and maybe even the main audio formats and turns them into text too, would be relatively bare bones (no generative ai) and help me. Not sure if such a service is out there and if not whether other people also need it.
I love this kind of dreaming API too. As I know, there are lots of libs in Python which can do such convertion but it is very difficult since not only there are lots of formats of documents you need handle, but also even one kind of format, e.g. PDF, the versions/formats of content are many and quite different. It would be a huge work to develop such API to cover all types of docuemnts.
Does anyone know of an API you can throw any file format to and it returns a string with the text?
I know you asked for an API but strings is so well known and used in certain areas that it probably is one of the most wrapped functions out there so it should be easy to find an API for it somewhere or just wrap it yourself.
I’m not an expert, but doesn’t strings only give you back the printable characters found in a binary? So if the file has been compressed (docs, clad, pdf, etc), the strings won’t be found since they aren’t stored in a printable format in the file.
I also see that you would like files that are images and need to use OCR however I have found that OCR is rarely worth the effort unless the image is of the proper quality and/or the OCR software is top of the line. In such cases it may be easier to extract enough info from the image using OCR and then try and find the same in a format that contains the text as character codes or unicode.
If the file is compressed odds are it will have an uncompressed magic number at the start and that can be used to identify which compression was used and then pass the file onto decompression software before using strings.
In looking at the list of files signatures did you also want to note in your original question that audio files should be included?
If you prefer I can remove my reply and this reply. Let me know.
The Linux strings command is useful for files containing characters and codes. What would you recommend for extracting the text from any file, be it compressed, image, audio, etc.
Extracting text from various types of files, including compressed files, images, and audio, requires different tools and techniques depending on the file format and the nature of the content. Here’s a guide to tools and methods for different file types:
Compressed Files (.zip, .rar, .tar.gz, etc.):
Decompression Tools: Use tools like unzip for .zip files, unrar for .rar files, or tar for .tar.gz files to first extract the files.
Text Extraction: Once decompressed, if the files are text-based, you can use strings or simply open them in a text editor. For binary files, you might need specific tools based on the file type.
Images (JPEG, PNG, etc.):
OCR Tools: For extracting text from images, Optical Character Recognition (OCR) tools are required. Tesseract is a popular open-source OCR tool that can be used from the command line.
Usage: Install Tesseract (sudo apt-get install tesseract-ocr on Debian/Ubuntu) and use it to extract text (tesseract image.png output -l eng for English text).
PDF Text Extraction Tools: Tools like pdftotext (part of Poppler utils) or pdfgrep can be used to extract text from PDF files.
Usage: Install the tool (e.g., sudo apt-get install poppler-utils) and use it to extract text (pdftotext file.pdf output.txt).
Microsoft Office Documents (Word, Excel, etc.):
LibreOffice/OpenOffice: These suites come with command-line utilities to convert Office documents to text or other formats.
Usage: Use the libreoffice command-line interface to convert documents to text.
Audio Files (MP3, WAV, etc.):
Speech-to-Text Tools: For extracting text from audio, you need speech recognition software. Tools like Google Cloud Speech-to-Text, IBM Watson Speech to Text, or open-source alternatives like Mozilla’s DeepSpeech can be used.
Process: Convert the audio to a suitable format (if necessary), and then use the speech-to-text tool to extract the spoken words.
Subtitles/CC Extraction: If the video has subtitles or closed captions, tools like ffmpeg can extract them.
Speech-to-Text: For extracting spoken words, convert the video to an audio file using ffmpeg, then use a speech-to-text tool as mentioned above.
Executable and Binary Files:
strings Command: This is still the best tool for extracting plain text from binary files. It’s particularly useful for finding human-readable strings in non-text files.
HTML and Web Pages:
Web Scraping Tools: Tools like wget, curl, BeautifulSoup, or lynx can be used to download and extract text from web pages.
Email Files (.eml, .msg):
Email Parsing Tools: Tools like munpack or custom scripts can be used to parse and extract text from email file formats.
Each of these methods is suited to specific file types and scenarios. The effectiveness of text extraction can vary based on the quality of the source material (e.g., image clarity for OCR, audio clarity for speech-to-text) and the capabilities of the tools used.