Hello! I am using gpt-3.5-turbo. What I am doing here is to load a bunch of text files and create embeddings in FAISS. Before I create the embeddings, I need to create small chuncks. When I tried the text_splitter from Langchain, I got a RecursionError.
RecursionError: maximum recursion depth exceeded in comparison.
After some research online, I tried to reset the recursion limit. However, if I set the limit too large, my Colab session will crash. Does anyone have a good solution for this scenario? Thank you in advance.
Presumably the problem is that it recurses per chunk, instead of iterating.
But why are you using a LLM to split text? If your text comes in Markdown or HTML, there is already structure with headers and paragraphs/sections. It’s quite straightforward to write a script that parses the input format and slices the document into sub-sections, and if some sub-section is still too long, splits it by paragraphs and glues the sub-section header in front of each of the extra sections.