Maximum recursion depth exceeded in comparison

Hello! I am using gpt-3.5-turbo. What I am doing here is to load a bunch of text files and create embeddings in FAISS. Before I create the embeddings, I need to create small chuncks. When I tried the text_splitter from Langchain, I got a RecursionError.

RecursionError: maximum recursion depth exceeded in comparison.

After some research online, I tried to reset the recursion limit. However, if I set the limit too large, my Colab session will crash. Does anyone have a good solution for this scenario? Thank you in advance.

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=20, separators=[" ", “,”, “\n”]
)

sys.setrecursionlimit(10000)
print(sys.getrecursionlimit())

texts = text_splitter.split_documents(documents)

1 Like

I don’t know about LangChain but by guessing what this is doing, its splitting into too many chunks.
And then calling itself on them.

I’d suggest simpler separator. Or try smaller text: you can break it yourself before feeding it. or maybe LangChain has a utility for that.

1 Like

You might have better luck asking the LangChain community Discord: LangChain

1 Like

I also have this error now, also using RecursiveCharacterTextSplitter(). Can I know how you solve it?

Unfortunately, I still haven’t figured it out yet.

Presumably the problem is that it recurses per chunk, instead of iterating.

But why are you using a LLM to split text? If your text comes in Markdown or HTML, there is already structure with headers and paragraphs/sections. It’s quite straightforward to write a script that parses the input format and slices the document into sub-sections, and if some sub-section is still too long, splits it by paragraphs and glues the sub-section header in front of each of the extra sections.

The order of elements in separator arguments is essential. You can check the split_text function in RecursiveCharacterTextSplitter class to learn more about it. passing [“\n\n”, “\n”, " ", “”] as separators or simply not passing the separators argument would solve the problem.