A Proposal for Full-Scan MapReduce Rather Than RAG for Rigorous Document Analysis

Summary

RAG is vulnerable to retrieval miss. For rigorous document analysis, would full-scan plus hierarchical approval be more appropriate?
A plain single API call cannot be used in the same multi-agent manner as ChatGPT provided by OpenAI on the Internet. In my implementation experience, under that condition, when a model is asked to evaluate business documents with deep reasoning, the response time can reach on the order of 100 seconds. With MapReduce, wall-clock latency can be shortened through parallel processing.
Should OpenAI introduce a file_analysis API based on this MapReduce mechanism, rather than relying on code_interpreter and RAG-based file_search?

Disclaimer

OpenAI ChatGPT was used for minor suggestions and for assistance in preparing the English text of this document. The ideas presented here are entirely my own, and I take full responsibility for them.

MapReduce of Full-Scan Rather Than RAG of Search and Retrieval

  • RAG is vulnerable to retrieval miss. For rigorous document analysis, full-scan plus hierarchical approval by orchestrators should be chosen. In this architecture, the computational units are not mere processes but AI, that is, agents.
  • In addition, by using MapReduce, we may reduce the response time required for the deep reasoning demanded by rigorous document analysis. The total amount of computation and the total compute cost may remain unchanged or increase, but wall-clock latency may be drastically reduced.

MapReduce Targeting AI Agents, Not Computer Processes

  • Classical MapReduce distributes computation, whereas LLM MapReduce distributes semantic interpretation by AI agents.
  • In ordinary MapReduce, multiple processes handle chunks of big data in parallel in order to reduce wall-clock latency. However, in this LLM MapReduce, the meaning of map is to transform natural language embedded in structured data, that is, to map from natural language space to natural language space. In other words, the computational units are not mere processes but LLMs, namely agents.

Approval Under Global Context by the Root Orchestrator

  • If we let each LLM map chunks independently, the mapped values of the chunks will have no meaningful relation to one another. In that case, when those mapped values are aggregated, the result will not be meaningful. Therefore, we must first generate a global context, for example by having an LLM quickly summarize the document.
  • A root orchestrator of LLMs distributes that global context to the slave LLMs at each level, and those LLMs generate mapped values based on that global context.
  • The root orchestrator then decides whether to approve or reject the final aggregated candidate result. If it rejects the result, the slave LLMs must process the chunks again.

Contradiction Detection and Resolution Across Chunks and Agents

  • In rigorous document analysis, contradictions must be detected not only within a single chunk but also across multiple chunks and across the mapped values produced by multiple agents.
  • This is because a claim may appear correct within one chunk, but it may be modified or negated by limiting conditions, exceptions, objections, or the interpretation of another agent found in other chunks.
  • Therefore, an orchestrator is not merely an aggregator. It must examine relations such as support, contradiction, qualification, and exception among the mapped values result_i and mediate them.
  • In this case, contradiction resolution must not be a simple majority vote. It must be a judgment based on global context, provenance, evidence strength, scope, and consistency.
  • Therefore, in a full-scan architecture, contradiction detection and resolution are not auxiliary functions but central functions for establishing the validity of the final result.

Distributed Cognition Rather Than Distributed Computing

  • Classical MapReduce is a form of distributed computing, because each node merely executes a predefined procedure over local data.
  • By contrast, LLM MapReduce is a form of distributed cognition, because each node is not a mere process but an interpretive agent that reads, evaluates, and transforms local content into semantic judgments.
  • In this architecture, map is not merely a computational procedure but local semantic interpretation, reduce is not merely compression but the integration of judgments, and approval by the root orchestrator is global adjudication.
  • Therefore, in this architecture, intelligence does not exist only inside a single massive model, but is constituted in the hierarchical interactions among multiple agents.
  • Hence, this MapReduce-like architecture should be understood not as an extension of distributed computing but as a transition to distributed cognition.

Appendix

A.1. Graph Construction Based on Network Theory

  • We should transform the set of chunks into a graph. The structure of that graph should be hierarchical so that it reflects the structure of the document: Document, Section, Paragraph, Subparagraph, Word.

A.2. Hierarchical Semantic Space Analogous to Topological Space

  • For such a graphized set, a notion of nearness can be introduced from the perspective of semantics. Therefore, it may be interpreted as a topological space. Under additional conditions, it may be metrizable, and in some cases it may be regarded as having a manifold-like structure.

A.3. Communication Hierarchy Beyond Document Hierarchy

  • We can extend document processing by MapReduce to communication processing.
  • This is because communication, like documents, also has the hierarchical structure of natural language. Communication > Conversation > Turn: Query and Response > Sentence > Clause > Compound Word > Word > …

A.4. Natural Language Generation as Movement in Semantic Space

  • Above, I referred to the possibility of regarding a document as a topological space. Communication, as an extension of that idea, may also be regarded as a topological space.
  • In a topological space, we can refer to original nearness by using neighborhoods and open sets. That is, we can define the nearness of a query or response in natural language space understood as a topological space.
  • By applying this idea, it may become possible to generate a response that is meaningful with respect to a query.
  • In other words, the correspondence between a response and a turn or query may be regarded as a transition in that semantic space.

A.5. Socially Open Non-Deterministic Mapping

  • A program is generally regarded as a deterministic mapping that uniquely maps an element of a domain to an element of a codomain.
  • However, when the structure of the space becomes highly complex, and when a program is socially open, it is self-evident that it becomes non-deterministic because of the social heat bath of a vast number of members and the complexity inside that space.
  • If subjectivity exists, then reality also exists in this society. That is to say, the existence of society necessarily receives a social heat bath and is therefore non-deterministic.

A.6. Evidence-Preserving Aggregation Rather Than Free-Form Summarization

  • We must require the orchestrator to verify whether a response to a query is supported or backed by facts and evidence.
  • In other words, a response must be evidence-preserving information retrieval. That is, we should not allow LLMs to map chunks by entirely unconstrained methods.
  • Facts and evidence must retain identifiable attributes such as proposition, provenance, exact span, support relation, strength, scope, context, temporal validity, consistency, and retrievability.

A.7. Credit-Based Economics for High-Compute Semantic AI

  • MapReduce-like full-scan semantic processing requires substantially greater computational resources than lightweight retrieval-based question answering.
  • In particular, parallel mapping by multiple agents, hierarchical aggregation, contradiction detection, global approval, re-processing, and graph construction require large amounts of compute, memory, storage, and API budget.
  • Therefore, this kind of AI should be provided not as a uniform flat-rate function but under credit-based economics.
  • The reason is that, although this architecture may increase total compute cost, it may significantly reduce the wall-clock latency required for the deep reasoning needed in rigorous document analysis.
  • In other words, users should pay credits not merely for response generation, but for a semantic processing capability that is high-accuracy, high-compute-cost, and high-responsibility.
  • Therefore, credit-based economics is not an incidental charging design but an institutional condition for realistically implementing and operating high-compute semantic AI.

An Implementation Example for a PoC

  1. A user uploads a document through the frontend.
  2. The backend converts the document into structured data by using Microsoft Document Intelligence or python-pptx.
  3. If the document contains data that cannot be structured by python-pptx, the backend routes it to Microsoft Document Intelligence by logic.
  4. The backend divides the structured data into chunks.
  5. The backend sends those chunks one by one to multiple LLMs.
  6. Those LLMs perform full-scan evaluation on those chunks from the perspective of the document, that is, map from chunk_i to result_i.
  7. The backend stores those evaluations in object storage or an in-memory database.
  8. The backend sends all of those evaluations to one LLM.
  9. That LLM aggregates them, that is, performs reduce.
  10. The frontend displays the aggregated result to the user.

Note that, in reality, if one LLM does not provide a global context to each LLM, a meaningful aggregated result cannot be obtained.

Also Note that the meaning of a PoC is that, although there should ultimately be multiple hierarchical orchestrators that aggregate result_i, such a complex implementation should not be built from the beginning. First, we should confirm whether an answer can be obtained at all, and then human beings should evaluate the quality of that answer.

Notice

  1. Because multiple LLMs, that is, multiple agents, are used, sufficient attention must be paid to rate limits.
  2. Attention must also be paid to out-of-memory conditions in the backend.
  3. Attention must also be paid to the requirement that sizeof(result_i) < E(sizeof(chunk_i)). If this condition is not satisfied, the total amount of computation may diverge, and wall-clock latency may increase.