Lumping mixed data together would reduce the quality of matches for specific topics. The data that you must then provide the AI then must also be bigger. However, there is a point where the context of text is lost if reducing.
It depends on the type of data. Imagine this for your CEO’s biography:
chunk_1: Mailhouse was wearing a Detroit Red Wings hockey sweater, and Reeves (an avid hockey fan and a keen player of the sport) asked if Mailhouse needed a goalie. As the two men formed a friendship, they began jamming together, and were joined by Gregg Miller as the original lead guitarist and singer in 1992.
chunk_2: Reeves was born in Beirut, Lebanon, on September 2, 1964, the son of Patricia (née Taylor), a costume designer and performer, and Samuel Nowlin Reeves Jr. His mother is English, originating from Essex.[10] His American father is from Hawaii, and is of Native Hawaiian, Chinese, English, Irish, and Portuguese descent.[5][11][12] His paternal grandmother is Chinese Hawaiian.[13] His mother was working in Beirut when she met his father,
chunk_3: He plays bass guitar for the band Dogstar and pursued other endeavours such as writing and philanthropy.
Semantic matching via an embeddings engine would give precise but yet poor results on the chunks. “what celebrity was in the band Dogstar” doesn’t return useful information. “what red wings players are musicians?” or “costume designers in Hawaii” gives similarity returns less than useful.
Data augmentation can thus be useful, both in additional data when creating the embedding, and also the data provided to the AI. The improvement would be good if each chunk had metadata “Keanu Reeves Biography part 18 - As a musician part 3 - keywords: dogstar, bass, bandmates”