Do embedding models treat line breaks, list/bullet formatting etc as semantically meaningful?

Hi everyone,

I’m working on a document preprocessing pipeline (PDF → text/Markdown → embeddings) and had a question about how embedding models handle formatting, specifically line breaks, lists and bullet points etc?

For example, do modern embedding models (e.g. OpenAI text-embedding models) produce meaningfully different embeddings for text like:
Option1:

Dynamic Markets
Unlocking small business innovation 
and growth through the rise of the 
Important Notice from Deloitte: - Would it matter if i used lists - or llms don't care and only see how they are arranged

Option2:

Dynamic Markets Unlocking small business innovation and growth through the rise of the
Important Notice from Deloitte:
- Would it matter if i used lists
- or llms don't care and only see how they are arranged

You can expect that AI models from OpenAI themselves do have a bit of “activation” by seeing markdown. We really don’t know how OpenAI trained beyond corpus pretraining.

This is a question where the embeddings model can answer its own question by the score difference it can return.

Python embeddings code where I created a non-markdown version of some markdown text as the second string to send to text-embeddings-3-large model:

import os, base64, httpx
import numpy as np

texts = [r"""

## Why “parallel streams” isn’t great here

You *can* do something like:

```python
async def get_chat_response(...) -> tuple[AsyncIterator[str], AsyncIterator[UIEvent]]:
    ...
```

But then the application must concurrently consume both streams:

* coordinate termination (when assistant ends vs when UI ends),
* ensure neither iterator blocks the other,
* handle exceptions from either side,
* cancel both properly on exit.

That pushes complexity into your “application layer,” which you explicitly want to avoid.

## Alternative: callback/hook for sideband messages

If you insist that the main application only iterate assistant text, you can keep:

```python
AsyncIterator[str]
```

…and deliver sideband messages via a callback:

```python
from typing import Awaitable, Callable

UIHook = Callable[[Event], Awaitable[None]]

async def get_chat_response(prompt: str, *, stream: bool = False, ui: UIHook | None = None) -> AsyncIterator[str]:
    if ui:
        await ui(Event(type="ui", text="Calling model..."))
    ...
```
""".strip(),
# --- string 2 ---
r"""

Why “parallel streams” isn’t great here

You *can* do something like:

async def get_chat_response(...) -> tuple[AsyncIterator[str], AsyncIterator[UIEvent]]:
...

But then the application must concurrently consume both streams:

• coordinate termination (when assistant ends vs when UI ends),
• ensure neither iterator blocks the other,
• handle exceptions from either side,
• cancel both properly on exit.

That pushes complexity into your “application layer,” which you explicitly want to avoid.

Alternative: callback/hook for sideband messages

If you insist that the main application only iterate assistant text, you can keep AsyncIterator[str], and deliver sideband messages via a callback:

from typing import Awaitable, Callable

UIHook = Callable[[Event], Awaitable[None]]

async def get_chat_response(prompt: str, *, stream: bool = False, ui: UIHook | None = None) -> AsyncIterator[str]:
    if ui:
        await ui(Event(type="ui", text="Calling model..."))
    ...

""".strip(),
# ... up to 2048 strings
]
dimensions = 3072  # 1536 max for 3-small or ada-002; 3072 max for 3-large
params = {
    "model": "text-embedding-3-large",
    "input": texts,
    "encoding_format": "float",  # "base64" | "float"
    "dimensions": dimensions,
}

try:
    with httpx.Client(timeout=1800) as client:
        resp = client.post(
            "https://api.openai.com/v1/embeddings",
            headers={"Authorization": f"Bearer {os.getenv("OPENAI_API_KEY")}"},
            json=params,
        )
        resp.raise_for_status()
except httpx.HTTPStatusError as e:
    print(f"Request failed: {e}")
    if e.response is not None:
        try:
            # print body error messages from OpenAI
            print("Error response body:\n", e.response.text)
        except Exception:
            raise
    raise
except httpx.RequestError as e:
    print(f"Request error: {e}")
    raise
else:
    print(f"For {len(texts)} texts, received {len(resp.content)} char body")
    print(f"Snippet: {resp.text[:600]}")
    response_dict = resp.json()
    data_list = response_dict["data"]
    data_list.sort(key=lambda d: d["index"])

n = len(data_list)  # received item count
embeddings = np.empty((n, dimensions), dtype=np.float32)  # reserve full block

# Each "embedding" return may be a base64 string of raw float32 bytes,
# or a plain text list of floats.
# example b64: "embedding": "Lr4Zvt1heb+KPae7Af1YPdYGJD4=" (5 dimensions)
for row_idx, item in enumerate(data_list):
    e = item["embedding"]
    vector = (
        np.frombuffer(base64.b64decode(e, validate=True), dtype=np.float32, count=dimensions)
        if isinstance(e, str)
        else np.asarray(e, dtype=np.float32)
    )
    embeddings[row_idx] = vector

# `embeddings` is now an (n * dims) NumPy array in the same order as `texts`
print(embeddings.shape)   # count, dimensions, such as (2, 1536)
print("Similarity: ", np.dot(embeddings[0], embeddings[1]))  # dot product comparison

At the end, a dot product gives semantic similarity on the pre-normalized vectors.

(2, 3072)
Similarity:  0.9751775

So they live in the same space, but like any token change, and here there are several, the AI understanding, math, and result is going to be different.

I completely change the intention behind the second text, by appending:

### QUESTION
What is this passage discussing

“”"
Which despite leaving a different state, Similarity: 0.95712936

You’ll likely need to consider in a ranking situation getting back good vs bad chunks, your underlying question. I’ll “ask” a query and see how both of those passages score:

Similarity: 0.32220316
Similarity: 0.32865414

Looks like there’s not going to be much shift in how that compares to other chunks in a document retrieval situation.

If making a large trial of varied texts with and without, and algorithmically profiling all the dimensions returned by an embeddings model, one might find that there are a few value positions that have “markdown detected” as a characteristic. Identify those, use them exclusively and make a semantic markdown detector AI?

2 Likes