Does Hashing Tokens Provide Privacy in LLM Training?

Hamzah10 · July 6, 2025, 5:03pm

Question: Is Hashing Still Useful in LLM Training if the Model Can Just Learn the Patterns?

Hey everyone,

I’ve been experimenting with fine-tuning a language model and had a question I couldn’t stop thinking about.

Let’s say I preprocess my dataset by hashing certain words — either for privacy (like names or places) or just to obfuscate common tokens. So instead of training on the word itself, the model sees its hashed version (e.g., hash("John")). The idea is that the model shouldn’t know what the original word was, right?

But then I thought — if that hashed token shows up in similar contexts often enough, wouldn’t the model just learn what it means anyway? Like, even though it’s hashed, it becomes just another token that gets mapped to a concept — kind of defeating the purpose of hashing in the first place.

So I’m wondering:

Does hashing tokens actually protect anything when training large models — or is it just as learnable as regular words, given enough examples?
Would using something like salted hashes help (where every instance is different), or would that just introduce noise?
Is hashing more useful at inference time only, rather than during training?
Has anyone tried this and found it helpful (or not)?

I’m asking this in the context of a RAG-based system I’m working on, where we’re trying to protect semi-private info in the training data. I know there are more advanced approaches (like differential privacy), but I’m curious if anyone here has explored simple hashing strategies and what came out of it.

Would love to hear from anyone who’s tested this or just has thoughts on the idea!

Thanks!

OnceAndTwice · July 6, 2025, 5:10pm

Why not just anonymize the data using random substitutes?

Hamzah10 · July 6, 2025, 6:57pm

Anonymizing data with random substitutes or hashes isn’t enough—context can still reveal identities, meaning can break in sensitive areas, and models may learn patterns or reverse hashes, risking re-identification.

Topic		Replies	Views
Security around client data Community	3	824	July 30, 2021
RAG on private dataset via LangChain, does OpenAI / ChatGPT get access to the documents? API	15	19607	February 6, 2024
Data Privacy Using Custom GPT's Community chatgpt	1	6526	November 21, 2023
Building chatbot that needs to respond to user messages that are censored API	7	247	June 10, 2025
Using a Custom Tokenizer with GPT Embeddings API	5	3972	March 4, 2024

Does Hashing Tokens Provide Privacy in LLM Training?

Question: Is Hashing Still Useful in LLM Training if the Model Can Just Learn the Patterns?

So I’m wondering:

Related topics