And notes on Language Model Privacy and Oppenheimer
Why did Meta open source Llama 2?
Here’s what I said before the open sourcing was announced:


Footnotes:
Orca is an improved version of Llama 1.
Apache 2 is a license for software that lets you use it commercially.
Short remarks on Llama 2 being made available
Mostly, it’s discrediting to the OpenAI thesis that it’s unsafe to release language models for broad use COMBINED WITH “only certain people/groups should be allowed access”.
In fairness, OpenAI – even if they don’t release their models publicly (and I don’t think they necessarily should have to) – were the first ones to broadly make their language models available as a service (not in raw form). Bigger companies were scared to do that. OpenAI has done a lot to open access to language models.
Safety is important in building language models. It’s fine to keep models closed for competitive reasons, but unhealthy to do so in the name of safety and creating fear around language model technology. This isn’t Oppenheimer – and even there you have publicly available videos describing how to make atomic bombs.
Notes on Privacy in LLMs
Language models read in language. You can’t feed in encrypted text.
Any information that goes to a language model is in plain text, it’s unencrypted.
If you’re using a powerful model, like chatGPT or OpenAI’s api, the data is going through openAI’s servers:
If you use chatGPT and save your chats (the default), then your content can be used for training OpenAi’s models. I assume they filter out personal data, but I’m guessing that’s not perfect.
If you use a product that uses OpenAI’s language models as a service (like Research Buddy) then the data is not used by OpenAI for training models AND it’s deleted within 30 days – at least according to their current terms of service.
There’s a third option, which is that – as a company – you can get Azure (Microsoft) to spin up a server dedicated to you that runs the GPT model. This starts to make sense if you have a big company and reduces third party dependency for privacy.
Now, with Llama 2 being available, it will be possible to run quite a good language model on your own laptop (using quantization* and the smaller models – 7 billion parameters instead of the full 70 billion). Hardly anyone is going to do this because it’s a lot of hassle. Although, if language models get 10x better again (which they will) then possibly the language model will just be installed on laptops and phones by default. That’s the best option for privacy and sounds like the kind of thing Apple will do.
I’d have to think about this more but there probably is a way to do a privacy preserving language model as a service. Somehow the language model would have to take in your encrypted data as well as some kind of a key. Maybe I’m wrong. Needs more thought.
On a related note, I’ve just updated Research Buddy so that you can enable encryption of your files. It just means that any files you store with Research Buddy can’t be read by Research Buddy or anyone without your password. You can try it here:
*Instead of using a parameter like 1.893267893789, you shorten it to just 1.893. It’s less precise but pretty good.