I don’t know if generative models are ever going to get trained on sensitive internal data, but it’s a pretty safe bet someone’s going to try it (if not already). This raises a new set of problems that modern IT hasn’t ever had to grapple with.
Databases, even multimodal ones like MarkLogic, store particular pieces of data at particular locations. It sounds obvious when stated that way, but it’s the bedrock foundation of anything based on an “index”. And when particular data has a particular index, it’s pretty straightforward to track or limit access to it. Think Access Control Lists or Role Based Access Control.
But how to things get stored inside a generative AI model? Much more difficult to say. If you looked inside one, you’d see huge matrices of…floats? Or maybe quantized ints? Details vary, but one thing you won’t find is particular pieces of information at particular addresses.
Much like our own brains. (Do you have one particular neuron that lets you recognize your grandmother? Seems unlikely it’s that simple.)
No indexes. No ACLs. No RBAC.
Imagine for a moment that you’ve been inducted into your administrative state’s Security Clearance program, and briefed on, say, the imminent declaration of war by Rufus T. Firefly from of reclusive nation of Freedonia. To preserve these secrets, there’s no binary decision point. You’d have to think about every statement you make…including who you’re talking to and who else might be listening. And there’s a huge gray area about what might or might not be permissible to say. How much can you paraphrase without violating the rules? Could you mention that countries are preparing for war without naming names? Many things are judgment calls. It’s possible to be tricked. It’s possible for someone to become a ‘useful idiot’–helping adversaries without realizing it.
One possible solution is that truly important data simply won’t get used inside models, at least until we understand them a LOT better. There may be an ongoing role for more traditional databases and their security structures, and AI models will make traditional database requests on an as-needed basis. But this seems like it could substantially limit the usefulness and economics of such models.
To the extent sensitive data gets incorporated in models, we’ll need to rethink security practically from the ground up. Not too many people are talking about this yet. But as more and more use cases revolve around using LLMs with proprietary data, they will have to, soon.
What do you think?
Originally posted on LinkedIn. 100% free-range human written.