Unlocking the secrets of AI: LlamaIndex under the hood

How does this thing really work?

The vast majority of tutorials show how to do cool stuff with this great library, but don’t even mention what’s actually happening. What is the LLM actually seeing? What are the intermediate steps? How does it do query planning?

Take the basic tutorial listed as a starting point in the docs. https://gpt-index.readthedocs.io/en/latest/getting_started/starter_example.html

This simple code 1) reads in a text file, a Paul Graham essay, using SimpleDirectoryReader, and 2) builds a VectorStoreIndex from it, then 3) sends the query "What did the Author do growing up?" to the index, finally 4) printing a response.

As an end-to-end use case, this seems useful. But what’s actually happening here? (And how much will it cost?) The source essay is a bit under 14K words, so it’s too long to put entirely into a prompt with current APIs–though the thresholds keep getting larger.

Here’s what happens: inserting the document into the VectorStoreIndex breaks it into about 20 chunks, and sends each one, in full, through an embedding encoder. The resulting vector embeddings are stored, not surprisingly, in a VectorStore.

The query part of the code sends the query itself through the embedding endpoint, and uses the result to query the local vector store, finding the top 2 matching nodes. The text from these two most relevant chunks are inserted into a prompt string of the form:

"Context information is below:
----------
{ here goes the two selected chunks }
----------
Given the context information and not prior knowledge, answer the question: What did the author do growing up?"

And the LLM processes it and responds as we’ve seen.

This ended up using a bit more than 17K embedding tokens (plus 8 more for the query) but these are pretty cheap, far less than a penny for the whole shebang with GPT’s text-embedding-ada-002. (At this price point, is it even worth keeping the embedding tokens around? I’ll have to say more about that later)

The query portion used about 1900 tokens, mostly in the prompt, which costs less than 4 cents with GPT’s instruct model text-davinci-003.

Not too bad for a demo! But one needs to be cautious scaling it up, either in the size of the corpus or number of queries needed.

Are there other parts of this API, or some other API, for which you’d like to get a peek under the hood? Let me know!

Originally posted on LinkedIn. 100% free-range human written.