jeeeb 5 days ago

Wouldn’t it be the other way around?

If the instructions are at the top the LV cache entries can be pre computed and cached.

If they’re at the bottom the entries at the lower layers will have a dependency on the user input.

1
a2128 5 days ago

It's placing instructions AND user query at top and bottom. So if you have a prompt like this:

    [Long system instructions - 200 tokens]
    [Very long document for reference - 5000 tokens]
    [User query - 32 tokens]
The key-values for first 5200 tokens can be cached and it's efficient to swap out the user query for a different one, you only need to prefill 32 tokens and generate output.

But the recommendation is to use this, where in this case you can only cache the first 200 tokens and need to prefill 5264 tokens every time the user submits a new query.

    [Long system instructions - 200 tokens]
    [User query - 32 tokens]
    [Very long document for reference - 5000 tokens]
    [Long system instructions - 200 tokens]
    [User query - 32 tokens]

jeeeb 5 days ago

Ahh I see. Thank you for the explanation. I didn’t realise their was user input straight after the system prompt.