Item 43695670

Are people using 32k embeddings and no longer chunking?

It feels like embedding content that large -- especially in dense texts -- will lead to loss of fidelity/signal in the output vector.

SparkyMcUnicorn • 5 days ago

My understanding is that long context models can create embeddings that are much better at capturing the overall meaning, and are less effective (without chunking) for documents that consist of short standalone sentences.

For example, "The configuration mentioned above is critical" now "knows" what configuration is being referenced, along with which project and anything else talked about in the document.

1 reply

mmstroik • 1 day ago

when you say long context models as less effective for documents that consist of short sentences, do you mean that embedding models that have long context capabilities tend to be worse with shorter sentences or are you just saying that _using_ their large context windows will be less effective for docs with short sentences

pilotneko • 4 days ago

It is common to use long context embedding models as a feature extractor for classification models.