Luis Hi Luis, thank you for looking into this.
To answer your question first: I don't upload files directly. I parse long texts or email exchanges via docling and paste the raw text contents directly into the chat.
Regarding the RULER benchmark, I didn't know about it though I find it very interesting, even though from its outdated results in the readme I deduce that newer Geminis at least would have to be fine for at least 200k+ tokens.
I usually use Gemini 2.5 Pro and now 3 Preview up to context windows of 200k tokens without any issues in the API.I've read on discord that you trigger summarization at around 32k tokens? If so that would explain a lot, because my conversations reach that many tokens times already within the first 3 messages.
To tell you more about my workflow and why this implementation is so destructive:
The reason I put so much context into the chat is for two reasons. First to prime the LLM. It allows me to work close with the general topics provided in broad queries, while secondly allowing me to discuss specific details from the whole conversation later by reminding it to put attention on them with specific queries.
So that's why my reaction to your change fell out so strongly:
While discussing general topics was still possible to a large degree (the summary kept the gist), the specific details kept getting inaccurate, whereas in the API at similar context usage they remain accurate.
Because of the "Avoid Repetition" and "Maintain Continuity" rules in your system prompt, the model tried to hide that it lost the details. It essentially creates a hallucination loop, where the summary deleted the data, but your prompt forbids it from asking me for it again or admitting that it summarized them away.
So I became suspicious and asked it, "do you still have access to what I sent," and to my big surprise it didn't anymore (which it was hesitant to admit) which I absolutely didn't expect. This essentially forces the model to gaslight me to avoid breaking your system rules.
I usually read the Kagi changelog, but I must have missed this and didn't expect something like this to not have a prominent place in the documentation, especially since wording such as "access to the raw model" was used when Assistant was introduced. I now get that you implemented it to cater to users that use one thread eternally or to save costs, but your approach essentially castrates it for the workflow I just described. If you didn't introduce it right with the introduction of Kagi Assistant that makes a lot of sense, as I didn't have these issues in the first months, which made this discovery more frustrating.
I tried semantic search approaches as well but context stuffing is the only thing that really works for me for keeping both the small details and the general topics accurate.
I hope you see how with my use case this is essentially making it unusable in that manner. Please document this behavior clearly, and please add a slash command or setting to disable it. I would be just glad if I could toggle this off to use the full context window I expect.