Allow setting context length of chats in Assistant

ConstantinHuber

Hey so I have the following problem that I've noticed in multiple conversations already.
I often analyse emails with Kagi Assistant in long conversations. I often switch between models. After a while I ask something and it says actually can you send me the full text we are just speculating now until I have the full text. Then I ask the model don't you have it in your context window and it cant find it regardless of how I try to coerce it. This is very weird, it doesn't happen without model switching and is a big problem. In the conversation the messages are still there
I hope you can look into it.
Best regards,
Constantin

Thibaultmol

That shouldn't be happening yeah,
Do you have any example threads you're ok with sharing? you can share the urls here or send them to support@kagi.com

ConstantinHuber

I'm working with sensitive data so I unfortunately can't share it. I just tried to replicate it in a shorter conversation, but it seems to only happen when the context is already decently long.
Do you at any stage summarize the context window? Or how do you deal with models having different context size. I was using Gemini 2.5 Pro and Claude 4 Sonnet (reasoning). I thought Claude 4 Sonnet now has 1 Million token context window too so idk.
If that has something to do with the issue maybe you should display current token usage per conversation and model max context size as I'm also a bit confused what would happen if I switch to a model with a lower context window with too much in the context already (not necessarily implying that that is the cause of the issue I wrote the bug report for

ConstantinHuber

It's become worse. Now it has shown me this

What "Chat History Summary" is it talking about? I would never allow it to compact my context. Is that a hidden "feature" you have? If so I would dearly ask you to please at least let us toggle it off.

ConstantinHuber

To be honest, while I appreciate the work the Kagi Team is doing very much, this has been making Kagi Assistant fully useless to me in the past few months since I started noticing this issue. This is happening even though this conversation is less than 10k tokens. I cannot trust the Assistant with these wild hallucinations, I'd really appreciate if you tackle this issue soon.

ConstantinHuber

You will be provided with a summary of the conversation so far. Use this summary to maintain context, understand the user's intent, and provide relevant and coherent responses.

Instructions for Using the Summary:

Review the Context: Before generating a response, carefully review the [Chat History Summary] to understand the ongoing dialogue.
Maintain Continuity: Refer to the summary to recall key topics, user preferences, and previously established facts. This ensures your responses are consistent and build upon the existing conversation.
Identify User Goals: Use the summary to understand the user's primary objectives or questions.
Avoid Repetition: Do not ask for information that has already been clearly stated in the summary.
[Chat History Summary]

I don't know if that really is part of the system prompt, but I have to admit I feel very betrayed by Kagi. I've been a User for over two years now and gladly paid for the $25 option since the beginning of this year. I really appreciated your concept of introducing Kagi Assistant.

However please take into account my perspective: what I first thought was a bad bug when switching models (which I reported months ago with no real resolution) now turned out to be a core mechanism of your chat feature. I'm making extensive use of LLMs in my work and when working on more sensitive data I trusted Kagi Assistant more than using the API directly. This means I now have to live with the afterthought that every single chat that I used Kagi Assistant for, especially in my most important work situations, was greatly by a very wide margin inferior to just using chatbox or something with an API key.

I cannot comprehend that you introduce something like this secretly without being transparent about it. If that's how you save tokens to make the economics of the plan work, fine, but hiding it feels like a bait and switch. It makes sense that you don't really get equal API budget to the price you pay + the admittedly great Kagi search, but this feels dishonest.
While I greatly appreciate your search, I lost trust by paying more than double the price of the search-only plan to get a greatly inferior experience when using LLMs without any acknowledgment of this happening. It's like built-in context deterioration x10 right from the first response if that's true. I honestly feel resentment for the more than 5 times in the past months where I renewed Kagi early to get access to more usage while supporting Kagi, thinking I was paying for a premium tool when I was actually getting a badly throttled one.

Thibaultmol

So there's a couple of approaches we could do:

Something in the settings that allows you to turn off the context summarizing unless it goes over some % of the claimed context window for example
We could provide a slash command so you can turn off or change the behaviour of the context summarizing
We could do it based on tokens or based on amount of messages. Openrouter actually does it based on messages which is pretty straighforward for a user to understand

I don't think we should make it that prominant in the UI, having it as a slash command probably makes sense with a setting to set a default behavior additionally.

Luis

Hey @ConstantinHuber we understand your concerns and appreciate the constructive feedback.

We'll revisit our implementation, there might be room for improvement to better serve your use cases. But first, I'd kindly ask if you could clarify a few points:

In your long threads, when you're likely reaching our token threshold, do these include uploaded documents or just interactions with the models?

This means I now have to live with the afterthought that every single chat that I used Kagi Assistant for, especially in my most important work situations, was greatly by a very wide margin inferior to just using chatbox or something with an API key.

(...) It's like built-in context deterioration x10 right from the first response if that's true. I honestly feel resentment for the more than 5 times in the past months where I renewed Kagi early to get access to more usage while supporting Kagi, thinking I was paying for a premium tool when I was actually getting a badly throttled one.

Our objective with the context window changes we introduced a while back was to remove arbitrary context window limits from users concerns, by implementing a few techniques that would effectively render the context window unlimited and also improve model's performance.

In our current implementation, we intentionally trigger summarisation before reaching the base models context limits because all models have degraded performance well before they reach their stated context limit, their effective context limit is often much lower. See https://github.com/NVIDIA/RULER for some examples.

I cannot comprehend that you introduce something like this secretly without being transparent about it.

We announced these changes in our changelog when they were introduced, but we appreciate this wasn't clear anywhere in the UI. Point taken.

We've implemented an initial approach to handling context window limits in Kagi Assistant and are keen on exploring next steps if it doesn't work well for specific use cases like yours.

We would greatly appreciate if you could share some more details about your specific workflow. Our apologies for failing to communicate these mechanism in a clearer way.

ConstantinHuber

Luis Hi Luis, thank you for looking into this.

To answer your question first: I don't upload files directly. I parse long texts or email exchanges via docling and paste the raw text contents directly into the chat.

Regarding the RULER benchmark, I didn't know about it though I find it very interesting, even though from its outdated results in the readme I deduce that newer Geminis at least would have to be fine for at least 200k+ tokens.
I usually use Gemini 2.5 Pro and now 3 Preview up to context windows of 200k tokens without any issues in the API.I've read on discord that you trigger summarization at around 32k tokens? If so that would explain a lot, because my conversations reach that many tokens times already within the first 3 messages.

To tell you more about my workflow and why this implementation is so destructive:
The reason I put so much context into the chat is for two reasons. First to prime the LLM. It allows me to work close with the general topics provided in broad queries, while secondly allowing me to discuss specific details from the whole conversation later by reminding it to put attention on them with specific queries.

So that's why my reaction to your change fell out so strongly:

While discussing general topics was still possible to a large degree (the summary kept the gist), the specific details kept getting inaccurate, whereas in the API at similar context usage they remain accurate.
Because of the "Avoid Repetition" and "Maintain Continuity" rules in your system prompt, the model tried to hide that it lost the details. It essentially creates a hallucination loop, where the summary deleted the data, but your prompt forbids it from asking me for it again or admitting that it summarized them away.

So I became suspicious and asked it, "do you still have access to what I sent," and to my big surprise it didn't anymore (which it was hesitant to admit) which I absolutely didn't expect. This essentially forces the model to gaslight me to avoid breaking your system rules.

I usually read the Kagi changelog, but I must have missed this and didn't expect something like this to not have a prominent place in the documentation, especially since wording such as "access to the raw model" was used when Assistant was introduced. I now get that you implemented it to cater to users that use one thread eternally or to save costs, but your approach essentially castrates it for the workflow I just described. If you didn't introduce it right with the introduction of Kagi Assistant that makes a lot of sense, as I didn't have these issues in the first months, which made this discovery more frustrating.

I tried semantic search approaches as well but context stuffing is the only thing that really works for me for keeping both the small details and the general topics accurate.

I hope you see how with my use case this is essentially making it unusable in that manner. Please document this behavior clearly, and please add a slash command or setting to disable it. I would be just glad if I could toggle this off to use the full context window I expect.