On Discord, there have been discussion about token usage in Assistant, and how some (heavy) Assistant users have got an email asking them to slow down a bit. From the discussions, it seems like it's not obvious what "tokens" are, or the fact that the whole chat thread is included everytime you click on send.
Someone suggested the use of checkboxes beside each prompt/reply as a way to let the users manually select what to include next time you press send. I think this is a genius idea, not only to decrease the tokens sent, but possibly even increase the quality of the response from the LLM.
For me, threads most often grow long when I'm not satisfied with the result or when the model take a detour and I have to bring it back to the subject I am interested in. But since the whole thread is included in each iteration, it often becomes harder and harder to keep the responses on track. When using ChatGPT/Claude, I sometimes copy'n'paste part of a chat thread into a new chat, and start from scratch again with the best parts of the previous conversation as the first prompt.
So, my suggestion is this:
- Consider to add a running counter next to the input box, and perhaps also let it slowly change color from green, to yellow, to red as you approach some kind of soft/hard limit on how many tokens the user is allowed to send.
- When the token count is at the limit, show a modal window informing the user what has happened and make it mandatory to select what parts of the chat to include from that point onward.
That would address the economic part of this.
But I think it also would be useful to educate the users that the checkboxes are there, making it possible to improve the quaility of the chat, as a way to exclude parts that not are of relevance. So perhaps the default should be selected checkboxes that are there from the start, and the modal windows should instruct the user to deselect the low-quality parts.