Assistant: Show estimate how many percent of a document were used

warpspin

I uploaded a 272 page PDF and started asking questions about it. Unlike for shorter documents, already the initial summary was wrong, it did correct itself though after I mentioned it:

Answers to my questions about the document stayed pretty vague though or contained outright mistakes (like that were no programming examples in the book) and when I asked it "What further books" to read on the subject, it told me it cannot answer me that, as the book does not mention sources, even though the book contained references.

Examples as well as references were pretty late in the book, so my theory is, the book silently crossed the token limit of whatever LLM was in use, so it never even saw the pages which would have helped it answer my questions.

So the suggestion:
If whatever you upload crosses the token limit, show an estimate how many percent/how many pages of the document were actually used and maybe suggest to only upload smaller parts.

thread: K0v5HQOnOZCOqNs8udvj5IvS0TGV3eCP

Vlad

warpspin It would be great if you proivded exact stpes to reproduce, including the document in question. In general we can not fix the problems without the ability to reproduce them 🙂

warpspin

Vlad Ah I thought you can access it through the thread ID I gave. I'm not sure how to pass on the file to you for reproducing without uploading here as I probably may not publicly redistribute it?

warpspin

Ok, just found an existing copy on the internet from http://worrydream.com/refs/Griswold-TheSnobolProgrammingLanguage.pdf - seems it may be reprinted now.

I repeated the experiment with giving just the URL to the assistant, but it was less helpful than when I directly uploaded it:

It pretty much ignores the whole text of the book itself, which is probably because this version of the text is not OCRed. So that's maybe another suggestion: If substantial parts of a PDF miss OCR, give a warning :-)

But as I just rechecked my copy of the book and it also contains rights to redistribute it, I'll just attach it here:

snobol-4.pdf

3MB

I tried to repeat my original session with my OCRed copy. It was more intelligent this time, most probably cause I switched the default in Kagi's settings from "Last used" assistant to "Expert" assistant and so did not fall into the trap again, that uploads currently reset the used model to another one (my original experiment was most probably running in "Research/Fast" as it reset during upload)

It again was not able to provide the references from the book, though, so unsure whether it has even seen them, or whether it did see them and it's just the limitations of LLM in general causing the problem.

It also still exhibits the problem, where it will totally focus on the ads on the last page for the initial summary and only if told that that's not what the book is about, it will provide a proper summary.