Gemini: use document processing for much better PDF handling

edapstah

Re: Kagi Assistant handling of PDF inputs:

Using Gemini's document processing instead of sending PDF raw text results in superior document analysis. This includes more intelligent intepretation of text, and native understanding of visual elements like graphs or charts.

See here: https://ai.google.dev/gemini-api/docs/document-processing

I suggest changing how PDFs are handled with Gemini models, to rather send the document encoded in base64 as inline data. Instead of as extracted text. Gemini processes PDFs natively at a fixed token cost of 258 tokens per page.

For this, documents require no OCR to be done, since gemini interprets the image natively.

Here is Gemini 2.5 Pro via Kagi interpreting a PDF:
https://kagi.com/assistant/37beff5a-400e-4b33-b55a-56cb6a653495
It is unable to correctly identify the type of figure, or the colours used. It hallucinates incorrect information.

Via aistudio, Gemini 2.5 Pro produces the following response:
"Bars are grouped by year. Within each year, there are two bars: Male (left) and Female (right).
Each bar is stacked:
The darker portion (green for males, orange for females) represents Healthy life expectancy at birth (HALE).
The total height of the bar (including the lighter portion on top) represents the total Life expectancy at birth."

For reference, the PDF used is page 14 of 96 (cleanly split) from this document: https://iris.who.int/bitstream/handle/10665/376869/9789240094703-eng.pdf#page=14

(For further comparison, taking a screenshot of that page and 'printing as pdf' further demonstrates Gemini's superior handling.)

batuhan

edapstah but the output is not PDF

edapstah

batuhan I don't follow?

batuhan

edapstah We need pdf as output

RoxyRoxyRoxy

batuhan If I’m understanding correctly this would be an internal change of how the model “sees” the PDF, no change in user output (aside from the benefits of better PDF handling)

edapstah

batuhan I see, I think PDF output would be a different suggestion or feature request. It might be worth making a thread for that if that's something you're interested in?

This suggestion concerns input - making use of the native PDF handling of Gemini, instead of sending extracted text from the PDF. I.e. sending the data of the PDF file rather than the contents, since Gemini can better handle the file itself.

batuhan

RoxyRoxyRoxy I don't know if it works out like that, but it would be nice if it did.

batuhan

edapstah Dude, the whole point of document translation is getting the output as a document in the same template. You can already give the PDF to any model and get a pretty good translation.

RoxyRoxyRoxy

batuhan Unless I've totally misunderstood the OP, this isn't about translation. Gemini models have an advanced mode of processing PDF input that eliminates the need for first OCR-ing the PDF; using this the model is able to work with the PDF much better, and it sounds like with less token usage(?)

As I said before there'd be no change for the user experience. This wouldn't add PDF output to Assistant (though you're welcome to request that as a feature), this is just about bettering the internals of Assistant's backend with regard to Gemini models.

edapstah

batuhan That's an interesting idea. Though, this suggestion thread is not about translation.

RoxyRoxyRoxy is correct.

I think the confusion comes in because kagifeedback.org doesn't have an 'Assistant' tag for threads, and only an 'AI' tag instead. Perhaps I could've been more clear in the original post that I'm referring to Assistant's input handling of PDFs, I'll make an edit.

batuhan

Actually, I have a better idea! I've tried Mistral OCR myself many times, and it parses PDFs really well (including images!). What if it worked like this: first, the layout of all the PDF pages (like where paragraphs and images are) is determined, then Kagi's translation translates the content parsed by Mistral OCR (maybe even the images could be translated in a different way!), and finally, the translated content is placed back using the original layout map?

edapstah

batuhan Sounds like you should make a thread for your suggestion regarding translation! 🙂

batuhan

edapstah If management approves, I can provide a more detailed explanation.

Thibaultmol

Just had this example of it not being able to read the table in the pdf probably, but handing it a screenshot of the page with the data it needed did work.
https://kagi.com/assistant/bc82bc0a-1969-4caf-830b-6dfccc8f40cf
(fyi, on chatgpt.com I did the same thing with o3 and there it did find it. So i asusume it's just the pdf parser not handling tables well in kagi assistant?)

https://cdn.nimbu.io/s/4tn7vz5/channelentries/8g3irv0/files/1746007377570/202504_gbs_tariefkaart.pdf?wvza0pf=&dl=1

howie

Finding out that Assistant is literally sending markdown versions of HTML files and summarising PDFs in actively worse formats instead of sending the files themselves is enough to maybe just kill my Ultimate subscription for good - I seriously cannot do any work that depends on file upload because it will either summarise and editorialize in ways that are completely opaque to me and I have no insight or input on, or it just won't let the model read the HTML file, as if there's no good reason at all I'd send a HTML file to be edited. It's getting between me and the model in a way that is completely opaque to the user and that's not what I am paying for.

edapstah

Ditto.

This is still an ongoing weakness in pdf handling by Kagi Assistant.