Universal Summarizer: Reliably extract file types to support more websites

tauon

Disclaimer: This may seem like – and probably is – a niche case when you only consider the minimum reproducible example I'm giving here, but it's easy to come up with an example like described in this issue in the real world.

Description: Summarizer fails to produce a result for a PDF when giving a URL that ends in file.pdff (or presumably any other unsupported ending, but I haven't tested this); when the supplied URL gets changed to file.pdf, it will work.

Why this is a problem: The file name including extension is not a reliable indicator of file type and should not be used to "rule out" that the supplied URL can't be summarized. Browsers (including Orion) will still display the PDF using the regular viewer regardless of the file's name, maybe* from reading header MIME types?
As shown in the first link above, some sites like cloud file services will hide file types/extensions, and this means the summarizer unnecessarily won't work, even on publicly available files.

Minimal reproduction: can-be-summarized.pdf link 1 vs. cannot-be-summarized.pdff link 2.
Note how both files are displayed and also otherwise work perfectly fine in a browser (tested in Orion and Chromium).

The IMO interesting "proof" that there is unintended behavior somewhere here is illustrated with a third example file uuid-F1BA92C8-835D-4E59-8265-97B928D91CF9, which has no file extension at all: link 3. Unlike the other examples, still using the same identical file, this one will work in summarizer. However, it will produce a noticeably different summary!

*Regarding this, I briefly checked (curl --head) and from my example files on a "default" hosted web server, only the .pdf contained a content-type: application/pdf section. So I'm not quite sure how the browsers do it, but I also know little to nothing about serving files on the web. Maybe they parse the beginning of a file when noticing it's binary to check for PDF magic bytes? But also, Kagi's summarizer can already deal with the PDF file with no file extension at all, so is it doing that already, perhaps?

tauon

Meta-question: How do I best tag issues that aren't directly related to the "main" search? I chose Search Quality and AI for now, feel free to re-tag/change.

Luis

Meta-question: How do I best tag issues that aren't directly related to the "main" search? I chose Search Quality and AI for now, feel free to re-tag/change.

We are currently defining a new tagging system for Kagifeedback -- tagging Kagi products will be more intuitive in the near future