Disclaimer: This may seem like – and probably is – a niche case when you only consider the minimum reproducible example I'm giving here, but it's easy to come up with an example like described in this issue in the real world.
Description: Summarizer fails to produce a result for a PDF when giving a URL that ends in file.pdff
(or presumably any other unsupported ending, but I haven't tested this); when the supplied URL gets changed to file.pdf
, it will work.
Why this is a problem: The file name including extension is not a reliable indicator of file type and should not be used to "rule out" that the supplied URL can't be summarized. Browsers (including Orion) will still display the PDF using the regular viewer regardless of the file's name, maybe* from reading header MIME types?
As shown in the first link above, some sites like cloud file services will hide file types/extensions, and this means the summarizer unnecessarily won't work, even on publicly available files.
Minimal reproduction: can-be-summarized.pdf
link 1 vs. cannot-be-summarized.pdff
link 2.
Note how both files are displayed and also otherwise work perfectly fine in a browser (tested in Orion and Chromium).
The IMO interesting "proof" that there is unintended behavior somewhere here is illustrated with a third example file uuid-F1BA92C8-835D-4E59-8265-97B928D91CF9
, which has no file extension at all: link 3. Unlike the other examples, still using the same identical file, this one will work in summarizer. However, it will produce a noticeably different summary!
*Regarding this, I briefly checked (curl --head
) and from my example files on a "default" hosted web server, only the .pdf
contained a content-type: application/pdf
section. So I'm not quite sure how the browsers do it, but I also know little to nothing about serving files on the web. Maybe they parse the beginning of a file when noticing it's binary to check for PDF magic bytes?