Whole words only

Friday, August 29, 2008

PDFs that just contain scanned images

Many websites use the PDF format to store documents that have been scanned. PDFs containing scanned documents consist of a series of bitmap images--they don't contain any text and so are not searchable. Up until now, to keep the Blossom spider from downloading these image files, you needed to put them explicitly into the exclude list for an index.

No longer! The indexing process now recognizes PDFs that just contain images, removes them from the index, and instructs the spider not to download the files again. As a result, the page count for a search index will no longer include image-only PDFs.