Search Engine News: December 2017

Thursday, December 28, 2017

Enhanced treatment of PDF files

We have upgraded the PDF text extraction engine to handle more character encodings. You should see better retention of punctuation and better sentence construction. (Identifying sentences in PDF files can be challenging because successive lines in a paragraph may not be adjacent in the PDF data.)

Depending on how PDF files are generated, they may not have a title. Titles are important to the search engine as the text is considered highly descriptive of the document. Also, the title is displayed in the search results presented to your visitors. In the new extractor, if there is no title we use a heuristic that chooses the first non-common line of text in the document as the title. Non-common text is text that doesn't appear frequently elsewhere on the website. Common text is usually boiler plate, such as the name of an entity.

For both PDF and HTML files, we recommend that each document have a descriptive title to help the search engine select the document when relevant and to help your visitors understand what the document contains.

Search Engine Library

We have built a library of documents to help you get the most out of the Blossom Search service:

The Search Guide. An introduction to all the features and options of the search service. Includes many examples.
What Makes a Good Search Engine? A peek into the philosophy behind the Blossom search engine.
Phrasal Query Suggestions. Describes a key aspect of Blossom's approach to guided search.
Fine tuning with page weights and meta tags. Details on how to influence the order of pages in the search results.

Search Engine News

Thursday, December 28, 2017

Enhanced treatment of PDF files

Blossom Search Engine News

Search Engine Library

Labels