We have upgraded the PDF text extraction engine to handle more character encodings. You should see better retention of punctuation and better sentence construction. (Identifying sentences in PDF files can be challenging because successive lines in a paragraph may not be adjacent in the PDF data.)
Depending on how PDF files are generated, they may not have a title. Titles are important to the search engine as the text is considered highly descriptive of the document. Also, the title is displayed in the search results presented to your visitors. In the new extractor, if there is no title we use a heuristic that chooses the first non-common line of text in the document as the title. Non-common text is text that doesn't appear frequently elsewhere on the website. Common text is usually boiler plate, such as the name of an entity.
For both PDF and HTML files, we recommend that each document have a descriptive title to help the search engine select the document when relevant and to help your visitors understand what the document contains.
Thursday, December 28, 2017
Subscribe to:
Comments (Atom)