Whole words only

Wednesday, October 14, 2009

List of image-only PDF files

Some sites have PDF files that don't contain text; often they are generated from scanned documents. The lack of text makes the files invisible to the search engine. To reduce the number of times image-only PDF files are downloaded, a list is kept for each index. You can now download that list from the Search Configuration page. Follow the link "Retrieve the list of image-only PDF files being ignored" in the "Search Index Settings" section.

The list is cleared whenever a complete respidering of the index is triggered, such as when you request a non-scheduled index update (by following the "Update an index" link on the Search Configuration page).

Thursday, October 8, 2009

New option for database-driven sites

Some websites have pages generated dynamically from a repository. Often on these sites there are multiple URLs that can generate the same page. Blogs are an example; an article may have multiple labels, each label providing a path to the same text. For example, www.domain.com/july/article1 and www.domain.com/announcements/article1 might both refer to the same article.

A new option has been added to the Blossom spider telling it to ignore all but the last component of a URL when determining whether two URLs are the same. Thus in the example above, "article1" would only be retrieved once. This will reduce the number of duplicate documents downloaded from a site, saving both bandwidth and potential page count.

Contact Blossom Support if you think your site might benefit from using this option.