Search Engine News: January 2020

If you look over the issues discussed in this blog, you'll see that many have arisen due to website content becoming more dynamic. Static web pages are becoming rarer, making the job of spidering more difficult. As a result, we have begun testing a significant rewrite of the Blossom spider. In addition to handling dynamic pages better, the new spider will offer more flexibility in how sites are traversed. This post will be updated as testing progresses.

If you monitor your web logs, you may notice extra activity from Blossom as we run the new spider alongside the old. You can pick out visits from Blossom by looking at the User_Agent HTTP header. For the production Blossom spider, the agent is Mozilla/5.0 (Blossom); for the new spider it is Mozilla/5.0 (Blossom/Beta).

We have begun rolling out the new spider to handle the regular update of indexes. In some instances, the new spider may require changes to the configuration of an index. (We will notify your technical contact via email if we make changes for you.) Here are some of the changes we've seen that can impact the contents of an index:

Stricter handling of redirection URLs. When a request is redirected, either by an HTTP header (e.g. 301 or 302 status code) or by an HTML meta-tag refresh, the redirection URL must satisfy the include/exclude specification for the index.
Stricter adherence to the HTTP status code and content type as reported by the webserver. Documents will only be added to the index if they are delivered with a status code of 200. HTML pages must either have a content type of text/html or begin with an identifying tag such as or .
Documents limited to 100MB by default. Likely this will only impact PDF files, and usually just PDFs with lots of images.
Reading of sitemap.xml and robots.txt are the default.
Scanning of URLs in javascript strings has been improved.

Search Engine News

Tuesday, January 21, 2020

Major revision of Blossom spider now being deployed

Blossom Search Engine News

Search Engine Library

Labels