If you monitor your web logs, you may notice extra activity from Blossom as we run the new spider alongside the old. You can pick out visits from Blossom by looking at the User_Agent HTTP header. For the production Blossom spider, the agent is Mozilla/5.0 (Blossom); for the new spider it is Mozilla/5.0 (Blossom/Beta).
We have begun rolling out the new spider to handle the regular update of indexes. In some instances, the new spider may require changes to the configuration of an index. (We will notify your technical contact via email if we make changes for you.) Here are some of the changes we've seen that can impact the contents of an index:
- Stricter handling of redirection URLs. When a request is redirected, either by an HTTP header (e.g. 301 or 302 status code) or by an HTML meta-tag refresh, the redirection URL must satisfy the include/exclude specification for the index.
- Stricter adherence to the HTTP status code and content type as reported by the webserver. Documents will only be added to the index if they are delivered with a status code of 200. HTML pages must either have a content type of text/html or begin with an identifying tag such as or .
- Documents limited to 100MB by default. Likely this will only impact PDF files, and usually just PDFs with lots of images.
- Reading of sitemap.xml and robots.txt are the default.
- Scanning of URLs in javascript strings has been improved.