Whole words only

Thursday, May 24, 2018

Use sitemaps to guide Blossom spider

As websites become more dynamic, some links may be generated programmatically rather than specified directly in HTML. While the Blossom spider does search Javascript for URLs embedded in strings, it does not execute Javascript. As a result, URLs generated by string operations can be overlooked.

The spider was recently enhanced to read sitemaps as a way to guide its traversal of a site. By specifying a sitemap in the "include" list for an index, the spider will visit each URL in the sitemap.
 A sitemap is an XML file that lists the URLs on a website. (See https://www.sitemaps.org/ for details.)

For Blossom, the list doesn't have to contain all the URLs on a site; it only needs to include those URLs generated dynamically. Other URLs can be picked up in the standard way by including the site's home page. For example, this include list can be used find all URLs on mysite.com:
https://www.mysite.com
!https://www.mysite.com/sitemap.xml
Notice two things. First that it's okay if there is overlap between the sitemap and other seed URLs in the include list; the spider will remove any duplicates. Also notice that the sitemap line starts with "!". This tells the spider to scan the file sitemap.xml for URLs, but not to include the text in sitemap.xml in the search index.