Blogs create some challenges for search engines because they often create many paths to the same content and the "last modified" dates are often not reliable.
Multiple paths causes a spider to download the same content multiple times. Also, if the search engine isn't careful, search results might contain the same text multiple times. With Blossom search, the multiple-path problem can be solved with judicious use of include and exclude patterns.
The "last modified" date is used as part of search-engine ranking algorithms as well as to implement sorting by date. Blog systems generate their content dynamically, so the "last modified" date is often reported as today regardless of when the content was actually created. This, of course, makes "sort by date" useless. With Blossom search, the date problem can be solved using a custom indexing filter. See www.blossom.com/search_blog.html for more details.
If you are indexing a blog and want help overcoming these problems, let us know by sending email to support@blossom.com.
Wednesday, June 11, 2008
Subscribe to:
Comments (Atom)