Whole words only

Saturday, January 18, 2025

Highlighting search snippets and terms

An important feature of Blossom Search is displaying the context of search terms on the search results page. The context, typically a sentence containing the search terms, is called a snippet. To make it clearer why the context is relevant, Blossom Search highlights the search terms in the snippet.

Currently, we are testing highlighting snippets on retrieved HTML files using a new feature of many browsers called "text fragments". The feature is triggered by adding a command at the end of the URL for a document. The command begins with :~:text= and identifies the text to be highlighted. The resulting effect is that the snippet is highlighted and the page is scrolled, if needed, to bring the snippet into view.

A similar feature is being tested on PDF files. Some PDF readers allow search terms to be appended to a URL causing the terms to be highlighted wherever they appear in the document. Again, this feature is not available in all browsers. We've been using Firefox for our tests; Chromium-based browsers do not currently offer this feature.

This feature has left testing and is now live.

Wednesday, October 9, 2024

Improved handling of JavaScript

 Dynamic websites have long been a challenge for web crawlers because some URLs may be generated dynamically via program code. Website menus, for example, may be created from data tables where the HTML for menu commands and landing pages are assembled at runtime. Without some interpretation of the JavaScript, those landing pages could be missed by a crawler.

The Blossom spider has long scanned JavaScript for potential URLs, but it does not execute JavaScript code. The latest update has enhanced the scan to extract more URLs embedded in strings. As a result, the spider now finds dynamic URLs that were previously missed and thus the number of pages in some indexes has grown.

Saturday, December 10, 2022

Upgrade to search software and hardware

 Over the next month the Blossom search server will be migrating to more modern hardware and system software. The changes should be transparent unless you have hard-coded Blossom's IP address anyplace in your system. The result should be better performance and better security. The servers will continue to hosted by Amazon AWS.

In addition to system upgrades, the search software itself will be upgraded. In the past the programs were a mixture of 32 and 64-bit. Moving forward, everything will be 64-bit running on Amazon Linux 2. During this testing period you will see more traffic from Blossom spiders as the new system runs in parallel with the old. The target date for completion is early January.

UPDATE. As of December 26, we have begun the process of moving indexes to the upgraded system. During the week of the 26th, index updates and weekly search reports will be delayed as they are checked for correctness. All indexes should be moved before Jan 1. Please report any anomalies you see to support@blossoft.com.

Saturday, March 12, 2022

Expanded sitemap processing

By default, the Blossom spider looks for a sitemap.xml file in each folder it processes. (If desired, this behavior can be turned off from the Search Configuration page.) Recently we discovered that at least one sitemap generator creates a file header that caused our sitemap parser to fail. As a result, some sitemaps were not followed.

The latest spider update has fixed this problem, allowing more sitemaps to be read. Your indexes may now include files that previously were not found.

Thursday, January 20, 2022

Upgrade to back-end infrastructure

Work has begun on upgrading the back-end infrastructure supporting Blossom Search, including the customer-facing portal for managing account information and search engine configuration. We will be adding more functionality to the portal, such as a facility for accessing accounting documents and more insight into the contents and use of a search index. Please send suggestions to support@blossoft.com for features you'd like to see.

Wednesday, January 27, 2021

Robot searches removed from search report

You may have noticed that the number of searches reported in your weekly search report has gone down, perhaps dramatically. We have recently changed the reporting system to ignore searches from known robots.

Robots have long dominated overall web traffic. As a result of their increased sophistication, we've seen an increase in the number of searches carried out by robots. These robot searches skew the search data, making it harder to see just what your (human) visitors are searching for. On your search report the data no longer includes searches from known robots (and you will see the number of robot search that were ignored).

Wednesday, October 28, 2020

New CSS classes to control search engine output

The appearance of search engine output can be controlled in a variety of ways. The simplest is to use parameters in the search URL, for example to control whether the size or date of a matched page is shown. By using a style sheet you can get control over the fonts, colors, and spacing of the text in the search output. Each component of the output is surrounded by CSS classes. Documentation for using CSS is in the Search Guide.

The next version of the search engine contains a few new classes to control the search output:

Blossom_DocBlock: A div that surrounds all of the output for each document matched by a query. Use this to control block-level behavior such as shading, hovering, and selecting.

Blossom_DocType: If a document is not HTML, then an indicator of the document type is added to the title. This class controls the style of the indicator.

Blossom_MoreButtons: Links for the "next" and "previous" pages of search results are displayed when the "more" option is used. This style controls the format of the link text.

Blossom_SearchForm: A div containing the search-again forms.

You can access the new version of the search engine by using "nquery" in place of "query" in your search URL. Use nquery just for testing, as it changes regularly as we test new search engine features. The changes will migrate to the production search engine, "query", in early November.

Update. These features went live on 11/9/2020.