Whole words only

Monday, March 25, 2013

Expanded tools to analyze search indexes

We have been expanding and improving the tools you can use to understand a search index. This note discusses three of these tools, accessible from the "Actions" section of your Search Configuration page by following the link Retrieve the list of URLs, meta data, and hyperlinks in an index.

  1. The list of URLs is extracted directly from a search index. It tells you the exact source documents that were used to generate the index.
  2. Meta data includes such items as the title for a page and the date the page was last modified. The meta data is shown for each of the URLs in the index; that is, it is the same URL list shown by the "List of URLs" tool.
  3. The list of hyperlinks from each page comes from the spidering log and it includes links both on and offsite. It tells you what links are on a page and what pages contain a particular link. It can answer the question "Why is this page in the index?".

You may notice more URLs in the "List of Hyperlinks" than in the "List of URLs". During indexing we carry out a more complete duplicate removal than that implemented during spidering. As a result, the spidered list may include pages that are removed during indexing.

You may also notice that the Starting URLs in the "List of Hyperlinks  varies depending on whether the last spidering was complete or incremental. On a complete respidering, the starting URLs are just those specified in the "Include List" for the index. For an incremental spidering, the starting URLs are all the URLs in the current index.

Tuesday, March 12, 2013

Modified scoring function

We have been revising the search engine's scoring function that determines the order pages are listed when they match a search query. PDF files, in particular, are troublesome because they often are archival documents and not a good match for the search queries of typical site visitors. For that reason, PDF files are penalized in their score relative to an HTML document with the same content. The penalty function has recently been rewritten. In general, PDF files are usually scored lower in the new function, but now always. Your feedback on the search results is always welcome.

If you are not happy with the default scoring, you can influence the search engine's scoring of pages by using the "/pdf" option, adding keywords and keyphrases to pages, and by setting page weights explicitly. Each of these techniques is described briefly in the Search Guide, and in more detail in the library document Fine tuning with page weights and meta tags.


Tuesday, March 5, 2013

Block link style now the default

Link style 3, described in our previous post, is now available in the standard search engine--in fact it is now the default style. If you do not specify a style in your search URL, the block style is used.

To get a different style, add "/linkN" to your search URL, where N is the style number. The previous default style was 0 (zero), so to get style 0 add "/link0". For a complete description of the link styles, please see Output Format section of the Search Guide.