Home > Resources > Web Development > Web Tips > Search tools

PDI 2008 Sessions: Search Tools for Your Web Site

What is it?

Various search tools can be used to create a customized search for a Web site. Components may include:

  • user interface (e.g. search box, search results page)
  • server (hardware and software), crawler and indexer

Why use it?

  • Provide users with an additional way to navigate your site
    • Links and navigation menus are the traditional way to navigate the Web
    • Search is especially important for sites with many pages
  • Many other sites have a search box so users expect it
    • Ideally every page on your site should have a site search box
    • Search box is most often found in the page header
  • Improve staff intranet productivity
  • Find errors in order to improve page content
    • Assure that all important content is indexed
  • Create a custom search engine for a group or research topic
    • Can include content from any sites on the Internet

Popular Tools

CSU Libraries Demos

Google Custom Search Engine (Co-op/CSE)

  • Interface is easy to customize using Libraries template
    • Results are Google-like, with Google Custom Search logo
    • Added code for menu to narrow search to one subdirectory
    • Can search content on multiple servers (lib and digital)
  • Keywords to narrow search
  • Sites/URLs to include or exclude, wildcards allowed
  • Editions: standard has ads, business/university/nonprofits do not
  • Add to Google home page, get code
  • Refinements to label categories in some sites
  • Look and feel of search box and results
  • Code to copy and paste in your search and results pages
  • Collaboration of contributors, invited or volunteers
  • Preview - try out your searches

Google Mini

  • Turnkey server (hardware and software) in our server room
  • Interface is fairly easy to customize using Libraries template
    • Menu of collections to narrow search
    • No Google branding needed
  • Crawl and Index
    • Crawl URLs - patterns to start, follow, or not crawl
    • Crawl Schedule - continuous or specific days/times
    • Crawler Access - internal, password-protected, proxy servers
    • Collections - groups of URL patterns to search together
  • Serving
    • Front ends - separate interfaces for public, staff, test
      • Output format, KeyMatch, related queries, remove URLs
  • Status and Reports
    • Crawl status - documents found/crawled/served
    • Crawl diagnostics - URLs crawled, excluded or with errors
    • Content statistics - documents by file type
    • Search reports - collections, dates, keywords, queries
  • Administration
    • User accounts - admin or manager, collections, frontends
    • Reset index - clear database and start
    • Import/export configuration - backup all settings
    • System, network, SNMP, certificates, SSL, LDAP, license

Features

User Interface

  • Public and staff/restricted interfaces (front ends)
  • Interface of Search and result pages can be customized?
    • page layout, header, footer, colors, styles, ads
  • Faceted search
    • left navigation links to subcategories or topics with fixed # of items
    • e.g. dates, countries, languages, subjects
  • Collections (limit search to specific folders or sets of URLs you define)
  • KeyMatch (staff-suggested URLs for highly-used keywords)
  • Spellchecker ("did you mean...") and suggestions for related terms
  • Advanced search
    • Keywords/phrases (and, or, not, exact phrase, part of word)
    • Limit (to a collection, language, format, domain, or field)
    • Sort (by relevance, date, title, etc.)
    • Output format (# results per page, long/short/URL, group by site)
  • Duplicates/similar items are removed or grouped?
  • XML search results available (for flexible formatting by scripts/XSLT)?

Crawl and Index

  • Crawl/search multiple domains or hosts
  • URLs to crawl
  • Filters (remove domains or URLs from crawls, indexes or interfaces)
  • File formats indexed (HTML, PDF, Word, Excel, etc.)
  • Crawl frequency (increase/decrease overall or for certain pages/patterns)
  • Usage reports (top queries, top keywords)
  • Crawl reports (URLs crawled/excluded, errors)
  • Helps create files for crawlers? (robots.txt, sitemap.xml)
  • Access to password-protected pages or proxy servers
  • Meta tag information used or ignored?
  • Language and character set support?

Other Selection Criteria

  • Provider: Commercial? Cost? Licensing? Open source?
  • Limits: # domains, pages, queries; ads, vendor branding
  • Platform: Windows or Unix? Apache or IIS? Programming language?
  • Performance: Searches must be fast or users will go elsewhere
  • Administration: Multiple administrators? Roles?
  • Ease of configuration: GUI-based and/or file-based?
  • Support: phone/email, user community, documentation, training, upgrades, longevity

Other Resources