Table of Contents

  • Introduction
  • Administrator
    · Introduction
    · Directories
    · Installing
    · Configuring
    · New profile wizard
    · Unleash the crawler
    · Logs
    · Advanced profile configuration
  • User
  • Appendix
  • Advanced profile configuration
    This page contains the more advanced configuration options not mentioned on the "New profile page" are described.
    Activated profile
    If "active" the intraseek engine will try to mount the data base at startup, and it will also be scheduled for automatic crawler launches.
    Automated update of the data bases
    Defines at which intervals the data bases should be automatically updated. If you want the crawler for a profile to be automatically launched at certain intervals to keep the index data bases up-to-date, you can use this function. The selections should be rather self explanatory. If you select Never! at the first selection, the two following (days and time) will be ignored and crawlers will not be automatically launched. For more status reports on scheduled crawlers, look at the logs page (in the Intraseek configuration interface).
    Crawler log detail level
    Selects how much information the crawler should write in the log.
      Fatal Errors   Errors   Warnings. Reports. Scheduler info.   Rejects. Accepts  
    Full (Default)   Yes   Yes   Yes   Yes  
    Medium   Yes   Yes   Yes   No  
    Short   Yes   Yes   No   No  
    None   Yes   No   No   No  

    Fatal
    Errors
    Errors Warnings.
    Reports.
    Scheduler info.
    Rejects,
    Accepts
    Full (Default) Yes Yes Yes Yes
    Medium Yes Yes Yes No
    Short Yes Yes No No
    None Yes No No No



    Crawler walk pause
    The crawler is quite fast. If you index pages outside your own net, you should slow it down somewhat by changing the "Crawler walk pause" between each document download, which is the number of seconds the crawler should pause before fetching the next document. This keeps it from causing large loads on the web servers of others.
    Crawler nice increment
    Sets the "nice level" of the crawler. If you are not familiar with UNIX, the UNIX-command man nice will explain things further. As a short summary, the nice increment level defines how nice the process should be with usage of system resources. A crawler with a high nice value will happily give away processor time to other processes. The level is zero by default, and the maximum level is 19. If the web server runs as root, it is possible to make the process more "mean" by setting nice increment as far down as -19.
    Stop lists
    Specifies which lists of short, common words that should not be indexed, e g "and" or "again". Here, several stop lists are specified. Select one or more (or none) depending on which languages are used on your server's pages. The stop lists are stored as ordinary text files in the directory ENGINE_HOME/resource/.

    (For further technical details on this function, check the chapter IntraSeek and memory usage.

    Additional stop words
    Indicates extra stop words that should be filtered. If "Yoyodyne Productions" is present on every one of your pages, it may be a good idea to specify "yoyodyne" and "productions" here. The disadvantage is that it will not then be ossible to search for the words "yoyodyne" or "productions", the advantage is that the data base files will be smaller, and searches faster.
    Query Logs active
    If "Yes", queries will be logged to disk, for top 100 statistics, and such.
    Safety save
    This value says how many pages a crawler should go through before automatic saving and reorganization of its data base. For further technical details on this function, check the chapter IntraSeek and memory usage.
    Max documents to download
    Specifies the maximum number of pages the crawler should index. It is a good idea to specify a maximum here. In case something should go wrong, you avoid having the entire partition filled by a huge data base. Going wrong usually means that the robot has become lost on the Internet, due to erroneously written accept and avoid patterns.
    Crawler page fetch Timeout
    Defines how many seconds that should pass before the download of a page will be aborted. For example, if a crawler can connect to a page, but doesn't get anything from the web server in the other end, it would patiently wait for information - forever, if it hadn't been for this setting. Enter how many seconds you allow the fetcher to do its work.
    Site structure logging
    Creates logs of web site errors and warnings. If active, site structure logs will be generated for this profile. See the logs chapter for information on site structure logs. If you are not interested in the site structure log, you can turn it off here and save crawler time consumption, space on disc and memory. (You will benefit from less memory usage by the crawler, and avoid logs that take place on disc. The operations controlling the log will be disabled as well.)
    Max size of query logs
    Is specified in bytes. When a query log exceeds this size, it will be moved to a .bak file. The old .bak file will be removed.
    Number of max quick links displayed
    . If you have a search resulting in several hundred pages, a list of several links to the next pages of result will be displayed below the list of summaries. The "quick links" referred to here, are the maximum number of links to show.
    Number of documents summaries
    Defines how many search summaries to display at every page.
    Quoted search enabled
    If set to Yes, the users of the search engine can use quotation marks to search for a phrase. For example, a search for "John Carl Smith" will search for persons with this name. Without quotes, the search would return any pages that use any of those common names.

    Note that an extra data base will be used to store the extra information, if this setting is enabled. With the current implementation of full text searches we cannot guarantee good performance for data bases covering more than 1000 documents. If you have more documents, turn this option off, or IntraSeek can sometimes get stuck with heavy calculations for several seconds.
    Wildcards enabled
    If set to Yes, the users of the search engine can use quotation marks (?) and asterisks (*) to broaden searches. A search for net* might match "netscape", "nethack", "network" and so on. A search for int??net matches "intranet" as well as "internet".

    Note that IntraSeek requires that the user specifies at least three characters in front of the * notation, and that there is no distinction made between lower- and uppercase searches. Also note that an extra data base will be used to store the extra information, if this setting is enabled.
    Summary text length
    Is the length (in characters) of the summaries displayed along with the search results and the link and the hit percent. If used on the web page, the Meta description will be used for this, otherwise the first part of the document becomes a summary.