The SP Indexing process and how it works - randomized
I was going to write this as a formal article for the sharepointSearch.com site but it is easier to write as a blog post first to get feedback make corrections and then publish it. Some of these statements are assumptions based on experience as I of course don't have access to the SP source code and I know that MS uses proprietary protocol handler interfaces for their own protocol handlers instead of iSearchProtocol.
First off when the Indexing service mssearch.exe starts up it looks up certain registry keys at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager to determine behavior and performance constraints. There are MANY registry keys here to play with and some actually aren't active anymore. But some can be very useful to implementors and to developers, these are:
-
ConnectTimeout and DataTimeout - if you have a slow DB connection or whatever you are doing is failing due to timeouts you can turn this up
-
DebugFilters - Set to 1 to ensure that only one crawling daemon is created and runs basically single threaded
-
DedicatedFilterProcessMemoryQuota - if you have a memory leak in an ifilter or PH then this is a way to cheat a little to get more items indexed before process is terminated.
-
RobotThreadsNumber - setting to 1 constrains each crawling daemon to be single threaded.
The indexing service is a C++ COM based windows service that manages the indexing processes. The actual crawling is run in separate processes from mssearch.exe to ensure stability as protocol handlers and ifilters can be unstable at times. This allows the indexer to run multiple crawling instances at the same time and kill off misbehaving ones easily enough.
The crawling instance looks to the registry to determine which protocol handler class to load for the given content source, ie. http = OSearch.HttpHandler. It then loads the PH and initializes it passing in some data and handles that can be used during the crawling like timeout and proxy information and an object reference to iProtocolHandlerSite which provides an interface to load ifilters with.
The crawling instance (CI) then creates a new thread and requests the PH to create or return a pooled iURLAccess object to process the crawl request. The job of the iURLAccessor is to handle the actual connection to the source, return information about the source (security, title, size, name, whether it is a directory or an individual item) and also to create an ifilter to handle the text and meta-data extraction. Typically the URLAccessor functions in two capacities: an enumerator of content and the crawling of the enumerated content. For instance, to crawl a file system the folders first have to be enumerated and then the individual files crawled. Every folder and every file is represented by a uri that gets processed through the PH and accessor, and depending on which they are the accessor will behave accordingly.
There are two categories of ifilters that are used within the crawling process: proprietary and standard. The standard ones are for known file-types like PDF, word documents, etc and can be used by all accessors. The proprietary ones are created manually by the accessor and controlled by them for a specific purpose, like enumerating content from a dataset. When writing a custom protocol handler you must create proprietary ones to do your custom work and can use the standard ones to deal with files.
As extracted text and properties are returned from the iURLAccessor's and ifilters it is run through SP plug-ins to do the word breaking, stemming and so forth before the data is added to the search indexes. Crawled properties are added to temporary work tables during the crawl process until a stored procedure can process them into managed properties. The extracted text and the text from the properties are added to the Search Index files which are based off the MS Exchange search index files. NOTE: WSS Search uses SQL Server full text searching exclusively as apposed to MOSS that uses both.
Some other random thoughts:
-
using the PDF ifilter from adobe causes performance problems as it is more than just single threaded, it is single instance only
-
most PH copy the files first to a temp directory on the indexing machine before loading an ifilter.
-
creating a new managed property doesn't have an effect until the data is re-crawled as the crawled properties are purged after each crawl when the managed property entries are generated.
-
during incremental crawls createAccessor is called for every item in the index, so if a change log isn't used performance problems could appear
- enough for now
Del.icio.us |
Digg It |
Technorati |
Blinklist |
Furl |
reddit |
DotNetKicks