July 2007 - Posts

FAST Search integration announcement could be interesting.

http://www.microsoft.com/presspass/press/2007/jul07/07-17MSFASTSharepointPR.mspx?rss_fdn=Press%20Releases 

It seems like everyone is adding conceptual searching to SharePoint, it is soon to be commoditized. This is good as it will drive prices down. It is a much needed improvement to relevancy when dealing with large document sets. There had been some rumors previously about Microsoft putting together a crack search development team to work on a new search release, I doubt that is the case now. There are too many mature products out there like FAST and Autonomy to not just do deal instead of doing a rewrite. I wonder what announcements we see next regarding FAST.

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
The SP Indexing process and how it works - randomized

I was going to write this as a formal article for the sharepointSearch.com site but it is easier to write as a blog post first to get feedback make corrections and then publish it. Some of these statements are assumptions based on experience as I of course don't have access to the SP source code and I know that MS uses proprietary protocol handler interfaces for their own protocol handlers instead of iSearchProtocol.

First off when the Indexing service mssearch.exe starts up it looks up certain registry keys at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager  to determine behavior and performance constraints. There are MANY registry keys here to play with and some actually aren't active anymore. But some can be very useful to implementors and to developers, these are:

  • ConnectTimeout and DataTimeout - if you have a slow DB connection or whatever you are doing is failing due to timeouts you can turn this up
  • DebugFilters - Set to 1 to ensure that only one crawling daemon is created and runs basically single threaded
  • DedicatedFilterProcessMemoryQuota - if you have a memory leak in an ifilter or PH then this is a way to cheat a little to get more items indexed before process is terminated.
  • RobotThreadsNumber - setting to 1 constrains each crawling daemon to be single threaded.

The indexing service is a C++ COM based windows service that manages the indexing processes. The actual crawling is run in separate processes from mssearch.exe to ensure stability as protocol handlers and ifilters can be unstable at times. This allows the indexer to run multiple crawling instances at the same time and kill off misbehaving ones easily enough.

The crawling instance looks to the registry to determine which protocol handler class to load for the given content source, ie.   http = OSearch.HttpHandler. It then loads the PH and initializes it passing in some data and handles that can be used during the crawling like timeout and proxy information and an object reference to iProtocolHandlerSite which provides an interface to load ifilters with.

The crawling instance (CI) then creates a new thread and requests the PH to create or return a pooled iURLAccess object to process the crawl request. The job of the iURLAccessor is to handle the actual connection to the source, return information about the source (security, title, size, name, whether it is a directory or an individual item) and also to create an ifilter to handle the text and meta-data extraction. Typically the URLAccessor functions in two capacities: an enumerator of content and the crawling of the enumerated content. For instance, to crawl a file system the folders first have to be enumerated and then the individual files crawled. Every folder and every file is represented by a uri that gets processed through the PH and accessor, and depending on which they are the accessor will behave accordingly.

There are two categories of ifilters that are used within the crawling process: proprietary and standard. The standard ones are for known file-types like PDF, word documents, etc and can be used by all accessors. The proprietary ones are created manually by the accessor and controlled by them for a specific purpose, like enumerating content from a dataset. When writing a custom protocol handler you must create proprietary ones to do your custom work and can use the standard ones to deal with files.

As extracted text and properties are returned from the iURLAccessor's and ifilters it is run through SP plug-ins to do the word breaking, stemming and so forth before the data is added to the search indexes. Crawled properties are added to temporary work tables during the crawl process until a stored procedure can process them into managed properties. The extracted text and the text from the properties are added to the Search Index files which are based off the MS Exchange search index files. NOTE: WSS Search uses SQL Server full text searching exclusively as apposed to MOSS that uses both.

Some other random thoughts:

  • using the PDF ifilter from adobe causes performance problems as it is more than just single threaded, it is single instance only
  • most PH copy the files first to a temp directory on the indexing machine before loading an ifilter.
  • creating a new managed property doesn't have an effect until the data is re-crawled as the crawled properties are purged after each crawl when the managed property entries are generated.
  • during incremental crawls createAccessor is called for every item in the index, so if a change log isn't used performance problems could appear

- enough for now

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
Deploying custom site templates that include SP Designer developed pages

There are two ways to share or reuse sites in SharePoint, templates or definitions. Site Defintions are complete site specifications that get installed as features usually and site templates are a single cab file with just the difference between the current and starting site definition that the site was based on. It is much easier to create a site template and deploy than site definitons. In order to create a site defnition you use the SharePoint solution generator to create a Visual Studio project and then from then on you make your changes to the files directly not the site when modifications are to be made.

The major benefit of Site Definitions is that you are able to make modifications to them and all sites created from them will be also modified, sort of. If of course some of the modifications were done in the SP designer then mods to the defintion may have no impact as pages may be unghosted. Also if you are in the business of developing these sites for market as a product then you are faced with additional restrictions, you cannot simply regenerate the solution to include your modifications as the generated solution is actually a feature based model of the site definition and the feature ID's will be different for all components including content types. So after you have delivered this site definition product to a customer you pretty much have to only make changes directly to the already generated solution files or figure our some model of merging in the changes ( I did this btw with a macro that merged the newly generated solution with the already deployed one and it still was a pain). 

SO. What i chose as a deployment model for sites that are products is instead site templates. There are problems with this model of course to be worked around but it is easier than site definitions. 

Some problems with using Site Templates as released product sites are as follows:

  • in order to include custom pages and images libraries you must check off the include content box when generating the site definition and this can mean junk makes it into your sites.
  • dataview web parts fail as they are still pointing at the original list ids and not the newly generated ones
  • if someone modifies the site defintion that your site template is based on then you could have a broken site.
  • updates to released sites are more complicated as existing sites must be recreated and the content restored.

How to go about using site templates as your deployment model:

  1. You cannot simply have your customers add your template to their site collection because of the above listed problems. You must use a site collection feature to deploy your site template, so create an empty SharePoint project and add your required feature.xml files to allow for feature deployment.
  2. add a new VS cab file project to your solution and add in all the files that are in your sites .stp file ( to do this rename the .stp file to .cab and open it in explorer view), then have the output of the cab file be a new .stp file that your feature in step one will use.
  3. Edit the manifest.xml file in the #2 project to remove any data in document libraries or lists that you don't want your customers to receive.
  4. create a feature event class in the #1 project to handle the site creation and cleanup of the resulting site. Here is a function I created to handle creating a site from the file:   

public static SPWeb CreateWebFromTemplate(SPWeb baseWeb, String templatePath,String templateName,
            String siteRelativePath, String siteTitle, String siteDescr,Boolean addToParentNav)
        {
            SPWeb tWeb = baseWeb.Webs[siteRelativePath];
            SPWeb newWeb = null;
            if (tWeb == null || !tWeb.Exists) 
            {
                 SPList wtcl = baseWeb.GetCatalog(SPListTemplateType.WebTemplateCatalog);
                //1st remove existing template
                try {
                    if (wtcl.RootFolder.Files[templateName] != null ) {
                        wtcl.RootFolder.Files[templateName].Delete();
                    }
                }
                catch (Exception ex2)
                {

                }
              
                 byte[] b = File.ReadAllBytes(templatePath);
                 SPFile spf = wtcl.RootFolder.Files.Add(templateName, b);
                 spf.Update();
                 SPWebTemplateCollection wtc = baseWeb.Site.GetCustomWebTemplates(1033);
                 if (templateName.Contains("."))
                     templateName = templateName.Substring(0, templateName.IndexOf("."));
                 SPWebTemplate spwtemplate = wtc[templateName];
                 newWeb = baseWeb.Webs.Add(siteRelativePath, siteTitle, siteDescr, 1033, spwtemplate, false, false);
                newWeb.NoCrawl = true;
                if (addToParentNav)
                {
                    Boolean addIt = true;
                    for (int i = 0; i < baseWeb.Navigation.TopNavigationBar.Count; i++)
                        if (baseWeb.Navigation.TopNavigationBar[i].Url.ToLower() == siteRelativePath.ToLower()) addIt = false;
                    if (addIt)
                    {
                        SPNavigationNode spwMenuItem = new SPNavigationNode(siteTitle, siteRelativePath, true);
                        baseWeb.Navigation.TopNavigationBar.AddAsLast(spwMenuItem);
                        baseWeb.Update();
                    }
                }

            }
            return newWeb;
        }

 

    5.  The next step in the activation class is to clean up the site. I used some of the code that was generated by the solution generator and retro fitted it. It depends on a provisioner.xml file to handle the list id remapping but could be made more automatic by examining the manifest.xml file in the cab instead. Here is the class to do this SiteCleanup.cs

    6.  The above steps will be fine for first deployments for updating existing customers you will need to back up all the list and document library content and reapply it after the site is recreated. This step can be part of the feature deactivation and activation process also.

If there is continued interest in the concept of the post I will continue it with greater detail. So please comment.

 

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
Resources that you always need open while SharePointing

I have noticed in the last year working on SharePoint 2007 that there are certain resources that I always have to keep opening up to reference during development. I have compiled them into a simple list and have simplified it for you to add them as a Tab group so that all you need to do is open the tab group in IE and all the sites will open at once and be ready for you.

These are what are in the list, you could just open them from here too:

Now you can always refer to this page and click each one to open OR you can follow these instructions and add a new favorites group in your browser which will allow you to just open them ALL at once in one browser with multiple tabs. Easy.

Step 1: Copy this link     http://www.sharepointsearch.com/pages/spdevrefs.htm    to your clipboard.

Step 2: Create a new folder in your favorites to hold this list. If you add it to Links the list will appear in your browser tool pane!!

Step 3: Open your favorites import wizard in your browser.

Step 3: Select the Address box and paste in the URL and click Next

Step 4: Select the favorites folder you added in step2.

Step 5: Right click the favorite folder and right click and select "Open In Tab Group"

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
Sample of a custom field control based off of SPFieldMultiColumn

here is a useful sample - a password box that hides the typed in password.

The files are  Password.Field.cs  and Password.FieldControl.cs 

Just add to a project, change the namespace, sign the dll and create the field.xml file and you are good to go. 

In developing MANY custom field controls I have noticed some bugs/quirky behaviour to watch out for, maybe it's just me but I will note them here anyways:

  • Default Properties as specified in the fields.xml file are not displayed correctly when editing an already created field. When you add a field to a list you can set the custom properties fine, but when you reopen that field/column to edit it shows the value in that custom property to be some bogus statement. If you then resave the field it will overwrite what you put in the first time with that garbage. So you have to remember what you put and and re-enter it everytime you want to edit the field.
  • Required Field settings do no work when set on content type and not field itself - it displays on the form as required with the red asterisk but if you examine the Required property of the control it is always false. Looking into this now.
Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
SharePoint Search Security explained with focus on the BDC and protocol handlers.

This is a repost of an original article I wrote for the SharePoint search site:

Business Data Catalog (BDC) – the BDC is a generic set of components provided with the SharePoint Portal that makes the integration of external LOB (line of business systems like a CRM solution) into SharePoint easier. The components the BDC provides are multiple configurable Web Parts for listing and displaying the data, a search indexer (protocol handler) and a search query security trimmer. The BDC operates off of xml definition files which are used to define the connections and entities in the LOB system. For an example, I want to integrate the standard MS Adventure Works database into SP. I would write an xml file that would include the following overly simplified information:

  • Connection information – server, db, username, password
  • Entity Definition
  • Product
    • ·         Finder Method (returns list of instances) - SELECT ProductID, Name, ProductNumber, ListPrice FROM Product
    • ·         Specific Finder Method (returns exactly on instance) - SELECT ProductID, Name, ProductNumber, ListPrice FROM Product Where ProductID = @ProductID
    • ·         IDEnumerator Method(enumerator for search crawler – same as finder in this case) - SELECT ProductID, Name, ProductNumber, ListPrice FROM Product
    • ·          Access Checker Method (for query time security) - SELECT CAST(Rights as bigint) FROM Customers WHERE CustomerId = @CustomerId and UserName = @currentuser;

From the above information (and other information not shown) SP would be able to display products in a list web part using the Finder method and display a selected product in a form using the Specific Finder method. It uses the IDEnumerator method and Specific Finder methods to crawl the products and add them to the search index. During query time SP would take the returned search results and apply the Access Check method to each item to see if I have permission to access each item. NOTE: the BDC does not support indexing unstructured data like files in a document management system.

 

Protocol Handlers – protocol handlers are used by the SP indexer to connect to external and internal systems to crawl data and add it to the query index. Custom protocol handlers implement a standard iSearchProtocol COM interface and are created by 3rd parties like Hummingbird and Interwoven to allow SP to index their proprietary data systems. Microsofts internal protocol handlers (SharePoint, Lotus Notes, BDC, File System, HTTP, HTTPS) DO NOT implement the standard iSearchProtol COM interface and are proprietary.


Security Trimmers – A new addition to the SharePoint Search system (added in the final release in November) the iSecurityTrimmer interface is used primarily to apply security to BDC search results at this time. Security Trimmers are .NET dlls that are registered and associated with specific data in the search index. The SP query engine uses the Security Trimmers to check access rights of items right before they are returned to the user, and only for items that have a registered security trimmer. Since it is a new interface there are only a few that exist including the one for BDC data, but many companies like Interwoven may be exploring them as a means to ensure real time security on search results.


Discussion:
From the above definitions you should be able to get an idea of how the BDC works and what Protocol Handlers and Security Trimmers are, but a more detailed comparative discussion is definitely warranted.

The SP Search system implements two forms of security by which search results are trimmed:

The first form of security is by standard ACLs (Access control lists) which is the most familiar as it is how the Windows file system determines if you have access to a document or not. During the crawl process the ACL’s of items that are being added to the index are determined and added along with the item. The query engine uses these ACLs that are stored in the search database to determine quickly if a user should be allowed to see that item in the results. This security method has been the standard for awhile in the MS Seach products and is very fast. When connecting new systems (like Documentum) a custom protocol handler would be created that knows how to map the security in Documentum to standard Windows ACL. For instance if a user in Documentum is allowed to access a particular document then the protocol handler will need to map that user id to a valid Active Directory user and create a read privilege ACL for that user and add it to the items security ACLs. All users and groups would need to be mapped also and added for each item to ensure proper security. Note: This security model has a flaw in that if the security changes it will not be picked up until the next incremental crawl which may not happen for hours or days. Also the BDC (Business Data Catalog) does NOT support ACL based security and prior to the addition of the next form of security, the real time Security Trimmer, the BDC had no security for its search results.

The second form of security is called real time Security Trimming and is completely separate from the ACL based security above. It can be applied in conjunction with ACL security to provide an added check to ensure that changes to security since the last crawl are adhered to, or as in the case of the BDC it provides the primary and only means of security. Basically after search results are compiled during a query the items are individually compared to a set of rules to determine if they have a Security Trimmer registered. The ones that do have a security trimmer registered (as in BDC items) are grouped into an array and passed into their respective Trimmers. The Security Trimmers validate the security of the items and return back an array with a simple true or false for each item indicating whether they are allowed or not. Depending on how the Security Trimmers were written they can be a source of performance contention, as in the case of the BDC one where each and every item is individually validated which could mean hundreds of database queries. As search results can have combined results from multiple sources there may be more than one Security Trimmer involved in each query.

 

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks