Configure the Crawler Surface

Integrating Multiple Search Engines Using Federated Search › Configure the Crawler Surface

Configure the Crawler Surface

A crawler is part of a search engine that automatically browses the Internet to index search terms. The Crawler Surface is a new read-only web-based interface to the CA Service Desk Manager application. External search engine can discover information using the Java Server Page (JSP) technology. The Crawler Surface provides the information in plain text and also provides individual hyperlinks to tickets and knowledge documents. Follow the hyperlinks for information details. Microsoft SharePoint 2010 and SharePoint 2013 can be used to crawl the Crawler Surface. The main component of the Crawler Surface is the FSCrawl Servlet.

Note: The CA Service Desk Manager information content that is exposed to a crawler is customizable through a configuration file. No form changes are required.

The diagram shows how to configure the Crawler Web Surface for SharePoint:

How to Configure the Crawler Surface for SharePoint

Follow these steps:

Complete the Prerequisites

Complete the following prerequisites before you start configuring the Crawler Surface for a Microsoft Sharepoint Server:

Enable the Configure Federated Search option while installing the CA SDM Application.
Have a dedicated User ID for accessing the search result information. The User ID controls the information present in the Crawler Surface. For the CA SDM multi-tenancy configuration, create extra user IDs to allow the segregation of tenant information.

Create the CA SDM User Crawler Surface

In the CA SDM configuration, create user identities to segregate information by tenant.

The process of creating a user ID for the Crawler Surface is same as creating the CA SDM user identities for regular users. Create a contact with Access Type and Role as "Crawler" in the CA SDM Application. For more information about creating a contact, see the CA Service Desk Manager Administrator Online Help.

The Crawler Access Type and Role provide the user with read-only access to the CA SDM data.

Configure the Tomcat Remote IP Address Filter

Change the Tomcat remote IP address filter setting to point to the remote system hosting the SharePoint Server. The IPV4 and IPV6 addresses are supported.

Crawler Surface uses the Tomcat Remote IP Address Filter mechanism to access the CA SDM information. The Tomcat filter mechanism uses an IP Address pattern (maintained by the CA SDM administrator) to match authorized IP addresses. By default, the Remote IP Address filter is configured with the loopback adapter IP Address 127.0.0.1. For secure communication in a production environment, consider using SSL between the crawler and the Crawler Surface.

Follow these steps:

Log in to the following server, depending on your CA SDM configuration:
- Conventional: Primary or Secondary Server
- Advanced Availability: Application Server

Open the web.xml file from the following CA SDM directory:

$NX_ROOT\bopcfg\www\CATALINA_BASE_FS\webapps\fscrawl\WEB-INF

Take a backup of the web.xml file.
Find the <filter-name>Remote Address Filter</filter-name> section. Change the pattern in <param-value> in the <filter>.
This parameter allows a range of IP Address patterns to be specified. Also, provide access from the remote machine hosting the SharePoint. For more information about the Tomcat Remote IP Address Filter, see the Apache Tomcat documentation.
Save the XML file.
Restart the Federated Search Tomcat.
Note: When you restart Federated Search Tomcat, you are not able to perform any Federated search.
The Tomcat Remote IP Address Filter value is changed and the Crawler Surface can now be accessed from the remote machine.
Note: Do not access the crawler surface from an unconfigured IP address. You are redirected to the CA SDM UI.
The Crawler Surface is accessed through a URL like any other web application. To validate SharePoint logging, enter the following URL in a browser:
```
http://<sdmhostname>:<FS_TOMCAT_PORT>/fscrawl/index.jsp?farm=<FarmName>
```
Note: All elements of the URL are case-sensitive.

hostname

Specifies the fully qualified domain name to the Federated Search Tomcat

port number

Defines the default port number that was assigned to the Federated Search Tomcat when you installed and configured the CA SDM application. Used to access the Federated Search Tomcat.

fscrawl

Specifies the name of the servlet. Servlet names are case-sensitive. Always specify fscrawl.

index.jsp

Specifies the name of the page that can be used for testing the Crawler Surface configuration.

<FarmName>

Each <farm> in crawler_surface_config.xml contains <sdm_user> entry. Must match the authenticated user. Requests are redirected to the CA SDM web UI for a failed user authentication. The <sdm_user> controls data access at the CA SDM Object Manager level. This <farm> level security layer prevents one Tenant from accessing another Tenant data.

Configure the Crawler Surface User ID

Configure the Crawler Surface User ID in the crawler_surface_config.xml file to modify the Crawler Surface.

Important! Know the language settings of your browser. SharePoint is sensitive to the language used in the search request.

Follow these steps:

Log in to the following CA SDM server that is hosting the Crawler Surface depending on your install configuration:
- Conventional: Primary or secondary Server
- Advanced Availability: Application Server
Open the crawler_surface_config.xml from the following CA SDM directory:
```
$NX_ROOT\bopcfg\www\CATALINA_BASE_FS\webapps\fscrawl\WEB-INF
```
Take a backup of the crawler_surface_config.xml file.
Important! Make all XML file modifications in a test environment before porting the final changes to a production server.
In the original file, locate <sdm_user>CHANGE_THIS</sdm_user> under each <farm> section and replace CHANGE_THIS with the crawler user ID created earlier.
You can further modify and restrict the XML object attributes depending on your requirements.
Note: For more information, see Crawler Surface XML Configuration File.
Save the XML file.
Restart the Federated Search Tomcat to reload the file.
The Crawler Surface XML file is modified.

Crawler Surface XML Configuration File

The crawler_surface_config.xml file contains the following XML sections.

<objects>

Specifies the information about the objects and attributes that the Crawler Surface exposes for an object. The objects section describes the layout of a detail page for each object type that is exposed to a crawler. This section does not control the selection of individual records. The <objects> section is a collection of <object> sections.

Each object is defined in an <object> section. The default specifications for these objects are provided as:

KD: Specifies Knowledge Documents.
chg: Specifies Change Orders.
iss: Specifies Issues.
in: Specifies Incidents.
pr: Specifies Problems.
cr: Specifies Requests.

The XML file contains the following sections that create the <head> section of a detail page in CA SDM:

<name>

Specify the Majic object name of the exposed object.

<note>

Specify a place for a short description of the object. This element is only for documentation purposes and the Crawler Surface ignores this element.

<last_mod_dt>

Specify the attribute name that stores the Last Modified Date and Time. This timestamp is exposed to the search engine crawler to allow the search engine to determine whether the record was updated. Many crawlers use this timestamp during an incremental crawl. An updated time stamp signals that the record changed after the record was last crawled. The search engine crawler skips the crawl when the record is not updated since the last crawl.

<title>

Specify the attribute that is used for the title of the detail page. The search engines use this element as the title of the document that is returned in search results. This element entry generates an HTML <title> tag in the <head> of the detail page. For Knowledge Document, the Title defaults to the Tile of the Knowledge Document. The summary is used for the title for Incidents, Problems, Requests, Change Orders, and Issues.

<meta_data>

Specify one or more properties that are exposed as a metadata. Metadata allows a search engine to store extra characteristics of the document in its index. Metadata is not searched directly but instead used to filter search results. This section generates HTML <meta> tags in the <head> of the detail page.

Each entry in the <meta_data> section contains one or more <property> entries. Each <property> element consists of a <name> element and a <content> element.

<name>

Specify the name of the metadata property.

<content>

Specify the attribute of the object that is used as the value for the metadata.

Together each <name> and <content> element pair of a <property> generate an HTML <meta> tag. The search engine crawlers use the following two metadata properties by default:

Description: Specify the metadata property of a search engine that stores a short summary of the document.
Author: Specify the author of the document.

The CASDMTENANT metadata property is also configured by default for each object. This property is a CA SDM specific metadata property. When CA SDM is configured for multi-tenancy, the Crawler Surface uses this property to expose the Tenant name of the object to the crawler of the search engine. Later, during a Federated Search, the results obtained from the search engine are filtered based on this metadata property.

The XML file contains the following sections that create the <body> section of a detail page in CA SDM:

<additional_attributes_to_index>

Indicates a list of attributes from the object that the Crawler Surface exposes. Separate multiple entries with a comma and a space. For example, PROBLEM, RESOLUTION, SD_ASSET_ID.name.

<activity_logs>

Indicates information displayed by the Crawler Surface from Activity Logs for objects that have Activity Logs. The <activity_logs> section contains the <object>, <select_criteria>, <rel_attr>, and <attributes> elements.

<object>

Specifies the object name that contains the Activity Log entries for the object. For example:

Incidents, Problems, and Requests activity log object is alg.
Change Orders activity log object is chgalg.
Issues activity log object is issalg.
Knowledge Documents is O_COMMENTS.

<select_criteria>

Allows you to filter the Activity Log objects that are exposed. This element is important to increase the relevancy of your search results by decreasing frequently occurring words. For example, the <select_criteria> for chgalg contains the following Magic Where clause:

"type IN ('ST', 'UPD_RISK', 'CB', 'RS', 'LOG', 'TR', 'ESC' ,'NF', 'UPD_SCHED')"

Includes only Activity Log entries that allow a user to enter comments and eliminates Activity Log entries with fixed text like Initial or Attached document.

<rel_attr>

Specifies how an Activity Log entry relates to its parent object. The <rel_attr> subsection contains <parent_obj_attr> and <join_attr> elements.

<parent_obj_attr>

Indicates an attribute of an Activity Log that contains an SREL (or foreign key pointer) to the parent object. For example, the change_id is the attribute of activity log object chgalg.

<join_attr>

Indicates the Relational Attribute (Rel Attr) of the parent object that is stored in <parent_obj_attr>. You can verify these values by using the following command:

bop_sinfo -df chgalg

You can verify both of these values by using the bop_sinfo -df chgalg command. The output must show that the value for change_id is SREL -> chg.id and ISS is SREL -> iss.persistent_id.

<attachments>

The attachments subsection allows you to expose attachments to the crawler so that content can be indexed. The <attachments> section is only allowed for objects that have Attachments.

The attachments are handled in a special manner by the Crawler Surface. The Crawler Surface exposes a hyperlink that the crawler follows to download the Attachment from CA SDM. Later during a Federated Search, if an Attachment is included in the search results, a user can click on the hyperlink to navigate to the parent object instead of the Attachment.

The <attachments> section contains <object>, <rel_attr>, <attmnt_id>, and <is_parent_updated> elements.

<object>: This element specifies the Majic object that links the Attachment to its parent object.
<rel_attr>: This subsection works the same as it does in Activity Logs. Specifies how the parent object relates to this object which links the parent object to the attachment.
<attmnt_id>: This element specifies the attribute of this linking object that points to the attachment.
<is_parent_updated>: Specifies the crawler Surface on how to expose the last-modified date for the object. For some objects like Knowledge Documents (KDs) when an attachment is added, the last modified date of the Knowledge document is not updated. The last-modified date is important when the search engine is doing an incremental crawl.

<configuration_items>

Used for objects that contain a list of Configuration Items. This section contains the <object>, <rel_attr>, and <attributes> elements.

<object>: Works the same as they do in Activity Logs and Attachments.
<rel_attr>: Work the same as they do in Activity Logs and Attachments.
<attributes>: This element works the same as it does in Attachments.

<multi-farm_datasets>

The <multi-farm_datasets> specifies how the records are selected. The <multi-farm_datasets> section is a collection of <farm> sections.

<farm>

Each <farm> section controls the CA SDM information that is exposed to a crawler. When a crawler is configured, the <farm> section is specified in the URL. Only the information that is specified in the <farm> section is exposed to the crawler. Each <farm> section contains <name>, <data_sets>, and <sdm_user> elements.<name>.

Note: This value is case-sensitive.

<data_sets>

Specify the exposed objects and how their records are selected. This subsection contains one or more <object> elements. Each object element contains a <name> and a <select_criteria> element.

<name>

References the <object> defined in the <objects> section.

<select_criteria>

This element specifies a Majic that is used to select the records of the object.

<sdm_user>

This element specifies the CA SDM user ID that must be used when accessing this farm. User ID must have Access Type=crawler and Role=crawler.

sdm_domsrvr_name

For a huge amount of indexing data, dedicate an Object Manager for the Crawler Surface. Default is domsrvr.

sharepoint_properties_file

This value is the name of the SharePoint properties file available by default in the CA SDM directory:

NX_ROOT\CATALINA_BASE_FS\lib

The sharepoint properties file contains configuration parameter, which is used by both the Federated Search and the Crawler Surface when CA SDM is configured for Multi-Tenancy.

Note: If CA SDM is configured for Multi-Tenancy, update the sharepoint_version parameter in this file to reflect your version of SharePoint.

<list_form_number_of_records_per_object>

Use this parameter for configuring the number of hyperlinks that the Crawler Surface presents on a list page for an object.

<send_wait_timeout>

This value controls the number of seconds that the Crawler Surface waits for a response from the Object Manager before timing out.

Configure the SharePoint Crawler

Configure Crawlers to crawl and search for content in SharePoint.

The Crawler is a multi-threaded application capable of high throughput and can sometimes have a negative impact on the CA SDM performance. To improve performance, ensure that you have considered the following:

Limit the number concurrent SharePoint crawle rs accessing the Crawler Surface at any one time.
Use SharePoint Crawler Impact Rules to throttle the crawlers
Schedule crawls at off-peak times of the day
Dedicate an Object Manager <sdm_domsrvr_name> to the Crawler Surface in crawler_surface_config.xml. For more information, see Crawler Surface XML Configuration File.
For CA SDM Advanced Availability, dedicate an entire Application Server to the Crawler Surface.

Follow these steps:

Create the Content Source in SharePoint

Create Content Sources for identifying the type of content that the SharePoint crawler processes.

Note: The names of SharePoint specific settings can vary depending on the SharePoint version you are using. For more information about creating content sources in SharePoint, see the Microsoft SharePoint Documentation.

Follow these steps:

Log in to the MS SharePoint Central Administration console.
Click Manage Services Application, Search Service Application.
Click Content Source for creating new content Sources:
Enter data name for the Content Source in Name as CA SDM.
Set the Content Source Type to Web Sites.

Enter the following URL in Start Address:

http://<sdmhostname>:<FS_TOMCAT_PORT>/fscrawl/listObject.jsp?farm=<Farm Name>

To prevent the crawler from straying away from the Crawler Surface, consider limiting the Page Depth to 2 and the Server Hops to 1. Minimum recommended values to allow crawling of Attachments.
Click Save.

Create Crawl Rules

The Crawl rules define how the SharePoint Web Crawler Surface URL is crawled. Define the following Crawl Rules:

A crawl rule that lets SharePoint crawl the Crawler Surface.
A crawl rule that allows SharePoint to access attachments.

Follow these steps:

Log in to the MS SharePoint Central Administration console.
Click Manage Services Application, Search Service Application.
Click Crawler Rule. Create new Crawler Rule.
Enter the following URL in the browser:
```
http:// <sdmhostname>:<FS_Tomcat_Port>/fscrawl/*farm=<farm-name>*
```
Important! The Crawler Surface URL is case-sensitive. SharePoint changes uppercase hostnames to lowercase. For SharePoint 2010, ensure to select the Match Case check box.
Select 'Include all items in this path' to configure the crawler.
Select 'Crawl complex URLs (URLs that contain a question mark - ?)'.
Select 'Specify a different content access account'.
Enter the CA SDM user account name and password for the Crawler Surface.
Create a second crawler rule for the CA SDM attachments:
```
http://<sdmhostname>:<FS_TOMCAT_PORT>/CAisd/*
```
Specify the default authentication:

Note: The Crawler Surface uses Basic Authentication. The CA SDM Repository Daemon uses proprietary BOPSID security which is not directly supported by Microsoft SharePoint. Specify any user ID and password or choose Anonymous Access if that option is available in your version of SharePoint.

For SharePoint 2013, select the 'Specify a different content access account'. Select the ‘Anonymous access’ option.
For SharePoint 2010, select the default content access account (NT AUTHORITY\NETWORK SERVICE).

The Microsoft SharePoint Crawl rule is created.

Start a Crawl in SharePoint

Start a full or incremental crawl of the content sources in SharePoint to index the search content.

Follow these steps:

Navigate to the Microsoft SharePoint Central Administrator page.
Click Manage Services Application, Search Service Application.
Click Content Sources. Select the content source that you configured for the SharePoint Crawler Surface.
Select Start Full Crawl or Start Incremental Crawl.
A full crawl crawls the entire content under a content source. Full crawls take more time and resource to complete than Incremental crawls.

In an incremental crawl, the index remains intact, and the crawler crawls only the content that is added or modified since the last successful crawl. For more information, see the Microsoft SharePoint Documentation.

Configure the Metadata in SharePoint

Note: This topic is applicable to CA SDM multi-tenancy environments.

When the crawler encounters a CA SDM metadata, it stores the metadata in SharePoint as Crawled properties. These properties are discovered during the initial full crawl of the CA SDM Crawler Surface. The SharePoint crawler discovers the metadata and creates the Crawled properties.

The metadata is used to pass extra information to a crawler in the detail pages using the <meta> tag in the <head> section. This information is available for searching and filtering. When CA SDM is configured for multi-tenancy, the Crawler Surface exposes only the tenant metadata information.

When you perform a Federated Search, the tenant name is passed to the search engine to filter the results appropriately.

Follow these steps:

Log in to the MS SharePoint Central Administration console.
Click Manage Services Application, Search Service Application, Metadata (SharePoint 2010), or Search Schema (SharePoint 2013).
Ensure that the SharePoint crawling is successful on the CA SDM data.
Click Managed Properties.
Click New Managed Property.
Enter CASDMTENANTas Property Name.
CASDMTENANT Indicates the tenant name for the CA SDM object. The sub tenant information is not displayed.
Select Text as Property Type.
Scroll down. Click Add a Mapping.
Search for the CASDMTENANT Crawl Property and select it.
Click OK to save the new Managed Property.
The metadata in SharePoint is configured.

Verify the Crawler Data in SharePoint

Verify the Crawler data in SharePoint in order to display search results.

Follow these steps:

Log in as CA SDM Service Desk Administrator.
Click the Knowledge Management tab for a new or existing ticket.
Select SharePoint Search Source.
Enter the search key. Click Search.
The Crawler displays the search results.

Troubleshooting

The Crawler Surface has the usual array of log files:

If you want to enable the debug mode for Federated Search, navigate to the following CA SDM directory:
```
$NX_ROOT\bopcfg\www\CATALINA_BASE_FS\webapps\cafedsearch\WEB-INF
```
Open the log4j.properties file and modify info to debug mode.
To enable debug mode for fscrawl, navigate to the following CA SDM directory:
```
$NX_ROOT\bopcfg\www\CATALINA_BASE_FS\webapps\fscrawl\WEB-INF
```
Open the log4j.properties file and modify info to debug mode.
To correct the syntax errors that are encountered while configuring the SharePoint crawler surface, open the jfscrawl log file from the CA SDM directory:
```
$NX_ROOT\logs directory
```

If you locate any syntax errors, correct the XML file and restart the Federated Search Tomcat. The log is located in the CA SDM directory:

$NX_ROOT\logs\jfscrawl.log

For example, if a <meta_data> tag is accidentally corrupted, then the log indicates the following error:

08/06 15:43:52.624 [pool-2-thread-1] ERROR FSCrawlApplicationListener 302 XmlException::Problem loading config_file::C:\PROGRA~2\CA\SERVIC~1\bopcfg\www\CATALINA_BASE_FS\webapps\fscrawl\WEB-INF\crawler_surface_config.xml:274:8: error: </meta_dataxxxxx> does not close tag <meta_data>

08/06 15:43:52.625 [pool-2-thread-1] ERROR FSCrawlApplicationListener 144 crawler_surface_config.xml could not be loaded, cannot read.

If there are no syntax errors, the following message is displayed:

08/06 15:46:27.924 [pool-2-thread-1] INFO FSCrawlApplicationListener 58 fscrawl context had been loaded successfully.

Correct any other errors that do not show up until you try to access CA SDM.
For example, an unknown attribute xxxxx is requested to be exposed for Incidents in the <additional_attributes_to_index> element of crawler_surface_config.xml. The Crawler Surface application does not detect the error. But, when the Crawler Surface sends the request to the Object Manager, the error is detected and reported in the stdlog.x file as follows:
```
08/06 15:51:23.92 SDMSERVER domsrvr 10860 ERROR domset.c 8049 Unknown attribute "xxxxx" requested from domset MLIST_STATIC of factory
```
Use the bop_sinfo -d command to resolve the error.
Modify the crawler_surface_config.xml file.
Restart the Federated Search Tomcat.
The Crawler Surface objects are configured without any errors.