Rudra's .NET and SQL Server Blog: August 2009

Thursday, August 27, 2009

Full-Text Search (SQL Server)

SQL Server 2008 provides the functionality for applications and users to issue full-text queries against character-based data in SQL Server tables. Before full-text queries can be run on a given table, the database administrator must create a full-text index on the table. The full-text index includes one or more character-based columns in the table. These columns can have any of the following data types: char, varchar, nchar, nvarchar, text, ntext, image, xml, varbinary, or varbinary(max). Each full-text index indexes one or more columns from the base table, and each column can have a specific language. Beginning in SQL Server 2008, full-text search supports more than 50 diverse languages, such as English, Spanish, Chinese, Japanese, Arabic, Bengali, and Hindi. For a complete list of the supported full-text languages, see sys.fulltext_languages (Transact-SQL).

For each supported language, SQL Server provides language-specific linguistic components, including a word breaker and stemmer and an empty thesaurus file. For each full-text language, SQL Server also provides a file in which you can optionally define language-specific synonyms to extend the scope of search queries (a thesaurus file). In addition, beginning in SQL Server 2008, a system stoplist is provided. To support specific language or business scenarios, you can alter the system stoplist by adding and removing stopwords (also known as noise words), and you can create additional stoplists as needed.

For writing full-text queries, SQL Server provides a set of full-text predicates (CONTAINS and FREETEXT) and rowset-valued functions (CONTAINSTABLE and FREETEXTTABLE). Using these, applications and users can perform a variety of types of full-text searches, such as searching on a single word or phrase (and optionally ranking the result set), searching on a word or phrase close to another word or phrase, or searching on synonymous forms of a specific word.

Note: Full-text search is an optional component of the SQL Server Database Engine. For more information, see Installing SQL Server 2008.

How to Configure a Database for Full-Text Searching
For any scenario, a database administrator performs the following basic steps to configure table columns in a database for full-text search:

1. Create a full-text catalog.

2. On each table that you want to search, create a full-text index by:

a. Identify each text columns that you want to include in the full-text index.

b. If a given column contains documents stored as binary data (varbinary, varbinary(max), or image data), you must specify a table column (the type column) that identifies the type of each document in the column being indexed.

c. Specify the language that you want full-text search to use on the documents in the column.

d. Choose the change-tracking mechanism that you want to use on the full-text index to track changes in the base table and its columns.

Full-text search supports multiple languages through the use of the following linguistic components: word breakers and stemmers, stoplists that contain stopwords (also known as noise words), and thesaurus files. Thesaurus files and, in some cases, stoplists require configuration by a database administrator. A given thesaurus file supports all full-text indexes that use the corresponding language, and a given stoplist can be associated with as many full-text indexes as you want.

After the columns have been added to a full-text index, applications and users can run full-text queries on the text in the columns. These queries can search for any of the following:

1. One or more specific words or phrases (simple term)

2. A word or a phrase where the words begin with specified text (prefix term)

3. Inflectional forms of a specific word (generation term)

4. A word or phrase close to another word or phrase (proximity term)

5. Synonymous forms of a specific word (thesaurus)

6. Words or phrases using weighted values (weighted term)

Full-text queries all use a small set of Transact-SQL predicates (CONTAINS and FREETEXT) and functions (CONTAINSTABLE and FREETEXTTABLE). However, the search goals of a given business scenario influence the exact structure of the full-text queries;

for example:

a. e-business—searching for a product on a website:
SELECT product_id FROM products WHERE CONTAINS(product_description, ”Snap Happy 100EZ” OR FORMSOF(THESAURUS,’Snap Happy’) OR ‘100EZ’) AND product_cost<200

b. Recruitment scenario—searching for job candidates that have experience working with SQL Server:
SELECT candidate_name,SSN FROM candidates WHERE CONTAINS(candidate_resume,”SQL Server”) AND candidate_division =DBA

Partitioned Views in SQL Server

Partitioned views allow the data in a large table to be split into smaller member tables. The data is partitioned between the member tables based on ranges of data values in one of the columns. The data ranges for each member table are defined in a CHECK constraint specified on the partitioning column. A view that uses UNION ALL to combine selects of all the member tables into a single result set is then defined. When SELECT statements referencing the view specify a search condition on the partition column, the query optimizer uses the CHECK constraint definitions to determine which member table contains the rows.

For example, a sales table that records sales for 1998 has been partitioned into 12 member tables, one for each month. Each member table has a constraint defined on the OrderMonth column:

CREATE TABLE May1998sales
(OrderID INT,
CustomerID INT NOT NULL,
OrderDate DATETIME NULL
CHECK (DATEPART(yy, OrderDate) = 1998),
OrderMonth INT
CHECK (OrderMonth = 5),
DeliveryDate DATETIME NULL
CHECK(DATEPART(mm, DeliveryDate) = 5)
CONSTRAINT OrderIDMonth PRIMARY KEY(OrderID, OrderMonth)
)

The application that populates May1998sales must make sure that all rows have 5 in the OrderMonth column and that the order date specifies a date in May 1998. This is enforced by the constraints defined on the table.

A view is then defined that uses UNION ALL to select the data from all 12 member tables as a single result set:

CREATE VIEW Year1998Sales
AS
SELECT * FROM Jan1998Sales
UNION ALL
SELECT * FROM Feb1998Sales
UNION ALL
SELECT * FROM Mar1998Sales
UNION ALL
SELECT * FROM Apr1998Sales
UNION ALL
SELECT * FROM May1998Sales
UNION ALL
SELECT * FROM Jun1998Sales
UNION ALL
SELECT * FROM Jul1998Sales
UNION ALL
SELECT * FROM Aug1998Sales
UNION ALL
SELECT * FROM Sep1998Sales
UNION ALL
SELECT * FROM Oct1998Sales
UNION ALL
SELECT * FROM Nov1998Sales
UNION ALL
SELECT * FROM Dec1998Sales

For example, the following SELECT statement queries for information about specific months.

SELECT *
FROM Year1998Sales
WHERE OrderMonth IN (5,6) AND CustomerID = 64892

The SQL Server query optimizer recognizes that the search condition in this SELECT statement references only rows in the May1998Sales and Jun1998Sales tables. Therefore, it limits its search to those tables.

To perform updates on a partitioned view, the partitioning column must be a part of the primary key of the base table. If a view is not updatable, you can create an INSTEAD OF trigger on the view that allows updates. You should design error handling into the trigger to make sure that no duplicate rows are inserted. For an example of an INSTEAD OF trigger designed on a view, see Designing INSTEAD OF Triggers.

CHECK constraints are not needed for the partitioned view to return the correct results. However, if the CHECK constraints have not been defined, the query optimizer must search all the tables instead of only those that cover the search condition on the partitioning column. Without the CHECK constraints, the view operates like any other view with UNION ALL. The query optimizer cannot make any assumptions about the values stored in different tables and it cannot skip searching the tables that participate in the view definition.

If all the member tables referenced by a partitioned view are on the same server, the view is a local partitioned view. If the member tables are on multiple servers, the view is a distributed partitioned view. Distributed partitioned views can be used to spread the database processing load of a system across a group of servers. For more information, see Federated Database Servers.

Partitioned views make it easier to maintain the member tables independently. For example, you can do the following at the end of a period:

1. The definition of the partitioned view for current results can be changed to add the newest period and drop the oldest period.

2. The definition of the partitioned view for past results can be changed to add the period just dropped from the current results view. The past results view can also be updated to remove and archive the oldest period it covers.

When you insert data into the partitioned views, the sp_executesql system stored procedure can be used to create INSERT statements with execution plans that have a significant chance of being reused in systems with many concurrent users.

Note:Bulk importing into a partitioned view is unsupported by both the bcp command and the BULK INSERT and INSERT ... SELECT * FROM OPENROWSET(BULK...) statements. However, you can insert multiple rows into a partitioned view by using an INSERT statement

Common Table Expressions (Recursive Queries)

A common table expression (CTE) provides the significant advantage of being able to reference itself, thereby creating a recursive CTE. A recursive CTE is one in which an initial CTE is repeatedly executed to return subsets of data until the complete result set is obtained.

A query is referred to as a recursive query when it references a recursive CTE. Returning hierarchical data is a common use of recursive queries, for example: Displaying employees in an organizational chart, or data in a bill of materials scenario in which a parent product has one or more components and those components may, in turn, have subcomponents or may be components of other parents.

A recursive CTE can greatly simplify the code required to run a recursive query within a SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement. In earlier versions of SQL Server, a recursive query usually requires using temporary tables, cursors, and logic to control the flow of the recursive steps. For more information about common table expressions, see Using Common Table Expressions.

Structure of a Recursive CTE
The structure of the recursive CTE in Transact-SQL is similar to recursive routines in other programming languages. Although a recursive routine in other languages returns a scalar value, a recursive CTE can return multiple rows.

A recursive CTE consists of three elements:

1. Invocation of the routine.
The first invocation of the recursive CTE consists of one or more CTE_query_definitions joined by UNION ALL, UNION, EXCEPT, or INTERSECT operators. Because these query definitions form the base result set of the CTE structure, they are referred to as anchor members.

CTE_query_definitions are considered anchor members unless they reference the CTE itself. All anchor-member query definitions must be positioned before the first recursive member definition, and a UNION ALL operator must be used to join the last anchor member with the first recursive member.

2. Recursive invocation of the routine.
The recursive invocation includes one or more CTE_query_definitions joined by UNION ALL operators that reference the CTE itself. These query definitions are referred to as recursive members.

3. Termination check.
The termination check is implicit; recursion stops when no rows are returned from the previous invocation.

Example
The following example shows the semantics of the recursive CTE structure by returning a hierarchical list of employees, starting with the highest ranking employee, in the Adventure Works Cycles company. The statement that executes the CTE limits the result set to employees in the Research and Development Group. A walkthrough of the code execution follows the example.

USE AdventureWorks;
GO
WITH DirectReports (ManagerID, EmployeeID, Title, DeptID, Level)
AS
(
-- Anchor member definition
SELECT e.ManagerID, e.EmployeeID, e.Title, edh.DepartmentID,
0 AS Level
FROM HumanResources.Employee AS e
INNER JOIN HumanResources.EmployeeDepartmentHistory AS edh
ON e.EmployeeID = edh.EmployeeID AND edh.EndDate IS NULL
WHERE ManagerID IS NULL
UNION ALL
-- Recursive member definition
SELECT e.ManagerID, e.EmployeeID, e.Title, edh.DepartmentID,
Level + 1
FROM HumanResources.Employee AS e
INNER JOIN HumanResources.EmployeeDepartmentHistory AS edh
ON e.EmployeeID = edh.EmployeeID AND edh.EndDate IS NULL
INNER JOIN DirectReports AS d
ON e.ManagerID = d.EmployeeID
)
-- Statement that executes the CTE
SELECT ManagerID, EmployeeID, Title, Level
FROM DirectReports
INNER JOIN HumanResources.Department AS dp
ON DirectReports.DeptID = dp.DepartmentID
WHERE dp.GroupName = N'Research and Development' OR Level = 0;
GO

Common Table Expressions in SQL Server

A common table expression (CTE) can be thought of as a temporary result set that is defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement. A CTE is similar to a derived table in that it is not stored as an object and lasts only for the duration of the query. Unlike a derived table, a CTE can be self-referencing and can be referenced multiple times in the same query.

A CTE can be used to:

1. Create a recursive query. For more information, see Recursive Queries Using Common Table Expressions.

2. Substitute for a view when the general use of a view is not required; that is, you do not have to store the definition in metadata.

3. Enable grouping by a column that is derived from a scalar subselect, or a function that is either not deterministic or has external access.

4. Reference the resulting table multiple times in the same statement.

Using a CTE offers the advantages of improved readability and ease in maintenance of complex queries. The query can be divided into separate, simple, logical building blocks. These simple blocks can then be used to build more complex, interim CTEs until the final result set is generated.

CTEs can be defined in user-defined routines, such as functions, stored procedures, triggers, or views.

Example
The following example shows the components of the CTE structure: expression name, column list, and query. The CTE expression Sales_CTE has three columns (SalesPersonID, NumberOfOrders, and MaxDate) and is defined as the total number of sales orders and the most recent sales order date in the SalesOrderHeader table for each salesperson. When the statement is executed, the CTE is referenced two times: one time to return the selected columns for the salesperson, and again to retrieve similar details for the salesperson's manager. The data for both the salesperson and the manager are returned in a single row.

USE AdventureWorks;
GO
WITH Sales_CTE (SalesPersonID, NumberOfOrders, MaxDate)
AS
(
SELECT SalesPersonID, COUNT(*), MAX(OrderDate)
FROM Sales.SalesOrderHeader
GROUP BY SalesPersonID
)
SELECT E.EmployeeID, OS.NumberOfOrders, OS.MaxDate,
E.ManagerID, OM.NumberOfOrders, OM.MaxDate
FROM HumanResources.Employee AS E
JOIN Sales_CTE AS OS
ON E.EmployeeID = OS.SalesPersonID
LEFT OUTER JOIN Sales_CTE AS OM
ON E.ManagerID = OM.SalesPersonID
ORDER BY E.EmployeeID;
GO

Derived Tables in SQL Server

The power of SQL Server never fails to amaze me; it literally seems that you can do just about anything in SQL. All you need is a little creativity and knowledge of the syntax, and you can put the power of SQL behind your web application. One of the neatest things I've yet done with SQL Server is using derived tables.

If you've used a VIEW before, you've used a more formal, more correct form of a derived table. For example, we could do the following:

CREATE VIEW vwEmployeesFromNewYork AS
SELECT * FROM Employee
WHERE State = "NY"
GO

Then if we wanted to see all of the Employees from New York with the last name Smith, ordered alphabetically, we could write:

SELECT LastName, FirstName
FROM vwEmployeesFromNewYork
WHERE LastName = "Smith"
ORDER BY FirstName

However, using derived tables, we could eliminate the view entirely. (Of course the view could be eliminated by simply adding an "AND State = "NY"" to the above WHERE clause, but what's important here is the concept, not the example!) Here is the same resultset as above but with the use of a derived table in place of a veiw:

SELECT LastName, FirstName
FROM
(SELECT * FROM Employee
WHERE State = "NY") AS EmployeeDerivedTable
WHERE LastName = "Smith"
ORDER BY FirstName

Isn't that neat? What we are doing is first getting the result set from our derived table (the SELECT statement in the FROM clause). Once we have that resultset, it is as though it was a table in itself. We then perform the SELECT on the derived table, returning our results! You can find another example of using derived tables here on 4GuysFromRolla.com in the article Obtaining Ranked Values from a Table page.

Happy Programming!

Rudra's .NET and SQL Server Blog

Thursday, August 27, 2009

Full-Text Search (SQL Server)

Partitioned Views in SQL Server

Common Table Expressions (Recursive Queries)

Common Table Expressions in SQL Server

Derived Tables in SQL Server

Followers

Blog Archive

About Me