Oct
31
2011

More sample queries for Semantic Search in SQL Server 2012

Earlier today I published a blog post about Looping over document similarity details in Semantic Search on the MSDN blog of the SQL Server Full-Text Search team. That post introduces the 3 built-in rowset functions that are provided by Semantic Search in SQL Server 2012 “Denali.” It also describes how to do something for which we don’t provide a built-in set-based function – that is, iterate over the documents that are similar to a source document, and ask why each document is similar. Please take a look!

I had fun playing with semantic search and testing my code. Here are a few more examples of the queries that you can perform with semantic search. Don't laugh at my SQL code! I haven't written much T-SQL while working mostly with SSIS for several years.

The examples in this post use the semanticsimilaritydetails_foreach function created in the blog post linked above.

The examples in this post also use a collection of SQL Server white papers published on MSDN and TechNet – mostly in Microsoft Word format – saved into a SQL Server database by using a SQL Server 2012 FileTable. With this approach, I can easily save BLOBs into the database simply by dragging and dropping the files in Windows Explorer. The database table is named WhitePapers, and it has the fixed schema common to all FileTables.

1. How many white papers are about ETL?

Query:

SELECT COUNT(*) AS 'Number of ETL White Papers'
    FROM semantickeyphrasetable (WhitePapers, *) AS SKP
    WHERE SKP.keyphrase = 'etl'
GO

 

Results:

Number of ETL White Papers
--------------------------
30

(1 row(s) affected)

 

These results tell me that my table contains 30 documents in which “etl” was identified as a statistically significant key phrase. (In version 1 of semantic search, only one-word terms are supported.)

2. What are the top matching white papers that are also about ETL?

Query:

DECLARE @Title NVARCHAR(255)
DECLARE @DocumentID hierarchyid

-- Source white paper - "We Loaded 1TB in 30 Minutes with SSIS, and So Can You."
SET @Title = '1TBin30MwithSSIS.docx'

-- Get the ID of the source document from its title.
SELECT @DocumentID = path_locator
    FROM WhitePapers
    WHERE name = @Title

-- Get the top n key phrases for each matching document.
DECLARE @DetailsCount smallint = 3

-- Get the top n matching white papers that are also about ETL.
SELECT TOP (10) LEFT(WP.name, 32) AS 'Document',
                LEFT(SSD.keyphrase, 16) AS 'Key Phrase',
                SSD.score AS 'Score'
    FROM WhitePapers AS WP
            CROSS APPLY
                semanticsimilaritydetails_foreach(@DocumentID, WP.path_locator, @DetailsCount)
            AS SSD
    WHERE WP.path_locator <> @DocumentID
        AND SSD.keyphrase = 'etl'
    ORDER BY Score DESC
GO

 

Results:

Document                         Key Phrase       Score
-------------------------------- ---------------- -------------
HighVolETLConsid.docx            etl              0.6152592
Implementing a Microsoft PDW usi etl              0.4083139
HubSpokeOverview.docx            etl              0.3958832
DataSrcISPkgs.docx               etl              0.3768547
SSIS_die_zweite.docx             etl              0.3722264
HubSpokeImpl.docx                etl              0.3580182
BestPractDWSQL2008.docx          etl              0.3551803
ScaleUpDWinSQL2008.docx          etl              0.3497615
OracleSSIS.docx                  etl              0.3446308
SSIS2008Connectivity.docx        etl              0.3446308

(10 row(s) affected)

 

These results show me the top 10 white papers that are similar to the source document, where “etl” is among the top 3 key phrases that were used to identify the white paper as similar to the source document.

(Note that the semantic indexing also captured a German-language article about ETL.

3. What are the similarities between my source document and other documents?

Query:

DECLARE @Title NVARCHAR(255)
DECLARE @DocumentID hierarchyid

-- Source white paper - "We Loaded 1TB in 30 Minutes with SSIS, and So Can You."
SET @Title = '1TBin30MwithSSIS.docx'

-- Get the ID of the source document from its title.
SELECT @DocumentID = path_locator
    FROM WhitePapers
    WHERE name = @Title

-- Get the top n key phrases for each matching document.
DECLARE @DetailsCount smallint = 3

-- Get the top n key phrases in matching white papers.
SELECT TOP (10) LEFT(SSD.keyphrase, 16) AS 'Key Phrase',
                COUNT(SSD.keyphrase) AS 'Count'
    FROM WhitePapers AS WP
            CROSS APPLY
                semanticsimilaritydetails_foreach(@DocumentID, WP.path_locator, @DetailsCount)
            AS SSD
    WHERE WP.path_locator <> @DocumentID
    GROUP BY SSD.keyphrase
    ORDER BY COUNT DESC
GO

 

Results:

Key Phrase       Count
---------------- -----------
sql              149
server           111
sqlserver        30
cpu              24
etl              23
msdn             21
aspx             15
data             11
microsoft        10
tcp              10

(10 row(s) affected)

 

These results show me the top 10 key phrases that were used to identify other white papers as similar to the source document, where those key phrases are among the top 3 shared key phrases.


Have fun with Semantic Search! For more information see the documentation.

Be sure to check out the killer demo of FileTable and Semantic Search from SQL Server evangelist Roger Doherty and his team. This example includes a tag cloud to visualize the relative importance of key phrases and to find similar documents at a glance.