Nov
2
2011

How near is NEAR in SQL Server 2012 Full-Text Search?

SQL Server 2012 “Denali” Full-Text Search introduces a new custom proximity operator, which we’ve been calling “customizable NEAR.” The new NEAR operator lets you query with 2 optional requirements that you could not previously specify:

  1. The maximum gap between the search terms.
  2. The order of the search terms. (For example, “John” must appear before “Smith.”)

Here is a partial example of the new NEAR operator:

CONTAINSTABLE (Documents, Content, 'NEAR((John, Smith), 4, TRUE)')

I think you will understand the simple syntax at first glance. I intend to write a blog post soon about customizable NEAR in general. But today, let’s focus on a question that has gained importance since we exposed the optional max_gap argument, which is the 2nd argument in the preceding example: How near is NEAR?

1. The generic NEAR operator (pre-SQL Server 2012)

1.a. How the generic NEAR operator determines proximity

The logic of the generic NEAR operator is simple:

  • When used with CONTAINSTABLE, it considers nearness significant and returns a Rank > 0 if the 2 words are within 50 words of each other. (Oracle’s generic proximity operator used a boundary of 100 words.) CONTAINSTABLE returns a match with a Rank = 0 if the 2 words are found, but are more than 50 words apart.
  • When used with CONTAINS, NEAR effectively becomes an AND query, and returns a match if the 2 words are found in the same row or document.

1.b. The generic NEAR operator is deprecated in SQL Server 2012

The custom proximity operator is a new operator, and not an extension of the existing NEAR operator. The generic NEAR operator is deprecated in SQL Server 2012. It still works in SQL Server 2012, but should not be used in new development.

2. The SQL Server 2012 customizable NEAR operator

The custom proximity operator now lets you specify “how near” you want search terms to be with an optional max_gap argument. In this example, you want to find documents where “dog” and “cat” appear within 4 words of each other, and where "dog" appears before "cat":

'NEAR((dog, cat), 4, TRUE)'

 

2.1 Simple definition of the gap

In a matching fragment that begins and ends with one of the search terms, the gap is the number of non-search-terms between them.

Stopwords or noise words are included in the count.

2.2 BUT the distance between search terms is logical and not absolute

For example, terms within different sentences within a paragraph are treated as farther apart than terms in the same sentence, regardless of their actual proximity, because we assume that they are less closely related. Terms in different paragraphs are treated as being even farther apart.

If a match spans the end of a sentence, paragraph, or chapter, the actual gap is increased by 8, 128, or 1024, respectively. So in this example… “I see the cat. The dog also sees her.” …the gap between “cat” and “dog” is 1 + 8 = 9, because of the sentence break.

What if your documents aren’t text documents, but are Word documents or Excel spreadsheets or PowerPoint presentations? Then interpreting things such as table breaks, different worksheets, or bullet points on a slide is the responsibility of the IFilter for each document type. It is up to the IFilter to map each construct in the document format to either a word, sentence, paragraph or chapter.

It’s not possible to constrain your search to terms that appear in the same sentence or in the same paragraph. However we believe that the new custom proximity operator is simple and powerful enough to satisfy the majority of our customers’ needs in relation to proximity searches.

2.3 Where can I see the calculated gap between terms?

You can use the dynamic management view, sys.dm_fts_parser, to see how the gap between terms is calculated. Here's an example that uses the sample sentence from the preceding section:

SELECT occurrence, LEFT(display_term, 16), special_term 
    FROM sys.dm_fts_parser('"I see the cat. The dog also sees her."', 1033, 0, 0)
 

Here are the results, which confirm the gap of 9 (14 – 4 – 1) between "cat" and "dog":

occurrence                   special_term
----------- ---------------- ----------------
1           i                Noise Word
2           see              Noise Word
3           the              Noise Word
4           cat              Exact Match
12          END OF FILE      End Of Sentence
13          the              Noise Word
14          dog              Exact Match
15          also             Noise Word
16          sees             Exact Match
17          her              Noise Word
25          END OF FILE      End Of Sentence

(11 row(s) affected)

 

2.4 Specifying a value for the max_gap argument

The max_gap argument is optional. If you specify a value, it can be any numeric value allowed by a 4-byte integer. (Frankly, a gap of 2,147,483,647 would seem to suggest that 2 terms are not closely related!)

Naturally, queries only return documents or rows that satisfy the max_gap requirement. (With the generic NEAR operator, on the other hand, CONTAINSTABLE previously returned rows where the gap exceeded its boundary of 50, but assigned a Rank of 0 (zero) to those rows.)

A matching gap is <= max_gap. It’s not possible to require a specific gap, neither less nor more, between search terms.

If you want to specify a value for the optional 3rd order argument, then you also have to specify a value for max_gap.

2.5 Omitting max_gap or specifying MAX

Instead of specifying an integer value, you can also specify the keyword MAX, or omit max_gap. MAX is the default value. When MAX is specified, then queries return any documents or rows that contain the search terms, regardless of the gap between them. However matches where the gap is > 100 have a rank of 0 (zero).

2.6 Two simple examples to summarize the behavior of max_gap

NEAR((dog, cat), 6)

 

  • The query only returns rows that contain both dog and cat, with 6 terms or fewer between them.
  • Rows are ranked by proximity, and no returned row can have Rank = 0.
NEAR((dog, cat))

 

  • This is equivalent to using the MAX default - NEAR((dog, cat), MAX).
  • The query returns all rows that contain both dog and cat.
  • Rows have a Rank = 0 if the gap between terms is > 100.

2.7 An example with multiple matches

Query: NEAR((A, B), 10)

Data: AB…(10 terms)…AB…(10 terms)…AB

Matches: 5

  1. The 1st AB pair.
  2. The first B…A.
  3. The 2nd AB pair.
  4. The 2nd B…A.
  5. The 3rd AB pair.

2.8 An example with multiple search terms

Query: NEAR((wine, cheese, “nearby stores”), 5)

Data: This wine and cheese can be found in nearby stores.

Matches? Yes. The count of non-search terms (“and,” “can be found in”) is <= the max_gap of 5.

Data: This wine and cheese can sometimes be found in nearby stores.

Matches? No. The count of non-search terms (“and,” “can sometimes be found in”) is > the max_gap of 5.


Thank you to the members of the SQL Server Full-Text Search development team for patiently answering my questions as I prepared this blog post!

Follow the Full-Text Search team blog if also you’re interested in occasional updates about full-text search.

-Doug

Comments (1) -

Trevor Dwyer

Love the article. Can't wait to see more like this. The SQL Server Full-Text Search development team seem really on the ball these days - I always had a problem with the non-expossure of how the engine actually processed terms and now this is much better.  I have followed the Microsoft Research team on their "Approximate String Joins in a Database" set of papers for an insight into how its actually done. I even wrote the entire thing in Transact SQL to prove it was procesing correctly

Pingbacks and trackbacks (1)+