Nov
2
2011
SQL Server 2012 “Denali” Full-Text Search introduces a new custom proximity operator, which we’ve been calling “customizable NEAR.” The new NEAR operator lets you query with 2 optional requirements that you could not previously specify:
- The maximum gap between the search terms.
- The order of the search terms. (For example, “John” must appear before “Smith.”)
Here is a partial example of the new NEAR operator:
CONTAINSTABLE (Documents, Content, 'NEAR((John, Smith), 4, TRUE)')
I think you will understand the simple syntax at first glance. I intend to write a blog post soon about customizable NEAR in general. But today, let’s focus on a question that has gained importance since we exposed the optional max_gap argument, which is the 2nd argument in the preceding example: How near is NEAR?
1. The generic NEAR operator (pre-SQL Server 2012)
1.a. How the generic NEAR operator determines proximity
The logic of the generic NEAR operator is simple:
- When used with CONTAINSTABLE, it considers nearness significant and returns a Rank > 0 if the 2 words are within 50 words of each other. (Oracle’s generic proximity operator used a boundary of 100 words.) CONTAINSTABLE returns a match with a Rank = 0 if the 2 words are found, but are more than 50 words apart.
- When used with CONTAINS, NEAR effectively becomes an AND query, and returns a match if the 2 words are found in the same row or document.
1.b. The generic NEAR operator is deprecated in SQL Server 2012
The custom proximity operator is a new operator, and not an extension of the existing NEAR operator. The generic NEAR operator is deprecated in SQL Server 2012. It still works in SQL Server 2012, but should not be used in new development.
2. The SQL Server 2012 customizable NEAR operator
The custom proximity operator now lets you specify “how near” you want search terms to be with an optional max_gap argument. In this example, you want to find documents where “dog” and “cat” appear within 4 words of each other, and where "dog" appears before "cat":
'NEAR((dog, cat), 4, TRUE)'
2.1 Simple definition of the gap
In a matching fragment that begins and ends with one of the search terms, the gap is the number of non-search-terms between them.
Stopwords or noise words are included in the count.
2.2 BUT the distance between search terms is logical and not absolute
For example, terms within different sentences within a paragraph are treated as farther apart than terms in the same sentence, regardless of their actual proximity, because we assume that they are less closely related. Terms in different paragraphs are treated as being even farther apart.
If a match spans the end of a sentence, paragraph, or chapter, the actual gap is increased by 8, 128, or 1024, respectively. So in this example… “I see the cat. The dog also sees her.” …the gap between “cat” and “dog” is 1 + 8 = 9, because of the sentence break.
What if your documents aren’t text documents, but are Word documents or Excel spreadsheets or PowerPoint presentations? Then interpreting things such as table breaks, different worksheets, or bullet points on a slide is the responsibility of the IFilter for each document type. It is up to the IFilter to map each construct in the document format to either a word, sentence, paragraph or chapter.
It’s not possible to constrain your search to terms that appear in the same sentence or in the same paragraph. However we believe that the new custom proximity operator is simple and powerful enough to satisfy the majority of our customers’ needs in relation to proximity searches.
2.3 Where can I see the calculated gap between terms?
You can use the dynamic management view, sys.dm_fts_parser, to see how the gap between terms is calculated. Here's an example that uses the sample sentence from the preceding section:
SELECT occurrence, LEFT(display_term, 16), special_term
FROM sys.dm_fts_parser('"I see the cat. The dog also sees her."', 1033, 0, 0)
Here are the results, which confirm the gap of 9 (14 – 4 – 1) between "cat" and "dog":
occurrence special_term
----------- ---------------- ----------------
1 i Noise Word
2 see Noise Word
3 the Noise Word
4 cat Exact Match
12 END OF FILE End Of Sentence
13 the Noise Word
14 dog Exact Match
15 also Noise Word
16 sees Exact Match
17 her Noise Word
25 END OF FILE End Of Sentence
(11 row(s) affected)
2.4 Specifying a value for the max_gap argument
The max_gap argument is optional. If you specify a value, it can be any numeric value allowed by a 4-byte integer. (Frankly, a gap of 2,147,483,647 would seem to suggest that 2 terms are not closely related!)
Naturally, queries only return documents or rows that satisfy the max_gap requirement. (With the generic NEAR operator, on the other hand, CONTAINSTABLE previously returned rows where the gap exceeded its boundary of 50, but assigned a Rank of 0 (zero) to those rows.)
A matching gap is <= max_gap. It’s not possible to require a specific gap, neither less nor more, between search terms.
If you want to specify a value for the optional 3rd order argument, then you also have to specify a value for max_gap.
2.5 Omitting max_gap or specifying MAX
Instead of specifying an integer value, you can also specify the keyword MAX, or omit max_gap. MAX is the default value. When MAX is specified, then queries return any documents or rows that contain the search terms, regardless of the gap between them. However matches where the gap is > 100 have a rank of 0 (zero).
2.6 Two simple examples to summarize the behavior of max_gap
NEAR((dog, cat), 6)
- The query only returns rows that contain both dog and cat, with 6 terms or fewer between them.
- Rows are ranked by proximity, and no returned row can have Rank = 0.
NEAR((dog, cat))
- This is equivalent to using the MAX default - NEAR((dog, cat), MAX).
- The query returns all rows that contain both dog and cat.
- Rows have a Rank = 0 if the gap between terms is > 100.
2.7 An example with multiple matches
Query: NEAR((A, B), 10)
Data: AB…(10 terms)…AB…(10 terms)…AB
Matches: 5
- The 1st AB pair.
- The first B…A.
- The 2nd AB pair.
- The 2nd B…A.
- The 3rd AB pair.
2.8 An example with multiple search terms
Query: NEAR((wine, cheese, “nearby stores”), 5)
Data: This wine and cheese can be found in nearby stores.
Matches? Yes. The count of non-search terms (“and,” “can be found in”) is <= the max_gap of 5.
Data: This wine and cheese can sometimes be found in nearby stores.
Matches? No. The count of non-search terms (“and,” “can sometimes be found in”) is > the max_gap of 5.
Thank you to the members of the SQL Server Full-Text Search development team for patiently answering my questions as I prepared this blog post!
Follow the Full-Text Search team blog if also you’re interested in occasional updates about full-text search.
-Doug