Tuesday, December 09, 2008

ATC(), Only Better

Because our flagship product, Stonefield Query, is a reporting tool, we do a LOT of parsing here, especially expression parsing. One parsing task we frequently do is look for the existence of a field in an expression. It's not as simple as it sounds. For example, suppose you have a SQL statement like this stored in a variable named lcSelect:

select Employees.FirstName,Employees.LastName,Employees.CountryOfBirth from ...

and you want to know if the Country field is involved in the query. Unfortunately, ATC('Country', lcSelect) will return a false positive because "Country" is contained within "CountryOfBirth".

One of Stonefield's genius developers, Trevor Mansuy, wrote a replacement for ATC() called ATCWord that searches for words rather than substrings. He used the VBScript regular expression parser to do the hard work. So, using the example above, ATCWord('Country', lcSelect) returns 0.

There are a lot of comments in ATCWord that explains how it works. Thanks, Trevor.

Update: Mike Potjer tweaked the code so it returns the correct position when the string you're searching for comes after a substring match, such as looking for "Country" in "Employees.CountryOfBirth,Employees.Country". Thanks, Mike.

*======================================================================
* Function: ATCWord
* Purpose: Performs an ATC() on whole words, not substrings
* Author: Trevor Mansuy, with a tweak by Mike Potjer
* Last revision: 12/09/2008
* Parameters: tcSearch - the string to search for
* tcString - the string to search for a match in
* tnOccurrence - which occurrence to search for
* Returns: the index of the first character of the match
* Environment in: VBScript.RegExp can be instantiated
* Environment out: none
*======================================================================

lparameters tcSearch, ;
tcString, ;
tnOccurrence
local lnOccurrence, ;
loRegExp, ;
lcLeftBoundary, ;
lcRightBoundary, ;
lcSearch, ;
loMatches, ;
lnReturn, ;
loMatch, ;
lnMatch

* Ensure the necessary parameters were passed.

assert vartype(tcSearch) = 'C' and not empty(tcSearch) ;
message 'ATCWord: invalid search string passed'
assert vartype(tcString) = 'C' ;
message 'ATCWord: invalid string passed'

* Bug out if the string is empty.

if empty(tcString)
return 0
endif empty(tcString)

* Set an occurrence if one wasn't passed.

if not vartype(tnOccurrence) = 'N'
lnOccurrence = 1
else
lnOccurrence = tnOccurrence
endif not vartype(tnOccurrence) = 'N'

* Create a VBScript.RegExp object.

loRegExp = createobject('VBScript.RegExp')
loRegExp.IgnoreCase = .T.
loRegExp.Global = .T.

* This is here for future proofing. Right now, the function expects
* whatever we search for to begin and end with non-word characters,
* but just in case, save the boundary characters in variables so we
* can change them easily. Right now, we are using word-boundaries
* (the space between a word and a non-word character. I originally
* wanted to use non-word characters (\W) for this, but the match
* consumes the character, so side-by-side matches won't be found. The
* drawback of \b is if we do an ATC() for a string with a non-word
* character at the beginning or end, there is no guarantee that there
* is a word boundary on the other side of that character.

lcLeftBoundary = '\b'
lcRightBoundary = '\b'

* Certain RegExp special characters prevent matches, so escape them.

lcSearch = AddRegExEscape(tcSearch)

* In the pattern below, the three sets of parentheses represent three
* match groups. The first match group, (\W|^), means we first match
* either a non-word character or the beginning of a string. Similarly,
* (\W|$) means at the end we match either a non-word character or the
* end of the string. This means we make a match in every situation
* except where the string in lcSearch is a substring of one of the
* strings in tcString.

loRegExp.Pattern = '(' + lcLeftBoundary + '|^)(' + lcSearch + ')(' + ;
lcRightBoundary + '|$)'

* Test the string. If a match collection is returned, check how many
* matches were made. If the number of matches is less than the
* occurrence passed, return 0, otherwise, do the ATC.

loMatches = loRegExp.Execute(tcString)
if loMatches.Count < lnOccurrence
lnReturn = 0

* Retrieve the specified occurrence from the Item collection. If the
* match is the one we want, return its start location in the string.

else
loMatch = loMatches.Item(lnOccurrence - 1)
lcMatch = loMatch.Value
lnReturn = iif(upper(tcSearch) = upper(lcMatch), ;
loMatch.FirstIndex + 1, 0)
endif loMatches.Count < lnOccurrence
return lnReturn

*======================================================================
* Function: AddRegExEscape
* Purpose: Escape certain characters in regular expressions
* Author: Trevor Mansuy
* Last revision: 09/03/2008
* Parameters: tcString - the string to escape
* Returns: the string with certain characters escaped
* Environment in: none
* Environment out: none
*======================================================================

function AddRegExEscape
lparameters tcString
local lcString
lcString = strtran(tcString, '\', '\\')
lcString = strtran(lcString, '?', '\?')
lcString = strtran(lcString, '*', '\*')
lcString = strtran(lcString, '+', '\+')
lcString = strtran(lcString, '.', '\.')
lcString = strtran(lcString, '|', '\|')
lcString = strtran(lcString, '{', '\{')
lcString = strtran(lcString, '}', '\}')
lcString = strtran(lcString, '[', '\[')
lcString = strtran(lcString, ']', '\]')
lcString = strtran(lcString, '(', '\(')
lcString = strtran(lcString, ')', '\)')
lcString = strtran(lcString, '$', '\$')
return lcString

No comments: