menu 1
menu 2
menu 3
menu 4
   

dg.o Web


Join    Premier Club    Search    Help    Locator+    Shop DevX
Register Today
 
January 2000
Subscribe NOW!



Advanced String Handling with Regular Expressions

Still using Perl for tough text processing tasks? The gnu.regexp package can help you add advanced text processing to any Java application

by Taylor G. Cowan
  Regular expression matching is an integral part of many applications dealing with the validation or manipulation of text. Of all the programming tricks and techniques I've encountered in my career, I find them to be some of the most powerful and useful tools to have in my toolbox. Some common uses of regular expressions are input validation, location of patterns in text, and the implementation of advanced search and replacement features. With the gnu.regexp package Java developers have at their disposal a powerful regular expression matching engine. This article demonstrates how regular expressions can help reduce the amount of code necessary to provide robust text processing applications in Java. If you haven't already done so, you may download the gnu.regexp package from www.cacas.org/java/gnu/regexp/ (see the sidebar, "The gnu.regexp License in a Nutshell"). After downloading the library and adding it to your classpath you will be able to compile and use the examples presented here.

Works With:
JDK 1.2, JDK 1.1
What is a Regular Expression?
Regular expressions, sometimes referred to as regexps or REs, are the legacy of many UNIX command line programs such as grep, sed, and awk. The term "regular expression" comes from an area of study called automata theory, in which "regular expressions" are used to describe a set of languages called "regular languages." I only mention this so that those of you who were forced into taking an automata theory class will be happy to know that something useful to the average programmer actually came from it. (Compilers, of course, are another one.) For the rest of you, just be assured that despite the theoretical beginnings of regular expressions, they make programming with text easier.

In simple terms, a regular expression is a way of describing a pattern of text in detail. Most software developers use them unknowingly on a daily basis in commands such as "dir *.java." Regular expressions allow a programmer to match almost any conceivable pattern, no matter how specific or generic they may be.

Constructing Regular Expressions
Before going into detail, I'd like to demonstrate the power of a regular expression. Consider the problem of determining if a string object contains a valid U.S. currency amount. Checking for the dollar sign is trivial, but this will take at least a line or two of Java code. Furthermore, you need to make sure the string consists solely of digits and commas, and that the commas appear after every third digit starting from the left. Using the regexp.gnu package all this pattern matching logic may be combined in a single expression. Assuming the program has imported the regexp.gnu package, these two lines will determine if string s contains a valid dollar amount:

RE regexp = new RE(

   "^\$\\d{1,3}(,\\d{3})*$");

if (regexp.isMatch(s)) { System.out.println(

   "Valid currency string.");}

Although ^\$\\d{1,3}(,\\d{3})*$ may look like a stream of censored language, it is actually a set of commands used to describe a pattern to match on. Regular expressions are created using two types of characters, literals and metacharacters. A simple expression might be composed of all literals, such as the type that are commonly employed in editor search functions. This creates a regular expression object that matches all instances of the string "gnu":

RE regexp = new RE("gnu");

An experienced Java developer will recognize that literal matches provide similar functionality to the String.equals() and String.indexOf() methods. Complex regular expressions are described using meta-characters. The most common example of a metacharacter is the "*" character used in file name globing. Regular expression syntax expands on this technique by providing a complete set of metacharacters with advanced pattern matching features. Table 1 lists some of the more commonly used metacharacters.

I find that the best way to create a regular expression is by verbalizing the intended pattern match. Take for instance the currency matching example. All U.S. currency amounts begin with "$" followed by a set of integers grouped by commas. Obviously, this specification is much too vague to translate into a regular expression. Therefore, as with any natural language description of an algorithm, the interaction must be described in detail. A natural language description of the currency pattern might read as follows: "US currency begins with '$' followed by between 1 and 3 digits, followed by 0 or more groupings of a comma and three digits." After describing the pattern you then break it down into pieces which directly translate into any of the expressions listed in Table 1.

In Table 2, I've taken the pattern description and broken it down into five parts. The expression begins with "$" as a literal match. Notice that it is preceded by the escape character. This allows what might normally be interpreted as a metacharacter to be taken literally. The resulting regular expression is "^${1,3}\d(,{3}\d)*$."

There will often be more that one way to express a match. Exactly three digits are also expressed by \d\d\d. Another important issue to recognize is the dual use of some characters. The "$" character is used twice within this expression, once as a literal and then again as a metacharacter to match the end of a string. Once you have crafted the expression, use it with the gnu.regexp package by first importing the package and then creating an instance of gnu.regexp.RE:

try {

   RE regexp = 

      new RE("^\$\\d{1,3}(,\\d{3})*$");

} catch (REException e) {

   // handle error

}

The exception thrown by the constructor indicates that it could not parse the regular expression. This is similar to the way Integer.parseInt() handles an invalid input string. Did you notice the addition of several extra escape characters? Unfortunately, (or fortunately, depending on your point of view), Java automatically interprets escaped characters within a string literal. The original expression would cause the JDK compiler to issue an "Invalid escape character" error message. To prevent this, all occurrences of "\" must be converted to "\\". The final step in crafting your regular expression is to test, test, and test again. I rarely get the intended match on the first try. The art of regular expression matching is analogous to walking a tightrope between matching more than you want or missing matches you intended to catch.

Using Regular Expressions in Data Validation
A common issue raised in Web-related circles is the validation of text input such as telephone numbers, numeric data, and e-mail addresses. JavaScript has recently incorporated pattern matching as one of its features to meet demand for better client-side data validation within browsers. Oddly, many Java developers are still doing it the hard way by attempting to validate text using String.substring() and other useful yet low-level string methods. Others may even disregard Java as an acceptable text-processing language and opt for a language such as Perl or Python. Since e-mail address validation is such a necessity for many Web developers, lets see what it takes to achieve this using the gnu.regexp package for Java.

Assume you have been asked to write a Java front end that includes an entry field for an e-mail address. The address must have valid syntax and be from a real domain. This is a good task for a regular expression. The pseudo expression might read "one or more non-whitespace chars, a '@' character, one or more non-whitespace chars, a period, and finally one of either com, org, or net." For the simplicity of this example, I'm not including other possible domains. Notice how I have described the text parts as "non-whitespace characters." You will find that it is sometimes easiest to describe what should not be matched rather than what should be matched. Once the pattern is expressed verbally the divide and conquer approach is employed. "one or more" expresses how many of a grouping to expect. Regular expressions provide several metacharacters that describe an expected amount called "quantifiers." By glancing at Table 1 you'll see that the five unique quantifier symbols each have a different purpose. The "+" operator is the best choice for "one or more" situations. For non-whitespace characters there is the "\S" operand. Placing these two together, the expression should read "\S+". You may think of quantifiers as unary operators with the operand on the left side. The next pattern is a literal, "@", and should simply be appended to the expression. Applying what we have already learned to the rest of the pattern the expression now reads:

\S+@\S+\.  

Remember that "." taken as a literal must be escaped because it is also a metacharacter. The last part of the pattern requires the introduction of a new concept in regular expressions-alternatives. To allow a pattern to decide between two or more alternatives use the "|" metacharacter. The expression (A|B|C) would match one of either A, B, or C. Here is the final version of our regular expression after applying this to the e-mail pattern:

\S+@\S+\.(com|org|net)

The next step after crafting the regular expression is to put it to work using the gnu.regexp library as before. Adding the additional escape characters to the e-mail regular expression allows it to be used in the RE constructor:

try { RE regexp = new RE(

   "\\S+@\\S+\\.(com|org|net)"); }

catch (Exception e) { /* ignored */ }

After creating an instance of RE, apply the pattern to a string using the getMatch() method. getMatch() returns the first match found as an instance of gnu.regexp.REMatch. If no matching text is located, it returns null. Once created, the RE object may be used over again with different input strings.

// check for a valid email address

public boolean isValidEmail(

   String emailAddress) {

   try { 

      RE regexp = new RE(

         "^\\S+@\\S+\\.(com|org|net)$"); 

      if ( regexp.getMatch(emailAddress) == 

         null ) 

         return false;

      else

         return true;

   } catch (Exception e) { /* ignored */ }

}

Hopefully you already see some areas where REs may be useful in your applications. Table 3 shows a few more examples.

Shallow Parsing with Backreferences
Simply matching a pattern is a nifty trick, but the real power of regular expressions is in their ability to remember portions of a previously matched expression. Such a match is called a backreference. Take for instance the time value "12:32 pm." After finding a time string to be of the proper format, you may then want to take it apart and store its hour, minute, and meridian values into separate variables.

 
Figure 1. Pattern Dating Click here.



As Figure 1 illustrates, the date pattern may be divided into hour, minute, and meridian subexpressions by using parentheses. The RE object will remember the parts of a match that are enclosed within parentheses. Access to a subexpression is made available through the methods REMatch.getSubStartIndex() and REMatch.getSubEndIndex(). These methods provide a start and end index back into the original input string passed to E.getMatch(). Listing 1 utilizes this feature to implement a military time conversion function. Because the steps necessary to retrieve a submatch are repeated for each subexpression, I've combined them within a utility method called getSubMatch().

Another great candidate for the application of shallow parsing techniques is HTML text. As you are reading this article, countless automated Web crawlers are searching the Web for information. The root functionality of these robots is the ability to deconstruct and match parts of an HTML document. Generally, these robots are interested in two types of data: meta information describing the page's content and link information possibly pointing to previously unknown frontier on the WWW. If a particular page consists of well-formed HTML, then we may expect its element tags to follow a set of strict guidelines. As defined in the HTML 4.0 specification, "each element type declaration generally describes three parts: a start tag, content, and an end tag." The element name must appear within the start tag (written ) and the end tag (written ). Using this these guidelines, a possible starting point for a hyperlink matching regexp might read: <A.*>.*</A>.

The parts of the link that change from tag to tag are being matched with "*", which is any amount of any character. The problem with this expression is that it might actually match a superset of the desired pattern. As you begin to experiment with your own regular expressions, you'll soon realize the constant battle between matching everything you want, while at the same time excluding undesirable text. A better way to express the hyperlink tag would be: <A[^>]*>[^<]*</A>. In this expression, I've introduced a new operator, "^", which is equivalent to a logical NOT. [^>] will match everything except the ">" character. You may use this with any character or even a range of characters. For example [^b] matches anything but the character b; [^a-z] matches anything but the characters a through z; and [^xyz] matches anything but x, y, or z.

For the robot to follow any link it finds, it must be able to extract the actual URL that is referenced. Within a hyperlink, the URL is made available through the "href" attribute. The syntax for this attribute is "href=" where is any valid URL string. Since we are trusting the page author to give us syntactically valid URLs, a "\\S*" (non-whitespace) placed in the correct place provides an acceptable matching expression. Putting it all together we have:

<A HREF=(\\S*)>[^<]*</A>

If we go ahead and attempt to use this expression we might be disappointed. Since the "A" element also supports the "name" attribute, the following valid hyperlink markup would not be matched by our expression:

<A NAME="link1" HREF="link1.hml">first link</A>

To compensate, the expression must take into account an arbitrary number of parameters preceding or following the href tag like this:

<A[^>]*HREF=(\\S*)[^>]*>[^<]*</A>

There is still one last issue preventing this expression from matching a large percentage of the hypertext links it might encounter-case sensitivity. Valid hyperlinks may be found beginning with "

RE regexp = new RE(

   "<A[^>]*HREF=(\\S*)[^>]*>[^<]*</A>", 

   RE.REG_ICASE)

Listing 2 demonstrates how this regular expression may be used to create a robot that traverses the Internet using a depth first search. Notice that Listing 2 uses getAllMatches() to retrieve an array of all matches found in the input string. This robot could be extended to perform site indexing quite easily by parsing each page's metatag information just as I've done with the hyperlink tags.

Replacing Matched Text
gnu.regexp allows you to create sophisticated search and replace commands that go beyond the normal "find and replace" dialog box. To perform a simple search and replace use the substituteAll() method of the RE object. This example globally changes all instances of "gnu" within the input string to upper case:

RE regexp = new RE("gnu");

String result = RE.substituteAll(

   input, "GNU");

This is not too impressive since the string object can do this for us using a combination of its indexOf() and toUpper() methods. Consider the task of searching an input string for time strings and converting them into military time. This will require the use of backtracking as well as replacement, since the original time values must be remembered and used in the time conversion. Most of the problem has been solved for us in Listing 1, however, there are some assumptions made in the implementation of toMilitaryTime() that no longer apply to the new task.

First, we can no longer assume that the input string consists solely of a time expression. This requires us to remove the beginning of line and end of line matching operators:

 (1[0-2]|[1-9]):([0-5]\\d) (am|pm) 

(The new expression includes a single space before and after to enforce normal word boundaries between the neighboring text.) The second change needed is to modify the last line of the toMilitaryTime() method to return the full input string with the replaced text, as opposed to simply returning the converted time expression. To perform a replacement on the first matching string use the substitute() method. Here is how it might be applied to Listing 1 to perform the replacement:

return substitue(

   s, h + ":" + min + " " + meridian);

To fully implement this functionality, you must continue through the input text after the last match looking for more time expressions until the end of string is reached. The trick is to use the REMatch object to determine the last index of the match using its getEndIndex() method. The RE object has an overloaded version of substitute() that takes as a third argument the index to begin searching at.

In this article I've introduced the concept of regular expressions primarily for use in Java programs. I recommend that you follow these four steps when crafting your own regular expressions:

1. Verbalize the intended match as natural language (begins with x, followed by x, and so on).
2. Divide the pattern description into smaller parts that translate directly to a metacharacter or literal expression.
3. Use the expression to construct a gnu.regexp.RE object.
4. Test the expression to ensure you are getting exactly what you want and remember that substitutions are especially dangerous.

Although I've only covered a small portion of what regular expressions are used for, you should be able to draw on these examples as a starting point. For those of you who tend to jump to Perl or Python when a tough text-processing requirement comes your way, I strongly recommend sticking with Java by making use of the gnu.regexp or other regular expression library for Java. One such library is OROMatcher, which was written by Daniel Savarese. OROMatcher is available at www.quateams.com/oro/downloads.

Taylor Cowan is a Senior Consultant with BrightStar Information Technology Group. He has his MS degree from University of North Texas and is a Sun Certified Java Developer. Reach him at .

 
Resources
gnu.regexp package

OROMatcher

Get the Code
  Registered Members: Download the code
for entire issue
  Premier Members: Download the code
for this article, plus additional documentation for the code.
  Not a member?
Register for free!
 
Join the Premier Club

In The Java Zone
" Product Review of the Week
JumpStart


" Site of the Week
Java Repository


" Book Review of the Week
Essential Java Style: Patterns for Implementation


" Tip of the Day
Efficient Use of the Vector Class


" Download of the Week
Javelin



Links
Back to Java Pro

Back to Java Zone

Join the Java Discussion



Register Today




What did you think of this article? Send your e-mail to the editors at .

© 1999 Fawcette Technical Publications.


DevX Home | VB Zone | Java Zone | C++ Zone | Enterprise Zone | Ask the Pros
VBPJ | Java Pro | Enterprise Development | Web Builder | VCDJ
Technical Guide To Visual Programming | XML Magazine | Exchange & Outlook
VBITS | MarketPlace Search | Help