org.enhydra.apache.xerces.utils.regex
Class RegularExpression

java.lang.Object
  |
  +--org.enhydra.apache.xerces.utils.regex.RegularExpression
All Implemented Interfaces:
Serializable

public class RegularExpression
extends Object
implements Serializable

A regular expression matching engine using Non-deterministic Finite Automaton (NFA). This engine does not conform to the POSIX regular expression.


How to use

A. Standard way
 RegularExpression re = new RegularExpression(regex);
 if (re.matches(text)) { ... }
 
B. Capturing groups
 RegularExpression re = new RegularExpression(regex);
 Match match = new Match();
 if (re.matches(text, match)) {
     ... // You can refer captured texts with methods of the Match class.
 }
 

Case-insensitive matching

 RegularExpression re = new RegularExpression(regex, "i");
 if (re.matches(text) >= 0) { ...}
 

Options

You can specify options to RegularExpression(regex, options) or setPattern(regex, options). This options parameter consists of the following characters.

"i"
This option indicates case-insensitive matching.
"m"
^ and $ consider the EOL characters within the text.
"s"
. matches any one character.
"u"
Redefines \d \D \w \W \s \S \b \B \< \> as becoming to Unicode.
"w"
By this option, \b \B \< \> are processed with the method of 'Unicode Regular Expression Guidelines' Revision 4. When "w" and "u" are specified at the same time, \b \B \< \> are processed for the "w" option.
","
The parser treats a comma in a character class as a range separator. [a,b] matches a or , or b without this option. [a,b] matches a or b with this option.
"X"
By this option, the engine confoms to XML Schema: Regular Expression. The match() method does not do subsring matching but entire string matching.

Syntax

Differences from the Perl 5 regular expression

  • There is 6-digit hexadecimal character representation (\vHHHHHH.)
  • Supports subtraction, union, and intersection operations for character classes.
  • Not supported: \ooo (Octal character representations), \G, \C, \lc, \ uc, \L, \U, \E, \Q, \N{name}, (?{code}), (??{code})

Meta characters are `. * + ? { [ ( ) | \ ^ $'.


BNF for the regular expression

 regex ::= ('(?' options ')')? term ('|' term)*
 term ::= factor+
 factor ::= anchors | atom (('*' | '+' | '?' | minmax ) '?'? )?
            | '(?#' [^)]* ')'
 minmax ::= '{' ([0-9]+ | [0-9]+ ',' | ',' [0-9]+ | [0-9]+ ',' [0-9]+) '}'
 atom ::= char | '.' | char-class | '(' regex ')' | '(?:' regex ')' | '\' [0-9]
          | '\w' | '\W' | '\d' | '\D' | '\s' | '\S' | category-block | '\X'
          | '(?>' regex ')' | '(?' options ':' regex ')'
          | '(?' ('(' [0-9] ')' | '(' anchors ')' | looks) term ('|' term)? ')'
 options ::= [imsw]* ('-' [imsw]+)?
 anchors ::= '^' | '$' | '\A' | '\Z' | '\z' | '\b' | '\B' | '\<' | '\>'
 looks ::= '(?=' regex ')'  | '(?!' regex ')'
           | '(?<=' regex ')' | '(?<!' regex ')'
 char ::= '\\' | '\' [efnrtv] | '\c' [@-_] | code-point | character-1
 category-block ::= '\' [pP] category-symbol-1
                    | ('\p{' | '\P{') (category-symbol | block-name
                                       | other-properties) '}'
 category-symbol-1 ::= 'L' | 'M' | 'N' | 'Z' | 'C' | 'P' | 'S'
 category-symbol ::= category-symbol-1 | 'Lu' | 'Ll' | 'Lt' | 'Lm' | Lo'
                     | 'Mn' | 'Me' | 'Mc' | 'Nd' | 'Nl' | 'No'
                     | 'Zs' | 'Zl' | 'Zp' | 'Cc' | 'Cf' | 'Cn' | 'Co' | 'Cs'
                     | 'Pd' | 'Ps' | 'Pe' | 'Pc' | 'Po'
                     | 'Sm' | 'Sc' | 'Sk' | 'So'
 block-name ::= (See above)
 other-properties ::= 'ALL' | 'ASSIGNED' | 'UNASSIGNED'
 character-1 ::= (any character except meta-characters)

 char-class ::= '[' ranges ']'
                | '(?[' ranges ']' ([-+&] '[' ranges ']')? ')'
 ranges ::= '^'? (range ','?)+
 range ::= '\d' | '\w' | '\s' | '\D' | '\W' | '\S' | category-block
           | range-char | range-char '-' range-char
 range-char ::= '\[' | '\]' | '\\' | '\' [,-efnrtv] | code-point | character-2
 code-point ::= '\x' hex-char hex-char
                | '\x{' hex-char+ '}'
                | '\v' hex-char hex-char hex-char hex-char hex-char hex-char
 hex-char ::= [0-9a-fA-F]
 character-2 ::= (any character except \[]-,)
 

TODO


Author:
TAMURA Kent <kent@trl.ibm.co.jp>
See Also:
Serialized Form

Inner Class Summary
(package private) static class RegularExpression.Context
           
 
Field Summary
(package private) static int CARRIAGE_RETURN
           
(package private)  RegularExpression.Context context
           
(package private) static boolean DEBUG
           
(package private) static int EXTENDED_COMMENT
          "x"
(package private)  RangeToken firstChar
           
(package private)  String fixedString
           
(package private)  boolean fixedStringOnly
           
(package private)  int fixedStringOptions
           
(package private)  BMPattern fixedStringTable
           
(package private)  boolean hasBackReferences
           
(package private) static int IGNORE_CASE
          "i"
(package private) static int LINE_FEED
           
(package private) static int LINE_SEPARATOR
           
(package private)  int minlength
           
(package private) static int MULTIPLE_LINES
          "m"
(package private)  int nofparen
          The number of parenthesis in the regular expression.
(package private)  int numberOfClosures
           
(package private)  Op operations
           
(package private)  int options
           
(package private) static int PARAGRAPH_SEPARATOR
           
(package private) static int PROHIBIT_FIXED_STRING_OPTIMIZATION
          "F"
(package private) static int PROHIBIT_HEAD_CHARACTER_OPTIMIZATION
          "H"
(package private)  String regex
          A regular expression.
(package private) static int SINGLE_LINE
          "s"
(package private) static int SPECIAL_COMMA
          ",".
(package private)  Token tokentree
          Internal representation of the regular expression.
(package private) static int UNICODE_WORD_BOUNDARY
          An option.
(package private) static int USE_UNICODE_CATEGORY
          This option redefines \d \D \w \W \s \S.
(package private) static Token wordchar
           
(package private) static int XMLSCHEMA_MODE
          "X".
 
Constructor Summary
  RegularExpression(String regex)
          Creates a new RegularExpression instance.
  RegularExpression(String regex, String options)
          Creates a new RegularExpression instance with options.
(package private) RegularExpression(String regex, Token tok, int parens, boolean hasBackReferences, int options)
           
 
Method Summary
 boolean equals(Object obj)
          Return true if patterns are the same and the options are equivalent.
(package private)  boolean equals(String pattern, int options)
           
 int getNumberOfGroups()
          Return the number of regular expression groups.
 String getOptions()
          Returns a option string.
 String getPattern()
           
 int hashCode()
           
 boolean matches(char[] target)
          Checks whether the target text contains this pattern or not.
 boolean matches(char[] target, int start, int end)
          Checks whether the target text contains this pattern in specified range or not.
 boolean matches(char[] target, int start, int end, Match match)
          Checks whether the target text contains this pattern in specified range or not.
 boolean matches(char[] target, Match match)
          Checks whether the target text contains this pattern or not.
 boolean matches(CharacterIterator target)
          Checks whether the target text contains this pattern or not.
 boolean matches(CharacterIterator target, Match match)
          Checks whether the target text contains this pattern or not.
 boolean matches(String target)
          Checks whether the target text contains this pattern or not.
 boolean matches(String target, int start, int end)
          Checks whether the target text contains this pattern in specified range or not.
 boolean matches(String target, int start, int end, Match match)
          Checks whether the target text contains this pattern in specified range or not.
 boolean matches(String target, Match match)
          Checks whether the target text contains this pattern or not.
(package private)  void prepare()
          Prepares for matching.
 void setPattern(String newPattern)
           
 void setPattern(String newPattern, String options)
           
 String toString()
          Represents this instence in String.
 
Methods inherited from class java.lang.Object
, clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

DEBUG

static final boolean DEBUG

regex

String regex
A regular expression.

options

int options

nofparen

int nofparen
The number of parenthesis in the regular expression.

tokentree

Token tokentree
Internal representation of the regular expression.

hasBackReferences

boolean hasBackReferences

minlength

transient int minlength

operations

transient Op operations

numberOfClosures

transient int numberOfClosures

context

transient RegularExpression.Context context

firstChar

transient RangeToken firstChar

fixedString

transient String fixedString

fixedStringOptions

transient int fixedStringOptions

fixedStringTable

transient BMPattern fixedStringTable

fixedStringOnly

transient boolean fixedStringOnly

IGNORE_CASE

static final int IGNORE_CASE
"i"

SINGLE_LINE

static final int SINGLE_LINE
"s"

MULTIPLE_LINES

static final int MULTIPLE_LINES
"m"

EXTENDED_COMMENT

static final int EXTENDED_COMMENT
"x"

USE_UNICODE_CATEGORY

static final int USE_UNICODE_CATEGORY
This option redefines \d \D \w \W \s \S.
See Also:
#RegularExpression(java.lang.String,int), #setPattern(java.lang.String,int), UNICODE_WORD_BOUNDARY

UNICODE_WORD_BOUNDARY

static final int UNICODE_WORD_BOUNDARY
An option. This enables to process locale-independent word boundary for \b \B \< \>.

By default, the engine considers a position between a word character (\w) and a non word character is a word boundary.

By this option, the engine checks word boundaries with the method of 'Unicode Regular Expression Guidelines' Revision 4.

See Also:
#RegularExpression(java.lang.String,int), #setPattern(java.lang.String,int)

PROHIBIT_HEAD_CHARACTER_OPTIMIZATION

static final int PROHIBIT_HEAD_CHARACTER_OPTIMIZATION
"H"

PROHIBIT_FIXED_STRING_OPTIMIZATION

static final int PROHIBIT_FIXED_STRING_OPTIMIZATION
"F"

XMLSCHEMA_MODE

static final int XMLSCHEMA_MODE
"X". XML Schema mode.

SPECIAL_COMMA

static final int SPECIAL_COMMA
",".

wordchar

static transient Token wordchar

LINE_FEED

static final int LINE_FEED

CARRIAGE_RETURN

static final int CARRIAGE_RETURN

LINE_SEPARATOR

static final int LINE_SEPARATOR

PARAGRAPH_SEPARATOR

static final int PARAGRAPH_SEPARATOR
Constructor Detail

RegularExpression

public RegularExpression(String regex)
                  throws ParseException
Creates a new RegularExpression instance.
Parameters:
regex - A regular expression
Throws:
ParseException - regex is not conforming to the syntax.

RegularExpression

public RegularExpression(String regex,
                         String options)
                  throws ParseException
Creates a new RegularExpression instance with options.
Parameters:
regex - A regular expression
options - A String consisted of "i" "m" "s" "u" "w" "," "X"
Throws:
ParseException - regex is not conforming to the syntax.

RegularExpression

RegularExpression(String regex,
                  Token tok,
                  int parens,
                  boolean hasBackReferences,
                  int options)
Method Detail

matches

public boolean matches(char[] target)
Checks whether the target text contains this pattern or not.
Returns:
true if the target is matched to this regular expression.

matches

public boolean matches(char[] target,
                       int start,
                       int end)
Checks whether the target text contains this pattern in specified range or not.
Parameters:
start - Start offset of the range.
end - End offset +1 of the range.
Returns:
true if the target is matched to this regular expression.

matches

public boolean matches(char[] target,
                       Match match)
Checks whether the target text contains this pattern or not.
Parameters:
match - A Match instance for storing matching result.
Returns:
Offset of the start position in target; or -1 if not match.

matches

public boolean matches(char[] target,
                       int start,
                       int end,
                       Match match)
Checks whether the target text contains this pattern in specified range or not.
Parameters:
start - Start offset of the range.
end - End offset +1 of the range.
match - A Match instance for storing matching result.
Returns:
Offset of the start position in target; or -1 if not match.

matches

public boolean matches(String target)
Checks whether the target text contains this pattern or not.
Returns:
true if the target is matched to this regular expression.

matches

public boolean matches(String target,
                       int start,
                       int end)
Checks whether the target text contains this pattern in specified range or not.
Parameters:
start - Start offset of the range.
end - End offset +1 of the range.
Returns:
true if the target is matched to this regular expression.

matches

public boolean matches(String target,
                       Match match)
Checks whether the target text contains this pattern or not.
Parameters:
match - A Match instance for storing matching result.
Returns:
Offset of the start position in target; or -1 if not match.

matches

public boolean matches(String target,
                       int start,
                       int end,
                       Match match)
Checks whether the target text contains this pattern in specified range or not.
Parameters:
start - Start offset of the range.
end - End offset +1 of the range.
match - A Match instance for storing matching result.
Returns:
Offset of the start position in target; or -1 if not match.

matches

public boolean matches(CharacterIterator target)
Checks whether the target text contains this pattern or not.
Returns:
true if the target is matched to this regular expression.

matches

public boolean matches(CharacterIterator target,
                       Match match)
Checks whether the target text contains this pattern or not.
Parameters:
match - A Match instance for storing matching result.
Returns:
Offset of the start position in target; or -1 if not match.

prepare

void prepare()
Prepares for matching. This method is called just before starting matching.

setPattern

public void setPattern(String newPattern)
                throws ParseException

setPattern

public void setPattern(String newPattern,
                       String options)
                throws ParseException

getPattern

public String getPattern()

toString

public String toString()
Represents this instence in String.
Overrides:
toString in class Object

getOptions

public String getOptions()
Returns a option string. The order of letters in it may be different from a string specified in a constructor or setPattern().
See Also:
RegularExpression(java.lang.String,java.lang.String), setPattern(java.lang.String,java.lang.String)

equals

public boolean equals(Object obj)
Return true if patterns are the same and the options are equivalent.
Overrides:
equals in class Object

equals

boolean equals(String pattern,
               int options)

hashCode

public int hashCode()
Overrides:
hashCode in class Object

getNumberOfGroups

public int getNumberOfGroups()
Return the number of regular expression groups. This method returns 1 when the regular expression has no capturing-parenthesis.


Copyright © 1999 The Apache Software Foundation. All Rights reserved.