Home | Libraries | People | FAQ | More |
Boost.Regex is intended to conform to the Technical Report on C++ Library Extensions.
All of the ECMAScript regular expression syntax features are supported, except that:
The escape sequence \u matches any upper case character (the same as [[:upper:]]) rather than a Unicode escape sequence; use \x{DDDD} for Unicode escape sequences.
Almost all Perl features are supported, except for:
(?{code}) Not implementable in a compiled strongly typed language.
(??{code}) Not implementable in a compiled strongly typed language.
(*VERB) The backtracking control verbs are not recognised or implemented at this time.
In addition the following features behave slightly differently from Perl:
^ $ \Z These recognise any line termination sequence, and not just \n: see the Unicode requirements below.
All the POSIX basic and extended regular expression features are supported, except that:
No character collating names are recognized except those specified in the POSIX standard for the C locale, unless they are explicitly registered with the traits class.
Character equivalence classes ( [[=a=]] etc) are probably buggy except on Win32. Implementing this feature requires knowledge of the format of the string sort keys produced by the system; if you need this, and the default implementation doesn't work on your platform, then you will need to supply a custom traits class.
The following comments refer to Unicode Technical Standard #18: Unicode Regular Expressions version 11.
Item |
Feature |
Support |
---|---|---|
1.1 |
Hex Notation |
Yes: use \x{DDDD} to refer to code point UDDDD. |
1.2 |
Character Properties |
All the names listed under the General Category Property are supported. Script names and Other Names are not currently supported. |
1.3 |
Subtraction and Intersection |
Indirectly support by forward-lookahead:
Gives the intersection of character properties X and Y.
Gives everything in Y that is not in X (subtraction). |
1.4 |
Simple Word Boundaries |
Conforming: non-spacing marks are included in the set of word characters. |
1.5 |
Caseless Matching |
Supported, note that at this level, case transformations are 1:1, many to many case folding operations are not supported (for example "ß" to "SS"). |
1.6 |
Line Boundaries |
Supported, except that "." matches only one character of "\r\n". Other than that word boundaries match correctly; including not matching in the middle of a "\r\n" sequence. |
1.7 |
Code Points |
Supported: provided you use the u32* algorithms, then UTF-8, UTF-16 and UTF-32 are all treated as sequences of 32-bit code points. |
2.1 |
Canonical Equivalence |
Not supported: it is up to the user of the library to convert all text into the same canonical form as the regular expression. |
2.2 |
Default Grapheme Clusters |
Not supported. |
2.3Default Word Boundaries |
Not supported. |
|
2.4 |
Default Loose Matches |
Not Supported. |
2.5 |
Named Properties |
Supported: the expression "[[:name:]]" or \N{name} matches the named character "name". |
2.6 |
Wildcard properties |
Not Supported. |
3.1 |
Tailored Punctuation. |
Not Supported. |
3.2 |
Tailored Grapheme Clusters |
Not Supported. |
3.3 |
Tailored Word Boundaries. |
Not Supported. |
3.4 |
Tailored Loose Matches |
Partial support: [[=c=]] matches characters with the same primary equivalence class as "c". |
3.5 |
Tailored Ranges |
Supported: [a-b] matches any character that collates in the range a to b, when the expression is constructed with the collate flag set. |
3.6 |
Context Matches |
Not Supported. |
3.7 |
Incremental Matches |
Supported: pass the flag |
3.8 |
Unicode Set Sharing |
Not Supported. |
3.9 |
Possible Match Sets |
Not supported, however this information is used internally to optimise the matching of regular expressions, and return quickly if no match is possible. |
3.10 |
Folded Matching |
Partial Support: It is possible to achieve a similar effect by using a custom regular expression traits class. |
3.11 |
Custom Submatch Evaluation |
Not Supported. |