Navigation: TextEd > Regular expressions >

POSIX character classes

 

 

 

 

Perl supports the POSIX notation for character classes. This uses names enclosed by [: and :] within the enclosing square brackets. PCRE also supports this notation. For example,

 

  [01[:alpha:]%]

 

matches "0", "1", any alphabetic character, or "%". The supported class names are:

 

  alnum    letters and digits

  alpha    letters

  ascii    character codes 0 - 127

  blank    space or tab only

  cntrl    control characters

  digit    decimal digits (same as \d)

  graph    printing characters, excluding space

  lower    lower case letters

  print    printing characters, including space

  punct    printing characters, excluding letters and digits and space

  space    white space (the same as \s from PCRE 8.34)

  upper    upper case letters

  word     "word" characters (same as \w)

  xdigit   hexadecimal digits

 

The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and space (32). If locale-specific matching is taking place, the list of space characters may be different; there may be fewer or more of them. "Space" used to be different to \s, which did not include VT, for Perl compatibility. However, Perl changed at release 5.18, and PCRE followed at release 8.34. "Space" and \s now match the same set of characters.

 

The name "word" is a Perl extension, and "blank" is a GNU extension from Perl 5.8. Another Perl extension is negation, which is indicated by a ^ character after the colon. For example,

 

  [12[:^digit:]]


matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not supported, and an error is given if they are encountered.

 

By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the PCRE_UCP option is passed to pcre_compile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:

 

  [:alnum:]  becomes  \p{Xan}

  [:alpha:]  becomes  \p{L}

  [:blank:]  becomes  \h

  [:digit:]  becomes  \p{Nd}

  [:lower:]  becomes  \p{Ll}

  [:space:]  becomes  \p{Xps}

  [:upper:]  becomes  \p{Lu}

  [:word:]   becomes  \p{Xwd}


Negated versions, such as [:^alpha:] use \P instead of \p. Three other POSIX classes are handled specially in UCP mode:

[:graph:] This matches characters that have glyphs that mark the page when printed. In Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf properties, except for:

 

  U+061C           Arabic Letter Mark

  U+180E           Mongolian Vowel Separator

  U+2066 - U+2069  Various "isolate"s

 

[:print:] This matches the same characters as [:graph:] plus space characters that are not controls, that is, characters with the Zs property.

[:punct:] This matches all characters that have the Unicode P (punctuation) property, plus those characters whose code points are less than 128 that have the S (Symbol) property.

 

The other POSIX classes are unchanged, and match only characters with code points less than 128.

 


 


 

Philip Hazel

University Computing Service

Cambridge CB2 3QH, England.

Last updated: 12 November 2013

Copyright © 1997-2013 University of Cambridge.


 


 


 

 

 

 

Copyright © 2024 Rickard Johansson