diff --git a/tools/pcre/doc/html/pcre.html b/tools/pcre/doc/html/pcre.html
index edb7479a..c2b29aa8 100644
--- a/tools/pcre/doc/html/pcre.html
+++ b/tools/pcre/doc/html/pcre.html
@@ -38,9 +38,9 @@ Herczeg.
Starting with release 8.32 it is possible to compile a third separate PCRE
-library, which supports 32-bit character strings (including
-UTF-32 strings). The build process allows any set of the 8-, 16- and 32-bit
-libraries. The work to make this possible was done by Christian Persch.
+library that supports 32-bit character strings (including UTF-32 strings). The
+build process allows any combination of the 8-, 16- and 32-bit libraries. The
+work to make this possible was done by Christian Persch.
The three libraries contain identical sets of functions, except that the names
@@ -62,7 +62,7 @@ The current implementation of PCRE corresponds approximately with Perl 5.12,
including support for UTF-8/16/32 encoded strings and Unicode general category
properties. However, UTF-8/16/32 and Unicode support has to be explicitly
enabled; it is not the default. The Unicode tables correspond to Unicode
-release 6.2.0.
+release 6.3.0.
In addition to the Perl-compatible matching function, PCRE contains an
@@ -100,8 +100,11 @@ function makes it possible for a client to discover which features are
available. The features themselves are described in the
pcrebuild
page. Documentation about building PCRE for various operating systems can be
-found in the README and NON-AUTOTOOLS_BUILD files in the source
-distribution.
+found in the
+README
+and
+NON-AUTOTOOLS_BUILD
+files in the source distribution.
The libraries contains a number of undocumented internal functions and data
@@ -126,8 +129,11 @@ use sufficiently many resources as to cause your application to lose
performance.
-The best way of guarding against this possibility is to use the
+One way of guarding against this possibility is to use the
pcre_fullinfo() function to check the compiled pattern's options for UTF.
+Alternatively, from release 8.33, you can set the PCRE_NEVER_UTF option at
+compile time. This causes an compile time error if a pattern contains a
+UTF-setting sequence.
If your application is one that supports UTF, be aware that validity checking
@@ -148,15 +154,18 @@ page.
The user documentation for PCRE comprises a number of different sections. In
the "man" format, each of these is a separate "man page". In the HTML format,
each is a separate page, linked from the index page. In the plain text format,
-all the sections, except the pcredemo section, are concatenated, for ease
-of searching. The sections are as follows:
+the descriptions of the pcregrep and pcretest programs are in files
+called pcregrep.txt and pcretest.txt, respectively. The remaining
+sections, except for the pcredemo section (which is a program listing),
+are concatenated in pcre.txt, for ease of searching. The sections are as
+follows:
@@ -259,8 +259,9 @@ buffer, including the zero terminator if the string was zero-terminated.
-The offsets within subject strings that are returned by the matching functions
-are in 16-bit units rather than bytes.
+The lengths and starting offsets of subject strings must be specified in 16-bit
+data units, and the offsets within subject strings that are returned by the
+matching functions are in also 16-bit units rather than bytes.
+This page is part of the PCRE HTML documentation. It was generated automatically
+from the original man page. If there is any nonsense in it, please consult the
+man page, in case the conversion went wrong.
+
+
+Starting with release 8.32, it is possible to compile a PCRE library that
+supports 32-bit character strings, including UTF-32 strings, as well as or
+instead of the original 8-bit library. This work was done by Christian Persch,
+based on the work done by Zoltan Herczeg for the 16-bit library. All three
+libraries contain identical sets of functions, used in exactly the same way.
+Only the names of the functions and the data types of their arguments and
+results are different. To avoid over-complication and reduce the documentation
+maintenance load, most of the PCRE documentation describes the 8-bit library,
+with only occasional references to the 16-bit and 32-bit libraries. This page
+describes what is different when you use the 32-bit library.
+
+
+WARNING: A single application can be linked with all or any of the three
+libraries, but you must take care when processing any particular pattern
+to use functions from just one library. For example, if you want to study
+a pattern that was compiled with pcre32_compile(), you must do so
+with pcre32_study(), not pcre_study(), and you must free the
+study data with pcre32_free_study().
+
+
+In the 8-bit library, strings are passed to PCRE library functions as vectors
+of bytes with the C type "char *". In the 32-bit library, strings are passed as
+vectors of unsigned 32-bit quantities. The macro PCRE_UCHAR32 specifies an
+appropriate data type, and PCRE_SPTR32 is defined as "const PCRE_UCHAR32 *". In
+very many environments, "unsigned int" is a 32-bit data type. When PCRE is
+built, it defines PCRE_UCHAR32 as "unsigned int", but checks that it really is
+a 32-bit data type. If it is not, the build fails with an error message telling
+the maintainer to modify the definition appropriately.
+
+
+The types of the opaque structures that are used for compiled 32-bit patterns
+and JIT stacks are pcre32 and pcre32_jit_stack respectively. The
+type of the user-accessible structure that is returned by pcre32_study()
+is pcre32_extra, and the type of the structure that is used for passing
+data to a callout function is pcre32_callout_block. These structures
+contain the same fields, with the same names, as their 8-bit counterparts. The
+only difference is that pointers to character strings are 32-bit instead of
+8-bit types.
+
+
+For every function in the 8-bit library there is a corresponding function in
+the 32-bit library with a name that starts with pcre32_ instead of
+pcre_. The prototypes are listed above. In addition, there is one extra
+function, pcre32_utf32_to_host_byte_order(). This is a utility function
+that converts a UTF-32 character string to host byte order if necessary. The
+other 32-bit functions expect the strings they are passed to be in host byte
+order.
+
+
+The result of the function is the number of 32-bit units placed into the output
+buffer, including the zero terminator if the string was zero-terminated.
+
+
+The lengths and starting offsets of subject strings must be specified in 32-bit
+data units, and the offsets within subject strings that are returned by the
+matching functions are in also 32-bit units rather than bytes.
+
+
+The name-to-number translation table that is maintained for named subpatterns
+uses 32-bit characters. The pcre32_get_stringtable_entries() function
+returns the length of each entry in the table as the number of 32-bit data
+units.
+
+
+There are two new general option names, PCRE_UTF32 and PCRE_NO_UTF32_CHECK,
+which correspond to PCRE_UTF8 and PCRE_NO_UTF8_CHECK in the 8-bit library. In
+fact, these new options define the same bits in the options word. There is a
+discussion about the
+validity of UTF-32 strings
+in the
+pcreunicode
+page.
+
+
+In 32-bit mode, when PCRE_UTF32 is not set, character values are treated in the
+same way as in 8-bit, non UTF-8 mode, except, of course, that they can range
+from 0 to 0x7fffffff instead of 0 to 0xff. Character types for characters less
+than 0xff can therefore be influenced by the locale in the same way as before.
+Characters greater than 0xff have only one case, and no "type" (such as letter
+or digit).
+
+
+In UTF-32 mode, the character code is Unicode, in the range 0 to 0x10ffff, with
+the exception of values in the range 0xd800 to 0xdfff because those are
+"surrogate" values that are ill-formed in UTF-32.
+
+
+A UTF-32 string can indicate its endianness by special code knows as a
+byte-order mark (BOM). The PCRE functions do not handle this, expecting strings
+to be in host byte order. A utility function called
+pcre32_utf32_to_host_byte_order() is provided to help with this (see
+above).
+
+
+The error PCRE_ERROR_BADUTF32 corresponds to its 8-bit counterpart.
+The error PCRE_ERROR_BADMODE is given when a compiled
+pattern is passed to a function that processes patterns in the other
+mode, for example, if a pattern compiled with pcre_compile() is passed to
+pcre32_exec().
+
+
+There are new error codes whose names begin with PCRE_UTF32_ERR for invalid
+UTF-32 strings, corresponding to the PCRE_UTF8_ERR codes for UTF-8 strings that
+are described in the section entitled
+"Reason codes for invalid UTF-8 strings"
+in the main
+pcreapi
+page. The UTF-32 errors are:
+
+If there is an error while compiling a pattern, the error text that is passed
+back by pcre32_compile() or pcre32_compile2() is still an 8-bit
+character string, zero-terminated.
+
+
+Not all the features of the 8-bit library are available with the 32-bit
+library. The C++ and POSIX wrapper functions support only the 8-bit library,
+and the pcregrep program is at present 8-bit only.
+
+
DESCRIPTION
@@ -50,16 +50,17 @@ are:
extra Points to an associated pcre[16|32]_extra structure,
or is NULL
subject Points to the subject string
- length Length of the subject string, in bytes
- startoffset Offset in bytes in the subject at which to
- start matching
+ length Length of the subject string
+ startoffset Offset in the subject at which to start matching
options Option bits
ovector Points to a vector of ints for result offsets
ovecsize Number of elements in the vector
workspace Points to a vector of ints used as working space
wscount Number of elements in the vector
-The options are:
+The units for length and startoffset are bytes for
+pcre_exec(), 16-bit data items for pcre16_exec(), and 32-bit items
+for pcre32_exec(). The options are:
PCRE_ANCHORED Match only at the first position
PCRE_BSR_ANYCRLF \R matches only CR, LF, or CRLF
diff --git a/tools/pcre/doc/html/pcre_exec.html b/tools/pcre/doc/html/pcre_exec.html
index e4ddf9a8..18e1a13f 100644
--- a/tools/pcre/doc/html/pcre_exec.html
+++ b/tools/pcre/doc/html/pcre_exec.html
@@ -20,18 +20,18 @@ SYNOPSIS
int pcre_exec(const pcre *code, const pcre_extra *extra,
-const char *subject, int length, int startoffset,
-int options, int *ovector, int ovecsize);
-
-
+ const char *subject, int length, int startoffset,
+ int options, int *ovector, int ovecsize);
+
+
int pcre16_exec(const pcre16 *code, const pcre16_extra *extra,
-PCRE_SPTR16 subject, int length, int startoffset,
-int options, int *ovector, int ovecsize);
-
-
+ PCRE_SPTR16 subject, int length, int startoffset,
+ int options, int *ovector, int ovecsize);
+
+
int pcre32_exec(const pcre32 *code, const pcre32_extra *extra,
-PCRE_SPTR32 subject, int length, int startoffset,
-int options, int *ovector, int ovecsize);
+ PCRE_SPTR32 subject, int length, int startoffset,
+ int options, int *ovector, int ovecsize);
DESCRIPTION
@@ -45,14 +45,15 @@ offsets to captured substrings. Its arguments are:
extra Points to an associated pcre[16|32]_extra structure,
or is NULL
subject Points to the subject string
- length Length of the subject string, in bytes
- startoffset Offset in bytes in the subject at which to
- start matching
+ length Length of the subject string
+ startoffset Offset in the subject at which to start matching
options Option bits
ovector Points to a vector of ints for result offsets
ovecsize Number of elements in the vector (a multiple of 3)
-The options are:
+The units for length and startoffset are bytes for
+pcre_exec(), 16-bit data items for pcre16_exec(), and 32-bit items
+for pcre32_exec(). The options are:
PCRE_ANCHORED Match only at the first position
PCRE_BSR_ANYCRLF \R matches only CR, LF, or CRLF
diff --git a/tools/pcre/doc/html/pcre_fullinfo.html b/tools/pcre/doc/html/pcre_fullinfo.html
index d353432b..b88fc115 100644
--- a/tools/pcre/doc/html/pcre_fullinfo.html
+++ b/tools/pcre/doc/html/pcre_fullinfo.html
@@ -20,15 +20,15 @@ SYNOPSIS
int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
-int what, void *where);
-
-
+ int what, void *where);
+
+
int pcre16_fullinfo(const pcre16 *code, const pcre16_extra *extra,
-int what, void *where);
-
-
+ int what, void *where);
+
+
int pcre32_fullinfo(const pcre32 *code, const pcre32_extra *extra,
-int what, void *where);
+ int what, void *where);
DESCRIPTION
diff --git a/tools/pcre/doc/html/pcre_get_named_substring.html b/tools/pcre/doc/html/pcre_get_named_substring.html
index 6150ad71..72924d9b 100644
--- a/tools/pcre/doc/html/pcre_get_named_substring.html
+++ b/tools/pcre/doc/html/pcre_get_named_substring.html
@@ -20,21 +20,21 @@ SYNOPSIS
int pcre_get_named_substring(const pcre *code,
-const char *subject, int *ovector,
-int stringcount, const char *stringname,
-const char **stringptr);
-
-
+ const char *subject, int *ovector,
+ int stringcount, const char *stringname,
+ const char **stringptr);
+
+
int pcre16_get_named_substring(const pcre16 *code,
-PCRE_SPTR16 subject, int *ovector,
-int stringcount, PCRE_SPTR16 stringname,
-PCRE_SPTR16 *stringptr);
-
-
+ PCRE_SPTR16 subject, int *ovector,
+ int stringcount, PCRE_SPTR16 stringname,
+ PCRE_SPTR16 *stringptr);
+
+
int pcre32_get_named_substring(const pcre32 *code,
-PCRE_SPTR32 subject, int *ovector,
-int stringcount, PCRE_SPTR32 stringname,
-PCRE_SPTR32 *stringptr);
+ PCRE_SPTR32 subject, int *ovector,
+ int stringcount, PCRE_SPTR32 stringname,
+ PCRE_SPTR32 *stringptr);
DESCRIPTION
diff --git a/tools/pcre/doc/html/pcre_get_stringnumber.html b/tools/pcre/doc/html/pcre_get_stringnumber.html
index 08967de3..7324d782 100644
--- a/tools/pcre/doc/html/pcre_get_stringnumber.html
+++ b/tools/pcre/doc/html/pcre_get_stringnumber.html
@@ -20,15 +20,15 @@ SYNOPSIS
int pcre_get_stringnumber(const pcre *code,
-const char *name);
-
-
+ const char *name);
+
+
int pcre16_get_stringnumber(const pcre16 *code,
-PCRE_SPTR16 name);
-
-
+ PCRE_SPTR16 name);
+
+
int pcre32_get_stringnumber(const pcre32 *code,
-PCRE_SPTR32 name);
+ PCRE_SPTR32 name);
DESCRIPTION
diff --git a/tools/pcre/doc/html/pcre_get_stringtable_entries.html b/tools/pcre/doc/html/pcre_get_stringtable_entries.html
index 38f9c0c9..79906798 100644
--- a/tools/pcre/doc/html/pcre_get_stringtable_entries.html
+++ b/tools/pcre/doc/html/pcre_get_stringtable_entries.html
@@ -20,15 +20,15 @@ SYNOPSIS
int pcre_get_stringtable_entries(const pcre *code,
-const char *name, char **first, char **last);
-
-
+ const char *name, char **first, char **last);
+
+
int pcre16_get_stringtable_entries(const pcre16 *code,
-PCRE_SPTR16 name, PCRE_UCHAR16 **first, PCRE_UCHAR16 **last);
-
-
+ PCRE_SPTR16 name, PCRE_UCHAR16 **first, PCRE_UCHAR16 **last);
+
+
int pcre32_get_stringtable_entries(const pcre32 *code,
-PCRE_SPTR32 name, PCRE_UCHAR32 **first, PCRE_UCHAR32 **last);
+ PCRE_SPTR32 name, PCRE_UCHAR32 **first, PCRE_UCHAR32 **last);
DESCRIPTION
diff --git a/tools/pcre/doc/html/pcre_get_substring.html b/tools/pcre/doc/html/pcre_get_substring.html
index 2a5a610f..1a8e4f5a 100644
--- a/tools/pcre/doc/html/pcre_get_substring.html
+++ b/tools/pcre/doc/html/pcre_get_substring.html
@@ -20,18 +20,18 @@ SYNOPSIS
int pcre_get_substring(const char *subject, int *ovector,
-int stringcount, int stringnumber,
-const char **stringptr);
-
-
+ int stringcount, int stringnumber,
+ const char **stringptr);
+
+
int pcre16_get_substring(PCRE_SPTR16 subject, int *ovector,
-int stringcount, int stringnumber,
-PCRE_SPTR16 *stringptr);
-
-
+ int stringcount, int stringnumber,
+ PCRE_SPTR16 *stringptr);
+
+
int pcre32_get_substring(PCRE_SPTR32 subject, int *ovector,
-int stringcount, int stringnumber,
-PCRE_SPTR32 *stringptr);
+ int stringcount, int stringnumber,
+ PCRE_SPTR32 *stringptr);
DESCRIPTION
diff --git a/tools/pcre/doc/html/pcre_get_substring_list.html b/tools/pcre/doc/html/pcre_get_substring_list.html
index 85edef4b..7e8c6bc8 100644
--- a/tools/pcre/doc/html/pcre_get_substring_list.html
+++ b/tools/pcre/doc/html/pcre_get_substring_list.html
@@ -20,15 +20,15 @@ SYNOPSIS
int pcre_get_substring_list(const char *subject,
-int *ovector, int stringcount, const char ***listptr);
-
-
+ int *ovector, int stringcount, const char ***listptr);
+
+
int pcre16_get_substring_list(PCRE_SPTR16 subject,
-int *ovector, int stringcount, PCRE_SPTR16 **listptr);
-
-
+ int *ovector, int stringcount, PCRE_SPTR16 **listptr);
+
+
int pcre32_get_substring_list(PCRE_SPTR32 subject,
-int *ovector, int stringcount, PCRE_SPTR32 **listptr);
+ int *ovector, int stringcount, PCRE_SPTR32 **listptr);
DESCRIPTION
diff --git a/tools/pcre/doc/html/pcre_jit_exec.html b/tools/pcre/doc/html/pcre_jit_exec.html
index 0c63503a..4ebb0cbc 100644
--- a/tools/pcre/doc/html/pcre_jit_exec.html
+++ b/tools/pcre/doc/html/pcre_jit_exec.html
@@ -20,21 +20,21 @@ SYNOPSIS
int pcre_jit_exec(const pcre *code, const pcre_extra *extra,
-const char *subject, int length, int startoffset,
-int options, int *ovector, int ovecsize,
-pcre_jit_stack *jstack);
-
-
+ const char *subject, int length, int startoffset,
+ int options, int *ovector, int ovecsize,
+ pcre_jit_stack *jstack);
+
+
int pcre16_jit_exec(const pcre16 *code, const pcre16_extra *extra,
-PCRE_SPTR16 subject, int length, int startoffset,
-int options, int *ovector, int ovecsize,
-pcre_jit_stack *jstack);
-
-
+ PCRE_SPTR16 subject, int length, int startoffset,
+ int options, int *ovector, int ovecsize,
+ pcre_jit_stack *jstack);
+
+
int pcre32_jit_exec(const pcre32 *code, const pcre32_extra *extra,
-PCRE_SPTR32 subject, int length, int startoffset,
-int options, int *ovector, int ovecsize,
-pcre_jit_stack *jstack);
+ PCRE_SPTR32 subject, int length, int startoffset,
+ int options, int *ovector, int ovecsize,
+ pcre_jit_stack *jstack);
DESCRIPTION
diff --git a/tools/pcre/doc/html/pcre_jit_stack_alloc.html b/tools/pcre/doc/html/pcre_jit_stack_alloc.html
index 4153ee59..23ba4507 100644
--- a/tools/pcre/doc/html/pcre_jit_stack_alloc.html
+++ b/tools/pcre/doc/html/pcre_jit_stack_alloc.html
@@ -20,15 +20,15 @@ SYNOPSIS
pcre_jit_stack *pcre_jit_stack_alloc(int startsize,
-int maxsize);
-
-
+ int maxsize);
+
+
pcre16_jit_stack *pcre16_jit_stack_alloc(int startsize,
-int maxsize);
-
-
+ int maxsize);
+
+
pcre32_jit_stack *pcre32_jit_stack_alloc(int startsize,
-int maxsize);
+ int maxsize);
DESCRIPTION
diff --git a/tools/pcre/doc/html/pcre_pattern_to_host_byte_order.html b/tools/pcre/doc/html/pcre_pattern_to_host_byte_order.html
index 68d6f5a1..1b1c8037 100644
--- a/tools/pcre/doc/html/pcre_pattern_to_host_byte_order.html
+++ b/tools/pcre/doc/html/pcre_pattern_to_host_byte_order.html
@@ -20,15 +20,15 @@ SYNOPSIS
int pcre_pattern_to_host_byte_order(pcre *code,
-pcre_extra *extra, const unsigned char *tables);
-
-
+ pcre_extra *extra, const unsigned char *tables);
+
+
int pcre16_pattern_to_host_byte_order(pcre16 *code,
-pcre16_extra *extra, const unsigned char *tables);
-
-
+ pcre16_extra *extra, const unsigned char *tables);
+
+
int pcre32_pattern_to_host_byte_order(pcre32 *code,
-pcre32_extra *extra, const unsigned char *tables);
+ pcre32_extra *extra, const unsigned char *tables);
DESCRIPTION
diff --git a/tools/pcre/doc/html/pcre_study.html b/tools/pcre/doc/html/pcre_study.html
index 2baf54c4..af82f114 100644
--- a/tools/pcre/doc/html/pcre_study.html
+++ b/tools/pcre/doc/html/pcre_study.html
@@ -20,15 +20,15 @@ SYNOPSIS
pcre_extra *pcre_study(const pcre *code, int options,
-const char **errptr);
-
-
+ const char **errptr);
+
+
pcre16_extra *pcre16_study(const pcre16 *code, int options,
-const char **errptr);
-
-
+ const char **errptr);
+
+
pcre32_extra *pcre32_study(const pcre32 *code, int options,
-const char **errptr);
+ const char **errptr);
DESCRIPTION
diff --git a/tools/pcre/doc/html/pcre_utf16_to_host_byte_order.html b/tools/pcre/doc/html/pcre_utf16_to_host_byte_order.html
index 164e2365..18e7788f 100644
--- a/tools/pcre/doc/html/pcre_utf16_to_host_byte_order.html
+++ b/tools/pcre/doc/html/pcre_utf16_to_host_byte_order.html
@@ -20,8 +20,8 @@ SYNOPSIS
int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output,
-PCRE_SPTR16 input, int length, int *host_byte_order,
-int keep_boms);
+ PCRE_SPTR16 input, int length, int *host_byte_order,
+ int keep_boms);
DESCRIPTION
diff --git a/tools/pcre/doc/html/pcre_utf32_to_host_byte_order.html b/tools/pcre/doc/html/pcre_utf32_to_host_byte_order.html
new file mode 100644
index 00000000..772ae40c
--- /dev/null
+++ b/tools/pcre/doc/html/pcre_utf32_to_host_byte_order.html
@@ -0,0 +1,57 @@
+
+
+pcre_utf32_to_host_byte_order specification
+
+
+pcre_utf32_to_host_byte_order man page
+
+Return to the PCRE index page.
+
+
+This page is part of the PCRE HTML documentation. It was generated automatically
+from the original man page. If there is any nonsense in it, please consult the
+man page, in case the conversion went wrong.
+
+
+SYNOPSIS
+
+
+#include <pcre.h>
+
+
+int pcre32_utf32_to_host_byte_order(PCRE_UCHAR32 *output,
+ PCRE_SPTR32 input, int length, int *host_byte_order,
+ int keep_boms);
+
+
+DESCRIPTION
+
+
+This function, which exists only in the 32-bit library, converts a UTF-32
+string to the correct order for the current host, taking account of any byte
+order marks (BOMs) within the string. Its arguments are:
+
+ output pointer to output buffer, may be the same as input
+ input pointer to input buffer
+ length number of 32-bit units in the input, or negative for
+ a zero-terminated string
+ host_byte_order a NULL value or a non-zero value pointed to means
+ start in host byte order
+ keep_boms if non-zero, BOMs are copied to the output string
+
+The result of the function is the number of 32-bit units placed into the output
+buffer, including the zero terminator if the string was zero-terminated.
+
+
+If host_byte_order is not NULL, it is set to indicate the byte order that
+is current at the end of the string.
+
+
+There is a complete description of the PCRE native API in the
+pcreapi
+page and a description of the POSIX API in the
+pcreposix
+page.
+
+Return to the PCRE index page.
+
diff --git a/tools/pcre/doc/html/pcreapi.html b/tools/pcre/doc/html/pcreapi.html
index 59398df3..b401ecc7 100644
--- a/tools/pcre/doc/html/pcreapi.html
+++ b/tools/pcre/doc/html/pcreapi.html
@@ -46,126 +46,129 @@ man page, in case the conversion went wrong.
PCRE NATIVE API BASIC FUNCTIONS
pcre *pcre_compile(const char *pattern, int options,
-const char **errptr, int *erroffset,
-const unsigned char *tableptr);
-
-
+ const char **errptr, int *erroffset,
+ const unsigned char *tableptr);
+
+
pcre *pcre_compile2(const char *pattern, int options,
-int *errorcodeptr,
-const char **errptr, int *erroffset,
-const unsigned char *tableptr);
-
-
+ int *errorcodeptr,
+ const char **errptr, int *erroffset,
+ const unsigned char *tableptr);
+
+
pcre_extra *pcre_study(const pcre *code, int options,
-const char **errptr);
-
-
+ const char **errptr);
+
+
void pcre_free_study(pcre_extra *extra);
-
-
+
+
int pcre_exec(const pcre *code, const pcre_extra *extra,
-const char *subject, int length, int startoffset,
-int options, int *ovector, int ovecsize);
-
-
+ const char *subject, int length, int startoffset,
+ int options, int *ovector, int ovecsize);
+
+
int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
-const char *subject, int length, int startoffset,
-int options, int *ovector, int ovecsize,
-int *workspace, int wscount);
+ const char *subject, int length, int startoffset,
+ int options, int *ovector, int ovecsize,
+ int *workspace, int wscount);
PCRE NATIVE API STRING EXTRACTION FUNCTIONS
int pcre_copy_named_substring(const pcre *code,
-const char *subject, int *ovector,
-int stringcount, const char *stringname,
-char *buffer, int buffersize);
-
-
+ const char *subject, int *ovector,
+ int stringcount, const char *stringname,
+ char *buffer, int buffersize);
+
+
int pcre_copy_substring(const char *subject, int *ovector,
-int stringcount, int stringnumber, char *buffer,
-int buffersize);
-
-
+ int stringcount, int stringnumber, char *buffer,
+ int buffersize);
+
+
int pcre_get_named_substring(const pcre *code,
-const char *subject, int *ovector,
-int stringcount, const char *stringname,
-const char **stringptr);
-
-
+ const char *subject, int *ovector,
+ int stringcount, const char *stringname,
+ const char **stringptr);
+
+
int pcre_get_stringnumber(const pcre *code,
-const char *name);
-
-
+ const char *name);
+
+
int pcre_get_stringtable_entries(const pcre *code,
-const char *name, char **first, char **last);
-
-
+ const char *name, char **first, char **last);
+
+
int pcre_get_substring(const char *subject, int *ovector,
-int stringcount, int stringnumber,
-const char **stringptr);
-
-
+ int stringcount, int stringnumber,
+ const char **stringptr);
+
+
int pcre_get_substring_list(const char *subject,
-int *ovector, int stringcount, const char ***listptr);
-
-
+ int *ovector, int stringcount, const char ***listptr);
+
+
void pcre_free_substring(const char *stringptr);
-
-
+
+
void pcre_free_substring_list(const char **stringptr);
PCRE NATIVE API AUXILIARY FUNCTIONS
int pcre_jit_exec(const pcre *code, const pcre_extra *extra,
-const char *subject, int length, int startoffset,
-int options, int *ovector, int ovecsize,
-pcre_jit_stack *jstack);
-
-
+ const char *subject, int length, int startoffset,
+ int options, int *ovector, int ovecsize,
+ pcre_jit_stack *jstack);
+
+
pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);
-
-
+
+
void pcre_jit_stack_free(pcre_jit_stack *stack);
-
-
+
+
void pcre_assign_jit_stack(pcre_extra *extra,
-pcre_jit_callback callback, void *data);
-
-
+ pcre_jit_callback callback, void *data);
+
+
const unsigned char *pcre_maketables(void);
-
-
+
+
int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
-int what, void *where);
-
-
+ int what, void *where);
+
+
int pcre_refcount(pcre *code, int adjust);
-
-
+
+
int pcre_config(int what, void *where);
-
-
+
+
const char *pcre_version(void);
-
-
+
+
int pcre_pattern_to_host_byte_order(pcre *code,
-pcre_extra *extra, const unsigned char *tables);
+ pcre_extra *extra, const unsigned char *tables);
PCRE NATIVE API INDIRECTED FUNCTIONS
void *(*pcre_malloc)(size_t);
-
-
+
+
void (*pcre_free)(void *);
-
-
+
+
void *(*pcre_stack_malloc)(size_t);
-
-
+
+
void (*pcre_stack_free)(void *);
-
-
+
+
int (*pcre_callout)(pcre_callout_block *);
+
+
+int (*pcre_stack_guard)(void);
PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
@@ -187,10 +190,10 @@ by UTF16 or UTF32, respectively. This facility is in fact just cosmetic; the
References to bytes and UTF-8 in this document should be read as references to
-16-bit data quantities and UTF-16 when using the 16-bit library, or 32-bit data
-quantities and UTF-32 when using the 32-bit library, unless specified
-otherwise. More details of the specific differences for the 16-bit and 32-bit
-libraries are given in the
+16-bit data units and UTF-16 when using the 16-bit library, or 32-bit data
+units and UTF-32 when using the 32-bit library, unless specified otherwise.
+More details of the specific differences for the 16-bit and 32-bit libraries
+are given in the
pcre16
and
pcre32
@@ -324,6 +327,15 @@ by the caller to a "callout" function, which PCRE will then call at specified
points during a matching operation. Details are given in the
pcrecallout
documentation.
+
+
+The global variable pcre_stack_guard initially contains NULL. It can be
+set by the caller to a function that is called by PCRE whenever it starts
+to compile a parenthesized part of a pattern. When parentheses are nested, PCRE
+uses recursive function calls, which use up the system stack. This function is
+provided so that applications with restricted stacks can force a compilation
+error if the stack runs out. The function should return zero if all is well, or
+non-zero to force an error.
NEWLINES
@@ -369,7 +381,8 @@ controlled in a similar way, but by separate options.
The PCRE functions can be used in multi-threading applications, with the
proviso that the memory management functions pointed to by pcre_malloc,
pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
-callout function pointed to by pcre_callout, are shared by all threads.
+callout and stack-checking functions pointed to by pcre_callout and
+pcre_stack_guard, are shared by all threads.
The compiled form of a regular expression is not altered during matching, so
@@ -483,6 +496,16 @@ interface uses malloc() for output vectors. Further details are given in
the
pcreposix
documentation.
+
+ PCRE_CONFIG_PARENS_LIMIT
+
+The output is a long integer that gives the maximum depth of nesting of
+parentheses (of any kind) in a pattern. This limit is imposed to cap the amount
+of system stack used when a pattern is compiled. It is specified when PCRE is
+built; the default is 250. This limit does not take into account the stack that
+may already be used by the calling application. For finer control over
+compilation stack usage, you can set a pointer to an external checking function
+in pcre_stack_guard.
PCRE_CONFIG_MATCH_LIMIT
@@ -509,12 +532,14 @@ avoiding the use of the stack.
COMPILING A PATTERN
pcre *pcre_compile(const char *pattern, int options,
-const char **errptr, int *erroffset,
-const unsigned char *tableptr);
+ const char **errptr, int *erroffset,
+ const unsigned char *tableptr);
+
+
pcre *pcre_compile2(const char *pattern, int options,
-int *errorcodeptr,
-const char **errptr, int *erroffset,
-const unsigned char *tableptr);
+ int *errorcodeptr,
+ const char **errptr, int *erroffset,
+ const unsigned char *tableptr);
Either of the functions pcre_compile() or pcre_compile2() can be
@@ -558,16 +583,16 @@ Otherwise, if compilation of a pattern fails, pcre_compile() returns
NULL, and sets the variable pointed to by errptr to point to a textual
error message. This is a static string that is part of the library. You must
not try to free it. Normally, the offset from the start of the pattern to the
-byte that was being processed when the error was discovered is placed in the
-variable pointed to by erroffset, which must not be NULL (if it is, an
-immediate error is given). However, for an invalid UTF-8 string, the offset is
-that of the first byte of the failing character.
+data unit that was being processed when the error was discovered is placed in
+the variable pointed to by erroffset, which must not be NULL (if it is,
+an immediate error is given). However, for an invalid UTF-8 or UTF-16 string,
+the offset is that of the first data unit of the failing character.
Some errors are not detected until the whole pattern has been scanned; in these
cases, the offset passed back is the length of the pattern. Note that the
-offset is in bytes, not characters, even in UTF-8 mode. It may sometimes point
-into the middle of a UTF-8 character.
+offset is in data units, not characters, even in a UTF mode. It may sometimes
+point into the middle of a UTF-8 or UTF-16 character.
If pcre_compile2() is used instead of pcre_compile(), and the
@@ -580,8 +605,9 @@ If the final argument, tableptr, is NULL, PCRE uses a default set of
character tables that are built when PCRE is compiled, using the default C
locale. Otherwise, tableptr must be an address that is the result of a
call to pcre_maketables(). This value is stored with the compiled
-pattern, and used again by pcre_exec(), unless another table pointer is
-passed to it. For more discussion, see the section on locale support below.
+pattern, and used again by pcre_exec() and pcre_dfa_exec() when the
+pattern is matched. For more discussion, see the section on locale support
+below.
This code fragment shows a typical straightforward call to pcre_compile():
@@ -666,12 +692,24 @@ documentation.
PCRE_EXTENDED
-If this bit is set, white space data characters in the pattern are totally
-ignored except when escaped or inside a character class. White space does not
-include the VT character (code 11). In addition, characters between an
-unescaped # outside a character class and the next newline, inclusive, are also
-ignored. This is equivalent to Perl's /x option, and it can be changed within a
-pattern by a (?x) option setting.
+If this bit is set, most white space characters in the pattern are totally
+ignored except when escaped or inside a character class. However, white space
+is not allowed within sequences such as (?> that introduce various
+parenthesized subpatterns, nor within a numerical quantifier such as {1,3}.
+However, ignorable white space is permitted between an item and a following
+quantifier and between a quantifier and a following + that indicates
+possessiveness.
+
+
+White space did not used to include the VT character (code 11), because Perl
+did not treat this character as white space. However, Perl changed at release
+5.18, so PCRE followed at release 8.34, and VT is now treated as white space.
+
+
+PCRE_EXTENDED also causes characters between an unescaped # outside a character
+class and the next newline, inclusive, to be ignored. PCRE_EXTENDED is
+equivalent to Perl's /x option, and it can be changed within a pattern by a
+(?x) option setting.
Which characters are interpreted as newlines is controlled by the options
@@ -741,12 +779,14 @@ binary zero character followed by z).
PCRE_MULTILINE
-By default, PCRE treats the subject string as consisting of a single line of
-characters (even if it actually contains newlines). The "start of line"
-metacharacter (^) matches only at the start of the string, while the "end of
-line" metacharacter ($) matches only at the end of the string, or before a
-terminating newline (unless PCRE_DOLLAR_ENDONLY is set). This is the same as
-Perl.
+By default, for the purposes of matching "start of line" and "end of line",
+PCRE treats the subject string as consisting of a single line of characters,
+even if it actually contains newlines. The "start of line" metacharacter (^)
+matches only at the start of the string, and the "end of line" metacharacter
+($) matches only at the end of the string, or before a terminating newline
+(except when PCRE_DOLLAR_ENDONLY is set). Note, however, that unless
+PCRE_DOTALL is set, the "any character" metacharacter (.) does not match at a
+newline. This behaviour (for ^, $, and dot) is the same as Perl.
When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs
@@ -755,6 +795,15 @@ subject string, respectively, as well as at the very start and end. This is
equivalent to Perl's /m option, and it can be changed within a pattern by a
(?m) option setting. If there are no newlines in a subject string, or no
occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has no effect.
+
+ PCRE_NEVER_UTF
+
+This option locks out interpretation of the pattern as UTF-8 (or UTF-16 or
+UTF-32 in the 16-bit and 32-bit libraries). In particular, it prevents the
+creator of the pattern from switching to UTF interpretation by starting the
+pattern with (*UTF). This may be useful in applications that process patterns
+from external sources. The combination of PCRE_UTF8 and PCRE_NEVER_UTF also
+causes an error.
PCRE_NEWLINE_CR
PCRE_NEWLINE_LF
@@ -814,12 +863,23 @@ were followed by ?: but named parentheses can still be used for capturing (and
they acquire numbers in the usual way). There is no equivalent of this option
in Perl.
- NO_START_OPTIMIZE
+ PCRE_NO_AUTO_POSSESS
+
+If this option is set, it disables "auto-possessification". This is an
+optimization that, for example, turns a+b into a++b in order to avoid
+backtracks into a+ that can never be successful. However, if callouts are in
+use, auto-possessification means that some of them are never taken. You can set
+this option if you want the matching functions to do a full unoptimized search
+and run all the callouts, but it is mainly provided for testing purposes.
+
+ PCRE_NO_START_OPTIMIZE
This is an option that acts at matching time; that is, it is really an option
for pcre_exec() or pcre_dfa_exec(). If it is set at compile time,
-it is remembered with the compiled pattern and assumed at matching time. For
-details see the discussion of PCRE_NO_START_OPTIMIZE
+it is remembered with the compiled pattern and assumed at matching time. This
+is necessary if you want to use JIT execution, because the JIT compiler needs
+to know whether or not this option is set. For details see the discussion of
+PCRE_NO_START_OPTIMIZE
below.
PCRE_UCP
@@ -862,10 +922,10 @@ page. If an invalid UTF-8 sequence is found, pcre_compile() returns an
error. If you already know that your pattern is valid, and you want to skip
this check for performance reasons, you can set the PCRE_NO_UTF8_CHECK option.
When it is set, the effect of passing an invalid UTF-8 string as a pattern is
-undefined. It may cause your program to crash. Note that this option can also
-be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
-validity checking of subject strings only. If the same string is being matched
-many times, the option can be safely set for the second and subsequent
+undefined. It may cause your program to crash or loop. Note that this option
+can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress
+the validity checking of subject strings only. If the same string is being
+matched many times, the option can be safely set for the second and subsequent
matchings to improve performance.
COMPILATION ERROR CODES
@@ -910,7 +970,7 @@ have fallen out of use. To avoid confusion, they have not been re-used.
31 POSIX collating elements are not supported
32 this version of PCRE is compiled without UTF support
33 [this code is not in use]
- 34 character value in \x{...} sequence is too large
+ 34 character value in \x{} or \o{} is too large
35 invalid condition (?(0)
36 \C not allowed in lookbehind assertion
37 PCRE does not support \L, \l, \N{name}, \U, or \u
@@ -938,7 +998,7 @@ have fallen out of use. To avoid confusion, they have not been re-used.
name/number or by a plain number
58 a numbered reference must not be zero
59 an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
- 60 (*VERB) not recognized
+ 60 (*VERB) not recognized or malformed
61 number is too big
62 subpattern name expected
63 digit expected after (?+
@@ -958,14 +1018,22 @@ have fallen out of use. To avoid confusion, they have not been re-used.
75 name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
76 character value in \u.... sequence is too large
77 invalid UTF-32 string (specifically UTF-32)
+ 78 setting UTF is disabled by the application
+ 79 non-hex character in \x{} (closing brace missing?)
+ 80 non-octal character in \o{} (closing brace missing?)
+ 81 missing opening brace after \o
+ 82 parentheses are too deeply nested
+ 83 invalid range in character class
+ 84 group name must start with a non-digit
+ 85 parentheses are too deeply nested (stack check)
The numbers 32 and 10000 in errors 48 and 49 are defaults; different values may
be used if the limits were changed when PCRE was built.
STUDYING A PATTERN
-pcre_extra *pcre_study(const pcre *code, int options
-const char **errptr);
+pcre_extra *pcre_study(const pcre *code, int options,
+ const char **errptr);
If a compiled pattern is going to be used several times, it is worth spending
@@ -1069,26 +1137,37 @@ In 32-bit mode, the bitmap is used for 32-bit values less than 256.)
These two optimizations apply to both pcre_exec() and
pcre_dfa_exec(), and the information is also used by the JIT compiler.
-The optimizations can be disabled by setting the PCRE_NO_START_OPTIMIZE option
-when calling pcre_exec() or pcre_dfa_exec(), but if this is done,
-JIT execution is also disabled. You might want to do this if your pattern
-contains callouts or (*MARK) and you want to make use of these facilities in
-cases where matching fails. See the discussion of PCRE_NO_START_OPTIMIZE
+The optimizations can be disabled by setting the PCRE_NO_START_OPTIMIZE option.
+You might want to do this if your pattern contains callouts or (*MARK) and you
+want to make use of these facilities in cases where matching fails.
+
+
+PCRE_NO_START_OPTIMIZE can be specified at either compile time or execution
+time. However, if PCRE_NO_START_OPTIMIZE is passed to pcre_exec(), (that
+is, after any JIT compilation has happened) JIT execution is disabled. For JIT
+execution to work with PCRE_NO_START_OPTIMIZE, the option must be set at
+compile time.
+
+
+There is a longer discussion of PCRE_NO_START_OPTIMIZE
below.
LOCALE SUPPORT
PCRE handles caseless matching, and determines whether characters are letters,
digits, or whatever, by reference to a set of tables, indexed by character
-value. When running in UTF-8 mode, this applies only to characters
-with codes less than 128. By default, higher-valued codes never match escapes
-such as \w or \d, but they can be tested with \p if PCRE is built with
-Unicode character property support. Alternatively, the PCRE_UCP option can be
-set at compile time; this causes \w and friends to use Unicode property
-support instead of built-in tables. The use of locales with Unicode is
-discouraged. If you are handling characters with codes greater than 128, you
-should either use UTF-8 and Unicode, or use locales, but not try to mix the
-two.
+code point. When running in UTF-8 mode, or in the 16- or 32-bit libraries, this
+applies only to characters with code points less than 256. By default,
+higher-valued code points never match escapes such as \w or \d. However, if
+PCRE is built with Unicode property support, all characters can be tested with
+\p and \P, or, alternatively, the PCRE_UCP option can be set when a pattern
+is compiled; this causes \w and friends to use Unicode property support
+instead of the built-in tables.
+
+
+The use of locales with Unicode is discouraged. If you are handling characters
+with code points greater than 128, you should either use Unicode support, or
+use locales, but not try to mix the two.
PCRE contains an internal set of tables that are used when the final argument
@@ -1106,10 +1185,10 @@ for this locale support is expected to die away.
External tables are built by calling the pcre_maketables() function,
which has no arguments, in the relevant locale. The result can then be passed
-to pcre_compile() or pcre_exec() as often as necessary. For
-example, to build and use tables that are appropriate for the French locale
-(where accented characters with values greater than 128 are treated as letters),
-the following code could be used:
+to pcre_compile() as often as necessary. For example, to build and use
+tables that are appropriate for the French locale (where accented characters
+with values greater than 128 are treated as letters), the following code could
+be used:
setlocale(LC_CTYPE, "fr_FR");
tables = pcre_maketables();
@@ -1127,21 +1206,25 @@ needed.
The pointer that is passed to pcre_compile() is saved with the compiled
pattern, and the same tables are used via this pointer by pcre_study()
-and normally also by pcre_exec(). Thus, by default, for any single
+and also by pcre_exec() and pcre_dfa_exec(). Thus, for any single
pattern, compilation, studying and matching all happen in the same locale, but
-different patterns can be compiled in different locales.
+different patterns can be processed in different locales.
It is possible to pass a table pointer or NULL (indicating the use of the
-internal tables) to pcre_exec(). Although not intended for this purpose,
-this facility could be used to match a pattern in a different locale from the
-one in which it was compiled. Passing table pointers at run time is discussed
-below in the section on matching a pattern.
+internal tables) to pcre_exec() or pcre_dfa_exec() (see the
+discussion below in the section on matching a pattern). This facility is
+provided for use with pre-compiled patterns that have been saved and reloaded.
+Character tables are not saved with patterns, so if a non-standard table was
+used at compile time, it must be provided again when the reloaded pattern is
+matched. Attempting to use this facility to match a pattern in a different
+locale from the one in which it was compiled is likely to lead to anomalous
+(usually incorrect) results.
INFORMATION ABOUT A PATTERN
int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
-int what, void *where);
+ int what, void *where);
The pcre_fullinfo() function returns information about a compiled
@@ -1162,6 +1245,7 @@ the following negative numbers:
PCRE_ERROR_BADENDIANNESS the pattern was compiled with different
endianness
PCRE_ERROR_BADOPTION the value of what was invalid
+ PCRE_ERROR_UNSET the requested field is not set
The "magic number" is placed at the start of each compiled pattern as an simple
check against passing an arbitrary memory pointer. The endianness error can
@@ -1199,12 +1283,15 @@ information call is provided for internal use by the pcre_study()
function. External callers can cause PCRE to use its internal tables by passing
a NULL table pointer.
- PCRE_INFO_FIRSTBYTE
+ PCRE_INFO_FIRSTBYTE (deprecated)
Return information about the first data unit of any matched string, for a
-non-anchored pattern. (The name of this option refers to the 8-bit library,
-where data units are bytes.) The fourth argument should point to an int
-variable.
+non-anchored pattern. The name of this option refers to the 8-bit library,
+where data units are bytes. The fourth argument should point to an int
+variable. Negative values are used for special cases. However, this means that
+when the 32-bit library is in non-UTF-32 mode, the full 32-bit range of
+characters cannot be returned. For this reason, this value is deprecated; use
+PCRE_INFO_FIRSTCHARACTERFLAGS and PCRE_INFO_FIRSTCHARACTER instead.
If there is a fixed first value, for example, the letter "c" from a pattern
@@ -1227,12 +1314,43 @@ starts with "^", or
-1 is returned, indicating that the pattern matches only at the start of a
subject string or after any newline within the string. Otherwise -2 is
returned. For anchored patterns, -2 is returned.
+
+ PCRE_INFO_FIRSTCHARACTER
+
+Return the value of the first data unit (non-UTF character) of any matched
+string in the situation where PCRE_INFO_FIRSTCHARACTERFLAGS returns 1;
+otherwise return 0. The fourth argument should point to an uint_t
+variable.
-Since for the 32-bit library using the non-UTF-32 mode, this function is unable
-to return the full 32-bit range of the character, this value is deprecated;
-instead the PCRE_INFO_FIRSTCHARACTERFLAGS and PCRE_INFO_FIRSTCHARACTER values
-should be used.
+In the 8-bit library, the value is always less than 256. In the 16-bit library
+the value can be up to 0xffff. In the 32-bit library in UTF-32 mode the value
+can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 mode.
+
+ PCRE_INFO_FIRSTCHARACTERFLAGS
+
+Return information about the first data unit of any matched string, for a
+non-anchored pattern. The fourth argument should point to an int
+variable.
+
+
+If there is a fixed first value, for example, the letter "c" from a pattern
+such as (cat|cow|coyote), 1 is returned, and the character value can be
+retrieved using PCRE_INFO_FIRSTCHARACTER. If there is no fixed first value, and
+if either
+
+
+(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
+starts with "^", or
+
+
+(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
+(if it were set, the pattern would be anchored),
+
+
+2 is returned, indicating that the pattern matches only at the start of a
+subject string or after any newline within the string. Otherwise 0 is
+returned. For anchored patterns, 0 is returned.
PCRE_INFO_FIRSTTABLE
@@ -1281,26 +1399,43 @@ is -1.
Since for the 32-bit library using the non-UTF-32 mode, this function is unable
-to return the full 32-bit range of the character, this value is deprecated;
+to return the full 32-bit range of characters, this value is deprecated;
instead the PCRE_INFO_REQUIREDCHARFLAGS and PCRE_INFO_REQUIREDCHAR values should
be used.
+
+ PCRE_INFO_MATCH_EMPTY
+
+Return 1 if the pattern can match an empty string, otherwise 0. The fourth
+argument should point to an int variable.
+
+ PCRE_INFO_MATCHLIMIT
+
+If the pattern set a match limit by including an item of the form
+(*LIMIT_MATCH=nnnn) at the start, the value is returned. The fourth argument
+should point to an unsigned 32-bit integer. If no such value has been set, the
+call to pcre_fullinfo() returns the error PCRE_ERROR_UNSET.
PCRE_INFO_MAXLOOKBEHIND
-Return the number of characters (NB not bytes) in the longest lookbehind
-assertion in the pattern. Note that the simple assertions \b and \B require a
-one-character lookbehind. This information is useful when doing multi-segment
-matching using the partial matching facilities.
+Return the number of characters (NB not data units) in the longest lookbehind
+assertion in the pattern. This information is useful when doing multi-segment
+matching using the partial matching facilities. Note that the simple assertions
+\b and \B require a one-character lookbehind. \A also registers a
+one-character lookbehind, though it does not actually inspect the previous
+character. This is to ensure that at least one character from the old segment
+is retained when a new segment is processed. Otherwise, if there are no
+lookbehinds in the pattern, \A might match incorrectly at the start of a new
+segment.
PCRE_INFO_MINLENGTH
If the pattern was studied and a minimum length for matching subject strings
was computed, its value is returned. Otherwise the returned value is -1. The
-value is a number of characters, which in UTF-8 mode may be different from the
-number of bytes. The fourth argument should point to an int variable. A
-non-negative value is a lower bound to the length of any matching string. There
-may not be any strings of that length that do actually match, but every string
-that does match is at least that long.
+value is a number of characters, which in UTF mode may be different from the
+number of data units. The fourth argument should point to an int
+variable. A non-negative value is a lower bound to the length of any matching
+string. There may not be any strings of that length that do actually match, but
+every string that does match is at least that long.
PCRE_INFO_NAMECOUNT
PCRE_INFO_NAMEENTRYSIZE
@@ -1324,22 +1459,24 @@ length of the longest name. PCRE_INFO_NAMETABLE returns a pointer to the first
entry of the table. This is a pointer to char in the 8-bit library, where
the first two bytes of each entry are the number of the capturing parenthesis,
most significant byte first. In the 16-bit library, the pointer points to
-16-bit data units, the first of which contains the parenthesis number.
-In the 32-bit library, the pointer points to 32-bit data units, the first of
-which contains the parenthesis number. The rest
-of the entry is the corresponding name, zero terminated.
+16-bit data units, the first of which contains the parenthesis number. In the
+32-bit library, the pointer points to 32-bit data units, the first of which
+contains the parenthesis number. The rest of the entry is the corresponding
+name, zero terminated.
-The names are in alphabetical order. Duplicate names may appear if (?| is used
-to create multiple groups with the same number, as described in the
+The names are in alphabetical order. If (?| is used to create multiple groups
+with the same number, as described in the
section on duplicate subpattern numbers
in the
pcrepattern
-page. Duplicate names for subpatterns with different numbers are permitted only
-if PCRE_DUPNAMES is set. In all cases of duplicate names, they appear in the
-table in the order in which they were found in the pattern. In the absence of
-(?| this is the order of increasing number; when (?| is used this is not
-necessarily the case because later subpatterns may have lower numbers.
+page, the groups may be given the same name, but there is only one entry in the
+table. Different names for groups of the same number are not permitted.
+Duplicate names for subpatterns with different numbers are permitted,
+but only if PCRE_DUPNAMES is set. They appear in the table in the order in
+which they were found in the pattern. In the absence of (?| this is the order
+of increasing number; when (?| is used this is not necessarily the case because
+later subpatterns may have lower numbers.
As a simple example of the name/number table, consider the following pattern
@@ -1391,10 +1528,17 @@ alternatives begin with one of the following:
For such patterns, the PCRE_ANCHORED bit is set in the options returned by
pcre_fullinfo().
+
+ PCRE_INFO_RECURSIONLIMIT
+
+If the pattern set a recursion limit by including an item of the form
+(*LIMIT_RECURSION=nnnn) at the start, the value is returned. The fourth
+argument should point to an unsigned 32-bit integer. If no such value has been
+set, the call to pcre_fullinfo() returns the error PCRE_ERROR_UNSET.
PCRE_INFO_SIZE
-Return the size of the compiled pattern in bytes (for both libraries). The
+Return the size of the compiled pattern in bytes (for all three libraries). The
fourth argument should point to a size_t variable. This value does not
include the size of the pcre structure that is returned by
pcre_compile(). The value that is passed as the argument to
@@ -1405,70 +1549,17 @@ does not alter the value returned by this option.
PCRE_INFO_STUDYSIZE
-Return the size in bytes of the data block pointed to by the study_data
-field in a pcre_extra block. If pcre_extra is NULL, or there is no
-study data, zero is returned. The fourth argument should point to a
-size_t variable. The study_data field is set by pcre_study()
-to record information that will speed up matching (see the section entitled
+Return the size in bytes (for all three libraries) of the data block pointed to
+by the study_data field in a pcre_extra block. If pcre_extra
+is NULL, or there is no study data, zero is returned. The fourth argument
+should point to a size_t variable. The study_data field is set by
+pcre_study() to record information that will speed up matching (see the
+section entitled
"Studying a pattern"
above). The format of the study_data block is private, but its length
is made available via this option so that it can be saved and restored (see the
pcreprecompile
documentation for details).
-
- PCRE_INFO_FIRSTCHARACTERFLAGS
-
-Return information about the first data unit of any matched string, for a
-non-anchored pattern. The fourth argument should point to an int
-variable.
-
-
-If there is a fixed first value, for example, the letter "c" from a pattern
-such as (cat|cow|coyote), 1 is returned, and the character value can be
-retrieved using PCRE_INFO_FIRSTCHARACTER.
-
-
-If there is no fixed first value, and if either
-
-
-(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
-starts with "^", or
-
-
-(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
-(if it were set, the pattern would be anchored),
-
-
-2 is returned, indicating that the pattern matches only at the start of a
-subject string or after any newline within the string. Otherwise 0 is
-returned. For anchored patterns, 0 is returned.
-
- PCRE_INFO_FIRSTCHARACTER
-
-Return the fixed first character value, if PCRE_INFO_FIRSTCHARACTERFLAGS
-returned 1; otherwise returns 0. The fourth argument should point to an
-uint_t variable.
-
-
-In the 8-bit library, the value is always less than 256. In the 16-bit library
-the value can be up to 0xffff. In the 32-bit library in UTF-32 mode the value
-can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 mode.
-
-
-If there is no fixed first value, and if either
-
-
-(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
-starts with "^", or
-
-
-(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
-(if it were set, the pattern would be anchored),
-
-
--1 is returned, indicating that the pattern matches only at the start of a
-subject string or after any newline within the string. Otherwise -2 is
-returned. For anchored patterns, -2 is returned.
PCRE_INFO_REQUIREDCHARFLAGS
@@ -1517,8 +1608,8 @@ is different. (This seems a highly unlikely scenario.)
MATCHING A PATTERN: THE TRADITIONAL FUNCTION
int pcre_exec(const pcre *code, const pcre_extra *extra,
-const char *subject, int length, int startoffset,
-int options, int *ovector, int ovecsize);
+ const char *subject, int length, int startoffset,
+ int options, int *ovector, int ovecsize);
The function pcre_exec() is called to match a subject string against a
@@ -1634,6 +1725,16 @@ the flags field. If the limit is exceeded, pcre_exec() returns
PCRE_ERROR_MATCHLIMIT.
+A value for the match limit may also be supplied by an item at the start of a
+pattern of the form
+
+ (*LIMIT_MATCH=d)
+
+where d is a decimal number. However, such a setting is ignored unless d is
+less than the limit set by the caller of pcre_exec() or, if no such limit
+is set, less than the default.
+
+
The match_limit_recursion field is similar to match_limit, but
instead of limiting the total number of times that match() is called, it
limits the depth of recursion. The recursion depth is a smaller number than the
@@ -1655,23 +1756,38 @@ PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the limit
is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
+A value for the recursion limit may also be supplied by an item at the start of
+a pattern of the form
+
+ (*LIMIT_RECURSION=d)
+
+where d is a decimal number. However, such a setting is ignored unless d is
+less than the limit set by the caller of pcre_exec() or, if no such limit
+is set, less than the default.
+
+
The callout_data field is used in conjunction with the "callout" feature,
and is described in the
pcrecallout
documentation.
-The tables field is used to pass a character tables pointer to
-pcre_exec(); this overrides the value that is stored with the compiled
-pattern. A non-NULL value is stored with the compiled pattern only if custom
-tables were supplied to pcre_compile() via its tableptr argument.
-If NULL is passed to pcre_exec() using this mechanism, it forces PCRE's
-internal tables to be used. This facility is helpful when re-using patterns
-that have been saved after compiling with an external set of tables, because
-the external tables might be at a different address when pcre_exec() is
-called. See the
+The tables field is provided for use with patterns that have been
+pre-compiled using custom character tables, saved to disc or elsewhere, and
+then reloaded, because the tables that were used to compile a pattern are not
+saved with it. See the
pcreprecompile
-documentation for a discussion of saving compiled patterns for later use.
+documentation for a discussion of saving compiled patterns for later use. If
+NULL is passed using this mechanism, it forces PCRE's internal tables to be
+used.
+
+
+Warning: The tables that pcre_exec() uses must be the same as those
+that were used when the pattern was compiled. If this is not the case, the
+behaviour of pcre_exec() is undefined. Therefore, when a pattern is
+compiled and matched in the same process, this field should never be set. In
+this (the most common) case, the correct table pointer is automatically passed
+with the compiled pattern from pcre_compile() to pcre_exec().
If PCRE_EXTRA_MARK is set in the flags field, the mark field must
@@ -1816,10 +1932,10 @@ unanchored match must start with a specific character, it searches the subject
for that character, and fails immediately if it cannot find it, without
actually running the main matching function. This means that a special item
such as (*COMMIT) at the start of a pattern is not considered until after a
-suitable starting point for the match has been found. When callouts or (*MARK)
-items are in use, these "start-up" optimizations can cause them to be skipped
-if the pattern is never actually used. The start-up optimizations are in effect
-a pre-scan of the subject that takes place before the pattern is run.
+suitable starting point for the match has been found. Also, when callouts or
+(*MARK) items are in use, these "start-up" optimizations can cause them to be
+skipped if the pattern is never actually used. The start-up optimizations are
+in effect a pre-scan of the subject that takes place before the pattern is run.
The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations, possibly
@@ -1827,8 +1943,9 @@ causing performance to suffer, but ensuring that in cases where the result is
"no match", the callouts do occur, and that items such as (*COMMIT) and (*MARK)
are considered at every possible starting position in the subject string. If
PCRE_NO_START_OPTIMIZE is set at compile time, it cannot be unset at matching
-time. The use of PCRE_NO_START_OPTIMIZE disables JIT execution; when it is set,
-matching is always done using interpretively.
+time. The use of PCRE_NO_START_OPTIMIZE at matching time (that is, passing it
+to pcre_exec()) disables JIT execution; in this situation, matching is
+always done using interpretively.
Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching operation.
@@ -1888,7 +2005,7 @@ all the matches in a single subject string. However, you should be sure that
the value of startoffset points to the start of a character (or the end
of the subject). When PCRE_NO_UTF8_CHECK is set, the effect of passing an
invalid string as a subject or an invalid value of startoffset is
-undefined. Your program may crash.
+undefined. Your program may crash or loop.
PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
@@ -1922,13 +2039,19 @@ The string to be matched by pcre_exec()
A non-zero starting offset is useful when searching for another match in the
@@ -1996,10 +2119,12 @@ rounded down.
When a match is successful, information about captured substrings is returned
in pairs of integers, starting at the beginning of ovector, and
continuing up to two-thirds of its length at the most. The first element of
-each pair is set to the byte offset of the first character in a substring, and
-the second is set to the byte offset of the first character after the end of a
-substring. Note: these values are always byte offsets, even in UTF-8
-mode. They are not character counts.
+each pair is set to the offset of the first character in a substring, and the
+second is set to the offset of the first character after the end of a
+substring. These values are always data unit offsets, even in UTF mode. They
+are byte offsets in the 8-bit library, 16-bit data item offsets in the 16-bit
+library, and 32-bit data item offsets in the 32-bit library. Note: they
+are not character counts.
To extract a substring by name, you first have to find associated number.
@@ -2499,7 +2626,7 @@ same number causes an error at compile time.
DUPLICATE SUBPATTERN NAMES
When a pattern is compiled with the PCRE_DUPNAMES option, names for subpatterns
@@ -2580,9 +2707,9 @@ the value returned is the size of each block that is obtained from the heap.
MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
+NOTE: PCRE's "auto-possessification" optimization usually applies to character
+repeats at the end of a pattern (as well as internally). For example, the
+pattern "a\d+" is compiled as if it were "a\d++" because there is no point
+even considering the possibility of backtracking into the repeated digits. For
+DFA matching, this means that only one possible match is found. If you really
+do want multiple matches in such cases, either use an ungreedy repeat
+("a\d+?") or set the PCRE_NO_AUTO_POSSESS option when compiling.
+
-This document describes the optional features of PCRE that can be selected when
-the library is compiled. It assumes use of the configure script, where
-the optional features are selected or deselected by providing options to
-configure before running the make command. However, the same
-options can be selected in both Unix-like and non-Unix-like environments using
-the GUI facility of cmake-gui if you are using CMake instead of
-configure to build PCRE.
+PCRE is distributed with a configure script that can be used to build the
+library in Unix-like environments using the applications known as Autotools.
+Also in the distribution are files to support building using CMake
+instead of configure. The text file
+README
+contains general information about building with Autotools (some of which is
+repeated below), and also has some comments about building on various operating
+systems. There is a lot more information about building PCRE without using
+Autotools (including information about using CMake and building "by
+hand") in the text file called
+NON-AUTOTOOLS-BUILD.
+You should consult this file as well as the
+README
+file if you are building in a non-Unix-like environment.
+
+
+The rest of this document describes the optional features of PCRE that can be
+selected when the library is compiled. It assumes use of the configure
+script, where the optional features are selected or deselected by providing
+options to configure before running the make command. However, the
+same options can be selected in both Unix-like and non-Unix-like environments
+using the GUI facility of cmake-gui if you are using CMake instead
+of configure to build PCRE.
Within a compiled pattern, offset values are used to point from one part to
another (for example, from an opening parenthesis to an alternation
@@ -259,7 +276,7 @@ longer offsets slows down the operation of PCRE because it has to load
additional data when handling them. For the 32-bit library the value is always
4 and cannot be overridden; the value of --with-link-size is ignored.
-
PCRE uses fixed tables for processing characters whose code values are less
than 256. By default, PCRE is built with a set of tables that are distributed
@@ -336,7 +353,7 @@ compiling, because dftables is run on the local host. If you need to
create alternative tables when cross compiling, you will have to do so "by
hand".)
-
PCRE assumes by default that it will run in an environment where the character
code is ASCII (or Unicode, which is a superset of ASCII). This is the case for
@@ -367,7 +384,7 @@ The options that select newline behaviour, such as --enable-newline-is-cr,
and equivalent run-time options, refer to these character values in an EBCDIC
environment.
-
@@ -436,7 +453,7 @@ option to to the configure command, PCRE will use valgrind annotations
to mark certain memory regions as unaddressable. This allows it to detect
invalid memory accesses, and is mostly useful for debugging PCRE itself.
-
CODE COVERAGE REPORTING
+
CODE COVERAGE REPORTING
If your C compiler is gcc, you can build a version of PCRE that can generate a
code coverage report for its test suite. To enable this, you must install
@@ -493,11 +510,11 @@ This cleans all coverage data including the generated coverage report. For more
information about code coverage, see the gcov and lcov
documentation.
-
SEE ALSO
+
SEE ALSO
pcreapi(3), pcre16, pcre32, pcre_config(3).
-
AUTHOR
+
AUTHOR
Philip Hazel
@@ -506,11 +523,11 @@ University Computing Service
Cambridge CB2 3QH, England.
-
REVISION
+
REVISION
-Last updated: 30 October 2012
+Last updated: 12 May 2013
-Copyright © 1997-2012 University of Cambridge.
+Copyright © 1997-2013 University of Cambridge.
Return to the PCRE index page.
diff --git a/tools/pcre/doc/html/pcrecallout.html b/tools/pcre/doc/html/pcrecallout.html
index b28e347f..53a937f5 100644
--- a/tools/pcre/doc/html/pcrecallout.html
+++ b/tools/pcre/doc/html/pcrecallout.html
@@ -64,23 +64,63 @@ it is processed as if it were
Notice that there is a callout before and after each parenthesis and
-alternation bar. Automatic callouts can be used for tracking the progress of
-pattern matching. The
-pcretest
-command has an option that sets automatic callouts; when it is used, the output
-indicates how the pattern is matched. This is useful information when you are
-trying to optimize the performance of a particular pattern.
+alternation bar. If the pattern contains a conditional group whose condition is
+an assertion, an automatic callout is inserted immediately before the
+condition. Such a callout may also be inserted explicitly, for example:
+
+ (?(?C9)(?=a)ab|de)
+
+This applies only to assertion conditions (because they are themselves
+independent groups).
-The use of callouts in a pattern makes it ineligible for optimization by the
-just-in-time compiler. Studying such a pattern with the PCRE_STUDY_JIT_COMPILE
-option always fails.
+Automatic callouts can be used for tracking the progress of pattern matching.
+The
+pcretest
+program has a pattern qualifier (/C) that sets automatic callouts; when it is
+used, the output indicates how the pattern is being matched. This is useful
+information when you are trying to optimize the performance of a particular
+pattern.
MISSING CALLOUTS
-You should be aware that, because of optimizations in the way PCRE matches
-patterns by default, callouts sometimes do not happen. For example, if the
-pattern is
+You should be aware that, because of optimizations in the way PCRE compiles and
+matches patterns, callouts sometimes do not happen exactly as you might expect.
+
+
+At compile time, PCRE "auto-possessifies" repeated items when it knows that
+what follows cannot be part of the repeat. For example, a+[bc] is compiled as
+if it were a++[bc]. The pcretest output when this pattern is anchored and
+then applied with automatic callouts to the string "aaaa" is:
+
+ --->aaaa
+ +0 ^ ^
+ +1 ^ a+
+ +3 ^ ^ [bc]
+ No match
+
+This indicates that when matching [bc] fails, there is no backtracking into a+
+and therefore the callouts that would be taken for the backtracks do not occur.
+You can disable the auto-possessify feature by passing PCRE_NO_AUTO_POSSESS
+to pcre_compile(), or starting the pattern with (*NO_AUTO_POSSESS). If
+this is done in pcretest (using the /O qualifier), the output changes to
+this:
+
+ --->aaaa
+ +0 ^ ^
+ +1 ^ a+
+ +3 ^ ^ [bc]
+ +3 ^ ^ [bc]
+ +3 ^ ^ [bc]
+ +3 ^^ [bc]
+ No match
+
+This time, when matching [bc] fails, the matcher backtracks into a+ and tries
+again, repeatedly, until a+ itself fails.
+
+
+Other optimizations that provide fast "no match" results also affect callouts.
+For example, if the pattern is
ab(?C4)cd
@@ -104,11 +144,11 @@ callouts such as the example above are obeyed.
THE CALLOUT INTERFACE
During matching, when PCRE reaches a callout point, the external function
-defined by pcre_callout or pcre[16|32]_callout is called
-(if it is set). This applies to both normal and DFA matching. The only
-argument to the callout function is a pointer to a pcre_callout
-or pcre[16|32]_callout block.
-These structures contains the following fields:
+defined by pcre_callout or pcre[16|32]_callout is called (if it is
+set). This applies to both normal and DFA matching. The only argument to the
+callout function is a pointer to a pcre_callout or
+pcre[16|32]_callout block. These structures contains the following
+fields:
int version;
int callout_number;
@@ -141,10 +181,10 @@ automatically generated callouts).
The offset_vector field is a pointer to the vector of offsets that was
passed by the caller to the matching function. When pcre_exec() or
-pcre[16|32]_exec() is used, the contents can be inspected, in order to extract
-substrings that have been matched so far, in the same way as for extracting
-substrings after a match has completed. For the DFA matching functions, this
-field is not useful.
+pcre[16|32]_exec() is used, the contents can be inspected, in order to
+extract substrings that have been matched so far, in the same way as for
+extracting substrings after a match has completed. For the DFA matching
+functions, this field is not useful.
The subject and subject_length fields contain copies of the values
@@ -171,8 +211,10 @@ functions are used, because they do not support captured substrings.
The capture_last field contains the number of the most recently captured
-substring. If no substrings have been captured, its value is -1. This is always
-the case for the DFA matching functions.
+substring. However, when a recursion exits, the value reverts to what it was
+outside the recursion, as do the values of all captured substrings. If no
+substrings have been captured, the value of capture_last is -1. This is
+always the case for the DFA matching functions.
The callout_data field contains a value that is passed to a matching
@@ -203,11 +245,12 @@ same callout number. However, they are set for all callouts.
The mark field is present from version 2 of the callout structure. In
-callouts from pcre_exec() or pcre[16|32]_exec() it contains a pointer to
-the zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
-(*THEN) item in the match, or NULL if no such items have been passed. Instances
-of (*PRUNE) or (*THEN) without a name do not obliterate a previous (*MARK). In
-callouts from the DFA matching functions this field always contains NULL.
+callouts from pcre_exec() or pcre[16|32]_exec() it contains a
+pointer to the zero-terminated name of the most recently passed (*MARK),
+(*PRUNE), or (*THEN) item in the match, or NULL if no such items have been
+passed. Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
+previous (*MARK). In callouts from the DFA matching functions this field always
+contains NULL.
RETURN VALUES
@@ -234,9 +277,9 @@ Cambridge CB2 3QH, England.
REVISION
-Last updated: 24 June 2012
+Last updated: 12 November 2013
-Copyright © 1997-2012 University of Cambridge.
+Copyright © 1997-2013 University of Cambridge.
Return to the PCRE index page.
diff --git a/tools/pcre/doc/html/pcrecompat.html b/tools/pcre/doc/html/pcrecompat.html
index 0637781b..3e622669 100644
--- a/tools/pcre/doc/html/pcrecompat.html
+++ b/tools/pcre/doc/html/pcrecompat.html
@@ -36,10 +36,8 @@ these do not seem to have any use.
3. Capturing subpatterns that occur inside negative lookahead assertions are
-counted, but their entries in the offsets vector are never set. Perl sets its
-numerical variables from any such patterns that are matched before the
-assertion fails to match something (thereby succeeding), but only if the
-negative lookahead assertion contains just one branch.
+counted, but their entries in the offsets vector are never set. Perl sometimes
+(but not always) sets its numerical variables from inside negative assertions.
4. Though binary zero characters are supported in the subject string, they are
@@ -102,24 +100,32 @@ in the
page.
-10. If any of the backtracking control verbs are used in an assertion or in a
-subpattern that is called as a subroutine (whether or not recursively), their
-effect is confined to that subpattern; it does not extend to the surrounding
-pattern. This is not always the case in Perl. In particular, if (*THEN) is
-present in a group that is called as a subroutine, its action is limited to
-that group, even if the group does not contain any | characters. There is one
-exception to this: the name from a *(MARK), (*PRUNE), or (*THEN) that is
-encountered in a successful positive assertion is passed back when a
-match succeeds (compare capturing parentheses in assertions). Note that such
-subpatterns are processed as anchored at the point where they are tested.
+10. If any of the backtracking control verbs are used in a subpattern that is
+called as a subroutine (whether or not recursively), their effect is confined
+to that subpattern; it does not extend to the surrounding pattern. This is not
+always the case in Perl. In particular, if (*THEN) is present in a group that
+is called as a subroutine, its action is limited to that group, even if the
+group does not contain any | characters. Note that such subpatterns are
+processed as anchored at the point where they are tested.
-11. There are some differences that are concerned with the settings of captured
+11. If a pattern contains more than one backtracking control verb, the first
+one that is backtracked onto acts. For example, in the pattern
+A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
+triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
+same as PCRE, but there are examples where it differs.
+
+
+12. Most backtracking verbs in assertions have their normal actions. They are
+not confined to the assertion.
+
+
+13. There are some differences that are concerned with the settings of captured
strings when part of a pattern is repeated. For example, matching "aba" against
the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set to "b".
-12. PCRE's handling of duplicate subpattern numbers and duplicate subpattern
+14. PCRE's handling of duplicate subpattern numbers and duplicate subpattern
names is not as general as Perl's. This is a consequence of the fact the PCRE
works internally just with numbers, using an external table to translate
between numbers and names. In particular, a pattern such as (?|(?<a>A)|(?<b)B),
@@ -130,13 +136,26 @@ names map to capturing subpattern number 1. To avoid this confusing situation,
an error is given at compile time.
-13. Perl recognizes comments in some places that PCRE does not, for example,
+15. Perl recognizes comments in some places that PCRE does not, for example,
between the ( and ? at the start of a subpattern. If the /x modifier is set,
-Perl allows white space between ( and ? but PCRE never does, even if the
-PCRE_EXTENDED option is set.
+Perl allows white space between ( and ? (though current Perls warn that this is
+deprecated) but PCRE never does, even if the PCRE_EXTENDED option is set.
-14. PCRE provides some extensions to the Perl regular expression facilities.
+16. Perl, when in warning mode, gives warnings for character classes such as
+[A-\d] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE has no
+warning features, so it gives an error in these cases because they are almost
+certainly user mistakes.
+
+
+17. In PCRE, the upper/lower case character properties Lu and Ll are not
+affected when case-independent matching is specified. For example, \p{Lu}
+always matches an upper case letter. I think Perl has changed in this respect;
+in the release at the time of writing (5.16), \p{Lu} and \p{Ll} match all
+letters, regardless of case, when case independence is specified.
+
+
+18. PCRE provides some extensions to the Perl regular expression facilities.
Perl 5.10 includes new features that are not in earlier versions of Perl, some
of which (such as named parentheses) have been in PCRE for some time. This list
is with respect to Perl 5.10:
@@ -207,9 +226,9 @@ Cambridge CB2 3QH, England.
REVISION
-Last updated: 25 August 2012
+Last updated: 10 November 2013
-Copyright © 1997-2012 University of Cambridge.
+Copyright © 1997-2013 University of Cambridge.
Return to the PCRE index page.
diff --git a/tools/pcre/doc/html/pcregrep.html b/tools/pcre/doc/html/pcregrep.html
index bac8f9a4..dacbb499 100644
--- a/tools/pcre/doc/html/pcregrep.html
+++ b/tools/pcre/doc/html/pcregrep.html
@@ -37,8 +37,10 @@ man page, in case the conversion went wrong.
pcregrep searches files for character patterns, in the same way as other
grep commands do, but it uses the PCRE regular expression library to support
patterns that are compatible with the regular expressions of Perl 5. See
+pcresyntax(3)
+for a quick-reference summary of pattern syntax, or
pcrepattern(3)
-for a full description of syntax and semantics of the regular expressions
+for a full description of the syntax and semantics of the regular expressions
that PCRE supports.
@@ -748,9 +750,9 @@ Cambridge CB2 3QH, England.
REVISION
-Last updated: 13 September 2012
+Last updated: 03 April 2014
-Copyright © 1997-2012 University of Cambridge.
+Copyright © 1997-2014 University of Cambridge.
Return to the PCRE index page.
diff --git a/tools/pcre/doc/html/pcrejit.html b/tools/pcre/doc/html/pcrejit.html
index 6286fccb..210f1da0 100644
--- a/tools/pcre/doc/html/pcrejit.html
+++ b/tools/pcre/doc/html/pcrejit.html
@@ -172,15 +172,9 @@ PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and
PCRE_PARTIAL_SOFT.
-The unsupported pattern items are:
-
- \C match a single byte; not supported in UTF-8 mode
- (?Cn) callouts
- (*PRUNE) )
- (*SKIP) ) backtracking control verbs
- (*THEN) )
-
-Support for some of these may be added in future.
+The only unsupported pattern items are \C (match a single data unit) when
+running in a UTF mode, and a callout immediately before an assertion condition
+in a conditional group.
RETURN VALUES FROM JIT EXECUTION
@@ -449,9 +443,9 @@ Cambridge CB2 3QH, England.
REVISION
-Last updated: 31 October 2012
+Last updated: 17 March 2013
-Copyright © 1997-2012 University of Cambridge.
+Copyright © 1997-2013 University of Cambridge.
Return to the PCRE index page.
diff --git a/tools/pcre/doc/html/pcrelimits.html b/tools/pcre/doc/html/pcrelimits.html
index b83a8010..ee5ebf03 100644
--- a/tools/pcre/doc/html/pcrelimits.html
+++ b/tools/pcre/doc/html/pcrelimits.html
@@ -21,9 +21,10 @@ practice be relevant.
The maximum length of a compiled pattern is approximately 64K data units (bytes
-for the 8-bit library, 32-bit units for the 32-bit library, and 32-bit units for
-the 32-bit library) if PCRE is compiled with the default internal linkage size
-of 2 bytes. If you want to process regular expressions that are truly enormous,
+for the 8-bit library, 16-bit units for the 16-bit library, and 32-bit units for
+the 32-bit library) if PCRE is compiled with the default internal linkage size,
+which is 2 bytes for the 8-bit and 16-bit libraries, and 4 bytes for the 32-bit
+library. If you want to process regular expressions that are truly enormous,
you can compile PCRE with an internal linkage size of 3 or 4 (when building the
16-bit or 32-bit library, 3 is rounded up to 4). See the README file in
the source distribution and the
@@ -36,7 +37,10 @@ All values in repeating quantifiers must be less than 65536.
There is no limit to the number of parenthesized subpatterns, but there can be
-no more than 65535 capturing subpatterns.
+no more than 65535 capturing subpatterns. There is, however, a limit to the
+depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
+order to limit the amount of system stack used at compile time. The limit can
+be specified when PCRE is built; the default is 250.
There is a limit to the number of forward references to subsequent subpatterns
@@ -50,7 +54,7 @@ maximum number of named subpatterns is 10000.
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb
-is 255 for the 8-bit library and 65535 for the 16-bit and 32-bit library.
+is 255 for the 8-bit library and 65535 for the 16-bit and 32-bit libraries.
The maximum length of a subject string is the largest positive number that an
@@ -77,9 +81,9 @@ Cambridge CB2 3QH, England.
REVISION
-Last updated: 04 May 2012
+Last updated: 05 November 2013
-Copyright © 1997-2012 University of Cambridge.
+Copyright © 1997-2013 University of Cambridge.
Return to the PCRE index page.
diff --git a/tools/pcre/doc/html/pcrematching.html b/tools/pcre/doc/html/pcrematching.html
index f1854314..a1af39b6 100644
--- a/tools/pcre/doc/html/pcrematching.html
+++ b/tools/pcre/doc/html/pcrematching.html
@@ -126,6 +126,15 @@ character of the subject. The algorithm does not automatically move on to find
matches that start at later positions.
+PCRE's "auto-possessification" optimization usually applies to character
+repeats at the end of a pattern (as well as internally). For example, the
+pattern "a\d+" is compiled as if it were "a\d++" because there is no point
+even considering the possibility of backtracking into the repeated digits. For
+DFA matching, this means that only one possible match is found. If you really
+do want multiple matches in such cases, either use an ungreedy repeat
+("a\d+?") or set the PCRE_NO_AUTO_POSSESS option when compiling.
+
+
There are a number of features of PCRE regular expressions that are not
supported by the alternative matching algorithm. They are as follows:
@@ -224,7 +233,7 @@ Cambridge CB2 3QH, England.
REVISION
-Last updated: 08 January 2012
+Last updated: 12 November 2013
Copyright © 1997-2012 University of Cambridge.
diff --git a/tools/pcre/doc/html/pcrepartial.html b/tools/pcre/doc/html/pcrepartial.html
index 298f92e0..4faeafcb 100644
--- a/tools/pcre/doc/html/pcrepartial.html
+++ b/tools/pcre/doc/html/pcrepartial.html
@@ -81,33 +81,36 @@ strings. This optimization is also disabled for partial matching.
PARTIAL MATCHING USING pcre_exec() OR pcre[16|32]_exec()
A partial match occurs during a call to pcre_exec() or
-pcre[16|32]_exec() when the end of the subject string is reached successfully,
-but matching cannot continue because more characters are needed. However, at
-least one character in the subject must have been inspected. This character
-need not form part of the final matched string; lookbehind assertions and the
-\K escape sequence provide ways of inspecting characters before the start of a
-matched substring. The requirement for inspecting at least one character exists
-because an empty string can always be matched; without such a restriction there
-would always be a partial match of an empty string at the end of the subject.
+pcre[16|32]_exec() when the end of the subject string is reached
+successfully, but matching cannot continue because more characters are needed.
+However, at least one character in the subject must have been inspected. This
+character need not form part of the final matched string; lookbehind assertions
+and the \K escape sequence provide ways of inspecting characters before the
+start of a matched substring. The requirement for inspecting at least one
+character exists because an empty string can always be matched; without such a
+restriction there would always be a partial match of an empty string at the end
+of the subject.
If there are at least two slots in the offsets vector when a partial match is
returned, the first slot is set to the offset of the earliest character that
was inspected. For convenience, the second offset points to the end of the
-subject so that a substring can easily be identified.
+subject so that a substring can easily be identified. If there are at least
+three slots in the offsets vector, the third slot is set to the offset of the
+character where matching started.
-For the majority of patterns, the first offset identifies the start of the
-partially matched string. However, for patterns that contain lookbehind
-assertions, or \K, or begin with \b or \B, earlier characters have been
-inspected while carrying out the match. For example:
+For the majority of patterns, the contents of the first and third slots will be
+the same. However, for patterns that contain lookbehind assertions, or begin
+with \b or \B, characters before the one where matching started may have been
+inspected while carrying out the match. For example, consider this pattern:
/(?<=abc)123/
This pattern matches "123", but only if it is preceded by "abc". If the subject
-string is "xyzabc12", the offsets after a partial match are for the substring
-"abc12", because all these characters are needed if another match is tried
-with extra characters added to the subject.
+string is "xyzabc12", the first two offsets after a partial match are for the
+substring "abc12", because all these characters were inspected. However, the
+third offset is set to 6, because that is the offset where matching began.
What happens when a partial match is identified depends on which of the two
@@ -303,6 +306,16 @@ not retain the previously partially-matched string. It is up to the calling
program to do that if it needs to.
+That means that, for an unanchored pattern, if a continued match fails, it is
+not possible to try again at a new starting point. All this facility is capable
+of doing is continuing with the previous match attempt. In the previous
+example, if the second set of data is "ug23" the result is no match, even
+though there would be a match for "aug23" if the entire string were given at
+once. Depending on the application, this may or may not be what you want.
+The only way to allow for starting again at the next character is to retain the
+matched part of the subject and try a new complete match.
+
+
You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
PCRE_DFA_RESTART to continue partial matching over multiple segments. This
facility can be used to pass very long subject strings to the DFA matching
@@ -334,10 +347,9 @@ processing time is needed.
Note: If the pattern contains lookbehind assertions, or \K, or starts
with \b or \B, the string that is returned for a partial match includes
-characters that precede the partially matched string itself, because these must
-be retained when adding on more characters for a subsequent matching attempt.
-However, in some cases you may need to retain even earlier characters, as
-discussed in the next section.
+characters that precede the start of what would be returned for a complete
+match, because it contains all the characters that were inspected during the
+partial match.
ISSUES WITH MULTI-SEGMENT MATCHING
@@ -356,12 +368,35 @@ includes the effect of PCRE_NOTEOL.
offsets that are returned for a partial match. However a lookbehind assertion
later in the pattern could require even earlier characters to be inspected. You
can handle this case by using the PCRE_INFO_MAXLOOKBEHIND option of the
-pcre_fullinfo() or pcre[16|32]_fullinfo() functions to obtain the length
-of the largest lookbehind in the pattern. This length is given in characters,
-not bytes. If you always retain at least that many characters before the
-partially matched string, all should be well. (Of course, near the start of the
-subject, fewer characters may be present; in that case all characters should be
-retained.)
+pcre_fullinfo() or pcre[16|32]_fullinfo() functions to obtain the
+length of the longest lookbehind in the pattern. This length is given in
+characters, not bytes. If you always retain at least that many characters
+before the partially matched string, all should be well. (Of course, near the
+start of the subject, fewer characters may be present; in that case all
+characters should be retained.)
+
+
+From release 8.33, there is a more accurate way of deciding which characters to
+retain. Instead of subtracting the length of the longest lookbehind from the
+earliest inspected character (offsets[0]), the match start position
+(offsets[2]) should be used, and the next match attempt started at the
+offsets[2] character by setting the startoffset argument of
+pcre_exec() or pcre_dfa_exec().
+
+
+For example, if the pattern "(?<=123)abc" is partially
+matched against the string "xx123a", the three offset values returned are 2, 6,
+and 5. This indicates that the matching process that gave a partial match
+started at offset 5, but the characters "123a" were all inspected. The maximum
+lookbehind for that pattern is 3, so taking that away from 5 shows that we need
+only keep "123a", and the next match attempt can be started at offset 3 (that
+is, at "a") when further characters have been added. When the match start is
+not the earliest inspected character, pcretest shows it explicitly:
+
+ re> "(?<=123)abc"
+ data> xx123a\P\P
+ Partial match at offset 5: 123a
+
3. Because a partial match must always contain at least one character, what
@@ -465,9 +500,9 @@ Cambridge CB2 3QH, England.
REVISION
-Last updated: 24 June 2012
+Last updated: 02 July 2013
-Copyright © 1997-2012 University of Cambridge.
+Copyright © 1997-2013 University of Cambridge.
Return to the PCRE index page.
diff --git a/tools/pcre/doc/html/pcrepattern.html b/tools/pcre/doc/html/pcrepattern.html
index ee55d06e..c06d1e03 100644
--- a/tools/pcre/doc/html/pcrepattern.html
+++ b/tools/pcre/doc/html/pcrepattern.html
@@ -14,8 +14,8 @@ man page, in case the conversion went wrong.
@@ -61,6 +62,30 @@ published by O'Reilly, covers regular expressions in great detail. This
description of PCRE's regular expressions is intended as reference material.
+This document discusses the patterns that are supported by PCRE when one its
+main matching functions, pcre_exec() (8-bit) or pcre[16|32]_exec()
+(16- or 32-bit), is used. PCRE also has alternative matching functions,
+pcre_dfa_exec() and pcre[16|32_dfa_exec(), which match using a
+different algorithm that is not Perl-compatible. Some of the features discussed
+below are not available when DFA matching is used. The advantages and
+disadvantages of the alternative functions, and how they differ from the normal
+functions, are discussed in the
+pcrematching
+page.
+
+
The original operation of PCRE was on strings of one-byte characters. However,
there is now also support for UTF-8 strings in the original library, an
extra library that supports 16-bit and UTF-16 character strings, and a
@@ -77,50 +102,52 @@ these special sequences:
(*UTF) is a generic sequence that can be used with any of the libraries.
Starting a pattern with such a sequence is equivalent to setting the relevant
-option. This feature is not Perl-compatible. How setting a UTF mode affects
-pattern matching is mentioned in several places below. There is also a summary
-of features in the
+option. How setting a UTF mode affects pattern matching is mentioned in several
+places below. There is also a summary of features in the
pcreunicode
page.
-Another special sequence that may appear at the start of a pattern or in
-combination with (*UTF8), (*UTF16), (*UTF32) or (*UTF) is:
-
+Another special sequence that may appear at the start of a pattern is (*UCP).
This has the same effect as setting the PCRE_UCP option: it causes sequences
such as \d and \w to use Unicode properties to determine character types,
instead of recognizing only characters with codes less than 128 via a lookup
table.
+
+If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting
+the PCRE_NO_AUTO_POSSESS option at compile time. This stops PCRE from making
+quantifiers possessive when what follows cannot match the repeated item. For
+example, by default a+b is treated as a++b. For more details, see the
+pcreapi
+documentation.
+
+
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
-PCRE_NO_START_OPTIMIZE option either at compile or matching time. There are
-also some more of these special sequences that are concerned with the handling
-of newlines; they are described below.
-
-
-The remainder of this document discusses the patterns that are supported by
-PCRE when one its main matching functions, pcre_exec() (8-bit) or
-pcre[16|32]_exec() (16- or 32-bit), is used. PCRE also has alternative
-matching functions, pcre_dfa_exec() and pcre[16|32_dfa_exec(),
-which match using a different algorithm that is not Perl-compatible. Some of
-the features discussed below are not available when DFA matching is used. The
-advantages and disadvantages of the alternative functions, and how they differ
-from the normal functions, are discussed in the
-pcrematching
-page.
-
-
-PCRE can be compiled to run in an environment that uses EBCDIC as its character
-code rather than ASCII or Unicode (typically a mainframe system). In the
-sections below, character code values are ASCII or Unicode; in an EBCDIC
-environment these characters may have different code values, and there are no
-code points greater than 255.
+PCRE_NO_START_OPTIMIZE option either at compile or matching time. This disables
+several optimizations for quickly reaching "no match" results. For more
+details, see the
+pcreapi
+documentation.
-
PCRE supports five different conventions for indicating line breaks in
strings: a single CR (carriage return) character, a single LF (linefeed)
@@ -148,9 +175,7 @@ example, on a Unix system where LF is the default newline sequence, the pattern
(*CR)a.b
changes the convention to CR. That pattern matches "a\nb" because LF is no
-longer a newline. Note that these special settings, which are not
-Perl-compatible, are recognized only at the very start of a pattern, and that
-they must be in upper case. If more than one of them is present, the last one
+longer a newline. If more than one of these settings is present, the last one
is used.
@@ -164,6 +189,36 @@ description of \R in the section entitled
below. A change of \R setting can be combined with a change of newline
convention.
+
+PCRE can be compiled to run in an environment that uses EBCDIC as its character
+code rather than ASCII or Unicode (typically a mainframe system). In the
+sections below, character code values are ASCII or Unicode; in an EBCDIC
+environment these characters may have different code values, and there are no
+code points greater than 255.
+
A regular expression is a pattern that is matched against a subject string from
@@ -241,10 +296,11 @@ backslash. All other characters (in particular, those whose codepoints are
greater than 127) are treated as literals.
-If a pattern is compiled with the PCRE_EXTENDED option, white space in the
-pattern (other than in a character class) and characters between a # outside
-a character class and the next newline are ignored. An escaping backslash can
-be used to include a white space or # character as part of the pattern.
+If a pattern is compiled with the PCRE_EXTENDED option, most white space in the
+pattern (other than in a character class), and characters between a # outside a
+character class and the next newline, inclusive, are ignored. An escaping
+backslash can be used to include a white space or # character as part of the
+pattern.
If you want to remove the special meaning from a sequence of characters, you
@@ -282,7 +338,9 @@ one of the following escape sequences than the binary character it represents:
\n linefeed (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
+ \0dd character with octal code 0dd
\ddd character with octal code ddd, or back reference
+ \o{ddd..} character with octal code ddd..
\xhh character with hex code hh
\x{hhh..} character with hex code hhh.. (non-JavaScript mode)
\uhhhh character with hex code hhhh (JavaScript mode only)
@@ -305,42 +363,6 @@ the EBCDIC letters are disjoint, \cZ becomes hex 29 (Z is E9), and other
characters also generate different values.
-By default, after \x, from zero to two hexadecimal digits are read (letters
-can be in upper or lower case). Any number of hexadecimal digits may appear
-between \x{ and }, but the character code is constrained as follows:
-
-If characters other than hexadecimal digits appear between \x{ and }, or if
-there is no terminating }, this form of escape is not recognized. Instead, the
-initial \x will be interpreted as a basic hexadecimal escape, with no
-following digits, giving a character whose value is zero.
-
-
-If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is
-as just described only when it is followed by two hexadecimal digits.
-Otherwise, it matches a literal "x" character. In JavaScript mode, support for
-code points greater than 256 is provided by \u, which must be followed by
-four hexadecimal digits; otherwise it matches a literal "u" character.
-Character codes specified by \u in JavaScript mode are constrained in the same
-was as those specified by \x in non-JavaScript mode.
-
-
-Characters whose value is less than 256 can be defined by either of the two
-syntaxes for \x (or by \u in JavaScript mode). There is no difference in the
-way they are handled. For example, \xdc is exactly the same as \x{dc} (or
-\u00dc in JavaScript mode).
-
-
After \0 up to two further octal digits are read. If there are fewer than two
digits, just those that are present are used. Thus the sequence \0\x\07
specifies two binary zeros followed by a BEL character (code value 7). Make
@@ -348,9 +370,23 @@ sure you supply two digits after the initial zero if the pattern character that
follows is itself an octal digit.
-The handling of a backslash followed by a digit other than 0 is complicated.
-Outside a character class, PCRE reads it and any following digits as a decimal
-number. If the number is less than 10, or if there have been at least that many
+The escape \o must be followed by a sequence of octal digits, enclosed in
+braces. An error occurs if this is not the case. This escape is a recent
+addition to Perl; it provides way of specifying character code points as octal
+numbers greater than 0777, and it also allows octal numbers and back references
+to be unambiguously specified.
+
+
+For greater clarity and unambiguity, it is best to avoid following \ by a
+digit greater than zero. Instead, use \o{} or \x{} to specify character
+numbers, and \g{} to specify back references. The following paragraphs
+describe the old, ambiguous syntax.
+
+
+The handling of a backslash followed by a digit other than 0 is complicated,
+and Perl has changed in recent releases, causing PCRE also to change. Outside a
+character class, PCRE reads the digit and any following digits as a decimal
+number. If the number is less than 8, or if there have been at least that many
previous capturing left parentheses in the expression, the entire sequence is
taken as a back reference. A description of how this works is given
later,
@@ -358,12 +394,11 @@ following the discussion of
parenthesized subpatterns.
-Inside a character class, or if the decimal number is greater than 9 and there
-have not been that many capturing subpatterns, PCRE re-reads up to three octal
-digits following the backslash, and uses them to generate a data character. Any
-subsequent digits stand for themselves. The value of the character is
-constrained in the same way as characters specified in hexadecimal.
-For example:
+Inside a character class, or if the decimal number following \ is greater than
+7 and there have not been that many capturing subpatterns, PCRE handles \8 and
+\9 as the literal characters "8" and "9", and otherwise re-reads up to three
+octal digits following the backslash, using them to generate a data character.
+Any subsequent digits stand for themselves. For example:
+By default, after \x that is not followed by {, from zero to two hexadecimal
+digits are read (letters can be in upper or lower case). Any number of
+hexadecimal digits may appear between \x{ and }. If a character other than
+a hexadecimal digit appears between \x{ and }, or if there is no terminating
+}, an error occurs.
+
+
+If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is
+as just described only when it is followed by two hexadecimal digits.
+Otherwise, it matches a literal "x" character. In JavaScript mode, support for
+code points greater than 256 is provided by \u, which must be followed by
+four hexadecimal digits; otherwise it matches a literal "u" character.
+
+
+Characters whose value is less than 256 can be defined by either of the two
+syntaxes for \x (or by \u in JavaScript mode). There is no difference in the
+way they are handled. For example, \xdc is exactly the same as \x{dc} (or
+\u00dc in JavaScript mode).
+
+
+Characters that are specified using octal or hexadecimal numbers are
+limited to certain values, as follows:
+
All the sequences that define a single character value can be used both inside
and outside character classes. In addition, inside a character class, \b is
interpreted as the backspace character (hex 08).
@@ -456,11 +532,14 @@ matching point is at the end of the subject string, all of them fail, because
there is no character to match.
-For compatibility with Perl, \s does not match the VT character (code 11).
-This makes it different from the the POSIX "space" class. The \s characters
-are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is
-included in a Perl script, \s may match the VT character. In PCRE, it never
-does.
+For compatibility with Perl, \s did not used to match the VT character (code
+11), which made it different from the the POSIX "space" class. However, Perl
+added VT at release 5.18, and PCRE followed suit at release 8.34. The default
+\s characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space
+(32), which are defined as white space in the "C" locale. This list may vary if
+locale-specific matching is taking place. For example, in some locales the
+"non-breaking space" character (\xA0) is recognized as white space, and in
+others the VT character is not.
A "word" character is an underscore or any character that is a letter or digit.
@@ -471,21 +550,23 @@ place (see
in the
pcreapi
page). For example, in a French locale such as "fr_FR" in Unix-like systems,
-or "french" in Windows, some character codes greater than 128 are used for
+or "french" in Windows, some character codes greater than 127 are used for
accented letters, and these are then matched by \w. The use of locales with
Unicode is discouraged.
-By default, in a UTF mode, characters with values greater than 128 never match
-\d, \s, or \w, and always match \D, \S, and \W. These sequences retain
-their original meanings from before UTF support was available, mainly for
-efficiency reasons. However, if PCRE is compiled with Unicode property support,
-and the PCRE_UCP option is set, the behaviour is changed so that Unicode
-properties are used to determine character types, as follows:
+By default, characters whose code points are greater than 127 never match \d,
+\s, or \w, and always match \D, \S, and \W, although this may vary for
+characters in the range 128-255 when locale-specific matching is happening.
+These escape sequences retain their original meanings from before Unicode
+support was available, mainly for efficiency reasons. If PCRE is compiled with
+Unicode property support, and the PCRE_UCP option is set, the behaviour is
+changed so that Unicode properties are used to determine character types, as
+follows:
The sequences \h, \H, \v, and \V are features that were added to Perl at
release 5.10. In contrast to the other sequences, which match only ASCII
-characters by default, these always match certain high-valued codepoints,
+characters by default, these always match certain high-valued code points,
whether or not PCRE_UCP is set. The horizontal space characters are:
U+0009 Horizontal tab (HT)
@@ -806,7 +887,8 @@ Unicode table.
Specifying caseless matching does not affect these escape sequences. For
-example, \p{Lu} always matches only upper case letters.
+example, \p{Lu} always matches only upper case letters. This is different from
+the behaviour of current versions of Perl.
Matching characters by Unicode property is not fast, because PCRE has to do a
@@ -870,8 +952,9 @@ PCRE's additional properties
As well as the standard Unicode properties described above, PCRE supports four
more that make it possible to convert traditional escape sequences such as \w
-and \s and POSIX character classes to use Unicode properties. PCRE uses these
-non-standard, non-Perl properties internally when PCRE_UCP is set. They are:
+and \s to use Unicode properties. PCRE uses these non-standard, non-Perl
+properties internally when PCRE_UCP is set. However, they may also be used
+explicitly. These properties are:
Xan Any alphanumeric character
Xps Any POSIX space character
@@ -881,8 +964,19 @@ non-standard, non-Perl properties internally when PCRE_UCP is set. They are:
Xan matches characters that have either the L (letter) or the N (number)
property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
carriage return, and any other character that has the Z (separator) property.
-Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
-same characters as Xan, plus underscore.
+Xsp is the same as Xps; it used to exclude vertical tab, for Perl
+compatibility, but Perl changed, and so PCRE followed at release 8.34. Xwd
+matches the same characters as Xan, plus underscore.
+
+
+There is another non-standard property, Xuc, which matches any character that
+can be represented by a Universal Character Name in C++ and other programming
+languages. These are the characters $, @, ` (grave accent), and all characters
+with Unicode code points greater than or equal to U+00A0, except for the
+surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are
+excluded. (Universal Character Names are of the form \uHHHH or \UHHHHHHHH
+where H is a hexadecimal digit. Note that the Xuc property does not match these
+sequences but the characters that they represent.)
Resetting the match start
@@ -909,7 +1003,9 @@ matches "foobar", the first substring is still set to "foo".
Perl documents that the use of \K within assertions is "not well defined". In
PCRE, \K is acted upon when it occurs inside positive assertions, but is
-ignored in negative assertions.
+ignored in negative assertions. Note that when a pattern such as (?=ab\K)
+matches, the reported start of the match can be greater than the end of the
+match.
Simple assertions
@@ -1164,7 +1260,9 @@ The minus (hyphen) character can be used to specify a range of characters in a
character class. For example, [d-m] matches any letter between d and m,
inclusive. If a minus character is required in a class, it must be escaped with
a backslash or appear in a position where it cannot be interpreted as
-indicating a range, typically as the first or last character in the class.
+indicating a range, typically as the first or last character in the class, or
+immediately after a range. For example, [b-d-z] matches letters in the range b
+to d, a hyphen character, or z.
It is not possible to have the literal character "]" as the end character of a
@@ -1176,6 +1274,12 @@ followed by two other characters. The octal or hexadecimal representation of
"]" can also be used to end a range.
+An error is generated if a POSIX character class (see below) or an escape
+sequence other than one that defines a single character appears at a point
+where a range ending character is expected. For example, [z-\xff] is valid,
+but [A-\d] and [A-[:digit:]] are not.
+
+
Ranges operate in the collating sequence of character values. They can also be
used for characters specified numerically, for example [\000-\037]. Ranges
can include any characters that are valid for the current mode.
@@ -1215,9 +1319,9 @@ something AND NOT ...".
The only metacharacters that are recognized in character classes are backslash,
hyphen (only where it can be interpreted as specifying a range), circumflex
(only at the start), opening square bracket (only when it can be interpreted as
-introducing a POSIX class name - see the next section), and the terminating
-closing square bracket. However, escaping other non-alphanumeric characters
-does no harm.
+introducing a POSIX class name, or for a special compatibility feature - see
+the next two sections), and the terminating closing square bracket. However,
+escaping other non-alphanumeric characters does no harm.
POSIX CHARACTER CLASSES
@@ -1240,15 +1344,17 @@ are:
lower lower case letters
print printing characters, including space
punct printing characters, excluding letters and digits and space
- space white space (not quite the same as \s)
+ space white space (the same as \s from PCRE 8.34)
upper upper case letters
word "word" characters (same as \w)
xdigit hexadecimal digits
-The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
-space (32). Notice that this list includes the VT character (code 11). This
-makes "space" different to \s, which does not include VT (for Perl
-compatibility).
+The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
+and space (32). If locale-specific matching is taking place, the list of space
+characters may be different; there may be fewer or more of them. "Space" used
+to be different to \s, which did not include VT, for Perl compatibility.
+However, Perl changed at release 5.18, and PCRE followed at release 8.34.
+"Space" and \s now match the same set of characters.
The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
@@ -1262,11 +1368,11 @@ syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
supported, and an error is given if they are encountered.
-By default, in UTF modes, characters with values greater than 128 do not match
-any of the POSIX character classes. However, if the PCRE_UCP option is passed
-to pcre_compile(), some of the classes are changed so that Unicode
-character properties are used. This is achieved by replacing the POSIX classes
-by other sequences, as follows:
+By default, characters with values greater than 128 do not match any of the
+POSIX character classes. However, if the PCRE_UCP option is passed to
+pcre_compile(), some of the classes are changed so that Unicode character
+properties are used. This is achieved by replacing certain POSIX classes by
+other sequences, as follows:
[:alnum:] becomes \p{Xan}
[:alpha:] becomes \p{L}
@@ -1277,11 +1383,56 @@ by other sequences, as follows:
[:upper:] becomes \p{Lu}
[:word:] becomes \p{Xwd}
-Negated versions, such as [:^alpha:] use \P instead of \p. The other POSIX
-classes are unchanged, and match only characters with code points less than
-128.
+Negated versions, such as [:^alpha:] use \P instead of \p. Three other POSIX
+classes are handled specially in UCP mode:
-
VERTICAL BAR
+
+[:graph:]
+This matches characters that have glyphs that mark the page when printed. In
+Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf
+properties, except for:
+
+ U+061C Arabic Letter Mark
+ U+180E Mongolian Vowel Separator
+ U+2066 - U+2069 Various "isolate"s
+
+
+
+
+[:print:]
+This matches the same characters as [:graph:] plus space characters that are
+not controls, that is, characters with the Zs property.
+
+
+[:punct:]
+This matches all characters that have the Unicode P (punctuation) property,
+plus those characters whose code points are less than 128 that have the S
+(Symbol) property.
+
+
+The other POSIX classes are unchanged, and match only characters with code
+points less than 128.
+
+
COMPATIBILITY FEATURE FOR WORD BOUNDARIES
+
+In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly
+syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of
+word". PCRE treats these items as follows:
+
+ [[:<:]] is converted to \b(?=\w)
+ [[:>:]] is converted to \b(?<=\w)
+
+Only these exact character sequences are recognized. A sequence such as
+[a[:<:]b] provokes error for an unrecognized POSIX class name. This support is
+not compatible with Perl. It is provided to help migrations from other
+environments, and is best not used in any new patterns. Note that \b matches
+at the start and the end of a word (see
+"Simple assertions"
+above), and in a Perl-style pattern the preceding or following character
+normally shows which is wanted, without the need for the assertions that are
+used above in order to give exactly the POSIX behaviour.
+
+
VERTICAL BAR
Vertical bar characters are used to separate alternative patterns. For example,
the pattern
@@ -1296,7 +1447,7 @@ that succeeds is used. If the alternatives are within a subpattern
"succeeds" means matching the rest of the main pattern as well as the
alternative in the subpattern.
-
INTERNAL OPTION SETTING
+
INTERNAL OPTION SETTING
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
PCRE_EXTENDED options (which are Perl-compatible) can be changed from within
@@ -1356,9 +1507,10 @@ above. There are also the (*UTF8), (*UTF16),(*UTF32), and (*UCP) leading
sequences that can be used to set UTF and Unicode property modes; they are
equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP
options, respectively. The (*UTF) sequence is a generic version that can be
-used with any of the libraries.
+used with any of the libraries. However, the application can set the
+PCRE_NEVER_UTF option, which locks out the use of the (*UTF) sequences.
-
SUBPATTERNS
+
SUBPATTERNS
Subpatterns are delimited by parentheses (round brackets), which can be nested.
Turning part of a pattern into a subpattern does two things:
@@ -1414,7 +1566,7 @@ from left to right, and options are not reset until the end of the subpattern
is reached, an option setting in one branch does affect subsequent branches, so
the above patterns match "SUNDAY" as well as "Saturday".
-
DUPLICATE SUBPATTERN NUMBERS
+
DUPLICATE SUBPATTERN NUMBERS
Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
the same numbers for its capturing parentheses. Such a subpattern starts with
@@ -1458,7 +1610,7 @@ true if any of the subpatterns of that number have matched.
An alternative approach to using this "branch reset" feature is to use
duplicate named subpatterns, as described in the next section.
-
NAMED SUBPATTERNS
+
NAMED SUBPATTERNS
Identifying capturing parentheses by number is simple, but it can be very hard
to keep track of the numbers in complicated regular expressions. Furthermore,
@@ -1480,11 +1632,12 @@ and
can be made by name as well as by number.
-Names consist of up to 32 alphanumeric characters and underscores. Named
-capturing parentheses are still allocated numbers as well as names, exactly as
-if the names were not present. The PCRE API provides function calls for
-extracting the name-to-number translation table from a compiled pattern. There
-is also a convenience function for extracting a captured substring by name.
+Names consist of up to 32 alphanumeric characters and underscores, but must
+start with a non-digit. Named capturing parentheses are still allocated numbers
+as well as names, exactly as if the names were not present. The PCRE API
+provides function calls for extracting the name-to-number translation table
+from a compiled pattern. There is also a convenience function for extracting a
+captured substring by name.
By default, a name must be unique within a pattern, but it is possible to relax
@@ -1513,9 +1666,23 @@ matched. This saves searching to find which numbered subpattern it was.
If you make a back reference to a non-unique named subpattern from elsewhere in
-the pattern, the one that corresponds to the first occurrence of the name is
-used. In the absence of duplicate numbers (see the previous section) this is
-the one with the lowest number. If you use a named reference in a condition
+the pattern, the subpatterns to which the name refers are checked in the order
+in which they appear in the overall pattern. The first one that is set is used
+for the reference. For example, this pattern matches both "foofoo" and
+"barbar" but not "foobar" or "barfoo":
+
+ (?:(?<n>foo)|(?<n>bar))\k<n>
+
+
+
+
+If you make a subroutine call to a non-unique named subpattern, the one that
+corresponds to the first occurrence of the name is used. In the absence of
+duplicate numbers (see the previous section) this is the one with the lowest
+number.
+
+
+If you use a named reference in a condition
test (see the
section about conditions
below), either to check whether a subpattern has matched, or to check for
@@ -1530,10 +1697,11 @@ documentation.
Warning: You cannot use different names to distinguish between two
subpatterns with the same number because PCRE uses only the numbers when
matching. For this reason, an error is given at compile time if different names
-are given to subpatterns with the same number. However, you can give the same
-name to subpatterns with the same number, even when PCRE_DUPNAMES is not set.
+are given to subpatterns with the same number. However, you can always give the
+same name to subpatterns with the same number, even when PCRE_DUPNAMES is not
+set.
-
REPETITION
+
REPETITION
Repetition is specified by quantifiers, which can follow any of the following
items:
@@ -1701,7 +1869,7 @@ example, after
matches "aba" the value of the second captured substring is "b".
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
repetition, failure of what follows normally causes the repeated item to be
@@ -1805,7 +1973,7 @@ an atomic group, like this:
sequences of non-digits cannot be broken, and failure happens quickly.
-
Outside a character class, a backslash followed by a digit greater than 0 (and
possibly further digits) is a back reference to a capturing subpattern earlier
@@ -1933,7 +2101,7 @@ as an
Once the whole group has been matched, a subsequent matching failure cannot
cause backtracking into the middle of the group.
-
An assertion is a test on the characters following or preceding the current
matching point that does not actually consume any characters. The simple
@@ -1950,8 +2118,8 @@ except that it does not cause the current matching position to be changed.
Assertion subpatterns are not capturing subpatterns. If such an assertion
contains capturing subpatterns within it, these are counted for the purposes of
numbering the capturing subpatterns in the whole pattern. However, substring
-capturing is carried out only for positive assertions, because it does not make
-sense for negative assertions.
+capturing is carried out only for positive assertions. (Perl sometimes, but not
+always, does do capturing in negative assertions.)
For compatibility with Perl, assertion subpatterns may be repeated; though
@@ -2123,7 +2291,7 @@ preceded by "foo", while
is another pattern that matches "foo" preceded by three digits and any three
characters that are not "999".
-
It is possible to cause the matching process to obey a subpattern
conditionally or to choose between two alternative subpatterns, depending on
@@ -2197,12 +2365,7 @@ Checking for a used subpattern by name
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
subpattern by name. For compatibility with earlier versions of PCRE, which had
-this facility before Perl, the syntax (?(name)...) is also recognized. However,
-there is a possible ambiguity with this syntax, because subpattern names may
-consist entirely of digits. PCRE looks first for a named subpattern; if it
-cannot find one and the name consists entirely of digits, PCRE looks for a
-subpattern of that number, which must be greater than zero. Using subpattern
-names that consist entirely of digits is not recommended.
+this facility before Perl, the syntax (?(name)...) is also recognized.
Rewriting the above example to use a named subpattern gives this:
@@ -2278,7 +2441,7 @@ subject is matched against the first alternative; otherwise it is matched
against the second. This pattern matches strings in one of the two forms
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
-
There are two ways of including comments in patterns that are processed by
PCRE. In both cases, the start of the comment must not be in a character class,
@@ -2307,7 +2470,7 @@ a newline in the pattern. The sequence \n is still literal at this stage, so
it does not terminate the comment. Only an actual character with the code value
0x0a (the default newline) does so.
-
Consider the problem of matching a string in parentheses, allowing for
unlimited nested parentheses. Without the use of recursion, the best that can
@@ -2522,7 +2685,7 @@ now match "b" and so the whole match succeeds. In Perl, the pattern fails to
match because inside the recursive call \1 cannot access the externally set
value.
-
If the syntax for a recursive subpattern call (either by number or by
name) is used outside the parentheses to which it refers, it operates like a
@@ -2563,7 +2726,7 @@ different calls. For example, consider this pattern:
It matches "abcabc". It does not match "abcABC" because the change of
processing option does not affect the called subpattern.
-
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
a number enclosed either in angle brackets or single quotes, is an alternative
@@ -2581,7 +2744,7 @@ plus or a minus sign it is taken as a relative reference. For example:
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
synonymous. The former is a back reference; the latter is a subroutine call.
-
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
code to be obeyed in the middle of matching a regular expression. This makes it
@@ -2605,53 +2768,65 @@ For example, this pattern has two callout points:
If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, callouts are
automatically installed before each item in the pattern. They are all numbered
-255.
+255. If there is a conditional group in the pattern whose condition is an
+assertion, an additional callout is inserted just before the condition. An
+explicit callout may also be set at this position, as in this example:
+
During matching, when PCRE reaches a callout point, the external function is
called. It is provided with the number of the callout, the position in the
pattern, and, optionally, one item of data originally supplied by the caller of
the matching function. The callout function may cause matching to proceed, to
-backtrack, or to fail altogether. A complete description of the interface to
-the callout function is given in the
+backtrack, or to fail altogether.
+
+
+By default, PCRE implements a number of optimizations at compile time and
+matching time, and one side-effect is that sometimes callouts are skipped. If
+you need all possible callouts to happen, you need to set options that disable
+the relevant optimizations. More details, and a complete description of the
+interface to the callout function, are given in the
pcrecallout
documentation.
-
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
-are described in the Perl documentation as "experimental and subject to change
-or removal in a future version of Perl". It goes on to say: "Their usage in
-production code should be noted to avoid problems during upgrades." The same
+are still described in the Perl documentation as "experimental and subject to
+change or removal in a future version of Perl". It goes on to say: "Their usage
+in production code should be noted to avoid problems during upgrades." The same
remarks apply to the PCRE features described in this section.
-Since these verbs are specifically related to backtracking, most of them can be
-used only when the pattern is to be matched using one of the traditional
-matching functions, which use a backtracking algorithm. With the exception of
-(*FAIL), which behaves like a failing negative assertion, they cause an error
-if encountered by a DFA matching function.
-
-
-If any of these verbs are used in an assertion or in a subpattern that is
-called as a subroutine (whether or not recursively), their effect is confined
-to that subpattern; it does not extend to the surrounding pattern, with one
-exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in
-a successful positive assertion is passed back when a match succeeds
-(compare capturing parentheses in assertions). Note that such subpatterns are
-processed as anchored at the point where they are tested. Note also that Perl's
-treatment of subroutines and assertions is different in some cases.
-
-
The new verbs make use of what was previously invalid syntax: an opening
parenthesis followed by an asterisk. They are generally of the form
-(*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,
-depending on whether or not an argument is present. A name is any sequence of
-characters that does not include a closing parenthesis. The maximum length of
-name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit library.
-If the name is empty, that is, if the closing parenthesis immediately follows
-the colon, the effect is as if the colon were not there. Any number of these
-verbs may occur in a pattern.
+(*VERB) or (*VERB:NAME). Some may take either form, possibly behaving
+differently depending on whether or not a name is present. A name is any
+sequence of characters that does not include a closing parenthesis. The maximum
+length of name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit
+libraries. If the name is empty, that is, if the closing parenthesis
+immediately follows the colon, the effect is as if the colon were not there.
+Any number of these verbs may occur in a pattern.
+
+
+Since these verbs are specifically related to backtracking, most of them can be
+used only when the pattern is to be matched using one of the traditional
+matching functions, because these use a backtracking algorithm. With the
+exception of (*FAIL), which behaves like a failing negative assertion, the
+backtracking control verbs cause an error if encountered by a DFA matching
+function.
+
+
+If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For
+example:
-When a match succeeds, the name of the last-encountered (*MARK) on the matching
-path is passed back to the caller as described in the section entitled
+When a match succeeds, the name of the last-encountered (*MARK:NAME),
+(*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to the
+caller as described in the section entitled
"Extra data for pcre_exec()"
in the
pcreapi
@@ -2744,13 +2924,13 @@ of obtaining this information than putting each alternative in its own
capturing parentheses.
-If (*MARK) is encountered in a positive assertion, its name is recorded and
-passed back if it is the last-encountered. This does not happen for negative
-assertions.
+If a verb with a name is encountered in a positive assertion that is true, the
+name is recorded and passed back if it is the last-encountered. This does not
+happen for negative assertions or failing positive assertions.
-After a partial match or a failed match, the name of the last encountered
-(*MARK) in the entire match process is returned. For example:
+After a partial match or a failed match, the last encountered name in the
+entire match process is returned. For example:
Re-using a precompiled pattern is straightforward. Having reloaded it into main
-memory, called pcre[16|32]_pattern_to_host_byte_order() if necessary,
-you pass its pointer to pcre[16|32]_exec() or pcre[16|32]_dfa_exec() in
+memory, called pcre[16|32]_pattern_to_host_byte_order() if necessary, you
+pass its pointer to pcre[16|32]_exec() or pcre[16|32]_dfa_exec() in
the usual way.
If you did not provide custom character tables when the pattern was compiled,
the pointer in the compiled pattern is NULL, which causes the matching
functions to use PCRE's internal tables. Thus, you do not need to take any
@@ -126,9 +131,9 @@ special action at run time in this case.
If you saved study data with the compiled pattern, you need to create your own
-pcre[16|32]_extra data block and set the study_data field to point to the
-reloaded study data. You must also set the PCRE_EXTRA_STUDY_DATA bit in the
-flags field to indicate that study data is present. Then pass the
+pcre[16|32]_extra data block and set the study_data field to point
+to the reloaded study data. You must also set the PCRE_EXTRA_STUDY_DATA bit in
+the flags field to indicate that study data is present. Then pass the
pcre[16|32]_extra block to the matching function in the usual way. If the
pattern was studied for just-in-time optimization, that data cannot be saved,
and so is lost by a save/restore cycle.
@@ -149,9 +154,9 @@ Cambridge CB2 3QH, England.