Security Advisories (26)
CVE-2020-12723 (2020-06-05)

regcomp.c in Perl before 5.30.3 allows a buffer overflow via a crafted regular expression because of recursive S_study_chunk calls.

CVE-2026-8376 (2026-05-25)

Perl versions through 5.43.10 have a heap buffer overflow when compiling regular expressions with a repeated fixed string on 32-bit builds. Perl_study_chunk in regcomp_study.c checked the size of the joined substring buffer in characters rather than bytes. For a quantified fixed substring with a large minimum count, the byte length mincount * l could overflow SSize_t, producing an undersized SvGROW allocation; the subsequent copy writes past the end of the buffer. A caller that compiles an attacker-controlled regular expression on a 32-bit perl build triggers a heap buffer overflow at compile time.

CVE-2016-1238 (2016-08-02)

(1) cpan/Archive-Tar/bin/ptar, (2) cpan/Archive-Tar/bin/ptardiff, (3) cpan/Archive-Tar/bin/ptargrep, (4) cpan/CPAN/scripts/cpan, (5) cpan/Digest-SHA/shasum, (6) cpan/Encode/bin/enc2xs, (7) cpan/Encode/bin/encguess, (8) cpan/Encode/bin/piconv, (9) cpan/Encode/bin/ucmlint, (10) cpan/Encode/bin/unidump, (11) cpan/ExtUtils-MakeMaker/bin/instmodsh, (12) cpan/IO-Compress/bin/zipdetails, (13) cpan/JSON-PP/bin/json_pp, (14) cpan/Test-Harness/bin/prove, (15) dist/ExtUtils-ParseXS/lib/ExtUtils/xsubpp, (16) dist/Module-CoreList/corelist, (17) ext/Pod-Html/bin/pod2html, (18) utils/c2ph.PL, (19) utils/h2ph.PL, (20) utils/h2xs.PL, (21) utils/libnetcfg.PL, (22) utils/perlbug.PL, (23) utils/perldoc.PL, (24) utils/perlivp.PL, and (25) utils/splain.PL in Perl 5.x before 5.22.3-RC2 and 5.24 before 5.24.1-RC2 do not properly remove . (period) characters from the end of the includes directory array, which might allow local users to gain privileges via a Trojan horse module under the current working directory.

CVE-2018-6913 (2018-04-17)

Heap-based buffer overflow in the pack function in Perl before 5.26.2 allows context-dependent attackers to execute arbitrary code via a large item count.

CVE-2018-18313 (2018-12-07)

Perl before 5.26.3 has a buffer over-read via a crafted regular expression that triggers disclosure of sensitive information from process memory.

CVE-2015-8853 (2016-05-25)

The (1) S_reghop3, (2) S_reghop4, and (3) S_reghopmaybe3 functions in regexec.c in Perl before 5.24.0 allow context-dependent attackers to cause a denial of service (infinite loop) via crafted utf-8 data, as demonstrated by "a\x80."

CVE-2012-5195 (2012-12-18)

Heap-based buffer overflow in the Perl_repeatcpy function in util.c in Perl 5.12.x before 5.12.5, 5.14.x before 5.14.3, and 5.15.x before 15.15.5 allows context-dependent attackers to cause a denial of service (memory consumption and crash) or possibly execute arbitrary code via the 'x' string repeat operator.

CVE-2011-1487 (2011-04-11)

The (1) lc, (2) lcfirst, (3) uc, and (4) ucfirst functions in Perl 5.10.x, 5.11.x, and 5.12.x through 5.12.3, and 5.13.x through 5.13.11, do not apply the taint attribute to the return value upon processing tainted input, which might allow context-dependent attackers to bypass the taint protection mechanism via a crafted string.

CVE-2023-47039 (2023-10-30)

Perl for Windows relies on the system path environment variable to find the shell (cmd.exe). When running an executable which uses Windows Perl interpreter, Perl attempts to find and execute cmd.exe within the operating system. However, due to path search order issues, Perl initially looks for cmd.exe in the current working directory. An attacker with limited privileges can exploit this behavior by placing cmd.exe in locations with weak permissions, such as C:\ProgramData. By doing so, when an administrator attempts to use this executable from these compromised locations, arbitrary code can be executed.

CVE-2023-47100

In Perl before 5.38.2, S_parse_uniprop_string in regcomp.c can write to unallocated space because a property name associated with a \p{...} regular expression construct is mishandled. The earliest affected version is 5.30.0.

CVE-2015-8608 (2017-02-07)

The VDir::MapPathA and VDir::MapPathW functions in Perl 5.22 allow remote attackers to cause a denial of service (out-of-bounds read) and possibly execute arbitrary code via a crafted (1) drive letter or (2) pInName argument.

CVE-2011-2728 (2012-12-21)

The bsd_glob function in the File::Glob module for Perl before 5.14.2 allows context-dependent attackers to cause a denial of service (crash) via a glob expression with the GLOB_ALTDIRFUNC flag, which triggers an uninitialized pointer dereference.

CVE-2020-10878 (2020-06-05)

Perl before 5.30.3 has an integer overflow related to mishandling of a "PL_regkind[OP(n)] == NOTHING" situation. A crafted regular expression could lead to malformed bytecode with a possibility of instruction injection.

CVE-2020-10543 (2020-06-05)

Perl before 5.30.3 on 32-bit platforms allows a heap-based buffer overflow because nested regular expression quantifiers have an integer overflow.

CVE-2018-18314 (2018-12-07)

Perl before 5.26.3 has a buffer overflow via a crafted regular expression that triggers invalid write operations.

CVE-2018-18312 (2018-12-05)

Perl before 5.26.3 and 5.28.0 before 5.28.1 has a buffer overflow via a crafted regular expression that triggers invalid write operations.

CVE-2018-18311 (2018-12-07)

Perl before 5.26.3 and 5.28.x before 5.28.1 has a buffer overflow via a crafted regular expression that triggers invalid write operations.

CVE-2013-1667 (2013-03-14)

The rehash mechanism in Perl 5.8.2 through 5.16.x allows context-dependent attackers to cause a denial of service (memory consumption and crash) via a crafted hash key.

CVE-2010-4777 (2014-02-10)

The Perl_reg_numbered_buff_fetch function in Perl 5.10.0, 5.12.0, 5.14.0, and other versions, when running with debugging enabled, allows context-dependent attackers to cause a denial of service (assertion failure and application exit) via crafted input that is not properly handled when using certain regular expressions, as demonstrated by causing SpamAssassin and OCSInventory to crash.

CVE-2010-1158 (2010-04-20)

Integer overflow in the regular expression engine in Perl 5.8.x allows context-dependent attackers to cause a denial of service (stack consumption and application crash) by matching a crafted regular expression against a long string.

CVE-2009-3626 (2009-10-29)

Perl 5.10.1 allows context-dependent attackers to cause a denial of service (application crash) via a UTF-8 character with a large, invalid codepoint, which is not properly handled during a regular-expression match.

CVE-2008-1927 (2008-04-24)

Double free vulnerability in Perl 5.8.8 allows context-dependent attackers to cause a denial of service (memory corruption and crash) via a crafted regular expression containing UTF8 characters. NOTE: this issue might only be present on certain operating systems.

CVE-2005-3962 (2005-12-01)

Integer overflow in the format string functionality (Perl_sv_vcatpvfn) in Perl 5.9.2 and 5.8.6 Perl allows attackers to overwrite arbitrary memory and possibly execute arbitrary code via format string specifiers with large values, which causes an integer wrap and leads to a buffer overflow, as demonstrated using format string vulnerabilities in Perl applications.

CVE-2007-5116 (2007-11-07)

Buffer overflow in the polymorphic opcode support in the Regular Expression Engine (regcomp.c) in Perl 5.8 allows context-dependent attackers to execute arbitrary code by switching from byte to Unicode (UTF) characters in a regular expression.

CVE-2016-2381 (2016-04-08)

Perl might allow context-dependent attackers to bypass the taint protection mechanism in a child process via duplicate environment variables in envp.

CVE-2013-7422 (2015-08-16)

Integer underflow in regcomp.c in Perl before 5.20, as used in Apple OS X before 10.10.5 and other products, allows context-dependent attackers to execute arbitrary code or cause a denial of service (application crash) via a long digit string associated with an invalid backreference within a regular expression.

NAME

perlunicode - Unicode support in Perl (EXPERIMENTAL, subject to change)

DESCRIPTION

Important Caveat

WARNING:  As of the 5.6.1 release, the implementation of Unicode
support in Perl is incomplete, and continues to be highly experimental.

The following areas need further work. They are being rapidly addressed in the 5.7.x development branch.

Input and Output Disciplines

There is currently no easy way to mark data read from a file or other external source as being utf8. This will be one of the major areas of focus in the near future.

Regular Expressions

The existing regular expression compiler does not produce polymorphic opcodes. This means that the determination on whether to match Unicode characters is made when the pattern is compiled, based on whether the pattern contains Unicode characters, and not when the matching happens at run time. This needs to be changed to adaptively match Unicode if the string to be matched is Unicode.

use utf8 still needed to enable a few features

The utf8 pragma implements the tables used for Unicode support. These tables are automatically loaded on demand, so the utf8 pragma need not normally be used.

However, as a compatibility measure, this pragma must be explicitly used to enable recognition of UTF-8 encoded literals and identifiers in the source text.

Byte and Character semantics

Beginning with version 5.6, Perl uses logically wide characters to represent strings internally. This internal representation of strings uses the UTF-8 encoding.

In future, Perl-level operations can be expected to work with characters rather than bytes, in general.

However, as strictly an interim compatibility measure, Perl v5.6 aims to provide a safe migration path from byte semantics to character semantics for programs. For operations where Perl can unambiguously decide that the input data is characters, Perl now switches to character semantics. For operations where this determination cannot be made without additional information from the user, Perl decides in favor of compatibility, and chooses to use byte semantics.

This behavior preserves compatibility with earlier versions of Perl, which allowed byte semantics in Perl operations, but only as long as none of the program's inputs are marked as being as source of Unicode character data. Such data may come from filehandles, from calls to external programs, from information provided by the system (such as %ENV), or from literals and constants in the source text.

If the -C command line switch is used, (or the ${^WIDE_SYSTEM_CALLS} global flag is set to 1), all system calls will use the corresponding wide character APIs. This is currently only implemented on Windows.

Regardless of the above, the bytes pragma can always be used to force byte semantics in a particular lexical scope. See bytes.

The utf8 pragma is primarily a compatibility device that enables recognition of UTF-8 in literals encountered by the parser. It may also be used for enabling some of the more experimental Unicode support features. Note that this pragma is only required until a future version of Perl in which character semantics will become the default. This pragma may then become a no-op. See utf8.

Unless mentioned otherwise, Perl operators will use character semantics when they are dealing with Unicode data, and byte semantics otherwise. Thus, character semantics for these operations apply transparently; if the input data came from a Unicode source (for example, by adding a character encoding discipline to the filehandle whence it came, or a literal UTF-8 string constant in the program), character semantics apply; otherwise, byte semantics are in effect. To force byte semantics on Unicode data, the bytes pragma should be used.

Under character semantics, many operations that formerly operated on bytes change to operating on characters. For ASCII data this makes no difference, because UTF-8 stores ASCII in single bytes, but for any character greater than chr(127), the character may be stored in a sequence of two or more bytes, all of which have the high bit set. But by and large, the user need not worry about this, because Perl hides it from the user. A character in Perl is logically just a number ranging from 0 to 2**32 or so. Larger characters encode to longer sequences of bytes internally, but again, this is just an internal detail which is hidden at the Perl level.

Effects of character semantics

Character semantics have the following effects:

  • Strings and patterns may contain characters that have an ordinal value larger than 255.

    Presuming you use a Unicode editor to edit your program, such characters will typically occur directly within the literal strings as UTF-8 characters, but you can also specify a particular character with an extension of the \x notation. UTF-8 characters are specified by putting the hexadecimal code within curlies after the \x. For instance, a Unicode smiley face is \x{263A}.

  • Identifiers within the Perl script may contain Unicode alphanumeric characters, including ideographs. (You are currently on your own when it comes to using the canonical forms of characters--Perl doesn't (yet) attempt to canonicalize variable names for you.)

  • Regular expressions match characters instead of bytes. For instance, "." matches a character instead of a byte. (However, the \C pattern is provided to force a match a single byte ("char" in C, hence \C).)

  • Character classes in regular expressions match characters instead of bytes, and match against the character properties specified in the Unicode properties database. So \w can be used to match an ideograph, for instance.

  • Named Unicode properties and block ranges make be used as character classes via the new \p{} (matches property) and \P{} (doesn't match property) constructs. For instance, \p{Lu} matches any character with the Unicode uppercase property, while \p{M} matches any mark character. Single letter properties may omit the brackets, so that can be written \pM also. Many predefined character classes are available, such as \p{IsMirrored} and \p{InTibetan}.

  • The special pattern \X match matches any extended Unicode sequence (a "combining character sequence" in Standardese), where the first character is a base character and subsequent characters are mark characters that apply to the base character. It is equivalent to (?:\PM\pM*).

  • The tr/// operator translates characters instead of bytes. Note that the tr///CU functionality has been removed, as the interface was a mistake. For similar functionality see pack('U0', ...) and pack('C0', ...).

  • Case translation operators use the Unicode case translation tables when provided character input. Note that uc() translates to uppercase, while ucfirst translates to titlecase (for languages that make the distinction). Naturally the corresponding backslash sequences have the same semantics.

  • Most operators that deal with positions or lengths in the string will automatically switch to using character positions, including chop(), substr(), pos(), index(), rindex(), sprintf(), write(), and length(). Operators that specifically don't switch include vec(), pack(), and unpack(). Operators that really don't care include chomp(), as well as any other operator that treats a string as a bucket of bits, such as sort(), and the operators dealing with filenames.

  • The pack()/unpack() letters "c" and "C" do not change, since they're often used for byte-oriented formats. (Again, think "char" in the C language.) However, there is a new "U" specifier that will convert between UTF-8 characters and integers. (It works outside of the utf8 pragma too.)

  • The chr() and ord() functions work on characters. This is like pack("U") and unpack("U"), not like pack("C") and unpack("C"). In fact, the latter are how you now emulate byte-oriented chr() and ord() under utf8.

  • The bit string operators & | ^ ~ can operate on character data. However, for backward compatibility reasons (bit string operations when the characters all are less than 256 in ordinal value) one cannot mix ~ (the bit complement) and characters both less than 256 and equal or greater than 256. Most importantly, the DeMorgan's laws (~($x|$y) eq ~$x&~$y, ~($x&$y) eq ~$x|~$y) won't hold. Another way to look at this is that the complement cannot return both the 8-bit (byte) wide bit complement, and the full character wide bit complement.

  • And finally, scalar reverse() reverses by character rather than by byte.

Character encodings for input and output

[XXX: This feature is not yet implemented.]

CAVEATS

As of yet, there is no method for automatically coercing input and output to some encoding other than UTF-8. This is planned in the near future, however.

Whether an arbitrary piece of data will be treated as "characters" or "bytes" by internal operations cannot be divined at the current time.

Use of locales with utf8 may lead to odd results. Currently there is some attempt to apply 8-bit locale info to characters in the range 0..255, but this is demonstrably incorrect for locales that use characters above that range (when mapped into Unicode). It will also tend to run slower. Avoidance of locales is strongly encouraged.

SEE ALSO

bytes, utf8, "${^WIDE_SYSTEM_CALLS}" in perlvar