Regarding the optional 3rd `strict` argument to mb_detect_encoding, the documentation states: Controls the behaviour when string is not valid in any of the listed encodings. If strict is set to false, the closest matching encoding will be returned; if strict is set to true, false will be returned. (Ref: https://www.php.net/manual/en/function.mb-detect-encoding.php) Because of bugs in the implementation, mb_detect_encoding did not always behave according to this description when `strict` was false. For example: <?php echo var_export(mb_detect_encoding("\xc0\x00", "UTF-8", false)); // Before this commit, prints: false // After this commit, prints: 'UTF-8' Because `strict` is false in the above example, mb_detect_encoding should return the 'closest matching encoding', which is UTF-8, since that is the only candidate encoding. (Incidentally, this example shows that using mb_detect_encoding with a single candidate encoding in non-strict mode is useless.) The new implementation fixes this bug. It also fixes another problem with the old implementation as regards non-strict detection mode: The old implementation would stop processing of the input string using a particular candidate encoding as soon as it saw an error in that encoding, even in non-strict mode. This means that it could not really detect the 'closest matching encoding'; rather, what it would return in non-strict mode was 'the encoding in which the first decoding error is furthest from the beginning of the input string'. In non-strict mode, the new implementation continues trying to process the input string to its end even after seeing an error. This makes it possible to determine in which candidate encoding the string has the smallest number of errors, i.e. the 'closest matching encoding'. Rejecting candidate encodings as soon as it saw an error gave the old implementation a marked performance advantage in non-strict mode; however, the new implementation still beats it in most cases. Here are a few sample microbenchmark results: UTF-8, ~100 codepoints, strict mode Old: 0.080s (100,000 calls) New: 0.026s (" " ) UTF-8, ~100 codepoints, non-strict mode Old: 0.079s (100,000 calls) New: 0.033s (" " ) UTF-8, ~10000 codepoints, strict mode Old: 6.708s (60,000 calls) New: 1.383s (" " ) UTF-8, ~10000 codepoints, non-strict mode Old: 6.705s (60,000 calls) New: 3.044s (" " ) Notice that the old implementation had almost identical performance between strict and non-strict mode, while the new suffers a significant performance penalty for non-strict detection. This is the cost of implementing the behavior specified in the documentation. A couple more sample results: SJIS, ~10000 codepoints, strict mode Old: 4.563s New: 1.084s SJIS, ~10000 codepoints, non-strict mode Old: 4.569s New: 2.863s This is the only case I found where the new implementation loses: UTF-16LE, ~10000 codepoints, non-strict mode Old: 1.514s New: 2.813s The reason is because the test strings happened to be invalid right from the first few bytes for all the candidate encodings except for UTF-16LE; so the old implementation would immediately reject all those encodings and only process the entire string in UTF-16LE. I believe mb_detect_encoding could be made much faster if we identified good criteria for when to reject candidate encodings before reaching the end of the input string.
28 lines
780 B
PHP
28 lines
780 B
PHP
--TEST--
|
|
Bug #49536 (mb_detect_encoding() returns incorrect results when strict_mode is turned on)
|
|
--EXTENSIONS--
|
|
mbstring
|
|
--FILE--
|
|
<?php
|
|
// non-strict mode
|
|
var_dump(mb_detect_encoding("A\x81", "SJIS", false));
|
|
// strict mode
|
|
var_dump(mb_detect_encoding("A\x81", "SJIS", true));
|
|
// non-strict mode
|
|
var_dump(mb_detect_encoding("\xc0\x00", "UTF-8", false));
|
|
// strict mode
|
|
var_dump(mb_detect_encoding("\xc0\x00", "UTF-8", true));
|
|
|
|
// Strict mode with multiple candidate encodings
|
|
// This input string is invalid in ALL the candidate encodings:
|
|
echo "== INVALID STRING - UTF-8 and SJIS ==\n";
|
|
var_dump(mb_detect_encoding("\xFF\xFF", ['SJIS', 'UTF-8'], true));
|
|
?>
|
|
--EXPECT--
|
|
string(4) "SJIS"
|
|
bool(false)
|
|
string(5) "UTF-8"
|
|
bool(false)
|
|
== INVALID STRING - UTF-8 and SJIS ==
|
|
bool(false)
|