The main problem is the way trim()
function works.
One cannot reliably trim a substring using this function, because it takes its second argument as a collection of characters, each of which would be trimmed from the string. For example, trim("Hello Webbs ", " ");
will remove trailing "b" and "s" from the name as well, returning just "Hello We". Therefore, such a task can be reliably done with a regular expression only.
The same goes for multi-byte characters, such as non-breaking space. Each byte in this character is trimmed separately, which may corrupt the original string (e.g. trim("· Hello world", "\xC2\xA0");
). The good news, there are options to solve this problem:
- with
u
modifier, the\s
meta character in PHP regex recognizes the non-breaking-space as well - starting from PHP 8.4, there will be
mb_trim()
function, which is not only multi-byte safe, but also trims the non-breaking space character by default (along with many other space characters as well).
Below you will find recipes for various use cases
Trimming literal
HTML entity from a string
$before = " abc xyz ";
$after = preg_replace('~^ | $~', '', $before);
var_dump($before, $after);
Trimming the "non-breaking space" sequence (\xC2\xA0
) only:
$before = html_entity_decode(" abc xyz ");
$after = preg_replace('~^\xC2\xA0|\xC2\xA0$~', '', $before);
Trimming non-breaking space along with other whitespace characters:
PHP < 8.4
$after = preg_replace('~^\s+|\s+$/~u', '$2', $before);
PHP >= 8.4
$after = mb_trim($before);
Trimming both "non-breaking space" sequence and HTML entity
Here, you have to add them both to regex.
It must be understood, that there is no way to reliably chain two trimming functions (e.g. trim(preg_replace(...))
) because trimming should be done strictly in one pass. Simply because first function won't notice characters removed by second one and vice versa. Hence preg_replace('~^( )*(.*?)( )*$~', '$2', trim(" abc"));
will leave leading spaces intact. Therefore, if you need to remove both substrings and multi-byte characters, regex is still the only option.
$before = html_entity_decode(" ")." abc xyz ";
$pattern = '~^(\xC2\xA0| )*(.*?)(\xC2\xA0| )*$~';
$after = preg_replace($pattern, '$2', $before);
var_dump($before, $after);
Trimming "non-breaking space" sequence, HTML entity and regular whitespace characters
$before = " ".html_entity_decode(" ")." abc \n ";
$pattern = '~^(\xC2\xA0| | |\r|\n|\t)*(.*?)(\xC2\xA0| | |\r|\n|\t)*$~';
$after = preg_replace($pattern, '$2', $before);
var_dump($before, $after);
(?:...)
would seem more appropriate.$after = preg_replace('~^\s+|\s+$/~u', '$2', $before);
Is'$2'
meant to be''
?| |\r|\n|\t
with|\s
.