Posted by Edd Mann on Dec 22, 2016

Managing Newlines and Unicode within JavaScript and PHP

We were recently sent a tweet in-regard to a text-area client/server-side length validation not correlating. After some detective work we were able to find two issues that could have caused this to occur. In this post I wish to discuss our findings, and how we resolved each issue.

Newlines

The first issue we noticed was when newlines were present within the text-area input. It seemed that client-side, newlines were being represented with a single character, where-as on the server two were instead used. Reading a section of the HTML5 specification confirmed our suspicions.

The API value… is normalized so that line breaks use “LF” (U+000A) characters. Finally, there is the form submission value. It is normalized so that line breaks use U+000D CARRIAGE RETURN “CRLF” (U+000A) character pairs… https://www.w3.org/TR/html5/forms.html#the-textarea-element

Unicode Characters

The second issue we observed was when the text-area contained a character outside the Basic Multilingual Plane. There are a couple of very interesting articles discussing Unicode support within JavaScript that I highly recommend reading. In short, JavaScript represents strings as a collection of unsigned 16-bit integers. This means that any character that does not fit within 16-bits (requiring a surrogate-pair), will instead have a length of two when calling .length.

'a'.length // 1
'🍕'.length // 2!

This can get even more confusing when you then wish to send that string to the server. The general consensus is to use UTF-8 character-encoding during transmission (default as of HTML5), and this means that a character can be represented with between one and four bytes. So, in the case of the above examples, a naive strlen will return.

<?php
strlen('a') // 1
strlen('🍕') // 4!

The beauty of UTF-8 is that the ASCII character-set can be represented with the identical single binary value.

Newline Solution

With these two problems in-mind, we set out to tackle them. In-regard to the newline issue, we decided on settling that a newline would represent a single character within the supplied limit. As the HTML5 standard handles this for us within the browser, all we are required to do is replace occurrences of CRLF on the server-side.

<?php
strlen(str_replace("\r", '', $message)) // 1

Unicode Solution

That was relatively easy to resolve, now both newlines on the client and server-side correlated with one another. Next we needed to address the Unicode issue - we had fortunately already resolved this on the server-side thanks to the PHP mb_* functions. As we had specified that our default encoding was UTF-8, we did not have to include the desired encoding upon invocation.

<?php
mb_strlen('🍕') // 1

Instead of naively byte processing the length of the string as strlen does, mb_strlen takes into consideration the encoding, and is able to discern multi-byte characters.

On the client-side we were still faced with the issue of characters containing a surrogate pair. To handle this use-case when calculating the length, we took inspiration from this article. We decided to replace any characters that lied outside of the BMP, with a single one that did (in the case below a simple ‘_’). An alternative approach if you were using ES6 would be to use Array.from, which handles the character encoding used correctly.

'🍕'.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g, '_').length // 1

Jobs at MyBuilder and Instapro

We need experienced software engineers who love their craft and want to share their hard-earned knowledge.

View vacancies