Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Overview

In the world of software development, sometimes we might need to convert a string with Unicode encoding into a readable string of letters. This transformation can be useful when working with data from various sources.

In this article, we’ll explore how to convert a string with Unicode encoding to a string of letters in Java.

2. Understanding Unicode Encoding

Firstly, Unicode is a universal character encoding standard that assigns a unique number (code point) to every character, no matter the platform or program. Unicode encoding represents characters as escape sequences in the form of “\uXXXX,” where XXXX is a hexadecimal number representing the character’s Unicode code point.

For example, the string “\u0048\u0065\u006C\u006C\u006F World” is encoded with Unicode escape sequences and represents the phrase “Hello World”.

3. Using Apache Commons Text

Apache Commons Text library provides a reliable utility class: StringEscapeUtils, that offers the unescapeJava() method for decoding Unicode escape sequences in a string:

String encodedString = "\\u0048\\u0065\\u006C\\u006C\\u006F World";
String expectedDecodedString = "Hello World";
assertEquals(expectedDecodedString, StringEscapeUtils.unescapeJava(encodedString));

4. Using Plain Java

In addition, we can use the Pattern and Matcher classes from the java.util.regex package to find all Unicode escape sequences in the input string. Then, we can replace each Unicode escape sequence:

public static String decodeWithPlainJava(String input) {
    Pattern pattern = Pattern.compile("\\\\u[0-9a-fA-F]{4}");
    Matcher matcher = pattern.matcher(input);

    StringBuilder decodedString = new StringBuilder();

    while (matcher.find()) {
        String unicodeSequence = matcher.group();
        char unicodeChar = (char) Integer.parseInt(unicodeSequence.substring(2), 16);
        matcher.appendReplacement(decodedString, Character.toString(unicodeChar));
    }

    matcher.appendTail(decodedString);
    return decodedString.toString();
}

The regular expression can be interpreted as follows:

  • \\\\u: Match the literal characters “\u”.
  • [0-9a-fA-F]: Match any valid hexadecimal digit.
  • {4}: Match exactly four hexadecimal digits in a row.

For example, let’s decode the following string:

String encodedString = "Hello \\u0057\\u006F\\u0072\\u006C\\u0064";
String expectedDecodedString = "Hello World";
assertEquals(expectedDecodedString, decodeWithPlainJava(encodedString));

5. Conclusion

In this tutorial, we’ve explored two ways to convert a string with Unicode encoding to a string of letters in Java.

The example code from this article can be found over on GitHub.

Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
res – REST with Spring (eBook) (everywhere)
Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.