Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Overview

Emojis appear in a lot of text that we may need to process in our code. For example, this could be when we’re working with email or instant messaging services.

In this tutorial, we’ll see the multiple methods we can use in Java applications to detect emojis.

2. How Does Java Represent Emojis?

Every emoji has a unique Unicode value which represents it. Java encodes Unicode characters in Strings using UTF-16.

UTF-16 can encode all Unicode code points. A code point may consist of either one or two code units. If two are needed because the Unicode value is beyond the range we can store in 16 bits, then we call it a surrogate pair.

A surrogate pair is simply two characters (or code units) which when combined represent a single Unicode character (or code point). There is a reserved range of code units for surrogate pairs.

For example, the Skull and Crossbones emoji has the Unicode value “U+2620” which is stored in a String as “\u2620️️”. We only required a single code unit. However, the Bear Face emoji has the Unicode character “U+1F43B” which would be stored in a String as “\uD83D\uDC3B”. This required two code units because the Unicode value was too high for a single unit.

There are extensions to this we’ll look at later but that explains the basics.

3. emoji-java Library

An off-the-shelf solution is to use emoji-java. To use this library in our project, we’ll need to import it into our pom.xml:

 <dependency>
     <groupId>com.vdurmont</groupId>
     <artifactId>emoji-java</artifactId>
    <version>5.1.1</version>
</dependency>

The latest version is available in the Maven Repository.

It’s simple to use this library to check if a letter is an emoji. It provides the static isEmoji() method in the EmojiManager utility class.

The method takes a single String argument and returns true if the String is an emoji, or else returns false:

@Test
void givenAWord_whenUsingEmojiJava_thenDetectEmoji(){
    boolean emoji = EmojiManager.isEmoji("\uD83D\uDC3B");
    assertTrue(emoji);

    boolean notEmoji = EmojiManager.isEmoji("w");
    assertFalse(notEmoji);
}

We can see from this test that the library has correctly identified the surrogate pair as an emoji. It has also asserted that the single letter “w” is not.

This library has a whole host of other features. So it’s a strong candidate for dealing with emojis in Java.

4. Using Regex

As we discussed earlier, we know roughly what an emoji will look like within a Java String. We also know the potential range of values that are reserved for surrogate pairs. The first code unit will be between U+D800 and U+DBFF, and the second code unit will be between U+DC00 and U+DFFF.

We can use this insight to write a regex for checking if a given String is one of the emojis represented by a surrogate pair. We need to note here that not all surrogate pairs are emojis, so this may give us false positives:

@Test
void givenAWord_whenUsingRegex_thenDetectEmoji(){
    String regexPattern = "[\uD800-\uDBFF\uDC00-\uDFFF]+";
    String emojiString = "\uD83D\uDC3B";
    boolean emoji = emojiString.matches(regexPattern);
    assertTrue(emoji);

    String notEmojiString = "w";
    boolean notEmoji = notEmojiString.matches(regexPattern);
    assertFalse(notEmoji);
}

However, it’s not always as simple as checking within the expected range. As we already saw, some emojis only use a single code unit. Also, many have modifiers that append onto the end of the Unicode and change the appearance of the emoji. We can also form more complex emojis by combining several emojis with Zero Width Joiner (ZWJ) characters in between them.

A good example of this is the Pirate Flag emoji which we can build using a Waving Black Flag and a Skull and Crossbones with a ZWJ in the middle. With this in mind, it’s clear the regex we’d need is much more complex to be certain we’re capturing all emojis.

Unicode published a document listing all current emoji values. We could either write a parser for this document or extract the ranges into our own configuration files. The results would then be useable for our own reliable emoji finder.

5. Conclusion

In this article, we looked at how Java represents Unicode emojis as UTF-16 surrogate pairs. There’s a library, emoji-java, we can use in our code to detect them. This library offers a simple method to check if a String is an emoji.

We also have the option of writing our own detection code using regex. However, this is complex and needs to cover a wide range of possible values which is ever-growing. To do this successfully, we’d need to be able to accept updates from Unicode into our program.

As always, the full code for the examples is available over on GitHub.

Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
res – REST with Spring (eBook) (everywhere)
Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.