Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Overview

In Java programming, dealing with strings and patterns is essential to many applications. Regular expressions, commonly known as regex, provide a powerful tool for pattern matching and manipulation.

Sometimes, we not only need to identify matches within a string but also locate exactly where these matches occur. In this tutorial, we’ll explore getting the indexes of regex pattern matches in Java.

2. Introduction to the Problem

Let’s start with a String example:

String INPUT = "This line contains <the first value>, <the second value>, and <the third value>.";

Let’s say we want to extract all “<…>” segments from the string above, such as “<the first value>” and “<the second value>“.

To match these segments, we can use regex’s NOR character classes: “<[^>]*>”. 

In Java, the Pattern and Matcher classes from the Regex API are important tools for working with pattern matching. These classes provide methods to compile regex patterns and apply them to strings for various operations.

So next, let’s use Pattern and Matcher to extract the desired text. For simplicity, we’ll use AssertJ assertions to verify whether we obtained the expected result:

Pattern pattern = Pattern.compile("<[^>]*>");
Matcher matcher = pattern.matcher(INPUT);
List<String> result = new ArrayList<>();
while (matcher.find()) {
    result.add(matcher.group());
}
assertThat(result).containsExactly("<the first value>", "<the second value>", "<the third value>");

As the code above shows, we extracted all “<…>” parts from the input String. However, sometimes, we want to know exactly where matches are located in the input. In other words, we want to obtain the matches and their indexes in the input string.

Next, let’s extend this code to achieve our goals.

3. Obtaining Indexes of Matches

We’ve used the Matcher class to extract the matches. The Matcher class offers two methods, start() and end(), which allow us to obtain each match’s start and end indexes. 

It’s worth noting that the Matcher.end() method returns the index after the last character of the matched subsequence. An example can show this clearly:

Pattern pattern = Pattern.compile("456");
Matcher matcher = pattern.matcher("0123456789");
String result = null;
int startIdx = -1;
int endIdx = -1;
if (matcher.find()) {
    result = matcher.group();
    startIdx = matcher.start();
    endIdx = matcher.end();
}
assertThat(result).isEqualTo("456");
assertThat(startIdx).isEqualTo(4);
assertThat(endIdx).isEqualTo(7); // matcher.end() returns 7 instead of 6

Now that we understand what start() and end() return, let’s see if we can obtain the indexes of each matched “<…>” subsequence in our INPUT:

Pattern pattern = Pattern.compile("<[^>]*>");
Matcher matcher = pattern.matcher(INPUT);
List<String> result = new ArrayList<>();
Map<Integer, Integer> indexesOfMatches = new LinkedHashMap<>();
while (matcher.find()) {
    result.add(matcher.group());
    indexesOfMatches.put(matcher.start(), matcher.end());
}
assertThat(result).containsExactly("<the first value>", "<the second value>", "<the third value>");
assertThat(indexesOfMatches.entrySet()).map(entry -> INPUT.substring(entry.getKey(), entry.getValue()))
  .containsExactly("<the first value>", "<the second value>", "<the third value>");

As the test above shows, we stored each match’s start() and end() results in a LinkedHashMap to preserve the insertion order. Then, we extracted substrings from the original input by these index pairs. If we obtained the correct indexes, these substrings must equal the matches.

If we give this test a run, it passes.

4. Obtaining Indexes of Matches With Capturing Groups

In regex, capturing groups play a crucial role by allowing us to reference them later or conveniently extract sub-patterns.

To illustrate, suppose we aim to extract the content enclosed between ‘<‘ and ‘>‘. In such cases, we can create a pattern that incorporates a capturing group: “<([^>]*)>”. As a result, when utilizing Matcher.group(1), we obtain the text “the first value“,  “the second value“, and so on.

When no explicit capturing group is defined, the entire regex pattern assumes the default group with the index 0. Therefore, invoking Matcher.group() is synonymous with calling Matcher.group(0).

Much like the behavior of the Matcher.group() function, the Matcher.start() and Matcher.end() methods offer support for specifying a group index as an argument. Consequently, these methods provide the starting and ending indexes corresponding to the matched content within the corresponding group:

Pattern pattern = Pattern.compile("<([^>]*)>");
Matcher matcher = pattern.matcher(INPUT);
List<String> result = new ArrayList<>();
Map<Integer, Integer> indexesOfMatches = new LinkedHashMap<>();
while (matcher.find()) {
    result.add(matcher.group(1));
    indexesOfMatches.put(matcher.start(1), matcher.end(1));
}
assertThat(result).containsExactly("the first value", "the second value", "the third value");
assertThat(indexesOfMatches.entrySet()).map(entry -> INPUT.substring(entry.getKey(), entry.getValue()))
  .containsExactly("the first value", "the second value", "the third value");

5. Conclusion

In this article, we explored obtaining the indexes of pattern matches within the original input when dealing with regex. We discussed scenarios involving patterns with and without explicitly defined capturing groups.

As always, the complete source code for the examples is available over on GitHub.

Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
res – REST with Spring (eBook) (everywhere)
Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.