Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Introduction

Uniform Resource Locators (URLs) are a significant part of web development as they help locate and get resources on the Internet. Yet, URLs may be inconsistent or formatted incorrectly; this could cause problems with processing and obtaining the desired materials.

URL normalization transforms the given piece of data to a canonical form, ensuring consistency and facilitating operability.

Throughout this tutorial, we’ll investigate different techniques to normalize a URL in Java.

2. Manual Normalization

Performing manual normalization involves applying custom logic to standardize the URLs. This process includes removing extraneous elements, such as unnecessary query parameters and fragment identifiers, to distill the URL down to its essential core. Suppose we have the following URL:

https://www.example.com:8080/path/to/resource?param1=value1&param2=value2#fragment

The normalized URL should be as follows:

https://www.example.com:8080/path/to/resource

Note that we’re considering anything after “?” as unnecessary, as we’re only interested in grouping by resource. But that’ll vary depending on the use case.

3. Utilizing Apache Commons Validator

The UrlValidator class in the Apache Commons Validator library is a convenient validation method for validating and normalizing URLs. First, we should ensure that our project includes the Apache Commons Validator dependency as follows:

<dependency>
    <groupId>commons-validator</groupId>
    <artifactId>commons-validator</artifactId>
    <version>1.8.0</version>
    <scope>test</scope>
</dependency>

Now, we’re ready to implement a simple Java code example:

String originalUrl = "https://www.example.com:8080/path/to/resource?param1=value1&param2=value2#fragment";
String expectedNormalizedUrl = "https://www.example.com:8080/path/to/resource";

@Test
public void givenOriginalUrl_whenUsingApacheCommonsValidator_thenValidatedAndMaybeManuallyNormalized() {
    UrlValidator urlValidator = new UrlValidator();
    if (urlValidator.isValid(originalUrl)) {
        String normalizedUrl = originalUrl.split("\\?")[0];
        assertEquals(expectedNormalizedUrl, manuallyNormalizedUrl);
    } else {
        fail(originalUrl);
    }
}

Here, we start by instantiating an object from the UrlValidator. Later, we use the isValid() method to determine whether the original URL compiles with the validation rules that were previously mentioned.

If the URL turns out to be legitimate, we standardize it by hand and remove query parameters and fragments, especially everything after ‘?’. Finally, we use the assertEquals() method to validate the equivalence of expectedNormalizedUrl and normalizedUrl.

4. Utilizing Java’s URI Class

Establishing a Java URI class in the java.net package provides other features for managing URIs, including normalization. Let’s see a simple example:

@Test
public void givenOriginalUrl_whenUsingJavaURIClass_thenNormalizedUrl() throws URISyntaxException {
    URI uri = new URI(originalUrl);
    URI normalizedUri = new URI(uri.getScheme(), uri.getAuthority(), uri.getPath(), null, null);
    String normalizedUrl = normalizedUri.toString();
    assertEquals(expectedNormalizedUrl, normalizedUrl);
}

Within this test, we pass the originalUrl to the URI object, and a normalized URI is derived by extracting and reassembling specific components such as scheme, authority, and path.

5. Using Regular Expressions

Regex is one very useful mechanism for the URL normalization in Java. They enable you to specify many patterns and transformations that match the URLs and change them based on your needs. Here’s a simple code example:

@Test
public void givenOriginalUrl_whenUsingRegularExpression_thenNormalizedUrl() throws URISyntaxException, UnsupportedEncodingException {
    String regex = "^(https?://[^/]+/[^?#]+)";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(originalUrl);

    if (matcher.find()) {
        String normalizedUrl = matcher.group(1);
        assertEquals(expectedNormalizedUrl, normalizedUrl);
    } else {
        fail(originalUrl);
    }
}

In the above code example, we first create a regex pattern that matches the scheme, domain, and path components of the URL. Then, we turn this pattern into a Pattern object representing a regular expression. Also, we use a Matcher to match the original URL against this given pattern.

Moreover, we utilize the matcher.find() method to find the next subsequence of the input sequence that matches the pattern defined by the regex. If the matcher.find() method returns true, the matcher.group(1) fetches out the substring that matches the regex. In this case, it specifically captures the content of the first-capturing group in regex (denoted by parentheses), which is thought to be a normalized URL.

6. Conclusion

In conclusion, we explored several ways, such as manual normalization, the Apache Commons Validator library, Java’s URI class, and regular expressions for URL normalization in Java.

As usual, the accompanying source code can be found over on GitHub.

Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
res – REST with Spring (eBook) (everywhere)
2 Comments
Oldest
Newest
Inline Feedbacks
View all comments
Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.