Working with Tabular Data Using Tablesaw

Azure Spring Apps is a fully managed service from Microsoft (built in collaboration with VMware), focused on building and deploying Spring Boot applications on Azure Cloud without worrying about Kubernetes.

The Enterprise plan comes with some interesting features, such as commercial Spring runtime support, a 99.95% SLA and some deep discounts (up to 47%) when you are ready for production.

>> Learn more and deploy your first Spring Boot app to Azure.

And, you can participate in a very quick (1 minute) paid user research from the Java on Azure product team.

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

The Jet Profiler was built for MySQL only, so it can do things like real-time query performance, focus on most used tables or most frequent queries, quickly identify performance issues and basically help you optimize your queries.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

Accelerate Your Jakarta EE Development with Payara Server!

With best-in-class guides and documentation, Payara essentially simplifies deployment to diverse infrastructures.

Beyond that, it provides intelligent insights and actions to optimize Jakarta EE applications.

The goal is to apply an opinionated approach to get to what's essential for mission-critical applications - really solid scalability, availability, security, and long-term support:

>> Download and Explore the Guide (to learn more)

The AI Assistant to boost Boost your productivity writing unit tests - Machinet AI.

AI is all the rage these days, but for very good reason. The highly practical coding companion, you'll get the power of AI-assisted coding and automated unit test generation.
Machinet's Unit Test AI Agent utilizes your own project context to create meaningful unit tests that intelligently aligns with the behavior of the code.
And, the AI Chat crafts code and fixes errors with ease, like a helpful sidekick.

Simplify Your Coding Journey with Machinet AI:

>> Install Machinet AI in your IntelliJ

Looking for the ideal Linux distro for running modern Spring apps in the cloud?

Meet Alpaquita Linux: lightweight, secure, and powerful enough to handle heavy workloads.

This distro is specifically designed for running Java apps. It builds upon Alpine and features significant enhancements to excel in high-density container environments while meeting enterprise-grade security standards.

Specifically, the container image size is ~30% smaller than standard options, and it consumes up to 30% less RAM:

>> Try Alpaquita Containers now.

DbSchema is a super-flexible database designer, which can take you from designing the DB with your team all the way to safely deploying the schema.

The way it does all of that is by using a design model, a database-independent image of the schema, which can be shared in a team using GIT and compared or deployed on to any database.

And, of course, it can be heavily visual, allowing you to interact with the database using diagrams, visually compose queries, explore the data, generate random data, import data or build HTML5 database reports.

>> Take a look at DBSchema

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

1. Introduction

In this article, we’ll learn to use the Tablesaw library to work with tabular data. Firstly, we’ll import some data. Then, we’ll get some insight by playing with the data.

We’ll use the avocado prices dataset. In short, it contains historical data on avocado prices and sales volume in multiple US markets.

2. Importing Data in Tablesaw

First of all, we need to import the data. Tablesaw supports various formats, including CSV, our dataset’s format. So, let’s start by loading the dataset from its CSV file:

CsvReadOptions csvReadOptions =
    CsvReadOptions.builder(file)
        .separator(',')
        .header(true)
        .dateFormat(formatter)
        .build();
table = Table.read().usingOptions(csvReadOptions);

Above, we create the CsvReadOptions class by passing the File object to the builder. Then, we describe how to read the CSV file by configuring the options object correctly.

First, we set the column separator using the separator() method. Second, we read the file’s first line as a header. Third, we provide a DateTimeFormatter to parse the date and time correctly. Finally, we use the freshly created CsvReadOptions to read the table data.

2.1. Validate the Imported Data

Let’s use the structure() method to check the table’s design. It returns another table with column names, indexes, and data types:

         Structure of avocado.csv         
 Index  |  Column Name   |  Column Type  |
------------------------------------------
     0  |            C0  |      INTEGER  |
     1  |          Date  |   LOCAL_DATE  |
     2  |  AveragePrice  |       DOUBLE  |
     3  |  Total Volume  |       DOUBLE  |
    ... |       ...      |       ...     |

Next, let’s inspect its shape by using the shape() method:

assertThat(table.shape()).isEqualTo("avocado.csv: 18249 rows X 14 cols");

This method returns a String with the file name followed by the number of rows and columns. Our dataset contains 18249 rows of data and 14 columns in total.

3. Data Representation Inside Tablesaw

Tablesaw works primarily with tables and columns, which form the base of what is known as a data frame. In short, a table is a set of columns where each column has a fixed type. A row in the table is a set of values, with each value assigned to its matching column.

Tablesaw supports a variety of column types. In addition to the ones extending the primitive types in Java, it offers textual and temporal columns.

3.1. Textual Types

In Tablesaw, there are two two text types: TextColumn and StringColumn. The first is a general-purpose type that holds any text as it is. On the other hand, StringColumn encodes values in a dictionary-like data structure before storing them. This saves data efficiently instead of repeating values inside the column.

For example, in the avocado dataset, the region and the type columns are of type StringColumn. Their repeating values inside the column vector are stored more efficiently and point to the same instance of the text:

StringColumn type = table.stringColumn("type");
List<String> conventional = type.where(type.isEqualTo("conventional")).asList().stream()
    .limit(2)
    .toList();
assertThat(conventional.get(0)).isSameAs(conventional.get(1));

3.2. Temporal Types

There are four temporal types available inside Tablesaw. They map to their equivalent Java objects: DateColumn, DateTimeColumn, TimeColumn, and InstantColumn. As seen above, we can configure, when importing, how to parse these values.

4. Working With Columns

Next, let’s look at how to work with the imported data and extract insights from it. For example, in Tablesaw, we can transform individual columns or work with the whole table.

4.1. Creating a New Column

Let’s create a new column by calling the static method .create() defined on each type of available column. For example, to make a TimeColumn named time, we write:

TimeColumn time = TimeColumn.create("Time");

This column can then be added to a table using the .addColumns() method:

Table table = Table.create("test");
table.addColumns(time);
assertThat(table.columnNames()).contains("time");

4.2. Adding or Modifying Columns Data

Let’s add data to the end of a column using the .append() method:

DoubleColumn averagePrice = table.doubleColumn("AveragePrice");
averagePrice.append(1.123);
assertThat(averagePrice.get(averagePrice.size() - 1)).isEqualTo(1.123);

For tables, we must provide a value for each column to ensure all columns have at least a value. Otherwise, it will throw an IllegalArgumentException when creating tables with columns having different sizes:

DoubleColumn averagePrice2 = table.doubleColumn("AveragePrice").copy();
averagePrice2.setName("AveragePrice2");
averagePrice2.append(1.123);
assertThatExceptionOfType(IllegalArgumentException.class).isThrownBy(() -> table.addColumns(averagePrice2));

We use the .set() method to change a specific value inside the column vector. To use it, we must know the index of the value we want to change:

stringColumn.set(2, "Baeldung");

Removing data from a column can be problematic, especially in the case of tables. So, Tablesaw doesn’t allow removing values from the column vector. Instead, let’s flag values we wish to remove as missing using the .setMissing() and pass the index of each value to this method:

DoubleColumn averagePrice = table.doubleColumn("AveragePrice").setMissing(0);
assertThat(averagePrice.get(0)).isNull();

As a result, it doesn’t remove the value holder from the vector but sets it to null. Consequently, the size of the vector remains the same.

5. Sorting the Data

Next, let’s sort the data we imported previously. To begin with, we’ll sort our table rows according to a set of columns. To do this, we use the .sortAscending() and .sortDescending() methods, which accept the names of columns. Let’s sort to get the oldest and the most recent date present in our dataset:

Table ascendingDateSortedTable = table.sortAscendingOn("Date");
assertThat(ascendingDateSortedTable.dateColumn("Date").get(0)).isEqualTo(LocalDate.parse("2015-01-04"));
Table descendingDateSortedTable = table.sortDescendingOn("Date");
assertThat(descendingDateSortedTable.dateColumn("Date").get(0)).isEqualTo(LocalDate.parse("2018-03-25"));

However, these methods are very limiting. For example, we cannot mix the ascending and descending sorting orders. To work around these limitations, we use the .sortOn() method. It accepts a set of column names and orders them by default. To sort a specific column in descending order, we preface the column name with a minus sign “-“. For example, let’s sort the data by year and with the highest average price in descending order:

Table ascendingYearAndAveragePriceSortedTable = table.sortOn("year", "-AveragePrice");
assertThat(ascendingYearAndAveragePriceSortedTable.intColumn("year").get(0)).isEqualTo(2015);
assertThat(ascendingYearAndAveragePriceSortedTable.numberColumn("AveragePrice").get(0)).isEqualTo(2.79);

These methods don’t fit all use cases. For such cases, Tablesaw accepts a custom implementation of Comparator<VRow> for the .sortOn() method.

6. Filtering the Data

Filters allow us to get a subset of data from the original table. Filtering a table returns another table, and we use the .where() and .dropWhere() methods to apply filters. The first method will return values or rows matching the criteria we specified. And the second method, instead, will drop them.

To specify the filtering criteria, we first need to understand Selections.

6.1. Selections

A Selection is a logical bitmap. In other words, it’s an array containing boolean values that mask values on a column vector. For example, applying the selection to a column will yield another column containing the filtered values — for instance, dropping values where the mask is 0 for a given index. Additionally, the selection vector will be the same size as its original column.

Let’s put this into practice by getting a table of the data from 2017 with an average price above $2 only:

DateColumn dateTable = table.dateColumn("Date");
DoubleColumn averagePrice = table.doubleColumn("AveragePrice");
Selection selection = dateTable.isInYear(2017).and(averagePrice.isGreaterThan(2D));
Table table2017 = table.where(selection);
assertThat(table2017.intColumn("year")).containsOnly(2017);
assertThat(table2017.doubleColumn("AveragePrice")).allMatch(avrgPrice -> avrgPrice > 2D);

Above, we used the method .isInYear() defined on the DateColumn and .isGreaterThan() defined on the DoubleColumn. We combined them in a query-like language with the .and() method. Tablesaw offers many such built-in helper methods. Therefore, we rarely have to build a custom selection ourselves for simple tasks. For complex tasks, we combine them using .and(), .andNot(), or(), and other column filters.

Alternatively, we write custom filters by creating a Predicate and passing it to the .eval() method available on each column. This method returns a Selection object we use to filter our tables or columns.

7. Summarizing Data

After working with the data, we want to extract some insight from it. We use the .summarize() method to aggregate data to learn about it. For example, from the avocado dataset, let’s extract the min, max, mean, and standard deviation of the average price:

Table summary = table.summarize("AveragePrice", max, min, mean, stdDev).by("year");
System.out.println(summary.print());

First, we pass the column name we want to aggregate and the list of AggregateFunction to the .summarize() method. Next, we group the result by each year using the .by() method. Finally, we print the result on the standard output:

                                              avocado.csv summary                                               
 year  |  Mean [AveragePrice]  |  Max [AveragePrice]  |  Min [AveragePrice]  |  Std. Deviation [AveragePrice]  |
----------------------------------------------------------------------------------------------------------------
 2015  |    1.375590382902939  |                2.79  |                0.49  |            0.37559477067238917  |
 2016  |   1.3386396011396013  |                3.25  |                0.51  |            0.39370799476072077  |
 2017  |   1.5151275777700104  |                3.17  |                0.44  |             0.4329056466203253  |
 2018  |   1.3475308641975308  |                 2.3  |                0.56  |             0.3058577391135024  |

Tablesaw provides AggregateFunction for most common operations. Alternatively, we can implement a custom AggregateFunction object, but since that’s beyond the scope of this article, we’ll keep things simple.

8. Saving the Data

Until now, we’ve been printing the data to the standard output. Printing to the console is good when validating our results on the fly, but we need to save our data to files to make the results reusable by others. So, let’s use the .write() method directly on the table:

summary.write().csv("summary.csv");

Above, we used the .csv() method to save the data in CSV format. Currently, Tablesaw supports only the CSV format and the fixed-width format, which is similar to what the .print() method displays on the console. Additionally, we use CsvWriterOptions to customize the CSV output of our data.

9. Conclusion

In this article, we explored working with tabular data using the Tablesaw library.

First, we explained how to import data. Then, we described the internal representation of the data and how to work with it. Next, we explored modifying the structure of the imported table and creating filters to extract the necessary data before aggregating it. Finally, we saved it to a CSV file.

As always, the full source code is available over on GitHub.

Working with Tabular Data Using Tablesaw

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Introduction

2. Importing Data in Tablesaw

2.1. Validate the Imported Data

3. Data Representation Inside Tablesaw

3.1. Textual Types

3.2. Temporal Types

4. Working With Columns

4.1. Creating a New Column

4.2. Adding or Modifying Columns Data

5. Sorting the Data

6. Filtering the Data

6.1. Selections

7. Summarizing Data

8. Saving the Data

9. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course:

REST with Spring

Learn Spring Security ▼▲

Learn Spring Security Core

Learn Spring Security OAuth

Learn Spring

Learn Spring Data JPA

Persistence

REST

Security

Full Archive

Baeldung Ebooks

About Baeldung

Write for Baeldung

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Introduction

2. Importing Data in Tablesaw

2.1. Validate the Imported Data

3. Data Representation Inside Tablesaw

3.1. Textual Types

3.2. Temporal Types

4. Working With Columns

4.1. Creating a New Column

4.2. Adding or Modifying Columns Data

5. Sorting the Data

6. Filtering the Data

6.1. Selections

7. Summarizing Data

8. Saving the Data

9. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course: