Java Spliterator Explained
Java has multiple types for traversing elements.
My last article showed how java.util.Iterator<T>
and java.util.ListIterator<T>
can be used to traverse data structures like Collections.
The general concept of iterators exists since Java 1.2, but in Java 8, it got a new relative: java.util.Spliterator<T>
.
Table of Contents
The Spliterator
Besides traversing sequences of data, like an Iterator<T>
, a Spliterator<T>
can also partition it:
Iterator + Splitting => Spliterator
The trySplit()
method allows it to partition off some elements of the sequence as another Spliterator<T>
.
This particular advantage over Iterator makes it the core component of the Stream API. By splitting up data into apt sub-sequences, it allows parallel processing.
Characteristics
Not every data sequence can be just split into multiple parts.
By being an enabler of parallel processing, the actual processor must know more about the data it’s getting from the Spliterator<T>
.
For example, if a data sequence must be ordered, a Stream needs to process it differently from an unordered collection.
To provide as much information about the “spliterated” data sequence, a Spliterator<T>
can have certain characteristics:
CONCURRENT
- Is the underlying element source thread-safe without external synchronization?
DISTINCT
- Are all elements unique and
x.equals(y) == false
for any pair? IMMUTABLE
- Are no changes to the source, like add/remove/replace, allowed during traversal?
NONNULL
- Are all elements guaranteed
!= null
? ORDERED
- Is the element source ordered?
SIZED
- Is the estimated size before traversal the actual size?
SORTED
- Are the elements sorted?
SUBSIZED
- Are split-up parts of the source still
SIZED
?
The characteristics are represented by int
, so to provide multiple values, we need to use bitwise or |
.
Here are the characteristics of Spliterators created by commonly used collection types:
The HashSet<T>
example also shows that the characteristics don’t have to be fixed. They can use the surrounding context to make an informed decision.
The Interface
java.util.Spliterator<T>
has only 8 methods.
And if we don’t count default
methods, the number goes down to 4:
With this article being more of a high-level introduction, I won’t go into detail. The methods have quite descriptive names, and they are documented well.
Primitive Types
With Project Valhalla still being in the distant future, handling primitive types with generics is a no-go. So specialized types for primitives are currently our only options, just like specialized Streams.
Spliterator<T>
3 specialized types:
Other primitive types aren’t directly supported because no specialized Consumer
types exist besides these 3.
We could create our own by utilizing Spliterator.OfPrimitive
, but it would need additional functional interfaces not readily available in the JDK.
Create a Spliterator
The simplest way is using the retro-fitted spliterator()
method of java.util.Collection
, it creates a rudimentary Spliterator<T>
without any characteristics.
But all collection types I’ve encountered in the JDK provide a more specialized Spliterator<T>
with optimized characteristics.
If we need to create a Spliterator<T>
for a custom data sequence, we don’t necessarily have to implement the interface ourselves.
Instead, we can use the convenience methods of java.util.Spliterators
.
Just like other types with only static
convenience methods, it provides a wide range of options:
Especially interesting is the <T> Spliterator<T> spliterator(Iterator<? extends T> iterator, long size, int characteristics)
method.
With its help, every iterable type is usable as a Stream<T>
, even if it doesn’t provide its own Spliterator<T>
:
The utility class java.util.Arrays
also provides static
convenience methods for creating a Spliterator<T>
, but they mostly just wrap a call to Spliterators
:
Parallelism
Parallel data processing was also a complex task. The Java Collection Framework types are either not thread-safe, or they provide synchronized
wrappers to mitigate.
This approach’s problem is thread contention: a conflict over access to a shared data structure between threads, leading to slowdowns while a thread is waiting for another one to finish its work.
Instead of just traversing a single item at a time, a Spliterator<T>
can split up a portion of its elements to allow parallel processing, not provide the parallel behaviour itself.
We can make use of this in Streams by calling Collection#parallelStream()
or the intermediate Stream operation parallel().
If we look at the actual code, there’s not much of a difference regarding the Spliterator<T>
:
The Spliterator<T>
facilitates parallelism, but it’s the stream that has to do the actual work.
When to use parallel streams is out of the scope of this article.
The quintessence is that using a parallel stream won’t automagically lead to better performance. It highly depends on the type and amount of data you process, your execution environment, and more.
Here’s a Medium article about possible pitfalls with parallel streams:
Iterator vs. Spliterator
Even though both Spliterator<T>
and Iterator<T>
share the concept of iteration, they differ in their scope of application.
Iterator<T>
is a universal, non-parallel way of iteration available since Java 1.2.
It’s deeply ingrained in the Java Collection Framework.
Most likely, many of us have some form of hands-on experience with Iterator<T>
or the related java.lang.Iterable<T>
, for concurrent modification, or due to the enhanced for-loop.
Spliterator<T>
can be used for iterating both the Java Collection Framework types and is the core of Stream API.
It also enables parallel traversing if the underlying data allows it.
Even though it’s such an essential part of the Stream API, it’s more of a “background” type.
We only use it through convenience methods or use already existing types in the JDK.
Resources
- Spliterator (Oracle)
- Java Tutorial Parallelism (Oracle)
- Java Stream API (Oracle)
- Java Collection Framework (Oracle)
- Iterator (Oracle)
- How to Iterate with Java (BDD)