Choosing the Right Data Types
Every time we create a new data structure, we have to decide which data types to use.
Usually, the decision is simple: text most likely will become a
String, non-floating-point numbers will be
int, and so forth.
Most of the time, these almost subconsciously-made decisions will suffice. But to design future-proof data structures, we have to think about choosing the correct data type a little more.
Table of Contents
Choosing the right data type
Even though the definition of “right” is highly subjective, choosing a data type depends on 4 different, interconnected factors:
- What are the requirements?
- Is it future-proof?
- How much convenience is provided?
- Will it impact performance or memory consumption?
The “Actual” Requirements
The main reason for choosing a particular data type starts with the requirements. We define what we want and need and have to choose accordingly.
But often, requirements are just too vague to represent the actual needs. Or they don’t match the technical reality. So the first step is finding the actual requirements and translating them into their technical counterpart.
Storing the Age of a User
Our imaginary requirements state that we need the age of a user at registration time. Depending on the target audience, we need between 1 and 3 digits to store that information.
Or do we want to have the actual age, not just the age captured at data entry? An additional digit is needed to save the year of birth.
Or might we also want to be even more precise, and congratulate them on their birthday? Now we should use a date-base data type instead of a numeric one.
Many use cases won’t initially be visible if the requirements are too vague. Data can often be represented by different data types with varying degrees of precision. That’s why it is crucial to actually translate non-precise or vague requirements into their technical counterpart.
But this doesn’t mean we should always store as much information as possible, without a good reason. Always remember the principle of data avoidance and minimization.
Compound-Values and Redundancy
boolean data type is often a quite obvious data type, that seems to be a good fit, but isn’t enough on its own.
Imagine a content-management-system (CMS).
Content might be deleted someday, and we want to store the date of deletion.
The naïve way to do so is to add a
boolean to indicate the document was deleted, and another
datetime for storing the deletion date.
These are what I call compound-values. They have meaning by themselves, but they are interconnected to represent the actual state we want to express. This kind of design leads to multiple consistency problems:
deleted == true && timeDeleted == null
The data is clearly deleted, but when?
deleted == false && timeDeleted != null
Is it deleted? Or was it restored? What does the deletion time represent?
Additional validation and logic will be needed to ensure consistency between the two values, which might introduce more bugs than needed. If we wanted to store all the different states possible, we would need to add even more values, with more logic and validation.
Instead of using compound-values for representing the deletion state, why not just choose a single data type, like
This way, we have the same information as before.
And we also eliminated the
boolean and any inconsistency problem that results from the interconnection.
If we can, we should avoid interconnected types, because it causes adhesion between values. Data should be consistent in itself, and the less logic and validation we need to ensure it, the better.
After finding the actual technical requirements for our data type, we should consider its persistence.
Not many requirements will remain unchanged over time, and it’s easier to start with an extensible type, instead of replacing it entirely later on.
The End of Time
Unix time is (at least in hindsight) an example of choosing a non-future-proof data type. In the year 2038, on Tuesday, 19th January, at 03:14:07, the 32-bit integer will overflow. The timestamp will become negative, and programs might interpret the next second as 20:45:52 on Friday, 13th December 1901.
Although, in the case of Unix time, it’s most likely not an oversight. The time frame to find another solution was 68 years in the future. And bits were premium back in the day, so a smaller data type made perfect sense, and isn’t a wrong decision per se.
Single-States Don’t Stay Single For Long
Our imaginary CMS now needs admin-users.
To satisfy the technical requirement, we could add a
boolean indicating a user is an admin.
But this will most likely become a problem in the future.
What if we need another type of user, e.g., an editor.
We could just add another
boolean field. And also change all the code handling user types.
Instead, we should choose a data type with more information-density than
By using an
enum representing the different user types, we still only have to deal with a single value.
It can describe various states and can be extended with ease, if necessary.
And if we need multiple states at once, we could rely on something
EnumSet (Java), or plain bitwise operations to achieve our goal.
Not all data types are created equal. Some are more convenient to use, others are not. This can be from a technical standpoint or just our personal experience with different data types.
How (not) to save dates
A long time ago, I was working on a small pet project to display the “Japanese era names” based on a date. Instead of saving the start and end of an era in separate fields, I’ve decided to encode both dates into a 32-bit integer and extract the values with bitwise operations.
It was a really efficient way, at least regarding the bit count. But I had to introduce an abstraction layer to make working with the actual values. The data was not human-readable in the database, which complicated debugging.
date values instead of a single
integer would have been a much better decision.
It would have cost me some more bits.
But the created overhead introduced bugs and made the code harder to reason with than really necessary.
Working with Colors
Such examples can be found with almost any kind of value.
Storing RGB values technically requires a range of 256 distinct values. It fits in 8 bit, or 1 byte, like
But using an
integer can be a better solution because associating a numeric value with a
char might not be the first thing that comes to mind.
Also, some programming languages won’t implicitly cast the values, making the code noisier.
We should validate the values anyway, so why not use a more numeric data type than
Performance Impacts & Memory Consumption
One aspect that I’ve mostly omitted so far is the impact on performance and memory requirements.
Today’s computer systems have a lot of memory and CPU cycles to throw around. But they still aren’t free, and only a finite supply exists. We often don’t spend much thought about what impact a particular data type will have on performance.
And most of the time, it doesn’t actually matter much.
There are scenarios where every single bit counts. But not all of us deal with “high-frequency trading”, or the hardware constraints from embedded software development, etc.
Choosing a data type solely due to performance reasons is actually premature optimization, and should be avoided. That doesn’t mean that we can just ignore how data types differ in memory consumption and performance requirements. But we need to understand when it matters, and when not.
Java has primitive wrapper types, so primitives can be used where only object types are allowed, e.g., generics. They are interchangeable and will be cast in their corresponding type implicitly:
The compiler will do the actual work of casting and converting the variables. This is really convenient, but it also creates an invisible overhead to our code.
An actual type cast translates into 3 opcodes if the compiler can’t remove the cast due to optimization.
The Java Virtual Machine (JVM) has many optimized opcodes for different primitive types to mitigate the overhead. But not all primitive types have the same kind of opcodes.
Know Your Runtime
boolean only needs a single bit to represent its current value.
But that’s not how many computer architectures work.
byte is the smallest amount of addressable memory in most cases.
On top of this behavior is how our runtime will handle the memory internally.
For example, the Java Virtual Machine uses 32-bit slots, so any smaller data type might induce a penalty compared to an
And it will be stored in a
To choose the right data type subconsciously, we need to have the knowledge and experience, what “right” actually means. It’s highly subjective, but we still can try to achieve the best result possible.
Even if the required range might be small enough to fit in a data type, we need to consider how to handle it in our code. The right data type at the creation of a value might not be the right data type how it will be used later on. And every cast will add additional overhead. Any additional logic might introduce bugs.
Performance and memory impact is a valid concern, but shouldn’t drive our decision primarily, if not absolutely necessary. How a particular runtime maps memory to different data types depends on the actual implementation, the underlying hardware (x86 vs. x64), etc. So it always helps to a thing or two about our environment’s memory design.