How to read Java Bytecode for fun and profit
We cover what Java Bytecode is, why it exists, and how to read it. This can help you with optimizing performance, enhancing security, and reverse engineering.
Jan 31, 2024 • 6 Minute Read
Embarking on a journey through the world of Java Bytecode? This article covers everything you need to know to get started.
What is bytecode?
Back in 1995, Sun Microsystems, the creators of the Java programming language, made a bold claim. They said that Java would allow you to “write once and run anywhere.” That meant that the compiled binaries would be able to run on any system architecture, something that C could not do and remains a core tenant of writing Java to this day.
To achieve this cross-platform capability, Java employs a unique approach when compiling. Instead of going from source code directly into machine code (which would be specific to each system architecture), Java compiles its programs into an intermediate form known as bytecode. Bytecode is a set of instructions that is neither tied to a particular machine language nor dependent on any specific hardware architecture. This abstraction is the key to Java's portability.
The program that interprets and executes Java bytecode instructions is called a Java Virtual Machine (JVM). The JVM translates each bytecode instruction into the machine code native to the particular system architecture it is running on. This process, often referred to as "just-in-time" (JIT) compilation, allows Java bytecode to be executed as efficiently as possible on any given platform.
Viewing Bytecode
Bytecode isn’t just useful for the JVM, though. Because the bytecode of a Java class is helpful for reverse engineering, performance optimization, security research, and other static analysis functions, the JDK ships with utilities to help you and me inspect it.
To glimpse at an example of bytecode, consider the following two methods from `java.lang.Boolean`, `booleanValue` and `valueOf(boolean)` which respectively unbox and box the `boolean` primitive type:
java
public boolean booleanValue() {
return value;
}
public static Boolean valueOf(boolean b) {
return (b ? TRUE : FALSE);
}
Using the `javap` command, which ships with the JDK, we can see the bytecode for each. You can do this by running `javap` with the `-c` command and the fully-qualified name of the class, like so:
bash
javap -c java.lang.Boolean
There result is the bytecode for all the public methods in `java.lang.Boolean`. Here I’ve copied just the bytecode for `booleanValue` and `valueOf(boolean)`:
java
public boolean booleanValue();
Code:
0: aload_0
1: getfield #7 // Field value:Z
4: ireturn
public static java.lang.Boolean valueOf(boolean);
Code:
0: iload_0
1: ifeq 10
4: getstatic #27 // Field TRUE:Ljava/lang/Boolean;
7: goto 13
10: getstatic #31 // Field FALSE:Ljava/lang/Boolean;
13: areturn
Dissecting Bytecode
At first glance, it’s an entirely new language to learn. However, it quickly becomes straightforward when as you learn what each instruction does and that Java operates with a stack.
Take the three bytecode instructions for `booleanValue`, for example:
`aload_n` means to place a reference to a local variable onto the stack. In a class instance, `aload_0` refers to `this`.
`getfield` means to read the member variable from `this` (the lower item on the stack) and place that value onto the stack
`#7` refers to the reference’s index in the constant pool
`// Field value:Z` tells us what `#7` refers to, a field named `value` of type `boolean` (Z)
`ireturn` means to pop a primitive value off of the stack and return it
Long story short, these three instructions lookup the instance’s `value` field and return it.
As a second example, take a look at the next method, `valueOf(boolean)`:
`iload_n` means to place a primitive local variable onto the stack. `iload_0` refers to the first method parameter (since the first method parameter is a primitive)
`ifeq n` means pop the value off of the stack and see if it is true; if so, proceed to the next line, otherwise jump to line `n`
`getstatic #n` means read a static member onto the stack
`#27` refers to the static member’s index in the constant pool
`// Field TRUE:Ljava/lang/Boolean` tells us what `#27` refers to, a static member named `TRUE` of type `Boolean
`goto n` means now jump to line `n` in the bytecode
`areturn` means pop a reference off of the stack and return it
In other words, these instructions say, take the first method parameter, if it’s true, then return `Boolean.TRUE`; otherwise, return `Boolean.FALSE`.
Leveraging Bytecode Analysis
I mentioned earlier that this can be helpful for reverse engineering, performance optimization, and security research. Let’s expand on those now.
Reverse Engineering
When working with third-party libraries or closed-source components, bytecode analysis becomes a powerful tool. Decompiling bytecode can provide a glimpse into the inner workings of these libraries, aiding in integration, troubleshooting, and ensuring compatibility.
In situations where you encounter proprietary or closed-source Java code, reading bytecode can be the only feasible way to understand its functionality. Bytecode analysis allows you to reverse engineer and comprehend the behavior of closed-source applications, facilitating interoperability or customization.
In the way of a real-life example, I was recently trying to integrate a third-party package tangle analysis tool into our Ci system. Unfortunately, the vendor was closed-sourced and only had documentation for how to access the library through their proprietary UI. By analyzing the bytecode, I was able to reverse engineer the expected inputs and outputs of the underlying analytics engine.
Performance Optimization
With bytecode insight, you can make informed decisions about optimizing specific code segments. For instance, if the bytecode reveals redundant operations, you can refactor the code to eliminate inefficiencies, resulting in a more streamlined and performant application.
Consider the simple scenario of using an enhanced for loop vs managing your own counter. Among other low-level tools like JMH, `javap` can help you learn what creates fewer or more optimal bytecode instructions.
If you run `javap` against a class that performs these two operations:
java
for (int i = 0; i < list.size(); i++) {
sum += list.get(i);
}
for (Integer i : list) {
sum += i;
}
You can see from the bytecode that the first computes `.size()` each time through the loop while the enhanced for loop does something more optimal:
java
4: iload_2
5: aload_0
6: getfield #19 // Field list:Ljava/util/List;
9: invokeinterface #25, 1 // InterfaceMethod java/util/List.size:()I
14: if_icmpge 42
17: iload_1
18: aload_0
19: getfield #19 // Field list:Ljava/util/List;
22: iload_2
23: invokeinterface #29, 2 // InterfaceMethod java/util/List.get:(I)Ljava/lang/Object;
vs.
java
2: aload_0
3: getfield #19 // Field list:Ljava/util/List;
6: invokeinterface #25, 1 // InterfaceMethod java/util/List.iterator:()Ljava/util/Iterator;
11: astore_2
12: aload_2
13: invokeinterface #29, 1 // InterfaceMethod java/util/Iterator.hasNext:()Z
18: ifeq 41
21: aload_2
22: invokeinterface #35, 1 // InterfaceMethod java/util/Iterator.next:()Ljava/lang/Object;
Or, in short, prefer the enhanced for loop or at least an `Iterator`; that’s what the JDK does.
Security Research
Security is paramount in software development. Bytecode analysis can help identify potential security vulnerabilities by revealing insecure coding practices or unintentional exposure of sensitive information.
See if you can find a problem with the given bytecode:
java
public boolean verifyLogin(java.lang.String, java.lang.String);
Code:
0: ldc #7 // String josh
2: aload_1
3: invokevirtual #9 // Method java/lang/String.equals:(Ljava/lang/Object;)Z
6: ifeq 20
9: ldc #15 // String password
11: aload_2
12: invokevirtual #9 // Method java/lang/String.equals:(Ljava/lang/Object;)Z
15: ifeq 20
18: iconst_1
19: ireturn
20: iconst_0
21: ireturn
What do you think is happening here? The first four instructions compare the first method parameter to the value “josh” and the next four instructions compare the second method parameter to “password”. If either doesn’t pass, then `iconst_0` is returned. If they both pass, then `iconst_1` is returned.
If you guessed that a successful login is josh/password, then you are correct!
Conclusion
In the ever-evolving landscape of software development, the ability to read and analyze Java bytecode is a powerful skill. As we've explored in this article, bytecode is more than just a byproduct of the Java compilation process; it's a window into the inner workings of yours and others’ Java applications. By demystifying the complexities of bytecode, we unlock a plethora of opportunities for optimizing performance, enhancing security, and even reverse engineering.
Did you find this article helpful? Then check out Josh Cumming's many video courses on Pluralsight which cover Java and the Spring framework in depth, such as "Secure Coding Practices in Java Applications" and "Securing Spring Data REST APIs."