Reviewing C/C++ Code for Security Vulnerabilities

Jonathan Boyd (jboyd@securityinnovation.com)

Max Chauhan (mchauhan@securityinnovation.com)

Recently, we have been tasked with reviewing millions of lines of native C and C++ source code on Windows and Linux platforms. While reviewing, we’ve noticed some recurring patterns in the code which illustrate weak coding practices. In this post, we will be sharing tips for conducting security code reviews on native source code across platforms; specifically calling out patterns of weak code.

Relative Paths in Command Execute in Unix

Consider the “passwd” command on Unix. This command can be run by any user, but requires permission to modify the “/etc/shadow” file. As such, the command itself is run with elevated privileges. This elevation of privilege is acceptable as the author of the “passwd” command controls the instructions. However, imagine if the “passwd” command executed the following code:

system(“cat /etc/shadow > /tmp/shadow.tmp”);
system(“chmod 600 /tmp/shadow.tmp”);

On the surface, it would appear that the program (running as an elevated user) is making a backup of the “/etc/shadow” file in /tmp. It is then leveraging the “chmod” command to lock down this file. Unfortunately, this fictional example is sloppy and incredibly insecure code. Let’s take a moment and examine why.

First, consider that the name “shadow.tmp” is a known file name. This should be random. Let’s assume an attacker wants to overwrite a file. He can create a symbolic link named “/tmp/shadow.tmp” pointing to a file he doesn’t have permission to overwrite. When the first line runs, that file will be overwritten using the symbolic link with the contents of “/etc/shadow.”

Next, there is a very clear race condition between when the program is creating the backup file and when the backup file is being locked down. The “system” library call does an exec of “/bin/sh” which then does a fork and exec of cat. Cat opens the source file and the contents is written to another file. At this time, an attacker can attempt to access the file. Indeed, while line 2 is running, there is a fork and exec of “/bin/sh” and then another fork and exec of “chmod.” By the time the second fork and exec have occurred, and attacker could have already read the contents of the sensitive backup file.

Finally, we can see both the “cat” and “chmod” commands being called. However, these are not called with absolute paths. As such, these could be the binaries in “/bin/” or “/usr/bin/” or they could be the binaries in: “~attacker/rootkit.” Suppose an attacker has a program named “cat” in the rootkit folder under his home directory. Suppose this Trojan cat program resembles the following:

#/bin/sh
/bin/cp /bin/sh /tmp/rootshell
/bin/chmod 4777 /tmp/rootshell

This copies a shell to /tmp and changes the permissions so it will run as root. As a result, if an attacker were to execute the fictional passwd command in the rootkit directory, he would then have a root shell on the system. The moral of the story is that when executing commands using the “system” library function, always specify the fully qualified name for all commands. This seems easy, but consider the following example:

system(“/usr/bin/cat /etc/passwd | awk ‘{print $1}’ | xargs touch”);

In this example, we’re specifying the fully qualified path for the “cat” command. However, we forgot to specify the fully qualified path for the “awk” and “xargs” commands. Oops!

Relative Paths on Windows

While Unix and Windows have their differences, they also have some similarities. For example, consider the following code from a program developed for the Windows platform:

LoadLibrary(“TestTools.dll”);

This code will effectively load the “TestTools” DLL into memory. Presumably, code after this would “GetProcAddress” to a function exported by that DLL and execute that resulting function. What could go wrong?

Suppose an attacker has dropped a Trojan DLL named “TestTools.dll” onto the Desktop. Further, suppose the application is started from the Desktop folder. We will notice the DLL being loaded is not loaded with a fully qualified path. Instead, it is loaded with a relative path. When the program searches for the DLL, it might find the right DLL, but it might find the Trojan DLL instead. If it finds the Trojan DLL, that DLL will Load. When it loads, it will call “DllMain” which is the DLL version of “main” or “wmain” on Windows. Once the attackers Trojan DLL’s “DllMain” code is executed, his code will execute in the context of the victim; potentially stealing sensitive data.

When I do not have the source code, I typically look for this sort of vulnerability using the ApiMonitor and hook the LoadLibrary entry point. It will be clear when the application is attempting to find and load a library and, to attack it, simply write your own Trojan and drop it on the box.

Native Memory Management Issues

When using higher-level languages like Java and C#, developers do not have to worry memory issues. When a developer allocates memory, it’s in the form of an object. If there’s not enough memory, an exception will be thrown. When that object goes out of scope, the garbage collector cleans it up in the background. Unfortunately, when managed developers write native code, they may not fully understand how memory works. Consider the following example:

void fn(){
wchar_t buffer[42];
wchar_t * psz = (wchar_t *) malloc(42 * sizeof(wchar_t));
memset(psz, 0x0, 42 * sizeof(wchar_t));
fwscanf(pFile, “%s, buffer);
}

The preceeding example code has some issues. First,the “42.” What is that?! It’s a sign of sloppy code. It *should* be a defined constant, not just some magic number hanging out in source code. What happens if they change 42 in one place, but forget to in another?

Secondly, consider the “malloc” library call. It turns out that “malloc” can happily return NULL in low memory conditions. If “malloc” were to return null, then “psz” would be NULL. If “psz” is NULL during the “memset” call then the program will attempt to write to NULL. This will result in the program crashing. As a result, it’s generally considered a good idea to test for NULL before referencing memory. It should be noted that “malloc” isn’t the only call that can return NULL. So can HeapAlloc, LocalAlloc, and other dynamic memory allocation methods.

Next, let’s pretend we’re reading in from a file using the “fwscanf” library function. This will read a string in from a file, but in the above code we’re using a “%s.” This will read until the end of the string is reached in the file. If the file is malformed or the string is greater than 42 wchars, this will result in a stack-based buffer overflow.

Finally, when this function returns the “psz” variable will go out of scope. When that happens, the memory will be leaked. There is no GC. If this is in a long-running service, it could lead to system performance degradation.

For six lines of code, there are a lot of things wrong!

Unicorns and Unicode

In a former life, I spent several years doing systems programming for various flavors of Unix. Being Fortune’s Fool, I later found myself as an engineer at Microsoft writing code. This was my first exposure to developing Unicode-aware applications. All of my code previously had been using ASCII. One character was represented by a single byte. I found myself having to convert strings from ASCII to Unicode. Little did I know then, but there be Dragons!

Consider the following code:

void ConvertUserInput(char * userInput) {
TCHAR tszDestination[42];
MultiByteToWideChar(CP_ACP, 0, userInput, -1, tszDestination, sizeof(tszDestination));

It turns out TCHAR is either char or wchar_t depending on whether this is compiled as ASCII or Unicode. For a moment, let’s assume this has been compiled with unicode. If we look at the “MultiByteToWideChar” function, we see the userInput as a parameter. We also see -1 as the length. This is just telling the function to read userInput until it hits a null terminator. We then see the destination variable, tszDestination, and then the amount of space in tszDestination. Only, “sizeof” returns the number of bytes, NOT the number of TCHARs. So, lets assume that userInput is 80 ASCII characters. The tszDestination will contains 42 wchar_t characters. However, we are telling the “MultiByteToWideChar” function that it will hold 84 (42 * 2…the sizeof 42 wchar_t elements). When the copy happens, it will overwrite the stack-based buffer. This is extremely scary and a little embarrasing! When I review code, I typically look for all calls to “MultiByteToWideChar” and ensure the last parameter is sizeof(tszDestination)/sizeof(tszDestination[0]) to ensure it’s the right size.

Recently, one of my teammates and I were reviewing millions of lines of native code. We decided to write a script to look for this pattern. We were trying to think of a name for the tool and decided the odds of finding this pattern were about the same as finding a Unicorn. Since we were looking for Unicode conversion issues, we decided to call the tool Unicorn Slapper! This has since been renamed to RandomVoodoo and is somewhat of a work-in-progress.

It should be noted that there are a pleasant variety of encoding schemes including UTF-7, UTF-8, and UTF-16. The lengths of these are not necessarily what you would expect. Indeed, the lengths may vary. As a result, great care should always be taken when converting between encodings.

Hardcoded Passwords and Hashes

When I started at Security Innovation, my manager handed me a metaphorical stack of money and told me to go buy a computer. Having previously been operating on a restricted budget, I went a little wild. I hunted down the fastest biggest super-computer-like laptop I could find. I even threw in a few bucks of my own to add some special features (racing stripes, extra RAM, etc). With my new laptop in hand, I felt heady with power! So, I started writing intensive scripts to parse through massive amounts of source code. My first thought was, given a mountain of code, how can I find hidden secrets? So, I wrote a script to parse C++ and C# and extract the variables and string literals in the code. I then tested the variables to see if they contained interesting names like “secretkey”, “password”, “passwd”, “encrypted”, etc. Oddly enough, this turned out to be a great way to find backdoors!

However, developers can be tricky! While reviewing code, we noticed some hardcoded hashes. It turns out, these were hashed passwords. When a user logged in, it would hash the users password and compare it to the hardcoded password. So, armed with my new laptop, I wrote a script to parse out hardcoded hashes. I then read in the entire “rockyou” password list, hashed the contents with a variety of hashing algorithms, and then did parallel table lookups for each of the hardcoded hashes. Rainbow tables are nice, but for a quick sanity test, the script worked really well!

Password Protected Pfx Files

Pfx files contain sensitive cryptographic information that can be used for signing software. While reviewing the source code for one of our clients, we noticed several of these files were included with the source code. We attempted to import the files, but they appeared to be password protected. So, we decided to create a script that tokenizes the source code looking for possible passwords. We then used that script, along with the “rockyou” list to attempt to crack the password-protected Pfx files which had been checked into the tree. The resulting tool was named PersephonePfx, named after the character from the Matrix Reloaded who gave up the Keymaster. It was very effective!

Deadlock Detection

Deadlocks occur when a lock is acquired, and then not subsequently released. These can occur when there are error conditions that happen that cause an early return. Occasionally a developer will forget to release the lock in that specific branch. This can lead to a deadlock, as the lock will still be acquired, and never released later.

In well maintained code, this is usually handled nicely. They have statements such as “if (error) goto cleanup1” The cleanup section handles the releasing of the lock, so even if the error occurs, there will not be a deadlock.

The hardest part about this is that not all code is written equally. Another developer may choose to have four separate if-else statements, and release the lock in each one of those. But what happens if they don’t release it in every section? Or what if they just have two if statements, and assume that the input will fall into one of the categories or the other, and are only releasing the lock in the ”if” statements. This means that if neither if statement is hit, the lock won’t be released.

To deal with this, I wrote some code to recurse through and try all return paths. If any of the return paths didn’t release a lock that had previously been acquired, it would flag it. Consider the following code:

void fn(){
Lock.Acquire();
if(g_A) {
printf(“g_A”);
return;
}
Lock.Release();

If this function is called multiple times, and during one of those times g_A evaluated to true, it could lead to a deadlock.

Another large headache I ran into during this part was dealing with wrappers. Lots of projects wrote their own wrapper for acquiring and releasing locks. The wrappers would obviously get flagged, because they were acquiring a lock without releasing it. To handle this I made my code ignore a function if it acquires a lock and doesn’t release it. This might miss some deadlocks, but I’d prefer that to false positives.