Data Flow

Summary: Our lead engineer uses Understand’s assign references to track data flow and hunt down a memory leak.

Motivation

We frequently hear questions from users about how Understand can be used to track the flow of data through their programs. But what does that mean precisely, and how can we use this kind of analysis to solve real problems? One possible interpretation relates to how the name of data changes as it moves through a call tree (or even within a single function).

There are at least a couple of subtly different kinds of data flow that we might care about. One involves copying the contents of one memory slot to another. This is pass/return/assign by value. The other involves copying the address of a memory slot. This is pass/return/assign by pointer/reference and is also known by the term aliasing.

While these two kinds of data flow are semantically different, they share a common syntactic mechanism (at least in most common languages). Namely, they are represented by named variables that can be assigned. Both can lead to meaningful questions that a programmer might want to answer.

Assign Reference

For most of it’s history, the Understand cross-reference database has contained references like “use”, “set” and “modify” for variable entities. The scope of these references has always been defined to be the entity that encloses the reference. For example, function main uses variable argc. Here the function main is the scope and the variable argc is the referenced entity. This is useful and important information and makes a lot of intuitive sense, but these references on their own don’t make it easy to track the flow of data or how its name changes.

A few years ago, a new kind of reference was introduced for this purpose. The “assign” reference appears when one named variable is assigned to another named variable. This includes assignment expressions where variables appear on both the left and right hand sides of the equals operator (int i = j), but, importantly, it also includes assignment from function argument to formal parameter.

The scope and entity of assign references are both variables (j is assigned to i). Assign references chain variable entities together in a way that makes it possible to track how their name changes. And, of course, since the same variable can be assigned any number of times, these references form hierarchies of names that branch out in both directions. The information browser for variables shows these hierarchies in the “Assignments” and “Assigned To” fields.

Case Study

During preparation for the recent release of Understand 6.5, we were reviewing the output of an automatic memory leak finding tool. These tools are great at discovering leaks and associating them with the interactions that caused them. Unfortunately, it can still be difficult to track down exactly why the leak happened and who should have been responsible for freeing the memory.

Memory management in C++ can comprehend several widely used patterns. Whatever the specific pattern, it’s common to think about allocated memory in terms of ownership. Whoever owns the memory is responsible for freeing it when it’s no longer needed. There are parent-child relationships where parent objects are solely responsible for cleaning up their children. Sole ownership can be generalized to shared ownership via smart pointers like std::shared_ptr, where any number of clients own references to a shared object.

Understand, like many C++ programs, employs both these patterns and others, depending on what makes the most sense in a given circumstance. We have hierarchies of GUI widgets that employ the parent-child relationship. Sometimes these widgets need to share ownership to an object at different levels of the hierarchy. They do this by passing around a shared pointer during widget creation.

The automatic leak finding tool uncovered a large number of allocations that could all be traced to a single root object that was managed by a shared pointer. At first this didn’t make a lot of sense. Isn’t a shared pointer supposed to automatically free the object when it’s no longer referenced? We soon reasoned that there must be another much smaller leak lost among the noise of hundreds of allocations for the shared object. That is, there must be a leaked widget somehow disconnected from the hierarchy that owns a reference to the shared pointer.

There are a few different ways to think about finding this hypothetical leaked widget. We could comb through all of the results in the leak finding tool. We could traverse the widget creation call tree in Understand to find which ones are passed the shared pointer. Ultimately, the question we want to answer is, “which other variables is this shared pointer assigned to?” And we want to answer the question recursively. This is exactly what the assign reference is designed to do. More importantly, it answers the question more succinctly and more quickly than any of the other methods.

The actual information browser for this shared pointer looks like this:

In this case, each line in the “Assigned To” tree represents a constructor call where the argument is assigned to a parameter with the given name. So the first line shows the root variable proj assigned to a parameter called settings. The leaf nodes are all copy constructor calls and other is the name of the unresolved parameter in the shared pointer copy constructor implementation.

The highlighted node is where the problem was ultimately found. Clicking on it showed a constructor call missing its parent argument. As expected, the orphaned widget received a reference to the shared pointer, but was itself leaked. Notice how the problem occurs four levels of assignment deep. Notice also how the name changes from proj to settings to proj and finally back to two distinct settings entities. The name change is unfortunate and a bit ugly, but it’s real code, and real code can be messy.

After discovering that the assignment tree would be the right tool for the job, the problem was found amazingly quickly. It took about half a minute. That was after poring over the leak stack traces for at least half an hour.

Limitations

Tracking argument to parameter assignment in this way is robust. We even could have mixed in any number of regular assignment expressions without any loss of accuracy. However, not all kinds of assignment conform to the clean one-to-one semantics of shared pointer copy. Sometimes the right hand side of the assignment is a more complex expression involving named variables. For example, int i = j + k and even int i = j + 1 do not form assign references. Should they? We don’t know.

We also do not currently form assign references for function return values. We considered writing assign references for all named variable returns, or even more generally between the variable and the function itself. That would mean that something like int i = f() would create an assign reference between variable i function f. This would enable clients to make their own connection to named variables returned by the function. This is the more general and likely scenario, but it breaks the present invariant that the scope and entity of assign references are both variables. We will likely reconsider this point if there is any interest.

Lastly, not all assignment expressions involve named variables at all. Literal values can be assigned to variables. The Understand database doesn’t represent literals as entities and we only write references between entities. The information browser somewhat makes up for this shortcoming by showing the initial value as the leaf nodes in the “Assignments” tree. Of course, this only works for the initial value and subsequent literal assignments are not represented.

It’s important to note that this feature is only available for C/C++ and only when the strict analyzer is enabled. It should also be considered somewhat experimental and subject to change. We welcome any feedback you may have.

Motivation

Assign Reference

Case Study

Limitations

Learn more about Understand's Features

Related