Assessing Possible Changes in Source Code (a.k.a. My boss asked me to change this code)

TLDR:

To better assess the quality of COVID simulation source code, Ken uses Understand to scope the work needed to verify changes made to a candidate function “setHouseoldAges()” by looking at who calls that function and any side-effects the function has on globals.

Key things he learned:

  1. He will have to familiarize himself with the OpenMP Parallel programming framework.
  2. If changing the function parameters, or code that would affect the passed in arguments, he will have to look through 1500 lines of code with a total cyclomatic complexity of 200)
  3. This function works its magic through side-effects on one very widely used global. Checking out these side-effects may get thorny… with 27 functions, 4300 loc, and a whopping 1059 points of complexity to step through.

He can use this information to work with his software lead to develop a reasonable expectation of what can/should be done, up to and including a full re-write with modern techniques and possibly a modern functional language.

Details:

What If I Change function AssignHouseHoldAges()?

I’ve been exploring the source code for the COVID Simulation model from Imperial College in London, UK. As part of that analysis, I’m picking core functions and assessing what I’d need to do if I changed them. This exercise gives me a stronger assessment of the nature and quality of the code than pure metrics numbers might – ultimately helping decisions about whether maintenance, some refactoring, or if re-writing is the best path forward.

I’ve picked a mid-pack function, in terms of size and complexity, named “AssignHouseholdAges”. It has a comment block, which is sort of helpful in a non-helpful way…

Thailand gets special treatment. Not sure what to do about Slovakia… or 246 other countries of the world.

Work Needed to Change a Function

  1. Whatever it is you want to do to what the function does (maybe handle countries other than Thailand).
  2. Work needed to ensure that those that call that function handle those changes.
  3. Work needed to track down consumers of side-effects of that program and ensure that they correctly handle whatever you did to the function.

I’m going to show how to sort out #2 and #3 using Understand.

Sorting Out Who Uses This Function

Understand faithfully pushes information about what I’m looking at to the Information Browser docked in its default position at the bottom left:

It’s not that many lines of code but has a fairly high Cyclomatic Complexity value signifying that what code is there includes a lot of decisions.

I’ve seen worse. And at least there is only one loop.

I note also that it is a void function, returning nothing, so it acts only by side-effect on globals. Hmmm… that’s a bit of a strike – especially if I wanted to make this code run on concurrent processors or machines to speed it up.

Two important fields help me in my question to understand/scope the work needed if I change this function:

  1. Called By – who calls this function, and who calls them, and who calls them… all the way to main() or whatever callback/interrupt service routine is their root function.
  2. Globals Used – this field reports any/all globals this function uses and/or sets. As this is a void function acting only by global variable side-effects, I’m going to pay particular attention to what globals it sets.

Assessing impact on those who call AssignHouseholdAges()

The Information Browser (IB) above shows me the first calling function is SetupModel(). Right-clicking and choosing Expand All will show me the full tree, which ends up being pretty small.

Just 3 calls. Whew. Not much to check there.

The “References” field also shows this in a detailed way where I can walk each use:

Clicking on the Call reference takes me to the call back in SetupModel.cpp:

Would you say those are descriptive argument names?

I’m sort of expecting this, as all the indicators are that this is a setup function, likely called at the start, and is mainly working thru globals.

If I made changes to the parameters of the functions I’d have to go to each of the places that called and verified they handled it right.

I’d have more digging to do to see what “m”, “i”, and “tn” were, and where / how they were set.

Exploring “m” a bit is concerning…. its IB shows a LOT of sets/uses:

And when I explore “tn” – which I think stands for “number of threads”, I find this code:

Found dependence on OpenMP parallel processing #pragma

indicating a dependence on the OpenMP parallel programming framework. This is encouraging, I’d already been wondering about using globals in a parallel context, and the global I’m impacting in for this function (Hosts) is being semaphored under this OpenMP framework.

The nut of it is that I’ve never used OpenMP framework, so one of the “work” items I’m going to have to do to change this, or consider changes to this, is study OpenMP. That’s cool – I’m always up for learning but it does add to the work needed to do this right.

Work so far…

  1. Familiarize with OpenMP framework
  2. Verify that changes I make to AssignHouseholdAges() match expectations in the arguments m, i, and tn. This isn’t just if I change parameters, some of these arguments could have implicit knowledge of what is expected to be done and I’m gonna need to check it. Right now this is adding up to 3 functions, and about 1500 lines of code, with cyclomatic complexities of 147 and 53.

What globals does AssignHouseholdAges() set?

I’m getting a bit more familiar with AssignHouseholdAges() now and I know that whatever work it does is done by modifying globals (who are semaphored by tools in the OpenMP framework). So I’m learning…

Now let’s see what Globals we are using. This is pretty easy using the IB “Globals” field:

This is useful, I can click and see what they are:

Since this function works by side-effect, I want to see what those are. And the IB can tell me that too. I click on this part:

to give me details on the uses, and then sort those by reference type – looking for “set”:

And a bit of luck… there is just one place a global gets set – the ENTIRE point of this function boils down to one line:

But my hopes are dashed because it’s just walking thru the horribly named local array “a”, and when I look at “a”…

It’s an array of int, size 12, which is done this way:

MAX_HOUSEHOLD_SIZE is a macro defined as 10. +2 is magic number of which there is no rationale given for its use. )-:

There are a LOT of sets and uses of “a”, this shows about 1/2 of them. I’m probably gonna have to visit all of them. Which the IB makes pretty easy to do.

Back to the global Hosts

We explored “a” because it is what set the global “Hosts”. Now we need to check out Hosts and in particular this particular sub-section of Hosts.

Hosts is a “person”, which looks like this (via the Understand Data Members graph). So I’m going to hazard an informed guess that this whole function’s purpose is to set up an array of people and set their age by some to-be-learned method so that their age can be factored into their viral outcomes.

Graphical view of the data structure for Person

Now that I know that I’m dealing with the “age” portion of a Person, I can start focusing on the impact changing it will have. First I’ll look at functions in the callby tree that use or set “age” (using the Object References w/ Callby graph):

What functions read and set the “Ages” field of Person

That’s not looking too bad.

And when the time comes I can use the IB to visit each place it is used to make sure those uses factor in the changes I’ve made – that’s about 70 references, across 27 functions, adding up to x lines of code and y cyclomatic complexity.

And it gets a bit thorny… I’ve got 27 functions to visit, adding up to this amount of code and complexity:

I visited each of these in the IB, but a Python API script would have been better.
I figured this out by visiting in the IB. Not bad one time, but a Python would work better. 

Work to do:

  1. Figure out what “a” is. It has some internal structure used elsewhere, mapping to Hosts.
  2. Make my changes, and make sure they all are good with “a”. That’s about 70 uses, across 180 lines of code, usually in lines of code that look like this:a[i] = State.InvAgeDist[ad][(int)(1000.0 * ranf_mt(tn))];
  3. Verify that consumers of the global Hosts that this function is affecting (thru “a”) handles the changes I made. That’s about 27 function, 4395 lines of code, and a total complexity of 1059.

Summary:

This is not the worse code I’ve seen. And what I’ve done above using Understand is somewhat of a worst-case analysis. The key, though, is I learned:

  1. There is an external framework I need to familiarize myself with to work in this code
  2. If I change the parameters, I’ve got 3 functions to work thru, about 1500 lines of code, moderately complex.
  3. If I change side-effects it gets thorny… 27 functions, 4300 loc, and a whopping 1059 points of complexity to step through.

As an aside, isn’t it better to know? Isn’t it better to let your boss know and to work from knowledge to make a good, believable plan?

If it were easy everybody could do it. Modeling infections of the world is hard. The source code to do it is gonna be a challenge to write and modify. The key is to make a plan, a reasonable, informed plan, and then work it.

Or… use this information to talk your boss into re-writing it using modern techniques in a modern functional language.

Your call. Good luck!