Skip to content
SciTools Blog
Menu
  • Home
  • Blog
  • Support
  • Contact
  • Login
  • Pricing
  • Free Trial
Menu
Sunrise at North Twin Lake

Duplicate Lines of Code

Posted on May 31, 2024

Abstract: revisiting how we detect and report duplicate lines of code, and creating a plugin metric to report it.

If you asked how to find duplicated lines of code with Understand, you were probably directed to this Perl interactive report plugin. So, if a solution has been hanging around for this long, why revisit it now?

The interactive report lists all the duplications. But what if I want to know the file with the most duplicated lines. Then I need a duplicated lines metric, like CBRI Insight. So, that means I’ll be writing a duplicated lines of code metric plugin.

Perl Duplicate Lines of Code Detection

I’d rather avoid reinventing the wheel, so the first step is to understand how the Perl interactive report works. The metric plugin has to be written in Python so using the Perl code directly is not an option.

Actually, the Perl version I examined was for CBRI Insight since I was looking at all the metrics in the plugin.  I’m not great at Perl, but here’s the outline I got from the script:

  1. Create an index from a region of code (N lines long) to the file and line number for every file. Each region of code is processed according to the report options for stripping whitespace and comments.
  2. Create a list of possible matches from every index entry that had more than one location.
  3. Ensure each location occurs in only one match. So, if 6 lines are duplicated and the window was 5 lines big, then there would be two index entries, one for the first five lines and one for the last five lines. This step seems to be intended to remove the second index entry from the list of possible matches.
  4. Grow each remaining match. Add another line to the match object while the text at all locations continues to match. Text at locations is found by opening the file(s) again.
  5. CBRI Insight additionally calculates the lines for each file by summing the length of every match the file is involved in.  

Sample Code

So far so good. But step 5 made me wonder what happens if the same line belonged to multiple match objects. Would the line get counted twice? I’ll create some sample code and see what happens.

As a human looking at the code, I’d expect to see two match objects, one with the framed purple boxes and a second with the shaded purple boxes. The total number of duplicated lines would be 21. 

Here’s what CBRI Insight reports:

6 useful lines duplicated in 2 locations
setrefgraph/duplicates.cpp(13)
setrefgraph/duplicates.cpp(22)
5 useful lines duplicated in 3 locations
setrefgraph/duplicates.cpp(5)
setrefgraph/duplicates.cpp(12)
setrefgraph/duplicates.cpp(21)

The matches are pretty close to expected. The only difference is that the shaded purple boxes are one shorter, starting at lines 13 and 22 instead of 12 and 21.

However, the total number of duplicated lines is wrong. The formula does double count lines in the overlapping region (6 * 2 + 5 * 3 = 27).

Snowballing

Getting the duplicated lines metric without double counting lines seems straightforward. It doesn’t even need steps 3-5. For each potential match, add each location to a set to ensure its unique and then find the size of the set. 

However, as a user, I’d want to be able to get from the duplicated line count to a view of the duplicated lines. So, my metric plugin needs to agree with the other views of duplicated lines in Understand.

Then, I also need to check the original Perl interactive report (since I’ve been looking at the CBRI Insight version) and I need to check the relatively new CodeCheck check. As a bonus, if the CodeCheck has the information I need, then the metric can be calculated from the CodeCheck results to avoid running the duplicate code detection more than once. 

Unfortunately, the original Perl script reported only a single match that doesn’t correspond to either of the expected matches and the CodeCheck script didn’t find any matches.

[-] 5 lines matched in 2 locations
setrefgraph/duplicates.cpp(5)
setrefgraph/duplicates.cpp(21)
[-] View Duplicate Lines
if (val1 == 0) {
calc += val2;
calc *= val3;
calc -= val2;
calc += val3

The New Problem

So, even though calculating the metric only requires steps one and two, I’m going to need to also implement steps 3 and 4 to create an interactive report and a CodeCheck so all three are consistent.

At this point, it’s worth taking a step back to consider whether there are better ways of detecting duplicate code. To summarize a search in that area, it turns out there are lots of different types of duplicate code, from exact copies, to copies with insertions or deletions, to copies with name changes and order changes. Since I’m not a patent lawyer, I’ll stay close to the original approach.

However, I’d like to avoid reading the file twice and I want to support multiple matches starting at the same location like my sample code. So, I’m going to modify the original algorithm to be a little closer to a genome assembly algorithm and track what the next node is. So the sample code would form this graph with a window of 5 lines and showing only nodes and edges with at least two locations:

Each node is still an initial match. In addition, any edge with at least two locations but whose number of locations does not match both nodes, is also a match. So the first edge in the diagram is also a match. Then, instead of extending matches by re-opening the file, adjacent matches are merged together as long as the number of locations is identical. So, the final result is the two desired matches:

Final Results

The plugins can be downloaded from our plugin repository. They all rely on a central file, duplicates.py. So it’s best to download the whole folder. Furthermore, to be as consistent as possible, the ireport and CodeCheck both save data that the metrics plugin can read. So, if CodeCheck is run, then the metrics will be calculated from the same results as CodeCheck was. If neither CodeCheck nor the IReport is run, then default option values for the number of lines, ignoring whitespace, and ignoring comments are used by the metrics.

Results for OpenSSL with 10 lines of code ignoring whitespace and comments.

The file preprocessing code is separate from the match code. So, future work could ignore useless lines of code like CBRI Insight or replace names to detect copies with renames. We’d love to hear from you about what you’d like to see.

Learn more about Understand's Features

  • Dependency Graph
    View Dependency Graphs
  • Comply with Standards
  • View Useful Metrics
  • Team Annotations
  • Browse Depenedencies
  • Edit and Refactor

Related

  • API
  • Architectures
  • Business
  • Code Comparison
  • Code Comprehension
  • Code Navigation
  • Code Visualization
  • Coding Standards
  • Dependencies
  • Developer Notes
  • DevOps
  • Getting Started
  • Legacy Code
  • Licensing
  • Metrics
  • Platform
  • Plugins
  • Power User Tips
  • Programming Practices
  • Uncategorized
  • Useful Scripts
  • User Stories
  • May 2025
  • January 2025
  • December 2024
  • November 2024
  • August 2024
  • June 2024
  • May 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • June 2023
  • April 2023
  • January 2023
  • December 2022
  • November 2022
  • September 2022
  • August 2022
  • May 2022
  • April 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021

©2025 SciTools Blog | Design: Newspaperly WordPress Theme