• Stars
    star
    163
  • Rank 231,141 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created about 6 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A simple Python3 tool to detect similarities between files within a repository

Duplicate Code Detection Tool

A simple Python3 tool (also available as a GitHub Action) to detect similarities between files within a repository.

What?

A command line tool that receives a directory or a list of files and determines the degree of similarity between them.

Why?

The tool intends guide the refactoring efforts of a developer who wishes to reduce code duplication within a component and improve its software architecture.

Its development was initiated within the context of the DAT265 - Software Evolution Project.

How?

The tool uses the gensim Python library to determine the similarity between source code files, supplied by the user. The default supported languages are C, C++, JAVA, Python and C#.

Dependencies

The following Python packages have to be installed:

  • nltk
    • pip3 install --user nltk
  • gensim
    • pip3 install --user gensim
  • astor
    • pip3 install --user astor
  • punkt
    • python3 -m nltk.downloader punkt

Get started

Suppress the warnings (generated by the used libraries) as python3 -W ignore duplicate_code_detection.py and then supply the necessary arguments. More details can be found by running the tool with the --help option.

Notice: Due to the way the models are created, the more source files you provide the tool the more accurate the similarity calculations are. In other words, the bigger the project, the more useful the tool is.

Example

If duplicate-code-detection-tool is the name where the tool resides in and smartcar_shield/src contains the repository you want to check for source code similarities between the files, then you can run the following to get the similarity report:

python3 -W ignore duplicate-code-detection-tool/duplicate_code_detection.py -d smartcar_shield/src/

The result should look something like this:

code duplication tool screenshot

GitHub Action

The tool is also available as a GitHub Action for easy integration with projects hosted on GitHub. An example output of the tool can be seen here.

The Action is meant to be triggered during pull requests to give the developers an impression over the degree of similarity between the files in the source code. Below you will find a sample workflow files that illustrate the usage.

Depending on the size of your project, you may want to have the tool running multiple times (i.e in diffferent steps) that test specific parts of your repository for duplicate code. This way you will not compare each file in your codebase with everything else and get back more meaningful reports.

Bare minimum

In the following example the tool will examine source code (the languages supported by default) in the src/ and test/ut directories relative to the root directory of your repository. The results will be posted as a comment in the pull request that was opened.

name: Duplicate code

on: pull_request

jobs:
  duplicate-code-check:
    name: Check for duplicate code
    runs-on: ubuntu-20.04
    steps:
      - name: Check for duplicate code
        uses: platisd/duplicate-code-detection-tool@master
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          directories: "src/, test/ut"

Trigger on pull request comment

If you want to avoid the "spam" you should configure the tool to not always run. Specifically, if you wish to trigger the Action manually, you can do so by leaving a comment in the pull request.

The following action will trigger the tool to be run when a comment containig run_duplicate_code_detection_tool is posted in a pull request. The tool will run using the code in the pull request.

name: Duplicate code

on: issue_comment

jobs:
  duplicate-code-check:
    name: Check for duplicate code
    # Trigger the tool only when a comment containing the keyword is published in a pull request
    if: github.event.issue.pull_request && contains(github.event.comment.body, 'run_duplicate_code_detection_tool')
    runs-on: ubuntu-20.04
    steps:
      - name: Check for duplicate code
        uses: platisd/duplicate-code-detection-tool@master
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          directories: "."

Important: Please note that due to the way GitHub Actions work, you will first have to merge this into your main branch so it starts taking effect.

Optional configuration

It may not make sense to compare all files or get a files with very low similarity reported. In the following workflow, the different optional arguments are demonstrated.

For the various default values, please consult action.yml.

name: Duplicate code

on: pull_request

jobs:
  duplicate-code-check:
    name: Check for duplicate code
    runs-on: ubuntu-20.04
    steps:
      - name: Check for duplicate code
        uses: platisd/duplicate-code-detection-tool@master
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          directories: "src"
          # Ignore the specified directories
          ignore_directories: "src/external_libraries"
          # Only examine .h and .cpp files
          file_extensions: "h, cpp"
          # Only report similarities above 5%
          ignore_below: 5
          # If a file is more than 70% similar to another, then the job fails
          fail_above: 70
          # If a file is more than 15% similar to another, show a warning symbol in the report
          warn_above: 15
          # Remove `src/` from the file paths when reporting similarities
          project_root_dir: "src"
          # Remove docstrings from code before analysis
          # For python source code only. This is checked on a per-file basis
          only_code: true

Using duplicate-code-check with pre-commit

To use Duplicate Code Detection Tool as a pre-commit hook with pre-commit add the following to your .pre-commit-config.yaml file:

-   repo: https://github.com/platisd/duplicate-code-detection-tool.git
    rev: ''  # Use the sha / tag you want to point at
    hooks:
    -   id: duplicate-code-detection

NOTE: that this repository sets args: -f, if you are configuring duplicate-code-detection-tool using args you'll want to include either -f (--files) or -d (--directories).

Limitations

  • only_code option only works with python files for now

More Repositories

1

smartcar_shield

A versatile and easy to use vehicle platform for hobby-grade projects
C++
75
star
2

refactoring-for-testability-cpp

Hard-to-test patterns in C++ and how to refactor them
C++
62
star
3

bad-commit-message-blocker

Inhibits commits with bad messages from getting merged
Python
62
star
4

AndroidCar

Arduino library to control an Android Autonomous Vehicle by Team Pegasus
C++
45
star
5

cryptopuck

A handheld gadget that encrypts your drives on the fly
Python
44
star
6

sonicdisc

A 360° ultrasonic scanner
C++
42
star
7

openai-pr-description

Autofill your pull request descriptions with the power of OpenAI
Python
41
star
8

clang-tidy-pr-comments

Turn clang-tidy warnings and fixes to comments in your pull request
Python
40
star
9

phonix

Generate captions for videos using the power of OpenAI's Whisper API
Python
33
star
10

nokia-5110-lcd-library

Arduino library for driving the Nokia 5110 LCD
C++
20
star
11

definition-of-done

A bot to remind you to check whether your Definition-of-Done has been satisfied before approving a pull request
Python
18
star
12

IoTink

Your portable & connected, e-paper dashboard
C++
16
star
13

reusable-testable-arduino-tutorial

C++
15
star
14

indoor-navigation-system

[WIP] An indoor navigation system to guide users towards their colleagues' desks
C++
15
star
15

vasttraPi

Your personal departures screen for Västtrafik buses, using a Raspberry Pi Zero W and an ATtiny85 power control board
Python
14
star
16

skonaki

Create cheatsheets out of videos
Python
14
star
17

nevma

USB gadget to transform your gestures to keyboard events
Arduino
13
star
18

code-review-lamp

A colorful lamp to notify the developer team for pending code reviews
C++
12
star
19

cpp-pimpl-tutorial

Source code for the pImpl idiom tutorial (C++)
C++
10
star
20

reverse-interview-questions

Questions YOU should ask when interviewed for a Software Engineering position
10
star
21

smartcar

This repository includes code to control an arduino - raspberry pi based vehicle from an android client.
Java
8
star
22

smartcar_core

A library that provides some high level functions to the user of a Smartcar
C++
8
star
23

vasttrafik-google-assistant

Make your Google Assistant talk to Västtrafik
Python
7
star
24

meta-dimitriOS

A BitBake layer for my Linux based projects
BitBake
6
star
25

smartcar_sensors

A simple library to control various sensors on the smartcar
C++
6
star
26

cpp-builder-pattern

Builder Pattern with C++: A pragmatic approach
C++
6
star
27

break-the-coupling-cpp

5 ways to decouple dependencies in C++
C++
5
star
28

CaroloCup2016

The repository for the physical layer of Team Pegasus, for the Carolo Cup 2016 competition
Arduino
5
star
29

Netstrings

A simple library to decode and encode Netstrings
C++
5
star
30

cpp-cmake-template-repo

A template repository for my typical C++ projects
CMake
5
star
31

intro-to-unit-testing-workshop

C++
4
star
32

finito

Get a push notification when a command has finished running
Shell
4
star
33

scrumtato

An ATtiny85 gadget to make daily stand-ups agile again
C++
4
star
34

sycophant

Opinionated articles based on the latest news; a churnalist's wet dream
Python
4
star
35

cpp-switcheroo

A compile-time alternative of a switch-case statement
C++
4
star
36

wakeduino

A bio-alarm clock based on arduino and an IR sensor to detect movements. After pointing the IR sensor towards the body of the user, it wakes him up on the specified time +- a threshold window, when it detects a specific amount of movements, which would indicate that he is in the end of a sleep cycle, causing the user to wake up more fresh.
3
star
37

dialectic-ball

A physical debugging tool for your daily code struggles
C++
3
star
38

HM-11_breakout

Eagle PCB files for a HM-11 Bluetooth Low Energy module, breakout board
Eagle
2
star
39

handy-cpp-components

A collection of useful C++ components
C++
2
star
40

tango-hackathon

Arduino
2
star
41

hackathon-pcb-trophy

A cool trophy for the winners of your Hackathon
C++
2
star
42

moltoduino

Add programmable cores and enable HIL testing
2
star
43

HMC5883L

HMC5883L library by loveelectronics.co.uk
C++
2
star
44

xmas-pcb-tree

The PCB design files and firmware for a xmas tree shaped PCB, using an ATTiny25 and MOSFETs to blink several LEDs - http://plat.is/xmaspcb
Eagle
2
star
45

robots.army

Self-replicating source code for robots.army 🤖
Ruby
2
star
46

SocketSerialBridge

A simple program written in Java to establish a 2-way bridge between web sockets and a serial connection
Java
2
star
47

example-dimitriOS-cmake-project

C++
1
star
48

static-mocks-good-bad-ugly

C++
1
star
49

TinyGSM

An Arduino library to control the SIM900 based TinyGSM board
C++
1
star
50

eely-sample-repository

Sample repository for eely
1
star
51

platisd

1
star
52

cpp-command-parser

Parse CLI commands with compile-time checks for your sanity
C++
1
star
53

toetap

A keyboard for your toes
C++
1
star