Creating an AI-Powered Code Clone Detector with CodeBERT and Flask

You know how it goes — you’re in the middle of a team code review when something starts to feel… familiar. That was me. Same structure. Same logic. Same comments. Same AI. I dug through an old repo, and sure enough — near-identical code, copied and repackaged.

It wasn’t plagiarism, but it was close. And it made one thing clear: we needed a better way to detect deep code duplication across our projects.

Why Traditional Tools Didn’t Cut It

At first, I reached for PMD and SonarQube. They’re great — for basic stuff. But they mostly flag surface-level similarities and formatting issues. What they miss are the deeper clones: renamed variables, reordered logic, slight rewrites that still do the same thing.

We needed something that understood semantics, not just syntax.

Enter CodeBERT

A colleague mentioned CodeBERT — “It’s like BERT, but for code.” That phrase stuck. CodeBERT is a transformer-based model trained on GitHub code, capable of understanding structure, data flow, and context in ways static analysis tools just can’t.

I figured — if NLP models can detect semantic similarity in natural language, maybe we can do the same for source code.

Flask + CodeBERT (AI)= MVP Time

I’m not an ML expert. But I do love building stuff. So I grabbed Flask, set up a quick API, and began experimenting.

One endpoint: /detect
Input: Two code snippets
Output: Similarity score

Simple. Or at least, it started that way.

Early Chaos

My first version flagged a bubble sort and a React component as “clones.” Yeah… tokenization matters.

After cleaning the input, normalizing formatting, and tweaking threshold values, I finally had something usable. CodeBERT started identifying true logic clones — even when variable names and minor details changed.

The “Oh Crap” Moment

I ran it on a live repo.

It flagged dozens of near-duplicate functions across services — especially around request validation. Turns out, copy-pasting slightly edited functions was common practice. (Guilty.)

Suddenly, we had visibility. The conversation changed from “why is this duplicated?” to “how do we centralize this logic?” That awareness alone was worth it.

Refactoring with Purpose

Refactoring is always tricky — especially when you’re touching code that technically “works.”

Using the similarity scores, we prioritized anything in the 85–95% range. These weren’t exact matches, but close enough to centralize. We built a shared utils package and added a pre-commit hook to flag clones in PRs (just as a warning, not a blocker).

Initially, some teammates hated it. (“Big Brother is watching my functions!”) But after a few iterations, they saw the value. Bugs dropped. Code reviews got faster. Onboarding became easier.

Unexpected Wins using AI

One junior dev told me:

“I ran the detector and saw my solution was 90% similar to something already in the repo… so I just reused that instead.”

That’s culture shift. That’s learning. And that’s the real win.

We even built a dashboard to visualize “duplication hotspots” in our codebase. Watching clone density drop week by week? Immensely satisfying.

Lessons Learned (The Hard Way)

Thinking of building your own? Keep this in mind:

Tokenize and normalize your input before passing it to the model.
Similarity scores are subjective — tune your threshold based on context.
Expect false positives — nothing’s perfect.
Understand that CodeBERT has no architectural awareness. It sees patterns, not intent.

Final Thoughts: Totally Worth It with AI

This started as a frustration-fueled side project. It turned into a tool we use across teams.

Not perfect. Not polished. But powerful.

If you’ve ever wished you had a second set of eyes during code reviews — ones that don’t miss patterns or get tired — an AI-powered clone detector might be exactly what you need.

You don’t need to be a machine learning expert. Just curious, resourceful, and okay with breaking things to make better ones.

I started this out of annoyance. I finished it with pride.