JPlag – Detecting Software Plagiarism (opens in new tab)

(github.com)

102 pointsdrapado1y ago78 comments

78 comments

JPlag, like similar plagiarism detectors, is vulnerable to attack. We outline the attack in this paper and show its effectiveness against JPlag and another widely used plagiarism detector, Moss. Note that this was written in 2020, in the pre “CheatGPT” era!

https://arxiv.org/abs/2010.01700

Mossad: Defeating Software Plagiarism Detection

Breanna Devore-McDonald, Emery D. Berger

Automatic software plagiarism detection tools are widely used in educational settings to ensure that submitted work was not copied. These tools have grown in use together with the rise in enrollments in computer science programs and the widespread availability of code on-line. Educators rely on the robustness of plagiarism detection tools; the working assumption is that the effort required to evade detection is as high as that required to actually do the assigned work.

This paper shows this is not the case. It presents an entirely automatic program transformation approach, Mossad, that defeats popular software plagiarism detection tools. Mossad comprises a framework that couples techniques inspired by genetic programming with domain-specific knowledge to effectively undermine plagiarism detectors. Mossad is effective at defeating four plagiarism detectors, including Moss and JPlag. Mossad is both fast and effective: it can, in minutes, generate modified versions of programs that are likely to escape detection. More insidiously, because of its non-deterministic approach, Mossad can, from a single program, generate dozens of variants, which are classified as no more suspicious than legitimate assignments. A detailed study of Mossad across a corpus of real student assignments demonstrates its efficacy at evading detection. A user study shows that graduate student assistants consistently rate Mossad-generated code as just as readable as authentic student code. This work motivates the need for both research on more robust plagiarism detection tools and greater integration of naturally plagiarism-resistant methodologies like code review into computer science education.

norswap1y ago

As someone that actually used JPlag as a university TA, I think if the students are smart enough to implement this, they're probably smart enough to do whatever assignment we've asked of them (unless there's a easy peasy program to do the transformation, but I don't think it's the case here).

The usage of the tool is basically a deterrent against a very low-hanging cheating fruit for students (some still tried and thought changing the variable names would help them...)

emeryberger1y ago

If you read the paper, you'll see that the attack is entirely feasible to implement by hand (we did this ourselves but do not report on it in the paper). It's a pretty mechanical process. A bit of trial and error will get the job done; it's a hell of a lot easier than most assignments.

1 more reply

omoikane1y ago

The usual defense against these is to ask students to explain their submitted work. Randomly generated dead code would likely be even more difficult for the students to explain.

Though a counterargument to this would be that teachers don't have time to interview every student. If Mossad is so good that teachers can't pick out the objectively suspicious subset, they might need to subjectively pick a random sample with varying amount of personal bias involved.

jrm41y ago

Yup. I sort of independently discovered this mechanism, and not just for "cheating," but for group work. Didn't even have to go individual interviews.

It was simple, I let students work in groups to do coding stuff (it's an intro type of class with students of varying skill levels). I had them work on a project together all they wanted, letting them know that it would be turned in about a month or so before the end of the semester. I would review them and then, in class, they would INDIVIDUALLY be quizzed on their own teams project; down to e.g.

"You have a function blahblah, explain what it does. What would happen if I passed it X?"

Forces them to work together and sort of study together. Kind of puts a bit more pressure on the less knowledgable, but probably worth it.

diggan1y ago

> A user study shows that graduate student assistants consistently rate Mossad-generated code as just as readable as authentic student code.

Do you have any small examples on a program that was transformed/generated with Mossad that we could compare against the original? As far as I can tell, the paper just have a really tiny example function.

Johnbot1y ago

An example of a Mossad generated file would be the source file plus a bunch of dead code. The dead code consists of lines from the original file repeated in random locations (plus, if you are using an "entropy file", random lines of code that were successful mutations from previous generations of Mossad).

As it turns out, a lot of student code can look this way anyway. Something crazy like 70% of authentic student code can have dead code in assignment submissions.

3 more replies

zelphirkalt1y ago

In an educational setting the plagiarism tools are probably most wanted by lecturers, but least useful. Do they teach every individual differently? If not, then there is not much surprise, if elementary ideas are expressed in very similar ways. So some cases of very similar solutions are bound to happen, hopefully not throwing shadow without proof of plagiarism.

drapadoOP1y ago

I recently had to check code from some of my students at the university as I suspected plagiarism. I discovered JPlag which works like a charm and generates nice reports

beeboobaa31y ago

Next time just ask them a few questions about the programming choices they made. Far easier.

BossingAround1y ago

How do you deal with disputes? One's code is flagged even if the student in question didn't actually cheat. What then? Do you trust tools over the students' word?

In addition, do things like stack overflow and using LLM-generated code count as cheating? Because that is horrible in and of itself, though a separate concern.

ActualTeacher1y ago

The output of plagiarism tools should only serve as a hint to look at a pair of solutions more closely. All judgement should be derived entirely from similarities between solutions and not some artificial similarity score computed by some program.

2 more replies

ablob1y ago

If you talk about the written code to the student in question it should become clear whether it was copied or not.

drapadoOP1y ago

Well, in this case I noticed the same code copied while grading a project. I used then JPlag to run an automatic check in all the submissions for all the projects. It found many instances where a couple of students did a copy-paste with same variable names, comments, etc. It was quite obvious if you look in detail, and JPlag helped us spot it in multiple files easily.

*edited mobile typos

thi3411y ago

An archival video of all coding sessions (locally, hosted by the student), starting with a visible outline of pseudo-code and ending with debugging should be sufficient.

In case of a false positive from a faulty detector this is extraordinary evidence.

wildzzz1y ago

We had a professor require us to use git as a timestamped log of our progress. Of course you could fake it but stealing work and basically redoing it piece by piece with fake timestamps is a lot of work for cheaters.

jaimex21y ago

Kinda rare these days with ChatGPT

i_am_proteus1y ago

You might be surprised. Many students who use ChatGPT for assignments end up turning in code identical (or nearly identical) to other students who use ChatGPT.

1 more reply

franga20001y ago

MOSS seems to be pretty good finding multiple people using LLM-generated code and flagging them as copies of each other. I imagine it would also be a good idea to throw the assignment text into the few most popular LLMs and feed that in as well, but I don't know of anyone who has tried this.

1 more reply

thiht1y ago

I was actually looking for something like this a few days ago!

There’s an open source tool which I love the idea of (basically a tool for declarative integration tests), but I really don’t like it’s implementation. I tried to contribute to improve it, but it’s too much work and it will never fit my ideal.

So I basically decided to "redo it but better", and I’m also tempted to make it a paid, proprietary tool because my implementation diverges enough that I consider it a different codebase altogether (and it would bring legitimate value to companies). I wrote my code from scratch but still had some knowledge of the original code base so I’d be interested in running something like JPlag to make sure I didn’t accidentally plagiarize open source code.

I hope I find a way to make it compare 2 codebases :)

meiraleal1y ago

> to make sure I didn’t accidentally plagiarize open source code.

If you didn't plagiarize, you don't need to run the tool. If you did plagiarize and want to hide it, tho...

thiht1y ago

There’s big sections of code I wrote in the original open source lib. I didn’t copy paste the code but the implementation in this component is obviously pretty close. I’m the copyright holder of this code anyway so it should not be an issue, but I’d rather not take the risk.

Plagiarism is not always clear cut because life is messy. That’s why Wine doesn’t allow contributions from people who have seen Windows source code[1] for instance, even though it could be good faith contributions with experience instead of plagiarism

[1]: https://wiki.winehq.org/Developer_FAQ#Who_can't_contribute_t...?

viralpraxis1y ago

One of the key outcomes of my master's thesis was the development of an extendable solution for Code Clone Detection (CCD), primarily focused on code and tested with undergraduates at my university [1]. Although I didn't have time to complete the adapter for JPlag, I believe it would be highly beneficial.

Interestingly, whenever I discussed my thesis, the first reaction from others often revolved around moral concerns.

[1] https://github.com/studyfair/studyfair

michaelmior1y ago

This looks cool, but for me one of the big wins with JPlag is that I just download and run a single JAR file.

Retr0id1y ago

Presumably, this needs a corpus of software to check against. Does it include one, or do you have to bring your own?

sakjur1y ago

> Just to make it clear: JPlag does not compare to the internet! It is designed to find similarities among the student solutions, which is usually sufficient for computer programs.

It seems like the latter based on their wiki, but also that that corpus can be relatively small.

westurner1y ago

Should a plagiarism score be considered when generating code with an infinite monkeys algorithm with selection or better?

Would that result in inability to write code in a clean room, even; because eventually all possible code strings and mutations thereof would already be patented.

For example, are three notes or chords copyrightable?

ocean_moist1y ago

pro tip: change variable/function names, switch from if/else to switch, invert if/else statements, switch for and while loops, group code differently, create helper functions or collapse helper functions, rewrite loops as streams/ranges/list comprehensions, etc. The ide can do most automatically.

It is pretty much impossible to detect software plagiarism, especially on leetcode style questions as only 1 style or pattern is the most efficient answer.

Though if a student changes it sufficiently enough, they might begin to actually see the invariants and ideas and actually learn the material.

hooverd1y ago

It's funny how we drill the idea that everything must be reimplemented from first principles into students, only to flip that when they join the workforce.

dspillett1y ago

Building things from first principles is a great way to instil understanding about how things work and why, which helps the understanding of problems later.

When doing things for real in the workforce, reinventing core parts yourself is often not the best way, if only because you'll reinvent already fixed bugs too or that it wastes time, and that should be explained too. But understanding how things work below the outer layer of what would otherwise be black boxes lets you better understand when things go wrong, or have a better position to assess a pre-made library/service/other to be confident it is the most suitable option¹. Also, building things from scratch helps teach complexity analysis and, at a slightly higher level, security analysis, both of which are very useful, often vital, at much higher and/or more abstract levels.

If building from first principles, or close to, is being drilled into students as the way to do things full stop, then those students are being taught poorly. It isn't how I remember learning way back when I was last called a student.

For example my understanding of how b-trees and their relatives work, in part from having build routines to manage them in the dim and distant past, along with a number of other similar bits of knowledge, helps my understanding of how many DBMSs work in general and how certain optimisations at higher levels² do/don't work. I doubt I'll ever need to build any structure like that from anything close to first principles, but having done so in the past was not wasted time. The same with knowledge of filesystem construction, network protocols, etc. – I'll probably not use those things directly but the understanding helps me make choices, create solutions³, and solve problems, less directly.

--------

[1] or at least a suitable option

[2] things I do in the query syntax, what the query planner/runner can/can't do with that, etc.

[3] I am sometimes the local master of temporary hacky solutions that get us over the line and allow time to do things more right slightly later instead of things failing right now.

hooverd1y ago

Yea, you should learn how things work. I'm just worried that overly clever plagiarism detection would ding you for things like converging on common design patterns.

diggan1y ago

> It's funny how we drill the idea that everything must be reimplemented from first principles into students, only to flip that when they join the workforce.

Kind of makes sense, doesn't it? While in school, you want to learn as much as possible (ideally?), while in the workforce, you want (or the company wants you) to be as efficient as possible. Different goals leads to different workflows.

jjice1y ago

This argument has come up a lot since ChatGPT released. I can agree that new tools (like LLMs) can have a place in education, potentially. That said, learning the foundation of anything you do is critical to understanding the higher levels of it.

I think the same connection can be made to StackOverflow. If you are/were a computer science student and you did a lot of copy-pasting and not a lot of thinking and trialing, there's a really good change you didn't get to suffer mistakes during development that you know to avoid now as a graduate. Now, we all have taken advantage of code from StackOverflow and that's a tool to aid development for us, but when it's something you treat as a crutch, when it doesn't have the answers you need, you're screwed.

One case I saw literally yesterday at work. We had a dev that had written a lot of copy-pasted code saying that it couldn't be generalized and take advantage of a mapping we already have to generate a UI. This dev had not yet had the fortune of learning how to properly abstract this kind of problem in this specific circumstance. I sat down with him for a moment and instead of spitting out a bunch of nonsense that the LLM was trying to get him to write, we say down and paired on the issue until we had a more general solution.

He learned some abstraction concepts, we all got a better code base, and he learned a way to help tease an LLM into a better solution going forward. That foundation was required to get that better solution though, in this situation.

Generally speaking, I think you should know all your underlying concepts so you can audit any new development assisting tools.

ActualTeacher1y ago

> only to flip that when they join the workforce.

A surprisingly large number of people do not realise that code on StackOverflow is under a relatively restrictive license.

https://meta.stackexchange.com/questions/12527/do-i-have-to-...

Some companies take this very seriously and others do not care at all. And, of course, there are companies that outright ban any library without the right licence.

immibis1y ago

Most companies realize they can do what they want as long as there are no actual consequences. These companies prosper under the rules of natural selection, compared to the companies which are afraid of doing things because some paper says they shouldn't do it even though nothing will happen if they do it.

jffhn1y ago

exam: implement add.

"return a + b": plagiarism, disqualified.

"return a + 1 + b - 1": A+

TuringNYC1y ago

The exam questions themselves would be plagiarism also

fn-mote1y ago

> It's funny how we drill the idea that everything must be reimplemented from first principles into students, only to flip that when they join the workforce.

It is funny, but that doesn’t mean it’s wrong.

PS If the domain in your profile is supposed to be valid … it isn’t.

hooverd1y ago

Yea, it's a great way to learn.

Thanks, I need to redo my website/blog/whatever. I'm still getting email at it!

donatj1y ago

Exactly this. If my coworker implemented sort themselves we'd have questions about their qualifications.

dataflow1y ago

> It's funny how we drill the idea that everything must be reimplemented from first principles into students

Where on earth did you go to school that you couldn't just call list.sort() after your first algorithms class? I've never seen a class that teaches students that everything must be implemented from first principles. They only do that for the concepts taught in that class, or 1-2 classes prior, to make sure you are actually learning and have a clue what you're doing. Which... should makes sense? You should want engineers to cut costs based on understanding rather than ignorance.

playingalong1y ago

Can I use it to detect Copy & Paste within my company's own codebase?

chuckadams1y ago

Jetbrains IDEs will flag duplicate code of N lines (configurable) or more. Just run the inspection manually and there you go. You can even do it in batch in CI nowadays.

maleldil1y ago

I've had some success with PMD/CMD [1]. It's Java-based, though, which could be a pain, depending on your setup. If your codebase uses Python, you can use Pylint for this, too.

[1] https://pmd.github.io/pmd/pmd_userdocs_cpd.html

willy_k1y ago

What do you need that for? It’s not really cheating when it’s going into a product, if it works it works. In a hypothetical ideal corporate environment wouldn’t it be preferred if one could save company time by copy and pasting?

JosephRedfern1y ago

Is this also effective at detecting code duplication within a codebase?

j / k navigate · click thread line to collapse

78 comments

emeryberger1y ago

https://arxiv.org/abs/2010.01700

Mossad: Defeating Software Plagiarism Detection

Breanna Devore-McDonald, Emery D. Berger

norswap1y ago

The usage of the tool is basically a deterrent against a very low-hanging cheating fruit for students (some still tried and thought changing the variable names would help them...)

emeryberger1y ago

1 more reply

omoikane1y ago

The usual defense against these is to ask students to explain their submitted work. Randomly generated dead code would likely be even more difficult for the students to explain.

jrm41y ago

Yup. I sort of independently discovered this mechanism, and not just for "cheating," but for group work. Didn't even have to go individual interviews.

"You have a function blahblah, explain what it does. What would happen if I passed it X?"

Forces them to work together and sort of study together. Kind of puts a bit more pressure on the less knowledgable, but probably worth it.

diggan1y ago

> A user study shows that graduate student assistants consistently rate Mossad-generated code as just as readable as authentic student code.

Johnbot1y ago

As it turns out, a lot of student code can look this way anyway. Something crazy like 70% of authentic student code can have dead code in assignment submissions.

3 more replies

zelphirkalt1y ago

drapadoOP1y ago

I recently had to check code from some of my students at the university as I suspected plagiarism. I discovered JPlag which works like a charm and generates nice reports

beeboobaa31y ago

Next time just ask them a few questions about the programming choices they made. Far easier.

BossingAround1y ago

How do you deal with disputes? One's code is flagged even if the student in question didn't actually cheat. What then? Do you trust tools over the students' word?

In addition, do things like stack overflow and using LLM-generated code count as cheating? Because that is horrible in and of itself, though a separate concern.

ActualTeacher1y ago

2 more replies

ablob1y ago

If you talk about the written code to the student in question it should become clear whether it was copied or not.

drapadoOP1y ago

*edited mobile typos

thi3411y ago

An archival video of all coding sessions (locally, hosted by the student), starting with a visible outline of pseudo-code and ending with debugging should be sufficient.

In case of a false positive from a faulty detector this is extraordinary evidence.

wildzzz1y ago

jaimex21y ago

Kinda rare these days with ChatGPT

i_am_proteus1y ago

You might be surprised. Many students who use ChatGPT for assignments end up turning in code identical (or nearly identical) to other students who use ChatGPT.

1 more reply

franga20001y ago

1 more reply

thiht1y ago

I was actually looking for something like this a few days ago!

I hope I find a way to make it compare 2 codebases :)

meiraleal1y ago

> to make sure I didn’t accidentally plagiarize open source code.

If you didn't plagiarize, you don't need to run the tool. If you did plagiarize and want to hide it, tho...

thiht1y ago

[1]: https://wiki.winehq.org/Developer_FAQ#Who_can't_contribute_t...?

viralpraxis1y ago

Interestingly, whenever I discussed my thesis, the first reaction from others often revolved around moral concerns.

[1] https://github.com/studyfair/studyfair

michaelmior1y ago

This looks cool, but for me one of the big wins with JPlag is that I just download and run a single JAR file.

Retr0id1y ago

Presumably, this needs a corpus of software to check against. Does it include one, or do you have to bring your own?

sakjur1y ago

> Just to make it clear: JPlag does not compare to the internet! It is designed to find similarities among the student solutions, which is usually sufficient for computer programs.

It seems like the latter based on their wiki, but also that that corpus can be relatively small.

westurner1y ago

Should a plagiarism score be considered when generating code with an infinite monkeys algorithm with selection or better?

Would that result in inability to write code in a clean room, even; because eventually all possible code strings and mutations thereof would already be patented.

For example, are three notes or chords copyrightable?

ocean_moist1y ago

It is pretty much impossible to detect software plagiarism, especially on leetcode style questions as only 1 style or pattern is the most efficient answer.

Though if a student changes it sufficiently enough, they might begin to actually see the invariants and ideas and actually learn the material.

hooverd1y ago

It's funny how we drill the idea that everything must be reimplemented from first principles into students, only to flip that when they join the workforce.

dspillett1y ago

Building things from first principles is a great way to instil understanding about how things work and why, which helps the understanding of problems later.

--------

[1] or at least a suitable option

[2] things I do in the query syntax, what the query planner/runner can/can't do with that, etc.

[3] I am sometimes the local master of temporary hacky solutions that get us over the line and allow time to do things more right slightly later instead of things failing right now.

hooverd1y ago

Yea, you should learn how things work. I'm just worried that overly clever plagiarism detection would ding you for things like converging on common design patterns.

diggan1y ago

> It's funny how we drill the idea that everything must be reimplemented from first principles into students, only to flip that when they join the workforce.

jjice1y ago

Generally speaking, I think you should know all your underlying concepts so you can audit any new development assisting tools.

ActualTeacher1y ago

> only to flip that when they join the workforce.

A surprisingly large number of people do not realise that code on StackOverflow is under a relatively restrictive license.

https://meta.stackexchange.com/questions/12527/do-i-have-to-...

Some companies take this very seriously and others do not care at all. And, of course, there are companies that outright ban any library without the right licence.

immibis1y ago

jffhn1y ago

exam: implement add.

"return a + b": plagiarism, disqualified.

"return a + 1 + b - 1": A+

TuringNYC1y ago

The exam questions themselves would be plagiarism also

fn-mote1y ago

> It's funny how we drill the idea that everything must be reimplemented from first principles into students, only to flip that when they join the workforce.

It is funny, but that doesn’t mean it’s wrong.

PS If the domain in your profile is supposed to be valid … it isn’t.

hooverd1y ago

Yea, it's a great way to learn.

Thanks, I need to redo my website/blog/whatever. I'm still getting email at it!

donatj1y ago

Exactly this. If my coworker implemented sort themselves we'd have questions about their qualifications.

dataflow1y ago

> It's funny how we drill the idea that everything must be reimplemented from first principles into students

playingalong1y ago

Can I use it to detect Copy & Paste within my company's own codebase?

chuckadams1y ago

Jetbrains IDEs will flag duplicate code of N lines (configurable) or more. Just run the inspection manually and there you go. You can even do it in batch in CI nowadays.

maleldil1y ago

I've had some success with PMD/CMD [1]. It's Java-based, though, which could be a pain, depending on your setup. If your codebase uses Python, you can use Pylint for this, too.

[1] https://pmd.github.io/pmd/pmd_userdocs_cpd.html

willy_k1y ago

JosephRedfern1y ago

Is this also effective at detecting code duplication within a codebase?

j / k navigate · click thread line to collapse