https://arxiv.org/abs/2010.01700
Mossad: Defeating Software Plagiarism Detection
Breanna Devore-McDonald, Emery D. Berger
Automatic software plagiarism detection tools are widely used in educational settings to ensure that submitted work was not copied. These tools have grown in use together with the rise in enrollments in computer science programs and the widespread availability of code on-line. Educators rely on the robustness of plagiarism detection tools; the working assumption is that the effort required to evade detection is as high as that required to actually do the assigned work.
This paper shows this is not the case. It presents an entirely automatic program transformation approach, Mossad, that defeats popular software plagiarism detection tools. Mossad comprises a framework that couples techniques inspired by genetic programming with domain-specific knowledge to effectively undermine plagiarism detectors. Mossad is effective at defeating four plagiarism detectors, including Moss and JPlag. Mossad is both fast and effective: it can, in minutes, generate modified versions of programs that are likely to escape detection. More insidiously, because of its non-deterministic approach, Mossad can, from a single program, generate dozens of variants, which are classified as no more suspicious than legitimate assignments. A detailed study of Mossad across a corpus of real student assignments demonstrates its efficacy at evading detection. A user study shows that graduate student assistants consistently rate Mossad-generated code as just as readable as authentic student code. This work motivates the need for both research on more robust plagiarism detection tools and greater integration of naturally plagiarism-resistant methodologies like code review into computer science education.
The usage of the tool is basically a deterrent against a very low-hanging cheating fruit for students (some still tried and thought changing the variable names would help them...)
Though a counterargument to this would be that teachers don't have time to interview every student. If Mossad is so good that teachers can't pick out the objectively suspicious subset, they might need to subjectively pick a random sample with varying amount of personal bias involved.
It was simple, I let students work in groups to do coding stuff (it's an intro type of class with students of varying skill levels). I had them work on a project together all they wanted, letting them know that it would be turned in about a month or so before the end of the semester. I would review them and then, in class, they would INDIVIDUALLY be quizzed on their own teams project; down to e.g.
"You have a function blahblah, explain what it does. What would happen if I passed it X?"
Forces them to work together and sort of study together. Kind of puts a bit more pressure on the less knowledgable, but probably worth it.
Do you have any small examples on a program that was transformed/generated with Mossad that we could compare against the original? As far as I can tell, the paper just have a really tiny example function.
As it turns out, a lot of student code can look this way anyway. Something crazy like 70% of authentic student code can have dead code in assignment submissions.
In addition, do things like stack overflow and using LLM-generated code count as cheating? Because that is horrible in and of itself, though a separate concern.
*edited mobile typos
In case of a false positive from a faulty detector this is extraordinary evidence.
There’s an open source tool which I love the idea of (basically a tool for declarative integration tests), but I really don’t like it’s implementation. I tried to contribute to improve it, but it’s too much work and it will never fit my ideal.
So I basically decided to "redo it but better", and I’m also tempted to make it a paid, proprietary tool because my implementation diverges enough that I consider it a different codebase altogether (and it would bring legitimate value to companies). I wrote my code from scratch but still had some knowledge of the original code base so I’d be interested in running something like JPlag to make sure I didn’t accidentally plagiarize open source code.
I hope I find a way to make it compare 2 codebases :)
If you didn't plagiarize, you don't need to run the tool. If you did plagiarize and want to hide it, tho...
Plagiarism is not always clear cut because life is messy. That’s why Wine doesn’t allow contributions from people who have seen Windows source code[1] for instance, even though it could be good faith contributions with experience instead of plagiarism
[1]: https://wiki.winehq.org/Developer_FAQ#Who_can't_contribute_t...?
Interestingly, whenever I discussed my thesis, the first reaction from others often revolved around moral concerns.
It seems like the latter based on their wiki, but also that that corpus can be relatively small.
Would that result in inability to write code in a clean room, even; because eventually all possible code strings and mutations thereof would already be patented.
For example, are three notes or chords copyrightable?
It is pretty much impossible to detect software plagiarism, especially on leetcode style questions as only 1 style or pattern is the most efficient answer.
Though if a student changes it sufficiently enough, they might begin to actually see the invariants and ideas and actually learn the material.
When doing things for real in the workforce, reinventing core parts yourself is often not the best way, if only because you'll reinvent already fixed bugs too or that it wastes time, and that should be explained too. But understanding how things work below the outer layer of what would otherwise be black boxes lets you better understand when things go wrong, or have a better position to assess a pre-made library/service/other to be confident it is the most suitable option¹. Also, building things from scratch helps teach complexity analysis and, at a slightly higher level, security analysis, both of which are very useful, often vital, at much higher and/or more abstract levels.
If building from first principles, or close to, is being drilled into students as the way to do things full stop, then those students are being taught poorly. It isn't how I remember learning way back when I was last called a student.
For example my understanding of how b-trees and their relatives work, in part from having build routines to manage them in the dim and distant past, along with a number of other similar bits of knowledge, helps my understanding of how many DBMSs work in general and how certain optimisations at higher levels² do/don't work. I doubt I'll ever need to build any structure like that from anything close to first principles, but having done so in the past was not wasted time. The same with knowledge of filesystem construction, network protocols, etc. – I'll probably not use those things directly but the understanding helps me make choices, create solutions³, and solve problems, less directly.
--------
[1] or at least a suitable option
[2] things I do in the query syntax, what the query planner/runner can/can't do with that, etc.
[3] I am sometimes the local master of temporary hacky solutions that get us over the line and allow time to do things more right slightly later instead of things failing right now.
Kind of makes sense, doesn't it? While in school, you want to learn as much as possible (ideally?), while in the workforce, you want (or the company wants you) to be as efficient as possible. Different goals leads to different workflows.
I think the same connection can be made to StackOverflow. If you are/were a computer science student and you did a lot of copy-pasting and not a lot of thinking and trialing, there's a really good change you didn't get to suffer mistakes during development that you know to avoid now as a graduate. Now, we all have taken advantage of code from StackOverflow and that's a tool to aid development for us, but when it's something you treat as a crutch, when it doesn't have the answers you need, you're screwed.
One case I saw literally yesterday at work. We had a dev that had written a lot of copy-pasted code saying that it couldn't be generalized and take advantage of a mapping we already have to generate a UI. This dev had not yet had the fortune of learning how to properly abstract this kind of problem in this specific circumstance. I sat down with him for a moment and instead of spitting out a bunch of nonsense that the LLM was trying to get him to write, we say down and paired on the issue until we had a more general solution.
He learned some abstraction concepts, we all got a better code base, and he learned a way to help tease an LLM into a better solution going forward. That foundation was required to get that better solution though, in this situation.
Generally speaking, I think you should know all your underlying concepts so you can audit any new development assisting tools.
A surprisingly large number of people do not realise that code on StackOverflow is under a relatively restrictive license.
https://meta.stackexchange.com/questions/12527/do-i-have-to-...
Some companies take this very seriously and others do not care at all. And, of course, there are companies that outright ban any library without the right licence.
"return a + b": plagiarism, disqualified.
"return a + 1 + b - 1": A+
It is funny, but that doesn’t mean it’s wrong.
PS If the domain in your profile is supposed to be valid … it isn’t.
Thanks, I need to redo my website/blog/whatever. I'm still getting email at it!
Where on earth did you go to school that you couldn't just call list.sort() after your first algorithms class? I've never seen a class that teaches students that everything must be implemented from first principles. They only do that for the concepts taught in that class, or 1-2 classes prior, to make sure you are actually learning and have a clue what you're doing. Which... should makes sense? You should want engineers to cut costs based on understanding rather than ignorance.