Every academic their own text-matcher

Jun 19, 2022 · 8 min read · Python Education ·

Plagiarism, text matching, and academic integrity

Every modern academic teacher is in thrall to giant text-matching systems such as Ouriginal or Turnitin. These systems are sold as "plagiarism detectors", which they are not - they are text matching systems, and they generally work by providing a report showing how much of a student's submitted work matches text from other sources. It is up to the academic to decide if the level of text matching constitutes plagiarism.

Although Turnitin sells itself as a plagiarism detector, or at any rate a tool for supporting academic integrity, its software is closed source, so, paradoxically, there's no way of knowing if any of its source code has been plagiarized from another source.

Such systems work by having access to a giant corpus of material: published articles, reports, text on websites, blogs, previous student work obtained from all over, and so on. The more texts a system can try to match a submission against, the more confident an academic is supposed to have in its findings. (And the more likely an administration will see fit to paying the yearly licence costs.)

Of course in the arms-race of academic integrity, you'll find plenty of websites offering advice on "how to beat Turnitin"; but in the interests of integrity I'm not going to link to any, but they're not hard to find. And of course Turnitin will presumably up its game to counter these methods, and the sites will be rewritten, and so on.

My problem

I have been teaching a fully online class; although my university is slowly trying to move back (at least partially) into on-campus delivery after 2 1/2 years of Covid remote learning, some classes will still run online.

My students were completing an online "exam": a timed test (un-invigilated) in which the questions were randomized so that no students got the same set of questions. They were all "Long Answer" questions in the parlance of our learning management system; at any rate for each question a text box was given for the student to enter their answer.

The test was to be marked "by hand". That is, by me.

Many of my students speak English as a second language, and although they are supposed to have a basic competency sufficient for tertiary study, many of them struggle. And if a question asks them to define, for example, "layering" in the context of cybersecurity, I have not the slightest problem with them searching for information online, finding it, and copying it into the textbox. If they can search for the correct information and find it, that's good enough for me. This exam is also open book. As far as I'm concerned, finding correct information is a useful and valuable skill; testing for the use of what they might remember, and "in their own words" is pedagogically indefensible.

So, working my way grimly through these exams, I had a "this seems familiar..." moment. And indeed, searching through some previous submissions I found exactly the same answer submitted by another student. Well, that can happen. What is less likely to happen, at least by chance, is for almost all of the 16 questions to have the same submissions as other students. People working in the area of academic integrity sometimes speak of a "spidey sense" a sort of sixth sense that alerts you that something's not right, even if you can't quite yet pinpoint the issue. This was that sense, and more.

It turned out that the entire test and all answers could be downloaded and saved as a CSV File, and hence loaded into Python as a Pandas DataFrame.

My first attempt had me looking at all pairs of students and their test answers, to see if any of the answer text strings matched. And some indeed did. Because of the randomized nature of the test, one student might receive as question 7, say, the same question that another student might see as question 5, or question 8.

The data I had to work with consisted of two DataFrames. Once contained all the exam information:

examdata.dtypes

Username      object
FirstName     object
LastName      object
Q #            int64
Q Text        object
Answer        object
Score        float64
Out Of       float64
dtype: object

This DataFrame was ordered by student, and then by question number. This meant that every student had up to 16 rows of the DataFrame. I had another DataFrame containing just the names and cohorts (there were two distinct cohorts, and this information was not given in the dump of exam data to the CSV file.)

names.dtypes

Username     object
FirstName    object
LastName     object
Cohort       object
dtype: object

I added the cohorts by hand. This could then be merged with the exam data:

data = examdata.merge(names,on=["Username","FirstName","LastName"],how='left').reset_index(drop=True)

String similarity

Since the exam answers in my DataFrame were text strings, any formatting that the student might have given in an answer, such as bullet points or a numbered list, a table, font changes, were ignored. All I had to work in were ascii strings.

However, exact string matching led to very few results. This is because there might have been a difference in starting or ending whitespace or other characters, or even if one student's submission included another student's submission as a substring. Consider for example these two (synthetic) examples:

"A man-in-the-middle attack is a cyberattack where the attacker secretly relays and possibly alters the communications between two parties who believe that they are directly communicating with each other, as the attacker has inserted themselves between the two parties." (from the Wikipedia page on the Man-In-The-Middle attack.)
"I think it's this: A man-in-the-middle attack is a cyberattack where Mallory secretly relays and possibly alters the communications between Alice and Bob who believe that they are directly communicating with each other, as Mallory has inserted himself between them."

There are various ways of measuring the distance between strings, or alternatively of their similarity. Two much used methods are the Jaro similarity measure (named for Matthew Jaro, who introduced it in 1989), and the Jaro-Winkler measure, a version named also for William Winkler, who discussed it in 1990. Both of these are defined on their Wikipedia page. Winkler's measure adds to the original Jaro measure a factor based on the equality of any beginning substring.

It turns out that the Jaro-Winkler similarity of the two strings above is about 0.78. If the first "I think it's this: " is removed from the second string, then the similarity increases to 0.89.

Both the Jaro and Jaro-Winkler measures are happily implemented in the Python jellyfish package. This package also includes some other standard measurements of the closeness of two strings.

My approach was to find the number of submissions whose Jaro-Winkler similarity exceeded 0.85. And I found this number empirically, by checking a number of (what appeared to me) to be very similar submissions, and computing their similarities.

Some results

In this cohort there were 39 students, divided into two cohorts: 12 were taught by me, and the rest by another teacher. I was only concerned with mine. There were 16 questions, but not every student answered every question, and so the maximum size of my DataFrame would be \(12\times 16=192\); in fact I had a total of 171 different answers. The numbers of questions submitted by the students were:

11, 16, 14, 16, 16, 16, 15, 13, 12, 12, 16, 14

and so (to avoid comparing pairs of submissions twice) I aimed to compare every student's submission to the submissions of all students below them in the DataFrame. This makes for 13,383 comparisons. In fact, because I'm a lazy programmer, I simply compared every submission to every submission below it in the DataFrame (which meant that I was comparing submissions from a single student), for a total of 14,535 comparisons.

This is how (assuming that the jellyfish package as been loaded as jf):

match_list = []
N = my_data.shape[0]
for i in range(N):
    for j in range(i+1,N):
        jfs = jf.jaro_winkler_similarity(my_data.at[i,"Answer"],my_data.at[j,"Answer"])
        if jfs > 0.85:
            match_list += [[my_data.at[i,"Username"],my_data.at[j,"Username"],my_data.at[i,"Q #"],my_data.at[j,"Q #"],jfs]]

I ended up with 33 matches, which I put into a DataFrame:

matches = pd.DataFrame(match_list,columns=["ID 1","ID 2","Q# 1","Q# 2","Similarity"])

As you see, each row of the DataFrame contained the two student ID numbers, the relevant question numbers, and the similarity measure. Because of the randomisation of the exam, two students might get the same question but with a different number (as I mentioned earlier).

To see if any pair of students appeared more than once, I grouped the DataFrame by their ID numbers:

dg = matches.groupby(["ID 1","ID 2"]).size()
dg.values

array([ 1,  1,  1,  1,  1,  1,  1, 11,  1,  1,  1,  1,  1,  1,  1,  1,  2,
        1,  1,  2,  1])

Notice something? There's a pair of students who submitted very similar answers to 11 questions! Now this pair can be isolated:

maxd = max(dg.values)
cheats = dg.loc[dg.values==maxdg].index[0]
c0, c1 = cheats

The matches can now be listed:

collusion = matches.loc[(matches["ID 1"]==c0) & (matches["ID 2"]==c1)].reset_index(drop=True)

and we can print off these matches as evidence.