Detect Plagiarism With 100% Accuracy


The goal of the project was to build a machine learning system that can detect plagiarism. According to MarketWatch, the global anti-plagiarism software market will reach US\$822.40 million by 2025. The growth will come from an increase in online assignment and project submission platforms.

Executive Summary

The project started with the prototyping phase. The prototyping phase was undertaken on AWS Sagemaker Notebooks. During the prototyping, the data was prepared in a way that it could be used for feature engineering and model training. The feature engineering phase was about creating functions to calculate containment and the longest common sequence scores. Both scores were used for the final feature generation. Once the features were generated, the training of the model took place. The decision was made to start with the Random Forest algorithm. After training the model, the prototyping phase was finished and the whole codebase was rewritten and packaged as infrastructure as code. The project can be deployed in any AWS Region though Cloudformation.


The data set is a modified version of a dataset created by Paul Clough and Mark Stevenson, from the University of Sheffield. The dataset contains files with short questions and answers that have different levels of plagiarisation. Original answers were included in the dataset. The goal of the project was to create a machine learning system that could detect if any given text is plagiarised or not. It turned out that the first evaluated algorithm Random Forest delivered 100% accuracy in plagiarism detection. It means that the machine learning system was able to detect all plagiarism correctly. The project is packaged as infrastructure as code (CloudFormation). The rollout takes approximately five minutes including download, preprocessing and endpoint deployment.

Project Details

The project was developed in two phases. The first phase was about prototyping the model and the second phase was about creating infrastructure as code for automated model deployments.

The prototyping phase started with data exploration using the data provided from the University of Sheffield. It is important to aware that there were five different levels of plagiarism that needed to be detected correctly. The levels were:

  1. cut - copy-pasted from the source,
  2. light - some copying and paraphrasing,
  3. heavy - plagiarised but used different words and structure,
  4. non - no plagiarisation,
  5. origin - original source text, was used only for comparison.

In total there were 100 files where 95 were answers from other people and five were the original text sources from Wikipedia. After the exploration, came the feature engineering part. The first task was to convert categorical data to numerical data, as some of the machine learning algorithms could not deal with categorical values. Also, it was important to introduce a class feature that marked an answer as plagiarised or not. Figure 1.0 shows the modified version of the metadata data frame.

metadata Figure 1.0: Metadata Dataframe with Category and Class Columns

The next step was about splitting the data into training and test sets. To do so, a stratified random sampling algorithm was used to split the data by task and degree of plagiarism. Approximately 26% of the data was used for the test and 74% of the data was used for training. The reason for using stratified sampling was that the dataset contained tasks and categories of plagiarism as features, so both should have the same distribution in the training and test sets.

train-test-sets Figure 1.1: Training and Test Set Distribution Example

The main features for the model were containment and the longest common subsequence. The formula for the features was provided by the authors from the plagiarism detection paper - Paul Clough and Mark Stevenson.

Containment is defined as the intersection of the n-gram word count of the Wikipedia Source Text (S) with the n-gram word count of the Student Answer Text (S) divided by the n-gram word count of the Student Answer Text. Containment of 0 means original text and containment of 1 means plagiarised.

After the calculation of the containment score, the next task was to calculate the longest common subsequence (LCS). LCS was used as the second input feature for the model.

The longest common subsequence is the longest string of words (or letters) that are the same between the Wikipedia Source Text (S) and the Student Answer Text (A). This value is also normalised by dividing by the total number of words (or letters) in the Student Answer Text.

Containment as well as LCS were then used for the creation of multiple features that were analyzed based on the correlation. The too-highly correlated features were discarded because highly-correlated features don’t explain the variance in data and would not be able to improve the classifier.

Once the features were created the model training part has started. The goal was to start with a simple machine learning algorithm to obtain the first results. As the first algorithm, Random Forest was chosen. There were no changes in hyper-parameters, and the default values were used to do the first run. The model was trained on AWS Sagemaker. After the first training job with the default hyper-parameter setup, an accuracy score of 100% was achieved. The algorithm was able to predict all class labels with 100% accuracy. Figure 1.2 shows the result.

results Figure 1.2: Random Forest Result

The decision was made to use only the accuracy score, since it was already 100%, so there was no need to look at other metrics such as precision or recall. The result of 100% accuracy finished the prototyping phase of the project.

After prototyping, it was important to make the project ready for production and to package it into infrastructure as code. Figure 1.3 shows CloudFormation stack’s output. The code was split into different modules, where each module runs independently from each other. Different parts such as data download and preprocessing including feature generator happen on a local machine. The training and inference parts run on a local machine as well as in the cloud on AWS Sagemaker. The cloud version of the project includes an endpoint to which a request can be sent to check if a text is plagiarised or not.

cloudformation Figure 1.3: CloudFormation Stack’s Output


The project shows that a basic and well-established algorithm such as Random Forest can deliver very strong results. It also shows that a thoughtful generation and selection of features enables the algorithms to produce highly accurate results. The project shows a classical case for the application of Occam’s razor principle. The Occam’s razor principle states that given more than one algorithm, the one that is easier to implement and easier to deploy should be used.


To make the system truly production-ready, the preprocessing part should be deployed on AWS Glue (Apache Spark). An AWS API Gateway and AWS Lambda Proxy could be added in front of the Sagemaker Endpoint. By doing so, any application that has access to the endpoint could make predictions in realtime over the HTTPS.

The project was developed during the machine learning program at Udacity. The code is not covered by unit tests, and there are no plans for further refactoring.

Source Code