Gen-AI Text Detection using LLM

Leveraged BERT to distinguish AI generated from human written texts in a Kaggle competition, using datasets from Mistral AI (to train the model) and achieving 72% training and 58% testing accuracy.

GitHub Url

Introduction:

As part of a Kaggle competition, I developed a machine learning model to distinguish AI-generated texts from human-written texts using BERT (Bidirectional Encoder Representations from Transformers). The project involved the following key steps:

Data Preparation:

Collected and integrated datasets, including Mistral AI-generated text datasets and the provided competition datasets.
Conducted data cleaning and preprocessing, including text stemming and removal of punctuation and stopwords.

Model Development:

Implemented BERT for sequence classification using a pre-trained BERT model.
Fine-tuned the model on the combined training dataset, which included both human-written and AI-generated texts.

Training and Evaluation:

Trained the model using a balanced dataset to address class imbalance.
Achieved a training accuracy of 72% and a testing accuracy of 58%.
Evaluated model performance using metrics such as accuracy and loss.

Deployment:

Prepared a submission file for the competition by predicting the classification of texts in the test dataset.

Key Technologies:

NLP: BERT, Tokenization, Text Preprocessing
Libraries: Pandas, NumPy, PyTorch, Transformers
Data Visualization: Matplotlib, Seaborn
Model Evaluation: Accuracy, Confusion Matrix, Loss Calculation

This project demonstrates my expertise in leveraging advanced NLP techniques and deep learning models to tackle complex text classification problems. The experience has enhanced my skills in data preprocessing, model fine-tuning, and performance evaluation in a competitive data science environment.