Spam detector AI

Hello everyone!

I developed a spam detection that classifies emails as spam or legitimate (“ham”) using machine learning. It processes email content and provides predictions with confidence scores.

This post is to know your opinion about it and if we should adopt it in our infra

How it works?

I used the Naive Bayes method, which is a probabilistic classification algorithm based on Bayes’ Theorem. It’s called “naive” because it assumes that all features (words in the email) are independent of each other - which isn’t entirely true in practice, but works surprisingly well for spam detection.

Why Naive Bayes is Better for Spam Detection

The state of art to detect spams would be the use of Natural Language Processing (NLP), but because it needs bigger datasets and more hardware, I started with Naive Bayes to have a prototype and it outperforms other machine learning methods, that don’t require too much data or hardware, for spam detection.

In several key aspects that make it ideal for initial implementation. Its fundamental advantage lies in its exceptional efficiency with text-based data - it processes thousands of word features with remarkable speed, enabling real-time email classification. Unlike Random Forests or SVMs that struggle with high-dimensional text data, Naive Bayes handles large vocabularies effortlessly, treating each word as an independent feature without performance degradation.

The algorithm’s probabilistic nature provides inherent confidence scores for predictions, eliminating the need for additional calibration required by other methods. Most importantly, Naive Bayes achieves excellent results with relatively small training datasets and learns spam patterns rapidly, allowing for quick implementation and iteration. This combination of speed, efficiency, and effectiveness makes it the optimal starting point for spam detection systems.

Datasets Sources

These were the sources used in this project so far to get ham and spam emails samples:

Project structure

After installing python dependencies, I downloaded the dataset in data/raw/ and divided in two classes spam and ham with src/utils.py. Then trained the model with src/train_model.py which also calls src/data_processing.py, that creates the model in data/models and it can be used with src/predict.py to check if the files in emails are spam or ham.
This is the project structure so far.

spam_detector/
├── data/
│   ├── raw/                    # Downloaded raw emails
│   │   ├── spam/              # Spam emails (*.txt)
│   │   └── ham/               # Legitimate emails (ham/*.txt)
│   └── models/                # Trained models (auto-generated)
├── emails/                    # Folder for new emails to be checked
│   └── mail.txt              # Individual email
├── src/
│   ├── data_processing.py     # Data processing and cleaning
│   ├── train_model.py         # ML model training
│   ├── predict.py             # New email prediction
│   └── utils.py              # Utility functions
├── requirements.txt          # Python dependencies
└── README.md

Let me know what you think!

Let me know what you think about it, if we should adopt it in our infra, create a repo for it or even if you think we shouldnt use an AI for it. I also can provide more info if you have questions

1 Like

Sounds like AI advertising AI :stuck_out_tongue:


Not sure what something like this should use, but Python wouldn’t be my first go-to.

I’m just explaining what and how I did.

We often have spammers in mailing lists and even in pagure.io creating repos, issues and PRs. The goal of this project is, at least, to reduce the amount of spams we need to remove manually

About the use of python, this is the language that has libs that makes the AI coding easier and readable, like scikit-learn and tensorflow, and of course Im used to it.

Do you think we shouldn’t use AI to try to reduce the spams?

Do you have a sense of how this performs in comparison to existing FOSS spam detection tools?

It’s cool to code something up and experiment with it, but spam detection is a problem that’s had a lot of time invested into solutions, and using an already-developed tool should be an option on the table.

That’s mainly why I wouldn’t have it as a first; I want compute efficiency and performance first. Python might be ok, but there’s faster languages and methods.

I’m not opposed to using AI for this purpose, but I’d like to think older ways before the AI buzz targeted efficiency better. Might there be something else more efficient?

Like SpamAssassin? Fedora mail server uses it already if Im not wrong, but some spammers are bypassing it

1 Like

Yes, SpamAssassin, and maybe other services too, I don’t know the space that well.

It would be good to do a performance comparison to understand whether your tool is better than the alternative solutions.

(“Performance” in functional terms - how many spam emails does it let through and how many non-spam emails does it wrongly catch - but also “performance” in terms of speed and resource usage.)

I see, but the point is not to replace SpamAssassin or any other tool. The idea is to create an additional layer that would analyze emails that are in lists inbox already and remove it if its a spam. We could use some business rules based on Naive Bayes confidence measure, which is basically the probability if the email is a spam or not, so for example if its higher than 95% then remove the email because its a spam and if its lower than that leave it.

OK, so if you’re taking that approach, then you’d want to evaluate:

  • Rate of false positives - i.e. tested on some realistic dataset, how many non-spam messages does the tool wrongfully remove?
  • Performance impact in terms of speed and resource usage

ok, I can get some emails from fedora mail server and use it to check how much the AI classify it right.

Naive Bayes is know for being fast and lightweight. It is one of the least resource intensive machine learning algorithms, so resource and speed is not an issue, the NLP tho would use more hardware for training, but it increases accuracy because it understand context.. anyway this would be another debate, I’ll work on it and come back with all that info.

Thanks everyone :slight_smile:

Nice work @phsmoura :slight_smile:

So, Naive Bayes is used by a lot of existing spam tools, and has been for a long time. In fact, I wouldn’t class this as AI, because AI is an overloaded term - this is a classic example of machine learning (ML) and building such a filter is a go-to of many courses.

That’s not a criticism, but given NB’s existing use in the industry, I would be surprised if yours performed better than SpamAssassin or other tools right now, because you trained it on general data such as the Enron set. To get better results, we need to train it on our data. So yes, I’d go ahead and push for building this with datasets drawn from Pagure and the lists.

Ideally, we’d want to get to an ROC curve which is one of the usual ways to assess accuracy of a binary classifier. Having a ROC curve for an existing and for your tool will give you a quantified answer to whether it’s worth deploying :wink:

More power to data scientists! :flexed_biceps:

2 Likes

+1 that’s why I used Naive Bayes as a prototype, its easy to code, dont need much hardware and we could think how to use that as an additional layer of spam blocking. For the final version of this project, I think we should consider the use of NLP, the predict.py would be the same or similar, only the model file would be different.

Ok, I’m going to retrain it using legit emails we have, but would have to use public datasets for spams, because we always delete them right away and would take lot of time to gather thousands of them to make a dataset :frowning: also will do this ROC curve and then get back to you people :slight_smile:

Interesting. I don’t think we would want it to block/reject things right away, but we could definitely enable it and see how it does for a while. :wink:

Where do you envision this running and how?

I seem to not be seeing a link to the source? Or is it all local at this point while developing it?

Where do you envision this running and how?

My idea was to have the model file and predict.py running in mail server with cron, but we could find other ways to use it if we want to use it, np

I seem to not be seeing a link to the source? Or is it all local at this point while developing it?

Yes, I did it locally for now because Im not sure if and where it should have a repo or even if it should be a new branch in ansible repo as a WIP…

Put it up on Codeberg or Gitlab so people can analyse it.

Hello everyone,

I’ve implemented NLP and uploaded the project to GitHub (GitHub - phsmoura/spam-detector-ai: This project implements a spam classifier that combines natural language processing (NLP) with machine learning algorithms). I haven’t uploaded the datasets because some of them use emails from Fedora mailing lists, and some are marked as private.

I’ve documented the development process, dataset details, and training results in the README. To summarize, the model achieved 99.5% accuracy. I used stratified validation due to the dataset’s imbalance - there were significantly more non-spam emails - and currently, spam detection is based solely on the English language. No false positives were identified; however, spam emails that weren’t in English or were encoded (such as base64) were classified as legitimate.

Regarding the ROC curve, I don’t believe it’s necessary at this moment since the accuracy is already high at 99.5%, and there were no false positives except for those in other languages or encoded formats. Creating a ROC curve might just confirm what we already know, and there’s already substantial evidence in the literature that the combination of Naive Bayes + NLP is excellent for spam detection. There’s also room for improvement, such as adding datasets in other languages. If you still think it’s necessary, I can potentially work on it next quarter.

Below is the model’s performance:

==================================================
Model Performance - NAIVE_BAYES
==================================================
Accuracy: 0.9951
Test set size: 11921 emails
Spam samples in test: 3954
Ham samples in test: 7967

Classification Report:
              precision    recall  f1-score   support

         Ham       0.99      1.00      1.00      7967
        Spam       1.00      0.99      0.99      3954

    accuracy                           1.00     11921
   macro avg       1.00      0.99      0.99     11921
weighted avg       1.00      1.00      1.00     11921

Just to not lose the track, I think it’s important for us to focus our discussion on whether it’s worthwhile to use AI as an additional filter against spam and, if so, what the possible ways to implement this would be.

2 Likes