MaxFilter

I had an assignment from a class where I needed to do some kind of IA, web crawling or something like that. I wanted to try neural networks, so after some research I decided to try to make a spam filter.

The main source of inspiration I found was Lelia post graduation report(link), which had the same approach I wanted to try.

Pretended Result

The main behavior I wanted from the program was to:

  1. Read my gmail mailbox
  2. Parse my email
  3. Flag which emails are spam
  4. Send results report to my email

I wanted also to compare the results of a simple perceptron against the results of an MLP.

Implementation

I knew some C/C++, asm8086 and HTML/CSS/JS at this time, but none of them provided simple support for neural networks, mail interactions and html handling at the same time. I heard python had support for these so this was my first time using python.

Email Reading

To be able to connect to the email server and ask for my emails I used the poplib python module. After gathering the emails I used the BeatifulSoup module to ignore the html content in the emails and get the raw text. After getting the text some pre-processing is still needed to have better results so the following operations were done:

Features Selection and Handling

To the neural network be able to evaluate if the email is spam or not we need to have some features that it will use as input. They were gathered from the Lelia report. The features are:

The output of the neural network will be a binary value, 0 if not spam or 1 if the evaluated email is spam.

Neural Networks Training and Classification

To train and test the neural networks the SpamBase dataset was used.

First we split the dataset in training and testing data. Then I implemented the Perceptron using the sklearn Perceptron module. It was configured to randomize the received test dataset and with the parameters for learning.

After that we start training the network and checking the resulting accuracy. After some training time the neural network is saved in a file using the python pickle module which allows to serialize python objects.

The process for using MLP is similar but has some differences: It uses The MLP from the pylearn2 module. The topology of the network is defined with two layers, the first has 57 nodes and uses a sigmoid function and the second one(output layer) has 2 nodes and uses a softmax function. * An sdg method is used for training the network.

Email Report

With the emails read from our email and the neural networks evaluating each one now we want to report the results of the evaluation in some way.

This way will be sending an email with the results, to do this we used the smtp lib and sent the email after the results are obtained.

Example of the report email:

Assunto : Dicas para usar o Gmail : Resultado –> 0
Assunto : Bem-vindo ao Gmail : Resultado –> 0
Assunto : Primeiros passos no Google+ : Resultado –> 0
Assunto : Bem-vindo(a) ao YouTube! : Resultado –> 0
Assunto : Thank you for downloading RapidMiner : Resultado –> 0
Assunto : Plano de Desenvolvimento de Software G13 : Resultado –> 0
Assunto : Your 14-Day Trial of RapidMiner : Resultado –> 0
Assunto : League of Legends: "Summoner's Rift Gameplay" : Resultado –> 0
Assunto : Risk List G13 – Convite para editar : Resultado –> 0
Assunto : How likely are you to recommend RapidMiner? : Resultado –> 0
Assunto : Your RapidMiner License Has Expired : Resultado –> 0
Assunto : League of Legends: "The Pledge – Kalista" : Resultado –> 0
Assunto : Conta do Google: tentativa de login bloqueada : Resultado –> 0
Assunto : League of Legends: "The Terror Beneath" : Resultado –> 0
Assunto : Conta do Google: tentativa de login bloqueada : Resultado –> 0
Assunto : Conta do Google: o acesso a aplicativos menos seguros foi ativado : Resultado –> 0
Assunto : Teste Pop : Resultado –> 1
Assunto : YOOOOO : Resultado –> 0
Assunto : Happy Holidays from RapidMiner : Resultado –> 0
Assunto : Meeting Minutes #8 – Convite para editar : Resultado –> 0
Assunto : Relatorio de SPAM : Resultado –> 0

Cli Interface

With all this we need to know which algorithm we want to use, which email account we will use, what is the password of the account, if we want to train more networks.

So a simple cli was implemented with prints and the raw_input function from python.

Results

Initial Cli
Email Config Cli
Run Settings Cli

Assunto : Dicas para usar o Gmail : Resultado -> 0
Assunto : Bem-vindo ao Gmail : Resultado -> 0
Assunto : Primeiros passos no Google+ : Resultado -> 0
Assunto : Bem-vindo(a) ao YouTube! : Resultado -> 0
Assunto : Thank you for downloading RapidMiner : Resultado -> 0
Assunto : Plano de Desenvolvimento de Software G13 : Resultado -> 0
Assunto : Your 14-Day Trial of RapidMiner : Resultado -> 0
Assunto : League of Legends: "Summoner's Rift Gameplay" : Resultado -> 0
Assunto : Risk List G13 - Convite para editar : Resultado -> 0
Assunto : How likely are you to recommend RapidMiner? : Resultado -> 0
Assunto : Your RapidMiner License Has Expired : Resultado -> 0
Assunto : League of Legends: "The Pledge - Kalista" : Resultado -> 0
Assunto : Conta do Google: tentativa de login bloqueada : Resultado -> 0
Assunto : League of Legends: "The Terror Beneath" : Resultado -> 0
Assunto : Conta do Google: tentativa de login bloqueada : Resultado -> 0
Assunto : Conta do Google: o acesso a aplicativos menos seguros foi ativado : Resultado -> 0
Assunto : Teste Pop : Resultado -> 1
Assunto : YOOOOO : Resultado -> 0
Assunto : Happy Holidays from RapidMiner : Resultado -> 0
Assunto : Meeting Minutes #8 - Convite para editar : Resultado -> 0
Assunto : Relatorio de SPAM : Resultado -> 0


Hello,

FREE STUFF WITH US!!!!

SUPER EASY MONEY!!!

Just to easy steps, check our site - www.test.com

AlgorithmAccuracy(%)FP(%)FN(%)
Perceptron90.114.325.55
MLP93.023.962.99

I was just trying to learn about neural networks bit still got a better accuracy than the post graduation.

Repository

If you want to check the source code go to the bitbucket repository.

NOTE: The code was done in python2 when I was still in my bachelors in 2015 +-, and I didn’t maintain it, so it will probably not work anymore.