Building a Naive Bayes Spam Filter with Testing and Updating Capabilities

admin

1 day ago

In this tutorial, we’ll implement a Naive Bayes spam filter that not only classifies emails but also includes testing functionality and the ability to update its knowledge base with new data.

Overview

Our implementation will include:

Training on labeled data
Spam classification with adjustable threshold
Accuracy testing
Dynamic model updates
Detailed logging and reporting

Implementation

Here’s the complete implementation of our advanced Naive Bayes classifier:

class NaiveBayesClassifier:
    def __init__(self):
        self.spam_words = {}
        self.ham_words = {}
        self.spam_count = 0
        self.ham_count = 0

    def tokenize(self, text):
        """Converts text to lowercase and splits into words."""
        return text.lower().split()

    def train(self, spam_messages, ham_messages):
        """Trains classifier on initial dataset."""
        self.spam_count = len(spam_messages)
        self.ham_count = len(ham_messages)

        # Process spam messages
        for message in spam_messages:
            words = self.tokenize(message)
            for word in words:
                self.spam_words[word] = self.spam_words.get(word, 0) + 1

        # Process ham messages
        for message in ham_messages:
            words = self.tokenize(message)
            for word in words:
                self.ham_words[word] = self.ham_words.get(word, 0) + 1

    def calculate_probability(self, word, word_counts, message_count, vocabulary_size):
        """Calculates word probability with Laplace smoothing."""
        return (word_counts.get(word, 0) + 1) / (message_count + vocabulary_size)

    def classify(self, message, threshold=0.5):
        """Classifies a message as spam or ham."""
        words = self.tokenize(message)

        # Get complete vocabulary
        vocabulary = set(self.spam_words.keys()) | set(self.ham_words.keys())
        vocabulary_size = len(vocabulary)

        # Calculate prior probabilities
        total_messages = self.spam_count + self.ham_count
        spam_prior = self.spam_count / total_messages
        ham_prior = self.ham_count / total_messages

        # Initialize with priors
        spam_prob = spam_prior
        ham_prob = ham_prior

        # Calculate likelihood
        for word in words:
            spam_prob *= self.calculate_probability(word, self.spam_words, 
                                                  self.spam_count, vocabulary_size)
            ham_prob *= self.calculate_probability(word, self.ham_words, 
                                                 self.ham_count, vocabulary_size)

        # Calculate spamicity
        total_prob = spam_prob + ham_prob
        spamicity = spam_prob / total_prob if total_prob != 0 else 0

        return "spam" if spamicity > threshold else "ham", spamicity

    def update(self, message, label):
        """Updates classifier with new labeled data."""
        words = self.tokenize(message)
        if label == "spam":
            self.spam_count += 1
            for word in words:
                self.spam_words[word] = self.spam_words.get(word, 0) + 1
        elif label == "ham":
            self.ham_count += 1
            for word in words:
                self.ham_words[word] = self.ham_words.get(word, 0) + 1

Using the Classifier

Here’s how to use the classifier with a real example:

# Training data
previous_spam = [
    'send us your password',
    'review our website',
    'send your password',
    'send us your account'
]
previous_ham = [
    'Your activity report',
    'benefits physical activity',
    'the importance vows'
]

# Test data
new_emails = {
    'spam': ['renew your password', 'renew your vows'],
    'ham': ['benefits of our account', 'the importance of physical activity']
}

# Create and train classifier
classifier = NaiveBayesClassifier()
classifier.train(previous_spam, previous_ham)

# Set spam threshold
SPAM_THRESHOLD = 0.6

# Test the classifier
def test_classifier(classifier, test_data, threshold):
    correct = 0
    total = 0

    for true_label, messages in test_data.items():
        for message in messages:
            prediction, spamicity = classifier.classify(message, threshold)
            total += 1
            if prediction == true_label:
                correct += 1

            print(f"Message: '{message}'")
            print(f"Spamicity: {spamicity:.4f}")
            print(f"Prediction: {prediction}")
            print(f"True label: {true_label}\n")

    accuracy = correct / total if total > 0 else 0
    print(f"Accuracy: {accuracy:.2%}")
    return accuracy

Handling Model Updates

One of the key features of our implementation is the ability to update the model with new data:

def update_classifier(classifier, new_data):
    """Updates classifier with new labeled data."""
    for label, messages in new_data.items():
        for message in messages:
            classifier.update(message, label)
            print(f"Updated with message: '{message}' as {label}")

# Test accuracy before updates
print("Initial accuracy:")
initial_accuracy = test_classifier(classifier, new_emails, SPAM_THRESHOLD)

# Update model with new data
print("\nUpdating classifier...")
update_classifier(classifier, new_emails)

# Test accuracy after updates
print("\nAccuracy after updates:")
final_accuracy = test_classifier(classifier, new_emails, SPAM_THRESHOLD)

Key Features and Improvements

Laplace Smoothing

Handles unseen words gracefully
Prevents zero probability problems

Adjustable Threshold

Allows fine-tuning of spam detection sensitivity
Can be adjusted based on requirements

Comprehensive Testing

Calculates and reports accuracy
Shows detailed classification results

Dynamic Updates

Model learns from new data
Improves accuracy over time

Performance Considerations

Memory Efficiency

Uses dictionaries for O(1) word lookups
Maintains minimal state

Computational Efficiency

Linear time complexity for classification
Efficient updates with constant-time operations

Scalability

Can handle growing vocabulary
Efficient updates with new data

Potential Improvements

Better Tokenization

def improved_tokenize(self, text):
    """Enhanced tokenization with preprocessing."""
    import re
    # Convert to lowercase
    text = text.lower()
    # Remove special characters
    text = re.sub(r'[^\w\s]', '', text)
    # Split into words
    words = text.split()
    # Remove short words
    return [w for w in words if len(w) > 2]

Feature Weighting

def calculate_word_weight(self, word):
    """Calculate importance weight for each word."""
    spam_freq = self.spam_words.get(word, 0) / max(self.spam_count, 1)
    ham_freq = self.ham_words.get(word, 0) / max(self.ham_count, 1)
    return abs(spam_freq - ham_freq)

Advanced Probability Calculations

def calculate_log_probability(self, word, is_spam):
    """Use log probabilities to prevent underflow."""
    import math
    if is_spam:
        prob = self.calculate_probability(word, self.spam_words, 
                                        self.spam_count, self.vocabulary_size)
    else:
        prob = self.calculate_probability(word, self.ham_words, 
                                        self.ham_count, self.vocabulary_size)
    return math.log(prob) if prob > 0 else float('-inf')

Conclusion

This implementation provides a robust foundation for spam detection that can be extended and improved based on specific needs. The ability to update the model with new data makes it particularly valuable for real-world applications where spam patterns evolve over time.

Key takeaways:

Simple yet effective implementation
Good balance of accuracy and efficiency
Extensible design for future improvements
Practical for real-world applications

Remember that while this implementation is good for learning and small-scale applications, production systems might need additional features like:

More sophisticated tokenization
Multiple classification features beyond just words
Integration with other spam detection methods
Persistent storage for trained models