« A taxonomy of visualizations | Main | Theories of information and interestingness »
January 09, 2007
Statistical models and spam
In the old days, internet technologies were developed for ethical well-behaved people. But when the hordes were unleashed on the internet, the old technologies could not cope with the bad behavior, but it has been very hard to change the underlying fabric of internet standards and protocols. Spam in particular is one of the most annoying problems. Spam filtering is an automated classification of messages (e-mails, but also blog comments, blog trackbacks, instant messages and so on) into the good (ham) and into the bad (spam).
The so-called Bayesian filtering has been popularized by Paul Graham in his essays A Plan for Spam and Better Bayesian Filtering a few years ago, but goes back to a Microsoft Research who first worked on detecting insults in 1997 and then junk in 1998. The traditional way of dealing with spam has been to identify individual words that seem to be overrepresented in spam (Viagra, free, money, casino, games). The appearance of each such word increases the log-odds that the email is spam. On the other hand, the appearance of words related to one's interests and work increases the log-odds that the email is spam. When we sum up all these log-odds, we obtain a score, which is used to classify the email. This approach is known as the naive Bayesian classification. Of course, "Bayesian" here is as in Bayes rule, not as in Bayesian statistics. Although models such as logistic regression and support vector machines would yield better accuracy for spam filtering, practitioners tell me that naive Bayes is still heavily used in practice because it scales to the huge collections of data: given a list of spam emails and a list of non-spam emails, we can figure out how much log-odds to add or subtract for each word (or some other aspect of the message, such a the number of all-uppercase words, or the maximum length of a sequence of exclamation marks).
My pet peeve with spam filtering is that it doesn't root the problem out. It merely provides an efficient broom to sweep it under the rug. The innocent users are paying for filtering and wasted internet bandwidth, have to keep separating emails into spam and ham, risk the loss of important emails due to spam filters, whereas the spammers incur practically no penalties, only profits. Furthermore, the adaptive aspect of filtering has been used in the adversarial strategy of Bayesian poisoning, where messages that will be classified as spam are made up of purely legitimate words, and where legitimate words are injected into the spam messages. Moreover, spammy messages are now stored in images, which cannot be easily filtered automatically. For these reasons, the effectiveness of spam filters has gone down over the past year or two. Until we get internet postage stamps, internet taxes and internet police, I would prefer vigilante approaches such as the notorious Make Love not Spam. But pessimists like to use the following cannot-solve-spam form.
It is still refreshing to observe new developments. I have come across a paper on the next generation of spam filtering techniques Spam Filtering Using Statistical Data Compression Models by some of my former colleagues. Andrej Bratko et al. have found that models based on individual letters outperform the models based on the word counts. For example, their method can employ the indentation pattern "> " which is far more frequent in legitimate emails than in spam. With sufficient training, they would also be able to detect misspellings and foreign languages. Although not rooting the problem out, they can still buy some time.
Posted by Aleks at January 9, 2007 10:24 AM
Trackback Pings
TrackBack URL for this entry:
http://www.stat.columbia.edu/~cook/movabletype/mt-tb.cgi/765