Short: Junk E-Mail Classifier.
Author: agmsmith@rogers.com (Alexander G. M. Smith)
Uploader: agmsmith@rogers.com (Alexander G. M. Smith)
Website: http://members.rogers.com/agmsmith/
Version: 1.65
Type: internet & network/e-mail
Requires: BeOS 5.0+
Related things: http://www.paulgraham.com/spam.html, http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
AGMSBayesianSpam is a set of BeOS programs for classifying e-mail messages and other text as either spam or genuine. "Spam" is the colloquial name for unwanted junk messages, usually advertising. The name comes from a 1970's Monty Python comedy skit involving lots of unwanted Spam, which is the name for the spicy ham in a can made by the Hormel Foods company, originally from Austin, Minnesota, USA. The program classifies messages as spam or genuine (sometimes called "ham"), based on the words they contain and previous messages which have been identified by the user as spam or genuine. It's implemented as a server program (AGMSBayesianSpamServer) which keeps track of the word list and a Mail Daemon Replacement add-on (AGMSBayesianSpamFilter) which uses the server to classify incoming messages. Theoretically other programs, like a news reader, could also use the word database using the scripting interface. There's also a command line interface and a graphical user interface.
If you want to know more about the technique of counting words, have a look at Paul Graham's wonderful write-up at http://www.paulgraham.com/spam.html. This program is currently using an improved method, called Gary-combining, by Gary Robinson. See http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html for Gary's story. There's also the Spambayes mailing list where further weird experiments are carrying on in the recent renaissance of Spam detection methods.
When you're done, it should look something like this:
Run the AGMSBayesianSpamServer program again. This time it shouldn't complain. Click the "Create" button to make a new database with the default name of "/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam Database".
Use the "Add Example of Spam/Genuine" button, and only select at most
80 files at a time (otherwise the Tracker/File Requester will lock up and
you'll have to reboot your computer). It will ask you to identify each
file as spam or genuine, you also have the choice of identifying a whole
batch of them as all spam or all genuine.
You can also drag and drop example messages into the bottom half of
the window. Drop in the left side for genuine, right side for spam, but
avoid the middle third of the window.
If you have thousands of messages, use the command line mode.
Check for e-mail as usual. If you look at the inbox directory in Tracker, you can add an extra column with the E-mail attribute "Spam/Genuine Estimate" to see how spammy the messages are. 0.0 means the system thinks the message is fully genuine, 1.0 fully spam. Usually if it is over 0.56 (the best cutoff value depends a bit on your database quality, but 0.56 is typical) then it is spam, and the closer it is to 1.0 the more likely it really is spam. But it can be wrong, for things like a friend of yours quoting a spam message. I sort by spam ratio, and manually throw away the messages that are spammy, then I switch the Tracker window back to sorting by thread+date (just a click on the appropriate column title does it) and get on with reading the mail.
If you turned on the filter option to modify the subject, you'll see spam messages with something like [Spam 95%] in front of the subject (I don't use it because it looks ugly). But only in the Tracker display of the Subject, the actual subject inside the message isn't affected, just the MAIL:subject attribute, which is what the Tracker shows.
The accuracy is only as good as your database, so update it with
more example spam and genuine messages. In particular, if it gets the estimate
wrong, add that message to the database to tell it what it should be doing.
You may also want to do this with all your messages (it gives slightly better
results in the long run than just training on the mistakes). A quick way to do
that is to right click on the e-mail in Tracker, and pick Open With...
AGMSBayesianSpamServer.
It should start up and ask you if the message is spam or genuine.
You can also drag and drop the message into the left third of the word list for
genuine messages, or right third for spam messages. Dropping in the middle
third does something else that's mostly harmless and fun.
If you're annoyed by the server window popping up whenver the system checks
for e-mail, you can tell it to hide. Just click the "Server Mode" checkbox.
To make it visible again, start up AGMSBayesianSpamServer (possibly by double
clicking on its icon in /boot/home/config/bin/ and bring up the
hidden window by using the deskbar).
Besides the graphical user interface, there is also a command line mode. Just type "AGMSBayesianSpamServer help" in the terminal to get a list of the commands and what they do (the ultimate documentation). It also explains all of the mysterious options you see in the graphical user interface. The same commands can be used in scripting, either from some other program or via the "hey" utility which you can get from http://www.bebits.com/app/2042. A useful command, if you have a lot of spam messages to add, is "AGMSBayesianSpamServer set genuine *" which will use all messages in the current directory as examples of genuine text.
Mon Oct 21 15:35:49 517 /tmp>AGMSBayesianSpamServer help AGMSBayesianSpamServer - A Spam Database Server Copyright © 2002 by Alexander G. M. Smith. Released to the public domain. Compiled on Oct 21 2002 at 13:12:30. $Revision: 1.4 $ $Header: /boot/home/agmsmith/Programming/AGMSBayesianSpam/Server/RCS/AGMSBayesianSpamSer ver.cpp,v 1.60 2002/10/21 16:41:27 agmsmith Exp $ This is a program for classifying e-mail messages as spam (junk mail which you don't want to read) and regular genuine messages. It can learn what's spam and what's genuine. You just give it a bunch of spam messages and a bunch of non-spam ones. It uses them to make a list of the words from the messages with the probability that each word is from a spam message or from a genuine message. Later on, it can use those probabilities to classify new messages as spam or not spam. If the classifier stops working well (because the spammers have changed their writing style and vocabulary, or your regular correspondants are writing like spammers), you can use this program to update the list of words to identify the new messages correctly. The original idea was from Paul Graham's algorithm, which has an excellent writeup at: http://www.paulgraham.com/spam.html Gary Robinson came up with the improved algorithm, which you can read about at: http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html Thanks go to Isaac Yonemoto for providing a better icon. Usage: Specify the operation as the first argument followed by more information as appropriate. The program's configuration will affect the actual operation (things like the name of the database file to use, or whether it should allow non-email messages to be added). In command line mode it will do the operation and exit. In GUI/server mode a command line invocation will just send the command to the server. You can also use BeOS scripting (see the "Hey" command which you can get from http://www.bebits.com/app/2042 ) to control the Spam updater. And finally, there's also a GUI interface which shows up if you start it without any command line arguments. Commands: Quit Stop the program. Useful if it's running as a server. Get DatabaseFile Get the pathname of the current database file. The default name is something like B_USER_SETTINGS_DIRECTORY / AGMSBayesianSpam / AGMSBayesianSpamServer Database Set DatabaseFile NewValue Change the pathname of the database file to use. It will automatically be converted to an absolute path name, so make sure the parent directories exist before setting it. If it doesn't exist, you'll have to use the create command next. Create DatabaseFile Creates a new empty database, will replace the existing database file too. Delete DatabaseFile Deletes the database file and all backup copies of that file too. Really only of use for uninstallers. Count DatabaseFile Returns the number of words in the database. Set Spam NewValue Adds the spam in the given file (specify full pathname to be safe) to the database. The words in the files will be added to the list of words in the database that identify spam messages. The files processed will also have the attribute MAIL:classification added with a value of "Spam" or "Genuine" as specified. If they already have that attribute and it matches the new classification then they won't get processed (and if it is different, they will get removed from the statistics for the old class and added to the statistics for the new one). You can turn off that behaviour with the IgnorePreviousClassification property. The command line version lets you specify more than one pathname. Count Spam Returns the number of spam messages in the database. Set Genuine NewValue Similar to adding spam except that the messages are added to the genuine statistics. Count Genuine Returns the number of genuine messages in the database. Set IgnorePreviousClassification NewValue If set to true then the previous classification (which was saved as an attribute of the e-mail message file) will be ignored, so that you can add the message to the database again. If set to false (the normal case), the attribute will be examined, and if the message has already been classified as what you claim it is, nothing will be done. If it was misclassified, then the message will be removed from the statistics for the old class and added to the stats for the new classification you have requested. Get IgnorePreviousClassification Find out the current setting of the flag for ignoring the previously recorded classification. Set ServerMode NewValue If set to true then error messages get printed to the standard error stream rather than showing up in an alert box. It also starts up with the window minimized. Get ServerMode Find out the setting of the server mode flag. Flush Writes out the database file to disk, if it has been updated in memory but hasn't been saved to disk. It will automatically get written when the program exits, so this command is mostly useful for server mode. Set PurgeAge NewValue Sets the old age limit. Words which haven't been updated since this many message additions to the database may be deleted when you do a purge. A good value is 1000, meaning that if a word hasn't appeared in the last 1000 spam/genuine messages, it will be forgotten. Zero will purge all words, 1 will purge words not in the last message added to the database, 2 will purge words not in the last two messages added, and so on. This is mostly useful for removing those one time words which are often hunks of binary garbage, not real words. This acts in combination with the popularity limit; both conditions have to be valid before the word gets deleted. Get PurgeAge Gets the old age limit. Set PurgePopularity NewValue Sets the popularity limit. Words which aren't this popular may be deleted when you do a purge. A good value is 5, which means that the word is safe from purging if it has been seen in 6 or more e-mail messages. If it's only in 5 or less, then it may get purged. The extreme is zero, where only words that haven't been seen in any message are deleted (usually means no words). This acts in combination with the old age limit; both conditions have to be valid before the word gets deleted. Get PurgePopularity Gets the purge popularity limit. Purge Purges the old obsolete words from the database, if they are old enough according to the age limit and also unpopular enough according to the popularity limit. Get Oldest Gets the age of the oldest message in the database. It's relative to the beginning of time, so you need to do (total messages - age - 1) to see how many messages ago it was added. Set Evaluate NewValue Evaluates a given file (by path name) to see if it is spam or not. Returns the ratio of spam probability vs genuine probability, 0.0 meaning completely genuine, 1.0 for completely spam. Normally you should safely be able to consider it as spam if it is over 0.56. It attaches a MAIL:ratio_spam attribute with the ratio as its float32 value to the file. Also returns the top few interesting words in "words" and the associated per-word probability ratios in "ratios". Set EvaluateString NewValue Like Evaluate, but rather than a file, it evaluates the string argument directly. ResetToDefaults Resets all the configuration options to the default values, including the database name. InstallThings Creates indices for the MAIL:classification and MAIL:ratio_spam attributes on all volumes which support BeOS queries, identifies them to the system as e-mail related attributes (modifies the text/x-email MIME type), and sets up the new MIME type (text/x-vnd.agmsmith.spam_probability_database) for the database file. Also registers names for the sound effects used by the separate filter program (use the installsound BeOS program or the Sounds preferences program to associate sound files with the names). Set TokenizeMode NewValue Sets the method used for breaking up the message into words. Use "Whole" for the whole file (also use it for non-email files). The file isn't broken into parts; the whole thing is converted into words, headers and attachments are just more raw data. Well, not quite raw data since it converts quoted-printable codes (equals sign followed by hex digits or end of line) to the equivalent single characters. "PlainText" breaks the file into MIME components and only looks at the ones which are of MIME type text/plain. "AnyText" will look for words in all text/* things, including text/html attachments. "AllParts" will decode all message components and look for words in them, including binary attachments. "JustHeader" will only look for words in the message header. "AllPartsAndHeader", "PlainTextAndHeader" and "AnyTextAndHeader" will also include the words from the message headers. Get TokenizeMode Gets the method used for breaking up the message into words. ProcessArgs: The property specified isn't known or doesn't support the requested action., error code $FFFFFFFF/-1 (General OS error) has occured. AGMSBayesianSpamUpdater shutting down... Mon Oct 21 15:35:52 518 /tmp>
Another advanced trick is to load the list of words into Gobe Productive's spreadsheet, so that you can find the most popular word or chart the word frequencies. Unfortunately it can only handle about 16000 words. To do that, start up Gobe Productive, pick Open, then from the file requester's "Document Type" menu, pick "Spreadsheet" and then in the submenu pick "Tab-delimited text". Then navigate to the database, the default location is "/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam Database". Have fun!
I did some tests with tokenizing different parts of mail messages to see what would work best.
The Database:
341 training genuine messages, 406 training spam messages (or 398 when
parsing due to a bug with messages that don't have body text).
40 test genuine messages, 40 test spam messages, all more recent than the
training ones.
Spam threshold is 0.56, Gary-combining method.
The results:
Tokenizing Method | Genuine Test Details | Genuine Accuracy | Spam Test Details | Spam Accuracy |
---|---|---|---|---|
Just headers | Genuine .181352 to .557881, one false positive (a mailbox full announcement). | 2.5% wrong. | Spam .450602 to .750511, 21 false negatives. | 52.5% wrong. |
Whole raw message text | Genuine .163027 to .627022, 3 false positives. | 7.5% wrong. | Spam .509355 to .993985, 1 false negative. | 2.5% wrong. |
Message parsed into parts plus header | Genuine .168857 to .609005, 4 false positives. | 10% wrong. | Spam .614564 to .994364, 0 false negatives. | 0% wrong. |
Message parsed into parts, no header data | Genuine .220161 to .631161, 5 false positives. | 12.5% wrong. | Spam .592501 to .994444, 0 false negatives. | 0% wrong. |
Any text parts and header | Genuine .162697 to .614136, 4 false positives. | 10% wrong. | Spam .614973 to .994362, 0 false negatives. | 0% wrong. |
Any text parts, no headers | Genuine .221923 to .635487, 6 false positives. | 15% wrong. | Spam .594271 to .994441, 0 false negatives. | 0% wrong. |
text/plain parts (including body text) | Genuine .137869 to .583192, 3 false positives. | 7.5% wrong. | Spam .448059 to .994119, 17 false negatives. | 42.5% wrong. |
Only text/plain sub-parts, no headers. 150 spam and 1 genuine training message had no words! | Genuine .219169 to .696899, 9 false positives. | 22.5% wrong. | Spam .660755 to .994116, 0 false negatives, 27 had no words. | 0% wrong. |
The results look good for the whole message tokenizing method (which also works on non-email files) and for the all text parts plus header. Since the text parts method doesn't add lots of garbage words to the database from trying to find words in binary attachments, it's now the default setting.
The header only method is pretty good too for identifying genuine messages, and so-so for spam messages. That may make it useable for pre-download tests (delete some of the spam on the mail server before downloading it, without worrying about deleting too many genuine messages).
The various versions released to the public. These are actually several accumulated minor changes, which you can see by looking at the log in the top of the source code files.
Released to the public domain in 2002 by the author, Alexander G. M. Smith.