In this whitepaper we consider the problem of outbound-filtering of emails to prevent accidental leakage of confidential information, We examine how to do this with the GPLed open-source spam filter CRM114 and test the accuracy of this filter against a 10,000+ document corpus of hand-classified emails (both confidential and non-confidential) in Japanese. We look into what moving parts are involved in these filters, and how they can be set up. The results show that a hybrid of multiple CRM114 filters outperforms a human-crafted regular-expression filter by nearly 100x in recall, by detecting > 99.9% of confidential documents, and with a simultaneous false alarm rate of less than 5.3%. As the programmers creating the machine-learning programs don't know how to read or write Japanese, this problem is an almost ideal case of the Searle “Chinese Room” problem.
Mitsubishi Electric Research Laboratories William Yerazunis is a Senior Principal Research Scientist and Team Lead at Mitsubishi Electric Research Laboratories in Cambridge, Massachusetts, USA. He received the B.S., M.Eng, and Ph.D degrees in Systems Engineering from Rensselaer Polytechnic Institute, in 1978, 1979, and 1987, respectively. Since then, he has worked in a number of fields including optics, machine vision, and signal processing (for General Electric's jet engine manufacturing); computer graphics (at Rensselaer's Center for Interactive Computer Graphics); artificial intelligence and parallel symbolic computation (at Rensselaer); radioastronomy and SETI ( at Harvard University), transplant immunology (for the American Red Cross), virtual and augmented reality, realtime physical and chemical sensing, and ubiquitous computing (for Mitsubishi Electric), and realtime statistical categorization of texts (the CRM114 Discriminator anti-spam system). He is also a Visiting Scientist at Dublin City University in Dublin, Ireland. He has appeared on numerous educational television shows, holds 35 U.S. patents, sports an Erdos number of three, a Kevin Bacon number of three, holds FCC ham radio Extra class and Commercial Broadcast/radar engineer licenses, and was voted one of the 50 most powerful people in networking by NetworkWorld magazine in 2006.