Scientists at Xerox Corporation have invented powerful software that’s clever enough to “read” an electronic document, decide how it should be classified by subject, then route it to the right person’s e-mail address or online document management system – all completely automatically.
The software, which is a categorizing tool, is intended to help businesses keep their e-document collections orderly and easily accessible, and it is available for licensing from Xerox.
“A misshelved book in a library might as well be lost. It’s the same with documents that haven’t been properly categorized; the document itself may have to be recreated,” said Eric Gaussier, a research scientist at the Xerox Research Centre Europe in Grenoble, France. “Our new software can help save time and money and increase productivity. It will ensure that documents are properly classified for future retrieval and that the right information gets into the right hands as quickly as possible.”
Categorizing tools currently available in the market treat each subject category independently of each other and are considered “flat.” For example, although it might seem obvious to humans that biochemistry and biophysics are related categories of information, a flat categorization system wouldn’t make the connection. But the Xerox system, based on patented technologies, uses a hierarchical model that is able to understand the dependency between those two categories and therefore make a more informed decision when classifying a document.
According to data gathered from a pilot test of the software, people found the right documents more often and faster because the software understood relationships between documents and categories.
Anne-Lise Veuthey, a senior researcher at the Swiss Institute of Bioinformatics, an academic nonprofit foundation that researches and develops technology used in biology, participated in the pilot program. “We’ve found it to be extremely accurate in identifying documents containing the very specific information we need to conduct our research on human genes,” Veuthey said.
Three integrated functions make the Xerox categorization technology unique:
*The system can start right away. Using advanced machine-learning techniques, with only a few examples it quickly learns by itself how to hierarchically classify documents in existing categories.
*The technology is easy to use and helps people create a comprehensive way to turn unorganized e-files into cleanly labeled document collections.
*The system can learn entirely new categories on its own. The categorization technology detects new or emerging topics and dynamically suggests new categories to the people who are using the system.
The Right Routing
The Xerox categorizer system can handle documents written in up to 20 languages and can be easily adapted for specific customer requirements. The software intelligently routes documents to the right person based on a pre-set user profile.
“This can be used, for example, to route incoming mail to the person responsible for a given topic and eliminate mail in your inbox you aren’t interested in,” said Gaussier. “Imagine clients’ complaints going directly to the person responsible for handling them and your e-mail inbox containing only what you are interested in.”
The categorization technology was developed by XRCE researchers based on their deep expertise in linguistic analysis and machine-learning techniques. The software is written in Java and can be deployed on multiple platforms including UNIX, Linux and Windows. The company anticipates the technology to be licensed by software vendors or corporations who wish to incorporate it into document systems focused on areas such as customer relationship management, information retrieval and data management.