A case study on the emerging technology of E-Business Intelligence.
Fall 1999
Nicole Wallace, Brian White, Temba Msezane.
As described by its founders, NetSapien is a breakthrough technology which allows companies to review Internet sites of potential competitors and analyze the content of these site and its postings for the intent
of the content. This is to say NetSapien , given client criteria for what constitutes logo infringement, possible attempts to fraudulently appear to be another company or to confuse their consumers, or otherwise divert revenue belonging to the client, can then go out on the web and interpret what messages or meanings other sites intend to send to your potential customers. Once NetSapien has gathered this information, human intelligence (business analysts) is applied to the equation also to ensure a thorough business analysis. NetSapien is enabled by the following technologies:
· a custom Webcrawler,
· Bots,
· an inference engine,
· and other proprietary search tools and algorithms
How exactly does NetSapien work?
The search begins by using client-specified data to constantly search for sites with relevant topics. Then it performs an ‘intelligent’ or deep crawl to look at each linked page. Then an algorithm is applied to determine if a link should be further examined. As the spider continues to search on the client specified topics, the ‘smarter’ it gets about the topic.
This step deletes any broken links or previously searched pages from the results.
This is where the inference engine steps in. Here, algorithms are plugged in to the engine to prioritize the results based on the client’s business criteria. The feature looks through all the pages for recognition of text, video, audio, hidden text, meta tags, and links to assess how revenue generation is to occur and the intent of the content. It then groups the pages back into sites to show the most relevant examples of the site.
NetSapien technology then extracts the relevant data from each page and enters it into a database according to the client’s business criteria. Some questions may be:
· Is this page generating revenue?
· Is it domestic or international?
· Does this target specific clientele?
The last step prepares the data for easy formatting and analysis, being constantly updated by a learning/feedback loop through the process.
Automotive
Industry
Challenge:
A major car manufacturer who sells through traditional distribution channels needed a mechanism to track and manage sales of parts and accessories over the Internet, with an emphasis on locating unauthorized distributors.
Solution:
Cyveillance identifies sites selling or distributing this client's parts and accessories via the Internet. This feedback enables the manufacturer to manage distribution channels and price points, maintain the quality of their product sold over the Internet and maximize their e-Business objectives.
Challenge:
A leading computer manufacturer's Web site sells $15 million in merchandise daily. However, this volume has spurred other sites to emulate the look and feel of the client's popular Web site, in an effort to divert traffic and revenue that belongs to our client.
Solution:
By identifying unauthorized sites diverting eyeballs and selling products, Cyveillance has prevented millions of dollars in revenue leakage each day through unauthorized sales of our client’s products.
· Music Distribution
Challenge:
The leading recording industry organization sought a way to track and prioritize the thousands of sites that offer nearly half a million MP3 files, compressed music files typically available for free download.
Solution:
Cyveillance provided the client with a system that consistently locates and prioritizes sites containing large numbers of MP3 files,
thus enabling them to effectively manage the situation. By using Cyveillance, the association has halted the equivalent of more than 7,000 downloadable music CDs.
· Music Licensing
Challenge:
A top music licensing organization sought an efficient, cost-effective way to identify and collect licensing fees from commercial or promotional sites streaming music owned by its members. Given the large and ever-increasing number of sites on the Internet, undertaking this task manually was not a feasible option for the organization.
Solution:
Cyveillance teamed with the client to develop a custom application of Cyveillance’s technology to handle the task. Today this powerful technology continuously scours the Internet to identify specific song titles of works being performed on Internet sites. The technology then prioritizes the sites based on client criteria and has identified more than $6 million in music licensing opportunities.
Challenge:
The Web site of a national media company contains proprietary content that is not available for distribution. Because this site employs an advertising e-Business model, lost or diverted traffic means lost ad revenue.
Solution:
Every month, Cyveillance enables the client to reclaim thousands in ad revenue that would otherwise be lost, by identifying sites stealing their proprietary content. Additionally, Cyveillance provides the publishing company with a proactive means of protecting its proprietary content.
Challenge:
A major pharmaceutical manufacturer needed a way to identify the misuse of its domain name, trademarked drug name and any use of drug's name in ways not consistent with its corporate policies.
Solution:
Cyveillance has identified several hundred sites misusing the drug's name within the domain, in the context of promoting an herbal alternative and on sites selling placebo versions of that drug. Our work has not only boosted revenues for the client, it has also significantly decreased their risk of liability and brand dilution.
Challenge:
A renowned retailer recently set forth a policy stating that they would be the only site authorized to sell their product over the Internet. They sought a proactive method of implementing their policy, thus controlling unauthorized distribution.
Solution:
Cyveillance identifies and prioritizes sites selling their products so that they can effectively capture that revenue stream and prevent cannibalism of their off–line distribution channels.
Since NetSapien is a proprietary technology, we cannot gather information to determine the algorithms used to generate such relevant and useful data. But of the technologies employed, the inference component of the search engine stand out as the piece of the process (along with the unspecified algorithms) that allows the technology to be more useful than other search engines. After all, it is the inference engine that is gauging the intent of the online messages being sent.
In explaining the technology, it is difficult to say how the engine is constructed. For this reason general explanations on what inference engines are (otherwise known as active logics), some constraints placed on artificial intelligence, and general model on how the actual inference takes place (abduction versus deduction).
Active logics (Inference Engines)
Active logics are a family of inference engines that incorporate a history of their own reasoning as they run. At any time T, an active logic has a record of its reasoning at all times prior to T. It also knows that the current time is T. As it continues to reason from time T, that reasoning is also recorded in the history, and is marked at time T+1 as having occurred at time T. Thus an active logic records the passage of time in discrete steps, and the "current" time slides forward as the system runs. It is convenient to regard its current inferences as occurring in a working memory, that is then transferred to the history (or long-term memory) in the next time-step.
The key aspect that makes such logics different from traditional temporal logics and from simple archival "dumps": in active logics the current time is itself noted in the working memory-Now (T)-and this changes to Now(T+1) one step later. (A time-step should be thought of as very fast, perhaps 0.1 sec in correspondence with performance of elementary cognitive tasks by humans). Thus active logics "ground" now in terms of real time-passage during reasoning.
Some Problems with Traditional Artificial Intelligence
(AI)
Critics of AI often remark that AI programs are "stupid"-they do not "really" understand anything, and thus are easily thrown into disarray and made useless. To some extent this criticism is well-taken: most AI break down when conditions vary even slightly outside of defined bounds. A "smart" agent should be flexible enough able to take in stride many kinds of incoming information: contradictions, nonsense, change of topic, ambiguity, and so on. Yet when the defined bounds are violated, systems tend not to be able to provide reasonable behaviors, such as recognizing that they cannot correctly parse the input, or that a contradiction has occurred, or that a belief must be revised. This is remedied by the NetSapien technology in the filtering step, and by the application of human intelligence (in the form of the business analyst).
Inference by abduction versus deduction
Abductive inference is the process where it is concluded from the rule A to B and the observation that B is true, that A might have caused B to be true. It is an approximate inference, meaning that abduction, in contrast to deduction, is not sound. Engineering design and configuration are likely to be abductive rather than deductive . It is the task of finding a structure given a functional specification. An example could be when a designer tries to realize a function F1 for the artifact to be designed, and his design knowledge tells him that a component C1 beside others realizes F1 i.e. C1 F1, then the designer concludes by abduction to select C1. If in later design phases inconsistency occurs, then he probably replaces C1 by another component, which can also realize F1. Abductive reasoning can also be found in the area of medical diagnostics. Given rules in the form disease symptoms, then the doctor concludes by abduction a disease because of the observed symptoms.
In order to build an abductive inference engine we need a component that is responsible for the generation of the hypotheses and a deductive inference engine (such as the client specifications at Cyveillance). First a hypotheses H (component or disease) must be generated to account for the fact F (function or symptoms) will be generated. The deductive inference engine then tries to prove F on the basis of H. To construct such an abductive inference engine, the problem is not generating the hypotheses and testing their validity by deduction, but rather, finding valid hypotheses in a controlled manner to deal with the enormous search space of possible hypotheses in the field of design or diagnosis such as the vast content of the Internet.
Cyveillance's clients include Bell Atlantic, Dell Computer Corp., Levi Strauss & Co., Mobil Corporation, Time Inc.-New Media, Washington Post, Newsweek Interactive, Bell South, ASCAP and the RIAA, in addition to leading companies in the pharmaceutical, financial services and computer industries, among others.
One of the limitations of the NetSapien Technology is data context. Data context is a problem in any search agent which results from the inability to distinguish the context in which the terms searched for are being used. Another similar problem is the inability to perform qualitative comparisons. With the exponential growth of the Internet and the limitations on search speed and thoroughness, search agents will have a tough time keeping up. Furthermore, as the information continues to grow, the amount of people it will take to digest and analysis all the returned information will grow as well.
Another limitation is the ability to improve the current spiders or search agents. Although the technology is patented, it will become obsolete as new algorithms are developed along with new search techniques. Some current competitors include Digimarc, Inforian's Quest, Agent Technologies' Copernic
Digimarc is currently developing a fundamentally new way to access and use the Internet by embedding imperceptible digital data in traditional and digital media. This includes printed materials such as magazine advertisements, articles, covers and subscription cards; direct mailers; packaging; debit and credit cards; greeting cards; coupons; catalogues; tickets; business cards; and digital content such as video, images and other creative properties in digital form. The embedded data creates a bridge between these materials and the Internet, permitting users to link directly to relevant Web destinations without any typing or mouse clicks. Our technology gives digital capabilities to physical media allowing new forms of interaction with the digital world and enhancing publishing, advertising and electronic commerce.
Several vendors offer tools that make searching for infringing material much easier than it used to be. Inexpensive applications such as Inforian's Quest and Agent Technologies' Copernic search multiple (up to 200) search engines, help gather and index searches, and simplify the entire monitoring process. These tools can locate anything from stolen code to copyrighted files being used across the Web.
Potential
The idea of robots as humanoid machines was first introduced in Karel Capek's 1921 play "R.U.R.," where the playwright conceived Rossum's Universal Robots. Sci-fi writer Isaac Asimov made them famous, beginning with his story I, Robot (1950) and continuing through a string of books known as the Robot Series.
On the Web, robots have taken on a new form of life. Since all Web servers are connected, robot-like software is the perfect way to perform the methodical searches needed to find information.
Bots were not invented on the Internet, however. Robotic software is generally believed to have been created in the form of Eliza, one of the first public displays of artificial intelligence. Eliza is a computer programmer that can engage a human in conversation: Eliza asks the user a question, and uses the answer to formulate yet another question. Artificial intelligence is an advanced form of computer science that aims to develop software capable of processing information on its own, without the need for human direction.
A Brief History of WebCrawler
http://webcrawler.com/Help/AboutWC/WCStory.html
Spiders
http://www.cs.indiana.edu/~rawlins/b669-webpages/nreed/spiders.html
http://bots.internet.com/bot/what_is_a_bot.html
Aggressive e-Businesses invest large sums of money to get in front of thousands of eyeballs each day and drive them to their sites to buy, browse, learn and create customer loyalty. Firms are consistently faced with questions of:
· How can I recapture diverted traffic?
· Are other sites using browser magnets to lure traffic that might otherwise arrive at my site?
· Are other sites emulating my site to diver customers away?
· What site features are most prevalent in specific e-Retailing environments?
If firms are not sure whether they’re capturing all the eyeballs that should be visiting their site, buying their products or reading their proprietary content, then they are definitely candidates for the NetSapien technology.
The Internet is changing the entire landscape of business and is impacting the way many firms interact with each other and with consumers. This new economy is projected to produce more than a trillion dollars in e-Business trade by the year 2003. The vast evolution of e-business to support more and more business models will unfortunately lead to more cases of Internet fraud and other illegal activities on the Web. Given this, it is critical for companies to maintain a pulse on all activities surrounding their business that may pose potential threats. Hence the relevance of the NetSapien technology. In the future, we will see an evolution of such technologies as more firms enter this space with competing products. Cyveillance, being the first mover and creator of the NetSapien technology will continue to build on its experience in this area while expanding its product offerings. It will become more and more evident to companies employing
e-Business strategies that using technologies such as NetSapien offer a very immediate and enormous return on investment.