Executive Overview (in true MBA fashion, here you can learn data mining buzzwords without really understanding the material.)
Requirements for Successful Data Mining
Who Uses Data Mining? (Hey, I can use Excel to analyze data. Why do I need data mining?)
Who Supplies Data Mining Products and Services?
Data mining is the extraction of hidden predictive information from large databases, which helps businesses to make proactive, knowledge-driven decisions. Data mining tools scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. The automated, prospective analysis offered by data mining to predict future trends and behaviors moves beyond the analyses of past events provided by retrospective tools. Data mining answers business questions that traditionally have been too time-consuming or complex to answer.
This document was prepared by:
for Patterns of Electronic Commerce, Goizueta Business School, Emory University
March 1997.
Data mining derives its name from the similarities between searching for valuable business information in a large database and mining a mountain for a vein of valuable ore. Both processes require sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides.
Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing specific capabilities:
Automatic prediction of trends and behaviors
Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered quickly and directly from the data. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events.
Automatic discovery of previously unknown patterns
In one step, Data mining tools sweep through databases and identify previously hidden patterns. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors.
* Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
* Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.
* Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome.
* Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset. Sometimes called the k-nearest neighbor technique.
* Rule induction: The extraction of useful if-then rules from data based on statistical significance.
* Data visualization: The visual interpretation of complex relationships in multidimensional data.
Data mining techniques are the result of a long process of research and product development. The evolution began as business data were first stored on computers. It has continued to become more complex with improvements in data access, and more recently, technologies have been generated that allow users to navigate through their data in real time. Data mining takes this evolutionary process to a new level - beyond retrospective data access and navigation - to prospective and proactive information delivery. Necessity created the invention of data mining. Commercial databases are growing at unprecedented rates and companies need a technology to deal with the data overload.
Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature:
* Massive data collection and storage * Powerful multiprocessor computers * Data mining algorithms
The accompanying need for improved computational engines can now be met in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms embody techniques that have existed for at least ten years, but have only recently been implemented as mature, reliable, understandable tools that consistently outperform older statistical methods.
| Evolutionary Step | Business Question | Enabling Technologies | Product Providers | Characteristics |
| Data Collection
(1960s) |
"What was my average total revenue over the last five years?" | Computers, tapes, disks | IBM, CDC | Retrospective, static data delivery |
| Data Access
(1980s) |
"What were unit sales in New England last March?" | Relational databases (RDBMS), Structured Query Language (SQL), ODBC | Oracle, Sybase, Informix, IBM, Microsoft | Retrospective, dynamic data delivery at record level |
| Data Navigation
(1990s) |
"What were unit sales in New England last March? Drill down to Boston." | On-line analytic processing (OLAP), multidimensional databases, data warehouses | Pilot, IRI, Arbor, Redbrick, Evolutionary Technologies | Retrospective, dynamic data delivery at multiple levels |
| Data Mining
(2000) |
"What's likely to happen to Boston unit sales next month? Why?" | Advanced algorithms, multiprocessor computers, massive databases | Lockheed, IBM, SGI, numerous startups (nascent industry) | Prospective, proactive information delivery |
Requirements for Successful Data Mining
Successful data mining has five basic requirements:
When data mining tools are implemented on high performance parallel processing systems, they can analyze massive databases in minutes. Faster processing means that users can automatically experiment with more models to understand complex data. High speed makes it practical for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions. Databases can be larger in two senses:
* Higher dimensionality. In hands-on analyses, analysts must often limit the number of variables they examine because of time constraints. Yet variables that are discarded because they seem unimportant may carry information about unknown patterns. High performance data mining allows users to explore the full dimensionality of a database, without preselecting a subset of variables.
* Larger samples. Larger samples yield lower estimation errors and variance, and allow users to make inferences about small segments of a population.
A wide range of companies have deployed successful applications of data mining. While early adopters of this technology have tended to be in information-intensive industries such as financial services and direct mail marketing, the technology is applicable to any company looking to leverage a large data warehouse to better manage their customer relationships. Two critical factors for success with data mining are: a large, well-integrated data warehouse and a well-defined understanding of the business process within which data mining is to be applied (such as customer prospecting, retention, campaign management, and so on).
Many groups are already engaged in data mining projects, from research and experimentation by individual analysts to completed products that have already added value to the business.
Some successful application areas include:
*A pharmaceutical company can analyze its recent sales force activity and their results to improve targeting of high-value physicians and determine which marketing activities will have the greatest impact in the next few months. The data needs to include competitor market activity as well as information about the local health care systems. The results can be distributed to the sales force via a wide-area network that enables the representatives to review the recommendations from the perspective of the key attributes in the decision process. The ongoing, dynamic analysis of the data warehouse allows best practices from throughout the organization to be applied in specific sales situations.
*A credit card company can leverage its vast warehouse of customer transaction data to identify customers most likely to be interested in a new credit product. Using a small test mailing, the attributes of customers with an affinity for the product can be identified. Recent projects have indicated more than a 20-fold decrease in costs for targeted mailing campaigns over conventional approaches.
*A diversified transportation company with a large direct sales force can apply data mining to identify the best prospects for its services. Using data mining to analyze its own customer experience, this company can build a unique segmentation identifying the attributes of high-value prospects. Applying this segmentation to a general business database can yield a prioritized list of prospects by region.
*A large consumer package goods company can apply data mining to improve its sales process to retailers. Data from consumer panels, shipments, and competitor activity can be applied to understand the reasons for brand and store switching. Through this analysis, the manufacturer can select promotional strategies that best reach their target customer segments.
Each of these examples have a clear common ground. They leverage the knowledge about customers implicit in a data warehouse to reduce costs and improve the value of customer relationships. These organizations can now focus their efforts on the most important (profitable) customers and prospects, and design targeted marketing strategies to best reach them.
IBM provides a data mining service called Quest. Its technologies include mining for association rules, sequential patterns, classification, and time-series clustering. IBM is making these technologies available through its data mining product, IBM Intelligent Miner.
NeoVista has created an integrated suite of data mining tools called Decision Series. The Decision Series is highly scalable, which gives consumers the flexibility to address the dynamic changes in the size and nature of their data or processing needs.
Ultragem has specialized in "genetic" data mining technology. Genetic data mining is the automatic extraction of prediction and classification rules from databases using advanced genetic algorithm technology.
Dun & Bradstreet has jumped on the data mining bandwagon as well and launched a data mining program called the Decision Support Suite through its subsidiary, Pilot Software, Inc. It is designed to be used in the on-line analytical processing (OLAP) environment.
Several other companies also market Data Mining products and services, including the major relational database companies - Oracle, Sybase and Informix. Please see our section on Cool Links to find a number of other vendors.
Data mining should not be viewed as a panacea. "Data mining is not as free and fuzzy as people are being led to believe," says Herbert Edelstein, president of Two Crows Corp., a data mining consulting firm based in Potomac, Md. Adds Bernice Grossman, principal of DMRS, a New York City-based data-warehousing consultancy, "The best design and the most brilliant strategic plan all put together in the most accessible, actionable marketing database is virtually useless if you don't spend the time to learn your data. I repeat, it will not matter how much money you spend."
Present-day tools are algorithmically strong but require significant expertise to implement effectively. Nevertheless, these tools can produce results that are an invaluable addition to a business' corporate information assets. As these tools mature, advances in server-side connectivity, the development of business-based models, and user interface improvements will bring data mining into the mainstream of decision-support efforts.
Maybe you don't need a data mining tool so much as you need a data mining application. Vendors offer data mining applications that include customer segmentation, market-basket analysis, and fraud detection, sparing you at least some of the heavy lifting in creating a specific data mine.
Or maybe you don't need a tool or an application so much as you need a complete solution. You could always outsource the whole effort to a company like IceBreaker from Alameda, Calif. You'd ship your data to IceBreaker, and the company would perform the data mining and ship you back the results. IBM also offers a similar service.
In the most simplistic definition, data mining is a method of applying technologies to use the information stored in a company's existing database more effectively. In a broad sense, this technology is one of the most dramatic utilization of a PC's raw computational power.
But there is controversy surrounding the effectiveness of data warehousing and data mining. Skeptics say that data warehousing is actually an expensive step backward, disguised as a step forward. They describe it as stuffing all the data contained in a company's many small databases into one large database, which is then managed by a trained staff and accessed by users through a friendly front end. The detractors see this as revisiting the age of mainframes and dumb terminals. They also say the sheer volume of database records being warehoused would inevitably drag down productivity; lost files and records would take up valuable storage space while important data would remain largely unused.
If you are going to predict, predict often." - paraphrased from Benn Konsynski
Today Data Mining exists only in a few large corporations, academic research centers, and in large government organizations. This is largely due to the requirements of having a large, powerful computer, a large relational database, and the experimental, complex nature of the software. We believe that within 10 years, individuals will be doing Data Mining using personal systems. Several developments will allow this to occur.
Once some combination of these requirements occurs, individuals will be able to effectively conduct Data Mining, without access to the corporate data warehouse.