Skip the navigation

Java Data Mining: Strategy, Standard and Practice

'The secret of getting ahead is getting started.' -- Mark Twain (1835-1910)

By Mark F. Hornick; Erik Marcad; Sunil Venkayala
June 25, 2007 12:00 PM ET

Computerworld - This article is excerpted from Java Data Mining: Strategy, Standard and Practice, by Mark F. Hornick, Erik Marcad and Sunil Venkayala. Printed with permission from Morgan Kaufmann, a division of Elsevier. Copyright 2007.

As with any technology, the challenge to gaining proficiency is not being afraid to venture into the unknown. As Mark Twain noted, "the secret of getting ahead is getting started," and a strategy to get ahead with data mining is to start with small problems and data sets, learn some basic techniques and processes, and keep practicing. This chapter introduces a small code example to give the reader a feel for the Java Data Mining (JDM) application programming interface (API) in the context of a specific business problem before going into more detailed examples in Parts II and III of this book. The business problem we address involves response modeling, as discussed in Chapter 2, for a fictitious company DMWhizz and its product, Gizmos. Rather than dive right into the code, this chapter follows the CRISP-DM data mining process by first discussing the business understanding, data understanding and data preparation phases. Code is shown for modeling, and evaluation and deployment are discussed. Note that each process phase is not explored in depth, but enough to give the reader a feel for the phase.

Business Understanding

The business objective of company DMWhizz is to increase the response rate for its latest campaign for a new product, Gizmos. DMWhizz has run many such campaigns, so from an operational standpoint, there is little risk associated with this project. However, this is the first time DMWhizz is employing data mining to try to increase the response rate over previous efforts. Historically, DMWhizz has obtained a 3% response rate from such campaigns, which it viewed as better than the norm in the retail industry. DMWhizz will be satisfied with anything over a 4% response rate, a 33% increase. The company typically sends a campaign to 400,000 of its customers, chosen at random. As such, it gets about 12,000 responses. With the introduction of data mining, DMWhizz expects at least 16,000 responses if sent to 400,000 customers, a 33% increase.

The business has a base of 1 million existing customers to potentially send an offer to. Although Gizmos is a new product, it is related to several other less-featured products, so DMWhizz can use historical sales information of customers who have bought these other products.

DMWhizz knows that this data mining solution requires known outcomes to build predictive models, in this case, which customers actually purchased Gizmos. As such company staffers factor into their plan conducting a small-scale, trial campaign to collect data on which customers actually purchased Gizmos. From this trial campaign, data mining models can be built to predict which of the remaining customers and prospects are likely to purchase Gizmos.

Specifically, DMWhizz takes a 2% random sample of the 1 million potential customers for Gizmos, totaling 20,000. The company mails the offer to these customers and records which customers purchased the item. Based on previous campaigns, it expects a 3% response rate, or 600 customers purchasing Gizmos. This data, with known outcomes both positive and negative, serves as the basis for the modeling process.

Technically speaking, this is a classification problem in data mining. DMWhizz employees are unfamiliar with the quality and character of the data; they expect to try several types of classification algorithms and to select the one that provides the best lift. Lift essentially indicates how well the model performs at predicting a particular outcome instead of randomly selecting cases, in this instance customers, to include in the campaign. The concept of lift is explored in more detail in Section 7.1.6.

The database of 1 million customers and prospects includes demographic data such as age, income, marital status and household size. There is also previous customer purchase data for three related products: Tads, Zads and Fads, which indicate whether the customer purchased these items or not. The target attribute, what is to be predicted, is called response. The response attribute contains a "1" if the customer responded to the Gizmos campaign, and "0" if not.

Data Understanding

DMWhizzs database administrators (DBA) obtain the data and provide it to the data miner to begin data exploration. Three tables are obtained: CUSTOMER, PURCHASES and PRODUCT. As illustrated in Figure 6-1, a customer can have many purchases, and a product can be purchased many times.

Figure 6-1: Entity-Relationship disagram of tables
Figure 6-1: Entity-Relationship disagram of tables


Our Commenting Policies
2015 Premier 100 nominations open
Premier 100

Computerworld has launched its annual search for outstanding IT leaders who align technology with business goals. Nominate a top IT executive for the 2015 Premier 100 IT Leaders awards now through July 18.