Updated by elderresearch on Aug 03, 2017

Top 5 Data Mining Mistakes

Mining data to extract useful and enduring patterns remains a skill arguably more art than science. Pressure enhances the appeal of early apparent results, but it is all too easy to fool yourself. How can you resist the siren songs of the data and maintain an analysis discipline that will lead to robust results? Here is a Top 5 list of data mining mistakes from Dr. John Elder, Founder of Elder Research, a predictive analytics and data science consultancy.

Source: http://www.elderresearch.com/mistake-lack-relevant-data

Mistake #1: Lack Relevant Data

To really make advances with an analysis, you must have labeled cases, i.e., an output variable. With input variables only, all you can do is look for subsets with similar characteristics (cluster) or find the dimensions that best capture the data variation (principal components). These unsupervised techniques are much less useful than a good (supervised) prediction or classification model. Even with an output variable, though, the most interesting class or type of observation is usually the most rare by orders of magnitude. For instance, roughly 1/10 of “risky” individuals given credit will default within 2 years, 1/100 people mailed a catalog will respond with a purchase, and perhaps 1/10,000 banking transactions of a certain size require auditing. The less probable the interesting events, the more data it takes to obtain enough to generalize a model to unseen cases. Some projects probably should not proceed until enough critical data are gathered to make them worthwhile.Download the whitepaper

Mistake #2: Rely on One Technique

“To a little boy with a hammer, all the world’s a nail.” All of us have had colleagues (at least) for whom the best solution for a problem happens to be the type of analysis in which they are most skilled! For many reasons, most researchers and practitioners focus too narrowly on one type of modeling technique. But, for best results, you need a whole toolkit. At the very least, be sure to compare any new and promising method against a stodgy conventional one, such as linear regression (LR) or linear discriminant analysis (LDA).Download the whitepaper

Mistake #3: Ask the Wrong Questions

It is first important to have the right project goal; that is, to aim at the right target. This was exemplified by a project at Shannon Labs, led by Daryl Pregibon, to detect fraud in international calls. Rather than use a conventional approach, which would have tried to build a model to distinguish (rare but expensive) fraud from (vast examples of) nonfraud, for any given call, the researchers characterized normal calling patterns for each account (customer) separately. When a call departed from what was the normal pattern for that account, an extra level of security, such as an operator becoming involved, was initiated. For instance, if one typically called a few particular countries each week, briefly, during weekdays, a call to a different region of the world on the weekend would bear scrutiny.
Efficiently reducing historical billing information to its key features, creating a mechanism for the proper level of adaptation over time, and implementing the models in real time for vast streams of data provided interesting research challenges. Still, the key to success was asking the right question of the data.Download the whitepaper

Mistake #4: Accept Leaks From the Future

I often evaluate promising investment systems, for possible implementation. In one, a Ph.D. consultant, with a couple of books under his belt, had prepared a neural network model for a Chicago bank to forecast interest rate changes. The model was 95% accurate – astonishing given the importance of such rates for much of the economy. The bank board was cautiously ecstatic, and sought a second opinion. My colleagues found that a version of the output variable had accidentally been made a candidate input. Thus, the output could be thought of as only losing 5% of its information as it traversed the network.Download the whitepaper

Mistake #5: Discount Pesky Cases

Outliers and leverage points can greatly affect summary results and cloud general trends. Yet, one must not routinely dismiss them; they could be the result. The statistician John Aitchison recalled how a spike in radiation levels over the Antarctic was thrown out for years, as an assumed error in measurement, when in fact it revealed a hole in the Ozone layer that proved to be an impressive finding. To the degree possible, visualize your data to help decide whether outliers are mistakes to be purged or findings to be explored.Download the whitepaper

Top 5 Data Mining Mistakes

Mistake #1: Lack Relevant Data

Mistake #2: Rely on One Technique

Mistake #3: Ask the Wrong Questions

Mistake #4: Accept Leaks From the Future

Mistake #5: Discount Pesky Cases

elderresearch

Tagged With

Tools