Data preparation is
the step of ensuring that data used in the modelling step is as suitable as
possible,
Select Data
The select data
sub-step identifies what data will be included and what will be excluded (The
Modelling Agency, 2000 p.22), the reasons for inclusion and exclusion can range
from the age of the data (as more recent data being more valid/relevant) or
completeness (as incomplete data can give an inaccurate picture) (Dunham, 2003
p.15). These exclusions/inclusions
should be clearly indicated so that the user digesting/acting upon the data
knows what data he/she is acting upon.
Clean data
Cleaning data refers
to the correction/removal of faulty or incomplete data, this can be extended to
making use of estimation/prediction to populate missing data (The Modelling
Agency, 2000 p.24). In a similar way to
the select data phase it is important that any caveats applied to the data are
documented and identified when making use of the data.
Construct data
Data may need to be
assembled before use in data mining; there may be calculations that need to be
applied to the data itself to produce meaningful/useful data for the model
(such as calculating the age of a subject based on today’s date and their birth
date) or even fields that are split such as post codes which are sometimes
stored in two separate parts (The Modelling Agency, 2000 p.24).
Integrate data
Once the data has
been through the previous steps it is necessary to put it together, the data
included is most likely held in numerous tables and possibly even multiple
databases which require linking together (The Modelling Agency, 2000
p.25). Aggregations may need to be
performed to deal with possible duplication, which would have been highlighted
in the explore data sub-step (The Modelling Agency, 2000 p.25).
Format data
The final sub-step of
data preparation is to make necessary adjustments to the data so that is
suitable for use in the data mining tool, changes may be needed such as the
introduction of composite keys or restructuring of data (The Modelling Agency,
2000 p.25).
No comments:
Post a Comment