This isn't really a blog, its more of a holding page for my domain (seems a shame not to have a page), if I know you then add me on either LinkedIn or Facebook (links are on the right), however if I don't know you then I won't add you!

Saturday 25 May 2013

Dissertation Series - CRISP DM - Step Three Data Preperation


Data preparation is the step of ensuring that data used in the modelling step is as suitable as possible,
Select Data
The select data sub-step identifies what data will be included and what will be excluded (The Modelling Agency, 2000 p.22), the reasons for inclusion and exclusion can range from the age of the data (as more recent data being more valid/relevant) or completeness (as incomplete data can give an inaccurate picture) (Dunham, 2003 p.15).  These exclusions/inclusions should be clearly indicated so that the user digesting/acting upon the data knows what data he/she is acting upon.


Clean data
Cleaning data refers to the correction/removal of faulty or incomplete data, this can be extended to making use of estimation/prediction to populate missing data (The Modelling Agency, 2000 p.24).  In a similar way to the select data phase it is important that any caveats applied to the data are documented and identified when making use of the data.
Construct data
Data may need to be assembled before use in data mining; there may be calculations that need to be applied to the data itself to produce meaningful/useful data for the model (such as calculating the age of a subject based on today’s date and their birth date) or even fields that are split such as post codes which are sometimes stored in two separate parts (The Modelling Agency, 2000 p.24).
Integrate data
Once the data has been through the previous steps it is necessary to put it together, the data included is most likely held in numerous tables and possibly even multiple databases which require linking together (The Modelling Agency, 2000 p.25).  Aggregations may need to be performed to deal with possible duplication, which would have been highlighted in the explore data sub-step (The Modelling Agency, 2000 p.25).
Format data
The final sub-step of data preparation is to make necessary adjustments to the data so that is suitable for use in the data mining tool, changes may be needed such as the introduction of composite keys or restructuring of data (The Modelling Agency, 2000 p.25).

No comments:

Post a Comment