Pauls Place: 2013

Saturday 27 July 2013

Easy Oracle SQL updates with excel

Its always frustrating when end users make the assumptions about data, often users will export data into a spreadsheet and believe it is easy to reimport this into the database.

Having done this numerous times using cobbled together formulas in Excel I decided to put together a generic template that you can use to carry out updates into your Oracle database with significantly less pain and frustration!

DOWNLOAD THE TEMPLATE

Step 1 - Question the update

Firstly before carrying out the update you should ask yourself and the user providing your data the following questions;

How long ago was the data in the spreadsheet updated, are other users aware that the update is taking place?
Is the user/other users aware that data entered into the database since the spreadsheet was created will be overwritten with the data contained in the spreadsheet.
Is there enough in the spreadsheet to match up to the records correctly (primary key etc)?

Based on that you can then decide whether to proceed with the update.

Step 2 - Format the data

Take the spreadsheet that you have been provided with and make the following amendments;

Ensure the first column includes one of the fields that will identify the record in the destination table (i,e, personcode, learnerid or staffnumber)
Check the data to ensure that it is formatted correctly, if the data has been amended manually users may have included spaces etc.

Step 3 - Transpose the data into the template

Paste the data from the spreadsheet your user has provided into the import template in cell B9.

Step 4 - Set the field names, table name and field type

Enter the table name to be updated in cell C4.
Identify each column in row 8 by its name in the database.
Identify each column in row 7 by its type;

Primary key will be used to construct the where clause
Update field will be the fields being updated

Step 5 - Review the generated SQL

The SQL generated as part of the template needs to be reviewed prior to execution.
Once satisfied that the SQL is correct paste the SQL into your client and execute the code.

Wednesday 12 June 2013

Can I use "user" as a parameter name in SSRS (when using Oracle)

The answer is no!

Although SSRS will allow you to create a parameter called user, if using Oracle SQL it will not let you reference the parameter in a dataset, the warning “ORA-01745:invalid host/bind variable name” will be displayed whne running your dataset.

I spent ages checking through my dataset until I realised that user must be a reserved term within the dataset that cannot coexist as a parameter name, again as per my posts hopefully this helps someone else out!

Tuesday 11 June 2013

Crystal Reports - IF IN Expression

Carrying out development between different tools such as SQL, SSRS expressions and Crystal reports often ends up in headaches about the syntax and functions available (as there are big similarities). When writing a Crystal Expression to include an IF statement that looks at a range of values (in a similar way to an IN clause in SQL) there are a few things to keep in mind.

For example

IF {MY_ELEMENT.STREAM} = "SA" THEN "Shop closed" ELSE “Shop open”

This expression will only read “shop closed” if the stream field is equals to “SA”, however if I have more than one value that equates to shop closed then I need to think about the construction of my expression.

I could write a simple OR in, however this becomes unwieldy the more values that will display as shop closed.

IF {MY_ELEMENT.STREAM} = "SA" OR {MY_ELEMENT.STREAM} = "SU” THEN "Shop closed" ELSE “Shop open”

I could write a case statement within my SQL, however it could be that I am using a snapshot at a given moment in time which requires me to include the logic natively into my report.

Using the IN string function, a series of values can be included within the IF statement with minimum bulk to the expression, however it operates slightly differently to the way it is used in SQL.

IF {MY_ELEMENT.STREAM} in("SA""SU”) THEN "Shop closed" ELSE “Shop open”

The separate values are stored within the brackets and separately quoted, however they are not separated by commas as in SQL.

Hope someone finds this useful, I struggled to find anything online about doing this in Crystal!

Thursday 30 May 2013

Dissertation Series - Data Mining Tesco Clubcard Case Study

In 1994 Tesco piloted their Clubcard scheme, then went on to launch it countrywide in 1995 (Humby et al, 2004 p.14), although loyalty schemes were also being operated by other retailers Tesco intended to make further use of the data generated as a result of it.

Safeway had tried this with their ABC card which they later abandoned and had suffered with the too much data issue having compared it to “drinking from a fire hose” (Humby et al, 2004 p.99). This scenario of too much data was a common theme throughout the research carried out in the literature review and often led to the use of data mining.

Dataset content

A huge amount of data was being created as a result of the Clubcard scheme; till transactions broken down to product level and attributing to the Clubcard holder that made the purchase (Humby et al, 2004 p.96).

Dataset analysis

Tesco outsourced the analysis of the data to a company called “Dunnhumby” as they did not possess the IT skills or infrastructure inhouse (Humby et al, 2004 p.96). In addition technical limitations of the time meant that Dunnhumby weren’t able to process all of the Tesco data (Humby et al, 2004 p.97), in fact in excess of 50 million transactions (shopping trips) were held in the first 3 months of the scheme (Humby et al, 2004 p.96).

Dunnhumby took the approach of analysing 10% of the collected data and then worked to apply what had been learnt to the entire dataset (Humby et al, 2004 p.97).

Dataset quality

Errors contained within data can cause false positives in data mining (Thrasingham, 1999 p.93) and Dunnhumby encountered this with the Clubcard data having multiple users of one card, users holding multiple cards (in the case of loss/theft) or even local issues preventing customers reaching the store for a period of time (Humby et al, 2004 p.98). However as the data was collected by Tesco hardware automatically, there was little likelihood of missing data or inconsistent data issues to be encountered.

Resistances

Whilst the majority of the Tesco case study talked about positive feedback (Humby et al, 2004 p.116), there were periods during which there were resistances to the processing of personal data (Humby et al, 2004 p.177). During 1997 there were 20 complaints made to the Data Protection Registrar (which was the precursor to the Information Comissioner), these complaints pertained to the use of the data collected for the clubcard scheme that was subsequently used in a Tesco Personal Finance mailing campaign (Humby et al, 2004 p.177). As a result of the complaints and subsequent meetings with the Data Protection Registrar Tesco revised their practices to not pass details from clubcard to third parties (Humby et al, 2004 p.179).

Benefits

Tesco realised a large number of benefits from the Clubcard scheme, with customers feedback being that the targeted mailings were viewed separate to other commercial mailings (Humby et al, 2004 p.116).

Tesco were also able to more accurately target marketing based on customers buying habits and as such get a higher return, this marketing was coupon based mailing which results in customers receiving a bespoke combination of coupons based on their buying patterns (Humby et al, 2004 p.117).

Summary

Tesco encountered technical issues at the beginning of the club card project, mainly caused by the technical limitations of the time. This resulted in them outsourcing the project to a third party, who then only analysed a small sample.

Tesco encountered resistance to their sharing of personal data with third party marketers during the Tesco Personal Finance mailing, although changed their approach to data sharing and mailing.

Otherwise the Tesco viewpoint was that the project was hugely successful and well received amongst its customers.

Wednesday 29 May 2013

Dissertation Series - Data mining Literature Review Summary

Data mining is a widely used technology, often deployed in scenarios where large amounts of data are collected and the analysis of this data is problematic. The use of data mining allows patterns to be gleaned from data and exploited to further the business/organisation objectives.

Data mining is not an out of the box solution that can be deployed to an organisation without technical intervention, as there are many factors that influence the accuracy and usefulness of the end product produced. These factors must be considered prior to undertaking a data mining project, as it may be the case that the chances of success are low and as such the end product may result in resources being targeted towards false positives.

There are many software solutions used to implement data mining and modelling techniques that can be used within these packages. These techniques each mine the data in different ways and as such would be used in the appropriate scenario.

There are some resistances to data mining as a technique, these can in the majority of cases by mitigated or at the very least controlled. There is a common theme of mutual consent between the subject and the organisation, where both parties receive a benefit (as in the Tesco example) the privacy concerns are generally reduced. This is separate to any legal issues and past examples have shown that even if an organisation adheres to the law there can be issues (such as expressed in the N2H2 example).

Oracle Date comparison - DATEDIFF

Many databases are designed in such a way that where a start/end time are stored there is no corresponding duration value, this is to avoid obvious data duplication and storage space as the duration can be calculated by comparing the start/end times. However some novice SQL coders struggle to calculate durations.

In Microsoft SQL there is the datediff function however this is not present in Oracle so the most straight forward method is to subtract the start date from the end date, this produces the difference expressed fractions of a day (i.e. an hour is expressed as 0.41677777), multiplying the number by 24 then gives the figure in hours.

(END_DATE – START_DATE) * 24

An easy way check your logic is to use a value within a dual statement such as the one below, obviously including the dates you are anticipating so that you can be sure of what figure to expect. This saves considerable time than sticking a date comparison into your where clause and crossing your fingers!

Select

(TO_DATE('01/08/2012 13:00','dd/mm/yyyy hh24:mi') - TO_DATE('01/08/2012 10:00','dd/mm/yyyy hh24:mi')) * 24 Difference_hours,

(TO_DATE('31/07/2013','dd/mm/yyyy') - TO_DATE('01/08/2012','dd/mm/yyyy')) Difference_Days

From dual

Tuesday 28 May 2013

Dissertation Series - CRISP DM - Step Five Evaluation

Evaluate results

The previous assessment of the model investigated how accurate the model was, the evaluate results sub-step investigates the models suitability based on the business success factors set at the start of the project (The Modelling Agency, 2000 p.30).

Review process

Using the conclusions of the evaluate results sub-step it may be that an area for further development of the model is identified and additional work is required; this may because of the amount of time that has elapsed since the original specification was drawn up or even down to something being overlooked in the business understanding sub-steps (The Modelling Agency, 2000 p.31).

Determine next steps

Following the assessment and results of the review process the next steps can be decided on, there are three possible avenues;

Further development work

If additional requirements are identified during the review process then it may be necessary to carry out additional work on the model (The Modelling Agency, 2000 p.31), it may also be necessary to carry out further development work if the model does not meet the initial requirements (The Modelling Agency, 2000 p.30).

Close the project

IT projects especially are known for going out of tolerance in terms of cost and time (McManus et al, 2008), in some circumstances it is therefore necessary to close a project prematurely (OFC, 2005 p.71) to save resources. The reasons behind closing the project may be complex such as industry changes leading to the original problem defined no longer existing and as such shouldn’t always be seen as a failure.

Deploy the model

In the event of the evaluation providing the model successfully in meeting the requirements, the model can move to the next stage of being deployed (The Modelling Agency, 2000 p.31).

Sunday 26 May 2013

Dissertation Series - CRISP DM - Step Four Data Modelling

The modelling step involves the actual data mining stage of the project, which breaks down into a number of steps;

Select modelling technique

The problem being solved/objectives set out in the business understanding step will tend to indicate which modelling technique is suitable for the problem being tackled (The Modelling Agency, 2000 p.25), however those new to data mining may test a number of different modelling techniques.

Generate test design

Data mining accuracy is very important as false positives can cause serious issues between an organisation and its customers (Thuraisingham, 1999 p.93), for this reason a robust test plan must be put in place in order to validate the data mining solution (The Modelling Agency, 2000 p.28).

Build model

This is the sub-step where the data mining application specific development is undertaken and as with software design is iterative with a loop of develop, test and adjust (The Modelling Agency, 2000 p.28).

Assess model

In union with the test design carry out testing of the model, with the option of rolling back to the build model sub-step to factor in debugging (The Modelling Agency, 2000 p.29).

Saturday 25 May 2013

Dissertation Series - CRISP DM - Step Three Data Preperation

Data preparation is the step of ensuring that data used in the modelling step is as suitable as possible,

Select Data

The select data sub-step identifies what data will be included and what will be excluded (The Modelling Agency, 2000 p.22), the reasons for inclusion and exclusion can range from the age of the data (as more recent data being more valid/relevant) or completeness (as incomplete data can give an inaccurate picture) (Dunham, 2003 p.15). These exclusions/inclusions should be clearly indicated so that the user digesting/acting upon the data knows what data he/she is acting upon.

Clean data

Cleaning data refers to the correction/removal of faulty or incomplete data, this can be extended to making use of estimation/prediction to populate missing data (The Modelling Agency, 2000 p.24). In a similar way to the select data phase it is important that any caveats applied to the data are documented and identified when making use of the data.

Construct data

Data may need to be assembled before use in data mining; there may be calculations that need to be applied to the data itself to produce meaningful/useful data for the model (such as calculating the age of a subject based on today’s date and their birth date) or even fields that are split such as post codes which are sometimes stored in two separate parts (The Modelling Agency, 2000 p.24).

Integrate data

Once the data has been through the previous steps it is necessary to put it together, the data included is most likely held in numerous tables and possibly even multiple databases which require linking together (The Modelling Agency, 2000 p.25). Aggregations may need to be performed to deal with possible duplication, which would have been highlighted in the explore data sub-step (The Modelling Agency, 2000 p.25).

Format data

The final sub-step of data preparation is to make necessary adjustments to the data so that is suitable for use in the data mining tool, changes may be needed such as the introduction of composite keys or restructuring of data (The Modelling Agency, 2000 p.25).

Friday 24 May 2013

Dissertation Series - CRISP DM - Step one Business Understanding

A development methodology specific to data mining is CRISP-DM (CRoss Industry Standard Process for Data Mining) a methodology conceived in 1996 by IT professionals and was based on their experiences of data mining implementations (The modelling agency, 2000 p.3).

CRISP DM is broken into six distinct steps, at some points the outcome of the step may require repeating the previous step.

Step 1 – Business understanding

This step is broken into several sub-steps which set the scene for the data mining development, it is broadly similar to the PID (Project Initiation Document) that makes up a PRINCE2 project (OGC, 2005 p.40-41).

Determine business objectives

Considers the business and its overall goals/objectives so as to set the scene (The Modelling Agency, 2000 p.16), discussing data warehouses Mukherjee and D’Souza (2003 p.84) agree with their statement that “DW implementation can be considered a success not only because it satisfies a need at a point in time, but also because it serves the continuing needs of an organization”.

Assess situation

Looks at available resources and any associated legal issues/risks (The Modelling Agency, 2000 p.17), this prevents undertaking work that cannot be completed due to resourcing issues and prevents work being undertaken that cannot be made use of because of legal issues. The impact of the development can also be factored in to compare the intended value gained against the resource required (The Modelling Agency, 2000 p.18).

Determine data mining goals

It is important in any project to ensure that the aim and associated objectives are clear (OGC, 2005 p.50). This sub-step of business understanding ensures that the goals are clearly defined and understood (The Modelling Agency, 2000 p.18), this allows the final outcome to be measured against these to ensure that the requirements have been met.

Project plan

The final sub-step of business understanding is the construction of a project plan, breaking the project into the steps that will be undertaken, the resources that will be required at each stage and identifies dependencies that may cause bottle necks in the delivery of the project (The Modelling Agency, 2000 p.19). Traditional project management techniques/solutions can be made use of in undertaking this such as PRINCE2 and Microsoft Project.

Thursday 23 May 2013

Dissertation Series - CRISP DM Step Six Deployment

Plan deployment

Deploying new software/solutions in an organisation requires a plan in order to avoid any issues or pitfalls; this could include raising awareness/allaying concerns amongst effected staff (Clark, 2012). A deployment could also be carried out in a number of ways;-

Parallel adoption

The new system is run alongside existing systems, this does however mean that in some cases the effort is duplicated, although conversely if any issues are encountered or even if the new system completely fails then the old system is still in place (Weaver, 2004 p.232).

Phased adoption

This is where functionality of the new system is slowly phased in and teething problems emerge gradually rather than in one massive raft of changes (Weaver, 2004 p.232).

Pilot adoption

The pilot approach involves selecting a number of staff or a specific area of the business and introducing the system there, with the aim of gaining feedback and experience with the new systems to apply it when rolling out the new system to the rest of the organisation (Weaver, 2004 p.232).

Big bang adoption

The big bang approach is where a new system is introduced and replaces an existing system immediately with no crossover or where a new system (where an existing system is not in place) is introduced to the entire organisation in one phase. Where existing systems are replaced using the big bang approach there can be issues encountered where the new system fails and there is no system in place to support the business activity (Weaver, 2004 p.232).

Plan monitoring and maintenance

The next sub-step involves putting controls in place so that any changes likely to affect the model are considered and documented (The Modelling Agency, 2000 p.33), an example of this would be changing the ways in which data is recorded.

Produce final report

A report is produced that documents the outcomes and products of the project (The Modelling Agency, 2000 p.33).

Review project

As with any development project a final review allows for lessons learnt during the project to be discussed and documented (OGC, 2005 p.333), this can mean that future projects take these into account and avoid making the same mistakes twice (The Modelling Agency, 2000 p.33).

Oracle SQL - Formatting dates into a more usable form (TO_CHAR)

Dates in oracle are stored in a format that doesn’t always make them friendly from a reporting point of view or dealing with data in Excel point of view, the TO_CHAR function can reformat your datetime fields so that they are more user friendly.

Examples

The statement is written with two parameters the exact name of the date field and the format (see second table).

TO_CHAR(DATEFIELD,’FORMAT’)

Statement	Result
TO_CHAR(DATEFIELD,’dd mm year’)	23 05 twenty thirteen
TO_CHAR(DATEFIELD,’dd/mm/yy’)	23/05/2013
TO_CHAR(DATEFIELD,’Day Month Year’)	23 May Twenty Thirteen
TO_CHAR(DATEFIELD,’dd/mm/yyyy hh24:mi’)	23/05/2013 18:38

Example formats

Any combination of the following formats can be used, although obviously certain combinations may not make any sense to your end users.

Format	Description	Example
Year/year	Year spelt out in text (with and without an upper case first digit	Twenty Thirteen / twenty thirteen
yyyy	Year number in full	2008
yy	Last two digits of year number	08
q	Quarter of the year	1 (February)
mm	Month number within year**	12 (December)
Mon / mon	Abbreviated month name (with and without an uppercase first digit)	Oct/oct
Month / month	Month name (with and without an upper case first digit)	October / october
w	Week number* (within month)	1 (01/05/2013)
ww	Week number* (within year) **	18 (01/05/2013)
d	Day of the week	1 (Monday)
dd	Day number within month**
Dy/dy	Abbreviated day name (with and without an uppercase first digit)	Mon / mon
Day / day	Day name (with and without an upper case first digit)	Monday / Monday
hh24	Hour of the day in 24 hour format**	16 (4pm)
hh	Hour of the day in 12 hour format**	04 (4pm)
mi	Minute of the hour**	52 (16:52)

*Important note about week numbers

Week numbers in oracle can be confusing as they start on the first day of the year and count seven days and then change, for example in 2013 the week number would increase each Wednesday.

**Suppression of zeros

This format can be prefixed with FM to suppress zeros, for example mm would display January as “01” where as if FMmm was used it would show “1”.

How to test you date formatting

The easiest method for testing the reformatting of your dates is to test them using the dual function, run the following SQL statement and substitute the area highlighted in yellow with the date format you are trying to use.

select to_char(to_date('01/05/2013 16:30','dd/mm/yyyy hh24:mi'),'Year') from dual

Dissertation Series - Educational Data Mining (EDM)

There are many sectors in which data mining can be applied, such as identifying buying habits in retail (BCS, 2002 p.30) and detecting fraud in the finance sector (Thiruvadi and Patel, 2011 p.710). Educational data mining is an area of data mining that specifically focuses on the development of data mining with the unique data that comes from education (Baker, 2011).

Describing a model that looks at learner results Ayesha et al (2010 p.26) identifies the action that would be taken.

“the proposed model identifies the weak students before the final exam in order to save them from serious harm. Teachers can take appropriate steps at right time to improve the performance of the student in final exam”

McGee (2008) talks about an implementation of an EDM model and its success.

“it used the trajectory analysis to identify 60 students at risk of failing state standardized test, and teachers developed plans to address their needs. Only 10 ended up doing poorly”

Dissertation Series - Barriers to successful Data Mining - Inconsistent data recording

Databases are capable of storing data from many sources and as such data can be entered/recorded by different individuals/organisations, which can introduce inconsistencies into the data.

For example data mined from clothing manufacturers could compare sizes (large medium etc) although the definition of that size could range between manufacturers. This would have to be kept in mind and factored in when understanding the underlying data.

Wednesday 22 May 2013

Dissertation Series - Barriers to successful Data Mining - Missing data

“as more data is collected, the higher the likelihood of missing data” (Brown et al, 2003 p.611) and as such the treatment of the missing data must be taken into consideration and the effect that such treatment will have on the end result “missing data may be replaced with estimates. This and other approaches to handling missing data can lead to invalid results in the data mining step” (Dunham, 2003 p.15).

Missing data may be as a result of individuals refusing to provide certain data, Brown and Kros (2003 p.612) refer to an example of a medical environment where “respondents may find certain survey questions offensive or they may be personally sensitive to certain questions”.

Dissertation Series - Barriers to succesful Data Mining - Data suitability

As with all technologies there are certain scenarios/situations where use of the technology will result in benefits not realised or even a negative outcome. The reasons for this within data mining are as follows.

Data suitability

In some cases some or all of the data held by an organisation may be unsuitable for data mining and this is one of the reasons that some regard the simplicity of modern data mining with caution (Bramer, 1999 p.xii).

The principles of data warehousing can be applied in identifying suitability of data for data mining:-

Subject orientated

It is important that there data is available that is related to the subject concerned, non subject related data that is mined will clearly result in false positives (Khan, 2005 p.151).

Time variant

Any data that is considered for mining should be taken from an appropriate time period (Khan, 2005 p.152), data that is mined and that is particularly old may result in patterns being highlighted that no longer affect the business.

Non volatile

Data that is to be mined should also be non volatile or static, with any amendments being on a periodic basis, this contrasts to databases which are subject to frequent change as transactions etc are processed (Khan, 2005 p.152).

Integrated

Data that is to be used in data mining should be integrated which means pulling together the data from the various tables within the database and also includes where appropriate bringing data from other databases (Khan, 2005 p.151).

It is therefore not always possible to mine the data held by an organisation, this can be frustrating to staff who wish data mining to be applied in the organisation.

Dissertation Series - Resistances to Data Mining - Technological Requirements

Data mining requires technical resources in terms of hardware/software on which the processing will take place, as well as staff to develop and implement it. Tesco in 1995 had collected a huge amount of data as a result of their clubcard program, but the technology to process it was available at the time (Humby et al, 2004 p.96). In fact Tesco didn’t even have the staff/resources to process even a small percentage of the data (Humby et al, 2004 p.96), so had to outsource to Dunnhumby a data analysis company (Dunnhumby, 2012). Dunnhumby took the collected Tesco data and performed analysis on 10% of the weekly transaction data (again because of technical limitations of the time) (Humby et al, 2004 p.97).

As forecasted by Moores Law (Intel, 2012), computing processing power has dramatically increased since the Tesco example discussed from 1995. However in some cases organisations might not have the equipment/staff to undertake the task in-house. In which case the work could be outsourced or additional resources/staff brought into the organisation, in both cases this brings cost a resistance to change in itself.

Dissertation Series - Resistances to Data Mining - Legislation concerns

On a similar note to the privacy concerns, some organisations may be concerned about making use of data mining because of the possible legal implications of doing so. In the United Kingdom data legislation is primarily covered by the data protection act, section 12 specifically mentioning “rights in relation to automated decision-taking” (Information Commissioners Office B, 2012).

Dissertation Series - Resistances to Data Mining - Awareness

Before an organisation commits resources to any development there must have been a catalyst to drive the company down that development route. Technical staff either have to make managers aware of new technologies that could be applied or senior managers need to be aware of these technologies and seek to employ them in their organisation.

This rational can be applied to data mining, if there isn’t awareness in the organisation of data mining then it is unlikely it would be pursued as a future development.

Dissertation Series - Resistances to Data Mining - Accuracy

Data mining is in many cases used to forecast/predict an outcome, it does this with a degree of accuracy although it is important to note that is a forecast/prediction and not an actual. For this reason results taken from data mining exercises should be acted on with this in mind. Any the action taken as a result of data mining should be with the consideration as to what the impact would be if it were applied where the prediction incorrect.

These incorrect predictions are referred to as false positives, which is where something is being flagged as something it is not, where as a false negative is where something is not grouped as it should (Thrasingham, 1999 p.93).

Thuraisingham (1999 p.93) identifies the possible implications of acting on these “false positives”.

“if an agency finds incorrectly that its employee has carried out fraudulent acts and then starts to investigate his behaviour, and if this is known to the employee, then it could damage him”

Conversely the same logic applies to false negatives “we do not want the data miner to return a result that the employee was well behaved when he is a fraud” (Thrasingham, 1999 p.93).

An area in which many consumers would have been exposed to the false positives of data mining is credit/debit card companies, as part of their fraud prevention systems banks look at consumer transaction patterns and place temporary blocks on cards that exhibit those that match patterns of stolen cards (AAAI, 2012). A temporary block on a card requires the customer to contact their bank to unblock the card, again these false positives can be frustrating or even embarrassing for the individual concerned.

“just because an individual makes a series of credit card purchase that are similar to those often made when a card is stolen does not mean that the card is stolen or that the individual is a criminal” (Dunham, 2003 p.16)

Tuesday 21 May 2013

Dissertation Series - Resistances to Data Mining - Privacy

The introduction of new technologies or adaptation of existing technologies within an organisation can bring with it resistance from the different layers of the organisation; from the management resisting introduction to the operational staff resisting the use/uptake. There are many reasons for management/operational staff resisting changes in an organisation and approaches to mitigating them (Davidson, 2009), so the focus will be those specific to data mining.

Privacy

Data mining against individuals inevitably makes use of large amounts of personal data (Busovky, 2011), with this brings concerns of data privacy and the high profile data breaches reported in the media (BBC, 2009).

Wilder and Soat (Wilder et al, 2001) cite an example of N2H2 a Seattle based company that provides internet filtering content software to schools, using that data they planned to sell the anonymised aggregated data.

“N2H2 began marketing the data, called Class Clicks, that it’s filtering tools collected on the website usage trends of elementary and high school students. The data contained no names or personal information and complied with the new deferral Children’s Online Privacy Protection Act. Yet N2H2’s new line of business brought such loud howls of protest from online privacy advocates that the company scrapped the effort”

A fictitious example is given by Wang and Liu (Wang et al, 2011) to illustrate the real privacy concerns that could exist when mining a medical database.

“released mining output can also be leveraged to uncover some combinations of symptoms that are so special that only rare people match them” “which qualifies as a severe threat to individuals privacy”

Many countries have legislation in place to protect individuals and ensure organisations put in place safe guards and controls to protect personal data, the main act in the United Kingdom being the Data Protection Act 1998, which covers many areas of data protection. Specific to privacy the seventh principle of the act applies;

“Appropriate technical and organisational measures shall be taken against unauthorised or unlawful processing of personal data and against accidental loss or destruction of, or damage to, personal data” (Information Commissioners Office A, 2012).

There are techniques to prevent unauthorised disclosure of personal data through data mining:

Anonymisation

Privacy can be ensured through anonymising data, however simply removing customer reference number/names is not in itself always sufficient as discussed by Vaidya et al (2005 p.8) “just because the individual is not identifiable in the data is not sufficient; joining the data with other sources must not enable identification”.

An established approach to ensure that data is truly anonymised is “k-anonymity” which is a process that involves the grouping of individuals together within the data (Vaidya et al, 2005 p.8).

Suppression can also be introduced to hide groups/data that consist of small and easily identified sample sizes; this requires footnotes and an accompanying narrative to explain that this has been done; to prevent a misunderstanding of any summarised data (Vaidya et al, 2005 p.8).

Clearly defined use of data

Another method to control concerns about privacy is to clearly outline to the data subjects at point of data collection what the data will be used for and the associated benefits to them.

This is evidenced by the success of the Tesco clubcard scheme and its changed perception amongst it customers and it’s the separation of its mailings from previously “dumb” junk mail.

“research consistency suggest that customers perceive the quarterly mailing from Tesco clubcard not as ‘junk mail’, but as personal mail” (Humby et al, 2004 p.116).

An example of poor understanding between the data subject and the organisaiton carrying out the data mining process is the case of pharmacies in the US that were selling data gathered from prescriptions to pharmaceutical companies to be data mined. The pharmaceutical companies were then using that data to target marketing/sales towards specific doctors, based on the prescriptions they had written (Silverman, 2008). The data subjects in this case (the doctors) represented by the American college of Physicians have opposed the use of this data for marketing (Walker, 2011).

However the example also speaks about the use of the data for other purposes;

“direct safety messages to doctors, to track disease progression, to aid law enforcement, to implement risk-mitigation programs, and to do post-marketing surveillance required by the FDA” (Walker, 2011)

It is where there is a benefit and consent between the data subject, the organisation and its use of the data mining, that there is less likelihood of resistance to data being mined.