Data cleaning:
Data cleaning routines attempts to fill the missing values, smooth out noise with identifying outliers and correct inconsistencies in the data.
1. Missing values:
a. Ignore the tuple: this is usually done when class label is missing. This method is no very effective, unless the tuple contain several attribute with the missing values. It is especially poor when the percentage of missing values per attribute varies consider.
b. Fill in missing values manually: in general, this approach is time consuming and may not be feasible given a large data set with many missing values.
c. Use a global constraint to fill in the missing values: replace all missing attribute values by the same constraint such as a label like unknown and infinity (∞). If missing values are replacing by say unknown, then the mining program may mistakenly think that they from interesting concept.
d. Use the attribute mean to fill in missing values
e. Use the attribute mean for all samples belonging to the same class as the given tuple.
f. Use of most probable values to the fill in missing values: it is important not that in some cases a missing value may not imply an error in the data.
2. Noisy data:
Noisy is a random error or variance in measured variable given a numerical attribute such as say price; how can we smooth out the data to remove the noisy.
a. Binning: binning methods smooth a shortest data values by consulting its neighborhood that is value around it. The shortest values are distributed in to number of buckets or bins because binning method consult the neighborhood of values, they perform local smoothing.
b. Regression: data can be smooth by the fitting the data to a function such as with regression. Linear regression involves finding the best line to fit the two attributes or variables so that one attribute can be used to predict to other. Multiple regressions are an extension of linear regression where more than two attributes are involved and the data are fit two a multi-dimensional surface.
c. Clustering: outliers may be detected by clustering where similar values are organized into a group or clusters.
Data integration:
Data mining often requires data integration - the merging of data from multiple sources.
Data analysis task will involve data integration, which combines data from multiple data sources into a coherent data store, as in data warehousing. These sources include multiple databases data cubes and flat files. The issues to consider during data integration are schema integration and object matching. How can equivalent real world entities from multiple data sources be matched up. This refers to as entity identification problem. Redundancy is an important issue. An attribute may be remove if it can be derived from another attribute or set of attributes. In consistency in attribute or dimension naming can also cause redundancy in the resulting data set.
Data transformation:
In data transformation the data transformed in to forms appropriate for mining. Data transformation can involve some following:
1. Smoothing which works to remove noise from the data such techniques include binning regression and clustering.
2. Aggregation where summary or aggregation operation are applied to the data. For example, the daily sells data may be aggravated so as compute monthly or annual total amounts.
3. Generalization: generalization of data where low level or row data are replaced by the high level concept through the use of concept hierarchy.
4. Normalization where the attribute data are scaled so as to fall within specified range.
5. Attribute construction when new attribute is constructed or added from given set of attributes to help of mining process.
Data reduction:
Data reduction is transformation of numerical and alphabetical digital information derived empirically or experimentally into a corrected ordered a simplified form.
Data reduction techniques can be applying to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of original data. That is mining on the reduced data set should be more efficient yet procedure the same analytical result.
1. Data cube aggregation where aggregations are applied to the data in the construction of a data cube.
2. Attribute subset selection where irrelevant weakly irrelevant or redundant (awful) attributes of dimensions may be ducted and remove.
3. Dimensionally reduction where encoding mechanisms are used to reduce the data set size.
4. Numerosity reduction where the data are replaced by alternative smaller data representation such as parametric model or non-parametric model such as clustering sampling and the use the histograms.
OLAP (on line analytical processing):
OLAP performs multidimensional analysis of business data and provides the capability for complex calculations Trent analysis vital data modeling. It is quickly becoming the fundamental foundation for intelligent solutions including business performance, management, planning, budgeting and forecasting, financial reporting analysis, simulation model, knowledge discovery and data warehouse reporting. OLAP enables end user to perform ad hoc analysis of data in multiple dimensional there by providing the inside and understanding they need for better decision making with our solution. OLAP is fast easy, simple, controllable access to intelligence locked in your data and it can through the excel spread sheet and web browser.
OLAP is performed on data warehouses or data marts. The primary goal of OLAP is to support ad hoc querying needed to support DSS. The multidimensional view of the data fundamental to OLAP application. OLAP is an application view, not a data structure or schema. The complex nature of OLAP application requires a multidimensional view of the data.
Types of OLAP server
1. Relational OLAP
2. Multidimensional OLAP
3. Hybrid OLAP
4. Specialized OLAP
1. Relational OLAP
R-OLAP servers are placed between relational back end server and client front end server. Client front end tools. To store and manage warehouse data OLAP uses relational or extended relational DBMS.
2. Multidimensional OLAP
M- OLAP uses array based multidimensional storage engines for multidimensional data stores; the storage utilization may be low if the data set is sparse (in little quantity).
Therefore, many OLAP server use two label of data storage representation to handle dense (vital/huge) and sparse data sets.
3. Hybrid OLAP
H-OLAP is a combination of both R-OLAP and M-OLAP it offers higher scalability of R-OLAP and faster computation of M-OLAP. H- OLAP server allows storing the large data volumes of detail information.
4. Specialized OLAP
S SQL OLAP server provide advance query language and querying processing support for SQL queries over star a snowflake schema in a read only environment.
24-Aug-17
Mining association rules:
One of the major technologies in data mining involves the discovery of the association rules. The data base is regarded as a collection of transaction each involving a set of items. A common example is that of market-basket data. Here market-basket corresponds to what is consumer buys in a super market during one visit. Mining association rules in a transactional or relational database has recently attracted a lot of attention in data base communities. For example, one may find from a large set transaction data such as association rules with as if a customer buys milk, he usually buys bread in the same transaction. Since mining association rules may require scanning through a large transaction data base to find different association patterns the amount of processing could be huge and performance improvement is an essential concern at mining such rules.
The most popular used data mining and data analysis tools associated with database system products are data generalization and summarization tools which carry several alternative names such as online analytical processing (OLAP), multidimensional database, data cubes, data abstraction etc.
Given a data base of sales transaction it is desirable to discover important associations among items such that the presence of some items in a transaction will imply the presence of their items in same transaction.
30 August 2017
Apriori algorithm:
Apriori is a classical algorithm for learning association rules. Apriori is designed to operate on data base containing transaction for example collection of items brought by costumer all details of a web site products or items. Other algorithm is design for finding association rules in data having no transaction. As is common in association rules mining given a set of item sets (for example sets of retail transaction, each listing individual items purchased) the algorithm attempts to find subset which are common to at least a minimum number of the item set. Apriori algorithm is used a bottom up approach.
Where frequent subsets are extended one item at a time (a step known as candidate generation) and group of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. The purpose of apriori algorithm is find association between different set of data it is some time referred to as market, basket analysis. Each set of data has a number of items is called a transaction. The output of apriori algorithm is set of rules that tell us how often items are contained in set of data.
Alpha
|
beta
|
gamma
|
Alpha
|
beta
|
theta
|
Alpha
|
beta
|
epsilon
|
Alpha
|
beta
|
theta
|
1. 100% of sets with alpha also contain beta.
2. 25% of sets with alpha, beta also have gamma.
3. 50% of sets alpha, beta also have theta.
Apriori uses breath first search and hash tree structure to count candidate item set efficiently. If generators generate candidates item set of length K from item sets of length K-1.
Then it prunes the candidates witch have any a frequent sub pattern. according to downward closer lemma the candidate’s sets contains all frequent.
Multidimensional association rules:
Rules involving more than one dimensions or predicates.
Buys (x, “IBM laptop computer”)
Age(x,”20___25”)
Occupation (x, student)
· Attributes can be categorical or quantitative.
· Quantitative attributes are numeric for example age, salary etc.
· Numeric attribute must be discretized.
Three different approaches in multidimensional association rules:
1. Using static discretization of quantitative attributes.
2. Using dynamic discretization of quantitative attributes.
3. Using distance based discretization with clustering.
05-Sep-17
Mining using static discretization:
· Discretization is static and occurs prior to mining
· Attributes are treated as categorical.
· Use apriori algorithm to find all K frequent predicate sets.
· Every subset of frequent predicate must be frequent.
o If in a data queue the 3d cubic (age income or buys) is frequent implies. (age, income) (age, buys) (income, buys).
Mining using dynamic discretization:
Known as mining quantative association rules.
· Numeric attributes are dynamically discretization
Consider rules of types
Aquan1 ^ aquan2 → acat (2d quantitative association rules)
Acat age (x, “20_ _ _ 25”) ^ (x, “30k……40k);
Buys (x, “laptop computer”);
ARCS (association rules clustering system) an approach for mining quantative association rules.
Distance based association rules:
Two step mining process
· Perform clustering to find the interval of attributes involved
· Obtain association rules by searching for rules of clusters that occur together.
Clustering
The process of grouping a set of physical or abstract objects into classes of similar object is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in the other clusters. A cluster of data object can be treated collectively as one group and so may be considered as a form of data comparison although classification is an effective means for distinguishing groups or classes of objects. It requires often costly collection and labeling of a large set of training tuples and patterns which is classifier uses to model each group.
Cluster analysis is an important human activity. Early in childhood we learn how to distinguish between cats and dogs. Dogs and between animals and plants clustering schemes cluster analysis has been widely used in numerous applications including market research pattern recognition data analysis image processing. In business, clustering can help marketers discovered the groups in their costumer bases and characterize costumer group based on purchasing patterns. In biology, it can be use derive plant and animal categories etc.
Clustering is also called data segmentation in some applications because clustering partition large data sets into groups according to their similarity clustering can also be used for outlier detection.
Application of outlier detection include the detection of credit card fraud and monitoring the criminal activities in electronic commerce.
Clustering is uses because there are some points
1. Simplification
2. Pattern detection
3. Useful in data concept construction
4. Unsupervised learning process
Data clustering is contributing areas of research include data mining static, machine learning, database technologies and marketing.
As a branch of static cluster analysis has studied for many years, focusing mainly on distance based cluster analysis.
Clustering is a challenging field of research in which its potential application pose their oven special requirement. The following are requirements of clustering in data mining.
1. Scalability
2. Ability to deal with different types of attributes
3. Discovery of clusters with arbitrary shape.
4. Ability to deal with noisy data.
5. Incremental clustering
6. High dimensionality
7. Constraint base clustering
8. Usability and interpretability.
Major clustering method:
In clustering there are many methods that are used in data minninng process.
1. Partitioning method
2. Hierarchical method
3. Density based method
4. Grid based method
5. Model based method
1. Partitioning method:
Given a database of n objects or data tuples, a partitioning method construct k partitioning of given data. Where each partition represents a cluster and k<=n that is, it classifies the data into k groups. Which together satisfy the following requirements.
1. Each group must contain at least one object.
2. Each object must belong to exactly one group.
Mainly partitioning methods –
Divide data into proper subset
Recursively go through each subsets and reallocated points between clusters.
Given k the number of partitioning to construct, a partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improving the partitioning by moving objects from one group to another. To achieve global optimally in partitioning based clustering would requires exhaustive enumeration of all the position possible partitions.
Hierarchical method:
A hierarchical method creates a hierarchical decomposition of the given set of the data objects. A hierarchical method can be classified being either agglomerative (cluster) and divisive based on how hierarchical decomposition is formed. The agglomerative approach also called bottom – up approach, starts with the all of the objects in the same cluster. A cluster split up into smaller clusters even each object be a one cluster or until a termination condition holds.
Hierarchical method suffers from the fact that once a step (marge or split) is done, it can be never being undone.
There are two approaches to improve the quality of hierarchical clustering
1. Perform careful analysis of the objects linkages at each hierarchical partitioning such ascchameleon.
2. Integrated hierarchical agglomeration and other approaches by first using a hierarchical agglomerative algorithm to group object into micro clusters and then performing macro cluster on the micro cluster. Using another clustering method such as iterative relocation such birch.
3. Density based method:
Most partitioning methods cluster objects based on the distance between objects. Such methods can be find spherical – shaped clusters and encounter difficulty at discovery cluster of arbitrary shapes. Other clustering methods have been developed based on the notion of density. There general ideal is continue growing the given cluster as long as the density (the number of objects and data points) in the neighborhood exist some threshold that is for each data point with inn a given cluster, the neighborhood of given radius has to contain at least minimum number of points such a method can be used to the filter out noise (outliers) and discover cluster of arbitrary shape.
4. Grid based method:
Grid based method quantize the object space into a finite number of cells that form a gird structure. All the cluster operations are performed on the grid structure. The main advantage of this approach it is fast processing time which is typically independent the number of data objects and depend only on the number of cells in each dimension in the quantized space.
5. Model based methods
Model based method hypothesize (estimated) a model for each often clusters and find the best fit of the data to be given model. A model based algorithm may locate clusters by the constructing a density function that reflects the spatial distribution of the data points. It also leads to a way of automatically determining the number of clusters based on the standard statics taking noise or outliers into account and yielding robust clustering methods.
Further development:
A data warehouse is a store of data which is made available to end users in a way they can understand and make a decision data warehouse support the analysis of data for decision about the enterprises will operate now and in the future. They are designed manually for ad hoc, complex and mostly read only queries over data obtain from a variety of sources informational data historical data it represents a stable view of the business over a period of time. Limitations of current technology to bring together information from many department systems. The development of information system. Data warehouse technology aims at providing a solution for these problems.
Need for developing data warehouse:
In most organizations, data about specific parts or business is their which contains lots and lots of data somewhere in some form.
Data is available but not information and not the right information at the right time.
To help workers in there every day business activity and improve their productivity.
To help knowledge workers (executives, managers, analysts) make faster and better dicission – dicision support system.
Bring together informatin from multiple sources as to provide a consistence data base source for dicision support queries.
Implementation of data warehouse:
Define the architecture, do capacity planning and select the storage sever data base and OLAP severs (R – OLAP, M-OLAP) and tools.
Integrate the servers storage and client tools. Design the warehouse schema and views. Define the physical warehouse organization data placement partitioning the access method. Connect the sources using gateway, ODBC drivers or other wrappers. Design a implement script for data exraction, cleaning, transformation, node and refresh.
Project management data warehouse:
A paramount determining fector in the success of data ware housing is the input of stack holder. Data warehousing is very unique to an organization its business process system architecture and decision support system.
Project management for data warehousing allows for large amount of user input and at all phases of the project. There are commercial software products tailored for data warehouse project management.
A good project plan least the critical task that must be performed and when each task shou.ld be started and completed. In identifies who is to perform the task describe deliverable to be created identifies milestone for measuring process.
Creating and maintaining a warehouse:
Data exeraction from different external data sources operational database file standard application and another document.
Data cleaning means finding and resolving inconsistency inn the source data. It maintains data refreshment and checking for data quality. It supports analyzing meta data.
Web mining:
The advent of the world wide web (www) has advantage home computers users with a vital flood of information. To almost any topic one can think of one can find pieces of information that are made available by the other internet citizens, ranging from indivisual users that cost inventory of their record collection to major companies that do business over the web. Web mining means when we get date or information through the web data warehousing is known as web mining.
Additional information can be get from data sources that capture the interaction of users with a website.
Web data are the record of what action a user takes with the mouse and keyboard while visiting a site. Web data are just another source of data with its own quicks and with limitations that come with the all other sources of data. Most web data have more detailed then the usual marketing or a financial function wants to see
0 comments:
Post a Comment