Data Mining’s Application for Analyzing Performance of Social Empowerment Activities
Pankaj Gupta
Department of Computer Science and Engineering, BIT Mesra:Ranchi, off-Campus:Noida.
Email: pgupta@bitmesra.ac.in
Abstract
Social
development activities are flourishing in diversified branches of
society endeavor, despite numerous hurdles inflicting on their ways that
are truly cross-sectoral. They vary from providing basic human
services, as such education, health, and entrepreneurship to advance
maneuvers depending on the demand at the outset. However, while talking
about discovering true success cases around the globe, recapitulating
their thoroughfares to accumulate knowledge; and foremost, utilizing
newly emerged information technology methods to archive and disseminate
model cases, not many stand on their own. This has happened due for many
reasons, and a few of them are; improper program design, inaccurate
site selection, incorrect break even
analysis, insufficient supply of funding, unbalanced manpower
selection, inappropriate budget allocation, inadequate feedback and
monitoring. Apart from them, there are many hidden parameters that are
not even visible. Furthermore, these visible parameters (including the
invisible) are intricately intermingled to one another in such a way
that lagging of one derailed the whole project and eventually the
program fail. Not surprisingly, all of these parameters depend on data
and information on implemented programs or projects of which they mostly
lack. Thus, lack of data and information related to their
appropriateness (or inappropriateness), made them failure projects,
despite devoted efforts by the implementers, in most cases. This paper
has tried to focus on data mining applications and their utilizations in
formulating performance-analyzing tools for social development
activities. In this context, this paper has provided justifications to
include data mining application to establish monitoring and evaluation
tools for various social development applications. Specifically, this
paper gave in-depth analytical observations to establish knowledge for
acceptance and rejection for various social activities and transform the
contemporary human society into a knowledge society.
Keywords: Data Mining, Social Activities, Empowerment, Knowledge,
Introduction
All
information pertaining to a successful organization is truly its asset.
Information, such as client lists, vendor lists, product details,
employee information, and corporate strategy, is invaluable. Without
appropriate feeding of information, a business cannot operate properly
(Utimaco, 2005). This is potentially true for any sort of ventures that
may vary from providing services to the scientific community or
academics or civil society or individuals. However, to take an
intelligent decision, the information needs to be processed and
compiled. Data mining is a method of collecting and processing of data
and eventually assisting to take knowledgeable decision. In today’s
modern information based environment, data mining is day by day coming
at the front and beginning to acquire more and more attention. Because
data mining is all about acquisition, assessment and analysis, and by
automatic or semiautomatic means huge or small, all quantities of data
can help to uncover meaningful patterns and rules. These patterns and
schemes help enterprises improve their marketing, sales and customer
support operations to better understand their end users. Over the years,
corporate houses have accumulated very large databases from
applications such as enterprise resource planning (ERP), client
relationship management (CRM), or other operational systems. People
believe that there are untapped values hidden inside these data, and
data mining techniques can help these patterns out of this data.1
Currently data are being collected and accumulated across a wide variety
of fields at an exaggerated pace. Data are no more a rigid matter for
an entrepreneurship, or an organization, but have became an intrinsic
part of any management process and most dynamic in nature. For these
reasons, data mining algorithms are imperative to researches in the
aspect of making intelligent decisions through data mining. To cope up
with this new arena of research, there is an urgent need for a new
generation of computational theories and tools to assist humans in
extracting useful information (knowledge) from the rapidly growing
volumes of digital data. At the same time, data mining and knowledge
discovery in databases have been attracting a significant amount of
research, industry, and media attention (Boulicaut, Esposito, Giannotti
& Pedreschi, 2004; Bramer, 1999; Fayyad, Piatetsky-Shapiro &
Smyth, 1996; Freitas, 2002; Kargupta & Chen, 2001; Kloesgen &
Zythkow, 2002; Larose, 2004; Miller & Han, 2001). Here we focused on
application of data mining algorithms in establishing social
development management systems, for this we illustrate a few real-world
applications, but specifically focused to data mining algorithms;
challenges involved in those applications of knowledge discovery,
including contemporary and future research directions in the arena of
establishing knowledge centers to assist the society for taking
intelligent decision. Also tries how data mining algorithms may be
applied for making decision support systems. However, until now, not
many researches are being conducted to measure their impacts in the
society, or any cost benefit analyses have carried out. This article
tries to devise to formulate the measuring criteria utilizing data
mining. Finally it discusses a few challenges with some hints on future
research directives before concluding.
Background
In
contrast to heuristics (which contain general recommendations based on
statistical evidence or theoretical reasoning), algorithms are comprised
of completely defined, finite sets of steps, operations, or procedures
to produce a particular outcome. Algorithms are based on finite patterns
and occurrences in any incidents, and the outcome could be quantified
using mathematical formulations (Abbass, Sarker & Newton, 2002;
Adamo, 2001; Kantardzic, 2002;Yoon & Kerschberg, 1993).
Historically, the concept of finding useful patterns in data has been
given a variety of names, including data mining, knowledge extraction,
information discovery, information harvesting, data archeology, data
warehousing, data repository, or data pattern processing. Furthermore,
the term data mining has been mainly used by statisticians, data
analysts, and management information system (MIS) communities. Though it
has also gained popularity in the database field (Chakrabarti, 2002;
Fayyad, Piatetsky-Shapiro & Smyth, 1996; Hand, Mannila & Smyth,
2001; Liu & Motoda, 1998a, 1998b; Pal & Mitra, 2004; Perner
& Petrou, 1999; Pyle, 1999), but development partners and
researchers in the field of implementing numerous development projects
remain aloof of utilizing data mining techniques to preserve their data
or content, and as well as utilizing data mining algorithms to derive
their project outcomes. Data remain as critical means of project
evaluation essence and data processing possesses as a simple means of
conversion of raw data into tables or charts. The hidden pattern within
the data remains hidden and transformation of those data into knowledge
element could not gain concrete momentum until now. Furthermore, there
has not been any mathematical formulation derived that can take care the
transformation of data into knowledge and at the same time, measure
their impact in the society, or quantify the impact of data
transformation. The traditional method of turning data into knowledge
relies on manual analysis and interpretation. For example, in the
health-care industry, it is common for physicians or specialists to
periodically analyze current trends and changes in health-care data. The
specialists then provide a report detailing the analysis to the
authority; and ultimately this report becomes the basis for future
decision making and planning for health care
management. In a totally different category of application, planetary
geologists sift through remotely sensed images of planets and asteroids
by carefully locating and cataloging such geologic objects of interest
as impact craters. Perhaps it can be a village information center,
established at a very remote corner of a geographically dispersed
region. There has not been evolved many readymade formulas, algorithms,
hypothesis, or any measuring criteria to recognize their pattern of
growth and implementation, nature of operation, sustainability of their
existence, or replication of success cases in applicable states or
stages. Be it science, research, marketing, finance, health care, retail
shop, community center, or any other field, the classical approach to
data analysis relies fundamentally on one or more analysts becoming
intimately familiar with the data ad serving as an interface between the
data and the users and end products (Berthold & Hand, 1999; Fayyad,
Piatetsky-Shapiro & Smyth, 1996; Maimon & Last, 2000; Mattison,
1997). Nevertheless, in recent years many entrepreneurs are formulating
measuring criteria that include marketing, finance (especially
investment), fraud detection, data access, data cleaning, manufacturing,
telecommunications, and Internet agents. Here, a few data mining
algorithms based on rough set theory (RS) (Cox, 2004; Curotto &
Ebecken, 2005; Kantardzic, 2002; Myatt, 2006; Nanopoulos, Katsaros &
Manolopoulos, 2003; Thuraisingham, 1999; Zhou, Li, Meng & Meng,
2004) are included which are used to extract decision-making rules from
dataset. Rough set theory provides a neat methodology to formalize and
calculate the results for data mining problems. In the early 1980’s Z.
Pawlak, in cooperation with other researchers developed the rough set
data analysis (RSDA) (Pawlak, 1982). As recommended by its main adage
“let the data speak for themselves”, RSDA tried to distinguish internal
characteristics of a data set, such as categorization, dependency, and
association rules, without invoking external metrics and judgment
(Drewry et al., 2002).
Analyzing Social Activities using Data Mining
The
output of a data mining algorithm is typically a pattern or a set of
patterns that are valid in the given data. A pattern is defined as a
statement (expression) in a given language, that describes
(relationships among) the facts in a subset of the given data, and is in
some sense simpler than the enumeration of all the facts in the subset.
(Drewry et al, 2002, p. 2) A given data mining algorithm usually
depends on a built-in class of patterns, and the particular language of
patterns considered depends on the characteristics of given data (the
attributes and their values). Data mining for association rules is an
useful method for analyzing data that describe transactions, lists of
items, unique phrases (in text mining), and so forth. In this context,
the decision tree algorithm would probably be the most popular technique
for predictive modeling.
This
section constitutes the main thrust of the chapter and includes a few
models/patterns of data mining algorithms that would be used to deduce
possible measuring criteria of social development processes.
The
following example explains some of the basics of the decision tree
algorithms. Table 1 shows a data-set that could be used to predict
credit risk. In this example, fictionalized information was generated on
loan seekers that included debit level, income level, what type of
employment they had and whether they were a good or bad credit risk.
Loan-Seeker-Id
|
Debt-Level
|
Income-Level
|
Employment-Status
|
Credit-Risk
|
Remarks
|
1
|
High
|
High
|
Self-Employed
|
Bad
| |
2
|
High
|
High
|
Salaried
|
Bad
| |
3
|
High
|
Low
|
Self-Employed
|
Bad
| |
4
|
High
|
Low
|
Salaried
|
Bad
| |
5
|
Low
|
High
|
Self-Employed
|
Bad
|
Accepted
|
6
|
Low
|
High
|
Salaried
|
Bad
|
Accepted
|
7
|
Low
|
Low
|
Self-Employed
|
Bad
| |
8
|
Low
|
Low
|
Salaried
|
Bad
| |
9
|
High
|
High
|
Self-Employed
|
Good
|
Accepted
|
10
|
Low
|
High
|
Self-Employed
|
Good
|
Accepted
|
11
|
Low
|
Low
|
Salaried
|
Good
|
Accepted
|
Table-1 Loan Seeker’s Info.
In
the example illustrated in Figure-1, the decision tree algorithm might
determine that the most significant attribute for predicting credit risk
is debt level. The first split in the decision tree is, therefore, made
on debt level. One of the two new nodes (debt = low) is a leaf node,
containing two cases with bad credits and three cases with good credit.
In this example, a high debt level is a perfect predictor of a bad
credit risk. The other node (debt = high) is still mixed, having two
good credits and zero bad credit case.
Departmental
stores may use data mining to understand customer’s behavior, sale
trend, market behavior, and predict market strategy. This can be done
using the following table. Table 2 includes two forms of tables—case
table and nested table. A case table contains the case information
related to the non-nested part of the data, and a nested table contains
information related to the nested part of the data. In the following
table, there are two input tables to the mining model. One table
contains information about customer demographics. It is a case table.
The other table contains information about customer purchases. It is a
nested table. In database technology, a nested table is similar to a
transaction table. In the example, age group division may be made more
broad sacrificing accuracy of the result, though smaller age groups
segregation results in complicated algorithms. This applies to other
parameters too.
Customer-id
|
Age-Group
a-below15,
b-15-20,
c-21-26,
d-27-32,
e-33-38,
f-39-44,
g-45-50,
h-51-56,
i-57-62,
j-above 62
|
Martial-Status
M-married,
S-separated,
D-divorced,
U-unmarried
|
Wealth-Group
A-Less than 50,000,
B-Between 50,000-250,000,
C-Between 251,000-450,000,
D-Above 451,000
|
Product Purchase
| |
Product
|
Quantity
| ||||
1
|
C
|
M
|
B
|
Washing Machine
|
1
|
TV
|
1
| ||||
Shampoo
|
2
| ||||
2
|
E
|
S
|
C
|
Diet-coke
|
12
|
TV
|
1
| ||||
Jelly
|
3
| ||||
Cake
|
2
| ||||
3
|
B
|
M
|
A
|
Coke
|
3
|
Cake
|
1
| ||||
Jelly
|
1
|
Table-2
To
illustrate another example of data mining, hidden patterns inside data
have been considered. It is a fact that, data mining finds hidden
patterns inside datasets, and these patterns can be used to solve many
business problems. The following table presents a few business questions
that are difficult to answer without data mining, and at the same time
answers to these questions are essential for making decisions on
predictive marketing (Ville, 2001; Ville, 2006; Weiss & Indurkhya,
1997). Fields for Table 3 could be Cust_ID, Income, Other_Income, Loan,
Age_Group, Area_Residence, Home_Years, Value_House, Home_Type, Insured,
Type_of_Insurance, Education_Level, Leave_Yes_No, and others.
Association rule mining is another fundamental technique in data mining.
Question Number
|
Question(Data Mining Application)
|
1
|
Identifying
those customers that are most likely depart based on customer
demographical information (Decision tree without nested table)
|
2
|
Grouping
heterogeneous customers into subgroups based on customer profile to
generate a mailing list for marketing purposes (Clustering without
nested table)
|
3
|
Finding
the list of other products that the customer may be interested in,
based on the products the customer has purchased (Cross-selling using
decision tree with nested table)
|
4
|
Grouping
customers into more or less homogeneous groups based on the customer
profile and the list of banking products they have subscribed to
(Clustering with nested table)
|
Table-3 Information for Predictive Marketing
In
some real-life applications, for example, market basket analysis in
super market chain stores, data sets can be too large for manual
analysis, and potentially valuable relations among attributes may not be
evident at a glance. An association rule-mining algorithm can find
frequent patterns (sets of database attributes) in a given data set and
generate association rules among database attributes. For example, some
items can be frequently sold together, for example, milk and cereal, or
bread and butter. Such items can be displayed together to improve the
convenience of shopping. Association rule mining is generally be
applicable to those applications in which the data set is large and it
is useful to find frequent patterns and their associations, for example,
market basket analysis, medical research, and intrusion detection.
Similarly, algorithms may be devised for various other social activities
like, readymade garments databank (bridging the gap between developed
and developing countries), NGO networks engaged in social development
works, skill and capacity development databank (migration of skilled
workers), jobs databank (for youths and jobless), online blood bank
(during emergencies and disasters), and microcredit databank for the
overall benefit of the society.
Future Issues and Challenges
Data
mining algorithms in future should consider incorporation of larger
databases, high dimensionality, over fitting, assessing of statistical
significance, dynamic database, adaptation of knowledge theory,
treatment of missing and noisy data, complex relationships between
fields, understandability of tattered patterns, user interaction and
prior knowledge, and integration, and versatility with other systems
(Wang, 2003). While measuring performance impact of social development
activities, future research should formulate a homogeneous pattern of
implementation, provided varying nature of environment, economy, culture
and other parameters exist at the peripheries. Specifically, in terms
of knowledge centers, there should be a symmetric matrix to follow as a
guideline, over which each node, sub-node, or any discrete existence of
knowledge center could be established. This will reduce the design cost,
operating expenditure, monitoring complexity and assist in measuring
the performance quantitatively. Given the three patterns of
implementation model, yet numerous debates are running across the globe
about their advantages and disadvantages. A systematic approach, in
terms of establishing a mathematical formula and its consequential
algorithm will ease debacles of enormous nature and lead to deduce a
verified threshold as output. Furthermore, quantification of knowledge
development from the immensely discrete activities of qualitative nature
will remain as challenge to the future researchers. Finally, utilizing
data mining algorithms for measuring performance impact demand huge
storage of data of varying nature; many of them have not been archived
during the last decade of implementation phases (collection and archival
of existing data) and by far most of them need to be transformed into
recognized data sets, so that they can be used by verified data readers
(transformation to any recognized database structure). Now, before
concluding, a pattern of data transformation is portrayed here in
Figures-2. If a community would like to synthesize data and transform
them into knowledge then transformation pattern are visible. The
vertical one is more or less thorough and involves several stages of
action during the transformation process though deserves rigorous study
and closed observation. Researchers may derive separate algorithms for
this transformation process, so that an acceptable measuring indicator
may evolve in future.
Conclusion
It
is well recognized, that the real-world knowledge-measurement
applications obviously vary in terms of underlying data, complexity, the
amount of human involvement required, and their degree of possible
automation of parts of the discovery process. In most applications,
however, an indispensable part of the measurement process is that the
analyst explores the data and sifts through the raw data to become
familiar with it and to get a feel for what the data may cover.
Furthermore, very often an explicit specification of what one actually
is looking for only arises during an interactive process of data
exploration, analysis, and segmentation (Stumme, Wille & Wille,
1998). Therefore, proper data mining techniques with timely feedback
analysis on the executed results deserves immediate attention for
accurate result. It is a difficult task to eliminate theories of
probability, redundancies of efforts and abundances of varying data in
determining reasonable mathematical formulae to measure the impact of
social development processes. Complexity accumulates further, when it
comes to projects or programmes that are related to newly evolved ICTs.
Many developing and transitional economies are entangled with severe
social problems within the vicious poverty cycle; thereby evolution of
ICT emulated performance indicators are extremely difficult to resonate.
They are diverse, deem to diverge and tend to become vulnerable in the
longer run, without a verified mathematical model. Moreover, data mining
algorithms should incorporate design, development, implementation and
operational factors, in addition to developing mathematical models on
cost-benefit analysis. Foremost, utilizing data mining, success cases
should come out at the forefront with rigorous analysis, so that they
could be easily replicated elsewhere, with minimum adjustments.
References
1. Abbass, H. A., Sarker, R. A., & Newton, C. S. (Eds.) (2002). Data mining: A heuristic approach. Hershey, PA: IGI Global.
2. Adamo, Jean-Marc (2001). Data mining for association rules and sequential patterns: Sequential and parallel algorithms. Springer Verlag.
3. Agrawal, R. & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases (pp. 487-499), Santiago, Chile.
4. Agrawal, R., Imielinski, T. & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD Special Interest Group on Management of Data (pp. 207-216), Washington, DC.
5. Boulicaut, Jean-Francois, Esposito, F., Giannotti, F. & Pedreschi, D. (Eds.) (2004). Knowledge discovery in databases. In Proceedings of the PKDD 2004: 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, Italy.
6. Bramer, M. A. (Ed.) (1999). Knowledge discovery and data mining: Theory and practice. IEE Books.
7. Chakrabarti, S. (2002). Mining the Web: Discovering knowledge from hypertext data. Morgan Kaufmann.
8. Cox, E. (2004). Fuzzy modeling and genetic algorithms for data mining and exploration. Morgan Kaufmann.
9. Curotto, C. L. & Ebecken, N. F. F. (2005). Implementing data mining algorithms in Microsoft® SQL Server™. WIT Press.
10. de Ville, Barry. (2001). Microsoft data mining, Integrated business intelligence for e-commerce and knowledge management.
11. de Ville, Barry (2006). Decision trees for business intelligence and data mining: Using SAS enterprise miner. SAS Press.
12. Drewry et al. (2002). Current state of data mining. Department of Computer Science, University of Virginia.
13. Fayyad, U., G. Piatetsky-Shapiro, & P. Smyth. (1996). From data mining to knowledge discovery in databases (a survey). AI Magazine, 17(3), 37-54.
14. Giuffrida, G., Cooper, L. G., & Chu, W. W. (1998). A scalable bottom-up data mining algorithm for relational databases. In Proceedings of the Tenth International Conference on Scientific and Statistical Database Management (pp. 206-209)
15. Hale, J., Threet, J., & Shenoi, S. (1994). A practical formalism for imprecise inference control. Ifip Trans. A-Computer Science And Technology,60, 139-156.
16. Han, J., Kamber, M. & Chiang, J. (1997). Metaruleguided mining of multi-dimensional association rules using data cubes. In Proceedings of international conference on knowledge discovering and data mining (KDD’97), pp. 207-210.
17. Kloesgen, W. & Zytkow, J. (Eds.) (2002). Handbook of data mining and knowledge discovery. Oxford University Press. Larose, D. T. (2004). Discovering knowledge indata: An introduction to data mining. Wiley-Interscience.
18. Utimaco (2005). Data encryption: The foundation of enterprise security. Foxboro, MA: Utimaco Safeware, Inc.
19. Wang, J. (Ed.) (2003). Data mining opportunities and challenges. IRM Press.
20. Zhou, C., Li, Z., Meng, Y. & Meng, Q. (2004). A data mining algorithm based on rough set theory.
No comments:
Post a Comment