5.9.3“.
a. 3%.?

1;}:
>

a . .lwwceﬁemh»
1’ 9...“..va sum.
.. ..
‘ Rump“

l

2 . I.
.. 5...")

rmg! a?

7 $55: (

Luau: i
.1: 1 ﬂy...
I .

99,11...
. (3.292.493.
:55). . .
i. if ..
u: ﬁxi}.§rt

.3‘

1:?

. l
.«J‘.‘.- ,
.3; 2:: a: SQ...-
.. i
2.an n n".
s

:i)
‘7 3.1,:
I

I. , :5111LPIaH Ex! :1
9.74:3»?- hiﬂ.\..>. a.
:2. I a .3. .v

2.11 l.
5::

er‘
..
r. ..
$-13.
z. 32:174‘l
I r21 13?.
.
.
.
s.

$39.1:
a; .9115}. I
l 1..., . 39'

)5
n: 1). 23.35.31. 3 . .153 51'. Ct .
. 1 ‘ 439‘. ins. ‘AIIIE:QA- A
~ .4 ‘ 5553" f ...rixxdvl¢.x
.. ‘
5.75. V
511.43....

ﬁr m. ,u

V.

.8.
. in

“3:. a, P; .7231 u a L.
$2 Eaﬁ

 

 

 

furry? LIBRARY
l Michigan State
.100 ; University

 

 

 

This is to certify that the
thesis entitled

MINING INTERESTING RULES BY COMBINING
KNOWLEDGE DISCOVERY FROM DATA WITH
LITERATURE BASED DISCOVERY

presented by

Samah Jamal Fodeh

has been accepted towards fulfillment
of the requirements for the

Master of degree in Computer Science and
Science Engineering

 

 

‘imWi

Major' Professor’s Signature
8/ 3/ 1006
I r

Date

 

MSU is an Affirmative Action/Equal Opportunity Institution

-— -.--v-v------o----—-—.—-.-

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2/05 p:/ClRC/DateDue.indd-p.1

 

MINING INTERESTING RULES BY COMBINING KNOWLEDGE
DISCOVERY FROM DATA WITH LITERATURE BASED DISCOVERY

By

Samah Jamal Fodeh

A THESIS

Submitted to
Michigan State University
In partial fulfillment of the requirements
For the degree of

MASTER OF SCIENCE
Department of Computer Science and Engineering

2006

ABSTRACT

MINING INTERESTING RULES BY COMBINING KNOWLEDGE
DISCOVERY FROM DATA WITH LITERATURE BASED DISCOVERY

By

Samah Fodeh

The application of data mining and knowledge discovery techniques to real world
domains is a rewarding but extremely challenging process. The massive size, high
dimensionality, and fast growing databases are among the key challenges that need to be
addressed by these techniques. In this thesis, we investigate the problem of mining
interesting association rules from large databases in the presence of background
knowledge. We hypothesized that the background knowledge provides useful domain
information that may help us to subjectively evaluate and interpret the extracted rules. To
illustrate the advantages of incorporating background knowledge, we developed a
framework called MIR for mining interesting rules from the medical domain. We showed
that the background knowledge helps us to distinguish obvious rules from those that are
either spurious or probable. We tested the MIR framework using deidentiﬁed Electronic
Medical Records (EMR) data from MQIC (Medical Quality Improvement Consortium)

and utilized background knowledge available in the PubMed bibliography database

Acknowledgments

I would like to thank Dr. Pang-Ning Tan, my major advisor, for his help and thoughtful
advice over my thesis work. Also, I would like to thank my committee members Dr.
Abdol-Hossein Esfahanian and Dr. Joyce Chai for agreeing to serve on my committee. I
also would like to thank Dr. Henry Barry, Dr. Michael Zaroukian, and Dr. David
Weismantel from College of Human Medicine for providing us with the data and the

domain expertise needed to analyze the results of our experiments.
Also, I wish to express my appreciation to my parents and husband for their support and

love which gave me the inspiration and persistence to overcome all the difficulties I had

over all the times of working on my thesis.

iii

Table of Contents

CHAPTER 1 INTRODUCTION ..................................................... 1
CHAPTER 2 LITERATURE-BASED KNOWLEDGE ISCOVERY ...... 5
2.1 Representation of Background Knowledge .............................................. 6
2.1.1 Structured Background Knowledge ....................................................................... 6
2.1.2 Unstructured Background Knowledge ................................................................... 9
2.2 Literature-Based Knowledge Discovery ................................................... 9
2.2.1 Discovery of Associations in Literature ............................................................... 11

2.3 Combining Background Knowledge from Literature with Data Mining 19

2.3.] Using Background Knowledge for Association Analysis .................................... 19
2.3.2 Using Background Knowledge for Clustering ..................................................... 22
CHAPTER 3 ASSOCIATION ANALYSIS ................................. 31
3.1 Problem Deﬁnition .................................................................................. 32
3.2 Association Rule Mining ......................................................................... 33
3.2.1 Frequent Itemsets Generation ............................................................................... 34
3.2.2 Rule Generation .................................................................................................... 42
3.2.3 Pruning Redundant Rules ..................................................................................... 43
3.3 Interesting Measures ................................................................................ 44
3.4 Summary .................................................................................................. 45
CHAPTER 4 PROPOSED FRAMEWORK ................................. 46
4.1 MIR: Mining Interesting Rules from Medical Data ................................ 46
4.2 Rule Evaluation and Interpretation ......................................................... 48
4.2.1 Querying Background Knowledge from Pubmed ................................................ 49
4.2.2 Distinguishing Obvious Rules from Unexpected Rules ....................................... 53

iv

4.2.3 Distinguishing Probable Rules from Spurious Rules ........................................... 55

4.2.4 Rule Evaluation and Interpretation using UMLS Ontology ................................. 59
4.3 Summary .................................................................................................. 62
CHAPTER 5 EXPERIMENTS ..................................................... 63
5.1 Experiment 1: Frequent Itemsets Generation .......................................... 64
5.2 Experiment 2: Estimating the Support from Pubmed ............................. 68

5.3 Experiment 3: Distinguishing Obvious Rules from Unexpected Rules..70

5.4 Experiment 4: Distinguishing Probable Rules from Spurious Rules ...... 72
5.5 Experiment 5: Rule Interpretation using UMLS ..................................... 75
5.6 Summary .................................................................................................. 77
CHAPTER 6 CONCLUSION AND FUTURE WORK ............... 78
REFERENCES .............................................................................. 8 l

Chapter 1

Introduction

As we are transitioning from an era of information dearth to information glut [36], there
is a need to explore the massively growing data in order to extract precious pieces of
information. Such exploration process is referred to as knowledge discovery from data
(KDD). Knowledge discovery is the intelligent science of mining and extracting
previously unidentiﬁed but potentially valuable information embedded in the data [32]. It
is also the process of searching large databases for models and patterns that describe the
structure of the data by applying techniques from statistics, machine learning, and pattern
recognition. According to Hand [12] a model is a global representation of a structure that
summarizes the systematic component underlying the data or that describes how the data
may have arisen. The word “global” here signiﬁes that it is a comprehensive structure
referring to many cases. In contrast a pattern is a local structure relating to just a handful
of variable and a few cases. Examples of data mining patterns include outliers, clusters,
frequent itemsets and association rules, whereas classiﬁcation and regression models are
examples of the discovered models.

Over the past two decades, knowledge discovery has emerged as a key enabling
technology for a variety of applications, from business intelligent and threat analysis to
medical informatics and the modeling of Earth Science data. Despite its remarkable
success, existing data mining technology is often plagued by several critical and

fundamental challenges:

o The number of extracted patterns tends to overwhelm our ability to analyze them.
For example, there has been tremendous interest in integrating a data mining
technique known as anomaly detection into intrusion detection systems. However,
the number of alarms generated by such systems may exceed the ability of
security experts to administer them.

0 Lack of information to interpret the signiﬁcance of data mining results. For
example, in association analysis, which is a widely used technique for ﬁnding
strongly correlated features in data, some patterns may seem novel to domain
experts while others may be obvious or even spurious. Though statistical methods
such as the chi-square test have been successfully applied to eliminate spurious
patterns, separating the novel from obvious patterns remains a signiﬁcant barrier
for many data mining systems.

A common theme among those challenges is the need to incorporate background
knowledge directly into the data mining process. Background knowledge may help
domain experts to order the extracted patterns so that it is sufﬁcient for experts to
examine only the highest ranked patterns. In addition, it may also provide the information
needed to aid domain experts in interpreting the meaning and signiﬁcance of the patterns.
While there has been considerable efforts to incorporate background knowledge into
various learning tasks, such knowledge must ﬁrst be hand coded by domain experts.

Recently, there has been considerable interest in developing text mining
techniques for extracting knowledge directly from literature, such as the digital libraries
and other repositories of scholarly articles. Such a process is known as literature-based

discovery [42][43][16][17][23][33][38]. The goal of literature-based discovery is to

uncover potentially meaningful relations between various concepts in a given application
domain using information from the literature. A classic example of literature-based
discovery is given in [34], by Swanson, who had successfully discovered an implicit
relation between fish oil and Raynaud’s disease by manually searching journal articles in
the medical domain.

The objective of this thesis is to incorporate background knowledge automatically
into the knowledge discovery process. Combining knowledge discovery from data with
background knowledge is a challenging task due to several reasons. First, each
knowledge type has its own representation. For instance, the knowledge discovered from
data may be encoded in the form of association rules whereas the background knowledge
may be contained in the vast amount of scientiﬁc literature. The differences in their
representations complicate the integration task because each rule extracted from the data
must be mapped to its corresponding concepts in the literature. Second, the background
knowledge may be unstructured and high in volume, thus requiring extensive
preprocessing steps. Third, a systematic approach is needed to ensure that incorporating
background knowledge actually helps to improve the quality of models and patterns
discovered from data.

This thesis focuses on utilizing background knowledge (in the form of structured
ontologies or scholarly articles in the scientiﬁc literature) to evaluate and interpret the
association rules extracted from large databases. To illustrate the advantages of using
background knowledge, we developed a framework called MIR for mining interesting
rules from the medical domain. We showed that the background knowledge helps us to

distinguish obvious rules from those that are either spurious or probable. The probable

rules correspond to unexpected relationships between two concepts that are indirectly
associated with each other (see Chapter 4). We tested the MIR framework using
deidentiﬁed Electronic Medical Records (EMR) data from MQIC (Medical Quality
Improvement Consortium) and utilized background knowledge available in the PubMed
bibliography database.

The remainder of this thesis is organized as follows. In Chapter 2, we describe the
various representations of background knowledge and introduce the literature-based
discovery problem. This chapter also presents other existing work on utilizing
background knowledge for knowledge discovery. In Chapter 3, we provide a brief
overview on the association rule mining problem, while Chapter 4 summarizes the key
components of our proposed MIR framework. In Chapter 5, we demonstrate the results of
experiments that were performed to evaluate the effectiveness of the MIR framework.

Finally, Chapter 6 gives the conclusions and suggestions for future work.

Chapter 2

Literature-Based Knowledge Discovery

The literature is a rich and major source of information in different areas of science.
It is defined in Webster’s dictionary as “The collective body of literary productions,
embracing the entire results of knowledge and fancy preserved in writing; also, the whole
body of literary productions or writings upon a given subject, or in reference to a
particular science or branch of knowledge, or of a given country or period.” Another
deﬁnition of literature that is closer to our work is given in answerscom, which describes
the literature as the body of written work produced by scholars or researchers in a given
ﬁeld. In this thesis, we will use online literature as the background knowledge that will be
incorporated into the process of data mining. Online literature is the digital version of the
written body of literature posted on the web to be available for the public use and to serve
in different purposes such as the automated knowledge discovery systems. Background
knowledge extracted from online literature may help domain experts to evaluate the
extracted patterns from the data, and it may also provide the information needed in
interpreting the meaning and signiﬁcance of the patterns.
First, we describe the various ways to represent background knowledge. We then
describe literature—based discovery which is the process of searching for hidden and

important connections among information embedded in published literature [9]. Finally,

we provide a survey on the prior work on how to incorporate background knowledge into

data mining.

2.1 Representation of Background Knowledge

With the accelerated growth of the Internet, knowledge from diverse domains has been
stored and made accessible for online access. Since acquiring information from the Web
tends to be cheaper than acquiring it from human experts, considerable interests have
emerged to employ this information as background knowledge for knowledge discovery
systems. Background knowledge is generally represented in two different forms:
structures and unstructured background knowledge. We will discuss these

representations in the remainder of this section.

2.1.1 Structured Background Knowledge

Structured background knowledge means that the relevant pieces of information are
encapsulated into different blocks that are semantically related to form tabular or a
hierarchical structure such as ontologies. The advantages of using structured background
knowledge are: (1) it is reliable because the structure was built upon the agreement of
domain experts, which made it a strong reference for research purposes, (2) the simplicity
of retrieving the required information from the hierarchy, and mapping the to-be-
classiﬁed data to the hierarchy due to the semantically related features between its blocks.
These advantages made researchers favor using the structured model over the

unstructured one. However, structured background knowledge has a drawback since it

remains limited in scope and static in its current contents for a considerable duration of
time because it is not frequently updated.

Ontology is an example of structured background knowledge. The objective of an
ontology is to define the concepts and relationships in the domain. Such a
conceptualization is intended to help with automatic translation among existing
databases, because it can be used to form a bridge that associates different databases
together. In the rest of this section we will provide some examples of structured

background knowledge.

Unified Medical Language System (UMLS ):

UMLS is a medical comprehensive compendium that contains many vocabularies from
different sources and languages [41], and provides a mapping structure between those
vocabularies. UMLS consists of three major sections: The ﬁrst section is called
Metathesaurus which contains different concepts (a concept may represent different
synonyms of the term from different languages). A group of related concepts are mapped
to a semantic type. The second section is the semantic network. There we can ﬁnd the
semantic relations between different semantic types. Finally we have the specialist
lexicon, which is developed to provide the lexical information needed for Natural
Language Processing (NLP) specialist, such as the syntactic, morphological, orthographic
information of the word.

Gene Ontology:

Gene Ontology: (G0) is one of the most popular ontologies for bioinformatics research.
It is a controlled vocabulary of terms in cellular and molecular biology; it consists of

three sub ontologies - the molecular functions of gene products, the biological process,

and the localization of the components. This ontology is used in the process of analyzing
the results of the micro array experiments due to the huge results that yield from those
experiments, which need to be organized and classiﬁed in a way that makes it easier for
the researcher to interpret and read these results.

WordNet:

Wordnet is a lexical reference system or linguistic dictionary that contains English nouns,
verbs, adjectives, and adverbs which are organized into synonym sets, each representing
a lexical concept, and records the semantic relations that links the synonym sets together.
WordNet can be explored in different directions, ﬁnding words that mean the same,
words that are more general, more speciﬁc, as well as getting a brief description. This
semantic lexicon can be used as a dictionary as well as supporting automatic text
analysis.

Semantic Web:

Semantic web is another way of representing knowledge; it aims to make the tremendous
information in literature machine processable on a large scale [25]. This requires the data
to be standardized format wise, to make it easily interchangeable. So the goal of the
semantic web is to enhance the usability of the web and its resources by marking the
different documents with semantic information [41]. This enables computers to read and
associate different documents together. Ontology’s are one example of the semantic Web
vision because they encode data into sets of thesauri and records the semantic relations

between these sets.

2.1.2 Unstructured Background Knowledge

The unstructured background knowledge could be in the form of video, images, or large
body of texts. Unlike structured knowledge, having the information encoded in a text
format made it wealthier due to the very frequent update of the contents in the
unstructured representation of background knowledge. Nevertheless, processing the text
is more challenging because it requires special Natural Language Processing algorithms
for this purpose, whereas the organized form of the structured representation made
retrieving and manipulating the information easier. Examples of unstructured background
knowledge include Pubmed, which is one of the most popular sources of texts in the
biomedical domain, and Wikipedia, which is also a huge general encyclopedia that

contains deﬁnitions in a free text format.

2.2 Literature-Based Knowledge Discovery

Literature-Based discovery (LBD) is the process of searching for hidden and
important connections among information embedded in published literature. The linkages
and associations are found by drawing paths between two different objects. Most of the
works in LBD follow two approaches for knowledge discovery—open or closed

discovery [39].

   

Figure 1: Open discovery approach.

Open discovery is the task of generating a hypothesis based on a given starting
concept [39]. The exploration process attempts to ﬁnd an association between a starting
concept (A) and the “to be discovered” concept (C). The starting concept A is used as a
query to an online database and all the documents that include this concept are retrieved.
The most important terms and phrases are then extracted using certain text processing
technique. The list of these important terms is then ﬁltered to get an intermediate list of
B-concepts. The intermediate concepts represent topics that are closely related to the
starting concept A. The literature is then queried again using each of the B concepts. The
resulting documents are processed to get the most important terms. These terms
correspond to candidates for the target concept C. The list of candidates for target
concept C is then ﬁltered via manual inspection by the domain experts. As a result of this
processing, a new hypothesis could be generated that the concept C is indirectly related to

the concept A [9].

10

 

Figure 2: Closed Discovery approach

Unlike open discovery, the closed discovery approach starts with known A and C
concepts. The relation between the concepts A and C could be a previously generated
hypothesis or an observed association. The goal of the closed discovery approach is to

ﬁnd novel B concepts that explain the initial observation or hypothesis [9].

2.2.1 Discovery of Associations in Literature

As explained in Section 2.2, the goal of using literature-based discovery is to ﬁnd an
association between terms A and C via an intermediate term B. The latter relation is quite
challenging to discover, because of the difﬁculty in selecting the intermediate concepts

3.- from the list of discovered B concepts. Therefore, we need a measure to quantify and
rank the relations (A, Bl ), (A, 32) ....... (A, B") and the relations (BI ,C),( BZ,C)...( B" ,C)
and choose the highest ranked concept 3,. Structured ontology and semantic relations

may be used to quantify the semantic relations (A, B,. ) and (B, ,C).

11

There are several studies that have recently been conducted for finding associations in the
literature. For instance, the work in [42] looks for implicit indirect associations between
two concepts in the medical domain using Medline records. To evaluate the candidate
relations, the authors have used the mutual information measure to determine the strength
of associations between (A,B) and (B,C). Another approach is suggested in [43], where
the authors seek to ﬁnd implicit relations between two medical concepts (A,C) via an
intermediate concept B. First, they compute the co-occurrences of (A,B) and (B,C) using
Pubmed abstracts. They then evaluate the semantic relations between these concepts
using the UMLS ontology. In the following subsections, we will give an overview of

these related works.

Extending the Mutual Information Measure to Rank Inferred Literature

Relationships. [42]

In this work, medical literature is processed to build a network of associations between
two selected objects, such that no direct association exists between these objects and an
implicit kind of association is to be revealed.

Identifying Literature-Based Associations:

The common approach used to associate terms in a text,,is searching for their co—
occurrences, but this technique has some drawbacks due to the fact that those linkages
may be very general and useless, since no valuable information is translated.
Additionally, the resulting relationships may be false alarms, because there is no apparent

association between the terms [42].

12

Mutual Information Measure (MIM):
This measure was proposed by [42]. It measures the strength of associations between
terms in literature and is widely used to quantify dependencies between variables
including co—occurring terms in text.

PAB
PA.PB

Where P(AB) is the probability of A ,B appearance in the same document, and P(A),P(B) are the

MIM (A, B) =

probabilities of A ,B appearance in the document, respectively.

Scoring the inferred associations:

The author in this paper proposes to extend the MIM measure to ﬁnd sub relations between
A-)B and 89C by computing their MIM score and then combine the results following tow
different methods:

0 Averaging the MIM (AMIM): scores for A9 B" and BI -)C, then divide the result by
the total number of connections between A and C to normalize.
( \
{[ PABn ]+[ PBnC J}
t PA 'PBn PC 'PBn

Score (A,C) = 2
n=l 2’

 

 

 

0 Minimum value of MIM ( MMIM ): computes the minimum value within the (t)

connections between A,C.

' P
2min ABn PBnC
_ PAP," P0P,"

ScordA,C) = "—l
t

 

As mentioned by the author the results form MMIM is slightly better than the AMIM.

13

Experiments were done using MEDLINE which is an immense online medical database,
created by the National Library of Medicine. The term capsaicin was examined and their
tests revealed associations between capsaicin and other terms like calcium, neurons, and
ileums. Those associative patterns were ranked by the proposed method MMIM and the
shared relations (where capsaicin and other terms share the same documents). When
using a shared relationship, the general relationships tended to be on the top of the list,
while using MMIM score placed the famous and obvious relations on the top of the list.
Many of the research methods in data mining depend on querying and screening available
information. While this method is searching for associations between two different terms,
which is a small network of one level, it was mentioned that this could be extended to
more levels, but did not discuss how that could be done. Because there are some
challenges in the next levels (whether to keep (A, C) as the references while computing
MMIM in the next levels or choose new terms from the top n—list).
A Semantic Approach for Mining Hidden Links from Complementary and Non-
interactive Biomedical Literature. [43][I6][17]
In this method, the authors hypothesize that considering complementary and non-
interactive biomedical literature together can disclose valuable knowledge that is not
apparent if considering each individually. Bio-sbKDS is a semantic based algorithm that
discovers hidden relations between biomedical concepts; it uses (UMLS) ontology for
this purpose. Their method ﬁnds the relation C9A, by ﬁnding the intermediate relations
C -)B and B9A.

The algorithm requires the user to input the kind of relationship he is aiming to,

(Initial Semantic Relation (ISR)) between the starting concept C and the to-be-discovered

l4

concept A. They continue with their algorithm by finding the semantic type of C (ST_C),
and then they ﬁnd all the semantic types that are related to C through different relations
in UMLS and call it ST_B_can. On the other hand, they use the relation ISR to identify
the semantic type of the to-be-discovered concept A (ST_A), then extend (ST_A_can)
semantic set through the is-a relations in UMLS and call the resulting set
(ST_A_can_ext). After that, they do the ﬁltering by examining if there are semantic
relations between the semantic types (ST_B_can) and (ST_A__can_ext). This step is
performed using semantic relations from UMLS, the resulting sets will be (ST_A) and
(ST_B). At this point, they incorporate literature again, using text-mining this time, by
searching Pubmed abstracts against the concept C, to get the B_can list from the MeSH
Headings of the abstracts. Now they apply the (ST_B) upon the B_can list, to exclude
irrelevant terms that do not belong to any semantic type in (ST_B) set, the terms are
ranked and the top n are selected. The same procedure is applied to ﬁnd the A candidates
but using the top B terms from the previous step as the query search terms against the
abstracts. For the ﬁnal step, they extract the A candidates terms that co—occur with C in
Pubmed abstracts. Their experiments showed that their algorithm is more efﬁcient. It was
compared with LSI-Based (Latent-Semantic Indexing) which uses singular-value
decomposition analysis and cosine measure between CB and B,A for ranking and AR-
Based (Association Rule Based ) which generates all C-)B,B-)A rules, and applies the
transitive law to discover the hidden link.

As described above, the method depends on UMLS. Because it is a famous
ontology that contains reliable medical information, it will discover the association

between A9C only if the relations between A-)B and B-)C are available in UMLS.

15

Actually this step could be considered an advantage because it makes the results more
trustworthy since it prunes away the irrelevant terms, however, there is a negative side
about it, because it bounds the performance of the method to the UMLS knowledge
which is a learning ontology updated constantly. Another drawback is related to the cost
of this method, due to the frequent querying of Pubmed literature.

Supporting Discovery in Medicine by Association Rule Mining in Medline and UMLS
[141

Literature is incorporated in this work to evaluate association rules using a new system
which is the Discovery Support System. This is an interactive system that demonstrates
how an embedded relation could be found between a starting concept and the concepts of
interest. More speciﬁcally, ﬁnding relations between X92 and Z9Y reveals an
embedded relation between X9Y that is not available in literature. Medline database is
used along with MeSH descriptors which are terms assigned to the Medline records as a
representation of contents of the articles [14].

The ﬁrst step is to prepare the association rules between the MeSH concepts. The
next step is to ﬁnd all the concepts Y related to the starting concept X (X could be any
concept, i.e. a disease, and Y could be a symptom), then all the concepts Z related to Y are
found [14]. Z could be a chemical substance that alleviates the symptoms Y. In the last
step, they check X92 co-occur in literature, if this association X92 does not appear in
literature, then this is a new discovered relation that is determined by a human
intervention or a clinical investigation [14]. Additionally, they add a limiting or ﬁltering

procedure on the relations between X92. This could be implemented by several

16

methods. One way is to incorporate the semantic types from UMLS or to add a threshold
on the support and conﬁdence of the associated rules X9Y, Y92.

In their experiment, they divided the Medline database and the corresponding
association rules into tow segments according to the publication dates of the documents
[14]. The older segment is within the period (1990-1995) and the new segment is within
the period (1996—1999). This is because they wanted to evaluate their results by using the
older segment to mine for the expected associations as new discoveries and then examine
the existence of these associations in the new segment articles. One advantage of this
method is the reliability because of imposing semantic relations from UMLS to qualify
the predicted relation X92. On the other hand, using thresholds for support and
conﬁdence for the intermediate relations X9Y and Y92 restricts the results to the value

of the selected thresholds.

Improving Literature Based Discovery Support by Background Knowledge Integration
[15]
This work is an extension of the previous work, but the authors here incorporate different
databases from literature to reach their goal. In this work, they deﬁne UMLS as the main
source for the relations between the biomedical concepts. Those relations are extracted
and stored in a knowledge base. In this work they try to ﬁnd candidate genes of a certain
disease. To achieve this, they include different sources of knowledge.

0 The chromosomal locations for the diseases are available in OMIM (Online

Mendelian Inheritance in Man;) a database catalog of human genes and genetic

disorders and contains textual information and references.

17

o The gene locations are extracted from HUGO(a database that contains genes

names and symbols), LocusLink, and OMIM.

0 The disease location of manifestation (the tissue or organ where the disease

phenotype is expressed) from OMIM and Medline. [15]

0 The gene expression location from the Uni Gene database.

They use background knowledge to connect the dots between the disease of interest
and its causing—genes. This resembles the goal of [37] but in this work they don’t include
an anatomical ontology, instead they use a lower more speciﬁc level of chromosomal
ontologies. They have a disease as a starting concept and mine for associations in
Medline guided by semantic types and relations. Apparently, the associations should tie a
disease with chromosomal and manifestation locations as well as linking the candidate
genes with their chromosomal and expression location. If the regions or locations match
in any way, this could be a good indication that this is a disease causing gene.

Other different works used literature for deriving association rules such as the work in
[32] [33]. In this work by Swanson and others; They looked for implicit associations
between terms (A,C) by ﬁnding associations between the concepts (A,B) and (B,C),
where a direct relation between (A,C) is non-existent. They use MeSH as indexes to
search Medline. They claim that the coherence of retrieved literature is necessary for the
accuracy of the results. Coherence means the literature on magnesium, for example,
should not include articles that mention magnesium only incidentally [32]. This literature
cohesiveness is obtained using MeSH terms. Furthermore, ranking the top n-list of the

intermediate B terms depends on the density of the common MeSH terms between AB

18

and BC records. Likewise in [33] authors discussed the prediction of relationships in

scientiﬁc Literature.

2.3 Combining Background Knowledge from Literature with
Data Mining

The knowledge available in literature has been incorporated in exploring different
domains as we mentioned above. We can ﬁnd good examples of this type of research in
the medical domain. Incorporating background knowledge; that is available in the online
literature; directly in the data mining process supports the performance of the automated
knowledge discovery systems and positively affects their results. It will aid domain
experts to order the extracted patterns in order to examine only the highest ranked
patterns. In addition, it also provides the information needed to aid domain experts in
interpreting the meaning and signiﬁcance of the patterns. Background knowledge is
utilized in analyzing association rules, clustering and classiﬁcation. In the next

subsections, we will demonstrate some of the examples of this work.

2.3.1 Using Background Knowledge for Association Analysis

The background knowledge that resides on the World Wide Web is used to build bridges
between various categories and blocks in the targeted databases. The following articles

are examples of this work.

19

Integration of text- and data-mining using ontology ’s successfully selects disease gene
candidates [37]

In this paper, the authors’ goal is to identify disease genes candidates, by using an
anatomical ontology to integrate text-mining of biomedical literature and data-mining of
gene expression data to identify candidate disease genes.

Figure 3 summarizes the proposed algorithm. In effect, they try to build a bridge
between tow different databases using Evoc ontology (a controlled vocabulary which

describes the human anatomy).

 

 

Disease Name

I \

Associated eVOC Disease Gene
terms identiﬁed in Locus identiﬁed by
PubMed by DDE Linkage Analysis

 

 

 

 

 

 

 

 

 

 

 

     
 

Ranked
Disease-
Associated

eVOC Terms / Genes
\W

Filtered
Candidate
Genes

 

 

 

 

 

 

 

Figure 3: Disease genes discovery [37].

The database on the left is the Pubmed abstracts, while the database on the right is the

disease genes from the ensemble online database. Their method starts with processing the

20

Pubmed abstracts in literature, and finding the Evoc terms that co-occur with the disease
name. Similarly, selected genes from the ensemble database are annotated with an Evoc
term from the Evoc Ontology. After that, they rank the selected anatomy terms by
computing the frequency of association and the frequency of annotation for each term,
the (N) top evoc terms are selected from the list and then intersected with the evoc terms
annotated with the candidate disease genes selected from the ensembl database. The
intersection of both sets should yield at least (N) evoc terms that match the top (N)
associated terms with the disease name in the Pubmed abstracts. The ranking function, as
mentioned, depends on two coefficients, the frequency of association that is the number
of abstracts containing the Evoc term and the disease name divided by the total number
of abstracts containing the disease name. The frequency of annotation is the sum of all
annotated genes at the node and descendants of the evoc term in the evoc ontology
hierarchy, divided by the total number of annotated genes.

Ranking Score = [2*f(association)+ f(annotation)]/2 .
It is clear that the method depends on the Evoc ontology when integrating both databases
which is a reasonable step to execute, because initially ontologies are created after
intensive and reliable studies that makes them robust references for researchers.
Using an Interest Ontology for Improved Support in Rule Mining. [44]
In this work, the authors use interest ontology in order to extract association rules that
associate ages and gender with interests on the higher level. So, interests in the data that
are speciﬁc are replaced by more general interests on the higher levels of the ontology
hierarchy. The ontology used is an interest ontology that consists of interest hierarchies

based on Yahoo and ICQ [44]. There are 11 levels of interests in the interest hierarchy.

21

The data is collected from different home pages and consists of records of real personal
information that contains demographic and interest data [44]. Basically, the algorithm
stores the cleaned data (collected from the web) in an object relational database. After
that association rules are created using WEKA, a tool that implements different data
mining techniques. A problem emerged here that most people express their interests on
the lower level of the interest hierarchy, which negatively affected the support of the rule.
In order to overcome this problem and to get higher accuracy and support [44], a support
raising algorithm was developed which includes the records from the database that are

mapped to the lower levels of the interest hierarchy.

2.3.2 Using Background Knowledge for Clustering

The impetus behind using ontologies in clustering is its organized categorized structure
which encodes the similar objects under certain conditions within one block at a certain
level. It breaks this block into subdivisions in the lower levels, because of adding more
conditions or attributes to the objects. For example, the research in [22] puts the genes
with similar expression proﬁles into one category because this similarity implies strong
correlation in their biological process. They evaluate their clusters by mapping them to
GO categories by computing the probability of the hyper geometric distribution which
measure whether a cluster is enriched with genes from a particular category to a greater
extent than that would be expected by chance [22]. So the computed probability would
give a flavor if the genes in the cluster have the same biological function or not.

Another example of ontology driven clustering is in [40]. The proposed approach

uses similarity/distance measure on the items in the GO hierarchy. The similarity

22

between two terms c, , c.

I in the hierarchy is computed using Lin’s similarity model.

Where S(c,. ,C).) are the set of parent terms shared by both c, ,cj. This algorithm was

used for clustering gene records using the GO hierarchy and it gave good results. In the
next subsection, we review some of the work that incorporates background knowledge in
order to cluster the data.

Gene Ontology Friendly Bi-clustering of Expression Proﬁles [22] [21]

This work incorporates Gene Ontology (GO) from literature, to perform a biclustering
approach. Before continue explaining the method, here is a brief deﬁnition of GO.

G0 is a directed acyclic graph, GO=<V,E>, where V is a set of gene function description
(GO terms) and E is the relationship on V [22]. A relationship between a term and its
ancestor is a part-of relationship. As a result of this ontology representation, genes that
are annotated with a GO term are subset of genes annotated with the term’s ancestor.
Formally, if T is a GO term with annotated genes 0, and T’ is the ancestor of T with
annotated genes 0’, 0 g 0’. The authors hypothesize that genes with similar expression
proﬁles implies strong correlation in their biological process; where an expression proﬁle
of a gene is the process by which a gene’s DNA sequence is converted into the structure
and functions of a cell [41]. Moreover, they propose that clustering the genes under all
conditions or expression levels will separate the biologically related genes from each
other [22]. As a result, the clusters they are aiming for should have the maximal number
of genes that have coherent tendency (consistent ranks) along a subset of conditions and
these are called Tendency-Preserving Clusters (TP). The proposed algorithm Smart
Hierarchical Tendency Preserving Clustering (SHTP-Clustering) is based on TP-

clustering model that forms the Tendency Preserving tree (TP-tree) and incorporates G0

23

categories in the process such that each sub tree in the TP-tree has a matching sub tree in
the GO hierarchy.

To deﬁne tendency between genes, a ranking function is deﬁned. It translates the
conditions in a gene’s expression proﬁle into an alphabetical ordered sequence. Genes
with consistent ranks under a subset of conditions or expression levels are grouped in one
TP-cluster. After that they build their objective TP- tree using the TP-clusters such that
each node is a TP-cluster by itself, and the nodes at level m correspond to m dimensional
clusters.

Annotation of gene cluster:

After having the TP-tree ready, the produced Gene TP-clusters are annotated using GO.
To achieve this annotation or mapping, they inspect the enrichment of the functional
categories (TP-clusters) by introducing a P-value of the hyper geometric distribution, the
P-value measures whether the cluster is enriched with genes from a particular category to
greater extent than that would be expected by chance[22].

W“)
P:]_z;0 1(n—i

(g
(n

The probability of observing at least (k) genes from a cluster of (n) genes by

 

 

chance in a category containing (f) genes from a total of (g) genes.[22]

If the greater number of the genes in a certain cluster have the same biological function,
the probability of the distribution is high and consequently the P-value is close to O. A
threshold t is used to determine the signiﬁcance of the p-value. The annotation itself

means mapping a sub tree from the GO to the cluster, to ﬁnd the most suitable Ontology

24

Sub tree (OST) that represents this cluster. The P-value is computed for each category in
GO, such that the OST with the least value is the most signiﬁcant one.

Mapping the TP-tree onto GO-hierarchy:

The child-parent relationship in the GO hierarchy was used to evaluate the child—parent
relationship in the TP-cluster tree. So if C is a TP-cluster and C’ is one of its descendent
and if H is the OST of C and H’ is the OST of C’ then C’ is biologically descendent of C

if H ’ = ( H [22], which means the root of H ' appears in H. So the SHTP algorithm starts

with a node of the TP-cluster tree, and extracts the OST of this cluster, if the OST is not
biologically viable or related to the parent OST, discard this node or cluster and its
descendents and continue with the next sibling according to the predeﬁned traversal order
[22]. Their experiments showed that the two pruning techniques they used which are the
OST and the threshold of the P-value made it a faster algorithm.

This method’s goal resembles ours in terms of trying to combine domain
knowledge from data with domain knowledge from literature. They do this in an
intelligent way using the hyper geometric distribution. One drawback in this algorithm is
that it can not be extended to large databases because the whole database has to be
resident in the memory to compute the ranking function.

An Ontology-Driven Clustering Method for supporting Gene Expression Analysis [40]
The impetus of using ontologies in clustering is that classical clustering of data

records lacks the ability of describing the biological meaning of similarity relationships

represented [40]. The proposed approach uses similarity/distance measure on the items in

the GO hierarchy. The similarity between two terms c, , cl. in the hierarchy is computed

using Lin’s similarin model.

25

2* maX( 10g(P(C)))

ceS(ci,cj)i

Sim( , c.) = log(p(ci)) +10g(p(cj))

 

Where S(c,. ,C).) are the set of parent terms shared by both c, ,cj [40].

P(c): is the probability of finding c or one of its children in the annotation database being
analyzed [40]. This algorithm was used for clustering gene records using the Gene
Ontology hierarchy and it gave good results.

Ontology-Based Induction of High Level Classification Rules. [35]

In this work, a high level classiﬁcation is performed by developing a query-based tool
which combines knowledge encoded in ontologies and knowledge stored in relational
databases to ﬁnd interesting information by extending the query into a collection of
queries against the database. According to this query extension, general classiﬁcation
rules at different levels of abstraction will be produced. Speciﬁcally, the ontology labels
are augmented with frequency counts from the database to guide the classiﬁcation
process and prune the search space [35]. The counts are grouped by class for each
concept in the hierarchy. Then, the evaluation function of the (attribute,value) pairs
D(Av) is computed for each concept that is a strong indicator for class membership, and
the selected pair should require the fewest additional constraints to create a query that is
satisﬁed by tuples in single class [35]. The advantage of this algorithm is that it allows
the databases to be queried at high level of abstraction [35]. One drawback of this work is
that the classiﬁcation process is fully dependent on the involved ontology. The accuracy

of the produced classiﬁcation rules depends on the correctness of the ontology hierarchy.

26

Wordnet improves Text Document Clustering.[I3]

This work is a different topic from what we have demonstrated so far, clustering
documents is the process of organizing and grouping the large amount of information into
small signiﬁcant collections. As the authors indicated, the previous text clustering
methods are based on related terminologies between documents other than the conceptual
meaning of concepts [13]. Therefore, they propose an algorithm that incorporates
background domain knowledge, namely Wordnet, to overcome this deﬁciency. WordNet
is a semantic lexicon for the English language, it groups English words into sets of
synonyms called synsets, provides short, general deﬁnitions, and records the various
semantic relations between these synonym sets [41]. Clearly they employ the semantic
relations between the concepts while clustering the documents.

Their approach starts with composing a terms vector for each document that encodes the
frequencies of the terms. Several steps of processing is done on the terms within the
vectors as follows:

0 Step one: Removal of the stop words (the, is, there, in, which...) from the
vectors. They are meaningless.

- Step two: Stemming the words using Porter Stemmer. Basically stemming is
the process of reducing a word to its stem or root form since the
morphological variants of words have similar semantic interpretations.

0 Step three: Pruning the rare terms using a threshold. A rare term that does not

affect the clustering results. Formally, a term is exclude from the vectors

representation if Z If (d , t) S 6 ; where tf(d,t) is the term frequency
de D

of a term in a particular document.

27

0 Step Four: tfidf: Term frequency inverse document frequency. This measure
weighs the frequency of the term in a document, with a factor that discounts

its significance when it appears in almost all the documents.[l3]

_ - | D l

TFIDF (d,r) — log( If ((1,!) + l) * log [W]

The set D contains all documents, whereas df(t) is the number of documents in which the

term t appears, f(d,t) is the term frequency of the document.

When the tﬁdf is applied, the frequency of a term in the document is replaced by the tﬁdf

value for all terms in the vector that represents a particular document.

Compiling background knowledge into the text document representation:

Including concepts from WordNet resolves synonyms and introduces more general

concepts help in identifying related concepts [13].

Concepts from WordNET are included in the terms’ representation in different methods

such as:

0 Adding concepts C to the term vector. The concept frequency of a document (1 is
computed by investigating all the terms in the vector against this concept. So, if a
term is related to this concept in WordNet, then the frequency of the concept is
increased.

0 Replacing terms in the vector representation. If a term has at least one corresponding
concept in WordNet. However, terms that do not appear in WordNet are not
eliminated.

0 Concept vector only. All terms are replaced with their corresponding concepts from

WordNet and terms that do not appear in WordNet are eliminated.

28

The mapping of terms to concepts is challenging, since a plethora of concepts may be
retrieved from WordNet and this may lead to ambiguity. It is unclear which is the most
suitable concepts to include. Different ways are found to resolve this issue such as adding
all concepts to the vector representation, or selecting only the ﬁrst one, because originally
concepts are retrieved from WordNet in a certain order that shows the most common
word in English to be the ﬁrst concept with the highest rank.

In their experiment, the representation was clustered using Bi-section mean

clustering and the similarity between two documents d,,d2 6 Dis measured by the

cosine measure between their corresponding vectors t1, t2 respectively.

tlor2
II II II oll t2 ll

cos(tl,t2) =
The clustering results (P) were evaluated using pre—categorized clusters (L) that were
done manually on this dataset. Purity of both partitions P, L is computed using the
following formula:

Purity (P, L) = Z J—Elnlialx (prescision (p,l))

[)6 P I

Ipﬂll

lpl

Precision (P,L)= ; measures the closeness of results between the two

clusters.
The results showed that clustering with background knowledge yielded purity up to
61.8% while clustering without background knowledge yielded purity 57%, Furthermore,

tﬁdf was a signiﬁcant factor in text representation in both cases.

29

The advantage of this work is integrating different combinations of parameters
such as disambiguation and term weighting, to enhance the clustering results. Comparing
this to our work we did use the term weighting but with a different formula, which also
contributed to enhancing our results.

In this chapter, we discussed literature as an important source of information, because it
includes the necessary background knowledge that is used by researchers to improve the
knowledge discovery process. We also discussed the two different types of background
knowledge; structured and unstructured background knowledge. Then we discussed the
advantages and the disadvantages of each type. After that, in Section 2.2 we talked about
literature based discovery, we saw several examples of literature based discovery.
Finally, we discussed the combination of background knowledge from data with literature
based discovery. In the next chapter we will discuss the pattern discovery process. We
will mainly focus on association rule mining since it is the data mining technique we are

using in this thesis.

3O

Chapter 3

Association Analysis

Association analysis is a methodology in data mining for discovering interesting
relationships hidden in large databases. The discovered relationships are typically
represented in two forms: frequent itemsets and association rules. Frequent itemsets
represent subsets of items that appear together frequently in the data, e.g., {diaper, milk};
whereas association rules suggest the existence of relationships between the presence of
items in the antecedent and the consequent of the rules, e. g., (cough 9 fever).

Association rule mining was initially developed for business intelligence in which
the market basket transactions are analyzed to identify potential opportunities for
applications such as marketing promotions and shelf management. For example, suppose
the rule (diapers 9 beer) is discovered from the set of transactions recorded in a certain
store. Store managers may act on the suggestion made by the rule by keeping both items
close to each other on the shelves because they are often sold together. Association rule
mining can also be applied to the medical domain to extract relationships in the clinical
data. For example, it can be used as a reminder system in clinical decision support tools
to remind physicians about other potential medical conditions that may explain the
symptoms exhibited by a patient.

In this chapter, we present an overview of association rule mining. In Section 3.2

we discuss some of the algorithms that are currently used to efﬁciently extract the

31

association rules. Section 3.3 demonstrates some of the techniques used to identify

interesting association rules and to eliminate the ones that are not useful and redundant.

3.1 Problem Deﬁnition

In this section we introduce the terminology used to explain some of the concepts in

associations analysis.

Deﬁnition 1 (Items): Let I = (2', ,i2 ,...,im} denote the set of items available in a data set.

Deﬁnition 2 (Itemset): An itemset X is a nonempty subset of all items, i.e., X g I . If an
itemset contains k items then it is called k-itemset.

Deﬁnition 3 (Transactions): Let T be the set of all transactions.

Deﬁnition 4: (Support). The support count of a given itemset, denoted as 6(X), is the

number of transactions that contain the particular itemset: 0'(X) 2| it; IX g t,,t,. e T} I,

where H denote the number of elements in a set. The support of a given itemset is the

0(XUY)
IT I

fraction of all transactions that contain the itemset: suppor(X,Y) =
Deﬁnition 5 (Frequent Itemsct): A frequent itemset is an itemset whose support is greater
than or equal to some user-speciﬁed minimum support threshold.

Deﬁnition 6 (Association Rule): An association rule is an implication expression of the
form X9Y, where X and Y are disjoint itemsets.

Deﬁnition 7 (Conﬁdence): The conﬁdence of an association rule X 9 Y determines how

frequently items in Y appear in transactions that contains items in X.

0(X nY)

Con Idence =
f o'(X

where 0' is the predeﬁned support count for the itemset

32

Deﬁnition 8: (Association Rule Discovery)
Given a set of transactions T, ﬁnd all the rules that have support 2 minsup and
conﬁdence 2 minconf; where minsup and minconf are the corresponding support and

conﬁdence thresholds.

3.2 Association Rule Mining

As illustrated in deﬁnition 8 in Section 3.1, association rule mining is the process of
ﬁnding all the association rules that pass the condition of min support and min
conﬁdence. In order to mine these rules, ﬁrst the support and conﬁdence values have to
be computed for all of the rules and then compare them with the threshold values to prune
the rules with low values of either support or conﬁdence. But this process is
computationally expensive because there is an exponential number of rules that can be
extracted from a dataset (D) according to the following formula:

R = 3" — 2"“1 +1
Where (1 is the number of items in the dataset and R is the number of extracted rules.

However a large number of these rules will be pruned after applying the support
and conﬁdence thresholds. Therefore the previous computations will be wasted. To avoid
this problem and to improve the performance of the rule discovery algorithm, the
procedure for mining association rules is divided into two steps. The ﬁrst step is frequent
itemsets generation, in which all the itemsets that satisfy the minimum support threshold
are extracted. The second step is rule generation, in which rules that have high conﬁdence
are formed from the previously extracted frequent itemsets. By using this methodology,

the pruning step of the spurious rules is done before completing all the computations. For

33

example, if we have the itemset { cough, fever, cold}, the support of all the rules formed
from this itemset is identical because the support of any candidate rule is the support of
{cough U cold U fever} . Therefore, if this itemset is infrequent, then all candidate rules
such as ( {cough 9 cold, fever},{cold 9 cough, fever},{fever 9 cough, cold} ) are

guaranteed to be infrequent.

3.2.1 Frequent Itemsets Generation

Given a data set (1 that contains k items, the number of itemsets that could be generated is
2k — l, excluding the empty set. In order to mine the frequent itemsets, the support of
each itemset must be computed by scanning each transaction in the dataset. A brute force
approach for doing this will be computationally expensive due to the exponential number
of itemsets whose support counts must be determined.

There are several algorithms that have been proposed to improve the efﬁciency of
frequent itemset generation. There are two typical strategies adopted by these algorithms:
the ﬁrst is an effective pruning strategy to reduce the combinatorial search space of
candidate itemsets. The second strategy is to use a compressed data representation to
facilitate in-core processing of the itemsets. Below, we give an example of each strategy.
Apriori Algorithm
The Apriori algorithm is one of the ﬁrst algorithms to effectively reduce the
combinatorial search space. The algorithm is guided by the concept of support-based
pruning to systematically control the exponential growth of candidate itemsets. The
pruning strategy, also known as the Apriori Principle, states that if an itemSet is frequent
then all of its subsets must also be frequent. For example, if we have a frequent itemset

{cough, fever}, then any transaction that contains { cough, fever} must also contain its

34

subsets {cough}, {fever}. Hence {cough} and {fever} must also be frequent. Conversely,
if an itemset {cough, fever} is infrequent, then all of its supersets such as {cough, fever,
cold} or {cough, fever, swelling, cold} will be infrequent. The Apriori principle therefore
allows us to prune the number of itemsets to be examined during frequent itemset
generation.

Apriori is a breadth-first search algorithm that generates candidate itemsets of
length k based on the frequent itemsets of length (k-l) found in the previous pass. It then
scans the database once to determine the support counts of the candidate itemsets.
Candidate itemsets whose support counts satisfy the support threshold requirement are
subsequently declared as frequent itemsets.

The second strategy that is often adopted for the efficient generation of frequent
itemsets is to use a compressed representation of the data. An example of this approach is
the FP-Growth algorithm, which stores the data in memory using a data structure called
FP-tree. This approach allows us to determine the support count by accessing the data
structure in memory, instead of repeatedly scanning the transactions on the disk. In the
FP-tree representation, each transaction in the database is represented as a path from the
root of the FP-tree. Compression is achieved by merging preﬁx paths of transactions that
share the same items. A vertical database representation is another popular strategy, in
which each item is associated with a column of values indicating the transactions that
contain the item. For example, algorithms such as CHARM [40] use vertical tid-lists as
their data representation while MAFIA [41] and DCI-Closed [42] use vertical bit vectors.

In our proposed framework, we use a frequent itemset mining algorithm called

PGMiner, which is a time and space efﬁcient algorithm recently developed by our team.

35

Preﬁx Graph Miner Algorithm
Preﬁx-Gragh Miner (PGMiner) is a graph-based approach for mining frequent itemsets.
This approach was developed by members of our team. Our approach consists of
constructing a preﬁx graph structure and decomposing the database to variable length bit
vectors assigned to nodes of the graph. The main advantage of this representation is that
the bit vectors at each node are relatively shorter than those produced by existing vertical
mining methods. This facilitates fast frequency counting of itemsets via intersection
operations. In our method we have a condensed representation of the database in memory
which is vital to achieve good efﬁciency. Existing algorithms such as FPGrowth [1 l]
constructs a frequent pattern tree (FP-tree) structure to encode the relevant itemsets and
frequency information. In this representation, each transaction in the database is
represented as a path from the root of the FP-tree. Compression is achieved by merging
preﬁx paths of transactions that share the same items. A vertical database representation
is another popular strategy, in which each item is associated with a column of values
indicating the transactions that contain the item. For example, algorithms such as
CHARM [45] use vertical tid-lists as their data representation while MAFIA [6] and DCI-
Closed [23] use vertical bit vectors.

FP-tree has been known as a good method in providing a compressed
representation of the data due to the sharing items between the transactions. Nevertheless,
when the database is sparse or the support threshold is low, FP-tree storage requirements

may exceed the database size.

36

 

Database . Min FP-Tree Vertical PreﬁxGraph
Name Db. Size Support Num . Bit vector Size
Nodes Size Size

Chess 474.4Kb 25% 31812 621 Kb 19.9Kb 239.6Kb
Pumsb 14.0Mb 45% 183349 3.5MB 377.2Kb 4.28MB

WebDoc l 150.4M 10% 50313644 959.6M 52.8MB 1 1 1.6MB
Kosarak 36.8Mb .08% 3425391 65.3MB 189.3MB 9.2MB

Table 1: Characteristics of various condensed representationsl24]

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 1 shows that the memory requirements for storing vertical bit vectors are generally
less than that for FP-tree, except for the case of Pumsb data set. The advantage of using
Vertical bit vectors is that it allow fast support counting using simple bitwise AND
operations. However, when the database is large and sparse, the handling of long bit-
vectors is quite inefﬁcient since there are lots of zeros in the vectors.

Preﬁx Graph Representation

In this section, we describe the PreﬁxGraph representation and show in detail the
construction of this structure. The items in a itemset that contains k items (k-itemset) are
assumed to be sorted according to some total order 1:. We use the notation x 1: y to
indicate x precedes y according to the total order. A PreﬁxGraph consists of a set of
nodes and a set of directed edges connecting pairs of nodes. Any item in the database that
satisﬁes the minimum support threshold is represented as a node in the PreﬁxGraph.
Each node is also associated with a projected bit vector database (Figure 4). The
following deﬁnitions are necessary concepts used during the construction of the Preﬁx
Graph.

Deﬁnition 1 (Preﬁx 2-Item) At each node k, an item i is called its prefix 2-item if i n k

and {i, k} is a frequent 2-itemset.

37

Deﬁnition 2 (Preﬁx Itemset) Consider an itemset I = {i1, i2, ...,Ij-|, Ij, Ij+],...,In}. A preﬁx

itemset OH with respect to node Ij is deﬁned as all the items {i,, i2, ...Ij-] }.

Preﬁx Grgph Construction

We now illustrate the construction of the PreﬁxGraph using the transaction database
given in Table 2 with support threshold g = 2. Consider the PreﬁxGraph shown in Figure
4 for the sample database in Table 2. The total order of the nodes are a n b it d It e 1: c.
The preﬁx 2-item for node b is a, while the preﬁx 2-items for node c are a and b. The
nodes d, e, and c are sufﬁx nodes of b because b precedes these nodes and { b,d}, {b,e},

and {b,c} are frequent 2-itemsets.

 

 

 

 

 

 

 

Transaction ID Items Frequent

Items

1 a, b, c, d, a, b, d, c

2 b,d,a,e,f,g a,b,d,e

3 d, a, e a, d, e

4 i, a, c, b a, b, c

5 b, c, e b, e, c

6 D, e, h d, e

 

 

 

 

 

Table 2: Sample Database[24]

First, we scan the database to identify the set of frequent items and their corresponding
support counts. These frequent items are used to form the nodes of the PreﬁxGraph. For
the database listed in Table 2, the list of frequent items are {(az4), (b:4), (c:3), (d:4),
(e:4)}, where the notation (mzn), m is the item and n is the support count. Now we order
the nodes based on the descending order of support count. Next, we scan the database
again to identify the preﬁx 2-items for each node. For example, the frequent 2-itemsets

for nodes a, b, d, e, and c are (ab, ad, ac, ae), (ba, bd, be, be), (da, db, de), (ea, eb, ed),

38

and (ca, cb), respectively. The prefix 2-items and their corresponding support counts for
these nodes are H, {a:3}, {a:3, b22}, {a:2, b:2, d:3}, and {a:2, b:3} respectively. For each
node, we store its set of preﬁx 2-items in a header table.

In the next step of the graph construction the transactions are stored as bits in the
projected bit vector database of the nodes. We scan the database, and for each
transaction, infrequent items are removed and the remaining items are sorted based on the
frequency descending order. Let T be the resulting itemset. Now for each item k in T, we
select the node k and compare its preﬁx 2-items against the preﬁx itemset of T. If there is
a match (of at least one item), then these matching items are stored as bits in the projected

bit vector database of node k.

 

 

 

 

 

 

 

Suffix Links
Nodes
a ab abd ab
Header / 1 11 1 11 11
Prefix 2-items)
( 1 11 101 11
1 10 010 01
— 001

 

 

 

 

 

 

 

 

 

 

 

Bit Vector Databases
Figure 4: PreﬁxGraph Representation[24]

For example, consider the ﬁrst transaction of the sample database. After removing the
infrequent items, the remaining items are: T: {a, b, d, c}. Since the transaction has 4
frequent items we need to consider the nodes a, b, d, and c. Node a has no preﬁx 2-items
and therefore nothing is stored. For node b, item a of the transaction matches with its
preﬁx 2-item (i.e. a) and therefore bit <1 > is stored in its projected bit vector DB. For

Node d, items { a, b} of T match with its preﬁx 2-items and therefore bits <1 l> are stored

39

in the bit vector DB. Similarly for node c, the bits for items {a, b} of the transaction are
stored in the bit vector databases. For the second transaction, we get T={ a, b, d, e} after
removing the infrequent items. Since item a matches with preﬁx 2-item of node b, it is
stored as a bit. Items {a, b} match with preﬁx 2-items of node d and therefore they are
stored as bits <1 l> in the database. For node e, items {a, b, d} of the transaction match
with its preﬁx 2-items. So the bit vector stored in node e is <1 1 l>. When storing a
transaction such as {d, e} at node e, we need to store bits <001> in node e, since only

item d of the transaction matches with the preﬁx 2-items of node e.

Algorithm 1 (PrefixGraph Construction)

0 Scan the database DB and ﬁnd the frequent l-itemsets (nodes) and their
supports.

0 Sort the nodes in the descending order of support.

0 Scan DB again and ﬁnd the frequent 2-itemsets for each node and create the
header tables.

1. Scan the DB and do the following for each transaction:
i. Sort the frequent items of the transaction in descending frequency order.
ii. For each item k in the transaction, select node k and match the preﬁx 2-

items of node k with the items in the transactions and if there is a match,

store the matching items as a bit vector in the bit vector database of node

k.

40

Anal vsis of PreﬂxGraph Structure

 

Based on the algorithm, we need three scans of the transaction database. The first two
scans are necessary to ﬁnd frequent l-itemset and 2-itemset, while the third scan is
needed to construct the projected bit vector database. If the number of items in the DB is
known to be small, it is possible to combine the ﬁrst two scans into one and perform the
algorithm with two scans of the database.

Lemma I: The size of the projected bit vector database of a node is bounded by the
support count of the node times the number of preﬁx 2-items of that node.

Proof: Let m be the number of preﬁx 2-items of a node k. Since the number of
transactions that contain item k is equal to the support count o(k), the size of the projected

bit vector database of node k must be equal to Imx 0(k)/ 8] bytes. But in a bit vector

database, projected transactions starting with item k are not stored as bit vectors at that

node. Therefore, the size of the projected bit vector database at node k is SImx0'(k)/ 8]

bytes.

Lemma 1 shows an important beneﬁt of the PreﬁxGraph structure as we use bit vector
intersections during mining. According to Lemma 1, the length of the bit vector is at
most equal to the support count of the node. Since the support count of an item is usually
much smaller than the database size, we will have shorter bit vectors and accordingly

faster bit vector intersections.

41

 

 

Chess “0... T4OI1OD100K

9000 ﬁ a 7 2 r
801])» _, IBr ..
/
7003* f
50:10- / +
a / -
;5CID~ / 1
re. a
o _ .1
.5 “m
m ..
m" ; ~4
2OCU~ A
1000~ a a
0 4L L

 

 

 

 

 

 

 

1 1 1 I. 1
2o 30 40 so 0 200 400 coo son 1001
Node Node

0
D

Figure 5: Size of bit vector databases at each node[24]
The size of the projected bit vector database at each node varies as shown in Figure 4
for the Chess and T4OIIODIOOK databases with minimum support thresholds 30%
and 0.10% respectively. Except for the nodes in the middle, all other nodes have

small projected bit vector databases, which facilitate faster mining of itemsets.

3.2.2 Rule Generation

After generating frequent itemsets, the next step is to generate association rules from the

frequent itemsets. In principle, each frequent k-itemset can generate up to 2" -2
association rules. For example, rules such as (urinary 9 kidney, nocturia ), (urinary,
nocturia9kidney), etc, can be extracted from the frequent itemset {urinary, kidney,
nocturia}. However, not all of these rules are interesting because they may not satisfy the
minimum conﬁdence requirements.

To efﬁciently generate high confidence rules from a given frequent itemset, the
following conﬁdence pruning strategy can be used. Let Y be a frequent itemset. Note that

any association rule generated from Y can be written in the form of X 9 Y — X, where X

42

is a subset of Y. It is possible to show that if the rule X9Y-X does not satisfy the

conﬁdence threshold, then any rule X9 Y-X, where X: X must not satisfy the
conﬁdence threshold as well. The Apriori rule generation algorithm uses this pruning
strategy to efﬁciently generate high conﬁdence rules from a given frequent itemset. This
algorithm uses a level-wise approach for generating association rules where each level
corresponds to the number of items that belong to the rule consequent [34]. For example
if {abc9d} is a low-conﬁdence rule then all the candidate rules that have (I as an item in
the consequent side should be pruned. In other words, the rules {ab9cd},{ac9bd},
{bc9ad},{c9abd} should be discarded because their conﬁdence is guaranteed to be
low.

In this section we discussed how conﬁdence and support are used for pruning the
rules. In the next section we will point out additional techniques that will be used to

further prune the rules.

3.2.3 Pruning Redundant Rules

After applying the support and conﬁdence measures to prune the insigniﬁcant rules, there
still some redundant rules that need to be pruned. In our system, we consider a rule X to
be redundant if it has a conﬁdence that is lower than or equal to the conﬁdence of a
related rule Y where the set of items that form the left-hand side of rule Y is a subset of
the set of items that form the left-hand side of rule X:

X: A 9 B is a rule with conﬁdence C,
Y: C 9 B is a rule with conﬁdence C2 where C is a subset of A,

then rule X is to be pruned if C, <= C2.

43

For example if rule (cough, ear pain 9 common cold) has a confidence of 70% and the
rule (cough 9 common cold) has a conﬁdence of 75%, then we prune the rule (cough,
ear pain 9 common cold) because knowing that a patient suffers from ear pain does not
improve the conﬁdence of predicting common cold among patients who complain about

coughing.

3.3 Interesting Measures

Association rule mining algorithms may produce a large number of rules. As a result it is
important to distinguish between interesting rules from spurious ones. There are various
interest measures that have been developed for this purpose. These measures can be
classiﬁed into two types: objective versus subjective interestingness measures. An
objective measure uses statistics computed from the data to determine whether a rule is
interesting. Examples of the objective measures are support, conﬁdence, and the lift
measure. In the previous section we mentioned how the support and conﬁdence were
used to prune unimportant rules.

In contrast, subjective interestingness measures tend to require domain knowledge
to determine whether a rule is interesting. In general, a rule is considered to be
subjectively interesting if it reveals unexpected information about the data or is
potentially actionable. For example, the rule cough 9 cold is not subjectively interesting
despite its high support and conﬁdence values because it is obvious to most physicians.
On the other hand, the rule ﬁsh oil 9 Raynaud disease is interesting because it associates

a substance that may be used to treat a disease. In this thesis we will investigate the use of

background knowledge from literature to determine the interestingness of rules extracted

from the medical domain.

3.4 Summary

In this chapter we described the association rule mining task, including the frequent
itemsets generation and rule generation processes. We also discussed the pruning
approach used to eliminate redundant rules. Finally, we explained the use of objective
and subjective interestingness measures for pruning spurious and uninteresting rules. In
the next chapter we show how these approaches form the key components of our

proposed framework.

45

Chapter 4

Proposed Framework

The main goal of this thesis is to discover interesting rules in large databases by
incorporating background knowledge from literature. This chapter describes our proposed
methodology in the context of mining association rules from the medical domain. More
speciﬁcally, we integrate the rules extracted from an EMR database with domain
knowledge automatically acquired from a large repository of biomedical literature. We
then show how the background knowledge helps to distinguish rules that are obvious to
the domain experts from those that are either spurious or probable.

The remainder of this chapter is organized as follows, in Section 4.1 we introduce
the various components of our framework. Section 4.2 describes how the background
knowledge from biomedical literature can be used to provide a subjective way for

evaluating and interpreting the rules.

4.] MIR: Mining Interesting Rules from Medical Data

MIR, which stands for Mining Interesting Rules, is a web—based system we had
developed for discovering interesting relationships in the medical domain. MIR can be
used as a recommendation tool to support clinical decision making. It may also be used
for research purposes because some of the interesting rules may reveal previously
unknown relationships that need to be further investigated. The overall system was built

using the C# .NET framework.

46

Figure 6 illustrates the overall architecture of our proposed system. The input data
for our system corresponds to an electronic medical records (EMR) database, which
contains both demographic and clinical information about the patients (including their
medical complaints, medication, age, and insurance information). The database records

all the activities corresponding to 1,145,517 clinical visits by 427,214 patients between

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2001 and 2005.
Data Acquisition Clinical Data Repositories
Patient \ideo Biometrics Sensor "j,
s W 0110190" I 3 " I
Information Access
.1 £1 Post Processing
S Preprocessmg Rule
Caregiver Extraction Rule. I
Knowledge Evaluation
Doc or Discovery ,_
T°°m°l°9y FrequentPattem :51. Rule
Mining 5 Intemretation
Clinical (3 7
Researchers+
‘53 Monitoring pp ca ions
Pharmaceutical Disease Decision
Compan'tT I Management Support Medical
. Literature
Public Health L

 

 

 

 

 

 

Figure 6: Overall Framework
The proposed framework consists of 3 major components: preprocessing, data mining
and post-processing. Since the original EMR data is stored in a relational database that
contains 23 tables, it has to be transformed into a binary transaction format so that it can
be subsequently mined by current association analysis algorithms. For the EMR data set,

we are interested in examining the relationships between various symptoms and diseases

47

of patients. The information was originally stored in a table describing the medical
complaints of patients. After preprocessing, a transaction data ﬁle is created where each
transaction corresponds to a patient’s visit to her healthcare provider. Each transaction
also contains a list of “items” that represent the symptoms and diseases of the patient.

The data mining component performs three major tasks. First, the PG-Miner
algorithm described in Section 3.2.1 is applied to generate frequent itemsets from the
transaction database. Next, the Apriori rule generation algorithm is applied to extract high
conﬁdence rules from the frequent itemsets. Finally, the rule pruning step described in
Section 3.2.3 is used to eliminate redundant rules.

Rules that survived the support, conﬁdence, and redundancy pruning steps are
subsequently evaluated and interpreted during the post-processing step using background
knowledge acquired from the literature. A detailed description of the rule evaluation and

interpretation steps is described in the next section.

4.2 Rule Evaluation and Interpretation

The goal of the post-processing component of our framework is to perform the following
two tasks:
0 Rule Evaluation. In this step the rules are evaluated using background knowledge
from literature in order to prune those that are subjectively uninteresting.
0 Rule interpretation: In this step the background knowledge is used to aid the
interpretation of the extracted rules.
Note that the rule evaluation and interpretation tasks are not exactly independent. This is

because the ability to interpret the meaning of a rule also gives us an indication whether it

48

is interesting or potentially spurious. Therefore we explain the steps for evaluating and
interpreting the rules together in the remainder of this chapter.

Our framework uses Pubmed as the source for background knowledge. PubMed is
a bibliographic database developed at the National Library of Medicine to provide access
to citations from biomedical literature. The procedure of querying and retrieving
background knowledge from Pubmed is described in Section 4.2.1.

The information retrieved from PubMed is used in the following two contexts.
First, it is used to distinguish between obvious rules from unexpected ones. A rule (A 9
C) is considered obvious if there is strong evidence in the literature that A and C is
associated with each other. Otherwise, the rule is said to be unexpected. We describe our
method for distinguishing obvious rules from unexpected rules in Section 4.2.2. Second,
the background knowledge is also used to distinguish two kinds of unexpected rules:
spurious rules and probable rules. An unexpected rule (A 9 C) is considered probable if
there is sufﬁcient evidence in the literature to suggest that the concepts A and C are
indirectly related via some other intermediate concepts. Otherwise, the unexpected rule is
considered spurious. We describe our approach for distinguishing probable rules from

spurious rules in Section 4.2.3.

4.2.1 Querying Background Knowledge from Pubmed

As previously mentioned, our proposed framework uses Pubmed as the source of
background knowledge. To perform rule evaluation and interpretation, our proposed

framework supports two types of queries to PubMed:

49

1. A support-based query to determine the frequency of occurrences of the
queried terms in the literature.

2. A retrieval query to find all the abstracts that contain the queried terms.
A set of e-utilities is available for download from the National Library of Medicine
(NLM) web site at http://eutils.ncbi.nlm.nih.gov to facilitate easy access to PubMed. We
had written a C# program to access PubMed using the e-utilities. Our program seeds the
query with the search terms along with their corresponding parameters and then submits
the query to the Entrez PubMed search engine. There are two steps involved for
retrieving information from PubMed: search and fetch. The search step is executed using
the E-search utility while the fetch step is executed using the E-fetch utility. For support-
based queries, it is sufﬁcient to use the E-search utility to determine the frequency of
occurrences for the queried terms. For retrieval queries, we need to execute both E-search
and E-fetch utilities.
E-search Utility
The general format for querying PubMed using the e-search utility is given as follows:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fc gi?param 1=val 1& ...&param,,=val,,)
where param, is a search parameter and val,- is its corresponding value. Below is a list of
the parameters to be passed to the E-search query:

0 URL of the NLM website which is (http://eutils.ncbi.nlm.nih.gov).
0 Type of e-utilities operation to be executed. Since this is a search process, we
request for the E-search utility.

0 db parameter, which is the name of the database.

50

0 Usehistory parameter. If this parameter is set to “yes” the history of this search is
stored for subsequent retrievals.

- The search term, which is the queried word or phrase.

For example if we want to search for the term “diabetes mellitus” the following query
can be used:

(http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=Pubmed&usehistory=y&t

erm=diabetes%20mellitus)
E-fetch Utility
The fetch step is invoked to retrieve all the abstracts previously located by the search
step. Each search query has its own history, with different information stored in the
history. The queryno and the webenv are the most important values used to retrieve
records associated with the search query results. The format of an E-fetch query is given
as follows:
(http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi? param1=val ,&...&param,,=val,,)
The following search parameters can be used in the E-fetch query:

0 URL of the NLM website which is (http://eutilsncbi.nlm.nih.gov).

0 Type of the e-utilities operation to be executed. Since this is a fetch process, we

request for the E-fetch utility.

0 db parameter, which is the name of the database.

0 Retmax parameter, which is the maximum number of records to be retrieved.

o Querykey parameter, which is a numeric value that indexes the records retrieved

by the E-search query.

0 Webenv parameter, which is a string value that indexes the E—search query.

5]

0 Retmode parameter, which: specifies the mode in which the records are retrieved;
for example, xml or ASCII text. In our case we set this parameter to xml and then
parse the xml ﬁle to extract the individual abstracts.

0 Rettype parameter, which speciﬁes the type in which the records are retrieved; for

example abstract or brief summary. In our case we set the rettype to be abstract.

The following is an example of a fetch query statement:
(http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?&db=Pubmed&WebEnv=" +
“32165HJ4JH7980EIUR345231” + ”&query_key=" +1 + "&retmode=xml&rettype=
abstract" +&retmax=" + 1000).

Each query (Esearch and Efetch) incurs a certain cost in terms of execution time. The

required time for query execution depends on the number of records retrieved per query.

PubMed imposes an upper limit to the number of abstracts that can be fetched in a single

query. Therefore, if a queried term appears in more than 1000 abstracts, we need to

invoke the e-fetch query multiple times. In Chapter 5, we showed that it is possible to
reduce the number of queries to PubMed without losing important information.
Once the abstracts have been retrieved, we need to apply the followng text
preprocessin g steps to extract the relevant background information:
1. Tokenization, which is the process of converting a stream of characters in the
abstract into a stream of words (terms).
2. Stopword removal, which is the process of discarding general terms that are

not useful for the rest of the mining task.

52

3. Stemming, which is the process of stripping the suffixes of words in the
abstract.
4. Support counting, which is the process of counting the frequency of
occurrences for each of the words that appear in the retrieved abstracts.
Data Dictionary
The data dictionary is a data structure we had created to store the key medical terms when
processing the abstracts retrieved from PubMed. The data dictionary is created based on
all the words describing the “items” that exist in the EMR transaction ﬁle. The purpose of
this dictionary is to eliminate irrelevant terms and stopwords that exist in the PubMed
abstracts. We use the data dictionary when the abstracts are tokenized. Each tokenized
word is checked against the dictionary to examine whether it is a relevant medical term or
not. If it is available in the dictionary, we will count the frequency of the word. Otherwise
the term is ignored and the parser moves to the next term.
Porter Stemmer
The Porter stemming algorithm is used to remove the morphological endings of words.
Stemming helps to normalize the terms that have common stems but different endings.
For example, the terms allergic, allergy, and allergies all share a common stem but with
different sufﬁxes. The stemmer is invoked after tokenizing the abstract and removing the

stopwords.

4.2.2 Distinguishing Obvious Rules from Unexpected Rules

Objective measures such as support and conﬁdence only help to remove rules that are

statistically insigniﬁcant. Some of the rules that survive the support, conﬁdence, and

53

redundancy pruning steps may still not be interesting. For example, consider the rule
cough 9 cold, which has relatively high support and conﬁdence values. Such a rule is
still considered uninteresting because the relationship is obvious and well known to the
domain experts. To eliminate such obvious rules, we need to incorporate subjective
knowledge into the rule evaluation and interpretation step. Throughout this thesis, we call
the non-obvious rules as unexpected rules.

MIR uses background knowledge from the Pubmed repository to distinguish the
obvious rules from unexpected ones. We assume that the support count of a rule in the
PubMed literature provides a good estimate for the obviousness of the rule. The higher
the literature support count, the more obvious the rule is. To distinguish between the
support of a rule in the EMR data from its support in the PubMed literature, we use the
term expected support throughout the thesis referring to the support obtained from
PubMed literature.

Therefore, our framework classifies the subjective interestingness of a rule as
follows: a rule is classiﬁed as obvious if the expected support is greater than or equals to
a user-speciﬁed minimum threshold; otherwise, it is classiﬁed as unexpected. To perform
such classiﬁcation, we use the support-based query described in Section 4.2.1 to
determine the co-occurrence frequency of the rule in Pubmed abstracts. Identifying the

appropriate expected support threshold is not an easy task. We investigated several

candidate thresholds, such as using the mean, median or 80'” percentile of the expected
support for all the rules. We observe experimentally that the median seems to yield the

best results, so we set out threshold to the median value.

54

4.2.3 Distinguishing Probable Rules from Spurious Rules

After removing the obvious rules, the remaining rule set contains only unexpected rules,
i.e., rules that are non-obvious because they involve only concepts that are rarely
associated together in the literature. These unexpected rules may represent two types of
relationships. First, the relationship suggested by the rule may still be spurious despite
satisfying the support and conﬁdence thresholds. Second, a hidden (indirect) relationship
may exist between the concepts in the rule antecedent and consequent. Rules with such
hidden relationship are known as probable rules. We explain our methodology for
distinguishing probable rules from spurious rules in the rest of this section.

Given a rule A 9 C, our objective is to build a graph that relates the concepts on
each side of the rule (A and C). If such a graph can be established, then the rule is
deemed to be probable. The intermediate nodes in the graph represent the common terms
extracted from the abstracts retrieved from PubMed when querying the concepts A and C
separately. However, instead of using all the words that appear in the retrieved abstracts,
we consider only the top n highest frequency terms as intermediate nodes. Note that this
approach is similar to the literature-based discovery process described in Chapter 2.
However, unlike traditional literature—based discovery, our framework is more general
because the intermediate term B does not have to be directly associated with both A and
C. There may be more than one intermediate node along the path between A and C. Put
another way, we say that the interaction between A and C involves more than one level of
indirection.

This approach has the danger of relating two seemingly unrelated concepts A and

C using several levels of indirection. In this situation, the intermediate nodes of the graph

55

may correspond to general terms that relate to most of the items in the database. We
consider two approaches to avoid this problem. First, we check the data dictionary to
avoid terms that are too general for the domain (e.g., disease). Second, we restrict the
maximum number of indirection levels allowed by the framework. If the number of
indirection levels is greater than some threshold (say, 2), the rule is classiﬁed as spurious.

We treat the task of constructing an association graph between concepts A and C
as a path discovery problem. We explain how MIR addresses this problem in the next
section.

Path Discovery Problem

 

 

 

 

 

Figure7: l-level network between concepts of a rule’s sides.

Let G = (V, E) be a weighted graph of vertices (V) and edges (B). Each vertex
corresponds to one of the concepts in the data dictionary. An edge between two nodes P
and Q in the association graph indicates that Q must belong in the top n list of highest
frequency terms associated with P. The weight of an edge (P, Q) represents the expected
support between the concepts P and Q.

Given a source node (A e V) and a destination node (C e V), the path discovery
problem aims to discover a path connecting between nodes A and C (see Figure 7).

Unlike traditional path discovery problems, the weights of the edges between nodes are

56

initially unknown and therefore must be queried using PubMed. Since each query incurs
some cost, our goal is to construct the graph without incurring too much overhead for

querying PubMed.

To do this, we start with a bidirectional search by querying the concepts on each side
of the rule simultaneously (say, A and C). After retrieving the abstracts corresponding to
A and C, we determine the top-n highest frequency terms associated with A and C. If
there is at least one common term between A and C, the task is complete and the mle is
declared as probable. Otherwise, if there is no common terms that are strongly associated
with both A and C, we proceed to the next level and query PubMed again. To make the
process more efficient, we should not have to query all the terms that belong to the top-n
list for A and C separately. Instead, we will choose only the most promising node to
expand (i.e., to be queried) at the next level of indirection. We apply a utility function to
determine whether a node is a promising candidate for expansion. For example, let {B r.
B2, Bn} denote the list of top-n terms associated with the concept A. For each
candidate node B,, its utility function is given as follows:

Utility Function = xi * u(A, B,) + (I - xi) *ﬁ(B, , C)
where u(A,B,) is the expected support obtained from PubMed literature and ii(B,,C)

corresponds to the expected support between B, and C. Since ii(B, , C) is unknown before

querying, we estimate the value using the actual support for itemset {B,, C} observed in
the EMR database. However, the support for {B,, C} was already recorded when

applying the PGMiner algorithm to generate frequent itemsets.

57

We also consider an alternative approach for ﬁnding intermediate nodes using

MESH terms. MESH (Medical Subject Headings) is a controlled vocabulary thesaurus

provided by the National Library of Medicine.

 

Table 3: Algorithm for identifying probable and spurious mles using MeSH.

 

Input: A set of unexpected rules {A9C}

Output: Sets of probable and spurious rules

1. Execute an automated query for the consequent and antecedent of the mle, in order to

retrieve the relevant abstracts .

2. for each concept 6 (A, C)

3. Parse the abstracts and distinguish the medical terms using the dictionary.

4. Compute the document frequency (DF) for each term.

5. Compute the signiﬁcance of the terms according to the Utility Function value.

6. Sort the terms according to their frequency values.

7. Select the top-n terms and map each term to its corresponding MeSH concept(s).

8. end
9. Intersect the top n-list of the MeSH concepts for both sides of the rule.

10. If iteration number < 2 and intersection set is nonempty, then add the rule to the
probable mle set.

1 1. If iteration number < 2 and intersection set is empty then for both sides on the
established network, choose the term with the highest utility value and goto stepl.

12. If iteration number = 2 add the rule to spurious rule set.

 

58

Instead of intersecting the top-n highest frequency terms to identify the intermediate
nodes, we ﬁrst mapped the terms in the top—n lists to their corresponding MESH
concepts. We then intersect the list of MESH concepts to identify whether there are any
common MESH terms that can be used as intermediate nodes of the association graph.
The MESH-based approach is summarized in Table 3.

Figure 8 illustrates an example of how MESH concepts can be used to ﬁnd

intermediary nodes for distinguishing spurious rules from probable mles.

 

    

A98
/ ,/’_\
, A11 / \
. 811
A12,
1’ .. B12
I A13 1813

Search Literature
Search Mesh
Search Mesh

(D
L
3
H
(U
h
(D
t
.._l
.C
O
s.
(0
(D
U)

 

 

 

 

 

 

 

Do It one more time

 

 

 

 

 

Figure 8: An illustration of using MeSH concepts for mle evaluation.

4.2.4 Rule Evaluation and Interpretation using UMLS Ontology

Our proposed framework also allows us to apply the UMLS ontology to evaluate and
interpret the rules. UMLS encodes the high-level medical concepts (called semantic
types) as well as semantic relations between them. Each MeSH concept is ﬁrst mapped to
its corresponding semantic type(s) and to the semantic relations participated by their

corresponding semantic types. The semantic relations provide additional background

59

 

Table 4: Algorithm for identifying probable and spurious rules using MeSH and UMLS.

 

Input: Set of unexpected rules A9B
Output: Sets of Probable / spurious rules
1. for each rule A9B

2. Get the corresponding MeSH concepts and semantic types for A and B.

3. for each concept, d e { A,B}

4. Query PubMed and retrieve the abstracts corresponding to the concept (1.

5. Compute the document frequency of each term that appears in the abstracts.

5. Compute the utility function of the terms (see Section 4.2.3).

6. Sort the terms according to their frequency values.

7. Map the terms to their corresponding MeSH concepts and semantic types.

8. Check for semantic relation between the semantic type of the MeSH term
and the semantic type of the concept (I. If there is at least one semantic relation,
add the MeSH term to the top it list.

9. end
10. Intersect the top n-list of the MeSH concepts for both sides of the rule.

1 1. If iteration number < 2 and intersection set is nonempty, then add the rule to the
probable rule set.

12. If iteration number < 2 and intersection set is empty then for both sides on the
established network, choose the term with the highest utility value and goto stepl.

13. If iteration number = 2 add the rule to spurious mle set.

 

knowledge that can be used to avoid false alarms (i.e., spurious edges in the association
graph) resulting from using only the MeSH concepts or top-n terms when ﬁnding indirect

relationships between concepts. The UMLS ontology imposes additional constraints that

60

must be satisﬁed by the terms to ensure that only terms that are semantically related to
the search query terms (on each rule side) are selected as intermediate nodes for the
association graph. As a result MeSH concepts that are unrelated to the queried terms will
be pruned and this will reduce the number of the false positives in the output.

Table 4 summarizes the outline of the UMLS based rule evaluation and interpretation
procedure. An illustration of this procedure is also shown in Figure 9. Note that all the
rules classiﬁed as probable according to this procedure are potentially interesting and
worth further investigation by the domain expert because they suggest a hidden relation
between the two concepts in the antecedent and consequent of the rule. These hidden
relations are not obvious and may suggest new knowledge that was previously unknown
to the experts. Using UMLS, interpreting these relations will not be too difﬁcult because
the semantic relations provide additional information to guide the experts to understand

the meaning of the association graph.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A2
,. A3\ , , 9
= . 2 a a s a
E am) ,3 ,3 or E
g --9 2 -—| 0 0 ﬂ, 2 ‘- g
_l .C :3 a .C .J
.c 9 C C 9 .c
o to ‘0 “5 m o
:5 <1) E E m a
(D An (I) 0) En (I)

- Set A = A11 7
Semantic Relations Set 8 = 811 Semantic Relations
Do It one more t_i_me

 

 

 

 

 

 

 

 

Figure 9: An illustration of using UMLS for rule evaluation and interpretation.

6]

4.3 Summary

This chapter presents an overall framework of the MIR system. We present the various
components of the system and describe the methods used for incorporating background
knowledge from PubMed and UMLS. In the next chapter we will demonstrate the
experiments we had conducted in order to evaluate the effectiveness of the methods

described in this chapter.

62

Chapter 5

Experiments

This chapter demonstrates the experiments that we have performed to evaluate the
various components of our system. We use the MQIC EMR data, which has been
deidentiﬁed by the GE Healthcare data Consortium, for our experiments. The EMR data
contains 1,145,517 clinical visit records of 427,214 patients during the period of 2001-
2005. The EMR is stored in a relational database that contains 23 tables such as patients,
complaints, medications, insurance, etc. the background knowledge for the system is
acquired from two sources: Pubmed, which is a repository of biomedical abstracts, and

UMLS (Unified Medical Language System).

Our experiments will be conducted to achieve the following objectives:

0 To show the efficiency (in terms of runtime and storage requirements) of the
PGMiner frequent itemset generation algorithm.

0 To determine the reliability of estimating support counts from the Pubmed
literature.

0 To determine the effectiveness of distinguishing obvious rules from unexcpected
rules using support counts from Pubmed.

0 To determine the effectiveness of distinguishing probable rules from spurious
mles using Pubmed and MeSH ontology.

0 To interpret the rules using UMLS.

63

5.1 Experiment 1: Frequent Itemsets Generation

In this experiment, we want to compare the efﬁciency of using the PGMiner algorithm
against other existing frequent itemsets generating algorithms (Apriori and FPGrowth).
The input to the algorithms is a ﬁle that contains 3115630 transactions and 5819 items.
Each transaction corresponds to a patient’s visit to the health care provider and each item
corresponds to a symptom or disease. As described in Section 3.1, the deﬁnition of a
frequent itemset depends on the support threshold. Changing the support threshold affects
the number of frequent itemsets generated. The higher the support threshold is, the fewer
frequent itemsets produced by the algorithm. Figure 10 and Table 5 shows the effect of

applying different support thresholds to the transaction data used in our experiments.

 

Frecpert lterrsets

12000 ~—-—

10000 X\

 

 

i

 

O I I I I I I
0.(IX)1 O.(ID15 0.00m QM Om 011115

Mnapport‘yo ..

 

 

 

 

Figure 10: frequent itemsets and min support

 

min support % 0.0001 0.00015 0.0002 0.00025 0.0003 0.00035

Itemsets I 0728 8486 6302 5700 4886 4585
Table 5: Generated itemsets using different support thresholds.

 

 

 

 

 

 

 

 

 

 

For instance, when the support value is .0001% the number of frequent itemsets

generated is 10728 while the number of frequent itemsets in the case of .00035% is 4585.

As described in Section 3.2.1, the Apriori frequent itemsets generation algorithm uses
support-based pruning to efﬁciently discover frequent itemsets. Despite its efﬁciency in
terms of pruning the exponential search space, the algorithm is still quite slow because it
needs to make multiple passes over the database. Figure l 1 and Table 6 shows a runtime

comparison between PGMiner and Apriori.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

, ____ v _ , ﬂ

in

\ +Aptiori

g +13%!er

F

0(1)? 1.. ,.,, ,,l___, J==,:==-F=_—I—_-=:l___1
O.(H)1 0.(IX)15 0.012 0.(II25 011113 QQXBS
MnStpport

— l—— -— “Fig-01:81[ht—comparison between PGMiner and Apriori
Min Support % 0.0001 0.00015 0.0002 0.00025 0.0003 0.00035
Apriori lsec 199.02 145.00 90.16 80.02 63.78 62.41
PGMiner lsec 3.39 3.34 3.20 3.22 3.20 3.13

 

 

 

 

 

 

 

 

 

Table 6: Runtime comparison between Apriori and PGMiner

Notice that the execution time for PGMiner is extremely fast compared to the execution
time for Apriori. For instance, when the support threshold was set to .0002%, the runtime
for Apriori was 90.16 seconds while PGMiner ﬁnished running in 3.20 seconds, which is
about 30 times improvement. Furthermore decreasing the support threshold does not
affect the execution time for PGMiner by much because the mining step is performed
almost instantaneously for this range of support thresholds. In contrast, the Apriori
algorithm shows an exponential growth in its execution time when decreasing the support

threshold. For example, when using the support .00035%, Apriori needed 62.41 seconds

65

to generate the itemsets compared to PGMiner which needed only 3.13 seconds to
generate the same frequent itemsets. When the support threshold is further reduced to
.0001%, the Apriori algorithm increased to 199.02 seconds while PGMiner increased by
only .26 seconds.

The bottleneck of Apriori has to do with the number of passes it makes on the
transaction database. To circumvent this problem, the FPGrowth algorithm was proposed
by Pei et al. [1 1] to reduce the I/O overhead. FPGrowth constructs a condensed
representation of the transaction data in memory so that subsequent mining steps can be
performed efﬁciently without going to the disk multiple times. However, the tradeoff
here is that FPGrowth consumes a large amount of memory to store the data and thus
may not be scalable when the database becomes very large (relative to the size of main
memory). PGMiner also constructs an internal representation of the transaction database,
called PreﬁxGraph. Figure 12 shows a comparison between the memory requirements of
PGMiner and FPGrowth algorithm. It is evidently clear that PGMiner is more space

efﬁcient than FPGrowth.

 

101m0.00
91min X
81000.00 -_.. B\
9 71%.“) ‘P‘ \S\ —a— PGWner
GIGDID
\ _)(_ FPGOMI‘I
51000.00 —

 

 

X
X
X

 

 

 

 

 

 

 

 

 

 

 

”\El
41000.00 “r—
2100000
11000.00
1000.00 . T A a s
0.00015 0.00020 0.00025 0.00000 0.00005

MnSquort %.

 

 

 

Figure 12: Comparison between PGMiner and FPgrowth in memory requirements

66

For example, when the support threshold is set to .0002% the memory consumed by
PGMiner is about 60000KB whereas FPGrowth uses 90000KB. Furthermore, from our
experience, the runtime of PGMiner is less than the runtime 0f FPGrowth.

After frequent itemsets generation, we applied Apriori rule generation algorithm to
generate the association rules. As noted in Section 3.2.2 the conﬁdence metric is used to
prune spurious rules. Rules that do not pass the minimum conﬁdence requirement will be
pruned. Initially, the number of extracted rules depends on the support threshold because
the support threshold determines the number of frequent itemsets produced. We have
conducted an experiment to show the effect of the varying the support threshold on the
number of extracted rules. We ﬁxed the conﬁdence threshold to be 65%. Fig.13 shows

the increasing number of rules generated when the support threshold is decreased.

 

 

O I I I ﬁ I l
0.(XI)1 O.(IX)15 0.0002 011125 DIXIE CLUBS

MnSrpport‘Vo .

 

 

 

 

Figure 13: Number of rules versus Min Support
In Figure 14 we performed another experiment to show the effect of changing conﬁdence
on the number of rules generated. For this experiment, the support threshold is ﬁxed at
.00015%. Notice that when the conﬁdence value is increased the number of the rules

decreases quite substantially.

67

 

 

 

 

 

 

 

 

—Y “ ______T-_‘ _"'”"'_.T———__—ﬁ

Mn Cordidenoe % .

 

 

 

Figure 14: Number of Rules versus Conﬁdence

In spite of this, the resulting rule set still contains some redundant rules. The pruning
approach described in Section 3.2.3 is then applied to eliminate redundant rules that have

conﬁdence less than or equal to their corresponding parent rules.

5.2 Experiment 2: Estimating the Support from Pubmed

Pubmed is a huge repository of biomedical literature. lnforrnation is accessible through a
search query using search terms such as title words, author names, phrases, words,
journal names. As described in Section 4.2.2, one of the key steps in evaluating the
subjective interestingness of a mle (A9C) generated from the EMR database is to
identify the top-n terms that appear frequently with both A and C. These high frequency
terms may serve as intermediary concepts to help better explain the relationship between
A and C. However, in order to identify the top-n terms for A (or C), we need to query
Pubmed and retrieve all the abstracts that contain the word A (or C). In some cases, the
number of abstracts associated with a queried term may exceed 1000, which is the

maximum number of abstracts retrieved for a given query from Pubmed. To retrieve

68

more abstracts, say 4000, we need to query Pubmed multiple times, to fetch the abstracts
number 1-1000, 1001-2000, 2001-3000, and so on. This makes the querying process
becomes highly inefﬁcient. To improve its efﬁciency, we investigate the possibility of
determining the top-n terms associated with a given concept using a sample of the
abstracts. The challenge here is to determine the appropriate sample size so that the top-n
terms determined from the sample is identical to the top-n terms determined from all the
relevant abstracts. We conducted the following experiment to determine the appropriate
sample size. We queried the term Dysplasia and found 39009 occurrences of the term in
Pubmed. Our goal is to ﬁnd the top-15 terms that appear frequently with the term
Dysplasia by varying the sample size from 1000 to 5000, 10000, 20000, and 39009

(using all the abstracts). Figure 15 shows the results of this experiment.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Abstracts

 

 

 

 

 

 

 

 

 

 

Figure 15: Document Frequency of terms in Pubmed

The experimental results show that the order of the top-15 words is consistent with each
other, irrespective of the sample size. This result suggests that it is possible to obtain a
good estimate of the top-15 terms that appear frequently with a given concept based on a

sample of 1000 abstracts alone (i.e., in a single query).

69

5.3 Experiment 3: Distinguishing Obvious Rules from

Unexpected Rules

We want to show in this experiment the role in which Pubmed plays in filtering obvious
rules from unexpected ones. We first extracted the rules with high support (more than
.00015%) and high conﬁdence (more than 65%) from the EMR data using PGMiner and
the Apriori rule generation algorithm. There were 113 rules that pass the minimum
support and minimum conﬁdence thresholds. For each association rule A9C in the rule
set we query Pubmed to ﬁnd the number of abstracts that contain both concepts A and C.
The frequency of {A,C} represents the expected support of the rule based on background
knowledge.

We argue that if the expected support of the rule is high, then the extracted mle
should be obvious because the relationship between A and C is prevalent in the literature.
On the other hand, if the expected support ’of the rule is low (less than a speciﬁed support
threshold), the rule is considered unexpected because A and C are seldom observed
together in literature. In our experiment, we set the expected support threshold to be the
median value of the expected support for all the extracted rules. Since the median value
was 13, any rule that has an expected support greater than or equals to 13 is considered an
obvious mle, while the rest are declared as unexpected rules. To validate these results, we
consulted with a physician from the Michigan State University Health Center. We
requested him to classify the 113 rules into two groups (obvious versus unexpected).
Table 7 shows the confusion matrix of the result. Notice that our framework could

distinguish 48 obvious rules out of 57 described as obvious by the physician.

70

 

Predicted
Obvious Unexpected
Obvious 48 9

Actual Unexpected 19 37
Table 7: Confusion Matrix

 

 

 

 

 

 

 

 

 

The overall accuracy of our framework is 75.2%.

Notice that there are 19 rules declared as obvious by our framework but marked
as unexpected according to the physician. This is because the expected support computed
from literature may not be indicative of causal relations between two medical concepts.
For example, the rule osteomalacia 9 osteomyelitis is classiﬁed as unexpected by the
physician because there is no known relation between both diseases. However, since they
are both diseases related to the bone (osteomyelitis is an acute bone infection while
osteomalacia is the softening of the bones caused by the deﬁciency of vitamin D), they
often appear together in the literature as examples of possible diagnosis for a bone-related
medical condition. In addition, 9 of the rules declared as unexpected by our framework
are marked as obvious by the physicians. This is because some of the relationships are so
obvious that they are rarely mentioned together in the literature (e.g., urinary frequency
slowing 9 nocturia).

The following are some examples of rules that were classiﬁed correctly by our
framework. The rule (swelling 9 leg edema) is an obvious association because the value
of the support from literature is (1506) which is higher than the literature-based support
threshold, 13. Examples of other obvious rules that were classiﬁed correctly include
(obesity hyperglycemia 9 hypercholesterolemia), (hematuria urinary incontinence9
nocturia). An example of an unexpected association that was identiﬁed by our method

and the physician is (obesity Grave's disease ==> hypertension) which is classiﬁed as

71

unexpected since its frequency in Pubmed is 2 which is less than the expected support
threshold (13).

Applying the frequencies from Pubmed literature alone does not help in
determining whether the unexpected association is spurious or interesting. Therefore in

the next section we describe how our system addresses this issue.

5.4 Experiment 4: Distinguishing Probable Rules from
Spurious Rules

The previous section describes how to use the expected support from Pubmed to
distinguish between obvious and unexpected rules. A rule is declared to be unexpected if
its expected support is low because there is lack of evidence in the literature suggesting
that the concepts are directly related. Having a low expected support however does not
necessarily mean that the mle is spurious. There may still exist a hidden (indirect)
relation between the given concepts via some intermediary concepts. We consider rules
that are indirectly associated via hidden concepts as probable rules.

Unlike the closed literature based discovery approach described in Section 2.2, we
allow the concepts to be associated at more than one level of indirection. For example,
the relation between A and C may involve two levels of indirection:

A(-—)B,(—>D(—>Bz(-9C

In general the more levels of indirection allowed by the framework, the more likely A
and C become indirectly associated. However, we observed that if the number of
indirection levels is too large, the intermediate concepts became overly general, thus

relating two concepts that might not have any association between them. To avoid this

72

problem, we set the maximum level of indirection to be 2, i.e., any unexpected rule that
requires more, than 2 levels of indirection is classiﬁed as spurious. Otherwise, the
unexpected rule is classified as probable. Note that the intermediate concepts can be
extracted based on either the top-15 words that appear frequently with a given concept or
mesh terms mapped from the top-15 words, (see the discussion in Section 4.2.3). To
evaluate these results, we again consult with the physician from the Michigan State
University Health Center and requested him to categorize his previously declared
unexpected rules into two groups: spurious versus probable. Table 8 shows the confusion

matrix of the results using top-15 terms.

 

 

 

 

 

 

 

 

 

 

Predicted
Obvious Probable Spurious
Obvious 48 9 0
Actual Probable 11 7 2
Spurious 8 23 5

 

 

Table 8: Confusion Matrix using terms

Our framework managed to correctly identify 7 out of the 9 probable rules classiﬁed by
the physician. On the other hand, we managed to correctly classify only 5 out of the 28
spurious rules. The poor result can be explained by the following reasons. First our
framework is too conservative in terms of classifying a rule as spurious, to ensure that it
does not miss any interesting mles. This problem can be avoided using a stricter criterion
for ﬁnding intermediary concepts (e.g. using the top-5 terms instead of top-15 terms).
Second, the concepts implied by the rules may be indirectly associated via general
medical terms such as cell, tissue, or syndrome. To avoid this problem, we need to
remove these general terms from the data dictionary we had used when processing

Pubmed abstracts.

73

When evaluating the unexpected rules using MeSH concepts, we obtained similar

results as before. Table 9 shows the confusion matrix of the classiﬁcation results.

 

 

 

 

 

 

 

 

 

 

Predicted
Obvious Probable Spurious
Obvious 48 9 0
Actual Probable 1 1 8 1
Spurious 8 23 5

 

 

Table 9: Confusion Matrix using MeSH

The following are some examples of unexpected rules that were classiﬁed
correctly by our framework. We observed that the rule (benign prostatic hypertrophy
Peyronie's disease 9 nocturia) was correctly identiﬁed as spurious. This is because we
do not ﬁnd any intermediate terms that can relate to both sides of the rule, neither in the
ﬁrst nor second level of indirection. The rule (urinary frequency, prostate nodule 9
benign prostatic hypertrophy) is an example of a probable rule since we could locate the
intermediary term (tumor). The expected support of the mle is equal to 0.

Finally, we used a well-known example from literature based discovery to test the
effectiveness of our system. Our objective is to make sure that MIR is capable of ﬁnding
indirect relations that are well known in the literature. As previously mentioned in
Section 2.2.1, Swanson found a relationship between Raynaud’s disease and ﬁsh oil
through blood viscosity and platelet aggregation. As will be shown here, we were able to
reproduce this result by using MeSH concepts as intermediate terms. We found blood
viscosity and blood aggregation to be in the top 15 MeSH terms for both Raynaud’s
disease and ﬁsh oil. We reached to this result by querying both sides of the rule and
computing the frequencies of all the related medical terms that appear in the abstracts

(e.g., terms such as fat, oil, lipid, and blood for ﬁsh oil and terms such as cold, skin,

74

blood, and temperature for Raynaud’s disease). Table (8) shows the complete list of top

15 terms extracted from the abstracts associated with each side of the rule.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Fish Oil DF Raynaud's disease DF
acid 566 Syndrome 232
Oil 388 Scterosis 196
Diet 284 Vascular 165
Lipid 209 Cold 141
Cells 160 Skin 138
Plasma 145 Blood 133
Blood 128 Scleroderrna 133
Fat 127 Tissue 124
Protein 127 Hand 98
inflammatory 114 Antibodies 93
cardiovascular 101 Crest 84
metabolism 91 Finger 80
Tissue 90 Flow 78
Heart 90 Temperature 68
Serum 86 Doppler 68

 

 

Table 8: List of toplS for Fish Oil9Raynaud’s Disease

We then sort the list according to their frequency values and map the top 15 terms
in each list to their corresponding MeSH concepts. Finally we intersect the MeSH
concepts and found blood viscosity and platelet aggregation to be the intermediate MeSH

concepts between Raynaud’s disease and ﬁsh oil.

5.5 Experiment 5: Rule Interpretation using UMLS

In this experiment, we continue to use the Swanson example described in Experiment 4
to explain how the structured knowledge encoded in UMLS can be used for rule
interpretation. Using UMLS, we ﬁrst extract the semantic types of the concepts for both
sides of the rule. We then query PubMed to ﬁnd the candidate intermediate terms. Next,

we map each intermediate term to its MeSH concept and corresponding semantic types.

75

Finally, we verify that a semantic relation exists in UMLS between the semantic type of
the intermediate MeSH concept and the semantic types of the concepts corresponding to
each side of the rule.

Once we capture at least one semantic relation, we add the MeSH concept to the
top-15 list of the intermediate terms. For example, the word (blood) was mapped to the
MeSH concept (blood viscosity) and this concept has the semantic type Physiologic
Function. The semantic type Physiologic Function is related to ﬁsh oil (which has the
semantic type Biologically Active Substance) through the semantic relations disrupts,
complicates, and affects. As a result, blood viscosity is added to the top 15 list of ﬁsh oil.
Similarly, for Raynaud’s disease (which has a semantic type Disease or Syndrome), we
found the semantic relations process, manifestation-of, affects, and result-of between its
semantic type and the semantic type for blood viscosity (Physiologic Function). As a
result blood viscosity is also added to the top-15 list for Raynaud’s disease. After
intersecting the top-15 lists, we found a common intermediate MeSH concept, which is
blood viscosity.

The interpretation of the result is illustrated in Figure 16. It shows the relations
between the antecedent of the rule (ﬁsh oil) and the intermediate term Blood viscosity
through the relation type (affects). On the other hand, we can ﬁnd the consequent of the
rule (Raynaud’s disease) related to blood viscosity by the association (causes). This
background knowledge that we included from UMLS simpliﬁed the task of

understanding the rule.

76

 

Fish Oil Raynaud‘s Disease

 

 

 

    
    

I

Blood Viscosity

 

Biological active substance
"‘ Syndrome or Disease

 

Physiologic Funaction

Affects Result-off

 

 

 

 

 

 

 

 

 

 

 

w.“

 

 

[UMLS Semantics;

Figure 16 — An illustrative example of UMLS

 

 

5.6 Summary

In this chapter we have performed several experiments to evaluate the various
components in our system. First, we showed that PGMiner is more efﬁcient than other
algorithms of frequent itemsets generation. Second, we could estimate the support counts
from Pubmed literature. After that, we could determine the effectiveness of our
algorithm in distinguishing obvious rules from unexpected ones. Moreover, we were able
to effectively exclude interesting from spurious rules in the set of unexpected rules. and
ﬁnally, we managed to use the background knowledge in UMLS to evaluate and interpret

the interesting rules.

77

Chapter 6

Conclusion and Future Work

The main objective of our thesis is to investigate the problem of mining interesting
association rules from large databases in the presence of background knowledge.
Towards this end, we develop a framework called MIR for ﬁnding interesting association
rules in the medical domain by incorporating background knowledge from PubMed and
UMLS. We showed that the structured (UMLS) and unstructured (PubMed) background
knowledge helps to achieve a two-fold goal: (1) to distinguish obvious rules from
spurious or probable ones and (2) to aid in the interpretation of the rules. We evaluated
the effectiveness of our rule evaluation approaches by comparing the predictions made by
the framework against the actual mle classiﬁcation provided by a medical expert from the
Michigan State University Health Care Center.

First, our experimental results suggest that, using background knowledge, our
proposed framework may effectively distinguish obvious rules from unexpected rules.
Obvious rules correspond to the associations that are highly supported in the literature
while unexpected associations correspond to relationships that are observed frequently in
the data but are weakly supported in the literature. Among the unexpected rules, some of
them may be spurious, while others may be interesting because there are hidden concepts
indirectly associated with the terms. We also showed that our proposed framework is
capable of reproducing well-known results from literature-based discovery.

We have also investigated the advantages of using structured background

knowledge (from UMLS). Our results indicate that the structured knowledge is not that

78

helpful for our data set in terms of pruning insignificant rules. This is because the
predeﬁned semantic types in UMLS are too general (e.g., terms such as physiological
function, disease or syndrome, and biologically active substance). Since all the items in
the data are either disease or symptoms, a semantic relation almost always exist between
most of the low-level concepts investigated in this study.

We also conducted experiments to assess the efﬁciency of our framework in terms
of generating frequent itemsets and reducing the number of queries needed for building
the association graph. First, we showed that our proposed frequent itemset generation
algorithm (PGMiner) is more efﬁcient, in terms of its memory consumption and runtime,
compared to other existing frequent itemset generation algorithms (Apriori and FP-
growth. To improve the efﬁciency of querying PubMed, we showed that it is possible to
obtain a good estimate of the top-n terms by retrieving only 1000 abstracts in a single

query instead of making multiple queries to retrieve all the abstracts for a given concept.

Our current experience from this research highlighted some of the important issues
that we believe, if handled, will enhance the performance of our work. First, we plan to
use more effective methods (e.g., using natural language processing techniques) to
identify the important terms (and phrases) associated with a given abstract. To do this, we
may employ tools such as the NP-Parser package. It takes text as an input and chunks the
text into words or simple phrases. Such phrases, when cleaned up, are useful to match to
entries from controlled vocabularies such as the Medical Subject Heading (MeSH)

vocabulary.

Another issue that we would like to improve is the method for integrating the

supportive frequencies from Pubmed with the observed support in the data. In spite of our

79

early success in using the approximated frequencies, we would like to ﬁnd a more
reliable probabilistic approach for combining evidence from the data with evidence from
literature. Furthermore, our experimental results seem to suggest that our framework
tends to be more biased toward classifying a rule as probable than spurious. We believe
that this occurs when choosing the top n terms. We plan to do more extensive research to
determine the appropriate way for selecting the top n terms. For example, we may apply
the lift measure (instead of support). This measure computes the closeness of the query
search term and the candidate terms, we want to include only the terms that are
conceptually more close to the query search term in our top it list. Another possibility is
to decrease the size of the top it list and observe how this will affect the result.

As we mentioned earlier, using UMLS as structured background knowledge did
not help in the pruning criteria, because of the generality of the semantic types. We
would like to do more research in this direction and investigate for other ways for
utilizing the predeﬁned semantic network in UMLS in order to aid in the pruning process
and thus identifying useful rules from spurious ones.

Finally, our current framework used the association rule mining method from data
mining in order to combine the background knowledge in literature with the data from the
medical domain. As part of our future work, we would like to extend the framework to
other data mining tasks such as sequential pattern mining, clustering, classiﬁcation, and

anomaly detection.

80

REFERENCES

[1] Agarwal, R. C. (1998). A tree projection algorithm for generation of large itemsets

[2]

[3]

l4]

[5]

[6]

[7]

[8]

[9]

for association rules. IBM TJ Watson Research Center.

Agarwal, R. C., Aggarwal, C. C., & Prasad, V. V. V. (2001). A tree projection
algorithm for generation of frequent item sets. Journal of Parallel and Distributed
Computing, 61(3), 350-371.

Agrawal, R., Imieliriski, T., & Swami, A. (1993). Mining association rules between
sets of items in large databases. ACM SIGMOD Record, 22(2), 207-216.

Agrawal, R., Shafer, J ., Center, I. B. M. A. R., & San Jose, C. (1996). Parallel
mining of association rules. Knowledge and Data Engineering, IEEE Transactions
on, 8(6), 962-969.

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules.
Proc.20th Int. Conf. very Large Data Bases, VLDB, , 487—499.

Burdick, D., Calimlim, M., & Gehrke, J. (2001). Maﬁa: A maximal frequent itemset
algorithm for transactional databases. Proceedings of the 17th International
Conference on Data Engineering, , 443—452.

Cimiano, P., & Staab, S. (2004). Learning by googling. ACM SIGKDD Explorations
Newsletter, 6(2), 24-33.

Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge
discovery in databases: An overview. AI Magazine, 13(3), 213-228.

Ganiz, M. C., Pottenger, W. M., & Janneck, C. D. Recent advances in literature
based discovery.

[10] Grahne, G., & Zhu, J. (2003). Efﬁciently using preﬁx-trees in mining frequent

itemsets. Proceedings of the ICDM Workshop on Frequent Itemset Mining
Implementations,

[l 1]Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate

generation. ACM SIGMOD Record, 29(2), 1-12.

[12] Hand, D. J. (1998). Data mining: Statistics and more? The American Statistician,

52(2)

81

[l3] Hotho, A., Staab, S., & Stumme, G. (2003). Wordnet improves text document
clustering. Proc.of the SIGIR 2003 Semantic Web Workshop,

[14] Hristovski, D., Stare, J., Peterlin, B., & Dzeroski, S. (2001). Supporting discovery in
medicine by association rule mining in medline and UMLS. Medinfo, 10(2), 1344—
1348.

[15] Hristovski, D., Peterlin, 8., Mitchell, J. A., & Humphrey, S. M. (2003). Improving
literature based discovery support by genetic knowledge integration. Studies in
Health Technology and Informatics, 95, 68-73.

[16] Hu, X. (2005). Mining novel connections from large online digital library using
biomedical ontologies. Library Management, 26(4/5), 261-270.

[17] Hu, X., Yoo, 1., Song, M., Zhang, Y., & Song, I. Y. (2005). Mining undiscovered
public knowledge from complementary and non-interactive biomedical literature
through semantic pruning. Proceedings of the 14th ACM International Conference
on Information and Knowledge Management, , 249-250.

[18] Huang, W. (2005). Mining scientiﬁc literature to predict new relationships.
Intelligent Data Analysis, 9(2), 219-234.

[19] Liu, 3., Hsu, W., Mun, L. F., & Lee, H. Y. (1999). Finding interesting patterns using
user expectations. Knowledge and Data Engineering, IEEE Transactions on, I 1(6),
817-832.

[20] Liu, J., Paulsen, S., Wang, W., Nobel, A., & Prins, J. (2005). Mining approximate
frequent itemset from noisy data. Proceedings of the Fifth IEEE International
Conference on Data Mining, , 721-724.

[21] Liu, J ., Wang, W., & Yang, J. (2004). A framework for ontology-driven subspace
clustering. Proceedings of the 2004 ACM SIGKDD lntemational Conference on
Knowledge Discovery and Data Mining, , 623-628.

[22] Liu, J., Wang, W., & Yang, J. (2004). Gene ontology friendly biclustering of
expression proﬁles. Computational Systems Bioinformatics Conference, 2004.CSB
2004.Proceedings.2004 IEEE, , 436-447.

[23] Lucchese, C., Orlando, 8., & Perego, R. (2006). Fast and memory efﬁcient mining of
frequent closed itemsets. Knowledge and Data Engineering, IEEE Transactions on,

18(1), 21-36.

[24] Moonesinghe, K., & Tan, P. Fodeh S. (2006). Frequent closed itemset mining using
preﬁx graphs with efﬁcient flow-based pruning strategy.

82

[25] Mukherjea, S., & Sahay, S. Discovering Biomedical Relations Utilizing the World
Wide Web.

[26] Nagypal, G. Improving information retrieval effectiveness by using domain
knowledge stored in ontologies. Lecture Notes in Computer Science, , 780-789.

[27] Orlando, 8., Lucchese, C., Palmerini, P., Perego, R., & Silvestri, F. (2003). kDCI: A
multi-strategy algorithm for mining frequent sets. Proc.of the Workshop on Frequent
Itemset Mining Implementations, in Conjunction with ICDM 2003,

[28] Pei, J., Han, J., & Mao, R. (2000). CLOSET: An efficient algorithm for mining
frequent closed itemsets. Proc.2000 ACM-SIGMOD Int. Workshop Data Mining and
Knowledge Discovery (DMKD’OO), , 11—20.

[29] Shannon, C. E., & Weaver, W. (1963). Mathematical theory of communication.
University of Illinois Press.

[30] Srinivasan, P. (2004). Text mining: Generating hypotheses from MEDLINE. Journal
of the American Society for Information Science and Technology, 55(5), 396-413.

[31] Srinivasan, P., & Sehgal, A. K. Mining MEDLINE for similar genes and similar
drugs. The Ninth ACM SIGKDD lntemational Conference on Knowledge Discovery
and Data Mining, Washington, DC, USA,

[32] Swanson, D., Smalheiser, N., & Torvik, V. Ranking indirect connections in
literature-based discovery: The role of medical subject headings (MeSH). Journal of
the American Society for Information Science and Technology,

[33] Swanson, D. R. (1990). Medical literature as a potential source of new knowledge.
Bulletin of the Medical Library Association, 78(1), 29-37.

[34] Swanson, D. R. (1986). Fish oil, raynaud's syndrome, and undiscovered public
knowledge. Perspectives in Biology and Medicine, 30(1), 7-18.

[35] Taylor, M., Stoffel, K., & Hendler, J. (1997). Ontology-based induction of high level
classiﬁcation rules. SIGMOD Workshop on Research Issues on Data Mining and
Knowledge Discovery,

[36] The Knowledge Management Network, Website (2006), www.brint.com/km/.
[37] Tifﬁn, N., Kelso, J. F., Powell, A. R., Pan, H., Bajic, V. B., & Hide, W. A. (2005).
Integration of text-and data-mining using ontologies successfully selects disease

gene candidates. Nucleic Acids Research, 33(5), 1544-1552.

[38] Van der Eijk, C.C., van Mulligen, E. M., Kors, J. A., Mons, B., & van den Berg, J.
(2004). Constructing an associative concept space for literature-based discovery.

83

Journal of the American Society for Information Science and Technology, 55(5),
436-444.

[39] Vos, R. (2001). Using concepts in literature-based discovery: Simulating Swanson’s
Raynaud—Fish oil and Mi graine-Magnesium discoveries. Journal of the American
Society for Information Science and Technology, 52(7), 548-557.

[40] Wang, H., Azuaje, F., & Bodenreider, O. (2005). An ontology-driven clustering
method for supporting gene expression analysis. Computer-Based Medical Systems,

2005.Proceedings.l8th IEEE Symposium on, , 389-394.

[41 ] Wikipedia. Website. (2006). www.wikiperliaorg.

 

[42] Wren, J. D. (2004). Extending the mutual information measure to rank inferred
literature relationships. BMC Bioinformatics [Computer File], 5, 145.

[43] Xiaohua, H., Xiaodan, 2., Illhoi, Y., & Yanqing, 2. (2006). A semantic approach for
mining hidden links from complementary and non-interactive biomedical literature.
Proceedings of the Sixth SIAM lntemational Conferance on Data Mining,

[44] Xiaoming Chen, Xuan Zhou, Richard Scherl, and James Geller. 2.3.12 using an
interest ontology for improved support in rule mining.

[45] Zaki, M. J ., & Hsiao, C. J. (2002). CHARM: An efﬁcient algorithm for closed
itemset mining. Proceedings of the Second SIAM lntemational Conference on Data
Mining,

[46] 2hang, J ., Silvescu, A., & Honavar, V. (2002). Ontology-driven induction of

decision trees at multiple levels of abstraction. Proceedings of Symposium on
Abstraction, Reformulation, and Approximation, , 316-323.

84

'III I.

'r‘l‘llrrll {.‘l

 

 

l"Ill‘IIIlIiIIIiIiIlIIIil

 

k