LINQS

STATISTICAL RELATIONAL LEARNING GROUP @ UMD

Link-based Classification

Traditional machine learning classification algorithms aim to label entities on the basis of their attribute values. Many real-world datasets, however, contain interlinked entities and exhibit correlations among labels of the interlinked entities. Link-based classification aims to improve classification accuracy by exploiting such correlations in the link structure besides utilizing the attribute values of each entity.

Our research in this area explores various aspects of this problem. We have proposed the use of simple aggregation functions such as sum, mode etc. that summarize the distribution of class labels in the Markov blanket of each entity and learn classifiers using these features and features based on attribute values to improve the classification accuracy. In addition, we have devised simple yet effective iterative approximate inference techniques that, in many cases, perform as well as more sophisticated approximate inference methods such as loopy belief propagation and relaxation labeling. We have also explored the utility of learning with unlabeled entities. We have tested the efficacy of the proposed techniques on various real-world datasets such scientific publication datasets and social network datasets.

In addition to link-based classification where each misclassification is considered to be equally costly, we have also developed techniques to handle unequal misclassification costs. Consider the example of a bank loan officer. The loan officer's task is to classify each application into one of 'grant' or 'deny'. If the officer grants a loan that should have been denied then the bank stands to lose the principal amount of the loan in question (assuming the applicant does not pay any of the loan back). On the other hand, if the officer denies a loan that should have been granted, not only does the bank lose the interest that could have been earned on the loan but also the amount that the applicant might have in her/his account with the bank (assuming the disgruntled applicant decides to take her/his business elsewhere). Such unequal misclassification costs occur in many real-world scenarios. Now consider the case when the bank allows joint accounts. Consider joint account J whose account holders A1 and A2 apply for two separate loans. The amount in J will be lost if either of A1's or A2's application gets denied. This is an example of a misclassification cost that is related to groups of related misclassifications that may arise in scenarios where entities to be labeled are inter-connected via relations. As part of our research on link-based classifiers, we extended classifiers based on Markov networks to handle such varied misclassification costs and demonstrated their efficacy on synthetic and real-world datasets.


Publications


Datasets

These datasets are free to use for research and teaching purposes. Please cite the paper here (bibtex given) when you use these datasets.
  • Document Classification Datasets:

    • CiteSeer: The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words. The README file in the dataset provides more details. Click here to download the tarball containing the dataset.
    • Cora: The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. The README file in the dataset provides more details. Click here to download the tarball containing the dataset.
    • PubMed Diabetes: The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Click here to download the tarball containing the dataset.
    • WebKB: The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1703 unique words. The README file in the dataset provides more details. Click here to download the tarball containing the dataset.
    • Wikipedia: The Wikipedia dataset consists of Wikipedia articles that appeared in the featured list in the period Oct. 7-21, 2009. After stemming and stop-word removal, it represents the text of each document as a tf/idf-weighted feature vector. Each document belongs to one of 19 distinct categories, which were obtained by using the category under which each featured article was listed. The data contains the relations Link which establishes a hyperlink between two documents and Talk which states that the user edited the "Talk" page of the given document. Click here to download the datasets.
  • Social Network Datasets:

    • Terrorists: This dataset contains information about terrorists and their relationships. Unlike the previous datasets, this dataset was designed for classification experiments aimed at classifying the relationships among terrorists. The dataset contains 851 relationships, each described by a 0/1-valued vector of attributes where each entry indicates the absence/presence of a feature. There are a total of 1224 distinct features. Each relationship can be assigned one or more labels out of a maximum of four labels making this dataset suitable for multi-label classification tasks. The README file provides more details. Click here to download the tarball containing the dataset.
    • Terrorist Attacks: This dataset consists of 1293 terrorist attacks each assigned one of 6 labels indicating the type of the attack. Each attack is described by a 0/1-valued vector of attributes whose entries indicate the absence/presence of a feature. There are a total of 106 distinct features. The files in the dataset can be used to create two distinct graphs. The README file in the dataset provides more details. Click here to download the tarball containing the dataset.