# New Graph Classification Data Sets

This is more informal writing about our recent paper “Understanding Isomorphism Bias in Graph Data Sets” that explores the quality of graph data sets. The paper is under review at the moment, but you can already try new clean graph data sets (GitHub and PyTorch-Geometric).

**Introduction**

Recent work in the analysis of datasets used for benchmarking state-of-the-art algorithms in the domain of computer vision revealed that commonly used datasets such as CIFAR and ImageNet contain at least 10% of duplicate images [1,2,3]. This fact potentially invalidates claimed state-of-the-art results of new architectures and questions generalization ability. In particular, evaluating the models on test sets cleaned from duplicate images shows a drop of accuracy by as much as 15%.

In a similar fashion, we have performed an investigation of commonly used graph datasets and tried to show how another form of duplication in this domain, i.e. graph isomorphism, can influence results of graph classification methods.

Recently there has been a rapid increase in the development of various methods for graph classification task including both graph kernel methods and graph neural networks. A lot of these state-of-the-art models assume that incorporating graph isomorphism features into architecture leads to better empirical performance.

However, we have analyzed commonly used data sets for the graph classification task and have discovered that a lot of them contain isomorphic instances that lead to the problem of isomorphism bias, i.e. artificially increasing the accuracy of the models by memorizing target information from the training set.

Existence of isomorphism bias potentially impairs results obtained from state-of-the-art classification methods and leads to the unfair comparison of graph classification algorithms.

**Graph Isomorphism**

Basically, graph isomorphism means that the same graph can exist in more than one form.

Formally, we define isomorphism between two graphs G1 = (V1, E1) and G2 = (V2, E2) is a bijective function φ : V1 → V2 such that any edge (u, v) ∈ E1 if and only if (φ(u), φ(v)) ∈ E2. Graph isomorphism problem asks if such function exists for given two graphs G1 and G2.

Graph Isomorphism Example (Source)

**Datasets**

In order to characterize data sets in terms of the presence of isomorphic graph instances, we have analyzed 54 datasets commonly used for measuring algorithms performance in graph classification tasks and developed several metrics.

Most of the analyzed data sets come either from a biological domain or from a social network domain. Biological data sets such as MUTAG, ENZYMES, PROTEINS are graphs that represent small or large molecules, where edges of the graphs are chemical bonds or spatial proximity between different atoms. Graph labels in these cases encode different properties like toxicity. In social data sets such as IMDB-BINARY, REDDIT- MULTI-5K, COLLAB the nodes represent people and edges are relationships in movies, discussion threads, or citation network respectively. Labels in these cases denote the type of interaction like the genre of the movie/thread or a research subfield.

We propose the following metrics:

- I, number of graphs that have isomorphic counterparts in a data set;
- I,%, proportion of isomorphic graphs to the total data set size, i.e. I/N ;
- IP,%, proportion of isomorphic pairs of graphs to the total number of graph pairs in a data set, (N(N−1))/2

Isomorphic metrics for Top-10 data sets based on the proportion of isomorphic graphs I%. IP% is the proportion of isomorphic pairs of graphs, Mismatched % is the proportion of mismatched labels.

**Explaining isomorphism**

Essentially there are three reasons for having isomorphic graphs in a data set:

- Meta-information
- Size of graphs
- Data set origin

Meta-information includes nodes and/or edges labels and attributes. Labels are discrete, while attributes are continuous and can be presented as a vector. We found out that if we consider node labels when computing graph isomorphism, the proportion of isomorphic graphs drops significantly. However, not all graphs have meta-data (e.g. IMDB or REDDIT does not have it).

The size of graphs affects the number of isomorphic instances in a very intuitive way: the bigger the graph, the lower probability that it will have an isomorphic graph. Hence, data sets with a high average number of nodes will have fewer isomorphic graphs. As many data sets come from Biological domain

Finally, the origin of a data set can explain isomorphism. For example, IMDB is a movie collaboration network, where each graph is an ego network for each actor. Such ego networks are often isomorphic to each other as actors collaborate in a similar way with each other.

Interestingly, we have also found that some of the isomorphic graphs in data sets have different target labels making them invalid in graph classification problem. Essentially, this means that a classifier cannot learn a proper target label for such examples. The reasons for this are (1) noise in data set preparation (different chemical compounds are given the same graph structure) and (2) data set origin (in IMDB data lots of the same actors play in different genres of movies).

Many isomorphic graphs in data sets have different classification labels, so these examples are definitely not suitable for classification.

**Experiments**

The described data sets were used in experiments with several state-of-the-art graph classification methods, such as Weisfeiler-Lehman kernel and Graph Isomorphism Network, in order to assess the influence of isomorphic graph instances on the classifier performance. We observed that performing 10-fold cross-validation shows a more fair accuracy metric, as during a single train-test split accuracy can jump significantly.

To test our hypothesis, that presence of graphs in the test set that has isomorphic graphs in training set can impair generalization ability we had to consider test sets with and without graphs that have isomorphic counterparts in the training set.

As we show for methods that incorporate graph isomorphism features, accuracy on test sets without isomorphic graphs will be lower if the model performs better on excluded isomorphic instances.

The results obtained showed the accuracy of classification will increase in the presence of isomorphic instances in test sets and the performance of any classification model can be overestimated by as much as 5% of accuracy.

**Recommendations**

- Use new clean graph data sets for graph classification that we open-source. This ensures future experimental comparison is correct. Data sets are available in GitHub repo and as part of PyTorch-Geometric.
- If you develop a graph classification model, consider using meta-information for the graph such as node/edge labels/attributes. Labels are discrete, while attributes are continuous arrays.
- Use 10-fold cross-validation as performance can jump from one fold to another. The single train-test split is not sufficient.

The details about the experimental setting, data sets, and theoretical justification can be found in full paper.

[1] *“Do we train on test data? Purging CIFAR of near-duplicates”* Barz et al., 2019

[2] *“Semantic Redundancies in Image-Classification Datasets: The 10% You Don’t Need”*, Birodkar,* *2019

[3] *“Do ImageNet Classifiers Generalize to ImageNet?”*, Recht et al., 2019