Data Science & Data Analytics learning resources

Below I share a list of resources I find valuable to get necessary skills for Data Analysts and Data Scientists. I got many questions from people about it and I hope these resources will be useful for those who want to get started in data science or data analytics.

Last update: 2022-03-25.

Data Analytics

Data Analyst Nanodegree by Udacity: covers the data analysis process of wrangling, exploring, analyzing, and communicating data, separate part on practical statistics and experimentation. Lots of practical cases with python.
- It costs money, but there are discount periods.
Intro to Data Analysis by Udacity: intro to data analysis with numpy and pandas, free.

Business Analytics

Master Reporting Automation with Google Sheets: course by colleague from Bolt – learn business reporting and modeling with spreadsheets.
- Use code #KNOWLEDGEISFREE to get 3 days free access to the course

Statistics

Machine Learning

Machine Learning by Anrdew Ng on Coursera: Old, but Gold, gives a good theoretical background on ML.
Open Machine Learning CourseOpen Machine Learning Course by OpenDataScience:
- Youtube lectures
Open ML course, the Russian version of the previous course with active session (not self-paced)
- Youtube lectures

SQL

Basic SQL on codecademy, free.
Analyzing Big Data with SQL by Coursera: for those with basic knowledge of SQL and looking forward to elevating their skills

Kaggle

Communities

Open Data Science, ODS
Telegram channel Start Career in DS (in Russian)

Newsletters

Data Elixir: more analytics focused.
Data Science Weekly: Data Science news, articles.
The batch: Andrew Ng and his team share recent research, more advanced.

Advanced Data Science topics

Introduction to Causal Inference: basics of causality theory.
- Videos
Machine Learning with Graphs by Stanford: fundamentals of applying ML to Graph theory and represantation learning. Includes applications in different areas: robustness and fragility of food webs and financial markets, algorithms for the World Wide Web, identification of functional modules in biological networks, disease outbreak detection.
- Videos
- My lecture notes

Machine Learning with Graphs: lecture notes, part 4/4

Some time ago, I finished the Stanford course CS224W Machine Learning with Graphs. This is the last Part 4 of blog posts series where I share my notes from watching lectures. The rest you can find here: 1, 2, 3.

Lecture 17 – Reasoning over Knowledge Graphs
Lecture 18 – Limitations of Graph Neural Networks
Lecture 19 – Applications of Graph Neural Networks

Lecture 17 – Reasoning over Knowledge Graphs

Slides, Video

Knowledge graphs are graphs which capture entities, types, and relationships. Nodes in these graphs are entities that are labeled with their types and edges between two nodes capture relationships between entities.

Examples are bibliographical network (node types are paper, title, author, conference, year; relation types are pubWhere, pubYear, hasTitle, hasAuthor, cite), social network (node types are account, song, post, food, channel; relation types are friend, like, cook, watch, listen).

Knowledge graphs in practice:

Google Knowledge Graph.
Amazon Product Graph.
Facebook Graph API.
IBM Watson.
Microsoft Satori.
Project Hanover/Literome.
LinkedIn Knowledge Graph.
Yandex Object Answer.

Knowledge graphs (KG) are applied in different areas, among others are serving information and question answering and conversation agents.

There exists several publicly available KGs (FreeBase, Wikidata, Dbpedia, YAGO, NELL, etc). They are massive (having millions of nodes and edges), but incomplete (many true edges are missing).

Given a massive KG, enumerating all the possible facts is intractable. Can we predict plausible BUT missing links?

Example: Freebase

~50 million entities.
~38K relation types.
~3 billion facts/triples.
Full version is incomplete, e.g. 93.8% of persons from Freebase have no place of birth and 78.5% have no nationality.
FB15k/FB15k-237 are complete subsets of Freebase, used by researchers to learn KG models.
- FB15k has 15K entities, 1.3K relations, 592K edges
- FB15k-237 has 14.5K entities, 237 relations, 310K edges

KG completion

We look into methods to understand how given an enormous KG we can complete the KG / predict missing relations.

Key idea of KG representation:

Edges in KG are represented as triples (ℎ, r ,t) where head (ℎ) has relation (r) with tail (t).
Model entities and relations in the embedding/vector space ℝ^d.
Given a true triple (ℎ, r ,t), the goal is that the embedding of (ℎ, r) should be close to the embedding of t.

How to embed ℎ, r ? How to define closeness? Answer in TransE algorithm.

Translation Intuition: For a triple (ℎ, r ,t), ℎ, r ,t ∈ ℝ^d , h + r = t (embedding vectors will appear in boldface). Score function: f_r(ℎ,t) = ||ℎ + r − t||

TransE training maximizes margin loss ℒ = Σ_{for (ℎ, r ,t)∈G, (ℎ, r ,t’)∉G} [γ + f_r(ℎ,t) − f_r(ℎ,t’)]₊ where γ is the margin, i.e., the smallest distance tolerated by the model between a valid triple (f_r(ℎ,t)) and a corrupted one (f_r(ℎ,t’) ).

*TransE link prediction to answer question: Who has won the Turing award?*

Relation types in TransE

Composition relations: r₁(x, y) ∧ r₂(y, z) ⇒ r₃(x,z) ∀ x, y, z (Example: My mother’s husband is my father). It satisfies TransE: r₃ = r₁ + r₂ (look how it looks on 2D space below)

Symmetric Relations: r (ℎ, t) ⇒ r (t, h) ∀ h, t (Example: Family, Roommate). It doesn’t satisfy TransE. If we want TransE to handle symmetric relations r, for all ℎ,t that satisfy r (ℎ, t), r (t, h) is also True, which means ‖ ℎ + r − t ‖ = 0 and ‖ t + r − ℎ ‖ = 0. Then r = 0 and ℎ = t, however ℎ and t are two different entities and should be mapped to different locations.
1-to-N, N-to-1, N-to-N relations: (ℎ, r ,t₁) and (ℎ, r ,t₂) both exist in the knowledge graph, e.g., r is “StudentsOf”. It doesn’t satisfy TransE. With TransE, t₁ and t₂ will map to the same vector, although they are different entities.

TransR

TransR algorithm models entities as vectors in the entity space ℝ^d and model each relation as vector r in relation space ℝ^k with M_r∈ ℝ^{k x d} as the projection matrix.

ℎ⊥ = Mr ℎ, t⊥ = Mr t, fr(ℎ,t) = ||ℎ⊥ + r − t⊥||

Relation types in TransR

Composition Relations: TransR doesn’t satisfy – Each relation has different space. It is not naturally compositional for multiple relations.
Symmetric Relations: For TransR, we can map ℎ and t to the same location on the space of relation r.
1-to-N, N-to-1, N-to-N relations: We can learn M_r so that t_⊥ = M_rt₁ = M_rt₂, note that t₁ does not need to be equal to t₂.

Types of queries on KG

One-hop queries: Where did Hinton graduate?
Path Queries: Where did Turing Award winners graduate?
Conjunctive Queries: Where did Canadians with Turing Award graduate?
EPFO (Existential Positive First-order): Queries Where did Canadians with Turing Award or Nobel graduate?

One-hop Queries

We can formulate one-hop queries as answering link prediction problems.

Link prediction: Is link (ℎ, r, t) True? -> One-hop query: Is t an answer to query (ℎ, r)?

We can generalize one-hop queries to path queries by adding more relations on the path.

Path Queries

Path queries can be represented by q = (v_a, r₁, … , r_n) where v_a is a constant node, answers are denoted by ||q||.

For example ““Where did Turing Award winners graduate?”, v_a is “Turing Award”, (r₁, r₂) is (“win”, “graduate”).

To answer path query, traverse KG: start from the anchor node “Turing Award” and traverse the KG by the relation “Win”, we reach entities {“Pearl”, “Hinton”, “Bengio”}. Then start from nodes {“Pearl”, “Hinton”, “Bengio”} and traverse the KG by the relation “Graduate”, we reach entities {“NYU”, “Edinburgh”, “Cambridge”, “McGill”}. These are the answers to the query.

How can we traverse if KG is incomplete? Can we first do link prediction and then traverse the completed (probabilistic) KG? The answer is no. The completed KG is a dense graph. Time complexity of traversing a dense KG with |V| entities to answer (v_a, r₁, … , r_n) of length n is O(|V|ⁿ).

Another approach is to traverse KG in vector space. Key idea is to embed queries (generalize TransE to multi-hop reasoning). For v being an answer to q, do a nearest neighbor search for all v based on f_q(v) = ||q − v||, time complexity is O(V).

*Embed path queries in vector space for “Where did Turing Award winners graduate?”*

Conjunctive Queries

We can answer more complex queries than path queries: we can start from multiple anchor nodes.

For example “Where did Canadian citizens with Turing Award graduate?”, start from the first anchor node “Turing Award”, and traverse by relation “Win”, we reach {“Pearl”, “Hinton”, “Bengio”}. Then start from the second anchor node “Canada”, and traverse by relation “citizen”, we reach { “Hinton”, “Bengio”, “Bieber”, “Trudeau”}. Then, we take the intersection of the two sets and achieve {‘Hinton’, ‘Bengio’}. After we do another traverse and arrive at the answers.

Again, another approach is to traverse KG in vector space. But how do we take the intersection of several vectors in the embedding space?

To do that, design a neural intersection operator ℐ:

Input: current query embeddings q₁, …, q_m.
Output: intersection query embedding q.
ℐ should be permutation invariant: ℐ (q₁, …, q_m) = ℐ(q_p(1), …, q_p(m)), [p(1) , … , p(m) ] is any permutation of [1, … , m].

Training is the following: Given an entity embedding v and a query embedding q, the distance is f_q(v) = ||q − v||. Trainable parameters are: entity embeddings d |V|, relation embeddings d |R|, intersection operator ϕ, β.

The whole process:

Training:
- Sample a query q, answer v, negative sample v′.
- Embed the query q.
- Calculate the distance f_q(v) and f_q(v’).
- Optimize the loss ℒ.
Query evaluation:
- Given a test query q, embed the query q.
- For all v in KG, calculate f_q(v).
- Sort the distance and rank all v.

Taking the intersection between two vectors is an operation that does not follow intuition. When we traverse the KG to achieve the answers, each step produces a set of reachable entities. How can we better model these sets? Can we define a more expressive geometry to embed the queries? Yes, with Box Embeddings.

Query2Box: Reasoning with Box Embeddings

The idea is to embed queries with hyper-rectangles (boxes): q = (Center (q), Offset (q)).

Taking intersection between two vectors is an operation that does not follow intuition. But intersection of boxes is well-defined. Boxes are a powerful abstraction, as we can project the center and control the offset to model the set of entities enclosed in the box.

Parameters are similar to those before: entity embeddings d |V| (entities are seen as zero-volume boxes), relation embeddings 2d |R| (augment each relation with an offset), intersection operator ϕ, β (inputs are boxes and output is a box).

Also, now we have Geometric Projection Operator P: Box × Relation → Box: Cen (q’) = Cen (q) + Cen (r), Off (q’) = Off (q) + Off (r).

Another operator is Geometric Intersection Operator ℐ: Box × ⋯× Box → Box. The new center is a weighted average; the new offset shrinks.

Cen (q_inter) = Σ w_i ⊙ Cen (q_i), Off (q_inter) = min (Off (q₁), …, Off (q_n)) ⊙ σ (Deepsets (q₁, …, q_n)) where ⊙ is dimension-wise product, min function guarantees shrinking and sigmoid function σ squashes output in (0,1).

Entity-to-Box Distance: Given a query box q and entity vector v, d_box (q,v) = d_out (q,v) + α d_in (q,v) where 0 < α < 1.

During training, Query2box minimises loss ℒ = − log σ (γ – d_box(q,v)) – log σ (γ – d_box(q,v’_i) – γ)

Query2box can handle different relation patterns:

Composition relations: r₁(x, y) ∧ r₂(y, z) ⇒ r₃(x,z) ∀ x, y, z (Example: My mother’s husband is my father). It satisfies Query2box: if y is in the box of (x, r₁) and z is in the box of (y, r₂), it is guaranteed that z is in the box of (x, r₁ + r₂).

Symmetric Relations: r (ℎ, t) ⇒ r (t, h) ∀ h, t (Example: Family, Roommate). For symmetric relations r, we could assign Cen (r) = 0. In this case, as long as t is in the box of (ℎ, r), it is guaranteed that ℎ is in the box of (t, r). So we have r (ℎ, t) ⇒ r (t, h).
1-to-N, N-to-1, N-to-N relations: (ℎ, r ,t₁) and (ℎ, r ,t₂) both exist in the knowledge graph, e.g., r is “StudentsOf”. Box Embedding can handle since t₁ and t₂ will be mapped to different locations in the box of (ℎ, r).

EPFO queries

Can we embed even more complex queries? Conjunctive queries + disjunction is called Existential Positive First-order (EPFO) queries. E.g., “Where did Canadians with Turing Award or Nobel graduate?” Yes, we also can design a disjunction operator and embed EPFO queries in low-dimensional vector space.For details, they suggest to check the paper, but the link is not working (during the video they skipped this part, googling also didn’t help to find what exactly they meant…).

Lecture 18 – Limitations of Graph Neural Networks

Slides, Video

Recap Graph Neural Networks: key idea is to generate node embeddings based on local network neighborhoods using neural networks. Many model variants have been proposed with different choices of neural networks (mean aggregation + Linear ReLu in GCN (Kipf & Welling ICLR’2017), max aggregation and MLP in GraphSAGE (Hamilton et al. NeurIPS’2017)).

Graph Neural Networks have achieved state-of the-art performance on:

Node classification [Kipf+ ICLR’2017].
Graph Classification [Ying+ NeurIPS’2018].
Link Prediction [Zhang+ NeurIPS’2018].

But GNN are not perfect:

Some simple graph structures cannot be distinguished by conventional GNNs.

*Assuming uniform input node features, GCN and GraphSAGE fail to distinguish the two graphs*

GNNs are not robust to noise in graph data (Node feature perturbation, edge addition/deletion).

Limitations of conventional GNNs in capturing graph structure

Given two different graphs, can GNNs map them into different graph representations? Essentially, it is a graph isomorphism test problem. No polynomial algorithms exist for the general case. Thus, GNNs may not perfectly distinguish any graphs.

To answer, how well can GNNs perform the graph isomorphism test, need to rethink the mechanism of how GNNs capture graph structure.

GNNs use different computational graphs to distinguish different graphs as shown on picture below:

Most discriminative GNNs map different subtrees into different node representations. On the left, all nodes will be classified the same; on the right, nodes 2 and 3 will be classified the same.

Recall injectivity: the function is injective if it maps different elements into different outputs. Thus, entire neighbor aggregation is injective if every step of neighbor aggregation is injective.

Neighbor aggregation is essentially a function over multi-set (set with repeating elements) over multi-set (set with repeating elements). Now let’s characterize the discriminative power of GNNs by that of multi-set functions.

GCN: as GCN uses mean pooling, it will fail to distinguish proportionally equivalent multi-sets (it is not injective).

*Case with GCN: both sets will be equivalent for GCN*

GraphSAGE: as it uses MLP and max pooling, it will even fail to distinguish multi-set with the same distinct elements (it is not injective).

*Case with GraphSAGE: both sets will be equivalent for GraphSAGE*

How can we design injective multi-set function using neural networks?

For that, use the theorem: any injective multi-set function can be expressed by ?( Σ f(x) ) where ? is some non-linear function, f is some non-linear function, the sum is over multi-set.

*The theorem of injective multi-set function*

We can model ? and f using Multi-Layer-Perceptron (MLP) (Note: MLP is a universal approximator).

Then Graph Isomorphism Network (GIN) neighbor aggregation using this approach becomes injective. Graph pooling is also a function over multiset. Thus, sum pooling can also give injective graph pooling.

GINs have the same discriminative power as the WL graph isomorphism test (WL test is known to be capable of distinguishing most of real-world graph, except for some corner cases as on picture below; the prove of GINs relation to WL is in lecture).

*The two graphs look the same for WL test because all the nodes have the same local subtree structure*

GIN achieves state-of-the-art test performance in graph classification. GIN fits training data much better than GCN, GraphSAGE.

*Training accuracy of different GNN architectures*

 Same trend for training accuracy occurs across datasets. GIN outperforms existing GNNs also in terms of test accuracy because it can better capture graph structure.

Vulnerability of GNNs to noise in graph data

Probably, you’ve met examples of adversarial attacks on deep neural network where adding noise to image changes prediction labels.

Adversaries are very common in applications of graph neural networks as well, e.g., search engines, recommender systems, social networks, etc. These adversaries will exploit any exposed vulnerabilities.

To identify how GNNs are robust to adversarial attacks, consider an example of semi-supervised node classification using Graph Convolutional Neural Networks (GCN). Input is partially labeled attributed graph, goal is to predict labels of unlabeled nodes.

Classification Model contains two-step GCN message passing softmax ( Â ReLU (ÂXW⁽¹⁾)W⁽²⁾. During training the model minimizes cross entropy loss on labeled data; during testing it applies the model to predict unlabeled data.

Attack possibilities for target node ? ∈ ? (the node whose classification label attack wants to change) and attacker nodes ? ⊂ ? (the nodes the attacker can modify):

Direct attack (? = {?})
- Modify the target‘s features (Change website content)
- Add connections to the target (Buy likes/ followers)
- Remove connections from the target (Unfollow untrusted users)
Indirect attack (? ∉ ?)
- Modify the attackers‘ features (Hijack friends of target)
- Add connections to the attackers (Create a link/ spam farm)
- Remove connections from the attackers (Create a link/ spam farm)

High level idea to formalize these attack possibilities: objective is to maximize the change of predicted labels of target node subject to limited noise in the graph. Mathematically, we need to find a modified graph that maximizes the change of predicted labels of target node: increase the loglikelihood of target node ? being predicted as ? and decrease the loglikelihood of target node ? being predicted as ? old.

In practice, we cannot exactly solve the optimization problem because graph modification is discrete (cannot use simple gradient descent to optimize) and inner loop involves expensive re-training of GCN.

Some heuristics have been proposed to efficiently obtain an approximate solution. For example: greedily choosing the step-by-step graph modification, simplifying GCN by removing ReLU activation (to work in closed form), etc (More details in Zügner+ KDD’2018).

Attack experiments for semi-supervised node classification with GCN

Left plot below shows class predictions for a single node, produced by 5 GCNs with different random initializations. Right plot shows that GCN prediction is easily manipulated by only 5 modifications of graph structure (|V|=~2k, |E|=~5k).

*GNN are not robust to adversarial attacks*

Challenges of applying GNNs:

Scarcity of labeled data: labels require expensive experiments -> Models overfit to small training datasets
Out-of-distribution prediction: test examples are very different from training in scientific discovery -> Models typically perform poorly.

To partially solve this Hu et al. 2019 proposed Pre-train GNNs on relevant, easy to obtain graph data and then fine-tune for downstream tasks.

Lecture 19 – Applications of Graph Neural Networks

Slides, Video

During this lecture, we describe 3 applications:

1. GNN recommendation (PinSage)

2. Heterogeneous GNN (Decagon)

3. Goal-directed generation (GCPN)

PinSage: GNN for recommender systems

In recommender systems, users interact with items (Watch movies, buy merchandise, listen to music) and the goal os to recommend items users might like:

Customer X buys Metallica and Megadeth CDs.
Customer Y buys Megadeth, the recommender system suggests Metallica as well.

From model perspective, the goal is to learn what items are related: for a given query item(s) Q, return a set of similar items that we recommend to the user.

Having a universal similarity function allows for many applications: Homefeed (endless feed of recommendations), Related pins (find most similar/related pins), Ads and shopping (use organic for the query and search the ads database).

Key problem: how do we define similarity:

1) Content-based: User and item features, in the form of images, text, categories, etc.

2) Graph-based: User-item interactions, in the form of graph/network structure. This is called collaborative filtering:

For a given user X, find others who liked similar items.
Estimate what X will like based on what similar others like.

Pinterest is a human-curated collection of pins. The pin is a visual bookmark someone has saved from the internet to a board they’ve created (image, text, link). The board is a collection of ideas (pins having something in common).

Pinterest has two sources of signal:

Features: image and text of each pin
Dynamic Graph: need to apply to new nodes without model retraining

Usually, recommendations are found via embeddings:

Step 1: Efficiently learn embeddings for billions of pins (items, nodes) using neural networks.
Step 2: Perform nearest neighbour query to recommend items in real-time.

PinSage is GNN which predicts whether two nodes in a graph are related.It generates embeddings for nodes (e.g., pins) in the Pinterest graph containing billions of objects. Key idea is to borrow information from nearby nodes simple set): e.g., bed rail Pin might look like a garden fence, but gates and rely adjacent in the graph.

Pin embeddings are essential in many different tasks. Aside from the “Related Pins” task, it can also be used in recommending related ads, home-feed recommendation, clustering users by their interests.

PinSage Pipeline:

Collect billions of training pairs from logs.
- Positive pair: Two pins that are consecutively saved into the same board within a time interval (1 hour).
- Negative pair: random pair of 2 pins. With high probability the pins are not on the same board
Train GNN to generate similar embeddings for training pairs. Train so that pins that are consecutively pinned have similar embeddings

$\Lambda = \sum_{(u,v)\in D}^{} max(0, -z_u^T z_v + z_u^T z_n + \Delta )$

where D is a set of training pairs from logs, z_vis a “positive”/true example, z_nis a “negative” example and Δ is a “margin” (how much larger positive pair similarity should be compared to negative).

3. Inference: Generate embeddings for all pins

4. Nearest neighbour search in embedding space to make recommendations.

Key innovations:

On-the-fly graph convolutions:
- Sample the neighboUrhood around a node and dynamically construct a computation graph
- Perform a localized graph convolution around a particular node (At every iteration, only source node embeddings are computed)
- Does not need the entire graph during training
Selecting neighbours via random walks
- Performing aggregation on all neighbours is infeasible: how to select the set of neighbours of a node to convolve over? Personalized PageRank can help with that.
- Define Importance pooling: define importance-based neighbourhoods by simulating random walks and selecting the neighbours with the highest visit counts.
  - Choose nodes with top K visit counts
  - Pool over the chosen nodes
  - The chosen nodes are not necessarily neighbours.
  - Compared to GraphSAGE mean pooling where the messages are averaged from direct neighbours, PinSAGE Importance pooling use the normalized counts as weights for weighted mean of messages from the top K nodes
- PinSAGE uses ? = 50. Performance gain for ? > 50 is negligible.
Efficient MapReduce inference

Problem: Many repeated computation if using localized graph convolution at inference step.
Need to avoid repeated computation

Introduce hard negative samples: force model to learn subtle distinctions between pins. Curriculum training on hard negatives starts with random negative examples and then provides harder negative examples over time. Obtain hard negatives using random walks:

Use nodes with visit counts ranked at 1000-5000 as hard negatives
Have something in common, but are not too similar

Example of pairs

Experiments with PinSage

Related Pin recommendations: given a user just saved pin Q, predict what pin X are they going to save next. Setup: Embed 3B pins, find nearest neighbours of Q. Baseline embeddings:

Visual: VGG visual embeddings.
Annotation: Word2vec embeddings
Combined: Concatenate embeddings

*PinSage outperform all other models (****MRR:*** *Mean reciprocal rank of the positive example X w.r.t Q.* ***Hit rate:*** *Fraction of times the positive example X is among top K closest to Q)*

*Comparing PinSage to previous recommender algorithm*

DECAGON: Heterogeneous GNN

So far we only applied GNNs to simple graphs. GNNs do not explicitly use node and edge type information. Real networks are often heterogeneous. How to use GNN for heterogeneous graphs?

The problem we consider is polypharmacy (use of multiple drugs for a disease): how to predict side effects of drug combination? It is difficult to identify manually because it is rare, occurs only in a subset of patients and is not observed in clinical testing.

Problem formulation:

How likely with a pair of drugs ?, ? lead to side effect ??
Graph: Molecules as heterogeneous (multimodal) graphs: graphs with different node types and/or edge types.
Goal: given a partially observed graph, predict labeled edges between drug nodes. Or in query language: Given a drug pair ?, ? how likely does an edge (? , r₂, ?) exist?

Task description: predict labeled edges between drugs nodes, i.e., predict the likelihood that an edge (? , r₂, s) exists between drug nodes c and s (meaning that drug combination (c, s) leads to polypharmacy side effect r₂.

The model is heterogeneous GNN:

Compute GNN messages from each edge type, then aggregate across different edge types.
- Input: heterogeneous graph.
- Output: node embeddings.
During edge predictions, use pair of computed node embeddings.
- Input: Node embeddings of query drug pairs.
- Output: predicted edges.

Experiment Setup

Data:
- Graph over Molecules: protein-protein interaction and drug target relationships.
- Graph over Population: Side effects of individual drugs, polypharmacy side effects of drug combinations.
Setup:
- Construct a heterogeneous graph of all the data .
- Train: Fit a model to predict known associations of drug pairs and polypharmacy side effects.
- Test: Given a query drug pair, predict candidate polypharmacy side effects.

Decagon model showed up to 54% improvement over baselines. It is the first opportunity to computationally flag polypharmacy side effects for follow-up analyses.

*Prediction performance of Decagon compared to other models*

GCPN: Goal-Directed Graph Generation (an extension of GraphRNN)

Recap Graph Generative Graphs: it generates a graph by generating a two level sequence (uses RNN to generate the sequences of nodes and edges). In other words, it imitates given graphs. But can we do graph generation in a more targeted way and in a more optimal way?

The paper [You et al., NeurIPS 2018] looks into Drug Discovery and poses the question: can we learn a model that can generate valid and realistic molecules with high value of a given chemical property?

Molecules are heterogeneous graphs with node types C, N, O, etc and edge types single bond, double bond, etc. (note: “H”s can be automatically inferred via chemical validity rules, thus are ignored in molecular graphs).

Authors proposes the model Graph Convolutional Policy Network which combines graph representation and Reinforcement Learning (RL):

Graph Neural Network captures complex structural information, and enables validity check in each state transition (Valid).
Reinforcement learning optimizes intermediate/final rewards (High scores).
Adversarial training imitates examples in given datasets (Realistic).

Steps of GCPN and shown on the picture below:

(a) Insert nodes/scaffolds.
(b) Compute state via GCN.
(c) Sample next action.
(d) Take action (check chemical validity).
(e, f) Compute reward.

To set the reward in RL part:

Learn to take valid action: at each step, assign small positive reward for valid action.
Optimize desired properties: at the end, assign positive reward for high desired property.
Generate realistic graphs: at the end, adversarially train a GCN discriminator, compute adversarial rewards that encourage realistic molecule graphs

Reward r_t is final reward (domain-specific reward) and step reward (step-wise validity reward).

Training consists of two parts:

Supervised training: train policy by imitating the action given by real observed graphs. Use gradient.
RL training: train policy to optimize rewards. Use standard policy gradient algorithm.

GCPN can perform different tasks:

Property optimization: generate molecules with high specified property score.
Property targeting: generate molecules whose specified property score falls within given range.
Constrained property optimization: edit a given molecule for a few steps to achieve higher specified property score.

*Visualization of GCPN graphs for constrained optimization*

PS: Other lecture notes you can find here: 1, 2, 3.

Machine Learning with Graphs: lecture notes, part 2/4

Recently, I finished the Stanford course CS224W Machine Learning with Graphs. This is Part 2 of blog posts series where I share my notes from watching lectures. The rest you can find here: 1, 3, 4.

Lecture 6 – Message Passing and Node Classification

Slides, Video

Node classification means: given a network with labels on some nodes, assign labels to all other nodes in the network. The main idea for this lecture is to look at collective classification leveraging existing correlations in networks.

Individual behaviors are correlated in a network environment. For example, the network where nodes are people, edges are friendship, and the color of nodes is a race: people are segregated by race due to homophily (the tendency of individuals to associate and bond with similar others).

“Guilt-by-association” principle: similar nodes are typically close together or directly connected; if I am connected to a node with label X, then I am likely to have label X as well.

Collective classification

Intuition behind this method: simultaneously classify interlinked nodes using correlations.

Applications of the method are found in:

document classification,
speech tagging,
link prediction,
optical character recognition,
image/3D data segmentation,
entity resolution in sensor networks,
spam and fraud detection.

Collective classification is based on Markov Assumption – the label Y_i of one node i depends on the labels of its neighbours N_i:

$P(Y_i|i)=P(Y_i|N_i)$

Steps of collective classification algorithms:

Local Classifier: used for initial label assignment.
- Predict label based on node attributes/features.
- Standard classification task.
- Don’t use network information.
Relational Classifier: capture correlations between nodes based on the network.
- Learn a classifier to label one node based on the labels and/or attributes of its neighbors.
- Network information is used.
Collective Inference: propagate the correlation through the network.
- Apply relational classifier to each node iteratively.
- Iterate until the inconsistency between neighboring labels is minimized.
- Network structure substantially affects the final prediction.

If we represent every node as a discrete random variable with a joint mass function p of its class membership, the marginal distribution of a node is the summation of p over all the other nodes. The exact solution takes exponential time in the number of nodes, therefore they use inference techniques that approximate the solution by narrowing the scope of the propagation (e.g., only neighbors) and the number of variables by means of aggregation.

Techniques for approximate inference (all are iterative algorithms):

Relational classifiers (weighted average of neighborhood properties, cannot take node attributes while labeling).
Iterative classification (update each node’s label using own and neighbor’s labels, can consider node attributes while labeling).
Belief propagation (Message passing to update each node’s belief of itself based on neighbors’ beliefs).

Probabilistic Relational classifier

The idea is that class probability of Y_i is a weighted average of class probabilities of its neighbors.

For labeled nodes, initialize with ground-truth Y labels.
For unlabeled nodes, initialize Y uniformly.
Update all nodes in a random order until convergence or until maximum number of iterations is reached.

Repeat for each node i and label c:

$P(Y_i=c)=\frac{1}{\sum_{(i,j)}^{} W(i,j)} \sum_{(i,j \in E)}^{} W(i,j) P(Y_i=c)$

W(i,j) is the edge strength from i to j.

Example of 1^st iteration for one node:

Challenges:

Convergence is not guaranteed.
Model cannot use node feature information.

Iterative classification

Main idea of iterative classification is to classify node i based on its attributes as well as labels of neighbour set N_i:

Architecture of iterative classifier:

Bootstrap phase:
- Convert each node i to a flat vector a_i (Node may have various numbers of neighbours, so we can aggregate using: count , mode, proportion, mean, exists, etc.)
- Use local classifier f(a_i) (e.g., SVM, kNN, …) to compute best value for Y_i .
Iteration phase: iterate till convergence. Repeat for each node i:
- Update node vector a_i.
- Update label Y_i to f(a_i).
- Iterate until class labels stabilize or max number of iterations is reached.

Note: Convergence is not guaranteed (need to run for max number of iterations).

Application of iterative classification:

REV2: Fraudulent User Predictions in Rating Platforms [Kumar et al.]. The model uses fairness, goodness and reliability scores to find malicious apps and users who give fake ratings.

Loopy belief propagation

Belief Propagation is a dynamic programming approach to answering conditional probability queries in a graphical model. It is an iterative process in which neighbor variables “talk” to each other, passing messages.

In a nutshell, it works as follows: variable X₁ believes variable X₂ belongs in these states with various likelihood. When consensus is reached, calculate final belief.

Message passing:

Task: Count the number of nodes in a graph.
Condition: Each node can only interact (pass message) with its neighbors.
Solution: Each node listens to the message from its neighbor, updates it, and passes it forward.

Loopy belief propagation algorithm:

Initialize all messages to 1. Then repeat for each node:

$m_{i\rightarrow j}(Y_i) = \alpha \sum _{Y_i \in \Lambda }{\psi (Y_i,Y_j) } \phi_i (Y_i)\prod_{k \in N_i /j}{m_k \rightarrow j}(Y_i)$

Label-label potential matrix ψ: dependency between a node and its neighbour. ψ(Y_i , Y_j) equals the probability of a node j being in state Y_j given that it has a i neighbour in state Y_i .
Prior belief ϕ_i: probability ϕ_i(Y_i) of node i being in state Y_i.
m(i->j)(Y_i) is i’s estimate of j being in state Y_j.
Λ is the set of all states.
Last term (Product) means that all messages sent by neighbours from previous round.
After convergence b_i(Y_i) = i’s belief of being in state Y_i:

$b_i(Y_i) = \alpha \phi_i(Y_i) \prod_{k \in N_i /j}{m_k \rightarrow j}(Y_i)$

Advantages of the method: easy to program & parallelize, can apply to any graphical model with any form of potentials (higher order than pairwise).
Challenges: convergence is not guaranteed (when to stop), especially if many closed loops.
Potential functions (parameters): require training to estimate, learning by gradient-based optimization (convergence issues during training).

Application of belief propagation:

Netprobe: A Fast and Scalable System for Fraud Detection in Online Auction Networks [Pandit et al.]. Check out mode details here. It describes quite interesting case when fraudsters form a buffer level of “accomplice” users to game the feedback system.

They form near-bipartite cores (2 roles): accomplice trades with honest, looks legit; fraudster trades with accomplice, fraud with honest.

Lecture 7 – Graph Representation Learning

Slides, Video

Supervised Machine learning algorithm includes feature engineering. For graph ML, feature engineering is substituted by feature representation – embeddings. During network embedding, they map each node in a network into a low-dimensional space:

It gives distributed representations for nodes;
Similarity of embeddings between nodes indicates their network similarity;
Network information is encoded into generated node representation.

Network embedding is hard because graphs have complex topographical structure (i.e., no spatial locality like grids), no fixed node ordering or reference point (i.e., the isomorphism problem) and they often are dynamic and have multimodal features.

Embedding Nodes

Setup: graph G with vertex set V, adjacency matrix A (for now no node features or extra information is used).

Goal is to encode nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in the original network.

Learning Node embeddings:

Define an encoder (i.e., a mapping from nodes to embeddings).
Define a node similarity function (i.e., a measure of similarity in the original network).

Optimize the parameters of the encoder so that: Similarity (u, v) ≈ z^T_vz_u.

Two components:

Encoder: maps each node to a low dimensional vector. Enc(v) = z_v where v is node in the input graph and z_v is d-dimensional embedding.
Similarity function: specifies how the relationships in vector space map to the relationships in the original network – z^T_vz_uis a dot product between node embeddings.

“Shallow” encoding

The simplest encoding approach is when encoder is just an embedding-lookup.

Enc(v) = Zv where Z is a matrix with each column being a node embedding (what we learn) and v is an indicator vector with all zeroes except one in column indicating node v.

*Each node is assigned to a unique embedding vector*

Many methods for encoding: DeepWalk, node2vec, TransE. The methods are different in how they define node similarity (should two nodes have similar embeddings if they are connected or share neighbors or have similar “structural roles”?).

Random Walk approaches to node embeddings

Given a graph and a starting point, we select a neighbor of it at random, and move to this neighbor; then we select a neighbor of this point at random, and move to it, etc. The (random) sequence of points selected this way is a random walk on the graph.

Then z^T_vz_u ≈ probability that u and v co-occur on a random walk over the network.

Properties of random walks:

Expressivity: Flexible stochastic definition of node similarity that incorporates both local and higher-order neighbourhood information.
Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks.

Algorithm:

Run short fixed-length random walks starting from each node on the graph using some strategy R.
For each node u collect N_R(u), the multiset (nodes can repeat) of nodes visited on random walks starting from u.

Optimize embeddings: given node u, predict its neighbours N_R(u)

$max_{z}\sum_{u \in V}^{}logP(N_R(u)|z_u) \rightarrow \Lambda =\sum_{u \in V} \sum_{v \in N_R(u)}-log(P(v|z_u))$

Intuition: Optimize embeddings to maximize likelihood of random walk co-occurrences.

Then parameterize it using softmax:

$\Lambda =\sum_{u \in V} \sum_{v \in N_R(u)}-log(\frac{exp(z_u^Tz_v)}{\sum_{n \in V}exp(z_u^Tz_n)})$

where first sum is sum over all nodes u, second sum is sum over nodes v seen on random walks starting from u and fraction (under log) is predicted probability of u and v co-occuring on random walk.

Optimizing random walk embeddings = Finding embeddings z_u that minimize Λ. Naive optimization is too expensive, use negative sampling to approximate.

Strategies for random walk itself:

Simplest idea: just run fixed-length, unbiased random walks starting from each node (i.e., DeepWalk from Perozzi et al., 2013). But such notion of similarity is too constrained.
More general: node2vec.

Node2vec: biased walks

Idea: use flexible, biased random walks that can trade off between local and global views of the network (Grover and Leskovec, 2016).

Two parameters:

Return parameter p: return back to the previous node
In-out parameter q: moving outwards (DFS) vs. inwards (BFS). Intuitively, q is the “ratio” of BFS vs. DFS

*Recap Breadth-first search and Depth-first search.*

Node2vec algorithm:

Compute random walk probabilities.
Simulate r random walks of length l starting from each node u.
Optimize the node2vec objective using Stochastic Gradient Descent.

How to use embeddings z_i of nodes:

Clustering/community detection: Cluster points z_i.
Node classification: Predict label f(z_i) of node i based on z_i.
Link prediction: Predict edge (i,j) based on f(z_i,z_j) with concatenation, avg, product, or the difference between the embeddings.

Random walk approaches are generally more efficient.

Translating embeddings for modeling multi-relational data

A knowledge graph is composed of facts/statements about inter-related entities. Nodes are referred to as entities, edges as relations. In Knowledge graphs (KGs), edges can be of many types!

Knowledge graphs may have missing relation. Intuition: we want a link prediction model that learns from local and global connectivity patterns in the KG, taking into account entities and relationships of different types at the same time.

Downstream task: relation predictions are performed by using the learned patterns to generalize observed relationships between an entity of interest and all the other entities.

In Translating Embeddings method (TransE), relationships between entities are represented as triplets h (head entity), l (relation), t (tail entity) => (ℎ, l, t).

Entities are first embedded in an entity space R^k and relations are represented as translations ℎ + l ≈ t if the given fact is true, else, ℎ + l ≠ t.

Embedding entire graphs

Sometimes we need to embed entire graphs to such tasks as classifying toxic vs. non-toxic molecules or identifying anomalous graphs.

Approach 1:

Run a standard graph embedding technique on the (sub)graph G. Then just sum (or average) the node embeddings in the (sub)graph G. (Duvenaud et al., 2016- classifying molecules based on their graph structure).

Approach 2:

Idea: Introduce a “virtual node” to represent the (sub)graph and run a standard graph embedding technique (proposed by Li et al., 2016 as a general technique for subgraph embedding).

Approach 3 – Anonymous Walk Embeddings:

States in anonymous walk correspond to the index of the first time we visited the node in a random walk.

*Number of anonymous walks grows exponentially with increasing length of a walk.*

Anonymous Walk Embeddings have 3 methods:

Represent the graph via the distribution over all the anonymous walks (enumerate all possible anonymous walks a_i of l steps, record their counts and translate it to probability distribution.
Create super-node that spans the (sub) graph and then embed that node (as complete counting of all anonymous walks in a large graph may be infeasible, sample to approximating the true distribution: generate independently a set of m random walks and calculate its corresponding empirical distribution of anonymous walks).
Embed anonymous walks (learn embedding z_i of every anonymous walk a_i. The embedding of a graph G is then sum/avg/concatenation of walk embeddings z_i).

Lecture 8 – Graph Neural Networks

Slides, Video

Recap that the goal of node embeddings it to map nodes so that similarity in the embedding space approximates similarity in the network. Last lecture was focused on “shallow” encoders, i.e. embedding lookups. This lecture will discuss deep methods based on graph neural networks:

Enc(v) = multiple layers of non-linear transformations of graph structure.

Modern deep learning toolbox is designed for simple sequences & grids. But networks are far more complex because they have arbitrary size and complex topological structure (i.e., no spatial locality like grids), no fixed node ordering or reference point and they often are dynamic and have multimodal features.

Naive approach of applying CNN to networks:

Join adjacency matrix and features.
Feed them into a deep neural net.

Issues with this idea: O(N) parameters; not applicable to graphs of different sizes; not invariant to node ordering.

Basics of Deep Learning for graphs

Setup: graph G with vertex set V, adjacency matrix A (assume binary), matrix of node features X (like user profile and images for social networks, gene expression profiles and gene functional information for biological networks; for case with no features it can be indicator vectors (one-hot encoding of a node) or vector of constant 1 [1, 1, …, 1]).

Idea: node’s neighborhood defines a computation graph -> generate node embeddings based on local network neighborhoods. Intuition behind the idea: nodes aggregate information from their neighbors using neural networks.

*Every node defines a computation graph based on its neighborhood*

Model can be of arbitrary depth:

Nodes have embeddings at each layer.
Layer-0 embedding of node x is its input feature, x_u.
Layer-K embedding gets information from nodes that are K hops away.

Different models vary in how they aggregate information across the layers (how neighbourhood aggregation is done).

Basic approach is to average the information from neighbors and apply a neural network.

Initial 0-th layer embeddings are equal to node features: h_v⁰ = x_v. The next ones:

$h_v^k=\sigma(W_k \sum _ {u \in N(v)} \frac {h_v^{k-1}}{|N(v)}+B_kh_v^{k-1}), \forall k \in (1,...,K)$

where σ is non-linearity (e.g., ReLU), h_v^k-1 is previous layer embedding of v, sum (after W_k) is average v-th neghbour’s previous layer embeddings.

Embedding after K layers of neighborhood aggregation: z_v = h_v^K.

We can feed these embeddings into any loss function and run stochastic gradient descent to train the weight parameters. We can train in two ways:

Unsupervised manner: use only the graph structure.
- “Similar” nodes have similar embeddings.
- Unsupervised loss function can be anything from random walks (node2vec, DeepWalk, struc2vec), graph factorization, node proximity in the graph.
Supervised manner: directly train the model for a supervised task (e.g., node classification):

$\Lambda = \sum_{v \in V} y_v log(\sigma (z_v^T \theta))+(1-y_v)log(1-\sigma (z_v^T \theta))$

where z_v^T is encoder output (node embedding), y_v is node class label, θ is classigication weights.

Supervised training steps:

Define a neighborhood aggregation function
Define a loss function on the embeddings
Train on a set of nodes, i.e., a batch of compute graphs
Generate embeddings for nodes as needed

The same aggregation parameters are shared for all nodes:

*The number of model parameters is sublinear in |V| and we can generalize to unseen nodes.*

The approach has inductive capability (generate embeddings for unseen nodes or graphs).

Graph Convolutional Networks and GraphSAGE

Can we do better than aggregating the neighbor messages by taking their (weighted) average (formula for h_v^k above)?

Variants

Mean: take a weighted average of neighbors:

$agg = \sum_{u \in N(v)} \frac {h_u^{k-1}}{|N(v)|}$

Pool: Transform neighbor vectors and apply symmetric vector function ( $\gamma$ is element-wise mean/max):

$agg = \gamma (\left \{ Qh_u^{k-1}, \forall u \in N(v) \right \})$

LSTM: Apply LSTM to reshuffled of neighbors:

$agg = LSTM(\left [ h_u^{k-1}, \forall u \in \pi (N(v)) \right ])$

Graph convolutional networks average neighborhood information and stack neural networks. Graph convolutional operator aggregates messages across neighborhoods, N(v).

a_vu = 1/|N(v)| is the weighting factor (importance) of node u’s message to node v. Thus, a_vu is defined explicitly based on the structural properties of the graph. All neighbors u ∈ N(v) are equally important to node v.

*Example of 2-layer GCN: The output of the first layer is the input of the second layer. Again, note that the neural network in GCN is simply a fully connected layer (*https://www.topbots.com/graph-convolutional-networks/)

GraphSAGE uses generalized neighborhood aggregation:

Graph Attention Networks

Idea: Compute embedding h_v^k of each node in the graph following an attention strategy:

Nodes attend over their neighborhoods’ message.
Implicitly specifying different weights to different nodes in a neighborhood.

Let a_vu be computed as a byproduct of an attention mechanism a. Let a compute attention coefficients e_vu across pairs of nodes u, v based on their messages: e_vu = a(W_kh_u^k-1,W_kh_v^k-1). e_vu indicates the importance of node u’s message to node v. Then normalize coefficients using the softmax function in order to be comparable across different neighborhoods:

$a_{vu} = \frac {exp(e_{vu})}{\sum _{k \in N(v)} exp(e_{vk})}$

$h_v^k= \sigma (\sum _{u \in N(v)} a_{vu}W_k h_u^{k-1})$

Form of attention mechanism a:

The approach is agnostic to the choice of a
- E.g., use a simple single-layer neural network.
- a can have parameters, which need to be estimates.
Parameters of a are trained jointly:
- Learn the parameters together with weight matrices (i.e., other parameter of the neural net) in an end-to-end fashion

Paper by Velickovic et al. introduced multi-head attention to stabilize the learning process of attention mechanism:

Attention operations in a given layer are independently replicated R times (each replica with different parameters). Outputs are aggregated (by concatenating or adding).
Key benefit: allows for (implicitly) specifying different importance values (a_vu) to different neighbors.
Computationally efficient: computation of attentional coefficients can be parallelized across all edges of the graph, aggregation may be parallelized across all nodes.
Storage efficient: sparse matrix operations do not require more than O(V+E) entries to be stored; fixed number of parameters, irrespective of graph size.
Trivially localized: only attends over local network neighborhoods.
Inductive capability: shared edge-wise mechanism, it does not depend on the global graph structure.

Example application: PinSage

Pinterest graph has 2B pins, 1B boards, 20B edges and it is dynamic. PinSage graph convolutional network:

Goal: generate embeddings for nodes (e.g., Pins/images) in a web-scale Pinterest graph containing billions of objects.
Key Idea: Borrow information from nearby nodes. E.g., bed rail Pin might look like a garden fence, but gates and beds are rarely adjacent in the graph.
Pin embeddings are essential to various tasks like recommendation of Pins, classification, clustering, ranking.
Challenges: large network and rich image/text features. How to scale the training as well as inference of node embeddings to graphs with billions of nodes and tens of billions of edges?
- To learn with resolution of 100 vs. 3B, they use harder and harder negative samples – include more and more hard negative samples for each epoch.

Key innovations:

On-the-fly graph convolutions:
- Sample the neighborhood around a node and dynamically construct a computation graph.
- Perform a localized graph convolution around a particular node.
- Does not need the entire graph during training.
Constructing convolutions via random walks:
- Performing convolutions on full neighborhoods is infeasible.
- Importance pooling to select the set of neighbors of a node to convolve over: define importance-based neighborhoods by simulating random walks and selecting the neighbors with the highest visit counts.
Efficient MapReduce inference:
- Bottom-up aggregation of node embeddings lends itself to MapReduce. Decompose each aggregation step across all nodes into three operations in MapReduce, i.e., map, join, and reduce.

*PinSage gives 150% improvement in hit rate and 60% improvement in MRR over the best baseline*

General tips working with GNN

Data preprocessing is important:
- Use renormalization tricks.
- Variance-scaled initialization.
- Network data whitening.
ADAM optimizer: naturally takes care of decaying the learning rate.
ReLU activation function often works really well.
No activation function at your output layer: easy mistake if you build layers with a shared function.
Include bias term in every layer.
GCN layer of size 64 or 128 is already plenty.

Lecture 9 – Graph Neural Networks: Hands-on Session

Video

Colab Notebook has everything to get hands on with PyTorch Geometric. Check it out!

Lecture 10 – Deep Generative Models for Graphs

Slides, Video

In precious lectures, we covered graph encoders where outputs are graph embeddings. This lecture covers the opposite aspect – graph decoders where outputs are graph structures.

Problem of Graph Generation

Why is it important to generate realistic graphs (or synthetic graph given a large real graph)?

Generation: gives insight into the graph formation process.
Anomaly detection: abnormal behavior, evolution.
Predictions: predicting future from the past.
Simulations of novel graph structures.
Graph completion: many graphs are partially observed.
“What if” scenarios.

Graph Generation tasks:

Realistic graph generation: generate graphs that are similar to a given set of graphs [Focus of this lecture].
Goal-directed graph generation: generate graphs that optimize given objectives/constraints (drug molecule generation/optimization). Examples: discover highly drug-like molecules, complete an existing molecule to optimize a desired property.

*Drug discovery: complete an existing molecule to optimize a desired property*

Why Graph Generation tasks are hard:

Large and variable output space (for n nodes we need to generate n*n values; graph size (nodes, edges) varies).
Non-unique representations (n-node graph can be represented in n! ways; hard to compute/optimize objective functions (e.g., reconstruction of an error)).
Complex dependencies: edge formation has long-range dependencies (existence of an edge may depend on the entire graph).

Graph generative models

Setup: we want to learn a generative model from a set of data points (i.e., graphs) {x_i}; p_DATA(x) is the data distribution, which is never known to us, but we have sampled x_i ~ p_DATA(x). p_MODEL(x,θ) is the model, parametrized by θ, that we use to approximate p_DATA(x).

Goal:

Make p_MODEL(x,θ) close to p_DATA(x) (Key Principle: Maximum Likelihood – find the model that is most likely to have generated the observed data x).
Make sure we can sample from a complex distribution p_MODEL(x,θ). The most common approach:
- Sample from a simple noise distribution z_i~ N(0,1).
- Transform the noise z_i via f(⋅): x_i= f(z_i; θ). Then x_i follows a complex distribution.

To design f(⋅) use Deep Neural Networks, and train it using the data we have.

This lecture is focused on auto-regressive models (predict future behavior based on past behavior). Recap autoregressive models: p_MODEL(x,θ) is used for both density estimation and sampling (from the probability density). Then apply chain rule: joint distribution is a product of conditional distributions:

$p_{model}(x; \theta)=\prod_{t=1}^{n} p_{model}(x_t | x_1,...,x_{t-1};\theta)$

In our case: x_t will be the t-th action (add node, add edge).

GraphRNN

[You et al., ICML 2018]

Idea: generating graphs via sequentially adding nodes and edges. Graph G with node ordering π can be uniquely mapped into a sequence of node and edge additions S^π.

The sequence S^π has two levels:

Node-level: at each step, add a new node.
Edge-level: at each step, add a new edge/edges between existing nodes. For example, step 4 for picture above S^π₄ = (S^π_4,1, S^π_4,2, S^π₄₃₁: 0,1,1) means to connect 4&2, 4&3, but not 4&1).

Thus, S is a sequence of sequences: a graph + a node ordering. Node ordering is randomly selected.

We have transformed the graph generation problem into a sequence generation problem. Now we need to model two processes: generate a state for a new node (Node-level sequence) and generate edges for the new node based on its state (Edge-level sequence). We use RNN to model these processes.

GraphRNN has a node-level RNN and an edge-level RNN. Relationship between the two RNNs:

Node-level RNN generates the initial state for edge-level RNN.
Edge-level RNN generates edges for the new node, then updates node-level RNN state using generated results.

Setup: State of RNN after time s_t, input to RNN at time x_t, output of RNN at time y_t, parameter matrices W, U, V, non-linearity σ(⋅). s_t = σ(W⋅x_t+ U⋅ s_t-1 ), y_t = V⋅ s_t

To initialize s₀, x₁, use start/end of sequence token (SOS, EOS)- e.g., zero vector. To generate sequences we could let x_t+1 = y_t. but this model is deterministic. That’s why we use y_t = p_model(x_t|x₁, …, x_t-1; θ). Then x_t+1 is a sample from y_t: x_t+1~ y_t. In other words, each step of RNN outputs a probability vector; we then sample from the vector, and feed the sample to the next step.

During training process, we use Teacher Forcing principle depicted below: replace input and output by the real sequence.

RNN steps:

Assume Node 1 is in the graph. Then add Node 2.
Edge RNN predicts how Node 2 connects to Node 1.
Edge RNN gets supervisions from ground truth.
New edges are used to update Node RNN.
Edge RNN predicts how Node 3 connects to Node 2.
Edge RNN gets supervisions from ground truth.
New edges are used to update Node RNN.
Node 4 doesn’t connect to any nodes, stop generation.
Backprop through time: All gradients are accumulated across time steps.
Replace ground truth by GraphRNN’s own predictions.

GraphRNN has an issue – tractability:

Any node can connect to any prior node.
Too many steps for edge generation:
- Need to generate a full adjacency matrix.
- Complex long edge dependencies.

Steps to generate graph below: Add node 1 – Add node 2 – Add node 3 – Connect 3 with 1 and 2 – Add node 4. But then Node 5 may connect to any/all previous nodes.

To limit this complexity, apply Breadth-First Search (BFS) node ordering. Steps with BFS: Add node 1 – Add node 2 – Connect 2 with 1 – Add node 3 – Connect 3 with 1 – Add node 4 – Connect 4 with 2 and 3. Since Node 4 doesn’t connect to Node 1, we know all Node 1’s neighbors have already been traversed. Therefore, Node 5 and the following nodes will never connect to node 1. We only need memory of 2 “steps” rather than n − 1 steps.

Benefits:

Reduce possible node orderings: from O(n!) to the number of distinct BFS orderings.
Reduce steps for edge generation (number of previous nodes to look at).

*BFS reduces the number of steps for edge generation*

When we want to define similarity metrics for graphs, the challenge is that there is no efficient Graph Isomorphism test that can be applied to any class of graphs. The solution is to use a visual similarity or graph statistics similarity.

Application: Drug Discovery

[You et al., NeurIPS 2018]

To learn a model that can generate valid and realistic molecules with high value of a given chemical property, one can use Goal-Directed Graph Generation which:

Optimize a given objective (High scores), e.g., drug-likeness (black box).
Obey underlying rules (Valid), e.g., chemical valency rules.
Are learned from examples (Realistic), e.g., imitating a molecule graph dataset.

Authors of paper present Graph Convolutional Policy Network that combines graph representation and RL. Graph Neural Network captures complex structural information, and enables validity check in each state transition (Valid), Reinforcement learning optimizes intermediate/final rewards (High scores) and adversarial training imitates examples in given datasets (Realistic).

*GCPN for generating graphs with high property scores*

*GCPN for editing given graph for higher property scores*

Open problems in graph generation

Generating graphs in other domains:
- 3D shapes, point clouds, scene graphs, etc.
Scale up to large graphs:
- Hierarchical action space, allowing high level action like adding a structure at a time.
Other applications: Anomaly detection.
- Use generative models to estimate probability of real graphs vs. fake graphs.

Lecture 11 – Link Analysis: PageRank

Slides, Video

The lecture covers analysis of the Web as a graph. Web pages are considered as nodes, hyperlinks are as edges. In the early days of the Web links were navigational. Today many links are transactional (used to navigate not from page to page, but to post, comment, like, buy, …).

The Web is directed graph since links direct from source to destination pages. For all nodes we can define IN-set (what other nodes can reach v?) and OUT-set (given node v, what nodes can v reach?).

Generally, there are two types of directed graphs:

Strongly connected: any node can reach any node via a directed path.
Directed Acyclic Graph (DAG) has no cycles: if u can reach v, then v cannot reach u.

Any directed graph (the Web) can be expressed in terms of these two types.

A Strongly Connected Component (SCC) is a set of nodes S so that every pair of nodes in S can reach each other and there is no larger set containing S with this property.

Every directed graph is a DAG on its SCCs:

SCCs partitions the nodes of G (that is, each node is in exactly one SCC).
If we build a graph G’ whose nodes are SCCs, and with an edge between nodes of G’ if there is an edge between corresponding SCCs in G, then G’ is a DAG.

*Strongly connected components of the graph G: {A,B,C,G}, {D}, {E}, {F}. G’ is a DAG*

Broder et al.: Altavista web crawl (Oct ’99): took a large snapshot of the Web (203 million URLS and 1.5 billion links) to understand how its SCCs “fit together” as a DAG.

Authors computed IN(v) and OUT(v) by starting at random nodes. The BFS either visits many nodes or very few as seen on the plot below:

Based on IN and OUT of a random node v they found Out(v) ≈ 100 million (50% nodes), In(v) ≈ 100 million (50% nodes), largest SCC: 56 million (28% nodes). It shows that the web has so called “Bowtie” structure:

PageRank

There is a large diversity in the web-graph node connectivity -> all web pages are not equally “important”.

The main idea: page is more important if it has more links. They consider in-links as votes. A “vote” (link) from an important page is worth more:

Each link’s vote is proportional to the importance of its source page.
If page i with importance r_i has d_i out-links, each link gets r_i / d_i votes.
Page j’s own importance r_j is the sum of the votes on its in-links.

r_j = r_i /3 + r_k/4. Define a “rank” r_j for node j (d_i is out-degree of node i):

$r_j = \sum_{i\rightarrow j} \frac{r_i}{d_i}$

If we represent each node’s rank in this way, we get system of linear equations. To solve it, use matrix Formulation.

Let page j have d_j out-links. If node j is directed to i, then stochastic adjacency matrix M_ij = 1 / d_j . Its columns sum to 1. Define rank vector r – an entry per page: r_i is the importance score of page i, ∑r_i = 1. Then the flow equation is r = M ⋅ r.

Random Walk Interpretation

Imagine a random web surfer. At any time t, surfer is on some page i. At time t + 1, the surfer follows an out-link from i uniformly at random. Ends up on some page j linked from i. Process repeats indefinitely. Then p(t) is a probability distribution over pages, it’s a vector whose i^th coordinate is the probability that the surfer is at page i at time t.

To identify where the surfer is at time t+1, follow a link uniformly at random p(t+1) = M ⋅ p(t). Suppose the random walk reaches a state p(t+1) = M ⋅ p(t) = p(t). Then p(t) is stationary distribution of a random walk. Our original rank vector r satisfies r = M ⋅ r. So, r is a stationary distribution for the random walk.

As flow equations is r = M ⋅ r, r is an eigenvector of the stochastic web matrix M. But Starting from any vector u, the limit M(M(…(M(Mu))) is the long-term distribution of the surfers. With math we derive that limiting distribution is the principal eigenvector of M – PageRank. Note: If r is the limit of M(M(…(M(Mu)), then r satisfies the equation r = M ⋅ r, so r is an eigenvector of M with eigenvalue 1. Knowing that, we can now efficiently solve for r with the Power iteration method.

Power iteration is a simple iterative scheme:

Initialize: r⁽⁰⁾ = [1/N,….,1/N]^T
Iterate: r^(t+1) = M · r^(t)
Stop when |r^(t+1) – r^(t)|₁ < ε (Note that |x|₁ = Σ|x_i| is the L1 norm, but we can use any other vector norm, e.g., Euclidean). About 50 iterations is sufficient to estimate the limiting solution.

How to solve PageRank

Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks:

Assign each node an initial page rank.
Repeat until convergence (Σ |r^(t+1) – r^(t)|₁ < ε).

Calculate the page rank of each node (d_i …. out-degree of node i):

$r_j^{(t+1)} = \sum_{i\rightarrow j} \frac{r_i^{(t)}}{d_i}$

PageRank solution poses 3 questions: Does this converge? Does it converge to what we want? Are results reasonable?

PageRank has some problems:

Some pages are dead ends (have no out-links, right part of web “bowtie”): such pages cause importance to “leak out”.
Spider traps (all out-links are within the group, left part of web “bowtie”): eventually spider traps absorb all importance.

Solution for spider traps: at each time step, the random surfer has two options:

With probability β, follow a link at random.
With probability 1-β, jump to a random page.

Common values for β are in the range 0.8 to 0.9. Surfer will teleport out of spider trap within a few time steps.

Solution to Dead Ends: Teleports – follow random teleport links with total probability 1.0 from dead-ends (and adjust matrix accordingly)

Why are dead-ends and spider traps a problem and why do teleports solve the problem?

Spider-traps are not a problem, but with traps PageRank scores are not what we want.
- Solution: never get stuck in a spider trap by teleporting out of it in a finite number of steps.
Dead-ends are a problem: the matrix is not column stochastic so our initial assumptions are not met.
- Solution: make matrix column stochastic by always teleporting when there is nowhere else to go.

This leads to PageRank equation [Brin-Page, 98]:

$r_j = \sum_{i\rightarrow j} \beta \frac{r_i}{d_i}+(1- \beta) \frac{1}{N}$

This formulation assumes that M has no dead ends. We can either preprocess matrix M to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends. (d_i … out-degree of node i).

The Google Matrix A becomes as:

$A = \beta M + (1 - \beta) \left [ \frac{1}{N} \right ]_{N\times N}$

[1/N]_NxN…N by N matrix where all entries are 1/N.

We have a recursive problem: r = A ⋅ r and the Power method still works.

Computing Pagerank

The key step is matrix-vector multiplication: r_new = A · r_old. Easy if we have enough main memory to hold each of them. With 1 bln pages, matrix A will have N² entries and 10¹⁸ is a large number.

But we can rearrange the PageRank equation

$r = \beta M\cdot r +\left [ \frac{1-\beta}{N} \right ]_N$

where [(1-β)/N]_N is a vector with all N entries (1-β)/N.

M is a sparse matrix (with no dead-ends):10 links per node, approx 10N entries. So in each iteration, we need to: compute r_new = β M · r_old and add a constant value (1-β)/N to each entry in r_new. Note if M contains dead-ends then ∑r_{j new} < 1 and we also have to renormalize r_new so that it sums to 1.

*Complete PageRank algorithm with input Graph G and parameter β*

Random Walk with Restarts

Idea: every node has some importance; importance gets evenly split among all edges and pushed to the neighbors. Given a set of QUERY_NODES, we simulate a random walk:

Make a step to a random neighbor and record the visit (visit count).
With probability ALPHA, restart the walk at one of the QUERY_NODES.
The nodes with the highest visit count have highest proximity to the QUERY_NODES.

The benefits of this approach is that it considers: multiple connections, multiple paths, direct and indirect connections, degree of the node.

PS: Other lecture notes you can find here: 1, 3, 4.

Machine Learning with Graphs: lecture notes, part 1/4

Recently, I finished the Stanford course CS224W Machine Learning with Graphs. In the following series of blog posts, I share my notes which I took watching lectures. I hope it gives you a quick sneak peek overview of how ML applied for graphs. The rest of the blog posts you can find here: 2, 3, 4.

NB: I don’t have proofs and derivations of theorems here, check out original sources if you are interested.

Lecture 1 – Intro to graph theory
Lecture 2 – Properties of networks. Random Graphs
Recitation 1 – Introduction to snap.py
Lecture 3 – Motifs and Structural Roles in Networks
Lecture 4 – Community Structure in Networks
Lecture 5 – Spectral Clustering

Lecture 1 – Intro to graph theory

Slides, Video

The lecture gives an overview of basic terms and definitions used in graph theory. Why should you learn graphs? The impact! The impact of graphs application is proven for Social networking, Drug design, AI reasoning.

My favourite example of applications is the Facebook social graph which shows that all users have only 4-degrees of separation:

*[Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]*

Another interesting example of application: predict side-effects of drugs (when patients take multiple drugs at the same time, what are the potential side effects? It represents a link prediction task).

What are the ways to analyze networks:

Node classification (Predict the type/color of a given node)
Link prediction (Predict whether two nodes are linked)
Identify densely linked clusters of nodes (Community detection)
Network similarity (Measure similarity of two nodes/networks)

Key definitions:

A network is a collection of objects (nodes) where some pairs of objects are connected by links (edges).
Types of networks: Directed vs. undirected, weighted vs. unweighted, connected vs. disconnected
Degree of node i is the number of edges adjacent to node i.
The maximum number of edges on N nodes (for undirected graph) is

$E_{max}=\frac{N(N-1)}{2}$

Complete graph is an undirected graph with the number of edges E = E_max, and its average degree is N-1.
Bipartite graph is a graph whose nodes can be divided into two disjoint sets U and V such that every link connects a node in U to one in V; that is, U and V are independent sets (restaurants-eaters for food delivery, riders-drivers for ride-hailing).
Ways for presenting a graph: visual, adjacency matrix, adjacency list.

Lecture 2 – Properties of networks. Random Graphs

Slides, Video

Metrics to measure network:

Degree distribution, P(k): Probability that a randomly chosen node has degree k.

N_k = # nodes with degree k.

Path length, h: a length of the sequence of nodes in which each node is linked to the next one (path can intersect itself).
Distance (shortest path, geodesic) between a pair of nodes is defined as the number of edges along the shortest path connecting the nodes. If the two nodes are not connected, the distance is usually defined as infinite (or zero).
Diameter: maximum (shortest path) distance between any pair of nodes in a graph.
Clustering coefficient, C_i: describes how connected are i’s neighbors to each other.

$C_{i}\in [0,1], C_{i}=\frac{2e_{i}}{k_{i}(k_{i}-1)}$

where e_i is the number of edges between neighbors of node i.

Average clustering coefficient:

$C=\frac{1}{N}\sum_{i}^{N}C_{i}$

Connected components, s: a subgraph in which each pair of nodes is connected with each other via a path (for real-life graphs, they usually calculate metrics for the largest connected components disregarding unconnected nodes).

Erdös-Renyi Random Graphs:

G_np: undirected graph on n nodes where each edge (u,v) appears i.i.d. with probability p.
Degree distribution of G_np is binomial.
Average path length: O(log n).
Average clustering coefficient: k / n.
Largest connected component: exists if k > 1.

Comparing metrics of random graph and real life graphs, real networks are not random. While G_np is wrong, it will turn out to be extremely useful!

Real graphs have high clustering but diameter is also high. How can we have both in a model? Answer in Small-World Model [Watts-Strogatz ‘98].

Small-World Model introduces randomness (“shortcuts”): first, add/remove edges to create shortcuts to join remote parts of the lattice; then for each edge, with probability p, move the other endpoint to a random node.

*Difference between regular random and small world networks*

Kronecker graphs: a recursive model of network structure obtained by growing a sequence of graphs by iterating the Kronecker product over the initiator matrix. It may be illustrated in the following way:

Random graphs are far away from real ones, but real and Kronecker networks are very close.

Recitation 1 – Introduction to snap.py

Slides

Stanford Network Analysis Platform (SNAP) is a general purpose, high-performance system for analysis and manipulation of large networks (http://snap.stanford.edu). SNAP software includes Snap.py for Python, SNAP C++ and SNAP datasets (over 70 network datasets can be found at https://snap.stanford.edu/data/).

Useful resources for plotting graph properties:

Gnuplot: http://www.gnuplot.info
Visualizing graphs with Graphviz: http://www.graphviz.org
Other options are in Matplotlib: http://www.matplotlib.org

I will not go into details here, better to get hands on right away with the original snappy tutorial at https://snap.stanford.edu/snappy/doc/tutorial/index-tut.html.

Lecture 3 – Motifs and Structural Roles in Networks

Slides, Video

Subnetworks, or subgraphs, are the building blocks of networks, they have the power to characterize and discriminate networks.

Network motifs are:

Recurring	found many times, i.e., with high frequency
Significant	Small induced subgraph: Induced subgraph of graph G is a graph, formed from a subset X of the vertices of graph G and all of the edges connecting pairs of vertices in subset X.
Patterns of interconnections	More frequent than expected. Subgraphs that occur in a real network much more often than in a random network have functional significance.

Motifs help us understand how networks work, predict operation and reaction of the network in a given situation.

Motifs can overlap each other in a network.

Examples of motifs:

Significance of motifs: Motifs are overrepresented in a network when compared to randomized networks. Significance is defined as:

$SP_{i}=Z_{i}/\sqrt{\sum_{j}^{}Z_{j}^{2}}\: \: and \, \, \; \! Z_{i}=(N_{i}^{real}-\bar{N_{i}^{rand}})/std(N_{i}^{rand})$

where N_i^real is # of subgraphs of type i in network G^real and N_i^real is # of subgraphs of type i in network G^rand.

Z_i captures statistical significance of motif i.
Generally, larger networks display higher Z-scores.
To calculate Z_i, one counts subgraphs i in G^real, then counts subgraphs i in G^rand with a configuration model which has the same #(nodes), #(edges) and #(degree distribution) as G^real.

Graphlets

Graphlets are connected non-isomorphic subgraphs (induced subgraphs of any frequency). Graphlets differ from network motifs, since they must be induced subgraphs, whereas motifs are partial subgraphs. An induced subgraph must contain all edges between its nodes that are present in the large network, while a partial subgraph may contain only some of these edges. Moreover, graphlets do not need to be over-represented in the data when compared with randomized networks, while motifs do.

Node-level subgraph metrics:

Graphlet degree vector counts #(graphlets) that a node touches.
Graphlet Degree Vector (GDV) counts #(graphlets) that a node touches.

Graphlet degree vector provides a measure of a node’s local network topology: comparing vectors of two nodes provides a highly constraining measure of local topological similarity between them.

Finding size-k motifs/graphlets requires solving two challenges:

Enumerating all size-k connected subgraphs
Counting # of occurrences of each subgraph type via graph isomorphisms test.

Just knowing if a certain subgraph exists in a graph is a hard computational problem!

Subgraph isomorphism is NP-complete.
Computation time grows exponentially as the size of the motif/graphlet increases.
Feasible motif size is usually small (3 to 8).

Algorithms for counting subgaphs:

Exact subgraph enumeration (ESU) [Wernicke 2006] (explained in the lecture, I will not cover it here).
Kavosh [Kashani et al. 2009].
Subgraph sampling [Kashtan et al. 2004].

Structural roles in networks

Role is a collection of nodes which have similar positions in a network

Roles are based on the similarity of ties between subsets of nodes.
In contrast to groups/communities, nodes with the same role need not be in direct, or even indirect interaction with each other.
Examples: Roles of species in ecosystems, roles of individuals in companies.
Structural equivalence: Nodes are structurally equivalent if they have the same relationships to all other nodes.

*Nodes 3 and 4 are structurally equivalent*

Examples when we need roles in networks:

Role query: identify individuals with similar behavior to a known target.
Role outliers: identify individuals with unusual behavior.
Role dynamics: identify unusual changes in behavior.

Structural role discovery method RoIX:

Input is adjacency matrix.
Turn network connectivity into structural features with Recursive feature Extraction [Henderson, et al. 2011a]. Main idea is to aggregate features of a node (degree, mean weight, # of edges in egonet, mean clustering coefficient of neighbors, etc) and use them to generate new recursive features (e.g., mean value of “unweighted degree” feature between all neighbors of a node). Recursive features show what kinds of nodes the node itself is connected to.
Cluster nodes based on extracted features. RolX uses non negative matrix factorization for clustering, MDL for model selection, and KL divergence to measure likelihood.

Lecture 4 – Community Structure in Networks

Slides, Video

The lecture shows methods for community detection problems (nodes clustering).

Background example from Granovetter work, 1960s: people often find out about new jobs through acquaintances rather than close friends. This is surprising: one would expect your friends to help you out more than casual acquaintances.

Why is it that acquaintances are most helpful?

Long-range edges (socially weak, spanning different parts of the network) allow you to gather information from different parts of the network and get a job.
Structurally embedded edges (socially strong) are heavily redundant in terms of information access.

*Visualisation of strong and weak relations in a network*

From this observation we find one type of structural role – Triadic Closure: if two people in a network have a friend in common, then there is an increased likelihood they will become friends themselves.

Triadic closure has high clustering coefficient. Reasons: if A and B have a friend C in common, then:

A is more likely to meet B (since they both spend time with C).
A and B trust each other (since they have a friend in common).
C has incentive to bring A and B together (since it is hard for C to maintain two disjoint relationships).

Granovetter’s theory suggests that networks are composed of tightly connected sets of nodes and leads to the following conceptual picture of networks:

Network communities

Network communities are sets of nodes with lots of internal connections and few external ones (to the rest of the network).

Modularity is a measure of how well a network is partitioned into communities: the fraction of the edges that fall within the given groups minus the expected fraction if edges were distributed at random.

Given a partitioning of the network into groups disjoint s:

$s\in S: \enspace Q\propto \sum_{s\in S}^{}[(number \; of \; edges \; within \; group \; s)-(expected \; number \; of \; edges \; within \; group \; s)]$

Modularity values take range [−1,1]. It is positive if the number of edges within groups exceeds the expected number. Values greater than 0.3-0.7 means significant community structure.

To define the expected number of edges, we need a Null model. After derivations (can be found in slides), modularity is defined as:

$Q(G, S)=\frac{1}{2m}\sum_{s \in S}^{}\sum_{i \in s}^{}\sum_{j \in s}^{}(A_{ij}-\frac{k_{i}k_{j}}{2m})$

A_ij represents the edge weight between nodes i and j;
k_i and kj are the sum of the weights of the edges attached to nodes i and j;
2m is the sum of all the edges weights in the graph.

Louvian algorithm

Louvian algorithm is a greedy algorithm for community detection with O(n log(n)) run time.

Supports weighted graphs.
Provides hierarchical communities.
Widely utilized to study large networks because it is fast, has rapid convergence and high modularity output (i.e., “better communities”).

Each pass of Louvain algorithm is made of 2 phases:

Phase 1 – Partitioning:

Modularity is optimized by allowing only local changes to node-communities memberships;
For each node i, compute the modularity delta (∆Q) when putting node i into the community of some neighbor j;
Move i to a community of node j that yields the largest gain in ∆Q.

Phase 2 – Restructuring:

identified communities are aggregated into super-nodes to build a new network;
communities obtained in the first phase are contracted into super-nodes, and the network is created accordingly: super-nodes are connected if there is at least one edge between the nodes of the corresponding communities;
the weight of the edge between the two supernodes is the sum of the weights from all edges between their corresponding communities;
goto Phase 1 to run it on the super-node network.

The passes are repeated iteratively until no increase of modularity is possible.

Detecting overlapping communities: BigCLAM

BigCLAM is a model-based community detection algorithm that can detect densely overlapping, hierarchically nested as well as non-overlapping communities in massive networks (more details and all proofs are in Yang, Leskovec “Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach”, 2013).

In BigCLAM communities arise due to shared community affiliations of nodes. It explicitly models the affiliation strength of each node to each community. Model assigns each node-community pair a nonnegative latent factor which represents the degree of membership of a node to the community. It then models the probability of an edge between a pair of nodes in the network as a function of the shared community affiliations.

Step 1: Define a generative model for graphs that is based on node community affiliations – Community Affiliation Graph Model (AGM). Model parameters are Nodes V, Communities C, Memberships M. Each community c has a single probability p_c.

Step 2: Given graph G, make the assumption that G was generated by AGM. Find the best AGM that could have generated G. This way we discover communities.

Lecture 5 – Spectral Clustering

Slides, Video

NB: these notes skip lots of mathematical proofs, all derivations can be found in slides themselves.

Three basic stages of spectral clustering:

Pre-processing: construct a matrix representation of the graph.
Decomposition: compute eigenvalues and eigenvectors of the matrix; map each point to a lower-dimensional representation based on one or more eigenvectors.
Grouping: assign points to two or more clusters, based on the new representation.

Lecture covers them one by one starting with graph partitioning.

Graph partitioning

Bi-partitioning task is a division of vertices into two disjoint groups A and B. Good partitioning will maximize the number of within-group connections and minimize the number of between-group connections. To achieve this, define edge cut: a set of edges with one endpoint in each group:

$Cut(A,B)=\sum_{i \in A, j \in B}^{}w_{ij}$

Graph cut criterion:

Minimum-cut: minimize weight of connections between groups. Problems: it only considers external cluster connections and does not consider internal cluster connectivity.
Conductance: connectivity between groups relative to the density of each group. It produces more balanced partitions, but computing the best conductance cut is NP-hard.

Conductance φ(A,B) = cut(A, B) / min(total weighted degree of the nodes in A or in B)

To efficiently find good partitioning, they use Spectral Graph Theory.

Spectral Graph Theory analyze the “spectrum” of matrix representing G: Eigenvectors x⁽ⁱ⁾ of a graph, ordered by the magnitude (strength) of their corresponding eigenvalues λ_i.

Λ = {λ₁,λ₂,…,λ_n}, λ₁≤ λ₂≤ …≤ λ_n (λ_i are sorted in ascending (not descending) order).

If A is adjacency matrix of undirected G (a_ij=1 if (i,j) is an edge, else 0), then Ax = λx.

What is the meaning of Ax? It is a sum of labels x_j of neighbors of i. It is better seen if represented in the form as below:

Examples of eigenvalues for different graphs:

For d-regular graph (where all nodes have degree d) if try x=(1,1,1,…1):

Ax = (d,d,..,d) = λx -> λ = d. This d is the largest eigenvalue of d-regular graph (proof is in slides).

An N x N matrix can have up to n eigenpairs.

Matrix Representation:

Adjacency matrix (A): A = [a_ij], a_ij = 1 if edge exists between node i and j (0-s on diagonal). It is a symmetric matrix, has n real eigenvalues; eigenvectors are real-valued and orthogonal.
Degree matrix (D): D=[d_ij],dijis a degree of node i.
Laplacian matrix (L): L = D-A. Trivial eigenpair is with x=(1,…,1) -> Lx = 0 and so λ = λ₁ = 0.

So far, we found λ and λ₁. From theorems, they derive

$\lambda _{2}=min \frac {\sum_{(i,j) \in E} (x_{i}-x_{j})^{2}}{\sum_{i}x_{i}^{2}}$

(all labelings of node i so that sum(x_i) = 0, meaning to assign values x_i to nodes i such that few edges cross 0 (x_i and x_j to subtract each other as shown on the picture below).

From theorems and profs, we go to the Cheeger inequality:

$\frac {\beta ^{2}}{2k_{max}}\leqslant \lambda _{2} \leqslant 2 \beta$

where k_max is the maximum node degree in the graph and β is conductance of the cut (A, B).

β = (# edges from A to B) / |A|. This is the approximation guarantee of the spectral clustering: spectral finds a cut that has at most twice the conductance as the optimal one of conductance β.

To summarize 3 steps of spectral clustering:

Pre-processing: build Laplacian matrix L of the graph.
Decomposition: find eigenvalues λ and eigenvectors x of the matrix L and map vertices to corresponding components of x₂.
Grouping: sort components of reduced 1-dimensional vector and identify clusters by splitting the sorted vector in two.

How to choose a splitting point?

Naïve approaches: split at 0 or median value.
More expensive approaches: attempt to minimize normalized cut in 1-dimension (sweep over ordering of nodes induced by the eigenvector).

K-way spectral clustering:

Recursive bi-partitioning [Hagen et al., ’92] is the method of recursively applying bi-partitioning algorithm in a hierarchical divisive manner. It has disadvantages: inefficient and unstable.

More preferred way:

Cluster multiple eigenvectors [Shi-Malik, ’00]: build a reduced space from multiple eigenvectors. Each node is now represented by k numbers; then we cluster (apply k-means) the nodes based on their k-dimensional representation.

To select k, use eigengap: the difference between two consecutive eigenvalues (the most stable clustering is generally given by the value k that maximizes eigengap).

*K-way spectral clustering: choose k=2 as it has the highest eigengap.*

Motif-based spectral clustering

Different motifs will give different modular structures. Spectral clustering for nodes generalizes for motifs as well.

Similarly to edge cuts, volume and conductance, for motifs:

vol_M(S) = #(motif endpoints in S), φ(S) = #(motifs cut) / vol_M(S).

Motif Spectral Clustering works as follows:

Input: Graph G and motif M.
Using G form a new weighted graph W^(M).
Apply spectral clustering on W^(M).
Output the clusters.

Resulting clusters will obtain near optimal motif conductance.

The same 3 steps as before:

Preprocessing: given motif M. Form weighted graph W^(M).

W_ij^(M)=# times edge (i,j) participates in the motif M.

Apply spectral clustering: Compute Fiedler vector x associated with λ₂ of the Laplacian of L^(M).

L^(M) = D^(M) – W^(M) where degree matrix D_ij^(M) = sum(W_ij^(M)). Compute L^(M)x = λ₂x. Use x to identify communities.

Sort nodes by their values in x: x₁, x₂, …x_n. Let S_r = {x₁, x₂, …x_n} and compute the motif conductance of each S_r.

At the end of the lecture, there are two examples of how motif-based clustering is used for food webs and gene regulatory networks.

There exist other methods for clustering:

METIS: heuristic but works really well in practice.
Graclus: based on kernel k-means.
Louvain: based on Modularity optimization.
Clique percolation method: for finding overlapping clusters.

PS: Other lecture notes you can find here: 2, 3, 4.