其他分享
首页 > 其他分享> > CS224W: Machine Learning with Graphs - 08 GNN Augmentation and Training

CS224W: Machine Learning with Graphs - 08 GNN Augmentation and Training

作者:互联网

GNN Augmentation and Training

0. A General GNN Framework

Idea: raw input graph ≠ \neq ​= computational graph

1). Why Agument Graphs?

Our assumption so far has been: raw input graph = computational graph
Reasons for breaking this assumption

It is unlikely that the input graph happens to be the optimal computation graph for embeddings.

2). Graph Augmentation Approaches

1. Feature Augmentation on Graphs

to be update

1). Message Computation

Message function: m u l = M S G l ( h u l − 1 ) m_u^l=MSG^l(h_u^{l-1}) mul​=MSGl(hul−1​)
Intuition: each node will create a message, which will be sent to other nodes later
Example: a linear layer m u l = W l h u l − 1 m_u^l=W^lh_u^{l-1} mul​=Wlhul−1​

2). Message Aggregation

Intuition: each node will aggregate the messages from node v v v's neighbors
h v l = A G G l ( { m u l , u ∈ N ( v ) } ) h_v^l=AGG^l(\{m_u^l, u\in N(v)\}) hvl​=AGGl({mul​,u∈N(v)})
Example: sum, mean, max aggregator
Issue: information from node v v v itself could get lost (computation of h v l h_v^l hvl​ does not directly depend on h v l − 1 h_v^{l-1} hvl−1​)
Solution: include h v l − 1 h_v^{l-1} hvl−1​ when computing h v l h_v^l hvl​

2. Classical GNN Layers

1). Graph Convolutional Networks (GCNs)

h v l = σ ( W l ∑ u ∈ N ( v ) h u l − 1 ∣ N ( v ) ∣ ) = σ ( ∑ u ∈ N ( v ) W l h u l − 1 ∣ N ( v ) ∣ ) h_v^l=\sigma (W^l\sum_{u\in N(v)}\frac{h_u^{l-1}}{|N(v)|})=\sigma (\sum_{u\in N(v)}W^l\frac{h_u^{l-1}}{|N(v)|}) hvl​=σ(Wlu∈N(v)∑​∣N(v)∣hul−1​​)=σ(u∈N(v)∑​Wl∣N(v)∣hul−1​​)
Message: each neighbor m u l = 1 ∣ N ( v ) ∣ W l h u l − 1 m_u^l=\frac{1}{|N(v)|}W^lh_u^{l-1} mul​=∣N(v)∣1​Wlhul−1​ (normalized by node degree)
Aggregation: sum over messages from neighbors, then apply activation h v l = σ ( Sum ( { m u l , u ∈ N ( v ) } ) ) h_v^l=\sigma (\text{Sum}(\{m_u^l, u\in N(v)\})) hvl​=σ(Sum({mul​,u∈N(v)}))

2). GraphSAGE

h v l = σ ( W l ⋅ CONCAT ( h v l − 1 , A G G l ( { h u l − 1 , u ∈ N ( v ) } ) ) ) h_v^l=\sigma(W^l\cdot\text{CONCAT}(h_v^{l-1}, AGG^l(\{h_u^{l-1}, u\in N(v)\}))) hvl​=σ(Wl⋅CONCAT(hvl−1​,AGGl({hul−1​,u∈N(v)})))

a). GraphSAGE neighbor aggregation
b). L 2 L_2 L2​ normalization

Optional: apply L 2 L_2 L2​ normalization to h v l h_v^l hvl​ at every layer
h v l ← h v l ∣ ∣ h v l ∣ ∣ 2 ∀ v ∈ V h_v^l\leftarrow\frac{h_v^l}{||h_v^l||_2} \forall v \in V hvl​←∣∣hvl​∣∣2​hvl​​∀v∈V where ∣ ∣ u ∣ ∣ 2 = ∑ i u i 2 ||u||_2=\sqrt{\sum_iu_i^2} ∣∣u∣∣2​=∑i​ui2​ ​ ( L 2 L_2 L2​-norm)
Without L 2 L_2 L2​ normalization, the embedding vectors have different scales for vectors
In some cases, normalization of embedding results in performance improvement
After L 2 L_2 L2​ normalization, all vectors will have the same L 2 L_2 L2​-norm

3). Graph Attention Networks (GATs)

a). Not all nodes’ neighbors are equally important

h v l = σ ( ∑ u ∈ N ( v ) α v u W l h u l − 1 ) h_v^l=\sigma (\sum_{u\in N(v)}\alpha_{vu}W^lh_u^{l-1}) hvl​=σ(u∈N(v)∑​αvu​Wlhul−1​)
Goal: specify arbitrary importance to different neighbors of each node in the graph.
Idea: compute embedding h v l h_v^l hvl​ of each node in the graph following an attention strategy。

b). Attention mechanism

Let α v u \alpha_{vu} αvu​ be computed as a byproduct of an attention mechanism a a a

Form of attention mechanism a a a: the approach is agnostic to the choice of a a a
Example: use a simple single-layer neural network ( a a a have trainable parameters in the Linear layer)
e A B = a ( W l h A l − 1 , W l h B l − 1 ) = Linear ( Concat ( W l h A l − 1 , W l h B l − 1 ) ) e_{AB}=a(W^lh_A^{l-1}, W^lh_B^{l-1})=\text{Linear}(\text{Concat}(W^lh_A^{l-1}, W^lh_B^{l-1})) eAB​=a(WlhAl−1​,WlhBl−1​)=Linear(Concat(WlhAl−1​,WlhBl−1​))
Parameters of a a a are trained together with weight matrices (i.e., para. of W l W^l Wl) in an end-to-end fashion.

c). Multi-head attention

To be updated

d). Benefits of attention mechanism

Key benefit: allow for (implicitly) specifying different importance values to different neighbors

e). GNN layer in practice

We can include modern deep learning modules that proved to be useful in many domains

3. Stacking GNN Layers

0). How to Connect GNN Layers into a GNN?

1). The Over-smoothing Problem

Issue: all the node embeddings converge to the same value after stacking many GNN layers. This is bad because we want to use node embeddings to differentiate nodes

a). Receptive field of a GNN

Receptive field: the set of nodes that determinte the embedding of a node of interest
In a K K K-layer GNN, each node has a receptive field of K K K-hop neighborhood. The shared neighbors quickly grows when we increase the number of hops (num of GNN layers)

b). Receptive field & over-smoothing

Stack many GNN layers → \rightarrow → Nodes will have highly-overlapped receptive fields → \rightarrow → Node embeddings will be highly similar → \rightarrow → Suffer from the over-smoothing problem

c). Be cautious when stacking GNN layers

Unlike NN in other domains, adding more GNN layers does not always help

2). Expressive Power for Shallow GNNs

a). Increase the expressive power within each GNN layer
b). Add layers that do not pass messages

A GNN does not necessarily only contain GNN layers. We can add MLP layers before and after GNN layers as preprocessing layers and postprocessing layers.

In practice, adding these layers work great.

3). Add skip connections in GNNs

Observation from over-smoothing: node embeddings in earlier GNN layers can sometimes better differentiate nodes.
Solutions: we can increase the impact of earlier layers on the final node embeddings by adding shortcuts in GNNs

标签:node,layers,Training,CS224W,hvl,graph,08,vu,GNN
来源: https://blog.csdn.net/fxb163/article/details/122283645