Neo4j : Graph queries and SVM link prediction

Utkarsh Jyani
5 min readMay 1, 2023

--

Introduction

Due to their capacity to efficiently and intuitively model complicated relationships between things, graph databases have grown in popularity in recent years. Neo4j is one such distributed graph database that is increasingly the go-to option for developers working with related data. In this blog post, we’ll take a closer look at a project that predicts linkages using SVM (Support Vector Machines) and Neo4j’s strengths. Link prediction, which calculates the likelihood that two nodes in a network will form a relationship, is a crucial task in graph-based data processing.

Gaining knowledge of Neo4j’s distributed capabilities

To manage complicated, connected data, a NoSQL graph database management system called Neo4j is available for free. Neo4j is more effective and adaptable for working with highly interconnected datasets because it employs nodes, relationships, and properties to represent data instead of typical relational databases’ usage of tables.

Key features of neo 4j:

• ACID compliance, which guarantees the consistency and integrity of the data.

• Querying with high performance using the Cypher query language.

• Native graph processing and storing.

• High availability and scalability via data sharding and clustering.

Setup

The dataset utilized in this study encompassed a substantial number of entities, consisting of over 150,000 nodes and 6.5 million edges. During the investigation, it was observed that employing Cypher’s LOAD_CSV function to process the data proved to be time-consuming. However, through iterative experimentation, the researchers identified that Neo4j’s admin-import terminal command enabled them to swiftly establish the network configuration in less than a minute.

Remarkably, the program exhibited exceptional performance by loading a network comprising 168,114 nodes and 6,797,557 edges in approximately 20 seconds, while ensuring the preservation of all node properties. Conversely, utilizing the LOAD_CSV approach to solely load the edges would consume more than 40 minutes.

It is noteworthy that the nodes_header.csv and edges_header.csv files accompanying the command contain essential information regarding the data types, in addition to the headers. This additional data-type specification is crucial for ensuring accurate data import, as the entries are considered as strings by default.

SVM link prediction

Support Vector Machines (SVM) are a well-liked supervised machine learning approach for applications in classification and regression. Link prediction uses SVM to classify potential connections between nodes based on a collection of properties. The fundamental aim of SVM is to find the optimal hyperplane that divides the data points into discrete classes with the largest feasible margin.

Data pre-processing and feature engineering

Before we can begin our research, we must first select an adequate dataset with a wide variety of nodes and associations. Social networks, citation networks, and collaborative networks are a few examples of these databases. Before importing the dataset into Neo4j, we must extract its features so that we can train the SVM model.

The following are some typical features for link prediction:

· Preferential attachment score,

· Adamic/Adar index,

· common neighbours,

· Jaccard coefficient,

· resource allocation index,

· training the SVM model

We will split the dataset into training and testing subsets once the characteristics have been specified. Our SVM model will be trained using the training data, and hyperparameters including the kernel function, regularisation parameter, and kernel-specific parameters will be adjusted using grid search and cross-validation.

Evaluating the performance of the SVM model

Once the SVM model is trained, we will evaluate its performance on the test dataset.

Common evaluation metrics for link prediction include:

• Precision

• Recall

• F1-score

• Area Under the Receiver Operating Characteristic (ROC) curve (AUC-ROC)

Integrating Neo4j and SVM for link prediction

We can now use the SVM model to predict links in our Neo4j database since it has been trained and validated. Using the standard Neo4j Python driver, we will construct a Python script that connects to Neo4j, retrieves pertinent characteristics for a pair of nodes, and estimates the likelihood of a relationship using our SVM model.

Cypher Queries

Neo4j’s query language, Cypher, is widely employed in data science applications due to its remarkable capability to handle intricate data connections and extract valuable insights from large datasets. Designed with a user-friendly and adaptable approach, Cypher empowers data scientists to construct complex queries that are easily comprehensible to others. Its versatility and ease of use make it a preferred choice for data scientists seeking to unlock the full potential of their data. The easiest Cypher query would be

match (n) return n

When executing this command, it will showcase the complete set of nodes within the network. Nevertheless, it is imperative to acknowledge that the Neo4j Browser imposes a constraint on the quantity of nodes that can be exhibited in a single query, albeit this restriction can be modified. Consequently, the maximum number of nodes visible on the screen at any given time will be below this threshold, and it is possible that not all nodes will be perceptible. The top 10 nodes (based on the number of connections) in this dataset may be retrieved by:

match (s)-[]->(t) return s.numeric_id, size(collect(t)) as connections order by connections desc limit 10

To set the criteria as the number of views, the Cypher command would be

match (n) return n.numeric_id, n.views as gamers order by n.views desc limit 10

Clustering, an essential process for organizing nodes with similar characteristics, involves grouping nodes based on specific criteria. Neo4j offers the Graph Data Science (GDS) plugin, equipped with a range of clustering techniques known as “Community Detection.” In our analysis of the dataset, we leveraged GDS’s Louvain community discovery algorithm to establish clusters. Through this approach, we successfully generated 19 distinct clusters.

To perform this clustering analysis, it is crucial to maintain the network in its original “graph” form. This can be achieved by executing the following command:

CALL gds.graph.project.cypher(
‘twitch’,
‘MATCH (n)
RETURN
id(n) AS id,
n.views AS views’,
‘MATCH (n)-[]->(m) RETURN id(n) AS source, id(m) AS target’
)
YIELD
graphName, nodeCount AS nodes, relationshipCount AS rels
RETURN graphName, nodes, rels

This stores a graph with the name “twitch” and the supplied features in the current runtime.

Louvain clustering can be triggered by

call gds.louvain.write(‘twitch’, {writeProperty:’louvain’})

When executing this command, the Louvain clustering algorithm is employed, resulting in the generation of a node attribute labelled “Louvain” that represents the clustering outcome. To display the clusters individually, the following code can be utilized to convert the node property into a node label.

match (n)
call apoc.create.addLabels([id(n)], [toString(n.louvain)])
yield node
with node remove node.louvain return node

Conclusion

The following are some typical features for link prediction:
Preferential attachment score, Adamic/Adar index, common neighbours, Jaccard coefficient, resource allocation index, and training the SVM model
We will split the dataset into training and testing subsets once the characteristics have been specified. Our SVM model will be trained using the training data, and hyperparameters including the kernel function, regularisation parameter, and kernel-specific parameters will be adjusted using grid search and cross-validation.

--

--

No responses yet