How to Automate AMI Backups & Cleanups, using AWS Lambda
16 December 2023Data Quality in Snowflake
16 December 2023Tags
Published by
BluePi
Data-Driven Business Transformation
JanusGraph with Cassandra
JanusGraph is a Graph Database. First of all, let’s see what is Graph Database.
Graph Database:
In computing, a graph database is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph (edge/relationship), which directly relates data items in the store. The relationships allow data in the store to be linked together directly, and in many cases retrieved with one operation.
Brief Description:
The Graph databases are based on the graph theory and employ nodes, edges, and properties.
- Nodes represent entities such as people, businesses, accounts, or any other item to be tracked. They are roughly similar to the record, relation, or row in a relational database, or the document in a document database.
- Edges, alias graphs or relationships, are the lines that connect nodes to other nodes; they represent the relationship between them. Meaningful patterns emerge when examining the connections and interconnections of nodes, properties, and edges. Edges are the main concept in graph databases, representing an abstraction which is not directly implemented in other systems.
- Properties are germane information that relates to nodes. For example, if BluePi were one of the nodes, it might be tied to properties such as website, reference material, or word that starts with the letter b, depending on which aspects of BluePi are germane to a given database.
One of the Advantages of Graph Database over Relational Database:
The relational model gathers data together using information in the data. For example, one might look for all the “users” whose phone number contains the area code “311”. This would be done by searching selected datastores, or tables, looking in the selected phone number fields for the string “311”. This is a time-consuming process in large tables, so relational databases offer the concept of a database index, which allows data like this to be stored in a smaller sub-table, containing only the selected data and unique key of the record it is part of. If the phone numbers are indexed, the same search would occur in the smaller index table, gathering the keys of matching records, and then looking in the main data table for the records with those keys. Generally, the tables are physically stored so that lookups on these keys are fast.
In contrast, graph databases directly store the relationships between records. Instead of an email address being found by looking up its user’s key in the user PC column, the user record has a pointer directly to the email address record. That is, after selecting a user, the pointer can be followed directly to the email records, there is no need to search the email table to find the matching records. This can eliminate the costly join operations. For example, if one searches for all of the email addresses for users in area code “311”, the engine would first perform a conventional search to find the users in “311”, but then retrieve the email addresses by following the links found in those records. A relational database would first find all the users in “311”, extract a list of the pk’s, perform another search for any records in the email table with those pk’s, and link the matching records together. For this kind of common operations, a graph database is significantly faster.
List of graph databases:
The following is a list of major graph databases:
AllegroGraph, ArangoDB, Blazegraph, Cayley, Dgraph, DataStax Enterprise Graph, Sparksee, GraphBase, gStore, InfiniteGraph, JanusGraph, MarkLogic, Neo4j, OpenLink Virtuoso, Oracle Spatial and Graph, OrientDB, SAP HANA, Sqrrl Enterprise, Teradata Aster, Microsoft SQL Server 2017.
The Benefits of JanusGraph:
- Support for very large graphs. JanusGraph graphs scale with the number of machines in the cluster.
- Support for very many concurrent transactions and operational graph processing.
JanusGraph’s transactional capacity scales with the number of machines in the cluster and answers complex traversal queries on huge graphs in milliseconds.
- Support for global graph analytics and batch graph processing through the Hadoop framework.
- Support for geo, numeric range, and full-text search for vertices and edges on very large graphs.
- Native support for the popular property graph data model exposed by TinkerPop.
- Native support for Gremlin, the graph traversal language.
- Effortless integration with Gremlin graph server for programming language connectivity.
- Numerous graph-level configurations provide knobs for tuning performance.
- Vertex-centric indices provide vertex-level querying to alleviate issues with the infamous supernode problem.
- Provides an optimized disk representation to allow for efficient use of storage and speed of access.
- Open source under the liberal Apache 2 license.
Benefits of JanusGraph with Cassandra:
- Continuously available with no single point of failure.
- No read/write bottlenecks to the graph as there is no master/slave architecture.
- Elastic scalability allows for introducing and removing machines.
- Caching layer makes sure that continuously accessed data is available in the memory.
- Increase the size of the cache by adding more machines to the cluster.
- Integration with Hadoop.
- Open source under the liberal Apache 2 license.
Using JanusGraph and Relational Database:
A relational database is based on a relational model of data. All relational databases use SQL (Structured Query Language) for querying and maintaining the database.
This model organizes the data into one/more tables of columns and rows, with a unique key identifying each row. Rows are also called records/tuples. Columns are also known as attributes. Generally, each table/relation represents one ‘entity type’ (for example user or item). The rows represent instances of that type of entity (for example ‘John’ or ‘mobile’) and the columns representing values attributed to that instance (for example ‘address’ or ‘price’).
One can use both relational database and graph database in an application depending on the project requirement. If there is a requirement of creating relations between the users, with all the data stored in the relational database with a unique key (for example userId) for every user, it is difficult to store the relations between the users in a relational database. In this case, both relational and graph databases can be used. Creating the vertices with unique property (userId) and creating edges between these vertices (relations between users) solves the issue of storing relations between users.
In this way, one can use both relational and graph database in a single project/application.
Now let’s see how to configure JanusGraph with Cassandra
Prerequisites:
Download and Install Java 8
https://www.oracle.com/
technetwork/java/javase
/downloads/index.html
- JanusGraph’s shell scripts expect that the $JAVA_HOME environment variable points to the directory where JRE or JDK is installed.
- Download and Install Cassandra
https://cassandra.apache.
org/doc/latest/getting_
started/installing.html
Getting started with JanusGraph
Step 1: Download JanusGraph from https://github.com/
JanusGraph/janusgraph/
releases
Step 2: Unzip the zip file that was downloaded.
Step 3: Configure JanusGraph to use cassandra for data storage
- Open /conf/janusgraph-cassandra.properties file.
- Set the below values and save the file.
storage.backend=cassandrathrift storage.username=[cassandra username] storage.password=[cassandra password] storage.cassandra.keyspace=[keyspace name, default is janusgraph] storage.hostname=[machine’s ip where cassandra is running]
Step 4: Now, run gremlin.sh file inside the bin folder. If everything goes right, gremlin console should appear.
Step 5: Load janusgraph with the properties file which is saved earlier by running below command
graph=JanusGraphFactory.open
(‘conf/janusgraph-cassandra.properties’);
Now janusgraph is created.
Creating a vertex
mgmt = graph.openManagement(); person = mgmt.makeVertexLabel(‘person’)
.make(); mgmt.commit() // Create a labeled vertex v = graph.addVertex(label, ‘person’); // Create an unlabeled vertex v = graph.addVertex(); graph.tx().commit();
Creating a labeled vertex with property:
person = graph.addVertex(label, ‘person’); person.property(‘personId’, 1); graph.tx().commit();
Creating a labled edge:
mgmt.makeEdgeLabel(‘edgeLable’)
.make();
Creating an edge between 2 vertices:
//First create 2 vertices user1=graph.addVertex(‘person’); user2=graph.addVertex(‘person’); // adding edge user1.addEdge(‘edgeLable’, user2);
To display all the vertices:
graph.traversal().V(); //to display personIds graph.traversal().V().values
(“personId”);
To display all the edges:
graph.traversal().E();
Now, let’s see how to load graph and create vertices, edges from java:
Below java method illustrates how to load graph and create vertices and edges
import org.janusgraph.core.JanusGraph;
import org.janusgraph.core.JanusGraph
Factory; import org.janusgraph.core.JanusGraph
Transaction; import org.apache.tinkerpop.gremlin.
structure.Edge; import org.apache.tinkerpop.gremlin.
structure.T; import org.apache.tinkerpop.gremlin.
structure.Vertex; public class GraphFactory{ public void createVertexAndEdge(){ //First configure the graph JanusGraphFactory.Builder config = JanusGraphFactory.build(); config.set(“storage.backend”, “cassandrathrift”); config.set(“storage.hostname”, “13.126.71.131”); //ip address where cassandra is installed config.set(“storage.username”, “cassandra”); config.set(“storage.password”, “cassandra”); config.set(“storage.port”, “9160”); config.set(“storage.cassandra.keyspace”, “testing”); //Get the instance of graph JanusGraph graph = config.open(); //Open a transaction JanusGraphTransaction tx = graph.newTransaction(); //Create vertex with label Vertex v1 = tx.addVertex(T.label, “user”); //Add property to the vertex v1.property(“userId”, 1); Vertex v2 = tx.addVertex(T.label, “user”); v2.property(“userId”, 2); //Create edge between 2 vertices Edge edge = v1.addEdge(“edgeLable”, v2); //Finally commit the transaction tx.commit(); } }
Below are the screenshots for creating vertices and edges on gremlin console