Following on from where we ended in part 1 of this series, we've now got a dedicated Neo4j graph database sitting on an Ubuntu instance and we are ready to start building graphs. This post will describe how I used fairly straight forward python script together with the python embedded bindings to populate the graph with nodes and relations.
I'm not going to spend a huge amount of time describing the python as this blog is aimed at developers with some experience in the language. Instructions on how to setup the python embedded library for Neo4j can be found here and make sure you also download the required python dependency 'Jpype', something I somehow forgot to do the first time I attempted this library!! Also the relevant docs for the python embedded library are here.
You might wonder why I haven't made use of any of the other various Python libraries that are available for NEO4J. I was keen to get as close to the database as possible when writing these initial scripts. This mirrors my approach for writing scripts for postrges, where I typically just use Sublime Text and the python PsycoPG2 library to effect transactions directly on the database. I run the sripts directly in the command line and then I check for the effects via the PSQL command line tool proivided with Postgres. This works well for me mainly because I typically write relatively short scripts (100-200 lines of code typically) and not fully object oriented applications.
Similarly with Neo4j I wanted to follow this same low-level approach, without abstracting myself too far away from the database. One complication that comes with this is that the database is single threaded for local connections. Thus if you have the console open in your browser and then try to execute a python script via the cmd-line you'll get a database locking error. From what I can make out, and feel free to correct me here, as the console allows both read and write operations on the graph the thread is occupied as long as the console is open.
Transactions in Neo4j for python embedded bindings are the key concept and everything happens through these. In your script you need to import the library and define the database:
This variable 'db' is now the reference that you use to effect transactions on that particular database.
Now this does open up some interesting routes for future experimentation. What if I had two graphs and I wanted to grab some nodes and/or relations from one and insert them into another?
Granted, correct domain modelling literature will suggest that you have one single graph for the two different concepts. However what about if you are in a large enterprise and you don't know that the other team / department even have a neo4j graph database. I could definitely imagine that in some of the larger organisations that I've worked at. Well perhaps you could create db and db2 variables and read from one and write to the other. Definitely going to give this a shot at some stage, will keep you posted!!
Anyhow back to our hydraGraph script. The transaction creates some blank indices for the nodes. (For some reason, I can't recall exactly why now, I created local variables for each of the index names as well, I've left them in for completeness)
Following on from above we need to test if the index already exists in the db. This was useful as when I was getting started there were plenty of false starts where I ended up having to delete everything and start again reloading the nodes and relations. Having this step meant the deletion step was easier and the indices remained.
Once the indices had been created for the nodes (I didn't bother with any indices for the relations yet, its on my list of todo's! Remember folks I'm learning here as well) we can progress to creating the nodes themselves. The nice thing here is that as the indices are already created we can add the nodes to them at the same time! Saves us having to iterate through the nodes again, useful if you were creating a large graph I would imagine! So here is a snippet that shows the node coolgarifTech being created. It has several attributes including name, description, established and founders.
This pattern is then repeated for the rest of the nodes that I wanted to create for this domain.
Granted, this is not the worlds biggest graph and you wouldn't use this script for a graph with even a hundred nodes, not to mind a million. At this stage its just about getting some data into the graph so we can get up and running. For the more exciting stuff you'll just have to wait until I blog about streaming nodes directly into Neo4j directly from a large posgres database using various python libraries and some magic incantations (James's current favourite coding phrase, personally I think he's been watching far too much Harry Potter recently!!)
Next up in the script are the relationships in the graph. So the hydraGraph is something we dreamt up as a way of showing some of the area's that Coolgarif Tech focuses on. I reckoned that the best way to model this was as a single node to reflect the company itself and then combinations of primary and secondary nodes that lead to ever more specialised concepts and with multiple relationships arising between primary and secondary nodes.
So the primary relationships were pretty straightforward, just connect the Coolgarif node to all the primary nodes in the graph.
From there the script just needed to create the relationships between the primary and secondary nodes in our domain model. Here is an example of one:
Once this is complete then all that is left is to add some final housekeeping code at the end of the script. This is primarily because we dont have the console running to see if the script did its job, I found it easier to print to the cmd line the number of nodes and relations now existing in the db. I did mean to add something smarter, which would give the total number of nodes added by that particular execution of the script but its is pretty trivial to add so I'll leave it for you to implement.
Finally and very importantly with a single threaded situation such as this, close the connection!!!!! If you dont then when you try to fire up the console to create a lovely visualisation of the graph its going to complain about database locking, as discussed above.
Use this statement to end your script:
So if you now fire up the console (see how in part 1) and play with the data browser you should be able to see something like this (apologies for the colours!). Also the full code base for this graph build is available here.
Ok so now the database has a graph that we can start to play with. Next part of the series will explore running some Cypher queries using neo4jrestclient on our hydragraph and exposing the results via an API using the lightweight Python Flask library.