Streaming Endpoint
Although Databridge is primarily designed for bulk data import, which requires Neo4j to be offline, we recently added the capability to import data into a running Neo4j instance.
This was prompted by a specific request from a user who pointed out that in many cases people want to do a fast bulk-load of an initial large dataset with the database offline, and then subsequently apply small incremental updates to that data with the database running. This seemed like a great idea, so we added the streaming endpoint to enable this feature.
The streaming endpoint uses Neo4j’s Bolt binary protocol, and the good news is that you don’t need to change any of your existing import configuration to use it. Simply pass the
-s
option to the import command, and it will automatically use the streaming endpoint:Example: use the
-s
option to import the hawkeye
dataset into a running instance of Neo4j.bin/databridge import -s hawkeye
The streaming endpoint connects to Neo4j using the following defaults:
neo4j.url=bolt://localhost neo4j.username=neo4j neo4j.password=password
You can override these defaults by creating a file
custom.properties
in the Databridge config
folder and setting the values as appropriate for your particular Neo4j installation.Please note that despite using the Bolt protocol, the streaming endpoint will take quite a bit longer to run than the offline endpoint for large datasets, so it isn’t really intended to replace bulk import. For small incremental updates, however, this should not be a problem.
Updates from the streaming endpoint are batched, with the transaction commit size currently set to 1000, and the plan is to make the commit size user-configurable in the near future.
Specifying the Output Database Folder
By default, Neo4j-Databridge creates a new
graph.db
database in the same folder as the import task. We’ve now added the ability for you to define the output path to the database explicitly. To do this, use the -o
option to specify the output folder path to the import command:Example: use the
-o
option to import the hawkeye
dataset into a user-specified database.bin/databridge import -o /databases/common hawkeye
In the example above, the
hawkeye
dataset will be imported into /databases/common/graph.db
, instead of the default location hawkeye/graph.db
.Among other things, this new feature allows you to import different datasets into the same physical database:
Example: use the
-o
option to allow the hawkeye
and epsilon
datasets to co-exist in the same Neo4j database.bin/databridge import -o /databases/common hawkeye bin/databridge pimport -o /databases/common epsilon
Simpler Commands
The eagle-eyed among you will have spotted that the above examples use the
import
command, while in our first blog post, our examples all used the run
command, which was invoked with a variety of different option flags. The original run
command still exists, but we’ve added some additional commands to make life a bit simpler.All the new commands also now support a
-l
option, to limit the number of rows imported. This can be very useful when testing a new import task for example. The new commands are:import:
runs the specified import taskusage:
import [-cdsq] [-o target] [-l limit]
c
: allow multiple copies of this import to co-exist in the target databased
: delete any existing dataset prior to running this imports
: stream data into a running instance of Neo4jq
: run the import task in the background, logging output to import.log
instead of the consoleo target
: use the specified target database for this importl limit
: the maximum number of rows to process from each resource during the importtest
: performs a dry run of the specified import task, but does not create a databaseusage:
test [-l limit]
l limit
: the maximum number of rows to process from each resource during the dry runprofile
: profiles the resources for an import task. Databridge uses a profiler at the initial phase of every import. The profiler examines the various data resources that will be loaded during the import and generates tuning information for the actual import phase.usage:
profile [-l limit]
l limit
: the maximum number of rows to profile from each resourceThe profiler display the statistics that will be used to tune the import. For nodes, these statistics include the average key length
akl
of the unique identifiers for each node type, as well as an upper bound max
on the number of nodes of each type.For relationships, the statistics include an upper bound on the number of edges of each type. (The
max
values are upper bounds because the profiler doesn’t attempt to detect possible duplicates.)Profile statistics are displayed in JSON format:
{ nodes: [ { 'Orbit': {'max':11, 'akl':10.545455} } { 'Satellite': {'max':11, 'akl':8.909091} } { 'SpaceProgram': {'max':11, 'akl':9.818182} } { 'Location': {'max':11, 'akl':4.818182} } ],edges: [ { 'LOCATION': {'max':11} } { 'ORBIT': {'max':11} { 'LAUNCHED': {'max':11} } { 'LIVE': {'max':11} } ] }
Deleting and Copying Individual Datasets
In order to support the new streaming endpoint as well as the ability to host multiple import datasets in the same database, Databridge only creates a brand new database the first time you run an import task.
If you run the same import task multiple times with the same datasets, Databridge will not create any new nodes or relationships in the graph during the second and subsequent imports.
If you want to force Databridge to clear down any previous data and re-import it again, you can use the
-d
option, which will delete the existing dataset first.Example: use the
-d
option to delete an existing dataset prior to re-importing it.bin/databridge import hawkeye bin/databridge import -d hawkeye
On the other hand, if you want to create a
copy
of an existing dataset, you can use the -c
option instead:Example: use the
-c
option to create a copy of a previously imported dataset.bin/databridge import hawkeye bin/databridge import -c hawkeye
Deleting All the Things
If you need to delete everything in the graph database and start again with a completely clean slate, you can use the
purge
command:bin/databridge purge hawkeye
Note that if you have imported multiple datasets into the same physical database, you should
purge
each of them individually, specifying the database path each time:bin/databridge purge -o /databases/common hawkeye bin/databridge purge -o /databases/common epsilon
Conclusion
Well, that about wraps up this quick survey of what’s new in Databridge from GraphAware. If you’re interested in finding out more, please take a look at the project WIKI, and in particular the Tutorials section.
If you believe Databridge would be useful for your project or organisation and are interested in trying it out, please contact me directly at vince@graphaware.com or drop an email to databridge@graphaware.com and one of the GraphAware team members will get in touch.
GraphAware is a Gold sponsor of GraphConnect Europe. Use discount code GRAPHAWARE30 to get 30% off your tickets and trainings.
Get My Ticket