Load HTML

This is the APOC Extended documentation.

APOC Extended is not supported by Neo4j. For the officially supported APOC Core, go to the APOC Core page.

Scraping Data from Html Pages.

apoc.load.html('url',{name: jquery, name2: jquery}, config) YIELD value

Load Html page and return the result as a Map

This procedures provides a very convenient API for acting using DOM, CSS and jquery-like methods. It relies on jsoup library.

CALL apoc.load.html(url, {name: <css/dom query>, name2: <css/dom query>}, {config}) YIELD value

The result is a stream of DOM elements represented by a map

The result is a map i.e.

{name: <list of elements>, name2: <list of elements>}

Config

Config param is optional, the default value is an empty map.

charset

Default: UTF-8

baserUri

Default: "", it is use to resolve relative paths

htmlString

Default: false, to use an HTML string instead of an url as 1st parameter

Example with real data

The examples below use the Wikipedia home page.

CALL apoc.load.html("https://en.wikipedia.org/",{metadata:"meta", h2:"h2"})

You will get this result:

apoc.load.htmlall
CALL apoc.load.html("https://en.wikipedia.org/",{links:"link"})

You will get this result:

apoc.load.htmllinks
CALL apoc.load.html("https://en.wikipedia.org/",{metadata:"meta", h2:"h2"}, {charset: "UTF-8"})

You will get this result:

apoc.load.htmlconfig