HTML
HTML can be captured by the Document Object Model (DOM) specification. HTML elements (also known as tags) can be considered containers.
According to the Facade-X model, SPARQL Anything uses: - RDF Properties for specifying tag attributes; - Container membership properties for specifying relations to child elements in the DOM tree. These may include text, which can be expressed as RDF literals of type xsd:string. - Tag names are used to type the container. Specifically, the tag name is used to mint a URI that identifies the class of the corresponding containers.
Extensions
SPARQL Anything selects this transformer for the following file extensions:
- .html
Media types
SPARQL Anything selects this transformer for the following media types:
- text/html
Default implementation
Default Transformation
Data
<html>
<head>
<title>Hello world!</title>
</head>
<body>
<p class="paragraph">Hello world</p>
</body>
</html>
Query
CONSTRUCT
{
?s ?p ?o .
}
WHERE
{ SERVICE <x-sparql-anything:location=https://sparql-anything.cc/examples/simple.html>
{ ?s ?p ?o }
}
Facade-X RDF
@prefix fx: <http://sparql.xyz/facade-x/ns/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix whatwg: <https://html.spec.whatwg.org/#> .
@prefix xhtml: <http://www.w3.org/1999/xhtml#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix xyz: <http://sparql.xyz/facade-x/data/> .
[ rdf:type xhtml:html , fx:root ;
rdf:_1 [ rdf:type xhtml:head ;
rdf:_1 [ rdf:type xhtml:title ;
rdf:_1 "Hello world!" ;
whatwg:innerHTML "Hello world!" ;
whatwg:innerText "Hello world!"
] ;
whatwg:innerHTML "<title>Hello world!</title>" ;
whatwg:innerText "Hello world! Hello world!"
] ;
rdf:_2 [ rdf:type xhtml:body ;
rdf:_1 [ rdf:type xhtml:p ;
rdf:_1 "Hello world" ;
xhtml:class "paragraph" ;
whatwg:innerHTML "Hello world" ;
whatwg:innerText "Hello world"
] ;
whatwg:innerHTML "<p class=\"paragraph\">Hello world</p>" ;
whatwg:innerText "Hello world Hello world"
] ;
whatwg:innerHTML "<head> \n <title>Hello world!</title> \n</head> \n<body> \n <p class=\"paragraph\">Hello world</p> \n</body>" ;
whatwg:innerText "Hello world! Hello world Hello world! Hello world! Hello world Hello world"
] .
Options
Summary
Option name | Description | Valid Values | Default Value |
---|---|---|---|
html.selector | A CSS selector that restricts the HTML tags to consider for the triplification. | Any valid CSS selector. | No Value |
html.browser | It tells the triplifier to use the specified browser to navigate to the page to obtain HTML. By default a browser is not used. The use of a browser has some dependencies -- see BROWSER. | chromium|webkit|firefox | No Value |
html.browser.timeout | When using a browser to nagivate, it tells the browser if it spends longer than this amount of time (in milliseconds) until a load event is emitted then the operation will timeout. | any integer | 30000 |
html.browser.wait | When using a browser to nagivate, it tells the triplifier to wait for the specified number of seconds (after telling the browser to navigate to the page) before attempting to obtain HTML. | any integer | No Value |
html.browser.screenshot | When using a browser to nagivate, take a screenshot of the webpage (perhaps for troubleshooting) and save it here. | a file URI e.g. "file:///tmp/screenshot.png" | No Value |
html.metadata | It tells the triplifier to extract inline RDF from HTML pages. The triples extracted will be included in the default graph. (cf. issue 164) | true/false | false |
html.selector
Description
A CSS selector that restricts the HTML tags to consider for the triplification.
Valid Values
Any valid CSS selector.
Default Value
No value
Examples
Input
<html>
<head>
<title>Hello world!</title>
</head>
<body>
<p class="paragraph">Hello world</p>
</body>
</html>
Located at https://sparql-anything.cc/examples/simple.html
Use Case 1: Selecting text contained in elements of the class "paragraph"
Query
SELECT ?text
WHERE
{ SERVICE <x-sparql-anything:location=https://sparql-anything.cc/examples/simple.html,html.selector=.paragraph>
{ ?s whatwg:innerText ?text }
}
Result
-----------------
| text |
=================
| "Hello world" |
-----------------
html.browser
Description
It tells the triplifier to use the specified browser to navigate to the page to obtain HTML. By default a browser is not used. The use of a browser has some dependencies -- see BROWSER.
Valid Values
chromium|webkit|firefox
Default Value
No value
Examples
Please refer to the justin2004's blogpost on Scraping Webpages with SPARQL.
html.browser.timeout
Description
When using a browser to nagivate, it tells the browser if it spends longer than this amount of time (in milliseconds) until a load event is emitted then the operation will timeout.
Valid Values
any integer
Default Value
30000
Examples
Please refer to the justin2004's blogpost on Scraping Webpages with SPARQL.
html.browser.wait
Description
When using a browser to nagivate, it tells the triplifier to wait for the specified number of seconds (after telling the browser to navigate to the page) before attempting to obtain HTML.
Valid Values
any integer
Default Value
No Value
Examples
Please refer to the justin2004's blogpost on Scraping Webpages with SPARQL.
html.browser.screenshot
Description
When using a browser to navigate, take a screenshot of the webpage (perhaps for troubleshooting) and save it here.
Valid Values
a file URI e.g. "file:///tmp/screenshot.png"
Default Value
No Value
Examples
Please refer to the justin2004's blogpost on Scraping Webpages with SPARQL.
html.metadata
Description
It tells the triplifier to extract inline RDF from HTML pages. The triples extracted will be included in the default graph. (cf. issue 164)
Valid Values
true/false
Default Value
false
Examples
Input
<!DOCTYPE html>
<html>
<body>
<div itemscope itemtype="https://schema.org/Movie">
<h1 itemprop="name">Avatar</h1>
<span>Director: James Cameron (born August 16, 1954)</span>
</div>
</body>
</html>
Located at https://sparql-anything.cc/examples/Microdata1.html
UC1: Extract triples embedded in the web page at the following address https://sparql-anything.cc/examples/Microdata1.html
Query
CONSTRUCT
{
?s ?p ?o .
}
WHERE
{ SERVICE <x-sparql-anything:location=https://sparql-anything.cc/examples/Microdata1.html,html.metadata=true>
{ GRAPH ?g
{ ?s ?p ?o }
}
}
Result
@prefix fx: <http://sparql.xyz/facade-x/ns/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix whatwg: <https://html.spec.whatwg.org/#> .
@prefix xhtml: <http://www.w3.org/1999/xhtml#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix xyz: <http://sparql.xyz/facade-x/data/> .
[ rdf:type fx:root , xhtml:html ;
rdf:_1 [ rdf:type xhtml:head ] ;
rdf:_2 [ rdf:type xhtml:body ;
rdf:_1 [ rdf:type xhtml:div ;
rdf:_1 [ rdf:type xhtml:h1 ;
rdf:_1 "Avatar" ;
xhtml:itemprop "name" ;
whatwg:innerHTML "Avatar" ;
whatwg:innerText "Avatar"
] ;
rdf:_2 [ rdf:type xhtml:span ;
rdf:_1 "Director: James Cameron (born August 16, 1954)" ;
whatwg:innerHTML "Director: James Cameron (born August 16, 1954)" ;
whatwg:innerText "Director: James Cameron (born August 16, 1954)"
] ;
xhtml:itemscope "" ;
xhtml:itemtype "https://schema.org/Movie" ;
whatwg:innerHTML "<h1 itemprop=\"name\">Avatar</h1> <span>Director: James Cameron (born August 16, 1954)</span>" ;
whatwg:innerText "Avatar Director: James Cameron (born August 16, 1954) Avatar Director: James Cameron (born August 16, 1954)"
] ;
whatwg:innerHTML "<div itemscope itemtype=\"https://schema.org/Movie\"> \n <h1 itemprop=\"name\">Avatar</h1> <span>Director: James Cameron (born August 16, 1954)</span> \n</div>" ;
whatwg:innerText "Avatar Director: James Cameron (born August 16, 1954) Avatar Director: James Cameron (born August 16, 1954) Avatar Director: James Cameron (born August 16, 1954)"
] ;
whatwg:innerHTML "<head></head>\n<body> \n <div itemscope itemtype=\"https://schema.org/Movie\"> \n <h1 itemprop=\"name\">Avatar</h1> <span>Director: James Cameron (born August 16, 1954)</span> \n </div> \n</body>" ;
whatwg:innerText "Avatar Director: James Cameron (born August 16, 1954) Avatar Director: James Cameron (born August 16, 1954) Avatar Director: James Cameron (born August 16, 1954) Avatar Director: James Cameron (born August 16, 1954)"
] .
<https://sparql-anything.cc/examples/Microdata1.html>
<http://www.w3.org/1999/xhtml/microdata#item>
[ rdf:type <https://schema.org/Movie> ;
<https://schema.org/name> "Avatar"
] .