Validation with SHACL
SHApes Constraint Language (SHACL) is one of two validation languages for RDF, the other being Shapes Expressions (ShEx). I prefer SHACL for its comprehensive reporting features. SHACL can also be used to create data collection fields on forms as a way to validate data at time of collection. It can also replace much the modeling provided by OWL, as adopted recently by TopQuadrant.
If you want to learn more about SHACL, the essential book “Validating RDF Data” is available free online or for hardcopy purchase.
When using SHACL, graph data is compared and validated against a graph pattern (shape). A shape is a set of constraints against which the data is evaluated. Shapes can be used to evaluate Subject and Object) nodes as well as relations (Predicates). Graph data can be validated prior to or after upload to a database. I find it most efficient to validate data prior to upload as a way to detect and correct data before it enters the database. However, if your constraints depend on inference from a reasoner, those shapes must be validated after upload to a database that supports reasoning.
Preparation
I use Apache Jena SHACL to validate TTL files prior to upload to the database. The latest Jena release is available online. Java dependencies may require download of an earlier version if your JRE is not up to date and you do not have administrator privileges that allow you to update Java on your machine. Setup of the command line tools is described online. A key requirement is to have the path to the shacl.bat
file in your system PATH
variable. You can use commands similar to these for a MS Windows system with Jena installed at C:\Jena
- SET JENA_HOME=C:\Jena
- SET PATH=%PATH%;%JENA_HOME%\bat
Example data
Data was created to test SHACL constraints for file header information. In the example data below, the FILE
has nonsense characters in the title, description, and note fields, with the exception of the words “file header” in the title. The data is used to evaluate a constraint that looks for a minimum of three terms extracted from each file header. The constraint should identify this file using an information message stating that there appears to be inadequate information in the file header.
:FILE_93ded90b dcterms:description "asdf asdfgasf sdfas asdfasdf safasdf"^^xsd:string ; dcterms:title "asdfa adfasdf asdfasd file header"^^xsd:string ; a :FileTypeR ; :fileName "SHACLTest-oneterm.R"^^xsd:string ; :filePath "C:\\_github\\NovasTaylor\\FileCat\\scripts\\r\\shaclTest\\SHACLTest-oneterm - Copy.R"^^xsd:string ; :hasTerm docdict:file_header ; :note "asdf asdfas asdfasdf f ada"^^xsd:string ; :status filecat:CompleteStatus .
Creating the shape
The shape below defines restrictions on the File nodes, requiring they have a minimum number of assigned terms and a status value that matches an approved list. The shape should generate:
- An Information message when a file has two terms or less
- A Warning message when a file has less than one term
- An Error message when file status (STAT:) does not match the approved list of values
Validation investigates FILE nodes, so a NodeShape will be defined for the R, RMD, and SPARQL file types. In the example TTL data above, we see the subject file node :FILE_93ded90b is of FileTypeR :
The SHACL file, saved as TTL, begins by specifying the :targetClass
values for files of type R, RMD, and SPARQL:
@prefix : <https://www.example.org/tw/filecat#> . @prefix sh: <http://www.w3.org/ns/shacl#> . :SPARQLFileShape a sh:NodeShape; sh:targetClass :FileTypeR, :FileTypeRMD, :FileTypeSPARQL ;
An Information message should be generated for a file node that has only one :hasTerm predicate relation. The constraint is added to the shape by stating there should be a minimum of two values (sh:minCount 2
) along the property path (sh:path
) :hasTerm.
@prefix : <https://www.example.org/tw/filecat#> . @prefix sh: <http://www.w3.org/ns/shacl#> . :SPARQLFileShape a sh:NodeShape; sh:targetClass :FileTypeR, :FileTypeRMD, :FileTypeSPARQL ; # Terms: Info if file has less than three terms. sh:property [ sh:path :hasTerm ; sh:name "lessThanTerm" ; sh:description "Files with less than two terms extracted from the file header should be reviewed. Header text and dictionary may require updates" ; sh:message "File has less than 2 associated terms." ; sh:severity sh:Info ; sh:minCount 2 . ] ;
Additional constraints are then added for generating a Warning when there is no :hasTerm and when the :status
value is not in the approved list of values:
@prefix : <https://www.example.org/tw/filecat#> . @prefix sh: <http://www.w3.org/ns/shacl#> . :SPARQLFileShape a sh:NodeShape; sh:targetClass :FileTypeR, :FileTypeRMD, :FileTypeSPARQL ; # Terms: Info if file has less than three terms. sh:property [ sh:path :hasTerm ; sh:name "lessThanTerm" ; sh:description "Files with less than two terms extracted from the file header should be reviewed. Header text and dictionary may require updates" ; sh:message "File has less than 2 associated terms." ; sh:severity sh:Info ; sh:minCount 2 . ] ; # Terms: Warning if file has no associated terms sh:property [ sh:path :hasTerm ; sh:name "minTerm" ; sh:description "A file must have at least one associated term extracted from its file header text. Header text and dictionary may require update." ; sh:message "File lacks associated term. At least one is required." ; sh:severity sh:Warning ; sh:minCount 1 ] ; # Status: Violation if status not in approved list sh:property [ sh:path :status ; sh:name "statusValues" ; sh:description "Status must be in a defined list of permissable values." ; sh:message "File Status is not in approved list." ; sh:severity sh:Violation ; sh:in (:CompleteStatus :DevelopmentStatus :ValidatedStatus :OutdatedStatus) ] .
The shape constraints are now ready to be executed on the data file.
Command Line Report Generation
See the Apache Jena SHACL documentation for command details.
The command below assumes the SHACL constraint file SHACL_FileCat.TTL
is in the same folder as the data file FileCat.TTL
.
shacl v -s SHACL_FileCat.TTL -d FileCat.TTL
The report can also be piped to a file which can be converted into a more readable rendition with RMarkdown. When the report is in the form of data as RDF triples, the results can be uploaded into the database and linked to the files that appear in the report.
shacl v -s SHACL_FileCat.TTL -d FileCat.TTL > ValReport.TTL
This excerpt from the report shows the information message for the file that has only one term identified in its comment header:
[ a sh:ValidationReport ; sh:conforms false ; sh:result [ a sh:ValidationResult ; sh:focusNode filecat:FILE_93ded90b ; sh:resultMessage "File has less than 2 associated terms." ; sh:resultPath filecat:hasTerm ; sh:resultSeverity sh:Info ; sh:sourceConstraintComponent sh:MinCountConstraintComponent ; sh:sourceShape _:b0 ] ;
We see that the data does not conform to the shape (the report fails with sh:conforms false;
) and the offending file is correctly identified as the file :FILE_93ded90b
.
The file IRI is not very informative to the human reader. What is the file name? Where is it located? As all the data is in graph form, including the validation report, this additional information is easy to obtain using a SPARQL query.
PREFIX : <https://www.example.org/tw/filecat#> SELECT ?pathToFile WHERE { :FILE_93ded90b :filePath ?pathToFile . }
pathToFile C:/_github/NovasTaylor/FileCat/scripts/r/shaclTest/SHACLTest-oneterm.R
When dealing with a large number of files and many constraints, it is helpful to summarize the Validation Report in an interactive HTML file generated using RMarkdown. R can read the TTL report file and merge the result with a query to the graph database (or query to the TTL file not yet uploaded to the database), providing more information about the entities that fail validation.