Code Flow

Similar to how the sequence of events in the high-level process can be represented as graph data, the more detailed code execution steps can also be captured in an ordered sequence.

Step 1: 01-FileCatMain.R writes FileCatRunLog.txt

Step 2: 01-FileCatMain.R reads RDFPrefixes.xlsx

Step 3: 01-FileCatMain.R writes FileCatInfo.rda

..and so on.

The steps already look like triples in the form File1 action File2. How can order be assigned to these action triples to capture (and later query) the sequence of events? RDF Reification is one way to “say things about triples,” but the challenge of representing the order remains. As noted by Bob Ducharme, representing order in RDF is cumbersome, even with the improvements in SPARQL 1.1. Rather than attempt to implement the ordered list concept in RDF, I chose to represent process step order using RDF-Star and query with SPARQL-Star.

RDF* (RDF-Star) and SPARQL* (SPARQL-Star)

RDF-Star is relatively new and not implemented in all RDF databases. It also presents a challenge when creating RDF because most R packages do not support it, so some kludging is necessary when using R to write RDF-Star data. I won’t go into the details here. See the Resources page for good introductions to RDF-Star and SPARQL-Star.

Substituting the file names for their corresponding IRIs, the first three steps in the File Catalog code flow can be written in TTL as:

:FILE_b7afd304  :writes :FILE_afbb9439 .

:FILE_b7afd304  :reads  :FILE_26b6c512 .

:FILE_b7afd304  :writes :FILE_d37a1d00 .

A step number must be assigned to these triples. An easy way to do this is to use RDF-star’s embedded triple concept, created by enclosing the triple within << and >> . The embedded triple can be in the subject or object position within a data structure or query. The triples below are now in RDF-Star form and have a step number IRI associated with them.


<<:FILE_b7afd304  :writes :FILE_afbb9439 >>
  :hasStep :CODSTEP_ID1 .

<<:FILE_b7afd304  :reads  :FILE_26b6c512 >>
  :hasStep :CODSTEP_ID2 .

<<:FILE_b7afd304  :writes :FILE_d37a1d00 >>
  :hasStep :CODSTEP_ID3 .

Why not assign an integer literal value like :stepNum "1"^xsd:integer instead of the more opaque :hasStep :CODSTEP_ID1 ? An IRI is used to represent the step because additional information must be specified about each step, including a description of the step and the hasNext / hasPrevious relationships between the steps. Here are the first three events with the order defined from Step 1 to Step 2, then Step 2 to Step 3:

<<:FILE_b7afd304  :writes :FILE_afbb9439 >>
  :hasStep :CODSTEP_ID1 .

<<:FILE_b7afd304  :reads  :FILE_26b6c512 >>
  :hasStep :CODSTEP_ID2  .

<<:FILE_b7afd304  :writes :FILE_d37a1d00 >>
  :hasStep :CODSTEP_ID2  .


:CODSTEP_ID1  
    rdf:type  :CodeFlowStep ;
    :stepNum  "1"^^xsd:integer ;
    :hasNext  :CODSTEP_ID2  .
    dcterms:description "Step 1 in Code Flow Proces"^^xsd:string ;

  :CODSTEP_ID2 
    rdf:type :CodeFlowStep ;
    :stepNum "2"^^xsd:integer ;
    dcterms:description "Step 2 in Code Flow Proces "^^xsd:string ;
    :hasNext :CODSTEP_ID3  .


Current Limitations Creating RDF-Star using R

The R package rdflib is used extensively in this project to create triples, with the exception of the embedded triples necessary to implement RDF-Star. At the time of this writing (September 2021) the rdflib R package does not yet support RDF-Star, even though a branch of the Python library supports it according to this issue thread.

Embedded RDF-Star for this project was implemented in a rude manner that is not detailed here, awaiting a more elegant process when R libraries have been updated.

Reification

The Ontotext article “What is RDF-Star” provides a good description of RDF-Star and the more traditional RDF reification approach where statements are made about triples.

:man :hasSpouse :woman .
 :id1 rdf:type rdf:Statement ;
   rdf:subject :man ;
   rdf:predicate :hasSpouse ;
   rdf:object :woman ;
   :startDate "2020-02-11"^^xsd:date .

I did not use reification in this project because I wanted to gain experience with RDF-Star and SPARQL-Star.

Retrieve Code Flow Order Using SPARQL-Star

It should come as no surpise that the SPARQL-Star query closely resembles the RDF-Star data. The << embedded triple >> pattern specifies the two files and the relationship between them. Step number is associated with the embedded triple using the :hasStep predicate and associated value. Each file has a fileName (derived and assigned in previous steps) and the integer step number is retrieved using the :stepNum predicate.

PREFIX :  <https://www.example.org/tw/filecat#>
SELECT ?step ?fileName1 ?action ?fileName2

WHERE{
  <<?f1 ?action ?f2>> :hasStep ?st .
     ?f1 :fileName ?fileName1 .
     ?f2 :fileName ?fileName2 .
     ?st :stepNum ?step .
} ORDER BY ?step
step fileName1          action  fileName2
1    01-FileCatMain.R   writes  FileCat-RunLog.txt
2    01-FileCatMain.R   reads   RDFPrefixes.xlsx
3    01-FileCatMain.R   writes  FileCatInfo.rda

... remaining steps omitted.

Query Along the Code Flow Path

Here begins a cautionary tale about SPARQL-Star.

Similar to the Process Flow question in the previous section, we want to determine which code files are affected when the dictionary file is changed. In other words, what are the process steps that occur after Step 4 where the R program 01-FileCatMain.R sources the dictionary program 02-Dictionary.R and onward?

Once again we can use the hasNext+ predicate, this time applied to the embedded triple that represent Step 3, since we want to retrieve Step 4 onward. Information about the next steps is available from the embedded triples for each later step.


ELECT ?startStep ?step ?fileName1 ?action ?fileName2

WHERE{
    # Step 3:  FILE1  writes File4
   << :FILE_b7afd304 :writes :FILE_d37a1d00 >>  :hasStep ?st .
     ?st :stepNum  ?startStep ;
          # All next steps after Step 3
          :hasNext+ ?nextStep .  
    <> :hasStep ?nextStep .
    ?nextStep :stepNum  ?step .
    ?file1 :fileName ?fileName1 .
    ?file2 :fileName ?fileName2 .
}

All seems well with results returned as:

startStep step fileName1          action   fileName2
3         4    01-FileCatMain.R   sources  02-Dictionary.R
3         5    02-Dictionary.R    reads    dictionary.xlsx
3         6    02-Dictionary.R    writes   fileDictNoMatch.Rda
... and so on.

Problems appear as soon as the reasoner is turned on in the Stardog database. The exact same query now throws the error:

ReasoningPlanNode cannot be cast to com.complexible.stardog.plan.ScanNode

What is happening? According to technical support staff at Stardog, the database does not yet (September 2021) support computing inference over edge properties. No problem: turn off the reasoner and the query will return the expected result. However, this also means that the :hasPrevious+ relation cannot be used like it was for the Process Flow example to find preceding steps from a starting point. A reasoner is required to infer that :hasPrevious is the owl:inverseOf :hasNext.

:hasNext     rdf:type      owl:ObjectProperty ;

:hasPrevious rdf:type      owl:ObjectProperty ;
             owl:inverseOf :hasNext .

The experiment with RDF-Star and SPARQL-Star concludes with a mix of success and failure, with the hope that Stardog and other databases will soon support inference in embedded triple patterns or at least provide better error messages.

Visualize

Code flow can be visualized using R and steps similar to those described for the Process Flow. The visualization is more complex due to the branching nature of the flow and different types of inputs and outputs. The R Package visNetwork was used with data added to place each node on an x,y grid of rows and columns based on step number and file type (work not shown). Click on the thumbnail image or the “Next” link at the bottom of this page to open an interactive version of the diagram.

Figure 5.3. Code Flow Process. Thumbnail of interactive diagram.

Click on the image to open the full, interactive version in a new tab.

Previous
Next