Stardog Similarity Model

In the previous section I was able find similar files by searching for terms extracted from their comment headers and by visualizing how files are connected by common terms. Finding similar entities in the graph is usually much more complex than these simple examples. This is where machine learning models could be helpfule.

The Stardog database provides an easy way to specify and use such machine learning models, as described in this blog post. The basic steps are to :

Create a Similarity Model.
Execute a Similarity Query based on the model.

1. Create the Similarity Model

The model for this project is very basic. I want to find files that are similar to a file I have identified. I created a model where terms are used to predict files. The model is inserted into the database by executing the code:

PREFIX spa:     <tag:stardog:api:analytics:>
PREFIX filecat: <https://www.example.org/tw/filecat#>
# Predict file based on terms
INSERT {
  graph spa:model {
        :simModel a             spa:SimilarityModel ;
                  spa:arguments (?terms) ; # Use terms to
                  spa:predict   ?file .    # predict file name
  }
}
WHERE {
  SELECT (spa:set(?term) as ?terms) ?file {
      ?file filecat:hasTerm  ?term .
  }
  GROUP BY ?file
}

2. Similarity Query

With both the model and instance data in the database it is possible to find similar files based on terms. I want to find files similar to 01-FileCatMain.R, which I know from a separate query has the unique IRI filecat:FILE_b7afd304 . I limit the query to predict the top three matches ( limit 3) and sort by descending confidence:

PREFIX filecat:  <https://www.example.org/tw/filecat#>
PREFIX spa:      <tag:stardog:api:analytics:>

SELECT ?similarFileName ?confidence
WHERE {
  graph spa:model {
    :simModel spa:arguments  (?terms) ;
              spa:confidence ?confidence ;
              spa:parameters [ spa:limit 3] ;  # Select the top 3
              spa:predict    ?similarFile .
  }
  { ?similarFile filecat:fileName ?similarFileName }
  {
    SELECT (spa:set(?term) as ?terms) ?file
      {
          ?file filecat:hasTerm ?term .
          VALUES ?file { filecat:FILE_b7afd304 }
      }
      GROUP BY ?file
  }
} ORDER BY DESC(?confidence)

The best match is the 100% confidence value for the specified file matched to itself, which at least hints that the model may be working. The next two matches are of interest.

similarFileName   confidence
01-FileCatMain.R  1.0
02-Dictionary.R   0.5853694070049636
03-FileCatTTL.R   0.502518907629606

The result is not a surprise. Revisiting the terms visualization for 01-FileCatMain.R, the file 02-Dictionary.R has five terms in common with 01-FileCatMain.R : parse, create ttl, file header, driver file, and dictionary.

Figure 6.7 Terms Visualization for 02-Dictionary.R (screen shot.)

NEXT: 7. Conclusion

Last updated on Nov 22, 2021