Standards

Standards facilitate collection of the information required to answer the Competency Questions. A description of the file, notes, and related files like screen shot images can be listed in a file comment header. The amount of manual data entry must be minimized to ensure consistency and decrease maintenance burden. Values like file name and date last modified are best obtained from the file system. The notes and description header fields can be mined for keyword terms instead of having a separate, and difficult to maintain, keywords field in the header.

File Headers

While a full description of each field of the standardized header is not within the scope of this discussion, some explanation is warranted. The fields TITL:, DESC: and NOTE: are mined for terms extraction using Natural Language Processing. IMAG: (path to a related image file), GRAP: (named graph for a linked data project), and REFE: (URL to a reference) are optional.

Some fields are multi-value and/or multi-line. Code authorship can be obtained from the log of the version control system used to manage the files. Status could also be obtained from version control branching, but in these examples it is specified manually in the STAT: field.

Design your headers to capture the information that will answer your own Competency Questions, follow the required commenting style for the file type. Your choices will have implications for later file parsing and you will likely revise the header structure a number of times over the course of your work.

Standard File Header for R, SPARQL

  ##############################################################################
  # TITL:
  # DESC:
  # STAT:
  # IMAG: (optional)
  # GRAP: (optional)
  # REFE: (optional)
  # NOTE:
  ################################################################################

Example of a completed R header

  ###############################################################################
  # TITL: Visnetwork plot of files related by keyword.
  # DESC: File for plot is manually selected by IRI and not by name for this example
  # STAT: Development
  # IMAG: C:\_github\NovasTaylor\FileCat\scripts\r\FilesRelatedByTerms_FileId_visnetwork.png
  # REFE: https://stackoverflow.com/questions/23549605/merge-two-r-data-frames-and-identify-the-source-of-each-row
  # NOTE: For node position see code example Concentric-VisNetwork-CalculatedXY.R
  #       Reference is for merging and identifying the source of the value.
  ###############################################################################

R Markdown header

Headers for RMarkdown files contain similar fields as parameters in the YAML header.

---
params:
  TITL:
  DESC:
  STAT:
  IMAG: (optional)
  GRAP: (optional)
  REFE: (optional)
  NOTE:
---

Folder Structure

Folder Descriptions

A project typically contains multiple subfolders, with each folder containing code for a related purpose. A description of the folder content is documented in a Readme.txt file within each folder. This antiquated approach is needed until MS Windows provides metadata capture at the folder level. A different file name like folderDesc.txt may be preferred if you use ReadMe.txt for other purposes.

Figure 1.1 Partial folder structure for file catalog project.
Figure 1.1 Partial folder structure for file catalog project.

For the File Catalog Project, the readme file contains the single field DESC: for a multi-line description of the folder content.

  ################################################################################
  # DESC:
  #
  ################################################################################

Example of a completed Readme.txt file:

  ###############################################################################
  DESC: SPARQL Scripts for the File Catalog project.
  ###############################################################################

Report Folder

Report files and data from the cataloging process are created under the project’s required /fileCatReport folder. Files include the folder inventory as an interactive HTML table (generated using R Markdown) and inventory data in graph form for upload to the database.

Figure 1.2 Report folder.
Figure 1.2 Report folder.

With these standards in place it is time to consider classification of the entities and relationships using ontologies.

NEXT: 2. Ontology Development