Data Collection

Term Extraction from File headers

R scripts extract information from the file system by descending through the project folders and extracting values from individual file headers. Python or another language of your choice would work equally well.

Project information is collected using a unique call to an R function that drives additional scripts, including data conversion to RDF and generation of reports. The function has multiple parameters:

FileCat(
  pathRoot     = ".../NovasTaylor\\FileCat",
  parseFolders = TRUE,
  filePattern  = '\\.rq$|\\.r$|\\.Rmd|readMe.txt$',
  omitFolders  = "\\/archive\\/|\\/testProject\\/",
  recurse      = TRUE,
  sImgFolder   = ".../NovasTaylor/FileCat/scripts/r/findFileTerms-app/www/scriptImages",
  debug        = TRUE  
)
Parameter Description
pathRoot Path to the project's root folder
parseFolders Parse folders (=TRUE) or rely on a previous parsing to generate a report from existing data (=FALSE)
filePattern Regular expression that identifies that types and names of files that should be parsed.
recurse Parse folders below the pathRoot folder (=TRUE) or only parse files within the pathRoot folder (=FALSE)
sImgFolder Folder where images (specifed in the IMAG: field) should be copied for access by the "Find Files" R Shiny app.
debug Verbose debugging messages (=TRUE) or standard messaging (=FALSE).

The parsing script performs a number of functions in addition to extracting the fields from the headers:

  • Obtains file information - full path to file, project name (parsed from file path), and date last modified.
  • Removes "stop words" from parsed text
  • Checks spelling
  • Generates the Project Catalog
  • Creates Graph Data

Text Mining

The tidytext() package was used to extract words from the combined fields TITL:, DESC: and NOTE: using the methods described in the book Text Mining with R by Silge and Robinson.

Text mining extracts a large number of words, especially for projects that contain many, diverse files. A dictionary file is used to identify words to either exclude (by adding them to the stop_words dataframe) or to keep and use as keyword identifiers associated with the project files.

A report in RMarkdown helps to identify the words and files that require corrections. Words flagged by the R package hunspell may be project acronyms or other unusual words that should be added to the dictionary as either words to ignore (add to stop_words) or to be linked to files.

Many extracted keywords do not assist in identifying or relating scripts. Parsing large projects results in a vast number of terms that have little value. In addition to excluding keywords using a dictionary, a list of bigrams was identified. Bigrams are extracted from the file header strings after removal of stop words. These adjacent word pairs are better suited than individual terms for providing context when searching for files. Use of bigrams also provides the ability to remove many keywords from the data in favor of their bigrams.

Refinement of keywords and bigrams is an iterative process that leads to improvements in code documentation. An additional RMarkdown report identifies words that are not matched with files, either because they are not listed in the project dictionary or because they remain in need of spelling correction. Analysis of this report leads to further refinement of the dictionary.

At this point R dataframes contain the information extracted from file headers and folder descriptions including keywords and bigrams from the text mining operation. The data can be fed directly into the dynamic RMarkdown reports as shown in the next section and is ready for conversion to graph data as described in chapter   4. Graph Data

Next