Getting Git: Scanning Directories

Abstract: My algorithm for finding files at a particular git revision.

Understand databases come in two main flavors: watched directory project and imported projects. A watched directory project decides which files belong in the project by scanning the file system. An imported project uses a visual studio project/solution, Xcode project, or CMake compile commands json file to determine the files in the project and their settings.

I’m working on a feature to create a database at a provided git commit. For it to work correctly, watched directory projects need to scan the git repository instead of the file system to find files in the project. 

The current directory scanning code uses QT. The basic algorithm is something like this:

void RescanDir::rescanImpl(const QDir &dir, QList<QFileInfo> &files)
{
  foreach (const QFileInfo &info, dir.entryInfoList(kFilters)) {
    // Report progress.

    // Handle regular files.
    QString ext = info.suffix();
    if (!info.isDir() || kBundleExts.contains(ext)) {
     if (/*file doesn’t match options*/)
        continue;

      // Add the file.
      files.append(info);
      continue;
    }

    // Handle directories
    if (!mRecursive || /*directory doesn’t match options*/)
      continue;

    // Enter subdirectory.
    rescanImpl(path, files);
  }
}

For Git access, Understand uses a wrapper around libgit2. The wrapper is essentially the same as the git wrapper (src/git) in the open source GitAhead project, also written by SciTools. I’m by no means an expert on git, but in general tree objects are kind of like directories and blobs are kind of like of files. So, a rescan operation for git would be something like:

void rescanImpl(const git::Tree & tree, const QString & curPath, QStringList & files)
{
  for (int i = 0; i < tree.count(); i++) {
    QString path = curPath + "/" + tree.name(i);
    git::Object object = tree.object(i);
    git::Tree subTree(object);
    if (!subTree) {
      if (/*file doesn't match options*/)
        continue;
      
      files.append(path);
      continue;
    }
    
    if (/*directory doesn't match options*/)
      continue;
    rescanImpl(subTree, path, files);
  }
}

It seems simple enough, but there are some problems. The first problem is ignored files. It’s a relatively common paradigm to ignore automatically generated files in git, but those files may still be part of the Understand project. So, when scanning for project files, what we actually want is a list of files that are in git plus any files that exist on disk that are ignored by git. 

void gitFilter(const QDir & dir, QList<QFileInfo> & onDisk)
{
  QString path = mRepo.workdir().relativeFilePath(dir.absolutePath());
  git::Id treeId = mCommit.tree().id(path);
  // TODO Convert the treeId to a tree object
  QSet<QString> gitEntries;
  for (int i = 0; i < tree.count(); i++)
    gitEntries.insert(tree.name(i));
  
  // Remove any entries from the fileinfo list that don't exist
  QSet<QString> found;
  auto iter = onDisk.begin();
  while (iter != onDisk.end()) {
    QString filename = iter->fileName();
    if (gitEntries.contains(filename) ||
        mRepo.isIgnored(treePath + filename)) {
      found.insert(filename);
      iter ++;
    } else {
      // doesn't exist in the repository for this commit, remove
      iter = entries.erase(iter);
    }
  }

  // Build the entries that exist in git but not on disk
  foreach (const QString & gitFile, gitEntries) {
    if (!found.contains(gitFile)
      entries.append(QFileInfo(dir.filePath(gitFile)));
  }
}

There is one immediately obvious problem with the code. The wrapper library doesn’t provide a way to find a tree object by id or by path. There is a function to lookup a blob by ID in Repository:

Blob Repository::lookupBlob(const Id &id) const
{
  git_object *obj = nullptr;
  git_object_lookup(&obj, d->repo, id, GIT_OBJECT_BLOB);
  return Blob(reinterpret_cast<git_blob *>(obj));
}

Adding a function for lookup tree turns out to be pretty simple:

Tree Repository::lookupTree(const Id &id) const
{
  git_object *obj = nullptr;
  git_object_lookup(&obj, d->repo, id, GIT_OBJECT_TREE);
  return Tree(reinterpret_cast<git_blob *>(obj));
} 

Now, to test it. Unfortunately, it doesn’t work. The problem is that QFileInfo::isDir() and the related functions only work if the file exists. Even ensuring the directories end in ‘/’ won’t change the result. So, the file system rescan needs additional information for the files that only existed in git. How much? The file type is probably enough. Again, there isn’t a wrapper function to access the type. But adding one is just a matter of finding the libgit2 function to wrap: 

git_filemode_t Tree::filemode(int index) const
{
  const git_tree_entry *entry = git_tree_entry_byindex(*this, index);
  return git_tree_entry_filemode(entry);
}

Now, we can return a map from the name to the file type from the filter function, and update the rescan library:

void RescanDir::rescanImpl(const QDir &dir, QList<QFileInfo> &files)
{
  auto dirEntries = dir.entryInfoList(kFilters);
  auto gitOnly = VersionControlManager::instance()->filteredEntries(
                   dir, dirEntries, mCommit);
  foreach (const QFileInfo &info, dirEntries) {
    // Report progress.

    // Check git
    bool isGit = gitOnly.contains(info.fileName());
    git_filemode_t gitMode = gitOnly.value(info.fileName(),
                                           GIT_FILEMODE_UNREADABLE);
    // Handle regular files.
    QString ext = info.suffix();
    // info.isDir() only works for files that exist (not git only files)
    bool isDir = isGit ? (gitMode == GIT_FILEMODE_TREE ||
                          gitMode == GIT_FILEMODE_COMMIT) : info.isDir();
    if (!isDir || kBundleExts.contains(ext)) {
      if (/*file doesn’t match options*/)
      continue;

      // Add the file.
      files.append(info);
      continue;
    }

    // Handle directories
    if (!mRecursive || /*directory doesn’t match options*/)
      continue;


    // Enter subdirectory.
    rescanImpl(path, files);
  }
}

There is one other problem to be solved: submodules. That topic will be covered in another post, but a small hint is that GIT_FILEMODE_COMMIT is in the code. From the rescan function’s perspective, that condition ensures submodules are treated like directories.