[Archived] Simple utility to format and clean CSV datasets
Added final commit before archiving
Move old code to a graveyard.
Remove unnecessary "column" struct.

refs

master
browse log
v0.8.0
browse .tar.gz

clone

read-only
https://git.sr.ht/~wrycode/DWCHelper
read/write
git@git.sr.ht:~wrycode/DWCHelper

You can also use your local clone with git send-email.

Archived Project

DWCHelper is now unmaintained. Forks are obviously fine :) The build scripts for the Windows installer (install.nsi and winbuild.sh) may be useful if you are trying to make a cross-platform, command-line Go program. Good luck!

Description

DWCHelper is a command-line utility to help format and clean up CSV files (for instance, exported from Microsoft Access). It:

  • formats the file according to RFC 4180 (cleans up extra quotes, etc.)
  • detects and suggests aliases to Darwin Core terms
  • detects and suggests terms that may not be used, and can be removed
  • allows the user to rename or remove terms
  • saves the conversion settings for future runs (to accommodate changes to the dataset)

Installation

Windows: An installer for the latest release can be found on the Releases page.

Linux: You can use the binary provided on the releases page, or easily build from source with the following steps:

  • set your GOPATH
  • install the only dependancy: go get -u github.com/fatih/camelcase
  • clone the repo and run run go build

Usage

Navigate to the location of your CSV dataset in the console and run: DWCHelper <input-filename.csv> <output-filename.csv>

For Windows users, this means you need to navigate to the folder containing your CSV file in Windows Explorer, then click in the navigation bar and type cmd (and press Enter). The black command prompt window that opens up is where you type DWCHelper <input-filename.csv> <output-filename.csv>.

On the first run for each dataset, DWCHelper will prompt you for various corrections to the data. It will save your choices in the .settings file (in Windows Explorer, it appears as <filename>.txt with the type SETTINGS, but is still a normal text file that you can open with Notepad) for subsequent runs; if you want to redo the prompts, simply delete this file.

Editing .settings

The .settings file can be edited with a text editor to avoid redoing the prompts for small changes. DWCHelper is fairly tolerant of errors in this file and will simply ignore typos and terms that aren't in your dataset.

The first line is a CSV list of terms to remove completely from the dataset during the conversion.

Any lines after that are term aliases. The first value on each line is the term to be renamed and the second value is the new name.

About

DWCHelper is one component of my 2019 Undergraduate Research and Creativity Award project, which is a collaborative effort with the Anthropology department at UNCG.

The eventual goal of the project is to provide a tool for researchers at different sites in Olduvai Gorge, Tanzania to easily share, compare, and combine datasets and create useful, publishable data visualizations.

In June of 2019, I will be traveling to Tanzania to excavate and analyze animal bones, and I hope to gain a broader understanding of the context surrounding these 1.4 to 2 million-year-old specimens. My objective is to understand what types of questions researchers may need answered in their quest to understand this period of human evolution.

TODO

  • comment/clean helper functions, tidy up everything
  • add better testing/examples
  • Continuous Integration and publish releases on Sourcehut instead
  • remove stringIsVariation or tighten it up to cut down on false positives (waiting for more sample aliases before doing this)