Lists of Text Processing Tools

Introduction

This is a small, hand-maintained, list of automated text processing tools. You may also be interested in my list of text editors and IDEs.

General-Purpose Preprocessors

  • GPP - a general-purpose preprocessor. Supports several alternative syntax modes. Open source (GPL).

  • filepp - an adaptation and extension of the C preprocessor for general-purpose use. Written in Perl. Open source (GPL-2-or-later).

  • chpp (Chakotay Preprocessor) - a powerful preprocessor that aims to be non-intrusive, and which can be considered a full-fledged programming system. Has been unmaintained since 1999. Open source (GPLv2).

  • Website Meta Language - an offline preprocessor primarily intended for HTML, but which may have some general-purpose utility. Quite slow and considered legacy. Open source (GPLv2).

  • m4 - a macro language with some open-source implementations, including GNU m4. I find it very vile and would recommend against using it despite the fact that it appears to be the most popular preprocessor.

General-Purpose Template Systems

  • Template Toolkit - a flexible and highly extensible template processing system for Perl. Open source (same terms as Perl).

  • ClearSilver - a language-agnostic and fast templating system written in C. Open source (2-clause BSD).

  • Jinja2 - a “full-featured” template engine for Python 2 and Python 3. Open source under a BSD-style licence.

  • Tenjin - “the fastest template engine in the world” - available for several dynamic languages.

  • eRuby - a Ruby-based template system with several implementations. Open source.

  • Smarty - a PHP Template Engine. Open Source.

  • HTML-Template and Text-Template - two other CPAN template systems popular in the Perl world. Open Source.

  • Cheetah3 - a Python-Powered Template Engine. “Fast, Flexible, Powerful”. Open Source (MIT licence). While being unmaintained for some years, it was forked and became maintained again (as of May 2020). Supports Python 3.x and 2.7.x.

Parser Generators

Regular Expression Libraries

Diffing and Patching Tools

  • GNU Diffutils - an open source (GPLv3+) package which provides diff and other programs.

  • GNU patch - apply a patch/diff file. Open source (GPLv3+).

  • patchutils - Patchutils is a small collection of programs that operate on patch files. Open source.

  • comm - a UNIX command used to compare two files for common and distinct lines.

  • Meld - a GUI diff/merge tool for gtk+. Open source.

  • KDiff3 - a GUI diff/merge tool for KDE. Open source.

  • GNU wdiff - a front-end to GNU diff for comparing files on a word-per-word basis.

Specialised Processors

XML Processors

Standard UNIX Text Processing Tools

  • echo - output strings (possibly with some configurable transformations).

  • cat - output or concatenate files.

  • cut - extract sections from each line of output.

  • head - start of stream.

  • tail - end of stream.

  • paste - join multiple files horizontally.

  • sort - sorts input.

  • csplit - split files based on context lines.

  • join - merges lines of two files based on commonalities.

  • uniq - collapses adjacent lines, and makes the output unique.

  • grep - search for lines matching regular expressions.

  • fold - wrap long lines at a certain width.

  • fmt - format natural language text for readability.

  • par - a replacement for fmt.

  • sed - stream editor - a mini programming language for text processing, based on the ed text editor.

  • AWK - an even more full-fledged programming language for text processing in UNIX (with some quirks, and idiosyncrasies).

Some General-Purpose Programming Languages with Good Text Processing Support

Licence

Creative Commons License

This document is Copyright by Shlomi Fish, 2012, and is available under the terms of the Creative Commons Attribution License (CC-by) 3.0 Unported (or at your option any later version of that licence).

For securing additional rights, please contact Shlomi Fish and see the explicit requirements that are being spelt from abiding by that licence.