Lists of Text Processing Tools
Introduction
This is a small, hand-maintained, list of automated text processing tools. You may also be interested in my list of text editors and IDEs.
General-Purpose Preprocessors
GPP - a general-purpose preprocessor. Supports several alternative syntax modes. Open source (GPL).
filepp - an adaptation and extension of the C preprocessor for general-purpose use. Written in Perl. Open source (GPL-2-or-later).
chpp (Chakotay Preprocessor) - a powerful preprocessor that aims to be non-intrusive, and which can be considered a full-fledged programming system. Has been unmaintained since 1999. Open source (GPLv2).
Website Meta Language - an offline preprocessor primarily intended for HTML, but which may have some general-purpose utility. Quite slow and considered legacy. Open source (GPLv2).
m4 - a macro language with some open-source implementations, including GNU m4. I find it very vile and would recommend against using it despite the fact that it appears to be the most popular preprocessor.
General-Purpose Template Systems
Template Toolkit - a flexible and highly extensible template processing system for Perl. Open source (same terms as Perl).
ClearSilver - a language-agnostic and fast templating system written in C. Open source (2-clause BSD).
Jinja2 - a “full-featured” template engine for Python 2 and Python 3. Open source under a BSD-style licence.
Tenjin - “the fastest template engine in the world” - available for several dynamic languages.
eRuby - a Ruby-based template system with several implementations. Open source.
Smarty - a PHP Template Engine. Open Source.
HTML-Template and Text-Template - two other CPAN template systems popular in the Perl world. Open Source.
Cheetah3 - a Python-Powered Template Engine. “Fast, Flexible, Powerful”. Open Source (MIT licence). While being unmaintained for some years, it was forked and became maintained again (as of May 2020). Supports Python 3.x and 2.7.x.
Parser Generators
Yacc - a LALR parser generator standard, with some popular implementations such as Berkeley Yacc (byacc) (Open source, public domain) and GNU Bison (Open source, GPLed).
ANTLR - “ANTLR, ANother Tool for Language Recognition, is a language tool that provides a framework for constructing recognizers, interpreters, compilers, and translators from grammatical descriptions containing actions in a variety of target languages.” Open Source (3-clause BSD licence).
Parse-RecDescent - a parser-generator for Perl 5. Open source (same terms as Perl).
Marpa - a parser that aims to be able to parse everything in BNF. Open source (LPGL-version-3-or-later).
Regexp::Grammars - “Add grammatical parsing features to Perl 5.10 regexes”.
Parser::MGC - build simple Recursive-Descent parsers in Perl.
Lemon Parser Generator - an LALR parser generator for C that is maintained as part of the SQLite project. Open source (public domain).
"Parsing in Python: all the tools and libraries you can use"
Regular Expression Libraries
Diffing and Patching Tools
GNU Diffutils - an open source (GPLv3+) package which provides
diff
and other programs.GNU patch - apply a patch/diff file. Open source (GPLv3+).
patchutils -
Patchutils is a small collection of programs that operate on patch files
. Open source.comm - a UNIX command used to compare two files for common and distinct lines.
Meld - a GUI diff/merge tool for gtk+. Open source.
KDiff3 - a GUI diff/merge tool for KDE. Open source.
GNU wdiff - a front-end to GNU diff for comparing files on a word-per-word basis.
Specialised Processors
XML Processors
libxslt , Apache Xalan , and SAXON - open-source processors for XSLT (Extensible Stylesheet Language Transformations) language.
XQuery - a language designed to query collections of XML data.
XML transformation languages - a Wikipedia page containing more alternatives.
Standard UNIX Text Processing Tools
echo - output strings (possibly with some configurable transformations).
printf - emulates the C function and reportedly more portable than "echo".
cat - output or concatenate files.
cut - extract sections from each line of output.
head - start of stream.
tail - end of stream.
paste - join multiple files horizontally.
sort - sorts input.
csplit - split files based on context lines.
join - merges lines of two files based on commonalities.
uniq - collapses adjacent lines, and makes the output unique.
grep - search for lines matching regular expressions.
fold - wrap long lines at a certain width.
fmt - format natural language text for readability.
par - a replacement for fmt.
sed - stream editor - a mini programming language for text processing, based on the ed text editor.
AWK - an even more full-fledged programming language for text processing in UNIX (with some quirks, and idiosyncrasies).
Some General-Purpose Programming Languages with Good Text Processing Support
Perl (also see the Perl Beginners’ Site).
Perl 6 - a different language from Perl 5, with many powerful features. Also see Perl 6 Maven.
Links
Text Related Tools in the book “GNU/Linux Tools Summary” - from the Linux Documentation Project.
structured-text-tools: A list of command-line tools for manipulating structured text data - on GitHub.
“Lightweight markup language” article on the Wikipedia - also contains a comparison.
“Which Open Source Wiki Works for You?” - an article I wrote about wikis (also see the update).
WikiMatrix - compare all the wiki engines.
ikiwiki - an open-source wiki engine that stores pages and history in a version control system.
“Text Parsing in Perl” and “Text Generation in Perl” pages on the Perl Beginners’ Site.
Fun Links
Stack Overflow comment on parsing HTML using regular expressions.
XSLT Facts (on this site).
Noise to Signal: “Expressionless” - a cartoon.
xkcd: “Regular Expressions” - a cartoon.
Licence
This document is Copyright by Shlomi Fish, 2012, and is available under the terms of the Creative Commons Attribution License (CC-by) 3.0 Unported (or at your option any later version of that licence).
For securing additional rights, please contact Shlomi Fish and see the explicit requirements that are being spelt from abiding by that licence.