Software Section Menu

Lists of Text Processing Tools

This is a small, hand-maintained, list of automated text processing tools. You may also be interested in my list of text editors and IDEs.

GPP - a general-purpose preprocessor. Supports several alternative syntax modes. Open source (GPL).
filepp - an adaptation and extension of the C preprocessor for general-purpose use. Written in Perl. Open source (GPL-2-or-later).
chpp (Chakotay Preprocessor) - a powerful preprocessor that aims to be non-intrusive, and which can be considered a full-fledged programming system. Has been unmaintained since 1999. Open source (GPLv2).
Website Meta Language - an offline preprocessor primarily intended for HTML, but which may have some general-purpose utility. Quite slow and considered legacy. Open source (GPLv2).
m4 - a macro language with some open-source implementations, including GNU m4. I find it very vile and would recommend against using it despite the fact that it appears to be the most popular preprocessor.

Template Toolkit - a flexible and highly extensible template processing system for Perl. Open source (same terms as Perl).
ClearSilver - a language-agnostic and fast templating system written in C. Open source (2-clause BSD).
Jinja2 - a “full-featured” template engine for Python 2 and Python 3. Open source under a BSD-style licence.
Tenjin - “the fastest template engine in the world” - available for several dynamic languages.
eRuby - a Ruby-based template system with several implementations. Open source.
Smarty - a PHP Template Engine. Open Source.
HTML-Template and Text-Template - two other CPAN template systems popular in the Perl world. Open Source.
Cheetah3 - a Python-Powered Template Engine. “Fast, Flexible, Powerful”. Open Source (MIT licence). While being unmaintained for some years, it was forked and became maintained again (as of May 2020). Supports Python 3.x and 2.7.x.

Yacc - a LALR parser generator standard, with some popular implementations such as Berkeley Yacc (byacc) (Open source, public domain) and GNU Bison (Open source, GPLed).
ANTLR - “ANTLR, ANother Tool for Language Recognition, is a language tool that provides a framework for constructing recognizers, interpreters, compilers, and translators from grammatical descriptions containing actions in a variety of target languages.” Open Source (3-clause BSD licence).
Parse-RecDescent - a parser-generator for Perl 5. Open source (same terms as Perl).
Marpa - a parser that aims to be able to parse everything in BNF. Open source (LPGL-version-3-or-later).
SGLR, the Scannerless Generalized LR Parser.
Regexp::Grammars - “Add grammatical parsing features to Perl 5.10 regexes”.
Parser::MGC - build simple Recursive-Descent parsers in Perl.
Lemon Parser Generator - an LALR parser generator for C that is maintained as part of the SQLite project. Open source (public domain).
"Parsing in Python: all the tools and libraries you can use"
Stack Overflow question about parser generators for PHP.

GNU Diffutils - an open source (GPLv3+) package which provides diff and other programs.
GNU patch - apply a patch/diff file. Open source (GPLv3+).
patchutils - Patchutils is a small collection of programs that operate on patch files. Open source.
comm - a UNIX command used to compare two files for common and distinct lines.
Meld - a GUI diff/merge tool for gtk+. Open source.
KDiff3 - a GUI diff/merge tool for KDE. Open source.
GNU wdiff - a front-end to GNU diff for comparing files on a word-per-word basis.

libxslt , Apache Xalan , and SAXON - open-source processors for XSLT (Extensible Stylesheet Language Transformations) language.
XQuery - a language designed to query collections of XML data.
XML transformation languages - a Wikipedia page containing more alternatives.

echo - output strings (possibly with some configurable transformations).
- printf - emulates the C function and reportedly more portable than "echo".
cat - output or concatenate files.
cut - extract sections from each line of output.
head - start of stream.
tail - end of stream.
paste - join multiple files horizontally.
sort - sorts input.
csplit - split files based on context lines.
join - merges lines of two files based on commonalities.
uniq - collapses adjacent lines, and makes the output unique.
grep - search for lines matching regular expressions.
fold - wrap long lines at a certain width.
fmt - format natural language text for readability.
par - a replacement for fmt.
sed - stream editor - a mini programming language for text processing, based on the ed text editor.
AWK - an even more full-fledged programming language for text processing in UNIX (with some quirks, and idiosyncrasies).

Perl (also see the Perl Beginners’ Site).
Python
Ruby
Lua
Perl 6 - a different language from Perl 5, with many powerful features. Also see Perl 6 Maven.

This document is Copyright by Shlomi Fish, 2012, and is available under the terms of the Creative Commons Attribution License (CC-by) 3.0 Unported (or at your option any later version of that licence).

For securing additional rights, please contact Shlomi Fish and see the explicit requirements that are being spelt from abiding by that licence.

Software Resources ( e.g.: curated lists, directories ) Link
Directories
Software Directories	Portability Libraries Software Building and Management Tools Editors and IDEs Numerical Software Text Processing Tools Networking Clients List of Multimedia Applications List of Computer Graphics Applications List of Database Implementations List of Software quality-enhancement tools