zipfR - LNRE models of word frequency distributions in R

zipfR is a library for the statistical computing environment R that implements several statistical models for the distribution of word frequencies in natural language. These models can also be applied to many other linguistic and non-linguistic quantities and are often called LNRE models (for large number of rare events). Since its first public release in summer 2006, zipfR has attracted an international user community. Currently, work is under way on a new release that will feature more sophisticated models and other improvements. More information can be found on the project homepage:

UCS - Association measures for collocation extraction

The UCS toolkit is a collection of libraries and scripts for the analysis of cooccurrence data, based on statistical association measures. Written in Perl and R, it provides reference implementations for a wide range of association measures, as well as various analysis and visualization techniques that can be used to achieve a better understanding of these measures.

The UCS toolkit accompanies my PhD thesis The Statistics of Word Cooccurrences, and a new release including code and data for (almost) all results presented in the thesis is currently in preparation. The software is open source can be obtained from

Indexing Google's Web 1T 5-grams database with Perl and SQLite

Web1T5-Easy is a collection of Perl scripts for indexing and querying the Google Web 1T 5-Grams database with the open-source SQLite database engine. This package offers a convenient way to build an interactively searchable version of the Web1T5 database, including a full collocation analysis and a simple, but powerful Web interface. It is not designed as a high-performance implementation and requires considerable amounts of disk space (approx. 220 GiB) as well as patience (indexing took ca. 2 weeks on a state-of-the-art server in 2007).

You can try an online demo of the Web1T5-Easy Web interface or download the latest release version 1.1.

GOPHER styles and the GopherPHP framework for simple Web sites

GOPHER is a set of CSS stylesheets for basic single-page or multi-page Web sites (e.g. the homepage of a software project, a conference Web site, etc.). They have been designed for a clear layout, good readability and compatibility with all major Web browsers. The acronym GOPHER stands for Good Old Plain HTML for the Occasional WritER. The associated GopherPHP framework automatically generates dynamic navigation bars for small Web sites and offers some other useful features (such as allowing users to change the font size of Web pages).

If you want to set up a simple Web site without worrying about CSS specifications and the details of page navigation, GOPHER is just what you need! More information and downloads can be found on the GOPHER homepage:

Perltjes - Small but useful Perl scripts and modules

HotSync - Hoover


HotSync is a convenient wrapper around the rsync, ssh and GNU tar programs for synchronizing files between two computers (over a network connection) as well as creating backup archives of directory trees (and storing them on a remote host). Synchronization and backup archive tasks are defined in a simple configuration file and can be listed and invoked from the command line or through an interactive console.


When TeX/LaTeX is used to format a document, many temporary files are created in the same directory as the source file. Unfortunately, TeX does not offer an easy way to remove these temporary files when they are no longer needed. The Hoover program was designed to take over this job. It also deletes Emacs backup, recovery and font lock cache files.

