cgrep [ -help ]
cgrep [ -version ]
In addition, cgrep allows a regular expression to be used to define a search universe and reports elements of the search universe that contain (or alternately do not contain) occurrences of the pattern. Useful search universes include email messages, news articles and similar components of structured documents.
The behavior of cgrep differs substantially from that of grep(1) and other related utilities. Those utilities perform matching only within a line, and only lines containing the pattern may be reported. The approach taken by cgrep increases the usefulness of regular expression search, allowing matching across lines and allowing non-text and binary files to be searched. Examples are given toward the end of this man page to illustrate some of the possibilities.
Regular expressions are written in a notation based on that of egrep(1) and POSIX 1003.2 (excluding internationalization features) with some additions. These additions include an intersection operator, escape sequences for non-printable characters, and a macro facility.
cgrep begins its execution by reading and processing macros defined in the file $HOME/.cgreprc. It then processes its command line arguments, reading and processing in order any macro definition files specified on the command line. Finally, each input file is read and searched, with occurrences of the pattern reported as they are encountered. As each occurrence of the pattern is printed it may be optionally delimited with user-defined start and end tags. If no input file is specified, standard input is read.
In addition to the standard POSIX 1003.2 operators, we accept '&' for the intersection of two regular expressions. The precedence of the intersection operator is the same as that of union ('|'). The union and intersect operators associate left to right.
The characters `<' and `>' may be used to match the beginning and end of file respectively.
We make one addition to the character classes defined by POSIX 1003.2: Within a bracket expression, the sequence `[:print:]' matches any printable character. Character class membership is based on the ctype(3) macros.
Escape sequences for non-printable characters follow the syntax of ANSI C, including the sequences for hexadecimal and octal constants. Escape sequences undefined by ANSI C represent the literal character following the '\'. In particular, an escape consisting of a `\' followed by any punctuation character may be used to represent the literal punctuation mark, avoiding any special meaning of the character.
Support for macros is provided. Macros calls come in two flavors: fast and tedious. A fast call consists of an `@' character followed by an single alphabetic character. A tedious macro call has the form:
[@name(parameter0, parameter1, ...)]where each of the up to 9 parameters is a regular expression. If the macro requires no parameters, the bracket-enclosed parameter list is omitted completely. Be careful not to put any extra whitespace in the parameter list, this extra whitespace will be counted as part of the parameter.
An un-parameterized macro definition has the form:
name=regular-expressionand parameterized macro definition has the form:
name#n=regular-expressionwhere the number of parameters is indicated by a single digit following the `#' character. Within the body of a parameterized macro, the actual parameters may be referenced as `#1' through `#9'. A macro name must start with a alphabetic character, and may include only alphanumeric characters and the character `_'. Be careful not put any extra whitespace after the '='; this whitespace counts as part of the regular expression.
cgrep '^.*United[[:space:]]*States.*$' constitution.txtwill print the lines of text that contain it. The command
cgrep -list 'the\nthe' *.txtchecks for a typing error that's hard to spot visually and prints the names of the files that contain it. The command
cgrep -insensitive -U '/\*.*\*/' POSIX cgrep.cprints all the comments in the C source file cgrep.c, that contain the string ``posix'' in any combination of lower and upper case letters (under some mild assumptions). The command
cgrep '[^[:print:]][[:print:]]{4,}[\n\0]' a.outreports strings of four or more printable characters ending in a newline or null character that appear in the executable file a.out. Each match is printed on a separate line. If the -binary flag were specified, the resulting matches would be run together without separating newlines. Each match is started by an unprintable character and may contain superfluous null characters. The output could be piped to
cgrep -binary '[[:print:]\n]'to strip these unprintable characters (or the tr(1) command could be used for the same purpose). As a final example, cgrep may be used to search a mail file and extract mail based on patterns in the sender or subject lines, or in other parts of the header and body. Standard macros for handling mail may be defined in the $HOME/.cgreprc file:
Mail=^From .*(^From |>) From#1=^From:[^$]*#1 Re#1=^Subject:[^$]*#1The command
cgrep -U '[@Mail]' '(.*[@From([Cc]owan)].*)&(.*[@Re(brewpubs)].*)' mboxwould then extract all mail messages in the file mbox that are from Cowan and are on the subject of brewpubs. It's then necessary to pipe the output through
sed '/^From $/d'or equivalently
cgrep -V '^.*$' '^From $'to strip out the extra characters needed to detect the end of each mail message and create a validly formatted mail file.
POSIX 1003.2, section 2.8 (Regular Expression Notation).
Charles L. A. Clarke and Gordon V. Cormack. On the use of Regular Expressions for Searching Text. University of Waterloo Computer Science Department Technical Report number CS-95-07, University of Waterloo, Waterloo, Ontario N2L 3G7, Canada. February 1995. ftp://plg.uwaterloo.ca/pub/mt/TechReports/CS-95-07/regexp.ps
The syntax for macros is ugly. An undefined macro is reported as a syntax error.
This man page needs to be extended with a complete and precise description of the regular expression format.
The software is an alpha release. Report bugs to mt@plg.uwaterloo.ca.