Extending ctags with Regex parser (optlib)

Maintainer:Masatake YAMATO <yamato@redhat.com>

Option files

An “option” file is a file in which command line options are written line by line. ctags loads it and runs as if the options in the file were passed through command line.

The following file is an example of an option file:

The character # can be used as a start marker of a line comment. Whitespaces at the start of lines are ignored during loading.

And it works exactly as if we had called:

ctags --exclude=Units --exclude=tinst-root --exclude=Tmain

There are two categories of option files, though they both contain command line options: preload and optlib option files.

Preload option file

Preload option files are option files loaded by ctags automatically at start-up time. Which files are loaded at start-up time are very different from Exuberant-ctags.

At start-up time, Universal-ctags loads files having .ctags as a file extension under the following statically defined directories:

  1. $XDG_CONFIG_HOME/ctags, or $HOME/.config/ctags if $XDG_CONFIG_HOME is not defined (on other than Windows)
  2. $HOME/.ctags.d
  3. $HOMEDRIVE$HOMEPATH/ctags.d (in Windows)
  4. .ctags.d
  5. ctags.d

ctags visits the directories in the order listed above for preloading files. ctags loads files having .ctags as file extension in alphabetical order (strcmp(3) is used for comparing, so for example .ctags.d/ZZZ.ctags will be loaded before .ctags.d/aaa.ctags).

Quoted from man page of Exuberant-ctags:

           /ctags.cnf (on MSDOS, MSWindows only)
           $HOME/ctags.cnf (on MSDOS, MSWindows only)
           ctags.cnf (on MSDOS, MSWindows only)
                          If any of these configuration files exist, each will
                          be expected to contain a set of default options
                          which are read in the order listed when ctags
                          starts, but before the CTAGS environment variable is
                          read or any command line options are read.  This
                          makes it possible to set up site-wide, personal or
                          project-level defaults. It is possible to compile
                          ctags to read an additional configuration file
                          before any of those shown above, which will be
                          indicated if the output produced by the --version
                          option lists the "custom-conf" feature. Options
                          appearing in the CTAGS environment variable or on
                          the command line will override options specified in
                          these files. Only options will be read from these
                          files.  Note that the option files are read in
                          line-oriented mode in which spaces are significant
                          (since shell quoting is not possible). Each line of
                          the file is read as one command line parameter (as
                          if it were quoted with single quotes). Therefore,
                          use new lines to indicate separate command-line

What follows explains the differences and their intentions…

Directory oriented configuration management

Exuberant-ctags provides a way to customize ctags with options like --langdef=<LANG> and --regex-<LANG>. These options are powerful and make ctags popular for programmers.

Universal-ctags extends this idea; we have added new options for defining a parser, and have extended existing options. Defining a new parser with the options is more than “customizing” in Universal-ctags.

To make easier the maintenance a parser defined using the options, you can put each language parser in a different options file. Universal-ctags doesn’t preload a single file. Instead, Universal-ctags loads all the files having the .ctags extension under the previously specified directories. If you have multiple parser definitions, put them in different files.

Avoiding option incompatibility issues

The Universal-ctags options are different from those of Exuberant-ctags, therefore Universal-ctags doesn’t load any of the files Exuberant-ctags loads at start-up. Otherwise there would be incompatibility issues if Exuberant-ctags loaded an option file that used a newly introduced option in Universal-ctags, and vice versa.

No system wide configuration

To make the preload path list short and because it was rarely ever used, Universal-ctags does not load any option files for system wide configuration. (i.e., no /etc/ctags.d)

Using .ctags for the file extension

Extensions .cnf and .conf are obsolete. Use the unified extension .ctags only.

Optlib option file

From a syntax perspective, there is no difference between optlib option files and preload option files; ctags options are written line by line in a file.

Optlib option files are option files not loaded at start-up time automatically. To load an optlib option file, specify a pathname for an optlib option file with --options=PATHNAME option explicitly. The pathname can be just the filename if it’s in the current directory.

Exuberant-ctags has the --options option, but you can only specify a single file to load. Universal-ctags extends the option in two aspects:

  • You can specify a directory, to load all the files in that directory.
  • You can specify a PATH list to look in. See next section for details.

Specifying a directory

If you specify a directory instead of a file as the argument for the --options=PATHNAME, Universal-ctags will load all files having a .ctags extension under said directory in alphabetical order.

Specifying an optlib PATH list

Much like a command line shell, ctags has an “optlib PATH list” in which it can look for a file (or directory) to load.

When loading a file (or directory) specified with --options=PATHNAME, ctags first checks if PATHNAME is an absolute path or a relative path. An absolute path starts with ‘/’ or ‘.’. If PATHNAME is an absolute path, ctags tries to load it immediately.

If, on the contrary, is a relative path, ctags does two things: First, looks for the file (or directory) in “optlib PATH list” and tries to load it.

If the file doesn’t exist in the PATH list, ctags treats PATHNAME as a path relative to the working directory and loads the file.

By default, optlib path list is empty. To set or add a directory path to the list, use --optlib-dir=PATH.

For setting (adding one after clearing):


For adding:


Tips for writing an option file

  • Use --quiet --options=NONE to disable preloading.
  • --_echo=MSG and --_force-quit=[NUM] options are introduced for debugging the process of loading option files. See “OPTIONS” section of ctags-optlib(7).
  • Universal-ctags has an optlib2c script that translates an option file into C source code. Your optlib parser can thus easily become a built-in parser, by contributing to Universal-ctags’ github. You could be famous! Examples are in the optlib directory in Universal-ctags source tree.

Regular expression (regex) engine

Universal-ctags currently uses the same regex engine as Exuberant-ctags: the POSIX.2 regex engine in GNU glibc-2.10.1. By default it uses the Extended Regular Expressions (ERE) syntax, as used by most engines today; however it does not support many of the “modern” extensions such as lazy captures, non-capturing grouping, atomic grouping, possessive quantifiers, look-ahead/behind, etc. It is also notoriously slow when backtracking, and has some known “quirks” with respect to escaping special characters in bracket expressions.

For example, a pattern of [^\]]+ is invalid in POSIX.2, because the ] is not special inside a bracket expression, and thus should not be escaped. Most regex engines ignore this subtle detail in POSIX.2, and instead allow escaping it with \] inside the bracket expression and treat it as the literal character ]. GNU glibc, however, does not generate an error but instead considers it undefined behavior, and in fact it will match very odd things. Instead you must use the more unintuitive [^]]+ syntax. The same is technically true of other special characters inside a bracket expression, such as [^\)]+, which should instead be [^)]+. The [^\)]+ will appear to work usually, but only because what it is really doing is matching any character but \ or ). The only exceptions for using \ inside a bracket expression are for \t and \n, which ctags converts to their single literal character control codes before passing the pattern to glibc.

Another detail to keep in mind is how the regex engine treats newlines. Universal-ctags compiles the regular expressions in the --regex-<LANG> and --mline-regex-<LANG> options with REG_NEWLINE set. What that means is documented in the POSIX spec. One obvious effect is that the regex special dot any-character . does not match newline characters, the ^ anchor does match right after a newline, and the $ anchor matches right before a newline. A more subtle issue is this text from the Regular Expressions chapter: “the use of literal <newline>s or any escape sequence equivalent produces undefined results”. What that means is using a regex pattern with [^\n]+ is invalid, and indeed in glibc produces very odd results. Never use \n in patterns for --regex-<LANG>, and never use them in non-matching bracket expressions for --mline-regex-<LANG> patterns. For the experimental --_mtable-regex-<LANG> you can safely use \n because that regex is not compiled with REG_NEWLINE.

You should always test your regex patterns against test files with strings that do and do not match. Pay particular emphasis to when it should not match, and how much it matches when it should. A common error is forgetting that a POSIX.2 ERE engine is always greedy; the * and + quantifiers match as much as possible, before backtracking from the end of their match.

For example this pattern:


Will match this entire string, not just the first part:

foobar, bar, and even more bar

Regex option argument flags

Many regex-based options described in this document support additional arguments in the form of long flags. Long flags are specified with surrounding { and }.

The general format and placement is as follows:


Some examples:

--regex-Pod=/^=head1[ \t]+(.+)/\1/c/

Note that the last example only has two / forward slashes following the regex pattern, as a shortened form when no kind-spec exists.

The --mline-regex-<LANG> option also follows the above format. The experimental --_mtable-regex-<LANG> option follows a slightly modified version as well.

The --langdef=<LANG> option also supports long flags, but not using forward-slash separators.

Regex control flags

The regex matching can be controlled by adding flags to the --regex-<LANG>, --mline-regex-<LANG>, and experimental --_mtable-regex-<LANG> options. This is done by either using the single character short flags b, e and i flags as explained in the ctags.1 man page, or by using long flags described earlier. The long flags require more typing but are much more readable.

The mapping between the older short flag names and long flag names is:

short flag long flag description
b basic Posix basic regular expression syntax.
e extend Posix extended regular expression syntax (default).
i icase Case-insensitive matching.

So the following --regex-<LANG> expression:


is the same as:


The characters { and } may not be suitable for command line use, but long flags are mostly intended for option files.

Exclusive flag in regex

By default, lines read from the input files will be matched against all the regular expressions defined with --regex-<LANG>. Each successfully matched regular expression will emit a tag.

In some cases another policy, exclusive-matching, is preferable to the all-matching policy. Exclusive-matching means the rest of regular expressions are not tried if one of regular expressions is matched successfully, for that input line.

For specifying exclusive-matching the flags exclusive (long) and x (short) were introduced. For example, this is used in optlib/gdbinit.ctags for ignoring comment lines in gdb files, as follows:


Comments in gbd files start with # so the above line is the first regex match line in gdbinit.ctags, so that subsequent regex matches are not tried for the input line.

If an empty name pattern(//) is used for the --regex-<LANG> option, ctags warns it as a wrong usage of the option. However, if the flags exclusive or x is specified, the warning is suppressed.

NOTE: This flag does not make sense in the multi-line --mline-regex-<LANG> option nor the multi-table --_mtable-regex-<LANG> option.

Experimental flags


These flags are experimental. They apply to all regex option types: basic --regex-<LANG>, multi-line --mline-regex-<LANG>, and the experimental multi-table --_mtable-regex-<LANG> option.


This flag indicates the tag should only be generated if the given ‘extra’ type is enabled, as explained in Conditional tagging with extras.


This flag allows a regex match to add additional custom fields to the generated tag entry, as explained in Adding custom fields to the tag output.


This flag allows a regex match to generate a reference tag entry and specify the role of the reference, as explained in Capturing reference tags.


This flag allows a regex match to generate an anonymous tag entry. ctags gives a name starting with PREFIX and emits it. This flag is useful to record the position for a language object having no name. A lambda function in a functional programming language is a typical example of a language object having no name.

Consider following input (input.foo):

(let ((f (lambda (x) (+ 1 x))))

Consider following optlib file (foo.ctags):

--kinddef-Foo=l,lambda,lambda functions
--regex-Foo=/.*\(lambda .*//l/{_anonymous=L}

You can get following tags file:

$ u-ctags  --options=foo.ctags -o - /tmp/input.foo
Le4679d360100   /tmp/input.foo  /^(let ((f (lambda (x) (+ 1 x))))$/;"   l

Ghost kind in regex parser

If a whitespace is used as a kind letter, it is never printed when ctags is called with --list-kinds option. This kind is automatically assigned to an empty name pattern.

Normally you don’t need to know this.

Scope tracking in a regex parser

About the {scope=..} flag itself for scope tracking, see “FLAGS FOR --regex-<LANG> OPTION” section of ctags-optlib(7).

Example 1:

# in /tmp/input.foo
class foo:
def bar(baz):
class goo:
def gar(gaz):
# in /tmp/foo.ctags:

$ ctags --options=/tmp/foo.ctags -o - /tmp/input.foo
bar     /tmp/input.foo  /^    def bar(baz):$/;" d       class:foo
foo     /tmp/input.foo  /^class foo:$/;"        c
gar     /tmp/input.foo  /^    def gar(gaz):$/;" d       class:goo
goo     /tmp/input.foo  /^class goo:$/;"        c

Example 2:

// in /tmp/input.pp
class foo {
        int bar;
# in /tmp/pp.ctags:

$ ctags --options=/tmp/pp.ctags -o - /tmp/input.pp
bar     /tmp/input.pp   /^    include bar$/;"   v       class:foo
foo     /tmp/input.pp   /^class foo {$/;"       c

NOTE: This flag doesn’t work well with --mline-regex-<LANG>=.

Overriding the letter for file kind

One of the built-in tag kinds in Universal-ctags is the F file kind. Overriding the letter for file kind is not allowed in Universal-ctags.


Don’t use F as a kind letter in your parser. (See issue #317 on github)

Generating fully qualified tags automatically from scope information

If scope fields are filled properly with {scope=…} regex flags, you can use the field values for generating fully qualified tags. About the {scope=..} flag itself, see “FLAGS FOR --regex-<LANG> OPTION” section of ctags-optlib(7).

Specify {_autoFQTag} to the end of --langdef=<LANG> option like -langdef=Foo{_autoFQTag} to make ctags generate fully qualified tags automatically.

. is the (ctags global) default separator combining names into a fully qualified tag. You can customize separators with --_scopesep-<LANG>=... option.


class X
   var y


--regex-foo=/class ([A-Z]*)/\1/c/{scope=push}
--regex-foo=/[ \t]*var ([a-z]*)/\1/v/{scope=ref}


$ u-ctags --quiet --options=NONE --options=./foo.ctags -o - input.foo
X       input.foo       /^class X$/;"   c
y       input.foo       /^      var y$/;"       v       class:X

$ u-ctags --quiet --options=NONE --options=./foo.ctags --extras=+q -o - input.foo
X       input.foo       /^class X$/;"   c
X.y     input.foo       /^      var y$/;"       v       class:X
y       input.foo       /^      var y$/;"       v       class:X

“X.y” is printed as a fully qualified tag when --extras=+q is given.

Customizing scope separators

Use --_scopesep-<LANG>=[<parent-kindLetter>]/<child-kindLetter>:<sep> option for customizing if the language uses {_autoFQTag}.


The kind letter for a tag of outer-scope.

You can use * for specifying as wildcards that means “any kinds” for a tag of outer-scope.

If you omit parent-kindLetter, the separator is used as a prefix for tags having the kind specified with child-kindLetter. This prefix can be used to refer to global namespace or similar concepts if the language has one.


The kind letter for a tag of inner-scope.

You can use * for specifying as wildcards that means “any kinds” for a tag of inner-scope.


In a qualified tag, if the outer-scope has kind and parent-kindLetter the inner-scope has child-kindLetter, then sep is instead in between the scope names in the generated tags file.

specifying * as both parent-kindLetter and child-kindLetter sets sep as the language default separator. It is used as fallback.

Specifying * as child-kindLetter and omitting parent-kindLetter sets sep as the language default prefix. It is used as fallback.

NOTE: There is no ctags global default prefix. NOTE: _scopesep-<LANG>=... option affects only a parser that enables _autoFQTag. A parser building full qualified tags manually ignores the option.

Let’s see an example. The input file is written in Tcl. Tcl parser is not an optlib parser. However, it uses the _autoFQTag feature internally. Therefore, _scopesep-Tcl= option works well. Tcl parser defines two kinds n (namespace) and p (procedure).

By default, Tcl parser uses :: as scope separator. The parser also uses :: as root prefix.

namespace eval N {
        namespace eval M {
                proc pr0 {s} {
                        puts $s

proc pr1 {s} {
        puts $s

M is defined under the scope of N. pr0 is defined under the scope of M. N and pr1 are at top level (so they are candidates to be added prefixes). M and N are language objects with n (namespace) kind. pr0 and pr1 are language objects with p (procedure) kind.

$ ctags -o - --extras=+q input.tcl
::N     input.tcl       /^namespace eval N {$/;"        n
::N::M  input.tcl       /^      namespace eval M {$/;"  n       namespace:::N
::N::M::pr0     input.tcl       /^              proc pr0 {s} {$/;"      p       namespace:::N::M
::pr1   input.tcl       /^proc pr1 {s} {$/;"    p
M       input.tcl       /^      namespace eval M {$/;"  n       namespace:::N
N       input.tcl       /^namespace eval N {$/;"        n
pr0     input.tcl       /^              proc pr0 {s} {$/;"      p       namespace:::N::M
pr1     input.tcl       /^proc pr1 {s} {$/;"    p

Let’s change the default separator to ->:

$ ctags -o - --extras=+q --_scopesep-Tcl='*/*:->' input.tcl
::N     input.tcl       /^namespace eval N {$/;"        n
::N->M  input.tcl       /^      namespace eval M {$/;"  n       namespace:::N
::N->M->pr0     input.tcl       /^              proc pr0 {s} {$/;"      p       namespace:::N->M
::pr1   input.tcl       /^proc pr1 {s} {$/;"    p
M       input.tcl       /^      namespace eval M {$/;"  n       namespace:::N
N       input.tcl       /^namespace eval N {$/;"        n
pr0     input.tcl       /^              proc pr0 {s} {$/;"      p       namespace:::N->M
pr1     input.tcl       /^proc pr1 {s} {$/;"    p

Let’s define ‘^’ as default prefix:

$ ctags -o - --extras=+q --_scopesep-Tcl='*/*:->' --_scopesep-Tcl='/*:^' input.tcl
M       input.tcl       /^      namespace eval M {$/;"  n       namespace:^N
N       input.tcl       /^namespace eval N {$/;"        n
^N      input.tcl       /^namespace eval N {$/;"        n
^N->M   input.tcl       /^      namespace eval M {$/;"  n       namespace:^N
^N->M->pr0      input.tcl       /^              proc pr0 {s} {$/;"      p       namespace:^N->M
^pr1    input.tcl       /^proc pr1 {s} {$/;"    p
pr0     input.tcl       /^              proc pr0 {s} {$/;"      p       namespace:^N->M
pr1     input.tcl       /^proc pr1 {s} {$/;"    p

Let’s override the specification of separator for combining a namespace and a procedure with ‘+’: (About the separator for combining a namespace and another namespace, ctags uses the default separator.)

$ ctags -o - --extras=+q --_scopesep-Tcl='*/*:->' --_scopesep-Tcl='/*:^' \
                        --_scopesep-Tcl='n/p:+' input.tcl
M       input.tcl       /^      namespace eval M {$/;"  n       namespace:^N
N       input.tcl       /^namespace eval N {$/;"        n
^N      input.tcl       /^namespace eval N {$/;"        n
^N->M   input.tcl       /^      namespace eval M {$/;"  n       namespace:^N
^N->M+pr0       input.tcl       /^              proc pr0 {s} {$/;"      p       namespace:^N->M
^pr1    input.tcl       /^proc pr1 {s} {$/;"    p
pr0     input.tcl       /^              proc pr0 {s} {$/;"      p       namespace:^N->M
pr1     input.tcl       /^proc pr1 {s} {$/;"    p

Let’s override the definition of prefix for a namespace with ‘@’: (About the prefix for procedures, ctags uses the default prefix.)

$ ctags -o - --extras=+q --_scopesep-Tcl='*/*:->' --_scopesep-Tcl='/*:^' \
                         --_scopesep-Tcl='n/p:+' --_scopesep-Tcl='/n:@' input.tcl
@N      input.tcl       /^namespace eval N {$/;"        n
@N->M   input.tcl       /^      namespace eval M {$/;"  n       namespace:@N
@N->M+pr0       input.tcl       /^              proc pr0 {s} {$/;"      p       namespace:@N->M
M       input.tcl       /^      namespace eval M {$/;"  n       namespace:@N
N       input.tcl       /^namespace eval N {$/;"        n
^pr1    input.tcl       /^proc pr1 {s} {$/;"    p
pr0     input.tcl       /^              proc pr0 {s} {$/;"      p       namespace:@N->M
pr1     input.tcl       /^proc pr1 {s} {$/;"    p

Multi-line pattern match

We often need to scan multiple lines to generate a tag, whether due to needing contextual information to decide whether to tag or not, or to constrain generating tags to only certain cases, or to grab multiple substrings to generate the tag name.

Universal-ctags has two ways to accomplish this: multi-line regex options, and an experimental multi-table regex options described later.

The newly introduced --mline-regex-<LANG> is similar to --regex-<LANG> except the pattern is applied to the whole file’s contents, not line by line.

This example is based on an issue #219 posted by @andreicristianpetcu:

// in input.java:

public void catchEvent(SomeEvent e)

public void
recover(Exception e)

The above java code is similar to the Java Spring framework. The @Subscribe annotation is a keyword for the framework, and the developer would like to have a tag generated for each method annotated with @Subscribe, using the name of the method followed by a dash followed by the type of the argument. For example the developer wants the tag name Event-SomeEvent generated for the first method shown above.

To accomplish this, the developer creates a spring.ctags file with the following:

# in spring.ctags:
--mline-regex-javaspring=/@Subscribe([[:space:]])*([a-z ]+)[[:space:]]*([a-zA-Z]*)\(([a-zA-Z]*)/\3-\4/s,subscription/{mgroup=3}

And now using spring.ctags the tag file has this:

$ ./ctags -o - --options=./spring.ctags input.java
Event-SomeEvent input.java      /^public void catchEvent(SomeEvent e)$/;"       s       line:2  language:javaspring
recover-Exception       input.java      /^    recover(Exception e)$/;"  s       line:10 language:javaspring

Multiline pattern flags


These flags also apply to the experimental --_mtable-regex-<LANG> option described later.


This flag indicates the pattern should be applied to the whole file contents, not line by line. N is the number of a capture group in the pattern, which is used to record the line number location of the tag. In the above example 3 is specified. The start position of the regex capture group 3, relative to the whole file is used.


You must add an {mgroup=N} flag to the multi-line --mline-regex-<LANG> option, even if the N is 0 (meaning the start position of the whole regex pattern). You do not need to add it for the multi-table --_mtable-regex-<LANG>.


A regex pattern is applied to whole file’s contents iteratively. This long flag specifies from where the pattern should be applied in the next iteration for regex matching. When a pattern matches, the next pattern matching starts from the start or end of capture group N. By default it advances to the end of the whole match (i.e., {_advanceTo=0end} is the default).

Let’s think about following input

def def abc

Consider two sets of options, foo and bar.

# foo.ctags:
--mline-regex-foo=/def *([a-z]+)/\1/a/{mgroup=1}
# bar.ctags:
--mline-regex-bar=/def *([a-z]+)/\1/a/{mgroup=1}{_advanceTo=1start}

foo.ctags emits following tags output:

def  input.foo       /^def def abc$/;"       a

bar.ctgs emits following tags output:

def  input-0.bar     /^def def abc$/;"       a
abc  input-0.bar     /^def def abc$/;"       a

_advanceTo=1start is specified in bar.ctags. This allows ctags to capture “abc”.

At the first iteration, the patterns of both foo.ctags and bar.ctags match as follows

0   1       (start)
v   v
def def abc
           0,1  (end)

“def” at the group 1 is captured as a tag in both languages. At the next iteration, the positions where the pattern matching is applied to are not the same in the languages.


           0end (default)
def def abc


        1start (as specified in _advanceTo long flag)
def def abc

This difference of positions makes the difference of tags output.

A more relevant use-case is when {_advanceTo=N[start|end]} is used in the experimental --_mtable-regex-<LANG>, to “advance” back to the beginning of a match, so that one can generate multiple tags for the same input line(s).


This flag doesn’t work well with scope related flags and exclusive flags.

Advanced pattern matching with multiple regex tables


This is a highly experimental feature. This will not go into the man page of 6.0. But let’s be honest, it’s the most exciting feature!

In some cases, the --regex-<LANG> and --mline-regex-<LANG> options are not sufficient to generate the tags for a particular language. Some of the common reasons for this are:

  • To ignore commented lines or sections for the language file, so that tags aren’t generated for symbols that are within the comments.
  • To enter and exit scope, and use it for tagging based on contextual state or with end-scope markers that are difficult to match to their associated scope entry point.
  • To support nested scopes.
  • To change the pattern searched for, or the resultant tag for the same pattern, based on scoping or contextual location.
  • To break up an overly complicated --mline-regex-<LANG> pattern into separate regex patterns, for performance or readability reasons.

To help handle such things, Universal-ctags has been enhanced with multi-table regex matching. The feature is inspired by lex, the fast lexical analyzer generator, which is a popular tool on Unix environments for writing parsers, and RegexLexer of Pygments. Knowledge about them will help you understand the new options.

The new options are:


Declares a new regex matching table of a given name for the language, as described in Declaring a new regex table.


Adds a regex pattern and associated tag generation information and flags, to the given table, as described in Adding a regex to a regex table.


Includes a previously-defined regex table to the named one.

The above will be discussed in more detail shortly.

First, let’s explain the feature with an example. Consider an imaginary language “X” has a similar syntax as JavaScript: “var” is used as defining variable(s), , and “/* … */” is used for block comments.

Here is our input, input.x:

var dont_capture_me;

We want ctags to capture a and b - but it is difficult to write a parser that will ignore dont_capture_me in the comment with a classical regex parser defined with --regex-<LANG> or --mline-regex-<LANG>, because of the block comments.

The --regex-<LANG> option only works on one line at a time, so can not know dont_capture_me is within comments. The --mline-regex-<LANG> could do it in theory, but due to the greedy nature of the regex engine it is impractical and potentially inefficient to do so, given that there could be multiple block comments in the file, with * inside them, etc.

A parser written with multi-table regex, on the other hand, can capture only a and b safely. But it is more complicated to understand.

Here is a 1st version of X.ctags:


Not so interesting. It doesn’t really do anything yet. It just creates a new language named X, for files ending with a .x suffix, and defines a new tag for variable kinds.

When writing a multi-table parser, you have to think about the necessary states of parsing. For the parser of language X, we need the following states:

  • toplevel (initial state)
  • comment (inside comment)
  • vars (var statements)

Declaring a new regex table

Before adding regular expressions, you have to declare tables for each state with the --_tabledef-<LANG>=<TABLE> option.

Here is the 2nd version of X.ctags doing so:



For table names, only characters in the range [0-9a-zA-Z_] are acceptable.

For a given language, for each file’s input the ctags multi-table parser begins with the first declared table. For X.ctags, toplevel is the one. The other tables are only ever entered/checked if another table specified to do so, starting with the first table. In other words, if the first declared table does not find a match for the current input, and does not specify to go to another table, the other tables for that language won’t be used. The flags to go to another table are {tenter}, {tleave}, and {tjump}, as described later.

Adding a regex to a regex table

The new option to add a regex to a declared table is --_mtable-regex-<LANG>, and it follows this form:


The parameters for --_mtable-regex-<LANG> look complicated. However, <PATTERN>, <NAME>, and <KIND> are the same as the parameters of the --regex-<LANG> and --mline-regex-<LANG> options. <TABLE> is simply the name of a table previously declared with the --_tabledef-<LANG> option.

A regex pattern added to a parser with --_mtable-regex-<LANG> is matched against the input at the current byte position, not line. Even if you do not specify the ^ anchor at the start of the pattern, ctags adds ^ to the pattern automatically. Unlike the --regex-<LANG> and --mline-regex-<LANG> options, a ^ anchor does not mean “beginning of line” in --_mtable-regex-<LANG>; instead it means the beginning of the input string (i.e., the current byte position).

The LONGFLAGS include the already discussed flags for --regex-<LANG> and --mline-regex-<LANG>: {scope=...}, {mgroup=N}, {_advanceTo=N}, {basic}, {extend}, and {icase}. The {exclusive} flag does not make sense for multi-table regex.

In addition, several new flags are introduced exclusively for multi-table regex use:


Push the current table on the stack, and enter another table.


Leave the current table, pop the stack, and go to the table that was just popped from the stack.


Jump to another table, without affecting the stack.


Clear the stack, and go to another table.


Clear the stack, and stop processing the current input file for this language.

To explain the above new flags, we’ll continue using our example in the next section.

Skipping block comments

Let’s continue with our example. Here is the 3rd version of X.ctags:





Four --_mtable-regex-X lines are added for skipping the block comments. Let’s discuss them one by one.

For each new file it scans, ctags always chooses the first pattern of the first table of the parser. Even if it’s an empty table, ctags will only try the first declared table. (in such a case it would immediately fail to match anything, and thus stop processing the input file and effectively do nothing)

The first declared table (toplevel) has the following regex added to it first:


A pattern of \/\* is added to the toplevel table, to match the beginning of a block comment. A backslash character is used in front of the leading / to escape the separation character / that separates the fields of --_mtable-regex-<LANG>. Another backslash inside the pattern is used before the asterisk *, to make it a literal asterisk character in regex.

The last // means ctags should not tag something matching this pattern. In --regex-<LANG> you never use // because it would be pointless to match something and not tag it using and single-line --regex-<LANG>; in multi-line --mline-regex-<LANG> you rarely see it, because it would rarely be useful. But in multi-table regex it’s quite common, since you frequently want to transition from one state to another (i.e., tenter or tjump from one table to another).

The long flag added to our first regex of our first table is tenter, which is a long flag for switching the table and pushing on the stack. {tenter=comment} means “switch the table from toplevel to comment”.

So given the input file input.x shown earlier, ctags will begin at the toplevel table and try to match the first regex. It will succeed, and thus push on the stack and go to the comment table.

It will begin at the top of the comment table (it always begins at the top of a given table), and try each regex line in sequence until it finds a match. If it fails to find a match, it will pop the stack and go to the table that was just popped from the stack, and begin trying to match at the top of that table. If it continues failing to find a match, and ultimately reaches the end of the stack, it will stop processing for this file. For the next input file, it will begin again from the top of the first declared table.

Getting back to our example, the top of the comment table has this regex:


Similar to the previous toplevel table pattern, this one for \*\/ uses a backslash to escape the separator /, as well as one before the * to make it a literal asterisk in regex. So what it’s looking for, from a simple string perspective, is the sequence */. Note that this means even though you see three backslashes /// at the end, the first one is escaped and used for the pattern itself, and the --_mtable-regex-X only has // to separate the regex pattern from the long flags, instead of the usual ///. Thus it’s using the shorthand form of the --_mtable-regex-X option. It could instead have been:


The above would have worked exactly the same.

Getting back to our example, remember we’re looking at the input.x file, currently using the comment table, and trying to match the first regex of that table, shown above, at the following location:

   ,ctags is trying to match starting here
var dont_capture_me;

The pattern doesn’t match for the position just after /*, because that position is a space character. So ctags tries the next pattern in the same table:


This pattern matches any any one character including newline; the current position moves one character forward. Now the character at the current position is B. The first pattern of the table */ still does not match with the input. So ctags uses next pattern again. When the current position moves to the */ of the 3rd line of input.x, it will finally match this:


In this pattern, the long flag {tleave} is specified. This triggers table switching again. {tleave} makes ctags switch the table back to the last table used before doing {tenter}. In this case, toplevel is the table. ctags manages a stack where references to tables are put. {tenter} pushes the current table to the stack. {tleave} pops the table at the top of the stack and chooses it.

So now ctags is back to the toplevel table, and tries the first regex of that table, which was this:


It tries to match that against its current position, which is now the newline on line 3, between the */ and the word var:

var dont_capture_me;
*/ <--- ctags is now at this newline (/n) character

The first regex of the toplevel table does not match a newline, so it tries the second regex:


This matches a newline successfully, but has no actions to perform. So ctags moves one character forward (the newline it just matched), and goes back to the top of the toplevel table, and tries the first regex again. Eventually we’ll reach the beginning of the second block comment, and do the same things as before.

When ctags finally reaches the end of the file (the position after b;), it will not be able to match either the first or second regex of the toplevel table, and quit processing the input file.

So far, we’ve successfully skipped over block comments for our new X language, but haven’t generated any tags. The point of ctags is to generate tags, not just keep your computer warm. So now let’s move onto actually tagging variables…

Capturing variables in a sequence

Here is the 4th version of X.ctags:



--_mtable-regex-X=toplevel/var[ \n\t]//{tenter=vars}



One pattern in toplevel was added, and a new table vars with four patterns was also added.

The new regex in toplevel is this:

--_mtable-regex-X=toplevel/var[ \n\t]//{tenter=vars}

The purpose of this being in toplevel is to switch to the vars table when the keyword var is found in the input stream. We need to switch states (i.e., tables) because we can’t simply capture the variables a and b with a single regex pattern in the toplevel table, because there might be block comments inside the var statement (as there are in our input.x), and we also need to create two tags: one for a and one for b, even though the word var only appears once. In other words, we need to “remember” that we saw the keyword var, when we later encounter the names a and b, so that we know to tag each of them; and saving that “in-variable-statement” state is accomplished by switching tables to the vars table.

The first regex in our new vars table is:


This pattern is used to match a single semi-colon ;, and if it matches pop back to the toplevel table using the {tleave} long flag. We didn’t have to make this the first regex pattern, because it doesn’t overlap with any of the other ones other than the /.// last one (which must be last for this example to work).

The second regex in our vars table is:


We need this because block comments can be in variable definitions:


So to skip block comments in such a position, the pattern \/\* is used just like it was used in the toplevel table: to find the literal /* beginning of the block comment and enter the comment table. Because we’re using {tenter} and {tleave} to push/pop from a stack of tables, we can use the same comment table for both toplevel and vars to go to, because ctags will “remember” the previous table and {tleave} will pop back to the right one.

The third regex in our vars table is:


This is nothing special, but is the one that actually tags something: it captures the variable name and uses it for generating a variable (shorthand v) tag kind.

The last regex in the vars table we’ve seen before:


This makes ctags ignore any other characters, such as whitespace or the comma ,.

Running our example

$ cat input.x
var dont_capture_me;

$ u-ctags -o - --fields=+n --options=X.ctags input.x
u-ctags -o - --fields=+n --options=X.ctags input.x
a       input.x /^var a \/* ANOTHER BLOCK COMMENT *\/, b;$/;"   v       line:4
b       input.x /^var a \/* ANOTHER BLOCK COMMENT *\/, b;$/;"   v       line:4

It works!

You can find additional examples of multi-table regex in our github repo, under the optlib directory. For example puppetManifest.ctags is a serious example. It is the primary parser for testing multi-table regex parsers, and used in the actual ctags program for parsing puppet manifest files.

Conditional tagging with extras

If a matched pattern should only be tagged when an extra flag is enabled, mark the pattern with {_extra=XNAME} where XNAME is the name of the extra. You must define a XNAME with the --_extradef-<LANG>=XNAME,DESCRIPTION option before defining a regex flag marked {_extra=XNAME}.

if __name__ == '__main__':

To capture the lines above in a python program(input.py), an extra flag can be used.

--_extradef-Python=main,__main__ entry points
--regex-Python=/^if __name__ == '__main__':/__main__/f/{_extra=main}

The above optlib(python-main.ctags) introduces main extra to the Python parser. The pattern matching is done only when the main is enabled.

$ ./ctags --options=python-main.ctags -o - --extras-Python='+{main}' input.py
__main__        input.py        /^if __name__ == '__main__':$/;"        f

Adding custom fields to the tag output

Exuberant-ctags allows just one of the specified groups in a regex pattern to be used as a part of the name of a tagEntry.

Universal-ctags allows using the other groups in the regex pattern.

An optlib parser can have its specific fields. The groups can be used as a value of the fields of a tagEntry.

Let’s think about Unknown, an imaginary language. Here is a source file(input.unknown) written in Unknown:

public func foo(n, m);
protected func bar(n);
private func baz(n,...);

With --regex-Unknown=… Exuberant-ctags can capture foo, bar, and baz as names. Universal-ctags can attach extra context information to the names as values for fields. Let’s focus on bar. protected is a keyword to control how widely the identifier bar can be accessed. (n) is the parameter list of bar. protected and (n) are extra context information of bar.

With the following optlib file(unknown.ctags), ctags can attach protected to the field protection and (n) to the field signature.


--_fielddef-unknown=protection,access scope

--regex-unknown=/^((public|protected|private) +)?func ([^\(]+)\((.*)\)/\3/f/{_field=protection:\1}{_field=signature:(\4)}


For the line protected func bar(n); you will get following tags output:

bar     input.unknown   /^protected func bar(n);$/;"    f       protection:protected    signature:(n)

Let’s see the detail of unknown.ctags.

--_fielddef-unknown=protection,access scope

--_fielddef-<LANG>=name,description defines a new field for a parser specified by <LANG>. Before defining a new field for the parser, the parser must be defined with --langdef=<LANG>. protection is the field name used in tags output. access scope is the description used in the output of --list-fields and --list-fields=Unknown.


This defines a field named signature.

--regex-unknown=/^((public|protected|private) +)?func ([^\(]+)\((.*)\)/\3/f/{_field=protection:\1}{_field=signature:(\4)}

This option requests making a tag for the name that is specified with the group 3 of the pattern, attaching the group 1 as a value for protection field to the tag, and attaching the group 4 as a value for signature field to the tag. You can use the long regex flag _field for attaching fields to a tag with the following notation rule:


--fields-<LANG>=[+|-]{FIELDNAME} can be used to enable or disable specified field.

When defining a new parser specific field, it is disabled by default. Enable the field explicitly to use the field. See Parser specific fields about --fields-<LANG> option.

passwd parser is a simple example that uses --fields-<LANG> option.

Capturing reference tags

To make a reference tag with an optlib parser, specify a role with _role long regex flag. Let’s see an example:

--_roledef-FOO.m=imported,imported module
--regex-FOO=/import[ \t]+([a-z]+)/\1/m/{_role=imported}

A role must be defined before specifying it as value for _role flag. --_roledef-<LANG>.<KIND>=<ROLE>,<ROLEDESC> option is for defining a role. See the line, --regex-FOO=.... In this parser FOO, the name of an imported module is captured as a reference tag with role imported.

For specifing KIND where the role is defined, you can use either a kind letter or a kind name. surrounded by { and }.

The option has two parameters separated by a comma:


the role name, and


the description of the role.

The first parameter is the name of the role. The role is defined in the kind <KIND> of the language <LANG>. In the example, imported role is defined in the module kind, which is specified with m. You can use {module}, the name of the kind instead.

The kind specified in --_roledef-<LANG>.<KIND> option must be defined before using the option. See the description of --kinddef-<LANG> for defining a kind.

The roles are listed with --list-roles=<LANG>. The name and description passed to --_roledef-<LANG>.<KIND> option are used in the output like:

$ ./ctags --langdef=FOO --kinddef-FOO=m,module,modules \
                        --_roledef-FOO.m='imported,imported module' --list-roles=FOO
m/module   imported on      imported module

If specifying _role regex flag multiple times with different roles, you can assign multiple roles to a reference tag. See following input of C language

x  = 0;
i += 1;

An ultra fine grained C parser may capture the variable x with lvalue role and the variable i with lvalue and incremented roles.

You can implement such roles by extending the built-in C parser:

# c-extra.ctags
--_roledef-C.v=lvalue,locator values
--_roledef-C.v=incremented,incremeted with ++ operator
--regex-C=/([a-zA-Z_][a-zA-Z_0-9]*) *=/\1/v/{_role=lvalue}
--regex-C=/([a-zA-Z_][a-zA-Z_0-9]*) *\+=/\1/v/{_role=lvalue}{_role=incremented}

ctags with --options=c-extra.ctags --extras=+r --fields=+r emits

Running a guest parser with _guest regex flag

With _guest regex flag, you can run a parser (a guest parser) on an area of the current input file. See Applying a parser to specified areas of input file (guest/host) about the concept of the guest parser.

The _guest regex flag specifies guest spec, and attaches it to the associated regex pattern.

A guest spec has three fields: PARSER, START of area, and END of area. The _guest regex flag has following forms:


ctags maintains a data called guest request during parsing. The guest request also has three fields: parser, start of area, and end of area.

You, a parser developer, have to fill the fields of guest specs. ctags inquiries the guest spec when matching the regex pattern associated with it, tries to fill the fields of the guest request, and runs a guest parser when all the fields of the guest request are filled.

If you don’t use Multi-line pattern match to define a host parser, ctags can fill fields of guest request incrementally; more than one guest specs are used to fill the fields. In other words, you can make some of the fields of a guest spec empty. On the other hand, you must specify all the fields of a guest spec for Multi-line pattern match.

The PARSER field of _guest regex flag

For PARSER, you can specify one of the following items:

a name of a parser

If you know the guest parser you want to run before parsing the input file, specify the name to the PARSER.

An example of running C parser as a guest parser:


the group number of a regex pattern started from \ (backslash)

If a parser name appears in an input file, write a regex pattern to capture the name. Specify the group number where the name is stored to the PARSER. In such case, use \ as the prefix for the number.

Let’s see an example. Git Flavor Markdown (GFM) is a language for documentation. It provides a notation for quoting a snippet of program code; the language treats the area started from ~~~ to ~~~ as a snippet. You can specify a programming language of the snippet with starting the area with ~~~THE_NAME_OF_LANGUAGE like ~~~C or ~~~Java.

To run a guest parser on the area, you have to capture the THE_NAME_OF_LANGUAGE with a regex pattern:


The pattern captures the language name in the input file with the regex group 1, and specify it to PARSER:


the group number of a regex pattern started from * (asterisk)

If a file name implying a programming language appears in an input file, capture the file name with the regex pattern where the guest spec attaches to. ctags tries to find a proper parser for the file name by inquiring the langmap.

Use * as the prefix to the number for specifying the group of the regex pattern that captures the file name.

Let’s see an example. Consider you have a shell script that emits a program code instantiated from one of the templates. HERE DOCUMENTs are used to represent the templates like:

cat > foo.c <<EOF
        int main (void) { return $i; }

cat > foo.el <<EOF
        (defun foo () (1+ $i))

To run guest parsers for the here document areas, the shell script parser of ctags must choose the parsers from the file names (foo.c and foo.el):

--regex-sh=/cat > ([a-z.]+) <<EOF//{_guest=*1,0end,}

The pattern captures the file name in the input file with the regex group 1, and specify it to PARSER:


The START and END fields of _guest regex flag

The START and END fields specify the area the PARSER parses. START specifies the start of the area. END specifies the end of the area.

The forms of the two fields are the same: a regex group number followed by “start” or “end”. e.g. “3start”, “0end”. The suffixes, “start” and “end”, represents one of two boundaries of the group.

Let’s see an example:


This guest regex flag means running C parser on the area between “2end” and “3start”. “2end” means the area starts from the end of matching of the 2nd regex group associated with the flag. “3start” means the area ends at the beginning of matching of the 3rd regex group associated with the flag.

Let’s more realistic example. Here is an optlib file for an imaginary language “single”.

--langdef=single --map-single=.single --regex-single=/^(BEGIN_C<).*(>END_C)$//{_guest=C,1end,2start}

This parser can run C parser and extract “main” function from the following input file:

BEGIN_C<int main (int argc, char **argv) { return 0; }>END_C
        ^                                             ^
                 `- "1end" points here.                       |
                                       "2start" points here. -+

Submitting an optlib file to the Universal-ctags project

You are encouraged to submit your .ctags file to our repository on github through a pull request.

Universal-ctags provides a facility for “Option library”. Read “Option library” about the concept and usage first.

Here I will explain how to merge your .ctags into Universal-ctags as part of the option library. Here I assume you consider contributing an option library in which a regex-based language parser is defined.

First you need your option library (which you have seen in this part of the guide). See How to Add Support for a New Language to Exuberant Ctags (EXTENDING) to learn how to write a regex-based language parser in C.

In this section I explain what to do after you have your parser.

Like in the link, I use Swine as the name of programming language that the parser deals with. Assume source files written in Swine language have a suffix .swn. The file name of the option library is swine.ctags.

Units test cases

We, universal-ctags developers don’t have enough time to learn all languages supported by ctags. In other word, we cannot review the code. Only test cases help us to know whether a contributed option library works well or not. We may reject any contribution without a test case.

Read “Using Units” about how to write Units test cases. Do not write one big test case: smaller cases are helpful to know about the intent of the contributor. For example:

  • Units/sh-alias.d
  • Units/sh-comments.d
  • Units/sh-quotes.d
  • Units/sh-statements.d

are good example of small test cases. Big test cases are acceptable if smaller test cases exist.

See also parser-m4.r/m4-simple.d especially parser-m4.r/m4-simple.d/args.ctags. Your test cases need ctags having already loaded your option library, swine.ctags. You must specify loading it in the test case own args.ctags.

Assume your test name is swine-simile.d. Put --option=swine in Units/swine-simile.d/args.ctags.

Incorporating your parser to ctags build process

Add your optlib file, swine.ctags to OPTLIB2C_INPUT variable of +*makefiles/optlib2c_input.mak* in Universal-ctags source tree.


Let’s verify all your work here.

  1. Run the tests and check whether your test case is passed or failed:

    $ make units
  2. Verify your files are installed as expected:

    $ mkdir /tmp/tmp
    $ ./configure --prefix=/tmp/tmp
    $ make
    $ make install
    $ /tmp/tmp/ctags -o - --languages=Swine something_input.swn


Please, consider submitting your well written optlib parser to Universal-ctags. Your .ctags is a treasure and can be shared as a first class software component in Universal-ctags.

Pull-requests are welcome.