Running multiple parsers on an input file¶
Universal Ctags provides parser developers ways (guest/host and sub/base) to run multiple parsers for an input file.
This section shows concepts behind the running multiple parsers, real examples, and APIs.
Applying a parser to specified areas of input file (guest/host)¶
guest/host combination considers the case that an input file has areas written in languages different from the language for the input file.
A host parser parses the input file and detects the areas. The host parser schedules guest parsers parsing the areas. The guest parsers parse the areas.
guest parsers are run only when --extras=+g
is given. If
--fields=+E
is given, all tags generated by a guest parser is marked
guest
in their extras:
fields.
Examples of guest/host combinations¶
{CSS,JavaScript}/HTML parser combination¶
For an HTML file, you may want to run HTML parser, of course. The HTML file may have CSS areas and JavaScript areas. In other hand Universal Ctags has both CSS and JavaScript parsers. Don’t you think it is useful if you can apply these parsers to the areas?
In this case, HTML has responsible to detect the CSS and JavaScript areas and record the positions of the areas. The HTML parser schedules delayed invocations of CSS and JavaScript parsers on the area with promise API.
Here HTML parser is a host parser. CSS and JavaScript parsers are guest parsers.
See “The new HTML parser” and parsers/html.c
.
C/Yacc parser combination¶
A yacc file has some areas written in C. Universal Ctags has both YACC and C parsers. You may want to run C parser for the areas from YACC parser.
Here YACC parser is a host parser. C parser is a guest parser.
See “promise API” and parsers/yacc.c
.
Pod/Perl parser combination¶
Pod (Plain Old Documentation) is a language for documentation. The language can be used not only in a stand alone file but also it can be used inside a Perl script.
Universal Ctags has both parsers for Perl and Pod. The Perl parser recognizes the area where Pod document is embedded in a Perl script and schedules applying pod parser as a guest parser on the area.
API for running a parser in an area¶
promise API can be used. A host parser using the interface has responsibility to detect areas from input stream and record them with name of guest parsers that will be applied to the areas.
Running guest parsers from an optlib based host parser¶
Tagging definitions of higher (upper) level language (sub/base)¶
Background¶
Consider an application written in language X. The application has its domain own concepts. Developers of the application may try to express the concepts in the syntax of language X.
In language X level, the developer can define functions, variables, types, and so on. Further more, if the syntax of X allows, the developers want to define higher level (= application level) things for implementing the domain own concepts.
Let me show the part of source code of SPY-WARS, an imaginary game application. It is written in scheme language, a dialect of lisp. (Here gauche is considered as the implementation of scheme interpreter).
(define agent-tables (make-hash-table))
(define-class <agent> ()
((rights :init-keyword :rights)
(responsibilities :init-keyword :responsibilities)))
(define-macro (define-agent name rights responsibilities)
`(hash-table-put! agent-tables ',name
(make <agent>
:rights ',rights
:responsibilities ',responsibilities)))
(define-agent Bond (kill ...) ...)
(define-agent Bourne ...)
...
define
, define-class
, and define-macro
are keywords of scheme
for defining a variable, class and macro. Therefore scheme parser of
ctags should make tags for agent-tables
with variable kind,
<agent>
with class kind, and define-agent
with macro kind.
There is no discussion here.
NOTE: To be exactlydefine-class
anddefine-macro
are not the part of scheme language. They are part of gauche. That means three parsers are stacked: scheme, gosh, and SPY-WARS.
The interesting things here are Bond
and Bourne
.
(define-agent Bond (kill ...) ...)
(define-agent Bourne ...)
In scheme parser level, the two expressions define nothing; the two
expressions are just macro (define-agent
) expansions.
However, in the application level, they define agents as the
macro name shown. In this level Universal Ctags should capture
Bond
and Bourne
. The question is which parser should
capture them? scheme parser should not; define-agent is not part of
scheme language. Newly defined SPY-WARS parser is the answer.
Though define-agent
is just a macro in scheme parser level,
it is keyword in SPY-WARS parser. SPY-WARS parser makes a
tag for a token next to define-agent
.
The above example illustrates levels of language in an input file. scheme is used as the base language. With the base language we can assume an imaginary higher level language named SPY-WARS is used to write the application. To parse the source code of the application written in two stacked language, ctags uses the two stacked parsers.
Making higher level language is very popular technique in the languages of lisp family (see “On Lisp” for more details). However, it is not special to lisp.
Following code is taken from linux kernel written in C:
DEFINE_EVENT(mac80211_msg_event, mac80211_info,
TP_PROTO(struct va_format *vaf),
TP_ARGS(vaf)
);
There is no concept EVENT in C language, however it make sense in the
source tree of linux kernel. So we can consider linux parser, based on
C parser, which tags mac80211_msg_event
as event
kind.
Terms¶
Base parser and subparser¶
In the context of the SPY-WARS example, scheme parser is called a base parser. The SPY-WARS is called a subparser. A base parser tags definitions found in lower level view. A subparser on the base parser tags definitions found in higher level view. This relationship can be nested. A subparser can be a base parser for another subparser.
At a glance the relationship between two parsers are similar to the relationship guest parser and host parser description in “Applying a parser to specified areas of input file”. However, they are different. Though a guest parser can run stand-alone, a subparser cannot; a subparser needs help from base parser to work.
Top down parser choice and bottom up parser choice¶
There are two ways to run a subparser: top down or bottom up parser choices.
Universal Ctags can chose a subparser automatically.
Matching file name patterns and extensions are the typical ways for
choosing. A user can choose a subparser with --language-force=
option.
Choosing a parser in these deterministic way is called top down.
When a parser is chosen as a subparser in the top down way, the
subparser must call its base parser. The base parser may call methods
defined in the subparser.
Universal Ctags uses bottom up choice when the top down way
doesn’t work; a given file name doesn’t match any patterns and
extensions of subparsers and the user doesn’t specify
--language-force=
explicitly. In choosing a subparser bottom up way
it is assumed that a base parser for the subparser can be chosen
by top down way. During a base parser running, the base parser tries
to detect use of higher level languages in the input file. As shown
later in this section, the base parser utilizes methods defined in its
subparsers for the detection. If the base parser detects the use of a
higher level language, a subparser for the higher level language is
chosen. Choosing a parser in this non-deterministic way (dynamic way)
is called bottom up.
Here is an example. Universal Ctags has both m4 parser and Autoconf
parser. The m4 parser is a base parser. The Autoconf parser is a
subparser based on the m4 parser. If configure.ac
is given as an
input file, Autoconf parser is chosen automatically because the
Autoconf parser has configure.ac
in its patterns list. Based on the
pattern matching, Universal Ctags chooses the Autoconf parser
automatically (top down choice).
If input.m4
is given as an input file, the Autoconf parser is
not chosen. Instead the m4 parser is chosen automatically because
the m4 parser has .m4
in its extension list. The m4 parser passes
every token finding in the input file to the
Autoconf parser. The Autoconf parser gets the chance to probe
whether the Autoconf parser itself can handle the input or not; if
a token name is started with AC_
, the Autoconf parser
reports “this is Autoconf input though its file extension
is .m4
” to the m4 parser. As the result the Autoconf parser is
chosen (bottom up choice).
Some subparsers can be chosen both top down and bottom up ways. Some subparser can be chosen only top down way or bottom up ways.
Exclusive subparser and coexisting subparser¶
TBW. This must be filled when I implement python-celery parser.
API for making a combination of base parser and subparsers¶
Outline¶
You have to work on both sides: a base parser and subparsers.
A base parser must define a data structure type (baseMethodTable
) for
its subparsers by extending struct subparser
defined in
main/subparser.h
. A subparser defines a variable (subparser var
)
having type baseMethodTable
by filling its fields and registers
subparser var
to the base parser using dependency API.
The base parser calls functions pointed by baseMethodTable
of
subparsers during parsing. A function for probing a higher level
language may be included in baseMethodTable
. What kind of fields
should be included in baseMethodTable
is up to the design of a base
parser and the requirements of its subparsers. A method for
probing is one of them.
Registering a subparser var
to a base parser is enough for the
bottom up choice. For handling the top down choice (e.g. specifying
--language-force=<subparser>
in a command line), more code is needed.
In the top down choice, the subparser must call scheduleRunningBasepaser
,
declared in main/subparser.h
, in its parser
method.
Here, parser
method means a function assigned to the parser
member of
the parserDefinition
of the subparser.
scheduleRunningBaseparser
takes an integer argument
that specifies the dependency used for registering the subparser var
.
By extending struct subparser
you can define a type for
your subparser. Then make a variable for the type and
declare a dependency on the base parser.
Details¶
Fields of subparser
type¶
Here the source code of Autoconf/m4 parsers is referred as an example.
main/types.h
:
struct sSubparser;
typedef struct sSubparser subparser;
main/subparser.h
:
typedef enum eSubparserRunDirection {
SUBPARSER_BASE_RUNS_SUB = 1 << 0,
SUBPARSER_SUB_RUNS_BASE = 1 << 1,
SUBPARSER_BI_DIRECTION = SUBPARSER_BASE_RUNS_SUB|SUBPARSER_SUB_RUNS_BASE,
} subparserRunDirection;
struct sSubparser {
...
/* public to the parser */
subparserRunDirection direction;
void (* inputStart) (subparser *s);
void (* inputEnd) (subparser *s);
void (* exclusiveSubparserChosenNotify) (subparser *s, void *data);
};
A subparser must fill the fields of subparser
.
direction
field specifies how the subparser is called. If a
subparser runs exclusively and is chosen in top down way, set
SUBPARSER_SUB_RUNS_BASE
flag. If a subparser runs coexisting way and
is chosen in bottom up way, set SUBPARSER_BASE_RUNS_SUB
. Use
SUBPARSER_BI_DIRECTION
if both cases can be considered.
SystemdUnit parser runs as a subparser of iniconf base parser.
SystemdUnit parser specifies SUBPARSER_SUB_RUNS_BASE
because
unit files of systemd have very specific file extensions though
they are written in iniconf syntax. Therefore we expect SystemdUnit
parser is chosen in top down way. The same logic is applicable to
YumRepo parser.
Autoconf parser specifies SUBPARSER_BI_DIRECTION
. For input
file having name configure.ac
, by pattern matching, Autoconf parser
is chosen in top down way. In other hand, for file name foo.m4
,
Autoconf parser can be chosen in bottom up way.
inputStart
is called before the base parser starting parsing a new input file.
inputEnd
is called after the base parser finishing parsing the input file.
Universal Ctags main part calls these methods. Therefore, a base parser doesn’t
have to call them.
exclusiveSubparserChosenNotify
is called when a parser is chosen
as an exclusive parser. Calling this method is a job of a base parser.
Extending subparser
type¶
The m4 parser extends subparser
type like following:
parsers/m4.h
:
typedef struct sM4Subparser m4Subparser;
struct sM4Subparser {
subparser subparser;
bool (* probeLanguage) (m4Subparser *m4, const char* token);
/* return value: Cork index */
int (* newMacroNotify) (m4Subparser *m4, const char* token);
bool (* doesLineCommentStart) (m4Subparser *m4, int c, const char *token);
bool (* doesStringLiteralStart) (m4Subparser *m4, int c);
};
Put subparser
as the first member of the extended struct (here sM4Subparser).
In addition the first field, 4 methods are defined in the extended struct.
Till choosing a subparser for the current input file, the m4 parser calls
probeLanguage
method of its subparsers each time when find a token
in the input file. A subparser returns true
if it recognizes the
input file is for the itself by analyzing tokens passed from the
base parser.
parsers/autoconf.c
:
extern parserDefinition* AutoconfParser (void)
{
static const char *const patterns [] = { "configure.in", NULL };
static const char *const extensions [] = { "ac", NULL };
parserDefinition* const def = parserNew("Autoconf");
static m4Subparser autoconfSubparser = {
.subparser = {
.direction = SUBPARSER_BI_DIRECTION,
.exclusiveSubparserChosenNotify = exclusiveSubparserChosenCallback,
},
.probeLanguage = probeLanguage,
.newMacroNotify = newMacroCallback,
.doesLineCommentStart = doesLineCommentStart,
.doesStringLiteralStart = doesStringLiteralStart,
};
probeLanguage
function defined in autoconf.c
is connected to
the probeLanguage
member of autoconfSubparser
. The probeLanguage
function
of Autoconf is very simple:
parsers/autoconf.c
:
static bool probeLanguage (m4Subparser *m4, const char* token)
{
return strncmp (token, "m4_", 3) == 0
|| strncmp (token, "AC_", 3) == 0
|| strncmp (token, "AM_", 3) == 0
|| strncmp (token, "AS_", 3) == 0
|| strncmp (token, "AH_", 3) == 0
;
}
This function checks the prefix of passed tokens. If known
prefix is found, Autoconf assumes this is an Autoconf input
and returns true
.
parsers/m4.c
:
if (m4tmp->probeLanguage
&& m4tmp->probeLanguage (m4tmp, token))
{
chooseExclusiveSubparser ((m4Subparser *)tmp, NULL);
m4found = m4tmp;
}
The m4 parsers calls probeLanguage
function of a subparser. If true
is returned chooseExclusiveSubparser
function which is defined
in the main part. chooseExclusiveSubparser
calls
exclusiveSubparserChosenNotify
method of the chosen subparser.
The method is implemented in Autoconf subparser like following:
parsers/autoconf.c
:
static void exclusiveSubparserChosenCallback (subparser *s, void *data)
{
setM4Quotes ('[', ']');
}
It changes quote characters of the m4 parser.
Making a tag in a subparser¶
Via calling callback functions defined in subparsers, their base parser gives chance to them making tag entries.
The m4 parser calls newMacroNotify
method when it finds an m4 macro is used.
The Autoconf parser connects newMacroCallback
function defined in parser/autoconf.c
.
parsers/autoconf.c
:
static int newMacroCallback (m4Subparser *m4, const char* token)
{
int keyword;
int index = CORK_NIL;
keyword = lookupKeyword (token, getInputLanguage ());
/* TODO:
AH_VERBATIM
*/
switch (keyword)
{
case KEYWORD_NONE:
break;
case KEYWORD_init:
index = makeAutoconfTag (PACKAGE_KIND);
break;
...
extern parserDefinition* AutoconfParser (void)
{
...
static m4Subparser autoconfSubparser = {
.subparser = {
.direction = SUBPARSER_BI_DIRECTION,
.exclusiveSubparserChosenNotify = exclusiveSubparserChosenCallback,
},
.probeLanguage = probeLanguage,
.newMacroNotify = newMacroCallback,
In newMacroCallback
function, the Autoconf parser receives the name of macro
found by the base parser and analysis whether the macro is interesting
in the context of Autoconf language or not. If it is interesting name,
the Autoconf parser makes a tag for it.
Calling methods of subparsers from a base parser¶
A base parser can use foreachSubparser
macro for accessing its
subparsers. A base should call enterSubparser
before calling a
method of a subparser, and call leaveSubparser
after calling the
method. The macro and functions are declare in main/subparser.h
.
parsers/m4.c
:
static m4Subparser * maySwitchLanguage (const char* token)
{
subparser *tmp;
m4Subparser *m4found = NULL;
foreachSubparser (tmp, false)
{
m4Subparser *m4tmp = (m4Subparser *)tmp;
enterSubparser(tmp);
if (m4tmp->probeLanguage
&& m4tmp->probeLanguage (m4tmp, token))
{
chooseExclusiveSubparser (tmp, NULL);
m4found = m4tmp;
}
leaveSubparser();
if (m4found)
break;
}
return m4found;
}
foreachSubparser
takes a variable having type subparser
.
For each iteration, the value for the variable is updated.
enterSubparser
takes a variable having type subparser
. With the
calling enterSubparser
, the current language (the value returned from
getInputLanguage
) can be temporary switched to the language specified
with the variable. One of the effect of switching is that language
field of tags made in the callback function called between
enterSubparser
and leaveSubparser
is adjusted.
Registering a subparser to its base parser¶
Use DEPTYPE_SUBPARSER
dependency in a subparser for registration.
parsers/autoconf.c
:
extern parserDefinition* AutoconfParser (void)
{
parserDefinition* const def = parserNew("Autoconf");
static m4Subparser autoconfSubparser = {
.subparser = {
.direction = SUBPARSER_BI_DIRECTION,
.exclusiveSubparserChosenNotify = exclusiveSubparserChosenCallback,
},
.probeLanguage = probeLanguage,
.newMacroNotify = newMacroCallback,
.doesLineCommentStart = doesLineCommentStart,
.doesStringLiteralStart = doesStringLiteralStart,
};
static parserDependency dependencies [] = {
[0] = { DEPTYPE_SUBPARSER, "M4", &autoconfSubparser },
};
def->dependencies = dependencies;
def->dependencyCount = ARRAY_SIZE (dependencies);
DEPTYPE_SUBPARSER
is specified in the 0th element of dependencies
function static variable. In the next a literal string “M4” is
specified and autoconfSubparser
follows. The intent of the code is
registering autoconfSubparser
subparser definition to a base parser
named “M4”.
dependencies
function static variable must be assigned to
dependencies
fields of a variable of parserDefinition
.
The main part of Universal Ctags refers the field when
initializing parsers.
[0]
emphasizes this is “the 0th element”. The subparser may refer
the index of the array when the subparser calls
scheduleRunningBaseparser
.
Scheduling running the base parser¶
For the case that a subparser is chosen in top down, the subparser
must call scheduleRunningBaseparser
in the main parser
method.
parsers/autoconf.c
:
static void findAutoconfTags(void)
{
scheduleRunningBaseparser (0);
}
extern parserDefinition* AutoconfParser (void)
{
...
parserDefinition* const def = parserNew("Autoconf");
...
static parserDependency dependencies [] = {
[0] = { DEPTYPE_SUBPARSER, "M4", &autoconfSubparser },
};
def->dependencies = dependencies;
...
def->parser = findAutoconfTags;
...
return def;
}
A subparser can do nothing actively. A base parser makes its subparser
work by calling methods of the subparser. Therefore a subparser must
run its base parser when the subparser is chosen in a top down way,
The main part prepares scheduleRunningBaseparser
function for the purpose.
A subparser should call the function from parser
method of parserDefinition
of the subparser. scheduleRunningBaseparser
takes an integer. It specifies
an index of the dependency which is used for registering the subparser.
Command line interface¶
Running subparser can be controlled with subparser (s
) extras flag.
By default it is enabled. To turning off the feature running
subparser, specify --extras=-s
.
When --extras=+s
option given, a tag entry recorded by a subparser
is marked as follows:
TMPDIR input.ac /^AH_TEMPLATE([TMPDIR],$/;" template extras:subparser end:4
See also “Defining a subparser”.
Examples of sub/base combinations¶
Automake/Make parser combination¶
Simply to say the syntax of Automake is the subset of Make. However,
the Automake parser has interests in Make macros having special
suffixes: _PROGRAMS
, _LTLIBRARIES
, and _SCRIPTS
so on.
Here is an example of input for Automake:
bin_PROGRAMS = ctags
ctags_CPPFLAGS = \
-I. \
-I$(srcdir) \
-I$(srcdir)/main
From the point of the view of the Make parser, bin_PROGRAMS
is a just
a macro; the Make parser tags bin_PROGRAMS
as a macro. The Make parser
doesn’t tag ctags
being right side of ‘=
’ because it is not a new
name: just a value assigned to bin_PROGRAMS. However, for the Automake
parser ctags
is a new name; the Automake parser tags ctags
with
kind Program
. The Automake parser can tag it with getting help from
the Make parser.
The Automake parser is an exclusive subparser. It is chosen in top
down way; an input file name Makefile.am
gives enough information for
choosing the Automake parser.
To give chances to the Automake parser to capture Automake own
definitions, The Make parser provides following interface in
parsers/make.h
:
struct sMakeSubparser {
subparser subparser;
void (* valueNotify) (makeSubparser *s, char* name);
void (* directiveNotify) (makeSubparser *s, char* name);
void (* newMacroNotify) (makeSubparser *s,
char* name,
bool withDefineDirective,
bool appending);
};
The Automake parser defines methods for tagging Automake own definitions
in a struct sMakeSubparser
type variable, and runs the Make parser by
calling scheduleRunningBaseparser
function.
The Make parser tags Make own definitions in an input file. In addition Make parser calls the methods during parsing the input file.
$ ctags --fields=+lK --extras=+r -o - Makefile.am
bin Makefile.am /^bin_PROGRAMS = ctags$/;" directory language:Automake
bin_PROGRAMS Makefile.am /^bin_PROGRAMS = ctags$/;" macro language:Make
ctags Makefile.am /^bin_PROGRAMS = ctags$/;" program language:Automake directory:bin
ctags_CPPFLAGS Makefile.am /^ctags_CPPFLAGS = \\$/;" macro language:Make
bin_PROGRAMS
and ctags_CPPFLAGS
are tagged as macros of Make.
In addition bin
is tagged as directory, and ctags
as program of Automake.
bin
is tagged in a callback function assigned to newMacroFound
method.
ctags
is tagged in a callback function assigned to valuesFound
method.
--extras=+r
is used in the example. Reference (r
) extra is needed to
tag bin
. bin
is not defined in the line, bin_PROGRAMS =
.
bin
is referenced as a name of directory where programs are
stored. Therefore r
is needed.
For tagging ctags
, the Automake parser must recognize
bin
in bin_PROGRAMS
first. ctags
is tagged
because it is specified as a value for bin_PROGRAMS
.
As the result r
is also needed to tag ctags
.
Only Automake related tags are emitted if Make parser is disabled.
$ ctags --languages=-Make --fields=+lKr --extras=+r -o - Makefile.am
bin Makefile.am /^bin_PROGRAMS = ctags$/;" directory language:Automake roles:program
ctags Makefile.am /^bin_PROGRAMS = ctags$/;" program language:Automake directory:bin
Autoconf/M4 parser combination¶
Universal Ctags uses m4 parser as a base parser and Autoconf parse as
a subparser for configure.ac
input file.
AC_DEFUN([PRETTY_VAR_EXPAND],
[$(eval "$as_echo_n" $(eval "$as_echo_n" "${$1}"))])
The m4 parser finds no definition here. However, Autoconf parser finds
PRETTY_VAR_EXPAND
as a macro definition. Syntax like (...)
is part
of M4 language. So Autoconf parser is implemented as a subparser of
m4 parser. The most parts of tokens in input files are handled by
M4. Autoconf parser gives hints for parsing configure.ac
and
registers callback functions to
Autoconf parser.