Topic: parsing c++ without a symbol table!


Author: "KNAPEN, GREGORY" <gregory.knapen@bell.ca>
Date: 1998/08/02
Raw View
Steve Clamage wrote:
>
> "KNAPEN, GREGORY" <gregory.knapen@bell.ca> writes:
> >What I am trying to do is to parse source code which might have
> missing
> >header files.  In that case the symbol table is incomplete. This
> means
> >that unambiguous declarations can fail because the type was not
> >previously declared.
>
> You will not be able to do anything about code that uses templates
> without a scoped symbol table, and you won't be able to parse
> code at all if template declarations are missing. You have to
> know whether a name is the name of a namespace, type template, or
> something else IN THE CURRENT SCOPE to have any chance of parsing
> general C++ code.
>
>         int V, z;
>         template<class T> int f(T t) {
>                 T *x; // x is local variable of template arg type
>                 V *z; // multiply global V by z and discard result

This example does not only apply to templates. Yes recognizing between a
declaration and an expression is difficult in c++.  I my case I would
always consider this to be a var declaration.  This is obviouly wrong,
but it would not affect the type of analysis I want to do.

>                 {
>                         typedef T V;
>                         V* z; // z is local variable of template arg
> type
>                 }
>         }


Maybe I should clarify things a bit more, I not trying to write a
compiler.  I just want to build a parser that will recognize as much as
possible of a given program that has missing header files. Of course,
for any ambiguous sentence such as

T(a); // function call or varible declaration(if T is a type)
      // in a local scope

I will not be able to determine what this means without knowing exactly
the type of T.  To make things worse, the scoping rules of c++ really
make matters worse.
In fact this sentence remains always ambiguous as soon as there is a
missing header file.

Using a small example:

// missing.h
int X(int p){ // body of function }
// end missing .h

//main.C
#include "missing.h"

struct X{ int x;};
int a = 0;

int main()
{
   X(a); //declaration or function call?
 // you could assume that X is a variable declaration since X is
 // defined as a struct
 // but in reality it is a function call because the name of the
function         // hides the name of the class
}

Even with X in the symbol table(defined as a Type), you can not be 100%
sure that this is a variable declaration since a function or variable
name could hide the name of the X struct.

Even if I have a symbol table, in the abscence of header files, this
table is potentially incomplete, and there are many sentences that will
remain ambiguous due to the scoping rules of c++.

Greg Knapen
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]






Author: AllanW@my-dejanews.com
Date: 1998/08/02
Raw View
In article <6pod5c$717@engnews1.Eng.Sun.COM>,
  clamage@Eng.Sun.COM (Steve Clamage) wrote:
> "KNAPEN, GREGORY" <gregory.knapen@bell.ca> writes:
>
> >Ok, I think I jumped to conclusions too quickly.  Thank you all for
> >correcting me.
>
> >I am still interested in gathering a list of all the ambiguities in the
> >language that can only be resolved by using semantics (usually looking
> >up a symbol table).  Usually we want to know if a given symbol is a type
> >or not.
>
> >What I am trying to do is to parse source code which might have missing
> >header files.  In that case the symbol table is incomplete. This means
> >that unambiguous declarations can fail because the type was not
> >previously declared.

It might make more sense to use a symbol table after all, but to
find the missing header files automatically.  For instance, if a
program uses cout but cout has never been defined, you could
spit out an error message stating that either <iostream> or
<iostream.h> should have been included. Then you can choose one
based on user preferences and re-analyze the program as if that
header file had been included after all.

I frequently find myself in the position of wanting to remove
unwanted header files, particularly after deleting more than a
few "old" functions from a source file (or moving them to a new
source file). After trying a variety of techniques, I've settled
on commenting out all the header files and then recompiling.
Naturally I get errors relating to certain symbols; I then find
out which header files define those symbols, and uncomment them.
Once I have a clean compile, I usually remove the comments for
the no-longer-needed header files.

> You will not be able to do anything about code that uses templates
> without a scoped symbol table, and you won't be able to parse
> code at all if template declarations are missing. You have to
> know whether a name is the name of a namespace, type template, or
> something else IN THE CURRENT SCOPE to have any chance of parsing
> general C++ code.
>
>  int V, z;
>  template<class T> int f(T t) {
>   T *x; // x is local variable of template arg type
>   V *z; // multiply global V by z and discard result
>   {
>    typedef T V;
>    V* z; // z is local variable of template arg type
>   }
>  }

Doesn't even need to involve templates.

    int main(int,char**) {
        int V,z;
        V*z; // Again, multiply and discard result
        {
            typedef int V;
            V*z; // Again, z is local variable
        }
    }

-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/rg_mkgrp.xp   Create Your Own Free Member Forum
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]






Author: "KNAPEN, GREGORY" <gregory.knapen@bell.ca>
Date: 1998/07/29
Raw View
Nathan Myers wrote:
>
> KNAPEN, GREGORY<gregory.knapen@bell.ca> wrote:
> >
> >For example, a c-style type cast is usually recognized by checking if
> >the identifier between parenthesis is a type or not.  It is possible
> to
> >find a type cast by the syntax alone.
> >
> >var = (Type1)(Type2)...(TypeN)(expression);
> >
> >an expression between () is a type cast if and only if it is followed
> by
> >another typecast or an expression.
>
> Sorry, not true.  This could be a function call, or a series of
> function calls.  For example,
>
>   (f)(g)(h)(i,j)
>
> could be equivalent to
>
>   f(g).operator()(h).operator(i,j)

Ok, I think I jumped to conclusions too quickly.  Thank you all for
correcting me.

I am still interested in gathering a list of all the ambiguities in the
language that can only be resolved by using semantics (usually looking
up a symbol table).  Usually we want to know if a given symbol is a type
or not.

What I am trying to do is to parse source code which might have missing
header files.  In that case the symbol table is incomplete. This means
that unambiguous declarations can fail because the type was not
previously declared.

T a; // will fail if T was declared in a missing .h file
// because T will not be available in symbol table

So I try to resort to semantics only when I absolutely need it.  Hence
the idea of making this list of ambiguous cases. These would probably be
treated differently in my c++ grammar.

So I am still looking for more examples like the ones discussed in this
thread.

Thanks


Greg Knapen
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]






Author: James Kuyper <kuyper@wizard.net>
Date: 1998/07/30
Raw View
KNAPEN, GREGORY wrote:
...
> What I am trying to do is to parse source code which might have missing
> header files.  In that case the symbol table is incomplete. This means
> that unambiguous declarations can fail because the type was not
> previously declared.

Header files often contain #define's; your chances of coming even close
to correctly parsing the code, even if just to the accuracy required to
gather metrics, can drop to zero if certain #define statements are
missing.

      [ Send an empty e-mail to c++-help@netlab.cs.rpi.edu for info ]
      [ about comp.lang.c++.moderated. First time posters: do this! ]
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]





Author: AllanW@my-dejanews.com
Date: 1998/07/30
Raw View
In article <35BC8C0D.748DB34E@bell.ca>,
  "KNAPEN, GREGORY" <gregory.knapen@bell.ca> wrote:
> Hi,
>
> I am building a c++ parser that recognizes c++ by using the syntax only.
> I don't use any semantic information i.e. there is no need for a symbol
> table.  Of course, this parser can not be use as a compiler because the
> language contains ambiguities. This parser is intended to gather metrics
> from source code.
>
> While doing this project, I found that most of c++ can be parsed by
> using the syntax alone except for three cases:
[Snip]
> For example, a c-style type cast is usually recognized by checking if
> the identifier between parenthesis is a type or not.  It is possible to
> find a type cast by the syntax alone.
>
> var = (Type1)(Type2)...(TypeN)(expression);
>
> an expression between () is a type cast if and only if it is followed by
> another typecast or an expression.  This requires a lot of
> backtracking(inefficient) but it illustrates the point that the sentence
> can be recognized without using semantic information.

    static const char str[] = "Hello";
    int strlen(const char*);
    int len;
    /* */
    len = (strlen)(str);
Since you're working by syntax alone, you will conclude that str is
being typecast to type strlen.

    bool flag, comp; // or in C use int
    static const char str1[] = "Hello", str2[] = "HELL-NO";
    /* */
    comp = 0==(flag ? stricmp : strcmp)(str1,str2);
Here, you will conclude that the comma operator is being applied to
str1,str2 (result is str2), and that this is being typecast to type
    flag ? stricmp : strcmp
and the result compared to 0.

-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/rg_mkgrp.xp   Create Your Own Free Member Forum

      [ Send an empty e-mail to c++-help@netlab.cs.rpi.edu for info ]
      [ about comp.lang.c++.moderated. First time posters: do this! ]
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]





Author: clamage@Eng.Sun.COM (Steve Clamage)
Date: 1998/07/30
Raw View
"KNAPEN, GREGORY" <gregory.knapen@bell.ca> writes:

>Ok, I think I jumped to conclusions too quickly.  Thank you all for
>correcting me.

>I am still interested in gathering a list of all the ambiguities in the
>language that can only be resolved by using semantics (usually looking
>up a symbol table).  Usually we want to know if a given symbol is a type
>or not.

>What I am trying to do is to parse source code which might have missing
>header files.  In that case the symbol table is incomplete. This means
>that unambiguous declarations can fail because the type was not
>previously declared.

You will not be able to do anything about code that uses templates
without a scoped symbol table, and you won't be able to parse
code at all if template declarations are missing. You have to
know whether a name is the name of a namespace, type template, or
something else IN THE CURRENT SCOPE to have any chance of parsing
general C++ code.

 int V, z;
 template<class T> int f(T t) {
  T *x; // x is local variable of template arg type
  V *z; // multiply global V by z and discard result
  {
   typedef T V;
   V* z; // z is local variable of template arg type
  }
 }
--
Steve Clamage, stephen.clamage@sun.com

      [ Send an empty e-mail to c++-help@netlab.cs.rpi.edu for info ]
      [ about comp.lang.c++.moderated. First time posters: do this! ]
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]





Author: kanze@my-dejanews.com
Date: 1998/07/29
Raw View
In article <35BC8C0D.748DB34E@bell.ca>,
  "KNAPEN, GREGORY" <gregory.knapen@bell.ca> wrote:

> I was wodering if there were other such cases where a sentence needs
> semantic information to be made non ambiguous.  Any case that can be
> recognized by de syntax alone does not qualify.  I assume that I have
> infinite lookahead(backtracking).

I believe that Andy Koenig is the author of this one:

    a < b > ( c )

If a is declared as a class template, and b is a type, then this
declares
a variable c of type a<b>.  If a, b and c are normal variables, then
this
is a simple (rather stupid) expression, which first compares a with b,
and then compares the results with c.

I think that there are a number of variants of this.

--
James Kanze    +33 (0)1 39 23 84 71    mailto: kanze@gabi-soft.fr
        +49 (0)69 66 45 33 10    mailto: jkanze@otelo.ibmmail.com
GABI Software, 22 rue Jacques-Lemercier, 78000 Versailles, France
Conseils en informatique orient   e objet --
              -- Beratung in objektorientierter Datenverarbeitung

-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/rg_mkgrp.xp   Create Your Own Free Member Forum

      [ Send an empty e-mail to c++-help@netlab.cs.rpi.edu for info ]
      [ about comp.lang.c++.moderated. First time posters: do this! ]
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]





Author: "Bill Wade" <bill.wade@stoner.com>
Date: 1998/07/29
Raw View
KNAPEN, GREGORY wrote in message <35BC8C0D.748DB34E@bell.ca>...

>an expression between () is a type cast if and only if it is followed by
>another typecast or an expression.


(foo)*bar;    // foo*bar; or (foo)(*bar); ?
(foo)-bar;    // foo-bar; or (foo)(-bar); ?

It seems you need to know whether foo is a type or a value.




      [ Send an empty e-mail to c++-help@netlab.cs.rpi.edu for info ]
      [ about comp.lang.c++.moderated. First time posters: do this! ]
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]





Author: David R Tribble <david.tribble@noSPAM.central.beasys.com>
Date: 1998/07/29
Raw View
KNAPEN, GREGORY wrote:
> I am building a c++ parser that recognizes c++ by using the syntax
> only. I don't use any semantic information i.e. there is no need for a
> symbol table.  Of course, this parser can not be use as a compiler
> because the language contains ambiguities. This parser is intended to
> gather metrics from source code.
>
> While doing this project, I found that most of c++ can be parsed by
> using the syntax alone except for three cases:
 [snip]
>
> I was wondering if there were other such cases where a sentence needs
> semantic information to be made non ambiguous.  Any case that can be
> recognized by de syntax alone does not qualify.  I assume that I have
> infinite lookahead(backtracking).

All the cases you mention can be resolved if you keep a table of
type names (class/struct/union tags, enum tags, and typedef names).
A type-name identifier can then be a different token that a non-type
identifier token.  (This is how some C compilers work, BTW.)
(The only special cases are constructor and destructor function
names, which are really type (class) names.)

The problem you run into now is that you have to deal with name
scope problems; an identifier might be a type-name in one scope and
a variable or function name in another scope.  This complicates
your parser considerably.

Of course, you can take the easy way out if you assume that all
of your code follows some naming scheme that separates type names
from variable/function names (such as making all type names begin
with an uppercase letter, and all variable/function/label names
begin with lowercase letters).  Then maybe your parser won't have
to work so hard.

-- David R. Tribble, dtribble@technologist.com --

      [ Send an empty e-mail to c++-help@netlab.cs.rpi.edu for info ]
      [ about comp.lang.c++.moderated. First time posters: do this! ]
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]





Author: ncm@nospam.cantrip.org (Nathan Myers)
Date: 1998/07/29
Raw View
KNAPEN, GREGORY<gregory.knapen@bell.ca> wrote:
>
>For example, a c-style type cast is usually recognized by checking if
>the identifier between parenthesis is a type or not.  It is possible to
>find a type cast by the syntax alone.
>
>var = (Type1)(Type2)...(TypeN)(expression);
>
>an expression between () is a type cast if and only if it is followed by
>another typecast or an expression.

Sorry, not true.  This could be a function call, or a series of
function calls.  For example,

  (f)(g)(h)(i,j)

could be equivalent to

  f(g).operator()(h).operator(i,j)

--
Nathan Myers
ncm@nospam.cantrip.org  http://www.cantrip.org/


      [ Send an empty e-mail to c++-help@netlab.cs.rpi.edu for info ]
      [ about comp.lang.c++.moderated. First time posters: do this! ]
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]





Author: "KNAPEN, GREGORY" <gregory.knapen@bell.ca>
Date: 1998/07/28
Raw View
Hi,

I am building a c++ parser that recognizes c++ by using the syntax only.
I don't use any semantic information i.e. there is no need for a symbol
table.  Of course, this parser can not be use as a compiler because the
language contains ambiguities. This parser is intended to gather metrics
from source code.

While doing this project, I found that most of c++ can be parsed by
using the syntax alone except for three cases:

1. ambiguity between function call and variable declaration

ex: T(a); or T(*a); etc..

this would be a variable declaration if T is a type or a function call
if T is a function.

2. ambiguity between function declaration and variable declaration

ex: int X(A);

if A is a type A X is a function declaration
if A is a variable x is a var initialized with A

3. ambiguous parameter

ex: int F(T(C));

if C is a type the declaration becomes int F(T(*fp)(C c));
if C is a new id it becomes int F(T C);

I was wodering if there were other such cases where a sentence needs
semantic information to be made non ambiguous.  Any case that can be
recognized by de syntax alone does not qualify.  I assume that I have
infinite lookahead(backtracking).

For example, a c-style type cast is usually recognized by checking if
the identifier between parenthesis is a type or not.  It is possible to
find a type cast by the syntax alone.

var = (Type1)(Type2)...(TypeN)(expression);

an expression between () is a type cast if and only if it is followed by
another typecast or an expression.  This requires a lot of
backtracking(inefficient) but it illustrates the point that the sentence
can be recognized without using semantic information.

So I was wondering if there were other families of sentences besides the
ones listed that required semantic information to be made non ambiguous?


Greg Knapen
---
[ comp.std.c++ is moderated.  To submit articles, try just posting with ]
[ your news-reader.  If that fails, use mailto:std-c++@ncar.ucar.edu    ]
[              --- Please see the FAQ before posting. ---               ]
[ FAQ: http://reality.sgi.com/austern_mti/std-c++/faq.html              ]