Omgili, forum search, forums search, search forums, discussion search,discussions search, search discussions, board search, boards search, search boards
  Advanced Search

compiler and metadata, request opinions...

On Wed, 22 Apr 2009 18:06:30 -0700, "cr88192" <...@hotmail.com

originally I started writing all this to the mono people, but it is OT and I
doubt they would care...

so, here is the status:
I have partial/incomplete frontend support for both Java and C#, as well as
good old C.
I am actually using the same parser for all 3 languages (and C++ as well,
but this is a lower priority), where an internal "lang" variable is used to
remember which language is being processed at the time (and adapt behavior
as appropriate).

a lot of the runtime machinery (classes/objects/interfaces, structs,
exception handling, ...) is in place.

a lot of the upper/middle compiler machinery is still lacking (such as
support for all the above features...). so, C is still the only language
which currently works...

generics currently fill me with dread (C# and Java have them, C++ calls them
templates, but they look to be a horrible pain if/when I have to implement
them...).

currently, the existing path used for C is being "widened" to facilitate the
newer features, leading to (thus far) much internal reworking of the
compiler.

I have determined that, due to semantic and architectural issues, I can't
embed the metadata directly into the object modules (the reason being that
COFF and ELF modules are loaded as needed, but for technical reasons all of
the metadata needs to be available to the runtime prior to the linking
process).

further context:
portions of the runtime may register themselves with the linker, where a
request for a particular piece of information is embedded in a symbol (sort
of like in HTTP CGI requests), and so when a module is linked, the runtime
may recieve the request and generate any code or data necessary to fullfill
this request.

I had decided this approach was preferable to having masses of API calls in
the produced code for reasons of both implementing the compiler machinery
(having to manually manage and generate thunks to call into the API is
awkward), as well as performance (since, for example, the runtime can
generate much lighter-weight code, for example, when accessing static
members or methods, than when accessing instance members or virtual methods,
but this benefit would be lost when using API calls).

all this is because caching is used (most code is dynamically loaded at
runtime, and it is preferable to only recompile changed modules), and COFF
or ELF modules may be used for caching purposes, where collections of such
modules may be packaged in a GNU-AR library (ZIP is another possibility...).

after much mental debate, I settled on using a table-based structure for
storing metadata, although the exact contents of these tables differs
somewhat from .NET metadata (the structure and contents were more influenced
by Java .class files, but are organized in tables more like in .NET). these
tables are more or less based on the relational model (and are querried in a
similar manner), but differ in that rows may be referred to by index (not
technically allowed in relational databases, where one can usually only
refer to a row via its primary key or by its contents).

my primary reason for choosing tables was mostly related to the amount of
information likely to be present, and concern for memory overhead, where my
other options were S-Expressions and DOM trees, which although more
convinient, would have a much higher memory overhead (S-Exps would take
easily 2x to 3x more memory, and DOM trees far more, and I have many more
things in need of ram than metadata...).

externally, these tables are represented as line-orientated text files (it
is within the realm of possibility that these be stored in the AR/ZIP
libraries as well).

probably when loading the library, this text file would be checked for, and
if present, it will be loaded into an in-memory version of the database. as
needed, contents may be queried from the database, and used to build other
structures (such as the in-memory class contexts, ...). the reason for not
building all of these structures outright, is that it is likely that not all
of these classes/interfaces/namespaces/... may be needed, and an in-memory
class context will use more space than its representation in the tables.

so, yeah, the upper compiler when compiling an "assembly" may access
existing databases and query contents from them, and may produce a new
database representing the contents of the current assembly, which will be
stored along with the associated object modules.



On Fri, 24 Apr 2009 20:31:08 +0200, Hans-Peter Diettrich <...@aol.com

cr88192 schrieb:

I'd use different frontends (parser...) for each language, each building
an canonical AST. Where "canonical" means that the AST is understood by
all following stages.

Problematic are e.g. classes, which have different implementations and
behaviour in C++, Java, .NET etc., so that the generated code for
dealing with an object has to take into account the language's class
model and lifetime rules.

What exactly is "metadata"?

That's a consequent extension of the selectable frontend (language...).

The FreePascal compiler has an interesting model for dealing with
different target widgetsets, machines and systems. cpp will have a
similar model (dunno). You can declare abstract classes or interfaces
for your AST nodes and targets, and instantiate the appropriate
object(s) when the language, library type etc. is known.

DoDi

On Sat, 25 Apr 2009 13:15:03 -0700, "cr88192" <...@hotmail.com

"Hans-Peter Diettrich" <...@aol.com
writing 3 parsers would mean maintaining 3 parsers, which is unecessary
since most of the syntax is common between the languages...

I have mostly been developing a "common superset" approach.

there are actually several different types of classes and structs:
struct/union: good old C struct/union;
struct/union(1): shared between C++ classes/structs, and C# structs
(currently N/A in Java);
class: C#/Java class, '__gc class' (or '__class') in C++ (C++ defaults to
'__nogc class');
interface: C#/Java interface, exported as '__interface' in C++.

1: these use the same tags at present, but are structurally different (using
different tags may be a good idea here, but at present they are recognized
by the structural difference in the ASTs).

there are different flags and flag semantics, which have not as of yet been
addressed.

this area is the point of greatest divergence in the current parsing and
processing logic...
another area is in the handling of namespaces (not fully resolved thus far).

the compiler will presently allow things to be done which are technically
not allowed in the respective languages:
using namespaces as an import mechanism in C++ (though, unless supported
explicitly, this would not allow importing types);
declaration of top-level and namespace-scoped variables and funtions in C#;
Java and C# both include a textual preprocessor;
...

information which describes things like:
all of the namespaces, classes (and class layouts), interfaces, functions
and signatures, ...

all of this stuff needs to be available for the runtime and compilers to
work properly (in part due to C# and Java not using the "include
teh-crapload of text" approach taken by C and C++...).

it is the same sort of thing which .NET drags along with its assemblies.
in Java (in the "proper"/JVM sense), this info is usually stored in the
class files along with the bytecode.

originally, I had wanted to store all of this in the object files, and so
when linked all this info would be conviniently embedded in the image along
with all the other code and data.

but, as a consequence of certain things being done at link time, and linking
being incremental in my framework, this approach could not be used (the
metadata would then need to be in a form which can be accessed apart from
having to link the image).

note that unlike in a more traditional C++ compile/link process, a lot of
info (such as the physical in-memory layout of objects) is not directly
handled by the compiler, but is instead left to dynamic link-time (OTOH, C
structs/unions are fixed at compile time).

I think something like this is likely needed to be able to compile a Java or
C#-like language to native-code object files (either that, or creating a
custom object format which behaves similarly to Java class files, rather
than acting like good old COFF or ELF...).

actually, I could embedd a lot of this kind of data in COFF or ELF files via
the use of special purpose sections, but this would require a little work
(and further creation of special linking tools, as almost invariably linking
it with something like GNU-LD would mess everything up...). as is, partial
linking via LD would be allowed (although there is not much reason to do
so...), but at the cost that if the tables are misplaced, it may not be
possible to properly link or load the code...

actually, by the time most of the metadata much comes into question, the
compiler is out of the process (the compiler runs, and spews out object code
and tables).

the linker and runtime use this information, but are physically disjoint
from the compiler.
(the compiler may also use some of this info from libraries, but mostly to
answer really basic questions like "is 'Foo' a class?", "what is the type of
Bar.z?", ...).

this issue, however, does make implementing templates/generics look a little
scary (since it is not entirely clear how to instantiate a generic without
having to call back into the compiler, which I regard as ugly...).

but, at least on the upside:
by the time the machinery will be in place for instantiating generics, the
machinery would also be in-place for handling expression-level eval (at
present, 'eval' can only be done at the module or function level...).

On Mon, 27 Apr 2009 14:08:53 +0200, Hans-Peter Diettrich <...@aol.com

cr88192 schrieb:

Parsers usually have to deal with semantics (for disambiguation...) as
well, in detail with context sensitive C-ish languages.

DoDi

On Tue, 28 Apr 2009 22:46:54 -0700, "cr88192" <...@hotmail.com

"Hans-Peter Diettrich" <...@aol.com
My parser does close to the minimum required to get the code parsed
(it handles declarations and typedefs, but little beyond
this). everything else goes into the AST, which in my case is
represented in an XML-based form (fairly similar to DOM). (I actually
prefer context-independent ASTs, but C can't be parsed in a
context-independent manner).

a lot of the rest of the issues (semantics, ...) are handled by the upper
compiler, which convert the AST's into the IL (an RPN-based IL I call
RPNIL). the ASTs recieved by the upper compiler are still mostly language
specific (apart from the large amount of comon features which exist between
the languages involved), and the upper compiler is aware which input
language is being used.

RPNIL no longer knows or cares what the input language is, as by the this
point the semantics are presumably normalized...