Kevin Atkinson
kevina at cs utah edu
MyLang will be a system programming language designed primarily to replace C and C++. But will also be powerful enough to replace Ada, Fortran, Java, and C#. The name MyLang is a temporary name until a better name is decided on.
C is an old language with plenty of limitations and flaws however it is still used by a large number of people. C++ is designed to be an improvement of C but it is rather ugly and few programmers fully understand the rules, thus it is not used by as many people as it could be. For example, system programmers, because they are afraid that it may do things, such as allocate dynamic memory, with out them realizing it. MyLang aims to be a much better tool for system programming than C but still give the programmer complete control of what is going on when needed.
Macro's are looked down upon by many language designers to the point that most new languages do not provide them in any fashion. This is due to the fact that the only macro system many programmers are familiar with are C preprocessor macros which are extremely limited and error prone. However even with these limitations they are in fact in heavily used by C programmers as many times a macro is the easiest way to get a job done. Higher level language constructs avoid the need for macros for many cases but don't eliminate theme. The fact of the matter is that Macro are a very powerful tool that should not be overlooked. MyLang will have a powerful macro system which will avoid the many pitfalls of the C preprocessor.
Many new languages focus on safety above all else. As such, such languages can never truly replace C as they can never be as efficient, both space and speed wise, as C in all cases. In order for a language to truly replace C, safety should be provided, but when speed or space is important the programmer should be allowed to unsafe things. MyLang will be one such language.
Most language are designed around a particular programming paradigm. Some languages, such as Java, force everything into this model, even when it is ill suited to the problem. Not only will MyLang support multiple programming paradigms, it will allow new ones to be created.
The major aim is to be a practical language which will allow the programmer to get the job done with a minimal amount of effort. However the language should also be a flexible and expressive language that does not force the programmer to think about a problem in a non-natural way. If the current language constructs are ill-fitted to the problem at hand the language should allow new ones to be defined.
The language should be safe by default but it should still allow a programmer to do something unsafe provided that they know what they are doing. It should not be possible to do something unsafe unintentionally.
MyLang will not be designed around a particular paradigm, but will support as many paradigm as possible, and quite possibly allow other ones to be invented.
MyLang will incorporated as many useful features as possible. Features will not be rejected because:
MyLang should be safe by default but it should still allow a programmer to do something that has the potential of being unsafe provided that they know what they are doing and are sure it is safe. It should not be possible to do something unsafe unintentionally. For example all array's should have bound checks by default but a programmer should be able to disable the checks. With garbage collection it should not be necessary to manually free memory, but a programmer should still be allowed to do so. Code that does unsafe things will be labeled as ``unsafe''. However, a programmer, after carefully reviewing the code in question, should be able to declare it as safe. Thus an individual component can do unsafe things but still be considered safe to use.
However, the language should NOT be designed so that it is necessary to do unsafe things, unless there is no way around it. One example of this is when interfacing with low level hardware. Other than when it is strictly necessary, the main reason to do unsafe things is for performance reasons. For example, bound checking can be a serious bottle neck in an inner loop. In the simple case the compiler may be able to eliminate the checks, but there will always be cases when the programmer knows it is safe but the compiler can not prove it because the programmer has more domain knowledge than the compiler.
Very few languages have attempted this middle ground. C# does to some extent with C style pointers, but it is not very well developed.
MyLang will be designed around Compile Time Functions (CTF). This will keep the core language (ie what the compiler has to be able to handle) as simple as possible. CTF will be used to define almost all of the user visible language constructs. The use of CTF will allow the users to design they own language constructs.
Due to the minimal nature of the core language when I talk about MyLang I will be referring to features of the Core Language, those provided by CTF, and standard libraries.
MyLang will have a simple C like syntax and an advanced type system. Type inference will be used on local variables. MyLang will be statically typed by default but dynamic typing will be available when desired. Garbage collection will also be available but it does not have to be used as many simple programs simply don't need it.
Other things which will be kept in mind when designing MyLang include:
The rest of this paper will detail various aspects of MyLang. MyLang is a work in progress and this paper is by no means meant to be a complete specification of the language. It will focus on key points which I think are important. Some sections simply mention key features that I would like to see in MyLang without any additional information. The last section, section 9, is a disorganized list of notes that don't fit anywhere else.
The top level grammar is extremely simple
and thats it. All other language constructs are defined as specialized expressions.
But just the grammar alone is rather suggestive. For one thing, everything is an expression. And separate expressions are separated by a ';'. It also suggests that expressions are groups by an O/C pair and not a begin and end clause. Putting these two ideas together suggests a syntax such as
Notice the ';', after each block. At first glance the ';' seams unnecessary. But without it the parser given the grammar above will not now if the c = b after the } is part of the if expression or not. From context it is obvious it is not, however with out knowing the meaning of if it is not so obvious. However, the idea was to avoid having to know any context in order to be able to separate expressions. This will allow a large degree of freedom in how an expression can be defined. In fact it will allow the user to define there own expressions.
The core language will be as simple as possible. Everything else is defined as specialized expressions. More over, built in expressions will generally not be used by the end user. Instead macros will transform ``standard expressions'', provided by the default library, into builtin ones.
Examples of builtin expressions
Compile Time Functions (CTF) will be an integral part of the language. In short CTF are functions which are executed at compile time rather than run time. They will be very similar to lisp macros except they are even more powerful, and not always textural expansions. CTF will generally be refereed to as macros throughout this text even though they are not always expansions. For more info on CTF see section 5.1.
There are no predefined statements in my new language meant for the end user. Instead a set of standard statements and expressions will be provided in the default library. In order for a program to be considered MyLang it must use the default libraries.
By default everything is case insensitive unlike most other languages. However, certain identifies can me made case sensitive when it is desirable to use case to distinguish between two identifiers.
The standard control flow statements will be provided such as if/then/else, while, switch and the syntax will be very similar to those of C++ except for an extra semicolon at the end (see 2.1 for why this is necessary).
Variables are prefixed with ``var'', functions with ``fun'', and truly const variables with ``const''. Types now come after variables. Examples conversion from C++ to MyLang syntax.
A const is like a variable except that it is a more of a binding than a variable. Its value can not be changed. It is similar to a const variable in in C++ except that its address can not be taken. Read only variables can also defined which are more like const variables used in C++. The syntax for a const is the same for var except that const is used:
Functions and consts can be defined in any order and can not be redefined. Furthermore a const can only be defined from other const or ``pure'' functions. That is function whose output only depends on the input and do not modify any non non local memory.
Types will use ML style syntax:
possible syntax for new type:
Like ML and most other functional languages MyLang with have a tuple type. Tuple are a special type of struct whose members are numbers
There will be two types of pointers, ones which only point to an object, and ones in which pointer arithmetic will be allowed. The syntax will be something like:
No "->". The dot ('.') operator is always used to dereference objects. It does not matter if its a pointer or the actual object.
Arrays will also be provided however the array subscript operator can also be multi dimensional
No comma operator, instead use {}. Unlike C++ blocks can be treated as expressions. The last expression evaluated in the block is the return value:
Goto still allows. Labels, however are local to the inner most block, like variables, as oppose to C++ where they are local to a function.
Labels can also be used for blocks so that they can be used to break out of multiple loops at once by breaking to the label.
A batter switch syntax will be provided. Will at least allow ranges and avoid the need for break.
Possible provide Perl style ifs ``x = 20 if ....''.
int / int" returns a rational, not an integer. It can be truncated to an integer however. This way 1/2 will work as expected.
Are short cut operators with Perl like semantics.
A very common program mistake is using = instead of ==. We think of the two as being the same but in fact there are two very different operators. Some languages use := for assignment, however assignment is used more often than comparison is in most programs so it makes sense for the comparison operator to be changed rather than the assignment operator. It is also possible to only allow assignment to appear in certain places and comparison in other, this way the same operator can be used. But this can limit the expressiveness of the language. One solution I thought of is to only allow assignment at the beginning of a statement. This like ``while ((x = next()) != 0) ...'' can become ``while ({x = next(); x != 0}) ...'' (as {} are now treated as expressions). However that begs the question, what is a statement and what makes a statement different from an expression. Another solution is to adjust the return type of the assignment operator to be void so ``if (x = 5) ...'' is not valid, but that will also prevent ``x = y = 5'' which is sometimes useful. There is no easy answer and I am not sure how I will handle it.
MyLang will have a powerful, flexible, and precise type system.
In MyLang everything is an object with a specific type. Variables are special objects that can be assigned to, Function are object that can be called, etc. Objects can have sub-objects which are generally accessed via the dot operator but not always as the dot operator for an object can be defined to do anything.
Types for local variables generally will not need to be specified, instead the type is implied using simple and easy to understand inference rules like ML.
There will be one int type. However the range, size, and overflow behaviour can be modified to provide other integer types.
I am considering making the default range for an int to be only 24 bits (ie [224+1, 224-1]) to leave 8 bits for the compile to use for whatever. Larger values will be undefined. I am not sure if this is necessary or even a good idea. Unsigned types may also have a limited range of the possible positive values for an integer of that size if it is an int, this way comparisons between an unsigned and a signed are always safe.
A basic integer can be modified by the use of type attributes. These attributes are: size (in bits), unsigned|signed, range, and overflow mode which is one of undefined modular (wrap around), or saturated.
Typedefs will be provided such as
byte = 8 bytes, unsigned, modular
short 16 bytes, signed, modular
u8 = byte
i8 = 8 bytes, signed, modular
Characters are not integers but enumeration.
Strings are an array of characters but will be much more powerful than C strings and an integral part of the language.
A raw memory type is designed for dealing with blocks of raw memory. Something like ``void *'' with the extension of allowing pointer arithmetic. This type has special type conversion rules similar to ``void *'' but acts more like a ``unsigned char *''.
Two different types can not alias each other unless they both are also aliasing a raw memory type.
Types in MyLang can be modified in several ways.
Types can be restricted on how they can be used. For example the const modifier can be used to make a type read only. For integers values the range can be restricted. A less restricted type can be implicitly converted to a more restricted type but not vise versa. For example a non-const object can be converted to a const object but not vise versa, and an int with a range of [1,10] can be converted to a int with a range of [1,200] but not vise versa.
The behaviour of a type can also be modified. For example for integers the overflow mode can be changed from undefined to either wrap-around or saturated.
Since everything is an object there is nothing special about a class are a struct, they are just objects with specific sub objects.
MyLang will provide enumeration type which is much more powerful than those provided in C or C++:
For an extended enum the user can specify how an extended enum is packed by providing functions to pack and unpack the structure (which including recognizing which enum it is). This can be useful if the layout needs to match some ABI. If layout functions for an extended enum is not provided the compiler will generate them using macros.
Basic pattern matching on extended enum types may also be possible.
Each sub-object of an object can either be static, inner, or static inner (with the default being static inner). An inner objects knows about its parent. An inner objects maintains a pointer back to its parent. Much like Java inner classes.
A static "inner" classes does not maintain a pointer back to its parent, instead it is provided automatically. It is an error to call a static inner class method without providing the compiler a way to figure out the outer class in the expression. For example:
It is possible to precisely control how an object or sub-object behaves. For example an object may act like a variable but not not have any storage associated with it. Functions are used to provide an actual value for the function or to allow it to be assigned to.
Any type can be opened which means that all of its sub-objects are directly accessible. For example if the object ``X'' is opened than instead of using ``X.foo'' you can just use ``foo''. Inside class members C++ ``opens'' the class so that the class members can be accessed with out using this. It is the same basic idea but a lot more flexible. Given two objects X and Y, if Y is a sub-object of X than X can open Y so that Y sub-objects can be directly accesses from X. For example instead of using ``X.Y.foo'' you can just use ``X.foo''. When a class in inherited in C++ the members of the parent class are ``opened'' so that they can be accessed directly via the child class.
For any object a:
might end up calling huge_object destructor, let f write directly to huge_object there for making it as efficient as if huge_object was passed by reference. Thus these special methods may be called at unpredictable places. Therefor they should only do what there are designated to do and not other weird things.
Other examples of things the compiler is allowed to do:
It should be able to to precisely define how types can be converted. For example saying an int can convert to a double, this new double may be converted again. C++ does not allow multiple conversions. These rules should be specified in ``src -> dest'' form and not in the form of type conversion operators or single parameter constructors like in C++. Although certain constructors may implicitly add type conversion rules.
Extended Enum (see 3.6), virtual functions, boxed types, etc. are really all the same concept, "run time type identifications". They should all be merged into a unified concept with only syntactic sugar separating them. Basically they specify what overloaded function to use. RTTI is very similar to providing function pointers (or more powerful closers) for each operation performed on the type so that should be tied into the same framework as well. Some of the elements of this common frame work include:
Functions are only used when needed so that all of them do not necessarily have to be defined. It is compile time error to compare typeids of different base classes.
both can be implemented in terms of each other but one or the other must be implemented. With C++, if both are implemented in terms of the other, than this will cause infinite recursion. But it can be checked at compile time so it should the constraints should be something like:
Constants can be overloaded based on type for example
Which also allows
As stated previously everything is an object. To the compiler there is no real distinction between basic types such as integers and aggregated types such as arrays, and structs, or more high level types such as classes.
Every object type has the following members:
Sub-objects can be accessed via the builtin ``element(info, obj, id/num)'' and info on a sub object can be accessed via ``element_info(info, id/num)''.
All higher level structures are created from the basic low level objects via macros. MyLang will not provide native support for anything but the low level object, including inheritance as there is no need to. For example a C structure can be created something like:
With a little more code more advance types can be created.
The dot operator allows unlimited freedom in how the members can be accessed. There is no reason that the dot operator can only be used to access sub-objects. For example to implement simple non-virtual inheritance are anonymous structures is necessary to also access a sub-object sub-object via the dot operator:
The dot operator can also be used for adding methods to an object by returning a function instead of a sub-object, among other things.
Since it is possible to get information on an objects sub-objects it is possible to create generic code that performs an operation on all of its sub-objects such as printing them. Something that is imposable to do in C++.
The basic syntax for a function will be something like:
The second tuple is the return type the ones before the ';' can not be ignored the ones after it can. The variables are named so that they can be refereed to in the function. Returned objects are not copied. They are directly allocated on the stack of the calling function, which makes it okay to return huge values.
Parameters to functions or passed by:
Functions can have pre and post conditions which can be checked at run time. A compiler can also use these conditions to optimize better.
Attributes are a special form of Pre/Post conditions. For example a "pure" function is one that only takes only parameters by value, modifies no external state, and returns a value. The output should only depend on the input or global variables. If global variables are used the function is annotated by which global variables it uses. A pure function can only call other "pure" functions, this will be enforced by the compiler type checking system.
Like C++ functions can be overloaded based on there parameters, but unlike C++ they can also be overloaded based on the return type. To avoid extreme confusion functions that are overloaded by return type should essentially do the same thing, but perhaps just return the result in a slightly different way.
When an exact match is not found for any given function than the compiler may do one of three things
Two types of nested functions:
Closures will make a copy of any variables needed from the local environment and will not go out of scope.
the ones before the ';' are variables from the local environment a copy will be made when the closure is created. The ones after the ';' are the normal function parameters. Functionally a closure is equivalent to (in C++)
but a lot more convenient.
A closure may be optimized as a nested function if the compiler can be sure that a reference to the function will not be used when the closure goes out of scope.
Closures can also be created by partially calling a function. For example: ``return fun(x,,y)'' will return a closure which takes a single parameter.
It should be possible to be able to prevent certain functions from calling other functions:
B should not be allowed to call A but A should be allowed to call B, and other blocks can only call A.
Be able to provide multiple versions of the same function. With certain parameters one version will be called, with other a different version. This will allow specialization to optimize a function for a common set of parameters. Perhaps the compiler can decide when to do this if it can determine that it will be beneficial.
Allow the calling conventions of functions to be changed - when it is not an external one - in order to be able to make function calls cheap. For example if only one function calls another one but does so multiple times the calling convention of that function could be changed to avoid to avoid unnecessary overhead, such as pushing parameters on the stack or shuffling registers around.
Compile time functions (CTF) are functions that are executed at compile time rather than run time. There are similar to preprocessor macros (ie ``#define'' in C and C++) but are a lot more powerful and less error prone. Unlike preprocessor macros CTF are more than simple expansions. CTF are written in MyLang them self and thus have the complete language at there disposal rather than a limited set of operators as preprocessor macros do. Unlike preprocessor macros, CTF also obey namespace rules so they can be defined locally without polluting the global namespace.
CTF functions will be triggered based on pattern matching. This will allow new language constructs to be defined with CTF.
The basic MyLang macro will ones that expand to a list of tokens or a string to be used in place of the function. However, unlike C macros these will be written in MyLang. As with all CTF they will obey namespace rules. In addition they will not be able to expand to anything. They must evaluate to a valid grammatical element or list of.
For example:
These macro will also have access to special functions in to create new structures and the like. See section 3.13 (Low Level Objects) for an example of these special functions.
This type of CTF essential offers the same power that Lisp macros do. However since MyLang will have syntactic closures (as in Scheme) the problem of unintentional variable capture will be avoided.
Another type of macro are those that simply expand into another set of tokens or a string. Like the basic macros these can only expand into valid grammatical elements. An expansion can be recursive thus avoiding the need of any sort of loops. On the surface these macros are similar to preprocessor macro but since they obey namespace rules and can only expand into valid grammatical element they are a lot safer. They are also more elegant as it avoids the need to have to explicitly generate code.
Unfortunately they are not as powerful as the basic macros, and thus do not avoid the need for them. Perhaps basic macro and pattern expansion macros can some how be combined into one. For example a macro can by default be a pattern expansion but also allow a special syntax to be used when code is needed.
A more advance type of macro that MyLang will support will directly manipulate the compile time environment via API calls. For complex tasks this type of CTF may be cleaner and less error prone. However supporting them means that a stable API must be developed.
For example CTF that manipulates the created a new object type might look something like:
And one that prints a structure out might look like:
A even more advances tyep of macro will be ones that manipulate the compile time environment and are executed as if they were executed at run time. These CTF can contain code that depends on both the compile-time and run-time environment. This type of CTF will be the most natural to write because the user does not have to worry about the separation of the compile-time and run-time environment.
For example one that prints a structure out might look like (notice the lack of the add_code function)
Unfortunately these will be the most difficult to implement, in particular because MyLang is meant to be a compiled and not interpreted. If implemented there will have to be some restrictions on what these functions can do. For example it will be difficult to support expressions that depend on both the run-time and compile-time state. If the expressions also modify the compile-time state than they will be virtually imposable to support.
CTF can also become ``ordinary'' functions. For example consider the macro ``OR(x,y)'' which will return true if x or y is true but will only evaluate y if x is false. When called directly the macro will be used, but when it needs to be treated as a ordinary function the function ``OR_f(x,y) {OR(x,y)}'' will be used. Naturally this new function loses it special ability to avoid evaluating y, but it can now be passed to functions expecting another function as a parameter. Of course, this trick will not work for all macros. If such a macro is attempted to be used as a function the compiler will throw an error.
Since CTF are executed at compile time rather that run time the parameters they take will be slightly different. In particular the type of an expression will generally not be known unless it is a constant. Thus, most expressions will be passed in as strings. The result of an expression is not known by the function. However it is also possible to pass in compile time constants that the CTF function can used, the most common type will be an integer constant, however other types are possible. Also CTF may also take in compile time objects so that they can get information and manipulate the compile time environment. Thus, the types of parameters a CTF can take will be something like:
MyLang will also have general support for compile time expressions which are not necessarily functions. Preprocessor ``#if'' are a good example of this. Of course they will be written in MyLang, but will have special syntax to indicate that the expression is to be eveluated at compile time rather than execution time. If I can figure out how to implement pseudo-CTF than this distionction might not even be necessary.
Exceptions are very useful, and will definitely be included but they have problems. MyLang if possible will try to avoid these problems.
One major problems with exceptions is that they can sometimes mask errors if an unexpected exception is thrown by some function and is passed through to the caller which is not expecting it, so it also passes it down, even though it is not suppose to throw that exceptions (but not specified explicitly as generally the case with C++ where the specifications where an afterthought). It passes down until some function which is expected to "itself" throw that exception, but not expecting it from any functions it calls. It than gets handled, when it shouldn't.
Having to always handle all exceptions is very annoying especially when something unexpected happens (like maybe its working directory was deleted) and the best course of action is to abort as there is nothing useful it can do. (Well, maybe it can attempt to quit cleanly, but I account for that by having a special class of exceptions for the truly unexpected).
My Lang will likely have two classes of exceptions, Exceptions, Errors.
Functions much specify what exceptions they will throw. If a functions attempts to throw an exception not specified than the exception will turn into an error. Errors can be handled via the try, throw block OR via a registered functions, much like signal handlers. In fact POSIX signals will generate exceptions which if not caught will turn into "errors". (Name not the best, how about unexpected).
This is a compromise between Java which must handle all exceptions, and C++ where exception specifications were an afterthought.
Certain types of Exceptions can also return and/or can be thrown asynchronously. This will allow exceptions to be able to handle signals elegantly.
Exceptions can also turn be turned off for a region of code. If any function throws an uncaught exceptions it will either cause an abort or be postponed depending on the nature of the exception.
If possible, exceptions will be implemented so that the exception object is NEVER copied. Exceptions can also refer to local objects on the stack provided used local objects are marked so that the compiler knows not to call the destructor or overwrite them. This will allow exceptions to be implemented extremely effectively. In fact they may even be more efficient than returning an error code.
No user written header files. That job is up to the compiler.
Very precise dependences which will greatly cut back on unnecessary recompiles. A dependency header file is written for each object. It will describe precisely what symbols it uses and how. For example given the struct A {int x; int y; int z;} if the object files only referees to z than the dependency info will say object file X depends on symbol z in struct A which it expects to be an int with an offset of 8. If an object file only creates a new A but never uses it than the dependencies info will say that object file X creates a new A which it expects to be of size 12 and without trivial constructors.
C++ (and to some extent C) programmers spend a good deal of effort design there interfaces to minimize recompiles. Sometimes C++ programmers will go to great length to avoid exposing the implementing so that the entire project does not need to be recompiled because of an addition of a private helper function. This is because C++ requires way to much information in the header file due to the class syntax. (See C++ FAQ Lite).
By automatically generating header files and emitting very precise dependency information the programmer will never have to worry about this anymore. Programmers can design the interface without worrying about what will go in the header files...
The standard library should at least provide
Provide vector types which are mini arrays. They main purpose it to make it easy to take advantage of vector based instructions of the hardware. An array of X can be implicitly converted to a vector of X.
User specifies vector type are the compiler can decide the best size based on the current instruction set.
May also support layout rules where { } and ; are placed implicitly, like Haskell. Layout rules WILL be context sensitive out of necessary.
Perhaps make it possible be able to force a variable in a scope lower scope.
Compiler is allowed to rearrange the storage of the variables on the stack in order to allow for variables of a lower scope to appear anywhere is the statement. Of course there will be restriction. Might not be possible, but worth pressuring.
Also it might not be a good idea in the first place as due to readability problems.
Also allow for a special grouping syntax where ALL variables are in the lower scope. Useful is for loops etc. This is a must.
Syntax Maybe (exp; exp;...)
Possible provide a very powerful looping construct which has 7 parts: init, preinc, pretest, body, posttest, postinc, final. Implemented something like:
The question is what should the syntax for such a beast be?
Should I implement some sort of laziness?
Lazy lists would be useful. Much like iterator concept
It should not be necessary to provide both
Some how prevent this: Let O be some object that has resources that need to be freed;
f(a) will call the destructor for A and NOT B. Will lead to a memory leak that is very hard to trace down unless you are already familiar with this type of mistake.
Include elements of Table Oriented Programming: http://www.geocities.com/tablizer/