Is it really necessary to separate the lexical scanner from the parser?
It seems that space and time considerations have led to an architecture where the scanner translates the whole input into a stream of a tokens, which is then fed into the parser's input port. It seems that in a modern system a recursive descent parser can easily do the regexp matching in a lazy manner (i.e.: when it tries to read the next token).
This will make it possible to :
- Easily define "local" keywords: if a certain token is a tagged as a keyword in some rule, in other rules it may be tagged as an identifier. For example, the "throws" keyword in Java is used in a very specific place, in Java's grammar. If we turn it into a local keyword we will be able to have a varialbe named "throws" with no conflicats arousing.
- The parser generator can check if two competing regexps are have a non-empty intersection and issue an error/warning. In current global lexers this is an inherent issue, e.g.: every Java keyword is also an identifier. The resulting ambiguities are resolved thru precedence rules (usually: the first definition wins). If we use "local" keywords, the set of competing regexps is much smaller => chances of conflicts are smaller => we can afford to have an error whenever such a conflict occurs.
Why is the grammar of a language so important ?
Again I think that we stick with old an practice which needs to be reconsidered (to make it clear: I am not saying that a language designer should not publish the grammar of his lanuage. My point is that it delivers very little information).
So again, in the old days, the languages were not as rich as today (in particular: a much more primitive type system) so there were very few errors that could be caught by the compiler's semantic analyzer. Most of the errors reported by the compiler were syntax/parsing errors.
On the other hand, in a modern statically typed language (object-oritented or functional) there are tons of errors that are due to the semantic analysis, such as: extending a final class, specifying a class in the implements clause, a non initialized final field, a non conforming throws clause, instantiating an abstract class, .....
As a result, most of the inoframtion we developers needs does not come from the grammar rules of the language. This information comes from the language specification document which focuses on semantics (and run time behavior). And, when we look for the correct syntax of a language construct we simply go to Google.