Item 43462650

pjc50 • 7 days ago

Parsing isn't too bad compared to, say, Perl.

The preprocessor is a classic example of simplicity in the wrong direction: it's simple to implement, and pretty simple to describe, but when actually using it you have to deal with complexity like argument multiple evaluations.

The semantics are a disaster ("undefined behavior").

kibwen • 7 days ago

> Parsing isn't too bad compared to, say, Perl.

This is damning with faint praise. Perl is undecidable to parse! Even if C isn't as bad as Perl, it's still bad enough that there's an entire Wikipedia article devoted to how bad it is: https://en.wikipedia.org/wiki/Lexer_hack

1 reply

account42 • 6 days ago

> The Clang parser handles the situation in a completely different way, namely by using a non-reference lexical grammar. Clang's lexer does not attempt to differentiate between type names and variable names: it simply reports the current token as an identifier. The parser then uses Clang's semantic analysis library to determine the nature of the identifier. This allows a simpler and more maintainable architecture than The Lexer Hack. This is also the approach used in most other modern languages, which do not distinguish different classes of identifiers in the lexical grammar, but instead defer them to the parsing or semantic analysis phase, when sufficient information is available.

Doesn't sound as much of a problem with the language as it is with the design of earlier compilers.

1 reply

kibwen • 6 days ago

Unifying identifiers in the lexer doesn't solve the problem. The problem is getting the parser to produce a sane AST without needing information from deeper in the pipeline. If all have is `foo * bar;`, what AST node do you produce for the operator? Something generic like "Asterisk", and then its child nodes get some generic "Identifier" node (when at this stage, unlike in the lexer, you should be distinguishing between types and variables), and you fix it up in some later pass. It's a flaw in the grammar, period. And it's excusable, because C is older than Methuselah and was hacked together in a weekend like Javascript and was never intended to be the basis for the entire modern computing industry. But it's a flaw that modern languages should learn from and avoid.

C ain't simple, it's an organically complex language that just happens to be small enough that you can fit a compiler into the RAM of a PDP-11.

mort96 • 7 days ago

I would probably describe Perl as really complex to parse as well if I knew enough about it. Both are difficult to parse compared to languages with more "modern sensibilities" like Go and Rust, with their nice mostly context free grammars which can be parsed without terrible lexer hacks and separately from semantic analysis.

Walter Bright (who, among other things, has been employed to work on a C preprocessor) seems to disagree that the C preprocessor is simple to implement: https://news.ycombinator.com/item?id=20890749

> The preprocessor is fiendishly tricky to write. [...] I had to scrap mine and reimplement it 3 times.

I have seen other people in the general "C implementer/standards community" complain about it as well.

1 reply

pjc50 • 7 days ago

I wonder if we can dig out the original K&R preprocessor implementation?

1 reply

dbrower • 7 days ago

it was a lot simpler in capabilities. much of the complexity is because of feature creep.

1 reply

uecker • 6 days ago

What feature creep did the preprocessor have?