| ucpp-1.0 is a C preprocessor mostly compliant to ISO-C99. |
| |
| Author: Thomas Pornin <thomas.pornin@ens.fr> |
| Main site: http://www.di.ens.fr/~pornin/ucpp/ |
| |
| |
| |
| INTRODUCTION |
| ------------ |
| |
| A C preprocessor is a part of a C compiler responsible for macro |
| replacement, conditional compilation and inclusion of header files. |
| It is often found as a stand-alone program on Unix systems. |
| |
| Ucpp is such a preprocessor; it is designed to be quick and light, |
| but anyway fully compliant to the ISO standard 9899:1999, also known |
| as C99. Ucpp can be compiled as a stand-alone program, or linked to |
| some other code; in the latter case, ucpp will output tokens, one |
| at a time, on demand, as an integrated lexer. |
| |
| Ucpp operates in two modes: |
| -- lexer mode: ucpp is linked to some other code and outputs a stream of |
| tokens (each call to the lex() function will give one token) |
| -- non-lexer mode: ucpp preprocesses text and outputs the resulting text |
| on a file descriptor; if linked to some other code, the cpp() function |
| must be called repeatedly, otherwise ucpp is a stand-alone binary. |
| |
| |
| |
| INSTALLATION |
| ------------ |
| |
| 1. Uncompress the archive file and extract the source files. |
| |
| 2. Edit tune.h. Here is a short explanation of compile-time options: |
| |
| LOW_MEM |
| Enable memory-saving functions; this is for low-end and old systems, |
| but seems to be good for larger systems too. Keep it. |
| NO_LIBC_BUF |
| NO_UCPP_BUF |
| Two options used to disable the two bufferings inside ucpp. Define |
| both options for maximum memory saving but you will probably want |
| to keep libc buffering if you want decent performance. Define none |
| on large systems (modern 32 or 64-bit systems). |
| UCPP_MMAP |
| With this option, if ucpp internal buffering is active, ucpp will |
| try to mmap() the input files. This might give a slight performance |
| improvement, but will work only on a limited set of architectures. |
| PRAGMA_TOKENIZE |
| Make ucpp generate tokenized PRAGMA tokens on #pragma and _Pragma(); |
| tokenization is made this way: tokens are assembled as a null |
| terminated array of unsigned chars; if a token has a string value |
| (as defined by the STRING_TOKEN macro), the value follows the token, |
| terminated by PRAGMA_TOKEN_END (by default, a newline character cast |
| to unsigned char). Whitespace tokens are skipped. The "name" value |
| of the PRAGMA token is a pointer to that array. This setting is |
| irrelevant in non-lexer mode. |
| PRAGMA_DUMP |
| In non-lexer mode, keep #pragma in output; non-void _Pragma() are |
| translated to the equivalent #pragma. Irrelevant in lexer mode. |
| NO_PRAGMA_IN_DIRECTIVE |
| Do not evaluate _Pragma() inside #if, #include, #include_next and #line |
| directives; instead, emit an error (since the remaining _Pragma will |
| surely imply a syntax error). |
| INMACRO_FLAG |
| In lexer mode, set the inmacro flag to 1 if the current token comes |
| from a macro replacement, 0 otherwise. macro_count maintains an |
| increasing counter of such replacements. CONTEXT tokens count as |
| one macro replacement each. #pragma, and _Pragma() that do not come |
| from a macro replacement, also count as one macro replacement each. |
| This setting is irrelevant in non-lexer mode. |
| STD_INCLUDE_PATH |
| Default include path in stand-alone ucpp. |
| STD_MACROS |
| Default predefined macros in stand-alone ucpp. |
| STD_ASSERT |
| Default assertions in stand-alone ucpp. |
| NATIVE_INTMAX |
| NATIVE_UINTMAX |
| SIMUL_UINTMAX |
| WCHAR_SIGNEDNESS |
| Those options define how #if expressions are evaluated; see the |
| cross-compilation section of this file for more info. |
| DEFAULT_LEXER_FLAGS |
| DEFAULT_CPP_FLAGS |
| Default flags in respectively lexer and non-lexer modes. |
| POSIX_JMP |
| Define this if your architecture defines sigsetjmp() and |
| siglongjmp(); it is known to (very slightly) improve performance |
| on AIX systems. |
| MAX_CHAR_VAL |
| Ucpp will consider characters whose value is equal or above |
| MAX_CHAR_VAL as outside the C source charset (so they will be |
| treated just like '@', for instance). For ASCII systems, 128 |
| is fine. 256 is a safer value, but uses more (static) memory. |
| For performance reasons, use a power of two. If MAX_CHAR_VAL is |
| correctly adjusted, ucpp should be compatible with any character |
| set. |
| UNBREAKABLE_SPACE |
| If you want an extra-whitespace character, define this macro to that |
| character. For instance, define this to 160 on an ISO-8859-1 system |
| if you want the 'unbreakable space' to be considered as whitespace. |
| SEMPER_FIDELIS |
| With this option set, ucpp, when used as a lexer, will pass |
| whitespace tokens to its caller, and those tokens will have their |
| true content; this is intended for reconstruction of the source |
| line. Beware that some comments may have embedded newlines. |
| COPY_LINE_LENGTH |
| Ucpp can maintain a copy of the current source line, up to that |
| length. Irrelevant to stand-alone version. |
| *_MEMG |
| Those settings modify ucpp behaviour, wrt memory allocations. With |
| higher values, ucpp will perform less malloc() calls and will run |
| faster, but it will use more memory. Reduce INPUT_BUF_MEMG and |
| OUTPUT_BUF_MEMG on low-memory systems, if you kept ucpp buffering |
| (see NO_UCPP_BUF option). |
| |
| 3. Edit the Makefile. You should define the variables CC and FLAGS; |
| there are the following options: |
| |
| -DAUDIT |
| Enable internal sanity checks; this slows down a bit ucpp. Do not |
| define unless you plan to debug ucpp. |
| -DMEM_CHECK |
| With this setting, ucpp will check for the return value of malloc() |
| and exit with a diagnostic when out of memory. MEM_CHECK is implied |
| by AUDIT. |
| -DINLINE=foobar |
| The ucpp code uses "inline" qualifier for some functions; by |
| default, that qualifier is macro-replaced with nothing. Define |
| INLINE to the correct replacement for your compiler, if supported. |
| Note that all "inline" functions in ucpp are also "static". For any |
| C99-compliant compiler, the GNU compiler (gcc), and the Compaq C |
| compiler under Linux/Alpha, no -DINLINE is needed (see tune.h for |
| details). |
| |
| 4. Compile by typing "make". This should produce the ucpp executable |
| file. |
| |
| 5. Install wherever you want the binary and the man page ucpp.1. |
| |
| 6. If you do not have the make utility, compile each file seperately |
| and link them together. The exact details depend on your compiler. |
| You must define the macro STAND_ALONE when compiling cpp.c (there |
| is such a definition, commented out, in cpp.c, line 34). |
| |
| There is no "configure" script since: |
| -- I do not like the very idea of a "configure" script. |
| -- Ucpp is written in ANSI-C and should be fairly portable. |
| -- There is no such thing as "standard" settings for a C preprocessor. |
| The predefined system macros, standard assertions,... must be tuned |
| by the sysadmin. |
| -- The primary goal of ucpp is to be included in compilers. The |
| stand-alone version is mainly a debugging tool. |
| |
| Please note that you need an ISO-C90 (formerly ANSI) C compiler suite |
| (including the standard library) to compile ucpp. If your compiler is |
| not C99 (or later), read the cross-compilation section in this README |
| file. |
| |
| The C90 and C99 standards state that external linkage names might |
| be considered equal or different based upon only their first 6 |
| characters; this rule might make ucpp not to compile on a conformant C |
| implementation. I have yet to see such an implementation, however. |
| |
| If you want to use ucpp as an integrated preprocessor and lexer, see the |
| section REUSE. Compiling ucpp as a library is an exercise left to the |
| reader. |
| |
| With the LOW_MEM code enabled, ucpp can run on a Minix-86 or Msdos |
| 16-bit small-memory-model machine. It will not be fully compliant |
| on such an architecture to C99, since C99 states that at least one |
| source code with 4095 simultaneously defined macros must be processed; |
| ucpp will be limited to about 1500 macros (at most) due to memory |
| restrictions. At least ucpp can preprocess its own code in these |
| conditions. LOW_MEM is on by default because it seems to improve |
| performance on large systems. |
| |
| |
| |
| LICENSE |
| ------- |
| |
| The copyright notice and license is at the beginning of the Makefile and |
| each source file. It is basically a BSD license, without the advertising |
| subclause (which BSD dropped recently anyway) and with no reference to |
| Berkeley (since the code is all mine, written from scratch). Informally, |
| this means that you can reuse and redistribute the code as you want, |
| provided that you states in the documentation (or any substantial part |
| of the software) of redistributed code that I am the original author. |
| (If you press a cdrom with 200 software packages, I do not insist on |
| having my name on the cover of the cdrom -- just keep a Readme file |
| somewhere on the cdrom, with the copyright notice included) |
| |
| As a courteous gesture, if you reuse my code, please drop me a mail. |
| It raises my self-esteem. |
| |
| |
| |
| REUSE |
| ----- |
| |
| The code has been thought as part of a bigger project; it might be |
| used as an integrated lexer, that will read files, process them as a |
| C preprocessor, and output a stream of C tokens. To include this code |
| into a project, compile with STAND_ALONE undefined. |
| |
| To use the preprocessor and lexer, several steps should be performed. |
| See the file 'sample.c' for an example. |
| |
| 1. call init_cpp(). This function initializes the lexer automaton. |
| |
| 2. set the following global variables: |
| no_special_macros |
| non-zero if the special macros (__FILE__ and others) |
| should not be defined. This is a global flag since |
| it affects the redefinition of such macros (which are |
| allowed if the special macros are not defined) |
| c99_compliant |
| if non-zero, define __STDC_VERSION__ to 199901L; this |
| is the default; otherwise, do not define __STDC_VERSION__. |
| Note that ucpp will accept to undefine __STDC_VERSION__ |
| with a #undef directive. |
| c99_hosted |
| if strictly positive, define __STDC_HOSTED__ to 1. |
| If zero, define __STDC_HOSTED__ to 0. If negative, |
| do not define __STDC_HOSTED__. The default is 1. |
| emit_defines and emit_assertions should be set to 0 for |
| the step 3. |
| |
| 3. call init_tables(). This function initializes the macro table |
| and other things; it will intialize assertions if it has a non-zero |
| argument. |
| |
| 4. call init_include_path(). This function will reset the include |
| path to the list of paths given as argument. |
| |
| 5. set the following global variables |
| emit_dependencies |
| set to 1 if dependencies should be emitted during |
| preprocessing |
| set to 2 if dependencies should also be emitted for |
| system include files |
| emit_defines |
| set to non-zero if #define macro definitions should be |
| emitted when macros are defined |
| emit_assertions |
| set to non-zero if #define macro definitions should be |
| emitted when macros are defined |
| emit_output |
| the FILE * where the above items are sent if one of the |
| three emit_ variables is set to non zero |
| transient_characters |
| this is for some cross-compilation; see the relevant |
| part in this README file for details |
| |
| 6. call set_init_filename() with the initial filename as argument; |
| the second argument indicates whether the filename is real or |
| conventional ("real" means "an fopen() on it will work"). |
| |
| 7. initialize your struct lexer_state: |
| call init_lexer_state() |
| call init_lexer_mode() if the preprocessor is supposed to |
| output a list of tokens, otherwise set the flags field |
| to DEFAULT_CPP_FLAGS and set the output field to the |
| FILE * where output should be sent |
| (init_lexer_mode(), if called at all, must be called after |
| init_lexer_state()) |
| adjust the flags field; here is the meaning of flags: |
| |
| WARN_STANDARD |
| emit the standard warnings |
| WARN_ANNOYING |
| emit the useless and annoying warnings |
| WARN_TRIGRAPHS |
| count trigraphs encountered; it is up to the caller to emit |
| a warning if some trigraphs were indeed encountered; the count |
| is stored in the count_trigraphs field of the struct lexer_state |
| WARN_TRIGRAPHS_MORE |
| emit a warning for every trigraph encountered |
| WARN_PRAGMA |
| emit a warning for each non-void _Pragma encountered in non-lexer |
| mode (because these are not dumped in the output) and for each |
| #pragma too, if ucpp was compiled without PRAGMA_DUMP |
| FAIL_SHARP |
| emit errors on '#' tokens beginning a line and not followed |
| by a valid cpp directive |
| CCHARSET |
| emit errors when non-C characters are encountered; if this flag |
| is not set, each non-C character will be considered as a BUNCH |
| token (since C99 states that non-C characters are allowed as |
| long as they "disappear" during preprocessing [through macro |
| replacement and stringification for instance], this flag must |
| not be set, for maximum C99 compliance) |
| DISCARD_COMMENTS |
| do not keep comments in output (irrelevant in lexer mode) |
| CPLUSPLUS_COMMENTS |
| understand new style comments (//) (mandatory for C99) |
| LINE_NUM |
| emit #line directives when entering a file, if not in lexer mode; |
| emit CONTEXT token in lexer mode for #line and new files |
| GCC_LINE_NUM |
| if LINE_NUM is set, emit gcc-like directives instead of #line |
| HANDLE_ASSERTIONS |
| understand assertions in #if expressions (and #assert, #unassert) |
| HANDLE_PRAGMA |
| make PRAGMA tokens for #pragma; irrelevant in non-lexer mode |
| (handling of some pragmas is required in C99 but is not of |
| the competence of the preprocessor; without this flag, ucpp will |
| ignore the contents of #pragma and _Pragma directives) |
| MACRO_VAARG |
| understand macros with a variable number of arguments (mandatory |
| for C99) |
| UTF8_SOURCE |
| understand UTF-8 encoding: multibyte characters are considered |
| equivalent to letters as far as syntax is concerned (they can |
| be used in identifiers) |
| LEXER |
| act as a lexer, outputting tokens |
| TEXT_OUTPUT |
| this flag should be set to 0 if ucpp works as a lexer, 1 otherwise. |
| It is somehow redundant with the LEXER flag, but the presence of |
| those two different flags is needed in ucpp. |
| KEEP_OUTPUT |
| in non-lexer mode, emit the result of preprocessing |
| COPY_LINE |
| maintain a copy of the last read line in the copy_line field of |
| the struct lexer_state ; see below for how to use this buffer |
| HANDLE_TRIGRAPHS |
| understand trigraphs, such as ??/ for \. This option should be |
| set by default, except for some legacy code. |
| |
| There are other flags, but they are for private usage of ucpp. |
| |
| 8. adjust the input field in the lexer_state to the FILE * from where |
| source file is read. If you use the UCPP_MMAP compile-time option, |
| and your input file is eligible to mmap(), then you can call |
| fopen_mmap_file() to open it, then set_input_file() to set ls->input |
| and some other internal options. Do not call set_input_file() unless |
| you just called fopen_mmap_file() just before on the same file. |
| |
| 9. call add_incpath() to add an include path, define_macro() and |
| undef_macro() to add or remove macros, make_assertion() and |
| destroy_assertion() to add or remove assertions. |
| |
| 10. call enter_file() (this is needed only in non-lexer mode, or if |
| LINE_NUM is set). |
| |
| |
| Afterwards: |
| -- if you are in lexer mode, call lex(); each call will make the ctok |
| field point to the next token. A non-zero return value is an error. |
| lex() skips whitespace tokens. The memory used by the string value |
| of some tokens (identifiers, numbers...) is automatically freed, |
| so copy the contents of each such token if you want to keep it |
| (tokens with a string content are identified by the STRING_TOKEN |
| macro applied to their type). |
| When lex() returned a non-zero value: if it is CPPERR_EOF, then |
| end-of-input was reached. Otherwise, it is a genuine error and |
| ls->ctok is an undefined token; skip it and call lex() again to |
| ignore the error. |
| |
| -- otherwise, call cpp(); each call will analyze one or more tokens |
| (one token if it did not found a cpp directive, or a macro name). |
| A positive return value is an error. |
| |
| For both functions, if the return value is CPPERR_EOF (which is a |
| strictly positive value), then it means that the end of file was |
| reached. Call check_cpp_errors() after end of file for pending errors |
| (unfinished #if constructions for instance). In non-lexer mode, |
| call flush_output(). |
| |
| In the struct lexer_state, the following fields might be read: |
| line the current input line number |
| oline the current output line number (in non-lexer mode) |
| flags the flags described above |
| count_trigraphs the number of trigraphs encountered |
| inmacro the current token comes from a macro |
| macro_count the current macro counter |
| "flags" is an unsigned long and might be modified; the three others |
| are of long type. |
| |
| |
| To perform another preprocessing: use free_lexer_state() to release |
| memory used by the buffers referenced in lexer_state, and go back to |
| step 2. The different tables (macros, assertions...) should be reset to |
| their respective initial contents. (Warning: this is currently untested) |
| |
| |
| The COPY_LINE buffer: the struct lexer_state contains two interesting |
| fields, copy_line[] and cli. If the COPY_LINE flag is on, each read |
| line is stored in this buffer, up to (at most) COPY_LINE_LENGTH - 1 |
| characters (COPY_LINE_LENGTH is defined in tune.h). The last character |
| of the buffer is always a zero, and if the line was read entirely, it is |
| zero terminated; the trailing newline is not included. |
| |
| The purpose of this buffer is error-reporting. When an error occurs |
| (cpp() returns a strictly positive value, or lex() returns a non-zero |
| value), if your struct lexer_state is called ls, use this code: |
| |
| if (ls.cli != 0) ls.copy_line[ls.cli] = 0; |
| |
| This will add a trailing 0 if the line was not read entirely. |
| |
| |
| |
| COMPATIBILITY NOTES |
| ------------------- |
| |
| The C language has a lengthening history. Nowadays, C comes in three |
| flavours: |
| |
| -- Traditional C, aka "K&R". This is the language first described by |
| Brian Kernighan and Dennis Ritchie, and implemented in the first C |
| compiler that was ever coded. There are actually several dialects of |
| K&R, and all of them are considered as deprecated. |
| |
| -- ISO 9899:1990, aka C90, aka C89, aka ANSI-C. Formalized by ANSI |
| in 1989 and adopted by ISO the next year, it is the C flavour many C |
| compilers understand. It is mostly backward compatible with K&R C, but |
| with enhancements, clarifications and several new features. |
| |
| -- ISO 9899:1999, aka C99. This is an evolution on C90, almost fully |
| backward compatible with C90 (exhibitting a code that makes a difference |
| is a tricky exercise). C99 introduces many new and useful features, |
| however, including in the preprocessor. |
| |
| There was also a normative addendum in 1995, that added a few features |
| to C90 (for instance, digraphs) that are also present in C99. |
| |
| |
| Ucpp implements the C99 standard, but can be used in a stricter mode, |
| to enforce C90 compatibility (it will, however, still recognize some |
| constructions that are not in plain C90). |
| |
| Ucpp also knows several extensions to C99: |
| |
| -- Assertions: this is an extension to the defined() operator, with |
| its own namespace. Assertions seem to be used in several places, |
| therefore ucpp knows about them. It is recommended to enable |
| assertions by default on Solaris systems. |
| -- Unicode: the C99 norm specifies that extended characters, from |
| the ISO-10646 charset (aka "unicode") can be used in identifiers |
| with the notations \u and \U. ucpp also accepts (with the proper |
| flag) the UTF-8 encoding in the source file for such characters. |
| -- #include_next directive: it works as a #include, but will look |
| for files only in the directories specified in the include path |
| after the one the current file was found. This is a GNU-ism that |
| is useful for writing transparent wrappers around header files. |
| |
| Assertions and unicode are activated by specific flags; the #include_next |
| support is always active. |
| |
| The ucpp code itself should be compatible with any ISO-C90 compiler. |
| The cpp.c file is rather big (~ 53kB), it might confuse old 16-bit C |
| compilers; the macro.c file is somewhat large also (~ 43kB). |
| |
| The evaluation of #if expressions is subject to some subtleties, see the |
| section "cross-compilation". |
| |
| The lexer code makes no assumption about the source character set, but |
| the following: source characters (those which have a syntactic value in |
| C; comment and string literal contents are not concerned) must have a |
| strictly positive value that is strictly lower than MAX_CHAR_VAL. The |
| strict positivity is already assured by the C standard, so you just need |
| to adjust MAX_CHAR_VAL. |
| |
| Ucpp has been tested succesfully on ASCII/ISO-8859-1 and EBCDIC systems. |
| Beware that UTF-8 is NOT compatible with EBCDIC. |
| |
| Pragma handling: when used in non-lexer mode, ucpp tries to output |
| a source text that, read again, will give the exact same stream of |
| tokens. This is not completely true with regards to line numbering in |
| some tricky macro replacements, but it should work correctly otherwise, |
| especially with pragma directives if the compile-time option PRAGMA_DUMP |
| was set: #pragma are dumped, non-void _Pragma() are converted to the |
| corresponding #pragma and dumped also. |
| |
| Ucpp does not macro-replace the contents of #pragma and _Pragma(); |
| If you want a macro-replaced pragma, use this: |
| |
| #define pragma_(x) _Pragma(#x) |
| #define pragma(x) pragma_(x) |
| |
| Anyway, pragmas do not nest (an _Pragma() cannot be evaluated if it is |
| inside a #pragma or another _Pragma). |
| |
| |
| I wrote ucpp according to what is found in "The Language C" from Brian |
| Kernighan and Dennis Ritchie (2nd edition) and the C99 standard; but I |
| could have misinterpreted some points. On some tricky points I got help |
| from the helpful people from the comp.std.c newsgroup. For assertions |
| and #include_next, I mimicked the behaviour of GNU cpp, as is stated |
| in the GNU cpp info documentation. An open question is related to the |
| following code: |
| |
| #define undefined ! |
| #define makeun(x) un ## x |
| #if makeun(defined foo) |
| qux |
| #else |
| bar |
| #endif |
| |
| Ucpp will replace 'defined foo' with 0 first (since foo is not defined), |
| then it will replace the macro makeun, and the expression will become |
| 'un0', which is replaced by 0 since this is a remaining identifier. The |
| expression evaluates to false, and 'bar' is emitted. |
| However, some other preprocessors will replace makeun first, considering |
| that it is not part of a 'defined' operator application; this will |
| produce the macro 'undefined', which is replaced, and the expression |
| becomes '!foo'. 'foo' is replaced by 0, the expression evaluates to |
| true, and 'qux' is emitted. |
| |
| My opinion is that the behaviour is undefined, because use of the |
| 'defined' operator does not match an allowed form prior to macro |
| replacement (I mean, its syntax matches, but its use is reconverted |
| to inexistant and therefore is not anymore matching). Other people |
| think that the behaviour is well-specified, and contrary to what ucpp |
| does. The only thing clear to me is that the wording of the standard |
| (paragraph 6.10.1.3) is unclear. |
| |
| Since the ucpp behaviour makes ucpp code simpler and cleaner, and |
| that it is unlikely that any real-life code would ever be disturbed |
| by that interpretation of the standard, ucpp will keep its current |
| behaviour until convincing evidence of my misinterpretation of the |
| standard is given to me. The problem can only occur is one uses ## to |
| make a 'defined' operator disappear from a #if expression (everybody |
| agrees that the generation of a 'defined' operator triggers undefined |
| behaviour). |
| |
| |
| About _Pragma: the standard is not clear about when this operator is |
| evaluated, and if it is allowed inside #if directives and such. For |
| ucpp, I coded _Pragma as a special macro with lazy replacement: it will |
| be evaluated wherever a macro could be replaced, and only at the end of |
| the macro replacement (for practical purposes, _Pragma can be considered |
| as a macro taking one argument, and being replaced by nothing, except |
| for some tricky uses of the # and ## operators). This means that, by |
| default, ucpp will evaluate _Pragma inside some directives (mainly, #if, |
| #include, #include_next and #line), but it can be taught not to do so by |
| defining NO_PRAGMA_IN_DIRECTIVE in tune.h. |
| |
| |
| |
| CROSS-COMPILATION |
| ----------------- |
| |
| If compiled with a C99 development suite, ucpp should be fully |
| C99-compliant on the host platform (up to my own understanding of the |
| standard -- remember that this software is distributed as-is, without |
| any guarantee). However, if a pre-C99 compiler is used, or if the |
| target machine is not the host machine (for instance when you build a |
| cross-compiler), the evaluation of #if expressions is subject to some |
| cross-compiling issues: |
| |
| |
| -- character constants: when evaluating expressions, character constants |
| are interpreted in the source character set context; this is allowed |
| by the standard but this can lead to problems with code that expects |
| this interpretation to match the one made in the C code. To ease |
| cross-compilation, you can define a conversion array, and make the |
| global variable transient_characters point to it. The array should |
| contain 256 int; transient_characters[x] is the value of the character |
| whose value is x in the source character set. |
| |
| This facility is provided for inclusion of ucpp inside another code; |
| if you want a stand-alone ucpp with that conversion, hard-code the |
| conversion table into eval.c and make transient_characters[] statically |
| point to it. Alternatively, you could provide an option syntax to |
| provide such a table on command-line, if you feel like it. |
| |
| |
| -- wide character constants signedness: by default, ucpp makes wide |
| characters as signed as what plain chars are on the build host. To |
| force wide character constant signedness, define WCHAR_SIGNEDNESS to 0 |
| (for unsigned) or 1 (for signed). Beware, however, that "native" wide |
| character constants, even signed, are considered positive. Non-wide |
| character constants are, according to the C99 standard, of type int, and |
| therefore always signed. |
| |
| |
| -- evaluation type: C90 states that all constants in #if expressions |
| are considered as either long or unsigned long, and that the evaluation |
| is performed with operands of that size. In C99, the situation is |
| equivalent, except that the types used are intmax_t and uintmax_t, as |
| defined in <stdint.h>. |
| |
| Ucpp can use two expression evaluators: one uses native integer types |
| (one signed and one unsigned), the other evaluator emulates big integer |
| numbers by representing them with two "unsigned long". By default, it |
| will use the first evaluator, using (u)intmax_t as native types if the |
| compiler is C99-compliant, or (unsigned) long otherwise. If you want |
| another behaviour, modify the relevant section in tune.h. Here are |
| examples of definitions: |
| |
| /* evaluate natively with type "long long" */ |
| #define NATIVE_UINTMAX unsigned long long |
| #define NATIVE_INTMAX long long |
| |
| /* evaluate natively with type "long" (even if bigger is available) */ |
| #define MATIVE_UINTMAX unsigned long |
| #define MATIVE_INTMAX long |
| |
| /* evaluate with bignum evaluation */ |
| #undef NATIVE_UINTMAX |
| #define SIMUL_UINTMAX |
| |
| The bignum evaluation handles signed integers in two's complement |
| representation, whether this is the native integer representation or |
| not. The code makes the non-standard assumption that unsigned long are |
| represented unpadded in memory, that is, unsigned long are made up of |
| exactly sizeof(unsigned long) * CHAR_BIT bits. I have never heard of any |
| architecture where this assumption would be false. |
| |
| |
| |
| FUTURE EVOLUTIONS |
| ----------------- |
| |
| Ucpp is quite complete now. There was a longstanding project of |
| "traditional" preprocessing, but I dropped it because it would not |
| map cleanly on the token-based ucpp structure. Maybe I will code a |
| string-based preprocessor one day; it would certainly use some of the |
| code from lexer.c, eval.c, mem.c and hash.c. However, making such a tool |
| is almost irrelevant nowadays. If one wants to handle such project, |
| using ucpp as code base, I would happily provide some help, if needed. |
| |
| |
| |
| CHANGES |
| ------- |
| |
| From 0.9 to 1.0: |
| |
| * bugfix: crash after erroneous #assert |
| * changed ERR_SHARP to FAIL_SHARP, EMUL_UINTMAX to SIMUL_UINTMAX |
| * made "inline" default on gcc and DEC ccc (Linux/Alpha) |
| * semantic of -I is now Unix-like (added directories are looked first) |
| * added -J flag (to add include directories after the system ones) |
| * cleaned up non-ascii issues |
| * bugfix: missing brace in no-LOW_MEM code |
| * bugfix: argument number check in variadic macros |
| * bugfix: crash in non-lexer mode after some cases of unreplaced macro |
| * bugfix: _Pragma() handling wrt # and ## |
| * made evaluation of _Pragma() optional in #if, #include and #line |
| * bugfix: re-dump of multiline #pragma |
| * added the inmacro and macro_count flags |
| * added mmap() support |
| * added option to retain whitespace content in lexer mode |
| |
| From 0.8 to 0.9: |
| |
| * added check for division by 0 in #if evaluation |
| * added check for non-standard line numbers |
| * added check for trailing garbage in most directives |
| * corrected signedness of char constants (always int, therefore always signed) |
| * made LOW_MEM code, so that ucpp runs smoothly on low memory architectures |
| * multiple bugfixes (using the GNU cpp testsuite) |
| * added handling of _Pragma (as a macro) |
| * added tokenization of pragma directives |
| * added conservation of pragma directives in text output |
| * produced Msdos 16-bit small memory model executable |
| * produced Minix-86 executable |
| |
| From 0.7 to 0.8: |
| |
| * added some support for Amiga systems |
| * fixed extra spacing in stringified tokens |
| * fixed bug related to %:% and tolerated rogue sharps |
| * namespace cleanup |
| * bugfix for macro redefinition |
| * added warning for evaluated comma operators in #if (ISO requirement) |
| * -Dfoo now defines foo with content 1 (and not void content) |
| * trigraphs can be disabled (for incorrect but legacy code) |
| * fixed semantics for #include "file" (local directory) |
| * fixed detection of protected files |
| * produced a Msdos 16-bit executable |
| |
| From 0.6 to 0.7: |
| |
| * officially changed the goal to full C99 compliance |
| * added the CONTEXT token and let NEWLINE tokens go |
| * added report_context() for error reporting |
| * enforced matching of #if/#endif (file-global nesting level = 0) |
| * added support of C99 digraphs |
| * added UTF-8 encoding support |
| * added universal character names |
| * rewrote #if expressions (sizes fixed, bignum, signed/unsigned fixed) |
| * fixed incomplete evaluation of #if expressions |
| * added transient_characters[] |
| |
| From 0.5 to 0.6: |
| |
| * disappearance of error_nonl() |
| * added extra optional warnings for trigraphs |
| * some bugfixes, especially in lexer mode |
| * handled MacIntosh files correctly |
| |
| From 0.4 to 0.5: |
| |
| * nicer #pragma handling (a token can be emitted) |
| * bugfix in lexer mode after #line and #error |
| * sample.c an example of code linked with ucpp |
| * made #if expressions conforming to standard signed/unsigned handling |
| * added the copy_line[] buffer feature |
| |
| From 0.3 to 0.4: |
| |
| * relaxed interpretation of '#include foo' when foo ends up, after macro |
| substitution, with a '<bar>' content |
| * corrected the 'double-dot' bug |
| * corrected two bugs related to the treatment of macro aborted calls (due |
| to lack of arguments) |
| * some namespaces cleanup, to ease integration into other code |
| * documented the way to include ucpp into another program |
| * made newlines embedded into strings illegal (and reported as such) |
| |
| From 0.2 to 0.3: |
| |
| * added support for system predefined macros |
| * made several bugfixes |
| * checked C99 compliance for most of the features |
| * ucpp now accepts non-C characters on standard when used stand-alone |
| * removed many useless spaces in the output |
| |
| From 0.1 to 0.2: |
| |
| * added support for assertions |
| * added support for macros with variable arguments |
| * split the pharaonic cpp.c file into many |
| * made several bugfixes |
| * relaxed the behaviour with regards to the void arguments |
| * made C++-like comments an option |
| |
| |
| |
| THANKS TO |
| --------- |
| |
| Volker Barthelmann, Neil Booth, Stéphane Ecolivet, Antoine Leca, Cyrille |
| Lefevre, Dave Rivers, Loic Tortay and Laurent Wacrenier, for suggestions |
| and beta-testing. |
| |
| Paul Eggert, Douglas A. Gwyn, Clive D.W. Feather, and the other guys from |
| comp.std.c, for explanations about the standard. |
| |
| Dave Brolley, Jamie Lokier and Neil Booth, for discussion about tricky |
| points on nesting macros. |
| |
| Brian Kernighan and Dennis Ritchie, for bringing C to mortal Men. |