Item 43712523

benbridle • 3 days ago

With character sets, I was initially going to support non-Unicode text by adding a --char-set flag to the assembler, but I decided that the character set should be defined somehow inside each program. My thought was that they could be defined as large table-like macros, something like the following:

  %BYTE:n  #nnnn_nnnn ;
  %CHAR:n
    ?[n 'A' ==]  BYTE:0x01 
    ?[n 'B' ==]  BYTE:0x02 
    ?[n 'C' ==]  BYTE:0x03 
    ?[n 'D' ==]  BYTE:0x04 ;
  
  CHAR:"ABCDABCD"

This is, admittedly, quite unweildy for character sets exceeding a few hundred characters, but it would work passably for small character sets like those used for HD44780-style LCD screens. What character sets did you have in mind?

Octal was another feature I couldn't make up my mind about, just because I wasn't familiar with any architectures that require it. It'll be trivial to tack on though. For the Z80 instruction set, since the instruction encoding tends to cleave along octal lines, I used the following macro to pack octal digits into bytes, which has the advantage of allowing variables to be passed into each digit (the ADDr macro shows how it's used):

  %XYZ:x:y:z  #xxyyyzzz ;
  
  %ADDr:r     XYZ:2:0:r ;

Thanks for the heads-up about the table of contents, the links should all work now.

zzo38computer • 3 days ago

I think the "0" prefix for octal (as used in the C programming language) is not so good and that "0o" is better; so, if octal literals are implemented then "0o" is better.

For non-Unicode text, probably the simplest thing would be to treat the input as a sequence of bytes instead of Unicode characters; or equivalently to treat it as ISO-8859-1 (although programming it to use ISO-8859-1 may be less efficient then just using bytes, possibly; I don't know much about the working of Rust programming, so I don't actually know if it is or not).

By "non-Unicode text", I did not mean character mapping, although character mapping is another feature that would be useful to implement, similar to what you mentioned although it could be made more efficient (like you mention). Some way to map a input sequence of bytes (whether or not it is valid UTF-8) to a output character code, would work, probably.

1 reply

benbridle • 3 days ago

I wholeheartedly agree with using "0o" as the octal prefix, I've never been a fan of "0". I've jut implemented this feature and released it in v2.2.0, you can grab it from the project page. Thanks for the suggestion!

I'm not too sure I understand what you're describing with non-Unicode text. Torque doesn't have a built-in concept of bytes, instead each character is treated as an integer with the value being the Unicode code point of that character (using decimal we have 65 for 'A', 955 for 'λ', 129302 for 'robot emoji', etc). It's up to the programmer to choose how to pack the character (integer) into a sequence of bits. Code points are different to encodings like UTF-8 or UTF-16, which define how a code point (integer) is packed down into one or more bytes.

If you want to assemble 7-bit ASCII text, one byte per character, you define a macro that packs each character value into the lower 7 bits of an 8-bit byte. If the string contains a non-ASCII character, the character value will be too large to fit into the field and an error will be displayed.

  %ASCII:c  #0ccc_cccc ;
  ASCII:"This is a string."

Assembling ISO-8859-1 text would be similar, but would involve remapping the characters above 0x7F like this:

  %BYTE:n  #nnnn_nnnn ;
  %ISO-8859-1:c
    ?[c 0x7F <=]  BYTE:c 
    ?[c '¡'  ==]  BYTE:0xA1 
    ?[c '¢'  ==]  BYTE:0xA2 
    ?[c '£'  ==]  BYTE:0xA3 ;
  ISO-8859-1:"£190.00"

UTF-8 being a variable-width encoding requires a more complicated macro arrangement (you can see an example here [0]). But the key point is that strings aren't treated as byte sequences, they're just character/integer sequences until they get baked down into a byte encoding.

Please let me know if that doesn't answer your suggestion, I'm keen to understand your use-case.

[0] https://benbridle.com/projects/torque/manual-v2.2.0.html#usi...

2 replies

zzo38computer • 2 days ago

I meant that the input should not have to be Unicode. The input could be any extended ASCII encoding (supporting ASCII is necessary, in order that the syntax will work; therefore some encodings such as Shift-JIS and UTF-16 are not suitable) which does not have to be mapped to Unicode. (Even if the input actually is UTF-8, you may have a character that is made up of multiple code points. If it is treated as a sequence of bytes then it can be any sequence of bytes, including a combination of multiple UTF-8 code points.)

It is another issue than mapping the character codes for output, which as you say can be unwieldly for large character sets, so another kind of table specification (which might also be useful for stuff other than text), would probably help and be more efficient than using a sequence of conditions in a macro.

About octal, someone said against using "0o"; it is usually not as difficult to distinguish if the "o" is always lowercase. Another alternative would be how it works in PostScript, which uses "8#" prefix for octal and "16#" prefix for hexadecimal. (My opinion is that "0o" is good enough though)

1 reply

benbridle • 2 days ago

Oh I think I see what you mean now, you're talking about the encoding of the Torque source code that gets fed into the assembler. To be honest I'd never really considered anything other than UTF-8, the parsing is all implemented in Rust which requires strings to be valid UTF-8 anyway. Are you wanting to write Torque code with a non-Unicode text editor, or are you thinking about how the document encoding affects the handling of string literals inside the assembler?

Some kind of table syntax would be useful for character mappings, but I'm not sure what it'd look like or if it'd be applicable outside of dealing with characters. I'll think more on that.

vanderZwan • 2 days ago

> I wholeheartedly agree with using "0o" as the octal prefix

Can I give a vote against this? Distinguishing "o" and "0" can be a huge pain when using the wrong font.

Also, why not 0c? 0x already uses the "x" in "hexadecimal", so why not the "c" in "octal". That it also reads a bit like a 13375p34k abbreviation of "octal" is a nice bonus.