shasm

NAME

shasm - binary file assembler written entirely in the GNU Bash shell

PAGE DATE

Jan. 2001/ Jan. 2002

INTERFACE

osimplay filename

after having done...

. osimplay

output is to the created files output and listing

DESCRIPTION

Good news

shasm is a trivially extensible, utterly flexible collection of unix shell routines for assembling arbitrary binary files. It is now merely what underlies the osimplay x86 compembler, but is documented here separately to keep the modularity of the osimplay design visible. shasm is thus the parts of osimplay analagous to assembler directives like .org, .ascii and so on in other assemblers. You probably already know a lot about how it works. shasm can be run on any computer you can get Bash or similar installed on. It uses only the shell. shasm provides gobs of user-feedback. osimplay currently can build several Linux user programs and an x86 bootsector.

Bad News

shasm is about 100 times slower than Gas. The functionality provided is just what I need to assembly a particular programming language on 386+. shasm needs a fairly featureful shell; ash won't cut it. pdksh has, at one point, zsh should, and stuff like Plan 9 rc and the Amiga CLI might be coerced to. shasm should be presumed to not work for anything using features not demonstrated in the provided examples.

local-scope jargoneering

bytes, duals and quads are integers of 1, 2 and 4 bytes respectively. A cell is an int at the native size of an address on the machine in question. This is 2 _OR_ 4 on the 386. An oper is the first byte or the characteristic byte pair of a specific x86 machine instruction. An argument is any syntactic modifier to an instruction other than prefixes. Arguments in shasm are separated by spaces. An operand is the actual value the instruction will act on at runtime, as defined by the arguments. Note that I've based this on "instruction" without defining that. It usually means the thing there is an Intel name for, but not always. A macro in shasm is a shell expression or routine additional to what shasm provides.

amble

shasm is an assembler written in the GNU Bash unix-style command interpreter "shell" to make my H3sm 3-stack programming language maximally portable, and make it an OS. The initial shasm is for that purpose, and is for a subset of the x86 instruction set which is most useful for systems programming languages and operating systems. No FPU, for example. A side-effect of that goal is that shasm is a flexible and relatively easy-to-use means of creating any kind of arbitrary binary file. Because it's a script, the script itself is the executable, the authoratative documentation, and is part of the user interface. shasm does not use anything external to the shell, such as sed, dd and so on. Most assemblers these days are geared to run in the background supporting a high-level language. shasm is more for coding machine language directly, or making odd binary files of any kind, interactively.

The initial shasm provides machine code assembly for a subset of the x86 instruction set. shasm implements a non-cryptic set of names for x86 instructions that I find helpful called asmacs. If you prefer Intel names you can easily transliterate them back in, for the most part. There are a few names that don't map one-to-one though.

Shasm interprets the file to be assembled as a shell script. The opcodes in shasm are shell subroutines ("functions"), and any routine in shasm, and any functionality of Bash, is available throughout the assembly source.

shell argument syntax

shasm's syntax is actually the behavior of shell argument processing. Usually one machine instruction and it's arguments are assembled by one shell routine and it's following arguments. An operator is followed by space-separated arguments. Arguments to the operator can be shell expressions if they are contiguous or quoted. The exception is that x86 "instruction prefixes" are handled like separate instructions by shasm. Instruction delimiting and argument delimiting are thus shell-style. Instructions are separated by ends of lines, and lines may be continued with a terminating \ or subdivided with ; as per usual in the shell. I don't do much of either, myself. Arguments are separated by spaces, also as is typical in the shell. shasm itself doesn't do any character-by-character parsing/lexing, so some things that are usually prefixes in other assemblers are separate tokens in shasm. For example, there are two separators for the source and destination sides of an instruction's arguments. They are to and from. These are the equivalent of a comma in e.g. GNU gas, and must be separated from other arguments by spaces on both sides.

The most important variable is here, which is the current assembly address. here is equivalent to period in other assemblers (and is degenerately analagous to HERE in Forth). I think the equivalent in MASM is $. here is a declared integer. You can use here in Bash expressions as you see fit. L is the label specifier. allot is similar to ALLOT in Forth, and does a relative-addressed ".org".

The high-level utility of the shell provides many other features typical of assemblers implicitly. Examples:

I use the terms "byte", "dual" and "quad" for integers of 1, 2 and 4 bytes. The directives bytes, duals and quads assemble integers literally. They take one or more numeric or expression arguments, as is typical for shell commands. Bash and other recent unix-like shells provide a rich set of operators, but expression syntax is tricky. shasm itself is full of examples. For each argument to e.g. "bytes", one integer of the size specified (a byte in this case) is appended to the assembly. Arguments with larger values than the type being appended are truncated, low-significance end surviving the truncate. For x86 there are also operand qualifiers called byte, dual and quad. These are not directives.

Assembler directives are machine-independant. shasm is therefor split into two scripts; the main one and the one for the CPU in question. Currently you have one choice of CPU; x86. shasm has no linker, sections, or debugging functionality. Please let me know if any of that changes.

The L style labels are for branch resolution. If you want to label some point in the assembly for other uses do


	mydatalabel=$here

and be careful with name conflicts. The ascii directive assembles a string.

A shasm opcode writes to two output files, ./output and ./listing. output is the raw binary assembly, and listing is a hexadecimal/octal listing. An item in listing of the form 234 is a byte in octal, whereas 22 is hex. The 386 modR/M and SIB bytes get built as octal and might as well be displayed that way.

A shell is roughly 100 times slower than compiled C at low-level stuff. I've just tried to avoid making shasm unnecessarily worse than that. More importantly, I don't see any data capacities in shasm that are likely to be exceeded by any reasonable file of code. I do think shasm makes machine language less daunting, and may be useful for playing around with other types of binary data files, such as GIFs.

x86 shasm has it's own seedoc for operator syntax and so on. ................................................................... ...................................................................