shasm
NAME
shasm - binary file assembler written entirely in the GNU Bash shell
PAGE DATE
Jan. 2001/ Jan. 2002
INTERFACE
osimplay filename
after having done...
. osimplay
output is to the created files output and listing
DESCRIPTION
Good news
shasm is a trivially extensible, utterly flexible collection of unix
shell routines for assembling arbitrary binary files. It is now merely
what underlies the osimplay x86 compembler, but is documented
here separately to keep the modularity of the osimplay design visible.
shasm is thus the parts of osimplay analagous to assembler
directives like .org, .ascii and so on in other assemblers. You probably
already know a lot about how it works. shasm can be run on any computer
you can get Bash or similar installed on. It uses only the shell.
shasm provides gobs of user-feedback. osimplay currently can build
several Linux user programs and an x86 bootsector.
Bad News
shasm is about 100 times slower than Gas. The functionality provided is
just what I need to assembly a particular programming language on 386+.
shasm needs a fairly featureful shell; ash won't cut it. pdksh has, at one
point, zsh should, and stuff like Plan 9 rc and the Amiga CLI might be
coerced to. shasm should be presumed to not work for anything using
features not demonstrated in the provided examples.
local-scope jargoneering
bytes, duals and quads are integers of 1, 2
and 4 bytes respectively. A cell is an int at the native size of
an address on the machine in question. This is 2 _OR_ 4 on the 386. An
oper is the first byte or the characteristic byte pair of a
specific x86 machine instruction. An argument is any syntactic
modifier to an instruction other than prefixes. Arguments in shasm are
separated by spaces. An operand is the actual value the
instruction will act on at runtime, as defined by the arguments. Note
that I've based this on "instruction" without defining that. It usually
means the thing there is an Intel name for, but not always. A
macro in shasm is a shell expression or routine additional to
what shasm provides.
amble
shasm is an assembler written in the GNU Bash unix-style command
interpreter "shell" to make my H3sm 3-stack programming language
maximally portable, and make it an OS. The initial shasm is for that
purpose, and is for a subset of the x86 instruction set which is most
useful for systems programming languages and operating systems. No FPU,
for example. A side-effect of that goal is that shasm is a flexible and
relatively easy-to-use means of creating any kind of arbitrary binary
file. Because it's a script, the script itself is the executable, the
authoratative documentation, and is part of the user interface. shasm
does not use anything external to the shell, such as sed, dd and so on.
Most assemblers these days are geared to run in the background supporting
a high-level language. shasm is more for coding machine language directly,
or making odd binary files of any kind, interactively.
The initial shasm provides machine code assembly for a subset of the x86
instruction set. shasm implements a non-cryptic set of names for x86
instructions that I find helpful called asmacs. If you prefer
Intel names you can easily transliterate them back in, for the most part.
There are a few names that don't map one-to-one though.
Shasm interprets the file to be assembled as a shell script. The opcodes
in shasm are shell subroutines ("functions"), and any routine in shasm,
and any functionality of Bash, is available throughout the assembly
source.
shell argument syntax
shasm's syntax is actually the behavior of shell argument processing.
Usually one machine instruction and it's arguments are assembled by one
shell routine and it's following arguments. An operator is followed by
space-separated arguments. Arguments to the operator can be shell
expressions if they are contiguous or quoted. The exception is that x86
"instruction prefixes" are handled like separate instructions by shasm.
Instruction delimiting and argument delimiting are thus shell-style.
Instructions are separated by ends of lines, and lines may be continued
with a terminating \ or subdivided with ; as per usual
in the shell. I don't do much of either, myself. Arguments are separated
by spaces, also as is typical in the shell. shasm itself doesn't do any
character-by-character parsing/lexing, so some things that are usually
prefixes in other assemblers are separate tokens in shasm. For example,
there are two separators for the source and destination sides of an
instruction's arguments. They are to and from. These are
the equivalent of a comma in e.g. GNU gas, and must be separated from
other arguments by spaces on both sides.
The most important variable is here, which is the current
assembly address. here is equivalent to period in other assemblers (and is
degenerately analagous to HERE in Forth). I think the equivalent in MASM
is $. here is a declared integer. You can use here in Bash expressions as
you see fit. L is the label specifier. allot is similar
to ALLOT in Forth, and does a relative-addressed ".org".
The high-level utility of the shell provides many other features typical
of assemblers implicitly. Examples:
- . <filename> is your .include directive.
- MOV () { copy $* ; } creates an opcode or
shasm routine synonym.
- Shell routines more complex than the preceeding constitute "macros".
- Suffixes and other constructs are implicit to shell string concatenation.
- declare -i pi=314159 declares an integer constant in the state
of the shell, and thus in the assembly state..
- echo can send arbitrary progress info to the user anytime
- shell conditional and looping constructs can control assembly
I use the terms "byte", "dual" and "quad" for integers of 1, 2 and 4
bytes. The directives bytes, duals and quads
assemble integers literally. They take one or more numeric or expression
arguments, as is typical for shell commands. Bash and other recent
unix-like shells provide a rich set of operators, but expression syntax is
tricky. shasm itself is full of examples. For each argument to e.g.
"bytes", one integer of the size specified (a byte in this case) is
appended to the assembly. Arguments with larger values than the type being
appended are truncated, low-significance end surviving the truncate. For
x86 there are also operand qualifiers called byte, dual
and quad. These are not directives.
Assembler directives are machine-independant. shasm is therefor split
into two scripts; the main one and the one for the CPU in question.
Currently you have one choice of CPU; x86. shasm has no linker, sections,
or debugging functionality. Please let me know if any of that changes.
The L style labels are for branch resolution. If you want to label some
point in the assembly for other uses do
mydatalabel=$here
and be careful with name conflicts. The ascii directive assembles
a string.
A shasm opcode writes to two output files, ./output and
./listing. output is the raw binary assembly, and listing is a
hexadecimal/octal listing. An item in listing of the form 234 is a byte in
octal, whereas 22 is hex. The 386 modR/M and SIB bytes get built as octal
and might as well be displayed that way.
A shell is roughly 100 times slower than compiled C at low-level stuff.
I've just tried to avoid making shasm unnecessarily worse than that. More
importantly, I don't see any data capacities in shasm that are likely to
be exceeded by any reasonable file of code. I do think shasm makes machine
language less daunting, and may be useful for playing around with other
types of binary data files, such as GIFs.
x86 shasm has it's
own seedoc for operator syntax and so on.
...................................................................
...................................................................