This document explains:
It has become a practice recently to send email containing attachments. These attachments are often encoded by the mime method, a technically better replacement for uuencode in that it allows for binary data to be transmitted reliably by email.
Attachments that contain word processor documents are, however, different. It is the practice of software companies to record these documents in a binary format. This serves two important purposes for them:
These same features are to the distinct disadvantage of the user. Documents written in word processor formats may be impossible to convert to other formats even if the conversion programs are supposedly available. At my work, we attempted to convert a document written in Italian to a format that will last. After spending an entire afternoon using 2 Sun workstations, 2 Macs, a PC powerbook, disks and ethernet for transfer, three of us, who were well experienced in all of these systems, were unable to succeed. Apparently the word processor binaries are not compatible between different computers (i.e. PC vs Mac).
The practices of the companies have the effect that the user is tied not only to the particular word processor program, but sometimes also to the computer that they wrote the document on. Transfer of the document for archival purposes becomes increasingly more difficult as the old computer becomes obsolete. That is, the documents may be easy to create, but they will not last 20 years or more. Anyone who has used a word processor from a now defunct company or computer knows that unless they converted the documents before moving on, the documents are lost.
Because I would like to be able to read email from a long time ago, I cannot store word processor binary formats. Furthermore, if someone sends me such a format, it would force me to buy the product. While I'm sure the computer companies like this, it is inappropriate, perhaps even unethical, for people to be forcing me to buy poorly designed software.
bytes | document type | fold difference |
6633 | text/plain | 1 |
17277 | application/wordperfect | 2.6 fold larger! |
53949 | application/msword | 8.1 fold larger! |
bytes | document type | fold difference |
6600 | html | 1 |
177664 | .doc | 26.9 fold |
bytes | document type | fold difference |
746 | itinerary.txt: text/plain | 1 |
5718 | itinerary.wpd: application/octet-stream | 8 fold larger! |
27667 | itinerary.doc: application/msword | 37 fold larger! |
Someone sent me a scientific paper
that they had printed in their version of wordY.
The equations were impossible to understand, making
me think that the author was a nut case.
Working with another person
who printed the same paper out,
I learned what had happened.
The times symbol was printed
as a double headed arrow,
The alpha symbol became an underscore,
and
the sigma symbol disappeared entirely.
Thus the same wordY document does not print the
same way for two people!
Based on this experience,
I suggest that no scientist
should ever risk using wordprocessor formats!
|
bytes | document type | fold difference |
62344 | original base64 encoded email | - |
46080 | letter.doc wordy file | 1 |
1276 | letter.txt: text/plain made using openoffice.org | 36.1 fold smaller |
1183 | letter.txt: trimmed by hand | 38.8 fold smaller! |
bytes | document type | fold difference |
19968 | speaker.doc | 1 |
266 | speaker.txt | 70 fold smaller! |
There are several alternatives available. For email the accepted standard is pure ASCII. That is, a text-only format. While this may seem primitive, it will last. Further, it serves many purposes adequately. Any documents that one wants to keep for a long time are best stored in an ASCII format.
How can one store complex documents for a long time without worrying that the company will go out of business or make the document obsolete? There are two ASCII based formats that are particularly good. The older one is called TeX or, in the more convenient form of TeX, LaTeX. (The even older troff and nroff are still used by some people, but they are not as good as LaTeX.) In this format one types commands such as \emph{emphasis in the form of italics will be generated from this}. Learning such commands is not as hard as people sometimes imagine, but they are far more powerful than a GUI (graphical user interface) because they allow the user to make up new commands and are fast because one does not need to move the mouse through long menus. After typing the commands, one puts the text through a converter that produces beautiful typesetting. This is TRUE typesetting; it runs circles around what word processors can do. It has been used to typeset entire books. LaTeX is used around the world.
A common complaint about LaTeX is that it is not a WYSIWYG.
That stands for What You See Is What You Get,
but it often is BNWYW: But Not What
You Want! I have been frustrated by
a commonly used word processor that---compared to LaTeX---could not make
an equation beautiful, which LaTeX does very well.
But the complaint is valid:
rapid feedback on the results of typesetting is useful.
Some time ago I invented a program called
atchange.
This program simply watches one or more files.
When a file changes, atchange will execute any series of commands
that I want. It only takes 10 seconds to set up the command,
but atchanges uses them hundreds of times.
This means that I spend less time moving the mouse and
much more time doing work.
Since computers have become so fast, atchange allows LaTeX to become
a WYSIWYG (and you get what you want).
I work in a simple but ergonomic editor
(vi or vim)
and when I type one key - a comma - the file
is written out. Atchange notices this and calls the commands
- including LaTeX -
to typeset and display the text. On a 200 Mhz machine,
a 50 page technical paper can be typeset in about a second.
As computers get faster, the time becomes negligible.
So one gets the best of both worlds - a fully programmable
powerful typesetting language that uses ASCII
(and so will last) AND a WYSIWYG.
Another feature of LaTeX that is incredibly nice: automatically formated bibliographies. I just type "\cite{Shannon1949}" and the bibliographic entry is put into the paper in the right place and all references throughout the paper are altered automatically. It is incredible to see people still struggling with these things when such a powerful tool is freely available. I have set up atchange to redo the bibliography automatically whenever a new entry is put in the paper OR when a new entry appears in my reference database.
The second ASCII based format is HTML (Hyper Text Markup Language), the language that is used to create pages on the world wide web, such as this one (which I typed in vi). Like TeX/LaTeX, HTML has commands for defining how a page is to be typeset. Unfortunately it is not the same as LaTeX and so some powerful features were lost. I believe that in time HTML and LaTeX will be fused or a third language that covers both will emerge. A conversion program has been written that takes LaTeX to HTML (latex2html). Because both languages are in ASCII, it will always be possible to write conversion programs reasonably easily. Since the Netscape program can be told to go to pages or to refresh a page, in combination with atchange one can have a WYSIWYG for HTML. (Further information is on the atchange page.)
Since HTML has rapidly become a widespread standard, it seems reasonable that email in HTML should be acceptable. Although purists object, at least HTML is readable without a web browser and so, if done carefully, does not come out as pure garbage when viewed without a program.
There will be one time that we can predict will cause trouble for ASCII format, and that is when computers begin to use Unicode. ASCII has 8 bits per character and the high order bits can cause trouble if one uses them in text files. Unicode has 16 bits and so allows much larger character sets; Unicode 2.0 contains 38,885 distinct coded characters making it a truly international standard. At some point it will be necessary to transform ASCII documents to Unicode, but conversion programs should be easy to create since all they have to do is map each ASCII character to its Unicode equivalent.
TeX, LaTeX, HTML and atchange are FREE. There are people around the world working to improve them. Physicists and mathematicians use TeX and LaTeX because of their wonderful ability to typeset mathematical equations and scientific notation. For some reason biologists have lagged behind.
Recommendations
This page was written entirely with my personal resources on my own time. No governmental funds, equipment or electrons were used.