Converting MS Word Documents to Text
Version 2.1, February, 2000
by Titmouse
[ editor's note: regardless of whether you use MS Word or not, the basic information in this article applies to all word/text processor problems, and is worth reading if you want to avoid some very common problems. ]
Like many others, I prefer to do my serious writing in
Microsoft Word. While it's not perfect, it has the
tools and capabilities to do everything I need and
most of what I want. But messages and stories posted
to Usenet newsgroups should be in plain ASCII text
and, as many have discovered, Word does not make this
conversion gracefully. This is a source of much
complication, confusion and irritation. After many
trials and errors, I think I've finally figured most
of it out, and this is a report on my conclusions.
The best and most successful tactic is problem
avoidance. It is much easier to prevent things that
will cause conversion problems than to fix them after
the fact. Some of the information presented here is
general and concerns basic formatting, but most of it
deals with the specific issues of converting Word
documents.
The discussion assumes you are working with Microsoft
Word 97 or Word 2000. Although I have not make
exhaustive tests with the new version, no changes
appear to be necessary in this document. The two
macros presented below also work without modification.
Initial Considerations
Most conversion problems stem from three sources:
document formats, paragraph formats, and the extended
character set. If you can avoid introducing problems,
the conversion should go smoothly. Word is designed
to defeat our purpose here, though, so we will have to
force it to do what we want. Defaults for all three
of these problem areas are wrong for text documents
posted to the Internet.
Unless all you do is write stories for posting to the
Internet, however, the changes you will need to make
are not ones you will want for other kinds of
documents. A secondary problem, then, is how to avoid
wrecking Word for other purposes.
I thought originally that it would be relatively
simple -- just create a new template designed for
plain text documents with all the bells and whistles
turned off, and there should be no problem. Not so.
My second theory was that the issues could be resolved
by creating an alternate version of the Normal.Dot
template. Also not so. Rather than recount a long
process of experimentation, I'll just report my
conclusions.
The Document Format
Problems with both document and paragraph formats are
most easily handled by creating a template that you
can use whenever you start a new story or other longer
document intended for the Internet. The template will
have the correct font, margins and paragraph format.
Then, if you remember to turn off Word's fancy text
gizmos as explained later, you can write your document
without creating problems for yourself.
To create a new template, launch Word and modify the
blank document as follows. First, go to File, Page
Setup and set the left and right margins. The top and
bottom margins don't matter, as they are ignored when
the document is saved as text.
You'll be using a fixed width font, so the line length
will determine the number of characters per line. I
use and recommend 55 characters per line (as in this
document) and strongly recommend that you not exceed
60 characters per line. When you post to the
Internet, your message will be handled and read by a
wide variety of programs. If your line length is too
long, one or more of them may force an early line
wrap. You won't know about it until your story
arrives on the newsgroup with alternating long and
short lines. I've never seen this problem occur with
55 character lines, however.
How to set the margins depends on the font size.
Word's default is 10-point type, but I recommend 12-
point type, which produces 10 characters per inch in
fixed width fonts. In this case, you'll need a 5.5-
inch line length. Any combination of left and right
margins that totals three inches will work. I use a
left margin of one inch and a right margin of two
inches, but 1.25 and 1.75 works just as well.
If you insist on 10-point type, you'll need a 4.5-inch
line length, so the margins should total four inches.
That's actually 54 characters per line, if you're
paying close attention, but near enough to 55.
METRIC NOTE: If you use the common alternative
standard of A4 paper and metric measurements, the
above recommendations translate to left and right
margins of 33mm, assuming the font is Courier New at
12 points. This produces 56 characters per line.
All the other settings on the Page Setup dialog should
remain the same as usual, so just click okay to set
the margins. If you prefer to work in Page Layout
View (rather than Normal View), set it now by
selecting View, Page Layout. Assuming you plan to use
12-point type, click on the Zoom control near the
right edge of the Standard toolbar (the one that
begins with the New, Open and Save icons) and set the
Zoom to 75%. If you stick with 10-point type, skip
this last step.
If your standard setup includes headers or footers,
eliminate them from this document. Go to View,
Headers and Footers and delete anything in either of
them. Otherwise, this information will appear in the
final text version.
Font and Paragraph Format
Now, press Ctrl-A (or use Edit, Select All) to select
the entire document. While the document may appear
blank, it contains a paragraph marker. In Word, the
paragraph marker is much more than an end-of-line
character; a great deal of formatting is stored with
it. If you don't include that initial paragraph
marker in your changes, the defaults will remain and
return to bite you later.
With the entire document selected, change the font to
Courier New, 12 point. If you have another fixed-
width font that you prefer, you can use it, since font
information will not be saved in the final text
version.
Now, with the entire document still selected, go to
Format, Paragraph. Make sure that Alignment is set to
Left, Indentation and Spacing are zeroed, and the
Outline level is set to Body Text. The most vital
setting is for Special. The default is First Line
with a half-inch indent. Set this to (none).
This last setting causes problems for many users.
Although the First Line setting will indent the first
line of paragraphs, no tab or other character is
actually placed in the document to cause it. Instead,
the setting is stored in the paragraph marker and
disappears when converted to text, which is why you
see a lot of documents where the indent appears to
have been lost. In fact, it was never really there in
the first place.
The final option, Line Spacing, should be set to
single. Don't worry about tab settings. Click okay
to implement the paragraph format.
Now, we're ready to save our new template. Select
File, Save As. Give your template a name -- I call
mine 'Text' -- and change the type to 'Document
Template.' Word will automatically place it in your
template folder. Click the Save button.
Sticking with ASCII
Now for the thorniest problem, which is Word's
insistence on putting extended character codes in
documents and leaving them there even when you convert
them to text. A little explanation is needed here,
although some will already be familiar with this
information.
The Usenet standard for text-oriented newsgroups calls
for plain ASCII text. ASCII (American Standard Code
for Information Interchange) predates widespread
computer use and is most closely associated with
Teletype machines. It is a seven-bit coding scheme,
since seven bits provide 128 numbers (0-127). At the
time, that seemed sufficient to represent the 52
capital and lower case letters, the 10 digits, common
punctuation symbols, and various control codes for
line feeds, carriage returns, tabs, page feeds and so
on.
Binary computers, though, use powers of two, most
famously the eight-bit byte. ASCII coding fit neatly
into a byte, with one bit left over which was
initially ignored. That didn't last, of course, and
several schemes evolved for extending the character
set by using that spare bit to provide an additional
128 codes (128-255). The most popular of these today
is ANSI (American National Standards Institute) in
which the first 128 codes correspond to ASCII. What
the upper 128 represent, at least in the Microsoft
world, depends on context, including language, font
and software.
Here's the problem. When Word converts a document to
text, it uses ANSI, not ASCII. Extended character
codes above 127 remain in the text. What shows up on
the screen -- letters from other languages, math
symbols, and little black boxes for anything the
software can't display -- depends partly on which
flavor of conversion you used but mostly on the
software used to read it.
There is no cure within Word; your only choice is
prevention.
Avoiding Extended Character Codes
While you can put extended codes in your documents
intentionally -- nearly everything on the Insert menu
will do so, for example -- the ones Word does for you
without asking are the biggest source of problems.
These mostly originate from the 'AutoFormat As You
Type' tab of the AutoCorrect page of the Tools menu.
The 'AutoCorrect' tab contributes a few additional
gotchas, and the (plain) 'AutoFormat' tab can also
cause problems.
The crux of the problem is that these settings are not
stored in any template. They stay with the program,
not the document, and they retain their settings until
you change them explicitly.
Since you probably will want at least some of these
features turned on for standard Word documents, there
are only two choices. One is to turn them on and off
manually depending on what you're working on; the
other choice is to use a pair of macros to do the work
for you. (You'll still have to remember to run the
macros, of course.) I have included the two macros
necessary and will explain how to implement and use
them later.
Copying Your AutoCorrect Setup
Before making any changes, make a copy of your current
setup. Start Word with a blank document, click on
'Tools' on the top ribbon menu, and then choose
'AutoCorrect...' You will see the AutoCorrect page
with four tabs: AutoCorrect, AutoFormat As You Type,
AutoText, and AutoFormat.
The third of these, AutoText, provides boilerplate
entries that require a manual step to insert in a
document. If you use this facility in documents
intended for publication in Usenet newsgroups, just
make sure such entries don't contain non-ASCII
characters. This caveat aside, AutoText is not
relevant to our text-conversion problems.
We may change the other three tabs, though, so let's
make a backup copy. Click on the AutoCorrect heading
to make sure the dialog has the focus, then hold down
the Alt key and press PrintScreen. This copies the
dialog to the clipboard. Close the dialog and, in
your blank document, press Ctrl + V (or click Edit,
Paste) to insert a picture of the dialog in your
document. Press Enter.
Now return to Tools, AutoCorrect. Select the
'AutoFormat As You Type' tab. Press Alt + PrintScreen
again. Close the dialog and press Ctrl + V to insert
a copy of this tab in your document. Press Enter.
Finally, return to Tools, AutoCorrect one more time
and select the 'AutoFormat' tab. Copy it to the
clipboard with Alt + PrintScreen, close the dialog,
and insert it into your document with Ctrl + V. Now,
save the document as 'AutoCorrect Settings' and print
a copy for reference.
The AutoCorrect Tab
This tab is concerned with typing mistakes. In the
top part are five checkbox options. I have four of
the five turned on normally, omitting the second
'Capitalize first letter of sentences.' In my
experience, checking this box makes Word capitalize
things I don't want it to. In any case, you can set
the first four checkboxes according to your
preferences. They don't create conversion problems.
The fifth checkbox controls the bottom half of this
tab and can cause problems, however. In particular,
it converts (c) and (r) to the Copyright and
Registered symbols and three successive periods to the
ellipsis symbol. These, of course, all require
extended codes. In preparing a plain text document,
you don't need to change any of the replacements.
Just uncheck the 'Replace text as you type' checkbox,
and Word will ignore the list. This also means it
will not correct the many common typographical errors
on the list, however, so a spelling check becomes more
important than ever.
There is an alternative, which is what I've chosen to
do. I deleted the first several entries in the table
-- the ones that convert smilies as well as the
copyright and registered symbols. Now I can leave the
autocorrection of common typos turned on without
danger of substituting an illegal character. It's
something of an awkward choice, but personally I'd
rather catch the typos.
The AutoFormat-As-You-Type Tab
This is the bad boy, responsible for most of the
problems experienced in converting Word documents to
plain text. For standard documents, I have everything
checked except for hyperlinks, the third from last.
For text documents, I turn everything off.
As you can see, the middle section converts straight
quotes to curly quotes, ordinals to superscript,
common fractions to their graphic equivalents, dual
hyphens to real dashes, and *bold* and _underlining_
to actual bold and underlining. All of these use
upper level codes and most of them don't convert
properly to text.
The AutoFormat Tab
The settings on this tab are almost identical with
those on the previous one. Where the first makes its
changes as you type, the changes on this tab are made
only if you tell Word -- by selecting Format,
AutoFormat -- to perform them. If you don't do that,
you can leave these settings alone. Since I change my
settings via macro, it's just as easy to switch them
off and on.
Macros to Turn Text Settings On and Off
As you can see, considerable labor is required to
change these settings manually, especially if you
switch between document types frequently. As
mentioned earlier, you can't solve this problem by
putting the desired settings in the Text.dot template.
You can't even fix it by creating an alternate version
of Normal.dot, the template Word always uses. The
AutoCorrect settings are independent of the template.
Instead, the simplest way to switch is with a pair of
macros. You could record them yourself if you know
how, but I've provided copies here and directions on
how to create them.
First, if you haven't already, save this document and
load it into Word. Find this location again, and
follow the steps below. Be sure you have a copy of
your original AutoCorrect settings before proceeding.
As provided, the macros switch almost everything off
for text documents and back on for others. You may
prefer a different setup. It's easy to change. The
lines in the macro correspond exactly to the
checkboxes on the three AutoCorrect tabs, with True
meaning checked and False meaning unchecked. Using
the copy of your setup as a guide, change the Text_OFF
settings in the provided example from True to False or
vice versa.
The Text_OFF settings should correspond to your
current, preferred setup for normal documents. I
recommend that you use the suggested settings for the
Options section of Text_ON, but the first four entries
in the AutoCorrect section can be changed as desired.
The fifth entry under AutoCorrect toggles the Replace
Text feature off and on. If you delete the problem-
causing entries from the table, you can leave this
alone. Just delete the line from both macros and it
won't be changed by either of them.
Keep a copy of this document with your preferred
settings. If you decide later to modify them, it's
easy to change the macros. First, edit the text to
reflect your new preferences. Then go to the Macros
dialog (Alt + F8), delete the old versions, and then
recreate them using your modified versions.
Creating the Macros
In the section immediately below labeled TEXT_ON
MACRO, highlight and copy the lines between START and
STOP. The shortcut for Copy is Ctrl + C.
TEXT_ON MACRO
START
With AutoCorrect
.CorrectInitialCaps = True
.CorrectSentenceCaps = False
.CorrectDays = True
.CorrectCapsLock = True
.ReplaceText = False
End With
With Options
.AutoFormatAsYouTypeApplyHeadings = False
.AutoFormatAsYouTypeApplyBorders = False
.AutoFormatAsYouTypeApplyBulletedLists = False
.AutoFormatAsYouTypeApplyNumberedLists = False
.AutoFormatAsYouTypeApplyTables = False
.AutoFormatAsYouTypeReplaceQuotes = False
.AutoFormatAsYouTypeReplaceSymbols = False
.AutoFormatAsYouTypeReplaceOrdinals = False
.AutoFormatAsYouTypeReplaceFractions = False
.AutoFormatAsYouTypeReplacePlainTextEmphasis = False
.AutoFormatAsYouTypeReplaceHyperlinks = False
.AutoFormatAsYouTypeFormatListItemBeginning = False
.AutoFormatAsYouTypeDefineStyles = False
.AutoFormatApplyHeadings = False
.AutoFormatApplyLists = False
.AutoFormatApplyBulletedLists = False
.AutoFormatApplyOtherParas = False
.AutoFormatReplaceQuotes = False
.AutoFormatReplaceSymbols = False
.AutoFormatReplaceOrdinals = False
.AutoFormatReplaceFractions = False
.AutoFormatReplacePlainTextEmphasis = False
.AutoFormatReplaceHyperlinks = False
.AutoFormatPreserveStyles = False
.AutoFormatPlainTextWordMail = False
End With
STOP
Now, press Alt + F8. This brings up the Macros
dialog. If there's anything in the top box, Macro
Name, press the Delete key to clear it. Type Text_ON,
then click the Create box.
This will open the Visual Basic Editor. In the right
pane, you should see the cursor on a blank line.
Above it will be several lines beginning with 'Sub
Text_ON.' Immediately below will be a line that says
'End Sub.' Press Ctrl + V (or use Edit, Paste) to
insert the text you copied. Click the X in the upper
right corner, which will close the Visual Basic Editor
and return you to this document.
Now, repeat the process to create a Text_OFF macro.
Begin by copying the following lines between START and
STOP as before:
TEXT_OFF MACRO
START
With AutoCorrect
.CorrectInitialCaps = True
.CorrectSentenceCaps = False
.CorrectDays = True
.CorrectCapsLock = True
.ReplaceText = True
End With
With Options
.AutoFormatAsYouTypeApplyHeadings = True
.AutoFormatAsYouTypeApplyBorders = True
.AutoFormatAsYouTypeApplyBulletedLists = True
.AutoFormatAsYouTypeApplyNumberedLists = True
.AutoFormatAsYouTypeApplyTables = True
.AutoFormatAsYouTypeReplaceQuotes = True
.AutoFormatAsYouTypeReplaceSymbols = True
.AutoFormatAsYouTypeReplaceOrdinals = True
.AutoFormatAsYouTypeReplaceFractions = True
.AutoFormatAsYouTypeReplacePlainTextEmphasis = True
.AutoFormatAsYouTypeReplaceHyperlinks = True
.AutoFormatAsYouTypeFormatListItemBeginning = True
.AutoFormatAsYouTypeDefineStyles = True
.AutoFormatApplyHeadings = True
.AutoFormatApplyLists = True
.AutoFormatApplyBulletedLists = True
.AutoFormatApplyOtherParas = True
.AutoFormatReplaceQuotes = True
.AutoFormatReplaceSymbols = True
.AutoFormatReplaceOrdinals = True
.AutoFormatReplaceFractions = True
.AutoFormatReplacePlainTextEmphasis = True
.AutoFormatReplaceHyperlinks = True
.AutoFormatPreserveStyles = True
.AutoFormatPlainTextWordMail = True
End With
STOP
Once again, press Alt + F8 to bring up the Macros
dialog. Press the Delete key to clear the Macro Name
box, and type Text_OFF, then click the Create box.
The cursor will again be on a blank line below several
lines beginning with 'Sub Text_OFF' and above a line
that says 'End Sub.' Press Ctrl + V (or use Edit,
Paste) to insert the text you copied. Click the X in
the upper right corner to close the Visual Basic
Editor and return to this document.
You should now have two macros, Text_ON and Text_OFF.
To test them, press Alt + F8, and double-click the
Text_ON macro (or click Text_ON and then the Run
button). Go to the Tools, AutoCorrect dialog and
check the 'AutoFormat As You Type' tab. Everything
should be turned off. Now run the Text_OFF macro and
check the dialog again. Everything should be switched
back to your preferred settings.
Creating Your Text
So, with these tools in hand, you're ready to start a
new project. To create a document, use File, New and
select the Text template you created earlier. Before
doing anything, run the Text_ON macro. You'll need to
run the macro again each time you begin a new editing
session and run the Text_OFF macro whenever you switch
to another kind of document.
Now, all you have to do is to keep in mind the
eventual goal. Mostly that means not doing things you
know won't convert, such as Word styles, bulleted
lists, sections breaks, columns, and so on. Avoid
bold, italics and underlining. If you need this kind
of emphasis, follow the plain text conventions of
indicating bold by preceding and following the text
with asterisks like *this* and underlining or italic
with underscores like _this._ With the AutoFormat
features turned off, these will not be converted.
For titles, I recommend a simple block at the left
margin, as in the following example.
Converting Word Documents to Text
By Titmouse
(C) August, 1999
You may wish to use capital letters for the actual
title. For section headings, I recommend placing two
blank lines before and one after. I've used this
convention throughout this document.
If you want to underline a heading, do so with hyphens
on a separate line beneath. Keep in mind, however,
that if you do this in any font other than Courier (or
some other monospaced font), you actually have no idea
how many hyphens are needed unless you count the
characters in the heading. Most fonts are
proportional. Each character, that is, has a separate
width, so that an 'm' and an 'i' take up different
amounts of line space. With monospaced fonts like
Courier, each character has the same width.
You also need to decide how you want to separate
paragraphs in your text. There are two basic
approaches. In one, paragraphs are not indented and
an extra blank line separates them. In the other,
paragraphs are indented with a tab or spaces and the
extra line is omitted. Either of these is acceptable,
but the first is preferred. Some software seems to
strip out tabs and spaces.
Saving the Document as Text
While you're working on the story or article, save it
as a normal Word document. You'll probably want to
maintain an archive version in that format anyway.
When you've finished the final editing for your story
and are ready to post it, save a new copy as
MS-DOS Text with Line Breaks
Then close the document in Word (or exit Word),
double-click on your new document to load it into
Notepad or Wordpad, and inspect it carefully for
surprises. If you need to make corrections other than
centering titles and headings with spaces, go back to
your Word document to make them and then resave over
your text version, always specifying 'MS-DOS Text with
Line Breaks.'
There is an alternative for those who use tab-indented
paragraphs or spaces to provide formatting. If you
save your final text version as 'MS-DOS Text with
Layout,' Word converts tabs to spaces and generally
preserves the visual layout. For reasons that escape
me, an extension of 'asc' is used for such documents.
You'll probably want to rename it with 'txt,' since
the 'asc' extension probably won't be recognized. Be
aware, though, that some software eliminates "extra"
spaces. This is why block format is preferred.
When you're ready to post it, open the text version,
copy the contents and paste into whatever software
you're using to post with. This should work in all
cases except for longer stories that exceed the limits
of certain providers (AOL, most notoriously). If you
have that problem, you'll need to go back to your
original story and break it into segments that fit
under the limits.
Final Thoughts
Okay, that's more than enough. I hope I haven't left
out anything significant or made any stupid mistakes.
I'm sure wiser heads will let me know, if so. I'll
repost this note periodically with accumulated
corrections. A copy of the latest version will also
be available on the FAQs pages (both web and ftp
versions) at ASSTR.
After the original publication of this document, there
was considerable discussion about various problems in
converting existing documents to ASCII and correcting
format problems in other people's documents. I
included some ideas in the original version, but this
seems to me to be a topic of sufficient complexity to
require it's own discussion. If there's enough
interest, I would be willing to take it on.
Please note that if you want to e-mail me directly,
the address is 'nitesweats |AT| aol.com' not the dummy
address in the header.
Peace,
Titmouse
Go to our Writer Guidelines page
Go to the Online Story Submission Form
Go back to Main Stories Index Page
Go to the top of this page
|