5 : RTF produced

Documentation for the AscToRTF conversion utility

This documentation can be downloaded as part of the documentation set in .zip format.

5 RTF produced

5.1 Text layout

5.1.1 Indentation

AscToRTF performs statistical analysis on the document to determine at what character positions indentations occur. This information is used on the output pass to determine the indentation level for each source line.

5.1.2 Hanging paragraph indents

Some documents have hanging paragraph indents. That is, the first line of each paragraph starts at an offset to the rest of the paragraph.

AscToRTF struggles heroically with this, and tries not to treat this as text at two indent levels, but it does occasionally get confused.

If writing a text file from scratch with AscToRTF in mind, then it is best to avoid this practice.

5.1.3 Bullets and lists

AscToRTF detects and supports several types of bullets and lists. However as of version 1.0 it doesn't attempt to convert these into auto-numbered lists (introduced in a later version of RTF).

Such text is marked up using the "Bullet" Style. See "the use of RTF stylesheets".

5.1.3.1 Bullet chars

Bullet chars are lines of the type

        - this is a bullet line

        - this is a bullet paragraph
          because it carries over onto
          more lines

That is, a single character followed by the bullet line. AscToRTF can determine via statistical analysis which character, if any, is being used in this way. Special attention is paid to the '-' and 'o' characters.

5.1.3.2 Numbered bullets

AscToRTF can spot numbered bullets. These can sometimes be confused with section headings in some documents. This is one area where the use of a document policy really pays dividends in sorting the sheep from the goats.

5.1.3.3 Alphabetic bullets

AscToRTF detects upper and lower case alphabetic bullets.

5.1.3.4 Roman Numeral bullets

AscToRTF detects upper and lower case roman numeral bullets.

5.1.4 Centred text

AscToRTF can attempt to spot sections of centred text. However, because this can easily go wrong this option is normally switched off.

Centering is not currently implemented in this version of AscToRTF.

5.1.5 Definitions

5.1.5.1 Definition lines

A definition line is a single line that appears to be defining something. Usually this is a line with either a colon (:) or an equals sign (=) in it. For example

        IMHO = In my humble opinion

        Address : Somewhere over the rainbow.

AscToRTF attempts to determine what definition characters are used and whether they are "strong" (only ever used in a definition) or "weak" (only sometimes used in a definition).

AscToRTF marks up definition lines by placing a line break on the end of the line to preserve the original line structure. Where this decision is made incorrectly unexpected breaks can appear in text.

AscToRTF offers the option of marking up the definition term in bold. This is not the default behaviour however.

5.1.5.2 Definition paragraphs

AscToRTF also recognises the use of definition paragraphs such as :-

      Note:     This is a definition paragraph whereby the whole
                paragraph is defining the term shown on the first line.
                Unfortunately AscToRTF currently only copes with single
                paragraphs (i.e. not with continuation paragraphs), and
                only with single word definitions.

5.2 Text formatting

5.2.1 Quoted lines

AscToRTF recognises that, especially in Internet files, it is increasingly common to quote from other text sources such as e-mail. The convention used in such cases is to insert a quote character such as ">" at the start of each line.

Consequently, AscToRTF adds a line break at the end of such lines to preserve the line structure of the original, and marks it up in italics to differentiate the quoted text

Such text is marked up using the "Quotes" Style. See "the use of RTF stylesheets"

5.2.2 Emphasis

AscToRTF can look for text emphasised by placing asterisks (*) either side of it, or underscores (_). AscToRTF will convert the enclosed text to bold and italic respectively using Bold and italic tags respectively.

AscToRTF will also look for combinations of asterisks and underscores which will be placed in bold italic. The asterisks and underscores should be properly nested.

The emphasised word or phrase should span no more than a few lines, and in particular should not span a blank line. If the phrase is longer, or if AscToRTF fails to match opening and closing emphasis marks, the characters are left unconverted.

Tests are made to ignore double asterisks and underscores, and sometimes adjacent punctuation will prevent the text being marked up.

5.3 Added hyperlinks

5.3.1 Contents List lines

Unlike AscToHTM, AscToRTF leaves any detected contents list intact and unchanged.

5.3.2 Cross-references

AscToRTF can convert cross-references to other sections into hyperlinks to those sections. Unfortunately this is currently only possible for second, third, fourth... level numeric headings (n.n, n.n.n, n.n.n.n etc)

This is because the error rate becomes too high on single numbers/letters or roman numerals. This may be refined in future releases, although it's hard to see how that would work.

5.3.3 URLs

AscToRTF can convert any URLs in the document to hyperlinks. This includes http and ftp URLs and any web addresses beginning with www.

Unlike AscToHTM, AscToRTF will only convert hyperlinks to a full URLs (i.e. those where a site name is supplied). If a url like "\home\index.html" is detected it is left unconverted. This is because it is less likely that the relationship between source and target can be relied on.

5.3.4 Usenet Newsgroups

AscToRTF can convert any newsgroup names it spots into hyperlinks to those newsgroups. Because this is prone to error, AscToRTF currently only converts newsgroups in known USENET hierarchies such as rec.gardens by default.

This can be overcome either by

placing "news:" in front of the newsgroup name (e.g. news:this.is.a.newsgroup.honest)

relaxing this condition via a document policy (see the policy "Only use known groups")

specifying the newsgroup hierarchy as recognised via a policy "Recognised USENET groups".

5.3.5 E-mail addresses

AscToRTF can convert any email addresses into hypertext mailto: links.

5.3.6 User-specified keywords

AscToRTF can convert use-specified keywords into hyperlinks. The words or phrase to be converted must lie on a single line in the source document. Care should be taken to ensure keywords are unambiguous. Normally I mark my keywords in [] brackets if authoring for conversion by AscToRTF

See the discussions on "link dictionaries" in 4.3.2.2 and 4.4.2.

5.4 Section headings

AscToRTF recognises various types of headings. Where headings are found, and deemed to be consistent with the prevailing document policy (correct indentation, right type, in numerical sequence etc), AscToRTF will use the standard "Heading n" styles.

In addition to this, AscToRTF will insert a named bookmark to allow hyperlink jumps to this point. These bookmarks are used for example in any cross-reference hyperlinks that AscToRTF generates.

5.4.1 Numbered headings

This is the preferred heading type and the type that AscToRTF has most success with. Sections of type N.N.N can be checked for consistency, and references to them can be spotted and converted into hyperlinks.

At present more exotic numbering schemes using roman numerals and letters of the alphabet are not fully supported. This is planned to be implemented soon, possibly via user policy files.

5.4.2 Capitalised headings

AscToRTF can treat wholly capitalised lines as headings. It also allows for such headings to be spread over more than one line.

5.4.3 Underlined headings

AscToRTF can recognise underlined text (e.g. a row of minus signs), and optionally promote the preceding line to be a section header.

The "underlining" line should have no gaps in it, and should be a similar length to the preceding heading. If these conditions aren't met you'll probably get a horizontal rule instead.

5.4.4 Numbered paragraphs

Some types of documents use what look like section numbers to number paragraphs (e.g. legal documents, or sets of rules).

AscToRTF can recognise this, and mark up such lines by placing the number in bold, and not using the "Heading n" style on the whole line.

5.4.5 Mail and USENET headers

Some documents, especially those that were originally email or USENET posts, come with header lines, usually in the form of a number of lines with a keyword followed by a colon and then some value.

AscToRTF can recognise these (to a limited extent). Where these are detected the program will parse the header lines to extract the Subject, Author and Date of the article concerned. A heading containing this information will then be generated to replace all the unsightly header lines.

5.5 Pre-formatted text

5.5.1 Lines and form feeds

Lines are interpreted in context. If they appear to be underlining text, or part of some pre-formatted structure such as a table, then they are treated as such. Otherwise they become horizontal rules.

Form feeds or page breaks also become horizontal rules.

5.5.2 User defined pre-formatted text

AscToRTF allows users to define their own regions of pre-formatted text, using the BEGIN_PRE and END_PRE pre-processor tags (see Using the preprocessor).

Such areas are marked up in the "Preformatted" style (see "the use of RTF stylesheets"), which uses a non-proportional font to preserve the relative spacing.

For example :-

      The use of BEGIN_PRE and END_PRE preprocessor commands (see 7.1.6) in
      the text documents tells AscToRTF that this portion of the document
      has been formatted by the user and should be left unchanged.

5.5.3 Automatically detected pre-formatted text

AscToRTF attempts to spot sections of preformatted text. This can vary from a single line (e.g. a line with a page number on the right-hand margin) to a complete table of data.

Where such text is detected AscToRTF analyses the section to determine what type of pre-formatted text it is. Options include

Tables

Code samples

Ascii Art and diagrams

some other formatted text

You can adjust the sensitivity of AscToRTF to pre-formatted text by setting the minimum number of lines required for a pre-formatted region using the "Minimum automatic <PRE> size" policy.

5.5.3.1 Tables

Tables are marked out by their use of white space, and a regular pattern of gaps or vertical bars being spotted on each lines. AscToRTF will attempt to spot the table, its columns, its headings, its cell alignment and entries that span multiple columns or rows.

Should AscToRTF wrongly detect the extent of a table, you can mark up a section of text by using the BEGIN_TABLE ... END_TABLE pre-processor commands (see 7.1.2). Alternatively you can try adding blank lines before and after, as the analysis uses white space to delimit tables.

You can alter the characteristics of all tables via the table policies (see 6.3.7).

You can alter the characteristics of all or individual tables via the table pre-processor commands (see 7.4).

Or you can suppress the whole thing altogether via the "Attempt TABLE generation" policy

Tables will be marked up using the "Table" style. See "the use of RTF stylesheets".

5.5.3.2 Code samples

AscToRTF attempts to recognise code fragments in technical documents. The code is assumed to be "C++" or "Java"-like, and key indicators are, for example, the presence of ";" characters on the end of lines.

Should AscToRTF wrongly detect the extent of a code fragment, you can mark up a section of text by using the BEGIN_CODE ... END_CODE pre-processor commands (see 7.1.4).

Of you can suppress the whole thing altogether via the policy "Expect code samples".

Code samples will be marked up using the "Code" style. See "the use of RTF stylesheets".

5.5.3.3 Ascii art and diagrams

AscToRTF attempts to recognise Ascii art and diagrams in documents. Key indicators include large numbers of non-alphanumeric characters and the use of white space.

However, some diagrams use the same mix of line and alphabetic characters as tables, so the two sometimes get confused.

Should AscToRTF wrongly detect the extent or type of a diagram, you can mark up a section of text by using the BEGIN_DIAGRAM ... END_DIAGRAM pre-processor commands (see 7.1.5).

Diagrams are marked up using the "Diagram" style. See "the use of RTF stylesheets".

5.5.3.4 Other formatted text

If AscToRTF detects formatted text, but decides that is is neither table, code or art (and it knows what it likes), then the text may be put out "as normal", but with the original line structure preserved.

In such regions other markup (such as bullets) may not be processed such as it would be elsewhere.

5.6 Added value markup

5.6.1 Document Title

AscToRTF can calculate - or be told - the title of a document. This will be placed in document properties section in the header of each RTF file produced.

The Title is calculated as in the order shown below. If the first algorithm returns a value, the subsequent ones are ignored.

If a $_$_TITLE pre-processor command (see 7.2.1) is placed in the source text, that value is used

If the "Use first header as title" policy is set then the first heading (if any) encountered is used as the title.

Note:
Depending on your document structure, this is prone to give bland tiles like "Introduction" , "Overview" and "Summary"

If the "Use first line as title" policy is set then the first line in the file is used as the title.

If the "Document title" policy is set then this value is used.

Note:
If this is the value you want, ensure the other policies outlined above are disabled.

Finally, if none of the above result in a title the text "Converted from <filename>" is used.

5.6.2 Contents lists

AscToRTF can detect the presence of a contents list in the original document, or it can insert a field code that will generate a contents list from the headings that it observes. This field can be recalculated in Word by pressing F9.

There are a number of policies that give you control over how and where a contents list is generated (see 6.3.4).

In addition you can request that a contents list field is inserted by using the "Add contents list" policy.

Contents lists placement

By default the contents list will be placed at the top of the output file. You can cause contents lists to be placed wherever you want by using the CONTENTS_LIST preprocessor command (see 7.3.2).

5.7 The use of RTF stylesheets

AscToRTF supports the use of stylesheets. That is the marking up of text in particular styles. AscToRTF uses this to identify how the text was analsyed, thus headings acquire a "Headings" style, and bulleted lists are marked up in the Bullet Style.

Initially most of these styles are the same, but if you use a word processor that supports RTF stylesheets (such as Word), you'll be able to globally change attributes line font face and colour. For example you could turn all code samples green by changing the attributes of the code style.

Styles are implemented in a hierarchy, with style attributes being inherited from their parents. Later versions of AscToRTF may allow style attributes to be selected before conversion.

The style hierarchy is as follows

      Normal                            (generic normal text style)
        |
        +-- 1 Body                      (main body text)
        |       |
        |       +--- 11 ShortLine       (short lines)
        |       +--- 12 Bullet          (bullets and numbered lists)
        |       +--- 13 Quoted          ("quoted" text as found in emails)
        |       +--- 14 Hanging         (hanging paragraohs)
        |       +--- 15 Definition      (definitions)
        |
        +-- 2 Table                     (Table text)
        +-- 3 Preform                   (preformatted text)
        +-- 4 Diagram                   (diagrams)
        +-- 5 Code                      (code samples)
        |
        +-- 6 Heading                   (generic heading style)
        |       |
        |       +--- 61 Heading1        (level 1 headings)
        |       +--- 62 Heading2        (level 2 headings)
        |       +--- 63 Heading3        (level 3 headings)
        |       +--- 64 Heading4        (level 4 headings)
        |       +--- 65 Heading5        (level 5 headings)
        |
        +-- 7 TOC                       (generic TOC style)
        |       |
        |       +--- 71 TOC1            (level 1 TOC entry)
        |       +--- 72 TOC2            (level 2 TOC entry)
        |       +--- 73 TOC3            (level 3 TOC entry)
        |       +--- 74 TOC4            (level 4 TOC entry)
        |       +--- 75 TOC5            (level 5 TOC entry)
        o

The default implementations of these styles are as follows:-

Body

Uses the user-supplied font. Created
justified text by default.

ShortLine

Same as Body, but with a \par at the end
of each line to preserve the original line structure.
These paragraphs have zero spacing before
and after, to closely mimic the original text
file structure.

Bullet

Styling is the same as Body, but the bullet
itself is output using a hanging indent with a
tab after the bullet.

Quoted

Text is placed in italics, and left justified.
Each line is given a \par to preserve the original
line structure.

Hanging

The text is divided into two parts. The first
is placed on the left, and the "hanging" part is
placed on the right, after a tab. The position
of the tab stop is calculated according to
the size of the text to be placed on the left.
Often text that AscToHTM would put in a table comes
out as a hanging list.

Definition

Much like Hanging. The definition term is on left,
the rest is hung on the right after a tab. Options
exist to allow the definition term to be made bold.

Table

The text is styled as in Body, but is placed into
cells in a table. Table analysis is complex, and
deserves a document in its own right, but
in essence the text is placed in cells and
aligned according to original placement and data
type. The whole process can sometimes go wrong.

Preformatted

Preformatted text is output in a non-proportional font
(usually Courier) with no spacing between lines and
a \par on each line to preserve the line structure.
A font size of 8pt is used as this best represents
80 characters across a page without wrapping.

Diagram
Same as Preformatted.

Code
Same as Preformatted.

Heading

Heading itself is unused, but acts as a common parent
for the actual styles "Heading 1", "Heading 2" etc.
These are set to be the same as the Microsoft Word
equivalents.

TOC

The table of contents style TOC itself is unused,
but acts as a common parent for the actual styles
"TOC 1", "TOC 2" etc. These are set to be the same
as the Microsoft Word equivalents.

Prev | Next | Contents

Body	Uses the user-supplied font. Created justified text by default.
ShortLine	Same as Body, but with a \par at the end of each line to preserve the original line structure. These paragraphs have zero spacing before and after, to closely mimic the original text file structure.
Bullet	Styling is the same as Body, but the bullet itself is output using a hanging indent with a tab after the bullet.
Quoted	Text is placed in italics, and left justified. Each line is given a \par to preserve the original line structure.
Hanging	The text is divided into two parts. The first is placed on the left, and the "hanging" part is placed on the right, after a tab. The position of the tab stop is calculated according to the size of the text to be placed on the left. Often text that AscToHTM would put in a table comes out as a hanging list.
Definition	Much like Hanging. The definition term is on left, the rest is hung on the right after a tab. Options exist to allow the definition term to be made bold.
Table	The text is styled as in Body, but is placed into cells in a table. Table analysis is complex, and deserves a document in its own right, but in essence the text is placed in cells and aligned according to original placement and data type. The whole process can sometimes go wrong.
Preformatted	Preformatted text is output in a non-proportional font (usually Courier) with no spacing between lines and a \par on each line to preserve the line structure. A font size of 8pt is used as this best represents 80 characters across a page without wrapping.
Diagram	Same as Preformatted.
Code	Same as Preformatted.
Heading	Heading itself is unused, but acts as a common parent for the actual styles "Heading 1", "Heading 2" etc. These are set to be the same as the Microsoft Word equivalents.
TOC	The table of contents style TOC itself is unused, but acts as a common parent for the actual styles "TOC 1", "TOC 2" etc. These are set to be the same as the Microsoft Word equivalents.