Documentation for the Detagger html to text converter and markup removal utility |
The latest version of these files is available online at http://www.jafsoft.com/doco/docindex.html
Here is the change history of Detagger
Version 2.4 (June 2005)
Changes made in 2.4Version 2.3.2 (September 2004)
Policies added in 2.4
Changes in version 2.3.2Version 2.3 (April 2004)
New policies in version 2.3.2
Bugs fixed in version 2.3.2
Changes in Version 2.3Version 2.2 (May 2003)
New policy options in Version 2.3
Changes in Version 2.2Version 2.1 (March 2003)
New features in Version 2.2
Changes in Version 2.1Version 2.0 (December 2002)
New features in Version 2.1
Changes in Version 2.0Version 1.0 (August 2002)
New features in Version 2.0
Version 2.4 contains a small number of minor bug fixes and policy changes
Bugs fixed :-
Version 2.3.2 contains a number of bug fixes and minor enhancements over version 2.3. It also contains enhanced support for handling Unicode, especially UTF-16.
Here are the policies added in version 2.3.2 :-
- Fixed a number of bugs when processing tables with missing <TR>, </TR> and <TD> tags. These could cause the table to not be processed or, in extreme cases, to be omitted from the output.
- Fixed bug where lines starting with a quotation mark didn't get correctly set to the target page width.
- Fixed bug whereby a nested table was indented when converted to delimited data.
- Fixed a number of bugs in which the presence of tags in embedded JavaScript (such as <style>, <body>, <title> etc) was confusing the file parsing, and occasionally leaving unwanted Javascript in the output.
- Files containing Unicode entities were not getting the Unicode file marker added when converted to text. This prevented some files (e.g. Arabic) from being displayed properly.
- A bug meant that occasionally the last line of a file wouldn't be read properly. This might have caused some conversion issues but was actually spotted when the last line in a policy file was ignored.
- Documents with a missing or misplaced </head> tag were converted to an empty file when converted to text.
- When removing hyperlinks, the first link after a NAME anchor tag wasn't being removed.
- When converting multiple files (as opposed to wildcards) the program would sometimes get confused calculation the output filename
Here are the policies added in version 2.3 :-
For example SEC filings use the HTML-like <TEXT> tag to markup plain text. By using a Text Command to change this into a <PRE> tag, the HTML converter is then tricked into leaving the format of the text in this section alone.
See Using a Text Commands File
Alternatively, the tips can be read in sequence should you prefer by using the next/last buttons to go through them, and the screen can be brought up from an option on the settings menu.
If anyone has suggestions as to topics they would like tips on, please feel free to send them to info<at>jafsoft.com.
Several new policy options are added in version 2.3
General
External configuration files
Conversion to text
Markup removal
Version 2.2 contains a small number of improvements and enhancements over version 2.1.
On the Settings menu a new option allows you to Remember settings on exit. If selected the current file, output directory, policy file and conversion options are remembered and used as the starting values next time you run the program.
A number of new options have been added to allow you to remove certain types of tags and attributes from inside tables only.
A new "Tables" option has been added under "markup manipulation" on the Conversion options menu. This takes you to the Detag Tables options tab which has the following options
Version 2.1 contains a small number of improvements and enhancements over version 2.0.
Version 2.0 contains a number of changes suggested by users of Detagger, as well as a number of bug fixes and code enhancements.
Several new features have been added to Detagger since version 1.0.
Markup removal
Tag removal options
Remove emphasis tags
Removes all the bold and italic markup from the HTMLRemove style sheet
Removes all the <STYLE> sections from the HTML, together with any reference to an external CSS style sheet.Remove HTML <IMG> tags
Removes all the <IMG> tags from a HTML document.
HTML-to-Text conversion
Paragraph formatting
Output each paragraph on a single line
Each paragraph is output without hard line breaks (except at the end). This can be useful, depending on how and where the resulting text is to be usedMiscellaneous text formatting
May add Unicode marker to output file
When Unicode is detected in the source the software will output the text as UTF8 and optionally add a file marker that will label the file as "Unicode" in a way that most applications that can cope with Unicode will recognize.Hyperlinks handling
Display link URLs
An option to display hyperlink URLs immediately after the display text in the output.Replace <IMG> tags by a text marker
Option to place a marker in the text to show where an image has been removed.Use the ALT attribute to replace <IMG> tags
Option to use the ALT attribute of an <ING> tag in it's text marker. This can help give some sense of what was being shown on the original page.Tables conversion
Convert table to plain text
Convert table to comma-delimited data
Convert table to tab-delimited data
These options (which are mutually exclusive), determine how any <TABLE>s in the HTML should be output in the text. The options allow the table to be converted to plain text, or to delimited text better suited for loading into a spreadsheet such as Excel.
The initial release.
Converted from
a single text file by
AscToHTM © 1997-2005 John A Fotheringham |