Documentation for the Detagger html to text converter and markup removal utility |
The latest version of these files is available online at http://www.jafsoft.com/doco/docindex.html
The software is available as both a Windows program and a console program. The console version can be run from the command line, and is better suited for use in command files and batch conversions.
Contents of this section
Running as a Windows application
Main DialogConsole version
Menu Bar
File menu
Conversion options menu
Selecting the Text Commands File
Selecting the Text fragments File
Settings menu
Language menu
View menu
Help menu
Update menu
Status window
The /CONCAT command line qualifierRunning from the 'SendTo' menu
The /CONSOLE command line qualifier
The /DETAG command line qualifier
The /HELP command line qualifier
The /LOG command line qualifier
The /OUTPUT command line qualifier
The /OVERWRITE command line qualifier
The /POLICY command line qualifier
The /SILENT command line qualifier
The /SUBFOLDERS command line qualifier
The /TREE command line qualifier
Working with Unicode
What is Unicode?
Unicode Byte Order Marks (BOMs)
Auto-detecting Unicode input
Creating Unicode output
Controlling Unicode handling through use of policies
Detagger can be invoked as a normal Windows application. On start-up you will be presented with the main window. This consists of a menu bar across the top of the window, and some data entry fields in the main body of the window.
To convert your files, take the following steps :-
Normally you need simply select the input file(s) using the Browse button, and the rest of the fields will be set to default values. If you want to use wildcards, type the file specification in the file(s) box directly
If you check the Search Sub-folders option, the program will look for matching files in sub-directories.
The program supports a number of conversion types. You should select the one that you want to perform.
Convert to text |
The output will be a plain text version of the input file. You can fine tune the conversion to text by using the text conversion policies |
Selectively remove markup |
The output will be HTML, with some of the tags selectively removed. You can control which by using the Markup removal policies. |
If you're removing markup (as opposed to creating text files), then you will typically be creating HTML files from HTML files. This means you run the risk of asking the program to overwrite the input file with the results. If this is what you want, you will need to check the May overwrite existing files option.
When you select your input, the output will default to being a file in the same directory. However, there are a number of options available to you.
You can select the output type. The default is to file(s), but you can also output to the clipboard. When converting wildcards the clipboard option is only useful if you have a clipboard manager in place, otherwise only the last file will be held there.
When converting to file, you can select the output filename. This is not a sensible choice when using wildcards to select the input file(s). If you don't change this field the output file will match the input filename, but my have a different extension.
When converting to file, you can select the output directory. If you are converting wildcard files and including sub-directories, this option will put all the files in the one directory. There is no option at present to output to a parallel directory structure.
The program supports a number of output types that determine where the output should go. You should select the one that you want to perform.
Output text to file
The output will be a file. Depending on the type of conversion
performed, the file will either be a HTML file or a plain text file
Output to the clipboard
You can use the Conversion Type to select the option of placing the generated
output onto the Windows clipboard, ready for use in other Windows applications.
Using Detagger in this way can be a very powerful technique which allows you to merge converted text with more traditionally authored content.
This approach becomes even more powerful if you use a Clipboard extender like ClipMate (see www.thornsoft.com) to remember and organise everything to the clipboard. You could convert a few files, and then use ClipMate to recall the pasted text at your leisure for insertion into your other files.
Concatenate results into one file
When you select a conversion type of Concatenate results into one file,
all your results will be added together in one big results file.
Converting to text
When you're converting to text, the output will be one big text file, with the results from converting each input file added after each other.Removing markup
When selectively removing markup the output will be one big HTML file. This file will inherit the <HEAD> and <BODY> tags (which will include any <TITLE> from the first input file). All other <HEAD> tags will be discarded. Because of this many of the results properties (e.g. style sheets and <META> tags) will be whatever was in the first file.The validity of the resulting HTML file will depend entirely on how well the markup of the multiple files goes together. It's a classic case of garbage in, garbage out
File separators
In the output file a separator can be added between the results of one input file and the next. These are defined using the text fragment feature.When creating a text file, the separator fragment is called TEXT_SEPARATOR, and when creating a HTML file it's HTML_SEPARATOR.
In the registered version both separators are absent by default unless you choose to define these fragments. In the 30-day trial, both separators contain short messages.
When outputting to file, there are a number of options for choosing where the output files will go.
The default is to output the files to the same directory as the input files. When the conversion type is set to selectively remove markup this has the potential to overwrite the original files. For this reason you have to select the 'May overwrite the input files' option.
Alternatively you can choose to output files to a different directory. In this case the program will overwrite any files already there because these are not the input files.
Finally, if you have selected the 'Search sub-folders' option, you can elect to replicate the input directory structure under the output directory, rather than have all the files found placed in just the one directory.
The main menu bar appears at the top of the main screen. It has the following options:-
File | File options |
Conversion Options | Options that affect the conversion |
Settings | Edit the program's settings |
Language |
Select the language you'd like the program's user interface to be in |
View |
View the created HTML files or the messages for the last conversion |
Help | Various help files and on-line resources |
Update menu | Check for more recent updates |
The file menu offers the following options:-
Convert |
This will prompt you for a file to convert and will then convert the selected file(s). |
Load policy file |
Loads settings previously saved to a policy file. See policy files |
Save policies to file |
Save program setting to a policy file. See policy files |
Exit | Exit the program. |
The Conversion Options menu allows you to alter the parameters of the conversions that you do. These options can be saved in policy files for later use. The options available include :-
This option allows access to dialogs that allow the options ("policies") that control markup removal to be adjusted
Detag policies
Detag Tables policies
Tag manipulation policies
This option allows access to dialogs that allow the options ("policies") that control the text conversion process to be adjusted
The Config File Location menu allows you to specify the location of additional configuration files. The locations you select will be stored in your policy file, so in a sense these files act as extensions of the policy file, but by being stored in separate files the same configuration files can be shared by multiple policy files.
The options on this menu allow you to select do locate following :-
This option allows you to select the Text Commands File you wish to use.
This option allows you to select the Text Fragments File you wish to use.
Detagger has many program options known as "policies" to help you tailor the conversion process. These policies can be saved in a policy file for later re-use in future conversions. This dialog screen is primarily intended to allow you to load a previously saved policy file
For a fuller description see the section on policy files.
Options on this screen include: -
Load policies from an existing policy file
Save policies to a policy file for later re-use
Reset all policies to their default values
This window is displayed whenever you wish to save your policies to a file, usually for use in later conversions.
To save the file, simply select the policy file name, usually with a .pol extension.
This window contains a radio button with two options:
Save only those policies that have changed
If this option is selected, then only those policies that have been loaded from an existing file and/or been edited during the current session will be saved.
This is the recommended option, as it will exclude all policies that have been set up correctly automatically.
Save all policies
If this option is selected, that all policies are written to file. This is a good way of documenting the policies used, but is usually too restrictive to be loaded as input into conversions of other files.
The saved file is a text file designed so that it may be manually edited and reloaded. If you do so, take care not to change the key phrases at the start of each line.
Note: If you find that conversions that used to work "stop working" it's possibly because you're using a complete policy file. If you find this happens, try creating a new policy file from scratch, or manually removing options from your existing policy file.
This option will reset all conversion options ("policies") to their default values. If a policy file has been loaded, it will be unloaded.
The program settings menu allows you to customise the way Detagger executes each time it is invoked.
This menu has the following options: -
Diagnostic Settings |
Set message filters and alter the error reporting level to control the number and type of messages generated during conversions |
Drag and drop settings |
Set the program's properties when invoked by dragging files into the icon on the desktop |
Results viewers settings |
Specify the viewers to be used for viewing results files, and their method of invocation |
Use of policy file settings | Specify any default policy file to be used. |
In addition to the above sun-menus, this menu allows you to toggle the following options, indicated by tick marks.
Show Tool Tips |
If checked tool tips will be available to offer help on the controls on each dialog screen |
Show Status Dialog |
If checked the Status Window will show during the conversion, showing messages describing how the conversion is going. |
Automatically view results |
If checked a file viewer will be launched after the conversion to view the results. This will either be a HTML browser of a text editor depending on the type of conversion being done. See results viewers settings |
Remember settings on exit |
If checked the program will remember the selected files and conversions details for next time |
Tip of the Day |
If selected the 'Tip of the Day' screen is shown and you can choose whether or not this should also be displayed on startup. |
These options allow you to set the level of error reporting, or to suppress messages of various types from being displayed during conversion.
The types of messages include :-
INFO messages |
Informational messages. These convey information telling you what was been done and why. |
|
WARNING messages |
Warning messages. These tell you that something you have requested has not been done, or something has been done which may not be correct. It's possible you may be able to take corrective action. |
|
TAG ERROR messages |
Tagging errors. Only occur when you use the preprocessor in-line tags and directives. |
|
PROGRAM ERROR messages |
Program errors. The program has detected it has done something wrong. The conversion may still be successful, but there is nothing you can do about such messages except report them to the program's author at info<at>jafsoft.com |
These options specify the behaviour of Detagger when invoked via drag and drop (i.e. by dropping a file icon on the program's icon).
Show the status screen
The status dialog, showing messages
reporting how the conversion is going
should be shown.View results in browser
once complete
The selected viewer (browser) for the
results files should be invoked on the
last file converted once conversion is
completeStart program after
conversionThe program should be launched in Windows
mode once the conversion is completed.
This identifies the viewers to be used whenever Detagger launches an application to view a results or documentation file. Viewers may be required for both HTML (when detagging) and TEXT (when converting to text) files.
Automatically view results files
You can elect to have results viewed automatically after each conversion. This will normally result in the named application being launched to view the last file converted.
Command used to view HTML files
For HTML, you can elect to use Dynamic Data Exchange (DDE) to have the results displayed in a currently active browser. This can be quicker and more efficient that launching a new instance of the browser each time. You should ensure your DDE browser matches the program named as the default browser so that if not already active, the program can start a fresh instance.
When DDE is used the results will vary from browser to browser. IE for example will come to the front, whereas Netscape will not, and if it is minimised you won't see the results until you maximise the browser again.
NOTE: On some systems problems can occur with DDE that will cause the program to hang whenever it attempts to display a HTML file. When this happens the program will need to be stopped via the task manager. The next time the program runs it will detect that this problem has occurred and disable the use of DDE.
Add "file://localhost/" prefix
For HTML files viewed from your local hard drive the prefix "file://localhost/" should be used in place of the "http:/" used for Internet access.
Unfortunately some browsers (take a bow IE 3.0) do not support this, so the addition of this prefix may be disabled if you're using such a browser.
Command used to view TEXT files
For TEXT files, DDE is not currently available, so you simply provide the command to view TEXT files (usually just a text editor or NotePad).
Using a default policy file
This determines which policy file, if any, is to be used by default when the program is first invoked. The actual policy file used can, of course, be changed via the policy dialogue.
The default policy file will also be used if the program is invoked via drag'n'drop. This avoids the need for creating batch files with the policy file name on the command line.
Always reload policy file during conversion
This specifies that the current policy file should be reloaded every time the conversion is done. If the file is large, and you are repeatedly converting using the same policy file, then this can slow you down. On the other hand if you are editing the policy file by hand outside the program between conversions then you will want this option enabled.
The "Tip of the Day" screen is shown by default each time you start up the program. This behaviour can be disabled by clearing the checkbox on the screen.
The tip shown will change each time the screen is displayed, and in addition you can review all the tips available by using the buttons marked "<<" and ">>" to go to the previous and next tips. The number of each tip is shown in case you should want to revisit it at a later date.
The Tip of the Day screen can be shown at any time by selecting the option on the Settings menu.
At present all tips are only available in English.
It is possible to change the user interface to the language of your choice. Translations are provided by a number of volunteers who help converting the menu, dialog, and ToolTips text. The message and documentation text remains in English for the time being. As such these don't offer a full translation, but will hopefully be of some use to those whose first language isn't English.
At any given time you may still find English translations, especially in the messages displayed, and in the help and documentation files, but it is hoped that the efforts of these volunteers will make the program easier to use for non-English speakers.
Supported languages
At present work is under way on
Spanish |
Gonzalo San Martin is undertaking the Spanish translation. Gonzalo operates a highly popular Real Madrid fan page (in Spanish and English) which you can visit at http://members.bigfoot.com/~G.SanMartin/ Gonzalo can be contacted at G.SanMartin<at>bigfoot.com |
Italian |
The Italian translation is being undertaken by Gianluigi Pizzuto who can be contacted at gibly<at>libero.it and has a web page at http://web.tiscalinet.it/fotone |
Swedish |
The Swedish translation is being undertaken by Dan Svarreby who can be contacted at dan.svarreby<at>home.se. |
French | The French translation is being undertaken by Andre Martinez. |
Russian |
The Russian translation is being undertaken by Alexander (aka J-34) at j34<at>mail.ru |
Dutch |
The Dutch translation is being undertaken by Jurrien Dokter, who can be contacted at info<at>axswebsolutions.nl and runs the web site at http://www.axswebsolutions.nl/ |
If you would like to volunteer to help with this effort, please email info<at>jafsoft.com (replace "<at>" by "@") or visit the web page at
http://www.jafsoft.com/products/translations.html
This menu contains the following options
Messages from last conversion |
View the messages window with messages generated in the last conversion by bringing back the Status window |
Results of last conversion |
View the last file converted in your preferred browser |
Once you've converted a file, you can view the results in the browser of your choice. Detagger will detect the default browser used on your system. If you wish you can change this through the settings menu
You can view results in the selected browser by selecting the option on the view menu or by pressing the View results button on the main screen.
Detagger can also be configured to automatically review results when run from the command line or in drag'n'drop operation.
The help menu has the following options:-
Contents
Brings up the contents page of this help file. Help
can be brought up anywhere in the program by
pressing F1Register (online)
This options will take you to the registration page,
or - if you have already registered - to the updates
pageHTML doco (offline)
Brings up the local copy of the HTML
documentation in your preferred browserHTML doco (online)
Brings up the Internet copy of the HTML
documentation in your preferred browser.Other products
Links to web pages for JafSoft and their various
software products.About
Shows the program version and other details.
Includes buttons to take you to the home page etc
on the web.
The update menu has the following option
Check for newer versions |
This option will take you to the web site, where a check will be made to tell you if this is still the latest version of the software. |
The status window is displayed whenever a conversion is in progress. It displays messages showing how the conversion is progressing. You can also bring up this window by selecting the "messages from last conversion" option on the View menu. You can prevent this behaviour by selecting the option from the Settings menu
The messages displayed are usually just informational messages telling you what Detagger is doing. You should review these messages and check they don't indicate an error in conversion.
Once conversion is complete you can dismiss the window. You can automate this by ticking the "dismiss on completion" box.
Should you wish to you can use the save to file button to save the messages displayed to file. This can be useful for reviewing messages, extracting URLs reported by the software (if showing URLs is enabled), or for sending details when requesting support.
In addition to the Windows version of Detagger, there is a console version. This can be invoked from the command line, and is thus well suited to use in batch and automated conversions.
The console version is free to users who register the Windows version. A trial copy of the console version can be obtained by visiting
The console version is used from the command line. Most of these command options are also supported by the Windows version, but the console version is better suited to batch operation.
The console version is called h2acons. You can see a list of the commands by using the command
c:> h2acons /help
This gives
Usage : h2acons filespec1 filespec2 [policy_file.pol] [/qualifiers]
Recognised qualifiers include
/CONCAT /CONSOLE /DETAG /HELP /LOG=filename /OUTPUT=filespec /OVERWRITE /POLICY=filename /SILENT /SUBFOLDERS /TREE |
Concatenate the results into one file Write output direct to console Selectively remove HTML markup Display this useful list of commands ("/?" also works) Generate a .log file Filespec for output file(s) May overwrite input files with the output Document policies used in a .pol file Suppress all output messages (except these :-) Process files that match the filespec in sub-folders as well Place output files in parallel folder structure to input files |
Qualifiers are case insensitive and may be reduced to shortest unique name (e.g. "/lo" for "/log")
Most of the configuration options are passed using a "policy file". This is most easily created by running the Windows version, selecting the options you want and then saving those to a policy file.
The policy file itself is just a text file, with one policy per line (hard break). If you look at the list of policies in the documentation you can edit this by hand, but usually it's just simpler to use the Windows version.
When present this qualifier states that all the results should be output to a single file. This only makes a difference if you've supplied multiple filespec's on the command line, or used a wildcard.
When present this specifies that the output should be written to the console window. This might be useful in piping operations.
If you use this, you will usually want to also use the /SILENT qualifier.
When present this specifies that Detagger should selectively remove HTML markup and create a HTML output file. The default behaviour otherwise is to convert the file to text.
If you want to specify which removal options should apply you'll need to create a policy file and add that to the command line.
Displays the list of supported qualifiers
When present this specifies that Detagger will create a .log file listing all the actions it takes and any messages created
When present this will tell Detagger where the output should be placed. If omitted the default is to output the results in the same folder as the source file, with an extension (.txt or .html) appropriate to the type of conversion being attempted
Examples :-
c:> h2acons input.html /out="c:\my files\output.txt"
File is output to "c:\my files\output.txt". Because there is a space in the directory path the filename needs to be in quotes
c:> h2acons in*.html /out=c:\output\
All the files in*.html will be converted and placed in the directory "c:\output\"
c:> h2acons in*.html /concat/detag/out=c:\output\bigfile.html
In this case the /concat/detag means that Detagger will selectively remove markup and concatenate the results in the single file "c:\output\bigfile.html"
When the /DETAG qualifier is specified then by default the output file will be a HTML file in the same directory as the source file. In this case Detagger could end up replacing the original file by the output file. That is only allowed if the /OVERWRITE qualifier is present. If it isn't, an error message is generated.
An alternative to using the /OVERWRITE qualifier is to use the /OUTPUT qualifier to direct the output to a different folder, or to a different name in the same folder.
When present Detagger will create a .pol policy file listing all the policies used in the conversion and their values. You should not normally want to do this unless you want to create a policy file to edit. or want to check that your policies are being used.
To pass in a set of policies, just list the policy file on the command line. It must have a .pol extension. For example the command
c:> h2acons in*.html input.pol /policy=output.pol
will read the policies in "input.pol", use those in the conversion, and then create a file "output.pol" listing the policies used, which will be a mixture of default values and those loaded from "input.pol".
When present all the messages usually displayed to the console window are suppressed.
You'd want to use this if you were using the /CONSOLE qualifier.
When present the software will search the sub-folders of the input directory looking for other files that match the input filespec.
See also The /TREE command line qualifier
When present the software will place output files in a directory structure that matches the input structure. This will only apply when using the /SUBFOLDERS and /OUTPUT options as well. So for example the command
c:> h2acons c:\input\a*.html /output=d:\new\ /subfolders/tree
Would look for all files a*.html in the folder c\input\ and its sub-folders. The output files will be placed in d:\new\ and sub-folders of that, so for example converting c:\input\sub\answer.html would be converted to d:\new\sub\answer.txt. If it didn't already exist, the sub-folder d:\new\sub\ will be created.
See also The /SUBFOLDERS command line qualifier
Detagger can make a useful addition to your "Send to" menu (available when you right-click on a file in explorer).
To add Detagger to this menu, simply add a shortcut to your Send To shortcuts directory. Under Windows 9x this is
/Windows/SendTo
under Windows XP this is
/Documents and Settings/<Your_User_Name>/SendTo
If you want to use a standard policy file (e.g. with a particular colour scheme), then change the properties of the shortcut so that the command is
Detagger %1 standard.pol
Detagger was not originally designed with Unicode in mind, and as a result support for Unicode text has been gradually added over time, with the result that earlier versions of Detagger may not support all the features described in this manual. If in doubt, please contact JafSoft for details.
Traditional single-byte character sets interpret the 8-bit character values (128-255) as special characters. So on a Russian machine this would be interpreted as Cyrillic, but on a different machine this could be read (wrongly) as Arabic (and vice versa). On most English-based PCs, the 8-bit characters are used for accented character used in certain European languages, so a Russian text would appear to have lots accented 'i's, 'e's and 'a's.
Unicode is a way of implementing text that supports multiple types of character sets at the same time so that - for example - it is possible to display Chinese and Cyrillic on the same page unambiguously. It does this by allocating each character in each language a unique code value, so that codes used for Cyrillic characters no longer overlap and conflict with those assigned to Arabic.
However, these code values are in most cases larger than can be represented in a single byte. As a result a way has to be chosen to represent each character by one or more bytes.
The following Unicode representations are commonly used
UTF-8
Each character is represented by 1, 2 or 3 bytes, depending on the which range the Unicode code value falls into. This has the advantage that all ASCII characters are a single byte, so for example all the HTML tags in a document are represented by a single byte each. This also means there are no null bytes contained in the text, which can make programming software to work with this text easier.UTF-16
Each character is represented by a 2-byte pair (future characters may require 2 such pairs). The 2-byte pair is just the numerical representation of the Unicode value of each character. This makes the files easier to interpret, but also means that the byte order depends on how the machine stores its bytes - i.e. is the machine big-endian or little-endian. Because ASCII characters have a Unicode value less than 255 the ASCII characters map onto a byte pairs in which one of the bytes is null. Because each character requires two bytes, a single byte wrongly inserted into a UTF-16 stream will render all text that follows is as gibberish.
Files that contain Unicode identify themselves by inserting a "Byte Order Mark" (BOM) at the top of the file. This is a two-byte marker for UTF-16 files and a three-byte marker for UTF-8 files. Modern applications will test for this byte marker and if present will then know how to interpret the contents of the file. For example Notepad as supplied with Windows XP can do this, whereas Notepad as supplied with Windows 98 could not.
In UTF-16 each character is represented by two bytes, and computers can store a two-byte value in different ways (known as "big-endian" and "little-endian"). Each operating system uses one method or another and it isn't usually an issue, but when Unicode files get passed from one machine to another, this becomes important. The BOM allows the two forms of UTF-16 (known as "UTF-16BE" and "UTF-16LE") to be distinguished.
The software has some ability to auto-detect Unicode text, and will generally do so under the following circumstances
The software will create Unicode output whenever it detects that the input files were Unicode, or wherever Unicode characters have been detected in the HTML entities of the original.
At present all Unicode output files will be UTF-8.
The following policies can be used to control the handling of Unicode during the conversion :-
By default the software will attempt to auto-detect whether or not the input is Unicode, but if this fails you can explicitly tell the software the encoding using this policy.
May add Unicode marker to output file
When Unicode is detected in the source the software will output the text as UTF8 and optionally add a file marker that will label the file as "Unicode" in a way that most applications that can cope with Unicode will recognize.
Allow ANSI alternatives (e.g. space for )
Certain common HTML entities don't have a single ANSI character but have common ASCII representations. If you enable this policy you tell the software to use ASCII/ANSI alternatives where possible, thereby reducing the chance of Unicode being necessary for the output file.
Converted from
a single text file by
AscToHTM © 1997-2005 John A Fotheringham |