Import Content from HTML or MS Word documents

Figure 1. Import HTML or MS Word interface.

The Import content (HTML or MS Word) option will allow you define a set of rules for each Project Topic that will attempt to match existing content in either HTML or MS Word documents and then import this into the Fact Sheet Fusion project. Each topic can have it's own set of rules that define what to look for as well as what to exclude. The HTML import option also supports importing of images and their captions.

Important Tip: Defining sets of rules that work well across lots of existing content can be tricky, especially if you haven't done this type of operation in the past. It is highly recommended that you try out your import rules into a new database and project before attempting to import into an existing project that already contains content. This will allow you to easily check your results. It is also worth checking the imported topics code tab in the editor to ensure you are removing any undesirable formatting tags or unclosed tags. Another way to check the import success is to of course perform an export of the entities to view the resulting fact sheets. This again will help determine if your rules captured everything you need from the source documents.
It is also highly recommended you first make a back up of an existing database and media store prior to importing.

Importing Content

Step 1.

The first step in undertaking an import is to define the Entities you wish to import content for. These entities must already exist within your project. New Entities cannot be added via the Import dialog. Filtering of Entities can be done via the Filter by Subsets option located top left of the Import dialog. Entities can also be easily excluded from the import process by simply not selecting them to be included in the import process via the check box in the Include column. More information on this is detailed further below in Step 3.

Step 2.

Select the document format are trying to importing from. I.e. HTML or MS Word.

Step 3.

Use the document folder browse button to select the folder where the HTML or MS Word documents are located. Once a valid folder has been defined, the folder will be scanned for matching document types (.html, .html or .doc, .docx). These will then be compared against your Entity list. Matching documents will be automatically mapped as matching file names and listed in the 'Matching Filename' column. Closely matched documents will be highlighted in light yellow, while not so closely matched documents will be highlighted in a darker yellow. These non-matching documents should be examined to determine if they are the correct document to import for the given Entity. If the match document is incorrect, change the file name by typing in correct one in the 'Matching Filename' column, or copy the file name from Windows Explorer and paste it in the 'Matching Filename' column to reduce the chance of errors.

Example matched documents to Entities

Example of matched documents to entities. One closely matched document is highlighted to check.

Once you have mapped the Entities to the documents you wish to import against, select theses Entities by selecting the check boxes in the 'Include' column. You can select and un-select all Entities for inclusion or exclusion via the two check box options as shown in the screen capture below.

Select or un-select all options

Select and un-select all Entities for import.

Step 4.

Within the middle of the import dialog is the list of the current projects Topics. Each of these Topics can be used to define a set of rules to capture content from the Entities matched document.

Example Topics and Rules panel.

Select each Topic in turn to define the desired rules. Any Topic without rules will not have any content imported against it. Once you have finished defining your rule set click the Import button. During the import process Fact Sheet Fusion will show you the progress of the import via a green progress bar that will appear at the bottom of the dialog. Detailed information on defining import Rules are listed below.

Final stage - Select the Import button.

Tip: If you are getting unexpected results from your rule set, check the Fact Sheet Fusion log file (via Help...About menu) as errors and warnings will be logged there.

Import Rules

Loading and Saving Rules

Defining rules for multiple Topics and images can take some time to get right, particularly when examining the document looking for consistent content and tags to match against. You can save your rule set to be used at a later date by selecting the save icon () located in the Topics Rules panel. This will save all rules for all Globally and Topics rules to a file with an extension called '.rules'. This can be loaded at a later time via the Open button () located to the left of the Save button. The loaded rules are matched against current Topic list. Any rule contained within a rules file that doesn't match against an existing Topic will be ignored.

Note: If a topic has been renamed since the Rule set was saved you will either need to redefine the rules for that topic. Or you can edit the rule file and update the topic label in your preferred text editor.

Look For Rules

Look for rules are the instructions given to the import algorithm to find content within the Entities matched document. Multiple Look for rules can be defined, though each rule must be defined on its own line. Defining multiple Look for rules can be very handy when you want to capture content that maybe inconsistent between documents. For example, across many documents information may use consistent headings but differ slightly due to pluralisation. As an example some documents may contain a heading "Family:", while others contain "Families:".

Fragment content example:
<h4>Family: Proteaceae</h4>

A second fragment example showing a different identifier of interest, targeted for the same Topic:
<h4>Families: Myrtaceae, Mimosaceae and Rutaceae</h4>

Look for rules can be defined in a number of ways:

Note: The import algorithm uses an asterisk (*) as a wildcard character. A wildcard character is used to represent one or more characters when searching. The wildcard character is a reserved character when defining search rules.

Double wildcard search

Using the simple examples above dealing with capturing the family name(s), the following two rules would be defined:

<h4*Family:*</h4>
<h4*Families:*</h4>

Tip: The left string to find only defines part of the heading four tag. This is done because HTML tags may contain additional style definitions. E.g. <h4 style="...">. This way we ensure to capture any kind of heading four tag (plain or styled).

Only heading four tags would be returned that contained either 'Family:' or 'Families:'. However since we already have a Topic in this instance is called 'Family' we don't need to retain the 'Family:' or 'Families:' component of the returned match. This where we would use the exclusion rules to remove this text. Details of the exclusion rules are outlined further below in the Exclusion Rules section.

Using a single wild card option will return everything matched between the strings to find, but also include the strings to find in the returned matched text. For example, if we wanted to capture HTML tables and their content, we would need to search for the beginning and end table tags, but also retain them, so as not to break the HTML formatting.

E.g.

<table*</table>

Tip: Note the left string to find only defines part of the start of the table tag. This is done because HTML tags can contain additional definitions such as <table border="1" ...>. This way we ensure to capture all beginning table tags no matter how they are defined.

Additional Rule Options

When using the single wildcard search option it is not always desirable to keep the string to find. To remove it from the returned match use the following token '[-]'. To remove the left string to find you must define it to the left, while to remove the right string to find add it to the right. Adding the remove token to both the left and right strings to find is similar to the double wildcard, but without the additional ability to match on inner content.

As an example, if we were wanting to match on individual images within a block of images the only reliable next tag to search on may be the next image. E.g.

<div>
<img src="../../pict.jpg" width="500" height="600"><img src="../../plant1.jpg" width="450 height="300"><img src="../seed3.jpg" width="250" height="300">
<img src="../../leaf223.jpg" width="100" height="300" border="1">
</div>

To capture these image tags we could do the do the following:

<img*<img[-]

In this example everything from the beginning of an image tag up until the next partial image tag (i.e. '<img' ) would be captured. Though the '<img' part of the right hand string to find would be discarded in the matched results.

Note: The above example would also need an additional rule to capture the last image as it would not have another beginning image tag for it to be matched on. See Greedy find option below.

Greedy Match option

The Greedy find option can be defined on to the end of any Look for rule and can be used in conjunction with the string removal rule. Consider the following example HTML content:

<div>
<img src="../../pict.jpg" width="500" height="600"><img src="../../plant1.jpg" width="450 height="300"><img src="../seed3.jpg" width="250" height="300">
<img src="../../leaf223.jpg" width="100" height="300" border="1">
</div>

If we wanted to capture the last image within this div block we don't necessarily have anything unique to match on. We can't define the last image tag (<img src="../../leaf223.jpg" width="100" height="300" border="1">) as the left string to find since the file name and size will change from file to file. We could use the start of an image tag (<img). E.g.

<img*</div>

However this would return from the first image tag found to the end Div tag. E.g.

<img src="../../pict.jpg" width="500" height="600"><img src="../../plant1.jpg" width="450 height="300"><img src="../seed3.jpg" width="250" height="300">
<img src="../../leaf223.jpg" width="100" height="300" border="1">
</div>

Using the Greedy option tells the matching algorithm to keep matching until the last instance of the match is found. E.g.

<img*</div>[g]

Would return:

<img src="../../leaf223.jpg" width="100" height="300" border="1">
</div>

We also don't need the end Div tag (</div>) as this would add a "broken" HTML tag to our matched content. To strip this from the matched results we just need to add the removal token. E.g.

<img*</div>[g][-]

This would return:

<img src="../../leaf223.jpg" width="100" height="300" border="1">

Exclude Rules

Exclude rules use the exact same rule types as the Look For rules, however the Exclude rules only work on the matched results of the Look for rules. Unlike Look For rules Exclude rules can consist of rules that contain no wildcard characters. When no wildcard character is defined the entire string block is searched for and if found removed from the matched results. Exclude rules allow you to remove undesired content such as words, or HTML tags. Each Exclude rule must be defined on separate lines.

Some example Exclude rules:

Comment below here
Remove the string 'Comment below here'


Removes all HTML comments

<br>
Remove all breaking returns

<font*>
Removes all start font tags

</font>
Removes all end font tags

<b*>
Remove all bold tags

</b>
Remove all bold end tags

<img*>
Remove all images

<div*>
remove all begging Div tags

</div>
Removes all end Div tags

Tip: If removing specific tags ensure you remove both the start and end tag.

Topic Rules

Example Topic rules.

Exclude check box option will exclude this Topics rules from the import process. This is useful, for example, if you have saved a set of rules for processing multiple folders worth of content but don't wish this Topic rule to be processed for one or more instances.
Look for rules defined for the selected Topic.
Exclude rules for the matched content of the Look for rules.
Replace, if topic already exists check box will replace any topic text that may exist for that entity topic combination, if matching results are returned. If not checked and text already exists for the Entity Topic combination then no matched text will be saved.

Global Rules

Example global rules.

Global rules are applied either for every topic or for matching images.

Clean HTML check box option will clean and remove any MS Word generated HTML formatting contained with the matched results of a rule.
Topics Exclude rules. Any topic exclude rules defined here will be applied to every Topic result match after any specific Topic exclude rules have been processed. These global Topic Exclude rules save you from having to defined the same set of rules for every topic where you may want to remove common elements.

Images

The Fact Sheet Fusion import algorithm can also capture images and their captions, saving them to your database media store and automatically attaching them to the Entity being imported against.

Note: Detection and retrieval of embedded images within MS Word documents is not currently supported.

Look for rules - The Image Look for rules are the same rule types as defined in the Import Rules outlined above. However in many instances images also have a corresponding caption. When matching on image tags any remaining content captured outside of the image tag is treated as the caption block. Consider the following example HTML that contains a table with images and their captions on the row below:

<table>
<tbody>
<tr>
<td><img src="../../pict.jpg" width="500" height="600"></td>
<tr>
<td><p>Example of the habit.</p></td>
</tr>
<tr>
<td><img src="../../plant1.jpg" width="450 height="300"></td>
</tr>
<tr>
<td><p>Example of the tree in flower.</p></td>
</tr>
<tr>
<td><img src="../seed3.jpg" width="250" height="300"></td>
<tr>
<td><p>Mature seed pod.</p></td>
</tr>
<td><img src="../../leaf223.jpg" width="100" height="300" border="1"></td>
<tr>
<td><p>Bipinnate leaves</p></td>
</tr>
</tbody>
</table>

We could use the following Image Look for rules:

<img*<img[-]
<img*</table>[g]

The first rule will return each image along with all other content, except for the last image as there is no additional image tag to match on.
The second rule uses the Greedy matching option to find the closest image tag to the end table tag to pick up the last image.

If we were to just define these two rules to find the desired images then we would be left with lots of broken inner table tags such as row tags and column tags. e.g.

First match:

<img src="../../pict.jpg" width="500" height="600"></td>
<tr>
<td><p>Example of the habit.</p></td>
</tr>
<tr>
<td>

Second match:

<img src="../../plant1.jpg" width="450 height="300"></td>
</tr>
<tr>
<td><p>Example of the tree in flower.</p></td>
</tr>
<tr>
<td>

Third match:

<img src="../seed3.jpg" width="250" height="300"></td>
<tr>
<td><p>Mature seed pod.</p></td>
</tr>
<td>

Final match:

<img src="../../leaf223.jpg" width="100" height="300" border="1"></td>
<tr>
<td><p>Bipinnate leaves</p></td>
</tr>
</tbody>

As you can see we need to use the Image Exclude rules to remove the remaining undesirable tags.

Following on from the Look for rules, these Exclude rules could be used:

<tr>
</tr>
<td>
</td
</tbody>

Each of the above string blocks consisting of tags will be removed from each match. E.g.

First match:

<img src="../../pict.jpg" width="500" height="600">
<p>Example of the habit.</p>

Second match:

<img src="../../plant1.jpg" width="450 height="300">
<p>Example of the tree in flower.</p>

Third match:

<img src="../seed3.jpg" width="250" height="300">
<p>Mature seed pod.</p>

Final match:

<img src="../../leaf223.jpg" width="100" height="300" border="1">
<p>Bipinnate leaves</p>

Given these matches the import algorithm will take each of the image tags and find the corresponding image file and caption text, copy it and register it to the media store, then attach it as an Entity image.

Skip, if already Entity already has images option will not attach or store any matched images from the import if the Entity already has images associated with it.

Replace existing images option, if selected, will overwrite images that already exist in the database media store.