The Fuzzy Grouping transform allows grouping of records by looking at the similarity between the values of various columns.
Two records in which a possible misspelling occurs can be grouped together for further analysis, or duplicates can be removed by the setting Output Top Level Records Only. The sensitivity can be set adjusted by setting the Probability Threshold.
The Fuzzy Grouping transform requires 1 input transform that has at least 1 column.
The input could be a SQL Select transform, or the result of another transform. For example, the input data is:
2. Add the Transform
Steps to add the transform:
- Select the connector link.
- Select the transform from the menu.
- To Edit/Configue the transform, select the newly added transform, and click the Configure menu.
Steps to configure the Fuzzy Grouping transform:
- Select the columns to be included in the output.
- Drag and Drop the column(s) you want to be part of the grouping.
- Enter the Probability Threshold. Valid values are from .0001 to 1.0. Value of 1.0 will require input data to be an "exact match" for them to be grouped together.
- Select Ignore String Case if you want a non-case sensitive match.
- Select Output Top Level Records Only if you want to omit the duplicate records.
The figure below illustrates the output from the Fuzzy Grouping transform.
- With the Output Top Level Records Only option not selected:
- With the Output Top Level Records Only option selected: