You are learning Power Query in MS Excel
How to perform fuzzy matching for data comparison in Power Query?
Performing fuzzy matching in Power Query allows you to compare and match text values that are similar but not necessarily identical. This is useful for tasks like deduplication, data cleansing, and merging datasets with slightly varied entries. Power Query itself does not have built-in fuzzy matching functions like some specialized tools, but you can achieve fuzzy matching using a combination of techniques. Here’s a structured approach to implement fuzzy matching in Power Query:
Steps to Perform Fuzzy Matching in Power Query
1. Normalize Text Values
Before performing fuzzy matching, it’s beneficial to normalize text values to ensure consistency:
- Lowercase Conversion: Convert all text to lowercase to ignore case sensitivity.
```m
Table.TransformColumns(Source,{{"Column1", Text.Lower}, {"Column2", Text.Lower}})
```
- Trimming Whitespace: Remove leading and trailing spaces from text values.
```m
Table.TransformColumns(Source,{"Column1", Text.Trim})
```
2. Calculate Similarity Scores
Power Query lacks built-in fuzzy matching functions, but you can compute similarity scores using a combination of functions:
- Levenshtein Distance: Measures the number of single-character edits (insertions, deletions, substitutions) required to change one word into the other.
Here’s a custom function to calculate Levenshtein distance:
```m
// Levenshtein Distance Function
(text1 as text, text2 as text) =>
let
s = Table.FromList({text1, text2}, Splitter.SplitByNothing()),
d = Table.AddIndexColumn(s, "Index", 0, 1, Int64.Type),
e = Table.AddColumn(d, "FirstCharacter", each Text.Middle([Column1], [Index], 1), type text),
f = Table.AddColumn(e, "Cost", each if [FirstCharacter]{0} <> [FirstCharacter]{1} then 1 else 0, Int64.Type),
g = Table.AddColumn(f, "Substitution", each [Previous Cost]{1} + [Cost], Int64.Type),
h = Table.AddColumn(g, "Addition", each [Cost] + 1, Int64.Type),
i = Table.AddColumn(h, "Deletion", each [Previous Cost]{0} + 1, Int64.Type),
j = Table.AddColumn(i, "MinCost", each List.Min({[Substitution], [Addition], [Deletion]}), Int64.Type),
k = Table.AddColumn(j, "Previous Cost", each {[MinCost], [Cost], [Cost], [MinCost]}),
l = Table.AddColumn(k, "FinalCost", each {[MinCost]{3}}, Int64.Type),
FinalCost = l{[Index]{1}}[FinalCost]
in
FinalCost
```