编程语言
首页 > 编程语言> > Edit distance in Java Edit distance with a scoring matrix

Edit distance in Java Edit distance with a scoring matrix

作者:互联网

Earlier, we defined edit distance as the minimal number of insertions, deletions and substitutions required to transform one string into another. But the metric can also be formulated in another way. We may assign some cost to each operation and say that edit distance is a sequence of transformations converting one string into the other having the minimal cost.

For example, we may say that each of the described operations costs 1. In this case, there is no difference between the two formulations. But sometimes it is convenient to assign costs in another way.

Assume we are working on a system for correction of spelling mistakes. Our algorithm is the following: we get a user's request, find the most similar word in a correct word database using edit distance metric, chose the most similar one and use it instead of the initial word.

Suppose we get a request "flaq". In this case, we have at least two words having the edit distance equal to 1 with the initial string: "flaw" and "flat". So, which one should we use? On the one hand, there is no difference. But on the other hand, the word "flaw" is more similar to the word "flaq", because the letters "q" and "w" are closer on a keyboard than "w" and "t" and it's more likely that the user wanted to write "flaw" and not "flat".

To process such cases correctly, one may use a so-called scoring matrix. A scoring matrix is a table mm where m[s_1][s_2]m[s1][s2] is a cost of a substitution of a symbol s_1s1 by a symbol s_2s2. For example, to solve the previous problem, we can use a matrix that assigns lower costs for symbols that are close on a keyboard and bigger costs for symbols that are far from each other.

So, your task here is to implement a simple system for correction of spelling mistakes. For convenience, we will use a shortened version of the alphabet.

Input: The first line contains a string ss a user's request. The second line contains an integer kk the size of a database. Each of the next kk lines contains a string a correct word. Each string consists of only letters \textrm{a, s, d, b, n, m}a, s, d, b, n, m.

Output: The first line should contain the edit distance d_E(s, t)d**E(s,t) where tt is a word having the minimal edit distance with ss among all other words from the database. The second line should contain a word tt itself. If there are several words with the minimal edit distance, print the one that occurs first in the database.

Consider the cost of an insertion and a deletion to be equal to 11. To calculate the cost of a substitution, use the following scoring matrix:

  a s d b n m
a 0 1 2 5 6 7
s 1 0 1 5 6 7
d 2 1 0 5 6 7
b 5 6 7 0 1 2
n 5 6 7 1 0 1
m 5 6 7 2 1 0

Sample Input 1:

aad
3
mad
sad
bad

Sample Output 1:

1
sad

Sample Input 2:

asa
3
ama
aba
ada

Sample Output 2:

1
ada
import java.util.*;

public class Main {
    public static void main(String[] args) {
        Scanner scanner = new Scanner(System.in);
        
        String s = scanner.next();
        int k  = scanner.nextInt();
        String[] database = new String[k];
        
        for (int i = 0; i < k; i++) {
            database[i] = scanner.next();
        }
        
        String letters = "asdbnm";
        
        int[][] scoringMatrix = {{0, 1, 2, 5, 6, 7}, {1, 0, 1, 5, 6, 7}, {2, 1, 0, 5, 6, 7}, {5, 6, 7, 0, 1, 2},
                                 {5, 6, 7, 1, 0, 1}, {5, 6, 7, 2, 1, 0}};
        
        int minDistance = Integer.MAX_VALUE;
        String result = null;
        
        for (String t : database) {
            int[][] distances = new int[s.length() + 1][t.length() + 1];
            for (int i = 0; i < s.length() + 1; i++) {
                distances[i][0] = i;
            }
            for (int j = 0; j < t.length() + 1; j++) {
                distances[0][j] = j;
            }
            for (int i = 1; i < s.length() + 1; i++) {
                for (int j = 1; j < t.length() + 1; j++) {
                    int insConst = distances[i][j - 1] + 1;
                    int delCost = distances[i - 1][j] + 1;
                    int match = scoringMatrix[letters.indexOf(s.charAt(i - 1))][letters.indexOf(t.charAt(j - 1))];
                    int subCost = distances[i - 1][j - 1] + match;
                    distances[i][j] = Math.min(Math.min(insConst, delCost), subCost);
                }
            }
            if (distances[s.length()][t.length()] < minDistance) {
                minDistance = distances[s.length()][t.length()];
                result = t;
            }
        }
        
        System.out.println(minDistance);
        System.out.println(result);       
    }
}

标签:distance,Java,distances,Edit,edit,int,length,word
来源: https://www.cnblogs.com/longlong6296/p/13521288.html