难以理解Paul Heckel的Diff算法
作者:互联网
我一直在看Paul Heckel’s Diff Algorithm,但我似乎不太了解.
我复制了1-5步,如Python代码所示,但是使用算法的最后一步却无法显示差异.如果有人解释了Python代码的最后一步,我将不胜感激.
另外,我也不完全理解为什么您需要在步骤4和5中引用表行,所以对此的解释也将是惊人的!
非常感谢
这是我当前的代码:
def find_diff(current_file_as_list, different_file_as_list):
N = current_file_as_list
O = different_file_as_list
table = {}
OA = []
NA = []
for i in O:
OA.append(i)
for i in N:
NA.append(i)
# First pass
i = 0
for line in N:
if not line in table:
table[line] = {}
table[line]["NC"] = 1
else:
if table[line]["NC"] == 1:
table[line]["NC"] = 2
else:
table[line]["NC"] = "many"
NA[i] = table[line]
i += 1
# second pass
j = 0
for line in O:
if not line in table:
table[line] = {}
table[line]["OC"] = 1
else:
if not "OC" in table[line]:
table[line]["OC"] = 1
elif table[line]["OC"] == 1:
table[line]["OC"] = 2
else:
table[line]["OC"] = "many"
table[line]["OLNO"] = j # Gets overwritten with multiple occurrences.
# Check to see if this is the intended implementation.
# Maybe only relevant for "OC" == "NC" == 1
OA[j] = table[line]
j += 1
# third pass
i = 0
for i in range(0, len(NA)):
# Check if they appear in both files
if "OC" in NA[i] and "NC" in NA[i]:
# Check if they appear exactly once
if NA[i]["OC"] == NA[i]["NC"] == 1:
olno = NA[i]["OLNO"]
NA[i], OA[olno] = olno, i
i += 1
# fourth pass
# ascending
for i in range(0, len(NA)):
for j in range(0 , len(OA)):
if NA[i] == OA[j] and i + 1 < len(NA) and j + 1 < len(OA) and NA[i + 1] == OA[j + 1]:
OA[j + 1] = table[O[i + 1]]
NA[i + 1] = table[N[j + 1]]
# fifth pass
# descending
for i in range(len(NA) - 1, 0, -1):
for j in range(len(OA) - 1, 0, -1):
if NA[i] == OA[j] and i - 1 > 0 and j - 1 > 0 and NA[i - 1] == OA[j - 1]:
OA[j - 1] = table[O[i - 1]]
NA[i - 1] = table[N[j - 1]]
# final step implementation should go here but I'm not sure how to approach it but this is my current attempt (which I am certain is wrong):
k = 0
array = []
for i in range(0, len(NA)):
if isinstance(NA[i], int):
array.append("= " + str(N[i]))
k = NA[i] + 1
elif isinstance(NA[i], dict):
array.append("+ " + N[i])
for j in range(k, len(OA)):
k = j + 1
print("j - " + str(j))
if not isinstance(OA[j], int):
array.append("- " + O[j])
else:
break
您可以将任意两个字符串或字符串列表作为输入传递给函数,例如. find_diff(“ hello”,“ hell”)
解决方法:
我不确定您在何处找到此说明和代码,但其中存在一些错误. Wikipedia上用于数据比较参考的页面之一是a reference to Paul’s paper,事实证明它对理解算法最有帮助.
首先,据我所知,您对最后一步的实现是正确的(假设先前的步骤正确完成).
让我们从语法/语言问题开始:也许我遗漏了一些东西,但是我不明白为什么您(以及链接到的代码)在第三遍中增加了自增索引i.
关于表条目的计数器:在链接的代码中有一个注释的问题-为什么我们根本需要2值?答案是-我们不会! Heckel在论文本身中明确指出,计数器应具有的唯一值是0、1和许多.您会看到我们从不使用或查询2值的计数器.我大惑不解,这个错误是由于以比Heckel在编写算法时所想到的语言更灵活的语言来实现算法的,因为查询特定表条目是否存在计数器与查询计数器的是否存在同义值为0.
最后也是最重要的是,此实现中的第四和第五遍是错误的.在这里,我相信论文中通行证的措辞可能会造成混淆,并且无论谁编写了链接代码,都是错误的.您的第二个问题已经揭示了它.第四遍在NA上按升序排列,对于每个位置,其值都指向OA中的位置(这意味着在所讨论的实现中为int类型),我们检查两个数组中下一个位置的值是否指向同一表条目.如果这样做,我们将这些指针替换为彼此的位置(用int覆盖这些指针.因此,您的第二个问题就在点上-我们根本不在这里使用表条目指针).这样,我们就有了第三行中发现的唯一行,作为定位符,以查找紧随其后的未更改行,它们是其“块”的一部分,但在文件中不是唯一的.第五遍也发生相同的情况,但是倒退,因此,未更改的唯一行之前的相同行也将归为未更改.
这是我描述的第四和第五遍:
# fourth pass
# ascending
for i in range(0, len(NA) - 1):
if isinstance(NA[i], int) and (NA[i] + 1) < len(OA) and NA[i + 1] == OA[NA[i] + 1]:
NA[i + 1] = NA[i] + 1
OA[NA[i] + 1] = i + 1
# fifth pass
# descending
for i in range(len(NA) - 1, 0, -1):
if isinstance(NA[i], int) and (NA[i] - 1) >= 0 and NA[i - 1] == OA[NA[i] - 1]:
NA[i - 1] = NA[i] - 1
OA[NA[i] - 1] = i - 1
标签:version-control,python,algorithm,file-diffs 来源: https://codeday.me/bug/20191026/1934384.html