4.3 Creating a binary diff (delta) file
4.3 Constructing a Delta file
Contents:
Constructing a Delta file
Here is an example to better understand what a binary diff file is and how target is recreated from the source file and the delta file.
Step 1
E.G.
The content of the source (old) file is: “abcdfghjq”
The content of the target (new) file is: “abcdefgijkrxy”
We note symbol (character) positions starting at 0. At position 0 we have symbol 'a' in the old file and symbol 'a', also, in the new file, and so on so forth. See attached table at column POSITION.
Creating a Delta file means finding a solution to the problem that is often called “the longest common subsequence” problem.
When a Delta file is created, we use only the two files, the source and the target. Thus, we first define a set of commands or instructions, that 'tells' the delta creation application how to construct the new content relying only on the old content.
The processing is done byte by byte, or symbol by symbol, starting at position 0 and comparing symbols from the target file with the symbols in the old file.
Step 2
These instructions are:
▪insert (symbol): we note this command with a plus sign, since it signals an addition of content, ‘+’
▪delete (symbol): we note this command with a minus sign, since it signals a removal of content, ‘-’
▪copy (symbol): we note this command with ‘#’. Current character will be copied from the old file to the new file.
"Insert" (symbol) command is interpreted by inserting the symbol present at the current position in the source (old) file into target (new) file at its current position.
Step 3
Since we are interested in creating the target (new) file, we start by recreating the new content. The above commands are inserted as instructions to reconstruct the new file while leaving out older content which is found to be obsolete.
Following these steps, the new content is now:
# (‘a’ is left out, since it is common to the old content)
# (‘b’left out) # (‘c’ left out)
# (‘d’ left out) command insert (+, e, at position 4… ) and so on so forth
Here is the attached table for all the steps taken to construct the binary diff file.
POSITION |
OLD FILE |
NEW FILE |
DELTA |
COMMENT |
CURRENT INDEX |
0 |
a |
a |
# |
Copy ‘a’ |
0 |
1 |
b |
b |
# |
Copy ‘b’ |
1 |
2 |
c |
c |
# |
Copy ‘c’ |
2 |
3 |
d |
d |
# |
Copy ‘d’ |
3 |
4 |
f |
e |
+e |
Add ‘e’ |
4 |
5 |
g |
f |
# |
Copy ‘f’ |
5 |
6 |
h |
g |
# |
Copy ‘g’ / Remove ‘h’ |
6 |
7 |
j |
i |
Add ‘i’ |
7 |
|
8 |
q |
j |
Copy ‘j’ / Remove ‘q’ |
8 |
|
9 |
z |
k |
Add ‘k’ |
9 |
|
10 |
r |
Add ‘r’ |
10 |
||
11 |
x |
Add ‘x’ |
11 |
||
12 |
y |
Add ‘y’ |
12 |
The binary diff file now has the following content: # # # # + e # # - h + i # - q + k + r + x + y
Step 4
We could also write these commands and data (also called payload) in a much more compact form, by removing the obvious redundancy: # # # # will become #.3 meaning just copy data from position current position to current position + 3
See attached table for newly compacted commands:
POSITION |
COMMAND |
START FROM POSITION |
END AT POSITION |
DESCRIPTION |
COMMENT |
0 |
# |
0 |
3 |
Copy data from position 0 to position 3 |
Copy ‘abcd’ |
1 |
+e |
4 |
4 |
add ‘e’ at the next position, 4 |
+e |
2 |
# |
5 |
6 |
Copy data from position 5 to position 6 |
Copy ‘fg’ |
3 |
-h |
6 |
6 |
Remove data from position 6 to position 6 |
-h |
4 |
+i |
7 |
7 |
Copy data from position 7 to position 7 |
+I |
5 |
# |
8 |
8 |
Copy data from position 8 to position 8 |
Copy ‘j’ |
6 |
-q |
8 |
8 |
Remove data from position 8 to position 8 |
-q |
7 |
+ |
9 |
12 |
Add ‘krxy’ |
+k +r +x +y |
As a side note: We always copy data from the source file (old) to the target file (new file).