Longest Common Subsequence (LCS)

Distribute Education like Computer Geek

Introduction

.

The Longest Common Subsequence (LCS) algorithm is a popular method in computer science used to find the longest subsequence common to two sequences (such as strings).

A subsequence is a sequence that appears in the same order but not necessarily consecutively.

The LCS algorithm uses a dynamic programming approach, which breaks down the problem into smaller subproblems, solves each subproblem once, and stores the results for future reference.

This method significantly reduces the computational effort compared to naive approaches and recursion approaches that might involve excessive repetition of calculations.

.

History

The study of string matching and sequence alignment problems dates back to the early days of computer science. Researchers were interested in finding efficient ways to compare and align sequences, which has applications in text processing, DNA and RNA sequence analysis, and other areas.

The concept of dynamic programming, which is crucial for efficiently solving the LCS problem, was introduced by Richard Bellman in the 1950s. Dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems and storing the results of these subproblems to avoid redundant calculations.

.

How it can be solved

There are several methods for solving LCS problem

  1. Naïve Method
  2. Recursion Method
  3. Dynamic Programming

.

1. Naïve Method

  • Generate all possible sub-sequences for both sequences.
  • Compare each subsequence of the first sequence with each subsequence of the second sequence.
  • Identify the longest common subsequence.

Example

Sequence 1 = ABCD

Sequence 2 = ACDB

From Naïve method –

Sequence 1 – all possible sub-sequences

A, B, C, D, AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, ABCD

Sequence 2 – all possible sub-sequences

A, C, D, B, AC, AD, AB, CD, CB, DB, ACD, ACB, ADB, CDB, ACDB

.

Compare each sub-sequence of first sting to the second string

‘A’, ‘B’, ‘C’, ‘D’, ‘AB’, ‘AC’, ‘AD’, ‘CD’, ‘ACD’

.

Longest common sub-sequence is ACD

 .

Time Complexity of brute force algorithm.

For a sequence of length n, there are 2n possible sub-sequences.

Comparing all sub-sequences of two sequences of length m and n takes O(2m * 2n) time.

Space Complexity of brute force algorithm is O(2m * 2n) to store all sub-sequences.

.

2. Recursion Method

  • Compare the last characters of both sequences.
  • If they match, the character is part of the LCS.
  • If they do not match, recursively find the LCS in two scenarios: excluding the last character of the first sequence and excluding the last character of the second sequence.
  • The result is the longer LCS found in the two scenarios.

Example –

Index

0

1

2

3

Sequence 1

A

B

C

D

Sequence 2

A

C

D

B

From Recursion method –

Longest Common Subsequence (LCS) in Recursion Method

.

Using the recursion method, we find that the LCS length of Sequence 1 = “ABCD” and Sequence 2 = “ACDB” is 3.

The actual LCS can be determined by tracking the characters included in the matching cases, which in this example would be “ACD”.

Time Complexity of Recursion Method – O(2min(m,n)), because for each character pair, we have two choices (exclude one character from either sequence), leading to a total of 2min(m,n) combinations in the worst case. But in the recursion method, it only explores relevant subproblems rather than generating all sub-sequences, so therefore its time complexity reduces to min(m,n)2.

Space Complexity of Recursion Method – O(n), due to the call stack depth.

 

  1. Dynamic Programming Algorithm –

The dynamic programming approach for solving the Longest Common Subsequence (LCS) problem is efficient and avoids redundant calculations by storing intermediate results.

This method uses a table to keep track of the lengths of LCS for different substrings of the given sequences.

.

Algorithm for LCS using Dynamic Programming (dp)

  1. Input
    • Two sequences seq1 of length m.
    • Two sequences seq2 of length n.
  2. Create a 2D Table
    • Create a 2D table dp with dimensions (m+1) x (n+1).
    • Initialize all elements of dp to 0.
  3. Fill the Table
    • Loop through each character in seq1 and seq2.
    • If characters match, update the dp table with the value from the diagonal cell plus one.
    • If characters do not match, update the dp table with the maximum value from the cell to the left or the cell above.
  4. Retrieve the Result:
    • The value in dp[m][n] contains the length of the LCS.

Example

Let’s consider seq1 = “ABCD” and seq2 = “ACDB”

Create and initialize the table

Longest Common Subsequence (LCS)

For i=1 (character A in seq1):

  • For j=1 (character A in seq2): dp[1][1] = 1 and directions[1][1] = “↖” (characters match).

Whenever a character matches, the cell contains a North-West arrow (↖), and the value in that cell is the value from the North-West cell plus 1.

  • For j=2 (character C in seq2): dp[1][2] = 1 and directions[1][2] = “←” (Take the maximum value from the cell directly above or the cell directly to the left.).
  • For j=3 (character D in seq2): dp[1][3] = 1 and directions[1][3] = “←” (Take the maximum value from the cell directly above or the cell directly to the left.).
  • For j=4 (character B in seq2): dp[1][4] = 1 and directions[1][4] = “←” (Take the maximum value from the cell directly above or the cell directly to the left.).

Longest Common Subsequence (LCS)

.

For i=2 (character B in seq1):

  • For j=1 (character A in seq2, unmatched): dp[2][1] = dp[1][1] = 1, directions[2][1] = “↑”
  • For j=2 (character C in seq2, unmatched): dp[2][2] = dp[2][1] = 1, directions[2][2] = “↑”
  • For j=3 (character D in seq2, unmatched): dp[2][3] = dp[2][2] = 1, directions[2][3] = “↑”
  • For j=4 (character B in seq2, matched): dp[2][4] = dp[1][3] + 1 = 2, directions[2][4] = “↖”

Longest Common Subsequence (LCS)

.

For i=3 (character C in seq1):

  • For j=1 (character A in seq2): dp[3][1] = dp[2][1] = 1, directions[3][1] = “↑”
  • For j=2 (character C in seq2): dp[3][2] = dp[2][1] + 1 = 2, directions[3][2] = “↖”
  • For j=3 (character D in seq2): dp[3][3] = dp[3][2] = 2, directions[3][3] = “←”
  • For j=4 (character B in seq2): dp[3][4] = dp[3][3] = 2, directions[3][4] = “←”

Longest Common Subsequence (LCS)

.

For i=4 (character D in seq1):

  • For j=1 (character A in seq2): dp[4][1] = dp[3][1] = 1, directions[4][1] = “↑”
  • For j=2 (character C in seq2): dp[4][2] = dp[3][2] = 2, directions[4][2] = “↑”
  • For j=3 (character D in seq2): dp[4][3] = dp[3][2] + 1 = 3, directions[4][3] = “↖”
  • For j=4 (character B in seq2): dp[4][4] = dp[4][3] = 3, directions[4][4] = “←”

Longest Common Subsequence (LCS)

The final answer is located in the last cell of the dynamic programming table, which is highlighted in red. The length of the Longest Common Subsequence (LCS) for Sequence 1 and Sequence 2 is 3.

Also, the lcs is ACD because –

  • Start from the last cell and follow the directions of arrow.
  • If the direction is “↖” (North-West), it means a character match. Add this character to the LCS.
  • If the direction is “↑” (Up), move up.
  • If the direction is “←” (Left), move left.
  • The collected characters are D, C, and A. Reverse this sequence to get the LCS – ACD.

.

Source Code of Longest Common Sub-sequence

Time Complexity of LCS

Initializing the DP table and directions table involves filling in values for each cell. This step takes O(m × n) time.

Also, finding the characters of LCS involves tracing back through the DP table, which takes linear time in the length of the LCS, so this step is O(m + n).

Combining both steps, the overall time complexity is dominated by the table construction phase, which is O(m × n).

.

Space Complexity of LCS

The space complexity is also O(m × n) because we use a 2D table to store the lengths of the LCS and the directions for reconstruction.

.

Advantages of the Longest Common Subsequence (LCS)
  1. Optimal Solution
  2. Polynomial Time Complexity
  3. Clear and Systematic Approach
  4. Finding Subsequence – Unlike some algorithms, the dynamic programming approach not only computes the length of the LCS but also allows for the finding of the LCS itself by tracing back.

.

Disadvantages of the Longest Common Subsequence (LCS)
  1. Space complexity
  2. Fixed Table Size
  3. Complexity for large Sequences – While polynomial, the time complexity can still be high for very long sequences, making it less suitable for extremely large inputs.
  4. Preprocessing Overhead

.

Applications of LCS Algorithm
  1. LCS is used in diff tools for comparing text files to identify the longest common subsequence between different versions of a file.
  2. In genetics and molecular biology, LCS algorithms help compare DNA, RNA, or protein sequences to find similarities and evolutionary relationships.
  3. LCS is employed in version control systems to merge changes from different branches by identifying the common subsequence of changes.
  4. LCS can be used to detect and eliminate duplicate data by finding common sub-sequences in data streams or files.
  5. In natural language processing (NLP), LCS helps align sentences or words in tasks such as machine translation and speech recognition.
  6. LCS is used to align and compare protein sequences to find similarities in biological functions.

Test Yourself

Q1- Explain the concept of the Longest Common Subsequence (LCS) and how it differs from the Longest Common Prefix (LCP).

The Longest Common Subsequence (LCS) is a sequence that appears in the same order but not necessarily consecutively within two sequences. It finds the longest sequence that can be derived from both sequences without reordering. For example, for sequences “ABCD” and “ACDB”, the LCS is “ACD”.

In contrast, the Longest Common Prefix (LCP) is the longest common initial segment of two sequences. For example, for sequences “ABCF” and “ABEF”, the LCP is “AB”.

Q2- Discuss the significance of dynamic programming in solving the LCS problem.

Dynamic programming is significant in solving the LCS problem because it breaks down the problem into smaller subproblems and stores the results of these subproblems to avoid redundant calculations.

This approach reduces the time complexity compared to naive methods, which might involve excessive repetition of calculations.

Q3- What is the time complexity of the naive method for solving the LCS problem and why is it inefficient?

The time complexity of the naive method for solving the LCS problem is O(2^m * 2^n), where m and n are the lengths of the two sequences. This method generates all possible subsequences of both sequences and compares them, which is inefficient because it involves exponential growth in the number of subsequences to be considered.

Q4- Describe the process of reconstructing the LCS from the dynamic programming table.

To reconstruct the LCS from the dynamic programming table, start from the bottom-right cell of the table and trace back the directions recorded in the directions table.

Move diagonally (↖) if characters match, up (↑) if the cell above has a higher value, or left (←) if the cell to the left has a higher value. Collect characters during diagonal moves to form the LCS.

Q5- How does the recursion method for LCS differ from the dynamic programming approach in terms of efficiency?

The recursion method for LCS has a time complexity of O(2^min(m, n)), as it explores all possible subsequences but does not store intermediate results, leading to redundant calculations. In contrast, the dynamic programming approach has a time complexity of O(m * n) and avoids redundant calculations by storing the results of subproblems in a table.

Q6- Explain the role of the directions table in the dynamic programming approach for LCS.

The directions table in the dynamic programming approach records the direction of the previous cell that led to the current cell’s value. It helps in reconstructing the LCS by indicating whether to move diagonally, up, or left, which guides the path to trace back the common subsequence.

Q7- Discuss the advantages and disadvantages of using LCS in text comparison tools.

Advantages of using LCS in text comparison tools include its ability to identify the longest common subsequence between different versions of text files, which helps in highlighting differences and similarities. Disadvantages include its high space complexity and potentially high time complexity for very large files, which might lead to inefficiencies in processing.

Q8- How does the LCS algorithm apply to biological sequence analysis, such as DNA or protein sequences?

In biological sequence analysis, LCS algorithms help compare DNA, RNA, or protein sequences to find similarities and evolutionary relationships. By identifying common sub-sequences, researchers can understand how different sequences are related and track changes or mutations over time.

Q9- Describe an optimization technique that can be used to reduce the space complexity of the LCS problem.

An optimization technique to reduce the space complexity of the LCS problem is to use a space-efficient version of the dynamic programming algorithm that only keeps track of the current and previous rows of the DP table, rather than storing the entire table.

This reduces the space complexity from O(m * n) to O(min(m, n)).

Q10- What is the time complexity of the dynamic programming approach for solving the LCS problem?
  1. O(m * n)
  2. O(m + n)
  3. O(2m * 2n)
  4. O(min(m, n)2)

Ans – (1)

Explanation – The dynamic programming approach fills a table of size m x n, where m and n are the lengths of the two sequences. Therefore, the time complexity is O(m * n).

Q11- Which method for solving the LCS problem involves generating all possible sub-sequences?
  1. Naive Method
  2. Recursion Method
  3. Dynamic Programming
  4. Greedy Algorithm

Ans – (1)

Explanation – The naive method generates all possible sub-sequences of both sequences and compares them to find the LCS.

Q12- In which application is the LCS algorithm used to align and compare sequences in natural language processing (NLP)?
  1. Text summarization
  2. Speech recognition
  3. Machine translation
  4. Sentiment analysis

Ans – (3)

Explanation – In natural language processing, LCS is used in machine translation to align and compare sentences or words between different languages.

BOOKS

Algorithm T H CoremanAlgorithm by Udit AgarwalAlgorithm Ellis HorowitzData Structure & Algorithm Made Easy Narasimha