Sunday, February 19, 2012

How to Find Common Characters among two Strings

There are several ways to find the common characters among two strings. Speaking of the worst-case time complexity, different algorithms perform differently from quadratic to linear, such as O(n2), O(nlogn) and O(n)

The average run time of naive method can be O(n2), which is straightforward to design. So here are the other two possible methods whose time complexity are O(nlogn) and O(n).

Let's assume the strings to be compared are a and b, and the functions are:
    string comchr(string&, string&)
and
    string comchr2(string&, string&)
in C++ respectively.

1. comchr(): Sort the strings first and then find the common characters - O(nlogn)

If we use merge sort to sort the two strings and then do the comparison, the average and worst time complexity of the problem will be of order O(nlogn), where n is the max(a.length(), b.length()). So here are the C++ implementation of the algorithm:


string comchr(string &a, string &b)
{
    string r;
    string::size_type i, j, k;
    if (a.length()+b.length()<=1)    return "";
    if (a.compare(b)==0)    return a;
    merge_sort(a, 0, a.length()-1);
    merge_sort(b, 0, b.length()-1);
    i=j=k=0;
    while (i<a.length() && j<b.length())
    {
        if (a[i]==b[j])
        {
            if (k==0 || r[k-1]!=a[i])     
            {
                r.push_back(a[i]);           
                k++;
            }
            i++; j++;
        }        
        else
  
        (a[i]<b[j]) ? i++ : j++;

    return r;
}
The merge_sort() is the merge sort routine that sorts the string and contributes O(nlogn) to the complexity.

2. comchr2(): Map the characters in string a into a set and then compare each of the characters in this set with string b - O(n) 
A simple way to improve the running time is to use a direct-address access table (here called t) for insertion (O(1)) of each character of one string and for table look up (O(1)). The implementation could be:

string comchr2(string &a, string &b)
{
    string r;
    set<char> s;    //s to remove duplicate chars
    bitset<256> t;    // All ASCII char
    if (a.length()+b.length()<=1)    return "";
    if (a.compare(b)==0)    return a;
    for (string::size_type i=0; i<a.length(); i++)
        if (!t.test(a[i]))
            t.set(a[i]);
    for (string::size_type i=0; i<b.length(); i++)
        if (t.test(b[i]))
            s.insert(b[i]);
    for (set<char>::iterator it=s.begin(); it!=s.end(); it++)
        r.push_back(char(*it));
    return r;
}

Note that the function uses a bitset t for direct-address access. Of course, the comchr2() is easier to design
3. Short experiment

A simple experiment to compare the performance of the two functions is done using test set:
                a            b
Test 1    abcdefghi     jklmnopqr
Test 2    abcdefghi     abcdefghi
Test 3    abcdefghi     aaabbbccc

The same function will be called 1000000 times to scale up the time to run the problem. The average running time (in second) of each function on different strings is shown below:
   


The comchr2() function runs faster than the comchr() when the test strings are not 'too' similar. And both functions perform similarly on the test case 2 because the test strings are same in both cases.

4. Conclusion

In programming practice, sometimes the simplest approach may also be the 'best'! Given the scale of the problem, the computing resource or the performance requirement, we may tune the program in different ways or even have different solutions.

For example, what if the strings we compare use wide character set (e.g. Unicode)?

What if the strings are stored in separate files on disk?

 



3 comments:

(Coding && Eating) || Sleeping