leetcode

Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", 
Return:
["AAAAACCCCC", "CCCCCAAAAA"].

Naive solution, use one set to store repeated 10 chars DNA sequences and another set to store already added 10 chars sequences (dedup purpose).

This will not pass, because once the string becomes longer and longer, it will consume a lot of memory in the set.

public List<String> findRepeatedDnaSequencesString(String s) {
        List<String> results = new ArrayList<String>();
        if (s==null || s.length()<10) {
            return results;
        }

        Set<String> repeated = new HashSet<String>();
        Set<String> temp = new HashSet<String>();

        StringBuilder sb = new StringBuilder();
        for (int i=0; i<s.length(); i++) {
            if (i==9) {
                repeated.add(sb.toString());
            } else {
                sb.deleteCharAt(0);
                sb.append(s.charAt(i));          

                String str = sb.toString();
                if (repeated.contains(str)) {
                    temp.add(str);
                } else {
                    repeated.add(str);
                }
            }
        }

        results.addAll(temp);

        return results;
    }

In another words, we can treat 'A' -> 00, 'C' -> 01, 'G' -> '10', 'T' -> '11' as binaries and store 20 bits (10 chars, each char is two bits) in the set.

//this will convert a string to bits.
public int hashcode(String str, Map<Character, Integer> map) {
        int hashcode = 0;

        for (int i=0; i<str.length(); i++) {
            hashcode = (hashcode<<2) | map.get(str.charAt(i));
        }

        return hashcode;
    }

    public List<String> findRepeatedDnaSequences(String s) {
        List<String> results = new ArrayList<String>();
        if (s==null || s.length()<10) {
            return results;
        }

        Map<Character, Integer> map = new HashMap<Character, Integer>();
        map.put('A', 0);
        map.put('C', 1);
        map.put('G', 2);
        map.put('T', 3);

        Set<Integer> repeated = new HashSet<Integer>();
        Set<Integer> temp = new HashSet<Integer>();

        StringBuilder sb = new StringBuilder();
        for (int i=0; i<s.length(); i++) {
            if (sb.length()<10) {
                sb.append(s.charAt(i));
                if (i==9) {
                    repeated.add(hashcode(sb.toString(), map));
                }
            }  else {
                sb.deleteCharAt(0);
                sb.append(s.charAt(i));

                String str = sb.toString();
                int hint = hashcode(str, map);
                if (repeated.contains(hint) && !temp.contains(hint)) {
                    results.add(str);
                    temp.add(hint);
                } else {
                    repeated.add(hint);
                }
            }
        }

        return results;
    }