Repeated DNA Sequences
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].
Analysis
- cannot store the 10-letter substrings themselves because they consume too much memory and the program will exceed memory limit.
- So we have only 4 possible letters, and we can use as little bits as possible to store each character of our 10-letter string. We really need only 2 bits (bits, not bytes) for this. Specifically the solution uses the following coding:
0 = 00 (bits in binary number system) = 'A'
1 = 01 (bits in binary number system) = 'C'
2 = 10 (bits in binary number system) = 'G'
3 = 11 (bits in binary number system) = 'T'
Note that since there 10 letters and each letter requires only 2 bits, we will need only 10 * 2= 20 bits to code the string (which is less then size of integer in java (as well as in all othere popular languages), which is 4 bytes = 32 bits).
For example, this is how "AACCTCCGGT" string will be coded:
A A C C T C C G G T
00 00 01 01 11 01 01 10 10 11 = 00000101110101101011 (binary) = 23915 (decimal)
Code
public List<String> findRepeatedDnaSequences(String s) {
Set<Integer> words = new HashSet<>();
Set<Integer> doubleWords = new HashSet<>();
List<String> rv = new ArrayList<>();
char[] map = new char[26];
//map['A' - 'A'] = 0;
map['C' - 'A'] = 1;
map['G' - 'A'] = 2;
map['T' - 'A'] = 3;
//i:开始的字母; j:扫10个字母; v:存转化后的数字
for(int i = 0; i < s.length() - 9; i++) {
int v = 0;
for(int j = i; j < i + 10; j++) {
v <<= 2;
v |= map[s.charAt(j) - 'A'];
}
//if words does not have v, !words.add(v) return false and doubleWords.add(v) will not excute. rv.add does not excute, v is added to words
//if words do have v, !words.add(v) return true, if doubleWords does not have v, v is added to doubleWords, if condition return true, substring added to rv
//if words do have v and doubleWords do have v, if is not ture, substring is not added to rv
if(!words.add(v) && doubleWords.add(v)) {
rv.add(s.substring(i, i + 10));
}
}
return rv;
}
Note
- set.add(): return true if set did not already contains the specific element