Top 10 occurring words in a very large file java algorithm

Upasana | October 18, 2020 | 4 min read | 422 views | algorithm-datastructures

Questions worth asking the interviewer

Can file containing words fit into main memory?
How many distinct words are there in the file?

Please note that there are limited number of natural language words available and all of them can be easily fit into today’s computer RAM. For example oxford English dictionary contains total of around 0.6 million words.

We will discuss two approaches to solve this problem -

Approach 1: Find the Top K Occurrence count using a hashmap and min-heap (PriorityQueue in Java)

Pseudo code for the algorithm

Finding the Word Occurrence Count - Stream the words into a HashMap (put operation is Big O(1)) keeping the value as word occurrence count. On every word occurrence, update the word count.
Track Top K occurring Words Using Binary Min Heap (PriorityQueue with Natural ordering) - This can be achieved by maintaining a binary min heap of max size K, and then for each word count in hashmap -
- Check if the heap size if less than K - then add the new word count to min heap. Otherwise
- Check if the peek element (that is minimum value in binary min heap) is less than the new word count, if it is, then remove the existing number and insert the new word count into min heap.
- When we are done traversing the entire word-counts then we will have heap containing the top K frequently occurring words.

Java 8 Source Code (Find Top 10 word occurrences)

import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;
import java.util.stream.Collectors;
public class SplitWordCount {
    public static void main(String[] args) {
        List<String> terms = Arrays.asList(
                "Coding is great",
                "Search Engine are great",
                "Google is a nice search engine",
                "Bing is also a nice engine");
        TopOccurrence topOccurrence = new TopOccurrence(2);
        terms.parallelStream() //Utilizes multi-core hardware
                .flatMap(s -> Arrays.asList(s.split(" ")).stream())
                .collect(Collectors.toConcurrentMap(w -> w.toLowerCase(), w -> 1, Integer::sum)) // Big O(n)
                .forEach((s, integer) -> topOccurrence.add(new WordCount(s, integer)));
        System.out.println(topOccurrence);
    }
    static class TopOccurrence {
        private final PriorityQueue<WordCount> minHeap;
        private final int maxSize;
        public TopOccurrence(int maxSize) {
            this.maxSize = maxSize;
            this.minHeap = new PriorityQueue<WordCount>(Comparator.comparingInt(wc -> wc.count));
// This constructs a min heap (when order of elements is natural i.e. ascending order).
// We are using Natural order for integers (wc.count)
// In order to create a max-heap, we just need to provide reversed comparator i.e. that sorts in descending order,as shown below
// this.minHeap = new PriorityQueue<WordCount>(Comparator.comparingInt((WordCount wc) -> wc.count).reversed());
        }
        public void add(WordCount data) {
            if (minHeap.size() < maxSize) { // size() is Big O(1)
                minHeap.offer(data); // Big O(log(k)) where k is the number of top occurrences required
            } else if (minHeap.peek().count < data.count) { // peek() is Big O(1)
                minHeap.poll(); // Big O(log(k))
                minHeap.offer(data); // Big O(log(k))
            }
        }
        @Override
        public String toString() {
            return "TopOccurrence{" + "minHeap=" + minHeap + ", maxSize=" + maxSize + '}';
        }
    }
    static class WordCount {
        protected final String word;
        protected final int count;
        WordCount(String word, int count) {
            this.word = word;
            this.count = count;
        }
        @Override
        public String toString() {
            return "{" + "word='" + word + '\'' + ", count=" + count + '}'+"\r\n";
        }
    }
}

The overall time complexity of the above algorithm should be O (n log k) where n is the total number of elements and k is the number of top occurrence elements that we need. Space complexity would be Big O(k + d), where d is the total number of distinct words in the file.

Notes

We preferred to choose Binary Heap over TreeSet because TreeSet provide a get method with Big O(log n) time complexity over PriorityQueue’s peek() method with Big O(1), so its a big time saver for the given requirement.

Binary Min Heap is a complete binary tree data structure in which each Node is less than or equal to each of its children. Heap is very efficient O(1) for finding minima and maxima from a given data set. PriorityQueue in JDK 1.6 is a implementation for Binary Min Heap.

A heap data structure should not be confused with the heap which is a common name for dynamically allocated memory. The term was originally used only for the data structure.

Approach 2: Using Counting Array Approach to find the Top x words

Pseudo code for algorithm -

Iterate all words and maintain a count inside hash data structure. This is Big O(n) time complexity task, this will result in structure like this - hashMap = {"first":42, "second":100, "third":59};
Traverse the hash and find the with maximum frequency, "second":100 in this case, then create an array of string with size 100 - this is Big O(x) where x number of unique words in hashmap
Now Traverse the hash again and use the number of occurrences of words as array index and append the word at that position in the array, we will get something like this -
Then just traverse the array from the end and collect the k non null words. Time Complexity - Big O(k)

Java Reference Implementation

import java.util.*;
import java.util.concurrent.ConcurrentMap;
import java.util.stream.Collectors;

public class SplitWordCount2 {

    public static void main(String[] args) {
        List<String> terms = Arrays.asList(
                "Coding is great",
                "Search Engine are great",
                "Google is a nice search engine",
                "Bing is also a nice engine");
        ConcurrentMap<String, Integer> wordFreq = terms
                .parallelStream() //Utilizes multi-core hardware
                .flatMap(s -> Arrays.stream(s.split(" ")))
                .collect(Collectors.toConcurrentMap(w -> w.toLowerCase(), w -> 1, Integer::sum)); // Big O(n)

        OptionalInt max = wordFreq
                .values()
                .parallelStream()
                .mapToInt(Integer::valueOf)
                .max();

        if (max.isPresent()) {
            String[] freq = new String[max.getAsInt()];
            wordFreq.forEach((s, integer) -> {
                if (freq[integer - 1] == null) {
                    freq[integer - 1] = s;
                } else {
                    freq[integer - 1] = freq[integer - 1] + "," + s;
                }
            });
            for (int i = max.getAsInt() - 1; i >= 0; i--) {
                System.out.println((i + 1) + " --> " + freq[i]);
            }
        }
    }
}

Output

3 --> engine,is
2 --> a,search,great,nice
1 --> coding,are,google,also,bing

Approach 3: Other Strategies for Scalable Design

If we have very limited memory in our device then we can utilize memory efficient data structures for storing words - TRIE and DAG.

TRIE - memory is shared between multiple words with common prefix and word count can be maintained along with the word termination mark, but it would be more time consuming than the HashMap