How would you count word frequency in a very large file. How to keep track of top 10 occurring words?

Carvia Tech | May 19, 2019 | 4 min read | 156 views | algorithm-datastructures


Questions worth asking the interviewer
  • Can file containing words fit into main memory?

  • How many distinct words are there in the file?

word count 1

Please note that there are limited number of natural language words available and all of them can be easily fit into today’s computer RAM. For example oxford English dictionary contains total of around 0.6 million words.

We will discuss two approaches to solve this problem -

Approach 1: Find the Top K Occurrence count using a hashmap and min-heap (PriorityQueue in Java)

Pseudo code for the algorithm
  1. Finding the Word Occurrence Count - Stream the words into a HashMap (put operation is Big O(1)) keeping the value as word occurrence count. On every word occurrence, update the word count.

  2. Track Top K occurring Words Using Binary Min Heap (PriorityQueue with Natural ordering) - This can be achieved by maintaining a binary min heap of max size K, and then for each word count in hashmap -

    • Check if the heap size if less than K - then add the new word count to min heap. Otherwise

    • Check if the peek element (that is minimum value in binary min heap) is less than the new word count, if it is, then remove the existing number and insert the new word count into min heap.

    • When we are done traversing the entire word-counts then we will have heap containing the top K frequently occurring words.

Java 8 Source Code (Find Top 10 word occurrences)
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;
import java.util.stream.Collectors;
public class SplitWordCount {
    public static void main(String[] args) {
        List<String> terms = Arrays.asList(
                "Coding is great",
                "Search Engine are great",
                "Google is a nice search engine",
                "Bing is also a nice engine");
        TopOccurrence topOccurrence = new TopOccurrence(2);
        terms.parallelStream() //Utilizes multi-core hardware
                .flatMap(s -> Arrays.asList(s.split(" ")).stream())
                .collect(Collectors.toConcurrentMap(w -> w.toLowerCase(), w -> 1, Integer::sum)) // Big O(n)
                .forEach((s, integer) -> topOccurrence.add(new WordCount(s, integer)));
        System.out.println(topOccurrence);
    }
    static class TopOccurrence {
        private final PriorityQueue<WordCount> minHeap;
        private final int maxSize;
        public TopOccurrence(int maxSize) {
            this.maxSize = maxSize;
            this.minHeap = new PriorityQueue<WordCount>(Comparator.comparingInt(wc -> wc.count));
// This constructs a min heap (when order of elements is natural i.e. ascending order).
// We are using Natural order for integers (wc.count)
// In order to create a max-heap, we just need to provide reversed comparator i.e. that sorts in descending order,as shown below
// this.minHeap = new PriorityQueue<WordCount>(Comparator.comparingInt((WordCount wc) -> wc.count).reversed());
        }
        public void add(WordCount data) {
            if (minHeap.size() < maxSize) { // size() is Big O(1)
                minHeap.offer(data); // Big O(log(k)) where k is the number of top occurrences required
            } else if (minHeap.peek().count < data.count) { // peek() is Big O(1)
                minHeap.poll(); // Big O(log(k))
                minHeap.offer(data); // Big O(log(k))
            }
        }
        @Override
        public String toString() {
            return "TopOccurrence{" + "minHeap=" + minHeap + ", maxSize=" + maxSize + '}';
        }
    }
    static class WordCount {
        protected final String word;
        protected final int count;
        WordCount(String word, int count) {
            this.word = word;
            this.count = count;
        }
        @Override
        public String toString() {
            return "{" + "word='" + word + '\'' + ", count=" + count + '}'+"\r\n";
        }
    }
}

The overall time complexity of the above algorithm should be O (n log k) where n is the total number of elements and k is the number of top occurrence elements that we need. Space complexity would be Big O(k + d), where d is the total number of distinct words in the file.

Notes

We preferred to choose Binary Heap over TreeSet because TreeSet provide a get method with Big O(log n) time complexity over PriorityQueue’s peek() method with Big O(1), so its a big time saver for the given requirement.

Binary Min Heap is a complete binary tree data structure in which each Node is less than or equal to each of its children. Heap is very efficient O(1) for finding minima and maxima from a given data set. PriorityQueue in JDK 1.6 is a implementation for Binary Min Heap.

A heap data structure should not be confused with the heap which is a common name for dynamically allocated memory. The term was originally used only for the data structure.

Approach 2: Using Counting Array Approach to find the Top x words

Pseudo code for algorithm -
  1. Iterate all words and maintain a count inside hash data structure. This is Big O(n) time complexity task, this will result in structure like this - hashMap = {"first":42, "second":100, "third":59};

  2. Traverse the hash and find the with maximum frequency, "second":100 in this case, then create an array of string with size 100 - this is Big O(x) where x number of unique words in hashmap

  3. Now Traverse the hash again and use the number of occurrences of words as array index and append the word at that position in the array, we will get something like this -

  4. Then just traverse the array from the end and collect the k non null words. Time Complexity - Big O(k)

Java Reference Implementation
import java.util.*;
import java.util.concurrent.ConcurrentMap;
import java.util.stream.Collectors;
public class SplitWordCount2 {

public static void main(String[] args) {
List<String> terms = Arrays.asList( "Coding is great",
"Search Engine are great", "Google is a nice search engine", "Bing is also a nice engine");
ConcurrentMap<String, Integer> wordFreq = terms.parallelStream() //Utilizes multi-core hardware .flatMap(s -> Arrays.asList(s.split(" ")).stream())
.collect(Collectors.toConcurrentMap(w -> w.toLowerCase(), w -> 1, Integer::sum)); // Big O(n)
OptionalInt max = wordFreq.values().parallelStream().mapToInt(Integer::valueOf).max(); if (max.isPresent()) {
String[] freq = new String[max.getAsInt()]; wordFreq.forEach((s, integer) -> {
if (freq[integer - 1] == null) { freq[integer - 1] = s;
} else {
} freq[integer - 1] = freq[integer - 1] + "," + s;
});
for (int i = max.getAsInt() - 1; i >= 0; i--) {
} System.out.println((i + 1) + " -- " + freq[i]);
} }
}

Approach 3: Other Strategies for Scalable Design

If we have very limited memory in our device then we can utilize memory efficient data structures for storing words - TRIE and DAG.

TRIE - memory is shared between multiple words with common prefix and word count can be maintained along with the word termination mark, but it would be more time consuming than the HashMap


Top articles in this category:
  1. Top 50 Multi-threading Java Interview Questions for Investment Bank
  2. BlackRock Top Java Interview Questions: Investment Banking Domain
  3. Top 20 Java Concurrency Interview Questions and Answers
  4. ION Trading Java Interview Questions
  5. Sapient Global Market Java Interview Questions and Coding Exercise
  6. Citibank Java developer interview questions
  7. Cracking core java interviews - question bank



Find more on this topic:
Java Interviews image
Java Interviews

Interview - Product Companies, eCommerce Companies, Investment Banking, Healthcare Industry, Service Companies and Startups.

Last updated 1 week ago


Recommended books for interview preparation:

This website uses cookies to ensure you get the best experience on our website. more info