
Can file containing words fit into main memory?

How many distinct words are there in the file?
Top 10 occurring words in a very large file java algorithm
Upasana  October 18, 2020  4 min read  422 views  algorithmdatastructures
Please note that there are limited number of natural language words available and all of them can be easily fit into today’s computer RAM. For example oxford English dictionary contains total of around 0.6 million words. 
We will discuss two approaches to solve this problem 
Approach 1: Find the Top K Occurrence count using a hashmap and minheap (PriorityQueue in Java)

Finding the Word Occurrence Count  Stream the words into a HashMap (put operation is Big O(1)) keeping the value as word occurrence count. On every word occurrence, update the word count.

Track Top K occurring Words Using Binary Min Heap (PriorityQueue with Natural ordering)  This can be achieved by maintaining a binary min heap of max size K, and then for each word count in hashmap 

Check if the heap size if less than K  then add the new word count to min heap. Otherwise

Check if the peek element (that is minimum value in binary min heap) is less than the new word count, if it is, then remove the existing number and insert the new word count into min heap.

When we are done traversing the entire wordcounts then we will have heap containing the top K frequently occurring words.

import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;
import java.util.stream.Collectors;
public class SplitWordCount {
public static void main(String[] args) {
List<String> terms = Arrays.asList(
"Coding is great",
"Search Engine are great",
"Google is a nice search engine",
"Bing is also a nice engine");
TopOccurrence topOccurrence = new TopOccurrence(2);
terms.parallelStream() //Utilizes multicore hardware
.flatMap(s > Arrays.asList(s.split(" ")).stream())
.collect(Collectors.toConcurrentMap(w > w.toLowerCase(), w > 1, Integer::sum)) // Big O(n)
.forEach((s, integer) > topOccurrence.add(new WordCount(s, integer)));
System.out.println(topOccurrence);
}
static class TopOccurrence {
private final PriorityQueue<WordCount> minHeap;
private final int maxSize;
public TopOccurrence(int maxSize) {
this.maxSize = maxSize;
this.minHeap = new PriorityQueue<WordCount>(Comparator.comparingInt(wc > wc.count));
// This constructs a min heap (when order of elements is natural i.e. ascending order).
// We are using Natural order for integers (wc.count)
// In order to create a maxheap, we just need to provide reversed comparator i.e. that sorts in descending order,as shown below
// this.minHeap = new PriorityQueue<WordCount>(Comparator.comparingInt((WordCount wc) > wc.count).reversed());
}
public void add(WordCount data) {
if (minHeap.size() < maxSize) { // size() is Big O(1)
minHeap.offer(data); // Big O(log(k)) where k is the number of top occurrences required
} else if (minHeap.peek().count < data.count) { // peek() is Big O(1)
minHeap.poll(); // Big O(log(k))
minHeap.offer(data); // Big O(log(k))
}
}
@Override
public String toString() {
return "TopOccurrence{" + "minHeap=" + minHeap + ", maxSize=" + maxSize + '}';
}
}
static class WordCount {
protected final String word;
protected final int count;
WordCount(String word, int count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return "{" + "word='" + word + '\'' + ", count=" + count + '}'+"\r\n";
}
}
}
The overall time complexity of the above algorithm should be O (n log k) where n is the total number of elements and k is the number of top occurrence elements that we need. Space complexity would be Big O(k + d), where d is the total number of distinct words in the file.
Notes
We preferred to choose Binary Heap over TreeSet because TreeSet provide a get method with Big O(log n) time complexity over PriorityQueue’s peek() method with Big O(1), so its a big time saver for the given requirement.
Binary Min Heap is a complete binary tree data structure in which each Node is less than or equal to each of its children. Heap is very efficient O(1) for finding minima and maxima from a given data set. PriorityQueue in JDK 1.6 is a implementation for Binary Min Heap.
A heap data structure should not be confused with the heap which is a common name for dynamically allocated memory. The term was originally used only for the data structure.
Approach 2: Using Counting Array Approach to find the Top x words

Iterate all words and maintain a count inside hash data structure. This is Big O(n) time complexity task, this will result in structure like this  hashMap = {"first":42, "second":100, "third":59};

Traverse the hash and find the with maximum frequency, "second":100 in this case, then create an array of string with size 100  this is Big O(x) where x number of unique words in hashmap

Now Traverse the hash again and use the number of occurrences of words as array index and append the word at that position in the array, we will get something like this 

Then just traverse the array from the end and collect the k non null words. Time Complexity  Big O(k)
import java.util.*;
import java.util.concurrent.ConcurrentMap;
import java.util.stream.Collectors;
public class SplitWordCount2 {
public static void main(String[] args) {
List<String> terms = Arrays.asList(
"Coding is great",
"Search Engine are great",
"Google is a nice search engine",
"Bing is also a nice engine");
ConcurrentMap<String, Integer> wordFreq = terms
.parallelStream() //Utilizes multicore hardware
.flatMap(s > Arrays.stream(s.split(" ")))
.collect(Collectors.toConcurrentMap(w > w.toLowerCase(), w > 1, Integer::sum)); // Big O(n)
OptionalInt max = wordFreq
.values()
.parallelStream()
.mapToInt(Integer::valueOf)
.max();
if (max.isPresent()) {
String[] freq = new String[max.getAsInt()];
wordFreq.forEach((s, integer) > {
if (freq[integer  1] == null) {
freq[integer  1] = s;
} else {
freq[integer  1] = freq[integer  1] + "," + s;
}
});
for (int i = max.getAsInt()  1; i >= 0; i) {
System.out.println((i + 1) + " > " + freq[i]);
}
}
}
}
3 > engine,is
2 > a,search,great,nice
1 > coding,are,google,also,bing
Approach 3: Other Strategies for Scalable Design
If we have very limited memory in our device then we can utilize memory efficient data structures for storing words  TRIE and DAG.
TRIE  memory is shared between multiple words with common prefix and word count can be maintained along with the word termination mark, but it would be more time consuming than the HashMap
Top articles in this category:
 Multithreading Java Interview Questions for Investment Bank
 BlackRock Java Interview Questions
 Sapient Global Market Java Interview Questions and Coding Exercise
 ION Trading Java Interview Questions
 Goldman Sachs Java Interview Questions
 Cracking core java interviews  question bank
 Java topics covered in Investment Banking Interviews (Morgan Stanley, Barclays, RBS, UBS, BlackRock)