Count word frequency in Java

Carvia Tech | October 18, 2020 | 2 min read | 0 views


In this article we will calculate word frequency for each word in a given sentence using various approaches - plain java, java 8 streams, parallel streams, etc.

1. Using HashMap and a loop

This is the simplest and most verbose approach where we track the count of each word in a hashmap.

Approach
  • Split the sentence into word list

  • Loop on word list

    • If hashmap contains the given word, increment the frequency count

    • else put the word into hashmap and set its frequency as 1

HashMap based implementation
public static void wordFreqV1() {
    String text = "Ann while Bob had had had had had had had had had had had a better effect on on the teacher";
    Map<String, Integer> freqMap = new HashMap<>();
    asList(text.split(" ")).forEach(s -> {
        if (freqMap.containsKey(s)) {
            Integer count = freqMap.get(s);
            freqMap.put(s, count + 1);
        } else {
            freqMap.put(s, 1);
        }
    });
    System.out.println(freqMap.toString());
}

2. Using Java 8 Map & compute

Java 8 provides compute method on HashMap which takes a mapping function to compute the value. This will reduce the amount of code we had written in previous example.

Using HashMap with compute method
public static void wordFreqV2() {
    String text = "Ann while Bob had had had had had had had had had had had a better effect on on the teacher";
    Map<String, Integer> freqMap = new HashMap<>();
    asList(text.split("[\\s.]")).forEach(s -> {
        freqMap.compute(s, (s1, count) -> count == null ? 1 : count + 1);
    });
    System.out.println(freqMap.toString());
}

3. Using Java 8 parallel stream

We can leverage parallel computing (utilizing multiple cores) by creating a parallel stream which will compute the word frequency.

Using parallel stream
public static void textWordFreq() {
    String text = "Ann while Bob had had had had had had had had had had had a better effect on on the teacher";
    ConcurrentMap<String, Integer> freqMap =
            asList(text.split("[\\s.]"))
                    .parallelStream()
                    .filter(s -> !s.isEmpty())
                    .collect(Collectors.toConcurrentMap(w -> w.toLowerCase(), w -> 1, Integer::sum));
    System.out.println(freqMap.toString());
}

Showing Top 3 frequent words

We can keep track of top X frequently used words using a PriorityQueue that uses word frequency for its comparator.

PriorityQueue is nothing but a min-heap implementation in Java. We create a comparator that sorts the min-heap elements by their frequency. The lowest frequency word will be at the head of PQ. This way we can keep removing lowest frequency word from the min-heap (in O(log n) time) as higher frequency words arrive in.

Keep track on top occuring words
public static void textWordFreq() {
    String text = "Ann while Bob had had had had had had had had had had had a better effect on on the teacher";
    ConcurrentMap<String, Integer> freqMap =
            asList(text.split("[\\s.]"))
                    .parallelStream()
                    .filter(s -> !s.isEmpty())
                    .collect(Collectors.toConcurrentMap(w -> w.toLowerCase(), w -> 1, Integer::sum));
    System.out.println(freqMap.toString());

    //Priority queue that uses frequency as the comparator and size as 3
    PriorityQueue<String> pq = new PriorityQueue<>(Comparator.comparingInt(freqMap::get));  (1)
    for(String key: freqMap.keySet()) {
        pq.add(key);    (2)
        if(pq.size() > 3) {
            pq.poll();  (3)
        }
    }
    System.out.println("Top 3 words by occurrences  : " + pq);
}
1 min-heap that sorts its elements based on the frequency of given key in frequency map i.e. the word with lowest frequency will be at top.
2 Adding a new element to the min-heap.
3 If min-heap has more than 3 elements, remove the one with lowest frequency by calling poll() method.

Features of PriorityQueue

  • The elements of queue are ordered according to their natural ordering or by a comparator provided in constructor

  • The head of the queue is the least element with respect to the specified ordering.

  • PQ does not permit null elements

  • PQ is not thread safe, if multiple threads can modify the queue concurrently, use PriorityBlockingQueue class instead

  • If you need ordered traversal of its elements, consider using Arrays.sort(pq.toArray())

Time Complexity

min-heap approach has the following time-complexity in Big O notation:

  • Big O(log n) time for enqueing and dequeing methods - offer(), poll(), remove() and add()

  • Big O(1) constant time for retrieval methods peek(), element() and size()

  • Big O(n) linear time for remove(Object) and contains(Object)

That’s all for this article.


Top articles in this category:
  1. Fail-Safe vs Fail-Fast Iterator in Java Collections Framework
  2. What is volatile keyword in Java
  3. Producer Consumer Problem using Blocking Queue in Java
  4. Blocking Queue implementation in Java
  5. What is difference between sleep() and wait() method in Java?
  6. Diamond Problem of Inheritance in Java 8
  7. What is AtomicInteger class and how it works internally


Find more on this topic:
Core Java image
Core Java

Core Java - OOP Concepts, Garbage Collection, Multi-threading, Collections Framework, Java 8 Features, Lambda Functions, Streams.

Last updated 1 week ago


Recommended books for interview preparation:

This website uses cookies to ensure you get the best experience on our website. more info