Get distinct words from a given file in Java

Upasana | February 13, 2020 | 2 min read | 471 views


We will extract distinct words from a given file using Java.

Concepts

  • Set data structure does not allow duplicate elements, so it can be used for filtering out duplicate words.

  • Using regex we can split the given text file into words, Java provides StringTokenizer class that can help splitting each line of file.

  • We need to close any input file so as to avoid file handle leaks inside Java program. try with resource takes care of automatically closing the underlying input stream once block of code is executed.

Java 11 code solution

We will use Java 11 to implement the solution for given coding problem.

DistinctWords.java
import java.io.*;
import java.util.HashSet;
import java.util.Set;
import java.util.StringTokenizer;
import java.util.logging.Level;
import java.util.logging.Logger;

public class DistinctWords {

    private static final Logger LOGGER = Logger.getLogger("DistinctWords");

    public Set<String> getDistinctWords(String fileName) {
        Set<String> wordSet = new HashSet<>();
        try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(fileName)))) {
            String line;
            while ((line = br.readLine()) != null) {
                StringTokenizer st = new StringTokenizer(line, " ,.;:\"");
                while (st.hasMoreTokens()) {
                    wordSet.add(st.nextToken().toLowerCase());
                }
            }
        } catch (IOException e) {
            LOGGER.log(Level.SEVERE, "IOException occurred", e);
        }
        return wordSet;
    }

    public static void main(String[] args) {
        DistinctWords distinctFileWords = new DistinctWords();
        Set<String> wordList = distinctFileWords.getDistinctWords("<path-to-file>");
        for (String str : wordList) {
            System.out.println(str);
        }
    }

}

Kotlin implementation

Kotlin implementation for the same would be much more concise.

DistinctWords.kt
import java.io.File
import java.util.*

class DistinctWords {
    fun getDistinctWords(fileName: String): Set<String> {
        val wordSet: MutableSet<String> = HashSet()
        File(fileName).forEachLine { line ->
            val words = line.split(" ,.;:\"")
            words.forEach { t: String -> wordSet.add(t) }
        }
        return wordSet
    }
}

fun main() {
    val distinctFileWords = DistinctWords()
    val wordList = distinctFileWords.getDistinctWords("<path-to-file>")
    wordList.forEach { str ->
        println(str)
    }
}

That’s all.


Top articles in this category:
  1. Create anagram buckets from a given input array of words
  2. Find longest non-repeating substring from a given string in Java
  3. Reverse order of words inside string in Java
  4. Reverse position of words in a string using recursion
  5. 50 Java Interview Questions for SDET Automation Engineer
  6. Java Coding Problems for SDET Automation Engineer
  7. Find first non-repeating character from a String

Recommended books for interview preparation:

Find more on this topic: