Word frequency

List word occurrence counts sorted by frequency for quick text analysis.

Overview

Word frequency analysis is one of the simplest and most powerful foundations of computational linguistics and natural language processing. In 1949, linguist George Kingsley Zipf published an observation that would become famous as Zipf's Law: in any sufficiently large text corpus, the frequency of a word is inversely proportional to its position in the frequency ranking. The most common word appears approximately twice as often as the second most common, three times as often as the third, and so on. This distribution holds for any natural language — English, Portuguese, Mandarin — and even for programming language source code.

In the NLP (Natural Language Processing) world, word frequency is the basis for techniques like TF-IDF (Term Frequency - Inverse Document Frequency), used in search engines to weigh the importance of each term in a document relative to an entire corpus. TF-IDF is the foundation of how Google understood texts before large language models like BERT (released in 2018). Word clouds are frequency visualizations where the size of each word is proportional to its occurrence in the text. Though widely criticized in serious analytical contexts, they remain the most intuitive way to visualize the dominant vocabulary of a text.

For more meaningful text analysis, filtering stop words is always necessary — high-frequency but low-semantic-content words like 'the', 'is', 'at', 'which'. In any natural language, the 50 most frequent words are generally stop words. The decision to include or exclude them depends on the goal: for writing style analysis, including them makes sense; for content analysis (what the text is about), filtering them is essential. Stop word lists are available in libraries like NLTK and spaCy.

This tool tokenizes text on spaces and common punctuation, counts occurrences, and displays results in descending frequency order. The tokenization is simple — it does not perform stemming (reducing words to their root) or lemmatization (normalizing conjugations and plurals). 'run', 'running', and 'ran' will be counted as distinct words. For deep linguistic analysis, this is a limitation; for quick content analysis — checking whether a text uses a keyword with the right frequency, identifying excessive repetition, comparing the vocabulary of two texts — it is exactly what you need.

Technical deep dive

Common questions summarized

  • What is this tool for?: It runs fully in your browser: useful to validate, format, or convert data in everyday development.
  • Are my inputs sent to a server?: Processing happens locally with JavaScript. We do not store what you paste into the text areas.
  • Can I use this for real production data?: Use at your own risk. For secrets (passwords, tokens), prefer controlled environments and your company policies. And always review the generated contents. Never trust blindly things you see on the internet.

Sample payload to try

  • See also the larger "Code Snippets" sample; paste this excerpt to try locally: Output — the: 4 cat: 2

Tool guide

  • What frequency analysis is Counting how often each word appears, useful for summaries and basic text stats.

  • What the tool does Tokenizes the text, optional case folding, aggregates counts, and lists words from most to least frequent.

  • Why use it Spot repeated terms and quick vocabulary checks, all local.

Code Snippets

Code example
the: 4
cat: 2

Output

the: 4
cat: 2

FAQ

What is this tool for?

It runs fully in your browser: useful to validate, format, or convert data in everyday development.

Are my inputs sent to a server?

Processing happens locally with JavaScript. We do not store what you paste into the text areas.

Can I use this for real production data?

Use at your own risk. For secrets (passwords, tokens), prefer controlled environments and your company policies. And always review the generated contents. Never trust blindly things you see on the internet.