Top K Frequently Mentioned Keywords
Also known as “Top K Buzzwords”.
Description:
Given a list of reviews, a list of keywords and an integer k. Find the most popular k keywords in order of most to least frequently mentioned.
The comparison of strings is case-insensitive. If keywords are mentioned an equal number of times in reviews, sort alphabetically.
Example 1:
Input:
k = 2
keywords = ["anacell", "cetracular", "betacellular"]
reviews = [
"Anacell provides the best services in the city",
"betacellular has awesome services",
"Best services provided by anacell, everyone should use anacell",
]Output:
["anacell", "betacellular"]Explanation:
"anacell" is occuring in 2 different reviews and "betacellular" is only occuring in 1 review.
Example 2:
Input:
k = 2
keywords = ["anacell", "betacellular", "cetracular", "deltacellular", "eurocell"]
reviews = [
"I love anacell Best services; Best services provided by anacell",
"betacellular has great services",
"deltacellular provides much better services than betacellular",
"cetracular is worse than anacell",
"Betacellular is better than deltacellular.",
]Output:
["betacellular", "anacell"]Explanation:
"betacellular" is occuring in 3 different reviews. "anacell" and "deltacellular" are occuring in 2 reviews, but "anacell" is lexicographically smaller.
Related problems:
- https://leetcode.com/problems/top-k-frequent-words/
- https://leetcode.com/problems/top-k-frequent-elements/
Solution:
To solve this task you need:
- parse the reviews and break them to words;
- compare all words of all reviews with all keywords to count how many keywords each review contains;
- sort all found keywords by their number and their lexicographical order.
First of all in order to compare all keywords to all review we will need for fast search of the given keywords. Obvious solution here is to copy the array of keywords into a hash table. Hash table has constant complexity of search so most solutions which I seen in the internet do this.
I don’t like this task in context of C++ because it forces to work with strings and break them to tokens. C++ is weak here, the code looks ugly, long and unsafe. But I like this task because it forces to think about many algorithms and data structures which is possible to use in its solution.
To solve the second part most of solutions in internet use a priority queue as a main data structure. This structure allows to keep sorted data by given priority and gives very fast access to the data item with the highest priority O(1). I don’t know why this kind of solution is so popular. May be the people who choose this data structure did it with hope to get fast access to the first k elements when they will return the result but this takes very little part to time in comparison to parsing of the text and counting of the keywords and anyway they have access to the top item only. In order to get next item they need to pop the previous one. Popping of a top element from the queue takes O(LogN) operations. Or may be they choose it to get fast sorting because adding of a new element to the priority queue takes O(LogN) when quicksort takes O(N * LogN) for sorting and then O(LogN) for searching. But they don’t take in account that O(LogN) takes adding of the only element. When we add all elements to the queue it takes the same O(N * LogN) as quicksort. They do sorting by adding of a custom comparator which gives top priority to a keyword with a smallest number. It looks like this (Java example):
for (Entry<String, Long> entry : wordCount.entrySet()) {
priority_queue.offer(entry);
while (priority_queue.size() > k) {
priority_queue.poll();
}
}
and they believe that it takes N*Log(k) where N is a number of keywords because adding of one element takes LogN. But actually priority_queue.offer(entry) takes N*LogN then priority_queue.poll() takes (N*LogN)-k so as result it takes (N*LogN) + ((N*LogN)-k) or generally O(N*LogN).
Then to get result they do something like
List<String> result = new ArrayList<String>();
while (!priority_queue .isEmpty()) {
result.add(pq.poll().getKey());
}
Collections.reverse(result);
or
while(!priority_queue.isEmpty()) result.add(0, priority_queue.poll().getKey());
this reverse elements in result and put keywords with biggest numbers to the top.
The first one takes O(N*LogN) for polling of all elements from the queue + O(N) for reversing them. The second one takes O(N*LogN) + O(N) because it adds to the head of the list. So final complexity is O(N*LogN).
Thus time complexity of typical best solution from internet based on priority queue is:
- parse the reviews and break them to words O(N).
- compare all words of all reviews with all keywords to count how many keywords each review contains O(N*LogN).
- sort all found keywords by their number and their lexicographical order O(k*Log(k)).
General complexity is about O(N*LogN)
My solution.
I decided to base my solution on hash tables. Hash table has constant complexity for search, adding and deletion of elements if we reserve enough buckets to keep hashes of all needed elements. We know number of all keywords so we can do it on the start and avoid rehash the table in process of work later. But it make sense only if we will process billions of keywords. Usually amortized constant time is good enough.
Therefore we have constant time complexity for adding, searching and deletion of the keywords during the longest process of parsing the reviews. Then we put all counted keywords into an array, sort the arrays with a custom comparator to keep order of the keywords corresponding to the requirements and return k first elements of this array. It will takes O(N * LogN) for sorting and O(N) for taking of k first elements. But smart compiler can optimize the last part.
In short most of work we do with hash tables with constant complexity and use expensive sorting just to sort the result.
At first I use std:: unordered_map to keep all keywords and their numbers. I add all keywords to the map at the beginning in order to use this map for recognize the keywords, so I don’t need for additional hash table to compare keywords during the parsing process.
Thus the comparison of all keywords with all words in reviews costs O(N) instead of (N*LogN)-k+memory management inside of the priority queue during pushing and popping of elements.
Then to sort of the result I use std::set with a custom comparator. IMHO there is no big difference wish a simple sort of a vector but I believe some implementations of the sets may add sorted elements faster than std::sort moving elements of a vector during the sorting process.
Repository with the full project you can find here: https://github.com/jolly-fellow/amazon/tree/master/top_n_buzzwords
Return to the table of contents.