SQL query for knowing the popular hashtag from a column, that has a list of hashtags stored as an array

I have been trying to find a popular hashtag from a table that looks something like this.

| Tweet_id |                 Hashtags                |
-----------------------------------------------------
|    id1   | [hashtag1,hashtag2,hashtag3]            |
|    id2   | [hashtag2,hashtag4]                     |
|    id3   | []                                      |
|    id4   | [hashtag1]                              |                             

So i am trying to print the top most occurred hashtag from the table using a MySQL Query. From the research i have done on this, i was able to retrieve only a single hashtag using FIND_IN_SET. But as it can be seen the number of hashtags in the columns are different for rows. And my query has to search all the hashtags in the array and produce the result.

Note: What i really am doing is that i have a json file and i am using sparks sqlContext to convert the json and register it as a table. The table looks like the above. I am using sqlContext.sql("//sqlquery//") to retrieve data from these tables.

Update:- This is the Schema

root
 |-- hashtag: array (nullable = true)
 |     |-- element: string (containsNull = true)

Answers


You can split and count:

SELECT sub.val AS `HashTag`, COUNT(*) AS `count`
FROM
(
  SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(t.HashTag, ',', n.n), ',', -1) AS val
  FROM (SELECT Substring(HashTag, 2, LENGTH(HashTag) - 2) AS HashTag FROM tab) AS t 
  CROSS JOIN 
  (
   SELECT a.N + b.N * 10 + 1 n
     FROM 
    (SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) a
   ,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) b
  ) n
   WHERE n.n <= 1 + (LENGTH(t.HashTag) - LENGTH(REPLACE(t.HashTag, ',', '')))
) sub
WHERE val <> ''
GROUP BY sub.val
ORDER BY `count` DESC
-- LIMIT 1;

SqlFiddleDemo

Output:

╔═══════════╦═══════╗
║   val     ║ count ║
╠═══════════╬═══════╣
║ hashtag1  ║     2 ║
║ hashtag2  ║     2 ║
║ hashtag4  ║     1 ║
║ hashtag3  ║     1 ║
╚═══════════╩═══════╝

Anyway you should normalize your table.


Need Your Help

Hazelcast map memory growing while entry numbers stay stable

caching out-of-memory heap hazelcast

We are using a standalone single node installation of Hazelcast 3.3.5 for storing some user's session info both logged and not logged users.

Undesirable Double Quotes When Exporting a Tab Delimited File

sql oracle toad tab-delimited

When I export my SQL results into a tab delimited file, double quotes surround a handful of my records. I don't know why this is. I am assuming it is because some of the record names have a special