How do I delete duplicate rows from these web logs
I am currently analyzing some Apache web logs. Some rows contain duplicates (not complete duplicates, as the datetime can be some seconds apart.) as you can see on the image below. I am mostly using SQL within Spark. I want to keep only one.
You can use 'dropDuplicates' method to remove the duplicates instead of a group by within query.
'weblogs_filter_bekijk = sqlContext.sql("select endpoint from basetable5 where ip_address = '220.127.116.11'").dropDuplicates'
This should help you.You can refer to below link for detailed explanation of this method.
You can use group by command in a SQL query, for example:
select * from table where x = y group by x_column