How do I delete duplicate rows from these web logs

I am currently analyzing some Apache web logs. Some rows contain duplicates (not complete duplicates, as the datetime can be some seconds apart.) as you can see on the image below. I am mostly using SQL within Spark. I want to keep only one.

See Image here

Answers


You can use 'dropDuplicates' method to remove the duplicates instead of a group by within query.

'weblogs_filter_bekijk = sqlContext.sql("select endpoint from basetable5 where ip_address = '91.74.184.68'").dropDuplicates'

This should help you.You can refer to below link for detailed explanation of this method.

https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/sql/DataFrame.html


You can use group by command in a SQL query, for example:

select * from table where x = y group by x_column 

Need Your Help

Deploying DotNetNuke and separate ASP.NET Application together - Possible Issues?

asp.net deployment dotnetnuke

I am making this in a proactive attempt to head off any potential problems which could arise from this. The situation is that we are developing an ASP.NET application for a client which will handle...

Extending WPF Button to store data in a new property

wpf button

I want to extend a WPF button to store some extra data, in a similar way to the current "Tag" property. Are attached properties the way forward? The data I want to store will be a URL Link string...