How do I delete duplicate rows from these web logs

I am currently analyzing some Apache web logs. Some rows contain duplicates (not complete duplicates, as the datetime can be some seconds apart.) as you can see on the image below. I am mostly using SQL within Spark. I want to keep only one.

See Image here


You can use 'dropDuplicates' method to remove the duplicates instead of a group by within query.

'weblogs_filter_bekijk = sqlContext.sql("select endpoint from basetable5 where ip_address = ''").dropDuplicates'

This should help you.You can refer to below link for detailed explanation of this method.

You can use group by command in a SQL query, for example:

select * from table where x = y group by x_column 

Need Your Help

Deploying DotNetNuke and separate ASP.NET Application together - Possible Issues? deployment dotnetnuke

I am making this in a proactive attempt to head off any potential problems which could arise from this. The situation is that we are developing an ASP.NET application for a client which will handle...

Extending WPF Button to store data in a new property

wpf button

I want to extend a WPF button to store some extra data, in a similar way to the current "Tag" property. Are attached properties the way forward? The data I want to store will be a URL Link string...