Apply a function to a single column of a csv in Spark

Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this?

My code

SparkContext().addPyFile("myfile.py")
spark = SparkSession\
    .builder\
    .appName("myApp")\
    .getOrCreate()
from myfile import myFunction

df = spark.read.csv(sys.argv[1], header=True,
    mode="DROPMALFORMED",)
a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF()

I would like to be able to just call the function on the column name instead of mapping each row to line and then calling the function on line[index].

I'm using Spark version 2.0.1

Answers


You can simply use User Defined Functions (udf) combined with a withColumn :

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

udf_myFunction = udf(myFunction, IntegerType()) # if the function returns an int
df.withColumn("message", udf_myFunction("_3")) #"_3" being the column name of the column you want to consider

This will add a new column to the dataframe df containing the result of myFunction(line[3]).


Need Your Help

Get text between HTML tags

php html arrays string preg-match

Ok, This is a pretty basic question im sure but im new to PHP and haven't been able to figure it out. The input string is $data im trying to continue to pull and only use the first match. Is the be...