Apply a function to a single column of a csv in Spark

Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this?

My code

spark = SparkSession\
from myfile import myFunction

df =[1], header=True,
a = line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF()

I would like to be able to just call the function on the column name instead of mapping each row to line and then calling the function on line[index].

I'm using Spark version 2.0.1


You can simply use User Defined Functions (udf) combined with a withColumn :

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

udf_myFunction = udf(myFunction, IntegerType()) # if the function returns an int
df.withColumn("message", udf_myFunction("_3")) #"_3" being the column name of the column you want to consider

This will add a new column to the dataframe df containing the result of myFunction(line[3]).

