Wednesday, September 23, 2020

Spark Column Order matters even for ORC files

Spark Column Order Matters in files, even if there is proper header for CSV  -- and -- even for ORC files

Test with CSV file with header

orcColumnOrderTest]$ ls test1

bookOrder11.txt  bookOrder12.txt

$ cat bookOrder11.txt
id,author,bookNm,reprintYear,publishYear

1,john,aa11,2020,2010

2,david,aa12,2019,2010

3,bob,ba11,2020,2015

4,rose,ba12,2019,2015

$ cat bookOrder12.txt

id,bookNm,author,publishYear,reprintYear

1,da11,john1,2010,2020

2,da12,david1,2010,2019

3,ea11,bob1,2015,2020

4,da12,rose1,2015,2019

5,fa11,alice1,2000,2020


scala> val books=spark.read.option("header", "true").csv("/<path>/sree/testing/orcColumnOrderTest/test1")

books: org.apache.spark.sql.DataFrame = [id: string, bookNm: string ... 3 more fields]


scala> books.printSchema

root

 |-- id: string (nullable = true)

 |-- bookNm: string (nullable = true)

 |-- author: string (nullable = true)

 |-- publishYear: string (nullable = true)

 |-- reprintYear: string (nullable = true)


scala> books.show

+---+------+------+-----------+-----------+

| id|bookNm|author|publishYear|reprintYear|

+---+------+------+-----------+-----------+

|  1|  da11| john1|       2010|       2020|

|  2|  da12|david1|       2010|       2019|

|  3|  ea11|  bob1|       2015|       2020|

|  4|  da12| rose1|       2015|       2019|

|  5|  fa11|alice1|       2000|       2020|

|  1|  john|  aa11|       2020|       2010|

|  2| david|  aa12|       2019|       2010|

|  3|   bob|  ba11|       2020|       2015|

|  4|  rose|  ba12|       2019|       2015|

+---+------+------+-----------+-----------+

So, as shown above - all columns got mixed... and data not usable

Now lets check it in another way.

orcColumnOrderTest]$ cat test11/bookOrder11.txt
id,author,bookNm,reprintYear,publishYear
1,john,aa11,2020,2010
2,david,aa12,2019,2010
3,bob,ba11,2020,2015
4,rose,ba12,2019,2015

orcColumnOrderTest]$ cat test12/bookOrder12.txt
id,bookNm,author,publishYear,reprintYear
1,da11,john1,2010,2020
2,da12,david1,2010,2019
3,ea11,bob1,2015,2020
4,da12,rose1,2015,2019
5,fa11,alice1,2000,2020

scala> val test11=spark.read.option("header", 
"true").csv("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test11")
test11: org.apache.spark.sql.DataFrame = [id: string, author: string ... 3 more fields]

scala> val test12=spark.read.option("header", "true").csv("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test12")
test12: org.apache.spark.sql.DataFrame = [id: string, bookNm: string ... 3 more fields]

scala> test11.union(test12).show  --> this became a mix n match on dis ordered column values
+---+------+------+-----------+-----------+
| id|author|bookNm|reprintYear|publishYear|
+---+------+------+-----------+-----------+
|  1|  john|  aa11|       2020|       2010|
|  2| david|  aa12|       2019|       2010|
|  3|   bob|  ba11|       2020|       2015|
|  4|  rose|  ba12|       2019|       2015|
|  1|  da11| john1|       2010|       2020|
|  2|  da12|david1|       2010|       2019|
|  3|  ea11|  bob1|       2015|       2020|
|  4|  da12| rose1|       2015|       2019|
|  5|  fa11|alice1|       2000|       2020|
+---+------+------+-----------+-----------+

Now lets Write it as ORC and Test again 

orcColumnOrderTest]$ ls test11Orc
part-00000-1fe6cc9b-160a-444c-a3f2-e2a999ad4542-c000.snappy.orc  _SUCCESS

orcColumnOrderTest]$ ls test12Orc
part-00000-841b62e3-2f03-4964-98d7-9e9fb9e65b99-c000.snappy.orc  _SUCCESS

scala> val orc11=spark.read.orc("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test11Orc")

scala> orc11.printSchema
root
 |-- id: string (nullable = true)
 |-- author: string (nullable = true)
 |-- bookNm: string (nullable = true)
 |-- reprintYear: string (nullable = true)
 |-- publishYear: string (nullable = true)


scala> val orc12=spark.read.orc("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test12Orc")

scala> orc12.printSchema
root
 |-- id: string (nullable = true)
 |-- bookNm: string (nullable = true)
 |-- author: string (nullable = true)
 |-- publishYear: string (nullable = true)
 |-- reprintYear: string (nullable = true)

scala> orc11.union(orc12).show --> it created a whole mess even though it was columnar format data union