Spark Column Order Matters in files, even if there is proper header for CSV -- and -- even for ORC files
Test with CSV file with header
orcColumnOrderTest]$ ls test1
bookOrder11.txt bookOrder12.txt
$ cat bookOrder11.txt
id,author,bookNm,reprintYear,publishYear
1,john,aa11,2020,2010
2,david,aa12,2019,2010
3,bob,ba11,2020,2015
4,rose,ba12,2019,2015
$ cat bookOrder12.txt
id,bookNm,author,publishYear,reprintYear
1,da11,john1,2010,2020
2,da12,david1,2010,2019
3,ea11,bob1,2015,2020
4,da12,rose1,2015,2019
5,fa11,alice1,2000,2020
scala> val books=spark.read.option("header", "true").csv("/<path>/sree/testing/orcColumnOrderTest/test1")
books: org.apache.spark.sql.DataFrame = [id: string, bookNm: string ... 3 more fields]
scala> books.printSchema
root
|-- id: string (nullable = true)
|-- bookNm: string (nullable = true)
|-- author: string (nullable = true)
|-- publishYear: string (nullable = true)
|-- reprintYear: string (nullable = true)
scala> books.show
+---+------+------+-----------+-----------+
| id|bookNm|author|publishYear|reprintYear|
+---+------+------+-----------+-----------+
| 1| da11| john1| 2010| 2020|
| 2| da12|david1| 2010| 2019|
| 3| ea11| bob1| 2015| 2020|
| 4| da12| rose1| 2015| 2019|
| 5| fa11|alice1| 2000| 2020|
| 1| john| aa11| 2020| 2010|
| 2| david| aa12| 2019| 2010|
| 3| bob| ba11| 2020| 2015|
| 4| rose| ba12| 2019| 2015|
+---+------+------+-----------+-----------+
So, as shown above - all columns got mixed... and data not usable
Now lets check it in another way.
orcColumnOrderTest]$ cat test11/bookOrder11.txt
id,author,bookNm,reprintYear,publishYear
1,john,aa11,2020,2010
2,david,aa12,2019,2010
3,bob,ba11,2020,2015
4,rose,ba12,2019,2015
orcColumnOrderTest]$ cat test12/bookOrder12.txt
id,bookNm,author,publishYear,reprintYear
1,da11,john1,2010,2020
2,da12,david1,2010,2019
3,ea11,bob1,2015,2020
4,da12,rose1,2015,2019
5,fa11,alice1,2000,2020
scala> val test11=spark.read.option("header",
"true").csv("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test11")
test11: org.apache.spark.sql.DataFrame = [id: string, author: string ... 3 more fields]
scala> val test12=spark.read.option("header", "true").csv("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test12")
test12: org.apache.spark.sql.DataFrame = [id: string, bookNm: string ... 3 more fields]
scala> test11.union(test12).show --> this became a mix n match on dis ordered column values
+---+------+------+-----------+-----------+
| id|author|bookNm|reprintYear|publishYear|
+---+------+------+-----------+-----------+
| 1| john| aa11| 2020| 2010|
| 2| david| aa12| 2019| 2010|
| 3| bob| ba11| 2020| 2015|
| 4| rose| ba12| 2019| 2015|
| 1| da11| john1| 2010| 2020|
| 2| da12|david1| 2010| 2019|
| 3| ea11| bob1| 2015| 2020|
| 4| da12| rose1| 2015| 2019|
| 5| fa11|alice1| 2000| 2020|
+---+------+------+-----------+-----------+
Now lets Write it as ORC and Test again
orcColumnOrderTest]$ ls test11Orc
part-00000-1fe6cc9b-160a-444c-a3f2-e2a999ad4542-c000.snappy.orc _SUCCESS
orcColumnOrderTest]$ ls test12Orc
part-00000-841b62e3-2f03-4964-98d7-9e9fb9e65b99-c000.snappy.orc _SUCCESS
scala> val orc11=spark.read.orc("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test11Orc")
scala> orc11.printSchema
root
|-- id: string (nullable = true)
|-- author: string (nullable = true)
|-- bookNm: string (nullable = true)
|-- reprintYear: string (nullable = true)
|-- publishYear: string (nullable = true)
scala> val orc12=spark.read.orc("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test12Orc")
scala> orc12.printSchema
root
|-- id: string (nullable = true)
|-- bookNm: string (nullable = true)
|-- author: string (nullable = true)
|-- publishYear: string (nullable = true)
|-- reprintYear: string (nullable = true)
scala> orc11.union(orc12).show --> it created a whole mess even though it was columnar format data union
No comments:
Post a Comment