Sree's Blog: September 2020

Spark Column Order Matters in files, even if there is proper header for CSV -- and -- even for ORC files

Test with CSV file with header

orcColumnOrderTest]$ ls test1

bookOrder11.txt bookOrder12.txt

$ cat bookOrder11.txt

id,author,bookNm,reprintYear,publishYear

1,john,aa11,2020,2010

2,david,aa12,2019,2010

3,bob,ba11,2020,2015

4,rose,ba12,2019,2015

$ cat bookOrder12.txt

id,bookNm,author,publishYear,reprintYear

1,da11,john1,2010,2020

2,da12,david1,2010,2019

3,ea11,bob1,2015,2020

4,da12,rose1,2015,2019

5,fa11,alice1,2000,2020

scala> val books=spark.read.option("header", "true").csv("/<path>/sree/testing/orcColumnOrderTest/test1")

books: org.apache.spark.sql.DataFrame = [id: string, bookNm: string ... 3 more fields]

scala> books.printSchema

root

|-- id: string (nullable = true)

|-- bookNm: string (nullable = true)

|-- author: string (nullable = true)

|-- publishYear: string (nullable = true)

|-- reprintYear: string (nullable = true)

scala> books.show

+---+------+------+-----------+-----------+

+---+------+------+-----------+-----------+

| 1| da11| john1| 2010| 2020|

| 2| da12|david1| 2010| 2019|

| 3| ea11| bob1| 2015| 2020|

| 4| da12| rose1| 2015| 2019|

| 5| fa11|alice1| 2000| 2020|

| 1| john| aa11| 2020| 2010|

| 2| david| aa12| 2019| 2010|

| 3| bob| ba11| 2020| 2015|

| 4| rose| ba12| 2019| 2015|

+---+------+------+-----------+-----------+

So, as shown above - all columns got mixed... and data not usable

Now lets check it in another way.

orcColumnOrderTest]$ cat test11/bookOrder11.txt

id,author,bookNm,reprintYear,publishYear

1,john,aa11,2020,2010

2,david,aa12,2019,2010

3,bob,ba11,2020,2015

4,rose,ba12,2019,2015

orcColumnOrderTest]$ cat test12/bookOrder12.txt

id,bookNm,author,publishYear,reprintYear

1,da11,john1,2010,2020

2,da12,david1,2010,2019

3,ea11,bob1,2015,2020

4,da12,rose1,2015,2019

5,fa11,alice1,2000,2020

scala> val test11=spark.read.option("header",

"true").csv("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test11")

test11: org.apache.spark.sql.DataFrame = [id: string, author: string ... 3 more fields]

scala> val test12=spark.read.option("header", "true").csv("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test12")

test12: org.apache.spark.sql.DataFrame = [id: string, bookNm: string ... 3 more fields]

scala> test11.union(test12).show --> this became a mix n match on dis ordered column values

+---+------+------+-----------+-----------+

+---+------+------+-----------+-----------+

| 1| john| aa11| 2020| 2010|

| 2| david| aa12| 2019| 2010|

| 3| bob| ba11| 2020| 2015|

| 4| rose| ba12| 2019| 2015|

| 1| da11| john1| 2010| 2020|

| 2| da12|david1| 2010| 2019|

| 3| ea11| bob1| 2015| 2020|

| 4| da12| rose1| 2015| 2019|

| 5| fa11|alice1| 2000| 2020|

+---+------+------+-----------+-----------+

Now lets Write it as ORC and Test again

orcColumnOrderTest]$ ls test11Orc

part-00000-1fe6cc9b-160a-444c-a3f2-e2a999ad4542-c000.snappy.orc _SUCCESS

orcColumnOrderTest]$ ls test12Orc

part-00000-841b62e3-2f03-4964-98d7-9e9fb9e65b99-c000.snappy.orc _SUCCESS

scala> val orc11=spark.read.orc("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test11Orc")

scala> orc11.printSchema

root

|-- id: string (nullable = true)

|-- author: string (nullable = true)

|-- bookNm: string (nullable = true)

|-- reprintYear: string (nullable = true)

|-- publishYear: string (nullable = true)

scala> val orc12=spark.read.orc("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test12Orc")

scala> orc12.printSchema

root

|-- id: string (nullable = true)

|-- bookNm: string (nullable = true)

|-- author: string (nullable = true)

|-- publishYear: string (nullable = true)

|-- reprintYear: string (nullable = true)

scala> orc11.union(orc12).show --> it created a whole mess even though it was columnar format data union

Sree's Blog

Wednesday, September 23, 2020

Spark Column Order matters even for ORC files

Spark Column Order Matters in files, even if there is proper header for CSV -- and -- even for ORC files

Test with CSV file with header

Now lets check it in another way.

Now lets Write it as ORC and Test again