Sree's Blog: 2020

Sunday, November 8, 2020

Realtime app issues on high load - SocketException: Too many open files

java.net.SocketException: Too many open files ?

ulimit

If you know the process IDs (PID) of the specific user you can get the limits for each process with:

cat /proc/<PID>/limits  (default for Max open files seems to be 4096 hard limit)

You can get the number of opened files for each PID with:

ls -1 /proc/<PID>/fd | wc -l

Or /usr/sbin/lsof -u <userName> | wc -l

how-to-fix-javanetsocketexception-too-many-open-files-java-tomcat

urlconnection-leads-too-many-open-files - stackOverflow reference

Make sure to disconnect the connection object

finally {
if (con != null) con.disconnect()
}

URLConnection or HTTPClient (stackOverflow reference)

con.disconnect()

https://techblog.bozho.net/caveats-of-httpurlconnection/

just for quick reference - copied from above - This is still unclear, but gives us a hint that there’s something more. After reading a couple of stackoverflow and java.net answers (1, 2, 3, 4) and also the android documentation of the same class, which is actually different from the Oracle implementation, it turns out that .disconnect() actually closes (or may close, in the case of android) the underlying socket.

Monday, October 5, 2020

User Story - INVEST criteria

User stories describe one thing a user wants the system to do.

They are written in a structured way typically using the form “As a type of user I want to do something so that I can get some benefit.

Another commonly used form is - given some context when I do something then this should happen.

So, when writing stories give each story a title that describes its purpose as a starting point follow this with a concise one sentence description of the story that follows.

One of the forms just described this form describes the user role what they want to do and why they want to do it as an example consider a banking system and a story to determine the available balance of a bank account the title of the story could be balancing Korea then

Following the template we describe the storyà As an account holder I want to check my available balance at any time of day so I am sure not to over draw my account.

This explains the role what they want to do and why they want to do it

user stories provide a clear and simple way of agreeing to requirements with a customer / end user D invest criteria can be used to evaluate good user

User Stories let me go through each letter of these criteria

Evaluate the stories with the INVEST criteria

Independent - A story should be independent to prevent problems with prioritization and planning

Negotiable - They are not written contracts but are used to stimulate discussion between customer and developers, until there is a clear agreement, they add collaboration

Valuable stories should provide value to users. Think about outcomes and impact, not outputs and deliverables

Estimatable - The story must be estimated. If it is not, it often indicates missing details or the story is too large

Small - Good stories should be small. This helps keep scope small and therefore less ambiguous and supports fast feedback from users

Testable stories must be testable so that developers can verify that the story has been implemented correctly and validate when the requirement has been met/is done

Created for my quick reference only.

ref: google documents

Wednesday, September 23, 2020

Spark Column Order matters even for ORC files

Spark Column Order Matters in files, even if there is proper header for CSV -- and -- even for ORC files

Test with CSV file with header

orcColumnOrderTest]$ ls test1

bookOrder11.txt bookOrder12.txt

$ cat bookOrder11.txt

id,author,bookNm,reprintYear,publishYear

1,john,aa11,2020,2010

2,david,aa12,2019,2010

3,bob,ba11,2020,2015

4,rose,ba12,2019,2015

$ cat bookOrder12.txt

id,bookNm,author,publishYear,reprintYear

1,da11,john1,2010,2020

2,da12,david1,2010,2019

3,ea11,bob1,2015,2020

4,da12,rose1,2015,2019

5,fa11,alice1,2000,2020

scala> val books=spark.read.option("header", "true").csv("/<path>/sree/testing/orcColumnOrderTest/test1")

books: org.apache.spark.sql.DataFrame = [id: string, bookNm: string ... 3 more fields]

scala> books.printSchema

root

|-- id: string (nullable = true)

|-- bookNm: string (nullable = true)

|-- author: string (nullable = true)

|-- publishYear: string (nullable = true)

|-- reprintYear: string (nullable = true)

scala> books.show

+---+------+------+-----------+-----------+

+---+------+------+-----------+-----------+

| 1| da11| john1| 2010| 2020|

| 2| da12|david1| 2010| 2019|

| 3| ea11| bob1| 2015| 2020|

| 4| da12| rose1| 2015| 2019|

| 5| fa11|alice1| 2000| 2020|

| 1| john| aa11| 2020| 2010|

| 2| david| aa12| 2019| 2010|

| 3| bob| ba11| 2020| 2015|

| 4| rose| ba12| 2019| 2015|

+---+------+------+-----------+-----------+

So, as shown above - all columns got mixed... and data not usable

Now lets check it in another way.

orcColumnOrderTest]$ cat test11/bookOrder11.txt

id,author,bookNm,reprintYear,publishYear

1,john,aa11,2020,2010

2,david,aa12,2019,2010

3,bob,ba11,2020,2015

4,rose,ba12,2019,2015

orcColumnOrderTest]$ cat test12/bookOrder12.txt

id,bookNm,author,publishYear,reprintYear

1,da11,john1,2010,2020

2,da12,david1,2010,2019

3,ea11,bob1,2015,2020

4,da12,rose1,2015,2019

5,fa11,alice1,2000,2020

scala> val test11=spark.read.option("header",

"true").csv("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test11")

test11: org.apache.spark.sql.DataFrame = [id: string, author: string ... 3 more fields]

scala> val test12=spark.read.option("header", "true").csv("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test12")

test12: org.apache.spark.sql.DataFrame = [id: string, bookNm: string ... 3 more fields]

scala> test11.union(test12).show --> this became a mix n match on dis ordered column values

+---+------+------+-----------+-----------+

+---+------+------+-----------+-----------+

| 1| john| aa11| 2020| 2010|

| 2| david| aa12| 2019| 2010|

| 3| bob| ba11| 2020| 2015|

| 4| rose| ba12| 2019| 2015|

| 1| da11| john1| 2010| 2020|

| 2| da12|david1| 2010| 2019|

| 3| ea11| bob1| 2015| 2020|

| 4| da12| rose1| 2015| 2019|

| 5| fa11|alice1| 2000| 2020|

+---+------+------+-----------+-----------+

Now lets Write it as ORC and Test again

orcColumnOrderTest]$ ls test11Orc

part-00000-1fe6cc9b-160a-444c-a3f2-e2a999ad4542-c000.snappy.orc _SUCCESS

orcColumnOrderTest]$ ls test12Orc

part-00000-841b62e3-2f03-4964-98d7-9e9fb9e65b99-c000.snappy.orc _SUCCESS

scala> val orc11=spark.read.orc("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test11Orc")

scala> orc11.printSchema

root

|-- id: string (nullable = true)

|-- author: string (nullable = true)

|-- bookNm: string (nullable = true)

|-- reprintYear: string (nullable = true)

|-- publishYear: string (nullable = true)

scala> val orc12=spark.read.orc("/axp/ccsg/metrhub/app/hub/sree/testing/orcColumnOrderTest/test12Orc")

scala> orc12.printSchema

root

|-- id: string (nullable = true)

|-- bookNm: string (nullable = true)

|-- author: string (nullable = true)

|-- publishYear: string (nullable = true)

|-- reprintYear: string (nullable = true)

scala> orc11.union(orc12).show --> it created a whole mess even though it was columnar format data union

Saturday, August 1, 2020

Character Encoding - ASCII, Unicode, Extended-ASCII etc

Ascii and Unicode

https://www.slideshare.net/ProjectStudent/ascii-and-unicode-character-codes

ASCII

https://www.ascii-code.com/ - ASCII - 7 bits (128 chars) and extended-ASCII - 8 bit (256 chars)

ASCII control characters (character code 0-31)

The first 32 characters in the ASCII-table are unprintable control codes and are used to control peripherals such as printers

ASCII printable characters (character code 32-127)

Codes 32-127 are common for all the different variations of the ASCII table, they are called printable characters, represent letters, digits, punctuation marks, and a few miscellaneous symbols. You will find almost every character on your keyboard. Character 127 represents the command DEL.

The extended ASCII codes (character code 128-255)

There are several different variations of the 8-bit ASCII table. The table below is according to Windows-1252 (CP-1252) which is a superset of ISO 8859-1, also called ISO Latin-1, in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 128 to 159 range. Characters that differ from ISO-8859-1 is marked by light blue color.

Unicode - https://home.unicode.org/

Characters before Unicode

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before the Unicode standard was developed, there were many different systems, called character encodings, for assigning these numbers. These earlier character encodings were limited and did not cover characters for all the world’s languages. Even for a single language like English, no single encoding covered all the letters, punctuation, and technical symbols in common use. Pictographic languages, such as Japanese, were a challenge to support these earlier encoding standards.

Early character encodings also conflicted with one another. That is, two encodings could use the same number for two different characters, or use different numbers for the same character. Any given computer might have to support many different encodings. However, when data is passed between computers and different encodings it increased the risk of data corruption or errors.

Character encodings existed for a handful of “large” languages. But many languages lacked character support altogether.

Unicode characters — A Global Standard to Support ALL the World’s Language

http://www.unicode.org/faq/utf_bom.html

Is Unicode a 16-bit encoding?

Depending on the encoding form you choose (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit code unit.

Can Unicode text be represented in more than one way?

Yes, there are several possible representations of Unicode data, including UTF-8, UTF-16, and UTF-32

Unicode defines 2 mapping methods - UTF and UTC (ISO 10646)