Text file differences

dos2unix

Well-Known Member
Joined
May 3, 2019
Messages
3,682
Reaction score
3,501
Credits
32,849

Understanding Line Endings: DOS, Linux, and Apple​

When working with text files across different operating systems, you might encounter issues with line endings. These differences can affect how text files are displayed and processed. Let's explore the distinctions between DOS carriage returns, Linux line feeds, and Apple text, and understand how the dos2unix tool helps in managing these differences.

Line Endings Explained​

  1. DOS/Windows (CRLF)
    • Carriage Return (CR): Moves the cursor to the beginning of the line.
    • Line Feed (LF): Moves the cursor down to the next line.
    • CRLF (Carriage Return + Line Feed): Combines both actions, represented as \r\n or 0x0D0A in hexadecimal. This is the standard for text files in DOS and Windows environments.
  2. Linux/Unix (LF)
    • Line Feed (LF): A single character that moves the cursor to the next line without returning to the beginning. Represented as \n or 0x0A in hexadecimal. This is the standard for text files in Linux and Unix systems.
  3. Classic Mac (CR)
    • Carriage Return (CR): Moves the cursor to the beginning of the line. Represented as \r or 0x0D in hexadecimal. This was the standard for text files in older Macintosh systems.

Cross-Platform Compatibility​

Different line endings can cause compatibility issues when transferring text files between systems. For example:
  • A file with CRLF line endings from Windows might display extra ^M characters in a Unix system.
  • A file with LF line endings from Unix might appear as a single long line in a Windows system.

The dos2unix Tool​

The dos2unix command-line utility is designed to convert text files with DOS-style line endings (CRLF) to Unix-style line endings (LF). This ensures compatibility when transferring files between Windows and Unix/Linux systems.
Usage Examples:
  1. Convert a file in place:
Code:
   dos2unix filename.txt

This command converts the line endings of filename.txt from CRLF to LF.
  1. Create a copy with Unix-style line endings:
Code:
   dos2unix -n original.txt converted.txt

This command creates a new file converted.txt with Unix-style line endings, leaving the original file unchanged.
By understanding these differences and using tools like dos2unix, you can ensure your text files are compatible across different operating systems, avoiding potential formatting issues.
 


Understanding %20 in Web URLs​

In a web URL, %20 represents a space character. This is part of a process called percent-encoding or URL encoding, which is used to encode special characters in URLs to ensure they are transmitted correctly over the internet.

Why %20 is Needed​

URLs can only contain a limited set of characters from the ASCII character set. Characters like spaces, punctuation, and non-ASCII characters need to be encoded to avoid misinterpretation by web browsers and servers. Percent-encoding replaces these characters with a % followed by two hexadecimal digits representing the character's ASCII code.

For example:

  • A space character is encoded as %20 because 20 is the hexadecimal value for a space in ASCII.

Example URLs​

Here are some example URLs with %20 encoded spaces, enclosed in [PRE] tags to prevent them from being decoded when pasted into a web page:

[PRE]
[/PRE]

These URLs will remain encoded when you paste them into your web page, ensuring they display correctly.

Other Common Encodings​

Here are a few more examples of percent-encoded characters:

  • %3A represents :
  • %2F represents /
  • %3F represents ?
These encodings ensure that URLs are correctly interpreted by web browsers and servers, avoiding issues with special characters that might otherwise be misinterpreted
 
I was too lazy to download the unix2dos package so I wrote my own little program. This is a C program. This platform wouldn't let me attach it unless it was renamed to .txt instead.

Signed,

Matthew Campbell
 

Attachments

you can show hidden whitespace characters in vi/vim by using the :set list command. This will display special characters like tabs, spaces, and line endings. Here's how you can do it:

  1. Open your file in Vim:
    Code:
     vim filename
  2. Enable the display of hidden characters:
    Code:
     :set list
  3. Customize the display of these characters (optional):
  4. Code:
     :set listchars=eol:$,tab:>-,trail:~,extends:>,precedes:<
    • eol:$ shows the end of line with a $ symbol.
    • tab:>- shows tabs with > followed by -.
    • trail:~ shows trailing spaces with ~.
    • extends:> and precedes:< indicate line continuation.
To turn off the display of hidden characters, use:
Code:
 :set nolist

This setup can help you identify and manage whitespace issues more effectively.
 
Here are the methods for URL encoding in Linux/Bash with the commands inside
Code:
 and
tags:

  1. Using curl:
    Code:
    curl -G --data-urlencode "param=value" ""

  2. Using jq:
    Code:
    printf '%s' "string to encode" | jq -sRr @uri

  3. Using perl:
    Code:
    perl -MURI::Escape -e 'print uri_escape($ARGV[0]);' "string to encode"

  4. Using a custom Bash function: You can define a function in your .bashrc or .zshrc file:
    Code:
    urlencode() {
      local length="${#1}"
      for (( i = 0; i < length; i++ )); do
        local c="${1:i:1}"
        case $c in
          [a-zA-Z0-9.~_-]) printf "$c" ;;
          *) printf '%%%02X' "'$c" ;;
        esac
      done
      echo
    }

    Then use it like this:
    Code:
    urlencode "string to encode"
These methods should help you achieve URL encoding in Linux/Bash.
 

Understanding Line Endings: DOS, Linux, and Apple​

When working with text files across different operating systems, you might encounter issues with line endings. These differences can affect how text files are displayed and processed. Let's explore the distinctions between DOS carriage returns, Linux line feeds, and Apple text, and understand how the dos2unix tool helps in managing these differences.

Line Endings Explained​

  1. DOS/Windows (CRLF)
    • Carriage Return (CR): Moves the cursor to the beginning of the line.
    • Line Feed (LF): Moves the cursor down to the next line.
    • CRLF (Carriage Return + Line Feed): Combines both actions, represented as \r\n or 0x0D0A in hexadecimal. This is the standard for text files in DOS and Windows environments.
  2. Linux/Unix (LF)
    • Line Feed (LF): A single character that moves the cursor to the next line without returning to the beginning. Represented as \n or 0x0A in hexadecimal. This is the standard for text files in Linux and Unix systems.
  3. Classic Mac (CR)
    • Carriage Return (CR): Moves the cursor to the beginning of the line. Represented as \r or 0x0D in hexadecimal. This was the standard for text files in older Macintosh systems.



Well, not a surprise, but Apple decided to do that different as well !
 
I recently had a situation where dos2unix didn't work, and I had to write a script like this.

Code:
#!/bin/bash

# Input file
input_file="macOS.txt"
# Output file
output_file="output.txt"

# Replace carriage return characters with newline and save to output file
tr '\r' '\n' &lt; "$input_file" &gt; "$output_file"

echo "Replacement complete. Check the output file: $output_file"
 
I recently had a situation where dos2unix didn't work, and I had to write a script like this.

Code:
#!/bin/bash

# Input file
input_file="macOS.txt"
# Output file
output_file="output.txt"

# Replace carriage return characters with newline and save to output file
tr '\r' '\n' &lt; "$input_file" &gt; "$output_file"

echo "Replacement complete. Check the output file: $output_file"
There appeared to be some difficulty with the conversion script in post #7.
A description of working out a solution based on that script follows.

A file was created:
Code:
$ cat input.txt
one
two
three

Since there is no file here with a carriage return (such as a macos file), a file with carriage returns needs to be created. First, to check that the contents of the input.txt file has the newline character, \n, run:
Code:
$ od -atcx1 input.txt
0000000   o   n   e  nl   t   w   o  nl   t   h   r   e   e  nl
          o   n   e  \n   t   w   o  \n   t   h   r   e   e  \n
         6f  6e  65  0a  74  77  6f  0a  74  68  72  65  65  0a
0000016

To change the newline character \n to a carriage return character run:
Code:
$ cat input.txt | tr '\n' '\r' > inputCR.txt

To check that the new file, inputCR.txt, has the carriage return character run:
Code:
$ od -atcx1 inputCR.txt
0000000   o   n   e  cr   t   w   o  cr   t   h   r   e   e  cr
          o   n   e  \r   t   w   o  \r   t   h   r   e   e  \r
         6f  6e  65  0d  74  77  6f  0d  74  68  72  65  65  0d
0000016

To use the script from post #7, a couple of modifications were made which don't change it's intent but try to make it work here. To enable an input file to be put on the command line, the input_file variable was changed to $1. The script name is cr2nl. Here is the modified script:
Code:
$ cat cr2nl
#!/bin/bash

# Input file
input_file="$1"
# Output file
output_file="output.txt"

# Replace carriage return characters with newline and save to output file
tr '\r' '\n' &lt; "$input_file" &gt; "$output_file"

echo "Replacement complete. Check the output file: $output_file"

Running the modified script returns the following:
Code:
$ ./cr2nl inputCR.txt
./cr2nl: line 9: lt: command not found
./cr2nl: line 9: inputCR.txt: command not found
./cr2nl: line 9: gt: command not found
./cr2nl: line 9: output.txt: command not found
Replacement complete. Check the output file: output.txt

No output file was produced.

The script was newly modified to the following:
Code:
#!/bin/bash

# Input file
input_file="$1"
# Output file
output_file="output.txt"

# Replace carriage return characters with newline and save to output file
tr '\r' '\n' < "$input_file"  > "$output_file"

echo "Replacement complete. Check the output file: $output_file"

Running the newly modified script outputs the following:
Code:
$ ./cr2nl inputCR.txt
Replacement complete. Check the output file: output.txt

Inspecting the output in the file: output.txt to see whether the carriage returns have been altered to newlines, the following is output:
Code:
$ od -atcx1 output.txt
0000000   o   n   e  nl   t   w   o  nl   t   h   r   e   e  nl
          o   n   e  \n   t   w   o  \n   t   h   r   e   e  \n
         6f  6e  65  0a  74  77  6f  0a  74  68  72  65  65  0a
i0000016

All looks good with newlines replacing the carriage returns :)
 
Last edited:


Members online


Latest posts

Top